Chimely

Operations

Metrics, alerts, dead-letter replay, graceful shutdown, rate limits, and measured capacity.

Everything on this page is enforced by the Phase 3 chaos suite (server/tests/chaos.rs): killing any process at any moment, losing Redis entirely, flooding one environment, or replaying any job must not lose data, drift a counter, or starve a neighbor.

Metrics

GET /metrics (Prometheus text format). Gauges marked sampled are recomputed from Postgres on a fixed cadence (CHIMELY_METRICS_SAMPLE_MS, default 15s) — they survive restarts and cannot freeze when a subsystem stalls.

MetricKindMeaning
chimely_queue_depth{environment,job_type}gauge, sampledAll pending job rows
chimely_queue_due{environment,job_type}gauge, sampledJobs past run_at
chimely_job_wait_seconds{job_type}histogramDue-to-claim latency (the fairness signal)
chimely_jobs_processed_total{environment}counterCompleted claims
chimely_jobs_failed_total{environment}counterFailed claims (will retry)
chimely_jobs_retried_total{job_type}counterBackoff reschedules
chimely_jobs_parked_total{job_type}counterMoves into dead_letters
chimely_dead_letters{job_type}gauge, sampledParked jobs awaiting replay
chimely_hint_publish_duration_secondshistogramPub/sub round trip
chimely_hint_delivery_lag_secondshistogramEnqueue-to-publish lag (includes queue wait + debounce)
chimely_sse_connectionsgaugeOpen SSE streams on this replica
chimely_counter_drift_unread / _unseengauge, sampledΣ|recount − maintained| over recently-active subscribers
chimely_partitions_remaining{table}gauge, sampledPre-created future partitions
chimely_rate_limited_totalcounterRequests answered 429
chimely_rate_limit_errors_totalcounterLimiter fail-opens (Redis unreachable)

Alerts

  • chimely_partitions_remaining < 2 — the partition-maintenance job has been dead for ~11 months; with headroom exhausted, inserts fail loudly (there is deliberately no DEFAULT partition). The gauge decays by one per month while the job is stalled, so this alert fires with a full month of pre-created headroom still left.
  • chimely_counter_drift_unread > 0 (sustained) — a counter-maintenance bug. Transient nonzero readings right after a preference flip resolve as soon as the enqueued counter_rebuild lands.
  • chimely_dead_letters > 0 — jobs exhausted max_attempts and need an operator (see below).
  • chimely_job_wait_seconds p99 growing without a matching queue-depth flood — claim starvation; check for a stuck worker.
  • Timeline write latency vs retention. The status log's idempotency guard probes every notification_status_log partition (the table is range-partitioned on occurred_at, and a prior status row can live in any month), so each append costs one index probe per partition. With the default 12-month retention that is ~26 cheap B-tree probes; if you raise CHIMELY_RETENTION_MONTHS substantially, watch notification create/read latency grow with the partition count.

Dead-letter replay

A job that exhausts max_attempts (default 10) moves to the dead_letters table with its last_error and progress_cursor — the jobs table itself stays near-empty (completed work leaves no row; parked work does not live in the hot claim path).

chimely dlq list                 # one line per parked job
chimely dlq replay job_01h4…     # replay one (fresh attempt budget, runs now)
chimely dlq replay --all

Replay re-enqueues the original row (same id, payload, and cursor), so a chunked fan-out resumes where it died and side effects apply exactly once.

Graceful shutdown

On SIGTERM/SIGINT, in order:

  1. /readyz flips to 503 while the listener keeps serving (CHIMELY_SHUTDOWN_GRACE_MS, default 5000) so load balancers drain the replica without dropping in-flight requests.
  2. The listener stops accepting; SSE streams receive a jittered retry: directive and close (no reconnect stampede); workers stop claiming.
  3. In-flight jobs get CHIMELY_SHUTDOWN_DRAIN_DEADLINE_MS (default 30000) to finish. Past the deadline the worker is aborted — the open transaction rolls back and the job is re-claimed elsewhere (at-least-once by design, side effects keyed by job deletion).

Rate limits

Token buckets, enforced atomically in Redis Lua so N replicas share one bucket (the bucket reads the Redis clock — replica clock skew cannot double-fill it). 429 responses carry Retry-After. On Redis loss the limiter fails open: Redis is the hint/cache plane and must never reject traffic Postgres can serve. In Redis-less mode an in-process bucket applies per replica.

KnobDefaultApplies to
CHIMELY_API_KEY_RATE_PER_SEC / _BURST50 / 200per API key: POST /v1/notifications, POST /v1/broadcasts
CHIMELY_SUBSCRIBER_RATE_PER_SEC / _BURST10 / 50per subscriber: GET /v1/inbox/items

A rate of 0 disables the limit.

Retry backoff

Failed jobs reschedule with equal-jittered exponential backoff: attempt n waits in [exp/2, exp] where exp = min(CHIMELY_RETRY_BACKOFF_CAP_MS, CHIMELY_RETRY_BACKOFF_BASE_MS × 2^(n−1)) (defaults 5s base, 15min cap).

Retention

Monthly partitions on notifications and notification_status_log are pre-created 13 months ahead (covering the deliver_at cap) and dropped (DETACH + DROP) past CHIMELY_RETENTION_MONTHS (default 12). Idempotency snapshots are purged after CHIMELY_IDEMPOTENCY_RETENTION_DAYS (default 30).

Measured capacity

The jobs table is an inbox work queue, not Kafka. The nightly sustained_jobs_churn_stays_bounded_under_autovacuum chaos run measures the sustained rate on whatever box it runs on and asserts that delete-on-complete plus the table's aggressive autovacuum settings (fillfactor 50, threshold 500, no cost delay) keep dead tuples and table size bounded at that rate.

Reference measurement (Apple M-series dev box, Postgres 15 in a container, 4 workers, Redis-less, June 2026): a 20,000-job single-environment backlog of one-statement jobs drained at ~600 jobs/sec sustained; dead tuples peaked between vacuum passes and settled to 0 with the table at 0 KiB afterwards. Notes for reading that number:

  • It is the worst case: a single environment gets one claim per fairness sweep, so per-environment throughput is sweep-bound. Aggregate throughput scales with the number of environments that have pending work.
  • One job is not one notification: deliver jobs process 500 notification rows per claim, so row throughput during fan-outs is hundreds of times higher.
  • Run the nightly lane on your own hardware for your number; the log line is sustained churn: … jobs/sec.

On this page