Operations
Metrics, alerts, dead-letter replay, graceful shutdown, rate limits, and measured capacity.
Everything on this page is enforced by the Phase 3 chaos suite
(server/tests/chaos.rs): killing any process at any moment, losing Redis
entirely, flooding one environment, or replaying any job must not lose
data, drift a counter, or starve a neighbor.
Metrics
GET /metrics (Prometheus text format). Gauges marked sampled are
recomputed from Postgres on a fixed cadence (CHIMELY_METRICS_SAMPLE_MS,
default 15s) — they survive restarts and cannot freeze when a subsystem
stalls.
| Metric | Kind | Meaning |
|---|---|---|
chimely_queue_depth{environment,job_type} | gauge, sampled | All pending job rows |
chimely_queue_due{environment,job_type} | gauge, sampled | Jobs past run_at |
chimely_job_wait_seconds{job_type} | histogram | Due-to-claim latency (the fairness signal) |
chimely_jobs_processed_total{environment} | counter | Completed claims |
chimely_jobs_failed_total{environment} | counter | Failed claims (will retry) |
chimely_jobs_retried_total{job_type} | counter | Backoff reschedules |
chimely_jobs_parked_total{job_type} | counter | Moves into dead_letters |
chimely_dead_letters{job_type} | gauge, sampled | Parked jobs awaiting replay |
chimely_hint_publish_duration_seconds | histogram | Pub/sub round trip |
chimely_hint_delivery_lag_seconds | histogram | Enqueue-to-publish lag (includes queue wait + debounce) |
chimely_sse_connections | gauge | Open SSE streams on this replica |
chimely_counter_drift_unread / _unseen | gauge, sampled | Σ|recount − maintained| over recently-active subscribers |
chimely_partitions_remaining{table} | gauge, sampled | Pre-created future partitions |
chimely_rate_limited_total | counter | Requests answered 429 |
chimely_rate_limit_errors_total | counter | Limiter fail-opens (Redis unreachable) |
Alerts
chimely_partitions_remaining < 2— the partition-maintenance job has been dead for ~11 months; with headroom exhausted, inserts fail loudly (there is deliberately no DEFAULT partition). The gauge decays by one per month while the job is stalled, so this alert fires with a full month of pre-created headroom still left.chimely_counter_drift_unread > 0(sustained) — a counter-maintenance bug. Transient nonzero readings right after a preference flip resolve as soon as the enqueuedcounter_rebuildlands.chimely_dead_letters > 0— jobs exhaustedmax_attemptsand need an operator (see below).chimely_job_wait_secondsp99 growing without a matching queue-depth flood — claim starvation; check for a stuck worker.- Timeline write latency vs retention. The status log's idempotency
guard probes every
notification_status_logpartition (the table is range-partitioned onoccurred_at, and a prior status row can live in any month), so each append costs one index probe per partition. With the default 12-month retention that is ~26 cheap B-tree probes; if you raiseCHIMELY_RETENTION_MONTHSsubstantially, watch notification create/read latency grow with the partition count.
Dead-letter replay
A job that exhausts max_attempts (default 10) moves to the
dead_letters table with its last_error and progress_cursor — the jobs
table itself stays near-empty (completed work leaves no row; parked work
does not live in the hot claim path).
chimely dlq list # one line per parked job
chimely dlq replay job_01h4… # replay one (fresh attempt budget, runs now)
chimely dlq replay --allReplay re-enqueues the original row (same id, payload, and cursor), so a chunked fan-out resumes where it died and side effects apply exactly once.
Graceful shutdown
On SIGTERM/SIGINT, in order:
/readyzflips to 503 while the listener keeps serving (CHIMELY_SHUTDOWN_GRACE_MS, default 5000) so load balancers drain the replica without dropping in-flight requests.- The listener stops accepting; SSE streams receive a jittered
retry:directive and close (no reconnect stampede); workers stop claiming. - In-flight jobs get
CHIMELY_SHUTDOWN_DRAIN_DEADLINE_MS(default 30000) to finish. Past the deadline the worker is aborted — the open transaction rolls back and the job is re-claimed elsewhere (at-least-once by design, side effects keyed by job deletion).
Rate limits
Token buckets, enforced atomically in Redis Lua so N replicas share one
bucket (the bucket reads the Redis clock — replica clock skew cannot
double-fill it). 429 responses carry Retry-After. On Redis loss the
limiter fails open: Redis is the hint/cache plane and must never
reject traffic Postgres can serve. In Redis-less mode an in-process bucket
applies per replica.
| Knob | Default | Applies to |
|---|---|---|
CHIMELY_API_KEY_RATE_PER_SEC / _BURST | 50 / 200 | per API key: POST /v1/notifications, POST /v1/broadcasts |
CHIMELY_SUBSCRIBER_RATE_PER_SEC / _BURST | 10 / 50 | per subscriber: GET /v1/inbox/items |
A rate of 0 disables the limit.
Retry backoff
Failed jobs reschedule with equal-jittered exponential backoff: attempt
n waits in [exp/2, exp] where
exp = min(CHIMELY_RETRY_BACKOFF_CAP_MS, CHIMELY_RETRY_BACKOFF_BASE_MS × 2^(n−1))
(defaults 5s base, 15min cap).
Retention
Monthly partitions on notifications and notification_status_log are
pre-created 13 months ahead (covering the deliver_at cap) and dropped
(DETACH + DROP) past CHIMELY_RETENTION_MONTHS (default 12). Idempotency
snapshots are purged after CHIMELY_IDEMPOTENCY_RETENTION_DAYS (default
30).
Measured capacity
The jobs table is an inbox work queue, not Kafka. The nightly
sustained_jobs_churn_stays_bounded_under_autovacuum chaos run measures
the sustained rate on whatever box it runs on and asserts that
delete-on-complete plus the table's aggressive autovacuum settings
(fillfactor 50, threshold 500, no cost delay) keep dead tuples and table
size bounded at that rate.
Reference measurement (Apple M-series dev box, Postgres 15 in a container, 4 workers, Redis-less, June 2026): a 20,000-job single-environment backlog of one-statement jobs drained at ~600 jobs/sec sustained; dead tuples peaked between vacuum passes and settled to 0 with the table at 0 KiB afterwards. Notes for reading that number:
- It is the worst case: a single environment gets one claim per fairness sweep, so per-environment throughput is sweep-bound. Aggregate throughput scales with the number of environments that have pending work.
- One job is not one notification: deliver jobs process 500 notification rows per claim, so row throughput during fan-outs is hundreds of times higher.
- Run the nightly lane on your own hardware for your number; the log line
is
sustained churn: … jobs/sec.