Runbook — quota-exhaustion
A bucket has fired AQPRatelimitBucketAt80Percent,
AQPRatelimitBucketAt95Percent, or AQPRatelimitBucketExhausted.
Diagnosis (5 min)
-
Open the rate-limit dashboard at
/data/ratelimit. Find the over-consuming(user_id, service, key_id). -
Inspect the
rl_ledgerpartition for the last hour:SELECT decision, count(*), sum(tokens_consumed)
FROM rl_ledger
WHERE ts > now() - interval '1 hour'
AND key_id = :key_id
GROUP BY decision; -
Cross-reference
audit_logfor the callingtool_id—data.ingest.materializeordata.ingest.preview_sourceare the usual culprits.
Decision tree (10 min)
| Cause | Action |
|---|---|
| Misconfigured backfill | Tell the operator to cancel via alphaswarm materialize cancel <reservation_id>. The reservation auto-releases. |
| Vendor downgrade | Mint a higher-tier key via alphaswarm keys mint --service polygon --rps 100 --burst 1000. |
| Stuck connector loop | alphaswarm ratelimit status --key-id <id> shows the call rate; halt the offending Dagster sensor via the topbar kill-switch. |
| Legitimate traffic | Raise the policy via data.ratelimit.policy.update (Tier-P + step-up MFA). |
Recovery (15 min)
-
Once the cause is addressed, the bucket refills at the policy's
refill_rate; no manual reset is required. -
If the operator wants an immediate reset, run the Phase 6 admin script that explicitly DELs the bucket key:
alphaswarm ratelimit admin reset --user-id <uid> --service polygon --key-id primary -
Verify recovery in Grafana:
rl_bucket_remaining{service="polygon.aggregates"} > 50
Postmortem
Every quota-exhaustion alert that requires manual intervention
must produce a postmortem PR within 72 hours. Template:
alphaswarm_docs/docs/how-to/runbooks/templates/postmortem.md (to be authored).