Runbook — quota-exhaustion

A bucket has fired AQPRatelimitBucketAt80Percent, AQPRatelimitBucketAt95Percent, or AQPRatelimitBucketExhausted.

Diagnosis (5 min)

Open the rate-limit dashboard at /data/ratelimit. Find the over-consuming (user_id, service, key_id).

Inspect the rl_ledger partition for the last hour:

SELECT decision, count(*), sum(tokens_consumed)
  FROM rl_ledger
 WHERE ts > now() - interval '1 hour'
   AND key_id = :key_id
 GROUP BY decision;

Cross-reference audit_log for the calling tool_id — data.ingest.materialize or data.ingest.preview_source are the usual culprits.

Decision tree (10 min)

Cause	Action
Misconfigured backfill	Tell the operator to cancel via `alphaswarm materialize cancel <reservation_id>`. The reservation auto-releases.
Vendor downgrade	Mint a higher-tier key via `alphaswarm keys mint --service polygon --rps 100 --burst 1000`.
Stuck connector loop	`alphaswarm ratelimit status --key-id <id>` shows the call rate; halt the offending Dagster sensor via the topbar kill-switch.
Legitimate traffic	Raise the policy via `data.ratelimit.policy.update` (Tier-P + step-up MFA).

Recovery (15 min)

Once the cause is addressed, the bucket refills at the policy's refill_rate; no manual reset is required.
If the operator wants an immediate reset, run the Phase 6 admin script that explicitly DELs the bucket key:
```
alphaswarm ratelimit admin reset --user-id <uid> --service polygon --key-id primary
```

Verify recovery in Grafana:

rl_bucket_remaining{service="polygon.aggregates"} > 50

Postmortem

Every quota-exhaustion alert that requires manual intervention must produce a postmortem PR within 72 hours. Template: alphaswarm_docs/docs/how-to/runbooks/templates/postmortem.md (to be authored).

Diagnosis (5 min)​

Decision tree (10 min)​

Recovery (15 min)​

Postmortem​

Diagnosis (5 min)

Decision tree (10 min)

Recovery (15 min)

Postmortem