Runbook — Disaster Recovery: full restore (under 30 min)
Restores the Phase 6 reliability surface from S3 in three layers.
Layer 1: rate-limit Redis (5 min)
-
The Redis primary in
alphaswarm-systemis gone. Theredis-master.alphaswarm-system.svcService points to no pod. -
Spin a fresh Redis pod:
kubectl -n alphaswarm-system scale statefulset/redis --replicas=1 -
The Lua scripts re-register lazily on the first
Checkcall (seeRedisTokenBucketStrategy._ensure_initialised—EVALSHAfailure paths fall back toEVAL+ re-register). -
Buckets that were drained in the previous Redis are now full again; this is intentional. The audit log captures every token consumed pre-incident; the operator can replay the ledger to rebuild bucket state if compliance requires it.
Layer 2: audit log (10 min)
-
The
audit_logtable is hash-chain-protected (trigger from Alembic 0079). The S3 export (Celery beat taskalphaswarm_ratelimit.tasks.ledger_export.export_ledger_window) carries every row in append-only JSONL form. -
Restore the latest window:
alphaswarm ratelimit admin restore-ledger \
--bucket alphaswarm-audit-archive \
--since 2026-05-01 \
--until 2026-05-24 -
The
enforce_audit_log_hash_chainPostgres trigger validates every restored row against its predecessor; on violation the restore aborts and surfaces the exact mismatched hex digest.
Layer 3: dbt-loom manifest registry (10 min)
-
The
s3://alphaswarm-dbt-manifestsbucket is the source of truth for cross-projectref()lookups. -
Restore the latest manifest per project:
alphaswarm deploy restore-dbt-manifests \
--env prod \
--to-bucket alphaswarm-dbt-manifests-restored -
Update the
loom.ymlin each team project to point at the restored bucket name; downstreamdbt parsesucceeds with the rehydrated manifests.
Phase-gate verification
The full DR test must complete in under 30 min wall-clock.
tests/chaos/test_dr_restore.py orchestrates the three layers
against a fixture cluster + S3 mock and asserts the under-30-min
deadline.