Runbook — Disaster Recovery: full restore (under 30 min)

Restores the Phase 6 reliability surface from S3 in three layers.

Layer 1: rate-limit Redis (5 min)

The Redis primary in alphaswarm-system is gone. The redis-master.alphaswarm-system.svc Service points to no pod.

Spin a fresh Redis pod:

kubectl -n alphaswarm-system scale statefulset/redis --replicas=1

The Lua scripts re-register lazily on the first Check call (see RedisTokenBucketStrategy._ensure_initialised — EVALSHA failure paths fall back to EVAL + re-register).
Buckets that were drained in the previous Redis are now full again; this is intentional. The audit log captures every token consumed pre-incident; the operator can replay the ledger to rebuild bucket state if compliance requires it.

Layer 2: audit log (10 min)

The audit_log table is hash-chain-protected (trigger from Alembic 0079). The S3 export (Celery beat task alphaswarm_ratelimit.tasks.ledger_export.export_ledger_window) carries every row in append-only JSONL form.

Restore the latest window:

alphaswarm ratelimit admin restore-ledger \
  --bucket alphaswarm-audit-archive \
  --since 2026-05-01 \
  --until 2026-05-24

The enforce_audit_log_hash_chain Postgres trigger validates every restored row against its predecessor; on violation the restore aborts and surfaces the exact mismatched hex digest.

Layer 3: dbt-loom manifest registry (10 min)

The s3://alphaswarm-dbt-manifests bucket is the source of truth for cross-project ref() lookups.

Restore the latest manifest per project:

alphaswarm deploy restore-dbt-manifests \
  --env prod \
  --to-bucket alphaswarm-dbt-manifests-restored

Update the loom.yml in each team project to point at the restored bucket name; downstream dbt parse succeeds with the rehydrated manifests.

Phase-gate verification

The full DR test must complete in under 30 min wall-clock. tests/chaos/test_dr_restore.py orchestrates the three layers against a fixture cluster + S3 mock and asserts the under-30-min deadline.

Layer 1: rate-limit Redis (5 min)​

Layer 2: audit log (10 min)​

Layer 3: dbt-loom manifest registry (10 min)​

Phase-gate verification​

Layer 1: rate-limit Redis (5 min)

Layer 2: audit log (10 min)

Layer 3: dbt-loom manifest registry (10 min)

Phase-gate verification