Operations runbook — Incident response

Standard playbook for diagnosing + recovering from AlphaSwarm production incidents.

Triage tree

                         Incident detected
                                |
              +-----------------+-----------------+
              |                                   |
       Workload error                     Platform error
              |                                   |
   +----------+----------+         +-------------+-------------+
   |          |          |         |             |             |
  Single   Several    All deps   Auth         Network        Storage
  pod      pods       degraded   (Auth0)      / Ingress      (Postgres
  crashes  CrashLoop                          rate-limited)   / Redis)
   |          |          |         |             |             |
   v          v          v         v             v             v
[A: pod    [B: HPA   [C: drain   [D: jwks    [E: ingress    [F: stateful
 logs]      thrash]   the queue]  503]        503]           failover]

Common diagnostic commands

# Pod status across both AlphaSwarm namespaces
kubectl get pods -n alphaswarm -o wide
kubectl get pods -n alphaswarm-admin -o wide

# Top resource consumers
kubectl top pods -n alphaswarm --sort-by=cpu
kubectl top pods -n alphaswarm --sort-by=memory

# Recent events
kubectl get events -n alphaswarm --sort-by='.lastTimestamp' | tail -n 30

# Tail logs for the API
kubectl logs -n alphaswarm deploy/alphaswarm-core --tail=200 -f

# Control-plane audit log (rolled to stdout by default; if ALPHASWARM_CP_AUDIT_LOG_PATH
# is set, also written to a file).
kubectl logs -n alphaswarm-admin deploy/alphaswarm-cp --tail=200 -f | findstr workload_run

# Recent terraform_runs (provisioning audit ledger).
kubectl exec -n alphaswarm deploy/alphaswarm-core -- python -m alphaswarm.cli runs list --limit 20

Scenario A — single pod crashes

# Identify the crashing pod
kubectl get pods -n alphaswarm -l app=alphaswarm-core | findstr CrashLoop

# Inspect
kubectl describe pod -n alphaswarm <pod-name>
kubectl logs -n alphaswarm <pod-name> --previous

# Rolling restart of the deployment (HPA + PDB keep the service up)
kubectl rollout restart -n alphaswarm deployment/alphaswarm-core

Scenario B — HPA thrashing

The HPA is scaling rapidly up + down, never stabilising.

# Check the HPA's recent decisions
kubectl describe hpa -n alphaswarm alphaswarm-core

# Most common cause: a runaway query or backtest that spikes CPU then
# crashes back. Check the audit log for recent task starts.
kubectl logs -n alphaswarm deploy/alphaswarm-worker --tail=500 | findstr "started\|finished\|FAILED"

# Mitigation: temporarily widen the HPA stabilizationWindow.
kubectl patch hpa -n alphaswarm alphaswarm-core --type='json' -p='[
  {"op":"replace","path":"/spec/behavior/scaleUp/stabilizationWindowSeconds","value":300}
]'

Scenario C — Celery queue depth alarm

# Drain the queue from the worker side
kubectl exec -n alphaswarm deploy/alphaswarm-worker -- celery -A alphaswarm.tasks.celery_app inspect active

# Scale workers up
kubectl scale -n alphaswarm deployment/alphaswarm-worker --replicas=8

# Or via the control plane (lands an audit row)
curl -X PATCH https://manage.alphaswarm.enterprise.com/manage/deployments/alphaswarm-worker/scale?replicas=8 `
  -H "Authorization: Bearer $TOKEN"

Scenario D — Auth0 JWKS returns 503

Symptom: every authenticated request fails with jwks_unreachable.

# Probe JWKS directly from inside a pod
kubectl exec -n alphaswarm deploy/alphaswarm-core -- curl -fsS https://your-tenant.us.auth0.com/.well-known/jwks.json

# Common causes:
# - Auth0 service incident (https://status.auth0.com/)
# - Outbound 443 blocked by NetworkPolicy (check network-policies.yaml)
# - DNS resolution failure inside the cluster

# Mitigation: flip ALPHASWARM_AUTH_ENFORCE to permissive for read-only routes
# while you wait for Auth0 to recover. ONLY do this if your operator
# UI is firewalled at the Ingress layer.
kubectl set env -n alphaswarm deploy/alphaswarm-core ALPHASWARM_AUTH_ENFORCE=permissive

Scenario E — Ingress returns 503

# Check NGINX Ingress controller
kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200

# Check service endpoints
kubectl get endpoints -n alphaswarm alphaswarm-client
kubectl get endpoints -n alphaswarm-admin alphaswarm-cp

# If endpoints are empty, the pods aren't passing readinessProbe.

Scenario F — Stateful service failover

Postgres

The compose stack uses a single Postgres pod backed by a PVC. K8s overlays should be migrated to a managed Postgres (Aurora / Cloud SQL / Azure Database for PostgreSQL) before going to prod.

For dev/staging:

kubectl -n alphaswarm delete pod -l app=postgres   # restarts; data persists in PVC

Redis Stack

kubectl -n alphaswarm delete pod redis-master-0   # StatefulSet brings it back

Post-incident

Open an incident report in alphaswarm_docs/incidents/<yyyy-mm-dd>-<slug>.md.
Capture: timeline, blast radius, root cause, fixes applied, follow-ups.
If a hard rule was bypassed (e.g. ALPHASWARM_AUTH_ENFORCE flipped to permissive), schedule the revert as a P1 task.
Add a regression test to prevent the same class of incident.

Triage tree​

Common diagnostic commands​

Scenario A — single pod crashes​

Scenario B — HPA thrashing​

Scenario C — Celery queue depth alarm​

Scenario D — Auth0 JWKS returns 503​

Scenario E — Ingress returns 503​

Scenario F — Stateful service failover​

Postgres​

Redis Stack​

Post-incident​