Operations runbook — Incident response
Standard playbook for diagnosing + recovering from AlphaSwarm production incidents.
Triage tree
Incident detected
|
+-----------------+-----------------+
| |
Workload error Platform error
| |
+----------+----------+ +-------------+-------------+
| | | | | |
Single Several All deps Auth Network Storage
pod pods degraded (Auth0) / Ingress (Postgres
crashes CrashLoop rate-limited) / Redis)
| | | | | |
v v v v v v
[A: pod [B: HPA [C: drain [D: jwks [E: ingress [F: stateful
logs] thrash] the queue] 503] 503] failover]
Common diagnostic commands
# Pod status across both AlphaSwarm namespaces
kubectl get pods -n alphaswarm -o wide
kubectl get pods -n alphaswarm-admin -o wide
# Top resource consumers
kubectl top pods -n alphaswarm --sort-by=cpu
kubectl top pods -n alphaswarm --sort-by=memory
# Recent events
kubectl get events -n alphaswarm --sort-by='.lastTimestamp' | tail -n 30
# Tail logs for the API
kubectl logs -n alphaswarm deploy/alphaswarm-core --tail=200 -f
# Control-plane audit log (rolled to stdout by default; if ALPHASWARM_CP_AUDIT_LOG_PATH
# is set, also written to a file).
kubectl logs -n alphaswarm-admin deploy/alphaswarm-cp --tail=200 -f | findstr workload_run
# Recent terraform_runs (provisioning audit ledger).
kubectl exec -n alphaswarm deploy/alphaswarm-core -- python -m alphaswarm.cli runs list --limit 20
Scenario A — single pod crashes
# Identify the crashing pod
kubectl get pods -n alphaswarm -l app=alphaswarm-core | findstr CrashLoop
# Inspect
kubectl describe pod -n alphaswarm <pod-name>
kubectl logs -n alphaswarm <pod-name> --previous
# Rolling restart of the deployment (HPA + PDB keep the service up)
kubectl rollout restart -n alphaswarm deployment/alphaswarm-core
Scenario B — HPA thrashing
The HPA is scaling rapidly up + down, never stabilising.
# Check the HPA's recent decisions
kubectl describe hpa -n alphaswarm alphaswarm-core
# Most common cause: a runaway query or backtest that spikes CPU then
# crashes back. Check the audit log for recent task starts.
kubectl logs -n alphaswarm deploy/alphaswarm-worker --tail=500 | findstr "started\|finished\|FAILED"
# Mitigation: temporarily widen the HPA stabilizationWindow.
kubectl patch hpa -n alphaswarm alphaswarm-core --type='json' -p='[
{"op":"replace","path":"/spec/behavior/scaleUp/stabilizationWindowSeconds","value":300}
]'
Scenario C — Celery queue depth alarm
# Drain the queue from the worker side
kubectl exec -n alphaswarm deploy/alphaswarm-worker -- celery -A alphaswarm.tasks.celery_app inspect active
# Scale workers up
kubectl scale -n alphaswarm deployment/alphaswarm-worker --replicas=8
# Or via the control plane (lands an audit row)
curl -X PATCH https://manage.alphaswarm.enterprise.com/manage/deployments/alphaswarm-worker/scale?replicas=8 `
-H "Authorization: Bearer $TOKEN"
Scenario D — Auth0 JWKS returns 503
Symptom: every authenticated request fails with jwks_unreachable.
# Probe JWKS directly from inside a pod
kubectl exec -n alphaswarm deploy/alphaswarm-core -- curl -fsS https://your-tenant.us.auth0.com/.well-known/jwks.json
# Common causes:
# - Auth0 service incident (https://status.auth0.com/)
# - Outbound 443 blocked by NetworkPolicy (check network-policies.yaml)
# - DNS resolution failure inside the cluster
# Mitigation: flip ALPHASWARM_AUTH_ENFORCE to permissive for read-only routes
# while you wait for Auth0 to recover. ONLY do this if your operator
# UI is firewalled at the Ingress layer.
kubectl set env -n alphaswarm deploy/alphaswarm-core ALPHASWARM_AUTH_ENFORCE=permissive
Scenario E — Ingress returns 503
# Check NGINX Ingress controller
kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200
# Check service endpoints
kubectl get endpoints -n alphaswarm alphaswarm-client
kubectl get endpoints -n alphaswarm-admin alphaswarm-cp
# If endpoints are empty, the pods aren't passing readinessProbe.
Scenario F — Stateful service failover
Postgres
The compose stack uses a single Postgres pod backed by a PVC. K8s overlays should be migrated to a managed Postgres (Aurora / Cloud SQL / Azure Database for PostgreSQL) before going to prod.
For dev/staging:
kubectl -n alphaswarm delete pod -l app=postgres # restarts; data persists in PVC
Redis Stack
kubectl -n alphaswarm delete pod redis-master-0 # StatefulSet brings it back
Post-incident
- Open an incident report in
alphaswarm_docs/incidents/<yyyy-mm-dd>-<slug>.md. - Capture: timeline, blast radius, root cause, fixes applied, follow-ups.
- If a hard rule was bypassed (e.g. ALPHASWARM_AUTH_ENFORCE flipped to permissive), schedule the revert as a P1 task.
- Add a regression test to prevent the same class of incident.