Kill Switch Incident Response

Three-scope kill switch (bot / fleet / platform). Quarterly drill required per blueprint caveat #7.

Scopes

Scope	What it halts	Typical use
`bot`	One Pod (one bot slug)	A single bot is misbehaving
`fleet`	Every bot in a fleet	A fleet-wide alpha goes stale
`platform`	Every bot on the platform	Emergency — venue outage, regulatory action

Engage

Via CRD (preferred — leaves audit trail)

kubectl apply -f - <<EOF
apiVersion: quantbot.io/v1
kind: KillSwitch
metadata:
  name: emergency-<reason>
  namespace: alphaswarm-bots
spec:
  scope: bot                # bot | fleet | platform
  target: mm-aapl           # bot slug / fleet name / "platform"
  mode: flatten             # cancel | flatten | freeze
  reason: "venue outage; halting until investigation complete"
  ttl: 1h
EOF

Via the REST kill-switch fan-out (UI button)

The operator UI's KillSwitch topbar component calls a sequence of halt endpoints in parallel:

POST /agents/halt
POST /quant-agents/halt
POST /paper/stop-all
POST /bots/halt-all ← halts every active bot deployment
POST /rl/halt-all
POST /workflows/halt

This is the equivalent of KillSwitch.scope=platform from the operator side. Use it when GitOps reconciliation is too slow (the CRD path can take up to poll_interval_s seconds; the REST fan-out is instant).

Via the redundant Redis polling channel (last resort)

If the Argo CD reconciler is unhealthy AND the REST API is unreachable:

# Directly set the kill switch key in the bots namespace's Redis.
kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli SET 'alphaswarm:bots:killswitch:platform:platform' 'manual-emergency'

Each bot polls this key every 5 seconds (configurable via KillSwitchV2.poll_interval_s) and halts when set. This is the fallback documented in blueprint caveat #7.

Release

Via CRD

kubectl delete killswitch emergency-<reason> -n alphaswarm-bots

Via Redis (matching the last-resort engage)

kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli DEL 'alphaswarm:bots:killswitch:platform:platform'

Verify

# CRD view:
kubectl get killswitches -A
# Status:
kubectl get killswitches -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.engaged}{"\n"}{end}'
# Redis view:
kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli --scan --pattern 'alphaswarm:bots:killswitch:*'
# Affected bots (operator status):
kubectl get bots -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.killSwitchEngaged}{"\t"}{.status.killSwitchReason}{"\n"}{end}'

Quarterly drill (caveat #7)

The blueprint mandates a quarterly drill because the worst time to discover the kill switch is broken is during a real incident.

Drill protocol

Schedule a 15-minute window during low-activity hours.
Engage scope=platform via the CRD path.
Verify every bot in kubectl get bots -A transitions to status.phase=Draining within 10 seconds.
Verify every bot reaches Stopped within 30 seconds (HFT) or 300 seconds (everything else).
Release the kill switch.
Verify bots auto-restart (their Deployments / StatefulSets reconcile).
Repeat with the REST fan-out path.
Repeat with the Redis fallback path.
Record the drill in the next RTS 6 validation report's kill_switch_drills evidence section.

Failure of any of the three paths is a P1 incident. Fix before the drill window closes.

Common failure modes

Operator pod is down. Symptom: KillSwitch CR created but bots don't halt within poll_interval_s. Mitigation: the Redis polling fallback bypasses the operator entirely.
Redis pod is down. Symptom: neither operator nor polling fallback works. Mitigation: at least one of the operator's in-memory CR watcher or the REST API fan-out path will still halt bots; if all three fail simultaneously, escalate to manual kubectl scale deployment/bot-* --replicas=0.
Redis Pub/Sub vs SET-key drift. KillSwitchV2.poll_interval_s defines the upper bound on the polling fallback's latency; if the pub/sub channel is dropping messages, polling still works after at most one interval.

Scopes​

Engage​

Via CRD (preferred — leaves audit trail)​

Via the REST kill-switch fan-out (UI button)​

Via the redundant Redis polling channel (last resort)​

Release​

Via CRD​

Via Redis (matching the last-resort engage)​

Verify​

Quarterly drill (caveat #7)​

Drill protocol​

Common failure modes​