Kill Switch Incident Response
Three-scope kill switch (bot / fleet / platform). Quarterly drill required per blueprint caveat #7.
Scopes
| Scope | What it halts | Typical use |
|---|---|---|
bot | One Pod (one bot slug) | A single bot is misbehaving |
fleet | Every bot in a fleet | A fleet-wide alpha goes stale |
platform | Every bot on the platform | Emergency — venue outage, regulatory action |
Engage
Via CRD (preferred — leaves audit trail)
kubectl apply -f - <<EOF
apiVersion: quantbot.io/v1
kind: KillSwitch
metadata:
name: emergency-<reason>
namespace: alphaswarm-bots
spec:
scope: bot # bot | fleet | platform
target: mm-aapl # bot slug / fleet name / "platform"
mode: flatten # cancel | flatten | freeze
reason: "venue outage; halting until investigation complete"
ttl: 1h
EOF
Via the REST kill-switch fan-out (UI button)
The operator UI's KillSwitch topbar component calls a sequence of
halt endpoints in parallel:
POST /agents/haltPOST /quant-agents/haltPOST /paper/stop-allPOST /bots/halt-all← halts every active bot deploymentPOST /rl/halt-allPOST /workflows/halt
This is the equivalent of KillSwitch.scope=platform from the operator
side. Use it when GitOps reconciliation is too slow (the CRD path can
take up to poll_interval_s seconds; the REST fan-out is instant).
Via the redundant Redis polling channel (last resort)
If the Argo CD reconciler is unhealthy AND the REST API is unreachable:
# Directly set the kill switch key in the bots namespace's Redis.
kubectl exec -n alphaswarm-bots redis-master-0 -- \
redis-cli SET 'alphaswarm:bots:killswitch:platform:platform' 'manual-emergency'
Each bot polls this key every 5 seconds (configurable via
KillSwitchV2.poll_interval_s) and halts when set. This is the
fallback documented in blueprint caveat #7.
Release
Via CRD
kubectl delete killswitch emergency-<reason> -n alphaswarm-bots
Via Redis (matching the last-resort engage)
kubectl exec -n alphaswarm-bots redis-master-0 -- \
redis-cli DEL 'alphaswarm:bots:killswitch:platform:platform'
Verify
# CRD view:
kubectl get killswitches -A
# Status:
kubectl get killswitches -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.engaged}{"\n"}{end}'
# Redis view:
kubectl exec -n alphaswarm-bots redis-master-0 -- \
redis-cli --scan --pattern 'alphaswarm:bots:killswitch:*'
# Affected bots (operator status):
kubectl get bots -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.killSwitchEngaged}{"\t"}{.status.killSwitchReason}{"\n"}{end}'
Quarterly drill (caveat #7)
The blueprint mandates a quarterly drill because the worst time to discover the kill switch is broken is during a real incident.
Drill protocol
- Schedule a 15-minute window during low-activity hours.
- Engage
scope=platformvia the CRD path. - Verify every bot in
kubectl get bots -Atransitions tostatus.phase=Drainingwithin 10 seconds. - Verify every bot reaches
Stoppedwithin 30 seconds (HFT) or 300 seconds (everything else). - Release the kill switch.
- Verify bots auto-restart (their Deployments / StatefulSets reconcile).
- Repeat with the REST fan-out path.
- Repeat with the Redis fallback path.
- Record the drill in the next RTS 6 validation report's
kill_switch_drillsevidence section.
Failure of any of the three paths is a P1 incident. Fix before the drill window closes.
Common failure modes
- Operator pod is down. Symptom:
KillSwitchCR created but bots don't halt withinpoll_interval_s. Mitigation: the Redis polling fallback bypasses the operator entirely. - Redis pod is down. Symptom: neither operator nor polling
fallback works. Mitigation: at least one of the operator's
in-memory CR watcher or the REST API fan-out path will still
halt bots; if all three fail simultaneously, escalate to manual
kubectl scale deployment/bot-* --replicas=0. - Redis Pub/Sub vs SET-key drift.
KillSwitchV2.poll_interval_sdefines the upper bound on the polling fallback's latency; if the pub/sub channel is dropping messages, polling still works after at most one interval.