AWS Hybrid Operational Runbook
Companion to aws-deploy.md. Page this when the AgentCore proxy / admin BFF / Bedrock KB / ECS Fargate cluster misbehaves.
On-call checklist (first 5 minutes)
- Confirm the blast radius. Cloudflare hosts
alpha-swarm.ai/api.alpha-swarm.ai/manage.alpha-swarm.ai; CloudFront hostsadmin.alpha-swarm.ai/agentcore.alpha-swarm.ai. A Cloudflare outage does NOT touch the admin / AgentCore surface (and vice versa). - Hit
/healthz.https://admin.alpha-swarm.ai/healthz— if 200, the ALB + ECS Fargate path is healthy. If 5xx, jump to "ECS Fargate service down" below. - Fan-out kill switch. If the incident is touching trading,
immediately POST
/portfolio/kill_switchAND/workloads/haltso every long-running runtime (paper, bots, RL, AgentCore) stops. The topbarKillSwitchcomponent does the fan-out automatically for logged-in operators.
Halt every AgentCore session
# 1. Disable new invocations at the gateway:
aws bedrock-agentcore update-gateway \
--gateway-id $(aws ssm get-parameter \
--name /alphaswarm/prod/agentcore_gateway_arn \
--query 'Parameter.Value' --output text | awk -F/ '{print $NF}') \
--status DISABLED
# 2. Stop the AgentCore proxy ECS Fargate service (the ALB stops
# forwarding to the proxy immediately):
aws ecs update-service \
--cluster $(aws ssm get-parameter \
--name /alphaswarm/prod/ecs_cluster_name \
--query 'Parameter.Value' --output text) \
--service alphaswarm-agentcore-proxy-prod \
--desired-count 0
The matching audit row lands in workload_runs via
WorkloadRuntime.start_run BEFORE the boto3 call returns (rule 45).
You can verify with:
aws rds-data execute-statement \
--resource-arn $RDS_ARN --secret-arn $DB_SECRET_ARN \
--database alphaswarm \
--sql "SELECT id, action, status, user_id, started_at \
FROM workload_runs ORDER BY started_at DESC LIMIT 10"
Roll back to the previous tag
# 1. Identify the previous good SHA:
git log --oneline --decorate -20
# 2. Tag the previous SHA + push to trigger the apply path:
git tag v1.0.1-rollback <previous-sha>
git push origin v1.0.1-rollback
# 3. Manual dispatch — terraform-pipeline.yml with
# tree=alphaswarm_platform, env=prod, action=apply (requires 2 reviewers).
#
# The DB / S3 buckets / KB source bucket are NOT touched — every data
# resource carries lifecycle.prevent_destroy=true and retains the
# previous Terraform-managed state.
ECS Fargate service down
# 1. Inspect recent task stops (most common = ECR pull failure or
# secrets resolution failure):
aws ecs describe-services \
--cluster alphaswarm-cluster-prod \
--services alphaswarm-admin-prod \
--query 'services[0].events[:10]'
# 2. Tail the application log group (the ADOT sidecar logs to a
# sibling stream prefixed adot/):
aws logs tail /aws/ecs/alphaswarm-admin-prod --follow
# 3. Force a fresh rollout (also rolls the ADOT sidecar):
aws ecs update-service \
--cluster alphaswarm-cluster-prod \
--service alphaswarm-admin-prod \
--force-new-deployment
The matching AwsProvider.restart call from WorkloadRuntime writes
a workload_runs row with action=restart BEFORE the boto3 call
runs, so the audit trail is intact even when the rollout fails halfway.
Drain a Celery worker (EKS path)
The Celery workers continue to run on EKS Karpenter; the ECS Fargate surface is admin + AgentCore only. To drain a worker pod without losing in-flight tasks:
# 1. Annotate the pod for graceful shutdown — the Celery preboot
# hook in alphaswarm/tasks/celery_app.py listens for this annotation:
kubectl annotate pod alphaswarm-worker-xxxxx \
-n alphaswarm \
alphaswarm.io/drain-requested-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# 2. Wait for the worker to finish in-flight tasks (max grace =
# ALPHASWARM_AGENT_STALL_THRESHOLD_SECONDS, default 1800s):
kubectl wait --for=delete pod alphaswarm-worker-xxxxx -n alphaswarm --timeout=1900s
# 3. Karpenter replaces the deleted pod automatically; the new pod
# inherits the same Celery queue subscriptions.
Assume the break-glass role
Reserved for catastrophic incidents (org-wide outage, suspected
account compromise). The AlphaSwarm-BreakGlass Identity Center
permission set has Admin across every account, MFA-required, and
alarms on every assumption (CloudWatch alarm +
workload_runs row + Cloudflare Access policy).
# IAM Identity Center -> Sign in -> Pick AlphaSwarm-BreakGlass for the
# target account -> Acknowledge the on-call ticket prompt before
# the assume completes.
aws sts get-caller-identity
# Outputs the breakglass session arn — paste into the incident ticket.
Rotate Cloudflare origin secret
When the CloudFront X-CloudFront-Secret header value is suspected
leaked:
new=$(openssl rand -hex 32)
aws ssm put-parameter \
--name /alphaswarm/prod/cloudfront_origin_secret \
--value "$new" --type SecureString --overwrite
# Re-apply the cloudfront module so the new value lands at the edge:
gh workflow run terraform-pipeline.yml -f tree=alphaswarm_platform -f env=prod -f action=apply
Common failure modes + fixes
| Symptom | Likely cause | Fix |
|---|---|---|
AccessDeniedException on first KB ingestion | aoss:APIAccessAll not propagated | Wait 20s + retry (the time_sleep.settle block handles this on apply, but manual ingestion-job starts can race). |
| AgentCore Runtime invocation returns 403 | The runtime role's IAM policy denies the FM | Verify the model ARN is in var.allowed_model_arns for the env. |
terraform apply fails on aws_bedrockagentcore_* resource | Provider version too old | Pin hashicorp/aws ~> 6.21 in alphaswarm_platform/terraform/environments/live/main.tf. |
ALB 504 on admin.alpha-swarm.ai | Cognito redirect loop or ALB OIDC misconfig | Inspect the listener rule's authenticate-cognito action; confirm user_pool_arn + user_pool_client_id + user_pool_domain SSM params match. |
| ECS exec hangs | Task definition missing enableExecuteCommand=true | Re-deploy with enable_execute_command=true in the service spec; ECS Exec also needs ALPHASWARM_AWS_ECS_EXEC_ENABLED=true on the provider. |
| Bedrock smoke workflow times out on X-Ray | ADOT sidecar not propagating; check alphaswarm-adot-sidecar SA has the X-Ray + Application Signals + CloudWatch policies. |
Cost guardrails
AWS Budgets alarms ride on top of the SCP region allowlist:
devbudget = $300/month (alerts at 50/80/100%)stagingbudget = $500/monthprodbudget = $1500/month (excluding Bedrock token spend; that is metered separately under thealphaswarm.io/cost-bucket=bedrock-tokenstag).
If a budget alarm fires AND the Bedrock token cost is the driver:
# Disable streaming responses (cheaper) and switch the agent spec to
# Haiku for the next 24h while the operator investigates:
aws ssm put-parameter --name /alphaswarm/prod/llm_model_preference \
--value "anthropic.claude-haiku-4-5-20251022-v1:0" \
--type String --overwrite
aws ecs update-service \
--cluster alphaswarm-cluster-prod \
--service alphaswarm-agentcore-proxy-prod \
--force-new-deployment