Skip to main content

AWS Hybrid Operational Runbook

Companion to aws-deploy.md. Page this when the AgentCore proxy / admin BFF / Bedrock KB / ECS Fargate cluster misbehaves.

On-call checklist (first 5 minutes)

  1. Confirm the blast radius. Cloudflare hosts alpha-swarm.ai / api.alpha-swarm.ai / manage.alpha-swarm.ai; CloudFront hosts admin.alpha-swarm.ai / agentcore.alpha-swarm.ai. A Cloudflare outage does NOT touch the admin / AgentCore surface (and vice versa).
  2. Hit /healthz. https://admin.alpha-swarm.ai/healthz — if 200, the ALB + ECS Fargate path is healthy. If 5xx, jump to "ECS Fargate service down" below.
  3. Fan-out kill switch. If the incident is touching trading, immediately POST /portfolio/kill_switch AND /workloads/halt so every long-running runtime (paper, bots, RL, AgentCore) stops. The topbar KillSwitch component does the fan-out automatically for logged-in operators.

Halt every AgentCore session

# 1. Disable new invocations at the gateway:
aws bedrock-agentcore update-gateway \
--gateway-id $(aws ssm get-parameter \
--name /alphaswarm/prod/agentcore_gateway_arn \
--query 'Parameter.Value' --output text | awk -F/ '{print $NF}') \
--status DISABLED

# 2. Stop the AgentCore proxy ECS Fargate service (the ALB stops
# forwarding to the proxy immediately):
aws ecs update-service \
--cluster $(aws ssm get-parameter \
--name /alphaswarm/prod/ecs_cluster_name \
--query 'Parameter.Value' --output text) \
--service alphaswarm-agentcore-proxy-prod \
--desired-count 0

The matching audit row lands in workload_runs via WorkloadRuntime.start_run BEFORE the boto3 call returns (rule 45). You can verify with:

aws rds-data execute-statement \
--resource-arn $RDS_ARN --secret-arn $DB_SECRET_ARN \
--database alphaswarm \
--sql "SELECT id, action, status, user_id, started_at \
FROM workload_runs ORDER BY started_at DESC LIMIT 10"

Roll back to the previous tag

# 1. Identify the previous good SHA:
git log --oneline --decorate -20

# 2. Tag the previous SHA + push to trigger the apply path:
git tag v1.0.1-rollback <previous-sha>
git push origin v1.0.1-rollback

# 3. Manual dispatch — terraform-pipeline.yml with
# tree=alphaswarm_platform, env=prod, action=apply (requires 2 reviewers).
#
# The DB / S3 buckets / KB source bucket are NOT touched — every data
# resource carries lifecycle.prevent_destroy=true and retains the
# previous Terraform-managed state.

ECS Fargate service down

# 1. Inspect recent task stops (most common = ECR pull failure or
# secrets resolution failure):
aws ecs describe-services \
--cluster alphaswarm-cluster-prod \
--services alphaswarm-admin-prod \
--query 'services[0].events[:10]'

# 2. Tail the application log group (the ADOT sidecar logs to a
# sibling stream prefixed adot/):
aws logs tail /aws/ecs/alphaswarm-admin-prod --follow

# 3. Force a fresh rollout (also rolls the ADOT sidecar):
aws ecs update-service \
--cluster alphaswarm-cluster-prod \
--service alphaswarm-admin-prod \
--force-new-deployment

The matching AwsProvider.restart call from WorkloadRuntime writes a workload_runs row with action=restart BEFORE the boto3 call runs, so the audit trail is intact even when the rollout fails halfway.

Drain a Celery worker (EKS path)

The Celery workers continue to run on EKS Karpenter; the ECS Fargate surface is admin + AgentCore only. To drain a worker pod without losing in-flight tasks:

# 1. Annotate the pod for graceful shutdown — the Celery preboot
# hook in alphaswarm/tasks/celery_app.py listens for this annotation:
kubectl annotate pod alphaswarm-worker-xxxxx \
-n alphaswarm \
alphaswarm.io/drain-requested-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 2. Wait for the worker to finish in-flight tasks (max grace =
# ALPHASWARM_AGENT_STALL_THRESHOLD_SECONDS, default 1800s):
kubectl wait --for=delete pod alphaswarm-worker-xxxxx -n alphaswarm --timeout=1900s

# 3. Karpenter replaces the deleted pod automatically; the new pod
# inherits the same Celery queue subscriptions.

Assume the break-glass role

Reserved for catastrophic incidents (org-wide outage, suspected account compromise). The AlphaSwarm-BreakGlass Identity Center permission set has Admin across every account, MFA-required, and alarms on every assumption (CloudWatch alarm + workload_runs row + Cloudflare Access policy).

# IAM Identity Center -> Sign in -> Pick AlphaSwarm-BreakGlass for the
# target account -> Acknowledge the on-call ticket prompt before
# the assume completes.
aws sts get-caller-identity
# Outputs the breakglass session arn — paste into the incident ticket.

Rotate Cloudflare origin secret

When the CloudFront X-CloudFront-Secret header value is suspected leaked:

new=$(openssl rand -hex 32)

aws ssm put-parameter \
--name /alphaswarm/prod/cloudfront_origin_secret \
--value "$new" --type SecureString --overwrite

# Re-apply the cloudfront module so the new value lands at the edge:
gh workflow run terraform-pipeline.yml -f tree=alphaswarm_platform -f env=prod -f action=apply

Common failure modes + fixes

SymptomLikely causeFix
AccessDeniedException on first KB ingestionaoss:APIAccessAll not propagatedWait 20s + retry (the time_sleep.settle block handles this on apply, but manual ingestion-job starts can race).
AgentCore Runtime invocation returns 403The runtime role's IAM policy denies the FMVerify the model ARN is in var.allowed_model_arns for the env.
terraform apply fails on aws_bedrockagentcore_* resourceProvider version too oldPin hashicorp/aws ~> 6.21 in alphaswarm_platform/terraform/environments/live/main.tf.
ALB 504 on admin.alpha-swarm.aiCognito redirect loop or ALB OIDC misconfigInspect the listener rule's authenticate-cognito action; confirm user_pool_arn + user_pool_client_id + user_pool_domain SSM params match.
ECS exec hangsTask definition missing enableExecuteCommand=trueRe-deploy with enable_execute_command=true in the service spec; ECS Exec also needs ALPHASWARM_AWS_ECS_EXEC_ENABLED=true on the provider.
Bedrock smoke workflow times out on X-RayADOT sidecar not propagating; check alphaswarm-adot-sidecar SA has the X-Ray + Application Signals + CloudWatch policies.

Cost guardrails

AWS Budgets alarms ride on top of the SCP region allowlist:

  • dev budget = $300/month (alerts at 50/80/100%)
  • staging budget = $500/month
  • prod budget = $1500/month (excluding Bedrock token spend; that is metered separately under the alphaswarm.io/cost-bucket=bedrock-tokens tag).

If a budget alarm fires AND the Bedrock token cost is the driver:

# Disable streaming responses (cheaper) and switch the agent spec to
# Haiku for the next 24h while the operator investigates:
aws ssm put-parameter --name /alphaswarm/prod/llm_model_preference \
--value "anthropic.claude-haiku-4-5-20251022-v1:0" \
--type String --overwrite

aws ecs update-service \
--cluster alphaswarm-cluster-prod \
--service alphaswarm-agentcore-proxy-prod \
--force-new-deployment