Operations runbook — CI/CD deploy

Task-oriented steps for the AlphaSwarm AWS CI/CD pipeline. For the design and the topology diagrams see the concept page CI/CD pipelines. This runbook is the companion to the bootstrap and incident playbooks in AWS Hybrid Deployment Guide and AWS Hybrid Operational Runbook — start there for the first-ever account bring-up; come here for the day-to-day pipeline.

All deploys run through GitHub Actions over GitHub OIDC. Never run terraform apply or alphaswarm deploy up against a shared environment from a laptop.

(a) One-time setup — Environments, reviewers, variables

Do this once per repo (the steps are the same for alphaswarm_platform and alphaswarm_admin).

Create the three GitHub Environments. In the repo: Settings → Environments → New environment, for each of dev, staging, prod.
Set required reviewers. Edit each Environment's protection rules:
- dev — no required reviewers (auto).
- staging — 1 required reviewer.
- prod — 2 required reviewers (4-eyes).

Set the per-env role variables. For each Environment add the apply role ARN (published by the infrastructure/modules/github-oidc module) plus the read-only plan role ARN:

# Apply role (one per environment):
gh variable set AWS_DEPLOYER_ROLE_ARN \
  --env prod \
  --body "arn:aws:iam::<prod-account-id>:role/aqp-gha-apply"

# Plan role (read-only, used by pr-validate.yml):
gh variable set AWS_PLAN_ROLE_ARN \
  --env prod \
  --body "arn:aws:iam::<prod-account-id>:role/aqp-gha-plan"

Repeat for dev and staging with their account IDs.

Set the cross-repo dispatch token (admin repo only). The admin pipeline fires a repository_dispatch at alphaswarm_platform, so it needs a token with repo scope on the platform repo. Store it as a secret in the admin repo:
```
gh secret set PLATFORM_DISPATCH_TOKEN \
  --repo Alpha-Swarm-ai/alphaswarm_admin \
  --body "<fine-grained-PAT-with-platform-dispatch>"
```

(b) Deploy the landing zone (infrastructure/)

The infrastructure/ tree is applied with native Terraform over OIDC into AqpTerraformExecutionRole. Always plan first, review the diff in the workflow summary, then apply.

# 1. Plan dev:
gh workflow run terraform-pipeline.yml \
  -f tree=infrastructure -f env=dev -f action=plan

# 2. Review the plan in the run summary, then apply:
gh workflow run terraform-pipeline.yml \
  -f tree=infrastructure -f env=dev -f action=apply

Promote by repeating with -f env=staging then -f env=prod. The staging apply waits on 1 reviewer and the prod apply on 2 (the GitHub Environment gate).

(c) Deploy the app tier (terraform/)

Same workflow, tree=alphaswarm_platform. This path delegates to CodeBuild, which runs alphaswarm deploy plan / alphaswarm deploy up (TerraformRuntime) and writes a terraform_runs audit row.

gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=dev -f action=plan

gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=dev -f action=apply

A push to main automatically runs an infrastructure plan against dev, so you usually only dispatch the apply actions explicitly.

(d) Release images — push a v* tag

build-publish.yml triggers on a v* tag. It builds each service multi-arch to ECR, signs with Cosign keyless, emits a syft SBOM and SLSA provenance, and runs Trivy + Grype scans.

git tag v1.4.0
git push origin v1.4.0
# Watch the workflow:
gh run watch

(e) Admin deploy flow

alphaswarm_admin builds two images and hands off to the platform.

Push to the admin repo's main (or push a v* tag).
The admin workflow builds and pushes two images to ECR: alphaswarm-admin and alphaswarm-admin-frontend.
After both land, it fires a repository_dispatch event admin-image-published at alphaswarm_platform (using PLATFORM_DISPATCH_TOKEN).
The platform's app-tier redeploy runs and rolls the admin service onto ECS Fargate (Cognito + ALB) via terraform/environments/{dev,staging,prod}, reading infra handles from SSM /alphaswarm/<env>/*.

To re-trigger the handoff manually (for example after a token fix without a new build):

gh api repos/Alpha-Swarm-ai/alphaswarm_platform/dispatches \
  -f event_type=admin-image-published \
  -f 'client_payload[env]=dev'

(f) Approving a prod release (4-eyes)

A prod apply (infra or app tier) pauses on the GitHub Environment gate until two distinct reviewers approve. The apply role cannot be assumed before that, so nothing touches prod until both sign off.

Dispatch the apply (step b or c) with -f env=prod.
Two reviewers open the run → "Review deployments" → select prod → Approve. Approvals must come from two different people.
The job then assumes vars.AWS_DEPLOYER_ROLE_ARN for prod over OIDC and proceeds.

# List runs awaiting approval:
gh run list --workflow terraform-pipeline.yml

(g) Where the terraform_runs audit row lands

Every app-tier alphaswarm deploy plan / up writes a row to the terraform_runs table in the platform Postgres (platform AGENTS rule 42) — the same ledger used by TerraformRuntime for in-app Terraform actions. Native infrastructure/ applies do not write this row (their history is the Terraform state in S3). To inspect recent app-tier runs:

aws rds-data execute-statement \
  --resource-arn "$RDS_ARN" --secret-arn "$DB_SECRET_ARN" \
  --database alphaswarm \
  --sql "SELECT id, action, status, env, started_at \
         FROM terraform_runs ORDER BY started_at DESC LIMIT 10"

(h) Rollback

Pick the path that matches what changed.

Bad image (app or admin): re-point the deploy at the prior immutable image tag and redeploy — no rebuild required.

# Re-run the app-tier apply pinned to the previous tag:
gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=prod -f action=apply \
  -f image_tag=v1.3.0

Bad infra/app-tier change: re-apply the previous good state by dispatching apply from the prior good commit. Tag-and-push the previous SHA, then dispatch the apply (prod still needs 2 reviewers):
```
git tag v1.3.1-rollback <previous-good-sha>
git push origin v1.3.1-rollback
gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=prod -f action=apply
```

Data resources (RDS, S3, the KB source bucket) carry lifecycle.prevent_destroy = true, so a re-apply rolls forward the service definitions without touching stateful resources. See the rollback section of AWS Hybrid Operational Runbook for the full data-safety notes.

(a) One-time setup — Environments, reviewers, variables​

(b) Deploy the landing zone (infrastructure/)​

(c) Deploy the app tier (terraform/)​

(d) Release images — push a v* tag​

(e) Admin deploy flow​

(f) Approving a prod release (4-eyes)​

(g) Where the terraform_runs audit row lands​

(h) Rollback​

See also​