Operations runbook — CI/CD deploy
Task-oriented steps for the AlphaSwarm AWS CI/CD pipeline. For the design and the topology diagrams see the concept page CI/CD pipelines. This runbook is the companion to the bootstrap and incident playbooks in AWS Hybrid Deployment Guide and AWS Hybrid Operational Runbook — start there for the first-ever account bring-up; come here for the day-to-day pipeline.
All deploys run through GitHub Actions over GitHub OIDC. Never run
terraform apply or alphaswarm deploy up against a shared
environment from a laptop.
(a) One-time setup — Environments, reviewers, variables
Do this once per repo (the steps are the same for alphaswarm_platform
and alphaswarm_admin).
-
Create the three GitHub Environments. In the repo: Settings → Environments → New environment, for each of
dev,staging,prod. -
Set required reviewers. Edit each Environment's protection rules:
dev— no required reviewers (auto).staging— 1 required reviewer.prod— 2 required reviewers (4-eyes).
-
Set the per-env role variables. For each Environment add the apply role ARN (published by the
infrastructure/modules/github-oidcmodule) plus the read-only plan role ARN:# Apply role (one per environment):
gh variable set AWS_DEPLOYER_ROLE_ARN \
--env prod \
--body "arn:aws:iam::<prod-account-id>:role/aqp-gha-apply"
# Plan role (read-only, used by pr-validate.yml):
gh variable set AWS_PLAN_ROLE_ARN \
--env prod \
--body "arn:aws:iam::<prod-account-id>:role/aqp-gha-plan"Repeat for
devandstagingwith their account IDs. -
Set the cross-repo dispatch token (admin repo only). The admin pipeline fires a
repository_dispatchatalphaswarm_platform, so it needs a token withreposcope on the platform repo. Store it as a secret in the admin repo:gh secret set PLATFORM_DISPATCH_TOKEN \
--repo Alpha-Swarm-ai/alphaswarm_admin \
--body "<fine-grained-PAT-with-platform-dispatch>"
(b) Deploy the landing zone (infrastructure/)
The infrastructure/ tree is applied with native Terraform over OIDC
into AqpTerraformExecutionRole. Always plan first, review the diff
in the workflow summary, then apply.
# 1. Plan dev:
gh workflow run terraform-pipeline.yml \
-f tree=infrastructure -f env=dev -f action=plan
# 2. Review the plan in the run summary, then apply:
gh workflow run terraform-pipeline.yml \
-f tree=infrastructure -f env=dev -f action=apply
Promote by repeating with -f env=staging then -f env=prod. The
staging apply waits on 1 reviewer and the prod apply on 2 (the
GitHub Environment gate).
(c) Deploy the app tier (terraform/)
Same workflow, tree=alphaswarm_platform. This path delegates to
CodeBuild, which runs alphaswarm deploy plan / alphaswarm deploy up
(TerraformRuntime) and writes a terraform_runs audit row.
gh workflow run terraform-pipeline.yml \
-f tree=alphaswarm_platform -f env=dev -f action=plan
gh workflow run terraform-pipeline.yml \
-f tree=alphaswarm_platform -f env=dev -f action=apply
A push to main automatically runs an infrastructure plan against
dev, so you usually only dispatch the apply actions explicitly.
(d) Release images — push a v* tag
build-publish.yml triggers on a v* tag. It builds each service
multi-arch to ECR, signs with Cosign keyless, emits a syft SBOM
and SLSA provenance, and runs Trivy + Grype scans.
git tag v1.4.0
git push origin v1.4.0
# Watch the workflow:
gh run watch
(e) Admin deploy flow
alphaswarm_admin builds two images and hands off to the platform.
- Push to the admin repo's
main(or push av*tag). - The admin workflow builds and pushes two images to
ECR:alphaswarm-adminandalphaswarm-admin-frontend. - After both land, it fires a
repository_dispatcheventadmin-image-publishedatalphaswarm_platform(usingPLATFORM_DISPATCH_TOKEN). - The platform's app-tier redeploy runs and rolls the admin service
onto ECS
Fargate(Cognito+ALB) viaterraform/environments/{dev,staging,prod}, reading infra handles from SSM/alphaswarm/<env>/*.
To re-trigger the handoff manually (for example after a token fix without a new build):
gh api repos/Alpha-Swarm-ai/alphaswarm_platform/dispatches \
-f event_type=admin-image-published \
-f 'client_payload[env]=dev'
(f) Approving a prod release (4-eyes)
A prod apply (infra or app tier) pauses on the GitHub Environment
gate until two distinct reviewers approve. The apply role cannot
be assumed before that, so nothing touches prod until both sign off.
- Dispatch the apply (step b or c) with
-f env=prod. - Two reviewers open the run → "Review deployments" → select
prod→ Approve. Approvals must come from two different people. - The job then assumes
vars.AWS_DEPLOYER_ROLE_ARNforprodover OIDC and proceeds.
# List runs awaiting approval:
gh run list --workflow terraform-pipeline.yml
(g) Where the terraform_runs audit row lands
Every app-tier alphaswarm deploy plan / up writes a row to the
terraform_runs table in the platform Postgres (platform AGENTS
rule 42) — the same ledger used by TerraformRuntime for in-app
Terraform actions. Native infrastructure/ applies do not write this
row (their history is the Terraform state in S3). To inspect recent
app-tier runs:
aws rds-data execute-statement \
--resource-arn "$RDS_ARN" --secret-arn "$DB_SECRET_ARN" \
--database alphaswarm \
--sql "SELECT id, action, status, env, started_at \
FROM terraform_runs ORDER BY started_at DESC LIMIT 10"
(h) Rollback
Pick the path that matches what changed.
-
Bad image (app or admin): re-point the deploy at the prior immutable image tag and redeploy — no rebuild required.
# Re-run the app-tier apply pinned to the previous tag:
gh workflow run terraform-pipeline.yml \
-f tree=alphaswarm_platform -f env=prod -f action=apply \
-f image_tag=v1.3.0 -
Bad infra/app-tier change: re-apply the previous good state by dispatching
applyfrom the prior good commit. Tag-and-push the previous SHA, then dispatch the apply (prod still needs 2 reviewers):git tag v1.3.1-rollback <previous-good-sha>
git push origin v1.3.1-rollback
gh workflow run terraform-pipeline.yml \
-f tree=alphaswarm_platform -f env=prod -f action=apply
Data resources (RDS, S3, the KB source bucket) carry
lifecycle.prevent_destroy = true, so a re-apply rolls forward the
service definitions without touching stateful resources. See the
rollback section of AWS Hybrid Operational Runbook
for the full data-safety notes.
See also
- CI/CD pipelines — the design and topology.
- AWS Hybrid Deployment Guide — first-time bootstrap.
- AWS Hybrid Operational Runbook — incident playbooks + rollback data safety.