Cell-router cutover runbook
Phase 3 §6 of RESTRUCTURING_PLAN.md. Covers the cutover from the single-container Python FastAPI cell proxy (in
alphaswarm_client/) to the Envoy +alphaswarm-tenant-routertwo-component cell router. This runbook is the operator-facing companion to the deployment manifests atalphaswarm_platform/deployments/kubernetes/edge/.
Architecture (Phase 3 §6.4)
[ user / agent ]
│ TLS
▼
[ Cloudflare Tunnel (alpha-swarm.ai) ]
│
▼
[ alphaswarm-edge — Envoy (HTTP-only) ]
│ ext_authz callout
│ ──────────────────────▶ [ alphaswarm-tenant-router ]
│ │ /resolve
│ ▼
│ [ cells registry (control plane) ]
│ ◀──────────────────── x-alphaswarm-cell header
│
▼ Route on x-alphaswarm-cell:
[ alphaswarm-cell-<id>-api (FastAPI) ]
[ alphaswarm-cell-<id>-workers (Celery, gVisor for agents) ]
[ alphaswarm-cell-<id>-postgres ] [ alphaswarm-cell-<id>-minio ]
Prerequisites
- The four canonical AlphaSwarm images (
alphaswarm-api,alphaswarm-worker,alphaswarm-client,alphaswarm-controller) are running on the pre-Phase-3 single-namespace topology. The Phase 3 work runs IN PARALLEL until the canary completes — nothing is taken away from the running fleet. - The Alembic head is at
0083_audit_cell_id_column.py. Verify:alembic current
# expected: 0083_audit_cell_id_column (head) - The
cellsregistry has at least onestate=activecell row. Verify via the control plane:curl -sS https://manage.alpha-swarm.ai/manage/cells | jq '.data[].id' - The
alphaswarm-edgenamespace exists and carries thealphaswarm.io/host-network-allowed: "true"exception label per Phase 2 §5.4.
Step 0 — Build the Phase 3 images
Both images build from the alphaswarm_platform repo root (the
post-repo-split context):
cd alphaswarm_platform
# alphaswarm-edge (Envoy)
docker buildx build \
--platform linux/amd64,linux/arm64 \
--file build/docker/alphaswarm-edge/Dockerfile \
--tag ghcr.io/julianwiley/alphaswarm-edge:v0.2.0 \
--push .
# alphaswarm-tenant-router (Python + uvloop)
docker buildx build \
--platform linux/amd64,linux/arm64 \
--file build/docker/alphaswarm-tenant-router/Dockerfile \
--tag ghcr.io/julianwiley/alphaswarm-tenant-router:v0.2.0 \
--push .
Tagged releases build both images automatically:
alphaswarm_platform/.github/workflows/build-publish.yml pushes them
to ECR with Cosign keyless signatures, SBOM + SLSA provenance, and a
Trivy scan via the build-sign-push composite.
Step 1 — Deploy in parallel (week 6)
Auth posture first. The tenant-router ships fail-closed (
AUTH_MODE=requiredwith an empty issuer) and will crash-loop until the IdP issuer/audience are stamped intoalphaswarm-tenant-router-config. Complete steps 1-2 of the tenant-router auth rollout runbook before (or together with) this apply.
# Apply both Deployments + Services + PodDisruptionBudgets
# (+ the tenant-router's ConfigMap, NetworkPolicy, HPA, and the
# alphaswarm-cell-bound-validator Service):
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-edge/
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/
# Verify the tenant-router hydrated the cells cache:
kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 18080:8080
curl -sS http://127.0.0.1:18080/readyz
# expected: {"status":"ok","cells":<n>,"auth_mode":"required","cba_mode":"enforce"}
DNS still points to the Python proxy. No user traffic flows to
alphaswarm-edge yet.
Step 2 — DNS canary 10% (week 7)
Cloudflare Workers + Load Balancer split the apex hostname (alpha-swarm.ai)
across the two backends:
# cloudflare/alphaswarm_load_balancer.tf (excerpt)
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_legacy" {
origins = [{ name = "alphaswarm-client", address = "...", weight = 0.9 }]
}
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_envoy" {
origins = [{ name = "alphaswarm-edge", address = "...", weight = 0.1 }]
}
Apply via alphaswarm deploy terraform plan apply (NEVER raw terraform apply per AGENTS rule 42).
Verify both pools healthy:
kubectl -n alphaswarm-edge get pods -l app=alphaswarm-edge
kubectl -n alphaswarm-edge get pods -l app=alphaswarm-tenant-router
# Tail tenant-router logs for any 503s / cache misses:
kubectl -n alphaswarm-edge logs -l app=alphaswarm-tenant-router --tail=200 -f
Stop conditions (rollback to 100% legacy):
alphaswarm-tenant-router/readyzreturns 503 for > 1 minute.- Envoy
5xxrate onalphaswarm-edgeingress > 0.5% over a 5-minute window. - Any audit event with
cell_id IS NULLafter the canary starts (indicates the X-AlphaSwarm-Cell header isn't propagating intoRequestContext).
Step 3 — 50% traffic (week 8)
Cloudflare LB weight: 0.5 / 0.5. Repeat the verification + stop
conditions from step 2. Watch the alphaswarm.cell.id distribution in
Tempo:
{alphaswarm.cell.id="cell-shared-std-local"} | count_over_time(span_count[5m])
Both routes should converge on the same cell-id distribution.
Step 4 — 100% traffic (week 9)
Cloudflare LB weight: 0.0 / 1.0. The Python proxy continues to run but receives no live traffic. Keep it running for 7 days as the rollback safety net.
Step 5 — Remove the Python FastAPI proxy (week 10)
This step is intentionally NOT in the Phase 3 PR; it lands as a
follow-up after the 7-day soak. The removal removes
alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile's FastAPI
proxy module (the production stage's uvicorn entrypoint) and
strips the /api/*, /ws/*, /manage/*, /static route
handlers from alphaswarm/api/main.py.
Tag the last buildable proxy image (alphaswarm-client:proxy-last-stable)
before the removal lands so a regression has a known-good rollback
target.
Rollback at any step
- Cloudflare LB weight back to 1.0 / 0.0 — instant traffic drain back to the legacy proxy.
kubectl -n alphaswarm-edge scale deployment alphaswarm-edge --replicas=0prevents Envoy from accepting any traffic even if DNS still points at it.
Phase 3 §6.6 follow-up — the removal PR
The Python proxy lives at
alphaswarm/api/proxy.py + the relevant routes in
alphaswarm/api/main.py. The Phase 3 §6.6 removal PR:
- Cuts the route registrations.
- Updates the
alphaswarm-clientDockerfile to drop the proxy CMD. - Removes the proxy's tests under
tests/api/. - Tags the prior commit
alphaswarm-client-proxy-finalso a rollback restores the buildable artifact.