Saltar al contenido principal

Cell-router cutover runbook

Phase 3 §6 of RESTRUCTURING_PLAN.md. Covers the cutover from the single-container Python FastAPI cell proxy (in alphaswarm_client/) to the Envoy + alphaswarm-tenant-router two-component cell router. This runbook is the operator-facing companion to the deployment manifests at alphaswarm_platform/deployments/kubernetes/edge/.

Architecture (Phase 3 §6.4)

[ user / agent ]
│ TLS

[ Cloudflare Tunnel (alpha-swarm.ai) ]


[ alphaswarm-edge — Envoy (HTTP-only) ]
│ ext_authz callout
│ ──────────────────────▶ [ alphaswarm-tenant-router ]
│ │ /resolve
│ ▼
│ [ cells registry (control plane) ]
│ ◀──────────────────── x-alphaswarm-cell header

▼ Route on x-alphaswarm-cell:
[ alphaswarm-cell-<id>-api (FastAPI) ]
[ alphaswarm-cell-<id>-workers (Celery, gVisor for agents) ]
[ alphaswarm-cell-<id>-postgres ] [ alphaswarm-cell-<id>-minio ]

Prerequisites

  1. The four canonical AlphaSwarm images (alphaswarm-api, alphaswarm-worker, alphaswarm-client, alphaswarm-controller) are running on the pre-Phase-3 single-namespace topology. The Phase 3 work runs IN PARALLEL until the canary completes — nothing is taken away from the running fleet.
  2. The Alembic head is at 0083_audit_cell_id_column.py. Verify:
    alembic current
    # expected: 0083_audit_cell_id_column (head)
  3. The cells registry has at least one state=active cell row. Verify via the control plane:
    curl -sS https://manage.alpha-swarm.ai/manage/cells | jq '.data[].id'
  4. The alphaswarm-edge namespace exists and carries the alphaswarm.io/host-network-allowed: "true" exception label per Phase 2 §5.4.

Step 0 — Build the Phase 3 images

Both images build from the alphaswarm_platform repo root (the post-repo-split context):

cd alphaswarm_platform

# alphaswarm-edge (Envoy)
docker buildx build \
--platform linux/amd64,linux/arm64 \
--file build/docker/alphaswarm-edge/Dockerfile \
--tag ghcr.io/julianwiley/alphaswarm-edge:v0.2.0 \
--push .

# alphaswarm-tenant-router (Python + uvloop)
docker buildx build \
--platform linux/amd64,linux/arm64 \
--file build/docker/alphaswarm-tenant-router/Dockerfile \
--tag ghcr.io/julianwiley/alphaswarm-tenant-router:v0.2.0 \
--push .

Tagged releases build both images automatically: alphaswarm_platform/.github/workflows/build-publish.yml pushes them to ECR with Cosign keyless signatures, SBOM + SLSA provenance, and a Trivy scan via the build-sign-push composite.

Step 1 — Deploy in parallel (week 6)

Auth posture first. The tenant-router ships fail-closed (AUTH_MODE=required with an empty issuer) and will crash-loop until the IdP issuer/audience are stamped into alphaswarm-tenant-router-config. Complete steps 1-2 of the tenant-router auth rollout runbook before (or together with) this apply.

# Apply both Deployments + Services + PodDisruptionBudgets
# (+ the tenant-router's ConfigMap, NetworkPolicy, HPA, and the
# alphaswarm-cell-bound-validator Service):
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-edge/
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/

# Verify the tenant-router hydrated the cells cache:
kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 18080:8080
curl -sS http://127.0.0.1:18080/readyz
# expected: {"status":"ok","cells":<n>,"auth_mode":"required","cba_mode":"enforce"}

DNS still points to the Python proxy. No user traffic flows to alphaswarm-edge yet.

Step 2 — DNS canary 10% (week 7)

Cloudflare Workers + Load Balancer split the apex hostname (alpha-swarm.ai) across the two backends:

# cloudflare/alphaswarm_load_balancer.tf (excerpt)
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_legacy" {
origins = [{ name = "alphaswarm-client", address = "...", weight = 0.9 }]
}
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_envoy" {
origins = [{ name = "alphaswarm-edge", address = "...", weight = 0.1 }]
}

Apply via alphaswarm deploy terraform plan apply (NEVER raw terraform apply per AGENTS rule 42).

Verify both pools healthy:

kubectl -n alphaswarm-edge get pods -l app=alphaswarm-edge
kubectl -n alphaswarm-edge get pods -l app=alphaswarm-tenant-router

# Tail tenant-router logs for any 503s / cache misses:
kubectl -n alphaswarm-edge logs -l app=alphaswarm-tenant-router --tail=200 -f

Stop conditions (rollback to 100% legacy):

  • alphaswarm-tenant-router /readyz returns 503 for > 1 minute.
  • Envoy 5xx rate on alphaswarm-edge ingress > 0.5% over a 5-minute window.
  • Any audit event with cell_id IS NULL after the canary starts (indicates the X-AlphaSwarm-Cell header isn't propagating into RequestContext).

Step 3 — 50% traffic (week 8)

Cloudflare LB weight: 0.5 / 0.5. Repeat the verification + stop conditions from step 2. Watch the alphaswarm.cell.id distribution in Tempo:

{alphaswarm.cell.id="cell-shared-std-local"} | count_over_time(span_count[5m])

Both routes should converge on the same cell-id distribution.

Step 4 — 100% traffic (week 9)

Cloudflare LB weight: 0.0 / 1.0. The Python proxy continues to run but receives no live traffic. Keep it running for 7 days as the rollback safety net.

Step 5 — Remove the Python FastAPI proxy (week 10)

This step is intentionally NOT in the Phase 3 PR; it lands as a follow-up after the 7-day soak. The removal removes alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile's FastAPI proxy module (the production stage's uvicorn entrypoint) and strips the /api/*, /ws/*, /manage/*, /static route handlers from alphaswarm/api/main.py.

Tag the last buildable proxy image (alphaswarm-client:proxy-last-stable) before the removal lands so a regression has a known-good rollback target.

Rollback at any step

  • Cloudflare LB weight back to 1.0 / 0.0 — instant traffic drain back to the legacy proxy.
  • kubectl -n alphaswarm-edge scale deployment alphaswarm-edge --replicas=0 prevents Envoy from accepting any traffic even if DNS still points at it.

Phase 3 §6.6 follow-up — the removal PR

The Python proxy lives at alphaswarm/api/proxy.py + the relevant routes in alphaswarm/api/main.py. The Phase 3 §6.6 removal PR:

  1. Cuts the route registrations.
  2. Updates the alphaswarm-client Dockerfile to drop the proxy CMD.
  3. Removes the proxy's tests under tests/api/.
  4. Tags the prior commit alphaswarm-client-proxy-final so a rollback restores the buildable artifact.