Saltar al contenido principal

Operations runbook — Kubernetes deployment

End-to-end walkthrough for shipping AlphaSwarm to any Kubernetes cluster (EKS, AKS, GKE, vanilla k3s, or the Raspberry Pi k3s cluster owned by rpi_kubernetes). AlphaSwarm is fully self-contained: every shared service it depends on (Postgres, Redis, Kafka, MinIO, MLflow, observability stack, etc.) ships in alphaswarm_platform/deployments/kubernetes/. There is no implicit dependency on rpi_kubernetes or any other repository.

Prerequisites

  • kubectl 1.30+ with a current context pointing at the target cluster.
  • Cluster admin (you'll create namespaces + RBAC).
  • A container registry the cluster can pull from (Docker Hub / ECR / ACR / GCR).
  • An ingress controller (ingress-nginx recommended) and cert-manager with a letsencrypt-prod ClusterIssuer for the AlphaSwarm TLS hosts.
  • Auth0 tenant configured per alphaswarm_docs/architecture/decisions/003-auth0-zero-trust.md (default tenant alphaswarm-fund.us.auth0.com).
  • Cluster operators / CRDs installed via alphaswarm_platform/scripts/cluster_install/ (Strimzi, Spark Operator, OpenTelemetry Operator, Phoenix, Redpanda, etc.) - run the relevant installer before applying the AlphaSwarm base kustomization.

Targeted runbooks

Step 1 — provision Auth0 (one-time)

$env:AUTH0_DOMAIN = "your-tenant.us.auth0.com"
$env:AUTH0_M2M_CLIENT_ID = "..."
$env:AUTH0_M2M_CLIENT_SECRET = "..."
$env:ALPHASWARM_SYNC_URL = "https://api.alphaswarm.enterprise.com/_internal/auth0/sync"

python alphaswarm_platform/build/scripts/provision_auth0.py --dry-run # preview
python alphaswarm_platform/build/scripts/provision_auth0.py # apply

This idempotently creates the API resource server, the four roles, and the post-login Action.

Step 2 — generate the K8s ConfigMap + Secret scaffold

make generate-config ENV=k8s

Produces:

  • alphaswarm_platform/deployments/kubernetes/base/configmaps/alphaswarm-config.yaml (commit this)
  • alphaswarm_platform/deployments/kubernetes/base/secrets/alphaswarm-secrets.yaml.template (DO NOT commit values — CI/CD or external-secrets-operator patches real values)

Step 3 — build + push images

$env:IMAGE_TAG = "rc-$(git rev-parse --short HEAD)-$(Get-Date -Format yyyy-MM-dd)"
make build-client IMAGE_TAG=$env:IMAGE_TAG
make build-cp IMAGE_TAG=$env:IMAGE_TAG

# Optional (only if the Dockerfiles exist in alphaswarm_platform/build/docker/*)
make build-worker IMAGE_TAG=$env:IMAGE_TAG
make build-ingestion IMAGE_TAG=$env:IMAGE_TAG

docker login
docker push docker.io/julianwiley/alphaswarm-client:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-controller:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-worker:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-ingestion:$env:IMAGE_TAG

If make build-worker or make build-ingestion reports a missing Dockerfile, pin those image tags to known-good prebuilt registry tags in the target overlay before applying.

Step 3b — one-shot Alembic migration (cluster)

After alphaswarm-api is pullable on the cluster, run:

kubectl apply -f alphaswarm_platform/deployments/kubernetes/base/jobs/alembic-upgrade.yaml
kubectl -n alphaswarm wait --for=condition=complete job/alphaswarm-alembic-upgrade --timeout=900s
kubectl -n alphaswarm logs job/alphaswarm-alembic-upgrade

The Job uses the same alphaswarm-config / alphaswarm-secrets env as alphaswarm-core and targets postgresql.alphaswarm-data-services.svc.cluster.local (the AlphaSwarm-owned Postgres in the alphaswarm-data-services namespace). Re-apply only when you need a fresh upgrade head (delete the previous Job first: kubectl -n alphaswarm delete job alphaswarm-alembic-upgrade).

alembic/env.py widens alembic_version.version_num to VARCHAR(128) automatically before migrations run (revision slugs longer than 32 characters otherwise fail at 0039_extended_instrument_taxonomy).

Brownfield Postgres (pre-Alembic or partial schema)

If alembic upgrade head fails with DuplicateTable / DuplicateColumn, the database was created outside Alembic tracking. From a workstation with the API image and a port-forward to cluster Postgres:

kubectl -n alphaswarm-data-services port-forward svc/postgresql 15432:5432
$env:ALPHASWARM_POSTGRES_DSN = "postgresql+psycopg2://alphaswarm:alphaswarm@host.docker.internal:15432/alphaswarm"
# Optional: stamp to the highest revision whose objects already exist, then upgrade.
# $env:ALPHASWARM_ALEMBIC_STAMP_REVISION = "0015_dbt_foundation"
bash scripts/cluster_alembic_upgrade.sh

Use ALPHASWARM_POSTGRES_DSN (maps to settings.postgres_dsn) — not a raw DATABASE_URL alias. Migration 0040_normalized_identifiers_backfill can take several minutes on large instruments tables.

Postgres prerequisites (alphaswarm-data-services)

Migration 0045_pgvector_foundation requires the vector extension in the alphaswarm database. On existing clusters (init script applied before the alphaswarm DB was added), run once as the Postgres superuser:

kubectl -n alphaswarm-data-services exec deploy/postgresql -- \
psql -U postgres -d alphaswarm -c "CREATE EXTENSION IF NOT EXISTS vector;"

Fresh installs use the AlphaSwarm-owned alphaswarm_platform/deployments/kubernetes/base-services/postgres-shared/ manifests, whose init SQL creates the alphaswarm role/database and enables vector there.

Step 4 — pin the image tag in the target overlay

Edit alphaswarm_platform/deployments/kubernetes/overlays/<env>/kustomization.yaml:

images:
- name: docker.io/julianwiley/alphaswarm-client
newTag: rc-abcdef01-2026-05-19
...

Docker Hub pull secret (private repos)

Deployments reference dockerhub-pull-secret. Create it in both workload namespaces before rollout:

$env:DOCKERHUB_USER = "<dockerhub-username>"
$env:DOCKERHUB_TOKEN = "<dockerhub-access-token>" # hub.docker.com → Account Settings → Security

kubectl create secret docker-registry dockerhub-pull-secret `
--docker-server=https://index.docker.io/v1/ `
--docker-username=$env:DOCKERHUB_USER `
--docker-password=$env:DOCKERHUB_TOKEN `
-n alphaswarm --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret docker-registry dockerhub-pull-secret `
--docker-server=https://index.docker.io/v1/ `
--docker-username=$env:DOCKERHUB_USER `
--docker-password=$env:DOCKERHUB_TOKEN `
-n alphaswarm-admin --dry-run=client -o yaml | kubectl apply -f -

Public repositories can omit the secret by removing imagePullSecrets from the deployment manifests.

Step 5 — apply

# Dry-run first
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev --dry-run=server

# Apply
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev

# Verify
kubectl -n alphaswarm get pods,svc,hpa,pdb
kubectl -n alphaswarm-admin get pods,svc

Step 6 — populate the Secret

If you're not using external-secrets-operator, populate the placeholder Secret manually:

kubectl -n alphaswarm create secret generic alphaswarm-secrets `
--from-literal=ALPHASWARM_DATABASE_PASSWORD="<value>" `
--from-literal=ALPHASWARM_AUTH_M2M_CLIENT_SECRET="<value>" `
--from-literal=ALPHASWARM_SESSION_COOKIE_SECRET="<value>" `
--dry-run=client -o yaml | kubectl apply -f -

For external-secrets-operator users, point an ExternalSecret at your secret store (Vault / SSM / Key Vault / Secret Manager) and let the operator create the K8s Secret.

Step 7 — DNS + TLS

The Ingresses expect:

  • alpha-swarm.ai -> alphaswarm-client Service in the alphaswarm namespace
  • api.alpha-swarm.ai -> alphaswarm-core Service in the alphaswarm namespace
  • manage.alpha-swarm.ai -> alphaswarm-cp Service in the alphaswarm-admin namespace

Point DNS at the NGINX Ingress controller's LoadBalancer IP. cert-manager handles TLS via the letsencrypt-prod ClusterIssuer (configure separately).

Step 8 — smoke test

# Client should serve the SPA shell
curl -fsS https://alpha-swarm.ai/ | findstr "<!doctype html"

# Control plane health (unauthenticated)
curl -fsS https://manage.alpha-swarm.ai/manage/health

# OpenAPI spec
curl -fsS https://manage.alpha-swarm.ai/manage/openapi.json | python -m json.tool | findstr title

# Cluster verification helper
bash scripts/verify_tower_cluster.sh

Rollback

# Re-apply the previous overlay with the previous image tag.
git checkout HEAD~1 -- alphaswarm_platform/deployments/kubernetes/overlays/dev/kustomization.yaml
make deploy-k8s ENV=dev

Or, for an immediate rollback that doesn't touch git:

kubectl -n alphaswarm rollout undo deployment/alphaswarm-client
kubectl -n alphaswarm rollout undo deployment/alphaswarm-core
kubectl -n alphaswarm rollout undo deployment/alphaswarm-worker
kubectl -n alphaswarm-admin rollout undo deployment/alphaswarm-cp