Saltar al contenido principal

Tenant-router auth rollout runbook

Operator companion to Edge authentication & cell routing and the manifests at alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/. Follows the cell-router cutover — run that first if the Envoy edge is not serving yet.

The tenant-router ships fail-closed: AUTH_MODE=required with an empty issuer, so a fresh apply crash-loops with a SettingsError until you stamp real IdP values. That is intentional — complete this runbook to bring the edge up authenticated.

1. Prerequisites

  1. The IdP is provisioned (Auth0 via terraform/modules/auth0_identity or Entra via alphaswarm_entra_directory) and the per-cell backends already validate the same issuer/audience (ALPHASWARM_AUTH_OIDC_ISSUER / ..._AUDIENCE in alphaswarm-config, stamped by build/scripts/sync_auth0_env_to_k8s.py).
  2. The claims pipeline stamps the namespaced routing claims (https://alphaswarm.internal/tenant_id, workspace_id, and — for B2B premium plans — tier). See Auth0 Actions / MSAL setup.
  3. The cells registry has at least one state=active cell per tier you route to (curl -sS $CP/manage/cells | jq '.data[].tier').

2. Stamp the auth ConfigMap

Edit (or overlay-patch) alphaswarm-tenant-router-config in deployments/kubernetes/edge/alphaswarm-tenant-router/configmap.yaml:

data:
ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "permissive" # step 3 flips to required
ALPHASWARM_TENANT_ROUTER_OIDC_ISSUER: "https://<tenant>.us.auth0.com/"
ALPHASWARM_TENANT_ROUTER_OIDC_AUDIENCE: "https://api.alphaswarm.internal/manage"

The JWKS URI derives from the issuer (<issuer>/.well-known/jwks.json); set ALPHASWARM_TENANT_ROUTER_JWKS_URI only for non-standard IdPs. Only asymmetric algorithms are accepted — if you change OIDC_ALGORITHMS, HS* values are refused at boot.

Apply + restart:

kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/
kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router
kubectl -n alphaswarm-edge rollout status deploy/alphaswarm-tenant-router

3. Canary in permissive, then enforce

permissive denies invalid tokens but lets anonymous requests through flagged x-alphaswarm-auth: anonymous (per-cell gates still reject where they require auth). Watch the decision counters:

kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 8080 &
curl -s localhost:8080/metrics | grep authz_decisions_total
# alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="verified"} 1042
# alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="anonymous"} 3
# alphaswarm_tenant_router_authz_decisions_total{decision="deny",mode="permissive",reason="expired_token"} 7

When reason="anonymous" is ~zero for a representative window (only unauthenticated probes remain), flip to enforcement:

  ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "required"

re-apply, restart, and confirm readyz reports the posture:

curl -s localhost:8080/readyz | jq
# {"status":"ok","cells":3,"auth_mode":"required","cba_mode":"enforce",...}

4. Verification checks

# Anonymous is denied (required mode):
curl -s -o /dev/null -w '%{http_code}\n' -XPOST localhost:8080/ext_authz/v3/check \
-H 'content-type: application/json' \
-d '{"attributes":{"request":{"http":{"headers":{}}}}}'
# 401

# A live SPA token is verified and routed:
TOKEN=$(...fetch from the SPA / device flow...)
curl -s -XPOST localhost:8080/ext_authz/v3/check \
-H 'content-type: application/json' \
-d "{\"attributes\":{\"request\":{\"http\":{\"headers\":{\"authorization\":\"Bearer ${TOKEN}\"}}}}}" \
-D - -o /dev/null | grep -i x-alphaswarm
# x-alphaswarm-cell: cell-shared-std-us-east-1a
# x-alphaswarm-auth: verified
# x-alphaswarm-sub: auth0|...

End-to-end through the edge, a tampered or expired token must produce 401 from Envoy, and x-alphaswarm-* request headers sent by the client must arrive at the cell overwritten with verified values.

5. Cross-cell CBA keys (Phase 5 §8.5)

Cross-cell calls present a Cell-Bound-Authorization JWT. The validator (co-located in the router) reads each source cell's verification keys from the cells-registry annotation — publish them when you enable cross-cell MCP:

curl -sS -XPATCH "$CP/manage/cells/cell-shared-std-us-east-1a" \
-H "authorization: Bearer $MGMT_TOKEN" -H 'content-type: application/json' \
-d '{"annotations":{"alphaswarm.internal/cba-jwks":"{\"keys\":[...]}"}}'

CBA_MODE=enforce (default) is safe before any workload mints CBAs — requests without the header pass through. Use monitor to log would-be denials during key rollout; check cba_decisions_total{decision="deny"} before returning to enforce. Single-cell edges should additionally pin ALPHASWARM_TENANT_ROUTER_CBA_DESTINATION_CELL_ID to their own cell id.

6. Rollback

Auth enforcement is config-only — no image rollback needed:

  1. Flip AUTH_MODE back to permissive (NOT disabled; the insecure mode also demands ALLOW_INSECURE=true and is for local dev only).
  2. kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router.
  3. The decision counters (/metrics) and structured authz_deny logs (reason codes: missing_token, expired_token, wrong_audience, wrong_issuer, no_matching_key, forbidden_algorithm, jwks_unreachable) identify what was being denied before you re-enforce.

Failure modes worth knowing

SymptomCauseResponse
Pod crash-loops with SettingsErrorMissing issuer/audience in required/permissiveStamp the ConfigMap (step 2).
All requests 401 jwks_unreachableRouter cannot reach the IdP JWKS (egress 443 blocked, wrong issuer)Check the NetworkPolicy + issuer URL; the JWKS cache serves stale once warmed, so this bites hardest on cold boots.
401 no_matching_key after IdP key rotationkid not in cached JWKSThe router force-refreshes once per unknown kid automatically; persistent failures mean the issuer/JWKS URI points at the wrong tenant.
503 no_cell_available for premium usersNo active shared-prem cellExplicit tiers are never downgraded — activate a cell for the tier or fix the claim pipeline.
readyz shows registry_stale: trueControl plane unreachable > REGISTRY_STALENESS_WARN_SECONDSRouting continues on last-known-good cells; restore alphaswarm-cp before making placement changes.