Saltar al contenido principal

Edge authentication & cell routing

Every request entering the hosted platform crosses one authentication decision point before it reaches a cell: alphaswarm-edge (Envoy) makes two ext_authz callouts to alphaswarm-tenant-router, which verifies identity and decides cell placement in one pass.

Fail-closed verification

The router's posture is an explicit setting (ALPHASWARM_TENANT_ROUTER_AUTH_MODE), and the default is the strict one — see the rollout runbook for the operational details:

ModeNo tokenInvalid tokenValid token
required (default, hosted cells)401401allow
permissive (canary/migration)allow, flagged401allow
disabled (local dev; needs ALLOW_INSECURE=true too)allowunsigned decodeunsigned decode

Three design rules keep the edge honest:

  1. Boot-time refusal. In required/permissive the pod exits at startup unless issuer + audience (and a derivable JWKS URI) are configured. A crash-looping edge is strictly better than one that silently routes unauthenticated traffic.
  2. Asymmetric algorithms only. RS*/PS*/ES*/EdDSA are the only acceptable JWT algorithms; HS* and none are rejected before any key material is consulted, closing the alg-confusion class of attacks. Verification semantics mirror alphaswarm_core.auth.jwt_validator.JwtValidator (kid selection, one forced JWKS refresh on unknown kid for key rotation, TTL cache that serves stale on IdP blips).
  3. Identity headers are always overwritten. On every ALLOW the router emits the full verified set — x-alphaswarm-sub, x-alphaswarm-tenant, x-alphaswarm-workspace, x-alphaswarm-org, x-alphaswarm-auth — empty when a claim is absent, so a client can never smuggle its own x-alphaswarm-* values past the edge. Per-cell FastAPI gates (alphaswarm.api.security) still re-validate the JWT; the edge is defense-in-depth, not the only boundary (AGENTS rule 11 applies at every layer).

B2C / B2B tier routing

Cell selection composes the multi-tenancy model with the deployment tiers from RESTRUCTURING_PLAN.md §6.1:

PlanJWT tier claimCell tierTenancy strategy
B2C consumer(none) or shared-stdshared-stdshared_schema_rls
B2B premiumshared-premshared-premschema_per_tenant
Regulated enterprise(registry pinning)silo-regdatabase_per_enterprise
Custom contract(registry pinning)silo-customhybrid

Resolution order, per request:

  1. Registry pinning is authoritative — a tenant listed in a cell's pinned_tenants always lands there (silo cells, controlled migrations), regardless of token claims.
  2. The verified tier claim (namespaced https://alphaswarm.internal/tier, stamped by the Auth0 Action / Entra claims pipeline) selects the tier. An explicit tier is honored or refused with 503 — never silently downgraded onto another tier's tenancy strategy.
  3. Default tier (shared-std) otherwise.

Within a tier, unpinned tenants spread across active cells by rendezvous (highest-random-weight) hashing keyed on tenant_id → organization_id → sub: every router replica picks the same cell with no shared state, a tenant is sticky to its cell, and adding or draining a cell only remaps the tenants that hashed onto it.

Registry staleness (the router caches the control plane's /manage/cells view) is reported, never failed closed — the data plane keeps routing on last-known-good cells through a control-plane outage, surfacing registry_stale in /readyz and a counter in /metrics.

Cell-Bound-Authorization (cross-cell calls)

Cross-cell calls are the highest-risk path (Phase 5 §8.5). The mint side lives in alphaswarm.auth.cell_bound; the router hosts the validator at POST /cell_bound/v1/check (the alphaswarm-cell-bound-validator Service selects the same pods):

  • No Cell-Bound-Authorization header → pass. External user traffic and same-cell calls never carry one; the response still emits empty x-alphaswarm-cell-source-* headers so smuggled values are stripped.
  • Header present → the token must verify against the source cell's published keys (cells-registry annotation alphaswarm.internal/cba-jwks, JWKS JSON or PEM), with iss = source cell, aud = destination cell, a ≤90 s lifetime (mint stamps 60 s), required jti, and per-replica replay rejection. Valid CBAs inject x-alphaswarm-cell-source + x-alphaswarm-cell-source-workload (SPIFFE id) so destination-cell services can authorize the calling workload.
  • CBA_MODE=monitor logs would-be denials without blocking (rollout aid); enforce is the default and is safe before any workload mints CBAs because headerless requests pass through.

Where things live

SurfacePath
Router service + testsalphaswarm_platform/tenant_router/
Edge Envoy config (canonical template)alphaswarm_platform/build/docker/alphaswarm-edge/envoy.template.yaml
Deployment (ConfigMap, NetworkPolicy, HPA, Services)alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/
Backend JWT validation (per-cell)alphaswarm/auth/oidc.py, alphaswarm_core/auth/jwt_validator.py
CBA mint/verify libraryalphaswarm/auth/cell_bound.py
Operator runbookTenant-router auth rollout
Cutover historyCell-router cutover