AlphaSwarm Management Engine
Canonical narrative for the unified management/control surface
shipped by the alphaswarm_management_engine plan
(.cursor/plans/alphaswarm_management_engine_fd9f1de7.plan.md).
What it owns
The Management Engine is the single direct-control surface for:
- Workload lifecycle — start / stop / scale / restart / exec /
tail logs / apply config / rotate secret. One Python ABC
(
alphaswarm_core.providers.InfrastructureProvider), one runtime (alphaswarm_core.runtime.WorkloadRuntime), one audit ledger row per action (workload_runs). - Identity provider configuration — Auth0 + Microsoft Entra ID
(MSAL) + Cloudflare Access, all registered through
IdentityProviderMeta. The BFF (/auth/{providers,exchange,refresh,logout}) is the canonical surface for SPA + Theia clients. - Cloudflare edge — tunnels, DNS records, Access apps. Runtime
CRUD via
alphaswarm.cloudflare.CloudflareEdgeAdapter; IaC via thealphaswarm_platform/terraform/modules/cloudflare_edgemodule (providercloudflare/cloudflare ~> 5.6). - Entra tenant onboarding —
pending->activeviaPOST /tenancy/entra-links/{id}/promote(Phase E of the plan). - alphaswarm_admin service identity — per-deployment Microsoft Entra
Agent Identities (
alphaswarm_admin_agent_identityTerraform module). Replaces the legacy shared-client_credentials path for outbound admin-to-CP + admin-to-monolith calls. See admin-agent-identity.md.
Architecture
Deployment modes
ALPHASWARM_MANAGEMENT_MODE controls how the engine runs:
| Mode | Workload calls go to | Audit sink | Use case |
|---|---|---|---|
embedded (default) | In-process WorkloadRuntime | PostgresWorkloadAuditSink | Single-image deployment |
sidecar | HTTP /manage/* proxy -> alphaswarm_controller | JsonlAuditSink | Air-gapped or multi-tenant deployments |
Both modes import the SAME WorkloadRuntime class — operators
choose by setting the env var; no code branches.
Provider matrix
| Provider | start / stop / scale | restart | exec | tail_logs | rotate_secret | Notes |
|---|---|---|---|---|---|---|
docker_compose | yes | yes | yes (Docker SDK) | yes | no | Local dev + admin overlays |
kubernetes | yes | yes (annotation bump) | yes (stream + _preload_content=False) | yes (watch.Watch().stream) | yes (rolling restart) | Production target |
aws | stub | stub | stub | stub | stub | Real health + delegated list_deployments when EKS attached |
azure | stub | stub | stub | stub | stub | Real health + delegated list_deployments when AKS attached |
gcp | stub | stub | stub | stub | stub | Real health + delegated list_deployments when GKE attached |
cloudflare | yes | yes (config reload) | n/a | n/a | destructive (opt-in) | Tunnel + Access app + DNS lifecycle |
Cloud providers gate K8s delegation on
ALPHASWARM_CP_{AWS,AZURE,GCP}_DELEGATE_K8S=true.
Halt + audit
POST /workloads/haltfires theWorkloadRuntime.halt_allhelper (per-process registry) and writes aHALTEDfinish row for every in-flightworkload_runsentry. Wired into the frontendKillSwitchalongside the existing halt endpoints (rule 45 + frontend rule 2).- Every audit row carries
experiment_id+test_idper AGENTS rule 34. The Postgres mirror table (workload_runs, Alembic 0055) is indexed onstatus + started_at DESC,action + started_at DESC, andprovider_alias + target.
Cloudflare end-to-end
Phase D of the plan ships:
alphaswarm/cloudflare/{client,adapter}.py— Python SDK wrapper +CloudflareEdgeAdapter(tunnels, DNS, Access apps).alphaswarm/api/routes/cloudflare.py— REST surface under/cloudflare/*(cluster:adminfor writes,cluster:readfor reads).alphaswarm/data/mcp/tools/cloudflare.py— DataMCP tools for agents (data.cloudflare.{health,list_tunnels,create_tunnel,put_tunnel_config,list_access_apps,put_access_app,put_dns_record}).alphaswarm/auth/providers/cloudflare_access.py— newCloudflareAccessProviderthat validatesCf-Access-Jwt-Assertionheaders and merges claims into the activeRequestContext.alphaswarm_platform/terraform/modules/cloudflare_edge+ Jinja codegen template (alphaswarm/terraform/codegen/templates/cloudflare_edge.tf.j2) +cloudflare = "~> 5.6"inalphaswarm_platform/terraform/versions.tf.- Optional
cloudflare_enabledblock inalphaswarm_platform/terraform/environments/rpi/main.tf— replaces the manual cloudflared deployment underrpi_kubernetes/kubernetes/base-services/cloudflared/.
Frontend
alphaswarm_client/src/lib/api/{workloads,cloudflare,clusterPods}.ts— typed clients matching the new REST surface.alphaswarm_client/src/routes/manage/page.tsx— Workload Studio.alphaswarm_client/src/routes/cluster-mgmt/page.tsx— Cluster pods browser (exec + log tail land in Phase F-2).alphaswarm_client/src/routes/cloudflare/page.tsx— Cloudflare edge studio.alphaswarm_client/src/lib/auth/MsalProvider.tsx— new MSAL branch ofAuthProvider; selects between<MsalProvider>and<Auth0Provider>based onauthConfig.provider.alphaswarm_client/public/redirect.html— MSAL v5 redirect bridge.
Theia
theia-extensions/alphaswarm/src/browser/auth/alphaswarm-auth-service.ts— additive BFF auth service (calls/auth/providers+/auth/refresh). Auth0Service still owns the direct PKCE flow.theia-extensions/alphaswarm/src/browser/widgets/management-widget.tsx— iframe embedding the Vite Workload Studio, cluster-mgmt, and cloudflare routes inside Theia. New env vars onbrowser.Dockerfile:ALPHASWARM_THEIA_FRONTEND_URL,ALPHASWARM_THEIA_PROVIDERS_URL.
Subagent + rule + skill
.cursor/agents/alphaswarm-management-engine.md— direct-control subagent that maps every control route to adata.*MCP tool and refuses raw HTTP shortcuts..cursor/rules/alphaswarm-management-engine.mdc— always-on rule that bans printing tokens, refresh tokens, M2M client_secrets, MFA secrets,Cf-Access-Jwt-Assertionvalues, kubeconfig contents, and fullAuthorizationheaders in any transcript..cursor/skills/alphaswarm-management-engine/SKILL.md— named workflows the subagent reaches for first (start, stop, restart, exec, tail-logs, provision-tunnel, rotate-secret, promote-entra-link, halt-all).