Saltar al contenido principal

AlphaSwarm Management Engine

Canonical narrative for the unified management/control surface shipped by the alphaswarm_management_engine plan (.cursor/plans/alphaswarm_management_engine_fd9f1de7.plan.md).

What it owns

The Management Engine is the single direct-control surface for:

  • Workload lifecycle — start / stop / scale / restart / exec / tail logs / apply config / rotate secret. One Python ABC (alphaswarm_core.providers.InfrastructureProvider), one runtime (alphaswarm_core.runtime.WorkloadRuntime), one audit ledger row per action (workload_runs).
  • Identity provider configuration — Auth0 + Microsoft Entra ID (MSAL) + Cloudflare Access, all registered through IdentityProviderMeta. The BFF (/auth/{providers,exchange,refresh,logout}) is the canonical surface for SPA + Theia clients.
  • Cloudflare edge — tunnels, DNS records, Access apps. Runtime CRUD via alphaswarm.cloudflare.CloudflareEdgeAdapter; IaC via the alphaswarm_platform/terraform/modules/cloudflare_edge module (provider cloudflare/cloudflare ~> 5.6).
  • Entra tenant onboardingpending -> active via POST /tenancy/entra-links/{id}/promote (Phase E of the plan).
  • alphaswarm_admin service identity — per-deployment Microsoft Entra Agent Identities (alphaswarm_admin_agent_identity Terraform module). Replaces the legacy shared-client_credentials path for outbound admin-to-CP + admin-to-monolith calls. See admin-agent-identity.md.

Architecture

Deployment modes

ALPHASWARM_MANAGEMENT_MODE controls how the engine runs:

ModeWorkload calls go toAudit sinkUse case
embedded (default)In-process WorkloadRuntimePostgresWorkloadAuditSinkSingle-image deployment
sidecarHTTP /manage/* proxy -> alphaswarm_controllerJsonlAuditSinkAir-gapped or multi-tenant deployments

Both modes import the SAME WorkloadRuntime class — operators choose by setting the env var; no code branches.

Provider matrix

Providerstart / stop / scalerestartexectail_logsrotate_secretNotes
docker_composeyesyesyes (Docker SDK)yesnoLocal dev + admin overlays
kubernetesyesyes (annotation bump)yes (stream + _preload_content=False)yes (watch.Watch().stream)yes (rolling restart)Production target
awsstubstubstubstubstubReal health + delegated list_deployments when EKS attached
azurestubstubstubstubstubReal health + delegated list_deployments when AKS attached
gcpstubstubstubstubstubReal health + delegated list_deployments when GKE attached
cloudflareyesyes (config reload)n/an/adestructive (opt-in)Tunnel + Access app + DNS lifecycle

Cloud providers gate K8s delegation on ALPHASWARM_CP_{AWS,AZURE,GCP}_DELEGATE_K8S=true.

Halt + audit

  • POST /workloads/halt fires the WorkloadRuntime.halt_all helper (per-process registry) and writes a HALTED finish row for every in-flight workload_runs entry. Wired into the frontend KillSwitch alongside the existing halt endpoints (rule 45 + frontend rule 2).
  • Every audit row carries experiment_id + test_id per AGENTS rule 34. The Postgres mirror table (workload_runs, Alembic 0055) is indexed on status + started_at DESC, action + started_at DESC, and provider_alias + target.

Cloudflare end-to-end

Phase D of the plan ships:

  • alphaswarm/cloudflare/{client,adapter}.py — Python SDK wrapper + CloudflareEdgeAdapter (tunnels, DNS, Access apps).
  • alphaswarm/api/routes/cloudflare.py — REST surface under /cloudflare/* (cluster:admin for writes, cluster:read for reads).
  • alphaswarm/data/mcp/tools/cloudflare.py — DataMCP tools for agents (data.cloudflare.{health,list_tunnels,create_tunnel,put_tunnel_config,list_access_apps,put_access_app,put_dns_record}).
  • alphaswarm/auth/providers/cloudflare_access.py — new CloudflareAccessProvider that validates Cf-Access-Jwt-Assertion headers and merges claims into the active RequestContext.
  • alphaswarm_platform/terraform/modules/cloudflare_edge + Jinja codegen template (alphaswarm/terraform/codegen/templates/cloudflare_edge.tf.j2) + cloudflare = "~> 5.6" in alphaswarm_platform/terraform/versions.tf.
  • Optional cloudflare_enabled block in alphaswarm_platform/terraform/environments/rpi/main.tf — replaces the manual cloudflared deployment under rpi_kubernetes/kubernetes/base-services/cloudflared/.

Frontend

  • alphaswarm_client/src/lib/api/{workloads,cloudflare,clusterPods}.ts — typed clients matching the new REST surface.
  • alphaswarm_client/src/routes/manage/page.tsx — Workload Studio.
  • alphaswarm_client/src/routes/cluster-mgmt/page.tsx — Cluster pods browser (exec + log tail land in Phase F-2).
  • alphaswarm_client/src/routes/cloudflare/page.tsx — Cloudflare edge studio.
  • alphaswarm_client/src/lib/auth/MsalProvider.tsx — new MSAL branch of AuthProvider; selects between <MsalProvider> and <Auth0Provider> based on authConfig.provider.
  • alphaswarm_client/public/redirect.html — MSAL v5 redirect bridge.

Theia

  • theia-extensions/alphaswarm/src/browser/auth/alphaswarm-auth-service.ts — additive BFF auth service (calls /auth/providers + /auth/refresh). Auth0Service still owns the direct PKCE flow.
  • theia-extensions/alphaswarm/src/browser/widgets/management-widget.tsx — iframe embedding the Vite Workload Studio, cluster-mgmt, and cloudflare routes inside Theia. New env vars on browser.Dockerfile: ALPHASWARM_THEIA_FRONTEND_URL, ALPHASWARM_THEIA_PROVIDERS_URL.

Subagent + rule + skill

  • .cursor/agents/alphaswarm-management-engine.md — direct-control subagent that maps every control route to a data.* MCP tool and refuses raw HTTP shortcuts.
  • .cursor/rules/alphaswarm-management-engine.mdc — always-on rule that bans printing tokens, refresh tokens, M2M client_secrets, MFA secrets, Cf-Access-Jwt-Assertion values, kubeconfig contents, and full Authorization headers in any transcript.
  • .cursor/skills/alphaswarm-management-engine/SKILL.md — named workflows the subagent reaches for first (start, stop, restart, exec, tail-logs, provision-tunnel, rotate-secret, promote-entra-link, halt-all).