ADR 004 — Abstract InfrastructureProvider ABC for workload runtime ops
- Status: Accepted (2026-05-18)
- Authors: Platform team
- Supersedes: Tightens AGENTS hard rule 42
- Related: ADR 005 — separated control plane,
AGENTS.md
Context
AlphaSwarm's existing IaC story is Terraform-first (AGENTS hard rule 42): every state-mutating cluster operation goes through alphaswarm/terraform/runtime.py::TerraformRuntime. That guarantee is great for provisioning (create cluster, create namespace, apply RBAC, register Auth0 tenant) but it's an awkward fit for live workload operations — restarting a pod, scaling a Deployment, exec-ing a shell, tailing logs — which today incur a full terraform plan + apply round trip and write to terraform_runs even though no IaC actually changed.
The refactor introduces the alphaswarm_controller micro-project that needs to support five backends (docker_compose, kubernetes, AWS, Azure, GCP). Two paths were considered:
- Translate every workload op into Terraform — every restart becomes a Terraform
null_resource+ provisioner. Preserves the rule 42 ledger as a single source of truth, but turns Terraform into a glorifiedkubectlwrapper. - Introduce a sibling abstraction —
InfrastructureProviderABC with five implementations, each calling its backend's native SDK (kubernetes-client, docker SDK, boto3, azure-mgmt, google-cloud-run). Terraform stays for provisioning only.
Decision
Adopt path 2: an abstract InfrastructureProvider ABC for runtime workload operations. Specifically:
class InfrastructureProvider(ABC):
@abstractmethod
async def start(self, spec: DeploymentSpec) -> DeploymentStatus: ...
@abstractmethod
async def stop(self, service_id: str) -> DeploymentStatus: ...
@abstractmethod
async def scale(self, service_id: str, replicas: int) -> DeploymentStatus: ...
@abstractmethod
async def status(self, service_id: str) -> DeploymentStatus: ...
@abstractmethod
async def apply_config(self, service_id: str, config: dict) -> bool: ...
@abstractmethod
async def stream_metrics(self, service_id: str): ... # async generator
Five concrete providers live under alphaswarm_controller/src/alphaswarm_controller/providers/:
docker_compose.py— docker Python SDK +docker composesubprocess for multi-container profileskubernetes.py— kubernetes-client/python (in-cluster + kubeconfig); Deployment apply, scale-to-0, ConfigMap patch, Metrics Server queryaws.py— boto3; EKS delegates tokubernetes.py; ECS/Fargate viaupdate_service; config sync via SSM Parameter Storeazure.py— azure-mgmt; AKS delegates tokubernetes.py; ACI via container groups; config sync via App Configuration / Key Vaultgcp.py— google-cloud SDKs; GKE delegates tokubernetes.py; Cloud Run via revision updates; config sync via Secret Manager
Each provider:
- Reads credentials from env vars only (
alphaswarm_core.credentials.CredentialResolver). - Translates
DeploymentSpecto its backend's native API. - Returns a normalised
DeploymentStatus. - Maps backend-specific exceptions to structured
{status, data, error}envelopes.
Amendment to AGENTS hard rule 42 (this PR)
Rule 42 changes from "all Terraform IaC lifecycle actions go through TerraformRuntime" to:
- All Terraform IaC PROVISIONING actions go through
alphaswarm/terraform/runtime.py::TerraformRuntime. Cluster bootstrap, IAM, Auth0 tenant, namespaces, secrets, network policies, and Ingress class registration are all "provisioning". Theterraform_runsledger, theterraform_stack_spec_versionshash-lock, the kill-switch hook (/terraform/halt), and OPA policy enforcement all depend on it.
A new rule 45 covers the workload ops side:
- All runtime workload operations go through
alphaswarm_controller.InfrastructureProvider(viaWorkloadRuntime). Start, stop, scale, restart, exec, log-tail, andapply_configare workload ops. They never reach for Terraform. A newworkload_runsledger row is created per mutating action with full audit context (user_id, action, target, provider, timestamp) BEFORE the provider call executes.
Consequences
Positive
- Restart latency drops from ~30 s (Terraform plan + apply) to ~200 ms (kubectl scale).
- The five providers are fully independent — each can be implemented + tested in parallel by an
orchestratefan-out (see plan §8.2). - Terraform stays clean for IaC provisioning and immutable audit trails. The
terraform_runsledger remains the source of truth for "what infrastructure exists". - The
alphaswarm_controllermicro-project becomes a thin, testable layer with mocked SDKs in CI. - Hard rule 27 (IdentityProvider), 28 (KubernetesAdapter), and the new ABC all follow the same self-registering metaclass pattern — consistent across the codebase.
Negative
- Two separate audit ledgers (
terraform_runs+workload_runs) instead of one. Documented inalphaswarm_docs/docs/how-to/operations/incident-response.md. - The five providers each take their own credential chain. Mitigated by
CredentialResolverso service code never sees raw env vars. - Provisioning vs runtime boundary is a soft line — adding a new namespace is provisioning, but auto-creating a per-tenant namespace at user signup is workload-ish. Each new operation requires an explicit choice; ADR 005 includes a decision tree.
Alternatives considered
- Translate every op into Terraform — rejected. Operational cost of running
terraform applyon every pod restart is prohibitive (~30 s p99), and Terraform's lock semantics serialise unrelated ops on the same workspace. - Use Crossplane — investigated; rejected for now. Crossplane is excellent for declarative cloud APIs but adds a CRD layer and operator dependency for marginal value over the five-provider Python ABC. Revisit when AlphaSwarm exceeds five backends.
- Use Pulumi instead of Terraform — out of scope. The existing
TerraformRuntimeworks and is hash-locked; replacing it is a separate ADR.
Implementation references
- ABC:
alphaswarm_controller/src/alphaswarm_controller/providers/base.py - Five providers:
alphaswarm_controller/src/alphaswarm_controller/providers/{docker_compose,kubernetes,aws,azure,gcp}.py - Workload ledger model:
alphaswarm/persistence/models_workload.py(new in this PR) - Telemetry streaming:
alphaswarm_controller/src/alphaswarm_controller/services/telemetry.py - AGENTS rule 45:
AGENTS.md(this PR)