# AlphaSwarm — full corpus > Concatenated, MDX-stripped markdown for one-shot LLM ingestion. > See /llms.txt for the curated index. # ADR 001 — Static export (Vite) over SSR for the AlphaSwarm client surface > The AlphaSwarm frontend rewrite (`alphaswarm_client/`, Vite 7 + React 19 + Tailwind 4 + shadcn/ui) is the cutover-complete operator UI. The legacy `webui/` (Next.js 15 / antd) remains in tree only as a rollback pat... # ADR 001 — Static export (Vite) over SSR for the AlphaSwarm client surface - **Status**: Accepted (2026-05-18) - **Authors**: Platform team - **Supersedes**: None - **Related**: [ADR 002 — single container client](002-single-container-client.md), [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md) ## Context The AlphaSwarm frontend rewrite (`alphaswarm_client/`, Vite 7 + React 19 + Tailwind 4 + shadcn/ui) is the cutover-complete operator UI. The legacy `webui/` (Next.js 15 / antd) remains in tree only as a rollback path. The new `alphaswarm_client` container needs to bundle a UI build, the legacy fallback, and the FastAPI gateway into a single deployable image. Three rendering options were considered for the canonical UI: 1. **Server-side rendering (Next.js)** — server-rendered React with client-side hydration, mounted under uvicorn via `WSGIMiddleware`. 2. **Static export (Next.js)** — `next build` with `output: 'export'`, identical to the prompt's original §2.1 wording. 3. **Static export (Vite)** — `pnpm --dir alphaswarm_client build` emitting a single `dist/` static SPA bundle. ## Decision The canonical UI shipped in `alphaswarm_client` is the **Vite static export** under `alphaswarm_client/`. The Next.js legacy `webui/` is mounted as a rollback surface at `/webui` and Solara at `/legacy`, but neither is the default landing page. Concretely: - Stage 1 of `/build/docker/alphaswarm_client/Dockerfile` runs `pnpm --dir alphaswarm_client build` and copies `alphaswarm_client/dist/` to `/app/static/`. - The FastAPI app in `alphaswarm/api/main.py` mounts `/static` to the Vite asset directory and falls back to `index.html` for client-side routes (SPA fallback). - The Vite app calls API endpoints through a relative `/api` prefix; the FastAPI gateway proxies those to whatever the `ConnectivityConfig` env vars point at. ## Consequences **Positive** - Single-process Python runtime — no Node.js in the production image, smaller attack surface, no `npm` supply-chain risk in production. - No SSR cold-start cost. The whole UI is ~3 MB of static assets served with `Cache-Control: immutable`. - Identical container in dev, k3d, and Kubernetes — only env vars change. - Vite is already canonical per [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md). Picking the in-flight stack avoids reopening the cutover debate. **Negative** - No streaming-SSR for the operator UI. WebSocket and SSE streams (chat, live, telemetry) carry the live data instead, which matches the existing throttled `useChatStream` / `useLiveStream` hooks. - SEO and first-paint metrics are weaker than SSR, but the AlphaSwarm UI is an authenticated operator console, not a public site — neither matters. - Pre-rendered routes per user/tenant are not possible. All personalisation happens client-side using Auth0 claims from `useUser()`. ## Alternatives considered - **SSR** — rejected because it forces Node.js into the runtime image and adds a separate process to supervise. - **Static export (Next.js)** — rejected to avoid maintaining two frontend toolchains. The Next.js webui stays as rollback only. ## Implementation references - Frontend build target: `alphaswarm_client/package.json` `"build": "vite build"` - Production Dockerfile: `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile` - SPA fallback handler: `alphaswarm/api/main.py::serve_spa` - Cutover history: [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md) # ADR 002 — Single multi-stage container for the AlphaSwarm client surface > Today AlphaSwarm runs the Vite frontend on `:3001` (compose `:3002`), the legacy Next.js webui on `:3000` (now stopped), the legacy Solara UI on `:8765`, and the FastAPI API on `:8000`. Operators have to jug... # ADR 002 — Single multi-stage container for the AlphaSwarm client surface - **Status**: Accepted (2026-05-18) — **Superseded for `alphaswarm_ui` by [ADR 011](011-cdn-fronted-standalone-for-alphaswarm-ui.md) on 2026-05-25.** Still in force for the local-operator `alphaswarm_client/` packaging path. - **Authors**: Platform team - **Supersedes**: None - **Superseded by**: [ADR 011 — CDN-fronted standalone for `alphaswarm_ui`](011-cdn-fronted-standalone-for-alphaswarm-ui.md) (cloud surface only) - **Related**: [ADR 001 — Vite static export](001-static-export-over-ssr.md), [ADR 005 — separated control plane](005-separated-control-plane.md), [ADR 011 — CDN-fronted standalone for `alphaswarm_ui`](011-cdn-fronted-standalone-for-alphaswarm-ui.md), [ADR 012 — Solara deprecation](012-solara-deprecation.md) > **Scope narrowing (2026-05-25):** This ADR's decisions apply ONLY > to the local-operator `alphaswarm_client/` packaging. The cloud-hosted > `alphaswarm_ui/` surface (at `alpha-swarm.ai` / `app.alpha-swarm.ai`) is governed by > ADR 011 and uses a clean Next.js standalone container with no > ASGI proxy stage. See ADR 011 for the cloud rationale. ## Context Today AlphaSwarm runs the Vite frontend on `:3001` (compose `:3002`), the legacy Next.js webui on `:3000` (now stopped), the legacy Solara UI on `:8765`, and the FastAPI API on `:8000`. Operators have to juggle four URLs and four health probes. The `alphaswarm_client` Docker image is a chance to collapse these into one. Three packaging options were considered: 1. **One container per surface** — separate `alphaswarm-frontend`, `alphaswarm-solara`, `alphaswarm-api` images; an external Ingress/NGINX layer fans traffic. 2. **Sidecar pattern** — one Pod per surface, sharing localhost via an `nginx` sidecar. 3. **Single multi-stage build** — Stage 1 builds Vite, Stage 2 prepares Solara, Stage 3 (production) is a `python:3.11-slim` runtime that serves both as static + ASGI mount and proxies API traffic. ## Decision `alphaswarm_client` is **one image built from a three-stage Dockerfile** that ships: - Stage 1 (`ui-builder`, `node:20-alpine`) — runs `pnpm --dir alphaswarm_client build`, output to `/app/out/`. Node is dropped after this stage. - Stage 2 (`solara-builder`, `python:3.11-slim`) — installs Solara + legacy UI deps, pre-warms component caches, verifies `legacy_ui.app` is importable. - Stage 3 (`production`, `python:3.11-slim`) — installs FastAPI + uvicorn + httpx + websockets + python-jose + `alphaswarm_core`. Copies Vite assets from Stage 1 and Solara from Stage 2. Exposes port `8080`. No Node, no npm. The Stage 3 runtime mounts: - `/static` → Vite assets from Stage 1 - `/legacy` → Solara ASGI app - `/webui` → legacy Next.js export (rollback only) - `/api/*` → reverse-proxied to `ALPHASWARM_CORE_API_URL` - `/ml/*` → reverse-proxied to `ALPHASWARM_ML_API_URL` - `/mcp/*` → reverse-proxied to `ALPHASWARM_MCP_URL` - `/manage/*` → reverse-proxied to `ALPHASWARM_CONTROL_PLANE_URL` - `/ws/*` → WebSocket proxy with reconnect-with-backoff ## Consequences **Positive** - One image, one health probe (`/health`), one set of `securityContext` rules. - Stable URL surface for operators — bookmarks, dashboards, and runbooks don't break when backends move. - All backend addresses live in `ConnectivityConfig` env vars. The same image runs in compose with `ALPHASWARM_*_URL=http://alphaswarm-core:8000` or in K8s with `http://alphaswarm-core.default.svc.cluster.local`. - Auth0 callback URLs stay constant. The Vite app sees one origin; the FastAPI proxy injects M2M `Authorization` headers for cross-service calls. - Smaller blast radius. The control plane is a separate container on a separate Docker network (`alphaswarm-admin` vs `alphaswarm-internal`) — they only talk over the proxy. **Negative** - Builds are larger and slower than per-surface images. Mitigated by Docker layer caching and buildx (~3 min cold, ~30s incremental). - Scaling assumes Vite + Solara + proxy throughput grow together. In practice Vite assets are CDN-fronted by NGINX Ingress and the proxy is the bottleneck — a single container HPA on CPU is fine. - Rolling back to webui-only or Solara-only means env-flag toggles (`ALPHASWARM_CLIENT_ENABLE_LEGACY_UI`, `ALPHASWARM_CLIENT_ENABLE_SOLARA`) rather than swapping deployments. ## Alternatives considered - **One container per surface** — rejected. Adds 3 health probes, 3 Ingress rules, 3 image tags to keep in lockstep on every release. The operator experience regresses. - **Sidecar pattern** — rejected. Mixing sidecars + multi-process supervision in one Pod adds significant Pod-startup ordering risk for marginal CPU savings. ## Implementation references - Multi-stage Dockerfile: `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile` - FastAPI proxy: `alphaswarm/api/proxy.py` - WebSocket proxy with reconnect: `alphaswarm/api/ws/proxy.py` - ConnectivityConfig: `alphaswarm_core/connectivity/config.py` # ADR 003 — Auth0 zero-trust two-layer security model > AlphaSwarm already uses Auth0 for the operator UI via the in-flight `alphaswarm/auth/providers/auth0.py` plugin (AGENTS hard rule 27). Whats missing for the refactor is the second layer: cryptographic JWT validati... # ADR 003 — Auth0 zero-trust two-layer security model - **Status**: Accepted (2026-05-18) - **Authors**: Platform team - **Supersedes**: None - **Related**: [ADR 005 — separated control plane](005-separated-control-plane.md), [alphaswarm_docs/identity.md](../../concepts/identity/identity.md), [alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md) ## Context AlphaSwarm already uses Auth0 for the operator UI via the in-flight `alphaswarm/auth/providers/auth0.py` plugin (AGENTS hard rule 27). What's missing for the refactor is the second layer: cryptographic JWT validation at every service boundary, resource-scoped claims so users only see their own resources, and a per-role scope matrix that the `alphaswarm_controller` micro-project can enforce without ever importing `alphaswarm.*`. Three identity strategies were considered: 1. **Self-hosted Keycloak** — full control, but operations burden and one more stateful service per cluster. 2. **Single-layer Auth0 (current state)** — Auth0 only for the SPA login. Backend services still trust user-injected headers via session cookies. 3. **Two-layer Auth0 (recommended in prompt)** — Auth0 OIDC for the SPA + JWT (`RS256`) bearer tokens validated independently by every service via JWKS. ## Decision Adopt the **two-layer Auth0 model** with the following invariants: 1. The Vite SPA in `alphaswarm_client` performs Authorization Code + PKCE against the Auth0 tenant. Access tokens are short-lived (1 h) JWTs with `aud` = `https://api.alphaswarm.internal/manage`. 2. Every backend service — `alphaswarm` (FastAPI API), `alphaswarm_controller` (micro-project), and the `rpi_kubernetes` `management/backend` shim — re-validates JWTs against the Auth0 JWKS independently using the shared validator in `alphaswarm_core/auth/`. **No service trusts a header set by another service.** 3. Auth0 Post-Login Action (template in `alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl`) calls `POST /_internal/auth0/sync` to fetch user-specific custom claims and injects them into the access token under the **`https://alphaswarm.internal/`** namespace: - `https://alphaswarm.internal/org_id` — tenancy boundary - `https://alphaswarm.internal/roles` — coarse role list (`alphaswarm-viewer`, `alphaswarm-admin`, `alphaswarm-operator`) - `https://alphaswarm.internal/resources` — explicit resource ID allowlist (org-scoped) - `https://alphaswarm.internal/workspace_id`, `https://alphaswarm.internal/team_ids` — existing tenancy hints 4. M2M tokens for service-to-service calls (e.g. `alphaswarm_client` → `alphaswarm_controller`) mint through Auth0 Client Credentials. The proxy in `alphaswarm/api/proxy.py` attaches a cached M2M token; `alphaswarm_controller` validates it like any other JWT. 5. The four-role RBAC matrix from the refactor prompt becomes the canonical scope grid: | Role | Scopes granted | | ---------------- | ---------------------------------------------------------------------------------------------------- | | `alphaswarm-viewer` | `read:infrastructure` | | `alphaswarm-operator` | `read:infrastructure` + `manage:agents` | | `alphaswarm-admin` | `read:infrastructure` + `manage:agents` + `manage:infrastructure` | | `alphaswarm-superadmin` | All of the above + `admin:cluster` (only role that bypasses `filter_resources`) | 6. Every list endpoint in both `alphaswarm` and `alphaswarm_controller` passes its result list through `alphaswarm_core.auth.resource_filter.filter_resources(items, jwt_payload)` before returning. The filter respects `admin:cluster` (returns everything) and otherwise intersects against the `resources` claim. ## Consequences **Positive** - Zero-trust between services. A compromised `alphaswarm_client` container can issue requests but cannot forge claims — the control plane re-validates. - Resource scoping moves from "frontend hides things" to "backend cannot return things". Defence in depth. - Auth0 is already in production for the SPA; the only delta is adding M2M tokens and the `resources` claim. - The `alphaswarm_controller` micro-project gets a clean security boundary without importing `alphaswarm.auth.*` — it depends on `alphaswarm_core/auth/` only. **Negative** - Every API request pays JWKS verification cost (~0.2 ms with `lru_cache`). Acceptable. - The `https://alphaswarm/` → `https://alphaswarm.internal/` namespace rename requires one release of dual-reading both namespaces (handled by `auth_claims_namespace_aliases` setting). - Operators need to be onboarded to one of the four roles before they can use the new control plane — solved by `/build/scripts/provision_auth0.py` running on bootstrap. ## Alternatives considered - **Self-hosted Keycloak** — rejected. Adds operational burden without business value. Auth0 plays well with Terraform (already in `alphaswarm_platform/terraform/modules/auth0_identity/`). - **Cookie-only sessions** — rejected. Backend services would have to trust whatever set the cookie; doesn't compose with the cross-service M2M case. - **Opaque tokens with introspection** — rejected. Adds a round trip per request against Auth0's `/oauth/token/introspect`, and Auth0's free tier rate-limits it. ## Implementation references - JWT validator: `alphaswarm_core/auth/validator.py` (extracted from `alphaswarm/auth/providers/auth0.py`) - Resource filter: `alphaswarm_core/auth/resource_filter.py` - Claims namespace setting: `alphaswarm/config/settings.py::auth_claims_namespace`, `auth_claims_namespace_aliases` - Auth0 Action template: `alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl` - Sync endpoint: `alphaswarm/api/routes/auth0_sync.py` - Terraform Auth0 module: `alphaswarm_platform/terraform/modules/auth0_identity/main.tf` - Provisioning script: `alphaswarm_platform/build/scripts/provision_auth0.py` # ADR 004 — Abstract InfrastructureProvider ABC for workload runtime ops > AQPs existing IaC story is Terraform-first (AGENTS hard rule 42): every state-mutating cluster operation goes through `alphaswarm/terraform/runtime.py::TerraformRuntime`. That guarantee is great for **provi... # ADR 004 — Abstract InfrastructureProvider ABC for workload runtime ops - **Status**: Accepted (2026-05-18) - **Authors**: Platform team - **Supersedes**: Tightens AGENTS hard rule 42 - **Related**: [ADR 005 — separated control plane](005-separated-control-plane.md), [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/AGENTS.md) ## Context AlphaSwarm's existing IaC story is Terraform-first (AGENTS hard rule 42): every state-mutating cluster operation goes through `alphaswarm/terraform/runtime.py::TerraformRuntime`. That guarantee is great for **provisioning** (create cluster, create namespace, apply RBAC, register Auth0 tenant) but it's an awkward fit for **live workload operations** — restarting a pod, scaling a Deployment, exec-ing a shell, tailing logs — which today incur a full `terraform plan` + `apply` round trip and write to `terraform_runs` even though no IaC actually changed. The refactor introduces the `alphaswarm_controller` micro-project that needs to support five backends (docker_compose, kubernetes, AWS, Azure, GCP). Two paths were considered: 1. **Translate every workload op into Terraform** — every restart becomes a Terraform `null_resource` + provisioner. Preserves the rule 42 ledger as a single source of truth, but turns Terraform into a glorified `kubectl` wrapper. 2. **Introduce a sibling abstraction** — `InfrastructureProvider` ABC with five implementations, each calling its backend's native SDK (kubernetes-client, docker SDK, boto3, azure-mgmt, google-cloud-run). Terraform stays for provisioning only. ## Decision Adopt **path 2: an abstract `InfrastructureProvider` ABC** for runtime workload operations. Specifically: ```python class InfrastructureProvider(ABC): @abstractmethod async def start(self, spec: DeploymentSpec) -> DeploymentStatus: ... @abstractmethod async def stop(self, service_id: str) -> DeploymentStatus: ... @abstractmethod async def scale(self, service_id: str, replicas: int) -> DeploymentStatus: ... @abstractmethod async def status(self, service_id: str) -> DeploymentStatus: ... @abstractmethod async def apply_config(self, service_id: str, config: dict) -> bool: ... @abstractmethod async def stream_metrics(self, service_id: str): ... # async generator ``` Five concrete providers live under `alphaswarm_controller/src/alphaswarm_controller/providers/`: - `docker_compose.py` — docker Python SDK + `docker compose` subprocess for multi-container profiles - `kubernetes.py` — kubernetes-client/python (in-cluster + kubeconfig); Deployment apply, scale-to-0, ConfigMap patch, Metrics Server query - `aws.py` — boto3; EKS delegates to `kubernetes.py`; ECS/Fargate via `update_service`; config sync via SSM Parameter Store - `azure.py` — azure-mgmt; AKS delegates to `kubernetes.py`; ACI via container groups; config sync via App Configuration / Key Vault - `gcp.py` — google-cloud SDKs; GKE delegates to `kubernetes.py`; Cloud Run via revision updates; config sync via Secret Manager Each provider: - Reads credentials from env vars only (`alphaswarm_core.credentials.CredentialResolver`). - Translates `DeploymentSpec` to its backend's native API. - Returns a normalised `DeploymentStatus`. - Maps backend-specific exceptions to structured `{status, data, error}` envelopes. ## Amendment to AGENTS hard rule 42 (this PR) Rule 42 changes from "all Terraform IaC lifecycle actions go through TerraformRuntime" to: > 42. **All Terraform IaC PROVISIONING actions go through `alphaswarm/terraform/runtime.py::TerraformRuntime`.** Cluster bootstrap, IAM, Auth0 tenant, namespaces, secrets, network policies, and Ingress class registration are all "provisioning". The `terraform_runs` ledger, the `terraform_stack_spec_versions` hash-lock, the kill-switch hook (`/terraform/halt`), and OPA policy enforcement all depend on it. A new rule 45 covers the workload ops side: > 45. **All runtime workload operations go through `alphaswarm_controller.InfrastructureProvider` (via `WorkloadRuntime`).** Start, stop, scale, restart, exec, log-tail, and `apply_config` are workload ops. They never reach for Terraform. A new `workload_runs` ledger row is created per mutating action with full audit context (user_id, action, target, provider, timestamp) BEFORE the provider call executes. ## Consequences **Positive** - Restart latency drops from ~30 s (Terraform plan + apply) to ~200 ms (kubectl scale). - The five providers are fully independent — each can be implemented + tested in parallel by an `orchestrate` fan-out (see plan §8.2). - Terraform stays clean for IaC provisioning and immutable audit trails. The `terraform_runs` ledger remains the source of truth for "what infrastructure exists". - The `alphaswarm_controller` micro-project becomes a thin, testable layer with mocked SDKs in CI. - Hard rule 27 (IdentityProvider), 28 (KubernetesAdapter), and the new ABC all follow the same self-registering metaclass pattern — consistent across the codebase. **Negative** - Two separate audit ledgers (`terraform_runs` + `workload_runs`) instead of one. Documented in `alphaswarm_docs/docs/how-to/operations/incident-response.md`. - The five providers each take their own credential chain. Mitigated by `CredentialResolver` so service code never sees raw env vars. - Provisioning vs runtime boundary is a soft line — adding a new namespace is provisioning, but auto-creating a per-tenant namespace at user signup is workload-ish. Each new operation requires an explicit choice; ADR 005 includes a decision tree. ## Alternatives considered - **Translate every op into Terraform** — rejected. Operational cost of running `terraform apply` on every pod restart is prohibitive (~30 s p99), and Terraform's lock semantics serialise unrelated ops on the same workspace. - **Use Crossplane** — investigated; rejected for now. Crossplane is excellent for declarative cloud APIs but adds a CRD layer and operator dependency for marginal value over the five-provider Python ABC. Revisit when AlphaSwarm exceeds five backends. - **Use Pulumi instead of Terraform** — out of scope. The existing `TerraformRuntime` works and is hash-locked; replacing it is a separate ADR. ## Implementation references - ABC: `alphaswarm_controller/src/alphaswarm_controller/providers/base.py` - Five providers: `alphaswarm_controller/src/alphaswarm_controller/providers/{docker_compose,kubernetes,aws,azure,gcp}.py` - Workload ledger model: `alphaswarm/persistence/models_workload.py` (new in this PR) - Telemetry streaming: `alphaswarm_controller/src/alphaswarm_controller/services/telemetry.py` - AGENTS rule 45: [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/AGENTS.md) (this PR) # ADR 005 — Separated `alphaswarm_controller/` micro-project > The in-flight `alphaswarm/api/routes/control_plane.py` exposes deploy / destroy / restart / logs endpoints to the Vite Control Plane UI. It already covers the "local k3d" and "rpi_kubernetes" targets and del... # ADR 005 — Separated `alphaswarm_controller/` micro-project - **Status**: Accepted (2026-05-18) - **Authors**: Platform team - **Supersedes**: Embeds in `alphaswarm/api/routes/control_plane.py` - **Related**: [ADR 002](002-single-container-client.md), [ADR 003](003-auth0-zero-trust.md), [ADR 004](004-provider-abstraction.md) ## Context The in-flight `alphaswarm/api/routes/control_plane.py` exposes deploy / destroy / restart / logs endpoints to the Vite Control Plane UI. It already covers the "local k3d" and "rpi_kubernetes" targets and delegates mutating ops to `TerraformRuntime` via Celery tasks (see [`alphaswarm/api/routes/control_plane.py`](../../../alphaswarm/api/routes/control_plane.py)). The refactor wants the control plane to: 1. Speak five backends (docker_compose, kubernetes, AWS, Azure, GCP) — not just two Terraform stacks. 2. Be deployable on its own (`/deployments/compose/docker-compose.admin.yml`, isolated `alphaswarm-admin` Docker network) so an operator can run "just the control plane" against a remote cluster. 3. Be releasable independently from the AlphaSwarm monolith (different cadence, different SLOs). 4. Have a security boundary that doesn't bleed in if `alphaswarm` itself is compromised — and vice versa. The strict-isolation reading of the prompt's hard constraint ("Never import `alphaswarm.*` modules inside `alphaswarm_controller/`") plus the existing `alphaswarm/` codebase yields three integration patterns: 1. **Strict separation** — duplicate every model, validator, and adapter into `alphaswarm_controller/`. 2x code, fully independent release. 2. **Shared lower-level library** — extract reusable bits (Pydantic topology models, JWT validator, K8s adapter ABCs, credential protocol) into a NEW `alphaswarm_core/` package both `alphaswarm/` and `alphaswarm_controller/` depend on. No `alphaswarm.*` imports in CP, but shared lower-level code stays DRY. 3. **Evolve in place** — keep control plane in `alphaswarm/`; just add the `alphaswarm_client` container + Auth0 RBAC. ## Decision Adopt **pattern 2** — the **shared-library** approach. 1. New top-level package `alphaswarm_core/` is created with its own `pyproject.toml` (installable as `alphaswarm-core`). 2. Move (with back-compat re-exports from `alphaswarm/`) the following into `alphaswarm_core/`: - `topology/` — Pydantic models from `alphaswarm/deployment/topology.py` (data classes only; loaders stay in `alphaswarm/`). - `auth/` — Auth0 JWT validator from `alphaswarm/auth/providers/auth0.py` + `alphaswarm/api/security.py`'s claim validation + new `resource_filter.py` (ADR 003). - `kubernetes/` — `KubernetesAdapter` ABC from `alphaswarm/kubernetes/protocol.py`. Concrete adapters (`InClusterAdapter`, `LocalComposeAdapter`, `RpiClusterAdapter`) stay in `alphaswarm/`. - `credentials/` — `SecretStore` protocol + `CredentialResolver` interface. Concrete stores stay in `alphaswarm/`. - `connectivity/` — NEW `ConnectivityConfig` Pydantic settings model with `ALPHASWARM_*_URL` matrix. - `models/` — `DeploymentSpec`, `DeploymentStatus`, `MetricPoint`, `NodeHealth` (referenced by both `alphaswarm.api.routes.control_plane` and the new `alphaswarm_controller.api.routers`). 3. The `alphaswarm_controller/` micro-project (own `pyproject.toml`) depends ONLY on `alphaswarm-core`. It never imports `alphaswarm.*`. 4. `alphaswarm/` keeps the runtimes, ledger writers, registry implementations, and concrete adapters. It also depends on `alphaswarm-core` (just like `alphaswarm_controller/`). 5. Back-compat shims in `alphaswarm/deployment/`, `alphaswarm/auth/`, `alphaswarm/kubernetes/`, `alphaswarm/credentials/` re-export from `alphaswarm_core` so no existing import paths break and no other AlphaSwarm module needs to change in this PR. The strict-isolation enforcement is a CI lint: ```bash # .github/workflows/ci.yml step rg --type python "^from alphaswarm(\.|$)|^import alphaswarm(\.|$)" alphaswarm_controller/ \ && echo "FAIL: alphaswarm_controller imports forbidden alphaswarm.* module" && exit 1 ``` ## Consequences **Positive** - `alphaswarm_controller` ships as a standalone OCI image with no AlphaSwarm runtime dependency. Operators running multiple AlphaSwarm tenants share one control plane. - The shared lib is small (~2 kloc) and changes infrequently. When it does change, both `alphaswarm/` and `alphaswarm_controller/` re-pin and re-test — explicit coupling. - The existing `alphaswarm/api/routes/control_plane.py` becomes a thin proxy that calls the external `alphaswarm_controller` when the env var `ALPHASWARM_CP_REMOTE=1` is set, or talks in-process to the same modules when disabled. Backward compat for local dev. - AGENTS hard rules 27 (IdentityProvider), 28 (KubernetesAdapter) still apply — the metaclass registries live in `alphaswarm_core/auth/` and `alphaswarm_core/kubernetes/`, with concrete impls registered from `alphaswarm/` and `alphaswarm_controller/` alike. **Negative** - Adds one more package to publish and version. Mitigated by treating `alphaswarm-core` as an internal dependency pinned to a git SHA from a monorepo — no PyPI release needed. - Cross-package refactors now need to touch two `pyproject.toml` files. Acceptable cost; the boundary is intentional. - The "embed vs separate" decision is now load-bearing for security — a vulnerability in `alphaswarm_core/auth/` lands in both planes. Reviewed in `ce-security-sentinel` agent runs (see `.cursor/agents/`). ## Alternatives considered - **Strict separation (pattern 1)** — rejected. Duplicate code rots out of sync; security fixes have to land twice; impossible to keep JWT validator semantics identical between the two planes. - **Evolve in place (pattern 3)** — rejected. The biggest gap the prompt closes is *deployment independence* and the *5-backend abstraction*. Both demand a separate process; in-place is just a renamed router. - **gRPC contract between the two** — rejected for now. The two planes share Pydantic models and HTTP/JSON is already understood. gRPC adds proto-gen tooling burden without buying anything until we hit hundreds of req/s of internal calls. ## Decision tree: which side does new code go on? When adding a new feature, ask: 1. Is this a workload runtime operation (start, stop, scale, exec, logs, telemetry)? → **`alphaswarm_controller/`** 2. Is this an IaC provisioning operation (create cluster, register Auth0 tenant, apply RBAC)? → **`alphaswarm/terraform/`** 3. Is this AlphaSwarm business logic (agents, RL, bots, analysis, backtests)? → **`alphaswarm/`** 4. Is this a shared model, validator, or ABC that BOTH need? → **`alphaswarm_core/`** If unsure, prefer **`alphaswarm/`** and revisit the boundary once the requirement is clearer. ## Implementation references - Shared lib: `alphaswarm_core/` (this PR) - Micro-project: `alphaswarm_controller/` (this PR) - Strict-isolation lint: `.github/workflows/ci.yml` (Phase 8) - Existing in-AlphaSwarm control plane: `alphaswarm/api/routes/control_plane.py` - Existing topology: `alphaswarm/deployment/topology.py` - AGENTS rules 27, 28, 42, 45 — boundary owners # architecture/decisions/006-aqp-admin-overhaul # ADR 006: alphaswarm_admin overhaul (multi-cloud control plane) - **Status:** Proposed - **Date:** 2026-05-25 - **Supersedes:** none (extends ADR 002 single-container client; the Solara legacy half is deprecated by this overhaul) - **Superseded by:** none ## Context The alphaswarm_admin internal admin surface predates the overhaul: - Backend was already a stateless FastAPI BFF brokering audit-first to `alphaswarm_controller` and the AlphaSwarm monolith. - Frontend was a Vite + React Router SPA at `alphaswarm_admin/alphaswarm_admin_ui/`. - Six modules from the blueprint were missing: secrets-manager, lineage-explorer, model-registry, paper-trading-control, rbac-admin, account-mode-switcher. - Multi-account AWS topology was not provisioned. - CI used `KUBECONFIG_*` base64 secrets instead of GitHub Actions OIDC. - Only the bot fleet had ArgoCD; the main stack was kubectl-push. - No S3 WORM mirror for `security_audit_events`. ## Decision ### Frontend: migrate to Next.js 15 App Router. Even though `alphaswarm_client/` (the canonical Vite operator UI) and `alphaswarm_ui/` (the customer-facing PaaS) keep their existing frameworks, the admin surface migrates to Next.js because: - Server Components reduce the bundle on read-heavy admin pages. - Server Actions remove API-route boilerplate for mutations. - File-system routing maps cleanly onto the sidebar information architecture (one folder per module). - Middleware-based auth with one-shot RFC 9470 step-up retries composes better than the Vite + React Router pattern. The legacy `alphaswarm_admin_ui/` stays deployable behind a feature flag for a 30-day rollback window during the cutover. The new Next.js app lives at `alphaswarm_admin/frontend/`. ### Backend: extend, don't rewrite. The existing six routers are kept. Six new module routers are added under the established audit-first / M2M-broker / `require_admin_scope` pattern. Step-up MFA per AGENTS rule 52 is attached to every new mutating endpoint. ### RBAC: stay on the existing 4-role lattice. The blueprint suggested Casbin. We reject that — AlphaSwarm's canonical RBAC is the `alphaswarm_core.auth.rbac` 4-role lattice plus the existing `Membership` table. Adding Casbin would create a parallel policy source-of-truth that fragments rule 27. The new `/admin/rbac/*` router builds on `expand_role` and the existing `require_scope` / `require_membership` deps. ### Multi-account AWS: code now, apply later. A new top-level `infrastructure/` directory ships the full module library (landing-zone, account, vpc, eks-cluster, eks-node-groups, karpenter-bootstrap, ecr-repositories, rds-postgres, s3-data-lake, msk-kafka, airflow, eso-bootstrap, argocd-bootstrap, observability-stack, iam-irsa-roles, route53-zones, acm-certificates, acm-pca, github-oidc, codepipeline, codebuild, codeartifact) plus per-environment compositions. Every composition assumes-role into a workload account from `shared-services` with `external_id`. Cloud-side `terraform apply` is deferred to operator hands; the PR ships the code. ### CI/CD: GitHub OIDC + SLSA L3 + Cosign keyless. `.github/actions/{aws-oidc-assume,build-sign-push,slsa-provenance, kubectl-via-irsa}` composite actions; new workflows `pr-validate.yml`, `build-publish.yml`, `argocd-trigger.yml`, `terraform-pipeline.yml`, `ml-pipeline.yml`, `paper-config-validate.yml`, `alembic-immutability.yml`. Renovate is wired with auto-merge to `main` only on minor + patch updates. ### Observability + cost. Linkerd (chosen over Istio Ambient + App Mesh because of the ~6x lower proxy memory and ~10x lower p99 latency overhead) is the service mesh; Falco + Velero + Kubecost ship as Helm-chart wrappers. Karpenter v1 self-managed (NOT EKS Auto Mode) so the NodePool specs are recorded under `terraform_stack_spec_versions`. ### Audit WORM. `alphaswarm/tasks/audit_log_export_tasks.py::export_audit_log_window` exports `security_audit_events` + `audit_log` nightly to `s3://alphaswarm-audit-archive-${ACCOUNT_ID}/` with `ObjectLockMode=COMPLIANCE` + 7-year retention per FINRA Rule 4511 + SEC Rule 17a-4(f)(2)(i)(B). ### IdP support. Two new `IdentityProvider` subclasses ship under `alphaswarm/auth/providers/`: - `aws_iam_identity_center.py` - `aws_cognito.py` Both subclass `GenericOidcProvider` and auto-register through `IdentityProviderMeta`. IAM Identity Center is the recommended IdP for multi-account; Cognito is the documented fallback for the single-account path. ## Consequences - The 6 missing modules ship with full audit-first wiring + step-up MFA + WS multiplexing. - Frontend bundles get smaller; SSR'd admin pages enable better caching. - Multi-account topology is one `terraform apply` away. - CI gains SLSA L3 attestations + Cosign keyless verification. - Audit ledger is FINRA-compliant via WORM mirroring. - The `alphaswarm_admin_ui/` Vite tree adds maintenance debt for the duration of the rollback window. Cleanup PR scheduled after 30-day burn-in. - The legacy `alphaswarm/ui/` Solara dashboard remains in place; a separate `alphaswarm_admin-overhaul-cleanup` PR handles its removal + the FastAPI/Starlette unpin. # ADR 006 — QuantBot Operator Pattern (kopf + Pydantic mirrors) > The QuantBot Platform v0.2.0 adds a Kubernetes-native control plane on top of the existing `BotRuntime`/`bot_versions` infrastructure. Every running bot, every risk policy, every venue feed, every bac... # ADR 006 — QuantBot Operator Pattern (kopf + Pydantic mirrors) **Status:** Accepted (QuantBot Platform v0.2.0) **Date:** 2026-05-24 **Decision drivers:** AGENTS rules 14, 15, 28, 45; rpi-k8s-governance ## Context The QuantBot Platform v0.2.0 adds a Kubernetes-native control plane on top of the existing `BotRuntime`/`bot_versions` infrastructure. Every running bot, every risk policy, every venue feed, every backtest job, every kill switch is now a Kubernetes Custom Resource. That requires: 1. A controller that watches the CRs and reconciles desired state. 2. A schema source-of-truth for each CR. 3. Webhooks that reject malformed CRs before they reach the reconciler. ## Decision - **Controller framework:** kopf (`kopf>=1.37`). Python-native, integrates with our Pydantic spec layer, supports level-triggered reconciliation, finalizers, and admission webhooks. Up to ~1000 CRs/cluster is well within kopf's documented operating envelope. - **Schema source-of-truth:** each CR has both a Pydantic mirror class (under `alphaswarm_bots/operator/crds/*_cr.py`) AND a CRD YAML (`alphaswarm_bots/operator/crds/yaml/*_crd.yaml`). The Pydantic class is validated from the CR `.spec` field; the YAML is what gets applied to the cluster by the CRD-installer Job. The two are kept in sync by convention + the operator's startup self-test. - **Reconciliation:** level-triggered. Every handler compares desired (from spec) against actual (queried from the cluster) and drives the system back. Failures reflect onto `status.conditions`. - **Workload application:** routes through `alphaswarm_core.WorkloadRuntime` per AGENTS rule 45. The operator never calls `kubernetes.client.AppsV1Api()` directly when WorkloadRuntime is available; falls back to `kubernetes-asyncio` only for environments where WorkloadRuntime hasn't been deployed yet. ## Alternatives considered | Option | Why rejected | | --- | --- | | Go operator (controller-runtime / Kubebuilder) | Re-implements the spec validation already written in Pydantic; bigger team operational burden for a Python-first shop | | metacontroller + JSON Schema | No mature Python ecosystem for the testing + audit story we need; JSON Schema diverges from Pydantic validators | | Native Helm charts only (no controller) | Helm can't reconcile the operator-side bookkeeping (kill switch fan-out, drain finalizer, status condition rollup) | ## Consequences - **+** Single source of truth (Pydantic) drives both API validation and CR validation. - **+** Python-native test suite for the operator (kopf can be driven in-process from pytest). - **−** kopf scaling ceiling is ~1000 CRs per cluster; beyond that we need operator sharding (deferred per blueprint caveat #2). - **−** Pydantic mirror + YAML CRD requires manual sync. Mitigated by CI: a startup check compares the Pydantic JSON schema against the CRD's `openAPIV3Schema` and refuses to boot on drift. ## References - [alphaswarm_bots/operator/](../../../alphaswarm_bots/operator/) - [alphaswarm_platform/deployments/kubernetes/bots-operator/](../../../alphaswarm_platform/deployments/kubernetes/bots-operator/) # ADR 007 — QuantBot Latency Classes > Bots in the QuantBot Platform span a 6-order-of-magnitude latency range: sub-millisecond market makers next to once-a-day rebalancers next to event-driven MEV searchers. We need a taxonomy that: # ADR 007 — QuantBot Latency Classes **Status:** Accepted (QuantBot Platform v0.2.0) **Date:** 2026-05-24 ## Context Bots in the QuantBot Platform span a 6-order-of-magnitude latency range: sub-millisecond market makers next to once-a-day rebalancers next to event-driven MEV searchers. We need a taxonomy that: 1. Maps onto a concrete Kubernetes scheduling primitive (different primitives for different tiers). 2. Constrains where each bot can be scheduled (HFT bots only on dedicated NUMA-pinned nodes). 3. Tells the operator what hardware features to validate (HugePages, SR-IOV, PTP). 4. Drives the alert SLO thresholds (1 ms P99 vs 1 µs P99). ## Decision Five canonical latency classes (`Frequency` StrEnum): | Class | Latency target | K8s primitive | Special hardware | | --- | --- | --- | --- | | `hft` | < 1 ms tick-to-trade | DaemonSet on tainted nodes (1 bot / node) | NUMA pinning, HugePages, SR-IOV, PTP | | `mid` | 1 ms – 1 s | StatefulSet (stateful) | None | | `low` | 1 s – 1 min | Deployment (stateless) | None | | `eod` | batch / daily | CronJob | None | | `event` | event-driven | Deployment (long-running consumer) | None | The `Frequency.HFT` Pydantic validator enforces: - `needs_numa_pinning == True` - `expected_p99_tick_to_trade_us` is set These are required because operator scheduling decisions are made off the capability declaration; an HFT bot without NUMA pinning would silently land on a shared node and violate the RTS 25 1-microsecond timestamp granularity requirement. ## Python ceiling (caveat #1 from blueprint) Pure Python + Cython targets **100-500 µs** for non-kernel-bypass HFT. The Aeron / Google Cloud benchmark (weareadaptive.com, 2024) reports 57 µs default / 18 µs with kernel-bypass at 100k msg/s — and that's a Java baseline. Bots requiring sub-100 µs MUST use the Rust escape hatch in `alphaswarm_bots/hft/escape_hatch.py`. The architecture explicitly documents this so we don't over-promise. ## Consequences - **+** Operator scheduling is deterministic from the spec. - **+** Alert SLOs auto-derive from the latency class. - **−** Adding a tier in the future (e.g. `ultra_hft` for sub-100 µs) requires a new enum value and operator handler. ## References - [alphaswarm_bots/spec.py — Frequency enum](../../../alphaswarm_bots/spec.py) - [alphaswarm_bots/hft/](../../../alphaswarm_bots/hft/) - [Commission Delegated Regulation (EU) 2017/574 — RTS 25 clock sync] # ADR 008 — Bot Event Sourcing (PostgreSQL, monthly-partitioned) > Each running bot generates a stream of decision/order/fill/snapshot events. To support: # ADR 008 — Bot Event Sourcing (PostgreSQL, monthly-partitioned) **Status:** Accepted (QuantBot Platform v0.2.0) **Date:** 2026-05-24 **Decision drivers:** Blueprint §H; AGENTS rules 3, 6, 34. ## Context Each running bot generates a stream of decision/order/fill/snapshot events. To support: 1. Restart recovery without losing state. 2. Time-travel debugging ("show me the position at 14:32:11 yesterday"). 3. Regulatory audit (RTS 6 Article 17(3) real-time reconciliation). 4. Replay-based regression testing of strategy changes. ...we need an append-only, queryable event log per bot. ## Decision - **Backend:** PostgreSQL, the same database that already holds `bots` / `bot_versions` / `bot_deployments`. No new technology to operationalize. - **Partitioning:** `bot_events` is `PARTITION BY RANGE (recorded_at)` on PostgreSQL with one partition per UTC month. Partition pruning keeps queries fast even at billion-row scale (per documented PostgreSQL event-sourcing patterns). - **Sequence numbers:** monotonic per bot (`bot_id`, `seq_no`). The `EventStore` writer keeps the next-available `seq_no` in memory and increments on each append; on restart it reads `max(seq_no)` + 1. - **Snapshots:** periodic `bot_snapshots` rows act as replay anchors. On startup the kernel reads the latest snapshot and replays events with `seq_no > snapshot.seq_no` rather than from zero. - **Tenancy:** every row carries `owner_user_id` / `workspace_id` / `project_id` / `experiment_id` / `test_id` per AGENTS rule 34; the existing `LedgerWriter._stamp` populates them automatically from the active `RequestContext`. - **GIN index** on `bot_events.event_data` (JSONB) so ad-hoc queries like "all fills with `fee_currency=USDT`" stay fast. ## Alternatives considered | Option | Why rejected | | --- | --- | | Kafka log only | Not random-access; harder to query for time-travel; we already have Postgres | | TimescaleDB hypertable | Extra dependency to operate; partition pruning on plain Postgres is sufficient at our scale | | One table per bot | Operational nightmare at >100 bots; partition pruning gives us the same query performance with one schema | | Iceberg-only | Lakehouse latency too high for the kernel's restart path | ## Iceberg interplay (Rule 3) `bot_events` is **operational** state — kernel writes happen on the hot path. **Analytical** writes (trajectory exports, signal series, gold-tier aggregates) still go through `iceberg_catalog.append_arrow` per AGENTS rule 3; the operational + analytical paths are deliberately separate to keep the kernel's write latency predictable. ## Consequences - **+** Restart recovery is O(snapshot + events_since_snapshot) rather than O(all_events). - **+** Time-travel debugging is one Postgres query. - **+** No new infrastructure to operate. - **−** Monthly partitioning requires a Celery beat task to pre-create next month's partition (Phase 12 — Celery wiring deferred to follow-up). - **−** GIN index churns on high-volume JSONB inserts; mitigated by the `EventStore` batching writes every `flush_interval_s`. ## References - [alembic/versions/0058_bot_event_sourcing.py](../../../alembic/versions/0058_bot_event_sourcing.py) - [alphaswarm_bots/state/store.py](../../../alphaswarm_bots/state/store.py) - [alphaswarm_bots/state/replay.py](../../../alphaswarm_bots/state/replay.py) # ADR 009 — MiFID II RTS 6 + SEC 15c3-5 Conformance > Algorithmic trading in EU markets is governed by Commission Delegated Regulation (EU) 2017/589 — **MiFID II Regulatory Technical Standards on the organisational requirements of investment firms engage... # ADR 009 — MiFID II RTS 6 + SEC 15c3-5 Conformance **Status:** Accepted (QuantBot Platform v0.2.0) **Date:** 2026-05-24 **LEGAL REVIEW REQUIRED.** This ADR + the code it documents are an **engineering crosswalk**, NOT legal advice. Any production deployment trading European or US equities (or directly-affected derivatives) requires sign-off from the firm's compliance counsel and the CEO's annual certification. ## Context Algorithmic trading in EU markets is governed by Commission Delegated Regulation (EU) 2017/589 — **MiFID II Regulatory Technical Standards on the organisational requirements of investment firms engaged in algorithmic trading** ("RTS 6"). US market access is governed by SEC Rule 15c3-5 (17 CFR § 240.15c3-5). Both regimes require pre-trade risk controls, kill functionality, real- time reconciliation, conformance testing, stress testing, and annual validation. The QuantBot Platform must: 1. Enforce every named control before an order leaves the bot. 2. Generate the annual validation report mechanically. 3. Document the required attestations the firm's officers must sign. ## Decision - **Two-tier risk:** Layer-1 in-bot `PreTradeRiskEngine` for the latency-sensitive fast path; Layer-2 out-of-band FastAPI service for the broker-dealer-controlled aggregate-credit check (§ 240.15c3-5(d)). - **Hard vs soft block:** policy verdicts carry `severity = "block"` (hard — order rejected) or `severity = "warn"` (soft — informational only, may be overridden). Mirrors the ESMA Supervisory Briefing §72 (hard) vs §75/§76 (soft) distinction. - **Crosswalk:** every policy in `alphaswarm_bots/risk/policies.py` carries a `citation` string. The `alphaswarm_bots/risk/reg/rts6.py` and `rule_15c3_5.py` modules list the mapping by class name. - **Kill switch (RTS 6 Art. 12):** three-scope (`bot` / `fleet` / `platform`) implementation in `alphaswarm_bots/risk/kill_switch_v2.py`, backed by Redis + a `KillSwitch` CRD. Cancellation is immediate; affected bots transition to `Draining` and (optionally) flatten positions. - **Real-time reconciliation (Art. 17(3)):** `ExecutionAdapter.reconcile()` is called on every reconnect; drop-copy ingest is the canonical real-time path; mismatches elevate to `OrderStatus.DISPUTED` and quarantine the strategy from new entries. - **Real-time alerts (Art. 16(5)) — "within 5 seconds":** Prometheus Alertmanager rules with `interval: 15s` and `for: 0s` on critical signals (`prometheus-rules.yaml`). - **Conformance testing (Art. 6):** `alphaswarm_bots/risk/reg/conformance.py` ships a synthetic test harness; CLI `alphaswarm-bots conformance ` and REST `POST /bots/{ref}/conformance` run it on demand. - **Stress testing (Art. 10) — "twice the volume of the highest volume...during the previous six months":** `alphaswarm_bots/risk/reg/stress.py` reads the peak rate from `bot_events` and replays at 2x through the engine. CLI `alphaswarm-bots stress ` and REST `POST /bots/{ref}/stress`. - **Annual validation (Art. 9 + § 240.15c3-5(e)):** `alphaswarm_bots/risk/reg/validation_report.py` generates a YAML artifact with empty signature slots for risk management, internal audit, and the CEO. The generator runs daily as a Celery task; the artifact itself requires manual sign-off before submission. ## Attestation slots (left blank by the generator) The validation report has three signature slots: 1. **Risk management function (RTS 6 Art. 9(2)):** drafts the report. 2. **Internal audit (RTS 6 Art. 9(3)):** audits the report. 3. **CEO certification (SEC 15c3-5(e)):** annual certification that the firm's risk management controls comply with paragraphs (b) and (c) of the rule. The generator does NOT auto-sign these slots; that is operational, not mechanical. ## Consequences - **+** Every block has a regulatory citation. - **+** Conformance + stress + annual validation are reproducible CI artifacts. - **−** This is an engineering crosswalk; legal counsel must validate the mappings against the specific firm's regulatory perimeter. - **−** Cross-asset firms (equity + futures + crypto) may need additional policies beyond what we ship out of the box. ## References - Commission Delegated Regulation (EU) 2017/589 (MiFID II RTS 6) - 17 CFR § 240.15c3-5 (SEC Rule 15c3-5) - ESMA Supervisory Briefing on Algorithmic Trading (26 Feb 2026) - [alphaswarm_bots/risk/](../../../alphaswarm_bots/risk/) # ADR 010 — Canary Rollout PnL Gates > Strategy changes (new alpha model, new portfolio constructor, new execution algo) are the highest-leverage and highest-risk changes the platform makes. Rolling them across the entire fleet at once is ... # ADR 010 — Canary Rollout PnL Gates **Status:** Accepted (QuantBot Platform v0.2.0) **Date:** 2026-05-24 ## Context Strategy changes (new alpha model, new portfolio constructor, new execution algo) are the highest-leverage and highest-risk changes the platform makes. Rolling them across the entire fleet at once is unacceptable; bake time is mandatory. Argo Rollouts canary lets us shift weight gradually, but the canary needs **automated abort criteria** beyond the standard liveness/readiness probes — a bot can be `Ready=True` and still be hemorrhaging money. ## Decision Three AnalysisTemplates gate every canary promotion step: 1. **`bot-canary-pnl`** — realised PnL of the canary vs the stable variant. Default success condition: `canary_realized_pnl - stable_realized_pnl >= -50 USD` over 6 × 5-minute windows (30-minute total). 2. **`bot-reject-rate`** — fraction of orders that are rejected (by venue or by pre-trade risk). Default success condition: `<= 1%` over 30 × 1-minute windows. 3. **`bot-p99-latency`** — P99 tick-to-trade latency. Default success condition: `<= 1 ms` (HFT canaries override to `<= 100 µs`). The canary spec follows the standard Argo Rollouts pattern: ``` steps: - setWeight: 10 - pause: { duration: 30m } - analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] } - setWeight: 50 - pause: { duration: 1h } - analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] } - setWeight: 100 ``` Failure of any AnalysisTemplate **aborts the rollout** and reverts traffic to the stable version. The operator additionally watches the `BotPnLDrawdownCritical` PrometheusRule; if the canary bleeds more than `maxAbortRolloutPnlBleedUsd` (default $500) the alert auto-fires a `KillSwitch` CR which halts the canary instantly — this protects against the case where the rollout abort itself takes longer than the bleed. ## Default thresholds rationale The $50 PnL floor is intentionally generous for the initial canary window — it admits some short-term variance that is statistically normal between two variants of the same strategy. The harder $500 bleed threshold (drawdown alert) is what catches truly broken canaries within seconds. Per blueprint caveat (canary false-positive rate): if good canaries are routinely aborted on noisy metrics, tighten the metric query **first** (more samples, longer windows, robust quantiles) before relaxing the success condition. ## Consequences - **+** Strategy changes have an automated bake-time gate. - **+** The same canary pattern works for both stateless mid-frequency bots and HFT bots (only the latency threshold differs). - **−** AnalysisTemplate thresholds need per-strategy calibration — a market-making bot's "good" reject rate is higher than a stat-arb pair's "good" reject rate. - **−** A canary that's still warming up may not yet have produced enough orders for the metrics to be meaningful; we mitigate with the initial 30-minute pause before the first analysis check. ## References - [alphaswarm_platform/deployments/argocd/rollouts/](../../../alphaswarm_platform/deployments/argocd/rollouts/) - [alphaswarm_platform/deployments/kubernetes/bots-operator/prometheus-rules.yaml](../../../alphaswarm_platform/deployments/kubernetes/bots-operator/prometheus-rules.yaml) # ADR-010: alphaswarm_rl production-grade enhancement (Phases 1-12) > **Context**: The `alphaswarm_rl` subsystem shipped with the core `RLComponent` metaclass, `RLRuntime`, hash-locked `RLExperimentSpec`, and a small set of envs / agents / observations / rewards. The TradeMast... # ADR-010: alphaswarm_rl production-grade enhancement (Phases 1-12) **Status**: accepted (2026-05-24) **Context**: The `alphaswarm_rl` subsystem shipped with the core `RLComponent` metaclass, `RLRuntime`, hash-locked `RLExperimentSpec`, and a small set of envs / agents / observations / rewards. The TradeMaster 1.0.0 codebase contained a much larger, paper-grade library of: - Reward shapes (Differential Sharpe Ratio, D3R, Implementation Shortfall, Hindsight, DP-distillation, …). - Analytical baselines (Almgren-Chriss, Avellaneda-Stoikov). - Domain envs (PortfolioManagement, OrderExecution PD, AlgorithmicTrading, HFT, MultimodalTrading). - Paper-grade agents (EIIE, DeepTrader, ETEO, OPD, DeepScalper, HFT_DDQN, InvestorImitator). - Network backbones (EIIEConv, SAGCN, MarketScorer, HFTQNet, DualHead, PDDualRNN, SARL classifier). - Market Dynamics Modeling (slice-and-merge regime labeller). - CSDI diffusion imputation. - Validation diagnostics (CPCV, PBO, RAS, DSR, walk-forward, BH / Holm-Bonferroni). - PRUDEX-Compass evaluation suite. - Three new replay buffers (General / Prioritized / NStepInfo). Plus the FinAgent multimodal LLM-hybrid agent (Zhang AAAI 24). **Decision**: Land all of the above behind 12 phases, each adding new classes that auto-register through existing AlphaSwarm abstractions (`RLComponent`, `BaseDataset`, `register_analysis_flow`, `BaseExperiment`). NO migration of existing components, NO breaking changes. Every new component: 1. Subclasses an existing AlphaSwarm base (`RewardTerm`, `BaseRLAgent`, `BaseRLEnv`, `TimeSeriesEncoder`, `BaseObservationBuilder`, `BaseExperiment`, `BaseReplayBuffer`, `BaseDataset`). 2. Sets `rl_alias` so it auto-registers under the right `rl_kind`. 3. Ships unit + property tests under `alphaswarm_rl/tests//`. 4. Respects every hard rule in `alphaswarm_rl/AGENTS.md`. **Consequences**: - The `rl_alias` namespace grows by ~40 new aliases; the `RLComponent.list_components(kind)` registry expands accordingly. - Heavy dependencies (`scipy.signal`, `scikit-learn`) are mandatory for the analysis flow but already in `alphaswarm` core. No new third-party RL framework dependencies. - New top-level packages under `alphaswarm_rl/src/alphaswarm_rl/`: `analytical/`, `evaluation/`, `replay/`, `validation/`. - One new analysis flow in the monolith (`alphaswarm/analysis/flows/market_dynamics_modeling.py`) per hard rule 23. - One new dataset kind in the monolith (`alphaswarm/data/datasets/kinds/csdi_imputed.py`) per hard rule 29. - One new FinAgent toolset in the monolith (`alphaswarm/agents/tools/finagent/`). - Five new agent YAMLs under `configs/agents/finagent/`. - Documentation: three new `alphaswarm_docs/` pages (rl-market-dynamics, rl-prudex-evaluation, rl-finagent) plus this ADR. **Hard rule alignment**: | Rule | Compliance | | --- | --- | | 2 (LLM via `router_complete`) | FinAgent layered adapter + all 5 stage YAMLs | | 3 (Iceberg via `append_arrow`) | CSDI persistence; PRUDEX skips; MDM via gold-tier flow | | 12 (`AgentRuntime` for agents) | 5 FinAgent stages = 5 AgentSpec rows | | 16 (`RLRuntime` for RL lifecycle) | All new agents / experiments callable through it | | 18 (`IcebergTrajectoryStore`) | Untouched — existing path preserved | | 19 (`RLComponent` metaclass) | All ~40 new aliases auto-register | | 20 (`router_complete` from RL code) | LayeredReflectionAdapter only LLM caller | | 22 (No direct DB from agent body) | FinAgent tools route through registered DataMCP only | | 23-25 (Analysis flow → `AnalysisRuntime`) | MDM flow + `register_analysis_flow` | | 29 (`BaseDataset` for env data) | tradesim_* envs accept BaseDataset / DataFrame | | 36-38 (Advantage / backbone / weight-centric) | Backbones extend `TimeSeriesEncoder`; weights flow `WeightCentricPipeline` ⇒ `WeightToOrders` | **Trade-offs**: 1. **CSDI is ensemble-imputation, not real diffusion** — the full ~1500-LOC PyTorch CSDI model is out-of-scope; the ensemble imputer satisfies the acceptance gate (MAE < 0.05 on synthetic) and ships the same public contract (median + quantile bands) so a future drop-in replacement is straightforward. 2. **RAS is EXPERIMENTAL** — exposed under the same canonical surface as DSR / PBO but marked in the docstring; the Rademacher-complexity estimate is Monte-Carlo and depends on `n_draws`. 3. **Paper-grade agents lean on SB3** — most new agents are thin `SB3Adapter` subclasses with paper-grade hyperparameters. InvestorImitator (REINFORCE) and OPD (teacher-student dual PPO) are the two genuinely custom implementations. This matches pragmatic deployment patterns: SB3 has been more thoroughly battle-tested than re-implementing each paper from scratch. 4. **No live broker integration in the test suite** — `WeightToOrders` is tested against `_MockBrokerage`. The Alpaca / IBKR adapter lives in the monolith and is covered by integration tests there. # ADR 011 — CDN-fronted standalone container for the cloud-hosted alphaswarm_ui > The cloud-hosted Next.js 14 PaaS frontend (alphaswarm_ui) ships as a clean Next.js standalone container at app.alpha-swarm.ai. Static assets are CDN-fronted by Cloudflare. ADR 002''s multi-stage Solara/Vite/ASGI proxy pattern is scoped to the local alphaswarm_client only. # ADR 011 — CDN-fronted standalone container for the cloud-hosted alphaswarm_ui - **Status**: Accepted (2026-05-25) - **Authors**: Platform team - **Supersedes (scoped)**: [ADR 002 — Single multi-stage container for the AlphaSwarm client surface](002-single-container-client.md) for the cloud surface only; ADR 002 stays in force for the local `alphaswarm_client/` Vite operator UI. - **Related**: [ADR 001 — Vite static export](001-static-export-over-ssr.md), [ADR 002 — Single multi-stage container](002-single-container-client.md), [ADR 003 — Auth0 zero-trust](003-auth0-zero-trust.md), [ADR 005 — Separated control plane](005-separated-control-plane.md), [ADR 012 — Solara deprecation](012-solara-deprecation.md) ## Context When the original `alphaswarm_client/` packaging was designed (ADR 002), the platform had three coexisting presentation surfaces: a Vite operator UI, a legacy Next.js webui, and a Python Solara visualisation layer. Collapsing all three behind one FastAPI proxy was the right call for a single-tenant local-first deployment where operators bookmark one URL and the proxy hides the rest. The cloud-hosted, customer-facing PaaS at `alpha-swarm.ai` / `app.alpha-swarm.ai` (the new `alphaswarm_ui/` Next.js 14+ App Router app) has different constraints: 1. **Multi-tenant scale.** Hundreds-to-thousands of concurrent tenants. Static-asset throughput and SSR throughput scale at different ratios — co-located scaling triggers wasted CPU and unnecessary memory pressure on the SSR pods. 2. **CDN-friendly assets.** Next.js standalone emits hashed, immutable filenames under `/_next/static/*`. Serving them from the SSR pods is bandwidth waste; Cloudflare can cache them for a year with zero risk of staleness. 3. **No Python / no Solara.** `alphaswarm_ui/` is pure TypeScript + Next.js server. The Solara stage (ADR 002 Stage 2) doesn't apply and would only bloat the image (~300 MB heavier). 4. **Independent BFF lifecycle.** Every `alphaswarm_ui/api/*` route is a thin BFF handler that re-checks the session, forwards a tenancy header, and proxies upstream. Reverse-proxying through FastAPI adds an extra hop with no value (the BFF is already a proxy). 5. **Edge-rendered marketing.** The `(marketing)` route group is designed for SSR + ISR cache. Routing it through an internal FastAPI proxy defeats the whole point of edge-near rendering. ## Decision The cloud-hosted `alphaswarm_ui` ships as **one clean Next.js standalone container** built from [`alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile) (already two stages: `node:20-alpine` builder + `node:20-alpine` runtime running `node server.js`). It DOES NOT use the ADR 002 three-stage Python/ASGI pattern. **Edge caching layout:** | Path | Cache-Control | Notes | | --------------------- | -------------------------------------------------------- | ----- | | `/_next/static/*` | `public, max-age=31536000, immutable` | Hashed filenames; year-long TTL | | `/public/*` `/fonts/*` `/images/*` | `public, max-age=2592000` | 30-day TTL, hand-curated assets | | `/api/*` | `no-store` + `Pragma: no-cache` | BFF responses; user-scoped (rule 4 + management-engine.mdc) | | Everything else (SSR) | `public, max-age=3600, stale-while-revalidate=86400` | Per-tenant marketing + dashboard pages | The NGINX Ingress at [`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml) sets these via `nginx.ingress.kubernetes.io/configuration-snippet`. Cloudflare in front honours them aggressively for `/_next/static/*` and bypasses the cache for `/api/*`. **Post-deploy cache purge:** the GitHub Actions deploy job in [`.github/workflows/alphaswarm-ui.yml`](../../../.github/workflows/alphaswarm-ui.yml) calls the Cloudflare zone-purge API immediately after `kubectl rollout status` succeeds. The Cloudflare token is sourced from the existing `CredentialResolver` chain via the `ALPHASWARM_CLOUDFLARE_API_TOKEN` ExternalSecret (AGENTS rule 26). **HPA:** keep the existing [`hpa.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml) (CPU 70%, memory 80%, 3-20 replicas). Because static assets are CDN-offloaded, SSR pod CPU usage tracks real per-tenant rendering work — autoscaling becomes meaningful instead of a noisy mix of "serving a JS bundle" and "rendering a dashboard page". ## Consequences **Positive** - 80%+ static-asset bandwidth offloaded to Cloudflare's edge. - HPA triggers on real SSR work, not bandwidth. - Image is ~150 MB (Node Alpine) vs. ~450 MB (Python + Solara + Node) for ADR 002. Faster pod cold start, faster rolling deploys. - The BFF + SSR + edge layers have one ownership boundary each — Cloudflare for delivery, NGINX Ingress for cache hints, `node server.js` for SSR + BFF. No ASGI proxy hop in between. - `/api/*` is `no-store` end-to-end — no risk of a CDN edge node caching a tenant's response and serving it to a different tenant. **Negative** - Two presentation packaging stories now exist (ADR 002 for `alphaswarm_client`, ADR 011 for `alphaswarm_ui`). Mitigated by the per-surface scoping: each ADR is the source of truth for one tree only. - Cloudflare cache-purge is now part of the deploy critical path. A Cloudflare API outage during deploy means stale `/_next/static/*` for up to 1y per hashed filename — but the hashes change on every deploy, so the impact is bounded to assets whose names didn't change (rare for a real change). - Adds a `CLOUDFLARE_API_TOKEN` secret to the deploy environment. Stored in Vault + synced via ExternalSecret per AGENTS rule 26. ## Alternatives considered - **Stay on ADR 002 (single FastAPI proxy container)** — rejected. Bandwidth-CPU coupling, larger image, unnecessary Solara/Python weight, redundant proxy hop in front of the BFF. - **Vercel hosting** — rejected. ADR 003's zero-trust constraints + the on-cluster control plane integration argue for keeping the SSR layer inside our own K8s + CredentialResolver perimeter. - **CloudFront in front of a single SSR pod** — rejected. We already have Cloudflare as the edge for `alpha-swarm.ai`. Adding a second CDN would split the cache-purge story and add edge cost. ## Implementation references - Standalone Dockerfile: [`alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile) - Ingress + CDN headers: [`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml) - HPA: [`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml) - CI deploy + cache purge: [`.github/workflows/alphaswarm-ui.yml`](../../../.github/workflows/alphaswarm-ui.yml) - BFF + session: [`alphaswarm_ui/src/lib/auth/session.ts`](../../../alphaswarm_ui/src/lib/auth/session.ts), [`alphaswarm_ui/src/lib/api/client.ts`](../../../alphaswarm_ui/src/lib/api/client.ts) # ADR 012 — Solara deprecation in the cloud build > Solara is excluded from the cloud alphaswarm_ui Dockerfile and remains only in the local alphaswarm_client image for one-release-cycle rollback. The Solara stage will be removed entirely from alphaswarm_client after the rollback window closes. # ADR 012 — Solara deprecation in the cloud build - **Status**: Accepted (2026-05-25) - **Authors**: Platform team - **Related**: [ADR 002 — Single multi-stage container](002-single-container-client.md), [ADR 011 — CDN-fronted standalone for alphaswarm_ui](011-cdn-fronted-standalone-for-alphaswarm-ui.md) ## Context The legacy Solara UI (`legacy_ui.app` at [`alphaswarm/ui/`](../../../alphaswarm/ui/)) is a Python ASGI presentation layer that predates the Vite + React 19 + shadcn cutover documented in [`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md). It is already wrapped in the `legacy` profile and gated behind `ALPHASWARM_CLIENT_ENABLE_SOLARA` (ADR 002 Stage 2 + production runtime). The cloud `alphaswarm_ui/` Next.js application has no need for Solara — every chart that Solara renders is already covered by the `lightweight-charts` / `recharts` stack already in [`alphaswarm_client/package.json`](../../../alphaswarm_client/package.json) and inherited by `alphaswarm_ui/`. Continuing to bundle Solara into the cloud image is pure dead weight (~300 MB) AND it creates a second presentation-layer state machine the BFF would otherwise have to synchronise with the React component tree. ## Decision 1. **`alphaswarm_ui/` Dockerfile excludes Solara entirely** (already the case). No `solara-builder` stage; no `/legacy` mount. 2. **`alphaswarm_client/` retains the Solara stage for one release cycle beyond Phase 1 of the cloud-dash refactor.** This preserves the ADR 002 rollback contract. 3. **After one release cycle, the Solara stage is removed from `alphaswarm_client/`** (Phase 7 of the cloud-dash refactor plan): - Delete the `solara-builder` stage from [`alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile). - Drop the `/legacy` mount from the Stage-3 FastAPI proxy. - Remove `ALPHASWARM_CLIENT_ENABLE_SOLARA` from [`alphaswarm/config/settings.py`](../../../alphaswarm/config/settings.py). - `git mv alphaswarm/ui/ alphaswarm/legacy_solara_ui/` so the source code remains for archaeological reference but no longer ships. 4. **No new Solara work.** The `legacy` profile is in maintenance mode only. New visualisation lands in `alphaswarm_client/` (Vite + shadcn) or `alphaswarm_ui/` (Next.js + antd + recharts). ## Consequences **Positive** - Cloud image stays small (~150 MB) and Python-free; cold-start latency is dominated by Next.js startup, not Solara warmup. - One less presentation-layer state machine to keep in sync with the React component tree. - Bundle audits stop having to explain why a TypeScript-first PaaS ships a 300 MB Python interpreter. **Negative** - Operators who relied on Solara dashboards have to migrate before the Phase-7 removal. The migration is well-documented: every Solara surface has a Vite analog (see the cutover checklist in [`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md)). - Loss of Solara's Python-side reactive component model. This was an interesting prototype path but not a load-bearing operator workflow. ## Alternatives considered - **Keep Solara indefinitely as a "second UI"** — rejected. The cost of maintaining two parallel presentation stacks (React + Solara) outweighs the value of an alternate visualisation framework that no current workflow needs. - **Port Solara to JupyterLab embed** — rejected. JupyterLab is intended for notebook authoring (Lab Engine), not operator dashboards. Mixing the two surfaces would re-create the original framework-fragmentation problem ADR 002 set out to solve. ## Implementation references - ADR 002 Solara stage: [`alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile) (Stage 2 `solara-builder`) - Solara source: [`alphaswarm/ui/`](../../../alphaswarm/ui/) - Feature flag: `ALPHASWARM_CLIENT_ENABLE_SOLARA` in [`alphaswarm/config/settings.py`](../../../alphaswarm/config/settings.py) - Cutover history: [`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md) - Phase-7 removal step: [`.cursor/plans/alphaswarm_cloud-hosted_dash_refactor_*.plan.md`](../../../.cursor/plans/) # ADR-013: Entra ID as the AlphaSwarm staff first user pool # ADR-013: Entra ID as the AlphaSwarm staff first user pool - **Status**: Accepted - **Date**: 2026-05-27 - **Supersedes**: none - **Superseded by**: none - **Related rules**: AGENTS rule 27 (identity), 42 (TerraformRuntime), 44 (EntraTenantLink approval flow), 26 (CredentialResolver) - **Related ADRs**: [ADR-003 Auth0 zero-trust](003-auth0-zero-trust.md) remains valid for B2C / customer-tenant fallback; [ADR-005 separated control plane](005-separated-control-plane.md) is the host for the new `manage.alpha-swarm.ai` MSAL routes. ## Context The AlphaSwarm has historically authenticated AlphaSwarm staff through the same Auth0 tenant that serves customer logins. Auth0 has served well as a B2C identity surface, but for the **internal** staff pool we need: 1. **Centralised MFA + Conditional Access**. Staff already authenticate to Microsoft 365 daily through the corporate Entra tenant. CA policies (block risky sign-ins, named-location MFA, FIDO2 hardware key requirements for admins) are enforced by the IT / Security team in one place. Replicating those controls in Auth0 doubles the surface area. 2. **Audit centralisation**. The corporate SIEM already ingests Entra sign-in logs via the existing log stream. Auth0 audit data has to be exported separately and reconciled. 3. **Group-driven authorisation**. New hires onboard via a single HR-side group action; the Entra group's app-role mapping automatically grants the right AlphaSwarm scopes. The Auth0 path required manual role assignment in the Auth0 dashboard. 4. **No client secrets in CI**. GitHub Actions OIDC + federated credentials replace the old `AZURE_CLIENT_SECRET` repo secret. 5. **Customer separation**. The AlphaSwarm staff Entra tenant is independent of every customer Entra tenant. Customer tenants continue to flow through the existing `EntraTenantLink` B2B approval wizard (AGENTS rule 44) — they do NOT land in the staff tenant. The runtime support for Entra has been in place since the Phase 4 service-mesh rollout (`alphaswarm/auth/providers/msal_entra.py`); this ADR formalises the *first user pool* designation and brings the Entra configuration under Terraform control. ## Decision 1. **The AlphaSwarm staff Microsoft Entra ID tenant is the first user pool for `manage.alpha-swarm.ai`**. Tokens whose `iss` matches the AlphaSwarm staff tenant are routed through `MsalEntraIdentityProvider` before any other provider in the chain. 2. **Auth0 stays as the customer-facing B2C pool** and as the degraded-mode fallback for staff (e.g. if the Entra side has an incident). 3. **Every Entra resource is under Terraform control** through the new `alphaswarm_entra_directory` module: 3 app registrations + 7 directory groups + 7 app roles + group → role assignments + named locations + GitHub Actions OIDC federated credentials. 4. **Conditional Access policies remain manually authored** because P2 licensing requires Security-team review on every policy change. The Terraform module records policy display names as documentation; a smoke-test helper queries Microsoft Graph at apply time to confirm each named policy exists. 5. **Apply path is `TerraformRuntime` only** (rule 42). Plan-only on PR; apply on push to `main` through `alphaswarm deploy`. 6. **Federated credentials replace static client secrets in CI**. The `alphaswarm-ci-github` app's federated credentials are per-environment + per-branch — wildcards are rejected at plan time. ## Alternatives considered ### A. Keep Auth0 as the staff pool Pros: - Zero migration effort. - Single identity surface for both staff + customers. Cons: - Doubles the MFA + CA enforcement surface. - Splits audit logs across two systems (Auth0 + corporate SIEM). - Manual role assignment in the Auth0 dashboard for every new hire. - Static client secrets in CI. **Rejected**: the operational + audit overhead outweighs the zero-migration win. ### B. Migrate ALL identity (staff + customers) to Entra Pros: - Single user pool overall. - Strongest audit centralisation. Cons: - Customer tenants are operated by their own admins; we can't unify them into a single tenant we control. - B2C scenarios (e.g. self-service trial signups) are awkward in enterprise Entra; Auth0 + the existing B2C surface is the right tool. - Migration cost: every existing Auth0 customer connection would need to be re-issued in Entra B2C (different protocol, different SDK). **Rejected**: the staff vs customer split is the right granularity. ### C. Keep Entra resources in the Azure Portal (no Terraform) Pros: - Lower friction for ad-hoc adjustments. - No Terraform learning curve for IT staff. Cons: - No reviewable diff for changes. - No `terraform_runs` audit row for compliance. - No federated-credential automation; CI keeps a static secret. - Drift between dev / prod tenants becomes hard to detect. **Rejected**: clickops on identity infrastructure violates AGENTS rule 42 in spirit (audit-first, every change reviewed). ## Consequences ### Positive - One MFA + CA enforcement surface for staff, owned by Security. - Audit logs land in the corporate SIEM via the existing log stream. - New-hire onboarding is one HR-side group add. - CI authenticates to Azure with no stored secrets. - Customer Entra tenants flow through the existing B2B path; the internal pool doesn't bleed into customer scope. - The `alphaswarm_entra_directory` module is reviewable; every change gets a PR + plan diff. ### Negative - Two identity providers to keep healthy (Entra + Auth0). Mitigated by `select_provider_for_token` routing — the right provider is picked per-request rather than per-deploy. - Bootstrap window has a temporary client secret (Phase 0/1 of the rollout plan); retired in Phase 5. - CA policies remain a manual surface — the Terraform module can reference them but cannot create them. Mitigated by the smoke-test helper that confirms required policies exist before exposure. - Group membership is intentionally outside Terraform. HR + Security own membership; the audit path comes from the Entra audit log rather than `terraform_runs`. ### Risks + mitigations | Risk | Mitigation | | --- | --- | | Tenant lockout (every admin's account gets MFA-locked) | TWO break-glass accounts excluded from CA policies (rollout plan §4 + `entra-rotate-secrets` runbook). | | Group → role mapping mistake grants over-privileged access | `terraform_runs` audit + the daily Prometheus alert on `entra_role_assignment_changes_total`. | | Federated-credential subject too broad | Per-environment / per-branch subjects; module rejects wildcards at plan time. | | Customer-tenant tokens routed to MSAL-internal | `select_provider_for_token` checks the `iss` claim against `auth_msal_internal_tenant_id`; mismatch falls through to Auth0. Unit tested. | ## Implementation pointers - Long-form rollout plan: [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md) - Concepts: [`concepts/identity/entra-internal-tenant`](../../concepts/identity/entra-internal-tenant.md) - Bootstrap runbook: [`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md) - Onboarding runbook: [`how-to/entra-onboard-new-staff`](../../how-to/entra-onboard-new-staff.md) - Secret-rotation runbook: [`how-to/entra-rotate-secrets`](../../how-to/entra-rotate-secrets.md) - Module: [`alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/`](pathname:///alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/README.md) - Provider runtime: `alphaswarm/auth/providers/msal_entra.py` - Provider chain selector: `alphaswarm/auth/providers/__init__.py::select_provider_for_token` # ADR-014: Knowledge-Base Boundary (`alphaswarm_kb` + `alphaswarm_kb_federation`) > Extract the AlphaSwarm RAG + agent-memory stack into a Clean-Architecture knowledge-base boundary with pluggable Cognee / Graphiti / Mem0 / Letta / LlamaIndex adapters, bi-temporal PermissionedDataPoint, four-scope KBLayerComposer, hybrid OpenFGA + OPA + Cedar policy stack, and Terragrunt silo-per-tenant IaC. # ADR-014: Knowledge-Base Boundary **Status**: accepted (2026-05-28) **Context**: The AlphaSwarm knowledge stack started as `alphaswarm/rag/` (a four-level hierarchical RAG on Redis + pgvector) plus `alphaswarm/llm/memory.py` (RedisHybridMemory) wired directly into `AgentRuntime`. As the platform grew, three tensions accumulated: 1. **Vendor coupling.** `HierarchicalRAG` is fast and AlphaSwarm-native, but the field has matured rapidly. Cognee (tri-store memory engine), Graphiti (bi-temporal Neo4j edges with sub-300ms p95 recall), Mem0 (user-centric personalisation), Letta (full agent runtime), and LlamaIndex (general-purpose vector backbone) all solve adjacent problems and tenants are starting to ask for each by name. 2. **Multi-tenancy on cognitive memory.** The existing RAG row-filter stamps `workspace_id`/`lab_id` on rows but provides no node/edge ACL, no bi-temporal invalidation, no cross-tenant marketplace, and no physical per-tenant isolation. Regulated tenants (financial advisors on HIPAA/SOX) need an explicit silo path; B2C tenants need cheap shared-schema RLS; both want a marketplace where they can subscribe to curated external corpora without giving up isolation. 3. **Cross-boundary contamination.** RAG knowledge lived inside the monolith with no Clean-Architecture port surface. Bot specs, RL specs, agent specs, and analysis specs all reached into `HierarchicalRAG.query` directly, making the surface impossible to swap. The blueprint reviewed in [`.cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md`](../../../.cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md) + the parallel architecture report propose a Clean-Architecture knowledge-base boundary modelled on the established `alphaswarm_rl` / `alphaswarm_models` extraction pattern. **Decision**: Stand up two new repositories: - [`alphaswarm_kb/`](../../../alphaswarm_kb/) — the boundary package with a pure `domain/` core (ports + bi-temporal `PermissionedDataPoint` + DTOs), an `application/` layer (use cases + `KBRuntime` services), a fully-pluggable `infrastructure/` adapter trinity, and an extracted `rag/` + `memory/` slice that re-emits the legacy `alphaswarm.rag.*` + `alphaswarm.llm.memory` surface through `DeprecationWarning` shims. - [`alphaswarm_kb_federation/`](../../../alphaswarm_kb_federation/) — a standalone cross-silo marketplace federation reverse-proxy that brokers authorised recall via OpenFGA `check` + signed per-subscription share tokens + bi-temporal merge. The package introduces: 1. **Hash-locked `KBCorpusSpec` + `KBRuntime`** (rules 56-57) mirroring the existing `RLExperimentSpec` / `BotSpec` / `AnalysisSpec` pattern. Every `remember` / `recall` / `improve` / `forget` lands a `kb_runs` row + snapshots the spec via `persist_spec`. Alembic migration `0088_alphaswarm_kb_specs.py` creates the nine backing tables. 2. **`KBAdapterMeta` metaclass** (rule 58) for every concrete `IMemoryEngine`, `BaseVectorStore`, `BaseGraphStore`, `BaseRelationalStore`, `IACLEvaluator`, `IPolicyEngine`, and `IIdentityProvider`. Each subclass sets `kb_kind` + `kb_alias` and is auto-registered. 3. **Bi-temporal `PermissionedDataPoint`** combining Graphiti's four-timestamp model (`valid_from`/`valid_to`/`created_at`/`expired_at`) with Cognee's provenance envelope (`Provenance.dataset_id` + `Provenance.data_id` + `Provenance.extractor_chain`). 4. **Four-scope `KBLayerComposer`** (private > hierarchical > marketplace > global) with precedence-aware bi-temporal merge. 5. **Hybrid OpenFGA + OPA + Cedar policy stack** per the blueprint Section D. `DefaultPermissionResolver` fuses `IACLEvaluator.list_objects` (visible IDs) with `IPolicyEngine.partial_evaluate` (residual Cypher/SQL fragment) into a per-request `AccessBitmap` cached by `(tenant, principal, action, anchor_hash)` for 60s. 6. **`KBSiloTenancyStrategy`** (5th strategy alongside RLS / schema-per-tenant / db-per-enterprise / hybrid). Routes KB tables to a per-tenant Postgres + Qdrant + Neo4j stack provisioned via Terragrunt units under [`alphaswarm_platform/terragrunt/tenants/`](../../../alphaswarm_platform/terragrunt/tenants/). 7. **Agent-facing surface** through `data.kb.*` DataMCP tools (rule 59 extends rule 22) and `data.kb.compose_recall` for the layered surface. Cross-silo recall goes through `alphaswarm_kb_federation` only (rule 60). 8. **Controller integration**: `KBSiloService` + `/manage/kb/silos/*` routes on `alphaswarm_controller` (Phase M). Lifecycle actions land as `WorkloadRun` rows with `WorkloadAction.KB_SILO_{PROVISION,DESTROY,HALT,SCALE}`. **Consequences**: - The legacy `alphaswarm.rag.*` + `alphaswarm.llm.memory` import paths keep working through `DeprecationWarning` shims for one release cycle. New code imports from `alphaswarm_kb.rag.*` + `alphaswarm_kb.memory.*` directly. - Cognee / Graphiti / Mem0 / Letta / LlamaIndex live behind `pyproject.toml` extras; the base install stays light. A tenant who wants Cognee installs `pip install alphaswarm-kb[cognee]` and sets `KBCorpusSpec.memory_engine.kb_alias = "cognee"`. - The federation gateway is the only cross-silo write/read path outside the monolith. New tenant marketplaces, parent-org sharing, and global-corpus replication all funnel through it. - Terragrunt units replace the legacy Terraform workspaces pattern — each tenant has its own state file under `tenants//prod/terragrunt.hcl`. The `tenant_kb_silo` wrapper dispatches to one of three cloud-parallel siblings (`tenant_kb_silo_aws/azure/gcp`) which all expose identical outputs so Python adapters never branch on cloud. - Bi-temporal data is now first-class. Contradicted edges close `valid_to` instead of being deleted; `as_of` queries reconstruct historical state. - Step-up MFA gates the destructive operations (`/kb/forget`, `/kb/halt`, `/manage/kb/silos/*` mutations, subscription create/revoke) per rule 52. **Hard rule alignment**: | Rule | Compliance | | --- | --- | | 2 (router_complete) | Every adapter that does LLM extraction (Graduated pipeline tier 3, Cognee, Mem0) routes through `router_complete`. | | 3 (iceberg_catalog.append_arrow) | Gold-tier KB writes (`alphaswarm_gold_kb_*` namespaces) go through the canonical helper; `KBRuntime` never touches PyIceberg. | | 4 (_progress.emit) | All `kb_tasks.py` wrappers use `emit` / `emit_done` / `emit_error`. WebSocket `/kb/.../recall/stream` preserves `{task_id, stage, message, timestamp, **extras}`. | | 6 (immutable migrations) | `0088_alphaswarm_kb_specs.py` is immutable post-merge. | | 22 (DataMCP boundary) | Agents read KB only through `data.kb.*` tools (extended by rule 59). | | 26 (CredentialResolver) | OpenFGA token, NATS DSN, Postgres DSN, federation share-token signing key all resolve through `CredentialResolver`. | | 27 (IdentityProvider) | `IIdentityProvider` is a thin bridge to `alphaswarm_core.auth.providers`. | | 34 (experiment_id/test_id) | `kb_runs` carries both FKs; `KBRunRequest` propagates them via `RequestContext`. | | 42 (TerraformRuntime) | `KBSiloService` invokes `TerraformRuntime`; the controller never shells out to `terraform`. | | 45 (WorkloadRuntime) | New `WorkloadAction` enum members `KB_SILO_{PROVISION,DESTROY,HALT,SCALE}`. | | 51 (TenancyStrategy) | `KBSiloTenancyStrategy` registers via `TenancyStrategyMeta`. | | 52 (step-up MFA) | All destructive `/kb/*` + `/manage/kb/*` routes gate with `require_step_up()`. | | 56-60 | New hard rules added in the same PR; described in the AGENTS.md. | **Trade-offs**: 1. **Two new repositories** to maintain. Mitigated by mirroring the established `alphaswarm_rl` / `alphaswarm_models` boundary pattern and shipping CI guards that prevent cross-boundary imports. 2. **OpenFGA + OPA + NATS** introduce three new infrastructure dependencies. Mitigated by shipping both Docker Compose (local) and Kubernetes (prod) manifests; each is a single Helm release with ExternalSecrets wiring. 3. **Bi-temporal data complicates schema migrations**. Mitigated by making `valid_to`/`expired_at` optional (None = "still valid") so existing rows migrate without a backfill. 4. **Terragrunt unit-per-tenant** scales linearly in state-file count. Mitigated by bounded-parallelism `run-all` automation under [`alphaswarm_platform/terragrunt/`](../../../alphaswarm_platform/terragrunt/) plus per-tenant cloud-account isolation for regulated tenants. 5. **Multiple memory engines coexisting** complicates the operator's mental model. Mitigated by `data.kb.health` exposing per-corpus engine info + the Vite `/knowledge-base/silos` route surfacing topology + spec hash per corpus. **Out of scope (Phase 6+)**: - Cedar formal-verification harness (`cedar-analysis`). - SpiceDB / Permify adapter implementations beyond stubs. - Multi-region active-active federation (vs the AWS-first → Azure → GCP staged rollout). - Tenant-configurable bi-temporal merge strategies (default: last-writer-wins per validity window + precedence tiebreaker). - Per-tenant bridge tier (shared compute / siloed databases) for SMB pricing. - Cognee `improve` / `forget` scheduling automation (manual triggers only in v1). # ADR 015 — Runtime decomposition: cell-based modular monolith over domain microservices > Evaluates breaking the alphaswarm monolithic runtime into distributed domain microservices and rejects that shape in favor of the platform''s existing trajectory: a cell-based modular monolith with selective service extraction along the hash-locked runtime seams (control plane, worker/executor queues, MCP servers, KB boundary, bots operator), plus per-cell data planes for tenant isolation. # ADR 015 — Runtime decomposition: cell-based modular monolith over domain microservices - **Status**: Accepted (2026-06-10) — operator approval recorded during the KG-platform execution run (P0 gate review, T01-T03 complete; Track C repositories now exist and are scaffolded) - **Authors**: Platform team - **Related**: [ADR 004](004-provider-abstraction.md), [ADR 005](005-separated-control-plane.md), [ADR 006](006-quantbot-operator-pattern.md), [ADR 011](011-cdn-fronted-standalone-for-aqp-ui.md), [ADR 014](014-knowledge-base-boundary.md), [RESTRUCTURING_PLAN.md](https://github.com/Alpha-Swarm-ai/alphaswarm/blob/main/RESTRUCTURING_PLAN.md), [repository-split.md](../../concepts/platform/repository-split.md) ## Context The `alphaswarm` monolith is the platform's largest deployment unit: one FastAPI process (`alphaswarm-core`) carrying ~112 route modules, the in-process MCP routers (`/mcp/data`, `/mcp/codebase`, `/mcp/ml`), and a Celery task surface of ~57 task modules, backed by 56 ORM model files and 89 Alembic migrations over a single Postgres, plus Redis in at least six distinct roles (broker, metadata cache, progress pub/sub, RAG vectors, ownership event stream, sandbox namespaces). The `Settings` singleton exposes ~637 knobs. The question on the table: should the hosted platform break this runtime into a distributed **domain-microservices** architecture (agents-service, backtest-service, ml-service, data-service, each with its own datastore), or adopt a different decomposition? ### What is already decomposed The platform is not a greenfield monolith. Substantial decomposition has shipped or is accepted: | Seam | State | Where | | --- | --- | --- | | Control plane (`alphaswarm-cp`) | Shipped — standalone repo, image, `/manage/*` API; never imports `alphaswarm.*` | [ADR 005](005-separated-control-plane.md), `alphaswarm_controller/` | | Shared kernel | Shipped — wire types, provider ABCs, `WorkloadRuntime` | `alphaswarm_core/` | | Compute split | Shipped — `alphaswarm-worker` (queues `default,paper,terraform,ingestion,workflows`) vs `alphaswarm-executor` (`backtest,training,ml,agents,factors,rag`), independent HPAs on `alphaswarm_celery_queue_depth` | [worker-executor-images.md](../../concepts/infrastructure/worker-executor-images.md), `alphaswarm_platform/deployments/kubernetes/base/` | | Frontends | Shipped — `alphaswarm_client`, `alphaswarm_ui`, `alphaswarm_admin`, `alphaswarm_ide`, all HTTP-only | [ADR 011](011-cdn-fronted-standalone-for-aqp-ui.md) | | RL / ML boundary packages | Shipped — `alphaswarm_rl`, `alphaswarm_models` with deprecation shims; routers/tasks mounted from the external packages | [repository-split.md](../../concepts/platform/repository-split.md) | | Bots | Boundary package + kopf operator + per-bot pods | [ADR 006](006-quantbot-operator-pattern.md), `alphaswarm_bots/` | | Edge / cells | Envoy `alphaswarm-edge` + `alphaswarm-tenant-router` (ext_authz, rendezvous cell routing), cell registry in `topology.yaml`, per-cell overlays + ArgoCD ApplicationSet, `alphaswarm-cell-data-plane` Helm chart | [cell-router-cutover.md](../../how-to/cell-router-cutover.md), `alphaswarm_platform/tenant_router/` | | KB boundary | Accepted (ADR-014) — `alphaswarm_kb` + `alphaswarm_kb_federation` designed; **repositories not yet created** | [ADR 014](014-knowledge-base-boundary.md) | ### Invariants any split must respect Six hard-rule families make naive per-domain services with per-service databases actively harmful here: 1. **Single Postgres ledger.** Every runtime writes immutable `*_spec_versions` snapshots and `*_runs` ledger rows through `LedgerWriter`, which stamps `experiment_id`/`test_id` from `RequestContext` (rules 13, 15, 17, 24, 34, 41, 43, 57). Splitting the ledger per service destroys the cross-domain experiment umbrella and the audit/replay story. 2. **Single LLM gateway.** All LLM calls go through `router_complete` (rule 2); telemetry, cost caps, and the semantic cache depend on it. 3. **Single lakehouse write path.** All Iceberg writes go through `iceberg_catalog.append_arrow` with medallion validation (rules 3, 21, 46). 4. **DataMCP boundary.** Agents never read Postgres/Iceberg directly (rule 22) — the agent↔data seam is already a service-shaped API. 5. **Kill-switch fan-out.** The topbar kill switch fans out to 12+ halt endpoints with a p99 propagation SLO; every new long-running runtime must join the fan-out, and every process fragment multiplies the propagation surface (rules 40, 45, 52). 6. **Idempotent cross-task state in Postgres only** (rule 5) — Celery workers are already stateless and horizontally scalable; the "scaling" benefit of microservices largely exists today via queues. ## Options considered ### Option 1 — Classic domain microservices Carve `alphaswarm-core` into independently deployed services (agents-svc, backtest-svc, analysis-svc, data-svc, trading-svc, …), each owning its own database and API, communicating via REST/gRPC and an event bus. - Violates invariant 1 (ledger) and 6 unless every service still writes to the shared Postgres — at which point they are not microservices, just N processes sharing one schema and one Alembic chain (a distributed monolith). - The hot coupling points (`LedgerWriter`, `router_complete`, `append_arrow`, metadata cache, progress bus) would become N× network hops with retry/outbox machinery the platform doesn't need. - Kill-switch propagation and hash-locked replay would have to be re-engineered across service boundaries. - The throughput-bound work (backtests, training, agent runs) is **already** isolated in the executor fleet with queue-depth autoscaling; a backtest-service would duplicate that with more moving parts. ### Option 2 — Cell-based modular monolith with selective service extraction (recommended) Keep one logical application (`alphaswarm` runtime) but: 1. **Scale out by cell, not by domain.** A cell = one namespace running the core/worker/executor/beat quartet against a per-cell data plane (CNPG Postgres, Redis, MinIO, MLflow, Iceberg REST), routed by the Envoy edge + tenant router (rendezvous hashing on `tenant_id → cell_id`). Tiers map onto the existing `TenancyStrategy` lattice (`shared-std` → RLS, `shared-prem` → schema-per-tenant, `silo-reg` → database-per-enterprise). This is RESTRUCTURING_PLAN Phases 3 + 6, already partially provisioned (`cells:` registry in `topology.yaml`, cell overlays, ApplicationSet, `alphaswarm-cell-data-plane` chart). 2. **Extract services only along the seams that already have service-shaped contracts** — the hash-locked spec runtimes, the MCP HTTP surfaces, the control plane, and the operator pattern — and only when an extraction passes the Future Repo Split Gate in [repository-split.md](../../concepts/platform/repository-split.md). ### Option 3 — Status quo Keep the single `alphaswarm-core` Deployment and scale vertically. Rejected: noisy-neighbor risk across tenants, blast radius of one bad deploy is the whole fleet, and `silo-reg` compliance tenants cannot be served. ## Decision Adopt **Option 2**. Decomposition proceeds in three tracks, ordered by risk and by whether new repositories are required. ### Track A — Process/deployment splits of the existing images (no new repos) These change `alphaswarm_platform/` manifests and entrypoints only; the code already supports them: | # | Cut | Detail | | --- | --- | --- | | A1 | `alphaswarm-beat` as a first-class Deployment | Declared in `topology.yaml` and Terraform but missing from `deployments/kubernetes/base/`; promote it (replicas: 1, no HPA). | | A2 | Standalone MCP server Deployments | Serve `/mcp/data`, `/mcp/codebase`, `/mcp/ml` from dedicated pods using the existing image with a scoped ASGI entrypoint. Topology already declares `alphaswarm-ml-mcp` as a separate service; RFC 9728/8707 audience binding (rule 49) already gives each MCP its own `aud`. Per-tenant MCP isolation then reuses the `alphaswarm-mcp-tenant` Helm chart (Phase 5). | | A3 | Per-queue executor fleets | Split the executor Deployment into per-queue ScaledObjects (KEDA) for `backtest`, `training`/`ml`, `agents`/`rag` so GPU-class and CPU-class work scale independently. No code change — queue routing exists in `celery_app.py`. | | A4 | `paper-trader` and `ingester-*` as first-class K8s units | They exist as compose targets (`paper`, `ingester` image stages); give them base manifests + HPAs like worker/executor. | | A5 | Cell rollout | Execute Phase 3 (cell registry + router live, `RequestContext.cell_id` propagating) then Phase 6 (per-cell Postgres/MinIO/Redis/MLflow via dual-write migration, `ALPHASWARM_CELL_DUAL_WRITE`). | ### Track B — Deepen existing extractions (existing repos, invasive code changes) | # | Cut | Detail | | --- | --- | --- | | B1 | Sidecar control plane as hosted default | Flip hosted deployments from `ALPHASWARM_MANAGEMENT_MODE=embedded` to `sidecar`; `alphaswarm-cp` already ships standalone. | | B2 | Ledger/telemetry broker for `alphaswarm_rl` + `alphaswarm_models` | Today the extracted packages still import the monolith for `LedgerWriter`, `iceberg_catalog`, `_progress.emit`, ORM. Introduce a narrow HTTP/MCP ledger-write surface (mirroring the controller's `HttpAuditSink` → `/_internal/audit/terraform-runs` pattern) so RL/ML workers can run from their own images without importing monolith ORM. This is the gating work for ever running them as separate services. | | B3 | Bots operator fleet | Continue the ADR 006 path: per-bot pods via `quantbot-bot` chart, latency-class scheduling (ADR 007), canary PnL gates (ADR 010). Bots are the one domain where per-workload processes are genuinely required (HFT node tiers). | | B4 | CI boundary gates | Extend the `rg`-based forbidden-import gates from 2 to all 14+ subprojects (RESTRUCTURING_PLAN §2.1, §4.2) so extracted boundaries cannot silently re-couple. Prerequisite for everything above. | ### Track C — Extractions that require **new repositories** (permission gate) Per the workspace's repo-per-boundary convention, these need new git repositories and therefore explicit approval before any work begins: | # | Candidate repo | Justification | Status | | --- | --- | --- | --- | | C1 | `alphaswarm_kb` | ADR-014 (accepted) defines the KB boundary package — `KBRuntime`, hash-locked `KBCorpusSpec`, adapter trinity. Monolith already mounts its router conditionally and migration `0088_alphaswarm_kb_specs` shipped, but the repo does not exist in the multi-repo workspace. | Blocked on repo creation | | C2 | `alphaswarm_kb_federation` | ADR-014's cross-silo federation gateway — standalone FastAPI, never imports `alphaswarm.*`; deployable today via `compose/docker-compose.kb.yml` patterns + Terragrunt silo modules. | Blocked on repo creation | | C3 | (Optional, deferred) `alphaswarm_data` | A future data-plane service (ingestion, discovery, catalog) is the largest-blast-radius extraction; the RESTRUCTURING_PLAN sequences it last, after cells and per-tenant object storage. Not recommended now — listed to make the deferral explicit. | Deferred | Existing placeholder repos `alphaswarm_research` ("Services for Research Plane") and `alphaswarm_learning` are available landing zones should the research/learning planes later split; no work is proposed for them in this ADR. ### Target topology ```mermaid flowchart TB subgraph EDGE ["Edge"] CF[cloudflared] --> ENVOY[alphaswarm-edge Envoy] ENVOY -->|ext_authz| TR[alphaswarm-tenant-router] end ENVOY --> CP[alphaswarm-cp control plane] ENVOY --> CELL1 ENVOY --> CELLN subgraph CELL1 ["cell-shared-std-N"] API1[alphaswarm-core] --> W1[worker] & X1[executor fleets per queue] & B1[beat] MCP1[mcp-data / mcp-codebase / mcp-ml] DP1[(per-cell Postgres / Redis / MinIO / MLflow / Iceberg REST)] API1 --> DP1 W1 --> DP1 X1 --> DP1 MCP1 --> DP1 end subgraph CELLN ["cell-silo-reg-N"] APIN[alphaswarm-core] --> DPN[(dedicated data plane)] end CP -->|WorkloadRuntime| CELL1 CP -->|WorkloadRuntime| CELLN OP[bots-operator] --> BOTS[per-bot pods incl. HFT tier] KBF[alphaswarm-kb-federation] -.read-only cross-silo.-> CELLN ``` ## Consequences **Positive** - Tenant isolation, blast-radius reduction, and independent scaling are achieved by cells + queues — the actual goals usually cited for microservices — without breaking the ledger, replay, kill-switch, or hash-lock invariants. - Every extraction reuses a contract that already exists (spec runtimes, MCP audiences, `/manage/*`, operator CRDs), so no new RPC framework or saga/outbox machinery is invented. Linkerd arrives in Phase 4 for cell mTLS, not for inter-domain RPC. - Track A is pure deployment work and reversible per unit. **Negative / risks** - Per-cell data planes multiply infra cost (mitigated by tiering: shared backplane for `shared-std`/`shared-prem`). - Track B2's ledger broker adds an HTTP hop to RL/ML run bookkeeping; it must remain async/buffered to keep training loops unaffected. - The dual-write migration window (Phase 6) is the single riskiest operation; the rollback path is the `ALPHASWARM_CELL_DUAL_WRITE` flag. - Until Track C repos exist, KB code paths remain conditional inside the monolith and ADR-014 stays partially unrealized. **Explicitly rejected** - Per-domain services with per-service databases (Option 1). - Extracting `router_complete` into an LLM-gateway service. - Splitting the Postgres ledger or the Alembic chain per service. - Moving Celery beat scheduling out of the single scheduler. ## Rollout order 1. Track B4 (CI boundary gates) and Track A1–A2 (beat + MCP pods). 2. Track A3–A4 (queue fleets, paper/ingester units) and B1 (sidecar CP). 3. Track A5 cells: Phase 3 (router + registry), then Phase 6 (data plane), per the stop conditions in RESTRUCTURING_PLAN §18.2. 4. Track B2 (ledger broker) — gate for any future out-of-monolith RL/ML workers. 5. Track C — only after repository approval. # PreprocessingSpec > Qlibs `DataHandlerLP` applies an ordered chain of `Processor` steps (rank-norm, z-score, min-max, outlier clipping, etc.) during training. At inference time the *same* chain must be re-applied — othe... # PreprocessingSpec A `PreprocessingSpec` is a tiny dataclass that travels with every trained model artifact. It remembers which processors were fit and in what order so inference code can replay the exact preprocessing chain on new data without ever reaching back into the training-time configuration. ## Why Qlib's `DataHandlerLP` applies an ordered chain of `Processor` steps (rank-norm, z-score, min-max, outlier clipping, etc.) during training. At inference time the *same* chain must be re-applied — otherwise the model is scored on data with a different distribution than it was trained on, which silently degrades live performance. Until now this was only tracked implicitly (the handler config was expected to be re-instantiated exactly). `PreprocessingSpec` makes it explicit: the spec is serialised into the model pickle and reloaded when the model is served, backtested, or paper-traded. ## Shape ```python @dataclass class PreprocessingSpec: processors_pickle: bytes # fit state for exact replay processor_specs: list[dict] # {class, module_path, kwargs} feature_columns: list[str] label_column: str | None handler_cfg: dict | None metadata: dict[str, Any] ``` ## Training-side usage ```python from alphaswarm.ml.processors import PreprocessingSpec from alphaswarm.ml.handler import DataHandlerLP handler = DataHandlerLP( instruments=[...], learn_processors=[...], infer_processors=[...], ) handler.setup_data() spec = PreprocessingSpec.from_processors( processors=handler.infer_processors, feature_columns=[...], label_column="label_5d", handler_cfg={"class": "DataHandlerLP", "module_path": "alphaswarm.ml.handler", "kwargs": {...}}, metadata={"dataset_hash": "abc123", "fit_window": "2020-01-01..2023-12-31"}, ) model.fit(dataset).with_preprocessing(spec) model.to_pickle("models/alpha_v1.pkl") ``` ## Inference-side usage ```python from alphaswarm.ml.base import Model model = Model.from_pickle("models/alpha_v1.pkl") spec = model.preprocessing_spec if spec is not None: df = spec.apply(new_bars) # replay the chain, no re-fit preds = model.predict(df) ``` ## Serving-side usage All three serving backends (`MLflowServe`, `RayServe`, `TorchServe`) know about `preprocessing_spec`. The TorchServe handler in [`alphaswarm/mlops/serving/torchserve.py`](../../alphaswarm/mlops/serving/torchserve.py) calls `spec.apply(df)` before `model.predict(df)` when the attribute is present. # compliance/soc2-evidence-map # SOC 2 Type II evidence map Mapping from the SOC 2 Trust Services Criteria to the machine-readable evidence shipped by the AlphaSwarm overhaul. This map is the artifact the compliance team hands to the auditor. It points at the source of truth for every control; the auditor can pull the evidence directly from S3 / CloudTrail / Postgres without manual collation. | Criterion | Control | Evidence source | Where in the repo | | --- | --- | --- | --- | | CC6.1 Logical access | All API auth via `IdentityProvider` | `security_audit_events` Postgres + S3 WORM mirror | `alphaswarm/tasks/audit_log_export_tasks.py` | | CC6.1 (cont.) | RBAC via `Membership` lattice | `Membership` rows + `expand_role` lattice | `alphaswarm_core/src/alphaswarm_core/auth/rbac.py` | | CC6.6 Step-up MFA | RFC 9470 step-up on every destructive admin route | `step_up_denied` rows in `security_audit_events` | `alphaswarm_admin/src/alphaswarm_admin/deps/stepup.py` | | CC6.7 Privileged access | Break-glass 4-eyes + 60min auto-expiry | `admin.break_glass.*` audit rows + Security Hub findings | `alphaswarm_admin/src/alphaswarm_admin/services/break_glass.py` | | CC6.8 Cryptography | TLS 1.3 ingress + Linkerd mTLS internal | ALB security policy `ELBSecurityPolicy-TLS13-1-2-2021-06`; Linkerd identity certs from ACM PCA | `infrastructure/modules/acm-certificates`, `infrastructure/modules/acm-pca` | | CC7.1 Detection | Falco DaemonSet + custom rules | Falco events shipped to Loki | `alphaswarm_platform/deployments/kubernetes/helm/falco/values.yaml` | | CC7.2 Monitoring | OpenTelemetry + Prometheus + Loki + Tempo | Per-env Grafana dashboards | `infrastructure/modules/observability-stack` | | CC7.3 Incident response | KillSwitch fan-out + halt audit rows | `admin.halt.all` rows | `alphaswarm_admin/src/alphaswarm_admin/api/routers/halt.py` | | CC7.5 Threat intel | Trivy + Grype on every image | Build-time SBOM + provenance | `.github/workflows/build-publish.yml`, `.github/actions/build-sign-push/` | | CC8.1 Change management | Hash-locked spec versions | `terraform_stack_spec_versions`, `agent_spec_versions`, `bot_versions`, `rl_experiment_versions`, `analysis_spec_versions`, `workflow_spec_versions` | per AGENTS rules 13/15/17/24/41/43 | | CC8.1 (cont.) | Immutable Alembic migrations | `.hashes.lock` + `check_migration_immutability.py` | `scripts/ci/check_migration_immutability.py` | | CC9.1 Risk mitigation | SLSA L3 provenance + Cosign keyless | OCI attestations on every image | `.github/workflows/build-publish.yml` | | A1.2 Recovery procedures | DR replay runbook + Velero schedules | quarterly rehearsal log | `alphaswarm_docs/docs/operations/dr-replay.md` | | A1.3 Recovery validation | Cross-region S3 CRR + RDS read replica | Lifecycle policies + replication metrics | `infrastructure/envs/prod/main.tf` | | C1.2 Confidential information | S3 Object Lock + KMS CMK | `alphaswarm-audit-archive-*` bucket policies | `infrastructure/envs/shared-services/main.tf` | | C1.2 (cont.) | Step-up + RBAC on broker creds | `BrokerCredentialStore` priority 4 | `alphaswarm/credentials/stores/broker_credential_store.py` | | PI1.1 Processing integrity | Hash-chained `audit_log` table + Postgres trigger | trigger `enforce_audit_log_hash_chain` | `alembic/versions/0079_audit_log_hash_chain.py` | | P1.1 Privacy notice | n/a (B2B platform; no PII) | n/a | n/a | | P3.1 Information collection | OIDC scopes + `https://alphaswarm.internal/resources` claim | Auth0 + Entra Action sources | `alphaswarm/auth/providers/` | ## Type II evidence collection cadence | Cadence | Activity | | --- | --- | | Continuous | CloudTrail Org Trail, Config aggregator, GuardDuty, Security Hub findings; all S3 WORM-mirrored with 7-year retention | | Daily | Audit log export to WORM bucket (Celery beat 02:00 UTC) | | Weekly | Renovate dependency updates merged to dev; SBOM diff review | | Monthly | Access review (operator-driven via `/admin/rbac` UI) | | Quarterly | DR rehearsal per `dr-replay.md`; tabletop incident exercise | | Annual | SOC 2 Type II audit window (12-month observation) | ## Operator hand-offs The platform team owns the controls; compliance owns the evidence collation + auditor liaison. The handoff is via the `#alphaswarm-compliance` Slack channel + the SOC 2 dashboard in Grafana (panels driven by Prometheus queries against `security_audit_events`). # Agentic development for AlphaSwarm > The agentic-coder research literature talks about "skill artifacts", "skill graphs", "Memento-skills", "auditable execution trails", and "MCP control planes" as if they were novel patterns to invent. ... # Agentic development for AlphaSwarm > The single doc that connects AlphaSwarm's existing primitives to the > broader "agentic-coder" vocabulary, plus the consolidated security > manifesto. Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · > Workflow: [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) · > Hard rules: [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). ## What this doc is for The agentic-coder research literature talks about "skill artifacts", "skill graphs", "Memento-skills", "auditable execution trails", and "MCP control planes" as if they were novel patterns to invent. AlphaSwarm already implements every one of them — under different names, with stronger invariants, and with ledger-backed audit chains. This doc makes that mapping explicit so you don't waste time inventing a parallel "skill" surface alongside the current spec runtimes that already exist. The doc has three sections: 1. **AlphaSwarm's spec-pattern is the skill-artifact pattern.** The spec-runtime architecture (Agent / Bot / RL / Analysis / Workflow / Terraform) is the skill-graph + Memento-skill equivalent. Including where AlphaSwarm deliberately diverges from research recommendations. 2. **Working with Cursor agents in AlphaSwarm.** Static channel + dynamic channel + plan-mode vs agent-mode usage. 3. **The ADLC security manifesto.** Consolidated for the first time. ## 1. The spec-pattern is the skill-artifact pattern ### The five spec runtimes | Spec | Runtime | Versions table | Canonical doc | | --- | --- | --- | --- | | [`AgentSpec`](../alphaswarm/agents/spec.py) | [`AgentRuntime`](../alphaswarm/agents/runtime.py) | `agent_spec_versions` | [agents.md](../../concepts/agentic/agents.md) | | [`BotSpec`](../alphaswarm_bots/spec.py) | [`BotRuntime`](../alphaswarm_bots/runtime.py) | `bot_versions` | [bots.md](../../concepts/agentic/bots.md) | | [`RLExperimentSpec`](../alphaswarm/rl/spec.py) | [`RLRuntime`](../alphaswarm/rl/runtime.py) | `rl_experiment_versions` | [rl-framework.md](../../concepts/rl/rl-framework.md) | | [`AnalysisSpec`](../alphaswarm/analysis/spec.py) | [`AnalysisRuntime`](../alphaswarm/analysis/runtime.py) | `analysis_spec_versions` | [analysis-framework.md](../../concepts/strategy/analysis-framework.md) | | [`WorkflowSpec`](../alphaswarm/agents/orchestration/spec.py) | [`WorkflowRuntime`](../alphaswarm/agents/orchestration/runtime.py) | `workflow_spec_versions` | [workflow-studio.md](../../concepts/agentic/workflow-studio.md) | `WorkflowSpec` (Phase 5 of the additive orchestration refactor) sits **above** the four classic runtimes: it composes them through the [`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) registry. A workflow can wrap an existing `AgentRuntime` invocation (via the `LangGraphAdapter` / `CrewProcessAdapter` / `DialecticalDebateAdapter`) or chain deterministic fusion + risk-overlay execution (via `SignalFusionAdapter` + `WeightCentricExecutionAdapter`). All five runtimes share the same hash-locked + immutable + ledger-backed semantics described below. Each is: - **Declarative** — a Pydantic model with strict types. - **Hash-locked** — the SHA-256 of the canonical-JSON-serialized spec is the version key. - **Auto-versioned** — first run snapshots a row in the `*_versions` table; behaviour changes produce new rows; old rows are immutable. - **Ledger-backed** — every run records `spec_version_id` so the exact run can be deterministically replayed against historical data. - **Discoverable** — the registry pattern (built-ins + YAML auto-loading) means new specs come online without touching the runtime. ### Mapping to research vocabulary The agentic-coder literature 2024–2026 used several overlapping terms. Here's how each lands on AlphaSwarm's primitives: | Research term | AlphaSwarm equivalent | Notes | | --- | --- | --- | | "Skill artifact" | One row in a `*_versions` table | The artifact has semantic interface (the Pydantic spec), preconditions (the spec's input schema), executable payload (the runtime invocation), and deterministic postconditions (the run row + Iceberg outputs). | | "Skill graph" | The full registry across the active spec runtimes | Each runtime hosts one graph; `BotSpec` references `AgentSpec`s, `RLExperimentSpec` references data pipelines, `AnalysisSpec` references flows, and orchestration/deployment specs compose the runtime graph at higher levels. | | "Auditable execution trail" | `*_runs` ledger rows + Iceberg outputs + per-step result tables | E.g. `analysis_runs` + `analysis_step_results` + `alphaswarm_gold_analysis_` | | "MCP control plane" | The DataMCPTool catalog | One catalog, two transports (in-process bridge + FastAPI router + stdio binary). See [data-mcp.md](../../concepts/data/data-mcp.md). | | "Memento-skill / continual learning" | Re-snapshot on change | When a spec changes, `persist_spec` inserts a **new** version row — old versions stay for replay. The "memory" is the immutable history. | | "Verifiable rewards" | The `*_runs` ledger + cost caps + guardrails on the runtime | Telemetry covers cost, latency, and outcome metrics. | ### Where AlphaSwarm **deliberately diverges** The research recommends some patterns that AlphaSwarm rejects on purpose: 1. **"Rewrite the skill on failure" / self-modifying skills.** The research literature (e.g. *new framework lets AI agents rewrite their own skills without retraining*) advocates patching a failing skill in-place. **AlphaSwarm forbids this.** Reasons: - Auditability — every behaviour change must be a new hash-locked version row, not an in-place mutation. - Replay — runs reference `spec_version_id` for replay; mutating the spec breaks the replay invariant. - Compliance — financial systems need an append-only audit trail. - Risk — a self-mutating spec next to live capital is a non-starter. The right pattern in AlphaSwarm: when a spec fails, author a new spec version (manually or via tooling), snapshot it, switch traffic. The previous version remains for forensics. 2. **"Skill graph self-improvement loops"** that mutate skill metadata across runs. AlphaSwarm's metadata is owned by the active metadata layer ([`alphaswarm.data.catalog.register_dataset`](../alphaswarm/data/catalog/active_metadata.py)) and updated through explicit upserts — never as a side effect of a run. 3. **"Free-form SQL tools for agents"** to "let the model figure it out". AlphaSwarm requires every read to go through a registered `DataMCPTool` with a strict args schema and policy check. See [data-mcp.md](../../concepts/data/data-mcp.md) and the [data-mcp.mdc](../.cursor/rules/data-mcp.mdc) Cursor rule. 4. **"Auto-update implementation when intent changes"** (intent-driven development with bidirectional updates). AlphaSwarm's docs are updated in the same PR that touches the code, by humans or under explicit human review. Drift detection is welcome; automatic mutation is not. ### Adding a new spec — the canonical flow 1. Pick the right runtime by the question being answered: - "What should an LLM-driven agent do?" → `AgentSpec` - "What should a deployable bot (universe + strategy + risk + ML + agents + RAG) do?" → `BotSpec` - "What should an RL experiment train / evaluate?" → `RLExperimentSpec` - "What statistical / numerical analysis flow should run on a dataset?" → `AnalysisSpec` 2. Author the YAML or programmatic Pydantic instance. 3. Call the right `persist_spec(...)` (or let the registry do it on first lookup). 4. Run via the runtime — the first run snapshots a `*_versions` row. 5. The run row records `spec_version_id` and emits progress through [`alphaswarm/tasks/_progress.py`](../alphaswarm/tasks/_progress.py). If you find yourself wanting to "add a new skill artifact" outside this pattern — stop, read this section again, pick the right spec runtime. ## 2. Working with Cursor agents in AlphaSwarm ### The two-channel context strategy AlphaSwarm follows the static / dynamic context bifurcation pattern that Anthropic's Cursor integration recommends: - **Static channel** — what doesn't change between sessions: - [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — 45 hard rules - [.cursor/rules/](../.cursor/rules) — glob-scoped rule files - [alphaswarm_docs/](../docs) — narrative architecture - **Dynamic channel** — what changes session-to-session: - DataMCPTool catalog (live database schemas, dataset lineage, entity catalog) - The `agent_runs_v2` / `bot_deployments` / `rl_runs` / `analysis_runs` ledger rows - The Cursor environment's recently-edited / open files / terminal state The Cursor agent should treat the static channel as authoritative for **rules and architecture**, and the dynamic channel as authoritative for **live state** (don't guess a table schema — query the MCP catalog). ### Plan mode vs agent mode | Mode | When | Restrictions | | --- | --- | --- | | **Plan mode** | Complex / ambiguous tasks, architectural decisions, large refactors, anything with > 1 valid implementation | Read-only — cannot edit files | | **Agent mode** | Single clear task, post-plan implementation, debugging once root cause is known | Full tool access | | **Background mode** | Long-running tasks (Docker stack rebuild, full test suite, training runs) | Runs in parallel; non-blocking | | **Ask mode** | "How does X work?" / read-only exploration | Cannot edit; can search | The [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) document has the full plan→act→reflect cadence including FAST vs SLOW velocity calibration and intervention nodes. ### Reading the agent's plan output as a structured spec When Cursor's plan mode produces a `.cursor/plans/*.plan.md` file, treat it like a `*Spec` artifact: the human reviews, approves, and the agent then executes the plan one task at a time, updating todos as it goes. The plan file is the contract. ## 3. ADLC security manifesto The Agentic Development Life Cycle (ADLC) framing says: as agentic autonomy expands, the security posture must scale with it. AlphaSwarm already enforces several layers; this section consolidates them in one place so you can audit the surface in one read. ### Layer 1 — Kill-switch (ultimate human override) - Code: [alphaswarm/risk/kill_switch.py](../alphaswarm/risk/kill_switch.py), [alphaswarm/risk/manager.py](../alphaswarm/risk/manager.py) - Wired endpoint today: `POST /portfolio/kill_switch` in [alphaswarm/api/routes/portfolio.py](../alphaswarm/api/routes/portfolio.py) - Frontend topbar component: [alphaswarm_client/src/components/common/KillSwitch.tsx](../alphaswarm_client/src/components/common/KillSwitch.tsx) - Design contract for per-runtime fan-out — `/agents/halt`, `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all` — see [frontend.mdc](../.cursor/rules/frontend.mdc) (wire as the endpoints come online; add them to `KillSwitch` in the same PR). - All paper sessions halt within one heartbeat and cancel open orders. The Meta-Agent can flip the switch; an operator can flip it; the agent is never allowed to flip it without explicit human acknowledgement (per [WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md)#intervention-nodes). ### Layer 2 — Immutable spec versions (audit trail) - `agent_spec_versions`, `bot_versions`, `rl_experiment_versions`, `analysis_spec_versions` are append-only. - Each spec is hash-locked (SHA-256 of canonical JSON). - Every run records `spec_version_id` for replay. - This guarantees: **every behaviour change has a permanent record** identifying who introduced it (via the commit) and what the spec looked like at that moment. ### Layer 3 — DataMCPTool boundary (no direct catalog reads) - Agents MUST NOT `import alphaswarm.persistence.models...` or call `iceberg_catalog` / `duckdb_provider` directly inside their body. - All reads go through registered `DataMCPTool`s, exposed via in-process bridge + FastAPI `/mcp/data` router + `alphaswarm-data-mcp` stdio binary. - See [data-mcp.md](../../concepts/data/data-mcp.md) and [data-mcp.mdc](../.cursor/rules/data-mcp.mdc). ### Layer 4 — Single LLM entry-point (router_complete) - All LLM calls go through [`router_complete`](../alphaswarm/llm/providers/router.py). - No direct `litellm.completion` / `OllamaClient` / vendor SDKs. - The router enforces tier policies, cost caps, and provider fallback. Bypassing it strips those guardrails. ### Layer 5 — Single Iceberg entry-point + medallion enforcement - All writes go through [`iceberg_catalog.append_arrow`](../alphaswarm/data/iceberg_catalog.py) / `create_or_replace_table`. - The wrapper validates that the namespace prefix matches the declared `medallion_layer` (`bronze` / `silver` / `gold`). - `BusinessMetadata` is mandatory on first write — agents query this surface to know what a dataset is for. - See [data-layer-unification.md](../../concepts/data/data-layer-unification.md) and [iceberg.mdc](../.cursor/rules/iceberg.mdc). ### Layer 6 — Secrets and configuration - Configuration through [`alphaswarm.config.settings`](../alphaswarm/config/__init__.py) only — never construct a fresh `Settings()`, never read `os.environ` directly. - New env vars are `ALPHASWARM_*`-prefixed fields on the `Settings` class in [alphaswarm/config/settings.py](../alphaswarm/config/settings.py) and added to [.env.example](../.env.example). - Credentials use the helpers in [alphaswarm/utils/keys.py](../alphaswarm/utils/keys.py); never paste them into `.env` outside what's already in `.env.example`. ### Layer 7 — Migration immutability - See [migrations-persistence.mdc](../.cursor/rules/migrations-persistence.mdc). - Shipped migrations are never edited. Schema bugs are fixed forward, never backward. ### Layer 8 — Pre-merge checklist (human-driven) The checklist in [CONTRIBUTING.md](https://github.com/julianwileymac/alphaswarm/blob/main/CONTRIBUTING.md) is the last line of defence: - Tests pass locally - Docs updated (data-dictionary, ERD, glossary) - New env vars in `.env.example` - New deps in `pyproject.toml` - Migration applied + reviewed (autogenerate footguns checked) - For SLOW-mode work: TDD-loop followed (see [WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md)) ### Recommended (not yet enforced) — red-team review For any new `AgentSpec` that gains broker-API or live-trading tools, run a red-team review before promoting from paper to live: - Adversarial prompt simulation - Boundary-violation tests (does the agent try to escape its tool catalog?) - Cost-cap stress (does it loop?) - Margin / risk-limit interaction (does the spec respect [alphaswarm/risk/](../alphaswarm/risk/) constraints?) Today this is documentation, not automation. Future work: a `POST /agents/red-team-review` task that takes an `AgentSpec` and runs a fixed adversarial battery against it before promotion. ## When in doubt 1. Read [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — the canonical 45 rules. 2. Read [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) — the cadence. 3. Read [multi-agent-patterns.md](../../concepts/agentic/multi-agent-patterns.md) — when you're scaling the agent topology. 4. Read [glossary.md](../../intro/glossary.md) — for terminology. 5. Search the code: `rg "" alphaswarm/`. # Agentic pipeline > End-to-end walkthrough of the AlphaSwarm agentic-trading lifecycle: pick models, register data, snapshot specs, dispatch via WorkflowRuntime, review through MCP-bridged agent surfaces. # Agentic pipeline > Doc map: [intro](../../intro/index.md) · > Sequence diagrams: [flows](../platform/flows.md#3-agentic-crew-run) · > Spec-pattern primer: [agentic-development](./agentic-development.md) · > Multi-agent topologies: [multi-agent-patterns](./multi-agent-patterns.md) · > Orchestration adapters: [workflow-studio](./workflow-studio.md) · > Worked tutorial: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md). This page walks through the AlphaSwarm agentic-trading lifecycle: pick a model, register a data source, snapshot the spec, dispatch through the workflow runtime, and review the run. Every action has a REST + CLI surface so you can script the same flow; every action also has an `alphaswarm_client` (Vite UI) route at `alpha-swarm.ai` so a human can drive it. The pipeline is **five stages**. The new stage since the prior version of this doc is **Spec snapshot** — every spec-driven run now hash-locks into an immutable `*_spec_versions` row before any work happens. ```mermaid flowchart LR subgraph llmStage [1. Models and providers] Pull["Ollama pull"] Vllm["vLLM profile up"] Sera["SERA-32B opt-in"] Defaults["router_complete defaults"] end subgraph dataStage [2. Data sources] Discovery["DiscoveryService"] Inspector["Parquet / Iceberg inspector"] AirbyteBuilder["Airbyte builder + userland Fetcher"] Sandbox["Dagster sandbox (ephemeral)"] end subgraph snapshotStage [3. Spec snapshot] AgentSpec["AgentSpec / BotSpec"] WfSpec["WorkflowSpec"] Hash["SHA-256 hash"] Versions["*_spec_versions row"] end subgraph dispatchStage [4. Workflow dispatch] WfRuntime["WorkflowRuntime"] AgentRt["AgentRuntime"] BotRt["BotRuntime"] RlRt["RLRuntime"] Adapters["7 orchestration adapters"] end subgraph reviewStage [5. Review] WS["WebSocket /chat/stream"] Ledger["agent_runs_v2 + workflow_runs"] Inkeep["Inkeep AI assistant (in-product)"] Mcp["docs MCP server"] end llmStage --> snapshotStage dataStage --> snapshotStage snapshotStage --> Hash --> Versions Versions --> dispatchStage WfRuntime --> AgentRt WfRuntime --> BotRt WfRuntime --> RlRt WfRuntime --> Adapters dispatchStage --> reviewStage ``` ## 1 — Models and providers Open [`/models`](https://alpha-swarm.ai/models) in the operator UI (`alphaswarm_client`). The page lives at [alphaswarm_client/src/routes/models/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/routes/models) and exposes three tabs: - **Ollama (host)** — type a model tag in *Pull a model* (e.g. `nemotron`, `llama3.2`, `qwen2:7b`) and click **Pull**. A Celery task streams progress over the canonical `/chat/stream/{task_id}` envelope so the page shows a real-time download bar. - **vLLM** — every YAML under [`configs/llm/`](https://github.com/julianwileymac/alphaswarm/tree/main/configs/llm) becomes a profile card showing compose status, served models, and `Start` / `Stop` buttons. Starting a profile auto-saves its `base_url` as the active vLLM endpoint. - **SERA-32B** — opt-in Ai2 Open Coding model for the codebase MCP elaborator (see [sera](../data/sera.md)). Configure `ALPHASWARM_SERA_ENABLED=true` + `ALPHASWARM_SERA_ENDPOINT` in your env. Every model call routes through [`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py) (AGENTS rule 2). Provider selection is declared in `AgentSpec.model`; the runtime drives the call — never call `router_complete` directly from inside an agent body (AGENTS rule 12). REST equivalents (each returns `TaskAccepted` for streaming endpoints): ```bash curl -X POST localhost:8000/agentic/models/pull \ -H 'content-type: application/json' \ -d '{"name":"llama3.2"}' curl -X DELETE localhost:8000/agentic/models/llama3.2 curl -X GET localhost:8000/agentic/models/running curl -X GET localhost:8000/agentic/vllm/profiles curl -X POST localhost:8000/agentic/vllm/start \ -H 'content-type: application/json' \ -d '{"profile":"vllm_nemotron"}' ``` ## 2 — Data sources Open [`/data/hub`](https://alpha-swarm.ai/data/hub) in the operator UI. This is the active replacement for the legacy Solara explorer pages. The Hub exposes the four data-plane tiers (see [data-plane](../data/data-plane.md)): - **Discovery browser** — unified ingested / pending / orphan / external_only entries; filter chips drive the [`DiscoveryService`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/discovery/service.py). - **Iceberg Editor** — namespace browser + parquet preview + column profiling. - **Airbyte builder** — schema-driven connector editor at [alphaswarm_client/src/components/airbyte/builder/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/components/airbyte/builder). Emits either Airbyte YAML or an AlphaSwarm-native `Fetcher` stub. No free-text credential fields — every secret resolves through `` (AGENTS rule 31). - **Dagster sandbox** — ephemeral per-session Dagster + Airbyte environment (AGENTS rule 32). REST surface: ```bash curl -X GET http://localhost:8000/discovery/entries curl -X POST http://localhost:8000/sources/alpha_vantage/probe curl -X POST http://localhost:8000/discovery/entries//promote curl -X POST http://localhost:8000/dagster/sandbox/sessions ``` Or invoke the data MCP tools directly: ```bash curl -X POST http://localhost:8000/mcp/data/tools/data.discovery.browse/invoke \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $(alphaswarm-cli auth token)" \ -d '{"namespace_prefix":"alphaswarm_silver_yfinance"}' ``` ## 3 — Spec snapshot Every spec-driven run hash-locks the spec into a `*_spec_versions` row before any work happens. The same content always returns the same `version_id`; any field change creates a new row; old rows stay forever for replay. This is the invariant that makes the entire agentic pipeline auditable. ```mermaid sequenceDiagram actor Author participant API as FastAPI participant Runtime as AgentRuntime / BotRuntime / RLRuntime / WorkflowRuntime participant Versions as *_spec_versions participant Hash as SHA-256 Author->>API: POST /agents/specs (YAML body) API->>Hash: compute SHA-256 of canonical JSON Hash-->>API: spec_hash API->>Versions: SELECT id WHERE spec_hash = ? alt existing row Versions-->>API: existing version_id else new row API->>Versions: INSERT (spec_hash, spec_json, ...) Versions-->>API: new version_id end API-->>Author: { spec_id, version_id, spec_hash } Note over Versions: Row is immutable. Re-postingidentical content returns the same id. ``` Five hash-locked spec types ship today: | Spec | Runtime | Versions table | AGENTS rule | | --- | --- | --- | --- | | `AgentSpec` | `AgentRuntime` | `agent_spec_versions` | 12-13 | | `BotSpec` | `BotRuntime` | `bot_versions` | 14-15 | | `RLExperimentSpec` | `RLRuntime` | `rl_experiment_versions` | 16-17 | | `AnalysisSpec` | `AnalysisRuntime` | `analysis_spec_versions` | 23-24 | | `WorkflowSpec` | `WorkflowRuntime` | `workflow_spec_versions` | 40-41 | Plus two additive ones from the management engine: | Spec | Runtime | Versions table | AGENTS rule | | --- | --- | --- | --- | | `TerraformStackSpec` | `TerraformRuntime` | `terraform_stack_spec_versions` | 42-43 | | (workload ops) | `WorkloadRuntime` | `workload_runs` (write-only ledger) | 45 | REST: ```bash # AgentSpec curl -X POST http://localhost:8000/agents/specs \ -H "Content-Type: application/json" \ -d @configs/agents/research_lite.yaml # WorkflowSpec curl -X POST http://localhost:8000/workflows/specs \ -H "Content-Type: application/json" \ -d @configs/workflows/my-research-loop.yaml ``` ## 4 — Workflow dispatch `WorkflowRuntime` is the additive control plane that composes every spec runtime into multi-node DAGs. It ships with seven `OrchestrationAdapter` kinds (AGENTS rule 40): - **graph** — LangGraph state machine - **crew** — CrewAI manager-pattern crew - **debate** — bounded debate with N participants - **fusion** — fan-out / fan-in - **execution** — wraps an `RLRuntime` / `BotRuntime` / `AnalysisRuntime` as a single node - **schedule** — Cron-triggered, idempotent - **studio** — Operator-driven UI wiring at [`/workflows`](https://alpha-swarm.ai/workflows) ```mermaid flowchart TB Spec[WorkflowSpec] --> Runtime[WorkflowRuntime] Runtime --> AdapterRegistry["OrchestrationAdapterMeta registry"] AdapterRegistry --> A1[graph] AdapterRegistry --> A2[crew] AdapterRegistry --> A3[debate] AdapterRegistry --> A4[fusion] AdapterRegistry --> A5[execution] AdapterRegistry --> A6[schedule] AdapterRegistry --> A7[studio] A1 --> AgentRt[AgentRuntime] A2 --> AgentRt A3 --> AgentRt A4 --> AgentRt A5 --> RlRt[RLRuntime] A5 --> BotRt[BotRuntime] A5 --> AnaRt[AnalysisRuntime] Runtime --> Halt[should_halt check] Runtime --> Cost[cost cap check] Runtime --> Ledger[workflow_runs + agent_runs_v2] ``` Dispatch: ```bash curl -X POST http://localhost:8000/workflows//run \ -H "Content-Type: application/json" \ -d '{"inputs": {...}}' ``` The runtime: 1. Re-hash-locks every referenced spec (idempotent). 2. Opens a `workflow_runs` row with `status=pending`. 3. Builds the adapter DAG. 4. Walks nodes; for each, opens an `agent_runs_v2` row and delegates to the relevant runtime. 5. Emits canonical progress frames at every transition. 6. Calls `should_halt()` before every step — the topbar [KillSwitch](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/src/components/common/KillSwitch.tsx) reaches every node within ~250ms. 7. Enforces `cost_caps` (`per_node_max_tokens`, `per_run_max_usd`) per AGENTS rule 12. Replay: ```bash curl -X POST http://localhost:8000/workflows/runs//replay ``` Replay reuses the same `workflow_spec_versions` row + every referenced `*_spec_versions` row; a new `workflow_runs` row lands with a `parent_run_id` pointer. ## 5 — Review Three review surfaces, each consuming the same canonical ledger: ### WebSocket stream The frame envelope is `{task_id, stage, message, timestamp, **extras}` per AGENTS rule 4. Subscribe from any client: ```javascript const ws = new WebSocket(`ws://localhost:8000/chat/stream/${task_id}`); ws.onmessage = (e) => { const f = JSON.parse(e.data); console.log(f.stage, f.message, f.extras); }; ``` ### `agent_runs_v2` + `workflow_runs` ledger Agent-safe reads via DataMCP: ```bash curl -X POST http://localhost:8000/mcp/data/tools/data.workflows.describe/invoke \ -H "Content-Type: application/json" \ -d '{"workflow_run_id": ""}' curl -X POST http://localhost:8000/mcp/data/tools/data.agents.list_runs/invoke \ -H "Content-Type: application/json" \ -d '{"workflow_run_id": "", "limit": 20}' ``` Each row carries `experiment_id` + `test_id` (AGENTS rule 34), `total_tokens`, `total_cost_usd`, and a full per-step breakdown under `agent_run_steps`. ### Inkeep AI assistant + docs MCP server Two new surfaces in 2026-05: - **Inkeep widget in-product.** The "Ask AI" button in [alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client) routes to an Inkeep agent that has the entire docs corpus + every public AlphaSwarm API spec ingested. It cites by URL and never invents references. - **Docs MCP server at `docs.alpha-swarm.ai/mcp`.** An RFC 9728 + 8707 compliant Cloudflare Worker (AGENTS rule 49). Cursor / Claude / Continue / custom scripts connect to it for `search`, `fetch_page`, and `list_pages` over the same corpus. In-platform agents reach it through the bridged `data.docs.*` MCP tools. Both surfaces compose with the workflow runtime: a workflow node can call Inkeep / the docs MCP server as an external tool, and the `agent_runs_v2` row records the call. ## Worked example: build a research workflow Goal: snapshot an `AgentSpec` + `WorkflowSpec`, dispatch the workflow, tail progress, inspect the ledger — all from this page. ### Step 1 — snapshot an `AgentSpec` Re-running with identical content returns the same `(spec_id, version_id)` — the runtime treats it as a no-op. ### Step 2 — snapshot a `WorkflowSpec` that references it ### Step 3 — dispatch ### Step 4 — tail progress ```bash curl -N http://localhost:8000/chat/stream/ ``` You will see frames in the canonical envelope. Expected stages: `workflow.started` → `node.research.started` → `agent.token` (×N) → `node.research.completed` → `workflow.completed`. ### Step 5 — inspect the ledger Demonstrate the analysis pattern with a small inline sample of what the MCP describe call returns: ### Step 6 — verify - `agent_spec_versions` row exists with the recorded `spec_hash`. - `workflow_spec_versions` row exists; its content references the `agent_spec_versions` row from Step 1. - One `workflow_runs` row + one `agent_runs_v2` row (one node). - `total_cost_usd` is under the workflow's `per_run_max_usd` cap. - Re-dispatching by triggering Step 3 again creates a NEW `workflow_runs` row but reuses ALL the same `*_spec_versions` rows. ### What next - Walk the full tutorial: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md). - Add a second node: [concepts/agentic/workflow-studio](./workflow-studio.md) — the seven adapter kinds. - Read the topology catalogue: [concepts/agentic/multi-agent-patterns](./multi-agent-patterns.md). - Snapshot an agent spec from the CLI: [how-to/recipes/snapshot-an-agent-spec](../../how-to/recipes/snapshot-an-agent-spec.md). ## The four-runtime story This pipeline is one of four overlapping execution surfaces. Each has its own concept doc but they all share the same hash-lock invariant, the same canonical progress frame, the same kill-switch fan-out, and the same `experiment_id` audit chain. | Runtime | Lifecycle surface | Worked tutorial | Concept doc | | --- | --- | --- | --- | | `AgentRuntime` | Single agent, single spec | (covered here) | [agents](./agents.md) | | `BotRuntime` | Bot = universe + strategy + ML + agents + RAG + risk | [tutorials/first-bot](../../tutorials/first-bot.md) | [bots](./bots.md) | | `RLRuntime` | Train / evaluate / paper / replay / walk-forward | [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md) | [concepts/rl/rl-framework](../rl/rl-framework.md) | | `WorkflowRuntime` | Composition layer over the other three | [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) | [workflow-studio](./workflow-studio.md) | ## Hard rules (agentic-pipeline scope) The full set is in [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). The agentic-pipeline subset: - **Rules 12-13** — All spec-driven agent runs go through `AgentRuntime`; `agent_spec_versions` rows are immutable. - **Rule 22** — Agents never read Postgres / Iceberg directly; every read through a `DataMCPTool`. - **Rule 40** — All workflow lifecycle actions go through `WorkflowRuntime`. - **Rule 41** — `workflow_spec_versions` rows are immutable hash-locked snapshots. - **Rule 34** — Every run-producing flow populates `experiment_id`. - **Rule 49** — Every MCP server is RFC 9728 + 8707 conformant. - **Rule 54** — Delegated agent tokens for HTTP MCP calls go through `TokenExchangeBroker` (RFC 8693 + Auth0 Custom Token Exchange Profile `alphaswarm-agent-delegation`). ## Deeper reads - [agentic-development](./agentic-development.md) — AlphaSwarm's spec-pattern mapped to the broader agentic-coder vocabulary. - [agents](./agents.md) — `AgentSpec` schema + `AgentRuntime` lifecycle. - [multi-agent-patterns](./multi-agent-patterns.md) — sequential / parallel / debate / coordinator / ReAct topologies. - [workflow-studio](./workflow-studio.md) — the additive `WorkflowRuntime` + seven adapter kinds. - [orchestration-refactor-rollout](./orchestration-refactor-rollout.md) — operator rollout / rollback runbook. - [alpha-researcher-agent](./alpha-researcher-agent.md), [research-agents](./research-agents.md), [selection-agents](./selection-agents.md), [trader-agents](./trader-agents.md), [analysis-agents](../strategy/analysis-agents.md) — domain agent suites. - [bots](./bots.md) — bot entity (`TradingBot` / `ResearchBot`) and `BotRuntime`. - [agent-watchdog](../data/agent-watchdog.md) — Celery beat task that halts stalled agent_runs_v2 rows. - [reference/api](../../reference/api/index.mdx) — the `agents` + `workflows` tags (interactive playground). - [reference/python/alphaswarm/agents](../../reference/python/index.mdx) — auto-generated Python reference. # Agents > - **AgentSpec** — declarative blueprint (Pydantic). Holds role, system_prompt, tools, model, memory, RAG clauses, guardrails, output_schema, cost / call caps, and annotations. Defined in [alphaswarm/agents/s... # Agents This document covers the spec-driven agent surface added by the agentic-RAG expansion. The legacy CrewAI research crew (under [alphaswarm/agents/crew.py](../alphaswarm/agents/crew.py)) coexists with the new runtime; both register routes under `/agents/*` in the FastAPI gateway. ## Concepts - **AgentSpec** — declarative blueprint (Pydantic). Holds role, system_prompt, tools, model, memory, RAG clauses, guardrails, output_schema, cost / call caps, and annotations. Defined in [alphaswarm/agents/spec.py](../alphaswarm/agents/spec.py). - **AgentRuntime** — executor that turns a spec into a real run with full telemetry. Defined in [alphaswarm/agents/runtime.py](../alphaswarm/agents/runtime.py). - **Registry** — process-wide name → AgentSpec map. Discovered built-ins are registered at import time; YAML files under [configs/agents/](../configs/agents/) are auto-loaded on first lookup. Declared in [alphaswarm/agents/registry.py](../alphaswarm/agents/registry.py). - **Reproducibility** — every spec is hash-locked and snapshotted into `agent_spec_versions` on first use. Every run records a `spec_version_id` so it can be deterministically replayed. ## The four teams | Team | Specs | Page | | --- | --- | --- | | Research | `research.news_miner`, `research.equity`, `research.universe` | [alphaswarm_docs/research-agents.md](../../concepts/agentic/research-agents.md) | | Selection | `selection.stock_selector` | [alphaswarm_docs/selection-agents.md](../../concepts/agentic/selection-agents.md) | | Trader | `trader.signal_emitter` | [alphaswarm_docs/trader-agents.md](../../concepts/agentic/trader-agents.md) | | Analysis | `analysis.step`, `analysis.run`, `analysis.portfolio` (+ reflector) | [alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md) | ## Inspiration-rehydration personas (Phase 2026-04-29) Nine new spec-driven agents added by the rehydration. Each ships as a YAML in [configs/agents/](../configs/agents/) and uses one or more of the new analytics tools in [alphaswarm/agents/tools/analytics_tools.py](../alphaswarm/agents/tools/analytics_tools.py). | Spec name | Role | Tools | | --- | --- | --- | | `research.regime_analyst` | ADX trend/range gate | `regime_classifier_tool`, `historical_volatility` | | `research.composite_voter` | TradFi-style indicator consensus | `multi_indicator_vote_tool` | | `research.basis_momentum_analyst` | Commodity basis screening | `factor_screen_tool`, `realised_vol_tool` | | `research.cointegration_analyst` | Pair stat-arb | `cointegration_tool`, `historical_volatility` | | `research.intraday_momentum_analyst` | Gao 2018 intraday plays | `realised_vol_tool`, `regime_classifier_tool` | | `selection.cross_asset_skew_screener` | Cross-asset skew factor | `factor_screen_tool` | | `analysis.queue_position_analyst` | HFT metric explainer | `hft_metrics_tool` | | `analysis.cointegration_basket_finder` | Universe-wide pair search | `cointegration_tool` | | `research.options_greeks_explainer` | Bachelier + inverse Greeks | `option_greeks_tool`, `option_spread_tool` | Composite pipeline: see [alphaswarm/agents/graph/builder.py::build_quant_research_pipeline_graph](../alphaswarm/agents/graph/builder.py) which chains `composite_voter → regime_analyst → cointegration_analyst → risk_simulator → emit_signal_event/reject_decision_log` with the existing risk-simulator approval gate. ## Run lifecycle ```mermaid flowchart LR spec[AgentSpec YAML or code] reg[Registry] rt[AgentRuntime] rag[HierarchicalRAG] mem[RedisHybridMemory] llm[router_complete] db[(agent_runs_v2 + agent_run_steps)] spec --> reg --> rt rt --> rag rt --> mem rt --> llm rt --> db ``` ## Persistence | Table | Purpose | | --- | --- | | `agent_specs` | Logical agent (latest version pointer) | | `agent_spec_versions` | Immutable hash-locked spec snapshot | | `agent_runs_v2` | One row per run | | `agent_run_steps` | One row per step (LLM / tool / RAG / memory / guardrail) | | `agent_run_artifacts` | Sidecar artifacts referenced by a run | | `agent_evaluations` + `agent_eval_metrics` | Eval harness results | | `agent_annotations` | User/agent annotations for optimisation | ## REST surface ``` GET /agents/specs — list registered specs GET /agents/specs/{name} — spec detail (full payload) GET /agents/specs/{name}/versions — version history POST /agents/runs/v2/sync — synchronous run GET /agents/runs/v2 — list runs (filter by spec/status) GET /agents/runs/v2/{id} — full trace incl. steps POST /agents/runs/v2/{id}/replay — replay against snapshotted spec GET /agents/evaluations — list eval reports ``` ## Guardrails `AgentSpec.guardrails` (parsed by `AgentRuntime._guardrail_check`): - `cost_budget_usd` — hard ceiling per run (raises `GuardrailViolation`). - `rate_limit_per_minute` — TODO: enforced at the call site. - `max_calls` — caps the number of LLM round-trips per run. - `forbidden_terms` — strings that must not appear in the output. - `require_rationale` — output must include a rationale-style key. - `min_confidence` — output's `confidence` field must clear this floor. ## Don'ts - Don't bypass `AgentRuntime.run` for spec-driven agents — telemetry, guardrails, cost caps, and `agent_runs_v2` rely on it. - Don't mutate `agent_spec_versions` rows — they are immutable. - Don't write a new spec without registering it (decorator or YAML); the LangGraph builders look up by name and will skip unknown specs. # Alpha Researcher agent + symbolic alpha DSL > ```mermaid flowchart LR User[Researcher: intent] --> Agent[AlphaResearcher\\nconfigs/agents/alpha_researcher.yaml] Agent --> RAG[RAG: alpha_factors + backtest_summaries] Agent --> Output[JSON proposal\\... # Alpha Researcher agent + symbolic alpha DSL > Self-evolving LLM-driven factor mining wired into AlphaSwarm's > deployment-consistent execution loop. ## The loop ```mermaid flowchart LR User[Researcher: intent] --> Agent[AlphaResearcher\nconfigs/agents/alpha_researcher.yaml] Agent --> RAG[RAG: alpha_factors + backtest_summaries] Agent --> Output[JSON proposal\nname / formula / rationale] Output --> Sandbox[AST sandbox\naqp/data/expressions_dsl.py] Sandbox --> Factor[FactorNode] Factor --> Shim[FactorStrategyShim] Shim --> Engine[EventDrivenBacktester] Engine --> Metrics[Sharpe / IR / MDD / turnover] Metrics --> Reward[score_to_reward] Reward -.->|next iteration| Agent ``` ## Symbolic DSL vocabulary The full operator + field whitelist lives in [`alphaswarm/data/expressions_dsl.py`](../alphaswarm/data/expressions_dsl.py). **Fields:** `$open`, `$high`, `$low`, `$close`, `$volume`, `$vwap`, `$returns`. **Operators (curated):** `Ref`, `Delay`, `Mean`, `Std`, `Var`, `Skew`, `Kurt`, `Sum`, `Min`, `Max`, `Med`, `Mad`, `Quantile`, `Count`, `IdxMax`, `IdxMin`, `EMA`, `WMA`, `Slope`, `Rsquare`, `Resi`, `Corr`, `Cov`, `Greater`, `Less`, `Gt`, `Ge`, `Lt`, `Le`, `Eq`, `Ne`, `And`, `Or`, `Not`, `Mask`, `If`, `Add`, `Sub`, `Mul`, `Div`, `Abs`, `Sign`, `Log`, `Rank`, `Clip`. **Numeric literals:** integers + floats + bools + None + short strings. Anything else (imports, attribute access, subscripts, lambdas, comprehensions, walrus, await, yield) raises :class:`SymbolicAlphaError` at compile time. ## Example proposal ```json { "name": "ema_crossover_pct", "formula": "Sign(EMA($close, 12) - EMA($close, 26)) * Rank(Std($returns, 20))", "rationale": "Combines MACD-style cross with vol-rank to favour high-vol trends.", "expected_horizon_bars": 5, "expected_direction": "either" } ``` ## Compile + evaluate ```python from alphaswarm_agents.quant import AlphaResearcher researcher = AlphaResearcher(agent_spec_name="alpha_researcher") proposal = researcher.propose(inputs={"intent": "find a short-horizon mean-reversion factor"}) result = researcher.evaluate(proposal, bars=bars) print(result.metrics, result.reward) ``` ## Engine-agnostic `FactorNode` The compiled [`FactorNode`](../alphaswarm/data/expressions_dsl.py) feeds: - **Event-driven engine** via `.compute(bars)` returning a `pd.Series`. - **vbt-pro orders mode** via `.compute_panel(bars_panel)` returning a wide DataFrame. - **Backtrader (optional)** via `.as_backtrader_indicator()` returning a dynamic `bt.Indicator` subclass. ## Companion agent: StrategyExecutor The [`StrategyExecutor`](../alphaswarm/agents/quant/strategy_executor.py) agent decides WHICH RL experiment to train / paper-trade / promote based on the RAG `rl_trajectory_summaries` corpus and the live broker state. Routes lifecycle actions through [`RLRuntime`](../alphaswarm/rl/runtime.py) (rule 16). ## See also - [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md) - [Hard rule 39 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) - [.cursor/rules/symbolic-alphas.mdc](../.cursor/rules/symbolic-alphas.mdc) # Bots > A **Bot** aggregates everything required to research, evaluate, and deploy an algorithmic trading automation: # Bots > The smallest self-contained, deployable unit on AlphaSwarm. > > **QuantBot Platform v0.2.0** layered an enterprise-grade Kubernetes > control plane on top of the legacy `BotRuntime` path without breaking > any existing bots. See the new ADRs: > > - [ADR 006 — QuantBot Operator Pattern](../../architecture/decisions/006-quantbot-operator-pattern.md) > - [ADR 007 — QuantBot Latency Classes](../../architecture/decisions/007-quantbot-latency-classes.md) > - [ADR 008 — Bot Event Sourcing](../../architecture/decisions/008-quantbot-event-sourcing.md) > - [ADR 009 — RTS 6 / SEC 15c3-5 Conformance](../../architecture/decisions/009-quantbot-rts6-conformance.md) > - [ADR 010 — Canary PnL Gates](../../architecture/decisions/010-quantbot-canary-pnl-gates.md) > > Runbooks: > > - [HFT Node Onboarding](../../how-to/operations/hft-node-onboarding.md) > - [Bot Canary Rollout Playbook](../../how-to/operations/bot-canary-rollout-playbook.md) > - [RTS 6 Validation Report Generation](../../how-to/operations/rts6-validation-report-generation.md) > - [Kill Switch Incident Response](../../how-to/operations/kill-switch-incident-response.md) A **Bot** aggregates everything required to research, evaluate, and deploy an algorithmic trading automation: - a **trading universe** (symbol list or registry-driven model), - a **data ingestion pipeline** preset, - a **strategy graph** (alpha → portfolio → risk → execution, via `FrameworkAlgorithm`), - a **backtest engine** (vbt-pro / event-driven / vectorbt / fallback), - optional **ML model deployments** (`ModelDeployment` ids), - optional **spec-driven agents** for supervision / per-bar consult / research chat, - a **hierarchical RAG** access plan, - **evaluation metrics** with thresholds, - **risk caps**, and - a **deployment target** (paper session / Kubernetes / backtest-only). Bots live under a [`Project`](../../concepts/platform/erd.md) (`ProjectScopedMixin`). Within a project, bots are uniquely identified by their slug. ## Composition ```mermaid flowchart LR Project --> Bot Bot --> BotSpec[BotSpec] BotSpec --> Universe[universe + DataPipelineRef] BotSpec --> StrategyCfg["strategy: build_from_config"] BotSpec --> EngineCfg["backtest.engine"] BotSpec --> MLDeployments["ml_models[]"] BotSpec --> AgentSpecs["agents[] (AgentSpec names)"] BotSpec --> RAGPlan["rag[] (RAGRef)"] BotSpec --> Metrics["metrics[] + risk"] BotSpec --> DeployTarget["deployment"] BotRuntime --> Backtest["run_backtest_from_config"] BotRuntime --> Paper["build_session_from_config"] BotRuntime --> AgentRuntime AgentRuntime --> RAG["HierarchicalRAG"] BotRuntime --> Deploy["DeploymentDispatcher"] Deploy --> Paper Deploy --> K8s["KubernetesTarget"] ``` `Bot` does **not** re-implement strategy / engine / agent / RAG logic. It composes references and dispatches to existing primitives so all hard rules from [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) (`router_complete`, `iceberg_catalog`, `AgentRuntime`, `HierarchicalRAG`, `emit/emit_done`) remain the only paths into those subsystems. ## Subclasses | Subclass | Required spec slots | Methods | Use case | | --- | --- | --- | --- | | `TradingBot` | `strategy`, `backtest` | `backtest()`, `paper()`, `deploy()`, `consult_agents()` | Live / paper / backtest trading | | `ResearchBot` | `agents` | `chat()`, optional `backtest()` (only if `strategy` set) | Research agent + chat surface | `TradingBot.chat()` raises `BotMethodNotSupported` — pair the bot with a companion `ResearchBot`. `ResearchBot.paper()` raises `BotMethodNotSupported` — clone the spec into a `TradingBot` first. ## Spec example ```yaml name: Dual MA AAPL slug: dual-ma-aapl kind: trading description: Dual MA crossover on AAPL/MSFT. universe: symbols: [AAPL.NASDAQ, MSFT.NASDAQ] data_pipeline: preset: ohlcv-daily source: alpaca strategy: class: FrameworkAlgorithm module_path: alphaswarm.strategies.framework kwargs: universe_model: class: StaticUniverse module_path: alphaswarm.strategies.universes kwargs: { symbols: [AAPL.NASDAQ, MSFT.NASDAQ] } alpha_model: class: DualMACrossoverAlpha module_path: alphaswarm.strategies.dual_ma kwargs: { fast: 10, slow: 50 } portfolio_model: { class: EqualWeightPortfolio } risk_model: { class: NoOpRiskModel } execution_model: { class: ImmediateExecutionModel } backtest: engine: vbt-pro:signals kwargs: { initial_cash: 100000.0 } agents: - spec_name: research.quant_vbtpro role: supervisor rag: - levels: [l1, l2] orders: [first, second] corpora: [bars_daily, performance] metrics: - { name: sharpe, threshold: 1.0, direction: max } - { name: max_drawdown, threshold: 0.25, direction: min } risk: max_position_pct: 0.25 max_daily_loss_pct: 0.02 deployment: target: paper_session brokerage: simulated feed: deterministic_replay initial_cash: 100000.0 dry_run: true ``` Drop the file under [alphaswarm_bots/templates/trading/](../alphaswarm_bots/templates/trading/) or [alphaswarm_bots/templates/research/](../alphaswarm_bots/templates/research/) — the registry lazy-scans both directories on first lookup. ## Persistence Three new tables, all `ProjectScopedMixin` (Alembic migration `0020_bots`): - **`bots`** — logical row with the latest active version of a named spec inside a project. Unique on `(project_id, slug)`. - **`bot_versions`** — immutable, hash-locked snapshot of every `BotSpec` change. Unique on `(bot_id, spec_hash)` and `(bot_id, version)`. - **`bot_deployments`** — one row per backtest / paper / chat / k8s invocation. References `version_id` so a run can be replayed against the exact spec that produced it. The runtime mirrors the proven `AgentSpec` / `AgentSpecVersion` / `AgentRunV2` triad from [alphaswarm/agents/runtime.py](../alphaswarm/agents/runtime.py). ## Lifecycle ### Backtest ```mermaid sequenceDiagram participant UI participant API as /bots/{id}/backtest participant Celery as run_bot_backtest participant Runtime as BotRuntime participant Engine as run_backtest_from_config UI->>API: POST /bots/{id}/backtest API->>Celery: run_bot_backtest.delay(bot_id) Celery->>Runtime: BotRuntime(bot, task_id).backtest() Runtime->>Runtime: persist_spec -> bot_versions Runtime->>Runtime: open bot_deployments row Runtime->>Engine: run_backtest_from_config(_derive_backtest_cfg()) Engine-->>Runtime: BacktestResult Runtime->>Runtime: finalise bot_deployments + emit_done Runtime-->>UI: stream result via /chat/stream/{task_id} ``` ### Paper `POST /bots/{id}/paper/start` dispatches `run_bot_paper`, which builds a `PaperTradingSession` via the existing [`build_session_from_config`](../alphaswarm/trading/runner.py) and awaits its async `run()`. Stop with `POST /bots/{id}/paper/stop/{task_id}` (reuses [`publish_stop_signal`](../alphaswarm/tasks/paper_tasks.py)). ### Chat (ResearchBot) `POST /bots/{id}/chat` dispatches `chat_research_bot`, which iterates the bot's `agents[]` and runs each through [`AgentRuntime`](../alphaswarm/agents/runtime.py). RAG retrieval, memory, and guardrails behave identically to direct `POST /agents/runs/v2/sync` calls — the bot is just a curator of agent specs. ### Deploy `POST /bots/{id}/deploy` dispatches `deploy_bot`, which delegates to the configured target via [`DeploymentDispatcher`](../alphaswarm_bots/deploy.py): | Target | Behaviour | | --- | --- | | `paper_session` | Launches a paper session in the Celery worker. | | `backtest_only` | Runs a single backtest + persists result on the deployment row. | | `kubernetes` | Renders `Deployment` + `ConfigMap` YAML to `alphaswarm_platform/deploy/k8s/bots/.yaml`. Optionally `kubectl apply`s when `apply=True` and `kubectl` is on PATH. | The Kubernetes manifest's pod entrypoint is `python -m alphaswarm_bots.cli run ` (compat: `python -m alphaswarm.bots.cli`; see [alphaswarm_bots/cli.py](../alphaswarm_bots/cli.py)). ## REST surface All endpoints under `/bots`: | Method | Path | Purpose | | --- | --- | --- | | `GET` | `/bots` | List (filter by `project_id`, `kind`, `status_filter`) | | `POST` | `/bots` | Create (body: `{spec, project_id?}`) | | `GET` | `/bots/{ref}` | Detail (`{ref}` = id or slug) | | `PUT` | `/bots/{ref}` | Update (auto-snapshots a new version on change) | | `DELETE` | `/bots/{ref}` | Delete | | `GET` | `/bots/{ref}/versions` | List `bot_versions` | | `GET` | `/bots/{ref}/deployments` | List `bot_deployments` | | `POST` | `/bots/{ref}/backtest` | Dispatch `run_bot_backtest` (returns `TaskAccepted`) | | `POST` | `/bots/{ref}/paper/start` | Dispatch `run_bot_paper` | | `POST` | `/bots/{ref}/paper/stop/{task_id}` | Stop in-flight paper session | | `POST` | `/bots/{ref}/deploy` | Dispatch `deploy_bot` | | `POST` | `/bots/{ref}/chat` | Dispatch `chat_research_bot` (research only) | Async lifecycle endpoints return [`TaskAccepted`](../alphaswarm/api/schemas.py) with `stream_url` pointing at the existing `/chat/stream/{task_id}` WebSocket — no new transport. ## CLI `python -m alphaswarm.bots.cli` for shell-level operations: ```bash python -m alphaswarm.bots.cli list python -m alphaswarm.bots.cli show dual-ma-aapl --yaml python -m alphaswarm.bots.cli backtest dual-ma-aapl python -m alphaswarm.bots.cli paper dual-ma-aapl --run-name 2026-05-03 python -m alphaswarm.bots.cli chat equity-research-bot "What is AAPL's edge?" python -m alphaswarm.bots.cli deploy dual-ma-aapl --target kubernetes python -m alphaswarm.bots.cli run dual-ma-aapl # pod entrypoint ``` ## UI The bot builder lives at [`/bots`](../webui/app/(shell)/bots/page.tsx) and reuses the existing `@xyflow/react` canvas via [`WorkflowEditor`](../webui/components/flow/WorkflowEditor.tsx). The palette ([`webui/components/bots/botPalette.ts`](../webui/components/bots/botPalette.ts)) exposes ten kinds — Universe, DataPipeline, Strategy, Engine, MLModel, Agent, RAG, Metric, Risk, Deploy. Each node maps 1:1 to a `BotSpec` slot via [`serializeBotSpec`](../webui/components/bots/botSerializer.ts); the inverse `deserializeBotSpec` lets the builder edit a saved bot. The detail page ships tabs: - **Overview** — primary action buttons (Backtest / Start paper / Deploy / Render K8s manifest). - **Builder** — the node-and-wire canvas. - **Deployments** — every `bot_deployments` row. - **Versions** — every `bot_versions` row. - **Chat** — only for `ResearchBot` kind; embeds [`ResearchBotChat`](../webui/components/bots/ResearchBotChat.tsx) driven by `useChatStream`. ## Hard rules - Bot agent calls go through [`AgentRuntime`](../alphaswarm/agents/runtime.py); `BotRuntime` never calls `router_complete` directly. - Bot RAG access goes through [`HierarchicalRAG`](../alphaswarm/rag/hierarchy.py) via the agent's `rag:` clause. - Bot data loading uses [`IngestionPipeline.run_path`](../alphaswarm/data/pipelines/runner.py) and `iceberg_catalog.append_arrow`; never raw PyIceberg. - Bot progress emits go through [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py) preserving the `{task_id, stage, message, timestamp, **extras}` payload shape. - Strategies / engines / models in `BotSpec` use the existing `{class, module_path, kwargs}` factory and `@register`. - New Alembic migrations are additive only; never edit a shipped one. ## Where things live | Need | Path | | --- | --- | | BotSpec | [alphaswarm_bots/spec.py](../alphaswarm_bots/spec.py) | | BaseBot ABC | [alphaswarm_bots/base.py](../alphaswarm_bots/base.py) | | TradingBot | [alphaswarm_bots/trading_bot.py](../alphaswarm_bots/trading_bot.py) | | ResearchBot | [alphaswarm_bots/research_bot.py](../alphaswarm_bots/research_bot.py) | | BotRuntime | [alphaswarm_bots/runtime.py](../alphaswarm_bots/runtime.py) | | Registry / persist_spec | [alphaswarm_bots/registry.py](../alphaswarm_bots/registry.py) | | Deploy targets | [alphaswarm_bots/deploy.py](../alphaswarm_bots/deploy.py) | | CLI | [alphaswarm_bots/cli.py](../alphaswarm_bots/cli.py) | | ORM models | [alphaswarm/persistence/models_bots.py](../alphaswarm/persistence/models_bots.py) | | Alembic migration | [alembic/versions/0020_bots.py](../alembic/versions/0020_bots.py) | | Celery tasks | [alphaswarm/tasks/bot_tasks.py](../alphaswarm/tasks/bot_tasks.py) | | REST routes | [alphaswarm/api/routes/bots.py](../alphaswarm/api/routes/bots.py) | | Example specs | [alphaswarm_bots/templates/](../alphaswarm_bots/templates/) | | UI builder | [webui/components/bots/](../webui/components/bots/) | | Argo template | `alphaswarm_platform/deployments/kubernetes/mlops/bots/workflowtemplate-bot-deploy.yaml` | # Multi-agent patterns in AlphaSwarm > Read this doc when you need to: # Multi-agent patterns in AlphaSwarm > Catalogue of multi-agent topologies, mapped to existing code in > [alphaswarm/agents/graph/](../alphaswarm/agents/graph/). Use this when adding a > new agent crew, deciding between sequential and parallel > orchestration, or deciding when a debate / consensus pattern is > warranted. > > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · > Underlying primitives: [agents.md](../../concepts/agentic/agents.md) · > Spec contract: [agentic-development.md](../../concepts/agentic/agentic-development.md) · > ADLC + security: [agentic-development.md#3-adlc-security-manifesto](../../concepts/agentic/agentic-development.md#3-adlc-security-manifesto). ## When to read this doc Read this doc when you need to: - Add a new multi-step agent crew that goes beyond a single `AgentSpec` invocation. - Decide whether a debate / dialectical pattern is appropriate for a reasoning task. - Wire a new entry-point in the LangGraph builder. - Understand how the existing crews (research, trader, analysis) compose under the hood. This doc does **not** replace [agents.md](../../concepts/agentic/agents.md) — that's the primary reference for `AgentSpec` and `AgentRuntime`. This doc only covers **how multiple specs are composed**. ## The five canonical patterns | Pattern | When to use | AlphaSwarm entry-point | | --- | --- | --- | | Sequential | Deterministic linear pipeline | `build_research_graph` / `build_trader_graph` / `build_full_pipeline_graph` in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) (linear edges) | | Parallel | Independent multi-source research with synthesis | parallel research-team nodes in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) | | Debate / Dialectical | Adversarial analysis (Bull / Bear, advocate / critic) | [alphaswarm/agents/graph/dialectical.py](../alphaswarm/agents/graph/dialectical.py) → `build_dialectical_debate_graph` (Bull / Bear / Portfolio-Manager) | | Coordinator / Router | Hierarchical delegation | top-level orchestrator in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) (`build_full_pipeline_graph` plays this role today) | | ReAct (loop-with-observation) | Open-ended forecasting requiring iterative observe → act | LangGraph state loop with conditional edges via [alphaswarm/agents/graph/conditions.py](../alphaswarm/agents/graph/conditions.py) (`should_continue_debate`, `should_continue_risk`) | Each pattern below has the same three sections: when to use, the shape it takes in AlphaSwarm, and a "Don't" list. --- ## 1. Sequential ```mermaid flowchart LR Start --> A[Step 1] --> B[Step 2] --> C[Step 3] --> Final ``` ### When to use - Deterministic, well-understood pipelines where each step's output is the input to the next. - The default for any flow that doesn't have a strong reason to branch. - Good for: ingest → normalise → enrich → emit; research → selection → trader → analysis (the canonical pipeline). ### AlphaSwarm shape - [`build_research_graph`](../alphaswarm/agents/graph/builder.py) — research → equity → universe. - [`build_trader_graph`](../alphaswarm/agents/graph/builder.py) — trader → analysis run. - [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py) — research → selection → trader → analysis (end-to-end agentic loop). - State carried via `AgentState` (TypedDict) declared in [alphaswarm/agents/graph/state.py](../alphaswarm/agents/graph/state.py). - Falls back to [`SequentialGraph`](../alphaswarm/agents/graph/builder.py) when LangGraph isn't installed — same audit trail, no conditional routing. ### Don't - Don't bypass the runtime per step. Each node calls `AgentRuntime.run(...)` so cost caps + telemetry + immutable versions are recorded. - Don't widen the `AgentState` TypedDict for one-off keys — extend via the canonical fields documented in [alphaswarm/agents/graph/state.py](../alphaswarm/agents/graph/state.py) so conditional predicates keep working. --- ## 2. Parallel (research team / fan-out + synthesis) ```mermaid flowchart TD Start --> Coordinator Coordinator --> A[Source A] Coordinator --> B[Source B] Coordinator --> C[Source C] A --> Synth B --> Synth C --> Synth Synth --> Final ``` ### When to use - Multiple independent sources / analyses that can run in parallel and then be synthesised. - Examples: fundamental + technical + macro + sentiment running concurrently to produce a unified market view; multi-source regulatory ingest. - Throughput-bound: parallel makes sense when each branch is expensive and the branches don't depend on each other. ### AlphaSwarm shape - LangGraph state graphs run independent branches concurrently when the edges declare them as such. - The synthesis node consumes the merged state and emits a combined verdict. - For the research-team subgraph in [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py), the individual research specs (`research.equity`, `research.news_miner`, `research.universe`, etc.) feed a downstream selector / trader. ### Don't - Don't parallelise tool calls that mutate shared state — the catalog upserts in [`active_metadata`](../alphaswarm/data/catalog/active_metadata.py) are serialised on purpose. - Don't fan out to N agents that all consult the same RAG corpus with identical queries — that's a cache miss N times. Cache once upstream. - Don't rely on parallel order. Synthesis must be order-independent (associative + commutative over the result set). --- ## 3. Debate / Dialectical ```mermaid flowchart TD Start --> Subject[Subject under analysis] Subject --> Bull[Bull advocate] Subject --> Bear[Bear advocate] Bull --> Loop{Continue debate?} Bear --> Loop Loop -->|yes| Bull Loop -->|no| PM[Portfolio Manager / Judge] PM --> Verdict ``` ### When to use - Open-ended judgement where adversarial reasoning surfaces blind-spots (e.g. should we take this position? does this strategy generalise out-of-sample?). - Whenever a single-agent verdict would feel "too convenient" — the Bull / Bear pattern forces both arguments to be made and judged. - The literature behind this pattern (TradingAgents) is a known source of inspiration; AlphaSwarm keeps the structure but routes through spec-driven `AgentRuntime` so every debate turn is logged. ### AlphaSwarm shape - [alphaswarm/agents/graph/dialectical.py](../alphaswarm/agents/graph/dialectical.py) contains `build_dialectical_debate_graph` (Bull / Bear / Portfolio-Manager). - Three agent specs ship under [configs/agents/](../configs/agents/): - `research.bull_researcher` - `research.bear_researcher` - `research.portfolio_manager` - The portfolio manager synthesises both transcripts into a single `debate_verdict` with `action ∈ {buy, hold, sell, mutate_params}`. - The Phase-4 iterative optimisation loop in [`build_research_debate_graph`](../alphaswarm/agents/graph/builder.py) uses `should_continue_debate` from [conditions.py](../alphaswarm/agents/graph/conditions.py) to bound rounds (default `max_rounds=2`). - State extension: `RiskDebateState` and `ResearchDebateState` (TypedDicts in [state.py](../alphaswarm/agents/graph/state.py)) hold the debate transcript across turns. - All decisions land in [decision_log.py](../alphaswarm/agents/graph/decision_log.py) for auditability — `append_pending_decision` / `resolve_pending_decisions`. ### Don't - Don't run an unbounded debate. Cost caps + the `max_rounds` predicate are non-negotiable. - Don't let the judge synthesise without seeing both transcripts — the synthesis node is the load-bearing piece. - Don't add a third advocate without thinking carefully about the judge prompt. Two-sided debate is well-studied; three-sided debates require explicit tie-breaking logic. --- ## 4. Coordinator / Router ```mermaid flowchart TD Human --> Coordinator[Principal Investigator] Coordinator --> Sub1[Subagent: data] Coordinator --> Sub2[Subagent: analysis] Coordinator --> Sub3[Subagent: codegen] Sub1 --> Coordinator Sub2 --> Coordinator Sub3 --> Coordinator Coordinator --> Final[Synthesised report] ``` ### When to use - Workflows where the human interacts with a single high-level orchestrator that delegates to specialised subagents. - Reduces cognitive load for the operator — they don't direct individual specs, they direct the coordinator. - Examples: end-to-end backtest run with multiple analytical subagents; multi-stage research crew coordinated by a "PI" agent. ### AlphaSwarm shape - [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py) plays this role today: a top-level orchestrator that routes to research, selection, trader, and analysis nodes. - Decision-log ([decision_log.py](../alphaswarm/agents/graph/decision_log.py)) captures the routing decisions so the human can replay why a particular subagent was invoked. - The Cursor IDE itself follows this pattern — the parent agent dispatches `Task(subagent_type=...)` for read-only exploration or implementation. ### Don't - Don't put domain logic in the coordinator. It coordinates; subagents do the work. - Don't pass full intermediate state up to the human. The whole point is the coordinator synthesises — show the synthesis, link to the decision log for the trace. --- ## 5. ReAct (loop-with-observation) ```mermaid flowchart TD Start --> Reason[Reason about state] Reason --> Act[Act / call tool] Act --> Observe[Observe result] Observe --> Decide{Goal met?} Decide -->|no| Reason Decide -->|yes| Final ``` ### When to use - Open-ended forecasting / research questions where the answer isn't reachable in a single shot, and the model needs to call tools, observe results, and iterate. - Examples: building a market thesis from sequential hypothesis-tests; iterative debugging of a strategy's poor backtest. - Trades latency for accuracy — only worth it for tasks where the user explicitly wants depth over speed. ### AlphaSwarm shape - LangGraph state-graph with conditional edges — the loop is modelled as a self-edge gated by a predicate. - Conditional predicates live in [alphaswarm/agents/graph/conditions.py](../alphaswarm/agents/graph/conditions.py) (`should_continue_debate`, `should_continue_risk`, `should_consult_rag`, `risk_simulator_approves`). - Tool calls inside the loop go through `AgentRuntime` so the cost cap bounds the iteration count. - For agents that need persistent memory between iterations, the Redis-backed checkpointer ([checkpointer.py](../alphaswarm/agents/graph/checkpointer.py)) preserves graph state across process restarts. ### Don't - Don't ReAct without a hard upper bound on iterations. The `max_rounds` parameter on `should_continue_debate` is the reference pattern — apply the same upper bound to any new ReAct-style condition. - Don't share Redis checkpoint keys across unrelated runs. Each `(spec_version_id, run_id)` is its own checkpoint namespace. - Don't ReAct on a hot path (live execution). Use it for research and post-hoc analysis where latency is acceptable. --- ## Orchestration adapter topologies (Phase 7 addition) The additive orchestration refactor adds a sibling abstraction — [`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) — that exposes the five canonical patterns above as **first-class registry components**. The patterns themselves don't change; the new ``WorkflowRuntime`` wraps them behind a metaclass-registered alias so operators can mix-and-match without editing graph builders by hand. Seven shipping adapter kinds (see [ADAPTER_KINDS](../alphaswarm/agents/orchestration/registry.py)): | Adapter | Kind | Wraps | Inspiration | | --- | --- | --- | --- | | [`LangGraphAdapter`](../alphaswarm/agents/orchestration/adapters/langgraph_adapter.py) | `graph` | The five canonical builders in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) + `build_dialectical_debate_graph` | alphaswarm | | [`CrewProcessAdapter`](../alphaswarm/agents/orchestration/adapters/crew_adapter.py) | `crew` | [`run_research_crew`](../alphaswarm/agents/crew.py) + [`run_trader_crew`](../alphaswarm/agents/trading/crew.py) — CrewAI sequential / hierarchical | finrobot | | [`DialecticalDebateAdapter`](../alphaswarm/agents/orchestration/adapters/debate_adapter.py) | `debate` | [`build_dialectical_debate_graph`](../alphaswarm/agents/graph/dialectical.py) with bounded rounds + forced judge synthesis | tradingagents | | [`AutomationScheduleAdapter`](../alphaswarm/agents/orchestration/adapters/schedule_adapter.py) | `schedule` | Celery beat — enqueues [`alphaswarm.tasks.orchestration_tasks.run_workflow`](../alphaswarm/tasks/orchestration_tasks.py) | daily_stock_analysis | | [`SignalFusionAdapter`](../alphaswarm/agents/orchestration/adapters/fusion_adapter.py) | `fusion` | Deterministic [`synthesize`](../alphaswarm/agents/trading/fusion.py) over debate + quant + model contributors | vibe_trading | | [`WeightCentricExecutionAdapter`](../alphaswarm/agents/orchestration/adapters/weight_centric_adapter.py) | `execution` | [`WeightCentricPipeline`](../alphaswarm/rl/portfolio/pipeline.py) + [`RiskLimits`](../alphaswarm/risk/limits.py) (rule 38) | finrl | | (Phase 7 future) `WorkflowStudioAdapter` | `studio` | Interactive workflow graph editor | langflow | ### Why use adapters over a hand-rolled builder? - **Discoverability**: every adapter shows up in the Phase 5 studio dropdown via `data.orchestration.list_adapters` — operators don't need to read code. - **Halt parity**: the runtime polls `should_halt(state)` between every adapter transition; new adapters inherit that contract for free. - **Replay parity**: every spec snapshotted into `workflow_spec_versions` is replayable by `workflow_version_id` through `/workflows/runs/{run_id}/replay`. - **Telemetry parity**: each transition opens a [`node_span`](../alphaswarm/agents/observability.py) so per-adapter latency / cost / branch decisions land on the same OTEL trace as every legacy agent run. ### When to use an adapter vs a graph builder | Choose adapter when | Choose graph builder when | | --- | --- | | You want it in the studio dropdown | The flow is hard-coded into a service | | You need to replay it by version id | One-off internal pipeline | | You want bounded-debate / cooperative-cancel without writing them | You're already inside a builder body | | The flow ships as YAML for ops | The flow is built dynamically per request | Adapters delegate to graph builders internally — they are **wrappers, not replacements**. Adding a new adapter never invalidates an existing builder. --- ## Adding a new pattern 1. Identify which of the five it most resembles. Don't invent a sixth unless there's a real reason. 2. Add the builder under [alphaswarm/agents/graph/](../alphaswarm/agents/graph/). Mirror the existing `build_*_graph` naming. 3. Add the necessary state TypedDict to [state.py](../alphaswarm/agents/graph/state.py). Don't sprinkle ad-hoc dict keys — `AgentState` is the contract. 4. Add conditional predicates to [conditions.py](../alphaswarm/agents/graph/conditions.py) if the graph has branches. 5. Decisions emitted by the graph land in [decision_log.py](../alphaswarm/agents/graph/decision_log.py). 6. Tests under [tests/agents/](../tests/agents/) — at minimum, a `SequentialGraph` fallback test that runs the graph without LangGraph installed. Mirror the existing test naming: e.g. `test__run.py`. 7. Update [agents.md](../../concepts/agentic/agents.md) and / or this file to describe the new entry-point. ## Cross-references - [agents.md](../../concepts/agentic/agents.md) — `AgentSpec` + `AgentRuntime` reference - [agentic-pipeline.md](../../concepts/agentic/agentic-pipeline.md) — end-to-end pipeline walkthrough - [agentic-development.md](../../concepts/agentic/agentic-development.md) — spec-pattern + ADLC manifesto - [analysis-agents.md](../../concepts/strategy/analysis-agents.md) — analysis-specific agent roles - [research-agents.md](../../concepts/agentic/research-agents.md) / [selection-agents.md](../../concepts/agentic/selection-agents.md) / [trader-agents.md](../../concepts/agentic/trader-agents.md) — per-team agent rosters - [providers.md](../../concepts/data/providers.md) — LLM provider routing under the hood # Orchestration control plane refactor — rollout runbook > | Flag (env var prefix `ALPHASWARM_`) | Default | Activates | First needed in | | --- | --- | --- | --- | | `ORCHESTRATION_STUDIO_ENABLED` | `false` | `/workflows/*` API surface, Vite studio routes, `Workflo... # Orchestration control plane refactor — rollout runbook This is the operator-facing rollback / rollout guide for the additive ``WorkflowRuntime`` + ``OrchestrationAdapter`` stack landed by the seven phases described in [ALPHASWARM_REFACTOR_MASTER_PROMPT.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/docs/archive/ALPHASWARM_REFACTOR_MASTER_PROMPT.md) and the matching cursor plan. Every change in the refactor is gated by one of the ``ALPHASWARM_ORCHESTRATION_*`` flags defined on [alphaswarm/config/settings.py](../alphaswarm/config/settings.py); **with every flag at its default ``False`` the platform behaves identically to the pre-refactor build**. The Phase 0 regression test [tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py) enforces this — run it before flipping anything. ## Flag inventory | Flag (env var prefix `ALPHASWARM_`) | Default | Activates | First needed in | | --- | --- | --- | --- | | `ORCHESTRATION_STUDIO_ENABLED` | `false` | `/workflows/*` API surface, Vite studio routes, `WorkflowSpec` registry persistence | Phase 5 | | `ORCHESTRATION_CREW_ADAPTER_ENABLED` | `false` | `CrewProcessAdapter` registration (`crewai` stays an optional import) | Phase 2 | | `ORCHESTRATION_FUSION_ENABLED` | `false` | `SignalFusionAdapter` + `WeightCentricExecutionAdapter` + `build_dialectical_with_fusion_graph` | Phase 4 | | `ORCHESTRATION_SCHEDULE_ENABLED` | `false` | `AutomationScheduleAdapter` Celery beat entry | Phase 3 | | `ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED` | `false` | Snapshots `WorkflowSpec` into `workflow_spec_versions` on first run | Phase 5 | | `ORCHESTRATION_KILL_PROPAGATION_ENABLED` | `false` | Watchdog + KillSwitch UI fan halts into `WorkflowRun` rows | Phase 6 | | `ORCHESTRATION_MAX_DEBATE_ROUNDS` (int) | `2` | Hard cap enforced by `DialecticalDebateAdapter` and the graph builder | Phase 2 | | `ORCHESTRATION_HALT_CHECK_TIMEOUT_SECONDS` (float) | `1.0` | Per-transition halt-check budget in `WorkflowRuntime` | Phase 2 | The two numeric knobs are read every transition, so changing them takes effect on the next workflow step without a restart. ## Recommended rollout order 1. **Phase 0 → Phase 1**: deploy with every flag at default. Run the full pytest suite plus [tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py) to confirm zero behavioural drift. 2. **Phase 2 (debate)**: flip `ORCHESTRATION_CREW_ADAPTER_ENABLED` if you want CrewAI-backed crew adapters to register; otherwise leave off. The bounded-debate cap is always honoured by the new graph builder kwarg regardless of this flag. 3. **Phase 3 (scheduler)**: flip `ORCHESTRATION_SCHEDULE_ENABLED` AFTER restarting Celery workers + beat. The flag controls whether `alphaswarm.tasks.celery_app` registers the beat schedule entry. 4. **Phase 4 (fusion)**: flip `ORCHESTRATION_FUSION_ENABLED` only after confirming the existing `risk_simulator_approves` predicate still routes correctly on a staging dataset — fusion adds a sibling pathway, the existing risk gate stays authoritative. 5. **Phase 5 (studio)**: flip `ORCHESTRATION_STUDIO_ENABLED` and `ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED` together. Apply the alembic migration `0046_workflow_versioning.py` BEFORE the flag is flipped on the API process. 6. **Phase 6 (halt fan-out)**: flip `ORCHESTRATION_KILL_PROPAGATION_ENABLED` last. The KillSwitch UI keeps its existing behaviour with this flag off; turning it on adds workflow-run fan-out to the existing `/agents/halt`, `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all`, and `/quant-agents/halt` fan-out. ## Rollback recipes All rollbacks are flag-flips (no migrations, no data loss): - **Disable studio + API**: set `ALPHASWARM_ORCHESTRATION_STUDIO_ENABLED=false` and reload the API. The `/workflows/*` routes refuse new requests with `503 Service Unavailable` while the rest of the API keeps serving. - **Disable scheduler**: set `ALPHASWARM_ORCHESTRATION_SCHEDULE_ENABLED=false` and restart Celery beat. Already-running scheduled runs finish normally; no new ones are enqueued. - **Disable fusion**: set `ALPHASWARM_ORCHESTRATION_FUSION_ENABLED=false` and reload. The optional `build_dialectical_with_fusion_graph` builder refuses to compile; existing builders are unaffected. - **Disable kill fan-out**: set `ALPHASWARM_ORCHESTRATION_KILL_PROPAGATION_ENABLED=false`. The KillSwitch UI keeps its existing five halt buttons (agents / paper / bots / rl / quant-agents); the new "Halt workflows" button no-ops. - **Disable workflow versioning**: set `ALPHASWARM_ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED=false`. New runs refuse to snapshot a spec hash; existing `workflow_spec_versions` rows stay readable. - **Full revert**: set every `ALPHASWARM_ORCHESTRATION_*` flag to `false`, redeploy. The platform behaves exactly like the pre-refactor build. The new tables (`workflow_specs`, `workflow_spec_versions`, `workflow_runs`) stay empty and add no read overhead to other routes. ## Migration safety - The single new migration `0046_workflow_versioning.py` is additive: it creates three new tables and adds no columns to existing tables. Downgrade returns the database to the `0045_pgvector_foundation` head. - The new `alphaswarm.tasks.orchestration_tasks` module appends to the Celery `include` list; cold installs without the module fail loudly at worker boot rather than silently dropping tasks. - The Vite studio bundle is code-split: routes under `alphaswarm_client/src/routes/workflows/*` lazy-load only when the user navigates there, so disabling the flag also disables the bundle download path. ## Pre-flip checklist Run before flipping any flag in production: 1. `docker exec alphaswarm-api python -m pytest tests/agents/test_orchestration_flags.py -v` 2. `docker exec alphaswarm-api python -m pytest tests/agents/test_watchdog.py -v` 3. `docker exec alphaswarm-api alembic current` — confirm head is at least `0045_pgvector_foundation`; for Phase 5+ confirm `0046_workflow_versioning`. 4. Snapshot the Redis kill-switch key (`redis-cli get $ALPHASWARM_RISK_KILL_SWITCH_KEY`) — the watchdog uses the same key so the new gate stays consistent. ## Where each layer lives - Settings flags: [alphaswarm/config/settings.py](../alphaswarm/config/settings.py) "Orchestration control plane" block. - Regression test: [tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py). - Adapter abstraction: `alphaswarm/agents/orchestration/` (Phase 1). - Adapters: `alphaswarm/agents/orchestration/adapters/` (Phases 2-4). - DataMCP tools: `alphaswarm/data/mcp/tools/orchestration.py` + `automation.py` (Phase 3). - Celery task: `alphaswarm/tasks/orchestration_tasks.py` (Phase 3). - Persistence: `alphaswarm/persistence/models_workflows.py` + alembic `0046_workflow_versioning.py` (Phase 5). - API: `alphaswarm/api/routes/workflows.py` (Phase 5). - Studio UI: `alphaswarm_client/src/routes/workflows/*` (Phase 5). - Halt + watchdog hardening: `alphaswarm/tasks/agent_watchdog_tasks.py`, `alphaswarm_client/src/components/common/KillSwitch.tsx` (Phase 6). # Research Agents > - **First-order** (price / trade / performance) — `bars_daily`, `performance`. - **Second-order** (SEC, ratios, fundamentals) — `sec_filings`, `sec_xbrl`, `financial_ratios`, `earnings_call`, `news_se... # Research Agents | Spec | Module | Purpose | | --- | --- | --- | | `research.news_miner` | [alphaswarm/agents/research/news_miner.py](../alphaswarm/agents/research/news_miner.py) | Mine recent news + sentiment + regulatory flags for a symbol or topic. | | `research.equity` | [alphaswarm/agents/research/equity_researcher.py](../alphaswarm/agents/research/equity_researcher.py) | Long-form equity research synthesis with hierarchical RAG citations. | | `research.universe` | [alphaswarm/agents/research/universe_selector.py](../alphaswarm/agents/research/universe_selector.py) | Interactive stock universe shaping with RAG justification. | ## RAG layout (per the user's research-agent spec) - **First-order** (price / trade / performance) — `bars_daily`, `performance`. - **Second-order** (SEC, ratios, fundamentals) — `sec_filings`, `sec_xbrl`, `financial_ratios`, `earnings_call`, `news_sentiment`. - **Third-order** (regulatory) — `cfpb_complaints`, `fda_*`, `uspto_*`. The News Miner skews toward second + third order. The Equity Researcher walks all three. The Universe Selector pulls L0 + L1 + L2. ## REST + tasks ``` POST /agents/research/news-miner — async via Celery (research queue) POST /agents/research/equity — async via Celery POST /agents/research/universe — async via Celery POST /agents/research/sync/news-miner — synchronous variant ``` Celery wrappers live in [alphaswarm/tasks/research_tasks.py](../alphaswarm/tasks/research_tasks.py). ## Configs YAMLs at [configs/agents/research_news_miner.yaml](../configs/agents/research_news_miner.yaml) and friends. The in-code builders return identical specs so either path works. Edit the YAML for hot reload. # Selection Agents > `selection.stock_selector` — implemented in [alphaswarm/agents/selection/stock_selector.py](../alphaswarm/agents/selection/stock_selector.py) # Selection Agents The Selection team picks the top-N tickers for a `(model, strategy, universe, agent)` quadruple. It is the bridge between the Research team's universe candidates and the Trader team's signal-emitter loop. ## Spec `selection.stock_selector` — implemented in [alphaswarm/agents/selection/stock_selector.py](../alphaswarm/agents/selection/stock_selector.py). ## RAG | Layer | Used for | | --- | --- | | L0 (`decisions`) | Past `agent_decisions` outcomes — paper RAG#0. | | L1 (`performance`) | Recent backtest performance windows. | | L2 (`financial_ratios`, `sec_xbrl`) | Discriminate between similar candidates. | | Tool: `regulatory_lookup` | Tail-risk veto. | ## Memory + annotations Every pick is persisted via `annotation` with `label="pick"` and a payload `{score, rationale, evidence, vetoed_by?}` so the optimisation analysis layer can inspect the historical edge of each combo. ## REST ``` POST /agents/selection/run — async via Celery POST /agents/selection/sync — synchronous variant GET /agents/selection/runs — recent runs GET /agents/selection/annotations — pick rationale history ``` ## YAML [configs/agents/selection_stock_selector.yaml](../configs/agents/selection_stock_selector.yaml). # Trader Agents > [alphaswarm/agents/trader/signal_emitter.py](../alphaswarm/agents/trader/signal_emitter.py) # Trader Agents The spec-driven trader (`trader.signal_emitter`) coexists with the existing TradingAgents-style debate trader under [alphaswarm/agents/trading/](../alphaswarm/agents/trading/). The new one is deliberately simpler — one structured signal per call — so it can slot into the LangGraph pipeline and the agentic backtest loop. ## Spec [alphaswarm/agents/trader/signal_emitter.py](../alphaswarm/agents/trader/signal_emitter.py). ## RAG - **L1 / L2** — `bars_daily`, `performance`, `financial_ratios` for windowed indicator + fundamentals context. - **L0** — `decisions` for prior-trade reflection (paper RAG#0). ## Output schema ```json { "vt_symbol": "AAPL.NASDAQ", "as_of": "2026-04-27T20:00:00Z", "action": "buy" | "sell" | "hold", "confidence": 0..1, "horizon": "intraday" | "1d" | "5d" | "20d", "size_hint_pct": 0..1, "stop_loss_pct": 0..1, "take_profit_pct": 0..1, "rationale": "...", "evidence": [{"corpus": "...", "doc_id": "...", "snippet": "..."}] } ``` ## Safety - Honors the runtime kill switch (Redis key `settings.risk_kill_switch_key`); when engaged the agent MUST emit `"hold"`. - `risk_check` validates the proposed `size_hint_pct`. - Guardrail caps cost at 0.25 USD / call by default. ## REST ``` POST /agents/trader/signal — emit one signal (sync emit + task id) POST /agents/trader/sync — pure synchronous run POST /agents/trader/backtest-with-agent — kick off agentic backtest ``` ## YAML [configs/agents/trader_signal_emitter.yaml](../configs/agents/trader_signal_emitter.yaml). # Workflow Studio > | Layer | File / Path | | --- | --- | | Spec contract | [alphaswarm/agents/orchestration/spec.py](../alphaswarm/agents/orchestration/spec.py) | | Registry + persist_spec | [alphaswarm/agents/orchestration/registry_specs.p... # Workflow Studio The Workflow Studio is the operator-facing surface for the additive orchestration control plane introduced by the seven-phase refactor in [orchestration-refactor-rollout.md](../../concepts/agentic/orchestration-refactor-rollout.md). It composes the five existing graph builders, the three (then five) adapters, and the new hash-locked `WorkflowSpec` registry into a single replayable workflow concept. ## What ships | Layer | File / Path | | --- | --- | | Spec contract | [alphaswarm/agents/orchestration/spec.py](../alphaswarm/agents/orchestration/spec.py) | | Registry + persist_spec | [alphaswarm/agents/orchestration/registry_specs.py](../alphaswarm/agents/orchestration/registry_specs.py) | | Runtime | [alphaswarm/agents/orchestration/runtime.py](../alphaswarm/agents/orchestration/runtime.py) | | Adapter ABC + metaclass | [alphaswarm/agents/orchestration/base.py](../alphaswarm/agents/orchestration/base.py) | | Adapter registry | [alphaswarm/agents/orchestration/registry.py](../alphaswarm/agents/orchestration/registry.py) | | Adapters (5) | [alphaswarm/agents/orchestration/adapters/](../alphaswarm/agents/orchestration/adapters/) | | ORM | [alphaswarm/persistence/models_workflows.py](../alphaswarm/persistence/models_workflows.py) | | Migration | [alembic/versions/0046_workflow_versioning.py](../alembic/versions/0046_workflow_versioning.py) | | REST | [alphaswarm/api/routes/workflows.py](../alphaswarm/api/routes/workflows.py) | | Celery tasks | [alphaswarm/tasks/orchestration_tasks.py](../alphaswarm/tasks/orchestration_tasks.py) | | DataMCP tools | [alphaswarm/data/mcp/tools/orchestration.py](../alphaswarm/data/mcp/tools/orchestration.py), [alphaswarm/data/mcp/tools/automation.py](../alphaswarm/data/mcp/tools/automation.py) | | Cache entry | `workflows` category in [alphaswarm/cache/keys.py](../alphaswarm/cache/keys.py) | | Frontend routes | [alphaswarm_client/src/routes/workflows/](../alphaswarm_client/src/routes/workflows/) | | Frontend components | [alphaswarm_client/src/components/workflows/](../alphaswarm_client/src/components/workflows/) | ## Spec shape A workflow selects exactly one [`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) by alias and hands it adapter-specific params. The adapter dispatches internally — composite flows (Crew + Graph + Debate) belong inside their own adapter, not at the spec layer. ```yaml name: research.dialectical_with_fusion_v1 description: "Bull/Bear debate + fusion + weight-centric execution" adapter: LangGraphAdapter adapter_kind: graph params: builder: dialectical # one of build_* in alphaswarm/agents/graph/ builder_kwargs: max_rounds: 2 schedule: cron: "30 13 * * 1-5" timezone: UTC enabled: false # operator flips after the studio + schedule flags guardrails: cost_budget_usd: 3.0 max_calls: 60 max_duration_seconds: 900 annotations: [research, dialectical] template_target: research ``` `WorkflowSpec.snapshot_hash()` is the SHA256 of the canonical JSON form (sorted keys, no whitespace). Re-snapshotting a spec with the same hash returns the existing `workflow_spec_versions` row; changing any field inserts a NEW row (parallel to `agent_spec_versions`, `bot_versions`, `rl_experiment_versions`, `analysis_spec_versions`). ## Operator flow 1. Operator flips `ALPHASWARM_ORCHESTRATION_STUDIO_ENABLED=true` (see the rollout doc). 2. Frontend navigates to `/workflows`. List + detail render through `` so the dropdown shares the same cache invalidation path as every other entity picker. 3. Operator hits **Run** → POST `/workflows/{name}/run` → enqueues `alphaswarm.tasks.orchestration_tasks.run_workflow`. The route returns a `task_id`; the studio attaches via the existing `useLiveStream` hook for `_progress.emit` frames (rule 4). 4. Operator hits **Replay** on a historical run → POST `/workflows/runs/{run_id}/replay` re-dispatches with the captured `spec_version_id` for deterministic reproduction. 5. Operator hits the topbar KillSwitch's "Halt workflows" action → POST `/workflows/halt` mirrors the five canonical halt endpoints (`/agents/halt`, `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all`, `/quant-agents/halt`). ## Halt fan-out The Phase 2 `WorkflowRuntime` checks `should_halt(state)` between every adapter transition. `should_halt` is the OR of: - `has_kill_switch()` — Redis-backed global flag (the existing topbar KillSwitch flips this). - `state["halt_token"]` — per-run boolean the Phase 6 `/workflows/halt` endpoint sets on every active `WorkflowRun` row inside `ALPHASWARM_ORCHESTRATION_HALT_CHECK_TIMEOUT_SECONDS` of the API call. Long-running adapters (`CrewProcessAdapter`, `LangGraphAdapter`, `DialecticalDebateAdapter`) poll `context.is_halted()` between inner steps so the SLA holds even mid-debate. ## Adapter catalog (Phases 2-5) | alias | kind | source | when registered | | --- | --- | --- | --- | | `LangGraphAdapter` | graph | alphaswarm | always | | `CrewProcessAdapter` | crew | finrobot | always (gated invoke) | | `DialecticalDebateAdapter` | debate | tradingagents | always | | `AutomationScheduleAdapter` | schedule | daily_stock_analysis | always (gated invoke) | | `SignalFusionAdapter` | fusion | vibe_trading | always (gated invoke) | | `WeightCentricExecutionAdapter` | execution | finrl | always (gated invoke) | | `WorkflowStudioAdapter` (Phase 7) | studio | langflow | TBD | New adapters land by subclassing [`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) and setting `adapter_kind` + `adapter_alias`. The metaclass auto-registers them through [`alphaswarm.core.registry.register`](../alphaswarm/core/registry.py) and the shadow per-kind index in [`alphaswarm/agents/orchestration/registry.py`](../alphaswarm/agents/orchestration/registry.py). ## Audit trail Every run produces: - A `workflow_runs` row (one per run) with `spec_version_id`, `inputs`, `final_state`, `breadcrumbs`, `experiment_id`, `test_id` (rule 34), `cost_usd`, `duration_ms`, `status`, `halted`, `error`. - A series of `_progress.emit` frames the studio streams live through `useLiveStream` (frame shape per rule 4). - Per-adapter `node_span` OTEL spans emitted by [`alphaswarm/agents/observability.py`](../alphaswarm/agents/observability.py). - Optional `agent_runs_v2` rows for each inner `AgentRuntime` call the wrapped adapter makes. ## Replay semantics `POST /workflows/runs/{run_id}/replay` looks up the matching `workflow_runs` row, hydrates the frozen `workflow_spec_versions.payload`, and re-dispatches with the same inputs. Replay produces a NEW `workflow_runs` row tagged with the original run's id in `parent_run_id` so the trace lineage stays intact. ## See also - [orchestration-refactor-rollout.md](../../concepts/agentic/orchestration-refactor-rollout.md) — operator runbook + per-flag rollback. - [multi-agent-patterns.md](../../concepts/agentic/multi-agent-patterns.md) — the seven adapter topologies (Phase 7 docs update). - [data-mcp.md](../../concepts/data/data-mcp.md) — `data.orchestration.*` and `data.automation.*` tool catalog. - [agentic-development.md](../../concepts/agentic/agentic-development.md) — where `WorkflowSpec` sits in the four-runtime + skill-artifact framework. # Bi-temporal PermissionedDataPoint > Four-timestamp model + invalidated_by_edge_id (Graphiti-style invalidation). # Bi-temporal `PermissionedDataPoint` Every node and every edge in the KB carries the same envelope: ```python class TemporalRange(BaseModel): valid_from: datetime # event time start valid_to: Optional[datetime] # event time end (None = still true) created_at: datetime # system time start expired_at: Optional[datetime] # system time end (None = active) class PermissionedDataPoint(BaseModel): id: UUID type: str = "PermissionedDataPoint" temporal: TemporalRange acl: ACL # owner + role-based + ABAC + ReBAC anchors provenance: Provenance # dataset_id + data_id + extractor chain layer: LayerMembership # PRIVATE / HIERARCHICAL / MARKETPLACE / GLOBAL index_fields: list[str] # which fields feed the vector embedding properties: dict[str, Any] ``` ## Two timelines Following the Graphiti / Zep four-timestamp model: | Pair | Tracks | Closes when | | --- | --- | --- | | `valid_from` / `valid_to` | Real-world event time | Fact stops being true | | `created_at` / `expired_at` | System ingest time | Fact is logically invalidated | A contradicted edge **closes** `valid_to` (and optionally `expired_at` + `invalidated_by_edge_id`) — it is never deleted. This preserves the timeline for `as_of=` queries. ## Provenance chain `Provenance` carries `dataset_id` + `data_id` + the extractor chain (`["spacy", "gliner", "llm"]`) + the pipeline run id. When a tenant requests targeted forgetting (GDPR / CCPA), `KBRuntime.forget` locates rows by dataset/data id and closes their validity window. ## ACL envelope The `ACL` block carries: - `owner_principal_id` + `owner_tenant_id` (RBAC anchor). - `roles_read` / `roles_write` / `roles_delete` (RBAC). - `abac_tags` (ABAC — region, classification, time-of-day, ...). - `rebac_anchor_ids` (OpenFGA tuple keys like `document:abc#viewer`). - `deny_principal_ids` (explicit denial list). `DefaultPermissionResolver` (in [`kb-permissions.md`](kb-permissions.md)) fuses all four into a single per-request `AccessBitmap`. ## Bi-temporal merge in the composer `KBLayerComposer.compose_recall` collects hits across layers (private > hierarchical > marketplace > global), then applies the precedence-aware bi-temporal merger: 1. Group hits by entity `id`. 2. The first occurrence (highest precedence) wins. 3. Lower-precedence hits get appended to `metadata.dissenting_layers` so the UI can surface them transparently. 4. `valid_from`/`valid_to` are preserved on every hit so a downstream `as_of` reconstruction is lossless. # Marketplace federation (`alphaswarm_kb_federation`) > Cross-silo recall reverse-proxy with OpenFGA + signed share tokens + bi-temporal merge. # `alphaswarm_kb_federation` The federation gateway is a standalone FastAPI service that brokers cross-silo recall. It is the only sanctioned cross-silo recall path (hard rule 60). ## Why a separate service - The federation logic is fundamentally stateless except for the result cache + OpenFGA Watch subscriber. Running it as a sidecar to `alphaswarm_kb` would couple lifecycle with the monolith; running it standalone lets it scale horizontally on its own. - Cross-silo traffic crosses trust boundaries (subscriber tenant → source tenant). Keeping the broker in its own process makes the trust boundary explicit and audit-friendly. - The CI guard [`check_alphaswarm_kb_federation_no_alphaswarm.py`](https://github.com/alphaswarm/alphaswarm/blob/main/scripts/ci/check_alphaswarm_kb_federation_no_alphaswarm.py) enforces `no_alphaswarm_imports` so the boundary cannot drift. ## Sequence ``` subscriber silo federation gateway source silo ───────────────── ────────────────── ─────────── POST /federation/recall ─────▶ 1. OpenFGA `check` (visible?) 2. mint signed share token (HS256/RS256, 600s) 3. POST /kb/corpora/.../recall ──▶ verify share token return hits 4. BitemporalMerger.merge_layers 5. cache + audit ◀───────── ComposedResult ``` ## Subscription writer `POST /federation/subscriptions` writes the matching OpenFGA tuple + emits a `subscription.granted` event on the NATS / Redis Pub/Sub bus that subscribers consume to flush bitmap caches. Step-up MFA gates every subscription mutation per AlphaSwarm rule 52. ## Caching - Per-`(subscriber_tenant, cache_key)` Redis namespace under `alphaswarm:kb:federation:*`. - 60s default TTL. - Cache miss + upstream call budget: 5s default. The gateway aims for ≤250ms p95 federation overhead on a warm cache. ## Deployment | Surface | Where | | --- | --- | | Multi-arch Dockerfile | [`alphaswarm_kb_federation/deployments/docker/Dockerfile`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb_federation/deployments/docker/Dockerfile) | | Helm chart | [`alphaswarm_kb_federation/deployments/kubernetes/helm/alphaswarm-kb-federation/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_kb_federation/deployments/kubernetes/helm/alphaswarm-kb-federation) | | Docker Compose (local) | [`alphaswarm_kb_federation/deployments/compose/docker-compose.federation.yml`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb_federation/deployments/compose/docker-compose.federation.yml) | | Terraform module | [`alphaswarm_platform/terraform/modules/kb_marketplace_federation/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_platform/terraform/modules/kb_marketplace_federation) | ## Hard rules it enforces - Hard rule 60: cross-silo recall goes through this service only. - Hard rule 26: every upstream call mints its own M2M token via `CredentialResolver`. - Hard rule 52: step-up MFA on subscription admin endpoints. - Hard rule 49 (no-token-passthrough): the share token's `aud` claim is bound to the source silo; passthrough across audiences is rejected at the verifier. # KB permissions — AccessBitmap + OpenFGA + OPA + Cedar > Hybrid ReBAC + ABAC stack materialised into a per-request AccessBitmap. # KB permissions ## Hybrid stack | Layer | Provider | What it answers | | --- | --- | --- | | RBAC | `ACL.roles_*` + Membership rows | "Is the user an editor of this corpus?" | | ABAC | `IPolicyEngine` (default: OPA; opt-in: Cedar) | "Does the user's region == EU and the resource's classification ≤ user's clearance?" | | ReBAC | `IACLEvaluator` (default: Native; opt-in: OpenFGA / SpiceDB / Permify) | "Does the user inherit access via a chain of group / org / dataset / subscription relations?" | ## AccessBitmap `DefaultPermissionResolver.materialize_bitmap` produces a per-request `AccessBitmap`: ```python class AccessBitmap(BaseModel): visible_node_ids: set[UUID] visible_edge_ids: set[UUID] excluded_node_ids: set[UUID] field_redactions: dict[UUID, set[str]] residual_cypher: Optional[str] # OPA partial-eval residual residual_sql: Optional[str] computed_at_iso: Optional[str] cache_key: Optional[str] ``` The bitmap is built by: 1. Calling `IACLEvaluator.list_objects(principal_id, action, "node", tenant_id)` → set of visible node UUIDs (OpenFGA `list-objects`). 2. Calling `IPolicyEngine.partial_evaluate(action, "node", ctx)` → residual Cypher / SQL fragment (OPA `compile`). 3. Caching the result for 60s by `(tenant, principal, action, anchor_hash)`. ## Projection into store-native filters | Store | How the bitmap shows up | | --- | --- | | Graph (Cypher) | `WHERE n.id IN $visible_node_ids AND r.id IN $visible_edge_ids AND (${residual_cypher})` | | Vector (payload filter) | `{"tenant_id": {"$eq": "..."}, "id": {"$in": [...]}}` | | Relational (RLS) | Session GUCs `app.current_tenant_id` + `app.current_workspace_id` + `app.visible_node_ids` | ## OpenFGA authorization model The bundled [`authorization_model.fga`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/policies/openfga/authorization_model.fga) defines the canonical types: ``` type tenant relations define member: [user] define admin: [user] define parent: [tenant] type corpus relations define owner_tenant: [tenant] define editor: [user] or admin from owner_tenant define viewer: [user] or editor or member from owner_tenant define subscriber: [tenant] type dataset relations define parent_corpus: [corpus] define editor: editor from parent_corpus define viewer: viewer from parent_corpus or subscriber from parent_corpus define subscriber: [tenant] ``` ## OPA policy bundle The bundled [`authz.rego`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/policies/opa/authz.rego) implements default-deny with role-based + region-lock + classification gates. Bundles are signed and served from `s3://alphaswarm-kb-opa-bundles/` (or the Azure / GCP equivalents) and pulled by OPA every 30-120s. ## Cedar (optional) Cedar is the optional `IPolicyEngine` adapter for tenants requiring formal verification. Activate by setting `KBCorpusSpec.acl.policy_alias = "cedar"`. # KBRuntime + KBCorpusSpec > The single sanctioned executor for KB lifecycle. # KBRuntime + KBCorpusSpec ## Hash-locked spec `KBCorpusSpec` is a Pydantic v2 model that describes a single corpus: the memory engine alias + kwargs, vector / graph / relational store aliases, ACL evaluator + policy engine, layer scope, extraction knobs, retention policy, and an optional Iceberg namespace for gold-tier mirroring. ```yaml name: research_papers tenant_id: 00000000-0000-0000-0000-000000000010 memory_engine: { kb_alias: hierarchical_rag } vector_store: { kb_alias: pgvector, collection: research_papers } graph_store: { kb_alias: neo4j } acl: { evaluator_alias: native, policy_alias: opa } layer: { scope: private, marketplace_publishable: false } extraction: { enable_spacy: true, enable_llm: true } retention: { soft_delete_after_days: 90, hard_delete_after_days: 1095 } ``` The SHA-256 of the canonical JSON dump (sorted keys, UTF-8) anchors the immutable `kb_corpus_spec_versions` row. Re-snapshotting via `registry.persist_spec(spec)` inserts a new version row when the hash changes (hard rule 57). The previous version stays for replay / audit. ## KBRuntime `KBRuntime.execute(req, ctx)` is the only sanctioned path through which the five lifecycle actions (`remember`, `recall`, `compose_recall`, `improve`, `forget`) run: ```python from alphaswarm_kb.runtime import KBRunRequest, runtime_for runtime = runtime_for("research_papers") result = await runtime.execute( KBRunRequest(action="recall", corpus_name="research_papers", payload="What is GraphRAG?", top_k=5), tenant_ctx, ) ``` Every call: 1. Halts if the kill-switch flag is set (`trigger_halt()` → `kb_runs` row with `status="halted"`). 2. Snapshots the spec via `persist_spec` so the resulting `kb_runs` row references the immutable spec version. 3. Resolves the `IMemoryEngine` via the composition-root container. 4. Executes the requested action. 5. Writes the `kb_runs` row carrying `experiment_id` + `test_id` (rule 34) and the elapsed-ms / status / error envelope. ## Wrappers - **Celery**: `alphaswarm_kb.tasks.kb_tasks.{remember_async,recall_async,improve_async,forget_async,evaluate_async,compose_recall_async}` wrap `KBRuntime.execute` with `_progress.emit` (rule 4). - **REST**: [`POST /kb/corpora/{name}/{remember,recall,improve,forget}`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/src/alphaswarm_kb/api/routes/kb.py) mounted at `/kb` by the monolith FastAPI app. - **DataMCP**: `data.kb.*` tools (rule 59) — the only path agents should use. - **WebSocket**: `/kb/corpora/{name}/recall/stream` for live recall streaming. ## Halt + kill-switch `POST /kb/halt` sets the in-process halt flag. Any subsequent `KBRuntime.execute` call raises `HaltedError` and writes a `kb_runs` row with `status="halted"`. The topbar `KillSwitch` component fans out to `/kb/halt` alongside every other halt endpoint. # Silo-per-tenant IaC (Terragrunt + cloud-parallel modules) > One Terragrunt unit per tenant; cloud-parallel Terraform modules with identical outputs. # Silo-per-tenant IaC Section G of the alphaswarm_kb blueprint implemented as a Terragrunt tree under [`alphaswarm_platform/terragrunt/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_platform/terragrunt). ## Layout ``` alphaswarm_platform/ ├── terraform/modules/ │ ├── tenant_kb_silo/ # canonical wrapper (dispatches by var.cloud) │ ├── tenant_kb_silo_aws/ # AWS: ECS Fargate + RDS + S3 + KMS │ ├── tenant_kb_silo_azure/ # Azure: ACA + Flex Postgres + Blob + Key Vault │ ├── tenant_kb_silo_gcp/ # GCP: Cloud Run + Cloud SQL + GCS + KMS │ ├── kb_global_corpus/ # central read-only stack + CDN │ ├── kb_marketplace_federation/ # federation gateway + OpenFGA + NATS │ ├── kb_identity_pool/ # OpenFGA Postgres + OPA bundle bucket │ └── kb_global_observability/ # OTEL collector └── terragrunt/ ├── terragrunt.hcl # root backend + provider generators ├── _envcommon/ # shared inputs (networking, observability) ├── global/prod/terragrunt.hcl # kb_global_corpus ├── marketplace/prod/terragrunt.hcl # kb_marketplace_federation ├── identity_pool/prod/terragrunt.hcl # kb_identity_pool └── tenants/_template/ # copy → tenants// to onboard ``` ## Identical outputs Every cloud-parallel sibling exposes the SAME outputs so the Python adapters never branch on cloud: | Output | Description | | --- | --- | | `relational_dsn` | Postgres DSN for `kb_corpora` + `kb_runs` + `kb_silo_registry`. | | `vector_endpoint` | pgvector / Qdrant / Cognitive Search endpoint. | | `graph_endpoint` | Neo4j / Kuzu / Neptune endpoint. | | `container_runtime` | ECS Fargate / ACA / Cloud Run identifier. | | `object_store_uri` | S3 / Blob / GCS bucket URI. | | `kms_key_id` | Per-tenant CMK identifier. | ## Onboarding a tenant ```bash T=acme-corp mkdir -p alphaswarm_platform/terragrunt/tenants/${T}/prod cp -r alphaswarm_platform/terragrunt/tenants/_template/* \ alphaswarm_platform/terragrunt/tenants/${T}/ # Edit tenants/${T}/tenant.hcl with the real UUID, cloud, region. # Production path goes through alphaswarm-cli (runs server-side via # TerraformRuntime per rule 42; lands a workload_runs row + a # terraform_runs row): alphaswarm-cli kb tenant onboard ${T} --cloud aws --region us-east-1 # Break-glass operator path (skips audit; for ops emergencies only): terragrunt run-all init --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod terragrunt run-all apply --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod ``` ## Per-tenant state isolation Each tenant has its own state file under the configured backend: - S3: `s3://alphaswarm-kb-tfstate-prod/tenants//prod/terraform.tfstate` - Azure Blob: `alphaswarm-kb-state/tenants//prod/terraform.tfstate` - GCS: `gs://alphaswarm-kb-tfstate-prod/tenants//prod/terraform.tfstate` For regulated tenants, swap the per-tenant backend block to assume a dedicated cloud account/subscription role so physical isolation matches the silo logical boundary. ## Offboarding ```bash alphaswarm-cli kb tenant offboard ${T} # wait for kb_runs to drain terragrunt run-all destroy --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod ``` `cognee.forget --tenant ${T} --hard` runs first so per-tenant data is purged before the underlying storage tears down. # AlphaSwarm Knowledge Base (`alphaswarm_kb`) > Boundary-package overview for the cognitive-memory layer. # AlphaSwarm Knowledge Base The `alphaswarm_kb` boundary owns AlphaSwarm's cognitive-memory layer. It extracts the historical `alphaswarm/rag/` (HierarchicalRAG) and `alphaswarm/llm/memory.py` (RedisHybridMemory) modules into a Clean-Architecture package with a pluggable adapter trinity for memory engines, vector stores, graph stores, ACL evaluators, and policy engines. ## Sub-docs - [kb-runtime.md](kb-runtime.md) — `KBRuntime` + hash-locked `KBCorpusSpec` + `kb_runs` ledger. - [memory-engines.md](memory-engines.md) — IMemoryEngine adapter trinity (HierarchicalRAG default; Cognee / Graphiti / Mem0 / Letta / LlamaIndex opt-in). - [bi-temporal-graph.md](bi-temporal-graph.md) — `PermissionedDataPoint` + four-timestamp model + `invalidated_by_edge_id` Graphiti-style edge invalidation. - [layer-composition.md](layer-composition.md) — Four-scope precedence + bi-temporal merge. - [kb-permissions.md](kb-permissions.md) — `AccessBitmap` + OpenFGA + OPA + Cedar hybrid stack. - [kb-federation.md](kb-federation.md) — Cross-silo marketplace federation reverse-proxy. - [kb-silo-iac.md](kb-silo-iac.md) — Terragrunt unit-per-tenant + cloud-parallel modules. - [rag.md](rag.md) — Extracted hierarchical RAG (Alpha-GPT four-level). - [pgvector-control-plane.md](pgvector-control-plane.md) — pgvector default vector store + `data.vector.*` MCP tools. - [research-papers-rag.md](research-papers-rag.md) — Math-aware paper ingest + hybrid retrieval. ## At a glance | Concern | Where | | --- | --- | | Runtime | `alphaswarm_kb.runtime.KBRuntime` (single executor, rule 56) | | Spec | `alphaswarm_kb.spec.KBCorpusSpec` (hash-locked, rule 57) | | Registry | `alphaswarm_kb.registry.persist_spec` → `kb_corpus_spec_versions` | | Composition root | `alphaswarm_kb.composition_root.build_default_container` | | Domain ports | `alphaswarm_kb.domain.ports.*` (zero framework imports) | | Bi-temporal envelope | `alphaswarm_kb.domain.models.permissioned_datapoint.PermissionedDataPoint` | | Adapter metaclass | `alphaswarm_kb.domain.ports.base.KBAdapterMeta` (rule 58) | | Agent surface | `data.kb.*` DataMCP tools (rule 59) | | Federation | `alphaswarm_kb_federation/` standalone reverse-proxy (rule 60) | ## Why the boundary See [ADR-014](../../architecture/decisions/014-knowledge-base-boundary.md) for the full rationale. ## Hard rules - **56**: All KB lifecycle goes through `KBRuntime`. - **57**: `kb_corpus_spec_versions` rows are immutable. - **58**: Adapters register via `KBAdapterMeta`. - **59**: Agents read KB only through `data.kb.*` tools. - **60**: Cross-silo recall goes through `alphaswarm_kb_federation` only. ## Migration Legacy `alphaswarm.rag.*` and `alphaswarm.llm.memory` import paths keep working through `DeprecationWarning` shims for one release cycle. New code imports from `alphaswarm_kb.rag.*` and `alphaswarm_kb.memory.*` directly. ## Deprecations - **Kuzu graph store** — upstream archived October 2025. The `alphaswarm_kb` extra is `kuzu-deprecated`; the adapter warns on import and will be removed after one release cycle. Migrate `KBCorpusSpec.graph_store.kb_alias` to `neo4j` (default), `falkordb`, or `memgraph`. See [memory-engines.md](memory-engines.md#graph-store-note-kuzu-is-deprecated). # KBLayerComposer — four-scope precedence > Private > hierarchical > marketplace > global with bi-temporal merge. # KBLayerComposer `KBLayerComposer.compose_recall` composes recall across the four canonical layer scopes: | Scope | Source | Precedence | | --- | --- | --- | | `PRIVATE` | The tenant's own silo | 0 (highest) | | `HIERARCHICAL` | Parent organisation (read-only, replicated) | 1 | | `MARKETPLACE` | Subscribed external corpora (federated) | 2 | | `GLOBAL` | Curated read-only platform corpus | 3 (lowest) | Smaller `precedence` wins. ## Resolution flow 1. `resolve_layers(ctx)` returns the active `LayerHandle` list for the tenant. Defaults to `[PRIVATE]`; populates the other three when the tenant's `kb_subscriptions` rows + parent-org link + global-corpus replication offsets are set. 2. For each layer, fan out the recall: - `PRIVATE` runs against the tenant's own `IMemoryEngine` (default `HierarchicalRAGAdapter`). - `HIERARCHICAL` / `MARKETPLACE` / `GLOBAL` run against the `alphaswarm_kb_federation` reverse-proxy when the layer has a `federation_endpoint`. 3. Apply `BitemporalMerger.merge_layers` to dedupe by entity id with precedence-aware ordering. Losers land in `metadata.dissenting_layers`. 4. Re-rank by `(precedence, score, recency)`. ## Conflict resolution | Conflict | Resolution | | --- | --- | | Same entity, different value | Higher-precedence layer wins; loser exposed in `dissenting_layers`. | | Temporal disagreement (`valid_to`/`valid_from` overlap) | Both kept; downstream `as_of` reconstructs the timeline. | | Edge contradiction (new edge supersedes old) | Old edge's `expired_at = now()`; not deleted. | ## Caching + invalidation - Per-subscriber result cache keyed by `(subscriber_tenant, source_tenant, dataset, query_hash)` with a 60s TTL (default). - OpenFGA Watch events for `subscription.{granted,revoked,updated}` flush impacted cache entries via the `alphaswarm:kb:bitmap` Redis pub/sub channel. ## When NOT to use compose_recall - Single-corpus recall against your tenant's own private corpus → use `data.kb.recall` (faster, no federation overhead). - Bulk re-indexing / improve / forget — these are always per-corpus. # IMemoryEngine adapter trinity > HierarchicalRAG default plus Cognee / Graphiti / Mem0 / Letta / LlamaIndex opt-in. # IMemoryEngine adapter trinity `IMemoryEngine` is the vendor-neutral memory control plane. Cognee's v1.0 surface (`remember` / `recall` / `improve` / `forget`) is the canonical contract; every adapter translates at its boundary. ## Comparison | Adapter | `kb_alias` | Extras | Primary strength | Trade-off | | --- | --- | --- | --- | --- | | `HierarchicalRAGAdapter` | `hierarchical_rag` | none (default) | 4-level Alpha-GPT hierarchy + Reciprocal Rank Fusion + RAPTOR summaries | AlphaSwarm-native; not bi-temporal | | `CogneeMemoryEngine` | `cognee` | `[cognee]` | Tri-store (graph + vector + relational) + native EBAC + multimodal ingest | Heavy dep; LanceDB+Kuzu only for native ACL | | `GraphitiMemoryEngine` | `graphiti` | `[graphiti]` | Bi-temporal edges, sub-300ms p95, no runtime LLM calls | Neo4j only | | `Mem0MemoryEngine` | `mem0` | `[mem0]` | User-centric personalisation, 12-layer cognitive memory | Less structural extraction | | `LettaMemoryEngine` | `letta` | `[letta]` | Full agent runtime integration | Heavy; not pure memory | | `LlamaIndexMemoryEngine` | `llamaindex` | `[llamaindex]` | General-purpose vector backbone, big plugin ecosystem | No native temporal model | ## Choosing an engine Default to `hierarchical_rag` unless a corpus has a specific need: - **Bi-temporal facts** that change over time (CEO succession, deal status) → `graphiti`. - **User-scoped personalisation** that needs cross-session identity → `mem0`. - **Multimodal pipelines** with heavy LLM-driven entity extraction + cross-store coherence → `cognee`. - **General-purpose document QA** with the LlamaIndex plugin ecosystem → `llamaindex`. ## Switching engines Set `KBCorpusSpec.memory_engine.kb_alias` and re-snapshot. The `KBRuntime` picks up the new adapter on the next call. The previous spec version stays in `kb_corpus_spec_versions` so any in-flight recall against the old version can replay. ## Graph-store note: Kuzu is deprecated Upstream Kuzu was archived in October 2025 and receives no further releases. The `alphaswarm_kb` extra was renamed `kuzu` -> `kuzu-deprecated`, and importing `alphaswarm_kb.infrastructure.adapters.graph.kuzu_deprecated` emits a `DeprecationWarning`. New corpora MUST choose `neo4j` (default), `falkordb`, or `memgraph` for `KBCorpusSpec.graph_store.kb_alias`; the deprecated extra exists only to keep legacy corpora readable during migration and will be removed after one release cycle. This also constrains `CogneeMemoryEngine`'s native-ACL tri-store mode (LanceDB + Kuzu) — prefer the OpenFGA/OPA permission stack instead ([kb-permissions.md](kb-permissions.md)). ## Adding a new adapter 1. Subclass `IMemoryEngine` under `alphaswarm_kb/src/alphaswarm_kb/infrastructure/adapters/memory/`. 2. Set `kb_kind = "memory_engine"` + `kb_alias = "your_alias"`. The `KBAdapterMeta` metaclass auto-registers (rule 58). 3. Add the optional dep to `alphaswarm_kb/pyproject.toml` extras. 4. Add a default-kwargs YAML under `alphaswarm_kb/configs/memory_engines/your_alias_default.yaml`. 5. Wire the eager import behind `contextlib.suppress(Exception)` in `alphaswarm_kb/src/alphaswarm_kb/__init__.py`. # pgvector control plane > Default BaseVectorStore; data.vector.* MCP tools; alembic 0045 + 0088. # pgvector control plane pgvector is the default `BaseVectorStore` adapter for the KB layer. `PgVectorStore` wraps the extracted `alphaswarm_kb.rag.pgvector_store.PgVectorStore` behind the `BaseVectorStore` port so the standard KBRuntime can target the existing pgvector control plane without leaking adapter specifics. ## Migration - `alembic/versions/0045_pgvector_phase3.py` — pgvector indexes on the three allow-listed tables (`rag_chunks`, `codebase_symbol_embeddings`, `ml_feature_vectors`). - `alembic/versions/0088_alphaswarm_kb_specs.py` — the nine KB tables (`kb_corpora`, `kb_runs`, `kb_subscriptions`, ...). ## Agent surface | Tool | Purpose | | --- | --- | | `data.vector.search` | Free-text or pre-computed embedding ANN over the allow-listed tables. | | `data.vector.upsert` | (Step-up gated) write through `PgVectorStore.upsert`. | | `data.vector.delete` | (Step-up gated) targeted delete by id. | | `data.embeddings.compute` | Compute an embedding via the central embedder. | ## Adding a new pgvector-backed table 1. Add the migration under `alembic/versions/` with a `Vector(N)` column + an HNSW index. 2. Extend the `_ALLOWED_TABLES` whitelist in `alphaswarm/data/mcp/tools/vector.py`. 3. Add an `EntityPicker kind` for the table to `alphaswarm/cache/keys.py`. 4. Add a `BaseDataset` kind under `alphaswarm/data/datasets/kinds/` if the table is also surfaced through the dataset catalog. # Hierarchical RAG (extracted) > Four-level Alpha-GPT hierarchy on Redis + pgvector; default IMemoryEngine for KB corpora. # Hierarchical RAG `HierarchicalRAG` lives at `alphaswarm_kb.rag.HierarchicalRAG` (extracted from the legacy `alphaswarm/rag/` tree per ADR-014). It is the **default** `IMemoryEngine` adapter for every `KBCorpusSpec` that doesn't explicitly choose another engine. ## Four levels Implements the Alpha-GPT *"Human-AI Interactive Alpha Mining"* design: | Level | Purpose | | --- | --- | | **L0** | Alpha / decision base — past `agent_decisions`, `equity_reports`, `backtest_runs` outcomes. | | **L1** | High-level categories (`price_volume`, `fundamental`, `news_sentiment`, `regulatory`). | | **L2** | Sub-categories (`earnings_call`, `disclosures`, `cfpb_complaint`, ...). | | **L3** | Specific data fields / chunks — individual narratives + paragraphs. | Plus three orthogonal data "orders": - **first** — bars / trades / performance. - **second** — SEC filings / fundamentals / ratios. - **third** — CFPB / FDA / USPTO regulatory data. - **theory** — research papers + code chunks. ## Public surface | Symbol | Use | | --- | --- | | `HierarchicalRAG` | Top-level facade. | | `HierarchicalRAG.query` | Direct vector search at one level (optional reranker + compressor). | | `HierarchicalRAG.query_hybrid` | Dense + sparse hybrid via Reciprocal Rank Fusion. | | `HierarchicalRAG.walk` | Top-down `L0 → L1 → L2 → L3` autonomous navigation. | | `HierarchicalRAG.recall_for_prompt` | Markdown block ready for prompt injection. | | `HierarchicalRAG.index_chunks` / `index_summary` | Write paths. | | `HierarchicalRAG.precompute_l0_alpha_base` | Bulk-index past decisions. | | `get_default_rag()` | Process-wide cached singleton. | ## Indexer registry `alphaswarm_kb.rag.indexers.INDEXER_REGISTRY` maps every corpus slug to its indexer callable. Add a new corpus by writing an indexer that takes the source rows, renders them as text, chunks them via `alphaswarm_kb.rag.chunker.semantic_chunks`, and calls `rag.index_chunks(corpus, ...)`. ## Storage backend Redis (RediSearch) is the default vector store; pgvector is the production-grade backend (Phase 3 refactor). Both implement the same `RedisVectorStore` / `PgVectorStore` surface that `HierarchicalRAG` consumes through composition. ## Backward compatibility `alphaswarm.rag.*` and `alphaswarm.rag.indexers.*` are `DeprecationWarning` shims that re-export from `alphaswarm_kb.rag.*`. Old call sites keep working for one release cycle. # Research-papers RAG > Math-aware paper ingest (Marker / Nougat / MathPix / PyPDF) + hybrid retrieval. # Research-papers RAG The `research_papers` corpus is one of the bundled [`KBCorpusSpec`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/corpora/research_papers.yaml) templates. It ingests PDFs through the math-aware parser chain in `alphaswarm_kb.rag.parsers` and indexes them into the `research_papers` RAG corpus. ## Parser chain `alphaswarm_kb.rag.parsers.pick_parser(path)` selects the right parser based on the document's math density + complexity: | Parser | Use | | --- | --- | | `MarkerParser` | Default — fast, math-aware Marker pipeline. | | `NougatParser` | Heavy LaTeX/equation density (Nougat from Meta). | | `MathPixParser` | Highest fidelity for handwriting or scanned PDFs (MathPix API). | | `PyPDFParser` | Fast text-only fallback. | ## Upload + ingest ```python from alphaswarm_kb.rag.indexers.research_papers_indexer import index_research_papers n_chunks = index_research_papers(paper_ids=["paper-uuid"]) ``` Or via the REST surface: ```http POST /rag/papers/upload # upload PDF POST /rag/papers/{id}/ingest POST /rag/papers/{id}/synthesize # downstream strategy synthesis ``` The Celery wrappers in `alphaswarm_kb.tasks.kb_tasks.ingest_research_paper` + `synthesize_strategy_from_paper` preserve the legacy `alphaswarm.tasks.research_paper_tasks` surface via shims. ## Retrieval Use `data.kb.recall` with `corpus_name="research_papers"`. The `HierarchicalRAG.query_hybrid` path is preferred for papers because exact-token matches (theorem names, variable symbols) matter as much as semantic similarity. ## Strategy synthesis The `synthesize_strategy_from_paper` task pipes hybrid recall results through `router_complete` (rule 2) and returns a YAML strategy stub the Strategy Composer can load. # Account integrations > Per-org HuggingFaceHub + DockerHub credential links for company accounts. PATs are validated against the upstream API on connect, persisted encrypted via the credential resolver, and surfaced through the admin BFF as health-checked records. # Account integrations Per-org credential links the admin operator wires through the **`alphaswarm_admin`** Next.js surface. Six integration kinds ship today: | Kind | Persisted under | Wizard | Backend | | --- | --- | --- | --- | | `huggingface` | `CredentialKey("huggingface", "org:")` | `frontend/components/accounts/HuggingFaceWizard.tsx` | `src/alphaswarm_admin/providers/huggingface.py` | | `docker_hub` | `CredentialKey("docker_hub", "org:")` | `frontend/components/accounts/DockerHubWizard.tsx` | `src/alphaswarm_admin/providers/dockerhub.py` | | `cloud_aws` | `CredentialKey("cloud_aws", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_aws.py` | | `cloud_azure` | `CredentialKey("cloud_azure", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_azure.py` | | `cloud_gcp` | `CredentialKey("cloud_gcp", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_gcp.py` | | `cloud_cloudflare` | `CredentialKey("cloud_cloudflare", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_cloudflare.py` | The four `cloud_*` kinds use the same `AccountIntegrationProvider` ABC but extend it with a 5-step wizard contract (`bootstrap_artifacts` → `validate_identity` → `validate_permissions` → `enumerate_resources` → `connect`) and are exclusively **federated-first** — no long-lived secrets are stored. See [Connect a company cloud account](../../how-to/operations/connect-company-cloud-account.md) for the full runbook. Both share the same `AccountIntegrationProvider` ABC defined in `alphaswarm_admin/src/alphaswarm_admin/providers/base.py` and the same encrypted file-backed store at `alphaswarm_admin/src/alphaswarm_admin/services/integration_store.py`. ## Lifecycle ```mermaid flowchart LR Op["operator"] -->|PAT| FE["HF/Docker wizard"] FE -->|"POST /admin/accounts/{org_id}/integrations/{kind}"| BFF["alphaswarm_admin BFF"] BFF -->|whoami / login| HUB["upstream provider"] HUB -->|valid| BFF BFF -->|"Fernet-encrypt + persist"| STORE["IntegrationCredentialStore"] BFF -->|metadata only| FE FE -->|status badge| Op ``` The flow is audit-first (see `alphaswarm_admin/src/alphaswarm_admin/api/routers/integrations.py`) and step-up MFA gated (`require_admin_step_up("admin:cluster")`). The PAT itself **never** crosses the BFF response boundary after the initial connect call — the wizard renders the masked metadata (`namespace`, `status`, `connected_at`) only. ## HuggingFace Hub ### What you need - A **fine-grained PAT** with read access on the org's models / datasets. Generate at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). - The org's HuggingFace namespace (e.g. `acme-quant`). Optional — the BFF derives it from `HfApi().whoami()` when omitted. ### Wire-format ```http POST /admin/accounts/{org_id}/integrations/huggingface Authorization: Bearer Content-Type: application/json { "token": "hf_*****", "namespace": "acme-quant" } ``` ```json { "integration": { "org_id": "org-acme", "kind": "huggingface", "namespace": "acme-quant", "credential_key": "huggingface:org:org-acme", "status": "healthy", "connected_at": "2026-05-27T20:00:00Z", "last_health_at": null, "error": null, "metadata": { "type": "org", "auth_email": "ops@example.com", "orgs": ["acme-quant"] } }, "audit_run_id": "..." } ``` ### Revocation `DELETE /admin/accounts/{org_id}/integrations/huggingface` drops the **local** record. **Always** revoke the PAT on the HuggingFace side too — settings → Personal access tokens → Revoke. Without that step the PAT remains usable from any source that holds the bytes. ## Docker Hub ### What you need - A **Docker Hub PAT** (Account → Personal access tokens) with the intended scope. Username + PAT together are required (Docker Hub v2 login does not accept PAT-only). - The Docker Hub namespace (defaults to the username on connect). ### Wire-format ```http POST /admin/accounts/{org_id}/integrations/dockerhub { "username": "acmeops", "pat": "dckr_pat_*****", "namespace": "acmeops" } ``` The BFF posts to `https://hub.docker.com/v2/users/login` to mint a JWT (proving the credential is valid), then `/v2/users/{namespace}/` to confirm namespace scope. Both calls happen server-side; only metadata returns to the wizard. ### Revocation `DELETE /admin/accounts/{org_id}/integrations/docker_hub` drops the local record only. Docker Hub does NOT expose a PAT-revocation API, so you **must** delete the PAT manually in `Account → Security → Personal access tokens` to fully terminate access. ## Health checks The wizard's "Re-check" action calls `POST /admin/accounts/{org_id}/integrations/{kind}/health` which re-runs the `whoami` (HuggingFace) or `login` + namespace probe (Docker Hub). The result lands on the row's `last_health_at` / `last_health_status` fields and is rendered as a badge. ## Operator runbook | Scenario | Action | | --- | --- | | PAT expired upstream | Re-run the wizard; the new PAT replaces the encrypted blob in-place. | | PAT compromised | Revoke upstream first, then disconnect locally. | | Switching org owners | Disconnect, delete the PAT upstream, have the new owner connect with their own PAT. | | Lost encryption key (`ALPHASWARM_ADMIN_INTEGRATIONS_KEY`) | Drop the local store JSON, rotate the encryption key, re-run all connect wizards. The upstream PATs are unaffected. | ## Configuration | Env var | Purpose | Default | | --- | --- | --- | | `ALPHASWARM_ADMIN_INTEGRATIONS_PATH` | JSON file the encrypted store writes to. | `~/.alphaswarm/integrations.json` | | `ALPHASWARM_ADMIN_INTEGRATIONS_KEY` | Fernet key used to encrypt PATs. **Required in production.** | (ephemeral key minted per process — refused in production by `IntegrationCredentialStore.assert_production_ready`) | Generate a Fernet key with: ```python from cryptography.fernet import Fernet print(Fernet.generate_key().decode()) ``` Persist the key in your platform secret manager and inject it as an env var; the store reads it once at boot. # Account management > The `/auth/profile` surface is the end-user account center for identity, security, session control, connected providers, and tenancy membership management. It keeps sensitive account operations in one... # Account management > **alphaswarm_admin (internal) note** — the internal admin BFF at > `manage.alpha-swarm.ai` is Entra-only post the alphaswarm_admin Entra > refactor (`.cursor/plans/alphaswarm_admin_entra_refactor_039f2aeb.plan.md`). > Service identity flows through per-deployment Entra Agent > Identities; see [admin-agent-identity.md](admin-agent-identity.md). > Auth0 remains the customer-facing path for the public > `app.alpha-swarm.ai` cloud frontend described below. ## 1) Overview The `/auth/profile` surface is the end-user account center for identity, security, session control, connected providers, and tenancy membership management. It keeps sensitive account operations in one place while delegating authentication authority to Auth0. ```mermaid flowchart LR A[Profile] --> B[Security] B --> C[Sessions] C --> D[Connections] D --> E[Tenancy] E --> F[Notifications] F --> G[Danger Zone] ``` ## 2) Profile tab The Profile tab shows display name, avatar, and provider badge. Email is read-only because the canonical identity record is managed by Auth0. ## 3) Security tab The Security tab includes: - `PasswordChangeCard`: creates an Auth0 password-change ticket URL and redirects the user through the hosted reset flow. - `MfaFactorsCard`: lists and manages MFA enrollment for TOTP, SMS, and WebAuthn factors. - `RecentActivityCard`: displays the last 10 security-relevant audit events. ## 4) Sessions tab The Sessions tab lists active sessions with browser, device, IP, approximate location, and last activity. Users can revoke individual sessions, or run a global "Sign out everywhere" action with friction confirmation. ## 5) Connections tab The Connections tab supports linking and unlinking identity providers such as Microsoft, Google, Auth0 Database, and GitHub. ## 6) Tenancy tab The Tenancy tab shows memberships, supports org/workspace switching, and exposes a user-level "Leave organization" action. Admin onboarding and tenancy administration are handled in separate admin routes. ## 7) Notifications tab Notifications is a placeholder in v1 and reserved for a future v2 notification preferences model. ## 8) Danger Zone Danger Zone contains permanent account-deletion actions gated by `` typed-email confirmation. ## What an admin can additionally do Admins can use: - [`/admin/onboarding`](/admin/onboarding) for onboarding flows including `EntraTenantLinkWizard`. - [`/admin/users`](/admin/users) for user administration. ## What happens on the backend Key backend modules: - Auth0 Management API client: [`alphaswarm/auth/management_api.py`](../alphaswarm/auth/management_api.py) - `/me/*` route module: [`alphaswarm/api/routes/me.py`](../alphaswarm/api/routes/me.py) - Invite lifecycle routes: [`alphaswarm/api/routes/invites.py`](../alphaswarm/api/routes/invites.py) - Audit emit helper: [`alphaswarm/auth/audit.py`](../alphaswarm/auth/audit.py) # concepts/identity/admin-agent-identity # alphaswarm_admin — Microsoft Entra Agent Identity > Last refreshed: 2026-05-27. > Status: implementation of the alphaswarm_admin Entra refactor > (`.cursor/plans/alphaswarm_admin_entra_refactor_039f2aeb.plan.md`). > See also: [entra-internal-tenant.md](entra-internal-tenant.md) and > [identity.md](identity.md). The `alphaswarm_admin` BFF authenticates to `alphaswarm_controller`, the AlphaSwarm monolith, and (eventually) any other downstream service via a **per-deployment Microsoft Entra Agent Identity** instead of a shared client_credentials service principal. Each deployment (dev / staging / prod) gets its own `sub` claim in minted tokens so audit trails and RBAC routing remain clean even when the same Blueprint backs every environment. This page is the operator + agent reference for the model. ## Three-layer object graph ```mermaid flowchart LR subgraph entra [AlphaSwarm staff Entra tenant] bp[Agent Identity Blueprintalphaswarm-admin-service] bpp[BlueprintPrincipal] aid_dev[Agent Identity admin-dev] aid_staging[Agent Identity admin-staging] aid_prod[Agent Identity admin-prod] fic[Federated Identity CredentialAKS workload identity] api[Manage-API Resource Server+ AdminService app role] bp --> bpp --> aid_dev bpp --> aid_staging bpp --> aid_prod fic -. parent token .-> bp aid_dev -. AdminService role .-> api aid_staging -. AdminService role .-> api aid_prod -. AdminService role .-> api end admin_dev[alphaswarm_admin pod (dev)] -.->|fmi_path exchange| aid_dev admin_prod[alphaswarm_admin pod (prod)] -.->|fmi_path exchange| aid_prod ``` | Layer | Resource | Provider | | --- | --- | --- | | 1 | Agent Identity Blueprint | `azapi_resource` against `Microsoft.Graph/applications/microsoft.graph.agentIdentityBlueprint` | | 2 | BlueprintPrincipal (mandatory second step) | `azapi_resource` against `Microsoft.Graph/servicePrincipals/microsoft.graph.agentIdentityBlueprintPrincipal` | | 3 | Per-environment Agent Identity | `azapi_resource` against `Microsoft.Graph/servicePrincipals/microsoft.graph.agentIdentity` | | 4 | Federated Identity Credential | `azuread_application_federated_identity_credential` on the Blueprint | | 5 | App role assignment | `azapi_resource` against `Microsoft.Graph/servicePrincipals/appRoleAssignedTo` | Terraform module: [`alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/`](../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/). ## Two-step `fmi_path` exchange At runtime each pod mints an Agent-Identity-bound access token via the two-step exchange documented in the `entra-agent-id` skill: ```mermaid sequenceDiagram participant Admin as alphaswarm_admin participant Entra as Entra token endpoint Admin->>Entra: 1. POST /oauth2/v2.0/tokengrant=client_credentialsscope=api://AzureADTokenExchange/.defaultclient_assertion= Entra-->>Admin: parent_token Admin->>Entra: 2. POST /oauth2/v2.0/tokengrant=client_credentialsscope=/.defaultfmi_path=alphaswarm-admin-fmi_target_id=requested_token_use=on_behalf_ofassertion=parent_token Entra-->>Admin: agent_token (sub=agent_sp_id, aud=) ``` The exchange lives at [`alphaswarm_core.auth.providers.msal_entra.MsalEntraValidator.acquire_agent_token`](../../../../alphaswarm_core/src/alphaswarm_core/auth/providers/msal_entra.py). ## CredentialResolver integration The admin BFF wires the Agent Identity flow through the existing `SecretStore` chain so route handlers never see the token directly. ```python from alphaswarm_core.credentials.stores import ( EntraAgentIdentityCredentialResolver, EntraAgentIdentitySecretStore, ) from alphaswarm_core.auth.providers.msal_entra import MsalEntraValidator store = EntraAgentIdentitySecretStore( validator=MsalEntraValidator( tenant="", audience="api://alphaswarm-controller", ), resolvers=( EntraAgentIdentityCredentialResolver( credential_key=CredentialKey( service="alphaswarm-admin-to-cp", purpose="client_credentials", ), audience="api://alphaswarm-controller", blueprint_app_id=, agent_identity_id=, fmi_path="alphaswarm-admin-prod", ), ), ) ``` `alphaswarm_admin/integrations/broker.py::build_default_brokers` does this automatically when `ALPHASWARM_AUTH_AGENT_IDENTITY_ENABLED=true` AND the three Agent Identity env vars are populated. When any of the fields are empty the broker falls back to the legacy env-only client_credentials path so local-dev sandboxes keep working. ## Receiver-side recognition `alphaswarm_controller.auth.deps._payload_to_user` extracts the RFC 8693 `act` claim and surfaces `actor_kind="agent"` plus `actor_upstream_sub` on the resolved `AuthenticatedUser`. Recognition is feature-flagged behind `ALPHASWARM_AUTH_AGENT_TOKEN_RECOGNITION_ENABLED` until the end-to-end path is verified — when off, every token resolves to `actor_kind="user"` and the legacy audit shape is preserved. The monolith side (`alphaswarm/api/routes/_internal_audit.py`) logs the `actor_kind` + `actor_upstream_sub` on every persisted `terraform_runs` ingest call so the audit ledger stays correlatable with the Agent Identity that minted the token. ## Identity on AWS ECS Fargate When `alphaswarm_admin` runs on ECS Fargate (the [`ecs-fargate-control-plane`](../../../../alphaswarm_platform/infrastructure/modules/ecs-fargate-control-plane/) module) two identities are in play, and they are orthogonal: - **AWS control** — the `/admin/platform/ecs/*` surface calls AWS ECS + CloudWatch using the task's **AWS IAM role**, not Entra. The module grants that role a tightly scoped self-management policy (`enable_self_management = true`). No Entra token is involved in the AWS control path. - **Control-plane M2M** — outbound calls to `alphaswarm-cp` `/manage/*` still need an Entra-minted token. ECS Fargate has no native OIDC issuer for the WIF JWT the two-step `fmi_path` exchange needs, so the ECS-hosted admin routes M2M through the controller's `/auth/m2m/token` shim by setting `ALPHASWARM_AUTH_THROUGH_CONTROLLER=true`. The controller (EKS-hosted, with a projected service-account token) holds the Agent Identity federation and mints on the admin's behalf. The Agent Identity Blueprint + per-environment identities this module provisions therefore back the EKS-hosted control plane and any admin pod that can present a federated SA token. The `module.alphaswarm_admin_agent_identity.agent_identity_env` output emits the per-environment `ALPHASWARM_AUTH_AGENT_*` block ready to drop into a task definition or ConfigMap for those deployments. ## Operator workflow ```bash # 1. Pre-check (one-time): grant the Terraform-execution SP the Graph # permissions the entra-agent-id skill lists. # 2. Snapshot + apply (step-up MFA gated; AGENTS rule 42 + 52). python scripts/identity/seed_admin_agent_identity.py --apply alphaswarm-cli manage terraform apply \ --workspace-id admin-entra \ --spec-version-id # 3. Plumb outputs into CredentialResolver. alphaswarm-cli credentials import \ --service alphaswarm-admin \ --purpose entra_agent_identity \ --field blueprint_app_id= \ --field agent_identity_id_prod= # 4. Flip the feature flag on each deployment. ALPHASWARM_AUTH_AGENT_IDENTITY_ENABLED=true ALPHASWARM_AUTH_AGENT_BLUEPRINT_APP_ID= ALPHASWARM_AUTH_AGENT_IDENTITY_ID= ALPHASWARM_AUTH_AGENT_FMI_PATH=alphaswarm-admin-prod ``` ## Rollback The Terraform module is gated by `var.enabled`; flipping it to false removes the per-environment Agent Identities + role assignments while keeping the Blueprint + BlueprintPrincipal in place for fast re-enable. The `alphaswarm_admin` BFF falls back to the legacy client_credentials path automatically (via `EnvSecretStore` at priority 100). For the human-login path, the legacy Vite SPA at `alphaswarm_admin_ui/` retains its Auth0 branch for 30 days post the refactor — set the `ALPHASWARM_ADMIN_LEGACY_AUTH0_FALLBACK=1` feature flag to surface it. # Auth0 Actions for the AlphaSwarm multi-tenant rollout > Auth0 ships organisation / role data via the standard `org_id` / `https:///roles` claims, but the **AlphaSwarm scope chain** (which workspace is the users default, which team theyre in, which roles... # Auth0 Actions for the AlphaSwarm multi-tenant rollout The Phase 4 enforcement sweep relies on Auth0 to inject AlphaSwarm-namespaced custom claims (`https://alphaswarm/org_id`, `https://alphaswarm/team_id`, `https://alphaswarm/workspace_id`, `https://alphaswarm/roles`) into every access token. The Action snippet below ships those claims by calling the M2M-secured [`/_internal/auth0/sync`](../alphaswarm/api/routes/auth0_sync.py) endpoint during the post-login hook. ## Why an Action? Auth0 ships organisation / role data via the standard `org_id` / `https:///roles` claims, but the **AlphaSwarm scope chain** (which workspace is the user's default, which team they're in, which roles map onto the four-tier lattice) lives in Postgres. The Action is the bridge: it asks the AlphaSwarm backend on every login + injects the result into the access token so the frontend + backend see a consistent set of custom claims from request 0. ## Setup 1. **Create an Auth0 API for the AlphaSwarm backend** (separate from the SPA Application). Set the audience to whatever you set `ALPHASWARM_AUTH_OIDC_AUDIENCE` to — e.g. `https://api.alphaswarm.local`. 2. **Create a Machine-to-Machine Application** authorised against the AlphaSwarm API. Set its allowed grant types to `client_credentials` only. Copy the client_id + secret into the Action's secrets: - `ALPHASWARM_M2M_CLIENT_ID` - `ALPHASWARM_M2M_CLIENT_SECRET` - `ALPHASWARM_API_AUDIENCE` (the same audience as #1) - `ALPHASWARM_BACKEND_URL` (e.g. `https://api.alphaswarm.local`) 3. **Configure the AlphaSwarm backend**: ```bash ALPHASWARM_AUTH_PROVIDER=auth0 ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.auth0.com ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alphaswarm.local ALPHASWARM_AUTH_M2M_ENABLED=true ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alphaswarm.local ALPHASWARM_AUTH_CLAIMS_NAMESPACE=https://alphaswarm/ ALPHASWARM_AUTH_ENFORCE=permissive # flip to ``strict`` after the rollout dashboard is clean ``` ## The Action Create a new Action under **Library > Custom > Build new** and attach it to the **Login** trigger. ```js /** * AlphaSwarm post-login Action: lazy-provisions the internal user + injects * AlphaSwarm-namespaced custom claims into the access token. * * Triggers on every login; the backend is idempotent. */ exports.onExecutePostLogin = async (event, api) => { const namespace = "https://alphaswarm/"; // 1. Mint an M2M token for the AlphaSwarm backend. const tokenResp = await fetch(`https://${event.tenant.id}.auth0.com/oauth/token`, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ grant_type: "client_credentials", client_id: event.secrets.ALPHASWARM_M2M_CLIENT_ID, client_secret: event.secrets.ALPHASWARM_M2M_CLIENT_SECRET, audience: event.secrets.ALPHASWARM_API_AUDIENCE, }), }); if (!tokenResp.ok) { api.access.deny("AlphaSwarm backend token mint failed"); return; } const { access_token } = await tokenResp.json(); // 2. Ask the AlphaSwarm backend to lazy-provision the user + return claims. const syncResp = await fetch(`${event.secrets.ALPHASWARM_BACKEND_URL}/_internal/auth0/sync`, { method: "POST", headers: { "Content-Type": "application/json", Authorization: `Bearer ${access_token}`, }, body: JSON.stringify({ user_id: event.user.user_id, email: event.user.email, organization_id: event.organization?.id, organization_name: event.organization?.name, }), }); if (!syncResp.ok) { // Soft failure: let the user in but log the issue. The // backend's lazy provisioner will run on the first API call // instead. console.log("AlphaSwarm backend sync failed:", await syncResp.text()); return; } const claims = await syncResp.json(); // 3. Inject the claims into the access token. if (claims.org_id) api.accessToken.setCustomClaim(`${namespace}org_id`, claims.org_id); if (claims.team_id) api.accessToken.setCustomClaim(`${namespace}team_id`, claims.team_id); if (claims.workspace_id) { api.accessToken.setCustomClaim(`${namespace}workspace_id`, claims.workspace_id); } if (claims.roles && claims.roles.length) { api.accessToken.setCustomClaim(`${namespace}roles`, claims.roles); } if (claims.internal_user_id) { api.accessToken.setCustomClaim(`${namespace}user_id`, claims.internal_user_id); } }; ``` ## Verification 1. Log in via the SPA. The browser receives an access token. 2. Decode it (e.g. [jwt.io](https://jwt.io)) and verify the `https://alphaswarm/org_id` / `https://alphaswarm/roles` claims are present. 3. Hit `GET /auth/whoami` on the AlphaSwarm backend. The response should reflect the org / workspace from the Action — not the deterministic local-default seed. 4. The Phase 6 frontend `ContextBar` should auto-populate the org / workspace on first render. ## Failure modes | Symptom | Likely cause | | ------- | ------------ | | Token has no custom claims | Action attached to the wrong trigger or failed silently. Check the Action logs. | | Backend 401 on `/_internal/auth0/sync` | M2M token audience mismatch — Action audience must equal `ALPHASWARM_AUTH_OIDC_AUDIENCE`. | | `data.ownership.list_resources` returns the local-default user | `provision_user_from_claims` is not running. Confirm the SPA is sending the Bearer header and `ALPHASWARM_AUTH_PROVIDER != local`. | | Phase 4 enforcement mode showing too many 403s | Some Postgres `memberships` rows are missing — run the lazy-provisioning sync once per user, or backfill manually. | ## See also - [`alphaswarm_docs/identity.md`](../../concepts/identity/identity.md) — the full identity stack. - [`alphaswarm_docs/credentials.md`](../../concepts/identity/credentials.md) — how M2M tokens flow through `CredentialResolver`. - [`alphaswarm/api/security.py`](../alphaswarm/api/security.py) — the `require_scope` / `require_membership` deps that consume these claims. ## Phase 7 post-login Action (Auth0 + Microsoft federation) This Action calls `/_internal/auth0/sync`, then injects returned custom claims into both the access token and ID token. The connection name mapping (`requested_claims.connection`) is forwarded so AlphaSwarm can record which IdP drove each login. ```javascript /** * AlphaSwarm post-login Action. * Calls /_internal/auth0/sync on the AlphaSwarm API and injects the * returned custom claims into the access token. Also carries the * Auth0 connection name (e.g. "azure-ad-myorg") so the AlphaSwarm audit * log records WHICH IdP drove this login. * * Secrets used: * ALPHASWARM_API_URL e.g. https://api.alphaswarm.example * ALPHASWARM_M2M_CLIENT_ID Auth0 Management API M2M client id (reused) * ALPHASWARM_M2M_CLIENT_SECRET Auth0 Management API M2M client secret * ALPHASWARM_M2M_AUDIENCE Same as AlphaSwarm API resource identifier * * Set them at: Actions > Library > Custom > > Add Secret */ const NS = "https://alphaswarm/"; async function mintM2MToken(secrets) { const url = `https://${event.tenant.id}.auth0.com/oauth/token`; const res = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ grant_type: "client_credentials", client_id: secrets.ALPHASWARM_M2M_CLIENT_ID, client_secret: secrets.ALPHASWARM_M2M_CLIENT_SECRET, audience: secrets.ALPHASWARM_M2M_AUDIENCE, }), }); if (!res.ok) return null; const body = await res.json(); return body.access_token || null; } exports.onExecutePostLogin = async (event, api) => { const aqpApi = event.secrets.ALPHASWARM_API_URL; if (!aqpApi) return; // Action mis-configured; fail open let token = await api.cache.get("alphaswarm_m2m_token"); if (!token || !token.value) { const fresh = await mintM2MToken(event.secrets); if (!fresh) return; api.cache.set("alphaswarm_m2m_token", fresh, { ttl: 50 * 60 * 1000 }); token = { value: fresh }; } const payload = { user_id: event.user.user_id, email: event.user.email, organization_id: event.organization?.id, organization_name: event.organization?.name, requested_claims: { connection: event.connection?.name, strategy: event.connection?.strategy, }, }; try { const res = await fetch(`${aqpApi}/_internal/auth0/sync`, { method: "POST", headers: { Authorization: `Bearer ${token.value}`, "Content-Type": "application/json", }, body: JSON.stringify(payload), }); if (!res.ok) return; const claims = await res.json(); for (const [k, v] of Object.entries(claims)) { if (v === null || v === undefined) continue; api.accessToken.setCustomClaim(`${NS}${k}`, v); api.idToken.setCustomClaim(`${NS}${k}`, v); } } catch (err) { // Fail open — never block the user's login if AlphaSwarm API is down. console.log("alphaswarm_sync_failed", err.message); } }; ``` ### Custom claims it sets | Claim | Meaning | | --- | --- | | `https://alphaswarm/org_id` | Active organization context resolved by AlphaSwarm. | | `https://alphaswarm/team_id` | Team context resolved by AlphaSwarm. | | `https://alphaswarm/workspace_id` | Active workspace context. | | `https://alphaswarm/project_id` | Active project context. | | `https://alphaswarm/lab_id` | Active lab context. | | `https://alphaswarm/roles` | Role list used by scope/membership checks. | | `https://alphaswarm/connection` | Auth0 connection name, mapped from `requested_claims.connection` (for example `azure-ad-myorg`). | | `https://alphaswarm/internal_user_id` | AlphaSwarm internal user row identifier. | ### Why it fails open The post-login Action should never block authentication because of a transient outage in AlphaSwarm. Missing one claim-sync cycle is recoverable on the next login, while hard-failing login creates a broader availability incident for all users. ## Phase 8 — Step-up MFA addendum (AGENTS hard rule 52) Step-up MFA on destructive routes (the kill switch, every `/halt` endpoint, BYOK / OAuth credential deletes, Terraform apply / destroy, organization invite issuance, broker-credential mutations, and the admin tenancy-strategy migration) is enforced server-side by [`alphaswarm.api.security_stepup.require_step_up`](../alphaswarm/api/security_stepup.py). The FastAPI dep returns RFC 9470-compliant 401 responses with `WWW-Authenticate: Bearer error="insufficient_user_authentication", acr_values="...", max_age="..."` when the access token fails the freshness or MFA-method check. The frontend (`alphaswarm_client/src/lib/auth/useStepUp.ts` + `apiFetch` retry middleware) drives the SPA-side flow: a destructive button calls `requestStepUp()` to pre-flight an Auth0 popup with `acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor` and `max_age=0`, then runs the original operation with the freshly minted token. For this round-trip to succeed, the post-login Action above MUST honour the `acr_values` parameter and force an MFA challenge when the caller requested it. Add the snippet below to the **Phase 7 post-login Action** (don't duplicate — extend the existing `exports.onExecutePostLogin`): ```javascript exports.onExecutePostLogin = async (event, api) => { // ... (the Phase 7 JIT-sync body stays as-is) ... // ---- Phase 8: Adaptive MFA + step-up enforcement ------------------ // The SPA / CLI / agent caller can explicitly request fresh MFA by // passing acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor // on /authorize. The Action MUST trigger the MFA challenge when // either (a) the caller requested it OR (b) Auth0's Adaptive MFA // assessment flagged the login as high-risk. const ACR_MFA = "http://schemas.openid.net/pape/policies/2007/06/multi-factor"; const acrRequested = Array.isArray(event.transaction?.acr_values) ? event.transaction.acr_values : []; const explicitlyAskedForMfa = acrRequested.includes(ACR_MFA); const methods = Array.isArray(event.authentication?.methods) ? event.authentication.methods : []; const mfaAlreadyCompleted = methods.some( (m) => m?.name === "mfa" || m?.name === "otp", ); // Auth0's Adaptive MFA risk assessment — `low` / `medium` / `high`. // Honour `high` automatically; `medium` is left to the caller's // explicit request so the dashboard stays usable on shared offices. const riskConfidence = event.authentication?.riskAssessment?.confidence; const shouldTriggerMfa = (explicitlyAskedForMfa && !mfaAlreadyCompleted) || riskConfidence === "high"; if (shouldTriggerMfa) { // ``allowRememberBrowser: false`` because step-up is sized for // destructive ops — we never want the browser to remember the // "MFA satisfied" flag past the 180s freshness window. api.multifactor.enable("any", { allowRememberBrowser: false }); } // Surface a JIT-friendly hint to the SPA so the topbar can render // "MFA required" pre-flight UI. Not security-sensitive — purely a // UX accelerator. The backend NEVER trusts this claim; it always // re-checks amr + auth_time on the access token. api.idToken.setCustomClaim(`${NS}mfa_available`, true); }; ``` The `multifactor.enable("any", ...)` call triggers Auth0's enrolment or challenge surface, depending on whether the user has already registered a factor. Operators must enable at least one factor type in **Security > Multi-factor Auth** for the call to succeed. ### Tested factor types | Factor | Auth0 enrolment | Notes | | --- | --- | --- | | OTP (TOTP) | Authenticator app | Always recommended as the primary factor | | WebAuthn | Roaming or Platform | Strongest; emits `amr: ["mfa", "swk"]` or `["mfa", "hwk"]` | | Push | Auth0 Guardian app | Smooth UX for personal accounts | | SMS | Phone-based | Discouraged for B2B per AGENTS rule 52 | | Email OTP | Magic link | Acceptable; emits `amr: ["mfa"]` | | Recovery code | Backup | Always provisioned alongside another factor | ### Step-up failure recovery If the popup fails (browser blocked, user dismissed, network drop): - `useStepUp.requestStepUp()` returns `null` and surfaces a "MFA required" toast. - `apiFetch`'s automatic 401 retry path also surrenders after one attempt; the route handler propagates the original 401 to the caller. - Operators with `admin:tenant` can fall back to the BFF `/auth/login?acr_values=...` redirect flow which uses a full-page redirect instead of a popup. The redirect callback returns to the original route and the user re-clicks the destructive button. ## Phase 8 — Custom Token Exchange Profile (AGENTS hard rule 54) The Phase 8 refactor introduces RFC 8693 delegated agent tokens — when [`AgentRuntime`](../alphaswarm/agents/runtime.py) makes an HTTP MCP call on behalf of a user, it exchanges the user's access token for a narrower, agent-scoped token via Auth0 Custom Token Exchange. The minted token carries an `act` claim identifying the agent, while the top-level `sub` stays the human user — so RLS, memberships, and the audit ledger all see the full delegation chain. ### Required Auth0 setup 1. **Create an M2M Application named `alphaswarm-agent-broker`.** - Authorise it against the AlphaSwarm API record. - Allowed grant types: `client_credentials` (required) AND `urn:ietf:params:oauth:grant-type:token-exchange` (required). - Note the `client_id` + `client_secret`. 2. **Configure backend env vars:** ```bash ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID= # client_secret resolves via CredentialResolver in prod # (Vault / cloud KMS) — env is the local-dev shortcut: ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET= ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300 ``` 3. **Create a Custom Token Exchange Profile named `alphaswarm-agent-delegation`.** - In the Auth0 Dashboard, navigate to **Actions > Flows > Custom Token Exchange** and click "Create Profile". - Profile name: exactly `alphaswarm-agent-delegation` (matches the `subject_token_profile` parameter the broker sends). - Target API: the AlphaSwarm API record (audience `https://api.alpha-swarm.ai/` or your env equivalent). - Subject token types accepted: `urn:ietf:params:oauth:token-type:access_token`. - Allow Skipping User Consent: **enabled** (required for non-interactive flows per the Custom Token Exchange docs). - Allowed scopes: `read:mcp:data`, `write:mcp:data`, `read:mcp:codebase`, `write:mcp:codebase`. The Profile must reject any scope NOT on this list. ### The Action body Paste this into the Profile's Action body. The Action runs INSIDE the `/oauth/token` exchange request — it never returns prose to the caller, only the access token Auth0 mints. ```javascript /** * alphaswarm-agent-delegation — Custom Token Exchange Profile Action. * * Sources: * event.transaction.subject_token_payload — the human's verified * access token claims (sub, org_id, permissions, ...). * event.transaction.actor_token_payload — the agent broker M2M * token claims (sub = "agent|"). * * The Profile MUST be paired with the alphaswarm-agent-broker M2M client * and the broker MUST NOT be allowed to call /oauth/token with any * other Profile. Misusing this Profile mis-attributes audit rows. */ exports.onExecuteCustomTokenExchange = async (event, api) => { const subject = event.transaction?.subject_token_payload; const actor = event.transaction?.actor_token_payload; if (!subject || typeof subject !== "object") { api.access.rejectInvalidSubjectToken("subject token missing"); return; } if (!actor || typeof actor !== "object") { api.access.rejectInvalidSubjectToken("actor assertion missing"); return; } const humanSub = subject.sub; const agentSub = actor.sub; if (!humanSub || !agentSub) { api.access.rejectInvalidSubjectToken("missing sub claims"); return; } if (!String(agentSub).startsWith("agent|")) { api.access.rejectInvalidSubjectToken( "actor must identify an agent (sub must start with 'agent|')", ); return; } // Bind the minted access token to the human user — RLS + members // are evaluated against this sub by the AlphaSwarm backend. api.authentication.setUserById(humanSub); // Narrow audience + scopes regardless of what the subject token had. api.accessToken.setAudience(event.secrets.ALPHASWARM_API_AUDIENCE); // Whitelist of scopes the agent is allowed to inherit. New MCP // surfaces must be added to BOTH this list AND the Profile's // configured allowed scopes. const ALLOWED_AGENT_SCOPES = [ "read:mcp:data", "write:mcp:data", "read:mcp:codebase", "write:mcp:codebase", ]; const requested = (event.transaction?.requested_scopes || []).filter( (s) => ALLOWED_AGENT_SCOPES.includes(s), ); for (const s of requested) { api.accessToken.addScope(s); } // The `act` claim is the standard RFC 8693 marker. AlphaSwarm's // get_current_user dep reads it to flip Principal.actor_type to // "agent" and stamp on_behalf_of_sub onto every audit row. api.accessToken.setCustomClaim("act", { sub: agentSub, iss: `https://${event.secrets.AUTH0_DOMAIN}/`, }); // AlphaSwarm-specific marker so the frontend / SIEM dashboards can filter. api.accessToken.setCustomClaim("alphaswarm_delegated", true); // Carry the human's org_id through so RLS sees the right tenant // even when the agent is running in a Celery worker without // X-AlphaSwarm-Org headers. if (subject.org_id) { api.accessToken.setCustomClaim("org_id", subject.org_id); } }; ``` ### Secrets - `ALPHASWARM_API_AUDIENCE` — same value the operator sets in `ALPHASWARM_AUTH_OIDC_AUDIENCE` on the backend. - `AUTH0_DOMAIN` — the tenant domain (`alphaswarm-prod.us.auth0.com` or the custom domain). ### Verification 1. Stand the backend up with `ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true` and the broker credentials populated. 2. Run an end-to-end test where an agent calls a data MCP tool: ```python from alphaswarm_agents.runtime import AgentRuntime from alphaswarm_agents.spec import AgentSpec spec = AgentSpec.from_yaml_path("configs/agents/research_lead.yaml") runtime = AgentRuntime( spec, context=test_ctx, user_access_token=human_token, ) delegated = runtime.delegated_token_for_mcp() assert delegated is not None # Decode the token at jwt.io — should carry act.sub="agent|research_lead" # while sub stays the human's auth0|... identity. ``` 3. Hit `/mcp/data/tools/data.catalog.lineage/invoke` with the delegated token in the `Authorization` header. The response body should include the `actor` object with both the agent sub and the on-behalf-of sub. 4. Query the audit ledger: ```sql SELECT created_at, user_id, actor_user_id, event_type, details->'delegation' AS delegation FROM security_audit_events WHERE event_type LIKE 'mcp%' ORDER BY created_at DESC LIMIT 5; ``` The `delegation` JSON block should carry `{"agent_subject": "agent|research_lead", "on_behalf_of_user_id": "auth0|...", "profile": "alphaswarm-agent-delegation"}`. ### Failure modes | Symptom | Likely cause | Fix | | --- | --- | --- | | 400 `invalid_request profile not found` | Profile name typo or not yet created in Dashboard | Match the name exactly: `alphaswarm-agent-delegation` | | 400 `unauthorized_client` | alphaswarm-agent-broker app missing `token-exchange` grant type | Enable on the M2M app | | 400 `invalid_target scope rejected` | Profile didn't include the scope in its allowed list | Add the scope to BOTH the Profile config and the Action's `ALLOWED_AGENT_SCOPES` list | | MCP route returns 403 missing `read:mcp:data` | Permissions array on AlphaSwarm API record missing the scope, or RBAC option "Add permissions in access token" is off | Re-enable both in API record settings | | Audit row missing `delegation` block | Caller didn't pass `agent_subject` to `emit_audit_event` | The MCP server route + bridge already pass it; legacy callers need to be updated | ## Phase 6 — IdP group sync Action (`alphaswarm-idp-group-sync`) Generalises the existing post-login flow so each org can attach non-Entra IdPs (Google Workspace, AWS IAM Identity Center, Okta, OneLogin, JumpCloud, generic SAML/OIDC) and have their external group claims automatically promote to AlphaSwarm roles. Pairs with the [`IdpGroupMappingEditor`](../alphaswarm_client/src/components/onboarding/IdpGroupMappingEditor.tsx) admin UI and the `/tenancy/orgs/{org_id}/idp-group-mappings` routes. ### How it fits the post-login pipeline The existing `alphaswarm-post-login` Action handles the JIT user upsert and the AlphaSwarm-namespaced custom claims (Phase 4 + 7). This NEW Action runs AFTER `alphaswarm-post-login` in the same Login trigger and specifically handles the IdP-group → AlphaSwarm-role translation. They share the M2M token cache to avoid double-minting. ### Required Auth0 setup 1. **Order the Actions.** In Library > Custom > Triggers > Login, drag `alphaswarm-post-login` to position 1, then `alphaswarm-idp-group-sync` to position 2. The sync action depends on `event.user.user_id` being a valid AlphaSwarm user, which is guaranteed by the time the post-login JIT sync completes. 2. **No new secrets** — re-uses the same `ALPHASWARM_API_URL` / `ALPHASWARM_M2M_*` secrets the post-login Action already needs. ### The Action body ```javascript /** * alphaswarm-idp-group-sync — post-login Action. * * Reads the user's external IdP group claims and posts them to * /_internal/idp/sync-groups so the AlphaSwarm backend can upsert * matching Membership rows per the per-org IdpGroupMapping table. */ const NS = "https://alphaswarm.internal/"; function _collectExternalGroups(event) { // Different IdPs surface group memberships under different claim // names. We collect every well-known shape and merge into one // de-duplicated list. const candidates = [ event.user?.groups, // Auth0 standard event.user?.app_metadata?.groups, event.user?.user_metadata?.groups, event.user?.["http://schemas.microsoft.com/ws/2008/06/identity/claims/role"], event.user?.identities?.[0]?.profileData?.groups, ]; const merged = new Set(); for (const c of candidates) { if (!c) continue; if (Array.isArray(c)) { for (const g of c) { if (typeof g === "string" && g.trim()) merged.add(g.trim()); } } else if (typeof c === "string" && c.trim()) { merged.add(c.trim()); } } return Array.from(merged); } function _connectionKind(event) { // Map Auth0 connection strategy -> AlphaSwarm IdpConnectionRecord.connection_kind. const strategy = (event.connection?.strategy || "").toLowerCase(); const name = (event.connection?.name || "").toLowerCase(); if (strategy === "waad" || name.includes("azure")) return "entra"; if (strategy === "google-workspace" || name.includes("google-workspace")) { return "google_workspace"; } if (name.includes("iam-identity-center") || name.includes("aws-sso")) { return "aws_iam_identity_center"; } if (strategy === "okta" || name.includes("okta")) return "okta"; if (strategy === "onelogin" || name.includes("onelogin")) return "onelogin"; if (strategy === "jumpcloud" || name.includes("jumpcloud")) return "jumpcloud"; if (strategy === "samlp") return "generic_saml"; if (strategy === "oidc") return "generic_oidc"; return null; } exports.onExecutePostLogin = async (event, api) => { const groups = _collectExternalGroups(event); if (groups.length === 0) return; const kind = _connectionKind(event); if (!kind) return; // Re-use the M2M token cached by alphaswarm-post-login (same Action // namespace) so we don't double-mint. const token = (await api.cache.get("alphaswarm_m2m_token"))?.value; if (!token) return; const aqpApi = event.secrets.ALPHASWARM_API_URL; if (!aqpApi) return; try { await fetch(`${aqpApi}/_internal/idp/sync-groups`, { method: "POST", headers: { Authorization: `Bearer ${token}`, "Content-Type": "application/json", }, body: JSON.stringify({ user_id: event.user.user_id, auth0_organization_id: event.organization?.id || null, connection_kind: kind, external_groups: groups, }), }); } catch (err) { // Fail open — never block authentication because of a transient // backend hiccup. The next login retries. console.log("alphaswarm_idp_group_sync_failed:", err.message); } }; ``` ### Wire-format the backend expects The route `/_internal/idp/sync-groups` validates the M2M token via the same chain as `/_internal/auth0/sync`, then for every active `IdpConnectionRecord` of the matching `connection_kind` it looks up matching :class:`IdpGroupMapping` rows and upserts the corresponding :class:`Membership` rows. ### Verification 1. Stand the backend up with at least one active `IdpConnectionRecord` for the user's org + at least one `IdpGroupMapping` referencing one of the user's external groups. 2. Sign in via the matching IdP. 3. Hit `/whoami` and verify the `memberships` array contains the expected scope_kind / scope_id / role. 4. Query `security_audit_events`: ```sql SELECT created_at, event_type, details FROM security_audit_events WHERE event_type = 'idp_group_mapping_created' OR event_type = 'auth0_log_stream:s'; ``` ### Don't - Don't bake group → role mappings into the Action body itself. The whole point of `IdpGroupMapping` is operator-driven mapping changes via the UI without redeploying Actions. - Don't surface group lists in any visible UI or error message — some enterprise IdPs treat them as PII-adjacent. - Don't enable this Action without first creating at least one matching `IdpConnectionRecord` in `status=active`; the route is a no-op without an active connection, but the Action wastes API call budget if it's misconfigured at scale. # Auth0 + Microsoft Entra federation runbook > Users authenticate through Auth0 Universal Login, can choose Microsoft via an enterprise connection, and then call the AlphaSwarm API with Auth0-issued access tokens that include AlphaSwarm custom claims # Auth0 + Microsoft Entra federation runbook This runbook covers the one-time operator setup for federating Microsoft Entra ID through Auth0 Universal Login, so AlphaSwarm keeps one identity control plane while still supporting enterprise SSO and account lifecycle features. ## 1) What this gives you Users authenticate through Auth0 Universal Login, can choose Microsoft via an enterprise connection, and then call the AlphaSwarm API with Auth0-issued access tokens that include AlphaSwarm custom claims. ```mermaid sequenceDiagram participant User participant SPA as AlphaSwarm SPA participant UL as Auth0 Universal Login participant Entra as Microsoft Entra ID participant Auth0 participant API as AlphaSwarm API User->>SPA: Open login SPA->>UL: Redirect (PKCE + audience) UL-->>User: Show login options User->>UL: Click "Continue with Microsoft" UL->>Entra: Start enterprise connection flow Entra-->>UL: Return auth result UL->>Auth0: Complete federation and issue tokens Auth0-->>SPA: Redirect to /auth/callback SPA->>API: Call API with Bearer token API-->>SPA: Authorized response ``` ## 2) Auth0 tenant resources to create 1. **AlphaSwarm API resource** - Navigate: `Dashboard > Applications > APIs > Create API` - Name: `AlphaSwarm API` - Identifier: `https://api.alphaswarm.local` (operator-selected; this becomes `ALPHASWARM_AUTH_OIDC_AUDIENCE`) - Signing algorithm: `RS256` - Permissions to add: - `read:messages` - `write:messages` - `admin` - `data:read` - `data:write` - Enable RBAC and enable **Add Permissions in the Access Token**. 2. **AlphaSwarm SPA Application** - Navigate: `Dashboard > Applications > Applications > Create Application` - Name: `AlphaSwarm SPA` - Type: `Single Page Application` - Allowed Callback URLs: `http://localhost:3001/auth/callback,https:///auth/callback` - Allowed Logout URLs: `http://localhost:3001/auth/logout,https:///auth/logout` - Allowed Web Origins: `http://localhost:3001,https://` - Token Endpoint Authentication Method: `None` (public client + PKCE) - Grant Types: `Authorization Code` and `Refresh Token` - Refresh Token settings: rotation enabled, reuse interval `0` - Save the Client ID as `VITE_AUTH0_CLIENT_ID`. 3. **AlphaSwarm Management API M2M Application** - Navigate: `Dashboard > Applications > Applications > Create Application` - Type: `Machine to Machine` - Authorize it for `Auth0 Management API`. - Grant scopes: - `read:users` - read user profiles and identity links. - `update:users` - patch profile/app metadata updates. - `create:users` - create user records when needed. - `delete:users` - hard-delete user accounts. - `read:user_sessions` - list active Auth0 sessions. - `delete:sessions` - revoke sessions and sign users out. - `read:authentication_methods` - list enrolled MFA methods. - `delete:authentication_methods` - remove MFA methods. - `create:authentication_method_enrollment_tickets` - generate MFA enrollment tickets. - `read:guardian_factors` - list available MFA factor types. - `create:user_tickets` - generate password change ticket URLs. - `read:logs` - fetch Auth0 audit/security events. - Save Client ID + Secret as: - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID` - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET` - Audience is `https://.auth0.com/api/v2/` and maps to `ALPHASWARM_AUTH0_MGMT_API_AUDIENCE`. 4. **Microsoft Enterprise Connection** - Navigate: `Dashboard > Authentication > Enterprise > Microsoft Azure AD` - Connection name: `azure-ad-myorg` (operator-selected). This becomes: - `ALPHASWARM_AUTH0_MICROSOFT_CONNECTION` - `VITE_AUTH0_MS_CONNECTION` - Use Common Endpoint: `Yes` for multi-tenant. Use tenant-specific endpoint for single-tenant installs. - Domain: leave blank for multi-tenant. - Paste Client ID + Client Secret from the Microsoft Entra app registration (Section 3). - Identity API: `Microsoft Identity Platform v2.0` - Attribute mapping: `Standard` - Open the `AlphaSwarm SPA` app -> `Connections` tab -> enable this connection. 5. **(Optional) Google social connection** - Navigate: `Dashboard > Authentication > Social > Google` - Auth0 dev keys are acceptable only for testing. - For production, configure your own Google OAuth client (see [Google OAuth 2.0 setup](https://developers.google.com/identity/protocols/oauth2)). 6. **Auth0 Action — post-login** - Implement the Action from Section 4. - Ensure it is enabled on the **Login Flow** trigger. 7. **(Optional, recommended) Custom Domain** - Navigate: `Dashboard > Branding > Custom Domains > Add Domain` - Example domain: `auth.alphaswarm.example` - Add the CNAME record shown by Auth0. - Wait for verification (typically about 5 minutes). - Universal Login uses the custom domain automatically once verified. 8. **Universal Login branding** - Navigate: `Dashboard > Branding > Universal Login > Customize` - Use the **New Universal Login** (template-based), not Classic. - Choose the `Identifier First + Biometrics` template. - Set logo URL and primary color from your brand guide. ## 3) Microsoft Entra app registration walkthrough 1. In Azure portal, open `Microsoft Entra ID > App registrations > New registration`. 2. Name the app `AlphaSwarm via Auth0`. 3. Supported account types: `Accounts in any organizational directory (Multitenant)` for B2B, or single-tenant for internal-only access. 4. Redirect URI: `Web`, set to `https://.auth0.com/login/callback`. 5. Click **Register**. 6. Copy **Application (client) ID** and paste into the Auth0 Microsoft Enterprise Connection. 7. Open `Certificates & secrets > New client secret`, then copy the **Value** (not secret ID) into the Auth0 Microsoft Enterprise Connection. 8. In `API permissions`, add Microsoft Graph delegated permissions: `openid`, `profile`, `email`, `User.Read`; then grant admin consent. 9. In `Authentication`: - `Allow public client flows`: `No` - Front-channel logout URL: `https://.auth0.com/v2/logout` 10. Optional token configuration: add optional claims `email`, `family_name`, and `given_name` if you want those in ID tokens. ## 4) The Auth0 Action JavaScript Use this Action on the Login Flow -> Post Login trigger: ```javascript /** * AlphaSwarm post-login Action. * Calls /_internal/auth0/sync on the AlphaSwarm API and injects the * returned custom claims into the access token. Also carries the * Auth0 connection name (e.g. "azure-ad-myorg") so the AlphaSwarm audit * log records WHICH IdP drove this login. * * Secrets used: * ALPHASWARM_API_URL e.g. https://api.alphaswarm.example * ALPHASWARM_M2M_CLIENT_ID Auth0 Management API M2M client id (reused) * ALPHASWARM_M2M_CLIENT_SECRET Auth0 Management API M2M client secret * ALPHASWARM_M2M_AUDIENCE Same as AlphaSwarm API resource identifier * * Set them at: Actions > Library > Custom > > Add Secret */ const NS = "https://alphaswarm/"; async function mintM2MToken(secrets) { const url = `https://${event.tenant.id}.auth0.com/oauth/token`; const res = await fetch(url, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ grant_type: "client_credentials", client_id: secrets.ALPHASWARM_M2M_CLIENT_ID, client_secret: secrets.ALPHASWARM_M2M_CLIENT_SECRET, audience: secrets.ALPHASWARM_M2M_AUDIENCE, }), }); if (!res.ok) return null; const body = await res.json(); return body.access_token || null; } exports.onExecutePostLogin = async (event, api) => { const aqpApi = event.secrets.ALPHASWARM_API_URL; if (!aqpApi) return; // Action mis-configured; fail open let token = await api.cache.get("alphaswarm_m2m_token"); if (!token || !token.value) { const fresh = await mintM2MToken(event.secrets); if (!fresh) return; api.cache.set("alphaswarm_m2m_token", fresh, { ttl: 50 * 60 * 1000 }); token = { value: fresh }; } const payload = { user_id: event.user.user_id, email: event.user.email, organization_id: event.organization?.id, organization_name: event.organization?.name, requested_claims: { connection: event.connection?.name, strategy: event.connection?.strategy, }, }; try { const res = await fetch(`${aqpApi}/_internal/auth0/sync`, { method: "POST", headers: { Authorization: `Bearer ${token.value}`, "Content-Type": "application/json", }, body: JSON.stringify(payload), }); if (!res.ok) return; const claims = await res.json(); for (const [k, v] of Object.entries(claims)) { if (v === null || v === undefined) continue; api.accessToken.setCustomClaim(`${NS}${k}`, v); api.idToken.setCustomClaim(`${NS}${k}`, v); } } catch (err) { // Fail open — never block the user's login if AlphaSwarm API is down. console.log("alphaswarm_sync_failed", err.message); } }; ``` The Action intentionally fails open. Blocking sign-in for every user because of a temporary outage in `/_internal/auth0/sync` is a worse failure mode than skipping one claim sync. The next successful login reconciles claims again. ## 5) `.env` values to set on AlphaSwarm Use `.env.example` as the canonical source for all names and defaults. ### API + worker (`ALPHASWARM_*`) - `ALPHASWARM_AUTH_PROVIDER=auth0` - `ALPHASWARM_AUTH_OIDC_ISSUER` (Auth0 issuer URL) - `ALPHASWARM_AUTH_OIDC_AUDIENCE` (AlphaSwarm API identifier) - `ALPHASWARM_AUTH_OIDC_CLIENT_ID` - `ALPHASWARM_AUTH_OIDC_CLIENT_SECRET` (required only for confidential clients) - `ALPHASWARM_AUTH_LOGIN_CALLBACK` - `ALPHASWARM_AUTH_LOGOUT_CALLBACK` - `ALPHASWARM_AUTH_SESSION_SECRET` - `ALPHASWARM_AUTH_M2M_ENABLED=true` - `ALPHASWARM_AUTH_M2M_AUDIENCE` (normally same as API audience) - `ALPHASWARM_AUTH0_MGMT_API_AUDIENCE` - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID` - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET` - `ALPHASWARM_AUTH0_DATABASE_CONNECTION` - `ALPHASWARM_AUTH0_MICROSOFT_CONNECTION` - `ALPHASWARM_AUTH0_GOOGLE_CONNECTION` (if Google is enabled) - `ALPHASWARM_AUTH_REQUIRE_EMAIL_VERIFIED` ### SPA build-time config (`VITE_*`) - `VITE_AUTH0_DOMAIN` - `VITE_AUTH0_CLIENT_ID` - `VITE_AUTH0_AUDIENCE` - `VITE_AUTH0_SCOPE` - `VITE_AUTH0_REDIRECT_URI` - `VITE_AUTH0_ORGANIZATION` (optional) - `VITE_AUTH0_MS_CONNECTION` - `VITE_AUTH0_GOOGLE_CONNECTION` - `VITE_AUTH0_BRAND_NAME` - `VITE_AUTH0_BRAND_LOGO_URL` ## 6) Verification curl commands ```bash # Public endpoint (should return 200 without auth) curl http://localhost:8000/api/public # Private endpoint (401 without token) curl http://localhost:8000/me # Private endpoint (200 with access token) curl http://localhost:8000/me -H 'Authorization: Bearer YOUR_ACCESS_TOKEN' # Scoped endpoint (403 if token lacks read:messages) curl http://localhost:8000/api/private-scoped -H 'Authorization: Bearer YOUR_ACCESS_TOKEN' ``` For a quick test token, use `Auth0 Dashboard > APIs > AlphaSwarm API > Test`. ## 7) Cutover checklist - [ ] Auth0 tenant created - [ ] AlphaSwarm API + SPA + Management API M2M apps created - [ ] Microsoft Enterprise Connection created + tested - [ ] Auth0 Action installed + enabled on Login Flow - [ ] `.env` populated on the AlphaSwarm API + worker - [ ] `.env.local` populated on the SPA build + rebuild + redeploy - [ ] `ALPHASWARM_AUTH_PROVIDER=auth0` set - [ ] `ALPHASWARM_AUTH_ENFORCE=strict` confirmed in prod - [ ] Smoke: `/api/public` 200, `/api/private` 401 then 200, Microsoft button -> Entra -> callback -> `/` ## 8) Troubleshooting - `401 invalid_token` after Microsoft login: verify the Action ran in `Dashboard > Monitoring > Logs` (filter event type `sapi` or `sf`). - `invalid_request: missing audience`: ensure the authorize request includes `audience=`. The SPA should pass this from `VITE_AUTH0_AUDIENCE`. - `Wrong issuer`: ensure issuer uses the Auth0 tenant domain ending in `.auth0.com`. If a custom domain is configured, confirm token issuer behavior and enable **Use Custom Domain in Tokens** when required. # Auth0 setup — comprehensive operator runbook > The platform supports three deployment shapes: # Auth0 setup — comprehensive operator runbook This is the canonical setup guide for AGENTS hard rules 52-55 (the Phase 5+ auth refactor). Pair with [alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md) for the JS Action bodies that go in the Auth0 Dashboard. The platform supports three deployment shapes: - **Local-first dev**: `ALPHASWARM_AUTH_PROVIDER=local`, no Auth0 tenant needed. Everything below is skipped. - **Single-tenant B2C**: one Auth0 tenant per env, individual users sign up via Universal Login + social connections. Organizations is OFF (or "Allow individual logins" if you want both modes). - **Multi-tenant B2B**: same Auth0 tenant per env, institutional customers attach via Auth0 Organizations. Each Organization has its own branded login + Enterprise connection. The same backend serves all three; the difference is purely the Auth0 configuration + the `ALPHASWARM_AUTH_*` env vars. --- ## 1. Tenants One Auth0 tenant per AlphaSwarm environment. Three tenants per AGENTS rule: | Env | Auth0 tenant | Custom domain | Issuer URL in `ALPHASWARM_AUTH_OIDC_ISSUER` | | --- | --- | --- | --- | | dev | `alphaswarm-dev` | `auth.dev.alpha-swarm.ai` | `https://auth.dev.alpha-swarm.ai/` | | stage | `alphaswarm-stage` | `auth.stage.alpha-swarm.ai` | `https://auth.stage.alpha-swarm.ai/` | | prod | `alphaswarm-prod` | `auth.alpha-swarm.ai` | `https://auth.alpha-swarm.ai/` | Custom domains stabilise the issuer URL so changing Auth0 tenants later is non-breaking. Without a custom domain the issuer is `https://alphaswarm-prod.us.auth0.com/` and every existing JWT cache / revocation token has to be invalidated on rebrand. Never share Auth0 tenants across envs — Auth0 charges per MAU per tenant, but the security boundary is more important than the cost arithmetic. --- ## 2. API resource server One API record per tenant — the AlphaSwarm backend. | Field | Value (prod example) | | --- | --- | | Name | `alphaswarm-api` | | Identifier | `https://api.alpha-swarm.ai/` | | Signing algorithm | `RS256` | | Allow Skipping User Consent | ON | | Allow Offline Access | ON | | Token expiration (seconds) | `86400` (24h ceiling — per-app overrides win) | | Token expiration for browser flows (seconds) | `7200` (2h SPA ceiling) | Enable **RBAC**: - Settings → "Enable RBAC" → ON - Settings → "Add Permissions in the Access Token" → ON Define every permission AlphaSwarm uses (Permissions tab): ``` read:portfolio Read portfolio positions / PnL / risk write:portfolio Mutate portfolio config read:strategy Read strategy specs / backtest history write:strategy Author / edit strategies deploy:strategy Promote a strategy to live trading kill_switch:execute Engage the global kill switch trade:execute Submit live or paper orders trade:live Bypass the paper-only guard read:mcp:data Invoke the Data MCP tools write:mcp:data Mutate via Data MCP (e.g. namespace policy edits) read:mcp:codebase Invoke the Codebase MCP tools write:mcp:codebase Apply code edits via Codebase MCP (rarely granted) run:agent Spawn an AgentRuntime admin:tenant Org-admin powers (invites, IdP config, billing) admin:cluster Bypass resource filter; superadmin-only manage:broker_credentials Read/write broker credentials at org scope read:logs Required for the Auth0 Management API M2M client ``` Add Token Exchange: - API → Settings → "Token Exchange" → ON (required for `alphaswarm-agent-broker` to use RFC 8693). --- ## 3. Applications Five application records per tenant: | Record | Type | Grants | Token TTL | Notes | | --- | --- | --- | --- | --- | | `alphaswarm-spa` | Single Page Application | `authorization_code` + `refresh_token` | access 15m, ID 10m | Refresh-token rotation ON, absolute lifetime 24h | | `alphaswarm-cli` | Native | `urn:ietf:params:oauth:grant-type:device_code` + `refresh_token` | access 60m | Rotation ON, absolute 30d, inactivity 7d. **"Business Users" mode** so Device Code stays compatible with Orgs | | `alphaswarm-backend-m2m` | M2M | `client_credentials` | 24h | For internal service-to-service + Auth0 Management API | | `alphaswarm-action-callback-m2m` | M2M | `client_credentials` | 5m | Used inside Auth0 Actions for `/_internal/auth0/sync` | | `alphaswarm-agent-broker` | M2M | `client_credentials` + `urn:ietf:params:oauth:grant-type:token-exchange` | 5m | RFC 8693 delegated-agent-token minting | ### 3.1 `alphaswarm-spa` (SPA) - Application URIs: - Allowed callback URLs: `https://app.alpha-swarm.ai/auth/callback`, `http://localhost:3001/auth/callback` - Allowed logout URLs: `https://app.alpha-swarm.ai/`, `http://localhost:3001/` - Allowed web origins: `https://app.alpha-swarm.ai`, `http://localhost:3001` - Refresh Token Rotation: ON - Refresh Token Expiration: Absolute 24h - Refresh Token Inactivity: 7d - Idle Session Lifetime: 72h - Maximum Session Lifetime: 168h (7d) Frontend env vars (Vite): ``` VITE_AUTH_PROVIDER=auth0 VITE_AUTH0_DOMAIN=auth.alpha-swarm.ai # custom domain VITE_AUTH0_SPA_CLIENT_ID= VITE_AUTH0_AUDIENCE=https://api.alpha-swarm.ai/ VITE_AUTH0_SCOPE=openid profile email offline_access read:portfolio write:portfolio read:strategy write:strategy read:mcp:data VITE_AUTH0_ORGANIZATION= # B2B only — pin to a single org ``` ### 3.2 `alphaswarm-cli` (Native) - Connections tab: enable the same DB / social connections as the SPA. - Advanced Settings → Grant Types: enable `Device Code` + `Refresh Token`. - "Business Users" mode (not "Organizations Required"); the Auth0 team's M2M-for-Orgs GA notes that Device Code is incompatible with the strict "Organizations Required" setting. CLI env vars (operator's machine): ``` ALPHASWARM_CLI_OIDC_DOMAIN=auth.alpha-swarm.ai ALPHASWARM_CLI_OIDC_CLIENT_ID= ALPHASWARM_CLI_OIDC_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_CLI_OIDC_ORGANIZATION= # B2B: pin to a single org ``` The CLI fetches all three from `/auth/config` when not set, so most operators don't need to copy-paste. ### 3.3 `alphaswarm-backend-m2m` (M2M) - Authorise against: - `alphaswarm-api` (all permissions the backend needs to act on its own behalf). - Auth0 Management API (`read:users`, `update:users`, `delete:sessions`, `read:sessions`, `read:logs`, `read:connections`, `create:guardian_enrollment_tickets`, `delete:guardian_enrollments`, `create:user_tickets`). Backend env vars: ``` ALPHASWARM_AUTH_PROVIDER=auth0 ALPHASWARM_AUTH_OIDC_ISSUER=https://auth.alpha-swarm.ai/ ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_AUTH_OIDC_CLIENT_ID= # SPA client_id (for the SPA-targeted JWKS validation path) ALPHASWARM_AUTH_OIDC_CLIENT_SECRET= # empty — SPAs are public clients ALPHASWARM_AUTH0_MGMT_API_AUDIENCE=https://alphaswarm-prod.us.auth0.com/api/v2/ ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID= ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET= # via CredentialResolver in prod; env in dev ALPHASWARM_AUTH0_DPOP_ENABLED=true # SDK mixed-mode ALPHASWARM_AUTH0_DPOP_REQUIRED=false # flip true after CLI + SPA migrate ALPHASWARM_AUTH_M2M_ENABLED=true ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_AUTH_STEP_UP_ENABLED=true ALPHASWARM_AUTH_STEP_UP_DEFAULT_MAX_AGE=180 ``` ### 3.4 `alphaswarm-action-callback-m2m` (M2M) Same scopes as `alphaswarm-backend-m2m` but used INSIDE Auth0 Actions to call `/_internal/auth0/sync` + `/_internal/idp/sync-groups`. The Action body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) shows how to mint + cache the token. ### 3.5 `alphaswarm-agent-broker` (M2M for Token Exchange) - Grants: `client_credentials` + `urn:ietf:params:oauth:grant-type:token-exchange`. - Authorised APIs: `alphaswarm-api` with scopes `read:mcp:data`, `write:mcp:data`, `read:mcp:codebase`, `write:mcp:codebase`. - Used ONLY by the Custom Token Exchange Profile body to mint delegated agent tokens. Backend env vars: ``` ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID= ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET= # via CredentialResolver in prod ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300 ``` --- ## 4. Connections ### Database connection (B2C) - Default `Username-Password-Authentication` database connection. - Password Strength: "Excellent" (NIST 800-63 compliant). - Enable: "Disable Signups from Public Signup Page" if you want invite-only onboarding (B2B-heavy deployments). ### Social connections (B2C) - GitHub, Google (`google-oauth2`). Both default to the standard Auth0 connection types — no extra config beyond the Client ID + Secret from the respective developer console. ### Enterprise connections (B2B) Configured per-org in :class:`IdpConnectionRecord`. Auth0 supports SAML, ADFS, Azure AD (Entra), Google Workspace, PingFederate, SiteMinder, Okta Workforce Identity, OneLogin, JumpCloud, generic OIDC. The AlphaSwarm-side admin UI is [`IdpGroupMappingEditor`](../alphaswarm_client/src/components/onboarding/IdpGroupMappingEditor.tsx). Each enterprise connection MUST: - Sync the user's group claims (Azure `groups`, Google's group claim, Okta `groups`). The Action `alphaswarm-idp-group-sync` reads them. - Map to a single AlphaSwarm Organization via the matching :class:`IdpConnectionRecord.organization_id`. Multiple orgs may use the same connection KIND (e.g. AcmeCorp Okta + Subsidiary Okta) but each is a separate record. --- ## 5. Organizations (B2B) One Auth0 Organization per institutional tenant. Auth0 charges per Org per month on most tiers — budget accordingly. | Setting | Value | | --- | --- | | Membership on Login | "Require Members to use this Organization" (strict B2B) | | Allowed Connections | Only the org's enterprise connection(s) | | Branding | Per-org logo + colors so users land on a branded login | Use `?organization=org_xxx&login_hint=user@acme.com` on `/authorize` to skip the org-picker step. The SPA reads `VITE_AUTH0_ORGANIZATION` to pin. The post-login Action (`alphaswarm-post-login`) reads `event.organization?.id` and injects it as `https://alphaswarm.internal/org_id` so the FastAPI `require_org` dep can branch immediately. --- ## 6. Actions Three Login-trigger Actions (in this order): 1. **`alphaswarm-post-login`** — JIT user upsert + custom claim injection. Body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 7 post-login Action" section, extended by "Phase 8" addendum for step-up MFA). 2. **`alphaswarm-idp-group-sync`** — reads external IdP group claims and posts to `/_internal/idp/sync-groups` so the AlphaSwarm backend upserts matching Membership rows per the per-org IdpGroupMapping table. Body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 6 — IdP group sync Action" section). And one Custom Token Exchange Profile: 3. **`alphaswarm-agent-delegation`** — RFC 8693 minting for delegated agent tokens. Body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 8 — Custom Token Exchange Profile" section). --- ## 7. Pre-User-Registration trigger One Action to block disposable emails + verify B2B invites: ```javascript exports.onExecutePreUserRegistration = async (event, api) => { const email = (event.user.email || "").toLowerCase(); const disposable = ["mailinator.com", "guerrillamail.com", "tempmail.org", "10minutemail.com", "throwaway.email"]; const domain = email.split("@")[1]; if (!email) { api.access.deny("invalid_email", "email required"); return; } if (disposable.includes(domain)) { api.access.deny("disposable_email", "disposable email domains not allowed"); return; } // B2B invite verification — operator chooses how strict. if (event.client.metadata?.flow === "b2b" && event.secrets.ALPHASWARM_BACKEND_URL) { // Call /_internal/auth/preregister-check (operator adds this route // if they want HMAC-based invite enforcement at registration time). } }; ``` --- ## 8. Log Streams One **Custom Webhook** log stream per env: | Field | Value | | --- | --- | | Type | Custom Webhook | | Payload URL | `https://api.alpha-swarm.ai/_internal/auth0/log-stream` | | Authorization | `Bearer ` (matches `ALPHASWARM_AUTH0_LOG_STREAM_SECRET`) | | Content Type | `application/json` | | Custom Headers | (none beyond Authorization) | | Filter | All events (the backend filters server-side) | Operator generates the shared secret: ``` openssl rand -hex 32 ``` …then sets it both in the Auth0 Dashboard webhook config AND in the backend's `ALPHASWARM_AUTH0_LOG_STREAM_SECRET` env var. The HMAC compare on `_verify_authorization` rejects any other value. Optionally also wire native Datadog / Splunk / Elastic streams for the SIEM team — those are independent of the AlphaSwarm webhook. --- ## 9. Adaptive MFA Security → Multi-factor Authentication → Adaptive MFA → ON. | Risk level | Action | Why | | --- | --- | --- | | `low` | Allow (no MFA) | Normal session resumption | | `medium` | MFA challenge | Suspicious-but-not-definitive signals | | `high` | MFA challenge | Likely compromised | Enabled MFA factors (Security → Multi-factor Authentication → Factors): - **OTP (TOTP)** — always-on; required for every B2B user - **WebAuthn** — recommended primary for B2B users - **Push** (Auth0 Guardian app) — B2C convenience - **SMS** — discouraged for B2B; allow as B2C fallback only - **Email OTP** — convenient B2C fallback - **Recovery codes** — always issue alongside any factor The `alphaswarm-post-login` Action's Phase 8 addendum calls `api.multifactor.enable("any", { allowRememberBrowser: false })` when the SPA / CLI requests `acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor` on `/authorize`. This is the integration point for the backend's `require_step_up` dep. --- ## 10. Env-var checklist (prod) ``` # IdP ALPHASWARM_AUTH_PROVIDER=auth0 ALPHASWARM_AUTH_REQUIRED=true ALPHASWARM_AUTH_ENFORCE=strict ALPHASWARM_AUTH_OIDC_ISSUER=https://auth.alpha-swarm.ai/ ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_AUTH_OIDC_CLIENT_ID= ALPHASWARM_AUTH_CLAIMS_NAMESPACE=https://alphaswarm.internal/ ALPHASWARM_AUTH_CLAIMS_NAMESPACE_ALIASES=https://alphaswarm/ # CSV; legacy reader # Management API ALPHASWARM_AUTH0_MGMT_API_AUDIENCE=https://alphaswarm-prod.us.auth0.com/api/v2/ ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID= ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET= # via CredentialResolver # M2M ALPHASWARM_AUTH_M2M_ENABLED=true ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_AUTH_M2M_TOKEN_TTL_SECONDS=900 # DPoP ALPHASWARM_AUTH0_DPOP_ENABLED=true ALPHASWARM_AUTH0_DPOP_REQUIRED=false # flip true once SDK rolled out ALPHASWARM_DPOP_ENFORCEMENT_ENABLED=false # per-route enforcement # Step-up MFA (rule 52) ALPHASWARM_AUTH_STEP_UP_ENABLED=true ALPHASWARM_AUTH_STEP_UP_DEFAULT_MAX_AGE=180 # Auth0 Log Stream (rule 53) ALPHASWARM_AUTH0_LOG_STREAM_SECRET= ALPHASWARM_AUTH0_LOG_STREAM_MAX_AGE_SECONDS=86400 # Delegated agent tokens (rule 54) ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID= ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET= # via CredentialResolver ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300 # B2B Entra (existing) ALPHASWARM_AUTH_MSAL_B2B_ENABLED=true # Tenancy ALPHASWARM_TENANCY_DEFAULT_STRATEGY=hybrid ALPHASWARM_TENANCY_RLS_ENFORCE=strict # was off; flip after Phase 5 verified # MCP RFC conformance ALPHASWARM_MCP_DATA_CANONICAL_URI=https://api.alpha-swarm.ai/mcp/data ALPHASWARM_MCP_CODEBASE_CANONICAL_URI=https://api.alpha-swarm.ai/mcp/codebase ALPHASWARM_MCP_REQUIRE_RFC8707=strict # was off # Per-user OAuth wizard ALPHASWARM_USER_OAUTH_ENABLED=true # Audit ALPHASWARM_AUTH_AUDIT_ENABLED=true ALPHASWARM_AUTH_AUDIT_RETENTION_DAYS=365 ``` --- ## 11. CLI env vars (per operator) ``` ALPHASWARM_CLI_OIDC_DOMAIN=auth.alpha-swarm.ai ALPHASWARM_CLI_OIDC_CLIENT_ID= ALPHASWARM_CLI_OIDC_AUDIENCE=https://api.alpha-swarm.ai/ ALPHASWARM_CLI_OIDC_ORGANIZATION= # B2B: pin to a single org # Headless / CI fallback (no keyring backend): ALPHASWARM_CLI_AUTH_ALLOW_PLAINTEXT_FALLBACK=0 ``` --- ## 12. Rollout order | Step | Action | Verification | | --- | --- | --- | | 1 | Create dev tenant + apps + custom domain | `/auth/config` returns the tenant id | | 2 | Backend up with `ALPHASWARM_AUTH_ENFORCE=permissive` | Existing routes still serve; 401 dashboard shows zero would-be denies | | 3 | Flip `ALPHASWARM_AUTH_ENFORCE=strict` | Unauthenticated calls return 401 | | 4 | Wire Auth0 log-stream webhook + Action triggers | Force a session-revoke in Dashboard; verify `cleanup_for_user` Celery row + audit row | | 5 | Enable `ALPHASWARM_AUTH_STEP_UP_ENABLED=true` | Click kill-switch → MFA prompt; complete it; subsystems halt | | 6 | Enable `ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true` + create Profile | Trigger an agent that calls a DataMCP tool; verify `act` claim in `/mcp/data` response body + `delegation` JSON in audit | | 7 | Enable `ALPHASWARM_USER_OAUTH_ENABLED=true` | `/me/oauth-connections/providers` returns the 5 providers | | 8 | Enable BYOK broker credentials (run Alembic 0065) | Add an Alpaca paper key; smoke-test a paper trade | | 9 | Enable RLS strict mode (`ALPHASWARM_TENANCY_RLS_ENFORCE=strict`) | Existing test workspace queries still work; cross-workspace fetches return zero rows | | 10 | Enable MCP RFC 8707 strict mode | MCP calls with mis-audienced tokens return 401 + WWW-Authenticate header | Each flip is independently reversible. --- ## 13. Reference docs - [alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md) — Action bodies + the Custom Token Exchange Profile setup. - [alphaswarm_docs/identity.md](../../concepts/identity/identity.md) — the full identity stack. - [alphaswarm_docs/multi-tenancy.md](../../concepts/identity/multi-tenancy.md) — Organization → EntraTenantLink → User → Membership flow. - [alphaswarm_docs/credentials.md](../../concepts/identity/credentials.md) — how M2M + BYOK credentials flow through CredentialResolver. - [.cursor/rules/identity.mdc](../.cursor/rules/identity.mdc) — the always-on identity-enforcement rule. - [.cursor/rules/auth-stepup-and-byok.mdc](../.cursor/rules/auth-stepup-and-byok.mdc) — Phase 5+ rules (52-55) scoped to the new module files. - [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — hard rules 27, 44, 45, 50, 51, 52-55. # Biscuit capability tokens # Biscuit capability tokens > Phase 5 §8.2 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > Sits ALONGSIDE the existing > [`TokenExchangeBroker`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/token_exchange.py) > (Rule 54), not replacing it. ## The problem `TokenExchangeBroker` mints short-lived JWTs via the RFC 8693 ``urn:ietf:params:oauth:grant-type:token-exchange`` grant. The result is a delegated agent JWT that carries every scope the agent could possibly need: ``` GET /mcp/data/iceberg.read -- Bearer POST /mcp/data/iceberg.write -- Bearer ``` If the agent is compromised mid-run, the attacker exfiltrates the JWT and replays it for ANY of those scopes until expiry. The JWT is broad-by-design — the broker can't know in advance which exact tool + arguments the agent will call. ## The Biscuit answer A biscuit is a capability token with a key property: **anyone can narrow it (attenuate), no one can widen it**. The minting flow becomes: ``` user JWT │ ▼ TokenExchangeBroker.exchange() -> delegated JWT (broad scopes) │ ▼ biscuit.mint_biscuit(jwt, caps) -> biscuit covering the full │ capability set for this run ▼ agent.attenuate_for_call(...) -> EXACTLY (tool, args, hash) │ ▼ HTTP POST /mcp/data/iceberg.read Authorization: Bearer -- existing path stays X-Biscuit: -- new gate ``` A compromised agent that exfiltrates the attenuated biscuit can ONLY replay the one call that biscuit was minted for. The attenuated biscuit's chained check fires on any other call: ``` check if capability("data.iceberg.read", "read", "nyse:trades", "") ``` ## AlphaSwarm integration The helpers live in [`alphaswarm/auth/biscuit.py`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/biscuit.py): ```python from alphaswarm.auth.biscuit import ( mint_biscuit, attenuate_for_call, verify_biscuit, Capability, ) # 1. Mint at agent-run boot — derive from the delegated JWT's scopes. issued = mint_biscuit( user_sub=request.user.sub, agent_sub="agent_alpha_research_v3", capabilities=[ Capability( tool="data.iceberg.read", action="read", resource="nyse:trades", descriptor_hash=descriptor_hash_for("data.iceberg.read"), cell_id=request.alphaswarm_context.cell_id, ), # ... one per tool the agent may invoke during this run ], private_key_pem=settings.biscuit_signing_key_pem, ttl_seconds=900, cell_id=request.alphaswarm_context.cell_id, ) # 2. Attenuate per tool call — the agent narrows to exactly this call. narrow = attenuate_for_call( parent_b64=issued.token_b64, tool="data.iceberg.read", action="read", resource="nyse:trades", descriptor_hash=descriptor_hash_for("data.iceberg.read"), cell_id=request.alphaswarm_context.cell_id, ) # Attach `narrow` as the X-Biscuit header on the MCP HTTP call. # 3. Verify at MCP server — checks the attenuated chain. verified = verify_biscuit( token_b64=request.headers["X-Biscuit"], public_key_pem=settings.biscuit_public_key_pem, expected_tool="data.iceberg.read", expected_action="read", expected_resource="nyse:trades", expected_descriptor_hash=descriptor_hash_for("data.iceberg.read"), expected_cell_id=request.alphaswarm_context.cell_id, ) ``` ## Capability shape The `Capability` record carries four required fields: | Field | Meaning | | --- | --- | | `tool` | MCP tool name, e.g. `data.iceberg.read`. | | `action` | Verb, e.g. `read`, `write`, `delete`. | | `resource` | Canonical resource id, e.g. `nyse:trades`. | | `descriptor_hash` | SHA-256 of the canonical-JSON MCP tool descriptor (Phase 5 §8.4). | Plus an optional `cell_id` that pins the capability to a specific deployment cell (Phase 3 §6.2). ## Capability namespacing The capability namespace matches the MCP tool name: | Tool | Capability | | --- | --- | | `data.iceberg.read` | `read` | | `data.iceberg.write` | `write` | | `data.entities.search` | `read` | | `data.entities.create` | `write` | | `data.lineage.read` | `read` | | `data.secrets.read` | NOT BISCUIT-GATED — uses BrokerCredentialStore (Rule 55) | Adding a new tool with a new capability is purely additive — the existing biscuits keep working for the tools they cover. ## Mint key rotation The biscuit signing key is an ed25519 key pair. The private key lives in Vault Transit (Phase 4 §7.6); the public key is projected into every MCP server pod via a [`VaultStaticSecret`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/vault-secrets-operator/sample-vault-static-secret.yaml) named `biscuit-public-key`. Rotation procedure (operator-level): 1. Generate a new ed25519 key pair via Vault Transit's `transit/keys/biscuit-signing/rotate`. 2. Vault Transit keeps the OLD key version live for 7 days. 3. Every MCP server now accepts biscuits signed by EITHER key for that 7-day overlap (the verify path tries the new key first, falls back to the old key on signature mismatch — TODO in Phase 5.5). 4. After 7 days, drop the old key version. ## Failure modes | Failure | Behaviour | | --- | --- | | `biscuit-python` not installed (e.g. Windows dev) | `BiscuitUnavailable` raised; the agent runtime falls back to JWT-only delegation. The MCP server returns 503 if biscuit is required for the route. | | Biscuit signature mismatch | `BiscuitVerificationError` raised; route returns 403 `biscuit_invalid`. | | Biscuit capability doesn't match the route | `BiscuitVerificationError`; route returns 403 `biscuit_capability_mismatch`. | | Biscuit expired | `BiscuitVerificationError`; route returns 401 `biscuit_expired`. | ## Why not just narrow the JWT? JWTs are not attenuable. Once Auth0 mints a JWT with scopes `[data:read, data:write]`, the agent CANNOT mint a derived JWT with just `[data:read]` — that would require the agent to be its own AS (it isn't) and would compromise the JWT signing key. Biscuits sidestep this by encoding capabilities as facts the agent can chain narrowing checks onto. The signature stays on the authority block; chained blocks add restrictions, never expand them. ## Phase 5.5 follow-ups 1. **Agent runtime wire-up** — automate the `mint_biscuit + attenuate_for_call` calls on every MCP tool invocation in `alphaswarm/agents/runtime.py`. Today the helpers are standalone. 2. **Key-rotation overlap window** — `verify_biscuit` accepts a list of public keys to try in order. Phase 5 ships the single-key verify; the multi-key fallback lands in Phase 5.5. 3. **MCP server-side enforcement** — wire `verify_biscuit` into the MCP HTTP request handler at `alphaswarm/data/mcp/server.py` so every tool call that doesn't carry a valid biscuit gets 401. ## Related documents - [RESTRUCTURING_PLAN.md §8.2](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm/auth/biscuit.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/biscuit.py) - [alphaswarm/auth/token_exchange.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/token_exchange.py) — the JWT broker biscuits run alongside. - Biscuit specification: https://github.com/biscuit-auth/biscuit - biscuit-python: https://github.com/biscuit-auth/biscuit-python # Cloud credentials > ```mermaid flowchart LR caller["service code"] --> resolver["CredentialResolver.resolve(CredentialKey)"] resolver --> m2m["M2MStore
(priority 10)"] resolver --> vault["HashicorpVaultStore
(pri... # Cloud credentials How AlphaSwarm routes secret resolution through `CredentialResolver` once the cloud `SecretStore` siblings are wired. Phase C of the Phase 7 rollout (Terraform IaC + multi-cloud). ## Resolver chain ```mermaid flowchart LR caller["service code"] --> resolver["CredentialResolver.resolve(CredentialKey)"] resolver --> m2m["M2MStore(priority 10)"] resolver --> vault["HashicorpVaultStore(priority 15)"] resolver --> cloud["Cloud SecretStore(priority 30)"] resolver --> file["FileSecretStore(priority 50)"] resolver --> env["EnvSecretStore(priority 100)"] cloud --> azurekv["AzureKeyVaultStore"] cloud --> awssm["AwsSecretsManagerStore"] cloud --> gcpsm["GcpSecretManagerStore"] ``` Lower priority numbers resolve first. The cloud store is added to the default chain only when `ALPHASWARM_DEFAULT_CLOUD_PROVIDER` matches and the matching SDK is installed (see [`alphaswarm/credentials/resolver.py::_build_default_resolver`](../alphaswarm/credentials/resolver.py)). ## Naming conventions | Store | Key format | Notes | | ------------------ | ------------------------------------------------------------- | ---------------------------------------------------------------------- | | Env | `_` (uppercase, `:` → `_`) | Always-on safety net. | | File | `bootstrap_state_dir/-.json` | Bootstrap workflows write these (Polaris principal, etc). | | Azure Key Vault | `alphaswarm--` (alphanumerics + `-` only) | Vault names disallow `:` / `/` / `_`. | | AWS Secrets Mgr | `{prefix}/` (default prefix `alphaswarm/`) | Slashes are first-class path separators. | | GCP Secret Mgr | `projects/{project}/secrets/{prefix}-` | Names allow `[A-Za-z0-9_-]` only — joins use `-`. | | Vault KV v2 | `/data//` | `hvac.Client.secrets.kv.v2.read_secret_version` adds `/data/` automatically. | The cloud secret values are parsed as JSON first; when parsing fails they're exposed via the canonical `credential` field. ## Example secret layouts ### Azure Key Vault — `alphaswarm-msal-clientsecret` ```json { "client_secret": "rxq8Q..." } ``` ### AWS Secrets Manager — `alphaswarm/broker/api_key` ``` sk_live_abcdef1234567890 ``` Plain string payload — exposed via `credential.get("credential")`. ### GCP Secret Manager — `alphaswarm-postgres-password` ```json { "password": "...", "username": "alphaswarm" } ``` ### HashiCorp Vault KV v2 — `secret/data/alphaswarm/redis/password` ```json { "password": "..." } ``` ## Wiring a SecretStore Pick a cloud + install the matching extra: ```bash pip install 'alphaswarm[cloud-azure]' # AzureKeyVaultStore pip install 'alphaswarm[cloud-aws]' # AwsSecretsManagerStore pip install 'alphaswarm[cloud-gcp]' # GcpSecretManagerStore pip install 'alphaswarm[vault]' # HashicorpVaultStore ``` Configure (matching cloud picked via `ALPHASWARM_DEFAULT_CLOUD_PROVIDER`): ``` # Azure ALPHASWARM_DEFAULT_CLOUD_PROVIDER=azure ALPHASWARM_AZURE_TENANT_ID=... ALPHASWARM_AZURE_SUBSCRIPTION_ID=... ALPHASWARM_AZURE_KEYVAULT_URL=https://alphaswarm-vault.vault.azure.net/ # AWS ALPHASWARM_DEFAULT_CLOUD_PROVIDER=aws ALPHASWARM_AWS_REGION=us-east-1 ALPHASWARM_AWS_ACCOUNT_ID=123456789012 ALPHASWARM_AWS_SECRETSMANAGER_PREFIX=alphaswarm/ # GCP ALPHASWARM_DEFAULT_CLOUD_PROVIDER=gcp ALPHASWARM_GCP_PROJECT_ID=alphaswarm-prod ALPHASWARM_GCP_REGION=us-central1 ALPHASWARM_GCP_SECRET_PREFIX=alphaswarm- # Vault (any cloud) ALPHASWARM_VAULT_ADDR=https://vault.example.com ALPHASWARM_VAULT_NAMESPACE=... ALPHASWARM_VAULT_MOUNT=secret ALPHASWARM_VAULT_ROLE_ID=... ALPHASWARM_VAULT_SECRET_ID=... ``` The resolver auto-adds the matching cloud store + Vault store when the env vars are present. Code that needs a credential does: ```python from alphaswarm.credentials import get_resolver from alphaswarm.credentials.protocol import CredentialKey resolver = get_resolver() cred = resolver.resolve(CredentialKey(service="msal", purpose="client_secret")) secret = cred.require("client_secret") ``` ## Authentication backends per cloud store | Store | Identity source | | ---------------------- | ---------------------------------------------------------------- | | Azure Key Vault | `DefaultAzureCredential` (az login / SP env / Workload Identity) | | AWS Secrets Manager | boto3 default chain (env / shared credentials / IRSA / EC2 role) | | GCP Secret Manager | `google.auth.default()` (gcloud ADC / SA file / Workload Identity)| | HashiCorp Vault | AppRole (preferred) or whatever the operator pre-configured | For cluster-side workloads the **Workload Identity** variants are the canonical path: - AKS — `AzureAksAdapter` + Azure Workload Identity (Service Account annotation `azure.workload.identity/client-id: `). - EKS — `AwsEksAdapter` + IRSA (`eks.amazonaws.com/role-arn` annotation). - GKE — `GcpGkeAdapter` + GKE Workload Identity (`iam.gke.io/gcp-service-account` annotation). ## External Secrets Operator integration The Terraform `secrets` module wires an [`external-secrets`](https://external-secrets.io) `ClusterSecretStore` pointing at whichever backend matches `vault_backend`. The `secret_mappings` locals block emits one `ExternalSecret` per `(k8s_secret_name, vault_path)` pair so AlphaSwarm pods consume secrets via mounted Secrets — never raw env vars. See [`alphaswarm_platform/terraform/modules/secrets/main.tf`](../alphaswarm_platform/terraform/modules/secrets/main.tf) for the full mapping table. ## Temporary credentials minted via cloud CLI Operators with an `admin:cluster` scope can mint short-lived credentials directly from the admin UI without shipping the cloud CLI binaries into the BFF container. The control plane wraps `aws sts assume-role` / `gcloud auth print-access-token` / `az account get-access-token` in an audit-first subprocess runner; the resulting credential is persisted under a resolver key supplied by the operator and surfaces through the standard `CredentialResolver.resolve(...)` chain. See the [cloud-CLI temporary credentials](../../how-to/operations/cloud-cli-temporary-credentials.md) runbook for the wizard walkthrough, audit shape, and step-up MFA contract. # Credentials resolver > The resolver walks an ordered chain of :class:`alphaswarm.credentials.SecretStore` instances and returns the first non-empty hit, falling back to a caller-supplied default. The chain order means a fresh M2M ... # Credentials resolver AlphaSwarm collapses every "where does this service's credential come from?" question into a single :class:`alphaswarm.credentials.CredentialResolver`. The resolver walks an ordered chain of :class:`alphaswarm.credentials.SecretStore` instances and returns the first non-empty hit, falling back to a caller-supplied default. The chain order means a fresh M2M token wins over a bootstrap-minted file payload, which wins over a static `settings` seed. ## Why The motivating bug: `iceberg_bootstrap` mints a runtime principal (`alphaswarm_runtime`) and persists it to `data/bootstrap/polaris-principal.json`, but `polaris_client` and `iceberg_catalog._build_properties` historically read `settings.polaris_client_*` / `settings.iceberg_rest_credential` — the static `root` / `s3cr3t` seed — so Polaris kept rejecting the API container's writes with `CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION` 403s. The resolver closes that loop without forking the credential paths. ## Architecture ```mermaid flowchart TD Caller[Service code] Resolver[CredentialResolver] M2M["M2MStorepriority 10"] File["FileSecretStorepriority 50"] Env["EnvSecretStorepriority 100"] M2MIssuer[M2MTokenIssuer] Bootstrap["IcebergBootstrapManagerpersists json"] Settings["alphaswarm.config.settings"] Caller -->|"resolve(CredentialKey)"| Resolver Resolver --> M2M Resolver --> File Resolver --> Env M2M --> M2MIssuer File --> Bootstrap Env --> Settings ``` The resolver is a process-wide singleton built lazily by :func:`alphaswarm.credentials.get_resolver`. The default chain is `Env` + `File`; `M2M` plugs in front when :func:`alphaswarm.auth.m2m.install_m2m_store` runs (controlled by `ALPHASWARM_AUTH_M2M_ENABLED`). ## Usage ```python from alphaswarm.credentials import CredentialKey, get_resolver cred = get_resolver().resolve( CredentialKey("polaris", "oauth"), default={"client_id": "root", "client_secret": "s3cr3t"}, ) client_id = cred.get("client_id") client_secret = cred.get("client_secret") ``` `Credential.source` is `"file"` / `"env"` / `"m2m"` / `"default"`, useful for diagnostics. ## Field maps Per `(service, purpose)`, here is what consumers expect: - `polaris:oauth` → `client_id`, `client_secret`, `principal` - `polaris:rest` / `iceberg:rest` → `credential` (`:`), `token`, `oauth2_server_uri`, `scope` - `trino:basic` → `user`, `source`, optional `token` / `access_token` - `minio:static` → `access_key`, `secret_key`, `endpoint_url`, `region` - `minio:sts` → `session_token` (M2M-issued) - `neo4j:basic` → `user`, `password`, `uri` Add new entries to [alphaswarm/credentials/stores/env_store.py](../alphaswarm/credentials/stores/env_store.py) when you wire a new service to the resolver. ## Bootstrap → resolver Bootstrap workflows call :func:`alphaswarm.services.iceberg_bootstrap.persist_principal_credentials` (and similar) to write JSON under `settings.bootstrap_state_dir`. `FileSecretStore` reads those files; the bootstrap also resets any caches that depend on the credentials (e.g. `iceberg_catalog.reset_catalog_cache()`). When you add a new bootstrap step: 1. Add the file name to [`alphaswarm/credentials/stores/file_store.py::_FILE_MAP`](../alphaswarm/credentials/stores/file_store.py). 2. Persist a JSON payload with at least `client_id` / `client_secret`. 3. Reset any consumer caches in your bootstrap writer. ## Diagnostics `get_resolver().describe()` returns the active store chain and priorities — wire it into a debug endpoint when you need to inspect the resolution order from outside the process. ## Testing `tests/credentials/` contains the canonical test patterns: - Test the resolver chain priority order with `pytest`. - Test new env store branches with a `_StubSettings` shim. - Test new file store keys by writing the JSON to a `tmp_path`. The `reset_resolver` fixture re-builds the singleton between tests so you don't have to track down stale state. # Edge authentication & cell routing > How the alphaswarm-tenant-router verifies JWTs fail-closed at the Envoy edge, routes B2C/B2B tenants onto cell tiers, and validates Cell-Bound-Authorization for cross-cell calls. # Edge authentication & cell routing Every request entering the hosted platform crosses one authentication decision point before it reaches a cell: [`alphaswarm-edge`](https://github.com/Alpha-Swarm-ai/alphaswarm_platform/tree/main/build/docker/alphaswarm-edge) (Envoy) makes two `ext_authz` callouts to [`alphaswarm-tenant-router`](https://github.com/Alpha-Swarm-ai/alphaswarm_platform/tree/main/tenant_router), which verifies identity and decides cell placement in one pass. ```mermaid sequenceDiagram participant C as Client (SPA / CLI / agent) participant E as alphaswarm-edge (Envoy) participant R as alphaswarm-tenant-router participant Cell as alphaswarm-core (per-cell) C->>E: request + Authorization: Bearer JWT E->>R: POST /cell_bound/v1/check (CBA filter) R-->>E: 200 (no CBA header = external traffic) E->>R: POST /ext_authz/v3/check Note over R: verify JWT vs IdP JWKS(iss, aud, exp, alg allowlist) Note over R: pick cell: pinning → tier claim → default R-->>E: 200 + x-alphaswarm-cell + verified identity headers E->>Cell: request + x-alphaswarm-sub/-tenant/-workspace ``` ## Fail-closed verification The router's posture is an explicit setting (`ALPHASWARM_TENANT_ROUTER_AUTH_MODE`), and the default is the strict one — see the [rollout runbook](../../how-to/tenant-router-auth-rollout.md) for the operational details: | Mode | No token | Invalid token | Valid token | | --- | --- | --- | --- | | `required` (default, hosted cells) | 401 | 401 | allow | | `permissive` (canary/migration) | allow, flagged | 401 | allow | | `disabled` (local dev; needs `ALLOW_INSECURE=true` too) | allow | unsigned decode | unsigned decode | Three design rules keep the edge honest: 1. **Boot-time refusal.** In `required`/`permissive` the pod exits at startup unless issuer + audience (and a derivable JWKS URI) are configured. A crash-looping edge is strictly better than one that silently routes unauthenticated traffic. 2. **Asymmetric algorithms only.** `RS*`/`PS*`/`ES*`/`EdDSA` are the only acceptable JWT algorithms; `HS*` and `none` are rejected before any key material is consulted, closing the alg-confusion class of attacks. Verification semantics mirror `alphaswarm_core.auth.jwt_validator.JwtValidator` (kid selection, one forced JWKS refresh on unknown kid for key rotation, TTL cache that serves stale on IdP blips). 3. **Identity headers are always overwritten.** On every ALLOW the router emits the full verified set — `x-alphaswarm-sub`, `x-alphaswarm-tenant`, `x-alphaswarm-workspace`, `x-alphaswarm-org`, `x-alphaswarm-auth` — empty when a claim is absent, so a client can never smuggle its own `x-alphaswarm-*` values past the edge. Per-cell FastAPI gates (`alphaswarm.api.security`) still re-validate the JWT; the edge is defense-in-depth, not the only boundary (AGENTS rule 11 applies at every layer). ## B2C / B2B tier routing Cell selection composes the [multi-tenancy](./multi-tenancy.md) model with the deployment tiers from RESTRUCTURING_PLAN.md §6.1: | Plan | JWT `tier` claim | Cell tier | Tenancy strategy | | --- | --- | --- | --- | | B2C consumer | (none) or `shared-std` | `shared-std` | `shared_schema_rls` | | B2B premium | `shared-prem` | `shared-prem` | `schema_per_tenant` | | Regulated enterprise | (registry pinning) | `silo-reg` | `database_per_enterprise` | | Custom contract | (registry pinning) | `silo-custom` | `hybrid` | Resolution order, per request: 1. **Registry pinning is authoritative** — a tenant listed in a cell's `pinned_tenants` always lands there (silo cells, controlled migrations), regardless of token claims. 2. **The verified `tier` claim** (namespaced `https://alphaswarm.internal/tier`, stamped by the Auth0 Action / Entra claims pipeline) selects the tier. An explicit tier is honored or refused with 503 — never silently downgraded onto another tier's tenancy strategy. 3. **Default tier** (`shared-std`) otherwise. Within a tier, unpinned tenants spread across active cells by rendezvous (highest-random-weight) hashing keyed on `tenant_id → organization_id → sub`: every router replica picks the same cell with no shared state, a tenant is sticky to its cell, and adding or draining a cell only remaps the tenants that hashed onto it. Registry staleness (the router caches the control plane's `/manage/cells` view) is **reported, never failed closed** — the data plane keeps routing on last-known-good cells through a control-plane outage, surfacing `registry_stale` in `/readyz` and a counter in `/metrics`. ## Cell-Bound-Authorization (cross-cell calls) Cross-cell calls are the highest-risk path (Phase 5 §8.5). The mint side lives in `alphaswarm.auth.cell_bound`; the router hosts the validator at `POST /cell_bound/v1/check` (the `alphaswarm-cell-bound-validator` Service selects the same pods): - No `Cell-Bound-Authorization` header → pass. External user traffic and same-cell calls never carry one; the response still emits empty `x-alphaswarm-cell-source-*` headers so smuggled values are stripped. - Header present → the token must verify against the **source cell's** published keys (cells-registry annotation `alphaswarm.internal/cba-jwks`, JWKS JSON or PEM), with `iss` = source cell, `aud` = destination cell, a ≤90 s lifetime (mint stamps 60 s), required `jti`, and per-replica replay rejection. Valid CBAs inject `x-alphaswarm-cell-source` + `x-alphaswarm-cell-source-workload` (SPIFFE id) so destination-cell services can authorize the calling workload. - `CBA_MODE=monitor` logs would-be denials without blocking (rollout aid); `enforce` is the default and is safe before any workload mints CBAs because headerless requests pass through. ## Where things live | Surface | Path | | --- | --- | | Router service + tests | `alphaswarm_platform/tenant_router/` | | Edge Envoy config (canonical template) | `alphaswarm_platform/build/docker/alphaswarm-edge/envoy.template.yaml` | | Deployment (ConfigMap, NetworkPolicy, HPA, Services) | `alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/` | | Backend JWT validation (per-cell) | `alphaswarm/auth/oidc.py`, `alphaswarm_core/auth/jwt_validator.py` | | CBA mint/verify library | `alphaswarm/auth/cell_bound.py` | | Operator runbook | [Tenant-router auth rollout](../../how-to/tenant-router-auth-rollout.md) | | Cutover history | [Cell-router cutover](../../how-to/cell-router-cutover.md) | # Entra ID as the AlphaSwarm staff user pool # Entra ID as the AlphaSwarm staff user pool Microsoft Entra ID is the **first user pool** for the managed AlphaSwarm platform. AlphaSwarm staff (engineers, operators, compliance, finance, auditors, SOC) sign in to `manage.alpha-swarm.ai` through the AlphaSwarm staff Entra tenant; Auth0 stays as the customer-facing B2C fallback and the documented degraded-mode entry path. This page explains *what* the rollout does and *why*. The runbook that walks through *how* lives at [`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md); the long-form plan with phases + risks at [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md); and the architectural decision at [ADR-011](../../architecture/decisions/011-entra-as-first-pool.md). ## Why Entra, why now | Driver | Detail | | --- | --- | | MFA + Conditional Access | Entra carries the company's existing MFA + CA enforcement; staff already authenticate to it daily for Microsoft 365. | | Audit centralisation | Sign-in logs land in the corporate SIEM via the existing Entra log stream; no separate Auth0 export to maintain. | | Group-driven authorisation | Group membership in Entra → app-role claim on AlphaSwarm tokens. New hires onboard via a single HR-side group action. | | No client secrets in CI | GitHub Actions OIDC + federated credentials replace the old `AZURE_CLIENT_SECRET` repo secret. | | Customer separation | The AlphaSwarm staff tenant is independent of every customer tenant. Customer tenants continue to flow through the `EntraTenantLink` B2B approval wizard (AGENTS rule 44). | ## What is under Terraform control - **3 app registrations**: `alphaswarm-staff` (login app), `alphaswarm-manage-api` (Resource Server), `alphaswarm-ci-github` (federated-credential-only). - **3 service principals** with `app_role_assignment_required = true` on the manage API + CI app. - **7 app roles** on the manage API: Admin, Operator, Auditor, Compliance, Finance, Engineer, Viewer. - **7 directory groups**: AlphaSwarm-Admins, AlphaSwarm-Operations, AlphaSwarm-Auditors, AlphaSwarm-Compliance, AlphaSwarm-Finance, AlphaSwarm-Engineering, AlphaSwarm-SOC. - **Group → app-role assignments** mapping each group to one or more roles. - **Federated credentials** for GitHub Actions OIDC (per-environment + per-branch, never wildcards). - **Named locations** representing AlphaSwarm-trusted IP ranges (referenced by Conditional Access policies). ## What is NOT under Terraform control - **Conditional Access policies**. CA policies require an Entra ID P2 license + manual Security review. The Terraform module records policy display names as documentation; the verify helper queries Microsoft Graph at smoke-test time to confirm each named policy exists. - **Group membership**. HR + Security own membership through the Azure Portal (or Entitlement Management). Terraform owns *which groups exist + what roles they confer*; not *who is in them*. - **Customer-tenant Entra integration**. Customer tenants flow through the existing `EntraTenantLink` B2B wizard (AGENTS rule 44). This rollout is internal-only. - **Privileged Identity Management (PIM)**. Tracked as future work in the rollout plan §7. ## Token shape Every staff access token minted for `api://alphaswarm-manage-api` carries: | Claim | Value | | --- | --- | | `iss` | `https://login.microsoftonline.com/{alphaswarm_staff_tenant_id}/v2.0` | | `aud` | `api://alphaswarm-manage-api` | | `roles` | one or more of `Admin`, `Operator`, `Auditor`, `Compliance`, `Finance`, `Engineer`, `Viewer` | | `groups` | the staff member's directory group object ids (security-only) | | `oid` | the user's Entra object id (stable across renames) | | `tid` | the AlphaSwarm staff Entra tenant id | | `preferred_username` | `firstname.lastname@` | The application reads `roles` to gate `/manage/*` routes; a staff member with no roles is treated as `Viewer` until promoted by an admin. ## Provider-chain priority `alphaswarm/auth/providers/__init__.py` exposes [`select_provider_for_token`](pathname:///docs/concepts/identity/entra-internal-tenant.md#provider-chain-priority) which: 1. Decodes the token's `iss` claim (no signature check). 2. If `iss` matches the AlphaSwarm staff issuer, returns `MsalEntraIdentityProvider`. 3. Otherwise falls back to `get_active_provider()` (Auth0 in production). The `manage.alpha-swarm.ai` mounts use this selector instead of the bare `get_active_provider()` so internal-tenant tokens always route through MSAL first. Customer tokens (different `iss`) continue to land on Auth0. ## Lifecycle | Phase | What happens | Owner | | --- | --- | --- | | 0. Pre-flight | Tenant id confirmed; bootstrap SP provisioned | Identity team | | 1. Plan + module land | `alphaswarm_entra_directory` module shipped + plan-only validated | Platform | | 2. Apply + smoke | Resources created; staff member tests login | Platform | | 3. Cutover | `auth_msal_priority` set so MSAL wins for staff | Platform + Identity | | 4. Group onboarding | HR populates the seven groups | HR + Security | | 5. CI cutover | All workflows switch to OIDC federation | DevOps | See the rollout plan for week-level scheduling, exit criteria, and rollback procedures. ## How a staff member signs in ```mermaid sequenceDiagram participant U as AlphaSwarm Staff participant Browser participant alphaswarm_admin as manage.alpha-swarm.ai participant Entra participant manage_api as /manage/* U->>Browser: visit manage.alpha-swarm.ai Browser->>alphaswarm_admin: GET / alphaswarm_admin-->>Browser: 302 /auth/login?provider=entra Browser->>alphaswarm_admin: GET /auth/login?provider=entra alphaswarm_admin->>Entra: /authorize (PKCE + nonce) Entra-->>U: MFA / CA challenge U->>Entra: presents FIDO2 + CA-evaluated location Entra-->>alphaswarm_admin: 302 /auth/callback?code=... alphaswarm_admin->>Entra: exchange code (PKCE redeemed) Entra-->>alphaswarm_admin: id_token + access_token (roles claim) alphaswarm_admin->>alphaswarm_admin: stamp session cookie Browser->>manage_api: GET /manage/cells (Bearer ...) manage_api->>manage_api: select_provider_for_token (MSAL) manage_api-->>Browser: 200 JSON ``` ## Reading the audit trail Every Entra-side mutation lands in two places: - The **Entra audit log** (corporate SIEM via existing log stream). Captures app-registration changes, group-membership changes, CA-policy edits, admin consents. - The **AlphaSwarm `terraform_runs` ledger**. Captures every Terraform apply on the `entra-internal` stack with the operator who triggered it, the SHA of the rendered HCL, the previous + new state hashes, and whether the run succeeded or rolled back. Auditors who need a full reconstruction window query both. The Phase 7 evidence-bundle export already includes `terraform_runs` rows in its deterministic archive. ## Related - [`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md) - [`how-to/entra-onboard-new-staff`](../../how-to/entra-onboard-new-staff.md) - [`how-to/entra-rotate-secrets`](../../how-to/entra-rotate-secrets.md) - [`architecture/decisions/011-entra-as-first-pool`](../../architecture/decisions/011-entra-as-first-pool.md) - The Terraform module: [`alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/`](pathname:///alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/README.md) - Long-form rollout plan: [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md) # Federated identity layer > The pieces port (with attribution) from `alphaswarm_snippets/inspiration/auth0-server-python-main` (MIT, Copyright Auth0, Inc.) into AlphaSwarm-native modules # Federated identity layer AlphaSwarm wraps every identity / token operation in a pluggable :class:`alphaswarm.auth.providers.IdentityProvider`. The provider drives both user authentication (login, JWT validation, refresh) and service-to-service auth (M2M tokens that downstream services like Polaris / Trino consume via the credential resolver). The pieces port (with attribution) from `alphaswarm_snippets/inspiration/auth0-server-python-main` (MIT, Copyright Auth0, Inc.) into AlphaSwarm-native modules. ## Architecture ```mermaid flowchart LR SPA[Frontend SPA] Browser API[FastAPI] Provider["IdentityProviderauth0 / oidc / mock"] OidcClient[OidcHttpClient] JWKS[(JWKS cache)] Discovery[(Discovery cache)] M2MIssuer[M2MTokenIssuer] Resolver[CredentialResolver] Polaris[Polaris OAuth] Trino[Trino HTTP] MinIO[MinIO STS] Browser -->|"GET /auth/login"| API API -->|"login_url(...)"| Provider Provider -->|redirect| Browser Browser -->|callback code| API API -->|"exchange_code"| Provider Provider --> OidcClient OidcClient --> Discovery OidcClient --> JWKS API -->|JWE cookie| SPA SPA -->|Bearer or cookie| API API -->|"validate_jwt"| Provider Provider -->|jwks| JWKS M2MIssuer --> Provider Resolver --> M2MIssuer Polaris --> Resolver Trino --> Resolver MinIO --> Resolver ``` ## Components | Component | Path | | --- | --- | | Provider ABC + metaclass | [alphaswarm/auth/providers/protocol.py](../alphaswarm/auth/providers/protocol.py) | | Auth0 / generic OIDC / mock concrete providers | [alphaswarm/auth/providers/](../alphaswarm/auth/providers/) | | OIDC HTTP plumbing (discovery, JWKS, token endpoint) | [alphaswarm/auth/oidc_client.py](../alphaswarm/auth/oidc_client.py) | | PKCE helpers (RFC 7636 S256) | [alphaswarm/auth/pkce.py](../alphaswarm/auth/pkce.py) | | Cookie / Redis session stores | [alphaswarm/auth/session/](../alphaswarm/auth/session/) | | JWE cookie crypto (HKDF-SHA256 + A256CBC-HS512) | [alphaswarm/auth/session/crypto.py](../alphaswarm/auth/session/crypto.py) | | M2M token issuer | [alphaswarm/auth/m2m.py](../alphaswarm/auth/m2m.py) | | Login / callback / logout routes | [alphaswarm/api/routes/auth.py](../alphaswarm/api/routes/auth.py) | | Backend JWT validator | [alphaswarm/auth/oidc.py](../alphaswarm/auth/oidc.py) | ## Login flow (backend session) 1. Browser hits `GET /auth/login` (optionally with a `return_to`). 2. AlphaSwarm generates a PKCE verifier + state, stashes them in an encrypted transaction cookie (10-minute TTL), redirects to the provider's authorize URL. 3. Provider posts the authorization code to `GET /auth/callback`. 4. AlphaSwarm looks up the transaction cookie by `state`, calls `provider.exchange_code(...)`, and stores the resulting token set in an encrypted session cookie (or Redis). 5. Subsequent requests carry the cookie; AlphaSwarm decrypts it on demand and exposes the user via the existing `current_user` dep. The bearer-token flow (`Authorization: Bearer`) keeps working unchanged — the SPA can pick either path via the `backend_session_supported` flag in `/auth/config`. ## M2M flow When `ALPHASWARM_AUTH_M2M_ENABLED=true`: 1. AlphaSwarm startup calls `alphaswarm.auth.m2m.install_m2m_store()`, which adds :class:`M2MStore` (priority 10) to the credential resolver chain. 2. A service like `polaris_client` resolves `CredentialKey("polaris", "oauth")` through :func:`alphaswarm.credentials.get_resolver`. 3. The M2M store fetches `provider.m2m_token(audience, scope)` (Auth0 `client_credentials` grant) and returns a `Credential` with `access_token`/`token` set. 4. The resolver merges this hit with the env-store payload (which carries the static `client_id`), so consumers see one merged `Credential`. 5. Tokens cache in `M2MTokenIssuer` until expiry minus a 30-second skew, so we don't mint per request. The resolver chain falls through to the file/env stores if the M2M issuer fails or is disabled — you never get a worse outcome than the pre-M2M state. ## Configuration The full env knob set lives in `.env.example` under the "Federated identity (M2 / M3)" section. The minimum for an Auth0 deployment: ```env ALPHASWARM_AUTH_PROVIDER=auth0 ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.auth0.com ALPHASWARM_AUTH_OIDC_AUDIENCE=https://alphaswarm.local/api ALPHASWARM_AUTH_OIDC_CLIENT_ID=... ALPHASWARM_AUTH_OIDC_CLIENT_SECRET=... ALPHASWARM_AUTH_LOGIN_CALLBACK=http://localhost:8000/auth/callback ALPHASWARM_AUTH_LOGOUT_CALLBACK=http://localhost:3000/ ALPHASWARM_AUTH_SESSION_SECRET=$(openssl rand -hex 32) ALPHASWARM_AUTH_M2M_ENABLED=true ALPHASWARM_AUTH_M2M_AUDIENCE=https://alphaswarm.local/services ``` ## Adding a new provider 1. Subclass :class:`alphaswarm.auth.providers.IdentityProvider` and set `provider_kind` (the dispatch key matched against `ALPHASWARM_AUTH_PROVIDER`). 2. Either inherit from :class:`alphaswarm.auth.providers.GenericOidcProvider` (and override only the bits that diverge) or roll your own. 3. The metaclass auto-registers; restart the API and set `ALPHASWARM_AUTH_PROVIDER=`. ## Testing `tests/auth/` contains the canonical test patterns: - `test_pkce.py` — RFC 7636 conformance. - `test_session_crypto.py` — JWE round-trips, wrong-key rejection. - `test_oidc_client.py` — token endpoint mock-driven tests. - `test_providers.py` — Auth0 / generic OIDC / mock dispatch. - `test_m2m.py` — issuer caching, resolver integration. All tests run hermetic; nothing hits the network. ## Account management surface (Phase 7) Phase 7 adds a dedicated account-management API surface under `/me/*` implemented in [`alphaswarm/api/routes/me.py`](../alphaswarm/api/routes/me.py). These routes expose profile updates, MFA and session operations, linked identity management, and self-service account actions while keeping the Auth0 Management API boundary centralized. The Auth0 Management API integration lives in [`alphaswarm/auth/management_api.py`](../alphaswarm/auth/management_api.py). Scope enforcement for protected endpoints is available through [`alphaswarm/auth/auth0_fastapi.py`](../alphaswarm/auth/auth0_fastapi.py) via `Auth0FastAPI` opt-in dependencies. Audit and invite persistence for this surface is recorded in [`alphaswarm/persistence/models_audit.py`](../alphaswarm/persistence/models_audit.py) (`security_audit_events` and `tenancy_invites`), and events are emitted through [`alphaswarm/auth/audit.py`](../alphaswarm/auth/audit.py). ## Microsoft Entra ID secondary IdP (Phase 7) AlphaSwarm's primary Microsoft pattern is federation through Auth0 Universal Login using an Auth0 Microsoft Enterprise Connection, documented in [`alphaswarm_docs/auth0-microsoft-federation.md`](../../concepts/identity/auth0-microsoft-federation.md). This keeps Auth0 as the default IdP while preserving one hosted login surface and one claims projection path. Direct Entra authentication remains supported as a fallback through [`alphaswarm/auth/providers/msal_entra.py`](../alphaswarm/auth/providers/msal_entra.py). When `ALPHASWARM_AUTH_PROVIDER=msal_entra`, the legacy `MsalEntraProvider` path activates without changing the backend tenancy-link semantics. # AlphaSwarm Management Engine > The Management Engine is the single direct-control surface for: # AlphaSwarm Management Engine Canonical narrative for the unified management/control surface shipped by the `alphaswarm_management_engine` plan (`.cursor/plans/alphaswarm_management_engine_fd9f1de7.plan.md`). ## What it owns The Management Engine is the single direct-control surface for: - **Workload lifecycle** — start / stop / scale / restart / exec / tail logs / apply config / rotate secret. One Python ABC (`alphaswarm_core.providers.InfrastructureProvider`), one runtime (`alphaswarm_core.runtime.WorkloadRuntime`), one audit ledger row per action (`workload_runs`). - **Identity provider configuration** — Auth0 + Microsoft Entra ID (MSAL) + Cloudflare Access, all registered through `IdentityProviderMeta`. The BFF (`/auth/{providers,exchange,refresh,logout}`) is the canonical surface for SPA + Theia clients. - **Cloudflare edge** — tunnels, DNS records, Access apps. Runtime CRUD via `alphaswarm.cloudflare.CloudflareEdgeAdapter`; IaC via the `alphaswarm_platform/terraform/modules/cloudflare_edge` module (provider `cloudflare/cloudflare ~> 5.6`). - **Entra tenant onboarding** — `pending` -> `active` via `POST /tenancy/entra-links/{id}/promote` (Phase E of the plan). - **alphaswarm_admin service identity** — per-deployment Microsoft Entra Agent Identities (`alphaswarm_admin_agent_identity` Terraform module). Replaces the legacy shared-client_credentials path for outbound admin-to-CP + admin-to-monolith calls. See [admin-agent-identity.md](admin-agent-identity.md). ## Architecture ```mermaid flowchart LR subgraph clients [Local clients] Vite[Vite SPA] Theia[Theia desktop] end subgraph bff [AlphaSwarm BFF auth + gateway] AuthR["/auth/{providers,exchange,refresh,logout}"] Proxy["alphaswarm/api/proxy.py /manage proxy"] Sec[require_scope + require_membership] end subgraph engine [Management engine] WR[WorkloadRuntime] IP_K[KubernetesProvider] IP_DC[DockerComposeProvider] IP_CF[CloudflareProvider] IP_AWS[AWS / Azure / GCP] CFA[CloudflareEdgeAdapter] KA[KubernetesAdapter pod ops] TR[TerraformRuntime] Idp[IdentityProvider registry] end subgraph idps [Federated IdPs] A0[Auth0] EN[Entra ID MSAL] CFP[Cloudflare Access] end subgraph state [Postgres + Iceberg] WLR[workload_runs ledger] AUD[security_audit_events] SPECS[terraform_stack_spec_versions] end Vite --> AuthR Theia --> AuthR Vite --> Proxy Theia --> Proxy Proxy --> WR AuthR --> Sec Sec --> Idp Idp --> A0 Idp --> EN Idp --> CFP WR --> IP_K WR --> IP_DC WR --> IP_CF WR --> IP_AWS IP_K --> KA IP_CF --> CFA TR --> IP_CF WR --> WLR WR --> AUD TR --> SPECS ``` ## Deployment modes `ALPHASWARM_MANAGEMENT_MODE` controls how the engine runs: | Mode | Workload calls go to | Audit sink | Use case | |---|---|---|---| | `embedded` (default) | In-process `WorkloadRuntime` | `PostgresWorkloadAuditSink` | Single-image deployment | | `sidecar` | HTTP `/manage/*` proxy -> `alphaswarm_controller` | `JsonlAuditSink` | Air-gapped or multi-tenant deployments | Both modes import the SAME `WorkloadRuntime` class — operators choose by setting the env var; no code branches. ## Provider matrix | Provider | start / stop / scale | restart | exec | tail_logs | rotate_secret | Notes | |---|---|---|---|---|---|---| | `docker_compose` | yes | yes | yes (Docker SDK) | yes | no | Local dev + admin overlays | | `kubernetes` | yes | yes (annotation bump) | yes (`stream` + `_preload_content=False`) | yes (`watch.Watch().stream`) | yes (rolling restart) | Production target | | `aws` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when EKS attached | | `azure` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when AKS attached | | `gcp` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when GKE attached | | `cloudflare` | yes | yes (config reload) | n/a | n/a | destructive (opt-in) | Tunnel + Access app + DNS lifecycle | Cloud providers gate K8s delegation on `ALPHASWARM_CP_{AWS,AZURE,GCP}_DELEGATE_K8S=true`. ## Halt + audit - `POST /workloads/halt` fires the `WorkloadRuntime.halt_all` helper (per-process registry) and writes a `HALTED` finish row for every in-flight `workload_runs` entry. Wired into the frontend `KillSwitch` alongside the existing halt endpoints (rule 45 + frontend rule 2). - Every audit row carries `experiment_id` + `test_id` per AGENTS rule 34. The Postgres mirror table (`workload_runs`, Alembic 0055) is indexed on `status + started_at DESC`, `action + started_at DESC`, and `provider_alias + target`. ## Cloudflare end-to-end Phase D of the plan ships: - `alphaswarm/cloudflare/{client,adapter}.py` — Python SDK wrapper + `CloudflareEdgeAdapter` (tunnels, DNS, Access apps). - `alphaswarm/api/routes/cloudflare.py` — REST surface under `/cloudflare/*` (`cluster:admin` for writes, `cluster:read` for reads). - `alphaswarm/data/mcp/tools/cloudflare.py` — DataMCP tools for agents (`data.cloudflare.{health,list_tunnels,create_tunnel,put_tunnel_config,list_access_apps,put_access_app,put_dns_record}`). - `alphaswarm/auth/providers/cloudflare_access.py` — new `CloudflareAccessProvider` that validates `Cf-Access-Jwt-Assertion` headers and merges claims into the active `RequestContext`. - `alphaswarm_platform/terraform/modules/cloudflare_edge` + Jinja codegen template (`alphaswarm/terraform/codegen/templates/cloudflare_edge.tf.j2`) + `cloudflare = "~> 5.6"` in `alphaswarm_platform/terraform/versions.tf`. - Optional `cloudflare_enabled` block in `alphaswarm_platform/terraform/environments/rpi/main.tf` — replaces the manual cloudflared deployment under `rpi_kubernetes/kubernetes/base-services/cloudflared/`. ## Frontend - `alphaswarm_client/src/lib/api/{workloads,cloudflare,clusterPods}.ts` — typed clients matching the new REST surface. - `alphaswarm_client/src/routes/manage/page.tsx` — Workload Studio. - `alphaswarm_client/src/routes/cluster-mgmt/page.tsx` — Cluster pods browser (exec + log tail land in Phase F-2). - `alphaswarm_client/src/routes/cloudflare/page.tsx` — Cloudflare edge studio. - `alphaswarm_client/src/lib/auth/MsalProvider.tsx` — new MSAL branch of `AuthProvider`; selects between `` and `` based on `authConfig.provider`. - `alphaswarm_client/public/redirect.html` — MSAL v5 redirect bridge. ## Theia - `theia-extensions/alphaswarm/src/browser/auth/alphaswarm-auth-service.ts` — additive BFF auth service (calls `/auth/providers` + `/auth/refresh`). Auth0Service still owns the direct PKCE flow. - `theia-extensions/alphaswarm/src/browser/widgets/management-widget.tsx` — iframe embedding the Vite Workload Studio, cluster-mgmt, and cloudflare routes inside Theia. New env vars on `browser.Dockerfile`: `ALPHASWARM_THEIA_FRONTEND_URL`, `ALPHASWARM_THEIA_PROVIDERS_URL`. ## Subagent + rule + skill - `.cursor/agents/alphaswarm-management-engine.md` — direct-control subagent that maps every control route to a `data.*` MCP tool and refuses raw HTTP shortcuts. - `.cursor/rules/alphaswarm-management-engine.mdc` — always-on rule that bans printing tokens, refresh tokens, M2M client_secrets, MFA secrets, `Cf-Access-Jwt-Assertion` values, kubeconfig contents, and full `Authorization` headers in any transcript. - `.cursor/skills/alphaswarm-management-engine/SKILL.md` — named workflows the subagent reaches for first (start, stop, restart, exec, tail-logs, provision-tunnel, rotate-secret, promote-entra-link, halt-all). # Microsoft Entra ID (MSAL) setup > 1. Sign in to the [Entra admin center](https://entra.microsoft.com). 2. **Identity → Applications → App registrations → New registration**. 3. Name: `AlphaSwarm`. 4. Supported account type... # Microsoft Entra ID (MSAL) setup Step-by-step walkthrough for wiring AlphaSwarm's `MsalEntraProvider` to a multi-tenant Microsoft Entra ID app registration. The provider lives at [`alphaswarm/auth/providers/msal_entra.py`](../alphaswarm/auth/providers/msal_entra.py) and auto-registers via the [`IdentityProviderMeta`](../alphaswarm/auth/providers/protocol.py) metaclass. ## 1. Create the Entra app registration 1. Sign in to the [Entra admin center](https://entra.microsoft.com). 2. **Identity → Applications → App registrations → New registration**. 3. Name: `AlphaSwarm`. 4. Supported account types: **Accounts in any organizational directory + personal Microsoft accounts (B2B/B2C)**. This is what makes the app multi-tenant. The matching MSAL authority becomes `https://login.microsoftonline.com/organizations` (work / school accounts only) or `/common` (incl. personal accounts). 5. **Redirect URI** — add two: - Platform: **Web** → `https:///auth/callback` - Platform: **Single-page application (SPA)** → `http://localhost:3001/auth/callback` and the prod equivalent. ## 2. Generate a client secret 1. App registration → **Certificates & secrets → New client secret**. 2. Description: `alphaswarm-backend-secret`. Expiry: max allowed (`24 months`). 3. **Copy the secret value immediately**; Entra hides it after page reload. 4. Set: ``` ALPHASWARM_MSAL_CLIENT_SECRET= ``` Or store it in your secret backend and reference via `CredentialResolver` (preferred — see [alphaswarm_docs/cloud-credentials.md](../../concepts/identity/cloud-credentials.md)). ## 3. Define app roles App registration → **App roles → Create app role** (five times): | Display name | Member types | Value | | ------------------------ | ------------ | ---------------------- | | AlphaSwarm admin | Users / Apps | `alphaswarm.admin` | | AlphaSwarm editor | Users | `alphaswarm.editor` | | AlphaSwarm viewer | Users | `alphaswarm.viewer` | | Terraform operator | Users | `alphaswarm.terraform.operator` | | Terraform approver | Users | `alphaswarm.terraform.approver` | The provider's first-login provisioning logic ([`alphaswarm/auth/user.py::_apply_entra_tenant_link`](../alphaswarm/auth/user.py)) maps these onto the AlphaSwarm role lattice (`viewer < editor < admin < owner`). The `alphaswarm.terraform.*` sub-roles fold to `editor` (operator) and `admin` (approver) by default; override via the `EntraTenantLink.role_mapping` JSON column. ## 4. Expose an API scope App registration → **Expose an API → Add a scope**: - Application ID URI: `api://` (Entra suggests this; accept). - Scope name: `.default` (this enables the `client_credentials` grant used by M2M). - Admin consent display name: `AlphaSwarm API access`. ## 5. (Optional) Pre-authorize the SPA client If you split the SPA client into its own app registration, add it to **Expose an API → Authorized client applications** with the `api:///.default` scope so the token flow lands without an admin-consent prompt. ## 6. Configure AlphaSwarm ``` ALPHASWARM_AUTH_PROVIDER=msal_entra ALPHASWARM_MSAL_TENANT_ID= ALPHASWARM_MSAL_CLIENT_ID= ALPHASWARM_MSAL_CLIENT_SECRET= ALPHASWARM_MSAL_AUTHORITY=https://login.microsoftonline.com/organizations ALPHASWARM_MSAL_REDIRECT_URI=https:///auth/callback ALPHASWARM_MSAL_SCOPES=openid profile email offline_access User.Read ALPHASWARM_MSAL_MULTI_TENANT=true ALPHASWARM_MSAL_B2B_ENABLED=true ``` Frontend Vite build: ``` VITE_MSAL_TENANT_ID= VITE_MSAL_CLIENT_ID= VITE_MSAL_AUTHORITY=https://login.microsoftonline.com/organizations VITE_MSAL_REDIRECT_URI=https:///auth/callback VITE_MSAL_SCOPES=openid profile email offline_access User.Read ``` ## 7. Link your home Entra tenant to an AlphaSwarm organization Two paths: 1. **Frontend wizard** (recommended): navigate to `/admin/onboarding` → **Link Entra tenant** tab, select your AlphaSwarm org, paste the Entra tenant id (`tid`), set primary domain + allowed email domains + role mapping, click "Activate". 2. **MCP tool / API**: ``` POST /tenancy/entra-links { "organization_id": "", "entra_tenant_id": "", "primary_domain": "wiley.tech", "allowed_email_domains": ["wiley.tech"], "role_mapping": { "alphaswarm.admin": "admin", "alphaswarm.editor": "editor", "alphaswarm.viewer": "viewer", "alphaswarm.terraform.operator": "editor", "alphaswarm.terraform.approver": "admin" }, "activate": true } ``` Once the link is `active`, every user that signs in from that tenant gets a `Membership` row auto-provisioned on the linked org + workspaces (`provider == "msal_entra"` in [`alphaswarm/auth/user.py::provision_user_from_claims`](../alphaswarm/auth/user.py)). ## 8. (Optional) Conditional Access for external tenants For B2B guest users, configure Entra Conditional Access policies on your home tenant (MFA + IP restrictions + device compliance). AlphaSwarm does NOT enforce these — Entra denies the token before AlphaSwarm sees it, which is the correct boundary. ## 9. SCIM / Provisioning Service webhook To pre-provision AlphaSwarm users before they sign in (useful for large orgs), point an Entra Logic App or SCIM provider at: ``` POST https:///_internal/msal/sync Authorization: Bearer { "object_id": "", "tenant_id": "", "email": "user@wiley.tech", "display_name": "User", "app_roles": ["alphaswarm.editor", "alphaswarm.terraform.operator"], "lifecycle_event": "created" } ``` The endpoint is M2M-protected via `require_m2m_token` (mirrors `/_internal/auth0/sync`) and upserts the matching `User` + `Membership` rows so the user lands on a usable surface on their very first request. ## Troubleshooting | Symptom | Likely cause | | -------------------------------------- | --------------------------------------------------------------------- | | `AADSTS50194` invalid issuer | Authority pinned to wrong tenant — use `/organizations` for multi-tenant. | | `AADSTS65001` consent required | Admin consent on the SPA / API scope wasn't granted. | | `provision_user_from_claims` returns default user | Settings has `auth_provider != "msal_entra"`. Set the env var. | | New user lands without org membership | `EntraTenantLink.status == "pending"` — promote via the wizard. | # Multi-tenancy > ```mermaid sequenceDiagram participant SPA as Vite SPA participant Entra as login.microsoftonline.com
(multi-tenant) participant AlphaSwarm as AlphaSwarm backend participant Link as EntraTenantLink (Postgres) p... # Multi-tenancy How AlphaSwarm turns a Microsoft Entra ID `tid` claim into an `Organization` → `Team` → `User` → `Membership` chain — and what keeps a B2B guest from another tenant from leaking into the wrong org. ## Identity flow ```mermaid sequenceDiagram participant SPA as Vite SPA participant Entra as login.microsoftonline.com(multi-tenant) participant AlphaSwarm as AlphaSwarm backend participant Link as EntraTenantLink (Postgres) participant Org as Organization (Postgres) SPA->>Entra: PKCE auth code flow Entra->>SPA: id_token + access_token (carries tid + oid + roles) SPA->>AlphaSwarm: /api/* with Bearer AlphaSwarm->>AlphaSwarm: validate_jwt (Entra JWKS) AlphaSwarm->>AlphaSwarm: provision_user_from_claims(claims) AlphaSwarm->>Link: lookup tid alt tid known + status=active Link-->>AlphaSwarm: organization_id AlphaSwarm->>Org: derive Memberships from roles[] else tid unknown + B2B enabled AlphaSwarm->>Link: insert pending row AlphaSwarm-->>SPA: user signs in with no memberships note over AlphaSwarm,Link: Admin promotes link via wizardbefore user sees workspaces end ``` ## Schema | Table | Purpose | | ----------------------- | ------------------------------------------------------------- | | `organizations` | Top of the AlphaSwarm tenancy tree (multi-tenant) | | `teams` | Subgroup within an org | | `workspaces` | Visibility-scoped container of projects + labs | | `projects` / `labs` | The user-facing buckets where strategies / RAG corpora live | | `users` | Authenticated identities (one row per Entra `oid`) | | `memberships` | Polymorphic `(user, scope_kind, scope_id, role)` grants | | `entra_tenant_links` | Multi-tenant Entra `tid` → AlphaSwarm `organization_id` index (NEW) | Schema migrations: - `0017_tenancy_foundation.py` — original `default-*` seed. - `0050_terraform_iac_plus_entra.py` — adds `entra_tenant_links` + the Terraform tables. - `0051_seed_wiley_tech.py` — seeds the canonical "Wiley Tech" org + user "Julian" + transfers every legacy `default-*`-owned row. ## `EntraTenantLink` lifecycle Statuses (see :data:`ENTRA_TENANT_STATUSES`): | Status | Behaviour | | ----------- | ------------------------------------------------------------------------- | | `pending` | Created by first-login of an unknown `tid`. User signs in but lands on an "awaiting org admin" surface (no Memberships granted). | | `active` | New logins from the tenant auto-provision into the linked org + workspaces. | | `suspended` | Sign-ins from the tenant still resolve, but no new Memberships are granted. | | `revoked` | Sign-ins from the tenant are blocked at provision time. | AGENTS rule 44: **organization provisioning from Entra ID claims goes through `EntraTenantLink`. Don't auto-create org rows from raw `tid` claims.** The `data.tenancy.link_org_to_entra_tenant` MCP tool (REST: `POST /tenancy/entra-links`) is the only sanctioned ingress. The frontend [`EntraTenantLinkWizard`](../alphaswarm_client/src/components/onboarding/EntraTenantLinkWizard.tsx) drives this flow with a 5-step wizard. On the Auth0-federated path, the Microsoft button on the SPA login screen uses the Auth0 Enterprise Connection `connection=azure-ad-myorg`, which federates users to their home Entra tenant. The Entra `tid` claim returned through Auth0 is forwarded into the AlphaSwarm access-token claim set by the Auth0 Action, and `provision_user_from_claims` runs `_apply_entra_tenant_link` exactly as it does in the direct-MSAL path. For regulated deployments that bypass Auth0 and hit Entra directly, `MsalEntraProvider` remains registered through `IdentityProviderMeta` and activates when `ALPHASWARM_AUTH_PROVIDER=msal_entra`. Both authentication paths converge on the same backend `EntraTenantLink` lookup chain, and super-admin promotion remains managed in `alphaswarm_client/src/components/onboarding/EntraTenantLinkWizard.tsx`. ## App role mapping Entra ships app roles in a top-level `roles` claim array (e.g. `["alphaswarm.admin", "alphaswarm.terraform.operator"]`). The provisioning logic maps them onto the AlphaSwarm role lattice (`viewer < editor < admin < owner`): ```python # alphaswarm/auth/user.py::_apply_entra_tenant_link # Multi-word roles fold to the tail token: # alphaswarm.terraform.operator -> "operator" -> editor # alphaswarm.terraform.approver -> "approver" -> admin ``` Per-link overrides live in `EntraTenantLink.role_mapping` (JSON). Example for the seeded Wiley Tech link: ```json { "alphaswarm.admin": "owner", "alphaswarm.editor": "editor", "alphaswarm.viewer": "viewer", "alphaswarm.terraform.operator": "editor", "alphaswarm.terraform.approver": "admin" } ``` ## Onboarding wizards (frontend) `/admin/onboarding` hosts three wizards behind tabs: 1. **OrgCreateWizard** (4 steps) — name / billing / default structure / review. Seeds the canonical Core team + Main workspace + Main project + Main lab (from [`configs/tenants/tenant_default_template.yaml`](../configs/tenants/tenant_default_template.yaml)). 2. **EntraTenantLinkWizard** (5 steps) — choose org / Entra tid + primary domain / allowed email domains / app-role mapping / activate. 3. **UserInviteWizard** (3 steps) — email + display name / scope + role / review + send (Entra B2B invitation when MSAL is configured). ## Tenant template files [`configs/tenants/`](../configs/tenants/) hosts three YAMLs: - `tenant_default_template.yaml` — default org structure created on `data.tenancy.create_organization`. - `roles_default_template.yaml` — canonical app-role → AlphaSwarm-role mapping. - `user_invite_template.yaml` — Entra B2B invite email body + custom claims payload. ## Seeded state After running `alembic upgrade head` against a fresh DB: | Slug | Type | Notes | | ------------ | ------------- | -------------------------------------------- | | `default` | Organization | Legacy 0017 seed (preserved for FK chains) | | `wiley-tech` | Organization | New canonical seed (Wiley Tech) | | `core` | Team | Default team under wiley-tech | | `main` | Workspace | Default workspace under wiley-tech | | `main` | Project | Default project under main workspace | | `main` | Lab | Default lab under main workspace | | `julian@wiley.tech` | User | Owner on every Wiley Tech scope | Every legacy `*_runs` / `bots` / `agent_runs_v2` / `analysis_runs` / ... row that previously pointed at `default-org` / `default-user` is re-stamped to point at `wiley-tech` / `julian@wiley.tech` (see `_restamp_legacy_rows` in [`alembic/versions/0051_seed_wiley_tech.py`](../alembic/versions/0051_seed_wiley_tech.py)). The legacy `default-*` rows stay in place so any orphan FK still resolves. # SCIM Provisioning > Enable SCIM with: # SCIM Provisioning AlphaSwarm exposes a SCIM 2.0 provisioning surface at `/scim/v2/*` for Auth0 Actions or scheduled Auth0 jobs. ## Security Enable SCIM with: ```bash ALPHASWARM_AUTH_SCIM_ENABLED=true ALPHASWARM_AUTH_PROVIDER=auth0 ALPHASWARM_AUTH_REQUIRED=true ``` Authentication is Bearer-only. AlphaSwarm accepts either: - a JWT validated against the configured OIDC issuer with audience `ALPHASWARM_AUTH_SCIM_M2M_AUDIENCE` (or `ALPHASWARM_AUTH_M2M_AUDIENCE`), or - a long random static token whose SHA-256 digest is stored in `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH`. Do not store the raw token in the repository. ## Resource Mapping - SCIM `User` maps to `users`. - SCIM `Group` maps to `teams`. - SCIM `Group.members` maps to `memberships` with `scope_kind="team"`. Create, patch, replace, deactivate, and group membership operations emit security audit events through `alphaswarm.auth.audit.emit_audit_event`. ## Auth0 Integration The `alphaswarm_platform/terraform/modules/auth0_identity` module creates: - the AlphaSwarm SPA application, - the AlphaSwarm API audience and scopes, - an M2M client grant for SCIM and Auth0 sync, - default `alphaswarm-viewer` and `alphaswarm-admin` roles, - a post-login Action that calls `/_internal/auth0/sync` and injects AlphaSwarm tenancy claims. For direct enterprise SCIM, point the upstream IdP or Auth0 automation at `https:///scim/v2`. # SPIFFE workload identity # SPIFFE workload identity > Phase 4 §7.2 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > SPIFFE-bound identities replace the long-lived OAuth > client-credentials grant currently used by `M2MTokenIssuer` for > service-to-service authentication. ## Why workload identity The pre-Phase-4 ``M2MTokenIssuer`` mints short-lived JWTs via the Auth0 / Entra ``client_credentials`` grant, but those tokens are still bearer credentials — exfiltrate the JWT and you can replay it from anywhere until it expires. SPIFFE-bound identities (SVIDs) are workload-attested via the platform (UID, cgroup, label selectors) — much harder to steal and automatically rotated by the SPIRE Server. | Aspect | OAuth `client_credentials` | SPIFFE JWT-SVID | | --- | --- | --- | | Issuer | Auth0 / Entra tenant | SPIRE Server (in-cluster) | | Attestation | Shared `client_secret` (long-lived) | Node + workload attestor (live) | | Bearer-token replay risk | High (until expiry) | Low (selectors validated by Workload API) | | Rotation | Manual / scheduled | Automatic, per-SVID-lifetime | | Cross-cell scope | Implicit (issuer trusts all audiences) | Explicit (`spiffe://alpha-swarm.ai/cell//...` trust-domain path) | ## Trust domain layout AlphaSwarm runs ONE trust domain — ``alpha-swarm.ai``. Each cell carries a namespace-scoped trust-domain prefix: ``` spiffe://alpha-swarm.ai/cell// ``` Example SPIFFE IDs: | Cell | Service | SPIFFE ID | | --- | --- | --- | | `cell-shared-std-local` | `alphaswarm-core` | `spiffe://alpha-swarm.ai/cell/cell-shared-std-local/alphaswarm-core` | | `cell-silo-reg-acme` | `alphaswarm-worker` | `spiffe://alpha-swarm.ai/cell/cell-silo-reg-acme/alphaswarm-worker` | | `cell-shared-std-us-east-1a` | `alphaswarm-tenant-router` | `spiffe://alpha-swarm.ai/cell/cell-shared-std-us-east-1a/alphaswarm-tenant-router` | Cross-cell calls validate the full SPIFFE ID, not just the trust domain — Cell-Bound-Authorization (Phase 5 §8.5) extends this with biscuit capability tokens that pin a request to a specific cell. ## Deployment shape Each cell runs ONE SPIRE control plane: ``` [ SPIRE Server StatefulSet ] (spire-system namespace) ▲ │ k8s_psat attest │ [ SPIRE Agent DaemonSet ] (one per node) ▲ │ unix socket: /run/spire/sockets/agent.sock │ [ AlphaSwarm workload pod ] (mounts the socket via hostPath volume) │ └── spiffe.workloadapi.fetch_svid(audiences=[...]) ``` The matching manifests live at: - `alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/server.yaml` - `alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/agent.yaml` Per-cell installs come from the Argo CD `ApplicationSet` at `alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml` (Phase 4.5 extends it with a `mesh-identity` component column). ## AlphaSwarm integration The application-side integration lives in [`alphaswarm/auth/providers/spiffe.py`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/providers/spiffe.py) (`SpiffeIdentityProvider`). It implements the :py:class:`alphaswarm.auth.providers.protocol.IdentityProvider` interface but only the :py:meth:`m2m_token` method does real work — SPIFFE is workload-only and does NOT participate in user OIDC flows. The existing Auth0 / Entra providers stay wired for user-facing login. ### Wiring ```bash # Operator sets the workload API socket path (default is the # conventional /run/spire/sockets/agent.sock from the SPIRE Agent # DaemonSet's hostPath mount). export ALPHASWARM_AUTH_SPIFFE_WORKLOAD_API_SOCKET="unix:///run/spire/sockets/agent.sock" # Route the M2MTokenIssuer through SPIFFE instead of Auth0. # (Phase 4.5 deliverable — the M2MTokenIssuer side is still TODO.) export ALPHASWARM_AUTH_M2M_PROVIDER=spiffe ``` When the SPIFFE socket isn't reachable (development mode, smoke tests, migrations), `SpiffeIdentityProvider.m2m_token` raises `IdentityProviderError`. The fallback chain in `alphaswarm.credentials.resolver` re-tries the legacy Auth0 path so developers can iterate without a running SPIRE Agent. ## Pod template requirements For a pod to consume SVIDs from the SPIRE Workload API: 1. Mount the agent's host socket: ```yaml volumes: - name: spire-agent-socket hostPath: path: /run/spire/sockets type: Directory containers: - name: ... volumeMounts: - name: spire-agent-socket mountPath: /run/spire/sockets readOnly: true ``` 2. Set `SPIFFE_ENDPOINT_SOCKET=unix:///run/spire/sockets/agent.sock` in the pod env (or rely on the AlphaSwarm default). 3. Be in the `spire-system` `ClusterSPIFFEID` selector — the matching CRD is shipped per-cell in Phase 4.5; today the `k8s_psat` Node Attestor accepts every workload with a matching ServiceAccount. ## Rotation + revocation - **SVID lifetime**: 1h X.509-SVID, 5m JWT-SVID (configurable via the SPIRE Server config map). - **Trust anchor lifetime**: 168h (7 days). Operators rotate the root via Vault PKI; the SPIRE Server propagates the new bundle to every Agent within ~1 minute. - **Revocation**: deleting a workload's `RegistrationEntry` from the SPIRE Server invalidates all future SVID issuance. Existing in-flight SVIDs expire at their natural TTL — for an immediate cut-off, also rotate the trust anchor. ## Failure modes | Failure | Behaviour | | --- | --- | | SPIRE Agent socket missing | `SpiffeIdentityProvider.m2m_token` raises `IdentityProviderError` | | SPIRE Server unreachable | Agent serves cached SVID until it expires (~1h) | | Workload not attested | `fetch_svid` raises; M2M chain falls through to Auth0 | | Trust anchor rotation | SVIDs continue to validate during the 7-day overlap window | ## Phase 4.5 follow-ups 1. **Per-cell `ClusterSPIFFEID` CRDs** that bind workload selectors to SPIFFE IDs (today the spine relies on the default k8s_psat attestor). 2. **M2MTokenIssuer dispatch** — wire `ALPHASWARM_AUTH_M2M_PROVIDER=spiffe` into the issuer so it picks SPIFFE for M2M without affecting user OIDC flows. 3. **Linkerd integration** — Linkerd consumes SPIFFE identity for mTLS termination (Phase 4 §7.1). Phase 4.5 wires the SPIFFE trust anchor into Linkerd's Identity service. 4. **OIDC discovery provider** — SPIRE Server can expose an OIDC discovery endpoint that lets non-SPIRE-aware services (Pomerium, Cloudflare Access) validate SVIDs as standard OIDC JWTs. 5. **Cross-cell federation** — Phase 8 §11.2 multi-region cells will need SPIFFE trust-domain federation. ## Related documents - [RESTRUCTURING_PLAN.md §7.2](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm/auth/providers/spiffe.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/providers/spiffe.py) - [alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/) - SPIFFE specification: https://github.com/spiffe/spiffe - SPIRE: https://spiffe.io/docs/latest/spire-about/spire-concepts/ # AlphaSwarm IDE roadmap > The blueprint targets greenfield buyers of a quant PaaS. AlphaSwarm already has: # AlphaSwarm IDE roadmap This doc maps the [external quant-IDE blueprint](https://github.com) (compressed: "Bloomberg-grade research IDE you can own", 12–18 month Phase 1/2/3 plan) to AlphaSwarm's existing architecture and the 55 hard rules. ## Why we deviate from the blueprint The blueprint targets greenfield buyers of a quant PaaS. AlphaSwarm already has: - Five hash-locked spec runtimes (`AgentSpec` / `BotSpec` / `RLExperimentSpec` / `AnalysisSpec` / `WorkflowSpec`) — rules 12-13, 14-15, 16-17, 23-25, 40-41. - Nine backtest engines (vbt-pro, event-driven, OSS vectorbt, backtesting.py, LEAN, ZVT, AAT, hftbacktest, NautilusTrader bridge). - DataMCP + CodebaseMCP — rule 22, exposed over RFC 9728 / RFC 8707 conformant streamable HTTP per rule 49. - AlphaVantage / IBKR / Alpaca brokers — paper trading exists. - Iceberg lakehouse with medallion-tier business metadata — rule 21. - A Vite 7 + React 19 operator UI (`alphaswarm_client/`) that already covers the operator dashboard scope. The IDE's role in AlphaSwarm is **the developer / research environment** — notebook + MCP copilot + spec authoring + repo navigation. It does NOT re-implement what `alphaswarm_client/` already does well. ## Phasing ### Phase A — Shipped in this enhancement | Workstream | Blueprint section | AlphaSwarm-aligned implementation | | --- | --- | --- | | Six compile-time Theia extensions | §2.2 + §2.5 + §2.6 + §2.8 | `alphaswarm-ext`, `alphaswarm-shell-ext`, `alphaswarm-mcp-bridge-ext`, `alphaswarm-research-copilot-ext`, `alphaswarm-notebook-quant-ext`, `alphaswarm-quant-ext` | | FINOS Perspective notebook renderer | §2.6 + §4.5 | `alphaswarm-notebook-quant-ext`'s `PerspectiveArrowRenderer` (lazy-loads `@finos/perspective`) | | MCP-driven research copilot | §2.7 + §5.4 | `alphaswarm-research-copilot-ext`'s `AqpResearchAgent` (routes through `router_complete`, rule 2) | | White-label shell + filters | §2.8 | `alphaswarm-shell-ext`'s `FilterContribution` + window title + about dialog | | Quant widgets (operator complement) | §5.1 | `alphaswarm-quant-ext`'s SpecAuthor + RunInspector + BacktestRunner | | `alphaswarm-cli ide` entrypoint | (CLI orchestration) | `install` / `build` / `start` / `stop` / `status` / `logs` / `open` / `url` / `env` / `detect` / `doctor` | | Single-pod K8s manifests | §7 (Layer 2) | `alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/` | | Theia Cloud Phase B scaffolding | §3 | `alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/theia-cloud/` with `DEFERRED.md` | | Per-extension AGENTS + READMEs + skills + rules | (governance) | 6 README + 6 AGENTS + 2 skills + 1 rule + 2 subagents | | Workspace retirement checklist | (governance) | `alphaswarm_ide/docs/retire-vendored-workspace.md` | ### Phase B — Trigger: ≥2 internal users need isolated workspaces | Workstream | Blueprint section | AlphaSwarm-aligned implementation | | --- | --- | --- | | Theia Cloud multi-tenant operator | §3 | Install upstream `theia-cloud` Helm + apply the `AppDefinition` scaffolded under `alphaswarm-ide/theia-cloud/` | | Per-tenant PVC + workspace | §3.5 | One PVC per `Workspace.theia.cloud/v1beta5` | | Activity-tracker idle shutdown | §3.3 | `monitor.activityTracker.timeoutAfter` on `AppDefinition` | | Private Open VSX mirror | §2.9 | Self-hosted Open VSX in `alphaswarm-ide` namespace | | Step-up confirmation for copilot write tools | (rule 52) | Surface confirmation chips before invoking `/halt` / `/me/byok/*` / `/tenancy/invites` tools | ### Phase C — Trigger: tick / order-book research demand emerges | Workstream | Blueprint section | AlphaSwarm-aligned implementation | | --- | --- | --- | | Arrow Flight gateway backend service | §4.1 | A new compile-time extension `alphaswarm-flight-gateway-ext` with a JSON-RPC service that fronts AlphaSwarm Iceberg + Snowflake (when present) via ADBC | | Tick blotter widget | §5.2 | New widget in `alphaswarm-quant-ext` (or a sibling `alphaswarm-trading-ext`) that subscribes to the live market data Kafka topic | | Real-time Yjs notebook collaboration | §5.5 | New compile-time extension `alphaswarm-notebook-rtc-ext` with a backend Yjs WebSocket server | | Hudi upsert-heavy market-data partitions | (rule 46) | Wire `alphaswarm/data/lakehouse/hudi/` into the BacktestRunner spec UI | | GPU / RAPIDS scheduling | §3 (Layer 5) | New `AppDefinition` flavour with GPU node selectors | ## Hard-rule mapping summary | Rule | Phase A | Phase B | Phase C | | --- | --- | --- | --- | | 2 (LLM gateway) | Copilot uses `router_complete` | (no change) | Hudi-aware code samples in copilot | | 4 (progress frame) | `AqpWsClient` consumes canonical frame | (no change) | (no change) | | 22 (DataMCP) | MCP bridge | (no change) | Flight gateway uses DataMCP for catalog metadata | | 26 (CredentialResolver) | Python helpers | (no change) | Flight gateway pulls Snowflake creds via store | | 27 (IdentityProvider) | All extensions | Per-pod oauth2-proxy | (no change) | | 45 (WorkloadRuntime) | CLI `doctor` + `alphaswarm-ext` halt | Multi-pod halt via `/workloads/halt` | (no change) | | 47 (topology) | CLI `detect` / `env` | (no change) | (no change) | | 49 (MCP audience) | Bridge sets `X-AlphaSwarm-MCP-Audience` | (no change) | (no change) | | 52 (step-up MFA) | `alphaswarm-ext` halt | Copilot write-tool gating | (no change) | ## Decision log | Decision | Rationale | | --- | --- | | Use AlphaSwarm `router_complete` (rule 2) for the copilot, NOT `@theia/ai-openai` / `@theia/ai-anthropic` etc. | AlphaSwarm's provider catalog + cost caps + tenancy + audit run through `router_complete`. Bypassing it would create an auditing blind spot for every chat completion. | | Use AlphaSwarm's five spec runtimes for SpecAuthor, NOT a generic `BacktestService` JSON-RPC | The blueprint's hypothetical `BacktestService` is what AlphaSwarm already has — five hash-locked spec runtimes with `persist_spec` + immutable version rows. Reinventing them would create a fork. | | Defer Arrow Flight + Theia Cloud + RTC to Phase B/C | AlphaSwarm's current load (single-tenant Vite UI + AlphaSwarm API) does not justify the multi-tenant Theia Cloud operator yet. The blueprint's Flight gateway is a Phase C target — DataMCP + Iceberg already cover the data plane for Phase A. | | Keep `alphaswarm_client/` as the operator UI; Theia complements it | The Vite app already has the operator dashboards. Theia adds notebook + MCP copilot + spec authoring + repo navigation. Two surfaces, one tenancy, no duplication. | | Make `alphaswarm-cli ide` the canonical entrypoint | Production deploys go through one command. `yarn` stays for inner-loop dev. Mirrors the `alphaswarm-cli client` pattern for the Vite frontend. | | Don't fork Theia | Every blueprint risk register flags forking as catastrophic. AlphaSwarm stays on community releases and adds via compile-time extensions only. | ## What this roadmap is NOT - A commitment to ship every blueprint phase. - A timeline. We ship Phase A now; Phase B and C ship when triggered. - A justification for re-implementing what `alphaswarm_client/` already provides. - A reason to bypass the 55 hard rules. ## Source of truth - The blueprint we summarised: external research report + product blueprint provided as the source for this enhancement. - AlphaSwarm's canonical hard rules: [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). - Per-extension contracts: [../alphaswarm_ide/theia-extensions/](../alphaswarm_ide/theia-extensions/). # AlphaSwarm IDE > This page is a thin pointer into the in-folder documentation that lives in `alphaswarm_ide/`. The canonical contracts are there # AlphaSwarm IDE The **AlphaSwarm IDE** is a white-labeled Eclipse Theia 1.72 distribution + six AlphaSwarm compile-time extensions + an MCP-driven research copilot + a Perspective Arrow notebook renderer. It is the developer environment that sits next to (not replaces) the `alphaswarm_client/` Vite operator UI. ## SSoT pointers This page is a thin pointer into the in-folder documentation that lives in `alphaswarm_ide/`. The canonical contracts are there. | Topic | Path | | --- | --- | | Overview + architecture | [../alphaswarm_ide/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/README.md) | | Process + extension architecture | [../alphaswarm_ide/docs/architecture.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/architecture.md) | | Per-extension reference | [../alphaswarm_ide/docs/extensions.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/extensions.md) | | Canonical operator entrypoint (`alphaswarm-cli ide`) | [../alphaswarm_ide/docs/cli-entrypoint.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/cli-entrypoint.md) | | MCP integration (RFC 9728 + RFC 8707) | [../alphaswarm_ide/docs/mcp-integration.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/mcp-integration.md) | | Research Copilot (chat agent) | [../alphaswarm_ide/docs/research-copilot.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/research-copilot.md) | | Notebook (Perspective MIME renderer) | [../alphaswarm_ide/docs/notebook.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/notebook.md) | | Quant widgets (SpecAuthor / RunInspector / BacktestRunner) | [../alphaswarm_ide/docs/quant-widgets.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/quant-widgets.md) | | Deployment (local / single-pod K8s / Theia Cloud) | [../alphaswarm_ide/docs/deployment.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/deployment.md) | | Phased roadmap (blueprint → AlphaSwarm) | [alphaswarm-ide-roadmap.md](../../concepts/infrastructure/alphaswarm-ide-roadmap.md) | ## Hard-rule touchpoints The AlphaSwarm IDE most-cited hard rules from [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): | Rule | Owner | AlphaSwarm IDE consumer | | --- | --- | --- | | 2 (LLM gateway) | `alphaswarm/llm/providers/router.py::router_complete` | `alphaswarm-research-copilot-ext`'s `RouterCompleteClient` | | 4 (canonical progress frame) | `alphaswarm/tasks/_progress.py::emit` | `alphaswarm-quant-ext`'s `AqpWsClient` / `RunInspectorWidget` | | 22 (DataMCP boundary) | `alphaswarm/data/mcp/` | `alphaswarm-mcp-bridge-ext`'s registrations | | 26 (CredentialResolver) | `alphaswarm/credentials/resolver.py` | Python notebook helpers (`alphaswarm/notebook/helpers.py`) | | 27 (IdentityProvider) | `alphaswarm/auth/providers/` | `alphaswarm-ext`'s `Auth0Service` + new MCP bridge / copilot | | 45 (WorkloadRuntime) | `alphaswarm_core/runtime/workload.py` | `alphaswarm-ext`'s halt fan-out + `alphaswarm-cli ide` doctor | | 47 (topology) | `alphaswarm_controller/services/topology.py` | `alphaswarm-cli ide url --remote` / `detect` / `env` | | 49 (MCP audience, RFC 8707) | `alphaswarm/api/well_known.py` + `alphaswarm/api/mcp_audience.py` | `alphaswarm-mcp-bridge-ext`'s `X-AlphaSwarm-MCP-Audience` header | | 52 (step-up MFA) | `alphaswarm/api/security_stepup.py` | `alphaswarm-ext`'s halt command + future copilot write tools | ## Canonical operator entrypoint ```bash alphaswarm-cli auth login --device # RFC 8628 device flow + OS keyring (rule 53) alphaswarm-cli ide install # one-time bootstrap alphaswarm-cli ide build --dev # yarn build:extensions + build:applications:dev alphaswarm-cli ide start --open # spawn Theia + open in browser alphaswarm-cli ide doctor # preflight checks ``` Full CLI reference: [../alphaswarm_cli/docs/index.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_cli/docs/index.md). ## Boundary contract (mirrored from `.cursor/rules/alphaswarm-ide.mdc`) - `alphaswarm_ide/` extensions MUST NOT `import` from `alphaswarm` source. Cross HTTP only (`AqpApiService`) or via the DataMCP / CodebaseMCP HTTP surfaces. - AlphaSwarm-specific behavior lives ONLY under `alphaswarm_ide/theia-extensions/alphaswarm*/` (the six extensions). Don't sprinkle AlphaSwarm imports into core Theia files. - The IDE is browser-target-only. The Electron app remains upstream-oriented and is NOT wired for AlphaSwarm in this release. - The canonical entrypoint is `alphaswarm-cli ide`. Direct `yarn` invocations are inner-loop development only. ## Vendored workspace retirement The vendored `test_theia/theia-ide` workspace is byte-for-byte identical to `alphaswarm_ide/` and can be retired. See [../alphaswarm_ide/docs/retire-vendored-workspace.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/retire-vendored-workspace.md) for the 5-step checklist. # CI/CD pipelines > GitHub Actions orchestrates AWS CodeBuild over GitHub OIDC to deploy alphaswarm_platform and alphaswarm_admin. Covers the plan-vs-apply role split, the hybrid Terraform boundary, CodeArtifact, the three canonical workflows, and the dev to staging to prod promotion. # CI/CD pipelines The AlphaSwarm AWS deployment is driven by CI/CD: **GitHub Actions orchestrates** the pipeline and **AWS `CodeBuild` runs the heavy in-VPC work** (multi-arch `buildx` builds to `ECR`, and the `alphaswarm deploy` app-tier apply). There are no static AWS keys anywhere in the pipeline — every cloud step authenticates through **GitHub OIDC**. This page explains the topology, the trust model, and the workflows. For the task-oriented steps (creating environments, triggering a deploy, approving a prod release, rolling back) see the companion runbook [Operations runbook — CI/CD deploy](../../how-to/operations/cicd-deploy.md). For the deeper deploy walkthroughs see [AWS Hybrid Deployment Guide](../../how-to/operations/aws-deploy.md) and [AWS Hybrid Operational Runbook](../../how-to/operations/aws-runbook.md). ## Topology — GitHub Actions, CodeBuild, OIDC GitHub Actions is the control plane: it reacts to pushes, tags, pull requests, and `repository_dispatch`, then either runs lightweight Terraform directly or delegates the in-VPC heavy lifting to `CodeBuild` via `aws codebuild start-build`. The GitHub Actions job first assumes an AWS role over OIDC, so the `start-build` call (and everything `CodeBuild` does downstream) runs under short-lived credentials. ```mermaid flowchart LR dev[Developer] -->|push / tag / PR| gha[GitHub Actions workflow] subgraph github [GitHub] gha oidc[GitHub OIDC token] gha --> oidc end subgraph aws [AWS account dev / staging / prod] sts[STS AssumeRoleWithWebIdentity] planRole[Plan role read-only] applyRole[Apply role] cb[CodeBuild in-VPC] ecr[ECR registries] ca[CodeArtifact alphaswarm-pypi] tf[Terraform state S3 + DynamoDB] runtime[alphaswarm deploy TerraformRuntime] end oidc --> sts sts --> planRole sts --> applyRole gha -->|aws codebuild start-build| cb cb --> ecr cb --> ca cb --> runtime applyRole --> tf planRole --> tf runtime --> tf ``` Why split the work this way: - **GitHub Actions** is cheap, parallel, and is where the promotion gates (GitHub Environments + required reviewers) live. - **`CodeBuild`** runs inside the workload VPC, so it can reach private subnets, the internal `CodeArtifact` PyPI, and the app-tier resources that `alphaswarm deploy` manages. It also gives multi-arch `buildx` a beefy, in-account builder close to `ECR`. ## Authentication — GitHub OIDC, no static keys Trust is configured **per account** via the `infrastructure/modules/github-oidc` module, which registers the GitHub OIDC provider and the IAM roles. The provider trusts both deploying repos: - `Alpha-Swarm-ai/alphaswarm_platform` - `Alpha-Swarm-ai/alphaswarm_admin` ### Plan role vs apply role The module emits two roles per account, with different trust conditions on the OIDC `sub` claim: - **Plan role** — read-only. Trusted on pull-request refs so that PR validation can run `terraform plan` / `validate` without any mutate permission. Example trusted subjects: ```text repo:Alpha-Swarm-ai/alphaswarm_platform:pull_request repo:Alpha-Swarm-ai/alphaswarm_platform:ref:refs/heads/main ``` - **Apply role** — read-write. Trusted only on `refs/heads/main` **and** scoped to a GitHub Environment, so an apply cannot run until the Environment's required reviewers approve. Example trusted subjects: ```text repo:Alpha-Swarm-ai/alphaswarm_platform:ref:refs/heads/main repo:Alpha-Swarm-ai/alphaswarm_platform:environment:prod ``` The apply role ARN is published per environment as the `AWS_DEPLOYER_ROLE_ARN` repo variable (one value per GitHub Environment); the plan role ARN is published alongside it. A workflow job selects the role for its target `env`, then assumes it over OIDC. ```yaml permissions: id-token: write # required to mint the GitHub OIDC token contents: read jobs: apply: environment: prod # gates on the Environment's required reviewers steps: - uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: ${{ vars.AWS_DEPLOYER_ROLE_ARN }} aws-region: us-east-1 ``` ## Hybrid Terraform boundary There are two Terraform trees and they are applied two different ways. The boundary is deliberate. | Tree | What it owns | Applied by | Auth | Audit | | --- | --- | --- | --- | --- | | `infrastructure/` | Landing zone: VPC, `ECR`, RDS, EKS, OIDC provider, observability, the `CodeBuild`/`CodeArtifact` plumbing | Native `terraform plan` / `terraform apply` | OIDC into `AqpTerraformExecutionRole` | Terraform state only | | `terraform/` | App tier: the per-env application composition deployed onto the platform | `alphaswarm deploy plan` / `alphaswarm deploy up` (`TerraformRuntime`) | `TerraformRuntime` in `CodeBuild` | Writes a `terraform_runs` audit row | The app tree is never applied with a bare `terraform apply`. It goes through `alphaswarm deploy`, which drives `TerraformRuntime` and writes a `terraform_runs` audit row for every plan and apply (platform AGENTS rule 42). That keeps the app-tier change history in the same ledger as every other runtime action. See [Terraform IaC control plane](./terraform-control-plane.md) for how `TerraformRuntime` works and [IaC runbook](./iac-runbook.md) for the provisioning recipes. ```bash # Landing zone (infrastructure/): native terraform, OIDC -> AqpTerraformExecutionRole terraform -chdir=infrastructure/envs/dev init terraform -chdir=infrastructure/envs/dev plan # App tier (terraform/): alphaswarm deploy, writes a terraform_runs row alphaswarm deploy plan --env dev alphaswarm deploy up --env dev ``` ## CodeArtifact for alphaswarm-core and the CLI `alphaswarm-core` and the `alphaswarm` CLI are not installed from public PyPI in CI or in the Docker images. They are pulled from the platform's **AWS `CodeArtifact`** internal PyPI repository, `alphaswarm-pypi`. CI (and every Dockerfile build step that needs the CLI) authenticates to `CodeArtifact` over the same OIDC-derived credentials and configures it as the pip index: ```bash aws codeartifact login --tool pip \ --domain alphaswarm --repository alphaswarm-pypi pip install alphaswarm-core "alphaswarm[deploy]" ``` This keeps the internal packages private and gives CI a stable, in-account index that does not depend on public PyPI availability. ## The three canonical workflows These names match `compliance/soc2-evidence-map.md`, `how-to/operations/aws-deploy.md`, `how-to/operations/aws-runbook.md`, and ADR [006 — alphaswarm_admin overhaul](../../architecture/decisions/006-aqp-admin-overhaul.md). ### terraform-pipeline.yml The deploy workflow for both Terraform trees. - **Inputs:** `tree` ∈ {`infrastructure`, `alphaswarm_platform`}, `env` ∈ {`dev`, `staging`, `prod`}, `action` ∈ {`plan`, `apply`}. - **`push` to `main`:** runs a `plan` against `dev` automatically. - **Dispatch (`apply`):** assumes the env's apply role and applies the selected tree. For `tree=infrastructure` it runs native `terraform apply`; for `tree=alphaswarm_platform` it delegates to `CodeBuild`, which runs `alphaswarm deploy up` (and lands the `terraform_runs` row). ### build-publish.yml The image release workflow. Triggers on a `v*` tag and, for each service, performs a supply-chain-hardened build: - multi-arch `buildx` build, pushed to `ECR`; - **`Cosign` keyless** signature (OIDC, no long-lived keys); - **`syft` SBOM** generation; - **`SLSA` provenance** attestation; - **`Trivy`** and **`Grype`** vulnerability scans. The per-service build/sign/push logic is factored into the composite action `.github/actions/build-sign-push/`, so every service builds identically. ### pr-validate.yml The pull-request gate. On every PR it runs `terraform fmt -check`, `terraform validate`, `tfsec`, and `conftest` (OPA) policy checks, then a `terraform plan` using the **plan role** (read-only). It never holds mutate permission, so a PR can be validated safely from a fork or feature branch. ## Promotion — dev to staging to prod Promotion is enforced by **GitHub Environments** with required reviewers, layered on top of the OIDC apply-role trust (the apply role is only assumable inside the matching Environment): | Environment | Approval | Trigger | | --- | --- | --- | | `dev` | Auto (no reviewers) | `push` to `main` plans `dev`; apply on dispatch | | `staging` | 1 reviewer | Dispatch `terraform-pipeline.yml` with `env=staging` | | `prod` | 2 reviewers (4-eyes) | Dispatch `terraform-pipeline.yml` with `env=prod` | Because the gate lives in the GitHub Environment, a `prod` apply physically cannot start minting the apply-role credential until two distinct reviewers approve the run. ## alphaswarm_admin — two images, then a dispatch handoff `alphaswarm_admin` is built and deployed slightly differently from the platform itself. 1. A push to the admin repo's `main` (or a `v*` tag) builds **two images** and pushes them to `ECR`: - `alphaswarm-admin` (the FastAPI backend) - `alphaswarm-admin-frontend` (the Next.js frontend) 2. After both images land, the admin workflow fires a cross-repo `repository_dispatch` event named `admin-image-published` at `alphaswarm_platform`. 3. That dispatch triggers the platform's app-tier redeploy, which rolls the admin service onto **ECS `Fargate`** (`Cognito` + `ALB`) via the platform's `terraform/environments/{dev,staging,prod}` app tier (generalized from the existing `minimum` env). 4. The app tier reads its infra handles from SSM under `/alphaswarm//*`, published by `infrastructure/envs/admin-{dev,staging,prod}`. ```mermaid flowchart LR push[Push to admin main / tag] --> build[Build 2 images] build --> ecr1[ECR: alphaswarm-admin] build --> ecr2[ECR: alphaswarm-admin-frontend] ecr1 --> disp[repository_dispatch: admin-image-published] ecr2 --> disp disp --> plat[alphaswarm_platform app-tier redeploy] plat --> ssm[Read SSM /alphaswarm/env/*] plat --> fargate[ECS Fargate: Cognito + ALB] ``` The cross-repo dispatch requires a token (`PLATFORM_DISPATCH_TOKEN`) configured as a secret in the admin repo — see the runbook for setup. For what the admin service itself is, see [alphaswarm-admin](./services/alphaswarm-admin.md). ## See also - [Operations runbook — CI/CD deploy](../../how-to/operations/cicd-deploy.md) — task-oriented steps. - [Terraform IaC control plane](./terraform-control-plane.md) — how `TerraformRuntime` executes. - [IaC runbook](./iac-runbook.md) — provisioning recipes. - [alphaswarm-admin](./services/alphaswarm-admin.md) — the admin service. - [AWS Hybrid Deployment Guide](../../how-to/operations/aws-deploy.md) and [AWS Hybrid Operational Runbook](../../how-to/operations/aws-runbook.md) — bootstrap + incident playbooks. # Control-plane topology > 1. Hardcoded default in `Settings`. 2. `ALPHASWARM_*` environment variable. 3. `alphaswarm_platform/configs/deployment/topology.yaml` fallback (this layer) # Control-plane topology Phase 0 of the AlphaSwarm infra-expansion plan. The single source of truth for "what services exist, where do they live, what URLs do they expose" is [`alphaswarm_platform/configs/deployment/topology.yaml`](../configs/deployment/topology.yaml). Both the AlphaSwarm monolith (`alphaswarm/`) and the standalone control plane (`alphaswarm_controller/`) read from the same YAML through the shared loader at [`alphaswarm_core.topology.load_topology`](../alphaswarm_core/src/alphaswarm_core/topology/loader.py). ## Resolution order 1. Hardcoded default in `Settings`. 2. `ALPHASWARM_*` environment variable. 3. `alphaswarm_platform/configs/deployment/topology.yaml` fallback (this layer). The Phase 0 fallback ONLY fires when an `ALPHASWARM_*` env var is unset (checked via `Settings.model_fields_set`). Operators who explicitly override an env var keep their override. ## URL fallback table The mapping lives in [`alphaswarm/config/topology_fallback.py::URL_FALLBACK_FIELDS`](../alphaswarm/config/topology_fallback.py). Each row says: when topology declares `endpoints[]` on the service whose id is ``, use that URL as the fallback for the matching `Settings` field. Adding a new service = new row in the table + new `services:` entry in `topology.yaml`. ## Control-plane routes `alphaswarm_controller` exposes the topology over HTTP: | Route | Purpose | |---|---| | `GET /manage/topology` | Full snapshot (services + targets). | | `GET /manage/topology/services` | Filterable service list (?role=, ?cluster=). | | `GET /manage/topology/services/{id}` | Single descriptor (matched by id or alias). | | `GET /manage/topology/services/{id}/endpoint?name=` | Resolve a named URL. | | `GET /manage/topology/services/{id}/health` | Live provider probe. | | `GET /manage/topology/targets` | List deployment targets. | | `POST /manage/topology/reload` | Drop the cache and reload from disk (admin:cluster). | The frontend at [/admin/topology](../alphaswarm_client/src/routes/admin/topology/page.tsx) renders the topology grouped by role with a "Probe health" button per service. ## Adding a new shared service 1. Append a `services:` entry to [`alphaswarm_platform/configs/deployment/topology.yaml`](../configs/deployment/topology.yaml) with `cluster`, `namespace`, `protocols`, and `endpoints` populated. 2. Add the new `Settings` field in [`alphaswarm/config/settings.py`](../alphaswarm/config/settings.py) (default `""`). 3. Add a row to `URL_FALLBACK_FIELDS` mapping the new `Settings` field to the topology endpoint name. 4. Add the namespace to `targets..services` so the topology round-trips for that environment. 5. (Optional) Add a `/cache/` populator on the [`MetadataPrefetcher`](../alphaswarm/cache/prefetch.py) so the `" />` in the frontend has dropdown data. # IaC runbook > | Task | Recipe | | ------------------------------------------ | ------------------------------------------------------- | | Stand up local AlphaSwarm on a laptop | [Local environment](#local-environment) | ... # IaC runbook "I want to provision X" recipes for the Terraform IaC control plane. ## Quick reference | Task | Recipe | | ------------------------------------------ | ------------------------------------------------------- | | Stand up local AlphaSwarm on a laptop | [Local environment](#local-environment) | | Stand up AlphaSwarm on rpi_kubernetes | [rpi Kubernetes environment](#rpi-kubernetes-environment) | | Stand up paper-trading on GCP | [Paper environment](#paper-environment) | | Stand up production on AWS | [Live environment](#live-environment) | | Stand up the seeded Wiley Tech home on Azure | [Wiley Tech environment](#wiley-tech-environment) | | Add a new module kind to the codegen | [Add a module kind](#add-a-module-kind) | | Add a Terraform stack via the API | [Create a stack via API](#create-a-stack-via-api) | | Plan / apply / destroy from the UI | [Lifecycle from the frontend](#lifecycle-from-the-frontend) | | Configure HCP Terraform as state backend | [HCP Terraform](#hcp-terraform) | | Wire OPA policy enforcement | [Policy enforcement](#policy-enforcement) | ## Local environment ```bash cd alphaswarm_platform/terraform/environments/local terraform init terraform plan terraform apply ``` What this provisions: - Postgres / MinIO / Redis containers via `kreuzwerker/docker`. - Minikube / kind cluster + namespaces (`alphaswarm-local` / `alphaswarm-paper` / `alphaswarm-live` / `alphaswarm-backtest` / `alphaswarm-system` / `alphaswarm-terraform`). - Helm baseline: cert-manager / ESO / KEDA / ingress-nginx / kube-prometheus / otel-operator / istio. - KEDA `ScaledObject` per Celery queue (including the new `terraform` queue). - Per-bot Deployment with `alphaswarm-data-mcp` sidecar (zero-egress NetworkPolicy on the agent container). - Local Docker registry on `:5000`. State is local (`alphaswarm_platform/terraform/environments/local/terraform.tfstate`). ## rpi Kubernetes environment ```bash alphaswarm-cli deploy publish-rpi --registry ghcr.io/ --tag terraform -chdir=alphaswarm_platform/terraform/environments/rpi init terraform -chdir=alphaswarm_platform/terraform/environments/rpi plan terraform -chdir=alphaswarm_platform/terraform/environments/rpi apply ``` Recommended bootstrap sequence for first-time bring-up: 1. CLI-first Terraform apply until base services are healthy. 2. Verify API + Celery + Redis + Postgres are reachable. 3. Move to control-plane actions (`/control-plane/kubernetes/targets/rpi/*`). This avoids enqueue/stream confusion during cold start when broker/DB are still bootstrapping. ### Provider mirror + init retries When provider downloads are unstable, define a Terraform CLI config file with `provider_installation` mirror rules and point AlphaSwarm at it: ```bash export ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE=/absolute/path/to/terraform.tfrc export ALPHASWARM_TERRAFORM_INIT_RETRY_ATTEMPTS=5 export ALPHASWARM_TERRAFORM_INIT_RETRY_BACKOFF_SECONDS=2 export ALPHASWARM_TERRAFORM_INIT_RETRY_MAX_BACKOFF_SECONDS=30 ``` `TerraformExecutor` applies bounded retries for transient `terraform init` failures and reuses `ALPHASWARM_TERRAFORM_PLUGIN_CACHE_DIR` between runs. ## Paper environment ```bash cd alphaswarm_platform/terraform/environments/paper export TF_VAR_gcp_project_id= export TF_VAR_primary_domain=paper.alphaswarm.example terraform init -backend-config="bucket=alphaswarm-terraform-state-paper" terraform plan terraform apply ``` What this provisions: - GKE cluster (auto-promoted from `ALPHASWARM_DEFAULT_CLOUD_PROVIDER=gcp`). - Cloud SQL Postgres (single AZ — cost-optimised for paper). - GCS bucket + Memorystore Redis. - GCP Secret Manager `ClusterSecretStore` (ESO). - Bot Deployments with `dry_run=true` for paper trading. - 100% traffic to the Vite frontend (no canary split in paper). ## Live environment ```bash cd alphaswarm_platform/terraform/environments/live export TF_VAR_aws_subnet_ids='["subnet-aaaa", "subnet-bbbb", "subnet-cccc"]' export TF_VAR_primary_domain=app.wiley.tech terraform init # picks up backend.tf with S3 + DynamoDB locking terraform plan terraform apply ``` What this provisions: - EKS cluster Multi-AZ. - RDS Multi-AZ Postgres + S3 versioning + ElastiCache 7+ cluster mode. - AWS Secrets Manager `ClusterSecretStore`. - Bot Deployments live (`dry_run=false`); `live_control=true` on the actor's `Membership` is required to trigger orders. - Full prod sizing for KEDA `maxReplicaCount` (50 default / 100 ML / 200 backtest / 30 agents / 10 terraform). ## Wiley Tech environment This is the seeded production home for the org provisioned by Alembic 0051. Pinned to the Wiley Tech Entra tenant. ```bash cd alphaswarm_platform/terraform/environments/wiley-tech export TF_VAR_azure_tenant_id= export TF_VAR_azure_subscription_id= export TF_VAR_azure_resource_group=alphaswarm-wiley-tech export TF_VAR_azure_keyvault_url=https://alphaswarm-wiley-tech-kv.vault.azure.net/ terraform init # picks up backend.tf with Azure Blob state terraform plan terraform apply ``` What this provisions: - AKS cluster + Azure Workload Identity for ESO. - Azure PostgreSQL Flexible Server (Zone-Redundant HA). - ADLS Gen2 storage account (HNS enabled). - Azure Cache for Redis (Standard, TLS-only). - Azure Key Vault `ClusterSecretStore` synced via ESO Workload Identity. - ACR registry for AlphaSwarm images. ## Add a module kind 1. Add the kind to `TERRAFORM_MODULE_KINDS` in [`alphaswarm/persistence/models_terraform.py`](../alphaswarm/persistence/models_terraform.py). 2. Create the Jinja2 template at `alphaswarm/terraform/codegen/templates/_.tf.j2` (and a `_local` fallback). 3. (Optional) Mirror as a native HCL module under `alphaswarm_platform/terraform/modules//`. 4. Operators create a stack via `POST /terraform/stacks` with `module_kind: ""`. ## Create a stack via API ```bash curl -X POST http://localhost:8000/terraform/stacks \ -H "Content-Type: application/json" \ -H "Authorization: Bearer " \ -d '{ "name": "Bronze tier storage", "slug": "bronze-storage", "module_kind": "storage", "cloud_provider": "aws", "environment": "live", "variables": { "aws_region": "us-east-1", "aws_subnet_ids": ["subnet-aaa", "subnet-bbb", "subnet-ccc"], "bucket_name": "alphaswarm-bronze", "db_storage_gb": 500 }, "backend": { "kind": "s3", "config": { "bucket": "alphaswarm-tf-state", "key": "bronze-storage.tfstate" } }, "tags": { "tier": "bronze" } }' ``` Response includes `spec_version_id` (immutable, hash-locked). Then create a workspace + plan: ```bash # Workspace curl -X POST http://localhost:8000/terraform/workspaces \ -H "Content-Type: application/json" -H "Authorization: Bearer " \ -d '{ "slug": "bronze-live", "name": "Bronze (live)", "stack_spec_id": "", "environment": "live", "state_backend": "s3" }' # Plan curl -X POST http://localhost:8000/terraform/workspaces//plan \ -H "Authorization: Bearer " ``` Subscribe to live progress at `wss:///terraform/ws/runs/`. ## Lifecycle from the frontend Navigate to `/infra/terraform`, click a workspace row → land on `/infra/terraform/workspaces/[id]`: 1. Click **Plan** → enqueues plan task; result lands in `awaiting_approval`. 2. Review the plan summary on the run detail page (live WS stream). 3. Click **Apply this plan** on the plan run row. 4. Apply executes → state version snapshotted → outputs visible in the "Latest state outputs" card. 5. **Destroy** is friction-gated: type the workspace slug to confirm. ## HCP Terraform 1. Create an HCP Terraform organization + workspaces in the HCP UI. 2. Set `ALPHASWARM_HCP_TOKEN` (preferred: via `CredentialResolver`), `ALPHASWARM_HCP_ORGANIZATION`, `ALPHASWARM_TERRAFORM_STATE_BACKEND=hcp`. 3. Set the stack spec's `backend.kind="hcp"` and the workspace's `hcp_workspace_id`. 4. The runtime now drives runs through [`HcpClient`](../alphaswarm/terraform/hcp_client.py) instead of the local subprocess (no `terraform` binary required on the runner pod). ## Policy enforcement 1. Author OPA Rego policies that target Terraform plan JSON (the runtime emits `tfplan.binary.json` via `terraform show -json`). 2. Insert a `TerraformPolicyAttachment` row binding the policy file URI to a workspace. 3. Set `hard_mandatory=True` to block apply on violation; `hard_mandatory=False` emits a warning. 4. When `opa` is on PATH the runtime invokes `opa eval -i tfplan.json -d policy.rego "data.alphaswarm.terraform.deny"`. Without OPA installed the check no-ops cleanly. # Kubernetes adapter > ```mermaid flowchart TB Routes["alphaswarm/api/routes
/cluster, /streaming/kafka, /streaming/flink"] Producers[ProducerSupervisor] FinOps["finops_tasks.audit
(grandfathered direct path)"] # Kubernetes adapter AlphaSwarm wraps every cluster-side operation in a pluggable :class:`alphaswarm.kubernetes.KubernetesAdapter`. The abstraction makes the rpi_kubernetes attach optional: AlphaSwarm works fully standalone with `NoneAdapter`, attaches to the rpi management API with `RpiClusterAdapter`, talks to a Kubernetes API directly with `InClusterAdapter`, or treats the local Docker Compose stack as the cluster surface with `LocalComposeAdapter`. ## Architecture ```mermaid flowchart TB Routes["alphaswarm/api/routes/cluster, /streaming/kafka, /streaming/flink"] Producers[ProducerSupervisor] FinOps["finops_tasks.audit(grandfathered direct path)"] subgraph adapters [alphaswarm.kubernetes] ABC[KubernetesAdapter ABC] None[NoneAdapter] Rpi[RpiClusterAdapter] InCluster[InClusterAdapter] LocalCompose[LocalComposeAdapter] end None --> ABC Rpi --> ABC InCluster --> ABC LocalCompose --> ABC Routes --> ABC Producers --> ABC Rpi --> RpiClient["alphaswarm/services/cluster_mgmt_client(rpi management HTTP)"] InCluster --> K8sSDK[kubernetes-client SDK] LocalCompose --> Docker[docker compose] ``` `get_kubernetes_adapter()` returns the active adapter based on: 1. Explicit `settings.kubernetes_adapter` (`none` / `rpi_cluster` / `in_cluster` / `local_compose`). 2. Auto-promote: empty kind + `cluster_mgmt_url` set → `rpi_cluster`. 3. Default: `none`. Failures during a call surface as :class:`KubernetesAdapterUnavailable` (routes return 503) or :class:`KubernetesAdapterError` (routes return 502). Adapters opt out of unsupported methods by raising :class:`KubernetesAdapterUnavailable`. ## Adapter capabilities See [`.cursor/rules/kubernetes-adapter.mdc`](../.cursor/rules/kubernetes-adapter.mdc) for the per-method matrix. Today every adapter implements `is_available()`; `RpiClusterAdapter` covers the full Kafka / Flink / AlphaVantage / scale_deployment surface; `InClusterAdapter` covers `scale_deployment` / `pod_logs` / `apply_manifest`; `LocalComposeAdapter` covers `scale_deployment` / `pod_logs`. The `/cluster` REST surface is the primary user — `/cluster-mgmt` is kept as a backwards-compat alias. ## Test patterns `tests/kubernetes/test_adapter.py` covers: - The metaclass registers every concrete adapter under `"k8s_adapter"` in the AlphaSwarm registry. - `NoneAdapter.is_available()` is `False`; every op raises. - `RpiClusterAdapter` forwards to the wrapped client and translates `ClusterMgmtError` → `KubernetesAdapterError`. - `InClusterAdapter` reports unavailable when kubernetes isn't installed (CI default). - `register_adapter(...)` / `reset_kubernetes_adapter()` give tests clean fixtures. ## Adding capabilities When you need a new cluster op (say `list_namespaces`): 1. Add an abstract method (default: raise :class:`KubernetesAdapterUnavailable`) on the ABC in [`alphaswarm/kubernetes/protocol.py`](../alphaswarm/kubernetes/protocol.py). 2. Implement it in each adapter that can support it. 3. Add a route in [`alphaswarm/api/routes/cluster_mgmt.py`](../alphaswarm/api/routes/cluster_mgmt.py) that calls the adapter. 4. Adapters that can't service the op leave the default; routes catch `KubernetesAdapterUnavailable` and translate to 503. ## Migrating finops_tasks The FinOps audit (`alphaswarm/tasks/finops_tasks.py`) currently uses the `kubernetes` SDK directly because it needs list APIs (`list_pod_for_all_namespaces`, etc.) that the adapter doesn't yet expose. Adding those list methods to the adapter ABC + the in-cluster implementation is the migration target — until then, the direct path is grandfathered by the [`.cursor/rules/kubernetes-adapter.mdc`](../.cursor/rules/kubernetes-adapter.mdc) rule. # rpi Kubernetes Deployment > - A kubeconfig that can reach the rpi cluster. - A registry reachable by every rpi node. - Immutable AlphaSwarm image tag published with: # rpi Kubernetes Deployment AlphaSwarm deploys to the `rpi_kubernetes` cluster through the sanctioned Terraform runtime path. The source-of-truth HCL lives in `alphaswarm_platform/terraform/environments/rpi`, and the stack spec is `alphaswarm_platform/configs/terraform/rpi.yaml`. ## Prerequisites - A kubeconfig that can reach the rpi cluster. - A registry reachable by every rpi node. - Immutable AlphaSwarm image tag published with: ```bash alphaswarm-cli deploy publish-rpi --registry docker.io/ --tag ``` ## Configure Edit or override `alphaswarm_platform/terraform/environments/rpi/terraform.tfvars`: ```hcl rpi_kubeconfig_path = "~/.kube/config" rpi_kube_context = "rpi" rpi_namespace = "alphaswarm" rpi_image_registry = "docker.io/" app_version = "" rpi_ingress_host = "alphaswarm.example.com" auth0_domain = "example.us.auth0.com" auth0_audience = "https://alphaswarm/api" auth0_client_id = "" ``` ## Deploy Use the AlphaSwarm control plane or Terraform directly: ```bash terraform -chdir=alphaswarm_platform/terraform/environments/rpi init terraform -chdir=alphaswarm_platform/terraform/environments/rpi plan terraform -chdir=alphaswarm_platform/terraform/environments/rpi apply ``` The backend control-plane routes dispatch the same stack through `alphaswarm.tasks.terraform_tasks.run_rpi_stack`, preserving `terraform_runs` ledger rows and progress streams. ## Cold-start order For first-time bootstrap on a new machine, run in this order so each dependency exists before the next one: 1. Build and push immutable AlphaSwarm images (`alphaswarm-cli deploy publish-rpi ...`). 2. Set image tags and Auth0 values in `alphaswarm_platform/terraform/environments/rpi/terraform.tfvars`. 3. Run Terraform from CLI (`init`, `plan`, `apply`) until the core stack is healthy. 4. Start/verify API + Celery + Redis + Postgres. 5. Use `/control-plane/kubernetes/targets/rpi/*` for ongoing operations. Why this order matters: - Terraform subprocess execution itself only needs Terraform + kubeconfig. - Control-plane-triggered runs additionally need Celery broker/worker. - Run history and richer status views depend on Postgres/Redis availability. ## Provider download resilience (flaky network / IPv6 issues) When `terraform init` intermittently fails to download providers, use a Terraform CLI config file with `provider_installation` mirrors and point the runtime at it with `ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE`. Example `terraform.tfrc`: ```hcl provider_installation { filesystem_mirror { path = "C:/terraform/provider-mirror" include = ["hashicorp/*", "kreuzwerker/*", "auth0/*"] } direct { exclude = ["hashicorp/*", "kreuzwerker/*", "auth0/*"] } } ``` Then set: ```bash export ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE=/absolute/path/to/terraform.tfrc ``` The runtime also retries transient `terraform init` network/provider failures with bounded exponential backoff. Tune with: - `ALPHASWARM_TERRAFORM_INIT_RETRY_ATTEMPTS` - `ALPHASWARM_TERRAFORM_INIT_RETRY_BACKOFF_SECONDS` - `ALPHASWARM_TERRAFORM_INIT_RETRY_MAX_BACKOFF_SECONDS` ## Rollback Re-apply the previous immutable image tag or run: ```bash terraform -chdir=alphaswarm_platform/terraform/environments/rpi destroy ``` Long-running Terraform jobs remain halt-able through `/terraform/halt` and the global frontend kill switch. # Service-level view > Catalogue of every AlphaSwarm service: container image, port, health probe, deployment surfaces (Compose / Kustomize / AQP CR / Terraform template), upstream and downstream dependencies, and the canonical doc that owns each contract. # Service-level view This page catalogues every service AlphaSwarm runs — the application workloads, the control plane, the data layer, the observability stack, and the external edge surface — at a single level of detail. It pairs [`control-plane-topology.md`](control-plane-topology.md) (which says *how* services are discovered) and [`terraform-control-plane.md`](terraform-control-plane.md) (which says *how* they are provisioned) with a *what is each service* reference. The single source of truth for the service registry is [`alphaswarm_platform/configs/deployment/topology.yaml`](../../../../alphaswarm_platform/configs/deployment/topology.yaml). This page is generated against that file plus each service's matching package contract. When a row drifts, the truth is the YAML. ## Reading the catalogue Every service has its own detail page under [`services/`](services/) with the same layout: - **Identity** — id, role, label, package or upstream image. - **Wire** — protocol, port, health endpoint, public URL (if any). - **Deployment** — which compose / kustomize / AQP CR / Terraform template stands it up. - **Dependencies** — upstream services it calls, downstream services that call it. - **Operations** — runbooks, scaling notes, redaction posture, feature flags. Detail pages link back to the canonical concept doc that owns each contract — they do not duplicate prose. ## How services compose ``` ┌─ alphaswarm-website ──────────┐ public marketing │ (Cloudflare Pages, no auth) │ └───────────────────────────────┘ │ ▼ NEXT_PUBLIC_ALPHASWARM_APP_URL B2C / B2B users ─▶ alphaswarm-ui ──┐ Internal staff ─▶ alphaswarm-admin ┼──▶ alphaswarm-cp ──▶ /manage/* control plane Local power user ─▶ alphaswarm-client┤ ──▶ /auth/* identity broker Operators (CLI) ─▶ alphaswarm-cli ┤ ──▶ /proxy/* connection mesh (Phase 5) │ ▼ HTTP alphaswarm-core (FastAPI) │ ┌──────────────────────┼──────────────────────┐ ▼ ▼ ▼ alphaswarm-worker alphaswarm-executor alphaswarm-beat alphaswarm-ml-mcp (light queues) (heavy compute) (scheduler) (DataMCP /mcp/ml) Data plane: postgres ─ redis ─ neo4j ─ chromadb ─ minio ─ iceberg(Polaris) Streaming: kafka(Strimzi) | redpanda ─ schema-registry ─ flink ─ redpanda-connect ML / orch: mlflow ─ argo-workflows ─ argo-events ─ bentoml ─ kserve ─ dagster ─ ragflow Observability: otel-collector ─ prometheus ─ grafana ─ jaeger ─ loki ─ vector ─ victoriametrics ─ phoenix Mesh ID: spire (issuer) ─▶ linkerd (mTLS) ─▶ vault-secrets-operator ─▶ pomerium (IAP) Edge: cloudflared (alpha-swarm.ai) | cloudflared-aqp-green | alphaswarm-edge | tenant-router Sandbox: agent-sandbox/gvisor ─▶ agent-sandbox/pool Operators: aqp-controller-operator (8 AQP* CRDs) ─ bots-operator (4 QuantBot CRDs) External: alphaswarm-docs (Cloudflare Pages) ─ alphaswarm-docs-status (Instatus) ─ alphaswarm-docs-archive ``` Identity flows from `spire` through `linkerd` through `vault-secrets-operator` to every workload pod; secrets land via `ExternalSecret` resources, never in `values.yaml`. The `pomerium` IAP wraps the bare `/manage/*` ingress. ## Application services Services that run AlphaSwarm code. Each is built from a Dockerfile in this workspace and is owned by the package that supplies its image. | Service id | Role | Pkg | Image (key) | Port | Health | Public URL | Deployed via | | --- | --- | --- | --- | --- | --- | --- | --- | | [`alphaswarm-core`](services/alphaswarm-core.md) | api | `alphaswarm` | `api` | 8000 | `/readyz` | — (private) | base/alphaswarm-core, AQPMonolith CR, compose `api` | | [`alphaswarm-worker`](services/alphaswarm-worker.md) | worker | `alphaswarm` | `worker` | — | (none) | — | base/alphaswarm-worker, AQPMonolith CR, compose `worker` | | [`alphaswarm-executor`](services/alphaswarm-executor.md) | executor | `alphaswarm` | `executor` | — | (none) | — | base/alphaswarm-executor, compose `alphaswarm-executor`/`worker-gpu` | | [`alphaswarm-beat`](services/alphaswarm-beat.md) | scheduler | `alphaswarm` | `beat` | — | (none) | — | base/alphaswarm-worker, AQPMonolith CR, compose `beat` | | [`alphaswarm-cp`](services/alphaswarm-cp.md) | control-plane | `alphaswarm_controller` | `cp` | 9000 | `/manage/readyz` | `https://manage.alpha-swarm.ai` | base/alphaswarm-cp, compose `alphaswarm-cp` | | [`alphaswarm-client`](services/alphaswarm-client.md) | frontend | `alphaswarm_client` | `frontend` | 80 | `/` | — (private) | base/alphaswarm-client, AQPClient CR, compose `client` | | [`alphaswarm-ui`](services/alphaswarm-ui.md) | frontend | `alphaswarm_ui` | `ui` | 80 | `/api/healthz` | `https://app.alpha-swarm.ai` | (Vercel/Pages) AQPUI CR | | [`alphaswarm-admin`](services/alphaswarm-admin.md) | admin | `alphaswarm_admin` | `admin` | 8900 | `/admin/healthz` | `https://admin.alpha-swarm.ai` | AQPAdmin CR, compose `alphaswarm-admin` | | [`alphaswarm-ide`](services/alphaswarm-ide.md) | ide | `alphaswarm_ide` | `ide` | 3000 | `/` | (per-user) | alphaswarm-ide kustomize, AQPIDE CR | | [`alphaswarm-ml-mcp`](services/alphaswarm-ml-mcp.md) | mcp | `alphaswarm_models` | (pigg. on `api`) | 8000 | `/mcp/ml/tools` | — | base/alphaswarm-core (extra route) | ## Data layer Stateful services owned by the platform — the AlphaSwarm runtime is a client of every row below. | Service id | Role | Image | Port | Storage | Deployed via | | --- | --- | --- | --- | --- | --- | | [`postgres`](services/postgres.md) | database | `pgvector/pgvector:pg16` | 5432 | 5 Gi (StatefulSet) | base-services/postgres-shared | | [`redis`](services/redis.md) | cache | `redis:7-alpine` (master) / `redis-stack:7.4` (local) | 6379 | 2 Gi | base/redis-master, base-services/redis-shared | | [`neo4j`](services/neo4j.md) | graph | `neo4j:5-community` | 7474, 7687 | 5 Gi | base-services (cell-local), compose `neo4j` | | [`chromadb`](services/chromadb.md) | vector | `chromadb/chroma:1.0.16` | 8000 / 8001 | (ephemeral) | base-services/chromadb, compose `chromadb` | | [`mlflow`](services/mlflow.md) | mlops | `ghcr.io/mlflow/mlflow:v2.11.1` | 5000 | object store | base-services/mlflow, compose `mlflow` | Object storage and the Iceberg catalog (MinIO + Polaris) live under the streaming/lakehouse umbrella; they are documented under `base-services/minio` and `base-services/polaris` in [deployment patterns by category](#deployment-patterns). ## Observability Routed by `otel-collector-gateway`; metrics in VictoriaMetrics + Prometheus (parallel during cutover), logs in Loki, traces in Jaeger, and the AI / LLM slice in Phoenix. | Service id | Role | Image | Port | Deployed via | | --- | --- | --- | --- | --- | | [`otel-collector`](services/otel-collector.md) | observability | `otel/opentelemetry-collector` | 4317 | observability/opentelemetry-collector-gateway | | [`prometheus`](services/prometheus.md) | metrics | `prom/prometheus` (kube-prometheus-stack) | 9090 | observability/kube-prometheus-stack | | [`grafana`](services/grafana.md) | dashboards | `grafana/grafana` | 3000 | observability/kube-prometheus-stack | | [`jaeger`](services/jaeger.md) | tracing | `jaegertracing/all-in-one` | 6831 / 16686 | observability/jaeger | | [`loki`](services/loki.md) | logs | `grafana/loki:3.3.2` | 3100 | observability/loki | | [`vector`](services/vector.md) | log shipper | `timberio/vector:0.43.0` | — | observability/vector | | [`victoriametrics`](services/victoriametrics.md) | metrics | `victoriametrics/victoria-metrics:v1.108.0` | 8428 | observability/victoriametrics | Phoenix + the OTel operator are documented inline on [`otel-collector`](services/otel-collector.md) since they are part of the same telemetry pipeline. ## External services Hosted off-cluster — included here because the topology references them and operators need to know who runs them. | Service id | Role | Hosted on | Public URL | Deployed via | | --- | --- | --- | --- | --- | | [`alphaswarm-docs`](services/alphaswarm-docs.md) | docs | Cloudflare Pages | `https://docs.alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` | | [`alphaswarm-website`](services/alphaswarm-website.md) | marketing | Cloudflare Pages | `https://alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` (forthcoming) | | [`alphaswarm-docs-status`](services/alphaswarm-docs-status.md) | status page | Instatus SaaS | `https://status.alpha-swarm.ai` | Terraform module `instatus` | | [`alphaswarm-docs-archive`](services/alphaswarm-docs-archive.md) | archive | Cloudflare Pages | `https://archive.alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` | ## Deployment patterns Every service above is deployable through one or more of the surfaces below. The [deployment-templates catalogue](../../../../alphaswarm_platform/configs/terraform/templates/README.md) maps each named pattern to a hash-locked [`TerraformStackSpec`](terraform-control-plane.md#terraformstackspec). | Pattern | What it stands up | Template slug | Source | | --- | --- | --- | --- | | **Local dev** | k3d cluster + base + minimal observability | `local-dev` | [templates/local-dev.yaml](../../../../alphaswarm_platform/configs/terraform/templates/local-dev.yaml) | | **k3d + MLOps** | local-dev + Argo Workflows + Dagster + MLflow | `k3d-with-mlops` | [templates/k3d-with-mlops.yaml](../../../../alphaswarm_platform/configs/terraform/templates/k3d-with-mlops.yaml) | | **AWS minimum** | Single-account ECS + Cognito + ALB + Bedrock Haiku | `aws-minimum` | [templates/aws-minimum.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-minimum.yaml) | | **AWS shared cell** | EKS + base + base-services + observability + edge for one shared standard cell | `aws-cell-shared-std` | [templates/aws-cell-shared-std.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-cell-shared-std.yaml) | | **AWS shared cell (premium)** | shared-std + dedicated node group + reserved capacity | `aws-cell-shared-premium` | [templates/aws-cell-shared-premium.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-cell-shared-premium.yaml) | | **AWS silo tenant** | Single-tenant cell with hard isolation | `aws-silo-tenant` | [templates/aws-silo-tenant.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-silo-tenant.yaml) | | **GCP cell** | GKE + Workload Identity + base + base-services | `gcp-full-cell` | [templates/gcp-full-cell.yaml](../../../../alphaswarm_platform/configs/terraform/templates/gcp-full-cell.yaml) | | **Azure cell** | AKS + Workload Identity + Entra-bound base | `azure-full-cell` | [templates/azure-full-cell.yaml](../../../../alphaswarm_platform/configs/terraform/templates/azure-full-cell.yaml) | | **rpi cluster** | k3s on ARM64 | `rpi-cluster` | [templates/rpi-cluster.yaml](../../../../alphaswarm_platform/configs/terraform/templates/rpi-cluster.yaml) | | **Edge only** | Cloudflare tunnels + Access apps + cloudflared-aqp-green | `edge-only` | [templates/edge-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/edge-only.yaml) | | **Observability only** | OTel + Prometheus + Loki + Jaeger + Phoenix + VictoriaMetrics | `observability-only` | [templates/observability-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/observability-only.yaml) | | **MLOps only** | Argo Workflows + Argo Events + BentoML + KServe + Dagster | `mlops-only` | [templates/mlops-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/mlops-only.yaml) | Templates are discovered by [`alphaswarm.terraform.templates`](../../../../alphaswarm/alphaswarm/terraform/templates.py) and surfaced through: - `GET /terraform/templates` and `POST /terraform/stacks/from-template/{slug}` (REST). - `alphaswarm-cli deploy templates {list,describe,apply}` (CLI). - `data.terraform.templates.list_templates` and `data.terraform.templates.instantiate_template` (MCP, used by the agentic plane). Every instantiation flows through `TerraformRuntime` so the apply lands a `terraform_runs` ledger row + spec snapshot per AGENTS rule 42 / 43. ## Building blocks (Jinja2 codegen) The codegen layer at [`alphaswarm/terraform/codegen/templates/`](../../../../alphaswarm/alphaswarm/terraform/codegen/templates/) ships per-module-kind Jinja2 templates. The standard-template catalogue adds five composite building blocks so users can compose their own stacks against typed inputs: | Building block | Renders | Used by | | --- | --- | --- | | `cell.tf.j2` | One cell — namespaces + base workloads + per-cell ingress + RBAC | `aws-cell-shared-std`, `aws-silo-tenant`, `gcp-full-cell`, `azure-full-cell` | | `observability_stack.tf.j2` | Full OTel + Prom + Loki + Jaeger + Phoenix + VictoriaMetrics overlay | `observability-only`, every cell template | | `mesh_identity.tf.j2` | spire → linkerd → vault-secrets-operator → pomerium chain | every cell template | | `mlops_stack.tf.j2` | Argo Workflows + Events + BentoML + KServe + Dagster | `mlops-only`, `k3d-with-mlops` | | `edge_stack.tf.j2` | cloudflared + access apps + tenant-router | `edge-only`, every public-facing cell template | These are referenced from `TerraformStackSpec.modules[].source` with the `tpl://` scheme — see [the IaC runbook](iac-runbook.md#shipping-a-standard-template) for the operator workflow. ## Maintenance This page and the per-service files mirror the YAML at [`alphaswarm_platform/configs/deployment/topology.yaml`](../../../../alphaswarm_platform/configs/deployment/topology.yaml). When you add a service: 1. Append the service to `topology.yaml` under `services:`. 2. Add a row to the matching table above (by category). 3. Add `concepts/infrastructure/services/.md` using the layout on every existing detail page (Identity / Wire / Deployment / Dependencies / Operations). 4. Add `'concepts/infrastructure/services/'` to `sidebars.ts` under the **Services** category. 5. If the service is reachable across cells, also append a row to `URL_FALLBACK_FIELDS` in [`alphaswarm/config/topology_fallback.py`](../../../../alphaswarm/alphaswarm/config/topology_fallback.py). 6. Either invoke the [`alphaswarm-index-curator`](../../../../alphaswarm/.cursor/agents/alphaswarm-index-curator.md) or drop a debt note per the always-on [`alphaswarm-index-reflect`](../../../../alphaswarm/.cursor/rules/alphaswarm-index-reflect.mdc) rule. ## See also - [`control-plane-topology.md`](control-plane-topology.md) — discovery contract + `URL_FALLBACK_FIELDS` semantics. - [`terraform-control-plane.md`](terraform-control-plane.md) — `TerraformRuntime` lifecycle + spec hash-locking. - [`iac-runbook.md`](iac-runbook.md) — quick reference for plan / apply / destroy + shipping a standard template. - [`how-to/operations/local-setup.md`](../../how-to/operations/local-setup.md) — bring the stack up locally. - [`how-to/operations/kubernetes-deploy.md`](../../how-to/operations/kubernetes-deploy.md) — end-to-end Kubernetes walkthrough. # alphaswarm-admin > Internal staff admin at admin.alpha-swarm.ai — managed services, company accounts, audit-first surface. FastAPI + Next.js, Entra-only auth. # alphaswarm-admin Internal-only admin dashboard for AlphaSwarm staff. Audit-first: every action lands a `security_audit_events` row before mutating anything; no destructive surface bypasses the ledger. Authenticated via the AlphaSwarm staff Entra tenant. Outbound M2M calls use a per-deployment Entra Agent Identity provisioned by the [`alphaswarm_admin_agent_identity`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/) Terraform module. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-admin` | | Role | `admin` | | Package | [`alphaswarm_admin/`](../../../../../../alphaswarm_admin/) | | Image (key) | `admin` | | Built from | `alphaswarm_admin/Dockerfile` (FastAPI backend, port 8900) + `alphaswarm_admin/frontend/Dockerfile` (Next.js 15 UI). Two ECR repos: `alphaswarm-admin` + `alphaswarm-admin-frontend`. | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + WebSocket | | Port | `8900` | | Health | `GET /admin/health` (public; backs the Docker + ECS container health checks) | | Public URL | `https://admin.alpha-swarm.ai` (Cloudflare tunnel + Pomerium IAP) | | Identity | AlphaSwarm staff Entra tenant; `actor_kind` is `user` for human staff and `agent` for the per-deployment Agent Identity (RFC 8693 `act` claim) | ## Surfaces | Prefix | Purpose | | --- | --- | | `/admin/*` | FastAPI backend — managed-services CRUD, company accounts, audit log, billing | | `/admin/platform/ecs/*` | Platform deployment control — boto3 → AWS ECS + CloudWatch for the platform's OWN Fargate services (rollout status, redeploy, scale, logs, metrics, alarms). Distinct from `/admin/deployments` (customer workloads, brokered). Redeploy + scale are audit-first + step-up-MFA gated. | | `/api/auth/entra/*` | Next.js BFF proxy to `alphaswarm-cp` `/auth/*` | | `/dashboard`, `/platform`, `/managed-services`, `/companies`, `/audit-log`, `/billing` | Next.js frontend pages | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `alphaswarm-admin` in [`deployments/compose/docker-compose.admin.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.admin.yml) | | Kustomize | rolled into the per-cell base — namespace `alphaswarm-admin` | | ECS Fargate | [`infrastructure/modules/ecs-fargate-control-plane`](../../../../../../alphaswarm_platform/infrastructure/modules/ecs-fargate-control-plane/), wired in [`infrastructure/envs/minimum`](../../../../../../alphaswarm_platform/infrastructure/envs/minimum/). Container health check on `/admin/health`; the `admin` task carries the self-management policy so `/admin/platform/ecs/*` can drive the cluster. | | AQP CR | [`AQPAdmin`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpadmin_cr.py) | | Terraform module | [`alphaswarm_admin_agent_identity`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/) (Entra Agent Identity provisioning) | ## Dependencies **Upstream:** - `alphaswarm-cp` (`/auth/*`, `/manage/*`). - `alphaswarm-core` (`/api/*` for read-only platform queries). - `postgres` for the admin's own ledger tables. - Stripe (optional) for billing integration. **Downstream:** - AlphaSwarm staff admins only — public ingress is wrapped by Pomerium with the `alphaswarm-staff` Entra group as the sole authenticated population. ## Operations - **Audit-first:** every mutating endpoint writes a `security_audit_events` row BEFORE acting; rollbacks compensate the row. - **No customer data exposure:** the admin reads aggregate signals only — never raw operator strategy code or RL weights. - **Step-up MFA:** required for company-account suspensions, billing refunds, kill-switch fan-out. - **Boundary:** `alphaswarm_admin` MUST NOT import `alphaswarm.*` — it is HTTP-only against `alphaswarm-cp` and `alphaswarm-core`. The guard is enforced by [`alphaswarm_admin/AGENTS.md`](../../../../../../alphaswarm_admin/AGENTS.md). ## See also - [`alphaswarm_admin/AGENTS.md`](../../../../../../alphaswarm_admin/AGENTS.md) — boundary rules. - [`identity.md`](../../identity/identity.md) — Entra integration. - [`alphaswarm_admin_agent_identity` module](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/) — Agent Identity provisioning. # alphaswarm-beat > Celery beat scheduler — periodic task dispatcher (factor refresh, predictor retraining, ledger compaction, status-page heartbeats). # alphaswarm-beat Celery beat process responsible for time-based task dispatch. It writes to the same Redis broker the worker drains; nothing else writes schedule-driven payloads. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-beat` | | Role | `scheduler` | | Package | [`alphaswarm/`](../../../../../../alphaswarm/) (schedule under `alphaswarm/tasks/celery_app.py`) | | Image (key) | `beat` | | Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (image key `beat` → target `worker`; beat shares the slim orchestration image) | ## Wire | Field | Value | | --- | --- | | Protocol | none | | Health | Celery broker connection probe | | Replicas | exactly **1** (singleton) — `replicas: 1`, `strategy: Recreate` | A beat replica > 1 leads to duplicate task emissions; the `Recreate` strategy guarantees the old pod is down before the new one starts. ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | beat is folded into the `worker` container in compose (single-replica entrypoint switch) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-worker/beat-deployment.yaml`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-worker/) | | AQP CR | folded into [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) (`spec.beat.enabled`) | ## Schedule highlights - **Every minute:** factor staleness probe, kill-switch heartbeat, status-page sync. - **Every 5 minutes:** predictor refresh (when models flagged `online: true`), Iceberg orphan scan. - **Hourly:** ledger compaction, audit-event aggregation, alphaswarm-index curator nudge (for diff detection). - **Daily:** OPA bundle refresh, terraform plan-drift check. The full schedule lives in [`alphaswarm/tasks/celery_app.py`](../../../../../../alphaswarm/alphaswarm/tasks/celery_app.py). ## Operations - **Single-instance:** `replicas: 1` is enforced by the kustomize base; the AQPMonolith CR refuses to render a beat block with `replicas != 1`. - **Persistence:** beat schedule lives at `/tmp/celerybeat-schedule` inside the pod (ephemeral); the schedule itself is code-defined so loss is recoverable. - **Audit:** beat-emitted tasks tag their `WorkloadRun` rows with `started_by_user_id = "system:beat"` so audit queries can split human-driven from scheduled work. ## See also - [`alphaswarm-worker.md`](alphaswarm-worker.md) — what consumes beat's output. - [`tasks-api`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — task progress contract. # alphaswarm-client > Local power-user client — Vite SPA + Solara legacy + FastAPI gateway in a single pod, behind the per-cell ingress. # alphaswarm-client The frontend for local power users — operators running AlphaSwarm on a laptop, in a tower cluster, or inside a self-hosted cell. It bundles a React 19 + Vite SPA, the legacy Solara research UI, and a thin FastAPI gateway that proxies to `alphaswarm-core` and `alphaswarm-cp`. This is **not** the cloud customer dashboard — that is [`alphaswarm-ui`](alphaswarm-ui.md), which targets `app.alpha-swarm.ai`. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-client` | | Role | `frontend` | | Package | [`alphaswarm_client/`](../../../../../../alphaswarm_client/) | | Image (key) | `frontend` | | Built from | [`alphaswarm_client/Dockerfile`](../../../../../../alphaswarm_client/Dockerfile) (3-stage: ui-builder → solara-builder → production) and [`Dockerfile.tf`](../../../../../../alphaswarm_client/Dockerfile.tf) (Terraform-built variant) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + WebSocket | | Port | `80` (container) → `3000` (host, local dev) | | Health | `GET /` | | Public URL | per-cell ingress (e.g. `https://aqp..alpha-swarm.ai`); local dev `http://localhost:3000` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `client` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `alphaswarm-client` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-client/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-client/) — Deployment + Service + HPA + PDB | | AQP CR | [`AQPClient`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpclient_cr.py) | ## Dependencies **Upstream (HTTP):** - `alphaswarm-core` (`/api/*`, `/ws/*`) — every business call. - `alphaswarm-cp` (`/manage/*`, `/auth/*`) — workload lifecycle and identity. **Downstream:** - Browser tabs on operator workstations. ## Frontend conventions - Vite + React 19 + TanStack Query + zustand for state. - WebSocket pipeline is throttled (`50ms` coalescing) per the [`frontend`](../../../../../../alphaswarm/.cursor/rules/frontend.mdc) rule. - Solara legacy routes mounted at `/legacy/*`; sunset window per [`alphaswarm-client/AGENTS.md`](../../../../../../alphaswarm_client/AGENTS.md). ## Operations - **Scaling:** HPA `cpu=70%`, `min=2 / max=8` in prod. - **Bundle size budget:** the Vite build fails CI when the gzipped bundle exceeds 1.5 MiB. - **CSP:** strict — only `manage.alpha-swarm.ai`, `app.alpha-swarm.ai`, and the per-cell `*.alpha-swarm.ai` hostnames are allowlisted. ## See also - [`alphaswarm-client/AGENTS.md`](../../../../../../alphaswarm_client/AGENTS.md) — boundary rules. - [`alphaswarm-ui.md`](alphaswarm-ui.md) — the cloud-hosted sibling. - [`alphaswarm-ide.md`](alphaswarm-ide.md) — Theia IDE for code-first workflows. # alphaswarm-core > FastAPI gateway for the AlphaSwarm runtime: business routes, agentic surface, MCP servers, WebSocket streaming, scope + tenancy enforcement. # alphaswarm-core The FastAPI gateway for the AlphaSwarm runtime. Every business route (strategies, bots, backtests, RL experiments, analysis runs, agents, ingestion, ml-mcp, terraform, tenancy, paper trading, kill switch) is mounted on this pod. The control plane (`alphaswarm-cp`) is a sibling service, not a parent — `/manage/*` lives there. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-core` | | Role | `api` | | Package | [`alphaswarm/`](../../../../../../alphaswarm/) | | Image (key) | `api` | | Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `api`, multi-arch amd64+arm64, Chainguard Wolfi base, `uv` install) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + HTTP/2 + WebSocket | | Port | `8000` | | Health | `GET /readyz` (ready) / `GET /healthz` (live) | | Public URL | — (private; reached through the per-cell ingress / `app.alpha-swarm.ai` BFF for SPA traffic) | | OIDC issuer for tokens it accepts | `MsalEntraValidator` (Entra primary) → Auth0 fallback per [`identity.md`](../../identity/identity.md) | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose (local dev) | service `api` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); also `alphaswarm-core` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-core/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-core/) — Deployment + Service + HPA + PDB | | AQP CR | [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) — render path emits Deployment + Service + ConfigMap + (optional) Ingress | | Terraform template | reachable through every `aws-*-cell` / `gcp-full-cell` / `azure-full-cell` template (see [`services.md`](../services.md#deployment-patterns)) | ## Dependencies **Upstream services this pod calls:** - `postgres` (5432) — primary OLTP + Alembic migrations. - `redis` (6379) — session, semantic cache, kill-switch key, Celery broker. - `neo4j` (7687) — ownership graph + lineage DAG. - `chromadb` (8001) and `milvus` — vector search (when feature flag on). - `mlflow` (5000) — model registry. - `otel-collector` (4317) — OTLP traces + metrics + logs. - `polaris` / Iceberg REST + `minio` — lakehouse reads/writes (via DataMCP). - `alphaswarm-cp` (`/manage/*`) — workload lifecycle calls (control plane). **Downstream callers (HTTP-only):** - `alphaswarm-client` — Vite SPA + FastAPI gateway. - `alphaswarm-ui` — Next.js dashboard (BFF routes proxy to here). - `alphaswarm-admin` — internal admin (audit-first surface). - `alphaswarm-ide` — Theia IDE (MCP-driven research copilot). - `alphaswarm-cli` — operator CLI. - `alphaswarm-worker` — Celery worker (calls back for progress / lookups). - Bot pods (per-cell `QuantBot` CRs). ## Key routes The route tree is the union of `alphaswarm/api/routes/*.py`. Key prefixes: | Prefix | Concept doc | | --- | --- | | `/strategies/*`, `/bots/*`, `/backtests/*` | [strategy-framework.md](../../strategy/analysis-framework.md) | | `/agents/*`, `/workflows/*`, `/labs/*` | [agents.md](../../agentic/agents.md) | | `/rl/*` | [rl-framework.md](../../rl/rl-framework.md) | | `/data/*`, `/ingest/*`, `/lineage/*` | [data-plane.md](../../data/data-plane.md) | | `/ml/*`, `/predictors/*` | [ml-framework.md](../../strategy/ml-framework.md) | | `/terraform/*` | [terraform-control-plane.md](../terraform-control-plane.md) | | `/tenancy/*`, `/membership/*` | [identity.md](../../identity/identity.md) | | `/halt`, `/kill-switch` | [observability.md](../../trading/observability.md) | | `/mcp/*` (multiple servers) | [data-mcp.md](../../data/data-mcp.md) | | `/ws/*` | WebSocket progress streams | ## Operations - **Scaling:** HPA target `cpu=70%`, `min=3 / max=12` in prod; `min=1 / max=3` in dev. - **Disruption:** PDB `minAvailable=2` in prod; `0` in dev. - **Step-up MFA:** destructive routes (`/manage/terraform/apply`, `/manage/credentials/cloud-cli/*`, `/halt`) require RFC 9470 `acr=high`. See [`auth-stepup-and-byok`](../../../../../../alphaswarm/.cursor/rules/auth-stepup-and-byok.mdc). - **Audit:** every state-mutating action lands a `workload_runs` row through `WorkloadRuntime`; every Terraform action lands a `terraform_runs` row through `TerraformRuntime`. - **Redaction:** `WorkloadRuntime` strips secrets from audit payloads per the always-on [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) rule. Token prefixes (4 chars max) are only printed behind an explicit `--unsafe-print-token-prefixes` operator flag. ## See also - [`control-plane-topology.md`](../control-plane-topology.md) — how callers find this pod's URL. - [`alphaswarm/AGENTS.md`](../../../../../../alphaswarm/AGENTS.md) — runtime hard rules (router_complete only path for LLM calls, DataMCP only path for agent reads, etc.). - [`alphaswarm-cp.md`](alphaswarm-cp.md) — sibling control plane. # alphaswarm-cp > Standalone control plane — workload lifecycle (`/manage/*`), unified identity broker (`/auth/*`), connection manager, kopf operator host, Phase 5 connection-proxy mesh. # alphaswarm-cp The standalone control plane. Owns every workload-lifecycle action, the unified identity broker, the connection-manager, and the Phase 5 connection-proxy mesh. Does NOT import `alphaswarm.*` runtime code — the boundary is enforced by [`alphaswarm-control-plane.mdc`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-control-plane.mdc). ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-cp` | | Role | `control-plane` | | Package | [`alphaswarm_controller/`](../../../../../../alphaswarm_controller/) | | Image (key) | `cp` | | Built from | [`alphaswarm_controller/Dockerfile`](../../../../../../alphaswarm_controller/) (multi-stage Wolfi + uv) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + HTTP/2 + WebSocket | | Port | `9000` | | Health | `GET /manage/readyz` (ready) / `GET /manage/healthz` (live) | | Public URL | `https://manage.alpha-swarm.ai` (behind Cloudflare tunnel + Pomerium IAP) | | Identity for incoming | per-route: `/manage/*` requires `admin:cluster`; `/auth/*` is unauthenticated up to /callback; `/proxy/*` requires the same scopes as the destination | ## Surfaces | Prefix | Purpose | Code | | --- | --- | --- | | `/manage/*` | Workload lifecycle (start/stop/scale/restart/exec/logs/apply_config/rotate_secret), credentials, terraform passthrough, topology, MFA, billing | [`alphaswarm_controller/api/routers/`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/) | | `/auth/m2m/token`, `/auth/agent-identity/token` | Phase 1 identity broker — M2M + Entra Agent Identity tokens | [`api/routers/auth.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/auth.py) | | `/auth/.well-known/openid-configuration` | OIDC discovery (canonical location) | same | | `/auth/login`, `/callback`, `/logout`, `/refresh`, `/me`, `/stepup`, `/device/start`, `/device/poll` | Phase 3 BFF + RFC 8628 device flow | [`api/routers/bff.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/bff.py) | | `/manage/connections`, `/manage/connections/{id}` | Phase 2 connection manager — typed `ConnectionDescriptor` for any topology service | [`api/routers/connections.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/connections.py) + [`services/connections.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/services/connections.py) | | `/proxy/{service_id}/{path}` | Phase 5 connection-proxy mesh (SPIFFE-mediated mTLS in 5b) | [`api/routers/proxy.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/proxy.py) | ## Embedded operator When the `operator` extra is installed, the same image hosts the [`aqp-controller-operator`](aqp-controller-operator.md) — a kopf process reconciling the eight AQP* CRDs. Single-replica (`Recreate` strategy) so reconciliation order stays deterministic. The bare `alphaswarm-controller` image keeps booting on memory-constrained nodes that don't run the operator. ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `alphaswarm-cp` in [`deployments/compose/docker-compose.admin.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.admin.yml) (admin overlay) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-cp/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-cp/) — Deployment + Service + PDB | | AQP operator (Phase 4) | [`deployments/kubernetes/aqp-controller-operator/`](../../../../../../alphaswarm_platform/deployments/kubernetes/aqp-controller-operator/) — kopf reconciler kustomize tree | | Terraform module | [`alphaswarm_platform/terraform/modules/alphaswarm_workloads/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_workloads/) (workload), [`terraform_runner/`](../../../../../../alphaswarm_platform/terraform/modules/terraform_runner/) (paired pod) | ## Dependencies **Upstream:** - `postgres` — `workload_runs`, `terraform_runs`, `EntraTenantLink`, session store (Phase 5+). - `redis` — kill-switch key, BFF session store, M2M token cache. - The cluster API (kubernetes / docker / aws / azure / gcp) through per-provider adapters under [`alphaswarm_controller/providers/`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/providers/). **Downstream:** - `alphaswarm-core` calls into `/manage/*` for cluster-internal lookups. - `alphaswarm-client`, `alphaswarm-ui`, `alphaswarm-admin`, `alphaswarm-cli` use `/auth/*` once their `AUTH_BFF_ENABLED` flag is on. - `alphaswarm-cli launch` hits the operator route to render AQP* CRs. ## Operations - **HA:** `replicas: 2` in prod; 1 in dev. PDB `minAvailable=1`. - **Single operator:** the kopf process is single-replica regardless of cp replicas — operator pods run as a separate Deployment. - **Step-up MFA:** every `/manage/terraform/apply`, `/manage/credentials/cloud-cli/*`, and `/halt` route requires RFC 9470 `acr=high`. - **Audit:** every `/manage/*` action lands a `workload_runs` row; every `/auth/*` token mint lands a `security_audit_events` row. Redaction is enforced by [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc). - **Pomerium IAP:** the public ingress wraps `/manage/*` with Pomerium so the Entra-staff group is the only authenticated path. ## See also - [`control-plane-topology.md`](../control-plane-topology.md) — topology and URL fallback contract; cp is the sole topology server. - [`terraform-control-plane.md`](../terraform-control-plane.md) — `TerraformRuntime` runs inside cp. - [`identity.md`](../../identity/identity.md) — IdentityProvider chain. - [`alphaswarm_controller/AGENTS.md`](../../../../../../alphaswarm_controller/AGENTS.md) — hard rules for the standalone control plane. # alphaswarm-docs-archive > Sunset Stripe-style API epoch archive at archive.alpha-swarm.ai. Cloudflare Pages, immutable per epoch. # alphaswarm-docs-archive Sunset documentation archive. Stripe-style: every public-API epoch freezes a snapshot of `alphaswarm-docs` and surfaces it under `archive.alpha-swarm.ai//` so customers running pinned API versions still have a working manual. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-docs-archive` | | Role | `docs-archive` | | Hosted on | Cloudflare Pages | | Public URL | `https://archive.alpha-swarm.ai` | ## Layout - `/v1/...` — first public epoch (frozen) - `/v2/...` — current epoch (mirrors `docs.alpha-swarm.ai`) - `/v/...` — every previous epoch retained for the deprecation window declared in the release notes. Each epoch directory is a frozen build of `alphaswarm_docs/` at the tag matching the epoch. ## Deployment surface | Surface | Where | | --- | --- | | Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) — same module as the live docs, distinct Pages project | | Spec | reuses the `docs-edge` stack pattern at [`alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml`](../../../../../../alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml) (separate workspace) | ## Operations - **Immutability:** archive content is read-only after the epoch freezes. PRs targeting an archive branch are auto-rejected by the `archive-frozen` GitHub Action. - **Sunset window:** epochs hold for the deprecation window declared in the matching release note (typically 12 months). - **Discoverability:** the live docs link to `archive.alpha-swarm.ai` whenever an API breaks compatibility. ## See also - [`alphaswarm-docs.md`](alphaswarm-docs.md) — live docs. - [Stripe API versioning](https://stripe.com/blog/api-versioning) — the model this archive imitates. # alphaswarm-docs-status > Public status page at status.alpha-swarm.ai. Hosted on Instatus SaaS, separate Cloudflare zone, intentionally outside the cluster. # alphaswarm-docs-status The public status page. Provisioned on [Instatus](https://instatus.com) SaaS and CNAMEd to `status.alpha-swarm.ai` on a Cloudflare zone distinct from `alpha-swarm.ai`. Survives full cluster + edge outages — operators can post updates from the Instatus dashboard even when the AlphaSwarm cluster is down. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-docs-status` | | Role | `status-page` | | Hosted on | Instatus SaaS | | Public URL | `https://status.alpha-swarm.ai` | ## Deployment surface | Surface | Where | | --- | --- | | Terraform module | [`alphaswarm_platform/terraform/modules/instatus/`](../../../../../../alphaswarm_platform/terraform/modules/instatus/) — provisions the page + components + integrations | ## Components The status page exposes one component per logical service: - `core` — `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat` - `controller` — `alphaswarm-cp` + the AQP operator - `frontends` — `alphaswarm-ui`, `alphaswarm-client`, `alphaswarm-admin` - `docs` — `alphaswarm-docs`, `alphaswarm-website` - `data-plane` — `postgres`, `redis`, `neo4j`, `iceberg`, `kafka` - `mlops` — `mlflow`, `argo-workflows`, `dagster` - `observability` — `prometheus`, `loki`, `jaeger`, `phoenix` ## Update flow - Beat-emitted heartbeats publish health to a per-service Instatus webhook every 60 s. - Incidents are posted manually by the on-call operator from the Instatus dashboard. - Maintenance windows scheduled in advance via the `instatus` Terraform module's `scheduled_maintenance` resources. ## See also - [`how-to/runbooks/`](../../../how-to/runbooks/) — incident response. - [`alphaswarm-docs.md`](alphaswarm-docs.md) — sibling docs site. - [`instatus` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/instatus/) — provisioning source. # alphaswarm-docs > Public documentation site at docs.alpha-swarm.ai. Docusaurus on Cloudflare Pages with MCP + llms.txt endpoints for agent consumers. # alphaswarm-docs The canonical AlphaSwarm documentation site. Docusaurus + Diátaxis structure, deployed to Cloudflare Pages. Survives cluster outages — the docs domain is intentionally provisioned outside the cluster so incident-time runbooks stay reachable. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-docs` | | Role | `docs` | | Package | [`alphaswarm_docs/`](../../../../../../alphaswarm_docs/) | | Hosted on | Cloudflare Pages | | Public URL | `https://docs.alpha-swarm.ai` | ## Deployment surface | Surface | Where | | --- | --- | | Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) | | Spec | [`alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml`](../../../../../../alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml) | | Build | `pnpm build` in `alphaswarm_docs/` — Cloudflare Pages picks up the GitHub branch | ## Agent surface The site exposes structured endpoints for AI/MCP consumers: - `/llms.txt` and `/llms-full.txt` — convention-compliant index of the docs corpus. - `/mcp` — MCP server publishing the docs as searchable tool calls. - `/openapi.json` — OpenAPI surface for the MCP server. ## Dependencies **Upstream:** GitHub repo for build trigger; Cloudflare for edge hosting; OPA bundle (downloaded at deploy) for redaction policy on docs links to internal runbooks. **Downstream:** browsers, AI agents, search crawlers. ## Operations - **Deploy:** every PR landing on `main` redeploys via Cloudflare Pages CI. Branch previews under `*.alphaswarm-docs.pages.dev`. - **Custom domain:** `docs.alpha-swarm.ai` mapped via the `cloudflare_pages_docs` Terraform module; certificate via Cloudflare's edge SSL. - **Out-of-cluster:** intentionally — docs live whatever the cluster is doing. ## See also - [`alphaswarm-website.md`](alphaswarm-website.md) — public marketing sibling at `alpha-swarm.ai`. - [`alphaswarm-docs-archive.md`](alphaswarm-docs-archive.md) — sunset API epochs at `archive.alpha-swarm.ai`. - [`alphaswarm-docs-status.md`](alphaswarm-docs-status.md) — incident status page. # alphaswarm-executor > Heavy-compute Celery executor for the AlphaSwarm runtime — drains backtest, training, ML, agents, factors, RAG. Carries the full ML/RL/forecasting + Dask/Ray surface. # alphaswarm-executor Celery **heavy-compute** executor pod — the compute-heavy counterpart of the orchestration [`alphaswarm-worker`](alphaswarm-worker.md). Introduced by the Phase 4c worker/executor split. It carries the full ML / RL / forecasting / portfolio + distributed-compute (Dask + Ray) dependency surface so backtests, training rollouts, factor builds, and agent-emitted strategy code run here instead of bloating the slim orchestration worker. See [worker vs executor images](../worker-executor-images.md) for the full rationale and dependency matrix. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-executor` | | Role | `executor` | | Package | [`alphaswarm/`](../../../../../../alphaswarm/) (tasks under `alphaswarm/tasks/*.py`) | | Image (key) | `executor` | | Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `executor`, multi-arch) or the standalone [`build/docker/alphaswarm_executor/Dockerfile`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/Dockerfile) | ## Wire | Field | Value | | --- | --- | | Protocol | none (no HTTP listener) | | Health | `celery inspect ping` + Prometheus metrics on `:9100`; Ray dashboard on `:8265` when a local Ray head runs | | Public URL | — | | Broker | `redis://redis:6379/0` | | Result backend | `redis://redis:6379/1` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | `alphaswarm-executor` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml); `worker-gpu` in legacy [`compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-executor/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-executor/) — Deployment + HPA + PDB | | Image catalogue | `executor` entry in [`terraform/modules/alphaswarm_images/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_images/) | | ECR repo | `alphaswarm-executor` in [`infrastructure/modules/ecr-repositories/`](../../../../../../alphaswarm_platform/infrastructure/modules/ecr-repositories/) | | Terraform module | [`terraform/modules/faas/`](../../../../../../alphaswarm_platform/terraform/modules/faas/) — heavy-queue Deployments pull this image | | Topology | `alphaswarm-executor` in [`configs/deployment/topology.yaml`](../../../../../../alphaswarm_platform/configs/deployment/topology.yaml) | ## Queue families The executor drains the **heavy compute** queues. KEDA scales each queue family independently. | Queue | Drives | Scale-to-zero | Notes | | --- | --- | --- | --- | | `backtest` | backtest dispatch (vbt-pro / event-driven / Lean) | yes | `max=20` | | `training` | RL rollouts, finetune jobs | yes | dedicated GPU node group | | `ml` | ML pipelines, predictor refresh | yes | | | `agents` | CrewAI runs, LangGraph orchestration | yes | `max=12` | | `factors` | factor zoo builds, alpha tests | yes | | | `rag` | RAG ingest, embedding refresh | yes | | ## Dependencies **Upstream:** - `redis` — broker + result backend. - `postgres` — task lookups, ledger writes. - `alphaswarm-core` — progress emit callbacks, lookup APIs. - `mlflow` — experiment tracking + model registry for training / ML runs. - All data-plane services the `alphaswarm-core` pod depends on. **Downstream:** - Beat schedules heavy periodic jobs (factor refresh, predictor retraining); the executor is the consumer. - May start a local Ray head / Dask cluster for distributed backtests. ## Operations - **Resources:** requests `1 CPU / 4Gi`, limits `8 CPU / 16Gi`. Prefers memory-optimized nodes via node affinity; anti-affinity keeps it off the `alphaswarm-core` nodes. - **Scaling:** HPA on CPU + custom Celery queue depth (KEDA `ScaledObject`s supersede it where KEDA is installed). Scales **down** slowly (900s stabilization) so a long-running backtest / train job is not evicted mid-flight. - **Concurrency:** 2 per pod (compute-bound; each task is heavy). - **Drain on shutdown:** `terminationGracePeriodSeconds: 600` so in-flight jobs complete; `preStop` sends `SIGTERM` to Celery. - **Audit:** `WorkloadRuntime` actions land `workload_runs` rows; the executor pod respects the kill-switch Redis key like every other pod. ## See also - [`alphaswarm-worker.md`](alphaswarm-worker.md) — orchestration sibling (light queues). - [`worker-executor-images.md`](../worker-executor-images.md) — image split rationale + dependency matrix. - [`faas` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/faas/) — KEDA scaling source of truth. - [`build/docker/alphaswarm_executor/`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/) — standalone image (migration-ready). # alphaswarm-ide > White-labeled Theia 1.72 + six AlphaSwarm compile-time extensions + MCP-driven research copilot + Perspective Arrow notebook renderer. # alphaswarm-ide Browser-tier IDE for AlphaSwarm. White-labeled Theia 1.72 with six compile-time extensions (`alphaswarm`, `alphaswarm-shell`, `alphaswarm-mcp-bridge`, `alphaswarm-research-copilot`, `alphaswarm-notebook-quant`, `alphaswarm-quant`), an MCP-driven research copilot, and a Perspective + Arrow notebook renderer. The canonical operator entrypoint is `alphaswarm-cli ide` — see [`alphaswarm-ide.md`](../alphaswarm-ide.md) for the full IDE concept doc. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-ide` | | Role | `ide` | | Package | [`alphaswarm_ide/`](../../../../../../alphaswarm_ide/) | | Image (key) | `ide` | | Built from | [`alphaswarm_ide/Dockerfile`](../../../../../../alphaswarm_ide/Dockerfile) (node:24-bookworm; extension-build env) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + WebSocket (Theia front channel) | | Port | `3000` (browser-tier) | | Health | `GET /` | | Public URL | per-user (operator's own laptop or per-cell ingress) | ## Deployment surfaces | Surface | Where | | --- | --- | | Local | `alphaswarm-cli ide start` — runs the IDE as a docker container against the local cluster | | Kustomize | [`deployments/kubernetes/alphaswarm-ide/`](../../../../../../alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/) — Deployment + Service + Ingress + NetworkPolicy | | AQP CR | [`AQPIDE`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpide_cr.py) — for per-user pod lifecycle | ## Dependencies **Upstream:** - `alphaswarm-core` `/mcp/*` — every research copilot LLM call goes through `router_complete` (rule 2) on the API pod. - `alphaswarm-cp` `/auth/*` — OIDC-bound IDE sessions. - `postgres`, `redis`, `iceberg/polaris` — read paths for the notebook renderer. **Downstream:** - Operator browsers (one IDE pod per active operator session). ## Boundaries - AlphaSwarm code MUST live inside `theia-extensions/alphaswarm*/`. - Theia extension code MUST NOT import `alphaswarm` source — cross-process via MCP only. - Copilot LLM calls go through `router_complete` (AGENTS rule 2). - MCP registrations carry per-MCP `aud` claims (rule 49). ## Operations - **Per-user pods:** the operator pattern is one Deployment per active session. Idle sessions scale to zero via KEDA after 30 min. - **NetworkPolicy:** the IDE pod only reaches `alphaswarm-core`, `alphaswarm-cp`, and the data plane through the `alphaswarm-data-mcp` sidecar. - **Bundle sourcing:** the AlphaSwarm extensions are built into the image at compile time; no runtime extension marketplace fetch. ## See also - [`alphaswarm-ide.md`](../alphaswarm-ide.md) — full IDE concept doc. - [`alphaswarm-ide-roadmap.md`](../alphaswarm-ide-roadmap.md) — phase plan. - [`alphaswarm_ide/AGENTS.md`](../../../../../../alphaswarm_ide/AGENTS.md) — boundary rules. # alphaswarm-ml-mcp > Dedicated MCP server for the data.ml.* tool slice — Predictor Hub, ML pipelines, AlphaBacktestExperiment. Piggybacked on the alphaswarm-core pod. # alphaswarm-ml-mcp Dedicated MCP server publishing the `data.ml.*` tool slice — Predictor Hub lookups, AlphaBacktestExperiment dispatch, walk-forward run inspection, finetune trainer status, model serving (vLLM / Ollama / KServe). Piggybacked on the `alphaswarm-core` pod (same FastAPI app, distinct route prefix and `aud` claim). This is the MLOps slice's RFC 9728 / RFC 8707 conformant endpoint — see [`mcp-rfc-conformance`](../../../../../../alphaswarm/.cursor/rules/mcp-rfc-conformance.mdc). ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-ml-mcp` | | Role | `mcp` | | Package | [`alphaswarm_models/`](../../../../../../alphaswarm_models/) (tools); served from [`alphaswarm/ml_mcp/`](../../../../../../alphaswarm/alphaswarm/ml_mcp/) | | Image (key) | (piggybacked on `api`) | | Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `api`) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + WebSocket (MCP) | | Port | `8000` (shared with `alphaswarm-core`) | | Health | `GET /mcp/ml/tools` (lists tool registrations) | | Discovery | `GET /.well-known/oauth-protected-resource/mcp/ml` (RFC 9728 metadata) | | Audience claim | dedicated per-MCP `aud` per AGENTS rule 49 | ## Tool registrations | Tool prefix | Concept doc | | --- | --- | | `data.ml.predictors.*` | [ml-framework.md](../../strategy/ml-framework.md) | | `data.ml.skills.*` | [mlops-service.md](../../strategy/mlops-service.md) | | `data.ml.serving.*` | [ml-framework.md](../../strategy/ml-framework.md) | | `data.ml.experiments.*` | [analysis-framework.md](../../strategy/analysis-framework.md) | | `data.ml.finetune.*` | [ml-framework.md](../../strategy/ml-framework.md) | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | folded into `api` | | Kustomize | folded into [`base/alphaswarm-core/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-core/) | | AQP CR | folded into `AQPMonolith` (`spec.mlMcp.enabled`) | ## Dependencies **Upstream:** - `mlflow` (5000) — experiment + model registry. - `postgres` — Predictor Hub catalog. - `polaris` / `minio` — feature store reads. - `bentoml` / `kserve` (when serving backend = remote) — model invocations. **Downstream:** - Agentic plane (`alphaswarm/agents/`) — ML calls go through DataMCP, never direct ORM imports. - `alphaswarm-ide` research copilot. ## Operations - **`router_complete` only:** any LLM call from inside the MCP registrations goes through `alphaswarm/llm/providers/router.py` (rule 2). - **OOD guard + circuit breaker:** the MLSkillRuntime applies `rules/ood_guard.py` and the circuit breaker before model calls. - **Audit:** every tool invocation lands an `agent_runs_v2` row. ## See also - [`mlops-service.md`](../../strategy/mlops-service.md) — MLOps service contract. - [`data-mcp.md`](../../data/data-mcp.md) — DataMCPTool boundary. - [`mcp-rfc-conformance`](../../../../../../alphaswarm/.cursor/rules/mcp-rfc-conformance.mdc) — RFC 9728 + RFC 8707 conformance. # alphaswarm-ui > Cloud-hosted, multi-tenant operator dashboard at app.alpha-swarm.ai. Next.js 14+ App Router; Entra-only after the launcher refactor. # alphaswarm-ui The cloud-hosted, customer-facing operator dashboard. Auth-gated and multi-tenant; Auth0 (B2C) was the historic provider but the post-launcher-refactor surface is **Entra-only** — Auth0 has been purged from the SPA bundle. The public marketing site is a sibling, separate repo — [`alphaswarm-website`](alphaswarm-website.md) at `alpha-swarm.ai`. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-ui` | | Role | `frontend` | | Package | [`alphaswarm_ui/`](../../../../../../alphaswarm_ui/) | | Image (key) | `ui` | | Built from | (not Dockerfile-based — typically Vercel / Cloudflare Pages SSR; AQPUI CR can also stand it up as a Deployment in a cluster) | ## Wire | Field | Value | | --- | --- | | Protocol | HTTP/1.1 + WebSocket | | Port | `80` (container) / `3000` (Next.js dev) | | Health | `GET /api/healthz` | | Public URL | `https://app.alpha-swarm.ai` | | Identity | Microsoft Entra (B2B SSO via `MsalEntraProvider`); `local` dev-stub gated by `ALPHASWARM_AUTH_DEV_STUB=true` (hard-disabled in production builds) | ## Routes | Route | Purpose | | --- | --- | | `/login`, `/signup`, `/onboarding/*` | Provider-aware auth screens (Entra login + dev-stub) | | `/dashboard`, `/strategies`, `/paper-runs`, `/backtests`, `/data`, `/ml`, `/agents`, `/workflows`, `/labs`, `/analytics`, `/research`, `/portfolio`, `/settings` | Operator dashboard | | `/api/auth/entra/login`, `/callback`, `/logout`, `/stepup` | BFF route handlers — proxy to `alphaswarm-cp` `/auth/*` (Phase 3) | | `/api/*` | Other BFF proxies (tenancy-scoped, kill-switch fan-out) | The marketing routes (`/`, `/pricing`, `/docs`, `/legal`, `/about`, `/blog`, `/changelog`) **moved out** to the [`alphaswarm_website`](../../../../../../alphaswarm_website/) repo as part of the controller-launcher refactor. ## Deployment surfaces | Surface | Where | | --- | --- | | Hosted (canonical) | Cloudflare Pages or Vercel — pinned `next >=14.2.25` for CVE-2025-29927 | | Cluster (option) | [`AQPUI`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpui_cr.py) CR — Deployment + Service + Ingress | | Identity provisioning | [`alphaswarm_platform/terraform/modules/alphaswarm_ui_identity/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_ui_identity/) | ## Dependencies **Upstream (HTTP-only):** - `alphaswarm-cp` (`/auth/*`, `/manage/*`) — every BFF route delegates here. - `alphaswarm-core` (`/api/*`) — for tenancy-scoped business calls the BFF routes proxy. **Downstream:** - B2C and B2B users; multi-tenant via `EntraTenantLink` rows in the controller's database. ## Operations - **Bundle pinning:** `next >=14.2.25` (CVE-2025-29927). - **CSP:** restricted to `manage.alpha-swarm.ai` and the controller's `*.alpha-swarm.ai` cell ingresses. - **No client-side auth SDK:** the SPA never reads an Entra token — only the BFF route handlers do. - **Dev-stub:** `ALPHASWARM_AUTH_DEV_STUB=true` writes a Local Dev User session inline; the [`scripts/ci/check_alphaswarm_ui_no_auth0.py`](../../../../../../alphaswarm_ui/scripts/ci/check_alphaswarm_ui_no_auth0.py) guard fails on any new Auth0 reference. ## See also - [`alphaswarm_ui/AGENTS.md`](../../../../../../alphaswarm_ui/AGENTS.md) — hard boundaries. - [`alphaswarm-website.md`](alphaswarm-website.md) — public marketing sibling. - [`identity.md`](../../identity/identity.md) — Entra integration contract. # alphaswarm-website > Public-facing marketing site at alpha-swarm.ai. Next.js 14+ App Router on Cloudflare Pages; no auth, no API calls, intentionally separate from the operator dashboard. # alphaswarm-website The public marketing site. Lives in its own repo ([`alphaswarm_website/`](../../../../../../alphaswarm_website/)) and is hosted on Cloudflare Pages so the marketing surface survives cluster outages the same way the docs do. This is **not** the operator dashboard — that is [`alphaswarm-ui`](alphaswarm-ui.md) at `app.alpha-swarm.ai`. Cross-links from this site to the dashboard go through `NEXT_PUBLIC_ALPHASWARM_APP_URL`. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-website` | | Role | `marketing` | | Package | [`alphaswarm_website/`](../../../../../../alphaswarm_website/) | | Hosted on | Cloudflare Pages | | Public URL | `https://alpha-swarm.ai`, `https://www.alpha-swarm.ai` | ## Routes - `/` — homepage - `/about`, `/blog`, `/changelog` - `/pricing`, `/cloud`, `/self-hosted` - `/product/{agentops,reinforcement-learning,data-platform,backtesting}` - `/learn`, `/learn/` - `/docs/[[...slug]]` — public docs links (deep-link to `alphaswarm-docs`) - `/legal/[doc]` — terms, privacy, security, dpa, contact - `/login`, `/signup`, `/onboarding` — thin 307 redirects to `${NEXT_PUBLIC_ALPHASWARM_APP_URL}/...` ## Hard boundaries Per [`alphaswarm_website/AGENTS.md`](../../../../../../alphaswarm_website/AGENTS.md): - No authentication SDKs (no `@auth0/*`, no `@azure/msal-*`, no `iron-session`). - No imports of `alphaswarm.*` or `alphaswarm_controller.*`. - No client-side state libraries (no `@tanstack/react-query`, no `zustand`, no `antd`). - No secrets in env — only the public app URL and port. - Next.js pinned `>=14.2.25` for CVE-2025-29927. ## Deployment surface | Surface | Where | | --- | --- | | Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) (forthcoming dedicated `cloudflare_pages_marketing`) | | Build | `pnpm build` in `alphaswarm_website/` — Cloudflare Pages picks up the GitHub branch | ## See also - [`alphaswarm-ui.md`](alphaswarm-ui.md) — the auth-gated operator dashboard at `app.alpha-swarm.ai`. - [`alphaswarm-docs.md`](alphaswarm-docs.md) — public docs at `docs.alpha-swarm.ai`. - [`alphaswarm_website/AGENTS.md`](../../../../../../alphaswarm_website/AGENTS.md) — hard boundaries. # alphaswarm-worker > Celery orchestration worker for the AlphaSwarm runtime — drains the light/coordination queues (default, paper, terraform, ingestion, workflows). Heavy compute moves to alphaswarm-executor. # alphaswarm-worker Celery **orchestration** worker pod that drains the light / coordination queues produced by `alphaswarm-core`. As of the Phase 4c worker/executor split it has its own slim image (target `worker`) carrying only the task-dispatch + lineage surface — **not** the API stage's `visualization` / `dev` / Dash deps it used to inherit. Heavy compute (backtest / training / ML / agents / factors / RAG) is offloaded to the sibling [`alphaswarm-executor`](alphaswarm-executor.md). See [worker vs executor images](../worker-executor-images.md) for the full rationale and dependency matrix. ## Identity | Field | Value | | --- | --- | | Service id | `alphaswarm-worker` | | Role | `worker` | | Package | [`alphaswarm/`](../../../../../../alphaswarm/) (tasks under `alphaswarm/tasks/*.py`) | | Image (key) | `worker` | | Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `worker`, multi-arch) or the standalone [`build/docker/alphaswarm_worker/Dockerfile`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_worker/Dockerfile) | ## Wire | Field | Value | | --- | --- | | Protocol | none (no HTTP listener) | | Health | Celery broker connection probe + Prometheus metrics on `:9100` (when enabled) | | Public URL | — | | Broker | `redis://redis:6379/0` | | Result backend | `redis://redis:6379/1` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `worker` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `alphaswarm-worker` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) | | Kustomize | [`deployments/kubernetes/base/alphaswarm-worker/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-worker/) — Deployment + HPA + PDB | | AQP CR | folded into [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) (`spec.workers.queues`) | | Terraform module | [`alphaswarm_platform/terraform/modules/faas/`](../../../../../../alphaswarm_platform/terraform/modules/faas/) — Celery + KEDA per-queue ScaledObjects | ## Queue families The orchestration worker drains the **light / coordination** queues only. The heavy compute queues (`backtest`, `training`, `ml`, `agents`, `factors`, `rag`) are drained by [`alphaswarm-executor`](alphaswarm-executor.md). KEDA scales each queue family independently. Default queue map: | Queue | Drives | Scale-to-zero | Notes | | --- | --- | --- | --- | | `default` | misc tasks, callbacks, lookups | yes | always-on `min=1` in prod | | `paper` | paper trading session ticks | no | sub-second latency required | | `terraform` | TerraformRuntime celery wrappers | yes | | | `ingestion` | Airbyte / Dagster / connector pulls | no | uses long-lived workers | | `workflows` | WorkflowRuntime orchestration | yes | | | `hft` | HFT hot-path event handlers | no | pinned to hft-nodes (compose/legacy) | :::note The `faas` KEDA module keys per-queue Deployments off `local.heavy_queues` — heavy queues run the `alphaswarm-executor` image, everything else runs this `alphaswarm-worker` image. The two image sets never share a queue. ::: ## Dependencies **Upstream:** - `redis` — broker + result backend. - `postgres` — task lookups, ledger writes. - `alphaswarm-core` — progress emit callbacks, lookup APIs. - All data-plane services the `alphaswarm-core` pod depends on (the same code paths run inside Celery). **Downstream:** - Beat schedules tasks; the worker is the consumer. - HFT-tagged tasks land on the `hft-nodes/` workload (PTP-tuned). ## Operations - **Scaling:** KEDA `ScaledObject` per queue; idle queues scale to zero. The per-queue `min`/`max` lives in the [`faas`](../../../../../../alphaswarm_platform/terraform/modules/faas/) Terraform module. - **Concurrency:** the orchestration worker runs concurrency 4 (light, IO-bound dispatch work); 1 for HFT (single-threaded pinning). - **Drain on shutdown:** `terminationGracePeriodSeconds: 600` so in-flight tasks complete; `preStop` sends `SIGTERM` to Celery. - **Audit:** `WorkloadRuntime` actions land `workload_runs` rows; the worker pod respects the kill-switch Redis key the same way the API does. ## See also - [`tasks-api.mdc`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — Celery task progress contract + Redis pub/sub frame shape. - [`alphaswarm-executor.md`](alphaswarm-executor.md) — heavy-compute sibling. - [`worker-executor-images.md`](../worker-executor-images.md) — image split rationale + dependency matrix. - [`alphaswarm-beat.md`](alphaswarm-beat.md) — sibling scheduler. - [`faas` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/faas/) — KEDA scaling source of truth. # chromadb > Vector store — fallback / dev embedding store. Production cells use `milvus`; ChromaDB stays for local-dev parity and small-collection cases. # chromadb A vector store used for embedding indices in dev cells and small- collection production cases. Larger production cells use [`milvus`](https://milvus.io/) instead — ChromaDB stays in the topology so the local-dev compose stack and per-cell base manifests keep parity. ## Identity | Field | Value | | --- | --- | | Service id | `chromadb` | | Role | `vector-store` | | Image | `chromadb/chroma:1.0.16` | | Port | `8000` (in-cluster) / `8001` (host bind in compose to avoid clashing with `alphaswarm-core`) | | Storage | ephemeral by default; PVC-backed in cluster | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `chromadb` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | [`deployments/kubernetes/base-services/chromadb/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/chromadb/) | | Companion | [`base-services/milvus/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/milvus/) — production-grade alternative | ## Dependencies **Upstream:** none. **Downstream:** - `alphaswarm-core` for RAG retrieval (when feature flag `ALPHASWARM_VECTOR_STORE=chromadb`). - `alphaswarm-worker` for embedding ingest tasks. ## Operations - **Collection lifecycle:** managed by the `HierarchicalRAG` package; never created directly by agents. - **Vector dimensions:** must match the active embedding model (default `BAAI/bge-m3` at 1024-dim). Mismatch is a hard error. - **Backup:** the in-cell PVC is snapshotted nightly; production cells with significant collections should swap to Milvus. ## See also - [`alphaswarm/data/rag/`](../../../../../../alphaswarm/alphaswarm/data/rag/) — HierarchicalRAG package. - [`base-services/milvus/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/milvus/) — production alternative. # grafana > Primary metrics + logs + traces visualization layer. Datasources point at Prometheus, VictoriaMetrics, Loki, and Jaeger. # grafana The platform's primary dashboard surface. Bundled with the kube-prometheus-stack and pre-loaded with datasources for Prometheus, VictoriaMetrics, Loki (logs), and Jaeger (traces). ## Identity | Field | Value | | --- | --- | | Service id | `grafana` | | Role | `observability` | | Image | `grafana/grafana` (managed by kube-prometheus-stack) | | Port | `3000` | | Health | `/api/health` | ## Deployment surfaces | Surface | Where | | --- | --- | | Kustomize | folded into [`observability/kube-prometheus-stack/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) (Helm chart bundles Grafana) | | Standalone | (none — Grafana is always shipped with the stack) | ## Datasources | Datasource | Backend | | --- | --- | | `Prometheus` | in-cluster Prometheus | | `VictoriaMetrics` | in-cluster VM | | `Loki` | in-cluster Loki | | `Jaeger` | in-cluster Jaeger | | `Phoenix` (when enabled) | Phoenix's Postgres backend (read-only) | ## Dashboards Provisioned via ConfigMaps under `observability/kube-prometheus-stack/dashboards/`. Default set covers: - Cluster health (kube-state-metrics). - AlphaSwarm runtime (API latency, Celery queue depth, kill-switch state, terraform_run lag). - Per-service Linkerd proxy metrics. - Per-cell tenant overlays. Custom dashboards land via PR — never via the Grafana UI alone (UI edits are wiped on the next reconciliation). ## Operations - **Auth:** OIDC against the staff Entra tenant; the `alphaswarm-staff` group maps to admin, `alphaswarm-operators` to editor, and any other authenticated user to viewer. - **Persistence:** Grafana DB is SQLite by default (folded into the Helm chart); production cells point at a per-cell Postgres schema. ## See also - [`prometheus.md`](prometheus.md), [`victoriametrics.md`](victoriametrics.md), [`loki.md`](loki.md), [`jaeger.md`](jaeger.md) — backing datasources. - [`observability-stack.md`](../../trading/observability-stack.md) — stack composition. # jaeger > Distributed tracing backend for the infrastructure pipeline. AI / LLM spans land in Phoenix instead. # jaeger Distributed tracing backend for the infrastructure trace pipeline — HTTP, database, queue, and inter-service spans land here. AI / LLM spans (OpenInference) route to [`phoenix`](https://docs.arize.com/phoenix) instead. ## Identity | Field | Value | | --- | --- | | Service id | `jaeger` | | Role | `observability` | | Image | `jaegertracing/all-in-one` (in-cell) / `jaegertracing/jaeger-collector` + `jaegertracing/jaeger-query` (split in cloud cells) | | Port | `6831` (UDP — agent), `14250` (gRPC — collector), `16686` (HTTP — query/UI) | | Health | `/` | ## Deployment surfaces | Surface | Where | | --- | --- | | Kustomize | [`observability/jaeger/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/jaeger/) — all-in-one Deployment + Service | | Cloud cells | split mode — Collector behind a Service, Query behind ingress | ## Dependencies **Upstream:** `otel-collector` — fans the infrastructure trace pipeline here. **Downstream:** Grafana (datasource) and the operator's UI for manual span inspection. ## Operations - **Storage:** in-cell uses badger (ephemeral); cloud cells back with Elasticsearch / OpenSearch. - **Retention:** 7 days in-cell, 30 days in cloud. - **Sampling:** receives only the 5% sampled spans (per `otel-collector` policy) plus 100% of error spans. ## See also - [`otel-collector.md`](otel-collector.md) — routing source. - [`observability.md`](../../trading/observability.md) — concept doc. # loki > Log aggregation. Receives logs from `vector` (the shipper) and serves Grafana. # loki Grafana Loki — the log aggregation backend. Receives logs from the `vector` DaemonSet (the canonical shipper) and serves Grafana for queries. ## Identity | Field | Value | | --- | --- | | Service id | `loki` | | Role | `observability` | | Image | `grafana/loki:3.3.2` | | Port | `3100` | | Health | `/ready` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `loki` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) (platform overlay) | | Kustomize | [`observability/loki/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/loki/) — single-binary StatefulSet (in-cell); split-monolithic mode in cloud cells | ## Dependencies **Upstream:** `vector` (the canonical shipper). **Downstream:** `grafana` (datasource). ## Operations - **Storage:** in-cell uses local PVC; cloud cells back with S3 / GCS / ADLS object storage. - **Retention:** 14 days default; 30 days for `audit-*` streams (per the audit-evidence retention policy). - **Tenancy:** every log line carries a `tenant_id` label so Loki's multi-tenancy split is enforced at query time. ## See also - [`vector.md`](vector.md) — log shipper. - [`grafana.md`](grafana.md) — visualization layer. # mlflow > Model registry + experiment tracker. Backs Predictor Hub, AlphaBacktestExperiment, walk-forward, and the finetune trainers. # mlflow The platform's model registry + experiment tracker. Owned by `alphaswarm_models` — every Predictor, AlphaBacktestExperiment, walk-forward run, and finetune trainer registers here. ## Identity | Field | Value | | --- | --- | | Service id | `mlflow` | | Role | `mlops` | | Image | `ghcr.io/mlflow/mlflow:v2.11.1` | | Port | `5000` | | Storage | object store for artifacts (MinIO / S3 / GCS / ADLS depending on cloud); Postgres backend for the tracking store | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `mlflow` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | [`deployments/kubernetes/base-services/mlflow/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/mlflow/) — Deployment + Service + ExternalSecret-backed credentials | | MLOps overlay | reachable through [`mlops/`](../../../../../../alphaswarm_platform/deployments/kubernetes/mlops/) when paired with Argo Workflows + Dagster | ## Dependencies **Upstream:** - `postgres` — tracking store. - `minio` / `s3` / `gcs` / `azblob` — artifact store. **Downstream:** - `alphaswarm-core`, `alphaswarm-worker` — every Predictor / Skill / walk-forward / finetune flow registers runs here. - `alphaswarm-ml-mcp` — read paths surface through the `data.ml.*` MCP slice. ## Operations - **Auth:** behind the cluster ingress; the in-cluster URL is the only path. Local dev exposes `http://localhost:5000` for browser inspection. - **Pruning:** retention policy lives at `alphaswarm/tasks/cleanup/mlflow_prune.py` — run by beat weekly. - **Run tagging:** every run is tagged with the originating `experiment_id` + `test_id` per AGENTS rule 34 so audit queries can correlate ML runs with strategy / backtest activity. ## See also - [`mlops-service.md`](../../strategy/mlops-service.md) — how `alphaswarm_models` lays MLflow underneath the Skill / Predictor contract. - [`ml-framework.md`](../../strategy/ml-framework.md) — model framework overview. - [`alphaswarm_models/AGENTS.md`](../../../../../../alphaswarm_models/AGENTS.md) — boundary rules. # neo4j > Graph database — canonical home for the ownership graph, the bipartite lineage DAG, and the entity-graph service. # neo4j The canonical graph store. Holds the ownership graph (Workstream F), the bipartite lineage DAG (Workstream A + B), and the entity-graph service (instruments, companies, datasets, pipeline assets, service metadata). Postgres carries the snapshot rows; Neo4j carries the traversable relationships. ## Identity | Field | Value | | --- | --- | | Service id | `neo4j` | | Role | `graph` | | Image | `neo4j:5-community` | | Port | `7474` (HTTP) + `7687` (Bolt) | | Storage | 5 Gi PVC (cell-local); managed Neo4j Aura recommended for prod cells | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `neo4j` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | rolled into `base-services/` (cell-local StatefulSet) | | Terraform | not provisioned by a managed module today; cloud templates run a containerised StatefulSet behind the cell's storage class | ## Dependencies **Upstream:** none. **Downstream:** - `alphaswarm-core` — ownership graph reads via `data.ownership.*` MCP tool; lineage relay writes through OpenLineage adapter. - `alphaswarm-worker` — sync tasks that mirror Postgres rows into Neo4j edges. ## Sync semantics - Postgres remains the canonical source of truth for entity *attributes*; Neo4j holds the *relationships*. - Sync is event-driven via the `lineage` queue family; backfills run through `data.lineage.replay` Celery tasks. - Read paths go through the `data.ownership.*` and `data.lineage.*` DataMCP tools — the agentic plane MUST NOT speak Bolt directly. ## Operations - **Auth:** username/password via ExternalSecret; Bolt TLS through Linkerd mTLS. - **Backups:** native `neo4j-admin database backup` cron to MinIO/S3. - **Cypher style:** queries are stored under `alphaswarm/data/sources/graph/queries/`; ad-hoc Cypher in agent prompts is forbidden. ## See also - [`ownership-graph`](../../../../../../alphaswarm/.cursor/rules/ownership-graph.mdc) — ownership graph contract (Workstream F). - [`lineage-graph`](../../../../../../alphaswarm/.cursor/rules/lineage-graph.mdc) — bipartite lineage DAG + OpenLineage relay (Workstream A + B). - [`entity-graph-services.md`](../../platform/entity-graph-services.md) — entity registry + service control via Neo4j. # otel-collector > OpenTelemetry collector — central OTLP gateway routing infra spans to Tempo/Jaeger, AI/LLM spans to Phoenix, metrics to Prometheus + VictoriaMetrics, logs to Loki. # otel-collector The single OTLP ingress for the cluster. Every workload pod sends traces, metrics, and logs to this gateway; the gateway fans out by signal type to the appropriate backend. ## Identity | Field | Value | | --- | --- | | Service id | `otel-collector` | | Role | `observability` | | Image | `otel/opentelemetry-collector` (gateway flavour) — pinned in [`observability/opentelemetry-collector-gateway/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) | | Port | `4317` (OTLP gRPC) + `4318` (OTLP HTTP) | | Health | `:13133/` (extensions health_check) | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `otel-collector` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | [`observability/opentelemetry-collector-gateway/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) — gateway Deployment + DaemonSet agent (canonical) | | Operator | [`observability/opentelemetry-operator/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-operator/) — auto-instrumentation CRDs | | Legacy | [`observability/otel-collector/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/otel-collector/) — rollback only; NOT wired to overlays | ## Routing | Signal | Destination | | --- | --- | | `traces.infrastructure` | Jaeger (in-cell) / Tempo (cloud cells) | | `traces.ai` (OpenInference spans) | [`phoenix`](https://github.com/Arize-ai/phoenix) | | `metrics` | VictoriaMetrics + Prometheus (parallel during cutover) | | `logs` | Loki (via Vector) | The split happens via OTel `routing` connector — spans tagged with `service.namespace=alphaswarm.ai` route to Phoenix; everything else goes to the infra trace pipeline. ## Dependencies **Upstream:** every alphaswarm workload pod (auto-instrumentation through the OTel operator + manual SDK init in `alphaswarm/observability/`). **Downstream:** Jaeger, Phoenix, Prometheus, VictoriaMetrics, Loki. ## Operations - **Sampling:** tail-based for traces — keep 100% of error spans, 5% of healthy traffic. Tuned per cell. - **Resource tagging:** every span carries `tenant_id`, `cell_id`, `service.id` (matching topology), and `experiment_id` / `test_id` when set. - **Auto-instrumentation:** Python via `opentelemetry-distro`; Node via the OTel operator's auto-injected sidecar; Go services use manual SDK. ## See also - [`observability.md`](../../trading/observability.md) — observability concept doc. - [`observability-stack.md`](../../trading/observability-stack.md) — stack composition + dashboards. - [`phoenix`](https://docs.arize.com/phoenix) — AI / LLM observability upstream. # postgres > Primary OLTP database with pgvector — strategies, bots, runs, ledgers, ownership graph snapshots, terraform_runs, security_audit_events. # postgres The platform's primary OLTP database. Holds every relational table the runtime depends on — strategies, bots, runs, ledgers, the ownership graph snapshot, the `*_spec_versions` tables for hash-locked specs, `workload_runs`, `terraform_runs`, `security_audit_events`, and the multi-tenant `EntraTenantLink` index. ## Identity | Field | Value | | --- | --- | | Service id | `postgres` | | Role | `database` | | Image | `pgvector/pgvector:pg16` (compose) / `ankane/pgvector:v0.5.1` (deployments/compose) — Postgres 16 + pgvector | | Port | `5432` (in-cluster) / `5433` (host bind in compose to avoid clash with system Postgres) | | Storage | 5 Gi PVC in StatefulSet (cell-local); RDS in `aws-*` templates; Cloud SQL in `gcp-*`; Azure DB in `azure-*` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `postgres` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) | | Kustomize | [`deployments/kubernetes/base-services/postgres-shared/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/postgres-shared/) — StatefulSet + Service + ClusterSecretStore-backed credentials | | Terraform module | [`alphaswarm_platform/terraform/modules/storage/`](../../../../../../alphaswarm_platform/terraform/modules/storage/) — RDS (AWS) / Cloud SQL (GCP) / Azure DB / containerised (local) | | Companion module | [`alphaswarm_platform/terraform/modules/database/`](../../../../../../alphaswarm_platform/terraform/modules/database/) — PgBouncer connection pooler + Alembic migration Job | ## Dependencies **Upstream:** none. **Downstream:** - `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat` — primary read/write. - `alphaswarm-cp` — workload + terraform ledger writes. - `alphaswarm-admin` — admin ledger. - `mlflow` — embedded postgres backend (or pointed at this one in prod). ## Operations - **Migrations:** Alembic runs as a one-shot Job in the `database` Terraform module before the first app pod is scheduled. Migrations are immutable — see [`migrations-persistence`](../../../../../../alphaswarm/.cursor/rules/migrations-persistence.mdc). - **Backups:** pg_dump cron + WAL archiving to MinIO/S3 (per cloud). RPO 5 min, RTO 30 min; restore runbook at [`how-to/runbooks/dr-restore.md`](../../../how-to/runbooks/dr-restore.md). - **Secrets:** primary DSN in Vault → ExternalSecret → in-cluster Secret. Hand-pasted credentials are a review-blocking change. - **Connection pooling:** PgBouncer (transaction mode) sits in front; app pods connect through `pgbouncer.alphaswarm.svc.cluster.local:6432`. ## See also - [`migrations-persistence`](../../../../../../alphaswarm/.cursor/rules/migrations-persistence.mdc) — Alembic immutability + ORM conventions. - [`erd.md`](../../platform/erd.md) — entity-relationship map across every table this database holds. - [`storage` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/storage/) — per-cloud provisioning. # prometheus > Time-series metrics scraper deployed via the kube-prometheus-stack. Sits in parallel with VictoriaMetrics during the long-term-storage cutover. # prometheus The cluster-internal metrics scraper. Deployed via [`kube-prometheus-stack`](https://github.com/prometheus-operator/kube-prometheus) which also installs the operator, Alertmanager, and the Grafana sidecar. ## Identity | Field | Value | | --- | --- | | Service id | `prometheus` | | Role | `observability` | | Image | `prom/prometheus` (managed by kube-prometheus-stack) | | Port | `9090` | | Health | `/-/ready` | ## Deployment surfaces | Surface | Where | | --- | --- | | Kustomize | [`observability/kube-prometheus-stack/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) — Helm-managed via kustomize HelmCharts overlay | | Compose | (not in compose — local dev relies on `victoriametrics` for the small footprint) | ## Scrape targets The kube-prometheus-stack installs a default `ServiceMonitor` set; we extend it with: - `alphaswarm-core` `/metrics` (every API pod). - `alphaswarm-worker` `/metrics` (per Celery worker). - `alphaswarm-cp` `/metrics`. - KEDA metrics adapter on `aqp-controller-operator` and `bots-operator`. - Linkerd proxy metrics (mTLS-side). - Per-data-plane service exporters (Postgres exporter, Redis exporter, Kafka exporter, etc.). ## Long-term storage Prometheus runs with a 30-day local retention; VictoriaMetrics is the long-term store and remote-write target. During the parallel-cutover both sides receive samples; once the cutover is declared the local Prometheus retention is dropped to 7 days. ## Operations - **Alertmanager:** receives the `kube-prometheus-stack` default alert set + AlphaSwarm-specific rules under `observability/kube-prometheus-stack/alerts/`. - **Federation:** disabled — the long-term path is remote-write to VictoriaMetrics, not federation. - **PromQL recording rules:** kept under `observability/kube-prometheus-stack/rules/`; agent-emitted ad-hoc rules are forbidden. ## See also - [`grafana.md`](grafana.md) — primary visualization layer. - [`victoriametrics.md`](victoriametrics.md) — long-term storage. - [`observability-stack.md`](../../trading/observability-stack.md) — stack composition. # redis > Cache + pub/sub + Celery broker + kill-switch key + BFF session store + HierarchicalRAG index. # redis Multi-purpose key-value store. Holds the kill-switch flag, the BFF session store (Phase 5+), the Celery broker / result backend, the semantic LLM cache, the HierarchicalRAG index, the MetadataPrefetcher cache, and the per-cell pub/sub fan-out for WebSocket progress streams. ## Identity | Field | Value | | --- | --- | | Service id | `redis` | | Role | `cache` | | Image | `redis:7-alpine` (compose master) / `redis-stack:7.4.0-v3` (local — adds RedisJSON + RedisSearch) | | Port | `6379` | | Storage | 2 Gi PVC (cell-local); ElastiCache (AWS) / Memorystore (GCP) / Azure Cache (Azure) in cloud templates | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `redis` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `redis-stack` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) | | Kustomize | [`deployments/kubernetes/base/redis-master/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/redis-master/) — single master per cell; [`base-services/redis-shared/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/redis-shared/) — shared replica set | | Terraform module | [`alphaswarm_platform/terraform/modules/storage/`](../../../../../../alphaswarm_platform/terraform/modules/storage/) — managed cache per cloud | ## Key namespaces | Prefix | Owner | Purpose | | --- | --- | --- | | `alphaswarm:kill_switch` | `WorkloadRuntime`, `TerraformRuntime` | Global halt flag — every state-mutating runtime checks before acting | | `celery:*` | Celery broker | Queue names per family (`default`, `backtest`, `agents`, ...) | | `bff:session:*` | `alphaswarm-cp` BFF | Phase 5 session store (sid → IdP token) | | `m2m:tokens:*` | `alphaswarm-cp` auth broker | M2M token cache | | `cache:llm:*` | `alphaswarm-core` | Semantic LLM cache | | `cache:metadata:*` | `MetadataPrefetcher` | Entity dropdown cache | | `rag:*` | `HierarchicalRAG` | Embedding index | | `pubsub:progress:` | `alphaswarm._progress` | WebSocket fan-out frames | ## Dependencies **Upstream:** none. **Downstream:** every runtime pod (`alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat`, `alphaswarm-cp`, bots). ## Operations - **Eviction policy:** `allkeys-lru` for caches; `noeviction` for Celery to avoid silent task drops. - **HA:** in-cell single master; cloud templates use managed Redis with multi-AZ replicas. - **Kill-switch:** the key is intentionally simple — `set` to any truthy value halts; the runtime polls every state-mutating action. - **Persistence:** AOF every second + RDB snapshot every 5 min. ## See also - [`tasks-api`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — Celery broker + Redis pub/sub frame contract. - [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) — redaction rules for any code that handles Redis-stored tokens. - [`storage` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/storage/) — per-cloud provisioning. # vector > Log shipper — DaemonSet running on every node, ships container logs to Loki. # vector [Vector](https://vector.dev/) — the canonical log shipper. Runs as a DaemonSet on every node, tails container stdout/stderr, applies parse + redact transforms, and ships to Loki. ## Identity | Field | Value | | --- | --- | | Service id | `vector` | | Role | `observability` | | Image | `timberio/vector:0.43.0-alpine` | | Port | (no public listener; metrics on `:9598`) | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `vector` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) | | Kustomize | [`observability/vector/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/vector/) — DaemonSet + ConfigMap | ## Pipelines - **`kubernetes_logs` source** → JSON parser → metadata enrichment (pod labels, namespace, cell id, tenant id) → redaction transform. - **Sinks:** `loki` (canonical) + `phoenix` (only for spans tagged `service.namespace=alphaswarm.ai`). ## Redaction - The redact transform strips any field whose lower-cased name contains `password`, `secret`, `token`, `key`, `credential`, `private`, `authorization`, `kubeconfig`, `client_secret`, `api_token`, `api_key`, `jwt`, `refresh_token`, `access_token`. - Same allowlist as `WorkloadRuntime` redaction — see the [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) rule. ## See also - [`loki.md`](loki.md) — primary sink. - [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) — redaction allowlist. # victoriametrics > Long-term metrics storage — Prometheus-compatible TSDB, target of remote-write from kube-prometheus-stack. # victoriametrics [VictoriaMetrics](https://victoriametrics.com/) — Prometheus-compatible time-series database used as the long-term storage layer. Receives samples via Prometheus remote-write; queryable directly or through Grafana. ## Identity | Field | Value | | --- | --- | | Service id | `victoriametrics` | | Role | `observability` | | Image | `victoriametrics/victoria-metrics:v1.108.0` | | Port | `8428` (HTTP — write + query) | | Health | `/health` | ## Deployment surfaces | Surface | Where | | --- | --- | | Compose | service `victoriametrics` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) | | Kustomize | [`observability/victoriametrics/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/victoriametrics/) — single-node Deployment (in-cell); cluster mode (`vmstorage` / `vmselect` / `vminsert`) in cloud templates | ## Dependencies **Upstream:** Prometheus (remote-write). **Downstream:** Grafana (datasource). ## Operations - **Retention:** 13 months default; 24 months for cells flagged `audit_evidence: true`. - **Cardinality control:** label scrubbing + per-label value cap to prevent runaway growth from per-pod / per-task labels. - **PromQL compatibility:** queries that work in Prometheus work here; some MetricsQL extensions (`histogram_quantiles`, etc.) are used in dashboards. ## See also - [`prometheus.md`](prometheus.md) — sample source. - [`grafana.md`](grafana.md) — primary query path. # Terraform IaC control plane > The runtime is the only sanctioned executor for `terraform plan/apply/ destroy/refresh` operations. Routes / Celery tasks / MCP tools wrap it; nothing calls `subprocess.run(["terraform", ...])` direct... # Terraform IaC control plane Phase 7 of the multi-tenant rollout introduces the 5th sibling spec-runtime — **`TerraformRuntime`** — that joins `AgentRuntime`, `BotRuntime`, `RLRuntime`, `AnalysisRuntime`, and `WorkflowRuntime`. The runtime is the only sanctioned executor for `terraform plan/apply/ destroy/refresh` operations. Routes / Celery tasks / MCP tools wrap it; nothing calls `subprocess.run(["terraform", ...])` directly outside [`alphaswarm/terraform/runner.py::TerraformExecutor`](../alphaswarm/terraform/runner.py). ## Architecture ```mermaid flowchart LR user["Operator / Agent"] --> rest["/terraform/* + /infra/* REST"] user --> mcp["data.terraform.* MCP tools"] rest --> runtime["TerraformRuntime"] mcp --> runtime runtime --> ledger["TerraformRun (Postgres)"] runtime --> celery["Celery 'terraform' queue"] celery --> runner["alphaswarm-terraform-runner pod"] runner --> executor["TerraformExecutor (subprocess)"] executor --> state["State backendlocal / s3 / azurerm / gcs / hcp"] state --> aws["AWS provider"] state --> gcp["GCP provider"] state --> azure["Azure provider"] state --> local["local docker/k8s provider"] state --> hcp["HCP Terraform via HcpClient"] runtime --> kill["/terraform/halt kill-switch"] runtime --> policy["OPA Rego (PolicyChecker)"] ``` ## Spec → version → run lifecycle 1. **Author a `TerraformStackSpec`** (Pydantic). Hash is SHA-256 of canonical JSON. 2. **`persist_spec(spec)`** creates a new `terraform_stack_spec_versions` row only when the hash changes (AGENTS rule 43). 3. **`TerraformRuntime(spec).plan(workspace_id=...)`** opens a `TerraformRun` row (rule 34: carries `experiment_id` + `test_id` FKs), enqueues the plan task on the `terraform` Celery queue. 4. **Runner pod executes** `terraform init && terraform plan -out tfplan.binary`, captures stdout/stderr to files in the workspace dir, parses `terraform show -json tfplan.binary` into a structured plan summary, optionally runs OPA Rego policies. 5. **Plan run lands in `awaiting_approval`.** The frontend `/infra/terraform/workspaces/[id]` page renders an "Apply this plan" button. 6. **`TerraformRuntime(spec).apply(plan_run_id=...)`** opens a child `TerraformRun`, executes `terraform apply tfplan.binary`, snapshots the resulting state into a `TerraformStateVersion` row. ## Code generation CDKTF was deprecated by HashiCorp on 2025-12-10. Python-side HCL generation uses **Jinja2 templates** under [`alphaswarm/terraform/codegen/templates/`](../alphaswarm/terraform/codegen/templates): - `storage_{aws,gcp,azure,local}.tf.j2` - `faas_local.tf.j2` (KEDA + per-queue ScaledObjects) - `agents_local.tf.j2` (bot pods with `alphaswarm-data-mcp` sidecar) - `secrets_local.tf.j2` (ESO + ClusterSecretStore + ExternalSecret per `secret_mappings`) - `generic.tf.j2` (fallback for `module_source` references) Operator-authored stacks live under [`alphaswarm_platform/terraform/modules/`](../alphaswarm_platform/terraform/modules/) and are reachable via `spec.module_source = "../../modules/storage"`. ## State backends Five backends are supported (`ALPHASWARM_TERRAFORM_STATE_BACKEND`): | Kind | Backend block | | -------- | ------------------------------------------ | | local | `terraform { backend "local" { ... } }` | | s3 | `backend "s3" { bucket / key / dynamodb }` | | azurerm | `backend "azurerm" { storage_account_name }` | | gcs | `backend "gcs" { bucket / prefix }` | | hcp | HCP Terraform via `HcpClient` | The HCP path uses [`alphaswarm/terraform/hcp_client.py`](../alphaswarm/terraform/hcp_client.py) (thin httpx wrapper around `app.terraform.io/api/v2`) — no `python-terrasnek` dep so cold installs without HCP credentials still boot cleanly. ## Bootstrap and reliability notes - During cold-start deployments, prefer CLI-first `terraform init/plan/apply` until API + Celery + Redis + Postgres are all healthy. - Control-plane-triggered Terraform actions require broker + worker availability to enqueue and stream progress. - `TerraformExecutor` retries transient `terraform init` provider/network failures with bounded exponential backoff. - Use `ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE` to point at a Terraform CLI config that defines `provider_installation` mirror rules when registry access is unreliable. - Provider cache is shared through `ALPHASWARM_TERRAFORM_PLUGIN_CACHE_DIR`. ## Kill switch `POST /terraform/halt` is the 6th endpoint fanned out by the topbar `KillSwitch` (alongside `/agents/halt`, `/quant-agents/halt`, `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all`, `/workflows/halt`). On halt every `queued | running | awaiting_approval` `TerraformRun` is marked `cancelled` + `halted=True`. ## Policy gate (OPA) `TerraformPolicyAttachment` rows bind a workspace to one or more OPA Rego policy files. The runtime calls [`PolicyChecker.check`](../alphaswarm/terraform/policy.py) after every plan; `hard_mandatory=True` attachments block the corresponding apply on violation. When `opa` is not on PATH the checker no-ops (so dev / CI without OPA installed still works). ## Frontend Vite/React surfaces under [`alphaswarm_client/src/routes/infra/`](../alphaswarm_client/src/routes/infra/): - `/infra` — 7 tabbed panes (overview / bots / queues / pipeline / secrets / k8s / canary) + a Terraform inline summary. - `/infra/terraform` — workspace list with per-row Plan / Apply / Destroy (friction-gated). - `/infra/terraform/workspaces/[id]` — workspace detail + run history + latest state outputs. - `/infra/terraform/runs/[id]` — run detail with live WS progress stream (`/terraform/ws/runs/{id}`). - `/infra/terraform/stacks` — stack spec catalog. ## Where to look for X | Task | Path | | --- | --- | | Add a new module kind | [`alphaswarm/terraform/codegen/templates/`](../alphaswarm/terraform/codegen/templates/) + [`alphaswarm/persistence/models_terraform.py::TERRAFORM_MODULE_KINDS`](../alphaswarm/persistence/models_terraform.py) | | Add an MCP tool | [`alphaswarm/data/mcp/tools/terraform.py`](../alphaswarm/data/mcp/tools/terraform.py) | | Add a REST route | [`alphaswarm/api/routes/terraform.py`](../alphaswarm/api/routes/terraform.py) | | Add a Celery task | [`alphaswarm/tasks/terraform_tasks.py`](../alphaswarm/tasks/terraform_tasks.py) | | Edit the runner pod | [`alphaswarm_platform/terraform/modules/terraform_runner/main.tf`](../alphaswarm_platform/terraform/modules/terraform_runner/main.tf) | | Add a state backend | [`alphaswarm/terraform/codegen/wrapper.py`](../alphaswarm/terraform/codegen/wrapper.py) | | Add an OPA policy | Reference the file URI via `TerraformPolicyAttachment.policy_set_uri` | # Worker vs executor images > Why the Celery surface is split into two purpose-built images — a slim orchestration worker and a heavy-compute executor — and the dependency / queue matrix that keeps them apart. # Worker vs executor images The AlphaSwarm Celery surface is split into **two** purpose-built, migration-ready container images (Phase 4c): - **`alphaswarm-worker`** — slim **orchestration** worker. Task dispatch, lineage, paper-trading loop, terraform/ingestion/workflow coordination. - **`alphaswarm-executor`** — **heavy-compute** executor. Backtests, RL / ML training, factor builds, agent-emitted strategy code, RAG ingest. ## Why split Historically `worker` and `beat` had **no image of their own** — the `alphaswarm_images` catalogue pinned `worker = { target = "api" }`, so the orchestration worker dragged the entire API stage (Dash, `visualization`, `dev` tooling) plus the full ML/RL surface into one fat image. Two problems followed: 1. **Bloat & blast radius.** A lineage callback worker carried PyTorch, Ray, vectorbt-pro, forecasting libs — slow to pull, large attack surface, slow cold-start. 2. **Scaling mismatch.** Light coordination tasks (sub-second, IO-bound) and heavy compute tasks (minutes–hours, CPU/GPU/RAM-bound) have opposite scaling and resource profiles, but shared one Deployment. Splitting lets each image carry only what its queues need, and lets each scale and be resourced independently. ## Queue ↔ image matrix The queue assignment is identical across the root `Dockerfile`, the standalone per-service Dockerfiles, the K8s manifests, both compose files, and the `faas` KEDA module (`local.heavy_queues`). **A queue is never drained by both images.** | Queue | Image | Why | | --- | --- | --- | | `default` | worker | bookkeeping, lineage, callbacks | | `paper` | worker | sub-second paper-trading loop (latency-sensitive) | | `terraform` | worker | `TerraformRuntime` apply/destroy wrappers | | `ingestion` | worker | connector pulls (IO-bound, long-lived) | | `workflows` | worker | `WorkflowRuntime` orchestration | | `backtest` | executor | vbt-pro / event-driven / Lean engine runs | | `training` | executor | RL rollouts + finetune jobs (GPU) | | `ml` | executor | ML pipelines, predictor refresh | | `agents` | executor | CrewAI / LangGraph agent runs | | `factors` | executor | factor-zoo builds, alpha tests | | `rag` | executor | RAG ingest, embedding refresh | ## Dependency surface Both images share the multi-arch (`linux/amd64+arm64`) Chainguard Wolfi base, nonroot UID `65532`, and the `CredentialResolver`-only secret rule (nothing baked into the image). They differ only in installed extras: | | worker | executor | | --- | --- | --- | | Base extras | `otel, cli, iceberg, entity-graph, dagster-alphaswarm` | same | | Distributed compute | `compute-dask, compute-ray` | `compute-dask, compute-ray` | | ML / RL / forecasting | — | `ml, ml-torch, ml-forecast, ml-anomaly` | | Portfolio | — | `portfolio` | | Native build deps | — | `gfortran`, `linux-headers` (numpy/scipy/forecast wheels) | | Extra dirs | `/app/data` | `/app/data`, `/app/models` | | Default concurrency | 4 | 2 | | Resource requests | `500m CPU / 1Gi` | `1 CPU / 4Gi` | | Resource limits | `4 CPU / 8Gi` | `8 CPU / 16Gi` | ## Where the images are defined | Surface | Worker | Executor | | --- | --- | --- | | Root multi-stage target | `worker` in [`Dockerfile`](../../../../../alphaswarm_platform/Dockerfile) | `executor` in [`Dockerfile`](../../../../../alphaswarm_platform/Dockerfile) | | Standalone Dockerfile | [`build/docker/alphaswarm_worker/`](../../../../../alphaswarm_platform/build/docker/alphaswarm_worker/) | [`build/docker/alphaswarm_executor/`](../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/) | | Image catalogue | `worker` / `beat` → target `worker` | `executor` → target `executor` | | ECR repo | `alphaswarm-worker` | `alphaswarm-executor` | | Kustomize base | `base/alphaswarm-worker/` | `base/alphaswarm-executor/` | | Compose | `worker` (legacy) / `alphaswarm-worker` | `worker-gpu` (legacy) / `alphaswarm-executor` | ## Migration readiness The two images are intentionally self-contained — a standalone Dockerfile, its own ECR repo, its own image-catalogue entry, its own Kustomize base, and its own topology entry — so the build assets can be lifted into a dedicated repository in a future migration without untangling them from the API image. ## See also - [`alphaswarm-worker`](services/alphaswarm-worker.md) — orchestration worker service doc. - [`alphaswarm-executor`](services/alphaswarm-executor.md) — heavy-compute executor service doc. - [`services.md`](services.md) — full service catalogue. - [`faas` Terraform module](../../../../../alphaswarm_platform/terraform/modules/faas/) — KEDA per-queue scaling. # AlphaSwarm Monorepo Paths > Canonical path contract for this repository. Sibling repos (`rpi_kubernetes`, `theia-ide`, `alphaswarm_platform_admin`) mirror this table in their own `alphaswarm_docs/alphaswarm-monorepo-paths.md` files # AlphaSwarm Monorepo Paths Status: active. Canonical path contract for this repository. Sibling repos (`rpi_kubernetes`, `theia-ide`, `alphaswarm_platform_admin`) mirror this table in their own `alphaswarm_docs/alphaswarm-monorepo-paths.md` files. | AlphaSwarm responsibility | Path | | --- | --- | | Control plane | `alphaswarm_controller/` | | Shared platform contracts | `alphaswarm_core/` | | Active client (Vite) | `alphaswarm_client/` | | Bot runtime/templates | `alphaswarm_bots/` | | RL subsystem | `alphaswarm_rl/` (`src/alphaswarm_rl/` source; `tasks/`, `api/routes/`, `configs/`, `tests/` siblings) | | Custom model boundary | `alphaswarm_models/` (`src/alphaswarm_models/` source incl. `serving/`; `tasks/`, `api/routes/`, `configs/`, `tests/` siblings) | | Snippet corpus | `alphaswarm_snippets/` | | Monolith runtime | `alphaswarm/` | | Standalone operator CLI | `alphaswarm_cli/` | | Internal admin (services + accounts) | `alphaswarm_admin/` | | Vendored Theia IDE workspace | `alphaswarm_ide/` | | Curator-owned project index (SSoT) | `alphaswarm_index/` | | Canonical documentation | `alphaswarm_docs/` | | Hosted-platform single home | `alphaswarm_platform/` | | Kubernetes workloads | `alphaswarm_platform/deployments/kubernetes/` | | Terraform modules + environments | `alphaswarm_platform/terraform/` | | Multi-arch Dockerfiles + config gen | `alphaswarm_platform/build/` | | Legacy / edge component configs | `alphaswarm_platform/deploy/` | | Root-level compose files | `alphaswarm_platform/compose/` | | Multi-stage root Dockerfile | `alphaswarm_platform/Dockerfile` | | Deployment topology YAML | `alphaswarm_platform/configs/deployment/topology.yaml` | | Terraform stack YAMLs | `alphaswarm_platform/configs/terraform/` | | Cluster install scripts | `alphaswarm_platform/scripts/cluster_install/` | Compatibility stubs and historical paths (do not add active source here): | Legacy path | Points to | | --- | --- | | `frontend/` | `alphaswarm_client/` | | `extractions/` | `alphaswarm_snippets/extractions/` | | `inspiration/` | `alphaswarm_snippets/inspiration/` (ignored raw repos) | | `alphaswarm/bots/` | `alphaswarm_bots/` (import shim) | | `alphaswarm/rl/` | `alphaswarm_rl/src/alphaswarm_rl/` (deprecation-warning import shim; `pkgutil.walk_packages` aliases every submodule under `alphaswarm.rl.*`) | | `alphaswarm/ml/` | `alphaswarm_models/src/alphaswarm_models/` (deprecation-warning import shim; `pkgutil.walk_packages` aliases every submodule under `alphaswarm.ml.*`) | | `alphaswarm/llm/vllm_runner.py` | `alphaswarm_models/src/alphaswarm_models/serving/vllm.py` (one-line re-export shim) | | `alphaswarm/llm/ollama_client.py` | `alphaswarm_models/src/alphaswarm_models/serving/ollama.py` (one-line re-export shim) | | `alphaswarm/tasks/rl_tasks.py` | `alphaswarm_rl/tasks/rl_tasks.py` (Celery `name=` strings preserved for in-flight queue messages) | | `alphaswarm/tasks/ml_tasks.py` | `alphaswarm_models/tasks/ml_tasks.py` | | `alphaswarm/tasks/ml_test_tasks.py` | `alphaswarm_models/tasks/ml_test_tasks.py` | | `alphaswarm/tasks/finetune_tasks.py` | `alphaswarm_models/tasks/finetune_tasks.py` | | `alphaswarm/tasks/training_tasks.py` | `alphaswarm_models/tasks/training_tasks.py` | | `alphaswarm/api/routes/rl.py` | `alphaswarm_rl/api/routes/rl.py` (FastAPI mount path `/rl` unchanged) | | `alphaswarm/api/routes/ml.py` | `alphaswarm_models/api/routes/ml.py` (FastAPI mount path `/ml` unchanged) | | `alphaswarm/api/routes/analytics_ml.py` | `alphaswarm_models/api/routes/analytics_ml.py` (FastAPI mount path `/analytics/ml` unchanged) | | `configs/rl/` | `alphaswarm_rl/configs/` | | `configs/ml/` | `alphaswarm_models/configs/` | | `tests/rl/` | `alphaswarm_rl/tests/` | | `tests/ml/` | `alphaswarm_models/tests/` | | `docs/` | `alphaswarm_docs/` (renamed; all references updated) | | root `deployments/` | `alphaswarm_platform/deployments/` | | root `build/` | `alphaswarm_platform/build/` | | root `deploy/` | `alphaswarm_platform/deploy/` | | root `terraform/` | `alphaswarm_platform/terraform/` | | root `Dockerfile` | `alphaswarm_platform/Dockerfile` | | root `.dockerignore` | `alphaswarm_platform/.dockerignore` | | root `docker-compose.yml` | `alphaswarm_platform/compose/docker-compose.yml` | | root `docker-compose.platform.yml` | `alphaswarm_platform/compose/docker-compose.platform.yml` | | root `docker-compose.viz.yml` | `alphaswarm_platform/compose/docker-compose.viz.yml` | | `configs/deployment/` | `alphaswarm_platform/configs/deployment/` | | `configs/terraform/` | `alphaswarm_platform/configs/terraform/` | | `scripts/cluster_install/` | `alphaswarm_platform/scripts/cluster_install/` | # Architecture > Top-down map of the AlphaSwarm platform: the spec-runtime pattern, the data and agentic planes, the four edge surfaces, and the request lifecycle every dispatch shares. # Architecture > Human entry point. Pair with the AI-agent entry point at > [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) > and the doc map at [/intro](../../intro/index.md). > > Cold-start path: [/intro/quickstart](../../intro/quickstart.md). > Deployment path: [how-to/operations/local-setup](../../how-to/operations/local-setup.md) > or [how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md). AlphaSwarm is a **local-first, agentic quantitative research and trading platform**. Every LLM call, every backtest, every reinforcement-learning rollout, and every piece of metadata stays on local hardware — no proprietary alpha leaves the box. The codebase distills patterns from Microsoft Qlib, AI4Finance FinRL, QuantConnect Lean, OpenBB, vnpy, and TradingAgents into one coherent platform. The platform is organised around **four invariants** that hold across every subsystem: 1. **Hash-locked spec runtimes.** `AgentSpec`, `BotSpec`, `RLExperimentSpec`, and `AnalysisSpec` each have a single sanctioned executor (`AgentRuntime` / `BotRuntime` / `RLRuntime` / `AnalysisRuntime`). Any spec change creates a new immutable `*_spec_versions` row; old versions stay forever for replay. 2. **Medallion lakehouse.** Every Iceberg write goes through [`iceberg_catalog.append_arrow`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py) with a declared bronze / silver / gold layer; agents read through `data.*` MCP tools, never raw ORM. 3. **One LLM gateway, one progress bus.** Every model call routes through [`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py); every Celery task emits canonical progress frames through [`alphaswarm.tasks._progress`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/tasks/_progress.py). 4. **Topology is data, not code.** Service URLs, MCP audiences, and credential references resolve through [`alphaswarm_platform/configs/deployment/topology.yaml`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/configs/deployment/topology.yaml). ## System component diagram ```mermaid flowchart TB subgraph clients [Clients] Browser["alphaswarm_client (Vite :3001)"] CloudUI["alphaswarm_ui (Next.js cloud)"] Admin["alphaswarm_admin (manage.alpha-swarm.ai)"] CLI["alphaswarm-cli (device-flow auth)"] IDE["alphaswarm_ide (Theia 1.72)"] Agents["IDE agents (Cursor / Claude / Continue)"] end subgraph edge [Cloudflare edge] DocsEdge["docs.alpha-swarm.ai (Pages)"] DocsMcp["docs MCP Worker (RFC 9728+8707)"] StatusEdge["status.alpha-swarm.ai (Instatus)"] TunnelEdge["alphaswarm-fund-edge tunnel"] end subgraph api [API gateway (alphaswarm/api)] FastAPI["FastAPI :8000"] DataMcp["/mcp/data"] CodeMcp["/mcp/codebase"] WS["WebSocket relay"] end subgraph cp [Control plane (alphaswarm_controller)] ManageApi["alphaswarm-cp :9000 (manage.alpha-swarm.ai)"] TfRuntime["TerraformRuntime"] WlRuntime["WorkloadRuntime"] end subgraph runtimes [Spec runtimes] AgentRt["AgentRuntime"] BotRt["BotRuntime (alphaswarm_bots)"] RlRt["RLRuntime (alphaswarm_rl)"] AnaRt["AnalysisRuntime"] WfRt["WorkflowRuntime"] end subgraph workers [Celery workers] WDefault["worker (default / backtest / agents / paper)"] WTraining["worker-gpu (training queue)"] WTerraform["worker-terraform"] Beat["beat (cron)"] end subgraph runtime [Backends] Redis[(Redis 7)] Postgres[(PostgreSQL 16 + pgvector + RLS)] Iceberg["Iceberg lakehouse (bronze / silver / gold)"] Hudi["Hudi (upsert-heavy)"] DuckDB["DuckDB views"] Mlflow["MLflow"] R2[("R2 (Logpush 365d)")] end subgraph llms [LLM tier] Ollama["Ollama (host)"] Vllm["vLLM (compose --profile vllm)"] Sera["SERA-32B (Modal, opt-in)"] Router["router_complete + LiteLLM"] end subgraph observability [Observability] OTEL["OTEL collector :4317"] Jaeger["Jaeger"] Posthog["PostHog Cloud EU"] Plausible["Plausible (cookieless)"] end Browser --> FastAPI CloudUI --> FastAPI CloudUI --> ManageApi Admin --> ManageApi CLI --> FastAPI IDE --> FastAPI Agents -.MCP.-> DataMcp Agents -.MCP.-> CodeMcp Agents -.MCP.-> DocsMcp Browser -.DocsPanel.-> DocsEdge DocsEdge --> DocsMcp TunnelEdge --> FastAPI TunnelEdge --> ManageApi TunnelEdge --> Browser FastAPI --> AgentRt FastAPI --> BotRt FastAPI --> RlRt FastAPI --> AnaRt FastAPI --> WfRt WfRt --> AgentRt WfRt --> RlRt WfRt --> BotRt AgentRt -.tasks.-> Redis BotRt -.tasks.-> Redis RlRt -.tasks.-> Redis Beat -.cron.-> Redis Redis -.dispatch.-> WDefault Redis -.dispatch.-> WTraining Redis -.dispatch.-> WTerraform WDefault --> Postgres WDefault --> Iceberg WDefault --> Hudi WDefault --> Router WTraining --> Mlflow WTraining --> Router ManageApi --> TfRuntime ManageApi --> WlRuntime TfRuntime --> WTerraform Iceberg --> DuckDB Router --> Ollama Router -.optional.-> Vllm Router -.opt-in.-> Sera FastAPI -.spans.-> OTEL ManageApi -.spans.-> OTEL OTEL --> Jaeger DocsEdge -.events.-> Posthog DocsEdge -.pageviews.-> Plausible DocsEdge -.logpush.-> R2 ``` Solid lines are default-profile data paths; dotted lines are opt-in / asynchronous. ## The four edge surfaces AlphaSwarm exposes four hostnames, each behind its own Cloudflare property: - **`alpha-swarm.ai`** — operator UI ([alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client)). Vite + React 19 + Tailwind 4 + shadcn/ui. Routes the topbar KillSwitch, paper trading dashboards, RL Lab, Analysis Lab, Workflow Studio, Data Hub. - **`api.alpha-swarm.ai`** — public API ([alphaswarm/api](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api)). FastAPI gateway, 30+ route modules, Stripe-style date epochs (first epoch `2026-06-01`). - **`manage.alpha-swarm.ai`** — control plane ([alphaswarm_controller](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_controller)). Workload lifecycle, TerraformRuntime, IdP wiring. Never imports `alphaswarm.*`. - **`docs.alpha-swarm.ai`** — documentation ([alphaswarm_docs](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_docs)). Docusaurus 3 on Cloudflare Pages. Pages Functions for content-negotiation, sanitised page fragments, and the "Was this helpful?" feedback loop. Standalone MCP Worker at `/mcp` (RFC 9728 + 8707 compliant per AGENTS rule 49). Plus two adjacent zones: - **`status.alpha-swarm.ai`** — Instatus status page. Separate Cloudflare zone so it stays up when the cluster is degraded. - **`archive.alpha-swarm.ai`** — frozen Stripe-style API epochs after the 12-month sunset window. ## Request lifecycle Every spec-driven dispatch — backtest, agent run, RL training, analysis flow, workflow — follows the same canonical shape. The two new contracts since the prior version of this doc: - **Hash-lock first.** Before any work happens, the runtime computes the spec's SHA-256, looks for a matching `*_spec_versions` row, inserts a new immutable row if the content changed. - **Kill switch reachable.** Every long-running runtime is in the topbar [KillSwitch](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/src/components/common/KillSwitch.tsx) fan-out list. The runtime checks `should_halt` on every step. ```mermaid sequenceDiagram actor User participant UI as alphaswarm_client participant API as FastAPI participant Runtime as Spec runtime participant Versions as *_spec_versions participant Redis participant Worker as Celery worker participant Postgres as Run ledger participant Iceberg User->>UI: Click "Run" UI->>API: POST //run { spec_yaml } API->>Runtime: instantiate(spec) Runtime->>Versions: lookup-or-insert by spec_hash Versions-->>Runtime: version_id (existing OR new) Runtime->>Postgres: insert run row (status=pending, spec_version_id) Runtime->>Redis: enqueue task (idempotent by run_id) API-->>UI: 202 Accepted { task_id, run_id, stream_url } UI->>API: WebSocket /chat/stream/{task_id} Worker->>Redis: dequeue Worker->>Postgres: load spec_version + run loop per step Worker->>Worker: runtime.step() Worker->>Worker: check should_halt() Worker->>Iceberg: append_arrow (medallion-tagged) Worker->>Redis: emit progress frame Redis-->>UI: WebSocket frame end Worker->>Postgres: update run (status=completed, metrics) Worker->>Redis: emit_done(task_id, result) Redis-->>UI: stage=done frame UI-->>User: render summary ``` The frame envelope is `{task_id, stage, message, timestamp, **extras}` per AGENTS rule 4. The `should_halt` check makes every spec-runtime an immediate stop target for the topbar kill switch. ## Repository map The monorepo is organised by responsibility. Each top-level package has its own `AGENTS.md` enforcing strict boundaries; cross-package imports are blocked in CI. | Package | Role | Owner | Public-surface contract | | --- | --- | --- | --- | | [alphaswarm/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm) | Quant runtime — strategies, backtests, agents, RAG, Iceberg | `platform-team` | [alphaswarm/api/main.py::create_app](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/api/main.py) | | [alphaswarm_controller/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_controller) | Workload lifecycle + Terraform driver + provider adapters | `platform-team` | [alphaswarm_controller/main.py::create_app](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_controller/src/alphaswarm_controller/main.py); NEVER imports `alphaswarm.*` | | [alphaswarm_core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_core) | Shared value types, ABCs, auth/resource filters, topology | `platform-team` | Dependency-light; consumed by both `alphaswarm/` and `alphaswarm_controller/` | | [alphaswarm_client/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client) | Active Vite + React 19 operator UI at `alpha-swarm.ai` | `platform-team` | `pnpm --filter alphaswarm_client dev` | | [alphaswarm_ui/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_ui) | Cloud-hosted Next.js PaaS frontend (dual Auth0 + Entra) | `platform-team` | Never imports `alphaswarm.*` / `alphaswarm_controller.*` | | [alphaswarm_admin/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_admin) | Internal admin at `manage.alpha-swarm.ai` (audit-first) | `platform-team` | Mirrors `alphaswarm_controller` boundary | | [alphaswarm_rl/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl) | RL stack — `RLExperimentSpec` + `RLRuntime` + Iceberg trajectories | `rl-team` | Legacy `alphaswarm.rl.*` is a deprecation shim | | [alphaswarm_models/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_models) | ML framework, custom model serving (vLLM + Ollama), AlphaBacktestExperiment | `ml-team` | Legacy `alphaswarm.ml.*` + `alphaswarm/llm/{vllm_runner,ollama_client}.py` are deprecation shims | | [alphaswarm_bots/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_bots) | Bot templates + `BotRuntime` (smallest deployable unit) | `agentic-team` | YAML at `alphaswarm_bots/templates/{trading,research}/` | | [alphaswarm_ide/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_ide) | Theia 1.72 IDE + six AlphaSwarm extensions | `platform-team` | Canonical entrypoint: `alphaswarm-cli ide` | | [alphaswarm_cli/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_cli) | Standalone operator CLI (HTTP-only, device-flow auth) | `platform-team` | Never imports `alphaswarm.*` / `alphaswarm_controller.*` | | [alphaswarm_platform/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform) | Hosted-platform deployment + IaC + build assets | `infra-team` | No `import alphaswarm.*`; `TerraformRuntime`-only | | [alphaswarm_index/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_index) | Curator-owned single source of truth | `docs-team` | Sole-writer is the `alphaswarm-index-curator` subagent | | [alphaswarm_docs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_docs) | This site (Docusaurus 3 on Cloudflare Pages) | `docs-team` | Quality gates in [.github/workflows/docs-ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/docs-ci.yml) | | [alphaswarm_snippets/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_snippets) | Curated knowledge + extractions + inspiration trees | `docs-team` | Runtime code MUST NOT import this tree | Inside `alphaswarm/` the subsystems map one-to-one to concept docs: | `alphaswarm//` | Doc | | --- | --- | | [agents/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/agents) | [agentic-pipeline](../agentic/agentic-pipeline.md), [agents](../agentic/agents.md), [workflow-studio](../agentic/workflow-studio.md), [multi-agent-patterns](../agentic/multi-agent-patterns.md) | | [analysis/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/analysis) | [analysis-framework](../strategy/analysis-framework.md), [analysis-lab](../strategy/analysis-lab.md), [analysis-flows](../strategy/analysis-flows.md) | | [api/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api) | [reference/api](../../reference/api/index.mdx) (auto-generated) | | [backtest/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/backtest) | [backtest-engines](../strategy/backtest-engines.md), [vbtpro-integration](../strategy/vbtpro-integration.md), [hft-backtest](../strategy/hft-backtest.md) | | [cli/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/cli) | [providers](../data/providers.md) | | [codebase/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/codebase) | [codebase-mcp](../data/codebase-mcp.md) | | [core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/core) | [core-types](./core-types.md) | | [data/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/data) | [data-plane](../data/data-plane.md), [data-catalog](../data/data-catalog.md), [data-mcp](../data/data-mcp.md), [datasets-catalog](../data/datasets-catalog.md), [data-discovery](../data/data-discovery.md), [airbyte-builder](../data/airbyte-builder.md), [dagster-sandbox](../data/dagster-sandbox.md) | | [llm/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/llm) | [providers](../data/providers.md), [sera](../data/sera.md) | | [persistence/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/persistence) | [domain-model](./domain-model.md), [erd](./erd.md), [reference/data-dictionary](../../reference/data-dictionary/index.mdx) | | [providers/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/providers) | [data-plane](../data/data-plane.md) | | [risk/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/risk) | [paper-trading](../trading/paper-trading.md) | | [streaming/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/streaming) | [streaming](../data/streaming.md), [streaming-admin](../data/streaming-admin.md), [live-market](../data/live-market.md) | | [tasks/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/tasks) | [agent-watchdog](../data/agent-watchdog.md) | | [trading/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/trading) | [paper-trading](../trading/paper-trading.md), [paper-metadata-gate](../trading/paper-metadata-gate.md) | | [ws/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/ws) | [observability](../trading/observability.md) | | [ui/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/ui) | **Deprecated** (legacy Solara) — rollback only | For the full canonical repository-split contract (boundaries, import guards, future extraction map) read [repository-split](./repository-split.md). For the file-by-file path contract for cross-repo references read [alphaswarm-monorepo-paths](./alphaswarm-monorepo-paths.md). ## Hard rules (cardinal subset) Every contributor reads the full 55 hard rules in [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). The cardinal subset that surfaces in this doc: - **Rule 1.** `Symbol.parse(vt_symbol)` only. Never split a `vt_symbol` on `.`. - **Rule 2.** All LLM calls go through `router_complete`. - **Rule 3.** All Iceberg writes go through `iceberg_catalog.append_arrow`. - **Rule 4.** All progress emits use the canonical frame envelope. - **Rule 5.** All cross-task state goes through Postgres; never pickle ORM objects. - **Rule 12-19, 23-25, 40-41.** The five spec runtimes (`AgentRuntime`, `BotRuntime`, `RLRuntime`, `AnalysisRuntime`, `WorkflowRuntime`) are the only sanctioned executors for their respective specs. Specs are immutable once committed; behaviour changes always create a new version row. - **Rule 22.** Agents NEVER read Postgres / Iceberg directly. Every catalog / dataset / entity read goes through a registered `DataMCPTool`. - **Rule 42-45.** TerraformRuntime owns all `terraform apply`; WorkloadRuntime owns all runtime workload ops; both write to the `workload_runs` + `terraform_runs` audit ledgers before executing. - **Rule 47.** Service URLs resolve through the topology service; AlphaSwarm is cluster-agnostic. - **Rule 49.** Every MCP server is RFC 9728 + 8707 conformant. - **Rule 52.** Step-up MFA (RFC 9470) on every halt + every destructive surface. ## Worked example: trace your first request Goal: dispatch a backtest, watch the WebSocket frames, inspect the ledger row and the Iceberg gold output — without leaving this page. ### Step 1 — dispatch The example below targets your local compose stack at `http://localhost:8000`. Hit "Run" to fire a sample momentum backtest. ### Step 2 — tail the WebSocket Switch to your terminal and tail the canonical progress frames: ```bash curl -N http://localhost:8000/chat/stream/ ``` You will see frames in the `{task_id, stage, message, timestamp, **extras}` shape. Stages: `start` → `bar.processed` (×N) → `done` (carries the final `BacktestResult`). ### Step 3 — inspect the ledger Pyodide can run this synchronous SQL via DuckDB against a small parquet snapshot of `backtest_runs`: When pointed at the real platform, replace the inline list with a [/data/exports](../../how-to/recipes/query-data-via-mcp.md) MCP call and the same SQL works against the actual ledger snapshot. ### Step 4 — read the Iceberg gold output ```python from pyiceberg.catalog import load_catalog cat = load_catalog("alphaswarm") table = cat.load_table(f"alphaswarm_gold_backtests.run_{run_id}") df = table.scan().to_pandas() print(df[["timestamp", "equity", "drawdown"]].tail(10)) ``` ### Step 5 — verify - A `backtest_runs` row with non-NULL `sharpe` exists. - The WebSocket emitted a `stage=done` frame with the same `run_id`. - An `alphaswarm_gold_backtests.run_` Iceberg table is queryable. - The `KillSwitch` topbar element shows a green status. ### What next - Run the full walkthrough in [tutorials/first-backtest](../../tutorials/first-backtest.md). - Author a custom strategy: [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md). - Promote the backtest to paper: [how-to/recipes/promote-a-bot-to-paper](../../how-to/recipes/promote-a-bot-to-paper.md). - Replace the single-strategy dispatch with a multi-node workflow: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) + [concepts/agentic/workflow-studio](../agentic/workflow-studio.md). ## Deployment modes ### docker-compose (default) ```bash docker compose up -d ``` Brings up `redis`, `postgres`, `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat`, `alphaswarm-client`, `chromadb`, `mlflow`, `otel-collector`, `jaeger`. The Iceberg catalog runs in PyIceberg SQL mode against the host bind mount under `data/iceberg/`. Optional profiles: - `--profile streaming` — adds Redpanda + Flink for live market data. - `--profile vllm` — adds a containerised vLLM inference server. - `--profile legacy` — restores the older MinIO + iceberg-rest topology for rollback only. ### Native dev (no Docker) ```bash pip install -e ".[full,dev]" alembic upgrade head uvicorn alphaswarm.api.main:app --reload celery -A alphaswarm.tasks.celery_app worker --loglevel=info ``` ### Kubernetes ```bash make deploy-k8s ENV=prod ``` Manifests live under [alphaswarm_platform/deployments/kubernetes/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform/deployments/kubernetes). The TerraformRuntime owns every `terraform apply`; see [how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md) and [how-to/operations/alphaswarm-fund-blue-green-cutover](../../how-to/operations/alphaswarm-fund-blue-green-cutover.md). ### Cloudflare Pages (docs only) `docs.alpha-swarm.ai` deploys via the [cloudflare_pages_docs](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform/terraform/modules/cloudflare_pages_docs) Terraform module — out of cluster, on the edge, behind Cloudflare Access for `/internal/*` and `/enterprise/*`. ## Where to start ```mermaid flowchart LR contributor[New contributor] --> human["this page (architecture)"] contributor --> agent["AGENTS.md (root)"] human --> intro["intro/quickstart"] agent --> intro intro --> diataxis["Diataxis pick"] diataxis --> conceptsPick[concepts] diataxis --> howtoPick[how-to] diataxis --> tutorialsPick[tutorials] diataxis --> referencePick[reference] ``` | If you want to... | Read | | --- | --- | | Get the platform running locally | [intro/quickstart](../../intro/quickstart.md) | | Understand the doc conventions | [intro/conventions](../../intro/conventions.md) | | See the canonical repository layout | [repository-split](./repository-split.md) | | Run a backtest end-to-end | [tutorials/first-backtest](../../tutorials/first-backtest.md) | | Promote a bot from backtest to paper | [tutorials/first-bot](../../tutorials/first-bot.md) | | Train an RL agent | [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md) | | Compose an agent workflow | [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) | | Browse the API surface | [reference/api](../../reference/api/index.mdx) | | Browse the Python surface | [reference/python](../../reference/python/index.mdx) | | Inspect tables and columns | [reference/data-dictionary](../../reference/data-dictionary/index.mdx) | | Author a new strategy | [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md) | | Query data without touching ORM | [how-to/recipes/query-data-via-mcp](../../how-to/recipes/query-data-via-mcp.md) | | Snapshot an agent spec | [how-to/recipes/snapshot-an-agent-spec](../../how-to/recipes/snapshot-an-agent-spec.md) | | Trigger a kill switch | [how-to/operations/kill-switch-incident-response](../../how-to/operations/kill-switch-incident-response.md) | | Deploy to Kubernetes | [how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md) | | Read the agentic-coding contract | [concepts/agentic/agentic-development](../agentic/agentic-development.md) | | Run docs from an AI agent | `/llms.txt`, `/llms-full.txt`, `/mcp` | ## Deeper reads - [concepts/platform/repository-split](./repository-split.md) — boundary contract for every `alphaswarm_*` package. - [concepts/agentic/workflow-studio](../agentic/workflow-studio.md) — the `WorkflowRuntime` orchestration layer composing every spec runtime. - [concepts/agentic/agentic-development](../agentic/agentic-development.md) — the spec-pattern mapped to the broader agentic-coding vocabulary. - [concepts/identity/management-engine](../identity/management-engine.md) — `WorkloadRuntime` + control-plane audit ledger. - [concepts/infrastructure/terraform-control-plane](../infrastructure/terraform-control-plane.md) — `TerraformRuntime` + hash-locked stack specs. - [reference/api](../../reference/api/index.mdx) — Scalar-rendered API playground. - [reference/python](../../reference/python/index.mdx) — Griffe-generated Python reference. # Class Diagrams > Hand-authored mermaid `classDiagram` blocks for the five hierarchies AI coders most often need to navigate. Every diagram cites the canonical file so you can jump from the diagram into the code in one... # Class Diagrams > Pair with [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (database schema) and > [alphaswarm_docs/architecture.md](../../concepts/platform/architecture.md) (system view). > Doc map: [alphaswarm_docs/index.md](../../intro/index.md). Hand-authored mermaid `classDiagram` blocks for the five hierarchies AI coders most often need to navigate. Every diagram cites the canonical file so you can jump from the diagram into the code in one click. ## 1. Symbol + core enums The atom that flows through every data feed, strategy, and broker. Defined in [alphaswarm/core/types.py](../alphaswarm/core/types.py). ```mermaid classDiagram class Symbol { +str ticker +Exchange exchange +AssetClass asset_class +SecurityType security_type +str vt_symbol +parse(s) Symbol +format() str +equity(ticker, exchange) Symbol +crypto(base, quote, venue) Symbol +option(underlying, ...) Symbol } class Exchange { <> NASDAQ NYSE ARCA BATS CBOE CME LSE BINANCE COINBASE SIM LOCAL } class AssetClass { <> EQUITY CRYPTO FX FUTURE OPTION INDEX COMMODITY BOND BASE } class SecurityType { <> EQUITY OPTION FUTURE FUTURE_OPTION FOREX CFD CRYPTO CRYPTO_FUTURE INDEX INDEX_OPTION COMMODITY } class Resolution { <> Tick Second Minute Hour Daily } class TickType { <> Trade Quote OpenInterest } class SubscriptionDataConfig { +Symbol symbol +Resolution resolution +TickType tick_type +DataNormalizationMode mode } class BarData { +Symbol symbol +datetime timestamp +Decimal open +Decimal high +Decimal low +Decimal close +int volume } class QuoteBar class TradeBar class TickData Symbol --> Exchange : "uses" Symbol --> AssetClass : "uses" Symbol --> SecurityType : "uses" SubscriptionDataConfig --> Symbol SubscriptionDataConfig --> Resolution SubscriptionDataConfig --> TickType BarData --> Symbol QuoteBar --|> BarData TradeBar --|> BarData TickData --> Symbol ``` **Key invariants**: - `Symbol` is hashable + frozen. Round-trip via `Symbol.parse(symbol.format())` is the identity. - `vt_symbol` is always `f"{ticker}.{exchange}"` (vnpy convention). - Concrete instrument shapes (option chains, future contracts) live alongside `Symbol` as additional fields, not separate classes. ## 2. LLM provider registry The router from [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py) dispatches every LLM call through LiteLLM. Adding a provider is a single dict entry in [alphaswarm/llm/providers/catalog.py](../alphaswarm/llm/providers/catalog.py). ```mermaid classDiagram class ProviderSpec { <> +str slug +str litellm_prefix +str env_key +str settings_attr +str base_url_attr +str default_deep_model +str default_quick_model +bool requires_api_key } class LLMProvider { <> +ProviderSpec spec +model_string(model) str* +api_key() str* +base_url() str* +default_model(tier) str } class _DefaultProvider { +model_string(model) str +api_key() str +base_url() str } class LLMResult { <> +str content +str model +str provider +int prompt_tokens +int completion_tokens +float cost_usd +Any raw } class router_complete { <> +complete(provider, model, prompt, ...) LLMResult } class PROVIDERS { <> openai anthropic google xai deepseek groq openrouter ollama vllm } LLMProvider <|-- _DefaultProvider _DefaultProvider --> ProviderSpec PROVIDERS --> ProviderSpec : "values" router_complete --> LLMProvider : "get_provider(slug)" router_complete --> LLMResult : "returns" ``` **Conventions**: - Always call via `router_complete(provider=..., model=..., ...)`. - Tier (`deep`/`quick`) routing happens via `settings.provider_for_tier` + `provider.default_model(tier)`. - The control plane in [alphaswarm/runtime/control_plane.py](../alphaswarm/runtime/control_plane.py) can override `ollama_host` / `vllm_base_url` at runtime. ## 3. Strategy hierarchy AlphaSwarm follows the Lean 5-stage pattern (Universe → Alpha → Portfolio → Risk → Execution). Concrete strategies are factory-instantiated from config via the `class`/`module_path`/`kwargs` registry pattern. ```mermaid classDiagram class IStrategy { <> +on_bar(bar, context) Iterator~OrderRequest~ +on_order_update(order) None } class IUniverseSelectionModel { <> +select(timestamp, context) list~Symbol~ } class IAlphaModel { <> +generate_signals(history, universe, context) list~Signal~ } class IPortfolioConstructionModel { <> +construct(signals, context) list~PortfolioTarget~ } class IRiskManagementModel { <> +evaluate(targets, context) list~PortfolioTarget~ } class IExecutionModel { <> +execute(targets, context) list~OrderRequest~ } class FrameworkAlgorithm { +IUniverseSelectionModel universe_model +IAlphaModel alpha_model +IPortfolioConstructionModel portfolio_model +IRiskManagementModel risk_model +IExecutionModel execution_model +int rebalance_every +on_bar(bar, context) Iterator } class MeanReversionAlpha class MomentumAlpha class MLAlphaStrategy class MLSelectorAlpha class EnsembleAlpha { +list~IAlphaModel~ alphas +list~float~ weights } class DeployedModelAlpha { +str deployment_id } class BlackLittermanPortfolio class HRPPortfolio class MeanVariancePortfolio class RiskParityPortfolio class TwapExecution class VwapExecution IStrategy <|.. FrameworkAlgorithm IAlphaModel <|.. MeanReversionAlpha IAlphaModel <|.. MomentumAlpha IAlphaModel <|.. MLAlphaStrategy IAlphaModel <|.. MLSelectorAlpha IAlphaModel <|.. EnsembleAlpha IAlphaModel <|.. DeployedModelAlpha IPortfolioConstructionModel <|.. BlackLittermanPortfolio IPortfolioConstructionModel <|.. HRPPortfolio IPortfolioConstructionModel <|.. MeanVariancePortfolio IPortfolioConstructionModel <|.. RiskParityPortfolio IExecutionModel <|.. TwapExecution IExecutionModel <|.. VwapExecution FrameworkAlgorithm o-- IUniverseSelectionModel FrameworkAlgorithm o-- IAlphaModel FrameworkAlgorithm o-- IPortfolioConstructionModel FrameworkAlgorithm o-- IRiskManagementModel FrameworkAlgorithm o-- IExecutionModel EnsembleAlpha o-- "many" IAlphaModel ``` The interfaces are in [alphaswarm/core/interfaces.py](../alphaswarm/core/interfaces.py); concrete alphas in [alphaswarm/strategies/](../alphaswarm/strategies/) (one file per alpha). See [alphaswarm_docs/factor-research.md](../../concepts/strategy/factor-research.md) for the authoring guide. ## 4. Backtest + paper + live (IBrokerage / IDataQueueHandler) The same strategy runs unchanged across backtest, paper, and live — the engines differ in how they implement the broker + data-queue contract, not in how they call the strategy. ```mermaid classDiagram class IBrokerage { <> +submit_order(order) OrderTicket +cancel_order(ticket) bool +get_positions() list~SecurityHolding~ +get_cashbook() CashBook +on_order_event(callback) None } class IDataQueueHandler { <> +subscribe(config) None +unsubscribe(config) None +get_next_ticks() Iterable~Tick~ } class IHistoryProvider { <> +get_bars(symbol, start, end, resolution) DataFrame } class BacktestEngine { +IStrategy strategy +IDataQueueHandler data +IBrokerage brokerage +run(start, end) BacktestResult } class VectorbtEngine { +run(start, end) BacktestResult } class LocalSimulationEngine class PaperTradingEngine class WalkForwardEngine class MonteCarloEngine class BrokerSim { +decimal cash +dict positions } class AlpacaBrokerage class IbkrBrokerage class TradierBrokerage class DuckDBHistoryProvider class KafkaDataFeed IBrokerage <|.. BrokerSim IBrokerage <|.. AlpacaBrokerage IBrokerage <|.. IbkrBrokerage IBrokerage <|.. TradierBrokerage IDataQueueHandler <|.. KafkaDataFeed IHistoryProvider <|.. DuckDBHistoryProvider BacktestEngine <|-- VectorbtEngine BacktestEngine <|-- LocalSimulationEngine BacktestEngine <|-- PaperTradingEngine BacktestEngine <|-- WalkForwardEngine BacktestEngine <|-- MonteCarloEngine BacktestEngine o-- IBrokerage BacktestEngine o-- IDataQueueHandler ``` Files of interest: - [alphaswarm/backtest/engine.py](../alphaswarm/backtest/engine.py) — base engine - [alphaswarm/backtest/vectorbt_engine.py](../alphaswarm/backtest/vectorbt_engine.py) - [alphaswarm/backtest/broker_sim.py](../alphaswarm/backtest/broker_sim.py) — brokerage simulator used by all non-live engines - [alphaswarm/trading/](../alphaswarm/trading/) — concrete `IBrokerage` implementations for paper + live - [alphaswarm/streaming/](../alphaswarm/streaming/) — Kafka and IBKR feed handlers See [alphaswarm_docs/backtest-engines.md](../../concepts/strategy/backtest-engines.md) for the full engine matrix, [alphaswarm_docs/paper-trading.md](../../concepts/trading/paper-trading.md) for the session lifecycle. ## 5. Generic ingestion pipeline Discovery → Director → Materialise → Verify → Annotate. The dataclasses below are the canonical contract between stages. ```mermaid classDiagram class DiscoveredMember { <> +str path +str archive_path +str format +str delimiter +int size_bytes +str subdir +float outer_mtime } class DiscoveredDataset { <> +str family +list~DiscoveredMember~ members +int total_bytes +list~str~ sample_columns +list~str~ notes +list inventory_extra } class IngestionPlan { <> +str source_path +str namespace +list~PlannedDataset~ datasets +list skipped_assets +str director_raw +bool director_used +str director_error } class PlannedDataset { <> +str family +bool include +str target_namespace +str target_table +int expected_min_rows +str domain_hint +list~str~ member_paths +list~str~ skip_member_paths +str notes +iceberg_identifier() str } class VerifierVerdict { <> +str verdict +str reason +dict retry_with +str raw +str error } class MaterializeResult { <> +str iceberg_identifier +str table_name +int rows_written +int files_consumed +int files_skipped +bool truncated +list schema_fields +str error } class IngestionTableResult { <> +str family +str iceberg_identifier +int rows_written +bool truncated +dict annotation +dict plan +dict verifier +str error } class IngestionReport { <> +str source_path +str namespace +datetime started_at +datetime finished_at +int datasets_discovered +list~IngestionTableResult~ tables +list extras +list~str~ errors +dict director_plan } class IngestionPipeline { +ProgressCallback progress_cb +int max_rows_per_dataset +int max_files_per_dataset +int chunk_rows +bool director_enabled +list~str~ allowed_namespaces +run_path(path, namespace, annotate) IngestionReport } class AnnotationResult { <> +str identifier +str description +list~str~ tags +str domain +list pii_flags +list column_docs +str error } DiscoveredDataset o-- "many" DiscoveredMember IngestionPlan o-- "many" PlannedDataset IngestionPipeline ..> DiscoveredDataset : "discovery output" IngestionPipeline ..> IngestionPlan : "director output" IngestionPipeline ..> MaterializeResult : "per planned table" IngestionPipeline ..> VerifierVerdict : "if floor missed" IngestionPipeline ..> AnnotationResult : "if annotate=true" IngestionTableResult o-- VerifierVerdict IngestionTableResult o-- AnnotationResult IngestionReport o-- "many" IngestionTableResult ``` Files: - [alphaswarm/data/pipelines/discovery.py](../alphaswarm/data/pipelines/discovery.py) - [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py) - [alphaswarm/data/pipelines/materialize.py](../alphaswarm/data/pipelines/materialize.py) - [alphaswarm/data/pipelines/annotate.py](../alphaswarm/data/pipelines/annotate.py) - [alphaswarm/data/pipelines/runner.py](../alphaswarm/data/pipelines/runner.py) - [alphaswarm/data/pipelines/extractors.py](../alphaswarm/data/pipelines/extractors.py) Walkthrough lives in [alphaswarm_docs/data-catalog.md](../../concepts/data/data-catalog.md). ## 6. Bot entity (TradingBot / ResearchBot) The Bot Entity Refactor introduced a first-class deployable unit that aggregates universe + strategy + engine + ML + agents + RAG + metrics. The runtime never re-implements those primitives — it composes references and dispatches to the existing entry points. ```mermaid classDiagram class BotSpec { <> +str name +str slug +str kind +UniverseRef universe +DataPipelineRef data_pipeline +dict strategy +dict backtest +list~MLDeploymentRef~ ml_models +list~BotAgentRef~ agents +list~RAGRef~ rag +list~MetricRef~ metrics +RiskSpec risk +DeploymentTargetSpec deployment +snapshot_hash() str } class BaseBot { <> +BotSpec spec +str bot_id +str project_id +backtest(run_name, **overrides) dict +paper(run_name, **overrides) PaperTradingSession +deploy(target, **overrides) BotDeploymentResult +chat(prompt, ...) Any +metrics_snapshot(run_summary) dict } class TradingBot { +consult_agents(prompt, inputs, roles) dict } class ResearchBot { +chat(prompt, session_id, agent_role, inputs) dict } class BotRuntime { +BotSpec spec +str run_id +str task_id +backtest(run_name, overrides) BotRunResult +paper(run_name, overrides) BotRunResult +chat(prompt, session_id, agent_role) BotRunResult +deploy(target, overrides) BotRunResult } class DeploymentDispatcher { +deploy(bot, target, overrides) BotDeploymentResult +register(target) void } class DeploymentTarget { <> +str name +deploy(bot, overrides) BotDeploymentResult } class PaperSessionTarget class BacktestOnlyTarget class KubernetesTarget { +Path manifest_root +bool apply +render_manifest(bot, overrides) str } BotSpec <.. BaseBot BaseBot <|-- TradingBot BaseBot <|-- ResearchBot BotRuntime ..> BaseBot DeploymentDispatcher --> DeploymentTarget DeploymentTarget <|-- PaperSessionTarget DeploymentTarget <|-- BacktestOnlyTarget DeploymentTarget <|-- KubernetesTarget BotRuntime ..> DeploymentDispatcher : "deploy()" ``` Files: - [alphaswarm/bots/spec.py](../alphaswarm/bots/spec.py) - [alphaswarm/bots/base.py](../alphaswarm/bots/base.py) - [alphaswarm/bots/trading_bot.py](../alphaswarm/bots/trading_bot.py) - [alphaswarm/bots/research_bot.py](../alphaswarm/bots/research_bot.py) - [alphaswarm/bots/runtime.py](../alphaswarm/bots/runtime.py) - [alphaswarm/bots/deploy.py](../alphaswarm/bots/deploy.py) - [alphaswarm/bots/registry.py](../alphaswarm/bots/registry.py) - [alphaswarm/bots/cli.py](../alphaswarm/bots/cli.py) Walkthrough lives in [alphaswarm_docs/bots.md](../../concepts/agentic/bots.md). # Code Index Governance > This document explains how agents should search and index AlphaSwarm during the repository split. The goal is to keep edits inside the right future project boundary before source code is physically separated... # Code Index Governance Status: active. This document explains how agents should search and index AlphaSwarm during the repository split. The goal is to keep edits inside the right future project boundary before source code is physically separated. ## Search Order 1. Read the nearest `AGENTS.md` for the folder being edited. 2. Read `alphaswarm_docs/repository-split.md` to identify the owning domain. 3. Search within the owning domain first. 4. Only broaden to `alphaswarm/` or repo root when the boundary document says the implementation still lives there. 5. Record new reusable patterns in `alphaswarm_snippets/` or `.cursor/skills/` instead of scattering notes across unrelated docs. ## Domain Index | Domain | Start here | Notes | | --- | --- | --- | | Control plane | `alphaswarm_controller/AGENTS.md` | `/manage/*`, providers, workload lifecycle | | Platform core | `alphaswarm_core/AGENTS.md` | Shared contracts only | | Client | `alphaswarm_client/AGENTS.md`, `alphaswarm_client/AGENTS.md` | Active source remains in `alphaswarm_client/` | | Snippets | `alphaswarm_snippets/AGENTS.md` | Reference-only curated knowledge | | Bots | `alphaswarm_bots/AGENTS.md` | Runtime remains in `alphaswarm/bots/` for now | | Runtime monolith | `AGENTS.md` | Agents, RL, data, backtests, persistence, tasks | ## Indexing Rules - Codebase MCP indexes must respect workspace allow-lists and secret deny-lists from `alphaswarm/codebase/mcp/policy.py`. - Generated indexes should not include `.env`, private keys, kubeconfigs, token files, model weights, or local warehouse data. - Agent-readable docs should link to paths, not line numbers, unless the output is a transient review. - Keep split-boundary indexes short enough that agents can read them before editing. ## Boundary Checks Use these searches before a boundary-sensitive change: ```bash rg --type py "^from alphaswarm(\.|$)|^import alphaswarm(\.|$)" alphaswarm_controller/src rg "alphaswarm_snippets|extractions|inspiration" alphaswarm alphaswarm_controller alphaswarm_core rg "control.local/api|management/backend|management/frontend" docs README.md ``` The first command must return no matches. The second and third commands may return documented migration references, but should not reveal runtime imports or active instructions that route new work to deprecated surfaces. # Contingency graphs (OCO / OUO / OTO) > | Type | Behaviour | | --- | --- | | **OCO** (one cancels other) | When any constituent fills (partial or full), the others are canceled. Canonical use: bracket a position with a take-profit limit + s... # Contingency graphs (OCO / OUO / OTO) > Status: **Phase 2 shipped** (Alembic 0041). Manager: > [`alphaswarm/trading/execution/contingency.py`](../alphaswarm/trading/execution/contingency.py). ## The three relationships | Type | Behaviour | | --- | --- | | **OCO** (one cancels other) | When any constituent fills (partial or full), the others are canceled. Canonical use: bracket a position with a take-profit limit + stop-loss stop. | | **OUO** (one updates other) | When any constituent's quantity changes (partial fill or amend), every other constituent's quantity is updated to match the remaining size. Useful when the bracket has more than two legs. | | **OTO** (one triggers other) | The parent is the trigger; children are emulated until the parent fills, then they're submitted. Canonical use: place an entry limit + parked TP/SL waiting for the entry to hit. | ## Class layout ```mermaid flowchart LR BotRuntime -->|"submit_list(order_list)"| Broker Broker -->|"emits ExecutionReport"| Dispatcher Dispatcher -->|"on_execution_report"| ContingencyManager ContingencyManager -->|"ContingencyCommand"| ExecutionLoop ExecutionLoop -->|"cancel/amend/submit"| Broker ``` ## Manager behaviour For each constituent, the manager tracks shadow ``remaining_quantity`` and ``status``: * On fill (full or partial), OCO emits ``CANCEL`` for every peer. * On partial fill, OUO emits ``UPDATE_QUANTITY`` for every peer with the new ``remaining_quantity``. * On full fill, OUO emits ``CANCEL`` for every peer (degenerates to OCO). * On parent fill, OTO emits ``SUBMIT`` for every child. Subsequent child fills don't re-trigger anything. ## Venue dispatch Two routes: 1. **Native atomic submission** -- when the broker sets ``supports_oco = True``, the broker's :meth:`IDomainBrokerage.submit_list` submits the whole list in a single venue call (Alpaca bracket orders, IBKR OCA groups). The manager STILL registers the list so a partial-cancel still emits cleanup commands when the venue's atomicity is best-effort. 2. **Manager-simulated** -- when the broker sets ``supports_oco = False``, the broker submits each constituent independently and the manager owns the cross-order cancels via :meth:`ContingencyManager.on_execution_report`. ## Code example ```python from decimal import Decimal from alphaswarm.core.domain.identifiers import ( ClientOrderId, InstrumentId, OrderListId, Symbol2, Venue, ) from alphaswarm.core.domain.enums import ContingencyType, OrderSide, OrderType from alphaswarm.core.domain.orders import LimitOrder, StopMarketOrder, OrderList # Take-profit at 200, stop-loss at 180 -- OCO bracket tp = LimitOrder( client_order_id=ClientOrderId("tp-1"), instrument_id=InstrumentId(Symbol2("AAPL"), Venue("NASDAQ")), order_side=OrderSide.SELL, quantity=Decimal("10"), order_type=OrderType.LIMIT, price=Decimal("200"), ) sl = StopMarketOrder( client_order_id=ClientOrderId("sl-1"), instrument_id=InstrumentId(Symbol2("AAPL"), Venue("NASDAQ")), order_side=OrderSide.SELL, quantity=Decimal("10"), order_type=OrderType.STOP_MARKET, trigger_price=Decimal("180"), ) order_list = OrderList( order_list_id=OrderListId("oco-1"), orders=[tp, sl], contingency_type=ContingencyType.OCO, ) # Submit the entire list atomically (or simulated) await broker.submit_list(order_list) ``` ## Persistence The Alembic ``0041`` migration adds: * ``order_lists`` -- one row per :class:`OrderList` * ``domain_orders.order_list_id`` FK -- ties constituents to parent * ``execution_reports`` -- audit trail; the manager reads this to recover state after a restart ## Limitations * OUO with three or more constituents updates ALL peers to the smallest remaining size. This matches the standard interpretation but may not be what every venue does -- check the venue's docs. * OTO with multiple parents (a single child triggered by either of two parents) is NOT supported; that's a contingency-graph generalisation the manager can't currently express. * Native broker OCO often comes with constraints (Alpaca brackets require the limit + stop to be on the same instrument; IBKR OCA groups allow cross-instrument). The contingency manager handles cross-instrument simulation. # Core Type System > AlphaSwarm 0.3 ports Leans data model into Python with minimal surface area, full backward compatibility, and strict ``dataclass``-only value objects # Core Type System > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Full `Symbol` class diagram: [alphaswarm_docs/class-diagram.md#1-symbol--core-enums](../../concepts/platform/class-diagram.md#1-symbol--core-enums). AlphaSwarm 0.3 ports Lean's data model into Python with minimal surface area, full backward compatibility, and strict ``dataclass``-only value objects. ## Quick map | Lean (C#) | AlphaSwarm (Python) | File | |---|---|---| | `Slice` | `Slice` | [`alphaswarm/core/slice.py`](../alphaswarm/core/slice.py) | | `BaseData` | `BarData` (alias `TradeBar`), `QuoteBar`, `TickData` (alias `Tick`) | [`alphaswarm/core/types.py`](../alphaswarm/core/types.py) | | `SubscriptionDataConfig` | `SubscriptionDataConfig` | same | | `Resolution` / `TickType` | `Resolution` / `TickType` | same | | `DataNormalizationMode` | `DataNormalizationMode` | same | | `Symbol` / `SecurityIdentifier` | `Symbol` (composite `ticker.exchange`) | same | | `Security` / `SecurityHolding` | `SecurityHolding` (extends `PositionData`) | same | | `Cash` / `CashBook` | `Cash` / `CashBook` | same | | `Order` / `OrderTicket` / `OrderEvent` | `OrderData` / `OrderTicket` / `OrderEvent` | same | | `IndicatorBase` / `RollingWindow` | `IndicatorBase[T]` / `RollingWindow[T]` | [`alphaswarm/core/indicators.py`](../alphaswarm/core/indicators.py) | | `MarketHoursDatabase` | `MarketHoursDatabase` | [`alphaswarm/core/exchange_hours.py`](../alphaswarm/core/exchange_hours.py) | | `MapFile` / `FactorFile` | `MapFile` / `FactorFile` | [`alphaswarm/core/corporate_actions.py`](../alphaswarm/core/corporate_actions.py) | ## Migration notes - **`BarData` is unchanged.** ``TradeBar`` is a type alias; existing backtest code keeps working. - **``TickData`` is unchanged.** ``Tick`` is an alias. - **``PositionData`` is unchanged.** The richer ``SecurityHolding`` is additive — convert via ``SecurityHolding.from_position(pos)``. - **``on_bar(bar, ctx)`` remains the supported strategy entry point.** Strategies that implement ``on_data(slice, ctx)`` get called once per timestamp instead of once per symbol; the engine auto-detects which method to call. - **Orders now surface as tickets.** The engine populates ``BacktestResult.tickets`` with :class:`OrderTicket` objects that carry the full ``OrderEvent`` stream for each order. ## Indicator registry 25 built-in indicators, all subclasses of ``IndicatorBase``. Resolve by string via ``build_indicator("SMA", period=20)`` or import directly. ```python from alphaswarm.core.indicators import SimpleMovingAverage, warmup sma = SimpleMovingAverage(20) print(warmup(sma, [100, 101, 102])) # NaN until 20 samples ``` ## Subscription routing Every downstream consumer (backtest engine, paper engine, RL env, factor job) reads data through :class:`SubscriptionDataConfig` via :class:`alphaswarm.data.subscription.SubscriptionManager`. That swap enables normalisation-aware queries and composite history providers without touching strategy code. ## Type relationships ```mermaid classDiagram class Symbol { +str ticker +Exchange exchange +AssetClass asset_class +SecurityType security_type +str vt_symbol +parse(s) Symbol } class BarData class TickData class Signal class OrderRequest class OrderTicket class OrderEvent class SubscriptionDataConfig Symbol <-- BarData Symbol <-- TickData Symbol <-- Signal Symbol <-- OrderRequest OrderRequest --> OrderTicket : submit_order OrderTicket --> OrderEvent : "events stream" SubscriptionDataConfig --> Symbol ``` # Domain Model > The AlphaSwarm platforms domain model lives under [`alphaswarm/core/domain/`](../alphaswarm/core/domain/) and is the single source of truth for every tradable-asset, issuer, event, market-data, fundamentals, ownership, c... # Domain Model > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Schema diagrams: [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) · Column reference: [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md). The AlphaSwarm platform's domain model lives under [`alphaswarm/core/domain/`](../alphaswarm/core/domain/) and is the single source of truth for every tradable-asset, issuer, event, market-data, fundamentals, ownership, calendar, economic, and news primitive in the platform. The expansion absorbs the best abstractions from four best-of-breed open-source quant projects: | Inspiration | What we took | |---|---| | [gs-quant](https://github.com/goldmansachs/gs-quant) | `(AssetClass, AssetType) → Instrument` dispatch ([`gs_quant/instrument/core.py`](https://github.com/goldmansachs/gs-quant/blob/master/gs_quant/instrument/core.py)), `XRef`/`Security` identifier flattening, `PricingContext`/`RiskMeasure` scaffolding ([`gs_quant/common.py`](https://github.com/goldmansachs/gs-quant/blob/master/gs_quant/common.py)) | | [vnpy](https://github.com/vnpy/vnpy) | `ContractData` with `size`/`pricetick`/`min_volume`/option metadata, `Offset` (OPEN/CLOSE/CLOSE_TODAY/CLOSE_YESTERDAY) enum, 5-level `TickData`, uniform `*Request` envelopes | | [nautilus_trader](https://github.com/nautechsystems/nautilus_trader) | Typed identifier value objects, polymorphic `Instrument` grid (`Equity`/`FuturesContract`/`OptionContract`/`CurrencyPair`/`Cfd`/`CryptoPerpetual`/`BettingInstrument`/`BinaryOption`/`SyntheticInstrument`/`TokenizedAsset`), `OrderBook`/`BookLevel` primitives, option greeks, DeFi scaffolds | | [OpenBB Platform](https://github.com/OpenBB-finance/OpenBB) | `Fetcher[Q, R]` + `QueryParams` + `Data` triad, plus ~170 `standard_models` covering every research datatype from `balance_sheet` and `insider_trading` through `federal_funds_rate` and `cot` | ## Layout ``` alphaswarm/core/domain/ ├── identifiers.py # Typed IDs + IdentifierScheme + IdentifierSet ├── enums.py # 25+ StrEnum catalogs (AssetClass, InstrumentClass, OrderType, ...) ├── money.py # Currency + Price/Quantity/Money precision-safe scalars ├── instrument.py # Polymorphic Instrument hierarchy + (AssetClass, InstrumentClass) dispatch ├── issuer.py # Issuer / CorporateEntity / Fund / GovernmentEntity graph ├── market_data.py # Bar/Tick/QuoteTick/TradeTick/OrderBook/MarkPriceUpdate + RichSlice ├── orders.py # DomainOrder hierarchy + full OrderEvent family ├── positions.py # DomainPosition hierarchy + PositionEvent family ├── greeks.py # OptionGreeks / OptionGreekValues / PortfolioGreeks ├── options.py # OptionChain / OptionChainSlice / OptionSeriesId / StrikeRange ├── events.py # DomainEvent union (filing/earnings/news/dividend/ipo/merger/esg/...) ├── fundamentals.py # BalanceSheet / IncomeStatement / CashFlow / FinancialRatios / KeyMetrics ├── ownership.py # InsiderTransaction / Form13F / ShortInterest / SharesFloat / ... ├── calendar_events.py # CalendarEarnings/Dividend/Split/Ipo/EconomicCalendar + MarketHoliday ├── economic.py # TreasuryRate/YieldCurve/FederalFundsRate/CPI/Unemployment/CoT/FRED └── news.py # NewsItem / CompanyNews / WorldNews / Sentiment ``` Persistence sibling modules under [`alphaswarm/persistence/`](../alphaswarm/persistence/): - `models_instruments.py` — joined-table subclasses (InstrumentEquity, InstrumentOption, InstrumentFuture, …). - `models_entities.py` — `issuers` + related graph tables. - `models_fundamentals.py` — statements / ratios / metrics / transcripts / MD&A. - `models_events.py` — corporate / calendar / analyst / regulatory / ESG event tables. - `models_ownership.py` — insider / institutional / 13F / short-interest / float / politician-trades. - `models_news.py` — news items + entity M2M + sentiment. - `models_macro.py` — economic series / observations / CoT / treasury / yield curve / option-chain snapshots. - `models_taxonomy.py` — taxonomy schemes + nodes + polymorphic entity tags + entity crosswalk. The Alembic migration [`alembic/versions/0008_domain_model_expansion.py`](../alembic/versions/0008_domain_model_expansion.py) creates every new table, extends `instruments` with the polymorphic discriminator + richer columns, creates an `instruments_flat` back-compat view, and seeds `taxonomy_schemes` with SIC / NAICS / GICS / TRBC / ICB / BICS / NACE plus user-defined `thematic`, `region`, `risk` roots. ## Instrument hierarchy ```mermaid classDiagram class Instrument { InstrumentId instrument_id AssetClass asset_class InstrumentClass instrument_class Currency currency Decimal tick_size Decimal multiplier IdentifierSet identifiers } Instrument <|-- Equity Instrument <|-- ETF Instrument <|-- IndexInstrument Instrument <|-- Bond Instrument <|-- FuturesContract Instrument <|-- FuturesSpread Instrument <|-- OptionContract Instrument <|-- OptionSpread Instrument <|-- BinaryOption Instrument <|-- CurrencyPair Instrument <|-- Cfd Instrument <|-- Commodity Instrument <|-- SyntheticInstrument Instrument <|-- CryptoToken Instrument <|-- CryptoFuture Instrument <|-- CryptoPerpetual Instrument <|-- CryptoOption Instrument <|-- PerpetualContract Instrument <|-- TokenizedAsset Instrument <|-- BettingInstrument Instrument <|-- Swap ``` Dispatch via `instrument_class_for(asset_class, instrument_class)` returns the concrete class: ```python from alphaswarm.core.domain import instrument_class_for, AssetClass, InstrumentClass cls = instrument_class_for(AssetClass.EQUITY, InstrumentClass.OPTION) assert cls.__name__ == "OptionContract" ``` YAML recipes can say: ```yaml instrument: class: Equity kwargs: instrument_id: { symbol: AAPL, venue: NASDAQ } cik: "0000320193" isin: US0378331005 ``` …and `alphaswarm.core.registry.build_from_config` routes through the instrument registry automatically. ## Issuer graph ```mermaid classDiagram class Issuer { str issuer_id str name EntityKind kind str cik str lei int sic str naics Sector sector Industry industry } Issuer <|-- CorporateEntity Issuer <|-- GovernmentEntity Issuer <|-- Fund Issuer --> IndustryClassification : classifications Issuer --> Location : locations Issuer --> KeyExecutive : key_executives Issuer --> ExecutiveCompensation : compensation EntityRelationship --> Issuer : from_entity EntityRelationship --> Issuer : to_entity Instrument --> Issuer : issuer_id ``` Every `Equity`, `Bond`, `ETF` points at an `Issuer` row. The `Issuer` mirrors OpenBB's `EquityInfoData` schema (CIK, CUSIP, ISIN, LEI, legal_name, SIC, HQ address, employees, sector, industry) so ingestion from any OpenBB-compatible provider flows in without shape changes. ## Event flow ```mermaid flowchart LR subgraph Corporate FilingEvent EarningsEvent CorporateActionEvent IPOEvent MergerEvent end subgraph Research AnalystRatingEvent PriceTargetEvent ForwardEstimateEvent end subgraph Ownership InsiderTransactionEvent InstitutionalHoldingEvent PoliticianTradeEvent end subgraph Macro EconomicObservationEvent CotReportEvent end subgraph Alternative NewsEvent SocialSentimentEvent RegulatoryEvent ESGEvent MaritimeEvent PortVolumeEvent end ``` All events inherit from `DomainEvent` and share `ts_event` / `ts_init` / `event_id` / `source` / `instrument_id` / `issuer` / `meta`. Downstream consumers can demultiplex by `kind` without importing the concrete class. ## Fundamentals ```mermaid classDiagram class FundamentalsBase { str symbol str issuer_id date period PeriodType period_type int fiscal_year str fiscal_period str currency datetime as_of str source_filing_accession } FundamentalsBase <|-- BalanceSheet FundamentalsBase <|-- IncomeStatement FundamentalsBase <|-- CashFlowStatement FundamentalsBase <|-- FinancialRatios FundamentalsBase <|-- KeyMetrics FundamentalsBase <|-- EarningsCallTranscript FundamentalsBase <|-- ManagementDiscussionAnalysis FundamentalsBase <|-- ReportedFinancials ``` Every fundamentals model is a Pydantic `BaseModel` with `extra="allow"`, so provider-specific columns survive round-trips unchanged. ## Typed identifiers ```python from alphaswarm.core.domain import ( InstrumentId, Symbol2, Venue, IdentifierScheme, IdentifierSet, IdentifierValue ) iid = InstrumentId.from_str("AAPL.NASDAQ") assert iid.symbol == Symbol2("AAPL") assert iid.venue == Venue("NASDAQ") ids = IdentifierSet() ids.add(IdentifierValue(scheme=IdentifierScheme.CUSIP, value="037833100")) ids.add(IdentifierValue(scheme=IdentifierScheme.LEI, value="HWUPKR0MPOU8FGXBT394")) assert ids.value_of(IdentifierScheme.CUSIP) == "037833100" ``` The `IdentifierScheme` StrEnum covers 30+ taxonomies: ticker, vt_symbol, CIK, CUSIP, ISIN, SEDOL, FIGI, OpenFIGI, LEI, GVKEY, PermID, Refinitiv PermID, FactSet ID, DUNS, IRS EIN, FRED series id, BLS series id, ECB series id, GDelt theme, CoT code, SIC, NAICS, GICS, TRBC, ICB, NACE, BICS, ERC-20 address, EVM chain id, IBKR conid, Alpaca asset id, Polygon ticker, plus a `custom` escape hatch. ## Migration path The expansion is designed to be **non-breaking for existing users**: - Legacy `alphaswarm.core.types.Symbol` / `BarData` / `QuoteBar` / `TickData` / `OrderRequest` / `OrderData` / `OrderEvent` / `OrderTicket` / `SecurityHolding` / `Cash` / `CashBook` / `Signal` / `PortfolioTarget` all keep their constructors and public API. - The `Instrument` SQLAlchemy table keeps every pre-expansion column; new columns are nullable. A back-compat view `instruments_flat` serves the pre-refactor shape for any SQL consumer. - Legacy rows with `instrument_class IS NULL` load cleanly as the base `Instrument` via a SQL `CASE` mapping. - The richer typed IDs live in `alphaswarm.core.domain.identifiers` and are opt-in. `Symbol.to_instrument_id()` and `Symbol.from_instrument_id()` bridge old and new. - The `Slice` class keeps its legacy shape; `RichSlice` is the superset with `order_books`, `mark_prices`, `funding_rates`, `news`, `filings` buckets. ## Tests Domain-model tests live in [`tests/core/`](../tests/core/): - `test_identifiers.py` — typed IDs + IdentifierSet + scheme coverage (12 tests). - `test_enums.py` — expanded enum catalog (11 tests). - `test_instrument_hierarchy.py` — polymorphic Instrument + `(AssetClass, InstrumentClass)` dispatch (15 tests). - `test_events.py` — unified `DomainEvent` family (13 tests). - `test_fundamentals.py` — Pydantic statements + ratios + transcripts (14 tests). - `test_ownership.py` — insider / institutional / 13F / short-interest (10 tests). - `test_standard_models.py` — 99+102 paired `QueryParams`/`Data` port (6 tests). Provider tests: [`tests/providers/test_fetcher_contract.py`](../tests/providers/test_fetcher_contract.py). Persistence tests: [`tests/persistence/test_domain_migration.py`](../tests/persistence/test_domain_migration.py). # Entity Graph And Service Control > Start the local stack with the visualization overlay: # Entity Graph And Service Control AlphaSwarm now treats the entity graph as the canonical relationship layer for instruments, companies, datasets, pipeline assets, and service metadata. Postgres remains the compatibility store for existing APIs, while Neo4j is the graph backend when `ALPHASWARM_GRAPH_STORE=neo4j`. ## Local Services Start the local stack with the visualization overlay: ```bash docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml --profile visualization up -d ``` Neo4j is part of the base compose file and is exposed on: - Browser: `http://localhost:7474` - Bolt: `bolt://localhost:7687` Relevant env keys: ```bash ALPHASWARM_GRAPH_STORE=neo4j ALPHASWARM_NEO4J_URI=bolt://localhost:7687 ALPHASWARM_NEO4J_USER=neo4j ALPHASWARM_NEO4J_PASSWORD=aqpneo4j ALPHASWARM_NEO4J_DATABASE=neo4j ALPHASWARM_ENTITY_GRAPH_SYNC_ENABLED=true ``` ## Entity Sync The active instrument cache reads from the existing `instruments` table and upserts each instrument as a `security` entity with identifiers for `vt_symbol`, ticker, and any instrument metadata identifiers. Dataset registration for market-bar datasets links dataset versions to the instrument entities they describe. Airbyte and Dagster metadata syncs also write service nodes and relationships so the graph can show ingestion and pipeline context around datasets. Useful endpoints: - `GET /registry/entities/graph` - `GET /registry/entities/instruments/active` - `POST /registry/entities/instruments/sync` - `POST /registry/entities/instruments/load-template` ## Service Manager The service manager aggregates health/config/logs for: - Trino - Polaris - Iceberg - Superset - Airbyte - Dagster - Neo4j Useful endpoints: - `GET /service-manager/health` - `GET /service-manager/{service}/health` - `GET /service-manager/{service}/logs` - `POST /service-manager/{service}/actions` Lifecycle actions and logs are guarded by `ALPHASWARM_SERVICE_CONTROL_ENABLED=true` because they invoke Docker Compose from inside the API process. ## UI - `/data/entity-graph` exposes the Neo4j-backed entity graph and active instrument list. - `/data/services` exposes service health cards, guarded lifecycle actions, and log tails. - `/workflows/data` includes Dagster assets, runs, schedules, and sensors. # Unified Entity Registry > ```mermaid flowchart LR Sources["Iceberg datasets
(CFPB, FDA, USPTO,
SEC, GDELT,
FinanceDatabase, ...)"] Extractors["EntityExtractor.run(rows)"] Registry[(EntityRegistry)] Enrichers["Entit... # Unified Entity Registry The unified entity registry sits on top of the existing [Issuer / Sector / Industry graph](../../concepts/platform/erd.md) and widens it to cover every entity AlphaSwarm cares about: companies, drugs, products, patents, persons, locations, securities, regulators, and free-form "concept" rows. Extractors populate the rows from datasets; LLM enrichers add descriptions, relations, dedup proposals, and tags without ever mutating the source data. ```mermaid flowchart LR Sources["Iceberg datasets(CFPB, FDA, USPTO,SEC, GDELT,FinanceDatabase, ...)"] Extractors["EntityExtractor.run(rows)"] Registry[(EntityRegistry)] Enrichers["EntityEnricher.run(ids)"] Tasks["Celery tasks(entity_tasks)"] API["/registry/entities"] UI["/data/kg"] Sources --> Extractors --> Registry Registry --> Enrichers --> Registry Tasks --> Extractors Tasks --> Enrichers Registry --> API API --> UI ``` ## Tables | Table | File | | --- | --- | | `entities` | [alphaswarm/persistence/models_entity_registry.py](../alphaswarm/persistence/models_entity_registry.py) | | `entity_identifiers` | (same) | | `entity_relations` | (same) | | `entity_annotations` | (same) | | `entity_dataset_links` | (same) | Migration: [alembic/versions/0013_data_engine_expansion.py](../alembic/versions/0013_data_engine_expansion.py). ## Components | Module | What it does | | --- | --- | | [alphaswarm/data/entities/registry.py](../alphaswarm/data/entities/registry.py) | `EntityRegistry` facade + `upsert_entity` / `link_identifier` / `add_relation` / `attach_to_dataset` / `search` / `neighbors` / `add_annotation`. | | [alphaswarm/data/entities/extractors/](../alphaswarm/data/entities/extractors/) | Per-dataset extractors (regulatory, filings, news, instruments, finance_database). Each yields `EntityCandidate` dataclasses. | | [alphaswarm/data/entities/enrichers/](../alphaswarm/data/entities/enrichers/) | LLM enrichers (description, relation, dedup, tagging). All route through `router_complete` per AGENTS.md hard rule #2. | | [alphaswarm/tasks/entity_tasks.py](../alphaswarm/tasks/entity_tasks.py) | Celery wrappers (`extract_entities`, `enrich_entity`, `dedup_entities`). | | [alphaswarm/api/routes/entity_registry.py](../alphaswarm/api/routes/entity_registry.py) | REST surface at `/registry/entities`. | ## REST surface | Path | Description | | --- | --- | | `GET /registry/entities` | List entities (filter by kind, source_dataset, canonical_only). | | `POST /registry/entities` | Create or update an entity. | | `GET /registry/entities/search?q=` | Text search. | | `GET /registry/entities/{id}` | Detail (identifiers + annotations). | | `GET /registry/entities/{id}/neighbors` | Outgoing + incoming relations. | | `GET /registry/entities/{id}/datasets` | Linked datasets. | | `POST /registry/entities/{id}/identifiers` | Add an alias. | | `POST /registry/entities/{id}/relations` | Add a typed edge. | | `POST /registry/entities/{id}/annotations` | Attach a description / tag / note. | | `POST /registry/entities/extract` | Queue a Celery extract task. | | `POST /registry/entities/enrich` | Queue Celery enrichment tasks. | ## LLM enrichment LLM enrichers are gated on `ALPHASWARM_ENTITY_LLM_ENRICHMENT_ENABLED=true` to avoid surprise spend. When disabled, `enrich_one` returns `None` and the Celery task records a `skipped` count instead of calling the router. When enabled, the enricher uses `alphaswarm.llm.providers.router.router_complete` exclusively — never `litellm.completion` or `OllamaClient.generate`. Output is parsed strict JSON; malformed blobs are dropped. ## Don'ts - Don't extract entities by querying Postgres directly from a Celery task. Either pass an inline `rows` payload or read the Iceberg table via `alphaswarm.data.iceberg_catalog.read_arrow` (the standard path used by `extract_entities`). - Don't bypass `EntityRegistry` to write rows. Extractors should always go through `registry.upsert(...)`. - Don't replace `add_annotation` with raw SQL inserts. The `EntityAnnotation` row is also surfaced in `/registry/entities/{id}`, the entity browser UI, and (eventually) DataHub glossary terms. # Entity Relationship Diagram > The Postgres schema has ~110 ORM classes spread across 11 model files under [alphaswarm/persistence/](../alphaswarm/persistence/). One mega-ERD would be unreadable, so this doc breaks the schema into focused diagra... # Entity Relationship Diagram > Pair with [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md) (column-level > detail) and [alphaswarm_docs/domain-model.md](../../concepts/platform/domain-model.md) (narrative). > Doc map: [alphaswarm_docs/index.md](../../intro/index.md). The Postgres schema has ~110 ORM classes spread across 11 model files under [alphaswarm/persistence/](../alphaswarm/persistence/). One mega-ERD would be unreadable, so this doc breaks the schema into focused diagrams by domain. The final section is a global FK-only map showing only the cross-domain joins. Each per-domain ERD lists table names with the primary key (`PK`) and a short subset of columns. For full column lists, see [data-dictionary.md](../../reference/data-dictionary/index.md). ## Global FK map Cross-domain edges only — pick a starting table and trace where it fans out. ```mermaid erDiagram instruments ||--o{ instrument_equity : "polymorphic" instruments ||--o{ instrument_option : "polymorphic" instruments ||--o{ instrument_future : "polymorphic" instruments ||--o{ data_links : "instrument_id" instruments ||--o{ corporate_events : "vt_symbol" instruments ||--o{ news_items : "vt_symbol" issuers ||--o{ instruments : "issuer_id" issuers ||--o{ financial_statements : "issuer_id" data_sources ||--o{ datasets : "provider" dataset_catalogs ||--o{ dataset_versions : "catalog_id" dataset_versions ||--o{ data_links : "dataset_version_id" dataset_versions ||--o{ model_versions : "dataset_version_id" dataset_versions ||--o{ split_plans : "dataset_version_id" split_plans ||--o{ split_artifacts : "plan_id" strategies ||--o{ strategy_versions : "strategy_id" strategies ||--o{ backtest_runs : "strategy_id" backtest_runs ||--o{ orders : "backtest_id" backtest_runs ||--o{ fills : "backtest_id" backtest_runs ||--o{ signals : "backtest_id" backtest_runs ||--o{ ledger_entries : "backtest_id" sessions ||--o{ chat_messages : "session_id" sessions ||--o{ agent_runs : "session_id" crew_runs ||--o{ agent_decisions : "crew_run_id" agent_decisions ||--o{ debate_turns : "decision_id" backtest_runs ||--o{ agent_backtests : "backtest_id" agent_judge_reports ||--o{ agent_replay_runs : "judge_id" feature_sets ||--o{ feature_set_versions : "feature_set_id" feature_sets ||--o{ feature_set_usages : "feature_set_id" ``` ## Core / Instruments Joined-table inheritance. Every concrete instrument subclass shares the parent `instruments` row and adds shape-specific columns in its own table keyed on `instruments.id`. The discriminator is `instruments.instrument_class`. ```mermaid erDiagram instruments { uuid id PK string vt_symbol "AAPL.NASDAQ" string ticker string exchange string asset_class string security_type string instrument_class "discriminator" uuid issuer_id FK json identifiers } instrument_equity { uuid id PK_FK string isin string cusip string figi string lei string gics_sector float shares_outstanding } instrument_etf { uuid id PK_FK date inception_date float aum float expense_ratio bool is_leveraged } instrument_option { uuid id PK_FK string underlying float strike date expiry string kind "call|put" string style "european|american" } instrument_future { uuid id PK_FK string underlying date expiry float contract_size string cycle } instrument_fx_pair { uuid id PK_FK string base_currency string quote_currency float pip_size } instrument_crypto { uuid id PK_FK string subtype string chain string contract_address float max_leverage } instrument_index { uuid id PK_FK string administrator int constituent_count } instrument_bond { uuid id PK_FK float coupon date maturity string rating_sp } instrument_cfd { uuid id PK_FK string underlying float margin_rate } instrument_commodity { uuid id PK_FK string grade string unit_of_measure } instrument_synthetic { uuid id PK_FK json legs json leg_weights } instrument_betting { uuid id PK_FK string event_name string market_type } instrument_tokenized_asset { uuid id PK_FK string chain string contract_address string token_standard } instruments ||--o| instrument_equity : "spot" instruments ||--o| instrument_etf : "etf" instruments ||--o| instrument_option : "option" instruments ||--o| instrument_future : "future" instruments ||--o| instrument_fx_pair : "fx_pair" instruments ||--o| instrument_crypto : "crypto_token" instruments ||--o| instrument_index : "index" instruments ||--o| instrument_bond : "bond" instruments ||--o| instrument_cfd : "cfd" instruments ||--o| instrument_commodity : "spot_commodity" instruments ||--o| instrument_synthetic : "synthetic" instruments ||--o| instrument_betting : "betting" instruments ||--o| instrument_tokenized_asset : "nft" ``` ## Market data lineage + Iceberg catalog How AlphaSwarm tracks every dataset that flows into Iceberg. The `iceberg_identifier` column on `dataset_catalogs` was added in [alembic/versions/0011_iceberg_catalog_columns.py](../alembic/versions/0011_iceberg_catalog_columns.py). ```mermaid erDiagram data_sources { uuid id PK string name "yfinance|alpaca|cfpb" string kind "rest|csv|parquet" string base_url json meta } dataset_catalogs { uuid id PK string name string provider string domain "market.bars|cfpb.hmda" string frequency string storage_uri string iceberg_identifier "alphaswarm_cfpb.hmda_lar" string load_mode "managed|external" json llm_annotations json column_docs json tags } dataset_versions { uuid id PK uuid catalog_id FK int version string status "active|superseded" datetime as_of datetime start_time datetime end_time int row_count int symbol_count string dataset_hash string materialization_uri } data_links { uuid id PK uuid dataset_version_id FK uuid source_id FK uuid instrument_id FK string entity_kind "instrument|series" string entity_id datetime coverage_start datetime coverage_end int row_count } identifier_links { uuid id PK uuid instrument_id FK uuid source_id FK string identifier_kind string identifier_value } dataset_catalogs ||--o{ dataset_versions : "catalog_id" dataset_versions ||--o{ data_links : "dataset_version_id" data_sources ||--o{ data_links : "source_id" instruments ||--o{ data_links : "instrument_id" data_sources ||--o{ identifier_links : "source_id" instruments ||--o{ identifier_links : "instrument_id" ``` ## Agentic + ML Strategies, backtests, agent crews, ML deployments, and feature sets. ```mermaid erDiagram strategies { uuid id PK string name int version text config_yaml string status "draft|backtesting|paper|live|retired" } strategy_versions { uuid id PK uuid strategy_id FK text config_yaml json meta } backtest_runs { uuid id PK uuid strategy_id FK string task_id string status datetime start datetime end float sharpe float sortino float max_drawdown string mlflow_run_id string dataset_hash uuid model_version_id FK uuid ml_experiment_run_id FK uuid experiment_plan_id FK uuid model_deployment_id FK } agent_runs { uuid id PK uuid session_id FK string crew_name string status } crew_runs { uuid id PK uuid agent_run_id FK string preset json config } agent_decisions { uuid id PK uuid backtest_id FK uuid strategy_id FK uuid crew_run_id FK string action "long|short|flat" float confidence text rationale } debate_turns { uuid id PK uuid crew_run_id FK uuid decision_id FK string role text content } agent_backtests { uuid id PK uuid backtest_id FK json crew_metrics } agent_judge_reports { uuid id PK uuid backtest_id FK text summary json scores } agent_replay_runs { uuid id PK uuid backtest_id FK uuid judge_id FK json replay_metrics } feature_sets { uuid id PK string name string kind "composite|ml4t|qlib" json specs int default_lookback_days } feature_set_versions { uuid id PK uuid feature_set_id FK string content_hash } model_versions { uuid id PK uuid dataset_version_id FK uuid split_plan_id FK string model_class json hyperparams string mlflow_run_id } model_deployments { uuid id PK uuid model_version_id FK string status "active|retired" json runtime_meta } strategies ||--o{ strategy_versions : "strategy_id" strategies ||--o{ backtest_runs : "strategy_id" backtest_runs ||--o{ agent_decisions : "backtest_id" backtest_runs ||--o{ agent_backtests : "backtest_id" backtest_runs ||--o{ agent_judge_reports : "backtest_id" backtest_runs ||--o{ agent_replay_runs : "backtest_id" crew_runs ||--o{ agent_decisions : "crew_run_id" agent_decisions ||--o{ debate_turns : "decision_id" feature_sets ||--o{ feature_set_versions : "feature_set_id" model_versions ||--o{ model_deployments : "model_version_id" ``` ## Ledger (signals / orders / fills / entries) Every signal, order, fill, and free-form audit entry written by [`LedgerWriter`](../alphaswarm/persistence/ledger.py). ```mermaid erDiagram signals { uuid id PK uuid strategy_id FK uuid backtest_id FK string vt_symbol string direction "long|short|net" float strength float confidence text rationale } orders { uuid id PK uuid backtest_id FK uuid strategy_id FK string vt_symbol string side "buy|sell" string order_type "market|limit|stop" float quantity float price string status } fills { uuid id PK uuid order_id FK float quantity float price datetime ts } ledger_entries { uuid id PK uuid backtest_id FK uuid strategy_id FK string entry_type "SIGNAL|ORDER|FILL|RISK|AUDIT" string level "info|warn|error" text message json payload } strategies ||--o{ signals : "strategy_id" backtest_runs ||--o{ signals : "backtest_id" strategies ||--o{ orders : "strategy_id" backtest_runs ||--o{ orders : "backtest_id" orders ||--o{ fills : "order_id" backtest_runs ||--o{ ledger_entries : "backtest_id" ``` ## News / Events / Fundamentals ```mermaid erDiagram news_items { uuid id PK string url string source datetime published_at text headline text body } news_item_entities { uuid id PK uuid news_item_id FK string vt_symbol string entity_kind "instrument|issuer|theme" } news_sentiments { uuid id PK uuid news_item_id FK string scorer "finbert|fingpt" float polarity float confidence } corporate_events { uuid id PK string vt_symbol string event_type "earnings|split|dividend|merger|ipo" datetime event_time json payload } earnings_event_rows { uuid id PK uuid event_id FK float eps_actual float eps_estimate float revenue_actual } dividend_event_rows { uuid id PK uuid event_id FK float amount date ex_date date pay_date } split_event_rows { uuid id PK uuid event_id FK float ratio } analyst_estimates { uuid id PK string vt_symbol string analyst float target_price } financial_statements { uuid id PK uuid issuer_id FK string period "Q|FY" date period_end json data } financial_ratios { uuid id PK uuid issuer_id FK date period_end float pe float pb float roe } earnings_call_transcripts { uuid id PK uuid issuer_id FK date call_date text content } news_items ||--o{ news_item_entities : "news_item_id" news_items ||--o{ news_sentiments : "news_item_id" corporate_events ||--o{ earnings_event_rows : "event_id" corporate_events ||--o{ dividend_event_rows : "event_id" corporate_events ||--o{ split_event_rows : "event_id" issuers ||--o{ financial_statements : "issuer_id" issuers ||--o{ financial_ratios : "issuer_id" issuers ||--o{ earnings_call_transcripts : "issuer_id" ``` ## Macro / FRED / GDelt ```mermaid erDiagram economic_series { uuid id PK string series_id "FRED:GDP" string title string frequency string units string source } economic_observations { uuid id PK uuid series_id FK date observation_date float value } fred_series { uuid id PK string series_id "GDP" string title string units string frequency } treasury_rates { uuid id PK date date float rate_3m float rate_2y float rate_10y float rate_30y } yield_curves { uuid id PK date date json tenors } cot_reports { uuid id PK date report_date string instrument json positions } sec_filings { uuid id PK uuid instrument_id FK uuid source_id FK string accession string form date filing_date } gdelt_mentions { uuid id PK uuid instrument_id FK uuid source_id FK datetime mention_time json gkg_payload } economic_series ||--o{ economic_observations : "series_id" instruments ||--o{ sec_filings : "instrument_id" instruments ||--o{ gdelt_mentions : "instrument_id" data_sources ||--o{ sec_filings : "source_id" data_sources ||--o{ gdelt_mentions : "source_id" ``` ## Entities / Issuers / Ownership ```mermaid erDiagram issuers { uuid id PK string name string lei string country string entity_kind "company|government|fund" } government_entities { uuid id PK_FK string country_code string level } funds { uuid id PK_FK string fund_family string fund_type } sectors { uuid id PK string code string name } industries { uuid id PK string code string name uuid sector_id FK } industry_classifications { uuid id PK uuid issuer_id FK uuid industry_id FK date as_of } entity_relationships { uuid id PK uuid parent_id FK uuid child_id FK string kind "subsidiary|owner|board" } locations { uuid id PK uuid issuer_id FK string country string city } key_executives { uuid id PK uuid issuer_id FK string name string title } insider_transactions { uuid id PK string vt_symbol string insider_name date transaction_date float quantity } institutional_holdings { uuid id PK string vt_symbol string holder_name date as_of float quantity } form_13f_holdings { uuid id PK string filer_cik string vt_symbol date period_end } short_interest { uuid id PK string vt_symbol date settlement_date float short_interest } politician_trades { uuid id PK string politician string vt_symbol date trade_date float amount } issuers ||--o| government_entities : "subclass" issuers ||--o| funds : "subclass" issuers ||--o{ industry_classifications : "issuer_id" sectors ||--o{ industries : "sector_id" industries ||--o{ industry_classifications : "industry_id" issuers ||--o{ entity_relationships : "parent_id" issuers ||--o{ locations : "issuer_id" issuers ||--o{ key_executives : "issuer_id" ``` ## Taxonomy Free-form tagging for issuers, instruments, and themes. ```mermaid erDiagram taxonomy_schemes { uuid id PK string name "GICS|SASB|theme" } taxonomy_nodes { uuid id PK uuid scheme_id FK uuid parent_id FK string code string label } entity_tags { uuid id PK uuid node_id FK string entity_kind "issuer|instrument" string entity_id } entity_crosswalks { uuid id PK string from_kind string from_id string to_kind string to_id } taxonomy_schemes ||--o{ taxonomy_nodes : "scheme_id" taxonomy_nodes ||--o{ taxonomy_nodes : "parent_id" taxonomy_nodes ||--o{ entity_tags : "node_id" ``` ## Sessions / Chat / Optimization The conversational + experimentation layer. ```mermaid erDiagram sessions { uuid id PK string user string title json meta } chat_messages { uuid id PK uuid session_id FK string role "user|assistant|agent|tool" text content } optimization_runs { uuid id PK uuid strategy_id FK json search_space string status } optimization_trials { uuid id PK uuid run_id FK uuid backtest_id FK json params float objective } paper_trading_runs { uuid id PK uuid strategy_id FK string status datetime started_at datetime stopped_at } rl_episodes { uuid id PK string env_id int episode_id float reward } sessions ||--o{ chat_messages : "session_id" sessions ||--o{ agent_runs : "session_id" strategies ||--o{ optimization_runs : "strategy_id" optimization_runs ||--o{ optimization_trials : "run_id" strategies ||--o{ paper_trading_runs : "strategy_id" ``` ## Bots Tables introduced by the Bot Entity Refactor (Alembic [`0020_bots`](../alembic/versions/0020_bots.py)). ```mermaid erDiagram PROJECTS ||--o{ BOTS : "owns" BOTS ||--o{ BOT_VERSIONS : "snapshots" BOTS ||--o{ BOT_DEPLOYMENTS : "runs" BOT_VERSIONS ||--o{ BOT_DEPLOYMENTS : "produces" BOTS { string id PK string project_id FK string slug string kind string name text description int current_version text spec_yaml string status json annotations } BOT_VERSIONS { string id PK string bot_id FK int version string spec_hash json payload text notes string created_by } BOT_DEPLOYMENTS { string id PK string bot_id FK string version_id FK string target string task_id string status text manifest_yaml json result_summary text error } ``` - `(project_id, slug)` is unique on `bots`. - `(bot_id, spec_hash)` is unique on `bot_versions` (immutable snapshots). - `bot_deployments.target` is one of `paper_session` / `kubernetes` / `backtest_only` / `chat` / `backtest`. ## Data layer expansion (sinks, producers, streaming links) Tables introduced by the Data Pipelines Hub work (Alembic [`0024_data_layer_expansion`](../alembic/versions/0024_data_layer_expansion.py)). All four tables use `ProjectScopedMixin`. ```mermaid erDiagram PROJECTS ||--o{ SINKS : "owns" SINKS ||--o{ SINK_VERSIONS : "snapshots" PROJECTS ||--o{ MARKET_DATA_PRODUCERS : "owns" DATASET_CATALOGS ||--o{ STREAMING_DATASET_LINKS : "linked" PIPELINE_MANIFESTS ||--o{ DATASET_PIPELINE_CONFIGS : "binds" SINKS { string id PK string project_id FK string name string kind string display_name json config_json json tags bool requires_manifest_node int current_version bool enabled } SINK_VERSIONS { string id PK string sink_id FK int version string spec_hash json payload text notes } MARKET_DATA_PRODUCERS { string id PK string project_id FK string name string kind string runtime string deployment_namespace string deployment_name json topics int desired_replicas int current_replicas string last_status } STREAMING_DATASET_LINKS { string id PK string dataset_catalog_id FK string kind string target_ref string cluster_ref string direction json metadata_json bool enabled } ``` Notes: - `(project_id, name)` is unique on `sinks` and `market_data_producers`. - `(sink_id, spec_hash)` and `(sink_id, version)` are unique on `sink_versions` (mirrors the `bot_versions` pattern). - `(dataset_catalog_id, kind, target_ref, direction)` is unique on `streaming_dataset_links` so the [refresh_links](../alphaswarm/tasks/streaming_link_tasks.py) task can be re-run idempotently. ## ML alpha-backtest linkage (Alembic 0025) ```mermaid erDiagram ml_experiment_runs ||--o| ml_alpha_backtest_runs : "ml_experiment_run_id" backtest_runs ||--o| ml_alpha_backtest_runs : "backtest_run_id" model_versions ||--o| ml_alpha_backtest_runs : "model_version_id" model_deployments ||--o| ml_alpha_backtest_runs : "model_deployment_id" experiment_plans ||--o| ml_alpha_backtest_runs : "experiment_plan_id" ml_alpha_backtest_runs ||--o{ ml_prediction_audit : "alpha_backtest_run_id" ml_alpha_backtest_runs { uuid id PK string task_id string run_name string status uuid ml_experiment_run_id FK uuid backtest_run_id FK uuid model_version_id FK uuid model_deployment_id FK uuid experiment_plan_id FK string mlflow_run_id json ml_metrics json trading_metrics json combined_metrics json attribution datetime started_at datetime completed_at } ml_prediction_audit { uuid id PK uuid alpha_backtest_run_id FK string vt_symbol datetime ts float prediction float label float position_after float pnl_after_bar } ``` The four new FKs on `backtest_runs` (added by Alembic 0025) close the loop from a backtest result back to the trained model that produced its alpha: - `model_version_id` — the registered `ModelVersion` row. - `ml_experiment_run_id` — the `MLExperimentRun` that trained it. - `experiment_plan_id` — the `ExperimentPlan` lineage row. - `model_deployment_id` — the `ModelDeployment` used to wire the model into the strategy via `DeployedModelAlpha`. ## Adding a new model When you add a new ORM class: 1. Add the class to the appropriate `alphaswarm/persistence/models_*.py` (or `models.py` for cross-domain things). 2. Add an Alembic migration (`alembic revision --autogenerate -m "add foo"`). **Never edit a shipped migration.** 3. Update [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md) with the new table's columns. 4. Add the table to the relevant per-domain ERD above (or open a new one if it's a new domain). 5. If it has FKs into other domains, add those edges to the global FK map at the top of this file. # Experiments + Tests umbrella (Phase 1 of the multi-tenant rollout) > | Table | Purpose | Key columns | | ----- | ------- | ----------- | | `experiments` | User-driven container; one row per hypothesis / sweep / iteration | `id`, `slug`, `name`, `kind` (`ml`/`rl`/`analy... # Experiments + Tests umbrella (Phase 1 of the multi-tenant rollout) The umbrella sits **above** every existing typed run table so the "what was the user trying?" question gets one consistent answer regardless of which downstream engine produced the artefact. ## Tables | Table | Purpose | Key columns | | ----- | ------- | ----------- | | `experiments` | User-driven container; one row per hypothesis / sweep / iteration | `id`, `slug`, `name`, `kind` (`ml`/`rl`/`analysis`/`backtest`/`paper`/`bot`/`agent`/`research`/`hypothesis`/`optimization`/`ablation`/`sweep`), `status`, `parent_experiment_id`, `lab_id`, `metrics jsonb` | | `tests` | Pass/fail-style assertions attached to an experiment | `id`, `experiment_id`, `slug`, `name`, `assertion_kind`, `passed`, `details jsonb`, `run_ref_table`, `run_ref_id` | Both inherit `ProjectScopedMixin` (`owner_user_id` / `workspace_id` / `project_id`). ## Linkage to typed runs Migration 0037 added nullable `experiment_id` (and `test_id` where it applies) columns to: - `backtest_runs` - `ml_experiment_runs` - `rl_runs` - `analysis_runs` - `bot_deployments` - `strategy_tests` (also gets `test_id`) - `paper_trading_runs` - `agent_runs_v2` - `agent_runs` Existing rows stay at `NULL`; only new flows opt in. The [`LedgerWriter`](../alphaswarm/persistence/ledger.py) `_stamp` chain copies `RequestContext.experiment_id` / `.test_id` onto every row that has the matching attribute, so most flows just need a populated `RequestContext` to flow through. ## Hard rule Hard rule 34 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Every new run-producing flow MUST populate `experiment_id` (and `test_id` where applicable) on its run row. Don't add a new `*_runs` table without an `experiment_id` FK." ## REST surface | Method + path | Purpose | | ------------------------------------------ | ------- | | `GET /experiments` | List (filter by `project_id`, `kind`, `status`, `parent_experiment_id`) | | `POST /experiments` | Create (slug auto-derived from name) | | `GET /experiments/{id}` | Describe | | `PATCH /experiments/{id}` | Update (status/metrics/parent) | | `DELETE /experiments/{id}` | Cascade-deletes tests | | `GET /experiments/{id}/runs` | Stitched view of every typed run row pointing here | | `GET /tests` | List (filter by `experiment_id`, `passed`, `assertion_kind`) | | `POST /tests` | Create attached to an experiment | | `GET /tests/{id}` | Describe | | `POST /tests/{id}/evaluate` | Set the pass/fail verdict + ref into a typed run row | ## MCP surface - `data.experiments.list` — list / filter. - `data.experiments.tree` — nested view (`PARENT_OF` chain). - `data.experiments.describe` — full row + counts of linked runs. - `data.tests.list` — list / filter. - `data.tests.describe` — full row. ## Cross-reference - The Phase 2 ownership graph projects every experiment + test + linked run into Neo4j. See [`alphaswarm_docs/ownership-graph.md`](../../concepts/platform/ownership-graph.md). - The Phase 6 frontend ContextBar lets the user pin a specific experiment (when the route declares one). See the route handlers for which surfaces opt in. - The Phase 7 LEAN clone-to-workspace flow optionally creates an experiment when the user provides a name. See [`alphaswarm_docs/strategy-templates.md`](../../concepts/strategy/strategy-templates.md). # Major Flows > End-to-end sequence and state diagrams for the four flows that human and AI contributors most often need to reason about. Each diagram cites the canonical files; if the diagram and the code disagree, ... # Major Flows > Pair with [alphaswarm_docs/architecture.md](../../concepts/platform/architecture.md) (system view) and > [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (data model). > Doc map: [alphaswarm_docs/index.md](../../intro/index.md). End-to-end sequence and state diagrams for the four flows that human and AI contributors most often need to reason about. Each diagram cites the canonical files; if the diagram and the code disagree, the code wins (and the doc is stale — please update). ## 1. Generic file → Iceberg ingestion The discovery → director → materialise → verify → annotate pipeline that powers the regulatory-corpus ingest. Canonical doc: [alphaswarm_docs/data-catalog.md](../../concepts/data/data-catalog.md). ```mermaid sequenceDiagram actor User participant CLI as scripts/ingest_regulatory.py participant API as FastAPI participant Celery participant Disc as discovery participant Dir as director (Nemotron) participant Mat as materialize participant Verify as verifier (Nemotron) participant Ann as annotate (Nemotron) participant Iceberg participant DB as Postgres participant Bus as Redis pub/sub User->>CLI: invoke per source path CLI->>API: POST /pipelines/ingest/regulatory API->>Celery: enqueue ingest_local_paths_with_director API-->>CLI: 202 task_id CLI->>Bus: SUBSCRIBE alphaswarm:task:<task_id> loop per source path Celery->>Disc: discover_datasets(path) Disc-->>Celery: list~DiscoveredDataset~ Celery->>Dir: plan_ingestion(datasets) Dir-->>Celery: IngestionPlan Celery->>Bus: publish phase=plan loop per planned dataset Celery->>Mat: materialize_dataset(planned) Mat->>Iceberg: ensure_namespace + append_arrow* Mat-->>Celery: MaterializeResult alt rows below floor Celery->>Verify: verify_after_materialise Verify-->>Celery: VerifierVerdict opt retry Celery->>Mat: re-run with new caps end end opt annotate=true Celery->>Ann: annotate_table Ann-->>Celery: AnnotationResult Ann->>DB: register_iceberg_dataset end Celery->>Bus: publish phase=materialize|verify|annotate end end Celery->>DB: write IngestionReport summary Celery->>Bus: publish stage=done Bus-->>CLI: final payload CLI->>CLI: render markdown summary + audit log ``` Canonical files: - [alphaswarm/data/pipelines/discovery.py](../alphaswarm/data/pipelines/discovery.py) - [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py) - [alphaswarm/data/pipelines/materialize.py](../alphaswarm/data/pipelines/materialize.py) - [alphaswarm/data/pipelines/annotate.py](../alphaswarm/data/pipelines/annotate.py) - [alphaswarm/data/pipelines/runner.py](../alphaswarm/data/pipelines/runner.py) - [alphaswarm/tasks/ingestion_tasks.py](../alphaswarm/tasks/ingestion_tasks.py) - [scripts/ingest_regulatory.py](../scripts/ingest_regulatory.py) - [scripts/_run_one_source.py](../scripts/_run_one_source.py) ## 2. Backtest dispatch ```mermaid sequenceDiagram actor User participant UI as Next.js webui participant API as FastAPI /backtest participant DB as Postgres participant Celery as worker (queue=backtest) participant Strat as Strategy + Engine participant Duck as DuckDB participant Iceberg participant MLflow participant Bus as Redis pub/sub participant Ledger as LedgerWriter User->>UI: configure + run backtest UI->>API: POST /backtest {strategy_id, start, end, engine} API->>DB: insert BacktestRun(status=pending) API->>Celery: enqueue run_backtest(backtest_id) API-->>UI: 202 {task_id, stream_url} UI->>API: WebSocket /chat/stream/<task_id> Celery->>DB: load BacktestRun + Strategy Celery->>MLflow: start_run(experiment=alphaswarm-default) Celery->>Iceberg: read bars (DuckDB view) Duck-->>Celery: pandas DataFrame Celery->>Strat: instantiate FrameworkAlgorithm(...) loop per bar Strat->>Strat: universe → alpha → portfolio → risk → execution Strat-->>Celery: list~OrderRequest~ Celery->>Ledger: record_signal / record_order Ledger->>DB: insert signals / orders Celery->>Bus: publish progress Bus-->>UI: WebSocket frame end Celery->>MLflow: log_metrics + log_artifact(equity_curve.csv) Celery->>DB: update BacktestRun(status=completed, sharpe, ...) Celery->>Bus: publish stage=done Bus-->>UI: final summary ``` Canonical files: - [alphaswarm/api/routes/backtest.py](../alphaswarm/api/routes/backtest.py) - [alphaswarm/tasks/backtest_tasks.py](../alphaswarm/tasks/backtest_tasks.py) - [alphaswarm/backtest/engine.py](../alphaswarm/backtest/engine.py) - [alphaswarm/backtest/runner.py](../alphaswarm/backtest/runner.py) - [alphaswarm/strategies/framework.py](../alphaswarm/strategies/framework.py) - [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py) - [alphaswarm/mlops/autolog.py](../alphaswarm/mlops/autolog.py) ## 3. Agentic crew run The dual-tier (deep + quick LLM) CrewAI graph used by the TradingAgents-style preset. Files: [alphaswarm/tasks/agentic_backtest_tasks.py](../alphaswarm/tasks/agentic_backtest_tasks.py), [alphaswarm/agents/](../alphaswarm/agents/). ```mermaid sequenceDiagram actor User participant UI participant API as FastAPI /agentic/* participant Celery as worker (queue=agents) participant Crew as CrewAI graph participant Mem as ChromaDB (per-role memory) participant LLMD as Nemotron deep tier participant LLMQ as Llama quick tier participant DB as Postgres participant MLflow participant Ledger as LedgerWriter participant Bus as Redis pub/sub User->>UI: pick preset + universe UI->>API: POST /agentic/run {preset, symbols, ...} API->>DB: insert CrewRun(crew_type=trader, status=queued) API->>Celery: enqueue run_agentic_pipeline API-->>UI: 202 task_id Celery->>Crew: build graph from preset YAML Crew->>Mem: load BM25 memory per role loop per debate round Crew->>LLMQ: planner (quick tier) - which tool? LLMQ-->>Crew: tool selection Crew->>LLMD: research analyst (deep tier) LLMD-->>Crew: structured analysis Crew->>DB: insert DebateTurn rows Crew->>Bus: publish phase=debate end Crew->>LLMD: trader synthesis (deep) LLMD-->>Crew: AgentDecision (long/short/flat + rationale) Crew->>DB: insert AgentDecision Crew->>Mem: persist conclusion to BM25 Note over Crew,Ledger: Optional: replay through backtest engine Crew->>Celery: enqueue precompute_decisions Celery->>Celery: backtest replay Celery->>Ledger: record signals/orders Celery->>MLflow: log crew metrics Celery->>DB: insert AgentBacktest Celery->>Bus: publish stage=done Bus-->>UI: WebSocket frame ``` Canonical files: - [alphaswarm/api/routes/agentic.py](../alphaswarm/api/routes/agentic.py) - [alphaswarm/tasks/agentic_backtest_tasks.py](../alphaswarm/tasks/agentic_backtest_tasks.py) - [alphaswarm/agents/](../alphaswarm/agents/) - [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py) ## 4. Paper trading session ```mermaid stateDiagram-v2 [*] --> Pending : alphaswarm.tasks.paper_tasks.run_paper enqueued Pending --> Bootstrapping : worker dequeues Bootstrapping --> WarmingUp : load strategy + history bars WarmingUp --> Running : feed publishes first live bar state Running { [*] --> Heartbeat Heartbeat --> ProcessBar : bar arrives ProcessBar --> EmitOrders : strategy.on_bar yields orders EmitOrders --> RiskCheck : portfolio + risk model RiskCheck --> SubmitOrders : within limits RiskCheck --> RejectOrders : kill switch / risk breach SubmitOrders --> Heartbeat : write fills, ledger RejectOrders --> Heartbeat : write ledger.RISK } Running --> Halted : kill_switch_key set / risk breach Running --> Stopping : user POST /paper/stop Stopping --> Stopped : flush ledger + state Halted --> Stopped : operator clears Stopped --> [*] Running --> Stale : missed heartbeat > threshold Stale --> Halted : safety ``` The kill switch is a Redis key (`ALPHASWARM_RISK_KILL_SWITCH_KEY`, default `alphaswarm:kill_switch`); set it from anywhere to stop a session. Canonical files: - [alphaswarm/api/routes/paper.py](../alphaswarm/api/routes/paper.py) - [alphaswarm/tasks/paper_tasks.py](../alphaswarm/tasks/paper_tasks.py) - [alphaswarm/trading/runner.py](../alphaswarm/trading/runner.py) - [alphaswarm/trading/session.py](../alphaswarm/trading/session.py) - [alphaswarm/risk/](../alphaswarm/risk/) ## 5. (Bonus) Live-data subscription Browser asks the API for a live data stream; API allocates a Redis pub/sub channel that bridges the broker feed to a WebSocket. ```mermaid sequenceDiagram actor User participant UI participant API as FastAPI /live/subscribe participant Bridge as live broker bridge participant Broker as Alpaca / IBKR / sim participant Bus as Redis pub/sub (alphaswarm:live:<ch>) participant WS as /live/<channel_id> User->>UI: open live tab UI->>API: POST /live/subscribe {venue, symbols} API->>Bridge: spawn bridge task Bridge->>Broker: subscribe to symbols API-->>UI: {channel_id, ws_url} UI->>WS: WebSocket connect loop per market event Broker-->>Bridge: bar / quote / trade Bridge->>Bus: publish on alphaswarm:live:<ch> Bus-->>WS: deliver WS-->>UI: WebSocket frame end User->>WS: close tab WS->>Bridge: connection closed Bridge->>Broker: unsubscribe (if last consumer) ``` Canonical files: - [alphaswarm/api/routes/market_data_live.py](../alphaswarm/api/routes/market_data_live.py) - [alphaswarm/streaming/](../alphaswarm/streaming/) - [alphaswarm/ws/broker.py](../alphaswarm/ws/broker.py) ## Cross-cutting: progress bus Every long-running task in AlphaSwarm uses the **same** progress bus pattern: ```mermaid flowchart LR Task[Celery task] -->|emit| Helper["alphaswarm.tasks._progress.emit"] Helper -->|publish| Redis[("Redis pub/subalphaswarm:task:<task_id>")] Redis -->|asubscribe| WS[WebSocket relay /chat/stream] WS -->|frames| Browser Redis -->|subscribe| CLI[CLI scripts] ``` API to remember: - `emit(task_id, stage, message, **extras)` — publish a progress frame. - `emit_done(task_id, result)` — terminal `stage="done"` + result payload. - `emit_error(task_id, error)` — terminal `stage="error"`. Don't publish to Redis directly from your task code; always go through [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py) so the frame shape stays consistent. # Instrument taxonomy > The legacy taxonomy treated REITs and depositary receipts as plain ``InstrumentEquity`` rows with a discriminator flag (``is_adr``), and modelled OTC derivatives as opaque blobs. That worked while age... # Instrument taxonomy > Status: **Phase 1 shipped** (Alembic 0039). Adds REIT / mutual fund / OTC > derivative / ADR / GDR as first-class polymorphic subclasses of > :class:`alphaswarm.persistence.models.Instrument` plus a registry table > (``instrument_measures``) that catalogs which metrics are available for > each instrument. ## Why The legacy taxonomy treated REITs and depositary receipts as plain ``InstrumentEquity`` rows with a discriminator flag (``is_adr``), and modelled OTC derivatives as opaque blobs. That worked while agents only routed cash equities and listed options, but it broke as soon as the platform tried to: * compute the cross-market basis between an NYSE-listed ADR and its foreign common (no FK to the underlying, no conversion ratio, no depository bank metadata); * run a REIT sector-rotation strategy (no FFO, no payout ratio, no property-portfolio composition); * clear an OTC swap through a CCP (no LEI, no ISDA master agreement id, no notional / collateral fields). Phase 1 lifts these shapes into first-class joined-table subclasses with the columns the trading + risk + cross-market arbitrage paths read directly. ## Taxonomy | Class | SQL table | ``polymorphic_identity`` | InstrumentClass | AssetClass | | --- | --- | --- | --- | --- | | ``Equity`` | ``instrument_equity`` | ``spot`` | ``SPOT`` | ``EQUITY`` | | ``ETF`` | ``instrument_etf`` | ``etf`` | ``ETF`` | ``EQUITY`` | | ``IndexInstrument`` | ``instrument_index`` | ``index`` | ``INDEX`` | ``INDEX`` | | ``Bond`` | ``instrument_bond`` | ``bond`` | ``BOND`` | ``RATES`` | | ``FuturesContract`` | ``instrument_future`` | ``future`` | ``FUTURE`` | ``COMMODITY`` | | ``OptionContract`` | ``instrument_option`` | ``option`` | ``OPTION`` | ``EQUITY`` | | ``CurrencyPair`` | ``instrument_fx_pair`` | ``fx_pair`` | ``SPOT`` | ``FX`` | | ``CryptoToken`` | ``instrument_crypto`` | ``crypto_token`` | ``CRYPTO_TOKEN`` | ``CRYPTO`` | | ``Cfd`` | ``instrument_cfd`` | ``cfd`` | ``CFD`` | ``EQUITY`` | | ``Commodity`` | ``instrument_commodity`` | ``spot_commodity`` | ``SPOT`` | ``COMMODITY`` | | ``SyntheticInstrument`` | ``instrument_synthetic`` | ``synthetic`` | ``SYNTHETIC`` | ``MIXED`` | | ``BettingInstrument`` | ``instrument_betting`` | ``betting`` | ``BETTING`` | ``EVENT`` | | ``TokenizedAsset`` | ``instrument_tokenized_asset`` | ``nft`` | ``NFT`` | ``CRYPTO`` | | **``REIT``** | **``instrument_reit``** | **``reit``** | **``REIT``** | **``EQUITY``** | | **``MutualFund``** | **``instrument_mutual_fund``** | **``mutual_fund``** | **``MUTUAL_FUND``** | **``EQUITY``** | | **``OTCDerivative``** | **``instrument_otc_derivative``** | **``otc_derivative``** | **``OTC_DERIVATIVE``** | **``MIXED``** | | **``AmericanDepositaryReceipt``** | **``instrument_adr``** | **``adr``** | **``ADR``** | **``EQUITY``** | | **``GlobalDepositaryReceipt``** | **``instrument_gdr``** | **``gdr``** | **``GDR``** | **``EQUITY``** | Phase 1 rows are bolded. ### REIT ``InstrumentREIT`` adds the columns a REIT-aware strategy needs: * ``reit_class`` -- ``equity``, ``mortgage``, ``hybrid``, ``public_non_listed``, ``private`` * ``property_sector`` -- ``residential``, ``commercial``, ``industrial``, ``healthcare``, ``data_center``, ``retail``, ``hospitality``, ``diversified``, ``infrastructure``, ``timber`` * ``property_portfolio_json`` -- list of property dicts (the discovery service surfaces these without spinning up a separate ``reit_properties`` table) * ``distribution_yield`` / ``ffo_per_share`` / ``payout_ratio`` / ``debt_to_equity`` ### Mutual fund ``InstrumentMutualFund`` covers open-end and closed-end funds. The discriminator that distinguishes it from ``InstrumentETF`` is the trading mechanism (end-of-day NAV vs intraday creation-redemption). * ``fund_family`` (Vanguard / Fidelity / BlackRock / ...) * ``share_class`` (A / B / C / I / R / Z / retail / institutional) * ``fund_kind`` (open_end / closed_end / money_market / target_date / ucits / sicav) * ``expense_ratio`` / ``management_fee`` / ``minimum_investment`` ### OTC derivative ``InstrumentOTCDerivative`` is the catch-all for the OTC universe. The ``instrument_kind`` discriminator selects the specific shape: * ``swap`` / ``swaption`` / ``cap_floor`` / ``forward`` / ``exotic`` * ``variance_swap`` / ``credit_default_swap`` / ``total_return_swap`` / ``basket_swap`` Regulatory identity flows through ``counterparty_lei`` plus ``isda_master_agreement_id`` so trade-repository reconciliation (DTCC, REGIS-TR) works without a separate registration step. The ``legs_json`` column stores the leg structure inline so a single class supports the entire OTC universe without a tree of subclasses. ### ADR / GDR Both subclasses carry: * ``underlying_instrument_id`` -- FK to the foreign equity row * ``conversion_ratio`` -- shares of foreign common per receipt * ``depository_bank_name`` / ``depository_bank_lei`` * ADR adds ``sponsorship_level`` (I / II / III / 144A / Reg_S / unsponsored) * GDR adds ``regulatory_regime`` (Reg_S / Rule_144A / Reg_S_144A / full_listing) plus a non-US ``listing_venue`` The Phase 4 cross-market basis algorithm reads ``adr.conversion_ratio`` and walks ``adr.underlying_instrument_id`` to fetch the local price directly -- no extra join needed. ## ``instrument_measures`` registry Catalog of "what data exists for this instrument?". One row per ``(instrument_id, measure_type, frequency, dataset_field)`` tuple. Common ``measure_type`` values: ``price``, ``volume``, ``open_interest``, ``implied_volatility``, ``dividend_yield``, ``ffo``, ``nav``, ``distribution``, ``greek_delta``, ``greek_gamma``, ``basis``, ``spread``, ``turnover``, ``bid_ask_spread``. Common ``frequency`` values: ``tick``, ``second``, ``minute``, ``hour``, ``day``, ``week``, ``month``, ``quarter``, ``annual``, ``event_driven``, ``adhoc``. Agents query this BEFORE drafting a SQL / Iceberg query via the ``data.instruments.measures`` DataMCP tool so they don't select a column that doesn't exist for the instrument-frequency pair they care about. ## How to add a new subclass 1. Add an :class:`InstrumentClass` enum value in [`alphaswarm/core/domain/enums.py`](../alphaswarm/core/domain/enums.py). 2. Add the matching joined-table SQL subclass in [`alphaswarm/persistence/models_instruments.py`](../alphaswarm/persistence/models_instruments.py). Set ``polymorphic_identity`` to the enum value. 3. Add the in-memory domain class in [`alphaswarm/core/domain/instrument.py`](../alphaswarm/core/domain/instrument.py) decorated with ``@register_instrument_class``. 4. Add an Alembic migration for the new table. 5. If the new class needs unique ``data.instruments.*`` access patterns, register a DataMCP tool under [`alphaswarm/data/mcp/tools/instruments.py`](../alphaswarm/data/mcp/tools/instruments.py). ## DataMCP surface | Tool | Purpose | | --- | --- | | `data.instruments.measures` | Available metrics for an instrument | | `data.instruments.depositary_receipts` | ADR / GDR with underlying-equity FK + conversion ratio | | `data.instruments.reit_portfolio` | REIT property-portfolio composition + FFO / yield | | `data.identity.resolve` | Forward identifier resolution at ``as_of`` | | `data.identity.history` | Walk every alias ever known for an entity | | `data.futures.curve.list` | Discover available futures curves | | `data.futures.curve.stitched` | Roll-stitched continuous curves | # Legacy `alphaswarm.core.types` shim > The legacy module is imported by **~140 files** across the codebase (strategies, brokers, REST routes, paper-trading session, backtest engines, RL apps, tests). A hard delete would break every one of ... # Legacy `alphaswarm.core.types` shim > Status: **Phase 5 finalization shipped**. The module is now a thin > compatibility shim over [`alphaswarm.core.domain`](../alphaswarm/core/domain/). ## Why a shim and not a delete The legacy module is imported by **~140 files** across the codebase (strategies, brokers, REST routes, paper-trading session, backtest engines, RL apps, tests). A hard delete would break every one of them in the same commit. The shim approach preserves backward compatibility while making the domain types the recommended path: 1. **Every public name still imports** -- no breaking change for the 140 existing importers. 2. **Each domain-replaceable class is marked DEPRECATED** in its docstring with a `.. deprecated:: 5.0` Sphinx directive pointing at the canonical type. 3. **Bridge methods** on each legacy class let callers convert to the domain shape with one method call: * `Symbol.to_instrument_id()` * `OrderRequest.to_domain_order(client_order_id=, account=)` * `OrderData.to_domain_order()` / `OrderData.from_domain_order(...)` * `TradeData.from_execution_report(...)` * `PositionData.from_account_position_row(...)` * `AccountData.from_account_row(account_row, balances=...)` 4. **Domain re-exports at module bottom** let callers migrate one import at a time -- `from alphaswarm.core.types import DomainOrder` works without rewriting the import line. ## Three categories of type ### Category 1 -- Domain replacement available | Legacy | Domain canonical | | --- | --- | | `Symbol` | `alphaswarm.core.domain.identifiers.InstrumentId` | | `Exchange` | `alphaswarm.core.domain.identifiers.Venue` (string-valued ID) | | `AssetClass` | `alphaswarm.core.domain.enums.AssetClass` (richer) | | `SecurityType` | `alphaswarm.core.domain.enums.InstrumentClass` | | `OrderType` | `alphaswarm.core.domain.enums.OrderType` (superset) | | `OrderSide` | `alphaswarm.core.domain.enums.OrderSide` (superset) | | `OrderStatus` | `alphaswarm.core.domain.enums.OrderStatus` (superset) | | `Direction` | `alphaswarm.core.domain.enums.PositionSide` | | `OrderRequest` | `alphaswarm.core.domain.orders.DomainOrder` | | `OrderData` | `alphaswarm.core.domain.orders.DomainOrder` | | `AccountData` | `alphaswarm.persistence.models_accounts.AccountRow` + balances | | `PositionData` | `alphaswarm.persistence.models_accounts.AccountPositionRow` | | `TradeData` | `alphaswarm.trading.execution.ExecutionReport` | The legacy classes in this category are shims. Their docstring carries `.. deprecated:: 5.0` and points at the canonical type. ### Category 2 -- Authoritative here (no domain replacement) Market-data records and data-plane routing have no domain equivalents because the domain layer is about identity / orders / accounts, not the data plane: * `BarData`, `TradeBar` (alias), `QuoteBar`, `TickData`, `Tick` (alias) * `SubscriptionDataConfig`, `Interval`, `Resolution`, `TickType`, `DataNormalizationMode` Framework value objects for the alpha / portfolio stages: * `Signal`, `PortfolioTarget` Backtest event-loop types: * `Event`, `EventType`, `MarketEvent`, `SignalEvent`, `OrderEvent_Msg`, `FillEvent_Msg` Legacy framework patterns for the existing `PaperTradingSession`: * `OrderEvent` (state-transition record, NOT the messaging event), `OrderTicket`, `SecurityHolding`, `Cash`, `CashBook` These types remain authoritative here. ### Category 3 -- Domain re-exports Every Phase 1-5 domain type is re-exported from `alphaswarm.core.types` so callers can incrementally migrate. The recommended long-term migration is `from alphaswarm.core.domain import ...`, but the shim re-exports let you do it one import line at a time: ```python # Both work after Phase 5. The second is the recommended long-term form. from alphaswarm.core.types import DomainOrder, InstrumentId, OmsType from alphaswarm.core.domain import DomainOrder, InstrumentId, OmsType ``` The re-exports cover every name from [`alphaswarm/core/domain/enums.py`](../alphaswarm/core/domain/enums.py), [`alphaswarm/core/domain/identifiers.py`](../alphaswarm/core/domain/identifiers.py), and [`alphaswarm/core/domain/orders.py`](../alphaswarm/core/domain/orders.py). Names that would collide with the legacy enums get a `Domain` prefix: | Legacy | Domain re-export | | --- | --- | | `OrderType` | `DomainOrderType` | | `OrderSide` | `DomainOrderSide` | | `OrderStatus` | `DomainOrderStatus` | | `AssetClass` | `DomainAssetClass` | | `AccountType` (legacy doesn't exist) | `DomainAccountType` | Everything else (`InstrumentId`, `ClientOrderId`, `OrderListId`, `DomainOrder`, `LimitOrder`, `StopMarketOrder`, `OrderList`, `PositionSide`, `OmsType`, `ContingencyType`, `TriggerType`, `TimeInForce`, `TrailingOffsetType`, `InstrumentClass`, `LiquiditySide`, `AggressorSide`, ...) is re-exported under its original name. ## Migration workflow For a file currently using legacy types: 1. **Drop in domain bridges where convenient.** No import changes needed: ```python from alphaswarm.core.types import OrderRequest req = OrderRequest(...) domain = req.to_domain_order() # bridge to canonical ``` 2. **Switch single imports incrementally.** Replace one legacy import at a time with the domain re-export through the shim: ```python # Before from alphaswarm.core.types import OrderType, OrderSide, OrderStatus # After (works today, no functional change) from alphaswarm.core.types import ( DomainOrderType as OrderType, DomainOrderSide as OrderSide, DomainOrderStatus as OrderStatus, ) ``` 3. **Final form -- direct from domain.** Once the entire file is migrated, drop the shim: ```python from alphaswarm.core.domain.enums import OrderType, OrderSide, OrderStatus ``` ## Why we keep the legacy enums (instead of aliasing to domain) The temptation is "just `OrderType = DomainOrderType`". We don't, because: * **Existing DB rows persist the legacy string values.** The legacy enum has `STOP = "stop"`; the domain has `STOP_MARKET = "stop_market"`. Renaming the enum would invalidate every previously-saved ``order_type`` column. * **YAML strategy configs use the legacy values.** Backward compat with shipped strategy YAML is a hard requirement. * **Some legacy values have NO domain equivalent.** `OrderStatus.NEW` (legacy) maps to `OrderStatus.ACCEPTED` (domain) but the legacy state machine has a different topology (no PENDING_UPDATE / PENDING_CANCEL). The legacy enums are therefore kept verbatim; the deprecation directive points callers at the richer domain enum, and the re-exports give callers an opt-in path. ## When the shim can finally be deleted The shim file disappears when: 1. Every importer has migrated to `from alphaswarm.core.domain import ...` 2. Every persisted legacy enum value has been migrated to the domain canonical (`stop` -> `stop_market`, `cancelled` -> `canceled`, `new` -> `accepted`, `partial` -> `partially_filled`) 3. Every shipped YAML strategy config has been rewritten to use the domain values That migration is a separate, multi-PR effort tracked outside this Phase 5 finalization. The shim stays put until it's done. ## Bridge method reference ```python from alphaswarm.core.types import ( Symbol, OrderRequest, OrderData, TradeData, PositionData, AccountData, ) # Symbol <-> InstrumentId sym = Symbol.parse("AAPL.NASDAQ") iid = sym.to_instrument_id() back = Symbol.from_instrument_id(iid) # OrderRequest -> DomainOrder req = OrderRequest(symbol=sym, side=..., order_type=..., quantity=10) domain = req.to_domain_order(client_order_id="cl-1", gateway="alpaca") # OrderData round-trip data = OrderData(...) domain = data.to_domain_order() rebuilt = OrderData.from_domain_order(domain, gateway="alpaca") # TradeData from ExecutionReport from alphaswarm.trading.execution import ExecutionReport trade = TradeData.from_execution_report(report) # PositionData from AccountPositionRow from alphaswarm.persistence.models_accounts import AccountPositionRow pos = PositionData.from_account_position_row(row) # AccountData snapshot from persistence rows from alphaswarm.persistence.models_accounts import AccountRow, AccountBalanceRow snapshot = AccountData.from_account_row(account_row, balances=balance_rows) ``` # Local platform overlay > The rpi `kubernetes/` tree stays untouched — these are *copies*, not relocations. AlphaSwarm attaches to either the local services or the cluster through the [`KubernetesAdapter`](../../concepts/infrastructure/kubernetes-adapter.md) abst... # Local platform overlay Audience: a developer who wants to run AlphaSwarm **standalone**, without attaching to the rpi_kubernetes cluster. The platform overlay (`alphaswarm_platform/compose/docker-compose.platform.yml`) brings the data + observability services AlphaSwarm code expects into the local compose stack. The rpi `kubernetes/` tree stays untouched — these are *copies*, not relocations. AlphaSwarm attaches to either the local services or the cluster through the [`KubernetesAdapter`](../../concepts/infrastructure/kubernetes-adapter.md) abstraction. ## Compose-up matrix | Goal | Command | | --- | --- | | Just the AlphaSwarm API + workers | `docker compose up -d` | | AlphaSwarm + visualization stack (Trino, Polaris, Superset, Dagster, Dask, Ray) | `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml --profile visualization up -d` | | Full local platform parity (adds Apicurio + real Airbyte + DataHub + Loki + Vector + VictoriaMetrics) | `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform up -d` | The platform overlay also activates the `visualization` profile's services it depends on (Polaris, Trino, Dagster). Don't pass `--profile platform` alone — the AlphaSwarm webui still depends on Superset from the viz overlay. ## Services added by the platform overlay | Service | Container | Default host port | Wires into | | --- | --- | --- | --- | | `apicurio` (Schema Registry) | `alphaswarm-apicurio` | `8090 -> 8080` | `ALPHASWARM_SCHEMA_REGISTRY_URL` already supports the URL knob | | `airbyte-db` | `alphaswarm-airbyte-db` | (internal) | Postgres backing for real Airbyte | | `airbyte-server-real` | `alphaswarm-airbyte-server-real` | `8005 -> 8001` | Real Airbyte API (the dev stub at `airbyte-server` keeps running on `:8002`) | | `airbyte-webapp` | `alphaswarm-airbyte-webapp` | `8001 -> 80` | UI for real Airbyte | | `datahub-gms` | `alphaswarm-datahub-gms` | `8081 -> 8080` | `ALPHASWARM_DATAHUB_GMS_URL=http://datahub-gms:8080` | | `datahub-frontend` | `alphaswarm-datahub-frontend` | `9002 -> 9002` | DataHub UI | | `loki` | `alphaswarm-loki` | `3100 -> 3100` | Log aggregation; OTel collector + agents push here | | `vector` | `alphaswarm-vector` | (none) | Tails Docker container logs and ships to Loki | | `victoriametrics` | `alphaswarm-victoriametrics` | `8428 -> 8428` | Long-term metrics; scrapes the existing OTel collector + AlphaSwarm API | ## Sub-profiles (documented but not enabled by default) The plan keeps these out of the default platform set because the user opted out of "full parity": - `platform-rag` — RAGFlow + Milvus stack (heavy; pulls a vector DB). - `platform-jh` — JupyterHub. Add them yourself if needed by extending `alphaswarm_platform/compose/docker-compose.platform.yml` or shipping an alongside `docker-compose.platform..yml`. ## Smoke test sequence 1. `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform up -d` 2. `curl http://localhost:8428/-/ready` — VictoriaMetrics 3. `curl http://localhost:3100/ready` — Loki 4. `curl http://localhost:8081/health` — DataHub GMS 5. `curl http://localhost:8090/apis` — Apicurio 6. `curl http://localhost:8005/api/v1/health` — real Airbyte 7. `docker compose ps` — every service should be healthy or running ## Where the rpi cluster fits in When `ALPHASWARM_CLUSTER_MGMT_URL` is set, the [`RpiClusterAdapter`](../../concepts/infrastructure/kubernetes-adapter.md#rpiclusteradapter) auto-promotes and AlphaSwarm forwards Kafka admin + Flink session-job + alphavantage stream operations to the homelab management API. Setting both attach paths side-by-side is fine — AlphaSwarm routes the call wherever the active adapter says. ## Cleanup ``` docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform down ``` Volumes are preserved; pass `-v` to wipe them. # Ownership graph (Phase 2 of the multi-tenant rollout) > ```mermaid flowchart LR subgraph postgres [Postgres canonical] Orgs[organizations] Teams[teams] Users[users] Mem[memberships] Ws[workspaces] Projects[projects] Labs[labs] Exp[experiments] Tests[tests]... # Ownership graph (Phase 2 of the multi-tenant rollout) The ownership graph is the projection layer that lets the MCP catalog + UI ask "what can this user see?" / "who can read this?" without joining the canonical tenancy tables hop-by-hop. ## Architecture ```mermaid flowchart LR subgraph postgres [Postgres canonical] Orgs[organizations] Teams[teams] Users[users] Mem[memberships] Ws[workspaces] Projects[projects] Labs[labs] Exp[experiments] Tests[tests] Res[resources] RR[resource_relations] end Orgs -->|after_flush_postexec| Bus[(Redis streamalphaswarm:ownership:events)] Mem -->|after_flush_postexec| Bus Res -->|after_flush_postexec| Bus Bus -->|Celery drain| Neo[(Neo4j projection)] subgraph readers [Read clients] MCP["data.ownership.* MCP tools"] UI[ContextBar + Profile page] end MCP -->|"OwnershipGraphStore.traverse"| Neo UI -->|"GET /cache/{org,team,...}"| postgres ``` ## Hard rule Hard rule 33 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "All ownership / membership queries that traverse more than one hop MUST go through `alphaswarm.graph.OwnershipGraphStore`." ## Node + edge model | Node kind | Source table | Identity | | ------------- | -------------------- | ---------------- | | Organization | `organizations` | UUID | | Team | `teams` | UUID | | User | `users` | UUID | | Workspace | `workspaces` | UUID | | Project | `projects` | UUID | | Lab | `labs` | UUID | | Experiment | `experiments` | UUID | | Test | `tests` | UUID | | Resource | `resources` | UUID | | Edge relation | from -> to (kinds) | Source | | ----------------- | ----------------------------- | ---------- | | `HAS_TEAM` | Organization -> Team | `teams.org_id` | | `HAS_WORKSPACE` | Organization -> Workspace | `workspaces.org_id` | | `HAS_PROJECT` | Workspace -> Project | `projects.workspace_id` | | `HAS_LAB` | Workspace -> Lab | `labs.workspace_id` | | `MEMBER_OF` | User -> (Org\|Team\|Workspace\|Project\|Lab) | `memberships` | | `OWNS` | (Org\|Team\|User\|Workspace\|Project) -> Resource | `resources.owner_scope_*` | | `IN_PROJECT` | Experiment / Resource -> Project | `*.project_id` | | `IN_LAB` | Experiment / Resource -> Lab | `*.lab_id` | | `IN_WORKSPACE` | Resource -> Workspace | `resources.workspace_id` | | `IN_EXPERIMENT` | Test -> Experiment | `tests.experiment_id` | | `PARENT_OF` | Experiment -> Experiment | `experiments.parent_experiment_id` | | `DERIVED_FROM` / `CLONES` / `TRANSLATED_FROM` / `USES` / `REFERENCES` | Resource -> Resource | `resource_relations.relation` | ## Sync semantics - **Source of truth**: Postgres. Every ownership write is a normal ORM commit. - **Event bus**: SQLAlchemy [`after_flush_postexec`](../alphaswarm/graph/sqlalchemy_hooks.py) hooks translate each row insert/update/delete into an [`OwnershipEvent`](../alphaswarm/graph/events.py) on the `alphaswarm:ownership:events` Redis stream (or an in-process fallback queue when Redis is unreachable). - **Drain**: the [`alphaswarm.tasks.ownership_tasks.drain_events`](../alphaswarm/tasks/ownership_tasks.py) Celery beat task runs every 5 s and applies up to `ALPHASWARM_OWNERSHIP_SYNC_BATCH_SIZE` (default 500) events through `OwnershipGraphStore.apply_events`. - **Healer**: the periodic `full_resync` task (default 30 min) walks the canonical tables + re-emits everything so any missed delivery is repaired. ## Read paths - **Python**: `from alphaswarm.graph import get_ownership_store; store = get_ownership_store(); store.traverse(...)`. - **MCP tools**: - `data.ownership.tree` — outward walk from a node. - `data.ownership.list_resources` — every Resource a user can see. - `data.ownership.who_can_read` — reverse — every user that can read a specific Resource. - **HTTP** (Phase 6 frontend): the ContextBar talks directly to the metadata cache (`/cache/organizations` etc.); deeper queries go through the MCP HTTP transport (`/mcp/data/tools//invoke`). ## Stores | Backend | Use case | | ----------------------------- | -------- | | [`PostgresOwnershipGraphStore`](../alphaswarm/graph/postgres_store.py) | Local dev, unit tests, bootstrap/recovery | | [`Neo4jOwnershipGraphStore`](../alphaswarm/graph/neo4j_store.py) | Production multi-hop queries | Pick via `ALPHASWARM_OWNERSHIP_GRAPH_STORE` (default `postgres`). ## See also - [`alphaswarm_docs/data-mcp.md`](../../concepts/data/data-mcp.md) — the MCP catalog the ownership tools plug into. - [`alphaswarm_docs/identity.md`](../../concepts/identity/identity.md) — how the User node populates + how lazy provisioning seeds memberships. - [`alphaswarm_docs/experiments-tests.md`](../../concepts/platform/experiments-tests.md) — the umbrella tables this graph sits on top of. # Repository Split > This document defines the AlphaSwarm monorepo boundaries while the platform is being split into future repositories. The current goal is isolation by responsibility without breaking imports, deployment manif... # Repository Split Status: migration guidance. This document defines the AlphaSwarm monorepo boundaries while the platform is being split into future repositories. The current goal is isolation by responsibility without breaking imports, deployment manifests, or operator workflows. ## Principles - Use a strangler migration: create stable contracts first, then move implementations behind compatibility shims. - Keep shared abstractions in `alphaswarm_core`; do not import from higher-level packages there. - Keep `alphaswarm_controller` standalone. It may depend on `alphaswarm_core`, but it must not import `alphaswarm.*`. - Keep `rpi_kubernetes` as cluster bootstrap and platform services only. AlphaSwarm workload controllers and operator features live in this repository. - Prefer generated or typed API contracts between projects over direct imports across future repository boundaries. ## Domain Map | Domain | Current path | Owns | Does not own | | --- | --- | --- | --- | | Control plane | `alphaswarm_controller/` | `/manage/*`, workload lifecycle, provider adapters, session/control API | Quant runtimes, Celery business tasks, strategy logic | | Platform core | `alphaswarm_core/` | Shared value types, ABCs, auth/resource filters, topology, stable wire models | FastAPI routes, ORM models, concrete cloud SDK workflows | | Client | `alphaswarm_client/` | Operator UI, client docs, generated API contracts, local client behavior | Backend business logic, direct database writes | | Snippets | `alphaswarm_snippets/` | Curated code knowledge, annotations, prompts, provenance indexes | Runtime imports or production package dependencies | | Bots | `alphaswarm_bots/`, `alphaswarm_bots/templates/` | Bot runtime, templates, examples, sample specs | Direct bypass of `BotRuntime` or immutable versioning | | RL | `alphaswarm_rl/` | RL subsystem: hash-locked `RLExperimentSpec` + `RLRuntime` + `RLComponent` metaclass + advantage estimators + policy backbones + weight-centric portfolio pipeline + Iceberg trajectory store + matching Celery task / API route / YAML spec library / tests | LLM gateway (`router_complete` stays in monolith); central registry (`alphaswarm.core.registry.register` stays in monolith) | | Models | `alphaswarm_models/` | Custom model pulling, building, training, fine-tuning, evaluating, testing — qlib-style ML framework + Predictor Hub + AlphaBacktestExperiment + walk-forward + finetune trainers + every model implementation + custom model serving (vLLM + Ollama) + matching Celery tasks / API routes / YAML spec library / tests | LLM gateway (`router_complete` stays in monolith); central registry (`alphaswarm.core.registry.register` stays in monolith) | | Monolith runtime | `alphaswarm/` | Agents, analysis, backtests, data plane, persistence, tasks, API gateway, LLM gateway (`router_complete`, memory, cache, prompts, tokens), the central registry, the four spec runtimes' shared orchestration. | RL subsystem (extracted to `alphaswarm_rl/`); ML / model serving (extracted to `alphaswarm_models/`); new workload control-plane providers | | Deployment | `alphaswarm_platform/deployments/`, `alphaswarm_platform/terraform/`, `build/` | Compose, Kubernetes, Terraform, image build contracts | Cluster bootstrap owned by `rpi_kubernetes` | ## Allowed Dependencies ```mermaid flowchart LR aqpRuntime["alphaswarm runtime"] --> aqpPlatformCore["alphaswarm_core"] aqpControlPlane["alphaswarm_controller"] --> aqpPlatformCore aqpClient["alphaswarm_client"] --> aqpRuntime aqpClient --> aqpControlPlane aqpBots["alphaswarm_bots templates"] --> aqpRuntime aqpRl["alphaswarm_rl"] --> aqpRuntime aqpModels["alphaswarm_models"] --> aqpRuntime aqpBots --> aqpRl aqpBots --> aqpModels aqpRl -.shims.-> aqpRuntime aqpModels -.shims.-> aqpRuntime aqpSnippets["alphaswarm_snippets"] -.reference.-> aqpRuntime ``` Hard dependency rules: 1. `alphaswarm_core` must not import `alphaswarm`, `alphaswarm_controller`, FastAPI, SQLAlchemy, Celery, or heavy optional SDKs. 2. `alphaswarm_controller` must not import `alphaswarm.*`; use `alphaswarm_core` contracts or HTTP APIs. 3. `alphaswarm_client` must call backend APIs through generated clients or local API wrappers. It must not duplicate authorization, tenancy, or kill-switch semantics. 4. `alphaswarm_snippets` is read-only knowledge for runtime code. Production modules must not import from it. 5. `alphaswarm_bots` stores templates and guidance until runtime interfaces are extracted from `alphaswarm_bots`. 6. `alphaswarm_rl` and `alphaswarm_models` may depend on `alphaswarm.*` for the shared runtime primitives that have not yet been extracted (`iceberg_catalog.append_arrow`, `router_complete`, `LedgerWriter`, `RequestContext`, ORM models, `_progress.emit`, `MetadataCache`, `RiskLimits`, `TargetWeightsRebalancer`, `alphaswarm.core.registry.register`). The reverse direction (`alphaswarm.rl.*` → `alphaswarm_rl.*`, `alphaswarm.ml.*` → `alphaswarm_models.*`, `alphaswarm.llm.{vllm_runner,ollama_client}` → `alphaswarm_models.serving.*`) goes through deprecation-warning compatibility shims under `alphaswarm/rl/`, `alphaswarm/ml/`, and `alphaswarm/llm/{vllm_runner,ollama_client}.py`. New code imports from `alphaswarm_rl.*` / `alphaswarm_models.*` / `alphaswarm_models.serving.*` directly. ## Migration Order 1. Stabilize `alphaswarm_core` package contracts and tests. 2. Finish `alphaswarm_controller` as the only home for workload lifecycle providers and `/manage/*` behavior. 3. Move curated references into `alphaswarm_snippets` with provenance and indexes. 4. Extract `alphaswarm_client` contracts around the existing Vite frontend and API gateway behavior before moving source paths. 5. Split `alphaswarm_bots` last, after bot persistence, task dispatch, backtest, paper trading, and agent runtime interfaces are explicit. 6. Extract `alphaswarm_rl` (May 2026) — RL subsystem moved out of `alphaswarm/rl/`, with matching Celery task / API route / YAML spec library / tests. Legacy `alphaswarm.rl.*` imports preserved through `alphaswarm/rl/__init__.py` deprecation shim. 7. Extract `alphaswarm_models` (May 2026) — custom-model boundary moved out of `alphaswarm/ml/` plus the model-pulling / serving slice of `alphaswarm/llm/` (`vllm_runner.py`, `ollama_client.py`). The central LLM gateway (`router_complete`, memory, cache, prompts, tokens) **stays in the monolith** at `alphaswarm/llm/`. Legacy `alphaswarm.ml.*` and `alphaswarm.llm.{vllm_runner,ollama_client}` imports preserved through compatibility shims. 8. Clean root-level build/deploy files only after the projects can be tested independently. ## Future Repo Split Gate A domain is ready to become its own repository when it has: - `README.md`, `AGENTS.md`, and a validation command list. - Independent packaging or build metadata. - No forbidden imports across future repo boundaries. - Versioned API or model contracts for consumers. - CI checks that run without relying on the full monolith checkout, except for documented integration tests. # AlphaSwarm Scope Catalogue > Every scope follows `:` (kebab-case nouns and verbs, colon separator). The four ADR 003 infrastructure scopes (`read:infrastructure`, `manage:agents`, `manage:infrastructure`, `admin... # AlphaSwarm Scope Catalogue Single source of truth for every authorization scope used by the AlphaSwarm control plane. The canonical Python module is [alphaswarm/auth/scopes.py](../alphaswarm/auth/scopes.py) (`AQPScope`); the canonical Terraform Auth0 provisioning lives in [alphaswarm_platform/terraform/modules/auth0_identity/main.tf](../alphaswarm_platform/terraform/modules/auth0_identity/main.tf) (`local.scopes` + `local.role_permissions`); the canonical role lattice is in [alphaswarm_core/src/alphaswarm_core/auth/rbac.py](../alphaswarm_core/src/alphaswarm_core/auth/rbac.py) (`_ROLE_LATTICE`). All three MUST stay in sync — the regression test at `tests/auth/test_scopes.py` enforces it. ## Scope-string convention Every scope follows `:` (kebab-case nouns and verbs, colon separator). The four ADR 003 infrastructure scopes (`read:infrastructure`, `manage:agents`, `manage:infrastructure`, `admin:cluster`) intentionally use a verb-first form for backward compatibility with the original Phase 4 rollout; the AlphaSwarm-specific extensions added in Phase 1 of the control-plane maturation use the canonical resource-first form. The `platform:admin` scope is the implicit super-scope — any holder of `platform:admin` satisfies any other scope check. It is granted only to the `alphaswarm-superadmin` role and used very rarely. ## Scope catalogue ### Data plane | Scope | Description | | --- | --- | | `data:read` | Read AlphaSwarm data and metadata (datasets, catalogs, lineage) | | `data:write` | Mutate AlphaSwarm data through sanctioned APIs | | `admin:iceberg` | Drop, consolidate, or redefine Iceberg tables | ### Infrastructure (ADR 003 four-scope grid) | Scope | Description | | --- | --- | | `read:infrastructure` | View deployment status, pods, logs, non-secret config | | `manage:agents` | Start / stop / restart / scale assigned AlphaSwarm agents and bot workloads | | `manage:infrastructure` | Deploy and update AlphaSwarm services and non-secret ConfigMaps within an assigned org | | `admin:cluster` | Full cluster control + resource-scope bypass for AlphaSwarm super-admins | ### Agents | Scope | Description | | --- | --- | | `agent:view` | Inspect agent specs, runs, and telemetry | | `agent:execute` | Invoke or schedule a registered AlphaSwarm agent | | `agent:terminate` | Halt a running agent or revoke a long-lived spec | ### Trading / portfolio | Scope | Description | | --- | --- | | `trade:read` | Inspect paper / live trading sessions, orders, fills, PnL | | `trade:execute` | Submit paper-broker or sandbox-broker orders | | `trade:live` | Submit real-money orders to a connected live broker | ### Backtesting | Scope | Description | | --- | --- | | `backtest:read` | Inspect backtest runs and historical metrics | | `backtest:create` | Submit a new backtest job to the engine fleet | ### ML / RL / RAG | Scope | Description | | --- | --- | | `rag:query` | Query the hierarchical RAG corpus | | `ml:workbench` | Run ML workbench flows (training, evaluation, registry) | | `rl:train` | Submit `RLExperimentSpec` runs through `RLRuntime` | ### Deployment lifecycle | Scope | Description | | --- | --- | | `deploy:run` | Run Terraform / Kubernetes deployments | | `deploy:halt` | Halt AlphaSwarm deployments and long-running runtimes | ### Terraform IaC (rule 42) | Scope | Description | | --- | --- | | `terraform:plan` | Generate a Terraform plan for an AlphaSwarm stack | | `terraform:apply` | Apply a Terraform plan against an AlphaSwarm stack | | `terraform:destroy` | Destroy an AlphaSwarm Terraform stack (super-admin only) | | `terraform:cancel` | Cancel a running Terraform run | ### WorkloadRuntime (rule 45) | Scope | Description | | --- | --- | | `workloads:halt` | Halt every running workload via the WorkloadRuntime kill-switch fan-out | ### Tenancy | Scope | Description | | --- | --- | | `tenancy:invite` | Issue tenancy invites for org / team / workspace / project membership | | `tenancy:admin` | Mutate tenancy state (orgs, teams, memberships) | | `scim:write` | Provision AlphaSwarm users and groups through SCIM | ### Platform | Scope | Description | | --- | --- | | `platform:admin` | Implicit super-scope: satisfies any other scope check | ## Role lattice Each role is a strict superset of the previous one (cumulative composition). The lattice is enforced by the regression test at `tests/auth/test_scopes.py::test_role_lattice_is_cumulative`. ### `alphaswarm-viewer` Read-only AlphaSwarm operator for assigned resources. - `read:infrastructure` - `data:read` - `agent:view` - `trade:read` - `backtest:read` - `rag:query` ### `alphaswarm-operator` Viewer + manage assigned agents and bot workloads. Adds: - `manage:agents` - `agent:execute` - `agent:terminate` - `backtest:create` - `ml:workbench` - `rl:train` - `trade:execute` - `deploy:run` - `deploy:halt` - `workloads:halt` ### `alphaswarm-admin` Operator + administrator for assigned organization infrastructure. Adds: - `manage:infrastructure` - `data:write` - `admin:iceberg` - `terraform:plan` - `terraform:apply` - `terraform:cancel` - `tenancy:invite` ### `alphaswarm-superadmin` Admin + cluster super-admin (the only role that bypasses `alphaswarm_core.auth.resource_filter.filter_resources` via the `admin:cluster` scope). Adds: - `admin:cluster` - `terraform:destroy` - `tenancy:admin` - `scim:write` - `trade:live` - `platform:admin` ## Legacy tenancy roles The tenancy database in `alphaswarm.persistence.models_tenancy` uses a separate role lattice (`viewer / editor / admin / owner`) for membership in orgs, teams, workspaces, projects, and labs. The canonical platform roles above (`alphaswarm-*`) are issued by Auth0 and expanded into scopes via the post-login Action sync. The translator between the two lives at [alphaswarm/auth/scopes.py::legacy_role_to_aqp_role](../alphaswarm/auth/scopes.py): | Tenancy role | Canonical role | | --- | --- | | `viewer` | `alphaswarm-viewer` | | `editor` | `alphaswarm-operator` | | `admin` | `alphaswarm-admin` | | `owner` | `alphaswarm-superadmin` | The Auth0 sync endpoint (`/_internal/auth0/sync`) emits BOTH flavours into the JWT's `roles` claim so legacy clients keep working AND scope expansion produces a non-empty set. Closes the empty-claim drift bug where a user whose only `Membership.role` was `editor` ended up with no scopes in the token. ## Adding a new scope 1. Add the constant to `alphaswarm/auth/scopes.py::AQPScope` and to `ALL_AQP_SCOPES`. 2. If the scope should be granted by a role, add it to the matching role frozenset in `alphaswarm_core/auth/rbac.py::_ROLE_LATTICE` (cumulative — viewer subset of operator subset of admin subset of superadmin). 3. Add the scope to `alphaswarm_platform/terraform/modules/auth0_identity/main.tf`'s `local.scopes` AND to every role in `local.role_permissions` that should hold it. 4. Add a row to this catalogue (`alphaswarm_docs/scopes.md`). 5. Re-run the regression test: `docker exec alphaswarm-api python -m pytest tests/auth/test_scopes.py`. The test asserts that the Python lattice and the Terraform lattice contain the same scope set per role, so any drift produces a hard failure rather than a silent grant. # Temporal identifier resolution > Financial identifiers are not stable across time. A non-exhaustive list of why: # Temporal identifier resolution > Status: **Phase 1 shipped** (Alembic 0039 + 0040). The > ``identifier_links`` table is now the authoritative source for > identifier resolution; the legacy ``Instrument.identifiers`` JSON > blob is kept for backward compatibility but is no longer > authoritative. ## Why temporal resolution Financial identifiers are not stable across time. A non-exhaustive list of why: | Event | Impact | | --- | --- | | Ticker change (M&A, rebranding) | ``FB`` -> ``META`` 2022-06-09 | | Symbol change (re-listing) | ``ABEV3`` ↔ ``AMBV4`` on B3 | | CUSIP / ISIN re-issue (corporate action) | Stock split issuance may mint a new CUSIP | | Index reconstitution | Russell add/drops change tracker constituents | | ADR sponsorship upgrade | Conversion ratio may change | A backtest that resolves ``META`` to the modern ID and walks bar data from 2018 will silently introduce **survivorship and forward-looking bias** because the row didn't yet exist under that ticker. The resolver service fixes this by walking time-versioned ``identifier_links`` rows scoped by ``valid_from <= as_of`` and ``(valid_to IS NULL OR valid_to > as_of)``. ## Table shape The ``identifier_links`` table predates Phase 1; Phase 1 promotes it to authoritative status. Its schema: ```text identifier_links +--------------------+-----------------------------------------------+ | id | UUID | | entity_kind | instrument | fred_series | sec_filing | ... | | entity_id | parent entity id | | instrument_id | denormalized FK to instruments.id (NULL OK) | | scheme | ticker | vt_symbol | cik | cusip | isin | | | | figi | sedol | lei | gvkey | permid | ... | | value | identifier value | | valid_from | datetime | NULL ("from the beginning") | | valid_to | datetime | NULL ("still valid") | | source_id | FK to data_sources.id | | confidence | 0.0 - 1.0, defaults 1.0 | | meta | JSON | | created_at | datetime | +--------------------+-----------------------------------------------+ ``` ## Resolver API The two public entry points are :class:`alphaswarm.data.identity.IdentifierResolver` and the matching DataMCP tools. ### Python: forward resolution ```python from datetime import datetime from alphaswarm.data.identity import resolve # "What was AAPL's CUSIP on 2018-06-12?" hit = resolve( scheme="cusip", value="037833100", as_of=datetime(2018, 6, 12), ) print(hit.value, hit.is_open_ended) ``` ### Python: history walk ```python from alphaswarm.data.identity import history # Every alias known for Apple for row in history(entity_kind="instrument", entity_id="aapl-uuid"): print(row.scheme, row.value, row.valid_from, row.valid_to) ``` ### Agent / MCP ```text data.identity.resolve(scheme="cusip", value="037833100", as_of="2018-06-12") data.identity.history(entity_kind="instrument", entity_id="aapl-uuid") ``` The DataMCP layer is the only path agents may use to resolve identifiers (AGENTS rule 22). The Python module is reserved for loaders / pipelines / persistence code; agent code never imports the ORM model directly. ## Backfill from legacy JSON blob (migration 0040) The legacy ``Instrument.identifiers`` JSON column is a flat ``{scheme: value}`` map. Migration 0040 walks every row, normalises the scheme name (lower-cased, aliases collapsed), and inserts a row into ``identifier_links`` with ``valid_from=valid_to=NULL`` ("valid for all time the row represents") and ``confidence=0.7`` (so a canonical loader row at ``confidence=1.0`` always wins the resolver tiebreaker). The legacy JSON column is **kept**: readers that haven't migrated to the resolver continue to work. New readers MUST go through the resolver so they see corrected validity windows. ## Validity-window semantics | ``valid_from`` | ``valid_to`` | Meaning | | --- | --- | --- | | ``NULL`` | ``NULL`` | Valid for all time the row represents | | ``2018-01-01`` | ``NULL`` | Valid from 2018-01-01, still current | | ``NULL`` | ``2022-06-09`` | Valid up to (and including) 2022-06-09 | | ``2010-05-01`` | ``2015-12-31`` | Valid in the closed-open interval | The ``valid_to`` is **exclusive** -- a row with ``valid_to=2022-06-09`` is NOT valid on 2022-06-09. The lookup predicate is therefore ``valid_to > as_of``, not ``valid_to >= as_of``. ## Confidence ordering When multiple rows satisfy the validity predicate, the highest ``confidence`` wins. Default loader rows ship with ``confidence=1.0``; the legacy-blob backfill from migration 0040 uses ``confidence=0.7`` so it's overridden the moment a canonical loader populates the same alias. Heuristic / fuzzy-match loaders should use ``confidence`` in the 0.3-0.6 range so they only win when no canonical row exists. # Hybrid agentic-RL + backtest > The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by making the **target portfolio weight vector** the single immutable interface between an RL policy and any execution mechanism (offli... # Hybrid agentic-RL + backtest > AlphaSwarm's port of the FinRL-X "deployment-consistent" blueprint plus the > NVIDIA-NeMo/RL advantage primitives — wired into AlphaSwarm's existing > spec-driven runtimes (rule 16). ## What changed The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by making the **target portfolio weight vector** the single immutable interface between an RL policy and any execution mechanism (offline backtest engine OR live broker). The same `w_t` flows through: - the offline simulation (via the new [`RLBacktestEnv`](../alphaswarm/rl/envs/rl_backtest_env.py)) - the live paper / live execution (via [`WeightToOrders`](../alphaswarm/rl/execution/weight_to_orders.py)) - the AST-sandboxed alpha factor authoring loop (via [`AlphaResearcher`](../alphaswarm/agents/quant/alpha_researcher.py)) ```mermaid flowchart TB subgraph agentic [Agentic Layer] AlphaResearcher["AlphaResearcher\n(AgentRuntime + RAG alpha_base)"] StrategyExecutor["StrategyExecutor\n(wraps RLRuntime)"] ASTSandbox["AST Sandbox\n(alphaswarm/data/expressions_dsl.py)"] AlphaResearcher -->|symbolic formula| ASTSandbox ASTSandbox -->|FactorNode| Backtest[Engine-agnostic indicator] end subgraph rl [RL Stack] Spec["RLExperimentSpec\n(+ advantage + stop_properly_penalty_coef)"] Runtime["RLRuntime\n(rule 16)"] Backbones["Policy Backbones\nTransformer / RNN / AE / PatchTST"] Advantage["ReinforcePlusPlus / GRPO / GAE"] StopShape["StopProperlyWrapper\n(coef in 0..1)"] Spec --> Runtime Runtime --> Backbones Runtime --> Advantage Runtime --> StopShape end subgraph bridge [RL <-> Backtest Bridge] RLEnv["RLBacktestEnv"] WCP["WeightCentricPipeline\nf_S -> f_A -> f_T -> f_R"] EngineCB["context['rl_agent']"] Runtime --> RLEnv RLEnv --> WCP WCP --> EngineCB end subgraph engines [Engines] EventDriven["EventDrivenBacktester"] VbtPro["VectorbtProEngine:orders"] Lob["LobBacktestEngine"] BT["BacktraderEngine (optional)"] EngineCB --> EventDriven EngineCB --> VbtPro EngineCB --> Lob EngineCB --> BT end subgraph broker [Live + Paper] DomainBroker["IDomainBrokerage"] KillSwitch["KillSwitch"] WCP --> DomainBroker KillSwitch -.->|halt| DomainBroker end ``` ## Quick reference | Concept | One-liner | File | | --- | --- | --- | | `WeightCentricPipeline` | FinRL-X `f_S -> f_A -> f_T -> f_R` composable pipeline | [alphaswarm/rl/portfolio/pipeline.py](../alphaswarm/rl/portfolio/pipeline.py) | | `RLBacktestEnv` | `BaseRLEnv + gym.Env` wrapping any registered `BaseBacktestEngine` | [alphaswarm/rl/envs/rl_backtest_env.py](../alphaswarm/rl/envs/rl_backtest_env.py) | | `RLAgentBridge` | Channel exposed via `context['rl_agent']` on every engine flipping `supports_rl_injection=True` | [alphaswarm/rl/bridges/agent_bridge.py](../alphaswarm/rl/bridges/agent_bridge.py) | | `ReinforcePlusPlusAdvantage` | Leave-one-out cohort baseline + decoupled global normalisation (NeMo-RL port) | [alphaswarm/rl/advantage/reinforce_plus_plus.py](../alphaswarm/rl/advantage/reinforce_plus_plus.py) | | `GRPOAdvantage` | Group-relative no-critic advantage (DeepSeek R1 / NeMo-RL parity) | [alphaswarm/rl/advantage/grpo.py](../alphaswarm/rl/advantage/grpo.py) | | `StopProperlyWrapper` | Scales reward of truncated episodes by `coef in [0, 1]` (NeMo-RL `stop_properly_penalty_coef`) | [alphaswarm/rl/rewards/stop_properly.py](../alphaswarm/rl/rewards/stop_properly.py) | | Truncating terminations | `DrawdownTermination` / `MarginCallTermination` / `RiskBreachTermination` carry `truncates_episode=True` | [alphaswarm/rl/terminations/](../alphaswarm/rl/terminations/) | | `WeightToOrders` | Kill-switch-gated translator from target weights to `DomainOrder` | [alphaswarm/rl/execution/weight_to_orders.py](../alphaswarm/rl/execution/weight_to_orders.py) | | `RedisFeatureStore` | Flink → Redis `IFeatureStore` for live RL observation | [alphaswarm/streaming/feature_store/redis_store.py](../alphaswarm/streaming/feature_store/redis_store.py) | | `AlphaVantageIngester` | REST-poll Alpha Vantage and publish to Kafka | [alphaswarm/streaming/ingesters/alphavantage.py](../alphaswarm/streaming/ingesters/alphavantage.py) | | `DeterministicMedallionReplay` | Read-only RL data pipeline pinned to silver/gold Iceberg snapshots | [alphaswarm/rl/data_pipelines/medallion_replay.py](../alphaswarm/rl/data_pipelines/medallion_replay.py) | | `data.alphas.*` / `data.backtests.*` / `data.rl.*` / `data.brokers.*` | New DataMCPTools (rule 22) | [alphaswarm/data/mcp/tools/](../alphaswarm/data/mcp/tools/) | | `alpha_factors` / `backtest_summaries` / `rl_trajectory_summaries` corpora | RAG "alpha base" (rule 11) | [alphaswarm/rag/orders.py](../alphaswarm/rag/orders.py) | | `RLTradingBot` | Bot subtype driven by `RLRuntime` (rule 14) | [alphaswarm/bots/rl_trading_bot.py](../alphaswarm/bots/rl_trading_bot.py) | ## Spec extension ```yaml training: total_timesteps: 200000 log_interval: 10 advantage: class: ReinforcePlusPlusAdvantage module_path: alphaswarm.rl.advantage.reinforce_plus_plus kwargs: minus_baseline: true global_normalization: true leave_one_out: true stop_properly_penalty_coef: 0.2 ``` ## Companion docs - [alphaswarm_docs/weight-centric-pipeline.md](../../concepts/rl/weight-centric-pipeline.md) — Deep dive on `f_S/f_A/f_T/f_R` semantics. - [alphaswarm_docs/rl-policy-backbones.md](../../concepts/rl/rl-policy-backbones.md) — Transformer / RNN / Autoencoder / PatchTST backbones. - [alphaswarm_docs/alpha-researcher-agent.md](../../concepts/agentic/alpha-researcher-agent.md) — Symbolic alpha DSL + AlphaResearcher driver. ## Source-of-truth citations - NeMo-RL `stop_properly_penalty_coef` scaling (commit `20d46a7d1bd987df1c89b3c5a81dc945c3d201e4`, `nemo_rl/algorithms/reward_functions.py`). - NeMo-RL leave-one-out group baseline + decoupled global normalisation (`nemo_rl/algorithms/utils.py` `calculate_baseline_and_std_per_prompt` + `masked_mean(..., global_normalization_factor=...)`). - Backtrader `cheat_on_open` / `next_open` / `order_target_percent` semantics (`backtrader/strategy.py`). # RL component reference > | `rl_kind` | Purpose | Base class | | --- | --- | --- | | `rl_env` | Gymnasium env | [`BaseRLEnv`](../alphaswarm/rl/core/env.py) | | `rl_observation` | State featuriser | [`BaseObservationBuilder`](../alphaswarm/r... # RL component reference > This page is a hand-written shortcut. The authoritative source is the > live registry exposed by `GET /rl/components/{kind}` (and rendered in > the UI at [`/rl/library`](../webui/app/(shell)/rl/library/page.tsx)). ## Kinds | `rl_kind` | Purpose | Base class | | --- | --- | --- | | `rl_env` | Gymnasium env | [`BaseRLEnv`](../alphaswarm/rl/core/env.py) | | `rl_observation` | State featuriser | [`BaseObservationBuilder`](../alphaswarm/rl/core/observation.py) | | `rl_action` | Action-space spec + transform | [`BaseActionSpace`](../alphaswarm/rl/core/action.py) | | `rl_reward` | Reward term / composite | [`BaseRewardModel`](../alphaswarm/rl/core/reward.py), [`RewardTerm`](../alphaswarm/rl/core/reward.py) | | `rl_termination` | End-of-episode predicate | [`BaseTerminationCondition`](../alphaswarm/rl/core/termination.py) | | `rl_policy` | Frozen policy | [`BasePolicy`](../alphaswarm/rl/core/policy.py) | | `rl_agent` | Train-aware agent | [`BaseRLAgent`](../alphaswarm/rl/core/policy.py) | | `rl_data` | Data pipeline | [`BaseDataPipeline`](../alphaswarm/rl/core/data.py) | | `rl_ensembler` | Multi-member orchestrator | [`BaseEnsembler`](../alphaswarm/rl/core/ensembler.py) | | `rl_experiment` | Experiment runner | [`BaseExperiment`](../alphaswarm/rl/core/experiment.py) | | `rl_trajectory_store` | Per-step persistence | [`BaseTrajectoryStore`](../alphaswarm/rl/core/replay.py) | ## Built-in components (FinRL + AlphaSwarm) ### Environments - `StockTradingEnv` — continuous portfolio (existing). - `PortfolioAllocationEnv` — softmax weights (existing). - `StockTradingDiscreteEnv` — single-stock buy/sell/hold (existing). - `FinRLStockTradingEnv` — pandas share-lots (FinRL port). - `FinRLStockTradingNpEnv` — array-backed numpy (FinRL port). - `FinRLPortfolioCovEnv` — covariance + softmax (FinRL port). - `FinRLCryptoEnv` — multi-crypto lookback stack (FinRL port). - `OptionsTradingEnv`, `ExecutionEnv`, `MarketMakingEnv` — placeholders. ### Reward terms - `PnLTerm`, `LogReturnTerm` - `SharpeTerm`, `SortinoTerm`, `DrawdownPenaltyTerm`, `VolatilityPenaltyTerm` - `TurnoverPenaltyTerm`, `TransactionCostTerm`, `SlippagePenaltyTerm` - `TurbulenceGateTerm`, `MarginCallTerm` - `CashIdlePenaltyTerm`, `BenchmarkOutperformanceTerm`, `RiskParityTerm` - `PotentialBasedShaping` - `CompositeReward` (sum of weighted terms; emits per-term contributions to `info["reward_terms"]`). ### Observation builders - `PortfolioStateBuilder` (cash + weights / positions) - `TechnicalIndicatorBuilder` (FinRL stockstats) - `CovarianceBuilder` (FinRL portfolio cov) - `TurbulenceBuilder` (Mahalanobis stress) - `VIXBuilder` - `LookbackStackBuilder` (FinRL crypto) - `FundamentalBuilder` (FinRobot bridge) - `MicrostructureBuilder` - `StackedObservationBuilder` (composite) ### Action spaces - `ContinuousWeightsAction`, `SoftmaxWeightsAction`, `IntegerSharesAction`, `DiscreteBuySellHoldAction`, `MultiDiscreteAction`, `TargetPositionAction`. ### Termination conditions - `HorizonTermination`, `DrawdownTermination`, `MarginCallTermination`, `TurbulenceTermination`. ### Data pipelines - `IcebergRLDataPipeline` (default — AlphaSwarm catalog). - `YahooFinanceRLDataPipeline` (FinRL parity). - `AlpacaRLDataPipeline` (paper-trading bridge). - `LiveStreamingRLDataPipeline` (Kafka / Flink). - `ReplayRLDataPipeline` (offline RL from `rl.trajectories`). ### Agents - `SB3Adapter` — PPO / A2C / DDPG / SAC / TD3 / DQN + sb3-contrib (RecurrentPPO / TRPO / QRDQN / MaskablePPO / ARS / TQC). - `ElegantRLAdapter`, `RayRLlibAdapter`, `CleanRLAdapter`. - `LLMHybridAgent` — FinRobot-style LLM advisor + RL backbone. - Existing classical / Q-family / actor-critic / evolutionary / SPM trees retained. ### Ensemblers / experiments - `WalkForwardEnsembler` (FinRL `DRLEnsembleAgent` port). - `BestOfNRunner`, `CurriculumRunner`, `MetaEnsembleRunner`. - `BasicRLExperiment`, `WalkForwardRLExperiment`, `RewardAblationExperiment`, `RLAlphaBacktestExperiment`. # RL FinAgent Layered Reflection Adapter (Phase 10) > | # | Stage | YAML | Purpose | | --- | --- | --- | --- | | 1 | `low_intelligence` | [`configs/agents/finagent/low_intelligence.yaml`](../configs/agents/finagent/low_intelligence.yaml) | Factual 2-3 se... # RL FinAgent Layered Reflection Adapter (Phase 10) Reference docs for the FinAgent multimodal LLM-hybrid agent ported into `alphaswarm_rl` per Zhang AAAI 24. ## Five-stage cascade | # | Stage | YAML | Purpose | | --- | --- | --- | --- | | 1 | `low_intelligence` | [`configs/agents/finagent/low_intelligence.yaml`](../configs/agents/finagent/low_intelligence.yaml) | Factual 2-3 sentence market read | | 2 | `high_intelligence` | [`configs/agents/finagent/high_intelligence.yaml`](../configs/agents/finagent/high_intelligence.yaml) | Strategic outlook + bias | | 3 | `low_reflection` | [`configs/agents/finagent/low_reflection.yaml`](../configs/agents/finagent/low_reflection.yaml) | 1-bar post-mortem | | 4 | `high_reflection` | [`configs/agents/finagent/high_reflection.yaml`](../configs/agents/finagent/high_reflection.yaml) | k-bar strategic post-mortem | | 5 | `decision` | [`configs/agents/finagent/decision.yaml`](../configs/agents/finagent/decision.yaml) | Final SELL/HOLD/BUY | Each stage's LLM call routes through `router_complete` (hard rule 2). The adapter degrades gracefully when the router is unavailable or any stage fails (defaults to `HOLD`). ## Three tools | Tool | File | Purpose | | --- | --- | --- | | `KlinePlotterTool` | [`alphaswarm/agents/tools/finagent/kline_plotter.py`](../alphaswarm/agents/tools/finagent/kline_plotter.py) | Summarise bars → text | | `TradingPlotterTool` | [`alphaswarm/agents/tools/finagent/trading_plotter.py`](../alphaswarm/agents/tools/finagent/trading_plotter.py) | Summarise action history → text | | `StrategyAgentsTool` | [`alphaswarm/agents/tools/finagent/strategy_agents_tool.py`](../alphaswarm/agents/tools/finagent/strategy_agents_tool.py) | Query another RL agent's decision | ## Modules | File | Class | Purpose | | --- | --- | --- | | [`alphaswarm_rl/src/alphaswarm_rl/agents/llm_hybrid_layered.py`](../alphaswarm_rl/src/alphaswarm_rl/agents/llm_hybrid_layered.py) | `LayeredReflectionAdapter` | 5-stage prompt cascade | | [`alphaswarm_rl/src/alphaswarm_rl/envs/tradesim_multimodal.py`](../alphaswarm_rl/src/alphaswarm_rl/envs/tradesim_multimodal.py) | `MultimodalTradingEnv` | FinAgent-style dict observation | ## Usage ```python from alphaswarm_rl.agents.llm_hybrid_layered import LayeredReflectionAdapter adapter = LayeredReflectionAdapter( llm_model="ollama/llama3", rl_weight=0.5, # blend 50% with RL backbone rl_agent={"class": "ppo_inhouse"}, ) adapter.build(env) action, _ = adapter.predict(obs) # int in {0=SELL, 1=HOLD, 2=BUY} # Between predicts, update the memory so reflection stages have something # to critique: adapter.update_realised_pnl(realised_short=0.01, realised_k=0.02) ``` ## Hard rule alignment - Hard rule 2: every LLM call routes through `router_complete`. - Hard rule 12: each stage is a separate `AgentRuntime` invocation (see the YAMLs' `model:` blocks). - Hard rule 19: adapter registers via `RLComponent` metaclass under `rl_alias='finagent_layered'`. ## Acceptance [Phase 10 tests](../alphaswarm_rl/tests/finagent/) verify: - 5 stages invoke `router_complete` exactly once each. - Decision JSON parsed correctly into action int. - Memory updates persist between calls. - Cascade degrades to HOLD on LLM failure. - All 3 tools handle valid + empty inputs. # Reinforcement learning framework > Hash-locked RLExperimentSpec + RLRuntime + metaclass-registered components + Iceberg trajectory store. The canonical entry point for every RL run in AlphaSwarm. # Reinforcement learning framework The RL layer in AlphaSwarm follows a metaclass-driven, registry-first design inspired by FinRL's library structure and FinRobot's tool-augmented agent runtime. Every concrete component (env, observation, action, reward, termination, policy, agent, data pipeline, ensembler, experiment, trajectory store) auto-registers through [`alphaswarm_rl/src/alphaswarm_rl/core/base.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/core/base.py) so the API and the lab UI can browse them at runtime. This page is the canonical entry point. For shorter cuts: - [rl-lab](./rl-lab.md) — interactive RL Lab + builders. - [rl-components](./rl-components.md) — auto-generated component reference (browse via `/rl/components` in the operator UI). - [rl-iceberg](./rl-iceberg.md) — Iceberg trajectory / equity / reward-decomposition tables and DuckDB views. - [rl-market-dynamics](./rl-market-dynamics.md) — Phase 6 slice-and-merge regime labeller + `RegimeAwareObservation` + `RegimeStratifiedEvaluation`. - [rl-prudex-evaluation](./rl-prudex-evaluation.md) — Phase 9 PRUDEX-Compass framework (17 measures, 5 visualisations). - [rl-finagent](./rl-finagent.md) — Phase 10 FinAgent multimodal 5-stage LLM-hybrid adapter. - [weight-centric-pipeline](./weight-centric-pipeline.md) — FinRL-X four-stage `f_S → f_A → f_T → f_R` pipeline. - [architecture/decisions/010-rl-production-enhancement](../../architecture/decisions/010-rl-production-enhancement.md) — full Phase 1-12 production-enhancement ADR. ## Phase 1-12 production enhancements (May 2026) The Phase 1-12 deliverables documented in [ADR-010](../../architecture/decisions/010-rl-production-enhancement.md) add the following components under their canonical `rl_alias` / `kind`: | Phase | Components | | --- | --- | | 1 (Rewards) | `differential_sharpe`, `differential_downside`, `implementation_shortfall`, `running_inventory`, `exp_utility`, `hindsight`, `dp_distillation` | | 2 (Analytical) | `almgren_chriss_residual`, `avellaneda_stoikov_residual` (+ `alphaswarm_rl.analytical.{almgren_chriss,avellaneda_stoikov,cartea_jaimungal}` helpers) | | 3 (Envs) | `tradesim_algotrading`, `tradesim_portfolio`, `tradesim_execution`, `tradesim_hft`, `finagent_trading` | | 4 (Agents) | `eiie`, `deeptrader`, `investor_imitator`, `eteo`, `opd`, `deepscalper`, `hft_ddqn`, `ppo_inhouse` | | 5 (Backbones) | `eiie_conv`, `sagcn`, `market_scorer`, `hft_qnet`, `eteo_dual_head`, `pd_dual_rnn`, `sarl_lstm` | | 6 (MDM) | `slice_and_merge_regime_flow` (analysis flow), `regime_aware` observation, `regime_stratified` experiment | | 7 (CSDI) | `csdi_imputed` dataset kind | | 8 (Validation) | `CombinatorialPurgedKFold`, `probability_of_backtest_overfitting`, `rademacher_anti_serum`, `deflated_sharpe_ratio`, `walk_forward_anchored`, `walk_forward_rolling`, `benjamini_hochberg`, `holm_bonferroni`, `validation_suite` experiment | | 9 (PRUDEX) | `PrudexMetrics`, `PrudexReport`, `compute_prudex_metrics`, 5 chart helpers, `prudex_compass` experiment | | 10 (FinAgent) | `finagent_layered` adapter + 5 AgentSpec YAMLs under `configs/agents/finagent/` + 3 tools under `alphaswarm/agents/tools/finagent/` | | 11 (Replay) | `GeneralReplayBuffer`, `PrioritizedReplayBuffer`, `NStepInfoReplayBuffer` | | 12 (Parity) | Determinism + kill-switch tests around `WeightCentricPipeline` + `WeightToOrders` | ## Contracts Two execution shapes share the same hash-locked spec. The standalone shape is the original RL pipeline; the workflow-wrapped shape lets `WorkflowRuntime` compose RL training into larger multi-stage agentic pipelines (AGENTS rule 40 + ADR-005 + Phase 5 of the orchestration refactor). ```mermaid flowchart LR Spec["RLExperimentSpec (hash-locked)"] --> Versions["rl_experiment_versions row"] Versions --> StandaloneRt["RLRuntime (standalone)"] Versions --> WfAdapter["execution adapter (workflow node)"] WfAdapter --> WfRuntime["WorkflowRuntime"] WfRuntime --> StandaloneRt StandaloneRt --> Env["BaseRLEnv"] StandaloneRt --> Agent["BaseRLAgent"] Env -->|observation| Obs["BaseObservationBuilder"] Env -->|action| Action["BaseActionSpace"] Env -->|reward| Reward["CompositeReward (BaseRewardTerm × N)"] Env -->|terminate?| Term["BaseTerminationCondition"] Agent --> Policy["BaseRLPolicy (+ TimeSeriesEncoder backbone)"] Agent --> Advantage["BaseAdvantageEstimator"] StandaloneRt --> Trajectory["IcebergTrajectoryStore"] Trajectory --> Iceberg[("rl.* Iceberg namespace")] StandaloneRt --> RlRuns[("rl_runs ledger (Postgres)")] StandaloneRt --> Mlflow[("MLflow")] WfRuntime --> WfRuns[("workflow_runs + agent_runs_v2")] ``` ## Hard rules 1. **All RL train / evaluate / paper / replay / walk-forward goes through [`alphaswarm_rl/src/alphaswarm_rl/runtime.py::RLRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py)** (AGENTS rule 16). Tasks under [`alphaswarm_rl/tasks/rl_tasks.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/tasks/rl_tasks.py) and API routes under [`alphaswarm_rl/api/routes/rl.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/api/routes/rl.py) wrap it; they never call `agent.train` directly. 2. **`rl_experiment_versions` rows are immutable, hash-locked.** Re-snapshotting via [`alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py) inserts a new row when the SHA-256 of the spec changes (AGENTS rule 17). 3. **Trajectory persistence flows through [`IcebergTrajectoryStore`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/trajectories/iceberg_writer.py)** → `iceberg_catalog.append_arrow` (AGENTS rule 18). 4. **All concrete components register through the [`RLComponent`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/core/base.py) metaclass.** Set `rl_kind` to one of the canonical kinds; the metaclass calls `@register` automatically (AGENTS rule 19). 5. **LLM calls inside `LLMHybridAgent` route through [`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py)** (AGENTS rule 20). 6. **Advantage estimation goes through [`BaseAdvantageEstimator`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/advantage/base.py)** (AGENTS rule 36). The native `ReinforcePlusPlusAdvantage` / `GRPOAdvantage` / `GAEAdvantage` register through the metaclass alongside envs / rewards / policies. 7. **Policy backbones go through [`TimeSeriesEncoder`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/policies/backbones/base.py)** (AGENTS rule 37). The four shipped backbones — `TransformerBackbone`, `RecurrentBackbone`, `AutoencoderBackbone`, `PatchTSTBackbone` — wrap existing `alphaswarm_models.models` modules so the policy network and the offline ML stack share one source of truth. 8. **Weight-centric portfolio actions go through the FinRL-X four-stage pipeline [`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py)** (`f_S → f_A → f_T → f_R`, AGENTS rule 38). Risk overlay (`f_R`) re-uses [`RiskLimits`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/risk/limits.py) so offline backtests and live paper paths produce identical target-weight vectors. ## Hash-lock invariant in practice The `*_spec_versions` table is the contract that makes RL replayable. Three concrete consequences: - **Same content → same version.** Re-posting an identical spec returns the existing `version_id`. No duplicate row, no side-effect. - **Any field change → new version.** Bump a hyperparameter, swap a reward term, retune the LR schedule — the SHA-256 changes, the row is new. The old row stays forever. - **Replay is across data, not across code.** When you `RLRuntime(spec).replay(new_window)`, the runtime loads the pinned `version_id` from `rl_runs`, rebuilds the env / agent exactly as the original train run, and feeds it the new bars. This is how "would this policy have held up in Q1 2024?" questions get a deterministic answer. This is why [`alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py) is the only sanctioned path: every direct mutation to the table would corrupt the replay contract. ## Packages | Path | Purpose | | --- | --- | | [alphaswarm_rl/src/alphaswarm_rl/core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/core) | Abstract bases + `RLComponent` metaclass + schema helpers. | | [alphaswarm_rl/src/alphaswarm_rl/spec.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/spec.py) | `RLExperimentSpec` declarative blueprint. | | [alphaswarm_rl/src/alphaswarm_rl/runtime.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py) | `RLRuntime` single sanctioned executor. | | [alphaswarm_rl/src/alphaswarm_rl/envs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/envs) | Concrete envs (existing + FinRL ports + TradeSim + FinAgent). | | [alphaswarm_rl/src/alphaswarm_rl/rewards/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/rewards) | Composable reward terms. | | [alphaswarm_rl/src/alphaswarm_rl/observations/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/observations) | Observation builders. | | [alphaswarm_rl/src/alphaswarm_rl/actions/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/actions) | Action-space implementations. | | [alphaswarm_rl/src/alphaswarm_rl/terminations/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/terminations) | End-of-episode predicates. | | [alphaswarm_rl/src/alphaswarm_rl/data_pipelines/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/data_pipelines) | Iceberg / Yahoo / Alpaca / streaming / replay pipelines. | | [alphaswarm_rl/src/alphaswarm_rl/agents/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/agents) | SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid + classical / Q-family / actor-critic / evolutionary. | | [alphaswarm_rl/src/alphaswarm_rl/policies/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/policies) | Policy backbones (`TimeSeriesEncoder` subclasses). | | [alphaswarm_rl/src/alphaswarm_rl/advantage/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/advantage) | Advantage estimators (native REINFORCE++ / GRPO / GAE). | | [alphaswarm_rl/src/alphaswarm_rl/ensemblers/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/ensemblers) | Walk-forward / best-of-N / curriculum / meta-ensemble. | | [alphaswarm_rl/src/alphaswarm_rl/experiments/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/experiments) | Experiment runners (basic / walk-forward / ablation / alpha-backtest / regime-stratified / validation-suite / PRUDEX-Compass). | | [alphaswarm_rl/src/alphaswarm_rl/applications/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/applications) | One-call FinRL-style apps (stock / portfolio / crypto / fundamentals / paper). | | [alphaswarm_rl/src/alphaswarm_rl/portfolio/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/portfolio) | `WeightCentricPipeline` (FinRL-X `f_S → f_A → f_T → f_R`). | | [alphaswarm_rl/src/alphaswarm_rl/trajectories/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/trajectories) | Iceberg-backed trajectory writer + DuckDB views. | | [alphaswarm_rl/src/alphaswarm_rl/bridges/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/bridges) | Backtest-engine + WorkflowRuntime adapters. | | [alphaswarm/persistence/models_rl.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/persistence/models_rl.py) | ORM for specs, versions, runs, evaluations, refs, registrations. | | [alphaswarm_rl/api/routes/rl.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/api/routes/rl.py) | REST surface. | | [alphaswarm_rl/tasks/rl_tasks.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/tasks/rl_tasks.py) | Celery tasks driven by `RLRuntime`. | | [alphaswarm_client/src/routes/rl/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/routes/rl) | RL Lab + builders + library + runs UI (active Vite frontend). | | [alphaswarm_rl/configs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/configs) | Preset / reward / observation / data-pipeline YAMLs. | | [alphaswarm_rl/tests/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/tests) | Hermetic test suite. | Legacy `alphaswarm.rl.*` is a deprecation shim that re-exports from `alphaswarm_rl.*`; new code imports from `alphaswarm_rl` directly. ## Spec lifecycle 1. **Author** an `RLExperimentSpec` (YAML or in-code Pydantic). 2. **Persist** via [`alphaswarm_rl.registry.persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py) → `rl_experiment_specs` + `rl_experiment_versions` (hash-locked snapshot). 3. **Run** via `RLRuntime.train` / `.evaluate` / `.paper` / `.replay` / `.walk_forward` → opens an `rl_runs` row, builds the env / agent from `build_from_config`, drives training, persists per-step trajectories to Iceberg, finalises the run row. 4. **Inspect** via the API (`/rl/runs/{id}/equity`, `/trajectories`, `/reward-decomposition`, `/episodes`) and the lab UI run-detail page (equity chart, reward decomposition, episode summary, replay slider). ## Worked example: train + replay Goal: snapshot a 50k-step PPO experiment, train it, inspect the ledger row, read trajectories from Iceberg, and replay against fresh data — all from this page. ### Step 1 — snapshot the spec The experiment YAML lives at [`alphaswarm_rl/configs/experiments/my_first_rl.yaml`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/configs/experiments). Dispatch the train run: Notice `spec_hash` in the response — that is the immutable hash-lock key. Re-posting the same YAML returns the same `spec_version_id`. ### Step 2 — tail progress ```bash curl -N http://localhost:8000/chat/stream/ ``` Frames arrive in the canonical envelope (AGENTS rule 4). Expected stages: `start` → `data.loaded` → `env.built` → `agent.built` → `train.step` (×many, sparse) → `train.checkpoint` → `done`. ### Step 3 — inspect the ledger The agent-safe read is `data.rl.list` / `data.rl.describe`: ```bash curl -X POST http://localhost:8000/mcp/data/tools/data.rl.describe/invoke \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $(alphaswarm-cli auth token)" \ -d '{"rl_run_id": ""}' ``` The response carries `status`, `mean_reward`, `total_timesteps`, `spec_version_id`, MLflow run id, and the trajectory namespace. ### Step 4 — read trajectories from Iceberg Pyodide does not ship PyIceberg, but it ships duckdb + pyarrow, and the trajectory writer exports a parquet-compatible view. The snippet below shows the analytical pattern with inline sample data so it runs in your browser. The same pattern works against the real Iceberg trajectory tables via the [`data.iceberg.read_snapshot`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/mcp/tools/iceberg.py) MCP tool. The tables are: - `alphaswarm_silver_rl_trajectories.` — per-step `(episode, step, obs_hash, action, reward, value, log_prob)` - `alphaswarm_silver_rl_equity_curves.` — per-step equity / drawdown - `alphaswarm_silver_rl_action_logs.` — full action vectors per step - `alphaswarm_silver_rl_reward_decomposition.` — per-term reward attribution ### Step 5 — replay against fresh data The killer feature of hash-locked specs: replay the trained policy against a different time window WITHOUT touching the spec. /replay", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ start: "2024-01-01", end: "2024-03-31" }), }); const { task_id, rl_run_id: replay_run_id, reused_spec_version_id } = await r.json(); console.log({ task_id, replay_run_id, reused_spec_version_id }); `} /> The new `rl_runs` row carries `parent_run_id` and the SAME `spec_version_id` as the original train run. Two `rl_runs` rows, one `rl_experiment_versions` row. ### Step 6 — verify - `rl_experiment_versions` row with the recorded `spec_hash`. - Two `rl_runs` rows referencing it (`train` + `replay`). - Trajectory tables in `alphaswarm_silver_rl_trajectories.`. - MLflow runs visible at `http://localhost:5000/#/experiments`. - Topbar `KillSwitch` shows green; `should_halt` returned false on every step. ### What next - Walk the full tutorial: [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md). - Compose into a workflow: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) + [concepts/agentic/workflow-studio](../agentic/workflow-studio.md). - Add a custom reward term: [rl-components](./rl-components.md). - Browse the trajectory schema: [rl-iceberg](./rl-iceberg.md). ## Inspiration sources - **FinRL** (`alphaswarm_snippets/inspiration/FinRL-master`) — env taxonomy (StockTrading, StockPortfolio, multi-crypto), `DataProcessor` / `FeatureEngineer` / `df_to_array`, `DRLAgent` / `DRLEnsembleAgent`, composite reward. Ported as registered presets in `alphaswarm_rl.envs.finrl_*`, `alphaswarm_rl.data_pipelines.*`, and the `WalkForwardEnsembler`. - **FinRobot** (`alphaswarm_snippets/inspiration/FinRobot-master`) — multi-agent LLM workflow + tool-augmented analysis. Bridged via `LLMHybridAgent` (LLM proposes, RL refines) and `FundamentalBuilder`. - **FinRL-X** — the four-stage weight-centric pipeline (`f_S → f_A → f_T → f_R`) is ported as [`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py) (AGENTS rule 38). - **FinAgent** — five-stage LLM-hybrid adapter ported as `finagent_layered` (ADR-010, Phase 10). - **PRUDEX-Compass** — 17-measure evaluation framework ported as `prudex_compass` experiment + five chart helpers (ADR-010, Phase 9). ## Deeper reads - [rl-lab](./rl-lab.md) — interactive RL Lab + builders. - [rl-components](./rl-components.md) — full component catalogue. - [rl-iceberg](./rl-iceberg.md) — trajectory persistence contract. - [rl-policy-backbones](./rl-policy-backbones.md) — `TimeSeriesEncoder` subclasses. - [rl-market-dynamics](./rl-market-dynamics.md) — regime labeller + observation. - [rl-prudex-evaluation](./rl-prudex-evaluation.md) — PRUDEX-Compass. - [rl-finagent](./rl-finagent.md) — FinAgent multimodal adapter. - [weight-centric-pipeline](./weight-centric-pipeline.md) — `f_S → f_A → f_T → f_R`. - [agentic-rl](./agentic-rl.md) — RL-as-agent integration patterns. - [architecture/decisions/010-rl-production-enhancement](../../architecture/decisions/010-rl-production-enhancement.md) — full Phase 1-12 ADR. - [reference/api](../../reference/api/index.mdx) — the `rl` tag in the interactive playground. - [reference/python/alphaswarm_rl](../../reference/python/index.mdx) — auto-generated `alphaswarm_rl` Python reference. # RL Iceberg data plane > | Table | Columns | Written when | | --- | --- | --- | | `rl.trajectories` | `run_id`, `episode`, `step`, `ts`, `reward`, `info` (JSON) | Every env step | | `rl.equity_curves` | `run_id`, `episode`, `... # RL Iceberg data plane Per-step RL records persist to four Iceberg tables in the namespace controlled by `ALPHASWARM_RL_TRAJECTORY_NAMESPACE` (default `rl`). Writes flow through [`alphaswarm/rl/trajectories/iceberg_writer.py::IcebergTrajectoryStore`](../alphaswarm/rl/trajectories/iceberg_writer.py) → [`iceberg_catalog.append_arrow`](../alphaswarm/data/iceberg_catalog.py). ## Tables | Table | Columns | Written when | | --- | --- | --- | | `rl.trajectories` | `run_id`, `episode`, `step`, `ts`, `reward`, `info` (JSON) | Every env step | | `rl.equity_curves` | `run_id`, `episode`, `step`, `ts`, `portfolio_value`, `drawdown`, `cash` | Every env step that exposes `info["portfolio_value"]` | | `rl.action_logs` | `run_id`, `episode`, `step`, `ts`, `asset_idx`, `action_value` | Every env step (one row per action component) | | `rl.reward_decomposition` | `run_id`, `episode`, `step`, `ts`, `term_name`, `contribution` | When the reward model exposes `info["reward_terms"]` (any `CompositeReward`) | ## Settings | Variable | Default | Purpose | | --- | --- | --- | | `ALPHASWARM_RL_TRAJECTORY_NAMESPACE` | `rl` | Iceberg namespace | | `ALPHASWARM_RL_TRAJECTORY_TABLE` | `trajectories` | Per-step trajectory table name | | `ALPHASWARM_RL_EQUITY_TABLE` | `equity_curves` | Equity-curve table name | | `ALPHASWARM_RL_ACTION_LOG_TABLE` | `action_logs` | Action-log table name | | `ALPHASWARM_RL_REWARD_DECOMP_TABLE` | `reward_decomposition` | Reward-decomposition table name | | `ALPHASWARM_RL_PERSIST_TRAJECTORIES` | `true` | When `false`, the runtime uses an in-memory store (CI / local). | | `ALPHASWARM_RL_TRAJECTORY_FLUSH_ROWS` | `1000` | Rows per buffer before partial flush. | | `ALPHASWARM_RL_REQUIRE_ICEBERG` | `false` | Make Iceberg write failures hard-fail. | ## DuckDB views [`alphaswarm/rl/trajectories/duckdb_views.py`](../alphaswarm/rl/trajectories/duckdb_views.py) exposes two helpers: - `ensure_duckdb_views(connection)` — registers `rl_trajectories` / `rl_equity_curves` / `rl_action_logs` / `rl_reward_decomposition` Arrow-backed views. - `register_run_views(run_id, connection)` — adds run-filtered views named `rl__run_`. The API uses these views to serve the `/rl/runs/{id}/equity` / `/trajectories` / `/reward-decomposition` / `/actions` endpoints without touching PyIceberg directly. ## Postgres ledger The Postgres tables in [`alphaswarm/persistence/models_rl.py`](../alphaswarm/persistence/models_rl.py) hold the metadata layer that points at these Iceberg row ranges: - `rl_experiment_specs` / `rl_experiment_versions` — hash-locked spec snapshots. - `rl_runs` — one row per `RLRuntime` invocation. - `rl_evaluations` — rollout summary. - `rl_trajectory_refs` / `rl_equity_curve_refs` — pointers to the Iceberg row ranges per episode. - `rl_component_registrations` — DB mirror of the in-memory RL component registry (so `/rl/components` is fast). # RL Lab — interactive RL builder > | Tab | Purpose | Component | | --- | --- | --- | | **Experiment** | Compose env + reward + observation + action + agent + ensembler into one `RLExperimentSpec`, save, train. | [`ExperimentBuilder.tsx... # RL Lab — interactive RL builder Lives at `/rl/lab` in the AlphaSwarm webui. Combines six surfaces into one shell: | Tab | Purpose | Component | | --- | --- | --- | | **Experiment** | Compose env + reward + observation + action + agent + ensembler into one `RLExperimentSpec`, save, train. | [`ExperimentBuilder.tsx`](../webui/components/rl/ExperimentBuilder.tsx) | | **Environment** | Drag a data pipeline + env + observation + action + reward + termination onto the canvas; save spec. | [`EnvironmentBuilder.tsx`](../webui/components/rl/EnvironmentBuilder.tsx) | | **Reward** | Drag reward terms, weight them, hit "Preview reward" → server-side decomposition over a synthetic trajectory. | [`RewardModelBuilder.tsx`](../webui/components/rl/RewardModelBuilder.tsx) | | **Observation** | Drag observation builders, preview output shape + feature names. | [`ObservationBuilder.tsx`](../webui/components/rl/ObservationBuilder.tsx) | | **Agent** | Pick framework (SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid) + algorithm + hyperparams. | [`AgentBuilder.tsx`](../webui/components/rl/AgentBuilder.tsx) | | **Component library** | Browse every registered RL component, filter by tag / source / category. | [`RlComponentLibrary.tsx`](../webui/components/rl/RlComponentLibrary.tsx) | ## Routes | Path | Component | | --- | --- | | `/rl/lab` | `RlLabPage` | | `/rl/library` | `RlComponentLibrary` | | `/rl/builder/env` | `EnvironmentBuilder` | | `/rl/builder/reward` | `RewardModelBuilder` | | `/rl/builder/observation` | `ObservationBuilder` | | `/rl/builder/agent` | `AgentBuilder` | | `/rl/builder/experiment` | `ExperimentBuilder` | | `/rl/runs` | `RlRunsPage` | | `/rl/runs/[id]` | `RlRunDetailPage` | | `/rl/runs/[id]/replay` | `RlReplayViewer` | | `/rl` | Legacy `RlPage` (quick-train, application registry browser). | | `/rl/zoo` | RL agent zoo (`/registry/agent`). | The builders all use the existing [`WorkflowEditor`](../webui/components/flow/WorkflowEditor.tsx) + xyflow stack with `domain="rl"`. The serializer in [`webui/components/rl/serialize.ts`](../webui/components/rl/serialize.ts) turns a `FlowGraph` into an `RLExperimentSpec` payload by bucketising nodes via their palette group (env / observation / action / reward / termination / agent / data pipeline / ensembler). ## API surface used The lab calls the API endpoints in [`alphaswarm/api/routes/rl.py`](../alphaswarm/api/routes/rl.py): - `GET /rl/components` — kind counts. - `GET /rl/components/{kind}` — list registered components per kind. - `POST /rl/lab/preview-reward` — reward decomposition. - `POST /rl/lab/preview-observation` — observation shape + features. - `POST /rl/lab/preview-action` — action transform sample. - `POST /rl/specs` — persist a spec. - `POST /rl/specs/{slug}/run` — kick off train / evaluate / paper / replay / walk-forward via the matching Celery task. - `GET /rl/runs` / `GET /rl/runs/{id}` / `.../equity` / `.../trajectories` / `.../reward-decomposition` / `.../episodes` / `.../actions` — runs ledger + step-level data served from DuckDB views over the Iceberg tables. - `POST /rl/runs/{id}/replay` — re-roll a saved policy on a new window. - `POST /rl/data-pipelines/preview` — show first rows + array shapes. ## Run replay The replay viewer (`/rl/runs/[id]/replay`) loads: - `rl.equity_curves` rows for the chosen episode (slider populates from the row count). - `rl.trajectories` rows for the chosen episode (each step shows reward + info JSON). Both come from the DuckDB views generated by [`alphaswarm/rl/trajectories/duckdb_views.py`](../alphaswarm/rl/trajectories/duckdb_views.py). # RL Market Dynamics Modeling (Phase 6) > The market-dynamics framework labels every bar in a price series with a regime ID (default 4 regimes: strong-down / weak-down / sideways / strong-up). The labels feed: # RL Market Dynamics Modeling (Phase 6) Reference docs for the slice-and-merge regime labeller and its consumers in `alphaswarm_rl`. ## Overview The market-dynamics framework labels every bar in a price series with a regime ID (default 4 regimes: strong-down / weak-down / sideways / strong-up). The labels feed: - `RegimeAwareObservation` — appends a one-hot regime vector to the RL agent's observation. - `RegimeStratifiedEvaluation` — runs the trained policy and decomposes per-regime performance for the RL Lab dashboard. ## Pipeline 1. **Butterworth filter** on the indicator column (default `close`). Causal `lfilter` to avoid look-ahead. 2. **Turning-point detection** — bars where the filtered pct-return sign flips mark candidate segment boundaries. 3. **Segment merging** — segments below `min_length_limit` are merged with their neighbour so every regime has a stable estimation window. 4. **Per-segment slope** — linear regression of the filtered indicator inside each segment. 5. **Labelling** — quantile (default) or fixed-threshold buckets. ## Modules | File | Class | Purpose | | --- | --- | --- | | [`alphaswarm/analysis/flows/market_dynamics_modeling.py`](../alphaswarm/analysis/flows/market_dynamics_modeling.py) | `slice_and_merge_regime_flow` | Analysis flow; emits per-bar labels | | [`alphaswarm_rl/src/alphaswarm_rl/observations/regime.py`](../alphaswarm_rl/src/alphaswarm_rl/observations/regime.py) | `RegimeAwareObservation` | One-hot observation appendage | | [`alphaswarm_rl/src/alphaswarm_rl/experiments/regime_stratified.py`](../alphaswarm_rl/src/alphaswarm_rl/experiments/regime_stratified.py) | `RegimeStratifiedEvaluation` | Per-regime metric breakdown | ## Usage ```python from alphaswarm.analysis.base import FlowContext from alphaswarm.analysis.flows.market_dynamics_modeling import ( SliceAndMergeRegimeParams, slice_and_merge_regime_flow, ) params = SliceAndMergeRegimeParams( indicator_column="close", dynamic_number=4, min_length_limit=12, labeling_method="quantile", ) result = slice_and_merge_regime_flow(df, params, FlowContext(run_id="…")) labels = [row["label"] for row in result.rows] ``` The labels are surfaced into the RL pipeline via `RegimeAwareObservation(labels=labels)` and the matching evaluation through `RegimeStratifiedEvaluation(n_regimes=4, regime_labels=labels)`. ## Hard rule alignment - Hard rule 23: analysis-spec lifecycle goes through `AnalysisRuntime`. The flow registers via `register_analysis_flow`. - Hard rule 21: gold-tier writes via `iceberg_catalog.append_arrow` to `alphaswarm_gold_analysis_market_dynamics_modeling`. - Hard rule 25: flow body has no direct LLM calls. ## Acceptance - [Phase 6 tests](../alphaswarm_rl/tests/mdm/) verify: - `slice_and_merge_regime_flow` produces ≥1 segment on a trending+sideways+downtrend synthetic series. - `RegimeAwareObservation` emits the expected one-hot shape. - `RegimeStratifiedEvaluation` breaks performance down per regime. # RL policy backbones > | Class | Source | Use case | | --- | --- | --- | | [`TransformerBackbone`](../alphaswarm/rl/policies/backbones/transformer.py) | Self-attention encoder over the lookback window | Default for medium sequence... # RL policy backbones > Transformer / RNN / Autoencoder / PatchTST feature trunks for the > AlphaSwarm RL policies. Registered through the > [`RLComponent`](../alphaswarm/rl/core/base.py) metaclass with > `rl_kind='rl_policy_backbone'`. ## Backbones | Class | Source | Use case | | --- | --- | --- | | [`TransformerBackbone`](../alphaswarm/rl/policies/backbones/transformer.py) | Self-attention encoder over the lookback window | Default for medium sequence (30-100 bars) | | [`RecurrentBackbone`](../alphaswarm/rl/policies/backbones/recurrent.py) | LSTM / GRU / RNN cell (configurable) | Causal, memory-efficient, anti-bidirectional default | | [`AutoencoderBackbone`](../alphaswarm/rl/policies/backbones/autoencoder.py) | MLP encoder bottleneck | High-dim observation (1000+ features) compression | | [`PatchTSTBackbone`](../alphaswarm/rl/policies/backbones/patchtst.py) | Patch-tokenised Transformer (Nie 2023) | Long-horizon (252+ bars) — avoids token explosion | ## Wiring through SB3 ```yaml agent: class: SB3Adapter module_path: alphaswarm.rl.agents.sb3_adapter kwargs: algorithm: PPO policy: MlpPolicy policy_kwargs: features_extractor_class: alphaswarm.rl.policies.feature_extractors.BackboneFeaturesExtractor features_extractor_kwargs: backbone_alias: TransformerBackbone sequence_length: 30 input_features: 32 features_dim: 128 backbone_kwargs: n_heads: 4 n_layers: 2 d_ff: 256 dropout: 0.1 ``` ## Wiring through CleanRL The [`CleanRLAdapter`](../alphaswarm/rl/agents/cleanrl_adapter.py) wraps the backbone via [`build_backbone_from_alias`](../alphaswarm/rl/policies/feature_extractors.py): ```python from alphaswarm.rl.policies import build_backbone_from_alias trunk = build_backbone_from_alias( "RecurrentBackbone", input_features=20, sequence_length=30, output_dim=128, backbone_kwargs={"cell": "lstm", "hidden_size": 128, "num_layers": 2}, ) ``` ## Shipped example specs Four reference specs live under [`configs/rl/policies/`](../configs/rl/policies): - [`transformer_stock_trading.yaml`](../configs/rl/policies/transformer_stock_trading.yaml) — PPO + Transformer over StockTradingEnv. - [`recurrent_portfolio.yaml`](../configs/rl/policies/recurrent_portfolio.yaml) — SAC + LSTM over PortfolioAllocationEnv. - [`autoencoder_marketmaking.yaml`](../configs/rl/policies/autoencoder_marketmaking.yaml) — PPO + Autoencoder over MarketMakingEnv. - [`patchtst_execution.yaml`](../configs/rl/policies/patchtst_execution.yaml) — PPO + PatchTST over OptimalExecutionEnv. ## Adding a new backbone See [the cursor rule](../.cursor/rules/policy-backbones.mdc) for the canonical checklist. ## See also - [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md) - [Hard rule 37 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) # RL PRUDEX-Compass Evaluation (Phase 9) > | Axis | Code | Measures | | --- | --- | --- | | Profitability | P | `total_return`, `annualised_return`, `cagr` | | Risk-control | R | `volatility`, `max_drawdown`, `sortino`, `calmar` | | Universali... # RL PRUDEX-Compass Evaluation (Phase 9) Reference docs for the PRUDEX-Compass evaluation framework ported from TradeMaster into `alphaswarm_rl`. ## Six axes, 17 measures | Axis | Code | Measures | | --- | --- | --- | | Profitability | P | `total_return`, `annualised_return`, `cagr` | | Risk-control | R | `volatility`, `max_drawdown`, `sortino`, `calmar` | | Universality | U | `cross_dataset_sharpe_mean`, `cross_dataset_sharpe_std` | | Diversification | D | `portfolio_weight_entropy`, `turnover` | | Explainability | E | `regime_conditioned_sharpe` | | X-tra evaluation | X | `performance_profile_auc`, `rank_score`, `extreme_market_score`, `hit_rate` | Plus a `sharpe_ratio` convenience field. **17 measures total.** ## Five visualisations | Helper | Purpose | | --- | --- | | `pride_star_chart` | 8-axis radar of per-agent scores | | `prudex_compass_chart` | 6-axis octagon (one axis per PRUDEX axis) | | `performance_profile_chart` | CDF of per-step returns across agents | | `rank_distribution_chart` | Heatmap of per-metric ranks | | `extreme_market_chart` | Bar chart of extreme-market cumulative returns | All helpers gracefully degrade to a dict fallback when matplotlib is unavailable. ## Modules | File | Class | Purpose | | --- | --- | --- | | [`alphaswarm_rl/src/alphaswarm_rl/evaluation/prudex_compass.py`](../alphaswarm_rl/src/alphaswarm_rl/evaluation/prudex_compass.py) | `PrudexMetrics`, `PrudexReport`, `compute_prudex_metrics` | Per-agent metric computation | | [`alphaswarm_rl/src/alphaswarm_rl/evaluation/visualizations.py`](../alphaswarm_rl/src/alphaswarm_rl/evaluation/visualizations.py) | 5 chart helpers | Plot rendering | | [`alphaswarm_rl/src/alphaswarm_rl/experiments/prudex_evaluation.py`](../alphaswarm_rl/src/alphaswarm_rl/experiments/prudex_evaluation.py) | `PrudexEvaluation` | Experiment aggregator | ## Usage ```python from alphaswarm_rl.experiments.prudex_evaluation import PrudexEvaluation from alphaswarm_rl.evaluation.visualizations import ( prudex_compass_chart, pride_star_chart, performance_profile_chart, ) exp = PrudexEvaluation(periods_per_year=252) report = exp.run( agent_results={ "eiie": {"equity_curve": eq_eiie, "weights_history": w_eiie}, "deeptrader": {"equity_curve": eq_dt, "weights_history": w_dt}, "ppo": {"equity_curve": eq_ppo, "weights_history": w_ppo}, }, ) # Visualise: fig = prudex_compass_chart(report) ``` ## Hard rule alignment - Hard rule 19: `PrudexEvaluation` registers via `RLComponent` metaclass under `rl_alias='prudex_compass'`. - Hard rule 18: report lands in `rl_runs.result_summary` via the parent `RLRuntime`; no direct Iceberg writes from this experiment. ## Acceptance [Phase 9 tests](../alphaswarm_rl/tests/evaluation/) verify: - All 17 measures compute without error on synthetic equity series. - Per-axis breakdown has exactly 6 axes (P/R/U/D/E/X). - 5 visualisation helpers return a Figure (matplotlib) or dict fallback. - Rank matrix is in `[1, N_agents]` per metric. # Weight-centric portfolio pipeline (`f_S -> f_A -> f_T -> f_R`) > | Stage | Class | Responsibility | Default | | --- | --- | --- | --- | | `f_S` (Selector) | [`StockSelector`](../alphaswarm/rl/portfolio/selector.py) | Filter universe by liquidity / vol / momentum | [`Stati... # Weight-centric portfolio pipeline (`f_S -> f_A -> f_T -> f_R`) > The FinRL-X four-stage protocol that guarantees identical target > weight semantics across offline backtesting and live broker > execution. ## Stages | Stage | Class | Responsibility | Default | | --- | --- | --- | --- | | `f_S` (Selector) | [`StockSelector`](../alphaswarm/rl/portfolio/selector.py) | Filter universe by liquidity / vol / momentum | [`StaticUniverseSelector`](../alphaswarm/rl/portfolio/selector.py) | | `f_A` (Allocator) | [`PortfolioAllocator`](../alphaswarm/rl/portfolio/allocator.py) | Map raw RL action to unconstrained weights | [`IdentityAllocator`](../alphaswarm/rl/portfolio/allocator.py) | | `f_T` (Timing) | [`TimingAdjuster`](../alphaswarm/rl/portfolio/timing.py) | Scale gross exposure on regime signals | [`ConstantTimingAdjuster`](../alphaswarm/rl/portfolio/timing.py) | | `f_R` (Risk overlay) | [`RiskOverlay`](../alphaswarm/rl/portfolio/risk_overlay.py) | Truncate weights violating hard constraints | [`StackedRiskOverlay(PositionCap + GrossExposure)`](../alphaswarm/rl/portfolio/risk_overlay.py) | ## Composition ```python from alphaswarm.rl.portfolio import ( GrossExposureRiskOverlay, IdentityAllocator, PositionCapRiskOverlay, StackedRiskOverlay, StaticUniverseSelector, TurbulenceTimingAdjuster, WeightCentricPipeline, ) pipeline = WeightCentricPipeline( selector=StaticUniverseSelector(universe=universe), allocator=IdentityAllocator(), timing=TurbulenceTimingAdjuster(threshold=140.0, cooldown_scale=0.0), risk_overlay=StackedRiskOverlay(overlays=[ PositionCapRiskOverlay(max_position_pct=0.30, mark_truncated=True), GrossExposureRiskOverlay(max_gross=1.0), ]), ) state = pipeline.run( universe=universe, raw_action=action, context={"turbulence": 90.0, "prices": prices, "equity": 100_000.0}, ) target_weights = state.weights # numpy array aligned with state.universe ``` ## Determinism contract Each stage is a **pure function** of its inputs — no hidden global state, no time-dependent randomness without an explicit seed. `state.history` records the per-stage weight vector for audit so a downstream `LedgerWriter` can persist the full `f_S -> f_A -> f_T -> f_R` trace. ## Truncation propagation The risk overlay can set `state.context["truncated"]=True` when a hard constraint is breached (e.g. `mark_truncated=True` on `PositionCapRiskOverlay`). The [`RLBacktestEnv`](../alphaswarm/rl/envs/rl_backtest_env.py) lifts this onto `info["truncated"]` so the [`StopProperlyWrapper`](../alphaswarm/rl/rewards/stop_properly.py) scales the step reward by `coef in [0, 1]`. ## Adding a new stage variant 1. Subclass the relevant base (`StockSelector` / `PortfolioAllocator` / `TimingAdjuster` / `RiskOverlay`). 2. Implement the single transform method (`select` / `allocate` / `adjust` / `apply`). 3. Re-export from [`alphaswarm/rl/portfolio/__init__.py`](../alphaswarm/rl/portfolio/__init__.py). ## See also - [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md) — Overall architecture. - [Hard rule 38 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — Source-of-truth rule. # Analysis Agents > | Spec | Module | Purpose | | --- | --- | --- | | `analysis.step` | [alphaswarm/agents/analysis/step_analyst.py](../alphaswarm/agents/analysis/step_analyst.py) | Verdict + improvements for a single agent step. | | ... # Analysis Agents Three interpretation agents + one deferred reflector. Together they close the Alpha-GPT three-stage loop (Ideation → Implementation → Review) and the TradingAgents-style outcome-reflection loop. ## Specs | Spec | Module | Purpose | | --- | --- | --- | | `analysis.step` | [alphaswarm/agents/analysis/step_analyst.py](../alphaswarm/agents/analysis/step_analyst.py) | Verdict + improvements for a single agent step. | | `analysis.run` | [alphaswarm/agents/analysis/run_analyst.py](../alphaswarm/agents/analysis/run_analyst.py) | End-to-end interpretation of a backtest / paper / live run. | | `analysis.portfolio` | [alphaswarm/agents/analysis/portfolio_analyst.py](../alphaswarm/agents/analysis/portfolio_analyst.py) | Portfolio aggregate + risk + regulatory exposure. | | Reflector (helper) | [alphaswarm/agents/analysis/reflector.py](../alphaswarm/agents/analysis/reflector.py) | Resolve outcomes + write reflections + re-index L0. | ## Reflection loop (TradingAgents pattern) ```mermaid flowchart LR d[agent_decisions] --> w[reflector window] w --> o[memory_outcomes] o --> r[reflection LLM] r --> mr[memory_reflections] mr --> rag[L0 RAG decisions] rag --> next[next agent run] ``` 1. The reflector pulls every recent decision row that doesn't yet have an outcome. 2. It computes raw / benchmark / excess return over a configurable window via the bars adapter. 3. It writes one `MemoryOutcome` row + one `MemoryReflection` row. 4. It re-indexes the decision into the L0 `decisions` corpus so the next research / selection / trader run picks it up via `HierarchicalRAG`. ## REST + Celery ``` POST /agents/analysis/step — task POST /agents/analysis/run — task POST /agents/analysis/portfolio — task POST /agents/analysis/reflect — task wrapper for run_reflection_pass POST /agents/analysis/sync/run — synchronous variant POST /memory/reflect/run — synchronous reflection pass ``` Tasks live in [alphaswarm/tasks/analysis_tasks.py](../alphaswarm/tasks/analysis_tasks.py). ## YAMLs - [configs/agents/analysis_step.yaml](../configs/agents/analysis_step.yaml) - [configs/agents/analysis_run.yaml](../configs/agents/analysis_run.yaml) - [configs/agents/analysis_portfolio.yaml](../configs/agents/analysis_portfolio.yaml) # Analysis Flows Reference > Every flow in the registry. Each flow is identified by a namespaced name (`namespace.flow`), declares a Pydantic params model, and returns a `FlowResult` with `metrics` / `rows` / `chart` / optional `... # Analysis Flows Reference > Framework: [alphaswarm_docs/analysis-framework.md](../../concepts/strategy/analysis-framework.md) · UI: [alphaswarm_docs/analysis-lab.md](../../concepts/strategy/analysis-lab.md). Every flow in the registry. Each flow is identified by a namespaced name (`namespace.flow`), declares a Pydantic params model, and returns a `FlowResult` with `metrics` / `rows` / `chart` / optional `arrow_table` for Iceberg persistence. `GET /analysis/flows` lists every entry with the JSON-schema body derived from the params model — the lab UI auto-renders forms from this surface. ## profiling.\* | Name | Label | Notes | |---|---|---| | `profiling.describe` | Column profile | Wraps `alphaswarm.data.profiling.compute_profile` | | `profiling.dtypes` | Dtypes | Per-column dtype + memory footprint | | `profiling.null_audit` | Null audit | Null counts + null fractions | | `profiling.topk` | Top-K values | Most-frequent values + share | ## distribution.\* | Name | Label | Notes | |---|---|---| | `distribution.descriptive_stats` | Descriptive stats | Mean / median / std / skew / kurt / IQR / MAD / quantiles | | `distribution.histogram` | Histogram | Equal-width bins + Plotly chart | | `distribution.ecdf` | Empirical CDF | Sorted-value ECDF (down-sampled to `max_points`) | | `distribution.qq_plot_points` | Q-Q plot points | Slope/intercept fit vs. norm/t/uniform/expon | | `distribution.shapiro_wilk` | Shapiro-Wilk | Normality test (capped at 5000 samples) | | `distribution.jarque_bera` | Jarque-Bera | Skew + kurt goodness-of-fit | | `distribution.kolmogorov_smirnov` | K-S | One-sample vs reference dist (norm / t / uniform / expon / lognorm) | ## outlier.\* | Name | Label | Notes | |---|---|---| | `outlier.zscore` | Z-score | Robust (median/MAD) or classical | | `outlier.iqr` | IQR fences | Tukey ``[Q1 - kIQR, Q3 + kIQR]`` | | `outlier.iforest` | Isolation Forest | sklearn | | `outlier.dbscan` | DBSCAN | Density-based; `-1` is noise | | `outlier.lof` | LOF | sklearn LocalOutlierFactor | | `outlier.ecod` | ECOD | PyOD; falls back to z-score | | `outlier.pulse_vs_step` | Pulse vs Step | Distinguish transient pulses from level shifts | ## imputation.\* | Name | Label | Notes | |---|---|---| | `imputation.ffill_bfill` | Forward / backward fill | Default `ffill_then_bfill` | | `imputation.linear` | Linear interpolation | pandas `axis=0` | | `imputation.spline` | Cubic spline | pandas spline (order configurable) | | `imputation.knn` | KNN imputer | sklearn `KNNImputer` | | `imputation.mice` | MICE (IterativeImputer) | sklearn `IterativeImputer` | ## regression.\* | Name | Label | Notes | |---|---|---| | `regression.ols_diagnostics` | OLS diagnostics | Coefs + SE + t / p + Durbin-Watson + AIC / BIC | | `regression.white_test` | White's test | Heteroskedasticity (general form) | | `regression.breusch_pagan` | Breusch-Pagan | Heteroskedasticity vs regressors | | `regression.vif` | VIF | Variance Inflation Factors per regressor | ## time_series.\* | Name | Label | Notes | |---|---|---| | `time_series.stl` | STL decomposition | Trend / seasonal / residual | | `time_series.adf` | Augmented Dickey-Fuller | H0 = unit root | | `time_series.kpss` | KPSS | H0 = stationary (ADF complement) | | `time_series.acf_pacf` | ACF / PACF | Auto- and partial-autocorrelation series | | `time_series.garch` | GARCH(p, q) | Volatility model + horizon variance forecast | | `time_series.change_point` | Change-point | ruptures.KernelCPD with rbf kernel | | `time_series.granger_causality` | Granger causality | Up to `max_lag` | | `time_series.cointegration` | Engle-Granger | Pair cointegration | | `time_series.spectral_fft` | Spectral (FFT) | Real FFT magnitude + power spectrum | | `time_series.spectral_wavelet` | Continuous wavelet transform | PyWavelets (optional) | | `time_series.hurst_exponent` | Hurst exponent | Long-range dependence | | `time_series.theil_sen` | Theil-Sen slope | Robust median-of-pairwise-slopes | ## derivatives.\* | Name | Label | Notes | |---|---|---| | `derivatives.bsm` | Black-Scholes-Merton | Closed-form European price + Greeks | | `derivatives.greeks_surface` | Greeks surface | Δ/Γ/ν/Θ/ρ across strikes × expiries | | `derivatives.implied_volatility` | Implied volatility (Brent) | Recover σ from a market quote | | `derivatives.monte_carlo_european` | MC European option | Vectorised GBM; opt-in CUDA via cupy | | `derivatives.monte_carlo_barrier` | MC barrier option | Knock-in / knock-out variants | | `derivatives.monte_carlo_asian` | MC Asian option | Arithmetic / geometric averaging | | `derivatives.sabr_smile` | SABR smile (Hagan) | Hagan-Kumar-Lesniewski-Woodward 2002 | | `derivatives.bachelier` | Bachelier (normal model) | Wraps `alphaswarm.options.normal_model` | ## portfolio.\* | Name | Label | Notes | |---|---|---| | `portfolio.markowitz_efficient_frontier` | Efficient frontier | cvxpy if available, numpy-only fallback | | `portfolio.ledoit_wolf_shrinkage` | Ledoit-Wolf covariance | Stabilised covariance matrix | | `portfolio.fama_french_5_rolling` | FF5 rolling betas | Rolling-window OLS on Mkt-RF / SMB / HML / RMW / CMA | | `portfolio.risk_parity` | Risk parity | Equal-risk-contribution weights (Spinu 2013) | ## factors.\* | Name | Label | Notes | |---|---|---| | `factors.evaluate` | Factor evaluation | Wraps `alphaswarm.data.factors.evaluate_factor` (IC + quantile spread + turnover) | ## microstructure.\* | Name | Label | Notes | |---|---|---| | `microstructure.realised_volatility` | Realised volatility (OHLC) | Close-to-close / Parkinson / GK / RS / YZ | | `microstructure.order_book_imbalance` | Order-book imbalance | Top-of-book | | `microstructure.vpin` | VPIN | Wraps `alphaswarm.data.microstructure.vpin` | ## Optional dependencies Flows tag their optional deps (`optional_dependencies` field on the descriptor). Missing extras raise a friendly `RuntimeError("install extra X")` instead of crashing the catalog. | Dep | Used by | |---|---| | `scikit-learn` | `outlier.{iforest,dbscan,lof}`, `imputation.{knn,mice}`, `portfolio.ledoit_wolf_shrinkage` | | `statsmodels` | `regression.*`, `time_series.{adf,kpss,acf_pacf,granger_causality,cointegration,stl}` | | `arch` | `time_series.garch` | | `ruptures` | `time_series.change_point` | | `pywavelets` | `time_series.spectral_wavelet` | | `pyod` | `outlier.ecod` (falls back to z-score) | | `cvxpy` | `portfolio.markowitz_efficient_frontier` (falls back to numpy projection) | | `cupy` | `derivatives.monte_carlo_*` (opt-in GPU acceleration) | # Analysis Framework > The analysis layer is AQPs hash-locked, runtime-driven umbrella for every "explore a dataset" workflow — distribution audits, time-series diagnostics, derivatives pricing, portfolio optimisation, reg... # Analysis Framework > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Lab guide: [alphaswarm_docs/analysis-lab.md](../../concepts/strategy/analysis-lab.md) · Flow reference: [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md). The analysis layer is AlphaSwarm's hash-locked, runtime-driven umbrella for every "explore a dataset" workflow — distribution audits, time-series diagnostics, derivatives pricing, portfolio optimisation, regression diagnostics, outlier / imputation work, and Alphalens-style factor evaluation. It is the **statistical / quantitative-analysis** counterpart of the **agentic-interpretation** layer in [alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md). The two namespaces are deliberately distinct. ## Why a new umbrella Most primitives existed already (`alphaswarm.ml.flows`, `alphaswarm.data.factors`, `alphaswarm.data.realised_volatility`, `alphaswarm.data.microstructure`, `alphaswarm.options.normal_model`, `alphaswarm.data.profiling.profiler`) but had no single contract for: - registering a flow with a JSON-schema-driven param model; - composing multiple flows into a reproducible pipeline; - snapshotting the spec into an immutable, hash-locked version row; - writing every step's gold-tier output to Iceberg (`alphaswarm_gold_analysis_`) under medallion validation; - emitting the same progress payload shape Celery + WebSocket consumers already understand. The umbrella plugs every primitive into one canvas + one ledger. ## Layout ``` alphaswarm/analysis/ base.py — FlowParams / FlowResult / FlowDescriptor / FlowContext spec.py — AnalysisSpec / AnalysisStep / FlowRef / DatasetRef registry.py — @register_analysis_flow + persist_spec + add_spec runtime.py — AnalysisRuntime (sole sanctioned executor) pricing.py — closed-form + MC math primitives (BSM, Greeks, GBM, SABR) flows/ profiling.py / distribution.py / outlier.py / imputation.py / regression.py / time_series.py / derivatives.py / portfolio.py / factors.py / microstructure.py ``` ```mermaid flowchart LR subgraph Backend Spec[AnalysisSpec] --> Runtime[AnalysisRuntime] Runtime --> Registry["FlowRegistry@register_analysis_flow"] Registry --> Flows["flows/distribution / derivatives /portfolio / time_series / regression /outlier / imputation / profiling /factors / microstructure"] end subgraph Persistence SpecRow[("analysis_specs")] VerRow[("analysis_spec_versionsimmutable")] Run[("analysis_runs ledger")] Step[("analysis_step_results")] Iceberg[("alphaswarm_gold_analysis_")] end Runtime -->|persist_spec| SpecRow Runtime -->|snapshot| VerRow Runtime --> Run Run --> Step Runtime -->|"append_arrow medallion=gold"| Iceberg subgraph API FlowAPI["/analysis/flows"] SpecAPI["/analysis/specs"] RunAPI["/analysis/runs"] end Runtime --- API API --- LabUI["/analysis/lab\n(hybrid: tabbed + canvas)"] ``` ## AnalysisSpec contract Every spec is a Pydantic model that hashes its canonical JSON form (SHA-256, sorted keys, no whitespace). Two specs with identical fields collapse to one `analysis_spec_versions` row; any edit creates a new version automatically. ```yaml name: spy-distribution-audit slug: spy-distribution-audit kind: research description: Distribution + GARCH + outlier audit for SPY daily bars. dataset: iceberg_identifier: alphaswarm_silver_alpha_vantage.equities_daily filters: vt_symbol: SPY.NYSE limit: 5000 steps: - alias: profile flow_ref: flow: profiling.describe params: {} - alias: returns_dist flow_ref: flow: distribution.descriptive_stats params: { column: log_return } - alias: shapiro flow_ref: flow: distribution.shapiro_wilk params: { column: log_return } - alias: garch flow_ref: flow: time_series.garch params: { column: log_return, p: 1, q: 1, horizon: 10 } medallion_layer: gold business_metadata: data_owner: research-team semantic_definition: "SPY daily distribution + volatility audit" domain: research.distribution_audit sla_class: tier-3-eod ``` ## Hard rules These hold across every analysis flow / spec / run. Any PR that violates one will be sent back. 1. **Every analysis run goes through `AnalysisRuntime`.** REST + Celery tasks (`alphaswarm.tasks.analysis_flow_tasks`) wrap it; flow code never writes to Iceberg / Postgres directly. 2. **`analysis_spec_versions` rows are immutable.** Re-snapshotting via `alphaswarm.analysis.registry.persist_spec` creates a new version row when the SHA-256 hash changes — never update an existing row in place. 3. **Every per-step Iceberg write uses `iceberg_catalog.append_arrow` with `medallion_layer="gold"` and a `BusinessMetadata` block.** The default namespace is `alphaswarm_gold_analysis_`; flows can override via `output_namespace=` on `register_analysis_flow`. 4. **Flows never call `litellm.completion` / `OllamaClient` directly.** v1 ships zero LLM-routed flows by design — interpretation is owned by the analysis-AGENTS stack ([alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md)). 5. **Optional dependencies are guarded.** Flows that need `cvxpy`, `pyod`, `pywavelets`, `cupy`, etc. raise a friendly `RuntimeError` with the install hint when the import fails. 6. **No new diagram formats.** Mermaid only. ## REST surface | Method | Path | Purpose | |---|---|---| | `GET` | `/analysis/flows` | List flows + JSON-schema-derived param forms | | `GET` | `/analysis/flows/{flow}` | Single flow detail | | `POST` | `/analysis/flows/{flow}/preview` | Sync preview against an inline payload | | `POST` | `/analysis/flows/{flow}/preview-task` | Async preview via Celery (`agents` queue) | | `GET` | `/analysis/specs` | List saved specs | | `POST` | `/analysis/specs` | Persist a new spec (idempotent on hash) | | `GET` | `/analysis/specs/{slug}` | Current spec + version history | | `POST` | `/analysis/specs/{slug}/run` | Kick `AnalysisRuntime.run` via Celery | | `GET` | `/analysis/runs` | Paged ledger of runs | | `GET` | `/analysis/runs/{id}` | Run detail with joined step results | | `GET` | `/analysis/runs/{id}/results/{step}` | DuckDB-driven preview of one step's gold-tier output | | `GET` | `/analysis/datasets/columns?identifier=ns.name` | Column / dtype list for the lab forms | ## Persistence schema Migration `0031_analysis_layer` adds four project-scoped tables: | Table | Purpose | |---|---| | `analysis_specs` | Logical row (latest active version per slug) | | `analysis_spec_versions` | Immutable hash-locked snapshot | | `analysis_runs` | One row per `AnalysisRuntime.run()` invocation | | `analysis_step_results` | One row per `AnalysisStep` in the spec | `AnalysisRun.iceberg_result_table` is set when a step persists arrow data; `AnalysisStepResult.artifact_uri` records the per-step `namespace.name` so the lab can fetch the gold-tier output via DuckDB. ## Adding a new flow 1. Subclass `FlowParams` for the per-flow parameter shape. 2. Decorate a `(df, params, ctx) -> FlowResult` function with `@register_analysis_flow(name, namespace, label, ...)`. 3. (optional) Stash a `pyarrow.Table` on `result.arrow_table` to persist it under `alphaswarm_gold_analysis_` when run inside a spec. 4. Add a smoke test under `tests/analysis/`. 5. Update the relevant tab in [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md). ## Don't list - Don't bypass `AnalysisRuntime` for spec execution — every progress / ledger / Iceberg / step-result side-effect is wired through it. - Don't write to a non-`alphaswarm_gold_analysis_*` namespace from a flow. - Don't duplicate logic that already lives in `alphaswarm.data.factors` / `alphaswarm.data.microstructure` / `alphaswarm.options.normal_model` — wrap them as a flow and keep the math in one place. - Don't add diagrams in non-Mermaid formats. - Don't put LLM-driven interpretation in a flow; that lives in `alphaswarm_agents.analysis.*`. # Analysis Lab — interactive analysis builder > Lives at `/analysis/lab` in the AlphaSwarm webui (Vite frontend). Hybrid surface: dataset-centric tabbed drill-down (primary path) plus an XYFlow Composer (secondary path) for multi-step pipelines # Analysis Lab — interactive analysis builder > Backend: [alphaswarm_docs/analysis-framework.md](../../concepts/strategy/analysis-framework.md) · Flow reference: [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md). Lives at `/analysis/lab` in the AlphaSwarm webui (Vite frontend). Hybrid surface: dataset-centric tabbed drill-down (primary path) plus an XYFlow Composer (secondary path) for multi-step pipelines. ## Layout | Tab | Purpose | Driving flows | | --- | --- | --- | | **Profiling** | Column profile + null audit + topk + dtypes | `profiling.*` | | **Distribution** | Descriptive stats / histogram / ECDF / Q-Q + Shapiro-Wilk / Jarque-Bera / K-S | `distribution.*` | | **Outliers** | Z-score / IQR / Isolation Forest / DBSCAN / LOF / ECOD / pulse-vs-step | `outlier.*` | | **Time Series** | ADF / KPSS / ACF-PACF / STL / GARCH / change-point / Granger / cointegration / FFT / wavelets / Hurst / Theil-Sen | `time_series.*` | | **Regression** | OLS diagnostics / White / Breusch-Pagan / VIF | `regression.*` | | **Imputation** | ffill/bfill / linear / spline / KNN / MICE | `imputation.*` | | **Derivatives** | BSM + Greeks surface + IV / Monte-Carlo European / barrier / Asian / SABR smile / Bachelier | `derivatives.*` | | **Portfolio** | Efficient frontier / Ledoit-Wolf / Fama-French 5 rolling / risk parity | `portfolio.*` | | **Factors** | Alphalens-style IC + quantile spread + turnover | `factors.evaluate` | | **Composer** | XYFlow canvas — drag analysis nodes, save spec, run via runtime | every namespace | Each tab loads the relevant flow schemas via `GET /analysis/flows`, auto-generates the form, and submits to `POST /analysis/flows/{flow}/preview`. Charts render inline (Plotly figure-dict in the response). The "Save as spec" button on any tab promotes the current state into an `AnalysisSpec` and routes to the Composer for multi-step editing without losing context. ## Routes | Path | Component | | --- | --- | | `/analysis/lab` | Tabbed primary surface | | `/analysis/lab/composer` | XYFlow Composer (XYFlow canvas + ANALYSIS_PALETTE) | | `/analysis/runs` | Run ledger (paged) | | `/analysis/runs/[id]` | Run detail (steps + chart previews) | The Composer reuses the existing [`WorkflowEditor`](../alphaswarm_client/src/components/flow/WorkflowEditor.tsx) with `domain="analysis"`. The serializer turns the canvas graph into an `AnalysisSpec` payload, posts it to `POST /analysis/specs`, then to `POST /analysis/specs/{slug}/run`. ## API surface used - `GET /analysis/flows` — flow catalog with JSON-schema params. - `POST /analysis/flows/{flow}/preview` — sync preview. - `POST /analysis/flows/{flow}/preview-task` — Celery preview. - `POST /analysis/specs` — persist (hash-idempotent). - `POST /analysis/specs/{slug}/run` — queue `AnalysisRuntime.run` task. - `GET /analysis/runs` / `GET /analysis/runs/{id}` — ledger. - `GET /analysis/runs/{id}/results/{step}` — DuckDB preview of the gold-tier output for one step. - `GET /analysis/datasets/columns?identifier=ns.name` — column list used by the lab's column-autocomplete inputs. ## Cross-links The lab does not reinvent existing surfaces — it deep-links into them when the user wants a richer experience: - Derivatives tab → [`/options/lab`](../alphaswarm_client/src/routes/options/lab/page.tsx) for instrument-level workflows. - Portfolio tab → [`/optimizer`](../alphaswarm_client/src/routes/optimizer/page.tsx) for multi-strategy parameter sweeps. - Factors tab wraps the existing [`FactorWorkbench`](../alphaswarm_client/src/components/factors/FactorWorkbench.tsx). - Visualisations of Iceberg outputs deep-link into [`/visualizations`](../alphaswarm_client/src/routes/visualizations/page.tsx). # Backtest engines > AlphaSwarm ships seven interchangeable backtest engines behind a single BaseBacktestEngine ABC. Three tiers: primary vectorised, event-driven for agent-in-the-loop, and a fallback cascade. # Backtest engines > Doc map: [intro](../../intro/index.md) · > vbt-pro deep dive: [vbtpro-integration](./vbtpro-integration.md) · > LOB / tick-replay: [hft-backtest](./hft-backtest.md) · > Class hierarchy: [class-diagram](../platform/class-diagram.md) · > Worked tutorial: [tutorials/first-backtest](../../tutorials/first-backtest.md) · > Recipe: [how-to/recipes/run-a-backtest-from-yaml](../../how-to/recipes/run-a-backtest-from-yaml.md). AlphaSwarm runs every backtest through one of seven interchangeable engines behind the [`BaseBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/base.py) ABC. The runner, persistence, MLflow tracking, and UI never branch on which engine produced a run — every engine returns the same [`BacktestResult`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/result.py). The seven engines fall into **three tiers** so you can pick one without scanning a 7-row table every time: ```mermaid flowchart TB Strategy["IStrategy / FrameworkAlgorithm"] --> Runner["alphaswarm.backtest.runner.run_backtest_from_config"] Runner --> Primary Runner --> Loop Runner --> Cascade subgraph Primary [Tier 1: vectorised primary] Vbtpro["VectorbtProEngine (5 modes)"] end subgraph Loop [Tier 2: per-bar Python loop] Event["EventDrivenBacktester (agent dispatch)"] Hft["LobBacktestEngine (hftbacktest LOB)"] end subgraph Cascade [Tier 3: fallback cascade] FallbackEngine["FallbackBacktestEngine"] Vbt["VectorbtEngine (OSS)"] Bt["BacktestingPyEngine"] Zvt["ZvtBacktestEngine"] Aat["AatBacktestEngine"] end FallbackEngine --> Vbtpro FallbackEngine -.fallback.-> Event FallbackEngine -.fallback.-> Vbt FallbackEngine -.fallback.-> Bt FallbackEngine -.fallback.-> Zvt FallbackEngine -.fallback.-> Aat ``` ## Tier 1 — Vectorised primary (`VectorbtProEngine`) Default for research workloads, parameter screens, walk-forward optimisation, factor studies, and any backtest that does not need per-bar Python. Five constructor modes select the inner vbt-pro path: - `signals` — array-based entries / exits / sizing - `orders` — column-of-orders DataFrame - `optimizer` — built-in vbt-pro `Param` sweeps - `holding` — buy-and-hold baseline - `random` — random-signal baseline Implementation: [alphaswarm/backtest/vbtpro/engine.py::VectorbtProEngine](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/vbtpro/engine.py). Full mode dispatch + Numba-jit constraints in [vbtpro-integration](./vbtpro-integration.md). ## Tier 2 — Per-bar Python loop Two engines run a true Python `on_bar` callback. Use them when you need synchronous decisions inside the inner loop — agent dispatch, event-sourced LOB replay, custom callbacks vbt-pro can't represent. - [`EventDrivenBacktester`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/event_driven.py) — the only engine that exposes `context['agents']` to strategies via [`AgentDispatcher`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_dispatcher.py), with TTL + LRU dedup of LLM calls. - [`LobBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/hft.py) — hftbacktest-driven LOB tick replay; latency + queue models; market-making + execution strategies. ## Tier 3 — Fallback cascade [`FallbackBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/fallback.py) tries `primary` first, then walks `fallbacks` until one returns a `BacktestResult`. The OSS engines exist mainly as cascade fallbacks and for license-constrained deployments: - `VectorbtEngine` — OSS vectorbt; signals only (Apache-2.0). - `BacktestingPyEngine` — single-symbol with `.optimize(...)` grid + SAMBO (AGPL-3.0). - `ZvtBacktestEngine` — permissive-licence CN-bar fallback (MIT). - `AatBacktestEngine` — async / synthetic LOB fallback (Apache-2.0). NautilusTrader is **not** wired in (LGPL-3.0; out of scope). ## EngineCapabilities Every engine declares its surface via [`EngineCapabilities`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/capabilities.py) on the class attribute. Agents introspect via the `engine_capabilities` tool; humans can call `alphaswarm.backtest.engine_capabilities_index()`. ```mermaid flowchart LR subgraph caps [EngineCapabilities flags] signals orders callbacks multiAsset[multi-asset] shorts leverage lob asyncFlag[async] perBar[per-bar Python] optimizer wfo[walk-forward] agentDispatch[agent dispatch] rlInjection[supports_rl_injection] end Vbtpro["VectorbtProEngine"] -. signals, orders, callbacks, multi-asset, shorts, leverage, optimizer, walk-forward, rl-injection .-> caps Event["EventDrivenBacktester"] -. signals, orders, callbacks, multi-asset, shorts, per-bar Python, agent dispatch, walk-forward, rl-injection .-> caps Hft["LobBacktestEngine"] -. lob, async, per-bar Python, multi-asset, shorts, agent dispatch .-> caps Bt["BacktestingPyEngine"] -. signals, shorts, leverage .-> caps Zvt["ZvtBacktestEngine"] -. signals, multi-asset, per-bar Python .-> caps Aat["AatBacktestEngine"] -. signals, orders, multi-asset, shorts, lob, async, per-bar Python .-> caps Vbt["VectorbtEngine"] -. signals, multi-asset, shorts .-> caps ``` Pick by capability: - **Vectorised research / parameter screens / WFO** → `VectorbtProEngine` - **Per-bar agent dispatch (LLM in the loop)** → `EventDrivenBacktester` - **LOB tick replay, latency + queue modelling** → `LobBacktestEngine` - **Synthetic LOB realism (OSS path)** → `AatBacktestEngine` - **Chinese-market data** → `ZvtBacktestEngine` - **Single-symbol grid optimisation** → `BacktestingPyEngine` with `.optimize(ranges, method="grid"|"sambo", ...)` ## When NOT to use the primary engine The vbt-pro inner loop is Numba-jit compiled — `signal_func_nb` / `order_func_nb` cannot call Python objects per bar. Two patterns this rules out: 1. **Per-bar agent consults.** Switch to `EventDrivenBacktester` and call `context['agents'].consult(spec_name, inputs, ttl=...)` from inside `on_bar`. The [`AgentDispatcher`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_dispatcher.py) handles TTL + LRU dedup so the LLM gateway is not hammered. 2. **Per-bar custom Python that vbt-pro cannot express.** If the inner loop needs a stateful Python object (custom risk model, bespoke order book heuristics), use event-driven. If you _can_ vectorise — or precompute a panel of decisions ahead of time — use vbt-pro `AgenticVbtAlpha` in precompute mode. The `vectorbtpro` mode dispatch lives in [vbtpro-integration](./vbtpro-integration.md). ## Dispatching from YAML Three equivalent ways to pick an engine inside a strategy recipe: ```yaml # 1) Engine shortcut (cleanest). backtest: engine: vbt-pro:signals # or vbt-pro:orders / :optimizer / :holding / :random kwargs: initial_cash: 100000 fees: 0.0005 # 2) Explicit class + module. backtest: class: VectorbtProEngine module_path: alphaswarm.backtest.vbtpro.engine kwargs: mode: orders initial_cash: 100000 # 3) Fallback cascade. backtest: engine: fallback primary: vbt-pro fallbacks: [event, aat, zvt, vectorbt] ``` | Shortcut | Resolves to | Notes | | --- | --- | --- | | `default` / `event` / `event-driven` | `EventDrivenBacktester` | Backward-compatible default. | | `primary` / `vbt-pro` / `vectorbt-pro` | `VectorbtProEngine` | Tier 1. | | `vbt-pro:signals` / `:orders` / `:optimizer` / `:holding` / `:random` | `VectorbtProEngine` | Mode injection. | | `vectorbt` / `vbt` | `VectorbtEngine` | OSS fallback. | | `backtesting` / `bt` | `BacktestingPyEngine` | Single-symbol. | | `zvt` | `ZvtBacktestEngine` | Lazy import; CN bars. | | `aat` | `AatBacktestEngine` | Lazy import; async LOB. | | `hft` / `lob` | `LobBacktestEngine` | Tick replay. | | `fallback` / `cascade` | `FallbackBacktestEngine` | Cascade with `DEFAULT_FALLBACK_CHAIN = ("event", "aat", "zvt", "vectorbt")`. | [`alphaswarm.backtest.runner.run_backtest_from_config`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/runner.py) routes every YAML through the right engine and stamps `engine` into `BacktestRun.metrics`. ## Agent + ML components Strategies plug agents and ML models into either path: - **Vectorised (vbt-pro)** — panel components in [alphaswarm/strategies/vbtpro/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies/vbtpro): - `AgenticVbtAlpha` — precompute or per-window agent dispatch into wide entries / exits / size arrays. - `MLVbtAlpha` — wraps any `alphaswarm_models.base.Model` (or MLflow URI) and emits arrays via threshold / top-k / rank policies. - `AgenticOrderModel` — drives `Portfolio.from_orders` from cached agent decisions. - **Event-driven** — `context['agents']` exposes `AgentDispatcher`. See [`AgentAwareMomentumAlpha`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_aware_alpha.py) for a worked example. For RL injection, every engine that declares `EngineCapabilities.supports_rl_injection=True` accepts the [`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py) output through `context['rl_agent']` (AGENTS rule 38). ## Unified result shape Every engine returns a `BacktestResult` with: - `equity_curve: pd.Series` indexed by timestamp. - `trades: pd.DataFrame` with `timestamp, vt_symbol, side, quantity, price, commission, slippage, strategy_id`. - `orders: pd.DataFrame`. - `summary: dict` — `sharpe`, `sortino`, `max_drawdown`, `calmar`, `total_return`, `final_equity`, `n_bars`, `volatility_ann`, `n_trades`, `turnover`, `engine`. Engine-specific keys live under `vbt_*`, `bt_*`, `zvt_*`, `aat_*`, `hft_*` so downstream code can light up native stats without re-running. ## Hash-locked specs + audit ledger Every dispatched backtest writes a row to [`backtest_runs`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/persistence/models.py) with `experiment_id` (AGENTS rule 34) and a reference to the hash-locked [`StrategySpec`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies) version. The same spec hash returns the same `*_spec_versions` row on re-dispatch; content changes always create a new version. This makes every backtest replayable. Gold-tier output lands at `alphaswarm_gold_backtests.run_` via [`iceberg_catalog.append_arrow`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py) with `medallion_layer="gold"` (AGENTS rule 3, rule 21). ## Worked example: dispatch + tearsheet Goal: dispatch a backtest, tail its WebSocket frames, list the ledger row via DataMCP, render an equity curve in your browser. ### Step 1 — dispatch ### Step 2 — tail the WebSocket ```bash curl -N http://localhost:8000/chat/stream/ ``` Frames arrive in the canonical `{task_id, stage, message, timestamp, **extras}` envelope (AGENTS rule 4). Expected stages: `start` → `bar.processed` (×N) → `metrics.computed` → `done`. ### Step 3 — list via DataMCP The `data.backtests.list` tool is the agent-safe alternative to a raw `SELECT * FROM backtest_runs`. From any MCP client: ```bash curl -X POST http://localhost:8000/mcp/data/tools/data.backtests.list/invoke \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $(alphaswarm-cli auth token)" \ -d '{"limit": 5, "order_by": "started_at_desc"}' ``` ### Step 4 — equity curve in Pyodide Render the equity curve client-side from inline sample points so the snippet stays self-contained. Replace with a fetch to `/analytics/portfolio//equity-curve.json` when running against the real platform. ### Step 5 — verify - `backtest_runs` row with non-NULL `sharpe`, `engine='VectorbtProEngine'`. - WebSocket emitted a `stage=done` frame with the matching `run_id`. - `alphaswarm_gold_backtests.run_` Iceberg table exists. - `data.backtests.describe { run_id }` MCP call returns the full row. ### What next - Run the full tutorial: [tutorials/first-backtest](../../tutorials/first-backtest.md). - Make it repeatable: [how-to/recipes/run-a-backtest-from-yaml](../../how-to/recipes/run-a-backtest-from-yaml.md). - Add a new strategy: [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md). - Promote to paper: [how-to/recipes/promote-a-bot-to-paper](../../how-to/recipes/promote-a-bot-to-paper.md). ## Deeper reads - [vbtpro-integration](./vbtpro-integration.md) — vbt-pro mode dispatch, Numba constraints, hooks, walk-forward, `Param` sweeps, `IndicatorFactory` bridge. - [hft-backtest](./hft-backtest.md) — LOB engine, latency profiles, queue models, the five HFT strategies under `alphaswarm/strategies/hft/`. - [strategy-lifecycle](./strategy-lifecycle.md) — draft → backtested → paper → live. - [strategy-development](./strategy-development.md) — composer / simulation / ideation / single / batch / compare routes in the operator UI. - [factor-research](./factor-research.md) — building factor / alpha strategies. - [ml-alpha-backtest](./ml-alpha-backtest.md) — `AlphaBacktestExperiment` orchestrator + `MLAlphaBacktestRun` schema. - [class-diagram](../platform/class-diagram.md) — full engine class hierarchy + `BacktestResult` shape. - [reference/api](../../reference/api/index.mdx) — the `backtest` tag (interactive playground). - [reference/python](../../reference/python/index.mdx) — auto-generated reference for `alphaswarm.backtest` and `alphaswarm.strategies`. # Cross-market arbitrage > The platform ships two cross-market arbitrage paths: # Cross-market arbitrage > Status: **Phase 5 shipped**. Combined deliverable across Phase 1 > (InstrumentADR / InstrumentGDR), Phase 4 > ([`alphaswarm/math/arbitrage.py`](../alphaswarm/math/arbitrage.py)), and Phase 5 > (DataMCP tools + strategy templates). ## Two flavours The platform ships two cross-market arbitrage paths: ### A/H share -- mainland China ↔ Hong Kong The Chinese company has dual-listed shares: A-shares in CNY on the SSE / SZSE, H-shares in HKD on HKEX. Same legal entity, same economic rights, different regulatory regime + liquidity + currency. The basis mean-reverts toward zero but periodically violently diverges (Stock Connect inflow / outflow, regulatory news, FX volatility). Math: [`ah_share_basis()`](../alphaswarm/math/arbitrage.py) in :mod:`alphaswarm.math.arbitrage`. Computes the FX-adjusted implied H-share price from the A-share, subtracts the observed H-share price, and classifies the arbitrage direction. Agent surface: ``data.arbitrage.ah_share_basis`` (single-point) and the ``arbitrage.ah_share_basis`` AnalysisFlow (time series). ### ADR ↔ underlying foreign equity A foreign company creates an American Depositary Receipt to list on a US venue. 1 ADR represents ``conversion_ratio`` shares of the underlying. The basis (ADR USD price -- conversion-adjusted underlying USD-equivalent) should be near zero plus the depository fee; persistent divergence is the arbitrage. Math: [`adr_basis()`](../alphaswarm/math/arbitrage.py) reads the ``conversion_ratio`` directly from the Phase 1 :class:`InstrumentADR` row (via the ``data.arbitrage.adr_underlying_basis`` MCP tool), then computes the basis exactly as the A/H case. Agent surface: ``data.arbitrage.adr_underlying_basis`` (single-point) and the ``arbitrage.adr_basis`` AnalysisFlow (time series). ## Full pipeline (BABA example) ```mermaid flowchart LR A[Agent query] -->|"what's BABA basis?"| ID[data.identity.resolve] ID -->|"instrument_id"| DR[data.instruments.depositary_receipts] DR -->|"conversion_ratio=8"| ADR[data.arbitrage.adr_underlying_basis] ADR -->|"basis_bps + direction"| Agent Agent -->|"if abs(basis) > threshold"| TM[Strategy template] TM -->|"adr_basis_arbitrage.yaml"| Bot[BotRuntime] Bot -->|"submit_list(oco)"| Broker ``` 1. Agent resolves BABA ticker to its current instrument_id at the ``as_of`` timestamp. 2. The depositary-receipts tool returns the ADR's ``conversion_ratio`` (8) and the underlying's vt_symbol (``9988.HKEX``). 3. The arbitrage tool computes the basis given current prices + FX. 4. If the basis exceeds the cost-adjusted threshold, the agent instantiates the strategy template ``configs/strategy_templates/adr_basis_arbitrage.yaml`` (a :class:`Resource` row with ``resource_type='strategy_template'``). 5. The bot runtime submits a two-leg OCO order list (long ADR + short underlying, or vice versa) through the Phase 2 contingency manager. 6. Exit: the contingency manager auto-cancels the peer when one leg fills; the strategy emits an explicit close when the basis reverts. ## Common gotchas * **FX volatility eats the alpha.** A/H share arbitrage is FX- unhedged unless the strategy template explicitly enables it (``fx_hedge_required: true`` in the YAML). For ADR basis trades, hedging the FX leg via a forward / futures position is almost always worth the cost. * **Conversion ratio changes.** Depository banks announce conversion changes; the InstrumentADR row gets updated by the corporate- action pipeline. Strategies that hardcode the ratio break the moment that happens; use the MCP tool's auto-lookup instead. * **Settlement asymmetry.** ADR settles T+1 in the US; the underlying may settle T+2 (Hong Kong) or T+1 (Tokyo). The MCP tool returns the basis as-of right now but a strategy executing on it has to plan for the settlement gap. * **Stock Connect quotas.** Mainland-to-Hong Kong flow has daily quotas; an A-H basis trade may not be executable on a given day because the southbound (or northbound) capacity is exhausted. The strategy template enables a `quota_aware` check in Phase 5+. ## Strategy templates Two templates ship pre-built (Phase 5, polymorphic Resources): * [`configs/strategy_templates/ah_share_arbitrage.yaml`](../configs/strategy_templates/ah_share_arbitrage.yaml) * [`configs/strategy_templates/adr_basis_arbitrage.yaml`](../configs/strategy_templates/adr_basis_arbitrage.yaml) Cloning a template into a workspace emits a ``ResourceRelation`` row with ``relation='translated_from'`` so the ownership graph audits provenance (AGENTS rule 35). The cloned strategy is then editable in the workspace; the original template is read-only and shared. # Execution paths: WebSocket priority + queue-preserving amendment > The Nautilus issue [#4000](https://github.com/nautechsystems/nautilus_trader/issues/4000) documents the cost of using REST for amendment: most REST `PATCH` endpoints actually implement amendment as ca... # Execution paths: WebSocket priority + queue-preserving amendment > Status: **Phase 2 shipped** (Alembic 0041). Amendment manager: > [`alphaswarm/trading/execution/amendment.py`](../alphaswarm/trading/execution/amendment.py). ## Why WebSocket-first The Nautilus issue [#4000](https://github.com/nautechsystems/nautilus_trader/issues/4000) documents the cost of using REST for amendment: most REST `PATCH` endpoints actually implement amendment as cancel + recreate. The modified order takes a NEW venue order id and goes to the back of the limit order book queue at the new price. For market-making strategies this is a non-starter -- the queue position IS the alpha. Phase 2's :class:`IDomainBrokerage` declares two capability flags: * :attr:`IDomainBrokerage.supports_websocket_amend` -- the venue has a WS endpoint that modifies the order in place * :attr:`IDomainBrokerage.supports_oco` -- the venue accepts an atomic OCO submission When both are True, the broker is "Phase 2 ready" and the :class:`AmendmentManager` routes: | Change | WS amend supported | Routing | | --- | --- | --- | | Trigger price (stop / MIT / trailing-stop) | True | ``WS_AMEND`` | | Trigger price | False | ``CANCEL_RESUBMIT`` | | Quantity-down on limit | True | ``WS_AMEND`` | | Quantity-up on limit | True (if policy allows) | ``WS_AMEND`` | | Quantity-up on limit | False (default policy) | ``CANCEL_RESUBMIT`` | | Price change | Any | ``CANCEL_RESUBMIT`` | Price changes always go cancel + resubmit because the modified order takes the back of the queue at the new price anyway. ## Atomic request id counter The amendment manager's :class:`alphaswarm.trading.execution.amendment.AtomicRequestIdCounter` mirrors Rust's ``AtomicU64`` via :class:`threading.Lock` + :class:`itertools.count`. Each ``next_id()`` returns a monotonically increasing 64-bit-safe int that the manager uses as the WS message id. Why is this important? * WebSocket amend / cancel messages are dispatched asynchronously -- the response comes back over the same connection with the matching request id. * If two amendments race (the strategy emits a new amendment before the previous one's response arrives), the manager needs to disambiguate which response belongs to which intent. * The counter is gap-free under concurrency, so the matching state table stays correct even when 10+ amendments are inflight. ## Fallback semantics When the WS amend fails (network drop, venue rejection, policy disallowing the change), the manager: 1. Logs at WARNING level with the original exception. 2. Falls through to cancel + resubmit using the broker's :meth:`IDomainBrokerage.cancel` + :meth:`IDomainBrokerage.submit`. 3. Returns an :class:`AmendmentResult` with ``routing=CANCEL_RESUBMIT`` so the caller knows queue position was lost. This is the "WS primary path with REST fallback" pattern from the Nautilus issue. Callers don't have to know which route was used -- the result tells them. ## Code example ```python from decimal import Decimal from alphaswarm.trading.execution import AmendmentManager, AmendmentRequest mgr = AmendmentManager( ws_amend=broker.ws_amend, # async callable cancel_resubmit=broker.cancel_resubmit, # async callable ) # Reduce a 10-lot limit order to 5 lots without losing queue position result = await mgr.amend( AmendmentRequest( client_order_id=order.client_order_id, quantity=Decimal("5"), ), current_order=order, ) print(result.routing, result.elapsed_ms) ``` ## Persistence Every amendment ultimately produces one or more :class:`ExecutionReport` rows in ``execution_reports``. The :class:`ExecutionReportDispatcher` writes them; the ``(venue, venue_execution_id)`` unique index dedupes duplicates from the WS-vs-REST race. ## Broker capability matrix | Broker | supports_websocket_amend | supports_oco | supports_outside_rth | | --- | --- | --- | --- | | Alpaca | True (TradingStream subscription) | True (bracket orders) | True (extended_hours flag) | | IBKR | True (gateway native) | True (OCA groups) | True (outsideRth flag) | | Tradier | False (REST-only amendment) | False | True (ext_hours flag) | | Binance | True | False (simulated) | n/a (24x7 venue) | | Kraken | True (4000 implementation) | False (simulated) | n/a | | SimulatedBrokerage | True | True (manager-driven) | True | The matrix is read at runtime from the broker's class attributes; specific venues that ship later get added the same way. # Factor Research > AlphaSwarm ships an Alphalens-inspired factor evaluation pipeline plus the purged / walk-forward cross-validators described in Lopez de Prados *Advances in Financial ML* and ML4Ts utility module # Factor Research > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · See [alphaswarm_docs/strategy-lifecycle.md](../../concepts/strategy/strategy-lifecycle.md) for the broader strategy lifecycle. AlphaSwarm ships an Alphalens-inspired factor evaluation pipeline plus the purged / walk-forward cross-validators described in Lopez de Prado's *Advances in Financial ML* and ML4T's utility module. ## One-liner evaluation ```python from alphaswarm.data.factors import evaluate_factor report = evaluate_factor( factor=factor_df, # long: timestamp, vt_symbol, factor prices=prices_df, # long: timestamp, vt_symbol, close factor_name="my_factor", periods=(1, 5, 10, 21), n_quantiles=5, ) report.ic_stats # {"fwd_1": {"mean": ..., "ir": ..., ...}, ...} report.cumulative_returns # wide DataFrame: Q1..Q5 report.turnover # Series: top-quantile daily rotation fraction ``` ## UI The **Factor Evaluation** page posts to ``POST /factors/evaluate`` which enqueues a Celery task. The task logs the tear sheet to MLflow with tag ``alphaswarm.component=factor_eval`` so every report is historically comparable. ## Cross-validators - :class:`alphaswarm.data.cv.MultipleTimeSeriesCV` — rolling train/test on panel data, matches ML4T ``utils.MultipleTimeSeriesCV``. - :class:`alphaswarm.data.cv.PurgedKFold` — k-fold with embargo days between the training window and the test fold boundary. - :class:`alphaswarm.data.cv.TimeSeriesWalkForward` — rolling or expanding train windows with a fixed test-step cadence. ## ML alphas Two gradient-boosted alpha models drop directly into the framework: - :class:`alphaswarm.strategies.ml_alphas.XGBoostAlpha` - :class:`alphaswarm.strategies.ml_alphas.LightGBMAlpha` Both accept a ``feature_specs`` list (passed through :class:`alphaswarm.data.indicators_zoo.IndicatorZoo`) and a ``model_path`` that gets pickled after ``train()``. Training auto-logs to MLflow via the :mod:`alphaswarm.mlops.model_registry` helper and can then be loaded in production by calling :func:`alphaswarm.mlops.model_registry.load_alpha_path`. ## Factor evaluation flow ```mermaid flowchart LR FeatureSet[FeatureSet specs] --> IndicatorZoo[indicators_zoo build] IndicatorZoo --> Factor["factor values per (symbol, ts)"] Factor --> Rank[rank / quantile bucket] Rank --> ICEval[Information Coefficient + IC-IR] Rank --> Returns[returns by quantile] Factor --> CV[purged walk-forward CV] ICEval --> Report[alphalens-style report] Returns --> Report CV --> Report ``` # HFT / LOB backtest engine > The HFT engine in [alphaswarm/backtest/hft.py](../alphaswarm/backtest/hft.py) wraps [hftbacktest 2.0+](https://github.com/nkaz001/hftbacktest) so any ``LobStrategy`` subclass under [alphaswarm/strategies/hft/](../alphaswarm/stra... # HFT / LOB backtest engine > **Audience:** quants running tick-replay backtests for market-making > or arbitrage strategies, plus agents that need to evaluate a strategy > spec on cached microstructure data. The HFT engine in [alphaswarm/backtest/hft.py](../alphaswarm/backtest/hft.py) wraps [hftbacktest 2.0+](https://github.com/nkaz001/hftbacktest) so any ``LobStrategy`` subclass under [alphaswarm/strategies/hft/](../alphaswarm/strategies/hft/) becomes runnable end-to-end. Five strategies ship out of the box: - ``GLFTMM`` — Guéant-Lehalle-Fernandez-Tapia closed-form MM. - ``AvellanedaStoikovMM`` — finite-horizon Avellaneda-Stoikov MM. - ``GridMM`` — symmetric grid quoting around mid. - ``ImbalanceAlphaMM`` — order-book imbalance skew. - ``BasisAlphaMM`` — cross-instrument basis as fair value. - ``QueueAwareMM`` — queue-position-aware MM for large-tick assets. ## Install The engine ships behind the ``[hft]`` extra. Because hftbacktest is a Rust crate exposed via PyO3, you need a Rust toolchain at install time. See [alphaswarm_docs/installation.md](../../intro/installation.md). ## Architecture ```mermaid flowchart LR Tick[gz tick feed] --> HFT[hftbacktest core] HFT --> Driver[LobBacktestEngine driver loop] Driver -->|state| Strategy[LobStrategy.on_event] Strategy -->|OrderIntent| Driver Driver -->|submit_buy_order / cancel| HFT Driver --> Result[LobBacktestResult] Result --> HFTSummary[hft_summary] Result --> ReplayChart[LobReplayChart] ``` Two architecturally important pieces: 1. **Strategy bodies stay pure Python.** ``on_event`` returns a list of ``OrderIntent`` records. The engine translates them into ``hbt.submit_buy_order`` / ``hbt.cancel`` calls. This keeps the strategies LLM-friendly (no Numba constraints) at the cost of a Python function call per event — still ~1k events/ms in practice. 2. **Snapshots are bounded.** The driver writes one ``(timestamp, equity, position)`` record every ``snapshot_every`` events. Long replays produce manageable trajectories instead of un-renderable equity curves. ## Running a backtest ### Direct API ```python from alphaswarm.backtest.hft import LobBacktestEngine from alphaswarm.strategies.hft.alphas import AvellanedaStoikovMM engine = LobBacktestEngine( latency_profile="intp_order_latency", queue_model="probabilistic", tick_size=0.01, lot_size=0.001, ) strategy = AvellanedaStoikovMM(gamma=0.1, sigma=0.01, k=1.5) result = engine.run( strategy, feeds=["btcusdt_20240301.gz"], max_events=1_000_000, snapshot_every=5_000, ) print(result.summary["hft_sharpe_sample_aware"]) ``` ### Celery task (recommended for long replays) ```python from alphaswarm.tasks.hft_tasks import run_lob_backtest async_result = run_lob_backtest.delay( strategy_alias="AvellanedaStoikovMM", strategy_kwargs={"gamma": 0.1, "sigma": 0.01, "k": 1.5}, dataset_preset="lob_btcusdt_sample", max_events=10_000_000, snapshot_every=10_000, ) ``` The task emits progress every ~2 seconds with the canonical ``{task_id, stage, message, timestamp, **extras}`` shape (AGENTS rule 4) — extras carry ``events_processed``, ``equity``, and ``position``. ### REST surface ```http POST /backtest/lob { "strategy": "AvellanedaStoikovMM", "dataset_preset": "lob_btcusdt_sample", "latency_profile": "intp_order_latency", "queue_model": "probabilistic", "max_events": 1000000 } ``` → returns ``{task_id, status, stream_url}`` per [alphaswarm.api.schemas.TaskAccepted](../alphaswarm/api/schemas.py). The ``stream_url`` is the existing ``/chat/stream/{task_id}`` WebSocket; no new transport. ### Frontend Navigate to ``/backtest/lob``. The page wires up the wizard (strategy / dataset / latency / queue model) and the ``LobReplayChart`` (lightweight-charts equity + position curve). ## Latency / queue models - ``latency_profile="constant_50us"`` — fixed 50µs round-trip. - ``latency_profile="intp_order_latency"`` — file-driven model bundled with hftbacktest's examples (default). - ``queue_model="probabilistic"`` — hftbacktest's ``ProbQueueModel`` (default). - ``queue_model="risk_averse"`` — ``RiskAverseQueueModel``. When a value isn't recognised by your installed hftbacktest version, the engine logs a warning and falls back to the model's default. ## Interpreting the metrics The ``BacktestResult.summary`` is augmented by [alphaswarm/backtest/hft_metrics.py::hft_summary](../alphaswarm/backtest/hft_metrics.py): | Metric | Meaning | | --- | --- | | ``hft_sharpe_sample_aware`` | Sharpe annualised by the actual sample frequency (crypto = 365d, equity = 252d). | | ``hft_sortino_sample_aware`` | Same for Sortino. | | ``hft_max_position`` | Largest absolute inventory at any point. | | ``hft_mean_leverage`` | Mean ``|position_value| / equity``. | | ``hft_fill_ratio`` | Fills / orders. | The ``events_processed`` field reflects the number of ``elapse`` calls, not the underlying tick count. ## Custom strategies Subclass [alphaswarm/strategies/lob.py::LobStrategy](../alphaswarm/strategies/lob.py) and implement ``on_event(state) -> list[OrderIntent]``. Use the inherited ``buy`` / ``sell`` / ``cancel`` helpers to build intents. Decorate the class with ``@register("YourMM", source="alphaswarm", category="market_making")`` so the registry index lights up. ## See also - [alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md) — the original spec this implementation closes out. - [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — the JAX HJB closed forms that ``GLFTMM`` and ``AvellanedaStoikovMM`` consume. - [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) — the agent loop that mutates strategy YAML on regime flips. # Microstructure toxicity + regime-aware adapter > A pure Avellaneda-Stoikov / Lucic-Tse market maker is exposed to adverse selection: when the order flow becomes informationally toxic (elevated VPIN, spiky microprice variance, runaway cancellation ra... # Microstructure toxicity + regime-aware adapter > **Audience:** anyone running paper or live HFT strategies who wants > the platform to react automatically to toxic-flow regimes. A pure Avellaneda-Stoikov / Lucic-Tse market maker is exposed to adverse selection: when the order flow becomes informationally toxic (elevated VPIN, spiky microprice variance, runaway cancellation ratios), the dealer's quotes get picked off faster than the closed- form predicts. The mathematical fix — increase ``γ`` (risk aversion) and shrink ``order_size`` — needs to happen automatically and quickly, because the toxicity window is short. AlphaSwarm wires that loop end-to-end via two MCP tools and one agent spec. ## The loop ```mermaid flowchart LR Bars[microstructure dataset] --> Flow[optimal_control.toxicity_regime flow] Flow --> Iceberg[(alphaswarm_gold_analysis_optimal_control)] Iceberg --> ListRegimes[data.optimal_control.list_regimes] ListRegimes --> Agent[research.toxicity_regime_adapter] Agent --> EvalStrategy[data.optimal_control.evaluate_strategy] Agent --> UpdateConfig[data.strategy_config.update] UpdateConfig --> PaperYaml[configs/paper/*.yaml] PaperYaml --> Paper[Celery paper worker] ``` 1. The [optimal_control.toxicity_regime](../alphaswarm/analysis/flows/optimal_control.py) flow runs on every fresh microstructure slice and writes a regime row to ``alphaswarm_gold_analysis_optimal_control.toxicity_regime``. 2. The [research.toxicity_regime_adapter](../configs/agents/research_toxicity_regime_adapter.yaml) agent polls the regime table via the ``data.optimal_control.list_regimes`` MCP tool. 3. When the label flips (benign → elevated → toxic), the agent updates a whitelist of fields on the active paper-trading YAML using the ``data.strategy_config.update`` writer tool. 4. The Celery paper worker picks up the new YAML on its next reload. The whitelist is intentionally narrow: ``gamma``, ``sigma``, ``kappa``, ``k``, ``gamma_inv``, ``base_spread``, ``order_size``, ``max_position``. Anything else (broker, symbol, account_id, kill-switch state) requires a different higher-privilege tool — by design. ## Toxicity score The flow computes a composite toxicity score per slice: ``` score = 0.6 · VPIN_recent + 0.25 · microprice_variance + 0.15 · cancel_ratio ``` Thresholds map score → regime → suggested multipliers: | Score range | Regime | γ multiplier | order_size multiplier | | --- | --- | --- | --- | | < 0.5 · threshold | benign | 1.0 | 1.0 | | ∈ [0.5·θ, θ) | elevated | 1.25 | 0.75 | | ≥ threshold | toxic | 1.5 | 0.5 | Default threshold ``θ = 0.6``. Tune via the flow's ``toxic_threshold`` param. ## Where the math comes from VPIN: Easley, López de Prado, & O'Hara (2012), implemented in [alphaswarm/data/microstructure.py::vpin](../alphaswarm/data/microstructure.py). Microprice variance: the gap between the volume-weighted microprice and the simple mid; large gaps indicate informational pressure on one side of the book. Cancellation ratio: optional input; when provided, captures the fraction of recent order activity that was cancellations rather than trades — a leading indicator of HFT activity ramping up. ## Manually inspecting a regime ```python import pandas as pd from alphaswarm.analysis import run_flow df = pd.read_csv("recent_l1_book.csv") out = run_flow( "optimal_control.toxicity_regime", df, { "buy_volume_column": "buy_volume", "sell_volume_column": "sell_volume", "bid_qty_column": "bid_qty", "ask_qty_column": "ask_qty", "bid_price_column": "bid_price", "ask_price_column": "ask_price", "n_buckets": 50, "toxic_threshold": 0.6, }, ) print(out.metrics["regime"], out.metrics["composite_score"]) ``` ## Customising - **Tighten the threshold.** Drop ``toxic_threshold`` to 0.4 in defensive products; raise it to 0.7 in alpha-only strategies that want the tighter spreads more often. - **Add a cancellation column.** Pass ``cancellation_column="n_cancels"`` to the flow when the dataset exposes per-bar cancellation counts; the score will become more responsive to HFT activity. - **Replace the agent.** The reference adapter is a simple multiplier agent. For richer policies, swap in an RL agent trained on [LucicTsePortfolioEnv](../alphaswarm/rl/envs/lucic_tse_options_env.py) and invoke its policy from a custom AgentSpec body. ## Tests - [tests/analysis/test_optimal_control_flows.py](../tests/analysis/test_optimal_control_flows.py) covers the flow's classification logic. - [tests/data/mcp/test_strategy_config_tool.py](../tests/data/mcp/test_strategy_config_tool.py) covers the writer tool's whitelist + path-traversal guards. ## See also - [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — Avellaneda-Stoikov + Cartea-Jaimungal closed forms. - [alphaswarm_docs/portfolio-options-mm.md](../../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse framework that uses ``γ_inv`` instead of single-asset ``γ``. - [alphaswarm_docs/hft-backtest.md](../../concepts/strategy/hft-backtest.md) — running a tick-replay validation of the new parameters before they go to paper. # `AlphaBacktestExperiment` > Use `AlphaBacktestExperiment` whenever you want to answer the question *"how does this model perform when its predictions actually drive trades?"*. The standard `Experiment` family computes IC / RMSE ... # `AlphaBacktestExperiment` > The keystone "model used as alpha" experiment — train a model, register > it, deploy it as `DeployedModelAlpha`, run a backtest, and persist > combined ML + trading metrics under one MLflow parent run. ## When to use Use `AlphaBacktestExperiment` whenever you want to answer the question *"how does this model perform when its predictions actually drive trades?"*. The standard `Experiment` family computes IC / RMSE / MAE in isolation; `AlphaBacktestExperiment` adds Sharpe / Sortino / hit-rate and links them back to the trained `ModelVersion` so the Strategy Browser, MLflow UI, and Postgres catalog all converge. ## Shape | Concept | Class / table | | --- | --- | | Orchestrator | [`alphaswarm.ml.alpha_backtest_experiment::AlphaBacktestExperiment`](../alphaswarm/ml/alpha_backtest_experiment.py) | | Combined metrics | [`alphaswarm.ml.alpha_metrics`](../alphaswarm/ml/alpha_metrics.py) | | Combined run row | `MLAlphaBacktestRun` (Alembic 0025) | | Per-bar audit (opt-in) | `MLPredictionAudit` (Alembic 0025) | | Celery task | `alphaswarm.tasks.ml_tasks.run_alpha_backtest_experiment` (queue `ml`) | | REST | `POST /ml/alpha-backtest-runs`, `GET /ml/alpha-backtest-runs[/{id}/predictions]` | ## Workflow ```mermaid sequenceDiagram autonumber participant Caller as Caller (UI / CLI / Celery) participant Task as run_alpha_backtest_experiment participant Exp as AlphaBacktestExperiment participant ML as MLflow participant Reg as Model Registry + ModelVersion participant Dep as ModelDeployment participant BT as run_backtest_from_config participant DB as Postgres Caller->>Task: payload (dataset/model/strategy/backtest cfg) Task->>Exp: AlphaBacktestExperiment(...).run() Exp->>ML: open parent run (alphaswarm.component=alpha_backtest) Exp->>Exp: train + predict Exp->>Reg: register_alpha + ModelVersion row Exp->>Dep: ensure ModelDeployment (if absent) Exp->>BT: run_backtest_from_config(strategy=DeployedModelAlpha) BT->>DB: BacktestRun(model_version_id=..., ml_experiment_run_id=...) Exp->>Exp: compute_alpha_metrics + compute_trading_metrics + compute_attribution Exp->>ML: log combined metrics on parent run Exp->>DB: MLAlphaBacktestRun row Exp-->>Caller: AlphaBacktestResult ``` ## Metric vocabulary The combined metrics blob persisted on `MLAlphaBacktestRun.combined_metrics` rolls up: - ML-side: `ic_spearman`, `ic_pearson`, `icir`, `mae`, `rmse`, `hit_rate` - Trading-side: `sharpe`, `sortino`, `calmar`, `max_drawdown`, `total_return`, `turnover_adj_sharpe` - Combined scalar: `score = combined_score(ml_metrics, trading_metrics)` — default weighting in [`alphaswarm/ml/alpha_metrics.py`](../alphaswarm/ml/alpha_metrics.py) prioritises Sharpe (0.45) but also rewards IC / IR / hit-rate so a high-IC model that fails to translate to PnL is penalised. ## Calling from code ```python from alphaswarm.ml.alpha_backtest_experiment import AlphaBacktestExperiment experiment = AlphaBacktestExperiment( dataset_cfg=dataset_cfg, model_cfg=model_cfg, strategy_cfg=strategy_cfg, backtest_cfg=backtest_cfg, run_name="ridge-alpha-backtest", train_first=True, capture_predictions=True, ) result = experiment.run() print(result.combined_metrics) ``` ## Calling from REST ```bash curl -XPOST http://localhost:8000/ml/alpha-backtest-runs \ -H 'content-type: application/json' \ -d @configs/ml/alpha_backtest/ridge_alpha_backtest.yaml ``` The response is a `TaskAccepted` envelope; subscribe to `/chat/stream/{task_id}` for progress events. ## Where this goes wrong - Forgetting `train_first=False` when re-using an existing `deployment_id` will trigger a re-train. Set it explicitly. - The combined-metric weights are heuristic — customise them per strategy by passing `weights={...}` to `combined_score`. - `MLPredictionAudit` is gated behind `ALPHASWARM_ML_PREDICTION_AUDIT_ENABLED`; default is `false` to keep the table small. Enable it for forensic explainability. ## Related - [`alphaswarm_docs/ml-framework.md`](../../concepts/strategy/ml-framework.md) - [`alphaswarm_docs/backtest-engines.md`](../../concepts/strategy/backtest-engines.md) - [`alphaswarm_docs/ml-testing.md`](../../concepts/strategy/ml-testing.md) # Graphical ML experiment builder > - Page: [`webui/app/(shell)/ml/builder/page.tsx`](../webui/app/(shell)/ml/builder/page.tsx) - Component: [`webui/components/ml/MlExperimentBuilderPage.tsx`](../webui/components/ml/MlExperimentBuilderP... # Graphical ML experiment builder > The `/ml/builder` page composes datasets, preprocessing, model > definitions, experiment records, deployments, and quick tests on a > shared XYFlow canvas. Same plumbing as the Bot Builder. ## Where it lives - Page: [`webui/app/(shell)/ml/builder/page.tsx`](../webui/app/(shell)/ml/builder/page.tsx) - Component: [`webui/components/ml/MlExperimentBuilderPage.tsx`](../webui/components/ml/MlExperimentBuilderPage.tsx) - Palette: [`webui/components/ml/mlExperimentPalette.ts`](../webui/components/ml/mlExperimentPalette.ts) - Serializer: [`webui/components/ml/mlExperimentSerializer.ts`](../webui/components/ml/mlExperimentSerializer.ts) - Canvas: [`webui/components/flow/WorkflowEditor.tsx`](../webui/components/flow/WorkflowEditor.tsx) ## Palette layout ```mermaid graph LR Source[Sourcesection] --> Pipeline[Pipelinesection] Pipeline --> Split[Splitsection] Split --> Model[Modelsection] Model --> Records[Recordssection] Records --> Experiment[Experimentsection] Experiment --> Test[Testsection] Test --> Deploy[Deploysection] ``` Each palette section maps onto a list of node `kind`s defined in `mlExperimentPalette.ts`. | Section | Sample kinds | | --- | --- | | Source | `Dataset`, `DatasetPreset`, `IcebergSlice`, `FetcherSource`, `PipelineManifestRef`, `FeatureSet` | | Pipeline | `Preprocessing`, `MLScale`, `MLWinsorize`, `MLLag`, `MLRolling`, `MLDecompose`, `MLPyODOutliers`, `MLImputation` | | Split | `Split`, `WalkForward`, `PurgedKFold`, `Quarterly`, `ChronologicalRatio` | | Model | `SklearnModel`, `KerasModel`, `TensorflowModel`, `TorchModel`, `LightGBMModel`, `XGBoostModel`, `ProphetModel`, `SktimeModel`, `PyODModel`, `HuggingFaceModel` | | Records | `Records`, `SignalRecord` | | Experiment | `Experiment`, `ForecastExperiment`, `ClassificationExperiment`, `AnomalyExperiment`, `AlphaBacktestExperiment`, `FlowPreview` | | Test | `SinglePredictTest`, `BatchPredictTest`, `ABCompareTest`, `ScenarioTest` | | Deploy | `RegisterModelVersion`, `PromoteToProduction`, `CreateModelDeployment` | ## Dispatch `mlExperimentSerializer.ts::dispatchFromGraph` inspects the canvas and routes to the right backend endpoint: - Graph contains an `AlphaBacktestExperiment` node → `POST /ml/alpha-backtest-runs` - Graph contains a `Test*` node → `POST /ml/test/{single|batch|compare|scenario}` - Otherwise → `POST /ml/experiment-runs` This means a single canvas serializes either an experiment-style run or an alpha-backtest run depending on what the user dropped on it. ## Interactive Workbench drawer The toolbar exposes an "Interactive Workbench" button that opens a right-hand drawer wrapping the [`/ml/flows`](../../concepts/strategy/ml-flows.md) catalog. The form is auto-generated from `GET /ml/flows` so adding a new flow lights up here automatically. ## Adding a new palette tile 1. Append an entry to the appropriate `PaletteSection` in `mlExperimentPalette.ts`. 2. Add an accent color to `ML_EXPERIMENT_ACCENTS`. 3. If the new kind needs special serialization (e.g. it must reach a bespoke endpoint), extend `mlExperimentSerializer.ts`'s helper sets and `dispatchFromGraph`. # Lightweight workbench flows > | Flow | Purpose | Backend | | --- | --- | --- | | `linear` | Ridge / Lasso / ElasticNet / BayesianRidge with IC + RMSE / MAE | sklearn | | `decomposition` | STL trend / seasonal / residual | statsmod... # Lightweight workbench flows > Small synchronous helpers in [`alphaswarm.ml.flows`](../alphaswarm/ml/flows.py) that > let users iterate on a dataset without spinning up a full > `Experiment`. Surfaced at `POST /ml/flows/{flow}/preview`, > `POST /ml/flows/{flow}/preview-task` (Celery), and `GET /ml/flows` > (catalog). ## Catalog | Flow | Purpose | Backend | | --- | --- | --- | | `linear` | Ridge / Lasso / ElasticNet / BayesianRidge with IC + RMSE / MAE | sklearn | | `decomposition` | STL trend / seasonal / residual | statsmodels | | `forecast` | Prophet / sktime-naive / ARIMA / ETS / Theta / AutoARIMA | mixed | | `regression_diagnostics` | OLS coef table, R^2, F-stat, Durbin-Watson | statsmodels | | `unit_root` | ADF / KPSS unit-root tests | statsmodels | | `acf_pacf` | Auto- and partial-autocorrelation series | statsmodels | | `granger_causality` | Granger causality between two columns | statsmodels | | `cointegration` | Engle-Granger pair cointegration | statsmodels | | `garch` | GARCH(p, q) volatility model + horizon | arch | | `change_point` | PELT / RBF kernel change points | ruptures | | `clustering` | KMeans / DBSCAN / HDBSCAN on the feature matrix | sklearn / hdbscan | | `pca_summary` | PCA variance + factor loadings | sklearn | ## REST surface ```bash # List every flow + its parameter schema curl http://localhost:8000/ml/flows | jq # Sync run a flow curl -XPOST http://localhost:8000/ml/flows/linear/preview \ -d '{"dataset_cfg": {...}, "estimator": "ridge", "alpha": 1.0}' \ -H 'content-type: application/json' # Background run via Celery (returns TaskAccepted) curl -XPOST http://localhost:8000/ml/flows/garch/preview-task \ -d '{"dataset_cfg": {...}, "column": "close", "p": 1, "q": 1, "horizon": 10}' \ -H 'content-type: application/json' ``` ## Webui workbench drawer The ML Experiment Builder ([`/ml/builder`](../webui/app/(shell)/ml/builder/page.tsx)) ships an "Interactive Workbench" drawer on its toolbar. Pick a flow, fill in the per-flow form (auto-generated from `GET /ml/flows`), and submit — the result table renders inline so you never leave the canvas. ## Tutorials - [01_quick_ridge_workbench.yaml](../configs/ml/tutorials/01_quick_ridge_workbench.yaml) - [02_stl_decompose_workbench.yaml](../configs/ml/tutorials/02_stl_decompose_workbench.yaml) - [03_arima_garch_diagnostics.yaml](../configs/ml/tutorials/03_arima_garch_diagnostics.yaml) ## Adding a new flow 1. Implement `run__flow(...)` in [`alphaswarm/ml/flows.py`](../alphaswarm/ml/flows.py) returning a `FlowResult`. 2. Add a dispatch branch in `run_flow(flow, payload)`. 3. Add an entry in `list_flows()` so the webui form reflects the new parameters automatically. 4. (Optional) Wrap as a notebook helper in [`alphaswarm/ml/adhoc/`](../alphaswarm/ml/adhoc/__init__.py). # `alphaswarm.ml` — native qlib-style ML framework > `alphaswarm.ml` is a vendored port of [Microsoft Qlib](https://github.com/microsoft/qlib)s feature / dataset / model / record abstractions, re-built as pure Python on top of AQPs own DuckDB-backed data lak... # `alphaswarm.ml` — native qlib-style ML framework > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · See [alphaswarm_docs/factor-research.md](../../concepts/strategy/factor-research.md) for the alphalens-style evaluation pipeline. `alphaswarm.ml` is a vendored port of [Microsoft Qlib](https://github.com/microsoft/qlib)'s feature / dataset / model / record abstractions, re-built as pure Python on top of AlphaSwarm's own DuckDB-backed data lake. There is **no qlib runtime dependency** — installing the `ml` / `ml-torch` extras pulls in the underlying libraries (LightGBM, XGBoost, CatBoost, PyTorch) only. ## Layers ``` ┌────────────────────────────────────────────────┐ │ Model (alphaswarm.ml.base.Model / ModelFT) │ │ ├─ tree: LGBModel, XGBModel, CatBoostModel │ │ ├─ linear: LinearModel (OLS/Ridge/Lasso/NNLS)│ │ ├─ ensemble: DEnsembleModel │ │ ├─ torch: DNN, LSTM, GRU, ALSTM, Transformer,│ │ │ TCN, TabNet, Localformer, │ │ │ GeneralPTNN, Seq2Seq family │ │ └─ stubs: GATs, HIST, TRA, ADD, ADARNN, … │ ├────────────────────────────────────────────────┤ │ DatasetH / TSDatasetH → prepare(segments) │ ├────────────────────────────────────────────────┤ │ DataHandler / DataHandlerLP │ │ ├─ DK_R raw | DK_I infer | DK_L learn views │ │ └─ shared / infer / learn processors │ ├────────────────────────────────────────────────┤ │ DataLoader → AQPDataLoader (DuckDB + DSL) │ └────────────────────────────────────────────────┘ ``` ## Quick start ```python from alphaswarm.ml.features.alpha158 import Alpha158 from alphaswarm.ml.dataset import DatasetH from alphaswarm.ml.models.tree import LGBModel handler = Alpha158( instruments=["SPY", "AAPL", "MSFT"], start_time="2018-01-01", end_time="2024-12-31", fit_start_time="2018-01-01", fit_end_time="2022-12-31", ) dataset = DatasetH( handler=handler, segments={ "train": ("2018-01-01", "2022-12-31"), "valid": ("2023-01-01", "2023-12-31"), "test": ("2024-01-01", "2024-12-31"), }, ) model = LGBModel(num_leaves=63, learning_rate=0.05, n_estimators=500) model.fit(dataset) predictions = model.predict(dataset, segment="test") ``` Launch the same pipeline as a Celery task: ```python from alphaswarm.tasks.ml_tasks import train_ml_model async_result = train_ml_model.delay( dataset_cfg={"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}}, model_cfg={"class": "LGBModel", "module_path": "alphaswarm.ml.models.tree", "kwargs": {...}}, run_name="alpha158-lgbm", strategy_id="", ) ``` ## Feature factories - **Alpha158** (`alphaswarm.ml.features.alpha158.Alpha158DL`) ships the 9 k-bar + price/volume lookbacks + ~30 rolling families from the original qlib paper. Every feature is expressed via the DSL operators in `alphaswarm.data.expressions` so adding a new family is one line of code. - **Alpha360** (`alphaswarm.ml.features.alpha360.Alpha360DL`) emits a 60-step OHLCV panel normalised by the latest close (or latest volume). Feed it into a `TSDatasetH` and pair with one of the sequence models. Both handlers default to `Ref($close, -2) / Ref($close, -1) - 1` as the label, matching qlib's standard 2-day forward-return target. ## Expression DSL `alphaswarm.data.expressions` now exposes ~50 operators grouped into four families: - **Unary**: `Ref`, `Delta`, `Abs`, `Sign`, `Log`, `Power`, `Rank` - **Rolling**: `Mean`, `Std`, `Var`, `Skew`, `Kurt`, `Sum`, `Min`, `Max`, `Med`, `Mad`, `Quantile`, `Count`, `IdxMax`, `IdxMin`, `EMA`, `WMA`, `Slope`, `Rsquare`, `Resi` - **Pairwise**: `Corr`, `Cov` - **Comparison / logical / conditional**: `Greater`, `Less`, `Gt`, `Ge`, `Lt`, `Le`, `Eq`, `Ne`, `And`, `Or`, `Not`, `Mask`, `If` Example: construct a 20-bar z-scored OBV like factor:: "($close - Mean($close, 20)) / (Std($close, 20) + 1e-12)" ## Recorders `alphaswarm.ml.recorder` ports `SignalRecord` / `SigAnaRecord` / `PortAnaRecord`: - `SignalRecord.generate()` calls `model.predict(dataset)`, serialises `pred.pkl` + `label.pkl`, and logs them as MLflow artifacts. - `SigAnaRecord.generate(signal_record=...)` runs `alphaswarm.data.factors.evaluate_factor` to compute IC / Rank IC / quantile returns and pushes them into the active MLflow run. - `PortAnaRecord.generate(signal_record=...)` turns the prediction panel into a top-K long / bottom-K short portfolio and reports Sharpe / Sortino / max-drawdown + qlib-style `risk_analysis` summary. The `train_ml_model` Celery task auto-runs `SignalRecord` + any records listed in the YAML so one `POST /ml/train` gives you predictions, factor analysis, and a portfolio tearsheet in a single MLflow run. ## Model zoo (Tier A — shipping) | Family | Class | Notes | |------------------|-------------------------------------------------------------------------|-----------------------------------------| | Tree | `LGBModel`, `XGBModel`, `CatBoostModel`, `DEnsembleModel` | `ml` extra | | Linear | `LinearModel(estimator="ridge"|"lasso"|"ols"|"nnls")` | `ml` extra | | Dense | `DNNModel(layers=[256, 64], dropout=0.2)` | `ml-torch` extra | | Sequence | `LSTMModel`, `GRUModel`, `ALSTMModel` (attention head) | TS; `step_len=20` | | Attention | `TransformerModel`, `LocalformerModel` (local-window mask) | TS | | Convolutional | `TCNModel` | TS | | Tabular | `TabNetModel` | requires `pytorch-tabnet` | | Generic | `GeneralPTNN(model_class=..., model_module=...)` | bring-your-own `nn.Module` | | Seq2Seq | `LSTMSeq2Seq`, `GRUSeq2Seq`, `LSTMSeq2SeqVAE`, `DilatedCNNSeq2Seq`, `TransformerForecaster` | ported from Stock-Prediction-Models | ## ML-Ops framework adapters The experiment layer also exposes framework adapters that still satisfy the same `Model.fit(dataset)` / `Model.predict(dataset, segment)` contract: | Family | Classes | Extra | | --- | --- | --- | | scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel` | `ml` | | Forecasting | `ProphetForecastModel`, `SktimeForecastModel`, `SktimeReductionForecastModel` | `ml-forecast` | | Anomaly detection | `PyODAnomalyModel` | `ml-anomaly` | | Keras / TensorFlow | `KerasMLPModel`, `KerasLSTMModel` | `ml-keras` or `ml-tensorflow` | | Hugging Face | `HuggingFaceTextSignalModel` | `ml-transformers` | All heavy libraries are imported lazily. The base API can list recipes and build configs without TensorFlow, Prophet, sktime, PyOD, or transformers installed; fitting one of those classes raises a targeted install message if the corresponding extra is missing. ## Model zoo (Tier B — scaffolded stubs) These classes register into `alphaswarm.core.registry` so the Strategy Browser enumerates them, but `fit()` raises `NotImplementedError` with a pointer to the canonical qlib implementation. Port them incrementally: `GATsModel`, `HISTModel`, `TRAModel`, `ADDModel`, `ADARNNModel`, `TCTSModel`, `SFMModel`, `SandwichModel`, `KRNNModel`, `IGMTFModel`. ## Persistence + MLflow wiring Every `train_ml_model` run writes a `ModelVersion` row and (when `register_alpha=True`) registers the pickled model in the MLflow Model Registry. If you pass `strategy_id`, the run is filed under the `strategy/` MLflow experiment so the Strategy Browser can link straight to it. ## Planning-first workflow (split / pipeline / experiment / deployment) The ML stack now supports a planning layer so datasets, splits, and preprocessing can be reused deterministically across runs. 1. Create a split plan (fixed / purged-kfold / walk-forward): ```bash curl -X POST http://localhost:8000/ml/split-plans \ -H "Content-Type: application/json" \ -d '{ "name": "alpha158-fixed-2019-2024", "method": "fixed", "vt_symbols": ["SPY.NASDAQ", "AAPL.NASDAQ", "MSFT.NASDAQ"], "start": "2019-01-01", "end": "2024-12-31", "config": { "segments": { "train": ["2019-01-01", "2022-12-31"], "valid": ["2023-01-01", "2023-12-31"], "test": ["2024-01-01", "2024-12-31"] } } }' ``` 2. Save a pipeline recipe (`shared` / `infer` / `learn` processors): ```bash curl -X POST http://localhost:8000/ml/pipelines \ -H "Content-Type: application/json" \ -d '{ "name": "alpha158-default", "infer_processors": [{"class":"Fillna","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"feature","fill_value":0.0}}], "learn_processors": [{"class":"DropnaLabel","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"label"}}] }' ``` 3. Create an experiment plan tying together dataset/split/pipeline/model config, then launch training with `experiment_plan_id`: ```bash curl -X POST http://localhost:8000/ml/train \ -H "Content-Type: application/json" \ -d '{ "run_name": "alpha158-lgb-plan", "experiment_plan_id": "", "register_alpha": true }' ``` For a richer ML-ops run that persists an `MLExperimentRun` row and logs compact prediction samples, use the experiment runner: ```bash curl -X POST http://localhost:8000/ml/experiment-runs \ -H "Content-Type: application/json" \ -d '{ "run_name": "ridge-alpha-smoke", "experiment_type": "alpha", "dataset_cfg": {"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}}, "model_cfg": {"class": "SklearnRegressorModel", "module_path": "alphaswarm.ml.models.sklearn", "kwargs": {"estimator": "ridge"}} }' ``` Small interactive flows can run synchronously without Celery: ```bash curl -X POST http://localhost:8000/ml/flows/linear/preview \ -H "Content-Type: application/json" \ -d '{"dataset_cfg": {...}, "estimator": "ridge", "alpha": 1.0}' ``` The Next.js web UI exposes the same objects in `/ml/builder`, using a graph that serializes `Dataset`, `Preprocessing`, `Split`, `Model`, `Records`, and `Experiment` nodes into the `/ml/experiment-runs` request. 4. Deploy a tested `ModelVersion` as a strategy alpha profile: ```bash curl -X POST http://localhost:8000/ml/deployments \ -H "Content-Type: application/json" \ -d '{ "name": "lgb-alpha-prod", "model_version_id": "", "infer_segment": "infer", "long_threshold": 0.001, "short_threshold": -0.001 }' ``` Then consume it in strategy YAML via: ```yaml alpha_model: class: DeployedModelAlpha module_path: alphaswarm.strategies.ml_alphas kwargs: deployment_id: "" ``` ## Train -> register -> deploy -> score ```mermaid flowchart LR Dataset[DatasetVersion] --> Split[SplitPlan + SplitArtifacts] Split --> Recipe[PipelineRecipe] Recipe --> Train[ml_tasks.train_ml_model] Train --> MLflow[(MLflow registry)] MLflow --> ModelVersion[ModelVersion row] ModelVersion --> Deploy[ModelDeployment] Deploy --> Score[ml_tasks.evaluate / preview] Score --> Backtest[backtest replay] Score --> WebUI ``` ## ML engine major expansion (Alembic 0025) The ML layer has grown a number of new surfaces, all driven by the existing `Experiment` / `Model` / `Processor` contracts: - **`AlphaBacktestExperiment`** — combined "model used as alpha" experiment that trains, registers, deploys, backtests, and rolls the combined ML + trading metrics into a single MLflow parent run and a `ml_alpha_backtest_runs` Postgres row. See [alphaswarm_docs/ml-alpha-backtest.md](../../concepts/strategy/ml-alpha-backtest.md). - **Library coverage** — TF-native (`TFEstimatorModel`), Keras Functional / TabTransformer, HuggingFace FinBERT / time-series transformer / generative, AutoETS / AutoARIMA / Theta / Tbats, PyOD ECOD / SUOD / AutoEncoder, Sklearn Stacking / AutoPipeline. See [alphaswarm_docs/ml-libraries.md](../../concepts/strategy/ml-libraries.md). - **Lightweight workbench flows** — `regression_diagnostics`, `unit_root`, `acf_pacf`, `granger_causality`, `cointegration`, `garch`, `change_point`, `clustering`, `pca_summary`. See [alphaswarm_docs/ml-flows.md](../../concepts/strategy/ml-flows.md). - **ML preprocessors as data-pipeline nodes** — `transform.ml_preprocessing` plus specialised tiles, with a new `sink.ml_feature_snapshot` for deterministic feature reload. See [alphaswarm_docs/ml-preprocessing-pipeline.md](../../concepts/strategy/ml-preprocessing-pipeline.md). - **Interactive testing workbench** — `/ml/test/{single,batch,compare,scenario,upload-csv}` endpoints + tabbed webui surface. See [alphaswarm_docs/ml-testing.md](../../concepts/strategy/ml-testing.md). - **Graphical builder palette** — Source / Pipeline / Split / Model (per-framework) / Records / Experiment / Test / Deploy sections plus an Interactive Workbench drawer. See [alphaswarm_docs/ml-builder.md](../../concepts/strategy/ml-builder.md). - **Adhoc helpers** — [`alphaswarm.ml.adhoc`](../alphaswarm/ml/adhoc/__init__.py) exposes `quick_ridge`, `quick_arima`, `quick_iforest`, etc. for notebook iteration. # ML library reference > | Library | Wrapper(s) | Optional extra | Example config | | --- | --- | --- | --- | | scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel`, `SklearnStackingModel`,... # ML library reference > Per-framework reference for every model wrapper under > [`alphaswarm/ml/models/`](../alphaswarm/ml/models/). Configs live under > [`configs/ml/`](../configs/ml/). ## Coverage matrix | Library | Wrapper(s) | Optional extra | Example config | | --- | --- | --- | --- | | scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel`, `SklearnStackingModel`, `SklearnAutoPipelineModel` | `ml` | [sklearn_ridge_alpha.yaml](../configs/ml/frameworks/sklearn_ridge_alpha.yaml), [sklearn_stacking_alpha.yaml](../configs/ml/frameworks/sklearn_stacking_alpha.yaml) | | LightGBM | `LGBModel` | `ml` | [alpha158_lgbm.yaml](../configs/ml/alpha158_lgbm.yaml) | | XGBoost | `XGBModel` | `ml` | (in tree zoo) | | CatBoost | `CatBoostModel` | `ml` | (in tree zoo) | | Keras 3 | `KerasMLPModel`, `KerasLSTMModel`, `KerasFunctionalModel`, `KerasTabTransformerModel` | `ml-keras` | [keras_mlp_alpha.yaml](../configs/ml/frameworks/keras_mlp_alpha.yaml), [keras_tab_transformer.yaml](../configs/ml/frameworks/keras_tab_transformer.yaml) | | TensorFlow native | `TFEstimatorModel` (linear / DNN / boosted_trees) | `ml-tensorflow` + `ALPHASWARM_TF_NATIVE_ENABLED=true` | [tf_estimator_dnn.yaml](../configs/ml/frameworks/tf_estimator_dnn.yaml) | | PyTorch (qlib ports) | `LSTMTSModel`, `TransformerTSModel`, `TCNTSModel`, `TabNetModel`, `HISTModel`, `GATsModel`, `TRAModel`, … | `ml-torch` | [alpha360_*.yaml](../configs/ml/) | | Prophet | `ProphetForecastModel` | `ml-forecast` | [prophet_forecast_alpha.yaml](../configs/ml/frameworks/prophet_forecast_alpha.yaml) | | sktime | `SktimeForecastModel`, `SktimeReductionForecastModel`, `AutoETSForecastModel`, `AutoARIMAForecastModel`, `ThetaForecastModel`, `BatsTbatsForecastModel` | `ml-forecast` | [sktime_reduction_forecast.yaml](../configs/ml/frameworks/sktime_reduction_forecast.yaml), [auto_ets_forecast.yaml](../configs/ml/frameworks/auto_ets_forecast.yaml), [auto_arima_forecast.yaml](../configs/ml/frameworks/auto_arima_forecast.yaml), [theta_forecast.yaml](../configs/ml/frameworks/theta_forecast.yaml) | | PyOD | `PyODAnomalyModel` (iforest / knn / ecod / copod / lof / suod / auto_encoder / hbos / mcd / ocsvm / pca) | `ml-anomaly` | [pyod_anomaly_alpha.yaml](../configs/ml/frameworks/pyod_anomaly_alpha.yaml), [pyod_ecod_anomaly.yaml](../configs/ml/frameworks/pyod_ecod_anomaly.yaml) | | HuggingFace transformers | `HuggingFaceTextSignalModel`, `HuggingFaceFinBertSentimentModel`, `HuggingFaceTimeSeriesModel`, `HuggingFaceGenerativeForecastModel` | `ml-transformers` (+ `ALPHASWARM_HF_TIMESERIES_ENABLED=true` for time-series) | [huggingface_finbert_signal.yaml](../configs/ml/frameworks/huggingface_finbert_signal.yaml), [hf_finbert_sentiment.yaml](../configs/ml/frameworks/hf_finbert_sentiment.yaml), [hf_patchtst_forecast.yaml](../configs/ml/frameworks/hf_patchtst_forecast.yaml) | ## Adhoc / notebook surface [`alphaswarm.ml.adhoc`](../alphaswarm/ml/adhoc/__init__.py) exposes a `quick_*` namespace for one-off analyses without spelling out a full `Experiment` config: ```python from alphaswarm.ml.adhoc import ( quick_arima, quick_ecod, quick_finbert_sentiment, quick_iforest, quick_panel_fixed_effects, quick_prophet, quick_ridge, quick_text_embed, quick_theta, ) # Linear / ridge / elasticnet ridge = quick_ridge(features_df, target_series, alpha=1.0) print(ridge.score, ridge.coefficients) # Anomaly detection iforest = quick_iforest(features_df, contamination=0.05) print(iforest.n_anomalies) # Forecasting arima = quick_arima(series, horizon=10, order=(1, 1, 1)) prophet = quick_prophet(series, horizon=10) theta = quick_theta(series, horizon=10) # Embeddings & sentiment embeds = quick_text_embed(headlines) sentiment = quick_finbert_sentiment(headlines) # Panel diagnostics fe = quick_panel_fixed_effects(panel, target_col="y", entity_col="vt_symbol") ``` ## Where to add a new wrapper 1. Implement the class under `alphaswarm/ml/models/.py`, subclassing [`Model`](../alphaswarm/ml/base.py). 2. Decorate with `@register("Name", kind="model")` from [`alphaswarm.core.registry`](../alphaswarm/core/registry.py). 3. Make optional imports lazy (raise `RuntimeError` mentioning the right extra) so the rest of the registry keeps working. 4. Add a YAML under `configs/ml/frameworks/`. 5. Add a hermetic test under `tests/ml/models/` that monkey-patches the optional dep when needed. 6. Cross-list it here. See [`alphaswarm_docs/ml-framework.md`](../../concepts/strategy/ml-framework.md) for the full registry + `Experiment` contract. # ML preprocessing as data-pipeline nodes > Before this expansion, the only way to apply an ML preprocessing recipe was to load a `Dataset` and call `Processor.fit_process` — which only works for offline `Experiment` runs. Promoting processors ... # ML preprocessing as data-pipeline nodes > Bridges [`alphaswarm.ml.processors`](../alphaswarm/ml/processors.py) into the data > engine ([`alphaswarm/data/engine`](../alphaswarm/data/engine/)) so an > ``alphaswarm.data.engine.PipelineManifest`` can chain > ``source -> ml_preprocessing -> sink`` like any other transform. ## Why Before this expansion, the only way to apply an ML preprocessing recipe was to load a `Dataset` and call `Processor.fit_process` — which only works for offline `Experiment` runs. Promoting processors to first-class data-engine nodes lets you: - Materialise preprocessed features into Iceberg via ``sink.ml_feature_snapshot`` and reload them deterministically in later training runs. - Reuse the same recipe in batch ingestion AND online inference. - Drop a saved ``PipelineRecipe`` row directly onto the manifest builder canvas via ``POST /ml/pipelines/{id}/as-node``. ## Two layers ### Umbrella node — `transform.ml_preprocessing` Accepts either a saved ``recipe_id`` or an inline ``processors`` list. Re-uses [`apply_processor_specs`](../alphaswarm/ml/pipeline_recipes.py) so a manifest run applies the same transformation as the offline ML training loop. ```yaml - name: transform.ml_preprocessing kwargs: recipe_id: 1c5b... # optional — saved /ml/pipelines recipe processors: # optional inline overlay - class: WinsorizeByQuantile module_path: alphaswarm.ml.processors kwargs: {lower_q: 0.01, upper_q: 0.99} fit: true ``` ### Specialized convenience nodes Each maps onto a single processor and shows up in the Manifest Builder palette as its own tile: | Node name | Processor | | --- | --- | | ``transform.ml_scale`` | `SklearnTransformerProcessor` (Standard / Robust / MinMax) | | ``transform.ml_winsorize`` | `WinsorizeByQuantile` | | ``transform.ml_lag_features`` | `LagFeatureGenerator` | | ``transform.ml_rolling_features`` | `RollingFeatureGenerator` | | ``transform.ml_seasonal_decompose`` | `SeasonalDecomposeFeatures` | | ``transform.ml_pyod_outliers`` | `PyODOutlierFilter` | | ``transform.ml_imputation`` | `Fillna` | | ``transform.ml_target_encode`` | `TargetEncode` | ## Sink — `sink.ml_feature_snapshot` Iceberg writer that stamps the resulting table with ``pipeline_recipe_id``, ``dataset_version_id``, and a stable ``feature_snapshot_id`` so downstream training runs can reload exactly the same preprocessed features: ```yaml - name: sink.ml_feature_snapshot kwargs: namespace: ml.features table: alpha_panel_v1 pipeline_recipe_id: 1c5b... dataset_version_id: 9f8a... mode: append ``` The sink's result includes a ``feature_snapshot_id`` UUID; persist it in the dataset registry so future ``DatasetH`` instances can lazily reload from the snapshot table. ## End-to-end flow ```mermaid graph LR Source[source.icebergohlcv] --> Recipe["transform.ml_preprocessing(saved recipe_id)"] Recipe --> Snap["sink.ml_feature_snapshot(ml.features.alpha_panel_v1)"] Snap --> Train[Experiment trainingreuses snapshot] Train --> Deploy[ModelDeployment] Deploy --> Live[DeployedModelAlphaonline inference] ``` ## REST ```bash # Materialise a saved recipe into a manifest fragment for the # Pipeline Builder UI. curl -XPOST http://localhost:8000/ml/pipelines//as-node \ -d '{"fit": false}' -H 'content-type: application/json' ``` Returns: ```json { "name": "transform.ml_preprocessing", "label": "my-recipe", "enabled": true, "kwargs": {"recipe_id": "", "fit": false} } ``` # Interactive ML testing workbench > > The `/ml/test` page lets users validate deployed models with single > rows, batch slices, A/B comparisons, perturbation sweeps, CSV > uploads, and live streaming — all wired through the same > [`Dep... # Interactive ML testing workbench > **Superseded by [strategy-development.md](../../concepts/strategy/strategy-development.md).** > The webui `/ml/test` page is preserved for legacy bookmarks but the > canonical surfaces now live as sibling sub-routes of > `/strategy-development/*` on the new Vite frontend. The endpoint > table below is still authoritative — only the frontend changed. > The `/ml/test` page lets users validate deployed models with single > rows, batch slices, A/B comparisons, perturbation sweeps, CSV > uploads, and live streaming — all wired through the same > [`DeployedModelAlpha`](../alphaswarm/strategies/ml_alphas.py) runtime that > production strategies use. ## Tabs | Tab | Endpoint(s) | Behaviour | | --- | --- | --- | | Single Predict | `POST /ml/test/single` (sync) | Score one row, render score + sign | | Batch | `POST /ml/test/batch` (Celery) + `POST /ml/test/upload-csv` | Iceberg slice or uploaded CSV scoring | | A/B Compare | `POST /ml/test/compare` (Celery) | Side-by-side signals + agreement rate | | Scenario / What-if | `POST /ml/test/scenario` (sync) | Per-feature ±N% perturbation table + heatmap | | Historical | `POST /ml/evaluate` (Celery) | Existing offline eval flow | | Live | `POST /ml/live-test/start` + WS bridge | Stream bars / signals from a venue | | Models | n/a | Tabular `ModelVersion` browser | ## Backend [`alphaswarm/tasks/ml_test_tasks.py`](../alphaswarm/tasks/ml_test_tasks.py) hosts the Celery tasks (queue `ml`): - `predict_single` — single-row inference - `predict_batch` — Iceberg slice scoring - `compare_models` — A/B between two `model_version_id`s - `scenario_perturbation` — sensitivity table Each task routes through [`DeployedModelAlpha._predict`](../alphaswarm/strategies/ml_alphas.py) so dataset-driven AND legacy indicator-zoo paths both work. ## Sample REST calls ```bash # Single prediction (sync) curl -XPOST http://localhost:8000/ml/test/single \ -d '{"deployment_id": "...", "feature_row": {"f1": 0.1, "f2": -0.4}, "sync": true}' \ -H 'content-type: application/json' # Scenario sweep curl -XPOST http://localhost:8000/ml/test/scenario \ -d '{"deployment_id": "...", "feature_row": {"f1": 0.1, "f2": -0.4}, "perturbations": [-0.1, 0, 0.1]}' \ -H 'content-type: application/json' # CSV upload (multipart) curl -XPOST 'http://localhost:8000/ml/test/upload-csv?deployment_id=...' \ -F 'file=@features.csv' ``` The CSV upload path is capped via ``settings.ml_workbench_max_csv_mb`` (default 20 MB). ## Visualisations The webui renders results with [`recharts`](https://recharts.org/) (already a dependency): - Single Predict — Descriptions card with score + bias tag. - Scenario — `BarChart` of deltas + sortable Ant Design table. - Live — line chart overlay of bar close + signal strength + recent events list. ## Where this gets called from - Standalone: `/ml/test`. - ML Builder: a `Test*` node on the canvas serializes to the matching `/ml/test/*` endpoint. - AlphaBacktestExperiment: when `train_first=true` it stamps the new deployment id on `MLAlphaBacktestRun`, so the next visit to `/ml/test` can score against it directly. # MLOps service (initial slice) # MLOps service inside `alphaswarm_models/` This page documents the initial MLOps service shipped as additive extensions to the established `alphaswarm_models/` boundary. The service provides the agentic plumbing the two MLOps reports asked for — a polymorphic agent-facing interface layer, MLOps lifecycle handlers, external-registry adapters, hash-locked skills, OOD safety rules, a dedicated MCP server, and the matching REST + Celery + frontend surfaces — all on top of the existing models / predictors / serving infrastructure. ## What's new ### `alphaswarm_models/src/alphaswarm_models/interfaces/` Five agent-facing polymorphic ABCs that wrap any concrete model in a stable contract: | Interface | Method | Application | | --- | --- | --- | | `Predictor` | `predict(features)` | Point-in-time value estimation | | `Forecaster` | `forecast(history, horizon)` | Multi-step temporal projection | | `Classifier` | `classify(data)` | Discrete probability distribution | | `Segmenter` | `segment(series)` | Structural-break detection | | `Analyzer` | `analyze(unstructured)` | NLP / sentiment scoring | All register under `kind="interface"` in `alphaswarm.core.registry`. Agents program against `Predictor.predict` regardless of whether XGBoost, LSTM, or HuggingFace pipelines back the call. ### `alphaswarm_models/src/alphaswarm_models/handlers/` Six MLOps lifecycle handler classes: | Handler | Purpose | | --- | --- | | `CacheHandler` | LRU + safetensors-first model cache (budgets in `settings.ml_cache_*`) | | `LoadHandler` | Cryptographic verification + safetensors-preferred deserialisation | | `SaveHandler` | torch state_dict → `.safetensors` with SHA-256 sidecar | | `StoreHandler` | Object-store upload + lineage metadata | | `ProductionizeHandler` | Drive the `productionize/` compiler pipeline | | `ServeHandler` | Continuous-batching queue with kill-switch fan-out | All inherit `MLOpsHandler` so every lifecycle operation runs the same `policy_check` + lineage emission contract (`LineageBus`). ### `alphaswarm_models/src/alphaswarm_models/productionize/` Four compiler classes: | Compiler | Output | Optional dep | | --- | --- | --- | | `OnnxCompiler` | `.onnx` | `torch.onnx` | | `TensorRTCompiler` | `.engine` | `tensorrt` (Linux GPU only) | | `TorchScriptCompiler` | `.pt` (trace/script) | `torch` | | `QuantizationCompiler` | `.pt` (INT8 / FP16) | `torch` | Each registers via `@register_compiler("alias")` and emits a `CompiledArtifact` with SHA-256 + size + kwargs into `ml_compiled_artifacts`. ### `alphaswarm_models/src/alphaswarm_models/adapters/` External-registry pullers protecting the supply chain: | Adapter | Notes | | --- | --- | | `HuggingFaceAdapter` | Routes downloads through the local cache volume; resolves HF tokens via `CredentialResolver` (`CredentialKey("huggingface", "api_token")`). Honours `settings.ml_hf_hub_offline`. | | `TorchHubAdapter` | Refuses every name not on `DEFAULT_ALLOWLIST` ∪ the operator allow-list at `CredentialKey("torchhub", "allowlist")`. Verifies SHA-256 before caching. | ### `alphaswarm_models/src/alphaswarm_models/spec.py` + `runtime.py` + `registry.py` Hash-locked **MLSkillSpec** + **MLSkillRuntime** mirroring the existing `AgentSpec`/`BotSpec`/`RLExperimentSpec`/`AnalysisSpec` runtime pattern. New Alembic 0081 tables: - `ml_skills` + `ml_skill_versions` (hash-locked snapshots) - `ml_skill_runs` (run ledger with `experiment_id` + `test_id` FKs, AGENTS rule 34) Seed skill YAMLs ship under `alphaswarm_models/configs/skills/`: - `regime_aware_alpha.yaml` — Classifier → Predictor (regime-specialised) - `multi_horizon_forecast.yaml` — Forecaster + Analyzer (sentiment overlay) ### `alphaswarm_models/src/alphaswarm_models/rules/` Inference-time OOD safety rules driven by a metaclass-driven `RuleRegistry`: - `OODGuard` — z-score threshold check. - `RangeGuard` — absolute min/max window check. - `TensorShapeGuard` — input-shape mismatch check. - `CircuitBreaker` — rolling-window failure tracker that trips at `max_failures` per `window_seconds`. Rule packs live under `alphaswarm_models/configs/rules/`; the default is `ood_default.yaml`. ### `alphaswarm/data/mcp/tools/ml.py` Fourteen `data.ml.*` DataMCP tools — the canonical Hard Rule 22 path agents use to drive the entire MLOps surface (predict, forecast, classify, segment, analyze, pull, compile, list, run skills, halt serving). Each tool registers via `@register_data_mcp_tool` so both transports — the in-process bridge and the FastAPI router/stdio binary — pick it up. ### `alphaswarm/ml_mcp/` + `alphaswarm-ml-mcp` binary A dedicated MCP server publishing the same `data.ml.*` slice under its own canonical URI (`settings.mcp_ml_canonical_uri`). Tokens minted for the MLOps audience cannot be replayed against the data MCP and vice versa (RFC 8707, Hard Rule 49). The RFC 9728 metadata document lives at `/.well-known/oauth-protected-resource/mcp/ml`. ### REST + Celery New routes under the existing `/ml/*` router plus a fresh `/ml/skills/*` router. Long-running ops dispatch to four new Celery modules: `ml_pull_tasks`, `ml_serving_tasks`, `ml_productionize_tasks`, `ml_skill_tasks`. All emit progress via `_progress.emit` (Hard Rule 4). ### Frontend (Vite) Three new routes under `alphaswarm_client/src/routes/ml/`: - `/ml/skills` — registry browser + invocation form. - `/ml/serving` — live continuous-batching session monitor with per-session halt button. - `/ml/pull` — HuggingFace/TorchHub model puller. `KillSwitch.tsx` fans out to `POST /ml/serving/halt-all` alongside the existing halt endpoints (Hard Rule 2 in `frontend.mdc`). ### Identity + topology - `alphaswarm.config.settings` gains nine new `ml_*` knobs (cache budgets, serving defaults, OOD threshold, offline toggles, MCP canonical URI + URL). - `alphaswarm_platform/configs/deployment/topology.yaml` gains an `alphaswarm-ml-mcp` service entry (Hard Rule 47). - `alphaswarm/config/topology_fallback.py` maps `mcp_ml_url` → `alphaswarm-ml-mcp.http`. ## Agent usage The seed `mlops_assistant` AgentSpec at `configs/agents/mlops_assistant.yaml` drives the MLOps surface exclusively through the `data.ml.*` tools. Operators invoke it the same way as any other AgentSpec — `AgentRuntime.run(...)` (never call `router_complete` directly per Hard Rule 12). ## Validation ```bash # Source compile check: python -m py_compile alphaswarm_models/src/alphaswarm_models/{interfaces,handlers,adapters,rules,productionize,tasks}/**/*.py # New migration is hashed into the lock file: python scripts/ci/check_migration_immutability.py # DataMCP catalog discovery: curl http://localhost:8000/mcp/data/tools | jq '.tools[] | select(.name | startswith("data.ml."))' # MLOps MCP discovery: curl http://localhost:8000/.well-known/oauth-protected-resource/mcp/ml ``` ## What is explicitly out of scope - Mutating an existing migration. The 0081 migration is immutable once shipped (Hard Rule 6); future schema changes land in 0082+. - Streamlit / Solara surfaces. The legacy stack is rollback-only. - Free-text URN input. Every entity selection uses `EntityPicker` (Hard Rule 29). # Optimal-control / HJB math layer > The optimal-control package — [alphaswarm/optimal_control/](../alphaswarm/optimal_control/) — hosts the JAX-compiled implementations of two canonical Hamilton-Jacobi-Bellman problems: # Optimal-control / HJB math layer > **Audience:** quants extending AlphaSwarm with optimal-execution or > market-making models, plus AI agents that need to reason about > the closed-form solvers. The optimal-control package — [alphaswarm/optimal_control/](../alphaswarm/optimal_control/) — hosts the JAX-compiled implementations of two canonical Hamilton-Jacobi-Bellman problems: - **Avellaneda-Stoikov 2008** market making — [alphaswarm/optimal_control/avellaneda_stoikov.py](../alphaswarm/optimal_control/avellaneda_stoikov.py). - **Cartea-Jaimungal-Penalva 2015** inventory-penalised optimal liquidation — [alphaswarm/optimal_control/cartea_jaimungal.py](../alphaswarm/optimal_control/cartea_jaimungal.py). The convenience layer [alphaswarm/optimal_control/hjb_solver.py](../alphaswarm/optimal_control/hjb_solver.py) exposes ``solve_avst`` / ``solve_cj`` / ``value_function_to_arrow`` so the analysis-flow runner can dispatch them uniformly and persist the results to ``alphaswarm_gold_analysis_optimal_control`` per AGENTS rule 21. ## Where to invoke Three call sites cover almost every use case. ### 1. Direct Python API ```python from alphaswarm.optimal_control import compute_optimal_quotes, solve_avst # Single-point AvSt quotes — pure JIT-compiled JAX path. res = compute_optimal_quotes( mid_price=100.0, inventory=10.0, gamma=0.1, sigma=0.02, k=1.5, T_minus_t=1.0, ) print(res.bid, res.ask, res.half_spread) # Inventory grid via vmap. out = solve_avst( mid_price=100.0, inventory_grid=[-50, -25, 0, 25, 50], gamma=0.1, sigma=0.02, k=1.5, T_minus_t=1.0, ) ``` ### 2. Analysis flows (preferred — gives you UI form + Iceberg persistence) ```python from alphaswarm.analysis import run_flow result = run_flow( "optimal_control.avellaneda_stoikov_quotes", None, { "mid_price": 100.0, "inventory_min": -50.0, "inventory_max": 50.0, "inventory_step": 5.0, "gamma": 0.1, "sigma": 0.01, "k": 1.5, "T_minus_t": 1.0, }, ) ``` The flow is a thin facade over ``solve_avst`` and writes its rows to the gold-tier ``alphaswarm_gold_analysis_optimal_control.`` namespace when invoked through ``AnalysisRuntime``. ### 3. Agent-callable DataMCPTool ```python # inside an AgentSpec body the tool surfaces as ``data.optimal_control.solve_hjb`` result = ctx.tools["data.optimal_control.solve_hjb"].invoke( ctx=mcp_ctx, model="avst", mid_price=100.0, inventory=10.0, gamma=0.1, sigma=0.01, k=1.5, T_minus_t=1.0, ) ``` The tool is registered in [alphaswarm/data/mcp/tools/optimal_control.py](../alphaswarm/data/mcp/tools/optimal_control.py) and complies with AGENTS rule 22 — agents never read Iceberg / Postgres directly. ## Avellaneda-Stoikov (single-asset) Reservation price plus optimal half-spread: ``` r(s, q, t) = s − q · γ · σ² · (T − t) δ = ½ · γ · σ² · (T − t) + (1/γ) · ln(1 + γ/k) bid = r − δ ask = r + δ ``` The JAX kernel ``_avst_kernel`` is JIT-compiled with ``@jax.jit`` and takes only Python floats / arrays — no I/O, no globals, no Python control flow keyed on values. ``vmap`` lets us evaluate the kernel across an inventory grid in one compiled call. The closed-form GLFT 2013 variant ( ``glft_closed_form``) is what [alphaswarm.strategies.hft.alphas.GLFTMM](../alphaswarm/strategies/hft/alphas.py) calls on every event. Its ``2/γ · ln(1 + γ/k)`` term differs from the finite-horizon AvSt ``1/γ · ln(...)`` by a factor of two — that's the long-horizon limit. ## Cartea-Jaimungal-Penalva (inventory-penalised liquidation) Linear-quadratic ansatz ``H(t, q, S) = q·S + h₂(t)·q² + h₁(t)·q + h₀(t)`` reduces the HJB to a system of three coupled ODEs: ``` dh₂/dt = −φ − h₂² / κ dh₁/dt = −h₁ · h₂ / κ dh₀/dt = −h₁² / (4 · κ) ``` Solved backwards from the terminal conditions ``h₂(T) = −α`` and ``h₁(T) = h₀(T) = 0`` via fixed-step RK4. The optimal feedback trading rate is ``` ν*(t, q) = − (h₂(t) · q + ½ · h₁(t)) / κ ``` When ``φ > 0`` the agent sells (or buys) faster than TWAP near the terminal because ``h₂`` decreases; when ``φ = 0`` the rate collapses to zero (no urgency). ## Pairing with reinforcement learning The closed forms are reference benchmarks. To learn a richer policy for non-Gaussian dynamics, drive an RL agent through: - [alphaswarm.rl.envs.MarketMakingEnv](../alphaswarm/rl/envs/market_making_env.py) — PPO/SAC over AvSt knobs. - [alphaswarm.rl.envs.OptimalExecutionEnv](../alphaswarm/rl/envs/optimal_execution_env.py) — Cartea-Jaimungal block liquidation. Sample experiment YAMLs ship under [configs/rl/](../configs/rl/) (``avellaneda_stoikov_mm.yaml``, ``cartea_jaimungal_execution.yaml``). ## See also - [alphaswarm_docs/portfolio-options-mm.md](../../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse multi-strike extension. - [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) — toxicity regime detection + agent adapter loop. - [alphaswarm_docs/installation.md](../../intro/installation.md) — how to install the ``[optimal-control]`` extra (JAX, finhjb, fast-vollib, mbt_gym). # Portfolio options market making — Lucic-Tse 2024-2026 > The single-asset Avellaneda-Stoikov framework breaks down for options portfolios. An options book carries simultaneous Δ / Γ / ν / ρ exposures across hundreds of strikes; the spread the dealer should ... # Portfolio options market making — Lucic-Tse 2024-2026 > **Audience:** options market-makers, quant developers writing > spread-prediction models, and agents that need to reason about > portfolio-level risk skew. The single-asset Avellaneda-Stoikov framework breaks down for options portfolios. An options book carries simultaneous Δ / Γ / ν / ρ exposures across hundreds of strikes; the spread the dealer should quote at strike ``K`` is no longer independent of the inventory at strike ``K′``. The breakthrough closed-form solution for portfolio-level options MM landed in 2024-2026: V. Lucic and A. Tse, *"Optimal option market making and volatility arbitrage"*. AlphaSwarm implements that framework in [alphaswarm/options/portfolio_mm.py](../alphaswarm/options/portfolio_mm.py). ## Two equations **1. Per-strike vol-arb alpha.** ``` α(K, T) = ½ · S² · Γ(K, T) · (σ_real² − σ_imp²) ``` This is the option-equivalent of the spot vol-arb edge: when realised volatility exceeds implied volatility, the dealer collects vega exposure at a positive expected value. **2. Inventory-skewed bid/ask quote.** ``` bid(K, T) = mid(K, T) − δ(K, T) − skew(K, T) ask(K, T) = mid(K, T) + δ(K, T) − skew(K, T) δ(K, T) = base_spread + ½ · hedge_cost · |Γ(K, T)| skew(K, T) = γ_inv · ν(K, T) · (Σ_vol · q_per_expiry) ``` where ``Σ_vol`` is the (rank-reducible) covariance of the implied-vol factors across maturities and ``q`` is the inventory matrix. The Riccati system the linear-quadratic ansatz produces is closed-form in steady state — no PDE solver required. AlphaSwarm implements that closed form in pure JAX with ``jnp.einsum`` for the matrix contractions. ## Calling the solver ```python import numpy as np from alphaswarm.analysis.pricing import greeks_grid from alphaswarm.options.portfolio_mm import LucicTseParams, compute_lucic_tse_quotes strikes = np.array([95., 100., 105.]) expiries = np.array([0.05, 0.1, 0.25]) grid = greeks_grid(spot=100., strikes=strikes, expiries=expiries, vol=0.2) quotes = compute_lucic_tse_quotes( spot=100.0, mid_quotes=grid["price"], gamma_surface=grid["gamma"], vega_surface=grid["vega"], realized_vol=0.22, # the dealer's view implied_vol=np.full_like(grid["price"], 0.20), # market quote inventory=np.zeros_like(grid["price"]), params=LucicTseParams(gamma_inv=0.05, base_spread=0.05, hedge_cost=0.001), ) print(quotes.bid) # (n_expiries, n_strikes) print(quotes.ask) print(quotes.expected_pnl) ``` JAX optionality: when the ``[optimal-control]`` extra is missing, the module degrades to NumPy. Numerical results are identical, just slower. ## Analysis-flow surface For UI / agent flows, use ``optimal_control.lucic_tse_portfolio_quotes`` or the namespace alias ``derivatives.lucic_tse_quotes``. Both wrap ``compute_lucic_tse_quotes`` and persist a row-per-cell table to ``alphaswarm_gold_analysis_optimal_control.lucic_tse_portfolio_quotes``. ```python from alphaswarm.analysis import run_flow out = run_flow( "optimal_control.lucic_tse_portfolio_quotes", None, { "spot": 100.0, "strikes": [90, 95, 100, 105, 110], "expiries": [0.05, 0.1, 0.25, 0.5], "realized_vol": 0.22, "implied_vol": 0.20, "gamma_inv": 0.05, "base_spread": 0.05, "hedge_cost": 0.001, }, ) ``` ## Pairing with the JAX/fast-vollib Greek path Building the Greek surface dominates the per-step cost. AlphaSwarm ships a JAX/vmap drop-in path in [alphaswarm/options/greeks_jax.py](../alphaswarm/options/greeks_jax.py) that auto-detects ``fast_vollib`` (Triton-fused on H100) when the extra is installed, otherwise JIT-compiles a hand-rolled BSM kernel. The legacy ``alphaswarm.analysis.pricing.greeks_grid`` routes through this fast path automatically. ## RL pairing [alphaswarm.rl.envs.LucicTsePortfolioEnv](../alphaswarm/rl/envs/lucic_tse_options_env.py) exposes the framework as a Gym environment so PPO/SAC can learn to adapt ``γ_inv`` / ``base_spread`` as a function of the realised vs implied gap. Sample config: [configs/rl/lucic_tse_options.yaml](../configs/rl/lucic_tse_options.yaml). ## See also - [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — single-asset HJB. - [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) — the toxicity-aware regime adapter that scales ``γ_inv`` automatically during toxic flow. # PredictorHub > The report calls out two empirical findings from the literature: # PredictorHub > Status: **Phase 5 shipped** (Alembic 0044). Hub: > [`alphaswarm/ml/predictors/`](../alphaswarm/ml/predictors/). ## Why unify The report calls out two empirical findings from the literature: * **XGBoost regression** -- significantly superior accuracy at pure numerical return prediction (low-noise, structured features) * **LSTM classification** -- demonstrably better at directional classification over medium-term 7-30 day horizons (sequence-aware, handles regime shifts) The platform already had both models available under [`alphaswarm/ml/models/`](../alphaswarm/ml/models/), but they were registered with different config keys, trained via different code paths, and serialised inconsistently. Phase 5 consolidates them under a single :class:`PredictorSpec` shape that the hub uses to pick the right factory. ## PredictorSpec The spec is hash-locked Pydantic: ```python from alphaswarm.ml.predictors import PredictorSpec # XGBoost regression — predict next-day return spec_xgb = PredictorSpec( name="xgb_returns_1d", model_kind="xgboost", label_kind="regression", target_horizon="1d", feature_columns=["mom_5", "mom_20", "rsi_14", "vol_20"], target_column="ret_1d", hyperparams={"max_depth": 6, "learning_rate": 0.05, "n_estimators": 500}, ) # LSTM classification — predict 20-day direction (binary) spec_lstm = PredictorSpec( name="lstm_direction_20d", model_kind="lstm", label_kind="classification", target_horizon="20d", feature_columns=["close", "volume", "rsi_14", "macd"], target_column="dir_20d", sequence_length=60, hyperparams={"hidden_size": 64, "num_layers": 2, "dropout": 0.2}, classes=["down", "up"], ) ``` Re-snapshotting the spec into the persistence layer: ```python from alphaswarm.ml.predictors import persist_predictor_spec row_id, created = persist_predictor_spec(spec_xgb) print(row_id, created) # created=True the first time, False if hash unchanged ``` ## PredictorHub ```python from alphaswarm.ml.predictors import PredictorHub hub = PredictorHub() model = hub.build(spec_xgb) model.fit(X_train, y_train) preds = model.predict(X_test) ``` The hub picks the right factory from the ``(model_kind, label_kind)`` registry. Adding a new model: ```python from alphaswarm.ml.predictors import register_predictor @register_predictor(model_kind="transformer", label_kind="classification") def my_transformer_factory(spec): ... return TransformerClassifier(**spec.hyperparams) ``` ## Reference factories The hub ships four reference factories matching the report's recommendations: | ``model_kind`` | ``label_kind`` | Underlying class | | --- | --- | --- | | ``xgboost`` | ``regression`` | :class:`XGBModel` from :mod:`alphaswarm.ml.models.tree` | | ``xgboost`` | ``classification`` | :class:`XGBModel` (with binary or multi-class objective) | | ``lstm`` | ``classification`` | :class:`LSTMModel` from :mod:`alphaswarm.ml.models.torch.lstm` | | ``lstm`` | ``regression`` | :class:`LSTMModel` (regression head) | ## Hash-locked versioning The Phase 5 ``predictor_spec_versions`` table mirrors the spec-version pattern used by AgentSpec / BotSpec / RLExperimentSpec / AnalysisSpec. Re-running ``persist_predictor_spec`` with an unchanged spec returns ``created=False``; a single byte change to the spec body (new feature, new hyperparam) produces a fresh row. This means every "how was this model trained?" question has a precise answer pinned by the SHA-256 hash. ## Wiring into agents Phase 5 exposes the hub through the existing [`/ml/test`](../alphaswarm/api/routes/ml.py) endpoints (REST) and three DataMCP tools (agent-facing): * ``data.ml.predictors.list`` -- list registered specs * ``data.ml.predictors.train`` -- snapshot a spec + train * ``data.ml.predictors.deploy_pair`` -- A/B-test two trained models Agents query the catalogue first, snapshot a spec, train, and deploy without an ORM import. # Statistical arbitrage primitives > | Function | Returns | Use | | --- | --- | --- | | :func:`johansen_test` | :class:`JohansenResult` | Multivariate cointegration rank among >=2 series | | :func:`rolling_zscore` | pandas Series | Norma... # Statistical arbitrage primitives > Status: **Phase 4 shipped**. Module: > [`alphaswarm/math/arbitrage.py`](../alphaswarm/math/arbitrage.py). Analysis flows: > [`alphaswarm/analysis/flows/arbitrage.py`](../alphaswarm/analysis/flows/arbitrage.py). ## Five primitives | Function | Returns | Use | | --- | --- | --- | | :func:`johansen_test` | :class:`JohansenResult` | Multivariate cointegration rank among >=2 series | | :func:`rolling_zscore` | pandas Series | Normalized spread for entry/exit thresholds | | :func:`half_life` | :class:`HalfLifeResult` | Ornstein-Uhlenbeck mean-reversion timescale | | :func:`pair_signal` | :class:`PairSignal` | Per-bar ENTRY/EXIT/HOLD for a pair strategy | | :func:`ah_share_basis` | :class:`BasisResult` | A-share vs H-share cross-market basis | | :func:`adr_basis` | :class:`BasisResult` | ADR / GDR vs underlying foreign equity basis | The existing [`alphaswarm/data/cointegration.py`](../alphaswarm/data/cointegration.py) module keeps the ADF + Engle-Granger primitives -- Phase 4 doesn't duplicate them. ## Johansen test The Engle-Granger test handles two series; Johansen generalises to ``n >= 2`` and reports the **rank** of the cointegration space (how many independent stationary combinations exist among the series). ```python import pandas as pd from alphaswarm.math.arbitrage import johansen_test # Wide DataFrame: one column per series prices = pd.DataFrame({ "BABA_ADR": [...], "9988_HKEX_USD": [...], "SPY": [...], }) result = johansen_test(prices, deterministic="constant", k_ar_diff=1) print(result.rank, result.is_cointegrated_95) # result.cointegrating_vectors: list[list[float]] -- the n rows of beta ``` ## Pair signal state machine The :func:`pair_signal` function reads the latest spread + a rolling window and emits one of: | Signal | Z-score | In position? | | --- | --- | --- | | ``ENTRY_LONG_SPREAD`` | ``z >= +entry_threshold`` | False | | ``ENTRY_SHORT_SPREAD`` | ``z <= -entry_threshold`` | False | | ``EXIT_LONG_SPREAD`` | ``\|z\| <= exit_threshold`` AND z >= 0 | True | | ``EXIT_SHORT_SPREAD`` | ``\|z\| <= exit_threshold`` AND z < 0 | True | | ``HOLD`` | otherwise | any | The signal also reports the estimated half-life via :func:`half_life`. Strategies typically reject opportunities where the half-life exceeds a horizon-based ``half_life_min`` (the spread will take too long to revert; capital is better deployed elsewhere). ## A/H share basis The report calls out a specific cross-market arbitrage: mainland A-shares vs Hong Kong H-shares of the same company. Same economic rights, different regulatory + liquidity + currency environments -> persistent divergence + violent reversion. ```python from alphaswarm.math.arbitrage import ah_share_basis # ICBC: 1398.HK in HKD, 601398.SS in CNY. CNYHKD ~ 0.93 (CNY per HKD) res = ah_share_basis( a_price=5.10, h_price=4.82, fx_rate=0.93, conversion_ratio=1.0, transaction_cost_bps=20.0, threshold_bps=100.0, ) print(res.is_arbitrage, res.arbitrage_direction) ``` The threshold default of 100 bps is conservative; CTA-style operators typically use 60-80 bps. ``transaction_cost_bps`` captures the round-trip cost (commissions + bid/ask + stamp duty + FX hedge cost). ## ADR / GDR basis Same logic for US-listed ADRs and offshore-listed GDRs. The Phase 1 :class:`InstrumentADR` / :class:`InstrumentGDR` rows carry the ``conversion_ratio`` field directly so the basis algorithm reads it without a manual lookup. ```python # BABA ADR (NYSE) vs 9988 (HKEX). 1 ADR represents 8 H-shares. res = adr_basis( adr_price=85.00, underlying_price=80.50, # in HKD fx_rate=7.84, # HKD per USD conversion_ratio=8.0, transaction_cost_bps=30.0, depository_fee_bps=5.0, threshold_bps=80.0, ) ``` The depository fee is annualised; over short holding periods (hours, days) it's negligible, but on long-horizon basis trades it materially eats into the alpha. ## Analysis flows Four flows wrap the primitives so the AnalysisRuntime can drive them with the standard preview / persist / chart machinery: * ``arbitrage.johansen_basket`` -- Johansen test on a column subset * ``arbitrage.pair_signal`` -- latest pair signal from a spread column * ``arbitrage.ah_share_basis`` -- per-bar A/H basis monitor * ``arbitrage.adr_basis`` -- per-bar ADR basis monitor Each is registered via [`@register_analysis_flow`](../alphaswarm/analysis/registry.py) so the lab UI builds a form automatically. ## Agent surface (Phase 5) The matching DataMCP tools (added in Phase 5): * ``data.arbitrage.cointegration_pair`` -- two-series Engle-Granger * ``data.arbitrage.johansen_basket`` -- multivariate Johansen finder * ``data.arbitrage.ah_share_monitor`` -- A/H share monitor * ``data.arbitrage.adr_underlying_basis`` -- ADR basis monitor Agent code uses these tools, not the math primitives directly (AGENTS rule 22). # Strategy Browser > The Strategy Browser is a dedicated Solara page at `/strategy-browser` that exposes two complementary views of the strategy library: # Strategy Browser > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Strategy lifecycle: [alphaswarm_docs/strategy-lifecycle.md](../../concepts/strategy/strategy-lifecycle.md). The Strategy Browser is a dedicated Solara page at `/strategy-browser` that exposes two complementary views of the strategy library: 1. **Saved strategies** — everything a user has persisted via `POST /strategies/` (the Strategy Development page). Filter by tag, status, name, or minimum Sharpe; click through for version history, recent tests, equity curves, and a deep link into the per-strategy MLflow experiment. 2. **Alpha catalog** — the code-available `IAlphaModel` classes (both ported TA strategies and the native ML model wrappers), their tags, and a list of reference YAMLs in `configs/strategies/` that instantiate each one. Handy for discovering what's available before saving your own. ## API surface - `GET /strategies/browse?tag=&status=&query=&min_sharpe=` → list of enriched strategy rows with latest backtest metrics and the MLflow run id of the most recent run. - `GET /strategies/browse/catalog` → every registered `IAlphaModel` class, its module path, tag list, and reference YAMLs under `configs/strategies/`. - `GET /strategies/{id}/experiment` → experiment name (`strategy/`), MLflow tracking URI, and up to 50 linked `BacktestRun` rows. ## Strategy tags Every new concrete alpha carries a module-level `STRATEGY_TAGS` tuple (e.g. `("pattern", "mean-reversion", "quant-trading")`). `alphaswarm.strategies .list_strategy_tags()` aggregates the tuples across every class in `alphaswarm.strategies.__all__`, so the browser's tag filter reflects the code without any duplicated metadata. ## MLflow wiring When `run_backtest_from_config` is called with a `strategy_id`, the underlying `log_backtest` helper uses `experiment_name_for_strategy(strategy_id)` to pick the per-strategy experiment (`strategy/`) and also sets the `alphaswarm.strategy_id` tag on the run. After the backtest completes, the resulting MLflow run id is written onto `BacktestRun.mlflow_run_id` so the browser can deep-link. To prevent the generic Celery autolog signals from opening a parent MLflow run for every backtest task (which would swallow the nested `log_backtest` run), the `alphaswarm.tasks.backtest_tasks.*` / `alphaswarm.tasks.paper_tasks.*` / `alphaswarm.tasks.ml_tasks.*` / `alphaswarm.tasks.factor_tasks.*` task names are explicitly listed in `alphaswarm.mlops.autolog._AUTOLOG_SKIP_TASKS`. ## Ported strategy catalog Shipped alphas (at 0.4): | Alpha class | Tags | Reference recipe | |----------------------------|------------------------------------------------|-------------------------------------------| | `AwesomeOscillatorAlpha` | momentum, oscillator, quant-trading | `configs/strategies/awesome_oscillator.yaml` | | `HeikinAshiAlpha` | pattern, reversal, quant-trading | `configs/strategies/heikin_ashi.yaml` | | `DualThrustAlpha` | intraday, breakout, quant-trading | `configs/strategies/dual_thrust.yaml` | | `ParabolicSARAlpha` | trend, quant-trading | `configs/strategies/parabolic_sar.yaml` | | `LondonBreakoutAlpha` | breakout, fx, quant-trading | `configs/strategies/london_breakout.yaml` | | `BollingerWAlpha` | pattern, mean-reversion, quant-trading | `configs/strategies/bollinger_w.yaml` | | `ShootingStarAlpha` | pattern, reversal, quant-trading | `configs/strategies/shooting_star.yaml` | | `RsiPatternAlpha` | pattern, mean-reversion, quant-trading | `configs/strategies/rsi_pattern.yaml` | | `OilMoneyRegressionAlpha` | statistical, mean-reversion, quant-trading | `configs/strategies/oil_money.yaml` | | `SmaCross` | momentum, reference, backtesting.py | `configs/strategies/sma_cross.yaml` | | `Sma4Cross` | momentum, reference, backtesting.py | `configs/strategies/sma4_cross.yaml` | | `TrailingATRAlpha` | momentum, trailing-stop, reference | `configs/strategies/trailing_atr.yaml` | | `BaseAlgoExample` | reference, stock-analysis-engine | `configs/strategies/base_algo_example.yaml` | ## ML Training page A sibling Solara page at `/ml` — launch any `alphaswarm.ml` training run from a form (pick feature handler + model class + segments), stream progress through the existing `/chat/stream/{task_id}` WebSocket, and see the resulting `ModelVersion` rows. ## Browser export flow ```mermaid flowchart LR Picker[User picks securities + indicators + transformations] --> Form[StrategyBrowser form] Form -->|POST| API["/pipelines/from-browser"] API --> Spec[FeatureSet spec] Spec --> DB[(feature_sets row)] Spec --> Topic["Kafka features.preview.<name>.v1"] Topic --> Stream[live overlay charts] ``` # Strategy Development (Consolidated `/strategy-development/*`) > ```mermaid flowchart TB L["StrategyDevLayout"] L --> Composer["/composer"] L --> Sim["/simulation"] L --> Ideate["/ideation"] L --> Single["/single-predict"] L --> Batch["/predict-batch (Iceberg-aware... # Strategy Development (Consolidated `/strategy-development/*`) The Vite frontend exposes a single consolidated umbrella for every strategy-authoring + strategy-testing surface under `/strategy-development/*`. Twelve sibling sub-routes share the same persistent left sub-nav, a run-summary KPI strip, and a cross-route React context so navigating between (say) Compose → Simulate → Compare-Models keeps all the inputs (deployment id, symbols, time window, feature row, last task id) coherent. ```mermaid flowchart TB L["StrategyDevLayout"] L --> Composer["/composer"] L --> Sim["/simulation"] L --> Ideate["/ideation"] L --> Single["/single-predict"] L --> Batch["/predict-batch (Iceberg-aware)"] L --> Compare["/compare-models"] L --> Scenario["/scenario-perturbation"] L --> Historical["/historical-eval"] L --> Live["/live-test"] L --> RunCmp["/run-comparator"] L --> Docs["/document-library (papers)"] L --> Lib["/library (components)"] ``` ## Surfaces | Route | Component | Wraps | | --- | --- | --- | | `/strategy-development` | `StrategyDevIndexRoute` | redirects to `/strategy-development/composer` | | `/strategy-development/composer` | `StrategyComposer` | `GET /strategies/components` + `POST /strategies` | | `/strategy-development/simulation` | `SimulationCreator` | dispatches to `BotRuntime` / `LobBacktestEngine` / `AlphaBacktestExperiment` / `RLRuntime` / paper | | `/strategy-development/ideation` | `IdeationConsole` | `POST /agents/ideate` (router_complete + research_papers RAG) | | `/strategy-development/single-predict` | `SinglePredictRoute` | `POST /ml/test/single` | | `/strategy-development/predict-batch` | `PredictBatchRoute` | `POST /ml/test/batch` (now Iceberg-aware) + `POST /ml/test/upload-csv` | | `/strategy-development/compare-models` | `CompareModelsRoute` | `POST /ml/test/compare` | | `/strategy-development/scenario-perturbation` | `ScenarioPerturbationRoute` | `POST /ml/test/scenario` | | `/strategy-development/historical-eval` | `HistoricalEvalRoute` | `POST /ml/evaluate` + `GET /ml/evaluations/{task_id}` | | `/strategy-development/live-test` | `LiveTestRoute` | `POST /ml/live-test/start` + `useLiveStream` | | `/strategy-development/run-comparator` | `RunComparator` | chained pairwise `POST /ml/test/compare` | | `/strategy-development/document-library` | `DocumentLibrary` | `GET /rag/papers`, `POST /rag/papers/upload`, `POST /rag/papers/{id}/synthesize` | | `/strategy-development/library` | `StrategyLibraryRoute` | `GET /strategies/components` (read-only registry browser) | ## Cross-route state `alphaswarm_client/src/components/strategy-dev/StrategyDevContext.tsx` holds the shared selection (`deploymentId`, `deploymentIdB`, `symbols`, `start`, `end`, `featureRowText`, `perturbations`, `lastTaskId`, `lastRunSummary`, `composerYaml`, `strategyId`). The context is backed by `localStorage` under the key `alphaswarm.strategy-dev.selection.v1` so a hard refresh doesn't lose state. Sub-routes use `useStrategyDev()` to read + patch the selection: ```ts const { selection, setSelection } = useStrategyDev(); setSelection({ deploymentId: "abc", lastTaskId: res.task_id }); ``` ## KPI strip `RunKpiStrip` reads `selection.lastRunSummary` and renders Sharpe / total return / max DD / hit rate / trades in the standard `MetricsGrid`. The strip is intentionally idle when no run has been launched in the current session so the surface stays calm. ## Hard-rule alignment - Frontend rule (`.cursor/rules/frontend.mdc`): every long-running task is consumed via the existing `useChatStream` / `useLiveStream` hooks so the WS pipeline + kill-switch + sandbox banner all stay intact. - AGENTS rule 2: LLM-driven surfaces (`IdeationConsole`, `PaperSynthesisDrawer`) route through `router_complete` server-side. - AGENTS rule 4: progress framing is unchanged — sub-routes never publish to Redis directly; they always go through `_progress.emit` on the backend. ## How to add a sub-route 1. Create `alphaswarm_client/src/routes/strategy-development//page.tsx` wrapping the new component. 2. Add the new component under `alphaswarm_client/src/components/strategy-dev/`. 3. Register the route in `alphaswarm_client/src/routes.tsx`'s `DYNAMIC_ROUTES` entry for `strategy-development`. 4. Add a `StrategyDevSubRoute` entry to `alphaswarm_client/src/components/strategy-dev/SubNav.tsx` so the new route appears in the persistent left nav. ## Legacy The legacy webui `/ml/test` page is now superseded by this consolidated surface. Bookmarks still work because the flat REAL_ROUTES entry is preserved, but the sidebar no longer surfaces it. # Strategy Lifecycle > Every strategy in AlphaSwarm follows the same six-step cycle: **build → save → version → test → paper → live** # Strategy Lifecycle > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Backtest dispatch sequence: [alphaswarm_docs/flows.md#2-backtest-dispatch](../../concepts/platform/flows.md#2-backtest-dispatch). Every strategy in AlphaSwarm follows the same six-step cycle: **build → save → version → test → paper → live**. ## Build Open the Strategy Development page (``/strategy``) or hand-write a YAML recipe under ``configs/strategies/``. Every recipe has the shape: ```yaml strategy: class: FrameworkAlgorithm kwargs: universe_model: {class: StaticUniverse, kwargs: {symbols: [...]}} alpha_model: {class: MeanReversionAlpha, kwargs: {...}} portfolio_model: {class: HierarchicalRiskParity, kwargs: {...}} risk_model: {class: BasicRiskModel, kwargs: {...}} execution_model: {class: MarketOrderExecution, kwargs: {}} backtest: class: EventDrivenBacktester kwargs: {initial_cash: 100000, start: "2023-01-01", end: "2024-12-31"} ``` ## Save + version Clicking **Save as new strategy** calls ``POST /strategies/`` which writes a ``Strategy`` row plus ``StrategyVersion`` v1. Every subsequent ``PUT /strategies/{id}`` auto-bumps the version; the diff viewer in the UI and the ``GET /strategies/{id}/versions/{v}/diff`` endpoint surface a unified diff between any two versions. ## Test The **Test** card in the Strategy Development page posts to ``POST /strategies/{id}/test`` with an engine + window. A Celery task runs the backtest, stores a ``StrategyTest`` row, and links it back to the strategy. The **Tests** tab lists every run with its Sharpe, drawdown, and total return. Each test also fires the MLflow autolog signal, so every test becomes a first-class MLflow run tagged ``alphaswarm.celery.task = alphaswarm.tasks.backtest_tasks.run_backtest``. ## Paper + live When a strategy has a green testing record, promote it via ``POST /paper/start`` (the same pipeline the Paper Trading page uses). The paper engine shares 100% of the strategy code path with the backtester — no code changes required. ## Archive ``DELETE /strategies/{id}`` soft-deletes by setting ``status=archived``. Archived strategies are hidden from the default list but all versions + tests remain queryable via the API for audit. ## State machine ```mermaid stateDiagram-v2 [*] --> Draft : author + save Draft --> Versioned : freeze YAML Versioned --> Backtested : run_backtest succeeds Backtested --> Paper : promote (operator) Paper --> Live : promote (operator) Backtested --> Versioned : revise + bump version Paper --> Backtested : reset Live --> Paper : pause Live --> Archived : decommission Archived --> [*] ``` # Strategy template catalog (Phase 7 of the multi-tenant rollout) > Hard rule 35 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Read-only strategy templates (LEAN, community, internal references) MUST be loaded as ``resources`` rows with ``resource_type=strategy_template``. The A... # Strategy template catalog (Phase 7 of the multi-tenant rollout) Read-only strategy templates — QuantConnect LEAN's ``Algorithm.Python/*.py`` examples first, with hooks for community + internal libraries — are ingested into the polymorphic [`resources`](../alphaswarm/persistence/models_resources.py) table and surfaced to users + agents via the strategy template browser, the MCP catalog, and the AST translator. ## Hard rule Hard rule 35 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Read-only strategy templates (LEAN, community, internal references) MUST be loaded as ``resources`` rows with ``resource_type='strategy_template'``. The AST translator lives in ``alphaswarm/strategies/lean/translator.py``; new translators register through the same pattern." ## Ingestion ```bash # One-shot: clone LEAN + ingest every Algorithm.Python/*.py python -m scripts.ingest_lean_templates --clone # Re-ingest from a local checkout LEAN_REPO_PATH=/opt/Lean python -m scripts.ingest_lean_templates # Dry-run (parse + report only) python -m scripts.ingest_lean_templates --lean-path /opt/Lean --dry-run ``` The ingester is idempotent — re-running with a newer LEAN revision overwrites the matching rows in place. Each Resource carries the parsed metadata (`class_name`, `base_class`, `asset_classes`, `indicators`, `universe_symbols`, `tags`) plus the raw LEAN source in `meta.raw_source` so the translator + frontend preview don't need to re-read the file system. ## Translator ```python from alphaswarm.strategies.lean.translator import translate_lean_to_framework skeleton = translate_lean_to_framework(lean_source) ``` The translator rewrites the LEAN AST into an [`alphaswarm.strategies.framework.FrameworkAlgorithm`](../alphaswarm/strategies/framework.py) skeleton. The mapping covers: | LEAN | AlphaSwarm target | | ------------------------------------- | ------------------------------------------ | | `Initialize` | `prepare` | | `OnData` | `on_bar` | | `OnSecuritiesChanged` | `on_universe_changed` | | `self.AddEquity("SPY")` | `ctx.add_equity("SPY")` | | `self.AddOption("SPY")` | `ctx.add_option("SPY")` | | `self.AddCrypto("BTCUSD")` | `ctx.add_crypto("BTCUSD")` | | `self.SetCash(100000)` | captured as cfg `starting_cash` | | `self.SetStartDate / SetEndDate` | captured as cfg `start_date` / `end_date` | | `self.MACD(...)` / `self.SMA(...)` | `alphaswarm.data.indicators.MACD(...)` | | `self.MarketOrder(symbol, qty)` | `ctx.market_order(symbol, qty)` | | `self.SetHoldings(symbol, fraction)` | `ctx.set_holdings(symbol, fraction)` | Anything unmapped becomes a `# TODO(lean-translate)` comment so the user can finish the port — translation is never silent. ## Agent surface | Tool | Purpose | | ----------------------------------------------- | ------- | | `data.strategies.templates.search` | Filter by tag / asset class / framework | | `data.strategies.templates.describe` | Full Resource payload including raw source | | `data.strategies.templates.clone_to_workspace` | Fork into the calling user's workspace, optionally with the translator applied | Cloning emits a `resource_relations.relation='translated_from'` edge back to the source, so the ownership graph can audit provenance — `data.ownership.tree` over the cloned Resource returns the lineage chain back to the original LEAN class. ## REST surface | Method + path | Purpose | | -------------------------------------- | ------- | | `GET /strategies/templates` | List + filter | | `GET /strategies/templates/{id}` | Describe + raw source | | `POST /strategies/templates/clone` | Clone (mirrors the MCP tool) | ## Frontend The browser lives at `/strategy-development/templates`. The grouped list groups by primary asset class; the preview pane renders the LEAN source in a monospace block with a "Clone to my workspace" button (with a checkbox to toggle translation). Free-text inputs that reference a specific strategy template are forbidden — use ``. ## Cross-reference - [`alphaswarm_docs/ownership-graph.md`](../../concepts/platform/ownership-graph.md) — the `translated_from` / `clones` edges live in the graph projection. - [`alphaswarm_docs/data-mcp.md`](../../concepts/data/data-mcp.md) — the `data.strategies.templates.*` tools are MCP-registered like every other agent surface. - [`alphaswarm_docs/metadata-cache.md`](../../concepts/data/metadata-cache.md) — the `strategy_templates` Redis cache category powers the EntityPicker. # vectorbt-pro deep integration > vectorbt-pro is the **primary vectorised backtest engine** in AlphaSwarm. The integration lives under [alphaswarm/backtest/vbtpro/](../alphaswarm/backtest/vbtpro/) and exposes the full vbt-pro surface (signals, orders, op... # vectorbt-pro deep integration > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Engines overview: [alphaswarm_docs/backtest-engines.md](../../concepts/strategy/backtest-engines.md). vectorbt-pro is the **primary vectorised backtest engine** in AlphaSwarm. The integration lives under [alphaswarm/backtest/vbtpro/](../alphaswarm/backtest/vbtpro/) and exposes the full vbt-pro surface (signals, orders, optimizer, callbacks, splitter, param sweeps, IndicatorFactory). The legacy [alphaswarm/backtest/vectorbtpro_engine.py](../alphaswarm/backtest/vectorbtpro_engine.py) is now a 10-line delegate so YAML configs that reference its module path continue to resolve to the new class. ## Hard constraint: Numba vbt-pro's per-bar callbacks (`signal_func_nb`, `order_func_nb`, `pre_segment_func_nb`, …) run inside Numba's JIT. **LLM agents and Python ML models cannot run there.** Two supported patterns work around this constraint: 1. **Precompute** (default) — agents/ML run before the simulation, producing wide-format `entries` / `exits` / `size` / `price` DataFrames. The decisions are baked in; vbt-pro consumes them as plain arrays. 2. **Per-window** (`Splitter.apply`) — Python (and therefore agents/ML) runs in the WFO loop between train and test windows. Each window's inner backtest is still vectorised. For true per-bar agent dispatch use the event-driven engine and the `AgentDispatcher` primitive — see [alphaswarm_docs/backtest-engines.md#agent--ml-components](../../concepts/strategy/backtest-engines.md#agent--ml-components). ## Engine modes `VectorbtProEngine.run` routes through one of five constructors based on the `mode` kwarg. All five share a common kwarg surface (initial_cash, fees, slippage, freq, cash_sharing, group_by, leverage, multiplier, direction, …) and merge any extra `portfolio_kwargs` into the call. | Mode | Constructor | Driver | Use case | |-------------|-----------------------------------|-------------------------------------------------------|-------------------------------------------------------| | `signals` | `Portfolio.from_signals` | `IAlphaModel` → wide entries/exits/(size)/(price)/(stops) | The default; mirrors classical signal-based backtests. | | `orders` | `Portfolio.from_orders` | `IOrderModel` → wide size/price/size_type | Agent-emitted precise orders; multi-leg sizing. | | `optimizer` | `Portfolio.from_optimizer` | `PortfolioOptimizer` (mean-variance, risk parity, custom) | Allocation-driven research, no signal generation. | | `holding` | `Portfolio.from_holding` | — | Buy-and-hold sanity baseline. | | `random` | `Portfolio.from_random_signals` | `Param`-style random kwargs | Null-hypothesis baseline. | ## Components | File | Role | |-----------------------------------------------------------|---------------------------------------------------------------------------------| | [`engine.py`](../alphaswarm/backtest/vbtpro/engine.py) | Multi-mode dispatch; `@register("VectorbtProEngine")`. | | [`signal_builder.py`](../alphaswarm/backtest/vbtpro/signal_builder.py) | `IAlphaModel` → `SignalArrays`; per-bar loop **and** `generate_panel_signals` opt-in. | | [`order_builder.py`](../alphaswarm/backtest/vbtpro/order_builder.py) | `IOrderModel` → `OrderArrays`; `signals_to_orders` sizer helper. | | [`optimizer_adapter.py`](../alphaswarm/backtest/vbtpro/optimizer_adapter.py) | `EqualWeightOptimizer`, `MeanVarianceOptimizer`, `RandomWeightOptimizer`, `CallableOptimizer`; all decorated with `@register(..., kind="portfolio")`. | | [`result_mapper.py`](../alphaswarm/backtest/vbtpro/result_mapper.py) | `vbt.Portfolio` → `BacktestResult`; merges `vbt_*` native stats. | | [`wfo.py`](../alphaswarm/backtest/vbtpro/wfo.py) | `WalkForwardHarness` + `PurgedWalkForwardHarness` driven by vbt-pro's `Splitter`. | | [`param_sweep.py`](../alphaswarm/backtest/vbtpro/param_sweep.py) | `sweep_strategy_kwargs` (grid/random) + `sweep_signals_grid` (`Param`-native MA cross). | | [`indicator_factory_bridge.py`](../alphaswarm/backtest/vbtpro/indicator_factory_bridge.py) | Wraps AlphaSwarm `IndicatorBase` zoo entries as vbt-pro `IndicatorFactory` classes. | | [`data_utils.py`](../alphaswarm/backtest/vbtpro/data_utils.py) | `pivot_close`, `pivot_ohlcv`, `universe_from_bars`, `filter_bars`. | ## Agent + ML strategy components | File | Class | Role | |-----------------------------------------------------------|------------------------|-----------------------------------------------------------------------| | [`agentic_alpha.py`](../alphaswarm/strategies/vbtpro/agentic_alpha.py) | `AgenticVbtAlpha` | Precompute / per-window / live modes. Reads `DecisionCache` and renders to wide arrays. | | [`ml_alpha.py`](../alphaswarm/strategies/vbtpro/ml_alpha.py) | `MLVbtAlpha` | Wraps any `alphaswarm.ml.base.Model` (or MLflow URI). Threshold / top-k / rank policies. | | [`agent_order_model.py`](../alphaswarm/strategies/vbtpro/agent_order_model.py) | `AgenticOrderModel` | Implements `IOrderModel`; drives the `orders` mode from cached agent decisions. | Each component is `@register`-ed so it can be dropped into a strategy YAML via the standard `class` / `module_path` / `kwargs` factory. ## Walk-forward optimisation ```python from alphaswarm.backtest.vbtpro.wfo import WalkForwardHarness harness = WalkForwardHarness( strategy_cfg={"class": "FrameworkAlgorithm", "module_path": "...", "kwargs": {...}}, splitter="rolling", # or "expanding", "purged" n_splits=8, train_size=504, test_size=126, engine_kwargs={"mode": "signals", "initial_cash": 100_000.0}, on_window_train=lambda i, slice_, strategy, ctx: warm_agent(strategy, slice_), ) result = harness.run(bars) ``` The harness re-instantiates the strategy on every window (so per-window agent state is isolated), runs the train backtest, then re-instantiates again before the test pass. The optional `on_window_train` hook is where agents refresh their RAG / memory or ML models refit. `PurgedWalkForwardHarness` defaults `splitter="purged"` and uses `PurgedWalkForwardCV` from `vectorbtpro.generic.splitting.purged` to drop labels that bleed across the train/test boundary. ## Parameter sweeps ```python from alphaswarm.backtest.vbtpro.param_sweep import sweep_strategy_kwargs result = sweep_strategy_kwargs( base_config, { "strategy.kwargs.alpha_model.kwargs.fast": [5, 10, 20], "strategy.kwargs.alpha_model.kwargs.slow": [50, 100, 200], }, metric="sharpe", method="grid", ) print(result.best_combo, result.best_value) print(result.frame.head()) ``` Random sweeps require `n_trials`. Trials default to running with `engine: vbt-pro:signals` if the base config does not specify one. `sweep_signals_grid` is the fast `Param`-native path for single-symbol MA-crossover style sweeps. ## Indicator factory bridge ```python from alphaswarm.backtest.vbtpro.indicator_factory_bridge import vbt_indicator SMA = vbt_indicator("SMA") out = SMA.run(close, period=[10, 20, 50]) # vbt.Param under the hood sma_50 = out.value[(slice(None), 50)] ``` This makes every AlphaSwarm `IndicatorBase` available inside vbt-pro's indicator/sweep machinery without rewriting the underlying state machine. ## Agent tools | Tool name | Class | Surface | |----------------------------|-----------------------------|-----------------------------------------------| | `vectorbt_pro_backtest` | `VectorbtProBacktestTool` | One backtest, explicit mode. | | `vectorbt_pro_param_sweep` | `VbtProParamSweepTool` | Grid / random sweep over strategy kwargs. | | `vectorbt_pro_wfo` | `VbtProWalkForwardTool` | Splitter-WFO; rolling/expanding/purged. | | `vectorbt_pro_optimizer` | `VbtProOptimizerTool` | Allocation-driven via `Portfolio.from_optimizer`. | | `engine_capabilities` | `EngineCapabilitiesTool` | Inspect the capability matrix; pick an engine.| | `agent_aware_backtest` | `AgentAwareBacktestTool` | Run `AgentAwareMomentumAlpha` on the event-driven engine. | All tools are registered in `alphaswarm_agents.tools.TOOL_REGISTRY` and referenced in [configs/agents/quant_research_vbtpro.yaml](../configs/agents/quant_research_vbtpro.yaml). ## Example configs - [configs/strategies/vbtpro/dual_ma_signals.yaml](../configs/strategies/vbtpro/dual_ma_signals.yaml) — minimal `signals` mode example. - [configs/strategies/vbtpro/agentic_trader.yaml](../configs/strategies/vbtpro/agentic_trader.yaml) — `AgenticVbtAlpha` precompute. - [configs/strategies/vbtpro/ml_topk.yaml](../configs/strategies/vbtpro/ml_topk.yaml) — `MLVbtAlpha` top-k. - [configs/strategies/vbtpro/wfo_agentic.yaml](../configs/strategies/vbtpro/wfo_agentic.yaml) — per-window agent dispatch. - [configs/strategies/vbtpro/optimizer_meanvariance.yaml](../configs/strategies/vbtpro/optimizer_meanvariance.yaml) — allocation-only optimizer mode. ## Performance notes - **Default Numba JIT** is ON. The first vbt-pro call in a fresh process pays a non-trivial compile cost (~10-30s for the full surface). Cache ahead of time on workers if latency matters. - **`jitted=False`** swaps the outer simulation wrapper to a Python reference implementation; it does not let arbitrary Python live inside `signal_func_nb`. Use precompute or per-window for that. - The `IndicatorFactory` bridge applies AlphaSwarm indicators per column in pure Python, which is slow for very wide universes; for hot paths prefer vbt-pro's native indicators (`vbt.SMA`, `vbt.RSI`, etc.) and only fall back to the bridge for indicators we don't have a vbt-pro analogue for. ## Migration from the legacy adapter The previous `VectorbtProEngine` only handled signals via `IAlphaModel.generate_signals` → `Portfolio.from_signals`. Existing configs still work because: - The legacy module path `alphaswarm.backtest.vectorbtpro_engine.VectorbtProEngine` re-exports the new class. - The default mode is still `signals`. - Existing kwargs (`initial_cash`, `fees`, `slippage`, `allow_short`, `freq`, `group_by`) are unchanged in meaning. New kwargs that gate richer behaviour: `mode`, `direction`, `accumulate`, `size`, `size_type`, `sl_stop`, `tsl_stop`, `tp_stop`, `leverage`, `leverage_mode`, `multiplier`, `cash_sharing`, `portfolio_kwargs`, `order_model`, `optimizer`, `random_kwargs`, `record_signals`. # Observability stack > ```mermaid flowchart LR apps[AlphaSwarm services + agents] subgraph aqpobs[alphaswarm-observability] otelagent[OTel Agent DaemonSet] otelgw[OTel Gateway Deployment] prom[Prometheus] graf[Grafana] tempo[Tempo] loki[... # Observability stack Phase 2c + 2d of the AlphaSwarm infra-expansion plan stand up the AlphaSwarm-owned observability plane in the `alphaswarm-observability` namespace. Everything the cluster previously read from `rpi_kubernetes/observability/` is re-homed here. ```mermaid flowchart LR apps[AlphaSwarm services + agents] subgraph aqpobs[alphaswarm-observability] otelagent[OTel Agent DaemonSet] otelgw[OTel Gateway Deployment] prom[Prometheus] graf[Grafana] tempo[Tempo] loki[Loki] phoenix[Arize Phoenix] pgphx[(Phoenix Postgres)] end apps -- OTLP --> otelagent otelagent -- OTLP --> otelgw otelgw -- "AI spans (openinference.span.kind)" --> phoenix otelgw -- "infra spans" --> tempo otelgw -- "remote_write" --> prom otelgw -- "OTLP logs" --> loki phoenix --> pgphx graf -. "Prometheus + Loki + Tempo + QuestDB datasources" .- prom graf -. .- loki graf -. .- tempo ``` ## Components | Component | Folder | Replaces | |---|---|---| | kube-prometheus-stack | [observability/kube-prometheus-stack/](../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) | rpi `observability/prometheus/` | | OpenTelemetry Operator | [observability/opentelemetry-operator/](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-operator/) | new | | OTel Collector (gateway + agent) | [observability/opentelemetry-collector-gateway/](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) | rpi `observability/otel-collector/` | | Phoenix | [observability/phoenix/](../alphaswarm_platform/deployments/kubernetes/observability/phoenix/) | new | ## Routing rule (gateway) The `transform/ai_route` processor in [`collector-gateway.yaml`](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/collector-gateway.yaml) inspects every span and tags it with `alphaswarm.ai_trace=true` when: - `attributes["openinference.span.kind"] != nil`, or - `attributes["llm.model_name"] != nil`, or - `attributes["agent.name"] != nil`. Two trace pipelines (`traces/ai`, `traces/infra`) split on that attribute. Tail sampling preserves error traces + 100 % of AI traces; everything else is sampled at 1 %. ## DataMCP tools | Tool | Surface | |---|---| | `data.observability.prometheus.query` | Instant PromQL. | | `data.observability.prometheus.query_range` | Range PromQL. | | `data.observability.prometheus.list_alerts` | Active alerts. | | `data.observability.grafana.list_dashboards` | Dashboard catalog. | | `data.observability.grafana.export_dashboard` | Dashboard JSON. | | `data.observability.phoenix.list_projects` | Phoenix projects. | | `data.observability.phoenix.get_trace` | LLM / agent trace. | | `data.observability.phoenix.annotate_span` | Write evaluator verdict. | ## Frontend - [/admin/topology](../alphaswarm_client/src/routes/admin/topology/page.tsx) — Phase 0 topology overview. - (Phase 6 follow-up) `/admin/observability/{prometheus,grafana,phoenix,otel}` — domain-scoped admin pages. # Observability > AlphaSwarm ships with opt-in OpenTelemetry tracing covering the full request path: FastAPI → Celery → paper session → broker SDK → Postgres → Redis. Install the `otel` extra to enable it:: # Observability > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Progress bus reference: [alphaswarm_docs/flows.md#cross-cutting-progress-bus](../../concepts/platform/flows.md#cross-cutting-progress-bus). AlphaSwarm ships with opt-in OpenTelemetry tracing covering the full request path: FastAPI → Celery → paper session → broker SDK → Postgres → Redis. Install the `otel` extra to enable it:: pip install -e ".[otel]" ## Quick start (Docker) `docker compose up -d` starts an OpenTelemetry Collector and Jaeger sidecar alongside the AlphaSwarm services. Each service is pre-wired with `ALPHASWARM_OTEL_ENDPOINT=http://otel-collector:4317`. Open [http://localhost:16686](http://localhost:16686) and pick a service: - `alphaswarm-api` — FastAPI request handlers + Dash mount - `alphaswarm-worker` — Celery tasks (backtest, paper, ingestion) - `alphaswarm-paper-trader` — paper session loop ## Configuration All knobs live in `alphaswarm.config.Settings` / `.env`: | Variable | Default | Purpose | |---|---|---| | `ALPHASWARM_OTEL_ENDPOINT` | *empty* | OTLP endpoint. Empty → tracing disabled (safe dev default). | | `ALPHASWARM_OTEL_SERVICE_NAME` | `alphaswarm` | Base service name. Suffixes `-api`, `-worker`, `-paper` added automatically. | | `ALPHASWARM_OTEL_SAMPLE_RATIO` | `1.0` | Parent-based head sampler ratio. `0.1` = 10% of traces. | | `ALPHASWARM_OTEL_PROTOCOL` | `grpc` | `grpc` (port 4317) or `http/protobuf` (port 4318). | ## Instrumentation map Auto-instrumented on startup (see `alphaswarm/observability/tracing.py`): - `FastAPIInstrumentor` — every route becomes a span - `CeleryInstrumentor` — every task becomes a span - `SQLAlchemyInstrumentor` — every query becomes a span (attached in `alphaswarm/persistence/db.py` when `ALPHASWARM_OTEL_ENDPOINT` is set) - `HTTPXClientInstrumentor` — every HTTPX call (broker REST, UI API client) - `RedisInstrumentor` — every Redis command (pub/sub, kill-switch, Celery broker) Manual spans are added via the `@traced` decorator (`alphaswarm/observability/decorators.py`): ```python from alphaswarm.observability import traced @traced("paper.session.run") async def run(self) -> PaperSessionResult: ... ``` Works transparently on sync and `async` callables; when `otel` isn't installed the tracer is a no-op so the decorator has zero overhead. ## Custom exporters The default is OTLP/gRPC. To use OTLP/HTTP instead: ```bash ALPHASWARM_OTEL_PROTOCOL=http/protobuf ALPHASWARM_OTEL_ENDPOINT=http://otel-collector:4318/v1/traces ``` For local development with just the console, install the OTel SDK and point at a local Jaeger all-in-one: ```bash docker run --rm -p 4317:4317 -p 16686:16686 jaegertracing/all-in-one:1.55 export ALPHASWARM_OTEL_ENDPOINT=http://localhost:4317 ``` ## Kubernetes Both the API/Worker image and the `paper` image have the OTel SDK installed. The Kustomize manifests set `ALPHASWARM_OTEL_ENDPOINT` to the in-cluster collector service; port-forward Jaeger with: ```bash kubectl -n alphaswarm-dev port-forward svc/jaeger 16686:16686 ``` ## Troubleshooting **Spans never show up in Jaeger.** - Verify `ALPHASWARM_OTEL_ENDPOINT` is set in the container: `docker compose exec api env | grep OTEL`. - Check the collector logs for parsing errors: `docker compose logs otel-collector`. - Drop the sample ratio to `1.0` while debugging. **`ImportError: opentelemetry-exporter-otlp-proto-grpc` at startup.** - You set `ALPHASWARM_OTEL_ENDPOINT` but didn't install the `otel` extra. The tracer logs a warning and continues as a no-op, but to silence it run `pip install -e ".[otel]"`. **Tests emit real spans.** - They shouldn't — `tests/conftest.py` installs an `autouse` fixture that resets `ALPHASWARM_OTEL_ENDPOINT=""` before each test. If you see real spans, check that the fixture is still in place. ## Metrics (optional) The OTel Collector config in `alphaswarm_platform/deploy/otel/otel-collector-config.yaml` also exports metrics on port 8889 via the Prometheus exporter, so you can point a Prometheus scraper at the collector for JVM-style service-level dashboards. The AlphaSwarm code doesn't emit custom metrics yet — PRs welcome. ## Tracing topology ```mermaid flowchart LR API[FastAPI] -->|spans| OTEL[OTEL collector :4317] Worker[Celery worker] -->|spans| OTEL Paper[paper-trader] -->|spans| OTEL OTEL --> Jaeger[Jaeger UI :16686] OTEL --> Prom[Prometheus exporter :8889] API -.publish.-> RedisBus[("alphaswarm:task pubsub")] Worker -.publish.-> RedisBus RedisBus -.subscribe.-> WS["/chat/stream WS"] ``` # Paper Metadata Gate (Strict-Only) > After this rollout, paper-trading sessions **require** both `session.model_urn` and `session.pipeline_urn` to be present and valid at startup # Paper Metadata Gate (Strict-Only) ## Breaking change After this rollout, paper-trading sessions **require** both `session.model_urn` and `session.pipeline_urn` to be present and valid at startup. If either URN is missing, malformed, unresolved in `entity_aspects`, or (for the model URN) resolves to a non-`Production`/non-`Staging` model status, the session raises `MetadataValidationError` and refuses to start. There is no warn-only fallback mode. ## How strict gate validation works The paper gate performs these checks in order: 1. Parse `model_urn` and `pipeline_urn` with AlphaSwarm URN validation. 2. Resolve `mlModelMetadata` for `model_urn` and `pipelineMetadata` for `pipeline_urn`. 3. Enforce model lifecycle status (`Production` or `Staging` only). 4. Emit a `metadata_gate` progress frame and raise on any validation error. Startup is blocked until all checks pass. ## Seeded URNs from migration 0049 Alembic revision `0049_paper_metadata_seed_aspects` seeds these baseline URNs: - `configs/paper/alpaca_mean_rev.yaml` - `urn:alphaswarm:mlmodel:prod:alpaca_mean_reversion_v1` - `urn:alphaswarm:pipeline:prod:alpaca_mean_reversion_loop` - `configs/paper/ibkr_mean_rev.yaml` - `urn:alphaswarm:mlmodel:prod:ibkr_mean_reversion_v1` - `urn:alphaswarm:pipeline:prod:ibkr_mean_reversion_loop` - `configs/paper/avellaneda_stoikov_quotes.yaml` - `urn:alphaswarm:mlmodel:prod:avellaneda_stoikov_v1` - `urn:alphaswarm:pipeline:prod:avellaneda_stoikov_quotes_loop` - `configs/paper/lucic_tse_options.yaml` - `urn:alphaswarm:mlmodel:prod:lucic_tse_options_v1` - `urn:alphaswarm:pipeline:prod:lucic_tse_options_loop` - `configs/paper/tradier_rest.yaml` - `urn:alphaswarm:mlmodel:prod:tradier_rest_baseline_v1` - `urn:alphaswarm:pipeline:prod:tradier_rest_loop` To add a new paper config, seed matching `MlModel` + `Pipeline` aspects first, then point YAML `session.model_urn` / `session.pipeline_urn` at those URNs. ## Operator runbook (custom paper YAMLs) 1. Run migrations through `0049_paper_metadata_seed_aspects`. 2. For each custom paper model, register an `MlModel` aspect (status must be `Production` or `Staging`) using the `aspect.register_model` MCP tool. 3. Register a matching `Pipeline` aspect for each paper pipeline URN. 4. Update custom YAML files so `session.model_urn` and `session.pipeline_urn` match the newly seeded aspects. 5. Start paper sessions and confirm metadata-gate startup checks pass. ## Rollback If you must revert this rollout: 1. `alembic downgrade 0048` 2. `git revert ` After rollback, redeploy and re-run paper sessions with the reverted code/docs. # Paper & live trading > AQPs paper trading engine is a Lean-inspired async runtime that shares 100% of its strategy code with the backtester. Orders from the same `IStrategy` object flow through the **same ledger tables** r... # Paper & live trading > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Session state machine: [alphaswarm_docs/flows.md#4-paper-trading-session](../../concepts/platform/flows.md#4-paper-trading-session). AlphaSwarm's paper trading engine is a Lean-inspired async runtime that shares 100% of its strategy code with the backtester. Orders from the same `IStrategy` object flow through the **same ledger tables** regardless of whether the session is a backtest, a paper replay, or a live session. ## Architecture ```mermaid flowchart LR Strategy["IStrategy(e.g. FrameworkAlgorithm)"] Session["PaperTradingSession(async loop)"] Feed["IMarketDataFeed(async iterator)"] Broker["IBrokerage + IAsyncBrokerage"] Ledger[(PostgresOrderRecord / Fill / LedgerEntry)] Redis[(Redisprogress + stop signals)] Feed -->|BarData| Session Session -->|on_bar| Strategy Strategy -->|OrderRequest| Session Session -->|submit_order_async| Broker Broker --> Ledger Session -->|emit progress| Redis Redis -->|stop signal| Session Session --> Ledger ``` ## Lifecycle 1. `alphaswarm paper run --config ` (or `POST /paper/start`) builds a `PaperTradingSession` via [`alphaswarm/trading/runner.py`](../alphaswarm/trading/runner.py). 2. `_connect` subscribes the feed to the strategy's universe. 3. For each bar (up to `max_bars` or forever): - Check the kill switch (`POST /portfolio/kill_switch`). - Append to an in-memory history window. - Call `strategy.on_bar(bar, context)` — identical to the backtest. - For each returned `OrderRequest`: - Run the pre-trade risk check (`RiskManager.check_pretrade`). - Submit via `brokerage.submit_order_async` (or the sync bridge). - Persist the `OrderRecord` and ledger entry. - Drain order updates (simulated path) and emit fills. 4. Every `state_flush_every_bars` bars, a snapshot of the session state is flushed to the `paper_trading_runs.state` JSONB column. 5. On shutdown (kill switch, stop signal, `max_bars`, or feed EOF), the engine drains, writes the final row, and emits `done` to the progress bus. ## Broker adapters Each adapter lives in `alphaswarm/trading/brokerages/` and implements **both** `IBrokerage` (sync, for backtest parity) and `IAsyncBrokerage`. ### Alpaca (`[alpaca]` extra) ```yaml brokerage: class: AlpacaBrokerage kwargs: {paper: true} # flip to false for live ``` Requires `ALPHASWARM_ALPACA_API_KEY` and `ALPHASWARM_ALPACA_SECRET_KEY`. The adapter maintains a background `TradingStream` that re-emits order updates through the session's `_order_event_queue`. ### Interactive Brokers (`[ibkr]` extra) ```yaml brokerage: class: InteractiveBrokersBrokerage kwargs: {exchange: SMART, currency: USD} ``` Requires a running TWS or IB Gateway. Defaults: `ALPHASWARM_IBKR_HOST=127.0.0.1`, `ALPHASWARM_IBKR_PORT=7497` (paper), `ALPHASWARM_IBKR_CLIENT_ID=1`. The feed uses `client_id + 100` so it doesn't collide with the trading client. ### Tradier (generic REST template) ```yaml brokerage: class: TradierBrokerage ``` Requires `ALPHASWARM_TRADIER_TOKEN` and `ALPHASWARM_TRADIER_ACCOUNT_ID`. Demonstrates how to subclass [`RestBrokerage`](../alphaswarm/trading/brokerages/rest.py) — five small overrides give you a full paper/live venue: `_order_payload`, `_parse_order(s)`, `_parse_positions`, `_parse_account`, `_order_detail_path/_orders_path/_positions_path/_account_path`. ## Credential flow `alphaswarm.config.Settings` reads every broker secret from the `ALPHASWARM_*` environment (via `.env`). Adapters pick those up automatically at construction time, so YAML recipes rarely need to inline secrets. Order of precedence: 1. Explicit `kwargs` in the YAML recipe (highest) 2. Explicit `kwargs=` passed to `build_from_config` 3. `ALPHASWARM_*` environment variables 4. Package defaults (sandbox URLs, paper=True, etc.) ## Kill-switch integration The paper session wraps every iteration in a check against [`alphaswarm.risk.kill_switch.is_engaged`](../alphaswarm/risk/kill_switch.py). Toggling the switch via `POST /portfolio/kill_switch` (or the UI's Portfolio page) causes the session to: 1. Stop accepting new bars from the feed. 2. Cancel every open order via `brokerage.cancel_order_async`. 3. Flush final state + close brokerage/feed connections. 4. Emit `done` to the task progress channel. Set `session.stop_on_kill_switch: false` in the recipe to disable this behaviour (not recommended). ## Remote / Kubernetes runs The `paper-trader` Docker image (`--target paper`) runs `alphaswarm paper run` as a single-replica k8s `Deployment`. See [`alphaswarm_platform/deploy/k8s/base/paper-trader.yaml`](../alphaswarm_platform/deploy/k8s/base/paper-trader.yaml). To run on a remote host over SSH: ```bash ALPHASWARM_ALPACA_API_KEY=... ALPHASWARM_ALPACA_SECRET_KEY=... \ alphaswarm paper run --config configs/paper/alpaca_mean_rev.yaml --celery ``` The `--celery` flag enqueues the job onto the shared worker pool; the shell can exit and the session keeps running. Use `alphaswarm paper stop ` to drain it gracefully from anywhere. ## Metadata Gate — Strict Mode Rollout Paper sessions now run with strict metadata validation only. Startup aborts when `session.model_urn` or `session.pipeline_urn` is missing, invalid, or unresolvable, or when model status is not `Production`/`Staging`. When adding a new paper config: 1. Declare both `model_urn` and `pipeline_urn` in the YAML. 2. Seed matching aspects before startup checks run. 3. Built-in baseline configs are seeded by Alembic revision `0049_paper_metadata_seed_aspects`; additional configs should ship a follow-up migration (either using `alphaswarm.trading.baseline_aspects.seed_paper_baseline_aspects()` or direct `write_aspect(...)` calls in non-migration application code). Baseline URNs seeded by revision `0049_paper_metadata_seed_aspects`: - `urn:alphaswarm:mlmodel:prod:alpaca_mean_reversion_v1` - `urn:alphaswarm:pipeline:prod:alpaca_mean_reversion_loop` - `urn:alphaswarm:mlmodel:prod:ibkr_mean_reversion_v1` - `urn:alphaswarm:pipeline:prod:ibkr_mean_reversion_loop` - `urn:alphaswarm:mlmodel:prod:avellaneda_stoikov_v1` - `urn:alphaswarm:pipeline:prod:avellaneda_stoikov_quotes_loop` - `urn:alphaswarm:mlmodel:prod:lucic_tse_options_v1` - `urn:alphaswarm:pipeline:prod:lucic_tse_options_loop` - `urn:alphaswarm:mlmodel:prod:tradier_rest_baseline_v1` - `urn:alphaswarm:pipeline:prod:tradier_rest_loop` ## Observability hooks All broker calls and the main session loop are instrumented with OpenTelemetry spans (see [observability.md](../../concepts/trading/observability.md)): | Span name | Emitted by | |---|---| | `paper.session.run` | `PaperTradingSession.run` | | `paper.session.bar` | Each bar processed | | `paper.session.submit_order` | Order submission gate | | `broker.submit_order` | Every concrete broker adapter | | `broker.cancel_order` | Every concrete broker adapter | | `broker.query_positions` | Every concrete broker adapter | | `broker.query_account` | Every concrete broker adapter | Each span carries a `broker.venue` attribute (`alpaca`, `ibkr`, `tradier`, or `sim`). # webui — Next.js 15 frontend > The `webui/` package is the React/TypeScript replacement for the legacy Solara UI on `:8765`. It runs as a separate Node process on `:3000` and talks to the FastAPI backend on `:8000` over REST + WebS... # webui — Next.js 15 frontend > Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · API surface: [alphaswarm_docs/architecture.md#system-component-diagram](../../concepts/platform/architecture.md#system-component-diagram). The `webui/` package is the React/TypeScript replacement for the legacy Solara UI on `:8765`. It runs as a separate Node process on `:3000` and talks to the FastAPI backend on `:8000` over REST + WebSocket. ## Stack - Next.js 15 App Router, React 19, TypeScript strict - Ant Design 5 + `@ant-design/icons` + `@ant-design/charts` - AG Grid Community (`ag-grid-community` + `ag-grid-react`) - React Flow v12 (`@xyflow/react`) for visual workflow editors - `react-financial-charts` for OHLC + indicators (alongside `recharts`) - TanStack Query v5 + Zustand - `openapi-typescript` + `openapi-fetch` for type-safe REST access The full directory layout and design rationale live in `webui/README.md`. ## Local dev From the repo root: ```bash make webui-install # one-time pnpm install make webui-gen-api # dump OpenAPI + regenerate TypeScript client make webui-dev # start dev server on :3000 ``` The Next dev server proxies `/alphaswarm-api/*` → `${NEXT_PUBLIC_API_URL}` (default `http://localhost:8000`) so cookies and WebSockets stay same-origin in dev. ## Backend contract additions The refactor added or extended a small surface on the FastAPI side: - `GET /auth/whoami` — local-first identity stub - `GET /chat/threads`, `POST /chat/threads`, `DELETE /chat/threads/{id}` - `POST /chat` accepts an optional `context: ChatContext` block (page, vt_symbol, backtest_id, strategy_id, …) which is materialised into the system prompt so the assistant knows which page the user is on. - CORS is now driven by `ALPHASWARM_WEBUI_CORS_ORIGINS` (comma-separated list). Empty value falls back to the legacy `"*"` behaviour. WebSocket contracts are unchanged: - `WS /chat/stream/{task_id}` — Celery task progress - `WS /live/stream/{channel_id}` — live market subscriptions ## OpenAPI client regeneration The `webui` consumes a generated `paths` interface that mirrors FastAPI's spec exactly: 1. `python -m scripts.export_openapi --out data/openapi.json` 2. `pnpm --dir webui exec openapi-typescript ../data/openapi.json -o lib/api/generated/schema.d.ts` `make webui-gen-api` (or `pwsh ./scripts/gen_webui_client.ps1`) wraps both steps. CI should run them and fail if the diff is non-empty (drift check). ## Strangler migration During the migration both UIs run in parallel: - `:3000` — Next.js webui (new, primary) - `:8765` — Solara UI (legacy) - `/dash` — Dash strategy monitor (kept; embedded in Next via iframe under `/monitor`) When the Next.js app reaches feature parity: 1. Drop the `ui` service from `alphaswarm_platform/compose/docker-compose.yml`. 2. Delete `alphaswarm/ui/pages/` and `alphaswarm/ui/app.py` (keep the Dash factory). 3. Optionally relax `fastapi\<0.116` and `starlette\<0.46` pins in `pyproject.toml` (they exist solely to satisfy Solara). ## Page tree (top-level) ```mermaid flowchart TB Root["/"] --> Dash["/dashboard"] Root --> Data["/data"] Data --> DataCatalog["/data/catalog"] Data --> DataIceberg["/data/iceberg"] Data --> DataIngest["/data/ingest"] Data --> DataBrowser["/data/browser"] Root --> Backtest["/backtest"] Backtest --> BTHistory["/backtest/history"] Backtest --> BTNew["/backtest/new"] Root --> Strategies["/strategies"] Root --> Models["/models"] Root --> Agentic["/agentic"] Root --> Paper["/paper"] Root --> Settings["/settings"] ``` # Wire alphaswarm_admin against the AlphaSwarm staff Entra tenant # Wire alphaswarm_admin against the AlphaSwarm staff Entra tenant End-to-end procedure for connecting the `alphaswarm_admin` service (backend BFF + Next.js frontend) to Microsoft Entra ID, using the staff app registration that the `alphaswarm_entra_directory` Terraform module provisions. The result: AlphaSwarm staff sign in to `manage.alpha-swarm.ai` with their corporate Entra account; the admin BFF validates the resulting `api://alphaswarm-manage-api` access tokens; the SPA mints tokens via `@azure/msal-browser` and renews them silently with `acquireTokenSilent`. Companion runbooks: - [Bootstrap the AlphaSwarm Entra tenant](./entra-terraform-bootstrap.md) — the prerequisite that creates the apps + groups + roles. - [Onboard a new staff member](./entra-onboard-new-staff.md) — group + role assignment after the apps land. - [Rotate Entra secrets](./entra-rotate-secrets.md) — federated credentials + break-glass procedures. - Concept overview: [Entra ID as the AlphaSwarm staff user pool](../concepts/identity/entra-internal-tenant.md). - ADR: [ADR-013 Entra ID as the AlphaSwarm staff first user pool](../architecture/decisions/013-entra-as-first-pool.md). ## What gets wired | Surface | What it does | | --- | --- | | `alphaswarm_admin/src/alphaswarm_admin/settings.py` | Reads the `ALPHASWARM_AUTH_MSAL_INTERNAL_*` env vars set by the helper script. Single-tenant when `INTERNAL_TENANT_ID` is set; multi-tenant otherwise. | | `alphaswarm_admin/src/alphaswarm_admin/deps/identity.py` | JWT validator pinned to the AlphaSwarm staff Entra v2.0 issuer; verifies `aud=api://alphaswarm-manage-api`, maps `roles` claim through the canonical RBAC lattice. | | `alphaswarm_admin/src/alphaswarm_admin/api/routers/auth_setup.py` | New `GET /admin/auth/discovery` + `GET /admin/auth/health`. Discovery feeds the SPA's `PublicClientApplication`; health confirms the IdP is reachable. | | `alphaswarm_admin/frontend/components/auth/AuthProvider.tsx` | Real MSAL flow (`loginRedirect`, `acquireTokenSilent`, `acquireTokenPopup` for step-up). No tenant id hard-coded in the bundle — everything comes from `/admin/auth/discovery`. | | `scripts/identity/alphaswarm_admin_entra_setup.py` | Operator helper: discovers values from Terraform outputs, prints + optionally writes the env vars, prints the runbook. | ## Prerequisites - The Terraform stack `entra-internal` has been planned + applied for the wiley-tech environment (see [bootstrap runbook](./entra-terraform-bootstrap.md)). - Admin consent has been granted on the staff app's Graph permissions (`./scripts/identity/grant_admin_consent.sh "$STAFF_CID"`). - The `EntraTenantLink` for the AlphaSwarm staff tenant exists with `meta.kind = 'internal'` (`python scripts/identity/seed_entra_internal_tenant.py --apply`). ## Step 1 — Generate the env vars ```bash # Auto-discover from the Terraform outputs in the wiley-tech env. python scripts/identity/alphaswarm_admin_entra_setup.py ``` The script prints two env blocks. Sample output: ``` # --- Backend env (alphaswarm_admin BFF) --- ALPHASWARM_ADMIN_AUTH_PROVIDER=msal_entra ALPHASWARM_ADMIN_AUTH_REQUIRED=true ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID=12345678-aaaa-bbbb-cccc-deadbeef0000 ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID=99999999-1111-2222-3333-444444444444 ALPHASWARM_AUTH_MSAL_INTERNAL_AUDIENCE=api://alphaswarm-manage-api ALPHASWARM_AUTH_OIDC_AUDIENCE=api://alphaswarm-manage-api ALPHASWARM_ADMIN_ENTRA_TENANT=12345678-aaaa-bbbb-cccc-deadbeef0000 ALPHASWARM_ADMIN_ENTRA_REDIRECT_PATH=/api/auth/entra/callback # --- Frontend env (alphaswarm_admin/frontend) --- NEXT_PUBLIC_AQP_AUTH_PROVIDER=msal_entra NEXT_PUBLIC_AQP_ADMIN_API_URL=http://localhost:8900 ``` To write a `.env.alphaswarm_admin.entra` file alongside the printout: ```bash python scripts/identity/alphaswarm_admin_entra_setup.py --write-env ``` The script is intentionally additive: it never overwrites values that weren't generated by it; the operator merges the block into their existing Kubernetes manifests / Helm values / `.env.local`. ## Step 2 — Verify the backend can reach Entra Boot the admin BFF (or restart your existing instance) with the env vars sourced: ```bash set -a; source .env.alphaswarm_admin.entra; set +a uv run alphaswarm-admin # or: python -m alphaswarm_admin.main ``` Then hit the new health endpoint: ```bash curl -fsSL http://localhost:8900/admin/auth/health | jq . ``` Expected output: ```json { "ok": true, "auth_enabled": true, "issuer": "https://login.microsoftonline.com/12345678-aaaa-bbbb-cccc-deadbeef0000/v2.0", "audience": "api://alphaswarm-manage-api", "jwks_uri": "https://login.microsoftonline.com/12345678-.../discovery/v2.0/keys", "discovery_url": "https://login.microsoftonline.com/12345678-.../v2.0/.well-known/openid-configuration", "key_count": 7 } ``` If `ok=false`, the JSON body's `stage` field tells you what failed (`discovery`, `issuer-mismatch`, `jwks`, `jwks-empty`). Common causes: - Wrong `ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID` → fix the env var, restart. - Tenant restrictions block the BFF from reaching `login.microsoftonline.com` → talk to Network about egress. ## Step 3 — Verify discovery returns the frontend config ```bash curl -fsSL http://localhost:8900/admin/auth/discovery | jq . ``` Expected: ```json { "provider": "msal_entra", "auth_enabled": true, "issuer": "https://login.microsoftonline.com/.../v2.0", "audience": "api://alphaswarm-manage-api", "scopes": ["api://alphaswarm-manage-api/.default"], "jwks_uri": "...", "authority": "https://login.microsoftonline.com/...", "client_id": "99999999-...", "tenant_id": "12345678-...", "redirect_path": "/api/auth/entra/callback", "claims_namespace": "https://alphaswarm.internal/" } ``` The frontend fetches this on mount; no tenant ids land in the JS bundle. ## Step 4 — Boot the frontend with MSAL ```bash cd alphaswarm_admin/frontend # .env.local picks up NEXT_PUBLIC_* automatically. pnpm dev open http://localhost:3001 ``` The first page load triggers the `AuthProvider` to: 1. `fetch('/admin/auth/discovery')` against the BFF. 2. Lazy-import `@azure/msal-browser`. 3. Construct a `PublicClientApplication` with the discovered config. 4. Call `handleRedirectPromise()` (consumes any pending login round-trip). 5. Surface the active account via `useAuth()`. A signed-in user should see their name + roles in the dashboard header within a few seconds. ## Step 5 — End-to-end smoke test The repo's MSAL round-trip helper validates the full chain: ```bash python scripts/identity/verify_entra_login.py ``` Expected: ``` INFO Got access token: eyJ0… (1456 chars) INFO Claims look correct. INFO CA policies found: AlphaSwarm-Admins-MFA-Required, AlphaSwarm-Block-Risky-Sign-Ins INFO All checks passed. ``` ## How auth is enforced at runtime ```mermaid sequenceDiagram participant Browser participant SPA as alphaswarm_admin SPA participant BFF as alphaswarm_admin BFF participant Entra Browser->>SPA: GET / (initial load) SPA->>BFF: GET /admin/auth/discovery BFF-->>SPA: { provider, issuer, audience, authority, client_id, scopes } SPA->>SPA: new PublicClientApplication(discovered) Browser->>SPA: click "Sign in" SPA->>Entra: /authorize (PKCE + nonce) Entra-->>Browser: MFA / CA challenge Browser->>Entra: present FIDO2 Entra-->>SPA: redirect to /api/auth/entra/callback SPA->>SPA: handleRedirectPromise() -> account + tokens SPA->>BFF: GET /admin/cells (Authorization: Bearer ...) BFF->>BFF: require_admin -> JwtValidator.validate(token) BFF->>BFF: extract roles -> map to AlphaSwarm scopes BFF-->>SPA: 200 JSON ``` Every subsequent call: 1. SPA pulls the bearer via `acquireTokenSilent`. 2. Backend `require_admin` dependency validates issuer + audience + signature against the cached JWKS, expands `roles` through `alphaswarm_core.auth.rbac.expand_role`. 3. Step-up routes (`require_admin_step_up`) trigger `acquireTokenPopup` for a fresh MFA evaluation. ## Local dev (no Entra tenant needed) Set: ```bash export ALPHASWARM_ADMIN_AUTH_REQUIRED=false # or: export NEXT_PUBLIC_AQP_AUTH_PROVIDER=mock ``` Both backend and frontend fall back to a synthetic anonymous user with `admin:cluster` scope. The dashboard renders without any IdP round-trip — ideal for offline contributors. ## Troubleshooting | Symptom | Cause / Fix | | --- | --- | | `GET /admin/auth/health → 502 stage=discovery` | BFF cannot reach `login.microsoftonline.com`. Check egress. | | `GET /admin/auth/health → 502 stage=issuer-mismatch` | The configured tenant id doesn't match the tenant that responded. Double-check `ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID`. | | Frontend stuck on the loading spinner | Inspect the browser console. The most common message is `discovery missing client_id/authority` — the BFF returned an incomplete discovery doc, meaning `ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID` is empty. | | Login completes but the user has no roles | The user isn't in any AlphaSwarm-* directory group, or the staff app's API permission consent wasn't granted. Re-run `grant_admin_consent.sh`. | | 401 on every API call after login | The bearer's `aud` doesn't match what the BFF expects. Check that the SPA's `scopes` came from `/admin/auth/discovery` (so they include `api://alphaswarm-manage-api/.default`). | | Step-up popup never appears | `setStepUpSupported(false)` was set because the SPA fell back to mock. Confirm `NEXT_PUBLIC_AQP_AUTH_PROVIDER=msal_entra`. | ## Production deployment notes The same env vars apply in production. In Kubernetes you typically: 1. Sync the values into a `Secret` via the External Secrets operator, sourcing from `secret/alphaswarm/admin/entra/*` in Vault. 2. Mount the Secret as env on the `alphaswarm-admin` Deployment. 3. Build the frontend image with `NEXT_PUBLIC_*` baked in (Next.js inlines these at build time). The Terraform module `alphaswarm_entra_directory` already creates the staff app with the production redirect URI `https://manage.alpha-swarm.ai/api/auth/entra/callback`; the helper script's `--admin-origin` defaults to `http://localhost:3001` for dev, override to `https://manage.alpha-swarm.ai` for production manifests. ## Audit trail Every Entra-side mutation lands in: - The **Entra audit log** — exported to the corporate SIEM via the existing log stream. - The **AlphaSwarm `terraform_runs` ledger** for every Terraform apply on the `entra-internal` stack. - The **AlphaSwarm audit log** (Phase 7 §10) on the admin side — `require_admin` attaches the user's `oid` to every `workload_runs` row, so the admin's mutation surface is fully attributed. ## Related - [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md) - [`how-to/entra-onboard-new-staff`](./entra-onboard-new-staff.md) - [`how-to/entra-rotate-secrets`](./entra-rotate-secrets.md) - [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md) - ADR-013: [`architecture/decisions/013-entra-as-first-pool`](../architecture/decisions/013-entra-as-first-pool.md) - Long-form plan: [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md) # how-to/audit-lake-reconstruction # Audit lake reconstruction runbook Phase 7 §10 (`RESTRUCTURING_PLAN.md`) — operating procedure for the hash-chained audit lake: hourly flush, transparency-log anchoring, replay harness, and regulatory-grade evidence bundles. This runbook is the canonical companion to: | Surface | Path | | --- | --- | | Hourly flush task | `alphaswarm/tasks/audit_lake_tasks.py::flush` | | Anchor sinks | `alphaswarm/audit/sinks/{rekor,qldb,rfc3161}.py` | | Replay harness | `alphaswarm/audit/replay.py` | | Evidence bundle route | `alphaswarm_controller/src/alphaswarm_controller/api/routers/evidence_bundles.py` | | Alembic migrations | `0085_audit_lake_anchors.py`, `0086_lineage_cell_id.py` | | MinIO retention | `alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/templates/minio.yaml` | | OpenLineage relay extension | `alphaswarm/audit/openlineage_anchor.py` | ## Architecture in one paragraph The Postgres ``audit_log`` hash chain (Alembic 0079) is the hot write path. Every hour the ``alphaswarm.tasks.audit_lake_tasks.flush`` Celery beat task seals the previous hour's segment, materialises it to ``alphaswarm_gold_audit.events_`` via Iceberg, copies the manifest to ``s3://alphaswarm--warehouse/audit/...`` with Object Lock COMPLIANCE, and submits the segment tip-hash to every configured transparency-log sink. The verification handle (Rekor entry UUID, QLDB document id, RFC 3161 ``TimeStampResp``) lands in ``audit_lake_anchors``. The ``BipartiteGraphObserver`` (Phase 7 §10.3 + Alembic 0086) stamps ``cell_id`` on every new lineage row so downstream queries can join audit + lineage by cell. Auditors call ``POST /manage/evidence-bundles`` to download a deterministic ``.tar.zst`` archive of every artifact needed to reconstruct an event window. ## When to enable Flip ``ALPHASWARM_AUDIT_LAKE_ENABLED=true`` once: 1. Phase 6 §9.2 MinIO chart is rolled out per cell with ``objectLockOnAudit: true``. 2. ``objectLockRetention`` is set to the regulatory minimum (``7y`` for FINRA / SEC; ``30d`` for dev). 3. Alembic 0085 + 0086 have run against every per-cell Postgres. 4. At least one transparency sink is configured via ``ALPHASWARM_AUDIT_TRANSPARENCY_SINKS`` (comma-separated: ``rekor`` / ``qldb`` / ``rfc3161``). ## Step 1 — Configure transparency sinks | Sink | Use when | Required env | | --- | --- | --- | | **Rekor** (default) | Shared cells, public verifiability | `ALPHASWARM_AUDIT_REKOR_URL` (default `https://rekor.sigstore.dev`) + Vault `secret/alphaswarm/rekor/sigstore` with `signing_key_pem` + `signing_cert_pem` | | **AWS QLDB** | `silo-reg` cells on AWS | `ALPHASWARM_AUDIT_QLDB_LEDGER_NAME`, `ALPHASWARM_AUDIT_QLDB_REGION`, AWS IAM role with `qldb:SendCommand` | | **RFC 3161 TSA** | `silo-reg` cells on-prem | `ALPHASWARM_AUDIT_RFC3161_TSA_URL` + Vault `secret/alphaswarm/rfc3161/tsa:` with optional `client_cert_pem`/`client_key_pem` | The three sinks are pluggable adapters of the `TransparencyAnchorSink` ABC (`alphaswarm/audit/protocol.py`). Operators MAY ship a custom subclass; the metaclass auto-registers it as long as it sets `sink_kind` and lives in an imported module. Belt-and-braces example for a `silo-reg`-on-prem cell: ```bash ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=rekor,rfc3161 ``` The flush task tries every configured sink and records every successful anchor as one row in `audit_lake_anchors`; an auditor who needs cross-verification can pick whichever sink suits. ## Step 2 — Enable the hourly flush ```bash # Per cell namespace. kubectl set env -n cell-shared-std-us-east-1a deploy/alphaswarm-core \ ALPHASWARM_AUDIT_LAKE_ENABLED=true \ ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=rekor kubectl rollout status -n cell-shared-std-us-east-1a deploy/alphaswarm-core ``` The flush task is already registered in `alphaswarm/tasks/celery_app.py::beat_schedule` as `audit-lake-flush` (default interval 3600 s). The settings layer (`alphaswarm_lake_enabled=False`) keeps it inert until you flip the switch. Verify a single flush manually: ```bash celery -A alphaswarm.tasks.celery_app call alphaswarm.tasks.audit_lake_tasks.flush # Then inspect the new rows: psql -c "SELECT cell_id, segment_start_ts, state, row_count, iceberg_snapshot_id FROM audit_lake_segments ORDER BY segment_start_ts DESC LIMIT 5" ``` A successful flush emits the OpenLineage `RunEvent` `alphaswarm/audit/segment-anchor` to the existing `lineage_openlineage_outbox`; the Marquez relay carries it through the standard pipeline. ## Step 3 — Verify a segment manually ```bash python -c " from alphaswarm.audit import AnchorRecord from alphaswarm.audit.sinks import RekorSink from datetime import datetime, timezone from sqlalchemy import text from alphaswarm.persistence.db import get_session with get_session() as s: row = s.execute(text( 'SELECT * FROM audit_lake_segments WHERE cell_id = :c ' 'ORDER BY segment_start_ts DESC LIMIT 1' ), {'c': 'cell-shared-std-us-east-1a'}).first() anchor = s.execute(text( 'SELECT * FROM audit_lake_anchors WHERE segment_id = :id AND sink_kind = :k' ), {'id': row.id, 'k': 'rekor'}).first() record = AnchorRecord( cell_id=row.cell_id, segment_start_ts=row.segment_start_ts, segment_end_ts=row.segment_end_ts, prev_tip_hash=row.prev_segment_tip_hash, tip_hash=row.segment_tip_hash, iceberg_snapshot_id=row.iceberg_snapshot_id or '', s3_manifest_uri=row.s3_manifest_uri or '', ) print(RekorSink().verify(record, anchor.verification_handle)) " ``` Should print `True`. If `False`, STOP and investigate before producing any evidence bundles — the chain is broken or the anchor was tampered with. ## Step 4 — Replay a recorded run `alphaswarm/audit/replay.py` re-executes a run against its hash-locked spec. ```python from alphaswarm.audit.replay import replay_run, ReplayEnvironment report = replay_run( run_id="agent-run-abc123", cell_id="cell-shared-std-us-east-1a", target_environment=ReplayEnvironment.AUDIT_SHADOW, ) print(report.to_dict()) ``` The harness: 1. Looks up the run row in whichever runtime table contains the id. 2. Loads the immutable spec snapshot via ``_spec_versions``. 3. Looks up the MCP tool descriptor hashes recorded at original run time. 4. Provisions a deterministic shadow Postgres schema named ``replay___`` (see ``_shadow_schema_name``). 5. Verifies the anchored audit segment covering the run's timestamp. 6. Returns a :class:`ReplayReport` with `output_matches` / `anchor_verified` for sign-off. The actual re-execution slot is currently a Phase 7.5 TODO — until then `replay_output_hash` mirrors `original_output_hash` so the report covers spec-pinning + anchor verification only. That's the audit-essential surface. ## Step 5 — Produce an evidence bundle ```bash curl -X POST https://manage.alpha-swarm.ai/manage/evidence-bundles \ -H "Authorization: Bearer ${ALPHASWARM_ADMIN_TOKEN}" \ -H "Content-Type: application/json" \ --data '{ "tenant_id": "tenant_acme", "cell_id": "cell-silo-reg-acme", "from_ts": "2026-05-01T00:00:00Z", "to_ts": "2026-05-31T23:59:59Z" }' \ --output evidence-acme-may2026.tar.zst ``` The bundle contents (every part is a deterministic JSON file): | File | Source | | --- | --- | | `manifest.json` | Top-level manifest with SHA-256 of every other part | | `audit_rows.json` | Every `audit_log` row in the window | | `audit_segments.json` | Every `audit_lake_segments` row + its anchors | | `spec_snapshots.json` | Every immutable spec referenced by an audit row | | `lineage.json` | Bipartite lineage rows for the same cell + window | The manifest hash IS the canonical bundle id; auditors archive ``manifest.manifest_hash`` alongside the .tar.zst. ## Reverting Phase 7 is incrementally adopted. Reverting is easy: - `ALPHASWARM_AUDIT_LAKE_ENABLED=false` — the task no-ops; existing data remains. - `ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=` (empty) — the segment still flushes to Iceberg but no anchors are submitted. - The Iceberg `alphaswarm_gold_audit.events_` tables are read-only by policy; do NOT delete them. They are the cold-storage backup of the Postgres `audit_log`. - MinIO Object Lock COMPLIANCE means the `audit/` prefix CANNOT be deleted by anyone — not even the root user — until retention expires. This is the regulatory commitment, not a bug. ## SLOs | SLO | Target | | --- | --- | | Flush latency p99 | ≤ 5 minutes after segment close | | Anchor latency p99 | ≤ 10 minutes after flush completes | | Per-segment row throughput | ≥ 10 000 audit rows / minute | | Evidence bundle build time | ≤ 30 s for a 30-day window | | Anchor verify success rate | ≥ 99.9% (excluding Internet outage windows) | ## Where to file alerts Prometheus + Alertmanager rules live in `alphaswarm_platform/deployments/kubernetes/base-services/prometheus-operator/` (future Phase 7.5 deliverable). Until then, monitor: - `audit_lake_segments.state = 'flushed'` rows that haven't progressed to `'anchored'` within 30 minutes — indicates sink failure. - `audit_lake_anchors.last_verified_ok = FALSE` rows — indicates an anchor was tampered with or the sink is unreachable. - `audit_log` insert errors with text `hash chain` — the Postgres trigger is rejecting a row. ## Audit trail of THIS subsystem Every Phase 7 mutation lands in `workload_runs`: - Flipping `ALPHASWARM_AUDIT_LAKE_ENABLED` lands as an `apply_config` row. - Each evidence-bundle export lands as an `evidence_bundle_export` row BEFORE the bytes leave the process (AGENTS rule 45). - The hourly flush itself does NOT land a `workload_runs` row by design (it's a routine background task, not an operator action) — the per-segment write to `audit_lake_segments` IS the audit trail for the flush. # how-to/cell-data-plane-migration # Cell data plane migration runbook Phase 6 §9 (`RESTRUCTURING_PLAN.md`) — operating procedure for provisioning a per-cell data plane and migrating a tenant from the shared cluster-wide Postgres/Redis/MinIO/MLflow/Iceberg into the dedicated cell. This runbook is the canonical companion to: | Surface | Path | | --- | --- | | Helm chart | `alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/` | | Topology models | `alphaswarm_core/src/alphaswarm_core/topology/models.py` | | Cell registry seed | `alphaswarm_platform/configs/deployment/topology.yaml` | | Dual-write switch | `ALPHASWARM_CELL_DUAL_WRITE` (`alphaswarm/config/settings.py`) | | Backfill script | `scripts/cells/dual_write_backfill.py` | | Iceberg cell-awareness | `alphaswarm/data/iceberg_catalog.py:_cell_data_plane` | | Vault Transit cell-key | `alphaswarm/credentials/vault_transit.py:_resolve_transit_key_name` | | Engine cell-keying | `alphaswarm/persistence/db.py:_sync_engine_for_cell` | ## When to use this runbook You SHOULD migrate a tenant into a dedicated per-cell data plane when: - A regulatory commitment requires cryptographic data-plane separation (FINRA, ISO 27001, SOC 2 with customer-side isolation). - The tenant signs onto a `silo-reg` or `silo-custom` contract. - A multi-AZ blast-radius failure isolated to one cell should not affect other tenants. You SHOULD NOT use this runbook for: - A `shared-std` cell — those share the cluster-wide data plane by design. - An ordinary regional cutover (use `cell-router-cutover.md` instead). ## Pre-flight 1. **Cell exists in `topology.yaml`.** Verify the destination cell id in the `cells:` section with a populated `data_plane:` block: ```yaml - id: cell-silo-reg-acme tier: silo-reg tenancy_strategy: database_per_enterprise # ... data_plane: postgres_dsn_secret: secret/alphaswarm/cells/cell-silo-reg-acme/postgres iceberg_rest_uri: http://alphaswarm-cell-iceberg-rest.cell-silo-reg-acme.svc.cluster.local:8181 iceberg_warehouse_uri: s3://alphaswarm-cell-silo-reg-acme-warehouse/ minio_endpoint: http://alphaswarm-cell-minio.cell-silo-reg-acme.svc.cluster.local:9000 vault_transit_key: alphaswarm-cell-silo-reg-acme ``` The `vault_transit_key` is **mandatory** for `silo-reg` cells. 2. **Vault paths exist.** Seed every credential the chart consumes: - `secret/alphaswarm/cells//postgres` — `username` + `password` - `secret/alphaswarm/cells//minio` — `access_key` + `secret_key` - `secret/alphaswarm/cells//mlflow` — `dsn` (Postgres DSN under the per-cell Postgres) - `secret/alphaswarm/cells//iceberg` — `jdbc_uri` + `username` + `password` The Phase 4 §7.6 `vault-secrets-operator` materialises these into Kubernetes `Secret` objects via the chart's `VaultStaticSecret` CRs. 3. **Operator (Phase 6.5) prerequisites installed cluster-wide.** - CloudNativePG operator (`postgresql.cnpg.io/v1`) - vault-secrets-operator (`secrets.hashicorp.com/v1beta1`) - Linkerd 2.16 (Phase 4 §7.1) ## Step 1 — Provision the per-cell data plane Install the Helm chart for the target cell. The chart stamps a CNPG `Cluster`, Redis StatefulSet, MinIO StatefulSet + bucket bootstrap Job (with Object Lock COMPLIANCE on the `audit/` prefix), MLflow Deployment, and Iceberg REST Deployment. ```bash helm install data-plane alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/ \ --namespace cell-silo-reg-acme \ --set cell_id=cell-silo-reg-acme \ --set tier=silo-reg \ --set region=us-east-1 \ --set minio.replicas=4 \ --set postgres.instances=3 ``` Wait for every Pod to reach `Ready=true`. Then: ```bash kubectl -n cell-silo-reg-acme get pods kubectl -n cell-silo-reg-acme get vaultstaticsecret kubectl -n cell-silo-reg-acme exec alphaswarm-cell-postgres-1 -- psql -c "SELECT 1" ``` The MinIO bootstrap Job creates 4 buckets with Object Lock COMPLIANCE on `alphaswarm-cell-silo-reg-acme-audit` for 30 days; verify: ```bash kubectl -n cell-silo-reg-acme exec deploy/alphaswarm-cell-minio-bootstrap -- \ mc retention info "cell/alphaswarm-cell-silo-reg-acme-audit" # expect: Mode=COMPLIANCE Validity=30d ``` ## Step 2 — Run schema migrations against the new Postgres Inside the cell namespace, run `alembic upgrade head` against the per-cell DSN. The CNPG cluster ships the application schema only after this step. ```bash kubectl -n cell-silo-reg-acme run alembic --rm -it --image=ghcr.io/julianwiley/alphaswarm-api:latest --restart=Never -- \ alembic -c /app/alembic.ini upgrade head ``` The Alembic chain immutability check (`scripts/ci/check_migration_immutability.py`) guarantees the same numeric head as the shared plane. ## Step 3 — Enable dual writes Flip `ALPHASWARM_CELL_DUAL_WRITE=true` in the **API** environment. This is the critical safety window — once enabled, every new write goes to BOTH planes (the shared cluster-wide plane AND the per-cell plane bound via `RequestContext.cell_id`). It does NOT affect callers without an active request context. ```bash kubectl set env -n alphaswarm deployment/alphaswarm-core ALPHASWARM_CELL_DUAL_WRITE=true kubectl rollout status -n alphaswarm deployment/alphaswarm-core ``` Verify the new cells are reachable by issuing a noop write from a test tenant pinned to the cell. ## Step 4 — Backfill historical rows ```bash # Dry-run first to print row counts: python scripts/cells/dual_write_backfill.py \ --tenant tenant_acme \ --target-cell cell-silo-reg-acme # When the plan looks right, apply: python scripts/cells/dual_write_backfill.py \ --tenant tenant_acme \ --target-cell cell-silo-reg-acme \ --apply ``` The script copies every tenant-owned table (workspaces, strategy specs, agent runs, bot runs, RL experiments, paper trading, dataset specs, …) but never deletes from the source. It refuses to write if the destination plane already has rows for the same tenant — that is the idempotency guard against duplicate inserts. ## Step 5 — Reconcile ```bash python scripts/cells/dual_write_backfill.py \ --tenant tenant_acme \ --target-cell cell-silo-reg-acme \ --reconcile-only ``` Every table MUST show `OK` (matching row count AND matching SHA-256 roll-up). If even one shows `MISMATCH`, STOP — investigate before proceeding. The script exits with code 2 on mismatch. ## Step 6 — Cutover Mutate `tenant_cells.cell_id` for the tenant. This step is intentionally NOT automated by the backfill script — operators run it manually so the change generates an explicit `workload_runs` audit row. ```sql -- in the SHARED plane INSERT INTO workload_runs (organization_id, action, ...) VALUES ('tenant_acme', 'cell_cutover', ...); UPDATE tenant_cells SET cell_id = 'cell-silo-reg-acme', cutover_at = NOW() WHERE tenant_id = 'tenant_acme'; ``` The cell-router (Phase 3 §6.4) picks up the new mapping on the next JWT exchange. Existing in-flight sessions stay bound to the source plane until the next request — no in-flight rollback needed. ## Step 7 — Disable dual writes ```bash kubectl set env -n alphaswarm deployment/alphaswarm-core ALPHASWARM_CELL_DUAL_WRITE=false kubectl rollout status -n alphaswarm deployment/alphaswarm-core ``` The tenant is now isolated in the cell data plane. The historical rows remain in the shared plane (Phase 6 keeps them as the immutable fallback path); a separate retention policy (90 days) prunes them after sufficient bake time. Do NOT delete source rows from this runbook. ## Reverting If anything goes wrong between Step 3 and Step 6 you can revert cleanly because writes are landing in BOTH planes. Set `ALPHASWARM_CELL_DUAL_WRITE=false`, restore the previous `tenant_cells.cell_id`, and the tenant resumes on the shared plane. After Step 6 the cutover is sticky — reverting requires running the inverse backfill (`--tenant tenant_acme --target-cell cell-shared-std-local`) and is a manual operation. Coordinate with the on-call. ## Audit trail Every step writes audit rows: - Step 1 (Helm install): captured by Argo CD's `Application` revision. - Step 2 (`alembic upgrade head`): writes `alembic_version` in the per-cell Postgres. - Step 3 (`ALPHASWARM_CELL_DUAL_WRITE=true`): captured by `alphaswarm_controller.audit.write_workload_run` when the env flip lands. - Step 4-5 (backfill): the script logs to stdout AND writes a `cell_backfill_runs` row (Alembic 0085, future). - Step 6 (cutover): the explicit `workload_runs` INSERT above. - Step 7 (`ALPHASWARM_CELL_DUAL_WRITE=false`): captured by `alphaswarm_controller.audit.write_workload_run`. The auditor SHOULD verify all seven rows exist before signing off on the migration. # Cell-router cutover runbook # Cell-router cutover runbook > Phase 3 §6 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > Covers the cutover from the single-container Python FastAPI cell > proxy (in `alphaswarm_client/`) to the Envoy + `alphaswarm-tenant-router` > two-component cell router. This runbook is the operator-facing > companion to the deployment manifests at > `alphaswarm_platform/deployments/kubernetes/edge/`. ## Architecture (Phase 3 §6.4) ``` [ user / agent ] │ TLS ▼ [ Cloudflare Tunnel (alpha-swarm.ai) ] │ ▼ [ alphaswarm-edge — Envoy (HTTP-only) ] │ ext_authz callout │ ──────────────────────▶ [ alphaswarm-tenant-router ] │ │ /resolve │ ▼ │ [ cells registry (control plane) ] │ ◀──────────────────── x-alphaswarm-cell header │ ▼ Route on x-alphaswarm-cell: [ alphaswarm-cell--api (FastAPI) ] [ alphaswarm-cell--workers (Celery, gVisor for agents) ] [ alphaswarm-cell--postgres ] [ alphaswarm-cell--minio ] ``` ## Prerequisites 1. The four canonical AlphaSwarm images (`alphaswarm-api`, `alphaswarm-worker`, `alphaswarm-client`, `alphaswarm-controller`) are running on the pre-Phase-3 single-namespace topology. The Phase 3 work runs IN PARALLEL until the canary completes — nothing is taken away from the running fleet. 2. The Alembic head is at `0083_audit_cell_id_column.py`. Verify: ```bash alembic current # expected: 0083_audit_cell_id_column (head) ``` 3. The `cells` registry has at least one `state=active` cell row. Verify via the control plane: ```bash curl -sS https://manage.alpha-swarm.ai/manage/cells | jq '.data[].id' ``` 4. The `alphaswarm-edge` namespace exists and carries the `alphaswarm.io/host-network-allowed: "true"` exception label per Phase 2 §5.4. ## Step 0 — Build the Phase 3 images Both images build from the `alphaswarm_platform` repo root (the post-repo-split context): ```bash cd alphaswarm_platform # alphaswarm-edge (Envoy) docker buildx build \ --platform linux/amd64,linux/arm64 \ --file build/docker/alphaswarm-edge/Dockerfile \ --tag ghcr.io/julianwiley/alphaswarm-edge:v0.2.0 \ --push . # alphaswarm-tenant-router (Python + uvloop) docker buildx build \ --platform linux/amd64,linux/arm64 \ --file build/docker/alphaswarm-tenant-router/Dockerfile \ --tag ghcr.io/julianwiley/alphaswarm-tenant-router:v0.2.0 \ --push . ``` Tagged releases build both images automatically: `alphaswarm_platform/.github/workflows/build-publish.yml` pushes them to ECR with Cosign keyless signatures, SBOM + SLSA provenance, and a Trivy scan via the `build-sign-push` composite. ## Step 1 — Deploy in parallel (week 6) > **Auth posture first.** The tenant-router ships fail-closed > (`AUTH_MODE=required` with an empty issuer) and will crash-loop > until the IdP issuer/audience are stamped into > `alphaswarm-tenant-router-config`. Complete steps 1-2 of the > [tenant-router auth rollout runbook](./tenant-router-auth-rollout.md) > before (or together with) this apply. ```bash # Apply both Deployments + Services + PodDisruptionBudgets # (+ the tenant-router's ConfigMap, NetworkPolicy, HPA, and the # alphaswarm-cell-bound-validator Service): kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-edge/ kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/ # Verify the tenant-router hydrated the cells cache: kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 18080:8080 curl -sS http://127.0.0.1:18080/readyz # expected: {"status":"ok","cells":,"auth_mode":"required","cba_mode":"enforce"} ``` DNS still points to the Python proxy. No user traffic flows to `alphaswarm-edge` yet. ## Step 2 — DNS canary 10% (week 7) Cloudflare Workers + Load Balancer split the apex hostname (`alpha-swarm.ai`) across the two backends: ```toml # cloudflare/alphaswarm_load_balancer.tf (excerpt) resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_legacy" { origins = [{ name = "alphaswarm-client", address = "...", weight = 0.9 }] } resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_envoy" { origins = [{ name = "alphaswarm-edge", address = "...", weight = 0.1 }] } ``` Apply via `alphaswarm deploy terraform plan apply` (NEVER raw `terraform apply` per AGENTS rule 42). Verify both pools healthy: ```bash kubectl -n alphaswarm-edge get pods -l app=alphaswarm-edge kubectl -n alphaswarm-edge get pods -l app=alphaswarm-tenant-router # Tail tenant-router logs for any 503s / cache misses: kubectl -n alphaswarm-edge logs -l app=alphaswarm-tenant-router --tail=200 -f ``` Stop conditions (rollback to 100% legacy): - `alphaswarm-tenant-router` `/readyz` returns 503 for > 1 minute. - Envoy `5xx` rate on `alphaswarm-edge` ingress > 0.5% over a 5-minute window. - Any audit event with `cell_id IS NULL` after the canary starts (indicates the X-AlphaSwarm-Cell header isn't propagating into `RequestContext`). ## Step 3 — 50% traffic (week 8) Cloudflare LB weight: 0.5 / 0.5. Repeat the verification + stop conditions from step 2. Watch the `alphaswarm.cell.id` distribution in Tempo: ``` {alphaswarm.cell.id="cell-shared-std-local"} | count_over_time(span_count[5m]) ``` Both routes should converge on the same cell-id distribution. ## Step 4 — 100% traffic (week 9) Cloudflare LB weight: 0.0 / 1.0. The Python proxy continues to run but receives no live traffic. Keep it running for 7 days as the rollback safety net. ## Step 5 — Remove the Python FastAPI proxy (week 10) This step is intentionally NOT in the Phase 3 PR; it lands as a follow-up after the 7-day soak. The removal removes `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`'s FastAPI proxy module (the `production` stage's uvicorn entrypoint) and strips the `/api/*`, `/ws/*`, `/manage/*`, `/static` route handlers from `alphaswarm/api/main.py`. Tag the last buildable proxy image (`alphaswarm-client:proxy-last-stable`) before the removal lands so a regression has a known-good rollback target. ## Rollback at any step - Cloudflare LB weight back to 1.0 / 0.0 — instant traffic drain back to the legacy proxy. - `kubectl -n alphaswarm-edge scale deployment alphaswarm-edge --replicas=0` prevents Envoy from accepting any traffic even if DNS still points at it. ## Phase 3 §6.6 follow-up — the removal PR The Python proxy lives at `alphaswarm/api/proxy.py` + the relevant routes in `alphaswarm/api/main.py`. The Phase 3 §6.6 removal PR: 1. Cuts the route registrations. 2. Updates the `alphaswarm-client` Dockerfile to drop the proxy CMD. 3. Removes the proxy's tests under `tests/api/`. 4. Tags the prior commit `alphaswarm-client-proxy-final` so a rollback restores the buildable artifact. ## Related documents - [RESTRUCTURING_PLAN.md §6](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm_platform/deployments/kubernetes/cells/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/cells/README.md) - [alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml) - [alphaswarm_tenant_router/AGENTS.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_tenant_router/AGENTS.md) # Chainguard base migration runbook # Chainguard base migration runbook > Phase 2 §5.1 + §5.2 + §5.3 + §5.4 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > Owns the cutover from Debian-slim base images to Chainguard Wolfi > bases, the cosign + SBOM signing pipeline, and the Kyverno > admission policies that gate signed-only images in production. ## Scope Four AlphaSwarm-owned images move to Chainguard Wolfi in Phase 2 §5.1: | Image | Dockerfile | Base before | Base after | | --- | --- | --- | --- | | `alphaswarm-api` / `alphaswarm-worker` (shared `api` target) | `alphaswarm_platform/Dockerfile` | `python:3.11-slim` | `cgr.dev/chainguard/python:3.11-dev` | | `alphaswarm-controller` | `alphaswarm_platform/build/docker/alphaswarm_controller/Dockerfile` | `python:3.11-slim` | `cgr.dev/chainguard/python:3.11-dev` (builder) + `cgr.dev/chainguard/python:3.11` (runtime) | | `alphaswarm-client` | `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile` | `node:20-alpine` + `python:3.11-slim` | `cgr.dev/chainguard/node:20-dev` + `cgr.dev/chainguard/python:3.11-dev` (builders) + `cgr.dev/chainguard/python:3.11` (runtime) | | `alphaswarm-ui` | `alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile` | `node:20-alpine` | `cgr.dev/chainguard/node:20-dev` (builder) + `cgr.dev/chainguard/node:20` (runtime) | Two images carry **documented exemptions** and stay on their current bases: | Image | Dockerfile | Reason | | --- | --- | --- | | `alphaswarm-bots` standard | `alphaswarm_bots/Dockerfile` | Already on `gcr.io/distroless/python3-debian12:nonroot` — smaller and more locked-down than Chainguard Python, no shell at all. Builder stage stays on `python:3.12-slim-bookworm` for build-essential availability. | | `alphaswarm-bots` HFT | `alphaswarm_bots/Dockerfile.hft` | Kernel-bypass libs (DPDK, Onload, Mellanox OFED) require kernel headers + `libnuma1` + `linuxptp` + `ethtool` + `kmod` which Chainguard's nonroot Wolfi runtime image does not ship. Per ADR 007. | Two **future-phase scaffolds** are created in Phase 2 §5.6: | Image | Dockerfile | Activation phase | | --- | --- | --- | | `alphaswarm-edge` (Envoy cell router) | `alphaswarm_platform/build/docker/alphaswarm-edge/Dockerfile` | Phase 3 §6.4 (cell topology) | | `alphaswarm-agent-sandbox` (gVisor target) | `alphaswarm_platform/build/docker/alphaswarm-agent-sandbox/Dockerfile` | Phase 5 §8 (per-tenant MCP + agent sandbox) | ## Why Chainguard Wolfi - **glibc**, not musl — keeps native wheel compatibility for `numpy`, `pyarrow`, `torch`, `psycopg2`, etc. The RESTRUCTURING_PLAN footnote at §5.1 explicitly notes that Alpine/musl-style minimalism breaks the native-wheel toolchain. - **Continuously rebuilt** — Chainguard ships a fresh image set every ~24 hours, so CVE patches land without us doing anything beyond a rebuild. Pair with Renovate (Phase 1 §4.7) to re-trigger the build matrix on a base-image bump. - **No CVEs in the base** — Chainguard runs distroless-style scans and publishes a daily-zero-CVE SLO for `latest` tags. We still run `grype --fail-on high` per Phase 2 §5.2 because application-level CVEs are our responsibility. - **Single nonroot UID convention (65532)** — matches the Phase 2 §5.4 PSS restricted profile. The runtime stages never run as root; the `-dev` builder runs as root only for `apk add`. ## Build verification Local one-off build (no push, no signing — for inner-loop dev): ```bash docker buildx build \ --platform linux/amd64,linux/arm64 \ --target api \ --file alphaswarm_platform/Dockerfile \ --tag alphaswarm-api:dev \ . ``` Multi-arch build via `build-multi-arch.yml` (CI canonical path): ```bash gh workflow run build-multi-arch.yml \ --ref feat/phase-2-supply-chain ``` The workflow signs every pushed image with cosign keyless OIDC and uploads a CycloneDX SBOM. The `inspect` job at the bottom of the workflow runs `cosign verify` + `cosign verify-attestation` to confirm the signature + SBOM attestation land in Rekor. ### Verify cosign signature locally ```bash cosign verify \ --certificate-identity-regexp 'https://github.com/julianwiley/alphaswarm/.github/workflows/build-multi-arch\.yml@refs/.*' \ --certificate-oidc-issuer https://token.actions.githubusercontent.com \ docker.io/julianwiley/alphaswarm-api:latest ``` Expected exit code: 0. The output prints the signature payload including the Rekor transparency log entry index. ### Verify CycloneDX SBOM attestation locally ```bash cosign verify-attestation \ --certificate-identity-regexp 'https://github.com/julianwiley/alphaswarm/.github/workflows/build-multi-arch\.yml@refs/.*' \ --certificate-oidc-issuer https://token.actions.githubusercontent.com \ --type cyclonedx \ docker.io/julianwiley/alphaswarm-api:latest > sbom-attestation.json ``` The `predicate` field of the attestation is the base64-encoded CycloneDX document. ### Re-run grype against the SBOM ```bash syft docker.io/julianwiley/alphaswarm-api:latest -o cyclonedx-json=sbom.json grype sbom:sbom.json --fail-on high ``` Exit code 0 = no HIGH or CRITICAL CVEs; non-zero = the gate fires in CI. ## Kyverno audit-to-enforce ratchet The six Phase 2 §5.3 cluster policies ship in `Audit` mode (see [alphaswarm_platform/deployments/kubernetes/security/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/security/README.md)). The ratchet schedule: | Policy | Audit-mode soak | Enforce gate | | --- | --- | --- | | `00-verify-signatures.yaml` | 7 days zero violations across all AlphaSwarm-owned namespaces | Phase 2.5 | | `01-require-pss-restricted.yaml` | 7 days zero violations | Phase 2.5 | | `02-require-runtime-class.yaml` | DO NOT enforce until Phase 5 §8.3 lands the gVisor RuntimeClass | Phase 5.1 | | `03-no-host-network.yaml` | 7 days zero violations after `alphaswarm-edge` namespace carries `alphaswarm.io/host-network-allowed: "true"` | Phase 2.5 | | `04-no-privilege-escalation.yaml` | 7 days zero violations | Phase 2.5 | | `05-required-labels.yaml` | 7 days zero violations on namespaces that carry `alphaswarm.io/component` | Phase 2.5 | ### Operator workflow to flip Audit → Enforce ```bash # 1. Verify zero violations for the target policy: kubectl get clusterpolicyreport -o jsonpath='{range .items[*].results[?(@.policy=="alphaswarm-verify-image-signatures")]}{.result}{"\n"}{end}' \ | sort | uniq -c # Expected output: only "pass" lines. Any "fail" lines block the ratchet. # 2. Patch the policy in place: kubectl patch clusterpolicy alphaswarm-verify-image-signatures \ --type=merge \ -p '{"spec":{"validationFailureAction":"Enforce"}}' # 3. Update the YAML in tree so the audit-only state is preserved: sed -i 's/validationFailureAction: Audit/validationFailureAction: Enforce/' \ alphaswarm_platform/deployments/kubernetes/security/kyverno/cluster-policies/00-verify-signatures.yaml # 4. Commit + open PR with `[Phase 2.5 ratchet]` in the title. ``` ## Rollback The Chainguard migration is reversible per Dockerfile. Each Dockerfile carries a `Phase 2 §5.1` comment at the top documenting the previous base image. To roll back a single image: 1. Revert that file in `alphaswarm_platform/Dockerfile` or `alphaswarm_platform/build/docker//Dockerfile` to its pre-Phase-2 state. 2. Trigger `build-multi-arch.yml` on the revert branch. 3. The cosign keyless signature still applies (it signs by digest, not base image). The grype scan may fail differently because the Debian-slim base ships different CVEs. ## Cosign signing on PRs The Phase 2 §5.2 cosign + SBOM + grype steps gate on `if: github.event_name != 'pull_request'` because cosign keyless requires OIDC, which is unavailable on PRs from forked repositories. PRs from internal branches still build (and pull- through cache), but they neither push nor sign. The `inspect` job that runs `cosign verify` on `:latest` tags is only useful for merged commits. If you need to verify a signature on a PR build, push to a feature branch in the canonical repo (not a fork) and check the registry manually: ```bash docker pull docker.io/julianwiley/alphaswarm-api:feat-phase-2-supply-chain- cosign verify \ --certificate-identity-regexp 'https://github.com/julianwiley/.*' \ --certificate-oidc-issuer https://token.actions.githubusercontent.com \ docker.io/julianwiley/alphaswarm-api:feat-phase-2-supply-chain- ``` ## Related documents - [RESTRUCTURING_PLAN.md §5](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm_platform/deployments/kubernetes/security/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/security/README.md) - [`.cursor/plans/alphaswarm-index-debt-phase-2-supply-chain.md`](https://github.com/julianwiley/alphaswarm/blob/main/.cursor/plans/alphaswarm-index-debt-phase-2-supply-chain.md) - ADR 007 — QuantBot Latency Classes (explains the HFT Debian-slim exemption) # Onboard a new staff member into Entra # Onboard a new staff member into Entra Procedure for adding a new AlphaSwarm employee to the company's Entra directory and granting them the right level of access to the managed AlphaSwarm platform. This is a HR + Security workflow that does NOT touch Terraform. Group membership is intentionally outside Terraform's purview (rollout plan §1.2); Terraform owns *which groups exist + what roles they confer*, not *who is in them*. ## Inputs - The new hire's full name + corporate email address. - Their HR-side role (engineer, ops, compliance, finance, …). - The hiring manager's approval (capture the ticket id for the audit). ## Steps ### 1. Create the Entra user If the new hire doesn't already have an Entra account from corporate onboarding, create one: ```bash az ad user create \ --display-name "First Last" \ --user-principal-name first.last@wiley-tech.onmicrosoft.com \ --password "$(uuidgen | tr -d '-' | head -c 24)Aa!1" \ --force-change-password-next-sign-in true ``` The auto-generated password is changed on first login; the operator NEVER stores or shares it. ### 2. Add to the appropriate AlphaSwarm group Map the HR-side role to the canonical group. Default mappings: | HR role | Entra group(s) | | --- | --- | | Software Engineer / SRE | `AlphaSwarm-Engineering` | | Senior SRE / on-call rotation | `AlphaSwarm-Engineering` + `AlphaSwarm-Operations` | | Compliance Officer | `AlphaSwarm-Compliance` | | Internal Auditor | `AlphaSwarm-Auditors` | | Finance / FinOps | `AlphaSwarm-Finance` | | Security Engineer | `AlphaSwarm-SOC` | | CTO / VP Engineering | `AlphaSwarm-Admins` (requires CTO sign-off and CA-policy MFA) | Add via the Azure Portal **OR** via CLI: ```bash # Look up the group id (cached locally for repeat use). GROUP_ID="$(az ad group show --group AlphaSwarm-Engineering --query id -o tsv)" USER_ID="$(az ad user show --id first.last@wiley-tech.onmicrosoft.com --query id -o tsv)" az ad group member add --group "${GROUP_ID}" --member-id "${USER_ID}" ``` ### 3. Verify the role propagation Wait 5 minutes for Entra to propagate, then have the new hire sign in once at `manage.alpha-swarm.ai`. The application token they receive should include the `roles` claim mapped to the group. Confirm from the operator side: ```bash python scripts/identity/list_entra_app_role_assignments.py \ --format=json \ | jq '.[] | select(.principal_display_name=="First Last")' ``` Should print one row per (role) for each group the user is in. ### 4. Capture the audit trail The Entra audit log records group-membership changes automatically and forwards them to the corporate SIEM via the existing log stream. The manager's approval ticket gets attached as part of the standard employee onboarding packet. ## Promoting an existing staff member ```bash # Add to a higher-privilege group (e.g. ops on-call). az ad group member add --group AlphaSwarm-Operations --member-id "${USER_ID}" ``` For promotions to `AlphaSwarm-Admins`: 1. The CTO must sign off in writing (ticket id captured). 2. The user must have a registered FIDO2 hardware key (verified by the Security team). 3. The user falls under the `AlphaSwarm-Admins-MFA-Required` Conditional Access policy automatically. ## Off-boarding ```bash # Remove from every AlphaSwarm group; do NOT just disable the Entra account # in case the user has cross-tenant memberships we don't manage. for group in AlphaSwarm-Engineering AlphaSwarm-Operations AlphaSwarm-Auditors AlphaSwarm-Compliance \ AlphaSwarm-Finance AlphaSwarm-SOC AlphaSwarm-Admins; do GROUP_ID="$(az ad group show --group ${group} --query id -o tsv)" az ad group member remove --group "${GROUP_ID}" --member-id "${USER_ID}" 2>/dev/null || true done ``` After removal, capture an evidence snapshot: ```bash python scripts/identity/list_entra_app_role_assignments.py \ --format=csv > evidence/entra-after-offboarding-${USER_ID}-$(date +%F).csv ``` ## Common pitfalls | Pitfall | Mitigation | | --- | --- | | Adding a user to two conflicting groups | The role union is granted; review with `list_entra_app_role_assignments.py` after every change | | Group propagation lag | Ask the user to wait 5 minutes between group add and login retry | | User can't sign in despite group membership | Check Conditional Access "What If" report for the user; CA may be blocking the sign-in | | Stale group from a previous role | Remove the old group BEFORE adding the new one to keep the audit trail clean | ## Related - [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md) — pool concept - [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md) — how the groups exist in the first place - [`how-to/entra-rotate-secrets`](./entra-rotate-secrets.md) — credential rotation # Rotate Entra ID app secrets # Rotate Entra ID app secrets The AlphaSwarm staff Entra rollout aims for **zero stored secrets** — CI authenticates via federated credentials (rollout plan §3.5), and the runtime bootstrap relies on operator `az login` sessions. This page documents: 1. The two secrets that DO exist during the bootstrap window and how to rotate them. 2. The federated-credential rotation cadence (no secret material, but subjects + issuer trust still matter). 3. The break-glass account procedure. ## What we DO NOT rotate The alphaswarm_entra_directory module ships zero `azuread_application_password` resources by default. App-secret material lives in Vault under `secret/alphaswarm/entra/` ONLY for the bootstrap window, and ONLY when the operator explicitly opts in via `terraform import` of an out-of-band Azure Portal-created password. Once Phase 5 of the rollout completes, no app secrets exist for any AlphaSwarm-managed Entra app. ## Bootstrap-window secret rotation Rotate the bootstrap SP secret used by Phase 0 / 1 of the rollout (before federated credentials are wired): ```bash # 1. Mint a new secret (90-day lifetime, recorded in the audit log). NEW_SECRET="$(az ad app credential reset \ --id "${BOOTSTRAP_SP_CLIENT_ID}" \ --years 0 --months 3 \ --query password -o tsv)" # 2. Write to Vault. vault kv put secret/alphaswarm/entra/bootstrap_sp_secret value="${NEW_SECRET}" # 3. Restart any service still using the old secret. Most are already # on federated credentials by this point so this is a sweep. kubectl rollout restart -n alphaswarm deploy/alphaswarm-admin # 4. Clear the new secret from local env. unset NEW_SECRET ``` **Cadence**: every 90 days while Phase 0/1 secrets exist. Phase 5 retires the secret entirely. ## Federated-credential rotation Federated credentials carry no secret material — the trust comes from the GitHub Actions OIDC issuer + the subject claim. Rotate the SUBJECT when: - A repo is renamed. - A protected branch is renamed. - A new GitHub environment is added. Procedure: ```hcl # In alphaswarm_platform/terraform/environments/wiley-tech/entra.tf, # add a new entry to var.ci_federated_credentials: { name = "github-staging-environment" description = "GitHub Actions deploy to staging environment." subject = "repo:julianwileymac/alphaswarm:environment:staging" } ``` Then the standard plan/apply path: ```bash ./scripts/identity/entra_terraform_plan.sh python scripts/identity/entra_terraform_apply_via_runtime.py \ --workspace wiley-tech \ --apply \ --reason "Add staging environment OIDC subject" ``` Per-environment / per-branch is **mandatory**; the module's plan check rejects subjects containing `*` or `ref:refs/heads/*`. ## Break-glass account rotation The two break-glass accounts (rollout plan §4 risk table) are excluded from time-based Conditional Access policies and used ONLY in declared incidents. Their credentials live in: - **Physical safe** — printed sealed envelope per account. - **Redundant FIDO2 hardware keys** — two YubiKey 5C NFC per account, stored in separate physical safes. Rotation cadence: every 6 months OR after any use, whichever comes first. Procedure: ```bash # 1. Generate fresh FIDO2 keys with the Azure portal / az ad device-mfa # enrolment flow. # 2. Mint a fresh emergency password (UUID-derived, 24+ chars). # 3. Update the sealed envelope in the safe. # 4. Capture the rotation in the security log: echo "{ \"actor\": \"$USER\", \"event\": \"break-glass-rotation\", \ \"account\": \"break-glass-1@wiley-tech.onmicrosoft.com\", \ \"completed_at\": \"$(date -u +%FT%TZ)\" }" \ | gpg --encrypt -r security@alpha-swarm.ai \ >> evidence/break-glass-rotations.gpg ``` The operator who performs the rotation runs the `scripts/identity/list_entra_app_role_assignments.py --apps` to confirm the break-glass account is still NOT assigned any AlphaSwarm role (the accounts MUST stay outside the AlphaSwarm role surface; they exist for tenant-level emergency access only). ## Common pitfalls | Pitfall | Mitigation | | --- | --- | | Operator tempted to add a static secret for "just one workflow" | NO. File a ticket to add a federated credential subject instead | | Forgetting to remove old federated subjects after a repo rename | Run `az ad app federated-credential list --id "${CI_APP_ID}"` quarterly; remove anything not in `var.ci_federated_credentials` | | Break-glass account drifting into AlphaSwarm role assignments | The `list_entra_app_role_assignments.py` audit script's CSV output is reviewed quarterly by Security | | Bootstrap SP secret accidentally committed | Pre-commit hook scans for Azure-secret patterns; CI fails the PR | ## Related - [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md) — initial setup - [`how-to/entra-onboard-new-staff`](./entra-onboard-new-staff.md) — staff lifecycle - [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md) — pool concept - Long-form rollout plan with risk register: [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md) # Bootstrap the AlphaSwarm Entra ID staff tenant # Bootstrap the AlphaSwarm Entra ID staff tenant Step-by-step procedure for taking the AlphaSwarm staff Microsoft Entra ID tenant from "exists in the Azure Portal" to "fully Terraform-controlled and serving as the first user pool for `manage.alpha-swarm.ai`". This is the implementation runbook. Concept context lives at [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md); the full rollout schedule + risks + rollback at [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md). ## Pre-requisites | Prereq | How to confirm | | --- | --- | | AlphaSwarm staff Entra tenant exists | `az account tenant list` shows the tenant id | | Global admin / Application Administrator account | `az ad signed-in-user show` confirms role assignment | | Bootstrap service principal exists with `Application.ReadWrite.All` + `Group.ReadWrite.All` + `RoleManagement.ReadWrite.Directory` | `az ad sp show --id ` | | Terraform 1.10+ installed locally | `terraform version` | | Repo cloned + AlphaSwarm runtime installable | `pip install -e .[dev]` succeeds | | Vault accessible with the `secret/alphaswarm/entra/` mount | `vault kv get secret/alphaswarm/entra/internal_tenant_id` resolves | If any prereq is missing, file a ticket with the Identity team (reference ADR-011) before continuing. ## Step 1 — Set environment variables ```bash # Sourced from Vault by the operator before running the helpers. export AZURE_TENANT_ID="" export AZURE_CLIENT_ID="" export AZURE_CLIENT_SECRET="" # OR use az login # Echoed into the Terraform provider. export TF_VAR_entra_tenant_id="${AZURE_TENANT_ID}" export TF_VAR_entra_enabled="true" ``` > **Note**: the `AZURE_CLIENT_SECRET` path is documented for the > bootstrap window only. Once the `alphaswarm-ci-github` app registration + > federated credentials land (Phase 5 of the rollout plan), no secret > is stored anywhere; CI authenticates via OIDC. ## Step 2 — Plan-only preview ```bash ./scripts/identity/entra_terraform_plan.sh ``` The script: 1. Runs `terraform fmt -check` + `terraform validate` against the module. 2. Runs `terraform plan -target=module.alphaswarm_entra_directory` against the `wiley-tech` environment. 3. Writes the plan binary to `/tmp/alphaswarm-entra-wiley-tech.plan` and prints the next-step command. Inspect the plan line-by-line. Common red flags: - A resource shows `# forces replacement` for an app-role id → someone has regenerated a UUID in `var.app_role_definitions` (DON'T merge). - A federated credential shows `subject = "...:*"` → wildcard rejected by the module check; fix the input. - A group display name conflicts with an existing group → rename or import. ## Step 3 — Apply via TerraformRuntime ```bash python scripts/identity/entra_terraform_apply_via_runtime.py \ --workspace wiley-tech \ --apply \ --reason "Phase 2 land for entra-internal stack" ``` The helper: 1. Loads the `entra-internal` `TerraformStackSpec`. 2. Runs `runtime.plan(...)` (writes a `terraform_runs` row). 3. Prompts for `yes` confirmation (skip with `--yes` only in CI). 4. Runs `runtime.apply(...)` (writes a second `terraform_runs` row linked to the same spec_version_id). Output is redacted: token-bearing fields show only the first 4 characters per AGENTS rule 26. ## Step 4 — Grant admin consent After the apps land, their requested Microsoft Graph permissions are *requested* but not yet *consented*. Grant tenant-wide consent: ```bash # The staff app's client_id is in the Terraform output: STAFF_CID="$(terraform -chdir=alphaswarm_platform/terraform/environments/wiley-tech \ output -raw entra_staff_app_client_id)" ./scripts/identity/grant_admin_consent.sh "${STAFF_CID}" ``` The script wraps `az ad app permission admin-consent` and verifies the resulting grants with `az ad app permission list-grants`. ## Step 5 — Seed `EntraTenantLink` ```bash # Read the new staff app's tenant id and stamp the canonical row. export ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID="${AZURE_TENANT_ID}" export ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID="${STAFF_CID}" python scripts/identity/seed_entra_internal_tenant.py --dry-run python scripts/identity/seed_entra_internal_tenant.py --apply ``` Idempotent: the second `--apply` is a no-op if the row already matches the target shape. ## Step 6 — Round-trip a real login ```bash # Browser flow. python scripts/identity/verify_entra_login.py # Headless / SSH session. python scripts/identity/verify_entra_login.py --device-code ``` Successful output: ``` INFO Got access token: eyJ0… (1456 chars) INFO Claims look correct. INFO CA policies found: AlphaSwarm-Admins-MFA-Required, AlphaSwarm-Block-Risky-Sign-Ins INFO All checks passed. ``` If a CA policy is missing, the script exits with code 4 and lists the missing policies. Add them via the Azure Portal under Security review, then re-run. CA policies are NOT created from Terraform (rollout plan §1.2). ## Step 7 — Verify role assignments ```bash python scripts/identity/list_entra_app_role_assignments.py ``` Should print one row per (group, role) pair the module created. Save a CSV snapshot for the audit trail: ```bash python scripts/identity/list_entra_app_role_assignments.py \ --format=csv > evidence/entra-role-snapshot-$(date +%F).csv ``` ## Step 8 — Switch the manage.alpha-swarm.ai chooser to prefer Entra With everything in place, flip the runtime so the `manage.alpha-swarm.ai` login chooser prefers Entra over Auth0: ```bash # Settings already wired in alphaswarm/config/settings.py: # auth_msal_priority = 100 # MSAL wins # auth_msal_internal_* # populated from Terraform outputs kubectl set env -n alphaswarm deploy/alphaswarm-admin \ ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID="${AZURE_TENANT_ID}" \ ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID="${STAFF_CID}" \ ALPHASWARM_AUTH_MSAL_INTERNAL_AUTHORITY="https://login.microsoftonline.com/${AZURE_TENANT_ID}" \ ALPHASWARM_AUTH_MSAL_INTERNAL_AUDIENCE="api://alphaswarm-manage-api" \ ALPHASWARM_AUTH_MSAL_PRIORITY=100 kubectl rollout status -n alphaswarm deploy/alphaswarm-admin ``` 24-hour bake: monitor the `auth_login_total{provider="entra"}` and `auth_login_failure_total` Prometheus counters. ≥95% of staff logins should land on Entra after the bake. ## Verification | Check | Command | | --- | --- | | Terraform plan is clean | `./scripts/identity/entra_terraform_plan.sh` (no diff) | | `terraform_runs` audit row recorded | `psql -c "SELECT id, status FROM terraform_runs WHERE stack_slug='entra-internal' ORDER BY created_at DESC LIMIT 1"` | | `entra_tenant_links` has `kind=internal` | `python scripts/identity/seed_entra_internal_tenant.py --dry-run` reports `EXISTING row matches target` | | Real login works end-to-end | `python scripts/identity/verify_entra_login.py` exits 0 | | All seven groups have role assignments | `python scripts/identity/list_entra_app_role_assignments.py` prints ≥7 rows | ## Rollback See [the rollout plan §5](pathname:///docs/plans/entra-internal-tenant-rollout.md) for the three rollback tiers (hot / cold / catastrophic). # Linkerd + SPIRE rollout runbook # Linkerd + SPIRE rollout runbook > Phase 4 §7.1 + §7.2 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > Covers the per-cell install of Linkerd 2.16 (service mesh) and > SPIRE 1.10 (workload identity) plus the matching validation > steps. ## Scope Per-cell installs of: - **Linkerd 2.16** — mTLS-by-default for every pod-to-pod call inside a cell. Cross-cell calls re-terminate at `alphaswarm-edge` (Envoy). - **SPIRE 1.10** — issues SPIFFE JWT-SVIDs and X.509-SVIDs via the Workload API. Replaces the kubelet-bound ServiceAccount token usage in `alphaswarm/auth/m2m.py`. Both ship as kustomize bases under `alphaswarm_platform/deployments/kubernetes/mesh-identity/`. Argo CD's `cells` `ApplicationSet` (Phase 3 §6.5) is extended in Phase 4.5 to stamp one per-component Application per cell. ## Prerequisites 1. The cell namespace exists and carries the Phase 4 §7.1 `linkerd.io/inject: enabled` annotation. Verify: ```bash kubectl get ns cell-shared-std-us-east-1a -o yaml | grep linkerd.io/inject # expected: linkerd.io/inject: enabled ``` 2. The cell registry has the cell row in `state=provisioning` (so the cell-router doesn't send live traffic yet). 3. Vault PKI is configured and ready to issue: - Linkerd trust anchor + issuer cert (rotates via VaultStaticSecret). - SPIRE upstream authority (if running with `UpstreamAuthority` plugin; the Phase 4 spine uses self-signed for simplicity). ## Step 0 — Apply the mesh-identity spine ```bash # Apply in dependency order: # 1. SPIRE (everything else consumes SVIDs) kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/ # Wait for SPIRE Server to be ready: kubectl -n spire-system rollout status statefulset/spire-server --timeout=5m kubectl -n spire-system get pods -l app=spire-agent # 2. Linkerd (consumes SPIRE-issued trust anchor) # The trust anchor + issuer cert must already be in # Secret/linkerd-identity-issuer (see §7.6 wire-up). kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/linkerd/ # Wait for Linkerd identity service: kubectl -n linkerd rollout status deployment/linkerd-identity --timeout=10m kubectl -n linkerd rollout status deployment/linkerd-destination --timeout=10m kubectl -n linkerd rollout status deployment/linkerd-proxy-injector --timeout=10m # Optional: install linkerd-viz for golden-signal dashboards kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/linkerd/ # idempotent # 3. vault-secrets-operator (mTLS via Linkerd, identity via SPIRE) kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/vault-secrets-operator/ # 4. Pomerium IAP (depends on Linkerd mTLS for backend reach) kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/pomerium/ ``` ## Step 1 — Validate SPIRE Workload API ```bash # Find a workload pod that mounts the agent socket: POD=$(kubectl -n cell-shared-std-us-east-1a get pods -l app=alphaswarm-core -o name | head -1) # Drop into the pod and fetch an SVID: kubectl -n cell-shared-std-us-east-1a exec -it "$POD" -- /bin/sh -c " export SPIFFE_ENDPOINT_SOCKET=unix:///run/spire/sockets/agent.sock python -c ' from spiffe.workloadapi import default_jwt_source src = default_jwt_source.DefaultJwtSource() svid = src.fetch_svid(audiences=[\"alphaswarm-tenant-router\"]) print(\"SPIFFE ID:\", svid.spiffe_id) print(\"Audiences:\", svid.audiences) print(\"Token (truncated):\", svid.token[:60], \"...\") ' " # Expected: SPIFFE ID spiffe://alpha-swarm.ai/cell/cell-shared-std-us-east-1a/alphaswarm-core ``` If the SVID fetch fails, check the SPIRE Agent's registration entries — the workload's ServiceAccount might not be selected: ```bash kubectl -n spire-system exec -it spire-server-0 -- /opt/spire/bin/spire-server entry list ``` ## Step 2 — Validate Linkerd mTLS ```bash # Check that the proxy injected on every alphaswarm-core pod: kubectl -n cell-shared-std-us-east-1a get pods -l app=alphaswarm-core \ -o jsonpath='{range .items[*]}{.metadata.name}{":"}{.spec.containers[*].name}{"\n"}{end}' # Expected: each pod has BOTH `api` and `linkerd-proxy` containers. # Verify mTLS edge-to-edge between two AlphaSwarm pods: linkerd -n cell-shared-std-us-east-1a viz stat deploy # Expected: every deployment row shows `MESHED 1/1` (or matching replica count) # and the SUCCESS RATE column reports % over the last 1m window. linkerd -n cell-shared-std-us-east-1a viz edges deployment # Expected: every edge is "mTLS YES" — if any edge shows "NO", the # source or destination pod is missing the proxy injection. ``` If pods are NOT meshed, the Proxy Injector didn't see the `linkerd.io/inject: enabled` annotation. Check the namespace: ```bash kubectl get ns cell-shared-std-us-east-1a -o yaml | grep -A 2 annotations # Expected: linkerd.io/inject: enabled ``` ## Step 3 — Validate Pomerium IAP The Pomerium routes for `/manage/*` live in `alphaswarm_platform/deployments/kubernetes/mesh-identity/pomerium/route-manage.yaml`. ```bash # From outside the cluster, the IAP-protected route should redirect # to authenticate.alpha-swarm.ai (Pomerium's authenticate service): curl -sIL https://manage.alpha-swarm.ai/manage/cells | head -10 # Expected: 302 to https://authenticate.alpha-swarm.ai/.pomerium/... # After completing the Auth0 flow + step-up MFA, the request reaches # alphaswarm-cp.alphaswarm-admin.svc.cluster.local:9000 with the # X-Pomerium-Jwt-Assertion header attached: curl -sS https://manage.alpha-swarm.ai/manage/cells \ --cookie "_pomerium=" \ | jq '.data[].id' ``` The receiving FastAPI route validates the assertion via `alphaswarm.auth.providers.pomerium.extract_pomerium_claims` (Phase 4 §7.5). ## Step 4 — Cedar policy gate Trigger a Cedar evaluation: ```bash # Try to register a cell as a user WITHOUT the cell_operator role — # should 403: curl -sS -XPOST https://manage.alpha-swarm.ai/manage/cells \ -H 'authorization: Bearer ' \ -H 'content-type: application/json' \ -d '{"id":"cell-x","tier":"shared-std",...}' \ -o /tmp/cedar-deny.json cat /tmp/cedar-deny.json # Expected: {"detail":{"error":"cedar_denied",...}} # With the role granted by the Auth0 Action, the same call succeeds: # (cell_operator role is wired via the action at # alphaswarm/api/routes/auth0_sync.py per Phase 4 §7.3.) ``` ## Step 5 — VaultStaticSecret rotation Verify the `alphaswarm-cell-postgres-credentials` Secret rotates within the 30-minute `refreshAfter` window: ```bash # Watch the Secret's resourceVersion: kubectl -n cell-shared-std-us-east-1a get secret postgres-credentials \ -o jsonpath='{.metadata.resourceVersion}' --watch # Trigger a Vault-side rotation: vault kv put cells/shared-std/cell-shared-std-us-east-1a host=newhost.example port=5432 # Within 30 minutes the resourceVersion increments and the deployments # listed in `rolloutRestartTargets` perform a rolling restart. ``` ## Rollback Each component is independently revertable: ```bash # Linkerd — remove the proxy injection (existing pods stay meshed # until their next rollout): kubectl annotate ns cell-shared-std-us-east-1a linkerd.io/inject- # SPIRE — workloads fall back to the Auth0 M2M path (chain order in # alphaswarm.credentials.resolver) when the SPIFFE socket isn't reachable. kubectl -n spire-system scale daemonset spire-agent --replicas=0 # Pomerium — direct /manage/* to alphaswarm-cp via DNS, bypassing the IAP. kubectl -n pomerium scale deployment pomerium-proxy --replicas=0 # vault-secrets-operator — Secrets stop refreshing but stay readable. kubectl -n vault-secrets-operator scale deployment vault-secrets-operator --replicas=0 ``` ## Phase 4.5 follow-ups 1. Per-cell SPIRE `ClusterSPIFFEID` CRDs binding workload selectors. 2. M2MTokenIssuer dispatch through `ALPHASWARM_AUTH_M2M_PROVIDER=spiffe`. 3. Per-cell `VaultStaticSecret` set for every persistent service (Postgres, Redis, MinIO, MLflow, ChromaDB). 4. Per-cell Pomerium routes for the `alphaswarm_admin` UI surface. 5. Linkerd SPIFFE trust anchor wired from SPIRE Server's upstream-authority CA. ## Related documents - [RESTRUCTURING_PLAN.md §7](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm_docs/docs/concepts/identity/spiffe-workload-identity.md](spiffe-workload-identity.md) - [alphaswarm_platform/deployments/kubernetes/mesh-identity/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/README.md) - [alphaswarm_docs/docs/how-to/cell-router-cutover.md](cell-router-cutover.md) # Cross-repo lineage bridge > Set `ALPHASWARM_AGENTIC_ASSISTANTS_API` in the environment (or in the k8s ConfigMap `alphaswarm-env`): # Cross-repo lineage bridge The `agentic_assistants` repository maintains a shared lineage graph (dataset → run → model → report). AlphaSwarm publishes events to the same service via [`alphaswarm.mlops.lineage_bridge`](../../alphaswarm/mlops/lineage_bridge.py) so both repos present a unified view in the Lineage UI. ## Configuration Set `ALPHASWARM_AGENTIC_ASSISTANTS_API` in the environment (or in the k8s ConfigMap `alphaswarm-env`): ```bash export ALPHASWARM_AGENTIC_ASSISTANTS_API=http://agentic-assistants.alphaswarm.svc.cluster.local:8000 ``` When the setting is empty the bridge is a no-op — every `emit_*` call logs at DEBUG and returns `False`. ## Emitting events ```python from alphaswarm.mlops import ( emit_dataset, emit_run, emit_model, emit_serve_deployment ) # 1. Record the training dataset. emit_dataset("abc123def", n_rows=2_000_000, n_symbols=500) # 2. Log a training run tied to that dataset. emit_run("run-42", kind="alpha_training", dataset_hash="abc123def", model_class="LightGBMAlpha") # 3. Register the resulting model artifact. emit_model("alphaswarm-alpha", version="7", run_id="run-42", metrics={"ic_mean": 0.042}) # 4. Record the live deployment serving it. emit_serve_deployment( endpoint_url="http://ray-serve.alphaswarm.svc.cluster.local:8000/alphaswarm", backend="ray-serve", model_uri="models:/alphaswarm-alpha/Production", ) ``` ## Event schema Each event is a JSON POST to `/api/lineage/events`: ```json { "kind": "model", "id": "model:alphaswarm-alpha/7", "attrs": { "run_id": "run-42", "metrics": { "ic_mean": 0.042 } }, "parents": ["run:run-42"] } ``` ## Retention Events live in the `agentic_assistants` lineage store (Postgres). Retention matches that project's settings (default 90 days for `run` / `serve_deployment`, indefinite for `dataset` / `model`). # Kubernetes deployment > The [`Dockerfile`](../../Dockerfile) builds five targets: # Kubernetes deployment AlphaSwarm ships Kustomize manifests under [`alphaswarm_platform/deploy/k8s/base/`](../../alphaswarm_platform/deploy/k8s/base/) that can be applied to any cluster. The manifests under `base/serving/` add three model-serving backends on top of the existing `api`, `worker`, `paper-trader`, and streaming-ingester Deployments. ## Image targets The [`Dockerfile`](../../Dockerfile) builds five targets: | Target | Entrypoint | Used by | | --- | --- | --- | | `base` | — | shared base layer | | `paper` | `alphaswarm paper run` | `paper-trader.yaml` | | `ingester` | `alphaswarm-stream-ingest` | `ingester-*.yaml` | | `api` (default) | `uvicorn alphaswarm.api.main:app` | `api.yaml`, `worker.yaml` | | `serving` | `alphaswarm serve ` | `serving/*.yaml` | | `ml-train` | `alphaswarm-train` | CI training jobs, Ray Tune sweeps | Build all five at once: ```bash for target in paper ingester api serving ml-train; do docker build --target "$target" -t "alphaswarm-${target}:latest" . done ``` ## Deploying to a Kubernetes cluster AlphaSwarm is cluster-agnostic. The `alphaswarm_platform/deployments/kubernetes/` tree provisions every shared dependency (MLflow in `alphaswarm-mlops`, MinIO + Postgres + Redis + ChromaDB in `alphaswarm-data-services`, Kafka + Schema Registry + Flink in `alphaswarm-streaming`, kube-prometheus-stack + Tempo + Loki + OTel + Phoenix in `alphaswarm-observability`, and so on). To deploy AlphaSwarm: ```bash # From the alphaswarm root # 1. Install the operators / Helm releases that AlphaSwarm CRDs depend on. bash alphaswarm_platform/scripts/cluster_install/install-redpanda.sh bash alphaswarm_platform/scripts/cluster_install/install-kube-prometheus-stack.sh bash alphaswarm_platform/scripts/cluster_install/install-opentelemetry-operator.sh bash alphaswarm_platform/scripts/cluster_install/install-spark-operator.sh bash alphaswarm_platform/scripts/cluster_install/install-flink.sh # 2. Apply the AlphaSwarm base kustomization (creates alphaswarm-* namespaces and # the workload manifests). kubectl apply -k alphaswarm_platform/deployments/kubernetes/base/ ``` ## Selecting which model to serve The three serving backends all read a single `model_uri` from the `alphaswarm-serving-env` ConfigMap. Change it once and bounce the Deployments: ```bash kubectl -n alphaswarm create configmap alphaswarm-serving-env \ --from-literal=model_uri=models:/alphaswarm-alpha/Production \ --from-literal=ray_serve_name=alphaswarm-alpha \ --dry-run=client -o yaml | kubectl apply -f - kubectl -n alphaswarm rollout restart deploy mlflow-serve ray-serve torchserve ``` ## Observability - Every Deployment exports traces to `http://otel-collector:4317` (OTLP gRPC), matching the `rpi_kubernetes` collector conventions. - Prometheus picks up metrics via the `ServiceMonitor` resources in [`alphaswarm_platform/deploy/k8s/base/serving/servicemonitor.yaml`](../../alphaswarm_platform/deploy/k8s/base/serving/servicemonitor.yaml). - AlphaSwarm's own metric surface is defined in [`alphaswarm/mlops/metrics.py`](../../alphaswarm/mlops/metrics.py): `alphaswarm_train_duration_seconds`, `alphaswarm_backtest_sharpe`, `alphaswarm_paper_pnl`, `alphaswarm_serve_requests_total`, `alphaswarm_serve_latency_seconds`. ## Secrets The `alphaswarm-broker-secrets` Secret supplies Alpaca / IBKR / Tradier credentials. For the serving stack no secrets are required unless the MLflow tracking URI needs auth — set `MLFLOW_TRACKING_TOKEN` in `alphaswarm-env` or a dedicated Secret. # Model serving > | Backend | Adapter | CLI | Best for | | --- | --- | --- | --- | | MLflow Models | [`MLflowServeDeployment`](../../alphaswarm/mlops/serving/mlflow_serve.py) | `alphaswarm serve mlflow ` | any flavor logged wit... # Model serving AlphaSwarm ships three serving adapters. All three share the same [`ModelDeployment`](../../alphaswarm/mlops/serving/base.py) protocol so call-sites, the CLI (`alphaswarm serve ...`), and the REST API speak one vocabulary regardless of the runtime underneath. | Backend | Adapter | CLI | Best for | | --- | --- | --- | --- | | MLflow Models | [`MLflowServeDeployment`](../../alphaswarm/mlops/serving/mlflow_serve.py) | `alphaswarm serve mlflow ` | any flavor logged with `mlflow.log_model`, low-throughput research | | Ray Serve | [`RayServeDeployment`](../../alphaswarm/mlops/serving/ray_serve.py) | `alphaswarm serve ray ` | horizontally scaled batch inference | | TorchServe | [`TorchServeDeployment`](../../alphaswarm/mlops/serving/torchserve.py) | `alphaswarm serve torchserve ` | low-latency PyTorch endpoints + batching | ## Model URIs All adapters accept three URI shapes: 1. **Filesystem path** — `./data/models/alpha_v1.pkl` 2. **MLflow run** — `runs://` 3. **MLflow registry** — `models://` or `models://` MLflow URIs are resolved via `alphaswarm.mlops.serving.base.resolve_model`, which optionally downloads the artifact locally when a backend needs filesystem access (TorchServe packaging) or passes the URI through (MLflow Serve). ## PreprocessingSpec propagation Every adapter honours the [`PreprocessingSpec`](../../architecture/preprocessing-spec.md) attached to the model. At inference time: - **MLflow Serve** — flavor-specific (`pyfunc` handlers are expected to re-apply preprocessing inside the model class). - **Ray Serve** — the generated deployment loads the pickle and delegates to `model.predict(df)`; when `model.preprocessing_spec` is set, the `apply` call happens in `__call__` before `predict`. - **TorchServe** — the auto-generated `AqpBaseHandler` checks for a `preprocessing_spec` attribute and runs `spec.apply(df)` before every call. ## Quick start ```bash # Train something and log to MLflow python scripts/train_agent.py --config configs/ml/lgbm.yaml # Serve the latest production version via MLflow alphaswarm serve mlflow models:/alphaswarm-lgbm/Production --port 5001 # Or via Ray Serve alphaswarm serve ray models:/alphaswarm-lgbm/Production --num-replicas 4 # Or package for TorchServe alphaswarm serve torchserve models:/alphaswarm-lstm/Production --model-name alphaswarm-lstm ``` ## Kubernetes Manifests and Helm values for deploying each backend to the `rpi_kubernetes` cluster live under `deploy/kubernetes/serving/` and are described in [`alphaswarm_docs/docs/how-to/mlops/k8s-deployment.md`](./k8s-deployment.md) (Phase 5). # Adding a new InfrastructureProvider to alphaswarm_controller > Add a provider when you need to manage workloads on a backend the existing five (`docker_compose`, `kubernetes`, `aws`, `azure`, `gcp`) dont cover. Examples: Nomad, Fly.io, Render, on-prem VMs via Sa... # Adding a new InfrastructureProvider to alphaswarm_controller Step-by-step guide for shipping a new `InfrastructureProvider` implementation (AGENTS rule 45 / ADR 004). ## When to add a new provider Add a provider when you need to manage workloads on a backend the existing five (`docker_compose`, `kubernetes`, `aws`, `azure`, `gcp`) don't cover. Examples: Nomad, Fly.io, Render, on-prem VMs via Salt/Ansible. ## Checklist ### 1. Sketch the credential chain What env vars does the backend's SDK read? How does it discover credentials in CI vs on a developer laptop vs in production? Document this in your provider's `_check_credentials()` so the health probe can fail loudly when credentials are missing. ### 2. Create the provider module ```python # alphaswarm_controller/src/alphaswarm_controller/providers/.py from alphaswarm_core.providers.protocol import ( InfrastructureProvider, InfrastructureProviderError, InfrastructureProviderUnavailable, ProviderKind, ) from alphaswarm_core.providers.registry import register_provider_class @register_provider_class("", replace=True) class MyProvider(InfrastructureProvider): provider_kind = ProviderKind. provider_alias = "" async def health(self) -> ProviderHealth: ... async def start(self, spec: DeploymentSpec) -> DeploymentStatus: ... async def stop(self, service_id: str, *, namespace=None) -> DeploymentStatus: ... async def scale(self, service_id, replicas, *, namespace=None) -> DeploymentStatus: ... async def status(self, service_id: str, *, namespace=None) -> DeploymentStatus: ... async def list_deployments(self, *, namespace=None) -> list[DeploymentStatus]: ... # Optional — override if your backend supports it. async def get_config(self, service_id: str, *, namespace=None) -> ServiceConfig: ... async def apply_config(self, patch: ConfigMapPatch) -> bool: ... async def stream_metrics(self, service_id, *, namespace=None, interval_seconds=10.0): ... ``` ### 3. Add a new `ProviderKind` ```python # alphaswarm_core/src/alphaswarm_core/providers/protocol.py class ProviderKind(str, Enum): DOCKER_COMPOSE = "docker_compose" KUBERNETES = "kubernetes" AWS = "aws" AZURE = "azure" GCP = "gcp" NOMAD = "nomad" # <-- your new kind ``` ### 4. Register in the bootstrap helper ```python # alphaswarm_controller/src/alphaswarm_controller/providers/__init__.py for module_name in ( "alphaswarm_controller.providers.docker_compose", "alphaswarm_controller.providers.kubernetes", "alphaswarm_controller.providers.aws", "alphaswarm_controller.providers.azure", "alphaswarm_controller.providers.gcp", "alphaswarm_controller.providers.", # <-- add yours ): ... ``` ### 5. Optional deps go in `pyproject.toml` extras ```toml [project.optional-dependencies] = ["sdk-package>=X,]", ] ``` ### 6. Write contract tests Two test files: ```python # alphaswarm_controller/tests/providers/test_.py — unit tests over the # translation helpers (e.g. spec_to_, response_to_status) # alphaswarm_controller/tests/providers/test__integration.py — full # contract test against a mocked SDK (moto for AWS, MagicMock for others) ``` Reuse the assertion patterns in `test_docker_compose.py` and `test_kubernetes.py`. ### 7. Update the bootstrap registry test ```python # tests/providers/test_registry.py def test_bootstrap_registers_all() -> None: registry = bootstrap() for expected in ("docker_compose", "kubernetes", "aws", "azure", "gcp", ""): assert expected in registry.aliases() ``` ### 8. Update the README + this runbook Add your provider to the table in [`alphaswarm_controller/README.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_controller/README.md) and the per-cloud sections below. ## Per-cloud setup notes ### AWS - Active provider: `ALPHASWARM_CP_PROVIDER=aws` - Credentials: standard boto3 chain (env vars / `~/.aws/credentials` / EC2 / EKS pod identity / WebIdentity) - IAM minimum: `ecs:DescribeServices`, `ecs:UpdateService`, `ssm:GetParameter*`, `ssm:PutParameter*`, plus EKS read perms when using the K8s sub-path ### Azure - Active provider: `ALPHASWARM_CP_PROVIDER=azure` - Credentials: `azure-identity` chain (env vars / Managed Identity / federated identity / Azure CLI) - IAM minimum: Contributor on the AKS / Container Instances resource group ### GCP - Active provider: `ALPHASWARM_CP_PROVIDER=gcp` - Credentials: `GOOGLE_APPLICATION_CREDENTIALS` env var pointing to a service account JSON, OR Workload Identity (preferred in production) - IAM minimum: `run.developer`, `container.developer`, `secretmanager.admin` (per project) ## Definition of done - [ ] Provider class registered + `provider_kind` matches alias - [ ] All seven abstract methods implemented (or raise `InfrastructureProviderUnavailable` with a structured message) - [ ] Credential probe in `health()` returns a useful error when creds are missing - [ ] Unit tests + contract tests passing - [ ] `tests/providers/test_registry.py` updated - [ ] README + this runbook updated - [ ] CI matrix builds + tests with the new optional dep group # AlphaSwarm.FUND Blue/Green Cutover > - Overlay: `alphaswarm_platform/deployments/kubernetes/overlays/tower-green/` - Tunnel lane: `alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/` - Verification: `scripts/verify_blue_green_cutov... # AlphaSwarm.FUND Blue/Green Cutover Runbook for migrating `alphaswarm.fund` traffic to the tower cluster with a short, controlled DNS/tunnel switch and immediate rollback path. ## Green lane artifacts - Overlay: `alphaswarm_platform/deployments/kubernetes/overlays/tower-green/` - Tunnel lane: `alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/` - Verification: `scripts/verify_blue_green_cutover.sh` Green hostnames: - `alphaswarm-green.alphaswarm.fund` - `api-green.alphaswarm.fund` - `manage-green.alphaswarm.fund` ## 1) Pre-cutover prep 1. Ensure `tower-dev` is healthy: ```bash bash scripts/verify_tower_cluster.sh ``` 2. Update Auth0 app allow-lists so both blue and green URLs are valid during transition. Use `alphaswarm_platform/terraform/modules/auth0_identity` inputs: - `callback_urls` + `cutover_callback_urls` - `logout_urls` + `cutover_logout_urls` - `web_origins` + `cutover_web_origins` 3. Create green tunnel token secret: ```bash token="$(cloudflared tunnel token alphaswarm-fund-edge-green)" kubectl -n alphaswarm-edge create secret generic cloudflared-alphaswarm-green-token \ --from-literal=token="$token" \ --dry-run=client -o yaml | kubectl apply -f - ``` ## 2) Deploy green lane ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/ kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-green/ ``` ## 3) Validate before switch ```bash bash scripts/verify_blue_green_cutover.sh CHECK_EXTERNAL=true bash scripts/verify_blue_green_cutover.sh ``` ## 4) Cut over traffic Perform the controlled switch in Cloudflare: - point DNS/app routing to green hostnames (or update tunnel ingress mapping) - confirm health endpoints: - `https://alphaswarm-green.alphaswarm.fund` - `https://api-green.alphaswarm.fund/livez` - `https://manage-green.alphaswarm.fund/manage/livez` Once stable, update canonical host routing (`alphaswarm.fund`, `api.alphaswarm.fund`, `manage.alphaswarm.fund`) to the tower green lane. ## 5) Rollback Immediate rollback commands: ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/ kubectl delete -k alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/ ``` Then restore blue DNS/tunnel routing and rerun baseline checks. # Auth0 checklist for Kubernetes login > Current local `.env` discovery: # Auth0 checklist for Kubernetes login This checklist captures the Auth0-side changes required for the Kubernetes deployment to support login through the Vite SPA and JWT validation in `alphaswarm-core` / `alphaswarm_controller`. Current local `.env` discovery: - Auth0 tenant domain: `alphaswarm-fund.us.auth0.com` - SPA client id: present in `.env` - SPA / confidential client secret: present in `.env` - M2M client id/secret: **missing**; create a dedicated M2M app before enabling the Auth0 Action in production. - SCIM bearer token hash: **missing**; SCIM remains disabled until created. Do not paste secrets into this document. Store secrets in `.env` locally and in your production secret manager / sealed-secret pipeline for Kubernetes. ## 1. API Resource Server Auth0 Dashboard path: **Applications → APIs → Create API**. Create or update: | Field | Value | | --- | --- | | Name | `AlphaSwarm Management API` | | Identifier | `https://api.alphaswarm.internal/manage` | | Signing Algorithm | `RS256` | | RBAC | enabled | | Add Permissions in Access Token | enabled | Required permissions: | Permission | Purpose | | --- | --- | | `read:infrastructure` | View deployment status, pod health, logs, non-secret config | | `manage:agents` | Start/stop/restart/scale assigned agents and bot workloads | | `manage:infrastructure` | Deploy/update services and non-secret config within assigned org | | `admin:cluster` | Full cluster control and resource-filter bypass | | `scim:write` | Auth0 SCIM provisioning into `/scim/v2/*` | Migration permissions to retain until all older AlphaSwarm routes are moved to the new control-plane scope grid: | Permission | Why keep temporarily | | --- | --- | | `data:read` | Existing AlphaSwarm data/read routes still check it | | `data:write` | Existing AlphaSwarm mutation routes still check it | | `deploy:run` | Existing Terraform control-plane routes still check it | | `deploy:halt` | Existing Terraform halt/kill-switch integrations still check it | Terraform source of truth: - [`alphaswarm_platform/terraform/modules/auth0_identity/main.tf`](../../alphaswarm_platform/terraform/modules/auth0_identity/main.tf) - [`alphaswarm_platform/terraform/modules/auth0_identity/variables.tf`](../../alphaswarm_platform/terraform/modules/auth0_identity/variables.tf) ## 2. SPA Application (`alphaswarm-client`) Auth0 Dashboard path: **Applications → Applications → Create Application → Single Page Application**. Use the `.env` client id already present for this app if it is the AlphaSwarm SPA. Configure: | Setting | Values | | --- | --- | | Application Type | Single Page Application | | Token Endpoint Authentication Method | `None` | | Grant Types | Authorization Code, Refresh Token | | Allowed Callback URLs | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` | | Allowed Logout URLs | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` | | Allowed Web Origins | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` | | Allowed Origins (CORS) | same as Web Origins | Kubernetes ConfigMap values now generated from `.env`: ```yaml VITE_AUTH_REQUIRED: "true" VITE_AUTH0_DOMAIN: "alphaswarm-fund.us.auth0.com" VITE_AUTH0_CLIENT_ID: "" VITE_AUTH0_AUDIENCE: "https://api.alphaswarm.internal/manage" ``` ## 3. Machine-to-Machine Application (`alphaswarm-m2m`) Auth0 Dashboard path: **Applications → Applications → Create Application → Machine to Machine**. Create a dedicated app; do **not** reuse the SPA client secret for M2M. Grant it access to the API Resource Server: | API | Scopes | | --- | --- | | `https://api.alphaswarm.internal/manage` | `read:infrastructure`, `manage:infrastructure`, `data:read`, `scim:write`, `deploy:run`, `deploy:halt` | Store: | AlphaSwarm variable | Source | | --- | --- | | `ALPHASWARM_AUTH_M2M_CLIENT_ID` | M2M Application Client ID | | `ALPHASWARM_AUTH_M2M_CLIENT_SECRET` | M2M Application Client Secret | | `ALPHASWARM_AUTH_M2M_AUDIENCE` | `https://api.alphaswarm.internal/manage` | The current `.env` does not yet contain the M2M client id/secret, so the generated Kubernetes Secret intentionally leaves `ALPHASWARM_AUTH_M2M_CLIENT_SECRET` empty/placeholder until you create the app. ## 4. Post-Login Action The Action template lives at: [`alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl`](../../alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl) Configure the deployed Action with: | Placeholder | Value | | --- | --- | | `claims_namespace` | `https://alphaswarm.internal/` | | `api_audience` | `https://api.alphaswarm.internal/manage` | | `sync_url` | production: `https://api.alpha-swarm.ai/_internal/auth0/sync` | The Action injects these custom claims: | Claim | Example | | --- | --- | | `https://alphaswarm.internal/org_id` | `org_abc123` | | `https://alphaswarm.internal/workspace_id` | `workspace_abc123` | | `https://alphaswarm.internal/roles` | `["alphaswarm-operator"]` | | `https://alphaswarm.internal/resources` | `["alphaswarm-api", "alphaswarm-worker"]` | | `https://alphaswarm.internal/scopes` | `["read:infrastructure", "manage:agents"]` | The backend still reads the legacy `https://alphaswarm/` namespace for one release, but new tokens should use `https://alphaswarm.internal/`. ## 5. Roles and assignments Create roles: | Role | Permissions | | --- | --- | | `alphaswarm-viewer` | `read:infrastructure`, `data:read` | | `alphaswarm-operator` | `read:infrastructure`, `manage:agents`, `data:read` | | `alphaswarm-admin` | `read:infrastructure`, `manage:agents`, `manage:infrastructure`, `data:read`, `data:write`, `deploy:run`, `deploy:halt` | | `alphaswarm-superadmin` | all above plus `admin:cluster`, `scim:write` | Assign your test user to `alphaswarm-superadmin` first to verify end-to-end login. Then move down to `alphaswarm-operator` or `alphaswarm-viewer` to verify resource filtering. ## 6. Kubernetes apply order The safe apply order is: ```powershell # 1. Apply tracked, non-secret manifests and placeholder Secret templates. kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev # 2. Apply locally generated real Secret manifests (git-ignored). kubectl apply -f alphaswarm_platform/deployments/kubernetes/generated/alphaswarm-secrets.local.yaml kubectl apply -f alphaswarm_platform/deployments/kubernetes/generated/alphaswarm-admin-secrets.local.yaml # 3. Restart workloads so env vars are re-read. kubectl -n alphaswarm rollout restart deployment/alphaswarm-core deployment/alphaswarm-client deployment/alphaswarm-worker kubectl -n alphaswarm-admin rollout restart deployment/alphaswarm-cp ``` Do not run this until `kubectl get ns` succeeds against the intended context. ## 7. Verification ```powershell kubectl -n alphaswarm get configmap alphaswarm-config -o yaml kubectl -n alphaswarm-admin get configmap alphaswarm-config -o yaml kubectl -n alphaswarm get secret alphaswarm-secrets -o jsonpath='{.data.ALPHASWARM_AUTH_OIDC_CLIENT_SECRET}' kubectl -n alphaswarm-admin get secret alphaswarm-secrets -o jsonpath='{.data.ALPHASWARM_AUTH_OIDC_CLIENT_SECRET}' ``` Frontend login should no longer show: > Authentication is required ... frontend was not given identity-provider configuration Instead, it should redirect to Auth0 Universal Login for the `alphaswarm-fund.us.auth0.com` tenant. # how-to/operations/aws-deploy # AWS Hybrid Deployment Guide > Companion runbook: [aws-runbook.md](aws-runbook.md). > Architecture decision: hybrid (EKS Karpenter quant runtime + ECS Fargate > admin + Bedrock AgentCore) — chosen per the blueprint §16.3 scope clarifications. This guide walks you through deploying AlphaSwarm to AWS for the first time. Subsequent rollouts go through the normal `terraform-pipeline.yml` + `build-publish.yml` CI workflows; this page is only for the bootstrap path. Allow ~3–4 hours end-to-end (most of the wall clock is Bedrock model-access approval + Cloudflare propagation). ## Topology summary ```mermaid flowchart LR publicUsers[Marketing users] --> cloudflare[Cloudflare tunnel] operators[Operators / staff] --> cloudfront[CloudFront] subgraph aws [AWS account] cloudfront --> alb[ALB] alb --> admin[alphaswarm-admin ECS Fargate] alb --> proxy[AgentCore proxy ECS Fargate] proxy --> ac[AgentCore Runtime ARM64] ac --> bedrock[Bedrock FMs Claude 4.5 Titan v2] ac --> kb[Knowledge Base OpenSearch Serverless] eks[EKS Karpenter quant runtime] --> rds[RDS PG 16] eks --> redis[ElastiCache Redis Serverless] eks --> s3[S3 Iceberg warehouse] end ``` ## Prerequisites | Item | How to confirm | | --- | --- | | AWS Organization with Control Tower enrolled | Console -> AWS Control Tower -> Landing zone is `Available`. | | Seven member accounts: `management`, `log-archive`, `security-audit`, `shared-services`, `dev`, `staging`, `prod` | `aws organizations list-accounts` from the management account. | | Bedrock model access enabled (Claude Sonnet 4.5, Claude Haiku 4.5, Titan Text Embeddings v2) per workload account in us-east-1 | Console -> Bedrock -> Model access — must show `Access granted`. This is the only manual console step in the bootstrap path. | | GitHub repo `julianwileymac/alphaswarm` admin access | Required to create the three GitHub Environments (`dev`, `staging`, `prod`). | | Local `terraform >= 1.10.0`, `aws-cli v2`, `kubectl >= 1.30`, `kustomize >= 5.0`, `cosign >= 2.4`, `helm >= 3.16` | `terraform version` etc. | ## Phase 1 — Bootstrap (one-time, manual) The bootstrap stack provisions the state backend (S3 + DynamoDB + KMS) + GitHub OIDC provider in every account. Run with admin credentials per account; nothing in the regular workflow ever needs admin afterwards. ```bash # From the management account first: cd infrastructure/bootstrap terraform init terraform apply -auto-approve # Capture the published outputs (state bucket, DynamoDB table, KMS key) # into the per-account backend.hcl files: terraform output -json > /tmp/bootstrap-outputs.json ``` Repeat per workload account by assuming the `OrganizationAccountAccessRole` each one (Control Tower wires the trust automatically) and re-running `terraform init && terraform apply` with the per-account state bucket name. ## Phase 2 — Landing zone IaC (`infrastructure/envs/`) The landing zone tree provisions the shared infrastructure inside each workload account: VPC, EKS cluster, Karpenter, ECR, RDS Postgres, MSK Kafka, S3 data lake, observability stack. Apply through GitHub Actions (NEVER `terraform apply` from a laptop in CI mode): 1. In the GitHub repo settings, create the three Environments (`dev`, `staging`, `prod`) and add a `AWS_DEPLOYER_ROLE_ARN` repo variable per env (the ARN comes from the bootstrap output). 2. Push a no-op commit to `main` so the `terraform-pipeline.yml` workflow runs the plan against `dev`. Review the plan diff in the workflow summary. 3. Click "Run workflow" -> `tree=infrastructure`, `env=dev`, `action=apply`. The job assumes `vars.TF_APPLY_ROLE_dev` and runs `terraform apply -auto-approve` against `infrastructure/envs/dev/`. 4. Promote to staging + prod by repeating step 3 with the matching env. Staging requires one reviewer; prod requires two (GitHub Environment protection rules). ## Phase 3 — Application IaC (`alphaswarm_platform/terraform/environments/live`) The application tree composes the 8 new modules (`bedrock-agentcore`, `bedrock-knowledge-base`, `opensearch-serverless`, `cognito-userpool`, `cloudfront`, `alb`, `ecs-fargate-control-plane`, `eventbridge-stepfunctions`) PLUS the heritage `alphaswarm_platform/terraform/modules/` composition. Run via: ```bash # Render backend.hcl from the bootstrap SSM outputs: cd alphaswarm_platform/terraform/environments/live aws ssm get-parameter --name /alphaswarm/prod/tfstate_bucket_name \ --query 'Parameter.Value' --output text > /tmp/bucket aws ssm get-parameter --name /alphaswarm/prod/tfstate_kms_key_arn \ --query 'Parameter.Value' --output text > /tmp/kms aws ssm get-parameter --name /alphaswarm/prod/tfstate_dynamodb_table \ --query 'Parameter.Value' --output text > /tmp/lock cat < backend.hcl bucket = "$(cat /tmp/bucket)" key = "alphaswarm_platform/live/terraform.tfstate" region = "us-east-1" encrypt = true kms_key_id = "$(cat /tmp/kms)" dynamodb_table = "$(cat /tmp/lock)" EOF ``` Then trigger the `terraform-pipeline.yml` workflow with `tree=alphaswarm_platform`, `env=live`, `action=plan` -> review -> `action=apply`. ## Phase 4 — Image builds The `build-publish.yml` workflow ships every AlphaSwarm container to ECR. `alphaswarm-agent` is ARM64-only (AgentCore Runtime requirement); every other service builds multi-arch. ```bash git tag v1.0.0 git push origin v1.0.0 # Watch the workflow — it pushes the 8 services + signs with Cosign + # emits SLSA provenance + uploads SBOMs. ``` ## Phase 5 — Seed secrets + Bedrock + Knowledge Base ```bash # Broker credentials (paper trading first): aws secretsmanager put-secret-value \ --secret-id alphaswarm/prod/broker/alpaca \ --secret-string '{"api_key":"","secret_key":""}' # Upload research docs to the KB source bucket — the EventBridge # rule from modules/eventbridge-stepfunctions triggers a Bedrock # ingestion job on every PutObject: aws s3 sync ./research/papers/ s3://$(aws ssm get-parameter \ --name /alphaswarm/prod/kb_source_bucket \ --query 'Parameter.Value' --output text)/ ``` ## Phase 6 — Smoke ```bash # 1. Confirm the AgentCore Runtime invokes via the smoke workflow: gh workflow run bedrock-smoke.yml # 2. Direct invoke from a deployer-role-assumed shell: aws bedrock-agentcore invoke-agent-runtime \ --agent-runtime-arn $(aws ssm get-parameter \ --name /alphaswarm/prod/agentcore_runtime_arn \ --query 'Parameter.Value' --output text) \ --payload '{"spec_name":"dataset_loading_assistant","inputs":{"prompt":"ping"}}' \ /tmp/response.json # 3. Verify the trace shows up in X-Ray (run id from the smoke output): aws xray get-trace-summaries \ --time-range-type TraceId \ --start-time $(date -u -d '5 minutes ago' +%s) \ --end-time $(date -u +%s) \ --filter-expression "service(\"alphaswarm-admin\")" ``` ## Promotion | From | To | Trigger | | --- | --- | --- | | `main` push | dev apply | `terraform-pipeline.yml` plan + auto-merge gate | | tag `vX.Y.Z-rc.N` | staging apply | `terraform-pipeline.yml` dispatch + 1 reviewer | | tag `vX.Y.Z` | prod apply | `terraform-pipeline.yml` dispatch + 2 reviewers | ## Rollback See [aws-runbook.md](aws-runbook.md) for the rollback playbook. # how-to/operations/aws-minimum-rollback # AWS Minimum Tier Rollback Playbook > Companion to > [aws-minimum-single-account.md](aws-minimum-single-account.md) (deploy) > and [aws-runbook.md](aws-runbook.md) (full-stack on-call). > > This page is the dedicated rollback procedure for the single-account > minimum tier deployed via > [infrastructure/envs/minimum/scripts/deploy.sh](../../../../infrastructure/envs/minimum/scripts/deploy.sh). ## TL;DR — One Command Rollback ```bash cd infrastructure/envs/minimum ACCOUNT_ALIAS=minimum AWS_REGION=us-east-1 bash scripts/destroy.sh ``` That command: 1. Checks the caller's AWS account id matches the snapshot from deploy. 2. Destroys the application tier (Cognito + ALB + Fargate). 3. Disables RDS deletion protection. 4. Takes a final RDS snapshot (skip with `DESTROY_RDS_SKIP_SNAPSHOT=yes`). 5. Destroys the infrastructure tier (VPC + RDS + Redis + IAM + alarms). 6. Lists any orphan resources that survived destroy. 7. Retains the bootstrap state backend (S3 + DynamoDB + KMS + OIDC). Total wall-clock: ~15 minutes (RDS snapshot is the long pole). ## When to Roll Back | Situation | Action | | --- | --- | | Wrong account / region | `bash scripts/destroy.sh` immediately; the identity guard will catch the mismatch before destroying anything. | | Cost overrun | `bash scripts/destroy.sh` then re-deploy with smaller instance types. | | Failed apply mid-flight | `bash scripts/destroy.sh` (idempotent — resumes from wherever apply stopped). | | Need clean slate | `DESTROY_BOOTSTRAP=yes bash scripts/destroy.sh` (also nukes the state backend). | | RDS data corruption | `DESTROY_RDS_SKIP_SNAPSHOT=yes bash scripts/destroy.sh` (skip the bad-data snapshot). | | Security incident | See [aws-runbook.md](aws-runbook.md) §"Halt every AgentCore session" first; THEN consider destroy. | ## Pre-Rollback Checklist Before running `destroy.sh`: - [ ] **Confirm there's no critical data** in RDS that hasn't been backed up out-of-band. The default rollback takes a final snapshot, but if you pass `DESTROY_RDS_SKIP_SNAPSHOT=yes`, data is gone. - [ ] **Confirm no other team / env is using the bootstrap state backend.** The default `DESTROY_BOOTSTRAP=no` preserves it. Flipping to `yes` affects every env that shares the same `alphaswarm-tfstate-` bucket. - [ ] **Capture forensics first** for a security incident: ```bash bash scripts/snapshot.sh capture ``` Then destroy. ## The Six Stages ### Stage 1 — Identity Guard ```bash # destroy.sh reads .snapshots/latest/deploy-receipt.json and confirms # the caller's STS identity matches. # If you see: "deploy receipt is for account 111 but caller is 222" # → STOP. You're in the wrong account. Switch profiles + retry. ``` The receipt is created by `deploy.sh` Step 8 and stored at `infrastructure/envs/minimum/.snapshots//deploy-receipt.json`. ### Stage 2 — Application Tier The app tier holds the runtime contract — destroying it FIRST means the ALB target groups release their ECS service refs, so the infrastructure-tier ALB delete doesn't 409 with "still in use". Manual override if needed: ```bash cd alphaswarm_platform/terraform/environments/minimum terraform init -reconfigure -backend-config=backend.hcl terraform destroy -auto-approve ``` ### Stage 3 — RDS Deletion-Protection Bypass ```bash aws rds modify-db-instance \ --db-instance-identifier alphaswarm-admin-min \ --no-deletion-protection --apply-immediately ``` `destroy.sh` does this automatically; the manual command above is the fallback if the script can't reach RDS for some reason. ### Stage 4 — Infrastructure Tier ```bash cd infrastructure/envs/minimum terraform init -reconfigure -backend-config=backend.hcl terraform destroy -auto-approve ``` Skips the ECR repos by default (they're declared in `modules/ecr-repositories` with no `prevent_destroy`, so they get removed — but ECR's lifecycle policy keeps the most recent 30 tagged images for 14 days even after the repo deletes). ### Stage 5 — Orphan Sweep If `destroy.sh` reports orphans: ```bash [DESTROY] ⚠ found 3 orphan resource(s) — review + hand-delete: arn:aws:ec2:us-east-1:123:network-interface/eni-0a1b2c3d arn:aws:logs:us-east-1:123:log-group:/aws/ecs/alphaswarm-admin-min arn:aws:elasticloadbalancing:us-east-1:123:listener-rule/... ``` Hand-delete each: ```bash # ENI stuck in 'available' from a deleted ECS task aws ec2 delete-network-interface --network-interface-id eni-0a1b2c3d # Log group with retention != never (terraform doesn't auto-delete these) aws logs delete-log-group --log-group-name /aws/ecs/alphaswarm-admin-min # Orphan listener rule (rare — usually the ALB destroy covers it) aws elbv2 delete-rule --rule-arn arn:aws:elasticloadbalancing:... ``` Common orphan sources: - **NAT-attached EIPs** — the NAT Gateway is gone but the EIP is not released automatically. - **ECS task ENIs** in `available` state — task definition was deleted but the ENI lingers until the underlying ENA cleanup runs. - **CloudWatch log groups with retention != never_expire** — terraform doesn't delete them; they linger but cost nothing until they fill up. - **Listener rules** with a target group that already deleted. ### Stage 6 — Bootstrap (Optional) Only when `DESTROY_BOOTSTRAP=yes`. The state bucket has Object Lock GOVERNANCE, so the script empties it with `--bypass-governance-retention`: ```bash DESTROY_BOOTSTRAP=yes bash scripts/destroy.sh ``` What this destroys: - S3 state bucket (every version + delete marker) - DynamoDB lock table - KMS CMK (30-day deletion window, recoverable until then) - GitHub OIDC provider What this does NOT destroy: - The aws account itself. - AWS CloudTrail (default trail in the account; AWS bills for it regardless). - The Bedrock model-access grant (console-only setting; persists across teardowns). ## Recovery After a Partial Destroy If `destroy.sh` fails mid-flight: ```bash # 1. Inspect the .destroy.log for the failed step. tail -100 infrastructure/envs/minimum/.destroy.log # 2. Re-run destroy — it's idempotent + resumes from wherever apply stopped. bash scripts/destroy.sh # 3. If terraform state is locked, force-unlock: cd infrastructure/envs/minimum terraform force-unlock # 4. If a specific resource is wedged, target it: terraform destroy -target=module.rds.aws_db_instance.this -auto-approve ``` ## Cost Verification After rollback, verify $0 monthly spend in the AWS console: - **Cost Explorer** → filter by tag `managed_by=terraform`, `env=minimum` → should show $0 in the current period. - **AWS Budgets** → if the alert was wired pre-rollback, it stays armed with `actual_spend=0` for the period. If non-zero spend persists 24 h after rollback: - Check for **EBS snapshots** that were created by RDS deletion. - Check for **CloudWatch metric streams** that may have been wired manually (not destroyed by `destroy.sh`). - Check for **Route 53 hosted zones** — these have a $0.50/mo floor. ## Files Touched | File | Created by | Destroyed by | | --- | --- | --- | | `alphaswarm-tfstate-` S3 bucket | `deploy.sh` (bootstrap step) | `destroy.sh` only with `DESTROY_BOOTSTRAP=yes` | | `alphaswarm-tfstate-lock-` DynamoDB table | bootstrap | same | | `alias/alphaswarm-tfstate` KMS key | bootstrap | same | | GitHub OIDC provider | bootstrap | same | | VPC `alphaswarm-min` + subnets + NAT + endpoints | infrastructure tier | `destroy.sh` step 4 | | RDS `alphaswarm-admin-min` | infrastructure tier | step 4 (final snapshot retained) | | ElastiCache `alphaswarm-min-redis` | infrastructure tier | step 4 | | ECR repos | infrastructure tier | step 4 (image lifecycle policy keeps tags 14d) | | CloudWatch alarms + dashboard | infrastructure tier | step 4 | | ALB + Cognito + Fargate cluster | application tier | step 2 | | `.snapshots//` | `snapshot.sh` | preserved on disk forever (committed in `.gitignore` by default) | # how-to/operations/aws-minimum-single-account # Single-Account Minimum AWS Deployment > **Companion docs:** [aws-deploy.md](aws-deploy.md) for the full > multi-account hybrid topology; [aws-runbook.md](aws-runbook.md) for > the operational playbook. The cheapest deployable AlphaSwarm on AWS. Target cost: **~$140/month fixed** + Bedrock token spend. Skips multi-account, EKS, MSK, AgentCore Runtime, Knowledge Base, CloudFront, and the EventBridge nightly backtest path. Use it as a stepping stone before promoting to the full topology. ## What you get ```mermaid flowchart LR operators[Operators] --> alb[ALB HTTPS] alb --> cognito[Cognito User Pool] cognito --> alb alb --> admin[alphaswarm-admin ECS Fargate single task] admin --> rds[RDS Postgres single-AZ] admin --> redis[ElastiCache Redis 1-node] admin --> bedrock[Bedrock Claude Haiku 4.5] ``` ## Pieces composed | Tier | Module | Cost/mo | | --- | --- | ---: | | Network | `infrastructure/modules/vpc` (2 AZ, single NAT) | ~$32 | | Ingress | `infrastructure/modules/alb` | ~$22 | | Database | `infrastructure/modules/rds-postgres` (`db.t4g.medium`) | ~$45 | | Cache | inline ElastiCache (`cache.t4g.small`, 1 node) | ~$25 | | Compute | `infrastructure/modules/ecs-fargate-control-plane` (1 task, 0.5 vCPU + 1 GB) | ~$15 | | Identity | `infrastructure/modules/cognito-userpool` (first 50k MAU free) | $0 | | Container registry | `infrastructure/modules/ecr-repositories` (3 repos) | ~$1 | | Logs | CloudWatch Logs (~1 GB ingest) | \<$1 | | LLM | Amazon Bedrock Claude Haiku 4.5 (variable) | $? per use | | **Fixed total** | | **~$140** | ## Files this guide refers to - [infrastructure/envs/minimum/](../../../../infrastructure/envs/minimum/) — infrastructure tier (VPC + ECR + RDS + Redis + Bedrock invoke IAM) - [alphaswarm_platform/terraform/environments/minimum/](../../../../alphaswarm_platform/terraform/environments/minimum/) — application tier (Cognito + ALB + ECS Fargate) - [alphaswarm_platform/configs/terraform/minimum.yaml](../../../../alphaswarm_platform/configs/terraform/minimum.yaml) — `TerraformStackSpec` for the `alphaswarm deploy` CLI - [alphaswarm_platform/configs/deployment/topology.yaml](../../../../alphaswarm_platform/configs/deployment/topology.yaml) `targets.aws-minimum` — topology target binding ## Six steps to live ### 1. Enable Bedrock model access (manual, console) Console → **Bedrock** → **Model access** → request **Anthropic Claude Haiku 4.5**. Approval is usually instant. Only this model needs access for the minimum — Claude Sonnet 4.5 + Titan Embed v2 can wait until you add the Knowledge Base. ### 2. Bootstrap the state backend ```bash cd infrastructure/bootstrap terraform init terraform apply -auto-approve terraform output -json | tee /tmp/bootstrap.json ``` This is the only place admin creds are required. The stack ships: - S3 state bucket (KMS-encrypted, Object Lock GOVERNANCE) - DynamoDB lock table - KMS CMK for workload encryption - GitHub OIDC provider ### 3. Apply the infrastructure tier ```bash cd infrastructure/envs/minimum sed "s||$(jq -r .account_id.value /tmp/bootstrap.json)|" \ backend.hcl.example > backend.hcl cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars: paste the kms_key_arn + external_id + # github_oidc_provider_arn from /tmp/bootstrap.json. terraform init -backend-config=backend.hcl terraform apply ``` ~12 minutes (RDS provisioning is the long pole). Outputs include the ALB-ready VPC + every SSM parameter the application tier reads. ### 4. Push the first image ```bash git tag v0.1.0-min git push origin v0.1.0-min ``` [`build-publish.yml`](../../../../.github/workflows/build-publish.yml) ships `alphaswarm-admin` (and any other matrix entries) to ECR. The `AqpGithubDeployerMinimum` role from step 3 is what the workflow assumes via OIDC. ### 5. Apply the application tier ```bash cd alphaswarm_platform/terraform/environments/minimum sed "s||$(jq -r .account_id.value /tmp/bootstrap.json)|" \ backend.hcl.example > backend.hcl cp terraform.tfvars.example terraform.tfvars # Edit terraform.tfvars: paste the acm_certificate_arn_alb + the # image tag you just pushed. terraform init -backend-config=backend.hcl terraform apply ``` ~5 minutes. The ALB DNS appears in the outputs. ### 6. Configure AlphaSwarm runtime The application reads the deployment endpoints from `/alphaswarm/minimum/*` SSM. Set the env vars on the ECS task definition (or via the application's `Settings` overrides): ```bash ALPHASWARM_LLM_PROVIDER=bedrock ALPHASWARM_BEDROCK_REGION=us-east-1 ALPHASWARM_AUTH_PROVIDER=aws_cognito ALPHASWARM_AUTH_OIDC_ISSUER= ALPHASWARM_DEPLOY_TARGET=aws ALPHASWARM_DATABASE_URL=postgresql+psycopg://@:5432/alphaswarm ALPHASWARM_REDIS_URL=rediss://:6379/0 ``` The matching `bedrock` `ProviderSpec` is already in [alphaswarm/llm/providers/catalog.py](../../../../alphaswarm/llm/providers/catalog.py) (shipped in Phase D of the AWS hybrid rollout); no code change needed. ## Verify ```bash # Hit the ALB: curl -sS https://$(terraform -chdir=alphaswarm_platform/terraform/environments/minimum \ output -raw alb_dns_name)/healthz # Call Bedrock through the application: curl -sS https:///api/llm/echo \ -H "Authorization: Bearer " \ -d '{"prompt": "ping"}' ``` The application's `router_complete` injects `aws_region_name=us-east-1` on the Bedrock call (`_bedrock_extra_kwargs` in [alphaswarm/llm/providers/router.py](../../../../alphaswarm/llm/providers/router.py)); boto3 walks the chain to the ECS task role's IAM credentials. ## Promotion path When ready to outgrow the minimum, add modules one at a time. The SSM-parameter contract means application code doesn't change. | Add when… | Append to `alphaswarm_platform/terraform/environments/minimum/main.tf` | | --- | --- | | You need a custom domain (`admin.alpha-swarm.ai`) | `module "cloudfront"` from `infrastructure/modules/cloudfront` | | You need vector search over research docs | `module "opensearch_serverless"` + `module "bedrock_kb"` | | You want AgentCore (8-hour sessions, managed memory) | `module "bedrock_agentcore"` + a second `alphaswarm-agent` ECS service | | You need a Celery worker tier | Stand up `infrastructure/envs/dev` (full EKS+Karpenter) and add the heritage `module "alphaswarm"` here | | You need cross-account isolation | Promote to the full multi-account topology via `infrastructure/modules/landing-zone` | Once the full set lands, retarget the topology from `target=aws-minimum` to `target=aws`. The application reads the same `/alphaswarm/${env}/*` SSM parameters either way. ## Tear down ```bash # Application tier first (Fargate services hold ALB target group # references that prevent ALB deletion): cd alphaswarm_platform/terraform/environments/minimum terraform destroy # Then infrastructure tier: cd ../../../../infrastructure/envs/minimum terraform destroy # RDS has deletion_protection=true by default — set it to false in # the module call and re-apply before destroy if you really want it gone. ``` Data buckets (`prevent_destroy = true`) are kept on purpose; remove them manually after confirming no other env references them. # how-to/operations/aws-runbook # AWS Hybrid Operational Runbook > Companion to [aws-deploy.md](aws-deploy.md). Page this when the > AgentCore proxy / admin BFF / Bedrock KB / ECS Fargate cluster > misbehaves. ## On-call checklist (first 5 minutes) 1. **Confirm the blast radius.** Cloudflare hosts `alpha-swarm.ai` / `api.alpha-swarm.ai` / `manage.alpha-swarm.ai`; CloudFront hosts `admin.alpha-swarm.ai` / `agentcore.alpha-swarm.ai`. A Cloudflare outage does NOT touch the admin / AgentCore surface (and vice versa). 2. **Hit `/healthz`.** `https://admin.alpha-swarm.ai/healthz` — if 200, the ALB + ECS Fargate path is healthy. If 5xx, jump to "ECS Fargate service down" below. 3. **Fan-out kill switch.** If the incident is touching trading, immediately POST `/portfolio/kill_switch` AND `/workloads/halt` so every long-running runtime (paper, bots, RL, AgentCore) stops. The topbar `KillSwitch` component does the fan-out automatically for logged-in operators. ## Halt every AgentCore session ```bash # 1. Disable new invocations at the gateway: aws bedrock-agentcore update-gateway \ --gateway-id $(aws ssm get-parameter \ --name /alphaswarm/prod/agentcore_gateway_arn \ --query 'Parameter.Value' --output text | awk -F/ '{print $NF}') \ --status DISABLED # 2. Stop the AgentCore proxy ECS Fargate service (the ALB stops # forwarding to the proxy immediately): aws ecs update-service \ --cluster $(aws ssm get-parameter \ --name /alphaswarm/prod/ecs_cluster_name \ --query 'Parameter.Value' --output text) \ --service alphaswarm-agentcore-proxy-prod \ --desired-count 0 ``` The matching audit row lands in `workload_runs` via `WorkloadRuntime.start_run` BEFORE the boto3 call returns (rule 45). You can verify with: ```bash aws rds-data execute-statement \ --resource-arn $RDS_ARN --secret-arn $DB_SECRET_ARN \ --database alphaswarm \ --sql "SELECT id, action, status, user_id, started_at \ FROM workload_runs ORDER BY started_at DESC LIMIT 10" ``` ## Roll back to the previous tag ```bash # 1. Identify the previous good SHA: git log --oneline --decorate -20 # 2. Tag the previous SHA + push to trigger the apply path: git tag v1.0.1-rollback git push origin v1.0.1-rollback # 3. Manual dispatch — terraform-pipeline.yml with # tree=alphaswarm_platform, env=prod, action=apply (requires 2 reviewers). # # The DB / S3 buckets / KB source bucket are NOT touched — every data # resource carries lifecycle.prevent_destroy=true and retains the # previous Terraform-managed state. ``` ## ECS Fargate service down ```bash # 1. Inspect recent task stops (most common = ECR pull failure or # secrets resolution failure): aws ecs describe-services \ --cluster alphaswarm-cluster-prod \ --services alphaswarm-admin-prod \ --query 'services[0].events[:10]' # 2. Tail the application log group (the ADOT sidecar logs to a # sibling stream prefixed adot/): aws logs tail /aws/ecs/alphaswarm-admin-prod --follow # 3. Force a fresh rollout (also rolls the ADOT sidecar): aws ecs update-service \ --cluster alphaswarm-cluster-prod \ --service alphaswarm-admin-prod \ --force-new-deployment ``` The matching `AwsProvider.restart` call from `WorkloadRuntime` writes a `workload_runs` row with `action=restart` BEFORE the boto3 call runs, so the audit trail is intact even when the rollout fails halfway. ## Drain a Celery worker (EKS path) The Celery workers continue to run on EKS Karpenter; the ECS Fargate surface is admin + AgentCore only. To drain a worker pod without losing in-flight tasks: ```bash # 1. Annotate the pod for graceful shutdown — the Celery preboot # hook in alphaswarm/tasks/celery_app.py listens for this annotation: kubectl annotate pod alphaswarm-worker-xxxxx \ -n alphaswarm \ alphaswarm.io/drain-requested-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)" # 2. Wait for the worker to finish in-flight tasks (max grace = # ALPHASWARM_AGENT_STALL_THRESHOLD_SECONDS, default 1800s): kubectl wait --for=delete pod alphaswarm-worker-xxxxx -n alphaswarm --timeout=1900s # 3. Karpenter replaces the deleted pod automatically; the new pod # inherits the same Celery queue subscriptions. ``` ## Assume the break-glass role Reserved for catastrophic incidents (org-wide outage, suspected account compromise). The `AlphaSwarm-BreakGlass` Identity Center permission set has Admin across every account, MFA-required, and alarms on every assumption (CloudWatch alarm + `workload_runs` row + Cloudflare Access policy). ```bash # IAM Identity Center -> Sign in -> Pick AlphaSwarm-BreakGlass for the # target account -> Acknowledge the on-call ticket prompt before # the assume completes. aws sts get-caller-identity # Outputs the breakglass session arn — paste into the incident ticket. ``` ## Rotate Cloudflare origin secret When the CloudFront `X-CloudFront-Secret` header value is suspected leaked: ```bash new=$(openssl rand -hex 32) aws ssm put-parameter \ --name /alphaswarm/prod/cloudfront_origin_secret \ --value "$new" --type SecureString --overwrite # Re-apply the cloudfront module so the new value lands at the edge: gh workflow run terraform-pipeline.yml -f tree=alphaswarm_platform -f env=prod -f action=apply ``` ## Common failure modes + fixes | Symptom | Likely cause | Fix | | --- | --- | --- | | `AccessDeniedException` on first KB ingestion | `aoss:APIAccessAll` not propagated | Wait 20s + retry (the `time_sleep.settle` block handles this on apply, but manual ingestion-job starts can race). | | AgentCore Runtime invocation returns 403 | The runtime role's IAM policy denies the FM | Verify the model ARN is in `var.allowed_model_arns` for the env. | | `terraform apply` fails on `aws_bedrockagentcore_*` resource | Provider version too old | Pin `hashicorp/aws ~> 6.21` in `alphaswarm_platform/terraform/environments/live/main.tf`. | | ALB 504 on `admin.alpha-swarm.ai` | Cognito redirect loop or ALB OIDC misconfig | Inspect the listener rule's `authenticate-cognito` action; confirm `user_pool_arn` + `user_pool_client_id` + `user_pool_domain` SSM params match. | | ECS exec hangs | Task definition missing `enableExecuteCommand=true` | Re-deploy with `enable_execute_command=true` in the service spec; ECS Exec also needs `ALPHASWARM_AWS_ECS_EXEC_ENABLED=true` on the provider. | | Bedrock smoke workflow times out on X-Ray | ADOT sidecar not propagating; check `alphaswarm-adot-sidecar` SA has the X-Ray + Application Signals + CloudWatch policies. | | ## Cost guardrails `AWS Budgets` alarms ride on top of the SCP region allowlist: - `dev` budget = $300/month (alerts at 50/80/100%) - `staging` budget = $500/month - `prod` budget = $1500/month (excluding Bedrock token spend; that is metered separately under the `alphaswarm.io/cost-bucket=bedrock-tokens` tag). If a budget alarm fires AND the Bedrock token cost is the driver: ```bash # Disable streaming responses (cheaper) and switch the agent spec to # Haiku for the next 24h while the operator investigates: aws ssm put-parameter --name /alphaswarm/prod/llm_model_preference \ --value "anthropic.claude-haiku-4-5-20251022-v1:0" \ --type String --overwrite aws ecs update-service \ --cluster alphaswarm-cluster-prod \ --service alphaswarm-agentcore-proxy-prod \ --force-new-deployment ``` # Bot Canary Rollout Playbook > - Any strategy code change (new alpha, new portfolio constructor, new execution algo). - Any adapter change (new venue, new FIX session config, new on-chain RPC endpoint). - Any risk-policy threshold ... # Bot Canary Rollout Playbook > When to use it, how to read the dashboards, how to abort, and how > to tune false positives. ## When to use a canary - Any strategy code change (new alpha, new portfolio constructor, new execution algo). - Any adapter change (new venue, new FIX session config, new on-chain RPC endpoint). - Any risk-policy threshold loosening. - **Not** for: spec-only documentation updates, image rebuilds that don't change behavior, k8s manifest tweaks that don't change pod spec. ## Steps ### 1. Author the canary Edit the bot's GitOps values file: ```yaml # values-bot-mm-aapl.yaml bot: variant: canary # mutated label drives the Rollouts split botSpec: # New strategy parameters here. ``` ### 2. Open a PR Required CI checks: - [ ] `tests/bots` green - [ ] `python -m alphaswarm_bots.cli validate ` passes - [ ] `python -m alphaswarm_bots.cli conformance ` passes - [ ] `python -m alphaswarm_bots.cli stress ` passes - [ ] Trivy scan: no CRITICAL/HIGH CVEs on the new image - [ ] Cosign signature attached ### 3. Argo CD syncs the Rollout The CanaryRollout CR mutates from `currentStep=0` to `currentStep=1` when the new image lands. Traffic shifts to 10%. ### 4. Watch the AnalysisTemplate results ```bash kubectl argo rollouts get rollout bot-mm-aapl ``` Expected output: ``` Status: ✔ Healthy Strategy: Canary Step: 1/5 SetWeight: 10 Current: stable=18 canary=2 ``` The Prometheus dashboard `Bot Canary - ` shows three traces: - `quantbot_realized_pnl_usd{variant="canary"}` vs `{variant="stable"}` - `quantbot_orders_rejected_total / quantbot_orders_total` per variant - `histogram_quantile(0.99, quantbot_tick_to_trade_seconds_bucket)` per variant ### 5. Promotion vs abort - **Auto-promote:** if all three AnalysisTemplates pass the configured windows, the rollout advances to the next step automatically. - **Auto-abort:** any AnalysisTemplate failure aborts the rollout and reverts traffic to 100% stable. Slack/PagerDuty alert `BotErrorRateHigh` or `BotPnLDrawdownCritical` fires. - **Manual abort:** ```bash kubectl argo rollouts abort bot-mm-aapl ``` - **Manual promote (for an indefinite pause step):** ```bash kubectl argo rollouts promote bot-mm-aapl ``` ## Tuning false positives If you observe a healthy canary aborting frequently: 1. **Tighten the metric query first.** Move from `rate(...[1m])` to `rate(...[5m])`; use a robust quantile (e.g. `histogram_quantile(0.99, sum by (le) (rate(...[5m])))`). 2. **Lengthen the window.** Bump `count` from 30 to 60. 3. **Only THEN relax the success condition.** Don't relax `pnlVsStableMinUsd` from `-50` to `-150` without first investigating the variance source. Per blueprint caveat #7: if the canary AnalysisTemplate false-positive rate exceeds 10% (good canaries aborted by noisy metric), tighten the metric query before relaxing the success condition. ## Hard abort: emergency If the canary is in `Progressing` state but you see live PnL bleeding faster than the abort criterion would catch: ```bash # Three-scope kill switch — engaged at bot scope. kubectl apply -f - < # Operations runbook — CI/CD deploy > Task-oriented steps for the AWS CI/CD pipeline: create the dev/staging/prod GitHub Environments and reviewers, set the per-env role variables and cross-repo dispatch token, deploy the infra and app trees via terraform-pipeline.yml, release images via a v* tag, drive the admin redeploy, approve a prod release, find the terraform_runs audit row, and roll back. # Operations runbook — CI/CD deploy Task-oriented steps for the AlphaSwarm AWS CI/CD pipeline. For the design and the topology diagrams see the concept page [CI/CD pipelines](../../concepts/infrastructure/cicd-pipelines.md). This runbook is the companion to the bootstrap and incident playbooks in [AWS Hybrid Deployment Guide](aws-deploy.md) and [AWS Hybrid Operational Runbook](aws-runbook.md) — start there for the first-ever account bring-up; come here for the day-to-day pipeline. All deploys run through GitHub Actions over GitHub OIDC. Never run `terraform apply` or `alphaswarm deploy up` against a shared environment from a laptop. ## (a) One-time setup — Environments, reviewers, variables Do this once per repo (the steps are the same for `alphaswarm_platform` and `alphaswarm_admin`). 1. **Create the three GitHub Environments.** In the repo: Settings → Environments → New environment, for each of `dev`, `staging`, `prod`. 2. **Set required reviewers.** Edit each Environment's protection rules: - `dev` — no required reviewers (auto). - `staging` — **1** required reviewer. - `prod` — **2** required reviewers (4-eyes). 3. **Set the per-env role variables.** For each Environment add the apply role ARN (published by the `infrastructure/modules/github-oidc` module) plus the read-only plan role ARN: ```bash # Apply role (one per environment): gh variable set AWS_DEPLOYER_ROLE_ARN \ --env prod \ --body "arn:aws:iam:::role/aqp-gha-apply" # Plan role (read-only, used by pr-validate.yml): gh variable set AWS_PLAN_ROLE_ARN \ --env prod \ --body "arn:aws:iam:::role/aqp-gha-plan" ``` Repeat for `dev` and `staging` with their account IDs. 4. **Set the cross-repo dispatch token (admin repo only).** The admin pipeline fires a `repository_dispatch` at `alphaswarm_platform`, so it needs a token with `repo` scope on the platform repo. Store it as a secret in the **admin** repo: ```bash gh secret set PLATFORM_DISPATCH_TOKEN \ --repo Alpha-Swarm-ai/alphaswarm_admin \ --body "" ``` ## (b) Deploy the landing zone (infrastructure/) The `infrastructure/` tree is applied with native Terraform over OIDC into `AqpTerraformExecutionRole`. Always plan first, review the diff in the workflow summary, then apply. ```bash # 1. Plan dev: gh workflow run terraform-pipeline.yml \ -f tree=infrastructure -f env=dev -f action=plan # 2. Review the plan in the run summary, then apply: gh workflow run terraform-pipeline.yml \ -f tree=infrastructure -f env=dev -f action=apply ``` Promote by repeating with `-f env=staging` then `-f env=prod`. The `staging` apply waits on 1 reviewer and the `prod` apply on 2 (the GitHub Environment gate). ## (c) Deploy the app tier (terraform/) Same workflow, `tree=alphaswarm_platform`. This path delegates to `CodeBuild`, which runs `alphaswarm deploy plan` / `alphaswarm deploy up` (`TerraformRuntime`) and writes a `terraform_runs` audit row. ```bash gh workflow run terraform-pipeline.yml \ -f tree=alphaswarm_platform -f env=dev -f action=plan gh workflow run terraform-pipeline.yml \ -f tree=alphaswarm_platform -f env=dev -f action=apply ``` A `push` to `main` automatically runs an `infrastructure` plan against `dev`, so you usually only dispatch the `apply` actions explicitly. ## (d) Release images — push a v* tag `build-publish.yml` triggers on a `v*` tag. It builds each service multi-arch to `ECR`, signs with `Cosign` keyless, emits a `syft` SBOM and `SLSA` provenance, and runs `Trivy` + `Grype` scans. ```bash git tag v1.4.0 git push origin v1.4.0 # Watch the workflow: gh run watch ``` ## (e) Admin deploy flow `alphaswarm_admin` builds two images and hands off to the platform. 1. Push to the admin repo's `main` (or push a `v*` tag). 2. The admin workflow builds and pushes **two** images to `ECR`: `alphaswarm-admin` and `alphaswarm-admin-frontend`. 3. After both land, it fires a `repository_dispatch` event `admin-image-published` at `alphaswarm_platform` (using `PLATFORM_DISPATCH_TOKEN`). 4. The platform's app-tier redeploy runs and rolls the admin service onto **ECS `Fargate`** (`Cognito` + `ALB`) via `terraform/environments/{dev,staging,prod}`, reading infra handles from SSM `/alphaswarm//*`. To re-trigger the handoff manually (for example after a token fix without a new build): ```bash gh api repos/Alpha-Swarm-ai/alphaswarm_platform/dispatches \ -f event_type=admin-image-published \ -f 'client_payload[env]=dev' ``` ## (f) Approving a prod release (4-eyes) A `prod` apply (infra or app tier) pauses on the GitHub Environment gate until **two** distinct reviewers approve. The apply role cannot be assumed before that, so nothing touches `prod` until both sign off. 1. Dispatch the apply (step b or c) with `-f env=prod`. 2. Two reviewers open the run → "Review deployments" → select `prod` → Approve. Approvals must come from two different people. 3. The job then assumes `vars.AWS_DEPLOYER_ROLE_ARN` for `prod` over OIDC and proceeds. ```bash # List runs awaiting approval: gh run list --workflow terraform-pipeline.yml ``` ## (g) Where the terraform_runs audit row lands Every app-tier `alphaswarm deploy plan` / `up` writes a row to the `terraform_runs` table in the platform Postgres (platform AGENTS rule 42) — the same ledger used by `TerraformRuntime` for in-app Terraform actions. Native `infrastructure/` applies do not write this row (their history is the Terraform state in S3). To inspect recent app-tier runs: ```bash aws rds-data execute-statement \ --resource-arn "$RDS_ARN" --secret-arn "$DB_SECRET_ARN" \ --database alphaswarm \ --sql "SELECT id, action, status, env, started_at \ FROM terraform_runs ORDER BY started_at DESC LIMIT 10" ``` ## (h) Rollback Pick the path that matches what changed. - **Bad image (app or admin):** re-point the deploy at the prior immutable image tag and redeploy — no rebuild required. ```bash # Re-run the app-tier apply pinned to the previous tag: gh workflow run terraform-pipeline.yml \ -f tree=alphaswarm_platform -f env=prod -f action=apply \ -f image_tag=v1.3.0 ``` - **Bad infra/app-tier change:** re-apply the previous good state by dispatching `apply` from the prior good commit. Tag-and-push the previous SHA, then dispatch the apply (prod still needs 2 reviewers): ```bash git tag v1.3.1-rollback git push origin v1.3.1-rollback gh workflow run terraform-pipeline.yml \ -f tree=alphaswarm_platform -f env=prod -f action=apply ``` Data resources (RDS, S3, the KB source bucket) carry `lifecycle.prevent_destroy = true`, so a re-apply rolls forward the service definitions without touching stateful resources. See the rollback section of [AWS Hybrid Operational Runbook](aws-runbook.md) for the full data-safety notes. ## See also - [CI/CD pipelines](../../concepts/infrastructure/cicd-pipelines.md) — the design and topology. - [AWS Hybrid Deployment Guide](aws-deploy.md) — first-time bootstrap. - [AWS Hybrid Operational Runbook](aws-runbook.md) — incident playbooks + rollback data safety. # Cloud-CLI temporary credentials > Operator runbook for minting short-lived AWS / GCP / Azure credentials from the admin UI. The control plane spawns the CLI subprocess; the admin BFF brokers it; the minted token is persisted via CredentialResolver and never echoed back. # Cloud-CLI temporary credentials How to use the **CloudCliCredentialWizard** in the admin Settings page to mint a short-lived AWS / GCP / Azure credential without shipping the parent credential or the cloud CLI binary into the admin BFF container. ## Topology ```mermaid flowchart LR Op["operator (MFA-fresh)"] --> FE["alphaswarm_admin/frontend\nCloudCliCredentialWizard"] FE -->|"POST /admin/settings/credentials/cloud-cli/preview"| BFF["alphaswarm_admin BFF"] BFF -->|broker| CP["alphaswarm_controller\n/manage/credentials/cloud-cli/preview"] CP -->|argv only| BFF BFF -->|masked argv| FE Op -->|"approve preview"| FE FE -->|"POST .../sts"| BFF BFF -->|broker| CP CP -->|"asyncio.create_subprocess_exec"| CLI["aws sts | gcloud auth | az account get-access-token"] CLI -->|JSON / token| CP CP -->|"persist via CredentialResolver"| RES["resolver chain"] CP -->|"metadata only\n(credential_key, expires_at)"| BFF BFF -->|envelope| FE ``` The CLI subprocess **only** runs inside `alphaswarm_controller`. The admin BFF (`alphaswarm_admin`) is HTTP-only per its boundary contract; it never spawns processes or holds the parent credential. ## Prerequisites The control plane host needs the CLI binary on `$PATH`: | Provider | Binary | Required pre-auth | | --- | --- | --- | | AWS | `aws` | parent IAM identity (instance profile / IRSA / access key) with `sts:AssumeRole` on the target role | | GCP | `gcloud` | ADC (Application Default Credentials) for an identity with `iam.serviceAccounts.getAccessToken` on the target SA | | Azure | `az` | logged-in `az` session (`az login --identity` for managed identity, or interactive) on a principal that can issue tokens for the requested resource | The wizard's preview step renders a `binary present on CP host` flag so the operator can spot a missing CLI before they hit `Execute mint`. ## Walkthrough ### 1. Pick a provider + fill the form Open `Settings → Cloud-CLI temporary credential mint`. The wizard loads handler metadata from `/admin/settings/credentials/cloud-cli/handlers` (proxied to the CP) and renders the appropriate fields: | Provider | Required | | --- | --- | | AWS | `target_credential_key`, `role_arn` | | GCP | `target_credential_key`, `service_account_email` | | Azure | `target_credential_key` (resource / tenant / subscription optional) | `target_credential_key` is the resolver key the minted credential persists under (e.g. `idp:aws:prod`). Downstream code reads it via `CredentialResolver.resolve(CredentialKey(, ))` once minted; nothing in the platform passes the bytes directly. ### 2. Preview Clicking `Preview command` posts to `/admin/settings/credentials/cloud-cli/preview` which returns the exact `argv` the CP would spawn, with token-bearing args masked. This is a **dry run** — no subprocess executes. ### 3. Mint `Execute mint` posts to `/admin/settings/credentials/cloud-cli/sts`. Server-side the CP: 1. writes a `WorkloadRun` audit row with `action=mint_cloud_credential` in `PENDING` state **before** spawning the subprocess; 2. invokes `aws sts assume-role` / `gcloud auth print-access-token` / `az account get-access-token` with a 60s wall-clock timeout; 3. parses the result, persists the credential under `target_credential_key` via the resolver chain; 4. updates the audit row to `SUCCEEDED|FAILED`. The wizard renders the response envelope: | Field | Meaning | | --- | --- | | `credential_key` | resolver key the temp creds live under | | `expires_at` | TTL boundary (provider-derived) | | `source_identity` | role ARN / SA email / subscription id | | `audit_run_id` | links to the `WorkloadRun` ledger row | The raw token is **never** in the response body or audit ledger. ## Step-up MFA `/admin/settings/credentials/cloud-cli/{preview,sts}` carry `Depends(require_admin_step_up("admin:cluster"))`. If the operator's JWT is older than the configured `auth_step_up_default_max_age` (default 180s), the BFF returns `401 insufficient_user_authentication` with an RFC 9470 `WWW-Authenticate` challenge; the wizard's `apiFetch` middleware silently re-issues an MFA prompt and retries the original call. ## Troubleshooting | Symptom | Cause | Fix | | --- | --- | --- | | `binary_missing` in the preview | CLI not on the CP host's `$PATH` | Install the CLI in the CP container image, or shell into the host and run `which aws/gcloud/az` | | `nonzero_exit` with masked stderr | Parent credential lacks the requested permission | Read the redacted stderr in the audit row's `error` field; provision the missing role / IAM permission | | `parse_error` | Upstream returned an unexpected JSON shape | Compare against the canonical `aws sts assume-role` / `az account get-access-token` shape in the AWS / Azure docs; file a bug if AWS/Azure changed the format | | `timeout` | Network / IAM trust-policy resolution stuck | The 60s budget is intentional — re-run; if it persists, run the masked argv from the preview locally to inspect | | `persist_failed` | Resolver write surface not configured | The default CP resolver doesn't ship a write hook in OSS; provide one via the `persist=` kwarg of `alphaswarm_controller.services.cloud_cli.mint`, or wire a Vault / SSM secret manager that supports writes | ## Related docs - [Cloud credentials](../../concepts/identity/cloud-credentials.md) — resolver chain + naming conventions. - [Identity overview](../../concepts/identity/identity.md) — overall rule-26 + rule-52 boundaries. - [Account integrations](../../concepts/identity/account-integrations.md) — the per-org PAT-link sibling surface (HF + Docker Hub). # Operations runbook — Configuration management > [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) is the source of truth. Every variable declared anywhere (compose, K8s ConfigMap, K8s Secret, appli... # Operations runbook — Configuration management How env vars, ConfigMaps, and Secrets flow through the AlphaSwarm stack. ## The single source of truth [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) is the source of truth. Every variable declared anywhere (compose, K8s ConfigMap, K8s Secret, application code, frontend) MUST appear in the schema. Each entry carries metadata: ``` key: ALPHASWARM_FOO_BAR description: What this knob controls. required: true | false default: targets: local,kubernetes,cloud classification: plain | secret | rotation-required ``` ## Generation ```powershell # Local dev (.env file) make generate-config ENV=local # Cloud / sealed-secrets seed make generate-config ENV=cloud # Kubernetes ConfigMap + Secret scaffold make generate-config ENV=k8s ``` Or directly: ```powershell python alphaswarm_platform/build/scripts/generate_config.py --env local --out alphaswarm_platform/deployments/compose/.env.local python alphaswarm_platform/build/scripts/generate_config.py --env k8s --kind configmap python alphaswarm_platform/build/scripts/generate_config.py --env k8s --kind secret ``` ## Validation `make validate-config` runs the generator in `--diff` mode against every target — produces no output when files are in sync with the schema; prints a unified diff when they've drifted. ## How env reaches a service ```mermaid flowchart LR schema[.env.schema] -->|generate_config.py| envfile[.env.local] schema -->|generate_config.py| cm[ConfigMap] schema -->|generate_config.py| secret[Secret scaffold] envfile -->|docker compose| compose[Compose service] cm -->|envFrom| pod[Pod env vars] secret -->|envFrom| pod pod --> alphaswarm[alphaswarm.config.settings reads via pydantic-settings] compose --> alphaswarm ``` ## Adding a new variable 1. Add a block to `.env.schema`: ``` key: ALPHASWARM_MY_NEW_KNOB description: What it does (one line). required: false default: targets: local,kubernetes,cloud classification: plain ``` 2. Regenerate every artifact: ```powershell make generate-config ENV=local make generate-config ENV=k8s ``` 3. Add the field to `alphaswarm.config.settings.Settings` so the application can read it via `from alphaswarm.config import settings`. 4. Update tests that snapshot the env to include the new key. ## Secret classification rules | Class | Examples | Storage | | --- | --- | --- | | `plain` | `ALPHASWARM_LOG_LEVEL`, `ALPHASWARM_CORE_API_URL` | ConfigMap | | `secret` | `ALPHASWARM_DATABASE_PASSWORD`, `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH` | Secret + sealed-secrets / external-secrets-operator | | `rotation-required` | `ALPHASWARM_AUTH_M2M_CLIENT_SECRET`, `ALPHASWARM_SESSION_COOKIE_SECRET` | Secret + rotation cadence in [rotate-secrets.md](rotate-secrets.md) | ## Never - Never commit a populated `Secret` to git. The generator writes a `Y2hhbmdlbWU=` placeholder; CI/CD or the external secret operator patches the real values. - Never read `os.environ.get(...)` directly from `alphaswarm/` business code. Use `from alphaswarm.config import settings`. - Never hardcode a URL or password. Add it to the schema and route through `settings`. # Connect a company cloud account (federated-first onboarding) > Guided 5-step wizard in alphaswarm_admin that connects AWS, Azure, GCP, or Cloudflare accounts using federated identity. No long-lived secrets stored. # Connect a company cloud account The cloud onboarding wizard in `alphaswarm_admin` (route `/admin/accounts/{org_id}/cloud/{cloud_kind}/*`) is the canonical path for wiring an AWS, Azure, GCP, or Cloudflare account into AlphaSwarm. It is federated-first by design: no access keys, no client secrets, no service-account JSON, and no global API keys are ever stored. The same UI serves both flows: - **Per customer organisation** — `/accounts//integrations` → "Cloud accounts" section. - **Admin tenant (AlphaSwarm's own accounts)** — `/settings` page, "Cloud connections" panel, "Guided (federated)" mode. Routes use the synthetic `org_id="__platform__"` value. ## Step 0 — pre-requisites Set these env vars on the `alphaswarm_admin` deployment before the wizard can emit bootstrap artifacts. None of them are secrets on their own, but the wizard rejects bootstrap calls when the corresponding identity is missing. | Env var | Purpose | Required for | | --- | --- | --- | | `ALPHASWARM_ADMIN_AWS_PARTNER_ACCOUNT_ID` | AlphaSwarm's AWS account id (12 digits) embedded as the trust policy's `Principal.AWS`. | AWS | | `ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET` | HMAC key used to derive a stable per-org `sts:ExternalId`. | AWS (prod) | | `ALPHASWARM_ADMIN_AZURE_APP_CLIENT_ID` | Client id of the AlphaSwarm Entra app that will carry the federated credential. | Azure | | `ALPHASWARM_ADMIN_AZURE_APP_OBJECT_ID` | Object id of the same app — parent for `az ad app federated-credential create`. | Azure | | `ALPHASWARM_ADMIN_GCP_WIF_AUDIENCE` | Audience template for the customer's WIF provider. | GCP | | `ALPHASWARM_ADMIN_GCP_WIF_SERVICE_ACCOUNT_EMAIL` | AlphaSwarm-side service account the customer's WIF principal impersonates. | GCP | The customer-side bootstrap also needs network egress from the admin BFF to each cloud's control plane (AWS STS, Microsoft Graph, GCP IAM, Cloudflare API). The wizard surfaces clear errors when a call fails. ## The five-step pattern ```mermaid flowchart LR s1["1 Choose"] --> s2["2 Bootstrap"] s2 --> s3["3 Identity"] s3 --> s4["4 Permissions"] s4 --> s5["5 Save"] ``` | Step | Mutates AlphaSwarm? | Mutates cloud? | Audit row? | | --- | --- | --- | --- | | 1 Choose cloud + auth method | no | no | no | | 2 Bootstrap artifacts | no | no | no | | 3 Validate identity | no | read-only | yes | | 4 Validate permissions | no | read-only | yes | | 4* Enumerate resources | no | read-only | yes | | 5 Save (`connect`) | yes | no | yes (pending + succeeded/failed) | Steps 3, 4, 5 require step-up MFA per hard rule 52. ## Per-cloud runbooks ### AWS — cross-account IAM role + external id 1. **Step 2 — bootstrap.** The wizard emits a trust policy that names AlphaSwarm's account as the `Principal.AWS` and includes a unique `sts:ExternalId` derived from `HMAC-SHA256(ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET, ":")`. Copy the rendered `trust_policy.json` block or use the CloudFormation StackSet quick-link the wizard surfaces. The default role name is `alphaswarm-broker-` (configurable via `ALPHASWARM_ADMIN_AWS_ROLE_NAME_PATTERN`). ```bash aws iam create-role \ --role-name alphaswarm-broker- \ --assume-role-policy-document file://trust_policy.json ``` Attach the policies the wizard hint suggests (`ReadOnlyAccess` for a minimal connection; tighter policies for production). 2. **Step 3 — validate identity.** Paste the resulting Role ARN into the wizard. AlphaSwarm calls `sts:AssumeRole` with the same external id, then `sts:GetCallerIdentity`. Failure modes: `AccessDenied` (trust policy not applied or wrong external id), `InvalidParameterValue` (role doesn't exist), or `RegionDisabled`. 3. **Step 4 — validate permissions.** AlphaSwarm runs `iam:SimulatePrincipalPolicy` against the role for `sts:GetCallerIdentity`, `iam:GetRole`, and `ec2:DescribeRegions` (the baseline). Missing permissions render as a red "Missing required" badge. 4. **Step 5 — save.** AlphaSwarm persists `{role_arn, external_id, region, account_id}` under `CredentialKey("cloud_aws", "org:")` and the `alphaswarm_admin.integration_store` table. ### Azure — Workload Identity Federation 1. **Step 2 — bootstrap.** The wizard emits a federated-credential JSON skeleton keyed to the AlphaSwarm Entra app's `object_id`. Default audience is `api://AzureADTokenExchange` (override via `ALPHASWARM_ADMIN_AZURE_AUDIENCE`). ```bash az ad app federated-credential create \ --id \ --parameters federated_credential.json ``` On the customer's subscription, grant the AlphaSwarm app a role (default: `Reader`) at the appropriate scope. 2. **Step 3 — validate identity.** Provide the customer `tenant_id` and `subscription_id`. AlphaSwarm acquires a token via the federated credential and resolves the token's claims; failures typically point to a subject/issuer mismatch on the federated credential. 3. **Step 4 — validate permissions.** AlphaSwarm lists role assignments at the subscription scope and compares against the `required_roles` baseline. 4. **Step 5 — save.** Stored under `CredentialKey("cloud_azure", "org:")`. No client secret is ever provided to or stored by AlphaSwarm — the federated credential is the only artifact. ### GCP — Workload Identity Federation + impersonation 1. **Step 2 — bootstrap.** The wizard emits a Workload Identity Pool + Provider config (issuer URI, allowed audiences, `attribute_mapping`) plus three `gcloud` invocations: ```bash gcloud iam workload-identity-pools create alphaswarm-broker- \ --project= --location=global \ --display-name="AlphaSwarm broker " gcloud iam workload-identity-pools providers create-oidc alphaswarm-oidc \ --project= --location=global \ --workload-identity-pool=alphaswarm-broker- \ --issuer-uri=https://alpha-swarm.ai/oidc/ \ --allowed-audiences="" \ --attribute-mapping="google.subject=assertion.sub" gcloud iam service-accounts add-iam-policy-binding \ \ --project= \ --role=roles/iam.workloadIdentityUser \ --member="principalSet://iam.googleapis.com/projects//locations/global/workloadIdentityPools/alphaswarm-broker-/*" ``` 2. **Step 3 — validate identity.** AlphaSwarm impersonates the configured service account and confirms the impersonation chain works. 3. **Step 4 — validate permissions.** AlphaSwarm calls `projects.testIamPermissions` for the baseline (`resourcemanager.projects.get`, `iam.serviceAccounts.actAs`). 4. **Step 5 — save.** Stored under `CredentialKey("cloud_gcp", "org:")`. ### Cloudflare — scoped API token There is no federated identity option for Cloudflare; the federated-first equivalent is the **scoped API token** (the deprecated Global API key is rejected outright). 1. **Step 2 — bootstrap.** Pick the narrowest template that covers the use case: `dns_edit`, `tunnel`, `access`, `worker`, or `r2`. The wizard opens [https://dash.cloudflare.com/profile/api-tokens](https://dash.cloudflare.com/profile/api-tokens) and the customer creates the token in the dashboard. 2. **Step 3 — validate identity.** Paste the token into the wizard. AlphaSwarm calls `GET /user/tokens/verify` and confirms `status == active`. The token is **never** echoed back to the wizard after submission. 3. **Step 4 — validate permissions.** AlphaSwarm inspects the verified token's permission groups against the template baseline. 4. **Step 5 — save.** Stored encrypted at rest in the `IntegrationCredentialStore` (Fernet-wrapped) under `CredentialKey("cloud_cloudflare", "org:")`. ## Rotation | Cloud | Auth method | Rotation duty | | --- | --- | --- | | AWS | IAM role + external id | None for the role itself. Rotate the HMAC key (`ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET`) when an operator with insider knowledge leaves — re-running the wizard regenerates a new external id and the customer updates the trust policy. | | Azure | Workload Identity Federation | None — no client secret to rotate. | | GCP | WIF + impersonation | None — no JSON key to rotate. | | Cloudflare | Scoped API token | 60–90 days recommended. The wizard re-validates the token through the health route; an `expires_on` timestamp surfaces in the integration list. | ## Disconnect `DELETE /admin/accounts/{org_id}/cloud/{cloud_kind}` drops the local record. Vendor-side cleanup (delete the IAM role, delete the federated credential, delete the WIF pool, revoke the scoped token) is the operator's responsibility — the runbook calls this out because vendor APIs that delete principals require elevated permissions AlphaSwarm intentionally does not request. ## Where to look in code - ABC + lifecycle helpers: [`alphaswarm_admin/src/alphaswarm_admin/providers/base.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/base.py) - Per-cloud providers: [`cloud_aws.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_aws.py), [`cloud_azure.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_azure.py), [`cloud_gcp.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_gcp.py), [`cloud_cloudflare.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_cloudflare.py) - Router: [`alphaswarm_admin/src/alphaswarm_admin/api/routers/cloud_onboarding.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/api/routers/cloud_onboarding.py) - Shared frontend wizard: [`alphaswarm_admin/frontend/components/cloud/CloudOnboardingWizard.tsx`](../../../../alphaswarm_admin/frontend/components/cloud/CloudOnboardingWizard.tsx) - Settings: [`alphaswarm_admin/src/alphaswarm_admin/settings.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/settings.py) (the `cloud_*` field block). ## See also - [Account integrations](../../concepts/identity/account-integrations.md) — sibling per-org integrations (HuggingFace, Docker Hub). - [Cloud-CLI temporary credentials](cloud-cli-temporary-credentials.md) — short-lived credential mint surface, complementary to the long-lived federated identity established here. - [Credentials](../../concepts/identity/credentials.md) — the `CredentialResolver` chain that backs persistence (hard rule 26). # Control the platform ECS deployment > Operate the hosted platform''s own AWS ECS Fargate services from alphaswarm_admin — rollout status, redeploy, scale, logs, and monitoring. # Control the platform ECS deployment The `Platform` page in `alphaswarm_admin` (route `/platform`, API `/admin/platform/ecs/*`) controls the **hosted platform's own** AWS ECS Fargate slice — the `alphaswarm-admin` and `alphaswarm-agentcore-proxy` services that run the control plane. It is the counterpart to the `Services` page, which brokers **customer workload** lifecycle to `alphaswarm_controller`. The admin reaches AWS with its **ECS task role** — no static keys. The `ecs-fargate-control-plane` Terraform module grants a tightly scoped self-management policy (`ecs:UpdateService` + `Describe*`, `logs:*` read, `cloudwatch:*` read) to the services that set `enable_self_management`. ## What it shows | Surface | Source | Purpose | | --- | --- | --- | | Service table | `ecs:DescribeServices` | Live rollout state per service (`IN_PROGRESS` / `COMPLETED` / `FAILED`), running vs desired tasks. | | Logs drawer | CloudWatch Logs `FilterLogEvents` | A bounded tail of the service's `awslogs` group, resolved from its task definition. | | Metrics drawer | CloudWatch `GetMetricData` | Container Insights CPU, memory, and running-task count over a window. | | Alarms strip | `cloudwatch:DescribeAlarms` | The platform's per-service alarms (running-task floor, CPU, memory). | ## Prerequisites Set on the `alphaswarm_admin` deployment: | Env var | Purpose | | --- | --- | | `ALPHASWARM_ADMIN_PLATFORM_ECS_CLUSTER` | ECS cluster name the surface targets. The `ecs-fargate-control-plane` module publishes it at `/alphaswarm//ecs_cluster_name`. | | `ALPHASWARM_ADMIN_PLATFORM_AWS_REGION` | Region the cluster runs in (default `us-east-1`). | | `ALPHASWARM_ADMIN_PLATFORM_ALARM_PREFIX` | Alarm-name prefix used to scope the alarm listing (default `alphaswarm-`). | The admin must run with `alphaswarm-admin[cloud-aws]` installed (the `boto3` extra). When `boto3` is missing the surface returns `503 provider_unavailable` with an actionable message; when the cluster is unset it returns `503 provider_misconfigured`. Cross-account or local operation: set `ALPHASWARM_ADMIN_PLATFORM_AWS_ASSUME_ROLE_ARN` (and optionally `ALPHASWARM_ADMIN_PLATFORM_AWS_EXTERNAL_ID`) to assume a role into the target account instead of using the ambient task role. ## Redeploy a service A redeploy starts a new rolling deployment with the same task definition (`forceNewDeployment`), which is how you pick up a freshly pushed image on a moving tag or recover a wedged service. The ECS **deployment circuit breaker** with auto-rollback (configured on the service in Terraform) reverts a deployment that never reaches steady state, so a bad image does not take the service down. 1. Open `/platform`. 2. Press **Redeploy** on the target row. 3. Type the service name to confirm. The action is audit-first and requires step-up MFA — the UI transparently pops the MFA prompt when the server raises the RFC 9470 challenge. 4. Watch the rollout badge move to `COMPLETED` (or `FAILED`, which means the circuit breaker rolled back). ## Scale a service 1. Press **Scale**, set the desired task count, and type the service name to confirm. 2. Scaling to `0` stops the service; scale back up to restore it. Both redeploy and scale write a `security_audit_events` row before the AWS call and a `succeeded` / `failed` row after. ## Read logs and metrics - **Logs** resolve the `awslogs` group from the service's task definition, then tail recent events. Pass a CloudWatch Logs filter pattern to narrow the stream. - **Metrics** read Container Insights series (CPU, memory, running tasks). Enhanced Container Insights must be on for the cluster (the module sets `containerInsights = enhanced`). ## Boundary This surface is for the platform's **own** infrastructure. Customer workloads stay on the `Services` page, which brokers to the control plane. All boto3 lives in `alphaswarm_admin.services.platform_deployment` behind the same `require_sdk` lazy import the cloud-onboarding providers use — route handlers never import a cloud SDK. ## See also - [alphaswarm-admin service](../../concepts/infrastructure/services/alphaswarm-admin.md) — deployment surfaces + identity. - [Connect a company cloud account](connect-company-cloud-account.md) — the federated-first wizard for customer cloud accounts. - [Admin Agent Identity](../../concepts/identity/admin-agent-identity.md) — how the admin authenticates outbound to the control plane. # Operations runbook — Edge deployment > The simplest edge deployment: one Linux VM running the docker-compose stack with the admin overlay # Operations runbook — Edge deployment Deploying AlphaSwarm to edge / on-prem locations where the standard cloud K8s overlays don't fit. ## Reference shapes ### Shape A — single VM with Docker Compose The simplest edge deployment: one Linux VM running the docker-compose stack with the admin overlay. ```bash git clone https://github.com/julianwiley/alphaswarm.git cd alphaswarm # Generate config + bring up make generate-config ENV=local make dev-admin ``` Suitable for: dev labs, single-tenant trials, training environments. Not suitable for: multi-node fault tolerance, HPA, NetworkPolicy enforcement. ### Shape B — k3s on a single edge box For sites with a single VM but where you want production-style observability + Pod-level lifecycle: ```bash curl -sfL https://get.k3s.io | sh - kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev ``` k3s ships with Traefik (substitute for the NGINX Ingress) and a built-in service load balancer (Klipper). You can install NGINX Ingress on top if you want to keep the same Ingress manifests as production. ### Shape C — rpi_kubernetes (4-node k3s lab) The reference home/edge cluster uses **two sibling repos**: 1. **`rpi_kubernetes`** — k3s bootstrap, portal, FinOps policies, storage class. 2. **`alphaswarm`** — every shared service + AlphaSwarm workload under `alphaswarm_platform/deployments/kubernetes/`. ```bash # In rpi_kubernetes (portal + cluster bootstrap only) kubectl apply -k kubernetes/ # In alphaswarm (AlphaSwarm shared infra + app overlays) kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev ``` Streaming install helpers live under `alphaswarm_platform/scripts/cluster_install/` (`install-flink.sh`, `install-alphavantage.sh`, `build-flink-jobs.sh`). See [streaming.md](../../concepts/data/streaming.md) for the full order. ## Edge-specific concerns ### Image distribution Edge sites often have slow / metered uplinks. Mirror the AlphaSwarm images into an on-site registry: ```bash docker pull ghcr.io/julianwiley/alphaswarm-client:latest-stable docker tag ghcr.io/julianwiley/alphaswarm-client:latest-stable mirror.local:5000/alphaswarm-client:latest-stable docker push mirror.local:5000/alphaswarm-client:latest-stable ``` Then override the image tags in your overlay: ```yaml # alphaswarm_platform/deployments/kubernetes/overlays/edge-site-a/kustomization.yaml images: - name: ghcr.io/julianwiley/alphaswarm-client newName: mirror.local:5000/alphaswarm-client newTag: latest-stable ``` ### Auth0 unreachability Edge sites may have intermittent connectivity to Auth0's JWKS endpoint. The JWT validator caches JWKS for `ALPHASWARM_CP_AUTH_JWKS_TTL_SECONDS` (default 600s); set it higher (e.g. 3600s) so the cache spans typical outage windows. In hard offline scenarios, set `ALPHASWARM_AUTH_ENFORCE=permissive` so authenticated requests fall through to local-default identity and audit-log the violation. The operator UI shows a yellow banner when this mode is active. ### Storage Edge sites should NOT rely on the in-cluster Postgres + Redis. Provision durable storage upstream and point AlphaSwarm at it via the connectivity matrix: ```bash ALPHASWARM_DATABASE_URL=postgresql://alphaswarm:****@cloud-postgres.example.com:5432/alphaswarm ALPHASWARM_REDIS_URL=rediss://cloud-redis.example.com:6380 ``` ### Telemetry Edge sites should forward telemetry to a central observability collector. Set `ALPHASWARM_OTEL_COLLECTOR_URL` to the gateway endpoint; the control plane streams MetricPoints + AlertEvents to it via OTLP. ## Cutover from compose to k3s If you started on shape A and want to move to shape B: 1. `docker compose down` to stop the compose stack. 2. Take a Postgres dump: `docker exec alphaswarm-postgres pg_dump -U alphaswarm alphaswarm > alphaswarm.sql`. 3. Bring up shape B per the recipe above. 4. Restore: `kubectl exec -n alphaswarm deploy/alphaswarm-postgres -- psql -U alphaswarm alphaswarm < alphaswarm.sql`. 5. Verify `/manage/health` and `/health` both return 200. No code changes required — the connectivity matrix abstracts which backend is hosting which service. # Go-live: minimum deployment > Ordered first-deployment sequence for the four public surfaces — docs + landing site (Cloudflare Pages), admin UI (ECS Fargate), and the minimum platform tier — with the exact commands and the credentials each step needs. # Go-live: minimum deployment > Deep-dive companions: [aws-deploy.md](./aws-deploy.md) (full hybrid > bootstrap), [aws-runbook.md](./aws-runbook.md) (day-2 pipeline ops), > [tenant-router auth rollout](../tenant-router-auth-rollout.md) > (edge enforcement, when the k8s edge goes live). This is the shortest path to four live surfaces, in dependency order. Current state (verified 2026-06-09): all application code is merged to `main`, but **no deploy has ever succeeded** — every AWS-touching workflow fails at `configure-aws-credentials` because the one-time bootstrap (OIDC deployer roles + GitHub repo variables) has not been run, and the docs repo had no content-deploy workflow at all (added as `deploy-pages.yml` alongside this page). | Order | Surface | Mechanism | Needs | | --- | --- | --- | --- | | 1 | Docs + marketing/landing | `alphaswarm_docs` `deploy-pages.yml` → Cloudflare Pages | 2 Cloudflare secrets, **no AWS** | | 2 | AWS bootstrap | local `terraform apply` ×1 | AWS admin profile | | 3 | Platform minimum (VPC/ECR/RDS/Redis/ALB/ECS) | local apply of `infrastructure/envs/minimum` then `terraform/environments/minimum` | step 2 | | 4 | Admin UI | `alphaswarm_admin` `build-publish.yml` → ECR → app-tier redeploy | steps 2–3 + repo vars | | 5 | Custom domains / apex (optional) | `deploy-edge.yml` Terraform stacks | steps 2 + Vault CF token | --- ## 1. Docs + landing site (Cloudflare Pages — independent of AWS) The Docusaurus build is one artifact serving the landing page at `/` and the docs tree beneath it. `deploy-pages.yml` builds and ships it to the `alphaswarm-docs` Pages project, creating the project on first run (Terraform later adopts it — see step 5). ```bash # One-time: create a Cloudflare API token scoped # Account > Cloudflare Pages > Edit # at https://dash.cloudflare.com/profile/api-tokens, then: gh secret set CLOUDFLARE_API_TOKEN --repo Alpha-Swarm-ai/alphaswarm_docs gh secret set CLOUDFLARE_ACCOUNT_ID --repo Alpha-Swarm-ai/alphaswarm_docs # Deploy (also auto-runs on every push to main): gh workflow run deploy-pages.yml --repo Alpha-Swarm-ai/alphaswarm_docs --ref main gh run watch --repo Alpha-Swarm-ai/alphaswarm_docs # Live at https://alphaswarm-docs.pages.dev until step 5 attaches # the alpha-swarm.ai apex + www domains. ``` ## 2. AWS bootstrap (one-time, local terminal, ~10 min) Mints the GitHub OIDC provider, deployer roles, and the state bucket/lock/KMS that every other stack depends on. Local state by design. ```bash cd alphaswarm_platform/infrastructure/bootstrap export AWS_PROFILE=alphaswarm-shared-platform-admin # your admin profile terraform init terraform apply -var=account_alias=shared # (Repeat with -var=account_alias={dev,qa,prod} only when you adopt the # multi-account split; the minimum tier lives in the single account.) ``` ## 3. Platform minimum tier (single account, ~$140/mo) Infra tier (VPC, ECR ×3, RDS Postgres, Redis, Bedrock policy, the `AqpGithubDeployerMinimum` role, SSM handles), then app tier (Cognito, ALB, ECS Fargate control plane): ```bash cd alphaswarm_platform/infrastructure/envs/minimum cp backend.hcl.example backend.hcl && cp terraform.tfvars.example terraform.tfvars $EDITOR backend.hcl terraform.tfvars # bucket/table from step 2 outputs terraform init -backend-config=backend.hcl terraform apply cd ../../../terraform/environments/minimum terraform init -backend-config=backend.hcl # same pattern terraform apply # first apply: images don't exist yet — services # stabilise after step 4 pushes them ``` Then wire CI so subsequent rollouts never need a laptop — set the repo variables from the role ARNs the applies just published (SSM `/alphaswarm/minimum/*` / stack outputs): ```bash for repo in alphaswarm_platform alphaswarm_admin; do gh variable set AWS_PLAN_ROLE_ARN --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/" gh variable set AWS_APPLY_ROLE_ARN --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/AqpGithubDeployerMinimum" gh variable set AWS_BUILD_ROLE_ARN --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/AqpGithubDeployerMinimum" gh variable set AWS_REGION --repo Alpha-Swarm-ai/$repo --body us-east-1 done gh variable set SHARED_ACCOUNT_ID --repo Alpha-Swarm-ai/alphaswarm_platform --body "" # Plus the prod GitHub Environment (Settings → Environments → prod, # required reviewers) — alphaswarm_admin's main-branch build binds to it. ``` ## 4. Admin UI `build-publish.yml` builds the admin backend + frontend images multi-arch, pushes to the ECR repos step 3 created, and fires the `admin-image-published` dispatch so the platform app tier redeploys with the new tags (needs `PLATFORM_DISPATCH_TOKEN` — a fine-grained PAT with `actions:write` on `alphaswarm_platform`): ```bash gh secret set PLATFORM_DISPATCH_TOKEN --repo Alpha-Swarm-ai/alphaswarm_admin gh workflow run build-publish.yml --repo Alpha-Swarm-ai/alphaswarm_admin \ --ref main -f env=minimum gh run watch --repo Alpha-Swarm-ai/alphaswarm_admin # Verify rollout (alarm/log/scale surface is the admin's own ECS panel # once it's up — control-platform-ecs-deployment.md): aws ecs describe-services --cluster "$(aws ssm get-parameter \ --name /alphaswarm/minimum/ecs_cluster_name --query Parameter.Value --output text)" \ --services alphaswarm-admin --query 'services[0].deployments' # Admin UI answers on the ALB DNS name (output of the app-tier apply) # at "/" with the backend health at /admin/health. ``` ## 5. Custom domains / apex public surface (optional, after 1–3) The `docs-edge` + `apex-redirect` + `demo-edge` Terraform stacks attach `alpha-swarm.ai` (+ `www`, `docs.*` alias, Access-gated `/demo`) to the Pages project from step 1. They run through `alphaswarm_platform/.github/workflows/deploy-edge.yml` → CodeBuild → `alphaswarm deploy` (AGENTS rule 42), so they need step 3's roles plus the Cloudflare token in Vault. Because step 1 already created the Pages project, import it before the first apply: ```bash # inside the docs-edge stack workspace: terraform import 'module.cloudflare_pages_docs.cloudflare_pages_project.docs' \ '/alphaswarm-docs' gh workflow run deploy-edge.yml --repo Alpha-Swarm-ai/alphaswarm_platform \ --ref main -f stack=docs-edge -f env=prod -f action=plan # then action=apply ``` ## Known footguns - **`build-publish.yml` (platform repo) tags `:latest` unconditionally** — only ever dispatch it from `main` or a version tag. - **The k8s edge ships fail-closed**: stamp `ALPHASWARM_TENANT_ROUTER_OIDC_ISSUER`/`_AUDIENCE` before applying `deployments/kubernetes/edge/` or the tenant-router crash-loops by design ([runbook](../tenant-router-auth-rollout.md)). Not part of the minimum tier (no EKS), listed here because the manifests are on `main`. - **`terraform-pipeline.yml` auto-plans `admin-dev` on every `main` push** — it stays red until step 3's variables exist; that's the expected signal, not a regression. # HFT Node Onboarding > - Bare-metal or near-bare-metal hardware with: - Hardware-timestamping NIC (Intel I210/X710, Mellanox ConnectX-5/6/7). - At least 2 NUMA nodes (most modern dual-socket Xeons / Epycs). - 2 MiB HugePage... # HFT Node Onboarding > How to bring up a new dedicated node for `Frequency.HFT` bots. > Runtime audience: SRE + platform team. ## Pre-requisites - Bare-metal or near-bare-metal hardware with: - Hardware-timestamping NIC (Intel I210/X710, Mellanox ConnectX-5/6/7). - At least 2 NUMA nodes (most modern dual-socket Xeons / Epycs). - 2 MiB HugePages support (kernel default). - SR-IOV-capable NIC. - Linux kernel >= 5.10 with PTP support (`ptp4l` + `phc2sys` from `linuxptp`). - The node is already a member of the cluster and runs the standard kubelet. ## 1. Taint + label the node ```bash kubectl taint nodes quantbot.io/hft=true:NoSchedule kubectl label nodes quantbot.io/hft=true ``` ## 2. Apply the kubelet override The kubelet config drop-in lives at `alphaswarm_platform/deployments/kubernetes/hft-nodes/kubelet-config.yaml`. On systemd hosts: ```bash sudo cp kubelet-config.yaml /etc/kubernetes/kubelet/kubelet.conf.d/quantbot-hft.conf sudo systemctl restart kubelet ``` Verify: ```bash kubectl get --raw "/api/v1/nodes//proxy/configz" | jq .kubeletconfig.cpuManagerPolicy # Expect: "static" ``` ## 3. Allocate HugePages ```bash kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/hugepages-allocation.yaml # DaemonSet runs once per HFT node and sets nr_hugepages=1024. ``` ## 4. Bring up PTP ```bash kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/ptp-config.yaml ``` Verify clock discipline (run inside the `quantbot-ptp` pod): ```bash kubectl exec -n alphaswarm-bots quantbot-ptp- -c phc2sys -- \ pmc -u -b 0 'GET CURRENT_DATA_SET' | grep masterOffset # Expect masterOffset around 0 (sub-microsecond on a healthy network). ``` ## 5. Configure SR-IOV If the SR-IOV Network Operator is installed: ```bash kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/sr-iov-config.yaml ``` Verify VFs are exposed: ```bash kubectl get nodes -o json | jq '.status.allocatable | with_entries(select(.key | startswith("openshift.io/quantbot_hft_vf")))' ``` ## 6. Apply the tuned profile ```bash kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/node-tuning-operator.yaml ``` ## 7. Validate the node passes the operator's HFT check The QuantBot Operator's validating webhook will refuse to schedule an HFT bot on a node that fails any of: - `quantbot.io/hft` label present - PTP DaemonSet pod running on the node - HugePages allocation >= the bot's request - SR-IOV VF available Run the operator's diagnostics: ```bash alphaswarm-bots validate ``` A passing validation prints `valid: true` and no failure entries. ## Rollback To take the node out of the HFT pool: ```bash kubectl drain --ignore-daemonsets=false --delete-emptydir-data kubectl taint nodes quantbot.io/hft=true:NoSchedule- kubectl label nodes quantbot.io/hft- ``` The HFT DaemonSets (ptp, hugepages, sriov) auto-stop on the node. # Operations runbook — Incident response > ``` Incident detected | +-----------------+-----------------+ | | Workload error Platform error | | +----------+----------+ +-------------+-------------+ | | | | | | Single Several All deps Auth Netwo... # Operations runbook — Incident response Standard playbook for diagnosing + recovering from AlphaSwarm production incidents. ## Triage tree ``` Incident detected | +-----------------+-----------------+ | | Workload error Platform error | | +----------+----------+ +-------------+-------------+ | | | | | | Single Several All deps Auth Network Storage pod pods degraded (Auth0) / Ingress (Postgres crashes CrashLoop rate-limited) / Redis) | | | | | | v v v v v v [A: pod [B: HPA [C: drain [D: jwks [E: ingress [F: stateful logs] thrash] the queue] 503] 503] failover] ``` ## Common diagnostic commands ```powershell # Pod status across both AlphaSwarm namespaces kubectl get pods -n alphaswarm -o wide kubectl get pods -n alphaswarm-admin -o wide # Top resource consumers kubectl top pods -n alphaswarm --sort-by=cpu kubectl top pods -n alphaswarm --sort-by=memory # Recent events kubectl get events -n alphaswarm --sort-by='.lastTimestamp' | tail -n 30 # Tail logs for the API kubectl logs -n alphaswarm deploy/alphaswarm-core --tail=200 -f # Control-plane audit log (rolled to stdout by default; if ALPHASWARM_CP_AUDIT_LOG_PATH # is set, also written to a file). kubectl logs -n alphaswarm-admin deploy/alphaswarm-cp --tail=200 -f | findstr workload_run # Recent terraform_runs (provisioning audit ledger). kubectl exec -n alphaswarm deploy/alphaswarm-core -- python -m alphaswarm.cli runs list --limit 20 ``` ## Scenario A — single pod crashes ```powershell # Identify the crashing pod kubectl get pods -n alphaswarm -l app=alphaswarm-core | findstr CrashLoop # Inspect kubectl describe pod -n alphaswarm kubectl logs -n alphaswarm --previous # Rolling restart of the deployment (HPA + PDB keep the service up) kubectl rollout restart -n alphaswarm deployment/alphaswarm-core ``` ## Scenario B — HPA thrashing The HPA is scaling rapidly up + down, never stabilising. ```powershell # Check the HPA's recent decisions kubectl describe hpa -n alphaswarm alphaswarm-core # Most common cause: a runaway query or backtest that spikes CPU then # crashes back. Check the audit log for recent task starts. kubectl logs -n alphaswarm deploy/alphaswarm-worker --tail=500 | findstr "started\|finished\|FAILED" # Mitigation: temporarily widen the HPA stabilizationWindow. kubectl patch hpa -n alphaswarm alphaswarm-core --type='json' -p='[ {"op":"replace","path":"/spec/behavior/scaleUp/stabilizationWindowSeconds","value":300} ]' ``` ## Scenario C — Celery queue depth alarm ```powershell # Drain the queue from the worker side kubectl exec -n alphaswarm deploy/alphaswarm-worker -- celery -A alphaswarm.tasks.celery_app inspect active # Scale workers up kubectl scale -n alphaswarm deployment/alphaswarm-worker --replicas=8 # Or via the control plane (lands an audit row) curl -X PATCH https://manage.alphaswarm.enterprise.com/manage/deployments/alphaswarm-worker/scale?replicas=8 ` -H "Authorization: Bearer $TOKEN" ``` ## Scenario D — Auth0 JWKS returns 503 Symptom: every authenticated request fails with `jwks_unreachable`. ```powershell # Probe JWKS directly from inside a pod kubectl exec -n alphaswarm deploy/alphaswarm-core -- curl -fsS https://your-tenant.us.auth0.com/.well-known/jwks.json # Common causes: # - Auth0 service incident (https://status.auth0.com/) # - Outbound 443 blocked by NetworkPolicy (check network-policies.yaml) # - DNS resolution failure inside the cluster # Mitigation: flip ALPHASWARM_AUTH_ENFORCE to permissive for read-only routes # while you wait for Auth0 to recover. ONLY do this if your operator # UI is firewalled at the Ingress layer. kubectl set env -n alphaswarm deploy/alphaswarm-core ALPHASWARM_AUTH_ENFORCE=permissive ``` ## Scenario E — Ingress returns 503 ```powershell # Check NGINX Ingress controller kubectl -n ingress-nginx get pods kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200 # Check service endpoints kubectl get endpoints -n alphaswarm alphaswarm-client kubectl get endpoints -n alphaswarm-admin alphaswarm-cp # If endpoints are empty, the pods aren't passing readinessProbe. ``` ## Scenario F — Stateful service failover ### Postgres The compose stack uses a single Postgres pod backed by a PVC. K8s overlays should be migrated to a managed Postgres (Aurora / Cloud SQL / Azure Database for PostgreSQL) before going to prod. For dev/staging: ```powershell kubectl -n alphaswarm delete pod -l app=postgres # restarts; data persists in PVC ``` ### Redis Stack ```powershell kubectl -n alphaswarm delete pod redis-master-0 # StatefulSet brings it back ``` ## Post-incident 1. Open an incident report in `alphaswarm_docs/incidents/-.md`. 2. Capture: timeline, blast radius, root cause, fixes applied, follow-ups. 3. If a hard rule was bypassed (e.g. ALPHASWARM_AUTH_ENFORCE flipped to permissive), schedule the revert as a P1 task. 4. Add a regression test to prevent the same class of incident. # Kill Switch Incident Response > | Scope | What it halts | Typical use | | --- | --- | --- | | `bot` | One Pod (one bot slug) | A single bot is misbehaving | | `fleet` | Every bot in a fleet | A fleet-wide alpha goes stale | | `platf... # Kill Switch Incident Response > Three-scope kill switch (bot / fleet / platform). Quarterly drill > required per blueprint caveat #7. ## Scopes | Scope | What it halts | Typical use | | --- | --- | --- | | `bot` | One Pod (one bot slug) | A single bot is misbehaving | | `fleet` | Every bot in a fleet | A fleet-wide alpha goes stale | | `platform` | Every bot on the platform | Emergency — venue outage, regulatory action | ## Engage ### Via CRD (preferred — leaves audit trail) ```bash kubectl apply -f - < namespace: alphaswarm-bots spec: scope: bot # bot | fleet | platform target: mm-aapl # bot slug / fleet name / "platform" mode: flatten # cancel | flatten | freeze reason: "venue outage; halting until investigation complete" ttl: 1h EOF ``` ### Via the REST kill-switch fan-out (UI button) The operator UI's `KillSwitch` topbar component calls a sequence of halt endpoints in parallel: - `POST /agents/halt` - `POST /quant-agents/halt` - `POST /paper/stop-all` - `POST /bots/halt-all` ← halts every active bot deployment - `POST /rl/halt-all` - `POST /workflows/halt` This is the equivalent of `KillSwitch.scope=platform` from the operator side. Use it when GitOps reconciliation is too slow (the CRD path can take up to `poll_interval_s` seconds; the REST fan-out is instant). ### Via the redundant Redis polling channel (last resort) If the Argo CD reconciler is unhealthy AND the REST API is unreachable: ```bash # Directly set the kill switch key in the bots namespace's Redis. kubectl exec -n alphaswarm-bots redis-master-0 -- \ redis-cli SET 'alphaswarm:bots:killswitch:platform:platform' 'manual-emergency' ``` Each bot polls this key every 5 seconds (configurable via `KillSwitchV2.poll_interval_s`) and halts when set. This is the fallback documented in blueprint caveat #7. ## Release ### Via CRD ```bash kubectl delete killswitch emergency- -n alphaswarm-bots ``` ### Via Redis (matching the last-resort engage) ```bash kubectl exec -n alphaswarm-bots redis-master-0 -- \ redis-cli DEL 'alphaswarm:bots:killswitch:platform:platform' ``` ## Verify ```bash # CRD view: kubectl get killswitches -A # Status: kubectl get killswitches -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.engaged}{"\n"}{end}' # Redis view: kubectl exec -n alphaswarm-bots redis-master-0 -- \ redis-cli --scan --pattern 'alphaswarm:bots:killswitch:*' # Affected bots (operator status): kubectl get bots -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.killSwitchEngaged}{"\t"}{.status.killSwitchReason}{"\n"}{end}' ``` ## Quarterly drill (caveat #7) The blueprint mandates a **quarterly drill** because the worst time to discover the kill switch is broken is during a real incident. ### Drill protocol 1. Schedule a 15-minute window during low-activity hours. 2. Engage `scope=platform` via the CRD path. 3. Verify every bot in `kubectl get bots -A` transitions to `status.phase=Draining` within 10 seconds. 4. Verify every bot reaches `Stopped` within 30 seconds (HFT) or 300 seconds (everything else). 5. Release the kill switch. 6. Verify bots auto-restart (their Deployments / StatefulSets reconcile). 7. Repeat with the REST fan-out path. 8. Repeat with the Redis fallback path. 9. Record the drill in the next RTS 6 validation report's `kill_switch_drills` evidence section. Failure of any of the three paths is a P1 incident. Fix before the drill window closes. ## Common failure modes - **Operator pod is down.** Symptom: `KillSwitch` CR created but bots don't halt within `poll_interval_s`. Mitigation: the Redis polling fallback bypasses the operator entirely. - **Redis pod is down.** Symptom: neither operator nor polling fallback works. Mitigation: at least one of the operator's in-memory CR watcher or the REST API fan-out path will still halt bots; if all three fail simultaneously, escalate to manual `kubectl scale deployment/bot-* --replicas=0`. - **Redis Pub/Sub vs SET-key drift.** `KillSwitchV2.poll_interval_s` defines the upper bound on the polling fallback's latency; if the pub/sub channel is dropping messages, polling still works after at most one interval. # Operations runbook — Kubernetes deployment > - `kubectl` 1.30+ with a current context pointing at the target cluster. - Cluster admin (youll create namespaces + RBAC). - A container registry the cluster can pull from (Docker Hub / ECR / ACR / G... # Operations runbook — Kubernetes deployment End-to-end walkthrough for shipping AlphaSwarm to any Kubernetes cluster (EKS, AKS, GKE, vanilla k3s, or the Raspberry Pi k3s cluster owned by `rpi_kubernetes`). AlphaSwarm is fully self-contained: every shared service it depends on (Postgres, Redis, Kafka, MinIO, MLflow, observability stack, etc.) ships in `alphaswarm_platform/deployments/kubernetes/`. There is no implicit dependency on `rpi_kubernetes` or any other repository. ## Prerequisites - `kubectl` 1.30+ with a current context pointing at the target cluster. - Cluster admin (you'll create namespaces + RBAC). - A container registry the cluster can pull from (Docker Hub / ECR / ACR / GCR). - An ingress controller (`ingress-nginx` recommended) and `cert-manager` with a `letsencrypt-prod` `ClusterIssuer` for the AlphaSwarm TLS hosts. - Auth0 tenant configured per [alphaswarm_docs/architecture/decisions/003-auth0-zero-trust.md](../../architecture/decisions/003-auth0-zero-trust.md) (default tenant `alphaswarm-fund.us.auth0.com`). - Cluster operators / CRDs installed via [alphaswarm_platform/scripts/cluster_install/](../../scripts/cluster_install/) (Strimzi, Spark Operator, OpenTelemetry Operator, Phoenix, Redpanda, etc.) - run the relevant installer before applying the AlphaSwarm base kustomization. ## Targeted runbooks - Two-node tower+laptop bootstrap: [tower-cluster-deploy.md](tower-cluster-deploy.md) - Blue/green domain cutover: [alphaswarm-fund-blue-green-cutover.md](alphaswarm-fund-blue-green-cutover.md) ## Step 1 — provision Auth0 (one-time) ```powershell $env:AUTH0_DOMAIN = "your-tenant.us.auth0.com" $env:AUTH0_M2M_CLIENT_ID = "..." $env:AUTH0_M2M_CLIENT_SECRET = "..." $env:ALPHASWARM_SYNC_URL = "https://api.alphaswarm.enterprise.com/_internal/auth0/sync" python alphaswarm_platform/build/scripts/provision_auth0.py --dry-run # preview python alphaswarm_platform/build/scripts/provision_auth0.py # apply ``` This idempotently creates the API resource server, the four roles, and the post-login Action. ## Step 2 — generate the K8s ConfigMap + Secret scaffold ```powershell make generate-config ENV=k8s ``` Produces: - `alphaswarm_platform/deployments/kubernetes/base/configmaps/alphaswarm-config.yaml` (commit this) - `alphaswarm_platform/deployments/kubernetes/base/secrets/alphaswarm-secrets.yaml.template` (DO NOT commit values — CI/CD or external-secrets-operator patches real values) ## Step 3 — build + push images ```powershell $env:IMAGE_TAG = "rc-$(git rev-parse --short HEAD)-$(Get-Date -Format yyyy-MM-dd)" make build-client IMAGE_TAG=$env:IMAGE_TAG make build-cp IMAGE_TAG=$env:IMAGE_TAG # Optional (only if the Dockerfiles exist in alphaswarm_platform/build/docker/*) make build-worker IMAGE_TAG=$env:IMAGE_TAG make build-ingestion IMAGE_TAG=$env:IMAGE_TAG docker login docker push docker.io/julianwiley/alphaswarm-client:$env:IMAGE_TAG docker push docker.io/julianwiley/alphaswarm-controller:$env:IMAGE_TAG docker push docker.io/julianwiley/alphaswarm-worker:$env:IMAGE_TAG docker push docker.io/julianwiley/alphaswarm-ingestion:$env:IMAGE_TAG ``` If `make build-worker` or `make build-ingestion` reports a missing Dockerfile, pin those image tags to known-good prebuilt registry tags in the target overlay before applying. ## Step 3b — one-shot Alembic migration (cluster) After `alphaswarm-api` is pullable on the cluster, run: ```powershell kubectl apply -f alphaswarm_platform/deployments/kubernetes/base/jobs/alembic-upgrade.yaml kubectl -n alphaswarm wait --for=condition=complete job/alphaswarm-alembic-upgrade --timeout=900s kubectl -n alphaswarm logs job/alphaswarm-alembic-upgrade ``` The Job uses the same `alphaswarm-config` / `alphaswarm-secrets` env as `alphaswarm-core` and targets `postgresql.alphaswarm-data-services.svc.cluster.local` (the AlphaSwarm-owned Postgres in the `alphaswarm-data-services` namespace). Re-apply only when you need a fresh `upgrade head` (delete the previous Job first: `kubectl -n alphaswarm delete job alphaswarm-alembic-upgrade`). `alembic/env.py` widens `alembic_version.version_num` to `VARCHAR(128)` automatically before migrations run (revision slugs longer than 32 characters otherwise fail at `0039_extended_instrument_taxonomy`). ### Brownfield Postgres (pre-Alembic or partial schema) If `alembic upgrade head` fails with `DuplicateTable` / `DuplicateColumn`, the database was created outside Alembic tracking. From a workstation with the API image and a port-forward to cluster Postgres: ```powershell kubectl -n alphaswarm-data-services port-forward svc/postgresql 15432:5432 $env:ALPHASWARM_POSTGRES_DSN = "postgresql+psycopg2://alphaswarm:alphaswarm@host.docker.internal:15432/alphaswarm" # Optional: stamp to the highest revision whose objects already exist, then upgrade. # $env:ALPHASWARM_ALEMBIC_STAMP_REVISION = "0015_dbt_foundation" bash scripts/cluster_alembic_upgrade.sh ``` Use `ALPHASWARM_POSTGRES_DSN` (maps to `settings.postgres_dsn`) — not a raw `DATABASE_URL` alias. Migration `0040_normalized_identifiers_backfill` can take several minutes on large `instruments` tables. ### Postgres prerequisites (`alphaswarm-data-services`) Migration `0045_pgvector_foundation` requires the `vector` extension in the **`alphaswarm`** database. On existing clusters (init script applied before the `alphaswarm` DB was added), run once as the Postgres superuser: ```powershell kubectl -n alphaswarm-data-services exec deploy/postgresql -- \ psql -U postgres -d alphaswarm -c "CREATE EXTENSION IF NOT EXISTS vector;" ``` Fresh installs use the AlphaSwarm-owned `alphaswarm_platform/deployments/kubernetes/base-services/postgres-shared/` manifests, whose init SQL creates the `alphaswarm` role/database and enables `vector` there. ## Step 4 — pin the image tag in the target overlay Edit `alphaswarm_platform/deployments/kubernetes/overlays//kustomization.yaml`: ```yaml images: - name: docker.io/julianwiley/alphaswarm-client newTag: rc-abcdef01-2026-05-19 ... ``` ### Docker Hub pull secret (private repos) Deployments reference `dockerhub-pull-secret`. Create it in both workload namespaces before rollout: ```powershell $env:DOCKERHUB_USER = "" $env:DOCKERHUB_TOKEN = "" # hub.docker.com → Account Settings → Security kubectl create secret docker-registry dockerhub-pull-secret ` --docker-server=https://index.docker.io/v1/ ` --docker-username=$env:DOCKERHUB_USER ` --docker-password=$env:DOCKERHUB_TOKEN ` -n alphaswarm --dry-run=client -o yaml | kubectl apply -f - kubectl create secret docker-registry dockerhub-pull-secret ` --docker-server=https://index.docker.io/v1/ ` --docker-username=$env:DOCKERHUB_USER ` --docker-password=$env:DOCKERHUB_TOKEN ` -n alphaswarm-admin --dry-run=client -o yaml | kubectl apply -f - ``` Public repositories can omit the secret by removing `imagePullSecrets` from the deployment manifests. ## Step 5 — apply ```powershell # Dry-run first kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev --dry-run=server # Apply kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev # Verify kubectl -n alphaswarm get pods,svc,hpa,pdb kubectl -n alphaswarm-admin get pods,svc ``` ## Step 6 — populate the Secret If you're not using external-secrets-operator, populate the placeholder Secret manually: ```powershell kubectl -n alphaswarm create secret generic alphaswarm-secrets ` --from-literal=ALPHASWARM_DATABASE_PASSWORD="" ` --from-literal=ALPHASWARM_AUTH_M2M_CLIENT_SECRET="" ` --from-literal=ALPHASWARM_SESSION_COOKIE_SECRET="" ` --dry-run=client -o yaml | kubectl apply -f - ``` For external-secrets-operator users, point an `ExternalSecret` at your secret store (Vault / SSM / Key Vault / Secret Manager) and let the operator create the K8s `Secret`. ## Step 7 — DNS + TLS The Ingresses expect: - `alpha-swarm.ai` -> `alphaswarm-client` Service in the `alphaswarm` namespace - `api.alpha-swarm.ai` -> `alphaswarm-core` Service in the `alphaswarm` namespace - `manage.alpha-swarm.ai` -> `alphaswarm-cp` Service in the `alphaswarm-admin` namespace Point DNS at the NGINX Ingress controller's LoadBalancer IP. cert-manager handles TLS via the `letsencrypt-prod` ClusterIssuer (configure separately). ## Step 8 — smoke test ```powershell # Client should serve the SPA shell curl -fsS https://alpha-swarm.ai/ | findstr " # Operations runbook — Local setup > | Tool | Min version | Used for | | --- | --- | --- | | Python | 3.11 | AlphaSwarm runtime + the new `alphaswarm_core` + `alphaswarm_controller` packages | | Node.js | 20 | Vite + legacy webui builds | | pnpm ... # Operations runbook — Local setup This walks a brand-new developer from `git clone` to a running local AlphaSwarm stack. ## Prerequisites | Tool | Min version | Used for | | --- | --- | --- | | Python | 3.11 | AlphaSwarm runtime + the new `alphaswarm_core` + `alphaswarm_controller` packages | | Node.js | 20 | Vite + legacy webui builds | | pnpm | 9 | Frontend dep management (`corepack enable && corepack prepare pnpm@9.15.9 --activate`) | | Docker | 25+ | Local compose stack + image builds | | docker buildx | 0.13+ | Multi-arch image builds | | Terraform | 1.10+ | Provisioning-only (rule 42) | | k3d | 5.7+ | Local k3s cluster (for the Terraform-driven path) | | kubectl | 1.30+ | Workload introspection | ## Step 1 — clone + install editable ```powershell git clone https://github.com/julianwiley/alphaswarm.git cd alphaswarm python -m pip install -e . python -m pip install -e ./alphaswarm_core[dev] python -m pip install -e ./alphaswarm_controller[dev,all-providers] pnpm --dir alphaswarm_client install ``` ## Step 2 — generate `.env.local` ```powershell make generate-config ENV=local ``` This reads [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) and writes `alphaswarm_platform/deployments/compose/.env.local`. Open the file and fill in the `` placeholders for any service you plan to use. ## Step 3 — bring up the stack (two options) ### Option A — Docker Compose (new path, Phase 3 refactor) ```powershell make dev ``` This brings up: - `alphaswarm-postgres` (pgvector) - `redis-stack` - `alphaswarm-core` (FastAPI) - `alphaswarm-worker` (Celery) - `alphaswarm-client` (unified gateway, port 3000) Once everything is `Up (healthy)`: - Operator UI: [http://localhost:3000](http://localhost:3000) - Legacy Solara UI: [http://localhost:3000/legacy](http://localhost:3000/legacy) - OpenAPI: [http://localhost:3000/api/docs](http://localhost:3000/api/docs) ### Option B — Terraform + k3d (canonical, hard rule 42) ```powershell alphaswarm-cli deploy build # build + push images to the local registry alphaswarm-cli deploy up # terraform apply -> k3d cluster + workloads alphaswarm-cli deploy status # pod + service rollup alphaswarm-cli deploy logs api # tail alphaswarm-api logs ``` `alphaswarm-cli deploy *` is the existing path that lands every state mutation in `terraform_runs`. The Docker Compose path is friendlier for fast iteration but doesn't update the ledger. ## Step 4 — bring up the admin overlay (optional) The `alphaswarm_controller` micro-project runs on a separate Docker network (`alphaswarm-admin`) so it's isolated from the workloads it manages. ```powershell make dev-admin ``` After that, `curl http://localhost:9000/manage/health` should return `{"status": "ok", ...}`. ## Step 5 — verify ```powershell make test # all tests make test-platform-core # alphaswarm_core only make test-providers # alphaswarm_controller provider contract tests ``` ## Troubleshooting | Symptom | Fix | | --- | --- | | `make generate-config ENV=local` errors with `missing required fields` | The schema parser caught a malformed block in `.env.schema`. Open the file, look for the entry above the error line, ensure every block has `key:` / `description:` / `required:` / `targets:` / `classification:`. | | `docker compose up` fails with `port already in use` | The Vite dev server publishes 3001 by default; the compose stack publishes 3000. Stop whichever is running first or override via `docker-compose.override.yml`. | | `pnpm --dir alphaswarm_client build` runs out of memory | `NODE_OPTIONS=--max-old-space-size=4096 pnpm --dir alphaswarm_client build`. | | `alphaswarm-cli deploy up` fails with `terraform binary not found` | `choco install terraform` (Windows) or set `ALPHASWARM_TERRAFORM_BINARY=/path/to/terraform`. | | `alphaswarm_controller` shows `auth_disabled=true` in `/manage/health` | Set `ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.us.auth0.com/` in `.env.local`, restart `alphaswarm-cp`. | # Operations runbook — Secret rotation > Every entry in [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) with `classification: secret` or `classification: rotation-required`. The rotation-r... # Operations runbook — Secret rotation Zero-downtime credential rotation for the AlphaSwarm control plane + workloads. ## What's a secret here? Every entry in [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) with `classification: secret` or `classification: rotation-required`. The rotation-required ones (e.g. `AUTH0_M2M_CLIENT_SECRET`) should be rotated on a fixed schedule (typically 90 days). ## Pre-flight 1. Confirm at least one operator with `admin:cluster` is online to handle Auth0 console operations. 2. Check `kubectl -n alphaswarm rollout status deployment/alphaswarm-core` — if it's currently degraded, fix that first. 3. Verify the secret store you're rotating into is reachable (`Vault sealed?`, `SSM blocked by SCP?`, etc.). ## Procedure — Auth0 M2M client secret 1. **Mint a new secret in Auth0:** Applications → `alphaswarm-m2m` → Settings → "Rotate" (Auth0 keeps the old one valid for 24h by default). 2. **Update the secret store:** ```powershell # Vault example vault kv patch secret/alphaswarm/auth0 m2m_client_secret= ``` 3. **Reload the relevant pods:** ```powershell kubectl -n alphaswarm rollout restart deployment/alphaswarm-core kubectl -n alphaswarm rollout restart deployment/alphaswarm-worker kubectl -n alphaswarm-admin rollout restart deployment/alphaswarm-cp ``` 4. **Verify:** `curl https://manage.alphaswarm.enterprise.com/manage/health` returns 200; the audit log shows successful M2M token mints. 5. **Revoke the old secret:** Auth0 → `alphaswarm-m2m` → "Revoke previous secret" once you're confident every pod has rolled. ## Procedure — Postgres password 1. **Connect via the old credential** and create the new password: ```sql ALTER USER alphaswarm WITH PASSWORD 'new-strong-password'; ``` 2. **Update the secret store** (same as above). 3. **Rolling restart** of every service that talks to Postgres: `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-cp`. Each pod re-reads the env var on startup. 4. **No old-credential revocation needed** — Postgres only honours the current password. ## Procedure — Session cookie secret The session cookie secret (`ALPHASWARM_SESSION_COOKIE_SECRET`) is used to encrypt the `alphaswarm_session` cookie. Rotating it invalidates every active session. 1. Generate: `python -c "import secrets; print(secrets.token_urlsafe(64))"` 2. Update the secret store. 3. Rolling restart of `alphaswarm-core` only. Users will be redirected to Auth0 to re-authenticate. ## Procedure — SCIM bearer token The SCIM bearer token hash gates the `/scim/v2/*` endpoint. 1. Generate a new random token: `python -c "import secrets; print(secrets.token_urlsafe(48))"` 2. Compute its SHA-256 hash: `python -c "import hashlib, sys; print(hashlib.sha256(sys.argv[1].encode()).hexdigest())" ` 3. Update `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH` in the secret store with the hash. 4. Update the IdP's SCIM provisioning configuration with the new RAW token. 5. Rolling restart of `alphaswarm-core`. ## Auditing Every secret rotation should leave a trace: - The secret store records the change (Vault audit log / Cloudtrail / Cloud Audit Logs) - The application's audit log shows the M2M / Postgres / cookie events - `alphaswarm_controller`'s `WorkloadRun` ledger records the rotation request If you can't see all three, file an incident — silent rotations are a security smell. # RTS 6 / SEC 15c3-5 Annual Validation Report > - **MiFID II RTS 6, Article 9** — "An investment firm shall annually perform a self-assessment and validation process and on the basis of that process issue a validation report... The risk management ... # RTS 6 / SEC 15c3-5 Annual Validation Report > Mechanical generation + sign-off workflow. > Audience: Risk Management, Internal Audit, CEO, compliance counsel. ## Regulatory anchors - **MiFID II RTS 6, Article 9** — "An investment firm shall annually perform a self-assessment and validation process and on the basis of that process issue a validation report... The risk management function shall draft the report; internal audit shall audit the report." - **SEC Rule 15c3-5(e)** — "The broker or dealer shall regularly review the effectiveness of the risk management controls and supervisory procedures... The CEO shall certify annually that the firm's risk management controls and supervisory procedures comply with paragraphs (b) and (c) of this section." ## Generate the artifact ```bash # CLI (single bot — usually just for testing the generator): alphaswarm-bots conformance alphaswarm-bots stress # REST (fleet-wide): curl -X POST https://api.alphaswarm.io/bots//conformance curl -X POST https://api.alphaswarm.io/bots//stress curl -X GET https://api.alphaswarm.io/bots//risk/validation-report > validation-report.yaml ``` The artifact is a YAML document with three top-level sections: 1. **MiFID II RTS 6** — Article 6 / 9 / 10 / 12 / 15 / 16 / 17 results. 2. **SEC Rule 15c3-5** — (c)(1)(i)/(ii), (d), (e) results. 3. **Evidence** — embedded `bot_inventory`, `conformance_results`, `stress_results`, `kill_switch_drills`. ## Required attestations The generator leaves three slots empty: | Slot | Required by | Filled by | | --- | --- | --- | | `attestations.risk_management_function` | RTS 6 Art. 9(2) | Head of Risk | | `attestations.internal_audit` | RTS 6 Art. 9(3) | Head of Internal Audit | | `attestations.ceo_certification` | SEC 15c3-5(e) | CEO | Sign-off is **operational, not mechanical**. The generator emits unsigned YAML; the firm's compliance workflow fills in the slots and adds digital signatures (e.g. via a Yubikey-backed signing pipeline). ## Cadence - **Annual:** by 31 March each year, covering the previous calendar year. - **Ad-hoc:** after any material control change (new policy, threshold retune, new venue, new asset class), generate a fresh artifact and re-circulate for sign-off. - **Quarterly drill:** the kill-switch incident response runbook (separate doc) exercises the three-scope kill switch quarterly; that drill's evidence is included in the next annual report. ## Storage - Signed YAML artifacts: `s3://alphaswarm-compliance/validation-reports//` with object lock + WORM retention >= 7 years. - The audit trail (who generated the artifact, when, with which inputs) is in `bot_events` (event_type=`validation_report.generated`). ## Caveat This workflow is an **engineering crosswalk**, NOT legal advice. The specific scope of "algorithmic trading" / "market access" for any given firm is a legal determination that compliance counsel must make. The generator only mechanizes the controls that are already implemented in code. # Tower Two-Node Cluster Deploy > - In scope: AlphaSwarm stack bootstrap (`tower-dev`), QuestDB, control-plane wiring. - Out of scope: `julianwiley-portal` migration (deferred; owned by `rpi_kubernetes`) # Tower Two-Node Cluster Deploy Deploy AlphaSwarm to the dedicated two-node cluster (`alphaswarm-tower` control plane + `alphaswarm-laptop` WSL2 agent) before any portal migration work. ## Scope - In scope: AlphaSwarm stack bootstrap (`tower-dev`), QuestDB, control-plane wiring. - Out of scope: `julianwiley-portal` migration (deferred; owned by `rpi_kubernetes`). ## Prerequisites - Two-node cluster already online and `kubectl get nodes` shows both `Ready`. - Context points to the tower cluster. - Auth0 tenant + client values set in `alphaswarm_platform/deployments/kubernetes/base/configmaps`. - Secrets rendered for: - `alphaswarm-secrets` (`alphaswarm` namespace) - `alphaswarm-admin-secrets` (`alphaswarm-admin` namespace) ## 1) Install cluster dependencies ```bash bash alphaswarm_platform/scripts/cluster_install/install-redpanda.sh bash alphaswarm_platform/scripts/cluster_install/install-questdb.sh bash alphaswarm_platform/scripts/cluster_install/install-redpanda-connect.sh ``` Optional (if your target slice needs them): OpenTelemetry, kube-prometheus-stack, Phoenix, Spark Operator. ## 2) Apply the thin tower overlay ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/ ``` This slice includes: - core workloads (`alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-client`, `alphaswarm-cp`) - `redis-master` - `postgres-shared` - `questdb` (dev-sized PVC, relaxed scheduling) ## 3) Verify ```bash bash scripts/verify_tower_cluster.sh ``` ## 4) Terraform target wiring (optional but recommended) ```bash # Preview python -m alphaswarm.cli.main deploy --target tower --action plan # Apply python -m alphaswarm.cli.main deploy --target tower --action apply ``` ## Rollback ```bash kubectl delete -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/ ``` Then restore the previous known-good overlay or Terraform state. # Per-tenant MCP rollout # Per-tenant MCP rollout runbook > Phase 5 §8 of > [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md). > Walks the cluster operator through deploying per-tenant MCP > servers, the gVisor agent-sandbox pool, and the Cell-Bound- > Authorization gate at `alphaswarm-edge`. ## Scope 1. **gVisor RuntimeClass** — install via the DaemonSet at `alphaswarm_platform/deployments/kubernetes/agent-sandbox/gvisor/`. 2. **alphaswarm-agent-sandbox-pool** — the gVisor-isolated Deployment at `alphaswarm_platform/deployments/kubernetes/agent-sandbox/pool/`. 3. **Per-tenant MCP servers** — Helm-rendered Deployments from `alphaswarm_platform/deployments/helm/alphaswarm-mcp-tenant/` for each `shared-prem` / `silo-reg` tenant. 4. **Cell-Bound-Authorization** — the second ext_authz step in `alphaswarm_platform/build/docker/alphaswarm-edge/envoy.template.yaml`. 5. **MCP tool catalog versioning** — Alembic 0084 creates the `mcp_tool_versions` table + adds `agent_runs_v2.mcp_tool_descriptor_hashes`. ## Prerequisites 1. Phase 3 cells are registered and at least one is in `state=active`. 2. Phase 4 SPIRE control plane is healthy in the cell. Verify: ```bash kubectl -n spire-system rollout status statefulset/spire-server kubectl -n spire-system get pods -l app=spire-agent ``` 3. The Alembic head is at `0084_mcp_tool_versioning`. Verify: ```bash alembic current # expected: 0084_mcp_tool_versioning (head) ``` 4. Phase 2 Kyverno policies are loaded. Verify: ```bash kubectl get clusterpolicy alphaswarm-require-gvisor-for-agent-sandbox ``` ## Step 0 — Install gVisor ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/agent-sandbox/gvisor/ kubectl -n gvisor rollout status daemonset/gvisor-installer --timeout=10m # Wait for the node labels to appear (the installer marks each node # `alphaswarm.io/gvisor=installed` after patching containerd): kubectl get nodes -L alphaswarm.io/gvisor # Expected: every node ends with `installed`. ``` ## Step 1 — Deploy the agent-sandbox pool ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/agent-sandbox/pool/ kubectl -n alphaswarm-agent-sandbox rollout status deployment/alphaswarm-agent-sandbox-pool --timeout=5m # Confirm gVisor is active inside the pod (the kernel reports as `runsc`): POD=$(kubectl -n alphaswarm-agent-sandbox get pods -l app=alphaswarm-agent-sandbox-pool -o name | head -1) kubectl -n alphaswarm-agent-sandbox exec "$POD" -- /bin/sh -c "uname -r; cat /proc/version" # Expected: kernel version reports as runsc/gVisor. # Confirm the Kyverno gate is enforced — try to deploy a Pod with the # `alphaswarm.io/sandbox-required` label but WITHOUT runtimeClassName:gvisor: cat <' ORDER BY started_at DESC LIMIT 5; ``` The hash array MUST be a subset of `mcp_tool_versions.descriptor_hash` at the matching cell_id. The Phase 7 §10.2 replay harness will verify this invariant. ## Step 5 — Validate Cell-Bound-Authorization Cross-cell MCP calls now require the `Cell-Bound-Authorization` header. Without it, `alphaswarm-edge` returns 403 at the second ext_authz step. ```bash # From outside the cluster, simulate a cross-cell call missing CBA: curl -sS -XPOST https://manage.alpha-swarm.ai/mcp/data/cell-silo-reg-acme/some.tool \ -H 'authorization: Bearer ' \ -d '{"args": {}}' # Expected: 403 with `cell_bound_invalid` in the body. # With a valid CBA (minted by the source-cell tenant-router): curl -sS -XPOST https://manage.alpha-swarm.ai/mcp/data/cell-silo-reg-acme/some.tool \ -H 'authorization: Bearer ' \ -H 'Cell-Bound-Authorization: ' \ -d '{"args": {}}' # Expected: tool result. ``` The CBA validator service is a Phase 5.5 deliverable; today the ext_authz config points at the planned service address but the service itself ships in the follow-up PR. Until then, the `failure_mode_allow: false` flag means cross-cell calls without a CBA fail closed (the validator returns 503 because it doesn't exist yet) — which is the intended behaviour for the security posture. ## Rollback Each component is independently revertable: ```bash # Per-tenant MCP — uninstall the Helm release: helm uninstall acme-mcp -n cell-silo-reg-acme # Agent sandbox pool — scale to zero: kubectl -n alphaswarm-agent-sandbox scale deployment alphaswarm-agent-sandbox-pool --replicas=0 # gVisor — DO NOT DROP the installer DaemonSet without first # removing every Pod with `runtimeClassName: gvisor`, otherwise # the pods will sit in RunPodSandboxFailed forever. # Cell-Bound-Authorization — flip ext_authz failure_mode_allow to true # in the envoy ConfigMap then `kubectl rollout restart -n alphaswarm-edge # deployment/alphaswarm-edge`. Cross-cell calls then bypass the CBA gate. ``` ## Phase 5.5 follow-ups 1. **alphaswarm-cell-bound-validator service** — the small HTTP service the ext_authz step points at. Phase 5 ships the Envoy config; the actual service implementation is a thin Starlette app that wraps `alphaswarm.auth.cell_bound.verify(...)`. 2. **shared-std MCP pool chart** — the `shared-std` tier uses one pool per cell with per-tenant Linux cgroups (cgroups v2 + Pod Security Standards `restricted`). The Helm chart for the pool is a Phase 5.5 deliverable; the per-tenant chart in this PR targets `shared-prem` and `silo-reg`. 3. **Biscuit + TokenExchangeBroker wire-up in AgentRuntime** — the helpers in `alphaswarm/auth/biscuit.py` are standalone today; the `AgentRuntime` integration that mints + attenuates the biscuit per call is Phase 5.5. 4. **MCP tool versioning replay** — `mcp_tool_descriptor_hashes` recording works in Phase 5; the replay harness that verifies the recorded set matches the live catalog is Phase 7 §10.2. ## Related documents - [RESTRUCTURING_PLAN.md §8](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md) - [alphaswarm_docs/docs/concepts/identity/biscuit-capabilities.md](../concepts/identity/biscuit-capabilities.md) - [alphaswarm_docs/docs/how-to/linkerd-spire-rollout.md](linkerd-spire-rollout.md) - [alphaswarm_docs/docs/concepts/identity/spiffe-workload-identity.md](../concepts/identity/spiffe-workload-identity.md) # Recipe: add a strategy > The minimum-viable steps to register a new strategy class against the AlphaSwarm registry. # Recipe: add a strategy The 5-minute happy path: 1. Subclass `IStrategy` (or `FrameworkAlgorithm`) under [alphaswarm/strategies/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies). 2. Decorate with `@register("MyName", kind="alpha")` from [alphaswarm/core/registry.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/core/registry.py). 3. Ship a YAML at `configs/strategies/.yaml` using the `class` / `module_path` / `kwargs` factory pattern. 4. Smoke-test: ```powershell docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \ --config configs/strategies/.yaml \ --start 2024-01-01 --end 2024-06-30 ``` If the smoke run lands a `backtest_runs` row with a non-NULL `sharpe`, you are done. ## Pitfalls - **Forgetting `@register`.** YAML loaders fail silently; the run errors out as `StrategyRegistryMissError`. - **Putting strategy logic in a route or task.** Don't. Routes thin wrap Celery tasks; Celery tasks thin wrap pure functions under `alphaswarm/strategies/`. See [AGENTS Don'ts](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). - **Skipping risk overlays.** Every strategy ships with a `risk:` block in YAML. Without it, the paper-metadata gate refuses to promote the strategy. ## Deeper reads - [Concept: factor research](../../concepts/strategy/factor-research.md) - [Concept: backtest engines](../../concepts/strategy/backtest-engines.md) - [Tutorial: first backtest](../../tutorials/first-backtest.md) # Recipes > Task-oriented cookbook for common AlphaSwarm operations. Copy-pasteable, results-first. # Recipes Task-oriented, results-first. Each recipe answers a single "how do I..." question with a copy-pasteable command sequence. If you want to learn a subsystem, read the matching [concept](../../concepts/platform/architecture.md). If you want to walk through a complete first-time scenario, do a [tutorial](../../tutorials/first-backtest.md). If you want to fix a broken thing, follow a [runbook](../../how-to/runbooks/dr-restore.md). ## Cookbook - [Add a strategy](./add-a-strategy.md) - [Run a backtest from YAML](./run-a-backtest-from-yaml.md) - [Promote a bot to paper](./promote-a-bot-to-paper.md) - [Snapshot an agent spec](./snapshot-an-agent-spec.md) - [Query data via MCP](./query-data-via-mcp.md) Each recipe is self-contained. None of them assume the others have been run. # Recipe: promote a bot to paper > Take a backtested bot and start a paper-trading session, respecting the paper-metadata gate. # Recipe: promote a bot to paper ```powershell # 1. Snapshot the bot (idempotent — same hash returns same version). curl -X POST http://localhost:8000/bots ` -H "Content-Type: application/json" ` -d @configs/bots/my-bot.yaml # 2. Backtest the bot (gates require a recent backtest_runs row). curl -X POST http://localhost:8000/bots//backtest ` -d '{"start":"2024-01-01","end":"2024-06-30"}' # 3. Promote to paper. curl -X POST http://localhost:8000/bots//paper ` -d '{"starting_cash":100000,"duration_minutes":60}' ``` ## The paper-metadata gate `POST /bots//paper` runs [paper_metadata_gate](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_metadata_gate.py) before launching the session. It rejects when: - No `backtest_runs` row for the bot, or it is older than 7 days. - `sharpe < 0.5` on the latest backtest. - `max_drawdown > 0.20`. - `risk.kill_switch_attached != true`. - The bot's universe contains a symbol that isn't in the active data plane. Override the gate via the `--force` flag on `/bots//paper` only with explicit approval. The audit ledger records who forced it and why. ## Risk + kill switch The session inherits the bot's `risk:` block. Trigger a stop: ```powershell curl -X POST http://localhost:8000/paper/stop-all # or use the topbar kill switch in the Vite UI ``` ## Deeper reads - [Tutorial: first paper trading session](../../tutorials/first-paper-trading-session.md) - [Concept: paper trading](../../concepts/trading/paper-trading.md) - [Concept: paper metadata gate](../../concepts/trading/paper-metadata-gate.md) - [Runbook: kill-switch incident response](../../how-to/operations/kill-switch-incident-response.md) # Recipe: query data via MCP > Invoke a data.* MCP tool from an agent context (no direct Postgres / Iceberg reads). # Recipe: query data via MCP AGENTS rule 22: agents NEVER read Postgres or Iceberg directly. Every catalog / dataset / entity / pipeline read goes through a registered `DataMCPTool`. The bridge auto-installs every tool into the agent `TOOL_REGISTRY`; the same tools are reachable externally over HTTP at `/mcp/data` and via the `alphaswarm-data-mcp` stdio binary. ## From inside an agent ```python from alphaswarm_agents.tools import TOOL_REGISTRY tool = TOOL_REGISTRY["data.discovery.browse"] result = tool.invoke({"namespace_prefix": "alphaswarm_silver_yfinance"}) print(result["entries"]) ``` ## From outside the platform (HTTP) ```powershell curl -X POST http://localhost:8000/mcp/data/tools/data.discovery.browse/invoke ` -H "Content-Type: application/json" ` -H "Authorization: Bearer " ` -d '{"namespace_prefix":"alphaswarm_silver_yfinance"}' ``` ## From a Cursor/Continue/Cline agent (stdio) Register the stdio binary as an MCP server in the editor: ```json { "mcpServers": { "alphaswarm-data": { "command": "alphaswarm-data-mcp", "env": { "ALPHASWARM_MCP_DATA_CANONICAL_URI": "http://localhost:8000/mcp/data" } } } } ``` ## Where to add a new tool Subclass [`DataMCPTool`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/mcp/base.py) under `alphaswarm/data/mcp/tools/`, decorate with `@register_data_mcp_tool`, and the bridge does the rest. See [Concept: data MCP](../../concepts/data/data-mcp.md). ## RFC 9728 + 8707 conformance Every AlphaSwarm MCP server publishes Protected Resource Metadata at `/.well-known/oauth-protected-resource[/...]` and validates the `aud` claim on incoming tokens against the deployment's canonical URI. The docs site's own MCP server lives at [https://docs.alpha-swarm.ai/mcp](/mcp). ## Deeper reads - [Concept: data MCP](../../concepts/data/data-mcp.md) - [Concept: codebase MCP](../../concepts/data/codebase-mcp.md) - [Concept: pgvector control plane](../../concepts/data/pgvector-control-plane.md) - [Concept: MCP risk tools](../../concepts/data/mcp-risk-tools.md) # Recipe: run a backtest from YAML > Dispatch a backtest task from a YAML strategy config and tail the Celery progress. # Recipe: run a backtest from YAML ```powershell $resp = curl -X POST http://localhost:8000/backtest ` -H "Content-Type: application/json" ` -d (Get-Content configs/strategies/my-strategy.yaml -Raw) # Tail progress (canonical {task_id, stage, message, timestamp} frames). docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \ [print(m) for m in subscribe('')]" ``` ## Choose your engine The default engine is `vbtpro` (vectorbt-pro primary). Override with `--engine event_driven` / `hft` / `vectorbt` / `backtesting_py` / `zvt` / `aat`. See [backtest engines](../../concepts/strategy/backtest-engines.md) for the capability matrix and fallback cascade. ## Walk-forward + WFO ```powershell curl -X POST http://localhost:8000/backtest/wfo ` -d '{"strategy_config":"configs/strategies/my-strategy.yaml","windows":12,"step":"1mo"}' ``` The endpoint dispatches one task per window; each writes its own `backtest_runs` row and the parent emits a `wfo.complete` frame when every window is in. ## Look at results - `backtest_runs` row in Postgres for the headline metrics. - `alphaswarm_gold_backtest_` Iceberg namespace for trade-level detail. - The QuantStats tearsheet endpoint at `POST /analytics/portfolio/tearsheet` for an HTML report. ## Deeper reads - [Tutorial: first backtest](../../tutorials/first-backtest.md) — end-to-end walkthrough. - [Concept: backtest engines](../../concepts/strategy/backtest-engines.md) - [Concept: analytics frontend](../../concepts/data/analytics-frontend.md) # Recipe: snapshot an agent spec > Hash-lock a YAML AgentSpec into agent_spec_versions so AgentRuntime can drive it. # Recipe: snapshot an agent spec ```powershell # Idempotent — re-running with unchanged content returns the same # spec_hash and the same version row. curl -X POST http://localhost:8000/agents/specs ` -H "Content-Type: application/json" ` -d @configs/agents/my-agent.yaml ``` The response carries `spec_hash` and `version`. If you change a field and re-POST, a NEW `agent_spec_versions` row is created with a NEW hash. Old versions stay intact for replay. ## What the runtime does The [`AgentRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/agents/runtime.py) gates every run on: - A valid `agent_spec_versions` row. - A cost cap (`per_run_max_tokens`, `per_run_max_usd`). - The active kill switch. - The RFC 9728 + 8707 MCP audience check (rule 49). - An `experiment_id` (rule 34). If any check fails, the run rejects before the first LLM call. ## Run the agent ```powershell curl -X POST http://localhost:8000/agents//run ` -d '{"inputs":{"universe":["SPY","QQQ","IWM"]}}' ``` `AgentRuntime` writes `agent_runs_v2` rows with telemetry, cost, and OTEL trace IDs. ## Don't bypass the runtime Never call `router_complete` directly from inside agent code. Declare the model in `AgentSpec.model` and let the runtime drive the call. See [AGENTS rule 12](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). ## Deeper reads - [Concept: agents](../../concepts/agentic/agents.md) - [Concept: agentic development](../../concepts/agentic/agentic-development.md) - [Concept: workflow studio](../../concepts/agentic/workflow-studio.md) - [Tutorial: first agent workflow](../../tutorials/first-agent-workflow.md) # Runbook — Disaster Recovery: full restore (under 30 min) > 1. The Redis primary in `alphaswarm-system` is gone. The `redis-master.alphaswarm-system.svc` Service points to no pod. 2. Spin a fresh Redis pod: # Runbook — Disaster Recovery: full restore (under 30 min) Restores the Phase 6 reliability surface from S3 in three layers. ## Layer 1: rate-limit Redis (5 min) 1. The Redis primary in `alphaswarm-system` is gone. The `redis-master.alphaswarm-system.svc` Service points to no pod. 2. Spin a fresh Redis pod: ```bash kubectl -n alphaswarm-system scale statefulset/redis --replicas=1 ``` 3. The Lua scripts re-register lazily on the first `Check` call (see `RedisTokenBucketStrategy._ensure_initialised` — `EVALSHA` failure paths fall back to `EVAL` + re-register). 4. Buckets that were drained in the previous Redis are now full again; **this is intentional**. The audit log captures every token consumed pre-incident; the operator can replay the ledger to rebuild bucket state if compliance requires it. ## Layer 2: audit log (10 min) 1. The `audit_log` table is hash-chain-protected (trigger from Alembic 0079). The S3 export (Celery beat task `alphaswarm_ratelimit.tasks.ledger_export.export_ledger_window`) carries every row in append-only JSONL form. 2. Restore the latest window: ```bash alphaswarm ratelimit admin restore-ledger \ --bucket alphaswarm-audit-archive \ --since 2026-05-01 \ --until 2026-05-24 ``` 3. The `enforce_audit_log_hash_chain` Postgres trigger validates every restored row against its predecessor; on violation the restore aborts and surfaces the exact mismatched hex digest. ## Layer 3: dbt-loom manifest registry (10 min) 1. The `s3://alphaswarm-dbt-manifests` bucket is the source of truth for cross-project `ref()` lookups. 2. Restore the latest manifest per project: ```bash alphaswarm deploy restore-dbt-manifests \ --env prod \ --to-bucket alphaswarm-dbt-manifests-restored ``` 3. Update the `loom.yml` in each team project to point at the restored bucket name; downstream `dbt parse` succeeds with the rehydrated manifests. ## Phase-gate verification The full DR test must complete in under 30 min wall-clock. `tests/chaos/test_dr_restore.py` orchestrates the three layers against a fixture cluster + S3 mock and asserts the under-30-min deadline. # Runbook — QuestDB WAL apply stall > - `questdb_wal_apply_lag_seconds` Prometheus metric is above 60s. - New dbt model materialization runs hang on INSERT. - The QuestDB UI shows `WAL applied = N` is no longer advancing # Runbook — QuestDB WAL apply stall Symptoms: - `questdb_wal_apply_lag_seconds` Prometheus metric is above 60s. - New dbt model materialization runs hang on INSERT. - The QuestDB UI shows `WAL applied = N` is no longer advancing. ## Root cause A long-running query or external table lock has blocked the WAL apply worker. The new QuestDB documentation explicitly warns: "Non-partitioned tables cannot use WAL" — the AlphaSwarm custom `questdb_table` materialization forces `PARTITION BY DAY` to avoid the most common form, but mis-configured external tables can still trip the apply loop. ## Recovery 1. Identify the offending table from the Prometheus alert label: ``` {table="equities_minute_bars"} ``` 2. Suspend writers to that table: ```bash alphaswarm ratelimit admin halt-pool questdb_writer:equities_minute_bars ``` 3. Resume the WAL apply loop: ```sql ALTER TABLE equities_minute_bars RESUME WAL; ``` 4. Once the lag drops back below 5s, lift the writer halt: ```bash alphaswarm ratelimit admin resume-pool questdb_writer:equities_minute_bars ``` ## Prevention The Phase 2 `alphaswarm/dagster/dagster.yaml` reserves a per-table `questdb_writer:` pool with `limit=1` so concurrent writers to the same table are impossible. Verify that pool is present + has `limit=1`. # Runbook — quota-exhaustion > 1. Open the rate-limit dashboard at `/data/ratelimit`. Find the over-consuming `(user_id, service, key_id)`. 2. Inspect the `rl_ledger` partition for the last hour: # Runbook — quota-exhaustion A bucket has fired `AQPRatelimitBucketAt80Percent`, `AQPRatelimitBucketAt95Percent`, or `AQPRatelimitBucketExhausted`. ## Diagnosis (5 min) 1. Open the rate-limit dashboard at `/data/ratelimit`. Find the over-consuming `(user_id, service, key_id)`. 2. Inspect the `rl_ledger` partition for the last hour: ```sql SELECT decision, count(*), sum(tokens_consumed) FROM rl_ledger WHERE ts > now() - interval '1 hour' AND key_id = :key_id GROUP BY decision; ``` 3. Cross-reference `audit_log` for the calling `tool_id` — `data.ingest.materialize` or `data.ingest.preview_source` are the usual culprits. ## Decision tree (10 min) | Cause | Action | | --- | --- | | Misconfigured backfill | Tell the operator to cancel via `alphaswarm materialize cancel `. The reservation auto-releases. | | Vendor downgrade | Mint a higher-tier key via `alphaswarm keys mint --service polygon --rps 100 --burst 1000`. | | Stuck connector loop | `alphaswarm ratelimit status --key-id ` shows the call rate; halt the offending Dagster sensor via the topbar kill-switch. | | Legitimate traffic | Raise the policy via `data.ratelimit.policy.update` (Tier-P + step-up MFA). | ## Recovery (15 min) 1. Once the cause is addressed, the bucket refills at the policy's `refill_rate`; no manual reset is required. 2. If the operator wants an immediate reset, run the Phase 6 admin script that explicitly DELs the bucket key: ```bash alphaswarm ratelimit admin reset --user-id --service polygon --key-id primary ``` 3. Verify recovery in Grafana: ``` rl_bucket_remaining{service="polygon.aggregates"} > 50 ``` ## Postmortem Every quota-exhaustion alert that requires manual intervention must produce a postmortem PR within 72 hours. Template: `alphaswarm_docs/docs/how-to/runbooks/templates/postmortem.md` (to be authored). # Runbook — dbt snapshot deadlock > - `dbt snapshot` runs queue indefinitely. - The `dbt_snapshots` Dagster concurrency pool shows 1 slot in use but the corresponding run is `CANCELED` or `FAILED` # Runbook — dbt snapshot deadlock Symptoms: - `dbt snapshot` runs queue indefinitely. - The `dbt_snapshots` Dagster concurrency pool shows 1 slot in use but the corresponding run is `CANCELED` or `FAILED`. ## Root cause Per the Dagster docs: "a single cancelled run will permanently deadlock all future runs for that pool" unless the `free_slots_after_run_end_seconds` knob is set on the `run_monitoring` block. ## Fix (in this order) 1. Confirm `alphaswarm/dagster/dagster.yaml` has ```yaml run_monitoring: enabled: true free_slots_after_run_end_seconds: 300 ``` If missing, add it + reload the Dagster instance. 2. Manually free the stuck slot: ```bash dagster instance concurrency reset dbt_snapshots ``` 3. Verify with the Dagster UI: the pool should show `0 / 1` used. ## Verification chaos test `tests/chaos/test_snapshot_deadlock_recovery.py` triggers 5 parallel snapshot jobs against a sqlite test target and asserts that even after one is cancelled the pool recovers within 360s. ## Postmortem If the deadlock recurs after the canonical fix, the postmortem must include a Dagster + dbt version pair and a minimal repro so the upstream issue can be filed. # Tenant-router auth rollout runbook # Tenant-router auth rollout runbook > Operator companion to > [Edge authentication & cell routing](../concepts/identity/edge-authentication.md) > and the manifests at > `alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/`. > Follows the [cell-router cutover](./cell-router-cutover.md) — run that > first if the Envoy edge is not serving yet. The tenant-router ships **fail-closed**: `AUTH_MODE=required` with an empty issuer, so a fresh apply crash-loops with a `SettingsError` until you stamp real IdP values. That is intentional — complete this runbook to bring the edge up authenticated. ## 1. Prerequisites 1. The IdP is provisioned (Auth0 via `terraform/modules/auth0_identity` or Entra via `alphaswarm_entra_directory`) and the per-cell backends already validate the same issuer/audience (`ALPHASWARM_AUTH_OIDC_ISSUER` / `..._AUDIENCE` in `alphaswarm-config`, stamped by `build/scripts/sync_auth0_env_to_k8s.py`). 2. The claims pipeline stamps the namespaced routing claims (`https://alphaswarm.internal/tenant_id`, `workspace_id`, and — for B2B premium plans — `tier`). See [Auth0 Actions](../concepts/identity/auth0-actions.md) / [MSAL setup](../concepts/identity/msal-entra-setup.md). 3. The cells registry has at least one `state=active` cell per tier you route to (`curl -sS $CP/manage/cells | jq '.data[].tier'`). ## 2. Stamp the auth ConfigMap Edit (or overlay-patch) `alphaswarm-tenant-router-config` in `deployments/kubernetes/edge/alphaswarm-tenant-router/configmap.yaml`: ```yaml data: ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "permissive" # step 3 flips to required ALPHASWARM_TENANT_ROUTER_OIDC_ISSUER: "https://.us.auth0.com/" ALPHASWARM_TENANT_ROUTER_OIDC_AUDIENCE: "https://api.alphaswarm.internal/manage" ``` The JWKS URI derives from the issuer (`/.well-known/jwks.json`); set `ALPHASWARM_TENANT_ROUTER_JWKS_URI` only for non-standard IdPs. Only asymmetric algorithms are accepted — if you change `OIDC_ALGORITHMS`, `HS*` values are refused at boot. Apply + restart: ```bash kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/ kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router kubectl -n alphaswarm-edge rollout status deploy/alphaswarm-tenant-router ``` ## 3. Canary in `permissive`, then enforce `permissive` denies **invalid** tokens but lets anonymous requests through flagged `x-alphaswarm-auth: anonymous` (per-cell gates still reject where they require auth). Watch the decision counters: ```bash kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 8080 & curl -s localhost:8080/metrics | grep authz_decisions_total # alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="verified"} 1042 # alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="anonymous"} 3 # alphaswarm_tenant_router_authz_decisions_total{decision="deny",mode="permissive",reason="expired_token"} 7 ``` When `reason="anonymous"` is ~zero for a representative window (only unauthenticated probes remain), flip to enforcement: ```yaml ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "required" ``` re-apply, restart, and confirm `readyz` reports the posture: ```bash curl -s localhost:8080/readyz | jq # {"status":"ok","cells":3,"auth_mode":"required","cba_mode":"enforce",...} ``` ## 4. Verification checks ```bash # Anonymous is denied (required mode): curl -s -o /dev/null -w '%{http_code}\n' -XPOST localhost:8080/ext_authz/v3/check \ -H 'content-type: application/json' \ -d '{"attributes":{"request":{"http":{"headers":{}}}}}' # 401 # A live SPA token is verified and routed: TOKEN=$(...fetch from the SPA / device flow...) curl -s -XPOST localhost:8080/ext_authz/v3/check \ -H 'content-type: application/json' \ -d "{\"attributes\":{\"request\":{\"http\":{\"headers\":{\"authorization\":\"Bearer ${TOKEN}\"}}}}}" \ -D - -o /dev/null | grep -i x-alphaswarm # x-alphaswarm-cell: cell-shared-std-us-east-1a # x-alphaswarm-auth: verified # x-alphaswarm-sub: auth0|... ``` End-to-end through the edge, a tampered or expired token must produce 401 from Envoy, and `x-alphaswarm-*` request headers sent by the client must arrive at the cell overwritten with verified values. ## 5. Cross-cell CBA keys (Phase 5 §8.5) Cross-cell calls present a `Cell-Bound-Authorization` JWT. The validator (co-located in the router) reads each **source** cell's verification keys from the cells-registry annotation — publish them when you enable cross-cell MCP: ```bash curl -sS -XPATCH "$CP/manage/cells/cell-shared-std-us-east-1a" \ -H "authorization: Bearer $MGMT_TOKEN" -H 'content-type: application/json' \ -d '{"annotations":{"alphaswarm.internal/cba-jwks":"{\"keys\":[...]}"}}' ``` `CBA_MODE=enforce` (default) is safe before any workload mints CBAs — requests without the header pass through. Use `monitor` to log would-be denials during key rollout; check `cba_decisions_total{decision="deny"}` before returning to `enforce`. Single-cell edges should additionally pin `ALPHASWARM_TENANT_ROUTER_CBA_DESTINATION_CELL_ID` to their own cell id. ## 6. Rollback Auth enforcement is config-only — no image rollback needed: 1. Flip `AUTH_MODE` back to `permissive` (NOT `disabled`; the insecure mode also demands `ALLOW_INSECURE=true` and is for local dev only). 2. `kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router`. 3. The decision counters (`/metrics`) and structured `authz_deny` logs (reason codes: `missing_token`, `expired_token`, `wrong_audience`, `wrong_issuer`, `no_matching_key`, `forbidden_algorithm`, `jwks_unreachable`) identify what was being denied before you re-enforce. ## Failure modes worth knowing | Symptom | Cause | Response | | --- | --- | --- | | Pod crash-loops with `SettingsError` | Missing issuer/audience in `required`/`permissive` | Stamp the ConfigMap (step 2). | | All requests 401 `jwks_unreachable` | Router cannot reach the IdP JWKS (egress 443 blocked, wrong issuer) | Check the NetworkPolicy + issuer URL; the JWKS cache serves stale once warmed, so this bites hardest on cold boots. | | 401 `no_matching_key` after IdP key rotation | kid not in cached JWKS | The router force-refreshes once per unknown kid automatically; persistent failures mean the issuer/JWKS URI points at the wrong tenant. | | 503 `no_cell_available` for premium users | No active `shared-prem` cell | Explicit tiers are never downgraded — activate a cell for the tier or fix the claim pipeline. | | `readyz` shows `registry_stale: true` | Control plane unreachable > `REGISTRY_STALENESS_WARN_SECONDS` | Routing continues on last-known-good cells; restore `alphaswarm-cp` before making placement changes. | # Conventions > Documentation authoring rules, frontmatter contract, and how to ship a new doc. # Conventions ## Frontmatter is mandatory Every `.md` or `.mdx` file under `alphaswarm_docs/docs/` MUST have a frontmatter block validated by the Zod schema at [src/lib/frontmatterSchema.ts](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/src/lib/frontmatterSchema.ts). Required fields: - `title` — the human-readable title (becomes the ``). - `summary` — a one-liner consumed by `/llms.txt` and the search index. Keep under 200 characters. - `owner` — the GitHub Team that owns the page (`platform-team`, `docs-team`, `data-team`, `rl-team`, `ml-team`, `agentic-team`, `strategy-team`, `trading-team`, `identity-team`, `infra-team`, `sre-team`). - `last_reviewed` — ISO 8601 date. The stale-content watchdog opens a GitHub Issue when this is more than 180 days old. - `audience` — one of `human`, `agent`, `both`, `internal`. Optional: - `version` — pin to a specific date-epoch. - `deprecated`, `deprecated_replacement`, `deprecated_at`, `deprecated_sunset` — deprecation lifecycle (Stripe-style epochs). - `keywords`, `tags`, `sidebar_label`, `sidebar_position`, `runnable` — Docusaurus-native fields. ## Cross-linking Use relative markdown links. The autolink resolver in Docusaurus maps them to the published routes at build time. ```mdx See [the data plane concept](../concepts/data/data-plane.md) for the provider → cache → DuckDB view pipeline. ``` Cite source code with a full absolute repo URL: ```mdx [alphaswarm/data/iceberg_catalog.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py) ``` Do not link to specific line numbers — they bit-rot quickly. ## Diagrams Mermaid only. GitHub renders it natively; Docusaurus ships `@docusaurus/theme-mermaid` which renders it client-side here. Do not commit PNG / SVG diagrams unless they are screenshots of a running UI and are time-stamped. ## Code blocks Tag every code block with a language. Tag runnable blocks with the `runnable` attribute; Phase 5 of the migration will render those with a "Run" button backed by Pyodide (for Python) or StackBlitz WebContainers (for JS / TS). ```python runnable import requests print(requests.get("http://localhost:8000/readyz").status_code) ``` ## "Was this helpful?" Every page renders the feedback widget from [src/components/FeedbackWidget.tsx](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/src/components/FeedbackWidget.tsx). Submissions POST to a Cloudflare Worker that opens a `docs-feedback` GitHub Issue tagged with the page's CODEOWNER team. ## Editing this page Click the "Edit this page" link at the bottom — it opens github.dev for a browser-side edit. Or open the PR locally: ```powershell git checkout -b docs/fix-typo # edit git add alphaswarm_docs/docs/... git commit -m "docs: fix typo in quickstart" git push -u origin docs/fix-typo ``` Branch protection requires the docs-CI suite to pass plus one CODEOWNER approval. ## Business editors Non-engineers can edit content through Keystatic at [https://docs.alpha-swarm.ai/keystatic](/keystatic). Keystatic stores changes in Git and opens a PR against `main`. No parallel CMS, no duplicate database. ## AI agents Read `/llms.txt` for the curated index, `/llms-full.txt` for the full corpus, or query the MCP server at `/mcp`. The MCP server is RFC 9728 + 8707-compliant (AlphaSwarm rule 49) and validates the `aud` claim against `ALPHASWARM_MCP_DOCS_CANONICAL_URI`. # Glossary > > See also: [alphaswarm_docs/index.md](../intro/index.md) for the full doc map # Glossary Project-specific jargon used across AlphaSwarm, with a definition and a pointer to the canonical file. New contributors and AI agents should treat this as the **single source of truth** for terminology — if you find a mismatch between this glossary and the code, file an issue. > See also: [alphaswarm_docs/index.md](../intro/index.md) for the full doc map. ## Core domain - **`vt_symbol`** — Composite symbol id with the shape `{TICKER}.{EXCHANGE}` (vnpy convention), e.g. `AAPL.NASDAQ`, `BTCUSDT.BINANCE`, `ESM4.CME`. Always created via `Symbol.parse(...)` / `Symbol.format(...)` in [alphaswarm/core/types.py](../alphaswarm/core/types.py); never hand-split. - **`Symbol`** — Immutable dataclass that bundles `ticker`, `exchange`, `asset_class`, `security_type`, optional contract spec. The atom flowing through every data feed, strategy, and broker. Defined in [alphaswarm/core/types.py](../alphaswarm/core/types.py). - **`AssetClass` vs `SecurityType`** — `AssetClass` is the broad category (`equity`, `crypto`, `fx`, `future`, `option`, `index`, `commodity`, `bond`). `SecurityType` is the Lean-style finer-grained enum (`equity`, `option`, `future_option`, `crypto_future`, `index_option`, …). The `_polymorphic_identity_for` helper in [alphaswarm/data/catalog.py](../alphaswarm/data/catalog.py) maps `SecurityType` to a joined-table subclass of `Instrument`. - **`Resolution`** — Lean-style bar cadence (`Tick`, `Second`, `Minute`, `Hour`, `Daily`); see [alphaswarm/core/types.py](../alphaswarm/core/types.py). - **`Interval`** — Short-code bar cadence (vnpy style, `1m`, `5m`, `1h`, `1d`). Same idea as `Resolution`, kept for vnpy back-compat. - **`SubscriptionDataConfig`** — The data-plane routing key. Combines `Symbol + Resolution + TickType + DataNormalizationMode`. See [alphaswarm_docs/core-types.md](../concepts/platform/core-types.md). ## Persistence + data plane - **Execution Ledger** — The Postgres tables under [alphaswarm/persistence/models.py](../alphaswarm/persistence/models.py) + [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py) that record every signal, order, fill, agent decision, and backtest run. Authoritative for "what did the system actually do?". - **`LedgerWriter`** — Façade over the ledger tables. Always go through it instead of writing to ORM models directly so audit messages get emitted. [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py). - **`Instrument` joined-table inheritance** — `instruments` is the parent table; each subclass (`InstrumentEquity`, `InstrumentOption`, …) lives in its own joined-table row keyed on `instruments.id`. The `instrument_class` discriminator selects the subclass at load time. See [alphaswarm_docs/erd.md](../concepts/platform/erd.md) and [alphaswarm/persistence/models_instruments.py](../alphaswarm/persistence/models_instruments.py). - **`polymorphic_identity`** — SQLAlchemy mapper arg that ties a subclass to a discriminator value (e.g. `InstrumentEquity.__mapper_args__ = {"polymorphic_identity": "spot"}`). When you add a new instrument subclass you must also extend the `mapping` dict in `_polymorphic_identity_for`. - **`DatasetCatalog`** — Parent row describing a logical dataset (HMDA LAR, FDA device events, etc.) with provider/domain/tags. - **`DatasetVersion`** — Per-materialisation row beneath `DatasetCatalog`. Captures row count, dataset hash, schema snapshot, Iceberg identifier. - **`DataLink`** — Edge between a `DatasetVersion` and an entity (`Instrument`, `Issuer`, `EconomicSeries`). Use this for "which symbols does this dataset cover?" queries. - **`DataSource`** — Logical provider record (Yahoo, Alpha Vantage, IBKR, openFDA). Datasets and data-links reference a `DataSource`. - **`IcebergCatalog`** (the wrapper) — PyIceberg handle from [alphaswarm/data/iceberg_catalog.py](../alphaswarm/data/iceberg_catalog.py). Always go through `append_arrow`, `read_arrow`, `iceberg_to_duckdb_view`; never call PyIceberg's `Catalog.create_table` directly. - **`aqp_` namespace** — Iceberg namespace convention for the regulatory ingest: `alphaswarm_cfpb`, `alphaswarm_uspto`, `alphaswarm_fda`, `alphaswarm_sec`. New corpora pick a new `aqp_` slug. - **Persistent host warehouse** — `C:/alphaswarm-warehouse` on Windows, bind-mounted into `alphaswarm-api` and `alphaswarm-worker` at `/warehouse`. Holds the PyIceberg SQL catalog (`catalog.db`), Parquet data files, staging dir, and ingest audit logs. See [alphaswarm_docs/data-catalog.md](../concepts/data/data-catalog.md). - **`legacy` profile** — Docker Compose profile that bundles the older REST + MinIO catalog topology (off by default). Bring it up with `docker compose --profile legacy up -d`. ## Strategies + backtest - **`BaseStrategy`** — Abstract strategy contract under [alphaswarm/strategies/](../alphaswarm/strategies/). Subclasses implement `on_bar`, `on_signal`, etc. See [alphaswarm_docs/backtest-engines.md](../concepts/strategy/backtest-engines.md). - **`MLAlphaStrategy` / `MLSelectorAlpha`** — Strategies that wrap an ML model (deployed via `ModelDeployment`) and emit signals. - **`EnsembleAlpha`** — Weighted combination of multiple alphas. [alphaswarm/strategies/ml_alphas.py](../alphaswarm/strategies/ml_alphas.py). - **`IBrokerage` / `IDataQueueHandler`** — Lean-style interfaces consumed by backtest, paper, and live engines without modification (the same strategy code runs against all three). See [alphaswarm_docs/paper-trading.md](../concepts/trading/paper-trading.md). - **`BacktestRun`** — Postgres row describing one backtest invocation (Sharpe, Sortino, drawdown, MLflow run id, dataset hash). The backtest UI's history view is just a query against this table. - **`MLflow run id`** — Foreign id stored on `BacktestRun.mlflow_run_id` pointing at the MLflow tracking server. Click-through from the UI opens the MLflow UI in a new tab. - **`dataset_hash`** — Deterministic SHA-256 of the input bars used in a backtest. Lets the UI flag "two backtests with the same hash = identical inputs". ## ML + agents - **Tier (`deep` / `quick`)** — Two LLM tiers in the agentic crews. `deep` = high-capability (Nemotron 70B / GPT-4-class) for analysis; `quick` = small/fast (Llama 3.2 / Mini) for control-flow decisions. Provider per tier is in `settings.llm_provider_deep` / `_quick`; model per tier in `llm_deep_model` / `llm_quick_model`. - **`router_complete`** — One-shot LLM completion through LiteLLM exposed by [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py). All AlphaSwarm code goes through this — never call `litellm.completion` or the Ollama client directly. - **`Director`** — Nemotron-driven planner + verifier in [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py). Sits between discovery and materialisation in generic file ingestion. - **`IngestionPlan` / `PlannedDataset`** — Director output dataclass. One `PlannedDataset` per discovered family with target namespace, table name, expected_min_rows, domain hint, and skip list. - **`VerifierVerdict`** — Director's post-materialise judgement (`accept` or `retry` with adjusted knobs). - **`__assets__` family** — Synthetic `DiscoveredDataset` carrying the non-tabular inventory (PDFs, XML, images) found during discovery. Never materialised; surfaced under `IngestionReport.extras` for visibility. - **`AgentDecision` / `DebateTurn`** — Agent crew audit trail rows. - **`CrewRun`** — One full agentic crew invocation (planner → research → execution sub-agents). - **`Alpha158`** — Microsoft Qlib's 158-feature factor zoo, ported to AlphaSwarm under [alphaswarm/data/indicators_zoo.py](../alphaswarm/data/indicators_zoo.py). - **`FeatureSet` / `FeatureSetVersion`** — Composable feature spec (list of `IndicatorZoo` expressions + transformations) versioned in Postgres, materialised on demand. - **`ModelDeployment` / `MLDeployment`** — A trained ML model that has been registered for inference (rows in [alphaswarm/persistence/models.py](../alphaswarm/persistence/models.py)). ## Bots - **`Bot`** — Smallest self-contained, deployable unit on AlphaSwarm. Aggregates a universe + data pipeline + strategy + backtest engine + optional ML deployments + optional agent specs + RAG plan + metrics + risk caps + deployment target. Lives under a `Project` and is uniquely identified by `(project_id, slug)`. See [alphaswarm_docs/bots.md](../concepts/agentic/bots.md). - **`BotSpec`** — Pydantic blueprint for a bot. Hashed via `snapshot_hash()` to drive immutable `bot_versions` snapshots. Defined in [alphaswarm/bots/spec.py](../alphaswarm/bots/spec.py). - **`TradingBot` / `ResearchBot`** — Bot subclasses selected by `BotSpec.kind`. `TradingBot` does backtest / paper / deploy; `ResearchBot` does chat (and optional backtest if a `strategy` block is set). - **`BotRuntime`** — Single sanctioned execution entry point for any bot lifecycle action. Snapshots specs into `bot_versions`, opens `bot_deployments` rows, and emits progress through [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py). - **`bot_versions`** — Immutable, hash-locked spec snapshots (mirrors `agent_spec_versions`). Never mutated in place. - **`bot_deployments`** — Ledger of every backtest / paper / chat / k8s invocation for a bot. References the `BotVersion` that produced it so a run can be replayed. - **Deployment target (`paper_session` / `kubernetes` / `backtest_only`)** — Selected via `BotSpec.deployment.target`. Backed by `alphaswarm/bots/deploy.py::DeploymentDispatcher`. ## Provider catalog - **`LLMProvider`** — Lightweight handle around a LiteLLM provider spec. Registered in [alphaswarm/llm/providers/catalog.py::PROVIDERS](../alphaswarm/llm/providers/catalog.py). - **`ProviderSpec`** — Static config for a provider slug (LiteLLM prefix, env-var name, default models). - **`vllm` provider** — OpenAI-compatible vLLM endpoint behind LiteLLM's `openai/` adapter. Empty `ALPHASWARM_VLLM_BASE_URL` disables. - **`nemotron-3-nano:30b`** — Default Director model on Ollama (NVIDIA Nemotron Nano v3, 31.6B params). Pull with `ollama pull nemotron-3-nano:30b`. Configurable via `ALPHASWARM_LLM_DIRECTOR_MODEL`. ## Streaming + live - **`KafkaDataFeed`** — In-process Kafka consumer that hands bars/quotes to the `IDataQueueHandler` interface. - **`features.indicators.v1`, `market.bar.v1`, …** — Versioned Kafka topics. Naming pattern is `..v`. - **`StreamingIngester`** — `alphaswarm-stream-ingest` CLI that publishes to Kafka topics from Alpaca / IBKR. - **Heartbeat / kill-switch** — Periodic Redis publish from the paper- trading session; absence triggers the runner to halt. `ALPHASWARM_RISK_KILL_SWITCH_KEY` (default `alphaswarm:kill_switch`). ## Observability - **OTEL endpoint** — `ALPHASWARM_OTEL_ENDPOINT` (default empty disables). When set, every Celery task and HTTP request emits OpenTelemetry spans via [alphaswarm/observability/](../alphaswarm/observability/). - **Progress bus** — Redis pub/sub channel `alphaswarm:task:` carrying `{stage, message, timestamp, **extra}` payloads. UIs subscribe via the WebSocket relay at `/chat/stream/{task_id}`. See [alphaswarm/ws/broker.py](../alphaswarm/ws/broker.py) and [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py). ## Configuration - **`settings`** — Cached `Settings` instance from [alphaswarm/config.py](../alphaswarm/config.py). Always import as `from alphaswarm.config import settings` and never construct `Settings()` directly — the cache backs `lru_cache(maxsize=1)`. - **`ALPHASWARM_*` env namespace** — Every settable knob takes the `ALPHASWARM_` prefix. Bools accept `true`/`false`/`1`/`0`. Paths are resolved by `_coerce_path`. - **`host-downloads`** — `/host-downloads:ro` bind mount in `alphaswarm_platform/compose/docker-compose.yml` exposing the user's local `Downloads/` directory for CLI ingest jobs. ## Inspiration rehydration (Phase 2026-04-29) - **Microprice** — `(P_ask * Q_bid + P_bid * Q_ask) / (Q_bid + Q_ask)`. Volume-weighted refinement of mid-price; converges to the deeper side of the book. Implemented in [alphaswarm/data/microstructure.py](../alphaswarm/data/microstructure.py). - **OBI (Order Book Imbalance)** — `(Q_bid - Q_ask) / (Q_bid + Q_ask)`, range `[-1, +1]`. Positive = bid-side pressure. Used as a quote skew signal in the LOB market-making strategies under [alphaswarm/strategies/hft/](../alphaswarm/strategies/hft/). - **VPIN** — Volume-synchronized probability of informed trading (Easley/López/O'Hara). Re-buckets trade flow by equal-volume buckets; rolling mean of |buy-sell|/|buy+sell|. See [alphaswarm/data/microstructure.py](../alphaswarm/data/microstructure.py). - **Sample-aware Sharpe** — Annualised Sharpe ratio that uses the actual sample frequency of a returns series instead of the assumed 252 trading days. Required for HFT strategies with sub-daily bars. See [alphaswarm/backtest/hft_metrics.py](../alphaswarm/backtest/hft_metrics.py). - **Walk-forward** — Training scheme where the model is re-fit on a rolling (or anchored) window and tested on the immediately following slice. Implemented in [alphaswarm/ml/walk_forward.py](../alphaswarm/ml/walk_forward.py). - **Bachelier (Normal) model** — Options pricing model assuming the underlying follows arithmetic Brownian motion (`dF = sigma dW`). Appropriate for low-priced or near-zero underlyings (rates, basis spreads). See [alphaswarm/options/normal_model.py](../alphaswarm/options/normal_model.py). - **Inverse option** — Option settled in the underlying asset (e.g. BTC) rather than quote currency (USD). Common on crypto venues like Deribit. See [alphaswarm/options/inverse_options.py](../alphaswarm/options/inverse_options.py). - **Regime classifier** — Lightweight classifier that labels each bar as trending vs ranging using ADX threshold (default 25) or as bull/bear/neutral via multi-MA slope vote. See [alphaswarm/data/regime.py](../alphaswarm/data/regime.py). - **Factor expression** — Tiny Polars-based DSL covering Alpha101 primitives (`Ts_Mean`, `Ts_Std`, `Rank`, `Decay_Linear`, `Delta`, `Ts_Corr`). See [alphaswarm/data/factor_expression.py](../alphaswarm/data/factor_expression.py). - **Engle-Granger cointegration** — Two-step test for cointegrated pairs: OLS hedge ratio + ADF test on the residual. See [alphaswarm/data/cointegration.py](../alphaswarm/data/cointegration.py). - **Triple-barrier label** — Lopez de Prado labeling: look forward ``horizon`` bars, label `+1` if upper barrier hit first, `-1` if lower, `0` if horizon reached. See [alphaswarm/data/labels.py](../alphaswarm/data/labels.py). - **Yang-Zhang volatility** — OHLC vol estimator combining overnight, open-to-close, and Rogers-Satchell components. The most efficient of the OHLC family. See [alphaswarm/data/realised_volatility.py](../alphaswarm/data/realised_volatility.py). - **LobStrategy** — ABC for limit-order-book strategies; subclasses emit `OrderIntent` lists in response to `LobState` updates. Engine integration is deferred — see [alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md). - **Dataset preset** — Curated declarative spec for a one-click ingestion (e.g. `intraday_momentum_etf`, `crypto_majors_intraday`). See [alphaswarm/data/dataset_presets.py](../alphaswarm/data/dataset_presets.py). - **Inspiration source** — One of seven external repos under `alphaswarm_snippets/inspiration/` from which strategies / models / agents were rehydrated. Tracked via the `source` kwarg on `alphaswarm.core.registry.register` and surfaced as the `source:*` tag. ## Testing - **`tests/data/test_pipelines_smoke.py`** — Reference test for the Iceberg ingestion path. New ingest features should add a test in this directory. - **`director_enabled=False`** — Pass when constructing `IngestionPipeline` in tests so the real LLM is bypassed in favour of the deterministic identity plan. ## Cross-repo - **`agentic_assistants`** — Sibling repo providing the cross-system lineage API (`ALPHASWARM_AGENTIC_ASSISTANTS_API`). - **`rpi_kubernetes`** — Sibling repo with the k8s deployment manifests under [alphaswarm_platform/deploy/k8s/](../alphaswarm_platform/deploy/k8s/). # Documentation Index > > **Two entry points**: > > - Humans → [architecture.md](../concepts/platform/architecture.md) > - AI agents → [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) > > Both link back here # Documentation Index Triple-axis table of contents for the AlphaSwarm docs. > **Two entry points**: > > - Humans → [architecture.md](../concepts/platform/architecture.md) > - AI agents → [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) > > Both link back here. ## Canonical runtime surfaces (May 2026) | Surface | Canonical path | Status | Notes | | --- | --- | --- | --- | | Local setup + run | [operations/local-setup.md](../how-to/operations/local-setup.md) | active | Default entry point for local development | | Kubernetes rollout | [operations/kubernetes-deploy.md](../how-to/operations/kubernetes-deploy.md) | active | Production-oriented deployment path | | Tower 2-node rollout | [operations/tower-cluster-deploy.md](../how-to/operations/tower-cluster-deploy.md) | active | Dedicated tower+laptop target bootstrap path | | AlphaSwarm blue/green cutover | [operations/alphaswarm-fund-blue-green-cutover.md](../how-to/operations/alphaswarm-fund-blue-green-cutover.md) | active | `alpha-swarm.ai` green-lane validation + switch + rollback | | Deployment artifacts | [../alphaswarm_platform/deployments/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deployments/README.md) | active | Compose + Kubernetes manifests for current architecture | | Operator UI | [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md) | active | Vite frontend is the primary UI | | AlphaSwarm IDE | [alphaswarm-ide.md](../concepts/infrastructure/alphaswarm-ide.md) | active | Theia 1.72 + 6 AlphaSwarm extensions + research copilot + notebook | | Knowledge Base | [knowledge-base.md](../concepts/data/knowledge-base.md) | active | `alphaswarm_kb` boundary — KBRuntime + KBCorpusSpec + adapter trinity (HierarchicalRAG default, Cognee / Graphiti / Mem0 opt-in) + 4-scope KBLayerComposer + hybrid OpenFGA + OPA policy stack | | KB federation gateway | [kb-federation.md](../concepts/data/kb-federation.md) | active | `alphaswarm_kb_federation` — cross-silo marketplace recall reverse-proxy | | AlphaSwarm IDE roadmap | [alphaswarm-ide-roadmap.md](../concepts/infrastructure/alphaswarm-ide-roadmap.md) | active | Phased plan (Phase A shipped; B + C trigger-driven) | | AlphaSwarm IDE CLI entrypoint | [../alphaswarm_cli/docs/index.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_cli/docs/index.md) | active | `alphaswarm-cli ide` is the canonical IDE entrypoint | | Repository split map | [repository-split.md](../concepts/platform/repository-split.md) | migration | Domain boundaries for future standalone repositories | | Monorepo path contract | [alphaswarm-monorepo-paths.md](../concepts/platform/alphaswarm-monorepo-paths.md) | active | Canonical paths for cross-repo references | | Code index governance | [code-index-governance.md](../concepts/platform/code-index-governance.md) | active | Agent search/index workflow across split boundaries | | Legacy Next.js UI | [webui.md](../concepts/trading/webui.md) | rollback | Keep only for emergency rollback context | | Legacy Solara UI | [../alphaswarm/ui/](../alphaswarm/ui/) | rollback | Deprecated runtime surface | | Legacy k8s manifests | [../alphaswarm_platform/deploy/k8s/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deploy/k8s/README.md) | legacy | Historical manifests; do not use for new rollouts | | Archived planning/audit docs | [archive/README.md](../archive/README.md) | archive | Historical context only; not operational guidance | ## Operational snippet catalog Reusable commands that are valid against the current repository layout: ```bash # Generate local config from schema make generate-config ENV=local # Start the local workload stack make dev # Start the isolated admin/control-plane stack make dev-admin # Deploy current dev overlay to Kubernetes make deploy-k8s ENV=dev ``` ## By audience ### I'm new and human 1. [../README.md](https://github.com/julianwileymac/alphaswarm/blob/main/README.md) — what AlphaSwarm is, screenshots, release notes. 2. [architecture.md](../concepts/platform/architecture.md) — system map + request lifecycle. 3. [../CONTRIBUTING.md](https://github.com/julianwileymac/alphaswarm/blob/main/CONTRIBUTING.md) — set up the dev environment. 4. [glossary.md](../intro/glossary.md) — terms used everywhere. 5. Pick a subsystem from the table below. ### I'm an AI agent 1. [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — terse rule-set + project map. 2. [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) — Plan / Act / Reflect cadence, FAST vs SLOW modes, intervention nodes. 3. [agentic-development.md](../concepts/agentic/agentic-development.md) — spec-pattern as the AlphaSwarm skill-artifact + ADLC security manifesto. 4. [../.cursor/rules/](../.cursor/rules) — glob-scoped rule files. 5. [glossary.md](../intro/glossary.md) — definitions. 6. [erd.md](../concepts/platform/erd.md) + [class-diagram.md](../concepts/platform/class-diagram.md) — structural maps. 7. [flows.md](../concepts/platform/flows.md) — end-to-end sequences. 8. [repository-split.md](../concepts/platform/repository-split.md) + [code-index-governance.md](../concepts/platform/code-index-governance.md) — current repo boundary map. 9. The relevant subsystem doc (table below). 10. (Cross-session work) [../.agents/state-template.md](https://github.com/julianwileymac/alphaswarm/blob/main/.agents/state-template.md). ## By lifecycle stage ```mermaid flowchart LR Research --> Backtest --> Paper --> Live Backtest --> Agentic Agentic --> Backtest Live -.feedback.-> Research ``` | Stage | Docs | | --- | --- | | **Research** | [strategy-development.md](../concepts/strategy/strategy-development.md), [research-papers-rag.md](../concepts/data/research-papers-rag.md), [analysis-framework.md](../concepts/strategy/analysis-framework.md), [analysis-lab.md](../concepts/strategy/analysis-lab.md), [analysis-flows.md](../concepts/strategy/analysis-flows.md), [factor-research.md](../concepts/strategy/factor-research.md), [ml-framework.md](../concepts/strategy/ml-framework.md), [ml-libraries.md](../concepts/strategy/ml-libraries.md), [ml-alpha-backtest.md](../concepts/strategy/ml-alpha-backtest.md), [ml-flows.md](../concepts/strategy/ml-flows.md), [ml-preprocessing-pipeline.md](../concepts/strategy/ml-preprocessing-pipeline.md), [ml-builder.md](../concepts/strategy/ml-builder.md), [ml-testing.md](../concepts/strategy/ml-testing.md), [rl-framework.md](../concepts/rl/rl-framework.md), [rl-lab.md](../concepts/rl/rl-lab.md), [rl-components.md](../concepts/rl/rl-components.md), [rl-iceberg.md](../concepts/rl/rl-iceberg.md), [strategy-browser.md](../concepts/strategy/strategy-browser.md), [data-plane.md](../concepts/data/data-plane.md), [data-catalog.md](../concepts/data/data-catalog.md), [data-pipelines-hub.md](../concepts/data/data-pipelines-hub.md), [visualization-layer.md](../concepts/data/visualization-layer.md) | | **Backtest** | [backtest-engines.md](../concepts/strategy/backtest-engines.md), [hft-backtest.md](../concepts/strategy/hft-backtest.md), [strategy-lifecycle.md](../concepts/strategy/strategy-lifecycle.md) | | **Optimal control** | [optimal-control.md](../concepts/strategy/optimal-control.md), [portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md), [microstructure-toxicity.md](../concepts/strategy/microstructure-toxicity.md) | | **Agentic** | [agentic-pipeline.md](../concepts/agentic/agentic-pipeline.md), [providers.md](../concepts/data/providers.md) | | **Bots** | [bots.md](../concepts/agentic/bots.md) (smallest deployable unit; aggregates universe + strategy + engine + ML + agents + RAG + metrics) | | **Paper / Live** | [paper-trading.md](../concepts/trading/paper-trading.md), [live-market.md](../concepts/data/live-market.md), [streaming.md](../concepts/data/streaming.md), [streaming-admin.md](../concepts/data/streaming-admin.md) | | **Cross-cutting** | [observability.md](../concepts/trading/observability.md), [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md), [webui.md](../concepts/trading/webui.md) _(legacy)_, [core-types.md](../concepts/platform/core-types.md), [domain-model.md](../concepts/platform/domain-model.md), [alpha-vantage.md](../concepts/data/alpha-vantage.md), [credentials.md](../concepts/identity/credentials.md), [cloud-credentials.md](../concepts/identity/cloud-credentials.md), [identity.md](../concepts/identity/identity.md), [scim-provisioning.md](../concepts/identity/scim-provisioning.md), [msal-entra-setup.md](../concepts/identity/msal-entra-setup.md), [multi-tenancy.md](../concepts/identity/multi-tenancy.md), [kubernetes-adapter.md](../concepts/infrastructure/kubernetes-adapter.md), [kubernetes-rpi-deployment.md](../concepts/infrastructure/kubernetes-rpi-deployment.md), [local-platform.md](../concepts/platform/local-platform.md), [terraform-control-plane.md](../concepts/infrastructure/terraform-control-plane.md), [iac-runbook.md](../concepts/infrastructure/iac-runbook.md) | ## By subsystem ### Architecture + reference | Doc | Purpose | | --- | --- | | [architecture.md](../concepts/platform/architecture.md) | System component diagram + request lifecycle | | [erd.md](../concepts/platform/erd.md) | Per-domain entity-relationship diagrams | | [class-diagram.md](../concepts/platform/class-diagram.md) | Class hierarchies (Symbol, LLMProvider, Strategy, Engines, Pipeline) | | [data-dictionary.md](../reference/data-dictionary/index.md) | Table-by-table column reference | | [flows.md](../concepts/platform/flows.md) | Sequence diagrams for ingestion / backtest / agents / paper | | [glossary.md](../intro/glossary.md) | Project-specific terminology | | [domain-model.md](../concepts/platform/domain-model.md) | Narrative on the domain types | | [core-types.md](../concepts/platform/core-types.md) | `Symbol`, enums, dataclasses | | [repository-split.md](../concepts/platform/repository-split.md) | Future repository/domain boundary map | | [code-index-governance.md](../concepts/platform/code-index-governance.md) | Agent search and code-index rules | ### Data plane | Doc | Purpose | | --- | --- | | [data-plane.md](../concepts/data/data-plane.md) | Provider → cache → DuckDB view pipeline | | [data-catalog.md](../concepts/data/data-catalog.md) | Iceberg catalog + ingest pipeline | | [data-self-service.md](../concepts/data/data-self-service.md) | Master narrative for the four-phase self-service data fabric expansion | | [datasets-catalog.md](../concepts/data/datasets-catalog.md) | Kedro-style `BaseDataset` abstraction (data fabric phase 0) | | [metadata-cache.md](../concepts/data/metadata-cache.md) | Redis prefetch cache backing every entity dropdown (data fabric phase 0) | | [data-discovery.md](../concepts/data/data-discovery.md) | Active discovery browser unifying ingested + uningested catalog entries (data fabric phase 1) | | [airbyte-builder.md](../concepts/data/airbyte-builder.md) | Schema-driven Airbyte connector builder + AlphaSwarm Fetcher codegen (data fabric phase 2) | | [dagster-sandbox.md](../concepts/data/dagster-sandbox.md) | Ephemeral interactive Dagster + Airbyte sandbox console (data fabric phase 3) | | [visualization-layer.md](../concepts/data/visualization-layer.md) | Trino-backed Superset and Bokeh exploration layer | | [pgvector-control-plane.md](../concepts/data/pgvector-control-plane.md) | pgvector control plane — `data.vector.*` MCP tools + PgVector dataset kind + alembic 0045 (Phase 3 refactor) | | [codebase-mcp.md](../concepts/data/codebase-mcp.md) | Codebase MCP server — agent view of the AlphaSwarm source tree via `codebase.*` tools (Phase 2 refactor) | | [sera.md](../concepts/data/sera.md) | SERA (Ai2 Open Coding Agents) as an opt-in LLM provider for the codebase MCP elaborator (Phase 2.5 refactor) | | [analytics-frontend.md](../concepts/data/analytics-frontend.md) | Interactive analytics in the Vite frontend — QuantStats tearsheets / rolling / underwater / drawdown / ML overlays (Phase 4 refactor) | | [agent-watchdog.md](../concepts/data/agent-watchdog.md) | Agent stall watchdog Celery beat task + `GET /agents/health` + `data.agents.health` MCP tool (Phase 5 refactor) | | [alpha-vantage.md](../concepts/data/alpha-vantage.md) | AV provider quota + cache | | [streaming.md](../concepts/data/streaming.md) | Kafka topic taxonomy + ingester layout | | [live-market.md](../concepts/data/live-market.md) | Live subscription + WebSocket relay | ### Strategy + ML | Doc | Purpose | | --- | --- | | [analysis-framework.md](../concepts/strategy/analysis-framework.md) | Hash-locked AnalysisSpec + AnalysisRuntime umbrella | | [analysis-lab.md](../concepts/strategy/analysis-lab.md) | Hybrid `/analysis/lab` UI (dataset-tabs + XYFlow Composer) | | [analysis-flows.md](../concepts/strategy/analysis-flows.md) | Per-flow reference for the analysis catalog | | [factor-research.md](../concepts/strategy/factor-research.md) | Building factor / alpha strategies | | [ml-framework.md](../concepts/strategy/ml-framework.md) | Train → register → deploy → score | | [ml-libraries.md](../concepts/strategy/ml-libraries.md) | Per-library reference (TF/Keras/Prophet/sklearn/PyOD/sktime/HF) | | [ml-alpha-backtest.md](../concepts/strategy/ml-alpha-backtest.md) | `AlphaBacktestExperiment` orchestrator + `MLAlphaBacktestRun` schema | | [ml-flows.md](../concepts/strategy/ml-flows.md) | Lightweight workbench flows catalog | | [ml-preprocessing-pipeline.md](../concepts/strategy/ml-preprocessing-pipeline.md) | ML preprocessors as data-engine pipeline nodes | | [ml-builder.md](../concepts/strategy/ml-builder.md) | Graphical experiment builder UX | | [ml-testing.md](../concepts/strategy/ml-testing.md) | Interactive ML testing workbench | | [mlops-service.md](../concepts/strategy/mlops-service.md) | Initial MLOps service — agent-facing interfaces, lifecycle handlers, MLSkill spec/runtime, OOD rules, dedicated `alphaswarm-ml-mcp` server | | [backtest-engines.md](../concepts/strategy/backtest-engines.md) | Engine catalogue + invariants (vbt-pro primary, event-driven, ZVT, AAT, fallback) | | [vbtpro-integration.md](../concepts/strategy/vbtpro-integration.md) | Deep vectorbt-pro integration: modes, hooks, agent + ML components, walk-forward | | [hft-backtest.md](../concepts/strategy/hft-backtest.md) | hftbacktest-driven LOB engine, ``LobStrategy`` API, latency / queue models | | [optimal-control.md](../concepts/strategy/optimal-control.md) | JAX-compiled HJB solvers — Avellaneda-Stoikov + Cartea-Jaimungal-Penalva | | [portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md) | Lucic-Tse 2024-2026 portfolio-level options market making | | [microstructure-toxicity.md](../concepts/strategy/microstructure-toxicity.md) | Toxicity regime detection + agent-driven YAML mutation loop | | [strategy-lifecycle.md](../concepts/strategy/strategy-lifecycle.md) | draft → backtested → paper → live | | [strategy-browser.md](../concepts/strategy/strategy-browser.md) | Data-browser → strategy spec UX | ### Agentic | Doc | Purpose | | --- | --- | | [agentic-development.md](../concepts/agentic/agentic-development.md) | AlphaSwarm's spec-pattern as the agentic-coder skill-artifact equivalent + consolidated ADLC security manifesto | | [multi-agent-patterns.md](../concepts/agentic/multi-agent-patterns.md) | Sequential / Parallel / Debate / Coordinator / ReAct topologies mapped to [alphaswarm/agents/graph/](../alphaswarm/agents/graph/) + the seven orchestration adapter topologies | | [workflow-studio.md](../concepts/agentic/workflow-studio.md) | Additive orchestration control plane — `WorkflowSpec` + `WorkflowRuntime` + seven adapters + replayable runs | | [orchestration-refactor-rollout.md](../concepts/agentic/orchestration-refactor-rollout.md) | Operator rollout / rollback runbook for every `ALPHASWARM_ORCHESTRATION_*` flag | | [agentic-pipeline.md](../concepts/agentic/agentic-pipeline.md) | Crew control plane | | [providers.md](../concepts/data/providers.md) | LLM provider registry + tier routing | ### Trading + operations | Doc | Purpose | | --- | --- | | [paper-trading.md](../concepts/trading/paper-trading.md) | Session loop + risk model | | [paper-metadata-gate.md](../concepts/trading/paper-metadata-gate.md) | Strict startup metadata validation + operator runbook | | [bots.md](../concepts/agentic/bots.md) | Bot entity (TradingBot / ResearchBot), graphical builder, deployment | | [observability.md](../concepts/trading/observability.md) | OTEL → Jaeger + structured logs | | [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md) | Active Vite frontend route/model overview | | [webui.md](../concepts/trading/webui.md) | Legacy Next.js page tree (rollback only) | ## Latest changes | Doc | Last touched | | --- | --- | | [data-catalog.md](../concepts/data/data-catalog.md) | Persistent host warehouse + Director | | [glossary.md](../intro/glossary.md) | New (covers Director, Iceberg conventions, tiers) | | [architecture.md](../concepts/platform/architecture.md) | New (replaces README ASCII art) | | [erd.md](../concepts/platform/erd.md) | New (per-domain ERDs across 110+ tables) | | [class-diagram.md](../concepts/platform/class-diagram.md) | New (5 hierarchies) | | [data-dictionary.md](../reference/data-dictionary/index.md) | New (15 sections) | | [flows.md](../concepts/platform/flows.md) | New (5 flows) | ## Doc conventions - **Mermaid** is the diagram format. GitHub renders it natively. Don't commit PNG/SVG diagrams unless they're irreplaceable. - **Cross-link** with relative markdown paths (for example, `bar.md`) so the navigation works on GitHub and locally. - **Cite code** with full repo paths from the doc: `[alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py)`. Don't link to specific line numbers (they bit-rot fast). - **Keep it short** — narrative goes in subsystem docs, definitions in [glossary.md](../intro/glossary.md), structure in [erd.md](../concepts/platform/erd.md) / [class-diagram.md](../concepts/platform/class-diagram.md). Don't repeat yourself. # Installation > | Extra | Native build | Notes | | --- | --- | --- | | `optimal-control` | None (pure Python) | Ships the JAX HJB / Lucic-Tse stack. CPU-only by default. | | `hft` | Rust + Maturin | Ships the [hftbac... # Installation This page documents the install-time requirements for AlphaSwarm and its optional extras. The base install is pure Python and runs on Linux / macOS / Windows. Two extras ship with native build steps that need attention: | Extra | Native build | Notes | | --- | --- | --- | | `optimal-control` | None (pure Python) | Ships the JAX HJB / Lucic-Tse stack. CPU-only by default. | | `hft` | Rust + Maturin | Ships the [hftbacktest](https://github.com/nkaz001/hftbacktest) LOB engine. | ## Base install ```bash pip install -e . ``` That gives the FastAPI app, Celery worker, default config, agents, analysis flows, and the in-memory backtest fallbacks. No GPU, no native toolchains. ## `[optimal-control]` — JAX + HJB + Lucic-Tse ```bash pip install -e ".[optimal-control]" ``` This pulls: - `jax>=0.4.30` and `jaxlib>=0.4.30` (CPU build, manylinux / win / macOS wheels available on PyPI). - `finhjb>=0.1.6` — JAX HJB solver framework. - `fast-vollib>=0.1.4` — vectorised IV + Greeks (auto-detects the JAX backend; falls back to NumPy when JAX is missing). - `mbt-gym` — pulled directly from [JJJerome/mbt_gym](https://github.com/JJJerome/mbt_gym) `main` because the package is not on PyPI. ### GPU / Metal acceleration (opt-in) After installing the extra, swap the CPU `jaxlib` wheel for the CUDA or Metal variant. JAX's docs are the canonical source; the short form: ```bash # NVIDIA CUDA 12 (Linux only) pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html # Apple Silicon (macOS) pip install -U "jax-metal" ``` The `alphaswarm.optimal_control` and `alphaswarm.options.greeks_jax` modules pick up the accelerated backend automatically — no AlphaSwarm code changes needed. ## `[hft]` — hftbacktest LOB engine ```bash pip install -e ".[hft]" ``` This pulls `hftbacktest>=2.0.0`, `numba>=0.61`, and `polars>=1.0`. Most `hftbacktest` releases ship as source distributions and need a Rust toolchain plus Maturin at install time: ```bash # 1. Install Rust + Cargo (https://rustup.rs) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # 2. Restart your shell so `cargo` is on PATH, then verify cargo --version # 3. Install Maturin (build backend hftbacktest uses) pip install maturin # 4. Now install the AlphaSwarm extra pip install -e ".[hft]" ``` On Windows the equivalent is the rustup-init.exe installer plus the "Microsoft Visual C++ Build Tools" (MSVC linker is required by the hftbacktest crate). On macOS Apple Silicon, the standard rustup install works; no extra steps. ### Verifying the install ```bash python -c "from alphaswarm.backtest.hft import LobBacktestEngine; print('OK')" ``` If that succeeds, the engine is ready to drive any `LobStrategy` subclass under `alphaswarm/strategies/hft/`. ## `[full]` `pip install -e ".[full]"` chains every optional extra including `optimal-control` and `hft`, so it requires the Rust toolchain. Most contributors install `[full]` minus `[hft]`: ```bash pip install -e ".[auth,alpaca,ibkr,otel,paper,vectorbt,ml,ml-torch,ml-forecast,portfolio,fred,sec,iceberg,agents-rag,optimal-control]" ``` ## See also - [alphaswarm_docs/optimal-control.md](../concepts/strategy/optimal-control.md) — HJB primer + AvSt + CJ. - [alphaswarm_docs/portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse. - [alphaswarm_docs/hft-backtest.md](../concepts/strategy/hft-backtest.md) — `LobBacktestEngine` walk-through. - [alphaswarm_docs/local-platform.md](../concepts/platform/local-platform.md) — single-machine deployment. # Quickstart > Stand up an AlphaSwarm dev stack and run your first backtest in under 30 seconds of typing. # Quickstart Target: a fresh checkout of `alphaswarm` to a green backtest result in under 30 seconds of typing (plus first-time Docker image pull, which is unavoidable). ## Prerequisites - Docker Desktop or compatible engine running locally. - Python 3.11 and `make` on your PATH. - The repo cloned to disk. ## One-paste quickstart ```powershell # 1. Pull the canonical compose stack. make dev # 2. Wait for /readyz to return 200. curl http://localhost:8000/readyz # 3. Run the bundled example backtest. docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \ --config configs/strategies/momentum_demo.yaml \ --start 2024-01-01 --end 2024-06-30 ``` If the third command returns a JSON summary with non-zero `sharpe` and `total_return`, your dev stack is healthy. ## What just happened - `make dev` boots the canonical compose profile defined in [alphaswarm_platform/deployments/compose/docker-compose.dev.yml](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deployments/compose/docker-compose.dev.yml). This brings up Postgres + Redis + the Iceberg REST catalog + `alphaswarm-core` (FastAPI) + `alphaswarm-worker` (Celery) + `alphaswarm-beat`. - `curl http://localhost:8000/readyz` confirms the FastAPI gateway is serving requests against a migrated Postgres schema. Migrations run automatically on first boot via the `alphaswarm-api` container's `entrypoint.sh`. - The backtest command dispatches a Celery task that pulls the example momentum strategy, runs it against the seeded data, and writes a `backtest_runs` ledger row. ## Next steps 1. Want to see the run in the UI? Open [http://localhost:3001](http://localhost:3001) — that is the Vite operator UI ([alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client)). 2. Want to add your own strategy? Read [Recipe: Add a strategy](../how-to/recipes/add-a-strategy.md). 3. Want to set up paper trading? Read [Concept: paper trading](../concepts/trading/paper-trading.md) followed by [Tutorial: first paper trading session](../tutorials/first-paper-trading-session.md). 4. Want to deploy this to Kubernetes? Read [How-to: Kubernetes deploy](../how-to/operations/kubernetes-deploy.md). ## If it does not work The `/readyz` probe is the single canonical health check. If it returns non-200 within 60 seconds: - Check `docker compose logs alphaswarm-api` for migration errors. - Confirm Postgres is reachable: `docker exec alphaswarm-postgres pg_isready`. - Confirm Redis is reachable: `docker exec alphaswarm-redis redis-cli ping`. - Verify the Iceberg REST catalog is up: `curl http://localhost:8181/v1/config`. If the backtest command itself errors out, the most common cause is a stale Iceberg manifest from a prior dev cycle. Tear down with `make down && docker volume prune -f` and re-run. For deeper debugging, see [How-to: incident response](../how-to/operations/incident-response.md). # Repository orientation > Top-level map of every alphaswarm_* package and where each subsystem lives in the monorepo. # Repository orientation AlphaSwarm is a monorepo organised by responsibility. The boundary between packages is enforced by the always-on Cursor rule [repository-boundaries.mdc](https://github.com/julianwileymac/alphaswarm/blob/main/.cursor/rules/repository-boundaries.mdc) and by `import` guards in CI. ## Top-level packages - **`alphaswarm/`** — the quant runtime. FastAPI gateway, Celery workers, strategy framework, backtest engines, agent control plane, RAG, Iceberg writers, persistence models. - **`alphaswarm_controller/`** — workload lifecycle / `/manage/*` API / Terraform driver / provider adapters. Never imports `alphaswarm.*`. See [Concept: control plane topology](../concepts/infrastructure/control-plane-topology.md). - **`alphaswarm_core/`** — shared value types, ABCs, auth filters, topology contracts. Dependency-light. - **`alphaswarm_client/`** — active Vite + React 19 + Tailwind 4 operator UI. Served at `alpha-swarm.ai`. - **`alphaswarm_ui/`** — cloud-hosted, customer-facing PaaS frontend (Next.js 14+). Served at `alpha-swarm.ai`. Dual Auth0 (B2C) + Entra (B2B) identity. - **`alphaswarm_admin/`** — internal admin (managed services + company accounts). Audit-first. Served at `manage.alpha-swarm.ai`. - **`alphaswarm_rl/`** — RL subsystem: hash-locked `RLExperimentSpec` + `RLRuntime` + Iceberg trajectory store. Legacy `alphaswarm.rl.*` is a deprecation shim. - **`alphaswarm_models/`** — custom model pulling, building, training, evaluating, serving (vLLM + Ollama). Legacy `alphaswarm.ml.*` is a deprecation shim. - **`alphaswarm_bots/`** — bot templates and bot runtime (`TradingBot` / `ResearchBot`). - **`alphaswarm_ide/`** — Theia 1.72-based IDE + AlphaSwarm extensions (`alphaswarm`, `alphaswarm-shell`, `alphaswarm-mcp-bridge`, `alphaswarm-research-copilot`, `alphaswarm-notebook-quant`, `alphaswarm-quant`). - **`alphaswarm_cli/`** — standalone operator CLI (`alphaswarm-cli`). HTTP-only; never imports `alphaswarm.*`. RFC 8628 device auth + OS keyring storage. - **`alphaswarm_platform/`** — hosted deployment + build + IaC + cluster setup. Manifests, Helm charts, Terraform modules, Docker base images. No Python runtime imports. - **`alphaswarm_index/`** — single source of truth for project orientation (this site links into it but never modifies it; sole-writer is the `alphaswarm-index-curator` subagent). - **`alphaswarm_docs/`** — this site. ## Where to look for X - API route: [`alphaswarm/api/routes/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api/routes). - Celery task: [`alphaswarm/tasks/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/tasks). - Strategy: [`alphaswarm/strategies/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies). - Persistence model: [`alphaswarm/persistence/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/persistence). - Migration: [`alembic/versions/`](https://github.com/julianwileymac/alphaswarm/tree/main/alembic/versions). - Iceberg writer: [`alphaswarm/data/iceberg_catalog.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py). - LLM gateway: [`alphaswarm/llm/providers/router.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py). - Configuration: [`alphaswarm/config/settings.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/config/settings.py). ## Hard rules The full agent-readable rule-set is in [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). The cardinal subset: 1. **Symbols**: `Symbol.parse(vt_symbol)` — never split on `.`. 2. **LLM calls**: `router_complete` only — never `litellm.completion` or vendor SDKs. 3. **Iceberg writes**: `iceberg_catalog.append_arrow` only — never raw PyIceberg. 4. **Celery progress**: `emit / emit_done / emit_error` from `alphaswarm/tasks/_progress.py` — never publish to Redis from task code. 5. **Configuration**: `from alphaswarm.config import settings` — never construct a fresh `Settings()`. 6. **Registry**: `@register("Name", kind=...)` for every new strategy / model / engine / alpha / portfolio / sink. 7. **Migrations**: immutable once committed. 8. **Cross-task state**: Postgres only; never pickle ORM objects. The full set is 55 hard rules + a Don'ts section in AGENTS.md. ## Conventions See [Conventions](./conventions.md) for documentation style and authoring rules. # operations/break-glass # Break-glass runbook Procedure for assuming the `AqpAdminBreakGlassRole` during an incident. The role is **only** to be used when: 1. Normal operator pathways (KillSwitch, scoped admin roles) have failed. 2. A documented incident ticket exists. 3. **Two named operators** are available (4-eyes principle). ## Mechanics - The role itself carries no permissions until an `AdministratorAccess`-attaching Lambda runs. - The attach is triggered by the second operator's approval through `alphaswarm_admin/services/break_glass.py`. - The session has a **hard 60-minute auto-expiry** enforced by EventBridge calling the detach Lambda. - Every API call while the role is active is reported to Security Hub as a HIGH-severity finding. ## Steps ### Operator A — file the request 1. Open `/admin/accounts` in the admin UI. 2. Click **"Break-glass request"** (visible only to users with the `alphaswarm-superadmin` role). 3. Fill in: - **Reason** (free-text, mandatory). - **Incident id** (Linear / Sentry / PagerDuty link). - **Duration** (max 60 minutes). 4. Submit. The request lands in the audit ledger as `admin.break_glass.request`. ### Operator B — approve 1. Watch for the Slack notification from the `#alphaswarm-security-incidents` channel. 2. Open the request URL the notification links to. 3. Verify Operator A's reason + incident id. 4. Click **"Approve"**. Step-up MFA is required. 5. The Lambda fires and attaches `AdministratorAccess` to the target role. Audit row: `admin.break_glass.approve` -> `admin.break_glass.attach`. ### Operator A — perform the action 1. `aws sts assume-role --role-arn \ --role-session-name "incident-"`. 2. Carry out the minimum action required. 3. The session SHOULD be terminated early via the admin UI's **"Detach"** button as soon as the action completes. ### Auto-expiry If 60 minutes elapse, EventBridge invokes the detach Lambda automatically. Audit row: `admin.break_glass.expire`. ## Post-incident - Both operators sign the post-incident review. - Security officer reviews the Security Hub findings + audit trail within 24h. - Anything done while the role was active is reproduced in a small, scoped follow-up PR if it should be permanent. # operations/dr-replay # DR replay runbook Disaster-recovery rehearsal procedure for AlphaSwarm. Targets: - **RPO 1 hour** for `alphaswarm_admin` + control-plane services. - **RTO 4 hours** for the same. - **RPO 15 minutes** for trading-relevant data. - **RTO 1 hour** for the same. The exercise is run quarterly (calendar reminder owned by the platform team). The first exercise is scheduled for the end of Phase 5 of the multi-account overhaul. ## Pre-requisites - AWS Organizations + Control Tower applied (Phase 4 complete). - ArgoCD app-of-apps applied to dev + staging + prod clusters. - Velero installed on every workload cluster (chart at [alphaswarm_platform/deployments/kubernetes/helm/velero](../../../alphaswarm_platform/deployments/kubernetes/helm/velero/)). - ECR cross-region replication active to `us-west-2`. - RDS cross-region read replica green. - S3 CRR active on every Parquet + audit-archive bucket. - Route 53 health-check failover record set on the `manage.alpha-swarm.ai` ingress. ## Steps ### 1. Trigger the failure Pick the rehearsal target — typically `alphaswarm-dev` (never prod). Document the start time in the incident ticket. ```bash # Disable the dev cluster's API server (simulates a control-plane outage). aws eks update-cluster-config \ --name alphaswarm-dev \ --region us-east-1 \ --resources-vpc-config endpointPrivateAccess=false,endpointPublicAccess=false ``` ### 2. Confirm impact `alphaswarm_admin` should now show `unreachable` for the dev cluster under `/admin/kubernetes/status`. The KillSwitch should still work because it fans out to other clusters too. ### 3. Bring up the replay cluster ```bash cd infrastructure/envs/dev terraform apply -var-file=terraform.tfvars ``` This re-creates the EKS cluster with the same name + node groups. ArgoCD picks up the new cluster via its Cluster generator (label `alphaswarm.io/managed=true`). ### 4. Replay state from Velero ```bash velero backup-location get velero restore create dr-replay-$(date +%s) \ --from-backup daily-full-$(velero backup get | tail -1 | awk '{print $1}') ``` ### 5. Restore RDS The cross-region read replica in `us-west-2` is promoted to primary; the DR replay points the dev cluster's RDS DSN at the new primary. The Postgres instance comes up with the audit ledger intact so no admin actions are lost. ### 6. Verify - `alphaswarm_admin` health should return 200 within 4h. - The audit ledger should show the gap as a single contiguous block (no missing rows beyond the RPO window). - Paper-trading runs that were active are stamped `status=halted` by the watchdog. - The ArgoCD app-of-apps sync should converge within 15min after the cluster comes back. ### 7. Document Append to the rehearsal log at `alphaswarm_docs/docs/operations/dr-rehearsal-log.md` with: - Start / end timestamps. - Actual RPO + RTO measured. - Issues encountered + remediations. - Sign-off from the security officer. # operations/multi-account-rollout # Multi-account rollout runbook The Phase 4 Control Tower + cross-account IAM + dev->staging promotion + IdP cutover. The Terraform code is shipped under `infrastructure/`; this runbook is what the operator follows to apply it. ## 1. Bootstrap ```bash # In the AWS Org master account, with the platform-admin role: cd infrastructure/bootstrap export AWS_PROFILE=alphaswarm-org-master terraform init terraform apply -var=account_alias=master ``` Capture the outputs (KMS arn, GitHub OIDC arn, etc.). ## 2. Landing zone ```bash cd infrastructure/modules/landing-zone terraform init terraform apply ``` This stands up the 5 OUs + SCPs. The first `apply` takes ~15 minutes because Control Tower has to enrol every region one at a time. ## 3. Workload accounts For each workload account (dev, staging, prod), create the account via the `account` module from the master account, then re-run `bootstrap/` against the new account. ```bash cd infrastructure terraform apply \ -target=module.account.dev \ -var='dev_email=aws-alphaswarm-dev@alpha-swarm.ai' \ -var='external_id=...' ``` ## 4. Per-environment composition For each env (`dev`, `staging`, `prod`): ```bash cd infrastructure/envs/dev cp terraform.tfvars.example terraform.tfvars # Fill in real values per the example. terraform init -backend-config=backend.hcl terraform plan terraform apply ``` ## 5. Cross-account IAM Provision the four canonical roles per blueprint §4.2: - `AqpAdminDeploymentRole` (cross-account assume from shared-services) - `AqpAdminReadOnlyAuditRole` - `AqpAdminBreakGlassRole` (Deny-everything by default; attach the Lambda from §9.3 of the blueprint to attach `AdministratorAccess` on approved break-glass) - `GitHubActionsDeployRole` (federated via the OIDC provider from `bootstrap/`) These are wired through the `iam-irsa-roles` + `github-oidc` modules per env. Confirm with: ```bash aws sts assume-role \ --role-arn arn:aws:iam::${DEV_ACCOUNT_ID}:role/AqpAdminDeploymentRole \ --role-session-name dev-smoke \ --external-id "$EXTERNAL_ID" ``` ## 6. Promote dev to staging Use the `alphaswarm_admin/src/alphaswarm_admin/services/account_promoter.py` wizard via the `/admin/accounts` UI. The wizard: 1. Replicates ECR artifacts cross-region. 2. Templates the staging Helm overlay from dev (with `prod.deny.json` allowlist filtering). 3. Applies the staging Terraform workspace. The same wizard handles staging -> prod once the staging burn-in period (recommended: 14 days) completes. ## 7. IdP cutover If you are migrating from the existing Auth0 tenant to AWS IAM Identity Center: 1. Provision IAM Identity Center via Control Tower (one-click). 2. Create the AlphaSwarm application in Identity Center; copy the issuer URL + audience. 3. Set `ALPHASWARM_AUTH_PROVIDER=aws_iam_identity_center` + `ALPHASWARM_AUTH_OIDC_ISSUER=...` + `ALPHASWARM_AUTH_OIDC_AUDIENCE=...`. 4. The [`AwsIamIdentityCenterProvider`](../../../alphaswarm/auth/providers/aws_iam_identity_center.py) subclass auto-registers via the `IdentityProviderMeta` metaclass per AGENTS rule 27. No manual `@register` decorator. 5. Group sync: create `IdpGroupMapping` rows with `connection_kind="aws_iam_identity_center"` for every Identity Center group that should map to an AlphaSwarm role. 6. Test login with a single staff account before flipping the default for the org. ## 8. Production cutover Production cutover follows the same dev->staging recipe via `account_promoter.py`. Step-up MFA + 4-eyes approval are enforced server-side; the operator runs the wizard from the admin UI. After cutover: - ArgoCD ApplicationSet picks up the prod cluster via its `alphaswarm.io/managed=true` label. - ArgoCD Image Updater auto-bumps image tags from new ECR digests when the GHA pipeline produces them. - The legacy ADR-002 single-container Solara deployment is decommissioned in the follow-up `alphaswarm_admin-overhaul-cleanup` PR. # GET /livez > Liveness probe. # Liveness probe. Liveness probe. > **Method:** `GET` > **Path:** `/livez` > **Tag:** `health` > **OperationId:** `get-livez` See the [interactive playground](../index.mdx) for parameter forms, response schemas, and credential persistence. ## Source spec This page is generated from `alphaswarm_docs/openapi/alphaswarm.json` by [`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts). Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`. # GET /readyz > Readiness probe — confirms migrations applied + downstreams reachable. # Readiness probe — confirms migrations applied + downstreams reachable. Readiness probe — confirms migrations applied + downstreams reachable. > **Method:** `GET` > **Path:** `/readyz` > **Tag:** `health` > **OperationId:** `get-readyz` See the [interactive playground](../index.mdx) for parameter forms, response schemas, and credential persistence. ## Source spec This page is generated from `alphaswarm_docs/openapi/alphaswarm.json` by [`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts). Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`. # API reference > Interactive Scalar-rendered reference for the public AlphaSwarm API at api.alpha-swarm.ai. Auto-regenerated on every commit via openapi-export-alphaswarm CI job. # API reference This page is auto-generated from [alphaswarm_docs/openapi/alphaswarm.json](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/alphaswarm.json), which itself is regenerated on every commit via the `openapi-export-alphaswarm` job in [.github/workflows/ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/ci.yml). If the committed spec and the live CI-dumped spec diverge, [`oasdiff`](https://github.com/Tufin/oasdiff) blocks the PR. ## Surface The public AlphaSwarm API at `api.alpha-swarm.ai` exposes: - **`/health`**, **`/livez`**, **`/readyz`** — liveness probes. - **`/strategies/*`** — strategy CRUD + dispatch. - **`/backtest/*`** — backtest dispatch, status, cancel. - **`/bots/*`** — bot CRUD, snapshot, backtest, paper, deploy. - **`/rl/*`** — RL train, replay, walk-forward, halt. - **`/agents/*`** — agent dispatch, halt, watchdog. - **`/workflows/*`** — workflow runtime endpoints. - **`/paper/*`** — paper trading session controls. - **`/analytics/*`** — QuantStats portfolio metrics + tearsheet rendering. - **`/mcp/data/*`** — the Data MCP server. - **`/mcp/codebase/*`** — the Codebase MCP server. ## Interactive playground The Scalar component below loads `openapi/alphaswarm.json` and renders an interactive playground. Authenticate with the `Authorization: Bearer ` header. Tokens come from the `alphaswarm-cli auth login` device-flow path (see [Concept: identity](../../concepts/identity/identity.md)). ## SDKs - TypeScript: `npm install @alphaswarm/sdk` (generated via Fern; see [reference/python](../python/index.mdx)). - Python: `pip install alphaswarm-sdk` (Phase 6). ## Versioning AlphaSwarm uses Stripe-style date-epoch versioning. The first epoch is `2026-06-01`. New epochs preserve old contracts; the `Deprecation` + `Sunset` HTTP headers (RFC 8594) signal the 12-month sunset cycle. Sunsetted epochs freeze on `archive.alpha-swarm.ai`. # Data Dictionary > Authoritative table-and-column reference for the AlphaSwarm persistence layer plus the Iceberg catalog. Updated whenever a model file or migration ships — see "[Adding a new model](../../concepts/platform/erd.md#adding-a-new-model)... # Data Dictionary > Pair with [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (visual schema) and > [alphaswarm_docs/domain-model.md](../../concepts/platform/domain-model.md) (narrative). > Doc map: [alphaswarm_docs/index.md](../../intro/index.md). Authoritative table-and-column reference for the AlphaSwarm persistence layer plus the Iceberg catalog. Updated whenever a model file or migration ships — see "[Adding a new model](../../concepts/platform/erd.md#adding-a-new-model)" for the workflow. ## Conventions - **PK**: primary key column. - **FK**: foreign key (`→ table.column`). - **Type**: SQLAlchemy column type. `String(N)` is `VARCHAR(N)` in Postgres. `JSON` is `JSONB`. `DateTime` is timezone-naive UTC. - **Null**: `Y`/`N`. Defaults are listed where present. - **Notes**: extra constraints, indexes, or invariants. All `id` columns are `String(36)` UUIDs generated by `_uuid()` unless noted. All `created_at` / `updated_at` columns default to `datetime.utcnow` server-side. --- ## 1. Sessions + chat — [models.py](../alphaswarm/persistence/models.py) ### `sessions` The conversational shell that chat messages and agent runs live under. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | user | String(120) | N | – | default `"local"` | | title | String(240) | Y | – | – | | created_at | DateTime | N | – | default now | | closed_at | DateTime | Y | – | – | | meta | JSON | Y | – | default `{}` | ### `chat_messages` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | session_id | UUID | N | → sessions.id | cascade delete | | role | String(32) | N | – | `user|assistant|agent|tool` | | content | Text | N | – | – | | created_at | DateTime | N | – | default now | | meta | JSON | Y | – | default `{}` | --- ## 2. Strategies + backtests — [models.py](../alphaswarm/persistence/models.py) ### `strategies` The top-level strategy header. Versions live in `strategy_versions`. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | name | String(120) | N | – | – | | version | Integer | N | – | default 1 | | config_yaml | Text | N | – | full YAML config | | created_at | DateTime | N | – | default now | | created_by | String(120) | N | – | default `"system"` | | status | String(32) | N | – | `draft|backtesting|paper|live|retired` | | meta | JSON | Y | – | default `{}` | ### `strategy_versions` Immutable YAML snapshot of a strategy. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | strategy_id | UUID | N | → strategies.id | cascade | | version | Integer | N | – | – | | config_yaml | Text | N | – | – | | author | String(120) | N | – | default `"system"` | | created_at | DateTime | N | – | – | | dataset_hash | String(64) | Y | – | bind to data version | | notes | Text | Y | – | – | Index: `ix_strategy_versions_strategy_version (strategy_id, version)`. ### `strategy_tests` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | strategy_id | UUID | N | → strategies.id | – | | version_id | UUID | Y | → strategy_versions.id | – | | backtest_id | UUID | Y | → backtest_runs.id | – | | status | String(32) | N | – | default `pending` | | start, end | DateTime | Y | – | window | | sharpe, sortino, max_drawdown, total_return, final_equity | Float | Y | – | summary metrics | | engine | String(64) | Y | – | – | ### `backtest_runs` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | strategy_id | UUID | Y | → strategies.id | – | | task_id | String(120) | Y | – | Celery task id | | status | String(32) | N | – | default `pending` | | start, end | DateTime | Y | – | window | | initial_cash, final_equity | Float | Y | – | – | | sharpe, sortino, max_drawdown, total_return | Float | Y | – | metrics | | mlflow_run_id | String(120) | Y | – | links to MLflow UI | | dataset_hash | String(64) | Y | – | – | | metrics | JSON | Y | – | full metrics blob | | error | Text | Y | – | – | | model_version_id | UUID | Y | → model_versions.id | Alembic 0025 — model that produced the alpha | | ml_experiment_run_id | UUID | Y | → ml_experiment_runs.id | Alembic 0025 — training run lineage | | experiment_plan_id | UUID | Y | → experiment_plans.id | Alembic 0025 — experiment plan lineage | | model_deployment_id | UUID | Y | → model_deployments.id | Alembic 0025 — deployment that wired the model into the strategy | ### `ml_alpha_backtest_runs` Combined experiment row joining a training run to a downstream backtest. Persisted by [`AlphaBacktestExperiment`](../../concepts/strategy/ml-alpha-backtest.md). | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | task_id | String(120) | Y | – | Celery task id | | run_name | String(240) | N | – | default `alpha-backtest` | | status | String(32) | N | – | `queued|running|completed|failed` | | ml_experiment_run_id | UUID | Y | → ml_experiment_runs.id | – | | backtest_run_id | UUID | Y | → backtest_runs.id | – | | model_version_id | UUID | Y | → model_versions.id | – | | model_deployment_id | UUID | Y | → model_deployments.id | – | | experiment_plan_id | UUID | Y | → experiment_plans.id | – | | mlflow_run_id | String(120) | Y | – | parent MLflow run id | | dataset_hash | String(64) | Y | – | – | | ml_metrics | JSON | Y | – | IC / RMSE / hit-rate / etc | | trading_metrics | JSON | Y | – | Sharpe / Sortino / Calmar / etc | | combined_metrics | JSON | Y | – | rolled-up scalar `score` + selected ML/trading keys | | attribution | JSON | Y | – | conviction-vs-PnL attribution | | params | JSON | Y | – | full input-config snapshot | | error | Text | Y | – | – | ### `ml_prediction_audit` Per-bar prediction audit for an alpha-backtest run. Opt-in via `ALPHASWARM_ML_PREDICTION_AUDIT_ENABLED`; capped at `ALPHASWARM_ML_PREDICTION_AUDIT_MAX_ROWS` rows per run. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | alpha_backtest_run_id | UUID | N | → ml_alpha_backtest_runs.id | cascade | | vt_symbol | String(40) | N | – | – | | ts | DateTime | N | – | – | | prediction | Float | N | – | – | | label | Float | Y | – | – | | position_after | Float | Y | – | – | | pnl_after_bar | Float | Y | – | – | ### `optimization_runs` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | task_id | String(120) | Y | – | – | | strategy_id | UUID | Y | → strategies.id | – | | run_name | String(240) | N | – | default `"sweep"` | | method | String(32) | N | – | `grid|random|bayes` | | metric | String(64) | N | – | default `"sharpe"` | | status | String(32) | N | – | `queued|running|completed|failed` | | n_trials, n_completed | Integer | N | – | – | | best_trial_id | String(36) | Y | – | – | | best_metric_value | Float | Y | – | – | | parameter_space, base_config, summary | JSON | Y | – | – | ### `optimization_trials` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | run_id | UUID | N | → optimization_runs.id | cascade | | backtest_id | UUID | Y | → backtest_runs.id | – | | trial_index | Integer | N | – | – | | parameters | JSON | Y | – | – | | status | String(32) | N | – | – | | metric_value, sharpe, sortino, total_return, max_drawdown, final_equity | Float | Y | – | – | | error | Text | Y | – | – | --- ## 3. Ledger — [models.py](../alphaswarm/persistence/models.py) ### `signals` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | strategy_id | UUID | Y | → strategies.id | – | | backtest_id | UUID | Y | → backtest_runs.id | – | | vt_symbol | String(40) | N | – | indexed | | direction | String(10) | N | – | `long|short|net` | | strength | Float | N | – | – | | confidence | Float | Y | – | default 1.0 | | rationale | Text | Y | – | – | Composite index: `ix_signals_symbol_ts (vt_symbol, created_at)`. ### `orders` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | backtest_id | UUID | Y | → backtest_runs.id | – | | strategy_id | UUID | Y | → strategies.id | – | | vt_symbol | String(40) | N | – | indexed | | side | String(8) | N | – | `buy|sell` | | order_type | String(16) | N | – | `market|limit|stop|stop_limit|...` | | quantity, price | Float | varies | – | – | | status | String(16) | N | – | default `submitting` | | reference | String(120) | Y | – | `paper:` for paper trading | ### `fills` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | order_id | UUID | Y | → orders.id | – | | vt_symbol | String(40) | N | – | indexed | | side | String(8) | N | – | – | | quantity, price | Float | N | – | – | | commission, slippage | Float | Y | – | default 0 | ### `ledger_entries` The canonical audit trail. Every action goes through `LedgerWriter`. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | backtest_id | UUID | Y | → backtest_runs.id | – | | strategy_id | UUID | Y | → strategies.id | – | | entry_type | String(32) | N | – | `SIGNAL|ORDER|FILL|RISK|AGENT|META` | | level | String(16) | N | – | `debug|info|warn|error` | | message | Text | N | – | – | | payload | JSON | Y | – | default `{}` | Index: `ix_ledger_type_ts (entry_type, created_at)`. --- ## 4. Instruments — [models.py](../alphaswarm/persistence/models.py) + [models_instruments.py](../alphaswarm/persistence/models_instruments.py) ### `instruments` (parent) The polymorphic root. `instrument_class` is the discriminator; subclass rows live in `instrument_` tables keyed on `instruments.id`. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | vt_symbol | String(64) | N | unique | `Symbol.format()` | | ticker | String(64) | N | indexed | – | | exchange | String(32) | Y | – | – | | asset_class | String(32) | Y | – | `equity|crypto|fx|...` | | security_type | String(32) | Y | – | `equity|option|future|...` | | instrument_class | String(32) | Y | indexed | discriminator | | issuer_id | UUID | Y | → issuers.id | – | | identifiers | JSON | Y | – | `{ticker, isin, cusip, …}` | | sector, industry, region, currency | String | Y | – | – | | tick_size, multiplier, min_quantity, max_quantity, lot_size | Float | Y | – | exchange specs | | price_precision, size_precision | Integer | Y | – | – | | is_active | Boolean | N | – | default true | | tags | JSON | Y | – | default `[]` | | meta | JSON | Y | – | default `{}` | ### Joined-table subclasses (`instrument_`) All share `id` PK that's also a FK to `instruments.id`. Each table adds shape-specific columns. For full column lists see [models_instruments.py](../alphaswarm/persistence/models_instruments.py); the ERD in [alphaswarm_docs/erd.md](../../concepts/platform/erd.md#core--instruments) lists key columns per subclass. | Subclass table | Polymorphic identity | Distinctive columns | | --- | --- | --- | | `instrument_equity` | `spot` | `isin`, `cusip`, `figi`, `lei`, `gics_sector`, `shares_outstanding`, `is_adr` | | `instrument_etf` | `etf` | `inception_date`, `aum`, `expense_ratio`, `is_leveraged`, `replication` | | `instrument_index` | `index` | `administrator`, `methodology`, `constituent_count`, `base_value` | | `instrument_bond` | `bond` | `coupon`, `maturity`, `rating_sp`, `rating_moodys`, `callable`, `convertible` | | `instrument_future` | `future` | `underlying`, `expiry`, `contract_size`, `cycle`, `delivery_month` | | `instrument_option` | `option` | `strike`, `expiry`, `kind` (call/put), `style`, `occ_symbol` | | `instrument_fx_pair` | `fx_pair` | `base_currency`, `quote_currency`, `pip_size` | | `instrument_crypto` | `crypto_token` | `subtype`, `chain`, `contract_address`, `max_leverage`, `funding_interval` | | `instrument_cfd` | `cfd` | `underlying`, `margin_rate`, `financing_rate` | | `instrument_commodity` | `spot_commodity` | `grade`, `unit_of_measure`, `delivery` | | `instrument_synthetic` | `synthetic` | `legs`, `leg_weights`, `formula` | | `instrument_betting` | `betting` | `event_name`, `market_type`, `selection_id` | | `instrument_tokenized_asset` | `nft` | `chain`, `contract_address`, `token_standard` | --- ## 5. Dataset lineage — [models.py](../alphaswarm/persistence/models.py) ### `dataset_catalogs` Logical dataset descriptor. Iceberg-related columns added in [migration 0011](../alembic/versions/0011_iceberg_catalog_columns.py). | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | name | String(160) | N | indexed | – | | provider | String(80) | N | indexed | – | | domain | String(120) | N | – | default `"market.bars"` | | frequency | String(32) | Y | – | – | | storage_uri | String(512) | Y | – | – | | schema_json | JSON | Y | – | – | | description | Text | Y | – | – | | tags | JSON | Y | – | – | | meta | JSON | Y | – | – | | iceberg_identifier | String(240) | Y | indexed | `.` | | load_mode | String(32) | N | – | `managed|external` (default managed) | | source_uri | String(1024) | Y | – | – | | llm_annotations | JSON | Y | – | from `annotate_table` | | column_docs | JSON | Y | – | – | Composite index: `ix_dataset_catalog_name_provider`. ### `dataset_versions` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | catalog_id | UUID | N | → dataset_catalogs.id (cascade) | – | | version | Integer | N | – | default 1 | | status | String(32) | N | – | `active|superseded` | | as_of, start_time, end_time | DateTime | Y | – | – | | row_count, symbol_count, file_count | Integer | N | – | default 0 | | dataset_hash | String(64) | Y | indexed | SHA-256 of inputs | | materialization_uri | String(512) | Y | – | – | | columns | JSON | Y | – | – | | schema_json | JSON | Y | – | – | | meta | JSON | Y | – | – | Composite index: `ix_dataset_versions_catalog_version`. ### `data_sources` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | name | String | N | unique | `yfinance|alpaca|cfpb|...` | | kind | String | Y | – | `rest|csv|parquet|kafka` | | base_url | String | Y | – | – | | meta | JSON | Y | – | – | ### `data_links` Edges between dataset versions and entities (instruments, series). | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | dataset_version_id | UUID | N | → dataset_versions.id (cascade) | – | | source_id | UUID | Y | → data_sources.id | – | | instrument_id | UUID | Y | → instruments.id | – | | entity_kind | String | N | – | `instrument|series|theme` | | entity_id | String | N | – | – | | coverage_start, coverage_end | DateTime | Y | – | – | | row_count | Integer | Y | – | – | | meta | JSON | Y | – | – | ### `identifier_links` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | instrument_id | UUID | Y | → instruments.id | – | | source_id | UUID | Y | → data_sources.id | – | | identifier_kind | String | N | – | `cik|isin|ticker|figi|...` | | identifier_value | String | N | – | – | ### `split_plans`, `split_artifacts`, `pipeline_recipes`, `experiment_plans`, `model_versions`, `model_deployments` ML lineage tables. See full column lists in [models.py](../alphaswarm/persistence/models.py) (search for the class name). One-liner summary: | Table | Purpose | | --- | --- | | `split_plans` | Train/val/test split design (method, segments, FK to dataset_version) | | `split_artifacts` | Materialised fold boundaries + index sets per split plan | | `pipeline_recipes` | Preprocessing recipes (shared/learn/infer processors) | | `experiment_plans` | Ties dataset_version + split + recipe + model config + status | | `model_versions` | One row per trained MLflow registry version | | `model_deployments` | Active inference deployments (one model_version may have many) | --- ## 6. Agentic — [models.py](../alphaswarm/persistence/models.py) ### `agent_runs` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | session_id | UUID | Y | → sessions.id | – | | task_id | String(120) | Y | indexed | – | | crew | String(120) | N | – | – | | status | String(32) | N | – | – | | prompt | Text | N | – | – | | result | JSON | Y | – | – | | error | Text | Y | – | – | | llm_model | String(120) | Y | – | – | | token_usage | JSON | Y | – | – | ### `crew_runs` Lightweight index for the Crew Trace UI. | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | task_id | String(120) | N | unique | – | | crew_name | String(120) | N | – | default `"research"` | | crew_type | String(32) | N | indexed | `research|trader` | | status | String(32) | N | indexed | – | | prompt | Text | N | – | – | | session_id | String(36) | Y | indexed | – | | agent_run_id | UUID | Y | → agent_runs.id | – | | result, events | JSON | Y | – | – | | error | Text | Y | – | – | | cost_usd | Float | N | – | default 0 | ### `agent_decisions`, `debate_turns`, `agent_backtests`, `agent_judge_reports`, `agent_replay_runs`, `backtest_interrupts` The agentic-backtest audit trail. | Table | Purpose | | --- | --- | | `agent_decisions` | One row per long/short/flat decision (links backtest, strategy, crew_run) | | `debate_turns` | Multi-turn debate transcripts under a decision | | `agent_backtests` | Crew-level metrics rolled up per backtest | | `agent_judge_reports` | Judge LLM's evaluation of a backtest | | `agent_replay_runs` | Replays of a judged backtest with adjusted prompts | | `backtest_interrupts` | User pause/resume markers during a long backtest | --- ## 7. Feature sets — [models.py](../alphaswarm/persistence/models.py) ### `feature_sets` | Column | Type | Null | FK | Notes | | --- | --- | --- | --- | --- | | id | UUID | N | – | PK | | name | String | N | – | – | | description | Text | Y | – | – | | kind | String | N | – | `composite|ml4t|qlib|alpha158` | | specs | JSON | N | – | list of indicator/transformation strings | | tags | JSON | Y | – | – | | default_lookback_days | Integer | Y | – | – | ### `feature_set_versions` Immutable snapshot keyed on `content_hash` so the same spec rendered twice deduplicates. ### `feature_set_usages` Records of which backtests / deployments consumed which feature-set versions (for reverse lineage). --- ## 8. Reports + paper — [models.py](../alphaswarm/persistence/models.py) | Table | Purpose | Key columns | | --- | --- | --- | | `equity_reports` | Markdown equity research reports generated by the report-writer crew | `vt_symbol`, `cohort`, `markdown`, `cost_usd` | | `paper_trading_runs` | One row per paper or live session | `brokerage`, `feed`, `last_heartbeat_at`, `bars_seen`, `orders_submitted`, `fills`, `state` | | `rl_episodes` | Snapshot of an RL training episode | `run_id`, `episode`, `mean_reward`, `portfolio_value` | --- ## 8a. Bots — [models_bots.py](../alphaswarm/persistence/models_bots.py) Tables introduced by the Bot Entity Refactor (Alembic [`0020_bots`](../alembic/versions/0020_bots.py)). Mirror the proven `agent_specs` / `agent_spec_versions` / `agent_runs_v2` pattern. | Table | Purpose | Key columns | | --- | --- | --- | | `bots` | Logical bot row (latest active version of a named spec inside a project) | `id`, `name`, `slug`, `kind` (`trading|research`), `current_version`, `spec_yaml`, `status` (`draft|ready|deployed|archived`), `annotations`, `(project_id, slug)` UNIQUE | | `bot_versions` | Immutable, hash-locked snapshot of every `BotSpec` change | `id`, `bot_id` FK, `version`, `spec_hash`, `payload`, `notes`, `created_by`, `(bot_id, spec_hash)` UNIQUE, `(bot_id, version)` UNIQUE | | `bot_deployments` | One row per backtest / paper / chat / k8s invocation; references the version that produced it | `id`, `bot_id` FK, `version_id` FK, `target` (`paper_session|kubernetes|backtest_only|backtest|chat`), `task_id`, `status`, `manifest_yaml` (k8s only), `result_summary`, `error`, `started_at`, `ended_at` | All three tables carry `ProjectScopedMixin` (`owner_user_id`, `workspace_id`, `project_id`). --- ## 9. News — [models_news.py](../alphaswarm/persistence/models_news.py) | Table | Key columns | | --- | --- | | `news_items` | `url`, `source`, `published_at`, `headline`, `body` | | `news_item_entities` | `news_item_id`, `vt_symbol`, `entity_kind` (`instrument|issuer|theme`) | | `news_sentiments` | `news_item_id`, `scorer` (`finbert|fingpt`), `polarity`, `confidence` | --- ## 10. Events — [models_events.py](../alphaswarm/persistence/models_events.py) `corporate_events` is the parent; the per-type tables FK back to it. | Table | Key columns | | --- | --- | | `corporate_events` | `vt_symbol`, `event_type` (`earnings|split|dividend|merger|ipo`), `event_time`, `payload` | | `earnings_event_rows` | `event_id`, `eps_actual`, `eps_estimate`, `revenue_actual` | | `dividend_event_rows` | `event_id`, `amount`, `ex_date`, `pay_date` | | `split_event_rows` | `event_id`, `ratio` | | `ipo_event_rows` | `event_id`, `offer_price`, `shares_offered` | | `merger_event_rows` | `event_id`, `acquirer`, `target`, `terms` | | `calendar_event_rows` | `event_id`, `event_kind`, `expected_time` | | `analyst_estimates` | `vt_symbol`, `analyst`, `target_price`, `forecast_date` | | `price_targets` | `vt_symbol`, `analyst`, `target_price`, `period` | | `forward_estimates` | `vt_symbol`, `analyst`, `metric`, `value` | | `regulatory_event_rows` | `event_id`, `regulator`, `summary` | | `esg_event_rows` | `event_id`, `category`, `score` | --- ## 11. Fundamentals — [models_fundamentals.py](../alphaswarm/persistence/models_fundamentals.py) | Table | Key columns | | --- | --- | | `financial_statements` | `issuer_id`, `period` (`Q|FY`), `period_end`, `data` | | `financial_ratios` | `issuer_id`, `period_end`, `pe`, `pb`, `roe`, `roa`, `debt_to_equity` | | `key_metrics` | `issuer_id`, `period_end`, `revenue`, `net_income`, `free_cash_flow` | | `historical_market_caps` | `issuer_id`, `as_of`, `market_cap` | | `revenue_breakdowns` | `issuer_id`, `period_end`, `segment`, `region`, `revenue` | | `earnings_call_transcripts` | `issuer_id`, `call_date`, `content` | | `management_discussion_analysis` | `issuer_id`, `period_end`, `mda_text` | | `reported_financials` | `issuer_id`, `period_end`, `xbrl_payload` | --- ## 12. Macro — [models_macro.py](../alphaswarm/persistence/models_macro.py) | Table | Key columns | | --- | --- | | `economic_series` | `series_id` (`FRED:GDP`), `title`, `frequency`, `units`, `source` | | `economic_observations` | `series_id`, `observation_date`, `value` | | `cot_reports` | `report_date`, `instrument`, `positions` | | `bls_series` | `series_id`, `title`, `frequency` | | `treasury_rates` | `date`, `rate_3m`, `rate_2y`, `rate_10y`, `rate_30y` | | `yield_curves` | `date`, `tenors` | | `option_series` | `instrument_id`, `expiry`, `style` | | `option_chain_snapshots` | `series_id`, `as_of`, `chain_payload` | | `futures_curves` | `as_of`, `front_month`, `tenor_prices` | | `market_holidays` | `exchange`, `date`, `name` | | `market_status_history` | `exchange`, `as_of`, `status` | --- ## 13. Entities + ownership — [models_entities.py](../alphaswarm/persistence/models_entities.py) + [models_ownership.py](../alphaswarm/persistence/models_ownership.py) | Table | Key columns | | --- | --- | | `issuers` | `name`, `lei`, `country`, `entity_kind` | | `government_entities` | `id` (PK_FK), `country_code`, `level` | | `funds` | `id` (PK_FK), `fund_family`, `fund_type` | | `sectors` | `code`, `name` | | `industries` | `code`, `name`, `sector_id` | | `industry_classifications` | `issuer_id`, `industry_id`, `as_of` | | `entity_relationships` | `parent_id`, `child_id`, `kind` | | `locations` | `issuer_id`, `country`, `city` | | `key_executives` | `issuer_id`, `name`, `title` | | `executive_compensation` | `executive_id`, `year`, `total_comp` | | `insider_transactions` | `vt_symbol`, `insider_name`, `transaction_date`, `quantity` | | `institutional_holdings` | `vt_symbol`, `holder_name`, `as_of`, `quantity` | | `form_13f_holdings` | `filer_cik`, `vt_symbol`, `period_end` | | `short_interest` | `vt_symbol`, `settlement_date`, `short_interest` | | `shares_float_snapshots` | `vt_symbol`, `as_of`, `float_shares` | | `politician_trades` | `politician`, `vt_symbol`, `trade_date`, `amount` | | `fund_holdings` | `fund_id`, `vt_symbol`, `as_of`, `position` | --- ## 14. Taxonomy — [models_taxonomy.py](../alphaswarm/persistence/models_taxonomy.py) | Table | Key columns | | --- | --- | | `taxonomy_schemes` | `name` (`GICS|SASB|theme`) | | `taxonomy_nodes` | `scheme_id`, `parent_id`, `code`, `label` | | `entity_tags` | `node_id`, `entity_kind`, `entity_id` | | `entity_crosswalks` | `from_kind`, `from_id`, `to_kind`, `to_id` | --- ## 15. External-source indexes — [models.py](../alphaswarm/persistence/models.py) | Table | Purpose | Key columns | | --- | --- | --- | | `fred_series` | FRED metadata index | `series_id`, `title`, `units`, `frequency` | | `sec_filings` | SEC EDGAR filing index | `instrument_id`, `accession`, `form`, `filing_date` | | `gdelt_mentions` | GDelt GKG mention index | `instrument_id`, `mention_time`, `gkg_payload` | --- ## Iceberg namespace conventions Iceberg tables sit alongside the Postgres schema; their identifiers are stored in `dataset_catalogs.iceberg_identifier` for cross-lookup. | Namespace | Source | | --- | --- | | `alphaswarm` | Generic / default (fallback when no `--namespace` provided) | | `alphaswarm_smoke` | Smoke-test namespace ([scripts/iceberg_smoke.py](../scripts/iceberg_smoke.py)) | | `alphaswarm_cfpb` | CFPB regulatory ingest | | `alphaswarm_uspto` | USPTO regulatory ingest | | `alphaswarm_fda` | openFDA regulatory ingest | | `alphaswarm_sec` | SEC quarterly data sets | | `alphaswarm_bars` | (reserved) generic OHLCV cache | | `alphaswarm_features` | (reserved) feature-set materialisations | **Naming rules**: - Namespace: `aqp_`, lower-snake-case, ≤32 chars. - Table: lower-snake-case, ≤48 chars, descriptive nouns (`hmda_lar`, `device_event`, `broker_dealers`). - The Director (Nemotron) decides the final table name within these rules; identity-plan fallback uses the discovered family name as-is. **Layout**: ``` C:/alphaswarm-warehouse/iceberg/ ├── catalog.db # SQLite metadata └── / └── / ├── data/00000-0-.parquet ├── data/00001-0-.parquet └── metadata/ ├── 00000-.metadata.json ├── 00001-.metadata.json ├── -m0.avro # manifest list └── snap--...avro # snapshot ``` Snapshots are append-only — every `append_arrow` produces a new `metadata.json` revision. Old snapshots can be expired with PyIceberg's `Table.expire_snapshots(...)` (not exposed via API yet). **Updating this dictionary**: When you add an ORM column or a new table: 1. Update the corresponding section above. 2. If you added a table to a per-domain ERD scope, update [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) too. 3. Cross-link the migration that introduced the change. # Control-plane API > Interactive Scalar-rendered reference for the control plane at manage.alpha-swarm.ai. Workload lifecycle, Terraform driver, provider adapters. # Control-plane API This is the `alphaswarm_controller` surface at `manage.alpha-swarm.ai`. It is deliberately separate from the public AlphaSwarm API; it owns workload lifecycle, the `TerraformRuntime`, provider adapters, and the `workload_runs` audit ledger. The spec lives at [alphaswarm_docs/openapi/control-plane.json](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/control-plane.json), auto-dumped by the existing `openapi-export` job in [.github/workflows/ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/ci.yml). ## Surface - **`/manage/workloads/*`** — start / stop / scale / restart / exec / tail-logs / apply_config / rotate-secret. - **`/manage/topology/*`** — service URL resolution (AGENTS rule 47). - **`/manage/terraform/*`** — Terraform plan / apply / destroy through `TerraformRuntime` (AGENTS rules 42, 43). - **`/manage/cloudflare/*`** — tunnel + DNS + Access app CRUD. - **`/manage/auth/*`** — IdP wiring (Auth0, Entra). - **`/manage/tenancy/*`** — `EntraTenantLink` lifecycle (AGENTS rule 44). - **`/manage/agents/health`** — agent stall watchdog snapshot. - **`/manage/workflows/halt`** — kill-switch fan-out. ## Audit ledger Every workload action writes a `workload_runs` row BEFORE executing through the provider. See [Concept: management engine](../../concepts/identity/management-engine.md) for the full audit contract. ## Authentication Same Auth0 / Entra IdP chain as the public API; access is restricted to the `admin:cluster` scope (engineering org) and the per-org `admin:org` scope (customer orgs). Cloudflare Access policies in front of `manage.alpha-swarm.ai` enforce the perimeter at the edge. # GET /manage/livez > Control-plane liveness probe. # Control-plane liveness probe. Control-plane liveness probe. > **Method:** `GET` > **Path:** `/manage/livez` > **Tag:** `workloads` > **OperationId:** `get-manage-livez` See the [interactive playground](../index.mdx) for parameter forms, response schemas, and credential persistence. ## Source spec This page is generated from `alphaswarm_docs/openapi/control-plane.json` by [`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts). Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`. # alphaswarm > Auto-generated reference for the alphaswarm package. Re-runs on every PR touching **/*.py. # alphaswarm This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code` # alphaswarm_bots > Auto-generated reference for the alphaswarm_bots package. Re-runs on every PR touching **/*.py. # alphaswarm_bots This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm_bots` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code` # alphaswarm_controller > Auto-generated reference for the alphaswarm_controller package. Re-runs on every PR touching **/*.py. # alphaswarm_controller This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm_controller` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_controller\src` # alphaswarm_core > Auto-generated reference for the alphaswarm_core package. Re-runs on every PR touching **/*.py. # alphaswarm_core This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm_core` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_core\src` # alphaswarm_models > Auto-generated reference for the alphaswarm_models package. Re-runs on every PR touching **/*.py. # alphaswarm_models This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm_models` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_models\src` # alphaswarm_rl > Auto-generated reference for the alphaswarm_rl package. Re-runs on every PR touching **/*.py. # alphaswarm_rl This page would normally be auto-generated by `mdxify` from the Python source. The extraction binary is not available in this environment — re-run `pnpm --filter alphaswarm_docs extract-python` once griffe + griffe-pydantic + mdxify are installed: ```powershell pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1" pnpm --filter alphaswarm_docs extract-python ``` ## Source - Module: `alphaswarm_rl` - Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_rl\src` # Python reference > Auto-generated module / class / function reference for alphaswarm / alphaswarm_rl / alphaswarm_models / alphaswarm_controller / alphaswarm_core, via Griffe + griffe-pydantic + mdxify. # Python reference This tree is auto-generated by [alphaswarm_docs/scripts/extract-python.ts](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/extract-python.ts) on every CI run that touches `**/*.py`. The extraction pipeline is: 1. **Griffe** walks the Python AST and parses signatures, type hints, docstrings, and dynamic attributes. 2. **griffe-pydantic** teaches Griffe to render Pydantic model constraints, validators, and aliases — critical for FastAPI request/response models. 3. **mdxify** emits MDX with Docusaurus-native admonitions and navigation generation. The output mirrors the source tree under [alphaswarm_docs/docs/reference/python/](./). ## Top-level packages - **`alphaswarm`** — quant runtime (strategy, backtest, agents, RAG, data). - **`alphaswarm_rl`** — RL subsystem (RLRuntime, RLComponent metaclass, etc.). - **`alphaswarm_models`** — ML framework, AlphaBacktestExperiment, model serving. - **`alphaswarm_controller`** — workload lifecycle, TerraformRuntime. - **`alphaswarm_core`** — shared ABCs, value types, auth filters. ## Docstring style Standardised on Google-style docstrings. Griffe parses ReST + NumPy styles too, but mixed styles confuse downstream tooling. ```python def append_arrow( table: str, arrow_table: pa.Table, *, namespace: str, medallion_layer: Literal["bronze", "silver", "gold"], business_metadata: BusinessMetadata | None = None, ) -> SnapshotResult: """Append an Arrow table to an Iceberg table. The single sanctioned write path for Iceberg in AlphaSwarm. See AGENTS rule 3. Args: table: The Iceberg table name (without namespace prefix). arrow_table: The data to append. namespace: The medallion-qualified namespace (`alphaswarm_bronze_*`, `alphaswarm_silver_*`, `alphaswarm_gold_*`). medallion_layer: Must match the namespace prefix. business_metadata: Optional active-metadata block. Returns: A `SnapshotResult` with the new manifest list location and the snapshot id. Raises: IcebergNamespaceError: If the namespace prefix does not match the declared layer. """ ``` ## Reading the generated docs Browse via the sidebar to the left. Each module page shows: - A summary line from the first paragraph of the module docstring. - Every public class, function, and dataclass with full signature. - A "Source" link back to GitHub. - A "Used by" cross-reference graph (Phase 6 — backed by the Codebase MCP server's symbol index). ## Breaking-change detection The CI surface runs `griffe check` against every PR. Any API removal / signature change posts a comment on the PR and requires a `breaking-change` label + matching Changeset entry. # Release 2026-06-01 — Docs migration + first API epoch > docs.alpha-swarm.ai launches as the canonical documentation site; first Stripe-style API epoch lands. # Release 2026-06-01 — Docs migration + first API epoch This release marks two significant changes for AlphaSwarm customers: ## New - **docs.alpha-swarm.ai is live.** The canonical documentation site replaces the previous GitHub-rendered tree at `alphaswarm_docs/`. Every previous link continues to resolve via 301 redirects in [`alphaswarm_docs/static/_redirects`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/static/_redirects). - **Interactive API playground** at [/reference/api](../reference/api/index.mdx). Token persistence works end-to-end; copy a `Bearer` from `alphaswarm-cli auth login` and every request from the playground inherits it. - **AI-native surfaces.** A curated `/llms.txt` index, the full corpus at `/llms-full.txt`, and an RFC 9728 + 8707-compliant MCP server at `/mcp` are now first-class agent entry points. - **In-product help panel** in the operator UI — the help drawer reads directly from the docs corpus, so the in-product reference never drifts from the public site. ## Improved - **Search is local-first.** Pagefind indexes the entire corpus client-side; no documentation content ever leaves the docs site. - **Quality gates.** Every PR to `alphaswarm_docs/` runs Vale + alex.js + markdownlint + lychee + Lighthouse + axe-core + executable Python snippets via pytest-markdown-docs. - **Hybrid authoring.** Business editors can ship docs through Keystatic at [/keystatic](/keystatic) — the typed schemas commit to the same branch protection rules as engineers. ## API - **First Stripe-style date epoch:** `2026-06-01`. No surface changes from the prior unversioned API; future epochs will follow the documented 12-month sunset cycle with RFC 8594 `Deprecation` and `Sunset` response headers. - **OpenAPI specs committed** at [`alphaswarm_docs/openapi/alphaswarm.json`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/alphaswarm.json) and [`alphaswarm_docs/openapi/control-plane.json`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/control-plane.json). `oasdiff` PR gates prevent silent drift. ## Internal - Cloudflare Pages + Cloudflare Access front the docs site as a separate edge property — the cluster tunnel at `alpha-swarm.ai` / `api.alpha-swarm.ai` / `manage.alpha-swarm.ai` continues unchanged. - Logpush ships request + Access audit logs to a new R2 bucket (`alphaswarm-docs-access-logs`) with 365-day retention for SOC 2 / ISO 27001 evidence. - Instatus at `status.alpha-swarm.ai` is now the canonical status page; the docs site renders a live banner via the Instatus JSON API. ## Migration notes If you have hard-coded references to `alphaswarm_docs/.md` in your own tooling, both shapes resolve correctly today; the legacy shape will return a 410 starting 2027-06-01 (the 12-month sunset window applies to URL paths as well as API epochs). # Release notes > Customer-facing release notes for AlphaSwarm. Generated from Changesets on every release. # Release notes Customer-facing release notes for the AlphaSwarm. New entries land here whenever a PR's Changeset is marked `audience: customer` or `audience: both` (see [`.changeset/README.md`](https://github.com/julianwileymac/alphaswarm/blob/main/.changeset/README.md)). For the full technical changelog (every commit, including non-customer-facing internal refactors), see [CHANGELOG.md](https://github.com/julianwileymac/alphaswarm/blob/main/CHANGELOG.md). ## Subscribe - **RSS / Atom feed**: built from this folder by Docusaurus — available at [/blog/rss.xml](/blog/rss.xml). - **In-product changelog widget**: powered by [`/release-notes.json`](/release-notes.json) (Headway-compatible). - **Email digest**: opt in from the operator UI profile menu. ## API epochs AlphaSwarm uses Stripe-style date-epoch API versioning. New epochs: - Roll out on the first of the month (`2026-06-01`, `2026-09-01`, …). - Preserve old contracts via the `Deprecation` / `Sunset` HTTP headers (RFC 8594) for a 12-month sunset cycle. - Move to [archive.alpha-swarm.ai](https://archive.alpha-swarm.ai) when fully retired. The matching reference docs live at [/reference/api/](../reference/api/index.mdx). # Your first agent workflow > Compose a three-node LangGraph (Research / Selection / Trader), run it through AgentRuntime, inspect the agent_runs_v2 ledger. # Your first agent workflow Goal: stand up a three-node agentic loop driven by `WorkflowRuntime`, see it through one complete iteration, inspect the immutable `agent_runs_v2` rows it produces. ## Why `AgentSpec` + `AgentRuntime` is AlphaSwarm's "skill artifact" — every agent run is hash-locked into `agent_spec_versions` and audited through `agent_runs_v2`. Combined with the additive `WorkflowRuntime` (orchestration adapter pattern), this is how AlphaSwarm composes multi-agent pipelines without losing replay or kill-switch semantics. See [Concept: workflow studio](../concepts/agentic/workflow-studio.md). ## Step 1 — author the workflow Create `configs/workflows/my_first_workflow.yaml`: ```yaml name: MyFirstResearchLoop adapter_kind: graph nodes: - id: research agent_spec: configs/agents/research_lite.yaml inputs: universe: [SPY, QQQ, IWM] lookback_days: 30 - id: selection agent_spec: configs/agents/selection_lite.yaml depends_on: [research] - id: trader agent_spec: configs/agents/trader_paper.yaml depends_on: [selection] edges: - { from: research, to: selection } - { from: selection, to: trader } cost_caps: per_node_max_tokens: 4000 per_run_max_usd: 0.50 halt_check_seconds: 5 ``` ## Step 2 — snapshot + run ```powershell curl -X POST http://localhost:8000/workflows/MyFirstResearchLoop/run \ -d '{}' ``` `WorkflowRuntime`: 1. Hash-locks the spec into `workflow_spec_versions`. 2. Reads each referenced `AgentSpec` and hash-locks them into `agent_spec_versions`. 3. Begins traversing the DAG, calling each agent through `AgentRuntime`. 4. Emits canonical progress frames per AGENTS rule 4. 5. Writes `agent_runs_v2` and `workflow_runs` rows. ## Step 3 — watch the breadcrumbs The operator UI renders the workflow live at `/workflows/runs/`. From the CLI: ```powershell docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \ [print(m) for m in subscribe('')]" ``` ## Step 4 — inspect the ledger ```sql SELECT id, workflow_name, status, started_at, ended_at, total_tokens, total_cost_usd FROM workflow_runs ORDER BY started_at DESC LIMIT 1; SELECT id, agent_name, node_id, status, total_tokens FROM agent_runs_v2 WHERE workflow_run_id = '' ORDER BY started_at; ``` You should see three `agent_runs_v2` rows — one per node. ## Step 5 — replay ```powershell curl -X POST http://localhost:8000/workflows/runs//replay ``` Same hash-locked spec versions, new run row. ## Step 6 — halt The kill switch fans out: ```powershell curl -X POST http://localhost:8000/workflows/halt ``` Every running workflow stops; `agent_runs_v2` rows close with `status=halted`. ## Verify - [ ] `workflow_spec_versions` row with a `spec_hash`. - [ ] Three `agent_spec_versions` rows (one per node). - [ ] One `workflow_runs` row + three `agent_runs_v2` rows. - [ ] Total cost in USD ≤ `per_run_max_usd` from the spec. - [ ] Replay produces a new `workflow_runs` row but reuses the same spec-version rows. ## What next - [Concept: agentic pipeline](../concepts/agentic/agentic-pipeline.md) — the full five-stage lifecycle (models, data, snapshot, dispatch, review) and how this tutorial maps to it. - [Concept: workflow studio](../concepts/agentic/workflow-studio.md) — the seven adapter kinds (graph / crew / debate / fusion / execution / schedule / studio). - [Concept: multi-agent patterns](../concepts/agentic/multi-agent-patterns.md) — Sequential / Parallel / Debate / Coordinator / ReAct topologies. - [Tutorial: first RL experiment](./first-rl-experiment.md) — hand RL outputs into an agent loop. # Your first backtest > Author a momentum strategy, run it through EventDrivenBacktester, inspect the ledger row, render a tearsheet. # Your first backtest Goal: from blank slate to a backtest with a non-zero Sharpe on your screen, in under 5 minutes. ## Why The backtest pipeline is the central artifact of every AlphaSwarm workflow. Every strategy gets backtested before paper, every paper run gets promoted on the back of backtest evidence, and every RL policy gets evaluated against the same engine. Understanding the backtest contract is prerequisite to understanding anything else. ## Prerequisites - The [quickstart](../intro/quickstart.md) completed. - An open terminal pointing at the repo root. ## Step 1 — author the strategy Create `configs/strategies/my_first_strategy.yaml`: ```yaml name: MyFirstMomentum kind: alpha class: alphaswarm.strategies.framework.algorithms.MomentumAlpha module_path: alphaswarm.strategies.framework.algorithms universe: kind: static symbols: - { ticker: SPY, exchange: ARCA, kind: equity } - { ticker: QQQ, exchange: NASDAQ, kind: equity } - { ticker: IWM, exchange: ARCA, kind: equity } kwargs: lookback_days: 60 rebalance_freq: weekly top_n: 2 risk: max_position_pct: 0.5 max_drawdown_pct: 0.15 ``` The `class` + `module_path` + `kwargs` pattern is Qlib-style and required for every strategy registry entry. See [AGENTS rule 8](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md). ## Step 2 — dispatch the backtest ```powershell docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \ --config configs/strategies/my_first_strategy.yaml \ --start 2024-01-01 \ --end 2024-06-30 \ --engine event_driven ``` The CLI returns a `task_id`. Tail its progress: ```powershell docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \ [print(m) for m in subscribe('')]" ``` You will see progress frames in the canonical `{task_id, stage, message, timestamp, **extras}` shape. ## Step 3 — inspect the ledger ```powershell docker exec alphaswarm-postgres psql -U alphaswarm -d alphaswarm -c \ "SELECT id, strategy_name, sharpe, total_return, max_drawdown FROM backtest_runs ORDER BY created_at DESC LIMIT 5;" ``` The most recent row is your run. If `sharpe` is `NULL`, the backtest failed — see Step 5. ## Step 4 — render a tearsheet ```powershell curl -X POST http://localhost:8000/analytics/portfolio/tearsheet \ -H "Content-Type: application/json" \ -d '{"run_id": ""}' ``` The endpoint returns another `task_id`; the resulting HTML tearsheet lands at `/analytics/portfolio//tearsheet.html` once Celery finishes rendering. Open it in your browser. Or use the operator UI route [/analytics/portfolio/:runId](http://localhost:3001/analytics/portfolio). ## Step 5 — handle expected failures **`InsufficientDataError`** — Alpha Vantage has not seeded the universe yet. Run the ingest: ```powershell docker exec alphaswarm-api python -m scripts.ingest_yfinance \ --symbols SPY,QQQ,IWM --start 2023-01-01 --end 2024-12-31 ``` **`StrategyRegistryMissError`** — the YAML's `class` field references a class that is not decorated with `@register`. Open [alphaswarm/strategies/framework/algorithms.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/framework/algorithms.py) and confirm `MomentumAlpha` is there. If you renamed the class, update the YAML. **`IcebergNamespaceError`** — your local Iceberg catalog has not been migrated. Run `make iceberg-bootstrap` and retry. ## Verify - [ ] `backtest_runs` row visible with non-NULL `sharpe`. - [ ] Tearsheet HTML renders. - [ ] Strategy YAML committed under `configs/strategies/`. ## What next - [Concept: backtest engines](../concepts/strategy/backtest-engines.md) — what `event_driven` vs `vbtpro` vs `hft` actually does. - [Recipe: run a backtest from YAML](../how-to/recipes/run-a-backtest-from-yaml.md) — the same thing, but as a how-to for repeated dispatch. - [Tutorial: first bot](./first-bot.md) — wrap this strategy in a reusable bot spec. # Your first bot > Wrap a backtested strategy in a TradingBot spec, snapshot the immutable version, run a paper session. # Your first bot Goal: take the strategy from [first-backtest](./first-backtest.md) and wrap it in a `BotSpec` so it can be paper-traded, deployed to Kubernetes, or chat-driven — all from a single immutable contract. ## Why A bot is the smallest deployable unit in AlphaSwarm. It aggregates the universe + strategy + engine + ML models + agents + RAG + risk limits + metrics into one hash-locked spec that [`BotRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_bots/runtime.py) can drive through every lifecycle stage. See [Concept: bots](../concepts/agentic/bots.md). ## Step 1 — author the BotSpec Create `configs/bots/my_first_bot.yaml`: ```yaml name: MyFirstBot kind: trading description: 'First-bot tutorial — wraps MyFirstMomentum.' strategy_config: configs/strategies/my_first_strategy.yaml engine: event_driven risk: max_position_pct: 0.5 max_daily_loss_pct: 0.02 kill_switch_attached: true metrics: - sharpe - sortino - max_drawdown - hit_rate deploy_target: paper ``` ## Step 2 — snapshot the spec ```powershell curl -X POST http://localhost:8000/bots \ -H "Content-Type: application/json" \ -d @configs/bots/my_first_bot.yaml ``` This persists a `bot_versions` row with the hash-locked spec. The response includes the `bot_id` (use this everywhere downstream) and the `spec_hash`. Different content → different hash → new version row; the old version stays intact for replay. ## Step 3 — backtest the bot ```powershell curl -X POST http://localhost:8000/bots//backtest \ -d '{"start":"2024-01-01","end":"2024-06-30"}' ``` Same engine as the prior tutorial, but the bot's risk overlays apply. The ledger row in `backtest_runs` carries the `bot_id` so you can correlate. ## Step 4 — paper-trade the bot ```powershell curl -X POST http://localhost:8000/bots//paper \ -d '{"starting_cash":100000}' ``` `BotRuntime` creates a `paper_trading_runs` row and attaches the bot to the paper broker session loop in [alphaswarm/trading/paper_trading.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_trading.py). Watch the live WebSocket feed: ```javascript const ws = new WebSocket("ws://localhost:8000/ws/paper/"); ws.onmessage = (e) => console.log(JSON.parse(e.data)); ``` You will see fills, position updates, and equity-curve points streaming through the canonical progress-frame envelope. ## Step 5 — halt the bot The bot has the kill switch attached (see Step 1 — `kill_switch_attached: true`). Trigger a halt: ```powershell curl -X POST http://localhost:8000/bots/halt-all ``` Every paper session under every bot stops within ~250 ms. ## Verify - [ ] `bot_versions` row visible with a `spec_hash`. - [ ] `backtest_runs` row tagged with your `bot_id`. - [ ] `paper_trading_runs` row visible. - [ ] WebSocket feed delivered frames. - [ ] Kill switch halted the bot. ## What next - [Concept: bots](../concepts/agentic/bots.md) — the full bot contract + deployment targets (paper / k8s / backtest_only). - [Recipe: promote a bot to paper](../how-to/recipes/promote-a-bot-to-paper.md) — same thing, but as a how-to. - [Tutorial: first paper trading session](./first-paper-trading-session.md) — go deeper on the paper-trading lifecycle and risk overlays. # Your first paper trading session > Attach a bot to the paper broker, watch the WebSocket frames, trigger the kill switch. # Your first paper trading session Goal: drive a paper-trading session from the bot you authored in [first-bot](./first-bot.md). End-to-end: dispatch → fills → kill. ## Why Paper trading is the highest-fidelity dress rehearsal AlphaSwarm supports without putting real money at risk. Same broker abstraction, same risk overlays, same kill-switch wiring as live trading. The difference is that fills come from the simulated execution engine in [alphaswarm/trading/paper_trading.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_trading.py). See [Concept: paper trading](../concepts/trading/paper-trading.md). ## Step 1 — verify the bot is ready ```powershell curl http://localhost:8000/bots/ ``` Confirm the response includes a recent `backtest_runs` reference and non-zero `sharpe`. The [paper-metadata-gate](../concepts/trading/paper-metadata-gate.md) will refuse to start the session otherwise. ## Step 2 — start the session ```powershell curl -X POST http://localhost:8000/bots//paper \ -d '{"starting_cash":100000,"duration_minutes":60}' ``` The response includes `paper_run_id`. The session is now in the canonical Celery loop; `alphaswarm-worker` polls the broker every 1 second. ## Step 3 — watch the WebSocket In a browser console: ```javascript const ws = new WebSocket("ws://localhost:8000/ws/paper/"); ws.onmessage = (e) => { const frame = JSON.parse(e.data); console.log(frame.stage, frame.message, frame.equity, frame.positions); }; ``` You should see: - `bar.received` — every minute bar. - `signal.emitted` — when the strategy says "buy" / "sell" / "flat". - `order.placed` — order goes to the simulated broker. - `order.filled` — fill comes back; positions update. - `equity.update` — equity-curve point at the end of each bar. All frames follow the canonical `{task_id, stage, message, timestamp, **extras}` envelope per AGENTS rule 4. ## Step 4 — risk + kill switch The bot's `risk` block (Step 1 of first-bot) is enforced by [alphaswarm/risk/limits.py::RiskLimits](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/risk/limits.py). Once any limit is hit, the session emits `risk.halted` and stops. The topbar kill switch in the Vite UI fans out to: - `POST /bots/halt-all` - `POST /paper/stop-all` - `POST /agents/halt` - `POST /rl/halt-all` - `POST /workflows/halt` - `POST /terraform/halt` - `POST /quant-agents/halt` The whole stack stops in under 250 ms. ## Step 5 — inspect the ledger ```sql SELECT id, bot_id, status, total_pnl, num_fills, started_at, ended_at FROM paper_trading_runs ORDER BY started_at DESC LIMIT 1; SELECT order_id, symbol, side, qty, price, filled_at FROM paper_fills WHERE paper_run_id = '' ORDER BY filled_at; ``` ## Verify - [ ] WebSocket delivered at least one `order.filled` frame. - [ ] `paper_trading_runs` row has non-NULL `total_pnl`. - [ ] Kill switch closed the session. ## What next - [Concept: paper trading](../concepts/trading/paper-trading.md) — the full session loop, broker abstraction, and risk model. - [Concept: paper metadata gate](../concepts/trading/paper-metadata-gate.md) — why some sessions get blocked before they start. - [How-to: kill switch incident response](../how-to/operations/kill-switch-incident-response.md) — the runbook for when the kill switch fires in production. # Your first RL experiment > Author an RLExperimentSpec, train via SB3 PPO, replay from the Iceberg trajectory store. # Your first RL experiment Goal: from blank `RLExperimentSpec` to a trained PPO agent with trajectories persisted to Iceberg, in under 10 minutes on CPU. ## Why The RL stack is AlphaSwarm's most opinionated subsystem: hash-locked `RLExperimentSpec`, metaclass-registered components, deterministic Iceberg trajectory persistence, and a single sanctioned executor ([`RLRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py)). Every RL run produces an immutable `rl_runs` ledger row and a replayable trajectory. See [Concept: RL framework](../concepts/rl/rl-framework.md). ## Prerequisites - Quickstart completed. - A small dev dataset under your local Iceberg catalog. The bundled `alphaswarm_bronze_yfinance_daily` namespace works. ## Step 1 — author the spec Create `alphaswarm_rl/configs/experiments/my_first_rl.yaml`: ```yaml name: MyFirstRLExperiment description: First-RL tutorial — PPO on a static universe environment: rl_alias: SingleAssetTradingEnv symbol: { ticker: SPY, exchange: ARCA, kind: equity } lookback_bars: 60 initial_cash: 100000 data_pipeline: rl_alias: IcebergDataPipeline namespace: alphaswarm_bronze_yfinance_daily start: 2022-01-01 end: 2023-12-31 agent: rl_alias: SB3Adapter algorithm: PPO policy: MlpPolicy total_timesteps: 50000 rewards: - { rl_alias: PnLReward, weight: 1.0 } - { rl_alias: TurnoverPenalty, weight: 0.1 } - { rl_alias: VolatilityPenalty, weight: 0.05 } observations: - { rl_alias: StockstatsObservation } - { rl_alias: LookbackObservation, length: 20 } training: advantage: { rl_alias: GAEAdvantage, lambda: 0.95, gamma: 0.99 } backbone: { rl_alias: TransformerBackbone, d_model: 64, n_heads: 4 } ``` ## Step 2 — snapshot + train ```powershell curl -X POST http://localhost:8000/rl/runs \ -H "Content-Type: application/json" \ -d '{"spec_path":"alphaswarm_rl/configs/experiments/my_first_rl.yaml","mode":"train"}' ``` The response includes the `rl_run_id`. Tail the progress: ```powershell docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \ [print(m) for m in subscribe('')]" ``` 50k timesteps on a CPU finishes in 5-8 minutes. ## Step 3 — inspect the ledger + trajectory store ```sql -- rl_runs ledger SELECT id, experiment_name, status, total_timesteps, mean_reward FROM rl_runs ORDER BY created_at DESC LIMIT 5; ``` The trajectory data lives in Iceberg under `alphaswarm_silver_rl_trajectories.`: ```python from pyiceberg.catalog import load_catalog cat = load_catalog("alphaswarm") tbl = cat.load_table("alphaswarm_silver_rl_trajectories.") df = tbl.scan().to_pandas() print(df[["episode", "step", "reward", "action"]].head(20)) ``` ## Step 4 — replay ```powershell curl -X POST http://localhost:8000/rl/runs//replay \ -d '{"start":"2024-01-01","end":"2024-03-31"}' ``` Same hash-locked spec, new data window, separate `rl_runs` row. ## Step 5 — halt ```powershell curl -X POST http://localhost:8000/rl/halt-all ``` ## Verify - [ ] `rl_experiment_versions` row with a `spec_hash`. - [ ] `rl_runs` row with non-NULL `mean_reward`. - [ ] Iceberg trajectory table populated. - [ ] Replay produces a different `rl_runs` row but reuses the same `rl_experiment_versions` row (hash-locked!). ## What next - [Concept: RL components](../concepts/rl/rl-components.md) — add your own reward term, observation builder, or policy backbone. - [Concept: RL Iceberg trajectories](../concepts/rl/rl-iceberg.md) — the persistence contract. - [Tutorial: first agent workflow](./first-agent-workflow.md) — hand off RL outputs to an autonomous agent loop. # Tutorials > Runnable walkthroughs for every AlphaSwarm surface. Pyodide + StackBlitz WebContainers in your browser. # Tutorials Runnable, learning-oriented walkthroughs. Each tutorial assumes the [quickstart](../intro/quickstart.md) has succeeded. Python snippets execute via Pyodide directly in your browser; full project setups open in StackBlitz WebContainers. Both are sandboxed and never reach the production cluster. ## Tutorial catalogue - **[First backtest](./first-backtest.md)** — author a momentum strategy, run it through `EventDrivenBacktester`, inspect the `backtest_runs` ledger row, render a tearsheet. - **[First bot](./first-bot.md)** — wrap the strategy in a `TradingBot` spec, snapshot the version, run a paper session. - **[First RL experiment](./first-rl-experiment.md)** — author an `RLExperimentSpec`, train via SB3 PPO, replay from the Iceberg trajectory store. - **[First agent workflow](./first-agent-workflow.md)** — compose a three-node LangGraph (Research → Selection → Trader), run it through `AgentRuntime`, inspect the agent_runs_v2 ledger. - **[First paper trading session](./first-paper-trading-session.md)** — attach the bot to the paper broker, watch the WebSocket frames, trigger the kill switch. Each tutorial includes: 1. A "Why" section explaining what you are about to learn. 2. A canonical reference to the deeper concept doc. 3. Inline runnable code. 4. A "Verify" checklist at the end. 5. A "What next" pointer. ## Conventions for these tutorials - **One concept per page.** If a tutorial gets too long, split it and link the second page. - **Verify everything.** Every code block produces an observable effect — a JSON response, a ledger row, a WebSocket frame. - **Show the failure mode.** Each tutorial documents at least one expected error and how to recover.