# AlphaSwarm — full corpus

> Concatenated, MDX-stripped markdown for one-shot LLM ingestion.
> See /llms.txt for the curated index.


<!-- https://alpha-swarm.ai/architecture/decisions/001-static-export-over-ssr -->
# ADR 001 — Static export (Vite) over SSR for the AlphaSwarm client surface
> The AlphaSwarm frontend rewrite (`alphaswarm_client/`, Vite 7 + React 19 + Tailwind 4 + shadcn/ui) is the cutover-complete operator UI. The legacy `webui/` (Next.js 15 / antd) remains in tree only as a rollback pat...

# ADR 001 — Static export (Vite) over SSR for the AlphaSwarm client surface

- **Status**: Accepted (2026-05-18)
- **Authors**: Platform team
- **Supersedes**: None
- **Related**: [ADR 002 — single container client](002-single-container-client.md), [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md)

## Context

The AlphaSwarm frontend rewrite (`alphaswarm_client/`, Vite 7 + React 19 + Tailwind 4 + shadcn/ui) is the cutover-complete operator UI. The legacy `webui/` (Next.js 15 / antd) remains in tree only as a rollback path. The new `alphaswarm_client` container needs to bundle a UI build, the legacy fallback, and the FastAPI gateway into a single deployable image.

Three rendering options were considered for the canonical UI:

1. **Server-side rendering (Next.js)** — server-rendered React with client-side hydration, mounted under uvicorn via `WSGIMiddleware`.
2. **Static export (Next.js)** — `next build` with `output: 'export'`, identical to the prompt's original §2.1 wording.
3. **Static export (Vite)** — `pnpm --dir alphaswarm_client build` emitting a single `dist/` static SPA bundle.

## Decision

The canonical UI shipped in `alphaswarm_client` is the **Vite static export** under `alphaswarm_client/`. The Next.js legacy `webui/` is mounted as a rollback surface at `/webui` and Solara at `/legacy`, but neither is the default landing page.

Concretely:

- Stage 1 of `/build/docker/alphaswarm_client/Dockerfile` runs `pnpm --dir alphaswarm_client build` and copies `alphaswarm_client/dist/` to `/app/static/`.
- The FastAPI app in `alphaswarm/api/main.py` mounts `/static` to the Vite asset directory and falls back to `index.html` for client-side routes (SPA fallback).
- The Vite app calls API endpoints through a relative `/api` prefix; the FastAPI gateway proxies those to whatever the `ConnectivityConfig` env vars point at.

## Consequences

**Positive**
- Single-process Python runtime — no Node.js in the production image, smaller attack surface, no `npm` supply-chain risk in production.
- No SSR cold-start cost. The whole UI is ~3 MB of static assets served with `Cache-Control: immutable`.
- Identical container in dev, k3d, and Kubernetes — only env vars change.
- Vite is already canonical per [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md). Picking the in-flight stack avoids reopening the cutover debate.

**Negative**
- No streaming-SSR for the operator UI. WebSocket and SSE streams (chat, live, telemetry) carry the live data instead, which matches the existing throttled `useChatStream` / `useLiveStream` hooks.
- SEO and first-paint metrics are weaker than SSR, but the AlphaSwarm UI is an authenticated operator console, not a public site — neither matters.
- Pre-rendered routes per user/tenant are not possible. All personalisation happens client-side using Auth0 claims from `useUser()`.

## Alternatives considered

- **SSR** — rejected because it forces Node.js into the runtime image and adds a separate process to supervise.
- **Static export (Next.js)** — rejected to avoid maintaining two frontend toolchains. The Next.js webui stays as rollback only.

## Implementation references

- Frontend build target: `alphaswarm_client/package.json` `"build": "vite build"`
- Production Dockerfile: `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`
- SPA fallback handler: `alphaswarm/api/main.py::serve_spa`
- Cutover history: [`alphaswarm_client/CUTOVER.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/alphaswarm_client/CUTOVER.md)


<!-- https://alpha-swarm.ai/architecture/decisions/002-single-container-client -->
# ADR 002 — Single multi-stage container for the AlphaSwarm client surface
> Today AlphaSwarm runs the Vite frontend on `:3001` (compose `:3002`), the legacy Next.js webui on `:3000` (now stopped), the legacy Solara UI on `:8765`, and the FastAPI API on `:8000`. Operators have to jug...

# ADR 002 — Single multi-stage container for the AlphaSwarm client surface

- **Status**: Accepted (2026-05-18) — **Superseded for `alphaswarm_ui` by [ADR 011](011-cdn-fronted-standalone-for-alphaswarm-ui.md) on 2026-05-25.** Still in force for the local-operator `alphaswarm_client/` packaging path.
- **Authors**: Platform team
- **Supersedes**: None
- **Superseded by**: [ADR 011 — CDN-fronted standalone for `alphaswarm_ui`](011-cdn-fronted-standalone-for-alphaswarm-ui.md) (cloud surface only)
- **Related**: [ADR 001 — Vite static export](001-static-export-over-ssr.md), [ADR 005 — separated control plane](005-separated-control-plane.md), [ADR 011 — CDN-fronted standalone for `alphaswarm_ui`](011-cdn-fronted-standalone-for-alphaswarm-ui.md), [ADR 012 — Solara deprecation](012-solara-deprecation.md)

> **Scope narrowing (2026-05-25):** This ADR's decisions apply ONLY
> to the local-operator `alphaswarm_client/` packaging. The cloud-hosted
> `alphaswarm_ui/` surface (at `alpha-swarm.ai` / `app.alpha-swarm.ai`) is governed by
> ADR 011 and uses a clean Next.js standalone container with no
> ASGI proxy stage. See ADR 011 for the cloud rationale.

## Context

Today AlphaSwarm runs the Vite frontend on `:3001` (compose `:3002`), the legacy Next.js webui on `:3000` (now stopped), the legacy Solara UI on `:8765`, and the FastAPI API on `:8000`. Operators have to juggle four URLs and four health probes. The `alphaswarm_client` Docker image is a chance to collapse these into one.

Three packaging options were considered:

1. **One container per surface** — separate `alphaswarm-frontend`, `alphaswarm-solara`, `alphaswarm-api` images; an external Ingress/NGINX layer fans traffic.
2. **Sidecar pattern** — one Pod per surface, sharing localhost via an `nginx` sidecar.
3. **Single multi-stage build** — Stage 1 builds Vite, Stage 2 prepares Solara, Stage 3 (production) is a `python:3.11-slim` runtime that serves both as static + ASGI mount and proxies API traffic.

## Decision

`alphaswarm_client` is **one image built from a three-stage Dockerfile** that ships:

- Stage 1 (`ui-builder`, `node:20-alpine`) — runs `pnpm --dir alphaswarm_client build`, output to `/app/out/`. Node is dropped after this stage.
- Stage 2 (`solara-builder`, `python:3.11-slim`) — installs Solara + legacy UI deps, pre-warms component caches, verifies `legacy_ui.app` is importable.
- Stage 3 (`production`, `python:3.11-slim`) — installs FastAPI + uvicorn + httpx + websockets + python-jose + `alphaswarm_core`. Copies Vite assets from Stage 1 and Solara from Stage 2. Exposes port `8080`. No Node, no npm.

The Stage 3 runtime mounts:

- `/static` → Vite assets from Stage 1
- `/legacy` → Solara ASGI app
- `/webui` → legacy Next.js export (rollback only)
- `/api/*` → reverse-proxied to `ALPHASWARM_CORE_API_URL`
- `/ml/*` → reverse-proxied to `ALPHASWARM_ML_API_URL`
- `/mcp/*` → reverse-proxied to `ALPHASWARM_MCP_URL`
- `/manage/*` → reverse-proxied to `ALPHASWARM_CONTROL_PLANE_URL`
- `/ws/*` → WebSocket proxy with reconnect-with-backoff

## Consequences

**Positive**
- One image, one health probe (`/health`), one set of `securityContext` rules.
- Stable URL surface for operators — bookmarks, dashboards, and runbooks don't break when backends move.
- All backend addresses live in `ConnectivityConfig` env vars. The same image runs in compose with `ALPHASWARM_*_URL=http://alphaswarm-core:8000` or in K8s with `http://alphaswarm-core.default.svc.cluster.local`.
- Auth0 callback URLs stay constant. The Vite app sees one origin; the FastAPI proxy injects M2M `Authorization` headers for cross-service calls.
- Smaller blast radius. The control plane is a separate container on a separate Docker network (`alphaswarm-admin` vs `alphaswarm-internal`) — they only talk over the proxy.

**Negative**
- Builds are larger and slower than per-surface images. Mitigated by Docker layer caching and buildx (~3 min cold, ~30s incremental).
- Scaling assumes Vite + Solara + proxy throughput grow together. In practice Vite assets are CDN-fronted by NGINX Ingress and the proxy is the bottleneck — a single container HPA on CPU is fine.
- Rolling back to webui-only or Solara-only means env-flag toggles (`ALPHASWARM_CLIENT_ENABLE_LEGACY_UI`, `ALPHASWARM_CLIENT_ENABLE_SOLARA`) rather than swapping deployments.

## Alternatives considered

- **One container per surface** — rejected. Adds 3 health probes, 3 Ingress rules, 3 image tags to keep in lockstep on every release. The operator experience regresses.
- **Sidecar pattern** — rejected. Mixing sidecars + multi-process supervision in one Pod adds significant Pod-startup ordering risk for marginal CPU savings.

## Implementation references

- Multi-stage Dockerfile: `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`
- FastAPI proxy: `alphaswarm/api/proxy.py`
- WebSocket proxy with reconnect: `alphaswarm/api/ws/proxy.py`
- ConnectivityConfig: `alphaswarm_core/connectivity/config.py`


<!-- https://alpha-swarm.ai/architecture/decisions/003-auth0-zero-trust -->
# ADR 003 — Auth0 zero-trust two-layer security model
> AlphaSwarm already uses Auth0 for the operator UI via the in-flight `alphaswarm/auth/providers/auth0.py` plugin (AGENTS hard rule 27). Whats missing for the refactor is the second layer: cryptographic JWT validati...

# ADR 003 — Auth0 zero-trust two-layer security model

- **Status**: Accepted (2026-05-18)
- **Authors**: Platform team
- **Supersedes**: None
- **Related**: [ADR 005 — separated control plane](005-separated-control-plane.md), [alphaswarm_docs/identity.md](../../concepts/identity/identity.md), [alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md)

## Context

AlphaSwarm already uses Auth0 for the operator UI via the in-flight `alphaswarm/auth/providers/auth0.py` plugin (AGENTS hard rule 27). What's missing for the refactor is the second layer: cryptographic JWT validation at every service boundary, resource-scoped claims so users only see their own resources, and a per-role scope matrix that the `alphaswarm_controller` micro-project can enforce without ever importing `alphaswarm.*`.

Three identity strategies were considered:

1. **Self-hosted Keycloak** — full control, but operations burden and one more stateful service per cluster.
2. **Single-layer Auth0 (current state)** — Auth0 only for the SPA login. Backend services still trust user-injected headers via session cookies.
3. **Two-layer Auth0 (recommended in prompt)** — Auth0 OIDC for the SPA + JWT (`RS256`) bearer tokens validated independently by every service via JWKS.

## Decision

Adopt the **two-layer Auth0 model** with the following invariants:

1. The Vite SPA in `alphaswarm_client` performs Authorization Code + PKCE against the Auth0 tenant. Access tokens are short-lived (1 h) JWTs with `aud` = `https://api.alphaswarm.internal/manage`.
2. Every backend service — `alphaswarm` (FastAPI API), `alphaswarm_controller` (micro-project), and the `rpi_kubernetes` `management/backend` shim — re-validates JWTs against the Auth0 JWKS independently using the shared validator in `alphaswarm_core/auth/`. **No service trusts a header set by another service.**
3. Auth0 Post-Login Action (template in `alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl`) calls `POST /_internal/auth0/sync` to fetch user-specific custom claims and injects them into the access token under the **`https://alphaswarm.internal/`** namespace:
   - `https://alphaswarm.internal/org_id` — tenancy boundary
   - `https://alphaswarm.internal/roles` — coarse role list (`alphaswarm-viewer`, `alphaswarm-admin`, `alphaswarm-operator`)
   - `https://alphaswarm.internal/resources` — explicit resource ID allowlist (org-scoped)
   - `https://alphaswarm.internal/workspace_id`, `https://alphaswarm.internal/team_ids` — existing tenancy hints
4. M2M tokens for service-to-service calls (e.g. `alphaswarm_client` → `alphaswarm_controller`) mint through Auth0 Client Credentials. The proxy in `alphaswarm/api/proxy.py` attaches a cached M2M token; `alphaswarm_controller` validates it like any other JWT.
5. The four-role RBAC matrix from the refactor prompt becomes the canonical scope grid:

   | Role             | Scopes granted                                                                                      |
   | ---------------- | ---------------------------------------------------------------------------------------------------- |
   | `alphaswarm-viewer`     | `read:infrastructure`                                                                                |
   | `alphaswarm-operator`   | `read:infrastructure` + `manage:agents`                                                              |
   | `alphaswarm-admin`      | `read:infrastructure` + `manage:agents` + `manage:infrastructure`                                    |
   | `alphaswarm-superadmin` | All of the above + `admin:cluster` (only role that bypasses `filter_resources`)                      |

6. Every list endpoint in both `alphaswarm` and `alphaswarm_controller` passes its result list through `alphaswarm_core.auth.resource_filter.filter_resources(items, jwt_payload)` before returning. The filter respects `admin:cluster` (returns everything) and otherwise intersects against the `resources` claim.

## Consequences

**Positive**
- Zero-trust between services. A compromised `alphaswarm_client` container can issue requests but cannot forge claims — the control plane re-validates.
- Resource scoping moves from "frontend hides things" to "backend cannot return things". Defence in depth.
- Auth0 is already in production for the SPA; the only delta is adding M2M tokens and the `resources` claim.
- The `alphaswarm_controller` micro-project gets a clean security boundary without importing `alphaswarm.auth.*` — it depends on `alphaswarm_core/auth/` only.

**Negative**
- Every API request pays JWKS verification cost (~0.2 ms with `lru_cache`). Acceptable.
- The `https://alphaswarm/` → `https://alphaswarm.internal/` namespace rename requires one release of dual-reading both namespaces (handled by `auth_claims_namespace_aliases` setting).
- Operators need to be onboarded to one of the four roles before they can use the new control plane — solved by `/build/scripts/provision_auth0.py` running on bootstrap.

## Alternatives considered

- **Self-hosted Keycloak** — rejected. Adds operational burden without business value. Auth0 plays well with Terraform (already in `alphaswarm_platform/terraform/modules/auth0_identity/`).
- **Cookie-only sessions** — rejected. Backend services would have to trust whatever set the cookie; doesn't compose with the cross-service M2M case.
- **Opaque tokens with introspection** — rejected. Adds a round trip per request against Auth0's `/oauth/token/introspect`, and Auth0's free tier rate-limits it.

## Implementation references

- JWT validator: `alphaswarm_core/auth/validator.py` (extracted from `alphaswarm/auth/providers/auth0.py`)
- Resource filter: `alphaswarm_core/auth/resource_filter.py`
- Claims namespace setting: `alphaswarm/config/settings.py::auth_claims_namespace`, `auth_claims_namespace_aliases`
- Auth0 Action template: `alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl`
- Sync endpoint: `alphaswarm/api/routes/auth0_sync.py`
- Terraform Auth0 module: `alphaswarm_platform/terraform/modules/auth0_identity/main.tf`
- Provisioning script: `alphaswarm_platform/build/scripts/provision_auth0.py`


<!-- https://alpha-swarm.ai/architecture/decisions/004-provider-abstraction -->
# ADR 004 — Abstract InfrastructureProvider ABC for workload runtime ops
> AQPs existing IaC story is Terraform-first (AGENTS hard rule 42): every state-mutating cluster operation goes through `alphaswarm/terraform/runtime.py::TerraformRuntime`. That guarantee is great for **provi...

# ADR 004 — Abstract InfrastructureProvider ABC for workload runtime ops

- **Status**: Accepted (2026-05-18)
- **Authors**: Platform team
- **Supersedes**: Tightens AGENTS hard rule 42
- **Related**: [ADR 005 — separated control plane](005-separated-control-plane.md), [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/AGENTS.md)

## Context

AlphaSwarm's existing IaC story is Terraform-first (AGENTS hard rule 42): every state-mutating cluster operation goes through `alphaswarm/terraform/runtime.py::TerraformRuntime`. That guarantee is great for **provisioning** (create cluster, create namespace, apply RBAC, register Auth0 tenant) but it's an awkward fit for **live workload operations** — restarting a pod, scaling a Deployment, exec-ing a shell, tailing logs — which today incur a full `terraform plan` + `apply` round trip and write to `terraform_runs` even though no IaC actually changed.

The refactor introduces the `alphaswarm_controller` micro-project that needs to support five backends (docker_compose, kubernetes, AWS, Azure, GCP). Two paths were considered:

1. **Translate every workload op into Terraform** — every restart becomes a Terraform `null_resource` + provisioner. Preserves the rule 42 ledger as a single source of truth, but turns Terraform into a glorified `kubectl` wrapper.
2. **Introduce a sibling abstraction** — `InfrastructureProvider` ABC with five implementations, each calling its backend's native SDK (kubernetes-client, docker SDK, boto3, azure-mgmt, google-cloud-run). Terraform stays for provisioning only.

## Decision

Adopt **path 2: an abstract `InfrastructureProvider` ABC** for runtime workload operations. Specifically:

```python
class InfrastructureProvider(ABC):
    @abstractmethod
    async def start(self, spec: DeploymentSpec) -> DeploymentStatus: ...

    @abstractmethod
    async def stop(self, service_id: str) -> DeploymentStatus: ...

    @abstractmethod
    async def scale(self, service_id: str, replicas: int) -> DeploymentStatus: ...

    @abstractmethod
    async def status(self, service_id: str) -> DeploymentStatus: ...

    @abstractmethod
    async def apply_config(self, service_id: str, config: dict) -> bool: ...

    @abstractmethod
    async def stream_metrics(self, service_id: str): ...  # async generator
```

Five concrete providers live under `alphaswarm_controller/src/alphaswarm_controller/providers/`:

- `docker_compose.py` — docker Python SDK + `docker compose` subprocess for multi-container profiles
- `kubernetes.py` — kubernetes-client/python (in-cluster + kubeconfig); Deployment apply, scale-to-0, ConfigMap patch, Metrics Server query
- `aws.py` — boto3; EKS delegates to `kubernetes.py`; ECS/Fargate via `update_service`; config sync via SSM Parameter Store
- `azure.py` — azure-mgmt; AKS delegates to `kubernetes.py`; ACI via container groups; config sync via App Configuration / Key Vault
- `gcp.py` — google-cloud SDKs; GKE delegates to `kubernetes.py`; Cloud Run via revision updates; config sync via Secret Manager

Each provider:
- Reads credentials from env vars only (`alphaswarm_core.credentials.CredentialResolver`).
- Translates `DeploymentSpec` to its backend's native API.
- Returns a normalised `DeploymentStatus`.
- Maps backend-specific exceptions to structured `{status, data, error}` envelopes.

## Amendment to AGENTS hard rule 42 (this PR)

Rule 42 changes from "all Terraform IaC lifecycle actions go through TerraformRuntime" to:

> 42. **All Terraform IaC PROVISIONING actions go through `alphaswarm/terraform/runtime.py::TerraformRuntime`.** Cluster bootstrap, IAM, Auth0 tenant, namespaces, secrets, network policies, and Ingress class registration are all "provisioning". The `terraform_runs` ledger, the `terraform_stack_spec_versions` hash-lock, the kill-switch hook (`/terraform/halt`), and OPA policy enforcement all depend on it.

A new rule 45 covers the workload ops side:

> 45. **All runtime workload operations go through `alphaswarm_controller.InfrastructureProvider` (via `WorkloadRuntime`).** Start, stop, scale, restart, exec, log-tail, and `apply_config` are workload ops. They never reach for Terraform. A new `workload_runs` ledger row is created per mutating action with full audit context (user_id, action, target, provider, timestamp) BEFORE the provider call executes.

## Consequences

**Positive**
- Restart latency drops from ~30 s (Terraform plan + apply) to ~200 ms (kubectl scale).
- The five providers are fully independent — each can be implemented + tested in parallel by an `orchestrate` fan-out (see plan §8.2).
- Terraform stays clean for IaC provisioning and immutable audit trails. The `terraform_runs` ledger remains the source of truth for "what infrastructure exists".
- The `alphaswarm_controller` micro-project becomes a thin, testable layer with mocked SDKs in CI.
- Hard rule 27 (IdentityProvider), 28 (KubernetesAdapter), and the new ABC all follow the same self-registering metaclass pattern — consistent across the codebase.

**Negative**
- Two separate audit ledgers (`terraform_runs` + `workload_runs`) instead of one. Documented in `alphaswarm_docs/docs/how-to/operations/incident-response.md`.
- The five providers each take their own credential chain. Mitigated by `CredentialResolver` so service code never sees raw env vars.
- Provisioning vs runtime boundary is a soft line — adding a new namespace is provisioning, but auto-creating a per-tenant namespace at user signup is workload-ish. Each new operation requires an explicit choice; ADR 005 includes a decision tree.

## Alternatives considered

- **Translate every op into Terraform** — rejected. Operational cost of running `terraform apply` on every pod restart is prohibitive (~30 s p99), and Terraform's lock semantics serialise unrelated ops on the same workspace.
- **Use Crossplane** — investigated; rejected for now. Crossplane is excellent for declarative cloud APIs but adds a CRD layer and operator dependency for marginal value over the five-provider Python ABC. Revisit when AlphaSwarm exceeds five backends.
- **Use Pulumi instead of Terraform** — out of scope. The existing `TerraformRuntime` works and is hash-locked; replacing it is a separate ADR.

## Implementation references

- ABC: `alphaswarm_controller/src/alphaswarm_controller/providers/base.py`
- Five providers: `alphaswarm_controller/src/alphaswarm_controller/providers/{docker_compose,kubernetes,aws,azure,gcp}.py`
- Workload ledger model: `alphaswarm/persistence/models_workload.py` (new in this PR)
- Telemetry streaming: `alphaswarm_controller/src/alphaswarm_controller/services/telemetry.py`
- AGENTS rule 45: [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/AGENTS.md) (this PR)


<!-- https://alpha-swarm.ai/architecture/decisions/005-separated-control-plane -->
# ADR 005 — Separated `alphaswarm_controller/` micro-project
> The in-flight `alphaswarm/api/routes/control_plane.py` exposes deploy / destroy / restart / logs endpoints to the Vite Control Plane UI. It already covers the "local k3d" and "rpi_kubernetes" targets and del...

# ADR 005 — Separated `alphaswarm_controller/` micro-project

- **Status**: Accepted (2026-05-18)
- **Authors**: Platform team
- **Supersedes**: Embeds in `alphaswarm/api/routes/control_plane.py`
- **Related**: [ADR 002](002-single-container-client.md), [ADR 003](003-auth0-zero-trust.md), [ADR 004](004-provider-abstraction.md)

## Context

The in-flight `alphaswarm/api/routes/control_plane.py` exposes deploy / destroy / restart / logs endpoints to the Vite Control Plane UI. It already covers the "local k3d" and "rpi_kubernetes" targets and delegates mutating ops to `TerraformRuntime` via Celery tasks (see [`alphaswarm/api/routes/control_plane.py`](../../../alphaswarm/api/routes/control_plane.py)).

The refactor wants the control plane to:

1. Speak five backends (docker_compose, kubernetes, AWS, Azure, GCP) — not just two Terraform stacks.
2. Be deployable on its own (`/deployments/compose/docker-compose.admin.yml`, isolated `alphaswarm-admin` Docker network) so an operator can run "just the control plane" against a remote cluster.
3. Be releasable independently from the AlphaSwarm monolith (different cadence, different SLOs).
4. Have a security boundary that doesn't bleed in if `alphaswarm` itself is compromised — and vice versa.

The strict-isolation reading of the prompt's hard constraint ("Never import `alphaswarm.*` modules inside `alphaswarm_controller/`") plus the existing `alphaswarm/` codebase yields three integration patterns:

1. **Strict separation** — duplicate every model, validator, and adapter into `alphaswarm_controller/`. 2x code, fully independent release.
2. **Shared lower-level library** — extract reusable bits (Pydantic topology models, JWT validator, K8s adapter ABCs, credential protocol) into a NEW `alphaswarm_core/` package both `alphaswarm/` and `alphaswarm_controller/` depend on. No `alphaswarm.*` imports in CP, but shared lower-level code stays DRY.
3. **Evolve in place** — keep control plane in `alphaswarm/`; just add the `alphaswarm_client` container + Auth0 RBAC.

## Decision

Adopt **pattern 2** — the **shared-library** approach.

1. New top-level package `alphaswarm_core/` is created with its own `pyproject.toml` (installable as `alphaswarm-core`).
2. Move (with back-compat re-exports from `alphaswarm/`) the following into `alphaswarm_core/`:
   - `topology/` — Pydantic models from `alphaswarm/deployment/topology.py` (data classes only; loaders stay in `alphaswarm/`).
   - `auth/` — Auth0 JWT validator from `alphaswarm/auth/providers/auth0.py` + `alphaswarm/api/security.py`'s claim validation + new `resource_filter.py` (ADR 003).
   - `kubernetes/` — `KubernetesAdapter` ABC from `alphaswarm/kubernetes/protocol.py`. Concrete adapters (`InClusterAdapter`, `LocalComposeAdapter`, `RpiClusterAdapter`) stay in `alphaswarm/`.
   - `credentials/` — `SecretStore` protocol + `CredentialResolver` interface. Concrete stores stay in `alphaswarm/`.
   - `connectivity/` — NEW `ConnectivityConfig` Pydantic settings model with `ALPHASWARM_*_URL` matrix.
   - `models/` — `DeploymentSpec`, `DeploymentStatus`, `MetricPoint`, `NodeHealth` (referenced by both `alphaswarm.api.routes.control_plane` and the new `alphaswarm_controller.api.routers`).
3. The `alphaswarm_controller/` micro-project (own `pyproject.toml`) depends ONLY on `alphaswarm-core`. It never imports `alphaswarm.*`.
4. `alphaswarm/` keeps the runtimes, ledger writers, registry implementations, and concrete adapters. It also depends on `alphaswarm-core` (just like `alphaswarm_controller/`).
5. Back-compat shims in `alphaswarm/deployment/`, `alphaswarm/auth/`, `alphaswarm/kubernetes/`, `alphaswarm/credentials/` re-export from `alphaswarm_core` so no existing import paths break and no other AlphaSwarm module needs to change in this PR.

The strict-isolation enforcement is a CI lint:

```bash
# .github/workflows/ci.yml step
rg --type python "^from alphaswarm(\.|$)|^import alphaswarm(\.|$)" alphaswarm_controller/ \
  && echo "FAIL: alphaswarm_controller imports forbidden alphaswarm.* module" && exit 1
```

## Consequences

**Positive**
- `alphaswarm_controller` ships as a standalone OCI image with no AlphaSwarm runtime dependency. Operators running multiple AlphaSwarm tenants share one control plane.
- The shared lib is small (~2 kloc) and changes infrequently. When it does change, both `alphaswarm/` and `alphaswarm_controller/` re-pin and re-test — explicit coupling.
- The existing `alphaswarm/api/routes/control_plane.py` becomes a thin proxy that calls the external `alphaswarm_controller` when the env var `ALPHASWARM_CP_REMOTE=1` is set, or talks in-process to the same modules when disabled. Backward compat for local dev.
- AGENTS hard rules 27 (IdentityProvider), 28 (KubernetesAdapter) still apply — the metaclass registries live in `alphaswarm_core/auth/` and `alphaswarm_core/kubernetes/`, with concrete impls registered from `alphaswarm/` and `alphaswarm_controller/` alike.

**Negative**
- Adds one more package to publish and version. Mitigated by treating `alphaswarm-core` as an internal dependency pinned to a git SHA from a monorepo — no PyPI release needed.
- Cross-package refactors now need to touch two `pyproject.toml` files. Acceptable cost; the boundary is intentional.
- The "embed vs separate" decision is now load-bearing for security — a vulnerability in `alphaswarm_core/auth/` lands in both planes. Reviewed in `ce-security-sentinel` agent runs (see `.cursor/agents/`).

## Alternatives considered

- **Strict separation (pattern 1)** — rejected. Duplicate code rots out of sync; security fixes have to land twice; impossible to keep JWT validator semantics identical between the two planes.
- **Evolve in place (pattern 3)** — rejected. The biggest gap the prompt closes is *deployment independence* and the *5-backend abstraction*. Both demand a separate process; in-place is just a renamed router.
- **gRPC contract between the two** — rejected for now. The two planes share Pydantic models and HTTP/JSON is already understood. gRPC adds proto-gen tooling burden without buying anything until we hit hundreds of req/s of internal calls.

## Decision tree: which side does new code go on?

When adding a new feature, ask:

1. Is this a workload runtime operation (start, stop, scale, exec, logs, telemetry)? → **`alphaswarm_controller/`**
2. Is this an IaC provisioning operation (create cluster, register Auth0 tenant, apply RBAC)? → **`alphaswarm/terraform/`**
3. Is this AlphaSwarm business logic (agents, RL, bots, analysis, backtests)? → **`alphaswarm/`**
4. Is this a shared model, validator, or ABC that BOTH need? → **`alphaswarm_core/`**

If unsure, prefer **`alphaswarm/`** and revisit the boundary once the requirement is clearer.

## Implementation references

- Shared lib: `alphaswarm_core/` (this PR)
- Micro-project: `alphaswarm_controller/` (this PR)
- Strict-isolation lint: `.github/workflows/ci.yml` (Phase 8)
- Existing in-AlphaSwarm control plane: `alphaswarm/api/routes/control_plane.py`
- Existing topology: `alphaswarm/deployment/topology.py`
- AGENTS rules 27, 28, 42, 45 — boundary owners


<!-- https://alpha-swarm.ai/architecture/decisions/006-aqp-admin-overhaul -->
# architecture/decisions/006-aqp-admin-overhaul

# ADR 006: alphaswarm_admin overhaul (multi-cloud control plane)

- **Status:** Proposed
- **Date:** 2026-05-25
- **Supersedes:** none (extends ADR 002 single-container client; the
  Solara legacy half is deprecated by this overhaul)
- **Superseded by:** none

## Context

The alphaswarm_admin internal admin surface predates the overhaul:

- Backend was already a stateless FastAPI BFF brokering audit-first
  to `alphaswarm_controller` and the AlphaSwarm monolith.
- Frontend was a Vite + React Router SPA at `alphaswarm_admin/alphaswarm_admin_ui/`.
- Six modules from the blueprint were missing: secrets-manager,
  lineage-explorer, model-registry, paper-trading-control,
  rbac-admin, account-mode-switcher.
- Multi-account AWS topology was not provisioned.
- CI used `KUBECONFIG_*` base64 secrets instead of GitHub Actions
  OIDC.
- Only the bot fleet had ArgoCD; the main stack was kubectl-push.
- No S3 WORM mirror for `security_audit_events`.

## Decision

### Frontend: migrate to Next.js 15 App Router.

Even though `alphaswarm_client/` (the canonical Vite operator UI) and
`alphaswarm_ui/` (the customer-facing PaaS) keep their existing
frameworks, the admin surface migrates to Next.js because:

- Server Components reduce the bundle on read-heavy admin pages.
- Server Actions remove API-route boilerplate for mutations.
- File-system routing maps cleanly onto the sidebar information
  architecture (one folder per module).
- Middleware-based auth with one-shot RFC 9470 step-up retries
  composes better than the Vite + React Router pattern.

The legacy `alphaswarm_admin_ui/` stays deployable behind a feature flag
for a 30-day rollback window during the cutover. The new Next.js
app lives at `alphaswarm_admin/frontend/`.

### Backend: extend, don't rewrite.

The existing six routers are kept. Six new module routers are
added under the established audit-first / M2M-broker /
`require_admin_scope` pattern. Step-up MFA per AGENTS rule 52 is
attached to every new mutating endpoint.

### RBAC: stay on the existing 4-role lattice.

The blueprint suggested Casbin. We reject that — AlphaSwarm's canonical
RBAC is the
`alphaswarm_core.auth.rbac` 4-role lattice plus the existing
`Membership` table. Adding Casbin would create a parallel policy
source-of-truth that fragments rule 27. The new
`/admin/rbac/*` router builds on `expand_role` and the existing
`require_scope` / `require_membership` deps.

### Multi-account AWS: code now, apply later.

A new top-level `infrastructure/` directory ships the full module
library (landing-zone, account, vpc, eks-cluster, eks-node-groups,
karpenter-bootstrap, ecr-repositories, rds-postgres, s3-data-lake,
msk-kafka, airflow, eso-bootstrap, argocd-bootstrap,
observability-stack, iam-irsa-roles, route53-zones,
acm-certificates, acm-pca, github-oidc, codepipeline, codebuild,
codeartifact) plus per-environment compositions. Every
composition assumes-role into a workload account from
`shared-services` with `external_id`. Cloud-side `terraform
apply` is deferred to operator hands; the PR ships the code.

### CI/CD: GitHub OIDC + SLSA L3 + Cosign keyless.

`.github/actions/{aws-oidc-assume,build-sign-push,slsa-provenance,
kubectl-via-irsa}` composite actions; new workflows
`pr-validate.yml`, `build-publish.yml`, `argocd-trigger.yml`,
`terraform-pipeline.yml`, `ml-pipeline.yml`,
`paper-config-validate.yml`, `alembic-immutability.yml`. Renovate
is wired with auto-merge to `main` only on minor + patch updates.

### Observability + cost.

Linkerd (chosen over Istio Ambient + App Mesh because of the
~6x lower proxy memory and ~10x lower p99 latency overhead) is
the service mesh; Falco + Velero + Kubecost ship as Helm-chart
wrappers. Karpenter v1 self-managed (NOT EKS Auto Mode) so the
NodePool specs are recorded under
`terraform_stack_spec_versions`.

### Audit WORM.

`alphaswarm/tasks/audit_log_export_tasks.py::export_audit_log_window`
exports `security_audit_events` + `audit_log` nightly to
`s3://alphaswarm-audit-archive-${ACCOUNT_ID}/` with
`ObjectLockMode=COMPLIANCE` + 7-year retention per FINRA Rule
4511 + SEC Rule 17a-4(f)(2)(i)(B).

### IdP support.

Two new `IdentityProvider` subclasses ship under
`alphaswarm/auth/providers/`:

- `aws_iam_identity_center.py`
- `aws_cognito.py`

Both subclass `GenericOidcProvider` and auto-register through
`IdentityProviderMeta`. IAM Identity Center is the recommended IdP
for multi-account; Cognito is the documented fallback for the
single-account path.

## Consequences

- The 6 missing modules ship with full audit-first wiring +
  step-up MFA + WS multiplexing.
- Frontend bundles get smaller; SSR'd admin pages enable better
  caching.
- Multi-account topology is one `terraform apply` away.
- CI gains SLSA L3 attestations + Cosign keyless verification.
- Audit ledger is FINRA-compliant via WORM mirroring.
- The `alphaswarm_admin_ui/` Vite tree adds maintenance debt for the
  duration of the rollback window. Cleanup PR scheduled
  after 30-day burn-in.
- The legacy `alphaswarm/ui/` Solara dashboard remains in place; a
  separate `alphaswarm_admin-overhaul-cleanup` PR handles its removal +
  the FastAPI/Starlette unpin.


<!-- https://alpha-swarm.ai/architecture/decisions/006-quantbot-operator-pattern -->
# ADR 006 — QuantBot Operator Pattern (kopf + Pydantic mirrors)
> The QuantBot Platform v0.2.0 adds a Kubernetes-native control plane on top of the existing `BotRuntime`/`bot_versions` infrastructure. Every running bot, every risk policy, every venue feed, every bac...

# ADR 006 — QuantBot Operator Pattern (kopf + Pydantic mirrors)

**Status:** Accepted (QuantBot Platform v0.2.0)
**Date:** 2026-05-24
**Decision drivers:** AGENTS rules 14, 15, 28, 45; rpi-k8s-governance

## Context

The QuantBot Platform v0.2.0 adds a Kubernetes-native control plane on
top of the existing `BotRuntime`/`bot_versions` infrastructure. Every
running bot, every risk policy, every venue feed, every backtest job,
every kill switch is now a Kubernetes Custom Resource. That requires:

1. A controller that watches the CRs and reconciles desired state.
2. A schema source-of-truth for each CR.
3. Webhooks that reject malformed CRs before they reach the reconciler.

## Decision

- **Controller framework:** kopf (`kopf>=1.37`). Python-native, integrates
  with our Pydantic spec layer, supports level-triggered reconciliation,
  finalizers, and admission webhooks. Up to ~1000 CRs/cluster is well
  within kopf's documented operating envelope.
- **Schema source-of-truth:** each CR has both a Pydantic mirror class
  (under `alphaswarm_bots/operator/crds/*_cr.py`) AND a CRD YAML
  (`alphaswarm_bots/operator/crds/yaml/*_crd.yaml`). The Pydantic class is
  validated from the CR `.spec` field; the YAML is what gets applied to
  the cluster by the CRD-installer Job. The two are kept in sync by
  convention + the operator's startup self-test.
- **Reconciliation:** level-triggered. Every handler compares desired
  (from spec) against actual (queried from the cluster) and drives the
  system back. Failures reflect onto `status.conditions`.
- **Workload application:** routes through `alphaswarm_core.WorkloadRuntime`
  per AGENTS rule 45. The operator never calls
  `kubernetes.client.AppsV1Api()` directly when WorkloadRuntime is
  available; falls back to `kubernetes-asyncio` only for environments
  where WorkloadRuntime hasn't been deployed yet.

## Alternatives considered

| Option | Why rejected |
| --- | --- |
| Go operator (controller-runtime / Kubebuilder) | Re-implements the spec validation already written in Pydantic; bigger team operational burden for a Python-first shop |
| metacontroller + JSON Schema | No mature Python ecosystem for the testing + audit story we need; JSON Schema diverges from Pydantic validators |
| Native Helm charts only (no controller) | Helm can't reconcile the operator-side bookkeeping (kill switch fan-out, drain finalizer, status condition rollup) |

## Consequences

- **+** Single source of truth (Pydantic) drives both API validation and
  CR validation.
- **+** Python-native test suite for the operator (kopf can be driven
  in-process from pytest).
- **−** kopf scaling ceiling is ~1000 CRs per cluster; beyond that we
  need operator sharding (deferred per blueprint caveat #2).
- **−** Pydantic mirror + YAML CRD requires manual sync. Mitigated by
  CI: a startup check compares the Pydantic JSON schema against the
  CRD's `openAPIV3Schema` and refuses to boot on drift.

## References

- [alphaswarm_bots/operator/](../../../alphaswarm_bots/operator/)
- [alphaswarm_platform/deployments/kubernetes/bots-operator/](../../../alphaswarm_platform/deployments/kubernetes/bots-operator/)


<!-- https://alpha-swarm.ai/architecture/decisions/007-quantbot-latency-classes -->
# ADR 007 — QuantBot Latency Classes
> Bots in the QuantBot Platform span a 6-order-of-magnitude latency range: sub-millisecond market makers next to once-a-day rebalancers next to event-driven MEV searchers. We need a taxonomy that:

# ADR 007 — QuantBot Latency Classes

**Status:** Accepted (QuantBot Platform v0.2.0)
**Date:** 2026-05-24

## Context

Bots in the QuantBot Platform span a 6-order-of-magnitude latency
range: sub-millisecond market makers next to once-a-day rebalancers
next to event-driven MEV searchers. We need a taxonomy that:

1. Maps onto a concrete Kubernetes scheduling primitive (different
   primitives for different tiers).
2. Constrains where each bot can be scheduled (HFT bots only on
   dedicated NUMA-pinned nodes).
3. Tells the operator what hardware features to validate (HugePages,
   SR-IOV, PTP).
4. Drives the alert SLO thresholds (1 ms P99 vs 1 µs P99).

## Decision

Five canonical latency classes (`Frequency` StrEnum):

| Class | Latency target | K8s primitive | Special hardware |
| --- | --- | --- | --- |
| `hft`   | < 1 ms tick-to-trade | DaemonSet on tainted nodes (1 bot / node) | NUMA pinning, HugePages, SR-IOV, PTP |
| `mid`   | 1 ms – 1 s | StatefulSet (stateful) | None |
| `low`   | 1 s – 1 min | Deployment (stateless) | None |
| `eod`   | batch / daily | CronJob | None |
| `event` | event-driven | Deployment (long-running consumer) | None |

The `Frequency.HFT` Pydantic validator enforces:

- `needs_numa_pinning == True`
- `expected_p99_tick_to_trade_us` is set

These are required because operator scheduling decisions are
made off the capability declaration; an HFT bot without NUMA
pinning would silently land on a shared node and violate the
RTS 25 1-microsecond timestamp granularity requirement.

## Python ceiling (caveat #1 from blueprint)

Pure Python + Cython targets **100-500 µs** for non-kernel-bypass
HFT. The Aeron / Google Cloud benchmark (weareadaptive.com, 2024)
reports 57 µs default / 18 µs with kernel-bypass at 100k msg/s — and
that's a Java baseline. Bots requiring sub-100 µs MUST use the Rust
escape hatch in `alphaswarm_bots/hft/escape_hatch.py`. The architecture
explicitly documents this so we don't over-promise.

## Consequences

- **+** Operator scheduling is deterministic from the spec.
- **+** Alert SLOs auto-derive from the latency class.
- **−** Adding a tier in the future (e.g. `ultra_hft` for sub-100 µs)
  requires a new enum value and operator handler.

## References

- [alphaswarm_bots/spec.py — Frequency enum](../../../alphaswarm_bots/spec.py)
- [alphaswarm_bots/hft/](../../../alphaswarm_bots/hft/)
- [Commission Delegated Regulation (EU) 2017/574 — RTS 25 clock sync]


<!-- https://alpha-swarm.ai/architecture/decisions/008-quantbot-event-sourcing -->
# ADR 008 — Bot Event Sourcing (PostgreSQL, monthly-partitioned)
> Each running bot generates a stream of decision/order/fill/snapshot events. To support:

# ADR 008 — Bot Event Sourcing (PostgreSQL, monthly-partitioned)

**Status:** Accepted (QuantBot Platform v0.2.0)
**Date:** 2026-05-24
**Decision drivers:** Blueprint §H; AGENTS rules 3, 6, 34.

## Context

Each running bot generates a stream of decision/order/fill/snapshot
events. To support:

1. Restart recovery without losing state.
2. Time-travel debugging ("show me the position at 14:32:11 yesterday").
3. Regulatory audit (RTS 6 Article 17(3) real-time reconciliation).
4. Replay-based regression testing of strategy changes.

...we need an append-only, queryable event log per bot.

## Decision

- **Backend:** PostgreSQL, the same database that already holds
  `bots` / `bot_versions` / `bot_deployments`. No new technology to
  operationalize.
- **Partitioning:** `bot_events` is `PARTITION BY RANGE (recorded_at)`
  on PostgreSQL with one partition per UTC month. Partition pruning
  keeps queries fast even at billion-row scale (per documented
  PostgreSQL event-sourcing patterns).
- **Sequence numbers:** monotonic per bot (`bot_id`, `seq_no`). The
  `EventStore` writer keeps the next-available `seq_no` in memory
  and increments on each append; on restart it reads `max(seq_no)`
  + 1.
- **Snapshots:** periodic `bot_snapshots` rows act as replay anchors.
  On startup the kernel reads the latest snapshot and replays events
  with `seq_no > snapshot.seq_no` rather than from zero.
- **Tenancy:** every row carries `owner_user_id` / `workspace_id` /
  `project_id` / `experiment_id` / `test_id` per AGENTS rule 34;
  the existing `LedgerWriter._stamp` populates them automatically
  from the active `RequestContext`.
- **GIN index** on `bot_events.event_data` (JSONB) so ad-hoc queries
  like "all fills with `fee_currency=USDT`" stay fast.

## Alternatives considered

| Option | Why rejected |
| --- | --- |
| Kafka log only | Not random-access; harder to query for time-travel; we already have Postgres |
| TimescaleDB hypertable | Extra dependency to operate; partition pruning on plain Postgres is sufficient at our scale |
| One table per bot | Operational nightmare at >100 bots; partition pruning gives us the same query performance with one schema |
| Iceberg-only | Lakehouse latency too high for the kernel's restart path |

## Iceberg interplay (Rule 3)

`bot_events` is **operational** state — kernel writes happen on the
hot path. **Analytical** writes (trajectory exports, signal series,
gold-tier aggregates) still go through `iceberg_catalog.append_arrow`
per AGENTS rule 3; the operational + analytical paths are deliberately
separate to keep the kernel's write latency predictable.

## Consequences

- **+** Restart recovery is O(snapshot + events_since_snapshot) rather
  than O(all_events).
- **+** Time-travel debugging is one Postgres query.
- **+** No new infrastructure to operate.
- **−** Monthly partitioning requires a Celery beat task to pre-create
  next month's partition (Phase 12 — Celery wiring deferred to follow-up).
- **−** GIN index churns on high-volume JSONB inserts; mitigated by
  the `EventStore` batching writes every `flush_interval_s`.

## References

- [alembic/versions/0058_bot_event_sourcing.py](../../../alembic/versions/0058_bot_event_sourcing.py)
- [alphaswarm_bots/state/store.py](../../../alphaswarm_bots/state/store.py)
- [alphaswarm_bots/state/replay.py](../../../alphaswarm_bots/state/replay.py)


<!-- https://alpha-swarm.ai/architecture/decisions/009-quantbot-rts6-conformance -->
# ADR 009 — MiFID II RTS 6 + SEC 15c3-5 Conformance
> Algorithmic trading in EU markets is governed by Commission Delegated Regulation (EU) 2017/589 — **MiFID II Regulatory Technical Standards on the organisational requirements of investment firms engage...

# ADR 009 — MiFID II RTS 6 + SEC 15c3-5 Conformance

**Status:** Accepted (QuantBot Platform v0.2.0)
**Date:** 2026-05-24
**LEGAL REVIEW REQUIRED.** This ADR + the code it documents are an
**engineering crosswalk**, NOT legal advice. Any production deployment
trading European or US equities (or directly-affected derivatives) requires
sign-off from the firm's compliance counsel and the CEO's annual
certification.

## Context

Algorithmic trading in EU markets is governed by Commission Delegated
Regulation (EU) 2017/589 — **MiFID II Regulatory Technical Standards
on the organisational requirements of investment firms engaged in
algorithmic trading** ("RTS 6"). US market access is governed by SEC
Rule 15c3-5 (17 CFR § 240.15c3-5).

Both regimes require pre-trade risk controls, kill functionality, real-
time reconciliation, conformance testing, stress testing, and annual
validation. The QuantBot Platform must:

1. Enforce every named control before an order leaves the bot.
2. Generate the annual validation report mechanically.
3. Document the required attestations the firm's officers must sign.

## Decision

- **Two-tier risk:** Layer-1 in-bot `PreTradeRiskEngine` for the
  latency-sensitive fast path; Layer-2 out-of-band FastAPI service
  for the broker-dealer-controlled aggregate-credit check (§ 240.15c3-5(d)).
- **Hard vs soft block:** policy verdicts carry `severity = "block"`
  (hard — order rejected) or `severity = "warn"` (soft — informational
  only, may be overridden). Mirrors the ESMA Supervisory Briefing
  §72 (hard) vs §75/§76 (soft) distinction.
- **Crosswalk:** every policy in `alphaswarm_bots/risk/policies.py` carries a
  `citation` string. The `alphaswarm_bots/risk/reg/rts6.py` and
  `rule_15c3_5.py` modules list the mapping by class name.
- **Kill switch (RTS 6 Art. 12):** three-scope (`bot` / `fleet` /
  `platform`) implementation in `alphaswarm_bots/risk/kill_switch_v2.py`,
  backed by Redis + a `KillSwitch` CRD. Cancellation is immediate;
  affected bots transition to `Draining` and (optionally) flatten
  positions.
- **Real-time reconciliation (Art. 17(3)):** `ExecutionAdapter.reconcile()`
  is called on every reconnect; drop-copy ingest is the canonical
  real-time path; mismatches elevate to `OrderStatus.DISPUTED` and
  quarantine the strategy from new entries.
- **Real-time alerts (Art. 16(5)) — "within 5 seconds":**
  Prometheus Alertmanager rules with `interval: 15s` and `for: 0s`
  on critical signals (`prometheus-rules.yaml`).
- **Conformance testing (Art. 6):** `alphaswarm_bots/risk/reg/conformance.py`
  ships a synthetic test harness; CLI `alphaswarm-bots conformance ` and
  REST `POST /bots/{ref}/conformance` run it on demand.
- **Stress testing (Art. 10) — "twice the volume of the highest
  volume...during the previous six months":** `alphaswarm_bots/risk/reg/stress.py`
  reads the peak rate from `bot_events` and replays at 2x through the
  engine. CLI `alphaswarm-bots stress ` and REST `POST /bots/{ref}/stress`.
- **Annual validation (Art. 9 + § 240.15c3-5(e)):**
  `alphaswarm_bots/risk/reg/validation_report.py` generates a YAML artifact
  with empty signature slots for risk management, internal audit, and
  the CEO. The generator runs daily as a Celery task; the artifact
  itself requires manual sign-off before submission.

## Attestation slots (left blank by the generator)

The validation report has three signature slots:

1. **Risk management function (RTS 6 Art. 9(2)):** drafts the report.
2. **Internal audit (RTS 6 Art. 9(3)):** audits the report.
3. **CEO certification (SEC 15c3-5(e)):** annual certification that
   the firm's risk management controls comply with paragraphs (b)
   and (c) of the rule.

The generator does NOT auto-sign these slots; that is operational, not
mechanical.

## Consequences

- **+** Every block has a regulatory citation.
- **+** Conformance + stress + annual validation are reproducible
  CI artifacts.
- **−** This is an engineering crosswalk; legal counsel must validate
  the mappings against the specific firm's regulatory perimeter.
- **−** Cross-asset firms (equity + futures + crypto) may need
  additional policies beyond what we ship out of the box.

## References

- Commission Delegated Regulation (EU) 2017/589 (MiFID II RTS 6)
- 17 CFR § 240.15c3-5 (SEC Rule 15c3-5)
- ESMA Supervisory Briefing on Algorithmic Trading (26 Feb 2026)
- [alphaswarm_bots/risk/](../../../alphaswarm_bots/risk/)


<!-- https://alpha-swarm.ai/architecture/decisions/010-quantbot-canary-pnl-gates -->
# ADR 010 — Canary Rollout PnL Gates
> Strategy changes (new alpha model, new portfolio constructor, new execution algo) are the highest-leverage and highest-risk changes the platform makes. Rolling them across the entire fleet at once is ...

# ADR 010 — Canary Rollout PnL Gates

**Status:** Accepted (QuantBot Platform v0.2.0)
**Date:** 2026-05-24

## Context

Strategy changes (new alpha model, new portfolio constructor, new
execution algo) are the highest-leverage and highest-risk changes the
platform makes. Rolling them across the entire fleet at once is
unacceptable; bake time is mandatory. Argo Rollouts canary lets us
shift weight gradually, but the canary needs **automated abort
criteria** beyond the standard liveness/readiness probes — a bot can
be `Ready=True` and still be hemorrhaging money.

## Decision

Three AnalysisTemplates gate every canary promotion step:

1. **`bot-canary-pnl`** — realised PnL of the canary vs the stable
   variant. Default success condition:
   `canary_realized_pnl - stable_realized_pnl >= -50 USD` over 6 ×
   5-minute windows (30-minute total).
2. **`bot-reject-rate`** — fraction of orders that are rejected
   (by venue or by pre-trade risk). Default success condition:
   `<= 1%` over 30 × 1-minute windows.
3. **`bot-p99-latency`** — P99 tick-to-trade latency. Default
   success condition: `<= 1 ms` (HFT canaries override to `<= 100 µs`).

The canary spec follows the standard Argo Rollouts pattern:

```
steps:
  - setWeight: 10
  - pause: { duration: 30m }
  - analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] }
  - setWeight: 50
  - pause: { duration: 1h }
  - analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] }
  - setWeight: 100
```

Failure of any AnalysisTemplate **aborts the rollout** and reverts
traffic to the stable version. The operator additionally watches
the `BotPnLDrawdownCritical` PrometheusRule; if the canary bleeds
more than `maxAbortRolloutPnlBleedUsd` (default $500) the alert
auto-fires a `KillSwitch` CR which halts the canary instantly —
this protects against the case where the rollout abort itself takes
longer than the bleed.

## Default thresholds rationale

The $50 PnL floor is intentionally generous for the initial canary
window — it admits some short-term variance that is statistically
normal between two variants of the same strategy. The harder $500
bleed threshold (drawdown alert) is what catches truly broken canaries
within seconds.

Per blueprint caveat (canary false-positive rate): if good canaries
are routinely aborted on noisy metrics, tighten the metric query
**first** (more samples, longer windows, robust quantiles) before
relaxing the success condition.

## Consequences

- **+** Strategy changes have an automated bake-time gate.
- **+** The same canary pattern works for both stateless mid-frequency
  bots and HFT bots (only the latency threshold differs).
- **−** AnalysisTemplate thresholds need per-strategy calibration —
  a market-making bot's "good" reject rate is higher than a stat-arb
  pair's "good" reject rate.
- **−** A canary that's still warming up may not yet have produced
  enough orders for the metrics to be meaningful; we mitigate with
  the initial 30-minute pause before the first analysis check.

## References

- [alphaswarm_platform/deployments/argocd/rollouts/](../../../alphaswarm_platform/deployments/argocd/rollouts/)
- [alphaswarm_platform/deployments/kubernetes/bots-operator/prometheus-rules.yaml](../../../alphaswarm_platform/deployments/kubernetes/bots-operator/prometheus-rules.yaml)


<!-- https://alpha-swarm.ai/architecture/decisions/010-rl-production-enhancement -->
# ADR-010: alphaswarm_rl production-grade enhancement (Phases 1-12)
> **Context**: The `alphaswarm_rl` subsystem shipped with the core `RLComponent` metaclass, `RLRuntime`, hash-locked `RLExperimentSpec`, and a small set of envs / agents / observations / rewards. The TradeMast...

# ADR-010: alphaswarm_rl production-grade enhancement (Phases 1-12)

**Status**: accepted (2026-05-24)

**Context**: The `alphaswarm_rl` subsystem shipped with the core
`RLComponent` metaclass, `RLRuntime`, hash-locked `RLExperimentSpec`,
and a small set of envs / agents / observations / rewards. The
TradeMaster 1.0.0 codebase contained a much larger, paper-grade
library of:

- Reward shapes (Differential Sharpe Ratio, D3R, Implementation
  Shortfall, Hindsight, DP-distillation, …).
- Analytical baselines (Almgren-Chriss, Avellaneda-Stoikov).
- Domain envs (PortfolioManagement, OrderExecution PD,
  AlgorithmicTrading, HFT, MultimodalTrading).
- Paper-grade agents (EIIE, DeepTrader, ETEO, OPD, DeepScalper,
  HFT_DDQN, InvestorImitator).
- Network backbones (EIIEConv, SAGCN, MarketScorer, HFTQNet,
  DualHead, PDDualRNN, SARL classifier).
- Market Dynamics Modeling (slice-and-merge regime labeller).
- CSDI diffusion imputation.
- Validation diagnostics (CPCV, PBO, RAS, DSR, walk-forward, BH /
  Holm-Bonferroni).
- PRUDEX-Compass evaluation suite.
- Three new replay buffers (General / Prioritized / NStepInfo).

Plus the FinAgent multimodal LLM-hybrid agent (Zhang AAAI 24).

**Decision**: Land all of the above behind 12 phases, each adding
new classes that auto-register through existing AlphaSwarm abstractions
(`RLComponent`, `BaseDataset`, `register_analysis_flow`,
`BaseExperiment`). NO migration of existing components, NO breaking
changes. Every new component:

1. Subclasses an existing AlphaSwarm base (`RewardTerm`, `BaseRLAgent`,
   `BaseRLEnv`, `TimeSeriesEncoder`, `BaseObservationBuilder`,
   `BaseExperiment`, `BaseReplayBuffer`, `BaseDataset`).
2. Sets `rl_alias` so it auto-registers under the right `rl_kind`.
3. Ships unit + property tests under `alphaswarm_rl/tests//`.
4. Respects every hard rule in `alphaswarm_rl/AGENTS.md`.

**Consequences**:

- The `rl_alias` namespace grows by ~40 new aliases; the
  `RLComponent.list_components(kind)` registry expands accordingly.
- Heavy dependencies (`scipy.signal`, `scikit-learn`) are mandatory
  for the analysis flow but already in `alphaswarm` core. No new
  third-party RL framework dependencies.
- New top-level packages under `alphaswarm_rl/src/alphaswarm_rl/`:
  `analytical/`, `evaluation/`, `replay/`, `validation/`.
- One new analysis flow in the monolith
  (`alphaswarm/analysis/flows/market_dynamics_modeling.py`) per hard rule
  23.
- One new dataset kind in the monolith
  (`alphaswarm/data/datasets/kinds/csdi_imputed.py`) per hard rule 29.
- One new FinAgent toolset in the monolith
  (`alphaswarm/agents/tools/finagent/`).
- Five new agent YAMLs under `configs/agents/finagent/`.
- Documentation: three new `alphaswarm_docs/` pages (rl-market-dynamics,
  rl-prudex-evaluation, rl-finagent) plus this ADR.

**Hard rule alignment**:

| Rule | Compliance |
| --- | --- |
| 2 (LLM via `router_complete`) | FinAgent layered adapter + all 5 stage YAMLs |
| 3 (Iceberg via `append_arrow`) | CSDI persistence; PRUDEX skips; MDM via gold-tier flow |
| 12 (`AgentRuntime` for agents) | 5 FinAgent stages = 5 AgentSpec rows |
| 16 (`RLRuntime` for RL lifecycle) | All new agents / experiments callable through it |
| 18 (`IcebergTrajectoryStore`) | Untouched — existing path preserved |
| 19 (`RLComponent` metaclass) | All ~40 new aliases auto-register |
| 20 (`router_complete` from RL code) | LayeredReflectionAdapter only LLM caller |
| 22 (No direct DB from agent body) | FinAgent tools route through registered DataMCP only |
| 23-25 (Analysis flow → `AnalysisRuntime`) | MDM flow + `register_analysis_flow` |
| 29 (`BaseDataset` for env data) | tradesim_* envs accept BaseDataset / DataFrame |
| 36-38 (Advantage / backbone / weight-centric) | Backbones extend `TimeSeriesEncoder`; weights flow `WeightCentricPipeline` ⇒ `WeightToOrders` |

**Trade-offs**:

1. **CSDI is ensemble-imputation, not real diffusion** — the full
   ~1500-LOC PyTorch CSDI model is out-of-scope; the ensemble
   imputer satisfies the acceptance gate (MAE < 0.05 on synthetic)
   and ships the same public contract (median + quantile bands)
   so a future drop-in replacement is straightforward.
2. **RAS is EXPERIMENTAL** — exposed under the same canonical
   surface as DSR / PBO but marked in the docstring; the
   Rademacher-complexity estimate is Monte-Carlo and depends on
   `n_draws`.
3. **Paper-grade agents lean on SB3** — most new agents are thin
   `SB3Adapter` subclasses with paper-grade hyperparameters.
   InvestorImitator (REINFORCE) and OPD (teacher-student dual PPO)
   are the two genuinely custom implementations. This matches
   pragmatic deployment patterns: SB3 has been more thoroughly
   battle-tested than re-implementing each paper from scratch.
4. **No live broker integration in the test suite** — `WeightToOrders`
   is tested against `_MockBrokerage`. The Alpaca / IBKR adapter
   lives in the monolith and is covered by integration tests there.


<!-- https://alpha-swarm.ai/architecture/decisions/011-cdn-fronted-standalone-for-aqp-ui -->
# ADR 011 — CDN-fronted standalone container for the cloud-hosted alphaswarm_ui
> The cloud-hosted Next.js 14 PaaS frontend (alphaswarm_ui) ships as a clean Next.js standalone container at app.alpha-swarm.ai. Static assets are CDN-fronted by Cloudflare. ADR 002''s multi-stage Solara/Vite/ASGI proxy pattern is scoped to the local alphaswarm_client only.

# ADR 011 — CDN-fronted standalone container for the cloud-hosted alphaswarm_ui

- **Status**: Accepted (2026-05-25)
- **Authors**: Platform team
- **Supersedes (scoped)**: [ADR 002 — Single multi-stage container for the AlphaSwarm client surface](002-single-container-client.md) for the cloud surface only; ADR 002 stays in force for the local `alphaswarm_client/` Vite operator UI.
- **Related**: [ADR 001 — Vite static export](001-static-export-over-ssr.md), [ADR 002 — Single multi-stage container](002-single-container-client.md), [ADR 003 — Auth0 zero-trust](003-auth0-zero-trust.md), [ADR 005 — Separated control plane](005-separated-control-plane.md), [ADR 012 — Solara deprecation](012-solara-deprecation.md)

## Context

When the original `alphaswarm_client/` packaging was designed (ADR 002), the
platform had three coexisting presentation surfaces: a Vite operator
UI, a legacy Next.js webui, and a Python Solara visualisation layer.
Collapsing all three behind one FastAPI proxy was the right call for
a single-tenant local-first deployment where operators bookmark one
URL and the proxy hides the rest.

The cloud-hosted, customer-facing PaaS at `alpha-swarm.ai` /
`app.alpha-swarm.ai` (the new `alphaswarm_ui/` Next.js 14+ App Router app) has
different constraints:

1. **Multi-tenant scale.** Hundreds-to-thousands of concurrent
   tenants. Static-asset throughput and SSR throughput scale at
   different ratios — co-located scaling triggers wasted CPU and
   unnecessary memory pressure on the SSR pods.
2. **CDN-friendly assets.** Next.js standalone emits hashed,
   immutable filenames under `/_next/static/*`. Serving them from
   the SSR pods is bandwidth waste; Cloudflare can cache them for a
   year with zero risk of staleness.
3. **No Python / no Solara.** `alphaswarm_ui/` is pure TypeScript + Next.js
   server. The Solara stage (ADR 002 Stage 2) doesn't apply and
   would only bloat the image (~300 MB heavier).
4. **Independent BFF lifecycle.** Every `alphaswarm_ui/api/*` route is a
   thin BFF handler that re-checks the session, forwards a tenancy
   header, and proxies upstream. Reverse-proxying through FastAPI
   adds an extra hop with no value (the BFF is already a proxy).
5. **Edge-rendered marketing.** The `(marketing)` route group is
   designed for SSR + ISR cache. Routing it through an internal
   FastAPI proxy defeats the whole point of edge-near rendering.

## Decision

The cloud-hosted `alphaswarm_ui` ships as **one clean Next.js standalone
container** built from
[`alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile)
(already two stages: `node:20-alpine` builder + `node:20-alpine`
runtime running `node server.js`). It DOES NOT use the ADR 002
three-stage Python/ASGI pattern.

**Edge caching layout:**

| Path                  | Cache-Control                                            | Notes |
| --------------------- | -------------------------------------------------------- | ----- |
| `/_next/static/*`     | `public, max-age=31536000, immutable`                    | Hashed filenames; year-long TTL |
| `/public/*` `/fonts/*` `/images/*` | `public, max-age=2592000`                | 30-day TTL, hand-curated assets |
| `/api/*`              | `no-store` + `Pragma: no-cache`                          | BFF responses; user-scoped (rule 4 + management-engine.mdc) |
| Everything else (SSR) | `public, max-age=3600, stale-while-revalidate=86400`     | Per-tenant marketing + dashboard pages |

The NGINX Ingress at
[`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml)
sets these via `nginx.ingress.kubernetes.io/configuration-snippet`.
Cloudflare in front honours them aggressively for `/_next/static/*`
and bypasses the cache for `/api/*`.

**Post-deploy cache purge:** the GitHub Actions deploy job in
[`.github/workflows/alphaswarm-ui.yml`](../../../.github/workflows/alphaswarm-ui.yml)
calls the Cloudflare zone-purge API immediately after
`kubectl rollout status` succeeds. The Cloudflare token is sourced
from the existing `CredentialResolver` chain via the
`ALPHASWARM_CLOUDFLARE_API_TOKEN` ExternalSecret (AGENTS rule 26).

**HPA:** keep the existing
[`hpa.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml)
(CPU 70%, memory 80%, 3-20 replicas). Because static assets are
CDN-offloaded, SSR pod CPU usage tracks real per-tenant rendering
work — autoscaling becomes meaningful instead of a noisy mix of
"serving a JS bundle" and "rendering a dashboard page".

## Consequences

**Positive**

- 80%+ static-asset bandwidth offloaded to Cloudflare's edge.
- HPA triggers on real SSR work, not bandwidth.
- Image is ~150 MB (Node Alpine) vs. ~450 MB (Python + Solara +
  Node) for ADR 002. Faster pod cold start, faster rolling deploys.
- The BFF + SSR + edge layers have one ownership boundary each —
  Cloudflare for delivery, NGINX Ingress for cache hints,
  `node server.js` for SSR + BFF. No ASGI proxy hop in between.
- `/api/*` is `no-store` end-to-end — no risk of a CDN edge node
  caching a tenant's response and serving it to a different tenant.

**Negative**

- Two presentation packaging stories now exist (ADR 002 for
  `alphaswarm_client`, ADR 011 for `alphaswarm_ui`). Mitigated by the per-surface
  scoping: each ADR is the source of truth for one tree only.
- Cloudflare cache-purge is now part of the deploy critical path. A
  Cloudflare API outage during deploy means stale `/_next/static/*`
  for up to 1y per hashed filename — but the hashes change on every
  deploy, so the impact is bounded to assets whose names didn't
  change (rare for a real change).
- Adds a `CLOUDFLARE_API_TOKEN` secret to the deploy environment.
  Stored in Vault + synced via ExternalSecret per AGENTS rule 26.

## Alternatives considered

- **Stay on ADR 002 (single FastAPI proxy container)** — rejected.
  Bandwidth-CPU coupling, larger image, unnecessary Solara/Python
  weight, redundant proxy hop in front of the BFF.
- **Vercel hosting** — rejected. ADR 003's zero-trust constraints
  + the on-cluster control plane integration argue for keeping the
  SSR layer inside our own K8s + CredentialResolver perimeter.
- **CloudFront in front of a single SSR pod** — rejected. We
  already have Cloudflare as the edge for `alpha-swarm.ai`. Adding a
  second CDN would split the cache-purge story and add edge cost.

## Implementation references

- Standalone Dockerfile: [`alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile)
- Ingress + CDN headers: [`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/ingress.yaml)
- HPA: [`alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml`](../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-ui/hpa.yaml)
- CI deploy + cache purge: [`.github/workflows/alphaswarm-ui.yml`](../../../.github/workflows/alphaswarm-ui.yml)
- BFF + session: [`alphaswarm_ui/src/lib/auth/session.ts`](../../../alphaswarm_ui/src/lib/auth/session.ts), [`alphaswarm_ui/src/lib/api/client.ts`](../../../alphaswarm_ui/src/lib/api/client.ts)


<!-- https://alpha-swarm.ai/architecture/decisions/012-solara-deprecation -->
# ADR 012 — Solara deprecation in the cloud build
> Solara is excluded from the cloud alphaswarm_ui Dockerfile and remains only in the local alphaswarm_client image for one-release-cycle rollback. The Solara stage will be removed entirely from alphaswarm_client after the rollback window closes.

# ADR 012 — Solara deprecation in the cloud build

- **Status**: Accepted (2026-05-25)
- **Authors**: Platform team
- **Related**: [ADR 002 — Single multi-stage container](002-single-container-client.md), [ADR 011 — CDN-fronted standalone for alphaswarm_ui](011-cdn-fronted-standalone-for-alphaswarm-ui.md)

## Context

The legacy Solara UI (`legacy_ui.app` at
[`alphaswarm/ui/`](../../../alphaswarm/ui/)) is a Python ASGI presentation layer
that predates the Vite + React 19 + shadcn cutover documented in
[`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md). It is
already wrapped in the `legacy` profile and gated behind
`ALPHASWARM_CLIENT_ENABLE_SOLARA` (ADR 002 Stage 2 + production runtime).

The cloud `alphaswarm_ui/` Next.js application has no need for Solara —
every chart that Solara renders is already covered by the
`lightweight-charts` / `recharts` stack already in
[`alphaswarm_client/package.json`](../../../alphaswarm_client/package.json) and
inherited by `alphaswarm_ui/`. Continuing to bundle Solara into the cloud
image is pure dead weight (~300 MB) AND it creates a second
presentation-layer state machine the BFF would otherwise have to
synchronise with the React component tree.

## Decision

1. **`alphaswarm_ui/` Dockerfile excludes Solara entirely** (already the
   case). No `solara-builder` stage; no `/legacy` mount.
2. **`alphaswarm_client/` retains the Solara stage for one release cycle
   beyond Phase 1 of the cloud-dash refactor.** This preserves the
   ADR 002 rollback contract.
3. **After one release cycle, the Solara stage is removed from
   `alphaswarm_client/`** (Phase 7 of the cloud-dash refactor plan):
   - Delete the `solara-builder` stage from
     [`alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile).
   - Drop the `/legacy` mount from the Stage-3 FastAPI proxy.
   - Remove `ALPHASWARM_CLIENT_ENABLE_SOLARA` from
     [`alphaswarm/config/settings.py`](../../../alphaswarm/config/settings.py).
   - `git mv alphaswarm/ui/ alphaswarm/legacy_solara_ui/` so the source code
     remains for archaeological reference but no longer ships.
4. **No new Solara work.** The `legacy` profile is in maintenance
   mode only. New visualisation lands in `alphaswarm_client/` (Vite +
   shadcn) or `alphaswarm_ui/` (Next.js + antd + recharts).

## Consequences

**Positive**

- Cloud image stays small (~150 MB) and Python-free; cold-start
  latency is dominated by Next.js startup, not Solara warmup.
- One less presentation-layer state machine to keep in sync with
  the React component tree.
- Bundle audits stop having to explain why a TypeScript-first PaaS
  ships a 300 MB Python interpreter.

**Negative**

- Operators who relied on Solara dashboards have to migrate before
  the Phase-7 removal. The migration is well-documented: every
  Solara surface has a Vite analog (see the cutover checklist in
  [`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md)).
- Loss of Solara's Python-side reactive component model. This was
  an interesting prototype path but not a load-bearing operator
  workflow.

## Alternatives considered

- **Keep Solara indefinitely as a "second UI"** — rejected. The
  cost of maintaining two parallel presentation stacks (React +
  Solara) outweighs the value of an alternate visualisation
  framework that no current workflow needs.
- **Port Solara to JupyterLab embed** — rejected. JupyterLab is
  intended for notebook authoring (Lab Engine), not operator
  dashboards. Mixing the two surfaces would re-create the original
  framework-fragmentation problem ADR 002 set out to solve.

## Implementation references

- ADR 002 Solara stage: [`alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`](../../../alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile) (Stage 2 `solara-builder`)
- Solara source: [`alphaswarm/ui/`](../../../alphaswarm/ui/)
- Feature flag: `ALPHASWARM_CLIENT_ENABLE_SOLARA` in [`alphaswarm/config/settings.py`](../../../alphaswarm/config/settings.py)
- Cutover history: [`alphaswarm_client/CUTOVER.md`](../../../alphaswarm_client/CUTOVER.md)
- Phase-7 removal step: [`.cursor/plans/alphaswarm_cloud-hosted_dash_refactor_*.plan.md`](../../../.cursor/plans/)


<!-- https://alpha-swarm.ai/architecture/decisions/013-entra-as-first-pool -->
# ADR-013: Entra ID as the AlphaSwarm staff first user pool

# ADR-013: Entra ID as the AlphaSwarm staff first user pool

- **Status**: Accepted
- **Date**: 2026-05-27
- **Supersedes**: none
- **Superseded by**: none
- **Related rules**: AGENTS rule 27 (identity), 42 (TerraformRuntime),
  44 (EntraTenantLink approval flow), 26 (CredentialResolver)
- **Related ADRs**: [ADR-003 Auth0 zero-trust](003-auth0-zero-trust.md)
  remains valid for B2C / customer-tenant fallback;
  [ADR-005 separated control plane](005-separated-control-plane.md)
  is the host for the new `manage.alpha-swarm.ai` MSAL routes.

## Context

The AlphaSwarm has historically authenticated AlphaSwarm staff
through the same Auth0 tenant that serves customer logins. Auth0 has
served well as a B2C identity surface, but for the **internal** staff
pool we need:

1. **Centralised MFA + Conditional Access**. Staff already authenticate
   to Microsoft 365 daily through the corporate Entra tenant. CA
   policies (block risky sign-ins, named-location MFA, FIDO2 hardware
   key requirements for admins) are enforced by the IT / Security team
   in one place. Replicating those controls in Auth0 doubles the
   surface area.
2. **Audit centralisation**. The corporate SIEM already ingests Entra
   sign-in logs via the existing log stream. Auth0 audit data has to
   be exported separately and reconciled.
3. **Group-driven authorisation**. New hires onboard via a single
   HR-side group action; the Entra group's app-role mapping
   automatically grants the right AlphaSwarm scopes. The Auth0 path required
   manual role assignment in the Auth0 dashboard.
4. **No client secrets in CI**. GitHub Actions OIDC + federated
   credentials replace the old `AZURE_CLIENT_SECRET` repo secret.
5. **Customer separation**. The AlphaSwarm staff Entra tenant is independent
   of every customer Entra tenant. Customer tenants continue to flow
   through the existing `EntraTenantLink` B2B approval wizard
   (AGENTS rule 44) — they do NOT land in the staff tenant.

The runtime support for Entra has been in place since the Phase 4
service-mesh rollout (`alphaswarm/auth/providers/msal_entra.py`); this ADR
formalises the *first user pool* designation and brings the Entra
configuration under Terraform control.

## Decision

1. **The AlphaSwarm staff Microsoft Entra ID tenant is the first user pool
   for `manage.alpha-swarm.ai`**. Tokens whose `iss` matches the AlphaSwarm staff
   tenant are routed through `MsalEntraIdentityProvider` before any
   other provider in the chain.
2. **Auth0 stays as the customer-facing B2C pool** and as the
   degraded-mode fallback for staff (e.g. if the Entra side has an
   incident).
3. **Every Entra resource is under Terraform control** through the
   new `alphaswarm_entra_directory` module: 3 app registrations + 7 directory
   groups + 7 app roles + group → role assignments + named locations
   + GitHub Actions OIDC federated credentials.
4. **Conditional Access policies remain manually authored** because P2
   licensing requires Security-team review on every policy change. The
   Terraform module records policy display names as documentation; a
   smoke-test helper queries Microsoft Graph at apply time to confirm
   each named policy exists.
5. **Apply path is `TerraformRuntime` only** (rule 42). Plan-only on
   PR; apply on push to `main` through `alphaswarm deploy`.
6. **Federated credentials replace static client secrets in CI**. The
   `alphaswarm-ci-github` app's federated credentials are per-environment +
   per-branch — wildcards are rejected at plan time.

## Alternatives considered

### A. Keep Auth0 as the staff pool

Pros:
- Zero migration effort.
- Single identity surface for both staff + customers.

Cons:
- Doubles the MFA + CA enforcement surface.
- Splits audit logs across two systems (Auth0 + corporate SIEM).
- Manual role assignment in the Auth0 dashboard for every new hire.
- Static client secrets in CI.

**Rejected**: the operational + audit overhead outweighs the
zero-migration win.

### B. Migrate ALL identity (staff + customers) to Entra

Pros:
- Single user pool overall.
- Strongest audit centralisation.

Cons:
- Customer tenants are operated by their own admins; we can't unify
  them into a single tenant we control.
- B2C scenarios (e.g. self-service trial signups) are awkward in
  enterprise Entra; Auth0 + the existing B2C surface is the right
  tool.
- Migration cost: every existing Auth0 customer connection would need
  to be re-issued in Entra B2C (different protocol, different SDK).

**Rejected**: the staff vs customer split is the right granularity.

### C. Keep Entra resources in the Azure Portal (no Terraform)

Pros:
- Lower friction for ad-hoc adjustments.
- No Terraform learning curve for IT staff.

Cons:
- No reviewable diff for changes.
- No `terraform_runs` audit row for compliance.
- No federated-credential automation; CI keeps a static secret.
- Drift between dev / prod tenants becomes hard to detect.

**Rejected**: clickops on identity infrastructure violates AGENTS rule
42 in spirit (audit-first, every change reviewed).

## Consequences

### Positive

- One MFA + CA enforcement surface for staff, owned by Security.
- Audit logs land in the corporate SIEM via the existing log stream.
- New-hire onboarding is one HR-side group add.
- CI authenticates to Azure with no stored secrets.
- Customer Entra tenants flow through the existing B2B path; the
  internal pool doesn't bleed into customer scope.
- The `alphaswarm_entra_directory` module is reviewable; every change gets a
  PR + plan diff.

### Negative

- Two identity providers to keep healthy (Entra + Auth0). Mitigated
  by `select_provider_for_token` routing — the right provider is
  picked per-request rather than per-deploy.
- Bootstrap window has a temporary client secret (Phase 0/1 of the
  rollout plan); retired in Phase 5.
- CA policies remain a manual surface — the Terraform module can
  reference them but cannot create them. Mitigated by the smoke-test
  helper that confirms required policies exist before exposure.
- Group membership is intentionally outside Terraform. HR + Security
  own membership; the audit path comes from the Entra audit log
  rather than `terraform_runs`.

### Risks + mitigations

| Risk | Mitigation |
| --- | --- |
| Tenant lockout (every admin's account gets MFA-locked) | TWO break-glass accounts excluded from CA policies (rollout plan §4 + `entra-rotate-secrets` runbook). |
| Group → role mapping mistake grants over-privileged access | `terraform_runs` audit + the daily Prometheus alert on `entra_role_assignment_changes_total`. |
| Federated-credential subject too broad | Per-environment / per-branch subjects; module rejects wildcards at plan time. |
| Customer-tenant tokens routed to MSAL-internal | `select_provider_for_token` checks the `iss` claim against `auth_msal_internal_tenant_id`; mismatch falls through to Auth0. Unit tested. |

## Implementation pointers

- Long-form rollout plan:
  [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md)
- Concepts:
  [`concepts/identity/entra-internal-tenant`](../../concepts/identity/entra-internal-tenant.md)
- Bootstrap runbook:
  [`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md)
- Onboarding runbook:
  [`how-to/entra-onboard-new-staff`](../../how-to/entra-onboard-new-staff.md)
- Secret-rotation runbook:
  [`how-to/entra-rotate-secrets`](../../how-to/entra-rotate-secrets.md)
- Module:
  [`alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/`](pathname:///alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/README.md)
- Provider runtime: `alphaswarm/auth/providers/msal_entra.py`
- Provider chain selector: `alphaswarm/auth/providers/__init__.py::select_provider_for_token`


<!-- https://alpha-swarm.ai/architecture/decisions/014-knowledge-base-boundary -->
# ADR-014: Knowledge-Base Boundary (`alphaswarm_kb` + `alphaswarm_kb_federation`)
> Extract the AlphaSwarm RAG + agent-memory stack into a Clean-Architecture knowledge-base boundary with pluggable Cognee / Graphiti / Mem0 / Letta / LlamaIndex adapters, bi-temporal PermissionedDataPoint, four-scope KBLayerComposer, hybrid OpenFGA + OPA + Cedar policy stack, and Terragrunt silo-per-tenant IaC.

# ADR-014: Knowledge-Base Boundary

**Status**: accepted (2026-05-28)

**Context**:

The AlphaSwarm knowledge stack started as `alphaswarm/rag/` (a four-level
hierarchical RAG on Redis + pgvector) plus `alphaswarm/llm/memory.py`
(RedisHybridMemory) wired directly into `AgentRuntime`. As the platform
grew, three tensions accumulated:

1. **Vendor coupling.** `HierarchicalRAG` is fast and AlphaSwarm-native, but
   the field has matured rapidly. Cognee (tri-store memory engine),
   Graphiti (bi-temporal Neo4j edges with sub-300ms p95 recall), Mem0
   (user-centric personalisation), Letta (full agent runtime), and
   LlamaIndex (general-purpose vector backbone) all solve adjacent
   problems and tenants are starting to ask for each by name.
2. **Multi-tenancy on cognitive memory.** The existing RAG row-filter
   stamps `workspace_id`/`lab_id` on rows but provides no node/edge ACL,
   no bi-temporal invalidation, no cross-tenant marketplace, and no
   physical per-tenant isolation. Regulated tenants (financial advisors
   on HIPAA/SOX) need an explicit silo path; B2C tenants need cheap
   shared-schema RLS; both want a marketplace where they can subscribe
   to curated external corpora without giving up isolation.
3. **Cross-boundary contamination.** RAG knowledge lived inside the
   monolith with no Clean-Architecture port surface. Bot specs, RL
   specs, agent specs, and analysis specs all reached into
   `HierarchicalRAG.query` directly, making the surface impossible to
   swap.

The blueprint reviewed in
[`.cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md`](../../../.cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md)
+ the parallel architecture report propose a Clean-Architecture
knowledge-base boundary modelled on the established `alphaswarm_rl` /
`alphaswarm_models` extraction pattern.

**Decision**:

Stand up two new repositories:

- [`alphaswarm_kb/`](../../../alphaswarm_kb/) — the boundary package
  with a pure `domain/` core (ports + bi-temporal `PermissionedDataPoint`
  + DTOs), an `application/` layer (use cases + `KBRuntime` services),
  a fully-pluggable `infrastructure/` adapter trinity, and an extracted
  `rag/` + `memory/` slice that re-emits the legacy
  `alphaswarm.rag.*` + `alphaswarm.llm.memory` surface through
  `DeprecationWarning` shims.
- [`alphaswarm_kb_federation/`](../../../alphaswarm_kb_federation/) —
  a standalone cross-silo marketplace federation reverse-proxy that
  brokers authorised recall via OpenFGA `check` + signed per-subscription
  share tokens + bi-temporal merge.

The package introduces:

1. **Hash-locked `KBCorpusSpec` + `KBRuntime`** (rules 56-57) mirroring
   the existing `RLExperimentSpec` / `BotSpec` / `AnalysisSpec`
   pattern. Every `remember` / `recall` / `improve` / `forget` lands a
   `kb_runs` row + snapshots the spec via `persist_spec`. Alembic
   migration `0088_alphaswarm_kb_specs.py` creates the nine backing
   tables.
2. **`KBAdapterMeta` metaclass** (rule 58) for every concrete
   `IMemoryEngine`, `BaseVectorStore`, `BaseGraphStore`,
   `BaseRelationalStore`, `IACLEvaluator`, `IPolicyEngine`, and
   `IIdentityProvider`. Each subclass sets `kb_kind` + `kb_alias` and
   is auto-registered.
3. **Bi-temporal `PermissionedDataPoint`** combining Graphiti's
   four-timestamp model (`valid_from`/`valid_to`/`created_at`/`expired_at`)
   with Cognee's provenance envelope (`Provenance.dataset_id` +
   `Provenance.data_id` + `Provenance.extractor_chain`).
4. **Four-scope `KBLayerComposer`** (private > hierarchical >
   marketplace > global) with precedence-aware bi-temporal merge.
5. **Hybrid OpenFGA + OPA + Cedar policy stack** per the blueprint
   Section D. `DefaultPermissionResolver` fuses
   `IACLEvaluator.list_objects` (visible IDs) with
   `IPolicyEngine.partial_evaluate` (residual Cypher/SQL fragment)
   into a per-request `AccessBitmap` cached by
   `(tenant, principal, action, anchor_hash)` for 60s.
6. **`KBSiloTenancyStrategy`** (5th strategy alongside RLS /
   schema-per-tenant / db-per-enterprise / hybrid). Routes KB tables
   to a per-tenant Postgres + Qdrant + Neo4j stack provisioned via
   Terragrunt units under
   [`alphaswarm_platform/terragrunt/tenants/`](../../../alphaswarm_platform/terragrunt/tenants/).
7. **Agent-facing surface** through `data.kb.*` DataMCP tools (rule 59
   extends rule 22) and `data.kb.compose_recall` for the layered
   surface. Cross-silo recall goes through
   `alphaswarm_kb_federation` only (rule 60).
8. **Controller integration**: `KBSiloService` +
   `/manage/kb/silos/*` routes on `alphaswarm_controller` (Phase M).
   Lifecycle actions land as `WorkloadRun` rows with
   `WorkloadAction.KB_SILO_{PROVISION,DESTROY,HALT,SCALE}`.

**Consequences**:

- The legacy `alphaswarm.rag.*` + `alphaswarm.llm.memory` import paths
  keep working through `DeprecationWarning` shims for one release
  cycle. New code imports from `alphaswarm_kb.rag.*` +
  `alphaswarm_kb.memory.*` directly.
- Cognee / Graphiti / Mem0 / Letta / LlamaIndex live behind
  `pyproject.toml` extras; the base install stays light. A tenant who
  wants Cognee installs `pip install alphaswarm-kb[cognee]` and sets
  `KBCorpusSpec.memory_engine.kb_alias = "cognee"`.
- The federation gateway is the only cross-silo write/read path
  outside the monolith. New tenant marketplaces, parent-org sharing,
  and global-corpus replication all funnel through it.
- Terragrunt units replace the legacy Terraform workspaces pattern —
  each tenant has its own state file under
  `tenants//prod/terragrunt.hcl`. The
  `tenant_kb_silo` wrapper dispatches to one of three
  cloud-parallel siblings (`tenant_kb_silo_aws/azure/gcp`) which all
  expose identical outputs so Python adapters never branch on cloud.
- Bi-temporal data is now first-class. Contradicted edges close
  `valid_to` instead of being deleted; `as_of` queries reconstruct
  historical state.
- Step-up MFA gates the destructive operations (`/kb/forget`,
  `/kb/halt`, `/manage/kb/silos/*` mutations, subscription
  create/revoke) per rule 52.

**Hard rule alignment**:

| Rule | Compliance |
| --- | --- |
| 2 (router_complete) | Every adapter that does LLM extraction (Graduated pipeline tier 3, Cognee, Mem0) routes through `router_complete`. |
| 3 (iceberg_catalog.append_arrow) | Gold-tier KB writes (`alphaswarm_gold_kb_*` namespaces) go through the canonical helper; `KBRuntime` never touches PyIceberg. |
| 4 (_progress.emit) | All `kb_tasks.py` wrappers use `emit` / `emit_done` / `emit_error`. WebSocket `/kb/.../recall/stream` preserves `{task_id, stage, message, timestamp, **extras}`. |
| 6 (immutable migrations) | `0088_alphaswarm_kb_specs.py` is immutable post-merge. |
| 22 (DataMCP boundary) | Agents read KB only through `data.kb.*` tools (extended by rule 59). |
| 26 (CredentialResolver) | OpenFGA token, NATS DSN, Postgres DSN, federation share-token signing key all resolve through `CredentialResolver`. |
| 27 (IdentityProvider) | `IIdentityProvider` is a thin bridge to `alphaswarm_core.auth.providers`. |
| 34 (experiment_id/test_id) | `kb_runs` carries both FKs; `KBRunRequest` propagates them via `RequestContext`. |
| 42 (TerraformRuntime) | `KBSiloService` invokes `TerraformRuntime`; the controller never shells out to `terraform`. |
| 45 (WorkloadRuntime) | New `WorkloadAction` enum members `KB_SILO_{PROVISION,DESTROY,HALT,SCALE}`. |
| 51 (TenancyStrategy) | `KBSiloTenancyStrategy` registers via `TenancyStrategyMeta`. |
| 52 (step-up MFA) | All destructive `/kb/*` + `/manage/kb/*` routes gate with `require_step_up()`. |
| 56-60 | New hard rules added in the same PR; described in the AGENTS.md. |

**Trade-offs**:

1. **Two new repositories** to maintain. Mitigated by mirroring the
   established `alphaswarm_rl` / `alphaswarm_models` boundary pattern and
   shipping CI guards that prevent cross-boundary imports.
2. **OpenFGA + OPA + NATS** introduce three new infrastructure
   dependencies. Mitigated by shipping both Docker Compose (local) and
   Kubernetes (prod) manifests; each is a single Helm release with
   ExternalSecrets wiring.
3. **Bi-temporal data complicates schema migrations**. Mitigated by
   making `valid_to`/`expired_at` optional (None = "still valid") so
   existing rows migrate without a backfill.
4. **Terragrunt unit-per-tenant** scales linearly in state-file count.
   Mitigated by bounded-parallelism `run-all` automation under
   [`alphaswarm_platform/terragrunt/`](../../../alphaswarm_platform/terragrunt/)
   plus per-tenant cloud-account isolation for regulated tenants.
5. **Multiple memory engines coexisting** complicates the operator's
   mental model. Mitigated by `data.kb.health` exposing per-corpus
   engine info + the Vite `/knowledge-base/silos` route surfacing
   topology + spec hash per corpus.

**Out of scope (Phase 6+)**:

- Cedar formal-verification harness (`cedar-analysis`).
- SpiceDB / Permify adapter implementations beyond stubs.
- Multi-region active-active federation (vs the AWS-first → Azure →
  GCP staged rollout).
- Tenant-configurable bi-temporal merge strategies (default:
  last-writer-wins per validity window + precedence tiebreaker).
- Per-tenant bridge tier (shared compute / siloed databases) for SMB
  pricing.
- Cognee `improve` / `forget` scheduling automation (manual triggers
  only in v1).


<!-- https://alpha-swarm.ai/architecture/decisions/015-runtime-decomposition -->
# ADR 015 — Runtime decomposition: cell-based modular monolith over domain microservices
> Evaluates breaking the alphaswarm monolithic runtime into distributed domain microservices and rejects that shape in favor of the platform''s existing trajectory: a cell-based modular monolith with selective service extraction along the hash-locked runtime seams (control plane, worker/executor queues, MCP servers, KB boundary, bots operator), plus per-cell data planes for tenant isolation.

# ADR 015 — Runtime decomposition: cell-based modular monolith over domain microservices

- **Status**: Accepted (2026-06-10) — operator approval recorded during the KG-platform execution run (P0 gate review, T01-T03 complete; Track C repositories now exist and are scaffolded)
- **Authors**: Platform team
- **Related**: [ADR 004](004-provider-abstraction.md), [ADR 005](005-separated-control-plane.md), [ADR 006](006-quantbot-operator-pattern.md), [ADR 011](011-cdn-fronted-standalone-for-aqp-ui.md), [ADR 014](014-knowledge-base-boundary.md), [RESTRUCTURING_PLAN.md](https://github.com/Alpha-Swarm-ai/alphaswarm/blob/main/RESTRUCTURING_PLAN.md), [repository-split.md](../../concepts/platform/repository-split.md)

## Context

The `alphaswarm` monolith is the platform's largest deployment unit: one
FastAPI process (`alphaswarm-core`) carrying ~112 route modules, the
in-process MCP routers (`/mcp/data`, `/mcp/codebase`, `/mcp/ml`), and a
Celery task surface of ~57 task modules, backed by 56 ORM model files
and 89 Alembic migrations over a single Postgres, plus Redis in at least
six distinct roles (broker, metadata cache, progress pub/sub, RAG
vectors, ownership event stream, sandbox namespaces). The `Settings`
singleton exposes ~637 knobs.

The question on the table: should the hosted platform break this
runtime into a distributed **domain-microservices** architecture
(agents-service, backtest-service, ml-service, data-service, each with
its own datastore), or adopt a different decomposition?

### What is already decomposed

The platform is not a greenfield monolith. Substantial decomposition
has shipped or is accepted:

| Seam | State | Where |
| --- | --- | --- |
| Control plane (`alphaswarm-cp`) | Shipped — standalone repo, image, `/manage/*` API; never imports `alphaswarm.*` | [ADR 005](005-separated-control-plane.md), `alphaswarm_controller/` |
| Shared kernel | Shipped — wire types, provider ABCs, `WorkloadRuntime` | `alphaswarm_core/` |
| Compute split | Shipped — `alphaswarm-worker` (queues `default,paper,terraform,ingestion,workflows`) vs `alphaswarm-executor` (`backtest,training,ml,agents,factors,rag`), independent HPAs on `alphaswarm_celery_queue_depth` | [worker-executor-images.md](../../concepts/infrastructure/worker-executor-images.md), `alphaswarm_platform/deployments/kubernetes/base/` |
| Frontends | Shipped — `alphaswarm_client`, `alphaswarm_ui`, `alphaswarm_admin`, `alphaswarm_ide`, all HTTP-only | [ADR 011](011-cdn-fronted-standalone-for-aqp-ui.md) |
| RL / ML boundary packages | Shipped — `alphaswarm_rl`, `alphaswarm_models` with deprecation shims; routers/tasks mounted from the external packages | [repository-split.md](../../concepts/platform/repository-split.md) |
| Bots | Boundary package + kopf operator + per-bot pods | [ADR 006](006-quantbot-operator-pattern.md), `alphaswarm_bots/` |
| Edge / cells | Envoy `alphaswarm-edge` + `alphaswarm-tenant-router` (ext_authz, rendezvous cell routing), cell registry in `topology.yaml`, per-cell overlays + ArgoCD ApplicationSet, `alphaswarm-cell-data-plane` Helm chart | [cell-router-cutover.md](../../how-to/cell-router-cutover.md), `alphaswarm_platform/tenant_router/` |
| KB boundary | Accepted (ADR-014) — `alphaswarm_kb` + `alphaswarm_kb_federation` designed; **repositories not yet created** | [ADR 014](014-knowledge-base-boundary.md) |

### Invariants any split must respect

Six hard-rule families make naive per-domain services with per-service
databases actively harmful here:

1. **Single Postgres ledger.** Every runtime writes immutable
   `*_spec_versions` snapshots and `*_runs` ledger rows through
   `LedgerWriter`, which stamps `experiment_id`/`test_id` from
   `RequestContext` (rules 13, 15, 17, 24, 34, 41, 43, 57). Splitting
   the ledger per service destroys the cross-domain experiment umbrella
   and the audit/replay story.
2. **Single LLM gateway.** All LLM calls go through `router_complete`
   (rule 2); telemetry, cost caps, and the semantic cache depend on it.
3. **Single lakehouse write path.** All Iceberg writes go through
   `iceberg_catalog.append_arrow` with medallion validation (rules 3,
   21, 46).
4. **DataMCP boundary.** Agents never read Postgres/Iceberg directly
   (rule 22) — the agent↔data seam is already a service-shaped API.
5. **Kill-switch fan-out.** The topbar kill switch fans out to 12+
   halt endpoints with a p99 propagation SLO; every new long-running
   runtime must join the fan-out, and every process fragment multiplies
   the propagation surface (rules 40, 45, 52).
6. **Idempotent cross-task state in Postgres only** (rule 5) — Celery
   workers are already stateless and horizontally scalable; the
   "scaling" benefit of microservices largely exists today via queues.

## Options considered

### Option 1 — Classic domain microservices

Carve `alphaswarm-core` into independently deployed services
(agents-svc, backtest-svc, analysis-svc, data-svc, trading-svc, …),
each owning its own database and API, communicating via REST/gRPC and
an event bus.

- Violates invariant 1 (ledger) and 6 unless every service still
  writes to the shared Postgres — at which point they are not
  microservices, just N processes sharing one schema and one Alembic
  chain (a distributed monolith).
- The hot coupling points (`LedgerWriter`, `router_complete`,
  `append_arrow`, metadata cache, progress bus) would become N×
  network hops with retry/outbox machinery the platform doesn't need.
- Kill-switch propagation and hash-locked replay would have to be
  re-engineered across service boundaries.
- The throughput-bound work (backtests, training, agent runs) is
  **already** isolated in the executor fleet with queue-depth
  autoscaling; a backtest-service would duplicate that with more
  moving parts.

### Option 2 — Cell-based modular monolith with selective service extraction (recommended)

Keep one logical application (`alphaswarm` runtime) but:

1. **Scale out by cell, not by domain.** A cell = one namespace running
   the core/worker/executor/beat quartet against a per-cell data plane
   (CNPG Postgres, Redis, MinIO, MLflow, Iceberg REST), routed by the
   Envoy edge + tenant router (rendezvous hashing on
   `tenant_id → cell_id`). Tiers map onto the existing
   `TenancyStrategy` lattice (`shared-std` → RLS, `shared-prem` →
   schema-per-tenant, `silo-reg` → database-per-enterprise). This is
   RESTRUCTURING_PLAN Phases 3 + 6, already partially provisioned
   (`cells:` registry in `topology.yaml`, cell overlays, ApplicationSet,
   `alphaswarm-cell-data-plane` chart).
2. **Extract services only along the seams that already have
   service-shaped contracts** — the hash-locked spec runtimes, the MCP
   HTTP surfaces, the control plane, and the operator pattern — and only
   when an extraction passes the Future Repo Split Gate in
   [repository-split.md](../../concepts/platform/repository-split.md).

### Option 3 — Status quo

Keep the single `alphaswarm-core` Deployment and scale vertically.
Rejected: noisy-neighbor risk across tenants, blast radius of one bad
deploy is the whole fleet, and `silo-reg` compliance tenants cannot be
served.

## Decision

Adopt **Option 2**. Decomposition proceeds in three tracks, ordered by
risk and by whether new repositories are required.

### Track A — Process/deployment splits of the existing images (no new repos)

These change `alphaswarm_platform/` manifests and entrypoints only; the
code already supports them:

| # | Cut | Detail |
| --- | --- | --- |
| A1 | `alphaswarm-beat` as a first-class Deployment | Declared in `topology.yaml` and Terraform but missing from `deployments/kubernetes/base/`; promote it (replicas: 1, no HPA). |
| A2 | Standalone MCP server Deployments | Serve `/mcp/data`, `/mcp/codebase`, `/mcp/ml` from dedicated pods using the existing image with a scoped ASGI entrypoint. Topology already declares `alphaswarm-ml-mcp` as a separate service; RFC 9728/8707 audience binding (rule 49) already gives each MCP its own `aud`. Per-tenant MCP isolation then reuses the `alphaswarm-mcp-tenant` Helm chart (Phase 5). |
| A3 | Per-queue executor fleets | Split the executor Deployment into per-queue ScaledObjects (KEDA) for `backtest`, `training`/`ml`, `agents`/`rag` so GPU-class and CPU-class work scale independently. No code change — queue routing exists in `celery_app.py`. |
| A4 | `paper-trader` and `ingester-*` as first-class K8s units | They exist as compose targets (`paper`, `ingester` image stages); give them base manifests + HPAs like worker/executor. |
| A5 | Cell rollout | Execute Phase 3 (cell registry + router live, `RequestContext.cell_id` propagating) then Phase 6 (per-cell Postgres/MinIO/Redis/MLflow via dual-write migration, `ALPHASWARM_CELL_DUAL_WRITE`). |

### Track B — Deepen existing extractions (existing repos, invasive code changes)

| # | Cut | Detail |
| --- | --- | --- |
| B1 | Sidecar control plane as hosted default | Flip hosted deployments from `ALPHASWARM_MANAGEMENT_MODE=embedded` to `sidecar`; `alphaswarm-cp` already ships standalone. |
| B2 | Ledger/telemetry broker for `alphaswarm_rl` + `alphaswarm_models` | Today the extracted packages still import the monolith for `LedgerWriter`, `iceberg_catalog`, `_progress.emit`, ORM. Introduce a narrow HTTP/MCP ledger-write surface (mirroring the controller's `HttpAuditSink` → `/_internal/audit/terraform-runs` pattern) so RL/ML workers can run from their own images without importing monolith ORM. This is the gating work for ever running them as separate services. |
| B3 | Bots operator fleet | Continue the ADR 006 path: per-bot pods via `quantbot-bot` chart, latency-class scheduling (ADR 007), canary PnL gates (ADR 010). Bots are the one domain where per-workload processes are genuinely required (HFT node tiers). |
| B4 | CI boundary gates | Extend the `rg`-based forbidden-import gates from 2 to all 14+ subprojects (RESTRUCTURING_PLAN §2.1, §4.2) so extracted boundaries cannot silently re-couple. Prerequisite for everything above. |

### Track C — Extractions that require **new repositories** (permission gate)

Per the workspace's repo-per-boundary convention, these need new git
repositories and therefore explicit approval before any work begins:

| # | Candidate repo | Justification | Status |
| --- | --- | --- | --- |
| C1 | `alphaswarm_kb` | ADR-014 (accepted) defines the KB boundary package — `KBRuntime`, hash-locked `KBCorpusSpec`, adapter trinity. Monolith already mounts its router conditionally and migration `0088_alphaswarm_kb_specs` shipped, but the repo does not exist in the multi-repo workspace. | Blocked on repo creation |
| C2 | `alphaswarm_kb_federation` | ADR-014's cross-silo federation gateway — standalone FastAPI, never imports `alphaswarm.*`; deployable today via `compose/docker-compose.kb.yml` patterns + Terragrunt silo modules. | Blocked on repo creation |
| C3 | (Optional, deferred) `alphaswarm_data` | A future data-plane service (ingestion, discovery, catalog) is the largest-blast-radius extraction; the RESTRUCTURING_PLAN sequences it last, after cells and per-tenant object storage. Not recommended now — listed to make the deferral explicit. | Deferred |

Existing placeholder repos `alphaswarm_research` ("Services for
Research Plane") and `alphaswarm_learning` are available landing zones
should the research/learning planes later split; no work is proposed
for them in this ADR.

### Target topology

```mermaid
flowchart TB
  subgraph EDGE ["Edge"]
    CF[cloudflared] --> ENVOY[alphaswarm-edge Envoy]
    ENVOY -->|ext_authz| TR[alphaswarm-tenant-router]
  end
  ENVOY --> CP[alphaswarm-cp control plane]
  ENVOY --> CELL1
  ENVOY --> CELLN
  subgraph CELL1 ["cell-shared-std-N"]
    API1[alphaswarm-core] --> W1[worker] & X1[executor fleets per queue] & B1[beat]
    MCP1[mcp-data / mcp-codebase / mcp-ml]
    DP1[(per-cell Postgres / Redis / MinIO / MLflow / Iceberg REST)]
    API1 --> DP1
    W1 --> DP1
    X1 --> DP1
    MCP1 --> DP1
  end
  subgraph CELLN ["cell-silo-reg-N"]
    APIN[alphaswarm-core] --> DPN[(dedicated data plane)]
  end
  CP -->|WorkloadRuntime| CELL1
  CP -->|WorkloadRuntime| CELLN
  OP[bots-operator] --> BOTS[per-bot pods incl. HFT tier]
  KBF[alphaswarm-kb-federation] -.read-only cross-silo.-> CELLN
```

## Consequences

**Positive**

- Tenant isolation, blast-radius reduction, and independent scaling are
  achieved by cells + queues — the actual goals usually cited for
  microservices — without breaking the ledger, replay, kill-switch, or
  hash-lock invariants.
- Every extraction reuses a contract that already exists (spec
  runtimes, MCP audiences, `/manage/*`, operator CRDs), so no new
  RPC framework or saga/outbox machinery is invented. Linkerd arrives
  in Phase 4 for cell mTLS, not for inter-domain RPC.
- Track A is pure deployment work and reversible per unit.

**Negative / risks**

- Per-cell data planes multiply infra cost (mitigated by tiering:
  shared backplane for `shared-std`/`shared-prem`).
- Track B2's ledger broker adds an HTTP hop to RL/ML run bookkeeping;
  it must remain async/buffered to keep training loops unaffected.
- The dual-write migration window (Phase 6) is the single riskiest
  operation; the rollback path is the `ALPHASWARM_CELL_DUAL_WRITE`
  flag.
- Until Track C repos exist, KB code paths remain conditional inside
  the monolith and ADR-014 stays partially unrealized.

**Explicitly rejected**

- Per-domain services with per-service databases (Option 1).
- Extracting `router_complete` into an LLM-gateway service.
- Splitting the Postgres ledger or the Alembic chain per service.
- Moving Celery beat scheduling out of the single scheduler.

## Rollout order

1. Track B4 (CI boundary gates) and Track A1–A2 (beat + MCP pods).
2. Track A3–A4 (queue fleets, paper/ingester units) and B1 (sidecar CP).
3. Track A5 cells: Phase 3 (router + registry), then Phase 6 (data
   plane), per the stop conditions in RESTRUCTURING_PLAN §18.2.
4. Track B2 (ledger broker) — gate for any future out-of-monolith
   RL/ML workers.
5. Track C — only after repository approval.


<!-- https://alpha-swarm.ai/architecture/preprocessing-spec -->
# PreprocessingSpec
> Qlibs `DataHandlerLP` applies an ordered chain of `Processor` steps (rank-norm, z-score, min-max, outlier clipping, etc.) during training. At inference time the *same* chain must be re-applied — othe...

# PreprocessingSpec

A `PreprocessingSpec` is a tiny dataclass that travels with every trained
model artifact. It remembers which processors were fit and in what order
so inference code can replay the exact preprocessing chain on new data
without ever reaching back into the training-time configuration.

## Why

Qlib's `DataHandlerLP` applies an ordered chain of `Processor` steps
(rank-norm, z-score, min-max, outlier clipping, etc.) during training.
At inference time the *same* chain must be re-applied — otherwise the
model is scored on data with a different distribution than it was trained
on, which silently degrades live performance.

Until now this was only tracked implicitly (the handler config was
expected to be re-instantiated exactly). `PreprocessingSpec` makes it
explicit: the spec is serialised into the model pickle and reloaded when
the model is served, backtested, or paper-traded.

## Shape

```python
@dataclass
class PreprocessingSpec:
    processors_pickle: bytes                # fit state for exact replay
    processor_specs: list[dict]             # {class, module_path, kwargs}
    feature_columns: list[str]
    label_column: str | None
    handler_cfg: dict | None
    metadata: dict[str, Any]
```

## Training-side usage

```python
from alphaswarm.ml.processors import PreprocessingSpec
from alphaswarm.ml.handler import DataHandlerLP

handler = DataHandlerLP(
    instruments=[...],
    learn_processors=[...],
    infer_processors=[...],
)
handler.setup_data()

spec = PreprocessingSpec.from_processors(
    processors=handler.infer_processors,
    feature_columns=[...],
    label_column="label_5d",
    handler_cfg={"class": "DataHandlerLP", "module_path": "alphaswarm.ml.handler", "kwargs": {...}},
    metadata={"dataset_hash": "abc123", "fit_window": "2020-01-01..2023-12-31"},
)

model.fit(dataset).with_preprocessing(spec)
model.to_pickle("models/alpha_v1.pkl")
```

## Inference-side usage

```python
from alphaswarm.ml.base import Model

model = Model.from_pickle("models/alpha_v1.pkl")
spec = model.preprocessing_spec
if spec is not None:
    df = spec.apply(new_bars)       # replay the chain, no re-fit
preds = model.predict(df)
```

## Serving-side usage

All three serving backends (`MLflowServe`, `RayServe`, `TorchServe`) know
about `preprocessing_spec`. The TorchServe handler in
[`alphaswarm/mlops/serving/torchserve.py`](../../alphaswarm/mlops/serving/torchserve.py)
calls `spec.apply(df)` before `model.predict(df)` when the attribute is
present.


<!-- https://alpha-swarm.ai/compliance/soc2-evidence-map -->
# compliance/soc2-evidence-map

# SOC 2 Type II evidence map

Mapping from the SOC 2 Trust Services Criteria to the
machine-readable evidence shipped by the AlphaSwarm overhaul.

This map is the artifact the compliance team hands to the auditor.
It points at the source of truth for every control; the auditor
can pull the evidence directly from S3 / CloudTrail / Postgres
without manual collation.

| Criterion | Control | Evidence source | Where in the repo |
| --- | --- | --- | --- |
| CC6.1 Logical access | All API auth via `IdentityProvider` | `security_audit_events` Postgres + S3 WORM mirror | `alphaswarm/tasks/audit_log_export_tasks.py` |
| CC6.1 (cont.) | RBAC via `Membership` lattice | `Membership` rows + `expand_role` lattice | `alphaswarm_core/src/alphaswarm_core/auth/rbac.py` |
| CC6.6 Step-up MFA | RFC 9470 step-up on every destructive admin route | `step_up_denied` rows in `security_audit_events` | `alphaswarm_admin/src/alphaswarm_admin/deps/stepup.py` |
| CC6.7 Privileged access | Break-glass 4-eyes + 60min auto-expiry | `admin.break_glass.*` audit rows + Security Hub findings | `alphaswarm_admin/src/alphaswarm_admin/services/break_glass.py` |
| CC6.8 Cryptography | TLS 1.3 ingress + Linkerd mTLS internal | ALB security policy `ELBSecurityPolicy-TLS13-1-2-2021-06`; Linkerd identity certs from ACM PCA | `infrastructure/modules/acm-certificates`, `infrastructure/modules/acm-pca` |
| CC7.1 Detection | Falco DaemonSet + custom rules | Falco events shipped to Loki | `alphaswarm_platform/deployments/kubernetes/helm/falco/values.yaml` |
| CC7.2 Monitoring | OpenTelemetry + Prometheus + Loki + Tempo | Per-env Grafana dashboards | `infrastructure/modules/observability-stack` |
| CC7.3 Incident response | KillSwitch fan-out + halt audit rows | `admin.halt.all` rows | `alphaswarm_admin/src/alphaswarm_admin/api/routers/halt.py` |
| CC7.5 Threat intel | Trivy + Grype on every image | Build-time SBOM + provenance | `.github/workflows/build-publish.yml`, `.github/actions/build-sign-push/` |
| CC8.1 Change management | Hash-locked spec versions | `terraform_stack_spec_versions`, `agent_spec_versions`, `bot_versions`, `rl_experiment_versions`, `analysis_spec_versions`, `workflow_spec_versions` | per AGENTS rules 13/15/17/24/41/43 |
| CC8.1 (cont.) | Immutable Alembic migrations | `.hashes.lock` + `check_migration_immutability.py` | `scripts/ci/check_migration_immutability.py` |
| CC9.1 Risk mitigation | SLSA L3 provenance + Cosign keyless | OCI attestations on every image | `.github/workflows/build-publish.yml` |
| A1.2 Recovery procedures | DR replay runbook + Velero schedules | quarterly rehearsal log | `alphaswarm_docs/docs/operations/dr-replay.md` |
| A1.3 Recovery validation | Cross-region S3 CRR + RDS read replica | Lifecycle policies + replication metrics | `infrastructure/envs/prod/main.tf` |
| C1.2 Confidential information | S3 Object Lock + KMS CMK | `alphaswarm-audit-archive-*` bucket policies | `infrastructure/envs/shared-services/main.tf` |
| C1.2 (cont.) | Step-up + RBAC on broker creds | `BrokerCredentialStore` priority 4 | `alphaswarm/credentials/stores/broker_credential_store.py` |
| PI1.1 Processing integrity | Hash-chained `audit_log` table + Postgres trigger | trigger `enforce_audit_log_hash_chain` | `alembic/versions/0079_audit_log_hash_chain.py` |
| P1.1 Privacy notice | n/a (B2B platform; no PII) | n/a | n/a |
| P3.1 Information collection | OIDC scopes + `https://alphaswarm.internal/resources` claim | Auth0 + Entra Action sources | `alphaswarm/auth/providers/` |

## Type II evidence collection cadence

| Cadence | Activity |
| --- | --- |
| Continuous | CloudTrail Org Trail, Config aggregator, GuardDuty, Security Hub findings; all S3 WORM-mirrored with 7-year retention |
| Daily | Audit log export to WORM bucket (Celery beat 02:00 UTC) |
| Weekly | Renovate dependency updates merged to dev; SBOM diff review |
| Monthly | Access review (operator-driven via `/admin/rbac` UI) |
| Quarterly | DR rehearsal per `dr-replay.md`; tabletop incident exercise |
| Annual | SOC 2 Type II audit window (12-month observation) |

## Operator hand-offs

The platform team owns the controls; compliance owns the
evidence collation + auditor liaison. The handoff is via the
`#alphaswarm-compliance` Slack channel + the SOC 2 dashboard in Grafana
(panels driven by Prometheus queries against
`security_audit_events`).


<!-- https://alpha-swarm.ai/concepts/agentic/agentic-development -->
# Agentic development for AlphaSwarm
> The agentic-coder research literature talks about "skill artifacts", "skill graphs", "Memento-skills", "auditable execution trails", and "MCP control planes" as if they were novel patterns to invent. ...

# Agentic development for AlphaSwarm

> The single doc that connects AlphaSwarm's existing primitives to the
> broader "agentic-coder" vocabulary, plus the consolidated security
> manifesto. Doc map: [alphaswarm_docs/index.md](../../intro/index.md) ·
> Workflow: [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) ·
> Hard rules: [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).

## What this doc is for

The agentic-coder research literature talks about
"skill artifacts", "skill graphs", "Memento-skills", "auditable
execution trails", and "MCP control planes" as if they were novel
patterns to invent. AlphaSwarm already implements every one of them — under
different names, with stronger invariants, and with ledger-backed
audit chains. This doc makes that mapping explicit so you don't waste
time inventing a parallel "skill" surface alongside the current spec
runtimes that already exist.

The doc has three sections:

1. **AlphaSwarm's spec-pattern is the skill-artifact pattern.** The
   spec-runtime architecture (Agent / Bot / RL / Analysis / Workflow /
   Terraform) is the
   skill-graph + Memento-skill equivalent. Including where AlphaSwarm
   deliberately diverges from research recommendations.
2. **Working with Cursor agents in AlphaSwarm.** Static channel + dynamic
   channel + plan-mode vs agent-mode usage.
3. **The ADLC security manifesto.** Consolidated for the first time.

## 1. The spec-pattern is the skill-artifact pattern

### The five spec runtimes

| Spec | Runtime | Versions table | Canonical doc |
| --- | --- | --- | --- |
| [`AgentSpec`](../alphaswarm/agents/spec.py) | [`AgentRuntime`](../alphaswarm/agents/runtime.py) | `agent_spec_versions` | [agents.md](../../concepts/agentic/agents.md) |
| [`BotSpec`](../alphaswarm_bots/spec.py) | [`BotRuntime`](../alphaswarm_bots/runtime.py) | `bot_versions` | [bots.md](../../concepts/agentic/bots.md) |
| [`RLExperimentSpec`](../alphaswarm/rl/spec.py) | [`RLRuntime`](../alphaswarm/rl/runtime.py) | `rl_experiment_versions` | [rl-framework.md](../../concepts/rl/rl-framework.md) |
| [`AnalysisSpec`](../alphaswarm/analysis/spec.py) | [`AnalysisRuntime`](../alphaswarm/analysis/runtime.py) | `analysis_spec_versions` | [analysis-framework.md](../../concepts/strategy/analysis-framework.md) |
| [`WorkflowSpec`](../alphaswarm/agents/orchestration/spec.py) | [`WorkflowRuntime`](../alphaswarm/agents/orchestration/runtime.py) | `workflow_spec_versions` | [workflow-studio.md](../../concepts/agentic/workflow-studio.md) |

`WorkflowSpec` (Phase 5 of the additive orchestration refactor) sits
**above** the four classic runtimes: it composes them through the
[`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py)
registry. A workflow can wrap an existing `AgentRuntime` invocation
(via the `LangGraphAdapter` / `CrewProcessAdapter` / `DialecticalDebateAdapter`)
or chain deterministic fusion + risk-overlay execution (via
`SignalFusionAdapter` + `WeightCentricExecutionAdapter`). All five
runtimes share the same hash-locked + immutable + ledger-backed
semantics described below.

Each is:

- **Declarative** — a Pydantic model with strict types.
- **Hash-locked** — the SHA-256 of the canonical-JSON-serialized
  spec is the version key.
- **Auto-versioned** — first run snapshots a row in the `*_versions`
  table; behaviour changes produce new rows; old rows are immutable.
- **Ledger-backed** — every run records `spec_version_id` so the
  exact run can be deterministically replayed against historical
  data.
- **Discoverable** — the registry pattern (built-ins + YAML
  auto-loading) means new specs come online without touching the
  runtime.

### Mapping to research vocabulary

The agentic-coder literature 2024–2026 used several overlapping terms.
Here's how each lands on AlphaSwarm's primitives:

| Research term | AlphaSwarm equivalent | Notes |
| --- | --- | --- |
| "Skill artifact" | One row in a `*_versions` table | The artifact has semantic interface (the Pydantic spec), preconditions (the spec's input schema), executable payload (the runtime invocation), and deterministic postconditions (the run row + Iceberg outputs). |
| "Skill graph" | The full registry across the active spec runtimes | Each runtime hosts one graph; `BotSpec` references `AgentSpec`s, `RLExperimentSpec` references data pipelines, `AnalysisSpec` references flows, and orchestration/deployment specs compose the runtime graph at higher levels. |
| "Auditable execution trail" | `*_runs` ledger rows + Iceberg outputs + per-step result tables | E.g. `analysis_runs` + `analysis_step_results` + `alphaswarm_gold_analysis_` |
| "MCP control plane" | The DataMCPTool catalog | One catalog, two transports (in-process bridge + FastAPI router + stdio binary). See [data-mcp.md](../../concepts/data/data-mcp.md). |
| "Memento-skill / continual learning" | Re-snapshot on change | When a spec changes, `persist_spec` inserts a **new** version row — old versions stay for replay. The "memory" is the immutable history. |
| "Verifiable rewards" | The `*_runs` ledger + cost caps + guardrails on the runtime | Telemetry covers cost, latency, and outcome metrics. |

### Where AlphaSwarm **deliberately diverges**

The research recommends some patterns that AlphaSwarm rejects on purpose:

1. **"Rewrite the skill on failure" / self-modifying skills.** The
   research literature (e.g. *new framework lets AI agents rewrite
   their own skills without retraining*) advocates patching a
   failing skill in-place. **AlphaSwarm forbids this.** Reasons:
   - Auditability — every behaviour change must be a new
     hash-locked version row, not an in-place mutation.
   - Replay — runs reference `spec_version_id` for replay; mutating
     the spec breaks the replay invariant.
   - Compliance — financial systems need an append-only audit
     trail.
   - Risk — a self-mutating spec next to live capital is a
     non-starter.
   The right pattern in AlphaSwarm: when a spec fails, author a new
   spec version (manually or via tooling), snapshot it, switch
   traffic. The previous version remains for forensics.
2. **"Skill graph self-improvement loops"** that mutate skill
   metadata across runs. AlphaSwarm's metadata is owned by the active
   metadata layer
   ([`alphaswarm.data.catalog.register_dataset`](../alphaswarm/data/catalog/active_metadata.py))
   and updated through explicit upserts — never as a side effect
   of a run.
3. **"Free-form SQL tools for agents"** to "let the model figure it
   out". AlphaSwarm requires every read to go through a registered
   `DataMCPTool` with a strict args schema and policy check. See
   [data-mcp.md](../../concepts/data/data-mcp.md) and the
   [data-mcp.mdc](../.cursor/rules/data-mcp.mdc) Cursor rule.
4. **"Auto-update implementation when intent changes"** (intent-driven
   development with bidirectional updates). AlphaSwarm's docs are
   updated in the same PR that touches the code, by humans or
   under explicit human review. Drift detection is welcome;
   automatic mutation is not.

### Adding a new spec — the canonical flow

1. Pick the right runtime by the question being answered:
   - "What should an LLM-driven agent do?" → `AgentSpec`
   - "What should a deployable bot (universe + strategy + risk +
     ML + agents + RAG) do?" → `BotSpec`
   - "What should an RL experiment train / evaluate?" →
     `RLExperimentSpec`
   - "What statistical / numerical analysis flow should run on a
     dataset?" → `AnalysisSpec`
2. Author the YAML or programmatic Pydantic instance.
3. Call the right `persist_spec(...)` (or let the registry do it
   on first lookup).
4. Run via the runtime — the first run snapshots a `*_versions`
   row.
5. The run row records `spec_version_id` and emits progress through
   [`alphaswarm/tasks/_progress.py`](../alphaswarm/tasks/_progress.py).

If you find yourself wanting to "add a new skill artifact" outside
this pattern — stop, read this section again, pick the right spec
runtime.

## 2. Working with Cursor agents in AlphaSwarm

### The two-channel context strategy

AlphaSwarm follows the static / dynamic context bifurcation pattern that
Anthropic's Cursor integration recommends:

- **Static channel** — what doesn't change between sessions:
  - [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — 45 hard rules
  - [.cursor/rules/](../.cursor/rules) — glob-scoped rule files
  - [alphaswarm_docs/](../docs) — narrative architecture
- **Dynamic channel** — what changes session-to-session:
  - DataMCPTool catalog (live database schemas, dataset lineage,
    entity catalog)
  - The `agent_runs_v2` / `bot_deployments` / `rl_runs` /
    `analysis_runs` ledger rows
  - The Cursor environment's recently-edited / open files /
    terminal state

The Cursor agent should treat the static channel as authoritative
for **rules and architecture**, and the dynamic channel as
authoritative for **live state** (don't guess a table schema —
query the MCP catalog).

### Plan mode vs agent mode

| Mode | When | Restrictions |
| --- | --- | --- |
| **Plan mode** | Complex / ambiguous tasks, architectural decisions, large refactors, anything with > 1 valid implementation | Read-only — cannot edit files |
| **Agent mode** | Single clear task, post-plan implementation, debugging once root cause is known | Full tool access |
| **Background mode** | Long-running tasks (Docker stack rebuild, full test suite, training runs) | Runs in parallel; non-blocking |
| **Ask mode** | "How does X work?" / read-only exploration | Cannot edit; can search |

The
[../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) document has the full plan→act→reflect
cadence including FAST vs SLOW velocity calibration and intervention
nodes.

### Reading the agent's plan output as a structured spec

When Cursor's plan mode produces a `.cursor/plans/*.plan.md` file,
treat it like a `*Spec` artifact: the human reviews, approves, and
the agent then executes the plan one task at a time, updating todos
as it goes. The plan file is the contract.

## 3. ADLC security manifesto

The Agentic Development Life Cycle (ADLC) framing says: as agentic
autonomy expands, the security posture must scale with it. AlphaSwarm
already enforces several layers; this section consolidates them
in one place so you can audit the surface in one read.

### Layer 1 — Kill-switch (ultimate human override)

- Code: [alphaswarm/risk/kill_switch.py](../alphaswarm/risk/kill_switch.py),
  [alphaswarm/risk/manager.py](../alphaswarm/risk/manager.py)
- Wired endpoint today: `POST /portfolio/kill_switch` in
  [alphaswarm/api/routes/portfolio.py](../alphaswarm/api/routes/portfolio.py)
- Frontend topbar component:
  [alphaswarm_client/src/components/common/KillSwitch.tsx](../alphaswarm_client/src/components/common/KillSwitch.tsx)
- Design contract for per-runtime fan-out — `/agents/halt`,
  `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all` — see
  [frontend.mdc](../.cursor/rules/frontend.mdc) (wire as the
  endpoints come online; add them to `KillSwitch` in the same PR).
- All paper sessions halt within one heartbeat and cancel open
  orders. The Meta-Agent can flip the switch; an operator can flip
  it; the agent is never allowed to flip it without explicit human
  acknowledgement (per
  [WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md)#intervention-nodes).

### Layer 2 — Immutable spec versions (audit trail)

- `agent_spec_versions`, `bot_versions`, `rl_experiment_versions`,
  `analysis_spec_versions` are append-only.
- Each spec is hash-locked (SHA-256 of canonical JSON).
- Every run records `spec_version_id` for replay.
- This guarantees: **every behaviour change has a permanent record**
  identifying who introduced it (via the commit) and what the spec
  looked like at that moment.

### Layer 3 — DataMCPTool boundary (no direct catalog reads)

- Agents MUST NOT `import alphaswarm.persistence.models...` or call
  `iceberg_catalog` / `duckdb_provider` directly inside their body.
- All reads go through registered `DataMCPTool`s, exposed via
  in-process bridge + FastAPI `/mcp/data` router + `alphaswarm-data-mcp`
  stdio binary.
- See [data-mcp.md](../../concepts/data/data-mcp.md) and
  [data-mcp.mdc](../.cursor/rules/data-mcp.mdc).

### Layer 4 — Single LLM entry-point (router_complete)

- All LLM calls go through
  [`router_complete`](../alphaswarm/llm/providers/router.py).
- No direct `litellm.completion` / `OllamaClient` / vendor SDKs.
- The router enforces tier policies, cost caps, and provider
  fallback. Bypassing it strips those guardrails.

### Layer 5 — Single Iceberg entry-point + medallion enforcement

- All writes go through
  [`iceberg_catalog.append_arrow`](../alphaswarm/data/iceberg_catalog.py)
  / `create_or_replace_table`.
- The wrapper validates that the namespace prefix matches the
  declared `medallion_layer` (`bronze` / `silver` / `gold`).
- `BusinessMetadata` is mandatory on first write — agents query
  this surface to know what a dataset is for.
- See [data-layer-unification.md](../../concepts/data/data-layer-unification.md) and
  [iceberg.mdc](../.cursor/rules/iceberg.mdc).

### Layer 6 — Secrets and configuration

- Configuration through
  [`alphaswarm.config.settings`](../alphaswarm/config/__init__.py) only — never
  construct a fresh `Settings()`, never read `os.environ` directly.
- New env vars are `ALPHASWARM_*`-prefixed fields on the `Settings` class
  in [alphaswarm/config/settings.py](../alphaswarm/config/settings.py) and added
  to [.env.example](../.env.example).
- Credentials use the helpers in
  [alphaswarm/utils/keys.py](../alphaswarm/utils/keys.py); never paste them
  into `.env` outside what's already in `.env.example`.

### Layer 7 — Migration immutability

- See [migrations-persistence.mdc](../.cursor/rules/migrations-persistence.mdc).
- Shipped migrations are never edited. Schema bugs are fixed
  forward, never backward.

### Layer 8 — Pre-merge checklist (human-driven)

The checklist in [CONTRIBUTING.md](https://github.com/julianwileymac/alphaswarm/blob/main/CONTRIBUTING.md) is the last
line of defence:

- Tests pass locally
- Docs updated (data-dictionary, ERD, glossary)
- New env vars in `.env.example`
- New deps in `pyproject.toml`
- Migration applied + reviewed (autogenerate footguns checked)
- For SLOW-mode work: TDD-loop followed (see
  [WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md))

### Recommended (not yet enforced) — red-team review

For any new `AgentSpec` that gains broker-API or live-trading
tools, run a red-team review before promoting from paper to live:

- Adversarial prompt simulation
- Boundary-violation tests (does the agent try to escape its
  tool catalog?)
- Cost-cap stress (does it loop?)
- Margin / risk-limit interaction (does the spec respect
  [alphaswarm/risk/](../alphaswarm/risk/) constraints?)

Today this is documentation, not automation. Future work: a
`POST /agents/red-team-review` task that takes an `AgentSpec` and
runs a fixed adversarial battery against it before promotion.

## When in doubt

1. Read [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — the canonical 45 rules.
2. Read [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) — the cadence.
3. Read [multi-agent-patterns.md](../../concepts/agentic/multi-agent-patterns.md) — when
   you're scaling the agent topology.
4. Read [glossary.md](../../intro/glossary.md) — for terminology.
5. Search the code: `rg "" alphaswarm/`.


<!-- https://alpha-swarm.ai/concepts/agentic/agentic-pipeline -->
# Agentic pipeline
> End-to-end walkthrough of the AlphaSwarm agentic-trading lifecycle: pick models, register data, snapshot specs, dispatch via WorkflowRuntime, review through MCP-bridged agent surfaces.

# Agentic pipeline

> Doc map: [intro](../../intro/index.md) Â·
> Sequence diagrams: [flows](../platform/flows.md#3-agentic-crew-run) Â·
> Spec-pattern primer: [agentic-development](./agentic-development.md) Â·
> Multi-agent topologies: [multi-agent-patterns](./multi-agent-patterns.md) Â·
> Orchestration adapters: [workflow-studio](./workflow-studio.md) Â·
> Worked tutorial: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md).

This page walks through the AlphaSwarm agentic-trading lifecycle: pick a
model, register a data source, snapshot the spec, dispatch through
the workflow runtime, and review the run. Every action has a REST +
CLI surface so you can script the same flow; every action also has
an `alphaswarm_client` (Vite UI) route at `alpha-swarm.ai` so a human can drive
it.

The pipeline is **five stages**. The new stage since the prior
version of this doc is **Spec snapshot** â€” every spec-driven run
now hash-locks into an immutable `*_spec_versions` row before any
work happens.

```mermaid
flowchart LR
    subgraph llmStage [1. Models and providers]
        Pull["Ollama pull"]
        Vllm["vLLM profile up"]
        Sera["SERA-32B opt-in"]
        Defaults["router_complete defaults"]
    end
    subgraph dataStage [2. Data sources]
        Discovery["DiscoveryService"]
        Inspector["Parquet / Iceberg inspector"]
        AirbyteBuilder["Airbyte builder + userland Fetcher"]
        Sandbox["Dagster sandbox (ephemeral)"]
    end
    subgraph snapshotStage [3. Spec snapshot]
        AgentSpec["AgentSpec / BotSpec"]
        WfSpec["WorkflowSpec"]
        Hash["SHA-256 hash"]
        Versions["*_spec_versions row"]
    end
    subgraph dispatchStage [4. Workflow dispatch]
        WfRuntime["WorkflowRuntime"]
        AgentRt["AgentRuntime"]
        BotRt["BotRuntime"]
        RlRt["RLRuntime"]
        Adapters["7 orchestration adapters"]
    end
    subgraph reviewStage [5. Review]
        WS["WebSocket /chat/stream"]
        Ledger["agent_runs_v2 + workflow_runs"]
        Inkeep["Inkeep AI assistant (in-product)"]
        Mcp["docs MCP server"]
    end

    llmStage --> snapshotStage
    dataStage --> snapshotStage
    snapshotStage --> Hash --> Versions
    Versions --> dispatchStage
    WfRuntime --> AgentRt
    WfRuntime --> BotRt
    WfRuntime --> RlRt
    WfRuntime --> Adapters
    dispatchStage --> reviewStage
```

## 1 â€” Models and providers

Open [`/models`](https://alpha-swarm.ai/models) in the operator UI
(`alphaswarm_client`). The page lives at
[alphaswarm_client/src/routes/models/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/routes/models)
and exposes three tabs:

- **Ollama (host)** â€” type a model tag in *Pull a model* (e.g.
  `nemotron`, `llama3.2`, `qwen2:7b`) and click **Pull**. A Celery
  task streams progress over the canonical
  `/chat/stream/{task_id}` envelope so the page shows a real-time
  download bar.
- **vLLM** â€” every YAML under
  [`configs/llm/`](https://github.com/julianwileymac/alphaswarm/tree/main/configs/llm)
  becomes a profile card showing compose status, served models,
  and `Start` / `Stop` buttons. Starting a profile auto-saves its
  `base_url` as the active vLLM endpoint.
- **SERA-32B** â€” opt-in Ai2 Open Coding model for the codebase
  MCP elaborator (see [sera](../data/sera.md)). Configure
  `ALPHASWARM_SERA_ENABLED=true` + `ALPHASWARM_SERA_ENDPOINT` in your env.

Every model call routes through
[`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py)
(AGENTS rule 2). Provider selection is declared in
`AgentSpec.model`; the runtime drives the call â€” never call
`router_complete` directly from inside an agent body (AGENTS rule
12).

REST equivalents (each returns `TaskAccepted` for streaming endpoints):

```bash
curl -X POST localhost:8000/agentic/models/pull \
    -H 'content-type: application/json' \
    -d '{"name":"llama3.2"}'

curl -X DELETE localhost:8000/agentic/models/llama3.2
curl -X GET    localhost:8000/agentic/models/running
curl -X GET    localhost:8000/agentic/vllm/profiles
curl -X POST   localhost:8000/agentic/vllm/start \
    -H 'content-type: application/json' \
    -d '{"profile":"vllm_nemotron"}'
```

## 2 â€” Data sources

Open [`/data/hub`](https://alpha-swarm.ai/data/hub) in the operator UI.
This is the active replacement for the legacy Solara explorer pages.

The Hub exposes the four data-plane tiers (see [data-plane](../data/data-plane.md)):

- **Discovery browser** â€” unified ingested / pending / orphan /
  external_only entries; filter chips drive the
  [`DiscoveryService`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/discovery/service.py).
- **Iceberg Editor** â€” namespace browser + parquet preview + column
  profiling.
- **Airbyte builder** â€” schema-driven connector editor at
  [alphaswarm_client/src/components/airbyte/builder/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/components/airbyte/builder).
  Emits either Airbyte YAML or an AlphaSwarm-native `Fetcher` stub. No
  free-text credential fields â€” every secret resolves through
  `` (AGENTS rule 31).
- **Dagster sandbox** â€” ephemeral per-session Dagster + Airbyte
  environment (AGENTS rule 32).

REST surface:

```bash
curl -X GET  http://localhost:8000/discovery/entries
curl -X POST http://localhost:8000/sources/alpha_vantage/probe
curl -X POST http://localhost:8000/discovery/entries//promote
curl -X POST http://localhost:8000/dagster/sandbox/sessions
```

Or invoke the data MCP tools directly:

```bash
curl -X POST http://localhost:8000/mcp/data/tools/data.discovery.browse/invoke \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(alphaswarm-cli auth token)" \
    -d '{"namespace_prefix":"alphaswarm_silver_yfinance"}'
```

## 3 â€” Spec snapshot

Every spec-driven run hash-locks the spec into a `*_spec_versions`
row before any work happens. The same content always returns the
same `version_id`; any field change creates a new row; old rows
stay forever for replay. This is the invariant that makes the
entire agentic pipeline auditable.

```mermaid
sequenceDiagram
    actor Author
    participant API as FastAPI
    participant Runtime as AgentRuntime / BotRuntime / RLRuntime / WorkflowRuntime
    participant Versions as *_spec_versions
    participant Hash as SHA-256
    Author->>API: POST /agents/specs (YAML body)
    API->>Hash: compute SHA-256 of canonical JSON
    Hash-->>API: spec_hash
    API->>Versions: SELECT id WHERE spec_hash = ?
    alt existing row
        Versions-->>API: existing version_id
    else new row
        API->>Versions: INSERT (spec_hash, spec_json, ...)
        Versions-->>API: new version_id
    end
    API-->>Author: { spec_id, version_id, spec_hash }
    Note over Versions: Row is immutable. Re-postingidentical content returns the same id.
```

Five hash-locked spec types ship today:

| Spec | Runtime | Versions table | AGENTS rule |
| --- | --- | --- | --- |
| `AgentSpec` | `AgentRuntime` | `agent_spec_versions` | 12-13 |
| `BotSpec` | `BotRuntime` | `bot_versions` | 14-15 |
| `RLExperimentSpec` | `RLRuntime` | `rl_experiment_versions` | 16-17 |
| `AnalysisSpec` | `AnalysisRuntime` | `analysis_spec_versions` | 23-24 |
| `WorkflowSpec` | `WorkflowRuntime` | `workflow_spec_versions` | 40-41 |

Plus two additive ones from the management engine:

| Spec | Runtime | Versions table | AGENTS rule |
| --- | --- | --- | --- |
| `TerraformStackSpec` | `TerraformRuntime` | `terraform_stack_spec_versions` | 42-43 |
| (workload ops) | `WorkloadRuntime` | `workload_runs` (write-only ledger) | 45 |

REST:

```bash
# AgentSpec
curl -X POST http://localhost:8000/agents/specs \
    -H "Content-Type: application/json" \
    -d @configs/agents/research_lite.yaml

# WorkflowSpec
curl -X POST http://localhost:8000/workflows/specs \
    -H "Content-Type: application/json" \
    -d @configs/workflows/my-research-loop.yaml
```

## 4 â€” Workflow dispatch

`WorkflowRuntime` is the additive control plane that composes every
spec runtime into multi-node DAGs. It ships with seven
`OrchestrationAdapter` kinds (AGENTS rule 40):

- **graph** â€” LangGraph state machine
- **crew** â€” CrewAI manager-pattern crew
- **debate** â€” bounded debate with N participants
- **fusion** â€” fan-out / fan-in
- **execution** â€” wraps an `RLRuntime` / `BotRuntime` / `AnalysisRuntime` as a single node
- **schedule** â€” Cron-triggered, idempotent
- **studio** â€” Operator-driven UI wiring at
  [`/workflows`](https://alpha-swarm.ai/workflows)

```mermaid
flowchart TB
    Spec[WorkflowSpec] --> Runtime[WorkflowRuntime]
    Runtime --> AdapterRegistry["OrchestrationAdapterMeta registry"]
    AdapterRegistry --> A1[graph]
    AdapterRegistry --> A2[crew]
    AdapterRegistry --> A3[debate]
    AdapterRegistry --> A4[fusion]
    AdapterRegistry --> A5[execution]
    AdapterRegistry --> A6[schedule]
    AdapterRegistry --> A7[studio]
    A1 --> AgentRt[AgentRuntime]
    A2 --> AgentRt
    A3 --> AgentRt
    A4 --> AgentRt
    A5 --> RlRt[RLRuntime]
    A5 --> BotRt[BotRuntime]
    A5 --> AnaRt[AnalysisRuntime]
    Runtime --> Halt[should_halt check]
    Runtime --> Cost[cost cap check]
    Runtime --> Ledger[workflow_runs + agent_runs_v2]
```

Dispatch:

```bash
curl -X POST http://localhost:8000/workflows//run \
    -H "Content-Type: application/json" \
    -d '{"inputs": {...}}'
```

The runtime:

1. Re-hash-locks every referenced spec (idempotent).
2. Opens a `workflow_runs` row with `status=pending`.
3. Builds the adapter DAG.
4. Walks nodes; for each, opens an `agent_runs_v2` row and
   delegates to the relevant runtime.
5. Emits canonical progress frames at every transition.
6. Calls `should_halt()` before every step â€” the topbar
   [KillSwitch](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/src/components/common/KillSwitch.tsx)
   reaches every node within ~250ms.
7. Enforces `cost_caps` (`per_node_max_tokens`, `per_run_max_usd`)
   per AGENTS rule 12.

Replay:

```bash
curl -X POST http://localhost:8000/workflows/runs//replay
```

Replay reuses the same `workflow_spec_versions` row + every
referenced `*_spec_versions` row; a new `workflow_runs` row lands
with a `parent_run_id` pointer.

## 5 â€” Review

Three review surfaces, each consuming the same canonical ledger:

### WebSocket stream

The frame envelope is `{task_id, stage, message, timestamp,
**extras}` per AGENTS rule 4. Subscribe from any client:

```javascript
const ws = new WebSocket(`ws://localhost:8000/chat/stream/${task_id}`);
ws.onmessage = (e) => {
  const f = JSON.parse(e.data);
  console.log(f.stage, f.message, f.extras);
};
```

### `agent_runs_v2` + `workflow_runs` ledger

Agent-safe reads via DataMCP:

```bash
curl -X POST http://localhost:8000/mcp/data/tools/data.workflows.describe/invoke \
    -H "Content-Type: application/json" \
    -d '{"workflow_run_id": ""}'

curl -X POST http://localhost:8000/mcp/data/tools/data.agents.list_runs/invoke \
    -H "Content-Type: application/json" \
    -d '{"workflow_run_id": "", "limit": 20}'
```

Each row carries `experiment_id` + `test_id` (AGENTS rule 34),
`total_tokens`, `total_cost_usd`, and a full per-step breakdown
under `agent_run_steps`.

### Inkeep AI assistant + docs MCP server

Two new surfaces in 2026-05:

- **Inkeep widget in-product.** The "Ask AI" button in
  [alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client)
  routes to an Inkeep agent that has the entire docs corpus +
  every public AlphaSwarm API spec ingested. It cites by URL and never
  invents references.
- **Docs MCP server at `docs.alpha-swarm.ai/mcp`.** An RFC 9728 + 8707
  compliant Cloudflare Worker (AGENTS rule 49). Cursor / Claude /
  Continue / custom scripts connect to it for `search`,
  `fetch_page`, and `list_pages` over the same corpus. In-platform
  agents reach it through the bridged `data.docs.*` MCP tools.

Both surfaces compose with the workflow runtime: a workflow node
can call Inkeep / the docs MCP server as an external tool, and the
`agent_runs_v2` row records the call.

## Worked example: build a research workflow

Goal: snapshot an `AgentSpec` + `WorkflowSpec`, dispatch the
workflow, tail progress, inspect the ledger â€” all from this page.

### Step 1 â€” snapshot an `AgentSpec`


Re-running with identical content returns the same
`(spec_id, version_id)` â€” the runtime treats it as a no-op.

### Step 2 â€” snapshot a `WorkflowSpec` that references it


### Step 3 â€” dispatch


### Step 4 â€” tail progress

```bash
curl -N http://localhost:8000/chat/stream/
```

You will see frames in the canonical envelope. Expected stages:
`workflow.started` â†’ `node.research.started` â†’
`agent.token` (Ã—N) â†’ `node.research.completed` â†’
`workflow.completed`.

### Step 5 â€” inspect the ledger

Demonstrate the analysis pattern with a small inline sample of what
the MCP describe call returns:


### Step 6 â€” verify

- `agent_spec_versions` row exists with the recorded `spec_hash`.
- `workflow_spec_versions` row exists; its content references the
  `agent_spec_versions` row from Step 1.
- One `workflow_runs` row + one `agent_runs_v2` row (one node).
- `total_cost_usd` is under the workflow's `per_run_max_usd` cap.
- Re-dispatching by triggering Step 3 again creates a NEW
  `workflow_runs` row but reuses ALL the same `*_spec_versions` rows.

### What next

- Walk the full tutorial: [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md).
- Add a second node: [concepts/agentic/workflow-studio](./workflow-studio.md) â€” the seven adapter kinds.
- Read the topology catalogue: [concepts/agentic/multi-agent-patterns](./multi-agent-patterns.md).
- Snapshot an agent spec from the CLI: [how-to/recipes/snapshot-an-agent-spec](../../how-to/recipes/snapshot-an-agent-spec.md).

## The four-runtime story

This pipeline is one of four overlapping execution surfaces. Each
has its own concept doc but they all share the same hash-lock
invariant, the same canonical progress frame, the same kill-switch
fan-out, and the same `experiment_id` audit chain.

| Runtime | Lifecycle surface | Worked tutorial | Concept doc |
| --- | --- | --- | --- |
| `AgentRuntime` | Single agent, single spec | (covered here) | [agents](./agents.md) |
| `BotRuntime` | Bot = universe + strategy + ML + agents + RAG + risk | [tutorials/first-bot](../../tutorials/first-bot.md) | [bots](./bots.md) |
| `RLRuntime` | Train / evaluate / paper / replay / walk-forward | [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md) | [concepts/rl/rl-framework](../rl/rl-framework.md) |
| `WorkflowRuntime` | Composition layer over the other three | [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) | [workflow-studio](./workflow-studio.md) |

## Hard rules (agentic-pipeline scope)

The full set is in
[AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).
The agentic-pipeline subset:

- **Rules 12-13** â€” All spec-driven agent runs go through
  `AgentRuntime`; `agent_spec_versions` rows are immutable.
- **Rule 22** â€” Agents never read Postgres / Iceberg directly;
  every read through a `DataMCPTool`.
- **Rule 40** â€” All workflow lifecycle actions go through
  `WorkflowRuntime`.
- **Rule 41** â€” `workflow_spec_versions` rows are immutable
  hash-locked snapshots.
- **Rule 34** â€” Every run-producing flow populates `experiment_id`.
- **Rule 49** â€” Every MCP server is RFC 9728 + 8707 conformant.
- **Rule 54** â€” Delegated agent tokens for HTTP MCP calls go
  through `TokenExchangeBroker` (RFC 8693 + Auth0 Custom Token
  Exchange Profile `alphaswarm-agent-delegation`).

## Deeper reads

- [agentic-development](./agentic-development.md) â€” AlphaSwarm's spec-pattern mapped to the broader agentic-coder vocabulary.
- [agents](./agents.md) â€” `AgentSpec` schema + `AgentRuntime` lifecycle.
- [multi-agent-patterns](./multi-agent-patterns.md) â€” sequential / parallel / debate / coordinator / ReAct topologies.
- [workflow-studio](./workflow-studio.md) â€” the additive `WorkflowRuntime` + seven adapter kinds.
- [orchestration-refactor-rollout](./orchestration-refactor-rollout.md) â€” operator rollout / rollback runbook.
- [alpha-researcher-agent](./alpha-researcher-agent.md), [research-agents](./research-agents.md), [selection-agents](./selection-agents.md), [trader-agents](./trader-agents.md), [analysis-agents](../strategy/analysis-agents.md) â€” domain agent suites.
- [bots](./bots.md) â€” bot entity (`TradingBot` / `ResearchBot`) and `BotRuntime`.
- [agent-watchdog](../data/agent-watchdog.md) â€” Celery beat task that halts stalled agent_runs_v2 rows.
- [reference/api](../../reference/api/index.mdx) â€” the `agents` + `workflows` tags (interactive playground).
- [reference/python/alphaswarm/agents](../../reference/python/index.mdx) â€” auto-generated Python reference.


<!-- https://alpha-swarm.ai/concepts/agentic/agents -->
# Agents
> - **AgentSpec** — declarative blueprint (Pydantic). Holds role, system_prompt, tools, model, memory, RAG clauses, guardrails, output_schema, cost / call caps, and annotations. Defined in [alphaswarm/agents/s...

# Agents

This document covers the spec-driven agent surface added by the
agentic-RAG expansion. The legacy CrewAI research crew (under
[alphaswarm/agents/crew.py](../alphaswarm/agents/crew.py)) coexists with the new
runtime; both register routes under `/agents/*` in the FastAPI gateway.

## Concepts

- **AgentSpec** — declarative blueprint (Pydantic). Holds role,
  system_prompt, tools, model, memory, RAG clauses, guardrails,
  output_schema, cost / call caps, and annotations. Defined in
  [alphaswarm/agents/spec.py](../alphaswarm/agents/spec.py).
- **AgentRuntime** — executor that turns a spec into a real run with
  full telemetry. Defined in [alphaswarm/agents/runtime.py](../alphaswarm/agents/runtime.py).
- **Registry** — process-wide name → AgentSpec map. Discovered
  built-ins are registered at import time; YAML files under
  [configs/agents/](../configs/agents/) are auto-loaded on first lookup.
  Declared in [alphaswarm/agents/registry.py](../alphaswarm/agents/registry.py).
- **Reproducibility** — every spec is hash-locked and snapshotted into
  `agent_spec_versions` on first use. Every run records a
  `spec_version_id` so it can be deterministically replayed.

## The four teams

| Team | Specs | Page |
| --- | --- | --- |
| Research | `research.news_miner`, `research.equity`, `research.universe` | [alphaswarm_docs/research-agents.md](../../concepts/agentic/research-agents.md) |
| Selection | `selection.stock_selector` | [alphaswarm_docs/selection-agents.md](../../concepts/agentic/selection-agents.md) |
| Trader | `trader.signal_emitter` | [alphaswarm_docs/trader-agents.md](../../concepts/agentic/trader-agents.md) |
| Analysis | `analysis.step`, `analysis.run`, `analysis.portfolio` (+ reflector) | [alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md) |

## Inspiration-rehydration personas (Phase 2026-04-29)

Nine new spec-driven agents added by the rehydration. Each ships as a
YAML in [configs/agents/](../configs/agents/) and uses one or more of
the new analytics tools in [alphaswarm/agents/tools/analytics_tools.py](../alphaswarm/agents/tools/analytics_tools.py).

| Spec name | Role | Tools |
| --- | --- | --- |
| `research.regime_analyst` | ADX trend/range gate | `regime_classifier_tool`, `historical_volatility` |
| `research.composite_voter` | TradFi-style indicator consensus | `multi_indicator_vote_tool` |
| `research.basis_momentum_analyst` | Commodity basis screening | `factor_screen_tool`, `realised_vol_tool` |
| `research.cointegration_analyst` | Pair stat-arb | `cointegration_tool`, `historical_volatility` |
| `research.intraday_momentum_analyst` | Gao 2018 intraday plays | `realised_vol_tool`, `regime_classifier_tool` |
| `selection.cross_asset_skew_screener` | Cross-asset skew factor | `factor_screen_tool` |
| `analysis.queue_position_analyst` | HFT metric explainer | `hft_metrics_tool` |
| `analysis.cointegration_basket_finder` | Universe-wide pair search | `cointegration_tool` |
| `research.options_greeks_explainer` | Bachelier + inverse Greeks | `option_greeks_tool`, `option_spread_tool` |

Composite pipeline: see
[alphaswarm/agents/graph/builder.py::build_quant_research_pipeline_graph](../alphaswarm/agents/graph/builder.py)
which chains `composite_voter → regime_analyst → cointegration_analyst →
risk_simulator → emit_signal_event/reject_decision_log` with the
existing risk-simulator approval gate.

## Run lifecycle

```mermaid
flowchart LR
  spec[AgentSpec YAML or code]
  reg[Registry]
  rt[AgentRuntime]
  rag[HierarchicalRAG]
  mem[RedisHybridMemory]
  llm[router_complete]
  db[(agent_runs_v2 + agent_run_steps)]
  spec --> reg --> rt
  rt --> rag
  rt --> mem
  rt --> llm
  rt --> db
```

## Persistence

| Table | Purpose |
| --- | --- |
| `agent_specs` | Logical agent (latest version pointer) |
| `agent_spec_versions` | Immutable hash-locked spec snapshot |
| `agent_runs_v2` | One row per run |
| `agent_run_steps` | One row per step (LLM / tool / RAG / memory / guardrail) |
| `agent_run_artifacts` | Sidecar artifacts referenced by a run |
| `agent_evaluations` + `agent_eval_metrics` | Eval harness results |
| `agent_annotations` | User/agent annotations for optimisation |

## REST surface

```
GET  /agents/specs                              — list registered specs
GET  /agents/specs/{name}                       — spec detail (full payload)
GET  /agents/specs/{name}/versions              — version history
POST /agents/runs/v2/sync                       — synchronous run
GET  /agents/runs/v2                            — list runs (filter by spec/status)
GET  /agents/runs/v2/{id}                       — full trace incl. steps
POST /agents/runs/v2/{id}/replay                — replay against snapshotted spec
GET  /agents/evaluations                        — list eval reports
```

## Guardrails

`AgentSpec.guardrails` (parsed by `AgentRuntime._guardrail_check`):

- `cost_budget_usd` — hard ceiling per run (raises `GuardrailViolation`).
- `rate_limit_per_minute` — TODO: enforced at the call site.
- `max_calls` — caps the number of LLM round-trips per run.
- `forbidden_terms` — strings that must not appear in the output.
- `require_rationale` — output must include a rationale-style key.
- `min_confidence` — output's `confidence` field must clear this floor.

## Don'ts

- Don't bypass `AgentRuntime.run` for spec-driven agents — telemetry,
  guardrails, cost caps, and `agent_runs_v2` rely on it.
- Don't mutate `agent_spec_versions` rows — they are immutable.
- Don't write a new spec without registering it (decorator or YAML);
  the LangGraph builders look up by name and will skip unknown specs.


<!-- https://alpha-swarm.ai/concepts/agentic/alpha-researcher-agent -->
# Alpha Researcher agent + symbolic alpha DSL
> ```mermaid flowchart LR User[Researcher: intent] --> Agent[AlphaResearcher\\nconfigs/agents/alpha_researcher.yaml] Agent --> RAG[RAG: alpha_factors + backtest_summaries] Agent --> Output[JSON proposal\\...

# Alpha Researcher agent + symbolic alpha DSL

> Self-evolving LLM-driven factor mining wired into AlphaSwarm's
> deployment-consistent execution loop.

## The loop

```mermaid
flowchart LR
    User[Researcher: intent] --> Agent[AlphaResearcher\nconfigs/agents/alpha_researcher.yaml]
    Agent --> RAG[RAG: alpha_factors + backtest_summaries]
    Agent --> Output[JSON proposal\nname / formula / rationale]
    Output --> Sandbox[AST sandbox\naqp/data/expressions_dsl.py]
    Sandbox --> Factor[FactorNode]
    Factor --> Shim[FactorStrategyShim]
    Shim --> Engine[EventDrivenBacktester]
    Engine --> Metrics[Sharpe / IR / MDD / turnover]
    Metrics --> Reward[score_to_reward]
    Reward -.->|next iteration| Agent
```

## Symbolic DSL vocabulary

The full operator + field whitelist lives in
[`alphaswarm/data/expressions_dsl.py`](../alphaswarm/data/expressions_dsl.py).

**Fields:** `$open`, `$high`, `$low`, `$close`, `$volume`, `$vwap`,
`$returns`.

**Operators (curated):** `Ref`, `Delay`, `Mean`, `Std`, `Var`,
`Skew`, `Kurt`, `Sum`, `Min`, `Max`, `Med`, `Mad`, `Quantile`,
`Count`, `IdxMax`, `IdxMin`, `EMA`, `WMA`, `Slope`, `Rsquare`,
`Resi`, `Corr`, `Cov`, `Greater`, `Less`, `Gt`, `Ge`, `Lt`, `Le`,
`Eq`, `Ne`, `And`, `Or`, `Not`, `Mask`, `If`, `Add`, `Sub`, `Mul`,
`Div`, `Abs`, `Sign`, `Log`, `Rank`, `Clip`.

**Numeric literals:** integers + floats + bools + None + short
strings.

Anything else (imports, attribute access, subscripts, lambdas,
comprehensions, walrus, await, yield) raises
:class:`SymbolicAlphaError` at compile time.

## Example proposal

```json
{
  "name": "ema_crossover_pct",
  "formula": "Sign(EMA($close, 12) - EMA($close, 26)) * Rank(Std($returns, 20))",
  "rationale": "Combines MACD-style cross with vol-rank to favour high-vol trends.",
  "expected_horizon_bars": 5,
  "expected_direction": "either"
}
```

## Compile + evaluate

```python
from alphaswarm_agents.quant import AlphaResearcher

researcher = AlphaResearcher(agent_spec_name="alpha_researcher")
proposal = researcher.propose(inputs={"intent": "find a short-horizon mean-reversion factor"})
result = researcher.evaluate(proposal, bars=bars)
print(result.metrics, result.reward)
```

## Engine-agnostic `FactorNode`

The compiled
[`FactorNode`](../alphaswarm/data/expressions_dsl.py) feeds:

- **Event-driven engine** via `.compute(bars)` returning a
  `pd.Series`.
- **vbt-pro orders mode** via `.compute_panel(bars_panel)` returning
  a wide DataFrame.
- **Backtrader (optional)** via `.as_backtrader_indicator()`
  returning a dynamic `bt.Indicator` subclass.

## Companion agent: StrategyExecutor

The
[`StrategyExecutor`](../alphaswarm/agents/quant/strategy_executor.py)
agent decides WHICH RL experiment to train / paper-trade / promote
based on the RAG `rl_trajectory_summaries` corpus and the live
broker state. Routes lifecycle actions through
[`RLRuntime`](../alphaswarm/rl/runtime.py) (rule 16).

## See also

- [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md)
- [Hard rule 39 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md)
- [.cursor/rules/symbolic-alphas.mdc](../.cursor/rules/symbolic-alphas.mdc)


<!-- https://alpha-swarm.ai/concepts/agentic/bots -->
# Bots
> A **Bot** aggregates everything required to research, evaluate, and deploy an algorithmic trading automation:

# Bots

> The smallest self-contained, deployable unit on AlphaSwarm.
>
> **QuantBot Platform v0.2.0** layered an enterprise-grade Kubernetes
> control plane on top of the legacy `BotRuntime` path without breaking
> any existing bots. See the new ADRs:
>
> - [ADR 006 — QuantBot Operator Pattern](../../architecture/decisions/006-quantbot-operator-pattern.md)
> - [ADR 007 — QuantBot Latency Classes](../../architecture/decisions/007-quantbot-latency-classes.md)
> - [ADR 008 — Bot Event Sourcing](../../architecture/decisions/008-quantbot-event-sourcing.md)
> - [ADR 009 — RTS 6 / SEC 15c3-5 Conformance](../../architecture/decisions/009-quantbot-rts6-conformance.md)
> - [ADR 010 — Canary PnL Gates](../../architecture/decisions/010-quantbot-canary-pnl-gates.md)
>
> Runbooks:
>
> - [HFT Node Onboarding](../../how-to/operations/hft-node-onboarding.md)
> - [Bot Canary Rollout Playbook](../../how-to/operations/bot-canary-rollout-playbook.md)
> - [RTS 6 Validation Report Generation](../../how-to/operations/rts6-validation-report-generation.md)
> - [Kill Switch Incident Response](../../how-to/operations/kill-switch-incident-response.md)

A **Bot** aggregates everything required to research, evaluate, and
deploy an algorithmic trading automation:

- a **trading universe** (symbol list or registry-driven model),
- a **data ingestion pipeline** preset,
- a **strategy graph** (alpha → portfolio → risk → execution, via
  `FrameworkAlgorithm`),
- a **backtest engine** (vbt-pro / event-driven / vectorbt / fallback),
- optional **ML model deployments** (`ModelDeployment` ids),
- optional **spec-driven agents** for supervision / per-bar consult /
  research chat,
- a **hierarchical RAG** access plan,
- **evaluation metrics** with thresholds,
- **risk caps**, and
- a **deployment target** (paper session / Kubernetes / backtest-only).

Bots live under a [`Project`](../../concepts/platform/erd.md) (`ProjectScopedMixin`). Within a
project, bots are uniquely identified by their slug.

## Composition

```mermaid
flowchart LR
  Project --> Bot
  Bot --> BotSpec[BotSpec]
  BotSpec --> Universe[universe + DataPipelineRef]
  BotSpec --> StrategyCfg["strategy: build_from_config"]
  BotSpec --> EngineCfg["backtest.engine"]
  BotSpec --> MLDeployments["ml_models[]"]
  BotSpec --> AgentSpecs["agents[] (AgentSpec names)"]
  BotSpec --> RAGPlan["rag[] (RAGRef)"]
  BotSpec --> Metrics["metrics[] + risk"]
  BotSpec --> DeployTarget["deployment"]

  BotRuntime --> Backtest["run_backtest_from_config"]
  BotRuntime --> Paper["build_session_from_config"]
  BotRuntime --> AgentRuntime
  AgentRuntime --> RAG["HierarchicalRAG"]
  BotRuntime --> Deploy["DeploymentDispatcher"]
  Deploy --> Paper
  Deploy --> K8s["KubernetesTarget"]
```

`Bot` does **not** re-implement strategy / engine / agent / RAG logic.
It composes references and dispatches to existing primitives so all
hard rules from [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) (`router_complete`,
`iceberg_catalog`, `AgentRuntime`, `HierarchicalRAG`, `emit/emit_done`)
remain the only paths into those subsystems.

## Subclasses

| Subclass | Required spec slots | Methods | Use case |
| --- | --- | --- | --- |
| `TradingBot` | `strategy`, `backtest` | `backtest()`, `paper()`, `deploy()`, `consult_agents()` | Live / paper / backtest trading |
| `ResearchBot` | `agents` | `chat()`, optional `backtest()` (only if `strategy` set) | Research agent + chat surface |

`TradingBot.chat()` raises `BotMethodNotSupported` — pair the bot with
a companion `ResearchBot`. `ResearchBot.paper()` raises
`BotMethodNotSupported` — clone the spec into a `TradingBot` first.

## Spec example

```yaml
name: Dual MA AAPL
slug: dual-ma-aapl
kind: trading
description: Dual MA crossover on AAPL/MSFT.

universe:
  symbols: [AAPL.NASDAQ, MSFT.NASDAQ]

data_pipeline:
  preset: ohlcv-daily
  source: alpaca

strategy:
  class: FrameworkAlgorithm
  module_path: alphaswarm.strategies.framework
  kwargs:
    universe_model:
      class: StaticUniverse
      module_path: alphaswarm.strategies.universes
      kwargs: { symbols: [AAPL.NASDAQ, MSFT.NASDAQ] }
    alpha_model:
      class: DualMACrossoverAlpha
      module_path: alphaswarm.strategies.dual_ma
      kwargs: { fast: 10, slow: 50 }
    portfolio_model: { class: EqualWeightPortfolio }
    risk_model: { class: NoOpRiskModel }
    execution_model: { class: ImmediateExecutionModel }

backtest:
  engine: vbt-pro:signals
  kwargs: { initial_cash: 100000.0 }

agents:
  - spec_name: research.quant_vbtpro
    role: supervisor

rag:
  - levels: [l1, l2]
    orders: [first, second]
    corpora: [bars_daily, performance]

metrics:
  - { name: sharpe, threshold: 1.0, direction: max }
  - { name: max_drawdown, threshold: 0.25, direction: min }

risk:
  max_position_pct: 0.25
  max_daily_loss_pct: 0.02

deployment:
  target: paper_session
  brokerage: simulated
  feed: deterministic_replay
  initial_cash: 100000.0
  dry_run: true
```

Drop the file under [alphaswarm_bots/templates/trading/](../alphaswarm_bots/templates/trading/)
or [alphaswarm_bots/templates/research/](../alphaswarm_bots/templates/research/) — the registry
lazy-scans both directories on first lookup.

## Persistence

Three new tables, all `ProjectScopedMixin` (Alembic migration
`0020_bots`):

- **`bots`** — logical row with the latest active version of a named
  spec inside a project. Unique on `(project_id, slug)`.
- **`bot_versions`** — immutable, hash-locked snapshot of every
  `BotSpec` change. Unique on `(bot_id, spec_hash)` and `(bot_id,
  version)`.
- **`bot_deployments`** — one row per backtest / paper / chat / k8s
  invocation. References `version_id` so a run can be replayed against
  the exact spec that produced it.

The runtime mirrors the proven `AgentSpec` / `AgentSpecVersion` /
`AgentRunV2` triad from
[alphaswarm/agents/runtime.py](../alphaswarm/agents/runtime.py).

## Lifecycle

### Backtest

```mermaid
sequenceDiagram
  participant UI
  participant API as /bots/{id}/backtest
  participant Celery as run_bot_backtest
  participant Runtime as BotRuntime
  participant Engine as run_backtest_from_config

  UI->>API: POST /bots/{id}/backtest
  API->>Celery: run_bot_backtest.delay(bot_id)
  Celery->>Runtime: BotRuntime(bot, task_id).backtest()
  Runtime->>Runtime: persist_spec -> bot_versions
  Runtime->>Runtime: open bot_deployments row
  Runtime->>Engine: run_backtest_from_config(_derive_backtest_cfg())
  Engine-->>Runtime: BacktestResult
  Runtime->>Runtime: finalise bot_deployments + emit_done
  Runtime-->>UI: stream result via /chat/stream/{task_id}
```

### Paper

`POST /bots/{id}/paper/start` dispatches `run_bot_paper`, which builds
a `PaperTradingSession` via the existing
[`build_session_from_config`](../alphaswarm/trading/runner.py) and awaits its
async `run()`. Stop with `POST /bots/{id}/paper/stop/{task_id}` (reuses
[`publish_stop_signal`](../alphaswarm/tasks/paper_tasks.py)).

### Chat (ResearchBot)

`POST /bots/{id}/chat` dispatches `chat_research_bot`, which iterates
the bot's `agents[]` and runs each through
[`AgentRuntime`](../alphaswarm/agents/runtime.py). RAG retrieval, memory, and
guardrails behave identically to direct
`POST /agents/runs/v2/sync` calls — the bot is just a curator of agent
specs.

### Deploy

`POST /bots/{id}/deploy` dispatches `deploy_bot`, which delegates to
the configured target via
[`DeploymentDispatcher`](../alphaswarm_bots/deploy.py):

| Target | Behaviour |
| --- | --- |
| `paper_session` | Launches a paper session in the Celery worker. |
| `backtest_only` | Runs a single backtest + persists result on the deployment row. |
| `kubernetes` | Renders `Deployment` + `ConfigMap` YAML to `alphaswarm_platform/deploy/k8s/bots/.yaml`. Optionally `kubectl apply`s when `apply=True` and `kubectl` is on PATH. |

The Kubernetes manifest's pod entrypoint is
`python -m alphaswarm_bots.cli run ` (compat: `python -m alphaswarm.bots.cli`; see
[alphaswarm_bots/cli.py](../alphaswarm_bots/cli.py)).

## REST surface

All endpoints under `/bots`:

| Method | Path | Purpose |
| --- | --- | --- |
| `GET` | `/bots` | List (filter by `project_id`, `kind`, `status_filter`) |
| `POST` | `/bots` | Create (body: `{spec, project_id?}`) |
| `GET` | `/bots/{ref}` | Detail (`{ref}` = id or slug) |
| `PUT` | `/bots/{ref}` | Update (auto-snapshots a new version on change) |
| `DELETE` | `/bots/{ref}` | Delete |
| `GET` | `/bots/{ref}/versions` | List `bot_versions` |
| `GET` | `/bots/{ref}/deployments` | List `bot_deployments` |
| `POST` | `/bots/{ref}/backtest` | Dispatch `run_bot_backtest` (returns `TaskAccepted`) |
| `POST` | `/bots/{ref}/paper/start` | Dispatch `run_bot_paper` |
| `POST` | `/bots/{ref}/paper/stop/{task_id}` | Stop in-flight paper session |
| `POST` | `/bots/{ref}/deploy` | Dispatch `deploy_bot` |
| `POST` | `/bots/{ref}/chat` | Dispatch `chat_research_bot` (research only) |

Async lifecycle endpoints return
[`TaskAccepted`](../alphaswarm/api/schemas.py) with `stream_url` pointing at
the existing `/chat/stream/{task_id}` WebSocket — no new transport.

## CLI

`python -m alphaswarm.bots.cli` for shell-level operations:

```bash
python -m alphaswarm.bots.cli list
python -m alphaswarm.bots.cli show dual-ma-aapl --yaml
python -m alphaswarm.bots.cli backtest dual-ma-aapl
python -m alphaswarm.bots.cli paper dual-ma-aapl --run-name 2026-05-03
python -m alphaswarm.bots.cli chat equity-research-bot "What is AAPL's edge?"
python -m alphaswarm.bots.cli deploy dual-ma-aapl --target kubernetes
python -m alphaswarm.bots.cli run dual-ma-aapl   # pod entrypoint
```

## UI

The bot builder lives at
[`/bots`](../webui/app/(shell)/bots/page.tsx) and reuses the existing
`@xyflow/react` canvas via
[`WorkflowEditor`](../webui/components/flow/WorkflowEditor.tsx). The
palette
([`webui/components/bots/botPalette.ts`](../webui/components/bots/botPalette.ts))
exposes ten kinds — Universe, DataPipeline, Strategy, Engine, MLModel,
Agent, RAG, Metric, Risk, Deploy. Each node maps 1:1 to a `BotSpec`
slot via
[`serializeBotSpec`](../webui/components/bots/botSerializer.ts); the
inverse `deserializeBotSpec` lets the builder edit a saved bot.

The detail page ships tabs:

- **Overview** — primary action buttons (Backtest / Start paper / Deploy / Render K8s manifest).
- **Builder** — the node-and-wire canvas.
- **Deployments** — every `bot_deployments` row.
- **Versions** — every `bot_versions` row.
- **Chat** — only for `ResearchBot` kind; embeds
  [`ResearchBotChat`](../webui/components/bots/ResearchBotChat.tsx)
  driven by `useChatStream`.

## Hard rules

- Bot agent calls go through
  [`AgentRuntime`](../alphaswarm/agents/runtime.py); `BotRuntime` never calls
  `router_complete` directly.
- Bot RAG access goes through
  [`HierarchicalRAG`](../alphaswarm/rag/hierarchy.py) via the agent's `rag:`
  clause.
- Bot data loading uses
  [`IngestionPipeline.run_path`](../alphaswarm/data/pipelines/runner.py) and
  `iceberg_catalog.append_arrow`; never raw PyIceberg.
- Bot progress emits go through
  [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py) preserving the
  `{task_id, stage, message, timestamp, **extras}` payload shape.
- Strategies / engines / models in `BotSpec` use the existing
  `{class, module_path, kwargs}` factory and `@register`.
- New Alembic migrations are additive only; never edit a shipped one.

## Where things live

| Need | Path |
| --- | --- |
| BotSpec | [alphaswarm_bots/spec.py](../alphaswarm_bots/spec.py) |
| BaseBot ABC | [alphaswarm_bots/base.py](../alphaswarm_bots/base.py) |
| TradingBot | [alphaswarm_bots/trading_bot.py](../alphaswarm_bots/trading_bot.py) |
| ResearchBot | [alphaswarm_bots/research_bot.py](../alphaswarm_bots/research_bot.py) |
| BotRuntime | [alphaswarm_bots/runtime.py](../alphaswarm_bots/runtime.py) |
| Registry / persist_spec | [alphaswarm_bots/registry.py](../alphaswarm_bots/registry.py) |
| Deploy targets | [alphaswarm_bots/deploy.py](../alphaswarm_bots/deploy.py) |
| CLI | [alphaswarm_bots/cli.py](../alphaswarm_bots/cli.py) |
| ORM models | [alphaswarm/persistence/models_bots.py](../alphaswarm/persistence/models_bots.py) |
| Alembic migration | [alembic/versions/0020_bots.py](../alembic/versions/0020_bots.py) |
| Celery tasks | [alphaswarm/tasks/bot_tasks.py](../alphaswarm/tasks/bot_tasks.py) |
| REST routes | [alphaswarm/api/routes/bots.py](../alphaswarm/api/routes/bots.py) |
| Example specs | [alphaswarm_bots/templates/](../alphaswarm_bots/templates/) |
| UI builder | [webui/components/bots/](../webui/components/bots/) |
| Argo template | `alphaswarm_platform/deployments/kubernetes/mlops/bots/workflowtemplate-bot-deploy.yaml` |


<!-- https://alpha-swarm.ai/concepts/agentic/multi-agent-patterns -->
# Multi-agent patterns in AlphaSwarm
> Read this doc when you need to:

# Multi-agent patterns in AlphaSwarm

> Catalogue of multi-agent topologies, mapped to existing code in
> [alphaswarm/agents/graph/](../alphaswarm/agents/graph/). Use this when adding a
> new agent crew, deciding between sequential and parallel
> orchestration, or deciding when a debate / consensus pattern is
> warranted.
>
> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) ·
> Underlying primitives: [agents.md](../../concepts/agentic/agents.md) ·
> Spec contract: [agentic-development.md](../../concepts/agentic/agentic-development.md) ·
> ADLC + security: [agentic-development.md#3-adlc-security-manifesto](../../concepts/agentic/agentic-development.md#3-adlc-security-manifesto).

## When to read this doc

Read this doc when you need to:

- Add a new multi-step agent crew that goes beyond a single
  `AgentSpec` invocation.
- Decide whether a debate / dialectical pattern is appropriate for a
  reasoning task.
- Wire a new entry-point in the LangGraph builder.
- Understand how the existing crews (research, trader, analysis)
  compose under the hood.

This doc does **not** replace [agents.md](../../concepts/agentic/agents.md) — that's the
primary reference for `AgentSpec` and `AgentRuntime`. This doc only
covers **how multiple specs are composed**.

## The five canonical patterns

| Pattern | When to use | AlphaSwarm entry-point |
| --- | --- | --- |
| Sequential | Deterministic linear pipeline | `build_research_graph` / `build_trader_graph` / `build_full_pipeline_graph` in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) (linear edges) |
| Parallel | Independent multi-source research with synthesis | parallel research-team nodes in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) |
| Debate / Dialectical | Adversarial analysis (Bull / Bear, advocate / critic) | [alphaswarm/agents/graph/dialectical.py](../alphaswarm/agents/graph/dialectical.py) → `build_dialectical_debate_graph` (Bull / Bear / Portfolio-Manager) |
| Coordinator / Router | Hierarchical delegation | top-level orchestrator in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) (`build_full_pipeline_graph` plays this role today) |
| ReAct (loop-with-observation) | Open-ended forecasting requiring iterative observe → act | LangGraph state loop with conditional edges via [alphaswarm/agents/graph/conditions.py](../alphaswarm/agents/graph/conditions.py) (`should_continue_debate`, `should_continue_risk`) |

Each pattern below has the same three sections: when to use, the
shape it takes in AlphaSwarm, and a "Don't" list.

---

## 1. Sequential

```mermaid
flowchart LR
    Start --> A[Step 1] --> B[Step 2] --> C[Step 3] --> Final
```

### When to use

- Deterministic, well-understood pipelines where each step's output
  is the input to the next.
- The default for any flow that doesn't have a strong reason to
  branch.
- Good for: ingest → normalise → enrich → emit; research →
  selection → trader → analysis (the canonical pipeline).

### AlphaSwarm shape

- [`build_research_graph`](../alphaswarm/agents/graph/builder.py) — research
  → equity → universe.
- [`build_trader_graph`](../alphaswarm/agents/graph/builder.py) — trader →
  analysis run.
- [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py) —
  research → selection → trader → analysis (end-to-end agentic
  loop).
- State carried via `AgentState` (TypedDict) declared in
  [alphaswarm/agents/graph/state.py](../alphaswarm/agents/graph/state.py).
- Falls back to
  [`SequentialGraph`](../alphaswarm/agents/graph/builder.py) when LangGraph
  isn't installed — same audit trail, no conditional routing.

### Don't

- Don't bypass the runtime per step. Each node calls
  `AgentRuntime.run(...)` so cost caps + telemetry + immutable
  versions are recorded.
- Don't widen the `AgentState` TypedDict for one-off keys — extend
  via the canonical fields documented in
  [alphaswarm/agents/graph/state.py](../alphaswarm/agents/graph/state.py) so
  conditional predicates keep working.

---

## 2. Parallel (research team / fan-out + synthesis)

```mermaid
flowchart TD
    Start --> Coordinator
    Coordinator --> A[Source A]
    Coordinator --> B[Source B]
    Coordinator --> C[Source C]
    A --> Synth
    B --> Synth
    C --> Synth
    Synth --> Final
```

### When to use

- Multiple independent sources / analyses that can run in parallel
  and then be synthesised.
- Examples: fundamental + technical + macro + sentiment running
  concurrently to produce a unified market view; multi-source
  regulatory ingest.
- Throughput-bound: parallel makes sense when each branch is
  expensive and the branches don't depend on each other.

### AlphaSwarm shape

- LangGraph state graphs run independent branches concurrently when
  the edges declare them as such.
- The synthesis node consumes the merged state and emits a
  combined verdict.
- For the research-team subgraph in
  [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py), the
  individual research specs (`research.equity`, `research.news_miner`,
  `research.universe`, etc.) feed a downstream selector / trader.

### Don't

- Don't parallelise tool calls that mutate shared state — the
  catalog upserts in
  [`active_metadata`](../alphaswarm/data/catalog/active_metadata.py) are
  serialised on purpose.
- Don't fan out to N agents that all consult the same RAG corpus
  with identical queries — that's a cache miss N times. Cache once
  upstream.
- Don't rely on parallel order. Synthesis must be order-independent
  (associative + commutative over the result set).

---

## 3. Debate / Dialectical

```mermaid
flowchart TD
    Start --> Subject[Subject under analysis]
    Subject --> Bull[Bull advocate]
    Subject --> Bear[Bear advocate]
    Bull --> Loop{Continue debate?}
    Bear --> Loop
    Loop -->|yes| Bull
    Loop -->|no| PM[Portfolio Manager / Judge]
    PM --> Verdict
```

### When to use

- Open-ended judgement where adversarial reasoning surfaces
  blind-spots (e.g. should we take this position? does this strategy
  generalise out-of-sample?).
- Whenever a single-agent verdict would feel "too convenient" — the
  Bull / Bear pattern forces both arguments to be made and judged.
- The literature behind this pattern (TradingAgents) is a known
  source of inspiration; AlphaSwarm keeps the structure but routes through
  spec-driven `AgentRuntime` so every debate turn is logged.

### AlphaSwarm shape

- [alphaswarm/agents/graph/dialectical.py](../alphaswarm/agents/graph/dialectical.py)
  contains `build_dialectical_debate_graph` (Bull / Bear /
  Portfolio-Manager).
- Three agent specs ship under [configs/agents/](../configs/agents/):
  - `research.bull_researcher`
  - `research.bear_researcher`
  - `research.portfolio_manager`
- The portfolio manager synthesises both transcripts into a single
  `debate_verdict` with `action ∈ {buy, hold, sell, mutate_params}`.
- The Phase-4 iterative optimisation loop in
  [`build_research_debate_graph`](../alphaswarm/agents/graph/builder.py)
  uses `should_continue_debate` from
  [conditions.py](../alphaswarm/agents/graph/conditions.py) to bound
  rounds (default `max_rounds=2`).
- State extension: `RiskDebateState` and `ResearchDebateState`
  (TypedDicts in
  [state.py](../alphaswarm/agents/graph/state.py)) hold the debate
  transcript across turns.
- All decisions land in
  [decision_log.py](../alphaswarm/agents/graph/decision_log.py) for
  auditability — `append_pending_decision` /
  `resolve_pending_decisions`.

### Don't

- Don't run an unbounded debate. Cost caps + the `max_rounds`
  predicate are non-negotiable.
- Don't let the judge synthesise without seeing both transcripts —
  the synthesis node is the load-bearing piece.
- Don't add a third advocate without thinking carefully about the
  judge prompt. Two-sided debate is well-studied; three-sided
  debates require explicit tie-breaking logic.

---

## 4. Coordinator / Router

```mermaid
flowchart TD
    Human --> Coordinator[Principal Investigator]
    Coordinator --> Sub1[Subagent: data]
    Coordinator --> Sub2[Subagent: analysis]
    Coordinator --> Sub3[Subagent: codegen]
    Sub1 --> Coordinator
    Sub2 --> Coordinator
    Sub3 --> Coordinator
    Coordinator --> Final[Synthesised report]
```

### When to use

- Workflows where the human interacts with a single high-level
  orchestrator that delegates to specialised subagents.
- Reduces cognitive load for the operator — they don't direct
  individual specs, they direct the coordinator.
- Examples: end-to-end backtest run with multiple analytical
  subagents; multi-stage research crew coordinated by a "PI"
  agent.

### AlphaSwarm shape

- [`build_full_pipeline_graph`](../alphaswarm/agents/graph/builder.py)
  plays this role today: a top-level orchestrator that routes to
  research, selection, trader, and analysis nodes.
- Decision-log
  ([decision_log.py](../alphaswarm/agents/graph/decision_log.py))
  captures the routing decisions so the human can replay why a
  particular subagent was invoked.
- The Cursor IDE itself follows this pattern — the parent agent
  dispatches `Task(subagent_type=...)` for read-only exploration
  or implementation.

### Don't

- Don't put domain logic in the coordinator. It coordinates;
  subagents do the work.
- Don't pass full intermediate state up to the human. The whole
  point is the coordinator synthesises — show the synthesis, link
  to the decision log for the trace.

---

## 5. ReAct (loop-with-observation)

```mermaid
flowchart TD
    Start --> Reason[Reason about state]
    Reason --> Act[Act / call tool]
    Act --> Observe[Observe result]
    Observe --> Decide{Goal met?}
    Decide -->|no| Reason
    Decide -->|yes| Final
```

### When to use

- Open-ended forecasting / research questions where the answer
  isn't reachable in a single shot, and the model needs to call
  tools, observe results, and iterate.
- Examples: building a market thesis from sequential
  hypothesis-tests; iterative debugging of a strategy's
  poor backtest.
- Trades latency for accuracy — only worth it for tasks where the
  user explicitly wants depth over speed.

### AlphaSwarm shape

- LangGraph state-graph with conditional edges — the loop is
  modelled as a self-edge gated by a predicate.
- Conditional predicates live in
  [alphaswarm/agents/graph/conditions.py](../alphaswarm/agents/graph/conditions.py)
  (`should_continue_debate`, `should_continue_risk`,
  `should_consult_rag`, `risk_simulator_approves`).
- Tool calls inside the loop go through `AgentRuntime` so the
  cost cap bounds the iteration count.
- For agents that need persistent memory between iterations, the
  Redis-backed checkpointer
  ([checkpointer.py](../alphaswarm/agents/graph/checkpointer.py))
  preserves graph state across process restarts.

### Don't

- Don't ReAct without a hard upper bound on iterations. The
  `max_rounds` parameter on `should_continue_debate` is the
  reference pattern — apply the same upper bound to any new
  ReAct-style condition.
- Don't share Redis checkpoint keys across unrelated runs. Each
  `(spec_version_id, run_id)` is its own checkpoint namespace.
- Don't ReAct on a hot path (live execution). Use it for
  research and post-hoc analysis where latency is acceptable.

---

## Orchestration adapter topologies (Phase 7 addition)

The additive orchestration refactor adds a sibling abstraction —
[`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) — that
exposes the five canonical patterns above as **first-class registry
components**. The patterns themselves don't change; the new
``WorkflowRuntime`` wraps them behind a metaclass-registered alias so
operators can mix-and-match without editing graph builders by hand.

Seven shipping adapter kinds (see
[ADAPTER_KINDS](../alphaswarm/agents/orchestration/registry.py)):

| Adapter | Kind | Wraps | Inspiration |
| --- | --- | --- | --- |
| [`LangGraphAdapter`](../alphaswarm/agents/orchestration/adapters/langgraph_adapter.py) | `graph` | The five canonical builders in [alphaswarm/agents/graph/builder.py](../alphaswarm/agents/graph/builder.py) + `build_dialectical_debate_graph` | alphaswarm |
| [`CrewProcessAdapter`](../alphaswarm/agents/orchestration/adapters/crew_adapter.py) | `crew` | [`run_research_crew`](../alphaswarm/agents/crew.py) + [`run_trader_crew`](../alphaswarm/agents/trading/crew.py) — CrewAI sequential / hierarchical | finrobot |
| [`DialecticalDebateAdapter`](../alphaswarm/agents/orchestration/adapters/debate_adapter.py) | `debate` | [`build_dialectical_debate_graph`](../alphaswarm/agents/graph/dialectical.py) with bounded rounds + forced judge synthesis | tradingagents |
| [`AutomationScheduleAdapter`](../alphaswarm/agents/orchestration/adapters/schedule_adapter.py) | `schedule` | Celery beat — enqueues [`alphaswarm.tasks.orchestration_tasks.run_workflow`](../alphaswarm/tasks/orchestration_tasks.py) | daily_stock_analysis |
| [`SignalFusionAdapter`](../alphaswarm/agents/orchestration/adapters/fusion_adapter.py) | `fusion` | Deterministic [`synthesize`](../alphaswarm/agents/trading/fusion.py) over debate + quant + model contributors | vibe_trading |
| [`WeightCentricExecutionAdapter`](../alphaswarm/agents/orchestration/adapters/weight_centric_adapter.py) | `execution` | [`WeightCentricPipeline`](../alphaswarm/rl/portfolio/pipeline.py) + [`RiskLimits`](../alphaswarm/risk/limits.py) (rule 38) | finrl |
| (Phase 7 future) `WorkflowStudioAdapter` | `studio` | Interactive workflow graph editor | langflow |

### Why use adapters over a hand-rolled builder?

- **Discoverability**: every adapter shows up in the Phase 5 studio
  dropdown via `data.orchestration.list_adapters` — operators don't
  need to read code.
- **Halt parity**: the runtime polls `should_halt(state)` between
  every adapter transition; new adapters inherit that contract for
  free.
- **Replay parity**: every spec snapshotted into
  `workflow_spec_versions` is replayable by `workflow_version_id`
  through `/workflows/runs/{run_id}/replay`.
- **Telemetry parity**: each transition opens a
  [`node_span`](../alphaswarm/agents/observability.py) so per-adapter
  latency / cost / branch decisions land on the same OTEL trace as
  every legacy agent run.

### When to use an adapter vs a graph builder

| Choose adapter when | Choose graph builder when |
| --- | --- |
| You want it in the studio dropdown | The flow is hard-coded into a service |
| You need to replay it by version id | One-off internal pipeline |
| You want bounded-debate / cooperative-cancel without writing them | You're already inside a builder body |
| The flow ships as YAML for ops | The flow is built dynamically per request |

Adapters delegate to graph builders internally — they are **wrappers,
not replacements**. Adding a new adapter never invalidates an existing
builder.

---

## Adding a new pattern

1. Identify which of the five it most resembles. Don't invent a
   sixth unless there's a real reason.
2. Add the builder under
   [alphaswarm/agents/graph/](../alphaswarm/agents/graph/). Mirror the existing
   `build_*_graph` naming.
3. Add the necessary state TypedDict to
   [state.py](../alphaswarm/agents/graph/state.py). Don't sprinkle ad-hoc
   dict keys — `AgentState` is the contract.
4. Add conditional predicates to
   [conditions.py](../alphaswarm/agents/graph/conditions.py) if the graph
   has branches.
5. Decisions emitted by the graph land in
   [decision_log.py](../alphaswarm/agents/graph/decision_log.py).
6. Tests under [tests/agents/](../tests/agents/) — at minimum, a
   `SequentialGraph` fallback test that runs the graph without
   LangGraph installed. Mirror the existing test naming: e.g.
   `test__run.py`.
7. Update [agents.md](../../concepts/agentic/agents.md) and / or this file to describe
   the new entry-point.

## Cross-references

- [agents.md](../../concepts/agentic/agents.md) — `AgentSpec` + `AgentRuntime` reference
- [agentic-pipeline.md](../../concepts/agentic/agentic-pipeline.md) — end-to-end pipeline
  walkthrough
- [agentic-development.md](../../concepts/agentic/agentic-development.md) — spec-pattern
  + ADLC manifesto
- [analysis-agents.md](../../concepts/strategy/analysis-agents.md) — analysis-specific
  agent roles
- [research-agents.md](../../concepts/agentic/research-agents.md) /
  [selection-agents.md](../../concepts/agentic/selection-agents.md) /
  [trader-agents.md](../../concepts/agentic/trader-agents.md) — per-team agent rosters
- [providers.md](../../concepts/data/providers.md) — LLM provider routing under the hood


<!-- https://alpha-swarm.ai/concepts/agentic/orchestration-refactor-rollout -->
# Orchestration control plane refactor — rollout runbook
> | Flag (env var prefix `ALPHASWARM_`) | Default | Activates | First needed in | | --- | --- | --- | --- | | `ORCHESTRATION_STUDIO_ENABLED` | `false` | `/workflows/*` API surface, Vite studio routes, `Workflo...

# Orchestration control plane refactor — rollout runbook

This is the operator-facing rollback / rollout guide for the additive
``WorkflowRuntime`` + ``OrchestrationAdapter`` stack landed by the
seven phases described in
[ALPHASWARM_REFACTOR_MASTER_PROMPT.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/docs/archive/ALPHASWARM_REFACTOR_MASTER_PROMPT.md) and
the matching cursor plan. Every change in the refactor is gated by
one of the ``ALPHASWARM_ORCHESTRATION_*`` flags defined on
[alphaswarm/config/settings.py](../alphaswarm/config/settings.py); **with every
flag at its default ``False`` the platform behaves identically to the
pre-refactor build**. The Phase 0 regression test
[tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py)
enforces this — run it before flipping anything.

## Flag inventory

| Flag (env var prefix `ALPHASWARM_`) | Default | Activates | First needed in |
| --- | --- | --- | --- |
| `ORCHESTRATION_STUDIO_ENABLED` | `false` | `/workflows/*` API surface, Vite studio routes, `WorkflowSpec` registry persistence | Phase 5 |
| `ORCHESTRATION_CREW_ADAPTER_ENABLED` | `false` | `CrewProcessAdapter` registration (`crewai` stays an optional import) | Phase 2 |
| `ORCHESTRATION_FUSION_ENABLED` | `false` | `SignalFusionAdapter` + `WeightCentricExecutionAdapter` + `build_dialectical_with_fusion_graph` | Phase 4 |
| `ORCHESTRATION_SCHEDULE_ENABLED` | `false` | `AutomationScheduleAdapter` Celery beat entry | Phase 3 |
| `ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED` | `false` | Snapshots `WorkflowSpec` into `workflow_spec_versions` on first run | Phase 5 |
| `ORCHESTRATION_KILL_PROPAGATION_ENABLED` | `false` | Watchdog + KillSwitch UI fan halts into `WorkflowRun` rows | Phase 6 |
| `ORCHESTRATION_MAX_DEBATE_ROUNDS` (int) | `2` | Hard cap enforced by `DialecticalDebateAdapter` and the graph builder | Phase 2 |
| `ORCHESTRATION_HALT_CHECK_TIMEOUT_SECONDS` (float) | `1.0` | Per-transition halt-check budget in `WorkflowRuntime` | Phase 2 |

The two numeric knobs are read every transition, so changing them
takes effect on the next workflow step without a restart.

## Recommended rollout order

1. **Phase 0 → Phase 1**: deploy with every flag at default. Run the
   full pytest suite plus
   [tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py)
   to confirm zero behavioural drift.
2. **Phase 2 (debate)**: flip `ORCHESTRATION_CREW_ADAPTER_ENABLED` if
   you want CrewAI-backed crew adapters to register; otherwise leave
   off. The bounded-debate cap is always honoured by the new graph
   builder kwarg regardless of this flag.
3. **Phase 3 (scheduler)**: flip `ORCHESTRATION_SCHEDULE_ENABLED` AFTER
   restarting Celery workers + beat. The flag controls whether
   `alphaswarm.tasks.celery_app` registers the beat schedule entry.
4. **Phase 4 (fusion)**: flip `ORCHESTRATION_FUSION_ENABLED` only after
   confirming the existing `risk_simulator_approves` predicate still
   routes correctly on a staging dataset — fusion adds a sibling
   pathway, the existing risk gate stays authoritative.
5. **Phase 5 (studio)**: flip `ORCHESTRATION_STUDIO_ENABLED` and
   `ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED` together. Apply the
   alembic migration `0046_workflow_versioning.py` BEFORE the flag is
   flipped on the API process.
6. **Phase 6 (halt fan-out)**: flip
   `ORCHESTRATION_KILL_PROPAGATION_ENABLED` last. The KillSwitch UI
   keeps its existing behaviour with this flag off; turning it on
   adds workflow-run fan-out to the existing `/agents/halt`,
   `/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all`, and
   `/quant-agents/halt` fan-out.

## Rollback recipes

All rollbacks are flag-flips (no migrations, no data loss):

- **Disable studio + API**: set `ALPHASWARM_ORCHESTRATION_STUDIO_ENABLED=false`
  and reload the API. The `/workflows/*` routes refuse new requests
  with `503 Service Unavailable` while the rest of the API keeps
  serving.
- **Disable scheduler**: set `ALPHASWARM_ORCHESTRATION_SCHEDULE_ENABLED=false`
  and restart Celery beat. Already-running scheduled runs finish
  normally; no new ones are enqueued.
- **Disable fusion**: set `ALPHASWARM_ORCHESTRATION_FUSION_ENABLED=false` and
  reload. The optional
  `build_dialectical_with_fusion_graph` builder refuses to compile;
  existing builders are unaffected.
- **Disable kill fan-out**: set
  `ALPHASWARM_ORCHESTRATION_KILL_PROPAGATION_ENABLED=false`. The KillSwitch
  UI keeps its existing five halt buttons (agents / paper / bots /
  rl / quant-agents); the new "Halt workflows" button no-ops.
- **Disable workflow versioning**: set
  `ALPHASWARM_ORCHESTRATION_WORKFLOW_VERSIONING_ENABLED=false`. New runs
  refuse to snapshot a spec hash; existing `workflow_spec_versions`
  rows stay readable.
- **Full revert**: set every `ALPHASWARM_ORCHESTRATION_*` flag to `false`,
  redeploy. The platform behaves exactly like the pre-refactor build.
  The new tables (`workflow_specs`, `workflow_spec_versions`,
  `workflow_runs`) stay empty and add no read overhead to other
  routes.

## Migration safety

- The single new migration `0046_workflow_versioning.py` is additive:
  it creates three new tables and adds no columns to existing
  tables. Downgrade returns the database to the
  `0045_pgvector_foundation` head.
- The new `alphaswarm.tasks.orchestration_tasks` module appends to the
  Celery `include` list; cold installs without the module fail
  loudly at worker boot rather than silently dropping tasks.
- The Vite studio bundle is code-split: routes under
  `alphaswarm_client/src/routes/workflows/*` lazy-load only when the user
  navigates there, so disabling the flag also disables the
  bundle download path.

## Pre-flip checklist

Run before flipping any flag in production:

1. `docker exec alphaswarm-api python -m pytest tests/agents/test_orchestration_flags.py -v`
2. `docker exec alphaswarm-api python -m pytest tests/agents/test_watchdog.py -v`
3. `docker exec alphaswarm-api alembic current` — confirm head is at least
   `0045_pgvector_foundation`; for Phase 5+ confirm `0046_workflow_versioning`.
4. Snapshot the Redis kill-switch key
   (`redis-cli get $ALPHASWARM_RISK_KILL_SWITCH_KEY`) — the watchdog uses
   the same key so the new gate stays consistent.

## Where each layer lives

- Settings flags: [alphaswarm/config/settings.py](../alphaswarm/config/settings.py)
  "Orchestration control plane" block.
- Regression test: [tests/agents/test_orchestration_flags.py](../tests/agents/test_orchestration_flags.py).
- Adapter abstraction: `alphaswarm/agents/orchestration/` (Phase 1).
- Adapters: `alphaswarm/agents/orchestration/adapters/` (Phases 2-4).
- DataMCP tools: `alphaswarm/data/mcp/tools/orchestration.py` + `automation.py` (Phase 3).
- Celery task: `alphaswarm/tasks/orchestration_tasks.py` (Phase 3).
- Persistence: `alphaswarm/persistence/models_workflows.py` + alembic `0046_workflow_versioning.py` (Phase 5).
- API: `alphaswarm/api/routes/workflows.py` (Phase 5).
- Studio UI: `alphaswarm_client/src/routes/workflows/*` (Phase 5).
- Halt + watchdog hardening: `alphaswarm/tasks/agent_watchdog_tasks.py`, `alphaswarm_client/src/components/common/KillSwitch.tsx` (Phase 6).


<!-- https://alpha-swarm.ai/concepts/agentic/research-agents -->
# Research Agents
> - **First-order** (price / trade / performance) — `bars_daily`, `performance`. - **Second-order** (SEC, ratios, fundamentals) — `sec_filings`, `sec_xbrl`, `financial_ratios`, `earnings_call`, `news_se...

# Research Agents

| Spec | Module | Purpose |
| --- | --- | --- |
| `research.news_miner` | [alphaswarm/agents/research/news_miner.py](../alphaswarm/agents/research/news_miner.py) | Mine recent news + sentiment + regulatory flags for a symbol or topic. |
| `research.equity` | [alphaswarm/agents/research/equity_researcher.py](../alphaswarm/agents/research/equity_researcher.py) | Long-form equity research synthesis with hierarchical RAG citations. |
| `research.universe` | [alphaswarm/agents/research/universe_selector.py](../alphaswarm/agents/research/universe_selector.py) | Interactive stock universe shaping with RAG justification. |

## RAG layout (per the user's research-agent spec)

- **First-order** (price / trade / performance) — `bars_daily`, `performance`.
- **Second-order** (SEC, ratios, fundamentals) — `sec_filings`, `sec_xbrl`,
  `financial_ratios`, `earnings_call`, `news_sentiment`.
- **Third-order** (regulatory) — `cfpb_complaints`, `fda_*`, `uspto_*`.

The News Miner skews toward second + third order. The Equity Researcher
walks all three. The Universe Selector pulls L0 + L1 + L2.

## REST + tasks

```
POST /agents/research/news-miner       — async via Celery (research queue)
POST /agents/research/equity           — async via Celery
POST /agents/research/universe         — async via Celery
POST /agents/research/sync/news-miner  — synchronous variant
```

Celery wrappers live in [alphaswarm/tasks/research_tasks.py](../alphaswarm/tasks/research_tasks.py).

## Configs

YAMLs at [configs/agents/research_news_miner.yaml](../configs/agents/research_news_miner.yaml)
and friends. The in-code builders return identical specs so either path
works. Edit the YAML for hot reload.


<!-- https://alpha-swarm.ai/concepts/agentic/selection-agents -->
# Selection Agents
> `selection.stock_selector` — implemented in [alphaswarm/agents/selection/stock_selector.py](../alphaswarm/agents/selection/stock_selector.py)

# Selection Agents

The Selection team picks the top-N tickers for a
`(model, strategy, universe, agent)` quadruple. It is the bridge
between the Research team's universe candidates and the Trader team's
signal-emitter loop.

## Spec

`selection.stock_selector` — implemented in
[alphaswarm/agents/selection/stock_selector.py](../alphaswarm/agents/selection/stock_selector.py).

## RAG

| Layer | Used for |
| --- | --- |
| L0 (`decisions`) | Past `agent_decisions` outcomes — paper RAG#0. |
| L1 (`performance`) | Recent backtest performance windows. |
| L2 (`financial_ratios`, `sec_xbrl`) | Discriminate between similar candidates. |
| Tool: `regulatory_lookup` | Tail-risk veto. |

## Memory + annotations

Every pick is persisted via `annotation` with `label="pick"` and a
payload `{score, rationale, evidence, vetoed_by?}` so the optimisation
analysis layer can inspect the historical edge of each combo.

## REST

```
POST /agents/selection/run             — async via Celery
POST /agents/selection/sync            — synchronous variant
GET  /agents/selection/runs            — recent runs
GET  /agents/selection/annotations     — pick rationale history
```

## YAML

[configs/agents/selection_stock_selector.yaml](../configs/agents/selection_stock_selector.yaml).


<!-- https://alpha-swarm.ai/concepts/agentic/trader-agents -->
# Trader Agents
> [alphaswarm/agents/trader/signal_emitter.py](../alphaswarm/agents/trader/signal_emitter.py)

# Trader Agents

The spec-driven trader (`trader.signal_emitter`) coexists with the
existing TradingAgents-style debate trader under
[alphaswarm/agents/trading/](../alphaswarm/agents/trading/). The new one is
deliberately simpler — one structured signal per call — so it can
slot into the LangGraph pipeline and the agentic backtest loop.

## Spec

[alphaswarm/agents/trader/signal_emitter.py](../alphaswarm/agents/trader/signal_emitter.py).

## RAG

- **L1 / L2** — `bars_daily`, `performance`, `financial_ratios` for
  windowed indicator + fundamentals context.
- **L0** — `decisions` for prior-trade reflection (paper RAG#0).

## Output schema

```json
{
  "vt_symbol": "AAPL.NASDAQ",
  "as_of": "2026-04-27T20:00:00Z",
  "action": "buy" | "sell" | "hold",
  "confidence": 0..1,
  "horizon": "intraday" | "1d" | "5d" | "20d",
  "size_hint_pct": 0..1,
  "stop_loss_pct": 0..1,
  "take_profit_pct": 0..1,
  "rationale": "...",
  "evidence": [{"corpus": "...", "doc_id": "...", "snippet": "..."}]
}
```

## Safety

- Honors the runtime kill switch (Redis key
  `settings.risk_kill_switch_key`); when engaged the agent MUST emit
  `"hold"`.
- `risk_check` validates the proposed `size_hint_pct`.
- Guardrail caps cost at 0.25 USD / call by default.

## REST

```
POST /agents/trader/signal              — emit one signal (sync emit + task id)
POST /agents/trader/sync                — pure synchronous run
POST /agents/trader/backtest-with-agent — kick off agentic backtest
```

## YAML

[configs/agents/trader_signal_emitter.yaml](../configs/agents/trader_signal_emitter.yaml).


<!-- https://alpha-swarm.ai/concepts/agentic/workflow-studio -->
# Workflow Studio
> | Layer | File / Path | | --- | --- | | Spec contract | [alphaswarm/agents/orchestration/spec.py](../alphaswarm/agents/orchestration/spec.py) | | Registry + persist_spec | [alphaswarm/agents/orchestration/registry_specs.p...

# Workflow Studio

The Workflow Studio is the operator-facing surface for the additive
orchestration control plane introduced by the seven-phase refactor in
[orchestration-refactor-rollout.md](../../concepts/agentic/orchestration-refactor-rollout.md).
It composes the five existing graph builders, the three (then five)
adapters, and the new hash-locked `WorkflowSpec` registry into a
single replayable workflow concept.

## What ships

| Layer | File / Path |
| --- | --- |
| Spec contract | [alphaswarm/agents/orchestration/spec.py](../alphaswarm/agents/orchestration/spec.py) |
| Registry + persist_spec | [alphaswarm/agents/orchestration/registry_specs.py](../alphaswarm/agents/orchestration/registry_specs.py) |
| Runtime | [alphaswarm/agents/orchestration/runtime.py](../alphaswarm/agents/orchestration/runtime.py) |
| Adapter ABC + metaclass | [alphaswarm/agents/orchestration/base.py](../alphaswarm/agents/orchestration/base.py) |
| Adapter registry | [alphaswarm/agents/orchestration/registry.py](../alphaswarm/agents/orchestration/registry.py) |
| Adapters (5) | [alphaswarm/agents/orchestration/adapters/](../alphaswarm/agents/orchestration/adapters/) |
| ORM | [alphaswarm/persistence/models_workflows.py](../alphaswarm/persistence/models_workflows.py) |
| Migration | [alembic/versions/0046_workflow_versioning.py](../alembic/versions/0046_workflow_versioning.py) |
| REST | [alphaswarm/api/routes/workflows.py](../alphaswarm/api/routes/workflows.py) |
| Celery tasks | [alphaswarm/tasks/orchestration_tasks.py](../alphaswarm/tasks/orchestration_tasks.py) |
| DataMCP tools | [alphaswarm/data/mcp/tools/orchestration.py](../alphaswarm/data/mcp/tools/orchestration.py), [alphaswarm/data/mcp/tools/automation.py](../alphaswarm/data/mcp/tools/automation.py) |
| Cache entry | `workflows` category in [alphaswarm/cache/keys.py](../alphaswarm/cache/keys.py) |
| Frontend routes | [alphaswarm_client/src/routes/workflows/](../alphaswarm_client/src/routes/workflows/) |
| Frontend components | [alphaswarm_client/src/components/workflows/](../alphaswarm_client/src/components/workflows/) |

## Spec shape

A workflow selects exactly one
[`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) by alias
and hands it adapter-specific params. The adapter dispatches
internally — composite flows (Crew + Graph + Debate) belong inside
their own adapter, not at the spec layer.

```yaml
name: research.dialectical_with_fusion_v1
description: "Bull/Bear debate + fusion + weight-centric execution"
adapter: LangGraphAdapter
adapter_kind: graph
params:
  builder: dialectical          # one of build_* in alphaswarm/agents/graph/
  builder_kwargs:
    max_rounds: 2
schedule:
  cron: "30 13 * * 1-5"
  timezone: UTC
  enabled: false                # operator flips after the studio + schedule flags
guardrails:
  cost_budget_usd: 3.0
  max_calls: 60
  max_duration_seconds: 900
annotations: [research, dialectical]
template_target: research
```

`WorkflowSpec.snapshot_hash()` is the SHA256 of the canonical JSON
form (sorted keys, no whitespace). Re-snapshotting a spec with the
same hash returns the existing `workflow_spec_versions` row;
changing any field inserts a NEW row (parallel to
`agent_spec_versions`, `bot_versions`, `rl_experiment_versions`,
`analysis_spec_versions`).

## Operator flow

1. Operator flips `ALPHASWARM_ORCHESTRATION_STUDIO_ENABLED=true` (see the
   rollout doc).
2. Frontend navigates to `/workflows`. List + detail render through
   `` so the dropdown shares the
   same cache invalidation path as every other entity picker.
3. Operator hits **Run** → POST `/workflows/{name}/run` → enqueues
   `alphaswarm.tasks.orchestration_tasks.run_workflow`. The route returns a
   `task_id`; the studio attaches via the existing `useLiveStream`
   hook for `_progress.emit` frames (rule 4).
4. Operator hits **Replay** on a historical run → POST
   `/workflows/runs/{run_id}/replay` re-dispatches with the
   captured `spec_version_id` for deterministic reproduction.
5. Operator hits the topbar KillSwitch's "Halt workflows" action →
   POST `/workflows/halt` mirrors the five canonical halt endpoints
   (`/agents/halt`, `/paper/stop-all`, `/bots/halt-all`,
   `/rl/halt-all`, `/quant-agents/halt`).

## Halt fan-out

The Phase 2 `WorkflowRuntime` checks `should_halt(state)` between
every adapter transition. `should_halt` is the OR of:

- `has_kill_switch()` — Redis-backed global flag (the existing
  topbar KillSwitch flips this).
- `state["halt_token"]` — per-run boolean the Phase 6
  `/workflows/halt` endpoint sets on every active `WorkflowRun` row
  inside `ALPHASWARM_ORCHESTRATION_HALT_CHECK_TIMEOUT_SECONDS` of the API
  call.

Long-running adapters (`CrewProcessAdapter`, `LangGraphAdapter`,
`DialecticalDebateAdapter`) poll `context.is_halted()` between
inner steps so the SLA holds even mid-debate.

## Adapter catalog (Phases 2-5)

| alias | kind | source | when registered |
| --- | --- | --- | --- |
| `LangGraphAdapter` | graph | alphaswarm | always |
| `CrewProcessAdapter` | crew | finrobot | always (gated invoke) |
| `DialecticalDebateAdapter` | debate | tradingagents | always |
| `AutomationScheduleAdapter` | schedule | daily_stock_analysis | always (gated invoke) |
| `SignalFusionAdapter` | fusion | vibe_trading | always (gated invoke) |
| `WeightCentricExecutionAdapter` | execution | finrl | always (gated invoke) |
| `WorkflowStudioAdapter` (Phase 7) | studio | langflow | TBD |

New adapters land by subclassing
[`OrchestrationAdapter`](../alphaswarm/agents/orchestration/base.py) and
setting `adapter_kind` + `adapter_alias`. The metaclass auto-registers
them through
[`alphaswarm.core.registry.register`](../alphaswarm/core/registry.py) and the
shadow per-kind index in
[`alphaswarm/agents/orchestration/registry.py`](../alphaswarm/agents/orchestration/registry.py).

## Audit trail

Every run produces:

- A `workflow_runs` row (one per run) with `spec_version_id`,
  `inputs`, `final_state`, `breadcrumbs`, `experiment_id`, `test_id`
  (rule 34), `cost_usd`, `duration_ms`, `status`, `halted`, `error`.
- A series of `_progress.emit` frames the studio streams live
  through `useLiveStream` (frame shape per rule 4).
- Per-adapter `node_span` OTEL spans emitted by
  [`alphaswarm/agents/observability.py`](../alphaswarm/agents/observability.py).
- Optional `agent_runs_v2` rows for each inner `AgentRuntime` call
  the wrapped adapter makes.

## Replay semantics

`POST /workflows/runs/{run_id}/replay` looks up the matching
`workflow_runs` row, hydrates the frozen
`workflow_spec_versions.payload`, and re-dispatches with the same
inputs. Replay produces a NEW `workflow_runs` row tagged with the
original run's id in `parent_run_id` so the trace lineage stays
intact.

## See also

- [orchestration-refactor-rollout.md](../../concepts/agentic/orchestration-refactor-rollout.md) — operator runbook + per-flag rollback.
- [multi-agent-patterns.md](../../concepts/agentic/multi-agent-patterns.md) — the seven adapter topologies (Phase 7 docs update).
- [data-mcp.md](../../concepts/data/data-mcp.md) — `data.orchestration.*` and `data.automation.*` tool catalog.
- [agentic-development.md](../../concepts/agentic/agentic-development.md) — where `WorkflowSpec` sits in the four-runtime + skill-artifact framework.


<!-- https://alpha-swarm.ai/concepts/data/bi-temporal-graph -->
# Bi-temporal PermissionedDataPoint
> Four-timestamp model + invalidated_by_edge_id (Graphiti-style invalidation).

# Bi-temporal `PermissionedDataPoint`

Every node and every edge in the KB carries the same envelope:

```python
class TemporalRange(BaseModel):
    valid_from:  datetime           # event time start
    valid_to:    Optional[datetime] # event time end (None = still true)
    created_at:  datetime           # system time start
    expired_at:  Optional[datetime] # system time end (None = active)

class PermissionedDataPoint(BaseModel):
    id: UUID
    type: str = "PermissionedDataPoint"
    temporal: TemporalRange
    acl: ACL                     # owner + role-based + ABAC + ReBAC anchors
    provenance: Provenance       # dataset_id + data_id + extractor chain
    layer: LayerMembership       # PRIVATE / HIERARCHICAL / MARKETPLACE / GLOBAL
    index_fields: list[str]      # which fields feed the vector embedding
    properties: dict[str, Any]
```

## Two timelines

Following the Graphiti / Zep four-timestamp model:

| Pair | Tracks | Closes when |
| --- | --- | --- |
| `valid_from` / `valid_to` | Real-world event time | Fact stops being true |
| `created_at` / `expired_at` | System ingest time | Fact is logically invalidated |

A contradicted edge **closes** `valid_to` (and optionally
`expired_at` + `invalidated_by_edge_id`) — it is never deleted. This
preserves the timeline for `as_of=` queries.

## Provenance chain

`Provenance` carries `dataset_id` + `data_id` + the extractor chain
(`["spacy", "gliner", "llm"]`) + the pipeline run id. When a tenant
requests targeted forgetting (GDPR / CCPA), `KBRuntime.forget`
locates rows by dataset/data id and closes their validity window.

## ACL envelope

The `ACL` block carries:

- `owner_principal_id` + `owner_tenant_id` (RBAC anchor).
- `roles_read` / `roles_write` / `roles_delete` (RBAC).
- `abac_tags` (ABAC — region, classification, time-of-day, ...).
- `rebac_anchor_ids` (OpenFGA tuple keys like
  `document:abc#viewer`).
- `deny_principal_ids` (explicit denial list).

`DefaultPermissionResolver` (in
[`kb-permissions.md`](kb-permissions.md)) fuses all four into a
single per-request `AccessBitmap`.

## Bi-temporal merge in the composer

`KBLayerComposer.compose_recall` collects hits across layers
(private > hierarchical > marketplace > global), then applies the
precedence-aware bi-temporal merger:

1. Group hits by entity `id`.
2. The first occurrence (highest precedence) wins.
3. Lower-precedence hits get appended to
   `metadata.dissenting_layers` so the UI can surface them
   transparently.
4. `valid_from`/`valid_to` are preserved on every hit so a downstream
   `as_of` reconstruction is lossless.


<!-- https://alpha-swarm.ai/concepts/data/kb-federation -->
# Marketplace federation (`alphaswarm_kb_federation`)
> Cross-silo recall reverse-proxy with OpenFGA + signed share tokens + bi-temporal merge.

# `alphaswarm_kb_federation`

The federation gateway is a standalone FastAPI service that brokers
cross-silo recall. It is the only sanctioned cross-silo recall path
(hard rule 60).

## Why a separate service

- The federation logic is fundamentally stateless except for the
  result cache + OpenFGA Watch subscriber. Running it as a sidecar to
  `alphaswarm_kb` would couple lifecycle with the monolith; running it
  standalone lets it scale horizontally on its own.
- Cross-silo traffic crosses trust boundaries (subscriber tenant →
  source tenant). Keeping the broker in its own process makes the
  trust boundary explicit and audit-friendly.
- The CI guard
  [`check_alphaswarm_kb_federation_no_alphaswarm.py`](https://github.com/alphaswarm/alphaswarm/blob/main/scripts/ci/check_alphaswarm_kb_federation_no_alphaswarm.py)
  enforces `no_alphaswarm_imports` so the boundary cannot drift.

## Sequence

```
subscriber silo                    federation gateway                source silo
─────────────────                  ──────────────────                ───────────
POST /federation/recall  ─────▶
                                   1. OpenFGA `check` (visible?)
                                   2. mint signed share token (HS256/RS256, 600s)
                                   3. POST /kb/corpora/.../recall ──▶
                                                                    verify share token
                                                                    return hits
                                   4. BitemporalMerger.merge_layers
                                   5. cache + audit
◀───────── ComposedResult
```

## Subscription writer

`POST /federation/subscriptions` writes the matching OpenFGA tuple +
emits a `subscription.granted` event on the NATS / Redis Pub/Sub
bus that subscribers consume to flush bitmap caches.

Step-up MFA gates every subscription mutation per AlphaSwarm rule 52.

## Caching

- Per-`(subscriber_tenant, cache_key)` Redis namespace under
  `alphaswarm:kb:federation:*`.
- 60s default TTL.
- Cache miss + upstream call budget: 5s default. The gateway aims for
  ≤250ms p95 federation overhead on a warm cache.

## Deployment

| Surface | Where |
| --- | --- |
| Multi-arch Dockerfile | [`alphaswarm_kb_federation/deployments/docker/Dockerfile`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb_federation/deployments/docker/Dockerfile) |
| Helm chart | [`alphaswarm_kb_federation/deployments/kubernetes/helm/alphaswarm-kb-federation/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_kb_federation/deployments/kubernetes/helm/alphaswarm-kb-federation) |
| Docker Compose (local) | [`alphaswarm_kb_federation/deployments/compose/docker-compose.federation.yml`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb_federation/deployments/compose/docker-compose.federation.yml) |
| Terraform module | [`alphaswarm_platform/terraform/modules/kb_marketplace_federation/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_platform/terraform/modules/kb_marketplace_federation) |

## Hard rules it enforces

- Hard rule 60: cross-silo recall goes through this service only.
- Hard rule 26: every upstream call mints its own M2M token via
  `CredentialResolver`.
- Hard rule 52: step-up MFA on subscription admin endpoints.
- Hard rule 49 (no-token-passthrough): the share token's `aud` claim
  is bound to the source silo; passthrough across audiences is
  rejected at the verifier.


<!-- https://alpha-swarm.ai/concepts/data/kb-permissions -->
# KB permissions — AccessBitmap + OpenFGA + OPA + Cedar
> Hybrid ReBAC + ABAC stack materialised into a per-request AccessBitmap.

# KB permissions

## Hybrid stack

| Layer | Provider | What it answers |
| --- | --- | --- |
| RBAC | `ACL.roles_*` + Membership rows | "Is the user an editor of this corpus?" |
| ABAC | `IPolicyEngine` (default: OPA; opt-in: Cedar) | "Does the user's region == EU and the resource's classification ≤ user's clearance?" |
| ReBAC | `IACLEvaluator` (default: Native; opt-in: OpenFGA / SpiceDB / Permify) | "Does the user inherit access via a chain of group / org / dataset / subscription relations?" |

## AccessBitmap

`DefaultPermissionResolver.materialize_bitmap` produces a per-request
`AccessBitmap`:

```python
class AccessBitmap(BaseModel):
    visible_node_ids: set[UUID]
    visible_edge_ids: set[UUID]
    excluded_node_ids: set[UUID]
    field_redactions: dict[UUID, set[str]]
    residual_cypher: Optional[str]   # OPA partial-eval residual
    residual_sql:    Optional[str]
    computed_at_iso: Optional[str]
    cache_key:       Optional[str]
```

The bitmap is built by:

1. Calling `IACLEvaluator.list_objects(principal_id, action, "node", tenant_id)`
   → set of visible node UUIDs (OpenFGA `list-objects`).
2. Calling `IPolicyEngine.partial_evaluate(action, "node", ctx)` →
   residual Cypher / SQL fragment (OPA `compile`).
3. Caching the result for 60s by
   `(tenant, principal, action, anchor_hash)`.

## Projection into store-native filters

| Store | How the bitmap shows up |
| --- | --- |
| Graph (Cypher) | `WHERE n.id IN $visible_node_ids AND r.id IN $visible_edge_ids AND (${residual_cypher})` |
| Vector (payload filter) | `{"tenant_id": {"$eq": "..."}, "id": {"$in": [...]}}` |
| Relational (RLS) | Session GUCs `app.current_tenant_id` + `app.current_workspace_id` + `app.visible_node_ids` |

## OpenFGA authorization model

The bundled
[`authorization_model.fga`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/policies/openfga/authorization_model.fga)
defines the canonical types:

```
type tenant
  relations
    define member: [user]
    define admin: [user]
    define parent: [tenant]

type corpus
  relations
    define owner_tenant: [tenant]
    define editor: [user] or admin from owner_tenant
    define viewer: [user] or editor or member from owner_tenant
    define subscriber: [tenant]

type dataset
  relations
    define parent_corpus: [corpus]
    define editor: editor from parent_corpus
    define viewer: viewer from parent_corpus or subscriber from parent_corpus
    define subscriber: [tenant]
```

## OPA policy bundle

The bundled
[`authz.rego`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/policies/opa/authz.rego)
implements default-deny with role-based + region-lock + classification
gates. Bundles are signed and served from
`s3://alphaswarm-kb-opa-bundles/` (or the Azure / GCP equivalents) and
pulled by OPA every 30-120s.

## Cedar (optional)

Cedar is the optional `IPolicyEngine` adapter for tenants requiring
formal verification. Activate by setting
`KBCorpusSpec.acl.policy_alias = "cedar"`.


<!-- https://alpha-swarm.ai/concepts/data/kb-runtime -->
# KBRuntime + KBCorpusSpec
> The single sanctioned executor for KB lifecycle.

# KBRuntime + KBCorpusSpec

## Hash-locked spec

`KBCorpusSpec` is a Pydantic v2 model that describes a single corpus:
the memory engine alias + kwargs, vector / graph / relational store
aliases, ACL evaluator + policy engine, layer scope, extraction knobs,
retention policy, and an optional Iceberg namespace for gold-tier
mirroring.

```yaml
name: research_papers
tenant_id: 00000000-0000-0000-0000-000000000010
memory_engine: { kb_alias: hierarchical_rag }
vector_store:  { kb_alias: pgvector, collection: research_papers }
graph_store:   { kb_alias: neo4j }
acl:           { evaluator_alias: native, policy_alias: opa }
layer:         { scope: private, marketplace_publishable: false }
extraction:    { enable_spacy: true, enable_llm: true }
retention:     { soft_delete_after_days: 90, hard_delete_after_days: 1095 }
```

The SHA-256 of the canonical JSON dump (sorted keys, UTF-8) anchors
the immutable `kb_corpus_spec_versions` row. Re-snapshotting via
`registry.persist_spec(spec)` inserts a new version row when the hash
changes (hard rule 57). The previous version stays for replay /
audit.

## KBRuntime

`KBRuntime.execute(req, ctx)` is the only sanctioned path through which
the five lifecycle actions (`remember`, `recall`, `compose_recall`,
`improve`, `forget`) run:

```python
from alphaswarm_kb.runtime import KBRunRequest, runtime_for

runtime = runtime_for("research_papers")
result = await runtime.execute(
    KBRunRequest(action="recall", corpus_name="research_papers",
                 payload="What is GraphRAG?", top_k=5),
    tenant_ctx,
)
```

Every call:

1. Halts if the kill-switch flag is set (`trigger_halt()` → `kb_runs`
   row with `status="halted"`).
2. Snapshots the spec via `persist_spec` so the resulting `kb_runs`
   row references the immutable spec version.
3. Resolves the `IMemoryEngine` via the composition-root container.
4. Executes the requested action.
5. Writes the `kb_runs` row carrying `experiment_id` + `test_id`
   (rule 34) and the elapsed-ms / status / error envelope.

## Wrappers

- **Celery**: `alphaswarm_kb.tasks.kb_tasks.{remember_async,recall_async,improve_async,forget_async,evaluate_async,compose_recall_async}`
  wrap `KBRuntime.execute` with `_progress.emit` (rule 4).
- **REST**: [`POST /kb/corpora/{name}/{remember,recall,improve,forget}`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/src/alphaswarm_kb/api/routes/kb.py)
  mounted at `/kb` by the monolith FastAPI app.
- **DataMCP**: `data.kb.*` tools (rule 59) — the only path agents
  should use.
- **WebSocket**: `/kb/corpora/{name}/recall/stream` for live recall
  streaming.

## Halt + kill-switch

`POST /kb/halt` sets the in-process halt flag. Any subsequent
`KBRuntime.execute` call raises `HaltedError` and writes a `kb_runs`
row with `status="halted"`. The topbar `KillSwitch` component fans
out to `/kb/halt` alongside every other halt endpoint.


<!-- https://alpha-swarm.ai/concepts/data/kb-silo-iac -->
# Silo-per-tenant IaC (Terragrunt + cloud-parallel modules)
> One Terragrunt unit per tenant; cloud-parallel Terraform modules with identical outputs.

# Silo-per-tenant IaC

Section G of the alphaswarm_kb blueprint implemented as a Terragrunt
tree under
[`alphaswarm_platform/terragrunt/`](https://github.com/alphaswarm/alphaswarm/tree/main/alphaswarm_platform/terragrunt).

## Layout

```
alphaswarm_platform/
├── terraform/modules/
│   ├── tenant_kb_silo/                # canonical wrapper (dispatches by var.cloud)
│   ├── tenant_kb_silo_aws/            # AWS: ECS Fargate + RDS + S3 + KMS
│   ├── tenant_kb_silo_azure/          # Azure: ACA + Flex Postgres + Blob + Key Vault
│   ├── tenant_kb_silo_gcp/            # GCP: Cloud Run + Cloud SQL + GCS + KMS
│   ├── kb_global_corpus/              # central read-only stack + CDN
│   ├── kb_marketplace_federation/     # federation gateway + OpenFGA + NATS
│   ├── kb_identity_pool/              # OpenFGA Postgres + OPA bundle bucket
│   └── kb_global_observability/       # OTEL collector
└── terragrunt/
    ├── terragrunt.hcl                 # root backend + provider generators
    ├── _envcommon/                    # shared inputs (networking, observability)
    ├── global/prod/terragrunt.hcl     # kb_global_corpus
    ├── marketplace/prod/terragrunt.hcl  # kb_marketplace_federation
    ├── identity_pool/prod/terragrunt.hcl  # kb_identity_pool
    └── tenants/_template/             # copy → tenants// to onboard
```

## Identical outputs

Every cloud-parallel sibling exposes the SAME outputs so the Python
adapters never branch on cloud:

| Output | Description |
| --- | --- |
| `relational_dsn` | Postgres DSN for `kb_corpora` + `kb_runs` + `kb_silo_registry`. |
| `vector_endpoint` | pgvector / Qdrant / Cognitive Search endpoint. |
| `graph_endpoint` | Neo4j / Kuzu / Neptune endpoint. |
| `container_runtime` | ECS Fargate / ACA / Cloud Run identifier. |
| `object_store_uri` | S3 / Blob / GCS bucket URI. |
| `kms_key_id` | Per-tenant CMK identifier. |

## Onboarding a tenant

```bash
T=acme-corp
mkdir -p alphaswarm_platform/terragrunt/tenants/${T}/prod
cp -r alphaswarm_platform/terragrunt/tenants/_template/* \
      alphaswarm_platform/terragrunt/tenants/${T}/

# Edit tenants/${T}/tenant.hcl with the real UUID, cloud, region.

# Production path goes through alphaswarm-cli (runs server-side via
# TerraformRuntime per rule 42; lands a workload_runs row + a
# terraform_runs row):
alphaswarm-cli kb tenant onboard ${T} --cloud aws --region us-east-1

# Break-glass operator path (skips audit; for ops emergencies only):
terragrunt run-all init  --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod
terragrunt run-all apply --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod
```

## Per-tenant state isolation

Each tenant has its own state file under the configured backend:

- S3: `s3://alphaswarm-kb-tfstate-prod/tenants//prod/terraform.tfstate`
- Azure Blob: `alphaswarm-kb-state/tenants//prod/terraform.tfstate`
- GCS: `gs://alphaswarm-kb-tfstate-prod/tenants//prod/terraform.tfstate`

For regulated tenants, swap the per-tenant backend block to assume a
dedicated cloud account/subscription role so physical isolation
matches the silo logical boundary.

## Offboarding

```bash
alphaswarm-cli kb tenant offboard ${T}
# wait for kb_runs to drain
terragrunt run-all destroy --terragrunt-working-dir alphaswarm_platform/terragrunt/tenants/${T}/prod
```

`cognee.forget --tenant ${T} --hard` runs first so per-tenant data is
purged before the underlying storage tears down.


<!-- https://alpha-swarm.ai/concepts/data/knowledge-base -->
# AlphaSwarm Knowledge Base (`alphaswarm_kb`)
> Boundary-package overview for the cognitive-memory layer.

# AlphaSwarm Knowledge Base

The `alphaswarm_kb` boundary owns AlphaSwarm's cognitive-memory layer.
It extracts the historical `alphaswarm/rag/` (HierarchicalRAG) and
`alphaswarm/llm/memory.py` (RedisHybridMemory) modules into a
Clean-Architecture package with a pluggable adapter trinity for memory
engines, vector stores, graph stores, ACL evaluators, and policy
engines.

## Sub-docs

- [kb-runtime.md](kb-runtime.md) — `KBRuntime` + hash-locked `KBCorpusSpec`
  + `kb_runs` ledger.
- [memory-engines.md](memory-engines.md) — IMemoryEngine adapter trinity
  (HierarchicalRAG default; Cognee / Graphiti / Mem0 / Letta /
  LlamaIndex opt-in).
- [bi-temporal-graph.md](bi-temporal-graph.md) — `PermissionedDataPoint`
  + four-timestamp model + `invalidated_by_edge_id` Graphiti-style
  edge invalidation.
- [layer-composition.md](layer-composition.md) — Four-scope precedence
  + bi-temporal merge.
- [kb-permissions.md](kb-permissions.md) — `AccessBitmap` + OpenFGA +
  OPA + Cedar hybrid stack.
- [kb-federation.md](kb-federation.md) — Cross-silo marketplace
  federation reverse-proxy.
- [kb-silo-iac.md](kb-silo-iac.md) — Terragrunt unit-per-tenant +
  cloud-parallel modules.
- [rag.md](rag.md) — Extracted hierarchical RAG (Alpha-GPT four-level).
- [pgvector-control-plane.md](pgvector-control-plane.md) — pgvector
  default vector store + `data.vector.*` MCP tools.
- [research-papers-rag.md](research-papers-rag.md) — Math-aware paper
  ingest + hybrid retrieval.

## At a glance

| Concern | Where |
| --- | --- |
| Runtime | `alphaswarm_kb.runtime.KBRuntime` (single executor, rule 56) |
| Spec | `alphaswarm_kb.spec.KBCorpusSpec` (hash-locked, rule 57) |
| Registry | `alphaswarm_kb.registry.persist_spec` → `kb_corpus_spec_versions` |
| Composition root | `alphaswarm_kb.composition_root.build_default_container` |
| Domain ports | `alphaswarm_kb.domain.ports.*` (zero framework imports) |
| Bi-temporal envelope | `alphaswarm_kb.domain.models.permissioned_datapoint.PermissionedDataPoint` |
| Adapter metaclass | `alphaswarm_kb.domain.ports.base.KBAdapterMeta` (rule 58) |
| Agent surface | `data.kb.*` DataMCP tools (rule 59) |
| Federation | `alphaswarm_kb_federation/` standalone reverse-proxy (rule 60) |

## Why the boundary

See [ADR-014](../../architecture/decisions/014-knowledge-base-boundary.md)
for the full rationale.

## Hard rules

- **56**: All KB lifecycle goes through `KBRuntime`.
- **57**: `kb_corpus_spec_versions` rows are immutable.
- **58**: Adapters register via `KBAdapterMeta`.
- **59**: Agents read KB only through `data.kb.*` tools.
- **60**: Cross-silo recall goes through `alphaswarm_kb_federation` only.

## Migration

Legacy `alphaswarm.rag.*` and `alphaswarm.llm.memory` import paths
keep working through `DeprecationWarning` shims for one release cycle.
New code imports from `alphaswarm_kb.rag.*` and
`alphaswarm_kb.memory.*` directly.

## Deprecations

- **Kuzu graph store** — upstream archived October 2025. The
  `alphaswarm_kb` extra is `kuzu-deprecated`; the adapter warns on
  import and will be removed after one release cycle. Migrate
  `KBCorpusSpec.graph_store.kb_alias` to `neo4j` (default),
  `falkordb`, or `memgraph`. See
  [memory-engines.md](memory-engines.md#graph-store-note-kuzu-is-deprecated).


<!-- https://alpha-swarm.ai/concepts/data/layer-composition -->
# KBLayerComposer — four-scope precedence
> Private > hierarchical > marketplace > global with bi-temporal merge.

# KBLayerComposer

`KBLayerComposer.compose_recall` composes recall across the four
canonical layer scopes:

| Scope | Source | Precedence |
| --- | --- | --- |
| `PRIVATE` | The tenant's own silo | 0 (highest) |
| `HIERARCHICAL` | Parent organisation (read-only, replicated) | 1 |
| `MARKETPLACE` | Subscribed external corpora (federated) | 2 |
| `GLOBAL` | Curated read-only platform corpus | 3 (lowest) |

Smaller `precedence` wins.

## Resolution flow

1. `resolve_layers(ctx)` returns the active `LayerHandle` list for the
   tenant. Defaults to `[PRIVATE]`; populates the other three when the
   tenant's `kb_subscriptions` rows + parent-org link + global-corpus
   replication offsets are set.
2. For each layer, fan out the recall:
   - `PRIVATE` runs against the tenant's own `IMemoryEngine` (default
     `HierarchicalRAGAdapter`).
   - `HIERARCHICAL` / `MARKETPLACE` / `GLOBAL` run against the
     `alphaswarm_kb_federation` reverse-proxy when the layer has a
     `federation_endpoint`.
3. Apply `BitemporalMerger.merge_layers` to dedupe by entity id with
   precedence-aware ordering. Losers land in
   `metadata.dissenting_layers`.
4. Re-rank by `(precedence, score, recency)`.

## Conflict resolution

| Conflict | Resolution |
| --- | --- |
| Same entity, different value | Higher-precedence layer wins; loser exposed in `dissenting_layers`. |
| Temporal disagreement (`valid_to`/`valid_from` overlap) | Both kept; downstream `as_of` reconstructs the timeline. |
| Edge contradiction (new edge supersedes old) | Old edge's `expired_at = now()`; not deleted. |

## Caching + invalidation

- Per-subscriber result cache keyed by
  `(subscriber_tenant, source_tenant, dataset, query_hash)` with a
  60s TTL (default).
- OpenFGA Watch events for `subscription.{granted,revoked,updated}`
  flush impacted cache entries via the
  `alphaswarm:kb:bitmap` Redis pub/sub channel.

## When NOT to use compose_recall

- Single-corpus recall against your tenant's own private corpus →
  use `data.kb.recall` (faster, no federation overhead).
- Bulk re-indexing / improve / forget — these are always per-corpus.


<!-- https://alpha-swarm.ai/concepts/data/memory-engines -->
# IMemoryEngine adapter trinity
> HierarchicalRAG default plus Cognee / Graphiti / Mem0 / Letta / LlamaIndex opt-in.

# IMemoryEngine adapter trinity

`IMemoryEngine` is the vendor-neutral memory control plane. Cognee's
v1.0 surface (`remember` / `recall` / `improve` / `forget`) is the
canonical contract; every adapter translates at its boundary.

## Comparison

| Adapter | `kb_alias` | Extras | Primary strength | Trade-off |
| --- | --- | --- | --- | --- |
| `HierarchicalRAGAdapter` | `hierarchical_rag` | none (default) | 4-level Alpha-GPT hierarchy + Reciprocal Rank Fusion + RAPTOR summaries | AlphaSwarm-native; not bi-temporal |
| `CogneeMemoryEngine` | `cognee` | `[cognee]` | Tri-store (graph + vector + relational) + native EBAC + multimodal ingest | Heavy dep; LanceDB+Kuzu only for native ACL |
| `GraphitiMemoryEngine` | `graphiti` | `[graphiti]` | Bi-temporal edges, sub-300ms p95, no runtime LLM calls | Neo4j only |
| `Mem0MemoryEngine` | `mem0` | `[mem0]` | User-centric personalisation, 12-layer cognitive memory | Less structural extraction |
| `LettaMemoryEngine` | `letta` | `[letta]` | Full agent runtime integration | Heavy; not pure memory |
| `LlamaIndexMemoryEngine` | `llamaindex` | `[llamaindex]` | General-purpose vector backbone, big plugin ecosystem | No native temporal model |

## Choosing an engine

Default to `hierarchical_rag` unless a corpus has a specific need:

- **Bi-temporal facts** that change over time (CEO succession, deal
  status) → `graphiti`.
- **User-scoped personalisation** that needs cross-session identity →
  `mem0`.
- **Multimodal pipelines** with heavy LLM-driven entity extraction +
  cross-store coherence → `cognee`.
- **General-purpose document QA** with the LlamaIndex plugin
  ecosystem → `llamaindex`.

## Switching engines

Set `KBCorpusSpec.memory_engine.kb_alias` and re-snapshot. The
`KBRuntime` picks up the new adapter on the next call. The previous
spec version stays in `kb_corpus_spec_versions` so any in-flight
recall against the old version can replay.

## Graph-store note: Kuzu is deprecated

Upstream Kuzu was archived in October 2025 and receives no further
releases. The `alphaswarm_kb` extra was renamed `kuzu` ->
`kuzu-deprecated`, and importing
`alphaswarm_kb.infrastructure.adapters.graph.kuzu_deprecated` emits a
`DeprecationWarning`. New corpora MUST choose `neo4j` (default),
`falkordb`, or `memgraph` for `KBCorpusSpec.graph_store.kb_alias`; the
deprecated extra exists only to keep legacy corpora readable during
migration and will be removed after one release cycle. This also
constrains `CogneeMemoryEngine`'s native-ACL tri-store mode (LanceDB +
Kuzu) — prefer the OpenFGA/OPA permission stack instead
([kb-permissions.md](kb-permissions.md)).

## Adding a new adapter

1. Subclass `IMemoryEngine` under
   `alphaswarm_kb/src/alphaswarm_kb/infrastructure/adapters/memory/`.
2. Set `kb_kind = "memory_engine"` + `kb_alias = "your_alias"`. The
   `KBAdapterMeta` metaclass auto-registers (rule 58).
3. Add the optional dep to `alphaswarm_kb/pyproject.toml` extras.
4. Add a default-kwargs YAML under
   `alphaswarm_kb/configs/memory_engines/your_alias_default.yaml`.
5. Wire the eager import behind `contextlib.suppress(Exception)` in
   `alphaswarm_kb/src/alphaswarm_kb/__init__.py`.


<!-- https://alpha-swarm.ai/concepts/data/pgvector-control-plane -->
# pgvector control plane
> Default BaseVectorStore; data.vector.* MCP tools; alembic 0045 + 0088.

# pgvector control plane

pgvector is the default `BaseVectorStore` adapter for the KB layer.
`PgVectorStore` wraps the extracted
`alphaswarm_kb.rag.pgvector_store.PgVectorStore` behind the
`BaseVectorStore` port so the standard KBRuntime can target the
existing pgvector control plane without leaking adapter specifics.

## Migration

- `alembic/versions/0045_pgvector_phase3.py` — pgvector indexes on the
  three allow-listed tables (`rag_chunks`, `codebase_symbol_embeddings`,
  `ml_feature_vectors`).
- `alembic/versions/0088_alphaswarm_kb_specs.py` — the nine KB tables
  (`kb_corpora`, `kb_runs`, `kb_subscriptions`, ...).

## Agent surface

| Tool | Purpose |
| --- | --- |
| `data.vector.search` | Free-text or pre-computed embedding ANN over the allow-listed tables. |
| `data.vector.upsert` | (Step-up gated) write through `PgVectorStore.upsert`. |
| `data.vector.delete` | (Step-up gated) targeted delete by id. |
| `data.embeddings.compute` | Compute an embedding via the central embedder. |

## Adding a new pgvector-backed table

1. Add the migration under `alembic/versions/` with a `Vector(N)`
   column + an HNSW index.
2. Extend the `_ALLOWED_TABLES` whitelist in
   `alphaswarm/data/mcp/tools/vector.py`.
3. Add an `EntityPicker kind` for the table to
   `alphaswarm/cache/keys.py`.
4. Add a `BaseDataset` kind under `alphaswarm/data/datasets/kinds/`
   if the table is also surfaced through the dataset catalog.


<!-- https://alpha-swarm.ai/concepts/data/rag -->
# Hierarchical RAG (extracted)
> Four-level Alpha-GPT hierarchy on Redis + pgvector; default IMemoryEngine for KB corpora.

# Hierarchical RAG

`HierarchicalRAG` lives at `alphaswarm_kb.rag.HierarchicalRAG`
(extracted from the legacy `alphaswarm/rag/` tree per ADR-014). It is
the **default** `IMemoryEngine` adapter for every `KBCorpusSpec` that
doesn't explicitly choose another engine.

## Four levels

Implements the Alpha-GPT *"Human-AI Interactive Alpha Mining"* design:

| Level | Purpose |
| --- | --- |
| **L0** | Alpha / decision base — past `agent_decisions`, `equity_reports`, `backtest_runs` outcomes. |
| **L1** | High-level categories (`price_volume`, `fundamental`, `news_sentiment`, `regulatory`). |
| **L2** | Sub-categories (`earnings_call`, `disclosures`, `cfpb_complaint`, ...). |
| **L3** | Specific data fields / chunks — individual narratives + paragraphs. |

Plus three orthogonal data "orders":

- **first** — bars / trades / performance.
- **second** — SEC filings / fundamentals / ratios.
- **third** — CFPB / FDA / USPTO regulatory data.
- **theory** — research papers + code chunks.

## Public surface

| Symbol | Use |
| --- | --- |
| `HierarchicalRAG` | Top-level facade. |
| `HierarchicalRAG.query` | Direct vector search at one level (optional reranker + compressor). |
| `HierarchicalRAG.query_hybrid` | Dense + sparse hybrid via Reciprocal Rank Fusion. |
| `HierarchicalRAG.walk` | Top-down `L0 → L1 → L2 → L3` autonomous navigation. |
| `HierarchicalRAG.recall_for_prompt` | Markdown block ready for prompt injection. |
| `HierarchicalRAG.index_chunks` / `index_summary` | Write paths. |
| `HierarchicalRAG.precompute_l0_alpha_base` | Bulk-index past decisions. |
| `get_default_rag()` | Process-wide cached singleton. |

## Indexer registry

`alphaswarm_kb.rag.indexers.INDEXER_REGISTRY` maps every corpus slug to
its indexer callable. Add a new corpus by writing an indexer that
takes the source rows, renders them as text, chunks them via
`alphaswarm_kb.rag.chunker.semantic_chunks`, and calls
`rag.index_chunks(corpus, ...)`.

## Storage backend

Redis (RediSearch) is the default vector store; pgvector is the
production-grade backend (Phase 3 refactor). Both implement the same
`RedisVectorStore` / `PgVectorStore` surface that `HierarchicalRAG`
consumes through composition.

## Backward compatibility

`alphaswarm.rag.*` and `alphaswarm.rag.indexers.*` are
`DeprecationWarning` shims that re-export from `alphaswarm_kb.rag.*`.
Old call sites keep working for one release cycle.


<!-- https://alpha-swarm.ai/concepts/data/research-papers-rag -->
# Research-papers RAG
> Math-aware paper ingest (Marker / Nougat / MathPix / PyPDF) + hybrid retrieval.

# Research-papers RAG

The `research_papers` corpus is one of the bundled
[`KBCorpusSpec`](https://github.com/alphaswarm/alphaswarm/blob/main/alphaswarm_kb/configs/corpora/research_papers.yaml)
templates. It ingests PDFs through the math-aware parser chain in
`alphaswarm_kb.rag.parsers` and indexes them into the `research_papers`
RAG corpus.

## Parser chain

`alphaswarm_kb.rag.parsers.pick_parser(path)` selects the right parser
based on the document's math density + complexity:

| Parser | Use |
| --- | --- |
| `MarkerParser` | Default — fast, math-aware Marker pipeline. |
| `NougatParser` | Heavy LaTeX/equation density (Nougat from Meta). |
| `MathPixParser` | Highest fidelity for handwriting or scanned PDFs (MathPix API). |
| `PyPDFParser` | Fast text-only fallback. |

## Upload + ingest

```python
from alphaswarm_kb.rag.indexers.research_papers_indexer import index_research_papers

n_chunks = index_research_papers(paper_ids=["paper-uuid"])
```

Or via the REST surface:

```http
POST /rag/papers/upload   # upload PDF
POST /rag/papers/{id}/ingest
POST /rag/papers/{id}/synthesize   # downstream strategy synthesis
```

The Celery wrappers in `alphaswarm_kb.tasks.kb_tasks.ingest_research_paper`
+ `synthesize_strategy_from_paper` preserve the legacy
`alphaswarm.tasks.research_paper_tasks` surface via shims.

## Retrieval

Use `data.kb.recall` with `corpus_name="research_papers"`. The
`HierarchicalRAG.query_hybrid` path is preferred for papers because
exact-token matches (theorem names, variable symbols) matter as much
as semantic similarity.

## Strategy synthesis

The `synthesize_strategy_from_paper` task pipes hybrid recall results
through `router_complete` (rule 2) and returns a YAML strategy stub
the Strategy Composer can load.


<!-- https://alpha-swarm.ai/concepts/identity/account-integrations -->
# Account integrations
> Per-org HuggingFaceHub + DockerHub credential links for company accounts. PATs are validated against the upstream API on connect, persisted encrypted via the credential resolver, and surfaced through the admin BFF as health-checked records.

# Account integrations

Per-org credential links the admin operator wires through the
**`alphaswarm_admin`** Next.js surface. Six integration kinds ship today:

| Kind | Persisted under | Wizard | Backend |
| --- | --- | --- | --- |
| `huggingface` | `CredentialKey("huggingface", "org:")` | `frontend/components/accounts/HuggingFaceWizard.tsx` | `src/alphaswarm_admin/providers/huggingface.py` |
| `docker_hub` | `CredentialKey("docker_hub", "org:")` | `frontend/components/accounts/DockerHubWizard.tsx` | `src/alphaswarm_admin/providers/dockerhub.py` |
| `cloud_aws` | `CredentialKey("cloud_aws", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_aws.py` |
| `cloud_azure` | `CredentialKey("cloud_azure", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_azure.py` |
| `cloud_gcp` | `CredentialKey("cloud_gcp", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_gcp.py` |
| `cloud_cloudflare` | `CredentialKey("cloud_cloudflare", "org:")` | `frontend/components/cloud/CloudOnboardingWizard.tsx` | `src/alphaswarm_admin/providers/cloud_cloudflare.py` |

The four `cloud_*` kinds use the same `AccountIntegrationProvider`
ABC but extend it with a 5-step wizard contract
(`bootstrap_artifacts` → `validate_identity` →
`validate_permissions` → `enumerate_resources` → `connect`) and
are exclusively **federated-first** — no long-lived secrets are
stored. See
[Connect a company cloud account](../../how-to/operations/connect-company-cloud-account.md)
for the full runbook.

Both share the same `AccountIntegrationProvider` ABC defined in
`alphaswarm_admin/src/alphaswarm_admin/providers/base.py` and the same encrypted
file-backed store at
`alphaswarm_admin/src/alphaswarm_admin/services/integration_store.py`.

## Lifecycle

```mermaid
flowchart LR
  Op["operator"] -->|PAT| FE["HF/Docker wizard"]
  FE -->|"POST /admin/accounts/{org_id}/integrations/{kind}"| BFF["alphaswarm_admin BFF"]
  BFF -->|whoami / login| HUB["upstream provider"]
  HUB -->|valid| BFF
  BFF -->|"Fernet-encrypt + persist"| STORE["IntegrationCredentialStore"]
  BFF -->|metadata only| FE
  FE -->|status badge| Op
```

The flow is audit-first (see `alphaswarm_admin/src/alphaswarm_admin/api/routers/integrations.py`)
and step-up MFA gated (`require_admin_step_up("admin:cluster")`). The
PAT itself **never** crosses the BFF response boundary after the
initial connect call — the wizard renders the masked metadata
(`namespace`, `status`, `connected_at`) only.

## HuggingFace Hub

### What you need

- A **fine-grained PAT** with read access on the org's models /
  datasets. Generate at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
- The org's HuggingFace namespace (e.g. `acme-quant`). Optional —
  the BFF derives it from `HfApi().whoami()` when omitted.

### Wire-format

```http
POST /admin/accounts/{org_id}/integrations/huggingface
Authorization: Bearer 
Content-Type: application/json

{
  "token": "hf_*****",
  "namespace": "acme-quant"
}
```

```json
{
  "integration": {
    "org_id": "org-acme",
    "kind": "huggingface",
    "namespace": "acme-quant",
    "credential_key": "huggingface:org:org-acme",
    "status": "healthy",
    "connected_at": "2026-05-27T20:00:00Z",
    "last_health_at": null,
    "error": null,
    "metadata": { "type": "org", "auth_email": "ops@example.com", "orgs": ["acme-quant"] }
  },
  "audit_run_id": "..."
}
```

### Revocation

`DELETE /admin/accounts/{org_id}/integrations/huggingface` drops the
**local** record. **Always** revoke the PAT on the HuggingFace side
too — settings → Personal access tokens → Revoke. Without that step
the PAT remains usable from any source that holds the bytes.

## Docker Hub

### What you need

- A **Docker Hub PAT** (Account → Personal access tokens) with the
  intended scope. Username + PAT together are required (Docker Hub
  v2 login does not accept PAT-only).
- The Docker Hub namespace (defaults to the username on connect).

### Wire-format

```http
POST /admin/accounts/{org_id}/integrations/dockerhub
{
  "username": "acmeops",
  "pat": "dckr_pat_*****",
  "namespace": "acmeops"
}
```

The BFF posts to `https://hub.docker.com/v2/users/login` to mint a
JWT (proving the credential is valid), then `/v2/users/{namespace}/`
to confirm namespace scope. Both calls happen server-side; only
metadata returns to the wizard.

### Revocation

`DELETE /admin/accounts/{org_id}/integrations/docker_hub` drops the
local record only. Docker Hub does NOT expose a PAT-revocation API,
so you **must** delete the PAT manually in
`Account → Security → Personal access tokens` to fully terminate
access.

## Health checks

The wizard's "Re-check" action calls
`POST /admin/accounts/{org_id}/integrations/{kind}/health` which
re-runs the `whoami` (HuggingFace) or `login` + namespace probe
(Docker Hub). The result lands on the row's `last_health_at` /
`last_health_status` fields and is rendered as a badge.

## Operator runbook

| Scenario | Action |
| --- | --- |
| PAT expired upstream | Re-run the wizard; the new PAT replaces the encrypted blob in-place. |
| PAT compromised | Revoke upstream first, then disconnect locally. |
| Switching org owners | Disconnect, delete the PAT upstream, have the new owner connect with their own PAT. |
| Lost encryption key (`ALPHASWARM_ADMIN_INTEGRATIONS_KEY`) | Drop the local store JSON, rotate the encryption key, re-run all connect wizards. The upstream PATs are unaffected. |

## Configuration

| Env var | Purpose | Default |
| --- | --- | --- |
| `ALPHASWARM_ADMIN_INTEGRATIONS_PATH` | JSON file the encrypted store writes to. | `~/.alphaswarm/integrations.json` |
| `ALPHASWARM_ADMIN_INTEGRATIONS_KEY` | Fernet key used to encrypt PATs. **Required in production.** | (ephemeral key minted per process — refused in production by `IntegrationCredentialStore.assert_production_ready`) |

Generate a Fernet key with:

```python
from cryptography.fernet import Fernet
print(Fernet.generate_key().decode())
```

Persist the key in your platform secret manager and inject it as an
env var; the store reads it once at boot.


<!-- https://alpha-swarm.ai/concepts/identity/account-management -->
# Account management
> The `/auth/profile` surface is the end-user account center for identity, security, session control, connected providers, and tenancy membership management. It keeps sensitive account operations in one...

# Account management

> **alphaswarm_admin (internal) note** — the internal admin BFF at
> `manage.alpha-swarm.ai` is Entra-only post the alphaswarm_admin Entra
> refactor (`.cursor/plans/alphaswarm_admin_entra_refactor_039f2aeb.plan.md`).
> Service identity flows through per-deployment Entra Agent
> Identities; see [admin-agent-identity.md](admin-agent-identity.md).
> Auth0 remains the customer-facing path for the public
> `app.alpha-swarm.ai` cloud frontend described below.

## 1) Overview

The `/auth/profile` surface is the end-user account center for identity, security, session control, connected providers, and tenancy membership management. It keeps sensitive account operations in one place while delegating authentication authority to Auth0.

```mermaid
flowchart LR
    A[Profile] --> B[Security]
    B --> C[Sessions]
    C --> D[Connections]
    D --> E[Tenancy]
    E --> F[Notifications]
    F --> G[Danger Zone]
```

## 2) Profile tab

The Profile tab shows display name, avatar, and provider badge. Email is read-only because the canonical identity record is managed by Auth0.

## 3) Security tab

The Security tab includes:

- `PasswordChangeCard`: creates an Auth0 password-change ticket URL and redirects the user through the hosted reset flow.
- `MfaFactorsCard`: lists and manages MFA enrollment for TOTP, SMS, and WebAuthn factors.
- `RecentActivityCard`: displays the last 10 security-relevant audit events.

## 4) Sessions tab

The Sessions tab lists active sessions with browser, device, IP, approximate location, and last activity. Users can revoke individual sessions, or run a global "Sign out everywhere" action with friction confirmation.

## 5) Connections tab

The Connections tab supports linking and unlinking identity providers such as Microsoft, Google, Auth0 Database, and GitHub.

## 6) Tenancy tab

The Tenancy tab shows memberships, supports org/workspace switching, and exposes a user-level "Leave organization" action. Admin onboarding and tenancy administration are handled in separate admin routes.

## 7) Notifications tab

Notifications is a placeholder in v1 and reserved for a future v2 notification preferences model.

## 8) Danger Zone

Danger Zone contains permanent account-deletion actions gated by `` typed-email confirmation.

## What an admin can additionally do

Admins can use:

- [`/admin/onboarding`](/admin/onboarding) for onboarding flows including `EntraTenantLinkWizard`.
- [`/admin/users`](/admin/users) for user administration.

## What happens on the backend

Key backend modules:

- Auth0 Management API client: [`alphaswarm/auth/management_api.py`](../alphaswarm/auth/management_api.py)
- `/me/*` route module: [`alphaswarm/api/routes/me.py`](../alphaswarm/api/routes/me.py)
- Invite lifecycle routes: [`alphaswarm/api/routes/invites.py`](../alphaswarm/api/routes/invites.py)
- Audit emit helper: [`alphaswarm/auth/audit.py`](../alphaswarm/auth/audit.py)


<!-- https://alpha-swarm.ai/concepts/identity/admin-agent-identity -->
# concepts/identity/admin-agent-identity

# alphaswarm_admin — Microsoft Entra Agent Identity

> Last refreshed: 2026-05-27.
> Status: implementation of the alphaswarm_admin Entra refactor
> (`.cursor/plans/alphaswarm_admin_entra_refactor_039f2aeb.plan.md`).
> See also: [entra-internal-tenant.md](entra-internal-tenant.md) and
> [identity.md](identity.md).

The `alphaswarm_admin` BFF authenticates to `alphaswarm_controller`,
the AlphaSwarm monolith, and (eventually) any other downstream service via
a **per-deployment Microsoft Entra Agent Identity** instead of a shared
client_credentials service principal. Each deployment (dev / staging /
prod) gets its own `sub` claim in minted tokens so audit trails and
RBAC routing remain clean even when the same Blueprint backs every
environment.

This page is the operator + agent reference for the model.

## Three-layer object graph

```mermaid
flowchart LR
  subgraph entra [AlphaSwarm staff Entra tenant]
    bp[Agent Identity Blueprintalphaswarm-admin-service]
    bpp[BlueprintPrincipal]
    aid_dev[Agent Identity admin-dev]
    aid_staging[Agent Identity admin-staging]
    aid_prod[Agent Identity admin-prod]
    fic[Federated Identity CredentialAKS workload identity]
    api[Manage-API Resource Server+ AdminService app role]
    bp --> bpp --> aid_dev
    bpp --> aid_staging
    bpp --> aid_prod
    fic -. parent token .-> bp
    aid_dev -. AdminService role .-> api
    aid_staging -. AdminService role .-> api
    aid_prod -. AdminService role .-> api
  end
  admin_dev[alphaswarm_admin pod (dev)] -.->|fmi_path exchange| aid_dev
  admin_prod[alphaswarm_admin pod (prod)] -.->|fmi_path exchange| aid_prod
```

| Layer | Resource | Provider |
| --- | --- | --- |
| 1 | Agent Identity Blueprint | `azapi_resource` against `Microsoft.Graph/applications/microsoft.graph.agentIdentityBlueprint` |
| 2 | BlueprintPrincipal (mandatory second step) | `azapi_resource` against `Microsoft.Graph/servicePrincipals/microsoft.graph.agentIdentityBlueprintPrincipal` |
| 3 | Per-environment Agent Identity | `azapi_resource` against `Microsoft.Graph/servicePrincipals/microsoft.graph.agentIdentity` |
| 4 | Federated Identity Credential | `azuread_application_federated_identity_credential` on the Blueprint |
| 5 | App role assignment | `azapi_resource` against `Microsoft.Graph/servicePrincipals/appRoleAssignedTo` |

Terraform module:
[`alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/`](../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/).

## Two-step `fmi_path` exchange

At runtime each pod mints an Agent-Identity-bound access token via the
two-step exchange documented in the `entra-agent-id` skill:

```mermaid
sequenceDiagram
  participant Admin as alphaswarm_admin
  participant Entra as Entra token endpoint
  Admin->>Entra: 1. POST /oauth2/v2.0/tokengrant=client_credentialsscope=api://AzureADTokenExchange/.defaultclient_assertion=
  Entra-->>Admin: parent_token
  Admin->>Entra: 2. POST /oauth2/v2.0/tokengrant=client_credentialsscope=/.defaultfmi_path=alphaswarm-admin-fmi_target_id=requested_token_use=on_behalf_ofassertion=parent_token
  Entra-->>Admin: agent_token (sub=agent_sp_id, aud=)
```

The exchange lives at
[`alphaswarm_core.auth.providers.msal_entra.MsalEntraValidator.acquire_agent_token`](../../../../alphaswarm_core/src/alphaswarm_core/auth/providers/msal_entra.py).

## CredentialResolver integration

The admin BFF wires the Agent Identity flow through the existing
`SecretStore` chain so route handlers never see the token directly.

```python
from alphaswarm_core.credentials.stores import (
    EntraAgentIdentityCredentialResolver,
    EntraAgentIdentitySecretStore,
)
from alphaswarm_core.auth.providers.msal_entra import MsalEntraValidator

store = EntraAgentIdentitySecretStore(
    validator=MsalEntraValidator(
        tenant="",
        audience="api://alphaswarm-controller",
    ),
    resolvers=(
        EntraAgentIdentityCredentialResolver(
            credential_key=CredentialKey(
                service="alphaswarm-admin-to-cp",
                purpose="client_credentials",
            ),
            audience="api://alphaswarm-controller",
            blueprint_app_id=,
            agent_identity_id=,
            fmi_path="alphaswarm-admin-prod",
        ),
    ),
)
```

`alphaswarm_admin/integrations/broker.py::build_default_brokers` does this
automatically when
`ALPHASWARM_AUTH_AGENT_IDENTITY_ENABLED=true` AND the three Agent
Identity env vars are populated. When any of the fields are empty the
broker falls back to the legacy env-only client_credentials path so
local-dev sandboxes keep working.

## Receiver-side recognition

`alphaswarm_controller.auth.deps._payload_to_user` extracts the
RFC 8693 `act` claim and surfaces `actor_kind="agent"` plus
`actor_upstream_sub` on the resolved `AuthenticatedUser`. Recognition
is feature-flagged behind
`ALPHASWARM_AUTH_AGENT_TOKEN_RECOGNITION_ENABLED` until the end-to-end
path is verified — when off, every token resolves to
`actor_kind="user"` and the legacy audit shape is preserved.

The monolith side (`alphaswarm/api/routes/_internal_audit.py`) logs the
`actor_kind` + `actor_upstream_sub` on every persisted `terraform_runs`
ingest call so the audit ledger stays correlatable with the Agent
Identity that minted the token.

## Identity on AWS ECS Fargate

When `alphaswarm_admin` runs on ECS Fargate (the
[`ecs-fargate-control-plane`](../../../../alphaswarm_platform/infrastructure/modules/ecs-fargate-control-plane/)
module) two identities are in play, and they are orthogonal:

- **AWS control** — the `/admin/platform/ecs/*` surface calls AWS ECS +
  CloudWatch using the task's **AWS IAM role**, not Entra. The module
  grants that role a tightly scoped self-management policy
  (`enable_self_management = true`). No Entra token is involved in the
  AWS control path.
- **Control-plane M2M** — outbound calls to `alphaswarm-cp` `/manage/*`
  still need an Entra-minted token. ECS Fargate has no native OIDC issuer
  for the WIF JWT the two-step `fmi_path` exchange needs, so the
  ECS-hosted admin routes M2M through the controller's `/auth/m2m/token`
  shim by setting `ALPHASWARM_AUTH_THROUGH_CONTROLLER=true`. The
  controller (EKS-hosted, with a projected service-account token) holds
  the Agent Identity federation and mints on the admin's behalf.

The Agent Identity Blueprint + per-environment identities this module
provisions therefore back the EKS-hosted control plane and any admin
pod that can present a federated SA token. The
`module.alphaswarm_admin_agent_identity.agent_identity_env` output emits
the per-environment `ALPHASWARM_AUTH_AGENT_*` block ready to drop into a
task definition or ConfigMap for those deployments.

## Operator workflow

```bash
# 1. Pre-check (one-time): grant the Terraform-execution SP the Graph
# permissions the entra-agent-id skill lists.

# 2. Snapshot + apply (step-up MFA gated; AGENTS rule 42 + 52).
python scripts/identity/seed_admin_agent_identity.py --apply
alphaswarm-cli manage terraform apply \
    --workspace-id admin-entra \
    --spec-version-id 

# 3. Plumb outputs into CredentialResolver.
alphaswarm-cli credentials import \
    --service alphaswarm-admin \
    --purpose entra_agent_identity \
    --field blueprint_app_id= \
    --field agent_identity_id_prod=

# 4. Flip the feature flag on each deployment.
ALPHASWARM_AUTH_AGENT_IDENTITY_ENABLED=true
ALPHASWARM_AUTH_AGENT_BLUEPRINT_APP_ID=
ALPHASWARM_AUTH_AGENT_IDENTITY_ID=
ALPHASWARM_AUTH_AGENT_FMI_PATH=alphaswarm-admin-prod
```

## Rollback

The Terraform module is gated by `var.enabled`; flipping it to false
removes the per-environment Agent Identities + role assignments while
keeping the Blueprint + BlueprintPrincipal in place for fast re-enable.
The `alphaswarm_admin` BFF falls back to the legacy client_credentials
path automatically (via `EnvSecretStore` at priority 100).

For the human-login path, the legacy Vite SPA at `alphaswarm_admin_ui/`
retains its Auth0 branch for 30 days post the refactor — set the
`ALPHASWARM_ADMIN_LEGACY_AUTH0_FALLBACK=1` feature flag to surface it.


<!-- https://alpha-swarm.ai/concepts/identity/auth0-actions -->
# Auth0 Actions for the AlphaSwarm multi-tenant rollout
> Auth0 ships organisation / role data via the standard `org_id` / `https://<tenant>/roles` claims, but the **AlphaSwarm scope chain** (which workspace is the users default, which team theyre in, which roles...

# Auth0 Actions for the AlphaSwarm multi-tenant rollout

The Phase 4 enforcement sweep relies on Auth0 to inject
AlphaSwarm-namespaced custom claims (`https://alphaswarm/org_id`,
`https://alphaswarm/team_id`, `https://alphaswarm/workspace_id`,
`https://alphaswarm/roles`) into every access token. The Action snippet
below ships those claims by calling the M2M-secured
[`/_internal/auth0/sync`](../alphaswarm/api/routes/auth0_sync.py)
endpoint during the post-login hook.

## Why an Action?

Auth0 ships organisation / role data via the standard `org_id` /
`https:///roles` claims, but the **AlphaSwarm scope chain**
(which workspace is the user's default, which team they're in,
which roles map onto the four-tier lattice) lives in Postgres.
The Action is the bridge: it asks the AlphaSwarm backend on every login
+ injects the result into the access token so the frontend +
backend see a consistent set of custom claims from request 0.

## Setup

1. **Create an Auth0 API for the AlphaSwarm backend** (separate from the
   SPA Application). Set the audience to whatever you set
   `ALPHASWARM_AUTH_OIDC_AUDIENCE` to — e.g. `https://api.alphaswarm.local`.
2. **Create a Machine-to-Machine Application** authorised against
   the AlphaSwarm API. Set its allowed grant types to `client_credentials`
   only. Copy the client_id + secret into the Action's secrets:
   - `ALPHASWARM_M2M_CLIENT_ID`
   - `ALPHASWARM_M2M_CLIENT_SECRET`
   - `ALPHASWARM_API_AUDIENCE` (the same audience as #1)
   - `ALPHASWARM_BACKEND_URL` (e.g. `https://api.alphaswarm.local`)
3. **Configure the AlphaSwarm backend**:
   ```bash
   ALPHASWARM_AUTH_PROVIDER=auth0
   ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.auth0.com
   ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alphaswarm.local
   ALPHASWARM_AUTH_M2M_ENABLED=true
   ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alphaswarm.local
   ALPHASWARM_AUTH_CLAIMS_NAMESPACE=https://alphaswarm/
   ALPHASWARM_AUTH_ENFORCE=permissive   # flip to ``strict`` after the rollout dashboard is clean
   ```

## The Action

Create a new Action under **Library > Custom > Build new** and
attach it to the **Login** trigger.

```js
/**
 * AlphaSwarm post-login Action: lazy-provisions the internal user + injects
 * AlphaSwarm-namespaced custom claims into the access token.
 *
 * Triggers on every login; the backend is idempotent.
 */
exports.onExecutePostLogin = async (event, api) => {
  const namespace = "https://alphaswarm/";

  // 1. Mint an M2M token for the AlphaSwarm backend.
  const tokenResp = await fetch(`https://${event.tenant.id}.auth0.com/oauth/token`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      grant_type: "client_credentials",
      client_id: event.secrets.ALPHASWARM_M2M_CLIENT_ID,
      client_secret: event.secrets.ALPHASWARM_M2M_CLIENT_SECRET,
      audience: event.secrets.ALPHASWARM_API_AUDIENCE,
    }),
  });
  if (!tokenResp.ok) {
    api.access.deny("AlphaSwarm backend token mint failed");
    return;
  }
  const { access_token } = await tokenResp.json();

  // 2. Ask the AlphaSwarm backend to lazy-provision the user + return claims.
  const syncResp = await fetch(`${event.secrets.ALPHASWARM_BACKEND_URL}/_internal/auth0/sync`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${access_token}`,
    },
    body: JSON.stringify({
      user_id: event.user.user_id,
      email: event.user.email,
      organization_id: event.organization?.id,
      organization_name: event.organization?.name,
    }),
  });
  if (!syncResp.ok) {
    // Soft failure: let the user in but log the issue. The
    // backend's lazy provisioner will run on the first API call
    // instead.
    console.log("AlphaSwarm backend sync failed:", await syncResp.text());
    return;
  }
  const claims = await syncResp.json();

  // 3. Inject the claims into the access token.
  if (claims.org_id) api.accessToken.setCustomClaim(`${namespace}org_id`, claims.org_id);
  if (claims.team_id) api.accessToken.setCustomClaim(`${namespace}team_id`, claims.team_id);
  if (claims.workspace_id) {
    api.accessToken.setCustomClaim(`${namespace}workspace_id`, claims.workspace_id);
  }
  if (claims.roles && claims.roles.length) {
    api.accessToken.setCustomClaim(`${namespace}roles`, claims.roles);
  }
  if (claims.internal_user_id) {
    api.accessToken.setCustomClaim(`${namespace}user_id`, claims.internal_user_id);
  }
};
```

## Verification

1. Log in via the SPA. The browser receives an access token.
2. Decode it (e.g. [jwt.io](https://jwt.io)) and verify the
   `https://alphaswarm/org_id` / `https://alphaswarm/roles` claims are present.
3. Hit `GET /auth/whoami` on the AlphaSwarm backend. The response should
   reflect the org / workspace from the Action — not the
   deterministic local-default seed.
4. The Phase 6 frontend `ContextBar` should auto-populate the org
   / workspace on first render.

## Failure modes

| Symptom | Likely cause |
| ------- | ------------ |
| Token has no custom claims | Action attached to the wrong trigger or failed silently. Check the Action logs. |
| Backend 401 on `/_internal/auth0/sync` | M2M token audience mismatch — Action audience must equal `ALPHASWARM_AUTH_OIDC_AUDIENCE`. |
| `data.ownership.list_resources` returns the local-default user | `provision_user_from_claims` is not running. Confirm the SPA is sending the Bearer header and `ALPHASWARM_AUTH_PROVIDER != local`. |
| Phase 4 enforcement mode showing too many 403s | Some Postgres `memberships` rows are missing — run the lazy-provisioning sync once per user, or backfill manually. |

## See also

- [`alphaswarm_docs/identity.md`](../../concepts/identity/identity.md) — the full identity stack.
- [`alphaswarm_docs/credentials.md`](../../concepts/identity/credentials.md) — how M2M tokens flow
  through `CredentialResolver`.
- [`alphaswarm/api/security.py`](../alphaswarm/api/security.py) — the
  `require_scope` / `require_membership` deps that consume these
  claims.

## Phase 7 post-login Action (Auth0 + Microsoft federation)

This Action calls `/_internal/auth0/sync`, then injects returned custom
claims into both the access token and ID token. The connection name
mapping (`requested_claims.connection`) is forwarded so AlphaSwarm can record
which IdP drove each login.

```javascript
/**
 * AlphaSwarm post-login Action.
 * Calls /_internal/auth0/sync on the AlphaSwarm API and injects the
 * returned custom claims into the access token. Also carries the
 * Auth0 connection name (e.g. "azure-ad-myorg") so the AlphaSwarm audit
 * log records WHICH IdP drove this login.
 *
 * Secrets used:
 *   ALPHASWARM_API_URL                  e.g. https://api.alphaswarm.example
 *   ALPHASWARM_M2M_CLIENT_ID            Auth0 Management API M2M client id (reused)
 *   ALPHASWARM_M2M_CLIENT_SECRET        Auth0 Management API M2M client secret
 *   ALPHASWARM_M2M_AUDIENCE             Same as AlphaSwarm API resource identifier
 *
 * Set them at: Actions > Library > Custom >  > Add Secret
 */
const NS = "https://alphaswarm/";

async function mintM2MToken(secrets) {
  const url = `https://${event.tenant.id}.auth0.com/oauth/token`;
  const res = await fetch(url, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      grant_type: "client_credentials",
      client_id: secrets.ALPHASWARM_M2M_CLIENT_ID,
      client_secret: secrets.ALPHASWARM_M2M_CLIENT_SECRET,
      audience: secrets.ALPHASWARM_M2M_AUDIENCE,
    }),
  });
  if (!res.ok) return null;
  const body = await res.json();
  return body.access_token || null;
}

exports.onExecutePostLogin = async (event, api) => {
  const aqpApi = event.secrets.ALPHASWARM_API_URL;
  if (!aqpApi) return; // Action mis-configured; fail open
  let token = await api.cache.get("alphaswarm_m2m_token");
  if (!token || !token.value) {
    const fresh = await mintM2MToken(event.secrets);
    if (!fresh) return;
    api.cache.set("alphaswarm_m2m_token", fresh, { ttl: 50 * 60 * 1000 });
    token = { value: fresh };
  }
  const payload = {
    user_id: event.user.user_id,
    email: event.user.email,
    organization_id: event.organization?.id,
    organization_name: event.organization?.name,
    requested_claims: {
      connection: event.connection?.name,
      strategy: event.connection?.strategy,
    },
  };
  try {
    const res = await fetch(`${aqpApi}/_internal/auth0/sync`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token.value}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(payload),
    });
    if (!res.ok) return;
    const claims = await res.json();
    for (const [k, v] of Object.entries(claims)) {
      if (v === null || v === undefined) continue;
      api.accessToken.setCustomClaim(`${NS}${k}`, v);
      api.idToken.setCustomClaim(`${NS}${k}`, v);
    }
  } catch (err) {
    // Fail open — never block the user's login if AlphaSwarm API is down.
    console.log("alphaswarm_sync_failed", err.message);
  }
};
```

### Custom claims it sets

| Claim | Meaning |
| --- | --- |
| `https://alphaswarm/org_id` | Active organization context resolved by AlphaSwarm. |
| `https://alphaswarm/team_id` | Team context resolved by AlphaSwarm. |
| `https://alphaswarm/workspace_id` | Active workspace context. |
| `https://alphaswarm/project_id` | Active project context. |
| `https://alphaswarm/lab_id` | Active lab context. |
| `https://alphaswarm/roles` | Role list used by scope/membership checks. |
| `https://alphaswarm/connection` | Auth0 connection name, mapped from `requested_claims.connection` (for example `azure-ad-myorg`). |
| `https://alphaswarm/internal_user_id` | AlphaSwarm internal user row identifier. |

### Why it fails open

The post-login Action should never block authentication because of a
transient outage in AlphaSwarm. Missing one claim-sync cycle is recoverable on
the next login, while hard-failing login creates a broader availability
incident for all users.

## Phase 8 — Step-up MFA addendum (AGENTS hard rule 52)

Step-up MFA on destructive routes (the kill switch, every `/halt`
endpoint, BYOK / OAuth credential deletes, Terraform apply / destroy,
organization invite issuance, broker-credential mutations, and the
admin tenancy-strategy migration) is enforced server-side by
[`alphaswarm.api.security_stepup.require_step_up`](../alphaswarm/api/security_stepup.py).
The FastAPI dep returns RFC 9470-compliant 401 responses with
`WWW-Authenticate: Bearer error="insufficient_user_authentication",
acr_values="...", max_age="..."` when the access token fails the
freshness or MFA-method check.

The frontend (`alphaswarm_client/src/lib/auth/useStepUp.ts` +
`apiFetch` retry middleware) drives the SPA-side flow: a destructive
button calls `requestStepUp()` to pre-flight an Auth0 popup with
`acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor`
and `max_age=0`, then runs the original operation with the freshly
minted token. For this round-trip to succeed, the post-login Action
above MUST honour the `acr_values` parameter and force an MFA
challenge when the caller requested it.

Add the snippet below to the **Phase 7 post-login Action** (don't
duplicate — extend the existing `exports.onExecutePostLogin`):

```javascript
exports.onExecutePostLogin = async (event, api) => {
  // ... (the Phase 7 JIT-sync body stays as-is) ...

  // ---- Phase 8: Adaptive MFA + step-up enforcement ------------------

  // The SPA / CLI / agent caller can explicitly request fresh MFA by
  // passing acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor
  // on /authorize. The Action MUST trigger the MFA challenge when
  // either (a) the caller requested it OR (b) Auth0's Adaptive MFA
  // assessment flagged the login as high-risk.
  const ACR_MFA = "http://schemas.openid.net/pape/policies/2007/06/multi-factor";
  const acrRequested = Array.isArray(event.transaction?.acr_values)
    ? event.transaction.acr_values
    : [];
  const explicitlyAskedForMfa = acrRequested.includes(ACR_MFA);

  const methods = Array.isArray(event.authentication?.methods)
    ? event.authentication.methods
    : [];
  const mfaAlreadyCompleted = methods.some(
    (m) => m?.name === "mfa" || m?.name === "otp",
  );

  // Auth0's Adaptive MFA risk assessment — `low` / `medium` / `high`.
  // Honour `high` automatically; `medium` is left to the caller's
  // explicit request so the dashboard stays usable on shared offices.
  const riskConfidence = event.authentication?.riskAssessment?.confidence;

  const shouldTriggerMfa =
    (explicitlyAskedForMfa && !mfaAlreadyCompleted) ||
    riskConfidence === "high";

  if (shouldTriggerMfa) {
    // ``allowRememberBrowser: false`` because step-up is sized for
    // destructive ops — we never want the browser to remember the
    // "MFA satisfied" flag past the 180s freshness window.
    api.multifactor.enable("any", { allowRememberBrowser: false });
  }

  // Surface a JIT-friendly hint to the SPA so the topbar can render
  // "MFA required" pre-flight UI. Not security-sensitive — purely a
  // UX accelerator. The backend NEVER trusts this claim; it always
  // re-checks amr + auth_time on the access token.
  api.idToken.setCustomClaim(`${NS}mfa_available`, true);
};
```

The `multifactor.enable("any", ...)` call triggers Auth0's enrolment
or challenge surface, depending on whether the user has already
registered a factor. Operators must enable at least one factor type
in **Security > Multi-factor Auth** for the call to succeed.

### Tested factor types

| Factor | Auth0 enrolment | Notes |
| --- | --- | --- |
| OTP (TOTP) | Authenticator app | Always recommended as the primary factor |
| WebAuthn | Roaming or Platform | Strongest; emits `amr: ["mfa", "swk"]` or `["mfa", "hwk"]` |
| Push | Auth0 Guardian app | Smooth UX for personal accounts |
| SMS | Phone-based | Discouraged for B2B per AGENTS rule 52 |
| Email OTP | Magic link | Acceptable; emits `amr: ["mfa"]` |
| Recovery code | Backup | Always provisioned alongside another factor |

### Step-up failure recovery

If the popup fails (browser blocked, user dismissed, network drop):

- `useStepUp.requestStepUp()` returns `null` and surfaces a "MFA
  required" toast.
- `apiFetch`'s automatic 401 retry path also surrenders after one
  attempt; the route handler propagates the original 401 to the
  caller.
- Operators with `admin:tenant` can fall back to the BFF
  `/auth/login?acr_values=...` redirect flow which uses a full-page
  redirect instead of a popup. The redirect callback returns to the
  original route and the user re-clicks the destructive button.

## Phase 8 — Custom Token Exchange Profile (AGENTS hard rule 54)

The Phase 8 refactor introduces RFC 8693 delegated agent tokens —
when [`AgentRuntime`](../alphaswarm/agents/runtime.py) makes an HTTP MCP
call on behalf of a user, it exchanges the user's access token for a
narrower, agent-scoped token via Auth0 Custom Token Exchange. The
minted token carries an `act` claim identifying the agent, while the
top-level `sub` stays the human user — so RLS, memberships, and the
audit ledger all see the full delegation chain.

### Required Auth0 setup

1. **Create an M2M Application named `alphaswarm-agent-broker`.**
   - Authorise it against the AlphaSwarm API record.
   - Allowed grant types: `client_credentials` (required) AND
     `urn:ietf:params:oauth:grant-type:token-exchange` (required).
   - Note the `client_id` + `client_secret`.

2. **Configure backend env vars:**

   ```bash
   ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true
   ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID=
   # client_secret resolves via CredentialResolver in prod
   # (Vault / cloud KMS) — env is the local-dev shortcut:
   ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET=
   ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300
   ```

3. **Create a Custom Token Exchange Profile named `alphaswarm-agent-delegation`.**
   - In the Auth0 Dashboard, navigate to **Actions > Flows > Custom Token
     Exchange** and click "Create Profile".
   - Profile name: exactly `alphaswarm-agent-delegation` (matches the
     `subject_token_profile` parameter the broker sends).
   - Target API: the AlphaSwarm API record (audience
     `https://api.alpha-swarm.ai/` or your env equivalent).
   - Subject token types accepted:
     `urn:ietf:params:oauth:token-type:access_token`.
   - Allow Skipping User Consent: **enabled** (required for
     non-interactive flows per the Custom Token Exchange docs).
   - Allowed scopes: `read:mcp:data`, `write:mcp:data`,
     `read:mcp:codebase`, `write:mcp:codebase`. The Profile must
     reject any scope NOT on this list.

### The Action body

Paste this into the Profile's Action body. The Action runs INSIDE
the `/oauth/token` exchange request — it never returns prose to the
caller, only the access token Auth0 mints.

```javascript
/**
 * alphaswarm-agent-delegation — Custom Token Exchange Profile Action.
 *
 * Sources:
 *   event.transaction.subject_token_payload — the human's verified
 *     access token claims (sub, org_id, permissions, ...).
 *   event.transaction.actor_token_payload   — the agent broker M2M
 *     token claims (sub = "agent|").
 *
 * The Profile MUST be paired with the alphaswarm-agent-broker M2M client
 * and the broker MUST NOT be allowed to call /oauth/token with any
 * other Profile. Misusing this Profile mis-attributes audit rows.
 */
exports.onExecuteCustomTokenExchange = async (event, api) => {
  const subject = event.transaction?.subject_token_payload;
  const actor = event.transaction?.actor_token_payload;

  if (!subject || typeof subject !== "object") {
    api.access.rejectInvalidSubjectToken("subject token missing");
    return;
  }
  if (!actor || typeof actor !== "object") {
    api.access.rejectInvalidSubjectToken("actor assertion missing");
    return;
  }

  const humanSub = subject.sub;
  const agentSub = actor.sub;
  if (!humanSub || !agentSub) {
    api.access.rejectInvalidSubjectToken("missing sub claims");
    return;
  }
  if (!String(agentSub).startsWith("agent|")) {
    api.access.rejectInvalidSubjectToken(
      "actor must identify an agent (sub must start with 'agent|')",
    );
    return;
  }

  // Bind the minted access token to the human user — RLS + members
  // are evaluated against this sub by the AlphaSwarm backend.
  api.authentication.setUserById(humanSub);

  // Narrow audience + scopes regardless of what the subject token had.
  api.accessToken.setAudience(event.secrets.ALPHASWARM_API_AUDIENCE);

  // Whitelist of scopes the agent is allowed to inherit. New MCP
  // surfaces must be added to BOTH this list AND the Profile's
  // configured allowed scopes.
  const ALLOWED_AGENT_SCOPES = [
    "read:mcp:data",
    "write:mcp:data",
    "read:mcp:codebase",
    "write:mcp:codebase",
  ];
  const requested = (event.transaction?.requested_scopes || []).filter(
    (s) => ALLOWED_AGENT_SCOPES.includes(s),
  );
  for (const s of requested) {
    api.accessToken.addScope(s);
  }

  // The `act` claim is the standard RFC 8693 marker. AlphaSwarm's
  // get_current_user dep reads it to flip Principal.actor_type to
  // "agent" and stamp on_behalf_of_sub onto every audit row.
  api.accessToken.setCustomClaim("act", {
    sub: agentSub,
    iss: `https://${event.secrets.AUTH0_DOMAIN}/`,
  });

  // AlphaSwarm-specific marker so the frontend / SIEM dashboards can filter.
  api.accessToken.setCustomClaim("alphaswarm_delegated", true);

  // Carry the human's org_id through so RLS sees the right tenant
  // even when the agent is running in a Celery worker without
  // X-AlphaSwarm-Org headers.
  if (subject.org_id) {
    api.accessToken.setCustomClaim("org_id", subject.org_id);
  }
};
```

### Secrets

- `ALPHASWARM_API_AUDIENCE` — same value the operator sets in
  `ALPHASWARM_AUTH_OIDC_AUDIENCE` on the backend.
- `AUTH0_DOMAIN` — the tenant domain
  (`alphaswarm-prod.us.auth0.com` or the custom domain).

### Verification

1. Stand the backend up with `ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true`
   and the broker credentials populated.
2. Run an end-to-end test where an agent calls a data MCP tool:

   ```python
   from alphaswarm_agents.runtime import AgentRuntime
   from alphaswarm_agents.spec import AgentSpec

   spec = AgentSpec.from_yaml_path("configs/agents/research_lead.yaml")
   runtime = AgentRuntime(
       spec,
       context=test_ctx,
       user_access_token=human_token,
   )
   delegated = runtime.delegated_token_for_mcp()
   assert delegated is not None
   # Decode the token at jwt.io — should carry act.sub="agent|research_lead"
   # while sub stays the human's auth0|... identity.
   ```

3. Hit `/mcp/data/tools/data.catalog.lineage/invoke` with the
   delegated token in the `Authorization` header. The response body
   should include the `actor` object with both the agent sub and the
   on-behalf-of sub.

4. Query the audit ledger:

   ```sql
   SELECT created_at, user_id, actor_user_id, event_type,
          details->'delegation' AS delegation
   FROM security_audit_events
   WHERE event_type LIKE 'mcp%'
   ORDER BY created_at DESC LIMIT 5;
   ```

   The `delegation` JSON block should carry
   `{"agent_subject": "agent|research_lead", "on_behalf_of_user_id":
   "auth0|...", "profile": "alphaswarm-agent-delegation"}`.

### Failure modes

| Symptom | Likely cause | Fix |
| --- | --- | --- |
| 400 `invalid_request profile not found` | Profile name typo or not yet created in Dashboard | Match the name exactly: `alphaswarm-agent-delegation` |
| 400 `unauthorized_client` | alphaswarm-agent-broker app missing `token-exchange` grant type | Enable on the M2M app |
| 400 `invalid_target scope rejected` | Profile didn't include the scope in its allowed list | Add the scope to BOTH the Profile config and the Action's `ALLOWED_AGENT_SCOPES` list |
| MCP route returns 403 missing `read:mcp:data` | Permissions array on AlphaSwarm API record missing the scope, or RBAC option "Add permissions in access token" is off | Re-enable both in API record settings |
| Audit row missing `delegation` block | Caller didn't pass `agent_subject` to `emit_audit_event` | The MCP server route + bridge already pass it; legacy callers need to be updated |

## Phase 6 — IdP group sync Action (`alphaswarm-idp-group-sync`)

Generalises the existing post-login flow so each org can attach
non-Entra IdPs (Google Workspace, AWS IAM Identity Center, Okta,
OneLogin, JumpCloud, generic SAML/OIDC) and have their external
group claims automatically promote to AlphaSwarm roles. Pairs with the
[`IdpGroupMappingEditor`](../alphaswarm_client/src/components/onboarding/IdpGroupMappingEditor.tsx)
admin UI and the `/tenancy/orgs/{org_id}/idp-group-mappings`
routes.

### How it fits the post-login pipeline

The existing `alphaswarm-post-login` Action handles the JIT user upsert
and the AlphaSwarm-namespaced custom claims (Phase 4 + 7). This NEW
Action runs AFTER `alphaswarm-post-login` in the same Login trigger and
specifically handles the IdP-group → AlphaSwarm-role translation. They
share the M2M token cache to avoid double-minting.

### Required Auth0 setup

1. **Order the Actions.** In Library > Custom > Triggers > Login,
   drag `alphaswarm-post-login` to position 1, then `alphaswarm-idp-group-sync`
   to position 2. The sync action depends on `event.user.user_id`
   being a valid AlphaSwarm user, which is guaranteed by the time the
   post-login JIT sync completes.

2. **No new secrets** — re-uses the same `ALPHASWARM_API_URL` /
   `ALPHASWARM_M2M_*` secrets the post-login Action already needs.

### The Action body

```javascript
/**
 * alphaswarm-idp-group-sync — post-login Action.
 *
 * Reads the user's external IdP group claims and posts them to
 * /_internal/idp/sync-groups so the AlphaSwarm backend can upsert
 * matching Membership rows per the per-org IdpGroupMapping table.
 */
const NS = "https://alphaswarm.internal/";

function _collectExternalGroups(event) {
  // Different IdPs surface group memberships under different claim
  // names. We collect every well-known shape and merge into one
  // de-duplicated list.
  const candidates = [
    event.user?.groups,          // Auth0 standard
    event.user?.app_metadata?.groups,
    event.user?.user_metadata?.groups,
    event.user?.["http://schemas.microsoft.com/ws/2008/06/identity/claims/role"],
    event.user?.identities?.[0]?.profileData?.groups,
  ];
  const merged = new Set();
  for (const c of candidates) {
    if (!c) continue;
    if (Array.isArray(c)) {
      for (const g of c) {
        if (typeof g === "string" && g.trim()) merged.add(g.trim());
      }
    } else if (typeof c === "string" && c.trim()) {
      merged.add(c.trim());
    }
  }
  return Array.from(merged);
}

function _connectionKind(event) {
  // Map Auth0 connection strategy -> AlphaSwarm IdpConnectionRecord.connection_kind.
  const strategy = (event.connection?.strategy || "").toLowerCase();
  const name = (event.connection?.name || "").toLowerCase();
  if (strategy === "waad" || name.includes("azure")) return "entra";
  if (strategy === "google-workspace" || name.includes("google-workspace")) {
    return "google_workspace";
  }
  if (name.includes("iam-identity-center") || name.includes("aws-sso")) {
    return "aws_iam_identity_center";
  }
  if (strategy === "okta" || name.includes("okta")) return "okta";
  if (strategy === "onelogin" || name.includes("onelogin")) return "onelogin";
  if (strategy === "jumpcloud" || name.includes("jumpcloud")) return "jumpcloud";
  if (strategy === "samlp") return "generic_saml";
  if (strategy === "oidc") return "generic_oidc";
  return null;
}

exports.onExecutePostLogin = async (event, api) => {
  const groups = _collectExternalGroups(event);
  if (groups.length === 0) return;
  const kind = _connectionKind(event);
  if (!kind) return;

  // Re-use the M2M token cached by alphaswarm-post-login (same Action
  // namespace) so we don't double-mint.
  const token = (await api.cache.get("alphaswarm_m2m_token"))?.value;
  if (!token) return;

  const aqpApi = event.secrets.ALPHASWARM_API_URL;
  if (!aqpApi) return;

  try {
    await fetch(`${aqpApi}/_internal/idp/sync-groups`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        user_id: event.user.user_id,
        auth0_organization_id: event.organization?.id || null,
        connection_kind: kind,
        external_groups: groups,
      }),
    });
  } catch (err) {
    // Fail open — never block authentication because of a transient
    // backend hiccup. The next login retries.
    console.log("alphaswarm_idp_group_sync_failed:", err.message);
  }
};
```

### Wire-format the backend expects

The route `/_internal/idp/sync-groups` validates the M2M token via
the same chain as `/_internal/auth0/sync`, then for every active
`IdpConnectionRecord` of the matching `connection_kind` it looks
up matching :class:`IdpGroupMapping` rows and upserts the
corresponding :class:`Membership` rows.

### Verification

1. Stand the backend up with at least one active `IdpConnectionRecord`
   for the user's org + at least one `IdpGroupMapping` referencing
   one of the user's external groups.
2. Sign in via the matching IdP.
3. Hit `/whoami` and verify the `memberships` array contains the
   expected scope_kind / scope_id / role.
4. Query `security_audit_events`:

   ```sql
   SELECT created_at, event_type, details
   FROM security_audit_events
   WHERE event_type = 'idp_group_mapping_created'
     OR event_type = 'auth0_log_stream:s';
   ```

### Don't

- Don't bake group → role mappings into the Action body itself.
  The whole point of `IdpGroupMapping` is operator-driven mapping
  changes via the UI without redeploying Actions.
- Don't surface group lists in any visible UI or error message —
  some enterprise IdPs treat them as PII-adjacent.
- Don't enable this Action without first creating at least one
  matching `IdpConnectionRecord` in `status=active`; the route is
  a no-op without an active connection, but the Action wastes API
  call budget if it's misconfigured at scale.


<!-- https://alpha-swarm.ai/concepts/identity/auth0-microsoft-federation -->
# Auth0 + Microsoft Entra federation runbook
> Users authenticate through Auth0 Universal Login, can choose Microsoft via an enterprise connection, and then call the AlphaSwarm API with Auth0-issued access tokens that include AlphaSwarm custom claims

# Auth0 + Microsoft Entra federation runbook

This runbook covers the one-time operator setup for federating Microsoft Entra ID through Auth0 Universal Login, so AlphaSwarm keeps one identity control plane while still supporting enterprise SSO and account lifecycle features.

## 1) What this gives you

Users authenticate through Auth0 Universal Login, can choose Microsoft via an enterprise connection, and then call the AlphaSwarm API with Auth0-issued access tokens that include AlphaSwarm custom claims.

```mermaid
sequenceDiagram
    participant User
    participant SPA as AlphaSwarm SPA
    participant UL as Auth0 Universal Login
    participant Entra as Microsoft Entra ID
    participant Auth0
    participant API as AlphaSwarm API

    User->>SPA: Open login
    SPA->>UL: Redirect (PKCE + audience)
    UL-->>User: Show login options
    User->>UL: Click "Continue with Microsoft"
    UL->>Entra: Start enterprise connection flow
    Entra-->>UL: Return auth result
    UL->>Auth0: Complete federation and issue tokens
    Auth0-->>SPA: Redirect to /auth/callback
    SPA->>API: Call API with Bearer token
    API-->>SPA: Authorized response
```

## 2) Auth0 tenant resources to create

1. **AlphaSwarm API resource**
   - Navigate: `Dashboard > Applications > APIs > Create API`
   - Name: `AlphaSwarm API`
   - Identifier: `https://api.alphaswarm.local` (operator-selected; this becomes `ALPHASWARM_AUTH_OIDC_AUDIENCE`)
   - Signing algorithm: `RS256`
   - Permissions to add:
     - `read:messages`
     - `write:messages`
     - `admin`
     - `data:read`
     - `data:write`
   - Enable RBAC and enable **Add Permissions in the Access Token**.

2. **AlphaSwarm SPA Application**
   - Navigate: `Dashboard > Applications > Applications > Create Application`
   - Name: `AlphaSwarm SPA`
   - Type: `Single Page Application`
   - Allowed Callback URLs: `http://localhost:3001/auth/callback,https:///auth/callback`
   - Allowed Logout URLs: `http://localhost:3001/auth/logout,https:///auth/logout`
   - Allowed Web Origins: `http://localhost:3001,https://`
   - Token Endpoint Authentication Method: `None` (public client + PKCE)
   - Grant Types: `Authorization Code` and `Refresh Token`
   - Refresh Token settings: rotation enabled, reuse interval `0`
   - Save the Client ID as `VITE_AUTH0_CLIENT_ID`.

3. **AlphaSwarm Management API M2M Application**
   - Navigate: `Dashboard > Applications > Applications > Create Application`
   - Type: `Machine to Machine`
   - Authorize it for `Auth0 Management API`.
   - Grant scopes:
     - `read:users` - read user profiles and identity links.
     - `update:users` - patch profile/app metadata updates.
     - `create:users` - create user records when needed.
     - `delete:users` - hard-delete user accounts.
     - `read:user_sessions` - list active Auth0 sessions.
     - `delete:sessions` - revoke sessions and sign users out.
     - `read:authentication_methods` - list enrolled MFA methods.
     - `delete:authentication_methods` - remove MFA methods.
     - `create:authentication_method_enrollment_tickets` - generate MFA enrollment tickets.
     - `read:guardian_factors` - list available MFA factor types.
     - `create:user_tickets` - generate password change ticket URLs.
     - `read:logs` - fetch Auth0 audit/security events.
   - Save Client ID + Secret as:
     - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID`
     - `ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET`
   - Audience is `https://.auth0.com/api/v2/` and maps to `ALPHASWARM_AUTH0_MGMT_API_AUDIENCE`.

4. **Microsoft Enterprise Connection**
   - Navigate: `Dashboard > Authentication > Enterprise > Microsoft Azure AD`
   - Connection name: `azure-ad-myorg` (operator-selected). This becomes:
     - `ALPHASWARM_AUTH0_MICROSOFT_CONNECTION`
     - `VITE_AUTH0_MS_CONNECTION`
   - Use Common Endpoint: `Yes` for multi-tenant. Use tenant-specific endpoint for single-tenant installs.
   - Domain: leave blank for multi-tenant.
   - Paste Client ID + Client Secret from the Microsoft Entra app registration (Section 3).
   - Identity API: `Microsoft Identity Platform v2.0`
   - Attribute mapping: `Standard`
   - Open the `AlphaSwarm SPA` app -> `Connections` tab -> enable this connection.

5. **(Optional) Google social connection**
   - Navigate: `Dashboard > Authentication > Social > Google`
   - Auth0 dev keys are acceptable only for testing.
   - For production, configure your own Google OAuth client (see [Google OAuth 2.0 setup](https://developers.google.com/identity/protocols/oauth2)).

6. **Auth0 Action — post-login**
   - Implement the Action from Section 4.
   - Ensure it is enabled on the **Login Flow** trigger.

7. **(Optional, recommended) Custom Domain**
   - Navigate: `Dashboard > Branding > Custom Domains > Add Domain`
   - Example domain: `auth.alphaswarm.example`
   - Add the CNAME record shown by Auth0.
   - Wait for verification (typically about 5 minutes).
   - Universal Login uses the custom domain automatically once verified.

8. **Universal Login branding**
   - Navigate: `Dashboard > Branding > Universal Login > Customize`
   - Use the **New Universal Login** (template-based), not Classic.
   - Choose the `Identifier First + Biometrics` template.
   - Set logo URL and primary color from your brand guide.

## 3) Microsoft Entra app registration walkthrough

1. In Azure portal, open `Microsoft Entra ID > App registrations > New registration`.
2. Name the app `AlphaSwarm via Auth0`.
3. Supported account types: `Accounts in any organizational directory (Multitenant)` for B2B, or single-tenant for internal-only access.
4. Redirect URI: `Web`, set to `https://.auth0.com/login/callback`.
5. Click **Register**.
6. Copy **Application (client) ID** and paste into the Auth0 Microsoft Enterprise Connection.
7. Open `Certificates & secrets > New client secret`, then copy the **Value** (not secret ID) into the Auth0 Microsoft Enterprise Connection.
8. In `API permissions`, add Microsoft Graph delegated permissions: `openid`, `profile`, `email`, `User.Read`; then grant admin consent.
9. In `Authentication`:
   - `Allow public client flows`: `No`
   - Front-channel logout URL: `https://.auth0.com/v2/logout`
10. Optional token configuration: add optional claims `email`, `family_name`, and `given_name` if you want those in ID tokens.

## 4) The Auth0 Action JavaScript

Use this Action on the Login Flow -> Post Login trigger:

```javascript
/**
 * AlphaSwarm post-login Action.
 * Calls /_internal/auth0/sync on the AlphaSwarm API and injects the
 * returned custom claims into the access token. Also carries the
 * Auth0 connection name (e.g. "azure-ad-myorg") so the AlphaSwarm audit
 * log records WHICH IdP drove this login.
 *
 * Secrets used:
 *   ALPHASWARM_API_URL                  e.g. https://api.alphaswarm.example
 *   ALPHASWARM_M2M_CLIENT_ID            Auth0 Management API M2M client id (reused)
 *   ALPHASWARM_M2M_CLIENT_SECRET        Auth0 Management API M2M client secret
 *   ALPHASWARM_M2M_AUDIENCE             Same as AlphaSwarm API resource identifier
 *
 * Set them at: Actions > Library > Custom >  > Add Secret
 */
const NS = "https://alphaswarm/";

async function mintM2MToken(secrets) {
  const url = `https://${event.tenant.id}.auth0.com/oauth/token`;
  const res = await fetch(url, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      grant_type: "client_credentials",
      client_id: secrets.ALPHASWARM_M2M_CLIENT_ID,
      client_secret: secrets.ALPHASWARM_M2M_CLIENT_SECRET,
      audience: secrets.ALPHASWARM_M2M_AUDIENCE,
    }),
  });
  if (!res.ok) return null;
  const body = await res.json();
  return body.access_token || null;
}

exports.onExecutePostLogin = async (event, api) => {
  const aqpApi = event.secrets.ALPHASWARM_API_URL;
  if (!aqpApi) return; // Action mis-configured; fail open
  let token = await api.cache.get("alphaswarm_m2m_token");
  if (!token || !token.value) {
    const fresh = await mintM2MToken(event.secrets);
    if (!fresh) return;
    api.cache.set("alphaswarm_m2m_token", fresh, { ttl: 50 * 60 * 1000 });
    token = { value: fresh };
  }
  const payload = {
    user_id: event.user.user_id,
    email: event.user.email,
    organization_id: event.organization?.id,
    organization_name: event.organization?.name,
    requested_claims: {
      connection: event.connection?.name,
      strategy: event.connection?.strategy,
    },
  };
  try {
    const res = await fetch(`${aqpApi}/_internal/auth0/sync`, {
      method: "POST",
      headers: {
        Authorization: `Bearer ${token.value}`,
        "Content-Type": "application/json",
      },
      body: JSON.stringify(payload),
    });
    if (!res.ok) return;
    const claims = await res.json();
    for (const [k, v] of Object.entries(claims)) {
      if (v === null || v === undefined) continue;
      api.accessToken.setCustomClaim(`${NS}${k}`, v);
      api.idToken.setCustomClaim(`${NS}${k}`, v);
    }
  } catch (err) {
    // Fail open — never block the user's login if AlphaSwarm API is down.
    console.log("alphaswarm_sync_failed", err.message);
  }
};
```

The Action intentionally fails open. Blocking sign-in for every user because of a temporary outage in `/_internal/auth0/sync` is a worse failure mode than skipping one claim sync. The next successful login reconciles claims again.

## 5) `.env` values to set on AlphaSwarm

Use `.env.example` as the canonical source for all names and defaults.

### API + worker (`ALPHASWARM_*`)

- `ALPHASWARM_AUTH_PROVIDER=auth0`
- `ALPHASWARM_AUTH_OIDC_ISSUER` (Auth0 issuer URL)
- `ALPHASWARM_AUTH_OIDC_AUDIENCE` (AlphaSwarm API identifier)
- `ALPHASWARM_AUTH_OIDC_CLIENT_ID`
- `ALPHASWARM_AUTH_OIDC_CLIENT_SECRET` (required only for confidential clients)
- `ALPHASWARM_AUTH_LOGIN_CALLBACK`
- `ALPHASWARM_AUTH_LOGOUT_CALLBACK`
- `ALPHASWARM_AUTH_SESSION_SECRET`
- `ALPHASWARM_AUTH_M2M_ENABLED=true`
- `ALPHASWARM_AUTH_M2M_AUDIENCE` (normally same as API audience)
- `ALPHASWARM_AUTH0_MGMT_API_AUDIENCE`
- `ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID`
- `ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET`
- `ALPHASWARM_AUTH0_DATABASE_CONNECTION`
- `ALPHASWARM_AUTH0_MICROSOFT_CONNECTION`
- `ALPHASWARM_AUTH0_GOOGLE_CONNECTION` (if Google is enabled)
- `ALPHASWARM_AUTH_REQUIRE_EMAIL_VERIFIED`

### SPA build-time config (`VITE_*`)

- `VITE_AUTH0_DOMAIN`
- `VITE_AUTH0_CLIENT_ID`
- `VITE_AUTH0_AUDIENCE`
- `VITE_AUTH0_SCOPE`
- `VITE_AUTH0_REDIRECT_URI`
- `VITE_AUTH0_ORGANIZATION` (optional)
- `VITE_AUTH0_MS_CONNECTION`
- `VITE_AUTH0_GOOGLE_CONNECTION`
- `VITE_AUTH0_BRAND_NAME`
- `VITE_AUTH0_BRAND_LOGO_URL`

## 6) Verification curl commands

```bash
# Public endpoint (should return 200 without auth)
curl http://localhost:8000/api/public

# Private endpoint (401 without token)
curl http://localhost:8000/me

# Private endpoint (200 with access token)
curl http://localhost:8000/me -H 'Authorization: Bearer YOUR_ACCESS_TOKEN'

# Scoped endpoint (403 if token lacks read:messages)
curl http://localhost:8000/api/private-scoped -H 'Authorization: Bearer YOUR_ACCESS_TOKEN'
```

For a quick test token, use `Auth0 Dashboard > APIs > AlphaSwarm API > Test`.

## 7) Cutover checklist

- [ ] Auth0 tenant created
- [ ] AlphaSwarm API + SPA + Management API M2M apps created
- [ ] Microsoft Enterprise Connection created + tested
- [ ] Auth0 Action installed + enabled on Login Flow
- [ ] `.env` populated on the AlphaSwarm API + worker
- [ ] `.env.local` populated on the SPA build + rebuild + redeploy
- [ ] `ALPHASWARM_AUTH_PROVIDER=auth0` set
- [ ] `ALPHASWARM_AUTH_ENFORCE=strict` confirmed in prod
- [ ] Smoke: `/api/public` 200, `/api/private` 401 then 200, Microsoft button -> Entra -> callback -> `/`

## 8) Troubleshooting

- `401 invalid_token` after Microsoft login: verify the Action ran in `Dashboard > Monitoring > Logs` (filter event type `sapi` or `sf`).
- `invalid_request: missing audience`: ensure the authorize request includes `audience=`. The SPA should pass this from `VITE_AUTH0_AUDIENCE`.
- `Wrong issuer`: ensure issuer uses the Auth0 tenant domain ending in `.auth0.com`. If a custom domain is configured, confirm token issuer behavior and enable **Use Custom Domain in Tokens** when required.


<!-- https://alpha-swarm.ai/concepts/identity/auth0-setup -->
# Auth0 setup — comprehensive operator runbook
> The platform supports three deployment shapes:

# Auth0 setup — comprehensive operator runbook

This is the canonical setup guide for AGENTS hard rules 52-55 (the
Phase 5+ auth refactor). Pair with
[alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md) for the JS Action bodies
that go in the Auth0 Dashboard.

The platform supports three deployment shapes:

- **Local-first dev**: `ALPHASWARM_AUTH_PROVIDER=local`, no Auth0 tenant
  needed. Everything below is skipped.
- **Single-tenant B2C**: one Auth0 tenant per env, individual users
  sign up via Universal Login + social connections. Organizations
  is OFF (or "Allow individual logins" if you want both modes).
- **Multi-tenant B2B**: same Auth0 tenant per env, institutional
  customers attach via Auth0 Organizations. Each Organization has
  its own branded login + Enterprise connection.

The same backend serves all three; the difference is purely the
Auth0 configuration + the `ALPHASWARM_AUTH_*` env vars.

---

## 1. Tenants

One Auth0 tenant per AlphaSwarm environment. Three tenants per AGENTS rule:

| Env | Auth0 tenant | Custom domain | Issuer URL in `ALPHASWARM_AUTH_OIDC_ISSUER` |
| --- | --- | --- | --- |
| dev | `alphaswarm-dev` | `auth.dev.alpha-swarm.ai` | `https://auth.dev.alpha-swarm.ai/` |
| stage | `alphaswarm-stage` | `auth.stage.alpha-swarm.ai` | `https://auth.stage.alpha-swarm.ai/` |
| prod | `alphaswarm-prod` | `auth.alpha-swarm.ai` | `https://auth.alpha-swarm.ai/` |

Custom domains stabilise the issuer URL so changing Auth0 tenants
later is non-breaking. Without a custom domain the issuer is
`https://alphaswarm-prod.us.auth0.com/` and every existing JWT cache /
revocation token has to be invalidated on rebrand.

Never share Auth0 tenants across envs — Auth0 charges per MAU per
tenant, but the security boundary is more important than the cost
arithmetic.

---

## 2. API resource server

One API record per tenant — the AlphaSwarm backend.

| Field | Value (prod example) |
| --- | --- |
| Name | `alphaswarm-api` |
| Identifier | `https://api.alpha-swarm.ai/` |
| Signing algorithm | `RS256` |
| Allow Skipping User Consent | ON |
| Allow Offline Access | ON |
| Token expiration (seconds) | `86400` (24h ceiling — per-app overrides win) |
| Token expiration for browser flows (seconds) | `7200` (2h SPA ceiling) |

Enable **RBAC**:

- Settings → "Enable RBAC" → ON
- Settings → "Add Permissions in the Access Token" → ON

Define every permission AlphaSwarm uses (Permissions tab):

```
read:portfolio             Read portfolio positions / PnL / risk
write:portfolio            Mutate portfolio config
read:strategy              Read strategy specs / backtest history
write:strategy             Author / edit strategies
deploy:strategy            Promote a strategy to live trading
kill_switch:execute        Engage the global kill switch
trade:execute              Submit live or paper orders
trade:live                 Bypass the paper-only guard
read:mcp:data              Invoke the Data MCP tools
write:mcp:data             Mutate via Data MCP (e.g. namespace policy edits)
read:mcp:codebase          Invoke the Codebase MCP tools
write:mcp:codebase         Apply code edits via Codebase MCP (rarely granted)
run:agent                  Spawn an AgentRuntime
admin:tenant               Org-admin powers (invites, IdP config, billing)
admin:cluster              Bypass resource filter; superadmin-only
manage:broker_credentials  Read/write broker credentials at org scope
read:logs                  Required for the Auth0 Management API M2M client
```

Add Token Exchange:

- API → Settings → "Token Exchange" → ON (required for
  `alphaswarm-agent-broker` to use RFC 8693).

---

## 3. Applications

Five application records per tenant:

| Record | Type | Grants | Token TTL | Notes |
| --- | --- | --- | --- | --- |
| `alphaswarm-spa` | Single Page Application | `authorization_code` + `refresh_token` | access 15m, ID 10m | Refresh-token rotation ON, absolute lifetime 24h |
| `alphaswarm-cli` | Native | `urn:ietf:params:oauth:grant-type:device_code` + `refresh_token` | access 60m | Rotation ON, absolute 30d, inactivity 7d. **"Business Users" mode** so Device Code stays compatible with Orgs |
| `alphaswarm-backend-m2m` | M2M | `client_credentials` | 24h | For internal service-to-service + Auth0 Management API |
| `alphaswarm-action-callback-m2m` | M2M | `client_credentials` | 5m | Used inside Auth0 Actions for `/_internal/auth0/sync` |
| `alphaswarm-agent-broker` | M2M | `client_credentials` + `urn:ietf:params:oauth:grant-type:token-exchange` | 5m | RFC 8693 delegated-agent-token minting |

### 3.1 `alphaswarm-spa` (SPA)

- Application URIs:
  - Allowed callback URLs:
    `https://app.alpha-swarm.ai/auth/callback`, `http://localhost:3001/auth/callback`
  - Allowed logout URLs: `https://app.alpha-swarm.ai/`, `http://localhost:3001/`
  - Allowed web origins: `https://app.alpha-swarm.ai`, `http://localhost:3001`
- Refresh Token Rotation: ON
- Refresh Token Expiration: Absolute 24h
- Refresh Token Inactivity: 7d
- Idle Session Lifetime: 72h
- Maximum Session Lifetime: 168h (7d)

Frontend env vars (Vite):

```
VITE_AUTH_PROVIDER=auth0
VITE_AUTH0_DOMAIN=auth.alpha-swarm.ai          # custom domain
VITE_AUTH0_SPA_CLIENT_ID=
VITE_AUTH0_AUDIENCE=https://api.alpha-swarm.ai/
VITE_AUTH0_SCOPE=openid profile email offline_access read:portfolio write:portfolio read:strategy write:strategy read:mcp:data
VITE_AUTH0_ORGANIZATION=                  # B2B only — pin to a single org
```

### 3.2 `alphaswarm-cli` (Native)

- Connections tab: enable the same DB / social connections as the SPA.
- Advanced Settings → Grant Types: enable `Device Code` + `Refresh Token`.
- "Business Users" mode (not "Organizations Required"); the Auth0
  team's M2M-for-Orgs GA notes that Device Code is incompatible
  with the strict "Organizations Required" setting.

CLI env vars (operator's machine):

```
ALPHASWARM_CLI_OIDC_DOMAIN=auth.alpha-swarm.ai
ALPHASWARM_CLI_OIDC_CLIENT_ID=
ALPHASWARM_CLI_OIDC_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_CLI_OIDC_ORGANIZATION=                # B2B: pin to a single org
```

The CLI fetches all three from `/auth/config` when not set, so most
operators don't need to copy-paste.

### 3.3 `alphaswarm-backend-m2m` (M2M)

- Authorise against:
  - `alphaswarm-api` (all permissions the backend needs to act on its own behalf).
  - Auth0 Management API (`read:users`, `update:users`, `delete:sessions`,
    `read:sessions`, `read:logs`, `read:connections`,
    `create:guardian_enrollment_tickets`, `delete:guardian_enrollments`,
    `create:user_tickets`).

Backend env vars:

```
ALPHASWARM_AUTH_PROVIDER=auth0
ALPHASWARM_AUTH_OIDC_ISSUER=https://auth.alpha-swarm.ai/
ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_AUTH_OIDC_CLIENT_ID=     # SPA client_id (for the SPA-targeted JWKS validation path)
ALPHASWARM_AUTH_OIDC_CLIENT_SECRET=                    # empty — SPAs are public clients
ALPHASWARM_AUTH0_MGMT_API_AUDIENCE=https://alphaswarm-prod.us.auth0.com/api/v2/
ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID=
ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET=               # via CredentialResolver in prod; env in dev
ALPHASWARM_AUTH0_DPOP_ENABLED=true                     # SDK mixed-mode
ALPHASWARM_AUTH0_DPOP_REQUIRED=false                   # flip true after CLI + SPA migrate
ALPHASWARM_AUTH_M2M_ENABLED=true
ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_AUTH_STEP_UP_ENABLED=true
ALPHASWARM_AUTH_STEP_UP_DEFAULT_MAX_AGE=180
```

### 3.4 `alphaswarm-action-callback-m2m` (M2M)

Same scopes as `alphaswarm-backend-m2m` but used INSIDE Auth0 Actions to
call `/_internal/auth0/sync` + `/_internal/idp/sync-groups`. The
Action body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) shows how to
mint + cache the token.

### 3.5 `alphaswarm-agent-broker` (M2M for Token Exchange)

- Grants: `client_credentials` + `urn:ietf:params:oauth:grant-type:token-exchange`.
- Authorised APIs: `alphaswarm-api` with scopes
  `read:mcp:data`, `write:mcp:data`, `read:mcp:codebase`, `write:mcp:codebase`.
- Used ONLY by the Custom Token Exchange Profile body to mint
  delegated agent tokens.

Backend env vars:

```
ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true
ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID=
ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET=           # via CredentialResolver in prod
ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300
```

---

## 4. Connections

### Database connection (B2C)

- Default `Username-Password-Authentication` database connection.
- Password Strength: "Excellent" (NIST 800-63 compliant).
- Enable: "Disable Signups from Public Signup Page" if you want
  invite-only onboarding (B2B-heavy deployments).

### Social connections (B2C)

- GitHub, Google (`google-oauth2`). Both default to the standard
  Auth0 connection types — no extra config beyond the Client ID +
  Secret from the respective developer console.

### Enterprise connections (B2B)

Configured per-org in :class:`IdpConnectionRecord`. Auth0 supports
SAML, ADFS, Azure AD (Entra), Google Workspace, PingFederate,
SiteMinder, Okta Workforce Identity, OneLogin, JumpCloud,
generic OIDC. The AlphaSwarm-side admin UI is
[`IdpGroupMappingEditor`](../alphaswarm_client/src/components/onboarding/IdpGroupMappingEditor.tsx).

Each enterprise connection MUST:

- Sync the user's group claims (Azure `groups`, Google's group claim,
  Okta `groups`). The Action `alphaswarm-idp-group-sync` reads them.
- Map to a single AlphaSwarm Organization via the matching
  :class:`IdpConnectionRecord.organization_id`. Multiple orgs may
  use the same connection KIND (e.g. AcmeCorp Okta + Subsidiary
  Okta) but each is a separate record.

---

## 5. Organizations (B2B)

One Auth0 Organization per institutional tenant. Auth0 charges per
Org per month on most tiers — budget accordingly.

| Setting | Value |
| --- | --- |
| Membership on Login | "Require Members to use this Organization" (strict B2B) |
| Allowed Connections | Only the org's enterprise connection(s) |
| Branding | Per-org logo + colors so users land on a branded login |

Use `?organization=org_xxx&login_hint=user@acme.com` on `/authorize`
to skip the org-picker step. The SPA reads
`VITE_AUTH0_ORGANIZATION` to pin.

The post-login Action (`alphaswarm-post-login`) reads `event.organization?.id`
and injects it as `https://alphaswarm.internal/org_id` so the FastAPI
`require_org` dep can branch immediately.

---

## 6. Actions

Three Login-trigger Actions (in this order):

1. **`alphaswarm-post-login`** — JIT user upsert + custom claim injection.
   Body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 7 post-login
   Action" section, extended by "Phase 8" addendum for step-up MFA).

2. **`alphaswarm-idp-group-sync`** — reads external IdP group claims and
   posts to `/_internal/idp/sync-groups` so the AlphaSwarm backend upserts
   matching Membership rows per the per-org IdpGroupMapping table.
   Body in [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 6 — IdP
   group sync Action" section).

And one Custom Token Exchange Profile:

3. **`alphaswarm-agent-delegation`** — RFC 8693 minting for delegated
   agent tokens. Body in
   [auth0-actions.md](../../concepts/identity/auth0-actions.md) ("Phase 8 — Custom Token
   Exchange Profile" section).

---

## 7. Pre-User-Registration trigger

One Action to block disposable emails + verify B2B invites:

```javascript
exports.onExecutePreUserRegistration = async (event, api) => {
  const email = (event.user.email || "").toLowerCase();
  const disposable = ["mailinator.com", "guerrillamail.com", "tempmail.org",
                      "10minutemail.com", "throwaway.email"];
  const domain = email.split("@")[1];
  if (!email) { api.access.deny("invalid_email", "email required"); return; }
  if (disposable.includes(domain)) {
    api.access.deny("disposable_email", "disposable email domains not allowed");
    return;
  }
  // B2B invite verification — operator chooses how strict.
  if (event.client.metadata?.flow === "b2b" && event.secrets.ALPHASWARM_BACKEND_URL) {
    // Call /_internal/auth/preregister-check (operator adds this route
    // if they want HMAC-based invite enforcement at registration time).
  }
};
```

---

## 8. Log Streams

One **Custom Webhook** log stream per env:

| Field | Value |
| --- | --- |
| Type | Custom Webhook |
| Payload URL | `https://api.alpha-swarm.ai/_internal/auth0/log-stream` |
| Authorization | `Bearer ` (matches `ALPHASWARM_AUTH0_LOG_STREAM_SECRET`) |
| Content Type | `application/json` |
| Custom Headers | (none beyond Authorization) |
| Filter | All events (the backend filters server-side) |

Operator generates the shared secret:

```
openssl rand -hex 32
```

…then sets it both in the Auth0 Dashboard webhook config AND in
the backend's `ALPHASWARM_AUTH0_LOG_STREAM_SECRET` env var. The HMAC
compare on `_verify_authorization` rejects any other value.

Optionally also wire native Datadog / Splunk / Elastic streams for
the SIEM team — those are independent of the AlphaSwarm webhook.

---

## 9. Adaptive MFA

Security → Multi-factor Authentication → Adaptive MFA → ON.

| Risk level | Action | Why |
| --- | --- | --- |
| `low` | Allow (no MFA) | Normal session resumption |
| `medium` | MFA challenge | Suspicious-but-not-definitive signals |
| `high` | MFA challenge | Likely compromised |

Enabled MFA factors (Security → Multi-factor Authentication → Factors):

- **OTP (TOTP)** — always-on; required for every B2B user
- **WebAuthn** — recommended primary for B2B users
- **Push** (Auth0 Guardian app) — B2C convenience
- **SMS** — discouraged for B2B; allow as B2C fallback only
- **Email OTP** — convenient B2C fallback
- **Recovery codes** — always issue alongside any factor

The `alphaswarm-post-login` Action's Phase 8 addendum calls
`api.multifactor.enable("any", { allowRememberBrowser: false })`
when the SPA / CLI requests `acr_values=http://schemas.openid.net/pape/policies/2007/06/multi-factor`
on `/authorize`. This is the integration point for the backend's
`require_step_up` dep.

---

## 10. Env-var checklist (prod)

```
# IdP
ALPHASWARM_AUTH_PROVIDER=auth0
ALPHASWARM_AUTH_REQUIRED=true
ALPHASWARM_AUTH_ENFORCE=strict
ALPHASWARM_AUTH_OIDC_ISSUER=https://auth.alpha-swarm.ai/
ALPHASWARM_AUTH_OIDC_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_AUTH_OIDC_CLIENT_ID=
ALPHASWARM_AUTH_CLAIMS_NAMESPACE=https://alphaswarm.internal/
ALPHASWARM_AUTH_CLAIMS_NAMESPACE_ALIASES=https://alphaswarm/   # CSV; legacy reader

# Management API
ALPHASWARM_AUTH0_MGMT_API_AUDIENCE=https://alphaswarm-prod.us.auth0.com/api/v2/
ALPHASWARM_AUTH0_MGMT_API_CLIENT_ID=
ALPHASWARM_AUTH0_MGMT_API_CLIENT_SECRET=                # via CredentialResolver

# M2M
ALPHASWARM_AUTH_M2M_ENABLED=true
ALPHASWARM_AUTH_M2M_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_AUTH_M2M_TOKEN_TTL_SECONDS=900

# DPoP
ALPHASWARM_AUTH0_DPOP_ENABLED=true
ALPHASWARM_AUTH0_DPOP_REQUIRED=false                   # flip true once SDK rolled out
ALPHASWARM_DPOP_ENFORCEMENT_ENABLED=false              # per-route enforcement

# Step-up MFA (rule 52)
ALPHASWARM_AUTH_STEP_UP_ENABLED=true
ALPHASWARM_AUTH_STEP_UP_DEFAULT_MAX_AGE=180

# Auth0 Log Stream (rule 53)
ALPHASWARM_AUTH0_LOG_STREAM_SECRET=
ALPHASWARM_AUTH0_LOG_STREAM_MAX_AGE_SECONDS=86400

# Delegated agent tokens (rule 54)
ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true
ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_ID=
ALPHASWARM_AUTH_AGENT_BROKER_CLIENT_SECRET=             # via CredentialResolver
ALPHASWARM_AUTH_AGENT_DELEGATION_TTL_SECONDS=300

# B2B Entra (existing)
ALPHASWARM_AUTH_MSAL_B2B_ENABLED=true

# Tenancy
ALPHASWARM_TENANCY_DEFAULT_STRATEGY=hybrid
ALPHASWARM_TENANCY_RLS_ENFORCE=strict                   # was off; flip after Phase 5 verified

# MCP RFC conformance
ALPHASWARM_MCP_DATA_CANONICAL_URI=https://api.alpha-swarm.ai/mcp/data
ALPHASWARM_MCP_CODEBASE_CANONICAL_URI=https://api.alpha-swarm.ai/mcp/codebase
ALPHASWARM_MCP_REQUIRE_RFC8707=strict                   # was off

# Per-user OAuth wizard
ALPHASWARM_USER_OAUTH_ENABLED=true

# Audit
ALPHASWARM_AUTH_AUDIT_ENABLED=true
ALPHASWARM_AUTH_AUDIT_RETENTION_DAYS=365
```

---

## 11. CLI env vars (per operator)

```
ALPHASWARM_CLI_OIDC_DOMAIN=auth.alpha-swarm.ai
ALPHASWARM_CLI_OIDC_CLIENT_ID=
ALPHASWARM_CLI_OIDC_AUDIENCE=https://api.alpha-swarm.ai/
ALPHASWARM_CLI_OIDC_ORGANIZATION=                        # B2B: pin to a single org
# Headless / CI fallback (no keyring backend):
ALPHASWARM_CLI_AUTH_ALLOW_PLAINTEXT_FALLBACK=0
```

---

## 12. Rollout order

| Step | Action | Verification |
| --- | --- | --- |
| 1 | Create dev tenant + apps + custom domain | `/auth/config` returns the tenant id |
| 2 | Backend up with `ALPHASWARM_AUTH_ENFORCE=permissive` | Existing routes still serve; 401 dashboard shows zero would-be denies |
| 3 | Flip `ALPHASWARM_AUTH_ENFORCE=strict` | Unauthenticated calls return 401 |
| 4 | Wire Auth0 log-stream webhook + Action triggers | Force a session-revoke in Dashboard; verify `cleanup_for_user` Celery row + audit row |
| 5 | Enable `ALPHASWARM_AUTH_STEP_UP_ENABLED=true` | Click kill-switch → MFA prompt; complete it; subsystems halt |
| 6 | Enable `ALPHASWARM_AUTH_AGENT_TOKEN_EXCHANGE_ENABLED=true` + create Profile | Trigger an agent that calls a DataMCP tool; verify `act` claim in `/mcp/data` response body + `delegation` JSON in audit |
| 7 | Enable `ALPHASWARM_USER_OAUTH_ENABLED=true` | `/me/oauth-connections/providers` returns the 5 providers |
| 8 | Enable BYOK broker credentials (run Alembic 0065) | Add an Alpaca paper key; smoke-test a paper trade |
| 9 | Enable RLS strict mode (`ALPHASWARM_TENANCY_RLS_ENFORCE=strict`) | Existing test workspace queries still work; cross-workspace fetches return zero rows |
| 10 | Enable MCP RFC 8707 strict mode | MCP calls with mis-audienced tokens return 401 + WWW-Authenticate header |

Each flip is independently reversible.

---

## 13. Reference docs

- [alphaswarm_docs/auth0-actions.md](../../concepts/identity/auth0-actions.md) — Action bodies + the
  Custom Token Exchange Profile setup.
- [alphaswarm_docs/identity.md](../../concepts/identity/identity.md) — the full identity stack.
- [alphaswarm_docs/multi-tenancy.md](../../concepts/identity/multi-tenancy.md) — Organization →
  EntraTenantLink → User → Membership flow.
- [alphaswarm_docs/credentials.md](../../concepts/identity/credentials.md) — how M2M + BYOK
  credentials flow through CredentialResolver.
- [.cursor/rules/identity.mdc](../.cursor/rules/identity.mdc) — the
  always-on identity-enforcement rule.
- [.cursor/rules/auth-stepup-and-byok.mdc](../.cursor/rules/auth-stepup-and-byok.mdc)
  — Phase 5+ rules (52-55) scoped to the new module files.
- [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — hard rules 27, 44, 45, 50, 51, 52-55.


<!-- https://alpha-swarm.ai/concepts/identity/biscuit-capabilities -->
# Biscuit capability tokens

# Biscuit capability tokens

> Phase 5 §8.2 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> Sits ALONGSIDE the existing
> [`TokenExchangeBroker`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/token_exchange.py)
> (Rule 54), not replacing it.

## The problem

`TokenExchangeBroker` mints short-lived JWTs via the
RFC 8693 ``urn:ietf:params:oauth:grant-type:token-exchange`` grant.
The result is a delegated agent JWT that carries every scope the
agent could possibly need:

```
GET /mcp/data/iceberg.read  -- Bearer 
POST /mcp/data/iceberg.write -- Bearer 
```

If the agent is compromised mid-run, the attacker exfiltrates the
JWT and replays it for ANY of those scopes until expiry. The JWT is
broad-by-design — the broker can't know in advance which exact tool
+ arguments the agent will call.

## The Biscuit answer

A biscuit is a capability token with a key property: **anyone can
narrow it (attenuate), no one can widen it**. The minting flow
becomes:

```
user JWT
   │
   ▼
TokenExchangeBroker.exchange()     -> delegated JWT (broad scopes)
   │
   ▼
biscuit.mint_biscuit(jwt, caps)    -> biscuit covering the full
   │                                  capability set for this run
   ▼
agent.attenuate_for_call(...)      -> EXACTLY (tool, args, hash)
   │
   ▼
HTTP POST /mcp/data/iceberg.read
  Authorization: Bearer       -- existing path stays
  X-Biscuit:   -- new gate
```

A compromised agent that exfiltrates the attenuated biscuit can
ONLY replay the one call that biscuit was minted for. The
attenuated biscuit's chained check fires on any other call:

```
check if capability("data.iceberg.read", "read", "nyse:trades", "")
```

## AlphaSwarm integration

The helpers live in [`alphaswarm/auth/biscuit.py`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/biscuit.py):

```python
from alphaswarm.auth.biscuit import (
    mint_biscuit, attenuate_for_call, verify_biscuit,
    Capability,
)

# 1. Mint at agent-run boot — derive from the delegated JWT's scopes.
issued = mint_biscuit(
    user_sub=request.user.sub,
    agent_sub="agent_alpha_research_v3",
    capabilities=[
        Capability(
            tool="data.iceberg.read",
            action="read",
            resource="nyse:trades",
            descriptor_hash=descriptor_hash_for("data.iceberg.read"),
            cell_id=request.alphaswarm_context.cell_id,
        ),
        # ... one per tool the agent may invoke during this run
    ],
    private_key_pem=settings.biscuit_signing_key_pem,
    ttl_seconds=900,
    cell_id=request.alphaswarm_context.cell_id,
)

# 2. Attenuate per tool call — the agent narrows to exactly this call.
narrow = attenuate_for_call(
    parent_b64=issued.token_b64,
    tool="data.iceberg.read",
    action="read",
    resource="nyse:trades",
    descriptor_hash=descriptor_hash_for("data.iceberg.read"),
    cell_id=request.alphaswarm_context.cell_id,
)
# Attach `narrow` as the X-Biscuit header on the MCP HTTP call.

# 3. Verify at MCP server — checks the attenuated chain.
verified = verify_biscuit(
    token_b64=request.headers["X-Biscuit"],
    public_key_pem=settings.biscuit_public_key_pem,
    expected_tool="data.iceberg.read",
    expected_action="read",
    expected_resource="nyse:trades",
    expected_descriptor_hash=descriptor_hash_for("data.iceberg.read"),
    expected_cell_id=request.alphaswarm_context.cell_id,
)
```

## Capability shape

The `Capability` record carries four required fields:

| Field | Meaning |
| --- | --- |
| `tool` | MCP tool name, e.g. `data.iceberg.read`. |
| `action` | Verb, e.g. `read`, `write`, `delete`. |
| `resource` | Canonical resource id, e.g. `nyse:trades`. |
| `descriptor_hash` | SHA-256 of the canonical-JSON MCP tool descriptor (Phase 5 §8.4). |

Plus an optional `cell_id` that pins the capability to a specific
deployment cell (Phase 3 §6.2).

## Capability namespacing

The capability namespace matches the MCP tool name:

| Tool | Capability |
| --- | --- |
| `data.iceberg.read` | `read` |
| `data.iceberg.write` | `write` |
| `data.entities.search` | `read` |
| `data.entities.create` | `write` |
| `data.lineage.read` | `read` |
| `data.secrets.read` | NOT BISCUIT-GATED — uses BrokerCredentialStore (Rule 55) |

Adding a new tool with a new capability is purely additive — the
existing biscuits keep working for the tools they cover.

## Mint key rotation

The biscuit signing key is an ed25519 key pair. The private key
lives in Vault Transit (Phase 4 §7.6); the public key is projected
into every MCP server pod via a
[`VaultStaticSecret`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/vault-secrets-operator/sample-vault-static-secret.yaml)
named `biscuit-public-key`.

Rotation procedure (operator-level):

1. Generate a new ed25519 key pair via Vault Transit's
   `transit/keys/biscuit-signing/rotate`.
2. Vault Transit keeps the OLD key version live for 7 days.
3. Every MCP server now accepts biscuits signed by EITHER key
   for that 7-day overlap (the verify path tries the new key
   first, falls back to the old key on signature mismatch — TODO
   in Phase 5.5).
4. After 7 days, drop the old key version.

## Failure modes

| Failure | Behaviour |
| --- | --- |
| `biscuit-python` not installed (e.g. Windows dev) | `BiscuitUnavailable` raised; the agent runtime falls back to JWT-only delegation. The MCP server returns 503 if biscuit is required for the route. |
| Biscuit signature mismatch | `BiscuitVerificationError` raised; route returns 403 `biscuit_invalid`. |
| Biscuit capability doesn't match the route | `BiscuitVerificationError`; route returns 403 `biscuit_capability_mismatch`. |
| Biscuit expired | `BiscuitVerificationError`; route returns 401 `biscuit_expired`. |

## Why not just narrow the JWT?

JWTs are not attenuable. Once Auth0 mints a JWT with scopes
`[data:read, data:write]`, the agent CANNOT mint a derived JWT with
just `[data:read]` — that would require the agent to be its own AS
(it isn't) and would compromise the JWT signing key.

Biscuits sidestep this by encoding capabilities as facts the agent
can chain narrowing checks onto. The signature stays on the
authority block; chained blocks add restrictions, never expand them.

## Phase 5.5 follow-ups

1. **Agent runtime wire-up** — automate the
   `mint_biscuit + attenuate_for_call` calls on every MCP tool
   invocation in `alphaswarm/agents/runtime.py`. Today the helpers are
   standalone.
2. **Key-rotation overlap window** — `verify_biscuit` accepts a list
   of public keys to try in order. Phase 5 ships the single-key
   verify; the multi-key fallback lands in Phase 5.5.
3. **MCP server-side enforcement** — wire `verify_biscuit` into the
   MCP HTTP request handler at `alphaswarm/data/mcp/server.py` so every
   tool call that doesn't carry a valid biscuit gets 401.

## Related documents

- [RESTRUCTURING_PLAN.md §8.2](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm/auth/biscuit.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/biscuit.py)
- [alphaswarm/auth/token_exchange.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/token_exchange.py) — the JWT broker biscuits run alongside.
- Biscuit specification: https://github.com/biscuit-auth/biscuit
- biscuit-python: https://github.com/biscuit-auth/biscuit-python


<!-- https://alpha-swarm.ai/concepts/identity/cloud-credentials -->
# Cloud credentials
> ```mermaid flowchart LR caller["service code"] --> resolver["CredentialResolver.resolve(CredentialKey)"] resolver --> m2m["M2MStore<br/>(priority 10)"] resolver --> vault["HashicorpVaultStore<br/>(pri...

# Cloud credentials

How AlphaSwarm routes secret resolution through `CredentialResolver` once the
cloud `SecretStore` siblings are wired. Phase C of the Phase 7
rollout (Terraform IaC + multi-cloud).

## Resolver chain

```mermaid
flowchart LR
  caller["service code"] --> resolver["CredentialResolver.resolve(CredentialKey)"]
  resolver --> m2m["M2MStore(priority 10)"]
  resolver --> vault["HashicorpVaultStore(priority 15)"]
  resolver --> cloud["Cloud SecretStore(priority 30)"]
  resolver --> file["FileSecretStore(priority 50)"]
  resolver --> env["EnvSecretStore(priority 100)"]
  cloud --> azurekv["AzureKeyVaultStore"]
  cloud --> awssm["AwsSecretsManagerStore"]
  cloud --> gcpsm["GcpSecretManagerStore"]
```

Lower priority numbers resolve first. The cloud store is added to the
default chain only when `ALPHASWARM_DEFAULT_CLOUD_PROVIDER` matches and the
matching SDK is installed (see
[`alphaswarm/credentials/resolver.py::_build_default_resolver`](../alphaswarm/credentials/resolver.py)).

## Naming conventions

| Store              | Key format                                                    | Notes                                                                  |
| ------------------ | ------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Env                | `_` (uppercase, `:` → `_`)                 | Always-on safety net.                                                  |
| File               | `bootstrap_state_dir/-.json`                | Bootstrap workflows write these (Polaris principal, etc).              |
| Azure Key Vault    | `alphaswarm--` (alphanumerics + `-` only)          | Vault names disallow `:` / `/` / `_`.                                  |
| AWS Secrets Mgr    | `{prefix}/` (default prefix `alphaswarm/`)         | Slashes are first-class path separators.                               |
| GCP Secret Mgr     | `projects/{project}/secrets/{prefix}-`      | Names allow `[A-Za-z0-9_-]` only — joins use `-`.                      |
| Vault KV v2        | `/data//`                            | `hvac.Client.secrets.kv.v2.read_secret_version` adds `/data/` automatically. |

The cloud secret values are parsed as JSON first; when parsing fails
they're exposed via the canonical `credential` field.

## Example secret layouts

### Azure Key Vault — `alphaswarm-msal-clientsecret`

```json
{
  "client_secret": "rxq8Q..."
}
```

### AWS Secrets Manager — `alphaswarm/broker/api_key`

```
sk_live_abcdef1234567890
```

Plain string payload — exposed via `credential.get("credential")`.

### GCP Secret Manager — `alphaswarm-postgres-password`

```json
{
  "password": "...",
  "username": "alphaswarm"
}
```

### HashiCorp Vault KV v2 — `secret/data/alphaswarm/redis/password`

```json
{
  "password": "..."
}
```

## Wiring a SecretStore

Pick a cloud + install the matching extra:

```bash
pip install 'alphaswarm[cloud-azure]'   # AzureKeyVaultStore
pip install 'alphaswarm[cloud-aws]'     # AwsSecretsManagerStore
pip install 'alphaswarm[cloud-gcp]'     # GcpSecretManagerStore
pip install 'alphaswarm[vault]'         # HashicorpVaultStore
```

Configure (matching cloud picked via `ALPHASWARM_DEFAULT_CLOUD_PROVIDER`):

```
# Azure
ALPHASWARM_DEFAULT_CLOUD_PROVIDER=azure
ALPHASWARM_AZURE_TENANT_ID=...
ALPHASWARM_AZURE_SUBSCRIPTION_ID=...
ALPHASWARM_AZURE_KEYVAULT_URL=https://alphaswarm-vault.vault.azure.net/

# AWS
ALPHASWARM_DEFAULT_CLOUD_PROVIDER=aws
ALPHASWARM_AWS_REGION=us-east-1
ALPHASWARM_AWS_ACCOUNT_ID=123456789012
ALPHASWARM_AWS_SECRETSMANAGER_PREFIX=alphaswarm/

# GCP
ALPHASWARM_DEFAULT_CLOUD_PROVIDER=gcp
ALPHASWARM_GCP_PROJECT_ID=alphaswarm-prod
ALPHASWARM_GCP_REGION=us-central1
ALPHASWARM_GCP_SECRET_PREFIX=alphaswarm-

# Vault (any cloud)
ALPHASWARM_VAULT_ADDR=https://vault.example.com
ALPHASWARM_VAULT_NAMESPACE=...
ALPHASWARM_VAULT_MOUNT=secret
ALPHASWARM_VAULT_ROLE_ID=...
ALPHASWARM_VAULT_SECRET_ID=...
```

The resolver auto-adds the matching cloud store + Vault store when
the env vars are present. Code that needs a credential does:

```python
from alphaswarm.credentials import get_resolver
from alphaswarm.credentials.protocol import CredentialKey

resolver = get_resolver()
cred = resolver.resolve(CredentialKey(service="msal", purpose="client_secret"))
secret = cred.require("client_secret")
```

## Authentication backends per cloud store

| Store                  | Identity source                                                  |
| ---------------------- | ---------------------------------------------------------------- |
| Azure Key Vault        | `DefaultAzureCredential` (az login / SP env / Workload Identity) |
| AWS Secrets Manager    | boto3 default chain (env / shared credentials / IRSA / EC2 role) |
| GCP Secret Manager     | `google.auth.default()` (gcloud ADC / SA file / Workload Identity)|
| HashiCorp Vault        | AppRole (preferred) or whatever the operator pre-configured     |

For cluster-side workloads the **Workload Identity** variants are the
canonical path:

- AKS — `AzureAksAdapter` + Azure Workload Identity (Service Account
  annotation `azure.workload.identity/client-id: `).
- EKS — `AwsEksAdapter` + IRSA (`eks.amazonaws.com/role-arn`
  annotation).
- GKE — `GcpGkeAdapter` + GKE Workload Identity
  (`iam.gke.io/gcp-service-account` annotation).

## External Secrets Operator integration

The Terraform `secrets` module wires an
[`external-secrets`](https://external-secrets.io) `ClusterSecretStore`
pointing at whichever backend matches `vault_backend`. The
`secret_mappings` locals block emits one `ExternalSecret` per
`(k8s_secret_name, vault_path)` pair so AlphaSwarm pods consume secrets via
mounted Secrets — never raw env vars.

See [`alphaswarm_platform/terraform/modules/secrets/main.tf`](../alphaswarm_platform/terraform/modules/secrets/main.tf)
for the full mapping table.

## Temporary credentials minted via cloud CLI

Operators with an `admin:cluster` scope can mint short-lived
credentials directly from the admin UI without shipping the cloud
CLI binaries into the BFF container. The control plane wraps
`aws sts assume-role` / `gcloud auth print-access-token` /
`az account get-access-token` in an audit-first subprocess runner;
the resulting credential is persisted under a resolver key supplied
by the operator and surfaces through the standard
`CredentialResolver.resolve(...)` chain. See the
[cloud-CLI temporary credentials](../../how-to/operations/cloud-cli-temporary-credentials.md)
runbook for the wizard walkthrough, audit shape, and step-up MFA
contract.


<!-- https://alpha-swarm.ai/concepts/identity/credentials -->
# Credentials resolver
> The resolver walks an ordered chain of :class:`alphaswarm.credentials.SecretStore` instances and returns the first non-empty hit, falling back to a caller-supplied default. The chain order means a fresh M2M ...

# Credentials resolver

AlphaSwarm collapses every "where does this service's credential come from?"
question into a single :class:`alphaswarm.credentials.CredentialResolver`.

The resolver walks an ordered chain of
:class:`alphaswarm.credentials.SecretStore` instances and returns the first
non-empty hit, falling back to a caller-supplied default. The chain
order means a fresh M2M token wins over a bootstrap-minted file
payload, which wins over a static `settings` seed.

## Why

The motivating bug: `iceberg_bootstrap` mints a runtime principal
(`alphaswarm_runtime`) and persists it to
`data/bootstrap/polaris-principal.json`, but `polaris_client` and
`iceberg_catalog._build_properties` historically read
`settings.polaris_client_*` / `settings.iceberg_rest_credential` —
the static `root` / `s3cr3t` seed — so Polaris kept rejecting the
API container's writes with `CREATE_TABLE_DIRECT_WITH_WRITE_DELEGATION`
403s.

The resolver closes that loop without forking the credential paths.

## Architecture

```mermaid
flowchart TD
    Caller[Service code]
    Resolver[CredentialResolver]
    M2M["M2MStorepriority 10"]
    File["FileSecretStorepriority 50"]
    Env["EnvSecretStorepriority 100"]
    M2MIssuer[M2MTokenIssuer]
    Bootstrap["IcebergBootstrapManagerpersists json"]
    Settings["alphaswarm.config.settings"]

    Caller -->|"resolve(CredentialKey)"| Resolver
    Resolver --> M2M
    Resolver --> File
    Resolver --> Env
    M2M --> M2MIssuer
    File --> Bootstrap
    Env --> Settings
```

The resolver is a process-wide singleton built lazily by
:func:`alphaswarm.credentials.get_resolver`. The default chain is `Env` +
`File`; `M2M` plugs in front when
:func:`alphaswarm.auth.m2m.install_m2m_store` runs (controlled by
`ALPHASWARM_AUTH_M2M_ENABLED`).

## Usage

```python
from alphaswarm.credentials import CredentialKey, get_resolver

cred = get_resolver().resolve(
    CredentialKey("polaris", "oauth"),
    default={"client_id": "root", "client_secret": "s3cr3t"},
)
client_id = cred.get("client_id")
client_secret = cred.get("client_secret")
```

`Credential.source` is `"file"` / `"env"` / `"m2m"` / `"default"`,
useful for diagnostics.

## Field maps

Per `(service, purpose)`, here is what consumers expect:

- `polaris:oauth` → `client_id`, `client_secret`, `principal`
- `polaris:rest` / `iceberg:rest` → `credential` (`:`),
  `token`, `oauth2_server_uri`, `scope`
- `trino:basic` → `user`, `source`, optional `token` / `access_token`
- `minio:static` → `access_key`, `secret_key`, `endpoint_url`, `region`
- `minio:sts` → `session_token` (M2M-issued)
- `neo4j:basic` → `user`, `password`, `uri`

Add new entries to
[alphaswarm/credentials/stores/env_store.py](../alphaswarm/credentials/stores/env_store.py)
when you wire a new service to the resolver.

## Bootstrap → resolver

Bootstrap workflows call
:func:`alphaswarm.services.iceberg_bootstrap.persist_principal_credentials`
(and similar) to write JSON under `settings.bootstrap_state_dir`.
`FileSecretStore` reads those files; the bootstrap also resets any
caches that depend on the credentials (e.g.
`iceberg_catalog.reset_catalog_cache()`).

When you add a new bootstrap step:

1. Add the file name to
   [`alphaswarm/credentials/stores/file_store.py::_FILE_MAP`](../alphaswarm/credentials/stores/file_store.py).
2. Persist a JSON payload with at least `client_id` / `client_secret`.
3. Reset any consumer caches in your bootstrap writer.

## Diagnostics

`get_resolver().describe()` returns the active store chain and
priorities — wire it into a debug endpoint when you need to inspect
the resolution order from outside the process.

## Testing

`tests/credentials/` contains the canonical test patterns:

- Test the resolver chain priority order with `pytest`.
- Test new env store branches with a `_StubSettings` shim.
- Test new file store keys by writing the JSON to a `tmp_path`.

The `reset_resolver` fixture re-builds the singleton between tests so
you don't have to track down stale state.


<!-- https://alpha-swarm.ai/concepts/identity/edge-authentication -->
# Edge authentication & cell routing
> How the alphaswarm-tenant-router verifies JWTs fail-closed at the Envoy edge, routes B2C/B2B tenants onto cell tiers, and validates Cell-Bound-Authorization for cross-cell calls.

# Edge authentication & cell routing

Every request entering the hosted platform crosses one authentication
decision point before it reaches a cell:
[`alphaswarm-edge`](https://github.com/Alpha-Swarm-ai/alphaswarm_platform/tree/main/build/docker/alphaswarm-edge)
(Envoy) makes two `ext_authz` callouts to
[`alphaswarm-tenant-router`](https://github.com/Alpha-Swarm-ai/alphaswarm_platform/tree/main/tenant_router),
which verifies identity and decides cell placement in one pass.

```mermaid
sequenceDiagram
    participant C as Client (SPA / CLI / agent)
    participant E as alphaswarm-edge (Envoy)
    participant R as alphaswarm-tenant-router
    participant Cell as alphaswarm-core (per-cell)

    C->>E: request + Authorization: Bearer JWT
    E->>R: POST /cell_bound/v1/check (CBA filter)
    R-->>E: 200 (no CBA header = external traffic)
    E->>R: POST /ext_authz/v3/check
    Note over R: verify JWT vs IdP JWKS(iss, aud, exp, alg allowlist)
    Note over R: pick cell: pinning → tier claim → default
    R-->>E: 200 + x-alphaswarm-cell + verified identity headers
    E->>Cell: request + x-alphaswarm-sub/-tenant/-workspace
```

## Fail-closed verification

The router's posture is an explicit setting
(`ALPHASWARM_TENANT_ROUTER_AUTH_MODE`), and the default is the strict
one — see the
[rollout runbook](../../how-to/tenant-router-auth-rollout.md) for the
operational details:

| Mode | No token | Invalid token | Valid token |
| --- | --- | --- | --- |
| `required` (default, hosted cells) | 401 | 401 | allow |
| `permissive` (canary/migration) | allow, flagged | 401 | allow |
| `disabled` (local dev; needs `ALLOW_INSECURE=true` too) | allow | unsigned decode | unsigned decode |

Three design rules keep the edge honest:

1. **Boot-time refusal.** In `required`/`permissive` the pod exits at
   startup unless issuer + audience (and a derivable JWKS URI) are
   configured. A crash-looping edge is strictly better than one that
   silently routes unauthenticated traffic.
2. **Asymmetric algorithms only.** `RS*`/`PS*`/`ES*`/`EdDSA` are the
   only acceptable JWT algorithms; `HS*` and `none` are rejected
   before any key material is consulted, closing the alg-confusion
   class of attacks. Verification semantics mirror
   `alphaswarm_core.auth.jwt_validator.JwtValidator` (kid selection,
   one forced JWKS refresh on unknown kid for key rotation, TTL cache
   that serves stale on IdP blips).
3. **Identity headers are always overwritten.** On every ALLOW the
   router emits the full verified set — `x-alphaswarm-sub`,
   `x-alphaswarm-tenant`, `x-alphaswarm-workspace`, `x-alphaswarm-org`,
   `x-alphaswarm-auth` — empty when a claim is absent, so a client can
   never smuggle its own `x-alphaswarm-*` values past the edge. Per-cell
   FastAPI gates (`alphaswarm.api.security`) still re-validate the JWT;
   the edge is defense-in-depth, not the only boundary (AGENTS rule 11
   applies at every layer).

## B2C / B2B tier routing

Cell selection composes the [multi-tenancy](./multi-tenancy.md) model
with the deployment tiers from RESTRUCTURING_PLAN.md §6.1:

| Plan | JWT `tier` claim | Cell tier | Tenancy strategy |
| --- | --- | --- | --- |
| B2C consumer | (none) or `shared-std` | `shared-std` | `shared_schema_rls` |
| B2B premium | `shared-prem` | `shared-prem` | `schema_per_tenant` |
| Regulated enterprise | (registry pinning) | `silo-reg` | `database_per_enterprise` |
| Custom contract | (registry pinning) | `silo-custom` | `hybrid` |

Resolution order, per request:

1. **Registry pinning is authoritative** — a tenant listed in a cell's
   `pinned_tenants` always lands there (silo cells, controlled
   migrations), regardless of token claims.
2. **The verified `tier` claim** (namespaced
   `https://alphaswarm.internal/tier`, stamped by the Auth0 Action /
   Entra claims pipeline) selects the tier. An explicit tier is honored
   or refused with 503 — never silently downgraded onto another tier's
   tenancy strategy.
3. **Default tier** (`shared-std`) otherwise.

Within a tier, unpinned tenants spread across active cells by
rendezvous (highest-random-weight) hashing keyed on
`tenant_id → organization_id → sub`: every router replica picks the
same cell with no shared state, a tenant is sticky to its cell, and
adding or draining a cell only remaps the tenants that hashed onto it.

Registry staleness (the router caches the control plane's
`/manage/cells` view) is **reported, never failed closed** — the data
plane keeps routing on last-known-good cells through a control-plane
outage, surfacing `registry_stale` in `/readyz` and a counter in
`/metrics`.

## Cell-Bound-Authorization (cross-cell calls)

Cross-cell calls are the highest-risk path (Phase 5 §8.5). The mint
side lives in `alphaswarm.auth.cell_bound`; the router hosts the
validator at `POST /cell_bound/v1/check` (the
`alphaswarm-cell-bound-validator` Service selects the same pods):

- No `Cell-Bound-Authorization` header → pass. External user traffic
  and same-cell calls never carry one; the response still emits empty
  `x-alphaswarm-cell-source-*` headers so smuggled values are stripped.
- Header present → the token must verify against the **source cell's**
  published keys (cells-registry annotation
  `alphaswarm.internal/cba-jwks`, JWKS JSON or PEM), with
  `iss` = source cell, `aud` = destination cell, a ≤90 s lifetime
  (mint stamps 60 s), required `jti`, and per-replica replay
  rejection. Valid CBAs inject `x-alphaswarm-cell-source` +
  `x-alphaswarm-cell-source-workload` (SPIFFE id) so destination-cell
  services can authorize the calling workload.
- `CBA_MODE=monitor` logs would-be denials without blocking
  (rollout aid); `enforce` is the default and is safe before any
  workload mints CBAs because headerless requests pass through.

## Where things live

| Surface | Path |
| --- | --- |
| Router service + tests | `alphaswarm_platform/tenant_router/` |
| Edge Envoy config (canonical template) | `alphaswarm_platform/build/docker/alphaswarm-edge/envoy.template.yaml` |
| Deployment (ConfigMap, NetworkPolicy, HPA, Services) | `alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/` |
| Backend JWT validation (per-cell) | `alphaswarm/auth/oidc.py`, `alphaswarm_core/auth/jwt_validator.py` |
| CBA mint/verify library | `alphaswarm/auth/cell_bound.py` |
| Operator runbook | [Tenant-router auth rollout](../../how-to/tenant-router-auth-rollout.md) |
| Cutover history | [Cell-router cutover](../../how-to/cell-router-cutover.md) |


<!-- https://alpha-swarm.ai/concepts/identity/entra-internal-tenant -->
# Entra ID as the AlphaSwarm staff user pool

# Entra ID as the AlphaSwarm staff user pool

Microsoft Entra ID is the **first user pool** for the managed AlphaSwarm
platform. AlphaSwarm staff (engineers, operators, compliance, finance,
auditors, SOC) sign in to `manage.alpha-swarm.ai` through the AlphaSwarm staff
Entra tenant; Auth0 stays as the customer-facing B2C fallback and the
documented degraded-mode entry path.

This page explains *what* the rollout does and *why*. The runbook
that walks through *how* lives at
[`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md);
the long-form plan with phases + risks at
[`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md);
and the architectural decision at
[ADR-011](../../architecture/decisions/011-entra-as-first-pool.md).

## Why Entra, why now

| Driver | Detail |
| --- | --- |
| MFA + Conditional Access | Entra carries the company's existing MFA + CA enforcement; staff already authenticate to it daily for Microsoft 365. |
| Audit centralisation | Sign-in logs land in the corporate SIEM via the existing Entra log stream; no separate Auth0 export to maintain. |
| Group-driven authorisation | Group membership in Entra → app-role claim on AlphaSwarm tokens. New hires onboard via a single HR-side group action. |
| No client secrets in CI | GitHub Actions OIDC + federated credentials replace the old `AZURE_CLIENT_SECRET` repo secret. |
| Customer separation | The AlphaSwarm staff tenant is independent of every customer tenant. Customer tenants continue to flow through the `EntraTenantLink` B2B approval wizard (AGENTS rule 44). |

## What is under Terraform control

- **3 app registrations**: `alphaswarm-staff` (login app), `alphaswarm-manage-api`
  (Resource Server), `alphaswarm-ci-github` (federated-credential-only).
- **3 service principals** with `app_role_assignment_required = true`
  on the manage API + CI app.
- **7 app roles** on the manage API: Admin, Operator, Auditor,
  Compliance, Finance, Engineer, Viewer.
- **7 directory groups**: AlphaSwarm-Admins, AlphaSwarm-Operations, AlphaSwarm-Auditors,
  AlphaSwarm-Compliance, AlphaSwarm-Finance, AlphaSwarm-Engineering, AlphaSwarm-SOC.
- **Group → app-role assignments** mapping each group to one or more
  roles.
- **Federated credentials** for GitHub Actions OIDC (per-environment
  + per-branch, never wildcards).
- **Named locations** representing AlphaSwarm-trusted IP ranges (referenced
  by Conditional Access policies).

## What is NOT under Terraform control

- **Conditional Access policies**. CA policies require an Entra ID P2
  license + manual Security review. The Terraform module records
  policy display names as documentation; the verify helper queries
  Microsoft Graph at smoke-test time to confirm each named policy
  exists.
- **Group membership**. HR + Security own membership through the
  Azure Portal (or Entitlement Management). Terraform owns *which
  groups exist + what roles they confer*; not *who is in them*.
- **Customer-tenant Entra integration**. Customer tenants flow through
  the existing `EntraTenantLink` B2B wizard (AGENTS rule 44). This
  rollout is internal-only.
- **Privileged Identity Management (PIM)**. Tracked as future work in
  the rollout plan §7.

## Token shape

Every staff access token minted for `api://alphaswarm-manage-api` carries:

| Claim | Value |
| --- | --- |
| `iss` | `https://login.microsoftonline.com/{alphaswarm_staff_tenant_id}/v2.0` |
| `aud` | `api://alphaswarm-manage-api` |
| `roles` | one or more of `Admin`, `Operator`, `Auditor`, `Compliance`, `Finance`, `Engineer`, `Viewer` |
| `groups` | the staff member's directory group object ids (security-only) |
| `oid` | the user's Entra object id (stable across renames) |
| `tid` | the AlphaSwarm staff Entra tenant id |
| `preferred_username` | `firstname.lastname@` |

The application reads `roles` to gate `/manage/*` routes; a staff
member with no roles is treated as `Viewer` until promoted by an admin.

## Provider-chain priority

`alphaswarm/auth/providers/__init__.py` exposes
[`select_provider_for_token`](pathname:///docs/concepts/identity/entra-internal-tenant.md#provider-chain-priority)
which:

1. Decodes the token's `iss` claim (no signature check).
2. If `iss` matches the AlphaSwarm staff issuer, returns
   `MsalEntraIdentityProvider`.
3. Otherwise falls back to `get_active_provider()` (Auth0 in
   production).

The `manage.alpha-swarm.ai` mounts use this selector instead of the bare
`get_active_provider()` so internal-tenant tokens always route through
MSAL first. Customer tokens (different `iss`) continue to land on
Auth0.

## Lifecycle

| Phase | What happens | Owner |
| --- | --- | --- |
| 0. Pre-flight | Tenant id confirmed; bootstrap SP provisioned | Identity team |
| 1. Plan + module land | `alphaswarm_entra_directory` module shipped + plan-only validated | Platform |
| 2. Apply + smoke | Resources created; staff member tests login | Platform |
| 3. Cutover | `auth_msal_priority` set so MSAL wins for staff | Platform + Identity |
| 4. Group onboarding | HR populates the seven groups | HR + Security |
| 5. CI cutover | All workflows switch to OIDC federation | DevOps |

See the rollout plan for week-level scheduling, exit criteria, and
rollback procedures.

## How a staff member signs in

```mermaid
sequenceDiagram
    participant U as AlphaSwarm Staff
    participant Browser
    participant alphaswarm_admin as manage.alpha-swarm.ai
    participant Entra
    participant manage_api as /manage/*

    U->>Browser: visit manage.alpha-swarm.ai
    Browser->>alphaswarm_admin: GET /
    alphaswarm_admin-->>Browser: 302 /auth/login?provider=entra
    Browser->>alphaswarm_admin: GET /auth/login?provider=entra
    alphaswarm_admin->>Entra: /authorize (PKCE + nonce)
    Entra-->>U: MFA / CA challenge
    U->>Entra: presents FIDO2 + CA-evaluated location
    Entra-->>alphaswarm_admin: 302 /auth/callback?code=...
    alphaswarm_admin->>Entra: exchange code (PKCE redeemed)
    Entra-->>alphaswarm_admin: id_token + access_token (roles claim)
    alphaswarm_admin->>alphaswarm_admin: stamp session cookie
    Browser->>manage_api: GET /manage/cells (Bearer ...)
    manage_api->>manage_api: select_provider_for_token (MSAL)
    manage_api-->>Browser: 200 JSON
```

## Reading the audit trail

Every Entra-side mutation lands in two places:

- The **Entra audit log** (corporate SIEM via existing log stream).
  Captures app-registration changes, group-membership changes,
  CA-policy edits, admin consents.
- The **AlphaSwarm `terraform_runs` ledger**. Captures every Terraform
  apply on the `entra-internal` stack with the operator who triggered
  it, the SHA of the rendered HCL, the previous + new state hashes,
  and whether the run succeeded or rolled back.

Auditors who need a full reconstruction window query both. The Phase 7
evidence-bundle export already includes `terraform_runs` rows in its
deterministic archive.

## Related

- [`how-to/entra-terraform-bootstrap`](../../how-to/entra-terraform-bootstrap.md)
- [`how-to/entra-onboard-new-staff`](../../how-to/entra-onboard-new-staff.md)
- [`how-to/entra-rotate-secrets`](../../how-to/entra-rotate-secrets.md)
- [`architecture/decisions/011-entra-as-first-pool`](../../architecture/decisions/011-entra-as-first-pool.md)
- The Terraform module:
  [`alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/`](pathname:///alphaswarm_platform/terraform/modules/alphaswarm_entra_directory/README.md)
- Long-form rollout plan:
  [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md)


<!-- https://alpha-swarm.ai/concepts/identity/identity -->
# Federated identity layer
> The pieces port (with attribution) from `alphaswarm_snippets/inspiration/auth0-server-python-main` (MIT, Copyright Auth0, Inc.) into AlphaSwarm-native modules

# Federated identity layer

AlphaSwarm wraps every identity / token operation in a pluggable
:class:`alphaswarm.auth.providers.IdentityProvider`. The provider drives both
user authentication (login, JWT validation, refresh) and
service-to-service auth (M2M tokens that downstream services like
Polaris / Trino consume via the credential resolver).

The pieces port (with attribution) from
`alphaswarm_snippets/inspiration/auth0-server-python-main` (MIT, Copyright Auth0, Inc.)
into AlphaSwarm-native modules.

## Architecture

```mermaid
flowchart LR
    SPA[Frontend SPA]
    Browser
    API[FastAPI]
    Provider["IdentityProviderauth0 / oidc / mock"]
    OidcClient[OidcHttpClient]
    JWKS[(JWKS cache)]
    Discovery[(Discovery cache)]
    M2MIssuer[M2MTokenIssuer]
    Resolver[CredentialResolver]
    Polaris[Polaris OAuth]
    Trino[Trino HTTP]
    MinIO[MinIO STS]

    Browser -->|"GET /auth/login"| API
    API -->|"login_url(...)"| Provider
    Provider -->|redirect| Browser
    Browser -->|callback code| API
    API -->|"exchange_code"| Provider
    Provider --> OidcClient
    OidcClient --> Discovery
    OidcClient --> JWKS
    API -->|JWE cookie| SPA
    SPA -->|Bearer or cookie| API
    API -->|"validate_jwt"| Provider
    Provider -->|jwks| JWKS
    M2MIssuer --> Provider
    Resolver --> M2MIssuer
    Polaris --> Resolver
    Trino --> Resolver
    MinIO --> Resolver
```

## Components

| Component | Path |
| --- | --- |
| Provider ABC + metaclass | [alphaswarm/auth/providers/protocol.py](../alphaswarm/auth/providers/protocol.py) |
| Auth0 / generic OIDC / mock concrete providers | [alphaswarm/auth/providers/](../alphaswarm/auth/providers/) |
| OIDC HTTP plumbing (discovery, JWKS, token endpoint) | [alphaswarm/auth/oidc_client.py](../alphaswarm/auth/oidc_client.py) |
| PKCE helpers (RFC 7636 S256) | [alphaswarm/auth/pkce.py](../alphaswarm/auth/pkce.py) |
| Cookie / Redis session stores | [alphaswarm/auth/session/](../alphaswarm/auth/session/) |
| JWE cookie crypto (HKDF-SHA256 + A256CBC-HS512) | [alphaswarm/auth/session/crypto.py](../alphaswarm/auth/session/crypto.py) |
| M2M token issuer | [alphaswarm/auth/m2m.py](../alphaswarm/auth/m2m.py) |
| Login / callback / logout routes | [alphaswarm/api/routes/auth.py](../alphaswarm/api/routes/auth.py) |
| Backend JWT validator | [alphaswarm/auth/oidc.py](../alphaswarm/auth/oidc.py) |

## Login flow (backend session)

1. Browser hits `GET /auth/login` (optionally with a `return_to`).
2. AlphaSwarm generates a PKCE verifier + state, stashes them in an
   encrypted transaction cookie (10-minute TTL), redirects to the
   provider's authorize URL.
3. Provider posts the authorization code to `GET /auth/callback`.
4. AlphaSwarm looks up the transaction cookie by `state`, calls
   `provider.exchange_code(...)`, and stores the resulting token set
   in an encrypted session cookie (or Redis).
5. Subsequent requests carry the cookie; AlphaSwarm decrypts it on demand
   and exposes the user via the existing `current_user` dep.

The bearer-token flow (`Authorization: Bearer`) keeps working unchanged
— the SPA can pick either path via the `backend_session_supported`
flag in `/auth/config`.

## M2M flow

When `ALPHASWARM_AUTH_M2M_ENABLED=true`:

1. AlphaSwarm startup calls `alphaswarm.auth.m2m.install_m2m_store()`, which adds
   :class:`M2MStore` (priority 10) to the credential resolver chain.
2. A service like `polaris_client` resolves
   `CredentialKey("polaris", "oauth")` through
   :func:`alphaswarm.credentials.get_resolver`.
3. The M2M store fetches `provider.m2m_token(audience, scope)` (Auth0
   `client_credentials` grant) and returns a `Credential` with
   `access_token`/`token` set.
4. The resolver merges this hit with the env-store payload (which
   carries the static `client_id`), so consumers see one merged
   `Credential`.
5. Tokens cache in `M2MTokenIssuer` until expiry minus a 30-second
   skew, so we don't mint per request.

The resolver chain falls through to the file/env stores if the M2M
issuer fails or is disabled — you never get a worse outcome than the
pre-M2M state.

## Configuration

The full env knob set lives in `.env.example` under the "Federated
identity (M2 / M3)" section. The minimum for an Auth0 deployment:

```env
ALPHASWARM_AUTH_PROVIDER=auth0
ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.auth0.com
ALPHASWARM_AUTH_OIDC_AUDIENCE=https://alphaswarm.local/api
ALPHASWARM_AUTH_OIDC_CLIENT_ID=...
ALPHASWARM_AUTH_OIDC_CLIENT_SECRET=...
ALPHASWARM_AUTH_LOGIN_CALLBACK=http://localhost:8000/auth/callback
ALPHASWARM_AUTH_LOGOUT_CALLBACK=http://localhost:3000/
ALPHASWARM_AUTH_SESSION_SECRET=$(openssl rand -hex 32)
ALPHASWARM_AUTH_M2M_ENABLED=true
ALPHASWARM_AUTH_M2M_AUDIENCE=https://alphaswarm.local/services
```

## Adding a new provider

1. Subclass :class:`alphaswarm.auth.providers.IdentityProvider` and set
   `provider_kind` (the dispatch key matched against
   `ALPHASWARM_AUTH_PROVIDER`).
2. Either inherit from
   :class:`alphaswarm.auth.providers.GenericOidcProvider` (and override only
   the bits that diverge) or roll your own.
3. The metaclass auto-registers; restart the API and set
   `ALPHASWARM_AUTH_PROVIDER=`.

## Testing

`tests/auth/` contains the canonical test patterns:

- `test_pkce.py` — RFC 7636 conformance.
- `test_session_crypto.py` — JWE round-trips, wrong-key rejection.
- `test_oidc_client.py` — token endpoint mock-driven tests.
- `test_providers.py` — Auth0 / generic OIDC / mock dispatch.
- `test_m2m.py` — issuer caching, resolver integration.

All tests run hermetic; nothing hits the network.

## Account management surface (Phase 7)

Phase 7 adds a dedicated account-management API surface under `/me/*`
implemented in [`alphaswarm/api/routes/me.py`](../alphaswarm/api/routes/me.py).
These routes expose profile updates, MFA and session operations, linked
identity management, and self-service account actions while keeping the
Auth0 Management API boundary centralized.

The Auth0 Management API integration lives in
[`alphaswarm/auth/management_api.py`](../alphaswarm/auth/management_api.py). Scope
enforcement for protected endpoints is available through
[`alphaswarm/auth/auth0_fastapi.py`](../alphaswarm/auth/auth0_fastapi.py) via
`Auth0FastAPI` opt-in dependencies. Audit and invite persistence for
this surface is recorded in
[`alphaswarm/persistence/models_audit.py`](../alphaswarm/persistence/models_audit.py)
(`security_audit_events` and `tenancy_invites`), and events are emitted
through [`alphaswarm/auth/audit.py`](../alphaswarm/auth/audit.py).

## Microsoft Entra ID secondary IdP (Phase 7)

AlphaSwarm's primary Microsoft pattern is federation through Auth0 Universal
Login using an Auth0 Microsoft Enterprise Connection, documented in
[`alphaswarm_docs/auth0-microsoft-federation.md`](../../concepts/identity/auth0-microsoft-federation.md).
This keeps Auth0 as the default IdP while preserving one hosted login
surface and one claims projection path.

Direct Entra authentication remains supported as a fallback through
[`alphaswarm/auth/providers/msal_entra.py`](../alphaswarm/auth/providers/msal_entra.py).
When `ALPHASWARM_AUTH_PROVIDER=msal_entra`, the legacy `MsalEntraProvider`
path activates without changing the backend tenancy-link semantics.


<!-- https://alpha-swarm.ai/concepts/identity/management-engine -->
# AlphaSwarm Management Engine
> The Management Engine is the single direct-control surface for:

# AlphaSwarm Management Engine

Canonical narrative for the unified management/control surface
shipped by the `alphaswarm_management_engine` plan
(`.cursor/plans/alphaswarm_management_engine_fd9f1de7.plan.md`).

## What it owns

The Management Engine is the single direct-control surface for:

- **Workload lifecycle** — start / stop / scale / restart / exec /
  tail logs / apply config / rotate secret. One Python ABC
  (`alphaswarm_core.providers.InfrastructureProvider`), one
  runtime (`alphaswarm_core.runtime.WorkloadRuntime`), one audit
  ledger row per action (`workload_runs`).
- **Identity provider configuration** — Auth0 + Microsoft Entra ID
  (MSAL) + Cloudflare Access, all registered through
  `IdentityProviderMeta`. The BFF (`/auth/{providers,exchange,refresh,logout}`)
  is the canonical surface for SPA + Theia clients.
- **Cloudflare edge** — tunnels, DNS records, Access apps. Runtime
  CRUD via `alphaswarm.cloudflare.CloudflareEdgeAdapter`; IaC via the
  `alphaswarm_platform/terraform/modules/cloudflare_edge` module (provider
  `cloudflare/cloudflare ~> 5.6`).
- **Entra tenant onboarding** — `pending` -> `active` via
  `POST /tenancy/entra-links/{id}/promote` (Phase E of the plan).
- **alphaswarm_admin service identity** — per-deployment Microsoft Entra
  Agent Identities (`alphaswarm_admin_agent_identity` Terraform module).
  Replaces the legacy shared-client_credentials path for outbound
  admin-to-CP + admin-to-monolith calls. See
  [admin-agent-identity.md](admin-agent-identity.md).

## Architecture

```mermaid
flowchart LR
  subgraph clients [Local clients]
    Vite[Vite SPA]
    Theia[Theia desktop]
  end
  subgraph bff [AlphaSwarm BFF auth + gateway]
    AuthR["/auth/{providers,exchange,refresh,logout}"]
    Proxy["alphaswarm/api/proxy.py /manage proxy"]
    Sec[require_scope + require_membership]
  end
  subgraph engine [Management engine]
    WR[WorkloadRuntime]
    IP_K[KubernetesProvider]
    IP_DC[DockerComposeProvider]
    IP_CF[CloudflareProvider]
    IP_AWS[AWS / Azure / GCP]
    CFA[CloudflareEdgeAdapter]
    KA[KubernetesAdapter pod ops]
    TR[TerraformRuntime]
    Idp[IdentityProvider registry]
  end
  subgraph idps [Federated IdPs]
    A0[Auth0]
    EN[Entra ID MSAL]
    CFP[Cloudflare Access]
  end
  subgraph state [Postgres + Iceberg]
    WLR[workload_runs ledger]
    AUD[security_audit_events]
    SPECS[terraform_stack_spec_versions]
  end
  Vite --> AuthR
  Theia --> AuthR
  Vite --> Proxy
  Theia --> Proxy
  Proxy --> WR
  AuthR --> Sec
  Sec --> Idp
  Idp --> A0
  Idp --> EN
  Idp --> CFP
  WR --> IP_K
  WR --> IP_DC
  WR --> IP_CF
  WR --> IP_AWS
  IP_K --> KA
  IP_CF --> CFA
  TR --> IP_CF
  WR --> WLR
  WR --> AUD
  TR --> SPECS
```

## Deployment modes

`ALPHASWARM_MANAGEMENT_MODE` controls how the engine runs:

| Mode | Workload calls go to | Audit sink | Use case |
|---|---|---|---|
| `embedded` (default) | In-process `WorkloadRuntime` | `PostgresWorkloadAuditSink` | Single-image deployment |
| `sidecar` | HTTP `/manage/*` proxy -> `alphaswarm_controller` | `JsonlAuditSink` | Air-gapped or multi-tenant deployments |

Both modes import the SAME `WorkloadRuntime` class — operators
choose by setting the env var; no code branches.

## Provider matrix

| Provider | start / stop / scale | restart | exec | tail_logs | rotate_secret | Notes |
|---|---|---|---|---|---|---|
| `docker_compose` | yes | yes | yes (Docker SDK) | yes | no | Local dev + admin overlays |
| `kubernetes` | yes | yes (annotation bump) | yes (`stream` + `_preload_content=False`) | yes (`watch.Watch().stream`) | yes (rolling restart) | Production target |
| `aws` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when EKS attached |
| `azure` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when AKS attached |
| `gcp` | stub | stub | stub | stub | stub | Real `health` + delegated `list_deployments` when GKE attached |
| `cloudflare` | yes | yes (config reload) | n/a | n/a | destructive (opt-in) | Tunnel + Access app + DNS lifecycle |

Cloud providers gate K8s delegation on
`ALPHASWARM_CP_{AWS,AZURE,GCP}_DELEGATE_K8S=true`.

## Halt + audit

- `POST /workloads/halt` fires the `WorkloadRuntime.halt_all`
  helper (per-process registry) and writes a `HALTED` finish row
  for every in-flight `workload_runs` entry. Wired into the
  frontend `KillSwitch` alongside the existing halt endpoints
  (rule 45 + frontend rule 2).
- Every audit row carries `experiment_id` + `test_id` per
  AGENTS rule 34. The Postgres mirror table
  (`workload_runs`, Alembic 0055) is indexed on `status +
  started_at DESC`, `action + started_at DESC`, and
  `provider_alias + target`.

## Cloudflare end-to-end

Phase D of the plan ships:

- `alphaswarm/cloudflare/{client,adapter}.py` — Python SDK wrapper +
  `CloudflareEdgeAdapter` (tunnels, DNS, Access apps).
- `alphaswarm/api/routes/cloudflare.py` — REST surface under
  `/cloudflare/*` (`cluster:admin` for writes,
  `cluster:read` for reads).
- `alphaswarm/data/mcp/tools/cloudflare.py` — DataMCP tools for agents
  (`data.cloudflare.{health,list_tunnels,create_tunnel,put_tunnel_config,list_access_apps,put_access_app,put_dns_record}`).
- `alphaswarm/auth/providers/cloudflare_access.py` — new
  `CloudflareAccessProvider` that validates `Cf-Access-Jwt-Assertion`
  headers and merges claims into the active `RequestContext`.
- `alphaswarm_platform/terraform/modules/cloudflare_edge` + Jinja codegen template
  (`alphaswarm/terraform/codegen/templates/cloudflare_edge.tf.j2`) +
  `cloudflare = "~> 5.6"` in `alphaswarm_platform/terraform/versions.tf`.
- Optional `cloudflare_enabled` block in
  `alphaswarm_platform/terraform/environments/rpi/main.tf` — replaces the manual
  cloudflared deployment under
  `rpi_kubernetes/kubernetes/base-services/cloudflared/`.

## Frontend

- `alphaswarm_client/src/lib/api/{workloads,cloudflare,clusterPods}.ts` —
  typed clients matching the new REST surface.
- `alphaswarm_client/src/routes/manage/page.tsx` — Workload Studio.
- `alphaswarm_client/src/routes/cluster-mgmt/page.tsx` — Cluster pods
  browser (exec + log tail land in Phase F-2).
- `alphaswarm_client/src/routes/cloudflare/page.tsx` — Cloudflare edge
  studio.
- `alphaswarm_client/src/lib/auth/MsalProvider.tsx` — new MSAL branch of
  `AuthProvider`; selects between `` and
  `` based on `authConfig.provider`.
- `alphaswarm_client/public/redirect.html` — MSAL v5 redirect bridge.

## Theia

- `theia-extensions/alphaswarm/src/browser/auth/alphaswarm-auth-service.ts` —
  additive BFF auth service (calls `/auth/providers` +
  `/auth/refresh`). Auth0Service still owns the direct PKCE flow.
- `theia-extensions/alphaswarm/src/browser/widgets/management-widget.tsx` —
  iframe embedding the Vite Workload Studio, cluster-mgmt, and
  cloudflare routes inside Theia. New env vars on
  `browser.Dockerfile`: `ALPHASWARM_THEIA_FRONTEND_URL`,
  `ALPHASWARM_THEIA_PROVIDERS_URL`.

## Subagent + rule + skill

- `.cursor/agents/alphaswarm-management-engine.md` — direct-control
  subagent that maps every control route to a `data.*` MCP tool
  and refuses raw HTTP shortcuts.
- `.cursor/rules/alphaswarm-management-engine.mdc` — always-on rule
  that bans printing tokens, refresh tokens, M2M client_secrets,
  MFA secrets, `Cf-Access-Jwt-Assertion` values, kubeconfig
  contents, and full `Authorization` headers in any transcript.
- `.cursor/skills/alphaswarm-management-engine/SKILL.md` — named
  workflows the subagent reaches for first (start, stop,
  restart, exec, tail-logs, provision-tunnel, rotate-secret,
  promote-entra-link, halt-all).


<!-- https://alpha-swarm.ai/concepts/identity/msal-entra-setup -->
# Microsoft Entra ID (MSAL) setup
> 1. Sign in to the [Entra admin center](https://entra.microsoft.com). 2. **Identity → Applications → App registrations → New registration**. 3. Name: `AlphaSwarm`. 4. Supported account type...

# Microsoft Entra ID (MSAL) setup

Step-by-step walkthrough for wiring AlphaSwarm's `MsalEntraProvider` to a
multi-tenant Microsoft Entra ID app registration. The provider lives
at [`alphaswarm/auth/providers/msal_entra.py`](../alphaswarm/auth/providers/msal_entra.py)
and auto-registers via the
[`IdentityProviderMeta`](../alphaswarm/auth/providers/protocol.py) metaclass.

## 1. Create the Entra app registration

1. Sign in to the [Entra admin center](https://entra.microsoft.com).
2. **Identity → Applications → App registrations → New registration**.
3. Name: `AlphaSwarm`.
4. Supported account types: **Accounts in any organizational
   directory + personal Microsoft accounts (B2B/B2C)**. This is
   what makes the app multi-tenant. The matching MSAL authority
   becomes `https://login.microsoftonline.com/organizations` (work /
   school accounts only) or `/common` (incl. personal accounts).
5. **Redirect URI** — add two:
   - Platform: **Web** → `https:///auth/callback`
   - Platform: **Single-page application (SPA)** →
     `http://localhost:3001/auth/callback` and the prod equivalent.

## 2. Generate a client secret

1. App registration → **Certificates & secrets → New client secret**.
2. Description: `alphaswarm-backend-secret`. Expiry: max allowed (`24 months`).
3. **Copy the secret value immediately**; Entra hides it after page reload.
4. Set:
   ```
   ALPHASWARM_MSAL_CLIENT_SECRET=
   ```
   Or store it in your secret backend and reference via
   `CredentialResolver` (preferred — see
   [alphaswarm_docs/cloud-credentials.md](../../concepts/identity/cloud-credentials.md)).

## 3. Define app roles

App registration → **App roles → Create app role** (five times):

| Display name             | Member types | Value                  |
| ------------------------ | ------------ | ---------------------- |
| AlphaSwarm admin                | Users / Apps | `alphaswarm.admin`            |
| AlphaSwarm editor               | Users        | `alphaswarm.editor`           |
| AlphaSwarm viewer               | Users        | `alphaswarm.viewer`           |
| Terraform operator       | Users        | `alphaswarm.terraform.operator` |
| Terraform approver       | Users        | `alphaswarm.terraform.approver` |

The provider's first-login provisioning logic
([`alphaswarm/auth/user.py::_apply_entra_tenant_link`](../alphaswarm/auth/user.py))
maps these onto the AlphaSwarm role lattice (`viewer < editor < admin <
owner`). The `alphaswarm.terraform.*` sub-roles fold to `editor` (operator)
and `admin` (approver) by default; override via the
`EntraTenantLink.role_mapping` JSON column.

## 4. Expose an API scope

App registration → **Expose an API → Add a scope**:

- Application ID URI: `api://` (Entra suggests this; accept).
- Scope name: `.default` (this enables the `client_credentials` grant
  used by M2M).
- Admin consent display name: `AlphaSwarm API access`.

## 5. (Optional) Pre-authorize the SPA client

If you split the SPA client into its own app registration, add it to
**Expose an API → Authorized client applications** with the
`api:///.default` scope so the token flow lands without an
admin-consent prompt.

## 6. Configure AlphaSwarm

```
ALPHASWARM_AUTH_PROVIDER=msal_entra
ALPHASWARM_MSAL_TENANT_ID=
ALPHASWARM_MSAL_CLIENT_ID=
ALPHASWARM_MSAL_CLIENT_SECRET=
ALPHASWARM_MSAL_AUTHORITY=https://login.microsoftonline.com/organizations
ALPHASWARM_MSAL_REDIRECT_URI=https:///auth/callback
ALPHASWARM_MSAL_SCOPES=openid profile email offline_access User.Read
ALPHASWARM_MSAL_MULTI_TENANT=true
ALPHASWARM_MSAL_B2B_ENABLED=true
```

Frontend Vite build:

```
VITE_MSAL_TENANT_ID=
VITE_MSAL_CLIENT_ID=
VITE_MSAL_AUTHORITY=https://login.microsoftonline.com/organizations
VITE_MSAL_REDIRECT_URI=https:///auth/callback
VITE_MSAL_SCOPES=openid profile email offline_access User.Read
```

## 7. Link your home Entra tenant to an AlphaSwarm organization

Two paths:

1. **Frontend wizard** (recommended): navigate to
   `/admin/onboarding` → **Link Entra tenant** tab, select your AlphaSwarm
   org, paste the Entra tenant id (`tid`), set primary domain +
   allowed email domains + role mapping, click "Activate".
2. **MCP tool / API**:
   ```
   POST /tenancy/entra-links
   {
     "organization_id": "",
     "entra_tenant_id": "",
     "primary_domain": "wiley.tech",
     "allowed_email_domains": ["wiley.tech"],
     "role_mapping": {
       "alphaswarm.admin": "admin",
       "alphaswarm.editor": "editor",
       "alphaswarm.viewer": "viewer",
       "alphaswarm.terraform.operator": "editor",
       "alphaswarm.terraform.approver": "admin"
     },
     "activate": true
   }
   ```

Once the link is `active`, every user that signs in from that tenant
gets a `Membership` row auto-provisioned on the linked org +
workspaces (`provider == "msal_entra"` in
[`alphaswarm/auth/user.py::provision_user_from_claims`](../alphaswarm/auth/user.py)).

## 8. (Optional) Conditional Access for external tenants

For B2B guest users, configure Entra Conditional Access policies on
your home tenant (MFA + IP restrictions + device compliance). AlphaSwarm
does NOT enforce these — Entra denies the token before AlphaSwarm sees it,
which is the correct boundary.

## 9. SCIM / Provisioning Service webhook

To pre-provision AlphaSwarm users before they sign in (useful for large
orgs), point an Entra Logic App or SCIM provider at:

```
POST https:///_internal/msal/sync
Authorization: Bearer 
{
  "object_id": "",
  "tenant_id": "",
  "email": "user@wiley.tech",
  "display_name": "User",
  "app_roles": ["alphaswarm.editor", "alphaswarm.terraform.operator"],
  "lifecycle_event": "created"
}
```

The endpoint is M2M-protected via `require_m2m_token` (mirrors
`/_internal/auth0/sync`) and upserts the matching `User` +
`Membership` rows so the user lands on a usable surface on their very
first request.

## Troubleshooting

| Symptom                                | Likely cause                                                          |
| -------------------------------------- | --------------------------------------------------------------------- |
| `AADSTS50194` invalid issuer           | Authority pinned to wrong tenant — use `/organizations` for multi-tenant. |
| `AADSTS65001` consent required         | Admin consent on the SPA / API scope wasn't granted.                  |
| `provision_user_from_claims` returns default user | Settings has `auth_provider != "msal_entra"`. Set the env var. |
| New user lands without org membership  | `EntraTenantLink.status == "pending"` — promote via the wizard.       |


<!-- https://alpha-swarm.ai/concepts/identity/multi-tenancy -->
# Multi-tenancy
> ```mermaid sequenceDiagram participant SPA as Vite SPA participant Entra as login.microsoftonline.com<br/>(multi-tenant) participant AlphaSwarm as AlphaSwarm backend participant Link as EntraTenantLink (Postgres) p...

# Multi-tenancy

How AlphaSwarm turns a Microsoft Entra ID `tid` claim into an
`Organization` → `Team` → `User` → `Membership` chain — and what
keeps a B2B guest from another tenant from leaking into the wrong
org.

## Identity flow

```mermaid
sequenceDiagram
  participant SPA as Vite SPA
  participant Entra as login.microsoftonline.com(multi-tenant)
  participant AlphaSwarm as AlphaSwarm backend
  participant Link as EntraTenantLink (Postgres)
  participant Org as Organization (Postgres)

  SPA->>Entra: PKCE auth code flow
  Entra->>SPA: id_token + access_token (carries tid + oid + roles)
  SPA->>AlphaSwarm: /api/* with Bearer 
  AlphaSwarm->>AlphaSwarm: validate_jwt (Entra JWKS)
  AlphaSwarm->>AlphaSwarm: provision_user_from_claims(claims)
  AlphaSwarm->>Link: lookup tid
  alt tid known + status=active
    Link-->>AlphaSwarm: organization_id
    AlphaSwarm->>Org: derive Memberships from roles[]
  else tid unknown + B2B enabled
    AlphaSwarm->>Link: insert pending row
    AlphaSwarm-->>SPA: user signs in with no memberships
    note over AlphaSwarm,Link: Admin promotes link via wizardbefore user sees workspaces
  end
```

## Schema

| Table                   | Purpose                                                       |
| ----------------------- | ------------------------------------------------------------- |
| `organizations`         | Top of the AlphaSwarm tenancy tree (multi-tenant)                    |
| `teams`                 | Subgroup within an org                                        |
| `workspaces`            | Visibility-scoped container of projects + labs                |
| `projects` / `labs`     | The user-facing buckets where strategies / RAG corpora live   |
| `users`                 | Authenticated identities (one row per Entra `oid`)            |
| `memberships`           | Polymorphic `(user, scope_kind, scope_id, role)` grants       |
| `entra_tenant_links`    | Multi-tenant Entra `tid` → AlphaSwarm `organization_id` index (NEW)  |

Schema migrations:

- `0017_tenancy_foundation.py` — original `default-*` seed.
- `0050_terraform_iac_plus_entra.py` — adds `entra_tenant_links` +
  the Terraform tables.
- `0051_seed_wiley_tech.py` — seeds the canonical "Wiley Tech" org +
  user "Julian" + transfers every legacy `default-*`-owned row.

## `EntraTenantLink` lifecycle

Statuses (see :data:`ENTRA_TENANT_STATUSES`):

| Status      | Behaviour                                                                 |
| ----------- | ------------------------------------------------------------------------- |
| `pending`   | Created by first-login of an unknown `tid`. User signs in but lands on an "awaiting org admin" surface (no Memberships granted). |
| `active`    | New logins from the tenant auto-provision into the linked org + workspaces. |
| `suspended` | Sign-ins from the tenant still resolve, but no new Memberships are granted. |
| `revoked`   | Sign-ins from the tenant are blocked at provision time.                   |

AGENTS rule 44: **organization provisioning from Entra ID claims
goes through `EntraTenantLink`. Don't auto-create org rows from raw
`tid` claims.** The
`data.tenancy.link_org_to_entra_tenant` MCP tool (REST: `POST
/tenancy/entra-links`) is the only sanctioned ingress. The frontend
[`EntraTenantLinkWizard`](../alphaswarm_client/src/components/onboarding/EntraTenantLinkWizard.tsx)
drives this flow with a 5-step wizard.

On the Auth0-federated path, the Microsoft button on the SPA login
screen uses the Auth0 Enterprise Connection
`connection=azure-ad-myorg`, which federates users to their home Entra
tenant. The Entra `tid` claim returned through Auth0 is forwarded into
the AlphaSwarm access-token claim set by the Auth0 Action, and
`provision_user_from_claims` runs `_apply_entra_tenant_link` exactly as
it does in the direct-MSAL path.

For regulated deployments that bypass Auth0 and hit Entra directly,
`MsalEntraProvider` remains registered through `IdentityProviderMeta`
and activates when `ALPHASWARM_AUTH_PROVIDER=msal_entra`. Both authentication
paths converge on the same backend `EntraTenantLink` lookup chain, and
super-admin promotion remains managed in
`alphaswarm_client/src/components/onboarding/EntraTenantLinkWizard.tsx`.

## App role mapping

Entra ships app roles in a top-level `roles` claim array (e.g.
`["alphaswarm.admin", "alphaswarm.terraform.operator"]`). The provisioning logic
maps them onto the AlphaSwarm role lattice (`viewer < editor < admin <
owner`):

```python
# alphaswarm/auth/user.py::_apply_entra_tenant_link
# Multi-word roles fold to the tail token:
#   alphaswarm.terraform.operator -> "operator" -> editor
#   alphaswarm.terraform.approver -> "approver" -> admin
```

Per-link overrides live in `EntraTenantLink.role_mapping` (JSON).
Example for the seeded Wiley Tech link:

```json
{
  "alphaswarm.admin": "owner",
  "alphaswarm.editor": "editor",
  "alphaswarm.viewer": "viewer",
  "alphaswarm.terraform.operator": "editor",
  "alphaswarm.terraform.approver": "admin"
}
```

## Onboarding wizards (frontend)

`/admin/onboarding` hosts three wizards behind tabs:

1. **OrgCreateWizard** (4 steps) — name / billing / default
   structure / review. Seeds the canonical Core team + Main
   workspace + Main project + Main lab (from
   [`configs/tenants/tenant_default_template.yaml`](../configs/tenants/tenant_default_template.yaml)).
2. **EntraTenantLinkWizard** (5 steps) — choose org / Entra tid +
   primary domain / allowed email domains / app-role mapping /
   activate.
3. **UserInviteWizard** (3 steps) — email + display name / scope +
   role / review + send (Entra B2B invitation when MSAL is
   configured).

## Tenant template files

[`configs/tenants/`](../configs/tenants/) hosts three YAMLs:

- `tenant_default_template.yaml` — default org structure created on
  `data.tenancy.create_organization`.
- `roles_default_template.yaml` — canonical app-role → AlphaSwarm-role
  mapping.
- `user_invite_template.yaml` — Entra B2B invite email body + custom
  claims payload.

## Seeded state

After running `alembic upgrade head` against a fresh DB:

| Slug         | Type          | Notes                                        |
| ------------ | ------------- | -------------------------------------------- |
| `default`    | Organization  | Legacy 0017 seed (preserved for FK chains)   |
| `wiley-tech` | Organization  | New canonical seed (Wiley Tech)              |
| `core`       | Team          | Default team under wiley-tech                |
| `main`       | Workspace     | Default workspace under wiley-tech           |
| `main`       | Project       | Default project under main workspace         |
| `main`       | Lab           | Default lab under main workspace             |
| `julian@wiley.tech` | User   | Owner on every Wiley Tech scope              |

Every legacy `*_runs` / `bots` / `agent_runs_v2` / `analysis_runs` /
... row that previously pointed at `default-org` / `default-user` is
re-stamped to point at `wiley-tech` / `julian@wiley.tech` (see
`_restamp_legacy_rows` in
[`alembic/versions/0051_seed_wiley_tech.py`](../alembic/versions/0051_seed_wiley_tech.py)).
The legacy `default-*` rows stay in place so any orphan FK still
resolves.


<!-- https://alpha-swarm.ai/concepts/identity/scim-provisioning -->
# SCIM Provisioning
> Enable SCIM with:

# SCIM Provisioning

AlphaSwarm exposes a SCIM 2.0 provisioning surface at `/scim/v2/*` for Auth0
Actions or scheduled Auth0 jobs.

## Security

Enable SCIM with:

```bash
ALPHASWARM_AUTH_SCIM_ENABLED=true
ALPHASWARM_AUTH_PROVIDER=auth0
ALPHASWARM_AUTH_REQUIRED=true
```

Authentication is Bearer-only. AlphaSwarm accepts either:

- a JWT validated against the configured OIDC issuer with audience
  `ALPHASWARM_AUTH_SCIM_M2M_AUDIENCE` (or `ALPHASWARM_AUTH_M2M_AUDIENCE`), or
- a long random static token whose SHA-256 digest is stored in
  `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH`.

Do not store the raw token in the repository.

## Resource Mapping

- SCIM `User` maps to `users`.
- SCIM `Group` maps to `teams`.
- SCIM `Group.members` maps to `memberships` with `scope_kind="team"`.

Create, patch, replace, deactivate, and group membership operations emit
security audit events through `alphaswarm.auth.audit.emit_audit_event`.

## Auth0 Integration

The `alphaswarm_platform/terraform/modules/auth0_identity` module creates:

- the AlphaSwarm SPA application,
- the AlphaSwarm API audience and scopes,
- an M2M client grant for SCIM and Auth0 sync,
- default `alphaswarm-viewer` and `alphaswarm-admin` roles,
- a post-login Action that calls `/_internal/auth0/sync` and injects AlphaSwarm
  tenancy claims.

For direct enterprise SCIM, point the upstream IdP or Auth0 automation at
`https:///scim/v2`.


<!-- https://alpha-swarm.ai/concepts/identity/spiffe-workload-identity -->
# SPIFFE workload identity

# SPIFFE workload identity

> Phase 4 §7.2 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> SPIFFE-bound identities replace the long-lived OAuth
> client-credentials grant currently used by `M2MTokenIssuer` for
> service-to-service authentication.

## Why workload identity

The pre-Phase-4 ``M2MTokenIssuer`` mints short-lived JWTs via the
Auth0 / Entra ``client_credentials`` grant, but those tokens are
still bearer credentials — exfiltrate the JWT and you can replay it
from anywhere until it expires. SPIFFE-bound identities (SVIDs) are
workload-attested via the platform (UID, cgroup, label selectors) —
much harder to steal and automatically rotated by the SPIRE Server.

| Aspect | OAuth `client_credentials` | SPIFFE JWT-SVID |
| --- | --- | --- |
| Issuer | Auth0 / Entra tenant | SPIRE Server (in-cluster) |
| Attestation | Shared `client_secret` (long-lived) | Node + workload attestor (live) |
| Bearer-token replay risk | High (until expiry) | Low (selectors validated by Workload API) |
| Rotation | Manual / scheduled | Automatic, per-SVID-lifetime |
| Cross-cell scope | Implicit (issuer trusts all audiences) | Explicit (`spiffe://alpha-swarm.ai/cell//...` trust-domain path) |

## Trust domain layout

AlphaSwarm runs ONE trust domain — ``alpha-swarm.ai``. Each cell carries a
namespace-scoped trust-domain prefix:

```
spiffe://alpha-swarm.ai/cell//
```

Example SPIFFE IDs:

| Cell | Service | SPIFFE ID |
| --- | --- | --- |
| `cell-shared-std-local` | `alphaswarm-core` | `spiffe://alpha-swarm.ai/cell/cell-shared-std-local/alphaswarm-core` |
| `cell-silo-reg-acme` | `alphaswarm-worker` | `spiffe://alpha-swarm.ai/cell/cell-silo-reg-acme/alphaswarm-worker` |
| `cell-shared-std-us-east-1a` | `alphaswarm-tenant-router` | `spiffe://alpha-swarm.ai/cell/cell-shared-std-us-east-1a/alphaswarm-tenant-router` |

Cross-cell calls validate the full SPIFFE ID, not just the trust
domain — Cell-Bound-Authorization (Phase 5 §8.5) extends this with
biscuit capability tokens that pin a request to a specific cell.

## Deployment shape

Each cell runs ONE SPIRE control plane:

```
[ SPIRE Server StatefulSet ]  (spire-system namespace)
        ▲
        │ k8s_psat attest
        │
[ SPIRE Agent DaemonSet ]     (one per node)
        ▲
        │ unix socket: /run/spire/sockets/agent.sock
        │
[ AlphaSwarm workload pod ]          (mounts the socket via hostPath volume)
        │
        └── spiffe.workloadapi.fetch_svid(audiences=[...])
```

The matching manifests live at:

- `alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/server.yaml`
- `alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/agent.yaml`

Per-cell installs come from the Argo CD `ApplicationSet` at
`alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml`
(Phase 4.5 extends it with a `mesh-identity` component column).

## AlphaSwarm integration

The application-side integration lives in
[`alphaswarm/auth/providers/spiffe.py`](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/providers/spiffe.py)
(`SpiffeIdentityProvider`). It implements the
:py:class:`alphaswarm.auth.providers.protocol.IdentityProvider` interface
but only the :py:meth:`m2m_token` method does real work — SPIFFE
is workload-only and does NOT participate in user OIDC flows. The
existing Auth0 / Entra providers stay wired for user-facing login.

### Wiring

```bash
# Operator sets the workload API socket path (default is the
# conventional /run/spire/sockets/agent.sock from the SPIRE Agent
# DaemonSet's hostPath mount).
export ALPHASWARM_AUTH_SPIFFE_WORKLOAD_API_SOCKET="unix:///run/spire/sockets/agent.sock"

# Route the M2MTokenIssuer through SPIFFE instead of Auth0.
# (Phase 4.5 deliverable — the M2MTokenIssuer side is still TODO.)
export ALPHASWARM_AUTH_M2M_PROVIDER=spiffe
```

When the SPIFFE socket isn't reachable (development mode, smoke
tests, migrations), `SpiffeIdentityProvider.m2m_token` raises
`IdentityProviderError`. The fallback chain in
`alphaswarm.credentials.resolver` re-tries the legacy Auth0 path so
developers can iterate without a running SPIRE Agent.

## Pod template requirements

For a pod to consume SVIDs from the SPIRE Workload API:

1. Mount the agent's host socket:
   ```yaml
   volumes:
     - name: spire-agent-socket
       hostPath:
         path: /run/spire/sockets
         type: Directory
   containers:
     - name: ...
       volumeMounts:
         - name: spire-agent-socket
           mountPath: /run/spire/sockets
           readOnly: true
   ```
2. Set `SPIFFE_ENDPOINT_SOCKET=unix:///run/spire/sockets/agent.sock`
   in the pod env (or rely on the AlphaSwarm default).
3. Be in the `spire-system` `ClusterSPIFFEID` selector — the
   matching CRD is shipped per-cell in Phase 4.5; today the
   `k8s_psat` Node Attestor accepts every workload with a
   matching ServiceAccount.

## Rotation + revocation

- **SVID lifetime**: 1h X.509-SVID, 5m JWT-SVID (configurable via
  the SPIRE Server config map).
- **Trust anchor lifetime**: 168h (7 days). Operators rotate the
  root via Vault PKI; the SPIRE Server propagates the new bundle
  to every Agent within ~1 minute.
- **Revocation**: deleting a workload's `RegistrationEntry` from
  the SPIRE Server invalidates all future SVID issuance. Existing
  in-flight SVIDs expire at their natural TTL — for an immediate
  cut-off, also rotate the trust anchor.

## Failure modes

| Failure | Behaviour |
| --- | --- |
| SPIRE Agent socket missing | `SpiffeIdentityProvider.m2m_token` raises `IdentityProviderError` |
| SPIRE Server unreachable | Agent serves cached SVID until it expires (~1h) |
| Workload not attested | `fetch_svid` raises; M2M chain falls through to Auth0 |
| Trust anchor rotation | SVIDs continue to validate during the 7-day overlap window |

## Phase 4.5 follow-ups

1. **Per-cell `ClusterSPIFFEID` CRDs** that bind workload selectors
   to SPIFFE IDs (today the spine relies on the default k8s_psat
   attestor).
2. **M2MTokenIssuer dispatch** — wire `ALPHASWARM_AUTH_M2M_PROVIDER=spiffe`
   into the issuer so it picks SPIFFE for M2M without affecting
   user OIDC flows.
3. **Linkerd integration** — Linkerd consumes SPIFFE identity for
   mTLS termination (Phase 4 §7.1). Phase 4.5 wires the SPIFFE
   trust anchor into Linkerd's Identity service.
4. **OIDC discovery provider** — SPIRE Server can expose an OIDC
   discovery endpoint that lets non-SPIRE-aware services
   (Pomerium, Cloudflare Access) validate SVIDs as standard
   OIDC JWTs.
5. **Cross-cell federation** — Phase 8 §11.2 multi-region cells
   will need SPIFFE trust-domain federation.

## Related documents

- [RESTRUCTURING_PLAN.md §7.2](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm/auth/providers/spiffe.py](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm/auth/providers/spiffe.py)
- [alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/)
- SPIFFE specification: https://github.com/spiffe/spiffe
- SPIRE: https://spiffe.io/docs/latest/spire-about/spire-concepts/


<!-- https://alpha-swarm.ai/concepts/infrastructure/alphaswarm-ide-roadmap -->
# AlphaSwarm IDE roadmap
> The blueprint targets greenfield buyers of a quant PaaS. AlphaSwarm already has:

# AlphaSwarm IDE roadmap

This doc maps the [external quant-IDE
blueprint](https://github.com) (compressed: "Bloomberg-grade research
IDE you can own", 12–18 month Phase 1/2/3 plan) to AlphaSwarm's existing
architecture and the 55 hard rules.

## Why we deviate from the blueprint

The blueprint targets greenfield buyers of a quant PaaS. AlphaSwarm already
has:

- Five hash-locked spec runtimes (`AgentSpec` / `BotSpec` /
  `RLExperimentSpec` / `AnalysisSpec` / `WorkflowSpec`) — rules
  12-13, 14-15, 16-17, 23-25, 40-41.
- Nine backtest engines (vbt-pro, event-driven, OSS vectorbt,
  backtesting.py, LEAN, ZVT, AAT, hftbacktest, NautilusTrader bridge).
- DataMCP + CodebaseMCP — rule 22, exposed over RFC 9728 / RFC 8707
  conformant streamable HTTP per rule 49.
- AlphaVantage / IBKR / Alpaca brokers — paper trading exists.
- Iceberg lakehouse with medallion-tier business metadata — rule 21.
- A Vite 7 + React 19 operator UI (`alphaswarm_client/`) that already covers
  the operator dashboard scope.

The IDE's role in AlphaSwarm is **the developer / research environment** —
notebook + MCP copilot + spec authoring + repo navigation. It does NOT
re-implement what `alphaswarm_client/` already does well.

## Phasing

### Phase A — Shipped in this enhancement

| Workstream | Blueprint section | AlphaSwarm-aligned implementation |
| --- | --- | --- |
| Six compile-time Theia extensions | §2.2 + §2.5 + §2.6 + §2.8 | `alphaswarm-ext`, `alphaswarm-shell-ext`, `alphaswarm-mcp-bridge-ext`, `alphaswarm-research-copilot-ext`, `alphaswarm-notebook-quant-ext`, `alphaswarm-quant-ext` |
| FINOS Perspective notebook renderer | §2.6 + §4.5 | `alphaswarm-notebook-quant-ext`'s `PerspectiveArrowRenderer` (lazy-loads `@finos/perspective`) |
| MCP-driven research copilot | §2.7 + §5.4 | `alphaswarm-research-copilot-ext`'s `AqpResearchAgent` (routes through `router_complete`, rule 2) |
| White-label shell + filters | §2.8 | `alphaswarm-shell-ext`'s `FilterContribution` + window title + about dialog |
| Quant widgets (operator complement) | §5.1 | `alphaswarm-quant-ext`'s SpecAuthor + RunInspector + BacktestRunner |
| `alphaswarm-cli ide` entrypoint | (CLI orchestration) | `install` / `build` / `start` / `stop` / `status` / `logs` / `open` / `url` / `env` / `detect` / `doctor` |
| Single-pod K8s manifests | §7 (Layer 2) | `alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/` |
| Theia Cloud Phase B scaffolding | §3 | `alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/theia-cloud/` with `DEFERRED.md` |
| Per-extension AGENTS + READMEs + skills + rules | (governance) | 6 README + 6 AGENTS + 2 skills + 1 rule + 2 subagents |
| Workspace retirement checklist | (governance) | `alphaswarm_ide/docs/retire-vendored-workspace.md` |

### Phase B — Trigger: ≥2 internal users need isolated workspaces

| Workstream | Blueprint section | AlphaSwarm-aligned implementation |
| --- | --- | --- |
| Theia Cloud multi-tenant operator | §3 | Install upstream `theia-cloud` Helm + apply the `AppDefinition` scaffolded under `alphaswarm-ide/theia-cloud/` |
| Per-tenant PVC + workspace | §3.5 | One PVC per `Workspace.theia.cloud/v1beta5` |
| Activity-tracker idle shutdown | §3.3 | `monitor.activityTracker.timeoutAfter` on `AppDefinition` |
| Private Open VSX mirror | §2.9 | Self-hosted Open VSX in `alphaswarm-ide` namespace |
| Step-up confirmation for copilot write tools | (rule 52) | Surface confirmation chips before invoking `/halt` / `/me/byok/*` / `/tenancy/invites` tools |

### Phase C — Trigger: tick / order-book research demand emerges

| Workstream | Blueprint section | AlphaSwarm-aligned implementation |
| --- | --- | --- |
| Arrow Flight gateway backend service | §4.1 | A new compile-time extension `alphaswarm-flight-gateway-ext` with a JSON-RPC service that fronts AlphaSwarm Iceberg + Snowflake (when present) via ADBC |
| Tick blotter widget | §5.2 | New widget in `alphaswarm-quant-ext` (or a sibling `alphaswarm-trading-ext`) that subscribes to the live market data Kafka topic |
| Real-time Yjs notebook collaboration | §5.5 | New compile-time extension `alphaswarm-notebook-rtc-ext` with a backend Yjs WebSocket server |
| Hudi upsert-heavy market-data partitions | (rule 46) | Wire `alphaswarm/data/lakehouse/hudi/` into the BacktestRunner spec UI |
| GPU / RAPIDS scheduling | §3 (Layer 5) | New `AppDefinition` flavour with GPU node selectors |

## Hard-rule mapping summary

| Rule | Phase A | Phase B | Phase C |
| --- | --- | --- | --- |
| 2 (LLM gateway) | Copilot uses `router_complete` | (no change) | Hudi-aware code samples in copilot |
| 4 (progress frame) | `AqpWsClient` consumes canonical frame | (no change) | (no change) |
| 22 (DataMCP) | MCP bridge | (no change) | Flight gateway uses DataMCP for catalog metadata |
| 26 (CredentialResolver) | Python helpers | (no change) | Flight gateway pulls Snowflake creds via store |
| 27 (IdentityProvider) | All extensions | Per-pod oauth2-proxy | (no change) |
| 45 (WorkloadRuntime) | CLI `doctor` + `alphaswarm-ext` halt | Multi-pod halt via `/workloads/halt` | (no change) |
| 47 (topology) | CLI `detect` / `env` | (no change) | (no change) |
| 49 (MCP audience) | Bridge sets `X-AlphaSwarm-MCP-Audience` | (no change) | (no change) |
| 52 (step-up MFA) | `alphaswarm-ext` halt | Copilot write-tool gating | (no change) |

## Decision log

| Decision | Rationale |
| --- | --- |
| Use AlphaSwarm `router_complete` (rule 2) for the copilot, NOT `@theia/ai-openai` / `@theia/ai-anthropic` etc. | AlphaSwarm's provider catalog + cost caps + tenancy + audit run through `router_complete`. Bypassing it would create an auditing blind spot for every chat completion. |
| Use AlphaSwarm's five spec runtimes for SpecAuthor, NOT a generic `BacktestService` JSON-RPC | The blueprint's hypothetical `BacktestService` is what AlphaSwarm already has — five hash-locked spec runtimes with `persist_spec` + immutable version rows. Reinventing them would create a fork. |
| Defer Arrow Flight + Theia Cloud + RTC to Phase B/C | AlphaSwarm's current load (single-tenant Vite UI + AlphaSwarm API) does not justify the multi-tenant Theia Cloud operator yet. The blueprint's Flight gateway is a Phase C target — DataMCP + Iceberg already cover the data plane for Phase A. |
| Keep `alphaswarm_client/` as the operator UI; Theia complements it | The Vite app already has the operator dashboards. Theia adds notebook + MCP copilot + spec authoring + repo navigation. Two surfaces, one tenancy, no duplication. |
| Make `alphaswarm-cli ide` the canonical entrypoint | Production deploys go through one command. `yarn` stays for inner-loop dev. Mirrors the `alphaswarm-cli client` pattern for the Vite frontend. |
| Don't fork Theia | Every blueprint risk register flags forking as catastrophic. AlphaSwarm stays on community releases and adds via compile-time extensions only. |

## What this roadmap is NOT

- A commitment to ship every blueprint phase.
- A timeline. We ship Phase A now; Phase B and C ship when triggered.
- A justification for re-implementing what `alphaswarm_client/` already provides.
- A reason to bypass the 55 hard rules.

## Source of truth

- The blueprint we summarised: external research report + product
  blueprint provided as the source for this enhancement.
- AlphaSwarm's canonical hard rules: [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).
- Per-extension contracts:
  [../alphaswarm_ide/theia-extensions/](../alphaswarm_ide/theia-extensions/).


<!-- https://alpha-swarm.ai/concepts/infrastructure/alphaswarm-ide -->
# AlphaSwarm IDE
> This page is a thin pointer into the in-folder documentation that lives in `alphaswarm_ide/`. The canonical contracts are there

# AlphaSwarm IDE

The **AlphaSwarm IDE** is a white-labeled Eclipse Theia 1.72 distribution + six
AlphaSwarm compile-time extensions + an MCP-driven research copilot + a
Perspective Arrow notebook renderer. It is the developer environment
that sits next to (not replaces) the `alphaswarm_client/` Vite operator UI.

## SSoT pointers

This page is a thin pointer into the in-folder documentation that lives
in `alphaswarm_ide/`. The canonical contracts are there.

| Topic | Path |
| --- | --- |
| Overview + architecture | [../alphaswarm_ide/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/README.md) |
| Process + extension architecture | [../alphaswarm_ide/docs/architecture.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/architecture.md) |
| Per-extension reference | [../alphaswarm_ide/docs/extensions.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/extensions.md) |
| Canonical operator entrypoint (`alphaswarm-cli ide`) | [../alphaswarm_ide/docs/cli-entrypoint.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/cli-entrypoint.md) |
| MCP integration (RFC 9728 + RFC 8707) | [../alphaswarm_ide/docs/mcp-integration.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/mcp-integration.md) |
| Research Copilot (chat agent) | [../alphaswarm_ide/docs/research-copilot.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/research-copilot.md) |
| Notebook (Perspective MIME renderer) | [../alphaswarm_ide/docs/notebook.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/notebook.md) |
| Quant widgets (SpecAuthor / RunInspector / BacktestRunner) | [../alphaswarm_ide/docs/quant-widgets.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/quant-widgets.md) |
| Deployment (local / single-pod K8s / Theia Cloud) | [../alphaswarm_ide/docs/deployment.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/deployment.md) |
| Phased roadmap (blueprint → AlphaSwarm) | [alphaswarm-ide-roadmap.md](../../concepts/infrastructure/alphaswarm-ide-roadmap.md) |

## Hard-rule touchpoints

The AlphaSwarm IDE most-cited hard rules from [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md):

| Rule | Owner | AlphaSwarm IDE consumer |
| --- | --- | --- |
| 2 (LLM gateway) | `alphaswarm/llm/providers/router.py::router_complete` | `alphaswarm-research-copilot-ext`'s `RouterCompleteClient` |
| 4 (canonical progress frame) | `alphaswarm/tasks/_progress.py::emit` | `alphaswarm-quant-ext`'s `AqpWsClient` / `RunInspectorWidget` |
| 22 (DataMCP boundary) | `alphaswarm/data/mcp/` | `alphaswarm-mcp-bridge-ext`'s registrations |
| 26 (CredentialResolver) | `alphaswarm/credentials/resolver.py` | Python notebook helpers (`alphaswarm/notebook/helpers.py`) |
| 27 (IdentityProvider) | `alphaswarm/auth/providers/` | `alphaswarm-ext`'s `Auth0Service` + new MCP bridge / copilot |
| 45 (WorkloadRuntime) | `alphaswarm_core/runtime/workload.py` | `alphaswarm-ext`'s halt fan-out + `alphaswarm-cli ide` doctor |
| 47 (topology) | `alphaswarm_controller/services/topology.py` | `alphaswarm-cli ide url --remote` / `detect` / `env` |
| 49 (MCP audience, RFC 8707) | `alphaswarm/api/well_known.py` + `alphaswarm/api/mcp_audience.py` | `alphaswarm-mcp-bridge-ext`'s `X-AlphaSwarm-MCP-Audience` header |
| 52 (step-up MFA) | `alphaswarm/api/security_stepup.py` | `alphaswarm-ext`'s halt command + future copilot write tools |

## Canonical operator entrypoint

```bash
alphaswarm-cli auth login --device   # RFC 8628 device flow + OS keyring (rule 53)
alphaswarm-cli ide install           # one-time bootstrap
alphaswarm-cli ide build --dev       # yarn build:extensions + build:applications:dev
alphaswarm-cli ide start --open      # spawn Theia + open in browser
alphaswarm-cli ide doctor            # preflight checks
```

Full CLI reference: [../alphaswarm_cli/docs/index.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_cli/docs/index.md).

## Boundary contract (mirrored from `.cursor/rules/alphaswarm-ide.mdc`)

- `alphaswarm_ide/` extensions MUST NOT `import` from `alphaswarm`
  source. Cross HTTP only (`AqpApiService`) or via the DataMCP /
  CodebaseMCP HTTP surfaces.
- AlphaSwarm-specific behavior lives ONLY under
  `alphaswarm_ide/theia-extensions/alphaswarm*/` (the six extensions). Don't sprinkle
  AlphaSwarm imports into core Theia files.
- The IDE is browser-target-only. The Electron app remains
  upstream-oriented and is NOT wired for AlphaSwarm in this release.
- The canonical entrypoint is `alphaswarm-cli ide`. Direct `yarn` invocations
  are inner-loop development only.

## Vendored workspace retirement

The vendored `test_theia/theia-ide` workspace is byte-for-byte identical
to `alphaswarm_ide/` and can be retired. See
[../alphaswarm_ide/docs/retire-vendored-workspace.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_ide/docs/retire-vendored-workspace.md)
for the 5-step checklist.


<!-- https://alpha-swarm.ai/concepts/infrastructure/cicd-pipelines -->
# CI/CD pipelines
> GitHub Actions orchestrates AWS CodeBuild over GitHub OIDC to deploy alphaswarm_platform and alphaswarm_admin. Covers the plan-vs-apply role split, the hybrid Terraform boundary, CodeArtifact, the three canonical workflows, and the dev to staging to prod promotion.

# CI/CD pipelines

The AlphaSwarm AWS deployment is driven by CI/CD: **GitHub Actions
orchestrates** the pipeline and **AWS `CodeBuild` runs the heavy
in-VPC work** (multi-arch `buildx` builds to `ECR`, and the
`alphaswarm deploy` app-tier apply). There are no static AWS keys
anywhere in the pipeline — every cloud step authenticates through
**GitHub OIDC**.

This page explains the topology, the trust model, and the workflows.
For the task-oriented steps (creating environments, triggering a
deploy, approving a prod release, rolling back) see the companion
runbook [Operations runbook — CI/CD deploy](../../how-to/operations/cicd-deploy.md).
For the deeper deploy walkthroughs see
[AWS Hybrid Deployment Guide](../../how-to/operations/aws-deploy.md) and
[AWS Hybrid Operational Runbook](../../how-to/operations/aws-runbook.md).

## Topology — GitHub Actions, CodeBuild, OIDC

GitHub Actions is the control plane: it reacts to pushes, tags, pull
requests, and `repository_dispatch`, then either runs lightweight
Terraform directly or delegates the in-VPC heavy lifting to
`CodeBuild` via `aws codebuild start-build`. The GitHub Actions job
first assumes an AWS role over OIDC, so the `start-build` call (and
everything `CodeBuild` does downstream) runs under short-lived
credentials.

```mermaid
flowchart LR
    dev[Developer] -->|push / tag / PR| gha[GitHub Actions workflow]

    subgraph github [GitHub]
        gha
        oidc[GitHub OIDC token]
        gha --> oidc
    end

    subgraph aws [AWS account dev / staging / prod]
        sts[STS AssumeRoleWithWebIdentity]
        planRole[Plan role read-only]
        applyRole[Apply role]
        cb[CodeBuild in-VPC]
        ecr[ECR registries]
        ca[CodeArtifact alphaswarm-pypi]
        tf[Terraform state S3 + DynamoDB]
        runtime[alphaswarm deploy TerraformRuntime]
    end

    oidc --> sts
    sts --> planRole
    sts --> applyRole
    gha -->|aws codebuild start-build| cb
    cb --> ecr
    cb --> ca
    cb --> runtime
    applyRole --> tf
    planRole --> tf
    runtime --> tf
```

Why split the work this way:

- **GitHub Actions** is cheap, parallel, and is where the promotion
  gates (GitHub Environments + required reviewers) live.
- **`CodeBuild`** runs inside the workload VPC, so it can reach
  private subnets, the internal `CodeArtifact` PyPI, and the app-tier
  resources that `alphaswarm deploy` manages. It also gives multi-arch
  `buildx` a beefy, in-account builder close to `ECR`.

## Authentication — GitHub OIDC, no static keys

Trust is configured **per account** via the
`infrastructure/modules/github-oidc` module, which registers the
GitHub OIDC provider and the IAM roles. The provider trusts both
deploying repos:

- `Alpha-Swarm-ai/alphaswarm_platform`
- `Alpha-Swarm-ai/alphaswarm_admin`

### Plan role vs apply role

The module emits two roles per account, with different trust
conditions on the OIDC `sub` claim:

- **Plan role** — read-only. Trusted on pull-request refs so that
  PR validation can run `terraform plan` / `validate` without any
  mutate permission. Example trusted subjects:

  ```text
  repo:Alpha-Swarm-ai/alphaswarm_platform:pull_request
  repo:Alpha-Swarm-ai/alphaswarm_platform:ref:refs/heads/main
  ```

- **Apply role** — read-write. Trusted only on `refs/heads/main`
  **and** scoped to a GitHub Environment, so an apply cannot run
  until the Environment's required reviewers approve. Example trusted
  subjects:

  ```text
  repo:Alpha-Swarm-ai/alphaswarm_platform:ref:refs/heads/main
  repo:Alpha-Swarm-ai/alphaswarm_platform:environment:prod
  ```

The apply role ARN is published per environment as the
`AWS_DEPLOYER_ROLE_ARN` repo variable (one value per GitHub
Environment); the plan role ARN is published alongside it. A workflow
job selects the role for its target `env`, then assumes it over OIDC.

```yaml
permissions:
  id-token: write   # required to mint the GitHub OIDC token
  contents: read

jobs:
  apply:
    environment: prod   # gates on the Environment's required reviewers
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ vars.AWS_DEPLOYER_ROLE_ARN }}
          aws-region: us-east-1
```

## Hybrid Terraform boundary

There are two Terraform trees and they are applied two different
ways. The boundary is deliberate.

| Tree | What it owns | Applied by | Auth | Audit |
| --- | --- | --- | --- | --- |
| `infrastructure/` | Landing zone: VPC, `ECR`, RDS, EKS, OIDC provider, observability, the `CodeBuild`/`CodeArtifact` plumbing | Native `terraform plan` / `terraform apply` | OIDC into `AqpTerraformExecutionRole` | Terraform state only |
| `terraform/` | App tier: the per-env application composition deployed onto the platform | `alphaswarm deploy plan` / `alphaswarm deploy up` (`TerraformRuntime`) | `TerraformRuntime` in `CodeBuild` | Writes a `terraform_runs` audit row |

The app tree is never applied with a bare `terraform apply`. It goes
through `alphaswarm deploy`, which drives `TerraformRuntime` and
writes a `terraform_runs` audit row for every plan and apply
(platform AGENTS rule 42). That keeps the app-tier change history in
the same ledger as every other runtime action. See
[Terraform IaC control plane](./terraform-control-plane.md) for how
`TerraformRuntime` works and [IaC runbook](./iac-runbook.md) for the
provisioning recipes.

```bash
# Landing zone (infrastructure/): native terraform, OIDC -> AqpTerraformExecutionRole
terraform -chdir=infrastructure/envs/dev init
terraform -chdir=infrastructure/envs/dev plan

# App tier (terraform/): alphaswarm deploy, writes a terraform_runs row
alphaswarm deploy plan --env dev
alphaswarm deploy up   --env dev
```

## CodeArtifact for alphaswarm-core and the CLI

`alphaswarm-core` and the `alphaswarm` CLI are not installed from
public PyPI in CI or in the Docker images. They are pulled from the
platform's **AWS `CodeArtifact`** internal PyPI repository,
`alphaswarm-pypi`. CI (and every Dockerfile build step that needs the
CLI) authenticates to `CodeArtifact` over the same OIDC-derived
credentials and configures it as the pip index:

```bash
aws codeartifact login --tool pip \
  --domain alphaswarm --repository alphaswarm-pypi
pip install alphaswarm-core "alphaswarm[deploy]"
```

This keeps the internal packages private and gives CI a stable,
in-account index that does not depend on public PyPI availability.

## The three canonical workflows

These names match `compliance/soc2-evidence-map.md`,
`how-to/operations/aws-deploy.md`, `how-to/operations/aws-runbook.md`,
and ADR [006 — alphaswarm_admin overhaul](../../architecture/decisions/006-aqp-admin-overhaul.md).

### terraform-pipeline.yml

The deploy workflow for both Terraform trees.

- **Inputs:** `tree` ∈ {`infrastructure`, `alphaswarm_platform`},
  `env` ∈ {`dev`, `staging`, `prod`}, `action` ∈ {`plan`, `apply`}.
- **`push` to `main`:** runs a `plan` against `dev` automatically.
- **Dispatch (`apply`):** assumes the env's apply role and applies the
  selected tree. For `tree=infrastructure` it runs native
  `terraform apply`; for `tree=alphaswarm_platform` it delegates to
  `CodeBuild`, which runs `alphaswarm deploy up` (and lands the
  `terraform_runs` row).

### build-publish.yml

The image release workflow. Triggers on a `v*` tag and, for each
service, performs a supply-chain-hardened build:

- multi-arch `buildx` build, pushed to `ECR`;
- **`Cosign` keyless** signature (OIDC, no long-lived keys);
- **`syft` SBOM** generation;
- **`SLSA` provenance** attestation;
- **`Trivy`** and **`Grype`** vulnerability scans.

The per-service build/sign/push logic is factored into the composite
action `.github/actions/build-sign-push/`, so every service builds
identically.

### pr-validate.yml

The pull-request gate. On every PR it runs `terraform fmt -check`,
`terraform validate`, `tfsec`, and `conftest` (OPA) policy checks,
then a `terraform plan` using the **plan role** (read-only). It never
holds mutate permission, so a PR can be validated safely from a fork
or feature branch.

## Promotion — dev to staging to prod

Promotion is enforced by **GitHub Environments** with required
reviewers, layered on top of the OIDC apply-role trust (the apply
role is only assumable inside the matching Environment):

| Environment | Approval | Trigger |
| --- | --- | --- |
| `dev` | Auto (no reviewers) | `push` to `main` plans `dev`; apply on dispatch |
| `staging` | 1 reviewer | Dispatch `terraform-pipeline.yml` with `env=staging` |
| `prod` | 2 reviewers (4-eyes) | Dispatch `terraform-pipeline.yml` with `env=prod` |

Because the gate lives in the GitHub Environment, a `prod` apply
physically cannot start minting the apply-role credential until two
distinct reviewers approve the run.

## alphaswarm_admin — two images, then a dispatch handoff

`alphaswarm_admin` is built and deployed slightly differently from the
platform itself.

1. A push to the admin repo's `main` (or a `v*` tag) builds **two
   images** and pushes them to `ECR`:
   - `alphaswarm-admin` (the FastAPI backend)
   - `alphaswarm-admin-frontend` (the Next.js frontend)
2. After both images land, the admin workflow fires a cross-repo
   `repository_dispatch` event named `admin-image-published` at
   `alphaswarm_platform`.
3. That dispatch triggers the platform's app-tier redeploy, which
   rolls the admin service onto **ECS `Fargate`** (`Cognito` + `ALB`)
   via the platform's `terraform/environments/{dev,staging,prod}` app
   tier (generalized from the existing `minimum` env).
4. The app tier reads its infra handles from SSM under
   `/alphaswarm//*`, published by
   `infrastructure/envs/admin-{dev,staging,prod}`.

```mermaid
flowchart LR
    push[Push to admin main / tag] --> build[Build 2 images]
    build --> ecr1[ECR: alphaswarm-admin]
    build --> ecr2[ECR: alphaswarm-admin-frontend]
    ecr1 --> disp[repository_dispatch: admin-image-published]
    ecr2 --> disp
    disp --> plat[alphaswarm_platform app-tier redeploy]
    plat --> ssm[Read SSM /alphaswarm/env/*]
    plat --> fargate[ECS Fargate: Cognito + ALB]
```

The cross-repo dispatch requires a token (`PLATFORM_DISPATCH_TOKEN`)
configured as a secret in the admin repo — see the runbook for setup.
For what the admin service itself is, see
[alphaswarm-admin](./services/alphaswarm-admin.md).

## See also

- [Operations runbook — CI/CD deploy](../../how-to/operations/cicd-deploy.md) — task-oriented steps.
- [Terraform IaC control plane](./terraform-control-plane.md) — how `TerraformRuntime` executes.
- [IaC runbook](./iac-runbook.md) — provisioning recipes.
- [alphaswarm-admin](./services/alphaswarm-admin.md) — the admin service.
- [AWS Hybrid Deployment Guide](../../how-to/operations/aws-deploy.md) and [AWS Hybrid Operational Runbook](../../how-to/operations/aws-runbook.md) — bootstrap + incident playbooks.


<!-- https://alpha-swarm.ai/concepts/infrastructure/control-plane-topology -->
# Control-plane topology
> 1. Hardcoded default in `Settings`. 2. `ALPHASWARM_*` environment variable. 3. `alphaswarm_platform/configs/deployment/topology.yaml` fallback (this layer)

# Control-plane topology

Phase 0 of the AlphaSwarm infra-expansion plan. The single source of truth
for "what services exist, where do they live, what URLs do they
expose" is [`alphaswarm_platform/configs/deployment/topology.yaml`](../configs/deployment/topology.yaml).
Both the AlphaSwarm monolith (`alphaswarm/`) and the standalone control plane
(`alphaswarm_controller/`) read from the same YAML through the shared
loader at
[`alphaswarm_core.topology.load_topology`](../alphaswarm_core/src/alphaswarm_core/topology/loader.py).

## Resolution order

1. Hardcoded default in `Settings`.
2. `ALPHASWARM_*` environment variable.
3. `alphaswarm_platform/configs/deployment/topology.yaml` fallback (this layer).

The Phase 0 fallback ONLY fires when an `ALPHASWARM_*` env var is unset
(checked via `Settings.model_fields_set`). Operators who explicitly
override an env var keep their override.

## URL fallback table

The mapping lives in
[`alphaswarm/config/topology_fallback.py::URL_FALLBACK_FIELDS`](../alphaswarm/config/topology_fallback.py).
Each row says: when topology declares `endpoints[]`
on the service whose id is ``, use that URL as the
fallback for the matching `Settings` field. Adding a new service =
new row in the table + new `services:` entry in `topology.yaml`.

## Control-plane routes

`alphaswarm_controller` exposes the topology over HTTP:

| Route | Purpose |
|---|---|
| `GET /manage/topology` | Full snapshot (services + targets). |
| `GET /manage/topology/services` | Filterable service list (?role=, ?cluster=). |
| `GET /manage/topology/services/{id}` | Single descriptor (matched by id or alias). |
| `GET /manage/topology/services/{id}/endpoint?name=` | Resolve a named URL. |
| `GET /manage/topology/services/{id}/health` | Live provider probe. |
| `GET /manage/topology/targets` | List deployment targets. |
| `POST /manage/topology/reload` | Drop the cache and reload from disk (admin:cluster). |

The frontend at [/admin/topology](../alphaswarm_client/src/routes/admin/topology/page.tsx)
renders the topology grouped by role with a "Probe health" button
per service.

## Adding a new shared service

1. Append a `services:` entry to
   [`alphaswarm_platform/configs/deployment/topology.yaml`](../configs/deployment/topology.yaml)
   with `cluster`, `namespace`, `protocols`, and `endpoints`
   populated.
2. Add the new `Settings` field in
   [`alphaswarm/config/settings.py`](../alphaswarm/config/settings.py) (default
   `""`).
3. Add a row to `URL_FALLBACK_FIELDS` mapping the new `Settings`
   field to the topology endpoint name.
4. Add the namespace to `targets..services` so the topology
   round-trips for that environment.
5. (Optional) Add a `/cache/` populator on the
   [`MetadataPrefetcher`](../alphaswarm/cache/prefetch.py) so the
   `" />` in the frontend has dropdown
   data.


<!-- https://alpha-swarm.ai/concepts/infrastructure/iac-runbook -->
# IaC runbook
> | Task | Recipe | | ------------------------------------------ | ------------------------------------------------------- | | Stand up local AlphaSwarm on a laptop | [Local environment](#local-environment) | ...

# IaC runbook

"I want to provision X" recipes for the Terraform IaC control plane.

## Quick reference

| Task                                       | Recipe                                                  |
| ------------------------------------------ | ------------------------------------------------------- |
| Stand up local AlphaSwarm on a laptop             | [Local environment](#local-environment)                 |
| Stand up AlphaSwarm on rpi_kubernetes             | [rpi Kubernetes environment](#rpi-kubernetes-environment) |
| Stand up paper-trading on GCP              | [Paper environment](#paper-environment)                 |
| Stand up production on AWS                 | [Live environment](#live-environment)                   |
| Stand up the seeded Wiley Tech home on Azure | [Wiley Tech environment](#wiley-tech-environment)     |
| Add a new module kind to the codegen       | [Add a module kind](#add-a-module-kind)                 |
| Add a Terraform stack via the API          | [Create a stack via API](#create-a-stack-via-api)       |
| Plan / apply / destroy from the UI         | [Lifecycle from the frontend](#lifecycle-from-the-frontend) |
| Configure HCP Terraform as state backend   | [HCP Terraform](#hcp-terraform)                         |
| Wire OPA policy enforcement                | [Policy enforcement](#policy-enforcement)               |

## Local environment

```bash
cd alphaswarm_platform/terraform/environments/local
terraform init
terraform plan
terraform apply
```

What this provisions:

- Postgres / MinIO / Redis containers via `kreuzwerker/docker`.
- Minikube / kind cluster + namespaces (`alphaswarm-local` / `alphaswarm-paper` /
  `alphaswarm-live` / `alphaswarm-backtest` / `alphaswarm-system` / `alphaswarm-terraform`).
- Helm baseline: cert-manager / ESO / KEDA / ingress-nginx /
  kube-prometheus / otel-operator / istio.
- KEDA `ScaledObject` per Celery queue (including the new
  `terraform` queue).
- Per-bot Deployment with `alphaswarm-data-mcp` sidecar (zero-egress
  NetworkPolicy on the agent container).
- Local Docker registry on `:5000`.

State is local (`alphaswarm_platform/terraform/environments/local/terraform.tfstate`).

## rpi Kubernetes environment

```bash
alphaswarm-cli deploy publish-rpi --registry ghcr.io/ --tag 
terraform -chdir=alphaswarm_platform/terraform/environments/rpi init
terraform -chdir=alphaswarm_platform/terraform/environments/rpi plan
terraform -chdir=alphaswarm_platform/terraform/environments/rpi apply
```

Recommended bootstrap sequence for first-time bring-up:

1. CLI-first Terraform apply until base services are healthy.
2. Verify API + Celery + Redis + Postgres are reachable.
3. Move to control-plane actions (`/control-plane/kubernetes/targets/rpi/*`).

This avoids enqueue/stream confusion during cold start when broker/DB
are still bootstrapping.

### Provider mirror + init retries

When provider downloads are unstable, define a Terraform CLI config file
with `provider_installation` mirror rules and point AlphaSwarm at it:

```bash
export ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE=/absolute/path/to/terraform.tfrc
export ALPHASWARM_TERRAFORM_INIT_RETRY_ATTEMPTS=5
export ALPHASWARM_TERRAFORM_INIT_RETRY_BACKOFF_SECONDS=2
export ALPHASWARM_TERRAFORM_INIT_RETRY_MAX_BACKOFF_SECONDS=30
```

`TerraformExecutor` applies bounded retries for transient `terraform init`
failures and reuses `ALPHASWARM_TERRAFORM_PLUGIN_CACHE_DIR` between runs.

## Paper environment

```bash
cd alphaswarm_platform/terraform/environments/paper
export TF_VAR_gcp_project_id=
export TF_VAR_primary_domain=paper.alphaswarm.example
terraform init -backend-config="bucket=alphaswarm-terraform-state-paper"
terraform plan
terraform apply
```

What this provisions:

- GKE cluster (auto-promoted from `ALPHASWARM_DEFAULT_CLOUD_PROVIDER=gcp`).
- Cloud SQL Postgres (single AZ — cost-optimised for paper).
- GCS bucket + Memorystore Redis.
- GCP Secret Manager `ClusterSecretStore` (ESO).
- Bot Deployments with `dry_run=true` for paper trading.
- 100% traffic to the Vite frontend (no canary split in paper).

## Live environment

```bash
cd alphaswarm_platform/terraform/environments/live
export TF_VAR_aws_subnet_ids='["subnet-aaaa", "subnet-bbbb", "subnet-cccc"]'
export TF_VAR_primary_domain=app.wiley.tech
terraform init  # picks up backend.tf with S3 + DynamoDB locking
terraform plan
terraform apply
```

What this provisions:

- EKS cluster Multi-AZ.
- RDS Multi-AZ Postgres + S3 versioning + ElastiCache 7+ cluster
  mode.
- AWS Secrets Manager `ClusterSecretStore`.
- Bot Deployments live (`dry_run=false`); `live_control=true` on
  the actor's `Membership` is required to trigger orders.
- Full prod sizing for KEDA `maxReplicaCount` (50 default / 100
  ML / 200 backtest / 30 agents / 10 terraform).

## Wiley Tech environment

This is the seeded production home for the org provisioned by
Alembic 0051. Pinned to the Wiley Tech Entra tenant.

```bash
cd alphaswarm_platform/terraform/environments/wiley-tech
export TF_VAR_azure_tenant_id=
export TF_VAR_azure_subscription_id=
export TF_VAR_azure_resource_group=alphaswarm-wiley-tech
export TF_VAR_azure_keyvault_url=https://alphaswarm-wiley-tech-kv.vault.azure.net/
terraform init  # picks up backend.tf with Azure Blob state
terraform plan
terraform apply
```

What this provisions:

- AKS cluster + Azure Workload Identity for ESO.
- Azure PostgreSQL Flexible Server (Zone-Redundant HA).
- ADLS Gen2 storage account (HNS enabled).
- Azure Cache for Redis (Standard, TLS-only).
- Azure Key Vault `ClusterSecretStore` synced via ESO Workload
  Identity.
- ACR registry for AlphaSwarm images.

## Add a module kind

1. Add the kind to `TERRAFORM_MODULE_KINDS` in
   [`alphaswarm/persistence/models_terraform.py`](../alphaswarm/persistence/models_terraform.py).
2. Create the Jinja2 template at
   `alphaswarm/terraform/codegen/templates/_.tf.j2` (and a
   `_local` fallback).
3. (Optional) Mirror as a native HCL module under
   `alphaswarm_platform/terraform/modules//`.
4. Operators create a stack via `POST /terraform/stacks` with
   `module_kind: ""`.

## Create a stack via API

```bash
curl -X POST http://localhost:8000/terraform/stacks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer " \
  -d '{
    "name": "Bronze tier storage",
    "slug": "bronze-storage",
    "module_kind": "storage",
    "cloud_provider": "aws",
    "environment": "live",
    "variables": {
      "aws_region": "us-east-1",
      "aws_subnet_ids": ["subnet-aaa", "subnet-bbb", "subnet-ccc"],
      "bucket_name": "alphaswarm-bronze",
      "db_storage_gb": 500
    },
    "backend": { "kind": "s3", "config": { "bucket": "alphaswarm-tf-state", "key": "bronze-storage.tfstate" } },
    "tags": { "tier": "bronze" }
  }'
```

Response includes `spec_version_id` (immutable, hash-locked).

Then create a workspace + plan:

```bash
# Workspace
curl -X POST http://localhost:8000/terraform/workspaces \
  -H "Content-Type: application/json" -H "Authorization: Bearer " \
  -d '{ "slug": "bronze-live", "name": "Bronze (live)", "stack_spec_id": "", "environment": "live", "state_backend": "s3" }'

# Plan
curl -X POST http://localhost:8000/terraform/workspaces//plan \
  -H "Authorization: Bearer "
```

Subscribe to live progress at `wss:///terraform/ws/runs/`.

## Lifecycle from the frontend

Navigate to `/infra/terraform`, click a workspace row → land on
`/infra/terraform/workspaces/[id]`:

1. Click **Plan** → enqueues plan task; result lands in
   `awaiting_approval`.
2. Review the plan summary on the run detail page (live WS stream).
3. Click **Apply this plan** on the plan run row.
4. Apply executes → state version snapshotted → outputs visible in
   the "Latest state outputs" card.
5. **Destroy** is friction-gated: type the workspace slug to confirm.

## HCP Terraform

1. Create an HCP Terraform organization + workspaces in the HCP UI.
2. Set `ALPHASWARM_HCP_TOKEN` (preferred: via `CredentialResolver`),
   `ALPHASWARM_HCP_ORGANIZATION`, `ALPHASWARM_TERRAFORM_STATE_BACKEND=hcp`.
3. Set the stack spec's `backend.kind="hcp"` and the workspace's
   `hcp_workspace_id`.
4. The runtime now drives runs through
   [`HcpClient`](../alphaswarm/terraform/hcp_client.py) instead of the local
   subprocess (no `terraform` binary required on the runner pod).

## Policy enforcement

1. Author OPA Rego policies that target Terraform plan JSON
   (the runtime emits `tfplan.binary.json` via `terraform show -json`).
2. Insert a `TerraformPolicyAttachment` row binding the policy file
   URI to a workspace.
3. Set `hard_mandatory=True` to block apply on violation;
   `hard_mandatory=False` emits a warning.
4. When `opa` is on PATH the runtime invokes
   `opa eval -i tfplan.json -d policy.rego "data.alphaswarm.terraform.deny"`.
   Without OPA installed the check no-ops cleanly.


<!-- https://alpha-swarm.ai/concepts/infrastructure/kubernetes-adapter -->
# Kubernetes adapter
> ```mermaid flowchart TB Routes["alphaswarm/api/routes<br/>/cluster, /streaming/kafka, /streaming/flink"] Producers[ProducerSupervisor] FinOps["finops_tasks.audit<br/>(grandfathered direct path)"]

# Kubernetes adapter

AlphaSwarm wraps every cluster-side operation in a pluggable
:class:`alphaswarm.kubernetes.KubernetesAdapter`. The abstraction makes the
rpi_kubernetes attach optional: AlphaSwarm works fully standalone with
`NoneAdapter`, attaches to the rpi management API with
`RpiClusterAdapter`, talks to a Kubernetes API directly with
`InClusterAdapter`, or treats the local Docker Compose stack as the
cluster surface with `LocalComposeAdapter`.

## Architecture

```mermaid
flowchart TB
    Routes["alphaswarm/api/routes/cluster, /streaming/kafka, /streaming/flink"]
    Producers[ProducerSupervisor]
    FinOps["finops_tasks.audit(grandfathered direct path)"]

    subgraph adapters [alphaswarm.kubernetes]
        ABC[KubernetesAdapter ABC]
        None[NoneAdapter]
        Rpi[RpiClusterAdapter]
        InCluster[InClusterAdapter]
        LocalCompose[LocalComposeAdapter]
    end

    None --> ABC
    Rpi --> ABC
    InCluster --> ABC
    LocalCompose --> ABC
    Routes --> ABC
    Producers --> ABC

    Rpi --> RpiClient["alphaswarm/services/cluster_mgmt_client(rpi management HTTP)"]
    InCluster --> K8sSDK[kubernetes-client SDK]
    LocalCompose --> Docker[docker compose]
```

`get_kubernetes_adapter()` returns the active adapter based on:

1. Explicit `settings.kubernetes_adapter` (`none` / `rpi_cluster` /
   `in_cluster` / `local_compose`).
2. Auto-promote: empty kind + `cluster_mgmt_url` set → `rpi_cluster`.
3. Default: `none`.

Failures during a call surface as
:class:`KubernetesAdapterUnavailable` (routes return 503) or
:class:`KubernetesAdapterError` (routes return 502). Adapters opt out
of unsupported methods by raising
:class:`KubernetesAdapterUnavailable`.

## Adapter capabilities

See [`.cursor/rules/kubernetes-adapter.mdc`](../.cursor/rules/kubernetes-adapter.mdc)
for the per-method matrix. Today every adapter implements
`is_available()`; `RpiClusterAdapter` covers the full Kafka / Flink /
AlphaVantage / scale_deployment surface; `InClusterAdapter` covers
`scale_deployment` / `pod_logs` / `apply_manifest`; `LocalComposeAdapter`
covers `scale_deployment` / `pod_logs`.

The `/cluster` REST surface is the primary user — `/cluster-mgmt` is
kept as a backwards-compat alias.

## Test patterns

`tests/kubernetes/test_adapter.py` covers:

- The metaclass registers every concrete adapter under
  `"k8s_adapter"` in the AlphaSwarm registry.
- `NoneAdapter.is_available()` is `False`; every op raises.
- `RpiClusterAdapter` forwards to the wrapped client and translates
  `ClusterMgmtError` → `KubernetesAdapterError`.
- `InClusterAdapter` reports unavailable when kubernetes isn't
  installed (CI default).
- `register_adapter(...)` / `reset_kubernetes_adapter()` give tests
  clean fixtures.

## Adding capabilities

When you need a new cluster op (say `list_namespaces`):

1. Add an abstract method (default: raise
   :class:`KubernetesAdapterUnavailable`) on the ABC in
   [`alphaswarm/kubernetes/protocol.py`](../alphaswarm/kubernetes/protocol.py).
2. Implement it in each adapter that can support it.
3. Add a route in
   [`alphaswarm/api/routes/cluster_mgmt.py`](../alphaswarm/api/routes/cluster_mgmt.py)
   that calls the adapter.
4. Adapters that can't service the op leave the default; routes
   catch `KubernetesAdapterUnavailable` and translate to 503.

## Migrating finops_tasks

The FinOps audit (`alphaswarm/tasks/finops_tasks.py`) currently uses the
`kubernetes` SDK directly because it needs list APIs (`list_pod_for_all_namespaces`,
etc.) that the adapter doesn't yet expose. Adding those list methods
to the adapter ABC + the in-cluster implementation is the migration
target — until then, the direct path is grandfathered by the
[`.cursor/rules/kubernetes-adapter.mdc`](../.cursor/rules/kubernetes-adapter.mdc)
rule.


<!-- https://alpha-swarm.ai/concepts/infrastructure/kubernetes-rpi-deployment -->
# rpi Kubernetes Deployment
> - A kubeconfig that can reach the rpi cluster. - A registry reachable by every rpi node. - Immutable AlphaSwarm image tag published with:

# rpi Kubernetes Deployment

AlphaSwarm deploys to the `rpi_kubernetes` cluster through the sanctioned
Terraform runtime path. The source-of-truth HCL lives in
`alphaswarm_platform/terraform/environments/rpi`, and the stack spec is
`alphaswarm_platform/configs/terraform/rpi.yaml`.

## Prerequisites

- A kubeconfig that can reach the rpi cluster.
- A registry reachable by every rpi node.
- Immutable AlphaSwarm image tag published with:

```bash
alphaswarm-cli deploy publish-rpi --registry docker.io/ --tag 
```

## Configure

Edit or override `alphaswarm_platform/terraform/environments/rpi/terraform.tfvars`:

```hcl
rpi_kubeconfig_path = "~/.kube/config"
rpi_kube_context    = "rpi"
rpi_namespace       = "alphaswarm"
rpi_image_registry  = "docker.io/"
app_version         = ""
rpi_ingress_host    = "alphaswarm.example.com"
auth0_domain        = "example.us.auth0.com"
auth0_audience      = "https://alphaswarm/api"
auth0_client_id     = ""
```

## Deploy

Use the AlphaSwarm control plane or Terraform directly:

```bash
terraform -chdir=alphaswarm_platform/terraform/environments/rpi init
terraform -chdir=alphaswarm_platform/terraform/environments/rpi plan
terraform -chdir=alphaswarm_platform/terraform/environments/rpi apply
```

The backend control-plane routes dispatch the same stack through
`alphaswarm.tasks.terraform_tasks.run_rpi_stack`, preserving `terraform_runs`
ledger rows and progress streams.

## Cold-start order

For first-time bootstrap on a new machine, run in this order so each
dependency exists before the next one:

1. Build and push immutable AlphaSwarm images (`alphaswarm-cli deploy publish-rpi ...`).
2. Set image tags and Auth0 values in `alphaswarm_platform/terraform/environments/rpi/terraform.tfvars`.
3. Run Terraform from CLI (`init`, `plan`, `apply`) until the core stack
   is healthy.
4. Start/verify API + Celery + Redis + Postgres.
5. Use `/control-plane/kubernetes/targets/rpi/*` for ongoing operations.

Why this order matters:

- Terraform subprocess execution itself only needs Terraform + kubeconfig.
- Control-plane-triggered runs additionally need Celery broker/worker.
- Run history and richer status views depend on Postgres/Redis availability.

## Provider download resilience (flaky network / IPv6 issues)

When `terraform init` intermittently fails to download providers, use a
Terraform CLI config file with `provider_installation` mirrors and point
the runtime at it with `ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE`.

Example `terraform.tfrc`:

```hcl
provider_installation {
  filesystem_mirror {
    path    = "C:/terraform/provider-mirror"
    include = ["hashicorp/*", "kreuzwerker/*", "auth0/*"]
  }
  direct {
    exclude = ["hashicorp/*", "kreuzwerker/*", "auth0/*"]
  }
}
```

Then set:

```bash
export ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE=/absolute/path/to/terraform.tfrc
```

The runtime also retries transient `terraform init` network/provider
failures with bounded exponential backoff. Tune with:

- `ALPHASWARM_TERRAFORM_INIT_RETRY_ATTEMPTS`
- `ALPHASWARM_TERRAFORM_INIT_RETRY_BACKOFF_SECONDS`
- `ALPHASWARM_TERRAFORM_INIT_RETRY_MAX_BACKOFF_SECONDS`

## Rollback

Re-apply the previous immutable image tag or run:

```bash
terraform -chdir=alphaswarm_platform/terraform/environments/rpi destroy
```

Long-running Terraform jobs remain halt-able through `/terraform/halt`
and the global frontend kill switch.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services -->
# Service-level view
> Catalogue of every AlphaSwarm service: container image, port, health probe, deployment surfaces (Compose / Kustomize / AQP CR / Terraform template), upstream and downstream dependencies, and the canonical doc that owns each contract.

# Service-level view

This page catalogues every service AlphaSwarm runs — the application
workloads, the control plane, the data layer, the observability stack,
and the external edge surface — at a single level of detail. It pairs
[`control-plane-topology.md`](control-plane-topology.md) (which says
*how* services are discovered) and
[`terraform-control-plane.md`](terraform-control-plane.md) (which says
*how* they are provisioned) with a *what is each service* reference.

The single source of truth for the service registry is
[`alphaswarm_platform/configs/deployment/topology.yaml`](../../../../alphaswarm_platform/configs/deployment/topology.yaml).
This page is generated against that file plus each service's matching
package contract. When a row drifts, the truth is the YAML.

## Reading the catalogue

Every service has its own detail page under
[`services/`](services/) with the same layout:

- **Identity** — id, role, label, package or upstream image.
- **Wire** — protocol, port, health endpoint, public URL (if any).
- **Deployment** — which compose / kustomize / AQP CR / Terraform
  template stands it up.
- **Dependencies** — upstream services it calls, downstream services
  that call it.
- **Operations** — runbooks, scaling notes, redaction posture,
  feature flags.

Detail pages link back to the canonical concept doc that owns each
contract — they do not duplicate prose.

## How services compose

```
                    ┌─ alphaswarm-website ──────────┐ public marketing
                    │  (Cloudflare Pages, no auth)  │
                    └───────────────────────────────┘
                                  │
                                  ▼ NEXT_PUBLIC_ALPHASWARM_APP_URL
   B2C / B2B users  ─▶  alphaswarm-ui  ──┐
   Internal staff   ─▶  alphaswarm-admin ┼──▶  alphaswarm-cp  ──▶  /manage/* control plane
   Local power user ─▶  alphaswarm-client┤                   ──▶  /auth/*    identity broker
   Operators (CLI)  ─▶  alphaswarm-cli   ┤                   ──▶  /proxy/*   connection mesh (Phase 5)
                                         │
                                         ▼ HTTP
                                   alphaswarm-core (FastAPI)
                                         │
                  ┌──────────────────────┼──────────────────────┐
                  ▼                      ▼                      ▼
       alphaswarm-worker  alphaswarm-executor  alphaswarm-beat   alphaswarm-ml-mcp
       (light queues)     (heavy compute)      (scheduler)       (DataMCP /mcp/ml)

   Data plane:    postgres ─ redis ─ neo4j ─ chromadb ─ minio ─ iceberg(Polaris)
   Streaming:     kafka(Strimzi) | redpanda ─ schema-registry ─ flink ─ redpanda-connect
   ML / orch:     mlflow ─ argo-workflows ─ argo-events ─ bentoml ─ kserve ─ dagster ─ ragflow
   Observability: otel-collector ─ prometheus ─ grafana ─ jaeger ─ loki ─ vector ─ victoriametrics ─ phoenix
   Mesh ID:       spire (issuer) ─▶ linkerd (mTLS) ─▶ vault-secrets-operator ─▶ pomerium (IAP)
   Edge:          cloudflared (alpha-swarm.ai) | cloudflared-aqp-green | alphaswarm-edge | tenant-router
   Sandbox:       agent-sandbox/gvisor ─▶ agent-sandbox/pool
   Operators:     aqp-controller-operator (8 AQP* CRDs) ─ bots-operator (4 QuantBot CRDs)
   External:      alphaswarm-docs (Cloudflare Pages) ─ alphaswarm-docs-status (Instatus) ─ alphaswarm-docs-archive
```

Identity flows from `spire` through `linkerd` through
`vault-secrets-operator` to every workload pod; secrets land via
`ExternalSecret` resources, never in `values.yaml`. The
`pomerium` IAP wraps the bare `/manage/*` ingress.

## Application services

Services that run AlphaSwarm code. Each is built from a Dockerfile in
this workspace and is owned by the package that supplies its image.

| Service id | Role | Pkg | Image (key) | Port | Health | Public URL | Deployed via |
| --- | --- | --- | --- | --- | --- | --- | --- |
| [`alphaswarm-core`](services/alphaswarm-core.md) | api | `alphaswarm` | `api` | 8000 | `/readyz` | — (private) | base/alphaswarm-core, AQPMonolith CR, compose `api` |
| [`alphaswarm-worker`](services/alphaswarm-worker.md) | worker | `alphaswarm` | `worker` | — | (none) | — | base/alphaswarm-worker, AQPMonolith CR, compose `worker` |
| [`alphaswarm-executor`](services/alphaswarm-executor.md) | executor | `alphaswarm` | `executor` | — | (none) | — | base/alphaswarm-executor, compose `alphaswarm-executor`/`worker-gpu` |
| [`alphaswarm-beat`](services/alphaswarm-beat.md) | scheduler | `alphaswarm` | `beat` | — | (none) | — | base/alphaswarm-worker, AQPMonolith CR, compose `beat` |
| [`alphaswarm-cp`](services/alphaswarm-cp.md) | control-plane | `alphaswarm_controller` | `cp` | 9000 | `/manage/readyz` | `https://manage.alpha-swarm.ai` | base/alphaswarm-cp, compose `alphaswarm-cp` |
| [`alphaswarm-client`](services/alphaswarm-client.md) | frontend | `alphaswarm_client` | `frontend` | 80 | `/` | — (private) | base/alphaswarm-client, AQPClient CR, compose `client` |
| [`alphaswarm-ui`](services/alphaswarm-ui.md) | frontend | `alphaswarm_ui` | `ui` | 80 | `/api/healthz` | `https://app.alpha-swarm.ai` | (Vercel/Pages) AQPUI CR |
| [`alphaswarm-admin`](services/alphaswarm-admin.md) | admin | `alphaswarm_admin` | `admin` | 8900 | `/admin/healthz` | `https://admin.alpha-swarm.ai` | AQPAdmin CR, compose `alphaswarm-admin` |
| [`alphaswarm-ide`](services/alphaswarm-ide.md) | ide | `alphaswarm_ide` | `ide` | 3000 | `/` | (per-user) | alphaswarm-ide kustomize, AQPIDE CR |
| [`alphaswarm-ml-mcp`](services/alphaswarm-ml-mcp.md) | mcp | `alphaswarm_models` | (pigg. on `api`) | 8000 | `/mcp/ml/tools` | — | base/alphaswarm-core (extra route) |

## Data layer

Stateful services owned by the platform — the AlphaSwarm runtime is a
client of every row below.

| Service id | Role | Image | Port | Storage | Deployed via |
| --- | --- | --- | --- | --- | --- |
| [`postgres`](services/postgres.md) | database | `pgvector/pgvector:pg16` | 5432 | 5 Gi (StatefulSet) | base-services/postgres-shared |
| [`redis`](services/redis.md) | cache | `redis:7-alpine` (master) / `redis-stack:7.4` (local) | 6379 | 2 Gi | base/redis-master, base-services/redis-shared |
| [`neo4j`](services/neo4j.md) | graph | `neo4j:5-community` | 7474, 7687 | 5 Gi | base-services (cell-local), compose `neo4j` |
| [`chromadb`](services/chromadb.md) | vector | `chromadb/chroma:1.0.16` | 8000 / 8001 | (ephemeral) | base-services/chromadb, compose `chromadb` |
| [`mlflow`](services/mlflow.md) | mlops | `ghcr.io/mlflow/mlflow:v2.11.1` | 5000 | object store | base-services/mlflow, compose `mlflow` |

Object storage and the Iceberg catalog (MinIO + Polaris) live
under the streaming/lakehouse umbrella; they are documented under
`base-services/minio` and `base-services/polaris` in
[deployment patterns by category](#deployment-patterns).

## Observability

Routed by `otel-collector-gateway`; metrics in VictoriaMetrics + Prometheus
(parallel during cutover), logs in Loki, traces in Jaeger, and the AI / LLM
slice in Phoenix.

| Service id | Role | Image | Port | Deployed via |
| --- | --- | --- | --- | --- |
| [`otel-collector`](services/otel-collector.md) | observability | `otel/opentelemetry-collector` | 4317 | observability/opentelemetry-collector-gateway |
| [`prometheus`](services/prometheus.md) | metrics | `prom/prometheus` (kube-prometheus-stack) | 9090 | observability/kube-prometheus-stack |
| [`grafana`](services/grafana.md) | dashboards | `grafana/grafana` | 3000 | observability/kube-prometheus-stack |
| [`jaeger`](services/jaeger.md) | tracing | `jaegertracing/all-in-one` | 6831 / 16686 | observability/jaeger |
| [`loki`](services/loki.md) | logs | `grafana/loki:3.3.2` | 3100 | observability/loki |
| [`vector`](services/vector.md) | log shipper | `timberio/vector:0.43.0` | — | observability/vector |
| [`victoriametrics`](services/victoriametrics.md) | metrics | `victoriametrics/victoria-metrics:v1.108.0` | 8428 | observability/victoriametrics |

Phoenix + the OTel operator are documented inline on
[`otel-collector`](services/otel-collector.md) since they are part of the
same telemetry pipeline.

## External services

Hosted off-cluster — included here because the topology references them
and operators need to know who runs them.

| Service id | Role | Hosted on | Public URL | Deployed via |
| --- | --- | --- | --- | --- |
| [`alphaswarm-docs`](services/alphaswarm-docs.md) | docs | Cloudflare Pages | `https://docs.alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` |
| [`alphaswarm-website`](services/alphaswarm-website.md) | marketing | Cloudflare Pages | `https://alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` (forthcoming) |
| [`alphaswarm-docs-status`](services/alphaswarm-docs-status.md) | status page | Instatus SaaS | `https://status.alpha-swarm.ai` | Terraform module `instatus` |
| [`alphaswarm-docs-archive`](services/alphaswarm-docs-archive.md) | archive | Cloudflare Pages | `https://archive.alpha-swarm.ai` | Terraform module `cloudflare_pages_docs` |

## Deployment patterns

Every service above is deployable through one or more of the surfaces
below. The
[deployment-templates catalogue](../../../../alphaswarm_platform/configs/terraform/templates/README.md)
maps each named pattern to a hash-locked
[`TerraformStackSpec`](terraform-control-plane.md#terraformstackspec).

| Pattern | What it stands up | Template slug | Source |
| --- | --- | --- | --- |
| **Local dev** | k3d cluster + base + minimal observability | `local-dev` | [templates/local-dev.yaml](../../../../alphaswarm_platform/configs/terraform/templates/local-dev.yaml) |
| **k3d + MLOps** | local-dev + Argo Workflows + Dagster + MLflow | `k3d-with-mlops` | [templates/k3d-with-mlops.yaml](../../../../alphaswarm_platform/configs/terraform/templates/k3d-with-mlops.yaml) |
| **AWS minimum** | Single-account ECS + Cognito + ALB + Bedrock Haiku | `aws-minimum` | [templates/aws-minimum.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-minimum.yaml) |
| **AWS shared cell** | EKS + base + base-services + observability + edge for one shared standard cell | `aws-cell-shared-std` | [templates/aws-cell-shared-std.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-cell-shared-std.yaml) |
| **AWS shared cell (premium)** | shared-std + dedicated node group + reserved capacity | `aws-cell-shared-premium` | [templates/aws-cell-shared-premium.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-cell-shared-premium.yaml) |
| **AWS silo tenant** | Single-tenant cell with hard isolation | `aws-silo-tenant` | [templates/aws-silo-tenant.yaml](../../../../alphaswarm_platform/configs/terraform/templates/aws-silo-tenant.yaml) |
| **GCP cell** | GKE + Workload Identity + base + base-services | `gcp-full-cell` | [templates/gcp-full-cell.yaml](../../../../alphaswarm_platform/configs/terraform/templates/gcp-full-cell.yaml) |
| **Azure cell** | AKS + Workload Identity + Entra-bound base | `azure-full-cell` | [templates/azure-full-cell.yaml](../../../../alphaswarm_platform/configs/terraform/templates/azure-full-cell.yaml) |
| **rpi cluster** | k3s on ARM64 | `rpi-cluster` | [templates/rpi-cluster.yaml](../../../../alphaswarm_platform/configs/terraform/templates/rpi-cluster.yaml) |
| **Edge only** | Cloudflare tunnels + Access apps + cloudflared-aqp-green | `edge-only` | [templates/edge-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/edge-only.yaml) |
| **Observability only** | OTel + Prometheus + Loki + Jaeger + Phoenix + VictoriaMetrics | `observability-only` | [templates/observability-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/observability-only.yaml) |
| **MLOps only** | Argo Workflows + Argo Events + BentoML + KServe + Dagster | `mlops-only` | [templates/mlops-only.yaml](../../../../alphaswarm_platform/configs/terraform/templates/mlops-only.yaml) |

Templates are discovered by
[`alphaswarm.terraform.templates`](../../../../alphaswarm/alphaswarm/terraform/templates.py)
and surfaced through:

- `GET /terraform/templates` and
  `POST /terraform/stacks/from-template/{slug}` (REST).
- `alphaswarm-cli deploy templates {list,describe,apply}` (CLI).
- `data.terraform.templates.list_templates` and
  `data.terraform.templates.instantiate_template` (MCP, used by the
  agentic plane).

Every instantiation flows through `TerraformRuntime` so the apply
lands a `terraform_runs` ledger row + spec snapshot per AGENTS rule
42 / 43.

## Building blocks (Jinja2 codegen)

The codegen layer at
[`alphaswarm/terraform/codegen/templates/`](../../../../alphaswarm/alphaswarm/terraform/codegen/templates/)
ships per-module-kind Jinja2 templates. The standard-template catalogue
adds five composite building blocks so users can compose their own
stacks against typed inputs:

| Building block | Renders | Used by |
| --- | --- | --- |
| `cell.tf.j2` | One cell — namespaces + base workloads + per-cell ingress + RBAC | `aws-cell-shared-std`, `aws-silo-tenant`, `gcp-full-cell`, `azure-full-cell` |
| `observability_stack.tf.j2` | Full OTel + Prom + Loki + Jaeger + Phoenix + VictoriaMetrics overlay | `observability-only`, every cell template |
| `mesh_identity.tf.j2` | spire → linkerd → vault-secrets-operator → pomerium chain | every cell template |
| `mlops_stack.tf.j2` | Argo Workflows + Events + BentoML + KServe + Dagster | `mlops-only`, `k3d-with-mlops` |
| `edge_stack.tf.j2` | cloudflared + access apps + tenant-router | `edge-only`, every public-facing cell template |

These are referenced from `TerraformStackSpec.modules[].source` with
the `tpl://` scheme — see
[the IaC runbook](iac-runbook.md#shipping-a-standard-template) for the
operator workflow.

## Maintenance

This page and the per-service files mirror the YAML at
[`alphaswarm_platform/configs/deployment/topology.yaml`](../../../../alphaswarm_platform/configs/deployment/topology.yaml).
When you add a service:

1. Append the service to `topology.yaml` under `services:`.
2. Add a row to the matching table above (by category).
3. Add `concepts/infrastructure/services/.md` using the layout
   on every existing detail page (Identity / Wire / Deployment /
   Dependencies / Operations).
4. Add `'concepts/infrastructure/services/'` to `sidebars.ts`
   under the **Services** category.
5. If the service is reachable across cells, also append a row to
   `URL_FALLBACK_FIELDS` in
   [`alphaswarm/config/topology_fallback.py`](../../../../alphaswarm/alphaswarm/config/topology_fallback.py).
6. Either invoke the
   [`alphaswarm-index-curator`](../../../../alphaswarm/.cursor/agents/alphaswarm-index-curator.md)
   or drop a debt note per the always-on
   [`alphaswarm-index-reflect`](../../../../alphaswarm/.cursor/rules/alphaswarm-index-reflect.mdc)
   rule.

## See also

- [`control-plane-topology.md`](control-plane-topology.md) — discovery
  contract + `URL_FALLBACK_FIELDS` semantics.
- [`terraform-control-plane.md`](terraform-control-plane.md) —
  `TerraformRuntime` lifecycle + spec hash-locking.
- [`iac-runbook.md`](iac-runbook.md) — quick reference for plan / apply
  / destroy + shipping a standard template.
- [`how-to/operations/local-setup.md`](../../how-to/operations/local-setup.md) —
  bring the stack up locally.
- [`how-to/operations/kubernetes-deploy.md`](../../how-to/operations/kubernetes-deploy.md) —
  end-to-end Kubernetes walkthrough.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-admin -->
# alphaswarm-admin
> Internal staff admin at admin.alpha-swarm.ai — managed services, company accounts, audit-first surface. FastAPI + Next.js, Entra-only auth.

# alphaswarm-admin

Internal-only admin dashboard for AlphaSwarm staff. Audit-first: every
action lands a `security_audit_events` row before mutating anything;
no destructive surface bypasses the ledger.

Authenticated via the AlphaSwarm staff Entra tenant. Outbound M2M
calls use a per-deployment Entra Agent Identity provisioned by the
[`alphaswarm_admin_agent_identity`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/)
Terraform module.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-admin` |
| Role | `admin` |
| Package | [`alphaswarm_admin/`](../../../../../../alphaswarm_admin/) |
| Image (key) | `admin` |
| Built from | `alphaswarm_admin/Dockerfile` (FastAPI backend, port 8900) + `alphaswarm_admin/frontend/Dockerfile` (Next.js 15 UI). Two ECR repos: `alphaswarm-admin` + `alphaswarm-admin-frontend`. |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + WebSocket |
| Port | `8900` |
| Health | `GET /admin/health` (public; backs the Docker + ECS container health checks) |
| Public URL | `https://admin.alpha-swarm.ai` (Cloudflare tunnel + Pomerium IAP) |
| Identity | AlphaSwarm staff Entra tenant; `actor_kind` is `user` for human staff and `agent` for the per-deployment Agent Identity (RFC 8693 `act` claim) |

## Surfaces

| Prefix | Purpose |
| --- | --- |
| `/admin/*` | FastAPI backend — managed-services CRUD, company accounts, audit log, billing |
| `/admin/platform/ecs/*` | Platform deployment control — boto3 → AWS ECS + CloudWatch for the platform's OWN Fargate services (rollout status, redeploy, scale, logs, metrics, alarms). Distinct from `/admin/deployments` (customer workloads, brokered). Redeploy + scale are audit-first + step-up-MFA gated. |
| `/api/auth/entra/*` | Next.js BFF proxy to `alphaswarm-cp` `/auth/*` |
| `/dashboard`, `/platform`, `/managed-services`, `/companies`, `/audit-log`, `/billing` | Next.js frontend pages |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `alphaswarm-admin` in [`deployments/compose/docker-compose.admin.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.admin.yml) |
| Kustomize | rolled into the per-cell base — namespace `alphaswarm-admin` |
| ECS Fargate | [`infrastructure/modules/ecs-fargate-control-plane`](../../../../../../alphaswarm_platform/infrastructure/modules/ecs-fargate-control-plane/), wired in [`infrastructure/envs/minimum`](../../../../../../alphaswarm_platform/infrastructure/envs/minimum/). Container health check on `/admin/health`; the `admin` task carries the self-management policy so `/admin/platform/ecs/*` can drive the cluster. |
| AQP CR | [`AQPAdmin`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpadmin_cr.py) |
| Terraform module | [`alphaswarm_admin_agent_identity`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/) (Entra Agent Identity provisioning) |

## Dependencies

**Upstream:**

- `alphaswarm-cp` (`/auth/*`, `/manage/*`).
- `alphaswarm-core` (`/api/*` for read-only platform queries).
- `postgres` for the admin's own ledger tables.
- Stripe (optional) for billing integration.

**Downstream:**

- AlphaSwarm staff admins only — public ingress is wrapped by Pomerium
  with the `alphaswarm-staff` Entra group as the sole authenticated
  population.

## Operations

- **Audit-first:** every mutating endpoint writes a
  `security_audit_events` row BEFORE acting; rollbacks compensate the
  row.
- **No customer data exposure:** the admin reads aggregate signals
  only — never raw operator strategy code or RL weights.
- **Step-up MFA:** required for company-account suspensions, billing
  refunds, kill-switch fan-out.
- **Boundary:** `alphaswarm_admin` MUST NOT import `alphaswarm.*` —
  it is HTTP-only against `alphaswarm-cp` and `alphaswarm-core`. The
  guard is enforced by
  [`alphaswarm_admin/AGENTS.md`](../../../../../../alphaswarm_admin/AGENTS.md).

## See also

- [`alphaswarm_admin/AGENTS.md`](../../../../../../alphaswarm_admin/AGENTS.md) — boundary
  rules.
- [`identity.md`](../../identity/identity.md) — Entra integration.
- [`alphaswarm_admin_agent_identity` module](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_admin_agent_identity/) —
  Agent Identity provisioning.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-beat -->
# alphaswarm-beat
> Celery beat scheduler — periodic task dispatcher (factor refresh, predictor retraining, ledger compaction, status-page heartbeats).

# alphaswarm-beat

Celery beat process responsible for time-based task dispatch. It writes
to the same Redis broker the worker drains; nothing else writes
schedule-driven payloads.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-beat` |
| Role | `scheduler` |
| Package | [`alphaswarm/`](../../../../../../alphaswarm/) (schedule under `alphaswarm/tasks/celery_app.py`) |
| Image (key) | `beat` |
| Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (image key `beat` → target `worker`; beat shares the slim orchestration image) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | none |
| Health | Celery broker connection probe |
| Replicas | exactly **1** (singleton) — `replicas: 1`, `strategy: Recreate` |

A beat replica > 1 leads to duplicate task emissions; the
`Recreate` strategy guarantees the old pod is down before the new one
starts.

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | beat is folded into the `worker` container in compose (single-replica entrypoint switch) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-worker/beat-deployment.yaml`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-worker/) |
| AQP CR | folded into [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) (`spec.beat.enabled`) |

## Schedule highlights

- **Every minute:** factor staleness probe, kill-switch heartbeat,
  status-page sync.
- **Every 5 minutes:** predictor refresh (when models flagged
  `online: true`), Iceberg orphan scan.
- **Hourly:** ledger compaction, audit-event aggregation, alphaswarm-index
  curator nudge (for diff detection).
- **Daily:** OPA bundle refresh, terraform plan-drift check.

The full schedule lives in
[`alphaswarm/tasks/celery_app.py`](../../../../../../alphaswarm/alphaswarm/tasks/celery_app.py).

## Operations

- **Single-instance:** `replicas: 1` is enforced by the kustomize
  base; the AQPMonolith CR refuses to render a beat block with
  `replicas != 1`.
- **Persistence:** beat schedule lives at `/tmp/celerybeat-schedule`
  inside the pod (ephemeral); the schedule itself is code-defined so
  loss is recoverable.
- **Audit:** beat-emitted tasks tag their `WorkloadRun` rows with
  `started_by_user_id = "system:beat"` so audit queries can split
  human-driven from scheduled work.

## See also

- [`alphaswarm-worker.md`](alphaswarm-worker.md) — what consumes beat's
  output.
- [`tasks-api`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — task
  progress contract.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-client -->
# alphaswarm-client
> Local power-user client — Vite SPA + Solara legacy + FastAPI gateway in a single pod, behind the per-cell ingress.

# alphaswarm-client

The frontend for local power users — operators running AlphaSwarm on a
laptop, in a tower cluster, or inside a self-hosted cell. It bundles a
React 19 + Vite SPA, the legacy Solara research UI, and a thin FastAPI
gateway that proxies to `alphaswarm-core` and `alphaswarm-cp`.

This is **not** the cloud customer dashboard — that is
[`alphaswarm-ui`](alphaswarm-ui.md), which targets `app.alpha-swarm.ai`.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-client` |
| Role | `frontend` |
| Package | [`alphaswarm_client/`](../../../../../../alphaswarm_client/) |
| Image (key) | `frontend` |
| Built from | [`alphaswarm_client/Dockerfile`](../../../../../../alphaswarm_client/Dockerfile) (3-stage: ui-builder → solara-builder → production) and [`Dockerfile.tf`](../../../../../../alphaswarm_client/Dockerfile.tf) (Terraform-built variant) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + WebSocket |
| Port | `80` (container) → `3000` (host, local dev) |
| Health | `GET /` |
| Public URL | per-cell ingress (e.g. `https://aqp..alpha-swarm.ai`); local dev `http://localhost:3000` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `client` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `alphaswarm-client` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-client/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-client/) — Deployment + Service + HPA + PDB |
| AQP CR | [`AQPClient`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpclient_cr.py) |

## Dependencies

**Upstream (HTTP):**

- `alphaswarm-core` (`/api/*`, `/ws/*`) — every business call.
- `alphaswarm-cp` (`/manage/*`, `/auth/*`) — workload lifecycle and
  identity.

**Downstream:**

- Browser tabs on operator workstations.

## Frontend conventions

- Vite + React 19 + TanStack Query + zustand for state.
- WebSocket pipeline is throttled (`50ms` coalescing) per the
  [`frontend`](../../../../../../alphaswarm/.cursor/rules/frontend.mdc)
  rule.
- Solara legacy routes mounted at `/legacy/*`; sunset window per
  [`alphaswarm-client/AGENTS.md`](../../../../../../alphaswarm_client/AGENTS.md).

## Operations

- **Scaling:** HPA `cpu=70%`, `min=2 / max=8` in prod.
- **Bundle size budget:** the Vite build fails CI when the gzipped
  bundle exceeds 1.5 MiB.
- **CSP:** strict — only `manage.alpha-swarm.ai`,
  `app.alpha-swarm.ai`, and the per-cell `*.alpha-swarm.ai`
  hostnames are allowlisted.

## See also

- [`alphaswarm-client/AGENTS.md`](../../../../../../alphaswarm_client/AGENTS.md) — boundary
  rules.
- [`alphaswarm-ui.md`](alphaswarm-ui.md) — the cloud-hosted sibling.
- [`alphaswarm-ide.md`](alphaswarm-ide.md) — Theia IDE for code-first
  workflows.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-core -->
# alphaswarm-core
> FastAPI gateway for the AlphaSwarm runtime: business routes, agentic surface, MCP servers, WebSocket streaming, scope + tenancy enforcement.

# alphaswarm-core

The FastAPI gateway for the AlphaSwarm runtime. Every business route
(strategies, bots, backtests, RL experiments, analysis runs, agents,
ingestion, ml-mcp, terraform, tenancy, paper trading, kill switch) is
mounted on this pod. The control plane (`alphaswarm-cp`) is a sibling
service, not a parent — `/manage/*` lives there.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-core` |
| Role | `api` |
| Package | [`alphaswarm/`](../../../../../../alphaswarm/) |
| Image (key) | `api` |
| Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `api`, multi-arch amd64+arm64, Chainguard Wolfi base, `uv` install) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + HTTP/2 + WebSocket |
| Port | `8000` |
| Health | `GET /readyz` (ready) / `GET /healthz` (live) |
| Public URL | — (private; reached through the per-cell ingress / `app.alpha-swarm.ai` BFF for SPA traffic) |
| OIDC issuer for tokens it accepts | `MsalEntraValidator` (Entra primary) → Auth0 fallback per [`identity.md`](../../identity/identity.md) |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose (local dev) | service `api` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); also `alphaswarm-core` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-core/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-core/) — Deployment + Service + HPA + PDB |
| AQP CR | [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) — render path emits Deployment + Service + ConfigMap + (optional) Ingress |
| Terraform template | reachable through every `aws-*-cell` / `gcp-full-cell` / `azure-full-cell` template (see [`services.md`](../services.md#deployment-patterns)) |

## Dependencies

**Upstream services this pod calls:**

- `postgres` (5432) — primary OLTP + Alembic migrations.
- `redis` (6379) — session, semantic cache, kill-switch key, Celery broker.
- `neo4j` (7687) — ownership graph + lineage DAG.
- `chromadb` (8001) and `milvus` — vector search (when feature flag on).
- `mlflow` (5000) — model registry.
- `otel-collector` (4317) — OTLP traces + metrics + logs.
- `polaris` / Iceberg REST + `minio` — lakehouse reads/writes (via DataMCP).
- `alphaswarm-cp` (`/manage/*`) — workload lifecycle calls (control plane).

**Downstream callers (HTTP-only):**

- `alphaswarm-client` — Vite SPA + FastAPI gateway.
- `alphaswarm-ui` — Next.js dashboard (BFF routes proxy to here).
- `alphaswarm-admin` — internal admin (audit-first surface).
- `alphaswarm-ide` — Theia IDE (MCP-driven research copilot).
- `alphaswarm-cli` — operator CLI.
- `alphaswarm-worker` — Celery worker (calls back for progress / lookups).
- Bot pods (per-cell `QuantBot` CRs).

## Key routes

The route tree is the union of `alphaswarm/api/routes/*.py`. Key
prefixes:

| Prefix | Concept doc |
| --- | --- |
| `/strategies/*`, `/bots/*`, `/backtests/*` | [strategy-framework.md](../../strategy/analysis-framework.md) |
| `/agents/*`, `/workflows/*`, `/labs/*` | [agents.md](../../agentic/agents.md) |
| `/rl/*` | [rl-framework.md](../../rl/rl-framework.md) |
| `/data/*`, `/ingest/*`, `/lineage/*` | [data-plane.md](../../data/data-plane.md) |
| `/ml/*`, `/predictors/*` | [ml-framework.md](../../strategy/ml-framework.md) |
| `/terraform/*` | [terraform-control-plane.md](../terraform-control-plane.md) |
| `/tenancy/*`, `/membership/*` | [identity.md](../../identity/identity.md) |
| `/halt`, `/kill-switch` | [observability.md](../../trading/observability.md) |
| `/mcp/*` (multiple servers) | [data-mcp.md](../../data/data-mcp.md) |
| `/ws/*` | WebSocket progress streams |

## Operations

- **Scaling:** HPA target `cpu=70%`, `min=3 / max=12` in prod; `min=1 /
  max=3` in dev.
- **Disruption:** PDB `minAvailable=2` in prod; `0` in dev.
- **Step-up MFA:** destructive routes (`/manage/terraform/apply`,
  `/manage/credentials/cloud-cli/*`, `/halt`) require RFC 9470
  `acr=high`. See [`auth-stepup-and-byok`](../../../../../../alphaswarm/.cursor/rules/auth-stepup-and-byok.mdc).
- **Audit:** every state-mutating action lands a `workload_runs` row
  through `WorkloadRuntime`; every Terraform action lands a
  `terraform_runs` row through `TerraformRuntime`.
- **Redaction:** `WorkloadRuntime` strips secrets from audit payloads
  per the always-on
  [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc)
  rule. Token prefixes (4 chars max) are only printed behind an
  explicit `--unsafe-print-token-prefixes` operator flag.

## See also

- [`control-plane-topology.md`](../control-plane-topology.md) — how
  callers find this pod's URL.
- [`alphaswarm/AGENTS.md`](../../../../../../alphaswarm/AGENTS.md) — runtime hard rules
  (router_complete only path for LLM calls, DataMCP only path for
  agent reads, etc.).
- [`alphaswarm-cp.md`](alphaswarm-cp.md) — sibling control plane.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-cp -->
# alphaswarm-cp
> Standalone control plane — workload lifecycle (`/manage/*`), unified identity broker (`/auth/*`), connection manager, kopf operator host, Phase 5 connection-proxy mesh.

# alphaswarm-cp

The standalone control plane. Owns every workload-lifecycle action,
the unified identity broker, the connection-manager, and the
Phase 5 connection-proxy mesh. Does NOT import `alphaswarm.*` runtime
code — the boundary is enforced by
[`alphaswarm-control-plane.mdc`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-control-plane.mdc).

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-cp` |
| Role | `control-plane` |
| Package | [`alphaswarm_controller/`](../../../../../../alphaswarm_controller/) |
| Image (key) | `cp` |
| Built from | [`alphaswarm_controller/Dockerfile`](../../../../../../alphaswarm_controller/) (multi-stage Wolfi + uv) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + HTTP/2 + WebSocket |
| Port | `9000` |
| Health | `GET /manage/readyz` (ready) / `GET /manage/healthz` (live) |
| Public URL | `https://manage.alpha-swarm.ai` (behind Cloudflare tunnel + Pomerium IAP) |
| Identity for incoming | per-route: `/manage/*` requires `admin:cluster`; `/auth/*` is unauthenticated up to /callback; `/proxy/*` requires the same scopes as the destination |

## Surfaces

| Prefix | Purpose | Code |
| --- | --- | --- |
| `/manage/*` | Workload lifecycle (start/stop/scale/restart/exec/logs/apply_config/rotate_secret), credentials, terraform passthrough, topology, MFA, billing | [`alphaswarm_controller/api/routers/`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/) |
| `/auth/m2m/token`, `/auth/agent-identity/token` | Phase 1 identity broker — M2M + Entra Agent Identity tokens | [`api/routers/auth.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/auth.py) |
| `/auth/.well-known/openid-configuration` | OIDC discovery (canonical location) | same |
| `/auth/login`, `/callback`, `/logout`, `/refresh`, `/me`, `/stepup`, `/device/start`, `/device/poll` | Phase 3 BFF + RFC 8628 device flow | [`api/routers/bff.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/bff.py) |
| `/manage/connections`, `/manage/connections/{id}` | Phase 2 connection manager — typed `ConnectionDescriptor` for any topology service | [`api/routers/connections.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/connections.py) + [`services/connections.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/services/connections.py) |
| `/proxy/{service_id}/{path}` | Phase 5 connection-proxy mesh (SPIFFE-mediated mTLS in 5b) | [`api/routers/proxy.py`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/api/routers/proxy.py) |

## Embedded operator

When the `operator` extra is installed, the same image hosts the
[`aqp-controller-operator`](aqp-controller-operator.md) — a kopf
process reconciling the eight AQP* CRDs. Single-replica
(`Recreate` strategy) so reconciliation order stays deterministic.

The bare `alphaswarm-controller` image keeps booting on
memory-constrained nodes that don't run the operator.

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `alphaswarm-cp` in [`deployments/compose/docker-compose.admin.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.admin.yml) (admin overlay) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-cp/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-cp/) — Deployment + Service + PDB |
| AQP operator (Phase 4) | [`deployments/kubernetes/aqp-controller-operator/`](../../../../../../alphaswarm_platform/deployments/kubernetes/aqp-controller-operator/) — kopf reconciler kustomize tree |
| Terraform module | [`alphaswarm_platform/terraform/modules/alphaswarm_workloads/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_workloads/) (workload), [`terraform_runner/`](../../../../../../alphaswarm_platform/terraform/modules/terraform_runner/) (paired pod) |

## Dependencies

**Upstream:**

- `postgres` — `workload_runs`, `terraform_runs`, `EntraTenantLink`,
  session store (Phase 5+).
- `redis` — kill-switch key, BFF session store, M2M token cache.
- The cluster API (kubernetes / docker / aws / azure / gcp) through
  per-provider adapters under
  [`alphaswarm_controller/providers/`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/providers/).

**Downstream:**

- `alphaswarm-core` calls into `/manage/*` for cluster-internal lookups.
- `alphaswarm-client`, `alphaswarm-ui`, `alphaswarm-admin`,
  `alphaswarm-cli` use `/auth/*` once their `AUTH_BFF_ENABLED` flag is on.
- `alphaswarm-cli launch` hits the operator route to render AQP* CRs.

## Operations

- **HA:** `replicas: 2` in prod; 1 in dev. PDB `minAvailable=1`.
- **Single operator:** the kopf process is single-replica regardless
  of cp replicas — operator pods run as a separate Deployment.
- **Step-up MFA:** every `/manage/terraform/apply`,
  `/manage/credentials/cloud-cli/*`, and `/halt` route requires
  RFC 9470 `acr=high`.
- **Audit:** every `/manage/*` action lands a `workload_runs` row;
  every `/auth/*` token mint lands a `security_audit_events` row.
  Redaction is enforced by
  [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc).
- **Pomerium IAP:** the public ingress wraps `/manage/*` with
  Pomerium so the Entra-staff group is the only authenticated path.

## See also

- [`control-plane-topology.md`](../control-plane-topology.md) — topology
  and URL fallback contract; cp is the sole topology server.
- [`terraform-control-plane.md`](../terraform-control-plane.md) —
  `TerraformRuntime` runs inside cp.
- [`identity.md`](../../identity/identity.md) — IdentityProvider chain.
- [`alphaswarm_controller/AGENTS.md`](../../../../../../alphaswarm_controller/AGENTS.md) —
  hard rules for the standalone control plane.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-docs-archive -->
# alphaswarm-docs-archive
> Sunset Stripe-style API epoch archive at archive.alpha-swarm.ai. Cloudflare Pages, immutable per epoch.

# alphaswarm-docs-archive

Sunset documentation archive. Stripe-style: every public-API epoch
freezes a snapshot of `alphaswarm-docs` and surfaces it under
`archive.alpha-swarm.ai//` so customers running pinned API
versions still have a working manual.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-docs-archive` |
| Role | `docs-archive` |
| Hosted on | Cloudflare Pages |
| Public URL | `https://archive.alpha-swarm.ai` |

## Layout

- `/v1/...` — first public epoch (frozen)
- `/v2/...` — current epoch (mirrors `docs.alpha-swarm.ai`)
- `/v/...` — every previous epoch retained for the deprecation
  window declared in the release notes.

Each epoch directory is a frozen build of `alphaswarm_docs/` at the
tag matching the epoch.

## Deployment surface

| Surface | Where |
| --- | --- |
| Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) — same module as the live docs, distinct Pages project |
| Spec | reuses the `docs-edge` stack pattern at [`alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml`](../../../../../../alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml) (separate workspace) |

## Operations

- **Immutability:** archive content is read-only after the epoch
  freezes. PRs targeting an archive branch are auto-rejected by the
  `archive-frozen` GitHub Action.
- **Sunset window:** epochs hold for the deprecation window declared
  in the matching release note (typically 12 months).
- **Discoverability:** the live docs link to `archive.alpha-swarm.ai`
  whenever an API breaks compatibility.

## See also

- [`alphaswarm-docs.md`](alphaswarm-docs.md) — live docs.
- [Stripe API versioning](https://stripe.com/blog/api-versioning) —
  the model this archive imitates.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-docs-status -->
# alphaswarm-docs-status
> Public status page at status.alpha-swarm.ai. Hosted on Instatus SaaS, separate Cloudflare zone, intentionally outside the cluster.

# alphaswarm-docs-status

The public status page. Provisioned on [Instatus](https://instatus.com)
SaaS and CNAMEd to `status.alpha-swarm.ai` on a Cloudflare zone
distinct from `alpha-swarm.ai`. Survives full cluster + edge
outages — operators can post updates from the Instatus dashboard
even when the AlphaSwarm cluster is down.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-docs-status` |
| Role | `status-page` |
| Hosted on | Instatus SaaS |
| Public URL | `https://status.alpha-swarm.ai` |

## Deployment surface

| Surface | Where |
| --- | --- |
| Terraform module | [`alphaswarm_platform/terraform/modules/instatus/`](../../../../../../alphaswarm_platform/terraform/modules/instatus/) — provisions the page + components + integrations |

## Components

The status page exposes one component per logical service:

- `core` — `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat`
- `controller` — `alphaswarm-cp` + the AQP operator
- `frontends` — `alphaswarm-ui`, `alphaswarm-client`,
  `alphaswarm-admin`
- `docs` — `alphaswarm-docs`, `alphaswarm-website`
- `data-plane` — `postgres`, `redis`, `neo4j`, `iceberg`, `kafka`
- `mlops` — `mlflow`, `argo-workflows`, `dagster`
- `observability` — `prometheus`, `loki`, `jaeger`, `phoenix`

## Update flow

- Beat-emitted heartbeats publish health to a per-service Instatus
  webhook every 60 s.
- Incidents are posted manually by the on-call operator from the
  Instatus dashboard.
- Maintenance windows scheduled in advance via the
  `instatus` Terraform module's `scheduled_maintenance` resources.

## See also

- [`how-to/runbooks/`](../../../how-to/runbooks/) — incident response.
- [`alphaswarm-docs.md`](alphaswarm-docs.md) — sibling docs site.
- [`instatus` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/instatus/) —
  provisioning source.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-docs -->
# alphaswarm-docs
> Public documentation site at docs.alpha-swarm.ai. Docusaurus on Cloudflare Pages with MCP + llms.txt endpoints for agent consumers.

# alphaswarm-docs

The canonical AlphaSwarm documentation site. Docusaurus + Diátaxis
structure, deployed to Cloudflare Pages. Survives cluster outages —
the docs domain is intentionally provisioned outside the cluster so
incident-time runbooks stay reachable.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-docs` |
| Role | `docs` |
| Package | [`alphaswarm_docs/`](../../../../../../alphaswarm_docs/) |
| Hosted on | Cloudflare Pages |
| Public URL | `https://docs.alpha-swarm.ai` |

## Deployment surface

| Surface | Where |
| --- | --- |
| Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) |
| Spec | [`alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml`](../../../../../../alphaswarm_platform/configs/terraform/stacks/docs-edge.yaml) |
| Build | `pnpm build` in `alphaswarm_docs/` — Cloudflare Pages picks up the GitHub branch |

## Agent surface

The site exposes structured endpoints for AI/MCP consumers:

- `/llms.txt` and `/llms-full.txt` — convention-compliant index of
  the docs corpus.
- `/mcp` — MCP server publishing the docs as searchable tool calls.
- `/openapi.json` — OpenAPI surface for the MCP server.

## Dependencies

**Upstream:** GitHub repo for build trigger; Cloudflare for edge
hosting; OPA bundle (downloaded at deploy) for redaction policy on
docs links to internal runbooks.

**Downstream:** browsers, AI agents, search crawlers.

## Operations

- **Deploy:** every PR landing on `main` redeploys via Cloudflare
  Pages CI. Branch previews under `*.alphaswarm-docs.pages.dev`.
- **Custom domain:** `docs.alpha-swarm.ai` mapped via the
  `cloudflare_pages_docs` Terraform module; certificate via
  Cloudflare's edge SSL.
- **Out-of-cluster:** intentionally — docs live whatever the cluster
  is doing.

## See also

- [`alphaswarm-website.md`](alphaswarm-website.md) — public marketing
  sibling at `alpha-swarm.ai`.
- [`alphaswarm-docs-archive.md`](alphaswarm-docs-archive.md) — sunset
  API epochs at `archive.alpha-swarm.ai`.
- [`alphaswarm-docs-status.md`](alphaswarm-docs-status.md) — incident
  status page.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-executor -->
# alphaswarm-executor
> Heavy-compute Celery executor for the AlphaSwarm runtime — drains backtest, training, ML, agents, factors, RAG. Carries the full ML/RL/forecasting + Dask/Ray surface.

# alphaswarm-executor

Celery **heavy-compute** executor pod — the compute-heavy counterpart of
the orchestration [`alphaswarm-worker`](alphaswarm-worker.md).

Introduced by the Phase 4c worker/executor split. It carries the full
ML / RL / forecasting / portfolio + distributed-compute (Dask + Ray)
dependency surface so backtests, training rollouts, factor builds, and
agent-emitted strategy code run here instead of bloating the slim
orchestration worker. See
[worker vs executor images](../worker-executor-images.md) for the full
rationale and dependency matrix.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-executor` |
| Role | `executor` |
| Package | [`alphaswarm/`](../../../../../../alphaswarm/) (tasks under `alphaswarm/tasks/*.py`) |
| Image (key) | `executor` |
| Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `executor`, multi-arch) or the standalone [`build/docker/alphaswarm_executor/Dockerfile`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/Dockerfile) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | none (no HTTP listener) |
| Health | `celery inspect ping` + Prometheus metrics on `:9100`; Ray dashboard on `:8265` when a local Ray head runs |
| Public URL | — |
| Broker | `redis://redis:6379/0` |
| Result backend | `redis://redis:6379/1` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | `alphaswarm-executor` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml); `worker-gpu` in legacy [`compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-executor/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-executor/) — Deployment + HPA + PDB |
| Image catalogue | `executor` entry in [`terraform/modules/alphaswarm_images/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_images/) |
| ECR repo | `alphaswarm-executor` in [`infrastructure/modules/ecr-repositories/`](../../../../../../alphaswarm_platform/infrastructure/modules/ecr-repositories/) |
| Terraform module | [`terraform/modules/faas/`](../../../../../../alphaswarm_platform/terraform/modules/faas/) — heavy-queue Deployments pull this image |
| Topology | `alphaswarm-executor` in [`configs/deployment/topology.yaml`](../../../../../../alphaswarm_platform/configs/deployment/topology.yaml) |

## Queue families

The executor drains the **heavy compute** queues. KEDA scales each queue
family independently.

| Queue | Drives | Scale-to-zero | Notes |
| --- | --- | --- | --- |
| `backtest` | backtest dispatch (vbt-pro / event-driven / Lean) | yes | `max=20` |
| `training` | RL rollouts, finetune jobs | yes | dedicated GPU node group |
| `ml` | ML pipelines, predictor refresh | yes | |
| `agents` | CrewAI runs, LangGraph orchestration | yes | `max=12` |
| `factors` | factor zoo builds, alpha tests | yes | |
| `rag` | RAG ingest, embedding refresh | yes | |

## Dependencies

**Upstream:**

- `redis` — broker + result backend.
- `postgres` — task lookups, ledger writes.
- `alphaswarm-core` — progress emit callbacks, lookup APIs.
- `mlflow` — experiment tracking + model registry for training / ML runs.
- All data-plane services the `alphaswarm-core` pod depends on.

**Downstream:**

- Beat schedules heavy periodic jobs (factor refresh, predictor
  retraining); the executor is the consumer.
- May start a local Ray head / Dask cluster for distributed backtests.

## Operations

- **Resources:** requests `1 CPU / 4Gi`, limits `8 CPU / 16Gi`. Prefers
  memory-optimized nodes via node affinity; anti-affinity keeps it off
  the `alphaswarm-core` nodes.
- **Scaling:** HPA on CPU + custom Celery queue depth (KEDA
  `ScaledObject`s supersede it where KEDA is installed). Scales **down**
  slowly (900s stabilization) so a long-running backtest / train job is
  not evicted mid-flight.
- **Concurrency:** 2 per pod (compute-bound; each task is heavy).
- **Drain on shutdown:** `terminationGracePeriodSeconds: 600` so
  in-flight jobs complete; `preStop` sends `SIGTERM` to Celery.
- **Audit:** `WorkloadRuntime` actions land `workload_runs` rows; the
  executor pod respects the kill-switch Redis key like every other pod.

## See also

- [`alphaswarm-worker.md`](alphaswarm-worker.md) — orchestration sibling (light queues).
- [`worker-executor-images.md`](../worker-executor-images.md) — image split rationale + dependency matrix.
- [`faas` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/faas/) — KEDA scaling source of truth.
- [`build/docker/alphaswarm_executor/`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/) — standalone image (migration-ready).


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-ide -->
# alphaswarm-ide
> White-labeled Theia 1.72 + six AlphaSwarm compile-time extensions + MCP-driven research copilot + Perspective Arrow notebook renderer.

# alphaswarm-ide

Browser-tier IDE for AlphaSwarm. White-labeled Theia 1.72 with six
compile-time extensions (`alphaswarm`, `alphaswarm-shell`,
`alphaswarm-mcp-bridge`, `alphaswarm-research-copilot`,
`alphaswarm-notebook-quant`, `alphaswarm-quant`), an MCP-driven
research copilot, and a Perspective + Arrow notebook renderer.

The canonical operator entrypoint is `alphaswarm-cli ide` — see
[`alphaswarm-ide.md`](../alphaswarm-ide.md) for the full IDE concept doc.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-ide` |
| Role | `ide` |
| Package | [`alphaswarm_ide/`](../../../../../../alphaswarm_ide/) |
| Image (key) | `ide` |
| Built from | [`alphaswarm_ide/Dockerfile`](../../../../../../alphaswarm_ide/Dockerfile) (node:24-bookworm; extension-build env) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + WebSocket (Theia front channel) |
| Port | `3000` (browser-tier) |
| Health | `GET /` |
| Public URL | per-user (operator's own laptop or per-cell ingress) |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Local | `alphaswarm-cli ide start` — runs the IDE as a docker container against the local cluster |
| Kustomize | [`deployments/kubernetes/alphaswarm-ide/`](../../../../../../alphaswarm_platform/deployments/kubernetes/alphaswarm-ide/) — Deployment + Service + Ingress + NetworkPolicy |
| AQP CR | [`AQPIDE`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpide_cr.py) — for per-user pod lifecycle |

## Dependencies

**Upstream:**

- `alphaswarm-core` `/mcp/*` — every research copilot LLM call goes
  through `router_complete` (rule 2) on the API pod.
- `alphaswarm-cp` `/auth/*` — OIDC-bound IDE sessions.
- `postgres`, `redis`, `iceberg/polaris` — read paths for the
  notebook renderer.

**Downstream:**

- Operator browsers (one IDE pod per active operator session).

## Boundaries

- AlphaSwarm code MUST live inside `theia-extensions/alphaswarm*/`.
- Theia extension code MUST NOT import `alphaswarm` source —
  cross-process via MCP only.
- Copilot LLM calls go through `router_complete` (AGENTS rule 2).
- MCP registrations carry per-MCP `aud` claims (rule 49).

## Operations

- **Per-user pods:** the operator pattern is one Deployment per
  active session. Idle sessions scale to zero via KEDA after 30 min.
- **NetworkPolicy:** the IDE pod only reaches `alphaswarm-core`,
  `alphaswarm-cp`, and the data plane through the
  `alphaswarm-data-mcp` sidecar.
- **Bundle sourcing:** the AlphaSwarm extensions are built into the
  image at compile time; no runtime extension marketplace fetch.

## See also

- [`alphaswarm-ide.md`](../alphaswarm-ide.md) — full IDE concept doc.
- [`alphaswarm-ide-roadmap.md`](../alphaswarm-ide-roadmap.md) — phase
  plan.
- [`alphaswarm_ide/AGENTS.md`](../../../../../../alphaswarm_ide/AGENTS.md) — boundary
  rules.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-ml-mcp -->
# alphaswarm-ml-mcp
> Dedicated MCP server for the data.ml.* tool slice — Predictor Hub, ML pipelines, AlphaBacktestExperiment. Piggybacked on the alphaswarm-core pod.

# alphaswarm-ml-mcp

Dedicated MCP server publishing the `data.ml.*` tool slice — Predictor
Hub lookups, AlphaBacktestExperiment dispatch, walk-forward run
inspection, finetune trainer status, model serving (vLLM / Ollama /
KServe). Piggybacked on the `alphaswarm-core` pod (same FastAPI app,
distinct route prefix and `aud` claim).

This is the MLOps slice's RFC 9728 / RFC 8707 conformant endpoint —
see [`mcp-rfc-conformance`](../../../../../../alphaswarm/.cursor/rules/mcp-rfc-conformance.mdc).

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-ml-mcp` |
| Role | `mcp` |
| Package | [`alphaswarm_models/`](../../../../../../alphaswarm_models/) (tools); served from [`alphaswarm/ml_mcp/`](../../../../../../alphaswarm/alphaswarm/ml_mcp/) |
| Image (key) | (piggybacked on `api`) |
| Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `api`) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + WebSocket (MCP) |
| Port | `8000` (shared with `alphaswarm-core`) |
| Health | `GET /mcp/ml/tools` (lists tool registrations) |
| Discovery | `GET /.well-known/oauth-protected-resource/mcp/ml` (RFC 9728 metadata) |
| Audience claim | dedicated per-MCP `aud` per AGENTS rule 49 |

## Tool registrations

| Tool prefix | Concept doc |
| --- | --- |
| `data.ml.predictors.*` | [ml-framework.md](../../strategy/ml-framework.md) |
| `data.ml.skills.*` | [mlops-service.md](../../strategy/mlops-service.md) |
| `data.ml.serving.*` | [ml-framework.md](../../strategy/ml-framework.md) |
| `data.ml.experiments.*` | [analysis-framework.md](../../strategy/analysis-framework.md) |
| `data.ml.finetune.*` | [ml-framework.md](../../strategy/ml-framework.md) |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | folded into `api` |
| Kustomize | folded into [`base/alphaswarm-core/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-core/) |
| AQP CR | folded into `AQPMonolith` (`spec.mlMcp.enabled`) |

## Dependencies

**Upstream:**

- `mlflow` (5000) — experiment + model registry.
- `postgres` — Predictor Hub catalog.
- `polaris` / `minio` — feature store reads.
- `bentoml` / `kserve` (when serving backend = remote) — model
  invocations.

**Downstream:**

- Agentic plane (`alphaswarm/agents/`) — ML calls go through DataMCP,
  never direct ORM imports.
- `alphaswarm-ide` research copilot.

## Operations

- **`router_complete` only:** any LLM call from inside the MCP
  registrations goes through `alphaswarm/llm/providers/router.py`
  (rule 2).
- **OOD guard + circuit breaker:** the MLSkillRuntime applies
  `rules/ood_guard.py` and the circuit breaker before model calls.
- **Audit:** every tool invocation lands an `agent_runs_v2` row.

## See also

- [`mlops-service.md`](../../strategy/mlops-service.md) — MLOps service
  contract.
- [`data-mcp.md`](../../data/data-mcp.md) — DataMCPTool boundary.
- [`mcp-rfc-conformance`](../../../../../../alphaswarm/.cursor/rules/mcp-rfc-conformance.mdc) —
  RFC 9728 + RFC 8707 conformance.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-ui -->
# alphaswarm-ui
> Cloud-hosted, multi-tenant operator dashboard at app.alpha-swarm.ai. Next.js 14+ App Router; Entra-only after the launcher refactor.

# alphaswarm-ui

The cloud-hosted, customer-facing operator dashboard. Auth-gated and
multi-tenant; Auth0 (B2C) was the historic provider but the
post-launcher-refactor surface is **Entra-only** — Auth0 has been
purged from the SPA bundle.

The public marketing site is a sibling, separate repo —
[`alphaswarm-website`](alphaswarm-website.md) at `alpha-swarm.ai`.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-ui` |
| Role | `frontend` |
| Package | [`alphaswarm_ui/`](../../../../../../alphaswarm_ui/) |
| Image (key) | `ui` |
| Built from | (not Dockerfile-based — typically Vercel / Cloudflare Pages SSR; AQPUI CR can also stand it up as a Deployment in a cluster) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | HTTP/1.1 + WebSocket |
| Port | `80` (container) / `3000` (Next.js dev) |
| Health | `GET /api/healthz` |
| Public URL | `https://app.alpha-swarm.ai` |
| Identity | Microsoft Entra (B2B SSO via `MsalEntraProvider`); `local` dev-stub gated by `ALPHASWARM_AUTH_DEV_STUB=true` (hard-disabled in production builds) |

## Routes

| Route | Purpose |
| --- | --- |
| `/login`, `/signup`, `/onboarding/*` | Provider-aware auth screens (Entra login + dev-stub) |
| `/dashboard`, `/strategies`, `/paper-runs`, `/backtests`, `/data`, `/ml`, `/agents`, `/workflows`, `/labs`, `/analytics`, `/research`, `/portfolio`, `/settings` | Operator dashboard |
| `/api/auth/entra/login`, `/callback`, `/logout`, `/stepup` | BFF route handlers — proxy to `alphaswarm-cp` `/auth/*` (Phase 3) |
| `/api/*` | Other BFF proxies (tenancy-scoped, kill-switch fan-out) |

The marketing routes (`/`, `/pricing`, `/docs`, `/legal`, `/about`,
`/blog`, `/changelog`) **moved out** to the
[`alphaswarm_website`](../../../../../../alphaswarm_website/) repo as part of the
controller-launcher refactor.

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Hosted (canonical) | Cloudflare Pages or Vercel — pinned `next >=14.2.25` for CVE-2025-29927 |
| Cluster (option) | [`AQPUI`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpui_cr.py) CR — Deployment + Service + Ingress |
| Identity provisioning | [`alphaswarm_platform/terraform/modules/alphaswarm_ui_identity/`](../../../../../../alphaswarm_platform/terraform/modules/alphaswarm_ui_identity/) |

## Dependencies

**Upstream (HTTP-only):**

- `alphaswarm-cp` (`/auth/*`, `/manage/*`) — every BFF route delegates here.
- `alphaswarm-core` (`/api/*`) — for tenancy-scoped business calls
  the BFF routes proxy.

**Downstream:**

- B2C and B2B users; multi-tenant via `EntraTenantLink` rows in the
  controller's database.

## Operations

- **Bundle pinning:** `next >=14.2.25` (CVE-2025-29927).
- **CSP:** restricted to `manage.alpha-swarm.ai` and the controller's
  `*.alpha-swarm.ai` cell ingresses.
- **No client-side auth SDK:** the SPA never reads an Entra token —
  only the BFF route handlers do.
- **Dev-stub:** `ALPHASWARM_AUTH_DEV_STUB=true` writes a Local Dev
  User session inline; the
  [`scripts/ci/check_alphaswarm_ui_no_auth0.py`](../../../../../../alphaswarm_ui/scripts/ci/check_alphaswarm_ui_no_auth0.py)
  guard fails on any new Auth0 reference.

## See also

- [`alphaswarm_ui/AGENTS.md`](../../../../../../alphaswarm_ui/AGENTS.md) — hard
  boundaries.
- [`alphaswarm-website.md`](alphaswarm-website.md) — public marketing
  sibling.
- [`identity.md`](../../identity/identity.md) — Entra integration
  contract.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-website -->
# alphaswarm-website
> Public-facing marketing site at alpha-swarm.ai. Next.js 14+ App Router on Cloudflare Pages; no auth, no API calls, intentionally separate from the operator dashboard.

# alphaswarm-website

The public marketing site. Lives in its own repo
([`alphaswarm_website/`](../../../../../../alphaswarm_website/)) and is hosted on
Cloudflare Pages so the marketing surface survives cluster outages
the same way the docs do.

This is **not** the operator dashboard — that is
[`alphaswarm-ui`](alphaswarm-ui.md) at `app.alpha-swarm.ai`.
Cross-links from this site to the dashboard go through
`NEXT_PUBLIC_ALPHASWARM_APP_URL`.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-website` |
| Role | `marketing` |
| Package | [`alphaswarm_website/`](../../../../../../alphaswarm_website/) |
| Hosted on | Cloudflare Pages |
| Public URL | `https://alpha-swarm.ai`, `https://www.alpha-swarm.ai` |

## Routes

- `/` — homepage
- `/about`, `/blog`, `/changelog`
- `/pricing`, `/cloud`, `/self-hosted`
- `/product/{agentops,reinforcement-learning,data-platform,backtesting}`
- `/learn`, `/learn/`
- `/docs/[[...slug]]` — public docs links (deep-link to
  `alphaswarm-docs`)
- `/legal/[doc]` — terms, privacy, security, dpa, contact
- `/login`, `/signup`, `/onboarding` — thin 307 redirects to
  `${NEXT_PUBLIC_ALPHASWARM_APP_URL}/...`

## Hard boundaries

Per [`alphaswarm_website/AGENTS.md`](../../../../../../alphaswarm_website/AGENTS.md):

- No authentication SDKs (no `@auth0/*`, no `@azure/msal-*`, no
  `iron-session`).
- No imports of `alphaswarm.*` or `alphaswarm_controller.*`.
- No client-side state libraries (no `@tanstack/react-query`, no
  `zustand`, no `antd`).
- No secrets in env — only the public app URL and port.
- Next.js pinned `>=14.2.25` for CVE-2025-29927.

## Deployment surface

| Surface | Where |
| --- | --- |
| Terraform module | [`alphaswarm_platform/terraform/modules/cloudflare_pages_docs/`](../../../../../../alphaswarm_platform/terraform/modules/cloudflare_pages_docs/) (forthcoming dedicated `cloudflare_pages_marketing`) |
| Build | `pnpm build` in `alphaswarm_website/` — Cloudflare Pages picks up the GitHub branch |

## See also

- [`alphaswarm-ui.md`](alphaswarm-ui.md) — the auth-gated operator
  dashboard at `app.alpha-swarm.ai`.
- [`alphaswarm-docs.md`](alphaswarm-docs.md) — public docs at
  `docs.alpha-swarm.ai`.
- [`alphaswarm_website/AGENTS.md`](../../../../../../alphaswarm_website/AGENTS.md) —
  hard boundaries.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/alphaswarm-worker -->
# alphaswarm-worker
> Celery orchestration worker for the AlphaSwarm runtime — drains the light/coordination queues (default, paper, terraform, ingestion, workflows). Heavy compute moves to alphaswarm-executor.

# alphaswarm-worker

Celery **orchestration** worker pod that drains the light / coordination
queues produced by `alphaswarm-core`.

As of the Phase 4c worker/executor split it has its own slim image
(target `worker`) carrying only the task-dispatch + lineage surface —
**not** the API stage's `visualization` / `dev` / Dash deps it used to
inherit. Heavy compute (backtest / training / ML / agents / factors /
RAG) is offloaded to the sibling [`alphaswarm-executor`](alphaswarm-executor.md).
See [worker vs executor images](../worker-executor-images.md) for the
full rationale and dependency matrix.

## Identity

| Field | Value |
| --- | --- |
| Service id | `alphaswarm-worker` |
| Role | `worker` |
| Package | [`alphaswarm/`](../../../../../../alphaswarm/) (tasks under `alphaswarm/tasks/*.py`) |
| Image (key) | `worker` |
| Built from | [`alphaswarm_platform/Dockerfile`](../../../../../../alphaswarm_platform/Dockerfile) (target `worker`, multi-arch) or the standalone [`build/docker/alphaswarm_worker/Dockerfile`](../../../../../../alphaswarm_platform/build/docker/alphaswarm_worker/Dockerfile) |

## Wire

| Field | Value |
| --- | --- |
| Protocol | none (no HTTP listener) |
| Health | Celery broker connection probe + Prometheus metrics on `:9100` (when enabled) |
| Public URL | — |
| Broker | `redis://redis:6379/0` |
| Result backend | `redis://redis:6379/1` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `worker` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `alphaswarm-worker` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) |
| Kustomize | [`deployments/kubernetes/base/alphaswarm-worker/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/alphaswarm-worker/) — Deployment + HPA + PDB |
| AQP CR | folded into [`AQPMonolith`](../../../../../../alphaswarm_controller/src/alphaswarm_controller/operator/crds/aqpmonolith_cr.py) (`spec.workers.queues`) |
| Terraform module | [`alphaswarm_platform/terraform/modules/faas/`](../../../../../../alphaswarm_platform/terraform/modules/faas/) — Celery + KEDA per-queue ScaledObjects |

## Queue families

The orchestration worker drains the **light / coordination** queues only.
The heavy compute queues (`backtest`, `training`, `ml`, `agents`,
`factors`, `rag`) are drained by [`alphaswarm-executor`](alphaswarm-executor.md).
KEDA scales each queue family independently. Default queue map:

| Queue | Drives | Scale-to-zero | Notes |
| --- | --- | --- | --- |
| `default` | misc tasks, callbacks, lookups | yes | always-on `min=1` in prod |
| `paper` | paper trading session ticks | no | sub-second latency required |
| `terraform` | TerraformRuntime celery wrappers | yes | |
| `ingestion` | Airbyte / Dagster / connector pulls | no | uses long-lived workers |
| `workflows` | WorkflowRuntime orchestration | yes | |
| `hft` | HFT hot-path event handlers | no | pinned to hft-nodes (compose/legacy) |

:::note
The `faas` KEDA module keys per-queue Deployments off `local.heavy_queues`
— heavy queues run the `alphaswarm-executor` image, everything else runs
this `alphaswarm-worker` image. The two image sets never share a queue.
:::

## Dependencies

**Upstream:**

- `redis` — broker + result backend.
- `postgres` — task lookups, ledger writes.
- `alphaswarm-core` — progress emit callbacks, lookup APIs.
- All data-plane services the `alphaswarm-core` pod depends on (the
  same code paths run inside Celery).

**Downstream:**

- Beat schedules tasks; the worker is the consumer.
- HFT-tagged tasks land on the `hft-nodes/` workload (PTP-tuned).

## Operations

- **Scaling:** KEDA `ScaledObject` per queue; idle queues scale to
  zero. The per-queue `min`/`max` lives in the
  [`faas`](../../../../../../alphaswarm_platform/terraform/modules/faas/) Terraform
  module.
- **Concurrency:** the orchestration worker runs concurrency 4 (light,
  IO-bound dispatch work); 1 for HFT (single-threaded pinning).
- **Drain on shutdown:** `terminationGracePeriodSeconds: 600` so
  in-flight tasks complete; `preStop` sends `SIGTERM` to Celery.
- **Audit:** `WorkloadRuntime` actions land `workload_runs` rows; the
  worker pod respects the kill-switch Redis key the same way the API
  does.

## See also

- [`tasks-api.mdc`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — Celery
  task progress contract + Redis pub/sub frame shape.
- [`alphaswarm-executor.md`](alphaswarm-executor.md) — heavy-compute sibling.
- [`worker-executor-images.md`](../worker-executor-images.md) — image split rationale + dependency matrix.
- [`alphaswarm-beat.md`](alphaswarm-beat.md) — sibling scheduler.
- [`faas` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/faas/) —
  KEDA scaling source of truth.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/chromadb -->
# chromadb
> Vector store — fallback / dev embedding store. Production cells use `milvus`; ChromaDB stays for local-dev parity and small-collection cases.

# chromadb

A vector store used for embedding indices in dev cells and small-
collection production cases. Larger production cells use
[`milvus`](https://milvus.io/) instead — ChromaDB stays in the topology
so the local-dev compose stack and per-cell base manifests keep
parity.

## Identity

| Field | Value |
| --- | --- |
| Service id | `chromadb` |
| Role | `vector-store` |
| Image | `chromadb/chroma:1.0.16` |
| Port | `8000` (in-cluster) / `8001` (host bind in compose to avoid clashing with `alphaswarm-core`) |
| Storage | ephemeral by default; PVC-backed in cluster |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `chromadb` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | [`deployments/kubernetes/base-services/chromadb/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/chromadb/) |
| Companion | [`base-services/milvus/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/milvus/) — production-grade alternative |

## Dependencies

**Upstream:** none.

**Downstream:**

- `alphaswarm-core` for RAG retrieval (when feature flag
  `ALPHASWARM_VECTOR_STORE=chromadb`).
- `alphaswarm-worker` for embedding ingest tasks.

## Operations

- **Collection lifecycle:** managed by the `HierarchicalRAG`
  package; never created directly by agents.
- **Vector dimensions:** must match the active embedding model
  (default `BAAI/bge-m3` at 1024-dim). Mismatch is a hard error.
- **Backup:** the in-cell PVC is snapshotted nightly; production
  cells with significant collections should swap to Milvus.

## See also

- [`alphaswarm/data/rag/`](../../../../../../alphaswarm/alphaswarm/data/rag/) —
  HierarchicalRAG package.
- [`base-services/milvus/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/milvus/) —
  production alternative.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/grafana -->
# grafana
> Primary metrics + logs + traces visualization layer. Datasources point at Prometheus, VictoriaMetrics, Loki, and Jaeger.

# grafana

The platform's primary dashboard surface. Bundled with the
kube-prometheus-stack and pre-loaded with datasources for Prometheus,
VictoriaMetrics, Loki (logs), and Jaeger (traces).

## Identity

| Field | Value |
| --- | --- |
| Service id | `grafana` |
| Role | `observability` |
| Image | `grafana/grafana` (managed by kube-prometheus-stack) |
| Port | `3000` |
| Health | `/api/health` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Kustomize | folded into [`observability/kube-prometheus-stack/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) (Helm chart bundles Grafana) |
| Standalone | (none — Grafana is always shipped with the stack) |

## Datasources

| Datasource | Backend |
| --- | --- |
| `Prometheus` | in-cluster Prometheus |
| `VictoriaMetrics` | in-cluster VM |
| `Loki` | in-cluster Loki |
| `Jaeger` | in-cluster Jaeger |
| `Phoenix` (when enabled) | Phoenix's Postgres backend (read-only) |

## Dashboards

Provisioned via ConfigMaps under
`observability/kube-prometheus-stack/dashboards/`. Default set covers:

- Cluster health (kube-state-metrics).
- AlphaSwarm runtime (API latency, Celery queue depth, kill-switch
  state, terraform_run lag).
- Per-service Linkerd proxy metrics.
- Per-cell tenant overlays.

Custom dashboards land via PR — never via the Grafana UI alone (UI
edits are wiped on the next reconciliation).

## Operations

- **Auth:** OIDC against the staff Entra tenant; the
  `alphaswarm-staff` group maps to admin, `alphaswarm-operators` to
  editor, and any other authenticated user to viewer.
- **Persistence:** Grafana DB is SQLite by default (folded into the
  Helm chart); production cells point at a per-cell Postgres
  schema.

## See also

- [`prometheus.md`](prometheus.md), [`victoriametrics.md`](victoriametrics.md),
  [`loki.md`](loki.md), [`jaeger.md`](jaeger.md) — backing
  datasources.
- [`observability-stack.md`](../../trading/observability-stack.md) —
  stack composition.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/jaeger -->
# jaeger
> Distributed tracing backend for the infrastructure pipeline. AI / LLM spans land in Phoenix instead.

# jaeger

Distributed tracing backend for the infrastructure trace pipeline —
HTTP, database, queue, and inter-service spans land here. AI / LLM
spans (OpenInference) route to [`phoenix`](https://docs.arize.com/phoenix)
instead.

## Identity

| Field | Value |
| --- | --- |
| Service id | `jaeger` |
| Role | `observability` |
| Image | `jaegertracing/all-in-one` (in-cell) / `jaegertracing/jaeger-collector` + `jaegertracing/jaeger-query` (split in cloud cells) |
| Port | `6831` (UDP — agent), `14250` (gRPC — collector), `16686` (HTTP — query/UI) |
| Health | `/` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Kustomize | [`observability/jaeger/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/jaeger/) — all-in-one Deployment + Service |
| Cloud cells | split mode — Collector behind a Service, Query behind ingress |

## Dependencies

**Upstream:** `otel-collector` — fans the infrastructure trace
pipeline here.

**Downstream:** Grafana (datasource) and the operator's UI for
manual span inspection.

## Operations

- **Storage:** in-cell uses badger (ephemeral); cloud cells back with
  Elasticsearch / OpenSearch.
- **Retention:** 7 days in-cell, 30 days in cloud.
- **Sampling:** receives only the 5% sampled spans (per
  `otel-collector` policy) plus 100% of error spans.

## See also

- [`otel-collector.md`](otel-collector.md) — routing source.
- [`observability.md`](../../trading/observability.md) — concept doc.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/loki -->
# loki
> Log aggregation. Receives logs from `vector` (the shipper) and serves Grafana.

# loki

Grafana Loki — the log aggregation backend. Receives logs from the
`vector` DaemonSet (the canonical shipper) and serves Grafana for
queries.

## Identity

| Field | Value |
| --- | --- |
| Service id | `loki` |
| Role | `observability` |
| Image | `grafana/loki:3.3.2` |
| Port | `3100` |
| Health | `/ready` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `loki` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) (platform overlay) |
| Kustomize | [`observability/loki/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/loki/) — single-binary StatefulSet (in-cell); split-monolithic mode in cloud cells |

## Dependencies

**Upstream:** `vector` (the canonical shipper).

**Downstream:** `grafana` (datasource).

## Operations

- **Storage:** in-cell uses local PVC; cloud cells back with
  S3 / GCS / ADLS object storage.
- **Retention:** 14 days default; 30 days for `audit-*` streams (per
  the audit-evidence retention policy).
- **Tenancy:** every log line carries a `tenant_id` label so Loki's
  multi-tenancy split is enforced at query time.

## See also

- [`vector.md`](vector.md) — log shipper.
- [`grafana.md`](grafana.md) — visualization layer.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/mlflow -->
# mlflow
> Model registry + experiment tracker. Backs Predictor Hub, AlphaBacktestExperiment, walk-forward, and the finetune trainers.

# mlflow

The platform's model registry + experiment tracker. Owned by
`alphaswarm_models` — every Predictor, AlphaBacktestExperiment,
walk-forward run, and finetune trainer registers here.

## Identity

| Field | Value |
| --- | --- |
| Service id | `mlflow` |
| Role | `mlops` |
| Image | `ghcr.io/mlflow/mlflow:v2.11.1` |
| Port | `5000` |
| Storage | object store for artifacts (MinIO / S3 / GCS / ADLS depending on cloud); Postgres backend for the tracking store |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `mlflow` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | [`deployments/kubernetes/base-services/mlflow/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/mlflow/) — Deployment + Service + ExternalSecret-backed credentials |
| MLOps overlay | reachable through [`mlops/`](../../../../../../alphaswarm_platform/deployments/kubernetes/mlops/) when paired with Argo Workflows + Dagster |

## Dependencies

**Upstream:**

- `postgres` — tracking store.
- `minio` / `s3` / `gcs` / `azblob` — artifact store.

**Downstream:**

- `alphaswarm-core`, `alphaswarm-worker` — every Predictor / Skill /
  walk-forward / finetune flow registers runs here.
- `alphaswarm-ml-mcp` — read paths surface through the `data.ml.*`
  MCP slice.

## Operations

- **Auth:** behind the cluster ingress; the in-cluster URL is the
  only path. Local dev exposes `http://localhost:5000` for browser
  inspection.
- **Pruning:** retention policy lives at
  `alphaswarm/tasks/cleanup/mlflow_prune.py` — run by beat weekly.
- **Run tagging:** every run is tagged with the originating
  `experiment_id` + `test_id` per AGENTS rule 34 so audit queries can
  correlate ML runs with strategy / backtest activity.

## See also

- [`mlops-service.md`](../../strategy/mlops-service.md) — how
  `alphaswarm_models` lays MLflow underneath the Skill / Predictor
  contract.
- [`ml-framework.md`](../../strategy/ml-framework.md) — model
  framework overview.
- [`alphaswarm_models/AGENTS.md`](../../../../../../alphaswarm_models/AGENTS.md) — boundary
  rules.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/neo4j -->
# neo4j
> Graph database — canonical home for the ownership graph, the bipartite lineage DAG, and the entity-graph service.

# neo4j

The canonical graph store. Holds the ownership graph (Workstream F),
the bipartite lineage DAG (Workstream A + B), and the entity-graph
service (instruments, companies, datasets, pipeline assets, service
metadata). Postgres carries the snapshot rows; Neo4j carries the
traversable relationships.

## Identity

| Field | Value |
| --- | --- |
| Service id | `neo4j` |
| Role | `graph` |
| Image | `neo4j:5-community` |
| Port | `7474` (HTTP) + `7687` (Bolt) |
| Storage | 5 Gi PVC (cell-local); managed Neo4j Aura recommended for prod cells |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `neo4j` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | rolled into `base-services/` (cell-local StatefulSet) |
| Terraform | not provisioned by a managed module today; cloud templates run a containerised StatefulSet behind the cell's storage class |

## Dependencies

**Upstream:** none.

**Downstream:**

- `alphaswarm-core` — ownership graph reads via `data.ownership.*`
  MCP tool; lineage relay writes through OpenLineage adapter.
- `alphaswarm-worker` — sync tasks that mirror Postgres rows into
  Neo4j edges.

## Sync semantics

- Postgres remains the canonical source of truth for entity
  *attributes*; Neo4j holds the *relationships*.
- Sync is event-driven via the `lineage` queue family; backfills run
  through `data.lineage.replay` Celery tasks.
- Read paths go through the `data.ownership.*` and `data.lineage.*`
  DataMCP tools — the agentic plane MUST NOT speak Bolt directly.

## Operations

- **Auth:** username/password via ExternalSecret; Bolt TLS through
  Linkerd mTLS.
- **Backups:** native `neo4j-admin database backup` cron to MinIO/S3.
- **Cypher style:** queries are stored under
  `alphaswarm/data/sources/graph/queries/`; ad-hoc Cypher in agent
  prompts is forbidden.

## See also

- [`ownership-graph`](../../../../../../alphaswarm/.cursor/rules/ownership-graph.mdc) —
  ownership graph contract (Workstream F).
- [`lineage-graph`](../../../../../../alphaswarm/.cursor/rules/lineage-graph.mdc) —
  bipartite lineage DAG + OpenLineage relay (Workstream A + B).
- [`entity-graph-services.md`](../../platform/entity-graph-services.md) —
  entity registry + service control via Neo4j.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/otel-collector -->
# otel-collector
> OpenTelemetry collector — central OTLP gateway routing infra spans to Tempo/Jaeger, AI/LLM spans to Phoenix, metrics to Prometheus + VictoriaMetrics, logs to Loki.

# otel-collector

The single OTLP ingress for the cluster. Every workload pod sends
traces, metrics, and logs to this gateway; the gateway fans out by
signal type to the appropriate backend.

## Identity

| Field | Value |
| --- | --- |
| Service id | `otel-collector` |
| Role | `observability` |
| Image | `otel/opentelemetry-collector` (gateway flavour) — pinned in [`observability/opentelemetry-collector-gateway/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) |
| Port | `4317` (OTLP gRPC) + `4318` (OTLP HTTP) |
| Health | `:13133/` (extensions health_check) |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `otel-collector` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | [`observability/opentelemetry-collector-gateway/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) — gateway Deployment + DaemonSet agent (canonical) |
| Operator | [`observability/opentelemetry-operator/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-operator/) — auto-instrumentation CRDs |
| Legacy | [`observability/otel-collector/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/otel-collector/) — rollback only; NOT wired to overlays |

## Routing

| Signal | Destination |
| --- | --- |
| `traces.infrastructure` | Jaeger (in-cell) / Tempo (cloud cells) |
| `traces.ai` (OpenInference spans) | [`phoenix`](https://github.com/Arize-ai/phoenix) |
| `metrics` | VictoriaMetrics + Prometheus (parallel during cutover) |
| `logs` | Loki (via Vector) |

The split happens via OTel `routing` connector — spans tagged with
`service.namespace=alphaswarm.ai` route to Phoenix; everything else
goes to the infra trace pipeline.

## Dependencies

**Upstream:** every alphaswarm workload pod (auto-instrumentation
through the OTel operator + manual SDK init in `alphaswarm/observability/`).

**Downstream:** Jaeger, Phoenix, Prometheus, VictoriaMetrics, Loki.

## Operations

- **Sampling:** tail-based for traces — keep 100% of error spans, 5%
  of healthy traffic. Tuned per cell.
- **Resource tagging:** every span carries `tenant_id`, `cell_id`,
  `service.id` (matching topology), and `experiment_id` /
  `test_id` when set.
- **Auto-instrumentation:** Python via `opentelemetry-distro`; Node
  via the OTel operator's auto-injected sidecar; Go services use
  manual SDK.

## See also

- [`observability.md`](../../trading/observability.md) —
  observability concept doc.
- [`observability-stack.md`](../../trading/observability-stack.md) —
  stack composition + dashboards.
- [`phoenix`](https://docs.arize.com/phoenix) — AI / LLM observability
  upstream.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/postgres -->
# postgres
> Primary OLTP database with pgvector — strategies, bots, runs, ledgers, ownership graph snapshots, terraform_runs, security_audit_events.

# postgres

The platform's primary OLTP database. Holds every relational table the
runtime depends on — strategies, bots, runs, ledgers, the ownership
graph snapshot, the `*_spec_versions` tables for hash-locked specs,
`workload_runs`, `terraform_runs`, `security_audit_events`, and the
multi-tenant `EntraTenantLink` index.

## Identity

| Field | Value |
| --- | --- |
| Service id | `postgres` |
| Role | `database` |
| Image | `pgvector/pgvector:pg16` (compose) / `ankane/pgvector:v0.5.1` (deployments/compose) — Postgres 16 + pgvector |
| Port | `5432` (in-cluster) / `5433` (host bind in compose to avoid clash with system Postgres) |
| Storage | 5 Gi PVC in StatefulSet (cell-local); RDS in `aws-*` templates; Cloud SQL in `gcp-*`; Azure DB in `azure-*` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `postgres` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml) |
| Kustomize | [`deployments/kubernetes/base-services/postgres-shared/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/postgres-shared/) — StatefulSet + Service + ClusterSecretStore-backed credentials |
| Terraform module | [`alphaswarm_platform/terraform/modules/storage/`](../../../../../../alphaswarm_platform/terraform/modules/storage/) — RDS (AWS) / Cloud SQL (GCP) / Azure DB / containerised (local) |
| Companion module | [`alphaswarm_platform/terraform/modules/database/`](../../../../../../alphaswarm_platform/terraform/modules/database/) — PgBouncer connection pooler + Alembic migration Job |

## Dependencies

**Upstream:** none.

**Downstream:**

- `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat` —
  primary read/write.
- `alphaswarm-cp` — workload + terraform ledger writes.
- `alphaswarm-admin` — admin ledger.
- `mlflow` — embedded postgres backend (or pointed at this one in
  prod).

## Operations

- **Migrations:** Alembic runs as a one-shot Job in the `database`
  Terraform module before the first app pod is scheduled. Migrations
  are immutable — see
  [`migrations-persistence`](../../../../../../alphaswarm/.cursor/rules/migrations-persistence.mdc).
- **Backups:** pg_dump cron + WAL archiving to MinIO/S3 (per cloud).
  RPO 5 min, RTO 30 min; restore runbook at
  [`how-to/runbooks/dr-restore.md`](../../../how-to/runbooks/dr-restore.md).
- **Secrets:** primary DSN in Vault → ExternalSecret → in-cluster
  Secret. Hand-pasted credentials are a review-blocking change.
- **Connection pooling:** PgBouncer (transaction mode) sits in front;
  app pods connect through `pgbouncer.alphaswarm.svc.cluster.local:6432`.

## See also

- [`migrations-persistence`](../../../../../../alphaswarm/.cursor/rules/migrations-persistence.mdc) —
  Alembic immutability + ORM conventions.
- [`erd.md`](../../platform/erd.md) — entity-relationship map across
  every table this database holds.
- [`storage` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/storage/) —
  per-cloud provisioning.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/prometheus -->
# prometheus
> Time-series metrics scraper deployed via the kube-prometheus-stack. Sits in parallel with VictoriaMetrics during the long-term-storage cutover.

# prometheus

The cluster-internal metrics scraper. Deployed via
[`kube-prometheus-stack`](https://github.com/prometheus-operator/kube-prometheus)
which also installs the operator, Alertmanager, and the Grafana
sidecar.

## Identity

| Field | Value |
| --- | --- |
| Service id | `prometheus` |
| Role | `observability` |
| Image | `prom/prometheus` (managed by kube-prometheus-stack) |
| Port | `9090` |
| Health | `/-/ready` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Kustomize | [`observability/kube-prometheus-stack/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) — Helm-managed via kustomize HelmCharts overlay |
| Compose | (not in compose — local dev relies on `victoriametrics` for the small footprint) |

## Scrape targets

The kube-prometheus-stack installs a default `ServiceMonitor` set; we
extend it with:

- `alphaswarm-core` `/metrics` (every API pod).
- `alphaswarm-worker` `/metrics` (per Celery worker).
- `alphaswarm-cp` `/metrics`.
- KEDA metrics adapter on `aqp-controller-operator` and `bots-operator`.
- Linkerd proxy metrics (mTLS-side).
- Per-data-plane service exporters (Postgres exporter, Redis exporter,
  Kafka exporter, etc.).

## Long-term storage

Prometheus runs with a 30-day local retention; VictoriaMetrics is
the long-term store and remote-write target. During the
parallel-cutover both sides receive samples; once the cutover is
declared the local Prometheus retention is dropped to 7 days.

## Operations

- **Alertmanager:** receives the `kube-prometheus-stack` default
  alert set + AlphaSwarm-specific rules under
  `observability/kube-prometheus-stack/alerts/`.
- **Federation:** disabled — the long-term path is remote-write to
  VictoriaMetrics, not federation.
- **PromQL recording rules:** kept under
  `observability/kube-prometheus-stack/rules/`; agent-emitted ad-hoc
  rules are forbidden.

## See also

- [`grafana.md`](grafana.md) — primary visualization layer.
- [`victoriametrics.md`](victoriametrics.md) — long-term storage.
- [`observability-stack.md`](../../trading/observability-stack.md) —
  stack composition.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/redis -->
# redis
> Cache + pub/sub + Celery broker + kill-switch key + BFF session store + HierarchicalRAG index.

# redis

Multi-purpose key-value store. Holds the kill-switch flag, the BFF
session store (Phase 5+), the Celery broker / result backend, the
semantic LLM cache, the HierarchicalRAG index, the
MetadataPrefetcher cache, and the per-cell pub/sub fan-out for
WebSocket progress streams.

## Identity

| Field | Value |
| --- | --- |
| Service id | `redis` |
| Role | `cache` |
| Image | `redis:7-alpine` (compose master) / `redis-stack:7.4.0-v3` (local — adds RedisJSON + RedisSearch) |
| Port | `6379` |
| Storage | 2 Gi PVC (cell-local); ElastiCache (AWS) / Memorystore (GCP) / Azure Cache (Azure) in cloud templates |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `redis` in [`alphaswarm_platform/compose/docker-compose.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.yml); `redis-stack` in [`deployments/compose/docker-compose.local.yml`](../../../../../../alphaswarm_platform/deployments/compose/docker-compose.local.yml) |
| Kustomize | [`deployments/kubernetes/base/redis-master/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base/redis-master/) — single master per cell; [`base-services/redis-shared/`](../../../../../../alphaswarm_platform/deployments/kubernetes/base-services/redis-shared/) — shared replica set |
| Terraform module | [`alphaswarm_platform/terraform/modules/storage/`](../../../../../../alphaswarm_platform/terraform/modules/storage/) — managed cache per cloud |

## Key namespaces

| Prefix | Owner | Purpose |
| --- | --- | --- |
| `alphaswarm:kill_switch` | `WorkloadRuntime`, `TerraformRuntime` | Global halt flag — every state-mutating runtime checks before acting |
| `celery:*` | Celery broker | Queue names per family (`default`, `backtest`, `agents`, ...) |
| `bff:session:*` | `alphaswarm-cp` BFF | Phase 5 session store (sid → IdP token) |
| `m2m:tokens:*` | `alphaswarm-cp` auth broker | M2M token cache |
| `cache:llm:*` | `alphaswarm-core` | Semantic LLM cache |
| `cache:metadata:*` | `MetadataPrefetcher` | Entity dropdown cache |
| `rag:*` | `HierarchicalRAG` | Embedding index |
| `pubsub:progress:` | `alphaswarm._progress` | WebSocket fan-out frames |

## Dependencies

**Upstream:** none.

**Downstream:** every runtime pod (`alphaswarm-core`, `alphaswarm-worker`,
`alphaswarm-beat`, `alphaswarm-cp`, bots).

## Operations

- **Eviction policy:** `allkeys-lru` for caches; `noeviction` for
  Celery to avoid silent task drops.
- **HA:** in-cell single master; cloud templates use managed Redis
  with multi-AZ replicas.
- **Kill-switch:** the key is intentionally simple — `set` to any
  truthy value halts; the runtime polls every state-mutating action.
- **Persistence:** AOF every second + RDB snapshot every 5 min.

## See also

- [`tasks-api`](../../../../../../alphaswarm/.cursor/rules/tasks-api.mdc) — Celery
  broker + Redis pub/sub frame contract.
- [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) —
  redaction rules for any code that handles Redis-stored tokens.
- [`storage` Terraform module](../../../../../../alphaswarm_platform/terraform/modules/storage/) —
  per-cloud provisioning.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/vector -->
# vector
> Log shipper — DaemonSet running on every node, ships container logs to Loki.

# vector

[Vector](https://vector.dev/) — the canonical log shipper. Runs as a
DaemonSet on every node, tails container stdout/stderr, applies
parse + redact transforms, and ships to Loki.

## Identity

| Field | Value |
| --- | --- |
| Service id | `vector` |
| Role | `observability` |
| Image | `timberio/vector:0.43.0-alpine` |
| Port | (no public listener; metrics on `:9598`) |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `vector` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) |
| Kustomize | [`observability/vector/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/vector/) — DaemonSet + ConfigMap |

## Pipelines

- **`kubernetes_logs` source** → JSON parser → metadata enrichment
  (pod labels, namespace, cell id, tenant id) → redaction transform.
- **Sinks:** `loki` (canonical) + `phoenix` (only for spans tagged
  `service.namespace=alphaswarm.ai`).

## Redaction

- The redact transform strips any field whose lower-cased name
  contains `password`, `secret`, `token`, `key`, `credential`,
  `private`, `authorization`, `kubeconfig`, `client_secret`,
  `api_token`, `api_key`, `jwt`, `refresh_token`, `access_token`.
- Same allowlist as `WorkloadRuntime` redaction — see the
  [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) rule.

## See also

- [`loki.md`](loki.md) — primary sink.
- [`alphaswarm-management-engine`](../../../../../../alphaswarm/.cursor/rules/alphaswarm-management-engine.mdc) —
  redaction allowlist.


<!-- https://alpha-swarm.ai/concepts/infrastructure/services/victoriametrics -->
# victoriametrics
> Long-term metrics storage — Prometheus-compatible TSDB, target of remote-write from kube-prometheus-stack.

# victoriametrics

[VictoriaMetrics](https://victoriametrics.com/) — Prometheus-compatible
time-series database used as the long-term storage layer. Receives
samples via Prometheus remote-write; queryable directly or through
Grafana.

## Identity

| Field | Value |
| --- | --- |
| Service id | `victoriametrics` |
| Role | `observability` |
| Image | `victoriametrics/victoria-metrics:v1.108.0` |
| Port | `8428` (HTTP — write + query) |
| Health | `/health` |

## Deployment surfaces

| Surface | Where |
| --- | --- |
| Compose | service `victoriametrics` in [`alphaswarm_platform/compose/docker-compose.platform.yml`](../../../../../../alphaswarm_platform/compose/docker-compose.platform.yml) |
| Kustomize | [`observability/victoriametrics/`](../../../../../../alphaswarm_platform/deployments/kubernetes/observability/victoriametrics/) — single-node Deployment (in-cell); cluster mode (`vmstorage` / `vmselect` / `vminsert`) in cloud templates |

## Dependencies

**Upstream:** Prometheus (remote-write).

**Downstream:** Grafana (datasource).

## Operations

- **Retention:** 13 months default; 24 months for cells flagged
  `audit_evidence: true`.
- **Cardinality control:** label scrubbing + per-label value cap to
  prevent runaway growth from per-pod / per-task labels.
- **PromQL compatibility:** queries that work in Prometheus work
  here; some MetricsQL extensions (`histogram_quantiles`, etc.) are
  used in dashboards.

## See also

- [`prometheus.md`](prometheus.md) — sample source.
- [`grafana.md`](grafana.md) — primary query path.


<!-- https://alpha-swarm.ai/concepts/infrastructure/terraform-control-plane -->
# Terraform IaC control plane
> The runtime is the only sanctioned executor for `terraform plan/apply/ destroy/refresh` operations. Routes / Celery tasks / MCP tools wrap it; nothing calls `subprocess.run(["terraform", ...])` direct...

# Terraform IaC control plane

Phase 7 of the multi-tenant rollout introduces the 5th sibling
spec-runtime — **`TerraformRuntime`** — that joins `AgentRuntime`,
`BotRuntime`, `RLRuntime`, `AnalysisRuntime`, and `WorkflowRuntime`.

The runtime is the only sanctioned executor for `terraform plan/apply/
destroy/refresh` operations. Routes / Celery tasks / MCP tools wrap it;
nothing calls `subprocess.run(["terraform", ...])` directly outside
[`alphaswarm/terraform/runner.py::TerraformExecutor`](../alphaswarm/terraform/runner.py).

## Architecture

```mermaid
flowchart LR
  user["Operator / Agent"] --> rest["/terraform/* + /infra/* REST"]
  user --> mcp["data.terraform.* MCP tools"]
  rest --> runtime["TerraformRuntime"]
  mcp  --> runtime
  runtime --> ledger["TerraformRun (Postgres)"]
  runtime --> celery["Celery 'terraform' queue"]
  celery --> runner["alphaswarm-terraform-runner pod"]
  runner --> executor["TerraformExecutor (subprocess)"]
  executor --> state["State backendlocal / s3 / azurerm / gcs / hcp"]
  state --> aws["AWS provider"]
  state --> gcp["GCP provider"]
  state --> azure["Azure provider"]
  state --> local["local docker/k8s provider"]
  state --> hcp["HCP Terraform via HcpClient"]
  runtime --> kill["/terraform/halt kill-switch"]
  runtime --> policy["OPA Rego (PolicyChecker)"]
```

## Spec → version → run lifecycle

1. **Author a `TerraformStackSpec`** (Pydantic). Hash is SHA-256 of
   canonical JSON.
2. **`persist_spec(spec)`** creates a new
   `terraform_stack_spec_versions` row only when the hash changes
   (AGENTS rule 43).
3. **`TerraformRuntime(spec).plan(workspace_id=...)`** opens a
   `TerraformRun` row (rule 34: carries `experiment_id` + `test_id`
   FKs), enqueues the plan task on the `terraform` Celery queue.
4. **Runner pod executes** `terraform init && terraform plan -out
   tfplan.binary`, captures stdout/stderr to files in the workspace
   dir, parses `terraform show -json tfplan.binary` into a structured
   plan summary, optionally runs OPA Rego policies.
5. **Plan run lands in `awaiting_approval`.** The frontend
   `/infra/terraform/workspaces/[id]` page renders an "Apply this
   plan" button.
6. **`TerraformRuntime(spec).apply(plan_run_id=...)`** opens a child
   `TerraformRun`, executes `terraform apply tfplan.binary`,
   snapshots the resulting state into a `TerraformStateVersion` row.

## Code generation

CDKTF was deprecated by HashiCorp on 2025-12-10. Python-side HCL
generation uses **Jinja2 templates** under
[`alphaswarm/terraform/codegen/templates/`](../alphaswarm/terraform/codegen/templates):

- `storage_{aws,gcp,azure,local}.tf.j2`
- `faas_local.tf.j2` (KEDA + per-queue ScaledObjects)
- `agents_local.tf.j2` (bot pods with `alphaswarm-data-mcp` sidecar)
- `secrets_local.tf.j2` (ESO + ClusterSecretStore + ExternalSecret per `secret_mappings`)
- `generic.tf.j2` (fallback for `module_source` references)

Operator-authored stacks live under [`alphaswarm_platform/terraform/modules/`](../alphaswarm_platform/terraform/modules/)
and are reachable via `spec.module_source = "../../modules/storage"`.

## State backends

Five backends are supported (`ALPHASWARM_TERRAFORM_STATE_BACKEND`):

| Kind     | Backend block                              |
| -------- | ------------------------------------------ |
| local    | `terraform { backend "local" { ... } }`    |
| s3       | `backend "s3" { bucket / key / dynamodb }` |
| azurerm  | `backend "azurerm" { storage_account_name }` |
| gcs      | `backend "gcs" { bucket / prefix }`        |
| hcp      | HCP Terraform via `HcpClient`              |

The HCP path uses
[`alphaswarm/terraform/hcp_client.py`](../alphaswarm/terraform/hcp_client.py) (thin
httpx wrapper around `app.terraform.io/api/v2`) — no
`python-terrasnek` dep so cold installs without HCP credentials still
boot cleanly.

## Bootstrap and reliability notes

- During cold-start deployments, prefer CLI-first `terraform init/plan/apply`
  until API + Celery + Redis + Postgres are all healthy.
- Control-plane-triggered Terraform actions require broker + worker
  availability to enqueue and stream progress.
- `TerraformExecutor` retries transient `terraform init` provider/network
  failures with bounded exponential backoff.
- Use `ALPHASWARM_TERRAFORM_CLI_CONFIG_FILE` to point at a Terraform CLI config
  that defines `provider_installation` mirror rules when registry access
  is unreliable.
- Provider cache is shared through `ALPHASWARM_TERRAFORM_PLUGIN_CACHE_DIR`.

## Kill switch

`POST /terraform/halt` is the 6th endpoint fanned out by the topbar
`KillSwitch` (alongside `/agents/halt`, `/quant-agents/halt`,
`/paper/stop-all`, `/bots/halt-all`, `/rl/halt-all`,
`/workflows/halt`). On halt every `queued | running |
awaiting_approval` `TerraformRun` is marked `cancelled` + `halted=True`.

## Policy gate (OPA)

`TerraformPolicyAttachment` rows bind a workspace to one or more OPA
Rego policy files. The runtime calls
[`PolicyChecker.check`](../alphaswarm/terraform/policy.py) after every plan;
`hard_mandatory=True` attachments block the corresponding apply on
violation. When `opa` is not on PATH the checker no-ops (so dev / CI
without OPA installed still works).

## Frontend

Vite/React surfaces under [`alphaswarm_client/src/routes/infra/`](../alphaswarm_client/src/routes/infra/):

- `/infra` — 7 tabbed panes (overview / bots / queues / pipeline /
  secrets / k8s / canary) + a Terraform inline summary.
- `/infra/terraform` — workspace list with per-row Plan / Apply /
  Destroy (friction-gated).
- `/infra/terraform/workspaces/[id]` — workspace detail + run history
  + latest state outputs.
- `/infra/terraform/runs/[id]` — run detail with live WS progress
  stream (`/terraform/ws/runs/{id}`).
- `/infra/terraform/stacks` — stack spec catalog.

## Where to look for X

| Task | Path |
| --- | --- |
| Add a new module kind | [`alphaswarm/terraform/codegen/templates/`](../alphaswarm/terraform/codegen/templates/) + [`alphaswarm/persistence/models_terraform.py::TERRAFORM_MODULE_KINDS`](../alphaswarm/persistence/models_terraform.py) |
| Add an MCP tool | [`alphaswarm/data/mcp/tools/terraform.py`](../alphaswarm/data/mcp/tools/terraform.py) |
| Add a REST route | [`alphaswarm/api/routes/terraform.py`](../alphaswarm/api/routes/terraform.py) |
| Add a Celery task | [`alphaswarm/tasks/terraform_tasks.py`](../alphaswarm/tasks/terraform_tasks.py) |
| Edit the runner pod | [`alphaswarm_platform/terraform/modules/terraform_runner/main.tf`](../alphaswarm_platform/terraform/modules/terraform_runner/main.tf) |
| Add a state backend | [`alphaswarm/terraform/codegen/wrapper.py`](../alphaswarm/terraform/codegen/wrapper.py) |
| Add an OPA policy | Reference the file URI via `TerraformPolicyAttachment.policy_set_uri` |


<!-- https://alpha-swarm.ai/concepts/infrastructure/worker-executor-images -->
# Worker vs executor images
> Why the Celery surface is split into two purpose-built images — a slim orchestration worker and a heavy-compute executor — and the dependency / queue matrix that keeps them apart.

# Worker vs executor images

The AlphaSwarm Celery surface is split into **two** purpose-built,
migration-ready container images (Phase 4c):

- **`alphaswarm-worker`** — slim **orchestration** worker. Task dispatch,
  lineage, paper-trading loop, terraform/ingestion/workflow coordination.
- **`alphaswarm-executor`** — **heavy-compute** executor. Backtests, RL /
  ML training, factor builds, agent-emitted strategy code, RAG ingest.

## Why split

Historically `worker` and `beat` had **no image of their own** — the
`alphaswarm_images` catalogue pinned `worker = { target = "api" }`, so the
orchestration worker dragged the entire API stage (Dash, `visualization`,
`dev` tooling) plus the full ML/RL surface into one fat image. Two
problems followed:

1. **Bloat & blast radius.** A lineage callback worker carried PyTorch,
   Ray, vectorbt-pro, forecasting libs — slow to pull, large attack
   surface, slow cold-start.
2. **Scaling mismatch.** Light coordination tasks (sub-second, IO-bound)
   and heavy compute tasks (minutes–hours, CPU/GPU/RAM-bound) have
   opposite scaling and resource profiles, but shared one Deployment.

Splitting lets each image carry only what its queues need, and lets each
scale and be resourced independently.

## Queue ↔ image matrix

The queue assignment is identical across the root `Dockerfile`, the
standalone per-service Dockerfiles, the K8s manifests, both compose
files, and the `faas` KEDA module (`local.heavy_queues`). **A queue is
never drained by both images.**

| Queue | Image | Why |
| --- | --- | --- |
| `default` | worker | bookkeeping, lineage, callbacks |
| `paper` | worker | sub-second paper-trading loop (latency-sensitive) |
| `terraform` | worker | `TerraformRuntime` apply/destroy wrappers |
| `ingestion` | worker | connector pulls (IO-bound, long-lived) |
| `workflows` | worker | `WorkflowRuntime` orchestration |
| `backtest` | executor | vbt-pro / event-driven / Lean engine runs |
| `training` | executor | RL rollouts + finetune jobs (GPU) |
| `ml` | executor | ML pipelines, predictor refresh |
| `agents` | executor | CrewAI / LangGraph agent runs |
| `factors` | executor | factor-zoo builds, alpha tests |
| `rag` | executor | RAG ingest, embedding refresh |

## Dependency surface

Both images share the multi-arch (`linux/amd64+arm64`) Chainguard Wolfi
base, nonroot UID `65532`, and the `CredentialResolver`-only secret rule
(nothing baked into the image). They differ only in installed extras:

| | worker | executor |
| --- | --- | --- |
| Base extras | `otel, cli, iceberg, entity-graph, dagster-alphaswarm` | same |
| Distributed compute | `compute-dask, compute-ray` | `compute-dask, compute-ray` |
| ML / RL / forecasting | — | `ml, ml-torch, ml-forecast, ml-anomaly` |
| Portfolio | — | `portfolio` |
| Native build deps | — | `gfortran`, `linux-headers` (numpy/scipy/forecast wheels) |
| Extra dirs | `/app/data` | `/app/data`, `/app/models` |
| Default concurrency | 4 | 2 |
| Resource requests | `500m CPU / 1Gi` | `1 CPU / 4Gi` |
| Resource limits | `4 CPU / 8Gi` | `8 CPU / 16Gi` |

## Where the images are defined

| Surface | Worker | Executor |
| --- | --- | --- |
| Root multi-stage target | `worker` in [`Dockerfile`](../../../../../alphaswarm_platform/Dockerfile) | `executor` in [`Dockerfile`](../../../../../alphaswarm_platform/Dockerfile) |
| Standalone Dockerfile | [`build/docker/alphaswarm_worker/`](../../../../../alphaswarm_platform/build/docker/alphaswarm_worker/) | [`build/docker/alphaswarm_executor/`](../../../../../alphaswarm_platform/build/docker/alphaswarm_executor/) |
| Image catalogue | `worker` / `beat` → target `worker` | `executor` → target `executor` |
| ECR repo | `alphaswarm-worker` | `alphaswarm-executor` |
| Kustomize base | `base/alphaswarm-worker/` | `base/alphaswarm-executor/` |
| Compose | `worker` (legacy) / `alphaswarm-worker` | `worker-gpu` (legacy) / `alphaswarm-executor` |

## Migration readiness

The two images are intentionally self-contained — a standalone
Dockerfile, its own ECR repo, its own image-catalogue entry, its own
Kustomize base, and its own topology entry — so the build assets can be
lifted into a dedicated repository in a future migration without
untangling them from the API image.

## See also

- [`alphaswarm-worker`](services/alphaswarm-worker.md) — orchestration worker service doc.
- [`alphaswarm-executor`](services/alphaswarm-executor.md) — heavy-compute executor service doc.
- [`services.md`](services.md) — full service catalogue.
- [`faas` Terraform module](../../../../../alphaswarm_platform/terraform/modules/faas/) — KEDA per-queue scaling.


<!-- https://alpha-swarm.ai/concepts/platform/alphaswarm-monorepo-paths -->
# AlphaSwarm Monorepo Paths
> Canonical path contract for this repository. Sibling repos (`rpi_kubernetes`, `theia-ide`, `alphaswarm_platform_admin`) mirror this table in their own `alphaswarm_docs/alphaswarm-monorepo-paths.md` files

# AlphaSwarm Monorepo Paths

Status: active.

Canonical path contract for this repository. Sibling repos (`rpi_kubernetes`,
`theia-ide`, `alphaswarm_platform_admin`) mirror this table in their own
`alphaswarm_docs/alphaswarm-monorepo-paths.md` files.

| AlphaSwarm responsibility | Path |
| --- | --- |
| Control plane | `alphaswarm_controller/` |
| Shared platform contracts | `alphaswarm_core/` |
| Active client (Vite) | `alphaswarm_client/` |
| Bot runtime/templates | `alphaswarm_bots/` |
| RL subsystem | `alphaswarm_rl/` (`src/alphaswarm_rl/` source; `tasks/`, `api/routes/`, `configs/`, `tests/` siblings) |
| Custom model boundary | `alphaswarm_models/` (`src/alphaswarm_models/` source incl. `serving/`; `tasks/`, `api/routes/`, `configs/`, `tests/` siblings) |
| Snippet corpus | `alphaswarm_snippets/` |
| Monolith runtime | `alphaswarm/` |
| Standalone operator CLI | `alphaswarm_cli/` |
| Internal admin (services + accounts) | `alphaswarm_admin/` |
| Vendored Theia IDE workspace | `alphaswarm_ide/` |
| Curator-owned project index (SSoT) | `alphaswarm_index/` |
| Canonical documentation | `alphaswarm_docs/` |
| Hosted-platform single home | `alphaswarm_platform/` |
| Kubernetes workloads | `alphaswarm_platform/deployments/kubernetes/` |
| Terraform modules + environments | `alphaswarm_platform/terraform/` |
| Multi-arch Dockerfiles + config gen | `alphaswarm_platform/build/` |
| Legacy / edge component configs | `alphaswarm_platform/deploy/` |
| Root-level compose files | `alphaswarm_platform/compose/` |
| Multi-stage root Dockerfile | `alphaswarm_platform/Dockerfile` |
| Deployment topology YAML | `alphaswarm_platform/configs/deployment/topology.yaml` |
| Terraform stack YAMLs | `alphaswarm_platform/configs/terraform/` |
| Cluster install scripts | `alphaswarm_platform/scripts/cluster_install/` |

Compatibility stubs and historical paths (do not add active source here):

| Legacy path | Points to |
| --- | --- |
| `frontend/` | `alphaswarm_client/` |
| `extractions/` | `alphaswarm_snippets/extractions/` |
| `inspiration/` | `alphaswarm_snippets/inspiration/` (ignored raw repos) |
| `alphaswarm/bots/` | `alphaswarm_bots/` (import shim) |
| `alphaswarm/rl/` | `alphaswarm_rl/src/alphaswarm_rl/` (deprecation-warning import shim; `pkgutil.walk_packages` aliases every submodule under `alphaswarm.rl.*`) |
| `alphaswarm/ml/` | `alphaswarm_models/src/alphaswarm_models/` (deprecation-warning import shim; `pkgutil.walk_packages` aliases every submodule under `alphaswarm.ml.*`) |
| `alphaswarm/llm/vllm_runner.py` | `alphaswarm_models/src/alphaswarm_models/serving/vllm.py` (one-line re-export shim) |
| `alphaswarm/llm/ollama_client.py` | `alphaswarm_models/src/alphaswarm_models/serving/ollama.py` (one-line re-export shim) |
| `alphaswarm/tasks/rl_tasks.py` | `alphaswarm_rl/tasks/rl_tasks.py` (Celery `name=` strings preserved for in-flight queue messages) |
| `alphaswarm/tasks/ml_tasks.py` | `alphaswarm_models/tasks/ml_tasks.py` |
| `alphaswarm/tasks/ml_test_tasks.py` | `alphaswarm_models/tasks/ml_test_tasks.py` |
| `alphaswarm/tasks/finetune_tasks.py` | `alphaswarm_models/tasks/finetune_tasks.py` |
| `alphaswarm/tasks/training_tasks.py` | `alphaswarm_models/tasks/training_tasks.py` |
| `alphaswarm/api/routes/rl.py` | `alphaswarm_rl/api/routes/rl.py` (FastAPI mount path `/rl` unchanged) |
| `alphaswarm/api/routes/ml.py` | `alphaswarm_models/api/routes/ml.py` (FastAPI mount path `/ml` unchanged) |
| `alphaswarm/api/routes/analytics_ml.py` | `alphaswarm_models/api/routes/analytics_ml.py` (FastAPI mount path `/analytics/ml` unchanged) |
| `configs/rl/` | `alphaswarm_rl/configs/` |
| `configs/ml/` | `alphaswarm_models/configs/` |
| `tests/rl/` | `alphaswarm_rl/tests/` |
| `tests/ml/` | `alphaswarm_models/tests/` |
| `docs/` | `alphaswarm_docs/` (renamed; all references updated) |
| root `deployments/` | `alphaswarm_platform/deployments/` |
| root `build/` | `alphaswarm_platform/build/` |
| root `deploy/` | `alphaswarm_platform/deploy/` |
| root `terraform/` | `alphaswarm_platform/terraform/` |
| root `Dockerfile` | `alphaswarm_platform/Dockerfile` |
| root `.dockerignore` | `alphaswarm_platform/.dockerignore` |
| root `docker-compose.yml` | `alphaswarm_platform/compose/docker-compose.yml` |
| root `docker-compose.platform.yml` | `alphaswarm_platform/compose/docker-compose.platform.yml` |
| root `docker-compose.viz.yml` | `alphaswarm_platform/compose/docker-compose.viz.yml` |
| `configs/deployment/` | `alphaswarm_platform/configs/deployment/` |
| `configs/terraform/` | `alphaswarm_platform/configs/terraform/` |
| `scripts/cluster_install/` | `alphaswarm_platform/scripts/cluster_install/` |


<!-- https://alpha-swarm.ai/concepts/platform/architecture -->
# Architecture
> Top-down map of the AlphaSwarm platform: the spec-runtime pattern, the data and agentic planes, the four edge surfaces, and the request lifecycle every dispatch shares.

# Architecture

> Human entry point. Pair with the AI-agent entry point at
> [AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md)
> and the doc map at [/intro](../../intro/index.md).
>
> Cold-start path: [/intro/quickstart](../../intro/quickstart.md).
> Deployment path: [how-to/operations/local-setup](../../how-to/operations/local-setup.md)
> or [how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md).

AlphaSwarm is a **local-first, agentic quantitative research and trading
platform**. Every LLM call, every backtest, every reinforcement-learning
rollout, and every piece of metadata stays on local hardware â€” no
proprietary alpha leaves the box. The codebase distills patterns from
Microsoft Qlib, AI4Finance FinRL, QuantConnect Lean, OpenBB, vnpy, and
TradingAgents into one coherent platform.

The platform is organised around **four invariants** that hold across
every subsystem:

1. **Hash-locked spec runtimes.** `AgentSpec`, `BotSpec`,
   `RLExperimentSpec`, and `AnalysisSpec` each have a single sanctioned
   executor (`AgentRuntime` / `BotRuntime` / `RLRuntime` /
   `AnalysisRuntime`). Any spec change creates a new immutable
   `*_spec_versions` row; old versions stay forever for replay.
2. **Medallion lakehouse.** Every Iceberg write goes through
   [`iceberg_catalog.append_arrow`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py)
   with a declared bronze / silver / gold layer; agents read through
   `data.*` MCP tools, never raw ORM.
3. **One LLM gateway, one progress bus.** Every model call routes
   through
   [`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py);
   every Celery task emits canonical progress frames through
   [`alphaswarm.tasks._progress`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/tasks/_progress.py).
4. **Topology is data, not code.** Service URLs, MCP audiences, and
   credential references resolve through
   [`alphaswarm_platform/configs/deployment/topology.yaml`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/configs/deployment/topology.yaml).

## System component diagram

```mermaid
flowchart TB
    subgraph clients [Clients]
        Browser["alphaswarm_client (Vite :3001)"]
        CloudUI["alphaswarm_ui (Next.js cloud)"]
        Admin["alphaswarm_admin (manage.alpha-swarm.ai)"]
        CLI["alphaswarm-cli (device-flow auth)"]
        IDE["alphaswarm_ide (Theia 1.72)"]
        Agents["IDE agents (Cursor / Claude / Continue)"]
    end

    subgraph edge [Cloudflare edge]
        DocsEdge["docs.alpha-swarm.ai (Pages)"]
        DocsMcp["docs MCP Worker (RFC 9728+8707)"]
        StatusEdge["status.alpha-swarm.ai (Instatus)"]
        TunnelEdge["alphaswarm-fund-edge tunnel"]
    end

    subgraph api [API gateway (alphaswarm/api)]
        FastAPI["FastAPI :8000"]
        DataMcp["/mcp/data"]
        CodeMcp["/mcp/codebase"]
        WS["WebSocket relay"]
    end

    subgraph cp [Control plane (alphaswarm_controller)]
        ManageApi["alphaswarm-cp :9000 (manage.alpha-swarm.ai)"]
        TfRuntime["TerraformRuntime"]
        WlRuntime["WorkloadRuntime"]
    end

    subgraph runtimes [Spec runtimes]
        AgentRt["AgentRuntime"]
        BotRt["BotRuntime (alphaswarm_bots)"]
        RlRt["RLRuntime (alphaswarm_rl)"]
        AnaRt["AnalysisRuntime"]
        WfRt["WorkflowRuntime"]
    end

    subgraph workers [Celery workers]
        WDefault["worker (default / backtest / agents / paper)"]
        WTraining["worker-gpu (training queue)"]
        WTerraform["worker-terraform"]
        Beat["beat (cron)"]
    end

    subgraph runtime [Backends]
        Redis[(Redis 7)]
        Postgres[(PostgreSQL 16 + pgvector + RLS)]
        Iceberg["Iceberg lakehouse (bronze / silver / gold)"]
        Hudi["Hudi (upsert-heavy)"]
        DuckDB["DuckDB views"]
        Mlflow["MLflow"]
        R2[("R2 (Logpush 365d)")]
    end

    subgraph llms [LLM tier]
        Ollama["Ollama (host)"]
        Vllm["vLLM (compose --profile vllm)"]
        Sera["SERA-32B (Modal, opt-in)"]
        Router["router_complete + LiteLLM"]
    end

    subgraph observability [Observability]
        OTEL["OTEL collector :4317"]
        Jaeger["Jaeger"]
        Posthog["PostHog Cloud EU"]
        Plausible["Plausible (cookieless)"]
    end

    Browser --> FastAPI
    CloudUI --> FastAPI
    CloudUI --> ManageApi
    Admin --> ManageApi
    CLI --> FastAPI
    IDE --> FastAPI
    Agents -.MCP.-> DataMcp
    Agents -.MCP.-> CodeMcp
    Agents -.MCP.-> DocsMcp

    Browser -.DocsPanel.-> DocsEdge
    DocsEdge --> DocsMcp

    TunnelEdge --> FastAPI
    TunnelEdge --> ManageApi
    TunnelEdge --> Browser

    FastAPI --> AgentRt
    FastAPI --> BotRt
    FastAPI --> RlRt
    FastAPI --> AnaRt
    FastAPI --> WfRt
    WfRt --> AgentRt
    WfRt --> RlRt
    WfRt --> BotRt

    AgentRt -.tasks.-> Redis
    BotRt -.tasks.-> Redis
    RlRt -.tasks.-> Redis
    Beat -.cron.-> Redis
    Redis -.dispatch.-> WDefault
    Redis -.dispatch.-> WTraining
    Redis -.dispatch.-> WTerraform

    WDefault --> Postgres
    WDefault --> Iceberg
    WDefault --> Hudi
    WDefault --> Router
    WTraining --> Mlflow
    WTraining --> Router

    ManageApi --> TfRuntime
    ManageApi --> WlRuntime
    TfRuntime --> WTerraform

    Iceberg --> DuckDB
    Router --> Ollama
    Router -.optional.-> Vllm
    Router -.opt-in.-> Sera

    FastAPI -.spans.-> OTEL
    ManageApi -.spans.-> OTEL
    OTEL --> Jaeger
    DocsEdge -.events.-> Posthog
    DocsEdge -.pageviews.-> Plausible
    DocsEdge -.logpush.-> R2
```

Solid lines are default-profile data paths; dotted lines are
opt-in / asynchronous.

## The four edge surfaces

AlphaSwarm exposes four hostnames, each behind its own Cloudflare property:

- **`alpha-swarm.ai`** â€” operator UI ([alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client)).
  Vite + React 19 + Tailwind 4 + shadcn/ui. Routes the topbar
  KillSwitch, paper trading dashboards, RL Lab, Analysis Lab,
  Workflow Studio, Data Hub.
- **`api.alpha-swarm.ai`** â€” public API
  ([alphaswarm/api](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api)).
  FastAPI gateway, 30+ route modules, Stripe-style date epochs
  (first epoch `2026-06-01`).
- **`manage.alpha-swarm.ai`** â€” control plane
  ([alphaswarm_controller](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_controller)).
  Workload lifecycle, TerraformRuntime, IdP wiring. Never imports
  `alphaswarm.*`.
- **`docs.alpha-swarm.ai`** â€” documentation
  ([alphaswarm_docs](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_docs)).
  Docusaurus 3 on Cloudflare Pages. Pages Functions for
  content-negotiation, sanitised page fragments, and the
  "Was this helpful?" feedback loop. Standalone MCP Worker at
  `/mcp` (RFC 9728 + 8707 compliant per AGENTS rule 49).

Plus two adjacent zones:

- **`status.alpha-swarm.ai`** â€” Instatus status page. Separate Cloudflare
  zone so it stays up when the cluster is degraded.
- **`archive.alpha-swarm.ai`** â€” frozen Stripe-style API epochs after the
  12-month sunset window.

## Request lifecycle

Every spec-driven dispatch â€” backtest, agent run, RL training,
analysis flow, workflow â€” follows the same canonical shape. The two
new contracts since the prior version of this doc:

- **Hash-lock first.** Before any work happens, the runtime computes
  the spec's SHA-256, looks for a matching `*_spec_versions` row,
  inserts a new immutable row if the content changed.
- **Kill switch reachable.** Every long-running runtime is in the
  topbar [KillSwitch](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/src/components/common/KillSwitch.tsx)
  fan-out list. The runtime checks `should_halt` on every step.

```mermaid
sequenceDiagram
    actor User
    participant UI as alphaswarm_client
    participant API as FastAPI
    participant Runtime as Spec runtime
    participant Versions as *_spec_versions
    participant Redis
    participant Worker as Celery worker
    participant Postgres as Run ledger
    participant Iceberg

    User->>UI: Click "Run"
    UI->>API: POST //run { spec_yaml }
    API->>Runtime: instantiate(spec)
    Runtime->>Versions: lookup-or-insert by spec_hash
    Versions-->>Runtime: version_id (existing OR new)
    Runtime->>Postgres: insert run row (status=pending, spec_version_id)
    Runtime->>Redis: enqueue task (idempotent by run_id)
    API-->>UI: 202 Accepted { task_id, run_id, stream_url }
    UI->>API: WebSocket /chat/stream/{task_id}
    Worker->>Redis: dequeue
    Worker->>Postgres: load spec_version + run
    loop per step
        Worker->>Worker: runtime.step()
        Worker->>Worker: check should_halt()
        Worker->>Iceberg: append_arrow (medallion-tagged)
        Worker->>Redis: emit progress frame
        Redis-->>UI: WebSocket frame
    end
    Worker->>Postgres: update run (status=completed, metrics)
    Worker->>Redis: emit_done(task_id, result)
    Redis-->>UI: stage=done frame
    UI-->>User: render summary
```

The frame envelope is `{task_id, stage, message, timestamp,
**extras}` per AGENTS rule 4. The `should_halt` check makes every
spec-runtime an immediate stop target for the topbar kill switch.

## Repository map

The monorepo is organised by responsibility. Each top-level package
has its own `AGENTS.md` enforcing strict boundaries; cross-package
imports are blocked in CI.

| Package | Role | Owner | Public-surface contract |
| --- | --- | --- | --- |
| [alphaswarm/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm) | Quant runtime â€” strategies, backtests, agents, RAG, Iceberg | `platform-team` | [alphaswarm/api/main.py::create_app](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/api/main.py) |
| [alphaswarm_controller/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_controller) | Workload lifecycle + Terraform driver + provider adapters | `platform-team` | [alphaswarm_controller/main.py::create_app](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_controller/src/alphaswarm_controller/main.py); NEVER imports `alphaswarm.*` |
| [alphaswarm_core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_core) | Shared value types, ABCs, auth/resource filters, topology | `platform-team` | Dependency-light; consumed by both `alphaswarm/` and `alphaswarm_controller/` |
| [alphaswarm_client/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client) | Active Vite + React 19 operator UI at `alpha-swarm.ai` | `platform-team` | `pnpm --filter alphaswarm_client dev` |
| [alphaswarm_ui/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_ui) | Cloud-hosted Next.js PaaS frontend (dual Auth0 + Entra) | `platform-team` | Never imports `alphaswarm.*` / `alphaswarm_controller.*` |
| [alphaswarm_admin/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_admin) | Internal admin at `manage.alpha-swarm.ai` (audit-first) | `platform-team` | Mirrors `alphaswarm_controller` boundary |
| [alphaswarm_rl/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl) | RL stack â€” `RLExperimentSpec` + `RLRuntime` + Iceberg trajectories | `rl-team` | Legacy `alphaswarm.rl.*` is a deprecation shim |
| [alphaswarm_models/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_models) | ML framework, custom model serving (vLLM + Ollama), AlphaBacktestExperiment | `ml-team` | Legacy `alphaswarm.ml.*` + `alphaswarm/llm/{vllm_runner,ollama_client}.py` are deprecation shims |
| [alphaswarm_bots/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_bots) | Bot templates + `BotRuntime` (smallest deployable unit) | `agentic-team` | YAML at `alphaswarm_bots/templates/{trading,research}/` |
| [alphaswarm_ide/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_ide) | Theia 1.72 IDE + six AlphaSwarm extensions | `platform-team` | Canonical entrypoint: `alphaswarm-cli ide` |
| [alphaswarm_cli/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_cli) | Standalone operator CLI (HTTP-only, device-flow auth) | `platform-team` | Never imports `alphaswarm.*` / `alphaswarm_controller.*` |
| [alphaswarm_platform/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform) | Hosted-platform deployment + IaC + build assets | `infra-team` | No `import alphaswarm.*`; `TerraformRuntime`-only |
| [alphaswarm_index/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_index) | Curator-owned single source of truth | `docs-team` | Sole-writer is the `alphaswarm-index-curator` subagent |
| [alphaswarm_docs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_docs) | This site (Docusaurus 3 on Cloudflare Pages) | `docs-team` | Quality gates in [.github/workflows/docs-ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/docs-ci.yml) |
| [alphaswarm_snippets/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_snippets) | Curated knowledge + extractions + inspiration trees | `docs-team` | Runtime code MUST NOT import this tree |

Inside `alphaswarm/` the subsystems map one-to-one to concept docs:

| `alphaswarm//` | Doc |
| --- | --- |
| [agents/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/agents) | [agentic-pipeline](../agentic/agentic-pipeline.md), [agents](../agentic/agents.md), [workflow-studio](../agentic/workflow-studio.md), [multi-agent-patterns](../agentic/multi-agent-patterns.md) |
| [analysis/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/analysis) | [analysis-framework](../strategy/analysis-framework.md), [analysis-lab](../strategy/analysis-lab.md), [analysis-flows](../strategy/analysis-flows.md) |
| [api/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api) | [reference/api](../../reference/api/index.mdx) (auto-generated) |
| [backtest/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/backtest) | [backtest-engines](../strategy/backtest-engines.md), [vbtpro-integration](../strategy/vbtpro-integration.md), [hft-backtest](../strategy/hft-backtest.md) |
| [cli/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/cli) | [providers](../data/providers.md) |
| [codebase/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/codebase) | [codebase-mcp](../data/codebase-mcp.md) |
| [core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/core) | [core-types](./core-types.md) |
| [data/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/data) | [data-plane](../data/data-plane.md), [data-catalog](../data/data-catalog.md), [data-mcp](../data/data-mcp.md), [datasets-catalog](../data/datasets-catalog.md), [data-discovery](../data/data-discovery.md), [airbyte-builder](../data/airbyte-builder.md), [dagster-sandbox](../data/dagster-sandbox.md) |
| [llm/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/llm) | [providers](../data/providers.md), [sera](../data/sera.md) |
| [persistence/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/persistence) | [domain-model](./domain-model.md), [erd](./erd.md), [reference/data-dictionary](../../reference/data-dictionary/index.mdx) |
| [providers/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/providers) | [data-plane](../data/data-plane.md) |
| [risk/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/risk) | [paper-trading](../trading/paper-trading.md) |
| [streaming/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/streaming) | [streaming](../data/streaming.md), [streaming-admin](../data/streaming-admin.md), [live-market](../data/live-market.md) |
| [tasks/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/tasks) | [agent-watchdog](../data/agent-watchdog.md) |
| [trading/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/trading) | [paper-trading](../trading/paper-trading.md), [paper-metadata-gate](../trading/paper-metadata-gate.md) |
| [ws/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/ws) | [observability](../trading/observability.md) |
| [ui/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/ui) | **Deprecated** (legacy Solara) â€” rollback only |

For the full canonical repository-split contract (boundaries, import
guards, future extraction map) read
[repository-split](./repository-split.md). For the
file-by-file path contract for cross-repo references read
[alphaswarm-monorepo-paths](./alphaswarm-monorepo-paths.md).

## Hard rules (cardinal subset)

Every contributor reads the full 55 hard rules in
[AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).
The cardinal subset that surfaces in this doc:

- **Rule 1.** `Symbol.parse(vt_symbol)` only. Never split a
  `vt_symbol` on `.`.
- **Rule 2.** All LLM calls go through `router_complete`.
- **Rule 3.** All Iceberg writes go through `iceberg_catalog.append_arrow`.
- **Rule 4.** All progress emits use the canonical frame envelope.
- **Rule 5.** All cross-task state goes through Postgres; never
  pickle ORM objects.
- **Rule 12-19, 23-25, 40-41.** The five spec runtimes
  (`AgentRuntime`, `BotRuntime`, `RLRuntime`, `AnalysisRuntime`,
  `WorkflowRuntime`) are the only sanctioned executors for their
  respective specs. Specs are immutable once committed; behaviour
  changes always create a new version row.
- **Rule 22.** Agents NEVER read Postgres / Iceberg directly. Every
  catalog / dataset / entity read goes through a registered
  `DataMCPTool`.
- **Rule 42-45.** TerraformRuntime owns all `terraform apply`;
  WorkloadRuntime owns all runtime workload ops; both write to the
  `workload_runs` + `terraform_runs` audit ledgers before executing.
- **Rule 47.** Service URLs resolve through the topology service;
  AlphaSwarm is cluster-agnostic.
- **Rule 49.** Every MCP server is RFC 9728 + 8707 conformant.
- **Rule 52.** Step-up MFA (RFC 9470) on every halt + every
  destructive surface.

## Worked example: trace your first request

Goal: dispatch a backtest, watch the WebSocket frames, inspect the
ledger row and the Iceberg gold output â€” without leaving this page.

### Step 1 â€” dispatch

The example below targets your local compose stack at
`http://localhost:8000`. Hit "Run" to fire a sample momentum backtest.


### Step 2 â€” tail the WebSocket

Switch to your terminal and tail the canonical progress frames:

```bash
curl -N http://localhost:8000/chat/stream/
```

You will see frames in the `{task_id, stage, message, timestamp,
**extras}` shape. Stages: `start` â†’ `bar.processed` (Ã—N) â†’
`done` (carries the final `BacktestResult`).

### Step 3 â€” inspect the ledger

Pyodide can run this synchronous SQL via DuckDB against a small
parquet snapshot of `backtest_runs`:


When pointed at the real platform, replace the inline list with a
[/data/exports](../../how-to/recipes/query-data-via-mcp.md) MCP call
and the same SQL works against the actual ledger snapshot.

### Step 4 â€” read the Iceberg gold output

```python
from pyiceberg.catalog import load_catalog
cat = load_catalog("alphaswarm")
table = cat.load_table(f"alphaswarm_gold_backtests.run_{run_id}")
df = table.scan().to_pandas()
print(df[["timestamp", "equity", "drawdown"]].tail(10))
```

### Step 5 â€” verify

- A `backtest_runs` row with non-NULL `sharpe` exists.
- The WebSocket emitted a `stage=done` frame with the same `run_id`.
- An `alphaswarm_gold_backtests.run_` Iceberg table is queryable.
- The `KillSwitch` topbar element shows a green status.

### What next

- Run the full walkthrough in [tutorials/first-backtest](../../tutorials/first-backtest.md).
- Author a custom strategy: [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md).
- Promote the backtest to paper: [how-to/recipes/promote-a-bot-to-paper](../../how-to/recipes/promote-a-bot-to-paper.md).
- Replace the single-strategy dispatch with a multi-node workflow:
  [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) +
  [concepts/agentic/workflow-studio](../agentic/workflow-studio.md).

## Deployment modes

### docker-compose (default)

```bash
docker compose up -d
```

Brings up `redis`, `postgres`, `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-beat`,
`alphaswarm-client`, `chromadb`, `mlflow`, `otel-collector`, `jaeger`. The
Iceberg catalog runs in PyIceberg SQL mode against the host bind
mount under `data/iceberg/`. Optional profiles:

- `--profile streaming` â€” adds Redpanda + Flink for live market data.
- `--profile vllm` â€” adds a containerised vLLM inference server.
- `--profile legacy` â€” restores the older MinIO + iceberg-rest
  topology for rollback only.

### Native dev (no Docker)

```bash
pip install -e ".[full,dev]"
alembic upgrade head
uvicorn alphaswarm.api.main:app --reload
celery -A alphaswarm.tasks.celery_app worker --loglevel=info
```

### Kubernetes

```bash
make deploy-k8s ENV=prod
```

Manifests live under
[alphaswarm_platform/deployments/kubernetes/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform/deployments/kubernetes).
The TerraformRuntime owns every `terraform apply`; see
[how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md)
and [how-to/operations/alphaswarm-fund-blue-green-cutover](../../how-to/operations/alphaswarm-fund-blue-green-cutover.md).

### Cloudflare Pages (docs only)

`docs.alpha-swarm.ai` deploys via the
[cloudflare_pages_docs](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_platform/terraform/modules/cloudflare_pages_docs)
Terraform module â€” out of cluster, on the edge, behind Cloudflare
Access for `/internal/*` and `/enterprise/*`.

## Where to start

```mermaid
flowchart LR
    contributor[New contributor] --> human["this page (architecture)"]
    contributor --> agent["AGENTS.md (root)"]
    human --> intro["intro/quickstart"]
    agent --> intro
    intro --> diataxis["Diataxis pick"]
    diataxis --> conceptsPick[concepts]
    diataxis --> howtoPick[how-to]
    diataxis --> tutorialsPick[tutorials]
    diataxis --> referencePick[reference]
```

| If you want to... | Read |
| --- | --- |
| Get the platform running locally | [intro/quickstart](../../intro/quickstart.md) |
| Understand the doc conventions | [intro/conventions](../../intro/conventions.md) |
| See the canonical repository layout | [repository-split](./repository-split.md) |
| Run a backtest end-to-end | [tutorials/first-backtest](../../tutorials/first-backtest.md) |
| Promote a bot from backtest to paper | [tutorials/first-bot](../../tutorials/first-bot.md) |
| Train an RL agent | [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md) |
| Compose an agent workflow | [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md) |
| Browse the API surface | [reference/api](../../reference/api/index.mdx) |
| Browse the Python surface | [reference/python](../../reference/python/index.mdx) |
| Inspect tables and columns | [reference/data-dictionary](../../reference/data-dictionary/index.mdx) |
| Author a new strategy | [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md) |
| Query data without touching ORM | [how-to/recipes/query-data-via-mcp](../../how-to/recipes/query-data-via-mcp.md) |
| Snapshot an agent spec | [how-to/recipes/snapshot-an-agent-spec](../../how-to/recipes/snapshot-an-agent-spec.md) |
| Trigger a kill switch | [how-to/operations/kill-switch-incident-response](../../how-to/operations/kill-switch-incident-response.md) |
| Deploy to Kubernetes | [how-to/operations/kubernetes-deploy](../../how-to/operations/kubernetes-deploy.md) |
| Read the agentic-coding contract | [concepts/agentic/agentic-development](../agentic/agentic-development.md) |
| Run docs from an AI agent | `/llms.txt`, `/llms-full.txt`, `/mcp` |

## Deeper reads

- [concepts/platform/repository-split](./repository-split.md) â€” boundary
  contract for every `alphaswarm_*` package.
- [concepts/agentic/workflow-studio](../agentic/workflow-studio.md) â€”
  the `WorkflowRuntime` orchestration layer composing every spec
  runtime.
- [concepts/agentic/agentic-development](../agentic/agentic-development.md) â€”
  the spec-pattern mapped to the broader agentic-coding vocabulary.
- [concepts/identity/management-engine](../identity/management-engine.md) â€”
  `WorkloadRuntime` + control-plane audit ledger.
- [concepts/infrastructure/terraform-control-plane](../infrastructure/terraform-control-plane.md) â€”
  `TerraformRuntime` + hash-locked stack specs.
- [reference/api](../../reference/api/index.mdx) â€” Scalar-rendered API
  playground.
- [reference/python](../../reference/python/index.mdx) â€” Griffe-generated
  Python reference.


<!-- https://alpha-swarm.ai/concepts/platform/class-diagram -->
# Class Diagrams
> Hand-authored mermaid `classDiagram` blocks for the five hierarchies AI coders most often need to navigate. Every diagram cites the canonical file so you can jump from the diagram into the code in one...

# Class Diagrams

> Pair with [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (database schema) and
> [alphaswarm_docs/architecture.md](../../concepts/platform/architecture.md) (system view).
> Doc map: [alphaswarm_docs/index.md](../../intro/index.md).

Hand-authored mermaid `classDiagram` blocks for the five hierarchies AI
coders most often need to navigate. Every diagram cites the canonical
file so you can jump from the diagram into the code in one click.

## 1. Symbol + core enums

The atom that flows through every data feed, strategy, and broker.
Defined in [alphaswarm/core/types.py](../alphaswarm/core/types.py).

```mermaid
classDiagram
    class Symbol {
        +str ticker
        +Exchange exchange
        +AssetClass asset_class
        +SecurityType security_type
        +str vt_symbol
        +parse(s) Symbol
        +format() str
        +equity(ticker, exchange) Symbol
        +crypto(base, quote, venue) Symbol
        +option(underlying, ...) Symbol
    }
    class Exchange {
        <>
        NASDAQ
        NYSE
        ARCA
        BATS
        CBOE
        CME
        LSE
        BINANCE
        COINBASE
        SIM
        LOCAL
    }
    class AssetClass {
        <>
        EQUITY
        CRYPTO
        FX
        FUTURE
        OPTION
        INDEX
        COMMODITY
        BOND
        BASE
    }
    class SecurityType {
        <>
        EQUITY
        OPTION
        FUTURE
        FUTURE_OPTION
        FOREX
        CFD
        CRYPTO
        CRYPTO_FUTURE
        INDEX
        INDEX_OPTION
        COMMODITY
    }
    class Resolution {
        <>
        Tick
        Second
        Minute
        Hour
        Daily
    }
    class TickType {
        <>
        Trade
        Quote
        OpenInterest
    }
    class SubscriptionDataConfig {
        +Symbol symbol
        +Resolution resolution
        +TickType tick_type
        +DataNormalizationMode mode
    }
    class BarData {
        +Symbol symbol
        +datetime timestamp
        +Decimal open
        +Decimal high
        +Decimal low
        +Decimal close
        +int volume
    }
    class QuoteBar
    class TradeBar
    class TickData

    Symbol --> Exchange : "uses"
    Symbol --> AssetClass : "uses"
    Symbol --> SecurityType : "uses"
    SubscriptionDataConfig --> Symbol
    SubscriptionDataConfig --> Resolution
    SubscriptionDataConfig --> TickType
    BarData --> Symbol
    QuoteBar --|> BarData
    TradeBar --|> BarData
    TickData --> Symbol
```

**Key invariants**:

- `Symbol` is hashable + frozen. Round-trip via
  `Symbol.parse(symbol.format())` is the identity.
- `vt_symbol` is always `f"{ticker}.{exchange}"` (vnpy convention).
- Concrete instrument shapes (option chains, future contracts) live
  alongside `Symbol` as additional fields, not separate classes.

## 2. LLM provider registry

The router from [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py)
dispatches every LLM call through LiteLLM. Adding a provider is a
single dict entry in
[alphaswarm/llm/providers/catalog.py](../alphaswarm/llm/providers/catalog.py).

```mermaid
classDiagram
    class ProviderSpec {
        <>
        +str slug
        +str litellm_prefix
        +str env_key
        +str settings_attr
        +str base_url_attr
        +str default_deep_model
        +str default_quick_model
        +bool requires_api_key
    }
    class LLMProvider {
        <>
        +ProviderSpec spec
        +model_string(model) str*
        +api_key() str*
        +base_url() str*
        +default_model(tier) str
    }
    class _DefaultProvider {
        +model_string(model) str
        +api_key() str
        +base_url() str
    }
    class LLMResult {
        <>
        +str content
        +str model
        +str provider
        +int prompt_tokens
        +int completion_tokens
        +float cost_usd
        +Any raw
    }
    class router_complete {
        <>
        +complete(provider, model, prompt, ...) LLMResult
    }
    class PROVIDERS {
        <>
        openai
        anthropic
        google
        xai
        deepseek
        groq
        openrouter
        ollama
        vllm
    }

    LLMProvider <|-- _DefaultProvider
    _DefaultProvider --> ProviderSpec
    PROVIDERS --> ProviderSpec : "values"
    router_complete --> LLMProvider : "get_provider(slug)"
    router_complete --> LLMResult : "returns"
```

**Conventions**:

- Always call via `router_complete(provider=..., model=..., ...)`.
- Tier (`deep`/`quick`) routing happens via `settings.provider_for_tier`
  + `provider.default_model(tier)`.
- The control plane in [alphaswarm/runtime/control_plane.py](../alphaswarm/runtime/control_plane.py)
  can override `ollama_host` / `vllm_base_url` at runtime.

## 3. Strategy hierarchy

AlphaSwarm follows the Lean 5-stage pattern (Universe → Alpha → Portfolio →
Risk → Execution). Concrete strategies are factory-instantiated from
config via the `class`/`module_path`/`kwargs` registry pattern.

```mermaid
classDiagram
    class IStrategy {
        <>
        +on_bar(bar, context) Iterator~OrderRequest~
        +on_order_update(order) None
    }
    class IUniverseSelectionModel {
        <>
        +select(timestamp, context) list~Symbol~
    }
    class IAlphaModel {
        <>
        +generate_signals(history, universe, context) list~Signal~
    }
    class IPortfolioConstructionModel {
        <>
        +construct(signals, context) list~PortfolioTarget~
    }
    class IRiskManagementModel {
        <>
        +evaluate(targets, context) list~PortfolioTarget~
    }
    class IExecutionModel {
        <>
        +execute(targets, context) list~OrderRequest~
    }
    class FrameworkAlgorithm {
        +IUniverseSelectionModel universe_model
        +IAlphaModel alpha_model
        +IPortfolioConstructionModel portfolio_model
        +IRiskManagementModel risk_model
        +IExecutionModel execution_model
        +int rebalance_every
        +on_bar(bar, context) Iterator
    }
    class MeanReversionAlpha
    class MomentumAlpha
    class MLAlphaStrategy
    class MLSelectorAlpha
    class EnsembleAlpha {
        +list~IAlphaModel~ alphas
        +list~float~ weights
    }
    class DeployedModelAlpha {
        +str deployment_id
    }
    class BlackLittermanPortfolio
    class HRPPortfolio
    class MeanVariancePortfolio
    class RiskParityPortfolio
    class TwapExecution
    class VwapExecution

    IStrategy <|.. FrameworkAlgorithm
    IAlphaModel <|.. MeanReversionAlpha
    IAlphaModel <|.. MomentumAlpha
    IAlphaModel <|.. MLAlphaStrategy
    IAlphaModel <|.. MLSelectorAlpha
    IAlphaModel <|.. EnsembleAlpha
    IAlphaModel <|.. DeployedModelAlpha
    IPortfolioConstructionModel <|.. BlackLittermanPortfolio
    IPortfolioConstructionModel <|.. HRPPortfolio
    IPortfolioConstructionModel <|.. MeanVariancePortfolio
    IPortfolioConstructionModel <|.. RiskParityPortfolio
    IExecutionModel <|.. TwapExecution
    IExecutionModel <|.. VwapExecution
    FrameworkAlgorithm o-- IUniverseSelectionModel
    FrameworkAlgorithm o-- IAlphaModel
    FrameworkAlgorithm o-- IPortfolioConstructionModel
    FrameworkAlgorithm o-- IRiskManagementModel
    FrameworkAlgorithm o-- IExecutionModel
    EnsembleAlpha o-- "many" IAlphaModel
```

The interfaces are in [alphaswarm/core/interfaces.py](../alphaswarm/core/interfaces.py);
concrete alphas in [alphaswarm/strategies/](../alphaswarm/strategies/) (one file per
alpha). See [alphaswarm_docs/factor-research.md](../../concepts/strategy/factor-research.md) for the
authoring guide.

## 4. Backtest + paper + live (IBrokerage / IDataQueueHandler)

The same strategy runs unchanged across backtest, paper, and live —
the engines differ in how they implement the broker + data-queue
contract, not in how they call the strategy.

```mermaid
classDiagram
    class IBrokerage {
        <>
        +submit_order(order) OrderTicket
        +cancel_order(ticket) bool
        +get_positions() list~SecurityHolding~
        +get_cashbook() CashBook
        +on_order_event(callback) None
    }
    class IDataQueueHandler {
        <>
        +subscribe(config) None
        +unsubscribe(config) None
        +get_next_ticks() Iterable~Tick~
    }
    class IHistoryProvider {
        <>
        +get_bars(symbol, start, end, resolution) DataFrame
    }
    class BacktestEngine {
        +IStrategy strategy
        +IDataQueueHandler data
        +IBrokerage brokerage
        +run(start, end) BacktestResult
    }
    class VectorbtEngine {
        +run(start, end) BacktestResult
    }
    class LocalSimulationEngine
    class PaperTradingEngine
    class WalkForwardEngine
    class MonteCarloEngine
    class BrokerSim {
        +decimal cash
        +dict positions
    }
    class AlpacaBrokerage
    class IbkrBrokerage
    class TradierBrokerage
    class DuckDBHistoryProvider
    class KafkaDataFeed

    IBrokerage <|.. BrokerSim
    IBrokerage <|.. AlpacaBrokerage
    IBrokerage <|.. IbkrBrokerage
    IBrokerage <|.. TradierBrokerage
    IDataQueueHandler <|.. KafkaDataFeed
    IHistoryProvider <|.. DuckDBHistoryProvider
    BacktestEngine <|-- VectorbtEngine
    BacktestEngine <|-- LocalSimulationEngine
    BacktestEngine <|-- PaperTradingEngine
    BacktestEngine <|-- WalkForwardEngine
    BacktestEngine <|-- MonteCarloEngine
    BacktestEngine o-- IBrokerage
    BacktestEngine o-- IDataQueueHandler
```

Files of interest:

- [alphaswarm/backtest/engine.py](../alphaswarm/backtest/engine.py) — base engine
- [alphaswarm/backtest/vectorbt_engine.py](../alphaswarm/backtest/vectorbt_engine.py)
- [alphaswarm/backtest/broker_sim.py](../alphaswarm/backtest/broker_sim.py) — brokerage
  simulator used by all non-live engines
- [alphaswarm/trading/](../alphaswarm/trading/) — concrete `IBrokerage`
  implementations for paper + live
- [alphaswarm/streaming/](../alphaswarm/streaming/) — Kafka and IBKR feed handlers

See [alphaswarm_docs/backtest-engines.md](../../concepts/strategy/backtest-engines.md) for the full
engine matrix, [alphaswarm_docs/paper-trading.md](../../concepts/trading/paper-trading.md) for the
session lifecycle.

## 5. Generic ingestion pipeline

Discovery → Director → Materialise → Verify → Annotate. The
dataclasses below are the canonical contract between stages.

```mermaid
classDiagram
    class DiscoveredMember {
        <>
        +str path
        +str archive_path
        +str format
        +str delimiter
        +int size_bytes
        +str subdir
        +float outer_mtime
    }
    class DiscoveredDataset {
        <>
        +str family
        +list~DiscoveredMember~ members
        +int total_bytes
        +list~str~ sample_columns
        +list~str~ notes
        +list inventory_extra
    }
    class IngestionPlan {
        <>
        +str source_path
        +str namespace
        +list~PlannedDataset~ datasets
        +list skipped_assets
        +str director_raw
        +bool director_used
        +str director_error
    }
    class PlannedDataset {
        <>
        +str family
        +bool include
        +str target_namespace
        +str target_table
        +int expected_min_rows
        +str domain_hint
        +list~str~ member_paths
        +list~str~ skip_member_paths
        +str notes
        +iceberg_identifier() str
    }
    class VerifierVerdict {
        <>
        +str verdict
        +str reason
        +dict retry_with
        +str raw
        +str error
    }
    class MaterializeResult {
        <>
        +str iceberg_identifier
        +str table_name
        +int rows_written
        +int files_consumed
        +int files_skipped
        +bool truncated
        +list schema_fields
        +str error
    }
    class IngestionTableResult {
        <>
        +str family
        +str iceberg_identifier
        +int rows_written
        +bool truncated
        +dict annotation
        +dict plan
        +dict verifier
        +str error
    }
    class IngestionReport {
        <>
        +str source_path
        +str namespace
        +datetime started_at
        +datetime finished_at
        +int datasets_discovered
        +list~IngestionTableResult~ tables
        +list extras
        +list~str~ errors
        +dict director_plan
    }
    class IngestionPipeline {
        +ProgressCallback progress_cb
        +int max_rows_per_dataset
        +int max_files_per_dataset
        +int chunk_rows
        +bool director_enabled
        +list~str~ allowed_namespaces
        +run_path(path, namespace, annotate) IngestionReport
    }
    class AnnotationResult {
        <>
        +str identifier
        +str description
        +list~str~ tags
        +str domain
        +list pii_flags
        +list column_docs
        +str error
    }

    DiscoveredDataset o-- "many" DiscoveredMember
    IngestionPlan o-- "many" PlannedDataset
    IngestionPipeline ..> DiscoveredDataset : "discovery output"
    IngestionPipeline ..> IngestionPlan : "director output"
    IngestionPipeline ..> MaterializeResult : "per planned table"
    IngestionPipeline ..> VerifierVerdict : "if floor missed"
    IngestionPipeline ..> AnnotationResult : "if annotate=true"
    IngestionTableResult o-- VerifierVerdict
    IngestionTableResult o-- AnnotationResult
    IngestionReport o-- "many" IngestionTableResult
```

Files:

- [alphaswarm/data/pipelines/discovery.py](../alphaswarm/data/pipelines/discovery.py)
- [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py)
- [alphaswarm/data/pipelines/materialize.py](../alphaswarm/data/pipelines/materialize.py)
- [alphaswarm/data/pipelines/annotate.py](../alphaswarm/data/pipelines/annotate.py)
- [alphaswarm/data/pipelines/runner.py](../alphaswarm/data/pipelines/runner.py)
- [alphaswarm/data/pipelines/extractors.py](../alphaswarm/data/pipelines/extractors.py)

Walkthrough lives in [alphaswarm_docs/data-catalog.md](../../concepts/data/data-catalog.md).

## 6. Bot entity (TradingBot / ResearchBot)

The Bot Entity Refactor introduced a first-class deployable unit that
aggregates universe + strategy + engine + ML + agents + RAG + metrics.
The runtime never re-implements those primitives — it composes
references and dispatches to the existing entry points.

```mermaid
classDiagram
    class BotSpec {
        <>
        +str name
        +str slug
        +str kind
        +UniverseRef universe
        +DataPipelineRef data_pipeline
        +dict strategy
        +dict backtest
        +list~MLDeploymentRef~ ml_models
        +list~BotAgentRef~ agents
        +list~RAGRef~ rag
        +list~MetricRef~ metrics
        +RiskSpec risk
        +DeploymentTargetSpec deployment
        +snapshot_hash() str
    }
    class BaseBot {
        <>
        +BotSpec spec
        +str bot_id
        +str project_id
        +backtest(run_name, **overrides) dict
        +paper(run_name, **overrides) PaperTradingSession
        +deploy(target, **overrides) BotDeploymentResult
        +chat(prompt, ...) Any
        +metrics_snapshot(run_summary) dict
    }
    class TradingBot {
        +consult_agents(prompt, inputs, roles) dict
    }
    class ResearchBot {
        +chat(prompt, session_id, agent_role, inputs) dict
    }
    class BotRuntime {
        +BotSpec spec
        +str run_id
        +str task_id
        +backtest(run_name, overrides) BotRunResult
        +paper(run_name, overrides) BotRunResult
        +chat(prompt, session_id, agent_role) BotRunResult
        +deploy(target, overrides) BotRunResult
    }
    class DeploymentDispatcher {
        +deploy(bot, target, overrides) BotDeploymentResult
        +register(target) void
    }
    class DeploymentTarget {
        <>
        +str name
        +deploy(bot, overrides) BotDeploymentResult
    }
    class PaperSessionTarget
    class BacktestOnlyTarget
    class KubernetesTarget {
        +Path manifest_root
        +bool apply
        +render_manifest(bot, overrides) str
    }

    BotSpec <.. BaseBot
    BaseBot <|-- TradingBot
    BaseBot <|-- ResearchBot
    BotRuntime ..> BaseBot
    DeploymentDispatcher --> DeploymentTarget
    DeploymentTarget <|-- PaperSessionTarget
    DeploymentTarget <|-- BacktestOnlyTarget
    DeploymentTarget <|-- KubernetesTarget
    BotRuntime ..> DeploymentDispatcher : "deploy()"
```

Files:

- [alphaswarm/bots/spec.py](../alphaswarm/bots/spec.py)
- [alphaswarm/bots/base.py](../alphaswarm/bots/base.py)
- [alphaswarm/bots/trading_bot.py](../alphaswarm/bots/trading_bot.py)
- [alphaswarm/bots/research_bot.py](../alphaswarm/bots/research_bot.py)
- [alphaswarm/bots/runtime.py](../alphaswarm/bots/runtime.py)
- [alphaswarm/bots/deploy.py](../alphaswarm/bots/deploy.py)
- [alphaswarm/bots/registry.py](../alphaswarm/bots/registry.py)
- [alphaswarm/bots/cli.py](../alphaswarm/bots/cli.py)

Walkthrough lives in [alphaswarm_docs/bots.md](../../concepts/agentic/bots.md).


<!-- https://alpha-swarm.ai/concepts/platform/code-index-governance -->
# Code Index Governance
> This document explains how agents should search and index AlphaSwarm during the repository split. The goal is to keep edits inside the right future project boundary before source code is physically separated...

# Code Index Governance

Status: active.

This document explains how agents should search and index AlphaSwarm during the
repository split. The goal is to keep edits inside the right future project
boundary before source code is physically separated.

## Search Order

1. Read the nearest `AGENTS.md` for the folder being edited.
2. Read `alphaswarm_docs/repository-split.md` to identify the owning domain.
3. Search within the owning domain first.
4. Only broaden to `alphaswarm/` or repo root when the boundary document says the
   implementation still lives there.
5. Record new reusable patterns in `alphaswarm_snippets/` or `.cursor/skills/`
   instead of scattering notes across unrelated docs.

## Domain Index

| Domain | Start here | Notes |
| --- | --- | --- |
| Control plane | `alphaswarm_controller/AGENTS.md` | `/manage/*`, providers, workload lifecycle |
| Platform core | `alphaswarm_core/AGENTS.md` | Shared contracts only |
| Client | `alphaswarm_client/AGENTS.md`, `alphaswarm_client/AGENTS.md` | Active source remains in `alphaswarm_client/` |
| Snippets | `alphaswarm_snippets/AGENTS.md` | Reference-only curated knowledge |
| Bots | `alphaswarm_bots/AGENTS.md` | Runtime remains in `alphaswarm/bots/` for now |
| Runtime monolith | `AGENTS.md` | Agents, RL, data, backtests, persistence, tasks |

## Indexing Rules

- Codebase MCP indexes must respect workspace allow-lists and secret
  deny-lists from `alphaswarm/codebase/mcp/policy.py`.
- Generated indexes should not include `.env`, private keys, kubeconfigs,
  token files, model weights, or local warehouse data.
- Agent-readable docs should link to paths, not line numbers, unless the
  output is a transient review.
- Keep split-boundary indexes short enough that agents can read them before
  editing.

## Boundary Checks

Use these searches before a boundary-sensitive change:

```bash
rg --type py "^from alphaswarm(\.|$)|^import alphaswarm(\.|$)" alphaswarm_controller/src
rg "alphaswarm_snippets|extractions|inspiration" alphaswarm alphaswarm_controller alphaswarm_core
rg "control.local/api|management/backend|management/frontend" docs README.md
```

The first command must return no matches. The second and third commands may
return documented migration references, but should not reveal runtime imports
or active instructions that route new work to deprecated surfaces.


<!-- https://alpha-swarm.ai/concepts/platform/contingency-graphs -->
# Contingency graphs (OCO / OUO / OTO)
> | Type | Behaviour | | --- | --- | | **OCO** (one cancels other) | When any constituent fills (partial or full), the others are canceled. Canonical use: bracket a position with a take-profit limit + s...

# Contingency graphs (OCO / OUO / OTO)

> Status: **Phase 2 shipped** (Alembic 0041). Manager:
> [`alphaswarm/trading/execution/contingency.py`](../alphaswarm/trading/execution/contingency.py).

## The three relationships

| Type | Behaviour |
| --- | --- |
| **OCO** (one cancels other) | When any constituent fills (partial or full), the others are canceled. Canonical use: bracket a position with a take-profit limit + stop-loss stop. |
| **OUO** (one updates other) | When any constituent's quantity changes (partial fill or amend), every other constituent's quantity is updated to match the remaining size. Useful when the bracket has more than two legs. |
| **OTO** (one triggers other) | The parent is the trigger; children are emulated until the parent fills, then they're submitted. Canonical use: place an entry limit + parked TP/SL waiting for the entry to hit. |

## Class layout

```mermaid
flowchart LR
    BotRuntime -->|"submit_list(order_list)"| Broker
    Broker -->|"emits ExecutionReport"| Dispatcher
    Dispatcher -->|"on_execution_report"| ContingencyManager
    ContingencyManager -->|"ContingencyCommand"| ExecutionLoop
    ExecutionLoop -->|"cancel/amend/submit"| Broker
```

## Manager behaviour

For each constituent, the manager tracks shadow ``remaining_quantity``
and ``status``:

* On fill (full or partial), OCO emits ``CANCEL`` for every peer.
* On partial fill, OUO emits ``UPDATE_QUANTITY`` for every peer with
  the new ``remaining_quantity``.
* On full fill, OUO emits ``CANCEL`` for every peer (degenerates to
  OCO).
* On parent fill, OTO emits ``SUBMIT`` for every child. Subsequent
  child fills don't re-trigger anything.

## Venue dispatch

Two routes:

1. **Native atomic submission** -- when the broker sets
   ``supports_oco = True``, the broker's
   :meth:`IDomainBrokerage.submit_list` submits the whole list in a
   single venue call (Alpaca bracket orders, IBKR OCA groups). The
   manager STILL registers the list so a partial-cancel still emits
   cleanup commands when the venue's atomicity is best-effort.

2. **Manager-simulated** -- when the broker sets
   ``supports_oco = False``, the broker submits each constituent
   independently and the manager owns the cross-order cancels via
   :meth:`ContingencyManager.on_execution_report`.

## Code example

```python
from decimal import Decimal
from alphaswarm.core.domain.identifiers import (
    ClientOrderId, InstrumentId, OrderListId, Symbol2, Venue,
)
from alphaswarm.core.domain.enums import ContingencyType, OrderSide, OrderType
from alphaswarm.core.domain.orders import LimitOrder, StopMarketOrder, OrderList

# Take-profit at 200, stop-loss at 180 -- OCO bracket
tp = LimitOrder(
    client_order_id=ClientOrderId("tp-1"),
    instrument_id=InstrumentId(Symbol2("AAPL"), Venue("NASDAQ")),
    order_side=OrderSide.SELL,
    quantity=Decimal("10"),
    order_type=OrderType.LIMIT,
    price=Decimal("200"),
)
sl = StopMarketOrder(
    client_order_id=ClientOrderId("sl-1"),
    instrument_id=InstrumentId(Symbol2("AAPL"), Venue("NASDAQ")),
    order_side=OrderSide.SELL,
    quantity=Decimal("10"),
    order_type=OrderType.STOP_MARKET,
    trigger_price=Decimal("180"),
)
order_list = OrderList(
    order_list_id=OrderListId("oco-1"),
    orders=[tp, sl],
    contingency_type=ContingencyType.OCO,
)
# Submit the entire list atomically (or simulated)
await broker.submit_list(order_list)
```

## Persistence

The Alembic ``0041`` migration adds:

* ``order_lists`` -- one row per :class:`OrderList`
* ``domain_orders.order_list_id`` FK -- ties constituents to parent
* ``execution_reports`` -- audit trail; the manager reads this to
  recover state after a restart

## Limitations

* OUO with three or more constituents updates ALL peers to the
  smallest remaining size. This matches the standard interpretation
  but may not be what every venue does -- check the venue's docs.
* OTO with multiple parents (a single child triggered by either of two
  parents) is NOT supported; that's a contingency-graph generalisation
  the manager can't currently express.
* Native broker OCO often comes with constraints (Alpaca brackets
  require the limit + stop to be on the same instrument; IBKR OCA
  groups allow cross-instrument). The contingency manager handles
  cross-instrument simulation.


<!-- https://alpha-swarm.ai/concepts/platform/core-types -->
# Core Type System
> AlphaSwarm 0.3 ports Leans data model into Python with minimal surface area, full backward compatibility, and strict ``dataclass``-only value objects

# Core Type System

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Full `Symbol` class diagram: [alphaswarm_docs/class-diagram.md#1-symbol--core-enums](../../concepts/platform/class-diagram.md#1-symbol--core-enums).

AlphaSwarm 0.3 ports Lean's data model into Python with minimal surface area,
full backward compatibility, and strict ``dataclass``-only value objects.

## Quick map

| Lean (C#) | AlphaSwarm (Python) | File |
|---|---|---|
| `Slice` | `Slice` | [`alphaswarm/core/slice.py`](../alphaswarm/core/slice.py) |
| `BaseData` | `BarData` (alias `TradeBar`), `QuoteBar`, `TickData` (alias `Tick`) | [`alphaswarm/core/types.py`](../alphaswarm/core/types.py) |
| `SubscriptionDataConfig` | `SubscriptionDataConfig` | same |
| `Resolution` / `TickType` | `Resolution` / `TickType` | same |
| `DataNormalizationMode` | `DataNormalizationMode` | same |
| `Symbol` / `SecurityIdentifier` | `Symbol` (composite `ticker.exchange`) | same |
| `Security` / `SecurityHolding` | `SecurityHolding` (extends `PositionData`) | same |
| `Cash` / `CashBook` | `Cash` / `CashBook` | same |
| `Order` / `OrderTicket` / `OrderEvent` | `OrderData` / `OrderTicket` / `OrderEvent` | same |
| `IndicatorBase` / `RollingWindow` | `IndicatorBase[T]` / `RollingWindow[T]` | [`alphaswarm/core/indicators.py`](../alphaswarm/core/indicators.py) |
| `MarketHoursDatabase` | `MarketHoursDatabase` | [`alphaswarm/core/exchange_hours.py`](../alphaswarm/core/exchange_hours.py) |
| `MapFile` / `FactorFile` | `MapFile` / `FactorFile` | [`alphaswarm/core/corporate_actions.py`](../alphaswarm/core/corporate_actions.py) |

## Migration notes

- **`BarData` is unchanged.** ``TradeBar`` is a type alias; existing
  backtest code keeps working.
- **``TickData`` is unchanged.** ``Tick`` is an alias.
- **``PositionData`` is unchanged.** The richer ``SecurityHolding`` is
  additive — convert via ``SecurityHolding.from_position(pos)``.
- **``on_bar(bar, ctx)`` remains the supported strategy entry point.**
  Strategies that implement ``on_data(slice, ctx)`` get called once per
  timestamp instead of once per symbol; the engine auto-detects which
  method to call.
- **Orders now surface as tickets.** The engine populates
  ``BacktestResult.tickets`` with :class:`OrderTicket` objects that
  carry the full ``OrderEvent`` stream for each order.

## Indicator registry

25 built-in indicators, all subclasses of ``IndicatorBase``. Resolve by
string via ``build_indicator("SMA", period=20)`` or import directly.

```python
from alphaswarm.core.indicators import SimpleMovingAverage, warmup

sma = SimpleMovingAverage(20)
print(warmup(sma, [100, 101, 102]))  # NaN until 20 samples
```

## Subscription routing

Every downstream consumer (backtest engine, paper engine, RL env,
factor job) reads data through :class:`SubscriptionDataConfig` via
:class:`alphaswarm.data.subscription.SubscriptionManager`. That swap enables
normalisation-aware queries and composite history providers without
touching strategy code.

## Type relationships

```mermaid
classDiagram
    class Symbol {
        +str ticker
        +Exchange exchange
        +AssetClass asset_class
        +SecurityType security_type
        +str vt_symbol
        +parse(s) Symbol
    }
    class BarData
    class TickData
    class Signal
    class OrderRequest
    class OrderTicket
    class OrderEvent
    class SubscriptionDataConfig

    Symbol <-- BarData
    Symbol <-- TickData
    Symbol <-- Signal
    Symbol <-- OrderRequest
    OrderRequest --> OrderTicket : submit_order
    OrderTicket --> OrderEvent : "events stream"
    SubscriptionDataConfig --> Symbol
```


<!-- https://alpha-swarm.ai/concepts/platform/domain-model -->
# Domain Model
> The AlphaSwarm platforms domain model lives under [`alphaswarm/core/domain/`](../alphaswarm/core/domain/) and is the single source of truth for every tradable-asset, issuer, event, market-data, fundamentals, ownership, c...

# Domain Model

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Schema diagrams: [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) · Column reference: [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md).

The AlphaSwarm platform's domain model lives under
[`alphaswarm/core/domain/`](../alphaswarm/core/domain/) and is the single source of truth
for every tradable-asset, issuer, event, market-data, fundamentals,
ownership, calendar, economic, and news primitive in the platform.

The expansion absorbs the best abstractions from four best-of-breed
open-source quant projects:

| Inspiration | What we took |
|---|---|
| [gs-quant](https://github.com/goldmansachs/gs-quant) | `(AssetClass, AssetType) → Instrument` dispatch ([`gs_quant/instrument/core.py`](https://github.com/goldmansachs/gs-quant/blob/master/gs_quant/instrument/core.py)), `XRef`/`Security` identifier flattening, `PricingContext`/`RiskMeasure` scaffolding ([`gs_quant/common.py`](https://github.com/goldmansachs/gs-quant/blob/master/gs_quant/common.py)) |
| [vnpy](https://github.com/vnpy/vnpy) | `ContractData` with `size`/`pricetick`/`min_volume`/option metadata, `Offset` (OPEN/CLOSE/CLOSE_TODAY/CLOSE_YESTERDAY) enum, 5-level `TickData`, uniform `*Request` envelopes |
| [nautilus_trader](https://github.com/nautechsystems/nautilus_trader) | Typed identifier value objects, polymorphic `Instrument` grid (`Equity`/`FuturesContract`/`OptionContract`/`CurrencyPair`/`Cfd`/`CryptoPerpetual`/`BettingInstrument`/`BinaryOption`/`SyntheticInstrument`/`TokenizedAsset`), `OrderBook`/`BookLevel` primitives, option greeks, DeFi scaffolds |
| [OpenBB Platform](https://github.com/OpenBB-finance/OpenBB) | `Fetcher[Q, R]` + `QueryParams` + `Data` triad, plus ~170 `standard_models` covering every research datatype from `balance_sheet` and `insider_trading` through `federal_funds_rate` and `cot` |

## Layout

```
alphaswarm/core/domain/
├── identifiers.py         # Typed IDs + IdentifierScheme + IdentifierSet
├── enums.py               # 25+ StrEnum catalogs (AssetClass, InstrumentClass, OrderType, ...)
├── money.py               # Currency + Price/Quantity/Money precision-safe scalars
├── instrument.py          # Polymorphic Instrument hierarchy + (AssetClass, InstrumentClass) dispatch
├── issuer.py              # Issuer / CorporateEntity / Fund / GovernmentEntity graph
├── market_data.py         # Bar/Tick/QuoteTick/TradeTick/OrderBook/MarkPriceUpdate + RichSlice
├── orders.py              # DomainOrder hierarchy + full OrderEvent family
├── positions.py           # DomainPosition hierarchy + PositionEvent family
├── greeks.py              # OptionGreeks / OptionGreekValues / PortfolioGreeks
├── options.py             # OptionChain / OptionChainSlice / OptionSeriesId / StrikeRange
├── events.py              # DomainEvent union (filing/earnings/news/dividend/ipo/merger/esg/...)
├── fundamentals.py        # BalanceSheet / IncomeStatement / CashFlow / FinancialRatios / KeyMetrics
├── ownership.py           # InsiderTransaction / Form13F / ShortInterest / SharesFloat / ...
├── calendar_events.py     # CalendarEarnings/Dividend/Split/Ipo/EconomicCalendar + MarketHoliday
├── economic.py            # TreasuryRate/YieldCurve/FederalFundsRate/CPI/Unemployment/CoT/FRED
└── news.py                # NewsItem / CompanyNews / WorldNews / Sentiment
```

Persistence sibling modules under [`alphaswarm/persistence/`](../alphaswarm/persistence/):

- `models_instruments.py` — joined-table subclasses (InstrumentEquity, InstrumentOption, InstrumentFuture, …).
- `models_entities.py` — `issuers` + related graph tables.
- `models_fundamentals.py` — statements / ratios / metrics / transcripts / MD&A.
- `models_events.py` — corporate / calendar / analyst / regulatory / ESG event tables.
- `models_ownership.py` — insider / institutional / 13F / short-interest / float / politician-trades.
- `models_news.py` — news items + entity M2M + sentiment.
- `models_macro.py` — economic series / observations / CoT / treasury / yield curve / option-chain snapshots.
- `models_taxonomy.py` — taxonomy schemes + nodes + polymorphic entity tags + entity crosswalk.

The Alembic migration [`alembic/versions/0008_domain_model_expansion.py`](../alembic/versions/0008_domain_model_expansion.py) creates every new table, extends `instruments` with the polymorphic discriminator + richer columns, creates an `instruments_flat` back-compat view, and seeds `taxonomy_schemes` with SIC / NAICS / GICS / TRBC / ICB / BICS / NACE plus user-defined `thematic`, `region`, `risk` roots.

## Instrument hierarchy

```mermaid
classDiagram
    class Instrument {
      InstrumentId instrument_id
      AssetClass asset_class
      InstrumentClass instrument_class
      Currency currency
      Decimal tick_size
      Decimal multiplier
      IdentifierSet identifiers
    }
    Instrument <|-- Equity
    Instrument <|-- ETF
    Instrument <|-- IndexInstrument
    Instrument <|-- Bond
    Instrument <|-- FuturesContract
    Instrument <|-- FuturesSpread
    Instrument <|-- OptionContract
    Instrument <|-- OptionSpread
    Instrument <|-- BinaryOption
    Instrument <|-- CurrencyPair
    Instrument <|-- Cfd
    Instrument <|-- Commodity
    Instrument <|-- SyntheticInstrument
    Instrument <|-- CryptoToken
    Instrument <|-- CryptoFuture
    Instrument <|-- CryptoPerpetual
    Instrument <|-- CryptoOption
    Instrument <|-- PerpetualContract
    Instrument <|-- TokenizedAsset
    Instrument <|-- BettingInstrument
    Instrument <|-- Swap
```

Dispatch via `instrument_class_for(asset_class, instrument_class)` returns the concrete class:

```python
from alphaswarm.core.domain import instrument_class_for, AssetClass, InstrumentClass

cls = instrument_class_for(AssetClass.EQUITY, InstrumentClass.OPTION)
assert cls.__name__ == "OptionContract"
```

YAML recipes can say:

```yaml
instrument:
  class: Equity
  kwargs:
    instrument_id: { symbol: AAPL, venue: NASDAQ }
    cik: "0000320193"
    isin: US0378331005
```

…and `alphaswarm.core.registry.build_from_config` routes through the instrument registry automatically.

## Issuer graph

```mermaid
classDiagram
    class Issuer {
      str issuer_id
      str name
      EntityKind kind
      str cik
      str lei
      int sic
      str naics
      Sector sector
      Industry industry
    }
    Issuer <|-- CorporateEntity
    Issuer <|-- GovernmentEntity
    Issuer <|-- Fund
    Issuer --> IndustryClassification : classifications
    Issuer --> Location : locations
    Issuer --> KeyExecutive : key_executives
    Issuer --> ExecutiveCompensation : compensation
    EntityRelationship --> Issuer : from_entity
    EntityRelationship --> Issuer : to_entity
    Instrument --> Issuer : issuer_id
```

Every `Equity`, `Bond`, `ETF` points at an `Issuer` row. The `Issuer` mirrors OpenBB's `EquityInfoData` schema (CIK, CUSIP, ISIN, LEI, legal_name, SIC, HQ address, employees, sector, industry) so ingestion from any OpenBB-compatible provider flows in without shape changes.

## Event flow

```mermaid
flowchart LR
    subgraph Corporate
      FilingEvent
      EarningsEvent
      CorporateActionEvent
      IPOEvent
      MergerEvent
    end
    subgraph Research
      AnalystRatingEvent
      PriceTargetEvent
      ForwardEstimateEvent
    end
    subgraph Ownership
      InsiderTransactionEvent
      InstitutionalHoldingEvent
      PoliticianTradeEvent
    end
    subgraph Macro
      EconomicObservationEvent
      CotReportEvent
    end
    subgraph Alternative
      NewsEvent
      SocialSentimentEvent
      RegulatoryEvent
      ESGEvent
      MaritimeEvent
      PortVolumeEvent
    end
```

All events inherit from `DomainEvent` and share `ts_event` / `ts_init` / `event_id` / `source` / `instrument_id` / `issuer` / `meta`. Downstream consumers can demultiplex by `kind` without importing the concrete class.

## Fundamentals

```mermaid
classDiagram
    class FundamentalsBase {
      str symbol
      str issuer_id
      date period
      PeriodType period_type
      int fiscal_year
      str fiscal_period
      str currency
      datetime as_of
      str source_filing_accession
    }
    FundamentalsBase <|-- BalanceSheet
    FundamentalsBase <|-- IncomeStatement
    FundamentalsBase <|-- CashFlowStatement
    FundamentalsBase <|-- FinancialRatios
    FundamentalsBase <|-- KeyMetrics
    FundamentalsBase <|-- EarningsCallTranscript
    FundamentalsBase <|-- ManagementDiscussionAnalysis
    FundamentalsBase <|-- ReportedFinancials
```

Every fundamentals model is a Pydantic `BaseModel` with `extra="allow"`, so provider-specific columns survive round-trips unchanged.

## Typed identifiers

```python
from alphaswarm.core.domain import (
    InstrumentId, Symbol2, Venue, IdentifierScheme, IdentifierSet, IdentifierValue
)

iid = InstrumentId.from_str("AAPL.NASDAQ")
assert iid.symbol == Symbol2("AAPL")
assert iid.venue == Venue("NASDAQ")

ids = IdentifierSet()
ids.add(IdentifierValue(scheme=IdentifierScheme.CUSIP, value="037833100"))
ids.add(IdentifierValue(scheme=IdentifierScheme.LEI, value="HWUPKR0MPOU8FGXBT394"))
assert ids.value_of(IdentifierScheme.CUSIP) == "037833100"
```

The `IdentifierScheme` StrEnum covers 30+ taxonomies: ticker, vt_symbol, CIK, CUSIP, ISIN, SEDOL, FIGI, OpenFIGI, LEI, GVKEY, PermID, Refinitiv PermID, FactSet ID, DUNS, IRS EIN, FRED series id, BLS series id, ECB series id, GDelt theme, CoT code, SIC, NAICS, GICS, TRBC, ICB, NACE, BICS, ERC-20 address, EVM chain id, IBKR conid, Alpaca asset id, Polygon ticker, plus a `custom` escape hatch.

## Migration path

The expansion is designed to be **non-breaking for existing users**:

- Legacy `alphaswarm.core.types.Symbol` / `BarData` / `QuoteBar` / `TickData` / `OrderRequest` / `OrderData` / `OrderEvent` / `OrderTicket` / `SecurityHolding` / `Cash` / `CashBook` / `Signal` / `PortfolioTarget` all keep their constructors and public API.
- The `Instrument` SQLAlchemy table keeps every pre-expansion column; new columns are nullable. A back-compat view `instruments_flat` serves the pre-refactor shape for any SQL consumer.
- Legacy rows with `instrument_class IS NULL` load cleanly as the base `Instrument` via a SQL `CASE` mapping.
- The richer typed IDs live in `alphaswarm.core.domain.identifiers` and are opt-in. `Symbol.to_instrument_id()` and `Symbol.from_instrument_id()` bridge old and new.
- The `Slice` class keeps its legacy shape; `RichSlice` is the superset with `order_books`, `mark_prices`, `funding_rates`, `news`, `filings` buckets.

## Tests

Domain-model tests live in [`tests/core/`](../tests/core/):

- `test_identifiers.py` — typed IDs + IdentifierSet + scheme coverage (12 tests).
- `test_enums.py` — expanded enum catalog (11 tests).
- `test_instrument_hierarchy.py` — polymorphic Instrument + `(AssetClass, InstrumentClass)` dispatch (15 tests).
- `test_events.py` — unified `DomainEvent` family (13 tests).
- `test_fundamentals.py` — Pydantic statements + ratios + transcripts (14 tests).
- `test_ownership.py` — insider / institutional / 13F / short-interest (10 tests).
- `test_standard_models.py` — 99+102 paired `QueryParams`/`Data` port (6 tests).

Provider tests: [`tests/providers/test_fetcher_contract.py`](../tests/providers/test_fetcher_contract.py).

Persistence tests: [`tests/persistence/test_domain_migration.py`](../tests/persistence/test_domain_migration.py).


<!-- https://alpha-swarm.ai/concepts/platform/entity-graph-services -->
# Entity Graph And Service Control
> Start the local stack with the visualization overlay:

# Entity Graph And Service Control

AlphaSwarm now treats the entity graph as the canonical relationship layer for
instruments, companies, datasets, pipeline assets, and service metadata.
Postgres remains the compatibility store for existing APIs, while Neo4j is
the graph backend when `ALPHASWARM_GRAPH_STORE=neo4j`.

## Local Services

Start the local stack with the visualization overlay:

```bash
docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml --profile visualization up -d
```

Neo4j is part of the base compose file and is exposed on:

- Browser: `http://localhost:7474`
- Bolt: `bolt://localhost:7687`

Relevant env keys:

```bash
ALPHASWARM_GRAPH_STORE=neo4j
ALPHASWARM_NEO4J_URI=bolt://localhost:7687
ALPHASWARM_NEO4J_USER=neo4j
ALPHASWARM_NEO4J_PASSWORD=aqpneo4j
ALPHASWARM_NEO4J_DATABASE=neo4j
ALPHASWARM_ENTITY_GRAPH_SYNC_ENABLED=true
```

## Entity Sync

The active instrument cache reads from the existing `instruments` table and
upserts each instrument as a `security` entity with identifiers for
`vt_symbol`, ticker, and any instrument metadata identifiers.

Dataset registration for market-bar datasets links dataset versions to the
instrument entities they describe. Airbyte and Dagster metadata syncs also
write service nodes and relationships so the graph can show ingestion and
pipeline context around datasets.

Useful endpoints:

- `GET /registry/entities/graph`
- `GET /registry/entities/instruments/active`
- `POST /registry/entities/instruments/sync`
- `POST /registry/entities/instruments/load-template`

## Service Manager

The service manager aggregates health/config/logs for:

- Trino
- Polaris
- Iceberg
- Superset
- Airbyte
- Dagster
- Neo4j

Useful endpoints:

- `GET /service-manager/health`
- `GET /service-manager/{service}/health`
- `GET /service-manager/{service}/logs`
- `POST /service-manager/{service}/actions`

Lifecycle actions and logs are guarded by `ALPHASWARM_SERVICE_CONTROL_ENABLED=true`
because they invoke Docker Compose from inside the API process.

## UI

- `/data/entity-graph` exposes the Neo4j-backed entity graph and active
  instrument list.
- `/data/services` exposes service health cards, guarded lifecycle actions,
  and log tails.
- `/workflows/data` includes Dagster assets, runs, schedules, and sensors.


<!-- https://alpha-swarm.ai/concepts/platform/entity-registry -->
# Unified Entity Registry
> ```mermaid flowchart LR Sources["Iceberg datasets<br/>(CFPB, FDA, USPTO,<br/>SEC, GDELT,<br/>FinanceDatabase, ...)"] Extractors["EntityExtractor.run(rows)"] Registry[(EntityRegistry)] Enrichers["Entit...

# Unified Entity Registry

The unified entity registry sits on top of the existing
[Issuer / Sector / Industry graph](../../concepts/platform/erd.md) and widens it to cover
every entity AlphaSwarm cares about: companies, drugs, products, patents,
persons, locations, securities, regulators, and free-form
"concept" rows. Extractors populate the rows from datasets;
LLM enrichers add descriptions, relations, dedup proposals, and
tags without ever mutating the source data.

```mermaid
flowchart LR
    Sources["Iceberg datasets(CFPB, FDA, USPTO,SEC, GDELT,FinanceDatabase, ...)"]
    Extractors["EntityExtractor.run(rows)"]
    Registry[(EntityRegistry)]
    Enrichers["EntityEnricher.run(ids)"]
    Tasks["Celery tasks(entity_tasks)"]
    API["/registry/entities"]
    UI["/data/kg"]
    Sources --> Extractors --> Registry
    Registry --> Enrichers --> Registry
    Tasks --> Extractors
    Tasks --> Enrichers
    Registry --> API
    API --> UI
```

## Tables

| Table | File |
| --- | --- |
| `entities` | [alphaswarm/persistence/models_entity_registry.py](../alphaswarm/persistence/models_entity_registry.py) |
| `entity_identifiers` | (same) |
| `entity_relations` | (same) |
| `entity_annotations` | (same) |
| `entity_dataset_links` | (same) |

Migration: [alembic/versions/0013_data_engine_expansion.py](../alembic/versions/0013_data_engine_expansion.py).

## Components

| Module | What it does |
| --- | --- |
| [alphaswarm/data/entities/registry.py](../alphaswarm/data/entities/registry.py) | `EntityRegistry` facade + `upsert_entity` / `link_identifier` / `add_relation` / `attach_to_dataset` / `search` / `neighbors` / `add_annotation`. |
| [alphaswarm/data/entities/extractors/](../alphaswarm/data/entities/extractors/) | Per-dataset extractors (regulatory, filings, news, instruments, finance_database). Each yields `EntityCandidate` dataclasses. |
| [alphaswarm/data/entities/enrichers/](../alphaswarm/data/entities/enrichers/) | LLM enrichers (description, relation, dedup, tagging). All route through `router_complete` per AGENTS.md hard rule #2. |
| [alphaswarm/tasks/entity_tasks.py](../alphaswarm/tasks/entity_tasks.py) | Celery wrappers (`extract_entities`, `enrich_entity`, `dedup_entities`). |
| [alphaswarm/api/routes/entity_registry.py](../alphaswarm/api/routes/entity_registry.py) | REST surface at `/registry/entities`. |

## REST surface

| Path | Description |
| --- | --- |
| `GET /registry/entities` | List entities (filter by kind, source_dataset, canonical_only). |
| `POST /registry/entities` | Create or update an entity. |
| `GET /registry/entities/search?q=` | Text search. |
| `GET /registry/entities/{id}` | Detail (identifiers + annotations). |
| `GET /registry/entities/{id}/neighbors` | Outgoing + incoming relations. |
| `GET /registry/entities/{id}/datasets` | Linked datasets. |
| `POST /registry/entities/{id}/identifiers` | Add an alias. |
| `POST /registry/entities/{id}/relations` | Add a typed edge. |
| `POST /registry/entities/{id}/annotations` | Attach a description / tag / note. |
| `POST /registry/entities/extract` | Queue a Celery extract task. |
| `POST /registry/entities/enrich` | Queue Celery enrichment tasks. |

## LLM enrichment

LLM enrichers are gated on `ALPHASWARM_ENTITY_LLM_ENRICHMENT_ENABLED=true`
to avoid surprise spend. When disabled, `enrich_one` returns `None`
and the Celery task records a `skipped` count instead of calling the
router.

When enabled, the enricher uses `alphaswarm.llm.providers.router.router_complete`
exclusively — never `litellm.completion` or `OllamaClient.generate`.
Output is parsed strict JSON; malformed blobs are dropped.

## Don'ts

- Don't extract entities by querying Postgres directly from a Celery
  task. Either pass an inline `rows` payload or read the Iceberg
  table via `alphaswarm.data.iceberg_catalog.read_arrow` (the standard
  path used by `extract_entities`).
- Don't bypass `EntityRegistry` to write rows. Extractors should
  always go through `registry.upsert(...)`.
- Don't replace `add_annotation` with raw SQL inserts. The
  `EntityAnnotation` row is also surfaced in `/registry/entities/{id}`,
  the entity browser UI, and (eventually) DataHub glossary terms.


<!-- https://alpha-swarm.ai/concepts/platform/erd -->
# Entity Relationship Diagram
> The Postgres schema has ~110 ORM classes spread across 11 model files under [alphaswarm/persistence/](../alphaswarm/persistence/). One mega-ERD would be unreadable, so this doc breaks the schema into focused diagra...

# Entity Relationship Diagram

> Pair with [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md) (column-level
> detail) and [alphaswarm_docs/domain-model.md](../../concepts/platform/domain-model.md) (narrative).
> Doc map: [alphaswarm_docs/index.md](../../intro/index.md).

The Postgres schema has ~110 ORM classes spread across 11 model files
under [alphaswarm/persistence/](../alphaswarm/persistence/). One mega-ERD would be
unreadable, so this doc breaks the schema into focused diagrams by
domain. The final section is a global FK-only map showing only the
cross-domain joins.

Each per-domain ERD lists table names with the primary key (`PK`) and a
short subset of columns. For full column lists, see
[data-dictionary.md](../../reference/data-dictionary/index.md).

## Global FK map

Cross-domain edges only — pick a starting table and trace where it
fans out.

```mermaid
erDiagram
    instruments ||--o{ instrument_equity : "polymorphic"
    instruments ||--o{ instrument_option : "polymorphic"
    instruments ||--o{ instrument_future : "polymorphic"
    instruments ||--o{ data_links : "instrument_id"
    instruments ||--o{ corporate_events : "vt_symbol"
    instruments ||--o{ news_items : "vt_symbol"
    issuers ||--o{ instruments : "issuer_id"
    issuers ||--o{ financial_statements : "issuer_id"

    data_sources ||--o{ datasets : "provider"
    dataset_catalogs ||--o{ dataset_versions : "catalog_id"
    dataset_versions ||--o{ data_links : "dataset_version_id"
    dataset_versions ||--o{ model_versions : "dataset_version_id"
    dataset_versions ||--o{ split_plans : "dataset_version_id"
    split_plans ||--o{ split_artifacts : "plan_id"

    strategies ||--o{ strategy_versions : "strategy_id"
    strategies ||--o{ backtest_runs : "strategy_id"
    backtest_runs ||--o{ orders : "backtest_id"
    backtest_runs ||--o{ fills : "backtest_id"
    backtest_runs ||--o{ signals : "backtest_id"
    backtest_runs ||--o{ ledger_entries : "backtest_id"

    sessions ||--o{ chat_messages : "session_id"
    sessions ||--o{ agent_runs : "session_id"
    crew_runs ||--o{ agent_decisions : "crew_run_id"
    agent_decisions ||--o{ debate_turns : "decision_id"
    backtest_runs ||--o{ agent_backtests : "backtest_id"
    agent_judge_reports ||--o{ agent_replay_runs : "judge_id"

    feature_sets ||--o{ feature_set_versions : "feature_set_id"
    feature_sets ||--o{ feature_set_usages : "feature_set_id"
```

## Core / Instruments

Joined-table inheritance. Every concrete instrument subclass shares the
parent `instruments` row and adds shape-specific columns in its own
table keyed on `instruments.id`. The discriminator is
`instruments.instrument_class`.

```mermaid
erDiagram
    instruments {
        uuid id PK
        string vt_symbol "AAPL.NASDAQ"
        string ticker
        string exchange
        string asset_class
        string security_type
        string instrument_class "discriminator"
        uuid issuer_id FK
        json identifiers
    }
    instrument_equity {
        uuid id PK_FK
        string isin
        string cusip
        string figi
        string lei
        string gics_sector
        float shares_outstanding
    }
    instrument_etf {
        uuid id PK_FK
        date inception_date
        float aum
        float expense_ratio
        bool is_leveraged
    }
    instrument_option {
        uuid id PK_FK
        string underlying
        float strike
        date expiry
        string kind "call|put"
        string style "european|american"
    }
    instrument_future {
        uuid id PK_FK
        string underlying
        date expiry
        float contract_size
        string cycle
    }
    instrument_fx_pair {
        uuid id PK_FK
        string base_currency
        string quote_currency
        float pip_size
    }
    instrument_crypto {
        uuid id PK_FK
        string subtype
        string chain
        string contract_address
        float max_leverage
    }
    instrument_index {
        uuid id PK_FK
        string administrator
        int constituent_count
    }
    instrument_bond {
        uuid id PK_FK
        float coupon
        date maturity
        string rating_sp
    }
    instrument_cfd {
        uuid id PK_FK
        string underlying
        float margin_rate
    }
    instrument_commodity {
        uuid id PK_FK
        string grade
        string unit_of_measure
    }
    instrument_synthetic {
        uuid id PK_FK
        json legs
        json leg_weights
    }
    instrument_betting {
        uuid id PK_FK
        string event_name
        string market_type
    }
    instrument_tokenized_asset {
        uuid id PK_FK
        string chain
        string contract_address
        string token_standard
    }

    instruments ||--o| instrument_equity : "spot"
    instruments ||--o| instrument_etf : "etf"
    instruments ||--o| instrument_option : "option"
    instruments ||--o| instrument_future : "future"
    instruments ||--o| instrument_fx_pair : "fx_pair"
    instruments ||--o| instrument_crypto : "crypto_token"
    instruments ||--o| instrument_index : "index"
    instruments ||--o| instrument_bond : "bond"
    instruments ||--o| instrument_cfd : "cfd"
    instruments ||--o| instrument_commodity : "spot_commodity"
    instruments ||--o| instrument_synthetic : "synthetic"
    instruments ||--o| instrument_betting : "betting"
    instruments ||--o| instrument_tokenized_asset : "nft"
```

## Market data lineage + Iceberg catalog

How AlphaSwarm tracks every dataset that flows into Iceberg. The
`iceberg_identifier` column on `dataset_catalogs` was added in
[alembic/versions/0011_iceberg_catalog_columns.py](../alembic/versions/0011_iceberg_catalog_columns.py).

```mermaid
erDiagram
    data_sources {
        uuid id PK
        string name "yfinance|alpaca|cfpb"
        string kind "rest|csv|parquet"
        string base_url
        json meta
    }
    dataset_catalogs {
        uuid id PK
        string name
        string provider
        string domain "market.bars|cfpb.hmda"
        string frequency
        string storage_uri
        string iceberg_identifier "alphaswarm_cfpb.hmda_lar"
        string load_mode "managed|external"
        json llm_annotations
        json column_docs
        json tags
    }
    dataset_versions {
        uuid id PK
        uuid catalog_id FK
        int version
        string status "active|superseded"
        datetime as_of
        datetime start_time
        datetime end_time
        int row_count
        int symbol_count
        string dataset_hash
        string materialization_uri
    }
    data_links {
        uuid id PK
        uuid dataset_version_id FK
        uuid source_id FK
        uuid instrument_id FK
        string entity_kind "instrument|series"
        string entity_id
        datetime coverage_start
        datetime coverage_end
        int row_count
    }
    identifier_links {
        uuid id PK
        uuid instrument_id FK
        uuid source_id FK
        string identifier_kind
        string identifier_value
    }

    dataset_catalogs ||--o{ dataset_versions : "catalog_id"
    dataset_versions ||--o{ data_links : "dataset_version_id"
    data_sources ||--o{ data_links : "source_id"
    instruments ||--o{ data_links : "instrument_id"
    data_sources ||--o{ identifier_links : "source_id"
    instruments ||--o{ identifier_links : "instrument_id"
```

## Agentic + ML

Strategies, backtests, agent crews, ML deployments, and feature sets.

```mermaid
erDiagram
    strategies {
        uuid id PK
        string name
        int version
        text config_yaml
        string status "draft|backtesting|paper|live|retired"
    }
    strategy_versions {
        uuid id PK
        uuid strategy_id FK
        text config_yaml
        json meta
    }
    backtest_runs {
        uuid id PK
        uuid strategy_id FK
        string task_id
        string status
        datetime start
        datetime end
        float sharpe
        float sortino
        float max_drawdown
        string mlflow_run_id
        string dataset_hash
        uuid model_version_id FK
        uuid ml_experiment_run_id FK
        uuid experiment_plan_id FK
        uuid model_deployment_id FK
    }
    agent_runs {
        uuid id PK
        uuid session_id FK
        string crew_name
        string status
    }
    crew_runs {
        uuid id PK
        uuid agent_run_id FK
        string preset
        json config
    }
    agent_decisions {
        uuid id PK
        uuid backtest_id FK
        uuid strategy_id FK
        uuid crew_run_id FK
        string action "long|short|flat"
        float confidence
        text rationale
    }
    debate_turns {
        uuid id PK
        uuid crew_run_id FK
        uuid decision_id FK
        string role
        text content
    }
    agent_backtests {
        uuid id PK
        uuid backtest_id FK
        json crew_metrics
    }
    agent_judge_reports {
        uuid id PK
        uuid backtest_id FK
        text summary
        json scores
    }
    agent_replay_runs {
        uuid id PK
        uuid backtest_id FK
        uuid judge_id FK
        json replay_metrics
    }
    feature_sets {
        uuid id PK
        string name
        string kind "composite|ml4t|qlib"
        json specs
        int default_lookback_days
    }
    feature_set_versions {
        uuid id PK
        uuid feature_set_id FK
        string content_hash
    }
    model_versions {
        uuid id PK
        uuid dataset_version_id FK
        uuid split_plan_id FK
        string model_class
        json hyperparams
        string mlflow_run_id
    }
    model_deployments {
        uuid id PK
        uuid model_version_id FK
        string status "active|retired"
        json runtime_meta
    }

    strategies ||--o{ strategy_versions : "strategy_id"
    strategies ||--o{ backtest_runs : "strategy_id"
    backtest_runs ||--o{ agent_decisions : "backtest_id"
    backtest_runs ||--o{ agent_backtests : "backtest_id"
    backtest_runs ||--o{ agent_judge_reports : "backtest_id"
    backtest_runs ||--o{ agent_replay_runs : "backtest_id"
    crew_runs ||--o{ agent_decisions : "crew_run_id"
    agent_decisions ||--o{ debate_turns : "decision_id"
    feature_sets ||--o{ feature_set_versions : "feature_set_id"
    model_versions ||--o{ model_deployments : "model_version_id"
```

## Ledger (signals / orders / fills / entries)

Every signal, order, fill, and free-form audit entry written by
[`LedgerWriter`](../alphaswarm/persistence/ledger.py).

```mermaid
erDiagram
    signals {
        uuid id PK
        uuid strategy_id FK
        uuid backtest_id FK
        string vt_symbol
        string direction "long|short|net"
        float strength
        float confidence
        text rationale
    }
    orders {
        uuid id PK
        uuid backtest_id FK
        uuid strategy_id FK
        string vt_symbol
        string side "buy|sell"
        string order_type "market|limit|stop"
        float quantity
        float price
        string status
    }
    fills {
        uuid id PK
        uuid order_id FK
        float quantity
        float price
        datetime ts
    }
    ledger_entries {
        uuid id PK
        uuid backtest_id FK
        uuid strategy_id FK
        string entry_type "SIGNAL|ORDER|FILL|RISK|AUDIT"
        string level "info|warn|error"
        text message
        json payload
    }

    strategies ||--o{ signals : "strategy_id"
    backtest_runs ||--o{ signals : "backtest_id"
    strategies ||--o{ orders : "strategy_id"
    backtest_runs ||--o{ orders : "backtest_id"
    orders ||--o{ fills : "order_id"
    backtest_runs ||--o{ ledger_entries : "backtest_id"
```

## News / Events / Fundamentals

```mermaid
erDiagram
    news_items {
        uuid id PK
        string url
        string source
        datetime published_at
        text headline
        text body
    }
    news_item_entities {
        uuid id PK
        uuid news_item_id FK
        string vt_symbol
        string entity_kind "instrument|issuer|theme"
    }
    news_sentiments {
        uuid id PK
        uuid news_item_id FK
        string scorer "finbert|fingpt"
        float polarity
        float confidence
    }
    corporate_events {
        uuid id PK
        string vt_symbol
        string event_type "earnings|split|dividend|merger|ipo"
        datetime event_time
        json payload
    }
    earnings_event_rows {
        uuid id PK
        uuid event_id FK
        float eps_actual
        float eps_estimate
        float revenue_actual
    }
    dividend_event_rows {
        uuid id PK
        uuid event_id FK
        float amount
        date ex_date
        date pay_date
    }
    split_event_rows {
        uuid id PK
        uuid event_id FK
        float ratio
    }
    analyst_estimates {
        uuid id PK
        string vt_symbol
        string analyst
        float target_price
    }
    financial_statements {
        uuid id PK
        uuid issuer_id FK
        string period "Q|FY"
        date period_end
        json data
    }
    financial_ratios {
        uuid id PK
        uuid issuer_id FK
        date period_end
        float pe
        float pb
        float roe
    }
    earnings_call_transcripts {
        uuid id PK
        uuid issuer_id FK
        date call_date
        text content
    }

    news_items ||--o{ news_item_entities : "news_item_id"
    news_items ||--o{ news_sentiments : "news_item_id"
    corporate_events ||--o{ earnings_event_rows : "event_id"
    corporate_events ||--o{ dividend_event_rows : "event_id"
    corporate_events ||--o{ split_event_rows : "event_id"
    issuers ||--o{ financial_statements : "issuer_id"
    issuers ||--o{ financial_ratios : "issuer_id"
    issuers ||--o{ earnings_call_transcripts : "issuer_id"
```

## Macro / FRED / GDelt

```mermaid
erDiagram
    economic_series {
        uuid id PK
        string series_id "FRED:GDP"
        string title
        string frequency
        string units
        string source
    }
    economic_observations {
        uuid id PK
        uuid series_id FK
        date observation_date
        float value
    }
    fred_series {
        uuid id PK
        string series_id "GDP"
        string title
        string units
        string frequency
    }
    treasury_rates {
        uuid id PK
        date date
        float rate_3m
        float rate_2y
        float rate_10y
        float rate_30y
    }
    yield_curves {
        uuid id PK
        date date
        json tenors
    }
    cot_reports {
        uuid id PK
        date report_date
        string instrument
        json positions
    }
    sec_filings {
        uuid id PK
        uuid instrument_id FK
        uuid source_id FK
        string accession
        string form
        date filing_date
    }
    gdelt_mentions {
        uuid id PK
        uuid instrument_id FK
        uuid source_id FK
        datetime mention_time
        json gkg_payload
    }

    economic_series ||--o{ economic_observations : "series_id"
    instruments ||--o{ sec_filings : "instrument_id"
    instruments ||--o{ gdelt_mentions : "instrument_id"
    data_sources ||--o{ sec_filings : "source_id"
    data_sources ||--o{ gdelt_mentions : "source_id"
```

## Entities / Issuers / Ownership

```mermaid
erDiagram
    issuers {
        uuid id PK
        string name
        string lei
        string country
        string entity_kind "company|government|fund"
    }
    government_entities {
        uuid id PK_FK
        string country_code
        string level
    }
    funds {
        uuid id PK_FK
        string fund_family
        string fund_type
    }
    sectors {
        uuid id PK
        string code
        string name
    }
    industries {
        uuid id PK
        string code
        string name
        uuid sector_id FK
    }
    industry_classifications {
        uuid id PK
        uuid issuer_id FK
        uuid industry_id FK
        date as_of
    }
    entity_relationships {
        uuid id PK
        uuid parent_id FK
        uuid child_id FK
        string kind "subsidiary|owner|board"
    }
    locations {
        uuid id PK
        uuid issuer_id FK
        string country
        string city
    }
    key_executives {
        uuid id PK
        uuid issuer_id FK
        string name
        string title
    }
    insider_transactions {
        uuid id PK
        string vt_symbol
        string insider_name
        date transaction_date
        float quantity
    }
    institutional_holdings {
        uuid id PK
        string vt_symbol
        string holder_name
        date as_of
        float quantity
    }
    form_13f_holdings {
        uuid id PK
        string filer_cik
        string vt_symbol
        date period_end
    }
    short_interest {
        uuid id PK
        string vt_symbol
        date settlement_date
        float short_interest
    }
    politician_trades {
        uuid id PK
        string politician
        string vt_symbol
        date trade_date
        float amount
    }

    issuers ||--o| government_entities : "subclass"
    issuers ||--o| funds : "subclass"
    issuers ||--o{ industry_classifications : "issuer_id"
    sectors ||--o{ industries : "sector_id"
    industries ||--o{ industry_classifications : "industry_id"
    issuers ||--o{ entity_relationships : "parent_id"
    issuers ||--o{ locations : "issuer_id"
    issuers ||--o{ key_executives : "issuer_id"
```

## Taxonomy

Free-form tagging for issuers, instruments, and themes.

```mermaid
erDiagram
    taxonomy_schemes {
        uuid id PK
        string name "GICS|SASB|theme"
    }
    taxonomy_nodes {
        uuid id PK
        uuid scheme_id FK
        uuid parent_id FK
        string code
        string label
    }
    entity_tags {
        uuid id PK
        uuid node_id FK
        string entity_kind "issuer|instrument"
        string entity_id
    }
    entity_crosswalks {
        uuid id PK
        string from_kind
        string from_id
        string to_kind
        string to_id
    }

    taxonomy_schemes ||--o{ taxonomy_nodes : "scheme_id"
    taxonomy_nodes ||--o{ taxonomy_nodes : "parent_id"
    taxonomy_nodes ||--o{ entity_tags : "node_id"
```

## Sessions / Chat / Optimization

The conversational + experimentation layer.

```mermaid
erDiagram
    sessions {
        uuid id PK
        string user
        string title
        json meta
    }
    chat_messages {
        uuid id PK
        uuid session_id FK
        string role "user|assistant|agent|tool"
        text content
    }
    optimization_runs {
        uuid id PK
        uuid strategy_id FK
        json search_space
        string status
    }
    optimization_trials {
        uuid id PK
        uuid run_id FK
        uuid backtest_id FK
        json params
        float objective
    }
    paper_trading_runs {
        uuid id PK
        uuid strategy_id FK
        string status
        datetime started_at
        datetime stopped_at
    }
    rl_episodes {
        uuid id PK
        string env_id
        int episode_id
        float reward
    }

    sessions ||--o{ chat_messages : "session_id"
    sessions ||--o{ agent_runs : "session_id"
    strategies ||--o{ optimization_runs : "strategy_id"
    optimization_runs ||--o{ optimization_trials : "run_id"
    strategies ||--o{ paper_trading_runs : "strategy_id"
```

## Bots

Tables introduced by the Bot Entity Refactor (Alembic
[`0020_bots`](../alembic/versions/0020_bots.py)).

```mermaid
erDiagram
    PROJECTS ||--o{ BOTS : "owns"
    BOTS ||--o{ BOT_VERSIONS : "snapshots"
    BOTS ||--o{ BOT_DEPLOYMENTS : "runs"
    BOT_VERSIONS ||--o{ BOT_DEPLOYMENTS : "produces"

    BOTS {
        string id PK
        string project_id FK
        string slug
        string kind
        string name
        text description
        int current_version
        text spec_yaml
        string status
        json annotations
    }
    BOT_VERSIONS {
        string id PK
        string bot_id FK
        int version
        string spec_hash
        json payload
        text notes
        string created_by
    }
    BOT_DEPLOYMENTS {
        string id PK
        string bot_id FK
        string version_id FK
        string target
        string task_id
        string status
        text manifest_yaml
        json result_summary
        text error
    }
```

- `(project_id, slug)` is unique on `bots`.
- `(bot_id, spec_hash)` is unique on `bot_versions` (immutable snapshots).
- `bot_deployments.target` is one of `paper_session` / `kubernetes` /
  `backtest_only` / `chat` / `backtest`.

## Data layer expansion (sinks, producers, streaming links)

Tables introduced by the Data Pipelines Hub work (Alembic
[`0024_data_layer_expansion`](../alembic/versions/0024_data_layer_expansion.py)).
All four tables use `ProjectScopedMixin`.

```mermaid
erDiagram
    PROJECTS ||--o{ SINKS : "owns"
    SINKS ||--o{ SINK_VERSIONS : "snapshots"
    PROJECTS ||--o{ MARKET_DATA_PRODUCERS : "owns"
    DATASET_CATALOGS ||--o{ STREAMING_DATASET_LINKS : "linked"
    PIPELINE_MANIFESTS ||--o{ DATASET_PIPELINE_CONFIGS : "binds"

    SINKS {
        string id PK
        string project_id FK
        string name
        string kind
        string display_name
        json config_json
        json tags
        bool requires_manifest_node
        int current_version
        bool enabled
    }
    SINK_VERSIONS {
        string id PK
        string sink_id FK
        int version
        string spec_hash
        json payload
        text notes
    }
    MARKET_DATA_PRODUCERS {
        string id PK
        string project_id FK
        string name
        string kind
        string runtime
        string deployment_namespace
        string deployment_name
        json topics
        int desired_replicas
        int current_replicas
        string last_status
    }
    STREAMING_DATASET_LINKS {
        string id PK
        string dataset_catalog_id FK
        string kind
        string target_ref
        string cluster_ref
        string direction
        json metadata_json
        bool enabled
    }
```

Notes:

- `(project_id, name)` is unique on `sinks` and `market_data_producers`.
- `(sink_id, spec_hash)` and `(sink_id, version)` are unique on
  `sink_versions` (mirrors the `bot_versions` pattern).
- `(dataset_catalog_id, kind, target_ref, direction)` is unique on
  `streaming_dataset_links` so the
  [refresh_links](../alphaswarm/tasks/streaming_link_tasks.py) task can be
  re-run idempotently.

## ML alpha-backtest linkage (Alembic 0025)

```mermaid
erDiagram
    ml_experiment_runs ||--o| ml_alpha_backtest_runs : "ml_experiment_run_id"
    backtest_runs ||--o| ml_alpha_backtest_runs : "backtest_run_id"
    model_versions ||--o| ml_alpha_backtest_runs : "model_version_id"
    model_deployments ||--o| ml_alpha_backtest_runs : "model_deployment_id"
    experiment_plans ||--o| ml_alpha_backtest_runs : "experiment_plan_id"
    ml_alpha_backtest_runs ||--o{ ml_prediction_audit : "alpha_backtest_run_id"

    ml_alpha_backtest_runs {
        uuid id PK
        string task_id
        string run_name
        string status
        uuid ml_experiment_run_id FK
        uuid backtest_run_id FK
        uuid model_version_id FK
        uuid model_deployment_id FK
        uuid experiment_plan_id FK
        string mlflow_run_id
        json ml_metrics
        json trading_metrics
        json combined_metrics
        json attribution
        datetime started_at
        datetime completed_at
    }
    ml_prediction_audit {
        uuid id PK
        uuid alpha_backtest_run_id FK
        string vt_symbol
        datetime ts
        float prediction
        float label
        float position_after
        float pnl_after_bar
    }
```

The four new FKs on `backtest_runs` (added by Alembic 0025) close the
loop from a backtest result back to the trained model that produced
its alpha:

- `model_version_id` — the registered `ModelVersion` row.
- `ml_experiment_run_id` — the `MLExperimentRun` that trained it.
- `experiment_plan_id` — the `ExperimentPlan` lineage row.
- `model_deployment_id` — the `ModelDeployment` used to wire the
  model into the strategy via `DeployedModelAlpha`.

## Adding a new model

When you add a new ORM class:

1. Add the class to the appropriate `alphaswarm/persistence/models_*.py`
   (or `models.py` for cross-domain things).
2. Add an Alembic migration (`alembic revision --autogenerate -m
   "add foo"`). **Never edit a shipped migration.**
3. Update [alphaswarm_docs/data-dictionary.md](../../reference/data-dictionary/index.md) with the new
   table's columns.
4. Add the table to the relevant per-domain ERD above (or open a new
   one if it's a new domain).
5. If it has FKs into other domains, add those edges to the global FK
   map at the top of this file.


<!-- https://alpha-swarm.ai/concepts/platform/experiments-tests -->
# Experiments + Tests umbrella (Phase 1 of the multi-tenant rollout)
> | Table | Purpose | Key columns | | ----- | ------- | ----------- | | `experiments` | User-driven container; one row per hypothesis / sweep / iteration | `id`, `slug`, `name`, `kind` (`ml`/`rl`/`analy...

# Experiments + Tests umbrella (Phase 1 of the multi-tenant rollout)

The umbrella sits **above** every existing typed run table so the
"what was the user trying?" question gets one consistent answer
regardless of which downstream engine produced the artefact.

## Tables

| Table | Purpose | Key columns |
| ----- | ------- | ----------- |
| `experiments` | User-driven container; one row per hypothesis / sweep / iteration | `id`, `slug`, `name`, `kind` (`ml`/`rl`/`analysis`/`backtest`/`paper`/`bot`/`agent`/`research`/`hypothesis`/`optimization`/`ablation`/`sweep`), `status`, `parent_experiment_id`, `lab_id`, `metrics jsonb` |
| `tests` | Pass/fail-style assertions attached to an experiment | `id`, `experiment_id`, `slug`, `name`, `assertion_kind`, `passed`, `details jsonb`, `run_ref_table`, `run_ref_id` |

Both inherit `ProjectScopedMixin` (`owner_user_id` / `workspace_id` /
`project_id`).

## Linkage to typed runs

Migration 0037 added nullable `experiment_id` (and `test_id` where it
applies) columns to:

- `backtest_runs`
- `ml_experiment_runs`
- `rl_runs`
- `analysis_runs`
- `bot_deployments`
- `strategy_tests` (also gets `test_id`)
- `paper_trading_runs`
- `agent_runs_v2`
- `agent_runs`

Existing rows stay at `NULL`; only new flows opt in. The
[`LedgerWriter`](../alphaswarm/persistence/ledger.py) `_stamp` chain copies
`RequestContext.experiment_id` / `.test_id` onto every row that
has the matching attribute, so most flows just need a populated
`RequestContext` to flow through.

## Hard rule

Hard rule 34 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Every new
run-producing flow MUST populate `experiment_id` (and `test_id`
where applicable) on its run row. Don't add a new `*_runs` table
without an `experiment_id` FK."

## REST surface

| Method + path                              | Purpose |
| ------------------------------------------ | ------- |
| `GET /experiments`                         | List (filter by `project_id`, `kind`, `status`, `parent_experiment_id`) |
| `POST /experiments`                        | Create (slug auto-derived from name) |
| `GET /experiments/{id}`                    | Describe |
| `PATCH /experiments/{id}`                  | Update (status/metrics/parent) |
| `DELETE /experiments/{id}`                 | Cascade-deletes tests |
| `GET /experiments/{id}/runs`               | Stitched view of every typed run row pointing here |
| `GET /tests`                               | List (filter by `experiment_id`, `passed`, `assertion_kind`) |
| `POST /tests`                              | Create attached to an experiment |
| `GET /tests/{id}`                          | Describe |
| `POST /tests/{id}/evaluate`                | Set the pass/fail verdict + ref into a typed run row |

## MCP surface

- `data.experiments.list` — list / filter.
- `data.experiments.tree` — nested view (`PARENT_OF` chain).
- `data.experiments.describe` — full row + counts of linked runs.
- `data.tests.list` — list / filter.
- `data.tests.describe` — full row.

## Cross-reference

- The Phase 2 ownership graph projects every experiment +
  test + linked run into Neo4j. See
  [`alphaswarm_docs/ownership-graph.md`](../../concepts/platform/ownership-graph.md).
- The Phase 6 frontend ContextBar lets the user pin a specific
  experiment (when the route declares one). See the route handlers
  for which surfaces opt in.
- The Phase 7 LEAN clone-to-workspace flow optionally creates an
  experiment when the user provides a name. See
  [`alphaswarm_docs/strategy-templates.md`](../../concepts/strategy/strategy-templates.md).


<!-- https://alpha-swarm.ai/concepts/platform/flows -->
# Major Flows
> End-to-end sequence and state diagrams for the four flows that human and AI contributors most often need to reason about. Each diagram cites the canonical files; if the diagram and the code disagree, ...

# Major Flows

> Pair with [alphaswarm_docs/architecture.md](../../concepts/platform/architecture.md) (system view) and
> [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (data model).
> Doc map: [alphaswarm_docs/index.md](../../intro/index.md).

End-to-end sequence and state diagrams for the four flows that human
and AI contributors most often need to reason about. Each diagram
cites the canonical files; if the diagram and the code disagree, the
code wins (and the doc is stale — please update).

## 1. Generic file → Iceberg ingestion

The discovery → director → materialise → verify → annotate pipeline
that powers the regulatory-corpus ingest. Canonical doc:
[alphaswarm_docs/data-catalog.md](../../concepts/data/data-catalog.md).

```mermaid
sequenceDiagram
    actor User
    participant CLI as scripts/ingest_regulatory.py
    participant API as FastAPI
    participant Celery
    participant Disc as discovery
    participant Dir as director (Nemotron)
    participant Mat as materialize
    participant Verify as verifier (Nemotron)
    participant Ann as annotate (Nemotron)
    participant Iceberg
    participant DB as Postgres
    participant Bus as Redis pub/sub

    User->>CLI: invoke per source path
    CLI->>API: POST /pipelines/ingest/regulatory
    API->>Celery: enqueue ingest_local_paths_with_director
    API-->>CLI: 202 task_id
    CLI->>Bus: SUBSCRIBE alphaswarm:task:&lt;task_id&gt;

    loop per source path
        Celery->>Disc: discover_datasets(path)
        Disc-->>Celery: list~DiscoveredDataset~
        Celery->>Dir: plan_ingestion(datasets)
        Dir-->>Celery: IngestionPlan
        Celery->>Bus: publish phase=plan
        loop per planned dataset
            Celery->>Mat: materialize_dataset(planned)
            Mat->>Iceberg: ensure_namespace + append_arrow*
            Mat-->>Celery: MaterializeResult
            alt rows below floor
                Celery->>Verify: verify_after_materialise
                Verify-->>Celery: VerifierVerdict
                opt retry
                    Celery->>Mat: re-run with new caps
                end
            end
            opt annotate=true
                Celery->>Ann: annotate_table
                Ann-->>Celery: AnnotationResult
                Ann->>DB: register_iceberg_dataset
            end
            Celery->>Bus: publish phase=materialize|verify|annotate
        end
    end

    Celery->>DB: write IngestionReport summary
    Celery->>Bus: publish stage=done
    Bus-->>CLI: final payload
    CLI->>CLI: render markdown summary + audit log
```

Canonical files:

- [alphaswarm/data/pipelines/discovery.py](../alphaswarm/data/pipelines/discovery.py)
- [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py)
- [alphaswarm/data/pipelines/materialize.py](../alphaswarm/data/pipelines/materialize.py)
- [alphaswarm/data/pipelines/annotate.py](../alphaswarm/data/pipelines/annotate.py)
- [alphaswarm/data/pipelines/runner.py](../alphaswarm/data/pipelines/runner.py)
- [alphaswarm/tasks/ingestion_tasks.py](../alphaswarm/tasks/ingestion_tasks.py)
- [scripts/ingest_regulatory.py](../scripts/ingest_regulatory.py)
- [scripts/_run_one_source.py](../scripts/_run_one_source.py)

## 2. Backtest dispatch

```mermaid
sequenceDiagram
    actor User
    participant UI as Next.js webui
    participant API as FastAPI /backtest
    participant DB as Postgres
    participant Celery as worker (queue=backtest)
    participant Strat as Strategy + Engine
    participant Duck as DuckDB
    participant Iceberg
    participant MLflow
    participant Bus as Redis pub/sub
    participant Ledger as LedgerWriter

    User->>UI: configure + run backtest
    UI->>API: POST /backtest {strategy_id, start, end, engine}
    API->>DB: insert BacktestRun(status=pending)
    API->>Celery: enqueue run_backtest(backtest_id)
    API-->>UI: 202 {task_id, stream_url}
    UI->>API: WebSocket /chat/stream/&lt;task_id&gt;

    Celery->>DB: load BacktestRun + Strategy
    Celery->>MLflow: start_run(experiment=alphaswarm-default)
    Celery->>Iceberg: read bars (DuckDB view)
    Duck-->>Celery: pandas DataFrame
    Celery->>Strat: instantiate FrameworkAlgorithm(...)

    loop per bar
        Strat->>Strat: universe → alpha → portfolio → risk → execution
        Strat-->>Celery: list~OrderRequest~
        Celery->>Ledger: record_signal / record_order
        Ledger->>DB: insert signals / orders
        Celery->>Bus: publish progress
        Bus-->>UI: WebSocket frame
    end

    Celery->>MLflow: log_metrics + log_artifact(equity_curve.csv)
    Celery->>DB: update BacktestRun(status=completed, sharpe, ...)
    Celery->>Bus: publish stage=done
    Bus-->>UI: final summary
```

Canonical files:

- [alphaswarm/api/routes/backtest.py](../alphaswarm/api/routes/backtest.py)
- [alphaswarm/tasks/backtest_tasks.py](../alphaswarm/tasks/backtest_tasks.py)
- [alphaswarm/backtest/engine.py](../alphaswarm/backtest/engine.py)
- [alphaswarm/backtest/runner.py](../alphaswarm/backtest/runner.py)
- [alphaswarm/strategies/framework.py](../alphaswarm/strategies/framework.py)
- [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py)
- [alphaswarm/mlops/autolog.py](../alphaswarm/mlops/autolog.py)

## 3. Agentic crew run

The dual-tier (deep + quick LLM) CrewAI graph used by the
TradingAgents-style preset. Files:
[alphaswarm/tasks/agentic_backtest_tasks.py](../alphaswarm/tasks/agentic_backtest_tasks.py),
[alphaswarm/agents/](../alphaswarm/agents/).

```mermaid
sequenceDiagram
    actor User
    participant UI
    participant API as FastAPI /agentic/*
    participant Celery as worker (queue=agents)
    participant Crew as CrewAI graph
    participant Mem as ChromaDB (per-role memory)
    participant LLMD as Nemotron deep tier
    participant LLMQ as Llama quick tier
    participant DB as Postgres
    participant MLflow
    participant Ledger as LedgerWriter
    participant Bus as Redis pub/sub

    User->>UI: pick preset + universe
    UI->>API: POST /agentic/run {preset, symbols, ...}
    API->>DB: insert CrewRun(crew_type=trader, status=queued)
    API->>Celery: enqueue run_agentic_pipeline
    API-->>UI: 202 task_id

    Celery->>Crew: build graph from preset YAML
    Crew->>Mem: load BM25 memory per role

    loop per debate round
        Crew->>LLMQ: planner (quick tier) - which tool?
        LLMQ-->>Crew: tool selection
        Crew->>LLMD: research analyst (deep tier)
        LLMD-->>Crew: structured analysis
        Crew->>DB: insert DebateTurn rows
        Crew->>Bus: publish phase=debate
    end

    Crew->>LLMD: trader synthesis (deep)
    LLMD-->>Crew: AgentDecision (long/short/flat + rationale)
    Crew->>DB: insert AgentDecision
    Crew->>Mem: persist conclusion to BM25

    Note over Crew,Ledger: Optional: replay through backtest engine
    Crew->>Celery: enqueue precompute_decisions
    Celery->>Celery: backtest replay
    Celery->>Ledger: record signals/orders
    Celery->>MLflow: log crew metrics
    Celery->>DB: insert AgentBacktest

    Celery->>Bus: publish stage=done
    Bus-->>UI: WebSocket frame
```

Canonical files:

- [alphaswarm/api/routes/agentic.py](../alphaswarm/api/routes/agentic.py)
- [alphaswarm/tasks/agentic_backtest_tasks.py](../alphaswarm/tasks/agentic_backtest_tasks.py)
- [alphaswarm/agents/](../alphaswarm/agents/)
- [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py)

## 4. Paper trading session

```mermaid
stateDiagram-v2
    [*] --> Pending : alphaswarm.tasks.paper_tasks.run_paper enqueued
    Pending --> Bootstrapping : worker dequeues
    Bootstrapping --> WarmingUp : load strategy + history bars
    WarmingUp --> Running : feed publishes first live bar

    state Running {
        [*] --> Heartbeat
        Heartbeat --> ProcessBar : bar arrives
        ProcessBar --> EmitOrders : strategy.on_bar yields orders
        EmitOrders --> RiskCheck : portfolio + risk model
        RiskCheck --> SubmitOrders : within limits
        RiskCheck --> RejectOrders : kill switch / risk breach
        SubmitOrders --> Heartbeat : write fills, ledger
        RejectOrders --> Heartbeat : write ledger.RISK
    }

    Running --> Halted : kill_switch_key set / risk breach
    Running --> Stopping : user POST /paper/stop
    Stopping --> Stopped : flush ledger + state
    Halted --> Stopped : operator clears
    Stopped --> [*]

    Running --> Stale : missed heartbeat &gt; threshold
    Stale --> Halted : safety
```

The kill switch is a Redis key (`ALPHASWARM_RISK_KILL_SWITCH_KEY`, default
`alphaswarm:kill_switch`); set it from anywhere to stop a session.

Canonical files:

- [alphaswarm/api/routes/paper.py](../alphaswarm/api/routes/paper.py)
- [alphaswarm/tasks/paper_tasks.py](../alphaswarm/tasks/paper_tasks.py)
- [alphaswarm/trading/runner.py](../alphaswarm/trading/runner.py)
- [alphaswarm/trading/session.py](../alphaswarm/trading/session.py)
- [alphaswarm/risk/](../alphaswarm/risk/)

## 5. (Bonus) Live-data subscription

Browser asks the API for a live data stream; API allocates a Redis
pub/sub channel that bridges the broker feed to a WebSocket.

```mermaid
sequenceDiagram
    actor User
    participant UI
    participant API as FastAPI /live/subscribe
    participant Bridge as live broker bridge
    participant Broker as Alpaca / IBKR / sim
    participant Bus as Redis pub/sub (alphaswarm:live:&lt;ch&gt;)
    participant WS as /live/&lt;channel_id&gt;

    User->>UI: open live tab
    UI->>API: POST /live/subscribe {venue, symbols}
    API->>Bridge: spawn bridge task
    Bridge->>Broker: subscribe to symbols
    API-->>UI: {channel_id, ws_url}
    UI->>WS: WebSocket connect

    loop per market event
        Broker-->>Bridge: bar / quote / trade
        Bridge->>Bus: publish on alphaswarm:live:&lt;ch&gt;
        Bus-->>WS: deliver
        WS-->>UI: WebSocket frame
    end

    User->>WS: close tab
    WS->>Bridge: connection closed
    Bridge->>Broker: unsubscribe (if last consumer)
```

Canonical files:

- [alphaswarm/api/routes/market_data_live.py](../alphaswarm/api/routes/market_data_live.py)
- [alphaswarm/streaming/](../alphaswarm/streaming/)
- [alphaswarm/ws/broker.py](../alphaswarm/ws/broker.py)

## Cross-cutting: progress bus

Every long-running task in AlphaSwarm uses the **same** progress bus pattern:

```mermaid
flowchart LR
    Task[Celery task] -->|emit| Helper["alphaswarm.tasks._progress.emit"]
    Helper -->|publish| Redis[("Redis pub/subalphaswarm:task:&lt;task_id&gt;")]
    Redis -->|asubscribe| WS[WebSocket relay /chat/stream]
    WS -->|frames| Browser
    Redis -->|subscribe| CLI[CLI scripts]
```

API to remember:

- `emit(task_id, stage, message, **extras)` — publish a progress frame.
- `emit_done(task_id, result)` — terminal `stage="done"` + result payload.
- `emit_error(task_id, error)` — terminal `stage="error"`.

Don't publish to Redis directly from your task code; always go through
[alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py) so the frame
shape stays consistent.


<!-- https://alpha-swarm.ai/concepts/platform/instrument-taxonomy -->
# Instrument taxonomy
> The legacy taxonomy treated REITs and depositary receipts as plain ``InstrumentEquity`` rows with a discriminator flag (``is_adr``), and modelled OTC derivatives as opaque blobs. That worked while age...

# Instrument taxonomy

> Status: **Phase 1 shipped** (Alembic 0039). Adds REIT / mutual fund / OTC
> derivative / ADR / GDR as first-class polymorphic subclasses of
> :class:`alphaswarm.persistence.models.Instrument` plus a registry table
> (``instrument_measures``) that catalogs which metrics are available for
> each instrument.

## Why

The legacy taxonomy treated REITs and depositary receipts as plain
``InstrumentEquity`` rows with a discriminator flag (``is_adr``), and
modelled OTC derivatives as opaque blobs. That worked while agents
only routed cash equities and listed options, but it broke as soon as
the platform tried to:

* compute the cross-market basis between an NYSE-listed ADR and its
  foreign common (no FK to the underlying, no conversion ratio, no
  depository bank metadata);
* run a REIT sector-rotation strategy (no FFO, no payout ratio, no
  property-portfolio composition);
* clear an OTC swap through a CCP (no LEI, no ISDA master agreement
  id, no notional / collateral fields).

Phase 1 lifts these shapes into first-class joined-table subclasses
with the columns the trading + risk + cross-market arbitrage paths
read directly.

## Taxonomy

| Class | SQL table | ``polymorphic_identity`` | InstrumentClass | AssetClass |
| --- | --- | --- | --- | --- |
| ``Equity`` | ``instrument_equity`` | ``spot`` | ``SPOT`` | ``EQUITY`` |
| ``ETF`` | ``instrument_etf`` | ``etf`` | ``ETF`` | ``EQUITY`` |
| ``IndexInstrument`` | ``instrument_index`` | ``index`` | ``INDEX`` | ``INDEX`` |
| ``Bond`` | ``instrument_bond`` | ``bond`` | ``BOND`` | ``RATES`` |
| ``FuturesContract`` | ``instrument_future`` | ``future`` | ``FUTURE`` | ``COMMODITY`` |
| ``OptionContract`` | ``instrument_option`` | ``option`` | ``OPTION`` | ``EQUITY`` |
| ``CurrencyPair`` | ``instrument_fx_pair`` | ``fx_pair`` | ``SPOT`` | ``FX`` |
| ``CryptoToken`` | ``instrument_crypto`` | ``crypto_token`` | ``CRYPTO_TOKEN`` | ``CRYPTO`` |
| ``Cfd`` | ``instrument_cfd`` | ``cfd`` | ``CFD`` | ``EQUITY`` |
| ``Commodity`` | ``instrument_commodity`` | ``spot_commodity`` | ``SPOT`` | ``COMMODITY`` |
| ``SyntheticInstrument`` | ``instrument_synthetic`` | ``synthetic`` | ``SYNTHETIC`` | ``MIXED`` |
| ``BettingInstrument`` | ``instrument_betting`` | ``betting`` | ``BETTING`` | ``EVENT`` |
| ``TokenizedAsset`` | ``instrument_tokenized_asset`` | ``nft`` | ``NFT`` | ``CRYPTO`` |
| **``REIT``** | **``instrument_reit``** | **``reit``** | **``REIT``** | **``EQUITY``** |
| **``MutualFund``** | **``instrument_mutual_fund``** | **``mutual_fund``** | **``MUTUAL_FUND``** | **``EQUITY``** |
| **``OTCDerivative``** | **``instrument_otc_derivative``** | **``otc_derivative``** | **``OTC_DERIVATIVE``** | **``MIXED``** |
| **``AmericanDepositaryReceipt``** | **``instrument_adr``** | **``adr``** | **``ADR``** | **``EQUITY``** |
| **``GlobalDepositaryReceipt``** | **``instrument_gdr``** | **``gdr``** | **``GDR``** | **``EQUITY``** |

Phase 1 rows are bolded.

### REIT

``InstrumentREIT`` adds the columns a REIT-aware strategy needs:

* ``reit_class`` -- ``equity``, ``mortgage``, ``hybrid``,
  ``public_non_listed``, ``private``
* ``property_sector`` -- ``residential``, ``commercial``, ``industrial``,
  ``healthcare``, ``data_center``, ``retail``, ``hospitality``,
  ``diversified``, ``infrastructure``, ``timber``
* ``property_portfolio_json`` -- list of property dicts (the discovery
  service surfaces these without spinning up a separate
  ``reit_properties`` table)
* ``distribution_yield`` / ``ffo_per_share`` / ``payout_ratio`` /
  ``debt_to_equity``

### Mutual fund

``InstrumentMutualFund`` covers open-end and closed-end funds. The
discriminator that distinguishes it from ``InstrumentETF`` is the
trading mechanism (end-of-day NAV vs intraday creation-redemption).

* ``fund_family`` (Vanguard / Fidelity / BlackRock / ...)
* ``share_class`` (A / B / C / I / R / Z / retail / institutional)
* ``fund_kind`` (open_end / closed_end / money_market / target_date /
  ucits / sicav)
* ``expense_ratio`` / ``management_fee`` / ``minimum_investment``

### OTC derivative

``InstrumentOTCDerivative`` is the catch-all for the OTC universe.
The ``instrument_kind`` discriminator selects the specific shape:

* ``swap`` / ``swaption`` / ``cap_floor`` / ``forward`` / ``exotic``
* ``variance_swap`` / ``credit_default_swap`` / ``total_return_swap`` /
  ``basket_swap``

Regulatory identity flows through ``counterparty_lei`` plus
``isda_master_agreement_id`` so trade-repository reconciliation
(DTCC, REGIS-TR) works without a separate registration step. The
``legs_json`` column stores the leg structure inline so a single
class supports the entire OTC universe without a tree of subclasses.

### ADR / GDR

Both subclasses carry:

* ``underlying_instrument_id`` -- FK to the foreign equity row
* ``conversion_ratio`` -- shares of foreign common per receipt
* ``depository_bank_name`` / ``depository_bank_lei``
* ADR adds ``sponsorship_level`` (I / II / III / 144A / Reg_S /
  unsponsored)
* GDR adds ``regulatory_regime`` (Reg_S / Rule_144A / Reg_S_144A /
  full_listing) plus a non-US ``listing_venue``

The Phase 4 cross-market basis algorithm reads
``adr.conversion_ratio`` and walks ``adr.underlying_instrument_id``
to fetch the local price directly -- no extra join needed.

## ``instrument_measures`` registry

Catalog of "what data exists for this instrument?". One row per
``(instrument_id, measure_type, frequency, dataset_field)`` tuple.

Common ``measure_type`` values: ``price``, ``volume``,
``open_interest``, ``implied_volatility``, ``dividend_yield``,
``ffo``, ``nav``, ``distribution``, ``greek_delta``, ``greek_gamma``,
``basis``, ``spread``, ``turnover``, ``bid_ask_spread``.

Common ``frequency`` values: ``tick``, ``second``, ``minute``,
``hour``, ``day``, ``week``, ``month``, ``quarter``, ``annual``,
``event_driven``, ``adhoc``.

Agents query this BEFORE drafting a SQL / Iceberg query via the
``data.instruments.measures`` DataMCP tool so they don't select a
column that doesn't exist for the instrument-frequency pair they
care about.

## How to add a new subclass

1. Add an :class:`InstrumentClass` enum value in
   [`alphaswarm/core/domain/enums.py`](../alphaswarm/core/domain/enums.py).
2. Add the matching joined-table SQL subclass in
   [`alphaswarm/persistence/models_instruments.py`](../alphaswarm/persistence/models_instruments.py).
   Set ``polymorphic_identity`` to the enum value.
3. Add the in-memory domain class in
   [`alphaswarm/core/domain/instrument.py`](../alphaswarm/core/domain/instrument.py)
   decorated with ``@register_instrument_class``.
4. Add an Alembic migration for the new table.
5. If the new class needs unique ``data.instruments.*`` access
   patterns, register a DataMCP tool under
   [`alphaswarm/data/mcp/tools/instruments.py`](../alphaswarm/data/mcp/tools/instruments.py).

## DataMCP surface

| Tool | Purpose |
| --- | --- |
| `data.instruments.measures` | Available metrics for an instrument |
| `data.instruments.depositary_receipts` | ADR / GDR with underlying-equity FK + conversion ratio |
| `data.instruments.reit_portfolio` | REIT property-portfolio composition + FFO / yield |
| `data.identity.resolve` | Forward identifier resolution at ``as_of`` |
| `data.identity.history` | Walk every alias ever known for an entity |
| `data.futures.curve.list` | Discover available futures curves |
| `data.futures.curve.stitched` | Roll-stitched continuous curves |


<!-- https://alpha-swarm.ai/concepts/platform/legacy-types-shim -->
# Legacy `alphaswarm.core.types` shim
> The legacy module is imported by **~140 files** across the codebase (strategies, brokers, REST routes, paper-trading session, backtest engines, RL apps, tests). A hard delete would break every one of ...

# Legacy `alphaswarm.core.types` shim

> Status: **Phase 5 finalization shipped**. The module is now a thin
> compatibility shim over [`alphaswarm.core.domain`](../alphaswarm/core/domain/).

## Why a shim and not a delete

The legacy module is imported by **~140 files** across the codebase
(strategies, brokers, REST routes, paper-trading session, backtest
engines, RL apps, tests). A hard delete would break every one of them
in the same commit.

The shim approach preserves backward compatibility while making the
domain types the recommended path:

1. **Every public name still imports** -- no breaking change for the
   140 existing importers.
2. **Each domain-replaceable class is marked DEPRECATED** in its
   docstring with a `.. deprecated:: 5.0` Sphinx directive pointing
   at the canonical type.
3. **Bridge methods** on each legacy class let callers convert to the
   domain shape with one method call:
   * `Symbol.to_instrument_id()`
   * `OrderRequest.to_domain_order(client_order_id=, account=)`
   * `OrderData.to_domain_order()` / `OrderData.from_domain_order(...)`
   * `TradeData.from_execution_report(...)`
   * `PositionData.from_account_position_row(...)`
   * `AccountData.from_account_row(account_row, balances=...)`
4. **Domain re-exports at module bottom** let callers migrate one
   import at a time -- `from alphaswarm.core.types import DomainOrder`
   works without rewriting the import line.

## Three categories of type

### Category 1 -- Domain replacement available

| Legacy | Domain canonical |
| --- | --- |
| `Symbol` | `alphaswarm.core.domain.identifiers.InstrumentId` |
| `Exchange` | `alphaswarm.core.domain.identifiers.Venue` (string-valued ID) |
| `AssetClass` | `alphaswarm.core.domain.enums.AssetClass` (richer) |
| `SecurityType` | `alphaswarm.core.domain.enums.InstrumentClass` |
| `OrderType` | `alphaswarm.core.domain.enums.OrderType` (superset) |
| `OrderSide` | `alphaswarm.core.domain.enums.OrderSide` (superset) |
| `OrderStatus` | `alphaswarm.core.domain.enums.OrderStatus` (superset) |
| `Direction` | `alphaswarm.core.domain.enums.PositionSide` |
| `OrderRequest` | `alphaswarm.core.domain.orders.DomainOrder` |
| `OrderData` | `alphaswarm.core.domain.orders.DomainOrder` |
| `AccountData` | `alphaswarm.persistence.models_accounts.AccountRow` + balances |
| `PositionData` | `alphaswarm.persistence.models_accounts.AccountPositionRow` |
| `TradeData` | `alphaswarm.trading.execution.ExecutionReport` |

The legacy classes in this category are shims. Their docstring carries
`.. deprecated:: 5.0` and points at the canonical type.

### Category 2 -- Authoritative here (no domain replacement)

Market-data records and data-plane routing have no domain equivalents
because the domain layer is about identity / orders / accounts, not
the data plane:

* `BarData`, `TradeBar` (alias), `QuoteBar`, `TickData`, `Tick` (alias)
* `SubscriptionDataConfig`, `Interval`, `Resolution`, `TickType`,
  `DataNormalizationMode`

Framework value objects for the alpha / portfolio stages:

* `Signal`, `PortfolioTarget`

Backtest event-loop types:

* `Event`, `EventType`, `MarketEvent`, `SignalEvent`,
  `OrderEvent_Msg`, `FillEvent_Msg`

Legacy framework patterns for the existing `PaperTradingSession`:

* `OrderEvent` (state-transition record, NOT the messaging event),
  `OrderTicket`, `SecurityHolding`, `Cash`, `CashBook`

These types remain authoritative here.

### Category 3 -- Domain re-exports

Every Phase 1-5 domain type is re-exported from `alphaswarm.core.types` so
callers can incrementally migrate. The recommended long-term migration
is `from alphaswarm.core.domain import ...`, but the shim re-exports let you
do it one import line at a time:

```python
# Both work after Phase 5. The second is the recommended long-term form.
from alphaswarm.core.types import DomainOrder, InstrumentId, OmsType
from alphaswarm.core.domain import DomainOrder, InstrumentId, OmsType
```

The re-exports cover every name from
[`alphaswarm/core/domain/enums.py`](../alphaswarm/core/domain/enums.py),
[`alphaswarm/core/domain/identifiers.py`](../alphaswarm/core/domain/identifiers.py),
and [`alphaswarm/core/domain/orders.py`](../alphaswarm/core/domain/orders.py).

Names that would collide with the legacy enums get a `Domain` prefix:

| Legacy | Domain re-export |
| --- | --- |
| `OrderType` | `DomainOrderType` |
| `OrderSide` | `DomainOrderSide` |
| `OrderStatus` | `DomainOrderStatus` |
| `AssetClass` | `DomainAssetClass` |
| `AccountType` (legacy doesn't exist) | `DomainAccountType` |

Everything else (`InstrumentId`, `ClientOrderId`, `OrderListId`,
`DomainOrder`, `LimitOrder`, `StopMarketOrder`, `OrderList`,
`PositionSide`, `OmsType`, `ContingencyType`, `TriggerType`,
`TimeInForce`, `TrailingOffsetType`, `InstrumentClass`,
`LiquiditySide`, `AggressorSide`, ...) is re-exported under its
original name.

## Migration workflow

For a file currently using legacy types:

1. **Drop in domain bridges where convenient.** No import changes
   needed:

   ```python
   from alphaswarm.core.types import OrderRequest

   req = OrderRequest(...)
   domain = req.to_domain_order()  # bridge to canonical
   ```

2. **Switch single imports incrementally.** Replace one legacy import
   at a time with the domain re-export through the shim:

   ```python
   # Before
   from alphaswarm.core.types import OrderType, OrderSide, OrderStatus

   # After (works today, no functional change)
   from alphaswarm.core.types import (
       DomainOrderType as OrderType,
       DomainOrderSide as OrderSide,
       DomainOrderStatus as OrderStatus,
   )
   ```

3. **Final form -- direct from domain.** Once the entire file is
   migrated, drop the shim:

   ```python
   from alphaswarm.core.domain.enums import OrderType, OrderSide, OrderStatus
   ```

## Why we keep the legacy enums (instead of aliasing to domain)

The temptation is "just `OrderType = DomainOrderType`". We don't,
because:

* **Existing DB rows persist the legacy string values.** The legacy
  enum has `STOP = "stop"`; the domain has `STOP_MARKET = "stop_market"`.
  Renaming the enum would invalidate every previously-saved
  ``order_type`` column.
* **YAML strategy configs use the legacy values.** Backward compat
  with shipped strategy YAML is a hard requirement.
* **Some legacy values have NO domain equivalent.** `OrderStatus.NEW`
  (legacy) maps to `OrderStatus.ACCEPTED` (domain) but the legacy
  state machine has a different topology (no PENDING_UPDATE /
  PENDING_CANCEL).

The legacy enums are therefore kept verbatim; the deprecation
directive points callers at the richer domain enum, and the
re-exports give callers an opt-in path.

## When the shim can finally be deleted

The shim file disappears when:

1. Every importer has migrated to `from alphaswarm.core.domain import ...`
2. Every persisted legacy enum value has been migrated to the
   domain canonical (`stop` -> `stop_market`, `cancelled` ->
   `canceled`, `new` -> `accepted`, `partial` -> `partially_filled`)
3. Every shipped YAML strategy config has been rewritten to use the
   domain values

That migration is a separate, multi-PR effort tracked outside this
Phase 5 finalization. The shim stays put until it's done.

## Bridge method reference

```python
from alphaswarm.core.types import (
    Symbol, OrderRequest, OrderData, TradeData,
    PositionData, AccountData,
)

# Symbol <-> InstrumentId
sym = Symbol.parse("AAPL.NASDAQ")
iid = sym.to_instrument_id()
back = Symbol.from_instrument_id(iid)

# OrderRequest -> DomainOrder
req = OrderRequest(symbol=sym, side=..., order_type=..., quantity=10)
domain = req.to_domain_order(client_order_id="cl-1", gateway="alpaca")

# OrderData round-trip
data = OrderData(...)
domain = data.to_domain_order()
rebuilt = OrderData.from_domain_order(domain, gateway="alpaca")

# TradeData from ExecutionReport
from alphaswarm.trading.execution import ExecutionReport
trade = TradeData.from_execution_report(report)

# PositionData from AccountPositionRow
from alphaswarm.persistence.models_accounts import AccountPositionRow
pos = PositionData.from_account_position_row(row)

# AccountData snapshot from persistence rows
from alphaswarm.persistence.models_accounts import AccountRow, AccountBalanceRow
snapshot = AccountData.from_account_row(account_row, balances=balance_rows)
```


<!-- https://alpha-swarm.ai/concepts/platform/local-platform -->
# Local platform overlay
> The rpi `kubernetes/` tree stays untouched — these are *copies*, not relocations. AlphaSwarm attaches to either the local services or the cluster through the [`KubernetesAdapter`](../../concepts/infrastructure/kubernetes-adapter.md) abst...

# Local platform overlay

Audience: a developer who wants to run AlphaSwarm **standalone**, without
attaching to the rpi_kubernetes cluster. The platform overlay
(`alphaswarm_platform/compose/docker-compose.platform.yml`) brings the data + observability
services AlphaSwarm code expects into the local compose stack.

The rpi `kubernetes/` tree stays untouched — these are *copies*, not
relocations. AlphaSwarm attaches to either the local services or the cluster
through the [`KubernetesAdapter`](../../concepts/infrastructure/kubernetes-adapter.md) abstraction.

## Compose-up matrix

| Goal | Command |
| --- | --- |
| Just the AlphaSwarm API + workers | `docker compose up -d` |
| AlphaSwarm + visualization stack (Trino, Polaris, Superset, Dagster, Dask, Ray) | `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml --profile visualization up -d` |
| Full local platform parity (adds Apicurio + real Airbyte + DataHub + Loki + Vector + VictoriaMetrics) | `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform up -d` |

The platform overlay also activates the `visualization` profile's
services it depends on (Polaris, Trino, Dagster). Don't pass `--profile
platform` alone — the AlphaSwarm webui still depends on Superset from the
viz overlay.

## Services added by the platform overlay

| Service | Container | Default host port | Wires into |
| --- | --- | --- | --- |
| `apicurio` (Schema Registry) | `alphaswarm-apicurio` | `8090 -> 8080` | `ALPHASWARM_SCHEMA_REGISTRY_URL` already supports the URL knob |
| `airbyte-db` | `alphaswarm-airbyte-db` | (internal) | Postgres backing for real Airbyte |
| `airbyte-server-real` | `alphaswarm-airbyte-server-real` | `8005 -> 8001` | Real Airbyte API (the dev stub at `airbyte-server` keeps running on `:8002`) |
| `airbyte-webapp` | `alphaswarm-airbyte-webapp` | `8001 -> 80` | UI for real Airbyte |
| `datahub-gms` | `alphaswarm-datahub-gms` | `8081 -> 8080` | `ALPHASWARM_DATAHUB_GMS_URL=http://datahub-gms:8080` |
| `datahub-frontend` | `alphaswarm-datahub-frontend` | `9002 -> 9002` | DataHub UI |
| `loki` | `alphaswarm-loki` | `3100 -> 3100` | Log aggregation; OTel collector + agents push here |
| `vector` | `alphaswarm-vector` | (none) | Tails Docker container logs and ships to Loki |
| `victoriametrics` | `alphaswarm-victoriametrics` | `8428 -> 8428` | Long-term metrics; scrapes the existing OTel collector + AlphaSwarm API |

## Sub-profiles (documented but not enabled by default)

The plan keeps these out of the default platform set because the user
opted out of "full parity":

- `platform-rag` — RAGFlow + Milvus stack (heavy; pulls a vector DB).
- `platform-jh` — JupyterHub.

Add them yourself if needed by extending `alphaswarm_platform/compose/docker-compose.platform.yml`
or shipping an alongside `docker-compose.platform..yml`.

## Smoke test sequence

1. `docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform up -d`
2. `curl http://localhost:8428/-/ready` — VictoriaMetrics
3. `curl http://localhost:3100/ready` — Loki
4. `curl http://localhost:8081/health` — DataHub GMS
5. `curl http://localhost:8090/apis` — Apicurio
6. `curl http://localhost:8005/api/v1/health` — real Airbyte
7. `docker compose ps` — every service should be healthy or running

## Where the rpi cluster fits in

When `ALPHASWARM_CLUSTER_MGMT_URL` is set, the
[`RpiClusterAdapter`](../../concepts/infrastructure/kubernetes-adapter.md#rpiclusteradapter) auto-promotes
and AlphaSwarm forwards Kafka admin + Flink session-job + alphavantage stream
operations to the homelab management API. Setting both attach paths
side-by-side is fine — AlphaSwarm routes the call wherever the active
adapter says.

## Cleanup

```
docker compose -f alphaswarm_platform/compose/docker-compose.yml -f alphaswarm_platform/compose/docker-compose.viz.yml -f alphaswarm_platform/compose/docker-compose.platform.yml --profile visualization --profile platform down
```

Volumes are preserved; pass `-v` to wipe them.


<!-- https://alpha-swarm.ai/concepts/platform/ownership-graph -->
# Ownership graph (Phase 2 of the multi-tenant rollout)
> ```mermaid flowchart LR subgraph postgres [Postgres canonical] Orgs[organizations] Teams[teams] Users[users] Mem[memberships] Ws[workspaces] Projects[projects] Labs[labs] Exp[experiments] Tests[tests]...

# Ownership graph (Phase 2 of the multi-tenant rollout)

The ownership graph is the projection layer that lets the MCP
catalog + UI ask "what can this user see?" / "who can read this?"
without joining the canonical tenancy tables hop-by-hop.

## Architecture

```mermaid
flowchart LR
  subgraph postgres [Postgres canonical]
    Orgs[organizations]
    Teams[teams]
    Users[users]
    Mem[memberships]
    Ws[workspaces]
    Projects[projects]
    Labs[labs]
    Exp[experiments]
    Tests[tests]
    Res[resources]
    RR[resource_relations]
  end

  Orgs -->|after_flush_postexec| Bus[(Redis streamalphaswarm:ownership:events)]
  Mem -->|after_flush_postexec| Bus
  Res -->|after_flush_postexec| Bus
  Bus -->|Celery drain| Neo[(Neo4j projection)]

  subgraph readers [Read clients]
    MCP["data.ownership.* MCP tools"]
    UI[ContextBar + Profile page]
  end
  MCP -->|"OwnershipGraphStore.traverse"| Neo
  UI -->|"GET /cache/{org,team,...}"| postgres
```

## Hard rule

Hard rule 33 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "All ownership /
membership queries that traverse more than one hop MUST go through
`alphaswarm.graph.OwnershipGraphStore`."

## Node + edge model

| Node kind     | Source table         | Identity         |
| ------------- | -------------------- | ---------------- |
| Organization  | `organizations`      | UUID             |
| Team          | `teams`              | UUID             |
| User          | `users`              | UUID             |
| Workspace     | `workspaces`         | UUID             |
| Project       | `projects`           | UUID             |
| Lab           | `labs`               | UUID             |
| Experiment    | `experiments`        | UUID             |
| Test          | `tests`              | UUID             |
| Resource      | `resources`          | UUID             |

| Edge relation     | from -> to (kinds)            | Source     |
| ----------------- | ----------------------------- | ---------- |
| `HAS_TEAM`        | Organization -> Team          | `teams.org_id` |
| `HAS_WORKSPACE`   | Organization -> Workspace     | `workspaces.org_id` |
| `HAS_PROJECT`     | Workspace -> Project          | `projects.workspace_id` |
| `HAS_LAB`         | Workspace -> Lab              | `labs.workspace_id` |
| `MEMBER_OF`       | User -> (Org\|Team\|Workspace\|Project\|Lab) | `memberships` |
| `OWNS`            | (Org\|Team\|User\|Workspace\|Project) -> Resource | `resources.owner_scope_*` |
| `IN_PROJECT`      | Experiment / Resource -> Project | `*.project_id` |
| `IN_LAB`          | Experiment / Resource -> Lab  | `*.lab_id` |
| `IN_WORKSPACE`    | Resource -> Workspace         | `resources.workspace_id` |
| `IN_EXPERIMENT`   | Test -> Experiment            | `tests.experiment_id` |
| `PARENT_OF`       | Experiment -> Experiment      | `experiments.parent_experiment_id` |
| `DERIVED_FROM` / `CLONES` / `TRANSLATED_FROM` / `USES` / `REFERENCES` | Resource -> Resource | `resource_relations.relation` |

## Sync semantics

- **Source of truth**: Postgres. Every ownership write is a normal
  ORM commit.
- **Event bus**: SQLAlchemy
  [`after_flush_postexec`](../alphaswarm/graph/sqlalchemy_hooks.py) hooks
  translate each row insert/update/delete into an
  [`OwnershipEvent`](../alphaswarm/graph/events.py) on the
  `alphaswarm:ownership:events` Redis stream (or an in-process fallback
  queue when Redis is unreachable).
- **Drain**: the
  [`alphaswarm.tasks.ownership_tasks.drain_events`](../alphaswarm/tasks/ownership_tasks.py)
  Celery beat task runs every 5 s and applies up to
  `ALPHASWARM_OWNERSHIP_SYNC_BATCH_SIZE` (default 500) events through
  `OwnershipGraphStore.apply_events`.
- **Healer**: the periodic `full_resync` task (default 30 min)
  walks the canonical tables + re-emits everything so any missed
  delivery is repaired.

## Read paths

- **Python**:
  `from alphaswarm.graph import get_ownership_store; store = get_ownership_store(); store.traverse(...)`.
- **MCP tools**:
  - `data.ownership.tree` — outward walk from a node.
  - `data.ownership.list_resources` — every Resource a user can see.
  - `data.ownership.who_can_read` — reverse — every user that can
    read a specific Resource.
- **HTTP** (Phase 6 frontend): the ContextBar talks directly to
  the metadata cache (`/cache/organizations` etc.); deeper queries
  go through the MCP HTTP transport (`/mcp/data/tools//invoke`).

## Stores

| Backend                       | Use case |
| ----------------------------- | -------- |
| [`PostgresOwnershipGraphStore`](../alphaswarm/graph/postgres_store.py) | Local dev, unit tests, bootstrap/recovery |
| [`Neo4jOwnershipGraphStore`](../alphaswarm/graph/neo4j_store.py) | Production multi-hop queries |

Pick via `ALPHASWARM_OWNERSHIP_GRAPH_STORE` (default `postgres`).

## See also

- [`alphaswarm_docs/data-mcp.md`](../../concepts/data/data-mcp.md) — the MCP catalog the
  ownership tools plug into.
- [`alphaswarm_docs/identity.md`](../../concepts/identity/identity.md) — how the User node populates
  + how lazy provisioning seeds memberships.
- [`alphaswarm_docs/experiments-tests.md`](../../concepts/platform/experiments-tests.md) — the umbrella
  tables this graph sits on top of.


<!-- https://alpha-swarm.ai/concepts/platform/repository-split -->
# Repository Split
> This document defines the AlphaSwarm monorepo boundaries while the platform is being split into future repositories. The current goal is isolation by responsibility without breaking imports, deployment manif...

# Repository Split

Status: migration guidance.

This document defines the AlphaSwarm monorepo boundaries while the platform is
being split into future repositories. The current goal is isolation by
responsibility without breaking imports, deployment manifests, or operator
workflows.

## Principles

- Use a strangler migration: create stable contracts first, then move
  implementations behind compatibility shims.
- Keep shared abstractions in `alphaswarm_core`; do not import from
  higher-level packages there.
- Keep `alphaswarm_controller` standalone. It may depend on
  `alphaswarm_core`, but it must not import `alphaswarm.*`.
- Keep `rpi_kubernetes` as cluster bootstrap and platform services only.
  AlphaSwarm workload controllers and operator features live in this repository.
- Prefer generated or typed API contracts between projects over direct
  imports across future repository boundaries.

## Domain Map

| Domain | Current path | Owns | Does not own |
| --- | --- | --- | --- |
| Control plane | `alphaswarm_controller/` | `/manage/*`, workload lifecycle, provider adapters, session/control API | Quant runtimes, Celery business tasks, strategy logic |
| Platform core | `alphaswarm_core/` | Shared value types, ABCs, auth/resource filters, topology, stable wire models | FastAPI routes, ORM models, concrete cloud SDK workflows |
| Client | `alphaswarm_client/` | Operator UI, client docs, generated API contracts, local client behavior | Backend business logic, direct database writes |
| Snippets | `alphaswarm_snippets/` | Curated code knowledge, annotations, prompts, provenance indexes | Runtime imports or production package dependencies |
| Bots | `alphaswarm_bots/`, `alphaswarm_bots/templates/` | Bot runtime, templates, examples, sample specs | Direct bypass of `BotRuntime` or immutable versioning |
| RL | `alphaswarm_rl/` | RL subsystem: hash-locked `RLExperimentSpec` + `RLRuntime` + `RLComponent` metaclass + advantage estimators + policy backbones + weight-centric portfolio pipeline + Iceberg trajectory store + matching Celery task / API route / YAML spec library / tests | LLM gateway (`router_complete` stays in monolith); central registry (`alphaswarm.core.registry.register` stays in monolith) |
| Models | `alphaswarm_models/` | Custom model pulling, building, training, fine-tuning, evaluating, testing — qlib-style ML framework + Predictor Hub + AlphaBacktestExperiment + walk-forward + finetune trainers + every model implementation + custom model serving (vLLM + Ollama) + matching Celery tasks / API routes / YAML spec library / tests | LLM gateway (`router_complete` stays in monolith); central registry (`alphaswarm.core.registry.register` stays in monolith) |
| Monolith runtime | `alphaswarm/` | Agents, analysis, backtests, data plane, persistence, tasks, API gateway, LLM gateway (`router_complete`, memory, cache, prompts, tokens), the central registry, the four spec runtimes' shared orchestration. | RL subsystem (extracted to `alphaswarm_rl/`); ML / model serving (extracted to `alphaswarm_models/`); new workload control-plane providers |
| Deployment | `alphaswarm_platform/deployments/`, `alphaswarm_platform/terraform/`, `build/` | Compose, Kubernetes, Terraform, image build contracts | Cluster bootstrap owned by `rpi_kubernetes` |

## Allowed Dependencies

```mermaid
flowchart LR
  aqpRuntime["alphaswarm runtime"] --> aqpPlatformCore["alphaswarm_core"]
  aqpControlPlane["alphaswarm_controller"] --> aqpPlatformCore
  aqpClient["alphaswarm_client"] --> aqpRuntime
  aqpClient --> aqpControlPlane
  aqpBots["alphaswarm_bots templates"] --> aqpRuntime
  aqpRl["alphaswarm_rl"] --> aqpRuntime
  aqpModels["alphaswarm_models"] --> aqpRuntime
  aqpBots --> aqpRl
  aqpBots --> aqpModels
  aqpRl -.shims.-> aqpRuntime
  aqpModels -.shims.-> aqpRuntime
  aqpSnippets["alphaswarm_snippets"] -.reference.-> aqpRuntime
```

Hard dependency rules:

1. `alphaswarm_core` must not import `alphaswarm`, `alphaswarm_controller`, FastAPI, SQLAlchemy,
   Celery, or heavy optional SDKs.
2. `alphaswarm_controller` must not import `alphaswarm.*`; use
   `alphaswarm_core` contracts or HTTP APIs.
3. `alphaswarm_client` must call backend APIs through generated clients or local API
   wrappers. It must not duplicate authorization, tenancy, or kill-switch
   semantics.
4. `alphaswarm_snippets` is read-only knowledge for runtime code. Production modules
   must not import from it.
5. `alphaswarm_bots` stores templates and guidance until runtime interfaces are
   extracted from `alphaswarm_bots`.
6. `alphaswarm_rl` and `alphaswarm_models` may depend on `alphaswarm.*` for the shared runtime
   primitives that have not yet been extracted (`iceberg_catalog.append_arrow`,
   `router_complete`, `LedgerWriter`, `RequestContext`, ORM models,
   `_progress.emit`, `MetadataCache`, `RiskLimits`,
   `TargetWeightsRebalancer`, `alphaswarm.core.registry.register`). The reverse
   direction (`alphaswarm.rl.*` → `alphaswarm_rl.*`, `alphaswarm.ml.*` → `alphaswarm_models.*`,
   `alphaswarm.llm.{vllm_runner,ollama_client}` → `alphaswarm_models.serving.*`) goes
   through deprecation-warning compatibility shims under `alphaswarm/rl/`,
   `alphaswarm/ml/`, and `alphaswarm/llm/{vllm_runner,ollama_client}.py`. New code
   imports from `alphaswarm_rl.*` / `alphaswarm_models.*` / `alphaswarm_models.serving.*`
   directly.

## Migration Order

1. Stabilize `alphaswarm_core` package contracts and tests.
2. Finish `alphaswarm_controller` as the only home for workload lifecycle
   providers and `/manage/*` behavior.
3. Move curated references into `alphaswarm_snippets` with provenance and indexes.
4. Extract `alphaswarm_client` contracts around the existing Vite frontend and API
   gateway behavior before moving source paths.
5. Split `alphaswarm_bots` last, after bot persistence, task dispatch, backtest,
   paper trading, and agent runtime interfaces are explicit.
6. Extract `alphaswarm_rl` (May 2026) — RL subsystem moved out of `alphaswarm/rl/`, with
   matching Celery task / API route / YAML spec library / tests. Legacy
   `alphaswarm.rl.*` imports preserved through `alphaswarm/rl/__init__.py` deprecation
   shim.
7. Extract `alphaswarm_models` (May 2026) — custom-model boundary moved out of
   `alphaswarm/ml/` plus the model-pulling / serving slice of `alphaswarm/llm/`
   (`vllm_runner.py`, `ollama_client.py`). The central LLM gateway
   (`router_complete`, memory, cache, prompts, tokens) **stays in the
   monolith** at `alphaswarm/llm/`. Legacy `alphaswarm.ml.*` and
   `alphaswarm.llm.{vllm_runner,ollama_client}` imports preserved through
   compatibility shims.
8. Clean root-level build/deploy files only after the projects can be tested
   independently.

## Future Repo Split Gate

A domain is ready to become its own repository when it has:

- `README.md`, `AGENTS.md`, and a validation command list.
- Independent packaging or build metadata.
- No forbidden imports across future repo boundaries.
- Versioned API or model contracts for consumers.
- CI checks that run without relying on the full monolith checkout, except
  for documented integration tests.


<!-- https://alpha-swarm.ai/concepts/platform/scopes -->
# AlphaSwarm Scope Catalogue
> Every scope follows `<resource>:<action>` (kebab-case nouns and verbs, colon separator). The four ADR 003 infrastructure scopes (`read:infrastructure`, `manage:agents`, `manage:infrastructure`, `admin...

# AlphaSwarm Scope Catalogue

Single source of truth for every authorization scope used by the AlphaSwarm
control plane. The canonical Python module is
[alphaswarm/auth/scopes.py](../alphaswarm/auth/scopes.py) (`AQPScope`); the canonical
Terraform Auth0 provisioning lives in
[alphaswarm_platform/terraform/modules/auth0_identity/main.tf](../alphaswarm_platform/terraform/modules/auth0_identity/main.tf)
(`local.scopes` + `local.role_permissions`); the canonical role lattice
is in
[alphaswarm_core/src/alphaswarm_core/auth/rbac.py](../alphaswarm_core/src/alphaswarm_core/auth/rbac.py)
(`_ROLE_LATTICE`). All three MUST stay in sync — the regression test at
`tests/auth/test_scopes.py` enforces it.

## Scope-string convention

Every scope follows `:` (kebab-case nouns and verbs,
colon separator). The four ADR 003 infrastructure scopes
(`read:infrastructure`, `manage:agents`, `manage:infrastructure`,
`admin:cluster`) intentionally use a verb-first form for backward
compatibility with the original Phase 4 rollout; the AlphaSwarm-specific
extensions added in Phase 1 of the control-plane maturation use the
canonical resource-first form.

The `platform:admin` scope is the implicit super-scope — any holder of
`platform:admin` satisfies any other scope check. It is granted only to
the `alphaswarm-superadmin` role and used very rarely.

## Scope catalogue

### Data plane

| Scope | Description |
| --- | --- |
| `data:read` | Read AlphaSwarm data and metadata (datasets, catalogs, lineage) |
| `data:write` | Mutate AlphaSwarm data through sanctioned APIs |
| `admin:iceberg` | Drop, consolidate, or redefine Iceberg tables |

### Infrastructure (ADR 003 four-scope grid)

| Scope | Description |
| --- | --- |
| `read:infrastructure` | View deployment status, pods, logs, non-secret config |
| `manage:agents` | Start / stop / restart / scale assigned AlphaSwarm agents and bot workloads |
| `manage:infrastructure` | Deploy and update AlphaSwarm services and non-secret ConfigMaps within an assigned org |
| `admin:cluster` | Full cluster control + resource-scope bypass for AlphaSwarm super-admins |

### Agents

| Scope | Description |
| --- | --- |
| `agent:view` | Inspect agent specs, runs, and telemetry |
| `agent:execute` | Invoke or schedule a registered AlphaSwarm agent |
| `agent:terminate` | Halt a running agent or revoke a long-lived spec |

### Trading / portfolio

| Scope | Description |
| --- | --- |
| `trade:read` | Inspect paper / live trading sessions, orders, fills, PnL |
| `trade:execute` | Submit paper-broker or sandbox-broker orders |
| `trade:live` | Submit real-money orders to a connected live broker |

### Backtesting

| Scope | Description |
| --- | --- |
| `backtest:read` | Inspect backtest runs and historical metrics |
| `backtest:create` | Submit a new backtest job to the engine fleet |

### ML / RL / RAG

| Scope | Description |
| --- | --- |
| `rag:query` | Query the hierarchical RAG corpus |
| `ml:workbench` | Run ML workbench flows (training, evaluation, registry) |
| `rl:train` | Submit `RLExperimentSpec` runs through `RLRuntime` |

### Deployment lifecycle

| Scope | Description |
| --- | --- |
| `deploy:run` | Run Terraform / Kubernetes deployments |
| `deploy:halt` | Halt AlphaSwarm deployments and long-running runtimes |

### Terraform IaC (rule 42)

| Scope | Description |
| --- | --- |
| `terraform:plan` | Generate a Terraform plan for an AlphaSwarm stack |
| `terraform:apply` | Apply a Terraform plan against an AlphaSwarm stack |
| `terraform:destroy` | Destroy an AlphaSwarm Terraform stack (super-admin only) |
| `terraform:cancel` | Cancel a running Terraform run |

### WorkloadRuntime (rule 45)

| Scope | Description |
| --- | --- |
| `workloads:halt` | Halt every running workload via the WorkloadRuntime kill-switch fan-out |

### Tenancy

| Scope | Description |
| --- | --- |
| `tenancy:invite` | Issue tenancy invites for org / team / workspace / project membership |
| `tenancy:admin` | Mutate tenancy state (orgs, teams, memberships) |
| `scim:write` | Provision AlphaSwarm users and groups through SCIM |

### Platform

| Scope | Description |
| --- | --- |
| `platform:admin` | Implicit super-scope: satisfies any other scope check |

## Role lattice

Each role is a strict superset of the previous one (cumulative
composition). The lattice is enforced by the regression test at
`tests/auth/test_scopes.py::test_role_lattice_is_cumulative`.

### `alphaswarm-viewer`

Read-only AlphaSwarm operator for assigned resources.

- `read:infrastructure`
- `data:read`
- `agent:view`
- `trade:read`
- `backtest:read`
- `rag:query`

### `alphaswarm-operator`

Viewer + manage assigned agents and bot workloads.

Adds:

- `manage:agents`
- `agent:execute`
- `agent:terminate`
- `backtest:create`
- `ml:workbench`
- `rl:train`
- `trade:execute`
- `deploy:run`
- `deploy:halt`
- `workloads:halt`

### `alphaswarm-admin`

Operator + administrator for assigned organization infrastructure.

Adds:

- `manage:infrastructure`
- `data:write`
- `admin:iceberg`
- `terraform:plan`
- `terraform:apply`
- `terraform:cancel`
- `tenancy:invite`

### `alphaswarm-superadmin`

Admin + cluster super-admin (the only role that bypasses
`alphaswarm_core.auth.resource_filter.filter_resources` via the
`admin:cluster` scope).

Adds:

- `admin:cluster`
- `terraform:destroy`
- `tenancy:admin`
- `scim:write`
- `trade:live`
- `platform:admin`

## Legacy tenancy roles

The tenancy database in `alphaswarm.persistence.models_tenancy` uses a
separate role lattice (`viewer / editor / admin / owner`) for
membership in orgs, teams, workspaces, projects, and labs. The
canonical platform roles above (`alphaswarm-*`) are issued by Auth0 and
expanded into scopes via the post-login Action sync. The translator
between the two lives at
[alphaswarm/auth/scopes.py::legacy_role_to_aqp_role](../alphaswarm/auth/scopes.py):

| Tenancy role | Canonical role |
| --- | --- |
| `viewer` | `alphaswarm-viewer` |
| `editor` | `alphaswarm-operator` |
| `admin` | `alphaswarm-admin` |
| `owner` | `alphaswarm-superadmin` |

The Auth0 sync endpoint (`/_internal/auth0/sync`) emits BOTH flavours
into the JWT's `roles` claim so legacy clients keep working AND scope
expansion produces a non-empty set. Closes the empty-claim drift bug
where a user whose only `Membership.role` was `editor` ended up with
no scopes in the token.

## Adding a new scope

1. Add the constant to `alphaswarm/auth/scopes.py::AQPScope` and to
   `ALL_AQP_SCOPES`.
2. If the scope should be granted by a role, add it to the matching
   role frozenset in `alphaswarm_core/auth/rbac.py::_ROLE_LATTICE`
   (cumulative — viewer subset of operator subset of admin subset of
   superadmin).
3. Add the scope to `alphaswarm_platform/terraform/modules/auth0_identity/main.tf`'s
   `local.scopes` AND to every role in `local.role_permissions` that
   should hold it.
4. Add a row to this catalogue (`alphaswarm_docs/scopes.md`).
5. Re-run the regression test:
   `docker exec alphaswarm-api python -m pytest tests/auth/test_scopes.py`.

The test asserts that the Python lattice and the Terraform lattice
contain the same scope set per role, so any drift produces a hard
failure rather than a silent grant.


<!-- https://alpha-swarm.ai/concepts/platform/temporal-identifiers -->
# Temporal identifier resolution
> Financial identifiers are not stable across time. A non-exhaustive list of why:

# Temporal identifier resolution

> Status: **Phase 1 shipped** (Alembic 0039 + 0040). The
> ``identifier_links`` table is now the authoritative source for
> identifier resolution; the legacy ``Instrument.identifiers`` JSON
> blob is kept for backward compatibility but is no longer
> authoritative.

## Why temporal resolution

Financial identifiers are not stable across time. A non-exhaustive
list of why:

| Event | Impact |
| --- | --- |
| Ticker change (M&A, rebranding) | ``FB`` -> ``META`` 2022-06-09 |
| Symbol change (re-listing) | ``ABEV3`` ↔ ``AMBV4`` on B3 |
| CUSIP / ISIN re-issue (corporate action) | Stock split issuance may mint a new CUSIP |
| Index reconstitution | Russell add/drops change tracker constituents |
| ADR sponsorship upgrade | Conversion ratio may change |

A backtest that resolves ``META`` to the modern ID and walks bar data
from 2018 will silently introduce **survivorship and forward-looking
bias** because the row didn't yet exist under that ticker. The
resolver service fixes this by walking time-versioned
``identifier_links`` rows scoped by ``valid_from <= as_of`` and
``(valid_to IS NULL OR valid_to > as_of)``.

## Table shape

The ``identifier_links`` table predates Phase 1; Phase 1 promotes it
to authoritative status. Its schema:

```text
identifier_links
+--------------------+-----------------------------------------------+
| id                 | UUID                                          |
| entity_kind        | instrument | fred_series | sec_filing | ...   |
| entity_id          | parent entity id                              |
| instrument_id      | denormalized FK to instruments.id (NULL OK)   |
| scheme             | ticker | vt_symbol | cik | cusip | isin |     |
|                    | figi | sedol | lei | gvkey | permid | ...     |
| value              | identifier value                              |
| valid_from         | datetime | NULL ("from the beginning")        |
| valid_to           | datetime | NULL ("still valid")               |
| source_id          | FK to data_sources.id                         |
| confidence         | 0.0 - 1.0, defaults 1.0                       |
| meta               | JSON                                          |
| created_at         | datetime                                      |
+--------------------+-----------------------------------------------+
```

## Resolver API

The two public entry points are
:class:`alphaswarm.data.identity.IdentifierResolver` and the matching
DataMCP tools.

### Python: forward resolution

```python
from datetime import datetime
from alphaswarm.data.identity import resolve

# "What was AAPL's CUSIP on 2018-06-12?"
hit = resolve(
    scheme="cusip",
    value="037833100",
    as_of=datetime(2018, 6, 12),
)
print(hit.value, hit.is_open_ended)
```

### Python: history walk

```python
from alphaswarm.data.identity import history

# Every alias known for Apple
for row in history(entity_kind="instrument", entity_id="aapl-uuid"):
    print(row.scheme, row.value, row.valid_from, row.valid_to)
```

### Agent / MCP

```text
data.identity.resolve(scheme="cusip", value="037833100", as_of="2018-06-12")
data.identity.history(entity_kind="instrument", entity_id="aapl-uuid")
```

The DataMCP layer is the only path agents may use to resolve
identifiers (AGENTS rule 22). The Python module is reserved for
loaders / pipelines / persistence code; agent code never imports the
ORM model directly.

## Backfill from legacy JSON blob (migration 0040)

The legacy ``Instrument.identifiers`` JSON column is a flat
``{scheme: value}`` map. Migration 0040 walks every row, normalises
the scheme name (lower-cased, aliases collapsed), and inserts a row
into ``identifier_links`` with ``valid_from=valid_to=NULL`` ("valid
for all time the row represents") and ``confidence=0.7`` (so a
canonical loader row at ``confidence=1.0`` always wins the resolver
tiebreaker).

The legacy JSON column is **kept**: readers that haven't migrated to
the resolver continue to work. New readers MUST go through the
resolver so they see corrected validity windows.

## Validity-window semantics

| ``valid_from`` | ``valid_to`` | Meaning |
| --- | --- | --- |
| ``NULL`` | ``NULL`` | Valid for all time the row represents |
| ``2018-01-01`` | ``NULL`` | Valid from 2018-01-01, still current |
| ``NULL`` | ``2022-06-09`` | Valid up to (and including) 2022-06-09 |
| ``2010-05-01`` | ``2015-12-31`` | Valid in the closed-open interval |

The ``valid_to`` is **exclusive** -- a row with ``valid_to=2022-06-09``
is NOT valid on 2022-06-09. The lookup predicate is therefore
``valid_to > as_of``, not ``valid_to >= as_of``.

## Confidence ordering

When multiple rows satisfy the validity predicate, the highest
``confidence`` wins. Default loader rows ship with
``confidence=1.0``; the legacy-blob backfill from migration 0040
uses ``confidence=0.7`` so it's overridden the moment a canonical
loader populates the same alias.

Heuristic / fuzzy-match loaders should use ``confidence`` in the
0.3-0.6 range so they only win when no canonical row exists.


<!-- https://alpha-swarm.ai/concepts/rl/agentic-rl -->
# Hybrid agentic-RL + backtest
> The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by making the **target portfolio weight vector** the single immutable interface between an RL policy and any execution mechanism (offli...

# Hybrid agentic-RL + backtest

> AlphaSwarm's port of the FinRL-X "deployment-consistent" blueprint plus the
> NVIDIA-NeMo/RL advantage primitives — wired into AlphaSwarm's existing
> spec-driven runtimes (rule 16).

## What changed

The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by
making the **target portfolio weight vector** the single immutable
interface between an RL policy and any execution mechanism
(offline backtest engine OR live broker). The same `w_t` flows
through:

- the offline simulation (via the new
  [`RLBacktestEnv`](../alphaswarm/rl/envs/rl_backtest_env.py))
- the live paper / live execution
  (via [`WeightToOrders`](../alphaswarm/rl/execution/weight_to_orders.py))
- the AST-sandboxed alpha factor authoring loop
  (via [`AlphaResearcher`](../alphaswarm/agents/quant/alpha_researcher.py))

```mermaid
flowchart TB
    subgraph agentic [Agentic Layer]
        AlphaResearcher["AlphaResearcher\n(AgentRuntime + RAG alpha_base)"]
        StrategyExecutor["StrategyExecutor\n(wraps RLRuntime)"]
        ASTSandbox["AST Sandbox\n(alphaswarm/data/expressions_dsl.py)"]
        AlphaResearcher -->|symbolic formula| ASTSandbox
        ASTSandbox -->|FactorNode| Backtest[Engine-agnostic indicator]
    end

    subgraph rl [RL Stack]
        Spec["RLExperimentSpec\n(+ advantage + stop_properly_penalty_coef)"]
        Runtime["RLRuntime\n(rule 16)"]
        Backbones["Policy Backbones\nTransformer / RNN / AE / PatchTST"]
        Advantage["ReinforcePlusPlus / GRPO / GAE"]
        StopShape["StopProperlyWrapper\n(coef in 0..1)"]
        Spec --> Runtime
        Runtime --> Backbones
        Runtime --> Advantage
        Runtime --> StopShape
    end

    subgraph bridge [RL <-> Backtest Bridge]
        RLEnv["RLBacktestEnv"]
        WCP["WeightCentricPipeline\nf_S -> f_A -> f_T -> f_R"]
        EngineCB["context['rl_agent']"]
        Runtime --> RLEnv
        RLEnv --> WCP
        WCP --> EngineCB
    end

    subgraph engines [Engines]
        EventDriven["EventDrivenBacktester"]
        VbtPro["VectorbtProEngine:orders"]
        Lob["LobBacktestEngine"]
        BT["BacktraderEngine (optional)"]
        EngineCB --> EventDriven
        EngineCB --> VbtPro
        EngineCB --> Lob
        EngineCB --> BT
    end

    subgraph broker [Live + Paper]
        DomainBroker["IDomainBrokerage"]
        KillSwitch["KillSwitch"]
        WCP --> DomainBroker
        KillSwitch -.->|halt| DomainBroker
    end
```

## Quick reference

| Concept | One-liner | File |
| --- | --- | --- |
| `WeightCentricPipeline` | FinRL-X `f_S -> f_A -> f_T -> f_R` composable pipeline | [alphaswarm/rl/portfolio/pipeline.py](../alphaswarm/rl/portfolio/pipeline.py) |
| `RLBacktestEnv` | `BaseRLEnv + gym.Env` wrapping any registered `BaseBacktestEngine` | [alphaswarm/rl/envs/rl_backtest_env.py](../alphaswarm/rl/envs/rl_backtest_env.py) |
| `RLAgentBridge` | Channel exposed via `context['rl_agent']` on every engine flipping `supports_rl_injection=True` | [alphaswarm/rl/bridges/agent_bridge.py](../alphaswarm/rl/bridges/agent_bridge.py) |
| `ReinforcePlusPlusAdvantage` | Leave-one-out cohort baseline + decoupled global normalisation (NeMo-RL port) | [alphaswarm/rl/advantage/reinforce_plus_plus.py](../alphaswarm/rl/advantage/reinforce_plus_plus.py) |
| `GRPOAdvantage` | Group-relative no-critic advantage (DeepSeek R1 / NeMo-RL parity) | [alphaswarm/rl/advantage/grpo.py](../alphaswarm/rl/advantage/grpo.py) |
| `StopProperlyWrapper` | Scales reward of truncated episodes by `coef in [0, 1]` (NeMo-RL `stop_properly_penalty_coef`) | [alphaswarm/rl/rewards/stop_properly.py](../alphaswarm/rl/rewards/stop_properly.py) |
| Truncating terminations | `DrawdownTermination` / `MarginCallTermination` / `RiskBreachTermination` carry `truncates_episode=True` | [alphaswarm/rl/terminations/](../alphaswarm/rl/terminations/) |
| `WeightToOrders` | Kill-switch-gated translator from target weights to `DomainOrder` | [alphaswarm/rl/execution/weight_to_orders.py](../alphaswarm/rl/execution/weight_to_orders.py) |
| `RedisFeatureStore` | Flink → Redis `IFeatureStore` for live RL observation | [alphaswarm/streaming/feature_store/redis_store.py](../alphaswarm/streaming/feature_store/redis_store.py) |
| `AlphaVantageIngester` | REST-poll Alpha Vantage and publish to Kafka | [alphaswarm/streaming/ingesters/alphavantage.py](../alphaswarm/streaming/ingesters/alphavantage.py) |
| `DeterministicMedallionReplay` | Read-only RL data pipeline pinned to silver/gold Iceberg snapshots | [alphaswarm/rl/data_pipelines/medallion_replay.py](../alphaswarm/rl/data_pipelines/medallion_replay.py) |
| `data.alphas.*` / `data.backtests.*` / `data.rl.*` / `data.brokers.*` | New DataMCPTools (rule 22) | [alphaswarm/data/mcp/tools/](../alphaswarm/data/mcp/tools/) |
| `alpha_factors` / `backtest_summaries` / `rl_trajectory_summaries` corpora | RAG "alpha base" (rule 11) | [alphaswarm/rag/orders.py](../alphaswarm/rag/orders.py) |
| `RLTradingBot` | Bot subtype driven by `RLRuntime` (rule 14) | [alphaswarm/bots/rl_trading_bot.py](../alphaswarm/bots/rl_trading_bot.py) |

## Spec extension

```yaml
training:
  total_timesteps: 200000
  log_interval: 10
  advantage:
    class: ReinforcePlusPlusAdvantage
    module_path: alphaswarm.rl.advantage.reinforce_plus_plus
    kwargs:
      minus_baseline: true
      global_normalization: true
      leave_one_out: true
  stop_properly_penalty_coef: 0.2
```

## Companion docs

- [alphaswarm_docs/weight-centric-pipeline.md](../../concepts/rl/weight-centric-pipeline.md) —
  Deep dive on `f_S/f_A/f_T/f_R` semantics.
- [alphaswarm_docs/rl-policy-backbones.md](../../concepts/rl/rl-policy-backbones.md) —
  Transformer / RNN / Autoencoder / PatchTST backbones.
- [alphaswarm_docs/alpha-researcher-agent.md](../../concepts/agentic/alpha-researcher-agent.md) —
  Symbolic alpha DSL + AlphaResearcher driver.

## Source-of-truth citations

- NeMo-RL `stop_properly_penalty_coef` scaling (commit
  `20d46a7d1bd987df1c89b3c5a81dc945c3d201e4`,
  `nemo_rl/algorithms/reward_functions.py`).
- NeMo-RL leave-one-out group baseline + decoupled global
  normalisation (`nemo_rl/algorithms/utils.py`
  `calculate_baseline_and_std_per_prompt` +
  `masked_mean(..., global_normalization_factor=...)`).
- Backtrader `cheat_on_open` / `next_open` / `order_target_percent`
  semantics (`backtrader/strategy.py`).


<!-- https://alpha-swarm.ai/concepts/rl/rl-components -->
# RL component reference
> | `rl_kind` | Purpose | Base class | | --- | --- | --- | | `rl_env` | Gymnasium env | [`BaseRLEnv`](../alphaswarm/rl/core/env.py) | | `rl_observation` | State featuriser | [`BaseObservationBuilder`](../alphaswarm/r...

# RL component reference

> This page is a hand-written shortcut. The authoritative source is the
> live registry exposed by `GET /rl/components/{kind}` (and rendered in
> the UI at [`/rl/library`](../webui/app/(shell)/rl/library/page.tsx)).

## Kinds

| `rl_kind` | Purpose | Base class |
| --- | --- | --- |
| `rl_env` | Gymnasium env | [`BaseRLEnv`](../alphaswarm/rl/core/env.py) |
| `rl_observation` | State featuriser | [`BaseObservationBuilder`](../alphaswarm/rl/core/observation.py) |
| `rl_action` | Action-space spec + transform | [`BaseActionSpace`](../alphaswarm/rl/core/action.py) |
| `rl_reward` | Reward term / composite | [`BaseRewardModel`](../alphaswarm/rl/core/reward.py), [`RewardTerm`](../alphaswarm/rl/core/reward.py) |
| `rl_termination` | End-of-episode predicate | [`BaseTerminationCondition`](../alphaswarm/rl/core/termination.py) |
| `rl_policy` | Frozen policy | [`BasePolicy`](../alphaswarm/rl/core/policy.py) |
| `rl_agent` | Train-aware agent | [`BaseRLAgent`](../alphaswarm/rl/core/policy.py) |
| `rl_data` | Data pipeline | [`BaseDataPipeline`](../alphaswarm/rl/core/data.py) |
| `rl_ensembler` | Multi-member orchestrator | [`BaseEnsembler`](../alphaswarm/rl/core/ensembler.py) |
| `rl_experiment` | Experiment runner | [`BaseExperiment`](../alphaswarm/rl/core/experiment.py) |
| `rl_trajectory_store` | Per-step persistence | [`BaseTrajectoryStore`](../alphaswarm/rl/core/replay.py) |

## Built-in components (FinRL + AlphaSwarm)

### Environments
- `StockTradingEnv` — continuous portfolio (existing).
- `PortfolioAllocationEnv` — softmax weights (existing).
- `StockTradingDiscreteEnv` — single-stock buy/sell/hold (existing).
- `FinRLStockTradingEnv` — pandas share-lots (FinRL port).
- `FinRLStockTradingNpEnv` — array-backed numpy (FinRL port).
- `FinRLPortfolioCovEnv` — covariance + softmax (FinRL port).
- `FinRLCryptoEnv` — multi-crypto lookback stack (FinRL port).
- `OptionsTradingEnv`, `ExecutionEnv`, `MarketMakingEnv` — placeholders.

### Reward terms
- `PnLTerm`, `LogReturnTerm`
- `SharpeTerm`, `SortinoTerm`, `DrawdownPenaltyTerm`, `VolatilityPenaltyTerm`
- `TurnoverPenaltyTerm`, `TransactionCostTerm`, `SlippagePenaltyTerm`
- `TurbulenceGateTerm`, `MarginCallTerm`
- `CashIdlePenaltyTerm`, `BenchmarkOutperformanceTerm`, `RiskParityTerm`
- `PotentialBasedShaping`
- `CompositeReward` (sum of weighted terms; emits per-term
  contributions to `info["reward_terms"]`).

### Observation builders
- `PortfolioStateBuilder` (cash + weights / positions)
- `TechnicalIndicatorBuilder` (FinRL stockstats)
- `CovarianceBuilder` (FinRL portfolio cov)
- `TurbulenceBuilder` (Mahalanobis stress)
- `VIXBuilder`
- `LookbackStackBuilder` (FinRL crypto)
- `FundamentalBuilder` (FinRobot bridge)
- `MicrostructureBuilder`
- `StackedObservationBuilder` (composite)

### Action spaces
- `ContinuousWeightsAction`, `SoftmaxWeightsAction`,
  `IntegerSharesAction`, `DiscreteBuySellHoldAction`,
  `MultiDiscreteAction`, `TargetPositionAction`.

### Termination conditions
- `HorizonTermination`, `DrawdownTermination`, `MarginCallTermination`,
  `TurbulenceTermination`.

### Data pipelines
- `IcebergRLDataPipeline` (default — AlphaSwarm catalog).
- `YahooFinanceRLDataPipeline` (FinRL parity).
- `AlpacaRLDataPipeline` (paper-trading bridge).
- `LiveStreamingRLDataPipeline` (Kafka / Flink).
- `ReplayRLDataPipeline` (offline RL from `rl.trajectories`).

### Agents
- `SB3Adapter` — PPO / A2C / DDPG / SAC / TD3 / DQN +
  sb3-contrib (RecurrentPPO / TRPO / QRDQN / MaskablePPO / ARS / TQC).
- `ElegantRLAdapter`, `RayRLlibAdapter`, `CleanRLAdapter`.
- `LLMHybridAgent` — FinRobot-style LLM advisor + RL backbone.
- Existing classical / Q-family / actor-critic / evolutionary / SPM
  trees retained.

### Ensemblers / experiments
- `WalkForwardEnsembler` (FinRL `DRLEnsembleAgent` port).
- `BestOfNRunner`, `CurriculumRunner`, `MetaEnsembleRunner`.
- `BasicRLExperiment`, `WalkForwardRLExperiment`,
  `RewardAblationExperiment`, `RLAlphaBacktestExperiment`.


<!-- https://alpha-swarm.ai/concepts/rl/rl-finagent -->
# RL FinAgent Layered Reflection Adapter (Phase 10)
> | # | Stage | YAML | Purpose | | --- | --- | --- | --- | | 1 | `low_intelligence` | [`configs/agents/finagent/low_intelligence.yaml`](../configs/agents/finagent/low_intelligence.yaml) | Factual 2-3 se...

# RL FinAgent Layered Reflection Adapter (Phase 10)

Reference docs for the FinAgent multimodal LLM-hybrid agent ported
into `alphaswarm_rl` per Zhang AAAI 24.

## Five-stage cascade

| # | Stage | YAML | Purpose |
| --- | --- | --- | --- |
| 1 | `low_intelligence` | [`configs/agents/finagent/low_intelligence.yaml`](../configs/agents/finagent/low_intelligence.yaml) | Factual 2-3 sentence market read |
| 2 | `high_intelligence` | [`configs/agents/finagent/high_intelligence.yaml`](../configs/agents/finagent/high_intelligence.yaml) | Strategic outlook + bias |
| 3 | `low_reflection` | [`configs/agents/finagent/low_reflection.yaml`](../configs/agents/finagent/low_reflection.yaml) | 1-bar post-mortem |
| 4 | `high_reflection` | [`configs/agents/finagent/high_reflection.yaml`](../configs/agents/finagent/high_reflection.yaml) | k-bar strategic post-mortem |
| 5 | `decision` | [`configs/agents/finagent/decision.yaml`](../configs/agents/finagent/decision.yaml) | Final SELL/HOLD/BUY |

Each stage's LLM call routes through `router_complete` (hard rule
2). The adapter degrades gracefully when the router is unavailable
or any stage fails (defaults to `HOLD`).

## Three tools

| Tool | File | Purpose |
| --- | --- | --- |
| `KlinePlotterTool` | [`alphaswarm/agents/tools/finagent/kline_plotter.py`](../alphaswarm/agents/tools/finagent/kline_plotter.py) | Summarise bars → text |
| `TradingPlotterTool` | [`alphaswarm/agents/tools/finagent/trading_plotter.py`](../alphaswarm/agents/tools/finagent/trading_plotter.py) | Summarise action history → text |
| `StrategyAgentsTool` | [`alphaswarm/agents/tools/finagent/strategy_agents_tool.py`](../alphaswarm/agents/tools/finagent/strategy_agents_tool.py) | Query another RL agent's decision |

## Modules

| File | Class | Purpose |
| --- | --- | --- |
| [`alphaswarm_rl/src/alphaswarm_rl/agents/llm_hybrid_layered.py`](../alphaswarm_rl/src/alphaswarm_rl/agents/llm_hybrid_layered.py) | `LayeredReflectionAdapter` | 5-stage prompt cascade |
| [`alphaswarm_rl/src/alphaswarm_rl/envs/tradesim_multimodal.py`](../alphaswarm_rl/src/alphaswarm_rl/envs/tradesim_multimodal.py) | `MultimodalTradingEnv` | FinAgent-style dict observation |

## Usage

```python
from alphaswarm_rl.agents.llm_hybrid_layered import LayeredReflectionAdapter

adapter = LayeredReflectionAdapter(
    llm_model="ollama/llama3",
    rl_weight=0.5,           # blend 50% with RL backbone
    rl_agent={"class": "ppo_inhouse"},
)
adapter.build(env)
action, _ = adapter.predict(obs)        # int in {0=SELL, 1=HOLD, 2=BUY}

# Between predicts, update the memory so reflection stages have something
# to critique:
adapter.update_realised_pnl(realised_short=0.01, realised_k=0.02)
```

## Hard rule alignment

- Hard rule 2: every LLM call routes through `router_complete`.
- Hard rule 12: each stage is a separate `AgentRuntime` invocation
  (see the YAMLs' `model:` blocks).
- Hard rule 19: adapter registers via `RLComponent` metaclass under
  `rl_alias='finagent_layered'`.

## Acceptance

[Phase 10 tests](../alphaswarm_rl/tests/finagent/) verify:

- 5 stages invoke `router_complete` exactly once each.
- Decision JSON parsed correctly into action int.
- Memory updates persist between calls.
- Cascade degrades to HOLD on LLM failure.
- All 3 tools handle valid + empty inputs.


<!-- https://alpha-swarm.ai/concepts/rl/rl-framework -->
# Reinforcement learning framework
> Hash-locked RLExperimentSpec + RLRuntime + metaclass-registered components + Iceberg trajectory store. The canonical entry point for every RL run in AlphaSwarm.

# Reinforcement learning framework

The RL layer in AlphaSwarm follows a metaclass-driven, registry-first design
inspired by FinRL's library structure and FinRobot's tool-augmented
agent runtime. Every concrete component (env, observation, action,
reward, termination, policy, agent, data pipeline, ensembler,
experiment, trajectory store) auto-registers through
[`alphaswarm_rl/src/alphaswarm_rl/core/base.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/core/base.py)
so the API and the lab UI can browse them at runtime.

This page is the canonical entry point. For shorter cuts:

- [rl-lab](./rl-lab.md) â€” interactive RL Lab + builders.
- [rl-components](./rl-components.md) â€” auto-generated component
  reference (browse via `/rl/components` in the operator UI).
- [rl-iceberg](./rl-iceberg.md) â€” Iceberg trajectory / equity /
  reward-decomposition tables and DuckDB views.
- [rl-market-dynamics](./rl-market-dynamics.md) â€” Phase 6 slice-and-merge regime
  labeller + `RegimeAwareObservation` +
  `RegimeStratifiedEvaluation`.
- [rl-prudex-evaluation](./rl-prudex-evaluation.md) â€” Phase 9
  PRUDEX-Compass framework (17 measures, 5 visualisations).
- [rl-finagent](./rl-finagent.md) â€” Phase 10 FinAgent multimodal
  5-stage LLM-hybrid adapter.
- [weight-centric-pipeline](./weight-centric-pipeline.md) â€” FinRL-X
  four-stage `f_S â†’ f_A â†’ f_T â†’ f_R` pipeline.
- [architecture/decisions/010-rl-production-enhancement](../../architecture/decisions/010-rl-production-enhancement.md) â€”
  full Phase 1-12 production-enhancement ADR.

## Phase 1-12 production enhancements (May 2026)

The Phase 1-12 deliverables documented in
[ADR-010](../../architecture/decisions/010-rl-production-enhancement.md)
add the following components under their canonical `rl_alias` /
`kind`:

| Phase | Components |
| --- | --- |
| 1 (Rewards) | `differential_sharpe`, `differential_downside`, `implementation_shortfall`, `running_inventory`, `exp_utility`, `hindsight`, `dp_distillation` |
| 2 (Analytical) | `almgren_chriss_residual`, `avellaneda_stoikov_residual` (+ `alphaswarm_rl.analytical.{almgren_chriss,avellaneda_stoikov,cartea_jaimungal}` helpers) |
| 3 (Envs) | `tradesim_algotrading`, `tradesim_portfolio`, `tradesim_execution`, `tradesim_hft`, `finagent_trading` |
| 4 (Agents) | `eiie`, `deeptrader`, `investor_imitator`, `eteo`, `opd`, `deepscalper`, `hft_ddqn`, `ppo_inhouse` |
| 5 (Backbones) | `eiie_conv`, `sagcn`, `market_scorer`, `hft_qnet`, `eteo_dual_head`, `pd_dual_rnn`, `sarl_lstm` |
| 6 (MDM) | `slice_and_merge_regime_flow` (analysis flow), `regime_aware` observation, `regime_stratified` experiment |
| 7 (CSDI) | `csdi_imputed` dataset kind |
| 8 (Validation) | `CombinatorialPurgedKFold`, `probability_of_backtest_overfitting`, `rademacher_anti_serum`, `deflated_sharpe_ratio`, `walk_forward_anchored`, `walk_forward_rolling`, `benjamini_hochberg`, `holm_bonferroni`, `validation_suite` experiment |
| 9 (PRUDEX) | `PrudexMetrics`, `PrudexReport`, `compute_prudex_metrics`, 5 chart helpers, `prudex_compass` experiment |
| 10 (FinAgent) | `finagent_layered` adapter + 5 AgentSpec YAMLs under `configs/agents/finagent/` + 3 tools under `alphaswarm/agents/tools/finagent/` |
| 11 (Replay) | `GeneralReplayBuffer`, `PrioritizedReplayBuffer`, `NStepInfoReplayBuffer` |
| 12 (Parity) | Determinism + kill-switch tests around `WeightCentricPipeline` + `WeightToOrders` |

## Contracts

Two execution shapes share the same hash-locked spec. The standalone
shape is the original RL pipeline; the workflow-wrapped shape lets
`WorkflowRuntime` compose RL training into larger multi-stage
agentic pipelines (AGENTS rule 40 + ADR-005 + Phase 5 of the
orchestration refactor).

```mermaid
flowchart LR
    Spec["RLExperimentSpec (hash-locked)"] --> Versions["rl_experiment_versions row"]
    Versions --> StandaloneRt["RLRuntime (standalone)"]
    Versions --> WfAdapter["execution adapter (workflow node)"]
    WfAdapter --> WfRuntime["WorkflowRuntime"]
    WfRuntime --> StandaloneRt

    StandaloneRt --> Env["BaseRLEnv"]
    StandaloneRt --> Agent["BaseRLAgent"]
    Env -->|observation| Obs["BaseObservationBuilder"]
    Env -->|action| Action["BaseActionSpace"]
    Env -->|reward| Reward["CompositeReward (BaseRewardTerm Ã— N)"]
    Env -->|terminate?| Term["BaseTerminationCondition"]
    Agent --> Policy["BaseRLPolicy (+ TimeSeriesEncoder backbone)"]
    Agent --> Advantage["BaseAdvantageEstimator"]

    StandaloneRt --> Trajectory["IcebergTrajectoryStore"]
    Trajectory --> Iceberg[("rl.* Iceberg namespace")]
    StandaloneRt --> RlRuns[("rl_runs ledger (Postgres)")]
    StandaloneRt --> Mlflow[("MLflow")]

    WfRuntime --> WfRuns[("workflow_runs + agent_runs_v2")]
```

## Hard rules

1. **All RL train / evaluate / paper / replay / walk-forward goes
   through
   [`alphaswarm_rl/src/alphaswarm_rl/runtime.py::RLRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py)**
   (AGENTS rule 16). Tasks under
   [`alphaswarm_rl/tasks/rl_tasks.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/tasks/rl_tasks.py)
   and API routes under
   [`alphaswarm_rl/api/routes/rl.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/api/routes/rl.py)
   wrap it; they never call `agent.train` directly.
2. **`rl_experiment_versions` rows are immutable, hash-locked.**
   Re-snapshotting via
   [`alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py)
   inserts a new row when the SHA-256 of the spec changes (AGENTS
   rule 17).
3. **Trajectory persistence flows through
   [`IcebergTrajectoryStore`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/trajectories/iceberg_writer.py)**
   â†’ `iceberg_catalog.append_arrow` (AGENTS rule 18).
4. **All concrete components register through the
   [`RLComponent`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/core/base.py)
   metaclass.** Set `rl_kind` to one of the canonical kinds; the
   metaclass calls `@register` automatically (AGENTS rule 19).
5. **LLM calls inside `LLMHybridAgent` route through
   [`router_complete`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py)**
   (AGENTS rule 20).
6. **Advantage estimation goes through
   [`BaseAdvantageEstimator`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/advantage/base.py)**
   (AGENTS rule 36). The native
   `ReinforcePlusPlusAdvantage` / `GRPOAdvantage` / `GAEAdvantage`
   register through the metaclass alongside envs / rewards /
   policies.
7. **Policy backbones go through
   [`TimeSeriesEncoder`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/policies/backbones/base.py)**
   (AGENTS rule 37). The four shipped backbones â€”
   `TransformerBackbone`, `RecurrentBackbone`,
   `AutoencoderBackbone`, `PatchTSTBackbone` â€” wrap existing
   `alphaswarm_models.models` modules so the policy network and the
   offline ML stack share one source of truth.
8. **Weight-centric portfolio actions go through the FinRL-X
   four-stage pipeline
   [`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py)**
   (`f_S â†’ f_A â†’ f_T â†’ f_R`, AGENTS rule 38). Risk overlay (`f_R`)
   re-uses
   [`RiskLimits`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/risk/limits.py)
   so offline backtests and live paper paths produce identical
   target-weight vectors.

## Hash-lock invariant in practice

The `*_spec_versions` table is the contract that makes RL replayable.
Three concrete consequences:

- **Same content â†’ same version.** Re-posting an identical spec
  returns the existing `version_id`. No duplicate row, no
  side-effect.
- **Any field change â†’ new version.** Bump a hyperparameter, swap a
  reward term, retune the LR schedule â€” the SHA-256 changes, the
  row is new. The old row stays forever.
- **Replay is across data, not across code.** When you
  `RLRuntime(spec).replay(new_window)`, the runtime loads the
  pinned `version_id` from `rl_runs`, rebuilds the env / agent
  exactly as the original train run, and feeds it the new bars.
  This is how "would this policy have held up in Q1 2024?"
  questions get a deterministic answer.

This is why
[`alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py)
is the only sanctioned path: every direct mutation to the table
would corrupt the replay contract.

## Packages

| Path | Purpose |
| --- | --- |
| [alphaswarm_rl/src/alphaswarm_rl/core/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/core) | Abstract bases + `RLComponent` metaclass + schema helpers. |
| [alphaswarm_rl/src/alphaswarm_rl/spec.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/spec.py) | `RLExperimentSpec` declarative blueprint. |
| [alphaswarm_rl/src/alphaswarm_rl/runtime.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py) | `RLRuntime` single sanctioned executor. |
| [alphaswarm_rl/src/alphaswarm_rl/envs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/envs) | Concrete envs (existing + FinRL ports + TradeSim + FinAgent). |
| [alphaswarm_rl/src/alphaswarm_rl/rewards/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/rewards) | Composable reward terms. |
| [alphaswarm_rl/src/alphaswarm_rl/observations/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/observations) | Observation builders. |
| [alphaswarm_rl/src/alphaswarm_rl/actions/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/actions) | Action-space implementations. |
| [alphaswarm_rl/src/alphaswarm_rl/terminations/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/terminations) | End-of-episode predicates. |
| [alphaswarm_rl/src/alphaswarm_rl/data_pipelines/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/data_pipelines) | Iceberg / Yahoo / Alpaca / streaming / replay pipelines. |
| [alphaswarm_rl/src/alphaswarm_rl/agents/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/agents) | SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid + classical / Q-family / actor-critic / evolutionary. |
| [alphaswarm_rl/src/alphaswarm_rl/policies/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/policies) | Policy backbones (`TimeSeriesEncoder` subclasses). |
| [alphaswarm_rl/src/alphaswarm_rl/advantage/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/advantage) | Advantage estimators (native REINFORCE++ / GRPO / GAE). |
| [alphaswarm_rl/src/alphaswarm_rl/ensemblers/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/ensemblers) | Walk-forward / best-of-N / curriculum / meta-ensemble. |
| [alphaswarm_rl/src/alphaswarm_rl/experiments/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/experiments) | Experiment runners (basic / walk-forward / ablation / alpha-backtest / regime-stratified / validation-suite / PRUDEX-Compass). |
| [alphaswarm_rl/src/alphaswarm_rl/applications/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/applications) | One-call FinRL-style apps (stock / portfolio / crypto / fundamentals / paper). |
| [alphaswarm_rl/src/alphaswarm_rl/portfolio/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/portfolio) | `WeightCentricPipeline` (FinRL-X `f_S â†’ f_A â†’ f_T â†’ f_R`). |
| [alphaswarm_rl/src/alphaswarm_rl/trajectories/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/trajectories) | Iceberg-backed trajectory writer + DuckDB views. |
| [alphaswarm_rl/src/alphaswarm_rl/bridges/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/src/alphaswarm_rl/bridges) | Backtest-engine + WorkflowRuntime adapters. |
| [alphaswarm/persistence/models_rl.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/persistence/models_rl.py) | ORM for specs, versions, runs, evaluations, refs, registrations. |
| [alphaswarm_rl/api/routes/rl.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/api/routes/rl.py) | REST surface. |
| [alphaswarm_rl/tasks/rl_tasks.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/tasks/rl_tasks.py) | Celery tasks driven by `RLRuntime`. |
| [alphaswarm_client/src/routes/rl/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client/src/routes/rl) | RL Lab + builders + library + runs UI (active Vite frontend). |
| [alphaswarm_rl/configs/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/configs) | Preset / reward / observation / data-pipeline YAMLs. |
| [alphaswarm_rl/tests/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/tests) | Hermetic test suite. |

Legacy `alphaswarm.rl.*` is a deprecation shim that re-exports from
`alphaswarm_rl.*`; new code imports from `alphaswarm_rl` directly.

## Spec lifecycle

1. **Author** an `RLExperimentSpec` (YAML or in-code Pydantic).
2. **Persist** via
   [`alphaswarm_rl.registry.persist_spec`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/registry.py)
   â†’ `rl_experiment_specs` + `rl_experiment_versions` (hash-locked
   snapshot).
3. **Run** via `RLRuntime.train` / `.evaluate` / `.paper` / `.replay` /
   `.walk_forward` â†’ opens an `rl_runs` row, builds the env / agent
   from `build_from_config`, drives training, persists per-step
   trajectories to Iceberg, finalises the run row.
4. **Inspect** via the API
   (`/rl/runs/{id}/equity`, `/trajectories`,
   `/reward-decomposition`, `/episodes`) and the lab UI run-detail
   page (equity chart, reward decomposition, episode summary,
   replay slider).

## Worked example: train + replay

Goal: snapshot a 50k-step PPO experiment, train it, inspect the
ledger row, read trajectories from Iceberg, and replay against
fresh data â€” all from this page.

### Step 1 â€” snapshot the spec

The experiment YAML lives at
[`alphaswarm_rl/configs/experiments/my_first_rl.yaml`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_rl/configs/experiments).
Dispatch the train run:


Notice `spec_hash` in the response â€” that is the immutable hash-lock
key. Re-posting the same YAML returns the same `spec_version_id`.

### Step 2 â€” tail progress

```bash
curl -N http://localhost:8000/chat/stream/
```

Frames arrive in the canonical envelope (AGENTS rule 4). Expected
stages: `start` â†’ `data.loaded` â†’ `env.built` â†’ `agent.built` â†’
`train.step` (Ã—many, sparse) â†’ `train.checkpoint` â†’ `done`.

### Step 3 â€” inspect the ledger

The agent-safe read is `data.rl.list` / `data.rl.describe`:

```bash
curl -X POST http://localhost:8000/mcp/data/tools/data.rl.describe/invoke \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(alphaswarm-cli auth token)" \
    -d '{"rl_run_id": ""}'
```

The response carries `status`, `mean_reward`, `total_timesteps`,
`spec_version_id`, MLflow run id, and the trajectory namespace.

### Step 4 â€” read trajectories from Iceberg

Pyodide does not ship PyIceberg, but it ships duckdb + pyarrow, and
the trajectory writer exports a parquet-compatible view. The
snippet below shows the analytical pattern with inline sample data
so it runs in your browser.


The same pattern works against the real Iceberg trajectory tables
via the
[`data.iceberg.read_snapshot`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/mcp/tools/iceberg.py)
MCP tool. The tables are:

- `alphaswarm_silver_rl_trajectories.` â€” per-step `(episode, step, obs_hash, action, reward, value, log_prob)`
- `alphaswarm_silver_rl_equity_curves.` â€” per-step equity / drawdown
- `alphaswarm_silver_rl_action_logs.` â€” full action vectors per step
- `alphaswarm_silver_rl_reward_decomposition.` â€” per-term reward attribution

### Step 5 â€” replay against fresh data

The killer feature of hash-locked specs: replay the trained policy
against a different time window WITHOUT touching the spec.

/replay", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ start: "2024-01-01", end: "2024-03-31" }),
});
const { task_id, rl_run_id: replay_run_id, reused_spec_version_id } = await r.json();
console.log({ task_id, replay_run_id, reused_spec_version_id });
`} />

The new `rl_runs` row carries `parent_run_id` and the SAME
`spec_version_id` as the original train run. Two `rl_runs` rows,
one `rl_experiment_versions` row.

### Step 6 â€” verify

- `rl_experiment_versions` row with the recorded `spec_hash`.
- Two `rl_runs` rows referencing it (`train` + `replay`).
- Trajectory tables in `alphaswarm_silver_rl_trajectories.`.
- MLflow runs visible at `http://localhost:5000/#/experiments`.
- Topbar `KillSwitch` shows green; `should_halt` returned false on
  every step.

### What next

- Walk the full tutorial: [tutorials/first-rl-experiment](../../tutorials/first-rl-experiment.md).
- Compose into a workflow:
  [tutorials/first-agent-workflow](../../tutorials/first-agent-workflow.md)
  + [concepts/agentic/workflow-studio](../agentic/workflow-studio.md).
- Add a custom reward term: [rl-components](./rl-components.md).
- Browse the trajectory schema: [rl-iceberg](./rl-iceberg.md).

## Inspiration sources

- **FinRL** (`alphaswarm_snippets/inspiration/FinRL-master`) â€” env taxonomy
  (StockTrading, StockPortfolio, multi-crypto), `DataProcessor` /
  `FeatureEngineer` / `df_to_array`, `DRLAgent` / `DRLEnsembleAgent`,
  composite reward. Ported as registered presets in
  `alphaswarm_rl.envs.finrl_*`, `alphaswarm_rl.data_pipelines.*`, and the
  `WalkForwardEnsembler`.
- **FinRobot** (`alphaswarm_snippets/inspiration/FinRobot-master`) â€”
  multi-agent LLM workflow + tool-augmented analysis. Bridged via
  `LLMHybridAgent` (LLM proposes, RL refines) and `FundamentalBuilder`.
- **FinRL-X** â€” the four-stage weight-centric pipeline (`f_S â†’ f_A
  â†’ f_T â†’ f_R`) is ported as
  [`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py)
  (AGENTS rule 38).
- **FinAgent** â€” five-stage LLM-hybrid adapter ported as
  `finagent_layered` (ADR-010, Phase 10).
- **PRUDEX-Compass** â€” 17-measure evaluation framework ported as
  `prudex_compass` experiment + five chart helpers (ADR-010,
  Phase 9).

## Deeper reads

- [rl-lab](./rl-lab.md) â€” interactive RL Lab + builders.
- [rl-components](./rl-components.md) â€” full component catalogue.
- [rl-iceberg](./rl-iceberg.md) â€” trajectory persistence contract.
- [rl-policy-backbones](./rl-policy-backbones.md) â€” `TimeSeriesEncoder` subclasses.
- [rl-market-dynamics](./rl-market-dynamics.md) â€” regime labeller + observation.
- [rl-prudex-evaluation](./rl-prudex-evaluation.md) â€” PRUDEX-Compass.
- [rl-finagent](./rl-finagent.md) â€” FinAgent multimodal adapter.
- [weight-centric-pipeline](./weight-centric-pipeline.md) â€” `f_S â†’ f_A â†’ f_T â†’ f_R`.
- [agentic-rl](./agentic-rl.md) â€” RL-as-agent integration patterns.
- [architecture/decisions/010-rl-production-enhancement](../../architecture/decisions/010-rl-production-enhancement.md) â€” full Phase 1-12 ADR.
- [reference/api](../../reference/api/index.mdx) â€” the `rl` tag in the interactive playground.
- [reference/python/alphaswarm_rl](../../reference/python/index.mdx) â€” auto-generated `alphaswarm_rl` Python reference.


<!-- https://alpha-swarm.ai/concepts/rl/rl-iceberg -->
# RL Iceberg data plane
> | Table | Columns | Written when | | --- | --- | --- | | `rl.trajectories` | `run_id`, `episode`, `step`, `ts`, `reward`, `info` (JSON) | Every env step | | `rl.equity_curves` | `run_id`, `episode`, `...

# RL Iceberg data plane

Per-step RL records persist to four Iceberg tables in the namespace
controlled by `ALPHASWARM_RL_TRAJECTORY_NAMESPACE` (default `rl`). Writes flow
through
[`alphaswarm/rl/trajectories/iceberg_writer.py::IcebergTrajectoryStore`](../alphaswarm/rl/trajectories/iceberg_writer.py)
→ [`iceberg_catalog.append_arrow`](../alphaswarm/data/iceberg_catalog.py).

## Tables

| Table | Columns | Written when |
| --- | --- | --- |
| `rl.trajectories` | `run_id`, `episode`, `step`, `ts`, `reward`, `info` (JSON) | Every env step |
| `rl.equity_curves` | `run_id`, `episode`, `step`, `ts`, `portfolio_value`, `drawdown`, `cash` | Every env step that exposes `info["portfolio_value"]` |
| `rl.action_logs` | `run_id`, `episode`, `step`, `ts`, `asset_idx`, `action_value` | Every env step (one row per action component) |
| `rl.reward_decomposition` | `run_id`, `episode`, `step`, `ts`, `term_name`, `contribution` | When the reward model exposes `info["reward_terms"]` (any `CompositeReward`) |

## Settings

| Variable | Default | Purpose |
| --- | --- | --- |
| `ALPHASWARM_RL_TRAJECTORY_NAMESPACE` | `rl` | Iceberg namespace |
| `ALPHASWARM_RL_TRAJECTORY_TABLE` | `trajectories` | Per-step trajectory table name |
| `ALPHASWARM_RL_EQUITY_TABLE` | `equity_curves` | Equity-curve table name |
| `ALPHASWARM_RL_ACTION_LOG_TABLE` | `action_logs` | Action-log table name |
| `ALPHASWARM_RL_REWARD_DECOMP_TABLE` | `reward_decomposition` | Reward-decomposition table name |
| `ALPHASWARM_RL_PERSIST_TRAJECTORIES` | `true` | When `false`, the runtime uses an in-memory store (CI / local). |
| `ALPHASWARM_RL_TRAJECTORY_FLUSH_ROWS` | `1000` | Rows per buffer before partial flush. |
| `ALPHASWARM_RL_REQUIRE_ICEBERG` | `false` | Make Iceberg write failures hard-fail. |

## DuckDB views

[`alphaswarm/rl/trajectories/duckdb_views.py`](../alphaswarm/rl/trajectories/duckdb_views.py)
exposes two helpers:

- `ensure_duckdb_views(connection)` — registers
  `rl_trajectories` / `rl_equity_curves` / `rl_action_logs` /
  `rl_reward_decomposition` Arrow-backed views.
- `register_run_views(run_id, connection)` — adds run-filtered views
  named `rl__run_`.

The API uses these views to serve the
`/rl/runs/{id}/equity` / `/trajectories` / `/reward-decomposition` /
`/actions` endpoints without touching PyIceberg directly.

## Postgres ledger

The Postgres tables in
[`alphaswarm/persistence/models_rl.py`](../alphaswarm/persistence/models_rl.py)
hold the metadata layer that points at these Iceberg row ranges:

- `rl_experiment_specs` / `rl_experiment_versions` — hash-locked spec
  snapshots.
- `rl_runs` — one row per `RLRuntime` invocation.
- `rl_evaluations` — rollout summary.
- `rl_trajectory_refs` / `rl_equity_curve_refs` — pointers to the
  Iceberg row ranges per episode.
- `rl_component_registrations` — DB mirror of the in-memory RL
  component registry (so `/rl/components` is fast).


<!-- https://alpha-swarm.ai/concepts/rl/rl-lab -->
# RL Lab — interactive RL builder
> | Tab | Purpose | Component | | --- | --- | --- | | **Experiment** | Compose env + reward + observation + action + agent + ensembler into one `RLExperimentSpec`, save, train. | [`ExperimentBuilder.tsx...

# RL Lab — interactive RL builder

Lives at `/rl/lab` in the AlphaSwarm webui. Combines six surfaces into one
shell:

| Tab | Purpose | Component |
| --- | --- | --- |
| **Experiment** | Compose env + reward + observation + action + agent + ensembler into one `RLExperimentSpec`, save, train. | [`ExperimentBuilder.tsx`](../webui/components/rl/ExperimentBuilder.tsx) |
| **Environment** | Drag a data pipeline + env + observation + action + reward + termination onto the canvas; save spec. | [`EnvironmentBuilder.tsx`](../webui/components/rl/EnvironmentBuilder.tsx) |
| **Reward** | Drag reward terms, weight them, hit "Preview reward" → server-side decomposition over a synthetic trajectory. | [`RewardModelBuilder.tsx`](../webui/components/rl/RewardModelBuilder.tsx) |
| **Observation** | Drag observation builders, preview output shape + feature names. | [`ObservationBuilder.tsx`](../webui/components/rl/ObservationBuilder.tsx) |
| **Agent** | Pick framework (SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid) + algorithm + hyperparams. | [`AgentBuilder.tsx`](../webui/components/rl/AgentBuilder.tsx) |
| **Component library** | Browse every registered RL component, filter by tag / source / category. | [`RlComponentLibrary.tsx`](../webui/components/rl/RlComponentLibrary.tsx) |

## Routes

| Path | Component |
| --- | --- |
| `/rl/lab` | `RlLabPage` |
| `/rl/library` | `RlComponentLibrary` |
| `/rl/builder/env` | `EnvironmentBuilder` |
| `/rl/builder/reward` | `RewardModelBuilder` |
| `/rl/builder/observation` | `ObservationBuilder` |
| `/rl/builder/agent` | `AgentBuilder` |
| `/rl/builder/experiment` | `ExperimentBuilder` |
| `/rl/runs` | `RlRunsPage` |
| `/rl/runs/[id]` | `RlRunDetailPage` |
| `/rl/runs/[id]/replay` | `RlReplayViewer` |
| `/rl` | Legacy `RlPage` (quick-train, application registry browser). |
| `/rl/zoo` | RL agent zoo (`/registry/agent`). |

The builders all use the existing
[`WorkflowEditor`](../webui/components/flow/WorkflowEditor.tsx) +
xyflow stack with `domain="rl"`. The serializer in
[`webui/components/rl/serialize.ts`](../webui/components/rl/serialize.ts)
turns a `FlowGraph` into an `RLExperimentSpec` payload by bucketising
nodes via their palette group (env / observation / action / reward /
termination / agent / data pipeline / ensembler).

## API surface used

The lab calls the API endpoints in
[`alphaswarm/api/routes/rl.py`](../alphaswarm/api/routes/rl.py):

- `GET /rl/components` — kind counts.
- `GET /rl/components/{kind}` — list registered components per kind.
- `POST /rl/lab/preview-reward` — reward decomposition.
- `POST /rl/lab/preview-observation` — observation shape + features.
- `POST /rl/lab/preview-action` — action transform sample.
- `POST /rl/specs` — persist a spec.
- `POST /rl/specs/{slug}/run` — kick off train / evaluate / paper /
  replay / walk-forward via the matching Celery task.
- `GET /rl/runs` / `GET /rl/runs/{id}` / `.../equity` /
  `.../trajectories` / `.../reward-decomposition` /
  `.../episodes` / `.../actions` — runs ledger + step-level data
  served from DuckDB views over the Iceberg tables.
- `POST /rl/runs/{id}/replay` — re-roll a saved policy on a new
  window.
- `POST /rl/data-pipelines/preview` — show first rows + array shapes.

## Run replay

The replay viewer (`/rl/runs/[id]/replay`) loads:

- `rl.equity_curves` rows for the chosen episode (slider populates from
  the row count).
- `rl.trajectories` rows for the chosen episode (each step shows
  reward + info JSON).

Both come from the DuckDB views generated by
[`alphaswarm/rl/trajectories/duckdb_views.py`](../alphaswarm/rl/trajectories/duckdb_views.py).


<!-- https://alpha-swarm.ai/concepts/rl/rl-market-dynamics -->
# RL Market Dynamics Modeling (Phase 6)
> The market-dynamics framework labels every bar in a price series with a regime ID (default 4 regimes: strong-down / weak-down / sideways / strong-up). The labels feed:

# RL Market Dynamics Modeling (Phase 6)

Reference docs for the slice-and-merge regime labeller and its
consumers in `alphaswarm_rl`.

## Overview

The market-dynamics framework labels every bar in a price series
with a regime ID (default 4 regimes: strong-down / weak-down /
sideways / strong-up). The labels feed:

- `RegimeAwareObservation` — appends a one-hot regime vector to the
  RL agent's observation.
- `RegimeStratifiedEvaluation` — runs the trained policy and
  decomposes per-regime performance for the RL Lab dashboard.

## Pipeline

1. **Butterworth filter** on the indicator column (default
   `close`). Causal `lfilter` to avoid look-ahead.
2. **Turning-point detection** — bars where the filtered pct-return
   sign flips mark candidate segment boundaries.
3. **Segment merging** — segments below `min_length_limit` are
   merged with their neighbour so every regime has a stable
   estimation window.
4. **Per-segment slope** — linear regression of the filtered
   indicator inside each segment.
5. **Labelling** — quantile (default) or fixed-threshold buckets.

## Modules

| File | Class | Purpose |
| --- | --- | --- |
| [`alphaswarm/analysis/flows/market_dynamics_modeling.py`](../alphaswarm/analysis/flows/market_dynamics_modeling.py) | `slice_and_merge_regime_flow` | Analysis flow; emits per-bar labels |
| [`alphaswarm_rl/src/alphaswarm_rl/observations/regime.py`](../alphaswarm_rl/src/alphaswarm_rl/observations/regime.py) | `RegimeAwareObservation` | One-hot observation appendage |
| [`alphaswarm_rl/src/alphaswarm_rl/experiments/regime_stratified.py`](../alphaswarm_rl/src/alphaswarm_rl/experiments/regime_stratified.py) | `RegimeStratifiedEvaluation` | Per-regime metric breakdown |

## Usage

```python
from alphaswarm.analysis.base import FlowContext
from alphaswarm.analysis.flows.market_dynamics_modeling import (
    SliceAndMergeRegimeParams,
    slice_and_merge_regime_flow,
)

params = SliceAndMergeRegimeParams(
    indicator_column="close",
    dynamic_number=4,
    min_length_limit=12,
    labeling_method="quantile",
)
result = slice_and_merge_regime_flow(df, params, FlowContext(run_id="…"))
labels = [row["label"] for row in result.rows]
```

The labels are surfaced into the RL pipeline via
`RegimeAwareObservation(labels=labels)` and the matching evaluation
through `RegimeStratifiedEvaluation(n_regimes=4, regime_labels=labels)`.

## Hard rule alignment

- Hard rule 23: analysis-spec lifecycle goes through
  `AnalysisRuntime`. The flow registers via `register_analysis_flow`.
- Hard rule 21: gold-tier writes via `iceberg_catalog.append_arrow`
  to `alphaswarm_gold_analysis_market_dynamics_modeling`.
- Hard rule 25: flow body has no direct LLM calls.

## Acceptance

- [Phase 6 tests](../alphaswarm_rl/tests/mdm/) verify:
  - `slice_and_merge_regime_flow` produces ≥1 segment on a
    trending+sideways+downtrend synthetic series.
  - `RegimeAwareObservation` emits the expected one-hot shape.
  - `RegimeStratifiedEvaluation` breaks performance down per regime.


<!-- https://alpha-swarm.ai/concepts/rl/rl-policy-backbones -->
# RL policy backbones
> | Class | Source | Use case | | --- | --- | --- | | [`TransformerBackbone`](../alphaswarm/rl/policies/backbones/transformer.py) | Self-attention encoder over the lookback window | Default for medium sequence...

# RL policy backbones

> Transformer / RNN / Autoencoder / PatchTST feature trunks for the
> AlphaSwarm RL policies. Registered through the
> [`RLComponent`](../alphaswarm/rl/core/base.py) metaclass with
> `rl_kind='rl_policy_backbone'`.

## Backbones

| Class | Source | Use case |
| --- | --- | --- |
| [`TransformerBackbone`](../alphaswarm/rl/policies/backbones/transformer.py) | Self-attention encoder over the lookback window | Default for medium sequence (30-100 bars) |
| [`RecurrentBackbone`](../alphaswarm/rl/policies/backbones/recurrent.py) | LSTM / GRU / RNN cell (configurable) | Causal, memory-efficient, anti-bidirectional default |
| [`AutoencoderBackbone`](../alphaswarm/rl/policies/backbones/autoencoder.py) | MLP encoder bottleneck | High-dim observation (1000+ features) compression |
| [`PatchTSTBackbone`](../alphaswarm/rl/policies/backbones/patchtst.py) | Patch-tokenised Transformer (Nie 2023) | Long-horizon (252+ bars) — avoids token explosion |

## Wiring through SB3

```yaml
agent:
  class: SB3Adapter
  module_path: alphaswarm.rl.agents.sb3_adapter
  kwargs:
    algorithm: PPO
    policy: MlpPolicy
    policy_kwargs:
      features_extractor_class: alphaswarm.rl.policies.feature_extractors.BackboneFeaturesExtractor
      features_extractor_kwargs:
        backbone_alias: TransformerBackbone
        sequence_length: 30
        input_features: 32
        features_dim: 128
        backbone_kwargs:
          n_heads: 4
          n_layers: 2
          d_ff: 256
          dropout: 0.1
```

## Wiring through CleanRL

The [`CleanRLAdapter`](../alphaswarm/rl/agents/cleanrl_adapter.py) wraps the
backbone via
[`build_backbone_from_alias`](../alphaswarm/rl/policies/feature_extractors.py):

```python
from alphaswarm.rl.policies import build_backbone_from_alias

trunk = build_backbone_from_alias(
    "RecurrentBackbone",
    input_features=20,
    sequence_length=30,
    output_dim=128,
    backbone_kwargs={"cell": "lstm", "hidden_size": 128, "num_layers": 2},
)
```

## Shipped example specs

Four reference specs live under
[`configs/rl/policies/`](../configs/rl/policies):

- [`transformer_stock_trading.yaml`](../configs/rl/policies/transformer_stock_trading.yaml)
  — PPO + Transformer over StockTradingEnv.
- [`recurrent_portfolio.yaml`](../configs/rl/policies/recurrent_portfolio.yaml)
  — SAC + LSTM over PortfolioAllocationEnv.
- [`autoencoder_marketmaking.yaml`](../configs/rl/policies/autoencoder_marketmaking.yaml)
  — PPO + Autoencoder over MarketMakingEnv.
- [`patchtst_execution.yaml`](../configs/rl/policies/patchtst_execution.yaml)
  — PPO + PatchTST over OptimalExecutionEnv.

## Adding a new backbone

See [the cursor rule](../.cursor/rules/policy-backbones.mdc) for the
canonical checklist.

## See also

- [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md)
- [Hard rule 37 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md)


<!-- https://alpha-swarm.ai/concepts/rl/rl-prudex-evaluation -->
# RL PRUDEX-Compass Evaluation (Phase 9)
> | Axis | Code | Measures | | --- | --- | --- | | Profitability | P | `total_return`, `annualised_return`, `cagr` | | Risk-control | R | `volatility`, `max_drawdown`, `sortino`, `calmar` | | Universali...

# RL PRUDEX-Compass Evaluation (Phase 9)

Reference docs for the PRUDEX-Compass evaluation framework ported
from TradeMaster into `alphaswarm_rl`.

## Six axes, 17 measures

| Axis | Code | Measures |
| --- | --- | --- |
| Profitability | P | `total_return`, `annualised_return`, `cagr` |
| Risk-control | R | `volatility`, `max_drawdown`, `sortino`, `calmar` |
| Universality | U | `cross_dataset_sharpe_mean`, `cross_dataset_sharpe_std` |
| Diversification | D | `portfolio_weight_entropy`, `turnover` |
| Explainability | E | `regime_conditioned_sharpe` |
| X-tra evaluation | X | `performance_profile_auc`, `rank_score`, `extreme_market_score`, `hit_rate` |

Plus a `sharpe_ratio` convenience field. **17 measures total.**

## Five visualisations

| Helper | Purpose |
| --- | --- |
| `pride_star_chart` | 8-axis radar of per-agent scores |
| `prudex_compass_chart` | 6-axis octagon (one axis per PRUDEX axis) |
| `performance_profile_chart` | CDF of per-step returns across agents |
| `rank_distribution_chart` | Heatmap of per-metric ranks |
| `extreme_market_chart` | Bar chart of extreme-market cumulative returns |

All helpers gracefully degrade to a dict fallback when matplotlib
is unavailable.

## Modules

| File | Class | Purpose |
| --- | --- | --- |
| [`alphaswarm_rl/src/alphaswarm_rl/evaluation/prudex_compass.py`](../alphaswarm_rl/src/alphaswarm_rl/evaluation/prudex_compass.py) | `PrudexMetrics`, `PrudexReport`, `compute_prudex_metrics` | Per-agent metric computation |
| [`alphaswarm_rl/src/alphaswarm_rl/evaluation/visualizations.py`](../alphaswarm_rl/src/alphaswarm_rl/evaluation/visualizations.py) | 5 chart helpers | Plot rendering |
| [`alphaswarm_rl/src/alphaswarm_rl/experiments/prudex_evaluation.py`](../alphaswarm_rl/src/alphaswarm_rl/experiments/prudex_evaluation.py) | `PrudexEvaluation` | Experiment aggregator |

## Usage

```python
from alphaswarm_rl.experiments.prudex_evaluation import PrudexEvaluation
from alphaswarm_rl.evaluation.visualizations import (
    prudex_compass_chart, pride_star_chart, performance_profile_chart,
)

exp = PrudexEvaluation(periods_per_year=252)
report = exp.run(
    agent_results={
        "eiie": {"equity_curve": eq_eiie, "weights_history": w_eiie},
        "deeptrader": {"equity_curve": eq_dt, "weights_history": w_dt},
        "ppo": {"equity_curve": eq_ppo, "weights_history": w_ppo},
    },
)
# Visualise:
fig = prudex_compass_chart(report)
```

## Hard rule alignment

- Hard rule 19: `PrudexEvaluation` registers via
  `RLComponent` metaclass under `rl_alias='prudex_compass'`.
- Hard rule 18: report lands in `rl_runs.result_summary` via the
  parent `RLRuntime`; no direct Iceberg writes from this experiment.

## Acceptance

[Phase 9 tests](../alphaswarm_rl/tests/evaluation/) verify:

- All 17 measures compute without error on synthetic equity series.
- Per-axis breakdown has exactly 6 axes (P/R/U/D/E/X).
- 5 visualisation helpers return a Figure (matplotlib) or dict
  fallback.
- Rank matrix is in `[1, N_agents]` per metric.


<!-- https://alpha-swarm.ai/concepts/rl/weight-centric-pipeline -->
# Weight-centric portfolio pipeline (`f_S -> f_A -> f_T -> f_R`)
> | Stage | Class | Responsibility | Default | | --- | --- | --- | --- | | `f_S` (Selector) | [`StockSelector`](../alphaswarm/rl/portfolio/selector.py) | Filter universe by liquidity / vol / momentum | [`Stati...

# Weight-centric portfolio pipeline (`f_S -> f_A -> f_T -> f_R`)

> The FinRL-X four-stage protocol that guarantees identical target
> weight semantics across offline backtesting and live broker
> execution.

## Stages

| Stage | Class | Responsibility | Default |
| --- | --- | --- | --- |
| `f_S` (Selector) | [`StockSelector`](../alphaswarm/rl/portfolio/selector.py) | Filter universe by liquidity / vol / momentum | [`StaticUniverseSelector`](../alphaswarm/rl/portfolio/selector.py) |
| `f_A` (Allocator) | [`PortfolioAllocator`](../alphaswarm/rl/portfolio/allocator.py) | Map raw RL action to unconstrained weights | [`IdentityAllocator`](../alphaswarm/rl/portfolio/allocator.py) |
| `f_T` (Timing) | [`TimingAdjuster`](../alphaswarm/rl/portfolio/timing.py) | Scale gross exposure on regime signals | [`ConstantTimingAdjuster`](../alphaswarm/rl/portfolio/timing.py) |
| `f_R` (Risk overlay) | [`RiskOverlay`](../alphaswarm/rl/portfolio/risk_overlay.py) | Truncate weights violating hard constraints | [`StackedRiskOverlay(PositionCap + GrossExposure)`](../alphaswarm/rl/portfolio/risk_overlay.py) |

## Composition

```python
from alphaswarm.rl.portfolio import (
    GrossExposureRiskOverlay,
    IdentityAllocator,
    PositionCapRiskOverlay,
    StackedRiskOverlay,
    StaticUniverseSelector,
    TurbulenceTimingAdjuster,
    WeightCentricPipeline,
)

pipeline = WeightCentricPipeline(
    selector=StaticUniverseSelector(universe=universe),
    allocator=IdentityAllocator(),
    timing=TurbulenceTimingAdjuster(threshold=140.0, cooldown_scale=0.0),
    risk_overlay=StackedRiskOverlay(overlays=[
        PositionCapRiskOverlay(max_position_pct=0.30, mark_truncated=True),
        GrossExposureRiskOverlay(max_gross=1.0),
    ]),
)

state = pipeline.run(
    universe=universe,
    raw_action=action,
    context={"turbulence": 90.0, "prices": prices, "equity": 100_000.0},
)
target_weights = state.weights  # numpy array aligned with state.universe
```

## Determinism contract

Each stage is a **pure function** of its inputs — no hidden global
state, no time-dependent randomness without an explicit seed.
`state.history` records the per-stage weight vector for audit so a
downstream `LedgerWriter` can persist the full
`f_S -> f_A -> f_T -> f_R` trace.

## Truncation propagation

The risk overlay can set `state.context["truncated"]=True` when a
hard constraint is breached (e.g. `mark_truncated=True` on
`PositionCapRiskOverlay`). The
[`RLBacktestEnv`](../alphaswarm/rl/envs/rl_backtest_env.py) lifts this onto
`info["truncated"]` so the
[`StopProperlyWrapper`](../alphaswarm/rl/rewards/stop_properly.py) scales
the step reward by `coef in [0, 1]`.

## Adding a new stage variant

1. Subclass the relevant base
   (`StockSelector` / `PortfolioAllocator` / `TimingAdjuster` /
   `RiskOverlay`).
2. Implement the single transform method
   (`select` / `allocate` / `adjust` / `apply`).
3. Re-export from
   [`alphaswarm/rl/portfolio/__init__.py`](../alphaswarm/rl/portfolio/__init__.py).

## See also

- [alphaswarm_docs/agentic-rl.md](../../concepts/rl/agentic-rl.md) — Overall architecture.
- [Hard rule 38 in AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — Source-of-truth rule.


<!-- https://alpha-swarm.ai/concepts/strategy/analysis-agents -->
# Analysis Agents
> | Spec | Module | Purpose | | --- | --- | --- | | `analysis.step` | [alphaswarm/agents/analysis/step_analyst.py](../alphaswarm/agents/analysis/step_analyst.py) | Verdict + improvements for a single agent step. | | ...

# Analysis Agents

Three interpretation agents + one deferred reflector. Together they
close the Alpha-GPT three-stage loop (Ideation → Implementation →
Review) and the TradingAgents-style outcome-reflection loop.

## Specs

| Spec | Module | Purpose |
| --- | --- | --- |
| `analysis.step` | [alphaswarm/agents/analysis/step_analyst.py](../alphaswarm/agents/analysis/step_analyst.py) | Verdict + improvements for a single agent step. |
| `analysis.run` | [alphaswarm/agents/analysis/run_analyst.py](../alphaswarm/agents/analysis/run_analyst.py) | End-to-end interpretation of a backtest / paper / live run. |
| `analysis.portfolio` | [alphaswarm/agents/analysis/portfolio_analyst.py](../alphaswarm/agents/analysis/portfolio_analyst.py) | Portfolio aggregate + risk + regulatory exposure. |
| Reflector (helper) | [alphaswarm/agents/analysis/reflector.py](../alphaswarm/agents/analysis/reflector.py) | Resolve outcomes + write reflections + re-index L0. |

## Reflection loop (TradingAgents pattern)

```mermaid
flowchart LR
  d[agent_decisions] --> w[reflector window]
  w --> o[memory_outcomes]
  o --> r[reflection LLM]
  r --> mr[memory_reflections]
  mr --> rag[L0 RAG decisions]
  rag --> next[next agent run]
```

1. The reflector pulls every recent decision row that doesn't yet have
   an outcome.
2. It computes raw / benchmark / excess return over a configurable
   window via the bars adapter.
3. It writes one `MemoryOutcome` row + one `MemoryReflection` row.
4. It re-indexes the decision into the L0 `decisions` corpus so the
   next research / selection / trader run picks it up via
   `HierarchicalRAG`.

## REST + Celery

```
POST /agents/analysis/step              — task
POST /agents/analysis/run               — task
POST /agents/analysis/portfolio         — task
POST /agents/analysis/reflect           — task wrapper for run_reflection_pass
POST /agents/analysis/sync/run          — synchronous variant
POST /memory/reflect/run                — synchronous reflection pass
```

Tasks live in [alphaswarm/tasks/analysis_tasks.py](../alphaswarm/tasks/analysis_tasks.py).

## YAMLs

- [configs/agents/analysis_step.yaml](../configs/agents/analysis_step.yaml)
- [configs/agents/analysis_run.yaml](../configs/agents/analysis_run.yaml)
- [configs/agents/analysis_portfolio.yaml](../configs/agents/analysis_portfolio.yaml)


<!-- https://alpha-swarm.ai/concepts/strategy/analysis-flows -->
# Analysis Flows Reference
> Every flow in the registry. Each flow is identified by a namespaced name (`namespace.flow`), declares a Pydantic params model, and returns a `FlowResult` with `metrics` / `rows` / `chart` / optional `...

# Analysis Flows Reference

> Framework: [alphaswarm_docs/analysis-framework.md](../../concepts/strategy/analysis-framework.md) · UI: [alphaswarm_docs/analysis-lab.md](../../concepts/strategy/analysis-lab.md).

Every flow in the registry. Each flow is identified by a namespaced
name (`namespace.flow`), declares a Pydantic params model, and returns
a `FlowResult` with `metrics` / `rows` / `chart` / optional
`arrow_table` for Iceberg persistence.

`GET /analysis/flows` lists every entry with the JSON-schema body
derived from the params model — the lab UI auto-renders forms from
this surface.

## profiling.\*

| Name | Label | Notes |
|---|---|---|
| `profiling.describe` | Column profile | Wraps `alphaswarm.data.profiling.compute_profile` |
| `profiling.dtypes` | Dtypes | Per-column dtype + memory footprint |
| `profiling.null_audit` | Null audit | Null counts + null fractions |
| `profiling.topk` | Top-K values | Most-frequent values + share |

## distribution.\*

| Name | Label | Notes |
|---|---|---|
| `distribution.descriptive_stats` | Descriptive stats | Mean / median / std / skew / kurt / IQR / MAD / quantiles |
| `distribution.histogram` | Histogram | Equal-width bins + Plotly chart |
| `distribution.ecdf` | Empirical CDF | Sorted-value ECDF (down-sampled to `max_points`) |
| `distribution.qq_plot_points` | Q-Q plot points | Slope/intercept fit vs. norm/t/uniform/expon |
| `distribution.shapiro_wilk` | Shapiro-Wilk | Normality test (capped at 5000 samples) |
| `distribution.jarque_bera` | Jarque-Bera | Skew + kurt goodness-of-fit |
| `distribution.kolmogorov_smirnov` | K-S | One-sample vs reference dist (norm / t / uniform / expon / lognorm) |

## outlier.\*

| Name | Label | Notes |
|---|---|---|
| `outlier.zscore` | Z-score | Robust (median/MAD) or classical |
| `outlier.iqr` | IQR fences | Tukey ``[Q1 - kIQR, Q3 + kIQR]`` |
| `outlier.iforest` | Isolation Forest | sklearn |
| `outlier.dbscan` | DBSCAN | Density-based; `-1` is noise |
| `outlier.lof` | LOF | sklearn LocalOutlierFactor |
| `outlier.ecod` | ECOD | PyOD; falls back to z-score |
| `outlier.pulse_vs_step` | Pulse vs Step | Distinguish transient pulses from level shifts |

## imputation.\*

| Name | Label | Notes |
|---|---|---|
| `imputation.ffill_bfill` | Forward / backward fill | Default `ffill_then_bfill` |
| `imputation.linear` | Linear interpolation | pandas `axis=0` |
| `imputation.spline` | Cubic spline | pandas spline (order configurable) |
| `imputation.knn` | KNN imputer | sklearn `KNNImputer` |
| `imputation.mice` | MICE (IterativeImputer) | sklearn `IterativeImputer` |

## regression.\*

| Name | Label | Notes |
|---|---|---|
| `regression.ols_diagnostics` | OLS diagnostics | Coefs + SE + t / p + Durbin-Watson + AIC / BIC |
| `regression.white_test` | White's test | Heteroskedasticity (general form) |
| `regression.breusch_pagan` | Breusch-Pagan | Heteroskedasticity vs regressors |
| `regression.vif` | VIF | Variance Inflation Factors per regressor |

## time_series.\*

| Name | Label | Notes |
|---|---|---|
| `time_series.stl` | STL decomposition | Trend / seasonal / residual |
| `time_series.adf` | Augmented Dickey-Fuller | H0 = unit root |
| `time_series.kpss` | KPSS | H0 = stationary (ADF complement) |
| `time_series.acf_pacf` | ACF / PACF | Auto- and partial-autocorrelation series |
| `time_series.garch` | GARCH(p, q) | Volatility model + horizon variance forecast |
| `time_series.change_point` | Change-point | ruptures.KernelCPD with rbf kernel |
| `time_series.granger_causality` | Granger causality | Up to `max_lag` |
| `time_series.cointegration` | Engle-Granger | Pair cointegration |
| `time_series.spectral_fft` | Spectral (FFT) | Real FFT magnitude + power spectrum |
| `time_series.spectral_wavelet` | Continuous wavelet transform | PyWavelets (optional) |
| `time_series.hurst_exponent` | Hurst exponent | Long-range dependence |
| `time_series.theil_sen` | Theil-Sen slope | Robust median-of-pairwise-slopes |

## derivatives.\*

| Name | Label | Notes |
|---|---|---|
| `derivatives.bsm` | Black-Scholes-Merton | Closed-form European price + Greeks |
| `derivatives.greeks_surface` | Greeks surface | Δ/Γ/ν/Θ/ρ across strikes × expiries |
| `derivatives.implied_volatility` | Implied volatility (Brent) | Recover σ from a market quote |
| `derivatives.monte_carlo_european` | MC European option | Vectorised GBM; opt-in CUDA via cupy |
| `derivatives.monte_carlo_barrier` | MC barrier option | Knock-in / knock-out variants |
| `derivatives.monte_carlo_asian` | MC Asian option | Arithmetic / geometric averaging |
| `derivatives.sabr_smile` | SABR smile (Hagan) | Hagan-Kumar-Lesniewski-Woodward 2002 |
| `derivatives.bachelier` | Bachelier (normal model) | Wraps `alphaswarm.options.normal_model` |

## portfolio.\*

| Name | Label | Notes |
|---|---|---|
| `portfolio.markowitz_efficient_frontier` | Efficient frontier | cvxpy if available, numpy-only fallback |
| `portfolio.ledoit_wolf_shrinkage` | Ledoit-Wolf covariance | Stabilised covariance matrix |
| `portfolio.fama_french_5_rolling` | FF5 rolling betas | Rolling-window OLS on Mkt-RF / SMB / HML / RMW / CMA |
| `portfolio.risk_parity` | Risk parity | Equal-risk-contribution weights (Spinu 2013) |

## factors.\*

| Name | Label | Notes |
|---|---|---|
| `factors.evaluate` | Factor evaluation | Wraps `alphaswarm.data.factors.evaluate_factor` (IC + quantile spread + turnover) |

## microstructure.\*

| Name | Label | Notes |
|---|---|---|
| `microstructure.realised_volatility` | Realised volatility (OHLC) | Close-to-close / Parkinson / GK / RS / YZ |
| `microstructure.order_book_imbalance` | Order-book imbalance | Top-of-book |
| `microstructure.vpin` | VPIN | Wraps `alphaswarm.data.microstructure.vpin` |

## Optional dependencies

Flows tag their optional deps (`optional_dependencies` field on the
descriptor). Missing extras raise a friendly `RuntimeError("install
extra X")` instead of crashing the catalog.

| Dep | Used by |
|---|---|
| `scikit-learn` | `outlier.{iforest,dbscan,lof}`, `imputation.{knn,mice}`, `portfolio.ledoit_wolf_shrinkage` |
| `statsmodels` | `regression.*`, `time_series.{adf,kpss,acf_pacf,granger_causality,cointegration,stl}` |
| `arch` | `time_series.garch` |
| `ruptures` | `time_series.change_point` |
| `pywavelets` | `time_series.spectral_wavelet` |
| `pyod` | `outlier.ecod` (falls back to z-score) |
| `cvxpy` | `portfolio.markowitz_efficient_frontier` (falls back to numpy projection) |
| `cupy` | `derivatives.monte_carlo_*` (opt-in GPU acceleration) |


<!-- https://alpha-swarm.ai/concepts/strategy/analysis-framework -->
# Analysis Framework
> The analysis layer is AQPs hash-locked, runtime-driven umbrella for every "explore a dataset" workflow — distribution audits, time-series diagnostics, derivatives pricing, portfolio optimisation, reg...

# Analysis Framework

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Lab guide: [alphaswarm_docs/analysis-lab.md](../../concepts/strategy/analysis-lab.md) · Flow reference: [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md).

The analysis layer is AlphaSwarm's hash-locked, runtime-driven umbrella for
every "explore a dataset" workflow — distribution audits, time-series
diagnostics, derivatives pricing, portfolio optimisation, regression
diagnostics, outlier / imputation work, and Alphalens-style factor
evaluation. It is the **statistical / quantitative-analysis** counterpart
of the **agentic-interpretation** layer in
[alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md). The two namespaces are
deliberately distinct.

## Why a new umbrella

Most primitives existed already (`alphaswarm.ml.flows`, `alphaswarm.data.factors`,
`alphaswarm.data.realised_volatility`, `alphaswarm.data.microstructure`,
`alphaswarm.options.normal_model`, `alphaswarm.data.profiling.profiler`) but had no
single contract for:

- registering a flow with a JSON-schema-driven param model;
- composing multiple flows into a reproducible pipeline;
- snapshotting the spec into an immutable, hash-locked version row;
- writing every step's gold-tier output to Iceberg
  (`alphaswarm_gold_analysis_`) under medallion validation;
- emitting the same progress payload shape Celery + WebSocket
  consumers already understand.

The umbrella plugs every primitive into one canvas + one ledger.

## Layout

```
alphaswarm/analysis/
    base.py        — FlowParams / FlowResult / FlowDescriptor / FlowContext
    spec.py        — AnalysisSpec / AnalysisStep / FlowRef / DatasetRef
    registry.py    — @register_analysis_flow + persist_spec + add_spec
    runtime.py     — AnalysisRuntime (sole sanctioned executor)
    pricing.py     — closed-form + MC math primitives (BSM, Greeks, GBM, SABR)
    flows/
        profiling.py / distribution.py / outlier.py / imputation.py /
        regression.py / time_series.py / derivatives.py / portfolio.py /
        factors.py / microstructure.py
```

```mermaid
flowchart LR
    subgraph Backend
        Spec[AnalysisSpec] --> Runtime[AnalysisRuntime]
        Runtime --> Registry["FlowRegistry@register_analysis_flow"]
        Registry --> Flows["flows/distribution / derivatives /portfolio / time_series / regression /outlier / imputation / profiling /factors / microstructure"]
    end
    subgraph Persistence
        SpecRow[("analysis_specs")]
        VerRow[("analysis_spec_versionsimmutable")]
        Run[("analysis_runs ledger")]
        Step[("analysis_step_results")]
        Iceberg[("alphaswarm_gold_analysis_")]
    end
    Runtime -->|persist_spec| SpecRow
    Runtime -->|snapshot| VerRow
    Runtime --> Run
    Run --> Step
    Runtime -->|"append_arrow medallion=gold"| Iceberg
    subgraph API
        FlowAPI["/analysis/flows"]
        SpecAPI["/analysis/specs"]
        RunAPI["/analysis/runs"]
    end
    Runtime --- API
    API --- LabUI["/analysis/lab\n(hybrid: tabbed + canvas)"]
```

## AnalysisSpec contract

Every spec is a Pydantic model that hashes its canonical JSON form
(SHA-256, sorted keys, no whitespace). Two specs with identical fields
collapse to one `analysis_spec_versions` row; any edit creates a new
version automatically.

```yaml
name: spy-distribution-audit
slug: spy-distribution-audit
kind: research
description: Distribution + GARCH + outlier audit for SPY daily bars.

dataset:
  iceberg_identifier: alphaswarm_silver_alpha_vantage.equities_daily
  filters:
    vt_symbol: SPY.NYSE
  limit: 5000

steps:
  - alias: profile
    flow_ref:
      flow: profiling.describe
      params: {}
  - alias: returns_dist
    flow_ref:
      flow: distribution.descriptive_stats
      params: { column: log_return }
  - alias: shapiro
    flow_ref:
      flow: distribution.shapiro_wilk
      params: { column: log_return }
  - alias: garch
    flow_ref:
      flow: time_series.garch
      params: { column: log_return, p: 1, q: 1, horizon: 10 }

medallion_layer: gold
business_metadata:
  data_owner: research-team
  semantic_definition: "SPY daily distribution + volatility audit"
  domain: research.distribution_audit
  sla_class: tier-3-eod
```

## Hard rules

These hold across every analysis flow / spec / run. Any PR that
violates one will be sent back.

1. **Every analysis run goes through `AnalysisRuntime`.** REST + Celery
   tasks (`alphaswarm.tasks.analysis_flow_tasks`) wrap it; flow code never
   writes to Iceberg / Postgres directly.
2. **`analysis_spec_versions` rows are immutable.** Re-snapshotting via
   `alphaswarm.analysis.registry.persist_spec` creates a new version row when
   the SHA-256 hash changes — never update an existing row in place.
3. **Every per-step Iceberg write uses `iceberg_catalog.append_arrow`
   with `medallion_layer="gold"` and a `BusinessMetadata` block.** The
   default namespace is `alphaswarm_gold_analysis_`; flows can
   override via `output_namespace=` on `register_analysis_flow`.
4. **Flows never call `litellm.completion` / `OllamaClient` directly.**
   v1 ships zero LLM-routed flows by design — interpretation is owned
   by the analysis-AGENTS stack ([alphaswarm_docs/analysis-agents.md](../../concepts/strategy/analysis-agents.md)).
5. **Optional dependencies are guarded.** Flows that need `cvxpy`,
   `pyod`, `pywavelets`, `cupy`, etc. raise a friendly `RuntimeError`
   with the install hint when the import fails.
6. **No new diagram formats.** Mermaid only.

## REST surface

| Method | Path | Purpose |
|---|---|---|
| `GET`  | `/analysis/flows` | List flows + JSON-schema-derived param forms |
| `GET`  | `/analysis/flows/{flow}` | Single flow detail |
| `POST` | `/analysis/flows/{flow}/preview` | Sync preview against an inline payload |
| `POST` | `/analysis/flows/{flow}/preview-task` | Async preview via Celery (`agents` queue) |
| `GET`  | `/analysis/specs` | List saved specs |
| `POST` | `/analysis/specs` | Persist a new spec (idempotent on hash) |
| `GET`  | `/analysis/specs/{slug}` | Current spec + version history |
| `POST` | `/analysis/specs/{slug}/run` | Kick `AnalysisRuntime.run` via Celery |
| `GET`  | `/analysis/runs` | Paged ledger of runs |
| `GET`  | `/analysis/runs/{id}` | Run detail with joined step results |
| `GET`  | `/analysis/runs/{id}/results/{step}` | DuckDB-driven preview of one step's gold-tier output |
| `GET`  | `/analysis/datasets/columns?identifier=ns.name` | Column / dtype list for the lab forms |

## Persistence schema

Migration `0031_analysis_layer` adds four project-scoped tables:

| Table | Purpose |
|---|---|
| `analysis_specs` | Logical row (latest active version per slug) |
| `analysis_spec_versions` | Immutable hash-locked snapshot |
| `analysis_runs` | One row per `AnalysisRuntime.run()` invocation |
| `analysis_step_results` | One row per `AnalysisStep` in the spec |

`AnalysisRun.iceberg_result_table` is set when a step persists arrow
data; `AnalysisStepResult.artifact_uri` records the per-step
`namespace.name` so the lab can fetch the gold-tier output via DuckDB.

## Adding a new flow

1. Subclass `FlowParams` for the per-flow parameter shape.
2. Decorate a `(df, params, ctx) -> FlowResult` function with
   `@register_analysis_flow(name, namespace, label, ...)`.
3. (optional) Stash a `pyarrow.Table` on `result.arrow_table` to persist
   it under `alphaswarm_gold_analysis_` when run inside a spec.
4. Add a smoke test under `tests/analysis/`.
5. Update the relevant tab in [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md).

## Don't list

- Don't bypass `AnalysisRuntime` for spec execution — every progress /
  ledger / Iceberg / step-result side-effect is wired through it.
- Don't write to a non-`alphaswarm_gold_analysis_*` namespace from a flow.
- Don't duplicate logic that already lives in
  `alphaswarm.data.factors` / `alphaswarm.data.microstructure` / `alphaswarm.options.normal_model`
  — wrap them as a flow and keep the math in one place.
- Don't add diagrams in non-Mermaid formats.
- Don't put LLM-driven interpretation in a flow; that lives in
  `alphaswarm_agents.analysis.*`.


<!-- https://alpha-swarm.ai/concepts/strategy/analysis-lab -->
# Analysis Lab — interactive analysis builder
> Lives at `/analysis/lab` in the AlphaSwarm webui (Vite frontend). Hybrid surface: dataset-centric tabbed drill-down (primary path) plus an XYFlow Composer (secondary path) for multi-step pipelines

# Analysis Lab — interactive analysis builder

> Backend: [alphaswarm_docs/analysis-framework.md](../../concepts/strategy/analysis-framework.md) · Flow reference: [alphaswarm_docs/analysis-flows.md](../../concepts/strategy/analysis-flows.md).

Lives at `/analysis/lab` in the AlphaSwarm webui (Vite frontend). Hybrid
surface: dataset-centric tabbed drill-down (primary path) plus an
XYFlow Composer (secondary path) for multi-step pipelines.

## Layout

| Tab | Purpose | Driving flows |
| --- | --- | --- |
| **Profiling** | Column profile + null audit + topk + dtypes | `profiling.*` |
| **Distribution** | Descriptive stats / histogram / ECDF / Q-Q + Shapiro-Wilk / Jarque-Bera / K-S | `distribution.*` |
| **Outliers** | Z-score / IQR / Isolation Forest / DBSCAN / LOF / ECOD / pulse-vs-step | `outlier.*` |
| **Time Series** | ADF / KPSS / ACF-PACF / STL / GARCH / change-point / Granger / cointegration / FFT / wavelets / Hurst / Theil-Sen | `time_series.*` |
| **Regression** | OLS diagnostics / White / Breusch-Pagan / VIF | `regression.*` |
| **Imputation** | ffill/bfill / linear / spline / KNN / MICE | `imputation.*` |
| **Derivatives** | BSM + Greeks surface + IV / Monte-Carlo European / barrier / Asian / SABR smile / Bachelier | `derivatives.*` |
| **Portfolio** | Efficient frontier / Ledoit-Wolf / Fama-French 5 rolling / risk parity | `portfolio.*` |
| **Factors** | Alphalens-style IC + quantile spread + turnover | `factors.evaluate` |
| **Composer** | XYFlow canvas — drag analysis nodes, save spec, run via runtime | every namespace |

Each tab loads the relevant flow schemas via `GET /analysis/flows`,
auto-generates the form, and submits to `POST /analysis/flows/{flow}/preview`.
Charts render inline (Plotly figure-dict in the response).

The "Save as spec" button on any tab promotes the current state into
an `AnalysisSpec` and routes to the Composer for multi-step editing
without losing context.

## Routes

| Path | Component |
| --- | --- |
| `/analysis/lab` | Tabbed primary surface |
| `/analysis/lab/composer` | XYFlow Composer (XYFlow canvas + ANALYSIS_PALETTE) |
| `/analysis/runs` | Run ledger (paged) |
| `/analysis/runs/[id]` | Run detail (steps + chart previews) |

The Composer reuses the existing
[`WorkflowEditor`](../alphaswarm_client/src/components/flow/WorkflowEditor.tsx) with
`domain="analysis"`. The serializer turns the canvas graph into an
`AnalysisSpec` payload, posts it to `POST /analysis/specs`, then to
`POST /analysis/specs/{slug}/run`.

## API surface used

- `GET  /analysis/flows` — flow catalog with JSON-schema params.
- `POST /analysis/flows/{flow}/preview` — sync preview.
- `POST /analysis/flows/{flow}/preview-task` — Celery preview.
- `POST /analysis/specs` — persist (hash-idempotent).
- `POST /analysis/specs/{slug}/run` — queue `AnalysisRuntime.run` task.
- `GET  /analysis/runs` / `GET /analysis/runs/{id}` — ledger.
- `GET  /analysis/runs/{id}/results/{step}` — DuckDB preview of the
  gold-tier output for one step.
- `GET  /analysis/datasets/columns?identifier=ns.name` — column list
  used by the lab's column-autocomplete inputs.

## Cross-links

The lab does not reinvent existing surfaces — it deep-links into them
when the user wants a richer experience:

- Derivatives tab → [`/options/lab`](../alphaswarm_client/src/routes/options/lab/page.tsx)
  for instrument-level workflows.
- Portfolio tab → [`/optimizer`](../alphaswarm_client/src/routes/optimizer/page.tsx)
  for multi-strategy parameter sweeps.
- Factors tab wraps the existing
  [`FactorWorkbench`](../alphaswarm_client/src/components/factors/FactorWorkbench.tsx).
- Visualisations of Iceberg outputs deep-link into
  [`/visualizations`](../alphaswarm_client/src/routes/visualizations/page.tsx).


<!-- https://alpha-swarm.ai/concepts/strategy/backtest-engines -->
# Backtest engines
> AlphaSwarm ships seven interchangeable backtest engines behind a single BaseBacktestEngine ABC. Three tiers: primary vectorised, event-driven for agent-in-the-loop, and a fallback cascade.

# Backtest engines

> Doc map: [intro](../../intro/index.md) Â·
> vbt-pro deep dive: [vbtpro-integration](./vbtpro-integration.md) Â·
> LOB / tick-replay: [hft-backtest](./hft-backtest.md) Â·
> Class hierarchy: [class-diagram](../platform/class-diagram.md) Â·
> Worked tutorial: [tutorials/first-backtest](../../tutorials/first-backtest.md) Â·
> Recipe: [how-to/recipes/run-a-backtest-from-yaml](../../how-to/recipes/run-a-backtest-from-yaml.md).

AlphaSwarm runs every backtest through one of seven interchangeable engines
behind the
[`BaseBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/base.py)
ABC. The runner, persistence, MLflow tracking, and UI never branch on
which engine produced a run â€” every engine returns the same
[`BacktestResult`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/result.py).

The seven engines fall into **three tiers** so you can pick one
without scanning a 7-row table every time:

```mermaid
flowchart TB
    Strategy["IStrategy / FrameworkAlgorithm"] --> Runner["alphaswarm.backtest.runner.run_backtest_from_config"]
    Runner --> Primary
    Runner --> Loop
    Runner --> Cascade

    subgraph Primary [Tier 1: vectorised primary]
        Vbtpro["VectorbtProEngine (5 modes)"]
    end

    subgraph Loop [Tier 2: per-bar Python loop]
        Event["EventDrivenBacktester (agent dispatch)"]
        Hft["LobBacktestEngine (hftbacktest LOB)"]
    end

    subgraph Cascade [Tier 3: fallback cascade]
        FallbackEngine["FallbackBacktestEngine"]
        Vbt["VectorbtEngine (OSS)"]
        Bt["BacktestingPyEngine"]
        Zvt["ZvtBacktestEngine"]
        Aat["AatBacktestEngine"]
    end

    FallbackEngine --> Vbtpro
    FallbackEngine -.fallback.-> Event
    FallbackEngine -.fallback.-> Vbt
    FallbackEngine -.fallback.-> Bt
    FallbackEngine -.fallback.-> Zvt
    FallbackEngine -.fallback.-> Aat
```

## Tier 1 â€” Vectorised primary (`VectorbtProEngine`)

Default for research workloads, parameter screens, walk-forward
optimisation, factor studies, and any backtest that does not need
per-bar Python.

Five constructor modes select the inner vbt-pro path:

- `signals` â€” array-based entries / exits / sizing
- `orders` â€” column-of-orders DataFrame
- `optimizer` â€” built-in vbt-pro `Param` sweeps
- `holding` â€” buy-and-hold baseline
- `random` â€” random-signal baseline

Implementation:
[alphaswarm/backtest/vbtpro/engine.py::VectorbtProEngine](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/vbtpro/engine.py).
Full mode dispatch + Numba-jit constraints in
[vbtpro-integration](./vbtpro-integration.md).

## Tier 2 â€” Per-bar Python loop

Two engines run a true Python `on_bar` callback. Use them when you
need synchronous decisions inside the inner loop â€” agent dispatch,
event-sourced LOB replay, custom callbacks vbt-pro can't represent.

- [`EventDrivenBacktester`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/event_driven.py) â€”
  the only engine that exposes `context['agents']` to strategies via
  [`AgentDispatcher`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_dispatcher.py),
  with TTL + LRU dedup of LLM calls.
- [`LobBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/hft.py) â€”
  hftbacktest-driven LOB tick replay; latency + queue models;
  market-making + execution strategies.

## Tier 3 â€” Fallback cascade

[`FallbackBacktestEngine`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/fallback.py)
tries `primary` first, then walks `fallbacks` until one returns a
`BacktestResult`. The OSS engines exist mainly as cascade fallbacks
and for license-constrained deployments:

- `VectorbtEngine` â€” OSS vectorbt; signals only (Apache-2.0).
- `BacktestingPyEngine` â€” single-symbol with `.optimize(...)`
  grid + SAMBO (AGPL-3.0).
- `ZvtBacktestEngine` â€” permissive-licence CN-bar fallback (MIT).
- `AatBacktestEngine` â€” async / synthetic LOB fallback (Apache-2.0).

NautilusTrader is **not** wired in (LGPL-3.0; out of scope).

## EngineCapabilities

Every engine declares its surface via
[`EngineCapabilities`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/capabilities.py)
on the class attribute. Agents introspect via the
`engine_capabilities` tool; humans can call
`alphaswarm.backtest.engine_capabilities_index()`.

```mermaid
flowchart LR
    subgraph caps [EngineCapabilities flags]
        signals
        orders
        callbacks
        multiAsset[multi-asset]
        shorts
        leverage
        lob
        asyncFlag[async]
        perBar[per-bar Python]
        optimizer
        wfo[walk-forward]
        agentDispatch[agent dispatch]
        rlInjection[supports_rl_injection]
    end

    Vbtpro["VectorbtProEngine"] -. signals, orders, callbacks, multi-asset, shorts, leverage, optimizer, walk-forward, rl-injection .-> caps
    Event["EventDrivenBacktester"] -. signals, orders, callbacks, multi-asset, shorts, per-bar Python, agent dispatch, walk-forward, rl-injection .-> caps
    Hft["LobBacktestEngine"] -. lob, async, per-bar Python, multi-asset, shorts, agent dispatch .-> caps
    Bt["BacktestingPyEngine"] -. signals, shorts, leverage .-> caps
    Zvt["ZvtBacktestEngine"] -. signals, multi-asset, per-bar Python .-> caps
    Aat["AatBacktestEngine"] -. signals, orders, multi-asset, shorts, lob, async, per-bar Python .-> caps
    Vbt["VectorbtEngine"] -. signals, multi-asset, shorts .-> caps
```

Pick by capability:

- **Vectorised research / parameter screens / WFO** â†’ `VectorbtProEngine`
- **Per-bar agent dispatch (LLM in the loop)** â†’ `EventDrivenBacktester`
- **LOB tick replay, latency + queue modelling** â†’ `LobBacktestEngine`
- **Synthetic LOB realism (OSS path)** â†’ `AatBacktestEngine`
- **Chinese-market data** â†’ `ZvtBacktestEngine`
- **Single-symbol grid optimisation** â†’ `BacktestingPyEngine` with
  `.optimize(ranges, method="grid"|"sambo", ...)`

## When NOT to use the primary engine

The vbt-pro inner loop is Numba-jit compiled â€” `signal_func_nb` /
`order_func_nb` cannot call Python objects per bar. Two patterns
this rules out:

1. **Per-bar agent consults.** Switch to `EventDrivenBacktester` and
   call `context['agents'].consult(spec_name, inputs, ttl=...)` from
   inside `on_bar`. The
   [`AgentDispatcher`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_dispatcher.py)
   handles TTL + LRU dedup so the LLM gateway is not hammered.
2. **Per-bar custom Python that vbt-pro cannot express.** If the
   inner loop needs a stateful Python object (custom risk model,
   bespoke order book heuristics), use event-driven.

If you _can_ vectorise â€” or precompute a panel of decisions ahead of
time â€” use vbt-pro `AgenticVbtAlpha` in precompute mode. The
`vectorbtpro` mode dispatch lives in
[vbtpro-integration](./vbtpro-integration.md).

## Dispatching from YAML

Three equivalent ways to pick an engine inside a strategy recipe:

```yaml
# 1) Engine shortcut (cleanest).
backtest:
  engine: vbt-pro:signals    # or vbt-pro:orders / :optimizer / :holding / :random
  kwargs:
    initial_cash: 100000
    fees: 0.0005

# 2) Explicit class + module.
backtest:
  class: VectorbtProEngine
  module_path: alphaswarm.backtest.vbtpro.engine
  kwargs:
    mode: orders
    initial_cash: 100000

# 3) Fallback cascade.
backtest:
  engine: fallback
  primary: vbt-pro
  fallbacks: [event, aat, zvt, vectorbt]
```

| Shortcut | Resolves to | Notes |
| --- | --- | --- |
| `default` / `event` / `event-driven` | `EventDrivenBacktester` | Backward-compatible default. |
| `primary` / `vbt-pro` / `vectorbt-pro` | `VectorbtProEngine` | Tier 1. |
| `vbt-pro:signals` / `:orders` / `:optimizer` / `:holding` / `:random` | `VectorbtProEngine` | Mode injection. |
| `vectorbt` / `vbt` | `VectorbtEngine` | OSS fallback. |
| `backtesting` / `bt` | `BacktestingPyEngine` | Single-symbol. |
| `zvt` | `ZvtBacktestEngine` | Lazy import; CN bars. |
| `aat` | `AatBacktestEngine` | Lazy import; async LOB. |
| `hft` / `lob` | `LobBacktestEngine` | Tick replay. |
| `fallback` / `cascade` | `FallbackBacktestEngine` | Cascade with `DEFAULT_FALLBACK_CHAIN = ("event", "aat", "zvt", "vectorbt")`. |

[`alphaswarm.backtest.runner.run_backtest_from_config`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/backtest/runner.py)
routes every YAML through the right engine and stamps `engine` into
`BacktestRun.metrics`.

## Agent + ML components

Strategies plug agents and ML models into either path:

- **Vectorised (vbt-pro)** â€” panel components in
  [alphaswarm/strategies/vbtpro/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies/vbtpro):
  - `AgenticVbtAlpha` â€” precompute or per-window agent dispatch into
    wide entries / exits / size arrays.
  - `MLVbtAlpha` â€” wraps any `alphaswarm_models.base.Model` (or MLflow URI)
    and emits arrays via threshold / top-k / rank policies.
  - `AgenticOrderModel` â€” drives `Portfolio.from_orders` from cached
    agent decisions.
- **Event-driven** â€” `context['agents']` exposes `AgentDispatcher`.
  See
  [`AgentAwareMomentumAlpha`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/agentic/agent_aware_alpha.py)
  for a worked example.

For RL injection, every engine that declares
`EngineCapabilities.supports_rl_injection=True` accepts the
[`WeightCentricPipeline`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/portfolio/pipeline.py)
output through `context['rl_agent']` (AGENTS rule 38).

## Unified result shape

Every engine returns a `BacktestResult` with:

- `equity_curve: pd.Series` indexed by timestamp.
- `trades: pd.DataFrame` with `timestamp, vt_symbol, side, quantity,
  price, commission, slippage, strategy_id`.
- `orders: pd.DataFrame`.
- `summary: dict` â€” `sharpe`, `sortino`, `max_drawdown`, `calmar`,
  `total_return`, `final_equity`, `n_bars`, `volatility_ann`,
  `n_trades`, `turnover`, `engine`. Engine-specific keys live under
  `vbt_*`, `bt_*`, `zvt_*`, `aat_*`, `hft_*` so downstream code can
  light up native stats without re-running.

## Hash-locked specs + audit ledger

Every dispatched backtest writes a row to
[`backtest_runs`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/persistence/models.py)
with `experiment_id` (AGENTS rule 34) and a reference to the
hash-locked
[`StrategySpec`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies)
version. The same spec hash returns the same `*_spec_versions` row
on re-dispatch; content changes always create a new version. This
makes every backtest replayable.

Gold-tier output lands at `alphaswarm_gold_backtests.run_` via
[`iceberg_catalog.append_arrow`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py)
with `medallion_layer="gold"` (AGENTS rule 3, rule 21).

## Worked example: dispatch + tearsheet

Goal: dispatch a backtest, tail its WebSocket frames, list the
ledger row via DataMCP, render an equity curve in your browser.

### Step 1 â€” dispatch


### Step 2 â€” tail the WebSocket

```bash
curl -N http://localhost:8000/chat/stream/
```

Frames arrive in the canonical `{task_id, stage, message, timestamp,
**extras}` envelope (AGENTS rule 4). Expected stages:
`start` â†’ `bar.processed` (Ã—N) â†’ `metrics.computed` â†’ `done`.

### Step 3 â€” list via DataMCP

The `data.backtests.list` tool is the agent-safe alternative to a
raw `SELECT * FROM backtest_runs`. From any MCP client:

```bash
curl -X POST http://localhost:8000/mcp/data/tools/data.backtests.list/invoke \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(alphaswarm-cli auth token)" \
    -d '{"limit": 5, "order_by": "started_at_desc"}'
```

### Step 4 â€” equity curve in Pyodide

Render the equity curve client-side from inline sample points so the
snippet stays self-contained. Replace with a fetch to
`/analytics/portfolio//equity-curve.json` when running against
the real platform.


### Step 5 â€” verify

- `backtest_runs` row with non-NULL `sharpe`, `engine='VectorbtProEngine'`.
- WebSocket emitted a `stage=done` frame with the matching `run_id`.
- `alphaswarm_gold_backtests.run_` Iceberg table exists.
- `data.backtests.describe { run_id }` MCP call returns the full row.

### What next

- Run the full tutorial: [tutorials/first-backtest](../../tutorials/first-backtest.md).
- Make it repeatable: [how-to/recipes/run-a-backtest-from-yaml](../../how-to/recipes/run-a-backtest-from-yaml.md).
- Add a new strategy: [how-to/recipes/add-a-strategy](../../how-to/recipes/add-a-strategy.md).
- Promote to paper: [how-to/recipes/promote-a-bot-to-paper](../../how-to/recipes/promote-a-bot-to-paper.md).

## Deeper reads

- [vbtpro-integration](./vbtpro-integration.md) â€” vbt-pro mode dispatch, Numba constraints, hooks, walk-forward, `Param` sweeps, `IndicatorFactory` bridge.
- [hft-backtest](./hft-backtest.md) â€” LOB engine, latency profiles, queue models, the five HFT strategies under `alphaswarm/strategies/hft/`.
- [strategy-lifecycle](./strategy-lifecycle.md) â€” draft â†’ backtested â†’ paper â†’ live.
- [strategy-development](./strategy-development.md) â€” composer / simulation / ideation / single / batch / compare routes in the operator UI.
- [factor-research](./factor-research.md) â€” building factor / alpha strategies.
- [ml-alpha-backtest](./ml-alpha-backtest.md) â€” `AlphaBacktestExperiment` orchestrator + `MLAlphaBacktestRun` schema.
- [class-diagram](../platform/class-diagram.md) â€” full engine class hierarchy + `BacktestResult` shape.
- [reference/api](../../reference/api/index.mdx) â€” the `backtest` tag (interactive playground).
- [reference/python](../../reference/python/index.mdx) â€” auto-generated reference for `alphaswarm.backtest` and `alphaswarm.strategies`.


<!-- https://alpha-swarm.ai/concepts/strategy/cross-market-arbitrage -->
# Cross-market arbitrage
> The platform ships two cross-market arbitrage paths:

# Cross-market arbitrage

> Status: **Phase 5 shipped**. Combined deliverable across Phase 1
> (InstrumentADR / InstrumentGDR), Phase 4
> ([`alphaswarm/math/arbitrage.py`](../alphaswarm/math/arbitrage.py)), and Phase 5
> (DataMCP tools + strategy templates).

## Two flavours

The platform ships two cross-market arbitrage paths:

### A/H share -- mainland China ↔ Hong Kong

The Chinese company has dual-listed shares: A-shares in CNY on the
SSE / SZSE, H-shares in HKD on HKEX. Same legal entity, same
economic rights, different regulatory regime + liquidity + currency.
The basis mean-reverts toward zero but periodically violently
diverges (Stock Connect inflow / outflow, regulatory news, FX
volatility).

Math: [`ah_share_basis()`](../alphaswarm/math/arbitrage.py) in
:mod:`alphaswarm.math.arbitrage`. Computes the FX-adjusted implied H-share
price from the A-share, subtracts the observed H-share price, and
classifies the arbitrage direction.

Agent surface: ``data.arbitrage.ah_share_basis`` (single-point) and
the ``arbitrage.ah_share_basis`` AnalysisFlow (time series).

### ADR ↔ underlying foreign equity

A foreign company creates an American Depositary Receipt to list
on a US venue. 1 ADR represents ``conversion_ratio`` shares of the
underlying. The basis (ADR USD price -- conversion-adjusted
underlying USD-equivalent) should be near zero plus the depository
fee; persistent divergence is the arbitrage.

Math: [`adr_basis()`](../alphaswarm/math/arbitrage.py) reads the
``conversion_ratio`` directly from the Phase 1
:class:`InstrumentADR` row (via the
``data.arbitrage.adr_underlying_basis`` MCP tool), then computes
the basis exactly as the A/H case.

Agent surface: ``data.arbitrage.adr_underlying_basis`` (single-point)
and the ``arbitrage.adr_basis`` AnalysisFlow (time series).

## Full pipeline (BABA example)

```mermaid
flowchart LR
    A[Agent query] -->|"what's BABA basis?"| ID[data.identity.resolve]
    ID -->|"instrument_id"| DR[data.instruments.depositary_receipts]
    DR -->|"conversion_ratio=8"| ADR[data.arbitrage.adr_underlying_basis]
    ADR -->|"basis_bps + direction"| Agent
    Agent -->|"if abs(basis) > threshold"| TM[Strategy template]
    TM -->|"adr_basis_arbitrage.yaml"| Bot[BotRuntime]
    Bot -->|"submit_list(oco)"| Broker
```

1. Agent resolves BABA ticker to its current instrument_id at the
   ``as_of`` timestamp.
2. The depositary-receipts tool returns the ADR's
   ``conversion_ratio`` (8) and the underlying's vt_symbol
   (``9988.HKEX``).
3. The arbitrage tool computes the basis given current prices + FX.
4. If the basis exceeds the cost-adjusted threshold, the agent
   instantiates the strategy template
   ``configs/strategy_templates/adr_basis_arbitrage.yaml`` (a
   :class:`Resource` row with ``resource_type='strategy_template'``).
5. The bot runtime submits a two-leg OCO order list (long ADR + short
   underlying, or vice versa) through the Phase 2 contingency manager.
6. Exit: the contingency manager auto-cancels the peer when one leg
   fills; the strategy emits an explicit close when the basis reverts.

## Common gotchas

* **FX volatility eats the alpha.** A/H share arbitrage is FX-
  unhedged unless the strategy template explicitly enables it
  (``fx_hedge_required: true`` in the YAML). For ADR basis trades,
  hedging the FX leg via a forward / futures position is almost
  always worth the cost.

* **Conversion ratio changes.** Depository banks announce conversion
  changes; the InstrumentADR row gets updated by the corporate-
  action pipeline. Strategies that hardcode the ratio break the
  moment that happens; use the MCP tool's auto-lookup instead.

* **Settlement asymmetry.** ADR settles T+1 in the US; the
  underlying may settle T+2 (Hong Kong) or T+1 (Tokyo). The MCP
  tool returns the basis as-of right now but a strategy executing
  on it has to plan for the settlement gap.

* **Stock Connect quotas.** Mainland-to-Hong Kong flow has daily
  quotas; an A-H basis trade may not be executable on a given day
  because the southbound (or northbound) capacity is exhausted.
  The strategy template enables a `quota_aware` check in
  Phase 5+.

## Strategy templates

Two templates ship pre-built (Phase 5, polymorphic Resources):

* [`configs/strategy_templates/ah_share_arbitrage.yaml`](../configs/strategy_templates/ah_share_arbitrage.yaml)
* [`configs/strategy_templates/adr_basis_arbitrage.yaml`](../configs/strategy_templates/adr_basis_arbitrage.yaml)

Cloning a template into a workspace emits a ``ResourceRelation`` row
with ``relation='translated_from'`` so the ownership graph audits
provenance (AGENTS rule 35). The cloned strategy is then editable
in the workspace; the original template is read-only and shared.


<!-- https://alpha-swarm.ai/concepts/strategy/execution-paths -->
# Execution paths: WebSocket priority + queue-preserving amendment
> The Nautilus issue [#4000](https://github.com/nautechsystems/nautilus_trader/issues/4000) documents the cost of using REST for amendment: most REST `PATCH` endpoints actually implement amendment as ca...

# Execution paths: WebSocket priority + queue-preserving amendment

> Status: **Phase 2 shipped** (Alembic 0041). Amendment manager:
> [`alphaswarm/trading/execution/amendment.py`](../alphaswarm/trading/execution/amendment.py).

## Why WebSocket-first

The Nautilus issue [#4000](https://github.com/nautechsystems/nautilus_trader/issues/4000)
documents the cost of using REST for amendment: most REST `PATCH`
endpoints actually implement amendment as cancel + recreate. The
modified order takes a NEW venue order id and goes to the back of
the limit order book queue at the new price. For market-making
strategies this is a non-starter -- the queue position IS the alpha.

Phase 2's :class:`IDomainBrokerage` declares two capability flags:

* :attr:`IDomainBrokerage.supports_websocket_amend` -- the venue has a
  WS endpoint that modifies the order in place
* :attr:`IDomainBrokerage.supports_oco` -- the venue accepts an atomic
  OCO submission

When both are True, the broker is "Phase 2 ready" and the
:class:`AmendmentManager` routes:

| Change | WS amend supported | Routing |
| --- | --- | --- |
| Trigger price (stop / MIT / trailing-stop) | True | ``WS_AMEND`` |
| Trigger price | False | ``CANCEL_RESUBMIT`` |
| Quantity-down on limit | True | ``WS_AMEND`` |
| Quantity-up on limit | True (if policy allows) | ``WS_AMEND`` |
| Quantity-up on limit | False (default policy) | ``CANCEL_RESUBMIT`` |
| Price change | Any | ``CANCEL_RESUBMIT`` |

Price changes always go cancel + resubmit because the modified order
takes the back of the queue at the new price anyway.

## Atomic request id counter

The amendment manager's
:class:`alphaswarm.trading.execution.amendment.AtomicRequestIdCounter` mirrors
Rust's ``AtomicU64`` via :class:`threading.Lock` +
:class:`itertools.count`. Each ``next_id()`` returns a monotonically
increasing 64-bit-safe int that the manager uses as the WS message id.

Why is this important?

* WebSocket amend / cancel messages are dispatched asynchronously --
  the response comes back over the same connection with the matching
  request id.
* If two amendments race (the strategy emits a new amendment before
  the previous one's response arrives), the manager needs to
  disambiguate which response belongs to which intent.
* The counter is gap-free under concurrency, so the matching state
  table stays correct even when 10+ amendments are inflight.

## Fallback semantics

When the WS amend fails (network drop, venue rejection, policy
disallowing the change), the manager:

1. Logs at WARNING level with the original exception.
2. Falls through to cancel + resubmit using the broker's
   :meth:`IDomainBrokerage.cancel` + :meth:`IDomainBrokerage.submit`.
3. Returns an :class:`AmendmentResult` with
   ``routing=CANCEL_RESUBMIT`` so the caller knows queue position was
   lost.

This is the "WS primary path with REST fallback" pattern from the
Nautilus issue. Callers don't have to know which route was used --
the result tells them.

## Code example

```python
from decimal import Decimal
from alphaswarm.trading.execution import AmendmentManager, AmendmentRequest

mgr = AmendmentManager(
    ws_amend=broker.ws_amend,           # async callable
    cancel_resubmit=broker.cancel_resubmit,  # async callable
)

# Reduce a 10-lot limit order to 5 lots without losing queue position
result = await mgr.amend(
    AmendmentRequest(
        client_order_id=order.client_order_id,
        quantity=Decimal("5"),
    ),
    current_order=order,
)
print(result.routing, result.elapsed_ms)
```

## Persistence

Every amendment ultimately produces one or more
:class:`ExecutionReport` rows in ``execution_reports``. The
:class:`ExecutionReportDispatcher` writes them; the
``(venue, venue_execution_id)`` unique index dedupes duplicates from
the WS-vs-REST race.

## Broker capability matrix

| Broker | supports_websocket_amend | supports_oco | supports_outside_rth |
| --- | --- | --- | --- |
| Alpaca | True (TradingStream subscription) | True (bracket orders) | True (extended_hours flag) |
| IBKR | True (gateway native) | True (OCA groups) | True (outsideRth flag) |
| Tradier | False (REST-only amendment) | False | True (ext_hours flag) |
| Binance | True | False (simulated) | n/a (24x7 venue) |
| Kraken | True (4000 implementation) | False (simulated) | n/a |
| SimulatedBrokerage | True | True (manager-driven) | True |

The matrix is read at runtime from the broker's class attributes;
specific venues that ship later get added the same way.


<!-- https://alpha-swarm.ai/concepts/strategy/factor-research -->
# Factor Research
> AlphaSwarm ships an Alphalens-inspired factor evaluation pipeline plus the purged / walk-forward cross-validators described in Lopez de Prados *Advances in Financial ML* and ML4Ts utility module

# Factor Research

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · See [alphaswarm_docs/strategy-lifecycle.md](../../concepts/strategy/strategy-lifecycle.md) for the broader strategy lifecycle.

AlphaSwarm ships an Alphalens-inspired factor evaluation pipeline plus the
purged / walk-forward cross-validators described in Lopez de Prado's
*Advances in Financial ML* and ML4T's utility module.

## One-liner evaluation

```python
from alphaswarm.data.factors import evaluate_factor

report = evaluate_factor(
    factor=factor_df,        # long: timestamp, vt_symbol, factor
    prices=prices_df,        # long: timestamp, vt_symbol, close
    factor_name="my_factor",
    periods=(1, 5, 10, 21),
    n_quantiles=5,
)
report.ic_stats       # {"fwd_1": {"mean": ..., "ir": ..., ...}, ...}
report.cumulative_returns  # wide DataFrame: Q1..Q5
report.turnover       # Series: top-quantile daily rotation fraction
```

## UI

The **Factor Evaluation** page posts to ``POST /factors/evaluate`` which
enqueues a Celery task. The task logs the tear sheet to MLflow with tag
``alphaswarm.component=factor_eval`` so every report is historically
comparable.

## Cross-validators

- :class:`alphaswarm.data.cv.MultipleTimeSeriesCV` — rolling train/test on
  panel data, matches ML4T ``utils.MultipleTimeSeriesCV``.
- :class:`alphaswarm.data.cv.PurgedKFold` — k-fold with embargo days between
  the training window and the test fold boundary.
- :class:`alphaswarm.data.cv.TimeSeriesWalkForward` — rolling or expanding
  train windows with a fixed test-step cadence.

## ML alphas

Two gradient-boosted alpha models drop directly into the framework:

- :class:`alphaswarm.strategies.ml_alphas.XGBoostAlpha`
- :class:`alphaswarm.strategies.ml_alphas.LightGBMAlpha`

Both accept a ``feature_specs`` list (passed through
:class:`alphaswarm.data.indicators_zoo.IndicatorZoo`) and a ``model_path`` that
gets pickled after ``train()``. Training auto-logs to MLflow via the
:mod:`alphaswarm.mlops.model_registry` helper and can then be loaded in
production by calling :func:`alphaswarm.mlops.model_registry.load_alpha_path`.

## Factor evaluation flow

```mermaid
flowchart LR
    FeatureSet[FeatureSet specs] --> IndicatorZoo[indicators_zoo build]
    IndicatorZoo --> Factor["factor values per (symbol, ts)"]
    Factor --> Rank[rank / quantile bucket]
    Rank --> ICEval[Information Coefficient + IC-IR]
    Rank --> Returns[returns by quantile]
    Factor --> CV[purged walk-forward CV]
    ICEval --> Report[alphalens-style report]
    Returns --> Report
    CV --> Report
```


<!-- https://alpha-swarm.ai/concepts/strategy/hft-backtest -->
# HFT / LOB backtest engine
> The HFT engine in [alphaswarm/backtest/hft.py](../alphaswarm/backtest/hft.py) wraps [hftbacktest 2.0+](https://github.com/nkaz001/hftbacktest) so any ``LobStrategy`` subclass under [alphaswarm/strategies/hft/](../alphaswarm/stra...

# HFT / LOB backtest engine

> **Audience:** quants running tick-replay backtests for market-making
> or arbitrage strategies, plus agents that need to evaluate a strategy
> spec on cached microstructure data.

The HFT engine in [alphaswarm/backtest/hft.py](../alphaswarm/backtest/hft.py) wraps
[hftbacktest 2.0+](https://github.com/nkaz001/hftbacktest) so any
``LobStrategy`` subclass under
[alphaswarm/strategies/hft/](../alphaswarm/strategies/hft/) becomes runnable
end-to-end. Five strategies ship out of the box:

- ``GLFTMM`` — Guéant-Lehalle-Fernandez-Tapia closed-form MM.
- ``AvellanedaStoikovMM`` — finite-horizon Avellaneda-Stoikov MM.
- ``GridMM`` — symmetric grid quoting around mid.
- ``ImbalanceAlphaMM`` — order-book imbalance skew.
- ``BasisAlphaMM`` — cross-instrument basis as fair value.
- ``QueueAwareMM`` — queue-position-aware MM for large-tick assets.

## Install

The engine ships behind the ``[hft]`` extra. Because hftbacktest is a
Rust crate exposed via PyO3, you need a Rust toolchain at install
time. See [alphaswarm_docs/installation.md](../../intro/installation.md).

## Architecture

```mermaid
flowchart LR
  Tick[gz tick feed] --> HFT[hftbacktest core]
  HFT --> Driver[LobBacktestEngine driver loop]
  Driver -->|state| Strategy[LobStrategy.on_event]
  Strategy -->|OrderIntent| Driver
  Driver -->|submit_buy_order / cancel| HFT
  Driver --> Result[LobBacktestResult]
  Result --> HFTSummary[hft_summary]
  Result --> ReplayChart[LobReplayChart]
```

Two architecturally important pieces:

1. **Strategy bodies stay pure Python.** ``on_event`` returns a list
   of ``OrderIntent`` records. The engine translates them into
   ``hbt.submit_buy_order`` / ``hbt.cancel`` calls. This keeps the
   strategies LLM-friendly (no Numba constraints) at the cost of a
   Python function call per event — still ~1k events/ms in practice.
2. **Snapshots are bounded.** The driver writes one
   ``(timestamp, equity, position)`` record every ``snapshot_every``
   events. Long replays produce manageable trajectories instead of
   un-renderable equity curves.

## Running a backtest

### Direct API

```python
from alphaswarm.backtest.hft import LobBacktestEngine
from alphaswarm.strategies.hft.alphas import AvellanedaStoikovMM

engine = LobBacktestEngine(
    latency_profile="intp_order_latency",
    queue_model="probabilistic",
    tick_size=0.01,
    lot_size=0.001,
)
strategy = AvellanedaStoikovMM(gamma=0.1, sigma=0.01, k=1.5)
result = engine.run(
    strategy,
    feeds=["btcusdt_20240301.gz"],
    max_events=1_000_000,
    snapshot_every=5_000,
)
print(result.summary["hft_sharpe_sample_aware"])
```

### Celery task (recommended for long replays)

```python
from alphaswarm.tasks.hft_tasks import run_lob_backtest

async_result = run_lob_backtest.delay(
    strategy_alias="AvellanedaStoikovMM",
    strategy_kwargs={"gamma": 0.1, "sigma": 0.01, "k": 1.5},
    dataset_preset="lob_btcusdt_sample",
    max_events=10_000_000,
    snapshot_every=10_000,
)
```

The task emits progress every ~2 seconds with the canonical
``{task_id, stage, message, timestamp, **extras}`` shape (AGENTS rule
4) — extras carry ``events_processed``, ``equity``, and ``position``.

### REST surface

```http
POST /backtest/lob
{
  "strategy": "AvellanedaStoikovMM",
  "dataset_preset": "lob_btcusdt_sample",
  "latency_profile": "intp_order_latency",
  "queue_model": "probabilistic",
  "max_events": 1000000
}
```

→ returns ``{task_id, status, stream_url}`` per
[alphaswarm.api.schemas.TaskAccepted](../alphaswarm/api/schemas.py). The
``stream_url`` is the existing ``/chat/stream/{task_id}`` WebSocket;
no new transport.

### Frontend

Navigate to ``/backtest/lob``. The page wires up the wizard
(strategy / dataset / latency / queue model) and the
``LobReplayChart`` (lightweight-charts equity + position curve).

## Latency / queue models

- ``latency_profile="constant_50us"`` — fixed 50µs round-trip.
- ``latency_profile="intp_order_latency"`` — file-driven model bundled
  with hftbacktest's examples (default).
- ``queue_model="probabilistic"`` — hftbacktest's
  ``ProbQueueModel`` (default).
- ``queue_model="risk_averse"`` — ``RiskAverseQueueModel``.

When a value isn't recognised by your installed hftbacktest version,
the engine logs a warning and falls back to the model's default.

## Interpreting the metrics

The ``BacktestResult.summary`` is augmented by
[alphaswarm/backtest/hft_metrics.py::hft_summary](../alphaswarm/backtest/hft_metrics.py):

| Metric | Meaning |
| --- | --- |
| ``hft_sharpe_sample_aware`` | Sharpe annualised by the actual sample frequency (crypto = 365d, equity = 252d). |
| ``hft_sortino_sample_aware`` | Same for Sortino. |
| ``hft_max_position`` | Largest absolute inventory at any point. |
| ``hft_mean_leverage`` | Mean ``|position_value| / equity``. |
| ``hft_fill_ratio`` | Fills / orders. |

The ``events_processed`` field reflects the number of ``elapse``
calls, not the underlying tick count.

## Custom strategies

Subclass [alphaswarm/strategies/lob.py::LobStrategy](../alphaswarm/strategies/lob.py) and
implement ``on_event(state) -> list[OrderIntent]``. Use the inherited
``buy`` / ``sell`` / ``cancel`` helpers to build intents. Decorate the
class with ``@register("YourMM", source="alphaswarm", category="market_making")``
so the registry index lights up.

## See also

- [alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md) —
  the original spec this implementation closes out.
- [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — the JAX HJB closed
  forms that ``GLFTMM`` and ``AvellanedaStoikovMM`` consume.
- [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) — the
  agent loop that mutates strategy YAML on regime flips.


<!-- https://alpha-swarm.ai/concepts/strategy/microstructure-toxicity -->
# Microstructure toxicity + regime-aware adapter
> A pure Avellaneda-Stoikov / Lucic-Tse market maker is exposed to adverse selection: when the order flow becomes informationally toxic (elevated VPIN, spiky microprice variance, runaway cancellation ra...

# Microstructure toxicity + regime-aware adapter

> **Audience:** anyone running paper or live HFT strategies who wants
> the platform to react automatically to toxic-flow regimes.

A pure Avellaneda-Stoikov / Lucic-Tse market maker is exposed to
adverse selection: when the order flow becomes informationally toxic
(elevated VPIN, spiky microprice variance, runaway cancellation
ratios), the dealer's quotes get picked off faster than the closed-
form predicts. The mathematical fix — increase ``γ`` (risk aversion)
and shrink ``order_size`` — needs to happen automatically and quickly,
because the toxicity window is short.

AlphaSwarm wires that loop end-to-end via two MCP tools and one agent spec.

## The loop

```mermaid
flowchart LR
  Bars[microstructure dataset] --> Flow[optimal_control.toxicity_regime flow]
  Flow --> Iceberg[(alphaswarm_gold_analysis_optimal_control)]
  Iceberg --> ListRegimes[data.optimal_control.list_regimes]
  ListRegimes --> Agent[research.toxicity_regime_adapter]
  Agent --> EvalStrategy[data.optimal_control.evaluate_strategy]
  Agent --> UpdateConfig[data.strategy_config.update]
  UpdateConfig --> PaperYaml[configs/paper/*.yaml]
  PaperYaml --> Paper[Celery paper worker]
```

1. The
   [optimal_control.toxicity_regime](../alphaswarm/analysis/flows/optimal_control.py)
   flow runs on every fresh microstructure slice and writes a regime
   row to ``alphaswarm_gold_analysis_optimal_control.toxicity_regime``.
2. The
   [research.toxicity_regime_adapter](../configs/agents/research_toxicity_regime_adapter.yaml)
   agent polls the regime table via the ``data.optimal_control.list_regimes``
   MCP tool.
3. When the label flips (benign → elevated → toxic), the agent updates
   a whitelist of fields on the active paper-trading YAML using the
   ``data.strategy_config.update`` writer tool.
4. The Celery paper worker picks up the new YAML on its next reload.

The whitelist is intentionally narrow:
``gamma``, ``sigma``, ``kappa``, ``k``, ``gamma_inv``, ``base_spread``,
``order_size``, ``max_position``. Anything else (broker, symbol,
account_id, kill-switch state) requires a different higher-privilege
tool — by design.

## Toxicity score

The flow computes a composite toxicity score per slice:

```
score = 0.6 · VPIN_recent + 0.25 · microprice_variance + 0.15 · cancel_ratio
```

Thresholds map score → regime → suggested multipliers:

| Score range | Regime | γ multiplier | order_size multiplier |
| --- | --- | --- | --- |
| < 0.5 · threshold | benign | 1.0 | 1.0 |
| ∈ [0.5·θ, θ) | elevated | 1.25 | 0.75 |
| ≥ threshold | toxic | 1.5 | 0.5 |

Default threshold ``θ = 0.6``. Tune via the flow's
``toxic_threshold`` param.

## Where the math comes from

VPIN: Easley, López de Prado, & O'Hara (2012), implemented in
[alphaswarm/data/microstructure.py::vpin](../alphaswarm/data/microstructure.py).
Microprice variance: the gap between the volume-weighted microprice
and the simple mid; large gaps indicate informational pressure on
one side of the book. Cancellation ratio: optional input; when
provided, captures the fraction of recent order activity that was
cancellations rather than trades — a leading indicator of HFT activity
ramping up.

## Manually inspecting a regime

```python
import pandas as pd
from alphaswarm.analysis import run_flow

df = pd.read_csv("recent_l1_book.csv")
out = run_flow(
    "optimal_control.toxicity_regime",
    df,
    {
        "buy_volume_column": "buy_volume",
        "sell_volume_column": "sell_volume",
        "bid_qty_column": "bid_qty",
        "ask_qty_column": "ask_qty",
        "bid_price_column": "bid_price",
        "ask_price_column": "ask_price",
        "n_buckets": 50,
        "toxic_threshold": 0.6,
    },
)
print(out.metrics["regime"], out.metrics["composite_score"])
```

## Customising

- **Tighten the threshold.** Drop ``toxic_threshold`` to 0.4 in
  defensive products; raise it to 0.7 in alpha-only strategies that
  want the tighter spreads more often.
- **Add a cancellation column.** Pass
  ``cancellation_column="n_cancels"`` to the flow when the dataset
  exposes per-bar cancellation counts; the score will become more
  responsive to HFT activity.
- **Replace the agent.** The reference adapter is a simple multiplier
  agent. For richer policies, swap in an RL agent trained on
  [LucicTsePortfolioEnv](../alphaswarm/rl/envs/lucic_tse_options_env.py) and
  invoke its policy from a custom AgentSpec body.

## Tests

- [tests/analysis/test_optimal_control_flows.py](../tests/analysis/test_optimal_control_flows.py)
  covers the flow's classification logic.
- [tests/data/mcp/test_strategy_config_tool.py](../tests/data/mcp/test_strategy_config_tool.py)
  covers the writer tool's whitelist + path-traversal guards.

## See also

- [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — Avellaneda-Stoikov
  + Cartea-Jaimungal closed forms.
- [alphaswarm_docs/portfolio-options-mm.md](../../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse
  framework that uses ``γ_inv`` instead of single-asset ``γ``.
- [alphaswarm_docs/hft-backtest.md](../../concepts/strategy/hft-backtest.md) — running a tick-replay
  validation of the new parameters before they go to paper.


<!-- https://alpha-swarm.ai/concepts/strategy/ml-alpha-backtest -->
# `AlphaBacktestExperiment`
> Use `AlphaBacktestExperiment` whenever you want to answer the question *"how does this model perform when its predictions actually drive trades?"*. The standard `Experiment` family computes IC / RMSE ...

# `AlphaBacktestExperiment`

> The keystone "model used as alpha" experiment — train a model, register
> it, deploy it as `DeployedModelAlpha`, run a backtest, and persist
> combined ML + trading metrics under one MLflow parent run.

## When to use

Use `AlphaBacktestExperiment` whenever you want to answer the question
*"how does this model perform when its predictions actually drive
trades?"*. The standard `Experiment` family computes IC / RMSE / MAE in
isolation; `AlphaBacktestExperiment` adds Sharpe / Sortino / hit-rate
and links them back to the trained `ModelVersion` so the Strategy
Browser, MLflow UI, and Postgres catalog all converge.

## Shape

| Concept | Class / table |
| --- | --- |
| Orchestrator | [`alphaswarm.ml.alpha_backtest_experiment::AlphaBacktestExperiment`](../alphaswarm/ml/alpha_backtest_experiment.py) |
| Combined metrics | [`alphaswarm.ml.alpha_metrics`](../alphaswarm/ml/alpha_metrics.py) |
| Combined run row | `MLAlphaBacktestRun` (Alembic 0025) |
| Per-bar audit (opt-in) | `MLPredictionAudit` (Alembic 0025) |
| Celery task | `alphaswarm.tasks.ml_tasks.run_alpha_backtest_experiment` (queue `ml`) |
| REST | `POST /ml/alpha-backtest-runs`, `GET /ml/alpha-backtest-runs[/{id}/predictions]` |

## Workflow

```mermaid
sequenceDiagram
    autonumber
    participant Caller as Caller (UI / CLI / Celery)
    participant Task as run_alpha_backtest_experiment
    participant Exp as AlphaBacktestExperiment
    participant ML as MLflow
    participant Reg as Model Registry + ModelVersion
    participant Dep as ModelDeployment
    participant BT as run_backtest_from_config
    participant DB as Postgres

    Caller->>Task: payload (dataset/model/strategy/backtest cfg)
    Task->>Exp: AlphaBacktestExperiment(...).run()
    Exp->>ML: open parent run (alphaswarm.component=alpha_backtest)
    Exp->>Exp: train + predict
    Exp->>Reg: register_alpha + ModelVersion row
    Exp->>Dep: ensure ModelDeployment (if absent)
    Exp->>BT: run_backtest_from_config(strategy=DeployedModelAlpha)
    BT->>DB: BacktestRun(model_version_id=..., ml_experiment_run_id=...)
    Exp->>Exp: compute_alpha_metrics + compute_trading_metrics + compute_attribution
    Exp->>ML: log combined metrics on parent run
    Exp->>DB: MLAlphaBacktestRun row
    Exp-->>Caller: AlphaBacktestResult
```

## Metric vocabulary

The combined metrics blob persisted on `MLAlphaBacktestRun.combined_metrics` rolls up:

- ML-side: `ic_spearman`, `ic_pearson`, `icir`, `mae`, `rmse`, `hit_rate`
- Trading-side: `sharpe`, `sortino`, `calmar`, `max_drawdown`, `total_return`, `turnover_adj_sharpe`
- Combined scalar: `score = combined_score(ml_metrics, trading_metrics)` —
  default weighting in [`alphaswarm/ml/alpha_metrics.py`](../alphaswarm/ml/alpha_metrics.py)
  prioritises Sharpe (0.45) but also rewards IC / IR / hit-rate so a
  high-IC model that fails to translate to PnL is penalised.

## Calling from code

```python
from alphaswarm.ml.alpha_backtest_experiment import AlphaBacktestExperiment

experiment = AlphaBacktestExperiment(
    dataset_cfg=dataset_cfg,
    model_cfg=model_cfg,
    strategy_cfg=strategy_cfg,
    backtest_cfg=backtest_cfg,
    run_name="ridge-alpha-backtest",
    train_first=True,
    capture_predictions=True,
)
result = experiment.run()
print(result.combined_metrics)
```

## Calling from REST

```bash
curl -XPOST http://localhost:8000/ml/alpha-backtest-runs \
  -H 'content-type: application/json' \
  -d @configs/ml/alpha_backtest/ridge_alpha_backtest.yaml
```

The response is a `TaskAccepted` envelope; subscribe to
`/chat/stream/{task_id}` for progress events.

## Where this goes wrong

- Forgetting `train_first=False` when re-using an existing
  `deployment_id` will trigger a re-train. Set it explicitly.
- The combined-metric weights are heuristic — customise them per
  strategy by passing `weights={...}` to `combined_score`.
- `MLPredictionAudit` is gated behind
  `ALPHASWARM_ML_PREDICTION_AUDIT_ENABLED`; default is `false` to keep the
  table small. Enable it for forensic explainability.

## Related

- [`alphaswarm_docs/ml-framework.md`](../../concepts/strategy/ml-framework.md)
- [`alphaswarm_docs/backtest-engines.md`](../../concepts/strategy/backtest-engines.md)
- [`alphaswarm_docs/ml-testing.md`](../../concepts/strategy/ml-testing.md)


<!-- https://alpha-swarm.ai/concepts/strategy/ml-builder -->
# Graphical ML experiment builder
> - Page: [`webui/app/(shell)/ml/builder/page.tsx`](../webui/app/(shell)/ml/builder/page.tsx) - Component: [`webui/components/ml/MlExperimentBuilderPage.tsx`](../webui/components/ml/MlExperimentBuilderP...

# Graphical ML experiment builder

> The `/ml/builder` page composes datasets, preprocessing, model
> definitions, experiment records, deployments, and quick tests on a
> shared XYFlow canvas. Same plumbing as the Bot Builder.

## Where it lives

- Page: [`webui/app/(shell)/ml/builder/page.tsx`](../webui/app/(shell)/ml/builder/page.tsx)
- Component: [`webui/components/ml/MlExperimentBuilderPage.tsx`](../webui/components/ml/MlExperimentBuilderPage.tsx)
- Palette: [`webui/components/ml/mlExperimentPalette.ts`](../webui/components/ml/mlExperimentPalette.ts)
- Serializer: [`webui/components/ml/mlExperimentSerializer.ts`](../webui/components/ml/mlExperimentSerializer.ts)
- Canvas: [`webui/components/flow/WorkflowEditor.tsx`](../webui/components/flow/WorkflowEditor.tsx)

## Palette layout

```mermaid
graph LR
    Source[Sourcesection] --> Pipeline[Pipelinesection]
    Pipeline --> Split[Splitsection]
    Split --> Model[Modelsection]
    Model --> Records[Recordssection]
    Records --> Experiment[Experimentsection]
    Experiment --> Test[Testsection]
    Test --> Deploy[Deploysection]
```

Each palette section maps onto a list of node `kind`s defined in
`mlExperimentPalette.ts`.

| Section | Sample kinds |
| --- | --- |
| Source | `Dataset`, `DatasetPreset`, `IcebergSlice`, `FetcherSource`, `PipelineManifestRef`, `FeatureSet` |
| Pipeline | `Preprocessing`, `MLScale`, `MLWinsorize`, `MLLag`, `MLRolling`, `MLDecompose`, `MLPyODOutliers`, `MLImputation` |
| Split | `Split`, `WalkForward`, `PurgedKFold`, `Quarterly`, `ChronologicalRatio` |
| Model | `SklearnModel`, `KerasModel`, `TensorflowModel`, `TorchModel`, `LightGBMModel`, `XGBoostModel`, `ProphetModel`, `SktimeModel`, `PyODModel`, `HuggingFaceModel` |
| Records | `Records`, `SignalRecord` |
| Experiment | `Experiment`, `ForecastExperiment`, `ClassificationExperiment`, `AnomalyExperiment`, `AlphaBacktestExperiment`, `FlowPreview` |
| Test | `SinglePredictTest`, `BatchPredictTest`, `ABCompareTest`, `ScenarioTest` |
| Deploy | `RegisterModelVersion`, `PromoteToProduction`, `CreateModelDeployment` |

## Dispatch

`mlExperimentSerializer.ts::dispatchFromGraph` inspects the canvas and
routes to the right backend endpoint:

- Graph contains an `AlphaBacktestExperiment` node →
  `POST /ml/alpha-backtest-runs`
- Graph contains a `Test*` node →
  `POST /ml/test/{single|batch|compare|scenario}`
- Otherwise → `POST /ml/experiment-runs`

This means a single canvas serializes either an experiment-style run
or an alpha-backtest run depending on what the user dropped on it.

## Interactive Workbench drawer

The toolbar exposes an "Interactive Workbench" button that opens a
right-hand drawer wrapping the
[`/ml/flows`](../../concepts/strategy/ml-flows.md) catalog. The form is auto-generated from
`GET /ml/flows` so adding a new flow lights up here automatically.

## Adding a new palette tile

1. Append an entry to the appropriate `PaletteSection` in
   `mlExperimentPalette.ts`.
2. Add an accent color to `ML_EXPERIMENT_ACCENTS`.
3. If the new kind needs special serialization (e.g. it must reach a
   bespoke endpoint), extend `mlExperimentSerializer.ts`'s helper sets
   and `dispatchFromGraph`.


<!-- https://alpha-swarm.ai/concepts/strategy/ml-flows -->
# Lightweight workbench flows
> | Flow | Purpose | Backend | | --- | --- | --- | | `linear` | Ridge / Lasso / ElasticNet / BayesianRidge with IC + RMSE / MAE | sklearn | | `decomposition` | STL trend / seasonal / residual | statsmod...

# Lightweight workbench flows

> Small synchronous helpers in [`alphaswarm.ml.flows`](../alphaswarm/ml/flows.py) that
> let users iterate on a dataset without spinning up a full
> `Experiment`. Surfaced at `POST /ml/flows/{flow}/preview`,
> `POST /ml/flows/{flow}/preview-task` (Celery), and `GET /ml/flows`
> (catalog).

## Catalog

| Flow | Purpose | Backend |
| --- | --- | --- |
| `linear` | Ridge / Lasso / ElasticNet / BayesianRidge with IC + RMSE / MAE | sklearn |
| `decomposition` | STL trend / seasonal / residual | statsmodels |
| `forecast` | Prophet / sktime-naive / ARIMA / ETS / Theta / AutoARIMA | mixed |
| `regression_diagnostics` | OLS coef table, R^2, F-stat, Durbin-Watson | statsmodels |
| `unit_root` | ADF / KPSS unit-root tests | statsmodels |
| `acf_pacf` | Auto- and partial-autocorrelation series | statsmodels |
| `granger_causality` | Granger causality between two columns | statsmodels |
| `cointegration` | Engle-Granger pair cointegration | statsmodels |
| `garch` | GARCH(p, q) volatility model + horizon | arch |
| `change_point` | PELT / RBF kernel change points | ruptures |
| `clustering` | KMeans / DBSCAN / HDBSCAN on the feature matrix | sklearn / hdbscan |
| `pca_summary` | PCA variance + factor loadings | sklearn |

## REST surface

```bash
# List every flow + its parameter schema
curl http://localhost:8000/ml/flows | jq

# Sync run a flow
curl -XPOST http://localhost:8000/ml/flows/linear/preview \
  -d '{"dataset_cfg": {...}, "estimator": "ridge", "alpha": 1.0}' \
  -H 'content-type: application/json'

# Background run via Celery (returns TaskAccepted)
curl -XPOST http://localhost:8000/ml/flows/garch/preview-task \
  -d '{"dataset_cfg": {...}, "column": "close", "p": 1, "q": 1, "horizon": 10}' \
  -H 'content-type: application/json'
```

## Webui workbench drawer

The ML Experiment Builder
([`/ml/builder`](../webui/app/(shell)/ml/builder/page.tsx)) ships an
"Interactive Workbench" drawer on its toolbar. Pick a flow, fill in
the per-flow form (auto-generated from `GET /ml/flows`), and submit —
the result table renders inline so you never leave the canvas.

## Tutorials

- [01_quick_ridge_workbench.yaml](../configs/ml/tutorials/01_quick_ridge_workbench.yaml)
- [02_stl_decompose_workbench.yaml](../configs/ml/tutorials/02_stl_decompose_workbench.yaml)
- [03_arima_garch_diagnostics.yaml](../configs/ml/tutorials/03_arima_garch_diagnostics.yaml)

## Adding a new flow

1. Implement `run__flow(...)` in
   [`alphaswarm/ml/flows.py`](../alphaswarm/ml/flows.py) returning a `FlowResult`.
2. Add a dispatch branch in `run_flow(flow, payload)`.
3. Add an entry in `list_flows()` so the webui form reflects the new
   parameters automatically.
4. (Optional) Wrap as a notebook helper in
   [`alphaswarm/ml/adhoc/`](../alphaswarm/ml/adhoc/__init__.py).


<!-- https://alpha-swarm.ai/concepts/strategy/ml-framework -->
# `alphaswarm.ml` — native qlib-style ML framework
> `alphaswarm.ml` is a vendored port of [Microsoft Qlib](https://github.com/microsoft/qlib)s feature / dataset / model / record abstractions, re-built as pure Python on top of AQPs own DuckDB-backed data lak...

# `alphaswarm.ml` — native qlib-style ML framework

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · See [alphaswarm_docs/factor-research.md](../../concepts/strategy/factor-research.md) for the alphalens-style evaluation pipeline.

`alphaswarm.ml` is a vendored port of [Microsoft Qlib](https://github.com/microsoft/qlib)'s
feature / dataset / model / record abstractions, re-built as pure Python on top
of AlphaSwarm's own DuckDB-backed data lake. There is **no qlib runtime dependency**
— installing the `ml` / `ml-torch` extras pulls in the underlying libraries
(LightGBM, XGBoost, CatBoost, PyTorch) only.

## Layers

```
┌────────────────────────────────────────────────┐
│ Model (alphaswarm.ml.base.Model / ModelFT)            │
│   ├─ tree: LGBModel, XGBModel, CatBoostModel   │
│   ├─ linear: LinearModel (OLS/Ridge/Lasso/NNLS)│
│   ├─ ensemble: DEnsembleModel                  │
│   ├─ torch: DNN, LSTM, GRU, ALSTM, Transformer,│
│   │         TCN, TabNet, Localformer,          │
│   │         GeneralPTNN, Seq2Seq family        │
│   └─ stubs: GATs, HIST, TRA, ADD, ADARNN, …    │
├────────────────────────────────────────────────┤
│ DatasetH / TSDatasetH → prepare(segments)      │
├────────────────────────────────────────────────┤
│ DataHandler / DataHandlerLP                    │
│   ├─ DK_R raw | DK_I infer | DK_L learn views  │
│   └─ shared / infer / learn processors         │
├────────────────────────────────────────────────┤
│ DataLoader → AQPDataLoader (DuckDB + DSL)      │
└────────────────────────────────────────────────┘
```

## Quick start

```python
from alphaswarm.ml.features.alpha158 import Alpha158
from alphaswarm.ml.dataset import DatasetH
from alphaswarm.ml.models.tree import LGBModel

handler = Alpha158(
    instruments=["SPY", "AAPL", "MSFT"],
    start_time="2018-01-01",
    end_time="2024-12-31",
    fit_start_time="2018-01-01",
    fit_end_time="2022-12-31",
)
dataset = DatasetH(
    handler=handler,
    segments={
        "train": ("2018-01-01", "2022-12-31"),
        "valid": ("2023-01-01", "2023-12-31"),
        "test":  ("2024-01-01", "2024-12-31"),
    },
)
model = LGBModel(num_leaves=63, learning_rate=0.05, n_estimators=500)
model.fit(dataset)
predictions = model.predict(dataset, segment="test")
```

Launch the same pipeline as a Celery task:

```python
from alphaswarm.tasks.ml_tasks import train_ml_model

async_result = train_ml_model.delay(
    dataset_cfg={"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}},
    model_cfg={"class": "LGBModel", "module_path": "alphaswarm.ml.models.tree", "kwargs": {...}},
    run_name="alpha158-lgbm",
    strategy_id="",
)
```

## Feature factories

- **Alpha158** (`alphaswarm.ml.features.alpha158.Alpha158DL`) ships the 9 k-bar +
  price/volume lookbacks + ~30 rolling families from the original qlib paper.
  Every feature is expressed via the DSL operators in
  `alphaswarm.data.expressions` so adding a new family is one line of code.
- **Alpha360** (`alphaswarm.ml.features.alpha360.Alpha360DL`) emits a 60-step OHLCV
  panel normalised by the latest close (or latest volume). Feed it into a
  `TSDatasetH` and pair with one of the sequence models.

Both handlers default to `Ref($close, -2) / Ref($close, -1) - 1` as the label,
matching qlib's standard 2-day forward-return target.

## Expression DSL

`alphaswarm.data.expressions` now exposes ~50 operators grouped into four families:

- **Unary**: `Ref`, `Delta`, `Abs`, `Sign`, `Log`, `Power`, `Rank`
- **Rolling**: `Mean`, `Std`, `Var`, `Skew`, `Kurt`, `Sum`, `Min`, `Max`,
  `Med`, `Mad`, `Quantile`, `Count`, `IdxMax`, `IdxMin`, `EMA`, `WMA`,
  `Slope`, `Rsquare`, `Resi`
- **Pairwise**: `Corr`, `Cov`
- **Comparison / logical / conditional**: `Greater`, `Less`, `Gt`, `Ge`,
  `Lt`, `Le`, `Eq`, `Ne`, `And`, `Or`, `Not`, `Mask`, `If`

Example: construct a 20-bar z-scored OBV like factor::

    "($close - Mean($close, 20)) / (Std($close, 20) + 1e-12)"

## Recorders

`alphaswarm.ml.recorder` ports `SignalRecord` / `SigAnaRecord` / `PortAnaRecord`:

- `SignalRecord.generate()` calls `model.predict(dataset)`, serialises
  `pred.pkl` + `label.pkl`, and logs them as MLflow artifacts.
- `SigAnaRecord.generate(signal_record=...)` runs
  `alphaswarm.data.factors.evaluate_factor` to compute IC / Rank IC / quantile
  returns and pushes them into the active MLflow run.
- `PortAnaRecord.generate(signal_record=...)` turns the prediction panel
  into a top-K long / bottom-K short portfolio and reports Sharpe /
  Sortino / max-drawdown + qlib-style `risk_analysis` summary.

The `train_ml_model` Celery task auto-runs `SignalRecord` + any records
listed in the YAML so one `POST /ml/train` gives you predictions, factor
analysis, and a portfolio tearsheet in a single MLflow run.

## Model zoo (Tier A — shipping)

| Family           | Class                                                                   | Notes                                   |
|------------------|-------------------------------------------------------------------------|-----------------------------------------|
| Tree             | `LGBModel`, `XGBModel`, `CatBoostModel`, `DEnsembleModel`               | `ml` extra                              |
| Linear           | `LinearModel(estimator="ridge"|"lasso"|"ols"|"nnls")`                   | `ml` extra                              |
| Dense            | `DNNModel(layers=[256, 64], dropout=0.2)`                               | `ml-torch` extra                        |
| Sequence         | `LSTMModel`, `GRUModel`, `ALSTMModel` (attention head)                  | TS; `step_len=20`                       |
| Attention        | `TransformerModel`, `LocalformerModel` (local-window mask)              | TS                                      |
| Convolutional    | `TCNModel`                                                              | TS                                      |
| Tabular          | `TabNetModel`                                                           | requires `pytorch-tabnet`               |
| Generic          | `GeneralPTNN(model_class=..., model_module=...)`                        | bring-your-own `nn.Module`              |
| Seq2Seq          | `LSTMSeq2Seq`, `GRUSeq2Seq`, `LSTMSeq2SeqVAE`, `DilatedCNNSeq2Seq`, `TransformerForecaster` | ported from Stock-Prediction-Models |

## ML-Ops framework adapters

The experiment layer also exposes framework adapters that still satisfy the
same `Model.fit(dataset)` / `Model.predict(dataset, segment)` contract:

| Family | Classes | Extra |
| --- | --- | --- |
| scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel` | `ml` |
| Forecasting | `ProphetForecastModel`, `SktimeForecastModel`, `SktimeReductionForecastModel` | `ml-forecast` |
| Anomaly detection | `PyODAnomalyModel` | `ml-anomaly` |
| Keras / TensorFlow | `KerasMLPModel`, `KerasLSTMModel` | `ml-keras` or `ml-tensorflow` |
| Hugging Face | `HuggingFaceTextSignalModel` | `ml-transformers` |

All heavy libraries are imported lazily. The base API can list recipes and
build configs without TensorFlow, Prophet, sktime, PyOD, or transformers
installed; fitting one of those classes raises a targeted install message if
the corresponding extra is missing.

## Model zoo (Tier B — scaffolded stubs)

These classes register into `alphaswarm.core.registry` so the Strategy Browser
enumerates them, but `fit()` raises `NotImplementedError` with a pointer to
the canonical qlib implementation. Port them incrementally:

`GATsModel`, `HISTModel`, `TRAModel`, `ADDModel`, `ADARNNModel`,
`TCTSModel`, `SFMModel`, `SandwichModel`, `KRNNModel`, `IGMTFModel`.

## Persistence + MLflow wiring

Every `train_ml_model` run writes a `ModelVersion` row and (when
`register_alpha=True`) registers the pickled model in the MLflow Model
Registry. If you pass `strategy_id`, the run is filed under the
`strategy/` MLflow experiment so the Strategy Browser can link
straight to it.

## Planning-first workflow (split / pipeline / experiment / deployment)

The ML stack now supports a planning layer so datasets, splits, and
preprocessing can be reused deterministically across runs.

1. Create a split plan (fixed / purged-kfold / walk-forward):

```bash
curl -X POST http://localhost:8000/ml/split-plans \
  -H "Content-Type: application/json" \
  -d '{
    "name": "alpha158-fixed-2019-2024",
    "method": "fixed",
    "vt_symbols": ["SPY.NASDAQ", "AAPL.NASDAQ", "MSFT.NASDAQ"],
    "start": "2019-01-01",
    "end": "2024-12-31",
    "config": {
      "segments": {
        "train": ["2019-01-01", "2022-12-31"],
        "valid": ["2023-01-01", "2023-12-31"],
        "test": ["2024-01-01", "2024-12-31"]
      }
    }
  }'
```

2. Save a pipeline recipe (`shared` / `infer` / `learn` processors):

```bash
curl -X POST http://localhost:8000/ml/pipelines \
  -H "Content-Type: application/json" \
  -d '{
    "name": "alpha158-default",
    "infer_processors": [{"class":"Fillna","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"feature","fill_value":0.0}}],
    "learn_processors": [{"class":"DropnaLabel","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"label"}}]
  }'
```

3. Create an experiment plan tying together dataset/split/pipeline/model
   config, then launch training with `experiment_plan_id`:

```bash
curl -X POST http://localhost:8000/ml/train \
  -H "Content-Type: application/json" \
  -d '{
    "run_name": "alpha158-lgb-plan",
    "experiment_plan_id": "",
    "register_alpha": true
  }'
```

For a richer ML-ops run that persists an `MLExperimentRun` row and logs compact
prediction samples, use the experiment runner:

```bash
curl -X POST http://localhost:8000/ml/experiment-runs \
  -H "Content-Type: application/json" \
  -d '{
    "run_name": "ridge-alpha-smoke",
    "experiment_type": "alpha",
    "dataset_cfg": {"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}},
    "model_cfg": {"class": "SklearnRegressorModel", "module_path": "alphaswarm.ml.models.sklearn", "kwargs": {"estimator": "ridge"}}
  }'
```

Small interactive flows can run synchronously without Celery:

```bash
curl -X POST http://localhost:8000/ml/flows/linear/preview \
  -H "Content-Type: application/json" \
  -d '{"dataset_cfg": {...}, "estimator": "ridge", "alpha": 1.0}'
```

The Next.js web UI exposes the same objects in `/ml/builder`, using a graph
that serializes `Dataset`, `Preprocessing`, `Split`, `Model`, `Records`, and
`Experiment` nodes into the `/ml/experiment-runs` request.

4. Deploy a tested `ModelVersion` as a strategy alpha profile:

```bash
curl -X POST http://localhost:8000/ml/deployments \
  -H "Content-Type: application/json" \
  -d '{
    "name": "lgb-alpha-prod",
    "model_version_id": "",
    "infer_segment": "infer",
    "long_threshold": 0.001,
    "short_threshold": -0.001
  }'
```

Then consume it in strategy YAML via:

```yaml
alpha_model:
  class: DeployedModelAlpha
  module_path: alphaswarm.strategies.ml_alphas
  kwargs:
    deployment_id: ""
```

## Train -> register -> deploy -> score

```mermaid
flowchart LR
    Dataset[DatasetVersion] --> Split[SplitPlan + SplitArtifacts]
    Split --> Recipe[PipelineRecipe]
    Recipe --> Train[ml_tasks.train_ml_model]
    Train --> MLflow[(MLflow registry)]
    MLflow --> ModelVersion[ModelVersion row]
    ModelVersion --> Deploy[ModelDeployment]
    Deploy --> Score[ml_tasks.evaluate / preview]
    Score --> Backtest[backtest replay]
    Score --> WebUI
```

## ML engine major expansion (Alembic 0025)

The ML layer has grown a number of new surfaces, all driven by the
existing `Experiment` / `Model` / `Processor` contracts:

- **`AlphaBacktestExperiment`** — combined "model used as alpha"
  experiment that trains, registers, deploys, backtests, and rolls
  the combined ML + trading metrics into a single MLflow parent run
  and a `ml_alpha_backtest_runs` Postgres row. See
  [alphaswarm_docs/ml-alpha-backtest.md](../../concepts/strategy/ml-alpha-backtest.md).
- **Library coverage** — TF-native (`TFEstimatorModel`),
  Keras Functional / TabTransformer, HuggingFace
  FinBERT / time-series transformer / generative,
  AutoETS / AutoARIMA / Theta / Tbats, PyOD ECOD / SUOD / AutoEncoder,
  Sklearn Stacking / AutoPipeline. See [alphaswarm_docs/ml-libraries.md](../../concepts/strategy/ml-libraries.md).
- **Lightweight workbench flows** — `regression_diagnostics`,
  `unit_root`, `acf_pacf`, `granger_causality`, `cointegration`,
  `garch`, `change_point`, `clustering`, `pca_summary`. See
  [alphaswarm_docs/ml-flows.md](../../concepts/strategy/ml-flows.md).
- **ML preprocessors as data-pipeline nodes** —
  `transform.ml_preprocessing` plus specialised tiles, with a new
  `sink.ml_feature_snapshot` for deterministic feature reload. See
  [alphaswarm_docs/ml-preprocessing-pipeline.md](../../concepts/strategy/ml-preprocessing-pipeline.md).
- **Interactive testing workbench** — `/ml/test/{single,batch,compare,scenario,upload-csv}`
  endpoints + tabbed webui surface. See [alphaswarm_docs/ml-testing.md](../../concepts/strategy/ml-testing.md).
- **Graphical builder palette** — Source / Pipeline / Split / Model
  (per-framework) / Records / Experiment / Test / Deploy sections plus
  an Interactive Workbench drawer. See [alphaswarm_docs/ml-builder.md](../../concepts/strategy/ml-builder.md).
- **Adhoc helpers** — [`alphaswarm.ml.adhoc`](../alphaswarm/ml/adhoc/__init__.py)
  exposes `quick_ridge`, `quick_arima`, `quick_iforest`, etc. for
  notebook iteration.


<!-- https://alpha-swarm.ai/concepts/strategy/ml-libraries -->
# ML library reference
> | Library | Wrapper(s) | Optional extra | Example config | | --- | --- | --- | --- | | scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel`, `SklearnStackingModel`,...

# ML library reference

> Per-framework reference for every model wrapper under
> [`alphaswarm/ml/models/`](../alphaswarm/ml/models/). Configs live under
> [`configs/ml/`](../configs/ml/).

## Coverage matrix

| Library | Wrapper(s) | Optional extra | Example config |
| --- | --- | --- | --- |
| scikit-learn | `SklearnRegressorModel`, `SklearnClassifierModel`, `SklearnPipelineModel`, `SklearnStackingModel`, `SklearnAutoPipelineModel` | `ml` | [sklearn_ridge_alpha.yaml](../configs/ml/frameworks/sklearn_ridge_alpha.yaml), [sklearn_stacking_alpha.yaml](../configs/ml/frameworks/sklearn_stacking_alpha.yaml) |
| LightGBM | `LGBModel` | `ml` | [alpha158_lgbm.yaml](../configs/ml/alpha158_lgbm.yaml) |
| XGBoost | `XGBModel` | `ml` | (in tree zoo) |
| CatBoost | `CatBoostModel` | `ml` | (in tree zoo) |
| Keras 3 | `KerasMLPModel`, `KerasLSTMModel`, `KerasFunctionalModel`, `KerasTabTransformerModel` | `ml-keras` | [keras_mlp_alpha.yaml](../configs/ml/frameworks/keras_mlp_alpha.yaml), [keras_tab_transformer.yaml](../configs/ml/frameworks/keras_tab_transformer.yaml) |
| TensorFlow native | `TFEstimatorModel` (linear / DNN / boosted_trees) | `ml-tensorflow` + `ALPHASWARM_TF_NATIVE_ENABLED=true` | [tf_estimator_dnn.yaml](../configs/ml/frameworks/tf_estimator_dnn.yaml) |
| PyTorch (qlib ports) | `LSTMTSModel`, `TransformerTSModel`, `TCNTSModel`, `TabNetModel`, `HISTModel`, `GATsModel`, `TRAModel`, … | `ml-torch` | [alpha360_*.yaml](../configs/ml/) |
| Prophet | `ProphetForecastModel` | `ml-forecast` | [prophet_forecast_alpha.yaml](../configs/ml/frameworks/prophet_forecast_alpha.yaml) |
| sktime | `SktimeForecastModel`, `SktimeReductionForecastModel`, `AutoETSForecastModel`, `AutoARIMAForecastModel`, `ThetaForecastModel`, `BatsTbatsForecastModel` | `ml-forecast` | [sktime_reduction_forecast.yaml](../configs/ml/frameworks/sktime_reduction_forecast.yaml), [auto_ets_forecast.yaml](../configs/ml/frameworks/auto_ets_forecast.yaml), [auto_arima_forecast.yaml](../configs/ml/frameworks/auto_arima_forecast.yaml), [theta_forecast.yaml](../configs/ml/frameworks/theta_forecast.yaml) |
| PyOD | `PyODAnomalyModel` (iforest / knn / ecod / copod / lof / suod / auto_encoder / hbos / mcd / ocsvm / pca) | `ml-anomaly` | [pyod_anomaly_alpha.yaml](../configs/ml/frameworks/pyod_anomaly_alpha.yaml), [pyod_ecod_anomaly.yaml](../configs/ml/frameworks/pyod_ecod_anomaly.yaml) |
| HuggingFace transformers | `HuggingFaceTextSignalModel`, `HuggingFaceFinBertSentimentModel`, `HuggingFaceTimeSeriesModel`, `HuggingFaceGenerativeForecastModel` | `ml-transformers` (+ `ALPHASWARM_HF_TIMESERIES_ENABLED=true` for time-series) | [huggingface_finbert_signal.yaml](../configs/ml/frameworks/huggingface_finbert_signal.yaml), [hf_finbert_sentiment.yaml](../configs/ml/frameworks/hf_finbert_sentiment.yaml), [hf_patchtst_forecast.yaml](../configs/ml/frameworks/hf_patchtst_forecast.yaml) |

## Adhoc / notebook surface

[`alphaswarm.ml.adhoc`](../alphaswarm/ml/adhoc/__init__.py) exposes a `quick_*`
namespace for one-off analyses without spelling out a full
`Experiment` config:

```python
from alphaswarm.ml.adhoc import (
    quick_arima,
    quick_ecod,
    quick_finbert_sentiment,
    quick_iforest,
    quick_panel_fixed_effects,
    quick_prophet,
    quick_ridge,
    quick_text_embed,
    quick_theta,
)

# Linear / ridge / elasticnet
ridge = quick_ridge(features_df, target_series, alpha=1.0)
print(ridge.score, ridge.coefficients)

# Anomaly detection
iforest = quick_iforest(features_df, contamination=0.05)
print(iforest.n_anomalies)

# Forecasting
arima = quick_arima(series, horizon=10, order=(1, 1, 1))
prophet = quick_prophet(series, horizon=10)
theta = quick_theta(series, horizon=10)

# Embeddings & sentiment
embeds = quick_text_embed(headlines)
sentiment = quick_finbert_sentiment(headlines)

# Panel diagnostics
fe = quick_panel_fixed_effects(panel, target_col="y", entity_col="vt_symbol")
```

## Where to add a new wrapper

1. Implement the class under `alphaswarm/ml/models/.py`,
   subclassing [`Model`](../alphaswarm/ml/base.py).
2. Decorate with `@register("Name", kind="model")` from
   [`alphaswarm.core.registry`](../alphaswarm/core/registry.py).
3. Make optional imports lazy (raise `RuntimeError` mentioning the
   right extra) so the rest of the registry keeps working.
4. Add a YAML under `configs/ml/frameworks/`.
5. Add a hermetic test under `tests/ml/models/` that monkey-patches
   the optional dep when needed.
6. Cross-list it here.

See [`alphaswarm_docs/ml-framework.md`](../../concepts/strategy/ml-framework.md) for the full registry +
`Experiment` contract.


<!-- https://alpha-swarm.ai/concepts/strategy/ml-preprocessing-pipeline -->
# ML preprocessing as data-pipeline nodes
> Before this expansion, the only way to apply an ML preprocessing recipe was to load a `Dataset` and call `Processor.fit_process` — which only works for offline `Experiment` runs. Promoting processors ...

# ML preprocessing as data-pipeline nodes

> Bridges [`alphaswarm.ml.processors`](../alphaswarm/ml/processors.py) into the data
> engine ([`alphaswarm/data/engine`](../alphaswarm/data/engine/)) so an
> ``alphaswarm.data.engine.PipelineManifest`` can chain
> ``source -> ml_preprocessing -> sink`` like any other transform.

## Why

Before this expansion, the only way to apply an ML preprocessing
recipe was to load a `Dataset` and call `Processor.fit_process` —
which only works for offline `Experiment` runs. Promoting processors
to first-class data-engine nodes lets you:

- Materialise preprocessed features into Iceberg via
  ``sink.ml_feature_snapshot`` and reload them deterministically in
  later training runs.
- Reuse the same recipe in batch ingestion AND online inference.
- Drop a saved ``PipelineRecipe`` row directly onto the manifest
  builder canvas via ``POST /ml/pipelines/{id}/as-node``.

## Two layers

### Umbrella node — `transform.ml_preprocessing`

Accepts either a saved ``recipe_id`` or an inline ``processors`` list.
Re-uses [`apply_processor_specs`](../alphaswarm/ml/pipeline_recipes.py) so a
manifest run applies the same transformation as the offline ML
training loop.

```yaml
- name: transform.ml_preprocessing
  kwargs:
    recipe_id: 1c5b...    # optional — saved /ml/pipelines recipe
    processors:           # optional inline overlay
      - class: WinsorizeByQuantile
        module_path: alphaswarm.ml.processors
        kwargs: {lower_q: 0.01, upper_q: 0.99}
    fit: true
```

### Specialized convenience nodes

Each maps onto a single processor and shows up in the Manifest Builder
palette as its own tile:

| Node name | Processor |
| --- | --- |
| ``transform.ml_scale`` | `SklearnTransformerProcessor` (Standard / Robust / MinMax) |
| ``transform.ml_winsorize`` | `WinsorizeByQuantile` |
| ``transform.ml_lag_features`` | `LagFeatureGenerator` |
| ``transform.ml_rolling_features`` | `RollingFeatureGenerator` |
| ``transform.ml_seasonal_decompose`` | `SeasonalDecomposeFeatures` |
| ``transform.ml_pyod_outliers`` | `PyODOutlierFilter` |
| ``transform.ml_imputation`` | `Fillna` |
| ``transform.ml_target_encode`` | `TargetEncode` |

## Sink — `sink.ml_feature_snapshot`

Iceberg writer that stamps the resulting table with
``pipeline_recipe_id``, ``dataset_version_id``, and a stable
``feature_snapshot_id`` so downstream training runs can reload exactly
the same preprocessed features:

```yaml
- name: sink.ml_feature_snapshot
  kwargs:
    namespace: ml.features
    table: alpha_panel_v1
    pipeline_recipe_id: 1c5b...
    dataset_version_id: 9f8a...
    mode: append
```

The sink's result includes a ``feature_snapshot_id`` UUID; persist it
in the dataset registry so future ``DatasetH`` instances can lazily
reload from the snapshot table.

## End-to-end flow

```mermaid
graph LR
    Source[source.icebergohlcv] --> Recipe["transform.ml_preprocessing(saved recipe_id)"]
    Recipe --> Snap["sink.ml_feature_snapshot(ml.features.alpha_panel_v1)"]
    Snap --> Train[Experiment trainingreuses snapshot]
    Train --> Deploy[ModelDeployment]
    Deploy --> Live[DeployedModelAlphaonline inference]
```

## REST

```bash
# Materialise a saved recipe into a manifest fragment for the
# Pipeline Builder UI.
curl -XPOST http://localhost:8000/ml/pipelines//as-node \
  -d '{"fit": false}' -H 'content-type: application/json'
```

Returns:

```json
{
  "name": "transform.ml_preprocessing",
  "label": "my-recipe",
  "enabled": true,
  "kwargs": {"recipe_id": "", "fit": false}
}
```


<!-- https://alpha-swarm.ai/concepts/strategy/ml-testing -->
# Interactive ML testing workbench
> > The `/ml/test` page lets users validate deployed models with single > rows, batch slices, A/B comparisons, perturbation sweeps, CSV > uploads, and live streaming — all wired through the same > [`Dep...

# Interactive ML testing workbench

> **Superseded by [strategy-development.md](../../concepts/strategy/strategy-development.md).**
> The webui `/ml/test` page is preserved for legacy bookmarks but the
> canonical surfaces now live as sibling sub-routes of
> `/strategy-development/*` on the new Vite frontend. The endpoint
> table below is still authoritative — only the frontend changed.

> The `/ml/test` page lets users validate deployed models with single
> rows, batch slices, A/B comparisons, perturbation sweeps, CSV
> uploads, and live streaming — all wired through the same
> [`DeployedModelAlpha`](../alphaswarm/strategies/ml_alphas.py) runtime that
> production strategies use.

## Tabs

| Tab | Endpoint(s) | Behaviour |
| --- | --- | --- |
| Single Predict | `POST /ml/test/single` (sync) | Score one row, render score + sign |
| Batch | `POST /ml/test/batch` (Celery) + `POST /ml/test/upload-csv` | Iceberg slice or uploaded CSV scoring |
| A/B Compare | `POST /ml/test/compare` (Celery) | Side-by-side signals + agreement rate |
| Scenario / What-if | `POST /ml/test/scenario` (sync) | Per-feature ±N% perturbation table + heatmap |
| Historical | `POST /ml/evaluate` (Celery) | Existing offline eval flow |
| Live | `POST /ml/live-test/start` + WS bridge | Stream bars / signals from a venue |
| Models | n/a | Tabular `ModelVersion` browser |

## Backend

[`alphaswarm/tasks/ml_test_tasks.py`](../alphaswarm/tasks/ml_test_tasks.py) hosts the
Celery tasks (queue `ml`):

- `predict_single` — single-row inference
- `predict_batch` — Iceberg slice scoring
- `compare_models` — A/B between two `model_version_id`s
- `scenario_perturbation` — sensitivity table

Each task routes through [`DeployedModelAlpha._predict`](../alphaswarm/strategies/ml_alphas.py)
so dataset-driven AND legacy indicator-zoo paths both work.

## Sample REST calls

```bash
# Single prediction (sync)
curl -XPOST http://localhost:8000/ml/test/single \
  -d '{"deployment_id": "...", "feature_row": {"f1": 0.1, "f2": -0.4}, "sync": true}' \
  -H 'content-type: application/json'

# Scenario sweep
curl -XPOST http://localhost:8000/ml/test/scenario \
  -d '{"deployment_id": "...", "feature_row": {"f1": 0.1, "f2": -0.4}, "perturbations": [-0.1, 0, 0.1]}' \
  -H 'content-type: application/json'

# CSV upload (multipart)
curl -XPOST 'http://localhost:8000/ml/test/upload-csv?deployment_id=...' \
  -F 'file=@features.csv'
```

The CSV upload path is capped via
``settings.ml_workbench_max_csv_mb`` (default 20 MB).

## Visualisations

The webui renders results with
[`recharts`](https://recharts.org/) (already a dependency):

- Single Predict — Descriptions card with score + bias tag.
- Scenario — `BarChart` of deltas + sortable Ant Design table.
- Live — line chart overlay of bar close + signal strength + recent
  events list.

## Where this gets called from

- Standalone: `/ml/test`.
- ML Builder: a `Test*` node on the canvas serializes to the
  matching `/ml/test/*` endpoint.
- AlphaBacktestExperiment: when `train_first=true` it stamps the new
  deployment id on `MLAlphaBacktestRun`, so the next visit to
  `/ml/test` can score against it directly.


<!-- https://alpha-swarm.ai/concepts/strategy/mlops-service -->
# MLOps service (initial slice)

# MLOps service inside `alphaswarm_models/`

This page documents the initial MLOps service shipped as additive
extensions to the established `alphaswarm_models/` boundary. The service
provides the agentic plumbing the two MLOps reports asked for — a
polymorphic agent-facing interface layer, MLOps lifecycle handlers,
external-registry adapters, hash-locked skills, OOD safety rules, a
dedicated MCP server, and the matching REST + Celery + frontend
surfaces — all on top of the existing models / predictors / serving
infrastructure.

## What's new

### `alphaswarm_models/src/alphaswarm_models/interfaces/`

Five agent-facing polymorphic ABCs that wrap any concrete model in a
stable contract:

| Interface | Method | Application |
| --- | --- | --- |
| `Predictor` | `predict(features)` | Point-in-time value estimation |
| `Forecaster` | `forecast(history, horizon)` | Multi-step temporal projection |
| `Classifier` | `classify(data)` | Discrete probability distribution |
| `Segmenter` | `segment(series)` | Structural-break detection |
| `Analyzer` | `analyze(unstructured)` | NLP / sentiment scoring |

All register under `kind="interface"` in `alphaswarm.core.registry`. Agents
program against `Predictor.predict` regardless of whether XGBoost,
LSTM, or HuggingFace pipelines back the call.

### `alphaswarm_models/src/alphaswarm_models/handlers/`

Six MLOps lifecycle handler classes:

| Handler | Purpose |
| --- | --- |
| `CacheHandler` | LRU + safetensors-first model cache (budgets in `settings.ml_cache_*`) |
| `LoadHandler` | Cryptographic verification + safetensors-preferred deserialisation |
| `SaveHandler` | torch state_dict → `.safetensors` with SHA-256 sidecar |
| `StoreHandler` | Object-store upload + lineage metadata |
| `ProductionizeHandler` | Drive the `productionize/` compiler pipeline |
| `ServeHandler` | Continuous-batching queue with kill-switch fan-out |

All inherit `MLOpsHandler` so every lifecycle operation runs the same
`policy_check` + lineage emission contract (`LineageBus`).

### `alphaswarm_models/src/alphaswarm_models/productionize/`

Four compiler classes:

| Compiler | Output | Optional dep |
| --- | --- | --- |
| `OnnxCompiler` | `.onnx` | `torch.onnx` |
| `TensorRTCompiler` | `.engine` | `tensorrt` (Linux GPU only) |
| `TorchScriptCompiler` | `.pt` (trace/script) | `torch` |
| `QuantizationCompiler` | `.pt` (INT8 / FP16) | `torch` |

Each registers via `@register_compiler("alias")` and emits a
`CompiledArtifact` with SHA-256 + size + kwargs into
`ml_compiled_artifacts`.

### `alphaswarm_models/src/alphaswarm_models/adapters/`

External-registry pullers protecting the supply chain:

| Adapter | Notes |
| --- | --- |
| `HuggingFaceAdapter` | Routes downloads through the local cache volume; resolves HF tokens via `CredentialResolver` (`CredentialKey("huggingface", "api_token")`). Honours `settings.ml_hf_hub_offline`. |
| `TorchHubAdapter` | Refuses every name not on `DEFAULT_ALLOWLIST` ∪ the operator allow-list at `CredentialKey("torchhub", "allowlist")`. Verifies SHA-256 before caching. |

### `alphaswarm_models/src/alphaswarm_models/spec.py` + `runtime.py` + `registry.py`

Hash-locked **MLSkillSpec** + **MLSkillRuntime** mirroring the
existing `AgentSpec`/`BotSpec`/`RLExperimentSpec`/`AnalysisSpec`
runtime pattern. New Alembic 0081 tables:

- `ml_skills` + `ml_skill_versions` (hash-locked snapshots)
- `ml_skill_runs` (run ledger with `experiment_id` + `test_id` FKs, AGENTS rule 34)

Seed skill YAMLs ship under `alphaswarm_models/configs/skills/`:

- `regime_aware_alpha.yaml` — Classifier → Predictor (regime-specialised)
- `multi_horizon_forecast.yaml` — Forecaster + Analyzer (sentiment overlay)

### `alphaswarm_models/src/alphaswarm_models/rules/`

Inference-time OOD safety rules driven by a metaclass-driven
`RuleRegistry`:

- `OODGuard` — z-score threshold check.
- `RangeGuard` — absolute min/max window check.
- `TensorShapeGuard` — input-shape mismatch check.
- `CircuitBreaker` — rolling-window failure tracker that trips at
  `max_failures` per `window_seconds`.

Rule packs live under `alphaswarm_models/configs/rules/`; the default is
`ood_default.yaml`.

### `alphaswarm/data/mcp/tools/ml.py`

Fourteen `data.ml.*` DataMCP tools — the canonical Hard Rule 22 path
agents use to drive the entire MLOps surface (predict, forecast,
classify, segment, analyze, pull, compile, list, run skills, halt
serving). Each tool registers via `@register_data_mcp_tool` so both
transports — the in-process bridge and the FastAPI
router/stdio binary — pick it up.

### `alphaswarm/ml_mcp/` + `alphaswarm-ml-mcp` binary

A dedicated MCP server publishing the same `data.ml.*` slice under
its own canonical URI (`settings.mcp_ml_canonical_uri`). Tokens
minted for the MLOps audience cannot be replayed against the data
MCP and vice versa (RFC 8707, Hard Rule 49). The RFC 9728 metadata
document lives at `/.well-known/oauth-protected-resource/mcp/ml`.

### REST + Celery

New routes under the existing `/ml/*` router plus a fresh
`/ml/skills/*` router. Long-running ops dispatch to four new Celery
modules: `ml_pull_tasks`, `ml_serving_tasks`,
`ml_productionize_tasks`, `ml_skill_tasks`. All emit progress via
`_progress.emit` (Hard Rule 4).

### Frontend (Vite)

Three new routes under `alphaswarm_client/src/routes/ml/`:

- `/ml/skills` — registry browser + invocation form.
- `/ml/serving` — live continuous-batching session monitor with
  per-session halt button.
- `/ml/pull` — HuggingFace/TorchHub model puller.

`KillSwitch.tsx` fans out to `POST /ml/serving/halt-all` alongside
the existing halt endpoints (Hard Rule 2 in `frontend.mdc`).

### Identity + topology

- `alphaswarm.config.settings` gains nine new `ml_*` knobs (cache budgets,
  serving defaults, OOD threshold, offline toggles, MCP canonical URI
  + URL).
- `alphaswarm_platform/configs/deployment/topology.yaml` gains an
  `alphaswarm-ml-mcp` service entry (Hard Rule 47).
- `alphaswarm/config/topology_fallback.py` maps `mcp_ml_url` →
  `alphaswarm-ml-mcp.http`.

## Agent usage

The seed `mlops_assistant` AgentSpec at
`configs/agents/mlops_assistant.yaml` drives the MLOps surface
exclusively through the `data.ml.*` tools. Operators invoke it the
same way as any other AgentSpec — `AgentRuntime.run(...)` (never call
`router_complete` directly per Hard Rule 12).

## Validation

```bash
# Source compile check:
python -m py_compile alphaswarm_models/src/alphaswarm_models/{interfaces,handlers,adapters,rules,productionize,tasks}/**/*.py

# New migration is hashed into the lock file:
python scripts/ci/check_migration_immutability.py

# DataMCP catalog discovery:
curl http://localhost:8000/mcp/data/tools | jq '.tools[] | select(.name | startswith("data.ml."))'

# MLOps MCP discovery:
curl http://localhost:8000/.well-known/oauth-protected-resource/mcp/ml
```

## What is explicitly out of scope

- Mutating an existing migration. The 0081 migration is immutable
  once shipped (Hard Rule 6); future schema changes land in 0082+.
- Streamlit / Solara surfaces. The legacy stack is rollback-only.
- Free-text URN input. Every entity selection uses `EntityPicker`
  (Hard Rule 29).


<!-- https://alpha-swarm.ai/concepts/strategy/optimal-control -->
# Optimal-control / HJB math layer
> The optimal-control package — [alphaswarm/optimal_control/](../alphaswarm/optimal_control/) — hosts the JAX-compiled implementations of two canonical Hamilton-Jacobi-Bellman problems:

# Optimal-control / HJB math layer

> **Audience:** quants extending AlphaSwarm with optimal-execution or
> market-making models, plus AI agents that need to reason about
> the closed-form solvers.

The optimal-control package — [alphaswarm/optimal_control/](../alphaswarm/optimal_control/) — hosts
the JAX-compiled implementations of two canonical Hamilton-Jacobi-Bellman
problems:

- **Avellaneda-Stoikov 2008** market making —
  [alphaswarm/optimal_control/avellaneda_stoikov.py](../alphaswarm/optimal_control/avellaneda_stoikov.py).
- **Cartea-Jaimungal-Penalva 2015** inventory-penalised optimal liquidation —
  [alphaswarm/optimal_control/cartea_jaimungal.py](../alphaswarm/optimal_control/cartea_jaimungal.py).

The convenience layer [alphaswarm/optimal_control/hjb_solver.py](../alphaswarm/optimal_control/hjb_solver.py)
exposes ``solve_avst`` / ``solve_cj`` / ``value_function_to_arrow`` so the
analysis-flow runner can dispatch them uniformly and persist the
results to ``alphaswarm_gold_analysis_optimal_control`` per AGENTS rule 21.

## Where to invoke

Three call sites cover almost every use case.

### 1. Direct Python API

```python
from alphaswarm.optimal_control import compute_optimal_quotes, solve_avst

# Single-point AvSt quotes — pure JIT-compiled JAX path.
res = compute_optimal_quotes(
    mid_price=100.0,
    inventory=10.0,
    gamma=0.1,
    sigma=0.02,
    k=1.5,
    T_minus_t=1.0,
)
print(res.bid, res.ask, res.half_spread)

# Inventory grid via vmap.
out = solve_avst(
    mid_price=100.0,
    inventory_grid=[-50, -25, 0, 25, 50],
    gamma=0.1, sigma=0.02, k=1.5, T_minus_t=1.0,
)
```

### 2. Analysis flows (preferred — gives you UI form + Iceberg persistence)

```python
from alphaswarm.analysis import run_flow

result = run_flow(
    "optimal_control.avellaneda_stoikov_quotes",
    None,
    {
        "mid_price": 100.0,
        "inventory_min": -50.0,
        "inventory_max": 50.0,
        "inventory_step": 5.0,
        "gamma": 0.1, "sigma": 0.01, "k": 1.5, "T_minus_t": 1.0,
    },
)
```

The flow is a thin facade over ``solve_avst`` and writes its rows to
the gold-tier ``alphaswarm_gold_analysis_optimal_control.`` namespace
when invoked through ``AnalysisRuntime``.

### 3. Agent-callable DataMCPTool

```python
# inside an AgentSpec body the tool surfaces as ``data.optimal_control.solve_hjb``
result = ctx.tools["data.optimal_control.solve_hjb"].invoke(
    ctx=mcp_ctx, model="avst", mid_price=100.0, inventory=10.0,
    gamma=0.1, sigma=0.01, k=1.5, T_minus_t=1.0,
)
```

The tool is registered in
[alphaswarm/data/mcp/tools/optimal_control.py](../alphaswarm/data/mcp/tools/optimal_control.py)
and complies with AGENTS rule 22 — agents never read Iceberg / Postgres
directly.

## Avellaneda-Stoikov (single-asset)

Reservation price plus optimal half-spread:

```
r(s, q, t) = s − q · γ · σ² · (T − t)
δ        = ½ · γ · σ² · (T − t) + (1/γ) · ln(1 + γ/k)
bid       = r − δ
ask       = r + δ
```

The JAX kernel ``_avst_kernel`` is JIT-compiled with ``@jax.jit`` and
takes only Python floats / arrays — no I/O, no globals, no Python
control flow keyed on values. ``vmap`` lets us evaluate the kernel
across an inventory grid in one compiled call.

The closed-form GLFT 2013 variant (
``glft_closed_form``) is what
[alphaswarm.strategies.hft.alphas.GLFTMM](../alphaswarm/strategies/hft/alphas.py)
calls on every event. Its ``2/γ · ln(1 + γ/k)`` term differs from the
finite-horizon AvSt ``1/γ · ln(...)`` by a factor of two — that's the
long-horizon limit.

## Cartea-Jaimungal-Penalva (inventory-penalised liquidation)

Linear-quadratic ansatz ``H(t, q, S) = q·S + h₂(t)·q² + h₁(t)·q + h₀(t)``
reduces the HJB to a system of three coupled ODEs:

```
dh₂/dt = −φ − h₂² / κ
dh₁/dt = −h₁ · h₂ / κ
dh₀/dt = −h₁² / (4 · κ)
```

Solved backwards from the terminal conditions ``h₂(T) = −α`` and
``h₁(T) = h₀(T) = 0`` via fixed-step RK4. The optimal feedback
trading rate is

```
ν*(t, q) = − (h₂(t) · q + ½ · h₁(t)) / κ
```

When ``φ > 0`` the agent sells (or buys) faster than TWAP near the
terminal because ``h₂`` decreases; when ``φ = 0`` the rate collapses to
zero (no urgency).

## Pairing with reinforcement learning

The closed forms are reference benchmarks. To learn a richer policy
for non-Gaussian dynamics, drive an RL agent through:

- [alphaswarm.rl.envs.MarketMakingEnv](../alphaswarm/rl/envs/market_making_env.py) —
  PPO/SAC over AvSt knobs.
- [alphaswarm.rl.envs.OptimalExecutionEnv](../alphaswarm/rl/envs/optimal_execution_env.py) —
  Cartea-Jaimungal block liquidation.

Sample experiment YAMLs ship under [configs/rl/](../configs/rl/)
(``avellaneda_stoikov_mm.yaml``, ``cartea_jaimungal_execution.yaml``).

## See also

- [alphaswarm_docs/portfolio-options-mm.md](../../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse
  multi-strike extension.
- [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) —
  toxicity regime detection + agent adapter loop.
- [alphaswarm_docs/installation.md](../../intro/installation.md) — how to install the
  ``[optimal-control]`` extra (JAX, finhjb, fast-vollib, mbt_gym).


<!-- https://alpha-swarm.ai/concepts/strategy/portfolio-options-mm -->
# Portfolio options market making — Lucic-Tse 2024-2026
> The single-asset Avellaneda-Stoikov framework breaks down for options portfolios. An options book carries simultaneous Δ / Γ / ν / ρ exposures across hundreds of strikes; the spread the dealer should ...

# Portfolio options market making — Lucic-Tse 2024-2026

> **Audience:** options market-makers, quant developers writing
> spread-prediction models, and agents that need to reason about
> portfolio-level risk skew.

The single-asset Avellaneda-Stoikov framework breaks down for options
portfolios. An options book carries simultaneous Δ / Γ / ν / ρ
exposures across hundreds of strikes; the spread the dealer should
quote at strike ``K`` is no longer independent of the inventory at
strike ``K′``.

The breakthrough closed-form solution for portfolio-level options MM
landed in 2024-2026: V. Lucic and A. Tse, *"Optimal option market
making and volatility arbitrage"*. AlphaSwarm implements that framework in
[alphaswarm/options/portfolio_mm.py](../alphaswarm/options/portfolio_mm.py).

## Two equations

**1. Per-strike vol-arb alpha.**

```
α(K, T) = ½ · S² · Γ(K, T) · (σ_real² − σ_imp²)
```

This is the option-equivalent of the spot vol-arb edge: when realised
volatility exceeds implied volatility, the dealer collects vega
exposure at a positive expected value.

**2. Inventory-skewed bid/ask quote.**

```
bid(K, T)  = mid(K, T) − δ(K, T) − skew(K, T)
ask(K, T)  = mid(K, T) + δ(K, T) − skew(K, T)
δ(K, T)    = base_spread + ½ · hedge_cost · |Γ(K, T)|
skew(K, T) = γ_inv · ν(K, T) · (Σ_vol · q_per_expiry)
```

where ``Σ_vol`` is the (rank-reducible) covariance of the implied-vol
factors across maturities and ``q`` is the inventory matrix.

The Riccati system the linear-quadratic ansatz produces is closed-form
in steady state — no PDE solver required. AlphaSwarm implements that closed
form in pure JAX with ``jnp.einsum`` for the matrix contractions.

## Calling the solver

```python
import numpy as np
from alphaswarm.analysis.pricing import greeks_grid
from alphaswarm.options.portfolio_mm import LucicTseParams, compute_lucic_tse_quotes

strikes = np.array([95., 100., 105.])
expiries = np.array([0.05, 0.1, 0.25])

grid = greeks_grid(spot=100., strikes=strikes, expiries=expiries, vol=0.2)
quotes = compute_lucic_tse_quotes(
    spot=100.0,
    mid_quotes=grid["price"],
    gamma_surface=grid["gamma"],
    vega_surface=grid["vega"],
    realized_vol=0.22,            # the dealer's view
    implied_vol=np.full_like(grid["price"], 0.20),  # market quote
    inventory=np.zeros_like(grid["price"]),
    params=LucicTseParams(gamma_inv=0.05, base_spread=0.05, hedge_cost=0.001),
)
print(quotes.bid)         # (n_expiries, n_strikes)
print(quotes.ask)
print(quotes.expected_pnl)
```

JAX optionality: when the ``[optimal-control]`` extra is missing, the
module degrades to NumPy. Numerical results are identical, just slower.

## Analysis-flow surface

For UI / agent flows, use ``optimal_control.lucic_tse_portfolio_quotes``
or the namespace alias ``derivatives.lucic_tse_quotes``. Both wrap
``compute_lucic_tse_quotes`` and persist a row-per-cell table to
``alphaswarm_gold_analysis_optimal_control.lucic_tse_portfolio_quotes``.

```python
from alphaswarm.analysis import run_flow

out = run_flow(
    "optimal_control.lucic_tse_portfolio_quotes",
    None,
    {
        "spot": 100.0,
        "strikes": [90, 95, 100, 105, 110],
        "expiries": [0.05, 0.1, 0.25, 0.5],
        "realized_vol": 0.22,
        "implied_vol": 0.20,
        "gamma_inv": 0.05,
        "base_spread": 0.05,
        "hedge_cost": 0.001,
    },
)
```

## Pairing with the JAX/fast-vollib Greek path

Building the Greek surface dominates the per-step cost. AlphaSwarm ships a
JAX/vmap drop-in path in
[alphaswarm/options/greeks_jax.py](../alphaswarm/options/greeks_jax.py) that
auto-detects ``fast_vollib`` (Triton-fused on H100) when the extra is
installed, otherwise JIT-compiles a hand-rolled BSM kernel. The legacy
``alphaswarm.analysis.pricing.greeks_grid`` routes through this fast path
automatically.

## RL pairing

[alphaswarm.rl.envs.LucicTsePortfolioEnv](../alphaswarm/rl/envs/lucic_tse_options_env.py)
exposes the framework as a Gym environment so PPO/SAC can learn to
adapt ``γ_inv`` / ``base_spread`` as a function of the realised vs
implied gap. Sample config: [configs/rl/lucic_tse_options.yaml](../configs/rl/lucic_tse_options.yaml).

## See also

- [alphaswarm_docs/optimal-control.md](../../concepts/strategy/optimal-control.md) — single-asset HJB.
- [alphaswarm_docs/microstructure-toxicity.md](../../concepts/strategy/microstructure-toxicity.md) — the
  toxicity-aware regime adapter that scales ``γ_inv`` automatically
  during toxic flow.


<!-- https://alpha-swarm.ai/concepts/strategy/predictor-hub -->
# PredictorHub
> The report calls out two empirical findings from the literature:

# PredictorHub

> Status: **Phase 5 shipped** (Alembic 0044). Hub:
> [`alphaswarm/ml/predictors/`](../alphaswarm/ml/predictors/).

## Why unify

The report calls out two empirical findings from the literature:

* **XGBoost regression** -- significantly superior accuracy at pure
  numerical return prediction (low-noise, structured features)
* **LSTM classification** -- demonstrably better at directional
  classification over medium-term 7-30 day horizons (sequence-aware,
  handles regime shifts)

The platform already had both models available under
[`alphaswarm/ml/models/`](../alphaswarm/ml/models/), but they were registered with
different config keys, trained via different code paths, and
serialised inconsistently. Phase 5 consolidates them under a single
:class:`PredictorSpec` shape that the hub uses to pick the right
factory.

## PredictorSpec

The spec is hash-locked Pydantic:

```python
from alphaswarm.ml.predictors import PredictorSpec

# XGBoost regression — predict next-day return
spec_xgb = PredictorSpec(
    name="xgb_returns_1d",
    model_kind="xgboost",
    label_kind="regression",
    target_horizon="1d",
    feature_columns=["mom_5", "mom_20", "rsi_14", "vol_20"],
    target_column="ret_1d",
    hyperparams={"max_depth": 6, "learning_rate": 0.05, "n_estimators": 500},
)

# LSTM classification — predict 20-day direction (binary)
spec_lstm = PredictorSpec(
    name="lstm_direction_20d",
    model_kind="lstm",
    label_kind="classification",
    target_horizon="20d",
    feature_columns=["close", "volume", "rsi_14", "macd"],
    target_column="dir_20d",
    sequence_length=60,
    hyperparams={"hidden_size": 64, "num_layers": 2, "dropout": 0.2},
    classes=["down", "up"],
)
```

Re-snapshotting the spec into the persistence layer:

```python
from alphaswarm.ml.predictors import persist_predictor_spec

row_id, created = persist_predictor_spec(spec_xgb)
print(row_id, created)  # created=True the first time, False if hash unchanged
```

## PredictorHub

```python
from alphaswarm.ml.predictors import PredictorHub

hub = PredictorHub()
model = hub.build(spec_xgb)
model.fit(X_train, y_train)
preds = model.predict(X_test)
```

The hub picks the right factory from the
``(model_kind, label_kind)`` registry. Adding a new model:

```python
from alphaswarm.ml.predictors import register_predictor

@register_predictor(model_kind="transformer", label_kind="classification")
def my_transformer_factory(spec):
    ...
    return TransformerClassifier(**spec.hyperparams)
```

## Reference factories

The hub ships four reference factories matching the report's
recommendations:

| ``model_kind`` | ``label_kind`` | Underlying class |
| --- | --- | --- |
| ``xgboost`` | ``regression`` | :class:`XGBModel` from :mod:`alphaswarm.ml.models.tree` |
| ``xgboost`` | ``classification`` | :class:`XGBModel` (with binary or multi-class objective) |
| ``lstm`` | ``classification`` | :class:`LSTMModel` from :mod:`alphaswarm.ml.models.torch.lstm` |
| ``lstm`` | ``regression`` | :class:`LSTMModel` (regression head) |

## Hash-locked versioning

The Phase 5 ``predictor_spec_versions`` table mirrors the spec-version
pattern used by AgentSpec / BotSpec / RLExperimentSpec /
AnalysisSpec. Re-running ``persist_predictor_spec`` with an unchanged
spec returns ``created=False``; a single byte change to the spec body
(new feature, new hyperparam) produces a fresh row. This means every
"how was this model trained?" question has a precise answer pinned
by the SHA-256 hash.

## Wiring into agents

Phase 5 exposes the hub through the existing
[`/ml/test`](../alphaswarm/api/routes/ml.py) endpoints (REST) and three
DataMCP tools (agent-facing):

* ``data.ml.predictors.list`` -- list registered specs
* ``data.ml.predictors.train`` -- snapshot a spec + train
* ``data.ml.predictors.deploy_pair`` -- A/B-test two trained models

Agents query the catalogue first, snapshot a spec, train, and
deploy without an ORM import.


<!-- https://alpha-swarm.ai/concepts/strategy/statistical-arbitrage -->
# Statistical arbitrage primitives
> | Function | Returns | Use | | --- | --- | --- | | :func:`johansen_test` | :class:`JohansenResult` | Multivariate cointegration rank among >=2 series | | :func:`rolling_zscore` | pandas Series | Norma...

# Statistical arbitrage primitives

> Status: **Phase 4 shipped**. Module:
> [`alphaswarm/math/arbitrage.py`](../alphaswarm/math/arbitrage.py). Analysis flows:
> [`alphaswarm/analysis/flows/arbitrage.py`](../alphaswarm/analysis/flows/arbitrage.py).

## Five primitives

| Function | Returns | Use |
| --- | --- | --- |
| :func:`johansen_test` | :class:`JohansenResult` | Multivariate cointegration rank among >=2 series |
| :func:`rolling_zscore` | pandas Series | Normalized spread for entry/exit thresholds |
| :func:`half_life` | :class:`HalfLifeResult` | Ornstein-Uhlenbeck mean-reversion timescale |
| :func:`pair_signal` | :class:`PairSignal` | Per-bar ENTRY/EXIT/HOLD for a pair strategy |
| :func:`ah_share_basis` | :class:`BasisResult` | A-share vs H-share cross-market basis |
| :func:`adr_basis` | :class:`BasisResult` | ADR / GDR vs underlying foreign equity basis |

The existing
[`alphaswarm/data/cointegration.py`](../alphaswarm/data/cointegration.py) module
keeps the ADF + Engle-Granger primitives -- Phase 4 doesn't duplicate
them.

## Johansen test

The Engle-Granger test handles two series; Johansen generalises to
``n >= 2`` and reports the **rank** of the cointegration space (how
many independent stationary combinations exist among the series).

```python
import pandas as pd
from alphaswarm.math.arbitrage import johansen_test

# Wide DataFrame: one column per series
prices = pd.DataFrame({
    "BABA_ADR": [...],
    "9988_HKEX_USD": [...],
    "SPY": [...],
})
result = johansen_test(prices, deterministic="constant", k_ar_diff=1)
print(result.rank, result.is_cointegrated_95)
# result.cointegrating_vectors: list[list[float]] -- the n rows of beta
```

## Pair signal state machine

The :func:`pair_signal` function reads the latest spread + a rolling
window and emits one of:

| Signal | Z-score | In position? |
| --- | --- | --- |
| ``ENTRY_LONG_SPREAD`` | ``z >= +entry_threshold`` | False |
| ``ENTRY_SHORT_SPREAD`` | ``z <= -entry_threshold`` | False |
| ``EXIT_LONG_SPREAD`` | ``\|z\| <= exit_threshold`` AND z >= 0 | True |
| ``EXIT_SHORT_SPREAD`` | ``\|z\| <= exit_threshold`` AND z < 0 | True |
| ``HOLD`` | otherwise | any |

The signal also reports the estimated half-life via
:func:`half_life`. Strategies typically reject opportunities where
the half-life exceeds a horizon-based ``half_life_min`` (the spread
will take too long to revert; capital is better deployed elsewhere).

## A/H share basis

The report calls out a specific cross-market arbitrage:
mainland A-shares vs Hong Kong H-shares of the same company. Same
economic rights, different regulatory + liquidity + currency
environments -> persistent divergence + violent reversion.

```python
from alphaswarm.math.arbitrage import ah_share_basis

# ICBC: 1398.HK in HKD, 601398.SS in CNY. CNYHKD ~ 0.93 (CNY per HKD)
res = ah_share_basis(
    a_price=5.10,
    h_price=4.82,
    fx_rate=0.93,
    conversion_ratio=1.0,
    transaction_cost_bps=20.0,
    threshold_bps=100.0,
)
print(res.is_arbitrage, res.arbitrage_direction)
```

The threshold default of 100 bps is conservative; CTA-style operators
typically use 60-80 bps. ``transaction_cost_bps`` captures the
round-trip cost (commissions + bid/ask + stamp duty + FX hedge cost).

## ADR / GDR basis

Same logic for US-listed ADRs and offshore-listed GDRs. The Phase 1
:class:`InstrumentADR` / :class:`InstrumentGDR` rows carry the
``conversion_ratio`` field directly so the basis algorithm reads it
without a manual lookup.

```python
# BABA ADR (NYSE) vs 9988 (HKEX). 1 ADR represents 8 H-shares.
res = adr_basis(
    adr_price=85.00,
    underlying_price=80.50,  # in HKD
    fx_rate=7.84,            # HKD per USD
    conversion_ratio=8.0,
    transaction_cost_bps=30.0,
    depository_fee_bps=5.0,
    threshold_bps=80.0,
)
```

The depository fee is annualised; over short holding periods (hours,
days) it's negligible, but on long-horizon basis trades it
materially eats into the alpha.

## Analysis flows

Four flows wrap the primitives so the AnalysisRuntime can drive them
with the standard preview / persist / chart machinery:

* ``arbitrage.johansen_basket`` -- Johansen test on a column subset
* ``arbitrage.pair_signal`` -- latest pair signal from a spread column
* ``arbitrage.ah_share_basis`` -- per-bar A/H basis monitor
* ``arbitrage.adr_basis`` -- per-bar ADR basis monitor

Each is registered via
[`@register_analysis_flow`](../alphaswarm/analysis/registry.py) so the lab
UI builds a form automatically.

## Agent surface (Phase 5)

The matching DataMCP tools (added in Phase 5):

* ``data.arbitrage.cointegration_pair`` -- two-series Engle-Granger
* ``data.arbitrage.johansen_basket`` -- multivariate Johansen finder
* ``data.arbitrage.ah_share_monitor`` -- A/H share monitor
* ``data.arbitrage.adr_underlying_basis`` -- ADR basis monitor

Agent code uses these tools, not the math primitives directly
(AGENTS rule 22).


<!-- https://alpha-swarm.ai/concepts/strategy/strategy-browser -->
# Strategy Browser
> The Strategy Browser is a dedicated Solara page at `/strategy-browser` that exposes two complementary views of the strategy library:

# Strategy Browser

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Strategy lifecycle: [alphaswarm_docs/strategy-lifecycle.md](../../concepts/strategy/strategy-lifecycle.md).

The Strategy Browser is a dedicated Solara page at `/strategy-browser` that
exposes two complementary views of the strategy library:

1. **Saved strategies** — everything a user has persisted via
   `POST /strategies/` (the Strategy Development page). Filter by tag,
   status, name, or minimum Sharpe; click through for version history,
   recent tests, equity curves, and a deep link into the per-strategy
   MLflow experiment.
2. **Alpha catalog** — the code-available `IAlphaModel` classes (both
   ported TA strategies and the native ML model wrappers), their tags,
   and a list of reference YAMLs in `configs/strategies/` that instantiate
   each one. Handy for discovering what's available before saving your
   own.

## API surface

- `GET /strategies/browse?tag=&status=&query=&min_sharpe=`
  → list of enriched strategy rows with latest backtest metrics and the
  MLflow run id of the most recent run.
- `GET /strategies/browse/catalog`
  → every registered `IAlphaModel` class, its module path, tag list, and
  reference YAMLs under `configs/strategies/`.
- `GET /strategies/{id}/experiment`
  → experiment name (`strategy/`), MLflow tracking URI, and up to
  50 linked `BacktestRun` rows.

## Strategy tags

Every new concrete alpha carries a module-level `STRATEGY_TAGS` tuple
(e.g. `("pattern", "mean-reversion", "quant-trading")`). `alphaswarm.strategies
.list_strategy_tags()` aggregates the tuples across every class in
`alphaswarm.strategies.__all__`, so the browser's tag filter reflects the code
without any duplicated metadata.

## MLflow wiring

When `run_backtest_from_config` is called with a `strategy_id`, the
underlying `log_backtest` helper uses
`experiment_name_for_strategy(strategy_id)` to pick the per-strategy
experiment (`strategy/`) and also sets the `alphaswarm.strategy_id` tag
on the run. After the backtest completes, the resulting MLflow run id is
written onto `BacktestRun.mlflow_run_id` so the browser can deep-link.

To prevent the generic Celery autolog signals from opening a parent
MLflow run for every backtest task (which would swallow the nested
`log_backtest` run), the `alphaswarm.tasks.backtest_tasks.*` /
`alphaswarm.tasks.paper_tasks.*` / `alphaswarm.tasks.ml_tasks.*` / `alphaswarm.tasks.factor_tasks.*`
task names are explicitly listed in
`alphaswarm.mlops.autolog._AUTOLOG_SKIP_TASKS`.

## Ported strategy catalog

Shipped alphas (at 0.4):

| Alpha class                | Tags                                           | Reference recipe                          |
|----------------------------|------------------------------------------------|-------------------------------------------|
| `AwesomeOscillatorAlpha`   | momentum, oscillator, quant-trading            | `configs/strategies/awesome_oscillator.yaml` |
| `HeikinAshiAlpha`          | pattern, reversal, quant-trading               | `configs/strategies/heikin_ashi.yaml`     |
| `DualThrustAlpha`          | intraday, breakout, quant-trading              | `configs/strategies/dual_thrust.yaml`     |
| `ParabolicSARAlpha`        | trend, quant-trading                           | `configs/strategies/parabolic_sar.yaml`   |
| `LondonBreakoutAlpha`      | breakout, fx, quant-trading                    | `configs/strategies/london_breakout.yaml` |
| `BollingerWAlpha`          | pattern, mean-reversion, quant-trading         | `configs/strategies/bollinger_w.yaml`     |
| `ShootingStarAlpha`        | pattern, reversal, quant-trading               | `configs/strategies/shooting_star.yaml`   |
| `RsiPatternAlpha`          | pattern, mean-reversion, quant-trading         | `configs/strategies/rsi_pattern.yaml`     |
| `OilMoneyRegressionAlpha`  | statistical, mean-reversion, quant-trading     | `configs/strategies/oil_money.yaml`       |
| `SmaCross`                 | momentum, reference, backtesting.py            | `configs/strategies/sma_cross.yaml`       |
| `Sma4Cross`                | momentum, reference, backtesting.py            | `configs/strategies/sma4_cross.yaml`      |
| `TrailingATRAlpha`         | momentum, trailing-stop, reference             | `configs/strategies/trailing_atr.yaml`    |
| `BaseAlgoExample`          | reference, stock-analysis-engine               | `configs/strategies/base_algo_example.yaml` |

## ML Training page

A sibling Solara page at `/ml` — launch any `alphaswarm.ml` training run from a
form (pick feature handler + model class + segments), stream progress
through the existing `/chat/stream/{task_id}` WebSocket, and see the
resulting `ModelVersion` rows.

## Browser export flow

```mermaid
flowchart LR
    Picker[User picks securities + indicators + transformations] --> Form[StrategyBrowser form]
    Form -->|POST| API["/pipelines/from-browser"]
    API --> Spec[FeatureSet spec]
    Spec --> DB[(feature_sets row)]
    Spec --> Topic["Kafka features.preview.&lt;name&gt;.v1"]
    Topic --> Stream[live overlay charts]
```


<!-- https://alpha-swarm.ai/concepts/strategy/strategy-development -->
# Strategy Development (Consolidated `/strategy-development/*`)
> ```mermaid flowchart TB L["StrategyDevLayout"] L --> Composer["/composer"] L --> Sim["/simulation"] L --> Ideate["/ideation"] L --> Single["/single-predict"] L --> Batch["/predict-batch (Iceberg-aware...

# Strategy Development (Consolidated `/strategy-development/*`)

The Vite frontend exposes a single consolidated umbrella for every
strategy-authoring + strategy-testing surface under
`/strategy-development/*`. Twelve sibling sub-routes share the same
persistent left sub-nav, a run-summary KPI strip, and a cross-route
React context so navigating between (say) Compose → Simulate →
Compare-Models keeps all the inputs (deployment id, symbols, time
window, feature row, last task id) coherent.

```mermaid
flowchart TB
  L["StrategyDevLayout"]
  L --> Composer["/composer"]
  L --> Sim["/simulation"]
  L --> Ideate["/ideation"]
  L --> Single["/single-predict"]
  L --> Batch["/predict-batch (Iceberg-aware)"]
  L --> Compare["/compare-models"]
  L --> Scenario["/scenario-perturbation"]
  L --> Historical["/historical-eval"]
  L --> Live["/live-test"]
  L --> RunCmp["/run-comparator"]
  L --> Docs["/document-library (papers)"]
  L --> Lib["/library (components)"]
```

## Surfaces

| Route | Component | Wraps |
| --- | --- | --- |
| `/strategy-development` | `StrategyDevIndexRoute` | redirects to `/strategy-development/composer` |
| `/strategy-development/composer` | `StrategyComposer` | `GET /strategies/components` + `POST /strategies` |
| `/strategy-development/simulation` | `SimulationCreator` | dispatches to `BotRuntime` / `LobBacktestEngine` / `AlphaBacktestExperiment` / `RLRuntime` / paper |
| `/strategy-development/ideation` | `IdeationConsole` | `POST /agents/ideate` (router_complete + research_papers RAG) |
| `/strategy-development/single-predict` | `SinglePredictRoute` | `POST /ml/test/single` |
| `/strategy-development/predict-batch` | `PredictBatchRoute` | `POST /ml/test/batch` (now Iceberg-aware) + `POST /ml/test/upload-csv` |
| `/strategy-development/compare-models` | `CompareModelsRoute` | `POST /ml/test/compare` |
| `/strategy-development/scenario-perturbation` | `ScenarioPerturbationRoute` | `POST /ml/test/scenario` |
| `/strategy-development/historical-eval` | `HistoricalEvalRoute` | `POST /ml/evaluate` + `GET /ml/evaluations/{task_id}` |
| `/strategy-development/live-test` | `LiveTestRoute` | `POST /ml/live-test/start` + `useLiveStream` |
| `/strategy-development/run-comparator` | `RunComparator` | chained pairwise `POST /ml/test/compare` |
| `/strategy-development/document-library` | `DocumentLibrary` | `GET /rag/papers`, `POST /rag/papers/upload`, `POST /rag/papers/{id}/synthesize` |
| `/strategy-development/library` | `StrategyLibraryRoute` | `GET /strategies/components` (read-only registry browser) |

## Cross-route state

`alphaswarm_client/src/components/strategy-dev/StrategyDevContext.tsx` holds
the shared selection (`deploymentId`, `deploymentIdB`, `symbols`,
`start`, `end`, `featureRowText`, `perturbations`, `lastTaskId`,
`lastRunSummary`, `composerYaml`, `strategyId`). The context is
backed by `localStorage` under the key `alphaswarm.strategy-dev.selection.v1`
so a hard refresh doesn't lose state.

Sub-routes use `useStrategyDev()` to read + patch the selection:

```ts
const { selection, setSelection } = useStrategyDev();
setSelection({ deploymentId: "abc", lastTaskId: res.task_id });
```

## KPI strip

`RunKpiStrip` reads `selection.lastRunSummary` and renders Sharpe /
total return / max DD / hit rate / trades in the standard
`MetricsGrid`. The strip is intentionally idle when no run has been
launched in the current session so the surface stays calm.

## Hard-rule alignment

- Frontend rule (`.cursor/rules/frontend.mdc`): every long-running
  task is consumed via the existing `useChatStream` / `useLiveStream`
  hooks so the WS pipeline + kill-switch + sandbox banner all stay
  intact.
- AGENTS rule 2: LLM-driven surfaces (`IdeationConsole`,
  `PaperSynthesisDrawer`) route through `router_complete` server-side.
- AGENTS rule 4: progress framing is unchanged — sub-routes never
  publish to Redis directly; they always go through `_progress.emit`
  on the backend.

## How to add a sub-route

1. Create `alphaswarm_client/src/routes/strategy-development//page.tsx`
   wrapping the new component.
2. Add the new component under `alphaswarm_client/src/components/strategy-dev/`.
3. Register the route in `alphaswarm_client/src/routes.tsx`'s
   `DYNAMIC_ROUTES` entry for `strategy-development`.
4. Add a `StrategyDevSubRoute` entry to
   `alphaswarm_client/src/components/strategy-dev/SubNav.tsx` so the new
   route appears in the persistent left nav.

## Legacy

The legacy webui `/ml/test` page is now superseded by this consolidated
surface. Bookmarks still work because the flat REAL_ROUTES entry is
preserved, but the sidebar no longer surfaces it.


<!-- https://alpha-swarm.ai/concepts/strategy/strategy-lifecycle -->
# Strategy Lifecycle
> Every strategy in AlphaSwarm follows the same six-step cycle: **build → save → version → test → paper → live**

# Strategy Lifecycle

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Backtest dispatch sequence: [alphaswarm_docs/flows.md#2-backtest-dispatch](../../concepts/platform/flows.md#2-backtest-dispatch).

Every strategy in AlphaSwarm follows the same six-step cycle: **build → save →
version → test → paper → live**.

## Build

Open the Strategy Development page (``/strategy``) or hand-write a YAML
recipe under ``configs/strategies/``. Every recipe has the shape:

```yaml
strategy:
  class: FrameworkAlgorithm
  kwargs:
    universe_model: {class: StaticUniverse, kwargs: {symbols: [...]}}
    alpha_model:    {class: MeanReversionAlpha, kwargs: {...}}
    portfolio_model: {class: HierarchicalRiskParity, kwargs: {...}}
    risk_model:     {class: BasicRiskModel, kwargs: {...}}
    execution_model: {class: MarketOrderExecution, kwargs: {}}
backtest:
  class: EventDrivenBacktester
  kwargs: {initial_cash: 100000, start: "2023-01-01", end: "2024-12-31"}
```

## Save + version

Clicking **Save as new strategy** calls ``POST /strategies/`` which
writes a ``Strategy`` row plus ``StrategyVersion`` v1. Every subsequent
``PUT /strategies/{id}`` auto-bumps the version; the diff viewer in the
UI and the ``GET /strategies/{id}/versions/{v}/diff`` endpoint surface a
unified diff between any two versions.

## Test

The **Test** card in the Strategy Development page posts to
``POST /strategies/{id}/test`` with an engine + window. A Celery task
runs the backtest, stores a ``StrategyTest`` row, and links it back to
the strategy. The **Tests** tab lists every run with its Sharpe,
drawdown, and total return.

Each test also fires the MLflow autolog signal, so every test becomes a
first-class MLflow run tagged ``alphaswarm.celery.task = alphaswarm.tasks.backtest_tasks.run_backtest``.

## Paper + live

When a strategy has a green testing record, promote it via
``POST /paper/start`` (the same pipeline the Paper Trading page uses).
The paper engine shares 100% of the strategy code path with the
backtester — no code changes required.

## Archive

``DELETE /strategies/{id}`` soft-deletes by setting ``status=archived``.
Archived strategies are hidden from the default list but all versions +
tests remain queryable via the API for audit.

## State machine

```mermaid
stateDiagram-v2
    [*] --> Draft : author + save
    Draft --> Versioned : freeze YAML
    Versioned --> Backtested : run_backtest succeeds
    Backtested --> Paper : promote (operator)
    Paper --> Live : promote (operator)
    Backtested --> Versioned : revise + bump version
    Paper --> Backtested : reset
    Live --> Paper : pause
    Live --> Archived : decommission
    Archived --> [*]
```


<!-- https://alpha-swarm.ai/concepts/strategy/strategy-templates -->
# Strategy template catalog (Phase 7 of the multi-tenant rollout)
> Hard rule 35 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Read-only strategy templates (LEAN, community, internal references) MUST be loaded as ``resources`` rows with ``resource_type=strategy_template``. The A...

# Strategy template catalog (Phase 7 of the multi-tenant rollout)

Read-only strategy templates — QuantConnect LEAN's
``Algorithm.Python/*.py`` examples first, with hooks for
community + internal libraries — are ingested into the polymorphic
[`resources`](../alphaswarm/persistence/models_resources.py) table and
surfaced to users + agents via the strategy template browser, the
MCP catalog, and the AST translator.

## Hard rule

Hard rule 35 in [`AGENTS.md`](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md): "Read-only strategy
templates (LEAN, community, internal references) MUST be loaded as
``resources`` rows with ``resource_type='strategy_template'``. The
AST translator lives in ``alphaswarm/strategies/lean/translator.py``; new
translators register through the same pattern."

## Ingestion

```bash
# One-shot: clone LEAN + ingest every Algorithm.Python/*.py
python -m scripts.ingest_lean_templates --clone

# Re-ingest from a local checkout
LEAN_REPO_PATH=/opt/Lean python -m scripts.ingest_lean_templates

# Dry-run (parse + report only)
python -m scripts.ingest_lean_templates --lean-path /opt/Lean --dry-run
```

The ingester is idempotent — re-running with a newer LEAN revision
overwrites the matching rows in place. Each Resource carries the
parsed metadata (`class_name`, `base_class`, `asset_classes`,
`indicators`, `universe_symbols`, `tags`) plus the raw LEAN source
in `meta.raw_source` so the translator + frontend preview don't
need to re-read the file system.

## Translator

```python
from alphaswarm.strategies.lean.translator import translate_lean_to_framework

skeleton = translate_lean_to_framework(lean_source)
```

The translator rewrites the LEAN AST into an
[`alphaswarm.strategies.framework.FrameworkAlgorithm`](../alphaswarm/strategies/framework.py)
skeleton. The mapping covers:

| LEAN                                  | AlphaSwarm target                                 |
| ------------------------------------- | ------------------------------------------ |
| `Initialize`                          | `prepare`                                  |
| `OnData`                              | `on_bar`                                   |
| `OnSecuritiesChanged`                 | `on_universe_changed`                      |
| `self.AddEquity("SPY")`               | `ctx.add_equity("SPY")`                    |
| `self.AddOption("SPY")`               | `ctx.add_option("SPY")`                    |
| `self.AddCrypto("BTCUSD")`            | `ctx.add_crypto("BTCUSD")`                 |
| `self.SetCash(100000)`                | captured as cfg `starting_cash`            |
| `self.SetStartDate / SetEndDate`      | captured as cfg `start_date` / `end_date`  |
| `self.MACD(...)` / `self.SMA(...)`    | `alphaswarm.data.indicators.MACD(...)`            |
| `self.MarketOrder(symbol, qty)`       | `ctx.market_order(symbol, qty)`            |
| `self.SetHoldings(symbol, fraction)`  | `ctx.set_holdings(symbol, fraction)`       |

Anything unmapped becomes a `# TODO(lean-translate)` comment so the
user can finish the port — translation is never silent.

## Agent surface

| Tool                                            | Purpose |
| ----------------------------------------------- | ------- |
| `data.strategies.templates.search`              | Filter by tag / asset class / framework |
| `data.strategies.templates.describe`            | Full Resource payload including raw source |
| `data.strategies.templates.clone_to_workspace`  | Fork into the calling user's workspace, optionally with the translator applied |

Cloning emits a `resource_relations.relation='translated_from'`
edge back to the source, so the ownership graph can audit
provenance — `data.ownership.tree` over the cloned Resource
returns the lineage chain back to the original LEAN class.

## REST surface

| Method + path                          | Purpose |
| -------------------------------------- | ------- |
| `GET /strategies/templates`            | List + filter |
| `GET /strategies/templates/{id}`       | Describe + raw source |
| `POST /strategies/templates/clone`     | Clone (mirrors the MCP tool) |

## Frontend

The browser lives at `/strategy-development/templates`. The grouped
list groups by primary asset class; the preview pane renders the
LEAN source in a monospace block with a "Clone to my workspace"
button (with a checkbox to toggle translation).

Free-text inputs that reference a specific strategy template are
forbidden — use ``.

## Cross-reference

- [`alphaswarm_docs/ownership-graph.md`](../../concepts/platform/ownership-graph.md) — the
  `translated_from` / `clones` edges live in the graph projection.
- [`alphaswarm_docs/data-mcp.md`](../../concepts/data/data-mcp.md) — the
  `data.strategies.templates.*` tools are MCP-registered like every
  other agent surface.
- [`alphaswarm_docs/metadata-cache.md`](../../concepts/data/metadata-cache.md) — the
  `strategy_templates` Redis cache category powers the EntityPicker.


<!-- https://alpha-swarm.ai/concepts/strategy/vbtpro-integration -->
# vectorbt-pro deep integration
> vectorbt-pro is the **primary vectorised backtest engine** in AlphaSwarm. The integration lives under [alphaswarm/backtest/vbtpro/](../alphaswarm/backtest/vbtpro/) and exposes the full vbt-pro surface (signals, orders, op...

# vectorbt-pro deep integration

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Engines overview: [alphaswarm_docs/backtest-engines.md](../../concepts/strategy/backtest-engines.md).

vectorbt-pro is the **primary vectorised backtest engine** in AlphaSwarm. The
integration lives under [alphaswarm/backtest/vbtpro/](../alphaswarm/backtest/vbtpro/) and
exposes the full vbt-pro surface (signals, orders, optimizer, callbacks,
splitter, param sweeps, IndicatorFactory). The legacy
[alphaswarm/backtest/vectorbtpro_engine.py](../alphaswarm/backtest/vectorbtpro_engine.py)
is now a 10-line delegate so YAML configs that reference its module path
continue to resolve to the new class.

## Hard constraint: Numba

vbt-pro's per-bar callbacks (`signal_func_nb`, `order_func_nb`,
`pre_segment_func_nb`, …) run inside Numba's JIT. **LLM agents and
Python ML models cannot run there.** Two supported patterns work around
this constraint:

1. **Precompute** (default) — agents/ML run before the simulation,
   producing wide-format `entries` / `exits` / `size` / `price`
   DataFrames. The decisions are baked in; vbt-pro consumes them as
   plain arrays.
2. **Per-window** (`Splitter.apply`) — Python (and therefore agents/ML)
   runs in the WFO loop between train and test windows. Each window's
   inner backtest is still vectorised.

For true per-bar agent dispatch use the event-driven engine and the
`AgentDispatcher` primitive — see
[alphaswarm_docs/backtest-engines.md#agent--ml-components](../../concepts/strategy/backtest-engines.md#agent--ml-components).

## Engine modes

`VectorbtProEngine.run` routes through one of five constructors based on
the `mode` kwarg. All five share a common kwarg surface (initial_cash,
fees, slippage, freq, cash_sharing, group_by, leverage, multiplier,
direction, …) and merge any extra `portfolio_kwargs` into the call.

| Mode        | Constructor                       | Driver                                                | Use case                                              |
|-------------|-----------------------------------|-------------------------------------------------------|-------------------------------------------------------|
| `signals`   | `Portfolio.from_signals`          | `IAlphaModel` → wide entries/exits/(size)/(price)/(stops) | The default; mirrors classical signal-based backtests. |
| `orders`    | `Portfolio.from_orders`           | `IOrderModel` → wide size/price/size_type             | Agent-emitted precise orders; multi-leg sizing.       |
| `optimizer` | `Portfolio.from_optimizer`        | `PortfolioOptimizer` (mean-variance, risk parity, custom) | Allocation-driven research, no signal generation.    |
| `holding`   | `Portfolio.from_holding`          | —                                                     | Buy-and-hold sanity baseline.                         |
| `random`    | `Portfolio.from_random_signals`   | `Param`-style random kwargs                           | Null-hypothesis baseline.                             |

## Components

| File                                                      | Role                                                                            |
|-----------------------------------------------------------|---------------------------------------------------------------------------------|
| [`engine.py`](../alphaswarm/backtest/vbtpro/engine.py)           | Multi-mode dispatch; `@register("VectorbtProEngine")`.                          |
| [`signal_builder.py`](../alphaswarm/backtest/vbtpro/signal_builder.py) | `IAlphaModel` → `SignalArrays`; per-bar loop **and** `generate_panel_signals` opt-in. |
| [`order_builder.py`](../alphaswarm/backtest/vbtpro/order_builder.py) | `IOrderModel` → `OrderArrays`; `signals_to_orders` sizer helper.            |
| [`optimizer_adapter.py`](../alphaswarm/backtest/vbtpro/optimizer_adapter.py) | `EqualWeightOptimizer`, `MeanVarianceOptimizer`, `RandomWeightOptimizer`, `CallableOptimizer`; all decorated with `@register(..., kind="portfolio")`. |
| [`result_mapper.py`](../alphaswarm/backtest/vbtpro/result_mapper.py) | `vbt.Portfolio` → `BacktestResult`; merges `vbt_*` native stats.            |
| [`wfo.py`](../alphaswarm/backtest/vbtpro/wfo.py)                 | `WalkForwardHarness` + `PurgedWalkForwardHarness` driven by vbt-pro's `Splitter`. |
| [`param_sweep.py`](../alphaswarm/backtest/vbtpro/param_sweep.py) | `sweep_strategy_kwargs` (grid/random) + `sweep_signals_grid` (`Param`-native MA cross). |
| [`indicator_factory_bridge.py`](../alphaswarm/backtest/vbtpro/indicator_factory_bridge.py) | Wraps AlphaSwarm `IndicatorBase` zoo entries as vbt-pro `IndicatorFactory` classes. |
| [`data_utils.py`](../alphaswarm/backtest/vbtpro/data_utils.py)   | `pivot_close`, `pivot_ohlcv`, `universe_from_bars`, `filter_bars`.              |

## Agent + ML strategy components

| File                                                      | Class                  | Role                                                                  |
|-----------------------------------------------------------|------------------------|-----------------------------------------------------------------------|
| [`agentic_alpha.py`](../alphaswarm/strategies/vbtpro/agentic_alpha.py) | `AgenticVbtAlpha`      | Precompute / per-window / live modes. Reads `DecisionCache` and renders to wide arrays. |
| [`ml_alpha.py`](../alphaswarm/strategies/vbtpro/ml_alpha.py)     | `MLVbtAlpha`           | Wraps any `alphaswarm.ml.base.Model` (or MLflow URI). Threshold / top-k / rank policies. |
| [`agent_order_model.py`](../alphaswarm/strategies/vbtpro/agent_order_model.py) | `AgenticOrderModel` | Implements `IOrderModel`; drives the `orders` mode from cached agent decisions. |

Each component is `@register`-ed so it can be dropped into a strategy
YAML via the standard `class` / `module_path` / `kwargs` factory.

## Walk-forward optimisation

```python
from alphaswarm.backtest.vbtpro.wfo import WalkForwardHarness

harness = WalkForwardHarness(
    strategy_cfg={"class": "FrameworkAlgorithm", "module_path": "...", "kwargs": {...}},
    splitter="rolling",   # or "expanding", "purged"
    n_splits=8,
    train_size=504,
    test_size=126,
    engine_kwargs={"mode": "signals", "initial_cash": 100_000.0},
    on_window_train=lambda i, slice_, strategy, ctx: warm_agent(strategy, slice_),
)
result = harness.run(bars)
```

The harness re-instantiates the strategy on every window (so per-window
agent state is isolated), runs the train backtest, then re-instantiates
again before the test pass. The optional `on_window_train` hook is where
agents refresh their RAG / memory or ML models refit.

`PurgedWalkForwardHarness` defaults `splitter="purged"` and uses
`PurgedWalkForwardCV` from `vectorbtpro.generic.splitting.purged` to drop
labels that bleed across the train/test boundary.

## Parameter sweeps

```python
from alphaswarm.backtest.vbtpro.param_sweep import sweep_strategy_kwargs

result = sweep_strategy_kwargs(
    base_config,
    {
        "strategy.kwargs.alpha_model.kwargs.fast": [5, 10, 20],
        "strategy.kwargs.alpha_model.kwargs.slow": [50, 100, 200],
    },
    metric="sharpe",
    method="grid",
)
print(result.best_combo, result.best_value)
print(result.frame.head())
```

Random sweeps require `n_trials`. Trials default to running with
`engine: vbt-pro:signals` if the base config does not specify one.
`sweep_signals_grid` is the fast `Param`-native path for single-symbol
MA-crossover style sweeps.

## Indicator factory bridge

```python
from alphaswarm.backtest.vbtpro.indicator_factory_bridge import vbt_indicator

SMA = vbt_indicator("SMA")
out = SMA.run(close, period=[10, 20, 50])  # vbt.Param under the hood
sma_50 = out.value[(slice(None), 50)]
```

This makes every AlphaSwarm `IndicatorBase` available inside vbt-pro's
indicator/sweep machinery without rewriting the underlying state machine.

## Agent tools

| Tool name                  | Class                       | Surface                                       |
|----------------------------|-----------------------------|-----------------------------------------------|
| `vectorbt_pro_backtest`    | `VectorbtProBacktestTool`   | One backtest, explicit mode.                  |
| `vectorbt_pro_param_sweep` | `VbtProParamSweepTool`      | Grid / random sweep over strategy kwargs.     |
| `vectorbt_pro_wfo`         | `VbtProWalkForwardTool`     | Splitter-WFO; rolling/expanding/purged.       |
| `vectorbt_pro_optimizer`   | `VbtProOptimizerTool`       | Allocation-driven via `Portfolio.from_optimizer`. |
| `engine_capabilities`      | `EngineCapabilitiesTool`    | Inspect the capability matrix; pick an engine.|
| `agent_aware_backtest`     | `AgentAwareBacktestTool`    | Run `AgentAwareMomentumAlpha` on the event-driven engine. |

All tools are registered in `alphaswarm_agents.tools.TOOL_REGISTRY` and
referenced in [configs/agents/quant_research_vbtpro.yaml](../configs/agents/quant_research_vbtpro.yaml).

## Example configs

- [configs/strategies/vbtpro/dual_ma_signals.yaml](../configs/strategies/vbtpro/dual_ma_signals.yaml)
  — minimal `signals` mode example.
- [configs/strategies/vbtpro/agentic_trader.yaml](../configs/strategies/vbtpro/agentic_trader.yaml)
  — `AgenticVbtAlpha` precompute.
- [configs/strategies/vbtpro/ml_topk.yaml](../configs/strategies/vbtpro/ml_topk.yaml)
  — `MLVbtAlpha` top-k.
- [configs/strategies/vbtpro/wfo_agentic.yaml](../configs/strategies/vbtpro/wfo_agentic.yaml)
  — per-window agent dispatch.
- [configs/strategies/vbtpro/optimizer_meanvariance.yaml](../configs/strategies/vbtpro/optimizer_meanvariance.yaml)
  — allocation-only optimizer mode.

## Performance notes

- **Default Numba JIT** is ON. The first vbt-pro call in a fresh process
  pays a non-trivial compile cost (~10-30s for the full surface). Cache
  ahead of time on workers if latency matters.
- **`jitted=False`** swaps the outer simulation wrapper to a Python
  reference implementation; it does not let arbitrary Python live inside
  `signal_func_nb`. Use precompute or per-window for that.
- The `IndicatorFactory` bridge applies AlphaSwarm indicators per column in pure
  Python, which is slow for very wide universes; for hot paths prefer
  vbt-pro's native indicators (`vbt.SMA`, `vbt.RSI`, etc.) and only fall
  back to the bridge for indicators we don't have a vbt-pro analogue for.

## Migration from the legacy adapter

The previous `VectorbtProEngine` only handled signals via
`IAlphaModel.generate_signals` → `Portfolio.from_signals`. Existing
configs still work because:

- The legacy module path
  `alphaswarm.backtest.vectorbtpro_engine.VectorbtProEngine` re-exports the new
  class.
- The default mode is still `signals`.
- Existing kwargs (`initial_cash`, `fees`, `slippage`, `allow_short`,
  `freq`, `group_by`) are unchanged in meaning.

New kwargs that gate richer behaviour: `mode`, `direction`, `accumulate`,
`size`, `size_type`, `sl_stop`, `tsl_stop`, `tp_stop`, `leverage`,
`leverage_mode`, `multiplier`, `cash_sharing`, `portfolio_kwargs`,
`order_model`, `optimizer`, `random_kwargs`, `record_signals`.


<!-- https://alpha-swarm.ai/concepts/trading/observability-stack -->
# Observability stack
> ```mermaid flowchart LR apps[AlphaSwarm services + agents] subgraph aqpobs[alphaswarm-observability] otelagent[OTel Agent DaemonSet] otelgw[OTel Gateway Deployment] prom[Prometheus] graf[Grafana] tempo[Tempo] loki[...

# Observability stack

Phase 2c + 2d of the AlphaSwarm infra-expansion plan stand up the AlphaSwarm-owned
observability plane in the `alphaswarm-observability` namespace. Everything
the cluster previously read from `rpi_kubernetes/observability/` is
re-homed here.

```mermaid
flowchart LR
    apps[AlphaSwarm services + agents]
    subgraph aqpobs[alphaswarm-observability]
      otelagent[OTel Agent DaemonSet]
      otelgw[OTel Gateway Deployment]
      prom[Prometheus]
      graf[Grafana]
      tempo[Tempo]
      loki[Loki]
      phoenix[Arize Phoenix]
      pgphx[(Phoenix Postgres)]
    end
    apps -- OTLP --> otelagent
    otelagent -- OTLP --> otelgw
    otelgw -- "AI spans (openinference.span.kind)" --> phoenix
    otelgw -- "infra spans" --> tempo
    otelgw -- "remote_write" --> prom
    otelgw -- "OTLP logs" --> loki
    phoenix --> pgphx
    graf -. "Prometheus + Loki + Tempo + QuestDB datasources" .- prom
    graf -. .- loki
    graf -. .- tempo
```

## Components

| Component | Folder | Replaces |
|---|---|---|
| kube-prometheus-stack | [observability/kube-prometheus-stack/](../alphaswarm_platform/deployments/kubernetes/observability/kube-prometheus-stack/) | rpi `observability/prometheus/` |
| OpenTelemetry Operator | [observability/opentelemetry-operator/](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-operator/) | new |
| OTel Collector (gateway + agent) | [observability/opentelemetry-collector-gateway/](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/) | rpi `observability/otel-collector/` |
| Phoenix | [observability/phoenix/](../alphaswarm_platform/deployments/kubernetes/observability/phoenix/) | new |

## Routing rule (gateway)

The `transform/ai_route` processor in
[`collector-gateway.yaml`](../alphaswarm_platform/deployments/kubernetes/observability/opentelemetry-collector-gateway/collector-gateway.yaml)
inspects every span and tags it with `alphaswarm.ai_trace=true` when:

- `attributes["openinference.span.kind"] != nil`, or
- `attributes["llm.model_name"] != nil`, or
- `attributes["agent.name"] != nil`.

Two trace pipelines (`traces/ai`, `traces/infra`) split on that
attribute. Tail sampling preserves error traces + 100 % of AI
traces; everything else is sampled at 1 %.

## DataMCP tools

| Tool | Surface |
|---|---|
| `data.observability.prometheus.query` | Instant PromQL. |
| `data.observability.prometheus.query_range` | Range PromQL. |
| `data.observability.prometheus.list_alerts` | Active alerts. |
| `data.observability.grafana.list_dashboards` | Dashboard catalog. |
| `data.observability.grafana.export_dashboard` | Dashboard JSON. |
| `data.observability.phoenix.list_projects` | Phoenix projects. |
| `data.observability.phoenix.get_trace` | LLM / agent trace. |
| `data.observability.phoenix.annotate_span` | Write evaluator verdict. |

## Frontend

- [/admin/topology](../alphaswarm_client/src/routes/admin/topology/page.tsx)
  — Phase 0 topology overview.
- (Phase 6 follow-up) `/admin/observability/{prometheus,grafana,phoenix,otel}`
  — domain-scoped admin pages.


<!-- https://alpha-swarm.ai/concepts/trading/observability -->
# Observability
> AlphaSwarm ships with opt-in OpenTelemetry tracing covering the full request path: FastAPI → Celery → paper session → broker SDK → Postgres → Redis. Install the `otel` extra to enable it::

# Observability

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Progress bus reference: [alphaswarm_docs/flows.md#cross-cutting-progress-bus](../../concepts/platform/flows.md#cross-cutting-progress-bus).

AlphaSwarm ships with opt-in OpenTelemetry tracing covering the full request
path: FastAPI → Celery → paper session → broker SDK → Postgres →
Redis. Install the `otel` extra to enable it::

    pip install -e ".[otel]"

## Quick start (Docker)

`docker compose up -d` starts an OpenTelemetry Collector and Jaeger
sidecar alongside the AlphaSwarm services. Each service is pre-wired with
`ALPHASWARM_OTEL_ENDPOINT=http://otel-collector:4317`.

Open [http://localhost:16686](http://localhost:16686) and pick a
service:

- `alphaswarm-api` — FastAPI request handlers + Dash mount
- `alphaswarm-worker` — Celery tasks (backtest, paper, ingestion)
- `alphaswarm-paper-trader` — paper session loop

## Configuration

All knobs live in `alphaswarm.config.Settings` / `.env`:

| Variable | Default | Purpose |
|---|---|---|
| `ALPHASWARM_OTEL_ENDPOINT` | *empty* | OTLP endpoint. Empty → tracing disabled (safe dev default). |
| `ALPHASWARM_OTEL_SERVICE_NAME` | `alphaswarm` | Base service name. Suffixes `-api`, `-worker`, `-paper` added automatically. |
| `ALPHASWARM_OTEL_SAMPLE_RATIO` | `1.0` | Parent-based head sampler ratio. `0.1` = 10% of traces. |
| `ALPHASWARM_OTEL_PROTOCOL` | `grpc` | `grpc` (port 4317) or `http/protobuf` (port 4318). |

## Instrumentation map

Auto-instrumented on startup (see `alphaswarm/observability/tracing.py`):

- `FastAPIInstrumentor` — every route becomes a span
- `CeleryInstrumentor` — every task becomes a span
- `SQLAlchemyInstrumentor` — every query becomes a span (attached in `alphaswarm/persistence/db.py` when `ALPHASWARM_OTEL_ENDPOINT` is set)
- `HTTPXClientInstrumentor` — every HTTPX call (broker REST, UI API client)
- `RedisInstrumentor` — every Redis command (pub/sub, kill-switch, Celery broker)

Manual spans are added via the `@traced` decorator
(`alphaswarm/observability/decorators.py`):

```python
from alphaswarm.observability import traced

@traced("paper.session.run")
async def run(self) -> PaperSessionResult:
    ...
```

Works transparently on sync and `async` callables; when `otel` isn't
installed the tracer is a no-op so the decorator has zero overhead.

## Custom exporters

The default is OTLP/gRPC. To use OTLP/HTTP instead:

```bash
ALPHASWARM_OTEL_PROTOCOL=http/protobuf
ALPHASWARM_OTEL_ENDPOINT=http://otel-collector:4318/v1/traces
```

For local development with just the console, install the OTel SDK and
point at a local Jaeger all-in-one:

```bash
docker run --rm -p 4317:4317 -p 16686:16686 jaegertracing/all-in-one:1.55
export ALPHASWARM_OTEL_ENDPOINT=http://localhost:4317
```

## Kubernetes

Both the API/Worker image and the `paper` image have the OTel SDK
installed. The Kustomize manifests set `ALPHASWARM_OTEL_ENDPOINT` to the
in-cluster collector service; port-forward Jaeger with:

```bash
kubectl -n alphaswarm-dev port-forward svc/jaeger 16686:16686
```

## Troubleshooting

**Spans never show up in Jaeger.**
- Verify `ALPHASWARM_OTEL_ENDPOINT` is set in the container: `docker compose exec api env | grep OTEL`.
- Check the collector logs for parsing errors: `docker compose logs otel-collector`.
- Drop the sample ratio to `1.0` while debugging.

**`ImportError: opentelemetry-exporter-otlp-proto-grpc` at startup.**
- You set `ALPHASWARM_OTEL_ENDPOINT` but didn't install the `otel` extra. The tracer logs a warning and continues as a no-op, but to silence it run `pip install -e ".[otel]"`.

**Tests emit real spans.**
- They shouldn't — `tests/conftest.py` installs an `autouse` fixture that resets `ALPHASWARM_OTEL_ENDPOINT=""` before each test. If you see real spans, check that the fixture is still in place.

## Metrics (optional)

The OTel Collector config in `alphaswarm_platform/deploy/otel/otel-collector-config.yaml`
also exports metrics on port 8889 via the Prometheus exporter, so you
can point a Prometheus scraper at the collector for JVM-style
service-level dashboards. The AlphaSwarm code doesn't emit custom metrics
yet — PRs welcome.

## Tracing topology

```mermaid
flowchart LR
    API[FastAPI] -->|spans| OTEL[OTEL collector :4317]
    Worker[Celery worker] -->|spans| OTEL
    Paper[paper-trader] -->|spans| OTEL
    OTEL --> Jaeger[Jaeger UI :16686]
    OTEL --> Prom[Prometheus exporter :8889]
    API -.publish.-> RedisBus[("alphaswarm:task pubsub")]
    Worker -.publish.-> RedisBus
    RedisBus -.subscribe.-> WS["/chat/stream WS"]
```


<!-- https://alpha-swarm.ai/concepts/trading/paper-metadata-gate -->
# Paper Metadata Gate (Strict-Only)
> After this rollout, paper-trading sessions **require** both `session.model_urn` and `session.pipeline_urn` to be present and valid at startup

# Paper Metadata Gate (Strict-Only)

## Breaking change

After this rollout, paper-trading sessions **require** both `session.model_urn`
and `session.pipeline_urn` to be present and valid at startup.

If either URN is missing, malformed, unresolved in `entity_aspects`, or (for
the model URN) resolves to a non-`Production`/non-`Staging` model status, the
session raises `MetadataValidationError` and refuses to start.

There is no warn-only fallback mode.

## How strict gate validation works

The paper gate performs these checks in order:

1. Parse `model_urn` and `pipeline_urn` with AlphaSwarm URN validation.
2. Resolve `mlModelMetadata` for `model_urn` and `pipelineMetadata` for
   `pipeline_urn`.
3. Enforce model lifecycle status (`Production` or `Staging` only).
4. Emit a `metadata_gate` progress frame and raise on any validation error.

Startup is blocked until all checks pass.

## Seeded URNs from migration 0049

Alembic revision `0049_paper_metadata_seed_aspects` seeds these baseline URNs:

- `configs/paper/alpaca_mean_rev.yaml`
  - `urn:alphaswarm:mlmodel:prod:alpaca_mean_reversion_v1`
  - `urn:alphaswarm:pipeline:prod:alpaca_mean_reversion_loop`
- `configs/paper/ibkr_mean_rev.yaml`
  - `urn:alphaswarm:mlmodel:prod:ibkr_mean_reversion_v1`
  - `urn:alphaswarm:pipeline:prod:ibkr_mean_reversion_loop`
- `configs/paper/avellaneda_stoikov_quotes.yaml`
  - `urn:alphaswarm:mlmodel:prod:avellaneda_stoikov_v1`
  - `urn:alphaswarm:pipeline:prod:avellaneda_stoikov_quotes_loop`
- `configs/paper/lucic_tse_options.yaml`
  - `urn:alphaswarm:mlmodel:prod:lucic_tse_options_v1`
  - `urn:alphaswarm:pipeline:prod:lucic_tse_options_loop`
- `configs/paper/tradier_rest.yaml`
  - `urn:alphaswarm:mlmodel:prod:tradier_rest_baseline_v1`
  - `urn:alphaswarm:pipeline:prod:tradier_rest_loop`

To add a new paper config, seed matching `MlModel` + `Pipeline` aspects first,
then point YAML `session.model_urn` / `session.pipeline_urn` at those URNs.

## Operator runbook (custom paper YAMLs)

1. Run migrations through `0049_paper_metadata_seed_aspects`.
2. For each custom paper model, register an `MlModel` aspect (status must be
   `Production` or `Staging`) using the `aspect.register_model` MCP tool.
3. Register a matching `Pipeline` aspect for each paper pipeline URN.
4. Update custom YAML files so `session.model_urn` and `session.pipeline_urn`
   match the newly seeded aspects.
5. Start paper sessions and confirm metadata-gate startup checks pass.

## Rollback

If you must revert this rollout:

1. `alembic downgrade 0048`
2. `git revert `

After rollback, redeploy and re-run paper sessions with the reverted code/docs.


<!-- https://alpha-swarm.ai/concepts/trading/paper-trading -->
# Paper & live trading
> AQPs paper trading engine is a Lean-inspired async runtime that shares 100% of its strategy code with the backtester. Orders from the same `IStrategy` object flow through the **same ledger tables** r...

# Paper & live trading

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · Session state machine: [alphaswarm_docs/flows.md#4-paper-trading-session](../../concepts/platform/flows.md#4-paper-trading-session).

AlphaSwarm's paper trading engine is a Lean-inspired async runtime that shares
100% of its strategy code with the backtester. Orders from the same
`IStrategy` object flow through the **same ledger tables** regardless of
whether the session is a backtest, a paper replay, or a live session.

## Architecture

```mermaid
flowchart LR
    Strategy["IStrategy(e.g. FrameworkAlgorithm)"]
    Session["PaperTradingSession(async loop)"]
    Feed["IMarketDataFeed(async iterator)"]
    Broker["IBrokerage + IAsyncBrokerage"]
    Ledger[(PostgresOrderRecord / Fill / LedgerEntry)]
    Redis[(Redisprogress + stop signals)]

    Feed -->|BarData| Session
    Session -->|on_bar| Strategy
    Strategy -->|OrderRequest| Session
    Session -->|submit_order_async| Broker
    Broker --> Ledger
    Session -->|emit progress| Redis
    Redis -->|stop signal| Session
    Session --> Ledger
```

## Lifecycle

1. `alphaswarm paper run --config ` (or `POST /paper/start`) builds a
   `PaperTradingSession` via
   [`alphaswarm/trading/runner.py`](../alphaswarm/trading/runner.py).
2. `_connect` subscribes the feed to the strategy's universe.
3. For each bar (up to `max_bars` or forever):
   - Check the kill switch (`POST /portfolio/kill_switch`).
   - Append to an in-memory history window.
   - Call `strategy.on_bar(bar, context)` — identical to the backtest.
   - For each returned `OrderRequest`:
     - Run the pre-trade risk check (`RiskManager.check_pretrade`).
     - Submit via `brokerage.submit_order_async` (or the sync bridge).
     - Persist the `OrderRecord` and ledger entry.
   - Drain order updates (simulated path) and emit fills.
4. Every `state_flush_every_bars` bars, a snapshot of the session state
   is flushed to the `paper_trading_runs.state` JSONB column.
5. On shutdown (kill switch, stop signal, `max_bars`, or feed EOF), the
   engine drains, writes the final row, and emits `done` to the progress
   bus.

## Broker adapters

Each adapter lives in `alphaswarm/trading/brokerages/` and implements **both**
`IBrokerage` (sync, for backtest parity) and `IAsyncBrokerage`.

### Alpaca (`[alpaca]` extra)

```yaml
brokerage:
  class: AlpacaBrokerage
  kwargs: {paper: true}    # flip to false for live
```

Requires `ALPHASWARM_ALPACA_API_KEY` and `ALPHASWARM_ALPACA_SECRET_KEY`. The adapter
maintains a background `TradingStream` that re-emits order updates
through the session's `_order_event_queue`.

### Interactive Brokers (`[ibkr]` extra)

```yaml
brokerage:
  class: InteractiveBrokersBrokerage
  kwargs: {exchange: SMART, currency: USD}
```

Requires a running TWS or IB Gateway. Defaults:
`ALPHASWARM_IBKR_HOST=127.0.0.1`, `ALPHASWARM_IBKR_PORT=7497` (paper),
`ALPHASWARM_IBKR_CLIENT_ID=1`. The feed uses `client_id + 100` so it doesn't
collide with the trading client.

### Tradier (generic REST template)

```yaml
brokerage:
  class: TradierBrokerage
```

Requires `ALPHASWARM_TRADIER_TOKEN` and `ALPHASWARM_TRADIER_ACCOUNT_ID`. Demonstrates
how to subclass [`RestBrokerage`](../alphaswarm/trading/brokerages/rest.py) —
five small overrides give you a full paper/live venue:
`_order_payload`, `_parse_order(s)`, `_parse_positions`, `_parse_account`,
`_order_detail_path/_orders_path/_positions_path/_account_path`.

## Credential flow

`alphaswarm.config.Settings` reads every broker secret from the `ALPHASWARM_*`
environment (via `.env`). Adapters pick those up automatically at
construction time, so YAML recipes rarely need to inline secrets.

Order of precedence:

1. Explicit `kwargs` in the YAML recipe (highest)
2. Explicit `kwargs=` passed to `build_from_config`
3. `ALPHASWARM_*` environment variables
4. Package defaults (sandbox URLs, paper=True, etc.)

## Kill-switch integration

The paper session wraps every iteration in a check against
[`alphaswarm.risk.kill_switch.is_engaged`](../alphaswarm/risk/kill_switch.py). Toggling
the switch via `POST /portfolio/kill_switch` (or the UI's Portfolio
page) causes the session to:

1. Stop accepting new bars from the feed.
2. Cancel every open order via `brokerage.cancel_order_async`.
3. Flush final state + close brokerage/feed connections.
4. Emit `done` to the task progress channel.

Set `session.stop_on_kill_switch: false` in the recipe to disable this
behaviour (not recommended).

## Remote / Kubernetes runs

The `paper-trader` Docker image (`--target paper`) runs `alphaswarm paper run`
as a single-replica k8s `Deployment`. See
[`alphaswarm_platform/deploy/k8s/base/paper-trader.yaml`](../alphaswarm_platform/deploy/k8s/base/paper-trader.yaml).

To run on a remote host over SSH:

```bash
ALPHASWARM_ALPACA_API_KEY=... ALPHASWARM_ALPACA_SECRET_KEY=... \
  alphaswarm paper run --config configs/paper/alpaca_mean_rev.yaml --celery
```

The `--celery` flag enqueues the job onto the shared worker pool; the
shell can exit and the session keeps running. Use `alphaswarm paper stop
` to drain it gracefully from anywhere.

## Metadata Gate — Strict Mode Rollout

Paper sessions now run with strict metadata validation only. Startup aborts
when `session.model_urn` or `session.pipeline_urn` is missing, invalid, or
unresolvable, or when model status is not `Production`/`Staging`.

When adding a new paper config:

1. Declare both `model_urn` and `pipeline_urn` in the YAML.
2. Seed matching aspects before startup checks run.
3. Built-in baseline configs are seeded by Alembic revision
   `0049_paper_metadata_seed_aspects`; additional configs should ship a
   follow-up migration (either using
   `alphaswarm.trading.baseline_aspects.seed_paper_baseline_aspects()` or direct
   `write_aspect(...)` calls in non-migration application code).

Baseline URNs seeded by revision `0049_paper_metadata_seed_aspects`:

- `urn:alphaswarm:mlmodel:prod:alpaca_mean_reversion_v1`
- `urn:alphaswarm:pipeline:prod:alpaca_mean_reversion_loop`
- `urn:alphaswarm:mlmodel:prod:ibkr_mean_reversion_v1`
- `urn:alphaswarm:pipeline:prod:ibkr_mean_reversion_loop`
- `urn:alphaswarm:mlmodel:prod:avellaneda_stoikov_v1`
- `urn:alphaswarm:pipeline:prod:avellaneda_stoikov_quotes_loop`
- `urn:alphaswarm:mlmodel:prod:lucic_tse_options_v1`
- `urn:alphaswarm:pipeline:prod:lucic_tse_options_loop`
- `urn:alphaswarm:mlmodel:prod:tradier_rest_baseline_v1`
- `urn:alphaswarm:pipeline:prod:tradier_rest_loop`

## Observability hooks

All broker calls and the main session loop are instrumented with
OpenTelemetry spans (see [observability.md](../../concepts/trading/observability.md)):

| Span name | Emitted by |
|---|---|
| `paper.session.run` | `PaperTradingSession.run` |
| `paper.session.bar` | Each bar processed |
| `paper.session.submit_order` | Order submission gate |
| `broker.submit_order` | Every concrete broker adapter |
| `broker.cancel_order` | Every concrete broker adapter |
| `broker.query_positions` | Every concrete broker adapter |
| `broker.query_account` | Every concrete broker adapter |

Each span carries a `broker.venue` attribute (`alpaca`, `ibkr`,
`tradier`, or `sim`).


<!-- https://alpha-swarm.ai/concepts/trading/webui -->
# webui — Next.js 15 frontend
> The `webui/` package is the React/TypeScript replacement for the legacy Solara UI on `:8765`. It runs as a separate Node process on `:3000` and talks to the FastAPI backend on `:8000` over REST + WebS...

# webui — Next.js 15 frontend

> Doc map: [alphaswarm_docs/index.md](../../intro/index.md) · API surface: [alphaswarm_docs/architecture.md#system-component-diagram](../../concepts/platform/architecture.md#system-component-diagram).

The `webui/` package is the React/TypeScript replacement for the legacy Solara
UI on `:8765`. It runs as a separate Node process on `:3000` and talks to the
FastAPI backend on `:8000` over REST + WebSocket.

## Stack

- Next.js 15 App Router, React 19, TypeScript strict
- Ant Design 5 + `@ant-design/icons` + `@ant-design/charts`
- AG Grid Community (`ag-grid-community` + `ag-grid-react`)
- React Flow v12 (`@xyflow/react`) for visual workflow editors
- `react-financial-charts` for OHLC + indicators (alongside `recharts`)
- TanStack Query v5 + Zustand
- `openapi-typescript` + `openapi-fetch` for type-safe REST access

The full directory layout and design rationale live in `webui/README.md`.

## Local dev

From the repo root:

```bash
make webui-install   # one-time pnpm install
make webui-gen-api   # dump OpenAPI + regenerate TypeScript client
make webui-dev       # start dev server on :3000
```

The Next dev server proxies `/alphaswarm-api/*` → `${NEXT_PUBLIC_API_URL}` (default
`http://localhost:8000`) so cookies and WebSockets stay same-origin in dev.

## Backend contract additions

The refactor added or extended a small surface on the FastAPI side:

- `GET  /auth/whoami` — local-first identity stub
- `GET  /chat/threads`, `POST /chat/threads`, `DELETE /chat/threads/{id}`
- `POST /chat` accepts an optional `context: ChatContext` block (page,
  vt_symbol, backtest_id, strategy_id, …) which is materialised into the
  system prompt so the assistant knows which page the user is on.
- CORS is now driven by `ALPHASWARM_WEBUI_CORS_ORIGINS` (comma-separated list).
  Empty value falls back to the legacy `"*"` behaviour.

WebSocket contracts are unchanged:

- `WS /chat/stream/{task_id}` — Celery task progress
- `WS /live/stream/{channel_id}` — live market subscriptions

## OpenAPI client regeneration

The `webui` consumes a generated `paths` interface that mirrors FastAPI's
spec exactly:

1. `python -m scripts.export_openapi --out data/openapi.json`
2. `pnpm --dir webui exec openapi-typescript ../data/openapi.json -o lib/api/generated/schema.d.ts`

`make webui-gen-api` (or `pwsh ./scripts/gen_webui_client.ps1`) wraps both
steps. CI should run them and fail if the diff is non-empty (drift check).

## Strangler migration

During the migration both UIs run in parallel:

- `:3000` — Next.js webui (new, primary)
- `:8765` — Solara UI (legacy)
- `/dash`  — Dash strategy monitor (kept; embedded in Next via iframe under `/monitor`)

When the Next.js app reaches feature parity:

1. Drop the `ui` service from `alphaswarm_platform/compose/docker-compose.yml`.
2. Delete `alphaswarm/ui/pages/` and `alphaswarm/ui/app.py` (keep the Dash factory).
3. Optionally relax `fastapi\<0.116` and `starlette\<0.46` pins in
   `pyproject.toml` (they exist solely to satisfy Solara).

## Page tree (top-level)

```mermaid
flowchart TB
    Root["/"] --> Dash["/dashboard"]
    Root --> Data["/data"]
    Data --> DataCatalog["/data/catalog"]
    Data --> DataIceberg["/data/iceberg"]
    Data --> DataIngest["/data/ingest"]
    Data --> DataBrowser["/data/browser"]
    Root --> Backtest["/backtest"]
    Backtest --> BTHistory["/backtest/history"]
    Backtest --> BTNew["/backtest/new"]
    Root --> Strategies["/strategies"]
    Root --> Models["/models"]
    Root --> Agentic["/agentic"]
    Root --> Paper["/paper"]
    Root --> Settings["/settings"]
```


<!-- https://alpha-swarm.ai/how-to/alphaswarm-admin-entra-setup -->
# Wire alphaswarm_admin against the AlphaSwarm staff Entra tenant

# Wire alphaswarm_admin against the AlphaSwarm staff Entra tenant

End-to-end procedure for connecting the `alphaswarm_admin` service (backend
BFF + Next.js frontend) to Microsoft Entra ID, using the staff app
registration that the `alphaswarm_entra_directory` Terraform module
provisions.

The result: AlphaSwarm staff sign in to `manage.alpha-swarm.ai` with their
corporate Entra account; the admin BFF validates the resulting
`api://alphaswarm-manage-api` access tokens; the SPA mints tokens via
`@azure/msal-browser` and renews them silently with
`acquireTokenSilent`.

Companion runbooks:

- [Bootstrap the AlphaSwarm Entra tenant](./entra-terraform-bootstrap.md) —
  the prerequisite that creates the apps + groups + roles.
- [Onboard a new staff member](./entra-onboard-new-staff.md) — group
  + role assignment after the apps land.
- [Rotate Entra secrets](./entra-rotate-secrets.md) — federated
  credentials + break-glass procedures.
- Concept overview:
  [Entra ID as the AlphaSwarm staff user pool](../concepts/identity/entra-internal-tenant.md).
- ADR:
  [ADR-013 Entra ID as the AlphaSwarm staff first user pool](../architecture/decisions/013-entra-as-first-pool.md).

## What gets wired

| Surface | What it does |
| --- | --- |
| `alphaswarm_admin/src/alphaswarm_admin/settings.py` | Reads the `ALPHASWARM_AUTH_MSAL_INTERNAL_*` env vars set by the helper script. Single-tenant when `INTERNAL_TENANT_ID` is set; multi-tenant otherwise. |
| `alphaswarm_admin/src/alphaswarm_admin/deps/identity.py` | JWT validator pinned to the AlphaSwarm staff Entra v2.0 issuer; verifies `aud=api://alphaswarm-manage-api`, maps `roles` claim through the canonical RBAC lattice. |
| `alphaswarm_admin/src/alphaswarm_admin/api/routers/auth_setup.py` | New `GET /admin/auth/discovery` + `GET /admin/auth/health`. Discovery feeds the SPA's `PublicClientApplication`; health confirms the IdP is reachable. |
| `alphaswarm_admin/frontend/components/auth/AuthProvider.tsx` | Real MSAL flow (`loginRedirect`, `acquireTokenSilent`, `acquireTokenPopup` for step-up). No tenant id hard-coded in the bundle — everything comes from `/admin/auth/discovery`. |
| `scripts/identity/alphaswarm_admin_entra_setup.py` | Operator helper: discovers values from Terraform outputs, prints + optionally writes the env vars, prints the runbook. |

## Prerequisites

- The Terraform stack `entra-internal` has been planned + applied for
  the wiley-tech environment (see
  [bootstrap runbook](./entra-terraform-bootstrap.md)).
- Admin consent has been granted on the staff app's Graph permissions
  (`./scripts/identity/grant_admin_consent.sh "$STAFF_CID"`).
- The `EntraTenantLink` for the AlphaSwarm staff tenant exists with
  `meta.kind = 'internal'`
  (`python scripts/identity/seed_entra_internal_tenant.py --apply`).

## Step 1 — Generate the env vars

```bash
# Auto-discover from the Terraform outputs in the wiley-tech env.
python scripts/identity/alphaswarm_admin_entra_setup.py
```

The script prints two env blocks. Sample output:

```
# --- Backend env (alphaswarm_admin BFF) ---
ALPHASWARM_ADMIN_AUTH_PROVIDER=msal_entra
ALPHASWARM_ADMIN_AUTH_REQUIRED=true
ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID=12345678-aaaa-bbbb-cccc-deadbeef0000
ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID=99999999-1111-2222-3333-444444444444
ALPHASWARM_AUTH_MSAL_INTERNAL_AUDIENCE=api://alphaswarm-manage-api
ALPHASWARM_AUTH_OIDC_AUDIENCE=api://alphaswarm-manage-api
ALPHASWARM_ADMIN_ENTRA_TENANT=12345678-aaaa-bbbb-cccc-deadbeef0000
ALPHASWARM_ADMIN_ENTRA_REDIRECT_PATH=/api/auth/entra/callback

# --- Frontend env (alphaswarm_admin/frontend) ---
NEXT_PUBLIC_AQP_AUTH_PROVIDER=msal_entra
NEXT_PUBLIC_AQP_ADMIN_API_URL=http://localhost:8900
```

To write a `.env.alphaswarm_admin.entra` file alongside the printout:

```bash
python scripts/identity/alphaswarm_admin_entra_setup.py --write-env
```

The script is intentionally additive: it never overwrites values that
weren't generated by it; the operator merges the block into their
existing Kubernetes manifests / Helm values / `.env.local`.

## Step 2 — Verify the backend can reach Entra

Boot the admin BFF (or restart your existing instance) with the env
vars sourced:

```bash
set -a; source .env.alphaswarm_admin.entra; set +a
uv run alphaswarm-admin  # or: python -m alphaswarm_admin.main
```

Then hit the new health endpoint:

```bash
curl -fsSL http://localhost:8900/admin/auth/health | jq .
```

Expected output:

```json
{
  "ok": true,
  "auth_enabled": true,
  "issuer": "https://login.microsoftonline.com/12345678-aaaa-bbbb-cccc-deadbeef0000/v2.0",
  "audience": "api://alphaswarm-manage-api",
  "jwks_uri": "https://login.microsoftonline.com/12345678-.../discovery/v2.0/keys",
  "discovery_url": "https://login.microsoftonline.com/12345678-.../v2.0/.well-known/openid-configuration",
  "key_count": 7
}
```

If `ok=false`, the JSON body's `stage` field tells you what failed
(`discovery`, `issuer-mismatch`, `jwks`, `jwks-empty`). Common causes:

- Wrong `ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID` → fix the env var, restart.
- Tenant restrictions block the BFF from reaching `login.microsoftonline.com`
  → talk to Network about egress.

## Step 3 — Verify discovery returns the frontend config

```bash
curl -fsSL http://localhost:8900/admin/auth/discovery | jq .
```

Expected:

```json
{
  "provider": "msal_entra",
  "auth_enabled": true,
  "issuer": "https://login.microsoftonline.com/.../v2.0",
  "audience": "api://alphaswarm-manage-api",
  "scopes": ["api://alphaswarm-manage-api/.default"],
  "jwks_uri": "...",
  "authority": "https://login.microsoftonline.com/...",
  "client_id": "99999999-...",
  "tenant_id": "12345678-...",
  "redirect_path": "/api/auth/entra/callback",
  "claims_namespace": "https://alphaswarm.internal/"
}
```

The frontend fetches this on mount; no tenant ids land in the JS
bundle.

## Step 4 — Boot the frontend with MSAL

```bash
cd alphaswarm_admin/frontend
# .env.local picks up NEXT_PUBLIC_* automatically.
pnpm dev
open http://localhost:3001
```

The first page load triggers the `AuthProvider` to:

1. `fetch('/admin/auth/discovery')` against the BFF.
2. Lazy-import `@azure/msal-browser`.
3. Construct a `PublicClientApplication` with the discovered config.
4. Call `handleRedirectPromise()` (consumes any pending login round-trip).
5. Surface the active account via `useAuth()`.

A signed-in user should see their name + roles in the dashboard
header within a few seconds.

## Step 5 — End-to-end smoke test

The repo's MSAL round-trip helper validates the full chain:

```bash
python scripts/identity/verify_entra_login.py
```

Expected:

```
INFO Got access token: eyJ0… (1456 chars)
INFO Claims look correct.
INFO CA policies found: AlphaSwarm-Admins-MFA-Required, AlphaSwarm-Block-Risky-Sign-Ins
INFO All checks passed.
```

## How auth is enforced at runtime

```mermaid
sequenceDiagram
    participant Browser
    participant SPA as alphaswarm_admin SPA
    participant BFF as alphaswarm_admin BFF
    participant Entra

    Browser->>SPA: GET / (initial load)
    SPA->>BFF: GET /admin/auth/discovery
    BFF-->>SPA: { provider, issuer, audience, authority, client_id, scopes }
    SPA->>SPA: new PublicClientApplication(discovered)
    Browser->>SPA: click "Sign in"
    SPA->>Entra: /authorize (PKCE + nonce)
    Entra-->>Browser: MFA / CA challenge
    Browser->>Entra: present FIDO2
    Entra-->>SPA: redirect to /api/auth/entra/callback
    SPA->>SPA: handleRedirectPromise() -> account + tokens
    SPA->>BFF: GET /admin/cells (Authorization: Bearer ...)
    BFF->>BFF: require_admin -> JwtValidator.validate(token)
    BFF->>BFF: extract roles -> map to AlphaSwarm scopes
    BFF-->>SPA: 200 JSON
```

Every subsequent call:

1. SPA pulls the bearer via `acquireTokenSilent`.
2. Backend `require_admin` dependency validates issuer + audience +
   signature against the cached JWKS, expands `roles` through
   `alphaswarm_core.auth.rbac.expand_role`.
3. Step-up routes (`require_admin_step_up`) trigger
   `acquireTokenPopup` for a fresh MFA evaluation.

## Local dev (no Entra tenant needed)

Set:

```bash
export ALPHASWARM_ADMIN_AUTH_REQUIRED=false
# or:
export NEXT_PUBLIC_AQP_AUTH_PROVIDER=mock
```

Both backend and frontend fall back to a synthetic anonymous user
with `admin:cluster` scope. The dashboard renders without any IdP
round-trip — ideal for offline contributors.

## Troubleshooting

| Symptom | Cause / Fix |
| --- | --- |
| `GET /admin/auth/health → 502 stage=discovery` | BFF cannot reach `login.microsoftonline.com`. Check egress. |
| `GET /admin/auth/health → 502 stage=issuer-mismatch` | The configured tenant id doesn't match the tenant that responded. Double-check `ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID`. |
| Frontend stuck on the loading spinner | Inspect the browser console. The most common message is `discovery missing client_id/authority` — the BFF returned an incomplete discovery doc, meaning `ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID` is empty. |
| Login completes but the user has no roles | The user isn't in any AlphaSwarm-* directory group, or the staff app's API permission consent wasn't granted. Re-run `grant_admin_consent.sh`. |
| 401 on every API call after login | The bearer's `aud` doesn't match what the BFF expects. Check that the SPA's `scopes` came from `/admin/auth/discovery` (so they include `api://alphaswarm-manage-api/.default`). |
| Step-up popup never appears | `setStepUpSupported(false)` was set because the SPA fell back to mock. Confirm `NEXT_PUBLIC_AQP_AUTH_PROVIDER=msal_entra`. |

## Production deployment notes

The same env vars apply in production. In Kubernetes you typically:

1. Sync the values into a `Secret` via the External Secrets operator,
   sourcing from `secret/alphaswarm/admin/entra/*` in Vault.
2. Mount the Secret as env on the `alphaswarm-admin` Deployment.
3. Build the frontend image with `NEXT_PUBLIC_*` baked in (Next.js
   inlines these at build time).

The Terraform module `alphaswarm_entra_directory` already creates the staff
app with the production redirect URI
`https://manage.alpha-swarm.ai/api/auth/entra/callback`; the helper script's
`--admin-origin` defaults to `http://localhost:3001` for dev, override
to `https://manage.alpha-swarm.ai` for production manifests.

## Audit trail

Every Entra-side mutation lands in:

- The **Entra audit log** — exported to the corporate SIEM via the
  existing log stream.
- The **AlphaSwarm `terraform_runs` ledger** for every Terraform apply on
  the `entra-internal` stack.
- The **AlphaSwarm audit log** (Phase 7 §10) on the admin side — `require_admin`
  attaches the user's `oid` to every `workload_runs` row, so the
  admin's mutation surface is fully attributed.

## Related

- [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md)
- [`how-to/entra-onboard-new-staff`](./entra-onboard-new-staff.md)
- [`how-to/entra-rotate-secrets`](./entra-rotate-secrets.md)
- [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md)
- ADR-013: [`architecture/decisions/013-entra-as-first-pool`](../architecture/decisions/013-entra-as-first-pool.md)
- Long-form plan:
  [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md)


<!-- https://alpha-swarm.ai/how-to/audit-lake-reconstruction -->
# how-to/audit-lake-reconstruction

# Audit lake reconstruction runbook

Phase 7 §10 (`RESTRUCTURING_PLAN.md`) — operating procedure for the
hash-chained audit lake: hourly flush, transparency-log anchoring,
replay harness, and regulatory-grade evidence bundles.

This runbook is the canonical companion to:

| Surface | Path |
| --- | --- |
| Hourly flush task | `alphaswarm/tasks/audit_lake_tasks.py::flush` |
| Anchor sinks | `alphaswarm/audit/sinks/{rekor,qldb,rfc3161}.py` |
| Replay harness | `alphaswarm/audit/replay.py` |
| Evidence bundle route | `alphaswarm_controller/src/alphaswarm_controller/api/routers/evidence_bundles.py` |
| Alembic migrations | `0085_audit_lake_anchors.py`, `0086_lineage_cell_id.py` |
| MinIO retention | `alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/templates/minio.yaml` |
| OpenLineage relay extension | `alphaswarm/audit/openlineage_anchor.py` |

## Architecture in one paragraph

The Postgres ``audit_log`` hash chain (Alembic 0079) is the hot write
path. Every hour the ``alphaswarm.tasks.audit_lake_tasks.flush`` Celery beat
task seals the previous hour's segment, materialises it to
``alphaswarm_gold_audit.events_`` via Iceberg, copies the manifest
to ``s3://alphaswarm--warehouse/audit/...`` with Object Lock COMPLIANCE,
and submits the segment tip-hash to every configured transparency-log
sink. The verification handle (Rekor entry UUID, QLDB document id,
RFC 3161 ``TimeStampResp``) lands in ``audit_lake_anchors``. The
``BipartiteGraphObserver`` (Phase 7 §10.3 + Alembic 0086) stamps
``cell_id`` on every new lineage row so downstream queries can join
audit + lineage by cell. Auditors call
``POST /manage/evidence-bundles`` to download a deterministic
``.tar.zst`` archive of every artifact needed to reconstruct an
event window.

## When to enable

Flip ``ALPHASWARM_AUDIT_LAKE_ENABLED=true`` once:

1. Phase 6 §9.2 MinIO chart is rolled out per cell with
   ``objectLockOnAudit: true``.
2. ``objectLockRetention`` is set to the regulatory minimum
   (``7y`` for FINRA / SEC; ``30d`` for dev).
3. Alembic 0085 + 0086 have run against every per-cell Postgres.
4. At least one transparency sink is configured via
   ``ALPHASWARM_AUDIT_TRANSPARENCY_SINKS`` (comma-separated:
   ``rekor`` / ``qldb`` / ``rfc3161``).

## Step 1 — Configure transparency sinks

| Sink | Use when | Required env |
| --- | --- | --- |
| **Rekor** (default) | Shared cells, public verifiability | `ALPHASWARM_AUDIT_REKOR_URL` (default `https://rekor.sigstore.dev`) + Vault `secret/alphaswarm/rekor/sigstore` with `signing_key_pem` + `signing_cert_pem` |
| **AWS QLDB** | `silo-reg` cells on AWS | `ALPHASWARM_AUDIT_QLDB_LEDGER_NAME`, `ALPHASWARM_AUDIT_QLDB_REGION`, AWS IAM role with `qldb:SendCommand` |
| **RFC 3161 TSA** | `silo-reg` cells on-prem | `ALPHASWARM_AUDIT_RFC3161_TSA_URL` + Vault `secret/alphaswarm/rfc3161/tsa:` with optional `client_cert_pem`/`client_key_pem` |

The three sinks are pluggable adapters of the
`TransparencyAnchorSink` ABC (`alphaswarm/audit/protocol.py`). Operators MAY
ship a custom subclass; the metaclass auto-registers it as long as
it sets `sink_kind` and lives in an imported module.

Belt-and-braces example for a `silo-reg`-on-prem cell:

```bash
ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=rekor,rfc3161
```

The flush task tries every configured sink and records every
successful anchor as one row in `audit_lake_anchors`; an auditor
who needs cross-verification can pick whichever sink suits.

## Step 2 — Enable the hourly flush

```bash
# Per cell namespace.
kubectl set env -n cell-shared-std-us-east-1a deploy/alphaswarm-core \
  ALPHASWARM_AUDIT_LAKE_ENABLED=true \
  ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=rekor
kubectl rollout status -n cell-shared-std-us-east-1a deploy/alphaswarm-core
```

The flush task is already registered in
`alphaswarm/tasks/celery_app.py::beat_schedule` as `audit-lake-flush`
(default interval 3600 s). The settings layer (`alphaswarm_lake_enabled=False`)
keeps it inert until you flip the switch.

Verify a single flush manually:

```bash
celery -A alphaswarm.tasks.celery_app call alphaswarm.tasks.audit_lake_tasks.flush
# Then inspect the new rows:
psql -c "SELECT cell_id, segment_start_ts, state, row_count, iceberg_snapshot_id FROM audit_lake_segments ORDER BY segment_start_ts DESC LIMIT 5"
```

A successful flush emits the OpenLineage `RunEvent`
`alphaswarm/audit/segment-anchor` to the existing
`lineage_openlineage_outbox`; the Marquez relay carries it through
the standard pipeline.

## Step 3 — Verify a segment manually

```bash
python -c "
from alphaswarm.audit import AnchorRecord
from alphaswarm.audit.sinks import RekorSink
from datetime import datetime, timezone
from sqlalchemy import text
from alphaswarm.persistence.db import get_session

with get_session() as s:
    row = s.execute(text(
        'SELECT * FROM audit_lake_segments WHERE cell_id = :c '
        'ORDER BY segment_start_ts DESC LIMIT 1'
    ), {'c': 'cell-shared-std-us-east-1a'}).first()
    anchor = s.execute(text(
        'SELECT * FROM audit_lake_anchors WHERE segment_id = :id AND sink_kind = :k'
    ), {'id': row.id, 'k': 'rekor'}).first()

record = AnchorRecord(
    cell_id=row.cell_id,
    segment_start_ts=row.segment_start_ts,
    segment_end_ts=row.segment_end_ts,
    prev_tip_hash=row.prev_segment_tip_hash,
    tip_hash=row.segment_tip_hash,
    iceberg_snapshot_id=row.iceberg_snapshot_id or '',
    s3_manifest_uri=row.s3_manifest_uri or '',
)
print(RekorSink().verify(record, anchor.verification_handle))
"
```

Should print `True`. If `False`, STOP and investigate before producing
any evidence bundles — the chain is broken or the anchor was tampered
with.

## Step 4 — Replay a recorded run

`alphaswarm/audit/replay.py` re-executes a run against its hash-locked spec.

```python
from alphaswarm.audit.replay import replay_run, ReplayEnvironment

report = replay_run(
    run_id="agent-run-abc123",
    cell_id="cell-shared-std-us-east-1a",
    target_environment=ReplayEnvironment.AUDIT_SHADOW,
)
print(report.to_dict())
```

The harness:

1. Looks up the run row in whichever runtime table contains the id.
2. Loads the immutable spec snapshot via ``_spec_versions``.
3. Looks up the MCP tool descriptor hashes recorded at original
   run time.
4. Provisions a deterministic shadow Postgres schema named
   ``replay___`` (see
   ``_shadow_schema_name``).
5. Verifies the anchored audit segment covering the run's timestamp.
6. Returns a :class:`ReplayReport` with `output_matches` /
   `anchor_verified` for sign-off.

The actual re-execution slot is currently a Phase 7.5 TODO — until
then `replay_output_hash` mirrors `original_output_hash` so the
report covers spec-pinning + anchor verification only. That's the
audit-essential surface.

## Step 5 — Produce an evidence bundle

```bash
curl -X POST https://manage.alpha-swarm.ai/manage/evidence-bundles \
  -H "Authorization: Bearer ${ALPHASWARM_ADMIN_TOKEN}" \
  -H "Content-Type: application/json" \
  --data '{
    "tenant_id": "tenant_acme",
    "cell_id": "cell-silo-reg-acme",
    "from_ts": "2026-05-01T00:00:00Z",
    "to_ts": "2026-05-31T23:59:59Z"
  }' \
  --output evidence-acme-may2026.tar.zst
```

The bundle contents (every part is a deterministic JSON file):

| File | Source |
| --- | --- |
| `manifest.json` | Top-level manifest with SHA-256 of every other part |
| `audit_rows.json` | Every `audit_log` row in the window |
| `audit_segments.json` | Every `audit_lake_segments` row + its anchors |
| `spec_snapshots.json` | Every immutable spec referenced by an audit row |
| `lineage.json` | Bipartite lineage rows for the same cell + window |

The manifest hash IS the canonical bundle id; auditors archive
``manifest.manifest_hash`` alongside the .tar.zst.

## Reverting

Phase 7 is incrementally adopted. Reverting is easy:

- `ALPHASWARM_AUDIT_LAKE_ENABLED=false` — the task no-ops; existing data
  remains.
- `ALPHASWARM_AUDIT_TRANSPARENCY_SINKS=` (empty) — the segment still flushes
  to Iceberg but no anchors are submitted.
- The Iceberg `alphaswarm_gold_audit.events_` tables are read-only by
  policy; do NOT delete them. They are the cold-storage backup of
  the Postgres `audit_log`.
- MinIO Object Lock COMPLIANCE means the `audit/` prefix CANNOT be
  deleted by anyone — not even the root user — until retention
  expires. This is the regulatory commitment, not a bug.

## SLOs

| SLO | Target |
| --- | --- |
| Flush latency p99 | ≤ 5 minutes after segment close |
| Anchor latency p99 | ≤ 10 minutes after flush completes |
| Per-segment row throughput | ≥ 10 000 audit rows / minute |
| Evidence bundle build time | ≤ 30 s for a 30-day window |
| Anchor verify success rate | ≥ 99.9% (excluding Internet outage windows) |

## Where to file alerts

Prometheus + Alertmanager rules live in
`alphaswarm_platform/deployments/kubernetes/base-services/prometheus-operator/`
(future Phase 7.5 deliverable). Until then, monitor:

- `audit_lake_segments.state = 'flushed'` rows that haven't progressed
  to `'anchored'` within 30 minutes — indicates sink failure.
- `audit_lake_anchors.last_verified_ok = FALSE` rows — indicates an
  anchor was tampered with or the sink is unreachable.
- `audit_log` insert errors with text `hash chain` — the Postgres
  trigger is rejecting a row.

## Audit trail of THIS subsystem

Every Phase 7 mutation lands in `workload_runs`:

- Flipping `ALPHASWARM_AUDIT_LAKE_ENABLED` lands as an `apply_config` row.
- Each evidence-bundle export lands as an `evidence_bundle_export`
  row BEFORE the bytes leave the process (AGENTS rule 45).
- The hourly flush itself does NOT land a `workload_runs` row by
  design (it's a routine background task, not an operator action) —
  the per-segment write to `audit_lake_segments` IS the audit trail
  for the flush.


<!-- https://alpha-swarm.ai/how-to/cell-data-plane-migration -->
# how-to/cell-data-plane-migration

# Cell data plane migration runbook

Phase 6 §9 (`RESTRUCTURING_PLAN.md`) — operating procedure for
provisioning a per-cell data plane and migrating a tenant from the
shared cluster-wide Postgres/Redis/MinIO/MLflow/Iceberg into the
dedicated cell.

This runbook is the canonical companion to:

| Surface | Path |
| --- | --- |
| Helm chart | `alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/` |
| Topology models | `alphaswarm_core/src/alphaswarm_core/topology/models.py` |
| Cell registry seed | `alphaswarm_platform/configs/deployment/topology.yaml` |
| Dual-write switch | `ALPHASWARM_CELL_DUAL_WRITE` (`alphaswarm/config/settings.py`) |
| Backfill script | `scripts/cells/dual_write_backfill.py` |
| Iceberg cell-awareness | `alphaswarm/data/iceberg_catalog.py:_cell_data_plane` |
| Vault Transit cell-key | `alphaswarm/credentials/vault_transit.py:_resolve_transit_key_name` |
| Engine cell-keying | `alphaswarm/persistence/db.py:_sync_engine_for_cell` |

## When to use this runbook

You SHOULD migrate a tenant into a dedicated per-cell data plane when:

- A regulatory commitment requires cryptographic data-plane separation
  (FINRA, ISO 27001, SOC 2 with customer-side isolation).
- The tenant signs onto a `silo-reg` or `silo-custom` contract.
- A multi-AZ blast-radius failure isolated to one cell should not
  affect other tenants.

You SHOULD NOT use this runbook for:

- A `shared-std` cell — those share the cluster-wide data plane by
  design.
- An ordinary regional cutover (use `cell-router-cutover.md` instead).

## Pre-flight

1. **Cell exists in `topology.yaml`.** Verify the destination cell id
   in the `cells:` section with a populated `data_plane:` block:
   ```yaml
   - id: cell-silo-reg-acme
     tier: silo-reg
     tenancy_strategy: database_per_enterprise
     # ...
     data_plane:
       postgres_dsn_secret: secret/alphaswarm/cells/cell-silo-reg-acme/postgres
       iceberg_rest_uri: http://alphaswarm-cell-iceberg-rest.cell-silo-reg-acme.svc.cluster.local:8181
       iceberg_warehouse_uri: s3://alphaswarm-cell-silo-reg-acme-warehouse/
       minio_endpoint: http://alphaswarm-cell-minio.cell-silo-reg-acme.svc.cluster.local:9000
       vault_transit_key: alphaswarm-cell-silo-reg-acme
   ```
   The `vault_transit_key` is **mandatory** for `silo-reg` cells.

2. **Vault paths exist.** Seed every credential the chart consumes:
   - `secret/alphaswarm/cells//postgres` — `username` + `password`
   - `secret/alphaswarm/cells//minio` — `access_key` + `secret_key`
   - `secret/alphaswarm/cells//mlflow` — `dsn` (Postgres DSN under
     the per-cell Postgres)
   - `secret/alphaswarm/cells//iceberg` — `jdbc_uri` + `username` +
     `password`
   The Phase 4 §7.6 `vault-secrets-operator` materialises these into
   Kubernetes `Secret` objects via the chart's `VaultStaticSecret` CRs.

3. **Operator (Phase 6.5) prerequisites installed cluster-wide.**
   - CloudNativePG operator (`postgresql.cnpg.io/v1`)
   - vault-secrets-operator (`secrets.hashicorp.com/v1beta1`)
   - Linkerd 2.16 (Phase 4 §7.1)

## Step 1 — Provision the per-cell data plane

Install the Helm chart for the target cell. The chart stamps a
CNPG `Cluster`, Redis StatefulSet, MinIO StatefulSet + bucket
bootstrap Job (with Object Lock COMPLIANCE on the `audit/` prefix),
MLflow Deployment, and Iceberg REST Deployment.

```bash
helm install data-plane alphaswarm_platform/deployments/helm/alphaswarm-cell-data-plane/ \
  --namespace cell-silo-reg-acme \
  --set cell_id=cell-silo-reg-acme \
  --set tier=silo-reg \
  --set region=us-east-1 \
  --set minio.replicas=4 \
  --set postgres.instances=3
```

Wait for every Pod to reach `Ready=true`. Then:

```bash
kubectl -n cell-silo-reg-acme get pods
kubectl -n cell-silo-reg-acme get vaultstaticsecret
kubectl -n cell-silo-reg-acme exec alphaswarm-cell-postgres-1 -- psql -c "SELECT 1"
```

The MinIO bootstrap Job creates 4 buckets with Object Lock COMPLIANCE
on `alphaswarm-cell-silo-reg-acme-audit` for 30 days; verify:

```bash
kubectl -n cell-silo-reg-acme exec deploy/alphaswarm-cell-minio-bootstrap -- \
  mc retention info "cell/alphaswarm-cell-silo-reg-acme-audit"
# expect: Mode=COMPLIANCE Validity=30d
```

## Step 2 — Run schema migrations against the new Postgres

Inside the cell namespace, run `alembic upgrade head` against the
per-cell DSN. The CNPG cluster ships the application schema only after
this step.

```bash
kubectl -n cell-silo-reg-acme run alembic --rm -it --image=ghcr.io/julianwiley/alphaswarm-api:latest --restart=Never -- \
  alembic -c /app/alembic.ini upgrade head
```

The Alembic chain immutability check (`scripts/ci/check_migration_immutability.py`)
guarantees the same numeric head as the shared plane.

## Step 3 — Enable dual writes

Flip `ALPHASWARM_CELL_DUAL_WRITE=true` in the **API** environment. This is the
critical safety window — once enabled, every new write goes to BOTH
planes (the shared cluster-wide plane AND the per-cell plane bound
via `RequestContext.cell_id`). It does NOT affect callers without an
active request context.

```bash
kubectl set env -n alphaswarm deployment/alphaswarm-core ALPHASWARM_CELL_DUAL_WRITE=true
kubectl rollout status -n alphaswarm deployment/alphaswarm-core
```

Verify the new cells are reachable by issuing a noop write from a
test tenant pinned to the cell.

## Step 4 — Backfill historical rows

```bash
# Dry-run first to print row counts:
python scripts/cells/dual_write_backfill.py \
  --tenant tenant_acme \
  --target-cell cell-silo-reg-acme

# When the plan looks right, apply:
python scripts/cells/dual_write_backfill.py \
  --tenant tenant_acme \
  --target-cell cell-silo-reg-acme \
  --apply
```

The script copies every tenant-owned table (workspaces, strategy
specs, agent runs, bot runs, RL experiments, paper trading, dataset
specs, …) but never deletes from the source. It refuses to write if
the destination plane already has rows for the same tenant — that is
the idempotency guard against duplicate inserts.

## Step 5 — Reconcile

```bash
python scripts/cells/dual_write_backfill.py \
  --tenant tenant_acme \
  --target-cell cell-silo-reg-acme \
  --reconcile-only
```

Every table MUST show `OK` (matching row count AND matching SHA-256
roll-up). If even one shows `MISMATCH`, STOP — investigate before
proceeding. The script exits with code 2 on mismatch.

## Step 6 — Cutover

Mutate `tenant_cells.cell_id` for the tenant. This step is intentionally
NOT automated by the backfill script — operators run it manually so
the change generates an explicit `workload_runs` audit row.

```sql
-- in the SHARED plane
INSERT INTO workload_runs (organization_id, action, ...)
  VALUES ('tenant_acme', 'cell_cutover', ...);

UPDATE tenant_cells
   SET cell_id = 'cell-silo-reg-acme', cutover_at = NOW()
 WHERE tenant_id = 'tenant_acme';
```

The cell-router (Phase 3 §6.4) picks up the new mapping on the next
JWT exchange. Existing in-flight sessions stay bound to the source
plane until the next request — no in-flight rollback needed.

## Step 7 — Disable dual writes

```bash
kubectl set env -n alphaswarm deployment/alphaswarm-core ALPHASWARM_CELL_DUAL_WRITE=false
kubectl rollout status -n alphaswarm deployment/alphaswarm-core
```

The tenant is now isolated in the cell data plane. The historical rows
remain in the shared plane (Phase 6 keeps them as the immutable
fallback path); a separate retention policy (90 days) prunes them
after sufficient bake time. Do NOT delete source rows from this
runbook.

## Reverting

If anything goes wrong between Step 3 and Step 6 you can revert
cleanly because writes are landing in BOTH planes. Set
`ALPHASWARM_CELL_DUAL_WRITE=false`, restore the previous `tenant_cells.cell_id`,
and the tenant resumes on the shared plane.

After Step 6 the cutover is sticky — reverting requires running the
inverse backfill (`--tenant tenant_acme --target-cell cell-shared-std-local`)
and is a manual operation. Coordinate with the on-call.

## Audit trail

Every step writes audit rows:

- Step 1 (Helm install): captured by Argo CD's `Application` revision.
- Step 2 (`alembic upgrade head`): writes `alembic_version` in the
  per-cell Postgres.
- Step 3 (`ALPHASWARM_CELL_DUAL_WRITE=true`): captured by
  `alphaswarm_controller.audit.write_workload_run` when the env flip lands.
- Step 4-5 (backfill): the script logs to stdout AND writes a
  `cell_backfill_runs` row (Alembic 0085, future).
- Step 6 (cutover): the explicit `workload_runs` INSERT above.
- Step 7 (`ALPHASWARM_CELL_DUAL_WRITE=false`): captured by
  `alphaswarm_controller.audit.write_workload_run`.

The auditor SHOULD verify all seven rows exist before signing off on
the migration.


<!-- https://alpha-swarm.ai/how-to/cell-router-cutover -->
# Cell-router cutover runbook

# Cell-router cutover runbook

> Phase 3 §6 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> Covers the cutover from the single-container Python FastAPI cell
> proxy (in `alphaswarm_client/`) to the Envoy + `alphaswarm-tenant-router`
> two-component cell router. This runbook is the operator-facing
> companion to the deployment manifests at
> `alphaswarm_platform/deployments/kubernetes/edge/`.

## Architecture (Phase 3 §6.4)

```
[ user / agent ]
       │ TLS
       ▼
[ Cloudflare Tunnel (alpha-swarm.ai) ]
       │
       ▼
[ alphaswarm-edge — Envoy (HTTP-only) ]
       │  ext_authz callout
       │ ──────────────────────▶  [ alphaswarm-tenant-router ]
       │                                │ /resolve
       │                                ▼
       │                          [ cells registry (control plane) ]
       │ ◀──────────────────── x-alphaswarm-cell header
       │
       ▼  Route on x-alphaswarm-cell:
[ alphaswarm-cell--api  (FastAPI) ]
[ alphaswarm-cell--workers (Celery, gVisor for agents) ]
[ alphaswarm-cell--postgres ]   [ alphaswarm-cell--minio ]
```

## Prerequisites

1. The four canonical AlphaSwarm images (`alphaswarm-api`, `alphaswarm-worker`,
   `alphaswarm-client`, `alphaswarm-controller`) are running on the
   pre-Phase-3 single-namespace topology. The Phase 3 work runs
   IN PARALLEL until the canary completes — nothing is taken away
   from the running fleet.
2. The Alembic head is at `0083_audit_cell_id_column.py`. Verify:
   ```bash
   alembic current
   # expected: 0083_audit_cell_id_column (head)
   ```
3. The `cells` registry has at least one `state=active` cell
   row. Verify via the control plane:
   ```bash
   curl -sS https://manage.alpha-swarm.ai/manage/cells | jq '.data[].id'
   ```
4. The `alphaswarm-edge` namespace exists and carries the
   `alphaswarm.io/host-network-allowed: "true"` exception label per
   Phase 2 §5.4.

## Step 0 — Build the Phase 3 images

Both images build from the `alphaswarm_platform` repo root (the
post-repo-split context):

```bash
cd alphaswarm_platform

# alphaswarm-edge (Envoy)
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --file build/docker/alphaswarm-edge/Dockerfile \
  --tag ghcr.io/julianwiley/alphaswarm-edge:v0.2.0 \
  --push .

# alphaswarm-tenant-router (Python + uvloop)
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --file build/docker/alphaswarm-tenant-router/Dockerfile \
  --tag ghcr.io/julianwiley/alphaswarm-tenant-router:v0.2.0 \
  --push .
```

Tagged releases build both images automatically:
`alphaswarm_platform/.github/workflows/build-publish.yml` pushes them
to ECR with Cosign keyless signatures, SBOM + SLSA provenance, and a
Trivy scan via the `build-sign-push` composite.

## Step 1 — Deploy in parallel (week 6)

> **Auth posture first.** The tenant-router ships fail-closed
> (`AUTH_MODE=required` with an empty issuer) and will crash-loop
> until the IdP issuer/audience are stamped into
> `alphaswarm-tenant-router-config`. Complete steps 1-2 of the
> [tenant-router auth rollout runbook](./tenant-router-auth-rollout.md)
> before (or together with) this apply.

```bash
# Apply both Deployments + Services + PodDisruptionBudgets
# (+ the tenant-router's ConfigMap, NetworkPolicy, HPA, and the
# alphaswarm-cell-bound-validator Service):
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-edge/
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/

# Verify the tenant-router hydrated the cells cache:
kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 18080:8080
curl -sS http://127.0.0.1:18080/readyz
# expected: {"status":"ok","cells":,"auth_mode":"required","cba_mode":"enforce"}
```

DNS still points to the Python proxy. No user traffic flows to
`alphaswarm-edge` yet.

## Step 2 — DNS canary 10% (week 7)

Cloudflare Workers + Load Balancer split the apex hostname (`alpha-swarm.ai`)
across the two backends:

```toml
# cloudflare/alphaswarm_load_balancer.tf (excerpt)
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_legacy" {
  origins = [{ name = "alphaswarm-client", address = "...", weight = 0.9 }]
}
resource "cloudflare_load_balancer_pool" "alphaswarm_proxy_envoy" {
  origins = [{ name = "alphaswarm-edge", address = "...", weight = 0.1 }]
}
```

Apply via `alphaswarm deploy terraform plan apply` (NEVER raw `terraform
apply` per AGENTS rule 42).

Verify both pools healthy:

```bash
kubectl -n alphaswarm-edge get pods -l app=alphaswarm-edge
kubectl -n alphaswarm-edge get pods -l app=alphaswarm-tenant-router

# Tail tenant-router logs for any 503s / cache misses:
kubectl -n alphaswarm-edge logs -l app=alphaswarm-tenant-router --tail=200 -f
```

Stop conditions (rollback to 100% legacy):
- `alphaswarm-tenant-router` `/readyz` returns 503 for > 1 minute.
- Envoy `5xx` rate on `alphaswarm-edge` ingress > 0.5% over a 5-minute
  window.
- Any audit event with `cell_id IS NULL` after the canary starts
  (indicates the X-AlphaSwarm-Cell header isn't propagating into
  `RequestContext`).

## Step 3 — 50% traffic (week 8)

Cloudflare LB weight: 0.5 / 0.5. Repeat the verification + stop
conditions from step 2. Watch the `alphaswarm.cell.id` distribution in
Tempo:

```
{alphaswarm.cell.id="cell-shared-std-local"} | count_over_time(span_count[5m])
```

Both routes should converge on the same cell-id distribution.

## Step 4 — 100% traffic (week 9)

Cloudflare LB weight: 0.0 / 1.0. The Python proxy continues to
run but receives no live traffic. Keep it running for 7 days as
the rollback safety net.

## Step 5 — Remove the Python FastAPI proxy (week 10)

This step is intentionally NOT in the Phase 3 PR; it lands as a
follow-up after the 7-day soak. The removal removes
`alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile`'s FastAPI
proxy module (the `production` stage's uvicorn entrypoint) and
strips the `/api/*`, `/ws/*`, `/manage/*`, `/static` route
handlers from `alphaswarm/api/main.py`.

Tag the last buildable proxy image (`alphaswarm-client:proxy-last-stable`)
before the removal lands so a regression has a known-good rollback
target.

## Rollback at any step

- Cloudflare LB weight back to 1.0 / 0.0 — instant traffic drain
  back to the legacy proxy.
- `kubectl -n alphaswarm-edge scale deployment alphaswarm-edge --replicas=0`
  prevents Envoy from accepting any traffic even if DNS still
  points at it.

## Phase 3 §6.6 follow-up — the removal PR

The Python proxy lives at
`alphaswarm/api/proxy.py` + the relevant routes in
`alphaswarm/api/main.py`. The Phase 3 §6.6 removal PR:

1. Cuts the route registrations.
2. Updates the `alphaswarm-client` Dockerfile to drop the proxy CMD.
3. Removes the proxy's tests under `tests/api/`.
4. Tags the prior commit `alphaswarm-client-proxy-final` so a rollback
   restores the buildable artifact.

## Related documents

- [RESTRUCTURING_PLAN.md §6](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm_platform/deployments/kubernetes/cells/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/cells/README.md)
- [alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/argocd/applicationsets/cells-appset.yaml)
- [alphaswarm_tenant_router/AGENTS.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_tenant_router/AGENTS.md)


<!-- https://alpha-swarm.ai/how-to/chainguard-base-migration -->
# Chainguard base migration runbook

# Chainguard base migration runbook

> Phase 2 §5.1 + §5.2 + §5.3 + §5.4 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> Owns the cutover from Debian-slim base images to Chainguard Wolfi
> bases, the cosign + SBOM signing pipeline, and the Kyverno
> admission policies that gate signed-only images in production.

## Scope

Four AlphaSwarm-owned images move to Chainguard Wolfi in Phase 2 §5.1:

| Image | Dockerfile | Base before | Base after |
| --- | --- | --- | --- |
| `alphaswarm-api` / `alphaswarm-worker` (shared `api` target) | `alphaswarm_platform/Dockerfile` | `python:3.11-slim` | `cgr.dev/chainguard/python:3.11-dev` |
| `alphaswarm-controller` | `alphaswarm_platform/build/docker/alphaswarm_controller/Dockerfile` | `python:3.11-slim` | `cgr.dev/chainguard/python:3.11-dev` (builder) + `cgr.dev/chainguard/python:3.11` (runtime) |
| `alphaswarm-client` | `alphaswarm_platform/build/docker/alphaswarm_client/Dockerfile` | `node:20-alpine` + `python:3.11-slim` | `cgr.dev/chainguard/node:20-dev` + `cgr.dev/chainguard/python:3.11-dev` (builders) + `cgr.dev/chainguard/python:3.11` (runtime) |
| `alphaswarm-ui` | `alphaswarm_platform/build/docker/alphaswarm_ui/Dockerfile` | `node:20-alpine` | `cgr.dev/chainguard/node:20-dev` (builder) + `cgr.dev/chainguard/node:20` (runtime) |

Two images carry **documented exemptions** and stay on their current
bases:

| Image | Dockerfile | Reason |
| --- | --- | --- |
| `alphaswarm-bots` standard | `alphaswarm_bots/Dockerfile` | Already on `gcr.io/distroless/python3-debian12:nonroot` — smaller and more locked-down than Chainguard Python, no shell at all. Builder stage stays on `python:3.12-slim-bookworm` for build-essential availability. |
| `alphaswarm-bots` HFT | `alphaswarm_bots/Dockerfile.hft` | Kernel-bypass libs (DPDK, Onload, Mellanox OFED) require kernel headers + `libnuma1` + `linuxptp` + `ethtool` + `kmod` which Chainguard's nonroot Wolfi runtime image does not ship. Per ADR 007. |

Two **future-phase scaffolds** are created in Phase 2 §5.6:

| Image | Dockerfile | Activation phase |
| --- | --- | --- |
| `alphaswarm-edge` (Envoy cell router) | `alphaswarm_platform/build/docker/alphaswarm-edge/Dockerfile` | Phase 3 §6.4 (cell topology) |
| `alphaswarm-agent-sandbox` (gVisor target) | `alphaswarm_platform/build/docker/alphaswarm-agent-sandbox/Dockerfile` | Phase 5 §8 (per-tenant MCP + agent sandbox) |

## Why Chainguard Wolfi

- **glibc**, not musl — keeps native wheel compatibility for
  `numpy`, `pyarrow`, `torch`, `psycopg2`, etc. The
  RESTRUCTURING_PLAN footnote at §5.1 explicitly notes that
  Alpine/musl-style minimalism breaks the native-wheel toolchain.
- **Continuously rebuilt** — Chainguard ships a fresh image set
  every ~24 hours, so CVE patches land without us doing anything
  beyond a rebuild. Pair with Renovate (Phase 1 §4.7) to
  re-trigger the build matrix on a base-image bump.
- **No CVEs in the base** — Chainguard runs distroless-style
  scans and publishes a daily-zero-CVE SLO for `latest` tags. We
  still run `grype --fail-on high` per Phase 2 §5.2 because
  application-level CVEs are our responsibility.
- **Single nonroot UID convention (65532)** — matches the Phase 2
  §5.4 PSS restricted profile. The runtime stages never run as
  root; the `-dev` builder runs as root only for `apk add`.

## Build verification

Local one-off build (no push, no signing — for inner-loop dev):

```bash
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --target api \
  --file alphaswarm_platform/Dockerfile \
  --tag alphaswarm-api:dev \
  .
```

Multi-arch build via `build-multi-arch.yml` (CI canonical path):

```bash
gh workflow run build-multi-arch.yml \
  --ref feat/phase-2-supply-chain
```

The workflow signs every pushed image with cosign keyless OIDC
and uploads a CycloneDX SBOM. The `inspect` job at the bottom of
the workflow runs `cosign verify` + `cosign verify-attestation`
to confirm the signature + SBOM attestation land in Rekor.

### Verify cosign signature locally

```bash
cosign verify \
  --certificate-identity-regexp 'https://github.com/julianwiley/alphaswarm/.github/workflows/build-multi-arch\.yml@refs/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  docker.io/julianwiley/alphaswarm-api:latest
```

Expected exit code: 0. The output prints the signature payload
including the Rekor transparency log entry index.

### Verify CycloneDX SBOM attestation locally

```bash
cosign verify-attestation \
  --certificate-identity-regexp 'https://github.com/julianwiley/alphaswarm/.github/workflows/build-multi-arch\.yml@refs/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  --type cyclonedx \
  docker.io/julianwiley/alphaswarm-api:latest > sbom-attestation.json
```

The `predicate` field of the attestation is the base64-encoded
CycloneDX document.

### Re-run grype against the SBOM

```bash
syft docker.io/julianwiley/alphaswarm-api:latest -o cyclonedx-json=sbom.json
grype sbom:sbom.json --fail-on high
```

Exit code 0 = no HIGH or CRITICAL CVEs; non-zero = the gate fires
in CI.

## Kyverno audit-to-enforce ratchet

The six Phase 2 §5.3 cluster policies ship in `Audit` mode (see
[alphaswarm_platform/deployments/kubernetes/security/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/security/README.md)).
The ratchet schedule:

| Policy | Audit-mode soak | Enforce gate |
| --- | --- | --- |
| `00-verify-signatures.yaml` | 7 days zero violations across all AlphaSwarm-owned namespaces | Phase 2.5 |
| `01-require-pss-restricted.yaml` | 7 days zero violations | Phase 2.5 |
| `02-require-runtime-class.yaml` | DO NOT enforce until Phase 5 §8.3 lands the gVisor RuntimeClass | Phase 5.1 |
| `03-no-host-network.yaml` | 7 days zero violations after `alphaswarm-edge` namespace carries `alphaswarm.io/host-network-allowed: "true"` | Phase 2.5 |
| `04-no-privilege-escalation.yaml` | 7 days zero violations | Phase 2.5 |
| `05-required-labels.yaml` | 7 days zero violations on namespaces that carry `alphaswarm.io/component` | Phase 2.5 |

### Operator workflow to flip Audit → Enforce

```bash
# 1. Verify zero violations for the target policy:
kubectl get clusterpolicyreport -o jsonpath='{range .items[*].results[?(@.policy=="alphaswarm-verify-image-signatures")]}{.result}{"\n"}{end}' \
  | sort | uniq -c

# Expected output: only "pass" lines. Any "fail" lines block the ratchet.

# 2. Patch the policy in place:
kubectl patch clusterpolicy alphaswarm-verify-image-signatures \
  --type=merge \
  -p '{"spec":{"validationFailureAction":"Enforce"}}'

# 3. Update the YAML in tree so the audit-only state is preserved:
sed -i 's/validationFailureAction: Audit/validationFailureAction: Enforce/' \
  alphaswarm_platform/deployments/kubernetes/security/kyverno/cluster-policies/00-verify-signatures.yaml

# 4. Commit + open PR with `[Phase 2.5 ratchet]` in the title.
```

## Rollback

The Chainguard migration is reversible per Dockerfile. Each
Dockerfile carries a `Phase 2 §5.1` comment at the top documenting
the previous base image. To roll back a single image:

1. Revert that file in `alphaswarm_platform/Dockerfile` or
   `alphaswarm_platform/build/docker//Dockerfile` to its
   pre-Phase-2 state.
2. Trigger `build-multi-arch.yml` on the revert branch.
3. The cosign keyless signature still applies (it signs by digest,
   not base image). The grype scan may fail differently because
   the Debian-slim base ships different CVEs.

## Cosign signing on PRs

The Phase 2 §5.2 cosign + SBOM + grype steps gate on
`if: github.event_name != 'pull_request'` because cosign keyless
requires OIDC, which is unavailable on PRs from forked
repositories. PRs from internal branches still build (and pull-
through cache), but they neither push nor sign. The `inspect` job
that runs `cosign verify` on `:latest` tags is only useful for
merged commits.

If you need to verify a signature on a PR build, push to a feature
branch in the canonical repo (not a fork) and check the registry
manually:

```bash
docker pull docker.io/julianwiley/alphaswarm-api:feat-phase-2-supply-chain-
cosign verify \
  --certificate-identity-regexp 'https://github.com/julianwiley/.*' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  docker.io/julianwiley/alphaswarm-api:feat-phase-2-supply-chain-
```

## Related documents

- [RESTRUCTURING_PLAN.md §5](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm_platform/deployments/kubernetes/security/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/security/README.md)
- [`.cursor/plans/alphaswarm-index-debt-phase-2-supply-chain.md`](https://github.com/julianwiley/alphaswarm/blob/main/.cursor/plans/alphaswarm-index-debt-phase-2-supply-chain.md)
- ADR 007 — QuantBot Latency Classes (explains the HFT Debian-slim exemption)


<!-- https://alpha-swarm.ai/how-to/entra-onboard-new-staff -->
# Onboard a new staff member into Entra

# Onboard a new staff member into Entra

Procedure for adding a new AlphaSwarm employee to the company's Entra
directory and granting them the right level of access to the managed
AlphaSwarm platform.

This is a HR + Security workflow that does NOT touch Terraform.
Group membership is intentionally outside Terraform's purview (rollout
plan §1.2); Terraform owns *which groups exist + what roles they
confer*, not *who is in them*.

## Inputs

- The new hire's full name + corporate email address.
- Their HR-side role (engineer, ops, compliance, finance, …).
- The hiring manager's approval (capture the ticket id for the audit).

## Steps

### 1. Create the Entra user

If the new hire doesn't already have an Entra account from corporate
onboarding, create one:

```bash
az ad user create \
    --display-name "First Last" \
    --user-principal-name first.last@wiley-tech.onmicrosoft.com \
    --password "$(uuidgen | tr -d '-' | head -c 24)Aa!1" \
    --force-change-password-next-sign-in true
```

The auto-generated password is changed on first login; the operator
NEVER stores or shares it.

### 2. Add to the appropriate AlphaSwarm group

Map the HR-side role to the canonical group. Default mappings:

| HR role | Entra group(s) |
| --- | --- |
| Software Engineer / SRE | `AlphaSwarm-Engineering` |
| Senior SRE / on-call rotation | `AlphaSwarm-Engineering` + `AlphaSwarm-Operations` |
| Compliance Officer | `AlphaSwarm-Compliance` |
| Internal Auditor | `AlphaSwarm-Auditors` |
| Finance / FinOps | `AlphaSwarm-Finance` |
| Security Engineer | `AlphaSwarm-SOC` |
| CTO / VP Engineering | `AlphaSwarm-Admins` (requires CTO sign-off and CA-policy MFA) |

Add via the Azure Portal **OR** via CLI:

```bash
# Look up the group id (cached locally for repeat use).
GROUP_ID="$(az ad group show --group AlphaSwarm-Engineering --query id -o tsv)"
USER_ID="$(az ad user show --id first.last@wiley-tech.onmicrosoft.com --query id -o tsv)"
az ad group member add --group "${GROUP_ID}" --member-id "${USER_ID}"
```

### 3. Verify the role propagation

Wait 5 minutes for Entra to propagate, then have the new hire sign in
once at `manage.alpha-swarm.ai`. The application token they receive should
include the `roles` claim mapped to the group.

Confirm from the operator side:

```bash
python scripts/identity/list_entra_app_role_assignments.py \
    --format=json \
    | jq '.[] | select(.principal_display_name=="First Last")'
```

Should print one row per (role) for each group the user is in.

### 4. Capture the audit trail

The Entra audit log records group-membership changes automatically and
forwards them to the corporate SIEM via the existing log stream. The
manager's approval ticket gets attached as part of the standard
employee onboarding packet.

## Promoting an existing staff member

```bash
# Add to a higher-privilege group (e.g. ops on-call).
az ad group member add --group AlphaSwarm-Operations --member-id "${USER_ID}"
```

For promotions to `AlphaSwarm-Admins`:

1. The CTO must sign off in writing (ticket id captured).
2. The user must have a registered FIDO2 hardware key (verified by
   the Security team).
3. The user falls under the `AlphaSwarm-Admins-MFA-Required` Conditional
   Access policy automatically.

## Off-boarding

```bash
# Remove from every AlphaSwarm group; do NOT just disable the Entra account
# in case the user has cross-tenant memberships we don't manage.
for group in AlphaSwarm-Engineering AlphaSwarm-Operations AlphaSwarm-Auditors AlphaSwarm-Compliance \
             AlphaSwarm-Finance AlphaSwarm-SOC AlphaSwarm-Admins; do
  GROUP_ID="$(az ad group show --group ${group} --query id -o tsv)"
  az ad group member remove --group "${GROUP_ID}" --member-id "${USER_ID}" 2>/dev/null || true
done
```

After removal, capture an evidence snapshot:

```bash
python scripts/identity/list_entra_app_role_assignments.py \
    --format=csv > evidence/entra-after-offboarding-${USER_ID}-$(date +%F).csv
```

## Common pitfalls

| Pitfall | Mitigation |
| --- | --- |
| Adding a user to two conflicting groups | The role union is granted; review with `list_entra_app_role_assignments.py` after every change |
| Group propagation lag | Ask the user to wait 5 minutes between group add and login retry |
| User can't sign in despite group membership | Check Conditional Access "What If" report for the user; CA may be blocking the sign-in |
| Stale group from a previous role | Remove the old group BEFORE adding the new one to keep the audit trail clean |

## Related

- [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md) — pool concept
- [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md) — how the groups exist in the first place
- [`how-to/entra-rotate-secrets`](./entra-rotate-secrets.md) — credential rotation


<!-- https://alpha-swarm.ai/how-to/entra-rotate-secrets -->
# Rotate Entra ID app secrets

# Rotate Entra ID app secrets

The AlphaSwarm staff Entra rollout aims for **zero stored secrets** — CI
authenticates via federated credentials (rollout plan §3.5), and the
runtime bootstrap relies on operator `az login` sessions. This page
documents:

1. The two secrets that DO exist during the bootstrap window and how
   to rotate them.
2. The federated-credential rotation cadence (no secret material, but
   subjects + issuer trust still matter).
3. The break-glass account procedure.

## What we DO NOT rotate

The alphaswarm_entra_directory module ships zero `azuread_application_password`
resources by default. App-secret material lives in Vault under
`secret/alphaswarm/entra/` ONLY for the bootstrap window, and ONLY when the
operator explicitly opts in via `terraform import` of an out-of-band
Azure Portal-created password. Once Phase 5 of the rollout completes,
no app secrets exist for any AlphaSwarm-managed Entra app.

## Bootstrap-window secret rotation

Rotate the bootstrap SP secret used by Phase 0 / 1 of the rollout
(before federated credentials are wired):

```bash
# 1. Mint a new secret (90-day lifetime, recorded in the audit log).
NEW_SECRET="$(az ad app credential reset \
    --id "${BOOTSTRAP_SP_CLIENT_ID}" \
    --years 0 --months 3 \
    --query password -o tsv)"

# 2. Write to Vault.
vault kv put secret/alphaswarm/entra/bootstrap_sp_secret value="${NEW_SECRET}"

# 3. Restart any service still using the old secret. Most are already
#    on federated credentials by this point so this is a sweep.
kubectl rollout restart -n alphaswarm deploy/alphaswarm-admin

# 4. Clear the new secret from local env.
unset NEW_SECRET
```

**Cadence**: every 90 days while Phase 0/1 secrets exist. Phase 5
retires the secret entirely.

## Federated-credential rotation

Federated credentials carry no secret material — the trust comes from
the GitHub Actions OIDC issuer + the subject claim. Rotate the
SUBJECT when:

- A repo is renamed.
- A protected branch is renamed.
- A new GitHub environment is added.

Procedure:

```hcl
# In alphaswarm_platform/terraform/environments/wiley-tech/entra.tf,
# add a new entry to var.ci_federated_credentials:
{
  name        = "github-staging-environment"
  description = "GitHub Actions deploy to staging environment."
  subject     = "repo:julianwileymac/alphaswarm:environment:staging"
}
```

Then the standard plan/apply path:

```bash
./scripts/identity/entra_terraform_plan.sh
python scripts/identity/entra_terraform_apply_via_runtime.py \
    --workspace wiley-tech \
    --apply \
    --reason "Add staging environment OIDC subject"
```

Per-environment / per-branch is **mandatory**; the module's plan
check rejects subjects containing `*` or `ref:refs/heads/*`.

## Break-glass account rotation

The two break-glass accounts (rollout plan §4 risk table) are excluded
from time-based Conditional Access policies and used ONLY in declared
incidents. Their credentials live in:

- **Physical safe** — printed sealed envelope per account.
- **Redundant FIDO2 hardware keys** — two YubiKey 5C NFC per account,
  stored in separate physical safes.

Rotation cadence: every 6 months OR after any use, whichever comes
first.

Procedure:

```bash
# 1. Generate fresh FIDO2 keys with the Azure portal / az ad device-mfa
#    enrolment flow.
# 2. Mint a fresh emergency password (UUID-derived, 24+ chars).
# 3. Update the sealed envelope in the safe.
# 4. Capture the rotation in the security log:
echo "{ \"actor\": \"$USER\", \"event\": \"break-glass-rotation\", \
       \"account\": \"break-glass-1@wiley-tech.onmicrosoft.com\", \
       \"completed_at\": \"$(date -u +%FT%TZ)\" }" \
    | gpg --encrypt -r security@alpha-swarm.ai \
    >> evidence/break-glass-rotations.gpg
```

The operator who performs the rotation runs the
`scripts/identity/list_entra_app_role_assignments.py --apps` to
confirm the break-glass account is still NOT assigned any AlphaSwarm role
(the accounts MUST stay outside the AlphaSwarm role surface; they exist for
tenant-level emergency access only).

## Common pitfalls

| Pitfall | Mitigation |
| --- | --- |
| Operator tempted to add a static secret for "just one workflow" | NO. File a ticket to add a federated credential subject instead |
| Forgetting to remove old federated subjects after a repo rename | Run `az ad app federated-credential list --id "${CI_APP_ID}"` quarterly; remove anything not in `var.ci_federated_credentials` |
| Break-glass account drifting into AlphaSwarm role assignments | The `list_entra_app_role_assignments.py` audit script's CSV output is reviewed quarterly by Security |
| Bootstrap SP secret accidentally committed | Pre-commit hook scans for Azure-secret patterns; CI fails the PR |

## Related

- [`how-to/entra-terraform-bootstrap`](./entra-terraform-bootstrap.md) — initial setup
- [`how-to/entra-onboard-new-staff`](./entra-onboard-new-staff.md) — staff lifecycle
- [`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md) — pool concept
- Long-form rollout plan with risk register:
  [`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md)


<!-- https://alpha-swarm.ai/how-to/entra-terraform-bootstrap -->
# Bootstrap the AlphaSwarm Entra ID staff tenant

# Bootstrap the AlphaSwarm Entra ID staff tenant

Step-by-step procedure for taking the AlphaSwarm staff Microsoft Entra ID
tenant from "exists in the Azure Portal" to "fully Terraform-controlled
and serving as the first user pool for `manage.alpha-swarm.ai`".

This is the implementation runbook. Concept context lives at
[`concepts/identity/entra-internal-tenant`](../concepts/identity/entra-internal-tenant.md);
the full rollout schedule + risks + rollback at
[`docs/plans/entra-internal-tenant-rollout.md`](pathname:///docs/plans/entra-internal-tenant-rollout.md).

## Pre-requisites

| Prereq | How to confirm |
| --- | --- |
| AlphaSwarm staff Entra tenant exists | `az account tenant list` shows the tenant id |
| Global admin / Application Administrator account | `az ad signed-in-user show` confirms role assignment |
| Bootstrap service principal exists with `Application.ReadWrite.All` + `Group.ReadWrite.All` + `RoleManagement.ReadWrite.Directory` | `az ad sp show --id ` |
| Terraform 1.10+ installed locally | `terraform version` |
| Repo cloned + AlphaSwarm runtime installable | `pip install -e .[dev]` succeeds |
| Vault accessible with the `secret/alphaswarm/entra/` mount | `vault kv get secret/alphaswarm/entra/internal_tenant_id` resolves |

If any prereq is missing, file a ticket with the Identity team
(reference ADR-011) before continuing.

## Step 1 — Set environment variables

```bash
# Sourced from Vault by the operator before running the helpers.
export AZURE_TENANT_ID=""
export AZURE_CLIENT_ID=""
export AZURE_CLIENT_SECRET=""   # OR use az login

# Echoed into the Terraform provider.
export TF_VAR_entra_tenant_id="${AZURE_TENANT_ID}"
export TF_VAR_entra_enabled="true"
```

> **Note**: the `AZURE_CLIENT_SECRET` path is documented for the
> bootstrap window only. Once the `alphaswarm-ci-github` app registration +
> federated credentials land (Phase 5 of the rollout plan), no secret
> is stored anywhere; CI authenticates via OIDC.

## Step 2 — Plan-only preview

```bash
./scripts/identity/entra_terraform_plan.sh
```

The script:

1. Runs `terraform fmt -check` + `terraform validate` against the
   module.
2. Runs `terraform plan -target=module.alphaswarm_entra_directory` against
   the `wiley-tech` environment.
3. Writes the plan binary to `/tmp/alphaswarm-entra-wiley-tech.plan` and
   prints the next-step command.

Inspect the plan line-by-line. Common red flags:

- A resource shows `# forces replacement` for an app-role id → someone
  has regenerated a UUID in `var.app_role_definitions` (DON'T merge).
- A federated credential shows `subject = "...:*"` → wildcard rejected
  by the module check; fix the input.
- A group display name conflicts with an existing group → rename or
  import.

## Step 3 — Apply via TerraformRuntime

```bash
python scripts/identity/entra_terraform_apply_via_runtime.py \
    --workspace wiley-tech \
    --apply \
    --reason "Phase 2 land for entra-internal stack"
```

The helper:

1. Loads the `entra-internal` `TerraformStackSpec`.
2. Runs `runtime.plan(...)` (writes a `terraform_runs` row).
3. Prompts for `yes` confirmation (skip with `--yes` only in CI).
4. Runs `runtime.apply(...)` (writes a second `terraform_runs` row
   linked to the same spec_version_id).

Output is redacted: token-bearing fields show only the first 4
characters per AGENTS rule 26.

## Step 4 — Grant admin consent

After the apps land, their requested Microsoft Graph permissions are
*requested* but not yet *consented*. Grant tenant-wide consent:

```bash
# The staff app's client_id is in the Terraform output:
STAFF_CID="$(terraform -chdir=alphaswarm_platform/terraform/environments/wiley-tech \
                output -raw entra_staff_app_client_id)"

./scripts/identity/grant_admin_consent.sh "${STAFF_CID}"
```

The script wraps `az ad app permission admin-consent` and verifies
the resulting grants with `az ad app permission list-grants`.

## Step 5 — Seed `EntraTenantLink`

```bash
# Read the new staff app's tenant id and stamp the canonical row.
export ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID="${AZURE_TENANT_ID}"
export ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID="${STAFF_CID}"

python scripts/identity/seed_entra_internal_tenant.py --dry-run
python scripts/identity/seed_entra_internal_tenant.py --apply
```

Idempotent: the second `--apply` is a no-op if the row already matches
the target shape.

## Step 6 — Round-trip a real login

```bash
# Browser flow.
python scripts/identity/verify_entra_login.py

# Headless / SSH session.
python scripts/identity/verify_entra_login.py --device-code
```

Successful output:

```
INFO Got access token: eyJ0… (1456 chars)
INFO Claims look correct.
INFO CA policies found: AlphaSwarm-Admins-MFA-Required, AlphaSwarm-Block-Risky-Sign-Ins
INFO All checks passed.
```

If a CA policy is missing, the script exits with code 4 and lists the
missing policies. Add them via the Azure Portal under Security review,
then re-run. CA policies are NOT created from Terraform (rollout plan
§1.2).

## Step 7 — Verify role assignments

```bash
python scripts/identity/list_entra_app_role_assignments.py
```

Should print one row per (group, role) pair the module created. Save
a CSV snapshot for the audit trail:

```bash
python scripts/identity/list_entra_app_role_assignments.py \
    --format=csv > evidence/entra-role-snapshot-$(date +%F).csv
```

## Step 8 — Switch the manage.alpha-swarm.ai chooser to prefer Entra

With everything in place, flip the runtime so the `manage.alpha-swarm.ai`
login chooser prefers Entra over Auth0:

```bash
# Settings already wired in alphaswarm/config/settings.py:
#   auth_msal_priority = 100   # MSAL wins
#   auth_msal_internal_*       # populated from Terraform outputs

kubectl set env -n alphaswarm deploy/alphaswarm-admin \
    ALPHASWARM_AUTH_MSAL_INTERNAL_TENANT_ID="${AZURE_TENANT_ID}" \
    ALPHASWARM_AUTH_MSAL_INTERNAL_APP_ID="${STAFF_CID}" \
    ALPHASWARM_AUTH_MSAL_INTERNAL_AUTHORITY="https://login.microsoftonline.com/${AZURE_TENANT_ID}" \
    ALPHASWARM_AUTH_MSAL_INTERNAL_AUDIENCE="api://alphaswarm-manage-api" \
    ALPHASWARM_AUTH_MSAL_PRIORITY=100

kubectl rollout status -n alphaswarm deploy/alphaswarm-admin
```

24-hour bake: monitor the
`auth_login_total{provider="entra"}` and
`auth_login_failure_total` Prometheus counters. ≥95% of staff logins
should land on Entra after the bake.

## Verification

| Check | Command |
| --- | --- |
| Terraform plan is clean | `./scripts/identity/entra_terraform_plan.sh` (no diff) |
| `terraform_runs` audit row recorded | `psql -c "SELECT id, status FROM terraform_runs WHERE stack_slug='entra-internal' ORDER BY created_at DESC LIMIT 1"` |
| `entra_tenant_links` has `kind=internal` | `python scripts/identity/seed_entra_internal_tenant.py --dry-run` reports `EXISTING row matches target` |
| Real login works end-to-end | `python scripts/identity/verify_entra_login.py` exits 0 |
| All seven groups have role assignments | `python scripts/identity/list_entra_app_role_assignments.py` prints ≥7 rows |

## Rollback

See [the rollout plan §5](pathname:///docs/plans/entra-internal-tenant-rollout.md)
for the three rollback tiers (hot / cold / catastrophic).


<!-- https://alpha-swarm.ai/how-to/linkerd-spire-rollout -->
# Linkerd + SPIRE rollout runbook

# Linkerd + SPIRE rollout runbook

> Phase 4 §7.1 + §7.2 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> Covers the per-cell install of Linkerd 2.16 (service mesh) and
> SPIRE 1.10 (workload identity) plus the matching validation
> steps.

## Scope

Per-cell installs of:

- **Linkerd 2.16** — mTLS-by-default for every pod-to-pod call inside
  a cell. Cross-cell calls re-terminate at `alphaswarm-edge` (Envoy).
- **SPIRE 1.10** — issues SPIFFE JWT-SVIDs and X.509-SVIDs via the
  Workload API. Replaces the kubelet-bound ServiceAccount token
  usage in `alphaswarm/auth/m2m.py`.

Both ship as kustomize bases under
`alphaswarm_platform/deployments/kubernetes/mesh-identity/`. Argo CD's
`cells` `ApplicationSet` (Phase 3 §6.5) is extended in Phase 4.5 to
stamp one per-component Application per cell.

## Prerequisites

1. The cell namespace exists and carries the Phase 4 §7.1
   `linkerd.io/inject: enabled` annotation. Verify:
   ```bash
   kubectl get ns cell-shared-std-us-east-1a -o yaml | grep linkerd.io/inject
   # expected: linkerd.io/inject: enabled
   ```
2. The cell registry has the cell row in `state=provisioning` (so
   the cell-router doesn't send live traffic yet).
3. Vault PKI is configured and ready to issue:
   - Linkerd trust anchor + issuer cert (rotates via VaultStaticSecret).
   - SPIRE upstream authority (if running with `UpstreamAuthority`
     plugin; the Phase 4 spine uses self-signed for simplicity).

## Step 0 — Apply the mesh-identity spine

```bash
# Apply in dependency order:
#   1. SPIRE (everything else consumes SVIDs)
kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/spire/

# Wait for SPIRE Server to be ready:
kubectl -n spire-system rollout status statefulset/spire-server --timeout=5m
kubectl -n spire-system get pods -l app=spire-agent

#   2. Linkerd (consumes SPIRE-issued trust anchor)
#   The trust anchor + issuer cert must already be in
#   Secret/linkerd-identity-issuer (see §7.6 wire-up).
kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/linkerd/

# Wait for Linkerd identity service:
kubectl -n linkerd rollout status deployment/linkerd-identity --timeout=10m
kubectl -n linkerd rollout status deployment/linkerd-destination --timeout=10m
kubectl -n linkerd rollout status deployment/linkerd-proxy-injector --timeout=10m

# Optional: install linkerd-viz for golden-signal dashboards
kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/linkerd/  # idempotent

#   3. vault-secrets-operator (mTLS via Linkerd, identity via SPIRE)
kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/vault-secrets-operator/

#   4. Pomerium IAP (depends on Linkerd mTLS for backend reach)
kubectl apply -k alphaswarm_platform/deployments/kubernetes/mesh-identity/pomerium/
```

## Step 1 — Validate SPIRE Workload API

```bash
# Find a workload pod that mounts the agent socket:
POD=$(kubectl -n cell-shared-std-us-east-1a get pods -l app=alphaswarm-core -o name | head -1)

# Drop into the pod and fetch an SVID:
kubectl -n cell-shared-std-us-east-1a exec -it "$POD" -- /bin/sh -c "
  export SPIFFE_ENDPOINT_SOCKET=unix:///run/spire/sockets/agent.sock
  python -c '
from spiffe.workloadapi import default_jwt_source
src = default_jwt_source.DefaultJwtSource()
svid = src.fetch_svid(audiences=[\"alphaswarm-tenant-router\"])
print(\"SPIFFE ID:\", svid.spiffe_id)
print(\"Audiences:\", svid.audiences)
print(\"Token (truncated):\", svid.token[:60], \"...\")
'
"
# Expected: SPIFFE ID spiffe://alpha-swarm.ai/cell/cell-shared-std-us-east-1a/alphaswarm-core
```

If the SVID fetch fails, check the SPIRE Agent's registration
entries — the workload's ServiceAccount might not be selected:

```bash
kubectl -n spire-system exec -it spire-server-0 -- /opt/spire/bin/spire-server entry list
```

## Step 2 — Validate Linkerd mTLS

```bash
# Check that the proxy injected on every alphaswarm-core pod:
kubectl -n cell-shared-std-us-east-1a get pods -l app=alphaswarm-core \
  -o jsonpath='{range .items[*]}{.metadata.name}{":"}{.spec.containers[*].name}{"\n"}{end}'
# Expected: each pod has BOTH `api` and `linkerd-proxy` containers.

# Verify mTLS edge-to-edge between two AlphaSwarm pods:
linkerd -n cell-shared-std-us-east-1a viz stat deploy
# Expected: every deployment row shows `MESHED 1/1` (or matching replica count)
# and the SUCCESS RATE column reports % over the last 1m window.

linkerd -n cell-shared-std-us-east-1a viz edges deployment
# Expected: every edge is "mTLS YES" — if any edge shows "NO", the
# source or destination pod is missing the proxy injection.
```

If pods are NOT meshed, the Proxy Injector didn't see the
`linkerd.io/inject: enabled` annotation. Check the namespace:

```bash
kubectl get ns cell-shared-std-us-east-1a -o yaml | grep -A 2 annotations
# Expected: linkerd.io/inject: enabled
```

## Step 3 — Validate Pomerium IAP

The Pomerium routes for `/manage/*` live in
`alphaswarm_platform/deployments/kubernetes/mesh-identity/pomerium/route-manage.yaml`.

```bash
# From outside the cluster, the IAP-protected route should redirect
# to authenticate.alpha-swarm.ai (Pomerium's authenticate service):
curl -sIL https://manage.alpha-swarm.ai/manage/cells | head -10
# Expected: 302 to https://authenticate.alpha-swarm.ai/.pomerium/...

# After completing the Auth0 flow + step-up MFA, the request reaches
# alphaswarm-cp.alphaswarm-admin.svc.cluster.local:9000 with the
# X-Pomerium-Jwt-Assertion header attached:
curl -sS https://manage.alpha-swarm.ai/manage/cells \
  --cookie "_pomerium=" \
  | jq '.data[].id'
```

The receiving FastAPI route validates the assertion via
`alphaswarm.auth.providers.pomerium.extract_pomerium_claims` (Phase 4 §7.5).

## Step 4 — Cedar policy gate

Trigger a Cedar evaluation:

```bash
# Try to register a cell as a user WITHOUT the cell_operator role —
# should 403:
curl -sS -XPOST https://manage.alpha-swarm.ai/manage/cells \
  -H 'authorization: Bearer ' \
  -H 'content-type: application/json' \
  -d '{"id":"cell-x","tier":"shared-std",...}' \
  -o /tmp/cedar-deny.json
cat /tmp/cedar-deny.json
# Expected: {"detail":{"error":"cedar_denied",...}}

# With the role granted by the Auth0 Action, the same call succeeds:
# (cell_operator role is wired via the action at
# alphaswarm/api/routes/auth0_sync.py per Phase 4 §7.3.)
```

## Step 5 — VaultStaticSecret rotation

Verify the `alphaswarm-cell-postgres-credentials` Secret rotates within the
30-minute `refreshAfter` window:

```bash
# Watch the Secret's resourceVersion:
kubectl -n cell-shared-std-us-east-1a get secret postgres-credentials \
  -o jsonpath='{.metadata.resourceVersion}' --watch

# Trigger a Vault-side rotation:
vault kv put cells/shared-std/cell-shared-std-us-east-1a host=newhost.example port=5432

# Within 30 minutes the resourceVersion increments and the deployments
# listed in `rolloutRestartTargets` perform a rolling restart.
```

## Rollback

Each component is independently revertable:

```bash
# Linkerd — remove the proxy injection (existing pods stay meshed
# until their next rollout):
kubectl annotate ns cell-shared-std-us-east-1a linkerd.io/inject-

# SPIRE — workloads fall back to the Auth0 M2M path (chain order in
# alphaswarm.credentials.resolver) when the SPIFFE socket isn't reachable.
kubectl -n spire-system scale daemonset spire-agent --replicas=0

# Pomerium — direct /manage/* to alphaswarm-cp via DNS, bypassing the IAP.
kubectl -n pomerium scale deployment pomerium-proxy --replicas=0

# vault-secrets-operator — Secrets stop refreshing but stay readable.
kubectl -n vault-secrets-operator scale deployment vault-secrets-operator --replicas=0
```

## Phase 4.5 follow-ups

1. Per-cell SPIRE `ClusterSPIFFEID` CRDs binding workload selectors.
2. M2MTokenIssuer dispatch through `ALPHASWARM_AUTH_M2M_PROVIDER=spiffe`.
3. Per-cell `VaultStaticSecret` set for every persistent service
   (Postgres, Redis, MinIO, MLflow, ChromaDB).
4. Per-cell Pomerium routes for the `alphaswarm_admin` UI surface.
5. Linkerd SPIFFE trust anchor wired from SPIRE Server's
   upstream-authority CA.

## Related documents

- [RESTRUCTURING_PLAN.md §7](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm_docs/docs/concepts/identity/spiffe-workload-identity.md](spiffe-workload-identity.md)
- [alphaswarm_platform/deployments/kubernetes/mesh-identity/README.md](https://github.com/julianwiley/alphaswarm/blob/main/alphaswarm_platform/deployments/kubernetes/mesh-identity/README.md)
- [alphaswarm_docs/docs/how-to/cell-router-cutover.md](cell-router-cutover.md)


<!-- https://alpha-swarm.ai/how-to/mlops/cross-repo-lineage -->
# Cross-repo lineage bridge
> Set `ALPHASWARM_AGENTIC_ASSISTANTS_API` in the environment (or in the k8s ConfigMap `alphaswarm-env`):

# Cross-repo lineage bridge

The `agentic_assistants` repository maintains a shared lineage graph
(dataset → run → model → report). AlphaSwarm publishes events to the same
service via
[`alphaswarm.mlops.lineage_bridge`](../../alphaswarm/mlops/lineage_bridge.py) so both
repos present a unified view in the Lineage UI.

## Configuration

Set `ALPHASWARM_AGENTIC_ASSISTANTS_API` in the environment (or in the k8s
ConfigMap `alphaswarm-env`):

```bash
export ALPHASWARM_AGENTIC_ASSISTANTS_API=http://agentic-assistants.alphaswarm.svc.cluster.local:8000
```

When the setting is empty the bridge is a no-op — every `emit_*` call
logs at DEBUG and returns `False`.

## Emitting events

```python
from alphaswarm.mlops import (
    emit_dataset, emit_run, emit_model, emit_serve_deployment
)

# 1. Record the training dataset.
emit_dataset("abc123def", n_rows=2_000_000, n_symbols=500)

# 2. Log a training run tied to that dataset.
emit_run("run-42", kind="alpha_training", dataset_hash="abc123def",
         model_class="LightGBMAlpha")

# 3. Register the resulting model artifact.
emit_model("alphaswarm-alpha", version="7", run_id="run-42",
           metrics={"ic_mean": 0.042})

# 4. Record the live deployment serving it.
emit_serve_deployment(
    endpoint_url="http://ray-serve.alphaswarm.svc.cluster.local:8000/alphaswarm",
    backend="ray-serve",
    model_uri="models:/alphaswarm-alpha/Production",
)
```

## Event schema

Each event is a JSON POST to `/api/lineage/events`:

```json
{
  "kind": "model",
  "id": "model:alphaswarm-alpha/7",
  "attrs": { "run_id": "run-42", "metrics": { "ic_mean": 0.042 } },
  "parents": ["run:run-42"]
}
```

## Retention

Events live in the `agentic_assistants` lineage store (Postgres).
Retention matches that project's settings (default 90 days for
`run` / `serve_deployment`, indefinite for `dataset` / `model`).


<!-- https://alpha-swarm.ai/how-to/mlops/k8s-deployment -->
# Kubernetes deployment
> The [`Dockerfile`](../../Dockerfile) builds five targets:

# Kubernetes deployment

AlphaSwarm ships Kustomize manifests under [`alphaswarm_platform/deploy/k8s/base/`](../../alphaswarm_platform/deploy/k8s/base/)
that can be applied to any cluster. The manifests under `base/serving/`
add three model-serving backends on top of the existing `api`, `worker`,
`paper-trader`, and streaming-ingester Deployments.

## Image targets

The [`Dockerfile`](../../Dockerfile) builds five targets:

| Target | Entrypoint | Used by |
| --- | --- | --- |
| `base` | — | shared base layer |
| `paper` | `alphaswarm paper run` | `paper-trader.yaml` |
| `ingester` | `alphaswarm-stream-ingest` | `ingester-*.yaml` |
| `api` (default) | `uvicorn alphaswarm.api.main:app` | `api.yaml`, `worker.yaml` |
| `serving` | `alphaswarm serve ` | `serving/*.yaml` |
| `ml-train` | `alphaswarm-train` | CI training jobs, Ray Tune sweeps |

Build all five at once:

```bash
for target in paper ingester api serving ml-train; do
  docker build --target "$target" -t "alphaswarm-${target}:latest" .
done
```

## Deploying to a Kubernetes cluster

AlphaSwarm is cluster-agnostic. The `alphaswarm_platform/deployments/kubernetes/` tree provisions
every shared dependency (MLflow in `alphaswarm-mlops`, MinIO + Postgres + Redis
+ ChromaDB in `alphaswarm-data-services`, Kafka + Schema Registry + Flink in
`alphaswarm-streaming`, kube-prometheus-stack + Tempo + Loki + OTel + Phoenix
in `alphaswarm-observability`, and so on). To deploy AlphaSwarm:

```bash
# From the alphaswarm root
# 1. Install the operators / Helm releases that AlphaSwarm CRDs depend on.
bash alphaswarm_platform/scripts/cluster_install/install-redpanda.sh
bash alphaswarm_platform/scripts/cluster_install/install-kube-prometheus-stack.sh
bash alphaswarm_platform/scripts/cluster_install/install-opentelemetry-operator.sh
bash alphaswarm_platform/scripts/cluster_install/install-spark-operator.sh
bash alphaswarm_platform/scripts/cluster_install/install-flink.sh

# 2. Apply the AlphaSwarm base kustomization (creates alphaswarm-* namespaces and
#    the workload manifests).
kubectl apply -k alphaswarm_platform/deployments/kubernetes/base/
```

## Selecting which model to serve

The three serving backends all read a single `model_uri` from the
`alphaswarm-serving-env` ConfigMap. Change it once and bounce the Deployments:

```bash
kubectl -n alphaswarm create configmap alphaswarm-serving-env \
  --from-literal=model_uri=models:/alphaswarm-alpha/Production \
  --from-literal=ray_serve_name=alphaswarm-alpha \
  --dry-run=client -o yaml | kubectl apply -f -

kubectl -n alphaswarm rollout restart deploy mlflow-serve ray-serve torchserve
```

## Observability

- Every Deployment exports traces to `http://otel-collector:4317`
  (OTLP gRPC), matching the `rpi_kubernetes` collector conventions.
- Prometheus picks up metrics via the `ServiceMonitor` resources in
  [`alphaswarm_platform/deploy/k8s/base/serving/servicemonitor.yaml`](../../alphaswarm_platform/deploy/k8s/base/serving/servicemonitor.yaml).
- AlphaSwarm's own metric surface is defined in
  [`alphaswarm/mlops/metrics.py`](../../alphaswarm/mlops/metrics.py):
  `alphaswarm_train_duration_seconds`, `alphaswarm_backtest_sharpe`, `alphaswarm_paper_pnl`,
  `alphaswarm_serve_requests_total`, `alphaswarm_serve_latency_seconds`.

## Secrets

The `alphaswarm-broker-secrets` Secret supplies Alpaca / IBKR / Tradier
credentials. For the serving stack no secrets are required unless the
MLflow tracking URI needs auth — set `MLFLOW_TRACKING_TOKEN` in
`alphaswarm-env` or a dedicated Secret.


<!-- https://alpha-swarm.ai/how-to/mlops/serving -->
# Model serving
> | Backend | Adapter | CLI | Best for | | --- | --- | --- | --- | | MLflow Models | [`MLflowServeDeployment`](../../alphaswarm/mlops/serving/mlflow_serve.py) | `alphaswarm serve mlflow <uri>` | any flavor logged wit...

# Model serving

AlphaSwarm ships three serving adapters. All three share the same
[`ModelDeployment`](../../alphaswarm/mlops/serving/base.py) protocol so
call-sites, the CLI (`alphaswarm serve ...`), and the REST API speak one
vocabulary regardless of the runtime underneath.

| Backend | Adapter | CLI | Best for |
| --- | --- | --- | --- |
| MLflow Models | [`MLflowServeDeployment`](../../alphaswarm/mlops/serving/mlflow_serve.py) | `alphaswarm serve mlflow ` | any flavor logged with `mlflow.log_model`, low-throughput research |
| Ray Serve | [`RayServeDeployment`](../../alphaswarm/mlops/serving/ray_serve.py) | `alphaswarm serve ray ` | horizontally scaled batch inference |
| TorchServe | [`TorchServeDeployment`](../../alphaswarm/mlops/serving/torchserve.py) | `alphaswarm serve torchserve ` | low-latency PyTorch endpoints + batching |

## Model URIs

All adapters accept three URI shapes:

1. **Filesystem path** — `./data/models/alpha_v1.pkl`
2. **MLflow run** — `runs://`
3. **MLflow registry** — `models://` or `models://`

MLflow URIs are resolved via `alphaswarm.mlops.serving.base.resolve_model`, which
optionally downloads the artifact locally when a backend needs filesystem
access (TorchServe packaging) or passes the URI through (MLflow Serve).

## PreprocessingSpec propagation

Every adapter honours the
[`PreprocessingSpec`](../../architecture/preprocessing-spec.md) attached to
the model. At inference time:

- **MLflow Serve** — flavor-specific (`pyfunc` handlers are expected to
  re-apply preprocessing inside the model class).
- **Ray Serve** — the generated deployment loads the pickle and delegates
  to `model.predict(df)`; when `model.preprocessing_spec` is set, the
  `apply` call happens in `__call__` before `predict`.
- **TorchServe** — the auto-generated `AqpBaseHandler` checks for a
  `preprocessing_spec` attribute and runs `spec.apply(df)` before every
  call.

## Quick start

```bash
# Train something and log to MLflow
python scripts/train_agent.py --config configs/ml/lgbm.yaml

# Serve the latest production version via MLflow
alphaswarm serve mlflow models:/alphaswarm-lgbm/Production --port 5001

# Or via Ray Serve
alphaswarm serve ray models:/alphaswarm-lgbm/Production --num-replicas 4

# Or package for TorchServe
alphaswarm serve torchserve models:/alphaswarm-lstm/Production --model-name alphaswarm-lstm
```

## Kubernetes

Manifests and Helm values for deploying each backend to the
`rpi_kubernetes` cluster live under `deploy/kubernetes/serving/` and are
described in [`alphaswarm_docs/docs/how-to/mlops/k8s-deployment.md`](./k8s-deployment.md)
(Phase 5).


<!-- https://alpha-swarm.ai/how-to/operations/add-new-provider -->
# Adding a new InfrastructureProvider to alphaswarm_controller
> Add a provider when you need to manage workloads on a backend the existing five (`docker_compose`, `kubernetes`, `aws`, `azure`, `gcp`) dont cover. Examples: Nomad, Fly.io, Render, on-prem VMs via Sa...

# Adding a new InfrastructureProvider to alphaswarm_controller

Step-by-step guide for shipping a new `InfrastructureProvider` implementation (AGENTS rule 45 / ADR 004).

## When to add a new provider

Add a provider when you need to manage workloads on a backend the existing five (`docker_compose`, `kubernetes`, `aws`, `azure`, `gcp`) don't cover. Examples: Nomad, Fly.io, Render, on-prem VMs via Salt/Ansible.

## Checklist

### 1. Sketch the credential chain

What env vars does the backend's SDK read? How does it discover credentials in CI vs on a developer laptop vs in production?

Document this in your provider's `_check_credentials()` so the health probe can fail loudly when credentials are missing.

### 2. Create the provider module

```python
# alphaswarm_controller/src/alphaswarm_controller/providers/.py
from alphaswarm_core.providers.protocol import (
    InfrastructureProvider,
    InfrastructureProviderError,
    InfrastructureProviderUnavailable,
    ProviderKind,
)
from alphaswarm_core.providers.registry import register_provider_class


@register_provider_class("", replace=True)
class MyProvider(InfrastructureProvider):
    provider_kind = ProviderKind.
    provider_alias = ""

    async def health(self) -> ProviderHealth: ...
    async def start(self, spec: DeploymentSpec) -> DeploymentStatus: ...
    async def stop(self, service_id: str, *, namespace=None) -> DeploymentStatus: ...
    async def scale(self, service_id, replicas, *, namespace=None) -> DeploymentStatus: ...
    async def status(self, service_id: str, *, namespace=None) -> DeploymentStatus: ...
    async def list_deployments(self, *, namespace=None) -> list[DeploymentStatus]: ...

    # Optional — override if your backend supports it.
    async def get_config(self, service_id: str, *, namespace=None) -> ServiceConfig: ...
    async def apply_config(self, patch: ConfigMapPatch) -> bool: ...
    async def stream_metrics(self, service_id, *, namespace=None, interval_seconds=10.0): ...
```

### 3. Add a new `ProviderKind`

```python
# alphaswarm_core/src/alphaswarm_core/providers/protocol.py
class ProviderKind(str, Enum):
    DOCKER_COMPOSE = "docker_compose"
    KUBERNETES = "kubernetes"
    AWS = "aws"
    AZURE = "azure"
    GCP = "gcp"
    NOMAD = "nomad"  # <-- your new kind
```

### 4. Register in the bootstrap helper

```python
# alphaswarm_controller/src/alphaswarm_controller/providers/__init__.py
for module_name in (
    "alphaswarm_controller.providers.docker_compose",
    "alphaswarm_controller.providers.kubernetes",
    "alphaswarm_controller.providers.aws",
    "alphaswarm_controller.providers.azure",
    "alphaswarm_controller.providers.gcp",
    "alphaswarm_controller.providers.",  # <-- add yours
):
    ...
```

### 5. Optional deps go in `pyproject.toml` extras

```toml
[project.optional-dependencies]
 = ["sdk-package>=X,]",
]
```

### 6. Write contract tests

Two test files:

```python
# alphaswarm_controller/tests/providers/test_.py — unit tests over the
# translation helpers (e.g. spec_to_, response_to_status)

# alphaswarm_controller/tests/providers/test__integration.py — full
# contract test against a mocked SDK (moto for AWS, MagicMock for others)
```

Reuse the assertion patterns in `test_docker_compose.py` and `test_kubernetes.py`.

### 7. Update the bootstrap registry test

```python
# tests/providers/test_registry.py
def test_bootstrap_registers_all() -> None:
    registry = bootstrap()
    for expected in ("docker_compose", "kubernetes", "aws", "azure", "gcp", ""):
        assert expected in registry.aliases()
```

### 8. Update the README + this runbook

Add your provider to the table in [`alphaswarm_controller/README.md`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_controller/README.md) and the per-cloud sections below.

## Per-cloud setup notes

### AWS

- Active provider: `ALPHASWARM_CP_PROVIDER=aws`
- Credentials: standard boto3 chain (env vars / `~/.aws/credentials` / EC2 / EKS pod identity / WebIdentity)
- IAM minimum: `ecs:DescribeServices`, `ecs:UpdateService`, `ssm:GetParameter*`, `ssm:PutParameter*`, plus EKS read perms when using the K8s sub-path

### Azure

- Active provider: `ALPHASWARM_CP_PROVIDER=azure`
- Credentials: `azure-identity` chain (env vars / Managed Identity / federated identity / Azure CLI)
- IAM minimum: Contributor on the AKS / Container Instances resource group

### GCP

- Active provider: `ALPHASWARM_CP_PROVIDER=gcp`
- Credentials: `GOOGLE_APPLICATION_CREDENTIALS` env var pointing to a service account JSON, OR Workload Identity (preferred in production)
- IAM minimum: `run.developer`, `container.developer`, `secretmanager.admin` (per project)

## Definition of done

- [ ] Provider class registered + `provider_kind` matches alias
- [ ] All seven abstract methods implemented (or raise `InfrastructureProviderUnavailable` with a structured message)
- [ ] Credential probe in `health()` returns a useful error when creds are missing
- [ ] Unit tests + contract tests passing
- [ ] `tests/providers/test_registry.py` updated
- [ ] README + this runbook updated
- [ ] CI matrix builds + tests with the new optional dep group


<!-- https://alpha-swarm.ai/how-to/operations/alphaswarm-fund-blue-green-cutover -->
# AlphaSwarm.FUND Blue/Green Cutover
> - Overlay: `alphaswarm_platform/deployments/kubernetes/overlays/tower-green/` - Tunnel lane: `alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/` - Verification: `scripts/verify_blue_green_cutov...

# AlphaSwarm.FUND Blue/Green Cutover

Runbook for migrating `alphaswarm.fund` traffic to the tower cluster with a short,
controlled DNS/tunnel switch and immediate rollback path.

## Green lane artifacts

- Overlay: `alphaswarm_platform/deployments/kubernetes/overlays/tower-green/`
- Tunnel lane: `alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/`
- Verification: `scripts/verify_blue_green_cutover.sh`

Green hostnames:

- `alphaswarm-green.alphaswarm.fund`
- `api-green.alphaswarm.fund`
- `manage-green.alphaswarm.fund`

## 1) Pre-cutover prep

1. Ensure `tower-dev` is healthy:

   ```bash
   bash scripts/verify_tower_cluster.sh
   ```

2. Update Auth0 app allow-lists so both blue and green URLs are valid during transition.
   Use `alphaswarm_platform/terraform/modules/auth0_identity` inputs:
   - `callback_urls` + `cutover_callback_urls`
   - `logout_urls` + `cutover_logout_urls`
   - `web_origins` + `cutover_web_origins`

3. Create green tunnel token secret:

   ```bash
   token="$(cloudflared tunnel token alphaswarm-fund-edge-green)"
   kubectl -n alphaswarm-edge create secret generic cloudflared-alphaswarm-green-token \
     --from-literal=token="$token" \
     --dry-run=client -o yaml | kubectl apply -f -
   ```

## 2) Deploy green lane

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-green/
```

## 3) Validate before switch

```bash
bash scripts/verify_blue_green_cutover.sh
CHECK_EXTERNAL=true bash scripts/verify_blue_green_cutover.sh
```

## 4) Cut over traffic

Perform the controlled switch in Cloudflare:

- point DNS/app routing to green hostnames (or update tunnel ingress mapping)
- confirm health endpoints:
  - `https://alphaswarm-green.alphaswarm.fund`
  - `https://api-green.alphaswarm.fund/livez`
  - `https://manage-green.alphaswarm.fund/manage/livez`

Once stable, update canonical host routing (`alphaswarm.fund`, `api.alphaswarm.fund`,
`manage.alphaswarm.fund`) to the tower green lane.

## 5) Rollback

Immediate rollback commands:

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/
kubectl delete -k alphaswarm_platform/deployments/kubernetes/edge/cloudflared-alphaswarm-green/
```

Then restore blue DNS/tunnel routing and rerun baseline checks.


<!-- https://alpha-swarm.ai/how-to/operations/auth0-k8s-checklist -->
# Auth0 checklist for Kubernetes login
> Current local `.env` discovery:

# Auth0 checklist for Kubernetes login

This checklist captures the Auth0-side changes required for the Kubernetes
deployment to support login through the Vite SPA and JWT validation in
`alphaswarm-core` / `alphaswarm_controller`.

Current local `.env` discovery:

- Auth0 tenant domain: `alphaswarm-fund.us.auth0.com`
- SPA client id: present in `.env`
- SPA / confidential client secret: present in `.env`
- M2M client id/secret: **missing**; create a dedicated M2M app before enabling
  the Auth0 Action in production.
- SCIM bearer token hash: **missing**; SCIM remains disabled until created.

Do not paste secrets into this document. Store secrets in `.env` locally and in
your production secret manager / sealed-secret pipeline for Kubernetes.

## 1. API Resource Server

Auth0 Dashboard path: **Applications → APIs → Create API**.

Create or update:

| Field | Value |
| --- | --- |
| Name | `AlphaSwarm Management API` |
| Identifier | `https://api.alphaswarm.internal/manage` |
| Signing Algorithm | `RS256` |
| RBAC | enabled |
| Add Permissions in Access Token | enabled |

Required permissions:

| Permission | Purpose |
| --- | --- |
| `read:infrastructure` | View deployment status, pod health, logs, non-secret config |
| `manage:agents` | Start/stop/restart/scale assigned agents and bot workloads |
| `manage:infrastructure` | Deploy/update services and non-secret config within assigned org |
| `admin:cluster` | Full cluster control and resource-filter bypass |
| `scim:write` | Auth0 SCIM provisioning into `/scim/v2/*` |

Migration permissions to retain until all older AlphaSwarm routes are moved to the new
control-plane scope grid:

| Permission | Why keep temporarily |
| --- | --- |
| `data:read` | Existing AlphaSwarm data/read routes still check it |
| `data:write` | Existing AlphaSwarm mutation routes still check it |
| `deploy:run` | Existing Terraform control-plane routes still check it |
| `deploy:halt` | Existing Terraform halt/kill-switch integrations still check it |

Terraform source of truth:

- [`alphaswarm_platform/terraform/modules/auth0_identity/main.tf`](../../alphaswarm_platform/terraform/modules/auth0_identity/main.tf)
- [`alphaswarm_platform/terraform/modules/auth0_identity/variables.tf`](../../alphaswarm_platform/terraform/modules/auth0_identity/variables.tf)

## 2. SPA Application (`alphaswarm-client`)

Auth0 Dashboard path: **Applications → Applications → Create Application → Single Page Application**.

Use the `.env` client id already present for this app if it is the AlphaSwarm SPA.

Configure:

| Setting | Values |
| --- | --- |
| Application Type | Single Page Application |
| Token Endpoint Authentication Method | `None` |
| Grant Types | Authorization Code, Refresh Token |
| Allowed Callback URLs | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` |
| Allowed Logout URLs | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` |
| Allowed Web Origins | `http://127.0.0.1:3001`, `http://localhost:3001`, `https://alpha-swarm.ai` |
| Allowed Origins (CORS) | same as Web Origins |

Kubernetes ConfigMap values now generated from `.env`:

```yaml
VITE_AUTH_REQUIRED: "true"
VITE_AUTH0_DOMAIN: "alphaswarm-fund.us.auth0.com"
VITE_AUTH0_CLIENT_ID: ""
VITE_AUTH0_AUDIENCE: "https://api.alphaswarm.internal/manage"
```

## 3. Machine-to-Machine Application (`alphaswarm-m2m`)

Auth0 Dashboard path: **Applications → Applications → Create Application → Machine to Machine**.

Create a dedicated app; do **not** reuse the SPA client secret for M2M.

Grant it access to the API Resource Server:

| API | Scopes |
| --- | --- |
| `https://api.alphaswarm.internal/manage` | `read:infrastructure`, `manage:infrastructure`, `data:read`, `scim:write`, `deploy:run`, `deploy:halt` |

Store:

| AlphaSwarm variable | Source |
| --- | --- |
| `ALPHASWARM_AUTH_M2M_CLIENT_ID` | M2M Application Client ID |
| `ALPHASWARM_AUTH_M2M_CLIENT_SECRET` | M2M Application Client Secret |
| `ALPHASWARM_AUTH_M2M_AUDIENCE` | `https://api.alphaswarm.internal/manage` |

The current `.env` does not yet contain the M2M client id/secret, so the
generated Kubernetes Secret intentionally leaves `ALPHASWARM_AUTH_M2M_CLIENT_SECRET`
empty/placeholder until you create the app.

## 4. Post-Login Action

The Action template lives at:

[`alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl`](../../alphaswarm_platform/terraform/modules/auth0_identity/post_login_action.js.tftpl)

Configure the deployed Action with:

| Placeholder | Value |
| --- | --- |
| `claims_namespace` | `https://alphaswarm.internal/` |
| `api_audience` | `https://api.alphaswarm.internal/manage` |
| `sync_url` | production: `https://api.alpha-swarm.ai/_internal/auth0/sync` |

The Action injects these custom claims:

| Claim | Example |
| --- | --- |
| `https://alphaswarm.internal/org_id` | `org_abc123` |
| `https://alphaswarm.internal/workspace_id` | `workspace_abc123` |
| `https://alphaswarm.internal/roles` | `["alphaswarm-operator"]` |
| `https://alphaswarm.internal/resources` | `["alphaswarm-api", "alphaswarm-worker"]` |
| `https://alphaswarm.internal/scopes` | `["read:infrastructure", "manage:agents"]` |

The backend still reads the legacy `https://alphaswarm/` namespace for one release,
but new tokens should use `https://alphaswarm.internal/`.

## 5. Roles and assignments

Create roles:

| Role | Permissions |
| --- | --- |
| `alphaswarm-viewer` | `read:infrastructure`, `data:read` |
| `alphaswarm-operator` | `read:infrastructure`, `manage:agents`, `data:read` |
| `alphaswarm-admin` | `read:infrastructure`, `manage:agents`, `manage:infrastructure`, `data:read`, `data:write`, `deploy:run`, `deploy:halt` |
| `alphaswarm-superadmin` | all above plus `admin:cluster`, `scim:write` |

Assign your test user to `alphaswarm-superadmin` first to verify end-to-end login.
Then move down to `alphaswarm-operator` or `alphaswarm-viewer` to verify resource filtering.

## 6. Kubernetes apply order

The safe apply order is:

```powershell
# 1. Apply tracked, non-secret manifests and placeholder Secret templates.
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev

# 2. Apply locally generated real Secret manifests (git-ignored).
kubectl apply -f alphaswarm_platform/deployments/kubernetes/generated/alphaswarm-secrets.local.yaml
kubectl apply -f alphaswarm_platform/deployments/kubernetes/generated/alphaswarm-admin-secrets.local.yaml

# 3. Restart workloads so env vars are re-read.
kubectl -n alphaswarm rollout restart deployment/alphaswarm-core deployment/alphaswarm-client deployment/alphaswarm-worker
kubectl -n alphaswarm-admin rollout restart deployment/alphaswarm-cp
```

Do not run this until `kubectl get ns` succeeds against the intended context.

## 7. Verification

```powershell
kubectl -n alphaswarm get configmap alphaswarm-config -o yaml
kubectl -n alphaswarm-admin get configmap alphaswarm-config -o yaml
kubectl -n alphaswarm get secret alphaswarm-secrets -o jsonpath='{.data.ALPHASWARM_AUTH_OIDC_CLIENT_SECRET}'
kubectl -n alphaswarm-admin get secret alphaswarm-secrets -o jsonpath='{.data.ALPHASWARM_AUTH_OIDC_CLIENT_SECRET}'
```

Frontend login should no longer show:

> Authentication is required ... frontend was not given identity-provider configuration

Instead, it should redirect to Auth0 Universal Login for the
`alphaswarm-fund.us.auth0.com` tenant.


<!-- https://alpha-swarm.ai/how-to/operations/aws-deploy -->
# how-to/operations/aws-deploy

# AWS Hybrid Deployment Guide

> Companion runbook: [aws-runbook.md](aws-runbook.md).
> Architecture decision: hybrid (EKS Karpenter quant runtime + ECS Fargate
> admin + Bedrock AgentCore) — chosen per the blueprint §16.3 scope clarifications.

This guide walks you through deploying AlphaSwarm to AWS for the first time.
Subsequent rollouts go through the normal `terraform-pipeline.yml`
+ `build-publish.yml` CI workflows; this page is only for the bootstrap
path. Allow ~3–4 hours end-to-end (most of the wall clock is Bedrock
model-access approval + Cloudflare propagation).

## Topology summary

```mermaid
flowchart LR
    publicUsers[Marketing users] --> cloudflare[Cloudflare tunnel]
    operators[Operators / staff] --> cloudfront[CloudFront]

    subgraph aws [AWS account]
        cloudfront --> alb[ALB]
        alb --> admin[alphaswarm-admin ECS Fargate]
        alb --> proxy[AgentCore proxy ECS Fargate]
        proxy --> ac[AgentCore Runtime ARM64]
        ac --> bedrock[Bedrock FMs Claude 4.5 Titan v2]
        ac --> kb[Knowledge Base OpenSearch Serverless]
        eks[EKS Karpenter quant runtime] --> rds[RDS PG 16]
        eks --> redis[ElastiCache Redis Serverless]
        eks --> s3[S3 Iceberg warehouse]
    end
```

## Prerequisites

| Item | How to confirm |
| --- | --- |
| AWS Organization with Control Tower enrolled | Console -> AWS Control Tower -> Landing zone is `Available`. |
| Seven member accounts: `management`, `log-archive`, `security-audit`, `shared-services`, `dev`, `staging`, `prod` | `aws organizations list-accounts` from the management account. |
| Bedrock model access enabled (Claude Sonnet 4.5, Claude Haiku 4.5, Titan Text Embeddings v2) per workload account in us-east-1 | Console -> Bedrock -> Model access — must show `Access granted`. This is the only manual console step in the bootstrap path. |
| GitHub repo `julianwileymac/alphaswarm` admin access | Required to create the three GitHub Environments (`dev`, `staging`, `prod`). |
| Local `terraform >= 1.10.0`, `aws-cli v2`, `kubectl >= 1.30`, `kustomize >= 5.0`, `cosign >= 2.4`, `helm >= 3.16` | `terraform version` etc. |

## Phase 1 — Bootstrap (one-time, manual)

The bootstrap stack provisions the state backend (S3 + DynamoDB +
KMS) + GitHub OIDC provider in every account. Run with admin
credentials per account; nothing in the regular workflow ever needs
admin afterwards.

```bash
# From the management account first:
cd infrastructure/bootstrap
terraform init
terraform apply -auto-approve

# Capture the published outputs (state bucket, DynamoDB table, KMS key)
# into the per-account backend.hcl files:
terraform output -json > /tmp/bootstrap-outputs.json
```

Repeat per workload account by assuming the `OrganizationAccountAccessRole`
each one (Control Tower wires the trust automatically) and re-running
`terraform init && terraform apply` with the per-account state bucket
name.

## Phase 2 — Landing zone IaC (`infrastructure/envs/`)

The landing zone tree provisions the shared infrastructure inside each
workload account: VPC, EKS cluster, Karpenter, ECR, RDS Postgres,
MSK Kafka, S3 data lake, observability stack. Apply through GitHub
Actions (NEVER `terraform apply` from a laptop in CI mode):

1. In the GitHub repo settings, create the three Environments
   (`dev`, `staging`, `prod`) and add a `AWS_DEPLOYER_ROLE_ARN` repo
   variable per env (the ARN comes from the bootstrap output).
2. Push a no-op commit to `main` so the `terraform-pipeline.yml`
   workflow runs the plan against `dev`. Review the plan diff in
   the workflow summary.
3. Click "Run workflow" -> `tree=infrastructure`, `env=dev`,
   `action=apply`. The job assumes
   `vars.TF_APPLY_ROLE_dev` and runs `terraform apply -auto-approve`
   against `infrastructure/envs/dev/`.
4. Promote to staging + prod by repeating step 3 with the matching
   env. Staging requires one reviewer; prod requires two
   (GitHub Environment protection rules).

## Phase 3 — Application IaC (`alphaswarm_platform/terraform/environments/live`)

The application tree composes the 8 new modules
(`bedrock-agentcore`, `bedrock-knowledge-base`,
`opensearch-serverless`, `cognito-userpool`, `cloudfront`, `alb`,
`ecs-fargate-control-plane`, `eventbridge-stepfunctions`) PLUS the
heritage `alphaswarm_platform/terraform/modules/` composition. Run via:

```bash
# Render backend.hcl from the bootstrap SSM outputs:
cd alphaswarm_platform/terraform/environments/live
aws ssm get-parameter --name /alphaswarm/prod/tfstate_bucket_name \
  --query 'Parameter.Value' --output text > /tmp/bucket
aws ssm get-parameter --name /alphaswarm/prod/tfstate_kms_key_arn \
  --query 'Parameter.Value' --output text > /tmp/kms
aws ssm get-parameter --name /alphaswarm/prod/tfstate_dynamodb_table \
  --query 'Parameter.Value' --output text > /tmp/lock
cat < backend.hcl
bucket         = "$(cat /tmp/bucket)"
key            = "alphaswarm_platform/live/terraform.tfstate"
region         = "us-east-1"
encrypt        = true
kms_key_id     = "$(cat /tmp/kms)"
dynamodb_table = "$(cat /tmp/lock)"
EOF
```

Then trigger the `terraform-pipeline.yml` workflow with
`tree=alphaswarm_platform`, `env=live`, `action=plan` -> review -> `action=apply`.

## Phase 4 — Image builds

The `build-publish.yml` workflow ships every AlphaSwarm container to ECR.
`alphaswarm-agent` is ARM64-only (AgentCore Runtime requirement); every
other service builds multi-arch.

```bash
git tag v1.0.0
git push origin v1.0.0
# Watch the workflow — it pushes the 8 services + signs with Cosign +
# emits SLSA provenance + uploads SBOMs.
```

## Phase 5 — Seed secrets + Bedrock + Knowledge Base

```bash
# Broker credentials (paper trading first):
aws secretsmanager put-secret-value \
  --secret-id alphaswarm/prod/broker/alpaca \
  --secret-string '{"api_key":"","secret_key":""}'

# Upload research docs to the KB source bucket — the EventBridge
# rule from modules/eventbridge-stepfunctions triggers a Bedrock
# ingestion job on every PutObject:
aws s3 sync ./research/papers/ s3://$(aws ssm get-parameter \
  --name /alphaswarm/prod/kb_source_bucket \
  --query 'Parameter.Value' --output text)/
```

## Phase 6 — Smoke

```bash
# 1. Confirm the AgentCore Runtime invokes via the smoke workflow:
gh workflow run bedrock-smoke.yml

# 2. Direct invoke from a deployer-role-assumed shell:
aws bedrock-agentcore invoke-agent-runtime \
  --agent-runtime-arn $(aws ssm get-parameter \
    --name /alphaswarm/prod/agentcore_runtime_arn \
    --query 'Parameter.Value' --output text) \
  --payload '{"spec_name":"dataset_loading_assistant","inputs":{"prompt":"ping"}}' \
  /tmp/response.json

# 3. Verify the trace shows up in X-Ray (run id from the smoke output):
aws xray get-trace-summaries \
  --time-range-type TraceId \
  --start-time $(date -u -d '5 minutes ago' +%s) \
  --end-time   $(date -u +%s) \
  --filter-expression "service(\"alphaswarm-admin\")"
```

## Promotion

| From | To | Trigger |
| --- | --- | --- |
| `main` push | dev apply | `terraform-pipeline.yml` plan + auto-merge gate |
| tag `vX.Y.Z-rc.N` | staging apply | `terraform-pipeline.yml` dispatch + 1 reviewer |
| tag `vX.Y.Z` | prod apply | `terraform-pipeline.yml` dispatch + 2 reviewers |

## Rollback

See [aws-runbook.md](aws-runbook.md) for the rollback playbook.


<!-- https://alpha-swarm.ai/how-to/operations/aws-minimum-rollback -->
# how-to/operations/aws-minimum-rollback

# AWS Minimum Tier Rollback Playbook

> Companion to
> [aws-minimum-single-account.md](aws-minimum-single-account.md) (deploy)
> and [aws-runbook.md](aws-runbook.md) (full-stack on-call).
>
> This page is the dedicated rollback procedure for the single-account
> minimum tier deployed via
> [infrastructure/envs/minimum/scripts/deploy.sh](../../../../infrastructure/envs/minimum/scripts/deploy.sh).

## TL;DR — One Command Rollback

```bash
cd infrastructure/envs/minimum
ACCOUNT_ALIAS=minimum AWS_REGION=us-east-1 bash scripts/destroy.sh
```

That command:

1. Checks the caller's AWS account id matches the snapshot from deploy.
2. Destroys the application tier (Cognito + ALB + Fargate).
3. Disables RDS deletion protection.
4. Takes a final RDS snapshot (skip with `DESTROY_RDS_SKIP_SNAPSHOT=yes`).
5. Destroys the infrastructure tier (VPC + RDS + Redis + IAM + alarms).
6. Lists any orphan resources that survived destroy.
7. Retains the bootstrap state backend (S3 + DynamoDB + KMS + OIDC).

Total wall-clock: ~15 minutes (RDS snapshot is the long pole).

## When to Roll Back

| Situation | Action |
| --- | --- |
| Wrong account / region | `bash scripts/destroy.sh` immediately; the identity guard will catch the mismatch before destroying anything. |
| Cost overrun | `bash scripts/destroy.sh` then re-deploy with smaller instance types. |
| Failed apply mid-flight | `bash scripts/destroy.sh` (idempotent — resumes from wherever apply stopped). |
| Need clean slate | `DESTROY_BOOTSTRAP=yes bash scripts/destroy.sh` (also nukes the state backend). |
| RDS data corruption | `DESTROY_RDS_SKIP_SNAPSHOT=yes bash scripts/destroy.sh` (skip the bad-data snapshot). |
| Security incident | See [aws-runbook.md](aws-runbook.md) §"Halt every AgentCore session" first; THEN consider destroy. |

## Pre-Rollback Checklist

Before running `destroy.sh`:

- [ ] **Confirm there's no critical data** in RDS that hasn't been backed up
      out-of-band. The default rollback takes a final snapshot, but if you
      pass `DESTROY_RDS_SKIP_SNAPSHOT=yes`, data is gone.
- [ ] **Confirm no other team / env is using the bootstrap state backend.**
      The default `DESTROY_BOOTSTRAP=no` preserves it. Flipping to `yes`
      affects every env that shares the same `alphaswarm-tfstate-` bucket.
- [ ] **Capture forensics first** for a security incident:
      ```bash
      bash scripts/snapshot.sh capture
      ```
      Then destroy.

## The Six Stages

### Stage 1 — Identity Guard

```bash
# destroy.sh reads .snapshots/latest/deploy-receipt.json and confirms
# the caller's STS identity matches.
# If you see: "deploy receipt is for account 111 but caller is 222"
# → STOP. You're in the wrong account. Switch profiles + retry.
```

The receipt is created by `deploy.sh` Step 8 and stored at
`infrastructure/envs/minimum/.snapshots//deploy-receipt.json`.

### Stage 2 — Application Tier

The app tier holds the runtime contract — destroying it FIRST means
the ALB target groups release their ECS service refs, so the
infrastructure-tier ALB delete doesn't 409 with "still in use".

Manual override if needed:

```bash
cd alphaswarm_platform/terraform/environments/minimum
terraform init -reconfigure -backend-config=backend.hcl
terraform destroy -auto-approve
```

### Stage 3 — RDS Deletion-Protection Bypass

```bash
aws rds modify-db-instance \
  --db-instance-identifier alphaswarm-admin-min \
  --no-deletion-protection --apply-immediately
```

`destroy.sh` does this automatically; the manual command above is the
fallback if the script can't reach RDS for some reason.

### Stage 4 — Infrastructure Tier

```bash
cd infrastructure/envs/minimum
terraform init -reconfigure -backend-config=backend.hcl
terraform destroy -auto-approve
```

Skips the ECR repos by default (they're declared in
`modules/ecr-repositories` with no `prevent_destroy`, so they get
removed — but ECR's lifecycle policy keeps the most recent 30 tagged
images for 14 days even after the repo deletes).

### Stage 5 — Orphan Sweep

If `destroy.sh` reports orphans:

```bash
[DESTROY]   ⚠ found 3 orphan resource(s) — review + hand-delete:
    arn:aws:ec2:us-east-1:123:network-interface/eni-0a1b2c3d
    arn:aws:logs:us-east-1:123:log-group:/aws/ecs/alphaswarm-admin-min
    arn:aws:elasticloadbalancing:us-east-1:123:listener-rule/...
```

Hand-delete each:

```bash
# ENI stuck in 'available' from a deleted ECS task
aws ec2 delete-network-interface --network-interface-id eni-0a1b2c3d

# Log group with retention != never (terraform doesn't auto-delete these)
aws logs delete-log-group --log-group-name /aws/ecs/alphaswarm-admin-min

# Orphan listener rule (rare — usually the ALB destroy covers it)
aws elbv2 delete-rule --rule-arn arn:aws:elasticloadbalancing:...
```

Common orphan sources:

- **NAT-attached EIPs** — the NAT Gateway is gone but the EIP is not
  released automatically.
- **ECS task ENIs** in `available` state — task definition was deleted
  but the ENI lingers until the underlying ENA cleanup runs.
- **CloudWatch log groups with retention != never_expire** — terraform
  doesn't delete them; they linger but cost nothing until they fill up.
- **Listener rules** with a target group that already deleted.

### Stage 6 — Bootstrap (Optional)

Only when `DESTROY_BOOTSTRAP=yes`. The state bucket has Object Lock
GOVERNANCE, so the script empties it with `--bypass-governance-retention`:

```bash
DESTROY_BOOTSTRAP=yes bash scripts/destroy.sh
```

What this destroys:

- S3 state bucket (every version + delete marker)
- DynamoDB lock table
- KMS CMK (30-day deletion window, recoverable until then)
- GitHub OIDC provider

What this does NOT destroy:

- The aws account itself.
- AWS CloudTrail (default trail in the account; AWS bills for it
  regardless).
- The Bedrock model-access grant (console-only setting; persists
  across teardowns).

## Recovery After a Partial Destroy

If `destroy.sh` fails mid-flight:

```bash
# 1. Inspect the .destroy.log for the failed step.
tail -100 infrastructure/envs/minimum/.destroy.log

# 2. Re-run destroy — it's idempotent + resumes from wherever apply stopped.
bash scripts/destroy.sh

# 3. If terraform state is locked, force-unlock:
cd infrastructure/envs/minimum
terraform force-unlock 

# 4. If a specific resource is wedged, target it:
terraform destroy -target=module.rds.aws_db_instance.this -auto-approve
```

## Cost Verification

After rollback, verify $0 monthly spend in the AWS console:

- **Cost Explorer** → filter by tag `managed_by=terraform`, `env=minimum`
  → should show $0 in the current period.
- **AWS Budgets** → if the alert was wired pre-rollback, it stays armed
  with `actual_spend=0` for the period.

If non-zero spend persists 24 h after rollback:

- Check for **EBS snapshots** that were created by RDS deletion.
- Check for **CloudWatch metric streams** that may have been wired
  manually (not destroyed by `destroy.sh`).
- Check for **Route 53 hosted zones** — these have a $0.50/mo floor.

## Files Touched

| File | Created by | Destroyed by |
| --- | --- | --- |
| `alphaswarm-tfstate-` S3 bucket | `deploy.sh` (bootstrap step) | `destroy.sh` only with `DESTROY_BOOTSTRAP=yes` |
| `alphaswarm-tfstate-lock-` DynamoDB table | bootstrap | same |
| `alias/alphaswarm-tfstate` KMS key | bootstrap | same |
| GitHub OIDC provider | bootstrap | same |
| VPC `alphaswarm-min` + subnets + NAT + endpoints | infrastructure tier | `destroy.sh` step 4 |
| RDS `alphaswarm-admin-min` | infrastructure tier | step 4 (final snapshot retained) |
| ElastiCache `alphaswarm-min-redis` | infrastructure tier | step 4 |
| ECR repos | infrastructure tier | step 4 (image lifecycle policy keeps tags 14d) |
| CloudWatch alarms + dashboard | infrastructure tier | step 4 |
| ALB + Cognito + Fargate cluster | application tier | step 2 |
| `.snapshots//` | `snapshot.sh` | preserved on disk forever (committed in `.gitignore` by default) |


<!-- https://alpha-swarm.ai/how-to/operations/aws-minimum-single-account -->
# how-to/operations/aws-minimum-single-account

# Single-Account Minimum AWS Deployment

> **Companion docs:** [aws-deploy.md](aws-deploy.md) for the full
> multi-account hybrid topology; [aws-runbook.md](aws-runbook.md) for
> the operational playbook.

The cheapest deployable AlphaSwarm on AWS. Target cost: **~$140/month fixed**
+ Bedrock token spend. Skips multi-account, EKS, MSK, AgentCore
Runtime, Knowledge Base, CloudFront, and the EventBridge nightly
backtest path. Use it as a stepping stone before promoting to the
full topology.

## What you get

```mermaid
flowchart LR
    operators[Operators] --> alb[ALB HTTPS]
    alb --> cognito[Cognito User Pool]
    cognito --> alb
    alb --> admin[alphaswarm-admin ECS Fargate single task]
    admin --> rds[RDS Postgres single-AZ]
    admin --> redis[ElastiCache Redis 1-node]
    admin --> bedrock[Bedrock Claude Haiku 4.5]
```

## Pieces composed

| Tier | Module | Cost/mo |
| --- | --- | ---: |
| Network | `infrastructure/modules/vpc` (2 AZ, single NAT) | ~$32 |
| Ingress | `infrastructure/modules/alb` | ~$22 |
| Database | `infrastructure/modules/rds-postgres` (`db.t4g.medium`) | ~$45 |
| Cache | inline ElastiCache (`cache.t4g.small`, 1 node) | ~$25 |
| Compute | `infrastructure/modules/ecs-fargate-control-plane` (1 task, 0.5 vCPU + 1 GB) | ~$15 |
| Identity | `infrastructure/modules/cognito-userpool` (first 50k MAU free) | $0 |
| Container registry | `infrastructure/modules/ecr-repositories` (3 repos) | ~$1 |
| Logs | CloudWatch Logs (~1 GB ingest) | \<$1 |
| LLM | Amazon Bedrock Claude Haiku 4.5 (variable) | $? per use |
| **Fixed total** | | **~$140** |

## Files this guide refers to

- [infrastructure/envs/minimum/](../../../../infrastructure/envs/minimum/)
  — infrastructure tier (VPC + ECR + RDS + Redis + Bedrock invoke IAM)
- [alphaswarm_platform/terraform/environments/minimum/](../../../../alphaswarm_platform/terraform/environments/minimum/)
  — application tier (Cognito + ALB + ECS Fargate)
- [alphaswarm_platform/configs/terraform/minimum.yaml](../../../../alphaswarm_platform/configs/terraform/minimum.yaml)
  — `TerraformStackSpec` for the `alphaswarm deploy` CLI
- [alphaswarm_platform/configs/deployment/topology.yaml](../../../../alphaswarm_platform/configs/deployment/topology.yaml)
  `targets.aws-minimum` — topology target binding

## Six steps to live

### 1. Enable Bedrock model access (manual, console)

Console → **Bedrock** → **Model access** → request **Anthropic Claude
Haiku 4.5**. Approval is usually instant. Only this model needs
access for the minimum — Claude Sonnet 4.5 + Titan Embed v2 can wait
until you add the Knowledge Base.

### 2. Bootstrap the state backend

```bash
cd infrastructure/bootstrap
terraform init
terraform apply -auto-approve
terraform output -json | tee /tmp/bootstrap.json
```

This is the only place admin creds are required. The stack ships:

- S3 state bucket (KMS-encrypted, Object Lock GOVERNANCE)
- DynamoDB lock table
- KMS CMK for workload encryption
- GitHub OIDC provider

### 3. Apply the infrastructure tier

```bash
cd infrastructure/envs/minimum
sed "s||$(jq -r .account_id.value /tmp/bootstrap.json)|" \
  backend.hcl.example > backend.hcl
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars: paste the kms_key_arn + external_id +
# github_oidc_provider_arn from /tmp/bootstrap.json.
terraform init -backend-config=backend.hcl
terraform apply
```

~12 minutes (RDS provisioning is the long pole). Outputs include the
ALB-ready VPC + every SSM parameter the application tier reads.

### 4. Push the first image

```bash
git tag v0.1.0-min
git push origin v0.1.0-min
```

[`build-publish.yml`](../../../../.github/workflows/build-publish.yml)
ships `alphaswarm-admin` (and any other matrix entries) to ECR. The
`AqpGithubDeployerMinimum` role from step 3 is what the workflow
assumes via OIDC.

### 5. Apply the application tier

```bash
cd alphaswarm_platform/terraform/environments/minimum
sed "s||$(jq -r .account_id.value /tmp/bootstrap.json)|" \
  backend.hcl.example > backend.hcl
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars: paste the acm_certificate_arn_alb + the
# image tag you just pushed.
terraform init -backend-config=backend.hcl
terraform apply
```

~5 minutes. The ALB DNS appears in the outputs.

### 6. Configure AlphaSwarm runtime

The application reads the deployment endpoints from
`/alphaswarm/minimum/*` SSM. Set the env vars on the ECS task definition (or
via the application's `Settings` overrides):

```bash
ALPHASWARM_LLM_PROVIDER=bedrock
ALPHASWARM_BEDROCK_REGION=us-east-1
ALPHASWARM_AUTH_PROVIDER=aws_cognito
ALPHASWARM_AUTH_OIDC_ISSUER=
ALPHASWARM_DEPLOY_TARGET=aws
ALPHASWARM_DATABASE_URL=postgresql+psycopg://@:5432/alphaswarm
ALPHASWARM_REDIS_URL=rediss://:6379/0
```

The matching `bedrock` `ProviderSpec` is already in
[alphaswarm/llm/providers/catalog.py](../../../../alphaswarm/llm/providers/catalog.py)
(shipped in Phase D of the AWS hybrid rollout); no code change needed.

## Verify

```bash
# Hit the ALB:
curl -sS https://$(terraform -chdir=alphaswarm_platform/terraform/environments/minimum \
                    output -raw alb_dns_name)/healthz

# Call Bedrock through the application:
curl -sS https:///api/llm/echo \
  -H "Authorization: Bearer " \
  -d '{"prompt": "ping"}'
```

The application's `router_complete` injects `aws_region_name=us-east-1`
on the Bedrock call (`_bedrock_extra_kwargs` in
[alphaswarm/llm/providers/router.py](../../../../alphaswarm/llm/providers/router.py));
boto3 walks the chain to the ECS task role's IAM credentials.

## Promotion path

When ready to outgrow the minimum, add modules one at a time. The
SSM-parameter contract means application code doesn't change.

| Add when… | Append to `alphaswarm_platform/terraform/environments/minimum/main.tf` |
| --- | --- |
| You need a custom domain (`admin.alpha-swarm.ai`) | `module "cloudfront"` from `infrastructure/modules/cloudfront` |
| You need vector search over research docs | `module "opensearch_serverless"` + `module "bedrock_kb"` |
| You want AgentCore (8-hour sessions, managed memory) | `module "bedrock_agentcore"` + a second `alphaswarm-agent` ECS service |
| You need a Celery worker tier | Stand up `infrastructure/envs/dev` (full EKS+Karpenter) and add the heritage `module "alphaswarm"` here |
| You need cross-account isolation | Promote to the full multi-account topology via `infrastructure/modules/landing-zone` |

Once the full set lands, retarget the topology from
`target=aws-minimum` to `target=aws`. The application reads the
same `/alphaswarm/${env}/*` SSM parameters either way.

## Tear down

```bash
# Application tier first (Fargate services hold ALB target group
# references that prevent ALB deletion):
cd alphaswarm_platform/terraform/environments/minimum
terraform destroy

# Then infrastructure tier:
cd ../../../../infrastructure/envs/minimum
terraform destroy

# RDS has deletion_protection=true by default — set it to false in
# the module call and re-apply before destroy if you really want it gone.
```

Data buckets (`prevent_destroy = true`) are kept on purpose; remove
them manually after confirming no other env references them.


<!-- https://alpha-swarm.ai/how-to/operations/aws-runbook -->
# how-to/operations/aws-runbook

# AWS Hybrid Operational Runbook

> Companion to [aws-deploy.md](aws-deploy.md). Page this when the
> AgentCore proxy / admin BFF / Bedrock KB / ECS Fargate cluster
> misbehaves.

## On-call checklist (first 5 minutes)

1. **Confirm the blast radius.** Cloudflare hosts `alpha-swarm.ai` /
   `api.alpha-swarm.ai` / `manage.alpha-swarm.ai`; CloudFront hosts
   `admin.alpha-swarm.ai` / `agentcore.alpha-swarm.ai`. A Cloudflare outage does NOT
   touch the admin / AgentCore surface (and vice versa).
2. **Hit `/healthz`.** `https://admin.alpha-swarm.ai/healthz` — if 200, the
   ALB + ECS Fargate path is healthy. If 5xx, jump to "ECS Fargate
   service down" below.
3. **Fan-out kill switch.** If the incident is touching trading,
   immediately POST `/portfolio/kill_switch` AND `/workloads/halt` so
   every long-running runtime (paper, bots, RL, AgentCore) stops. The
   topbar `KillSwitch` component does the fan-out automatically for
   logged-in operators.

## Halt every AgentCore session

```bash
# 1. Disable new invocations at the gateway:
aws bedrock-agentcore update-gateway \
  --gateway-id $(aws ssm get-parameter \
    --name /alphaswarm/prod/agentcore_gateway_arn \
    --query 'Parameter.Value' --output text | awk -F/ '{print $NF}') \
  --status DISABLED

# 2. Stop the AgentCore proxy ECS Fargate service (the ALB stops
#    forwarding to the proxy immediately):
aws ecs update-service \
  --cluster $(aws ssm get-parameter \
    --name /alphaswarm/prod/ecs_cluster_name \
    --query 'Parameter.Value' --output text) \
  --service alphaswarm-agentcore-proxy-prod \
  --desired-count 0
```

The matching audit row lands in `workload_runs` via
`WorkloadRuntime.start_run` BEFORE the boto3 call returns (rule 45).
You can verify with:

```bash
aws rds-data execute-statement \
  --resource-arn $RDS_ARN --secret-arn $DB_SECRET_ARN \
  --database alphaswarm \
  --sql "SELECT id, action, status, user_id, started_at \
         FROM workload_runs ORDER BY started_at DESC LIMIT 10"
```

## Roll back to the previous tag

```bash
# 1. Identify the previous good SHA:
git log --oneline --decorate -20

# 2. Tag the previous SHA + push to trigger the apply path:
git tag v1.0.1-rollback 
git push origin v1.0.1-rollback

# 3. Manual dispatch — terraform-pipeline.yml with
#    tree=alphaswarm_platform, env=prod, action=apply (requires 2 reviewers).
#
# The DB / S3 buckets / KB source bucket are NOT touched — every data
# resource carries lifecycle.prevent_destroy=true and retains the
# previous Terraform-managed state.
```

## ECS Fargate service down

```bash
# 1. Inspect recent task stops (most common = ECR pull failure or
#    secrets resolution failure):
aws ecs describe-services \
  --cluster alphaswarm-cluster-prod \
  --services alphaswarm-admin-prod \
  --query 'services[0].events[:10]'

# 2. Tail the application log group (the ADOT sidecar logs to a
#    sibling stream prefixed adot/):
aws logs tail /aws/ecs/alphaswarm-admin-prod --follow

# 3. Force a fresh rollout (also rolls the ADOT sidecar):
aws ecs update-service \
  --cluster alphaswarm-cluster-prod \
  --service alphaswarm-admin-prod \
  --force-new-deployment
```

The matching `AwsProvider.restart` call from `WorkloadRuntime` writes
a `workload_runs` row with `action=restart` BEFORE the boto3 call
runs, so the audit trail is intact even when the rollout fails halfway.

## Drain a Celery worker (EKS path)

The Celery workers continue to run on EKS Karpenter; the ECS Fargate
surface is admin + AgentCore only. To drain a worker pod without
losing in-flight tasks:

```bash
# 1. Annotate the pod for graceful shutdown — the Celery preboot
#    hook in alphaswarm/tasks/celery_app.py listens for this annotation:
kubectl annotate pod alphaswarm-worker-xxxxx \
  -n alphaswarm \
  alphaswarm.io/drain-requested-at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 2. Wait for the worker to finish in-flight tasks (max grace =
#    ALPHASWARM_AGENT_STALL_THRESHOLD_SECONDS, default 1800s):
kubectl wait --for=delete pod alphaswarm-worker-xxxxx -n alphaswarm --timeout=1900s

# 3. Karpenter replaces the deleted pod automatically; the new pod
#    inherits the same Celery queue subscriptions.
```

## Assume the break-glass role

Reserved for catastrophic incidents (org-wide outage, suspected
account compromise). The `AlphaSwarm-BreakGlass` Identity Center
permission set has Admin across every account, MFA-required, and
alarms on every assumption (CloudWatch alarm +
`workload_runs` row + Cloudflare Access policy).

```bash
# IAM Identity Center -> Sign in -> Pick AlphaSwarm-BreakGlass for the
# target account -> Acknowledge the on-call ticket prompt before
# the assume completes.
aws sts get-caller-identity
# Outputs the breakglass session arn — paste into the incident ticket.
```

## Rotate Cloudflare origin secret

When the CloudFront `X-CloudFront-Secret` header value is suspected
leaked:

```bash
new=$(openssl rand -hex 32)

aws ssm put-parameter \
  --name /alphaswarm/prod/cloudfront_origin_secret \
  --value "$new" --type SecureString --overwrite

# Re-apply the cloudfront module so the new value lands at the edge:
gh workflow run terraform-pipeline.yml -f tree=alphaswarm_platform -f env=prod -f action=apply
```

## Common failure modes + fixes

| Symptom | Likely cause | Fix |
| --- | --- | --- |
| `AccessDeniedException` on first KB ingestion | `aoss:APIAccessAll` not propagated | Wait 20s + retry (the `time_sleep.settle` block handles this on apply, but manual ingestion-job starts can race). |
| AgentCore Runtime invocation returns 403 | The runtime role's IAM policy denies the FM | Verify the model ARN is in `var.allowed_model_arns` for the env. |
| `terraform apply` fails on `aws_bedrockagentcore_*` resource | Provider version too old | Pin `hashicorp/aws ~> 6.21` in `alphaswarm_platform/terraform/environments/live/main.tf`. |
| ALB 504 on `admin.alpha-swarm.ai` | Cognito redirect loop or ALB OIDC misconfig | Inspect the listener rule's `authenticate-cognito` action; confirm `user_pool_arn` + `user_pool_client_id` + `user_pool_domain` SSM params match. |
| ECS exec hangs | Task definition missing `enableExecuteCommand=true` | Re-deploy with `enable_execute_command=true` in the service spec; ECS Exec also needs `ALPHASWARM_AWS_ECS_EXEC_ENABLED=true` on the provider. |
| Bedrock smoke workflow times out on X-Ray | ADOT sidecar not propagating; check `alphaswarm-adot-sidecar` SA has the X-Ray + Application Signals + CloudWatch policies. | |

## Cost guardrails

`AWS Budgets` alarms ride on top of the SCP region allowlist:

- `dev` budget = $300/month (alerts at 50/80/100%)
- `staging` budget = $500/month
- `prod` budget = $1500/month (excluding Bedrock token spend; that is
  metered separately under the `alphaswarm.io/cost-bucket=bedrock-tokens` tag).

If a budget alarm fires AND the Bedrock token cost is the driver:

```bash
# Disable streaming responses (cheaper) and switch the agent spec to
# Haiku for the next 24h while the operator investigates:
aws ssm put-parameter --name /alphaswarm/prod/llm_model_preference \
  --value "anthropic.claude-haiku-4-5-20251022-v1:0" \
  --type String --overwrite

aws ecs update-service \
  --cluster alphaswarm-cluster-prod \
  --service alphaswarm-agentcore-proxy-prod \
  --force-new-deployment
```


<!-- https://alpha-swarm.ai/how-to/operations/bot-canary-rollout-playbook -->
# Bot Canary Rollout Playbook
> - Any strategy code change (new alpha, new portfolio constructor, new execution algo). - Any adapter change (new venue, new FIX session config, new on-chain RPC endpoint). - Any risk-policy threshold ...

# Bot Canary Rollout Playbook

> When to use it, how to read the dashboards, how to abort, and how
> to tune false positives.

## When to use a canary

- Any strategy code change (new alpha, new portfolio constructor, new
  execution algo).
- Any adapter change (new venue, new FIX session config, new on-chain
  RPC endpoint).
- Any risk-policy threshold loosening.
- **Not** for: spec-only documentation updates, image rebuilds that
  don't change behavior, k8s manifest tweaks that don't change pod
  spec.

## Steps

### 1. Author the canary

Edit the bot's GitOps values file:

```yaml
# values-bot-mm-aapl.yaml
bot:
  variant: canary           # mutated label drives the Rollouts split
  botSpec:
    # New strategy parameters here.
```

### 2. Open a PR

Required CI checks:

- [ ] `tests/bots` green
- [ ] `python -m alphaswarm_bots.cli validate ` passes
- [ ] `python -m alphaswarm_bots.cli conformance ` passes
- [ ] `python -m alphaswarm_bots.cli stress ` passes
- [ ] Trivy scan: no CRITICAL/HIGH CVEs on the new image
- [ ] Cosign signature attached

### 3. Argo CD syncs the Rollout

The CanaryRollout CR mutates from `currentStep=0` to `currentStep=1`
when the new image lands. Traffic shifts to 10%.

### 4. Watch the AnalysisTemplate results

```bash
kubectl argo rollouts get rollout bot-mm-aapl
```

Expected output:

```
Status:        ✔ Healthy
Strategy:      Canary
  Step:        1/5
  SetWeight:   10
  Current:     stable=18 canary=2
```

The Prometheus dashboard `Bot Canary - ` shows three traces:

- `quantbot_realized_pnl_usd{variant="canary"}` vs `{variant="stable"}`
- `quantbot_orders_rejected_total / quantbot_orders_total` per variant
- `histogram_quantile(0.99, quantbot_tick_to_trade_seconds_bucket)` per variant

### 5. Promotion vs abort

- **Auto-promote:** if all three AnalysisTemplates pass the configured
  windows, the rollout advances to the next step automatically.
- **Auto-abort:** any AnalysisTemplate failure aborts the rollout
  and reverts traffic to 100% stable. Slack/PagerDuty alert
  `BotErrorRateHigh` or `BotPnLDrawdownCritical` fires.
- **Manual abort:**

  ```bash
  kubectl argo rollouts abort bot-mm-aapl
  ```

- **Manual promote (for an indefinite pause step):**

  ```bash
  kubectl argo rollouts promote bot-mm-aapl
  ```

## Tuning false positives

If you observe a healthy canary aborting frequently:

1. **Tighten the metric query first.** Move from `rate(...[1m])` to
   `rate(...[5m])`; use a robust quantile (e.g. `histogram_quantile(0.99,
   sum by (le) (rate(...[5m])))`).
2. **Lengthen the window.** Bump `count` from 30 to 60.
3. **Only THEN relax the success condition.** Don't relax `pnlVsStableMinUsd`
   from `-50` to `-150` without first investigating the variance source.

Per blueprint caveat #7: if the canary AnalysisTemplate false-positive
rate exceeds 10% (good canaries aborted by noisy metric), tighten the
metric query before relaxing the success condition.

## Hard abort: emergency

If the canary is in `Progressing` state but you see live PnL bleeding
faster than the abort criterion would catch:

```bash
# Three-scope kill switch — engaged at bot scope.
kubectl apply -f - <<EOF
apiVersion: quantbot.io/v1
kind: KillSwitch
metadata:
  name: emergency-mm-aapl-canary
  namespace: alphaswarm-bots
spec:
  scope: bot
  target: mm-aapl
  mode: flatten
  reason: "emergency canary bleed"
EOF
```

This bypasses the rollout reconciler — every pod with
`quantbot.io/bot-slug=mm-aapl` halts within `poll_interval_s` (5s).


<!-- https://alpha-swarm.ai/how-to/operations/cicd-deploy -->
# Operations runbook — CI/CD deploy
> Task-oriented steps for the AWS CI/CD pipeline: create the dev/staging/prod GitHub Environments and reviewers, set the per-env role variables and cross-repo dispatch token, deploy the infra and app trees via terraform-pipeline.yml, release images via a v* tag, drive the admin redeploy, approve a prod release, find the terraform_runs audit row, and roll back.

# Operations runbook — CI/CD deploy

Task-oriented steps for the AlphaSwarm AWS CI/CD pipeline. For the
design and the topology diagrams see the concept page
[CI/CD pipelines](../../concepts/infrastructure/cicd-pipelines.md).
This runbook is the companion to the bootstrap and incident playbooks
in [AWS Hybrid Deployment Guide](aws-deploy.md) and
[AWS Hybrid Operational Runbook](aws-runbook.md) — start there for the
first-ever account bring-up; come here for the day-to-day pipeline.

All deploys run through GitHub Actions over GitHub OIDC. Never run
`terraform apply` or `alphaswarm deploy up` against a shared
environment from a laptop.

## (a) One-time setup — Environments, reviewers, variables

Do this once per repo (the steps are the same for `alphaswarm_platform`
and `alphaswarm_admin`).

1. **Create the three GitHub Environments.** In the repo:
   Settings → Environments → New environment, for each of `dev`,
   `staging`, `prod`.
2. **Set required reviewers.** Edit each Environment's protection
   rules:
   - `dev` — no required reviewers (auto).
   - `staging` — **1** required reviewer.
   - `prod` — **2** required reviewers (4-eyes).
3. **Set the per-env role variables.** For each Environment add the
   apply role ARN (published by the `infrastructure/modules/github-oidc`
   module) plus the read-only plan role ARN:

   ```bash
   # Apply role (one per environment):
   gh variable set AWS_DEPLOYER_ROLE_ARN \
     --env prod \
     --body "arn:aws:iam:::role/aqp-gha-apply"

   # Plan role (read-only, used by pr-validate.yml):
   gh variable set AWS_PLAN_ROLE_ARN \
     --env prod \
     --body "arn:aws:iam:::role/aqp-gha-plan"
   ```

   Repeat for `dev` and `staging` with their account IDs.
4. **Set the cross-repo dispatch token (admin repo only).** The admin
   pipeline fires a `repository_dispatch` at `alphaswarm_platform`, so
   it needs a token with `repo` scope on the platform repo. Store it
   as a secret in the **admin** repo:

   ```bash
   gh secret set PLATFORM_DISPATCH_TOKEN \
     --repo Alpha-Swarm-ai/alphaswarm_admin \
     --body ""
   ```

## (b) Deploy the landing zone (infrastructure/)

The `infrastructure/` tree is applied with native Terraform over OIDC
into `AqpTerraformExecutionRole`. Always plan first, review the diff
in the workflow summary, then apply.

```bash
# 1. Plan dev:
gh workflow run terraform-pipeline.yml \
  -f tree=infrastructure -f env=dev -f action=plan

# 2. Review the plan in the run summary, then apply:
gh workflow run terraform-pipeline.yml \
  -f tree=infrastructure -f env=dev -f action=apply
```

Promote by repeating with `-f env=staging` then `-f env=prod`. The
`staging` apply waits on 1 reviewer and the `prod` apply on 2 (the
GitHub Environment gate).

## (c) Deploy the app tier (terraform/)

Same workflow, `tree=alphaswarm_platform`. This path delegates to
`CodeBuild`, which runs `alphaswarm deploy plan` / `alphaswarm deploy up`
(`TerraformRuntime`) and writes a `terraform_runs` audit row.

```bash
gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=dev -f action=plan

gh workflow run terraform-pipeline.yml \
  -f tree=alphaswarm_platform -f env=dev -f action=apply
```

A `push` to `main` automatically runs an `infrastructure` plan against
`dev`, so you usually only dispatch the `apply` actions explicitly.

## (d) Release images — push a v* tag

`build-publish.yml` triggers on a `v*` tag. It builds each service
multi-arch to `ECR`, signs with `Cosign` keyless, emits a `syft` SBOM
and `SLSA` provenance, and runs `Trivy` + `Grype` scans.

```bash
git tag v1.4.0
git push origin v1.4.0
# Watch the workflow:
gh run watch
```

## (e) Admin deploy flow

`alphaswarm_admin` builds two images and hands off to the platform.

1. Push to the admin repo's `main` (or push a `v*` tag).
2. The admin workflow builds and pushes **two** images to `ECR`:
   `alphaswarm-admin` and `alphaswarm-admin-frontend`.
3. After both land, it fires a `repository_dispatch` event
   `admin-image-published` at `alphaswarm_platform` (using
   `PLATFORM_DISPATCH_TOKEN`).
4. The platform's app-tier redeploy runs and rolls the admin service
   onto **ECS `Fargate`** (`Cognito` + `ALB`) via
   `terraform/environments/{dev,staging,prod}`, reading infra handles
   from SSM `/alphaswarm//*`.

To re-trigger the handoff manually (for example after a token fix
without a new build):

```bash
gh api repos/Alpha-Swarm-ai/alphaswarm_platform/dispatches \
  -f event_type=admin-image-published \
  -f 'client_payload[env]=dev'
```

## (f) Approving a prod release (4-eyes)

A `prod` apply (infra or app tier) pauses on the GitHub Environment
gate until **two** distinct reviewers approve. The apply role cannot
be assumed before that, so nothing touches `prod` until both sign off.

1. Dispatch the apply (step b or c) with `-f env=prod`.
2. Two reviewers open the run → "Review deployments" → select `prod`
   → Approve. Approvals must come from two different people.
3. The job then assumes `vars.AWS_DEPLOYER_ROLE_ARN` for `prod` over
   OIDC and proceeds.

```bash
# List runs awaiting approval:
gh run list --workflow terraform-pipeline.yml
```

## (g) Where the terraform_runs audit row lands

Every app-tier `alphaswarm deploy plan` / `up` writes a row to the
`terraform_runs` table in the platform Postgres (platform AGENTS
rule 42) — the same ledger used by `TerraformRuntime` for in-app
Terraform actions. Native `infrastructure/` applies do not write this
row (their history is the Terraform state in S3). To inspect recent
app-tier runs:

```bash
aws rds-data execute-statement \
  --resource-arn "$RDS_ARN" --secret-arn "$DB_SECRET_ARN" \
  --database alphaswarm \
  --sql "SELECT id, action, status, env, started_at \
         FROM terraform_runs ORDER BY started_at DESC LIMIT 10"
```

## (h) Rollback

Pick the path that matches what changed.

- **Bad image (app or admin):** re-point the deploy at the prior
  immutable image tag and redeploy — no rebuild required.

  ```bash
  # Re-run the app-tier apply pinned to the previous tag:
  gh workflow run terraform-pipeline.yml \
    -f tree=alphaswarm_platform -f env=prod -f action=apply \
    -f image_tag=v1.3.0
  ```

- **Bad infra/app-tier change:** re-apply the previous good state by
  dispatching `apply` from the prior good commit. Tag-and-push the
  previous SHA, then dispatch the apply (prod still needs 2
  reviewers):

  ```bash
  git tag v1.3.1-rollback 
  git push origin v1.3.1-rollback
  gh workflow run terraform-pipeline.yml \
    -f tree=alphaswarm_platform -f env=prod -f action=apply
  ```

Data resources (RDS, S3, the KB source bucket) carry
`lifecycle.prevent_destroy = true`, so a re-apply rolls forward the
service definitions without touching stateful resources. See the
rollback section of [AWS Hybrid Operational Runbook](aws-runbook.md)
for the full data-safety notes.

## See also

- [CI/CD pipelines](../../concepts/infrastructure/cicd-pipelines.md) — the design and topology.
- [AWS Hybrid Deployment Guide](aws-deploy.md) — first-time bootstrap.
- [AWS Hybrid Operational Runbook](aws-runbook.md) — incident playbooks + rollback data safety.


<!-- https://alpha-swarm.ai/how-to/operations/cloud-cli-temporary-credentials -->
# Cloud-CLI temporary credentials
> Operator runbook for minting short-lived AWS / GCP / Azure credentials from the admin UI. The control plane spawns the CLI subprocess; the admin BFF brokers it; the minted token is persisted via CredentialResolver and never echoed back.

# Cloud-CLI temporary credentials

How to use the **CloudCliCredentialWizard** in the admin Settings
page to mint a short-lived AWS / GCP / Azure credential without
shipping the parent credential or the cloud CLI binary into the
admin BFF container.

## Topology

```mermaid
flowchart LR
  Op["operator (MFA-fresh)"] --> FE["alphaswarm_admin/frontend\nCloudCliCredentialWizard"]
  FE -->|"POST /admin/settings/credentials/cloud-cli/preview"| BFF["alphaswarm_admin BFF"]
  BFF -->|broker| CP["alphaswarm_controller\n/manage/credentials/cloud-cli/preview"]
  CP -->|argv only| BFF
  BFF -->|masked argv| FE
  Op -->|"approve preview"| FE
  FE -->|"POST .../sts"| BFF
  BFF -->|broker| CP
  CP -->|"asyncio.create_subprocess_exec"| CLI["aws sts | gcloud auth | az account get-access-token"]
  CLI -->|JSON / token| CP
  CP -->|"persist via CredentialResolver"| RES["resolver chain"]
  CP -->|"metadata only\n(credential_key, expires_at)"| BFF
  BFF -->|envelope| FE
```

The CLI subprocess **only** runs inside `alphaswarm_controller`. The
admin BFF (`alphaswarm_admin`) is HTTP-only per its boundary contract; it
never spawns processes or holds the parent credential.

## Prerequisites

The control plane host needs the CLI binary on `$PATH`:

| Provider | Binary | Required pre-auth |
| --- | --- | --- |
| AWS | `aws` | parent IAM identity (instance profile / IRSA / access key) with `sts:AssumeRole` on the target role |
| GCP | `gcloud` | ADC (Application Default Credentials) for an identity with `iam.serviceAccounts.getAccessToken` on the target SA |
| Azure | `az` | logged-in `az` session (`az login --identity` for managed identity, or interactive) on a principal that can issue tokens for the requested resource |

The wizard's preview step renders a `binary present on CP host` flag
so the operator can spot a missing CLI before they hit `Execute mint`.

## Walkthrough

### 1. Pick a provider + fill the form

Open `Settings → Cloud-CLI temporary credential mint`. The wizard
loads handler metadata from
`/admin/settings/credentials/cloud-cli/handlers` (proxied to the CP)
and renders the appropriate fields:

| Provider | Required |
| --- | --- |
| AWS | `target_credential_key`, `role_arn` |
| GCP | `target_credential_key`, `service_account_email` |
| Azure | `target_credential_key` (resource / tenant / subscription optional) |

`target_credential_key` is the resolver key the minted credential
persists under (e.g. `idp:aws:prod`). Downstream code reads it via
`CredentialResolver.resolve(CredentialKey(, ))`
once minted; nothing in the platform passes the bytes directly.

### 2. Preview

Clicking `Preview command` posts to
`/admin/settings/credentials/cloud-cli/preview` which returns the
exact `argv` the CP would spawn, with token-bearing args masked.
This is a **dry run** — no subprocess executes.

### 3. Mint

`Execute mint` posts to `/admin/settings/credentials/cloud-cli/sts`.
Server-side the CP:

1. writes a `WorkloadRun` audit row with `action=mint_cloud_credential`
   in `PENDING` state **before** spawning the subprocess;
2. invokes `aws sts assume-role` / `gcloud auth print-access-token`
   / `az account get-access-token` with a 60s wall-clock timeout;
3. parses the result, persists the credential under
   `target_credential_key` via the resolver chain;
4. updates the audit row to `SUCCEEDED|FAILED`.

The wizard renders the response envelope:

| Field | Meaning |
| --- | --- |
| `credential_key` | resolver key the temp creds live under |
| `expires_at` | TTL boundary (provider-derived) |
| `source_identity` | role ARN / SA email / subscription id |
| `audit_run_id` | links to the `WorkloadRun` ledger row |

The raw token is **never** in the response body or audit ledger.

## Step-up MFA

`/admin/settings/credentials/cloud-cli/{preview,sts}` carry
`Depends(require_admin_step_up("admin:cluster"))`. If the operator's
JWT is older than the configured `auth_step_up_default_max_age`
(default 180s), the BFF returns
`401 insufficient_user_authentication` with an RFC 9470
`WWW-Authenticate` challenge; the wizard's `apiFetch` middleware
silently re-issues an MFA prompt and retries the original call.

## Troubleshooting

| Symptom | Cause | Fix |
| --- | --- | --- |
| `binary_missing` in the preview | CLI not on the CP host's `$PATH` | Install the CLI in the CP container image, or shell into the host and run `which aws/gcloud/az` |
| `nonzero_exit` with masked stderr | Parent credential lacks the requested permission | Read the redacted stderr in the audit row's `error` field; provision the missing role / IAM permission |
| `parse_error` | Upstream returned an unexpected JSON shape | Compare against the canonical `aws sts assume-role` / `az account get-access-token` shape in the AWS / Azure docs; file a bug if AWS/Azure changed the format |
| `timeout` | Network / IAM trust-policy resolution stuck | The 60s budget is intentional — re-run; if it persists, run the masked argv from the preview locally to inspect |
| `persist_failed` | Resolver write surface not configured | The default CP resolver doesn't ship a write hook in OSS; provide one via the `persist=` kwarg of `alphaswarm_controller.services.cloud_cli.mint`, or wire a Vault / SSM secret manager that supports writes |

## Related docs

- [Cloud credentials](../../concepts/identity/cloud-credentials.md) —
  resolver chain + naming conventions.
- [Identity overview](../../concepts/identity/identity.md) — overall
  rule-26 + rule-52 boundaries.
- [Account integrations](../../concepts/identity/account-integrations.md) —
  the per-org PAT-link sibling surface (HF + Docker Hub).


<!-- https://alpha-swarm.ai/how-to/operations/configuration-management -->
# Operations runbook — Configuration management
> [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) is the source of truth. Every variable declared anywhere (compose, K8s ConfigMap, K8s Secret, appli...

# Operations runbook — Configuration management

How env vars, ConfigMaps, and Secrets flow through the AlphaSwarm stack.

## The single source of truth

[`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) is the source of truth. Every variable declared anywhere (compose, K8s ConfigMap, K8s Secret, application code, frontend) MUST appear in the schema.

Each entry carries metadata:

```
key:            ALPHASWARM_FOO_BAR
description:    What this knob controls.
required:       true | false
default:        
targets:        local,kubernetes,cloud
classification: plain | secret | rotation-required
```

## Generation

```powershell
# Local dev (.env file)
make generate-config ENV=local

# Cloud / sealed-secrets seed
make generate-config ENV=cloud

# Kubernetes ConfigMap + Secret scaffold
make generate-config ENV=k8s
```

Or directly:

```powershell
python alphaswarm_platform/build/scripts/generate_config.py --env local --out alphaswarm_platform/deployments/compose/.env.local
python alphaswarm_platform/build/scripts/generate_config.py --env k8s --kind configmap
python alphaswarm_platform/build/scripts/generate_config.py --env k8s --kind secret
```

## Validation

`make validate-config` runs the generator in `--diff` mode against every target — produces no output when files are in sync with the schema; prints a unified diff when they've drifted.

## How env reaches a service

```mermaid
flowchart LR
  schema[.env.schema] -->|generate_config.py| envfile[.env.local]
  schema -->|generate_config.py| cm[ConfigMap]
  schema -->|generate_config.py| secret[Secret scaffold]

  envfile -->|docker compose| compose[Compose service]
  cm -->|envFrom| pod[Pod env vars]
  secret -->|envFrom| pod
  pod --> alphaswarm[alphaswarm.config.settings reads via pydantic-settings]
  compose --> alphaswarm
```

## Adding a new variable

1. Add a block to `.env.schema`:

   ```
   key:            ALPHASWARM_MY_NEW_KNOB
   description:    What it does (one line).
   required:       false
   default:        
   targets:        local,kubernetes,cloud
   classification: plain
   ```

2. Regenerate every artifact:

   ```powershell
   make generate-config ENV=local
   make generate-config ENV=k8s
   ```

3. Add the field to `alphaswarm.config.settings.Settings` so the application can read it via `from alphaswarm.config import settings`.

4. Update tests that snapshot the env to include the new key.

## Secret classification rules

| Class | Examples | Storage |
| --- | --- | --- |
| `plain` | `ALPHASWARM_LOG_LEVEL`, `ALPHASWARM_CORE_API_URL` | ConfigMap |
| `secret` | `ALPHASWARM_DATABASE_PASSWORD`, `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH` | Secret + sealed-secrets / external-secrets-operator |
| `rotation-required` | `ALPHASWARM_AUTH_M2M_CLIENT_SECRET`, `ALPHASWARM_SESSION_COOKIE_SECRET` | Secret + rotation cadence in [rotate-secrets.md](rotate-secrets.md) |

## Never

- Never commit a populated `Secret` to git. The generator writes a `Y2hhbmdlbWU=` placeholder; CI/CD or the external secret operator patches the real values.
- Never read `os.environ.get(...)` directly from `alphaswarm/` business code. Use `from alphaswarm.config import settings`.
- Never hardcode a URL or password. Add it to the schema and route through `settings`.


<!-- https://alpha-swarm.ai/how-to/operations/connect-company-cloud-account -->
# Connect a company cloud account (federated-first onboarding)
> Guided 5-step wizard in alphaswarm_admin that connects AWS, Azure, GCP, or Cloudflare accounts using federated identity. No long-lived secrets stored.

# Connect a company cloud account

The cloud onboarding wizard in `alphaswarm_admin` (route
`/admin/accounts/{org_id}/cloud/{cloud_kind}/*`) is the canonical
path for wiring an AWS, Azure, GCP, or Cloudflare account into AlphaSwarm.
It is federated-first by design: no access keys, no client secrets,
no service-account JSON, and no global API keys are ever stored.

The same UI serves both flows:

- **Per customer organisation** — `/accounts//integrations` →
  "Cloud accounts" section.
- **Admin tenant (AlphaSwarm's own accounts)** — `/settings` page, "Cloud
  connections" panel, "Guided (federated)" mode. Routes use the
  synthetic `org_id="__platform__"` value.

## Step 0 — pre-requisites

Set these env vars on the `alphaswarm_admin` deployment before the wizard
can emit bootstrap artifacts. None of them are secrets on their own,
but the wizard rejects bootstrap calls when the corresponding
identity is missing.

| Env var | Purpose | Required for |
| --- | --- | --- |
| `ALPHASWARM_ADMIN_AWS_PARTNER_ACCOUNT_ID` | AlphaSwarm's AWS account id (12 digits) embedded as the trust policy's `Principal.AWS`. | AWS |
| `ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET` | HMAC key used to derive a stable per-org `sts:ExternalId`. | AWS (prod) |
| `ALPHASWARM_ADMIN_AZURE_APP_CLIENT_ID` | Client id of the AlphaSwarm Entra app that will carry the federated credential. | Azure |
| `ALPHASWARM_ADMIN_AZURE_APP_OBJECT_ID` | Object id of the same app — parent for `az ad app federated-credential create`. | Azure |
| `ALPHASWARM_ADMIN_GCP_WIF_AUDIENCE` | Audience template for the customer's WIF provider. | GCP |
| `ALPHASWARM_ADMIN_GCP_WIF_SERVICE_ACCOUNT_EMAIL` | AlphaSwarm-side service account the customer's WIF principal impersonates. | GCP |

The customer-side bootstrap also needs network egress from the
admin BFF to each cloud's control plane (AWS STS, Microsoft Graph,
GCP IAM, Cloudflare API). The wizard surfaces clear errors when a
call fails.

## The five-step pattern

```mermaid
flowchart LR
    s1["1 Choose"] --> s2["2 Bootstrap"]
    s2 --> s3["3 Identity"]
    s3 --> s4["4 Permissions"]
    s4 --> s5["5 Save"]
```

| Step | Mutates AlphaSwarm? | Mutates cloud? | Audit row? |
| --- | --- | --- | --- |
| 1 Choose cloud + auth method | no | no | no |
| 2 Bootstrap artifacts | no | no | no |
| 3 Validate identity | no | read-only | yes |
| 4 Validate permissions | no | read-only | yes |
| 4* Enumerate resources | no | read-only | yes |
| 5 Save (`connect`) | yes | no | yes (pending + succeeded/failed) |

Steps 3, 4, 5 require step-up MFA per hard rule 52.

## Per-cloud runbooks

### AWS — cross-account IAM role + external id

1. **Step 2 — bootstrap.** The wizard emits a trust policy that
   names AlphaSwarm's account as the `Principal.AWS` and includes a unique
   `sts:ExternalId` derived from
   `HMAC-SHA256(ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET, ":")`.

   Copy the rendered `trust_policy.json` block or use the
   CloudFormation StackSet quick-link the wizard surfaces. The
   default role name is `alphaswarm-broker-` (configurable via
   `ALPHASWARM_ADMIN_AWS_ROLE_NAME_PATTERN`).

   ```bash
   aws iam create-role \
     --role-name alphaswarm-broker- \
     --assume-role-policy-document file://trust_policy.json
   ```

   Attach the policies the wizard hint suggests
   (`ReadOnlyAccess` for a minimal connection; tighter policies for
   production).

2. **Step 3 — validate identity.** Paste the resulting Role ARN
   into the wizard. AlphaSwarm calls `sts:AssumeRole` with the same
   external id, then `sts:GetCallerIdentity`. Failure modes:
   `AccessDenied` (trust policy not applied or wrong external id),
   `InvalidParameterValue` (role doesn't exist), or
   `RegionDisabled`.

3. **Step 4 — validate permissions.** AlphaSwarm runs
   `iam:SimulatePrincipalPolicy` against the role for
   `sts:GetCallerIdentity`, `iam:GetRole`, and
   `ec2:DescribeRegions` (the baseline). Missing permissions
   render as a red "Missing required" badge.

4. **Step 5 — save.** AlphaSwarm persists
   `{role_arn, external_id, region, account_id}` under
   `CredentialKey("cloud_aws", "org:")` and the
   `alphaswarm_admin.integration_store` table.

### Azure — Workload Identity Federation

1. **Step 2 — bootstrap.** The wizard emits a federated-credential
   JSON skeleton keyed to the AlphaSwarm Entra app's `object_id`. Default
   audience is `api://AzureADTokenExchange` (override via
   `ALPHASWARM_ADMIN_AZURE_AUDIENCE`).

   ```bash
   az ad app federated-credential create \
     --id  \
     --parameters federated_credential.json
   ```

   On the customer's subscription, grant the AlphaSwarm app a role
   (default: `Reader`) at the appropriate scope.

2. **Step 3 — validate identity.** Provide the customer
   `tenant_id` and `subscription_id`. AlphaSwarm acquires a token via the
   federated credential and resolves the token's claims; failures
   typically point to a subject/issuer mismatch on the federated
   credential.

3. **Step 4 — validate permissions.** AlphaSwarm lists role assignments
   at the subscription scope and compares against the
   `required_roles` baseline.

4. **Step 5 — save.** Stored under
   `CredentialKey("cloud_azure", "org:")`. No client
   secret is ever provided to or stored by AlphaSwarm — the federated
   credential is the only artifact.

### GCP — Workload Identity Federation + impersonation

1. **Step 2 — bootstrap.** The wizard emits a Workload Identity
   Pool + Provider config (issuer URI, allowed audiences,
   `attribute_mapping`) plus three `gcloud` invocations:

   ```bash
   gcloud iam workload-identity-pools create alphaswarm-broker- \
     --project= --location=global \
     --display-name="AlphaSwarm broker "

   gcloud iam workload-identity-pools providers create-oidc alphaswarm-oidc \
     --project= --location=global \
     --workload-identity-pool=alphaswarm-broker- \
     --issuer-uri=https://alpha-swarm.ai/oidc/ \
     --allowed-audiences="" \
     --attribute-mapping="google.subject=assertion.sub"

   gcloud iam service-accounts add-iam-policy-binding \
      \
     --project= \
     --role=roles/iam.workloadIdentityUser \
     --member="principalSet://iam.googleapis.com/projects//locations/global/workloadIdentityPools/alphaswarm-broker-/*"
   ```

2. **Step 3 — validate identity.** AlphaSwarm impersonates the configured
   service account and confirms the impersonation chain works.

3. **Step 4 — validate permissions.** AlphaSwarm calls
   `projects.testIamPermissions` for the baseline
   (`resourcemanager.projects.get`, `iam.serviceAccounts.actAs`).

4. **Step 5 — save.** Stored under
   `CredentialKey("cloud_gcp", "org:")`.

### Cloudflare — scoped API token

There is no federated identity option for Cloudflare; the
federated-first equivalent is the **scoped API token** (the
deprecated Global API key is rejected outright).

1. **Step 2 — bootstrap.** Pick the narrowest template that covers
   the use case: `dns_edit`, `tunnel`, `access`, `worker`, or
   `r2`. The wizard opens
   [https://dash.cloudflare.com/profile/api-tokens](https://dash.cloudflare.com/profile/api-tokens) and the
   customer creates the token in the dashboard.

2. **Step 3 — validate identity.** Paste the token into the
   wizard. AlphaSwarm calls `GET /user/tokens/verify` and confirms
   `status == active`. The token is **never** echoed back to the
   wizard after submission.

3. **Step 4 — validate permissions.** AlphaSwarm inspects the verified
   token's permission groups against the template baseline.

4. **Step 5 — save.** Stored encrypted at rest in the
   `IntegrationCredentialStore` (Fernet-wrapped) under
   `CredentialKey("cloud_cloudflare", "org:")`.

## Rotation

| Cloud | Auth method | Rotation duty |
| --- | --- | --- |
| AWS | IAM role + external id | None for the role itself. Rotate the HMAC key (`ALPHASWARM_ADMIN_CLOUD_AWS_EXTERNAL_ID_SECRET`) when an operator with insider knowledge leaves — re-running the wizard regenerates a new external id and the customer updates the trust policy. |
| Azure | Workload Identity Federation | None — no client secret to rotate. |
| GCP | WIF + impersonation | None — no JSON key to rotate. |
| Cloudflare | Scoped API token | 60–90 days recommended. The wizard re-validates the token through the health route; an `expires_on` timestamp surfaces in the integration list. |

## Disconnect

`DELETE /admin/accounts/{org_id}/cloud/{cloud_kind}` drops the local
record. Vendor-side cleanup (delete the IAM role, delete the
federated credential, delete the WIF pool, revoke the scoped token)
is the operator's responsibility — the runbook calls this out
because vendor APIs that delete principals require elevated
permissions AlphaSwarm intentionally does not request.

## Where to look in code

- ABC + lifecycle helpers:
  [`alphaswarm_admin/src/alphaswarm_admin/providers/base.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/base.py)
- Per-cloud providers:
  [`cloud_aws.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_aws.py),
  [`cloud_azure.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_azure.py),
  [`cloud_gcp.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_gcp.py),
  [`cloud_cloudflare.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/providers/cloud_cloudflare.py)
- Router:
  [`alphaswarm_admin/src/alphaswarm_admin/api/routers/cloud_onboarding.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/api/routers/cloud_onboarding.py)
- Shared frontend wizard:
  [`alphaswarm_admin/frontend/components/cloud/CloudOnboardingWizard.tsx`](../../../../alphaswarm_admin/frontend/components/cloud/CloudOnboardingWizard.tsx)
- Settings:
  [`alphaswarm_admin/src/alphaswarm_admin/settings.py`](../../../../alphaswarm_admin/src/alphaswarm_admin/settings.py)
  (the `cloud_*` field block).

## See also

- [Account integrations](../../concepts/identity/account-integrations.md)
  — sibling per-org integrations (HuggingFace, Docker Hub).
- [Cloud-CLI temporary credentials](cloud-cli-temporary-credentials.md)
  — short-lived credential mint surface, complementary to the
  long-lived federated identity established here.
- [Credentials](../../concepts/identity/credentials.md) — the
  `CredentialResolver` chain that backs persistence (hard rule 26).


<!-- https://alpha-swarm.ai/how-to/operations/control-platform-ecs-deployment -->
# Control the platform ECS deployment
> Operate the hosted platform''s own AWS ECS Fargate services from alphaswarm_admin — rollout status, redeploy, scale, logs, and monitoring.

# Control the platform ECS deployment

The `Platform` page in `alphaswarm_admin` (route `/platform`, API
`/admin/platform/ecs/*`) controls the **hosted platform's own** AWS ECS
Fargate slice — the `alphaswarm-admin` and `alphaswarm-agentcore-proxy`
services that run the control plane. It is the counterpart to the
`Services` page, which brokers **customer workload** lifecycle to
`alphaswarm_controller`.

The admin reaches AWS with its **ECS task role** — no static keys. The
`ecs-fargate-control-plane` Terraform module grants a tightly scoped
self-management policy (`ecs:UpdateService` + `Describe*`, `logs:*` read,
`cloudwatch:*` read) to the services that set `enable_self_management`.

## What it shows

| Surface | Source | Purpose |
| --- | --- | --- |
| Service table | `ecs:DescribeServices` | Live rollout state per service (`IN_PROGRESS` / `COMPLETED` / `FAILED`), running vs desired tasks. |
| Logs drawer | CloudWatch Logs `FilterLogEvents` | A bounded tail of the service's `awslogs` group, resolved from its task definition. |
| Metrics drawer | CloudWatch `GetMetricData` | Container Insights CPU, memory, and running-task count over a window. |
| Alarms strip | `cloudwatch:DescribeAlarms` | The platform's per-service alarms (running-task floor, CPU, memory). |

## Prerequisites

Set on the `alphaswarm_admin` deployment:

| Env var | Purpose |
| --- | --- |
| `ALPHASWARM_ADMIN_PLATFORM_ECS_CLUSTER` | ECS cluster name the surface targets. The `ecs-fargate-control-plane` module publishes it at `/alphaswarm//ecs_cluster_name`. |
| `ALPHASWARM_ADMIN_PLATFORM_AWS_REGION` | Region the cluster runs in (default `us-east-1`). |
| `ALPHASWARM_ADMIN_PLATFORM_ALARM_PREFIX` | Alarm-name prefix used to scope the alarm listing (default `alphaswarm-`). |

The admin must run with `alphaswarm-admin[cloud-aws]` installed (the
`boto3` extra). When `boto3` is missing the surface returns
`503 provider_unavailable` with an actionable message; when the cluster
is unset it returns `503 provider_misconfigured`.

Cross-account or local operation: set
`ALPHASWARM_ADMIN_PLATFORM_AWS_ASSUME_ROLE_ARN` (and optionally
`ALPHASWARM_ADMIN_PLATFORM_AWS_EXTERNAL_ID`) to assume a role into the
target account instead of using the ambient task role.

## Redeploy a service

A redeploy starts a new rolling deployment with the same task
definition (`forceNewDeployment`), which is how you pick up a freshly
pushed image on a moving tag or recover a wedged service. The ECS
**deployment circuit breaker** with auto-rollback (configured on the
service in Terraform) reverts a deployment that never reaches steady
state, so a bad image does not take the service down.

1. Open `/platform`.
2. Press **Redeploy** on the target row.
3. Type the service name to confirm. The action is audit-first and
   requires step-up MFA — the UI transparently pops the MFA prompt when
   the server raises the RFC 9470 challenge.
4. Watch the rollout badge move to `COMPLETED` (or `FAILED`, which means
   the circuit breaker rolled back).

## Scale a service

1. Press **Scale**, set the desired task count, and type the service
   name to confirm.
2. Scaling to `0` stops the service; scale back up to restore it.

Both redeploy and scale write a `security_audit_events` row before the
AWS call and a `succeeded` / `failed` row after.

## Read logs and metrics

- **Logs** resolve the `awslogs` group from the service's task
  definition, then tail recent events. Pass a CloudWatch Logs filter
  pattern to narrow the stream.
- **Metrics** read Container Insights series (CPU, memory, running
  tasks). Enhanced Container Insights must be on for the cluster (the
  module sets `containerInsights = enhanced`).

## Boundary

This surface is for the platform's **own** infrastructure. Customer
workloads stay on the `Services` page, which brokers to the control
plane. All boto3 lives in
`alphaswarm_admin.services.platform_deployment` behind the same
`require_sdk` lazy import the cloud-onboarding providers use — route
handlers never import a cloud SDK.

## See also

- [alphaswarm-admin service](../../concepts/infrastructure/services/alphaswarm-admin.md)
  — deployment surfaces + identity.
- [Connect a company cloud account](connect-company-cloud-account.md) —
  the federated-first wizard for customer cloud accounts.
- [Admin Agent Identity](../../concepts/identity/admin-agent-identity.md)
  — how the admin authenticates outbound to the control plane.


<!-- https://alpha-swarm.ai/how-to/operations/edge-deploy -->
# Operations runbook — Edge deployment
> The simplest edge deployment: one Linux VM running the docker-compose stack with the admin overlay

# Operations runbook — Edge deployment

Deploying AlphaSwarm to edge / on-prem locations where the standard cloud K8s overlays don't fit.

## Reference shapes

### Shape A — single VM with Docker Compose

The simplest edge deployment: one Linux VM running the docker-compose stack with the admin overlay.

```bash
git clone https://github.com/julianwiley/alphaswarm.git
cd alphaswarm

# Generate config + bring up
make generate-config ENV=local
make dev-admin
```

Suitable for: dev labs, single-tenant trials, training environments.

Not suitable for: multi-node fault tolerance, HPA, NetworkPolicy enforcement.

### Shape B — k3s on a single edge box

For sites with a single VM but where you want production-style observability + Pod-level lifecycle:

```bash
curl -sfL https://get.k3s.io | sh -
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev
```

k3s ships with Traefik (substitute for the NGINX Ingress) and a built-in service load balancer (Klipper). You can install NGINX Ingress on top if you want to keep the same Ingress manifests as production.

### Shape C — rpi_kubernetes (4-node k3s lab)

The reference home/edge cluster uses **two sibling repos**:

1. **`rpi_kubernetes`** — k3s bootstrap, portal, FinOps policies, storage class.
2. **`alphaswarm`** — every shared service + AlphaSwarm workload under
   `alphaswarm_platform/deployments/kubernetes/`.

```bash
# In rpi_kubernetes (portal + cluster bootstrap only)
kubectl apply -k kubernetes/

# In alphaswarm (AlphaSwarm shared infra + app overlays)
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/dev
```

Streaming install helpers live under
`alphaswarm_platform/scripts/cluster_install/` (`install-flink.sh`,
`install-alphavantage.sh`, `build-flink-jobs.sh`). See
[streaming.md](../../concepts/data/streaming.md) for the full order.

## Edge-specific concerns

### Image distribution

Edge sites often have slow / metered uplinks. Mirror the AlphaSwarm images into an on-site registry:

```bash
docker pull ghcr.io/julianwiley/alphaswarm-client:latest-stable
docker tag ghcr.io/julianwiley/alphaswarm-client:latest-stable mirror.local:5000/alphaswarm-client:latest-stable
docker push mirror.local:5000/alphaswarm-client:latest-stable
```

Then override the image tags in your overlay:

```yaml
# alphaswarm_platform/deployments/kubernetes/overlays/edge-site-a/kustomization.yaml
images:
  - name: ghcr.io/julianwiley/alphaswarm-client
    newName: mirror.local:5000/alphaswarm-client
    newTag: latest-stable
```

### Auth0 unreachability

Edge sites may have intermittent connectivity to Auth0's JWKS endpoint. The JWT validator caches JWKS for `ALPHASWARM_CP_AUTH_JWKS_TTL_SECONDS` (default 600s); set it higher (e.g. 3600s) so the cache spans typical outage windows.

In hard offline scenarios, set `ALPHASWARM_AUTH_ENFORCE=permissive` so authenticated requests fall through to local-default identity and audit-log the violation. The operator UI shows a yellow banner when this mode is active.

### Storage

Edge sites should NOT rely on the in-cluster Postgres + Redis. Provision durable storage upstream and point AlphaSwarm at it via the connectivity matrix:

```bash
ALPHASWARM_DATABASE_URL=postgresql://alphaswarm:****@cloud-postgres.example.com:5432/alphaswarm
ALPHASWARM_REDIS_URL=rediss://cloud-redis.example.com:6380
```

### Telemetry

Edge sites should forward telemetry to a central observability collector. Set `ALPHASWARM_OTEL_COLLECTOR_URL` to the gateway endpoint; the control plane streams MetricPoints + AlertEvents to it via OTLP.

## Cutover from compose to k3s

If you started on shape A and want to move to shape B:

1. `docker compose down` to stop the compose stack.
2. Take a Postgres dump: `docker exec alphaswarm-postgres pg_dump -U alphaswarm alphaswarm > alphaswarm.sql`.
3. Bring up shape B per the recipe above.
4. Restore: `kubectl exec -n alphaswarm deploy/alphaswarm-postgres -- psql -U alphaswarm alphaswarm < alphaswarm.sql`.
5. Verify `/manage/health` and `/health` both return 200.

No code changes required — the connectivity matrix abstracts which backend is hosting which service.


<!-- https://alpha-swarm.ai/how-to/operations/go-live-minimum -->
# Go-live: minimum deployment
> Ordered first-deployment sequence for the four public surfaces — docs + landing site (Cloudflare Pages), admin UI (ECS Fargate), and the minimum platform tier — with the exact commands and the credentials each step needs.

# Go-live: minimum deployment

> Deep-dive companions: [aws-deploy.md](./aws-deploy.md) (full hybrid
> bootstrap), [aws-runbook.md](./aws-runbook.md) (day-2 pipeline ops),
> [tenant-router auth rollout](../tenant-router-auth-rollout.md)
> (edge enforcement, when the k8s edge goes live).

This is the shortest path to four live surfaces, in dependency order.
Current state (verified 2026-06-09): all application code is merged to
`main`, but **no deploy has ever succeeded** — every AWS-touching
workflow fails at `configure-aws-credentials` because the one-time
bootstrap (OIDC deployer roles + GitHub repo variables) has not been
run, and the docs repo had no content-deploy workflow at all (added as
`deploy-pages.yml` alongside this page).

| Order | Surface | Mechanism | Needs |
| --- | --- | --- | --- |
| 1 | Docs + marketing/landing | `alphaswarm_docs` `deploy-pages.yml` → Cloudflare Pages | 2 Cloudflare secrets, **no AWS** |
| 2 | AWS bootstrap | local `terraform apply` ×1 | AWS admin profile |
| 3 | Platform minimum (VPC/ECR/RDS/Redis/ALB/ECS) | local apply of `infrastructure/envs/minimum` then `terraform/environments/minimum` | step 2 |
| 4 | Admin UI | `alphaswarm_admin` `build-publish.yml` → ECR → app-tier redeploy | steps 2–3 + repo vars |
| 5 | Custom domains / apex (optional) | `deploy-edge.yml` Terraform stacks | steps 2 + Vault CF token |

---

## 1. Docs + landing site (Cloudflare Pages — independent of AWS)

The Docusaurus build is one artifact serving the landing page at `/`
and the docs tree beneath it. `deploy-pages.yml` builds and ships it to
the `alphaswarm-docs` Pages project, creating the project on first run
(Terraform later adopts it — see step 5).

```bash
# One-time: create a Cloudflare API token scoped
#   Account > Cloudflare Pages > Edit
# at https://dash.cloudflare.com/profile/api-tokens, then:
gh secret set CLOUDFLARE_API_TOKEN  --repo Alpha-Swarm-ai/alphaswarm_docs
gh secret set CLOUDFLARE_ACCOUNT_ID --repo Alpha-Swarm-ai/alphaswarm_docs

# Deploy (also auto-runs on every push to main):
gh workflow run deploy-pages.yml --repo Alpha-Swarm-ai/alphaswarm_docs --ref main
gh run watch --repo Alpha-Swarm-ai/alphaswarm_docs

# Live at https://alphaswarm-docs.pages.dev until step 5 attaches
# the alpha-swarm.ai apex + www domains.
```

## 2. AWS bootstrap (one-time, local terminal, ~10 min)

Mints the GitHub OIDC provider, deployer roles, and the state
bucket/lock/KMS that every other stack depends on. Local state by
design.

```bash
cd alphaswarm_platform/infrastructure/bootstrap
export AWS_PROFILE=alphaswarm-shared-platform-admin   # your admin profile
terraform init
terraform apply -var=account_alias=shared
# (Repeat with -var=account_alias={dev,qa,prod} only when you adopt the
#  multi-account split; the minimum tier lives in the single account.)
```

## 3. Platform minimum tier (single account, ~$140/mo)

Infra tier (VPC, ECR ×3, RDS Postgres, Redis, Bedrock policy, the
`AqpGithubDeployerMinimum` role, SSM handles), then app tier (Cognito,
ALB, ECS Fargate control plane):

```bash
cd alphaswarm_platform/infrastructure/envs/minimum
cp backend.hcl.example backend.hcl && cp terraform.tfvars.example terraform.tfvars
$EDITOR backend.hcl terraform.tfvars      # bucket/table from step 2 outputs
terraform init -backend-config=backend.hcl
terraform apply

cd ../../../terraform/environments/minimum
terraform init -backend-config=backend.hcl   # same pattern
terraform apply        # first apply: images don't exist yet — services
                       # stabilise after step 4 pushes them
```

Then wire CI so subsequent rollouts never need a laptop — set the repo
variables from the role ARNs the applies just published (SSM
`/alphaswarm/minimum/*` / stack outputs):

```bash
for repo in alphaswarm_platform alphaswarm_admin; do
  gh variable set AWS_PLAN_ROLE_ARN  --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/"
  gh variable set AWS_APPLY_ROLE_ARN --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/AqpGithubDeployerMinimum"
  gh variable set AWS_BUILD_ROLE_ARN --repo Alpha-Swarm-ai/$repo --body "arn:aws:iam:::role/AqpGithubDeployerMinimum"
  gh variable set AWS_REGION         --repo Alpha-Swarm-ai/$repo --body us-east-1
done
gh variable set SHARED_ACCOUNT_ID --repo Alpha-Swarm-ai/alphaswarm_platform --body ""
# Plus the prod GitHub Environment (Settings → Environments → prod,
# required reviewers) — alphaswarm_admin's main-branch build binds to it.
```

## 4. Admin UI

`build-publish.yml` builds the admin backend + frontend images
multi-arch, pushes to the ECR repos step 3 created, and fires the
`admin-image-published` dispatch so the platform app tier redeploys
with the new tags (needs `PLATFORM_DISPATCH_TOKEN` — a fine-grained PAT
with `actions:write` on `alphaswarm_platform`):

```bash
gh secret set PLATFORM_DISPATCH_TOKEN --repo Alpha-Swarm-ai/alphaswarm_admin
gh workflow run build-publish.yml --repo Alpha-Swarm-ai/alphaswarm_admin \
  --ref main -f env=minimum
gh run watch --repo Alpha-Swarm-ai/alphaswarm_admin

# Verify rollout (alarm/log/scale surface is the admin's own ECS panel
# once it's up — control-platform-ecs-deployment.md):
aws ecs describe-services --cluster "$(aws ssm get-parameter \
  --name /alphaswarm/minimum/ecs_cluster_name --query Parameter.Value --output text)" \
  --services alphaswarm-admin --query 'services[0].deployments'
# Admin UI answers on the ALB DNS name (output of the app-tier apply)
# at "/" with the backend health at /admin/health.
```

## 5. Custom domains / apex public surface (optional, after 1–3)

The `docs-edge` + `apex-redirect` + `demo-edge` Terraform stacks attach
`alpha-swarm.ai` (+ `www`, `docs.*` alias, Access-gated `/demo`) to the
Pages project from step 1. They run through
`alphaswarm_platform/.github/workflows/deploy-edge.yml` → CodeBuild →
`alphaswarm deploy` (AGENTS rule 42), so they need step 3's roles plus
the Cloudflare token in Vault. Because step 1 already created the Pages
project, import it before the first apply:

```bash
# inside the docs-edge stack workspace:
terraform import 'module.cloudflare_pages_docs.cloudflare_pages_project.docs' \
  '/alphaswarm-docs'

gh workflow run deploy-edge.yml --repo Alpha-Swarm-ai/alphaswarm_platform \
  --ref main -f stack=docs-edge -f env=prod -f action=plan   # then action=apply
```

## Known footguns

- **`build-publish.yml` (platform repo) tags `:latest` unconditionally**
  — only ever dispatch it from `main` or a version tag.
- **The k8s edge ships fail-closed**: stamp
  `ALPHASWARM_TENANT_ROUTER_OIDC_ISSUER`/`_AUDIENCE` before applying
  `deployments/kubernetes/edge/` or the tenant-router crash-loops by
  design ([runbook](../tenant-router-auth-rollout.md)). Not part of the
  minimum tier (no EKS), listed here because the manifests are on `main`.
- **`terraform-pipeline.yml` auto-plans `admin-dev` on every `main`
  push** — it stays red until step 3's variables exist; that's the
  expected signal, not a regression.


<!-- https://alpha-swarm.ai/how-to/operations/hft-node-onboarding -->
# HFT Node Onboarding
> - Bare-metal or near-bare-metal hardware with: - Hardware-timestamping NIC (Intel I210/X710, Mellanox ConnectX-5/6/7). - At least 2 NUMA nodes (most modern dual-socket Xeons / Epycs). - 2 MiB HugePage...

# HFT Node Onboarding

> How to bring up a new dedicated node for `Frequency.HFT` bots.
> Runtime audience: SRE + platform team.

## Pre-requisites

- Bare-metal or near-bare-metal hardware with:
  - Hardware-timestamping NIC (Intel I210/X710, Mellanox ConnectX-5/6/7).
  - At least 2 NUMA nodes (most modern dual-socket Xeons / Epycs).
  - 2 MiB HugePages support (kernel default).
  - SR-IOV-capable NIC.
- Linux kernel >= 5.10 with PTP support (`ptp4l` + `phc2sys` from
  `linuxptp`).
- The node is already a member of the cluster and runs the standard
  kubelet.

## 1. Taint + label the node

```bash
kubectl taint nodes  quantbot.io/hft=true:NoSchedule
kubectl label nodes  quantbot.io/hft=true
```

## 2. Apply the kubelet override

The kubelet config drop-in lives at
`alphaswarm_platform/deployments/kubernetes/hft-nodes/kubelet-config.yaml`.
On systemd hosts:

```bash
sudo cp kubelet-config.yaml /etc/kubernetes/kubelet/kubelet.conf.d/quantbot-hft.conf
sudo systemctl restart kubelet
```

Verify:

```bash
kubectl get --raw "/api/v1/nodes//proxy/configz" | jq .kubeletconfig.cpuManagerPolicy
# Expect: "static"
```

## 3. Allocate HugePages

```bash
kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/hugepages-allocation.yaml
# DaemonSet runs once per HFT node and sets nr_hugepages=1024.
```

## 4. Bring up PTP

```bash
kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/ptp-config.yaml
```

Verify clock discipline (run inside the `quantbot-ptp` pod):

```bash
kubectl exec -n alphaswarm-bots quantbot-ptp- -c phc2sys -- \
  pmc -u -b 0 'GET CURRENT_DATA_SET' | grep masterOffset
# Expect masterOffset around 0 (sub-microsecond on a healthy network).
```

## 5. Configure SR-IOV

If the SR-IOV Network Operator is installed:

```bash
kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/sr-iov-config.yaml
```

Verify VFs are exposed:

```bash
kubectl get nodes  -o json | jq '.status.allocatable | with_entries(select(.key | startswith("openshift.io/quantbot_hft_vf")))'
```

## 6. Apply the tuned profile

```bash
kubectl apply -f alphaswarm_platform/deployments/kubernetes/hft-nodes/node-tuning-operator.yaml
```

## 7. Validate the node passes the operator's HFT check

The QuantBot Operator's validating webhook will refuse to schedule
an HFT bot on a node that fails any of:

- `quantbot.io/hft` label present
- PTP DaemonSet pod running on the node
- HugePages allocation >= the bot's request
- SR-IOV VF available

Run the operator's diagnostics:

```bash
alphaswarm-bots validate 
```

A passing validation prints `valid: true` and no failure entries.

## Rollback

To take the node out of the HFT pool:

```bash
kubectl drain  --ignore-daemonsets=false --delete-emptydir-data
kubectl taint nodes  quantbot.io/hft=true:NoSchedule-
kubectl label nodes  quantbot.io/hft-
```

The HFT DaemonSets (ptp, hugepages, sriov) auto-stop on the node.


<!-- https://alpha-swarm.ai/how-to/operations/incident-response -->
# Operations runbook — Incident response
> ``` Incident detected | +-----------------+-----------------+ | | Workload error Platform error | | +----------+----------+ +-------------+-------------+ | | | | | | Single Several All deps Auth Netwo...

# Operations runbook — Incident response

Standard playbook for diagnosing + recovering from AlphaSwarm production incidents.

## Triage tree

```
                         Incident detected
                                |
              +-----------------+-----------------+
              |                                   |
       Workload error                     Platform error
              |                                   |
   +----------+----------+         +-------------+-------------+
   |          |          |         |             |             |
  Single   Several    All deps   Auth         Network        Storage
  pod      pods       degraded   (Auth0)      / Ingress      (Postgres
  crashes  CrashLoop                          rate-limited)   / Redis)
   |          |          |         |             |             |
   v          v          v         v             v             v
[A: pod    [B: HPA   [C: drain   [D: jwks    [E: ingress    [F: stateful
 logs]      thrash]   the queue]  503]        503]           failover]
```

## Common diagnostic commands

```powershell
# Pod status across both AlphaSwarm namespaces
kubectl get pods -n alphaswarm -o wide
kubectl get pods -n alphaswarm-admin -o wide

# Top resource consumers
kubectl top pods -n alphaswarm --sort-by=cpu
kubectl top pods -n alphaswarm --sort-by=memory

# Recent events
kubectl get events -n alphaswarm --sort-by='.lastTimestamp' | tail -n 30

# Tail logs for the API
kubectl logs -n alphaswarm deploy/alphaswarm-core --tail=200 -f

# Control-plane audit log (rolled to stdout by default; if ALPHASWARM_CP_AUDIT_LOG_PATH
# is set, also written to a file).
kubectl logs -n alphaswarm-admin deploy/alphaswarm-cp --tail=200 -f | findstr workload_run

# Recent terraform_runs (provisioning audit ledger).
kubectl exec -n alphaswarm deploy/alphaswarm-core -- python -m alphaswarm.cli runs list --limit 20
```

## Scenario A — single pod crashes

```powershell
# Identify the crashing pod
kubectl get pods -n alphaswarm -l app=alphaswarm-core | findstr CrashLoop

# Inspect
kubectl describe pod -n alphaswarm 
kubectl logs -n alphaswarm  --previous

# Rolling restart of the deployment (HPA + PDB keep the service up)
kubectl rollout restart -n alphaswarm deployment/alphaswarm-core
```

## Scenario B — HPA thrashing

The HPA is scaling rapidly up + down, never stabilising.

```powershell
# Check the HPA's recent decisions
kubectl describe hpa -n alphaswarm alphaswarm-core

# Most common cause: a runaway query or backtest that spikes CPU then
# crashes back. Check the audit log for recent task starts.
kubectl logs -n alphaswarm deploy/alphaswarm-worker --tail=500 | findstr "started\|finished\|FAILED"

# Mitigation: temporarily widen the HPA stabilizationWindow.
kubectl patch hpa -n alphaswarm alphaswarm-core --type='json' -p='[
  {"op":"replace","path":"/spec/behavior/scaleUp/stabilizationWindowSeconds","value":300}
]'
```

## Scenario C — Celery queue depth alarm

```powershell
# Drain the queue from the worker side
kubectl exec -n alphaswarm deploy/alphaswarm-worker -- celery -A alphaswarm.tasks.celery_app inspect active

# Scale workers up
kubectl scale -n alphaswarm deployment/alphaswarm-worker --replicas=8

# Or via the control plane (lands an audit row)
curl -X PATCH https://manage.alphaswarm.enterprise.com/manage/deployments/alphaswarm-worker/scale?replicas=8 `
  -H "Authorization: Bearer $TOKEN"
```

## Scenario D — Auth0 JWKS returns 503

Symptom: every authenticated request fails with `jwks_unreachable`.

```powershell
# Probe JWKS directly from inside a pod
kubectl exec -n alphaswarm deploy/alphaswarm-core -- curl -fsS https://your-tenant.us.auth0.com/.well-known/jwks.json

# Common causes:
# - Auth0 service incident (https://status.auth0.com/)
# - Outbound 443 blocked by NetworkPolicy (check network-policies.yaml)
# - DNS resolution failure inside the cluster

# Mitigation: flip ALPHASWARM_AUTH_ENFORCE to permissive for read-only routes
# while you wait for Auth0 to recover. ONLY do this if your operator
# UI is firewalled at the Ingress layer.
kubectl set env -n alphaswarm deploy/alphaswarm-core ALPHASWARM_AUTH_ENFORCE=permissive
```

## Scenario E — Ingress returns 503

```powershell
# Check NGINX Ingress controller
kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --tail=200

# Check service endpoints
kubectl get endpoints -n alphaswarm alphaswarm-client
kubectl get endpoints -n alphaswarm-admin alphaswarm-cp

# If endpoints are empty, the pods aren't passing readinessProbe.
```

## Scenario F — Stateful service failover

### Postgres

The compose stack uses a single Postgres pod backed by a PVC. K8s overlays should be migrated to a managed Postgres (Aurora / Cloud SQL / Azure Database for PostgreSQL) before going to prod.

For dev/staging:
```powershell
kubectl -n alphaswarm delete pod -l app=postgres   # restarts; data persists in PVC
```

### Redis Stack

```powershell
kubectl -n alphaswarm delete pod redis-master-0   # StatefulSet brings it back
```

## Post-incident

1. Open an incident report in `alphaswarm_docs/incidents/-.md`.
2. Capture: timeline, blast radius, root cause, fixes applied, follow-ups.
3. If a hard rule was bypassed (e.g. ALPHASWARM_AUTH_ENFORCE flipped to permissive), schedule the revert as a P1 task.
4. Add a regression test to prevent the same class of incident.


<!-- https://alpha-swarm.ai/how-to/operations/kill-switch-incident-response -->
# Kill Switch Incident Response
> | Scope | What it halts | Typical use | | --- | --- | --- | | `bot` | One Pod (one bot slug) | A single bot is misbehaving | | `fleet` | Every bot in a fleet | A fleet-wide alpha goes stale | | `platf...

# Kill Switch Incident Response

> Three-scope kill switch (bot / fleet / platform). Quarterly drill
> required per blueprint caveat #7.

## Scopes

| Scope | What it halts | Typical use |
| --- | --- | --- |
| `bot` | One Pod (one bot slug) | A single bot is misbehaving |
| `fleet` | Every bot in a fleet | A fleet-wide alpha goes stale |
| `platform` | Every bot on the platform | Emergency — venue outage, regulatory action |

## Engage

### Via CRD (preferred — leaves audit trail)

```bash
kubectl apply -f - <
  namespace: alphaswarm-bots
spec:
  scope: bot                # bot | fleet | platform
  target: mm-aapl           # bot slug / fleet name / "platform"
  mode: flatten             # cancel | flatten | freeze
  reason: "venue outage; halting until investigation complete"
  ttl: 1h
EOF
```

### Via the REST kill-switch fan-out (UI button)

The operator UI's `KillSwitch` topbar component calls a sequence of
halt endpoints in parallel:

- `POST /agents/halt`
- `POST /quant-agents/halt`
- `POST /paper/stop-all`
- `POST /bots/halt-all`   ← halts every active bot deployment
- `POST /rl/halt-all`
- `POST /workflows/halt`

This is the equivalent of `KillSwitch.scope=platform` from the operator
side. Use it when GitOps reconciliation is too slow (the CRD path can
take up to `poll_interval_s` seconds; the REST fan-out is instant).

### Via the redundant Redis polling channel (last resort)

If the Argo CD reconciler is unhealthy AND the REST API is unreachable:

```bash
# Directly set the kill switch key in the bots namespace's Redis.
kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli SET 'alphaswarm:bots:killswitch:platform:platform' 'manual-emergency'
```

Each bot polls this key every 5 seconds (configurable via
`KillSwitchV2.poll_interval_s`) and halts when set. This is the
fallback documented in blueprint caveat #7.

## Release

### Via CRD

```bash
kubectl delete killswitch emergency- -n alphaswarm-bots
```

### Via Redis (matching the last-resort engage)

```bash
kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli DEL 'alphaswarm:bots:killswitch:platform:platform'
```

## Verify

```bash
# CRD view:
kubectl get killswitches -A
# Status:
kubectl get killswitches -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.engaged}{"\n"}{end}'
# Redis view:
kubectl exec -n alphaswarm-bots redis-master-0 -- \
  redis-cli --scan --pattern 'alphaswarm:bots:killswitch:*'
# Affected bots (operator status):
kubectl get bots -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.killSwitchEngaged}{"\t"}{.status.killSwitchReason}{"\n"}{end}'
```

## Quarterly drill (caveat #7)

The blueprint mandates a **quarterly drill** because the worst time
to discover the kill switch is broken is during a real incident.

### Drill protocol

1. Schedule a 15-minute window during low-activity hours.
2. Engage `scope=platform` via the CRD path.
3. Verify every bot in `kubectl get bots -A` transitions to
   `status.phase=Draining` within 10 seconds.
4. Verify every bot reaches `Stopped` within 30 seconds (HFT) or
   300 seconds (everything else).
5. Release the kill switch.
6. Verify bots auto-restart (their Deployments / StatefulSets reconcile).
7. Repeat with the REST fan-out path.
8. Repeat with the Redis fallback path.
9. Record the drill in the next RTS 6 validation report's
   `kill_switch_drills` evidence section.

Failure of any of the three paths is a P1 incident. Fix before the
drill window closes.

## Common failure modes

- **Operator pod is down.** Symptom: `KillSwitch` CR created but
  bots don't halt within `poll_interval_s`. Mitigation: the Redis
  polling fallback bypasses the operator entirely.
- **Redis pod is down.** Symptom: neither operator nor polling
  fallback works. Mitigation: at least one of the operator's
  in-memory CR watcher or the REST API fan-out path will still
  halt bots; if all three fail simultaneously, escalate to manual
  `kubectl scale deployment/bot-* --replicas=0`.
- **Redis Pub/Sub vs SET-key drift.** `KillSwitchV2.poll_interval_s`
  defines the upper bound on the polling fallback's latency; if the
  pub/sub channel is dropping messages, polling still works after at
  most one interval.


<!-- https://alpha-swarm.ai/how-to/operations/kubernetes-deploy -->
# Operations runbook — Kubernetes deployment
> - `kubectl` 1.30+ with a current context pointing at the target cluster. - Cluster admin (youll create namespaces + RBAC). - A container registry the cluster can pull from (Docker Hub / ECR / ACR / G...

# Operations runbook — Kubernetes deployment

End-to-end walkthrough for shipping AlphaSwarm to any Kubernetes cluster (EKS,
AKS, GKE, vanilla k3s, or the Raspberry Pi k3s cluster owned by
`rpi_kubernetes`). AlphaSwarm is fully self-contained: every shared service it
depends on (Postgres, Redis, Kafka, MinIO, MLflow, observability stack,
etc.) ships in `alphaswarm_platform/deployments/kubernetes/`. There is no implicit
dependency on `rpi_kubernetes` or any other repository.

## Prerequisites

- `kubectl` 1.30+ with a current context pointing at the target cluster.
- Cluster admin (you'll create namespaces + RBAC).
- A container registry the cluster can pull from (Docker Hub / ECR / ACR / GCR).
- An ingress controller (`ingress-nginx` recommended) and `cert-manager`
  with a `letsencrypt-prod` `ClusterIssuer` for the AlphaSwarm TLS hosts.
- Auth0 tenant configured per
  [alphaswarm_docs/architecture/decisions/003-auth0-zero-trust.md](../../architecture/decisions/003-auth0-zero-trust.md)
  (default tenant `alphaswarm-fund.us.auth0.com`).
- Cluster operators / CRDs installed via
  [alphaswarm_platform/scripts/cluster_install/](../../scripts/cluster_install/) (Strimzi,
  Spark Operator, OpenTelemetry Operator, Phoenix, Redpanda, etc.) - run
  the relevant installer before applying the AlphaSwarm base kustomization.

## Targeted runbooks

- Two-node tower+laptop bootstrap: [tower-cluster-deploy.md](tower-cluster-deploy.md)
- Blue/green domain cutover: [alphaswarm-fund-blue-green-cutover.md](alphaswarm-fund-blue-green-cutover.md)

## Step 1 — provision Auth0 (one-time)

```powershell
$env:AUTH0_DOMAIN = "your-tenant.us.auth0.com"
$env:AUTH0_M2M_CLIENT_ID = "..."
$env:AUTH0_M2M_CLIENT_SECRET = "..."
$env:ALPHASWARM_SYNC_URL = "https://api.alphaswarm.enterprise.com/_internal/auth0/sync"

python alphaswarm_platform/build/scripts/provision_auth0.py --dry-run    # preview
python alphaswarm_platform/build/scripts/provision_auth0.py              # apply
```

This idempotently creates the API resource server, the four roles, and the post-login Action.

## Step 2 — generate the K8s ConfigMap + Secret scaffold

```powershell
make generate-config ENV=k8s
```

Produces:
- `alphaswarm_platform/deployments/kubernetes/base/configmaps/alphaswarm-config.yaml` (commit this)
- `alphaswarm_platform/deployments/kubernetes/base/secrets/alphaswarm-secrets.yaml.template` (DO NOT commit values — CI/CD or external-secrets-operator patches real values)

## Step 3 — build + push images

```powershell
$env:IMAGE_TAG = "rc-$(git rev-parse --short HEAD)-$(Get-Date -Format yyyy-MM-dd)"
make build-client IMAGE_TAG=$env:IMAGE_TAG
make build-cp IMAGE_TAG=$env:IMAGE_TAG

# Optional (only if the Dockerfiles exist in alphaswarm_platform/build/docker/*)
make build-worker IMAGE_TAG=$env:IMAGE_TAG
make build-ingestion IMAGE_TAG=$env:IMAGE_TAG

docker login
docker push docker.io/julianwiley/alphaswarm-client:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-controller:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-worker:$env:IMAGE_TAG
docker push docker.io/julianwiley/alphaswarm-ingestion:$env:IMAGE_TAG
```

If `make build-worker` or `make build-ingestion` reports a missing Dockerfile,
pin those image tags to known-good prebuilt registry tags in the target overlay
before applying.

## Step 3b — one-shot Alembic migration (cluster)

After `alphaswarm-api` is pullable on the cluster, run:

```powershell
kubectl apply -f alphaswarm_platform/deployments/kubernetes/base/jobs/alembic-upgrade.yaml
kubectl -n alphaswarm wait --for=condition=complete job/alphaswarm-alembic-upgrade --timeout=900s
kubectl -n alphaswarm logs job/alphaswarm-alembic-upgrade
```

The Job uses the same `alphaswarm-config` / `alphaswarm-secrets` env as `alphaswarm-core` and targets
`postgresql.alphaswarm-data-services.svc.cluster.local` (the AlphaSwarm-owned Postgres in the
`alphaswarm-data-services` namespace). Re-apply only when you need a fresh
`upgrade head` (delete the previous Job first: `kubectl -n alphaswarm delete job alphaswarm-alembic-upgrade`).

`alembic/env.py` widens `alembic_version.version_num` to `VARCHAR(128)` automatically
before migrations run (revision slugs longer than 32 characters otherwise fail at
`0039_extended_instrument_taxonomy`).

### Brownfield Postgres (pre-Alembic or partial schema)

If `alembic upgrade head` fails with `DuplicateTable` / `DuplicateColumn`, the database
was created outside Alembic tracking. From a workstation with the API image and a
port-forward to cluster Postgres:

```powershell
kubectl -n alphaswarm-data-services port-forward svc/postgresql 15432:5432
$env:ALPHASWARM_POSTGRES_DSN = "postgresql+psycopg2://alphaswarm:alphaswarm@host.docker.internal:15432/alphaswarm"
# Optional: stamp to the highest revision whose objects already exist, then upgrade.
# $env:ALPHASWARM_ALEMBIC_STAMP_REVISION = "0015_dbt_foundation"
bash scripts/cluster_alembic_upgrade.sh
```

Use `ALPHASWARM_POSTGRES_DSN` (maps to `settings.postgres_dsn`) — not a raw `DATABASE_URL`
alias. Migration `0040_normalized_identifiers_backfill` can take several minutes on
large `instruments` tables.

### Postgres prerequisites (`alphaswarm-data-services`)

Migration `0045_pgvector_foundation` requires the `vector` extension in the **`alphaswarm`**
database. On existing clusters (init script applied before the `alphaswarm` DB was added),
run once as the Postgres superuser:

```powershell
kubectl -n alphaswarm-data-services exec deploy/postgresql -- \
  psql -U postgres -d alphaswarm -c "CREATE EXTENSION IF NOT EXISTS vector;"
```

Fresh installs use the AlphaSwarm-owned `alphaswarm_platform/deployments/kubernetes/base-services/postgres-shared/`
manifests, whose init SQL creates the `alphaswarm` role/database and enables
`vector` there.

## Step 4 — pin the image tag in the target overlay

Edit `alphaswarm_platform/deployments/kubernetes/overlays//kustomization.yaml`:

```yaml
images:
  - name: docker.io/julianwiley/alphaswarm-client
    newTag: rc-abcdef01-2026-05-19
  ...
```

### Docker Hub pull secret (private repos)

Deployments reference `dockerhub-pull-secret`. Create it in both workload
namespaces before rollout:

```powershell
$env:DOCKERHUB_USER = ""
$env:DOCKERHUB_TOKEN = ""  # hub.docker.com → Account Settings → Security

kubectl create secret docker-registry dockerhub-pull-secret `
  --docker-server=https://index.docker.io/v1/ `
  --docker-username=$env:DOCKERHUB_USER `
  --docker-password=$env:DOCKERHUB_TOKEN `
  -n alphaswarm --dry-run=client -o yaml | kubectl apply -f -

kubectl create secret docker-registry dockerhub-pull-secret `
  --docker-server=https://index.docker.io/v1/ `
  --docker-username=$env:DOCKERHUB_USER `
  --docker-password=$env:DOCKERHUB_TOKEN `
  -n alphaswarm-admin --dry-run=client -o yaml | kubectl apply -f -
```

Public repositories can omit the secret by removing `imagePullSecrets` from
the deployment manifests.

## Step 5 — apply

```powershell
# Dry-run first
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev --dry-run=server

# Apply
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev

# Verify
kubectl -n alphaswarm get pods,svc,hpa,pdb
kubectl -n alphaswarm-admin get pods,svc
```

## Step 6 — populate the Secret

If you're not using external-secrets-operator, populate the placeholder Secret manually:

```powershell
kubectl -n alphaswarm create secret generic alphaswarm-secrets `
  --from-literal=ALPHASWARM_DATABASE_PASSWORD="" `
  --from-literal=ALPHASWARM_AUTH_M2M_CLIENT_SECRET="" `
  --from-literal=ALPHASWARM_SESSION_COOKIE_SECRET="" `
  --dry-run=client -o yaml | kubectl apply -f -
```

For external-secrets-operator users, point an `ExternalSecret` at your secret store (Vault / SSM / Key Vault / Secret Manager) and let the operator create the K8s `Secret`.

## Step 7 — DNS + TLS

The Ingresses expect:
- `alpha-swarm.ai` -> `alphaswarm-client` Service in the `alphaswarm` namespace
- `api.alpha-swarm.ai` -> `alphaswarm-core` Service in the `alphaswarm` namespace
- `manage.alpha-swarm.ai` -> `alphaswarm-cp` Service in the `alphaswarm-admin` namespace

Point DNS at the NGINX Ingress controller's LoadBalancer IP. cert-manager handles TLS via the `letsencrypt-prod` ClusterIssuer (configure separately).

## Step 8 — smoke test

```powershell
# Client should serve the SPA shell
curl -fsS https://alpha-swarm.ai/ | findstr "<!doctype html"

# Control plane health (unauthenticated)
curl -fsS https://manage.alpha-swarm.ai/manage/health

# OpenAPI spec
curl -fsS https://manage.alpha-swarm.ai/manage/openapi.json | python -m json.tool | findstr title

# Cluster verification helper
bash scripts/verify_tower_cluster.sh
```

## Rollback

```powershell
# Re-apply the previous overlay with the previous image tag.
git checkout HEAD~1 -- alphaswarm_platform/deployments/kubernetes/overlays/dev/kustomization.yaml
make deploy-k8s ENV=dev
```

Or, for an immediate rollback that doesn't touch git:

```powershell
kubectl -n alphaswarm rollout undo deployment/alphaswarm-client
kubectl -n alphaswarm rollout undo deployment/alphaswarm-core
kubectl -n alphaswarm rollout undo deployment/alphaswarm-worker
kubectl -n alphaswarm-admin rollout undo deployment/alphaswarm-cp
```


<!-- https://alpha-swarm.ai/how-to/operations/local-setup -->
# Operations runbook — Local setup
> | Tool | Min version | Used for | | --- | --- | --- | | Python | 3.11 | AlphaSwarm runtime + the new `alphaswarm_core` + `alphaswarm_controller` packages | | Node.js | 20 | Vite + legacy webui builds | | pnpm ...

# Operations runbook — Local setup

This walks a brand-new developer from `git clone` to a running local AlphaSwarm stack.

## Prerequisites

| Tool | Min version | Used for |
| --- | --- | --- |
| Python | 3.11 | AlphaSwarm runtime + the new `alphaswarm_core` + `alphaswarm_controller` packages |
| Node.js | 20 | Vite + legacy webui builds |
| pnpm | 9 | Frontend dep management (`corepack enable && corepack prepare pnpm@9.15.9 --activate`) |
| Docker | 25+ | Local compose stack + image builds |
| docker buildx | 0.13+ | Multi-arch image builds |
| Terraform | 1.10+ | Provisioning-only (rule 42) |
| k3d | 5.7+ | Local k3s cluster (for the Terraform-driven path) |
| kubectl | 1.30+ | Workload introspection |

## Step 1 — clone + install editable

```powershell
git clone https://github.com/julianwiley/alphaswarm.git
cd alphaswarm

python -m pip install -e .
python -m pip install -e ./alphaswarm_core[dev]
python -m pip install -e ./alphaswarm_controller[dev,all-providers]
pnpm --dir alphaswarm_client install
```

## Step 2 — generate `.env.local`

```powershell
make generate-config ENV=local
```

This reads [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) and writes `alphaswarm_platform/deployments/compose/.env.local`. Open the file and fill in the `` placeholders for any service you plan to use.

## Step 3 — bring up the stack (two options)

### Option A — Docker Compose (new path, Phase 3 refactor)

```powershell
make dev
```

This brings up:
- `alphaswarm-postgres` (pgvector)
- `redis-stack`
- `alphaswarm-core` (FastAPI)
- `alphaswarm-worker` (Celery)
- `alphaswarm-client` (unified gateway, port 3000)

Once everything is `Up (healthy)`:

- Operator UI: [http://localhost:3000](http://localhost:3000)
- Legacy Solara UI: [http://localhost:3000/legacy](http://localhost:3000/legacy)
- OpenAPI: [http://localhost:3000/api/docs](http://localhost:3000/api/docs)

### Option B — Terraform + k3d (canonical, hard rule 42)

```powershell
alphaswarm-cli deploy build      # build + push images to the local registry
alphaswarm-cli deploy up         # terraform apply -> k3d cluster + workloads
alphaswarm-cli deploy status     # pod + service rollup
alphaswarm-cli deploy logs api   # tail alphaswarm-api logs
```

`alphaswarm-cli deploy *` is the existing path that lands every state mutation in `terraform_runs`. The Docker Compose path is friendlier for fast iteration but doesn't update the ledger.

## Step 4 — bring up the admin overlay (optional)

The `alphaswarm_controller` micro-project runs on a separate Docker network (`alphaswarm-admin`) so it's isolated from the workloads it manages.

```powershell
make dev-admin
```

After that, `curl http://localhost:9000/manage/health` should return `{"status": "ok", ...}`.

## Step 5 — verify

```powershell
make test                # all tests
make test-platform-core  # alphaswarm_core only
make test-providers      # alphaswarm_controller provider contract tests
```

## Troubleshooting

| Symptom | Fix |
| --- | --- |
| `make generate-config ENV=local` errors with `missing required fields` | The schema parser caught a malformed block in `.env.schema`. Open the file, look for the entry above the error line, ensure every block has `key:` / `description:` / `required:` / `targets:` / `classification:`. |
| `docker compose up` fails with `port already in use` | The Vite dev server publishes 3001 by default; the compose stack publishes 3000. Stop whichever is running first or override via `docker-compose.override.yml`. |
| `pnpm --dir alphaswarm_client build` runs out of memory | `NODE_OPTIONS=--max-old-space-size=4096 pnpm --dir alphaswarm_client build`. |
| `alphaswarm-cli deploy up` fails with `terraform binary not found` | `choco install terraform` (Windows) or set `ALPHASWARM_TERRAFORM_BINARY=/path/to/terraform`. |
| `alphaswarm_controller` shows `auth_disabled=true` in `/manage/health` | Set `ALPHASWARM_AUTH_OIDC_ISSUER=https://your-tenant.us.auth0.com/` in `.env.local`, restart `alphaswarm-cp`. |


<!-- https://alpha-swarm.ai/how-to/operations/rotate-secrets -->
# Operations runbook — Secret rotation
> Every entry in [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) with `classification: secret` or `classification: rotation-required`. The rotation-r...

# Operations runbook — Secret rotation

Zero-downtime credential rotation for the AlphaSwarm control plane + workloads.

## What's a secret here?

Every entry in [`alphaswarm_platform/deployments/compose/.env.schema`](../../alphaswarm_platform/deployments/compose/.env.schema) with `classification: secret` or `classification: rotation-required`. The rotation-required ones (e.g. `AUTH0_M2M_CLIENT_SECRET`) should be rotated on a fixed schedule (typically 90 days).

## Pre-flight

1. Confirm at least one operator with `admin:cluster` is online to handle Auth0 console operations.
2. Check `kubectl -n alphaswarm rollout status deployment/alphaswarm-core` — if it's currently degraded, fix that first.
3. Verify the secret store you're rotating into is reachable (`Vault sealed?`, `SSM blocked by SCP?`, etc.).

## Procedure — Auth0 M2M client secret

1. **Mint a new secret in Auth0:** Applications → `alphaswarm-m2m` → Settings → "Rotate" (Auth0 keeps the old one valid for 24h by default).
2. **Update the secret store:**

   ```powershell
   # Vault example
   vault kv patch secret/alphaswarm/auth0 m2m_client_secret=
   ```

3. **Reload the relevant pods:**

   ```powershell
   kubectl -n alphaswarm rollout restart deployment/alphaswarm-core
   kubectl -n alphaswarm rollout restart deployment/alphaswarm-worker
   kubectl -n alphaswarm-admin rollout restart deployment/alphaswarm-cp
   ```

4. **Verify:** `curl https://manage.alphaswarm.enterprise.com/manage/health` returns 200; the audit log shows successful M2M token mints.
5. **Revoke the old secret:** Auth0 → `alphaswarm-m2m` → "Revoke previous secret" once you're confident every pod has rolled.

## Procedure — Postgres password

1. **Connect via the old credential** and create the new password:

   ```sql
   ALTER USER alphaswarm WITH PASSWORD 'new-strong-password';
   ```

2. **Update the secret store** (same as above).
3. **Rolling restart** of every service that talks to Postgres: `alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-cp`. Each pod re-reads the env var on startup.
4. **No old-credential revocation needed** — Postgres only honours the current password.

## Procedure — Session cookie secret

The session cookie secret (`ALPHASWARM_SESSION_COOKIE_SECRET`) is used to encrypt the `alphaswarm_session` cookie. Rotating it invalidates every active session.

1. Generate: `python -c "import secrets; print(secrets.token_urlsafe(64))"`
2. Update the secret store.
3. Rolling restart of `alphaswarm-core` only. Users will be redirected to Auth0 to re-authenticate.

## Procedure — SCIM bearer token

The SCIM bearer token hash gates the `/scim/v2/*` endpoint.

1. Generate a new random token: `python -c "import secrets; print(secrets.token_urlsafe(48))"`
2. Compute its SHA-256 hash: `python -c "import hashlib, sys; print(hashlib.sha256(sys.argv[1].encode()).hexdigest())" `
3. Update `ALPHASWARM_AUTH_SCIM_BEARER_TOKEN_HASH` in the secret store with the hash.
4. Update the IdP's SCIM provisioning configuration with the new RAW token.
5. Rolling restart of `alphaswarm-core`.

## Auditing

Every secret rotation should leave a trace:

- The secret store records the change (Vault audit log / Cloudtrail / Cloud Audit Logs)
- The application's audit log shows the M2M / Postgres / cookie events
- `alphaswarm_controller`'s `WorkloadRun` ledger records the rotation request

If you can't see all three, file an incident — silent rotations are a security smell.


<!-- https://alpha-swarm.ai/how-to/operations/rts6-validation-report-generation -->
# RTS 6 / SEC 15c3-5 Annual Validation Report
> - **MiFID II RTS 6, Article 9** — "An investment firm shall annually perform a self-assessment and validation process and on the basis of that process issue a validation report... The risk management ...

# RTS 6 / SEC 15c3-5 Annual Validation Report

> Mechanical generation + sign-off workflow.
> Audience: Risk Management, Internal Audit, CEO, compliance counsel.

## Regulatory anchors

- **MiFID II RTS 6, Article 9** — "An investment firm shall annually
  perform a self-assessment and validation process and on the basis of
  that process issue a validation report... The risk management function
  shall draft the report; internal audit shall audit the report."
- **SEC Rule 15c3-5(e)** — "The broker or dealer shall regularly review
  the effectiveness of the risk management controls and supervisory
  procedures... The CEO shall certify annually that the firm's risk
  management controls and supervisory procedures comply with paragraphs
  (b) and (c) of this section."

## Generate the artifact

```bash
# CLI (single bot — usually just for testing the generator):
alphaswarm-bots conformance 
alphaswarm-bots stress 

# REST (fleet-wide):
curl -X POST https://api.alphaswarm.io/bots//conformance
curl -X POST https://api.alphaswarm.io/bots//stress
curl -X GET  https://api.alphaswarm.io/bots//risk/validation-report > validation-report.yaml
```

The artifact is a YAML document with three top-level sections:

1. **MiFID II RTS 6** — Article 6 / 9 / 10 / 12 / 15 / 16 / 17 results.
2. **SEC Rule 15c3-5** — (c)(1)(i)/(ii), (d), (e) results.
3. **Evidence** — embedded `bot_inventory`, `conformance_results`,
   `stress_results`, `kill_switch_drills`.

## Required attestations

The generator leaves three slots empty:

| Slot | Required by | Filled by |
| --- | --- | --- |
| `attestations.risk_management_function` | RTS 6 Art. 9(2) | Head of Risk |
| `attestations.internal_audit` | RTS 6 Art. 9(3) | Head of Internal Audit |
| `attestations.ceo_certification` | SEC 15c3-5(e) | CEO |

Sign-off is **operational, not mechanical**. The generator emits
unsigned YAML; the firm's compliance workflow fills in the slots and
adds digital signatures (e.g. via a Yubikey-backed signing pipeline).

## Cadence

- **Annual:** by 31 March each year, covering the previous calendar year.
- **Ad-hoc:** after any material control change (new policy, threshold
  retune, new venue, new asset class), generate a fresh artifact and
  re-circulate for sign-off.
- **Quarterly drill:** the kill-switch incident response runbook
  (separate doc) exercises the three-scope kill switch quarterly; that
  drill's evidence is included in the next annual report.

## Storage

- Signed YAML artifacts: `s3://alphaswarm-compliance/validation-reports//`
  with object lock + WORM retention >= 7 years.
- The audit trail (who generated the artifact, when, with which inputs)
  is in `bot_events` (event_type=`validation_report.generated`).

## Caveat

This workflow is an **engineering crosswalk**, NOT legal advice. The
specific scope of "algorithmic trading" / "market access" for any given
firm is a legal determination that compliance counsel must make. The
generator only mechanizes the controls that are already implemented in
code.


<!-- https://alpha-swarm.ai/how-to/operations/tower-cluster-deploy -->
# Tower Two-Node Cluster Deploy
> - In scope: AlphaSwarm stack bootstrap (`tower-dev`), QuestDB, control-plane wiring. - Out of scope: `julianwiley-portal` migration (deferred; owned by `rpi_kubernetes`)

# Tower Two-Node Cluster Deploy

Deploy AlphaSwarm to the dedicated two-node cluster (`alphaswarm-tower` control plane +
`alphaswarm-laptop` WSL2 agent) before any portal migration work.

## Scope

- In scope: AlphaSwarm stack bootstrap (`tower-dev`), QuestDB, control-plane wiring.
- Out of scope: `julianwiley-portal` migration (deferred; owned by `rpi_kubernetes`).

## Prerequisites

- Two-node cluster already online and `kubectl get nodes` shows both `Ready`.
- Context points to the tower cluster.
- Auth0 tenant + client values set in `alphaswarm_platform/deployments/kubernetes/base/configmaps`.
- Secrets rendered for:
  - `alphaswarm-secrets` (`alphaswarm` namespace)
  - `alphaswarm-admin-secrets` (`alphaswarm-admin` namespace)

## 1) Install cluster dependencies

```bash
bash alphaswarm_platform/scripts/cluster_install/install-redpanda.sh
bash alphaswarm_platform/scripts/cluster_install/install-questdb.sh
bash alphaswarm_platform/scripts/cluster_install/install-redpanda-connect.sh
```

Optional (if your target slice needs them): OpenTelemetry, kube-prometheus-stack,
Phoenix, Spark Operator.

## 2) Apply the thin tower overlay

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/
```

This slice includes:

- core workloads (`alphaswarm-core`, `alphaswarm-worker`, `alphaswarm-client`, `alphaswarm-cp`)
- `redis-master`
- `postgres-shared`
- `questdb` (dev-sized PVC, relaxed scheduling)

## 3) Verify

```bash
bash scripts/verify_tower_cluster.sh
```

## 4) Terraform target wiring (optional but recommended)

```bash
# Preview
python -m alphaswarm.cli.main deploy --target tower --action plan

# Apply
python -m alphaswarm.cli.main deploy --target tower --action apply
```

## Rollback

```bash
kubectl delete -k alphaswarm_platform/deployments/kubernetes/overlays/tower-dev/
```

Then restore the previous known-good overlay or Terraform state.


<!-- https://alpha-swarm.ai/how-to/per-tenant-mcp-rollout -->
# Per-tenant MCP rollout

# Per-tenant MCP rollout runbook

> Phase 5 §8 of
> [RESTRUCTURING_PLAN.md](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md).
> Walks the cluster operator through deploying per-tenant MCP
> servers, the gVisor agent-sandbox pool, and the Cell-Bound-
> Authorization gate at `alphaswarm-edge`.

## Scope

1. **gVisor RuntimeClass** — install via the DaemonSet at
   `alphaswarm_platform/deployments/kubernetes/agent-sandbox/gvisor/`.
2. **alphaswarm-agent-sandbox-pool** — the gVisor-isolated Deployment at
   `alphaswarm_platform/deployments/kubernetes/agent-sandbox/pool/`.
3. **Per-tenant MCP servers** — Helm-rendered Deployments from
   `alphaswarm_platform/deployments/helm/alphaswarm-mcp-tenant/` for each
   `shared-prem` / `silo-reg` tenant.
4. **Cell-Bound-Authorization** — the second ext_authz step in
   `alphaswarm_platform/build/docker/alphaswarm-edge/envoy.template.yaml`.
5. **MCP tool catalog versioning** — Alembic 0084 creates the
   `mcp_tool_versions` table + adds
   `agent_runs_v2.mcp_tool_descriptor_hashes`.

## Prerequisites

1. Phase 3 cells are registered and at least one is in
   `state=active`.
2. Phase 4 SPIRE control plane is healthy in the cell. Verify:
   ```bash
   kubectl -n spire-system rollout status statefulset/spire-server
   kubectl -n spire-system get pods -l app=spire-agent
   ```
3. The Alembic head is at `0084_mcp_tool_versioning`. Verify:
   ```bash
   alembic current  # expected: 0084_mcp_tool_versioning (head)
   ```
4. Phase 2 Kyverno policies are loaded. Verify:
   ```bash
   kubectl get clusterpolicy alphaswarm-require-gvisor-for-agent-sandbox
   ```

## Step 0 — Install gVisor

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/agent-sandbox/gvisor/
kubectl -n gvisor rollout status daemonset/gvisor-installer --timeout=10m

# Wait for the node labels to appear (the installer marks each node
# `alphaswarm.io/gvisor=installed` after patching containerd):
kubectl get nodes -L alphaswarm.io/gvisor
# Expected: every node ends with `installed`.
```

## Step 1 — Deploy the agent-sandbox pool

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/agent-sandbox/pool/
kubectl -n alphaswarm-agent-sandbox rollout status deployment/alphaswarm-agent-sandbox-pool --timeout=5m

# Confirm gVisor is active inside the pod (the kernel reports as `runsc`):
POD=$(kubectl -n alphaswarm-agent-sandbox get pods -l app=alphaswarm-agent-sandbox-pool -o name | head -1)
kubectl -n alphaswarm-agent-sandbox exec "$POD" -- /bin/sh -c "uname -r; cat /proc/version"
# Expected: kernel version reports as runsc/gVisor.

# Confirm the Kyverno gate is enforced — try to deploy a Pod with the
# `alphaswarm.io/sandbox-required` label but WITHOUT runtimeClassName:gvisor:
cat <'
ORDER BY started_at DESC
LIMIT 5;
```

The hash array MUST be a subset of `mcp_tool_versions.descriptor_hash`
at the matching cell_id. The Phase 7 §10.2 replay harness will
verify this invariant.

## Step 5 — Validate Cell-Bound-Authorization

Cross-cell MCP calls now require the `Cell-Bound-Authorization`
header. Without it, `alphaswarm-edge` returns 403 at the second ext_authz
step.

```bash
# From outside the cluster, simulate a cross-cell call missing CBA:
curl -sS -XPOST https://manage.alpha-swarm.ai/mcp/data/cell-silo-reg-acme/some.tool \
  -H 'authorization: Bearer ' \
  -d '{"args": {}}'
# Expected: 403 with `cell_bound_invalid` in the body.

# With a valid CBA (minted by the source-cell tenant-router):
curl -sS -XPOST https://manage.alpha-swarm.ai/mcp/data/cell-silo-reg-acme/some.tool \
  -H 'authorization: Bearer ' \
  -H 'Cell-Bound-Authorization: ' \
  -d '{"args": {}}'
# Expected: tool result.
```

The CBA validator service is a Phase 5.5 deliverable; today the
ext_authz config points at the planned service address but the
service itself ships in the follow-up PR. Until then, the
`failure_mode_allow: false` flag means cross-cell calls without a
CBA fail closed (the validator returns 503 because it doesn't
exist yet) — which is the intended behaviour for the security
posture.

## Rollback

Each component is independently revertable:

```bash
# Per-tenant MCP — uninstall the Helm release:
helm uninstall acme-mcp -n cell-silo-reg-acme

# Agent sandbox pool — scale to zero:
kubectl -n alphaswarm-agent-sandbox scale deployment alphaswarm-agent-sandbox-pool --replicas=0

# gVisor — DO NOT DROP the installer DaemonSet without first
# removing every Pod with `runtimeClassName: gvisor`, otherwise
# the pods will sit in RunPodSandboxFailed forever.

# Cell-Bound-Authorization — flip ext_authz failure_mode_allow to true
# in the envoy ConfigMap then `kubectl rollout restart -n alphaswarm-edge
# deployment/alphaswarm-edge`. Cross-cell calls then bypass the CBA gate.
```

## Phase 5.5 follow-ups

1. **alphaswarm-cell-bound-validator service** — the small HTTP service the
   ext_authz step points at. Phase 5 ships the Envoy config; the
   actual service implementation is a thin Starlette app that
   wraps `alphaswarm.auth.cell_bound.verify(...)`.
2. **shared-std MCP pool chart** — the `shared-std` tier uses one
   pool per cell with per-tenant Linux cgroups (cgroups v2 + Pod
   Security Standards `restricted`). The Helm chart for the pool
   is a Phase 5.5 deliverable; the per-tenant chart in this PR
   targets `shared-prem` and `silo-reg`.
3. **Biscuit + TokenExchangeBroker wire-up in AgentRuntime** —
   the helpers in `alphaswarm/auth/biscuit.py` are standalone today; the
   `AgentRuntime` integration that mints + attenuates the biscuit
   per call is Phase 5.5.
4. **MCP tool versioning replay** — `mcp_tool_descriptor_hashes`
   recording works in Phase 5; the replay harness that verifies
   the recorded set matches the live catalog is Phase 7 §10.2.

## Related documents

- [RESTRUCTURING_PLAN.md §8](https://github.com/julianwiley/alphaswarm/blob/main/RESTRUCTURING_PLAN.md)
- [alphaswarm_docs/docs/concepts/identity/biscuit-capabilities.md](../concepts/identity/biscuit-capabilities.md)
- [alphaswarm_docs/docs/how-to/linkerd-spire-rollout.md](linkerd-spire-rollout.md)
- [alphaswarm_docs/docs/concepts/identity/spiffe-workload-identity.md](../concepts/identity/spiffe-workload-identity.md)


<!-- https://alpha-swarm.ai/how-to/recipes/add-a-strategy -->
# Recipe: add a strategy
> The minimum-viable steps to register a new strategy class against the AlphaSwarm registry.

# Recipe: add a strategy

The 5-minute happy path:

1. Subclass `IStrategy` (or `FrameworkAlgorithm`) under
   [alphaswarm/strategies/](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies).
2. Decorate with `@register("MyName", kind="alpha")` from
   [alphaswarm/core/registry.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/core/registry.py).
3. Ship a YAML at `configs/strategies/.yaml` using the
   `class` / `module_path` / `kwargs` factory pattern.
4. Smoke-test:

```powershell
docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \
    --config configs/strategies/.yaml \
    --start 2024-01-01 --end 2024-06-30
```

If the smoke run lands a `backtest_runs` row with a non-NULL
`sharpe`, you are done.

## Pitfalls

- **Forgetting `@register`.** YAML loaders fail silently; the run
  errors out as `StrategyRegistryMissError`.
- **Putting strategy logic in a route or task.** Don't. Routes thin
  wrap Celery tasks; Celery tasks thin wrap pure functions under
  `alphaswarm/strategies/`. See [AGENTS Don'ts](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).
- **Skipping risk overlays.** Every strategy ships with a `risk:`
  block in YAML. Without it, the paper-metadata gate refuses to
  promote the strategy.

## Deeper reads

- [Concept: factor research](../../concepts/strategy/factor-research.md)
- [Concept: backtest engines](../../concepts/strategy/backtest-engines.md)
- [Tutorial: first backtest](../../tutorials/first-backtest.md)


<!-- https://alpha-swarm.ai/how-to/recipes -->
# Recipes
> Task-oriented cookbook for common AlphaSwarm operations. Copy-pasteable, results-first.

# Recipes

Task-oriented, results-first. Each recipe answers a single
"how do I..." question with a copy-pasteable command sequence.

If you want to learn a subsystem, read the matching
[concept](../../concepts/platform/architecture.md). If you want to
walk through a complete first-time scenario, do a
[tutorial](../../tutorials/first-backtest.md). If you want to fix a
broken thing, follow a [runbook](../../how-to/runbooks/dr-restore.md).

## Cookbook

- [Add a strategy](./add-a-strategy.md)
- [Run a backtest from YAML](./run-a-backtest-from-yaml.md)
- [Promote a bot to paper](./promote-a-bot-to-paper.md)
- [Snapshot an agent spec](./snapshot-an-agent-spec.md)
- [Query data via MCP](./query-data-via-mcp.md)

Each recipe is self-contained. None of them assume the others have
been run.


<!-- https://alpha-swarm.ai/how-to/recipes/promote-a-bot-to-paper -->
# Recipe: promote a bot to paper
> Take a backtested bot and start a paper-trading session, respecting the paper-metadata gate.

# Recipe: promote a bot to paper

```powershell
# 1. Snapshot the bot (idempotent — same hash returns same version).
curl -X POST http://localhost:8000/bots `
    -H "Content-Type: application/json" `
    -d @configs/bots/my-bot.yaml

# 2. Backtest the bot (gates require a recent backtest_runs row).
curl -X POST http://localhost:8000/bots//backtest `
    -d '{"start":"2024-01-01","end":"2024-06-30"}'

# 3. Promote to paper.
curl -X POST http://localhost:8000/bots//paper `
    -d '{"starting_cash":100000,"duration_minutes":60}'
```

## The paper-metadata gate

`POST /bots//paper` runs [paper_metadata_gate](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_metadata_gate.py)
before launching the session. It rejects when:

- No `backtest_runs` row for the bot, or it is older than 7 days.
- `sharpe < 0.5` on the latest backtest.
- `max_drawdown > 0.20`.
- `risk.kill_switch_attached != true`.
- The bot's universe contains a symbol that isn't in the active
  data plane.

Override the gate via the `--force` flag on `/bots//paper`
only with explicit approval. The audit ledger records who forced it
and why.

## Risk + kill switch

The session inherits the bot's `risk:` block. Trigger a stop:

```powershell
curl -X POST http://localhost:8000/paper/stop-all
# or use the topbar kill switch in the Vite UI
```

## Deeper reads

- [Tutorial: first paper trading session](../../tutorials/first-paper-trading-session.md)
- [Concept: paper trading](../../concepts/trading/paper-trading.md)
- [Concept: paper metadata gate](../../concepts/trading/paper-metadata-gate.md)
- [Runbook: kill-switch incident response](../../how-to/operations/kill-switch-incident-response.md)


<!-- https://alpha-swarm.ai/how-to/recipes/query-data-via-mcp -->
# Recipe: query data via MCP
> Invoke a data.*  MCP tool from an agent context (no direct Postgres / Iceberg reads).

# Recipe: query data via MCP

AGENTS rule 22: agents NEVER read Postgres or Iceberg directly.
Every catalog / dataset / entity / pipeline read goes through a
registered `DataMCPTool`. The bridge auto-installs every tool into
the agent `TOOL_REGISTRY`; the same tools are reachable externally
over HTTP at `/mcp/data` and via the `alphaswarm-data-mcp` stdio binary.

## From inside an agent

```python
from alphaswarm_agents.tools import TOOL_REGISTRY

tool = TOOL_REGISTRY["data.discovery.browse"]
result = tool.invoke({"namespace_prefix": "alphaswarm_silver_yfinance"})
print(result["entries"])
```

## From outside the platform (HTTP)

```powershell
curl -X POST http://localhost:8000/mcp/data/tools/data.discovery.browse/invoke `
    -H "Content-Type: application/json" `
    -H "Authorization: Bearer " `
    -d '{"namespace_prefix":"alphaswarm_silver_yfinance"}'
```

## From a Cursor/Continue/Cline agent (stdio)

Register the stdio binary as an MCP server in the editor:

```json
{
  "mcpServers": {
    "alphaswarm-data": {
      "command": "alphaswarm-data-mcp",
      "env": { "ALPHASWARM_MCP_DATA_CANONICAL_URI": "http://localhost:8000/mcp/data" }
    }
  }
}
```

## Where to add a new tool

Subclass [`DataMCPTool`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/mcp/base.py)
under `alphaswarm/data/mcp/tools/`, decorate with `@register_data_mcp_tool`,
and the bridge does the rest. See
[Concept: data MCP](../../concepts/data/data-mcp.md).

## RFC 9728 + 8707 conformance

Every AlphaSwarm MCP server publishes Protected Resource Metadata at
`/.well-known/oauth-protected-resource[/...]` and validates the
`aud` claim on incoming tokens against the deployment's canonical
URI. The docs site's own MCP server lives at
[https://docs.alpha-swarm.ai/mcp](/mcp).

## Deeper reads

- [Concept: data MCP](../../concepts/data/data-mcp.md)
- [Concept: codebase MCP](../../concepts/data/codebase-mcp.md)
- [Concept: pgvector control plane](../../concepts/data/pgvector-control-plane.md)
- [Concept: MCP risk tools](../../concepts/data/mcp-risk-tools.md)


<!-- https://alpha-swarm.ai/how-to/recipes/run-a-backtest-from-yaml -->
# Recipe: run a backtest from YAML
> Dispatch a backtest task from a YAML strategy config and tail the Celery progress.

# Recipe: run a backtest from YAML

```powershell
$resp = curl -X POST http://localhost:8000/backtest `
    -H "Content-Type: application/json" `
    -d (Get-Content configs/strategies/my-strategy.yaml -Raw)

# Tail progress (canonical {task_id, stage, message, timestamp} frames).
docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
    [print(m) for m in subscribe('')]"
```

## Choose your engine

The default engine is `vbtpro` (vectorbt-pro primary). Override
with `--engine event_driven` / `hft` / `vectorbt` / `backtesting_py`
/ `zvt` / `aat`. See [backtest engines](../../concepts/strategy/backtest-engines.md)
for the capability matrix and fallback cascade.

## Walk-forward + WFO

```powershell
curl -X POST http://localhost:8000/backtest/wfo `
    -d '{"strategy_config":"configs/strategies/my-strategy.yaml","windows":12,"step":"1mo"}'
```

The endpoint dispatches one task per window; each writes its own
`backtest_runs` row and the parent emits a `wfo.complete` frame
when every window is in.

## Look at results

- `backtest_runs` row in Postgres for the headline metrics.
- `alphaswarm_gold_backtest_` Iceberg namespace for trade-level
  detail.
- The QuantStats tearsheet endpoint at
  `POST /analytics/portfolio/tearsheet` for an HTML report.

## Deeper reads

- [Tutorial: first backtest](../../tutorials/first-backtest.md) —
  end-to-end walkthrough.
- [Concept: backtest engines](../../concepts/strategy/backtest-engines.md)
- [Concept: analytics frontend](../../concepts/data/analytics-frontend.md)


<!-- https://alpha-swarm.ai/how-to/recipes/snapshot-an-agent-spec -->
# Recipe: snapshot an agent spec
> Hash-lock a YAML AgentSpec into agent_spec_versions so AgentRuntime can drive it.

# Recipe: snapshot an agent spec

```powershell
# Idempotent — re-running with unchanged content returns the same
# spec_hash and the same version row.
curl -X POST http://localhost:8000/agents/specs `
    -H "Content-Type: application/json" `
    -d @configs/agents/my-agent.yaml
```

The response carries `spec_hash` and `version`. If you change a
field and re-POST, a NEW `agent_spec_versions` row is created with
a NEW hash. Old versions stay intact for replay.

## What the runtime does

The
[`AgentRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/agents/runtime.py)
gates every run on:

- A valid `agent_spec_versions` row.
- A cost cap (`per_run_max_tokens`, `per_run_max_usd`).
- The active kill switch.
- The RFC 9728 + 8707 MCP audience check (rule 49).
- An `experiment_id` (rule 34).

If any check fails, the run rejects before the first LLM call.

## Run the agent

```powershell
curl -X POST http://localhost:8000/agents//run `
    -d '{"inputs":{"universe":["SPY","QQQ","IWM"]}}'
```

`AgentRuntime` writes `agent_runs_v2` rows with telemetry, cost,
and OTEL trace IDs.

## Don't bypass the runtime

Never call `router_complete` directly from inside agent code. Declare
the model in `AgentSpec.model` and let the runtime drive the call.
See [AGENTS rule 12](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).

## Deeper reads

- [Concept: agents](../../concepts/agentic/agents.md)
- [Concept: agentic development](../../concepts/agentic/agentic-development.md)
- [Concept: workflow studio](../../concepts/agentic/workflow-studio.md)
- [Tutorial: first agent workflow](../../tutorials/first-agent-workflow.md)


<!-- https://alpha-swarm.ai/how-to/runbooks/dr-restore -->
# Runbook — Disaster Recovery: full restore (under 30 min)
> 1. The Redis primary in `alphaswarm-system` is gone. The `redis-master.alphaswarm-system.svc` Service points to no pod. 2. Spin a fresh Redis pod:

# Runbook — Disaster Recovery: full restore (under 30 min)

Restores the Phase 6 reliability surface from S3 in three layers.

## Layer 1: rate-limit Redis (5 min)

1. The Redis primary in `alphaswarm-system` is gone. The
   `redis-master.alphaswarm-system.svc` Service points to no pod.
2. Spin a fresh Redis pod:

   ```bash
   kubectl -n alphaswarm-system scale statefulset/redis --replicas=1
   ```

3. The Lua scripts re-register lazily on the first `Check` call
   (see `RedisTokenBucketStrategy._ensure_initialised` —
   `EVALSHA` failure paths fall back to `EVAL` + re-register).
4. Buckets that were drained in the previous Redis are now full
   again; **this is intentional**. The audit log captures every
   token consumed pre-incident; the operator can replay the
   ledger to rebuild bucket state if compliance requires it.

## Layer 2: audit log (10 min)

1. The `audit_log` table is hash-chain-protected (trigger from
   Alembic 0079). The S3 export (Celery beat task
   `alphaswarm_ratelimit.tasks.ledger_export.export_ledger_window`)
   carries every row in append-only JSONL form.
2. Restore the latest window:

   ```bash
   alphaswarm ratelimit admin restore-ledger \
     --bucket alphaswarm-audit-archive \
     --since 2026-05-01 \
     --until 2026-05-24
   ```

3. The `enforce_audit_log_hash_chain` Postgres trigger validates
   every restored row against its predecessor; on violation the
   restore aborts and surfaces the exact mismatched hex digest.

## Layer 3: dbt-loom manifest registry (10 min)

1. The `s3://alphaswarm-dbt-manifests` bucket is the source of truth
   for cross-project `ref()` lookups.
2. Restore the latest manifest per project:

   ```bash
   alphaswarm deploy restore-dbt-manifests \
     --env prod \
     --to-bucket alphaswarm-dbt-manifests-restored
   ```

3. Update the `loom.yml` in each team project to point at the
   restored bucket name; downstream `dbt parse` succeeds with
   the rehydrated manifests.

## Phase-gate verification

The full DR test must complete in under 30 min wall-clock.
`tests/chaos/test_dr_restore.py` orchestrates the three layers
against a fixture cluster + S3 mock and asserts the under-30-min
deadline.


<!-- https://alpha-swarm.ai/how-to/runbooks/questdb-wal-stall -->
# Runbook — QuestDB WAL apply stall
> - `questdb_wal_apply_lag_seconds` Prometheus metric is above 60s. - New dbt model materialization runs hang on INSERT. - The QuestDB UI shows `WAL applied = N` is no longer advancing

# Runbook — QuestDB WAL apply stall

Symptoms:

- `questdb_wal_apply_lag_seconds` Prometheus metric is above 60s.
- New dbt model materialization runs hang on INSERT.
- The QuestDB UI shows `WAL applied = N` is no longer advancing.

## Root cause

A long-running query or external table lock has blocked the WAL
apply worker. The new QuestDB documentation explicitly warns:
"Non-partitioned tables cannot use WAL" — the AlphaSwarm custom
`questdb_table` materialization forces `PARTITION BY DAY` to
avoid the most common form, but mis-configured external tables
can still trip the apply loop.

## Recovery

1. Identify the offending table from the Prometheus alert label:

   ```
   {table="equities_minute_bars"}
   ```

2. Suspend writers to that table:

   ```bash
   alphaswarm ratelimit admin halt-pool questdb_writer:equities_minute_bars
   ```

3. Resume the WAL apply loop:

   ```sql
   ALTER TABLE equities_minute_bars RESUME WAL;
   ```

4. Once the lag drops back below 5s, lift the writer halt:

   ```bash
   alphaswarm ratelimit admin resume-pool questdb_writer:equities_minute_bars
   ```

## Prevention

The Phase 2 `alphaswarm/dagster/dagster.yaml` reserves a per-table
`questdb_writer:` pool with `limit=1` so concurrent
writers to the same table are impossible. Verify that pool is
present + has `limit=1`.


<!-- https://alpha-swarm.ai/how-to/runbooks/quota-exhaustion -->
# Runbook — quota-exhaustion
> 1. Open the rate-limit dashboard at `/data/ratelimit`. Find the over-consuming `(user_id, service, key_id)`. 2. Inspect the `rl_ledger` partition for the last hour:

# Runbook — quota-exhaustion

A bucket has fired `AQPRatelimitBucketAt80Percent`,
`AQPRatelimitBucketAt95Percent`, or `AQPRatelimitBucketExhausted`.

## Diagnosis (5 min)

1. Open the rate-limit dashboard at `/data/ratelimit`. Find the
   over-consuming `(user_id, service, key_id)`.
2. Inspect the `rl_ledger` partition for the last hour:

   ```sql
   SELECT decision, count(*), sum(tokens_consumed)
     FROM rl_ledger
    WHERE ts > now() - interval '1 hour'
      AND key_id = :key_id
    GROUP BY decision;
   ```

3. Cross-reference `audit_log` for the calling `tool_id` —
   `data.ingest.materialize` or `data.ingest.preview_source` are
   the usual culprits.

## Decision tree (10 min)

| Cause | Action |
| --- | --- |
| Misconfigured backfill | Tell the operator to cancel via `alphaswarm materialize cancel `. The reservation auto-releases. |
| Vendor downgrade | Mint a higher-tier key via `alphaswarm keys mint --service polygon --rps 100 --burst 1000`. |
| Stuck connector loop | `alphaswarm ratelimit status --key-id ` shows the call rate; halt the offending Dagster sensor via the topbar kill-switch. |
| Legitimate traffic | Raise the policy via `data.ratelimit.policy.update` (Tier-P + step-up MFA). |

## Recovery (15 min)

1. Once the cause is addressed, the bucket refills at the policy's
   `refill_rate`; no manual reset is required.
2. If the operator wants an immediate reset, run the Phase 6
   admin script that explicitly DELs the bucket key:

   ```bash
   alphaswarm ratelimit admin reset --user-id  --service polygon --key-id primary
   ```

3. Verify recovery in Grafana:

   ```
   rl_bucket_remaining{service="polygon.aggregates"} > 50
   ```

## Postmortem

Every quota-exhaustion alert that requires manual intervention
must produce a postmortem PR within 72 hours. Template:
`alphaswarm_docs/docs/how-to/runbooks/templates/postmortem.md` (to be authored).


<!-- https://alpha-swarm.ai/how-to/runbooks/snapshot-deadlock -->
# Runbook — dbt snapshot deadlock
> - `dbt snapshot` runs queue indefinitely. - The `dbt_snapshots` Dagster concurrency pool shows 1 slot in use but the corresponding run is `CANCELED` or `FAILED`

# Runbook — dbt snapshot deadlock

Symptoms:

- `dbt snapshot` runs queue indefinitely.
- The `dbt_snapshots` Dagster concurrency pool shows 1 slot in use
  but the corresponding run is `CANCELED` or `FAILED`.

## Root cause

Per the Dagster docs: "a single cancelled run will permanently
deadlock all future runs for that pool" unless the
`free_slots_after_run_end_seconds` knob is set on the
`run_monitoring` block.

## Fix (in this order)

1. Confirm `alphaswarm/dagster/dagster.yaml` has

   ```yaml
   run_monitoring:
     enabled: true
     free_slots_after_run_end_seconds: 300
   ```

   If missing, add it + reload the Dagster instance.

2. Manually free the stuck slot:

   ```bash
   dagster instance concurrency reset dbt_snapshots
   ```

3. Verify with the Dagster UI: the pool should show `0 / 1` used.

## Verification chaos test

`tests/chaos/test_snapshot_deadlock_recovery.py` triggers 5
parallel snapshot jobs against a sqlite test target and asserts
that even after one is cancelled the pool recovers within 360s.

## Postmortem

If the deadlock recurs after the canonical fix, the postmortem
must include a Dagster + dbt version pair and a minimal repro
so the upstream issue can be filed.


<!-- https://alpha-swarm.ai/how-to/tenant-router-auth-rollout -->
# Tenant-router auth rollout runbook

# Tenant-router auth rollout runbook

> Operator companion to
> [Edge authentication & cell routing](../concepts/identity/edge-authentication.md)
> and the manifests at
> `alphaswarm_platform/deployments/kubernetes/edge/alphaswarm-tenant-router/`.
> Follows the [cell-router cutover](./cell-router-cutover.md) — run that
> first if the Envoy edge is not serving yet.

The tenant-router ships **fail-closed**: `AUTH_MODE=required` with an
empty issuer, so a fresh apply crash-loops with a `SettingsError`
until you stamp real IdP values. That is intentional — complete this
runbook to bring the edge up authenticated.

## 1. Prerequisites

1. The IdP is provisioned (Auth0 via `terraform/modules/auth0_identity`
   or Entra via `alphaswarm_entra_directory`) and the per-cell backends
   already validate the same issuer/audience
   (`ALPHASWARM_AUTH_OIDC_ISSUER` / `..._AUDIENCE` in
   `alphaswarm-config`, stamped by
   `build/scripts/sync_auth0_env_to_k8s.py`).
2. The claims pipeline stamps the namespaced routing claims
   (`https://alphaswarm.internal/tenant_id`, `workspace_id`, and —
   for B2B premium plans — `tier`). See
   [Auth0 Actions](../concepts/identity/auth0-actions.md) /
   [MSAL setup](../concepts/identity/msal-entra-setup.md).
3. The cells registry has at least one `state=active` cell per tier
   you route to (`curl -sS $CP/manage/cells | jq '.data[].tier'`).

## 2. Stamp the auth ConfigMap

Edit (or overlay-patch) `alphaswarm-tenant-router-config` in
`deployments/kubernetes/edge/alphaswarm-tenant-router/configmap.yaml`:

```yaml
data:
  ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "permissive"   # step 3 flips to required
  ALPHASWARM_TENANT_ROUTER_OIDC_ISSUER: "https://.us.auth0.com/"
  ALPHASWARM_TENANT_ROUTER_OIDC_AUDIENCE: "https://api.alphaswarm.internal/manage"
```

The JWKS URI derives from the issuer
(`/.well-known/jwks.json`); set
`ALPHASWARM_TENANT_ROUTER_JWKS_URI` only for non-standard IdPs. Only
asymmetric algorithms are accepted — if you change
`OIDC_ALGORITHMS`, `HS*` values are refused at boot.

Apply + restart:

```bash
kubectl apply -k alphaswarm_platform/deployments/kubernetes/edge/
kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router
kubectl -n alphaswarm-edge rollout status deploy/alphaswarm-tenant-router
```

## 3. Canary in `permissive`, then enforce

`permissive` denies **invalid** tokens but lets anonymous requests
through flagged `x-alphaswarm-auth: anonymous` (per-cell gates still
reject where they require auth). Watch the decision counters:

```bash
kubectl -n alphaswarm-edge port-forward svc/alphaswarm-tenant-router 8080 &
curl -s localhost:8080/metrics | grep authz_decisions_total
# alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="verified"} 1042
# alphaswarm_tenant_router_authz_decisions_total{decision="allow",mode="permissive",reason="anonymous"} 3
# alphaswarm_tenant_router_authz_decisions_total{decision="deny",mode="permissive",reason="expired_token"} 7
```

When `reason="anonymous"` is ~zero for a representative window (only
unauthenticated probes remain), flip to enforcement:

```yaml
  ALPHASWARM_TENANT_ROUTER_AUTH_MODE: "required"
```

re-apply, restart, and confirm `readyz` reports the posture:

```bash
curl -s localhost:8080/readyz | jq
# {"status":"ok","cells":3,"auth_mode":"required","cba_mode":"enforce",...}
```

## 4. Verification checks

```bash
# Anonymous is denied (required mode):
curl -s -o /dev/null -w '%{http_code}\n' -XPOST localhost:8080/ext_authz/v3/check \
  -H 'content-type: application/json' \
  -d '{"attributes":{"request":{"http":{"headers":{}}}}}'
# 401

# A live SPA token is verified and routed:
TOKEN=$(...fetch from the SPA / device flow...)
curl -s -XPOST localhost:8080/ext_authz/v3/check \
  -H 'content-type: application/json' \
  -d "{\"attributes\":{\"request\":{\"http\":{\"headers\":{\"authorization\":\"Bearer ${TOKEN}\"}}}}}" \
  -D - -o /dev/null | grep -i x-alphaswarm
# x-alphaswarm-cell: cell-shared-std-us-east-1a
# x-alphaswarm-auth: verified
# x-alphaswarm-sub: auth0|...
```

End-to-end through the edge, a tampered or expired token must produce
401 from Envoy, and `x-alphaswarm-*` request headers sent by the client
must arrive at the cell overwritten with verified values.

## 5. Cross-cell CBA keys (Phase 5 §8.5)

Cross-cell calls present a `Cell-Bound-Authorization` JWT. The
validator (co-located in the router) reads each **source** cell's
verification keys from the cells-registry annotation — publish them
when you enable cross-cell MCP:

```bash
curl -sS -XPATCH "$CP/manage/cells/cell-shared-std-us-east-1a" \
  -H "authorization: Bearer $MGMT_TOKEN" -H 'content-type: application/json' \
  -d '{"annotations":{"alphaswarm.internal/cba-jwks":"{\"keys\":[...]}"}}'
```

`CBA_MODE=enforce` (default) is safe before any workload mints CBAs —
requests without the header pass through. Use `monitor` to log
would-be denials during key rollout; check
`cba_decisions_total{decision="deny"}` before returning to `enforce`.
Single-cell edges should additionally pin
`ALPHASWARM_TENANT_ROUTER_CBA_DESTINATION_CELL_ID` to their own cell id.

## 6. Rollback

Auth enforcement is config-only — no image rollback needed:

1. Flip `AUTH_MODE` back to `permissive` (NOT `disabled`; the insecure
   mode also demands `ALLOW_INSECURE=true` and is for local dev only).
2. `kubectl -n alphaswarm-edge rollout restart deploy/alphaswarm-tenant-router`.
3. The decision counters (`/metrics`) and structured `authz_deny` logs
   (reason codes: `missing_token`, `expired_token`, `wrong_audience`,
   `wrong_issuer`, `no_matching_key`, `forbidden_algorithm`,
   `jwks_unreachable`) identify what was being denied before you
   re-enforce.

## Failure modes worth knowing

| Symptom | Cause | Response |
| --- | --- | --- |
| Pod crash-loops with `SettingsError` | Missing issuer/audience in `required`/`permissive` | Stamp the ConfigMap (step 2). |
| All requests 401 `jwks_unreachable` | Router cannot reach the IdP JWKS (egress 443 blocked, wrong issuer) | Check the NetworkPolicy + issuer URL; the JWKS cache serves stale once warmed, so this bites hardest on cold boots. |
| 401 `no_matching_key` after IdP key rotation | kid not in cached JWKS | The router force-refreshes once per unknown kid automatically; persistent failures mean the issuer/JWKS URI points at the wrong tenant. |
| 503 `no_cell_available` for premium users | No active `shared-prem` cell | Explicit tiers are never downgraded — activate a cell for the tier or fix the claim pipeline. |
| `readyz` shows `registry_stale: true` | Control plane unreachable > `REGISTRY_STALENESS_WARN_SECONDS` | Routing continues on last-known-good cells; restore `alphaswarm-cp` before making placement changes. |


<!-- https://alpha-swarm.ai/intro/conventions -->
# Conventions
> Documentation authoring rules, frontmatter contract, and how to ship a new doc.

# Conventions

## Frontmatter is mandatory

Every `.md` or `.mdx` file under `alphaswarm_docs/docs/` MUST have a
frontmatter block validated by the Zod schema at
[src/lib/frontmatterSchema.ts](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/src/lib/frontmatterSchema.ts).

Required fields:

- `title` — the human-readable title (becomes the ``).
- `summary` — a one-liner consumed by `/llms.txt` and the search
  index. Keep under 200 characters.
- `owner` — the GitHub Team that owns the page (`platform-team`,
  `docs-team`, `data-team`, `rl-team`, `ml-team`, `agentic-team`,
  `strategy-team`, `trading-team`, `identity-team`, `infra-team`,
  `sre-team`).
- `last_reviewed` — ISO 8601 date. The stale-content watchdog opens
  a GitHub Issue when this is more than 180 days old.
- `audience` — one of `human`, `agent`, `both`, `internal`.

Optional:

- `version` — pin to a specific date-epoch.
- `deprecated`, `deprecated_replacement`, `deprecated_at`,
  `deprecated_sunset` — deprecation lifecycle (Stripe-style epochs).
- `keywords`, `tags`, `sidebar_label`, `sidebar_position`,
  `runnable` — Docusaurus-native fields.

## Cross-linking

Use relative markdown links. The autolink resolver in Docusaurus
maps them to the published routes at build time.

```mdx
See [the data plane concept](../concepts/data/data-plane.md) for the
provider → cache → DuckDB view pipeline.
```

Cite source code with a full absolute repo URL:

```mdx
[alphaswarm/data/iceberg_catalog.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py)
```

Do not link to specific line numbers — they bit-rot quickly.

## Diagrams

Mermaid only. GitHub renders it natively; Docusaurus ships
`@docusaurus/theme-mermaid` which renders it client-side here.

Do not commit PNG / SVG diagrams unless they are screenshots of a
running UI and are time-stamped.

## Code blocks

Tag every code block with a language. Tag runnable blocks with the
`runnable` attribute; Phase 5 of the migration will render those
with a "Run" button backed by Pyodide (for Python) or StackBlitz
WebContainers (for JS / TS).

```python runnable
import requests
print(requests.get("http://localhost:8000/readyz").status_code)
```

## "Was this helpful?"

Every page renders the feedback widget from
[src/components/FeedbackWidget.tsx](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/src/components/FeedbackWidget.tsx).
Submissions POST to a Cloudflare Worker that opens a `docs-feedback`
GitHub Issue tagged with the page's CODEOWNER team.

## Editing this page

Click the "Edit this page" link at the bottom — it opens
github.dev for a browser-side edit. Or open the PR locally:

```powershell
git checkout -b docs/fix-typo
# edit
git add alphaswarm_docs/docs/...
git commit -m "docs: fix typo in quickstart"
git push -u origin docs/fix-typo
```

Branch protection requires the docs-CI suite to pass plus one
CODEOWNER approval.

## Business editors

Non-engineers can edit content through Keystatic at
[https://docs.alpha-swarm.ai/keystatic](/keystatic). Keystatic stores
changes in Git and opens a PR against `main`. No parallel CMS,
no duplicate database.

## AI agents

Read `/llms.txt` for the curated index, `/llms-full.txt` for the
full corpus, or query the MCP server at `/mcp`. The MCP server is
RFC 9728 + 8707-compliant (AlphaSwarm rule 49) and validates the `aud`
claim against `ALPHASWARM_MCP_DOCS_CANONICAL_URI`.


<!-- https://alpha-swarm.ai/intro/glossary -->
# Glossary
> > See also: [alphaswarm_docs/index.md](../intro/index.md) for the full doc map

# Glossary

Project-specific jargon used across AlphaSwarm, with a definition and a pointer
to the canonical file. New contributors and AI agents should treat this
as the **single source of truth** for terminology — if you find a
mismatch between this glossary and the code, file an issue.

> See also: [alphaswarm_docs/index.md](../intro/index.md) for the full doc map.

## Core domain

- **`vt_symbol`** — Composite symbol id with the shape
  `{TICKER}.{EXCHANGE}` (vnpy convention), e.g. `AAPL.NASDAQ`,
  `BTCUSDT.BINANCE`, `ESM4.CME`. Always created via `Symbol.parse(...)` /
  `Symbol.format(...)` in [alphaswarm/core/types.py](../alphaswarm/core/types.py); never
  hand-split.
- **`Symbol`** — Immutable dataclass that bundles `ticker`, `exchange`,
  `asset_class`, `security_type`, optional contract spec. The atom
  flowing through every data feed, strategy, and broker. Defined in
  [alphaswarm/core/types.py](../alphaswarm/core/types.py).
- **`AssetClass` vs `SecurityType`** — `AssetClass` is the broad
  category (`equity`, `crypto`, `fx`, `future`, `option`, `index`,
  `commodity`, `bond`). `SecurityType` is the Lean-style finer-grained
  enum (`equity`, `option`, `future_option`, `crypto_future`,
  `index_option`, …). The `_polymorphic_identity_for` helper in
  [alphaswarm/data/catalog.py](../alphaswarm/data/catalog.py) maps `SecurityType` to a
  joined-table subclass of `Instrument`.
- **`Resolution`** — Lean-style bar cadence (`Tick`, `Second`, `Minute`,
  `Hour`, `Daily`); see [alphaswarm/core/types.py](../alphaswarm/core/types.py).
- **`Interval`** — Short-code bar cadence (vnpy style, `1m`, `5m`,
  `1h`, `1d`). Same idea as `Resolution`, kept for vnpy back-compat.
- **`SubscriptionDataConfig`** — The data-plane routing key. Combines
  `Symbol + Resolution + TickType + DataNormalizationMode`. See
  [alphaswarm_docs/core-types.md](../concepts/platform/core-types.md).

## Persistence + data plane

- **Execution Ledger** — The Postgres tables under
  [alphaswarm/persistence/models.py](../alphaswarm/persistence/models.py) +
  [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py) that record
  every signal, order, fill, agent decision, and backtest run.
  Authoritative for "what did the system actually do?".
- **`LedgerWriter`** — Façade over the ledger tables. Always go through
  it instead of writing to ORM models directly so audit messages get
  emitted. [alphaswarm/persistence/ledger.py](../alphaswarm/persistence/ledger.py).
- **`Instrument` joined-table inheritance** — `instruments` is the
  parent table; each subclass (`InstrumentEquity`, `InstrumentOption`,
  …) lives in its own joined-table row keyed on `instruments.id`. The
  `instrument_class` discriminator selects the subclass at load time.
  See [alphaswarm_docs/erd.md](../concepts/platform/erd.md) and
  [alphaswarm/persistence/models_instruments.py](../alphaswarm/persistence/models_instruments.py).
- **`polymorphic_identity`** — SQLAlchemy mapper arg that ties a
  subclass to a discriminator value (e.g.
  `InstrumentEquity.__mapper_args__ = {"polymorphic_identity": "spot"}`).
  When you add a new instrument subclass you must also extend the
  `mapping` dict in `_polymorphic_identity_for`.
- **`DatasetCatalog`** — Parent row describing a logical dataset
  (HMDA LAR, FDA device events, etc.) with provider/domain/tags.
- **`DatasetVersion`** — Per-materialisation row beneath
  `DatasetCatalog`. Captures row count, dataset hash, schema snapshot,
  Iceberg identifier.
- **`DataLink`** — Edge between a `DatasetVersion` and an entity
  (`Instrument`, `Issuer`, `EconomicSeries`). Use this for "which
  symbols does this dataset cover?" queries.
- **`DataSource`** — Logical provider record (Yahoo, Alpha Vantage,
  IBKR, openFDA). Datasets and data-links reference a `DataSource`.
- **`IcebergCatalog`** (the wrapper) — PyIceberg handle from
  [alphaswarm/data/iceberg_catalog.py](../alphaswarm/data/iceberg_catalog.py).
  Always go through `append_arrow`, `read_arrow`,
  `iceberg_to_duckdb_view`; never call PyIceberg's `Catalog.create_table`
  directly.
- **`aqp_` namespace** — Iceberg namespace convention for the
  regulatory ingest:
  `alphaswarm_cfpb`, `alphaswarm_uspto`, `alphaswarm_fda`, `alphaswarm_sec`. New corpora pick a new
  `aqp_` slug.
- **Persistent host warehouse** — `C:/alphaswarm-warehouse` on Windows,
  bind-mounted into `alphaswarm-api` and `alphaswarm-worker` at `/warehouse`.
  Holds the PyIceberg SQL catalog (`catalog.db`), Parquet data files,
  staging dir, and ingest audit logs. See [alphaswarm_docs/data-catalog.md](../concepts/data/data-catalog.md).
- **`legacy` profile** — Docker Compose profile that bundles the older
  REST + MinIO catalog topology (off by default). Bring it up with
  `docker compose --profile legacy up -d`.

## Strategies + backtest

- **`BaseStrategy`** — Abstract strategy contract under
  [alphaswarm/strategies/](../alphaswarm/strategies/). Subclasses implement
  `on_bar`, `on_signal`, etc. See [alphaswarm_docs/backtest-engines.md](../concepts/strategy/backtest-engines.md).
- **`MLAlphaStrategy` / `MLSelectorAlpha`** — Strategies that wrap an
  ML model (deployed via `ModelDeployment`) and emit signals.
- **`EnsembleAlpha`** — Weighted combination of multiple alphas.
  [alphaswarm/strategies/ml_alphas.py](../alphaswarm/strategies/ml_alphas.py).
- **`IBrokerage` / `IDataQueueHandler`** — Lean-style interfaces
  consumed by backtest, paper, and live engines without modification
  (the same strategy code runs against all three). See
  [alphaswarm_docs/paper-trading.md](../concepts/trading/paper-trading.md).
- **`BacktestRun`** — Postgres row describing one backtest invocation
  (Sharpe, Sortino, drawdown, MLflow run id, dataset hash). The
  backtest UI's history view is just a query against this table.
- **`MLflow run id`** — Foreign id stored on `BacktestRun.mlflow_run_id`
  pointing at the MLflow tracking server. Click-through from the UI
  opens the MLflow UI in a new tab.
- **`dataset_hash`** — Deterministic SHA-256 of the input bars used in
  a backtest. Lets the UI flag "two backtests with the same hash =
  identical inputs".

## ML + agents

- **Tier (`deep` / `quick`)** — Two LLM tiers in the agentic crews.
  `deep` = high-capability (Nemotron 70B / GPT-4-class) for analysis;
  `quick` = small/fast (Llama 3.2 / Mini) for control-flow decisions.
  Provider per tier is in `settings.llm_provider_deep` /
  `_quick`; model per tier in `llm_deep_model` / `llm_quick_model`.
- **`router_complete`** — One-shot LLM completion through LiteLLM
  exposed by [alphaswarm/llm/providers/router.py](../alphaswarm/llm/providers/router.py).
  All AlphaSwarm code goes through this — never call `litellm.completion` or
  the Ollama client directly.
- **`Director`** — Nemotron-driven planner + verifier in
  [alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py).
  Sits between discovery and materialisation in generic file ingestion.
- **`IngestionPlan` / `PlannedDataset`** — Director output dataclass.
  One `PlannedDataset` per discovered family with target namespace,
  table name, expected_min_rows, domain hint, and skip list.
- **`VerifierVerdict`** — Director's post-materialise judgement
  (`accept` or `retry` with adjusted knobs).
- **`__assets__` family** — Synthetic `DiscoveredDataset` carrying the
  non-tabular inventory (PDFs, XML, images) found during discovery.
  Never materialised; surfaced under
  `IngestionReport.extras` for visibility.
- **`AgentDecision` / `DebateTurn`** — Agent crew audit trail rows.
- **`CrewRun`** — One full agentic crew invocation (planner →
  research → execution sub-agents).
- **`Alpha158`** — Microsoft Qlib's 158-feature factor zoo, ported to
  AlphaSwarm under [alphaswarm/data/indicators_zoo.py](../alphaswarm/data/indicators_zoo.py).
- **`FeatureSet` / `FeatureSetVersion`** — Composable feature spec
  (list of `IndicatorZoo` expressions + transformations) versioned
  in Postgres, materialised on demand.
- **`ModelDeployment` / `MLDeployment`** — A trained ML model that
  has been registered for inference (rows in
  [alphaswarm/persistence/models.py](../alphaswarm/persistence/models.py)).

## Bots

- **`Bot`** — Smallest self-contained, deployable unit on AlphaSwarm.
  Aggregates a universe + data pipeline + strategy + backtest engine +
  optional ML deployments + optional agent specs + RAG plan + metrics
  + risk caps + deployment target. Lives under a `Project` and is
  uniquely identified by `(project_id, slug)`. See
  [alphaswarm_docs/bots.md](../concepts/agentic/bots.md).
- **`BotSpec`** — Pydantic blueprint for a bot. Hashed via
  `snapshot_hash()` to drive immutable `bot_versions` snapshots.
  Defined in [alphaswarm/bots/spec.py](../alphaswarm/bots/spec.py).
- **`TradingBot` / `ResearchBot`** — Bot subclasses selected by
  `BotSpec.kind`. `TradingBot` does backtest / paper / deploy;
  `ResearchBot` does chat (and optional backtest if a `strategy` block
  is set).
- **`BotRuntime`** — Single sanctioned execution entry point for any
  bot lifecycle action. Snapshots specs into `bot_versions`, opens
  `bot_deployments` rows, and emits progress through
  [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py).
- **`bot_versions`** — Immutable, hash-locked spec snapshots
  (mirrors `agent_spec_versions`). Never mutated in place.
- **`bot_deployments`** — Ledger of every backtest / paper / chat /
  k8s invocation for a bot. References the `BotVersion` that produced
  it so a run can be replayed.
- **Deployment target (`paper_session` / `kubernetes` /
  `backtest_only`)** — Selected via `BotSpec.deployment.target`.
  Backed by `alphaswarm/bots/deploy.py::DeploymentDispatcher`.

## Provider catalog

- **`LLMProvider`** — Lightweight handle around a LiteLLM provider
  spec. Registered in
  [alphaswarm/llm/providers/catalog.py::PROVIDERS](../alphaswarm/llm/providers/catalog.py).
- **`ProviderSpec`** — Static config for a provider slug (LiteLLM
  prefix, env-var name, default models).
- **`vllm` provider** — OpenAI-compatible vLLM endpoint behind LiteLLM's
  `openai/` adapter. Empty `ALPHASWARM_VLLM_BASE_URL` disables.
- **`nemotron-3-nano:30b`** — Default Director model on Ollama
  (NVIDIA Nemotron Nano v3, 31.6B params). Pull with
  `ollama pull nemotron-3-nano:30b`. Configurable via
  `ALPHASWARM_LLM_DIRECTOR_MODEL`.

## Streaming + live

- **`KafkaDataFeed`** — In-process Kafka consumer that hands bars/quotes
  to the `IDataQueueHandler` interface.
- **`features.indicators.v1`, `market.bar.v1`, …** — Versioned Kafka
  topics. Naming pattern is `..v`.
- **`StreamingIngester`** — `alphaswarm-stream-ingest` CLI that publishes
  to Kafka topics from Alpaca / IBKR.
- **Heartbeat / kill-switch** — Periodic Redis publish from the paper-
  trading session; absence triggers the runner to halt.
  `ALPHASWARM_RISK_KILL_SWITCH_KEY` (default `alphaswarm:kill_switch`).

## Observability

- **OTEL endpoint** — `ALPHASWARM_OTEL_ENDPOINT` (default empty disables).
  When set, every Celery task and HTTP request emits OpenTelemetry
  spans via [alphaswarm/observability/](../alphaswarm/observability/).
- **Progress bus** — Redis pub/sub channel
  `alphaswarm:task:` carrying `{stage, message, timestamp, **extra}`
  payloads. UIs subscribe via the WebSocket relay at
  `/chat/stream/{task_id}`. See
  [alphaswarm/ws/broker.py](../alphaswarm/ws/broker.py) and
  [alphaswarm/tasks/_progress.py](../alphaswarm/tasks/_progress.py).

## Configuration

- **`settings`** — Cached `Settings` instance from
  [alphaswarm/config.py](../alphaswarm/config.py). Always import as
  `from alphaswarm.config import settings` and never construct
  `Settings()` directly — the cache backs `lru_cache(maxsize=1)`.
- **`ALPHASWARM_*` env namespace** — Every settable knob takes the
  `ALPHASWARM_` prefix. Bools accept `true`/`false`/`1`/`0`. Paths are
  resolved by `_coerce_path`.
- **`host-downloads`** — `/host-downloads:ro` bind mount in
  `alphaswarm_platform/compose/docker-compose.yml` exposing the user's local `Downloads/`
  directory for CLI ingest jobs.

## Inspiration rehydration (Phase 2026-04-29)

- **Microprice** — `(P_ask * Q_bid + P_bid * Q_ask) / (Q_bid + Q_ask)`.
  Volume-weighted refinement of mid-price; converges to the deeper side
  of the book. Implemented in
  [alphaswarm/data/microstructure.py](../alphaswarm/data/microstructure.py).
- **OBI (Order Book Imbalance)** — `(Q_bid - Q_ask) / (Q_bid + Q_ask)`,
  range `[-1, +1]`. Positive = bid-side pressure. Used as a quote skew
  signal in the LOB market-making strategies under
  [alphaswarm/strategies/hft/](../alphaswarm/strategies/hft/).
- **VPIN** — Volume-synchronized probability of informed trading
  (Easley/López/O'Hara). Re-buckets trade flow by equal-volume buckets;
  rolling mean of |buy-sell|/|buy+sell|. See
  [alphaswarm/data/microstructure.py](../alphaswarm/data/microstructure.py).
- **Sample-aware Sharpe** — Annualised Sharpe ratio that uses the
  actual sample frequency of a returns series instead of the assumed
  252 trading days. Required for HFT strategies with sub-daily bars.
  See [alphaswarm/backtest/hft_metrics.py](../alphaswarm/backtest/hft_metrics.py).
- **Walk-forward** — Training scheme where the model is re-fit on a
  rolling (or anchored) window and tested on the immediately following
  slice. Implemented in
  [alphaswarm/ml/walk_forward.py](../alphaswarm/ml/walk_forward.py).
- **Bachelier (Normal) model** — Options pricing model assuming the
  underlying follows arithmetic Brownian motion (`dF = sigma dW`).
  Appropriate for low-priced or near-zero underlyings (rates, basis
  spreads). See [alphaswarm/options/normal_model.py](../alphaswarm/options/normal_model.py).
- **Inverse option** — Option settled in the underlying asset (e.g.
  BTC) rather than quote currency (USD). Common on crypto venues like
  Deribit. See
  [alphaswarm/options/inverse_options.py](../alphaswarm/options/inverse_options.py).
- **Regime classifier** — Lightweight classifier that labels each bar
  as trending vs ranging using ADX threshold (default 25) or as
  bull/bear/neutral via multi-MA slope vote. See
  [alphaswarm/data/regime.py](../alphaswarm/data/regime.py).
- **Factor expression** — Tiny Polars-based DSL covering Alpha101
  primitives (`Ts_Mean`, `Ts_Std`, `Rank`, `Decay_Linear`, `Delta`,
  `Ts_Corr`). See [alphaswarm/data/factor_expression.py](../alphaswarm/data/factor_expression.py).
- **Engle-Granger cointegration** — Two-step test for cointegrated
  pairs: OLS hedge ratio + ADF test on the residual. See
  [alphaswarm/data/cointegration.py](../alphaswarm/data/cointegration.py).
- **Triple-barrier label** — Lopez de Prado labeling: look forward
  ``horizon`` bars, label `+1` if upper barrier hit first, `-1` if
  lower, `0` if horizon reached. See
  [alphaswarm/data/labels.py](../alphaswarm/data/labels.py).
- **Yang-Zhang volatility** — OHLC vol estimator combining overnight,
  open-to-close, and Rogers-Satchell components. The most efficient of
  the OHLC family. See
  [alphaswarm/data/realised_volatility.py](../alphaswarm/data/realised_volatility.py).
- **LobStrategy** — ABC for limit-order-book strategies; subclasses
  emit `OrderIntent` lists in response to `LobState` updates. Engine
  integration is deferred — see
  [alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_snippets/extractions/_FUTURE_PROMPTS/lob_adapter_prompt.md).
- **Dataset preset** — Curated declarative spec for a one-click
  ingestion (e.g. `intraday_momentum_etf`, `crypto_majors_intraday`).
  See [alphaswarm/data/dataset_presets.py](../alphaswarm/data/dataset_presets.py).
- **Inspiration source** — One of seven external repos under
  `alphaswarm_snippets/inspiration/` from which strategies / models / agents were
  rehydrated. Tracked via the `source` kwarg on
  `alphaswarm.core.registry.register` and surfaced as the `source:*` tag.

## Testing

- **`tests/data/test_pipelines_smoke.py`** — Reference test for the
  Iceberg ingestion path. New ingest features should add a test in
  this directory.
- **`director_enabled=False`** — Pass when constructing
  `IngestionPipeline` in tests so the real LLM is bypassed in favour
  of the deterministic identity plan.

## Cross-repo

- **`agentic_assistants`** — Sibling repo providing the cross-system
  lineage API (`ALPHASWARM_AGENTIC_ASSISTANTS_API`).
- **`rpi_kubernetes`** — Sibling repo with the k8s deployment
  manifests under [alphaswarm_platform/deploy/k8s/](../alphaswarm_platform/deploy/k8s/).


<!-- https://alpha-swarm.ai/intro -->
# Documentation Index
> > **Two entry points**: > > - Humans → [architecture.md](../concepts/platform/architecture.md) > - AI agents → [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) > > Both link back here

# Documentation Index

Triple-axis table of contents for the AlphaSwarm docs.

> **Two entry points**:
>
> - Humans → [architecture.md](../concepts/platform/architecture.md)
> - AI agents → [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md)
>
> Both link back here.

## Canonical runtime surfaces (May 2026)

| Surface | Canonical path | Status | Notes |
| --- | --- | --- | --- |
| Local setup + run | [operations/local-setup.md](../how-to/operations/local-setup.md) | active | Default entry point for local development |
| Kubernetes rollout | [operations/kubernetes-deploy.md](../how-to/operations/kubernetes-deploy.md) | active | Production-oriented deployment path |
| Tower 2-node rollout | [operations/tower-cluster-deploy.md](../how-to/operations/tower-cluster-deploy.md) | active | Dedicated tower+laptop target bootstrap path |
| AlphaSwarm blue/green cutover | [operations/alphaswarm-fund-blue-green-cutover.md](../how-to/operations/alphaswarm-fund-blue-green-cutover.md) | active | `alpha-swarm.ai` green-lane validation + switch + rollback |
| Deployment artifacts | [../alphaswarm_platform/deployments/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deployments/README.md) | active | Compose + Kubernetes manifests for current architecture |
| Operator UI | [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md) | active | Vite frontend is the primary UI |
| AlphaSwarm IDE | [alphaswarm-ide.md](../concepts/infrastructure/alphaswarm-ide.md) | active | Theia 1.72 + 6 AlphaSwarm extensions + research copilot + notebook |
| Knowledge Base | [knowledge-base.md](../concepts/data/knowledge-base.md) | active | `alphaswarm_kb` boundary — KBRuntime + KBCorpusSpec + adapter trinity (HierarchicalRAG default, Cognee / Graphiti / Mem0 opt-in) + 4-scope KBLayerComposer + hybrid OpenFGA + OPA policy stack |
| KB federation gateway | [kb-federation.md](../concepts/data/kb-federation.md) | active | `alphaswarm_kb_federation` — cross-silo marketplace recall reverse-proxy |
| AlphaSwarm IDE roadmap | [alphaswarm-ide-roadmap.md](../concepts/infrastructure/alphaswarm-ide-roadmap.md) | active | Phased plan (Phase A shipped; B + C trigger-driven) |
| AlphaSwarm IDE CLI entrypoint | [../alphaswarm_cli/docs/index.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_cli/docs/index.md) | active | `alphaswarm-cli ide` is the canonical IDE entrypoint |
| Repository split map | [repository-split.md](../concepts/platform/repository-split.md) | migration | Domain boundaries for future standalone repositories |
| Monorepo path contract | [alphaswarm-monorepo-paths.md](../concepts/platform/alphaswarm-monorepo-paths.md) | active | Canonical paths for cross-repo references |
| Code index governance | [code-index-governance.md](../concepts/platform/code-index-governance.md) | active | Agent search/index workflow across split boundaries |
| Legacy Next.js UI | [webui.md](../concepts/trading/webui.md) | rollback | Keep only for emergency rollback context |
| Legacy Solara UI | [../alphaswarm/ui/](../alphaswarm/ui/) | rollback | Deprecated runtime surface |
| Legacy k8s manifests | [../alphaswarm_platform/deploy/k8s/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deploy/k8s/README.md) | legacy | Historical manifests; do not use for new rollouts |
| Archived planning/audit docs | [archive/README.md](../archive/README.md) | archive | Historical context only; not operational guidance |

## Operational snippet catalog

Reusable commands that are valid against the current repository layout:

```bash
# Generate local config from schema
make generate-config ENV=local

# Start the local workload stack
make dev

# Start the isolated admin/control-plane stack
make dev-admin

# Deploy current dev overlay to Kubernetes
make deploy-k8s ENV=dev
```

## By audience

### I'm new and human

1. [../README.md](https://github.com/julianwileymac/alphaswarm/blob/main/README.md) — what AlphaSwarm is, screenshots, release notes.
2. [architecture.md](../concepts/platform/architecture.md) — system map + request lifecycle.
3. [../CONTRIBUTING.md](https://github.com/julianwileymac/alphaswarm/blob/main/CONTRIBUTING.md) — set up the dev environment.
4. [glossary.md](../intro/glossary.md) — terms used everywhere.
5. Pick a subsystem from the table below.

### I'm an AI agent

1. [../AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md) — terse rule-set + project map.
2. [../WORKFLOW.md](https://github.com/julianwileymac/alphaswarm/blob/main/WORKFLOW.md) — Plan / Act / Reflect cadence,
   FAST vs SLOW modes, intervention nodes.
3. [agentic-development.md](../concepts/agentic/agentic-development.md) — spec-pattern
   as the AlphaSwarm skill-artifact + ADLC security manifesto.
4. [../.cursor/rules/](../.cursor/rules) — glob-scoped rule files.
5. [glossary.md](../intro/glossary.md) — definitions.
6. [erd.md](../concepts/platform/erd.md) + [class-diagram.md](../concepts/platform/class-diagram.md) — structural maps.
7. [flows.md](../concepts/platform/flows.md) — end-to-end sequences.
8. [repository-split.md](../concepts/platform/repository-split.md) + [code-index-governance.md](../concepts/platform/code-index-governance.md) — current repo boundary map.
9. The relevant subsystem doc (table below).
10. (Cross-session work) [../.agents/state-template.md](https://github.com/julianwileymac/alphaswarm/blob/main/.agents/state-template.md).

## By lifecycle stage

```mermaid
flowchart LR
    Research --> Backtest --> Paper --> Live
    Backtest --> Agentic
    Agentic --> Backtest
    Live -.feedback.-> Research
```

| Stage | Docs |
| --- | --- |
| **Research** | [strategy-development.md](../concepts/strategy/strategy-development.md), [research-papers-rag.md](../concepts/data/research-papers-rag.md), [analysis-framework.md](../concepts/strategy/analysis-framework.md), [analysis-lab.md](../concepts/strategy/analysis-lab.md), [analysis-flows.md](../concepts/strategy/analysis-flows.md), [factor-research.md](../concepts/strategy/factor-research.md), [ml-framework.md](../concepts/strategy/ml-framework.md), [ml-libraries.md](../concepts/strategy/ml-libraries.md), [ml-alpha-backtest.md](../concepts/strategy/ml-alpha-backtest.md), [ml-flows.md](../concepts/strategy/ml-flows.md), [ml-preprocessing-pipeline.md](../concepts/strategy/ml-preprocessing-pipeline.md), [ml-builder.md](../concepts/strategy/ml-builder.md), [ml-testing.md](../concepts/strategy/ml-testing.md), [rl-framework.md](../concepts/rl/rl-framework.md), [rl-lab.md](../concepts/rl/rl-lab.md), [rl-components.md](../concepts/rl/rl-components.md), [rl-iceberg.md](../concepts/rl/rl-iceberg.md), [strategy-browser.md](../concepts/strategy/strategy-browser.md), [data-plane.md](../concepts/data/data-plane.md), [data-catalog.md](../concepts/data/data-catalog.md), [data-pipelines-hub.md](../concepts/data/data-pipelines-hub.md), [visualization-layer.md](../concepts/data/visualization-layer.md) |
| **Backtest** | [backtest-engines.md](../concepts/strategy/backtest-engines.md), [hft-backtest.md](../concepts/strategy/hft-backtest.md), [strategy-lifecycle.md](../concepts/strategy/strategy-lifecycle.md) |
| **Optimal control** | [optimal-control.md](../concepts/strategy/optimal-control.md), [portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md), [microstructure-toxicity.md](../concepts/strategy/microstructure-toxicity.md) |
| **Agentic** | [agentic-pipeline.md](../concepts/agentic/agentic-pipeline.md), [providers.md](../concepts/data/providers.md) |
| **Bots** | [bots.md](../concepts/agentic/bots.md) (smallest deployable unit; aggregates universe + strategy + engine + ML + agents + RAG + metrics) |
| **Paper / Live** | [paper-trading.md](../concepts/trading/paper-trading.md), [live-market.md](../concepts/data/live-market.md), [streaming.md](../concepts/data/streaming.md), [streaming-admin.md](../concepts/data/streaming-admin.md) |
| **Cross-cutting** | [observability.md](../concepts/trading/observability.md), [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md), [webui.md](../concepts/trading/webui.md) _(legacy)_, [core-types.md](../concepts/platform/core-types.md), [domain-model.md](../concepts/platform/domain-model.md), [alpha-vantage.md](../concepts/data/alpha-vantage.md), [credentials.md](../concepts/identity/credentials.md), [cloud-credentials.md](../concepts/identity/cloud-credentials.md), [identity.md](../concepts/identity/identity.md), [scim-provisioning.md](../concepts/identity/scim-provisioning.md), [msal-entra-setup.md](../concepts/identity/msal-entra-setup.md), [multi-tenancy.md](../concepts/identity/multi-tenancy.md), [kubernetes-adapter.md](../concepts/infrastructure/kubernetes-adapter.md), [kubernetes-rpi-deployment.md](../concepts/infrastructure/kubernetes-rpi-deployment.md), [local-platform.md](../concepts/platform/local-platform.md), [terraform-control-plane.md](../concepts/infrastructure/terraform-control-plane.md), [iac-runbook.md](../concepts/infrastructure/iac-runbook.md) |

## By subsystem

### Architecture + reference

| Doc | Purpose |
| --- | --- |
| [architecture.md](../concepts/platform/architecture.md) | System component diagram + request lifecycle |
| [erd.md](../concepts/platform/erd.md) | Per-domain entity-relationship diagrams |
| [class-diagram.md](../concepts/platform/class-diagram.md) | Class hierarchies (Symbol, LLMProvider, Strategy, Engines, Pipeline) |
| [data-dictionary.md](../reference/data-dictionary/index.md) | Table-by-table column reference |
| [flows.md](../concepts/platform/flows.md) | Sequence diagrams for ingestion / backtest / agents / paper |
| [glossary.md](../intro/glossary.md) | Project-specific terminology |
| [domain-model.md](../concepts/platform/domain-model.md) | Narrative on the domain types |
| [core-types.md](../concepts/platform/core-types.md) | `Symbol`, enums, dataclasses |
| [repository-split.md](../concepts/platform/repository-split.md) | Future repository/domain boundary map |
| [code-index-governance.md](../concepts/platform/code-index-governance.md) | Agent search and code-index rules |

### Data plane

| Doc | Purpose |
| --- | --- |
| [data-plane.md](../concepts/data/data-plane.md) | Provider → cache → DuckDB view pipeline |
| [data-catalog.md](../concepts/data/data-catalog.md) | Iceberg catalog + ingest pipeline |
| [data-self-service.md](../concepts/data/data-self-service.md) | Master narrative for the four-phase self-service data fabric expansion |
| [datasets-catalog.md](../concepts/data/datasets-catalog.md) | Kedro-style `BaseDataset` abstraction (data fabric phase 0) |
| [metadata-cache.md](../concepts/data/metadata-cache.md) | Redis prefetch cache backing every entity dropdown (data fabric phase 0) |
| [data-discovery.md](../concepts/data/data-discovery.md) | Active discovery browser unifying ingested + uningested catalog entries (data fabric phase 1) |
| [airbyte-builder.md](../concepts/data/airbyte-builder.md) | Schema-driven Airbyte connector builder + AlphaSwarm Fetcher codegen (data fabric phase 2) |
| [dagster-sandbox.md](../concepts/data/dagster-sandbox.md) | Ephemeral interactive Dagster + Airbyte sandbox console (data fabric phase 3) |
| [visualization-layer.md](../concepts/data/visualization-layer.md) | Trino-backed Superset and Bokeh exploration layer |
| [pgvector-control-plane.md](../concepts/data/pgvector-control-plane.md) | pgvector control plane — `data.vector.*` MCP tools + PgVector dataset kind + alembic 0045 (Phase 3 refactor) |
| [codebase-mcp.md](../concepts/data/codebase-mcp.md) | Codebase MCP server — agent view of the AlphaSwarm source tree via `codebase.*` tools (Phase 2 refactor) |
| [sera.md](../concepts/data/sera.md) | SERA (Ai2 Open Coding Agents) as an opt-in LLM provider for the codebase MCP elaborator (Phase 2.5 refactor) |
| [analytics-frontend.md](../concepts/data/analytics-frontend.md) | Interactive analytics in the Vite frontend — QuantStats tearsheets / rolling / underwater / drawdown / ML overlays (Phase 4 refactor) |
| [agent-watchdog.md](../concepts/data/agent-watchdog.md) | Agent stall watchdog Celery beat task + `GET /agents/health` + `data.agents.health` MCP tool (Phase 5 refactor) |
| [alpha-vantage.md](../concepts/data/alpha-vantage.md) | AV provider quota + cache |
| [streaming.md](../concepts/data/streaming.md) | Kafka topic taxonomy + ingester layout |
| [live-market.md](../concepts/data/live-market.md) | Live subscription + WebSocket relay |

### Strategy + ML

| Doc | Purpose |
| --- | --- |
| [analysis-framework.md](../concepts/strategy/analysis-framework.md) | Hash-locked AnalysisSpec + AnalysisRuntime umbrella |
| [analysis-lab.md](../concepts/strategy/analysis-lab.md) | Hybrid `/analysis/lab` UI (dataset-tabs + XYFlow Composer) |
| [analysis-flows.md](../concepts/strategy/analysis-flows.md) | Per-flow reference for the analysis catalog |
| [factor-research.md](../concepts/strategy/factor-research.md) | Building factor / alpha strategies |
| [ml-framework.md](../concepts/strategy/ml-framework.md) | Train → register → deploy → score |
| [ml-libraries.md](../concepts/strategy/ml-libraries.md) | Per-library reference (TF/Keras/Prophet/sklearn/PyOD/sktime/HF) |
| [ml-alpha-backtest.md](../concepts/strategy/ml-alpha-backtest.md) | `AlphaBacktestExperiment` orchestrator + `MLAlphaBacktestRun` schema |
| [ml-flows.md](../concepts/strategy/ml-flows.md) | Lightweight workbench flows catalog |
| [ml-preprocessing-pipeline.md](../concepts/strategy/ml-preprocessing-pipeline.md) | ML preprocessors as data-engine pipeline nodes |
| [ml-builder.md](../concepts/strategy/ml-builder.md) | Graphical experiment builder UX |
| [ml-testing.md](../concepts/strategy/ml-testing.md) | Interactive ML testing workbench |
| [mlops-service.md](../concepts/strategy/mlops-service.md) | Initial MLOps service — agent-facing interfaces, lifecycle handlers, MLSkill spec/runtime, OOD rules, dedicated `alphaswarm-ml-mcp` server |
| [backtest-engines.md](../concepts/strategy/backtest-engines.md) | Engine catalogue + invariants (vbt-pro primary, event-driven, ZVT, AAT, fallback) |
| [vbtpro-integration.md](../concepts/strategy/vbtpro-integration.md) | Deep vectorbt-pro integration: modes, hooks, agent + ML components, walk-forward |
| [hft-backtest.md](../concepts/strategy/hft-backtest.md) | hftbacktest-driven LOB engine, ``LobStrategy`` API, latency / queue models |
| [optimal-control.md](../concepts/strategy/optimal-control.md) | JAX-compiled HJB solvers — Avellaneda-Stoikov + Cartea-Jaimungal-Penalva |
| [portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md) | Lucic-Tse 2024-2026 portfolio-level options market making |
| [microstructure-toxicity.md](../concepts/strategy/microstructure-toxicity.md) | Toxicity regime detection + agent-driven YAML mutation loop |
| [strategy-lifecycle.md](../concepts/strategy/strategy-lifecycle.md) | draft → backtested → paper → live |
| [strategy-browser.md](../concepts/strategy/strategy-browser.md) | Data-browser → strategy spec UX |

### Agentic

| Doc | Purpose |
| --- | --- |
| [agentic-development.md](../concepts/agentic/agentic-development.md) | AlphaSwarm's spec-pattern as the agentic-coder skill-artifact equivalent + consolidated ADLC security manifesto |
| [multi-agent-patterns.md](../concepts/agentic/multi-agent-patterns.md) | Sequential / Parallel / Debate / Coordinator / ReAct topologies mapped to [alphaswarm/agents/graph/](../alphaswarm/agents/graph/) + the seven orchestration adapter topologies |
| [workflow-studio.md](../concepts/agentic/workflow-studio.md) | Additive orchestration control plane — `WorkflowSpec` + `WorkflowRuntime` + seven adapters + replayable runs |
| [orchestration-refactor-rollout.md](../concepts/agentic/orchestration-refactor-rollout.md) | Operator rollout / rollback runbook for every `ALPHASWARM_ORCHESTRATION_*` flag |
| [agentic-pipeline.md](../concepts/agentic/agentic-pipeline.md) | Crew control plane |
| [providers.md](../concepts/data/providers.md) | LLM provider registry + tier routing |

### Trading + operations

| Doc | Purpose |
| --- | --- |
| [paper-trading.md](../concepts/trading/paper-trading.md) | Session loop + risk model |
| [paper-metadata-gate.md](../concepts/trading/paper-metadata-gate.md) | Strict startup metadata validation + operator runbook |
| [bots.md](../concepts/agentic/bots.md) | Bot entity (TradingBot / ResearchBot), graphical builder, deployment |
| [observability.md](../concepts/trading/observability.md) | OTEL → Jaeger + structured logs |
| [../alphaswarm_client/README.md](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_client/README.md) | Active Vite frontend route/model overview |
| [webui.md](../concepts/trading/webui.md) | Legacy Next.js page tree (rollback only) |

## Latest changes

| Doc | Last touched |
| --- | --- |
| [data-catalog.md](../concepts/data/data-catalog.md) | Persistent host warehouse + Director |
| [glossary.md](../intro/glossary.md) | New (covers Director, Iceberg conventions, tiers) |
| [architecture.md](../concepts/platform/architecture.md) | New (replaces README ASCII art) |
| [erd.md](../concepts/platform/erd.md) | New (per-domain ERDs across 110+ tables) |
| [class-diagram.md](../concepts/platform/class-diagram.md) | New (5 hierarchies) |
| [data-dictionary.md](../reference/data-dictionary/index.md) | New (15 sections) |
| [flows.md](../concepts/platform/flows.md) | New (5 flows) |

## Doc conventions

- **Mermaid** is the diagram format. GitHub renders it natively.
  Don't commit PNG/SVG diagrams unless they're irreplaceable.
- **Cross-link** with relative markdown paths (for example, `bar.md`) so
  the navigation works on GitHub and locally.
- **Cite code** with full repo paths from the doc:
  `[alphaswarm/data/pipelines/director.py](../alphaswarm/data/pipelines/director.py)`.
  Don't link to specific line numbers (they bit-rot fast).
- **Keep it short** — narrative goes in subsystem docs, definitions
  in [glossary.md](../intro/glossary.md), structure in
  [erd.md](../concepts/platform/erd.md) / [class-diagram.md](../concepts/platform/class-diagram.md). Don't
  repeat yourself.


<!-- https://alpha-swarm.ai/intro/installation -->
# Installation
> | Extra | Native build | Notes | | --- | --- | --- | | `optimal-control` | None (pure Python) | Ships the JAX HJB / Lucic-Tse stack. CPU-only by default. | | `hft` | Rust + Maturin | Ships the [hftbac...

# Installation

This page documents the install-time requirements for AlphaSwarm and its
optional extras. The base install is pure Python and runs on
Linux / macOS / Windows. Two extras ship with native build steps that
need attention:

| Extra | Native build | Notes |
| --- | --- | --- |
| `optimal-control` | None (pure Python) | Ships the JAX HJB / Lucic-Tse stack. CPU-only by default. |
| `hft` | Rust + Maturin | Ships the [hftbacktest](https://github.com/nkaz001/hftbacktest) LOB engine. |

## Base install

```bash
pip install -e .
```

That gives the FastAPI app, Celery worker, default config, agents,
analysis flows, and the in-memory backtest fallbacks. No GPU, no native
toolchains.

## `[optimal-control]` — JAX + HJB + Lucic-Tse

```bash
pip install -e ".[optimal-control]"
```

This pulls:

- `jax>=0.4.30` and `jaxlib>=0.4.30` (CPU build, manylinux / win / macOS
  wheels available on PyPI).
- `finhjb>=0.1.6` — JAX HJB solver framework.
- `fast-vollib>=0.1.4` — vectorised IV + Greeks (auto-detects the JAX
  backend; falls back to NumPy when JAX is missing).
- `mbt-gym` — pulled directly from
  [JJJerome/mbt_gym](https://github.com/JJJerome/mbt_gym) `main` because
  the package is not on PyPI.

### GPU / Metal acceleration (opt-in)

After installing the extra, swap the CPU `jaxlib` wheel for the CUDA or
Metal variant. JAX's docs are the canonical source; the short form:

```bash
# NVIDIA CUDA 12 (Linux only)
pip install -U "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Apple Silicon (macOS)
pip install -U "jax-metal"
```

The `alphaswarm.optimal_control` and `alphaswarm.options.greeks_jax` modules pick up
the accelerated backend automatically — no AlphaSwarm code changes needed.

## `[hft]` — hftbacktest LOB engine

```bash
pip install -e ".[hft]"
```

This pulls `hftbacktest>=2.0.0`, `numba>=0.61`, and `polars>=1.0`.
Most `hftbacktest` releases ship as source distributions and need a Rust
toolchain plus Maturin at install time:

```bash
# 1. Install Rust + Cargo (https://rustup.rs)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# 2. Restart your shell so `cargo` is on PATH, then verify
cargo --version

# 3. Install Maturin (build backend hftbacktest uses)
pip install maturin

# 4. Now install the AlphaSwarm extra
pip install -e ".[hft]"
```

On Windows the equivalent is the rustup-init.exe installer plus the
"Microsoft Visual C++ Build Tools" (MSVC linker is required by the
hftbacktest crate). On macOS Apple Silicon, the standard rustup install
works; no extra steps.

### Verifying the install

```bash
python -c "from alphaswarm.backtest.hft import LobBacktestEngine; print('OK')"
```

If that succeeds, the engine is ready to drive any `LobStrategy`
subclass under `alphaswarm/strategies/hft/`.

## `[full]`

`pip install -e ".[full]"` chains every optional extra including
`optimal-control` and `hft`, so it requires the Rust toolchain. Most
contributors install `[full]` minus `[hft]`:

```bash
pip install -e ".[auth,alpaca,ibkr,otel,paper,vectorbt,ml,ml-torch,ml-forecast,portfolio,fred,sec,iceberg,agents-rag,optimal-control]"
```

## See also

- [alphaswarm_docs/optimal-control.md](../concepts/strategy/optimal-control.md) — HJB primer + AvSt + CJ.
- [alphaswarm_docs/portfolio-options-mm.md](../concepts/strategy/portfolio-options-mm.md) — Lucic-Tse.
- [alphaswarm_docs/hft-backtest.md](../concepts/strategy/hft-backtest.md) — `LobBacktestEngine` walk-through.
- [alphaswarm_docs/local-platform.md](../concepts/platform/local-platform.md) — single-machine deployment.


<!-- https://alpha-swarm.ai/intro/quickstart -->
# Quickstart
> Stand up an AlphaSwarm dev stack and run your first backtest in under 30 seconds of typing.

# Quickstart

Target: a fresh checkout of `alphaswarm` to a green backtest
result in under 30 seconds of typing (plus first-time Docker image
pull, which is unavoidable).

## Prerequisites

- Docker Desktop or compatible engine running locally.
- Python 3.11 and `make` on your PATH.
- The repo cloned to disk.

## One-paste quickstart

```powershell
# 1. Pull the canonical compose stack.
make dev

# 2. Wait for /readyz to return 200.
curl http://localhost:8000/readyz

# 3. Run the bundled example backtest.
docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \
    --config configs/strategies/momentum_demo.yaml \
    --start 2024-01-01 --end 2024-06-30
```

If the third command returns a JSON summary with non-zero `sharpe` and
`total_return`, your dev stack is healthy.

## What just happened

- `make dev` boots the canonical compose profile defined in
  [alphaswarm_platform/deployments/compose/docker-compose.dev.yml](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_platform/deployments/compose/docker-compose.dev.yml).
  This brings up Postgres + Redis + the Iceberg REST catalog +
  `alphaswarm-core` (FastAPI) + `alphaswarm-worker` (Celery) + `alphaswarm-beat`.
- `curl http://localhost:8000/readyz` confirms the FastAPI gateway is
  serving requests against a migrated Postgres schema. Migrations run
  automatically on first boot via the `alphaswarm-api` container's
  `entrypoint.sh`.
- The backtest command dispatches a Celery task that pulls the
  example momentum strategy, runs it against the seeded data, and
  writes a `backtest_runs` ledger row.

## Next steps

1. Want to see the run in the UI? Open
   [http://localhost:3001](http://localhost:3001) — that is the
   Vite operator UI ([alphaswarm_client](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm_client)).
2. Want to add your own strategy? Read
   [Recipe: Add a strategy](../how-to/recipes/add-a-strategy.md).
3. Want to set up paper trading? Read
   [Concept: paper trading](../concepts/trading/paper-trading.md)
   followed by
   [Tutorial: first paper trading session](../tutorials/first-paper-trading-session.md).
4. Want to deploy this to Kubernetes? Read
   [How-to: Kubernetes deploy](../how-to/operations/kubernetes-deploy.md).

## If it does not work

The `/readyz` probe is the single canonical health check. If it returns
non-200 within 60 seconds:

- Check `docker compose logs alphaswarm-api` for migration errors.
- Confirm Postgres is reachable: `docker exec alphaswarm-postgres pg_isready`.
- Confirm Redis is reachable: `docker exec alphaswarm-redis redis-cli ping`.
- Verify the Iceberg REST catalog is up:
  `curl http://localhost:8181/v1/config`.

If the backtest command itself errors out, the most common cause is a
stale Iceberg manifest from a prior dev cycle. Tear down with
`make down && docker volume prune -f` and re-run.

For deeper debugging, see [How-to: incident response](../how-to/operations/incident-response.md).


<!-- https://alpha-swarm.ai/intro/repository-orientation -->
# Repository orientation
> Top-level map of every alphaswarm_* package and where each subsystem lives in the monorepo.

# Repository orientation

AlphaSwarm is a monorepo organised by responsibility. The boundary between
packages is enforced by the always-on Cursor rule
[repository-boundaries.mdc](https://github.com/julianwileymac/alphaswarm/blob/main/.cursor/rules/repository-boundaries.mdc)
and by `import` guards in CI.

## Top-level packages

- **`alphaswarm/`** — the quant runtime. FastAPI gateway, Celery workers,
  strategy framework, backtest engines, agent control plane, RAG,
  Iceberg writers, persistence models.
- **`alphaswarm_controller/`** — workload lifecycle / `/manage/*` API /
  Terraform driver / provider adapters. Never imports `alphaswarm.*`. See
  [Concept: control plane topology](../concepts/infrastructure/control-plane-topology.md).
- **`alphaswarm_core/`** — shared value types, ABCs, auth filters,
  topology contracts. Dependency-light.
- **`alphaswarm_client/`** — active Vite + React 19 + Tailwind 4 operator UI.
  Served at `alpha-swarm.ai`.
- **`alphaswarm_ui/`** — cloud-hosted, customer-facing PaaS frontend
  (Next.js 14+). Served at `alpha-swarm.ai`. Dual Auth0 (B2C) + Entra (B2B)
  identity.
- **`alphaswarm_admin/`** — internal admin (managed services + company
  accounts). Audit-first. Served at `manage.alpha-swarm.ai`.
- **`alphaswarm_rl/`** — RL subsystem: hash-locked `RLExperimentSpec` +
  `RLRuntime` + Iceberg trajectory store. Legacy `alphaswarm.rl.*` is a
  deprecation shim.
- **`alphaswarm_models/`** — custom model pulling, building, training,
  evaluating, serving (vLLM + Ollama). Legacy `alphaswarm.ml.*` is a
  deprecation shim.
- **`alphaswarm_bots/`** — bot templates and bot runtime
  (`TradingBot` / `ResearchBot`).
- **`alphaswarm_ide/`** — Theia 1.72-based IDE + AlphaSwarm extensions
  (`alphaswarm`, `alphaswarm-shell`, `alphaswarm-mcp-bridge`, `alphaswarm-research-copilot`,
  `alphaswarm-notebook-quant`, `alphaswarm-quant`).
- **`alphaswarm_cli/`** — standalone operator CLI (`alphaswarm-cli`). HTTP-only;
  never imports `alphaswarm.*`. RFC 8628 device auth + OS keyring storage.
- **`alphaswarm_platform/`** — hosted deployment + build + IaC + cluster
  setup. Manifests, Helm charts, Terraform modules, Docker base
  images. No Python runtime imports.
- **`alphaswarm_index/`** — single source of truth for project orientation
  (this site links into it but never modifies it; sole-writer is the
  `alphaswarm-index-curator` subagent).
- **`alphaswarm_docs/`** — this site.

## Where to look for X

- API route: [`alphaswarm/api/routes/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/api/routes).
- Celery task: [`alphaswarm/tasks/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/tasks).
- Strategy: [`alphaswarm/strategies/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/strategies).
- Persistence model: [`alphaswarm/persistence/`](https://github.com/julianwileymac/alphaswarm/tree/main/alphaswarm/persistence).
- Migration: [`alembic/versions/`](https://github.com/julianwileymac/alphaswarm/tree/main/alembic/versions).
- Iceberg writer: [`alphaswarm/data/iceberg_catalog.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/data/iceberg_catalog.py).
- LLM gateway: [`alphaswarm/llm/providers/router.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/llm/providers/router.py).
- Configuration: [`alphaswarm/config/settings.py`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/config/settings.py).

## Hard rules

The full agent-readable rule-set is in
[AGENTS.md](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).
The cardinal subset:

1. **Symbols**: `Symbol.parse(vt_symbol)` — never split on `.`.
2. **LLM calls**: `router_complete` only — never `litellm.completion`
   or vendor SDKs.
3. **Iceberg writes**: `iceberg_catalog.append_arrow` only — never
   raw PyIceberg.
4. **Celery progress**: `emit / emit_done / emit_error` from
   `alphaswarm/tasks/_progress.py` — never publish to Redis from task code.
5. **Configuration**: `from alphaswarm.config import settings` — never
   construct a fresh `Settings()`.
6. **Registry**: `@register("Name", kind=...)` for every new
   strategy / model / engine / alpha / portfolio / sink.
7. **Migrations**: immutable once committed.
8. **Cross-task state**: Postgres only; never pickle ORM objects.

The full set is 55 hard rules + a Don'ts section in AGENTS.md.

## Conventions

See [Conventions](./conventions.md) for documentation style and
authoring rules.


<!-- https://alpha-swarm.ai/operations/break-glass -->
# operations/break-glass

# Break-glass runbook

Procedure for assuming the `AqpAdminBreakGlassRole` during an
incident.

The role is **only** to be used when:

1. Normal operator pathways (KillSwitch, scoped admin roles) have
   failed.
2. A documented incident ticket exists.
3. **Two named operators** are available (4-eyes principle).

## Mechanics

- The role itself carries no permissions until an
  `AdministratorAccess`-attaching Lambda runs.
- The attach is triggered by the second operator's approval
  through `alphaswarm_admin/services/break_glass.py`.
- The session has a **hard 60-minute auto-expiry** enforced by
  EventBridge calling the detach Lambda.
- Every API call while the role is active is reported to Security
  Hub as a HIGH-severity finding.

## Steps

### Operator A — file the request

1. Open `/admin/accounts` in the admin UI.
2. Click **"Break-glass request"** (visible only to users with
   the `alphaswarm-superadmin` role).
3. Fill in:
   - **Reason** (free-text, mandatory).
   - **Incident id** (Linear / Sentry / PagerDuty link).
   - **Duration** (max 60 minutes).
4. Submit. The request lands in the audit ledger as
   `admin.break_glass.request`.

### Operator B — approve

1. Watch for the Slack notification from the
   `#alphaswarm-security-incidents` channel.
2. Open the request URL the notification links to.
3. Verify Operator A's reason + incident id.
4. Click **"Approve"**. Step-up MFA is required.
5. The Lambda fires and attaches `AdministratorAccess` to the
   target role. Audit row:
   `admin.break_glass.approve` -> `admin.break_glass.attach`.

### Operator A — perform the action

1. `aws sts assume-role --role-arn  \
        --role-session-name "incident-"`.
2. Carry out the minimum action required.
3. The session SHOULD be terminated early via the admin UI's
   **"Detach"** button as soon as the action completes.

### Auto-expiry

If 60 minutes elapse, EventBridge invokes the detach Lambda
automatically. Audit row: `admin.break_glass.expire`.

## Post-incident

- Both operators sign the post-incident review.
- Security officer reviews the Security Hub findings + audit
  trail within 24h.
- Anything done while the role was active is reproduced in a
  small, scoped follow-up PR if it should be permanent.


<!-- https://alpha-swarm.ai/operations/dr-replay -->
# operations/dr-replay

# DR replay runbook

Disaster-recovery rehearsal procedure for AlphaSwarm. Targets:

- **RPO 1 hour** for `alphaswarm_admin` + control-plane services.
- **RTO 4 hours** for the same.
- **RPO 15 minutes** for trading-relevant data.
- **RTO 1 hour** for the same.

The exercise is run quarterly (calendar reminder owned by the
platform team). The first exercise is scheduled for the end of
Phase 5 of the multi-account overhaul.

## Pre-requisites

- AWS Organizations + Control Tower applied (Phase 4 complete).
- ArgoCD app-of-apps applied to dev + staging + prod clusters.
- Velero installed on every workload cluster (chart at
  [alphaswarm_platform/deployments/kubernetes/helm/velero](../../../alphaswarm_platform/deployments/kubernetes/helm/velero/)).
- ECR cross-region replication active to `us-west-2`.
- RDS cross-region read replica green.
- S3 CRR active on every Parquet + audit-archive bucket.
- Route 53 health-check failover record set on the
  `manage.alpha-swarm.ai` ingress.

## Steps

### 1. Trigger the failure

Pick the rehearsal target — typically `alphaswarm-dev` (never prod).
Document the start time in the incident ticket.

```bash
# Disable the dev cluster's API server (simulates a control-plane outage).
aws eks update-cluster-config \
  --name alphaswarm-dev \
  --region us-east-1 \
  --resources-vpc-config endpointPrivateAccess=false,endpointPublicAccess=false
```

### 2. Confirm impact

`alphaswarm_admin` should now show `unreachable` for the dev cluster
under `/admin/kubernetes/status`. The KillSwitch should still
work because it fans out to other clusters too.

### 3. Bring up the replay cluster

```bash
cd infrastructure/envs/dev
terraform apply -var-file=terraform.tfvars
```

This re-creates the EKS cluster with the same name + node groups.
ArgoCD picks up the new cluster via its Cluster generator (label
`alphaswarm.io/managed=true`).

### 4. Replay state from Velero

```bash
velero backup-location get
velero restore create dr-replay-$(date +%s) \
  --from-backup daily-full-$(velero backup get | tail -1 | awk '{print $1}')
```

### 5. Restore RDS

The cross-region read replica in `us-west-2` is promoted to
primary; the DR replay points the dev cluster's RDS DSN at the
new primary. The Postgres instance comes up with the audit ledger
intact so no admin actions are lost.

### 6. Verify

- `alphaswarm_admin` health should return 200 within 4h.
- The audit ledger should show the gap as a single contiguous
  block (no missing rows beyond the RPO window).
- Paper-trading runs that were active are stamped `status=halted`
  by the watchdog.
- The ArgoCD app-of-apps sync should converge within 15min after
  the cluster comes back.

### 7. Document

Append to the rehearsal log at
`alphaswarm_docs/docs/operations/dr-rehearsal-log.md` with:

- Start / end timestamps.
- Actual RPO + RTO measured.
- Issues encountered + remediations.
- Sign-off from the security officer.


<!-- https://alpha-swarm.ai/operations/multi-account-rollout -->
# operations/multi-account-rollout

# Multi-account rollout runbook

The Phase 4 Control Tower + cross-account IAM + dev->staging
promotion + IdP cutover. The Terraform code is shipped under
`infrastructure/`; this runbook is what the operator follows to
apply it.

## 1. Bootstrap

```bash
# In the AWS Org master account, with the platform-admin role:
cd infrastructure/bootstrap
export AWS_PROFILE=alphaswarm-org-master
terraform init
terraform apply -var=account_alias=master
```

Capture the outputs (KMS arn, GitHub OIDC arn, etc.).

## 2. Landing zone

```bash
cd infrastructure/modules/landing-zone
terraform init
terraform apply
```

This stands up the 5 OUs + SCPs. The first `apply` takes ~15
minutes because Control Tower has to enrol every region one at a
time.

## 3. Workload accounts

For each workload account (dev, staging, prod), create the
account via the `account` module from the master account, then
re-run `bootstrap/` against the new account.

```bash
cd infrastructure
terraform apply \
  -target=module.account.dev \
  -var='dev_email=aws-alphaswarm-dev@alpha-swarm.ai' \
  -var='external_id=...'
```

## 4. Per-environment composition

For each env (`dev`, `staging`, `prod`):

```bash
cd infrastructure/envs/dev
cp terraform.tfvars.example terraform.tfvars
# Fill in real values per the example.
terraform init -backend-config=backend.hcl
terraform plan
terraform apply
```

## 5. Cross-account IAM

Provision the four canonical roles per blueprint §4.2:

- `AqpAdminDeploymentRole` (cross-account assume from
  shared-services)
- `AqpAdminReadOnlyAuditRole`
- `AqpAdminBreakGlassRole` (Deny-everything by default; attach
  the Lambda from §9.3 of the blueprint to attach
  `AdministratorAccess` on approved break-glass)
- `GitHubActionsDeployRole` (federated via the OIDC provider
  from `bootstrap/`)

These are wired through the `iam-irsa-roles` + `github-oidc`
modules per env. Confirm with:

```bash
aws sts assume-role \
  --role-arn arn:aws:iam::${DEV_ACCOUNT_ID}:role/AqpAdminDeploymentRole \
  --role-session-name dev-smoke \
  --external-id "$EXTERNAL_ID"
```

## 6. Promote dev to staging

Use the `alphaswarm_admin/src/alphaswarm_admin/services/account_promoter.py`
wizard via the `/admin/accounts` UI. The wizard:

1. Replicates ECR artifacts cross-region.
2. Templates the staging Helm overlay from dev (with
   `prod.deny.json` allowlist filtering).
3. Applies the staging Terraform workspace.

The same wizard handles staging -> prod once the staging burn-in
period (recommended: 14 days) completes.

## 7. IdP cutover

If you are migrating from the existing Auth0 tenant to AWS IAM
Identity Center:

1. Provision IAM Identity Center via Control Tower (one-click).
2. Create the AlphaSwarm application in Identity Center; copy the
   issuer URL + audience.
3. Set `ALPHASWARM_AUTH_PROVIDER=aws_iam_identity_center` +
   `ALPHASWARM_AUTH_OIDC_ISSUER=...` +
   `ALPHASWARM_AUTH_OIDC_AUDIENCE=...`.
4. The
   [`AwsIamIdentityCenterProvider`](../../../alphaswarm/auth/providers/aws_iam_identity_center.py)
   subclass auto-registers via the `IdentityProviderMeta`
   metaclass per AGENTS rule 27. No manual `@register` decorator.
5. Group sync: create `IdpGroupMapping` rows with
   `connection_kind="aws_iam_identity_center"` for every Identity
   Center group that should map to an AlphaSwarm role.
6. Test login with a single staff account before flipping the
   default for the org.

## 8. Production cutover

Production cutover follows the same dev->staging recipe via
`account_promoter.py`. Step-up MFA + 4-eyes approval are
enforced server-side; the operator runs the wizard from the
admin UI.

After cutover:

- ArgoCD ApplicationSet picks up the prod cluster via its
  `alphaswarm.io/managed=true` label.
- ArgoCD Image Updater auto-bumps image tags from new ECR
  digests when the GHA pipeline produces them.
- The legacy ADR-002 single-container Solara deployment is
  decommissioned in the follow-up `alphaswarm_admin-overhaul-cleanup` PR.


<!-- https://alpha-swarm.ai/reference/api/health/get-livez -->
# GET /livez
> Liveness probe.

# Liveness probe.

Liveness probe.

> **Method:** `GET`
> **Path:** `/livez`
> **Tag:** `health`
> **OperationId:** `get-livez`

See the [interactive playground](../index.mdx) for parameter
forms, response schemas, and credential persistence.

## Source spec

This page is generated from `alphaswarm_docs/openapi/alphaswarm.json` by
[`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts).
Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`.


<!-- https://alpha-swarm.ai/reference/api/health/get-readyz -->
# GET /readyz
> Readiness probe — confirms migrations applied + downstreams reachable.

# Readiness probe — confirms migrations applied + downstreams reachable.

Readiness probe — confirms migrations applied + downstreams reachable.

> **Method:** `GET`
> **Path:** `/readyz`
> **Tag:** `health`
> **OperationId:** `get-readyz`

See the [interactive playground](../index.mdx) for parameter
forms, response schemas, and credential persistence.

## Source spec

This page is generated from `alphaswarm_docs/openapi/alphaswarm.json` by
[`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts).
Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`.


<!-- https://alpha-swarm.ai/reference/api -->
# API reference
> Interactive Scalar-rendered reference for the public AlphaSwarm API at api.alpha-swarm.ai. Auto-regenerated on every commit via openapi-export-alphaswarm CI job.

# API reference

This page is auto-generated from
[alphaswarm_docs/openapi/alphaswarm.json](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/alphaswarm.json),
which itself is regenerated on every commit via the
`openapi-export-alphaswarm` job in [.github/workflows/ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/ci.yml).
If the committed spec and the live CI-dumped spec diverge,
[`oasdiff`](https://github.com/Tufin/oasdiff) blocks the PR.

## Surface

The public AlphaSwarm API at `api.alpha-swarm.ai` exposes:

- **`/health`**, **`/livez`**, **`/readyz`** — liveness probes.
- **`/strategies/*`** — strategy CRUD + dispatch.
- **`/backtest/*`** — backtest dispatch, status, cancel.
- **`/bots/*`** — bot CRUD, snapshot, backtest, paper, deploy.
- **`/rl/*`** — RL train, replay, walk-forward, halt.
- **`/agents/*`** — agent dispatch, halt, watchdog.
- **`/workflows/*`** — workflow runtime endpoints.
- **`/paper/*`** — paper trading session controls.
- **`/analytics/*`** — QuantStats portfolio metrics + tearsheet
  rendering.
- **`/mcp/data/*`** — the Data MCP server.
- **`/mcp/codebase/*`** — the Codebase MCP server.

## Interactive playground

The Scalar component below loads `openapi/alphaswarm.json` and renders an
interactive playground. Authenticate with the `Authorization:
Bearer ` header. Tokens come from the `alphaswarm-cli auth login`
device-flow path (see
[Concept: identity](../../concepts/identity/identity.md)).


## SDKs

- TypeScript: `npm install @alphaswarm/sdk` (generated via Fern; see
  [reference/python](../python/index.mdx)).
- Python: `pip install alphaswarm-sdk` (Phase 6).

## Versioning

AlphaSwarm uses Stripe-style date-epoch versioning. The first epoch is
`2026-06-01`. New epochs preserve old contracts; the
`Deprecation` + `Sunset` HTTP headers (RFC 8594) signal the
12-month sunset cycle. Sunsetted epochs freeze on
`archive.alpha-swarm.ai`.


<!-- https://alpha-swarm.ai/reference/data-dictionary -->
# Data Dictionary
> Authoritative table-and-column reference for the AlphaSwarm persistence layer plus the Iceberg catalog. Updated whenever a model file or migration ships — see "[Adding a new model](../../concepts/platform/erd.md#adding-a-new-model)...

# Data Dictionary

> Pair with [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) (visual schema) and
> [alphaswarm_docs/domain-model.md](../../concepts/platform/domain-model.md) (narrative).
> Doc map: [alphaswarm_docs/index.md](../../intro/index.md).

Authoritative table-and-column reference for the AlphaSwarm persistence
layer plus the Iceberg catalog. Updated whenever a model file or
migration ships — see "[Adding a new model](../../concepts/platform/erd.md#adding-a-new-model)"
for the workflow.

## Conventions

- **PK**: primary key column.
- **FK**: foreign key (`→ table.column`).
- **Type**: SQLAlchemy column type. `String(N)` is `VARCHAR(N)` in
  Postgres. `JSON` is `JSONB`. `DateTime` is timezone-naive UTC.
- **Null**: `Y`/`N`. Defaults are listed where present.
- **Notes**: extra constraints, indexes, or invariants.

All `id` columns are `String(36)` UUIDs generated by `_uuid()` unless
noted. All `created_at` / `updated_at` columns default to
`datetime.utcnow` server-side.

---

## 1. Sessions + chat — [models.py](../alphaswarm/persistence/models.py)

### `sessions`

The conversational shell that chat messages and agent runs live under.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| user | String(120) | N | – | default `"local"` |
| title | String(240) | Y | – | – |
| created_at | DateTime | N | – | default now |
| closed_at | DateTime | Y | – | – |
| meta | JSON | Y | – | default `{}` |

### `chat_messages`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| session_id | UUID | N | → sessions.id | cascade delete |
| role | String(32) | N | – | `user|assistant|agent|tool` |
| content | Text | N | – | – |
| created_at | DateTime | N | – | default now |
| meta | JSON | Y | – | default `{}` |

---

## 2. Strategies + backtests — [models.py](../alphaswarm/persistence/models.py)

### `strategies`

The top-level strategy header. Versions live in `strategy_versions`.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| name | String(120) | N | – | – |
| version | Integer | N | – | default 1 |
| config_yaml | Text | N | – | full YAML config |
| created_at | DateTime | N | – | default now |
| created_by | String(120) | N | – | default `"system"` |
| status | String(32) | N | – | `draft|backtesting|paper|live|retired` |
| meta | JSON | Y | – | default `{}` |

### `strategy_versions`

Immutable YAML snapshot of a strategy.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| strategy_id | UUID | N | → strategies.id | cascade |
| version | Integer | N | – | – |
| config_yaml | Text | N | – | – |
| author | String(120) | N | – | default `"system"` |
| created_at | DateTime | N | – | – |
| dataset_hash | String(64) | Y | – | bind to data version |
| notes | Text | Y | – | – |

Index: `ix_strategy_versions_strategy_version (strategy_id, version)`.

### `strategy_tests`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| strategy_id | UUID | N | → strategies.id | – |
| version_id | UUID | Y | → strategy_versions.id | – |
| backtest_id | UUID | Y | → backtest_runs.id | – |
| status | String(32) | N | – | default `pending` |
| start, end | DateTime | Y | – | window |
| sharpe, sortino, max_drawdown, total_return, final_equity | Float | Y | – | summary metrics |
| engine | String(64) | Y | – | – |

### `backtest_runs`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| strategy_id | UUID | Y | → strategies.id | – |
| task_id | String(120) | Y | – | Celery task id |
| status | String(32) | N | – | default `pending` |
| start, end | DateTime | Y | – | window |
| initial_cash, final_equity | Float | Y | – | – |
| sharpe, sortino, max_drawdown, total_return | Float | Y | – | metrics |
| mlflow_run_id | String(120) | Y | – | links to MLflow UI |
| dataset_hash | String(64) | Y | – | – |
| metrics | JSON | Y | – | full metrics blob |
| error | Text | Y | – | – |
| model_version_id | UUID | Y | → model_versions.id | Alembic 0025 — model that produced the alpha |
| ml_experiment_run_id | UUID | Y | → ml_experiment_runs.id | Alembic 0025 — training run lineage |
| experiment_plan_id | UUID | Y | → experiment_plans.id | Alembic 0025 — experiment plan lineage |
| model_deployment_id | UUID | Y | → model_deployments.id | Alembic 0025 — deployment that wired the model into the strategy |

### `ml_alpha_backtest_runs`

Combined experiment row joining a training run to a downstream
backtest. Persisted by [`AlphaBacktestExperiment`](../../concepts/strategy/ml-alpha-backtest.md).

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| task_id | String(120) | Y | – | Celery task id |
| run_name | String(240) | N | – | default `alpha-backtest` |
| status | String(32) | N | – | `queued|running|completed|failed` |
| ml_experiment_run_id | UUID | Y | → ml_experiment_runs.id | – |
| backtest_run_id | UUID | Y | → backtest_runs.id | – |
| model_version_id | UUID | Y | → model_versions.id | – |
| model_deployment_id | UUID | Y | → model_deployments.id | – |
| experiment_plan_id | UUID | Y | → experiment_plans.id | – |
| mlflow_run_id | String(120) | Y | – | parent MLflow run id |
| dataset_hash | String(64) | Y | – | – |
| ml_metrics | JSON | Y | – | IC / RMSE / hit-rate / etc |
| trading_metrics | JSON | Y | – | Sharpe / Sortino / Calmar / etc |
| combined_metrics | JSON | Y | – | rolled-up scalar `score` + selected ML/trading keys |
| attribution | JSON | Y | – | conviction-vs-PnL attribution |
| params | JSON | Y | – | full input-config snapshot |
| error | Text | Y | – | – |

### `ml_prediction_audit`

Per-bar prediction audit for an alpha-backtest run. Opt-in via
`ALPHASWARM_ML_PREDICTION_AUDIT_ENABLED`; capped at
`ALPHASWARM_ML_PREDICTION_AUDIT_MAX_ROWS` rows per run.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| alpha_backtest_run_id | UUID | N | → ml_alpha_backtest_runs.id | cascade |
| vt_symbol | String(40) | N | – | – |
| ts | DateTime | N | – | – |
| prediction | Float | N | – | – |
| label | Float | Y | – | – |
| position_after | Float | Y | – | – |
| pnl_after_bar | Float | Y | – | – |

### `optimization_runs`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| task_id | String(120) | Y | – | – |
| strategy_id | UUID | Y | → strategies.id | – |
| run_name | String(240) | N | – | default `"sweep"` |
| method | String(32) | N | – | `grid|random|bayes` |
| metric | String(64) | N | – | default `"sharpe"` |
| status | String(32) | N | – | `queued|running|completed|failed` |
| n_trials, n_completed | Integer | N | – | – |
| best_trial_id | String(36) | Y | – | – |
| best_metric_value | Float | Y | – | – |
| parameter_space, base_config, summary | JSON | Y | – | – |

### `optimization_trials`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| run_id | UUID | N | → optimization_runs.id | cascade |
| backtest_id | UUID | Y | → backtest_runs.id | – |
| trial_index | Integer | N | – | – |
| parameters | JSON | Y | – | – |
| status | String(32) | N | – | – |
| metric_value, sharpe, sortino, total_return, max_drawdown, final_equity | Float | Y | – | – |
| error | Text | Y | – | – |

---

## 3. Ledger — [models.py](../alphaswarm/persistence/models.py)

### `signals`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| strategy_id | UUID | Y | → strategies.id | – |
| backtest_id | UUID | Y | → backtest_runs.id | – |
| vt_symbol | String(40) | N | – | indexed |
| direction | String(10) | N | – | `long|short|net` |
| strength | Float | N | – | – |
| confidence | Float | Y | – | default 1.0 |
| rationale | Text | Y | – | – |

Composite index: `ix_signals_symbol_ts (vt_symbol, created_at)`.

### `orders`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| backtest_id | UUID | Y | → backtest_runs.id | – |
| strategy_id | UUID | Y | → strategies.id | – |
| vt_symbol | String(40) | N | – | indexed |
| side | String(8) | N | – | `buy|sell` |
| order_type | String(16) | N | – | `market|limit|stop|stop_limit|...` |
| quantity, price | Float | varies | – | – |
| status | String(16) | N | – | default `submitting` |
| reference | String(120) | Y | – | `paper:` for paper trading |

### `fills`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| order_id | UUID | Y | → orders.id | – |
| vt_symbol | String(40) | N | – | indexed |
| side | String(8) | N | – | – |
| quantity, price | Float | N | – | – |
| commission, slippage | Float | Y | – | default 0 |

### `ledger_entries`

The canonical audit trail. Every action goes through `LedgerWriter`.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| backtest_id | UUID | Y | → backtest_runs.id | – |
| strategy_id | UUID | Y | → strategies.id | – |
| entry_type | String(32) | N | – | `SIGNAL|ORDER|FILL|RISK|AGENT|META` |
| level | String(16) | N | – | `debug|info|warn|error` |
| message | Text | N | – | – |
| payload | JSON | Y | – | default `{}` |

Index: `ix_ledger_type_ts (entry_type, created_at)`.

---

## 4. Instruments — [models.py](../alphaswarm/persistence/models.py) + [models_instruments.py](../alphaswarm/persistence/models_instruments.py)

### `instruments` (parent)

The polymorphic root. `instrument_class` is the discriminator;
subclass rows live in `instrument_` tables keyed on
`instruments.id`.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| vt_symbol | String(64) | N | unique | `Symbol.format()` |
| ticker | String(64) | N | indexed | – |
| exchange | String(32) | Y | – | – |
| asset_class | String(32) | Y | – | `equity|crypto|fx|...` |
| security_type | String(32) | Y | – | `equity|option|future|...` |
| instrument_class | String(32) | Y | indexed | discriminator |
| issuer_id | UUID | Y | → issuers.id | – |
| identifiers | JSON | Y | – | `{ticker, isin, cusip, …}` |
| sector, industry, region, currency | String | Y | – | – |
| tick_size, multiplier, min_quantity, max_quantity, lot_size | Float | Y | – | exchange specs |
| price_precision, size_precision | Integer | Y | – | – |
| is_active | Boolean | N | – | default true |
| tags | JSON | Y | – | default `[]` |
| meta | JSON | Y | – | default `{}` |

### Joined-table subclasses (`instrument_`)

All share `id` PK that's also a FK to `instruments.id`. Each table
adds shape-specific columns. For full column lists see
[models_instruments.py](../alphaswarm/persistence/models_instruments.py); the
ERD in [alphaswarm_docs/erd.md](../../concepts/platform/erd.md#core--instruments) lists key columns per
subclass.

| Subclass table | Polymorphic identity | Distinctive columns |
| --- | --- | --- |
| `instrument_equity` | `spot` | `isin`, `cusip`, `figi`, `lei`, `gics_sector`, `shares_outstanding`, `is_adr` |
| `instrument_etf` | `etf` | `inception_date`, `aum`, `expense_ratio`, `is_leveraged`, `replication` |
| `instrument_index` | `index` | `administrator`, `methodology`, `constituent_count`, `base_value` |
| `instrument_bond` | `bond` | `coupon`, `maturity`, `rating_sp`, `rating_moodys`, `callable`, `convertible` |
| `instrument_future` | `future` | `underlying`, `expiry`, `contract_size`, `cycle`, `delivery_month` |
| `instrument_option` | `option` | `strike`, `expiry`, `kind` (call/put), `style`, `occ_symbol` |
| `instrument_fx_pair` | `fx_pair` | `base_currency`, `quote_currency`, `pip_size` |
| `instrument_crypto` | `crypto_token` | `subtype`, `chain`, `contract_address`, `max_leverage`, `funding_interval` |
| `instrument_cfd` | `cfd` | `underlying`, `margin_rate`, `financing_rate` |
| `instrument_commodity` | `spot_commodity` | `grade`, `unit_of_measure`, `delivery` |
| `instrument_synthetic` | `synthetic` | `legs`, `leg_weights`, `formula` |
| `instrument_betting` | `betting` | `event_name`, `market_type`, `selection_id` |
| `instrument_tokenized_asset` | `nft` | `chain`, `contract_address`, `token_standard` |

---

## 5. Dataset lineage — [models.py](../alphaswarm/persistence/models.py)

### `dataset_catalogs`

Logical dataset descriptor. Iceberg-related columns added in
[migration 0011](../alembic/versions/0011_iceberg_catalog_columns.py).

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| name | String(160) | N | indexed | – |
| provider | String(80) | N | indexed | – |
| domain | String(120) | N | – | default `"market.bars"` |
| frequency | String(32) | Y | – | – |
| storage_uri | String(512) | Y | – | – |
| schema_json | JSON | Y | – | – |
| description | Text | Y | – | – |
| tags | JSON | Y | – | – |
| meta | JSON | Y | – | – |
| iceberg_identifier | String(240) | Y | indexed | `.` |
| load_mode | String(32) | N | – | `managed|external` (default managed) |
| source_uri | String(1024) | Y | – | – |
| llm_annotations | JSON | Y | – | from `annotate_table` |
| column_docs | JSON | Y | – | – |

Composite index: `ix_dataset_catalog_name_provider`.

### `dataset_versions`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| catalog_id | UUID | N | → dataset_catalogs.id (cascade) | – |
| version | Integer | N | – | default 1 |
| status | String(32) | N | – | `active|superseded` |
| as_of, start_time, end_time | DateTime | Y | – | – |
| row_count, symbol_count, file_count | Integer | N | – | default 0 |
| dataset_hash | String(64) | Y | indexed | SHA-256 of inputs |
| materialization_uri | String(512) | Y | – | – |
| columns | JSON | Y | – | – |
| schema_json | JSON | Y | – | – |
| meta | JSON | Y | – | – |

Composite index: `ix_dataset_versions_catalog_version`.

### `data_sources`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| name | String | N | unique | `yfinance|alpaca|cfpb|...` |
| kind | String | Y | – | `rest|csv|parquet|kafka` |
| base_url | String | Y | – | – |
| meta | JSON | Y | – | – |

### `data_links`

Edges between dataset versions and entities (instruments, series).

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| dataset_version_id | UUID | N | → dataset_versions.id (cascade) | – |
| source_id | UUID | Y | → data_sources.id | – |
| instrument_id | UUID | Y | → instruments.id | – |
| entity_kind | String | N | – | `instrument|series|theme` |
| entity_id | String | N | – | – |
| coverage_start, coverage_end | DateTime | Y | – | – |
| row_count | Integer | Y | – | – |
| meta | JSON | Y | – | – |

### `identifier_links`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| instrument_id | UUID | Y | → instruments.id | – |
| source_id | UUID | Y | → data_sources.id | – |
| identifier_kind | String | N | – | `cik|isin|ticker|figi|...` |
| identifier_value | String | N | – | – |

### `split_plans`, `split_artifacts`, `pipeline_recipes`, `experiment_plans`, `model_versions`, `model_deployments`

ML lineage tables. See full column lists in
[models.py](../alphaswarm/persistence/models.py) (search for the class
name). One-liner summary:

| Table | Purpose |
| --- | --- |
| `split_plans` | Train/val/test split design (method, segments, FK to dataset_version) |
| `split_artifacts` | Materialised fold boundaries + index sets per split plan |
| `pipeline_recipes` | Preprocessing recipes (shared/learn/infer processors) |
| `experiment_plans` | Ties dataset_version + split + recipe + model config + status |
| `model_versions` | One row per trained MLflow registry version |
| `model_deployments` | Active inference deployments (one model_version may have many) |

---

## 6. Agentic — [models.py](../alphaswarm/persistence/models.py)

### `agent_runs`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| session_id | UUID | Y | → sessions.id | – |
| task_id | String(120) | Y | indexed | – |
| crew | String(120) | N | – | – |
| status | String(32) | N | – | – |
| prompt | Text | N | – | – |
| result | JSON | Y | – | – |
| error | Text | Y | – | – |
| llm_model | String(120) | Y | – | – |
| token_usage | JSON | Y | – | – |

### `crew_runs`

Lightweight index for the Crew Trace UI.

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| task_id | String(120) | N | unique | – |
| crew_name | String(120) | N | – | default `"research"` |
| crew_type | String(32) | N | indexed | `research|trader` |
| status | String(32) | N | indexed | – |
| prompt | Text | N | – | – |
| session_id | String(36) | Y | indexed | – |
| agent_run_id | UUID | Y | → agent_runs.id | – |
| result, events | JSON | Y | – | – |
| error | Text | Y | – | – |
| cost_usd | Float | N | – | default 0 |

### `agent_decisions`, `debate_turns`, `agent_backtests`, `agent_judge_reports`, `agent_replay_runs`, `backtest_interrupts`

The agentic-backtest audit trail.

| Table | Purpose |
| --- | --- |
| `agent_decisions` | One row per long/short/flat decision (links backtest, strategy, crew_run) |
| `debate_turns` | Multi-turn debate transcripts under a decision |
| `agent_backtests` | Crew-level metrics rolled up per backtest |
| `agent_judge_reports` | Judge LLM's evaluation of a backtest |
| `agent_replay_runs` | Replays of a judged backtest with adjusted prompts |
| `backtest_interrupts` | User pause/resume markers during a long backtest |

---

## 7. Feature sets — [models.py](../alphaswarm/persistence/models.py)

### `feature_sets`

| Column | Type | Null | FK | Notes |
| --- | --- | --- | --- | --- |
| id | UUID | N | – | PK |
| name | String | N | – | – |
| description | Text | Y | – | – |
| kind | String | N | – | `composite|ml4t|qlib|alpha158` |
| specs | JSON | N | – | list of indicator/transformation strings |
| tags | JSON | Y | – | – |
| default_lookback_days | Integer | Y | – | – |

### `feature_set_versions`

Immutable snapshot keyed on `content_hash` so the same spec rendered
twice deduplicates.

### `feature_set_usages`

Records of which backtests / deployments consumed which feature-set
versions (for reverse lineage).

---

## 8. Reports + paper — [models.py](../alphaswarm/persistence/models.py)

| Table | Purpose | Key columns |
| --- | --- | --- |
| `equity_reports` | Markdown equity research reports generated by the report-writer crew | `vt_symbol`, `cohort`, `markdown`, `cost_usd` |
| `paper_trading_runs` | One row per paper or live session | `brokerage`, `feed`, `last_heartbeat_at`, `bars_seen`, `orders_submitted`, `fills`, `state` |
| `rl_episodes` | Snapshot of an RL training episode | `run_id`, `episode`, `mean_reward`, `portfolio_value` |

---

## 8a. Bots — [models_bots.py](../alphaswarm/persistence/models_bots.py)

Tables introduced by the Bot Entity Refactor (Alembic
[`0020_bots`](../alembic/versions/0020_bots.py)). Mirror the proven
`agent_specs` / `agent_spec_versions` / `agent_runs_v2` pattern.

| Table | Purpose | Key columns |
| --- | --- | --- |
| `bots` | Logical bot row (latest active version of a named spec inside a project) | `id`, `name`, `slug`, `kind` (`trading|research`), `current_version`, `spec_yaml`, `status` (`draft|ready|deployed|archived`), `annotations`, `(project_id, slug)` UNIQUE |
| `bot_versions` | Immutable, hash-locked snapshot of every `BotSpec` change | `id`, `bot_id` FK, `version`, `spec_hash`, `payload`, `notes`, `created_by`, `(bot_id, spec_hash)` UNIQUE, `(bot_id, version)` UNIQUE |
| `bot_deployments` | One row per backtest / paper / chat / k8s invocation; references the version that produced it | `id`, `bot_id` FK, `version_id` FK, `target` (`paper_session|kubernetes|backtest_only|backtest|chat`), `task_id`, `status`, `manifest_yaml` (k8s only), `result_summary`, `error`, `started_at`, `ended_at` |

All three tables carry `ProjectScopedMixin` (`owner_user_id`,
`workspace_id`, `project_id`).

---

## 9. News — [models_news.py](../alphaswarm/persistence/models_news.py)

| Table | Key columns |
| --- | --- |
| `news_items` | `url`, `source`, `published_at`, `headline`, `body` |
| `news_item_entities` | `news_item_id`, `vt_symbol`, `entity_kind` (`instrument|issuer|theme`) |
| `news_sentiments` | `news_item_id`, `scorer` (`finbert|fingpt`), `polarity`, `confidence` |

---

## 10. Events — [models_events.py](../alphaswarm/persistence/models_events.py)

`corporate_events` is the parent; the per-type tables FK back to it.

| Table | Key columns |
| --- | --- |
| `corporate_events` | `vt_symbol`, `event_type` (`earnings|split|dividend|merger|ipo`), `event_time`, `payload` |
| `earnings_event_rows` | `event_id`, `eps_actual`, `eps_estimate`, `revenue_actual` |
| `dividend_event_rows` | `event_id`, `amount`, `ex_date`, `pay_date` |
| `split_event_rows` | `event_id`, `ratio` |
| `ipo_event_rows` | `event_id`, `offer_price`, `shares_offered` |
| `merger_event_rows` | `event_id`, `acquirer`, `target`, `terms` |
| `calendar_event_rows` | `event_id`, `event_kind`, `expected_time` |
| `analyst_estimates` | `vt_symbol`, `analyst`, `target_price`, `forecast_date` |
| `price_targets` | `vt_symbol`, `analyst`, `target_price`, `period` |
| `forward_estimates` | `vt_symbol`, `analyst`, `metric`, `value` |
| `regulatory_event_rows` | `event_id`, `regulator`, `summary` |
| `esg_event_rows` | `event_id`, `category`, `score` |

---

## 11. Fundamentals — [models_fundamentals.py](../alphaswarm/persistence/models_fundamentals.py)

| Table | Key columns |
| --- | --- |
| `financial_statements` | `issuer_id`, `period` (`Q|FY`), `period_end`, `data` |
| `financial_ratios` | `issuer_id`, `period_end`, `pe`, `pb`, `roe`, `roa`, `debt_to_equity` |
| `key_metrics` | `issuer_id`, `period_end`, `revenue`, `net_income`, `free_cash_flow` |
| `historical_market_caps` | `issuer_id`, `as_of`, `market_cap` |
| `revenue_breakdowns` | `issuer_id`, `period_end`, `segment`, `region`, `revenue` |
| `earnings_call_transcripts` | `issuer_id`, `call_date`, `content` |
| `management_discussion_analysis` | `issuer_id`, `period_end`, `mda_text` |
| `reported_financials` | `issuer_id`, `period_end`, `xbrl_payload` |

---

## 12. Macro — [models_macro.py](../alphaswarm/persistence/models_macro.py)

| Table | Key columns |
| --- | --- |
| `economic_series` | `series_id` (`FRED:GDP`), `title`, `frequency`, `units`, `source` |
| `economic_observations` | `series_id`, `observation_date`, `value` |
| `cot_reports` | `report_date`, `instrument`, `positions` |
| `bls_series` | `series_id`, `title`, `frequency` |
| `treasury_rates` | `date`, `rate_3m`, `rate_2y`, `rate_10y`, `rate_30y` |
| `yield_curves` | `date`, `tenors` |
| `option_series` | `instrument_id`, `expiry`, `style` |
| `option_chain_snapshots` | `series_id`, `as_of`, `chain_payload` |
| `futures_curves` | `as_of`, `front_month`, `tenor_prices` |
| `market_holidays` | `exchange`, `date`, `name` |
| `market_status_history` | `exchange`, `as_of`, `status` |

---

## 13. Entities + ownership — [models_entities.py](../alphaswarm/persistence/models_entities.py) + [models_ownership.py](../alphaswarm/persistence/models_ownership.py)

| Table | Key columns |
| --- | --- |
| `issuers` | `name`, `lei`, `country`, `entity_kind` |
| `government_entities` | `id` (PK_FK), `country_code`, `level` |
| `funds` | `id` (PK_FK), `fund_family`, `fund_type` |
| `sectors` | `code`, `name` |
| `industries` | `code`, `name`, `sector_id` |
| `industry_classifications` | `issuer_id`, `industry_id`, `as_of` |
| `entity_relationships` | `parent_id`, `child_id`, `kind` |
| `locations` | `issuer_id`, `country`, `city` |
| `key_executives` | `issuer_id`, `name`, `title` |
| `executive_compensation` | `executive_id`, `year`, `total_comp` |
| `insider_transactions` | `vt_symbol`, `insider_name`, `transaction_date`, `quantity` |
| `institutional_holdings` | `vt_symbol`, `holder_name`, `as_of`, `quantity` |
| `form_13f_holdings` | `filer_cik`, `vt_symbol`, `period_end` |
| `short_interest` | `vt_symbol`, `settlement_date`, `short_interest` |
| `shares_float_snapshots` | `vt_symbol`, `as_of`, `float_shares` |
| `politician_trades` | `politician`, `vt_symbol`, `trade_date`, `amount` |
| `fund_holdings` | `fund_id`, `vt_symbol`, `as_of`, `position` |

---

## 14. Taxonomy — [models_taxonomy.py](../alphaswarm/persistence/models_taxonomy.py)

| Table | Key columns |
| --- | --- |
| `taxonomy_schemes` | `name` (`GICS|SASB|theme`) |
| `taxonomy_nodes` | `scheme_id`, `parent_id`, `code`, `label` |
| `entity_tags` | `node_id`, `entity_kind`, `entity_id` |
| `entity_crosswalks` | `from_kind`, `from_id`, `to_kind`, `to_id` |

---

## 15. External-source indexes — [models.py](../alphaswarm/persistence/models.py)

| Table | Purpose | Key columns |
| --- | --- | --- |
| `fred_series` | FRED metadata index | `series_id`, `title`, `units`, `frequency` |
| `sec_filings` | SEC EDGAR filing index | `instrument_id`, `accession`, `form`, `filing_date` |
| `gdelt_mentions` | GDelt GKG mention index | `instrument_id`, `mention_time`, `gkg_payload` |

---

## Iceberg namespace conventions

Iceberg tables sit alongside the Postgres schema; their identifiers
are stored in `dataset_catalogs.iceberg_identifier` for cross-lookup.

| Namespace | Source |
| --- | --- |
| `alphaswarm` | Generic / default (fallback when no `--namespace` provided) |
| `alphaswarm_smoke` | Smoke-test namespace ([scripts/iceberg_smoke.py](../scripts/iceberg_smoke.py)) |
| `alphaswarm_cfpb` | CFPB regulatory ingest |
| `alphaswarm_uspto` | USPTO regulatory ingest |
| `alphaswarm_fda` | openFDA regulatory ingest |
| `alphaswarm_sec` | SEC quarterly data sets |
| `alphaswarm_bars` | (reserved) generic OHLCV cache |
| `alphaswarm_features` | (reserved) feature-set materialisations |

**Naming rules**:

- Namespace: `aqp_`, lower-snake-case, ≤32 chars.
- Table: lower-snake-case, ≤48 chars, descriptive nouns
  (`hmda_lar`, `device_event`, `broker_dealers`).
- The Director (Nemotron) decides the final table name within these
  rules; identity-plan fallback uses the discovered family name as-is.

**Layout**:

```
C:/alphaswarm-warehouse/iceberg/
├── catalog.db                                # SQLite metadata
└── /
    └── /
        ├── data/00000-0-.parquet
        ├── data/00001-0-.parquet
        └── metadata/
            ├── 00000-.metadata.json
            ├── 00001-.metadata.json
            ├── -m0.avro                # manifest list
            └── snap--...avro        # snapshot
```

Snapshots are append-only — every `append_arrow` produces a new
`metadata.json` revision. Old snapshots can be expired with
PyIceberg's `Table.expire_snapshots(...)` (not exposed via API yet).

**Updating this dictionary**:

When you add an ORM column or a new table:

1. Update the corresponding section above.
2. If you added a table to a per-domain ERD scope, update
   [alphaswarm_docs/erd.md](../../concepts/platform/erd.md) too.
3. Cross-link the migration that introduced the change.


<!-- https://alpha-swarm.ai/reference/manage-api -->
# Control-plane API
> Interactive Scalar-rendered reference for the control plane at manage.alpha-swarm.ai. Workload lifecycle, Terraform driver, provider adapters.

# Control-plane API

This is the `alphaswarm_controller` surface at `manage.alpha-swarm.ai`. It is
deliberately separate from the public AlphaSwarm API; it owns workload
lifecycle, the `TerraformRuntime`, provider adapters, and the
`workload_runs` audit ledger.

The spec lives at
[alphaswarm_docs/openapi/control-plane.json](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/control-plane.json),
auto-dumped by the existing `openapi-export` job in
[.github/workflows/ci.yml](https://github.com/julianwileymac/alphaswarm/blob/main/.github/workflows/ci.yml).

## Surface

- **`/manage/workloads/*`** — start / stop / scale / restart / exec /
  tail-logs / apply_config / rotate-secret.
- **`/manage/topology/*`** — service URL resolution (AGENTS rule 47).
- **`/manage/terraform/*`** — Terraform plan / apply / destroy through
  `TerraformRuntime` (AGENTS rules 42, 43).
- **`/manage/cloudflare/*`** — tunnel + DNS + Access app CRUD.
- **`/manage/auth/*`** — IdP wiring (Auth0, Entra).
- **`/manage/tenancy/*`** — `EntraTenantLink` lifecycle (AGENTS rule 44).
- **`/manage/agents/health`** — agent stall watchdog snapshot.
- **`/manage/workflows/halt`** — kill-switch fan-out.

## Audit ledger

Every workload action writes a `workload_runs` row BEFORE executing
through the provider. See
[Concept: management engine](../../concepts/identity/management-engine.md)
for the full audit contract.


## Authentication

Same Auth0 / Entra IdP chain as the public API; access is restricted
to the `admin:cluster` scope (engineering org) and the per-org
`admin:org` scope (customer orgs). Cloudflare Access policies in
front of `manage.alpha-swarm.ai` enforce the perimeter at the edge.


<!-- https://alpha-swarm.ai/reference/manage-api/workloads/get-manage-livez -->
# GET /manage/livez
> Control-plane liveness probe.

# Control-plane liveness probe.

Control-plane liveness probe.

> **Method:** `GET`
> **Path:** `/manage/livez`
> **Tag:** `workloads`
> **OperationId:** `get-manage-livez`

See the [interactive playground](../index.mdx) for parameter
forms, response schemas, and credential persistence.

## Source spec

This page is generated from `alphaswarm_docs/openapi/control-plane.json` by
[`alphaswarm_docs/scripts/generate-openapi-mdx.ts`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/generate-openapi-mdx.ts).
Refresh by re-running `pnpm --filter alphaswarm_docs generate-openapi-mdx`.


<!-- https://alpha-swarm.ai/reference/python/alphaswarm -->
# alphaswarm
> Auto-generated reference for the alphaswarm package. Re-runs on every PR touching **/*.py.

# alphaswarm

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code`


<!-- https://alpha-swarm.ai/reference/python/alphaswarm_bots -->
# alphaswarm_bots
> Auto-generated reference for the alphaswarm_bots package. Re-runs on every PR touching **/*.py.

# alphaswarm_bots

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm_bots`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code`


<!-- https://alpha-swarm.ai/reference/python/alphaswarm_controller -->
# alphaswarm_controller
> Auto-generated reference for the alphaswarm_controller package. Re-runs on every PR touching **/*.py.

# alphaswarm_controller

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm_controller`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_controller\src`


<!-- https://alpha-swarm.ai/reference/python/alphaswarm_core -->
# alphaswarm_core
> Auto-generated reference for the alphaswarm_core package. Re-runs on every PR touching **/*.py.

# alphaswarm_core

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm_core`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_core\src`


<!-- https://alpha-swarm.ai/reference/python/alphaswarm_models -->
# alphaswarm_models
> Auto-generated reference for the alphaswarm_models package. Re-runs on every PR touching **/*.py.

# alphaswarm_models

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm_models`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_models\src`


<!-- https://alpha-swarm.ai/reference/python/alphaswarm_rl -->
# alphaswarm_rl
> Auto-generated reference for the alphaswarm_rl package. Re-runs on every PR touching **/*.py.

# alphaswarm_rl

This page would normally be auto-generated by `mdxify` from the
Python source. The extraction binary is not available in this
environment — re-run `pnpm --filter alphaswarm_docs extract-python` once
griffe + griffe-pydantic + mdxify are installed:

```powershell
pip install "griffe>=1.5" "griffe-pydantic>=0.1.4" "mdxify>=0.1"
pnpm --filter alphaswarm_docs extract-python
```

## Source

- Module: `alphaswarm_rl`
- Search path: `C:\Users\Julian Wiley\Documents\AlphaSwarm\code\alphaswarm_rl\src`


<!-- https://alpha-swarm.ai/reference/python -->
# Python reference
> Auto-generated module / class / function reference for alphaswarm / alphaswarm_rl / alphaswarm_models / alphaswarm_controller / alphaswarm_core, via Griffe + griffe-pydantic + mdxify.

# Python reference

This tree is auto-generated by
[alphaswarm_docs/scripts/extract-python.ts](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/scripts/extract-python.ts)
on every CI run that touches `**/*.py`. The extraction pipeline is:

1. **Griffe** walks the Python AST and parses signatures, type hints,
   docstrings, and dynamic attributes.
2. **griffe-pydantic** teaches Griffe to render Pydantic model
   constraints, validators, and aliases — critical for FastAPI
   request/response models.
3. **mdxify** emits MDX with Docusaurus-native admonitions and
   navigation generation.

The output mirrors the source tree under
[alphaswarm_docs/docs/reference/python/](./).

## Top-level packages

- **`alphaswarm`** — quant runtime (strategy, backtest, agents, RAG, data).
- **`alphaswarm_rl`** — RL subsystem (RLRuntime, RLComponent metaclass, etc.).
- **`alphaswarm_models`** — ML framework, AlphaBacktestExperiment, model serving.
- **`alphaswarm_controller`** — workload lifecycle, TerraformRuntime.
- **`alphaswarm_core`** — shared ABCs, value types, auth filters.

## Docstring style

Standardised on Google-style docstrings. Griffe parses ReST + NumPy
styles too, but mixed styles confuse downstream tooling.

```python
def append_arrow(
    table: str,
    arrow_table: pa.Table,
    *,
    namespace: str,
    medallion_layer: Literal["bronze", "silver", "gold"],
    business_metadata: BusinessMetadata | None = None,
) -> SnapshotResult:
    """Append an Arrow table to an Iceberg table.

    The single sanctioned write path for Iceberg in AlphaSwarm. See
    AGENTS rule 3.

    Args:
        table: The Iceberg table name (without namespace prefix).
        arrow_table: The data to append.
        namespace: The medallion-qualified namespace
            (`alphaswarm_bronze_*`, `alphaswarm_silver_*`, `alphaswarm_gold_*`).
        medallion_layer: Must match the namespace prefix.
        business_metadata: Optional active-metadata block.

    Returns:
        A `SnapshotResult` with the new manifest list location and
        the snapshot id.

    Raises:
        IcebergNamespaceError: If the namespace prefix does not
            match the declared layer.
    """
```

## Reading the generated docs

Browse via the sidebar to the left. Each module page shows:

- A summary line from the first paragraph of the module docstring.
- Every public class, function, and dataclass with full signature.
- A "Source" link back to GitHub.
- A "Used by" cross-reference graph (Phase 6 — backed by the
  Codebase MCP server's symbol index).

## Breaking-change detection

The CI surface runs `griffe check` against every PR. Any API
removal / signature change posts a comment on the PR and requires
a `breaking-change` label + matching Changeset entry.


<!-- https://alpha-swarm.ai/release-notes/2026-06-01-initial-release -->
# Release 2026-06-01 — Docs migration + first API epoch
> docs.alpha-swarm.ai launches as the canonical documentation site; first Stripe-style API epoch lands.

# Release 2026-06-01 — Docs migration + first API epoch

This release marks two significant changes for AlphaSwarm customers:

## New

- **docs.alpha-swarm.ai is live.** The canonical documentation site
  replaces the previous GitHub-rendered tree at `alphaswarm_docs/`.
  Every previous link continues to resolve via 301 redirects in
  [`alphaswarm_docs/static/_redirects`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/static/_redirects).
- **Interactive API playground** at
  [/reference/api](../reference/api/index.mdx). Token persistence
  works end-to-end; copy a `Bearer` from `alphaswarm-cli auth login` and
  every request from the playground inherits it.
- **AI-native surfaces.** A curated `/llms.txt` index, the full
  corpus at `/llms-full.txt`, and an RFC 9728 + 8707-compliant
  MCP server at `/mcp` are now first-class agent entry points.
- **In-product help panel** in the operator UI — the help drawer
  reads directly from the docs corpus, so the in-product reference
  never drifts from the public site.

## Improved

- **Search is local-first.** Pagefind indexes the entire corpus
  client-side; no documentation content ever leaves the docs site.
- **Quality gates.** Every PR to `alphaswarm_docs/` runs Vale + alex.js +
  markdownlint + lychee + Lighthouse + axe-core + executable Python
  snippets via pytest-markdown-docs.
- **Hybrid authoring.** Business editors can ship docs through
  Keystatic at [/keystatic](/keystatic) — the typed schemas commit
  to the same branch protection rules as engineers.

## API

- **First Stripe-style date epoch:** `2026-06-01`. No surface
  changes from the prior unversioned API; future epochs will
  follow the documented 12-month sunset cycle with
  RFC 8594 `Deprecation` and `Sunset` response headers.
- **OpenAPI specs committed** at
  [`alphaswarm_docs/openapi/alphaswarm.json`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/alphaswarm.json)
  and
  [`alphaswarm_docs/openapi/control-plane.json`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_docs/openapi/control-plane.json).
  `oasdiff` PR gates prevent silent drift.

## Internal

- Cloudflare Pages + Cloudflare Access front the docs site as a
  separate edge property — the cluster tunnel at `alpha-swarm.ai` /
  `api.alpha-swarm.ai` / `manage.alpha-swarm.ai` continues unchanged.
- Logpush ships request + Access audit logs to a new R2 bucket
  (`alphaswarm-docs-access-logs`) with 365-day retention for SOC 2 /
  ISO 27001 evidence.
- Instatus at `status.alpha-swarm.ai` is now the canonical status page;
  the docs site renders a live banner via the Instatus JSON API.

## Migration notes

If you have hard-coded references to `alphaswarm_docs/.md` in
your own tooling, both shapes resolve correctly today; the legacy
shape will return a 410 starting 2027-06-01 (the 12-month sunset
window applies to URL paths as well as API epochs).


<!-- https://alpha-swarm.ai/release-notes -->
# Release notes
> Customer-facing release notes for AlphaSwarm. Generated from Changesets on every release.

# Release notes

Customer-facing release notes for the AlphaSwarm. New
entries land here whenever a PR's Changeset is marked
`audience: customer` or `audience: both` (see
[`.changeset/README.md`](https://github.com/julianwileymac/alphaswarm/blob/main/.changeset/README.md)).

For the full technical changelog (every commit, including
non-customer-facing internal refactors), see
[CHANGELOG.md](https://github.com/julianwileymac/alphaswarm/blob/main/CHANGELOG.md).

## Subscribe

- **RSS / Atom feed**: built from this folder by Docusaurus —
  available at [/blog/rss.xml](/blog/rss.xml).
- **In-product changelog widget**: powered by
  [`/release-notes.json`](/release-notes.json) (Headway-compatible).
- **Email digest**: opt in from the operator UI profile menu.

## API epochs

AlphaSwarm uses Stripe-style date-epoch API versioning. New epochs:

- Roll out on the first of the month (`2026-06-01`, `2026-09-01`, …).
- Preserve old contracts via the `Deprecation` / `Sunset` HTTP
  headers (RFC 8594) for a 12-month sunset cycle.
- Move to [archive.alpha-swarm.ai](https://archive.alpha-swarm.ai) when fully
  retired.

The matching reference docs live at
[/reference/api/](../reference/api/index.mdx).


<!-- https://alpha-swarm.ai/tutorials/first-agent-workflow -->
# Your first agent workflow
> Compose a three-node LangGraph (Research / Selection / Trader), run it through AgentRuntime, inspect the agent_runs_v2 ledger.

# Your first agent workflow

Goal: stand up a three-node agentic loop driven by `WorkflowRuntime`,
see it through one complete iteration, inspect the immutable
`agent_runs_v2` rows it produces.

## Why

`AgentSpec` + `AgentRuntime` is AlphaSwarm's "skill artifact" — every agent
run is hash-locked into `agent_spec_versions` and audited through
`agent_runs_v2`. Combined with the additive `WorkflowRuntime`
(orchestration adapter pattern), this is how AlphaSwarm composes
multi-agent pipelines without losing replay or kill-switch
semantics. See [Concept: workflow studio](../concepts/agentic/workflow-studio.md).

## Step 1 — author the workflow

Create `configs/workflows/my_first_workflow.yaml`:

```yaml
name: MyFirstResearchLoop
adapter_kind: graph
nodes:
  - id: research
    agent_spec: configs/agents/research_lite.yaml
    inputs:
      universe: [SPY, QQQ, IWM]
      lookback_days: 30
  - id: selection
    agent_spec: configs/agents/selection_lite.yaml
    depends_on: [research]
  - id: trader
    agent_spec: configs/agents/trader_paper.yaml
    depends_on: [selection]
edges:
  - { from: research, to: selection }
  - { from: selection, to: trader }
cost_caps:
  per_node_max_tokens: 4000
  per_run_max_usd: 0.50
halt_check_seconds: 5
```

## Step 2 — snapshot + run

```powershell
curl -X POST http://localhost:8000/workflows/MyFirstResearchLoop/run \
    -d '{}'
```

`WorkflowRuntime`:

1. Hash-locks the spec into `workflow_spec_versions`.
2. Reads each referenced `AgentSpec` and hash-locks them into
   `agent_spec_versions`.
3. Begins traversing the DAG, calling each agent through
   `AgentRuntime`.
4. Emits canonical progress frames per AGENTS rule 4.
5. Writes `agent_runs_v2` and `workflow_runs` rows.

## Step 3 — watch the breadcrumbs

The operator UI renders the workflow live at
`/workflows/runs/`. From the CLI:

```powershell
docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
    [print(m) for m in subscribe('')]"
```

## Step 4 — inspect the ledger

```sql
SELECT id, workflow_name, status, started_at, ended_at, total_tokens, total_cost_usd
FROM workflow_runs ORDER BY started_at DESC LIMIT 1;

SELECT id, agent_name, node_id, status, total_tokens
FROM agent_runs_v2
WHERE workflow_run_id = ''
ORDER BY started_at;
```

You should see three `agent_runs_v2` rows — one per node.

## Step 5 — replay

```powershell
curl -X POST http://localhost:8000/workflows/runs//replay
```

Same hash-locked spec versions, new run row.

## Step 6 — halt

The kill switch fans out:

```powershell
curl -X POST http://localhost:8000/workflows/halt
```

Every running workflow stops; `agent_runs_v2` rows close with
`status=halted`.

## Verify

- [ ] `workflow_spec_versions` row with a `spec_hash`.
- [ ] Three `agent_spec_versions` rows (one per node).
- [ ] One `workflow_runs` row + three `agent_runs_v2` rows.
- [ ] Total cost in USD ≤ `per_run_max_usd` from the spec.
- [ ] Replay produces a new `workflow_runs` row but reuses the
  same spec-version rows.

## What next

- [Concept: agentic pipeline](../concepts/agentic/agentic-pipeline.md) —
  the full five-stage lifecycle (models, data, snapshot, dispatch,
  review) and how this tutorial maps to it.
- [Concept: workflow studio](../concepts/agentic/workflow-studio.md) —
  the seven adapter kinds (graph / crew / debate / fusion /
  execution / schedule / studio).
- [Concept: multi-agent patterns](../concepts/agentic/multi-agent-patterns.md) —
  Sequential / Parallel / Debate / Coordinator / ReAct topologies.
- [Tutorial: first RL experiment](./first-rl-experiment.md) — hand
  RL outputs into an agent loop.


<!-- https://alpha-swarm.ai/tutorials/first-backtest -->
# Your first backtest
> Author a momentum strategy, run it through EventDrivenBacktester, inspect the ledger row, render a tearsheet.

# Your first backtest

Goal: from blank slate to a backtest with a non-zero Sharpe on
your screen, in under 5 minutes.

## Why

The backtest pipeline is the central artifact of every AlphaSwarm workflow.
Every strategy gets backtested before paper, every paper run gets
promoted on the back of backtest evidence, and every RL policy gets
evaluated against the same engine. Understanding the backtest
contract is prerequisite to understanding anything else.

## Prerequisites

- The [quickstart](../intro/quickstart.md) completed.
- An open terminal pointing at the repo root.

## Step 1 — author the strategy

Create `configs/strategies/my_first_strategy.yaml`:

```yaml
name: MyFirstMomentum
kind: alpha
class: alphaswarm.strategies.framework.algorithms.MomentumAlpha
module_path: alphaswarm.strategies.framework.algorithms
universe:
  kind: static
  symbols:
    - { ticker: SPY, exchange: ARCA, kind: equity }
    - { ticker: QQQ, exchange: NASDAQ, kind: equity }
    - { ticker: IWM, exchange: ARCA, kind: equity }
kwargs:
  lookback_days: 60
  rebalance_freq: weekly
  top_n: 2
risk:
  max_position_pct: 0.5
  max_drawdown_pct: 0.15
```

The `class` + `module_path` + `kwargs` pattern is Qlib-style and
required for every strategy registry entry. See
[AGENTS rule 8](https://github.com/julianwileymac/alphaswarm/blob/main/AGENTS.md).

## Step 2 — dispatch the backtest

```powershell
docker exec alphaswarm-api python -m alphaswarm.cli.cli backtest \
    --config configs/strategies/my_first_strategy.yaml \
    --start 2024-01-01 \
    --end 2024-06-30 \
    --engine event_driven
```

The CLI returns a `task_id`. Tail its progress:

```powershell
docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
    [print(m) for m in subscribe('')]"
```

You will see progress frames in the canonical
`{task_id, stage, message, timestamp, **extras}` shape.

## Step 3 — inspect the ledger

```powershell
docker exec alphaswarm-postgres psql -U alphaswarm -d alphaswarm -c \
    "SELECT id, strategy_name, sharpe, total_return, max_drawdown
     FROM backtest_runs ORDER BY created_at DESC LIMIT 5;"
```

The most recent row is your run. If `sharpe` is `NULL`, the backtest
failed — see Step 5.

## Step 4 — render a tearsheet

```powershell
curl -X POST http://localhost:8000/analytics/portfolio/tearsheet \
    -H "Content-Type: application/json" \
    -d '{"run_id": ""}'
```

The endpoint returns another `task_id`; the resulting HTML tearsheet
lands at `/analytics/portfolio//tearsheet.html` once Celery
finishes rendering.

Open it in your browser. Or use the operator UI route
[/analytics/portfolio/:runId](http://localhost:3001/analytics/portfolio).

## Step 5 — handle expected failures

**`InsufficientDataError`** — Alpha Vantage has not seeded the
universe yet. Run the ingest:

```powershell
docker exec alphaswarm-api python -m scripts.ingest_yfinance \
    --symbols SPY,QQQ,IWM --start 2023-01-01 --end 2024-12-31
```

**`StrategyRegistryMissError`** — the YAML's `class` field
references a class that is not decorated with `@register`. Open
[alphaswarm/strategies/framework/algorithms.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/strategies/framework/algorithms.py)
and confirm `MomentumAlpha` is there. If you renamed the class,
update the YAML.

**`IcebergNamespaceError`** — your local Iceberg catalog has not
been migrated. Run `make iceberg-bootstrap` and retry.

## Verify

- [ ] `backtest_runs` row visible with non-NULL `sharpe`.
- [ ] Tearsheet HTML renders.
- [ ] Strategy YAML committed under `configs/strategies/`.

## What next

- [Concept: backtest engines](../concepts/strategy/backtest-engines.md) —
  what `event_driven` vs `vbtpro` vs `hft` actually does.
- [Recipe: run a backtest from YAML](../how-to/recipes/run-a-backtest-from-yaml.md) —
  the same thing, but as a how-to for repeated dispatch.
- [Tutorial: first bot](./first-bot.md) — wrap this strategy in a
  reusable bot spec.


<!-- https://alpha-swarm.ai/tutorials/first-bot -->
# Your first bot
> Wrap a backtested strategy in a TradingBot spec, snapshot the immutable version, run a paper session.

# Your first bot

Goal: take the strategy from [first-backtest](./first-backtest.md)
and wrap it in a `BotSpec` so it can be paper-traded, deployed to
Kubernetes, or chat-driven — all from a single immutable contract.

## Why

A bot is the smallest deployable unit in AlphaSwarm. It aggregates the
universe + strategy + engine + ML models + agents + RAG + risk
limits + metrics into one hash-locked spec that
[`BotRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_bots/runtime.py)
can drive through every lifecycle stage. See
[Concept: bots](../concepts/agentic/bots.md).

## Step 1 — author the BotSpec

Create `configs/bots/my_first_bot.yaml`:

```yaml
name: MyFirstBot
kind: trading
description: 'First-bot tutorial — wraps MyFirstMomentum.'
strategy_config: configs/strategies/my_first_strategy.yaml
engine: event_driven
risk:
  max_position_pct: 0.5
  max_daily_loss_pct: 0.02
  kill_switch_attached: true
metrics:
  - sharpe
  - sortino
  - max_drawdown
  - hit_rate
deploy_target: paper
```

## Step 2 — snapshot the spec

```powershell
curl -X POST http://localhost:8000/bots \
    -H "Content-Type: application/json" \
    -d @configs/bots/my_first_bot.yaml
```

This persists a `bot_versions` row with the hash-locked spec. The
response includes the `bot_id` (use this everywhere downstream)
and the `spec_hash`. Different content → different hash → new
version row; the old version stays intact for replay.

## Step 3 — backtest the bot

```powershell
curl -X POST http://localhost:8000/bots//backtest \
    -d '{"start":"2024-01-01","end":"2024-06-30"}'
```

Same engine as the prior tutorial, but the bot's risk overlays
apply. The ledger row in `backtest_runs` carries the `bot_id` so
you can correlate.

## Step 4 — paper-trade the bot

```powershell
curl -X POST http://localhost:8000/bots//paper \
    -d '{"starting_cash":100000}'
```

`BotRuntime` creates a `paper_trading_runs` row and attaches the
bot to the paper broker session loop in
[alphaswarm/trading/paper_trading.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_trading.py).

Watch the live WebSocket feed:

```javascript
const ws = new WebSocket("ws://localhost:8000/ws/paper/");
ws.onmessage = (e) => console.log(JSON.parse(e.data));
```

You will see fills, position updates, and equity-curve points
streaming through the canonical progress-frame envelope.

## Step 5 — halt the bot

The bot has the kill switch attached (see Step 1 — `kill_switch_attached: true`).
Trigger a halt:

```powershell
curl -X POST http://localhost:8000/bots/halt-all
```

Every paper session under every bot stops within ~250 ms.

## Verify

- [ ] `bot_versions` row visible with a `spec_hash`.
- [ ] `backtest_runs` row tagged with your `bot_id`.
- [ ] `paper_trading_runs` row visible.
- [ ] WebSocket feed delivered frames.
- [ ] Kill switch halted the bot.

## What next

- [Concept: bots](../concepts/agentic/bots.md) — the full bot
  contract + deployment targets (paper / k8s / backtest_only).
- [Recipe: promote a bot to paper](../how-to/recipes/promote-a-bot-to-paper.md) —
  same thing, but as a how-to.
- [Tutorial: first paper trading session](./first-paper-trading-session.md) —
  go deeper on the paper-trading lifecycle and risk overlays.


<!-- https://alpha-swarm.ai/tutorials/first-paper-trading-session -->
# Your first paper trading session
> Attach a bot to the paper broker, watch the WebSocket frames, trigger the kill switch.

# Your first paper trading session

Goal: drive a paper-trading session from the bot you authored in
[first-bot](./first-bot.md). End-to-end: dispatch → fills → kill.

## Why

Paper trading is the highest-fidelity dress rehearsal AlphaSwarm supports
without putting real money at risk. Same broker abstraction, same
risk overlays, same kill-switch wiring as live trading. The
difference is that fills come from the simulated execution engine
in [alphaswarm/trading/paper_trading.py](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/trading/paper_trading.py).

See [Concept: paper trading](../concepts/trading/paper-trading.md).

## Step 1 — verify the bot is ready

```powershell
curl http://localhost:8000/bots/
```

Confirm the response includes a recent `backtest_runs` reference and
non-zero `sharpe`. The
[paper-metadata-gate](../concepts/trading/paper-metadata-gate.md)
will refuse to start the session otherwise.

## Step 2 — start the session

```powershell
curl -X POST http://localhost:8000/bots//paper \
    -d '{"starting_cash":100000,"duration_minutes":60}'
```

The response includes `paper_run_id`. The session is now in
the canonical Celery loop; `alphaswarm-worker` polls the broker every
1 second.

## Step 3 — watch the WebSocket

In a browser console:

```javascript
const ws = new WebSocket("ws://localhost:8000/ws/paper/");
ws.onmessage = (e) => {
  const frame = JSON.parse(e.data);
  console.log(frame.stage, frame.message, frame.equity, frame.positions);
};
```

You should see:

- `bar.received` — every minute bar.
- `signal.emitted` — when the strategy says "buy" / "sell" / "flat".
- `order.placed` — order goes to the simulated broker.
- `order.filled` — fill comes back; positions update.
- `equity.update` — equity-curve point at the end of each bar.

All frames follow the canonical `{task_id, stage, message,
timestamp, **extras}` envelope per AGENTS rule 4.

## Step 4 — risk + kill switch

The bot's `risk` block (Step 1 of first-bot) is enforced by
[alphaswarm/risk/limits.py::RiskLimits](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm/risk/limits.py).
Once any limit is hit, the session emits `risk.halted` and stops.

The topbar kill switch in the Vite UI fans out to:

- `POST /bots/halt-all`
- `POST /paper/stop-all`
- `POST /agents/halt`
- `POST /rl/halt-all`
- `POST /workflows/halt`
- `POST /terraform/halt`
- `POST /quant-agents/halt`

The whole stack stops in under 250 ms.

## Step 5 — inspect the ledger

```sql
SELECT id, bot_id, status, total_pnl, num_fills, started_at, ended_at
FROM paper_trading_runs ORDER BY started_at DESC LIMIT 1;

SELECT order_id, symbol, side, qty, price, filled_at
FROM paper_fills
WHERE paper_run_id = ''
ORDER BY filled_at;
```

## Verify

- [ ] WebSocket delivered at least one `order.filled` frame.
- [ ] `paper_trading_runs` row has non-NULL `total_pnl`.
- [ ] Kill switch closed the session.

## What next

- [Concept: paper trading](../concepts/trading/paper-trading.md) — the
  full session loop, broker abstraction, and risk model.
- [Concept: paper metadata gate](../concepts/trading/paper-metadata-gate.md) —
  why some sessions get blocked before they start.
- [How-to: kill switch incident response](../how-to/operations/kill-switch-incident-response.md) —
  the runbook for when the kill switch fires in production.


<!-- https://alpha-swarm.ai/tutorials/first-rl-experiment -->
# Your first RL experiment
> Author an RLExperimentSpec, train via SB3 PPO, replay from the Iceberg trajectory store.

# Your first RL experiment

Goal: from blank `RLExperimentSpec` to a trained PPO agent with
trajectories persisted to Iceberg, in under 10 minutes on CPU.

## Why

The RL stack is AlphaSwarm's most opinionated subsystem: hash-locked
`RLExperimentSpec`, metaclass-registered components, deterministic
Iceberg trajectory persistence, and a single sanctioned executor
([`RLRuntime`](https://github.com/julianwileymac/alphaswarm/blob/main/alphaswarm_rl/src/alphaswarm_rl/runtime.py)).
Every RL run produces an immutable `rl_runs` ledger row and a
replayable trajectory.

See [Concept: RL framework](../concepts/rl/rl-framework.md).

## Prerequisites

- Quickstart completed.
- A small dev dataset under your local Iceberg catalog. The bundled
  `alphaswarm_bronze_yfinance_daily` namespace works.

## Step 1 — author the spec

Create `alphaswarm_rl/configs/experiments/my_first_rl.yaml`:

```yaml
name: MyFirstRLExperiment
description: First-RL tutorial — PPO on a static universe
environment:
  rl_alias: SingleAssetTradingEnv
  symbol: { ticker: SPY, exchange: ARCA, kind: equity }
  lookback_bars: 60
  initial_cash: 100000
data_pipeline:
  rl_alias: IcebergDataPipeline
  namespace: alphaswarm_bronze_yfinance_daily
  start: 2022-01-01
  end: 2023-12-31
agent:
  rl_alias: SB3Adapter
  algorithm: PPO
  policy: MlpPolicy
  total_timesteps: 50000
rewards:
  - { rl_alias: PnLReward, weight: 1.0 }
  - { rl_alias: TurnoverPenalty, weight: 0.1 }
  - { rl_alias: VolatilityPenalty, weight: 0.05 }
observations:
  - { rl_alias: StockstatsObservation }
  - { rl_alias: LookbackObservation, length: 20 }
training:
  advantage: { rl_alias: GAEAdvantage, lambda: 0.95, gamma: 0.99 }
  backbone: { rl_alias: TransformerBackbone, d_model: 64, n_heads: 4 }
```

## Step 2 — snapshot + train

```powershell
curl -X POST http://localhost:8000/rl/runs \
    -H "Content-Type: application/json" \
    -d '{"spec_path":"alphaswarm_rl/configs/experiments/my_first_rl.yaml","mode":"train"}'
```

The response includes the `rl_run_id`. Tail the progress:

```powershell
docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
    [print(m) for m in subscribe('')]"
```

50k timesteps on a CPU finishes in 5-8 minutes.

## Step 3 — inspect the ledger + trajectory store

```sql
-- rl_runs ledger
SELECT id, experiment_name, status, total_timesteps, mean_reward
FROM rl_runs ORDER BY created_at DESC LIMIT 5;
```

The trajectory data lives in Iceberg under
`alphaswarm_silver_rl_trajectories.`:

```python
from pyiceberg.catalog import load_catalog
cat = load_catalog("alphaswarm")
tbl = cat.load_table("alphaswarm_silver_rl_trajectories.")
df = tbl.scan().to_pandas()
print(df[["episode", "step", "reward", "action"]].head(20))
```

## Step 4 — replay

```powershell
curl -X POST http://localhost:8000/rl/runs//replay \
    -d '{"start":"2024-01-01","end":"2024-03-31"}'
```

Same hash-locked spec, new data window, separate `rl_runs` row.

## Step 5 — halt

```powershell
curl -X POST http://localhost:8000/rl/halt-all
```

## Verify

- [ ] `rl_experiment_versions` row with a `spec_hash`.
- [ ] `rl_runs` row with non-NULL `mean_reward`.
- [ ] Iceberg trajectory table populated.
- [ ] Replay produces a different `rl_runs` row but reuses the
  same `rl_experiment_versions` row (hash-locked!).

## What next

- [Concept: RL components](../concepts/rl/rl-components.md) — add
  your own reward term, observation builder, or policy backbone.
- [Concept: RL Iceberg trajectories](../concepts/rl/rl-iceberg.md) —
  the persistence contract.
- [Tutorial: first agent workflow](./first-agent-workflow.md) —
  hand off RL outputs to an autonomous agent loop.


<!-- https://alpha-swarm.ai/tutorials -->
# Tutorials
> Runnable walkthroughs for every AlphaSwarm surface. Pyodide + StackBlitz WebContainers in your browser.

# Tutorials

Runnable, learning-oriented walkthroughs. Each tutorial assumes the
[quickstart](../intro/quickstart.md) has succeeded.

Python snippets execute via Pyodide directly in your browser; full
project setups open in StackBlitz WebContainers. Both are sandboxed
and never reach the production cluster.

## Tutorial catalogue

- **[First backtest](./first-backtest.md)** — author a momentum
  strategy, run it through `EventDrivenBacktester`, inspect the
  `backtest_runs` ledger row, render a tearsheet.
- **[First bot](./first-bot.md)** — wrap the strategy in a
  `TradingBot` spec, snapshot the version, run a paper session.
- **[First RL experiment](./first-rl-experiment.md)** — author an
  `RLExperimentSpec`, train via SB3 PPO, replay from the Iceberg
  trajectory store.
- **[First agent workflow](./first-agent-workflow.md)** — compose a
  three-node LangGraph (Research → Selection → Trader), run it
  through `AgentRuntime`, inspect the agent_runs_v2 ledger.
- **[First paper trading session](./first-paper-trading-session.md)** —
  attach the bot to the paper broker, watch the WebSocket frames,
  trigger the kill switch.

Each tutorial includes:

1. A "Why" section explaining what you are about to learn.
2. A canonical reference to the deeper concept doc.
3. Inline runnable code.
4. A "Verify" checklist at the end.
5. A "What next" pointer.

## Conventions for these tutorials

- **One concept per page.** If a tutorial gets too long, split it
  and link the second page.
- **Verify everything.** Every code block produces an observable
  effect — a JSON response, a ledger row, a WebSocket frame.
- **Show the failure mode.** Each tutorial documents at least one
  expected error and how to recover.