ADR 006 — QuantBot Operator Pattern (kopf + Pydantic mirrors)

Status: Accepted (QuantBot Platform v0.2.0) Date: 2026-05-24 Decision drivers: AGENTS rules 14, 15, 28, 45; rpi-k8s-governance

Context

The QuantBot Platform v0.2.0 adds a Kubernetes-native control plane on top of the existing BotRuntime/bot_versions infrastructure. Every running bot, every risk policy, every venue feed, every backtest job, every kill switch is now a Kubernetes Custom Resource. That requires:

A controller that watches the CRs and reconciles desired state.
A schema source-of-truth for each CR.
Webhooks that reject malformed CRs before they reach the reconciler.

Decision

Controller framework: kopf (kopf>=1.37). Python-native, integrates with our Pydantic spec layer, supports level-triggered reconciliation, finalizers, and admission webhooks. Up to ~1000 CRs/cluster is well within kopf's documented operating envelope.
Schema source-of-truth: each CR has both a Pydantic mirror class (under alphaswarm_bots/operator/crds/*_cr.py) AND a CRD YAML (alphaswarm_bots/operator/crds/yaml/*_crd.yaml). The Pydantic class is validated from the CR .spec field; the YAML is what gets applied to the cluster by the CRD-installer Job. The two are kept in sync by convention + the operator's startup self-test (alphaswarm_bots/operator/schema_check.py, see Consequences).
Reconciliation: level-triggered. Every handler compares desired (from spec) against actual (queried from the cluster) and drives the system back. Failures reflect onto status.conditions.
Workload application: routes through alphaswarm_core.WorkloadRuntime per AGENTS rule 45. The operator never calls kubernetes.client.AppsV1Api() directly when WorkloadRuntime is available; falls back to kubernetes-asyncio only for environments where WorkloadRuntime hasn't been deployed yet.

Alternatives considered

Option	Why rejected
Go operator (controller-runtime / Kubebuilder)	Re-implements the spec validation already written in Pydantic; bigger team operational burden for a Python-first shop
metacontroller + JSON Schema	No mature Python ecosystem for the testing + audit story we need; JSON Schema diverges from Pydantic validators
Native Helm charts only (no controller)	Helm can't reconcile the operator-side bookkeeping (kill switch fan-out, drain finalizer, status condition rollup)

Consequences

+ Single source of truth (Pydantic) drives both API validation and CR validation.
+ Python-native test suite for the operator (kopf can be driven in-process from pytest).
− kopf scaling ceiling is ~1000 CRs per cluster; beyond that we need operator sharding (deferred per blueprint caveat #2).
− Pydantic mirror + YAML CRD requires manual sync. Mitigated by alphaswarm_bots/operator/schema_check.py: for each of the 9 CRDs it regenerates the JSON Schema from the Pydantic *SpecField model (model_json_schema()) and structurally compares it against the CRD's openAPIV3Schema.properties.spec (resolving $ref/$defs, collapsing Optional anyOf, and ignoring title/default annotations). It runs three ways: the operator's run path executes it before starting kopf and refuses to boot on drift (exit 5; break-glass via ALPHASWARM_BOTS_ALLOW_SCHEMA_DRIFT=1); alphaswarm-bots-operator check-schemas runs it standalone; and CI runs it via tests/test_crd_schema_drift.py and the crd-schema-drift GitHub workflow.

References

alphaswarm_bots/operator/
alphaswarm_bots/operator/schema_check.py — the Pydantic-vs-CRD-YAML self-test
alphaswarm_bots/tests/test_crd_schema_drift.py
alphaswarm_platform/deployments/kubernetes/bots-operator/

Context​

Decision​

Alternatives considered​

Consequences​

References​

Context

Decision

Alternatives considered

Consequences

References