Skip to main content

MLOps service inside alphaswarm_models/

This page documents the initial MLOps service shipped as additive extensions to the established alphaswarm_models/ boundary. The service provides the agentic plumbing the two MLOps reports asked for — a polymorphic agent-facing interface layer, MLOps lifecycle handlers, external-registry adapters, hash-locked skills, OOD safety rules, a dedicated MCP server, and the matching REST + Celery + frontend surfaces — all on top of the existing models / predictors / serving infrastructure.

What's new

alphaswarm_models/src/alphaswarm_models/interfaces/

Five agent-facing polymorphic ABCs that wrap any concrete model in a stable contract:

InterfaceMethodApplication
Predictorpredict(features)Point-in-time value estimation
Forecasterforecast(history, horizon)Multi-step temporal projection
Classifierclassify(data)Discrete probability distribution
Segmentersegment(series)Structural-break detection
Analyzeranalyze(unstructured)NLP / sentiment scoring

All register under kind="interface" in alphaswarm.core.registry. Agents program against Predictor.predict regardless of whether XGBoost, LSTM, or HuggingFace pipelines back the call.

alphaswarm_models/src/alphaswarm_models/handlers/

Six MLOps lifecycle handler classes:

HandlerPurpose
CacheHandlerLRU + safetensors-first model cache (budgets in settings.ml_cache_*)
LoadHandlerCryptographic verification + safetensors-preferred deserialisation
SaveHandlertorch state_dict → .safetensors with SHA-256 sidecar
StoreHandlerObject-store upload + lineage metadata
ProductionizeHandlerDrive the productionize/ compiler pipeline
ServeHandlerContinuous-batching queue with kill-switch fan-out

All inherit MLOpsHandler so every lifecycle operation runs the same policy_check + lineage emission contract (LineageBus).

alphaswarm_models/src/alphaswarm_models/productionize/

Four compiler classes:

CompilerOutputOptional dep
OnnxCompiler.onnxtorch.onnx
TensorRTCompiler.enginetensorrt (Linux GPU only)
TorchScriptCompiler.pt (trace/script)torch
QuantizationCompiler.pt (INT8 / FP16)torch

Each registers via @register_compiler("alias") and emits a CompiledArtifact with SHA-256 + size + kwargs into ml_compiled_artifacts.

alphaswarm_models/src/alphaswarm_models/adapters/

External-registry pullers protecting the supply chain:

AdapterNotes
HuggingFaceAdapterRoutes downloads through the local cache volume; resolves HF tokens via CredentialResolver (CredentialKey("huggingface", "api_token")). Honours settings.ml_hf_hub_offline.
TorchHubAdapterRefuses every name not on DEFAULT_ALLOWLIST ∪ the operator allow-list at CredentialKey("torchhub", "allowlist"). Verifies SHA-256 before caching.

alphaswarm_models/src/alphaswarm_models/spec.py + runtime.py + registry.py

Hash-locked MLSkillSpec + MLSkillRuntime mirroring the existing AgentSpec/BotSpec/RLExperimentSpec/AnalysisSpec runtime pattern. New Alembic 0081 tables:

  • ml_skills + ml_skill_versions (hash-locked snapshots)
  • ml_skill_runs (run ledger with experiment_id + test_id FKs, AGENTS rule 34)

Seed skill YAMLs ship under alphaswarm_models/configs/skills/:

  • regime_aware_alpha.yaml — Classifier → Predictor (regime-specialised)
  • multi_horizon_forecast.yaml — Forecaster + Analyzer (sentiment overlay)

alphaswarm_models/src/alphaswarm_models/rules/

Inference-time OOD safety rules driven by a metaclass-driven RuleRegistry:

  • OODGuard — z-score threshold check.
  • RangeGuard — absolute min/max window check.
  • TensorShapeGuard — input-shape mismatch check.
  • CircuitBreaker — rolling-window failure tracker that trips at max_failures per window_seconds.

Rule packs live under alphaswarm_models/configs/rules/; the default is ood_default.yaml.

alphaswarm/data/mcp/tools/ml.py

Fourteen data.ml.* DataMCP tools — the canonical Hard Rule 22 path agents use to drive the entire MLOps surface (predict, forecast, classify, segment, analyze, pull, compile, list, run skills, halt serving). Each tool registers via @register_data_mcp_tool so both transports — the in-process bridge and the FastAPI router/stdio binary — pick it up.

alphaswarm/ml_mcp/ + alphaswarm-ml-mcp binary

A dedicated MCP server publishing the same data.ml.* slice under its own canonical URI (settings.mcp_ml_canonical_uri). Tokens minted for the MLOps audience cannot be replayed against the data MCP and vice versa (RFC 8707, Hard Rule 49). The RFC 9728 metadata document lives at /.well-known/oauth-protected-resource/mcp/ml.

REST + Celery

New routes under the existing /ml/* router plus a fresh /ml/skills/* router. Long-running ops dispatch to four new Celery modules: ml_pull_tasks, ml_serving_tasks, ml_productionize_tasks, ml_skill_tasks. All emit progress via _progress.emit (Hard Rule 4).

Frontend (Vite)

Three new routes under alphaswarm_client/src/routes/ml/:

  • /ml/skills — registry browser + invocation form.
  • /ml/serving — live continuous-batching session monitor with per-session halt button.
  • /ml/pull — HuggingFace/TorchHub model puller.

KillSwitch.tsx fans out to POST /ml/serving/halt-all alongside the existing halt endpoints (Hard Rule 2 in frontend.mdc).

Identity + topology

  • alphaswarm.config.settings gains nine new ml_* knobs (cache budgets, serving defaults, OOD threshold, offline toggles, MCP canonical URI
    • URL).
  • alphaswarm_platform/configs/deployment/topology.yaml gains an alphaswarm-ml-mcp service entry (Hard Rule 47).
  • alphaswarm/config/topology_fallback.py maps mcp_ml_urlalphaswarm-ml-mcp.http.

Agent usage

The seed mlops_assistant AgentSpec at configs/agents/mlops_assistant.yaml drives the MLOps surface exclusively through the data.ml.* tools. Operators invoke it the same way as any other AgentSpec — AgentRuntime.run(...) (never call router_complete directly per Hard Rule 12).

Validation

# Source compile check:
python -m py_compile alphaswarm_models/src/alphaswarm_models/{interfaces,handlers,adapters,rules,productionize,tasks}/**/*.py

# New migration is hashed into the lock file:
python scripts/ci/check_migration_immutability.py

# DataMCP catalog discovery:
curl http://localhost:8000/mcp/data/tools | jq '.tools[] | select(.name | startswith("data.ml."))'

# MLOps MCP discovery:
curl http://localhost:8000/.well-known/oauth-protected-resource/mcp/ml

What is explicitly out of scope

  • Mutating an existing migration. The 0081 migration is immutable once shipped (Hard Rule 6); future schema changes land in 0082+.
  • Streamlit / Solara surfaces. The legacy stack is rollback-only.
  • Free-text URN input. Every entity selection uses EntityPicker (Hard Rule 29).