Analysis Framework
Doc map: alphaswarm_docs/index.md · Lab guide: alphaswarm_docs/analysis-lab.md · Flow reference: alphaswarm_docs/analysis-flows.md.
The analysis layer is AlphaSwarm's hash-locked, runtime-driven umbrella for every "explore a dataset" workflow — distribution audits, time-series diagnostics, derivatives pricing, portfolio optimisation, regression diagnostics, outlier / imputation work, and Alphalens-style factor evaluation. It is the statistical / quantitative-analysis counterpart of the agentic-interpretation layer in alphaswarm_docs/analysis-agents.md. The two namespaces are deliberately distinct.
Why a new umbrella
Most primitives existed already (alphaswarm.ml.flows, alphaswarm.data.factors,
alphaswarm.data.realised_volatility, alphaswarm.data.microstructure,
alphaswarm.options.normal_model, alphaswarm.data.profiling.profiler) but had no
single contract for:
- registering a flow with a JSON-schema-driven param model;
- composing multiple flows into a reproducible pipeline;
- snapshotting the spec into an immutable, hash-locked version row;
- writing every step's gold-tier output to Iceberg
(
alphaswarm_gold_analysis_<namespace>) under medallion validation; - emitting the same progress payload shape Celery + WebSocket consumers already understand.
The umbrella plugs every primitive into one canvas + one ledger.
Layout
alphaswarm/analysis/
base.py — FlowParams / FlowResult / FlowDescriptor / FlowContext
spec.py — AnalysisSpec / AnalysisStep / FlowRef / DatasetRef
registry.py — @register_analysis_flow + persist_spec + add_spec
runtime.py — AnalysisRuntime (sole sanctioned executor)
pricing.py — closed-form + MC math primitives (BSM, Greeks, GBM, SABR)
flows/
profiling.py / distribution.py / outlier.py / imputation.py /
regression.py / time_series.py / derivatives.py / portfolio.py /
factors.py / microstructure.py
AnalysisSpec contract
Every spec is a Pydantic model that hashes its canonical JSON form
(SHA-256, sorted keys, no whitespace). Two specs with identical fields
collapse to one analysis_spec_versions row; any edit creates a new
version automatically.
name: spy-distribution-audit
slug: spy-distribution-audit
kind: research
description: Distribution + GARCH + outlier audit for SPY daily bars.
dataset:
iceberg_identifier: alphaswarm_silver_alpha_vantage.equities_daily
filters:
vt_symbol: SPY.NYSE
limit: 5000
steps:
- alias: profile
flow_ref:
flow: profiling.describe
params: {}
- alias: returns_dist
flow_ref:
flow: distribution.descriptive_stats
params: { column: log_return }
- alias: shapiro
flow_ref:
flow: distribution.shapiro_wilk
params: { column: log_return }
- alias: garch
flow_ref:
flow: time_series.garch
params: { column: log_return, p: 1, q: 1, horizon: 10 }
medallion_layer: gold
business_metadata:
data_owner: research-team
semantic_definition: "SPY daily distribution + volatility audit"
domain: research.distribution_audit
sla_class: tier-3-eod
Hard rules
These hold across every analysis flow / spec / run. Any PR that violates one will be sent back.
- Every analysis run goes through
AnalysisRuntime. REST + Celery tasks (alphaswarm.tasks.analysis_flow_tasks) wrap it; flow code never writes to Iceberg / Postgres directly. analysis_spec_versionsrows are immutable. Re-snapshotting viaalphaswarm.analysis.registry.persist_speccreates a new version row when the SHA-256 hash changes — never update an existing row in place.- Every per-step Iceberg write uses
iceberg_catalog.append_arrowwithmedallion_layer="gold"and aBusinessMetadatablock. The default namespace isalphaswarm_gold_analysis_<flow.namespace>; flows can override viaoutput_namespace=onregister_analysis_flow. - Flows never call
litellm.completion/OllamaClientdirectly. v1 ships zero LLM-routed flows by design — interpretation is owned by the analysis-AGENTS stack (alphaswarm_docs/analysis-agents.md). - Optional dependencies are guarded. Flows that need
cvxpy,pyod,pywavelets,cupy, etc. raise a friendlyRuntimeErrorwith the install hint when the import fails. - No new diagram formats. Mermaid only.
REST surface
| Method | Path | Purpose |
|---|---|---|
GET | /analysis/flows | List flows + JSON-schema-derived param forms |
GET | /analysis/flows/{flow} | Single flow detail |
POST | /analysis/flows/{flow}/preview | Sync preview against an inline payload |
POST | /analysis/flows/{flow}/preview-task | Async preview via Celery (agents queue) |
GET | /analysis/specs | List saved specs |
POST | /analysis/specs | Persist a new spec (idempotent on hash) |
GET | /analysis/specs/{slug} | Current spec + version history |
POST | /analysis/specs/{slug}/run | Kick AnalysisRuntime.run via Celery |
GET | /analysis/runs | Paged ledger of runs |
GET | /analysis/runs/{id} | Run detail with joined step results |
GET | /analysis/runs/{id}/results/{step} | DuckDB-driven preview of one step's gold-tier output |
GET | /analysis/datasets/columns?identifier=ns.name | Column / dtype list for the lab forms |
Persistence schema
Migration 0031_analysis_layer adds four project-scoped tables:
| Table | Purpose |
|---|---|
analysis_specs | Logical row (latest active version per slug) |
analysis_spec_versions | Immutable hash-locked snapshot |
analysis_runs | One row per AnalysisRuntime.run() invocation |
analysis_step_results | One row per AnalysisStep in the spec |
AnalysisRun.iceberg_result_table is set when a step persists arrow
data; AnalysisStepResult.artifact_uri records the per-step
namespace.name so the lab can fetch the gold-tier output via DuckDB.
Adding a new flow
- Subclass
FlowParamsfor the per-flow parameter shape. - Decorate a
(df, params, ctx) -> FlowResultfunction with@register_analysis_flow(name, namespace, label, ...). - (optional) Stash a
pyarrow.Tableonresult.arrow_tableto persist it underalphaswarm_gold_analysis_<namespace>when run inside a spec. - Add a smoke test under
tests/analysis/. - Update the relevant tab in alphaswarm_docs/analysis-flows.md.
Don't list
- Don't bypass
AnalysisRuntimefor spec execution — every progress / ledger / Iceberg / step-result side-effect is wired through it. - Don't write to a non-
alphaswarm_gold_analysis_*namespace from a flow. - Don't duplicate logic that already lives in
alphaswarm.data.factors/alphaswarm.data.microstructure/alphaswarm.options.normal_model— wrap them as a flow and keep the math in one place. - Don't add diagrams in non-Mermaid formats.
- Don't put LLM-driven interpretation in a flow; that lives in
alphaswarm_agents.analysis.*.