Reinforcement learning framework
The RL layer in AlphaSwarm follows a metaclass-driven, registry-first design
inspired by FinRL's library structure and FinRobot's tool-augmented
agent runtime. Every concrete component (env, observation, action,
reward, termination, policy, agent, data pipeline, ensembler,
experiment, trajectory store) auto-registers through
alphaswarm_rl/src/alphaswarm_rl/core/base.py
so the API and the lab UI can browse them at runtime.
This page is the canonical entry point. For shorter cuts:
- rl-lab — interactive RL Lab + builders.
- rl-components — auto-generated component
reference (browse via
/rl/componentsin the operator UI). - rl-iceberg — Iceberg trajectory / equity / reward-decomposition tables and DuckDB views.
- rl-market-dynamics — Phase 6 slice-and-merge regime
labeller +
RegimeAwareObservation+RegimeStratifiedEvaluation. - rl-prudex-evaluation — Phase 9 PRUDEX-Compass framework (17 measures, 5 visualisations).
- rl-finagent — Phase 10 FinAgent multimodal 5-stage LLM-hybrid adapter.
- weight-centric-pipeline — FinRL-X
four-stage
f_S → f_A → f_T → f_Rpipeline. - architecture/decisions/010-rl-production-enhancement — full Phase 1-12 production-enhancement ADR.
Phase 1-12 production enhancements (May 2026)
The Phase 1-12 deliverables documented in
ADR-010
add the following components under their canonical rl_alias /
kind:
| Phase | Components |
|---|---|
| 1 (Rewards) | differential_sharpe, differential_downside, implementation_shortfall, running_inventory, exp_utility, hindsight, dp_distillation |
| 2 (Analytical) | almgren_chriss_residual, avellaneda_stoikov_residual (+ alphaswarm_rl.analytical.{almgren_chriss,avellaneda_stoikov,cartea_jaimungal} helpers) |
| 3 (Envs) | tradesim_algotrading, tradesim_portfolio, tradesim_execution, tradesim_hft, finagent_trading |
| 4 (Agents) | eiie, deeptrader, investor_imitator, eteo, opd, deepscalper, hft_ddqn, ppo_inhouse |
| 5 (Backbones) | eiie_conv, sagcn, market_scorer, hft_qnet, eteo_dual_head, pd_dual_rnn, sarl_lstm |
| 6 (MDM) | slice_and_merge_regime_flow (analysis flow), regime_aware observation, regime_stratified experiment |
| 7 (CSDI) | csdi_imputed dataset kind |
| 8 (Validation) | CombinatorialPurgedKFold, probability_of_backtest_overfitting, rademacher_anti_serum, deflated_sharpe_ratio, walk_forward_anchored, walk_forward_rolling, benjamini_hochberg, holm_bonferroni, validation_suite experiment |
| 9 (PRUDEX) | PrudexMetrics, PrudexReport, compute_prudex_metrics, 5 chart helpers, prudex_compass experiment |
| 10 (FinAgent) | finagent_layered adapter + 5 AgentSpec YAMLs under configs/agents/finagent/ + 3 tools under alphaswarm/agents/tools/finagent/ |
| 11 (Replay) | GeneralReplayBuffer, PrioritizedReplayBuffer, NStepInfoReplayBuffer |
| 12 (Parity) | Determinism + kill-switch tests around WeightCentricPipeline + WeightToOrders |
Contracts
Two execution shapes share the same hash-locked spec. The standalone
shape is the original RL pipeline; the workflow-wrapped shape lets
WorkflowRuntime compose RL training into larger multi-stage
agentic pipelines (AGENTS rule 40 + ADR-005 + Phase 5 of the
orchestration refactor).
Hard rules
- All RL train / evaluate / paper / replay / walk-forward goes
through
alphaswarm_rl/src/alphaswarm_rl/runtime.py::RLRuntime(AGENTS rule 16). Tasks underalphaswarm_rl/tasks/rl_tasks.pyand API routes underalphaswarm_rl/api/routes/rl.pywrap it; they never callagent.traindirectly. rl_experiment_versionsrows are immutable, hash-locked. Re-snapshotting viaalphaswarm_rl/src/alphaswarm_rl/registry.py::persist_specinserts a new row when the SHA-256 of the spec changes (AGENTS rule 17).- Trajectory persistence flows through
IcebergTrajectoryStore→iceberg_catalog.append_arrow(AGENTS rule 18). - All concrete components register through the
RLComponentmetaclass. Setrl_kindto one of the canonical kinds; the metaclass calls@registerautomatically (AGENTS rule 19). - LLM calls inside
LLMHybridAgentroute throughrouter_complete(AGENTS rule 20). - Advantage estimation goes through
BaseAdvantageEstimator(AGENTS rule 36). The nativeReinforcePlusPlusAdvantage/GRPOAdvantage/GAEAdvantageregister through the metaclass alongside envs / rewards / policies. - Policy backbones go through
TimeSeriesEncoder(AGENTS rule 37). The four shipped backbones —TransformerBackbone,RecurrentBackbone,AutoencoderBackbone,PatchTSTBackbone— wrap existingalphaswarm_models.modelsmodules so the policy network and the offline ML stack share one source of truth. - Weight-centric portfolio actions go through the FinRL-X
four-stage pipeline
WeightCentricPipeline(f_S → f_A → f_T → f_R, AGENTS rule 38). Risk overlay (f_R) re-usesRiskLimitsso offline backtests and live paper paths produce identical target-weight vectors.
Hash-lock invariant in practice
The *_spec_versions table is the contract that makes RL replayable.
Three concrete consequences:
- Same content → same version. Re-posting an identical spec
returns the existing
version_id. No duplicate row, no side-effect. - Any field change → new version. Bump a hyperparameter, swap a reward term, retune the LR schedule — the SHA-256 changes, the row is new. The old row stays forever.
- Replay is across data, not across code. When you
RLRuntime(spec).replay(new_window), the runtime loads the pinnedversion_idfromrl_runs, rebuilds the env / agent exactly as the original train run, and feeds it the new bars. This is how "would this policy have held up in Q1 2024?" questions get a deterministic answer.
This is why
alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec
is the only sanctioned path: every direct mutation to the table
would corrupt the replay contract.
Packages
| Path | Purpose |
|---|---|
| alphaswarm_rl/src/alphaswarm_rl/core/ | Abstract bases + RLComponent metaclass + schema helpers. |
| alphaswarm_rl/src/alphaswarm_rl/spec.py | RLExperimentSpec declarative blueprint. |
| alphaswarm_rl/src/alphaswarm_rl/runtime.py | RLRuntime single sanctioned executor. |
| alphaswarm_rl/src/alphaswarm_rl/envs/ | Concrete envs (existing + FinRL ports + TradeSim + FinAgent). |
| alphaswarm_rl/src/alphaswarm_rl/rewards/ | Composable reward terms. |
| alphaswarm_rl/src/alphaswarm_rl/observations/ | Observation builders. |
| alphaswarm_rl/src/alphaswarm_rl/actions/ | Action-space implementations. |
| alphaswarm_rl/src/alphaswarm_rl/terminations/ | End-of-episode predicates. |
| alphaswarm_rl/src/alphaswarm_rl/data_pipelines/ | Iceberg / Yahoo / Alpaca / streaming / replay pipelines. |
| alphaswarm_rl/src/alphaswarm_rl/agents/ | SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid + classical / Q-family / actor-critic / evolutionary. |
| alphaswarm_rl/src/alphaswarm_rl/policies/ | Policy backbones (TimeSeriesEncoder subclasses). |
| alphaswarm_rl/src/alphaswarm_rl/advantage/ | Advantage estimators (native REINFORCE++ / GRPO / GAE). |
| alphaswarm_rl/src/alphaswarm_rl/ensemblers/ | Walk-forward / best-of-N / curriculum / meta-ensemble. |
| alphaswarm_rl/src/alphaswarm_rl/experiments/ | Experiment runners (basic / walk-forward / ablation / alpha-backtest / regime-stratified / validation-suite / PRUDEX-Compass). |
| alphaswarm_rl/src/alphaswarm_rl/applications/ | One-call FinRL-style apps (stock / portfolio / crypto / fundamentals / paper). |
| alphaswarm_rl/src/alphaswarm_rl/portfolio/ | WeightCentricPipeline (FinRL-X f_S → f_A → f_T → f_R). |
| alphaswarm_rl/src/alphaswarm_rl/trajectories/ | Iceberg-backed trajectory writer + DuckDB views. |
| alphaswarm_rl/src/alphaswarm_rl/bridges/ | Backtest-engine + WorkflowRuntime adapters. |
| alphaswarm/persistence/models_rl.py | ORM for specs, versions, runs, evaluations, refs, registrations. |
| alphaswarm_rl/api/routes/rl.py | REST surface. |
| alphaswarm_rl/tasks/rl_tasks.py | Celery tasks driven by RLRuntime. |
| alphaswarm_client/src/routes/rl/ | RL Lab + builders + library + runs UI (active Vite frontend). |
| alphaswarm_rl/configs/ | Preset / reward / observation / data-pipeline YAMLs. |
| alphaswarm_rl/tests/ | Hermetic test suite. |
Legacy alphaswarm.rl.* is a deprecation shim that re-exports from
alphaswarm_rl.*; new code imports from alphaswarm_rl directly.
Spec lifecycle
- Author an
RLExperimentSpec(YAML or in-code Pydantic). - Persist via
alphaswarm_rl.registry.persist_spec→rl_experiment_specs+rl_experiment_versions(hash-locked snapshot). - Run via
RLRuntime.train/.evaluate/.paper/.replay/.walk_forward→ opens anrl_runsrow, builds the env / agent frombuild_from_config, drives training, persists per-step trajectories to Iceberg, finalises the run row. - Inspect via the API
(
/rl/runs/{id}/equity,/trajectories,/reward-decomposition,/episodes) and the lab UI run-detail page (equity chart, reward decomposition, episode summary, replay slider).
Worked example: train + replay
Goal: snapshot a 50k-step PPO experiment, train it, inspect the ledger row, read trajectories from Iceberg, and replay against fresh data — all from this page.
Step 1 — snapshot the spec
The experiment YAML lives at
alphaswarm_rl/configs/experiments/my_first_rl.yaml.
Dispatch the train run:
Notice spec_hash in the response — that is the immutable hash-lock
key. Re-posting the same YAML returns the same spec_version_id.
Step 2 — tail progress
curl -N http://localhost:8000/chat/stream/<task_id>
Frames arrive in the canonical envelope (AGENTS rule 4). Expected
stages: start → data.loaded → env.built → agent.built →
train.step (×many, sparse) → train.checkpoint → done.
Step 3 — inspect the ledger
The agent-safe read is data.rl.list / data.rl.describe:
curl -X POST http://localhost:8000/mcp/data/tools/data.rl.describe/invoke \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(alphaswarm-cli auth token)" \
-d '{"rl_run_id": "<from-step-1>"}'
The response carries status, mean_reward, total_timesteps,
spec_version_id, MLflow run id, and the trajectory namespace.
Step 4 — read trajectories from Iceberg
Pyodide does not ship PyIceberg, but it ships duckdb + pyarrow, and the trajectory writer exports a parquet-compatible view. The snippet below shows the analytical pattern with inline sample data so it runs in your browser.
The same pattern works against the real Iceberg trajectory tables
via the
data.iceberg.read_snapshot
MCP tool. The tables are:
alphaswarm_silver_rl_trajectories.<spec_hash>— per-step(episode, step, obs_hash, action, reward, value, log_prob)alphaswarm_silver_rl_equity_curves.<spec_hash>— per-step equity / drawdownalphaswarm_silver_rl_action_logs.<spec_hash>— full action vectors per stepalphaswarm_silver_rl_reward_decomposition.<spec_hash>— per-term reward attribution
Step 5 — replay against fresh data
The killer feature of hash-locked specs: replay the trained policy against a different time window WITHOUT touching the spec.
The new rl_runs row carries parent_run_id and the SAME
spec_version_id as the original train run. Two rl_runs rows,
one rl_experiment_versions row.
Step 6 — verify
rl_experiment_versionsrow with the recordedspec_hash.- Two
rl_runsrows referencing it (train+replay). - Trajectory tables in
alphaswarm_silver_rl_trajectories.<spec_hash>. - MLflow runs visible at
http://localhost:5000/#/experiments. - Topbar
KillSwitchshows green;should_haltreturned false on every step.
What next
- Walk the full tutorial: tutorials/first-rl-experiment.
- Compose into a workflow: tutorials/first-agent-workflow
- Add a custom reward term: rl-components.
- Browse the trajectory schema: rl-iceberg.
Inspiration sources
- FinRL (
alphaswarm_snippets/inspiration/FinRL-master) — env taxonomy (StockTrading, StockPortfolio, multi-crypto),DataProcessor/FeatureEngineer/df_to_array,DRLAgent/DRLEnsembleAgent, composite reward. Ported as registered presets inalphaswarm_rl.envs.finrl_*,alphaswarm_rl.data_pipelines.*, and theWalkForwardEnsembler. - FinRobot (
alphaswarm_snippets/inspiration/FinRobot-master) — multi-agent LLM workflow + tool-augmented analysis. Bridged viaLLMHybridAgent(LLM proposes, RL refines) andFundamentalBuilder. - FinRL-X — the four-stage weight-centric pipeline (
f_S → f_A → f_T → f_R) is ported asWeightCentricPipeline(AGENTS rule 38). - FinAgent — five-stage LLM-hybrid adapter ported as
finagent_layered(ADR-010, Phase 10). - PRUDEX-Compass — 17-measure evaluation framework ported as
prudex_compassexperiment + five chart helpers (ADR-010, Phase 9).
Deeper reads
- rl-lab — interactive RL Lab + builders.
- rl-components — full component catalogue.
- rl-iceberg — trajectory persistence contract.
- rl-policy-backbones —
TimeSeriesEncodersubclasses. - rl-market-dynamics — regime labeller + observation.
- rl-prudex-evaluation — PRUDEX-Compass.
- rl-finagent — FinAgent multimodal adapter.
- weight-centric-pipeline —
f_S → f_A → f_T → f_R. - agentic-rl — RL-as-agent integration patterns.
- architecture/decisions/010-rl-production-enhancement — full Phase 1-12 ADR.
- reference/api — the
rltag in the interactive playground. - reference/python/alphaswarm_rl — auto-generated
alphaswarm_rlPython reference.