Reinforcement learning framework

The RL layer in AlphaSwarm follows a metaclass-driven, registry-first design inspired by FinRL's library structure and FinRobot's tool-augmented agent runtime. Every concrete component (env, observation, action, reward, termination, policy, agent, data pipeline, ensembler, experiment, trajectory store) auto-registers through alphaswarm_rl/src/alphaswarm_rl/core/base.py so the API and the lab UI can browse them at runtime.

This page is the canonical entry point. For shorter cuts:

rl-lab â€” interactive RL Lab + builders.
rl-components â€” auto-generated component reference (browse via /rl/components in the operator UI).
rl-iceberg â€” Iceberg trajectory / equity / reward-decomposition tables and DuckDB views.
rl-market-dynamics â€” Phase 6 slice-and-merge regime labeller + RegimeAwareObservation + RegimeStratifiedEvaluation.
rl-prudex-evaluation â€” Phase 9 PRUDEX-Compass framework (17 measures, 5 visualisations).
rl-finagent â€” Phase 10 FinAgent multimodal 5-stage LLM-hybrid adapter.
weight-centric-pipeline â€” FinRL-X four-stage f_S â†’ f_A â†’ f_T â†’ f_R pipeline.
architecture/decisions/010-rl-production-enhancement â€” full Phase 1-12 production-enhancement ADR.

Phase 1-12 production enhancements (May 2026)

The Phase 1-12 deliverables documented in ADR-010 add the following components under their canonical rl_alias / kind:

Phase	Components
1 (Rewards)	`differential_sharpe`, `differential_downside`, `implementation_shortfall`, `running_inventory`, `exp_utility`, `hindsight`, `dp_distillation`
2 (Analytical)	`almgren_chriss_residual`, `avellaneda_stoikov_residual` (+ `alphaswarm_rl.analytical.{almgren_chriss,avellaneda_stoikov,cartea_jaimungal}` helpers)
3 (Envs)	`tradesim_algotrading`, `tradesim_portfolio`, `tradesim_execution`, `tradesim_hft`, `finagent_trading`
4 (Agents)	`eiie`, `deeptrader`, `investor_imitator`, `eteo`, `opd`, `deepscalper`, `hft_ddqn`, `ppo_inhouse`
5 (Backbones)	`eiie_conv`, `sagcn`, `market_scorer`, `hft_qnet`, `eteo_dual_head`, `pd_dual_rnn`, `sarl_lstm`
6 (MDM)	`slice_and_merge_regime_flow` (analysis flow), `regime_aware` observation, `regime_stratified` experiment
7 (CSDI)	`csdi_imputed` dataset kind
8 (Validation)	`CombinatorialPurgedKFold`, `probability_of_backtest_overfitting`, `rademacher_anti_serum`, `deflated_sharpe_ratio`, `walk_forward_anchored`, `walk_forward_rolling`, `benjamini_hochberg`, `holm_bonferroni`, `validation_suite` experiment
9 (PRUDEX)	`PrudexMetrics`, `PrudexReport`, `compute_prudex_metrics`, 5 chart helpers, `prudex_compass` experiment
10 (FinAgent)	`finagent_layered` adapter + 5 AgentSpec YAMLs under `configs/agents/finagent/` + 3 tools under `alphaswarm/agents/tools/finagent/`
11 (Replay)	`GeneralReplayBuffer`, `PrioritizedReplayBuffer`, `NStepInfoReplayBuffer`
12 (Parity)	Determinism + kill-switch tests around `WeightCentricPipeline` + `WeightToOrders`

Contracts

Two execution shapes share the same hash-locked spec. The standalone shape is the original RL pipeline; the workflow-wrapped shape lets WorkflowRuntime compose RL training into larger multi-stage agentic pipelines (AGENTS rule 40 + ADR-005 + Phase 5 of the orchestration refactor).

Hard rules

All RL train / evaluate / paper / replay / walk-forward goes through alphaswarm_rl/src/alphaswarm_rl/runtime.py::RLRuntime (AGENTS rule 16). Tasks under alphaswarm_rl/tasks/rl_tasks.py and API routes under alphaswarm_rl/api/routes/rl.py wrap it; they never call agent.train directly.
rl_experiment_versions rows are immutable, hash-locked. Re-snapshotting via alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec inserts a new row when the SHA-256 of the spec changes (AGENTS rule 17).
Trajectory persistence flows through IcebergTrajectoryStore â†’ iceberg_catalog.append_arrow (AGENTS rule 18).
All concrete components register through the RLComponent metaclass. Set rl_kind to one of the canonical kinds; the metaclass calls @register automatically (AGENTS rule 19).
LLM calls inside LLMHybridAgent route through router_complete (AGENTS rule 20).
Advantage estimation goes through BaseAdvantageEstimator (AGENTS rule 36). The native ReinforcePlusPlusAdvantage / GRPOAdvantage / GAEAdvantage register through the metaclass alongside envs / rewards / policies.
Policy backbones go through TimeSeriesEncoder (AGENTS rule 37). The four shipped backbones â€” TransformerBackbone, RecurrentBackbone, AutoencoderBackbone, PatchTSTBackbone â€” wrap existing alphaswarm_models.models modules so the policy network and the offline ML stack share one source of truth.
Weight-centric portfolio actions go through the FinRL-X four-stage pipeline WeightCentricPipeline (f_S â†’ f_A â†’ f_T â†’ f_R, AGENTS rule 38). Risk overlay (f_R) re-uses RiskLimits so offline backtests and live paper paths produce identical target-weight vectors.

Hash-lock invariant in practice

The *_spec_versions table is the contract that makes RL replayable. Three concrete consequences:

Same content â†’ same version. Re-posting an identical spec returns the existing version_id. No duplicate row, no side-effect.
Any field change â†’ new version. Bump a hyperparameter, swap a reward term, retune the LR schedule â€” the SHA-256 changes, the row is new. The old row stays forever.
Replay is across data, not across code. When you RLRuntime(spec).replay(new_window), the runtime loads the pinned version_id from rl_runs, rebuilds the env / agent exactly as the original train run, and feeds it the new bars. This is how "would this policy have held up in Q1 2024?" questions get a deterministic answer.

This is why alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec is the only sanctioned path: every direct mutation to the table would corrupt the replay contract.

Packages

Path	Purpose
alphaswarm_rl/src/alphaswarm_rl/core/	Abstract bases + `RLComponent` metaclass + schema helpers.
alphaswarm_rl/src/alphaswarm_rl/spec.py	`RLExperimentSpec` declarative blueprint.
alphaswarm_rl/src/alphaswarm_rl/runtime.py	`RLRuntime` single sanctioned executor.
alphaswarm_rl/src/alphaswarm_rl/envs/	Concrete envs (existing + FinRL ports + TradeSim + FinAgent).
alphaswarm_rl/src/alphaswarm_rl/rewards/	Composable reward terms.
alphaswarm_rl/src/alphaswarm_rl/observations/	Observation builders.
alphaswarm_rl/src/alphaswarm_rl/actions/	Action-space implementations.
alphaswarm_rl/src/alphaswarm_rl/terminations/	End-of-episode predicates.
alphaswarm_rl/src/alphaswarm_rl/data_pipelines/	Iceberg / Yahoo / Alpaca / streaming / replay pipelines.
alphaswarm_rl/src/alphaswarm_rl/agents/	SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid + classical / Q-family / actor-critic / evolutionary.
alphaswarm_rl/src/alphaswarm_rl/policies/	Policy backbones (`TimeSeriesEncoder` subclasses).
alphaswarm_rl/src/alphaswarm_rl/advantage/	Advantage estimators (native REINFORCE++ / GRPO / GAE).
alphaswarm_rl/src/alphaswarm_rl/ensemblers/	Walk-forward / best-of-N / curriculum / meta-ensemble.
alphaswarm_rl/src/alphaswarm_rl/experiments/	Experiment runners (basic / walk-forward / ablation / alpha-backtest / regime-stratified / validation-suite / PRUDEX-Compass).
alphaswarm_rl/src/alphaswarm_rl/applications/	One-call FinRL-style apps (stock / portfolio / crypto / fundamentals / paper).
alphaswarm_rl/src/alphaswarm_rl/portfolio/	`WeightCentricPipeline` (FinRL-X `f_S â†’ f_A â†’ f_T â†’ f_R`).
alphaswarm_rl/src/alphaswarm_rl/trajectories/	Iceberg-backed trajectory writer + DuckDB views.
alphaswarm_rl/src/alphaswarm_rl/bridges/	Backtest-engine + WorkflowRuntime adapters.
alphaswarm/persistence/models_rl.py	ORM for specs, versions, runs, evaluations, refs, registrations.
alphaswarm_rl/api/routes/rl.py	REST surface.
alphaswarm_rl/tasks/rl_tasks.py	Celery tasks driven by `RLRuntime`.
alphaswarm_client/src/routes/rl/	RL Lab + builders + library + runs UI (active Vite frontend).
alphaswarm_rl/configs/	Preset / reward / observation / data-pipeline YAMLs.
alphaswarm_rl/tests/	Hermetic test suite.

Legacy alphaswarm.rl.* is a deprecation shim that re-exports from alphaswarm_rl.*; new code imports from alphaswarm_rl directly.

Spec lifecycle

Author an RLExperimentSpec (YAML or in-code Pydantic).
Persist via alphaswarm_rl.registry.persist_spec â†’ rl_experiment_specs + rl_experiment_versions (hash-locked snapshot).
Run via RLRuntime.train / .evaluate / .paper / .replay / .walk_forward â†’ opens an rl_runs row, builds the env / agent from build_from_config, drives training, persists per-step trajectories to Iceberg, finalises the run row.
Inspect via the API (/rl/runs/{id}/equity, /trajectories, /reward-decomposition, /episodes) and the lab UI run-detail page (equity chart, reward decomposition, episode summary, replay slider).

Worked example: train + replay

Goal: snapshot a 50k-step PPO experiment, train it, inspect the ledger row, read trajectories from Iceberg, and replay against fresh data â€” all from this page.

Step 1 â€” snapshot the spec

The experiment YAML lives at alphaswarm_rl/configs/experiments/my_first_rl.yaml. Dispatch the train run:

Notice spec_hash in the response â€” that is the immutable hash-lock key. Re-posting the same YAML returns the same spec_version_id.

Step 2 â€” tail progress

curl -N http://localhost:8000/chat/stream/<task_id>

Frames arrive in the canonical envelope (AGENTS rule 4). Expected stages: start â†’ data.loaded â†’ env.built â†’ agent.built â†’ train.step (Ã—many, sparse) â†’ train.checkpoint â†’ done.

Step 3 â€” inspect the ledger

The agent-safe read is data.rl.list / data.rl.describe:

curl -X POST http://localhost:8000/mcp/data/tools/data.rl.describe/invoke \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $(alphaswarm-cli auth token)" \
    -d '{"rl_run_id": "<from-step-1>"}'

The response carries status, mean_reward, total_timesteps, spec_version_id, MLflow run id, and the trajectory namespace.

Step 4 â€” read trajectories from Iceberg

Pyodide does not ship PyIceberg, but it ships duckdb + pyarrow, and the trajectory writer exports a parquet-compatible view. The snippet below shows the analytical pattern with inline sample data so it runs in your browser.

The same pattern works against the real Iceberg trajectory tables via the data.iceberg.read_snapshot MCP tool. The tables are:

alphaswarm_silver_rl_trajectories.<spec_hash> â€” per-step (episode, step, obs_hash, action, reward, value, log_prob)
alphaswarm_silver_rl_equity_curves.<spec_hash> â€” per-step equity / drawdown
alphaswarm_silver_rl_action_logs.<spec_hash> â€” full action vectors per step
alphaswarm_silver_rl_reward_decomposition.<spec_hash> â€” per-term reward attribution

Step 5 â€” replay against fresh data

The killer feature of hash-locked specs: replay the trained policy against a different time window WITHOUT touching the spec.

The new rl_runs row carries parent_run_id and the SAME spec_version_id as the original train run. Two rl_runs rows, one rl_experiment_versions row.

Step 6 â€” verify

rl_experiment_versions row with the recorded spec_hash.
Two rl_runs rows referencing it (train + replay).
Trajectory tables in alphaswarm_silver_rl_trajectories.<spec_hash>.
MLflow runs visible at http://localhost:5000/#/experiments.
Topbar KillSwitch shows green; should_halt returned false on every step.

What next

Walk the full tutorial: tutorials/first-rl-experiment.
Compose into a workflow: tutorials/first-agent-workflow
- concepts/agentic/workflow-studio.
Add a custom reward term: rl-components.
Browse the trajectory schema: rl-iceberg.

Inspiration sources

FinRL (alphaswarm_snippets/inspiration/FinRL-master) â€” env taxonomy (StockTrading, StockPortfolio, multi-crypto), DataProcessor / FeatureEngineer / df_to_array, DRLAgent / DRLEnsembleAgent, composite reward. Ported as registered presets in alphaswarm_rl.envs.finrl_*, alphaswarm_rl.data_pipelines.*, and the WalkForwardEnsembler.
FinRobot (alphaswarm_snippets/inspiration/FinRobot-master) â€” multi-agent LLM workflow + tool-augmented analysis. Bridged via LLMHybridAgent (LLM proposes, RL refines) and FundamentalBuilder.
FinRL-X â€” the four-stage weight-centric pipeline (f_S â†’ f_A â†’ f_T â†’ f_R) is ported as WeightCentricPipeline (AGENTS rule 38).
FinAgent â€” five-stage LLM-hybrid adapter ported as finagent_layered (ADR-010, Phase 10).
PRUDEX-Compass â€” 17-measure evaluation framework ported as prudex_compass experiment + five chart helpers (ADR-010, Phase 9).

Deeper reads

rl-lab â€” interactive RL Lab + builders.
rl-components â€” full component catalogue.
rl-iceberg â€” trajectory persistence contract.
rl-policy-backbones â€” TimeSeriesEncoder subclasses.
rl-market-dynamics â€” regime labeller + observation.
rl-prudex-evaluation â€” PRUDEX-Compass.
rl-finagent â€” FinAgent multimodal adapter.
weight-centric-pipeline â€” f_S â†’ f_A â†’ f_T â†’ f_R.
agentic-rl â€” RL-as-agent integration patterns.
architecture/decisions/010-rl-production-enhancement â€” full Phase 1-12 ADR.
reference/api â€” the rl tag in the interactive playground.
reference/python/alphaswarm_rl â€” auto-generated alphaswarm_rl Python reference.

Phase 1-12 production enhancements (May 2026)​

Contracts​

Hard rules​

Hash-lock invariant in practice​

Packages​

Spec lifecycle​

Worked example: train + replay​

Step 1 â€” snapshot the spec​

Step 2 â€” tail progress​

Step 3 â€” inspect the ledger​

Step 4 â€” read trajectories from Iceberg​

Step 5 â€” replay against fresh data​

Step 6 â€” verify​

What next​

Inspiration sources​

Deeper reads​