Saltar al contenido principal

Reinforcement learning framework

The RL layer in AlphaSwarm follows a metaclass-driven, registry-first design inspired by FinRL's library structure and FinRobot's tool-augmented agent runtime. Every concrete component (env, observation, action, reward, termination, policy, agent, data pipeline, ensembler, experiment, trajectory store) auto-registers through alphaswarm_rl/src/alphaswarm_rl/core/base.py so the API and the lab UI can browse them at runtime.

This page is the canonical entry point. For shorter cuts:

Phase 1-12 production enhancements (May 2026)

The Phase 1-12 deliverables documented in ADR-010 add the following components under their canonical rl_alias / kind:

PhaseComponents
1 (Rewards)differential_sharpe, differential_downside, implementation_shortfall, running_inventory, exp_utility, hindsight, dp_distillation
2 (Analytical)almgren_chriss_residual, avellaneda_stoikov_residual (+ alphaswarm_rl.analytical.{almgren_chriss,avellaneda_stoikov,cartea_jaimungal} helpers)
3 (Envs)tradesim_algotrading, tradesim_portfolio, tradesim_execution, tradesim_hft, finagent_trading
4 (Agents)eiie, deeptrader, investor_imitator, eteo, opd, deepscalper, hft_ddqn, ppo_inhouse
5 (Backbones)eiie_conv, sagcn, market_scorer, hft_qnet, eteo_dual_head, pd_dual_rnn, sarl_lstm
6 (MDM)slice_and_merge_regime_flow (analysis flow), regime_aware observation, regime_stratified experiment
7 (CSDI)csdi_imputed dataset kind
8 (Validation)CombinatorialPurgedKFold, probability_of_backtest_overfitting, rademacher_anti_serum, deflated_sharpe_ratio, walk_forward_anchored, walk_forward_rolling, benjamini_hochberg, holm_bonferroni, validation_suite experiment
9 (PRUDEX)PrudexMetrics, PrudexReport, compute_prudex_metrics, 5 chart helpers, prudex_compass experiment
10 (FinAgent)finagent_layered adapter + 5 AgentSpec YAMLs under configs/agents/finagent/ + 3 tools under alphaswarm/agents/tools/finagent/
11 (Replay)GeneralReplayBuffer, PrioritizedReplayBuffer, NStepInfoReplayBuffer
12 (Parity)Determinism + kill-switch tests around WeightCentricPipeline + WeightToOrders

Contracts

Two execution shapes share the same hash-locked spec. The standalone shape is the original RL pipeline; the workflow-wrapped shape lets WorkflowRuntime compose RL training into larger multi-stage agentic pipelines (AGENTS rule 40 + ADR-005 + Phase 5 of the orchestration refactor).

Hard rules

  1. All RL train / evaluate / paper / replay / walk-forward goes through alphaswarm_rl/src/alphaswarm_rl/runtime.py::RLRuntime (AGENTS rule 16). Tasks under alphaswarm_rl/tasks/rl_tasks.py and API routes under alphaswarm_rl/api/routes/rl.py wrap it; they never call agent.train directly.
  2. rl_experiment_versions rows are immutable, hash-locked. Re-snapshotting via alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec inserts a new row when the SHA-256 of the spec changes (AGENTS rule 17).
  3. Trajectory persistence flows through IcebergTrajectoryStore → iceberg_catalog.append_arrow (AGENTS rule 18).
  4. All concrete components register through the RLComponent metaclass. Set rl_kind to one of the canonical kinds; the metaclass calls @register automatically (AGENTS rule 19).
  5. LLM calls inside LLMHybridAgent route through router_complete (AGENTS rule 20).
  6. Advantage estimation goes through BaseAdvantageEstimator (AGENTS rule 36). The native ReinforcePlusPlusAdvantage / GRPOAdvantage / GAEAdvantage register through the metaclass alongside envs / rewards / policies.
  7. Policy backbones go through TimeSeriesEncoder (AGENTS rule 37). The four shipped backbones — TransformerBackbone, RecurrentBackbone, AutoencoderBackbone, PatchTSTBackbone — wrap existing alphaswarm_models.models modules so the policy network and the offline ML stack share one source of truth.
  8. Weight-centric portfolio actions go through the FinRL-X four-stage pipeline WeightCentricPipeline (f_S → f_A → f_T → f_R, AGENTS rule 38). Risk overlay (f_R) re-uses RiskLimits so offline backtests and live paper paths produce identical target-weight vectors.

Hash-lock invariant in practice

The *_spec_versions table is the contract that makes RL replayable. Three concrete consequences:

  • Same content → same version. Re-posting an identical spec returns the existing version_id. No duplicate row, no side-effect.
  • Any field change → new version. Bump a hyperparameter, swap a reward term, retune the LR schedule — the SHA-256 changes, the row is new. The old row stays forever.
  • Replay is across data, not across code. When you RLRuntime(spec).replay(new_window), the runtime loads the pinned version_id from rl_runs, rebuilds the env / agent exactly as the original train run, and feeds it the new bars. This is how "would this policy have held up in Q1 2024?" questions get a deterministic answer.

This is why alphaswarm_rl/src/alphaswarm_rl/registry.py::persist_spec is the only sanctioned path: every direct mutation to the table would corrupt the replay contract.

Packages

PathPurpose
alphaswarm_rl/src/alphaswarm_rl/core/Abstract bases + RLComponent metaclass + schema helpers.
alphaswarm_rl/src/alphaswarm_rl/spec.pyRLExperimentSpec declarative blueprint.
alphaswarm_rl/src/alphaswarm_rl/runtime.pyRLRuntime single sanctioned executor.
alphaswarm_rl/src/alphaswarm_rl/envs/Concrete envs (existing + FinRL ports + TradeSim + FinAgent).
alphaswarm_rl/src/alphaswarm_rl/rewards/Composable reward terms.
alphaswarm_rl/src/alphaswarm_rl/observations/Observation builders.
alphaswarm_rl/src/alphaswarm_rl/actions/Action-space implementations.
alphaswarm_rl/src/alphaswarm_rl/terminations/End-of-episode predicates.
alphaswarm_rl/src/alphaswarm_rl/data_pipelines/Iceberg / Yahoo / Alpaca / streaming / replay pipelines.
alphaswarm_rl/src/alphaswarm_rl/agents/SB3 / ElegantRL / RLlib / CleanRL / LLM-hybrid + classical / Q-family / actor-critic / evolutionary.
alphaswarm_rl/src/alphaswarm_rl/policies/Policy backbones (TimeSeriesEncoder subclasses).
alphaswarm_rl/src/alphaswarm_rl/advantage/Advantage estimators (native REINFORCE++ / GRPO / GAE).
alphaswarm_rl/src/alphaswarm_rl/ensemblers/Walk-forward / best-of-N / curriculum / meta-ensemble.
alphaswarm_rl/src/alphaswarm_rl/experiments/Experiment runners (basic / walk-forward / ablation / alpha-backtest / regime-stratified / validation-suite / PRUDEX-Compass).
alphaswarm_rl/src/alphaswarm_rl/applications/One-call FinRL-style apps (stock / portfolio / crypto / fundamentals / paper).
alphaswarm_rl/src/alphaswarm_rl/portfolio/WeightCentricPipeline (FinRL-X f_S → f_A → f_T → f_R).
alphaswarm_rl/src/alphaswarm_rl/trajectories/Iceberg-backed trajectory writer + DuckDB views.
alphaswarm_rl/src/alphaswarm_rl/bridges/Backtest-engine + WorkflowRuntime adapters.
alphaswarm/persistence/models_rl.pyORM for specs, versions, runs, evaluations, refs, registrations.
alphaswarm_rl/api/routes/rl.pyREST surface.
alphaswarm_rl/tasks/rl_tasks.pyCelery tasks driven by RLRuntime.
alphaswarm_client/src/routes/rl/RL Lab + builders + library + runs UI (active Vite frontend).
alphaswarm_rl/configs/Preset / reward / observation / data-pipeline YAMLs.
alphaswarm_rl/tests/Hermetic test suite.

Legacy alphaswarm.rl.* is a deprecation shim that re-exports from alphaswarm_rl.*; new code imports from alphaswarm_rl directly.

Spec lifecycle

  1. Author an RLExperimentSpec (YAML or in-code Pydantic).
  2. Persist via alphaswarm_rl.registry.persist_spec → rl_experiment_specs + rl_experiment_versions (hash-locked snapshot).
  3. Run via RLRuntime.train / .evaluate / .paper / .replay / .walk_forward → opens an rl_runs row, builds the env / agent from build_from_config, drives training, persists per-step trajectories to Iceberg, finalises the run row.
  4. Inspect via the API (/rl/runs/{id}/equity, /trajectories, /reward-decomposition, /episodes) and the lab UI run-detail page (equity chart, reward decomposition, episode summary, replay slider).

Worked example: train + replay

Goal: snapshot a 50k-step PPO experiment, train it, inspect the ledger row, read trajectories from Iceberg, and replay against fresh data — all from this page.

Step 1 — snapshot the spec

The experiment YAML lives at alphaswarm_rl/configs/experiments/my_first_rl.yaml. Dispatch the train run:

Notice spec_hash in the response — that is the immutable hash-lock key. Re-posting the same YAML returns the same spec_version_id.

Step 2 — tail progress

curl -N http://localhost:8000/chat/stream/<task_id>

Frames arrive in the canonical envelope (AGENTS rule 4). Expected stages: start → data.loaded → env.built → agent.built → train.step (×many, sparse) → train.checkpoint → done.

Step 3 — inspect the ledger

The agent-safe read is data.rl.list / data.rl.describe:

curl -X POST http://localhost:8000/mcp/data/tools/data.rl.describe/invoke \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $(alphaswarm-cli auth token)" \
-d '{"rl_run_id": "<from-step-1>"}'

The response carries status, mean_reward, total_timesteps, spec_version_id, MLflow run id, and the trajectory namespace.

Step 4 — read trajectories from Iceberg

Pyodide does not ship PyIceberg, but it ships duckdb + pyarrow, and the trajectory writer exports a parquet-compatible view. The snippet below shows the analytical pattern with inline sample data so it runs in your browser.

The same pattern works against the real Iceberg trajectory tables via the data.iceberg.read_snapshot MCP tool. The tables are:

  • alphaswarm_silver_rl_trajectories.<spec_hash> — per-step (episode, step, obs_hash, action, reward, value, log_prob)
  • alphaswarm_silver_rl_equity_curves.<spec_hash> — per-step equity / drawdown
  • alphaswarm_silver_rl_action_logs.<spec_hash> — full action vectors per step
  • alphaswarm_silver_rl_reward_decomposition.<spec_hash> — per-term reward attribution

Step 5 — replay against fresh data

The killer feature of hash-locked specs: replay the trained policy against a different time window WITHOUT touching the spec.

The new rl_runs row carries parent_run_id and the SAME spec_version_id as the original train run. Two rl_runs rows, one rl_experiment_versions row.

Step 6 — verify

  • rl_experiment_versions row with the recorded spec_hash.
  • Two rl_runs rows referencing it (train + replay).
  • Trajectory tables in alphaswarm_silver_rl_trajectories.<spec_hash>.
  • MLflow runs visible at http://localhost:5000/#/experiments.
  • Topbar KillSwitch shows green; should_halt returned false on every step.

What next

Inspiration sources

  • FinRL (alphaswarm_snippets/inspiration/FinRL-master) — env taxonomy (StockTrading, StockPortfolio, multi-crypto), DataProcessor / FeatureEngineer / df_to_array, DRLAgent / DRLEnsembleAgent, composite reward. Ported as registered presets in alphaswarm_rl.envs.finrl_*, alphaswarm_rl.data_pipelines.*, and the WalkForwardEnsembler.
  • FinRobot (alphaswarm_snippets/inspiration/FinRobot-master) — multi-agent LLM workflow + tool-augmented analysis. Bridged via LLMHybridAgent (LLM proposes, RL refines) and FundamentalBuilder.
  • FinRL-X — the four-stage weight-centric pipeline (f_S → f_A → f_T → f_R) is ported as WeightCentricPipeline (AGENTS rule 38).
  • FinAgent — five-stage LLM-hybrid adapter ported as finagent_layered (ADR-010, Phase 10).
  • PRUDEX-Compass — 17-measure evaluation framework ported as prudex_compass experiment + five chart helpers (ADR-010, Phase 9).

Deeper reads