ADR-010: alphaswarm_rl production-grade enhancement (Phases 1-12)
Status: accepted (2026-05-24)
Context: The alphaswarm_rl subsystem shipped with the core
RLComponent metaclass, RLRuntime, hash-locked RLExperimentSpec,
and a small set of envs / agents / observations / rewards. The
TradeMaster 1.0.0 codebase contained a much larger, paper-grade
library of:
- Reward shapes (Differential Sharpe Ratio, D3R, Implementation Shortfall, Hindsight, DP-distillation, …).
- Analytical baselines (Almgren-Chriss, Avellaneda-Stoikov).
- Domain envs (PortfolioManagement, OrderExecution PD, AlgorithmicTrading, HFT, MultimodalTrading).
- Paper-grade agents (EIIE, DeepTrader, ETEO, OPD, DeepScalper, HFT_DDQN, InvestorImitator).
- Network backbones (EIIEConv, SAGCN, MarketScorer, HFTQNet, DualHead, PDDualRNN, SARL classifier).
- Market Dynamics Modeling (slice-and-merge regime labeller).
- CSDI diffusion imputation.
- Validation diagnostics (CPCV, PBO, RAS, DSR, walk-forward, BH / Holm-Bonferroni).
- PRUDEX-Compass evaluation suite.
- Three new replay buffers (General / Prioritized / NStepInfo).
Plus the FinAgent multimodal LLM-hybrid agent (Zhang AAAI 24).
Decision: Land all of the above behind 12 phases, each adding
new classes that auto-register through existing AlphaSwarm abstractions
(RLComponent, BaseDataset, register_analysis_flow,
BaseExperiment). NO migration of existing components, NO breaking
changes. Every new component:
- Subclasses an existing AlphaSwarm base (
RewardTerm,BaseRLAgent,BaseRLEnv,TimeSeriesEncoder,BaseObservationBuilder,BaseExperiment,BaseReplayBuffer,BaseDataset). - Sets
rl_aliasso it auto-registers under the rightrl_kind. - Ships unit + property tests under
alphaswarm_rl/tests/<phase_dir>/. - Respects every hard rule in
alphaswarm_rl/AGENTS.md.
Consequences:
- The
rl_aliasnamespace grows by ~40 new aliases; theRLComponent.list_components(kind)registry expands accordingly. - Heavy dependencies (
scipy.signal,scikit-learn) are mandatory for the analysis flow but already inalphaswarmcore. No new third-party RL framework dependencies. - New top-level packages under
alphaswarm_rl/src/alphaswarm_rl/:analytical/,evaluation/,replay/,validation/. - One new analysis flow in the monolith
(
alphaswarm/analysis/flows/market_dynamics_modeling.py) per hard rule 23. - One new dataset kind in the monolith
(
alphaswarm/data/datasets/kinds/csdi_imputed.py) per hard rule 29. - One new FinAgent toolset in the monolith
(
alphaswarm/agents/tools/finagent/). - Five new agent YAMLs under
configs/agents/finagent/. - Documentation: three new
alphaswarm_docs/pages (rl-market-dynamics, rl-prudex-evaluation, rl-finagent) plus this ADR.
Hard rule alignment:
| Rule | Compliance |
|---|---|
2 (LLM via router_complete) | FinAgent layered adapter + all 5 stage YAMLs |
3 (Iceberg via append_arrow) | CSDI persistence; PRUDEX skips; MDM via gold-tier flow |
12 (AgentRuntime for agents) | 5 FinAgent stages = 5 AgentSpec rows |
16 (RLRuntime for RL lifecycle) | All new agents / experiments callable through it |
18 (IcebergTrajectoryStore) | Untouched — existing path preserved |
19 (RLComponent metaclass) | All ~40 new aliases auto-register |
20 (router_complete from RL code) | LayeredReflectionAdapter only LLM caller |
| 22 (No direct DB from agent body) | FinAgent tools route through registered DataMCP only |
23-25 (Analysis flow → AnalysisRuntime) | MDM flow + register_analysis_flow |
29 (BaseDataset for env data) | tradesim_* envs accept BaseDataset / DataFrame |
| 36-38 (Advantage / backbone / weight-centric) | Backbones extend TimeSeriesEncoder; weights flow WeightCentricPipeline ⇒ WeightToOrders |
Trade-offs:
- CSDI is ensemble-imputation, not real diffusion — the full ~1500-LOC PyTorch CSDI model is out-of-scope; the ensemble imputer satisfies the acceptance gate (MAE < 0.05 on synthetic) and ships the same public contract (median + quantile bands) so a future drop-in replacement is straightforward.
- RAS is EXPERIMENTAL — exposed under the same canonical
surface as DSR / PBO but marked in the docstring; the
Rademacher-complexity estimate is Monte-Carlo and depends on
n_draws. - Paper-grade agents lean on SB3 — most new agents are thin
SB3Adaptersubclasses with paper-grade hyperparameters. InvestorImitator (REINFORCE) and OPD (teacher-student dual PPO) are the two genuinely custom implementations. This matches pragmatic deployment patterns: SB3 has been more thoroughly battle-tested than re-implementing each paper from scratch. - No live broker integration in the test suite —
WeightToOrdersis tested against_MockBrokerage. The Alpaca / IBKR adapter lives in the monolith and is covered by integration tests there.