Hybrid agentic-RL + backtest
AlphaSwarm's port of the FinRL-X "deployment-consistent" blueprint plus the NVIDIA-NeMo/RL advantage primitives — wired into AlphaSwarm's existing spec-driven runtimes (rule 16).
What changed
The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by
making the target portfolio weight vector the single immutable
interface between an RL policy and any execution mechanism
(offline backtest engine OR live broker). The same w_t flows
through:
- the offline simulation (via the new
RLBacktestEnv) - the live paper / live execution
(via
WeightToOrders) - the AST-sandboxed alpha factor authoring loop
(via
AlphaResearcher)
Quick reference
| Concept | One-liner | File |
|---|---|---|
WeightCentricPipeline | FinRL-X f_S -> f_A -> f_T -> f_R composable pipeline | alphaswarm/rl/portfolio/pipeline.py |
RLBacktestEnv | BaseRLEnv + gym.Env wrapping any registered BaseBacktestEngine | alphaswarm/rl/envs/rl_backtest_env.py |
RLAgentBridge | Channel exposed via context['rl_agent'] on every engine flipping supports_rl_injection=True | alphaswarm/rl/bridges/agent_bridge.py |
ReinforcePlusPlusAdvantage | Leave-one-out cohort baseline + decoupled global normalisation (NeMo-RL port) | alphaswarm/rl/advantage/reinforce_plus_plus.py |
GRPOAdvantage | Group-relative no-critic advantage (DeepSeek R1 / NeMo-RL parity) | alphaswarm/rl/advantage/grpo.py |
StopProperlyWrapper | Scales reward of truncated episodes by coef in [0, 1] (NeMo-RL stop_properly_penalty_coef) | alphaswarm/rl/rewards/stop_properly.py |
| Truncating terminations | DrawdownTermination / MarginCallTermination / RiskBreachTermination carry truncates_episode=True | alphaswarm/rl/terminations/ |
WeightToOrders | Kill-switch-gated translator from target weights to DomainOrder | alphaswarm/rl/execution/weight_to_orders.py |
RedisFeatureStore | Flink → Redis IFeatureStore for live RL observation | alphaswarm/streaming/feature_store/redis_store.py |
AlphaVantageIngester | REST-poll Alpha Vantage and publish to Kafka | alphaswarm/streaming/ingesters/alphavantage.py |
DeterministicMedallionReplay | Read-only RL data pipeline pinned to silver/gold Iceberg snapshots | alphaswarm/rl/data_pipelines/medallion_replay.py |
data.alphas.* / data.backtests.* / data.rl.* / data.brokers.* | New DataMCPTools (rule 22) | alphaswarm/data/mcp/tools/ |
alpha_factors / backtest_summaries / rl_trajectory_summaries corpora | RAG "alpha base" (rule 11) | alphaswarm/rag/orders.py |
RLTradingBot | Bot subtype driven by RLRuntime (rule 14) | alphaswarm/bots/rl_trading_bot.py |
Spec extension
training:
total_timesteps: 200000
log_interval: 10
advantage:
class: ReinforcePlusPlusAdvantage
module_path: alphaswarm.rl.advantage.reinforce_plus_plus
kwargs:
minus_baseline: true
global_normalization: true
leave_one_out: true
stop_properly_penalty_coef: 0.2
Companion docs
- alphaswarm_docs/weight-centric-pipeline.md —
Deep dive on
f_S/f_A/f_T/f_Rsemantics. - alphaswarm_docs/rl-policy-backbones.md — Transformer / RNN / Autoencoder / PatchTST backbones.
- alphaswarm_docs/alpha-researcher-agent.md — Symbolic alpha DSL + AlphaResearcher driver.
Source-of-truth citations
- NeMo-RL
stop_properly_penalty_coefscaling (commit20d46a7d1bd987df1c89b3c5a81dc945c3d201e4,nemo_rl/algorithms/reward_functions.py). - NeMo-RL leave-one-out group baseline + decoupled global
normalisation (
nemo_rl/algorithms/utils.pycalculate_baseline_and_std_per_prompt+masked_mean(..., global_normalization_factor=...)). - Backtrader
cheat_on_open/next_open/order_target_percentsemantics (backtrader/strategy.py).