Skip to main content

Hybrid agentic-RL + backtest

AlphaSwarm's port of the FinRL-X "deployment-consistent" blueprint plus the NVIDIA-NeMo/RL advantage primitives — wired into AlphaSwarm's existing spec-driven runtimes (rule 16).

What changed

The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by making the target portfolio weight vector the single immutable interface between an RL policy and any execution mechanism (offline backtest engine OR live broker). The same w_t flows through:

Quick reference

ConceptOne-linerFile
WeightCentricPipelineFinRL-X f_S -> f_A -> f_T -> f_R composable pipelinealphaswarm/rl/portfolio/pipeline.py
RLBacktestEnvBaseRLEnv + gym.Env wrapping any registered BaseBacktestEnginealphaswarm/rl/envs/rl_backtest_env.py
RLAgentBridgeChannel exposed via context['rl_agent'] on every engine flipping supports_rl_injection=Truealphaswarm/rl/bridges/agent_bridge.py
ReinforcePlusPlusAdvantageLeave-one-out cohort baseline + decoupled global normalisation (NeMo-RL port)alphaswarm/rl/advantage/reinforce_plus_plus.py
GRPOAdvantageGroup-relative no-critic advantage (DeepSeek R1 / NeMo-RL parity)alphaswarm/rl/advantage/grpo.py
StopProperlyWrapperScales reward of truncated episodes by coef in [0, 1] (NeMo-RL stop_properly_penalty_coef)alphaswarm/rl/rewards/stop_properly.py
Truncating terminationsDrawdownTermination / MarginCallTermination / RiskBreachTermination carry truncates_episode=Truealphaswarm/rl/terminations/
WeightToOrdersKill-switch-gated translator from target weights to DomainOrderalphaswarm/rl/execution/weight_to_orders.py
RedisFeatureStoreFlink → Redis IFeatureStore for live RL observationalphaswarm/streaming/feature_store/redis_store.py
AlphaVantageIngesterREST-poll Alpha Vantage and publish to Kafkaalphaswarm/streaming/ingesters/alphavantage.py
DeterministicMedallionReplayRead-only RL data pipeline pinned to silver/gold Iceberg snapshotsalphaswarm/rl/data_pipelines/medallion_replay.py
data.alphas.* / data.backtests.* / data.rl.* / data.brokers.*New DataMCPTools (rule 22)alphaswarm/data/mcp/tools/
alpha_factors / backtest_summaries / rl_trajectory_summaries corporaRAG "alpha base" (rule 11)alphaswarm/rag/orders.py
RLTradingBotBot subtype driven by RLRuntime (rule 14)alphaswarm/bots/rl_trading_bot.py

Spec extension

training:
total_timesteps: 200000
log_interval: 10
advantage:
class: ReinforcePlusPlusAdvantage
module_path: alphaswarm.rl.advantage.reinforce_plus_plus
kwargs:
minus_baseline: true
global_normalization: true
leave_one_out: true
stop_properly_penalty_coef: 0.2

Companion docs

Source-of-truth citations

  • NeMo-RL stop_properly_penalty_coef scaling (commit 20d46a7d1bd987df1c89b3c5a81dc945c3d201e4, nemo_rl/algorithms/reward_functions.py).
  • NeMo-RL leave-one-out group baseline + decoupled global normalisation (nemo_rl/algorithms/utils.py calculate_baseline_and_std_per_prompt + masked_mean(..., global_normalization_factor=...)).
  • Backtrader cheat_on_open / next_open / order_target_percent semantics (backtrader/strategy.py).