Hybrid agentic-RL + backtest

AlphaSwarm's port of the FinRL-X "deployment-consistent" blueprint plus the NVIDIA-NeMo/RL advantage primitives — wired into AlphaSwarm's existing spec-driven runtimes (rule 16).

What changed

The Phase 1-9 rollout closes the "backtest-to-paper-trading gap" by making the target portfolio weight vector the single immutable interface between an RL policy and any execution mechanism (offline backtest engine OR live broker). The same w_t flows through:

the offline simulation (via the new RLBacktestEnv)
the live paper / live execution (via WeightToOrders)
the AST-sandboxed alpha factor authoring loop (via AlphaResearcher)

Quick reference

Concept	One-liner	File
`WeightCentricPipeline`	FinRL-X `f_S -> f_A -> f_T -> f_R` composable pipeline	alphaswarm/rl/portfolio/pipeline.py
`RLBacktestEnv`	`BaseRLEnv + gym.Env` wrapping any registered `BaseBacktestEngine`	alphaswarm/rl/envs/rl_backtest_env.py
`RLAgentBridge`	Channel exposed via `context['rl_agent']` on every engine flipping `supports_rl_injection=True`	alphaswarm/rl/bridges/agent_bridge.py
`ReinforcePlusPlusAdvantage`	Leave-one-out cohort baseline + decoupled global normalisation (NeMo-RL port)	alphaswarm/rl/advantage/reinforce_plus_plus.py
`GRPOAdvantage`	Group-relative no-critic advantage (DeepSeek R1 / NeMo-RL parity)	alphaswarm/rl/advantage/grpo.py
`StopProperlyWrapper`	Scales reward of truncated episodes by `coef in [0, 1]` (NeMo-RL `stop_properly_penalty_coef`)	alphaswarm/rl/rewards/stop_properly.py
Truncating terminations	`DrawdownTermination` / `MarginCallTermination` / `RiskBreachTermination` carry `truncates_episode=True`	alphaswarm/rl/terminations/
`WeightToOrders`	Kill-switch-gated translator from target weights to `DomainOrder`	alphaswarm/rl/execution/weight_to_orders.py
`RedisFeatureStore`	Flink → Redis `IFeatureStore` for live RL observation	alphaswarm/streaming/feature_store/redis_store.py
`AlphaVantageIngester`	REST-poll Alpha Vantage and publish to Kafka	alphaswarm/streaming/ingesters/alphavantage.py
`DeterministicMedallionReplay`	Read-only RL data pipeline pinned to silver/gold Iceberg snapshots	alphaswarm/rl/data_pipelines/medallion_replay.py
`data.alphas.` / `data.backtests.` / `data.rl.` / `data.brokers.`	New DataMCPTools (rule 22)	alphaswarm/data/mcp/tools/
`alpha_factors` / `backtest_summaries` / `rl_trajectory_summaries` corpora	RAG "alpha base" (rule 11)	alphaswarm/rag/orders.py
`RLTradingBot`	Bot subtype driven by `RLRuntime` (rule 14)	alphaswarm/bots/rl_trading_bot.py

Spec extension

training:
  total_timesteps: 200000
  log_interval: 10
  advantage:
    class: ReinforcePlusPlusAdvantage
    module_path: alphaswarm.rl.advantage.reinforce_plus_plus
    kwargs:
      minus_baseline: true
      global_normalization: true
      leave_one_out: true
  stop_properly_penalty_coef: 0.2

Companion docs

alphaswarm_docs/weight-centric-pipeline.md — Deep dive on f_S/f_A/f_T/f_R semantics.
alphaswarm_docs/rl-policy-backbones.md — Transformer / RNN / Autoencoder / PatchTST backbones.
alphaswarm_docs/alpha-researcher-agent.md — Symbolic alpha DSL + AlphaResearcher driver.

Source-of-truth citations

NeMo-RL stop_properly_penalty_coef scaling (commit 20d46a7d1bd987df1c89b3c5a81dc945c3d201e4, nemo_rl/algorithms/reward_functions.py).
NeMo-RL leave-one-out group baseline + decoupled global normalisation (nemo_rl/algorithms/utils.py calculate_baseline_and_std_per_prompt + masked_mean(..., global_normalization_factor=...)).
Backtrader cheat_on_open / next_open / order_target_percent semantics (backtrader/strategy.py).

What changed​

Quick reference​

Spec extension​

Companion docs​

Source-of-truth citations​

What changed

Quick reference

Spec extension

Companion docs

Source-of-truth citations