Saltar al contenido principal

Your first RL experiment

Goal: from blank RLExperimentSpec to a trained PPO agent with trajectories persisted to Iceberg, in under 10 minutes on CPU.

Why

The RL stack is AlphaSwarm's most opinionated subsystem: hash-locked RLExperimentSpec, metaclass-registered components, deterministic Iceberg trajectory persistence, and a single sanctioned executor (RLRuntime). Every RL run produces an immutable rl_runs ledger row and a replayable trajectory.

See Concept: RL framework.

Prerequisites

  • Quickstart completed.
  • A small dev dataset under your local Iceberg catalog. The bundled alphaswarm_bronze_yfinance_daily namespace works.

Step 1 — author the spec

Create alphaswarm_rl/configs/experiments/my_first_rl.yaml:

name: MyFirstRLExperiment
description: First-RL tutorial — PPO on a static universe
environment:
rl_alias: SingleAssetTradingEnv
symbol: { ticker: SPY, exchange: ARCA, kind: equity }
lookback_bars: 60
initial_cash: 100000
data_pipeline:
rl_alias: IcebergDataPipeline
namespace: alphaswarm_bronze_yfinance_daily
start: 2022-01-01
end: 2023-12-31
agent:
rl_alias: SB3Adapter
algorithm: PPO
policy: MlpPolicy
total_timesteps: 50000
rewards:
- { rl_alias: PnLReward, weight: 1.0 }
- { rl_alias: TurnoverPenalty, weight: 0.1 }
- { rl_alias: VolatilityPenalty, weight: 0.05 }
observations:
- { rl_alias: StockstatsObservation }
- { rl_alias: LookbackObservation, length: 20 }
training:
advantage: { rl_alias: GAEAdvantage, lambda: 0.95, gamma: 0.99 }
backbone: { rl_alias: TransformerBackbone, d_model: 64, n_heads: 4 }

Step 2 — snapshot + train

curl -X POST http://localhost:8000/rl/runs \
-H "Content-Type: application/json" \
-d '{"spec_path":"alphaswarm_rl/configs/experiments/my_first_rl.yaml","mode":"train"}'

The response includes the rl_run_id. Tail the progress:

docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
[print(m) for m in subscribe('<task_id>')]"

50k timesteps on a CPU finishes in 5-8 minutes.

Step 3 — inspect the ledger + trajectory store

-- rl_runs ledger
SELECT id, experiment_name, status, total_timesteps, mean_reward
FROM rl_runs ORDER BY created_at DESC LIMIT 5;

The trajectory data lives in Iceberg under alphaswarm_silver_rl_trajectories.<experiment_hash>:

from pyiceberg.catalog import load_catalog
cat = load_catalog("alphaswarm")
tbl = cat.load_table("alphaswarm_silver_rl_trajectories.<hash>")
df = tbl.scan().to_pandas()
print(df[["episode", "step", "reward", "action"]].head(20))

Step 4 — replay

curl -X POST http://localhost:8000/rl/runs/<rl_run_id>/replay \
-d '{"start":"2024-01-01","end":"2024-03-31"}'

Same hash-locked spec, new data window, separate rl_runs row.

Step 5 — halt

curl -X POST http://localhost:8000/rl/halt-all

Verify

  • rl_experiment_versions row with a spec_hash.
  • rl_runs row with non-NULL mean_reward.
  • Iceberg trajectory table populated.
  • Replay produces a different rl_runs row but reuses the same rl_experiment_versions row (hash-locked!).

What next