Your first RL experiment

Goal: from blank RLExperimentSpec to a trained PPO agent with trajectories persisted to Iceberg, in under 10 minutes on CPU.

Why

The RL stack is AlphaSwarm's most opinionated subsystem: hash-locked RLExperimentSpec, metaclass-registered components, deterministic Iceberg trajectory persistence, and a single sanctioned executor (RLRuntime). Every RL run produces an immutable rl_runs ledger row and a replayable trajectory.

See Concept: RL framework.

Prerequisites

Quickstart completed.
A small dev dataset under your local Iceberg catalog. The bundled alphaswarm_bronze_yfinance_daily namespace works.

Step 1 — author the spec

Create alphaswarm_rl/configs/experiments/my_first_rl.yaml:

name: MyFirstRLExperiment
description: First-RL tutorial — PPO on a static universe
environment:
  rl_alias: SingleAssetTradingEnv
  symbol: { ticker: SPY, exchange: ARCA, kind: equity }
  lookback_bars: 60
  initial_cash: 100000
data_pipeline:
  rl_alias: IcebergDataPipeline
  namespace: alphaswarm_bronze_yfinance_daily
  start: 2022-01-01
  end: 2023-12-31
agent:
  rl_alias: SB3Adapter
  algorithm: PPO
  policy: MlpPolicy
  total_timesteps: 50000
rewards:
  - { rl_alias: PnLReward, weight: 1.0 }
  - { rl_alias: TurnoverPenalty, weight: 0.1 }
  - { rl_alias: VolatilityPenalty, weight: 0.05 }
observations:
  - { rl_alias: StockstatsObservation }
  - { rl_alias: LookbackObservation, length: 20 }
training:
  advantage: { rl_alias: GAEAdvantage, lambda: 0.95, gamma: 0.99 }
  backbone: { rl_alias: TransformerBackbone, d_model: 64, n_heads: 4 }

Step 2 — snapshot + train

curl -X POST http://localhost:8000/rl/runs \
    -H "Content-Type: application/json" \
    -d '{"spec_path":"alphaswarm_rl/configs/experiments/my_first_rl.yaml","mode":"train"}'

The response includes the rl_run_id. Tail the progress:

docker exec alphaswarm-api python -c "from alphaswarm.ws.broker import subscribe; \
    [print(m) for m in subscribe('<task_id>')]"

50k timesteps on a CPU finishes in 5-8 minutes.

Step 3 — inspect the ledger + trajectory store

-- rl_runs ledger
SELECT id, experiment_name, status, total_timesteps, mean_reward
FROM rl_runs ORDER BY created_at DESC LIMIT 5;

The trajectory data lives in Iceberg under alphaswarm_silver_rl_trajectories.<experiment_hash>:

from pyiceberg.catalog import load_catalog
cat = load_catalog("alphaswarm")
tbl = cat.load_table("alphaswarm_silver_rl_trajectories.<hash>")
df = tbl.scan().to_pandas()
print(df[["episode", "step", "reward", "action"]].head(20))

Step 4 — replay

curl -X POST http://localhost:8000/rl/runs/<rl_run_id>/replay \
    -d '{"start":"2024-01-01","end":"2024-03-31"}'

Same hash-locked spec, new data window, separate rl_runs row.

Step 5 — halt

curl -X POST http://localhost:8000/rl/halt-all

Verify

rl_experiment_versions row with a spec_hash.
rl_runs row with non-NULL mean_reward.
Iceberg trajectory table populated.
Replay produces a different rl_runs row but reuses the same rl_experiment_versions row (hash-locked!).

What next

Concept: RL components — add your own reward term, observation builder, or policy backbone.
Concept: RL Iceberg trajectories — the persistence contract.
Tutorial: first agent workflow — hand off RL outputs to an autonomous agent loop.

Why​

Prerequisites​

Step 1 — author the spec​

Step 2 — snapshot + train​

Step 3 — inspect the ledger + trajectory store​

Step 4 — replay​

Step 5 — halt​

Verify​

What next​

Why