alphaswarm.ml — native qlib-style ML framework
Doc map: alphaswarm_docs/index.md · See alphaswarm_docs/factor-research.md for the alphalens-style evaluation pipeline.
alphaswarm.ml is a vendored port of Microsoft Qlib's
feature / dataset / model / record abstractions, re-built as pure Python on top
of AlphaSwarm's own DuckDB-backed data lake. There is no qlib runtime dependency
— installing the ml / ml-torch extras pulls in the underlying libraries
(LightGBM, XGBoost, CatBoost, PyTorch) only.
Layers
┌────────────────────────────────────────────────┐
│ Model (alphaswarm.ml.base.Model / ModelFT) │
│ ├─ tree: LGBModel, XGBModel, CatBoostModel │
│ ├─ linear: LinearModel (OLS/Ridge/Lasso/NNLS)│
│ ├─ ensemble: DEnsembleModel │
│ ├─ torch: DNN, LSTM, GRU, ALSTM, Transformer,│
│ │ TCN, TabNet, Localformer, │
│ │ GeneralPTNN, Seq2Seq family │
│ └─ stubs: GATs, HIST, TRA, ADD, ADARNN, … │
├────────────────────────────────────────────────┤
│ DatasetH / TSDatasetH → prepare(segments) │
├────────────────────────────────────────────────┤
│ DataHandler / DataHandlerLP │
│ ├─ DK_R raw | DK_I infer | DK_L learn views │
│ └─ shared / infer / learn processors │
├────────────────────────────────────────────────┤
│ DataLoader → AQPDataLoader (DuckDB + DSL) │
└────────────────────────────────────────────────┘
Quick start
from alphaswarm.ml.features.alpha158 import Alpha158
from alphaswarm.ml.dataset import DatasetH
from alphaswarm.ml.models.tree import LGBModel
handler = Alpha158(
instruments=["SPY", "AAPL", "MSFT"],
start_time="2018-01-01",
end_time="2024-12-31",
fit_start_time="2018-01-01",
fit_end_time="2022-12-31",
)
dataset = DatasetH(
handler=handler,
segments={
"train": ("2018-01-01", "2022-12-31"),
"valid": ("2023-01-01", "2023-12-31"),
"test": ("2024-01-01", "2024-12-31"),
},
)
model = LGBModel(num_leaves=63, learning_rate=0.05, n_estimators=500)
model.fit(dataset)
predictions = model.predict(dataset, segment="test")
Launch the same pipeline as a Celery task:
from alphaswarm.tasks.ml_tasks import train_ml_model
async_result = train_ml_model.delay(
dataset_cfg={"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}},
model_cfg={"class": "LGBModel", "module_path": "alphaswarm.ml.models.tree", "kwargs": {...}},
run_name="alpha158-lgbm",
strategy_id="<optional-strategy-uuid>",
)
Feature factories
- Alpha158 (
alphaswarm.ml.features.alpha158.Alpha158DL) ships the 9 k-bar + price/volume lookbacks + ~30 rolling families from the original qlib paper. Every feature is expressed via the DSL operators inalphaswarm.data.expressionsso adding a new family is one line of code. - Alpha360 (
alphaswarm.ml.features.alpha360.Alpha360DL) emits a 60-step OHLCV panel normalised by the latest close (or latest volume). Feed it into aTSDatasetHand pair with one of the sequence models.
Both handlers default to Ref($close, -2) / Ref($close, -1) - 1 as the label,
matching qlib's standard 2-day forward-return target.
Expression DSL
alphaswarm.data.expressions now exposes ~50 operators grouped into four families:
- Unary:
Ref,Delta,Abs,Sign,Log,Power,Rank - Rolling:
Mean,Std,Var,Skew,Kurt,Sum,Min,Max,Med,Mad,Quantile,Count,IdxMax,IdxMin,EMA,WMA,Slope,Rsquare,Resi - Pairwise:
Corr,Cov - Comparison / logical / conditional:
Greater,Less,Gt,Ge,Lt,Le,Eq,Ne,And,Or,Not,Mask,If
Example: construct a 20-bar z-scored OBV like factor::
"($close - Mean($close, 20)) / (Std($close, 20) + 1e-12)"
Recorders
alphaswarm.ml.recorder ports SignalRecord / SigAnaRecord / PortAnaRecord:
SignalRecord.generate()callsmodel.predict(dataset), serialisespred.pkl+label.pkl, and logs them as MLflow artifacts.SigAnaRecord.generate(signal_record=...)runsalphaswarm.data.factors.evaluate_factorto compute IC / Rank IC / quantile returns and pushes them into the active MLflow run.PortAnaRecord.generate(signal_record=...)turns the prediction panel into a top-K long / bottom-K short portfolio and reports Sharpe / Sortino / max-drawdown + qlib-stylerisk_analysissummary.
The train_ml_model Celery task auto-runs SignalRecord + any records
listed in the YAML so one POST /ml/train gives you predictions, factor
analysis, and a portfolio tearsheet in a single MLflow run.
Model zoo (Tier A — shipping)
| Family | Class | Notes |
|---|---|---|
| Tree | LGBModel, XGBModel, CatBoostModel, DEnsembleModel | ml extra |
| Linear | `LinearModel(estimator="ridge" | "lasso" |
| Dense | DNNModel(layers=[256, 64], dropout=0.2) | ml-torch extra |
| Sequence | LSTMModel, GRUModel, ALSTMModel (attention head) | TS; step_len=20 |
| Attention | TransformerModel, LocalformerModel (local-window mask) | TS |
| Convolutional | TCNModel | TS |
| Tabular | TabNetModel | requires pytorch-tabnet |
| Generic | GeneralPTNN(model_class=..., model_module=...) | bring-your-own nn.Module |
| Seq2Seq | LSTMSeq2Seq, GRUSeq2Seq, LSTMSeq2SeqVAE, DilatedCNNSeq2Seq, TransformerForecaster | ported from Stock-Prediction-Models |
ML-Ops framework adapters
The experiment layer also exposes framework adapters that still satisfy the
same Model.fit(dataset) / Model.predict(dataset, segment) contract:
| Family | Classes | Extra |
|---|---|---|
| scikit-learn | SklearnRegressorModel, SklearnClassifierModel, SklearnPipelineModel | ml |
| Forecasting | ProphetForecastModel, SktimeForecastModel, SktimeReductionForecastModel | ml-forecast |
| Anomaly detection | PyODAnomalyModel | ml-anomaly |
| Keras / TensorFlow | KerasMLPModel, KerasLSTMModel | ml-keras or ml-tensorflow |
| Hugging Face | HuggingFaceTextSignalModel | ml-transformers |
All heavy libraries are imported lazily. The base API can list recipes and build configs without TensorFlow, Prophet, sktime, PyOD, or transformers installed; fitting one of those classes raises a targeted install message if the corresponding extra is missing.
Model zoo (Tier B — scaffolded stubs)
These classes register into alphaswarm.core.registry so the Strategy Browser
enumerates them, but fit() raises NotImplementedError with a pointer to
the canonical qlib implementation. Port them incrementally:
GATsModel, HISTModel, TRAModel, ADDModel, ADARNNModel,
TCTSModel, SFMModel, SandwichModel, KRNNModel, IGMTFModel.
Persistence + MLflow wiring
Every train_ml_model run writes a ModelVersion row and (when
register_alpha=True) registers the pickled model in the MLflow Model
Registry. If you pass strategy_id, the run is filed under the
strategy/<id[:8]> MLflow experiment so the Strategy Browser can link
straight to it.
Planning-first workflow (split / pipeline / experiment / deployment)
The ML stack now supports a planning layer so datasets, splits, and preprocessing can be reused deterministically across runs.
- Create a split plan (fixed / purged-kfold / walk-forward):
curl -X POST http://localhost:8000/ml/split-plans \
-H "Content-Type: application/json" \
-d '{
"name": "alpha158-fixed-2019-2024",
"method": "fixed",
"vt_symbols": ["SPY.NASDAQ", "AAPL.NASDAQ", "MSFT.NASDAQ"],
"start": "2019-01-01",
"end": "2024-12-31",
"config": {
"segments": {
"train": ["2019-01-01", "2022-12-31"],
"valid": ["2023-01-01", "2023-12-31"],
"test": ["2024-01-01", "2024-12-31"]
}
}
}'
- Save a pipeline recipe (
shared/infer/learnprocessors):
curl -X POST http://localhost:8000/ml/pipelines \
-H "Content-Type: application/json" \
-d '{
"name": "alpha158-default",
"infer_processors": [{"class":"Fillna","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"feature","fill_value":0.0}}],
"learn_processors": [{"class":"DropnaLabel","module_path":"alphaswarm.ml.processors","kwargs":{"fields_group":"label"}}]
}'
- Create an experiment plan tying together dataset/split/pipeline/model
config, then launch training with
experiment_plan_id:
curl -X POST http://localhost:8000/ml/train \
-H "Content-Type: application/json" \
-d '{
"run_name": "alpha158-lgb-plan",
"experiment_plan_id": "<experiment-plan-id>",
"register_alpha": true
}'
For a richer ML-ops run that persists an MLExperimentRun row and logs compact
prediction samples, use the experiment runner:
curl -X POST http://localhost:8000/ml/experiment-runs \
-H "Content-Type: application/json" \
-d '{
"run_name": "ridge-alpha-smoke",
"experiment_type": "alpha",
"dataset_cfg": {"class": "DatasetH", "module_path": "alphaswarm.ml.dataset", "kwargs": {...}},
"model_cfg": {"class": "SklearnRegressorModel", "module_path": "alphaswarm.ml.models.sklearn", "kwargs": {"estimator": "ridge"}}
}'
Small interactive flows can run synchronously without Celery:
curl -X POST http://localhost:8000/ml/flows/linear/preview \
-H "Content-Type: application/json" \
-d '{"dataset_cfg": {...}, "estimator": "ridge", "alpha": 1.0}'
The Next.js web UI exposes the same objects in /ml/builder, using a graph
that serializes Dataset, Preprocessing, Split, Model, Records, and
Experiment nodes into the /ml/experiment-runs request.
- Deploy a tested
ModelVersionas a strategy alpha profile:
curl -X POST http://localhost:8000/ml/deployments \
-H "Content-Type: application/json" \
-d '{
"name": "lgb-alpha-prod",
"model_version_id": "<model-version-id>",
"infer_segment": "infer",
"long_threshold": 0.001,
"short_threshold": -0.001
}'
Then consume it in strategy YAML via:
alpha_model:
class: DeployedModelAlpha
module_path: alphaswarm.strategies.ml_alphas
kwargs:
deployment_id: "<deployment-id>"
Train -> register -> deploy -> score
ML engine major expansion (Alembic 0025)
The ML layer has grown a number of new surfaces, all driven by the
existing Experiment / Model / Processor contracts:
AlphaBacktestExperiment— combined "model used as alpha" experiment that trains, registers, deploys, backtests, and rolls the combined ML + trading metrics into a single MLflow parent run and aml_alpha_backtest_runsPostgres row. See alphaswarm_docs/ml-alpha-backtest.md.- Library coverage — TF-native (
TFEstimatorModel), Keras Functional / TabTransformer, HuggingFace FinBERT / time-series transformer / generative, AutoETS / AutoARIMA / Theta / Tbats, PyOD ECOD / SUOD / AutoEncoder, Sklearn Stacking / AutoPipeline. See alphaswarm_docs/ml-libraries.md. - Lightweight workbench flows —
regression_diagnostics,unit_root,acf_pacf,granger_causality,cointegration,garch,change_point,clustering,pca_summary. See alphaswarm_docs/ml-flows.md. - ML preprocessors as data-pipeline nodes —
transform.ml_preprocessingplus specialised tiles, with a newsink.ml_feature_snapshotfor deterministic feature reload. See alphaswarm_docs/ml-preprocessing-pipeline.md. - Interactive testing workbench —
/ml/test/{single,batch,compare,scenario,upload-csv}endpoints + tabbed webui surface. See alphaswarm_docs/ml-testing.md. - Graphical builder palette — Source / Pipeline / Split / Model (per-framework) / Records / Experiment / Test / Deploy sections plus an Interactive Workbench drawer. See alphaswarm_docs/ml-builder.md.
- Adhoc helpers —
alphaswarm.ml.adhocexposesquick_ridge,quick_arima,quick_iforest, etc. for notebook iteration.