ADR 010 — Canary Rollout PnL Gates
Status: Accepted (QuantBot Platform v0.2.0) Date: 2026-05-24
Context
Strategy changes (new alpha model, new portfolio constructor, new
execution algo) are the highest-leverage and highest-risk changes the
platform makes. Rolling them across the entire fleet at once is
unacceptable; bake time is mandatory. Argo Rollouts canary lets us
shift weight gradually, but the canary needs automated abort
criteria beyond the standard liveness/readiness probes — a bot can
be Ready=True and still be hemorrhaging money.
Decision
Three AnalysisTemplates gate every canary promotion step:
bot-canary-pnl— realised PnL of the canary vs the stable variant. Default success condition:canary_realized_pnl - stable_realized_pnl >= -50 USDover 6 × 5-minute windows (30-minute total).bot-reject-rate— fraction of orders that are rejected (by venue or by pre-trade risk). Default success condition:<= 1%over 30 × 1-minute windows.bot-p99-latency— P99 tick-to-trade latency. Default success condition:<= 1 ms(HFT canaries override to<= 100 µs).
The canary spec follows the standard Argo Rollouts pattern:
steps:
- setWeight: 10
- pause: { duration: 30m }
- analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] }
- setWeight: 50
- pause: { duration: 1h }
- analysis: { templates: [bot-canary-pnl, bot-reject-rate, bot-p99-latency] }
- setWeight: 100
Failure of any AnalysisTemplate aborts the rollout and reverts
traffic to the stable version. The operator additionally watches
the BotPnLDrawdownCritical PrometheusRule; if the canary bleeds
more than maxAbortRolloutPnlBleedUsd (default $500) the alert
auto-fires a KillSwitch CR which halts the canary instantly —
this protects against the case where the rollout abort itself takes
longer than the bleed.
Default thresholds rationale
The $50 PnL floor is intentionally generous for the initial canary window — it admits some short-term variance that is statistically normal between two variants of the same strategy. The harder $500 bleed threshold (drawdown alert) is what catches truly broken canaries within seconds.
Per blueprint caveat (canary false-positive rate): if good canaries are routinely aborted on noisy metrics, tighten the metric query first (more samples, longer windows, robust quantiles) before relaxing the success condition.
Consequences
- + Strategy changes have an automated bake-time gate.
- + The same canary pattern works for both stateless mid-frequency bots and HFT bots (only the latency threshold differs).
- − AnalysisTemplate thresholds need per-strategy calibration — a market-making bot's "good" reject rate is higher than a stat-arb pair's "good" reject rate.
- − A canary that's still warming up may not yet have produced enough orders for the metrics to be meaningful; we mitigate with the initial 30-minute pause before the first analysis check.