Runbook — dbt snapshot deadlock
Symptoms:
dbt snapshotruns queue indefinitely.- The
dbt_snapshotsDagster concurrency pool shows 1 slot in use but the corresponding run isCANCELEDorFAILED.
Root cause
Per the Dagster docs: "a single cancelled run will permanently
deadlock all future runs for that pool" unless the
free_slots_after_run_end_seconds knob is set on the
run_monitoring block.
Fix (in this order)
-
Confirm
alphaswarm/dagster/dagster.yamlhasrun_monitoring:
enabled: true
free_slots_after_run_end_seconds: 300If missing, add it + reload the Dagster instance.
-
Manually free the stuck slot:
dagster instance concurrency reset dbt_snapshots -
Verify with the Dagster UI: the pool should show
0 / 1used.
Verification chaos test
tests/chaos/test_snapshot_deadlock_recovery.py triggers 5
parallel snapshot jobs against a sqlite test target and asserts
that even after one is cancelled the pool recovers within 360s.
Postmortem
If the deadlock recurs after the canonical fix, the postmortem must include a Dagster + dbt version pair and a minimal repro so the upstream issue can be filed.