Worker vs executor images

The AlphaSwarm Celery surface is split into two purpose-built, migration-ready container images (Phase 4c):

alphaswarm-worker — slim orchestration worker. Task dispatch, lineage, paper-trading loop, terraform/ingestion/workflow coordination.
alphaswarm-executor — heavy-compute executor. Backtests, RL / ML training, factor builds, agent-emitted strategy code, RAG ingest.

Why split

Historically worker and beat had no image of their own — the alphaswarm_images catalogue pinned worker = { target = "api" }, so the orchestration worker dragged the entire API stage (Dash, visualization, dev tooling) plus the full ML/RL surface into one fat image. Two problems followed:

Bloat & blast radius. A lineage callback worker carried PyTorch, Ray, vectorbt-pro, forecasting libs — slow to pull, large attack surface, slow cold-start.
Scaling mismatch. Light coordination tasks (sub-second, IO-bound) and heavy compute tasks (minutes–hours, CPU/GPU/RAM-bound) have opposite scaling and resource profiles, but shared one Deployment.

Splitting lets each image carry only what its queues need, and lets each scale and be resourced independently.

Queue ↔ image matrix

The queue assignment is identical across the root Dockerfile, the standalone per-service Dockerfiles, the K8s manifests, both compose files, and the faas KEDA module (local.heavy_queues). A queue is never drained by both images.

Queue	Image	Why
`default`	worker	bookkeeping, lineage, callbacks
`paper`	worker	sub-second paper-trading loop (latency-sensitive)
`terraform`	worker	`TerraformRuntime` apply/destroy wrappers
`ingestion`	worker	connector pulls (IO-bound, long-lived)
`workflows`	worker	`WorkflowRuntime` orchestration
`backtest`	executor	vbt-pro / event-driven / Lean engine runs
`training`	executor	RL rollouts + finetune jobs (GPU)
`ml`	executor	ML pipelines, predictor refresh
`agents`	executor	CrewAI / LangGraph agent runs
`factors`	executor	factor-zoo builds, alpha tests
`rag`	executor	RAG ingest, embedding refresh

Dependency surface

Both images share the multi-arch (linux/amd64+arm64) Chainguard Wolfi base, nonroot UID 65532, and the CredentialResolver-only secret rule (nothing baked into the image). They differ only in installed extras:

	worker	executor
Base extras	`otel, cli, iceberg, entity-graph, dagster-alphaswarm`	same
Distributed compute	`compute-dask, compute-ray`	`compute-dask, compute-ray`
ML / RL / forecasting	—	`ml, ml-torch, ml-forecast, ml-anomaly`
Portfolio	—	`portfolio`
Native build deps	—	`gfortran`, `linux-headers` (numpy/scipy/forecast wheels)
Extra dirs	`/app/data`	`/app/data`, `/app/models`
Default concurrency	4	2
Resource requests	`500m CPU / 1Gi`	`1 CPU / 4Gi`
Resource limits	`4 CPU / 8Gi`	`8 CPU / 16Gi`

Where the images are defined

Surface	Worker	Executor
Root multi-stage target	`worker` in `Dockerfile`	`executor` in `Dockerfile`
Standalone Dockerfile	`build/docker/alphaswarm_worker/`	`build/docker/alphaswarm_executor/`
Image catalogue	`worker` / `beat` → target `worker`	`executor` → target `executor`
ECR repo	`alphaswarm-worker`	`alphaswarm-executor`
Kustomize base	`base/alphaswarm-worker/`	`base/alphaswarm-executor/`
Compose	`worker` (legacy) / `alphaswarm-worker`	`worker-gpu` (legacy) / `alphaswarm-executor`

Migration readiness

The two images are intentionally self-contained — a standalone Dockerfile, its own ECR repo, its own image-catalogue entry, its own Kustomize base, and its own topology entry — so the build assets can be lifted into a dedicated repository in a future migration without untangling them from the API image.

Why split​

Queue ↔ image matrix​

Dependency surface​

Where the images are defined​

Migration readiness​

See also​

Why split

Queue ↔ image matrix

Dependency surface

Where the images are defined

Migration readiness

See also