Skip to main content

Worker vs executor images

The AlphaSwarm Celery surface is split into two purpose-built, migration-ready container images (Phase 4c):

  • alphaswarm-worker — slim orchestration worker. Task dispatch, lineage, paper-trading loop, terraform/ingestion/workflow coordination.
  • alphaswarm-executorheavy-compute executor. Backtests, RL / ML training, factor builds, agent-emitted strategy code, RAG ingest.

Why split

Historically worker and beat had no image of their own — the alphaswarm_images catalogue pinned worker = { target = "api" }, so the orchestration worker dragged the entire API stage (Dash, visualization, dev tooling) plus the full ML/RL surface into one fat image. Two problems followed:

  1. Bloat & blast radius. A lineage callback worker carried PyTorch, Ray, vectorbt-pro, forecasting libs — slow to pull, large attack surface, slow cold-start.
  2. Scaling mismatch. Light coordination tasks (sub-second, IO-bound) and heavy compute tasks (minutes–hours, CPU/GPU/RAM-bound) have opposite scaling and resource profiles, but shared one Deployment.

Splitting lets each image carry only what its queues need, and lets each scale and be resourced independently.

Queue ↔ image matrix

The queue assignment is identical across the root Dockerfile, the standalone per-service Dockerfiles, the K8s manifests, both compose files, and the faas KEDA module (local.heavy_queues). A queue is never drained by both images.

QueueImageWhy
defaultworkerbookkeeping, lineage, callbacks
paperworkersub-second paper-trading loop (latency-sensitive)
terraformworkerTerraformRuntime apply/destroy wrappers
ingestionworkerconnector pulls (IO-bound, long-lived)
workflowsworkerWorkflowRuntime orchestration
backtestexecutorvbt-pro / event-driven / Lean engine runs
trainingexecutorRL rollouts + finetune jobs (GPU)
mlexecutorML pipelines, predictor refresh
agentsexecutorCrewAI / LangGraph agent runs
factorsexecutorfactor-zoo builds, alpha tests
ragexecutorRAG ingest, embedding refresh

Dependency surface

Both images share the multi-arch (linux/amd64+arm64) Chainguard Wolfi base, nonroot UID 65532, and the CredentialResolver-only secret rule (nothing baked into the image). They differ only in installed extras:

workerexecutor
Base extrasotel, cli, iceberg, entity-graph, dagster-alphaswarmsame
Distributed computecompute-dask, compute-raycompute-dask, compute-ray
ML / RL / forecastingml, ml-torch, ml-forecast, ml-anomaly
Portfolioportfolio
Native build depsgfortran, linux-headers (numpy/scipy/forecast wheels)
Extra dirs/app/data/app/data, /app/models
Default concurrency42
Resource requests500m CPU / 1Gi1 CPU / 4Gi
Resource limits4 CPU / 8Gi8 CPU / 16Gi

Where the images are defined

SurfaceWorkerExecutor
Root multi-stage targetworker in Dockerfileexecutor in Dockerfile
Standalone Dockerfilebuild/docker/alphaswarm_worker/build/docker/alphaswarm_executor/
Image catalogueworker / beat → target workerexecutor → target executor
ECR repoalphaswarm-workeralphaswarm-executor
Kustomize basebase/alphaswarm-worker/base/alphaswarm-executor/
Composeworker (legacy) / alphaswarm-workerworker-gpu (legacy) / alphaswarm-executor

Migration readiness

The two images are intentionally self-contained — a standalone Dockerfile, its own ECR repo, its own image-catalogue entry, its own Kustomize base, and its own topology entry — so the build assets can be lifted into a dedicated repository in a future migration without untangling them from the API image.

See also