Saltar al contenido principal

alphaswarm-executor

Celery heavy-compute executor pod — the compute-heavy counterpart of the orchestration alphaswarm-worker.

Introduced by the Phase 4c worker/executor split. It carries the full ML / RL / forecasting / portfolio + distributed-compute (Dask + Ray) dependency surface so backtests, training rollouts, factor builds, and agent-emitted strategy code run here instead of bloating the slim orchestration worker. See worker vs executor images for the full rationale and dependency matrix.

Identity

FieldValue
Service idalphaswarm-executor
Roleexecutor
Packagealphaswarm/ (tasks under alphaswarm/tasks/*.py)
Image (key)executor
Built fromalphaswarm_platform/Dockerfile (target executor, multi-arch) or the standalone build/docker/alphaswarm_executor/Dockerfile

Wire

FieldValue
Protocolnone (no HTTP listener)
Healthcelery inspect ping + Prometheus metrics on :9100; Ray dashboard on :8265 when a local Ray head runs
Public URL
Brokerredis://redis:6379/0
Result backendredis://redis:6379/1

Deployment surfaces

SurfaceWhere
Composealphaswarm-executor in deployments/compose/docker-compose.local.yml; worker-gpu in legacy compose/docker-compose.yml
Kustomizedeployments/kubernetes/base/alphaswarm-executor/ — Deployment + HPA + PDB
Image catalogueexecutor entry in terraform/modules/alphaswarm_images/
ECR repoalphaswarm-executor in infrastructure/modules/ecr-repositories/
Terraform moduleterraform/modules/faas/ — heavy-queue Deployments pull this image
Topologyalphaswarm-executor in configs/deployment/topology.yaml

Queue families

The executor drains the heavy compute queues. KEDA scales each queue family independently.

QueueDrivesScale-to-zeroNotes
backtestbacktest dispatch (vbt-pro / event-driven / Lean)yesmax=20
trainingRL rollouts, finetune jobsyesdedicated GPU node group
mlML pipelines, predictor refreshyes
agentsCrewAI runs, LangGraph orchestrationyesmax=12
factorsfactor zoo builds, alpha testsyes
ragRAG ingest, embedding refreshyes

Dependencies

Upstream:

  • redis — broker + result backend.
  • postgres — task lookups, ledger writes.
  • alphaswarm-core — progress emit callbacks, lookup APIs.
  • mlflow — experiment tracking + model registry for training / ML runs.
  • All data-plane services the alphaswarm-core pod depends on.

Downstream:

  • Beat schedules heavy periodic jobs (factor refresh, predictor retraining); the executor is the consumer.
  • May start a local Ray head / Dask cluster for distributed backtests.

Operations

  • Resources: requests 1 CPU / 4Gi, limits 8 CPU / 16Gi. Prefers memory-optimized nodes via node affinity; anti-affinity keeps it off the alphaswarm-core nodes.
  • Scaling: HPA on CPU + custom Celery queue depth (KEDA ScaledObjects supersede it where KEDA is installed). Scales down slowly (900s stabilization) so a long-running backtest / train job is not evicted mid-flight.
  • Concurrency: 2 per pod (compute-bound; each task is heavy).
  • Drain on shutdown: terminationGracePeriodSeconds: 600 so in-flight jobs complete; preStop sends SIGTERM to Celery.
  • Audit: WorkloadRuntime actions land workload_runs rows; the executor pod respects the kill-switch Redis key like every other pod.

See also