Unified Entity Registry
The unified entity registry sits on top of the existing Issuer / Sector / Industry graph and widens it to cover every entity AlphaSwarm cares about: companies, drugs, products, patents, persons, locations, securities, regulators, and free-form "concept" rows. Extractors populate the rows from datasets; LLM enrichers add descriptions, relations, dedup proposals, and tags without ever mutating the source data.
Tables
| Table | File |
|---|---|
entities | alphaswarm/persistence/models_entity_registry.py |
entity_identifiers | (same) |
entity_relations | (same) |
entity_annotations | (same) |
entity_dataset_links | (same) |
Migration: alembic/versions/0013_data_engine_expansion.py.
Components
| Module | What it does |
|---|---|
| alphaswarm/data/entities/registry.py | EntityRegistry facade + upsert_entity / link_identifier / add_relation / attach_to_dataset / search / neighbors / add_annotation. |
| alphaswarm/data/entities/extractors/ | Per-dataset extractors (regulatory, filings, news, instruments, finance_database). Each yields EntityCandidate dataclasses. |
| alphaswarm/data/entities/enrichers/ | LLM enrichers (description, relation, dedup, tagging). All route through router_complete per AGENTS.md hard rule #2. |
| alphaswarm/tasks/entity_tasks.py | Celery wrappers (extract_entities, enrich_entity, dedup_entities). |
| alphaswarm/api/routes/entity_registry.py | REST surface at /registry/entities. |
REST surface
| Path | Description |
|---|---|
GET /registry/entities | List entities (filter by kind, source_dataset, canonical_only). |
POST /registry/entities | Create or update an entity. |
GET /registry/entities/search?q= | Text search. |
GET /registry/entities/{id} | Detail (identifiers + annotations). |
GET /registry/entities/{id}/neighbors | Outgoing + incoming relations. |
GET /registry/entities/{id}/datasets | Linked datasets. |
POST /registry/entities/{id}/identifiers | Add an alias. |
POST /registry/entities/{id}/relations | Add a typed edge. |
POST /registry/entities/{id}/annotations | Attach a description / tag / note. |
POST /registry/entities/extract | Queue a Celery extract task. |
POST /registry/entities/enrich | Queue Celery enrichment tasks. |
LLM enrichment
LLM enrichers are gated on ALPHASWARM_ENTITY_LLM_ENRICHMENT_ENABLED=true
to avoid surprise spend. When disabled, enrich_one returns None
and the Celery task records a skipped count instead of calling the
router.
When enabled, the enricher uses alphaswarm.llm.providers.router.router_complete
exclusively — never litellm.completion or OllamaClient.generate.
Output is parsed strict JSON; malformed blobs are dropped.
Don'ts
- Don't extract entities by querying Postgres directly from a Celery
task. Either pass an inline
rowspayload or read the Iceberg table viaalphaswarm.data.iceberg_catalog.read_arrow(the standard path used byextract_entities). - Don't bypass
EntityRegistryto write rows. Extractors should always go throughregistry.upsert(...). - Don't replace
add_annotationwith raw SQL inserts. TheEntityAnnotationrow is also surfaced in/registry/entities/{id}, the entity browser UI, and (eventually) DataHub glossary terms.