ADR-014: Knowledge-Base Boundary
Status: accepted (2026-05-28)
Context:
The AlphaSwarm knowledge stack started as alphaswarm/rag/ (a four-level
hierarchical RAG on Redis + pgvector) plus alphaswarm/llm/memory.py
(RedisHybridMemory) wired directly into AgentRuntime. As the platform
grew, three tensions accumulated:
- Vendor coupling.
HierarchicalRAGis fast and AlphaSwarm-native, but the field has matured rapidly. Cognee (tri-store memory engine), Graphiti (bi-temporal Neo4j edges with sub-300ms p95 recall), Mem0 (user-centric personalisation), Letta (full agent runtime), and LlamaIndex (general-purpose vector backbone) all solve adjacent problems and tenants are starting to ask for each by name. - Multi-tenancy on cognitive memory. The existing RAG row-filter
stamps
workspace_id/lab_idon rows but provides no node/edge ACL, no bi-temporal invalidation, no cross-tenant marketplace, and no physical per-tenant isolation. Regulated tenants (financial advisors on HIPAA/SOX) need an explicit silo path; B2C tenants need cheap shared-schema RLS; both want a marketplace where they can subscribe to curated external corpora without giving up isolation. - Cross-boundary contamination. RAG knowledge lived inside the
monolith with no Clean-Architecture port surface. Bot specs, RL
specs, agent specs, and analysis specs all reached into
HierarchicalRAG.querydirectly, making the surface impossible to swap.
The blueprint reviewed in
.cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md
- the parallel architecture report propose a Clean-Architecture
knowledge-base boundary modelled on the established
alphaswarm_rl/alphaswarm_modelsextraction pattern.
Decision:
Stand up two new repositories:
alphaswarm_kb/— the boundary package with a puredomain/core (ports + bi-temporalPermissionedDataPoint- DTOs), an
application/layer (use cases +KBRuntimeservices), a fully-pluggableinfrastructure/adapter trinity, and an extractedrag/+memory/slice that re-emits the legacyalphaswarm.rag.*+alphaswarm.llm.memorysurface throughDeprecationWarningshims.
- DTOs), an
alphaswarm_kb_federation/— a standalone cross-silo marketplace federation reverse-proxy that brokers authorised recall via OpenFGAcheck+ signed per-subscription share tokens + bi-temporal merge.
The package introduces:
- Hash-locked
KBCorpusSpec+KBRuntime(rules 56-57) mirroring the existingRLExperimentSpec/BotSpec/AnalysisSpecpattern. Everyremember/recall/improve/forgetlands akb_runsrow + snapshots the spec viapersist_spec. Alembic migration0088_alphaswarm_kb_specs.pycreates the nine backing tables. KBAdapterMetametaclass (rule 58) for every concreteIMemoryEngine,BaseVectorStore,BaseGraphStore,BaseRelationalStore,IACLEvaluator,IPolicyEngine, andIIdentityProvider. Each subclass setskb_kind+kb_aliasand is auto-registered.- Bi-temporal
PermissionedDataPointcombining Graphiti's four-timestamp model (valid_from/valid_to/created_at/expired_at) with Cognee's provenance envelope (Provenance.dataset_id+Provenance.data_id+Provenance.extractor_chain). - Four-scope
KBLayerComposer(private > hierarchical > marketplace > global) with precedence-aware bi-temporal merge. - Hybrid OpenFGA + OPA + Cedar policy stack per the blueprint
Section D.
DefaultPermissionResolverfusesIACLEvaluator.list_objects(visible IDs) withIPolicyEngine.partial_evaluate(residual Cypher/SQL fragment) into a per-requestAccessBitmapcached by(tenant, principal, action, anchor_hash)for 60s. KBSiloTenancyStrategy(5th strategy alongside RLS / schema-per-tenant / db-per-enterprise / hybrid). Routes KB tables to a per-tenant Postgres + Qdrant + Neo4j stack provisioned via Terragrunt units underalphaswarm_platform/terragrunt/tenants/.- Agent-facing surface through
data.kb.*DataMCP tools (rule 59 extends rule 22) anddata.kb.compose_recallfor the layered surface. Cross-silo recall goes throughalphaswarm_kb_federationonly (rule 60). - Controller integration:
KBSiloService+/manage/kb/silos/*routes onalphaswarm_controller(Phase M). Lifecycle actions land asWorkloadRunrows withWorkloadAction.KB_SILO_{PROVISION,DESTROY,HALT,SCALE}.
Consequences:
- The legacy
alphaswarm.rag.*+alphaswarm.llm.memoryimport paths keep working throughDeprecationWarningshims for one release cycle. New code imports fromalphaswarm_kb.rag.*+alphaswarm_kb.memory.*directly. - Cognee / Graphiti / Mem0 / Letta / LlamaIndex live behind
pyproject.tomlextras; the base install stays light. A tenant who wants Cognee installspip install alphaswarm-kb[cognee]and setsKBCorpusSpec.memory_engine.kb_alias = "cognee". - The federation gateway is the only cross-silo write/read path outside the monolith. New tenant marketplaces, parent-org sharing, and global-corpus replication all funnel through it.
- Terragrunt units replace the legacy Terraform workspaces pattern —
each tenant has its own state file under
tenants/<tenant_id>/prod/terragrunt.hcl. Thetenant_kb_silowrapper dispatches to one of three cloud-parallel siblings (tenant_kb_silo_aws/azure/gcp) which all expose identical outputs so Python adapters never branch on cloud. - Bi-temporal data is now first-class. Contradicted edges close
valid_toinstead of being deleted;as_ofqueries reconstruct historical state. - Step-up MFA gates the destructive operations (
/kb/forget,/kb/halt,/manage/kb/silos/*mutations, subscription create/revoke) per rule 52.
Hard rule alignment:
| Rule | Compliance |
|---|---|
| 2 (router_complete) | Every adapter that does LLM extraction (Graduated pipeline tier 3, Cognee, Mem0) routes through router_complete. |
| 3 (iceberg_catalog.append_arrow) | Gold-tier KB writes (alphaswarm_gold_kb_* namespaces) go through the canonical helper; KBRuntime never touches PyIceberg. |
| 4 (_progress.emit) | All kb_tasks.py wrappers use emit / emit_done / emit_error. WebSocket /kb/.../recall/stream preserves {task_id, stage, message, timestamp, **extras}. |
| 6 (immutable migrations) | 0088_alphaswarm_kb_specs.py is immutable post-merge. |
| 22 (DataMCP boundary) | Agents read KB only through data.kb.* tools (extended by rule 59). |
| 26 (CredentialResolver) | OpenFGA token, NATS DSN, Postgres DSN, federation share-token signing key all resolve through CredentialResolver. |
| 27 (IdentityProvider) | IIdentityProvider is a thin bridge to alphaswarm_core.auth.providers. |
| 34 (experiment_id/test_id) | kb_runs carries both FKs; KBRunRequest propagates them via RequestContext. |
| 42 (TerraformRuntime) | KBSiloService invokes TerraformRuntime; the controller never shells out to terraform. |
| 45 (WorkloadRuntime) | New WorkloadAction enum members KB_SILO_{PROVISION,DESTROY,HALT,SCALE}. |
| 51 (TenancyStrategy) | KBSiloTenancyStrategy registers via TenancyStrategyMeta. |
| 52 (step-up MFA) | All destructive /kb/* + /manage/kb/* routes gate with require_step_up(). |
| 56-60 | New hard rules added in the same PR; described in the AGENTS.md. |
Trade-offs:
- Two new repositories to maintain. Mitigated by mirroring the
established
alphaswarm_rl/alphaswarm_modelsboundary pattern and shipping CI guards that prevent cross-boundary imports. - OpenFGA + OPA + NATS introduce three new infrastructure dependencies. Mitigated by shipping both Docker Compose (local) and Kubernetes (prod) manifests; each is a single Helm release with ExternalSecrets wiring.
- Bi-temporal data complicates schema migrations. Mitigated by
making
valid_to/expired_atoptional (None = "still valid") so existing rows migrate without a backfill. - Terragrunt unit-per-tenant scales linearly in state-file count.
Mitigated by bounded-parallelism
run-allautomation underalphaswarm_platform/terragrunt/plus per-tenant cloud-account isolation for regulated tenants. - Multiple memory engines coexisting complicates the operator's
mental model. Mitigated by
data.kb.healthexposing per-corpus engine info + the Vite/knowledge-base/silosroute surfacing topology + spec hash per corpus.
Out of scope (Phase 6+):
- Cedar formal-verification harness (
cedar-analysis). - SpiceDB / Permify adapter implementations beyond stubs.
- Multi-region active-active federation (vs the AWS-first → Azure → GCP staged rollout).
- Tenant-configurable bi-temporal merge strategies (default: last-writer-wins per validity window + precedence tiebreaker).
- Per-tenant bridge tier (shared compute / siloed databases) for SMB pricing.
- Cognee
improve/forgetscheduling automation (manual triggers only in v1).