Saltar al contenido principal

ADR-014: Knowledge-Base Boundary

Status: accepted (2026-05-28)

Context:

The AlphaSwarm knowledge stack started as alphaswarm/rag/ (a four-level hierarchical RAG on Redis + pgvector) plus alphaswarm/llm/memory.py (RedisHybridMemory) wired directly into AgentRuntime. As the platform grew, three tensions accumulated:

  1. Vendor coupling. HierarchicalRAG is fast and AlphaSwarm-native, but the field has matured rapidly. Cognee (tri-store memory engine), Graphiti (bi-temporal Neo4j edges with sub-300ms p95 recall), Mem0 (user-centric personalisation), Letta (full agent runtime), and LlamaIndex (general-purpose vector backbone) all solve adjacent problems and tenants are starting to ask for each by name.
  2. Multi-tenancy on cognitive memory. The existing RAG row-filter stamps workspace_id/lab_id on rows but provides no node/edge ACL, no bi-temporal invalidation, no cross-tenant marketplace, and no physical per-tenant isolation. Regulated tenants (financial advisors on HIPAA/SOX) need an explicit silo path; B2C tenants need cheap shared-schema RLS; both want a marketplace where they can subscribe to curated external corpora without giving up isolation.
  3. Cross-boundary contamination. RAG knowledge lived inside the monolith with no Clean-Architecture port surface. Bot specs, RL specs, agent specs, and analysis specs all reached into HierarchicalRAG.query directly, making the surface impossible to swap.

The blueprint reviewed in .cursor/plans/alphaswarm_kb_boundary_d1617245.plan.md

  • the parallel architecture report propose a Clean-Architecture knowledge-base boundary modelled on the established alphaswarm_rl / alphaswarm_models extraction pattern.

Decision:

Stand up two new repositories:

  • alphaswarm_kb/ — the boundary package with a pure domain/ core (ports + bi-temporal PermissionedDataPoint
    • DTOs), an application/ layer (use cases + KBRuntime services), a fully-pluggable infrastructure/ adapter trinity, and an extracted rag/ + memory/ slice that re-emits the legacy alphaswarm.rag.* + alphaswarm.llm.memory surface through DeprecationWarning shims.
  • alphaswarm_kb_federation/ — a standalone cross-silo marketplace federation reverse-proxy that brokers authorised recall via OpenFGA check + signed per-subscription share tokens + bi-temporal merge.

The package introduces:

  1. Hash-locked KBCorpusSpec + KBRuntime (rules 56-57) mirroring the existing RLExperimentSpec / BotSpec / AnalysisSpec pattern. Every remember / recall / improve / forget lands a kb_runs row + snapshots the spec via persist_spec. Alembic migration 0088_alphaswarm_kb_specs.py creates the nine backing tables.
  2. KBAdapterMeta metaclass (rule 58) for every concrete IMemoryEngine, BaseVectorStore, BaseGraphStore, BaseRelationalStore, IACLEvaluator, IPolicyEngine, and IIdentityProvider. Each subclass sets kb_kind + kb_alias and is auto-registered.
  3. Bi-temporal PermissionedDataPoint combining Graphiti's four-timestamp model (valid_from/valid_to/created_at/expired_at) with Cognee's provenance envelope (Provenance.dataset_id + Provenance.data_id + Provenance.extractor_chain).
  4. Four-scope KBLayerComposer (private > hierarchical > marketplace > global) with precedence-aware bi-temporal merge.
  5. Hybrid OpenFGA + OPA + Cedar policy stack per the blueprint Section D. DefaultPermissionResolver fuses IACLEvaluator.list_objects (visible IDs) with IPolicyEngine.partial_evaluate (residual Cypher/SQL fragment) into a per-request AccessBitmap cached by (tenant, principal, action, anchor_hash) for 60s.
  6. KBSiloTenancyStrategy (5th strategy alongside RLS / schema-per-tenant / db-per-enterprise / hybrid). Routes KB tables to a per-tenant Postgres + Qdrant + Neo4j stack provisioned via Terragrunt units under alphaswarm_platform/terragrunt/tenants/.
  7. Agent-facing surface through data.kb.* DataMCP tools (rule 59 extends rule 22) and data.kb.compose_recall for the layered surface. Cross-silo recall goes through alphaswarm_kb_federation only (rule 60).
  8. Controller integration: KBSiloService + /manage/kb/silos/* routes on alphaswarm_controller (Phase M). Lifecycle actions land as WorkloadRun rows with WorkloadAction.KB_SILO_{PROVISION,DESTROY,HALT,SCALE}.

Consequences:

  • The legacy alphaswarm.rag.* + alphaswarm.llm.memory import paths keep working through DeprecationWarning shims for one release cycle. New code imports from alphaswarm_kb.rag.* + alphaswarm_kb.memory.* directly.
  • Cognee / Graphiti / Mem0 / Letta / LlamaIndex live behind pyproject.toml extras; the base install stays light. A tenant who wants Cognee installs pip install alphaswarm-kb[cognee] and sets KBCorpusSpec.memory_engine.kb_alias = "cognee".
  • The federation gateway is the only cross-silo write/read path outside the monolith. New tenant marketplaces, parent-org sharing, and global-corpus replication all funnel through it.
  • Terragrunt units replace the legacy Terraform workspaces pattern — each tenant has its own state file under tenants/<tenant_id>/prod/terragrunt.hcl. The tenant_kb_silo wrapper dispatches to one of three cloud-parallel siblings (tenant_kb_silo_aws/azure/gcp) which all expose identical outputs so Python adapters never branch on cloud.
  • Bi-temporal data is now first-class. Contradicted edges close valid_to instead of being deleted; as_of queries reconstruct historical state.
  • Step-up MFA gates the destructive operations (/kb/forget, /kb/halt, /manage/kb/silos/* mutations, subscription create/revoke) per rule 52.

Hard rule alignment:

RuleCompliance
2 (router_complete)Every adapter that does LLM extraction (Graduated pipeline tier 3, Cognee, Mem0) routes through router_complete.
3 (iceberg_catalog.append_arrow)Gold-tier KB writes (alphaswarm_gold_kb_* namespaces) go through the canonical helper; KBRuntime never touches PyIceberg.
4 (_progress.emit)All kb_tasks.py wrappers use emit / emit_done / emit_error. WebSocket /kb/.../recall/stream preserves {task_id, stage, message, timestamp, **extras}.
6 (immutable migrations)0088_alphaswarm_kb_specs.py is immutable post-merge.
22 (DataMCP boundary)Agents read KB only through data.kb.* tools (extended by rule 59).
26 (CredentialResolver)OpenFGA token, NATS DSN, Postgres DSN, federation share-token signing key all resolve through CredentialResolver.
27 (IdentityProvider)IIdentityProvider is a thin bridge to alphaswarm_core.auth.providers.
34 (experiment_id/test_id)kb_runs carries both FKs; KBRunRequest propagates them via RequestContext.
42 (TerraformRuntime)KBSiloService invokes TerraformRuntime; the controller never shells out to terraform.
45 (WorkloadRuntime)New WorkloadAction enum members KB_SILO_{PROVISION,DESTROY,HALT,SCALE}.
51 (TenancyStrategy)KBSiloTenancyStrategy registers via TenancyStrategyMeta.
52 (step-up MFA)All destructive /kb/* + /manage/kb/* routes gate with require_step_up().
56-60New hard rules added in the same PR; described in the AGENTS.md.

Trade-offs:

  1. Two new repositories to maintain. Mitigated by mirroring the established alphaswarm_rl / alphaswarm_models boundary pattern and shipping CI guards that prevent cross-boundary imports.
  2. OpenFGA + OPA + NATS introduce three new infrastructure dependencies. Mitigated by shipping both Docker Compose (local) and Kubernetes (prod) manifests; each is a single Helm release with ExternalSecrets wiring.
  3. Bi-temporal data complicates schema migrations. Mitigated by making valid_to/expired_at optional (None = "still valid") so existing rows migrate without a backfill.
  4. Terragrunt unit-per-tenant scales linearly in state-file count. Mitigated by bounded-parallelism run-all automation under alphaswarm_platform/terragrunt/ plus per-tenant cloud-account isolation for regulated tenants.
  5. Multiple memory engines coexisting complicates the operator's mental model. Mitigated by data.kb.health exposing per-corpus engine info + the Vite /knowledge-base/silos route surfacing topology + spec hash per corpus.

Out of scope (Phase 6+):

  • Cedar formal-verification harness (cedar-analysis).
  • SpiceDB / Permify adapter implementations beyond stubs.
  • Multi-region active-active federation (vs the AWS-first → Azure → GCP staged rollout).
  • Tenant-configurable bi-temporal merge strategies (default: last-writer-wins per validity window + precedence tiebreaker).
  • Per-tenant bridge tier (shared compute / siloed databases) for SMB pricing.
  • Cognee improve / forget scheduling automation (manual triggers only in v1).