Research-papers RAG
The research_papers corpus is one of the bundled
KBCorpusSpec
templates. It ingests PDFs through the math-aware parser chain in
alphaswarm_kb.rag.parsers and indexes them into the research_papers
RAG corpus.
Parser chain
alphaswarm_kb.rag.parsers.pick_parser(path) selects the right parser
based on the document's math density + complexity:
| Parser | Use |
|---|---|
MarkerParser | Default — fast, math-aware Marker pipeline. |
NougatParser | Heavy LaTeX/equation density (Nougat from Meta). |
MathPixParser | Highest fidelity for handwriting or scanned PDFs (MathPix API). |
PyPDFParser | Fast text-only fallback. |
Upload + ingest
from alphaswarm_kb.rag.indexers.research_papers_indexer import index_research_papers
n_chunks = index_research_papers(paper_ids=["paper-uuid"])
Or via the REST surface:
POST /rag/papers/upload # upload PDF
POST /rag/papers/{id}/ingest
POST /rag/papers/{id}/synthesize # downstream strategy synthesis
The Celery wrappers in alphaswarm_kb.tasks.kb_tasks.ingest_research_paper
synthesize_strategy_from_paperpreserve the legacyalphaswarm.tasks.research_paper_taskssurface via shims.
Retrieval
Use data.kb.recall with corpus_name="research_papers". The
HierarchicalRAG.query_hybrid path is preferred for papers because
exact-token matches (theorem names, variable symbols) matter as much
as semantic similarity.
Strategy synthesis
The synthesize_strategy_from_paper task pipes hybrid recall results
through router_complete (rule 2) and returns a YAML strategy stub
the Strategy Composer can load.