Saltar al contenido principal

Research-papers RAG

The research_papers corpus is one of the bundled KBCorpusSpec templates. It ingests PDFs through the math-aware parser chain in alphaswarm_kb.rag.parsers and indexes them into the research_papers RAG corpus.

Parser chain

alphaswarm_kb.rag.parsers.pick_parser(path) selects the right parser based on the document's math density + complexity:

ParserUse
MarkerParserDefault — fast, math-aware Marker pipeline.
NougatParserHeavy LaTeX/equation density (Nougat from Meta).
MathPixParserHighest fidelity for handwriting or scanned PDFs (MathPix API).
PyPDFParserFast text-only fallback.

Upload + ingest

from alphaswarm_kb.rag.indexers.research_papers_indexer import index_research_papers

n_chunks = index_research_papers(paper_ids=["paper-uuid"])

Or via the REST surface:

POST /rag/papers/upload   # upload PDF
POST /rag/papers/{id}/ingest
POST /rag/papers/{id}/synthesize # downstream strategy synthesis

The Celery wrappers in alphaswarm_kb.tasks.kb_tasks.ingest_research_paper

  • synthesize_strategy_from_paper preserve the legacy alphaswarm.tasks.research_paper_tasks surface via shims.

Retrieval

Use data.kb.recall with corpus_name="research_papers". The HierarchicalRAG.query_hybrid path is preferred for papers because exact-token matches (theorem names, variable symbols) matter as much as semantic similarity.

Strategy synthesis

The synthesize_strategy_from_paper task pipes hybrid recall results through router_complete (rule 2) and returns a YAML strategy stub the Strategy Composer can load.