Python RAG library · MCP-native · AGPL-3.0

Your documents.
Any agent.
One library.

ragmcp is a modular Python RAG library. Embed it directly in your agent, expose it as an MCP server for Claude Desktop or Cursor, or serve it as a REST API — LLM-agnostic and swappable at every level.

main.py
from ragmcp import RAGFactory
from ragmcp.mcp_server import RAGMCPServer

pipeline = RAGFactory.create_default()
await pipeline.ingest_folder("./docs")

# Use 1 — embed in your agent
chunks = await pipeline.search("timeout config", top_k=5)
context = "\n\n".join(c.content for c in chunks)

# Use 2 — expose as MCP server
RAGMCPServer(pipeline).run()
Agent SDKMCP stdio / SSE
ChromaDB · Qdrant · pgvectorOpenAI · Ollama · Cohere
LLM-agnostic
10+ vector stores
Agent-ready
MCP-native
stdio & SSE
Python 3.11+

Everything you need to build
production RAG pipelines

Agent SDK

Use ragmcp inside any agent framework — LangGraph, pydantic-ai, CrewAI, or your own. Call pipeline.search() and get structured chunks.

LangGraphpydantic-aiCrewAI

LLM-agnostic

Swap your embedder or LLM backend at any time — FastEmbed, LiteLLM, Ollama, Cohere — without touching application code.

OpenAIOllamaCohere

Swappable backends

ChromaDB, Qdrant, pgvector, or Milvus for vectors. Redis for caching. Cross-encoder or Cohere for reranking.

ChromaDBQdrantpgvector

Production-ready

Audit logging, rate limiting, tenant isolation, feedback loops, OpenTelemetry tracing, and Prometheus metrics built in.

Multi-tenancyObservability

Rich retrieval

Dense vector search, BM25 full-text, hybrid with RRF fusion, Graph RAG over entity graphs, and ColPali multimodal.

Hybrid RRFGraph RAGColPali

MCP-native

One line to expose your pipeline as an MCP server. Claude Desktop and Cursor call search_documents as a native tool.

stdioSSE

RAGFactory

Go from zero to a working pipeline in 3 lines. Scale to production by changing env vars.

create_default()from_env()

Install once.
Deploy anywhere.

1

Install

One pip install. Minimal dependencies by default.

pip install mcpaisuite-ragmcp
2

Ingest

Point ragmcp at a file, directory, URL, or S3 bucket. Chunking and embedding are automatic.

await pipeline.ingest_folder("docs/")
3

Search

Call pipeline.search() from your agent, your API, or any Python code.

await pipeline.search("query", top_k=5)
4

Deploy

Library, MCP server, or FastAPI REST endpoint — all from the same pipeline object.

RAGMCPServer(pipeline).run()
Your Docs RAGPipeline 🤖 Your Agent 🔌 MCP Server 🌐 REST API

Every component is swappable

CategoryBackendsNotes
EmbeddersFastEmbed, LiteLLM (OpenAI/Cohere/Mistral), Ollama, sentence-transformersFully async; cached
Vector storesInMemory, ChromaDB, Qdrant, pgvector, MilvusMetadata filtering, namespaces
RerankersCrossEncoder, Cohere Rerank, FeedbackRerankerFeedback loop improves over time
CachesInMemoryLRU, RedisEmbedding & search result caching
LoadersPDF, DOCX, HTML, Markdown, images (OCR), audio (Whisper)AutoLoader auto-detects format
RetrieversDense, BM25, Hybrid (RRF), Graph RAGConfigurable RRF weights
SourcesS3, GCS, Notion, Confluence, GitHub, Slack, Jira, IMAP, KafkaPull or streaming ingestion

ragmcp vs LlamaIndex

Same model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.

MetricragmcpLlamaIndex
recall@1 (chunk-parity)0.8750.875
bulk ingest~1.7× faster
query latency~13.76ms~10.52ms
2-hop multi-hop (Sonnet)ReAct-RAG 6/10single-shot 0/10

The core retriever is at exact parity — force one chunk per doc on both sides and every metric is identical (same embedder, same cosine). ragmcp's real edges are faster bulk ingest and a built-in agentic multi-hop loop that recovers 2-hop questions plain single-shot RAG structurally can't. LlamaIndex answers a few ms faster per query. Competitive, opposite tradeoffs — not a knockout.

See the full benchmark →

Ready to connect your docs
to every LLM?

Read the docs Star on GitHub