ragmcp is a modular Python RAG library. Embed it directly in your agent, expose it as an MCP server for Claude Desktop or Cursor, or serve it as a REST API — LLM-agnostic and swappable at every level.
from ragmcp import RAGFactory
from ragmcp.mcp_server import RAGMCPServer
pipeline = RAGFactory.create_default()
await pipeline.ingest_folder("./docs")
# Use 1 — embed in your agent
chunks = await pipeline.search("timeout config", top_k=5)
context = "\n\n".join(c.content for c in chunks)
# Use 2 — expose as MCP server
RAGMCPServer(pipeline).run()
Use ragmcp inside any agent framework — LangGraph, pydantic-ai, CrewAI, or your own. Call pipeline.search() and get structured chunks.
Swap your embedder or LLM backend at any time — FastEmbed, LiteLLM, Ollama, Cohere — without touching application code.
ChromaDB, Qdrant, pgvector, or Milvus for vectors. Redis for caching. Cross-encoder or Cohere for reranking.
Audit logging, rate limiting, tenant isolation, feedback loops, OpenTelemetry tracing, and Prometheus metrics built in.
Dense vector search, BM25 full-text, hybrid with RRF fusion, Graph RAG over entity graphs, and ColPali multimodal.
One line to expose your pipeline as an MCP server. Claude Desktop and Cursor call search_documents as a native tool.
Go from zero to a working pipeline in 3 lines. Scale to production by changing env vars.
One pip install. Minimal dependencies by default.
Point ragmcp at a file, directory, URL, or S3 bucket. Chunking and embedding are automatic.
Call pipeline.search() from your agent, your API, or any Python code.
Library, MCP server, or FastAPI REST endpoint — all from the same pipeline object.
| Category | Backends | Notes |
|---|---|---|
| Embedders | FastEmbed, LiteLLM (OpenAI/Cohere/Mistral), Ollama, sentence-transformers | Fully async; cached |
| Vector stores | InMemory, ChromaDB, Qdrant, pgvector, Milvus | Metadata filtering, namespaces |
| Rerankers | CrossEncoder, Cohere Rerank, FeedbackReranker | Feedback loop improves over time |
| Caches | InMemoryLRU, Redis | Embedding & search result caching |
| Loaders | PDF, DOCX, HTML, Markdown, images (OCR), audio (Whisper) | AutoLoader auto-detects format |
| Retrievers | Dense, BM25, Hybrid (RRF), Graph RAG | Configurable RRF weights |
| Sources | S3, GCS, Notion, Confluence, GitHub, Slack, Jira, IMAP, Kafka | Pull or streaming ingestion |
ragmcp vs LlamaIndexSame model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.
| Metric | ragmcp | LlamaIndex |
|---|---|---|
| recall@1 (chunk-parity) | 0.875 | 0.875 |
| bulk ingest | ~1.7× faster | — |
| query latency | ~13.76ms | ~10.52ms |
| 2-hop multi-hop (Sonnet) | ReAct-RAG 6/10 | single-shot 0/10 |
The core retriever is at exact parity — force one chunk per doc on both sides and every metric is identical (same embedder, same cosine). ragmcp's real edges are faster bulk ingest and a built-in agentic multi-hop loop that recovers 2-hop questions plain single-shot RAG structurally can't. LlamaIndex answers a few ms faster per query. Competitive, opposite tradeoffs — not a knockout.