Python RAG library · MCP-native · Apache-2.0

Your documents.
Any agent.
One library.

ragmcp is a modular Python RAG library. Embed it directly in your agent, expose it as an MCP server for Claude Desktop or Cursor, or serve it as a REST API — LLM-agnostic and swappable at every level.

Get started View on GitHub

main.py

from ragmcp import RAGFactory
from ragmcp.mcp_server import RAGMCPServer

pipeline = RAGFactory.create_default()
await pipeline.ingest_folder("./docs")

# Use 1 — embed in your agent
chunks = await pipeline.search("timeout config", top_k=5)
context = "\n\n".join(c.content for c in chunks)

# Use 2 — expose as MCP server
RAGMCPServer(pipeline).run()

Agent SDKMCP stdio / SSE

ChromaDB · Qdrant · pgvectorOpenAI · Ollama · Cohere

Features

Everything you need to build
production RAG pipelines

Agent SDK

Use ragmcp inside any agent framework — LangGraph, pydantic-ai, CrewAI, or your own. Call pipeline.search() and get structured chunks.

LangGraph pydantic-ai CrewAI

LLM-agnostic

Swap your embedder or LLM backend at any time — FastEmbed, LiteLLM, Ollama, Cohere — without touching application code.

OpenAI Ollama Cohere

Swappable backends

ChromaDB, Qdrant, pgvector, or Milvus for vectors. Redis for caching. Cross-encoder or Cohere for reranking.

ChromaDB Qdrant pgvector

Production-ready

Audit logging, rate limiting, tenant isolation, feedback loops, OpenTelemetry tracing, and Prometheus metrics built in.

Multi-tenancy Observability

Rich retrieval

Dense vector search, BM25 full-text, hybrid with RRF fusion, Graph RAG over entity graphs, and ColPali multimodal.

Hybrid RRF Graph RAG ColPali

MCP-native

One line to expose your pipeline as an MCP server. Claude Desktop and Cursor call search_documents as a native tool.

stdio SSE

RAGFactory

Go from zero to a working pipeline in 3 lines. Scale to production by changing env vars.

create_default() from_env()

How it works

Install once.
Deploy anywhere.

Install

One pip install. Minimal dependencies by default.

pip install mcpaisuite-ragmcp

Ingest

Point ragmcp at a file, directory, URL, or S3 bucket. Chunking and embedding are automatic.

await pipeline.ingest_folder("docs/")

Search

Call pipeline.search() from your agent, your API, or any Python code.

await pipeline.search("query", top_k=5)

Deploy

Library, MCP server, or FastAPI REST endpoint — all from the same pipeline object.

RAGMCPServer(pipeline).run()

Your Docs → RAGPipeline → 🤖 Your Agent 🔌 MCP Server 🌐 REST API

Backends

Every component is swappable

Category	Backends	Notes
Embedders	FastEmbed, LiteLLM (OpenAI/Cohere/Mistral), Ollama, sentence-transformers	Fully async; cached
Vector stores	InMemory, ChromaDB, Qdrant, pgvector, Milvus	Metadata filtering, namespaces
Rerankers	CrossEncoder, Cohere Rerank, FeedbackReranker	Feedback loop improves over time
Caches	InMemoryLRU, Redis	Embedding & search result caching
Loaders	PDF, DOCX, HTML, Markdown, images (OCR), audio (Whisper)	AutoLoader auto-detects format
Retrievers	Dense, BM25, Hybrid (RRF), Graph RAG	Configurable RRF weights
Sources	S3, GCS, Notion, Confluence, GitHub, Slack, Jira, IMAP, Kafka	Pull or streaming ingestion

Measured, not claimed

`ragmcp` vs LlamaIndex

Same model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.

Metric	ragmcp	LlamaIndex
recall@1 (chunk-parity)	0.875	0.875
bulk ingest	~1.7× faster	—
query latency	~13.76ms	~10.52ms
2-hop multi-hop (Sonnet)	ReAct-RAG 6/10	single-shot 0/10

The core retriever is at exact parity — force one chunk per doc on both sides and every metric is identical (same embedder, same cosine). ragmcp's real edges are faster bulk ingest and a built-in agentic multi-hop loop that recovers 2-hop questions plain single-shot RAG structurally can't. LlamaIndex answers a few ms faster per query. Competitive, opposite tradeoffs — not a knockout.

See the full benchmark →

Your documents.Any agent.One library.

Everything you need to buildproduction RAG pipelines