I Benchmarked My Own Libraries Against LlamaIndex, Mem0, Tavily, DeepEval and a Raw Sandbox — Honestly
I already benchmarked my agent kernel against LangGraph and CrewAI. But the kernel sits on top of a suite of standalone libraries — a RAG engine, a memory store, a code sandbox, a web-search layer, an eval harness — and the obvious follow-up question is: are those any good on their own, versus the tools people actually reach for?
So I put five of them head-to-head with their named incumbents — LlamaIndex, Mem0, a raw OS subprocess, Tavily, and DeepEval — under one rule I’ve learned to trust: control the confounds, run the competitor in its best configuration, and publish where I lose as plainly as where I win.
Here’s the honest part up front: none of them was a clean sweep. Some are “competitive with a specific edge,” one is “useful but explicitly bounded,” and two are “functional parity, not a knockout.” And — consistent with every benchmark I’ve run — each one surfaced a real bug, all now fixed. That last part is the actual value: chasing fair numbers is the best bug-finder I have.
1. ragmcp vs LlamaIndex — the retriever is at exact parity
The trap in any RAG comparison is the embedding model: if two systems embed with different models,
you’re benchmarking the models, not the frameworks. So I gave both sides the same embedder —
fastembed bge-small-en-v1.5, identical weights, wrapped for LlamaIndex too — the same 20-document
corpus, and by-construction ground truth (each query paraphrases one fact in one document, so the
relevant doc is determined by construction, not by my opinion).
| Metric | ragmcp | LlamaIndex |
|---|---|---|
| recall@1 | 0.925 | 0.875 |
| recall@3 | 1.000 | 0.975 |
| MRR@10 | 0.954 | 0.930 |
| ingest (40 docs) | 0.43 s | 0.66 s |
| query latency | 11.5 ms | 7.6 ms |
ragmcp looks ahead — but I don’t get to claim a retrieval win, and here’s why. Force one chunk
per document on both sides, removing chunking as a variable, and the two become identical on
every metric (recall@1 0.875, MRR 0.93). Same embedder, same cosine, same ranking — of course
they tie. ragmcp’s default-config edge comes entirely from its finer default chunk (500 chars
vs LlamaIndex’s 1024 tokens), which pinpoints the answer passage better. That’s a defaults
difference, not a smarter algorithm, and a LlamaIndex user lowering chunk_size would erase it.
What’s actually real: ragmcp ingests ~1.5× faster in bulk (concurrent ingest), and LlamaIndex answers ~1.5× faster per query — ragmcp trades a few milliseconds for its audit/governance layer. Verdict: competitive on retrieval, opposite performance tradeoffs. Not a knockout — and I say so on the page.
On reranking, same honesty. ragmcp now ships a local cross-encoder reranker ($0, one flag). But I gave the same reranker to LlamaIndex too, at single-chunk parity — and they came out identical on every metric again (recall@1/@3/@5 0.95, MRR 0.954). A reranker is not a ragmcp algorithmic edge; same model, same candidates, same ranking. What ragmcp ships is the convenience — a $0 local reranker on by a single flag — not a smarter ranker. Measured so I can say exactly that.
Where ragmcp’s edge IS real: agentic multi-hop. On 2-hop questions (answer reachable only via a bridge entity, buried among distractors), plain single-shot retrieve-then-generate is structurally incapable — 0/10 (one query can’t surface evidence a follow-up query needs). ragmcp’s built-in ReAct-RAG — search, read, pivot to the entity, search again — recovers them: 10/10 with Haiku. That’s not a retrieval-quality contest (those tie); it’s the agentic loop ragmcp ships out of the box vs basic RAG. (LlamaIndex has its own agent engines too — this measures the value of the pattern, not a claim it can’t.)
2. memorymcp vs Mem0 — and the confound that would have faked a win
This one is the clearest illustration of why “control the confounds” isn’t optional.
A naive setup puts both memory stores on ChromaDB and runs queries. Do that, and Mem0 scores 0/4 on a trivial recall probe — it returns the wrong fact every time. It would be easy, and dishonest, to publish that as a crushing win.
It’s a bug in Mem0’s Chroma integration: its adapter returns Chroma’s raw cosine distance as the
result score, but Mem0’s ranker treats score as a similarity (higher = better). So the
nearest memory — smallest distance — gets ranked last. That’s not a fair picture of Mem0; it’s
a second-class backend. The fix is to run each library on the backend it’s built for: Mem0 on its
default Qdrant (cosine-native, scores correctly — same probe: 5/5), memorymcp on Chroma. Same
embedder, both cosine, and Mem0 gets its full hybrid pipeline (spaCy entities + BM25 + vector).
With the confound gone, the real result — 40 facts, 40 paraphrased queries:
| Metric | memorymcp | memorymcp +rerank | Mem0 |
|---|---|---|---|
| recall@1 | 0.775 | 0.975 | 0.500 |
| recall@3 | 0.825 | 0.975 | 0.825 |
| recall@5 | 0.900 | 0.975 | 0.925 |
| MRR@10 | 0.826 | 0.979 | 0.679 |
| ingest (40 facts) | 1.6 s / $0 | $0 | 5.2 s |
| cost / 8 facts (extraction) | $0.00 | $0.00 | $0.071 |
Update: memorymcp now ships an opt-in cross-encoder reranker (enable_rerank=True,
fastembed, $0, no new heavy dep). With it on, memorymcp leads every recall metric —
including recall@5 (0.975 vs Mem0’s 0.925), the one place it used to trail — still at $0.
The honest cost is latency: the reranker is a second pass, so it’s off by default and you
turn it on when recall matters more than a few hundred milliseconds.
Out of the box (no reranker), recall is close and split — memorymcp ranks the target higher
(recall@1, MRR), Mem0 catches a couple more in the tail. The defensible gap is cost:
memorymcp’s deterministic PatternFactExtractor ingests free, where Mem0’s intended LLM
extraction runs ~$0.009/fact. Mem0 answers faster per query (and the reranker, when on, adds
latency).
And the caveat I put against myself: my dataset is pure paraphrase with deliberately low lexical overlap, which under-uses Mem0’s BM25 keyword layer — turning that layer on actually lowered Mem0’s recall@5 here. On keyword-heavy queries, Mem0 would close the gap. This is a semantic-retrieval result, not a universal one.
3. sandboxmcp vs raw subprocess — real containment, honestly bounded
The sandbox question is binary: when malicious code runs, does the unsafe action happen? I wrote
20 payloads, each printing a token only if its attack actually executed (harmless temp-file and
echo targets — nothing real is touched), and ran them through a hardened sandbox config and a raw
subprocess control.
| Attack class | raw subprocess | sandboxmcp (process, hardened) |
|---|---|---|
| Known dangerous patterns (12) | 12 leaked | 12 blocked |
| Validator bypasses (6) | 6 leaked | 6 leaked |
| Resource exhaustion (2) | 1 leaked | 2 contained |
| Total neutralized | 2 / 20 | 14 / 20 |
The hardened static validator blocks the entire known-pattern set — including obfuscated exec
and getattr(__builtins__, ...) — and OS limits kill the infinite loop and the 2 GB allocation.
But the process backend has no kernel isolation, so deliberate bypasses (os.popen, pathlib
reads, urllib) sail through. I publish those six leaks rather than hide them, because a static
scanner is bypassable by design.
That’s exactly what the Docker backend is for — and I’d written “only Docker would contain these” in the report without testing it. So I tested it. With the validator turned off entirely (measuring runtime isolation, not pattern matching), against payloads that try to read a real host file, write to the host, and open real network connections:
| Host-impact payloads contained | subprocess | process | docker |
|---|---|---|---|
| read/write host file, egress, 2 GB, loop | 1 / 5 | 3 / 5 | 5 / 5 |
Docker contains 5/5 — including the host filesystem reads and writes the process backend leaks.
The claim is now a measured fact. (Along the way I learned the process backend’s network “block” is
a Python-level socket.connect shim, bypassable in two lines via the C module — so I documented
that precisely too, rather than overstate it.)
Update — 5/5 without Docker, on Linux. The process backend now has a hardened=True mode
that wraps execution behind Landlock (kernel filesystem confinement) + a user/network
namespace — no container runtime. Same 5 host-impact payloads:
| Host-impact payloads contained | subprocess | process | process hardened | docker |
|---|---|---|---|---|
| read/write host file, egress, 2 GB, loop | 1 / 5 | 3 / 5 | 5 / 5 | 5 / 5 |
Landlock blocks the host FS read/write that the plain process backend leaks; the namespace kills
egress at the kernel (stronger than the bypassable socket shim). It matches Docker — without
Docker. Honest scope: Linux only (needs Landlock ≥ kernel 5.13 + util-linux unshare);
on Windows/macOS the flag is a no-op and Docker stays the answer. A benign program still runs and
writes inside its workdir, so it’s real containment, not “everything crashes.”
Verdict: real defense-in-depth with zero dependencies, not an unescapable sandbox. Use the
process backend for trusted code with guardrails; use hardened=True (Linux) or Docker to contain
code you don’t trust.
4. websearchmcp vs Tavily — you can’t out-index a crawler, but you can match the answer
websearchmcp aggregates free engines (SearXNG, DuckDuckGo, Mojeek, Brave). Tavily runs its own real-time crawl + index. On raw result quality that’s not a fair fight, and I won’t pretend it is: on 10 factual queries, an authoritative source (Wikipedia, .gov, official docs…) landed in the top-3 for Tavily 6/10 vs websearchmcp 3/10. We lose the index contest, structurally.
But an agent doesn’t consume the result list — it consumes the answer. So I measured that. With
search_with_answer (fetch the top sources, trim to the query-relevant passages, let your LLM
synthesize), websearchmcp answered 5/6 factual questions correctly — exactly matching Tavily’s
5/6 (the shared miss is an Everest height both phrase as “8,848”, which the strict substring
grader counts against both), at $0 search cost and no API key.
| Tavily | websearchmcp | |
|---|---|---|
| Answer correctness (6 factual Qs) | 5/6 | 5/6 |
| Raw authority@3 (10 Qs) | 6/10 | 3/10 |
| Search cost | paid (credits) | $0 / no key |
| Latency | ~1.3 s (index) | ~22 s (live fetch) |
The revealing detail: websearchmcp’s top results were still SEO pages — yet the answer was right, because the LLM extracts the fact from any decent fetched page. Tavily does the same thing (its own top-3 sometimes included quora/instagram, answer still correct). The answer layer compensates for mediocre sources — on both sides. So the honest verdict: you will not beat a paid index on freshness or raw ranking, but you can match its actual deliverable — the cited answer — for free. The price you pay is latency (we crawl live; they serve from an index).
5. evalmcp vs DeepEval — the judge was broken until the benchmark caught it
An eval library lives or dies by one thing: does its LLM-judge agree with a human? So I hand-labeled 24 correctness cases (clear right/wrong, plus paraphrases that fool substring matching and plausible-but-wrong answers that fool lazy judges) and ran each library’s judge on the same LLM (Haiku) — so the number measures the judge’s prompt + logic, not the model.
| Judge | Accuracy | F1 | Cohen’s κ |
|---|---|---|---|
evalmcp contains (no LLM) | 0.83 | 0.83 | 0.68 |
| evalmcp LLM judge | 1.00 | 1.00 | 1.00 |
DeepEval GEval | 1.00 | 1.00 | 1.00 |
Parity with DeepEval — but the first run didn’t say that. It said 0.42 accuracy, F1 0.00: the
LLM judge marked everything wrong, scoring worse than the no-LLM substring baseline. Two bugs:
a .format-style {{ }} left inside an f-string (so a literal {{...}} went to the model), and a
bare json.loads on the reply (so any json fence made parsing fail and silently score 0).
Fixed: robust JSON extraction + the question added to the prompt → 0.42 → 1.00, with regression
tests. Honest caveats: the dataset is small and clear-cut (both good judges hit the ceiling — this
is parity, not a knockout), and evalmcp grades correctness, not Ragas-style faithfulness/context-
recall (a real functional gap, not a bug). I also wanted Ragas in the table — its install pulled a
broken transitive dependency in my environment and import ragas itself fails, so I’m not going to
fake a number for it.
The meta-result: fair benchmarking is a bug-finder
Across these runs I found and fixed:
- memorymcp’s default semantic store was silently degraded — a dependency had changed an API, the error was swallowed, and it fell back to keyword search that mis-ranked facts (1/4 correct). Trying to benchmark it as a semantic store is what exposed it. Fixed, with a regression test that now fails loudly if it ever degrades again.
- sandboxmcp’s hardened mode didn’t actually block obfuscated payloads — its AST layer
detected aliased
execandgetattr(__builtins__), but only the regex findings gated rejection, so the AST warnings were emitted and then executed anyway. The security benchmark caught it; two payloads went from leaked to blocked. - A confound that would have faked a Mem0 “win” — caught before publishing, not after.
- evalmcp’s LLM judge was silently scoring everything wrong — a leftover
{{ }}and a brittlejson.loadsmade it fail on fenced replies. Benchmarking it against DeepEval is what exposed it; it went from 0.42 to 1.00 agreement.
That’s the pattern in all of it. The point of benchmarking honestly isn’t a trophy — it’s that the discipline of making a comparison fair drags your own bugs into the light. Every number on the benchmarks page reproduces from a script with raw JSON in its repo. Including the ones where I lose.
ragmcp, memorymcp, sandboxmcp and the full suite are open source:
GitHub. Each benchmark lives in its library’s benchmarks/ folder.