June 4, 2026 · gashel01

I Benchmarked My Own Libraries Against LlamaIndex, Mem0, Tavily, DeepEval and a Raw Sandbox — Honestly

I already benchmarked my agent kernel against LangGraph and CrewAI. But the kernel sits on top of a suite of standalone libraries — a RAG engine, a memory store, a code sandbox, a web-search layer, an eval harness — and the obvious follow-up question is: are those any good on their own, versus the tools people actually reach for?

So I put five of them head-to-head with their named incumbents — LlamaIndex, Mem0, a raw OS subprocess, Tavily, and DeepEval — under one rule I’ve learned to trust: control the confounds, run the competitor in its best configuration, and publish where I lose as plainly as where I win.

Here’s the honest part up front: none of them was a clean sweep. Some are “competitive with a specific edge,” one is “useful but explicitly bounded,” and two are “functional parity, not a knockout.” And — consistent with every benchmark I’ve run — each one surfaced a real bug, all now fixed. That last part is the actual value: chasing fair numbers is the best bug-finder I have.


1. ragmcp vs LlamaIndex — the retriever is at exact parity

The trap in any RAG comparison is the embedding model: if two systems embed with different models, you’re benchmarking the models, not the frameworks. So I gave both sides the same embedder — fastembed bge-small-en-v1.5, identical weights, wrapped for LlamaIndex too — the same 20-document corpus, and by-construction ground truth (each query paraphrases one fact in one document, so the relevant doc is determined by construction, not by my opinion).

MetricragmcpLlamaIndex
recall@10.9250.875
recall@31.0000.975
MRR@100.9540.930
ingest (40 docs)0.43 s0.66 s
query latency11.5 ms7.6 ms

ragmcp looks ahead — but I don’t get to claim a retrieval win, and here’s why. Force one chunk per document on both sides, removing chunking as a variable, and the two become identical on every metric (recall@1 0.875, MRR 0.93). Same embedder, same cosine, same ranking — of course they tie. ragmcp’s default-config edge comes entirely from its finer default chunk (500 chars vs LlamaIndex’s 1024 tokens), which pinpoints the answer passage better. That’s a defaults difference, not a smarter algorithm, and a LlamaIndex user lowering chunk_size would erase it.

What’s actually real: ragmcp ingests ~1.5× faster in bulk (concurrent ingest), and LlamaIndex answers ~1.5× faster per query — ragmcp trades a few milliseconds for its audit/governance layer. Verdict: competitive on retrieval, opposite performance tradeoffs. Not a knockout — and I say so on the page.

On reranking, same honesty. ragmcp now ships a local cross-encoder reranker ($0, one flag). But I gave the same reranker to LlamaIndex too, at single-chunk parity — and they came out identical on every metric again (recall@1/@3/@5 0.95, MRR 0.954). A reranker is not a ragmcp algorithmic edge; same model, same candidates, same ranking. What ragmcp ships is the convenience — a $0 local reranker on by a single flag — not a smarter ranker. Measured so I can say exactly that.

Where ragmcp’s edge IS real: agentic multi-hop. On 2-hop questions (answer reachable only via a bridge entity, buried among distractors), plain single-shot retrieve-then-generate is structurally incapable — 0/10 (one query can’t surface evidence a follow-up query needs). ragmcp’s built-in ReAct-RAG — search, read, pivot to the entity, search again — recovers them: 10/10 with Haiku. That’s not a retrieval-quality contest (those tie); it’s the agentic loop ragmcp ships out of the box vs basic RAG. (LlamaIndex has its own agent engines too — this measures the value of the pattern, not a claim it can’t.)


2. memorymcp vs Mem0 — and the confound that would have faked a win

This one is the clearest illustration of why “control the confounds” isn’t optional.

A naive setup puts both memory stores on ChromaDB and runs queries. Do that, and Mem0 scores 0/4 on a trivial recall probe — it returns the wrong fact every time. It would be easy, and dishonest, to publish that as a crushing win.

It’s a bug in Mem0’s Chroma integration: its adapter returns Chroma’s raw cosine distance as the result score, but Mem0’s ranker treats score as a similarity (higher = better). So the nearest memory — smallest distance — gets ranked last. That’s not a fair picture of Mem0; it’s a second-class backend. The fix is to run each library on the backend it’s built for: Mem0 on its default Qdrant (cosine-native, scores correctly — same probe: 5/5), memorymcp on Chroma. Same embedder, both cosine, and Mem0 gets its full hybrid pipeline (spaCy entities + BM25 + vector).

With the confound gone, the real result — 40 facts, 40 paraphrased queries:

Metricmemorymcpmemorymcp +rerankMem0
recall@10.7750.9750.500
recall@30.8250.9750.825
recall@50.9000.9750.925
MRR@100.8260.9790.679
ingest (40 facts)1.6 s / $0$05.2 s
cost / 8 facts (extraction)$0.00$0.00$0.071

Update: memorymcp now ships an opt-in cross-encoder reranker (enable_rerank=True, fastembed, $0, no new heavy dep). With it on, memorymcp leads every recall metric — including recall@5 (0.975 vs Mem0’s 0.925), the one place it used to trail — still at $0. The honest cost is latency: the reranker is a second pass, so it’s off by default and you turn it on when recall matters more than a few hundred milliseconds.

Out of the box (no reranker), recall is close and split — memorymcp ranks the target higher (recall@1, MRR), Mem0 catches a couple more in the tail. The defensible gap is cost: memorymcp’s deterministic PatternFactExtractor ingests free, where Mem0’s intended LLM extraction runs ~$0.009/fact. Mem0 answers faster per query (and the reranker, when on, adds latency).

And the caveat I put against myself: my dataset is pure paraphrase with deliberately low lexical overlap, which under-uses Mem0’s BM25 keyword layer — turning that layer on actually lowered Mem0’s recall@5 here. On keyword-heavy queries, Mem0 would close the gap. This is a semantic-retrieval result, not a universal one.


3. sandboxmcp vs raw subprocess — real containment, honestly bounded

The sandbox question is binary: when malicious code runs, does the unsafe action happen? I wrote 20 payloads, each printing a token only if its attack actually executed (harmless temp-file and echo targets — nothing real is touched), and ran them through a hardened sandbox config and a raw subprocess control.

Attack classraw subprocesssandboxmcp (process, hardened)
Known dangerous patterns (12)12 leaked12 blocked
Validator bypasses (6)6 leaked6 leaked
Resource exhaustion (2)1 leaked2 contained
Total neutralized2 / 2014 / 20

The hardened static validator blocks the entire known-pattern set — including obfuscated exec and getattr(__builtins__, ...) — and OS limits kill the infinite loop and the 2 GB allocation. But the process backend has no kernel isolation, so deliberate bypasses (os.popen, pathlib reads, urllib) sail through. I publish those six leaks rather than hide them, because a static scanner is bypassable by design.

That’s exactly what the Docker backend is for — and I’d written “only Docker would contain these” in the report without testing it. So I tested it. With the validator turned off entirely (measuring runtime isolation, not pattern matching), against payloads that try to read a real host file, write to the host, and open real network connections:

Host-impact payloads containedsubprocessprocessdocker
read/write host file, egress, 2 GB, loop1 / 53 / 55 / 5

Docker contains 5/5 — including the host filesystem reads and writes the process backend leaks. The claim is now a measured fact. (Along the way I learned the process backend’s network “block” is a Python-level socket.connect shim, bypassable in two lines via the C module — so I documented that precisely too, rather than overstate it.)

Update — 5/5 without Docker, on Linux. The process backend now has a hardened=True mode that wraps execution behind Landlock (kernel filesystem confinement) + a user/network namespace — no container runtime. Same 5 host-impact payloads:

Host-impact payloads containedsubprocessprocessprocess hardeneddocker
read/write host file, egress, 2 GB, loop1 / 53 / 55 / 55 / 5

Landlock blocks the host FS read/write that the plain process backend leaks; the namespace kills egress at the kernel (stronger than the bypassable socket shim). It matches Docker — without Docker. Honest scope: Linux only (needs Landlock ≥ kernel 5.13 + util-linux unshare); on Windows/macOS the flag is a no-op and Docker stays the answer. A benign program still runs and writes inside its workdir, so it’s real containment, not “everything crashes.”

Verdict: real defense-in-depth with zero dependencies, not an unescapable sandbox. Use the process backend for trusted code with guardrails; use hardened=True (Linux) or Docker to contain code you don’t trust.


4. websearchmcp vs Tavily — you can’t out-index a crawler, but you can match the answer

websearchmcp aggregates free engines (SearXNG, DuckDuckGo, Mojeek, Brave). Tavily runs its own real-time crawl + index. On raw result quality that’s not a fair fight, and I won’t pretend it is: on 10 factual queries, an authoritative source (Wikipedia, .gov, official docs…) landed in the top-3 for Tavily 6/10 vs websearchmcp 3/10. We lose the index contest, structurally.

But an agent doesn’t consume the result list — it consumes the answer. So I measured that. With search_with_answer (fetch the top sources, trim to the query-relevant passages, let your LLM synthesize), websearchmcp answered 5/6 factual questions correctly — exactly matching Tavily’s 5/6 (the shared miss is an Everest height both phrase as “8,848”, which the strict substring grader counts against both), at $0 search cost and no API key.

Tavilywebsearchmcp
Answer correctness (6 factual Qs)5/65/6
Raw authority@3 (10 Qs)6/103/10
Search costpaid (credits)$0 / no key
Latency~1.3 s (index)~22 s (live fetch)

The revealing detail: websearchmcp’s top results were still SEO pages — yet the answer was right, because the LLM extracts the fact from any decent fetched page. Tavily does the same thing (its own top-3 sometimes included quora/instagram, answer still correct). The answer layer compensates for mediocre sources — on both sides. So the honest verdict: you will not beat a paid index on freshness or raw ranking, but you can match its actual deliverable — the cited answer — for free. The price you pay is latency (we crawl live; they serve from an index).


5. evalmcp vs DeepEval — the judge was broken until the benchmark caught it

An eval library lives or dies by one thing: does its LLM-judge agree with a human? So I hand-labeled 24 correctness cases (clear right/wrong, plus paraphrases that fool substring matching and plausible-but-wrong answers that fool lazy judges) and ran each library’s judge on the same LLM (Haiku) — so the number measures the judge’s prompt + logic, not the model.

JudgeAccuracyF1Cohen’s κ
evalmcp contains (no LLM)0.830.830.68
evalmcp LLM judge1.001.001.00
DeepEval GEval1.001.001.00

Parity with DeepEval — but the first run didn’t say that. It said 0.42 accuracy, F1 0.00: the LLM judge marked everything wrong, scoring worse than the no-LLM substring baseline. Two bugs: a .format-style {{ }} left inside an f-string (so a literal {{...}} went to the model), and a bare json.loads on the reply (so any json fence made parsing fail and silently score 0). Fixed: robust JSON extraction + the question added to the prompt → 0.42 → 1.00, with regression tests. Honest caveats: the dataset is small and clear-cut (both good judges hit the ceiling — this is parity, not a knockout), and evalmcp grades correctness, not Ragas-style faithfulness/context- recall (a real functional gap, not a bug). I also wanted Ragas in the table — its install pulled a broken transitive dependency in my environment and import ragas itself fails, so I’m not going to fake a number for it.


The meta-result: fair benchmarking is a bug-finder

Across these runs I found and fixed:

  1. memorymcp’s default semantic store was silently degraded — a dependency had changed an API, the error was swallowed, and it fell back to keyword search that mis-ranked facts (1/4 correct). Trying to benchmark it as a semantic store is what exposed it. Fixed, with a regression test that now fails loudly if it ever degrades again.
  2. sandboxmcp’s hardened mode didn’t actually block obfuscated payloads — its AST layer detected aliased exec and getattr(__builtins__), but only the regex findings gated rejection, so the AST warnings were emitted and then executed anyway. The security benchmark caught it; two payloads went from leaked to blocked.
  3. A confound that would have faked a Mem0 “win” — caught before publishing, not after.
  4. evalmcp’s LLM judge was silently scoring everything wrong — a leftover {{ }} and a brittle json.loads made it fail on fenced replies. Benchmarking it against DeepEval is what exposed it; it went from 0.42 to 1.00 agreement.

That’s the pattern in all of it. The point of benchmarking honestly isn’t a trophy — it’s that the discipline of making a comparison fair drags your own bugs into the light. Every number on the benchmarks page reproduces from a script with raw JSON in its repo. Including the ones where I lose.


ragmcp, memorymcp, sandboxmcp and the full suite are open source: GitHub. Each benchmark lives in its library’s benchmarks/ folder.