Measured, not asserted · Reproducible harness

How it actually
compares — with numbers.

Two layers, one rule. The kernel runs each task as LTP (compile-once), ReAct (multi-turn), or Hybrid — measured against LangGraph and CrewAI. And the standalone libraries go head-to-head with their incumbents: ragmcp vs LlamaIndex, memorymcp vs Mem0, sandboxmcp vs a raw OS sandbox, websearchmcp vs Tavily, evalmcp vs DeepEval. Same model, same controls, losses shown as plainly as wins — every number reproduces from the repo.

Verifiable tasks
Same model & tools
Per-run isolation
Open harness

28 verifiable tasks,
three modes, one model

Arithmetic, string ops, small algorithms and a knowledge hop — no network, so anyone gets the same answer. Every LLM call pinned to claude-haiku-4-5.

ModeSuccessAvg tokens / taskvs ReActBest for
ReAct multi-turn96.4%9,301knowledge & open-ended
LTP compile-once96.4%2,517−73%deterministic compute (cost floor)
Hybrid LTP-first + fallback100%3,088−67%mixed / unknown workloads

LTP calls the LLM once to compile the plan, not once per step, so its cost stays flat (~2.5k tokens) as the task deepens — as long as the added depth is deterministic execution (loops, code, tool calls), not extra LLM steps or a re-compile. ReAct re-sends the system prompt and tool schemas every turn, so its cost scales with turns. Hybrid is the only mode at 100% — but it is not the cheapest: pure LTP is. Hybrid spends a little more to convert LTP's fragile tail into reliability.

One result holds at
every model size

The same 28-task suite, run on a weak, a mid, and a strong model. Each cell is success / avg tokens per task.

ModelReActLTPHybrid
Haiku weak96.4% / 9,30196.4% / 2,517100% / 3,088
Sonnet mid100% / 9,502100% / 2,925100% / 2,611
Opus strong89.3% / 6,57092.9% / 3,409100% / 4,281

Hybrid is 100% at every tier — the only mode that is. The model shapes the rest: a weak model makes ReAct loop (9.3k tokens), so LTP's token edge is largest; a mid model (Sonnet) lifts everyone to 100% — the sweet spot; a strong model (Opus) answers in ~3 turns "in its head", which is cheaper but drops ReAct to 89.3% on over-confidence (it skipped tools and miscomputed). Hybrid absorbs both failure modes.

The real product surface:
chaining actual tools

Seven tasks that must chain real suite tools — memory, files, working memory — not just the sandbox. Each goal demands a clean final answer and runs in a fresh namespace. Two repetitions per task.

TaskWhat it doesReliable?
mem-recallStore a fact in memory, recall itsolid
mem-countStore 3 facts, then count themsolid
mem-updateStore a fact, overwrite it, recall the latestsolid
mem-then-fileRecall a number, compute, write & read a filesolid
file-upperWrite a file, read it, uppercase itsolid
working-memSet a working-memory value, read & double itmostly
file-wordsWrite a file, read it, count the wordshard
ModelReActLTPHybrid
Haiku78.6% / 16,70971.4% / 6,924100% / 12,244
Sonnet78.6% / 17,13371.4% / 6,16885.7% / 8,689
Opus71.4% / 22,49064.3% / 8,61878.6% / 25,072

Two honest findings. (1) Cost is the robust win: LTP routes the same tools at roughly half ReAct's tokens on every model. (2) Hybrid is the most reliable mode at every tier (100% / 85.7% / 78.6%), because its fallback rescues each pure mode's distinct misses — LTP can't finish working-mem, ReAct short-circuits file-words, and hybrid takes whichever one works. Six of seven tasks are now solid across all models. The decline on stronger models comes down to one stubborn task: file-words, where Sonnet and Opus return the raw file object instead of counting even when told to reply with only the number — the same over-confidence that drops their ReAct math score. We keep it in rather than quietly drop it.

Per-task receipts (avg tokens over 2 reps; = both reps failed, ½ = one of two):

▸ Haiku — per-task
TaskReActLTPHybrid
mem-recall17,9702,65217,691
mem-count21,9402,5242,636
mem-update7,7242,747 ✗12,520
mem-then-file26,8432,5412,671
file-words18,173 ✗17,04717,053
file-upper10,75918,44913,333
working-mem13,560 ½2,512 ✗19,807
▸ Sonnet — per-task
TaskReActLTPHybrid
mem-recall18,4192,6612,767
mem-count18,0832,5232,636
mem-update12,3422,7499,296
mem-then-file23,8442,6952,835
file-words18,162 ✗16,730 ✗18,748 ✗
file-upper10,75313,29913,300
working-mem18,330 ½2,520 ✗11,242
▸ Opus — per-task
TaskReActLTPHybrid
mem-recall22,2103,59726,436
mem-count25,9403,43432,062
mem-update15,0593,698 ½11,592
mem-then-file33,1163,68237,808
file-words22,217 ✗25,776 ✗25,775 ✗
file-upper13,20616,72716,722
working-mem25,685 ✗3,414 ✗25,108 ½

vs LangGraph —
the metric decides

The same 28 deterministic tasks through a standard LangGraph ReAct agent, same model, one functionally-equivalent Python tool. We'll say the uncomfortable part first: on raw token count, a near-empty agent always wins — a feature-rich kernel can't out-token it on one-tool arithmetic (LangGraph ~1.4k tokens, our lean LTP ~2.2k).

But raw tokens aren't the bill — cost is, and cost flips by model. Each cell is $/task / tokens:

Modelkernelmcp LTPkernelmcp HybridLangGraphCheaper
Haiku$0.0023 / 2,194$0.0029 / 2,817$0.0019 / 1,424LangGraph (−17%)
Sonnet$0.0020 / 2,515$0.0018 / 2,719$0.0057 / 1,466kernelmcp (~3×)

On Sonnet — a production model — kernelmcp LTP/Hybrid is ~3× cheaper than LangGraph and faster (2.2s vs 4.2s), despite sending more raw tokens. The reason is prompt caching: kernelmcp's large static prefix exceeds the cache minimum and is billed at ~10% on a hit (effective ~$0.78/M vs LangGraph's ~$3.9/M); LangGraph's minimal prompt is too small to cache, so it pays full price. On Haiku nothing caches below its larger floor, so there the lean agent stays ~17% cheaper.

Fairness, openly: this is a bare LangGraph agent vs kernelmcp's full loop on arithmetic; LangGraph's tool is a plain exec vs our security sandbox; and the integrated workflows kernelmcp is built for (memory, 120+ tools, planning, scheduling) aren't portable to a stock agent, so they're not in this comparison. Pitch: cost-at-scale, latency, and the suite — not raw token count.

Same real tools —
who orchestrates better?

The single-tool test above is unflattering by design. The apples-to-apples version puts three orchestrators on the same real tools over MCP: LangGraph and CrewAI both connect to the standalone memorymcp server (same filtered 6-tool set); kernelmcp runs a memory-only kernel. Nobody touches anyone else — the only variable is who orchestrates. Seven memory/working-memory tasks, two reps. Each cell is success / $ per task:

SystemHaikuSonnet
LangGraph + memorymcp92.9% / $0.005592.9% / $0.0181
CrewAI + memorymcp100% / $0.0058100% / $0.0231
kernelmcp ReAct85.7% / $0.015271.4% / $0.0329
kernelmcp LTP57.1% / $0.002371.4% / $0.0026
kernelmcp Hybrid100% / $0.0100100% / $0.0075

Two honest reads. Reliability: CrewAI and kernelmcp Hybrid are the only systems at 100% on both models; LangGraph is close (92.9%); pure LTP is cheapest but unreliable on stateful chains. kernelmcp does not dominate reliability — CrewAI matches it. Cost: on Sonnet, kernelmcp Hybrid is the cheapest 100%-reliable system at $0.0075 — ~2.4× under LangGraph ($0.0181) and ~3× under CrewAI ($0.0231) (prompt caching). On Haiku (no caching) Hybrid is pricier than the lean agents. So the kernel's edge is real but conditional: cheapest-at-100% on a caching-capable production model, not on the cheapest tier.

Integrity note: CrewAI's built-in token counter undercounts on its litellm path (2,182 reported vs 8,502 actual on a Sonnet task), so its cost here is measured by intercepting the Anthropic SDK directly — every real API call counted. Caveats: 7 memory tasks only (file tasks excluded — tool names differ across the standalone servers); LangGraph/CrewAI latency includes a fresh memorymcp subprocess spawned per run, so latency isn't comparable.

Deep agentic chains —
the kernel pulls ahead

The tests above are shallow. The realistic agentic surface is deep chains: five tasks that each chain 5–6 sequential calls across memory and code execution (store → recall → compute → store → recall). All three frameworks, same 7 tools over MCP. Each cell is success / $ per task:

SystemHaikuSonnet
LangGraph100% / $0.0102100% / $0.0381
CrewAI100% / $0.0151100% / $0.0458
kernelmcp ReAct100% / $0.0259100% / $0.0468
kernelmcp LTP80% / $0.002560% / $0.0041
kernelmcp Hybrid100% / $0.0071100% / $0.0146

Depth tips the balance. On deep chains, kernelmcp Hybrid is the cheapest 100%-reliable system on both models — 1.4× under LangGraph and 2.1× under CrewAI on Haiku, widening to 2.6× / 3.1× on Sonnet — and faster (Sonnet 9.5s vs ~30s). The reason is structural: LTP compiles the whole chain in one call (~2.4k tokens) then executes it deterministically — so the LLM is called once, not once per step, and cost stays flat as long as the added depth is deterministic execution rather than extra LLM steps or a re-compile. LangGraph and CrewAI re-invoke the LLM at every step, so their cost grows with chain length. This is the opposite of the single-tool result — the compile-once design pays off exactly as agentic depth grows.

Honest qualifier: LTP alone is unreliable on stateful chains (80% / 60% — it stumbles on a working-memory task), so this is a win for Hybrid (LTP-first + verified fallback), not LTP by itself. Reliability is a four-way tie at 100%; kernelmcp wins on cost at equal reliability, not on reliability. Small sample (5 tasks).

Each lib vs its
named incumbent

The kernel isn't the only thing worth measuring. Five of the standalone libraries went head-to-head with the tools people actually reach for — LlamaIndex, Mem0, a raw OS sandbox, Tavily, and DeepEval — under the same discipline: control the confounds, run the incumbent in its best configuration, and publish the losses as plainly as the wins. None of these is a clean sweep, and that's the point.

ragmcp vs LlamaIndex — retrieval

Same embedder on both sides (fastembed bge-small, identical weights wrapped for LlamaIndex too), same 20-doc corpus, by-construction ground truth. So any gap is the framework's chunking + retrieval logic, never the embedding model.

MetricragmcpLlamaIndex
recall@10.9250.875
recall@31.0000.975
MRR@100.9540.930
ingest (20 docs)0.58 s1.00 s
query latency13.8 ms10.5 ms

The core retriever is at exact parity. Force one chunk per document on both sides — chunking removed as a variable — and every metric is identical (recall@1 0.875, MRR 0.93): same embedder, same cosine, same ranking. ragmcp's default-config edge comes entirely from its finer default chunk (500 chars vs LlamaIndex's 1024 tokens), not a smarter algorithm. ragmcp ingests ~1.7× faster in bulk; LlamaIndex answers ~1.3× faster per query. Verdict: competitive on retrieval, opposite performance tradeoffs — not a knockout.

memorymcp vs Mem0 — fact recall + cost

40 facts, 40 paraphrased queries (low lexical overlap — a real semantic test). Same embedder, both cosine. The headline work was neutralizing a confound: on ChromaDB, Mem0 ranks semantically backwards (its adapter passes raw distance where its ranker expects a similarity), so we ran Mem0 on its default Qdrant backend, where it scores correctly — and gave it its full hybrid pipeline (spaCy + BM25 + vector).

MetricmemorymcpMem0
recall@10.7750.500
recall@30.8250.825
recall@50.9000.925
MRR@100.8260.679
ingest (40 facts)1.6 s / $05.2 s
cost / 8 facts (extraction)$0.00$0.071

Recall is close and split: memorymcp ranks the target higher (recall@1, MRR), Mem0 catches a couple more in the tail (recall@5). The clear, defensible gap is cost: memorymcp's deterministic PatternFactExtractor ingests free and fast, where Mem0's intended LLM extraction costs ~$0.009/fact. Mem0 answers ~3.5× faster per query (memorymcp pays for a richer pipeline + persistent store).

Caveat stated against us: this dataset is pure paraphrase with low lexical overlap, which under-uses Mem0's BM25 keyword layer — on keyword-heavy queries Mem0 would close the gap. It's a semantic-retrieval result, not a universal one.

sandboxmcp vs raw subprocess — containment

The question that matters for untrusted code: does a malicious payload’s effect reach the host or the network? Five escape payloads, run on the raw interpreter, the zero-dependency process backend, and the hardened Docker backend — the marker is verified on the host filesystem, so “contained” means the damage genuinely never landed.

Host-impact payloaddefault (raw)sandboxmcp (process)sandboxmcp (docker)
read host fileleakedleakedcontained
write host fileleakedleakedcontained
network egressleakedcontainedcontained
2 GB memory balloonleakedcontainedcontained
infinite loopcontainedcontainedcontained
Total contained1 / 53 / 55 / 5

The process backend (socket shim + OS limits) stops egress and resource attacks, but has no kernel isolation — host filesystem reads/writes leak. The Docker backend (network_mode=none, dropped capabilities, read-only rootfs, Docker’s default seccomp, memory & PID caps, throwaway container) contains all five — 5/5, verified on the host fs. Use the process backend for trusted code with guardrails; use Docker to contain code you don’t trust.

Under the hood — static-validator coverage. A separate binary test of 20 payloads (a token prints only if the unsafe action ran) shows what the hardened validator catches before any backend runs — and, honestly, what it doesn’t:

Attack classdefaultsandboxmcp (process)
Known dangerous patterns (12)12 leaked12 blocked
Validator bypasses (6)6 leaked6 leaked
Resource exhaustion (2)1 leaked2 contained
Total neutralized2 / 2014 / 20

A static scanner is bypassable by design: os.popen, pathlib and urllib evade the patterns. The validator is a first filter, not containment — which is exactly why the host-impact above is run in Docker. (On Docker those bypasses still execute, but inside a network-less, read-only, throwaway container, so nothing reaches the host.)

websearchmcp vs Tavily — the answer, not the index

Tavily runs its own real-time crawl + index; websearchmcp aggregates free engines (SearXNG + scrapers). On raw result quality that’s not a fair fight — and I won’t pretend it is. But an agent consumes the answer, not the result list, so that’s what I measured: fetch the top sources, trim to query-relevant passages, let the same LLM synthesize.

MetricwebsearchmcpTavily
Answer correctness (6 factual Qs)5 / 65 / 6
Authority@3 (raw results, 10 Qs)3 / 106 / 10
Search cost$0 / no keypaid (credits)
Latency / query~22 s (live fetch)~1.3 s (index)

We lose the index contest, structurally (3/10 vs 6/10 on surfacing an authoritative source) — you can’t out-crawl a crawler. But on the actual deliverable, the cited answer, it’s a tie: 5/6 each (the one miss is an Everest height both phrase as “8,848”, which the strict substring grader counts against both). The revealing detail — our top results were still SEO pages, yet the answer was right, because the LLM extracts the fact from any decent fetched page (Tavily’s own top-3 sometimes included quora/instagram too). The answer layer compensates for mediocre sources, on both sides. The price we pay for having no index is latency.

evalmcp vs DeepEval — does the judge agree with a human?

An eval library lives or dies by one thing: does its LLM-judge match a human label? 24 hand-labeled correctness cases (clear right/wrong, plus paraphrases that fool substring matching and plausible-but-wrong answers that fool lazy judges), every judge run on the same LLM — so the number measures the judge’s logic, not the model.

Judge (same LLM)AccuracyF1Cohen’s κ
evalmcp contains (no LLM)0.830.830.68
evalmcp LLM judge1.001.001.00
DeepEval GEval1.001.001.00

Parity with DeepEval on judge-vs-human agreement — but the first run said 0.42 accuracy, F1 0.00: evalmcp’s LLM judge marked everything wrong, worse than the no-LLM baseline. Two bugs (a stray {{ }} in an f-string, a bare json.loads that failed on fenced replies and silently scored 0) — fixed, 0.42→1.00, with regression tests. Honest caveats: small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout), evalmcp grades correctness not Ragas-style faithfulness, and Ragas itself couldn’t run (a broken transitive dependency in the test env), so I’m not faking a number for it.

Each benchmark is one reproducible script with raw JSON in its repo: ragmcp/benchmarks/, memorymcp/benchmarks/, sandboxmcp/benchmarks/, websearchmcp/benchmarks/, evalmcp/benchmarks/. Every one of these runs surfaced or confirmed a real bug — all fixed.

No single winner —
a portfolio

LTP for the cost floor

On deterministic, decomposable work LTP is the cheapest by far (~⅕ the tokens of ReAct) and just as reliable. Compiling once (one LLM call, not one per step) keeps cost flat as that deterministic depth grows.

Hybrid for reliability

LTP-first with a verified fallback is the most reliable mode in both regimes — 100% on the deterministic suite, and the top of every column on tool chains too — because its fallback rescues each pure mode's distinct failures. It can't fix a task both modes fail, but it never does worse.

ReAct for the open-ended

ReAct's per-turn adaptivity is the right tool for exploration and knowledge lookups. On the deterministic suite it is the most expensive and, on a strong model, the least reliable (over-confidence).

What this is —
and isn't

Two kinds of comparisonMost sections pit the kernel's own modes against each other (same kernel, same tools). The cross-framework sections vs LangGraph and CrewAI are real, but methodologically contentious by nature — different tool implementations, a fresh subprocess spawned per run (so latency isn't comparable), and a framework token-counter that undercounts. Each of those carries its fairness caveats stated inline, right next to the numbers.
Small samples28 deterministic + 7 multi-tool tasks, each run on all three models (Haiku, Sonnet, Opus). Deltas are directional, not a guarantee for every workload — more tasks and reps would tighten the intervals.
Hybrid's fallback isn't freeThe verification step occasionally re-runs a correct-but-terse LTP answer through ReAct. It never returns a wrong answer, but it adds tokens — the documented price of the safety net.
Runs against current sourceNumbers reflect the kernel at the repo's HEAD. Clone and run the harness to reproduce them exactly.

The raw receipts
(tokens per task)

All 28 tasks, every mode. A marks a failed answer. Haiku is shown in full because its token spread is the widest (a weak model makes ReAct loop); Sonnet and Opus are one click below. Nothing hidden.

TaskReActLTPHybrid
speed
Average speed: 60 km in 1.5 h
5,2902,4822,583
ratio
Cost of 7 apples if 3 cost $1.20
6,4902,4862,586
seq
Next number in 2, 6, 12, 20, 30
25,7302,4722,647
percent
35% of 240
5,2232,4642,550
mul-sub
123 × 456, then − 1000
6,7222,4732,562
fib
15th Fibonacci number
13,9422,5352,633
factorial
12 factorial (12!)
6,7002,4662,550
primes
Sum of all primes below 30
13,9802,5482,766
squares
Sum of squares 1²…5²
13,7222,4832,575
gcd
GCD of 84 and 126
13,7062,4812,572
reverse
Reverse the string 'benchmark'
2,593 ✗2,4622,546
vowels
Count vowels in 'orchestration'
5,3982,49914,809
wordcount
Words in a 9-word sentence
5,2972,4862,581
upper
Uppercase 'kernelmcp'
3,1812,4632,548
sort
Sort 7, 2, 9, 1, 5 ascending
2,6342,5002,600
max
Largest of 14, 88, 23, 91, 47
5,2552,4872,584
evens
Even numbers from 1 to 20
5,2732,4862,575
binary
Binary 1011 → decimal
5,3082,4682,554
branch-parity
17 × 23, then branch on even/odd
14,3052,5182,752
fact-digits
Smallest n where n! exceeds 10 digits
20,2932,5002,721
mul-until
×3 from 1 until > 1000 — how many steps
11,8772,5092,811
nextprime-sq
Smallest prime > 100, squared
5,9702,5442,633
reverse-sub
2024 minus its digit-reverse
5,8412,5052,735
div3not5
Sum 1–50 divisible by 3 but not 5
11,9172,5052,603
fib-label
20th Fibonacci, label BIG/SMALL
7,1072,5802,750
string-chain
Strip vowels, reverse, give length
12,2162,5262,704
sq-vs-sqsum
(Σ1…10)² − Σ(1²…10²)
14,6512,5302,742
iran-capital
Reverse 'narI' → country → its capital
9,8223,018 ✗3,198

Read it honestly: LTP's column is strikingly flat (~2.5k) — it compiles once instead of calling the LLM per step, so cost stays flat as deterministic depth grows — while ReAct swings from 2.6k to 25.7k (it can spiral on a trivial task like seq). LTP's one miss is iran-capital (knowledge); ReAct's is reverse (it answered without the sandbox). Hybrid is the only column with no ✗ — its two spikes (vowels, others) are verified fallbacks: the price of catching the misses. Full raw JSON for every run, model, and suite lives in the repo.

▸ Sonnet — per-task tokens (all three modes at 100%)
TaskReActLTPHybrid
speed7,8902,4802,582
ratio9,6452,4942,599
seq7,8842,4822,570
percent7,8422,4642,550
mul-sub6,6682,4732,563
fib13,7892,5232,638
factorial6,6552,4662,551
primes13,6932,4972,582
squares13,6202,4822,575
gcd13,5952,4792,571
reverse2,5942,4622,547
vowels2,6322,4932,592
wordcount7,8782,4842,580
upper3,1782,4632,549
sort2,6272,4952,604
max7,8842,4932,596
evens7,8602,4822,573
binary7,8512,4682,555
branch-parity6,7692,5232,638
fact-digits11,33514,0652,605
mul-until17,9272,5072,614
nextprime-sq5,3842,5122,628
reverse-sub18,8272,4832,598
div3not517,9112,4982,597
fib-label6,9272,5872,713
string-chain16,7432,5182,627
sq-vs-sqsum13,7742,5282,634
iran-capital6,6783,0003,098
▸ Opus — per-task tokens (note the ✗ over-confidence misses)
TaskReActLTPHybrid
speed6,4693,3683,501
ratio7,9423,3773,510
seq6,5013,3656,816
percent9,5853,3473,461
mul-sub3,916 ✗3,3603,499
fib11,8803,394 ✗15,867
factorial3,9113,3543,469
primes7,9013,3763,547
squares11,8593,3713,496
gcd11,8563,3673,491
reverse3,1743,3573,482
vowels9,6423,3843,514
wordcount6,4783,3853,517
upper3,9203,3563,482
sort3,1953,3893,529
max9,6273,3483,474
evens9,6153,3713,493
binary9,6003,3543,471
branch-parity3,9663,3953,568
fact-digits3,8363,3973,625
mul-until6,4573,4043,654
nextprime-sq3,173 ✗3,4013,548
reverse-sub7,0443,3913,564
div3not53,275 ✗3,3873,686
fib-label4,0693,471 ✗8,002
string-chain3,2183,4133,686
sq-vs-sqsum7,9463,3983,591
iran-capital3,9294,1814,344

Don't trust the numbers.
Reproduce them.

The harness lives in the kernelmcp repo. Point it at your own API key and run all three modes over the full suite.

# from the kernelmcp repo
python benchmarks/ltp_vs_react_bench.py \
  --model claude-haiku-4-5 \
  --modes react,ltp,hybrid \
  --reps 2 --suite all

# multi-tool suite (chains real memory/file tools)
python benchmarks/ltp_vs_react_bench.py \
  --modes react,ltp,hybrid --suite multitool --reps 2