Deterministic suite

28 verifiable tasks,
three modes, one model

Arithmetic, string ops, small algorithms and a knowledge hop — no network, so anyone gets the same answer. Every LLM call pinned to claude-haiku-4-5.

Mode	Success	Avg tokens / task	vs ReAct	Best for
ReAct multi-turn	96.4%	9,301	—	knowledge & open-ended
LTP compile-once	96.4%	2,517	−73%	deterministic compute (cost floor)
Hybrid LTP-first + fallback	100%	3,088	−67%	mixed / unknown workloads

LTP calls the LLM once to compile the plan, not once per step, so its cost stays flat (~2.5k tokens) as the task deepens — as long as the added depth is deterministic execution (loops, code, tool calls), not extra LLM steps or a re-compile. ReAct re-sends the system prompt and tool schemas every turn, so its cost scales with turns. Hybrid is the only mode at 100% — but it is not the cheapest: pure LTP is. Hybrid spends a little more to convert LTP's fragile tail into reliability.

Across model tiers

One result holds at
every model size

The same 28-task suite, run on a weak, a mid, and a strong model. Each cell is success / avg tokens per task.

Model	ReAct	LTP	Hybrid
Haiku weak	96.4% / 9,301	96.4% / 2,517	100% / 3,088
Sonnet mid	100% / 9,502	100% / 2,925	100% / 2,611
Opus strong	89.3% / 6,570	92.9% / 3,409	100% / 4,281

Hybrid is 100% at every tier — the only mode that is. The model shapes the rest: a weak model makes ReAct loop (9.3k tokens), so LTP's token edge is largest; a mid model (Sonnet) lifts everyone to 100% — the sweet spot; a strong model (Opus) answers in ~3 turns "in its head", which is cheaper but drops ReAct to 89.3% on over-confidence (it skipped tools and miscomputed). Hybrid absorbs both failure modes.

Multi-tool suite (harder)

The real product surface:
chaining actual tools

Seven tasks that must chain real suite tools — memory, files, working memory — not just the sandbox. Each goal demands a clean final answer and runs in a fresh namespace. Two repetitions per task.

Task	What it does	Reliable?
`mem-recall`	Store a fact in memory, recall it	solid
`mem-count`	Store 3 facts, then count them	solid
`mem-update`	Store a fact, overwrite it, recall the latest	solid
`mem-then-file`	Recall a number, compute, write & read a file	solid
`file-upper`	Write a file, read it, uppercase it	solid
`working-mem`	Set a working-memory value, read & double it	mostly
`file-words`	Write a file, read it, count the words	hard

Model	ReAct	LTP	Hybrid
Haiku	78.6% / 16,709	71.4% / 6,924	100% / 12,244
Sonnet	78.6% / 17,133	71.4% / 6,168	85.7% / 8,689
Opus	71.4% / 22,490	64.3% / 8,618	78.6% / 25,072

Two honest findings. (1) Cost is the robust win: LTP routes the same tools at roughly half ReAct's tokens on every model. (2) Hybrid is the most reliable mode at every tier (100% / 85.7% / 78.6%), because its fallback rescues each pure mode's distinct misses — LTP can't finish working-mem, ReAct short-circuits file-words, and hybrid takes whichever one works. Six of seven tasks are now solid across all models. The decline on stronger models comes down to one stubborn task: file-words, where Sonnet and Opus return the raw file object instead of counting even when told to reply with only the number — the same over-confidence that drops their ReAct math score. We keep it in rather than quietly drop it.

Per-task receipts (avg tokens over 2 reps; ✗ = both reps failed, ½ = one of two):

▸ Haiku — per-task

Task	ReAct	LTP	Hybrid
`mem-recall`	17,970	2,652	17,691
`mem-count`	21,940	2,524	2,636
`mem-update`	7,724	2,747 ✗	12,520
`mem-then-file`	26,843	2,541	2,671
`file-words`	18,173 ✗	17,047	17,053
`file-upper`	10,759	18,449	13,333
`working-mem`	13,560 ½	2,512 ✗	19,807

▸ Sonnet — per-task

Task	ReAct	LTP	Hybrid
`mem-recall`	18,419	2,661	2,767
`mem-count`	18,083	2,523	2,636
`mem-update`	12,342	2,749	9,296
`mem-then-file`	23,844	2,695	2,835
`file-words`	18,162 ✗	16,730 ✗	18,748 ✗
`file-upper`	10,753	13,299	13,300
`working-mem`	18,330 ½	2,520 ✗	11,242

▸ Opus — per-task

Task	ReAct	LTP	Hybrid
`mem-recall`	22,210	3,597	26,436
`mem-count`	25,940	3,434	32,062
`mem-update`	15,059	3,698 ½	11,592
`mem-then-file`	33,116	3,682	37,808
`file-words`	22,217 ✗	25,776 ✗	25,775 ✗
`file-upper`	13,206	16,727	16,722
`working-mem`	25,685 ✗	3,414 ✗	25,108 ½

Cross-framework

vs LangGraph —
the metric decides

The same 28 deterministic tasks through a standard LangGraph ReAct agent, same model, one functionally-equivalent Python tool. We'll say the uncomfortable part first: on raw token count, a near-empty agent always wins — a feature-rich kernel can't out-token it on one-tool arithmetic (LangGraph ~1.4k tokens, our lean LTP ~2.2k).

But raw tokens aren't the bill — cost is, and cost flips by model. Each cell is $/task / tokens:

Model	kernelmcp LTP	kernelmcp Hybrid	LangGraph	Cheaper
Haiku	$0.0023 / 2,194	$0.0029 / 2,817	$0.0019 / 1,424	LangGraph (−17%)
Sonnet	$0.0020 / 2,515	$0.0018 / 2,719	$0.0057 / 1,466	kernelmcp (~3×)

On Sonnet — a production model — kernelmcp LTP/Hybrid is ~3× cheaper than LangGraph and faster (2.2s vs 4.2s), despite sending more raw tokens. The reason is prompt caching: kernelmcp's large static prefix exceeds the cache minimum and is billed at ~10% on a hit (effective ~$0.78/M vs LangGraph's ~$3.9/M); LangGraph's minimal prompt is too small to cache, so it pays full price. On Haiku nothing caches below its larger floor, so there the lean agent stays ~17% cheaper.

Fairness, openly: this is a bare LangGraph agent vs kernelmcp's full loop on arithmetic; LangGraph's tool is a plain exec vs our security sandbox; and the integrated workflows kernelmcp is built for (memory, 120+ tools, planning, scheduling) aren't portable to a stock agent, so they're not in this comparison. Pitch: cost-at-scale, latency, and the suite — not raw token count.

The fair test

Same real tools —
who orchestrates better?

The single-tool test above is unflattering by design. The apples-to-apples version puts three orchestrators on the same real tools over MCP: LangGraph and CrewAI both connect to the standalone memorymcp server (same filtered 6-tool set); kernelmcp runs a memory-only kernel. Nobody touches anyone else — the only variable is who orchestrates. Seven memory/working-memory tasks, two reps. Each cell is success / $ per task:

System	Haiku	Sonnet
LangGraph + memorymcp	92.9% / $0.0055	92.9% / $0.0181
CrewAI + memorymcp	100% / $0.0058	100% / $0.0231
kernelmcp ReAct	85.7% / $0.0152	71.4% / $0.0329
kernelmcp LTP	57.1% / $0.0023	71.4% / $0.0026
kernelmcp Hybrid	100% / $0.0100	100% / $0.0075

Two honest reads. Reliability: CrewAI and kernelmcp Hybrid are the only systems at 100% on both models; LangGraph is close (92.9%); pure LTP is cheapest but unreliable on stateful chains. kernelmcp does not dominate reliability — CrewAI matches it. Cost: on Sonnet, kernelmcp Hybrid is the cheapest 100%-reliable system at $0.0075 — ~2.4× under LangGraph ($0.0181) and ~3× under CrewAI ($0.0231) (prompt caching). On Haiku (no caching) Hybrid is pricier than the lean agents. So the kernel's edge is real but conditional: cheapest-at-100% on a caching-capable production model, not on the cheapest tier.

Integrity note: CrewAI's built-in token counter undercounts on its litellm path (2,182 reported vs 8,502 actual on a Sonnet task), so its cost here is measured by intercepting the Anthropic SDK directly — every real API call counted. Caveats: 7 memory tasks only (file tasks excluded — tool names differ across the standalone servers); LangGraph/CrewAI latency includes a fresh memorymcp subprocess spawned per run, so latency isn't comparable.

Where it tips

Deep agentic chains —
the kernel pulls ahead

The tests above are shallow. The realistic agentic surface is deep chains: five tasks that each chain 5–6 sequential calls across memory and code execution (store → recall → compute → store → recall). All three frameworks, same 7 tools over MCP. Each cell is success / $ per task:

System	Haiku	Sonnet
LangGraph	100% / $0.0102	100% / $0.0381
CrewAI	100% / $0.0151	100% / $0.0458
kernelmcp ReAct	100% / $0.0259	100% / $0.0468
kernelmcp LTP	80% / $0.0025	60% / $0.0041
kernelmcp Hybrid	100% / $0.0071	100% / $0.0146

Depth tips the balance. On deep chains, kernelmcp Hybrid is the cheapest 100%-reliable system on both models — 1.4× under LangGraph and 2.1× under CrewAI on Haiku, widening to 2.6× / 3.1× on Sonnet — and faster (Sonnet 9.5s vs ~30s). The reason is structural: LTP compiles the whole chain in one call (~2.4k tokens) then executes it deterministically — so the LLM is called once, not once per step, and cost stays flat as long as the added depth is deterministic execution rather than extra LLM steps or a re-compile. LangGraph and CrewAI re-invoke the LLM at every step, so their cost grows with chain length. This is the opposite of the single-tool result — the compile-once design pays off exactly as agentic depth grows.

Honest qualifier: LTP alone is unreliable on stateful chains (80% / 60% — it stumbles on a working-memory task), so this is a win for Hybrid (LTP-first + verified fallback), not LTP by itself. Reliability is a four-way tie at 100%; kernelmcp wins on cost at equal reliability, not on reliability. Small sample (5 tasks).

The libraries, standalone

Each lib vs its
named incumbent

The kernel isn't the only thing worth measuring. Five of the standalone libraries went head-to-head with the tools people actually reach for — LlamaIndex, Mem0, a raw OS sandbox, Tavily, and DeepEval — under the same discipline: control the confounds, run the incumbent in its best configuration, and publish the losses as plainly as the wins. None of these is a clean sweep, and that's the point.

`ragmcp` vs LlamaIndex — retrieval

Same embedder on both sides (fastembed bge-small, identical weights wrapped for LlamaIndex too), same 20-doc corpus, by-construction ground truth. So any gap is the framework's chunking + retrieval logic, never the embedding model.

Metric	ragmcp	LlamaIndex
recall@1	0.925	0.875
recall@3	1.000	0.975
MRR@10	0.954	0.930
ingest (20 docs)	0.58 s	1.00 s
query latency	13.8 ms	10.5 ms

The core retriever is at exact parity. Force one chunk per document on both sides — chunking removed as a variable — and every metric is identical (recall@1 0.875, MRR 0.93): same embedder, same cosine, same ranking. ragmcp's default-config edge comes entirely from its finer default chunk (500 chars vs LlamaIndex's 1024 tokens), not a smarter algorithm. ragmcp ingests ~1.7× faster in bulk; LlamaIndex answers ~1.3× faster per query. Verdict: competitive on retrieval, opposite performance tradeoffs — not a knockout.

`memorymcp` vs Mem0 — fact recall + cost

40 facts, 40 paraphrased queries (low lexical overlap — a real semantic test). Same embedder, both cosine. The headline work was neutralizing a confound: on ChromaDB, Mem0 ranks semantically backwards (its adapter passes raw distance where its ranker expects a similarity), so we ran Mem0 on its default Qdrant backend, where it scores correctly — and gave it its full hybrid pipeline (spaCy + BM25 + vector).

Metric	memorymcp	Mem0
recall@1	0.775	0.500
recall@3	0.825	0.825
recall@5	0.900	0.925
MRR@10	0.826	0.679
ingest (40 facts)	1.6 s / $0	5.2 s
cost / 8 facts (extraction)	$0.00	$0.071

Recall is close and split: memorymcp ranks the target higher (recall@1, MRR), Mem0 catches a couple more in the tail (recall@5). The clear, defensible gap is cost: memorymcp's deterministic PatternFactExtractor ingests free and fast, where Mem0's intended LLM extraction costs ~$0.009/fact. Mem0 answers ~3.5× faster per query (memorymcp pays for a richer pipeline + persistent store).

Caveat stated against us: this dataset is pure paraphrase with low lexical overlap, which under-uses Mem0's BM25 keyword layer — on keyword-heavy queries Mem0 would close the gap. It's a semantic-retrieval result, not a universal one.

`sandboxmcp` vs raw subprocess — containment

The question that matters for untrusted code: does a malicious payload’s effect reach the host or the network? Five escape payloads, run on the raw interpreter, the zero-dependency process backend, and the hardened Docker backend — the marker is verified on the host filesystem, so “contained” means the damage genuinely never landed.

Host-impact payload	default (raw)	sandboxmcp (process)	sandboxmcp (docker)
read host file	leaked	leaked	contained
write host file	leaked	leaked	contained
network egress	leaked	contained	contained
2 GB memory balloon	leaked	contained	contained
infinite loop	contained	contained	contained
Total contained	1 / 5	3 / 5	5 / 5

The process backend (socket shim + OS limits) stops egress and resource attacks, but has no kernel isolation — host filesystem reads/writes leak. The Docker backend (network_mode=none, dropped capabilities, read-only rootfs, Docker’s default seccomp, memory & PID caps, throwaway container) contains all five — 5/5, verified on the host fs. Use the process backend for trusted code with guardrails; use Docker to contain code you don’t trust.

Under the hood — static-validator coverage. A separate binary test of 20 payloads (a token prints only if the unsafe action ran) shows what the hardened validator catches before any backend runs — and, honestly, what it doesn’t:

Attack class	default	sandboxmcp (process)
Known dangerous patterns (12)	12 leaked	12 blocked
Validator bypasses (6)	6 leaked	6 leaked
Resource exhaustion (2)	1 leaked	2 contained
Total neutralized	2 / 20	14 / 20

A static scanner is bypassable by design: os.popen, pathlib and urllib evade the patterns. The validator is a first filter, not containment — which is exactly why the host-impact above is run in Docker. (On Docker those bypasses still execute, but inside a network-less, read-only, throwaway container, so nothing reaches the host.)

`websearchmcp` vs Tavily — the answer, not the index

Tavily runs its own real-time crawl + index; websearchmcp aggregates free engines (SearXNG + scrapers). On raw result quality that’s not a fair fight — and I won’t pretend it is. But an agent consumes the answer, not the result list, so that’s what I measured: fetch the top sources, trim to query-relevant passages, let the same LLM synthesize.

Metric	websearchmcp	Tavily
Answer correctness (6 factual Qs)	5 / 6	5 / 6
Authority@3 (raw results, 10 Qs)	3 / 10	6 / 10
Search cost	$0 / no key	paid (credits)
Latency / query	~22 s (live fetch)	~1.3 s (index)

We lose the index contest, structurally (3/10 vs 6/10 on surfacing an authoritative source) — you can’t out-crawl a crawler. But on the actual deliverable, the cited answer, it’s a tie: 5/6 each (the one miss is an Everest height both phrase as “8,848”, which the strict substring grader counts against both). The revealing detail — our top results were still SEO pages, yet the answer was right, because the LLM extracts the fact from any decent fetched page (Tavily’s own top-3 sometimes included quora/instagram too). The answer layer compensates for mediocre sources, on both sides. The price we pay for having no index is latency.

`evalmcp` vs DeepEval — does the judge agree with a human?

An eval library lives or dies by one thing: does its LLM-judge match a human label? 24 hand-labeled correctness cases (clear right/wrong, plus paraphrases that fool substring matching and plausible-but-wrong answers that fool lazy judges), every judge run on the same LLM — so the number measures the judge’s logic, not the model.

Judge (same LLM)	Accuracy	F1	Cohen’s κ
evalmcp `contains` (no LLM)	0.83	0.83	0.68
evalmcp LLM judge	1.00	1.00	1.00
DeepEval `GEval`	1.00	1.00	1.00

Parity with DeepEval on judge-vs-human agreement — but the first run said 0.42 accuracy, F1 0.00: evalmcp’s LLM judge marked everything wrong, worse than the no-LLM baseline. Two bugs (a stray {{ }} in an f-string, a bare json.loads that failed on fenced replies and silently scored 0) — fixed, 0.42→1.00, with regression tests. Honest caveats: small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout), evalmcp grades correctness not Ragas-style faithfulness, and Ragas itself couldn’t run (a broken transitive dependency in the test env), so I’m not faking a number for it.

Each benchmark is one reproducible script with raw JSON in its repo: ragmcp/benchmarks/, memorymcp/benchmarks/, sandboxmcp/benchmarks/, websearchmcp/benchmarks/, evalmcp/benchmarks/. Every one of these runs surfaced or confirmed a real bug — all fixed.

What it means

No single winner —
a portfolio

LTP for the cost floor

On deterministic, decomposable work LTP is the cheapest by far (~⅕ the tokens of ReAct) and just as reliable. Compiling once (one LLM call, not one per step) keeps cost flat as that deterministic depth grows.

Hybrid for reliability

LTP-first with a verified fallback is the most reliable mode in both regimes — 100% on the deterministic suite, and the top of every column on tool chains too — because its fallback rescues each pure mode's distinct failures. It can't fix a task both modes fail, but it never does worse.

ReAct for the open-ended

ReAct's per-turn adaptivity is the right tool for exploration and knowledge lookups. On the deterministic suite it is the most expensive and, on a strong model, the least reliable (over-confidence).

Honest limits

What this is —
and isn't

Two kinds of comparison	Most sections pit the kernel's own modes against each other (same kernel, same tools). The cross-framework sections vs LangGraph and CrewAI are real, but methodologically contentious by nature — different tool implementations, a fresh subprocess spawned per run (so latency isn't comparable), and a framework token-counter that undercounts. Each of those carries its fairness caveats stated inline, right next to the numbers.
Small samples	28 deterministic + 7 multi-tool tasks, each run on all three models (Haiku, Sonnet, Opus). Deltas are directional, not a guarantee for every workload — more tasks and reps would tighten the intervals.
Hybrid's fallback isn't free	The verification step occasionally re-runs a correct-but-terse LTP answer through ReAct. It never returns a wrong answer, but it adds tokens — the documented price of the safety net.
Runs against current source	Numbers reflect the kernel at the repo's HEAD. Clone and run the harness to reproduce them exactly.

Every task, no cherry-picking

The raw receipts
(tokens per task)

All 28 tasks, every mode. A ✗ marks a failed answer. Haiku is shown in full because its token spread is the widest (a weak model makes ReAct loop); Sonnet and Opus are one click below. Nothing hidden.

Task	ReAct	LTP	Hybrid
`speed` Average speed: 60 km in 1.5 h	5,290	2,482	2,583
`ratio` Cost of 7 apples if 3 cost $1.20	6,490	2,486	2,586
`seq` Next number in 2, 6, 12, 20, 30	25,730	2,472	2,647
`percent` 35% of 240	5,223	2,464	2,550
`mul-sub` 123 × 456, then − 1000	6,722	2,473	2,562
`fib` 15th Fibonacci number	13,942	2,535	2,633
`factorial` 12 factorial (12!)	6,700	2,466	2,550
`primes` Sum of all primes below 30	13,980	2,548	2,766
`squares` Sum of squares 1²…5²	13,722	2,483	2,575
`gcd` GCD of 84 and 126	13,706	2,481	2,572
`reverse` Reverse the string 'benchmark'	2,593 ✗	2,462	2,546
`vowels` Count vowels in 'orchestration'	5,398	2,499	14,809
`wordcount` Words in a 9-word sentence	5,297	2,486	2,581
`upper` Uppercase 'kernelmcp'	3,181	2,463	2,548
`sort` Sort 7, 2, 9, 1, 5 ascending	2,634	2,500	2,600
`max` Largest of 14, 88, 23, 91, 47	5,255	2,487	2,584
`evens` Even numbers from 1 to 20	5,273	2,486	2,575
`binary` Binary 1011 → decimal	5,308	2,468	2,554
`branch-parity` 17 × 23, then branch on even/odd	14,305	2,518	2,752
`fact-digits` Smallest n where n! exceeds 10 digits	20,293	2,500	2,721
`mul-until` ×3 from 1 until > 1000 — how many steps	11,877	2,509	2,811
`nextprime-sq` Smallest prime > 100, squared	5,970	2,544	2,633
`reverse-sub` 2024 minus its digit-reverse	5,841	2,505	2,735
`div3not5` Sum 1–50 divisible by 3 but not 5	11,917	2,505	2,603
`fib-label` 20th Fibonacci, label BIG/SMALL	7,107	2,580	2,750
`string-chain` Strip vowels, reverse, give length	12,216	2,526	2,704
`sq-vs-sqsum` (Σ1…10)² − Σ(1²…10²)	14,651	2,530	2,742
`iran-capital` Reverse 'narI' → country → its capital	9,822	3,018 ✗	3,198

Read it honestly: LTP's column is strikingly flat (~2.5k) — it compiles once instead of calling the LLM per step, so cost stays flat as deterministic depth grows — while ReAct swings from 2.6k to 25.7k (it can spiral on a trivial task like seq). LTP's one miss is iran-capital (knowledge); ReAct's is reverse (it answered without the sandbox). Hybrid is the only column with no ✗ — its two spikes (vowels, others) are verified fallbacks: the price of catching the misses. Full raw JSON for every run, model, and suite lives in the repo.

▸ Sonnet — per-task tokens (all three modes at 100%)

Task	ReAct	LTP	Hybrid
`speed`	7,890	2,480	2,582
`ratio`	9,645	2,494	2,599
`seq`	7,884	2,482	2,570
`percent`	7,842	2,464	2,550
`mul-sub`	6,668	2,473	2,563
`fib`	13,789	2,523	2,638
`factorial`	6,655	2,466	2,551
`primes`	13,693	2,497	2,582
`squares`	13,620	2,482	2,575
`gcd`	13,595	2,479	2,571
`reverse`	2,594	2,462	2,547
`vowels`	2,632	2,493	2,592
`wordcount`	7,878	2,484	2,580
`upper`	3,178	2,463	2,549
`sort`	2,627	2,495	2,604
`max`	7,884	2,493	2,596
`evens`	7,860	2,482	2,573
`binary`	7,851	2,468	2,555
`branch-parity`	6,769	2,523	2,638
`fact-digits`	11,335	14,065	2,605
`mul-until`	17,927	2,507	2,614
`nextprime-sq`	5,384	2,512	2,628
`reverse-sub`	18,827	2,483	2,598
`div3not5`	17,911	2,498	2,597
`fib-label`	6,927	2,587	2,713
`string-chain`	16,743	2,518	2,627
`sq-vs-sqsum`	13,774	2,528	2,634
`iran-capital`	6,678	3,000	3,098

▸ Opus — per-task tokens (note the ✗ over-confidence misses)

Task	ReAct	LTP	Hybrid
`speed`	6,469	3,368	3,501
`ratio`	7,942	3,377	3,510
`seq`	6,501	3,365	6,816
`percent`	9,585	3,347	3,461
`mul-sub`	3,916 ✗	3,360	3,499
`fib`	11,880	3,394 ✗	15,867
`factorial`	3,911	3,354	3,469
`primes`	7,901	3,376	3,547
`squares`	11,859	3,371	3,496
`gcd`	11,856	3,367	3,491
`reverse`	3,174	3,357	3,482
`vowels`	9,642	3,384	3,514
`wordcount`	6,478	3,385	3,517
`upper`	3,920	3,356	3,482
`sort`	3,195	3,389	3,529
`max`	9,627	3,348	3,474
`evens`	9,615	3,371	3,493
`binary`	9,600	3,354	3,471
`branch-parity`	3,966	3,395	3,568
`fact-digits`	3,836	3,397	3,625
`mul-until`	6,457	3,404	3,654
`nextprime-sq`	3,173 ✗	3,401	3,548
`reverse-sub`	7,044	3,391	3,564
`div3not5`	3,275 ✗	3,387	3,686
`fib-label`	4,069	3,471 ✗	8,002
`string-chain`	3,218	3,413	3,686
`sq-vs-sqsum`	7,946	3,398	3,591
`iran-capital`	3,929	4,181	4,344

Run it yourself

Don't trust the numbers.
Reproduce them.

The harness lives in the kernelmcp repo. Point it at your own API key and run all three modes over the full suite.

# from the kernelmcp repo
python benchmarks/ltp_vs_react_bench.py \
  --model claude-haiku-4-5 \
  --modes react,ltp,hybrid \
  --reps 2 --suite all

# multi-tool suite (chains real memory/file tools)
python benchmarks/ltp_vs_react_bench.py \
  --modes react,ltp,hybrid --suite multitool --reps 2

Browse the harness on GitHub About kernelmcp

How it actuallycompares — with numbers.

28 verifiable tasks,three modes, one model

One result holds atevery model size

The real product surface:chaining actual tools

vs LangGraph —the metric decides

Same real tools —who orchestrates better?

Deep agentic chains —the kernel pulls ahead

Each lib vs itsnamed incumbent

ragmcp vs LlamaIndex — retrieval

memorymcp vs Mem0 — fact recall + cost

sandboxmcp vs raw subprocess — containment

websearchmcp vs Tavily — the answer, not the index

evalmcp vs DeepEval — does the judge agree with a human?

No single winner —a portfolio