Two layers, one rule. The kernel runs each task as LTP (compile-once), ReAct (multi-turn), or Hybrid — measured against LangGraph and CrewAI. And the standalone libraries go head-to-head with their incumbents: ragmcp vs LlamaIndex, memorymcp vs Mem0, sandboxmcp vs a raw OS sandbox, websearchmcp vs Tavily, evalmcp vs DeepEval. Same model, same controls, losses shown as plainly as wins — every number reproduces from the repo.
Arithmetic, string ops, small algorithms and a knowledge hop — no network, so anyone gets the same answer. Every LLM call pinned to claude-haiku-4-5.
| Mode | Success | Avg tokens / task | vs ReAct | Best for |
|---|---|---|---|---|
| ReAct multi-turn | 96.4% | 9,301 | — | knowledge & open-ended |
| LTP compile-once | 96.4% | 2,517 | −73% | deterministic compute (cost floor) |
| Hybrid LTP-first + fallback | 100% | 3,088 | −67% | mixed / unknown workloads |
LTP calls the LLM once to compile the plan, not once per step, so its cost stays flat (~2.5k tokens) as the task deepens — as long as the added depth is deterministic execution (loops, code, tool calls), not extra LLM steps or a re-compile. ReAct re-sends the system prompt and tool schemas every turn, so its cost scales with turns. Hybrid is the only mode at 100% — but it is not the cheapest: pure LTP is. Hybrid spends a little more to convert LTP's fragile tail into reliability.
The same 28-task suite, run on a weak, a mid, and a strong model. Each cell is success / avg tokens per task.
| Model | ReAct | LTP | Hybrid |
|---|---|---|---|
| Haiku weak | 96.4% / 9,301 | 96.4% / 2,517 | 100% / 3,088 |
| Sonnet mid | 100% / 9,502 | 100% / 2,925 | 100% / 2,611 |
| Opus strong | 89.3% / 6,570 | 92.9% / 3,409 | 100% / 4,281 |
Hybrid is 100% at every tier — the only mode that is. The model shapes the rest: a weak model makes ReAct loop (9.3k tokens), so LTP's token edge is largest; a mid model (Sonnet) lifts everyone to 100% — the sweet spot; a strong model (Opus) answers in ~3 turns "in its head", which is cheaper but drops ReAct to 89.3% on over-confidence (it skipped tools and miscomputed). Hybrid absorbs both failure modes.
Seven tasks that must chain real suite tools — memory, files, working memory — not just the sandbox. Each goal demands a clean final answer and runs in a fresh namespace. Two repetitions per task.
| Task | What it does | Reliable? |
|---|---|---|
mem-recall | Store a fact in memory, recall it | solid |
mem-count | Store 3 facts, then count them | solid |
mem-update | Store a fact, overwrite it, recall the latest | solid |
mem-then-file | Recall a number, compute, write & read a file | solid |
file-upper | Write a file, read it, uppercase it | solid |
working-mem | Set a working-memory value, read & double it | mostly |
file-words | Write a file, read it, count the words | hard |
| Model | ReAct | LTP | Hybrid |
|---|---|---|---|
| Haiku | 78.6% / 16,709 | 71.4% / 6,924 | 100% / 12,244 |
| Sonnet | 78.6% / 17,133 | 71.4% / 6,168 | 85.7% / 8,689 |
| Opus | 71.4% / 22,490 | 64.3% / 8,618 | 78.6% / 25,072 |
Two honest findings. (1) Cost is the robust win: LTP routes the same tools at roughly half ReAct's tokens on every model. (2) Hybrid is the most reliable mode at every tier (100% / 85.7% / 78.6%), because its fallback rescues each pure mode's distinct misses — LTP can't finish working-mem, ReAct short-circuits file-words, and hybrid takes whichever one works. Six of seven tasks are now solid across all models. The decline on stronger models comes down to one stubborn task: file-words, where Sonnet and Opus return the raw file object instead of counting even when told to reply with only the number — the same over-confidence that drops their ReAct math score. We keep it in rather than quietly drop it.
Per-task receipts (avg tokens over 2 reps; ✗ = both reps failed, ½ = one of two):
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
mem-recall | 17,970 | 2,652 | 17,691 |
mem-count | 21,940 | 2,524 | 2,636 |
mem-update | 7,724 | 2,747 ✗ | 12,520 |
mem-then-file | 26,843 | 2,541 | 2,671 |
file-words | 18,173 ✗ | 17,047 | 17,053 |
file-upper | 10,759 | 18,449 | 13,333 |
working-mem | 13,560 ½ | 2,512 ✗ | 19,807 |
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
mem-recall | 18,419 | 2,661 | 2,767 |
mem-count | 18,083 | 2,523 | 2,636 |
mem-update | 12,342 | 2,749 | 9,296 |
mem-then-file | 23,844 | 2,695 | 2,835 |
file-words | 18,162 ✗ | 16,730 ✗ | 18,748 ✗ |
file-upper | 10,753 | 13,299 | 13,300 |
working-mem | 18,330 ½ | 2,520 ✗ | 11,242 |
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
mem-recall | 22,210 | 3,597 | 26,436 |
mem-count | 25,940 | 3,434 | 32,062 |
mem-update | 15,059 | 3,698 ½ | 11,592 |
mem-then-file | 33,116 | 3,682 | 37,808 |
file-words | 22,217 ✗ | 25,776 ✗ | 25,775 ✗ |
file-upper | 13,206 | 16,727 | 16,722 |
working-mem | 25,685 ✗ | 3,414 ✗ | 25,108 ½ |
The same 28 deterministic tasks through a standard LangGraph ReAct agent, same model, one functionally-equivalent Python tool. We'll say the uncomfortable part first: on raw token count, a near-empty agent always wins — a feature-rich kernel can't out-token it on one-tool arithmetic (LangGraph ~1.4k tokens, our lean LTP ~2.2k).
But raw tokens aren't the bill — cost is, and cost flips by model. Each cell is $/task / tokens:
| Model | kernelmcp LTP | kernelmcp Hybrid | LangGraph | Cheaper |
|---|---|---|---|---|
| Haiku | $0.0023 / 2,194 | $0.0029 / 2,817 | $0.0019 / 1,424 | LangGraph (−17%) |
| Sonnet | $0.0020 / 2,515 | $0.0018 / 2,719 | $0.0057 / 1,466 | kernelmcp (~3×) |
On Sonnet — a production model — kernelmcp LTP/Hybrid is ~3× cheaper than LangGraph and faster (2.2s vs 4.2s), despite sending more raw tokens. The reason is prompt caching: kernelmcp's large static prefix exceeds the cache minimum and is billed at ~10% on a hit (effective ~$0.78/M vs LangGraph's ~$3.9/M); LangGraph's minimal prompt is too small to cache, so it pays full price. On Haiku nothing caches below its larger floor, so there the lean agent stays ~17% cheaper.
Fairness, openly: this is a bare LangGraph agent vs kernelmcp's full loop on arithmetic; LangGraph's tool is a plain exec vs our security sandbox; and the integrated workflows kernelmcp is built for (memory, 120+ tools, planning, scheduling) aren't portable to a stock agent, so they're not in this comparison. Pitch: cost-at-scale, latency, and the suite — not raw token count.
The single-tool test above is unflattering by design. The apples-to-apples version puts three orchestrators on the same real tools over MCP: LangGraph and CrewAI both connect to the standalone memorymcp server (same filtered 6-tool set); kernelmcp runs a memory-only kernel. Nobody touches anyone else — the only variable is who orchestrates. Seven memory/working-memory tasks, two reps. Each cell is success / $ per task:
| System | Haiku | Sonnet |
|---|---|---|
| LangGraph + memorymcp | 92.9% / $0.0055 | 92.9% / $0.0181 |
| CrewAI + memorymcp | 100% / $0.0058 | 100% / $0.0231 |
| kernelmcp ReAct | 85.7% / $0.0152 | 71.4% / $0.0329 |
| kernelmcp LTP | 57.1% / $0.0023 | 71.4% / $0.0026 |
| kernelmcp Hybrid | 100% / $0.0100 | 100% / $0.0075 |
Two honest reads. Reliability: CrewAI and kernelmcp Hybrid are the only systems at 100% on both models; LangGraph is close (92.9%); pure LTP is cheapest but unreliable on stateful chains. kernelmcp does not dominate reliability — CrewAI matches it. Cost: on Sonnet, kernelmcp Hybrid is the cheapest 100%-reliable system at $0.0075 — ~2.4× under LangGraph ($0.0181) and ~3× under CrewAI ($0.0231) (prompt caching). On Haiku (no caching) Hybrid is pricier than the lean agents. So the kernel's edge is real but conditional: cheapest-at-100% on a caching-capable production model, not on the cheapest tier.
Integrity note: CrewAI's built-in token counter undercounts on its litellm path (2,182 reported vs 8,502 actual on a Sonnet task), so its cost here is measured by intercepting the Anthropic SDK directly — every real API call counted. Caveats: 7 memory tasks only (file tasks excluded — tool names differ across the standalone servers); LangGraph/CrewAI latency includes a fresh memorymcp subprocess spawned per run, so latency isn't comparable.
The tests above are shallow. The realistic agentic surface is deep chains: five tasks that each chain 5–6 sequential calls across memory and code execution (store → recall → compute → store → recall). All three frameworks, same 7 tools over MCP. Each cell is success / $ per task:
| System | Haiku | Sonnet |
|---|---|---|
| LangGraph | 100% / $0.0102 | 100% / $0.0381 |
| CrewAI | 100% / $0.0151 | 100% / $0.0458 |
| kernelmcp ReAct | 100% / $0.0259 | 100% / $0.0468 |
| kernelmcp LTP | 80% / $0.0025 | 60% / $0.0041 |
| kernelmcp Hybrid | 100% / $0.0071 | 100% / $0.0146 |
Depth tips the balance. On deep chains, kernelmcp Hybrid is the cheapest 100%-reliable system on both models — 1.4× under LangGraph and 2.1× under CrewAI on Haiku, widening to 2.6× / 3.1× on Sonnet — and faster (Sonnet 9.5s vs ~30s). The reason is structural: LTP compiles the whole chain in one call (~2.4k tokens) then executes it deterministically — so the LLM is called once, not once per step, and cost stays flat as long as the added depth is deterministic execution rather than extra LLM steps or a re-compile. LangGraph and CrewAI re-invoke the LLM at every step, so their cost grows with chain length. This is the opposite of the single-tool result — the compile-once design pays off exactly as agentic depth grows.
Honest qualifier: LTP alone is unreliable on stateful chains (80% / 60% — it stumbles on a working-memory task), so this is a win for Hybrid (LTP-first + verified fallback), not LTP by itself. Reliability is a four-way tie at 100%; kernelmcp wins on cost at equal reliability, not on reliability. Small sample (5 tasks).
The kernel isn't the only thing worth measuring. Five of the standalone libraries went head-to-head with the tools people actually reach for — LlamaIndex, Mem0, a raw OS sandbox, Tavily, and DeepEval — under the same discipline: control the confounds, run the incumbent in its best configuration, and publish the losses as plainly as the wins. None of these is a clean sweep, and that's the point.
ragmcp vs LlamaIndex — retrievalSame embedder on both sides (fastembed bge-small, identical weights wrapped for LlamaIndex too), same 20-doc corpus, by-construction ground truth. So any gap is the framework's chunking + retrieval logic, never the embedding model.
| Metric | ragmcp | LlamaIndex |
|---|---|---|
| recall@1 | 0.925 | 0.875 |
| recall@3 | 1.000 | 0.975 |
| MRR@10 | 0.954 | 0.930 |
| ingest (20 docs) | 0.58 s | 1.00 s |
| query latency | 13.8 ms | 10.5 ms |
The core retriever is at exact parity. Force one chunk per document on both sides — chunking removed as a variable — and every metric is identical (recall@1 0.875, MRR 0.93): same embedder, same cosine, same ranking. ragmcp's default-config edge comes entirely from its finer default chunk (500 chars vs LlamaIndex's 1024 tokens), not a smarter algorithm. ragmcp ingests ~1.7× faster in bulk; LlamaIndex answers ~1.3× faster per query. Verdict: competitive on retrieval, opposite performance tradeoffs — not a knockout.
memorymcp vs Mem0 — fact recall + cost40 facts, 40 paraphrased queries (low lexical overlap — a real semantic test). Same embedder, both cosine. The headline work was neutralizing a confound: on ChromaDB, Mem0 ranks semantically backwards (its adapter passes raw distance where its ranker expects a similarity), so we ran Mem0 on its default Qdrant backend, where it scores correctly — and gave it its full hybrid pipeline (spaCy + BM25 + vector).
| Metric | memorymcp | Mem0 |
|---|---|---|
| recall@1 | 0.775 | 0.500 |
| recall@3 | 0.825 | 0.825 |
| recall@5 | 0.900 | 0.925 |
| MRR@10 | 0.826 | 0.679 |
| ingest (40 facts) | 1.6 s / $0 | 5.2 s |
| cost / 8 facts (extraction) | $0.00 | $0.071 |
Recall is close and split: memorymcp ranks the target higher (recall@1, MRR), Mem0 catches a couple more in the tail (recall@5). The clear, defensible gap is cost: memorymcp's deterministic PatternFactExtractor ingests free and fast, where Mem0's intended LLM extraction costs ~$0.009/fact. Mem0 answers ~3.5× faster per query (memorymcp pays for a richer pipeline + persistent store).
Caveat stated against us: this dataset is pure paraphrase with low lexical overlap, which under-uses Mem0's BM25 keyword layer — on keyword-heavy queries Mem0 would close the gap. It's a semantic-retrieval result, not a universal one.
sandboxmcp vs raw subprocess — containmentThe question that matters for untrusted code: does a malicious payload’s effect reach the host or the network? Five escape payloads, run on the raw interpreter, the zero-dependency process backend, and the hardened Docker backend — the marker is verified on the host filesystem, so “contained” means the damage genuinely never landed.
| Host-impact payload | default (raw) | sandboxmcp (process) | sandboxmcp (docker) |
|---|---|---|---|
| read host file | leaked | leaked | contained |
| write host file | leaked | leaked | contained |
| network egress | leaked | contained | contained |
| 2 GB memory balloon | leaked | contained | contained |
| infinite loop | contained | contained | contained |
| Total contained | 1 / 5 | 3 / 5 | 5 / 5 |
The process backend (socket shim + OS limits) stops egress and resource attacks, but has no kernel isolation — host filesystem reads/writes leak. The Docker backend (network_mode=none, dropped capabilities, read-only rootfs, Docker’s default seccomp, memory & PID caps, throwaway container) contains all five — 5/5, verified on the host fs. Use the process backend for trusted code with guardrails; use Docker to contain code you don’t trust.
Under the hood — static-validator coverage. A separate binary test of 20 payloads (a token prints only if the unsafe action ran) shows what the hardened validator catches before any backend runs — and, honestly, what it doesn’t:
| Attack class | default | sandboxmcp (process) |
|---|---|---|
| Known dangerous patterns (12) | 12 leaked | 12 blocked |
| Validator bypasses (6) | 6 leaked | 6 leaked |
| Resource exhaustion (2) | 1 leaked | 2 contained |
| Total neutralized | 2 / 20 | 14 / 20 |
A static scanner is bypassable by design: os.popen, pathlib and urllib evade the patterns. The validator is a first filter, not containment — which is exactly why the host-impact above is run in Docker. (On Docker those bypasses still execute, but inside a network-less, read-only, throwaway container, so nothing reaches the host.)
websearchmcp vs Tavily — the answer, not the indexTavily runs its own real-time crawl + index; websearchmcp aggregates free engines (SearXNG + scrapers). On raw result quality that’s not a fair fight — and I won’t pretend it is. But an agent consumes the answer, not the result list, so that’s what I measured: fetch the top sources, trim to query-relevant passages, let the same LLM synthesize.
| Metric | websearchmcp | Tavily |
|---|---|---|
| Answer correctness (6 factual Qs) | 5 / 6 | 5 / 6 |
| Authority@3 (raw results, 10 Qs) | 3 / 10 | 6 / 10 |
| Search cost | $0 / no key | paid (credits) |
| Latency / query | ~22 s (live fetch) | ~1.3 s (index) |
We lose the index contest, structurally (3/10 vs 6/10 on surfacing an authoritative source) — you can’t out-crawl a crawler. But on the actual deliverable, the cited answer, it’s a tie: 5/6 each (the one miss is an Everest height both phrase as “8,848”, which the strict substring grader counts against both). The revealing detail — our top results were still SEO pages, yet the answer was right, because the LLM extracts the fact from any decent fetched page (Tavily’s own top-3 sometimes included quora/instagram too). The answer layer compensates for mediocre sources, on both sides. The price we pay for having no index is latency.
evalmcp vs DeepEval — does the judge agree with a human?An eval library lives or dies by one thing: does its LLM-judge match a human label? 24 hand-labeled correctness cases (clear right/wrong, plus paraphrases that fool substring matching and plausible-but-wrong answers that fool lazy judges), every judge run on the same LLM — so the number measures the judge’s logic, not the model.
| Judge (same LLM) | Accuracy | F1 | Cohen’s κ |
|---|---|---|---|
evalmcp contains (no LLM) | 0.83 | 0.83 | 0.68 |
| evalmcp LLM judge | 1.00 | 1.00 | 1.00 |
DeepEval GEval | 1.00 | 1.00 | 1.00 |
Parity with DeepEval on judge-vs-human agreement — but the first run said 0.42 accuracy, F1 0.00: evalmcp’s LLM judge marked everything wrong, worse than the no-LLM baseline. Two bugs (a stray {{ }} in an f-string, a bare json.loads that failed on fenced replies and silently scored 0) — fixed, 0.42→1.00, with regression tests. Honest caveats: small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout), evalmcp grades correctness not Ragas-style faithfulness, and Ragas itself couldn’t run (a broken transitive dependency in the test env), so I’m not faking a number for it.
Each benchmark is one reproducible script with raw JSON in its repo: ragmcp/benchmarks/, memorymcp/benchmarks/, sandboxmcp/benchmarks/, websearchmcp/benchmarks/, evalmcp/benchmarks/. Every one of these runs surfaced or confirmed a real bug — all fixed.
On deterministic, decomposable work LTP is the cheapest by far (~⅕ the tokens of ReAct) and just as reliable. Compiling once (one LLM call, not one per step) keeps cost flat as that deterministic depth grows.
LTP-first with a verified fallback is the most reliable mode in both regimes — 100% on the deterministic suite, and the top of every column on tool chains too — because its fallback rescues each pure mode's distinct failures. It can't fix a task both modes fail, but it never does worse.
ReAct's per-turn adaptivity is the right tool for exploration and knowledge lookups. On the deterministic suite it is the most expensive and, on a strong model, the least reliable (over-confidence).
| Two kinds of comparison | Most sections pit the kernel's own modes against each other (same kernel, same tools). The cross-framework sections vs LangGraph and CrewAI are real, but methodologically contentious by nature — different tool implementations, a fresh subprocess spawned per run (so latency isn't comparable), and a framework token-counter that undercounts. Each of those carries its fairness caveats stated inline, right next to the numbers. |
| Small samples | 28 deterministic + 7 multi-tool tasks, each run on all three models (Haiku, Sonnet, Opus). Deltas are directional, not a guarantee for every workload — more tasks and reps would tighten the intervals. |
| Hybrid's fallback isn't free | The verification step occasionally re-runs a correct-but-terse LTP answer through ReAct. It never returns a wrong answer, but it adds tokens — the documented price of the safety net. |
| Runs against current source | Numbers reflect the kernel at the repo's HEAD. Clone and run the harness to reproduce them exactly. |
All 28 tasks, every mode. A ✗ marks a failed answer. Haiku is shown in full because its token spread is the widest (a weak model makes ReAct loop); Sonnet and Opus are one click below. Nothing hidden.
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
speedAverage speed: 60 km in 1.5 h | 5,290 | 2,482 | 2,583 |
ratioCost of 7 apples if 3 cost $1.20 | 6,490 | 2,486 | 2,586 |
seqNext number in 2, 6, 12, 20, 30 | 25,730 | 2,472 | 2,647 |
percent35% of 240 | 5,223 | 2,464 | 2,550 |
mul-sub123 × 456, then − 1000 | 6,722 | 2,473 | 2,562 |
fib15th Fibonacci number | 13,942 | 2,535 | 2,633 |
factorial12 factorial (12!) | 6,700 | 2,466 | 2,550 |
primesSum of all primes below 30 | 13,980 | 2,548 | 2,766 |
squaresSum of squares 1²…5² | 13,722 | 2,483 | 2,575 |
gcdGCD of 84 and 126 | 13,706 | 2,481 | 2,572 |
reverseReverse the string 'benchmark' | 2,593 ✗ | 2,462 | 2,546 |
vowelsCount vowels in 'orchestration' | 5,398 | 2,499 | 14,809 |
wordcountWords in a 9-word sentence | 5,297 | 2,486 | 2,581 |
upperUppercase 'kernelmcp' | 3,181 | 2,463 | 2,548 |
sortSort 7, 2, 9, 1, 5 ascending | 2,634 | 2,500 | 2,600 |
maxLargest of 14, 88, 23, 91, 47 | 5,255 | 2,487 | 2,584 |
evensEven numbers from 1 to 20 | 5,273 | 2,486 | 2,575 |
binaryBinary 1011 → decimal | 5,308 | 2,468 | 2,554 |
branch-parity17 × 23, then branch on even/odd | 14,305 | 2,518 | 2,752 |
fact-digitsSmallest n where n! exceeds 10 digits | 20,293 | 2,500 | 2,721 |
mul-until×3 from 1 until > 1000 — how many steps | 11,877 | 2,509 | 2,811 |
nextprime-sqSmallest prime > 100, squared | 5,970 | 2,544 | 2,633 |
reverse-sub2024 minus its digit-reverse | 5,841 | 2,505 | 2,735 |
div3not5Sum 1–50 divisible by 3 but not 5 | 11,917 | 2,505 | 2,603 |
fib-label20th Fibonacci, label BIG/SMALL | 7,107 | 2,580 | 2,750 |
string-chainStrip vowels, reverse, give length | 12,216 | 2,526 | 2,704 |
sq-vs-sqsum(Σ1…10)² − Σ(1²…10²) | 14,651 | 2,530 | 2,742 |
iran-capitalReverse 'narI' → country → its capital | 9,822 | 3,018 ✗ | 3,198 |
Read it honestly: LTP's column is strikingly flat (~2.5k) — it compiles once instead of calling the LLM per step, so cost stays flat as deterministic depth grows — while ReAct swings from 2.6k to 25.7k (it can spiral on a trivial task like seq). LTP's one miss is iran-capital (knowledge); ReAct's is reverse (it answered without the sandbox). Hybrid is the only column with no ✗ — its two spikes (vowels, others) are verified fallbacks: the price of catching the misses. Full raw JSON for every run, model, and suite lives in the repo.
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
speed | 7,890 | 2,480 | 2,582 |
ratio | 9,645 | 2,494 | 2,599 |
seq | 7,884 | 2,482 | 2,570 |
percent | 7,842 | 2,464 | 2,550 |
mul-sub | 6,668 | 2,473 | 2,563 |
fib | 13,789 | 2,523 | 2,638 |
factorial | 6,655 | 2,466 | 2,551 |
primes | 13,693 | 2,497 | 2,582 |
squares | 13,620 | 2,482 | 2,575 |
gcd | 13,595 | 2,479 | 2,571 |
reverse | 2,594 | 2,462 | 2,547 |
vowels | 2,632 | 2,493 | 2,592 |
wordcount | 7,878 | 2,484 | 2,580 |
upper | 3,178 | 2,463 | 2,549 |
sort | 2,627 | 2,495 | 2,604 |
max | 7,884 | 2,493 | 2,596 |
evens | 7,860 | 2,482 | 2,573 |
binary | 7,851 | 2,468 | 2,555 |
branch-parity | 6,769 | 2,523 | 2,638 |
fact-digits | 11,335 | 14,065 | 2,605 |
mul-until | 17,927 | 2,507 | 2,614 |
nextprime-sq | 5,384 | 2,512 | 2,628 |
reverse-sub | 18,827 | 2,483 | 2,598 |
div3not5 | 17,911 | 2,498 | 2,597 |
fib-label | 6,927 | 2,587 | 2,713 |
string-chain | 16,743 | 2,518 | 2,627 |
sq-vs-sqsum | 13,774 | 2,528 | 2,634 |
iran-capital | 6,678 | 3,000 | 3,098 |
| Task | ReAct | LTP | Hybrid |
|---|---|---|---|
speed | 6,469 | 3,368 | 3,501 |
ratio | 7,942 | 3,377 | 3,510 |
seq | 6,501 | 3,365 | 6,816 |
percent | 9,585 | 3,347 | 3,461 |
mul-sub | 3,916 ✗ | 3,360 | 3,499 |
fib | 11,880 | 3,394 ✗ | 15,867 |
factorial | 3,911 | 3,354 | 3,469 |
primes | 7,901 | 3,376 | 3,547 |
squares | 11,859 | 3,371 | 3,496 |
gcd | 11,856 | 3,367 | 3,491 |
reverse | 3,174 | 3,357 | 3,482 |
vowels | 9,642 | 3,384 | 3,514 |
wordcount | 6,478 | 3,385 | 3,517 |
upper | 3,920 | 3,356 | 3,482 |
sort | 3,195 | 3,389 | 3,529 |
max | 9,627 | 3,348 | 3,474 |
evens | 9,615 | 3,371 | 3,493 |
binary | 9,600 | 3,354 | 3,471 |
branch-parity | 3,966 | 3,395 | 3,568 |
fact-digits | 3,836 | 3,397 | 3,625 |
mul-until | 6,457 | 3,404 | 3,654 |
nextprime-sq | 3,173 ✗ | 3,401 | 3,548 |
reverse-sub | 7,044 | 3,391 | 3,564 |
div3not5 | 3,275 ✗ | 3,387 | 3,686 |
fib-label | 4,069 | 3,471 ✗ | 8,002 |
string-chain | 3,218 | 3,413 | 3,686 |
sq-vs-sqsum | 7,946 | 3,398 | 3,591 |
iran-capital | 3,929 | 4,181 | 4,344 |
The harness lives in the kernelmcp repo. Point it at your own API key and run all three modes over the full suite.
# from the kernelmcp repo
python benchmarks/ltp_vs_react_bench.py \
--model claude-haiku-4-5 \
--modes react,ltp,hybrid \
--reps 2 --suite all
# multi-tool suite (chains real memory/file tools)
python benchmarks/ltp_vs_react_bench.py \
--modes react,ltp,hybrid --suite multitool --reps 2