June 3, 2026 · gashel01

I Benchmarked My Agent Framework Against LangGraph and CrewAI — Honestly

I built an agent framework (kernelmcp — ReAct, plus a compile-once mode called LTP, plus a Hybrid that runs LTP-first with a verified fallback). The obvious question: is it actually any good versus the incumbents?

Most framework benchmarks are marketing — “10× faster!” with no harness. So I did the opposite: I ran kernelmcp against LangGraph and CrewAI on the same model, the same tools over MCP, with reproducible runs, and I’m publishing where I lose as plainly as where I win. Every number below is on the benchmarks page with the raw per-run JSON.

Spoiler: it’s not a clean sweep. It’s a specific, defensible win — and a few honest losses.

Ground rules (so it’s fair, not flattering)

Same model, pinned for every call (Claude Haiku and Sonnet).
Same tools. For the cross-framework runs, LangGraph and CrewAI connect to my standalone MCP servers (memorymcp, sandboxmcp) via the official adapters; kernelmcp runs the same tools. No framework touches another — the only variable is who orchestrates.
Reproducible. Fixed, verifiable tasks; fresh state per run; success = the expected value appears in the final answer.
Honest token accounting. I capture each framework’s real usage — including intercepting the Anthropic SDK for CrewAI, whose built-in counter undercounts on its litellm path (it reported 2,182 tokens on a task that actually used 8,502). If I can’t measure it, I don’t print it.

Loss #1 — the toy task: the bare agent wins

A single-tool arithmetic task. A minimal LangGraph ReAct agent does it in ~1,420 tokens, 2 turns. My kernel, even stripped down, does ~2,200. A feature-rich kernel cannot out-token a near-empty agent on a one-tool task — to do that you’d have to become a near-empty agent. That’s structural, and pretending otherwise would be dishonest. On simple tasks, reach for the lean tool.

Tie — reliability: CrewAI matches me

On a suite of memory/working-memory tasks (all three frameworks on the same memorymcp tools):

System	Haiku (success / $)	Sonnet (success / $)
LangGraph + memorymcp	92.9% / $0.0055	92.9% / $0.0181
CrewAI + memorymcp	100% / $0.0058	100% / $0.0231
kernelmcp Hybrid	100% / $0.0100	100% / $0.0075

Reliability is a tie — CrewAI and my Hybrid both hit 100%; LangGraph trails slightly. I do not dominate reliability here. On cost, it splits by model: on Haiku (no prompt caching for that model) the lean agents are cheaper; on Sonnet, my Hybrid is cheapest because its large static prefix is prompt-cached (more on that below) while the competitors’ minimal prompts are too small to cache.

Win — deep agentic chains: this is where it tips

The realistic agentic surface isn’t one tool call — it’s deep chains. Five tasks that each chain 5–6 sequential steps across memory and code execution (store → recall → compute → store → recall), same 7 tools for everyone:

System	Haiku (success / $)	Sonnet (success / $)
LangGraph	100% / $0.0102	100% / $0.0381
CrewAI	100% / $0.0151	100% / $0.0458
kernelmcp Hybrid	100% / $0.0071	100% / $0.0146

On deep chains my Hybrid is the cheapest system that still hits 100% — on both models — ~1.4–2.1× under LangGraph and CrewAI on Haiku, widening to ~2.6–3.1× on Sonnet, and faster.

The mechanism is the whole point: LTP compiles the entire chain in one LLM call (~2.4k tokens), then executes it deterministically in Python. Because the LLM is called once to compile rather than once per step, the cost stays flat as long as the added depth is deterministic execution (loops, code, tool calls) — not extra LLM-reasoning steps or a re-compile on failure. LangGraph and CrewAI re-invoke the model at every step, so their cost grows with chain length. The deeper the (deterministic) work, the bigger the gap. It’s the exact opposite of the toy-task result.

The honest caveat: LTP alone is unreliable on stateful chains (it drops to 60–80%). The win belongs to Hybrid — LTP-first with a verified ReAct fallback — not to LTP by itself.

Pushing deeper: where naive ReAct collapses entirely

I then made the chains harder — 8–15 fully state-dependent steps (a running ledger, a counter doubled N times, multi-stage gcd/lcm), where one dropped or stale value poisons the answer. Same 7 tools, same model (Haiku), 3 runs per task:

System	success	avg $/task	latency
LangGraph	0%	$0	244s (timeout)
CrewAI	40%	$0.011	122s
kernelmcp LTP	100% (15/15)	$0.0035	2.2s
kernelmcp Hybrid	100%	$0.0036	3.4s

At this depth LangGraph times out on every task (its ReAct loop never converges within 240s) and CrewAI hangs/half-finishes, while kernelmcp holds 100% at ~$0.0035 and ~2s — the compile-once advantage, decisive.

Getting here surfaced two real bugs, both fixed (and that’s the point of benchmarking honestly): First, mutable state recalled from append-only facts returns stale values, so LTP gained a working-memory primitive (set/get, overwrite). Second — the subtler one — the compiler chose that primitive only ~1/3 of the time, falling back to a fragile stale-recall + envelope-parse hack the rest, which is what made the early runs swing 20–80%. Strengthening the compiler’s mutable-state rule (recall→modify→store-back ⇒ working memory, any wording) took it to 6/6, and the benchmark from noisy to 100%.

Confirmed across models and a broader set. Re-running on 11 deep state-dependent tasks (2 runs each), on both Haiku and Sonnet:

model	kernelmcp LTP	kernelmcp Hybrid
Haiku	95.5%	100%
Sonnet	100%	100%

Hybrid hits 100% on both — the LTP-first path handles it, and the verified ReAct fallback covers the single Haiku LTP miss (a “repeat N times” loop). That’s exactly what the fallback is for.

Honest caveats: still synthetic deterministic tasks, and LangGraph’s 0% is timeout-driven (structurally slow on deep loops, not “broken”). The robust signal isn’t a precise %, it’s the shape: deep state-dependent chains break naive ReAct; a compiled plan doesn’t.

The meta-result: benchmarking honestly found real bugs

Chasing these numbers surfaced five real bugs in my own code that “success rate” had hidden — the kind you only catch when you measure cost and outputs, not just pass/fail:

@RESPOND returned a variable’s name instead of its value (31/36 runs).
@EXEC_CODE stored the whole sandbox envelope, so answers came back as {request_id…} dicts — pervasive, but masked because the right number sat inside the dict.
The Hybrid router was keyword guesswork; I rebuilt it as LTP-first with verified fallback.
Hybrid didn’t fall back when LTP returned an error result.
Prompt caching never actually fired — cache_control was set at the message level, which Anthropic silently ignores. Fixing it (block-level) cut cached-prefix cost ~90% on Sonnet and is why the kernel now wins on cost there.

A benchmark that only reports “we’re faster” would have shipped all five.

So… is it the best?

No — and I’d distrust anyone who answered yes to “best at everything.” Here’s the honest scorecard:

Raw tokens / simple tasks: I lose. A bare agent is leaner; that’s structural.
Reliability: a tie — CrewAI matches my Hybrid.
Cost for deep agentic workflows on a capable model: I win — cheapest at 100%, and the advantage grows with depth.
Ecosystem, maturity, adoption: the incumbents win — that’s community and time, not code.

The defensible claim isn’t “the best.” It’s: the cheapest system that stays 100% reliable on deep agentic workflows, on a production model — measured, reproducible, with the losses shown.

That’s the part I’d actually stake my name on: not a superlative, but a number you can re-run.

All results reproduce from the benchmarks page — harness and raw JSON included. kernelmcp and the suite are open source: GitHub.