I Benchmarked My Agent Framework Against LangGraph and CrewAI — Honestly
I built an agent framework (kernelmcp — ReAct, plus a compile-once mode called LTP, plus a Hybrid that runs LTP-first with a verified fallback). The obvious question: is it actually any good versus the incumbents?
Most framework benchmarks are marketing — “10× faster!” with no harness. So I did the opposite: I ran kernelmcp against LangGraph and CrewAI on the same model, the same tools over MCP, with reproducible runs, and I’m publishing where I lose as plainly as where I win. Every number below is on the benchmarks page with the raw per-run JSON.
Spoiler: it’s not a clean sweep. It’s a specific, defensible win — and a few honest losses.
Ground rules (so it’s fair, not flattering)
- Same model, pinned for every call (Claude Haiku and Sonnet).
- Same tools. For the cross-framework runs, LangGraph and CrewAI connect to my standalone MCP servers (memorymcp, sandboxmcp) via the official adapters; kernelmcp runs the same tools. No framework touches another — the only variable is who orchestrates.
- Reproducible. Fixed, verifiable tasks; fresh state per run; success = the expected value appears in the final answer.
- Honest token accounting. I capture each framework’s real usage — including intercepting the Anthropic SDK for CrewAI, whose built-in counter undercounts on its litellm path (it reported 2,182 tokens on a task that actually used 8,502). If I can’t measure it, I don’t print it.
Loss #1 — the toy task: the bare agent wins
A single-tool arithmetic task. A minimal LangGraph ReAct agent does it in ~1,420 tokens, 2 turns. My kernel, even stripped down, does ~2,200. A feature-rich kernel cannot out-token a near-empty agent on a one-tool task — to do that you’d have to become a near-empty agent. That’s structural, and pretending otherwise would be dishonest. On simple tasks, reach for the lean tool.
Tie — reliability: CrewAI matches me
On a suite of memory/working-memory tasks (all three frameworks on the same memorymcp tools):
| System | Haiku (success / $) | Sonnet (success / $) |
|---|---|---|
| LangGraph + memorymcp | 92.9% / $0.0055 | 92.9% / $0.0181 |
| CrewAI + memorymcp | 100% / $0.0058 | 100% / $0.0231 |
| kernelmcp Hybrid | 100% / $0.0100 | 100% / $0.0075 |
Reliability is a tie — CrewAI and my Hybrid both hit 100%; LangGraph trails slightly. I do not dominate reliability here. On cost, it splits by model: on Haiku (no prompt caching for that model) the lean agents are cheaper; on Sonnet, my Hybrid is cheapest because its large static prefix is prompt-cached (more on that below) while the competitors’ minimal prompts are too small to cache.
Win — deep agentic chains: this is where it tips
The realistic agentic surface isn’t one tool call — it’s deep chains. Five tasks that each chain 5–6 sequential steps across memory and code execution (store → recall → compute → store → recall), same 7 tools for everyone:
| System | Haiku (success / $) | Sonnet (success / $) |
|---|---|---|
| LangGraph | 100% / $0.0102 | 100% / $0.0381 |
| CrewAI | 100% / $0.0151 | 100% / $0.0458 |
| kernelmcp Hybrid | 100% / $0.0071 | 100% / $0.0146 |
On deep chains my Hybrid is the cheapest system that still hits 100% — on both models — ~1.4–2.1× under LangGraph and CrewAI on Haiku, widening to ~2.6–3.1× on Sonnet, and faster.
The mechanism is the whole point: LTP compiles the entire chain in one LLM call (~2.4k tokens), then executes it deterministically in Python. Because the LLM is called once to compile rather than once per step, the cost stays flat as long as the added depth is deterministic execution (loops, code, tool calls) — not extra LLM-reasoning steps or a re-compile on failure. LangGraph and CrewAI re-invoke the model at every step, so their cost grows with chain length. The deeper the (deterministic) work, the bigger the gap. It’s the exact opposite of the toy-task result.
The honest caveat: LTP alone is unreliable on stateful chains (it drops to 60–80%). The win belongs to Hybrid — LTP-first with a verified ReAct fallback — not to LTP by itself.
Pushing deeper: where naive ReAct collapses entirely
I then made the chains harder — 8–15 fully state-dependent steps (a running ledger, a counter doubled N times, multi-stage gcd/lcm), where one dropped or stale value poisons the answer. Same 7 tools, same model (Haiku), 3 runs per task:
| System | success | avg $/task | latency |
|---|---|---|---|
| LangGraph | 0% | $0 | 244s (timeout) |
| CrewAI | 40% | $0.011 | 122s |
| kernelmcp LTP | 100% (15/15) | $0.0035 | 2.2s |
| kernelmcp Hybrid | 100% | $0.0036 | 3.4s |
At this depth LangGraph times out on every task (its ReAct loop never converges within 240s) and CrewAI hangs/half-finishes, while kernelmcp holds 100% at ~$0.0035 and ~2s — the compile-once advantage, decisive.
Getting here surfaced two real bugs, both fixed (and that’s the point of benchmarking honestly):
First, mutable state recalled from append-only facts returns stale values, so LTP gained a
working-memory primitive (set/get, overwrite). Second — the subtler one — the compiler
chose that primitive only ~1/3 of the time, falling back to a fragile stale-recall + envelope-parse
hack the rest, which is what made the early runs swing 20–80%. Strengthening the compiler’s
mutable-state rule (recall→modify→store-back ⇒ working memory, any wording) took it to 6/6, and the
benchmark from noisy to 100%.
Confirmed across models and a broader set. Re-running on 11 deep state-dependent tasks (2 runs each), on both Haiku and Sonnet:
| model | kernelmcp LTP | kernelmcp Hybrid |
|---|---|---|
| Haiku | 95.5% | 100% |
| Sonnet | 100% | 100% |
Hybrid hits 100% on both — the LTP-first path handles it, and the verified ReAct fallback covers the single Haiku LTP miss (a “repeat N times” loop). That’s exactly what the fallback is for.
Honest caveats: still synthetic deterministic tasks, and LangGraph’s 0% is timeout-driven (structurally slow on deep loops, not “broken”). The robust signal isn’t a precise %, it’s the shape: deep state-dependent chains break naive ReAct; a compiled plan doesn’t.
The meta-result: benchmarking honestly found real bugs
Chasing these numbers surfaced five real bugs in my own code that “success rate” had hidden — the kind you only catch when you measure cost and outputs, not just pass/fail:
@RESPONDreturned a variable’s name instead of its value (31/36 runs).@EXEC_CODEstored the whole sandbox envelope, so answers came back as{request_id…}dicts — pervasive, but masked because the right number sat inside the dict.- The Hybrid router was keyword guesswork; I rebuilt it as LTP-first with verified fallback.
- Hybrid didn’t fall back when LTP returned an error result.
- Prompt caching never actually fired —
cache_controlwas set at the message level, which Anthropic silently ignores. Fixing it (block-level) cut cached-prefix cost ~90% on Sonnet and is why the kernel now wins on cost there.
A benchmark that only reports “we’re faster” would have shipped all five.
So… is it the best?
No — and I’d distrust anyone who answered yes to “best at everything.” Here’s the honest scorecard:
- Raw tokens / simple tasks: I lose. A bare agent is leaner; that’s structural.
- Reliability: a tie — CrewAI matches my Hybrid.
- Cost for deep agentic workflows on a capable model: I win — cheapest at 100%, and the advantage grows with depth.
- Ecosystem, maturity, adoption: the incumbents win — that’s community and time, not code.
The defensible claim isn’t “the best.” It’s: the cheapest system that stays 100% reliable on deep agentic workflows, on a production model — measured, reproducible, with the losses shown.
That’s the part I’d actually stake my name on: not a superlative, but a number you can re-run.
All results reproduce from the benchmarks page — harness and raw JSON included. kernelmcp and the suite are open source: GitHub.