April 20, 2026 · gashel01

Stop Letting Your LLM Drive: Compile the Plan, Execute It Deterministically

Name: MCP AI Suite
Author: MCP AI Suite

LTP — the Lean Task Protocol — treats the LLM as a compiler, not a runtime. One call writes the whole plan; pure Python executes it. This post explains how, and gives the measured numbers — including where it loses.

Most agent frameworks follow the same pattern: hand the LLM some tools, let it decide what to do at every step, and hope it doesn’t loop, drift, or forget the goal after turn 8. That’s ReAct (Reason-Act). It works — until it doesn’t, and it re-sends the entire conversation every turn.

LTP separates the two jobs a ReAct loop conflates: planning and execution. A SQL planner doesn’t re-plan after every row. A compiler doesn’t re-compile after every instruction. So LTP compiles the goal into a structured plan in one LLM call, then runs that plan deterministically in Python.

Honesty up front: this is not a universal win. On a trivial one-shot question a lean ReAct agent is just as cheap. LTP’s payoff shows up as tasks get deep — and it’s the Hybrid mode (LTP-first with a verified ReAct fallback) that you actually want in production, because LTP alone is unreliable on knowledge lookups and stateful chains. The numbers below are measured and reproducible; the full data lives on the benchmarks page.

The idea: what if the LLM only ran once?

User: "What's the weather in Ibiza?"
         ↓  LTP Compiler — ONE LLM call:
    S1: search "weather Ibiza today" > $results
    S2: respond $results
         ↓  LTP Runtime — pure Python, no LLM:
    execute S1 → store results → execute S2 → return

The compile call costs tokens once; execution costs zero LLM tokens. Compare that to a ReAct loop that re-invokes the model at every turn, re-reading everything it already saw.

The DSL: a language for agent plans

LTP is a tiny DSL designed to be easy for an LLM to generate (translation, not reasoning):

PLAN_START
S1: @WEB_SEARCH (query="prime minister of France 2026") > $results
S2: @LLM_EXTRACT ($results, target="person_name_and_title") > $answer
S3: @RESPOND ($answer)
PLAN_END

Each step has an ID, a tool, arguments with explicit $variable references, and an output variable (> $answer). Variables flow explicitly — no hidden “based on the previous output.” Three kinds of step:

Deterministic tools (no LLM): @WEB_SEARCH, @EXEC_CODE, @READ_FILE, @WRITE_FILE…
LLM micro-calls (one focused prompt): @LLM_EXTRACT, @LLM_CLASSIFY, @LLM_SUMMARIZE
Adaptive agent (a scoped mini-ReAct for genuine exploration): @AGENT_SOLVE(...)

Most steps don’t need an LLM at all — and the ones that do operate on a focused input, not the whole conversation.

It’s not rigid: error handling, iteration, re-planning

S2: @FETCH_PAGE (url="https://api.internal/data") > $data ON_FAIL @RETRY(3)
S3: @EXEC_CODE (code="process($data)") > $result ON_FAIL GOTO S5
S4: @FETCH_PAGE (url="https://backup-api") > $data ON_FAIL TERMINATE ("All APIs offline")

Mechanism	Cost	Use case
`ON_FAIL @RETRY`	0 LLM calls	Transient failures
`ON_FAIL GOTO`	0 LLM calls	Known alternative path
`?FOREACH`	0 LLM calls	Iterate a list in Python
`RE-PLAN`	1 LLM call	The plan is invalidated mid-run
`@AGENT_SOLVE`	several LLM calls	Genuinely exploratory sub-problem

A static analyzer runs before execution — undefined variables, unreachable steps, circular GOTOs, duplicate IDs — and if a plan is invalid, the kernel falls back to ReAct rather than crash.

The measured numbers (reproducible, not asserted)

We benchmarked LTP, ReAct, and Hybrid on the same kernel, same tools, same tasks, every LLM call pinned to one model, on a fixed suite of verifiable tasks. (Run it yourself — the harness and raw per-run JSON are linked from /benchmarks.)

Deterministic suite, 28 tasks, Claude Haiku:

Mode	Success	Avg tokens/task
ReAct	96.4%	9,301
LTP	96.4%	2,517 (−73%)
Hybrid	100%	3,088

LTP’s token count is strikingly flat (~2.5k) as the task deepens, because the LLM is called once to compile the plan, not once per step — the added depth is deterministic Python/tool execution, which costs no tokens. (It stays flat only while that holds: steps that are themselves LLM ops, or an ON_FAIL: RE-PLAN that re-compiles, do add calls.) ReAct’s cost, by contrast, scales with the number of turns. Two honest caveats from the same data:

The edge is model-dependent. On a weak model ReAct loops a lot, so LTP’s saving is largest (up to ~77% on harder tasks). On a strong model ReAct solves things in 2–3 turns and the gap narrows. Raw tokens aren’t the whole story — cost depends on prompt caching too.
LTP alone has a reliability tail. It nails compute and tool-routing, but stumbles when a step needs world knowledge it can’t derive (“reverse this string, then name that country’s capital”) or fiddly stateful manipulation. That’s exactly why Hybrid exists.

Hybrid mode: LTP-first, with a verified fallback

In production you don’t choose manually. You run the kernel in Hybrid mode (mode="hybrid"), and it is result-driven, not a keyword guess:

Obviously exploratory/conversational goals (“help me debug…”, “not sure…”) → ReAct directly.
Everything else → try LTP first, then verify the result; if it’s inadequate, fall back to ReAct.

This makes Hybrid the only mode that stays at 100% across model tiers in our tests, while keeping most of LTP’s efficiency. The price is a small verification overhead and the occasional fallback — a deliberate trade for reliability.

Where it really pays off — deep agentic chains. On tasks that chain 5–6 sequential calls across memory and code, we measured Hybrid against standard LangGraph and CrewAI agents on the same tools: Hybrid was the cheapest 100%-reliable system on both Haiku and Sonnet — ~1.4–2.1× under the competitors on Haiku, widening to ~2.6–3.1× on Sonnet — because LTP compiles the whole chain once while ReAct-loop frameworks re-invoke the model at every step. The deeper the chain, the bigger the gap. (Full tables: /benchmarks.)

Design principles

Compile once, execute deterministically. The LLM writes the plan; Python runs it.
The LLM is a tool, not the pilot. @LLM_EXTRACT is a step like any other.
Data flows through variables, not context. $results is explicit.
Conditions are code, not prompts. ?IF ($rate > 5) is real comparison.
Simple stays simple, exploration stays adaptive. Don’t over-plan; @AGENT_SOLVE when you genuinely need to explore.

What we’re still working on

Parallel execution of independent steps (currently sequential; parallel groups exist but aren’t the default).
Hierarchical plans for very deep tasks (15+ steps), where single-shot compilation gets harder.
Reliability on stateful/knowledge steps — the main reason we run Hybrid, not raw LTP, in production.

Try it

LTP lives in its own library, ltpmcp (use it as a library, or via its CLI and MCP server):

pip install mcpaisuite-ltpmcp

from ltpmcp import LTPCompiler, LTPRuntime

# 1. Compile a goal into a plan (one LLM call)
compiler = LTPCompiler(llm_fn=your_llm_call)
plan = await compiler.compile("What's the weather in Paris?")

# 2. Execute it deterministically
runtime = LTPRuntime()
result = await runtime.execute(plan, tool_executor=your_tools, llm_fn=your_llm_call)
print(result["response"])

Or let kernelmcp auto-select LTP / ReAct / Hybrid for you:

from kernelmcp import KernelFactory
kernel = KernelFactory.full_suite(llm_model="claude-sonnet-4-6")
task = await kernel.run("...", mode="hybrid")  # recommended in production (bare default is ReAct)

Conclusion

The ecosystem defaults to ReAct: flexible, easy, and it spends most of its tokens re-reading its own history. LTP offers another path — treat the LLM as a compiler, plan once, execute in Python, and fall back to ReAct only when you genuinely need exploration. It is not a silver bullet: on toy tasks a lean agent ties it, and LTP alone has a reliability tail. But as agentic work gets deep, compile-once wins — measurably, and reproducibly.

The LLM is brilliant at planning. Let it plan once, then get out of the execution loop.

LTP is part of the MCP AI Suite — open-source Python libraries for building AI agents. All numbers in this post are reproducible from the benchmarks page. GitHub