Stop Letting Your LLM Drive: Compile the Plan, Execute It Deterministically
LTP — the Lean Task Protocol — treats the LLM as a compiler, not a runtime. One call writes the whole plan; pure Python executes it. This post explains how, and gives the measured numbers — including where it loses.
Most agent frameworks follow the same pattern: hand the LLM some tools, let it decide what to do at every step, and hope it doesn’t loop, drift, or forget the goal after turn 8. That’s ReAct (Reason-Act). It works — until it doesn’t, and it re-sends the entire conversation every turn.
LTP separates the two jobs a ReAct loop conflates: planning and execution. A SQL planner doesn’t re-plan after every row. A compiler doesn’t re-compile after every instruction. So LTP compiles the goal into a structured plan in one LLM call, then runs that plan deterministically in Python.
Honesty up front: this is not a universal win. On a trivial one-shot question a lean ReAct agent is just as cheap. LTP’s payoff shows up as tasks get deep — and it’s the Hybrid mode (LTP-first with a verified ReAct fallback) that you actually want in production, because LTP alone is unreliable on knowledge lookups and stateful chains. The numbers below are measured and reproducible; the full data lives on the benchmarks page.
The idea: what if the LLM only ran once?
User: "What's the weather in Ibiza?"
↓ LTP Compiler — ONE LLM call:
S1: search "weather Ibiza today" > $results
S2: respond $results
↓ LTP Runtime — pure Python, no LLM:
execute S1 → store results → execute S2 → return
The compile call costs tokens once; execution costs zero LLM tokens. Compare that to a ReAct loop that re-invokes the model at every turn, re-reading everything it already saw.
The DSL: a language for agent plans
LTP is a tiny DSL designed to be easy for an LLM to generate (translation, not reasoning):
PLAN_START
S1: @WEB_SEARCH (query="prime minister of France 2026") > $results
S2: @LLM_EXTRACT ($results, target="person_name_and_title") > $answer
S3: @RESPOND ($answer)
PLAN_END
Each step has an ID, a tool, arguments with explicit $variable references, and an
output variable (> $answer). Variables flow explicitly — no hidden “based on the previous
output.” Three kinds of step:
- Deterministic tools (no LLM):
@WEB_SEARCH,@EXEC_CODE,@READ_FILE,@WRITE_FILE… - LLM micro-calls (one focused prompt):
@LLM_EXTRACT,@LLM_CLASSIFY,@LLM_SUMMARIZE - Adaptive agent (a scoped mini-ReAct for genuine exploration):
@AGENT_SOLVE(...)
Most steps don’t need an LLM at all — and the ones that do operate on a focused input, not the whole conversation.
It’s not rigid: error handling, iteration, re-planning
S2: @FETCH_PAGE (url="https://api.internal/data") > $data ON_FAIL @RETRY(3)
S3: @EXEC_CODE (code="process($data)") > $result ON_FAIL GOTO S5
S4: @FETCH_PAGE (url="https://backup-api") > $data ON_FAIL TERMINATE ("All APIs offline")
| Mechanism | Cost | Use case |
|---|---|---|
ON_FAIL @RETRY | 0 LLM calls | Transient failures |
ON_FAIL GOTO | 0 LLM calls | Known alternative path |
?FOREACH | 0 LLM calls | Iterate a list in Python |
RE-PLAN | 1 LLM call | The plan is invalidated mid-run |
@AGENT_SOLVE | several LLM calls | Genuinely exploratory sub-problem |
A static analyzer runs before execution — undefined variables, unreachable steps, circular GOTOs, duplicate IDs — and if a plan is invalid, the kernel falls back to ReAct rather than crash.
The measured numbers (reproducible, not asserted)
We benchmarked LTP, ReAct, and Hybrid on the same kernel, same tools, same tasks, every LLM call pinned to one model, on a fixed suite of verifiable tasks. (Run it yourself — the harness and raw per-run JSON are linked from /benchmarks.)
Deterministic suite, 28 tasks, Claude Haiku:
| Mode | Success | Avg tokens/task |
|---|---|---|
| ReAct | 96.4% | 9,301 |
| LTP | 96.4% | 2,517 (−73%) |
| Hybrid | 100% | 3,088 |
LTP’s token count is strikingly flat (~2.5k) as the task deepens, because the LLM is called
once to compile the plan, not once per step — the added depth is deterministic Python/tool
execution, which costs no tokens. (It stays flat only while that holds: steps that are themselves
LLM ops, or an ON_FAIL: RE-PLAN that re-compiles, do add calls.) ReAct’s cost, by contrast,
scales with the number of turns. Two honest caveats from the same data:
- The edge is model-dependent. On a weak model ReAct loops a lot, so LTP’s saving is largest (up to ~77% on harder tasks). On a strong model ReAct solves things in 2–3 turns and the gap narrows. Raw tokens aren’t the whole story — cost depends on prompt caching too.
- LTP alone has a reliability tail. It nails compute and tool-routing, but stumbles when a step needs world knowledge it can’t derive (“reverse this string, then name that country’s capital”) or fiddly stateful manipulation. That’s exactly why Hybrid exists.
Hybrid mode: LTP-first, with a verified fallback
In production you don’t choose manually. You run the kernel in Hybrid mode (mode="hybrid"), and it is
result-driven, not a keyword guess:
- Obviously exploratory/conversational goals (“help me debug…”, “not sure…”) → ReAct directly.
- Everything else → try LTP first, then verify the result; if it’s inadequate, fall back to ReAct.
This makes Hybrid the only mode that stays at 100% across model tiers in our tests, while keeping most of LTP’s efficiency. The price is a small verification overhead and the occasional fallback — a deliberate trade for reliability.
Where it really pays off — deep agentic chains. On tasks that chain 5–6 sequential calls across memory and code, we measured Hybrid against standard LangGraph and CrewAI agents on the same tools: Hybrid was the cheapest 100%-reliable system on both Haiku and Sonnet — ~1.4–2.1× under the competitors on Haiku, widening to ~2.6–3.1× on Sonnet — because LTP compiles the whole chain once while ReAct-loop frameworks re-invoke the model at every step. The deeper the chain, the bigger the gap. (Full tables: /benchmarks.)
Design principles
- Compile once, execute deterministically. The LLM writes the plan; Python runs it.
- The LLM is a tool, not the pilot.
@LLM_EXTRACTis a step like any other. - Data flows through variables, not context.
$resultsis explicit. - Conditions are code, not prompts.
?IF ($rate > 5)is real comparison. - Simple stays simple, exploration stays adaptive. Don’t over-plan;
@AGENT_SOLVEwhen you genuinely need to explore.
What we’re still working on
- Parallel execution of independent steps (currently sequential; parallel groups exist but aren’t the default).
- Hierarchical plans for very deep tasks (15+ steps), where single-shot compilation gets harder.
- Reliability on stateful/knowledge steps — the main reason we run Hybrid, not raw LTP, in production.
Try it
LTP lives in its own library, ltpmcp (use it as a library, or via its CLI and MCP server):
pip install mcpaisuite-ltpmcp
from ltpmcp import LTPCompiler, LTPRuntime
# 1. Compile a goal into a plan (one LLM call)
compiler = LTPCompiler(llm_fn=your_llm_call)
plan = await compiler.compile("What's the weather in Paris?")
# 2. Execute it deterministically
runtime = LTPRuntime()
result = await runtime.execute(plan, tool_executor=your_tools, llm_fn=your_llm_call)
print(result["response"])
Or let kernelmcp auto-select LTP / ReAct / Hybrid for you:
from kernelmcp import KernelFactory
kernel = KernelFactory.full_suite(llm_model="claude-sonnet-4-6")
task = await kernel.run("...", mode="hybrid") # recommended in production (bare default is ReAct)
Conclusion
The ecosystem defaults to ReAct: flexible, easy, and it spends most of its tokens re-reading its own history. LTP offers another path — treat the LLM as a compiler, plan once, execute in Python, and fall back to ReAct only when you genuinely need exploration. It is not a silver bullet: on toy tasks a lean agent ties it, and LTP alone has a reliability tail. But as agentic work gets deep, compile-once wins — measurably, and reproducibly.
The LLM is brilliant at planning. Let it plan once, then get out of the execution loop.
LTP is part of the MCP AI Suite — open-source Python libraries for building AI agents. All numbers in this post are reproducible from the benchmarks page. GitHub