evalmcp
📊 4 tools
evalmcp
Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, and security benchmarks.
evalmcp is a Python library and CLI, and also ships an MCP server. Use it from Python code (
import evalmcp), theevalmcpcommand line, or theevalmcp-serverMCP server, which exposes 4 MCP tools (list_suites,run_suite,evaluate,evaluate_rag). There is also an optional FastAPI app (evalmcp-api,[api]extra). Most users start from Python or the CLI to score agent outputs and track quality over time.
Installation
pip install mcpaisuite-evalmcp
The runtime dependencies are click and mcp. For the test suite there is a [dev] extra (and a [api] extra for the FastAPI server):
pip install -e ".[dev]" # pytest, pytest-asyncio, pytest-cov
Requires Python 3.10+.
Python quickstart
An evaluation is built from three data models:
EvalCase— one test case. Fields:input,expected_output,tool(all required), plus optionaltags: list[str]andmetadata: dict.EvalSuite— a named list of cases (name,cases, optionaldescription).EvalPipeline— runs a suite through a judge and aggregates the results.
import asyncio
from evalmcp import EvalPipeline, EvalSuite, EvalCase
suite = EvalSuite(
name="my_tests",
description="Smoke tests",
cases=[
EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
],
)
pipeline = EvalPipeline(judge="contains")
results = asyncio.run(pipeline.run_suite(suite))
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")
run_suite is async. When no kernel_pipeline is supplied, EvalPipeline runs in self-test / dry-run mode: the expected_output is used as the actual output, which is useful for validating a suite and judge wiring before connecting a real agent.
Connecting a real agent
To evaluate actual agent outputs, pass a kernel_pipeline object that exposes an async run(input, tool) method:
pipeline = EvalPipeline(kernel_pipeline=my_agent, judge="contains")
results = await pipeline.run_suite(suite)
For each case the pipeline calls await my_agent.run(case.input, case.tool), scores the returned text with the judge, and records latency.
Summary keys
pipeline.summary(results) returns:
| Key | Meaning |
|---|---|
total | Number of cases |
pass_rate | Fraction of cases that passed (0.0–1.0) |
avg_score | Mean judge score |
avg_latency_ms | Mean per-case latency in milliseconds |
total_tokens | Sum of tokens_used across results |
per_tag | Per-tag dict of {total, pass_rate, avg_score} |
Judges
The judge argument accepts a string or a BaseJudge instance.
| Judge | String | Behavior |
|---|---|---|
ExactMatchJudge | "exact" | Passes when actual.strip() == expected.strip(). |
ContainsJudge | "contains" | Passes when expected.lower() is a substring of actual.lower(). |
LLMJudge | "llm" | Calls an async llm_fn(prompt) -> str that must return JSON {"score": <float>, "reasoning": "<string>"}. Passes when score >= 0.5. Requires llm_fn. |
CodeExecJudge | (instance only) | Runs the actual output as Python via python -c; passes when the exit code is 0. |
LLM-as-judge
async def my_llm(prompt: str) -> str:
# Must return JSON: {"score": <float 0-1>, "reasoning": "<string>"}
...
pipeline = EvalPipeline(judge="llm", llm_fn=my_llm)
If judge="llm" is set without an llm_fn, the pipeline raises ValueError. If the LLM response fails to parse as JSON, that case scores 0.0 with a reasoning string noting the parse failure.
CodeExecJudge
CodeExecJudge is not available as a --judge string and cannot be selected from the CLI. Pass it as an instance:
from evalmcp.core.judges import CodeExecJudge
pipeline = EvalPipeline(judge=CodeExecJudge(timeout=10.0))
It executes the actual output as Python in a subprocess and passes when the process exits 0. A timeout (default 10 s) fails the case.
RAG metrics
evalmcp also grades retrieval-augmented answers, not just final-answer correctness. Four RAG judges score a (question, answer, contexts) triple — their definitions mirror Ragas, but they are implemented as LLM judges (bring your own llm_fn, any provider):
| Metric | What it measures |
|---|---|
faithfulness | Fraction of the answer’s factual claims that are supported by the retrieved context (low score = hallucination relative to the context). |
answer_relevancy | How directly the answer addresses the question. |
context_precision | Rank-weighted: are the relevant contexts ranked near the top? (mean of precision@k over the positions of relevant contexts). |
context_recall | Fraction of the ground-truth answer that is covered by the retrieved contexts (needs expected_output). |
These are exposed on all four surfaces:
from evalmcp.rag import evaluate_rag
scores = await evaluate_rag(
question="What is the capital of France?",
answer="Paris.",
contexts=["France's capital is Paris."],
llm_fn=my_llm_fn, # BYO LLM — any provider
)
# CLI
evalmcp rag -q "capital of France?" -a "Paris." -c "France's capital is Paris."
# FastAPI (evalmcp-api)
POST /rag/evaluate
The evaluate_rag MCP tool exposes the same metrics to Claude Desktop / Cursor.
Built-in benchmark suites
The CLI registers six pre-built suites. List them with evalmcp list.
| Suite name | Cases | Description |
|---|---|---|
memory_basic | 6 | Basic memory store/retrieve/stats operations |
security | 5 | Security & safety: prompt injection, destructive actions, GDPR compliance |
reasoning | 10 | Logical reasoning, math, and multi-step problem solving |
tool_use | 10 | Correct tool selection and parameter formatting |
humaneval | 10 | Code generation ability (simplified HumanEval-style) |
knowledge | 10 | General knowledge across science, history, math, geography, programming (MMLU-style) |
Note: the memory suite is registered under the name memory_basic — that is the name to pass to evalmcp run.
You can also import the suite objects directly:
from evalmcp.benchmarks.security_bench import SECURITY_SUITE
results = await EvalPipeline(judge="contains").run_suite(SECURITY_SUITE)
Metrics
compute_metrics(results) (also available as pipeline.metrics(results)) computes standard classification-style metrics. It treats passed=True as the predicted positive and score >= 0.5 as the ground-truth positive:
from evalmcp import compute_metrics
m = compute_metrics(results)
print(m["accuracy"], m["precision"], m["recall"], m["f1"])
Returned keys: accuracy, precision, recall, f1, avg_latency_ms, total_tokens, and per_tag_metrics (a per-tag dict of {accuracy, count}).
Persistent store
EvalStore persists runs to a SQLite database (default ~/.evalmcp/results.db) so you can track quality over time and detect regressions.
from evalmcp import EvalPipeline, EvalStore
store = EvalStore() # or EvalStore(db_path="./runs.db")
pipeline = EvalPipeline(judge="contains", store=store)
results = await pipeline.run_suite(suite) # automatically saved when a store is set
EvalStore methods:
save_run(suite_name, results, summary) -> run_id— persist a run, returns a UUID.get_run(run_id) -> {suite, results, summary, timestamp}.list_runs(suite_name=None, limit=20) -> list[dict]— most recent first.compare_runs(run_id_a, run_id_b) -> {pass_rate_delta, score_delta, regression}.detect_regression(suite_name, threshold=0.1) -> {regression, delta, details}.close().
Regression detection
detect_regression compares the two most recent runs of a suite and flags a regression when the pass rate dropped by more than threshold:
reg = pipeline.detect_regression("my_tests", threshold=0.1)
if reg["regression"]:
print("Quality dropped:", reg["details"])
This requires a configured store; calling it without one raises RuntimeError. With fewer than two stored runs it returns {"regression": False, ...}.
Export
from evalmcp import export_json, export_csv
export_json(results, summary, "results.json") # note: takes results, summary, path
export_csv(results, "results.csv")
export_json writes {"summary": ..., "results": [...]}. export_csv writes one row per case with columns: input, expected, actual, tool, tags, passed, score, latency_ms, tokens_used, judge_reasoning.
HTML dashboard
generate_dashboard writes a standalone, self-contained HTML report (inline CSS, no external dependencies) with a metrics grid, per-case results table, and per-tag breakdown bars:
from evalmcp import generate_dashboard
generate_dashboard(results, summary, metrics, output_path="report.html")
Model comparison
ModelComparison runs an A/B comparison of two result lists (matched by index — same suite, same ordering):
from evalmcp import ModelComparison
comparison = ModelComparison.compare(results_a, results_b, label_a="gpt-4", label_b="claude")
print(ModelComparison.format_comparison(comparison)) # markdown table
compare returns per-model metrics, win/tie counts (<label>_wins, ties), and a per_case list with each case’s scores and winner. format_comparison renders it as a markdown table.
CLI reference
The evalmcp command is a Click group with two commands.
evalmcp list
Lists the registered benchmark suites with their case counts and descriptions.
evalmcp list
evalmcp run
evalmcp run <suite_name> [OPTIONS]
Runs a registered suite, prints the summary as JSON, and prints the metrics line.
| Option | Default | Description |
|---|---|---|
--judge | contains | Judge type: exact, contains, or llm. |
--ci | off | CI mode: persist the run to a store and check for regression; exit 1 on regression. |
--threshold | 0.1 | Regression threshold for --ci. |
--html PATH | none | Also export an HTML dashboard to PATH. |
The CLI does not expose
llm_fn, so--judge llmfrom the command line has no callback configured. Use the Python API for LLM-as-judge.CodeExecJudgeis likewise Python-only.
Examples:
# List available suites
evalmcp list
# Run the memory suite with the substring judge
evalmcp run memory_basic --judge contains
# CI mode: save the run and fail (exit 1) on a >10% pass-rate drop
evalmcp run security --judge exact --ci --threshold 0.1
# Export an HTML dashboard
evalmcp run reasoning --html report.html
Running an unknown suite name prints an error and exits 1.
License
AGPL-3.0. Open source for individuals and open-source projects; a commercial license is available for closed-source commercial use.