evalmcp

📊 4 tools

evalmcp

Evaluation suite for MCP AI agents — golden datasets, LLM-as-judge, and security benchmarks.

evalmcp is a Python library and CLI, and also ships an MCP server. Use it from Python code (import evalmcp), the evalmcp command line, or the evalmcp-server MCP server, which exposes 4 MCP tools (list_suites, run_suite, evaluate, evaluate_rag). There is also an optional FastAPI app (evalmcp-api, [api] extra). Most users start from Python or the CLI to score agent outputs and track quality over time.

Installation

pip install mcpaisuite-evalmcp

The runtime dependencies are click and mcp. For the test suite there is a [dev] extra (and a [api] extra for the FastAPI server):

pip install -e ".[dev]"   # pytest, pytest-asyncio, pytest-cov

Requires Python 3.10+.

Python quickstart

An evaluation is built from three data models:

EvalCase — one test case. Fields: input, expected_output, tool (all required), plus optional tags: list[str] and metadata: dict.
EvalSuite — a named list of cases (name, cases, optional description).
EvalPipeline — runs a suite through a judge and aggregates the results.

import asyncio
from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(
    name="my_tests",
    description="Smoke tests",
    cases=[
        EvalCase(input="What is 2+2?", expected_output="4", tool="run_task", tags=["math"]),
        EvalCase(input="Capital of France?", expected_output="Paris", tool="run_task", tags=["geography"]),
    ],
)

pipeline = EvalPipeline(judge="contains")
results = asyncio.run(pipeline.run_suite(suite))

summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")

run_suite is async. When no kernel_pipeline is supplied, EvalPipeline runs in self-test / dry-run mode: the expected_output is used as the actual output, which is useful for validating a suite and judge wiring before connecting a real agent.

Connecting a real agent

To evaluate actual agent outputs, pass a kernel_pipeline object that exposes an async run(input, tool) method:

pipeline = EvalPipeline(kernel_pipeline=my_agent, judge="contains")
results = await pipeline.run_suite(suite)

For each case the pipeline calls await my_agent.run(case.input, case.tool), scores the returned text with the judge, and records latency.

Summary keys

pipeline.summary(results) returns:

Key	Meaning
`total`	Number of cases
`pass_rate`	Fraction of cases that passed (0.0–1.0)
`avg_score`	Mean judge score
`avg_latency_ms`	Mean per-case latency in milliseconds
`total_tokens`	Sum of `tokens_used` across results
`per_tag`	Per-tag dict of `{total, pass_rate, avg_score}`

Judges

The judge argument accepts a string or a BaseJudge instance.

Judge	String	Behavior
`ExactMatchJudge`	`"exact"`	Passes when `actual.strip() == expected.strip()`.
`ContainsJudge`	`"contains"`	Passes when `expected.lower()` is a substring of `actual.lower()`.
`LLMJudge`	`"llm"`	Calls an async `llm_fn(prompt) -> str` that must return JSON `{"score": <float>, "reasoning": "<string>"}`. Passes when `score >= 0.5`. Requires `llm_fn`.
`CodeExecJudge`	(instance only)	Runs the actual output as Python via `python -c`; passes when the exit code is 0.

LLM-as-judge

async def my_llm(prompt: str) -> str:
    # Must return JSON: {"score": <float 0-1>, "reasoning": "<string>"}
    ...

pipeline = EvalPipeline(judge="llm", llm_fn=my_llm)

If judge="llm" is set without an llm_fn, the pipeline raises ValueError. If the LLM response fails to parse as JSON, that case scores 0.0 with a reasoning string noting the parse failure.

CodeExecJudge

CodeExecJudge is not available as a --judge string and cannot be selected from the CLI. Pass it as an instance:

from evalmcp.core.judges import CodeExecJudge

pipeline = EvalPipeline(judge=CodeExecJudge(timeout=10.0))

It executes the actual output as Python in a subprocess and passes when the process exits 0. A timeout (default 10 s) fails the case.

RAG metrics

evalmcp also grades retrieval-augmented answers, not just final-answer correctness. Four RAG judges score a (question, answer, contexts) triple — their definitions mirror Ragas, but they are implemented as LLM judges (bring your own llm_fn, any provider):

Metric	What it measures
`faithfulness`	Fraction of the answer’s factual claims that are supported by the retrieved context (low score = hallucination relative to the context).
`answer_relevancy`	How directly the answer addresses the question.
`context_precision`	Rank-weighted: are the relevant contexts ranked near the top? (mean of precision@k over the positions of relevant contexts).
`context_recall`	Fraction of the ground-truth answer that is covered by the retrieved contexts (needs `expected_output`).

These are exposed on all four surfaces:

from evalmcp.rag import evaluate_rag

scores = await evaluate_rag(
    question="What is the capital of France?",
    answer="Paris.",
    contexts=["France's capital is Paris."],
    llm_fn=my_llm_fn,            # BYO LLM — any provider
)

# CLI
evalmcp rag -q "capital of France?" -a "Paris." -c "France's capital is Paris."

# FastAPI (evalmcp-api)
POST /rag/evaluate

The evaluate_rag MCP tool exposes the same metrics to Claude Desktop / Cursor.

Built-in benchmark suites

The CLI registers six pre-built suites. List them with evalmcp list.

Suite name	Cases	Description
`memory_basic`	6	Basic memory store/retrieve/stats operations
`security`	5	Security & safety: prompt injection, destructive actions, GDPR compliance
`reasoning`	10	Logical reasoning, math, and multi-step problem solving
`tool_use`	10	Correct tool selection and parameter formatting
`humaneval`	10	Code generation ability (simplified HumanEval-style)
`knowledge`	10	General knowledge across science, history, math, geography, programming (MMLU-style)

Note: the memory suite is registered under the name memory_basic — that is the name to pass to evalmcp run.

You can also import the suite objects directly:

from evalmcp.benchmarks.security_bench import SECURITY_SUITE
results = await EvalPipeline(judge="contains").run_suite(SECURITY_SUITE)

Metrics

compute_metrics(results) (also available as pipeline.metrics(results)) computes standard classification-style metrics. It treats passed=True as the predicted positive and score >= 0.5 as the ground-truth positive:

from evalmcp import compute_metrics

m = compute_metrics(results)
print(m["accuracy"], m["precision"], m["recall"], m["f1"])

Returned keys: accuracy, precision, recall, f1, avg_latency_ms, total_tokens, and per_tag_metrics (a per-tag dict of {accuracy, count}).

Persistent store

EvalStore persists runs to a SQLite database (default ~/.evalmcp/results.db) so you can track quality over time and detect regressions.

from evalmcp import EvalPipeline, EvalStore

store = EvalStore()  # or EvalStore(db_path="./runs.db")
pipeline = EvalPipeline(judge="contains", store=store)

results = await pipeline.run_suite(suite)  # automatically saved when a store is set

EvalStore methods:

save_run(suite_name, results, summary) -> run_id — persist a run, returns a UUID.
get_run(run_id) -> {suite, results, summary, timestamp}.
list_runs(suite_name=None, limit=20) -> list[dict] — most recent first.
compare_runs(run_id_a, run_id_b) -> {pass_rate_delta, score_delta, regression}.
detect_regression(suite_name, threshold=0.1) -> {regression, delta, details}.
close().

Regression detection

detect_regression compares the two most recent runs of a suite and flags a regression when the pass rate dropped by more than threshold:

reg = pipeline.detect_regression("my_tests", threshold=0.1)
if reg["regression"]:
    print("Quality dropped:", reg["details"])

This requires a configured store; calling it without one raises RuntimeError. With fewer than two stored runs it returns {"regression": False, ...}.

Export

from evalmcp import export_json, export_csv

export_json(results, summary, "results.json")  # note: takes results, summary, path
export_csv(results, "results.csv")

export_json writes {"summary": ..., "results": [...]}. export_csv writes one row per case with columns: input, expected, actual, tool, tags, passed, score, latency_ms, tokens_used, judge_reasoning.

HTML dashboard

generate_dashboard writes a standalone, self-contained HTML report (inline CSS, no external dependencies) with a metrics grid, per-case results table, and per-tag breakdown bars:

from evalmcp import generate_dashboard

generate_dashboard(results, summary, metrics, output_path="report.html")

Model comparison

ModelComparison runs an A/B comparison of two result lists (matched by index — same suite, same ordering):

from evalmcp import ModelComparison

comparison = ModelComparison.compare(results_a, results_b, label_a="gpt-4", label_b="claude")
print(ModelComparison.format_comparison(comparison))  # markdown table

compare returns per-model metrics, win/tie counts (<label>_wins, ties), and a per_case list with each case’s scores and winner. format_comparison renders it as a markdown table.

CLI reference

The evalmcp command is a Click group with two commands.

`evalmcp list`

Lists the registered benchmark suites with their case counts and descriptions.

evalmcp list

`evalmcp run`

evalmcp run <suite_name> [OPTIONS]

Runs a registered suite, prints the summary as JSON, and prints the metrics line.

Option	Default	Description
`--judge`	`contains`	Judge type: `exact`, `contains`, or `llm`.
`--ci`	off	CI mode: persist the run to a store and check for regression; exit 1 on regression.
`--threshold`	`0.1`	Regression threshold for `--ci`.
`--html PATH`	none	Also export an HTML dashboard to `PATH`.

The CLI does not expose llm_fn, so --judge llm from the command line has no callback configured. Use the Python API for LLM-as-judge. CodeExecJudge is likewise Python-only.

Examples:

# List available suites
evalmcp list

# Run the memory suite with the substring judge
evalmcp run memory_basic --judge contains

# CI mode: save the run and fail (exit 1) on a >10% pass-rate drop
evalmcp run security --judge exact --ci --threshold 0.1

# Export an HTML dashboard
evalmcp run reasoning --html report.html

Running an unknown suite name prints an error and exits 1.

License

Apache-2.0 — free and open source for any use, including commercial.