Python eval library · CLI · Apache-2.0

Evaluate & benchmark
your AI agents.

evalmcp is a Python library and CLI for scoring MCP agent outputs. Golden datasets, pluggable judges (exact, contains, LLM-as-judge, code execution), six built-in benchmark suites, standard metrics, Ragas-style RAG metrics, regression detection, and HTML reports. Use it from Python, the CLI, or its MCP server (4 tools).

Get started View on GitHub

eval_agent.py

import asyncio
from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(name="my_tests", cases=[
    EvalCase(input="What is 2+2?", expected_output="4",
             tool="run_task", tags=["math"]),
    EvalCase(input="Capital of France?", expected_output="Paris",
             tool="run_task", tags=["geo"]),
])

pipeline = EvalPipeline(judge="contains")
results = asyncio.run(pipeline.run_suite(suite))

summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")

LLM-as-judgeRegression detection

HTML reports

Features

Everything you need
to measure agent quality

Pluggable Judges

Exact match, case-insensitive substring, LLM-as-judge for semantic scoring, or code execution. Bring your own by subclassing BaseJudge.

exact contains llm

Built-in Suites

Six golden-dataset benchmarks out of the box: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge.

memory_basic security

Standard Metrics

Accuracy, precision, recall, F1, average latency, token totals, and per-tag breakdowns computed from every run.

F1 per-tag

Regression Detection

Compare the two most recent runs of a suite and flag a regression when the pass rate drops past your threshold. Built for CI.

--ci --threshold

Persistent Store

EvalStore persists every run to a local SQLite database so you can track quality trends and diff runs over time.

SQLite run history

HTML & Data Export

Generate a standalone HTML dashboard, or export results to JSON and CSV for downstream analysis.

HTML JSON CSV

Model Comparison

ModelComparison runs an A/B of two result sets case-by-case, with win/tie counts and a markdown report.

A/B

RAG metrics

Grade retrieval-augmented answers, not just final-answer correctness: faithfulness, answer relevancy, context precision, and context recall. Definitions mirror Ragas, implemented as LLM judges (bring your own model). On all four surfaces via evaluate_rag.

faithfulness context recall

Library, CLI & MCP server

Use it from Python (import evalmcp), the evalmcp command line, or the evalmcp-server MCP server (list_suites, run_suite, evaluate, evaluate_rag).

import evalmcp evalmcp run evalmcp-server

How it works

Install once.
Benchmark everything.

Install

One pip install. Runtime deps: click and mcp.

pip install mcpaisuite-evalmcp

Define

Build an EvalSuite of EvalCases, or use a built-in benchmark.

EvalCase( input="2+2?", expected_output="4", tool="run_task", )

Run

Score outputs through a judge and aggregate the results.

await pipeline.run_suite(suite)

Track

Persist runs, detect regressions, and export HTML / JSON / CSV.

evalmcp run security --ci

Judges

The right scorer for
every kind of output

Judge	Selector	Passes when
ExactMatchJudge	`"exact"`	Actual equals expected (trimmed)
ContainsJudge	`"contains"`	Expected is a case-insensitive substring of actual
LLMJudge	`"llm"` + `llm_fn`	LLM returns a score ≥ 0.5 (JSON output)
CodeExecJudge	instance only	Output runs as Python and exits 0

Benchmark suites

Six golden datasets
ready to run

Suite	Cases	What it tests
memory_basic	6	Memory store / retrieve / stats operations
security	5	Prompt injection, destructive actions, GDPR compliance
reasoning	10	Logical reasoning, math, multi-step problems
tool_use	10	Correct tool selection and parameter formatting
humaneval	10	Code generation (HumanEval-style)
knowledge	10	General knowledge (MMLU-style)

Command line

Run benchmarks
from your terminal

terminal

# List available benchmark suites
evalmcp list

# Run a suite with the substring judge
evalmcp run memory_basic --judge contains

# CI mode: persist the run and fail on a >10% pass-rate drop
evalmcp run security --judge exact --ci --threshold 0.1

# Export an HTML dashboard
evalmcp run reasoning --html report.html

Measured, not claimed

`evalmcp` vs DeepEval

Same model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.

Metric	evalmcp	DeepEval
Judge accuracy vs human labels	1.00	1.00
F1	1.00	1.00
Cohen's κ	1.00	1.00
no-LLM baseline (contains)	0.83	—

Judge-vs-human-label agreement at parity with DeepEval, both on the same LLM — so the number measures the judge's logic, not the model. Honest caveat: a small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout). This table grades final-answer correctness; evalmcp also ships Ragas-style RAG metrics (faithfulness, answer relevancy, context precision/recall) as LLM judges for retrieval pipelines.

See the full benchmark →

Evaluate & benchmarkyour AI agents.

Everything you needto measure agent quality