Python eval library · CLI · AGPL-3.0

Evaluate & benchmark
your AI agents.

evalmcp is a Python library and CLI for scoring MCP agent outputs. Golden datasets, pluggable judges (exact, contains, LLM-as-judge, code execution), six built-in benchmark suites, standard metrics, Ragas-style RAG metrics, regression detection, and HTML reports. Use it from Python, the CLI, or its MCP server (4 tools).

eval_agent.py
import asyncio
from evalmcp import EvalPipeline, EvalSuite, EvalCase

suite = EvalSuite(name="my_tests", cases=[
    EvalCase(input="What is 2+2?", expected_output="4",
             tool="run_task", tags=["math"]),
    EvalCase(input="Capital of France?", expected_output="Paris",
             tool="run_task", tags=["geo"]),
])

pipeline = EvalPipeline(judge="contains")
results = asyncio.run(pipeline.run_suite(suite))

summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}")
LLM-as-judgeRegression detection
HTML reports
4 judge types
6 benchmark suites
Accuracy / P / R / F1
SQLite run store
Python 3.10+

Everything you need
to measure agent quality

Pluggable Judges

Exact match, case-insensitive substring, LLM-as-judge for semantic scoring, or code execution. Bring your own by subclassing BaseJudge.

exactcontainsllm

Built-in Suites

Six golden-dataset benchmarks out of the box: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge.

memory_basicsecurity

Standard Metrics

Accuracy, precision, recall, F1, average latency, token totals, and per-tag breakdowns computed from every run.

F1per-tag

Regression Detection

Compare the two most recent runs of a suite and flag a regression when the pass rate drops past your threshold. Built for CI.

--ci--threshold

Persistent Store

EvalStore persists every run to a local SQLite database so you can track quality trends and diff runs over time.

SQLiterun history

HTML & Data Export

Generate a standalone HTML dashboard, or export results to JSON and CSV for downstream analysis.

HTMLJSONCSV

Model Comparison

ModelComparison runs an A/B of two result sets case-by-case, with win/tie counts and a markdown report.

A/B

RAG metrics

Grade retrieval-augmented answers, not just final-answer correctness: faithfulness, answer relevancy, context precision, and context recall. Definitions mirror Ragas, implemented as LLM judges (bring your own model). On all four surfaces via evaluate_rag.

faithfulnesscontext recall

Library, CLI & MCP server

Use it from Python (import evalmcp), the evalmcp command line, or the evalmcp-server MCP server (list_suites, run_suite, evaluate, evaluate_rag).

import evalmcpevalmcp runevalmcp-server

Install once.
Benchmark everything.

1

Install

One pip install. Runtime deps: click and mcp.

pip install mcpaisuite-evalmcp
2

Define

Build an EvalSuite of EvalCases, or use a built-in benchmark.

EvalCase( input="2+2?", expected_output="4", tool="run_task", )
3

Run

Score outputs through a judge and aggregate the results.

await pipeline.run_suite(suite)
4

Track

Persist runs, detect regressions, and export HTML / JSON / CSV.

evalmcp run security --ci

The right scorer for
every kind of output

JudgeSelectorPasses when
ExactMatchJudge"exact"Actual equals expected (trimmed)
ContainsJudge"contains"Expected is a case-insensitive substring of actual
LLMJudge"llm" + llm_fnLLM returns a score ≥ 0.5 (JSON output)
CodeExecJudgeinstance onlyOutput runs as Python and exits 0

Six golden datasets
ready to run

SuiteCasesWhat it tests
memory_basic6Memory store / retrieve / stats operations
security5Prompt injection, destructive actions, GDPR compliance
reasoning10Logical reasoning, math, multi-step problems
tool_use10Correct tool selection and parameter formatting
humaneval10Code generation (HumanEval-style)
knowledge10General knowledge (MMLU-style)

Run benchmarks
from your terminal

terminal
# List available benchmark suites
evalmcp list

# Run a suite with the substring judge
evalmcp run memory_basic --judge contains

# CI mode: persist the run and fail on a >10% pass-rate drop
evalmcp run security --judge exact --ci --threshold 0.1

# Export an HTML dashboard
evalmcp run reasoning --html report.html

evalmcp vs DeepEval

Same model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.

MetricevalmcpDeepEval
Judge accuracy vs human labels1.001.00
F11.001.00
Cohen's κ1.001.00
no-LLM baseline (contains)0.83

Judge-vs-human-label agreement at parity with DeepEval, both on the same LLM — so the number measures the judge's logic, not the model. Honest caveat: a small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout). This table grades final-answer correctness; evalmcp also ships Ragas-style RAG metrics (faithfulness, answer relevancy, context precision/recall) as LLM judges for retrieval pipelines.

See the full benchmark →

Ready to measure
how good your agent really is?

Read the docs Star on GitHub