evalmcp is a Python library and CLI for scoring MCP agent outputs. Golden datasets, pluggable judges (exact, contains, LLM-as-judge, code execution), six built-in benchmark suites, standard metrics, Ragas-style RAG metrics, regression detection, and HTML reports. Use it from Python, the CLI, or its MCP server (4 tools).
import asyncio
from evalmcp import EvalPipeline, EvalSuite, EvalCase
suite = EvalSuite(name="my_tests", cases=[
EvalCase(input="What is 2+2?", expected_output="4",
tool="run_task", tags=["math"]),
EvalCase(input="Capital of France?", expected_output="Paris",
tool="run_task", tags=["geo"]),
])
pipeline = EvalPipeline(judge="contains")
results = asyncio.run(pipeline.run_suite(suite))
summary = pipeline.summary(results)
print(f"Pass rate: {summary['pass_rate']:.0%}") Exact match, case-insensitive substring, LLM-as-judge for semantic scoring, or code execution. Bring your own by subclassing BaseJudge.
Six golden-dataset benchmarks out of the box: memory, security, reasoning, tool-use, HumanEval-style code, and MMLU-style knowledge.
Accuracy, precision, recall, F1, average latency, token totals, and per-tag breakdowns computed from every run.
Compare the two most recent runs of a suite and flag a regression when the pass rate drops past your threshold. Built for CI.
EvalStore persists every run to a local SQLite database so you can track quality trends and diff runs over time.
Generate a standalone HTML dashboard, or export results to JSON and CSV for downstream analysis.
ModelComparison runs an A/B of two result sets case-by-case, with win/tie counts and a markdown report.
Grade retrieval-augmented answers, not just final-answer correctness: faithfulness, answer relevancy, context precision, and context recall. Definitions mirror Ragas, implemented as LLM judges (bring your own model). On all four surfaces via evaluate_rag.
Use it from Python (import evalmcp), the evalmcp command line, or the evalmcp-server MCP server (list_suites, run_suite, evaluate, evaluate_rag).
One pip install. Runtime deps: click and mcp.
Build an EvalSuite of EvalCases, or use a built-in benchmark.
Score outputs through a judge and aggregate the results.
Persist runs, detect regressions, and export HTML / JSON / CSV.
| Judge | Selector | Passes when |
|---|---|---|
| ExactMatchJudge | "exact" | Actual equals expected (trimmed) |
| ContainsJudge | "contains" | Expected is a case-insensitive substring of actual |
| LLMJudge | "llm" + llm_fn | LLM returns a score ≥ 0.5 (JSON output) |
| CodeExecJudge | instance only | Output runs as Python and exits 0 |
| Suite | Cases | What it tests |
|---|---|---|
| memory_basic | 6 | Memory store / retrieve / stats operations |
| security | 5 | Prompt injection, destructive actions, GDPR compliance |
| reasoning | 10 | Logical reasoning, math, multi-step problems |
| tool_use | 10 | Correct tool selection and parameter formatting |
| humaneval | 10 | Code generation (HumanEval-style) |
| knowledge | 10 | General knowledge (MMLU-style) |
# List available benchmark suites
evalmcp list
# Run a suite with the substring judge
evalmcp run memory_basic --judge contains
# CI mode: persist the run and fail on a >10% pass-rate drop
evalmcp run security --judge exact --ci --threshold 0.1
# Export an HTML dashboard
evalmcp run reasoning --html report.html
evalmcp vs DeepEvalSame model, same controls, losses shown as plainly as wins. Every number reproduces from a script with raw JSON in the repo.
| Metric | evalmcp | DeepEval |
|---|---|---|
| Judge accuracy vs human labels | 1.00 | 1.00 |
| F1 | 1.00 | 1.00 |
| Cohen's κ | 1.00 | 1.00 |
| no-LLM baseline (contains) | 0.83 | — |
Judge-vs-human-label agreement at parity with DeepEval, both on the same LLM — so the number measures the judge's logic, not the model. Honest caveat: a small, clear-cut dataset (both good judges hit the ceiling — parity, not a knockout). This table grades final-answer correctness; evalmcp also ships Ragas-style RAG metrics (faithfulness, answer relevancy, context precision/recall) as LLM judges for retrieval pipelines.