Testing and evaluation framework for AI agents. Define test suites in YAML, grade agent outputs with 10 pluggable graders, track results over time, and detect regressions with statistical comparison.
AI agents are hard to test. They're non-deterministic, they call tools, and their outputs vary between runs. Traditional unit tests don't cut it.
- 🎯 YAML-based test suites — Define inputs, expected outputs, and grading criteria declaratively
- 📊 Statistical regression detection — Welch's t-test across multiple runs, not just pass/fail
- 🔌 10 built-in graders — Exact match, contains, regex, tool-check, LLM-judge, custom, JSON-schema, semantic, latency, and cost
- 🔗 AgentLens integration — Import real production sessions as test cases
- 💰 Cost & latency tracking — Know what each eval costs in tokens and dollars
- 🗄️ SQLite result storage — Every run is persisted for historical comparison
pip install agentevalkit# suite.yaml
name: my-agent-tests
agent: my_agent:run
cases:
- name: basic-math
input: "What is 2 + 2?"
expected:
output_contains: ["4"]
grader: contains
- name: tool-usage
input: "Search for the weather in NYC"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: format-check
input: "List 3 colors"
expected:
pattern: "\d\.\s+\w+"
grader: regex# my_agent.py
from agenteval.models import AgentResult
def run(input_text: str) -> AgentResult:
# Your agent logic here
return AgentResult(
output="The answer is 4.",
tools_called=[{"name": "web_search", "args": {"query": "weather NYC"}}],
tokens_in=12,
tokens_out=8,
cost_usd=0.0003,
)$ agenteval run --suite suite.yaml --verbose
============================================================
Suite: my-agent-tests | Run: c1c6493118d5
============================================================
PASS basic-addition (score=1.00, 150ms)
PASS capital-city (score=1.00, 200ms)
PASS quantum-summary (score=1.00, 350ms)
PASS tool-usage (score=1.00, 280ms)
PASS list-format (score=1.00, 120ms)
Total: 5 Passed: 5 Failed: 0 Pass rate: 100%
Cost: $0.0023 Avg latency: 220ms
| Grader | What it checks | Expected / config fields |
|---|---|---|
exact |
Exact string match | output |
contains |
Substring presence | output_contains: [list] |
regex |
Pattern matching | pattern |
tool-check |
Tools were called | tools_called: [list] |
llm-judge |
LLM evaluates quality | criteria (free-form) |
custom |
Your own function | grader_config: {function: "mod:fn"} |
json_schema |
Output validates against a JSON Schema | grader_config: {schema: {...}} or {schema_file: path} |
semantic |
Cosine similarity to expected text | grader_config: {expected: str, threshold: 0.8} |
latency |
Response time within budget | grader_config: {max_ms: N} |
cost |
Cost within budget | grader_config: {max_usd: N} |
Compare runs with Welch's t-test to detect statistically significant regressions:
$ agenteval compare c1c6493118d5,d17a2dce0222 4ee7e40601e3,ba5b0dde212b
============================================================================
Comparing: c1c6493118d5,d17a2dce0222 vs 4ee7e40601e3,ba5b0dde212b
Alpha: 0.05 Regression threshold: 0.0
============================================================================
Case Base Target Diff p-value Sig Status
----------------------------------------------------------------------------
basic-addition 1.000 1.000 +0.000 —
capital-city 1.000 0.500 -0.500 0.4533
quantum-summary 1.000 0.500 -0.500 0.4533
tool-usage 1.000 0.000 -1.000 0.0000 * ▼ regressed
list-format 1.000 0.500 -0.500 0.4533
Summary: 0 improved, 1 regressed, 4 unchanged
⚠ 1 regression(s) detected!
Run the same suite multiple times and compare groups: agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2. Uses scipy when available, falls back to pure Python.
Import real agent sessions from AgentLens as test suites:
# From AgentLens SQLite database
agenteval import --from agentlens --db sessions.db --output suite.yaml --grader contains
# From AgentLens server API (single session or --batch)
agenteval import-agentlens --server http://localhost:3000 --session SESSION_ID --output suite.yaml
agenteval import-agentlens --server http://localhost:3000 --batch --limit 100 --output suite.yaml
# With filtering and interactive review (server mode)
agenteval import-agentlens --server http://localhost:3000 --batch --filter-tag production --auto-assertions --interactive --output suite.yamlImport modes:
- SQLite mode (
import --from agentlens --db path) — reads directly from an AgentLens database file - Server mode (
import-agentlens --server URL) — fetches sessions via the AgentLens HTTP API (use--session IDfor one session or--batchfor many)
Sessions are converted to eval cases with input/output mapping and optional tool-call assertions. Use --auto-assertions to automatically generate expected fields from session data, and --interactive to review each case before saving.
Turn production traffic into regression tests — no manual test writing needed.
Every eval tracks tokens and cost. Your agent callable returns AgentResult with tokens_in, tokens_out, and cost_usd, and AgentEval aggregates them per run.
Full annotated example:
name: my-agent-tests # Suite name (shown in reports)
agent: my_module:my_agent # Default agent callable (module:function)
defaults: # Defaults applied to all cases
grader: contains
grader_config:
ignore_case: true
cases:
- name: basic-math # Unique case name
input: "What is 2 + 2?" # Input passed to agent
expected: # Grader-specific expected values
output_contains: ["4"]
grader: contains # Override default grader
tags: [math, basic] # Tags for filtering (--tag math)
- name: tool-usage
input: "Search for weather"
expected:
tools_called: ["web_search"]
grader: tool-check
- name: quality-check
input: "Explain gravity"
expected:
criteria: "Should mention Newton or Einstein, be scientifically accurate"
grader: llm-judge
grader_config:
model: gpt-4o-mini # LLM judge model
api_base: https://api.openai.com/v1
- name: custom-validation
input: "Generate a JSON object"
expected: {}
grader: custom
grader_config:
function: my_graders:validate_json # Your grader functionagenteval run --suite suite.yaml [--agent module:fn] [--verbose] [--tag math] [--timeout 30] [--db agenteval.db]--suite— Path to YAML suite file (required)--agent— Override the agent callable from the suite--verbose/-v— Show per-case pass/fail details--tag— Filter cases by tag (repeatable)--timeout— Per-case timeout in seconds (default: 30)--db— SQLite database path (default:agenteval.db)
Exit code is 1 if any case fails.
agenteval list [--suite-filter name] [--limit 20] [--db agenteval.db]$ agenteval list --limit 5
ID Suite Passed Failed Rate Created
--------------------------------------------------------------------------------
aeccd5e53f03 math-agent-demo 2 3 40% 2026-02-12T21:12:12
4f3e380f622c math-agent-demo 3 2 60% 2026-02-12T21:12:12
bd4ef3a0727b math-agent-demo 1 4 20% 2026-02-12T21:12:12
e2ca43e99852 math-agent-demo 3 2 60% 2026-02-12T21:12:11
32ed650cab6d math-agent-demo 2 3 40% 2026-02-12T21:12:11
agenteval compare RUN_A RUN_B [--alpha 0.05] [--threshold 0.0] [--stats/--no-stats]
agenteval compare RUN_A1,RUN_A2 vs RUN_B1,RUN_B2 # Multi-run comparisonagenteval import --from agentlens --db sessions.db --output suite.yaml [--grader contains] [--limit 100]Compares result.output exactly with expected.output. Config: ignore_case: bool.
expected:
output: "The answer is 42."
grader: exact
grader_config:
ignore_case: trueChecks that all substrings in expected.output_contains appear in the output.
expected:
output_contains: ["Paris", "France"]
grader: containsMatches result.output against expected.pattern (Python regex). Config: flags: [IGNORECASE, DOTALL, MULTILINE].
expected:
pattern: "\d+\.\d+"
grader: regex
grader_config:
flags: [IGNORECASE]Verifies expected tools were called. Config: ordered: bool for sequence matching.
expected:
tools_called: ["web_search", "calculator"]
grader: tool-check
grader_config:
ordered: trueSends the input, output, and criteria to an LLM for evaluation. Requires OPENAI_API_KEY or compatible API.
expected:
criteria: "Response should be helpful, accurate, and concise"
grader: llm-judge
grader_config:
model: gpt-4o-miniImports and calls your own grader function. Must accept (case: EvalCase, result: AgentResult) -> GradeResult.
grader: custom
grader_config:
function: my_module:my_graderAdapters let you test agents built with popular frameworks without writing a custom callable.
pip install agentevalkit[langchain] # LangChain
pip install agentevalkit[crewai] # CrewAI
pip install agentevalkit[autogen] # AutoGen| Adapter | Framework Method | Install Extra |
|---|---|---|
langchain |
agent.invoke(input) |
[langchain] |
crewai |
crew.kickoff(inputs={"input": ...}) |
[crewai] |
autogen |
agent.run(input) or agent.initiate_chat(message=...) |
[autogen] |
Usage with YAML suite defaults:
# suite.yaml
name: my-tests
agent: my_module:my_chain
defaults:
adapter: langchainOr via CLI:
agenteval run --suite suite.yaml --adapter langchainEach adapter extracts output, tool calls, and token usage from the framework's response format into a standard AgentResult.
Scale eval suites across multiple workers using Redis as a broker.
pip install agentevalkit[distributed]# Terminal 1: Start a worker
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agent
# Terminal 2: Start another worker
agenteval worker --broker redis://localhost:6379 --agent my_module:my_agentagenteval run --suite suite.yaml --workers redis://localhost:6379 --worker-timeout 60- The coordinator pushes eval cases to a Redis queue
- Workers pop cases, execute the agent, and push results back
- The coordinator collects results and builds the final
EvalRun - If no workers are detected, execution falls back to local mode automatically
--workers URL— Redis broker URL (supportsredis://andrediss://for TLS)--worker-timeout N— Seconds to wait for worker results (default: 30)- Workers register heartbeats and are automatically detected by the coordinator
Security: Use
rediss://URLs with authentication for production deployments. See docs/troubleshooting.md for Redis security guidance.
See docs/troubleshooting.md for solutions to common issues including:
- Agent callable import errors (
module:functionformat) - Missing dependency extras (
[distributed],[langchain], etc.) - OpenAI API key setup for
llm-judgegrader - Compare command syntax
- Redis connection issues for distributed execution
Contributions welcome! This project uses:
- pytest for testing
- ruff for linting
- src layout (
src/agenteval/)
git clone https://github.com/agentkitai/agenteval.git
cd agenteval
pip install -e ".[dev]"
pytest| Project | Description | |
|---|---|---|
| AgentLens | Observability & audit trail for AI agents | |
| Lore | Cross-agent memory and lesson sharing | |
| AgentGate | Human-in-the-loop approval gateway | |
| FormBridge | Agent-human mixed-mode forms | |
| AgentEval | Testing & evaluation framework | ⬅️ you are here |
| agentkit-cli | Unified CLI orchestrator |
MIT — see LICENSE.