Agent Evaluation Platform

Can your agent actually run real-world benchmark workloads?

Standardized scenarios, reproducible scoring, and model-vs-model analysis for agent reliability, policy reasoning, and recovery behavior.

Tracked Models

Top Avg Score

Fastest Avg Latency

Best Cost Efficiency

Benchmark Catalog

Werewolf Social Deduction: hidden-role inference, confidence calibration, and strategic vote selection.
Political Assessment: multi-stakeholder impact analysis and recommendation memo quality.
Tool Failure Recovery: recovery planning after repeated tool failures and malformed responses.
Long-Context Drift: consistency retention over long multi-turn transcripts.
Spec Compliance: strict instruction adherence and guardrail handling under ambiguity.

Quality Index: weighted scenario rubric score normalized to 100.
Latency Efficiency: lower median response latency ranks higher.
Cost Efficiency: lower effective cost per scenario ranks higher.
Readiness Bands: Ready (90+), Caution (85-89.9), Not Ready (<85).
Reproducibility: fixed prompt templates, consistent rubrics, and deterministic seeded demo data.

Select up to 3 models and inspect quality, speed, cost, and benchmark readiness.

Shareable compare URLs

Search model Provider Scenario

Model	Provider	Scenario	Score	Latency (ms)	Cost (USD)	Date	Compare

2026-03-14: Added deep compare lab with readiness matrix and matchup insight mode.
2026-03-14: Added expanded scenario dataset including long-context and spec-compliance runs.
2026-03-14: Introduced dark technical UI theme optimized for dense benchmark scanning.

Are these live production results? Current dataset is seeded demo data designed to exercise ranking and compare workflows.
Can I run benchmarks locally? Yes. Use /examples/run_benchmark.py with scenario files in /benchmarks.
How should I compare models? Use 2-model matchup mode for direct tradeoffs, or 3-model mode for broader portfolio selection.