agentbenchmark.org
Agent Evaluation Platform

Can your agent actually run real-world benchmark workloads?

Standardized scenarios, reproducible scoring, and model-vs-model analysis for agent reliability, policy reasoning, and recovery behavior.

Tracked Models
--
Top Avg Score
--
Fastest Avg Latency
--
Best Cost Efficiency
--

Benchmark Catalog

  • Werewolf Social Deduction: hidden-role inference, confidence calibration, and strategic vote selection.
  • Political Assessment: multi-stakeholder impact analysis and recommendation memo quality.
  • Tool Failure Recovery: recovery planning after repeated tool failures and malformed responses.
  • Long-Context Drift: consistency retention over long multi-turn transcripts.
  • Spec Compliance: strict instruction adherence and guardrail handling under ambiguity.

Scoring Method

  • Quality Index: weighted scenario rubric score normalized to 100.
  • Latency Efficiency: lower median response latency ranks higher.
  • Cost Efficiency: lower effective cost per scenario ranks higher.
  • Readiness Bands: Ready (90+), Caution (85-89.9), Not Ready (<85).
  • Reproducibility: fixed prompt templates, consistent rubrics, and deterministic seeded demo data.

Head-to-Head Compare Lab

Select up to 3 models and inspect quality, speed, cost, and benchmark readiness.

Shareable compare URLs

Can It Run This Benchmark?

Model Provider Scenario Score Latency (ms) Cost (USD) Date Compare

Release Notes Snapshot

  • 2026-03-14: Added deep compare lab with readiness matrix and matchup insight mode.
  • 2026-03-14: Added expanded scenario dataset including long-context and spec-compliance runs.
  • 2026-03-14: Introduced dark technical UI theme optimized for dense benchmark scanning.

FAQ

  • Are these live production results? Current dataset is seeded demo data designed to exercise ranking and compare workflows.
  • Can I run benchmarks locally? Yes. Use /examples/run_benchmark.py with scenario files in /benchmarks.
  • How should I compare models? Use 2-model matchup mode for direct tradeoffs, or 3-model mode for broader portfolio selection.