Tracked Models
--
Standardized scenarios, reproducible scoring, and model-vs-model analysis for agent reliability, policy reasoning, and recovery behavior.
Select up to 3 models and inspect quality, speed, cost, and benchmark readiness.
| Model | Provider | Scenario | Score | Latency (ms) | Cost (USD) | Date | Compare |
|---|
/examples/run_benchmark.py with scenario files in /benchmarks.