The Problem
Current AI benchmarks measure the wrong thing.
MMLU, HumanEval, SWE-bench — these evaluate a model's capability at a single step in isolation. That is a useful proxy when you are choosing a model for a chatbot. It is a poor proxy when you are building an agent that needs to complete a 40-step research task, orchestrate a multi-file refactor, or run a document synthesis pipeline without falling apart halfway through.
The practical reality of production agentic systems: a model scoring 95% on a single task may be at 20% success by step 30. The degradation curve is not visible in any standard benchmark. It is only visible in production — which is an expensive place to discover it.
HorizonBench makes the curve visible before you ship.
What It Measures
HorizonBench evaluates AI agent reliability over long-horizon task sequences, not peak capability at a single step.
Four benchmark task families:
- Multi-file refactoring — coordinate changes across a codebase while maintaining consistency
- Data pipeline execution — run sequential data transformations without state corruption
- Document research synthesis — gather, reconcile, and summarize information across sources
- Constraint-based scheduling — satisfy interacting constraints across a growing problem space
Each family runs at increasing step counts — 5, 10, 20, 30, 50, 100+ — and success rates are recorded at each level. The output is a Reliability Decay Curve (RDC) for each model: a direct measurement of how quickly performance collapses as task length grows.
The Three Metrics That Matter
RDC — Reliability Decay Curve. The full performance profile from k=5 to k=100+. Not a single number but a shape: some models cliff at step 20, others degrade gracefully to step 80. The shape matters more than any point on it.
MOP — Meltdown Onset Point. The step count at which a model's success rate drops below 50%. If your production agent regularly runs 30-step tasks, you want to know which models are still reliable at step 30 and which have already collapsed.
GDS — Graceful Degradation Score. Whether the model fails hard or degrades gracefully. A model that fails hard at step 25 is worse than a model that partially succeeds at step 40 — especially in pipelines where partial output has value.
"The question is not which model is smartest. The question is which model stays reliable longest."
Architecture
HorizonBench is built on LiteLLM, which means it runs against any compatible model — Anthropic, OpenAI, Google, or local models via Ollama. The evaluation harness is written in Python and distributed as a CLI tool via uv.
The evaluation loop is deliberately simple. Each task family produces a task instance at a given step count. The model completes the task. The result is scored against ground truth. Success rates are aggregated across iterations. The decay curve is computed.
# Run all task families against Claude Sonnet and GPT-4o
horizonbench run --models claude-sonnet-4-5,gpt-4o --steps 5,10,20,50
# Run a specific family with more iterations for statistical confidence
horizonbench run --families refactor --iterations 20
# Export results as an interactive leaderboard
horizonbench export --format html
The interactive leaderboard output lets you compare models side-by-side on their full decay curves — not just a single headline number. This is the output that actually informs model selection decisions.
Why This Exists
Every production agentic system I have built has run into the same problem: the model that benchmarks best is not always the model that performs best at the step counts your system operates at. This tool exists because I needed it and nothing like it existed.
If you are selecting a model for a production agent, the model selection decision is one of the highest-leverage choices you make. A model that scores 5 percentage points better on a standard benchmark might be 30 percentage points worse at the step counts your system actually operates at. HorizonBench gives you the data to make that decision correctly.
Installation
Install free with a single command. No account, no signup, no license required.
After installation, run the interactive setup to configure your API keys:
horizonbench setup
HorizonBench supports API keys for Anthropic, OpenAI, and Google. You only need keys for the models you want to evaluate. Results are generated locally — no data leaves your machine.