Why benchmark scores are not production metrics

Every major LLM ships with benchmark scores. MMLU, HellaSwag, HumanEval, MATH. These numbers are real — they measure something. The question is what they measure and whether that thing is the thing you care about.

Benchmark scores measure a model's capability at a specific task, administered under specific conditions, at a specific point in time, on a specific test set. They do not measure how that model performs on your data, for your users, under your latency constraints, with your prompt templates, at production scale.

The gap between benchmark performance and production performance is where most AI products live or die.

This post is about building the measurement infrastructure that closes that gap.

The four things that actually matter in production

1. Task completion rate. For a given class of user requests, what fraction does the system complete correctly? This is the primary metric. Everything else is diagnostic.

Task completion rate requires defining what "correct" means for your use case — which is harder than it sounds and worth investing time in before you write a single eval.

2. Latency distribution. Not average latency — the full distribution, specifically p95 and p99. A system with average latency of 800ms but p99 latency of 12 seconds has a problem that the average conceals. Users who hit the p99 case do not know about the average; they only experience the 12 seconds.

3. Failure mode distribution. When the system fails, how does it fail? There is a significant difference between:

A refusal ("I cannot help with that") — the system knows it cannot complete the task
A hallucination — the system completes the task incorrectly with false confidence
A timeout — the system does not respond

These require different interventions. Conflating them into a single "failure rate" makes diagnosis impossible.

4. Quality over time. LLM system quality is not static. Prompt changes, model updates, user population shifts, and knowledge base staleness all cause quality to drift. A system that was high-quality at launch may be mediocre six months later. The only way to know is continuous measurement.

Building an evaluation pipeline

An evaluation pipeline is not a one-time benchmark run. It is a continuous system that measures production quality at regular intervals and alerts when quality drops.

The components:

A reference set of representative inputs and expected outputs. These should come from real production data — actual user queries, not synthetic ones you invented in development. The distribution of real user queries is always different from what you imagine it to be.

An automated scorer. For many LLM tasks, LLM-as-judge works well: run a second model call that evaluates the quality of the first call's output against the expected output and a rubric. This is not perfect — LLM judges have their own biases — but it scales in a way that human evaluation cannot.

class LLMJudge:
    def score(
        self,
        query: str,
        expected: str,
        actual: str,
        rubric: str
    ) -> JudgementResult:
        prompt = self._build_judge_prompt(query, expected, actual, rubric)
        response = self.model.complete(prompt)
        return self._parse_score(response)

Human calibration. A sample of automated scores should be reviewed by humans regularly. Not all of them — just enough to verify that the automated scores are tracking human judgement. When they diverge significantly, investigate the divergence before trusting either.

A scheduled job. Run the evaluation pipeline on a cadence that makes sense for your system. For high-velocity products, nightly. For stable products, weekly. Alert when metrics drop below threshold.

Measuring qualitative properties

Some LLM system properties resist simple automated measurement. Character consistency, brand tone, reasoning quality, appropriate uncertainty expression — these are real quality dimensions that matter to users, but they are harder to operationalize than "was the answer factually correct."

The practical approach is a structured scoring rubric that breaks the qualitative property into concrete sub-questions an LLM judge can evaluate:

Character consistency (example rubric for a character AI product):

Does the response use the character's established vocabulary? (0-3)
Does the response reflect the character's known opinions and values? (0-3)
Does the response maintain the character's emotional register? (0-3)
Does the response avoid breaking character on adversarial prompts? (0-3)

Total score out of 12. Track over time. Set an alert threshold.

This is not perfect measurement — it is structured approximation. Structured approximation at scale beats unsystematic human review.

Latency instrumentation that actually helps

Standard APM tools (Datadog, New Relic) instrument server latency well. They do not instrument LLM-specific latency in ways that are useful for diagnosis.

The metrics that matter for LLM systems:

Time to first token (TTFT). For streaming responses, users begin reading after the first token. TTFT is what determines whether the system feels responsive, regardless of total generation time.

Tokens per second. Once streaming begins, the generation speed determines the reading experience. Slow generation on long responses is noticeable.

Latency by prompt template. If you use multiple prompt templates (different system prompts for different use cases), instrument latency separately for each. Template changes that affect latency become visible.

Latency by input length. LLM latency scales with context length. Instrument the distribution of input lengths alongside latency to identify whether latency spikes correlate with long inputs.

Cache hit rate. If you use prompt caching (Anthropic's cache_control, for example), instrument what fraction of requests are hitting cache and what the latency difference is between cache hits and misses.

The production feedback loop

The most valuable signal in LLM system quality does not come from automated evaluation — it comes from user behavior.

Users who get bad answers do things that are measurable: they rephrase and retry, they copy the answer and immediately search for it elsewhere, they report the message, they abandon the session. These behavioral signals are noisy but they are measuring the thing that actually matters: whether the system is useful to real users.

Wire behavioral signals into your quality dashboard alongside automated eval scores. When they diverge — when automated scores are high but users are abandoning — investigate. The behavioral signals are usually right.

The cadence that works

The evaluation infrastructure I describe here needs to run fast enough to be useful. A weekly batch job catches problems that have been affecting users for a week. A nightly job catches them overnight. A job that runs on every deployment catches them before users see them.

The right cadence depends on your deployment frequency and your tolerance for quality regression. For most teams I work with, nightly is the right default, with additional runs triggered by every significant prompt or configuration change.

The goal is to make quality regression visible before it causes user-facing harm. Once it has caused harm, you are reacting. The evaluation infrastructure is what lets you prevent it instead.

How to measure LLM quality in production (not just at benchmark time)