ML Observe
Opsview 6 EA dashboard

Photo: Jjainschigg (CC BY-SA 4.0)

ops

Closing the Eval-Prod Gap: Online Evaluation as Observability

Offline eval scores are green and production is worse. The gap is not a measurement error — it is structural. Here is how to instrument online evaluation so production quality becomes observable.

By Priya Anand · · 8 min read

Every team running LLMs in production eventually hits the same wall: the offline evaluation suite is green, the dashboards are green, and the users are unhappy. This is not usually a bug in the eval harness. It is a structural property of how offline evaluation works, and treating it as an observability problem — rather than a one-time test problem — is what closes it.

Why the gap is structural, not accidental

Offline eval measures a fixed dataset under a fixed prompt and model. Production is none of those things. The input distribution is live and shifts; the prompt and model change under you (managed-model providers update weights without a version bump); retrieved context is dynamic; and users do things your eval set never imagined. An offline suite is, by construction, a snapshot of yesterday’s understanding of the problem. It answers “did we regress against cases we already knew about?” It cannot answer “is the system good on the traffic it is actually getting right now?”

There is also a Goodhart problem. Once an offline metric becomes the target a team optimizes against, it stops being a good measure of the underlying quality — the system gets better at the metric and not necessarily at the job. The defense is not a better single metric; it is measuring the live system continuously so the offline number is a regression gate, not the definition of quality.

Online evaluation: making production quality observable

Online evaluation means scoring real production traffic continuously, in production, and treating those scores as telemetry alongside latency and cost. The building blocks:

  • Trace every request with evaluation in mind. Capture inputs, retrieved context, the model+version actually used, the output, and any tool calls as a structured trace following the OpenTelemetry GenAI conventions. You cannot evaluate what you did not record, and post-hoc evaluation needs the full context, not just the final string.
  • Run automatic evaluators on a sample of live traffic. Cheap programmatic checks (schema validity, format, refusal/empty-response detection, grounding/citation checks for RAG) can run on every request. More expensive judgments (LLM-as-judge for helpfulness or faithfulness) run on a sampled subset. Emit the results as metrics with the same dimensions as the trace — model version, route, prompt version — so a quality drop is sliceable.
  • Capture implicit and explicit user signal. Thumbs, regenerations, edits, abandonment, follow-up rephrasings. These are noisy individually and powerful in aggregate, and they are the only signal that reflects what users actually experienced rather than what a proxy judge thinks.
  • Sample hard cases for human review. Route low-confidence, low-score, or thumbs-down traces into a labeling queue. Human-reviewed production failures are the highest-quality material for the next iteration of the offline suite — which is how online evaluation feeds back into closing the gap permanently.

Make LLM-as-judge a monitored component, not an oracle

LLM-as-judge is the workhorse of online evaluation and the easiest thing to over-trust. A judge is itself a model on a managed endpoint: it drifts when the provider updates it, it is biased (toward length, toward its own style), and it can be wrong in correlated ways. Treat it as an instrument that needs calibration:

  • Periodically score a human-labeled set with the judge and track judge-vs-human agreement over time. A drop in agreement is a judge regression, independent of any product change.
  • Pin the judge model+version explicitly, the same way you would pin a production dependency, and re-validate on upgrade.
  • Use the judge for trend detection over many requests, not as ground truth on any single one.

Wire it into alerting like any other SLI

The point of online evaluation is not a dashboard nobody opens. Define quality SLIs — grounding rate, judged-helpfulness rate, refusal rate, thumbs-down rate — with thresholds, and alert on them with the same seriousness as an error-rate page. Slice by model version and prompt version so the alert points at the change. The single most useful panel is offline-vs-online metric divergence over time: a persistent, unexplained gap between green eval and worse production is the signature that your offline suite has fallen behind reality, and it tells you exactly when to pull production failures back into the eval set.

Offline evaluation asks whether you broke something you already understood. Online evaluation asks whether the live system is good on the traffic it is actually serving. Observability practice for LLMs is mostly the work of making the second question answerable continuously — and the eval-prod gap closes not when you find the perfect metric, but when production quality becomes something you watch instead of something you assume.

Sources

  1. OpenTelemetry GenAI semantic conventions
  2. Phoenix (Arize): LLM evaluation and online evals
  3. Goodhart's law (when a measure becomes a target)
  4. Evidently AI: LLM evaluation metrics and approaches
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments