There's a particular kind of confidence that precedes a painful lesson. A team ships an AI workflow, the demo went beautifully, the eval suite scores 91%, and three weeks into production the business stakeholder sends a message that begins: "I've been looking at these outputs and something's off."

The post-mortem finds no crashed pipelines, no error logs, no anomalies in the monitoring dashboard. The system was working — in the sense that it was completing — the entire time. The outputs were plausible-looking. They just weren't correct. And they hadn't been correct for long enough that downstream decisions had already been made on top of them.

This is the production AI failure mode that matters most, and it's the one the mainstream conversation is least equipped to address. That conversation — eval frameworks, LLM-as-judge, benchmark suites, red-teaming pipelines — is technically sound and practically insufficient, because it's solving for measurement without first solving for what makes measurement meaningful. You cannot evaluate your way to reliability. You can only evaluate your way to knowing whether your other work is paying off.

The teams that figure this out tend to arrive at the same insight, usually after the first expensive production failure: evals are the last layer of a quality stack, not the first. What comes before them determines whether those evals tell you something true or just something legible.


The Quality Stack Dependency Nobody Talks About

Think of production AI reliability as a stack with a strict dependency chain:

  • Layer 4: Automated evals and testing
  • Layer 3: Human review (calibrated, rubric-based)
  • Layer 2: Observability and step-level tracing
  • Layer 1: Data contracts and input validation
Most teams build Layer 4 first and work downward only when things break. The problem is architectural: Layer 4 tells you whether something went wrong. It cannot tell you what went wrong or where unless Layers 1 through 3 are already in place. An eval suite sitting on top of an unobservable, schema-ambiguous pipeline is a measurement instrument without a reference point — producing numbers that aren't grounded in anything stable enough to act on.

The dependency failure is most visible in how teams define "correct." An output preference sounds like this: "We want the AI to give helpful, accurate answers." An output contract sounds like this: "For a customer support triage workflow, a correct output is a sentiment classification drawn from a defined enum, a summary under 80 words, and a suggested next action from the approved action taxonomy. Any output that deviates from this schema is a failure, regardless of how well-written it is." The first is a design aspiration. The second is something you can actually evaluate against.

Contracts make evals possible. Preferences make evals a matter of opinion — and therefore unreliable at exactly the moment you need them most.

Most organisations are running on preferences and calling them evals. Engineering defines "correct" as "no error thrown." Product defines it as "the output looks reasonable." The domain SME defines it as "this would pass peer review." These are three different standards operating in parallel. When an automated eval runs against implicit, unreconciled expectations, the resulting score is technically valid and practically meaningless — because nobody agreed on what it was testing.


The Silent Corruption Problem

The most dangerous production AI failures don't announce themselves. They don't crash your pipeline, trigger your alerts, or surface in your error logs. They complete successfully. The problem is that "completing successfully" and "producing a correct output" are not the same thing — and production AI systems can decouple these two properties in ways that take weeks to surface.

Consider a multi-step AI workflow: a document is ingested, classified, summarised, and routed. A schema change in an upstream data system shifts a field type. The classifier receives the malformed input, infers something plausible rather than throwing an error, and the summary and routing steps proceed — on top of a corrupted foundation. The pipeline logs show green. The outputs look reasonable. The business is routing work to the wrong queues. Nobody notices for a month.

A specialty insurer running an AI-assisted claims-triage workflow lived exactly this scenario. An LLM classifier read incoming first-notice-of-loss reports, extracted a severity tier from one to four, and routed accordingly: tier one and two to the standard adjuster queue, tier three to a specialist, tier four straight to a senior reviewer. The system had been live for seven weeks. The eval suite was reporting 94% accuracy against a held-out test set.

Two weeks in, an upstream field — incident_type — had quietly been migrated from a free-text string to a numeric enum. The classifier never threw. The model received "incident_type: 17" instead of "incident_type: collision_with_object" and inferred severity from the rest of the payload, which was usually enough to land in the right tier. Usually. Around 8% of claims that should have been tier three or tier four were being silently routed as tier two. The team caught it when a senior reviewer noticed three claims in a fortnight that had skipped his queue and shouldn't have. By that point the backlog of mis-triaged cases had been compounding for five weeks.

The eval suite never moved. It was still scoring 94% — because the held-out set was generated before the schema change, and the production input distribution had drifted underneath it without the eval being aware that drift had occurred. The output-level monitor saw a tier label and a routing decision; both were inside the valid enum. Nothing was technically wrong. Everything was operationally wrong.

This is the failure mode that makes step-level observability a prerequisite for meaningful evals, not an optional enhancement. With output-level logging only, the insurer could see that a tier label was assigned. They could not see that step two — entity extraction from the malformed incident_type field — had quietly degraded from confident extraction to inference-from-context. Without that attribution, the eval findings were a symptom report with no diagnosis attached. If you're logging at the output level only, you can detect that the final output was wrong. You cannot attribute it, reproduce it, or distinguish between a systematic failure and a one-off anomaly. Without attribution, your eval findings are a symptom report with no diagnosis attached.

The Trace-Before-Eval approach prevents this: instrument every node in your workflow graph to emit a structured trace event — input state entering the node, tool calls and their raw responses, output state leaving the node, latency — before writing a single eval. Run the system in production for two to four weeks in observation-only mode. You now have a corpus of real production traces to build evals against. Your failure modes are empirical rather than hypothetical. Your eval suite reflects the actual distribution of production inputs, not the clean inputs from your demo environment.

In parallel, define structural sanity checks that run synchronously at each intermediate step — not quality evals, but corruption detectors. Does the output of step 3 conform to the expected schema? Does the classification score fall within the valid range? Is the entity list non-empty when the document clearly contains named entities? These checks are the circuit breakers that prevent a corrupted intermediate state from propagating silently through downstream steps. They're cheap, fast, and they catch the category of failure that output-level evals were never designed to see.


Why Human Eval Fails Quietly

Teams that recognise the limits of automated evals often turn to human review as the authoritative signal. This is the right instinct — human evaluation is the only ground truth that doesn't need to be validated against some other ground truth. The problem is that most teams stop at the decision to do human eval without solving what makes human eval reliable.

Ad hoc human review — "the PM looked at 50 outputs and they seemed fine" — has a specific failure mode: it produces a noisy signal that looks like a clean one. When you then calibrate automated evals against that signal, you're encoding the noise. Your LLM judge learns what the PM found acceptable on a particular day, under a particular framing, with the particular priors they brought to the task. The resulting eval score will feel authoritative. It will measure something real. Just not what you think.

The structural fix is a calibration sprint before any automated eval is built. Take two or three domain experts. Have them independently evaluate the same 50 outputs using a structured analytic rubric — not a holistic one ("rate this 1-5"), but a criterion-by-criterion breakdown that isolates distinct quality dimensions such as factual accuracy, format compliance, and appropriate scope. Measure agreement using a formal inter-rater statistic such as Cohen's kappa. If your reviewers agree only slightly better than chance, the rubric needs refinement. If agreement is strong, you have a quality definition stable enough to automate against.

What this sprint reliably surfaces is that the organisation hasn't actually agreed on quality. The domain expert and the product manager hold entirely different criteria — both reasonable, neither reconcilable without an explicit prioritisation conversation. That conversation is uncomfortable. It's also the one that makes everything downstream function. The eval suite can't have it for you. It just inherits whatever ambiguity you hand it.


The Lifecycle Inversion

There's a recognisable pattern in how most AI workflow projects proceed: prototype → tune → demo → deploy → scramble to retrofit monitoring after production problems emerge. This sequence feels natural because it follows the momentum of building. Each phase produces something visible — a working prototype, a refined model, an impressive demo, a live system — and organisations optimise for visible progress.

The correct order inverts this. Define quality contracts before building the AI component. Instrument for observability before going to production. Establish empirical baselines before measuring drift. Build and deploy into a system that can already tell you whether it's working.

This inversion feels slow because its outputs are invisible. A quality contract is a document. An instrumented pipeline with no AI in it yet is not a demo. A calibration sprint doesn't appear on a product roadmap. But every team that skips these steps and ships directly to production is running a science experiment on their users — collecting the data the quality stack would have provided pre-deployment, at higher cost, higher risk, and with the business already exposed to the consequences.

The pattern that operationalises this is a regression anchor set: 20 to 50 input/output pairs representing known-correct behaviour across your most critical use cases. Before any system change — prompt update, model version bump, tool update, upstream schema change — run the full workflow against this set. Any deviation from expected outputs triggers human review before the change is promoted. This is not a comprehensive eval suite. It's a regression safeguard, deliberately small because it needs to run fast enough to precede every deployment. The goal isn't coverage. It's catching the failure mode where a change that appears safe at the system level quietly breaks something important at the output level.


The Bottleneck Is Upstream

Here's the uncomfortable reality for teams currently evaluating eval platforms: the tooling is not your bottleneck. Most serious LLMOps platforms are technically capable. What they cannot do is define your output contracts, calibrate your reviewers, instrument your pipeline nodes, or compel your organisation to agree on what "correct" means. They presuppose you've done that work. If you haven't, you'll spend three months configuring a sophisticated eval dashboard that produces scores nobody trusts, measuring against standards nobody agreed on, for a system nobody can observe at the step level.

The work that makes evals possible is organisational and architectural before it is technical. It's the conversation between engineering, product, and domain experts that produces a written quality definition. It's two weeks of observation-only production tracing before a single eval is written. It's the calibration sprint that determines whether your reviewers actually agree. It's the output contract that converts "the AI should give good answers" into a schema with validation rules, defined failure modes, and explicit degradation thresholds.

Before you write another eval, do this: pick one workflow in your current system and write down — in typed, specific terms — exactly what a correct output looks like. Not a preference. A contract. Define the schema. Define the acceptable values. Define what constitutes a hard failure versus acceptable degradation. Then ask your product manager and your most relevant domain expert to review that definition independently, without coordinating first.

The gap between their answers is where your real reliability problem lives. And no eval suite will close it for you.