Why Most Teams Get Agentic Observability Wrong (And What to Instrument Instead)

The problem isn't a lack of data. It's instrumenting the wrong layer entirely.
An engineering team ships an AI agent. They've done this properly — traces wired up, Prometheus metrics flowing, dashboards configured, error rate alerts in place. The agent goes live. For three weeks, everything looks healthy. Then user complaints start arriving. The team investigates. Latency is fine. API error rates are within thresholds. All tool calls are completing. The infrastructure is telling them nothing is wrong.

The agent has been confidently, repeatedly, doing the wrong thing. And their observability stack has been faithfully confirming that it's doing so at acceptable latency.

This isn't a tooling failure. It's a category error — and it's one of the most expensive mistakes teams make when they move from microservices into agentic systems. The observability model built for distributed infrastructure is being applied to a system that fails in an entirely different register. Microservice failures are mechanical. Agentic failures are semantic. A service is either up or down. An agent can be fully operational and profoundly wrong. The gap between technical correctness and goal alignment is where most agentic failures live — and it is completely invisible to a conventional observability stack.

Your monitoring is asking "did it run?" The question that matters is "did it reason correctly?"

The Layer Problem

Here is the core of what most teams get wrong: they are instrumenting the execution layer when they should be instrumenting the reasoning layer.

In a microservice, a trace follows a request across service boundaries. Service A called Service B at 14ms. Service B returned 200. The trace is a flow record — it tells you what happened, in what order, with what result. This model is genuinely useful because microservice failures are almost always flow failures. The call didn't complete, the service was unavailable, the response was malformed. The failure and the signal live at the same layer.

In an agentic system, they don't. An agent can complete every function call successfully, hit every API endpoint without error, and return a response to the user — and the outcome can still be catastrophically wrong. The agent understood the task incorrectly. It selected the right tools in the wrong order. It resolved an ambiguity without flagging it. It was given contradictory constraints and quietly chose one, leaving no record of the choice.

What you need is not a flow record. You need a decision record: why did the agent call Tool B instead of Tool A? Was that decision coherent given the task context? Did the plan that generated this execution sequence make sense to begin with?

Most observability tooling — including the emerging crop of LLM-specific platforms — remains primarily a flow-record system with extra fields for token counts. It tells you that things happened. It does not tell you whether they should have. Closing that gap requires a fundamentally different instrumentation philosophy.

The Three Failures Nobody Is Logging

1. The Phantom Plan

Many agentic architectures separate planning from execution. A planning step decides which tools to use, in what sequence, with what parameters. An execution step carries out those decisions. This separation is good design — but it creates a critical observability blind spot when the plan itself is never persisted.

Consider what happens when execution fails in this architecture. The engineering team has tool call records, API responses, error states. What they don't have is the plan that generated those calls. They cannot determine whether the plan was wrong to begin with, whether execution deviated from a sound plan, or whether the plan was sound but the tools returned results the agent couldn't handle. Root cause analysis becomes archaeological — reconstructing intent from artefacts that were never designed to preserve it.

The fix is simple but requires treating the plan as a first-class artefact. Every generated plan should be logged with a unique plan ID that is threaded through all downstream execution events. This single architectural decision — made at design time, not bolted on after a production incident — compresses root cause analysis from days to minutes. If the plan was wrong, you can see it. If execution deviated from a sound plan, you can see that too. If the tools betrayed a sensible plan, that's a different problem requiring a different fix.

Most teams discover they're not logging plans after their first inexplicable production failure. The better teams build plan logging in before they ship.

2. The Cascading Context Collapse

Take a common multi-agent pipeline: a Planner feeds a Researcher, who feeds a Writer. In principle, each agent has a focused task. In practice, context accumulates. By the time the Writer receives its input, the context window contains the original task specification, the Planner's reasoning, the Researcher's raw retrieval results, error recovery attempts from a failed intermediate step, and several rounds of self-correction. The context window is 70% full. Most of that 70% is noise relative to what the Writer actually needs to do.

The Writer's output degrades — less focused, less precise, occasionally incoherent in ways that are hard to pin down. But no error is thrown. The trace shows all green. Latency is within bounds. The system has completed its task, technically.

This is what context collapse looks like in practice. The failure isn't in any single agent or any single call. It's in the composition of information as it moves through the pipeline. And it's almost universally invisible because teams log that context was passed, not what the context contained or whether it was fit for purpose.

The instrumentation target here is context composition at each agent boundary: total context size, the ratio of task-relevant content to noise, how much of the window is occupied by prior reasoning versus current task instructions versus raw tool output. When that ratio degrades past a defined threshold, you have a signal — one that arrives before output quality visibly collapses, not after.

3. The Token Spike as a Quality Signal

Most engineering teams monitor token consumption for one reason: cost. They set budget alerts, review monthly invoices, and optimise for efficiency. What they're missing is that token usage per reasoning step is one of the most honest quality signals available to them — and it's already being collected.

An agent that consumes three times its normal token budget on a single reasoning step is not just expensive. It is telling you something specific. It may be stuck in a recovery loop, trying and failing to resolve contradictory instructions. It may be processing tool output that has flooded its context with irrelevant data. It may be attempting to synthesise information that genuinely cannot be synthesised without additional input it doesn't have. The token spike is the symptom. The underlying cause is a reasoning problem.

This reframe — from cost metric to quality signal — is actionable immediately, without new tooling. Teams that already have token instrumentation can begin correlating per-step token consumption against output quality assessments today. Anomalous consumption patterns, when reviewed, consistently surface architectural issues: underspecified task contracts, tools that return excessive output, prompts that create irresolvable ambiguity. Finding these issues through token analysis is faster and cheaper than finding them through user complaints.

The Evaluation Gap Nobody Talks About

There is a structural problem in how most engineering organisations handle agentic systems: observability and evaluation are treated as separate disciplines.

The observability team instruments production systems. The ML/AI team runs offline evaluation benchmarks. In practice, these functions are solving the same problem from different angles — and the organisational separation creates a blind spot that neither team sees clearly.

Production observability tells you what the agent did. It cannot tell you whether what it did was right. Offline evaluation tells you whether the agent performs well on a benchmark suite. It cannot tell you whether that benchmark reflects what the agent is actually encountering in production. Real production quality, measured systematically, falls into the gap between the two teams.

The organisations getting this right have stopped treating evaluation as a pre-deployment gate and started running it as a continuous production signal. A lightweight automated evaluation pipeline — even an LLM-as-judge setup sampling a few percent of production outputs — produces a quality signal that raw telemetry cannot approximate. When that signal feeds into the same dashboard as your infrastructure metrics, you get something genuinely new: the ability to see quality drift before it becomes visible to users.

Silent drift is the agentic equivalent of silent data corruption. The system is technically operational. Performance is degrading week over week. Because no quality baseline was established at deployment and there are no semantic quality metrics in the monitoring stack, the degradation is invisible until it's severe enough to generate complaints. By the time the first user escalation arrives, the agent has often been delivering degraded output for weeks.

Addressing this requires establishing quality metrics from day one — even rough ones. What does a coherent tool selection look like for this task type? What constitutes a sound reasoning chain for this agent's domain? These questions are harder to answer than "what is the p99 latency?" but they are the questions that determine whether your observability gives you early warning or post-mortem material.

Build It In, Or Accept That You Can't See It

One conclusion that follows from all of this is uncomfortable: most observability problems in agentic systems are architectural problems in disguise.

Agents built as monolithic prompt chains, with tool outputs concatenated into context as unstructured text, are fundamentally opaque after the fact. If context is passed as a text blob between reasoning steps, there is no reliable way to instrument what information the agent was working with at any given decision point. No observability platform fixes a bad contract between agents. The instrumentation options available to you later are determined by the architectural decisions you make now.

The most valuable observability investment for most teams is therefore not a better platform subscription. It is structured agent communication from the start: explicit task contracts, logged plans, typed context at agent boundaries, defined tool output schemas. These decisions make your system instrumentable. Without them, you can add spans and traces and still find yourself debugging production failures by reading concatenated text logs and trying to reconstruct what the agent was thinking.

Teams that discover this the hard way spend months retrofitting structure into architectures that weren't designed for it. Teams that design for observability from the start find that production debugging cycles are shorter and quality improvement loops actually close.

What to Do This Week

If you have an agent in production right now, there is one change that will give you more genuine insight than any new tooling: start logging your plans.

Identify where in your architecture the agent decides what to do — which tools to call, in what order, with what intent. Persist that decision as a structured artefact with a unique ID. Thread that ID through every downstream execution event. Don't change anything else yet. Just make the plan visible.

The next time a production failure occurs, you will immediately be able to distinguish between three fundamentally different root causes: the plan was wrong, execution deviated from a sound plan, or the plan and execution were both sound but the tools failed. Each requires a completely different response. Before you start logging plans, you cannot reliably tell which one you're dealing with — and that ambiguity is where debugging time disappears.

That single distinction is worth more than any dashboard you'll build this quarter.