The Best Thing an Observability Agent Can Say Is ‘I Don’t Know’

There’s a demo everyone in observability is building right now: you type “why did p95 latency spike at 14:03?” into a chat box, and an LLM writes you a confident, well-formatted root-cause analysis. It’s a great demo. It is also, if you ship it naively, a machine for manufacturing plausible lies.

The problem isn’t the language model. It’s what you point it at. Ask an LLM to explain a spike and it will always explain the spike — even when there was no spike, even when nothing in your telemetry correlates with it, even when the “14:03 anomaly” is a single sample two percent above a jittery baseline. The model has no incentive to say “actually, this is noise.” Fluency is not the same as being right, and on an on-call page at 3am, a fluent wrong answer is worse than no answer at all. It sends a tired human down the wrong path with false confidence.

So when I built a small Observability Assistant — an agent that answers operator questions over a Prometheus-style metrics dataset — the interesting engineering wasn’t the prompt. It was teaching it to shut up.

Decide in code; narrate with the model

The core design rule is simple: the LLM never makes a decision. It only describes one that’s already been made.

Everything load-bearing — is there an anomaly? is there a supported cause? are we confident enough to name one? — is computed in plain, testable code. The model is handed the finished verdict, the detected anomaly, and the ranked list of correlated signals, and told, in so many words, these findings are final; do not add to them, and if we abstained, you may not assert a cause. When there’s no key, or the API is down, a deterministic rule-based writer produces the same answer. The analysis never depends on the network.

This inverts the usual architecture, where the model reasons and the code plumbs. Here the code reasons and the model is a thin, replaceable prose layer. That’s not a limitation — it’s the whole point. You can’t write a unit test for “did the model reason well.” You can write a test for “did the agent detect the anomaly, name the upstream cause, and refuse the two cases where it shouldn’t have answered.”

Three gates, and the courage to abstain at each

The agent walks three gates before it will name a root cause, and it can bail out — abstain — at any of them:

Gate 1: Is the anomaly even real? Before hunting for a cause, insist the premise holds. The named point has to clear both a robust z-score bar (computed with median + MAD, so the statistic resists the very spike it’s measuring) and a relative-rise bar, on top of a baseline with enough samples to trust. A blip that’s 3σ and 35% over baseline, when the bar is 4σ and 50%, gets the honest answer: “that’s within noise — there’s no anomaly there.” Most “the AI hallucinated a root cause” failures die right here, because most of them start from a spike that was never a spike.

Gate 2: Is there a supported cause? If the spike is real, score every other metric by how far it deviated near the anomaly. Two honesty rules are enforced in code, not left to the model’s judgment. First, a signal that only moved after latency rose is filtered out — it’s a symptom, not a cause; causes lead or coincide. Second, a flat metric scores ~0 and is never blamed, so the agent won’t pin a latency spike on a “traffic surge” that didn’t happen. And if nothing clears the bar? The agent confirms the spike is real and then says the most useful true thing it can: “I confirmed a 74σ spike at 14:04, but nothing in this dataset correlates with it. Escalate to a human.” That sentence is the entire value of the system. An agent that can distinguish “here’s your cause” from “this one’s real but I can’t explain it” is one an operator can actually trust, because its silence is informative.

Gate 3: Is one cause clearly ahead? When two unrelated signals deviate about equally, naming one is a coin flip dressed as analysis. So the agent reports both and lowers its confidence instead of guessing. When the signals are causally linked — a connection-pool saturating at 14:02, slow queries and rising latency at 14:03 — it uses temporal precedence to name the upstream cause (the pool), not the loudest downstream symptom (the slow queries, which actually have the larger raw deviation). First-mover, not loudest.

There’s a fourth, quieter guard: grounding. Ask about a metric the dataset doesn’t contain — “why did memory usage spike?” — and the agent refuses with the list of series it actually has, rather than inventing one. A grounded refusal is just abstention wearing a different hat.

The eval harness is the real deliverable

None of this matters if you can’t prove it stays true as you change the prompt, swap the model, or tune a threshold. So the whole agent runs behind a small eval harness: graded cases that assert on the decision, not the prose. Does the incident case name the pool saturation and refuse to blame flat traffic? Does the noisy case abstain? Does the unexplained-spike case confirm the anomaly but decline the cause? Does the SLO question skip root-cause hunting entirely and just report the error budget? Twenty checks across five cases, green, offline, in CI.

This is the same instinct behind the public benchmarks the observability industry is starting to build for exactly these agents — grading them on consistency (does it get it right three times out of three?) against a real stack, not on a single lucky transcript. If you’re going to let an agent near a production incident, “it demoed well once” is not a bar. “It passes a graded suite, including the cases where the right move is to abstain” is.

The takeaway

The race in AI observability isn’t to the assistant that always has an answer. It’s to the one that knows the difference between an answer and a guess. Build the decision in code, put it behind evals, and give the thing permission to say “I don’t know — here’s what I checked.” That single capability is what turns a slick demo into something you’d actually wire into your on-call rotation.