The Agent Wanted to Call It Healthy. I Made It Abstain.

There’s a version of “AI writes the code now” that’s true, and a version that’s a trap. The true version: I ship most of my software by driving a coding agent — Claude Code, Cursor, Aider — from a spec, and I move a lot faster than I used to. The trap: believing that because the agent produced working, plausible code, it produced correct code. Those are not the same thing, and the gap between them is the entire job now.

I want to make that concrete, because “human judgment on top of AI” is the kind of phrase that means nothing until you watch it happen on a real diff. So here’s a small tool I built recently — a health-checker for a fleet of web properties — and the specific moments where I stopped the agent and made a different call. The tool is unremarkable. The overrides are the point.

The setup

The premise: you run a lot of websites — hundreds — and you can’t look at them one at a time. You want a program that takes a snapshot of each site (HTTP status, TLS certificate, latency, the SEO surface) and tells you which ones are healthy, which are degraded, which are failing, and — critically — what to fix first.

I wrote the spec, named the invariant I cared about most, and handed the implementation to the agent. It came back fast with a clean, typed, reasonable first cut. And “reasonable” is exactly where the danger was hiding.

Override 1: it wanted to score incomplete data. I made it refuse.

Here’s the moment that matters. A snapshot doesn’t always come back complete — sometimes the latency check lands but the certificate check times out. The agent’s scorer did the sensible-looking thing: it computed a health score from whatever signals were present and returned a verdict. If a site returned a 200 and a fast response but its TLS check never completed, the agent would happily call it healthy.

Think about what that does at scale. You have a fleet dashboard, and the one site whose certificate actually expired last night is sitting there green — because the probe that would have caught it timed out, and the tool filled the silence with optimism. A monitoring tool that invents “healthy” out of missing data is worse than no tool, because it earns your trust and then hides the failure.

So I tore it out and put a gate in front of the scorer: if the availability- critical signals weren’t actually captured, the property is indeterminate, with a reason, full stop. Not scored on the fragments. Not averaged into a number. The first test I wrote is named for exactly this case — a fast, 200-returning page with an uncaptured certificate is indeterminate, not healthy. An honest “I don’t know” beats a confident wrong answer every time, and teaching software to say “I don’t know” is a design decision a person has to make. The agent will never volunteer it.

Override 2: “expired” is not “degraded.”

The agent modeled certificate problems on a single smooth axis: fewer days remaining, worse score. Tidy, and wrong. A certificate with ten days left is a warning — you should renew it. A certificate that already expired is a serving failure — the browser blocks the page; the site is effectively down. Those are different categories, not different points on a slider. I split the rule: inside the warning window it’s a major issue; already expired, it’s critical. Same reasoning moved another case up in severity — a page that’s supposed to earn search traffic but is quietly serving noindex. It “looks fine” and earns nothing. That’s the failure a fleet owner most needs shoved in their face, so I made it loud.

Override 3: the model doesn’t get to decide what to fix first.

My first pass at the LLM prompt asked the model to prioritize the issues. It’s tempting — the model is right there, it’ll happily rank them. But remediation order is a business decision (how bad × how much traffic the site carries), and it has to be deterministic and testable: identical inputs, identical order, every single run. I pulled ranking out of the prompt and into code, wrote a test that a flagship failure outranks a long-tail one, and demoted the model to narrating a queue it is not allowed to reorder. The prompt now tells it, in plain words, that the order is final.

This is the trust boundary I put in every LLM feature I build: code decides, the model narrates. The language model is wonderful at turning a set of decided facts into a readable paragraph. It is not where you put the decision. When the model is down, or slow, or hallucinating, the tool falls back to a deterministic sentence and keeps working. Nothing a user relies on is ever one API call away from breaking.

The unglamorous ones

Not every override is philosophical. Left alone, the agent pulled in three dependencies — an HTTP library, a validation library, a test framework — that the platform already provides natively. I removed all three; the tool now runs with no install step, which matters when it’s meant to live in CI across hundreds of sites. It also wanted the tests to hit live URLs, which would make them flaky and network-bound; I inverted that so the snapshot is an input to a pure core and the tests run offline and green. Small calls. They’re the difference between a demo and something you’d actually deploy.

What this says about the work

If you gave that same spec to an agent with nobody steering, you’d get a plausible fleet checker that averages partial data into green dashboards, lets the model pick what to fix, and carries three dependencies it doesn’t need. It would demo beautifully and fail quietly in production, which is the worst way to fail.

The distance between that and something you can trust is entirely the human layer: knowing that “reasonable” and “correct” diverge, knowing where they diverge, and having the judgment to stop and say “no, abstain.” The agent is genuine leverage — it wrote most of the typed implementation, the first draft of the tests, the CLI. I got a day’s work done in a couple of hours. But the leverage is only worth having if someone senior is reading every diff and owns the decisions the model can’t. That’s not a hedge against AI. It’s the job the tools finally free you up to do.