The Number Was Green, So the Agent Stayed Quiet

The blended ROAS was 3.11x. The goal was 3.0x. Every dashboard in the world would paint that number green and move on.

The agent I’d built refused to say anything.

That refusal — not the analysis it produces when the data is good — is the whole point of the project, and it’s the part that most LLM features skip. This is a note on how to build an LLM analysis agent for marketing data that a media agency can actually trust: an eval harness to pin its decisions, and an abstain gate so it stays quiet on noise.

The failure mode nobody eval’s for

The obvious risk with an LLM that “analyzes your campaign” is that it hallucinates a number. That one’s easy to design around: don’t let the model do arithmetic. Compute CPM, CTR, CPA, ROAS, and pacing in plain code; compare each channel to the client’s KPI goal in plain code; decide over/under/on-target in plain code. The model never touches a number. Solved.

The subtler, more dangerous failure is that the model hallucinates a story. Give a capable model six conversions on a channel and ask why it underperformed, and it will tell you — confidently, fluently, with a plausible operational cause. “Programmatic Display underperformed; likely creative fatigue and a soft mid-funnel audience.” It reads like insight. It’s a narrative draped over four coin flips.

Marketing data is especially prone to this because the early numbers look real. A channel with 8 conversions has a CPA. You can put it on a slide. It just doesn’t mean anything yet — the variance is enormous, and any “why” you attach is a story about randomness. An agent that always produces a confident read is worse than a spreadsheet, because it launders noise into prose that sounds like a decision.

The abstain gate

So the core of the agent isn’t the analysis. It’s the gate in front of it.

Before any channel gets a verdict, it has to clear a sufficiency check, in code:

Enough conversions for the KPI to be stable (a channel with fewer than ~25 conversions doesn’t get a ROAS verdict, no matter how good the ROAS looks).
Enough spend to be above a read floor.
Enough of the flight elapsed — a campaign that’s 20% run can’t be judged, even if it already has plenty of conversions.

A channel that fails any of these is marked indeterminate. It doesn’t get a verdict, it doesn’t get a “why,” and — critically — it never enters a budget reallocation. There’s a campaign-wide version of the same check: too few total conversions, or nothing readable, and the entire read is withheld with a plain-English reason.

Here’s what that looks like on the sparse example:

⚠️  ABSTAINED at the campaign level:
    · campaign has 19 conversions total (< 100); results are not yet
      distinguishable from noise
    · no individual channel cleared its own data-sufficiency bar

The blended ROAS on that data was 3.11x — over goal. The agent still refused to call it. That’s the behavior you want, and it’s the behavior an LLM will not give you on its own.

The thresholds here are deliberately simple heuristics. The honest next step is statistical: a confidence interval on each channel’s conversion rate, and an indeterminate that means “the interval still includes the goal” rather than “the count is below a magic number.” But even the crude version captures 90% of the value, because the point isn’t precision — it’s having a gate at all.

The model’s actual job

Once the code has decided what’s readable and what the verdicts are, the LLM does the one thing it’s genuinely good at: turn a table of verdicts into a paragraph a human wants to read. It’s handed only the channels that survived the gate, and told explicitly not to narrate the indeterminate ones. The prose floats; the decisions are pinned.

And because the prose is the only part that needs a model, the whole tool degrades gracefully: no API key, a network blip, or an abstain, and it drops to a deterministic rule-based narrative. The analysis is identical either way. Nothing about the trustworthy core depends on the model being up.

Why the eval harness grades decisions, not words

The instinct with an LLM feature is to eval the output text. That’s a trap — you end up with brittle assertions that break the moment the model paraphrases, and you learn nothing about whether the judgment is right.

So the eval harness grades the structured result, with the model forced off:

Does the underperformer get flagged?
Do the thin channels stay indeterminate?
Does a reallocation ever touch an indeterminate channel? (It must not.)
Does the sparse dataset abstain, and does the mid-flight dataset abstain for the right reason — flight elapsed, not conversion count?

Each case is a fixture with known-correct answers; the runner exits non-zero on any failure, so it’s a CI gate. When you change a threshold or refactor the classifier, the eval tells you immediately whether you changed a decision you didn’t mean to. That regression signal is worth more than any single clever prompt.

Forward-deployed, by config

One more thing that matters when you’re embedded with a client rather than shipping a generic tool: the analyst logic is fixed, but the client is a config file. One client is ROAS-primary with a 3.0x goal and one set of noise floors; the next is CPA-primary with a $30 goal and stricter thresholds. Same delivery data, same code — swap the config and every verdict re-derives against the new KPI. That’s the file you edit while sitting with a new agency, not the codebase.

The takeaway

An LLM analysis agent earns trust the same way a good analyst does: by knowing when it doesn’t know. Build the gate first, grade the decisions not the prose, and let the model do only the part it’s good at. The green number that the agent declined to celebrate is the feature, not the bug.