A False Signal Is Worse Than a Missed One

There’s a whole category of AI product that reads large volumes of public documents — government board minutes, procurement RFPs, budgets, filings — and turns them into sales signals: this district has money bonded for a security upgrade, that city’s SCADA system is out of warranty and under a consent decree, this county’s IT director is named in the agenda. A sales rep gets a tip months before a formal RFP exists, while the deal is still shapeable.

It’s a great use of an LLM. It’s also a use where the obvious way to build it is quietly dangerous, because the failure modes are asymmetric.

Think about what happens when the model is wrong in each direction. If it misses a signal, a rep never sees a lead they might have worked — an opportunity cost, invisible, absorbed into the noise of a big pipeline. If it invents a signal — hallucinates a budget figure, reads a routine $150 filing fee as procurement intent, attributes a decision to the wrong person — a rep spends real hours chasing a deal that was never there, and then stops trusting the tool. The first bad tip is expensive; it’s also the last one the rep takes at face value. For a sales-intelligence product, precision is the whole franchise.

So the design goal isn’t “extract as much as possible.” It’s “never ship a signal you can’t defend.” Two disciplines get you there: grounding and abstention. Both are simple to state and easy to skip.

Grounding: every signal cites a verbatim span

Grounding means each extracted signal carries a quote that is actually present in the source document — not a paraphrase, not a plausible summary, an exact span a human can click and read. This is worth stating as a hard invariant:

A signal ships only if its quote is a substring of the source. If it isn’t, the signal is dropped — no matter how confident the model was.

That last clause is the important one. LLMs are fluent, and fluency reads as confidence. A model will happily return budget: $9.9M with a nicely worded rationale and a confidence: 0.98, and the number will be nowhere in the document. Citation has to beat confidence. In practice that’s a gate the extractor’s output passes through before anything reaches a user:

def is_grounded(quote: str, source: str) -> bool:
    q = normalize(quote)           # whitespace-collapsed, lowercased
    return bool(q) and q in normalize(source)

Whitespace-insensitive substring matching is deliberately dumb, and that’s a feature: it’s cheap, it’s deterministic, and it can’t itself hallucinate. A signal that survives it is anchored to text a person can audit. A signal that fails it gets written to a “withheld” list — kept for debugging, never surfaced. You’d be surprised how often, on real documents, that list is non-empty. That’s the gate earning its keep.

The prompt does its part too: instruct the model, in no uncertain terms, that quote must be copied verbatim and that anything it can’t quote should not be emitted. But you don’t trust the prompt — you verify its output. The prompt is a request; the grounding gate is the enforcement.

Abstention: returning nothing is a correct answer

The second discipline is harder culturally than technically, because it runs against the grain of how we demo AI. We love to show the model finding something. But a huge fraction of public documents contain no buying signal at all — a proclamation honoring a retiring librarian, a zoning variance for a backyard garage, a fee schedule. The correct output for those is nothing.

An extractor that always finds “something” is an extractor that manufactures signals to fill the response. So abstention has to be a first-class, blessed outcome: an empty result is success, not a bug. Concretely, that means:

That last point is where most teams have a gap. It’s easy to measure recall on the documents that have signals. It’s the no-signal documents — the ones where a false positive is most damaging — that need to be in the graded set, weighted like the landmines they are.

The eval loop makes it real

None of this is trustworthy without measurement. “The prompt seems good” is not a guarantee; the next model version, or a well-meaning prompt tweak, can quietly start over-firing. So the extractor sits behind a small eval harness: a set of graded documents with expected signal types, plus explicit must_abstain cases, scored on four axes that map directly to product risk:

Because the harness scores whatever engine is active, it doubles as a regression gate for a prompt change or a provider swap. That matters in this space: the serious teams don’t default to one model, they benchmark OpenAI vs. Anthropic vs. Gemini per task and choose on quality and cost. A shared graded set is what makes that a measurement instead of a vibe — you can only “pick the better model” if you can say what better means, on your documents, in numbers.

A small runnable version

To make this concrete we built signalscope, a ~400-line Python example that does exactly this loop: it reads synthetic-but-realistic public records (a school-board minutes with a bonded security program, a municipal SCADA RFP, and a planning-commission record with a proclamation and a routine fee), extracts budget / timing / decision-maker / trigger signals, cites each to a verbatim span, and abstains on the document that supports nothing. It runs on a structured-output LLM call with a deterministic offline fallback, and every run is scored by the eval harness. One detail we like: on the abstain document, the LLM correctly ignores the routine $150 filing fee that a naïve keyword pass reads as budget — the exact false-positive the whole design exists to prevent.

The lesson generalizes past sales intelligence to any LLM-over-documents product — contract review, compliance monitoring, medical abstraction, due diligence. The value isn’t that the model can read the document. The value is that when it tells you something, you can click the citation and see where it came from — and when the document says nothing, the model has the discipline to say nothing back.