Never Bill a Sentence the Chart Didn’t Write: Evidence Gates for Clinical Charge Capture
There’s a category of software problem where being wrong is not a bug report — it’s a liability. Clinical charge capture is one of them. A hospital discharges a patient, a coder assigns billing codes, and somewhere in the gap between what actually happened and what got coded, real money goes missing: a documented complication that was never coded to its specific, higher-reimbursing code; a chronic condition that quietly raises the case’s risk adjustment but wasn’t captured. Reading every chart a second time to find that money is exactly the kind of tedious, high-volume pattern-matching that language models are good at.
It’s also exactly the kind of task where a confident, fluent, wrong answer is the most expensive thing you can ship.
Because the failure mode isn’t “the model missed some revenue.” Hospitals live with that today. The failure mode is the model adding revenue that the record doesn’t support — suggesting a billable code with a plausible-sounding justification that, on inspection, isn’t in the chart. Do that at scale and you haven’t built a revenue-integrity tool; you’ve built an over-billing engine with a compliance trail pointing straight back at your customer. In a domain where the value proposition is measured in millions of recovered dollars, a hallucinated code is not a rounding error. It’s the whole risk.
I built a small tool to work through this — a “chart-to-charge gap finder” — and the interesting part was never the finding. It was the refusing.
The finder is the easy half
Ask a good model to read a clinical record and propose codes the documentation supports, give it the codes already billed so it doesn’t repeat them, constrain it to a known code set so it can’t invent one, and it will do a creditable job. Structured output makes the result machine-readable. You get back a list of candidate charges with rationales and confidences. That part is a Tuesday.
The problem is that everything the model returns looks equally trustworthy. The well-supported suggestion and the confident fabrication arrive in the same JSON shape, with the same calm tone, the same 0.9 confidence. You cannot tell them apart by reading them. And “the model said it confidently” is not a fact about the world — it’s a fact about the model.
So the actual product question is: before this suggestion is allowed in front of a human coder, what has to be true?
Make the model quote the chart — then don’t trust the quote
The design that worked is boring, and that’s the point. Every suggested charge must carry a citation: the id of a chart source (a note, the problem list, a med) and the exact text the model is relying on. Not a paraphrase — a verbatim quote.
Then a deterministic gate — plain code, no model — checks four things, and it is the single source of truth:
- The code is in the catalog. The model can’t bill something outside the known set.
- There’s a citation at all. A suggestion with nothing behind it is never surfaced. This is the “no unsupported add this code” rule.
- The quote actually appears in the chart. Every cited span is re-verified as a literal substring of the source it claims to come from. If the model quoted text that isn’t in the record, the suggestion is dropped — even at 0.99 confidence.
- The code isn’t already billed. A duplicate isn’t a gap.
That third rule is the one that earns its keep. It turns “trust me” into “show me,” and it does it mechanically. A model that fabricates a supporting quote — the single most dangerous thing it can do here — trips a string check and disappears. The gate doesn’t need to be smarter than the model. It just needs to be checkable, and the model’s job is reduced to something a computer can verify: point at the words.
There’s a subtlety worth stating: verifying that a quote exists is not the same as verifying that the quote supports the code. My gate guarantees the evidence is real; a clinician still confirms the evidence is sufficient. That division of labor is the honest one. Software should enforce the parts it can prove and hand a human the parts it can’t — not blur the two.
Confidence is a priority, not a permission
The instinct is to gate on confidence: below some threshold, drop it. I think that’s backwards for this domain. A plausibly-supported, low-confidence finding is exactly what a second-level review exists to catch — the subtle one the first pass missed. Hiding it because the model was unsure defeats the purpose.
So confidence never suppresses a finding in my tool. It only sorts it. What suppresses a finding is missing or fabricated evidence — an objective, defensible standard — not the model’s self-reported mood. Low confidence gets a flag and still goes to the queue. The only thing that never happens is a charge landing on a bill without a human accepting it. There is no auto-apply path in the code at all. On money and compliance, the default is a person.
Why this lives in code, not in the prompt
You could write all of this into the system prompt: “only propose codes you can quote, never fabricate, always cite.” I do write that. And I assume the model will sometimes ignore it, because models do.
A prompt is a request. A gate is a guarantee. The prompt makes the model more likely to behave; the deterministic check makes misbehavior not matter, because the fabricated citation gets dropped whether or not the instruction was followed. When the cost of a single bad output is a compliance finding, you want the guarantee, and you want it somewhere you can unit-test. The best test in my little codebase isn’t a model eval — it’s a five-line assertion that a high-confidence suggestion with a made-up quote comes back suppressed, every time.
That’s the general shape I’d argue for in any LLM feature where the output does something irreversible or expensive: let the model generate freely, force it to ground each claim in something verifiable, and put a dumb, deterministic, well-tested gate between the model and the action. The intelligence goes in the finding. The trust goes in the gate. Keep them separate, and you can move fast on the first without ever gambling on the second.