The LLM Suggests. The Rules Engine Decides.

There is a specific class of software where “the model is usually right” is not good enough: anything that moves money or produces a legal artifact. Reimbursements. Invoices. Claims. Tax. Payroll. If a system approves a $412 health-expense reimbursement, someone may eventually have to explain — to an auditor, a regulator, or a customer’s finance team — exactly why. “The language model was fairly confident” is not an answer you want to give.

We build a lot of this kind of system, and the architecture that keeps us out of trouble is boring and old: a deterministic rules engine is the source of truth, and the LLM is a suggester wearing a leash. Below is why, using a concrete example — classifying an ICHRA health-expense claim as reimbursable, not reimbursable, or needs-review — but the pattern generalizes to any compliance or money decision.

The temptation, and why it’s a trap

The naïve version is seductive because it’s so easy to build: hand the whole claim to an LLM, give it the plan rules in the prompt, and ask “should we reimburse this? Return JSON.” It demos beautifully. It also has three properties that are disqualifying for money:

It’s non-deterministic. The same claim can get two different answers. For a correctness-critical decision, that alone ends the conversation.
It’s unauditable. “Explain your reasoning” produces a story, not the actual causal chain. You cannot point to the rule that fired, because there was no rule — there was a probability distribution over tokens.
It fails confidently. The dangerous errors aren’t the ones where the model says “I’m not sure.” They’re the ones where it’s wrong and self-assured, and it cheerfully approves an ineligible expense with a fluent justification.

None of these are fixed by a better prompt or a bigger model. They’re structural.

The shape that works

Split the problem into two responsibilities that never blur:

Categorization — mapping messy reality (“paid Blue Shield $412 for July”) onto a small, closed set of known categories. This is genuinely hard for rigid code and genuinely easy for an LLM. Let the model do it.
Decisioning — given a category and the facts, deciding eligibility and dollars. This must be deterministic, testable, and explainable. Never let the model near it.

In our ICHRA classifier the deterministic engine is a handful of pure functions. Each takes the claim, the plan, and the employee’s remaining allowance, and returns exactly one outcome with a reason:

Is the amount positive?
Is the employee enrolled in qualifying individual coverage? (Under an ICHRA, no enrollment means no reimbursement — full stop.)
Is the category eligible for this specific plan (premiums-only vs. premiums plus qualified 213(d) medical)?
Was substantiation provided?
Is it within the remaining monthly allowance?

The engine runs every rule, every time, and records every outcome — including the ones that passed — into an audit trail. The final decision is simply the most restrictive outcome: if anything says “deny,” it’s denied; if anything says “needs review” and nothing denies, a human looks at it; only if everything passes is it approved. You can read the whole decision back, line by line, months later. That’s not a logging nicety — for a compliance system it is the product.

Where the LLM actually lives

The model’s only job is to produce a category when the claim arrives as free text instead of a structured field. It returns a category from the closed set — it cannot invent one — plus a confidence. Then two thresholds do the load-bearing work:

A floor (say 0.50): below it, we discard the model’s label entirely and treat the category as unknown, which routes the claim to human review.
An auto-approval bar (say 0.85): an LLM-derived category may only result in an automatic payout above this line. In the gap between the two thresholds, the suggestion is good enough to inform a human but not good enough to spend money on its own.

This is the abstain gate, and it’s the most important twenty lines in the system. It encodes a rule that no amount of model quality should override: the machine does not get to confidently approve money on a guess. When it isn’t sure enough, it says so and hands the claim to a person — with the model’s best suggestion attached, so the human starts ahead rather than from scratch.

There’s a guard clause that makes this airtight: even when the rules would approve a claim, if the category came from the LLM below the auto-approval bar, the decision is downgraded to needs-review. And crucially, the model can never move in the other direction — a confident LLM labeling something “gym membership” can’t make it reimbursable, because the rules engine flatly denies that category regardless of where the label came from. The LLM can only ever lower the ceiling, never raise it.

Graceful degradation is a feature

Because the LLM is confined to categorization, the system has an obvious fallback: if there’s no API key, or the call times out, or the provider has an outage, the claim simply routes to human review. Nothing breaks. Nothing gets guessed. A compliance system that quietly keeps working — more conservatively — when its AI dependency is unavailable is far more trustworthy than one that either fails hard or, worse, starts approving on degraded signal.

The judgment, not the plumbing

Wiring an LLM into an app is cheap now, and getting cheaper. The valuable part is knowing which decisions it’s allowed to make — and building the deterministic spine, the audit trail, and the abstain gate that let you use the model’s strengths without inheriting its failure modes. On anything touching money or the law, that boundary is the whole job.

The rules engine decides. The LLM suggests. Keep them in that order and you get the best of both: the fluency of a model on the messy human input, and the correctness, explainability, and calm-under-outage of code on the part that actually matters.