Never Trust the Model’s Answer Key

There’s a specific, quiet way that AI-generated content fails, and assessment is where it does the most damage.

Ask a language model to write a physics problem and it will happily produce one: “An object starts at 18 m/s, accelerates at 3 m/s² for 11 seconds — find its final velocity and displacement.” It will also, just as happily, hand you an answer key. The problem looks right. The answer key looks right. And sometimes the answer key is simply wrong — off by a sign, a dropped ½, a units slip. If that item is auto- graded, every student who does the arithmetic correctly now gets marked wrong, against a key the model hallucinated.

This isn’t a prompt-engineering problem you can fully solve with a better prompt. It’s a trust boundary. And the senior move is to treat the model’s output the way you’d treat any untrusted input crossing into a system of record: validate it deterministically, and never let it be the authority on anything that matters.

The pattern: propose, then dispose

I recently built a small companion project to an open-source assessment platform — the kind universities use to run mastery-based learning and large-scale exams. The platform’s model is elegant: an instructor writes a question as code that generates infinite randomized variants of itself and grades them automatically. Increasingly, teams want an LLM to help author those variants — describe a question in English, get a working randomized question back.

The temptation is to wire the model straight through: LLM generates the parameters and the answer key, and the platform serves it. That’s the trust- boundary mistake. Here’s the shape I used instead.

1. The LLM proposes. It suggests a new parameterization — new starting velocity, acceleration, duration, maybe a fresh scenario (“a braking car” instead of “an object”). It also claims what the correct answers are. We capture all of it, and we trust none of it.

2. A deterministic validator disposes. The question type owns exactly one piece of authoritative math: given the parameters, compute the correct answers. That function is the single source of truth for “what is correct” — and crucially, it’s the same function used to grade real student submissions. When a proposal arrives, the validator:

checks the parameters are structurally sound and within sane physical ranges (no acceleration of 40 m/s², no negative time, no NaN);
recomputes the answer key from the parameters using that authoritative function;
compares the model’s claimed key against the recomputed one, and rejects the whole proposal if they disagree by more than a tight tolerance;
sanity-checks magnitudes so a technically-correct but absurd variant doesn’t reach a student.

Only if every check passes does the system mint a usable variant — and it carries the recomputed answers, not the model’s. The model’s key is discarded even when it happens to be right, because trusting it is the habit we’re trying to break.

3. Fall back, and say why. If the model is unavailable, errors, or proposes something that fails validation, the system quietly falls back to deterministic generation — a plain, self-consistent variant — and records the rejection reasons for an instructor to review. The feature degrades to “still works, just without AI,” never to “serves a broken question.”

Why this is the right amount of engineering

It would be easy to over-build this: a second LLM to “grade” the first, a confidence score, a human approval queue for every variant. Sometimes you want those. But the cheapest, most reliable guard is the one that costs almost nothing and can’t be fooled: you already have the correct-answer function, because you need it to grade students anyway. Reusing it as the validator means the AI path and the grading path can never drift apart. There’s no separate “is this right?” logic to maintain and no way for the model to talk its way past it.

This mirrors how the better AI-grading systems already work: the model proposes a score, a human (with a clear rubric) reviews, and the two are shown side by side with disagreements flagged. Proposition and disposition are separated. The model is a fast first draft; a deterministic or human authority makes the call. AI as proposer, not decider, is the pattern that survives contact with high-stakes use.

What it looks like in practice

The whole thing is a few hundred lines of TypeScript. The core is framework-free and pure — parameters in, correct answers out — which makes it trivial to test, and testing is where the value lives. The suite I care about isn’t “does the UI render”; it’s the adversarial case: when the model lies about the answer key, does a wrong key ever reach a student? You assert that whatever variant the system hands back, its stored answers equal your own recomputation — every time, across the honest case, the out-of-range case, and the model-lies case. That one invariant, pinned by a test, is the feature.

Everything else is ergonomics: partial credit so a student who gets one of two fields right earns half; a distinction between a wrong answer and an unparseable input (a typo should say “that’s not a number,” not silently score zero); reproducible variants from a seed so any instance can be re-graded later.

The takeaway

LLMs are genuinely useful for authoring assessment content — they turn a paragraph of intent into a working, randomized, self-grading question in seconds. But “useful for drafting” and “trusted as the authority” are different jobs, and conflating them is how you ship a question that punishes correct students.

Draw the trust boundary explicitly. Let the model propose. Keep one deterministic function as the authority, reuse it everywhere correctness is decided, and reject anything that can’t survive a recomputation. It’s less code than the alternatives, and it’s the version you can put in front of a real exam.