Streaming a Conversational-AI Trainer — and Teaching It to Say ‘I Don’t Know’

Most “AI feature” demos fail in one of two boring ways. Either the answer arrives in a single silent lump three seconds later — no streaming, no life — or the product confidently scores something it has no business scoring. Both are avoidable, and fixing them is most of what separates a toy from a tool.

I recently built a small Scenario Roleplay Trainer: a person practices a high-stakes conversation — a tough sales objection, a vendor negotiation — against an AI that plays the counterpart, and then gets a scored debrief. It’s a tiny app, but it has to get two hard things right: the roleplayer has to feel live, and the debrief has to be honest. Here’s how both work.

Streaming: the token should appear the instant the model produces it

The roleplayer streams token-by-token over Server-Sent Events. The server opens a long-lived response and writes frames as the model emits text; the browser reads them with fetch().body.getReader() and paints each piece.

The wire format is the one I use in my flagship demo, Relay: the Vercel AI SDK data-stream protocol. Each frame is <code>:<json>\n

0:"That's "     ← a text delta, append it
0:"fair, "
0:"but "
e:{"finishReason":"stop","source":"gemini"}   ← done

A text delta is 0:, a finish is e:, an error is 3:. That’s basically the whole vocabulary you need for a chat. The value is JSON-encoded, so quotes and newlines in the model’s output survive transport for free — a detail people discover the hard way when they hand-roll a delimiter and a model says she replied "no".

Three things make streaming feel good rather than merely work:

Never let a proxy buffer it. Set X-Accel-Buffering: no and Cache-Control: no-transform. Otherwise nginx (or your PaaS’s edge) helpfully collects your stream into one chunk and hands the user the exact lump you were trying to avoid.

Smooth the output. Models don’t emit one word at a time — they emit bursts. Gemini in particular will hand you a whole clause at once, which reads as a stutter, not a stream. So re-emit each chunk piece-by-piece with a ~16ms delay. The user sees a steady, human-paced reply instead of three jumps.

Retry, but only before the first token. Transient 429/503s are a fact of life on hosted models. Retrying — and falling back to a second model — is easy. The subtlety: once you’ve streamed a single token to the user, you can’t safely retry, because the retry would duplicate output. So the rule is: retry and fall back only before the first token flows; after that, a failure is a failure and you surface it in an error frame. My streamer tracks a started flag and enforces exactly this.

One more thing that matters more than it should: the whole thing degrades to a scripted roleplayer when there’s no API key. A reviewer who clones the repo and runs npm start with nothing configured still gets the full streaming UI and a real debrief. “Works on first run, no setup” is a feature, and it’s cheap.

The debrief: score a rubric, and abstain when you can’t

The debrief grades the trainee against a per-scenario rubric: did they acknowledge the objection before arguing, did they ask a discovery question, did they reframe on value, did they hold the price? Four criteria, each either met or not.

The temptation is to always return a number. Someone barely typed two words? Here’s your 25%. That number is a lie, and worse, it’s a confident lie — the kind that erodes trust in the whole product the first time a user notices.

So the scorer has a third verdict: abstain. Every criterion that depends on what the trainee actually said has an evidence gate. If the trainee gave us too little to judge — fewer than a handful of substantive words — that criterion returns insufficient evidence, not a fake zero. And if less than half the rubric can be judged at all, the overall verdict is insufficient_evidence with a null score, not a made-up percentage:

if (criterion.needsTrainee && wordCount(traineeText) < MIN_TRAINEE_WORDS) {
  return { verdict: "abstain",
           reason: "Insufficient evidence — the trainee didn't engage enough." };
}

This is the same instinct as a good classifier that can output “I’m not sure.” Calibrated abstention is a feature, not a cop-out. In a training product it’s doubly important: telling someone “you scored 25% on holding the price” when they never got to the price is actively bad coaching. “We couldn’t assess that — go deeper and re-run” is honest and useful.

I built the scorer as a deterministic heuristic on purpose. It’s the floor: it always runs, with or without a model, and it’s trivially unit-testable — the abstain path is a first-class test case, not an afterthought. When a model is available, you layer an LLM judge on top, but you keep the exact same contract: the judge must cite a span of the trainee’s words as evidence, or it abstains. The model can be smarter; it doesn’t get to be less honest.

Why this generalizes

None of this is specific to sales training. Any conversational-AI product is the same two problems: make the machine’s turn feel alive, and make the machine’s judgments trustworthy. Stream the tokens as they’re produced, smooth the bursts, retry before the first token and not after — and when you grade, build the “I don’t have enough to say” branch first, then earn the confident answers.

The full app is ~500 lines with zero runtime dependencies, and it runs on the first try with no key. If you’re building anything where an AI talks back in real time and then has an opinion about how it went, these two patterns are most of the game.