When Not to Answer: Building a Voice-Over-Email Loop With a Consent Gate

Most of the interesting engineering in a voice product isn’t the voice. It’s knowing when to keep quiet.

I’ve been building around a simple idea lately: conversation over email, in voice. You send a spoken message, something on the other end understands it, and it answers you — out loud, threaded back into the same email conversation. No app to open, no new inbox to check. Email is the transport; voice is the interface.

The loop is short enough to hold in your head:

inbound voice note (email) → transcribe → compose reply → synthesize → threaded reply

So I built it. A few hundred lines of plain Node, no runtime dependencies, and a handful of tests. It runs in one command. But the version that’s worth writing about is the version with a gate in the middle — because a product that talks back has two failure modes that a product that types back simply does not.

The two failure modes of a voice reply

Mishearing. Speech-to-text is probabilistic. On a clean recording it’s excellent; on a subway platform, through a bad mic, in a second language, it degrades — and it degrades silently, handing you a plausible-looking transcript that’s subtly or wildly wrong. In a text product, a bad transcript is a visible typo the user can fix. In a voice product, you take that shaky guess, feed it to a language model, and read the confident answer aloud. Now the person is listening to a fluent, authoritative response to a question they never asked. That’s not a small UX papercut. It’s the product lying to someone in a human voice.

Consent. A synthesized voice in your ear is more intimate than text on a screen. There’s an implicit contract about when it’s welcome. Auto-replying in voice to someone who sent you one voice note — but never signed up to be spoken to by a machine — is the kind of thing that feels invasive the first time and gets your product muted forever.

Neither of these is exotic. Both are structural. So they belong in the architecture, not in a backlog labeled “polish later.”

The gate

Between transcription and speech I put one decision function. It takes the transcript (with a confidence score), the sender, and a consent registry, and returns one of three modes:

voice — synthesize a spoken reply and attach it.
text — answer, but in text only.
clarify — don’t answer the question at all; ask them to resend.

The rules are deliberately boring:

// 1. Confidence gate — fail closed.
if (confidence == null || confidence < minConfidence) {
  return { mode: "clarify", voiceReply: false,
    reason: "refusing to voice-reply to a message we may have misheard" };
}
// 2. Consent gate.
if (!consent[sender]?.voiceReplies) {
  return { mode: "text", voiceReply: false,
    reason: "sender has not opted into synthesized voice replies" };
}
return { mode: "voice", voiceReply: true };

Two details matter more than they look.

Fail-closed on unknown confidence. Whisper-family models don’t hand you a tidy per-utterance confidence; you can derive a coarse proxy from segment log-probabilities, but sometimes you get nothing usable. The tempting move is to treat “no score” as “probably fine.” The correct move is the opposite: unknown confidence is treated exactly like low confidence. If you can’t prove you heard someone, you don’t answer them in a voice. Silence — well, a polite “could you resend that?” — is the safe default.

Consent is per-sender and also fail-closed. An unknown sender gets text, not voice. Opting in is the affirmative act; the absence of a record means “text.”

When the gate returns clarify, the loop doesn’t send the LLM’s best guess in any form. It sends a short, honest note: “I couldn’t make out your voice note clearly enough to answer without guessing — could you resend it, maybe somewhere a little quieter?” That message is doing real work. It tells the user the truth about the system’s state instead of papering over it with a confident hallucination.

The parts around the gate

The rest of the loop is unglamorous plumbing done carefully, which is most of what shipping a product actually is:

STT is a provider seam. Point it at a real Whisper CLI via an environment variable and it transcribes the audio; leave it unset and it reads a fixture transcript so the demo and its tests are deterministic. The confidence score rides along either way.
The LLM reply goes to Gemini when a key is present and falls back to a small grounded, deterministic composer when it isn’t. Crucially, the prompt is told the reply will be spoken: short sentences, no markdown, no URLs, under seventy words, and answer only from the provided context. Writing for the ear is different from writing for the eye.
TTS prefers a real engine (a configurable command, or on-device speech) and degrades to a labeled stub tone so the pipeline always produces a real audio file to attach.
Email is stubbed at both ends — inbound is a fixture, outbound is written to an outbox/ as an .eml — but the part that actually matters is real: the reply carries correct In-Reply-To and References headers, so a mail client threads it under the original. “Conversation over email” is a lie if the replies don’t thread.

Every external stage is real when configured and clearly labeled when it’s running a fallback. That “graceful degradation with honest labels” property is worth more than it sounds: it means the whole loop runs end-to-end on a laptop with no keys and no network, which is what makes it testable — and it means the trace never pretends a stub was the real thing.

What I deliberately left out

A prototype earns trust by being honest about its edges. This one is batch, not streaming — it transcribes fully, then replies. A production voice product wants partial-hypothesis STT and streamed TTS to feel like a conversation, with the gate running on the final hypothesis before any audio is committed. The consent store is a flat file, not a grant/revoke lifecycle with an audit trail. There’s no SPF/DKIM verification in front of the gate yet, which is exactly where it belongs once real senders show up. And the confidence proxy is coarse; a real gate deserves a calibrated per-utterance score.

None of that changes the shape of the thing. The shape is: understand, decide whether it’s safe to speak, then — only then — speak. The decision step is the product. Everything else is I/O.