Building a Multilingual Speech Corpus, Verified by Whisper

We recently shipped a small browser game — LinguaGuessr, “GeoGuessr for the ear.” You hear a few seconds of real human speech and guess where on Earth it is spoken. The game is the fun part. The interesting engineering is underneath it: assembling a clean, correctly-labeled speech corpus across 38 languages out of audio that arrives noisy, inconsistently tagged, and frequently wrong. This is a data-pipeline problem wearing a game costume, and the techniques generalize to any project where you have to turn a pile of untrusted source data into something you can stake a product on.

The constraint that shaped everything: a guessing game is only as good as its ground truth. If even five percent of clips are mislabeled — Portuguese tagged as Spanish, a clip of dead air, two speakers talking over each other — players lose trust immediately, because the one thing the game promises is that the answer is correct. So the whole pipeline is built around a single question: how do you mechanically prove a clip is what it claims to be, at a scale where you cannot listen to each one yourself?

The shape of the problem

Public speech audio is abundant. It is also filthy. Pull from open sources and you get variable sample rates and codecs, clips that are forty minutes long when you want fifteen seconds, leading silence, background music, multiple speakers, and language labels that range from accurate to aspirational. Treating any of that as ground truth is how you ship a game that lies to people.

The pipeline therefore has four jobs, in order: acquire candidate audio with a claimed language, segment it into short utterances, verify that each segment actually is the claimed language and is clean speech, and publish the survivors as an immutable, edge-served manifest. Acquisition and publishing are plumbing. Segmentation and verification are where the corpus is won or lost.

Segmentation: short, clean, single-speaker

A good clip for this game is 10–20 seconds of one person speaking continuously — long enough to carry the phonetic and prosodic cues a listener needs, short enough to keep the round fast. Getting there means trimming leading and trailing silence, splitting on natural pauses rather than at a hard time cutoff (so you never slice a word in half), and rejecting segments where the energy profile suggests music or overlapping voices rather than a single speaker.

The lesson worth generalizing: do the cheap rejections first. A duration check, a silence-ratio check, and a crude signal check cost almost nothing and throw out a large fraction of garbage before it reaches the expensive stage. Every item you can discard with arithmetic is an item you do not pay a model to evaluate. Order your pipeline cheapest-filter-first and the expensive stage only ever sees plausible candidates.

Verification: Whisper as a labeling oracle

This is the core idea. Rather than trust the source’s language label, we run each candidate segment through Whisper — OpenAI’s speech recognition model — and use it as an independent oracle. Whisper does two useful things at once: it detects the language it actually hears, and it transcribes what was said. Both outputs become verification gates.

The language-detection gate is the obvious one: if the source claims Polish but Whisper hears Czech with high confidence, the clip is rejected. The clip’s claimed label has to match the model’s detected label, and the detection confidence has to clear a threshold. Disagreement is not something to resolve — it is a reason to drop the clip. With abundant source material, you can afford to be ruthless and keep only the clips where two independent signals agree.

The transcript gate is the subtler, more valuable one. A confident transcript of real words is strong evidence the clip is clean, intelligible speech. An empty transcript means silence or noise. A transcript that is mostly the same token repeated is Whisper’s well-known failure mode on non-speech audio — a reliable negative signal. So the transcript does double duty: it filters out junk the energy checks missed, and the surviving transcript becomes the “here is what was said” reveal the game shows after each guess. One model pass, two payoffs.

The pattern here is general: when you cannot trust a source’s metadata, find a second, independent model whose output you can trust, and keep only the records where the two agree. Disagreement becomes a free quality filter.

On running Whisper at corpus scale

Verifying thousands of clips is a batch job, and batch jobs have their own engineering. A few decisions mattered. Pick the smallest model that clears your accuracy bar — language detection and short-clip transcription do not need the largest checkpoint, and the smaller ones run several times faster, which compounds across a corpus. On Apple Silicon, a Metal-backed build of Whisper turns an overnight job into a coffee-break one; matching the runtime to the hardware is not a micro-optimization at this scale, it is the difference between iterating daily and iterating weekly.

Make the whole pass idempotent and resumable. Key every result by a content hash of the audio so a re-run skips work already done and never double-counts. A corpus build is something you will run dozens of times as you tune thresholds; if each run starts from zero you will stop tuning, and the corpus will be worse for it. Cache the expensive output, vary only the cheap filters on top, and you can re-tune the accept/reject thresholds in seconds without re-transcribing anything.

Publishing: an immutable manifest, served from the edge

The survivors are written to a single JSON manifest — every clip with its verified language, region, geocode, and transcript — and the audio files sit next to it as flat, content-addressed assets. The game itself is a static front end; there is no application server in the request path. The manifest and the clips are served straight from a CDN edge, which means the “backend” is a build artifact, not a running process. Nothing to scale, nothing to page you at 3 a.m., a hosting cost that rounds to zero, and a global latency profile you get for free.

This is a deliberate architectural choice that suits the workload. The corpus changes when we rebuild it, not per request. Anything that only changes at build time has no business being computed at request time. Precompute the hard part, ship the result as static assets, and let the edge do the distribution. The client’s one job at runtime is to read the manifest and play audio — so it stays fast and resilient even when a clip URL occasionally 404s, because the manifest gives it enough to gracefully swap in another.

What this example is really about

LinguaGuessr is a toy, and we built it as one — a demo, an enjoyable thing to share. But the engineering under it is the same engineering we do on client work: take untrusted source data, define what “correct” means precisely enough that a machine can enforce it, build a verification gate that proves each record rather than assuming it, and ship the result through an architecture sized to how the data actually changes. Swap “speech clips” for invoices, support tickets, product listings, or sensor readings and the playbook is identical: cheap filters first, an independent oracle for the expensive check, idempotent and resumable batch passes, and a serving layer that matches the data’s real update cadence.

The reason a guessing game makes a good demonstration of this is that the quality bar is brutal and immediate. There is nowhere to hide a mislabeled record — a player catches it on the first round. That forces the verification step to be real, not theater. Most production data pipelines would be better off held to the same standard.