Saying "Hey Claude" Out Loud, On-Device

The question that started this was small and specific: could you say “Hey Claude, fix the failing tests” out loud, and have a coding agent actually start working — without leaving a microphone streaming to someone’s cloud all day to make it happen? Cloud wake-word services exist and they work, but the price is that your always-on audio is somebody else’s to listen to. The interesting version of the problem keeps everything on the machine until the very last step.

So we built it: a small macOS tool called hey-claude that listens for “hey claude,” transcribes whatever you say next, and dispatches it as a background agent. You keep talking; the agents stack up. The fun is the voice trigger. The engineering worth writing down is what makes always-on listening cheap, private, and safe enough to actually leave running — and it generalizes well beyond this toy.

The shape of the problem

An always-on voice trigger is a latency-and-power problem in disguise. The naive version — run speech recognition continuously and check whether the transcript starts with your phrase — works, and it will also flatten your battery and warm your lap, because you are paying for a heavyweight model on every second of mostly-silence. The whole design is about not doing that.

The answer is a gate that gets more expensive only as the signal gets more promising. Four stages, each one cheaper than letting the next one run unguarded: a tiny classifier listens for the wake word, a crude voice-activity detector decides when your command ends, a GPU transcription model turns that one short clip into text, and only then does anything leave the machine. Arrange the cost curve that way and the steady state — you, not talking to your computer — costs almost nothing.

Stage one: a wake word small enough to always run

The first stage uses openWakeWord, which is the clever part of the architecture. Rather than recognize arbitrary speech, it runs a frozen, general-purpose speech-embedding network — the genuinely expensive model, but a fixed one — and then a tiny per-phrase classifier head, around a quarter of a megabyte, that asks one yes/no question of each frame: did that sound like “hey claude”? Most of the compute is the shared embedding; the part that knows your phrase is almost free. That is what lets it score every 80-millisecond frame at a fraction of one percent of a CPU core, indefinitely.

The detail that makes it practical: those classifier heads are trained on 100% synthetic speech. You never record yourself. A text-to-speech engine generates thousands of spoken variants of the phrase across many synthetic voices, the trainer augments them with noise and reverb, and the result is speaker-independent — it responds to anyone, because it never learned one specific anyone. The cost of supporting a new wake phrase is a training run, not a data-collection project.

What training the wake word actually took

“Just train a model” is where the honest part of the story lives. The recipe is straightforward in principle — synthesize speech, augment it, train a small classifier, export to ONNX — and every step has a way to bite you on a fresh machine. The text-to-speech stack, the audio backend, and the training framework each have opinions about versions, and those opinions conflict. A modern audio library wants a codec dependency that the model exporter does not expect; a one-version bump in the data tooling pulls in a transitive package that breaks the synthesis step. None of it is conceptually hard. All of it costs an afternoon if you discover it interactively.

The lesson we took out of it is the boring, correct one: pin the environment and make the run reproducible before you tune anything. A wake-word model is something you will rebuild many times — different phrases, different thresholds, different augmentation budgets — and a build that only works by hand, once, on the machine where it was discovered, is a build you will stop iterating on. We ended with a fully scripted training pass with pinned versions, so a fresh checkout on a fresh cloud GPU produces a model with no interactive babysitting. That is the difference between shipping one wake word and shipping a handful. We ended up shipping several — “hey claude,” “okay claude,” “hey computer,” and a couple of agent-neutral ones — bundled into the tool so it works the moment it is installed.

A model you trained once, by hand, is a demo. A model whose training run is pinned and scripted is an asset you can re-derive on demand. The gap between those two is most of what “productionizing” an ML artifact means.

Stage two and three: endpoint, then transcribe

Once the wake word fires, a different problem starts: when has the user finished giving the command? Here a simple energy-based voice-activity detector earns its keep. It records while you speak and stops after a short run of trailing silence, keeping a small pre-roll of audio from just before speech started so the first syllable is never clipped. It is not trying to understand anything — it is just deciding where the utterance ends — and a cheap heuristic is exactly right for that, because spending a model on it would defeat the whole gating idea.

The captured clip then goes to the one heavyweight model in the path: Whisper, via MLX, Apple’s array framework, so transcription runs on the Mac’s GPU. Matching the runtime to the hardware is the entire game at this stage: a Metal-backed Whisper turns a short command into text in well under a second, which is the difference between a tool that feels conversational and one that feels like dictation software from 2009. The expensive model only ever sees a few seconds of known-good speech, because the two cheap stages in front of it already threw away all the silence.

Stage four: dispatch, carefully

The last stage hands the transcribed text to an agent. This is the only point where anything leaves the machine, and it is the point where a voice tool has to be most careful, because a wake word will occasionally fire when it should not, and speech recognition will occasionally hear something you did not say. A design that pipes a transcript into a shell is a design that will, eventually, run something alarming.

So the transcribed command is always passed as a single argument to the agent process — never interpolated into a shell string, never re-split on spaces. If you say “ship it; remove the temp files,” the agent receives that whole sentence as one opaque argument; the semicolon is just a character in a string, not a shell separator. The command can say anything, but it can never become additional flags or a second command. That property is worth more than any amount of input sanitizing, because it does not depend on anticipating the bad input — it removes the channel the bad input would travel through.

The same single-argument dispatch is what let us make the tool agent-agnostic almost for free. A wake does not have to run Claude Code; it runs whatever command you configure, with the spoken text dropped into a placeholder. Point it at a different agent CLI and the safety property holds unchanged, because the placeholder is still a single argument no matter whose program receives it.

The unglamorous boss fight: microphone permission

The hardest part of shipping this was not the audio pipeline. It was convincing macOS to let a background program hear the microphone at all. macOS grants microphone access to an application identity, not to a running process, and a bare command-line tool — especially one started at login by a launch agent — frequently has no identity stable enough to even raise the permission prompt. It just receives silence, forever, with no error.

The fix is to give the tool a real identity: generate a minimal, ad-hoc-signed .app bundle that wraps the same code, so the operating system has something concrete to grant permission to and remember. This is the kind of platform detail that no amount of cleverness in the core logic saves you from. It is also exactly the sort of thing that separates a weekend prototype from something a person can actually install and have work after a reboot. We spent real time here so the user spends none.

Giving it a voice (five of them)

A wake word that works silently is unnerving — you say the phrase and have no idea whether the machine heard you, started listening, dispatched, or quietly ignored the whole thing. So the tool talks back at each step of the pipeline: a chime and a spoken line when it wakes, another when it is transcribing, another when the agent is launched. The interesting design choice was making that feedback feel like a character rather than a system beep.

The v0.2.0 release ships five character voice soundpacks: sawyer, a warm Southern narrator (the default); alastair, a precise British robo-butler; and mara, cass, and sol. Each pack has several in-character lines written for every cue, and the tool draws from them with a shuffle-bag — it plays through a randomized permutation of the available lines before any line can repeat, so you never hear the same acknowledgement twice in a row, and it never feels like a recording on a loop. The packs also bundle CC0 sound effects, and hey-claude sounds new <name> scaffolds an empty pack so you can record or synthesize your own voice and drop the clips in.

This is a small feature with an outsized effect on whether the thing feels alive, and the shuffle-bag is the reason: pure random selection repeats often enough to feel broken (the birthday problem bites within a handful of cues), while a shuffle-bag guarantees maximal spacing for the cost of one shuffled list. It is the same trick good music players use to make “shuffle” actually feel shuffled.

Where the agent runs

The other addition in v0.2.0 is a configurable working directory. By default a dispatched agent runs in whatever folder you launched hey-claude from, which is exactly right when you start it by hand inside a project. But the tool is most useful left running as a login daemon — and a daemon has no meaningful “current folder.” So you can pin one: hey-claude config set work_dir ~/code/project sets it permanently, or hey-claude run --dir <path> sets it for a single session. An empty value keeps the launch-folder behavior. It is a one-line setting, but it is the difference between a tool that only works when you babysit it from a terminal and one that works the way you actually want to use it: always on, dispatching into the project you care about, from the moment you log in.

What this example is really about

hey-claude is a small, fun tool, and we built it as one — and put it on GitHub, MIT-licensed, for anyone to use. But the engineering under it is the engineering we do on client work: profile where the cost actually is and gate the expensive stages behind cheap ones; treat an ML model as a reproducible build artifact rather than a one-off; match the runtime to the hardware instead of fighting it; design the dispatch boundary so the dangerous input has no channel to travel through; and spend the unglamorous time on the platform details that decide whether the thing works for a real person on a real machine.

Swap “wake word” for a fraud signal, a document classifier, or a search-ranking model and the playbook is identical: a cheap-filters-first pipeline, a pinned and scriptable training pass, a runtime sized to the hardware, and a boundary that contains the blast radius. A voice toy makes a good demonstration of it because the failure modes are loud and immediate — a wake word that fires on the television, or a latency that makes the whole thing feel broken, you notice on the first try. Most production systems would be better off held to a standard where you would notice that fast.

Install it

hey-claude is free, open source (MIT), and runs entirely on your machine — no API key to listen, no signup, nothing leaves the laptop until you dispatch. It needs macOS on Apple Silicon, Python 3.10–3.13, and Claude Code ≥ 2.1.139 (for claude --bg). Install straight from the public repo:

# PortAudio is the only system dependency
brew install portaudio

# install the CLI in an isolated environment
pipx install git+https://github.com/tachyurgy/hey-claude@v0.2.0

hey-claude doctor    # first-run check: mic permission, models, claude --bg
hey-claude           # start listening — then just say "hey claude, ..."

The first run walks you through granting microphone permission and picking a wake engine: whisper works immediately with no model, or openwakeword uses a trained hey_claude model for lower idle cost. From there, hey-claude config set work_dir ~/code/project pins where dispatched agents run, and hey-claude sounds lets you switch between the five character voices (sawyer, alastair, mara, cass, sol) or scaffold your own. Full configuration, the launch-agent setup (start at login, restart on crash), and clean uninstall are documented in the README. The pinned @v0.2.0 git install above is the canonical path.