Knowing When Not to Make the Highlight Reel
Everybody’s camera roll has the same shape. Ten thousand items. Most of them are nothing — a blurry menu, four near-identical shots of the same latte, a video that’s thirty seconds of a pocket. But buried in there are the six frames that are actually good, and if someone could just find them and cut them into something postable, you’d post it.
That’s a great product idea, and it’s a deceptively hard engineering problem. Recently I built a small, runnable slice of it — a Camera-Roll Highlight Picker — and the most interesting decision in the whole thing wasn’t how to pick the good clips. It was teaching it when to give up.
The pipeline, briefly
The happy path is four stages:
- Score every clip on an explainable set of signals — sharpness, exposure, aesthetics, faces, camera stability.
- Cluster near-duplicates so a seven-shot burst becomes one clip.
- Assemble a draft edit — an ordered list of clips with runtimes, in an opener → body → closer arc, filled to a ~22-second target.
- Optionally name it with an LLM.
None of that is exotic. What makes or breaks it — what separates a tool people keep from a novelty they delete — is the quality of two judgments: which clips are good, and when there isn’t a reel here at all.
Ranking you can explain
The scoring is a plain weighted sum of the per-clip signals. Not a black box — a legible function where every clip can show its own breakdown:
IMG_0004: 90.6/100 (sharpness 0.88, exposure 0.81, aesthetic 0.94, face 0.94, stability 1.00)
There’s a temptation to reach for a learned model immediately. But the first version of a selection system should be inspectable, because the failure mode you’ll actually hit is a user asking “why did you leave out my sunset?” If your answer is “the model felt like it,” you’ve lost them. If your answer is “your sunset scored a 55 — it came out soft and slightly underexposed,” you’ve got a product conversation, and a place to tune. Learn the weights later; keep the explanation forever.
A small thing that matters for social content specifically: I give a clip a nudge just for having a recognizable face in frame. Highlight reels are about people. A technically perfect photo of an empty beach is worth less than a slightly softer shot of someone laughing on it.
Dedup is where reels go to look broken
Bursts are the enemy of a good montage. Your phone fired off seven frames of the same smile; if all seven land in the reel, it reads like a stutter — like a rendering bug. So near-duplicates have to collapse to a single best frame.
The trick is that “near-duplicate” needs two signals, not one. A perceptual hash tells you two frames look alike, but two unrelated blue-sky photos can collide in hash space. Time proximity tells you two frames were captured seconds apart, but a burst can drift. Require both — similar hash and close in time — and merge transitively (a union-find), and a real burst collapses cleanly while two genuinely different moments that happen to look alike stay separate. From each cluster you keep the single highest-scoring frame and quietly bench the rest.
This is also what makes the “10,000 items” number tractable. You’re not reasoning about ten thousand things; after dedup you’re reasoning about a few dozen moments.
The part I actually care about: refusing
Here is the decision that tells you whether an engineer has shipped consumer software before. The demo-friendly behavior — the one that looks great in a pitch — is to always produce a montage. Drop in any roll, get a reel, watch the progress bar fill. Ship it.
It’s the wrong product.
A reel stitched out of four blurry, near-identical shots is worse than no reel. The user posts it, it looks bad, and they don’t blame their footage — they blame your app, once, and then they delete it. The confident progress bar that produced garbage is the thing they remember.
So the pipeline has a gate that sits between analysis and assembly, and it can say no. It refuses in three distinct situations:
- Not enough good material. Too few clips clear the quality bar. A pocket-dial afternoon of soft, dark footage doesn’t become a highlight reel by force.
- Not enough variety. This is the subtle one. You can have plenty of individually gorgeous clips that are all the same three seconds — ten perfect shots of one jump. Every clip passes; there’s still no reel, just one moment photographed to death. A montage of it would read as a stutter, not a story. Catching this requires reasoning about distinct moments (the clusters), not raw clip count.
- Not enough runtime. One nice clip and nothing else can’t fill a postable social cut.
And when it refuses, it doesn’t throw an error and it doesn’t fake a result. It returns a plain-English reason and a concrete suggestion:
No reel. Only 1 of 21 clips clears the quality bar. I won’t stitch a highlight reel out of blurry, dark, or shaky footage — it would look worse than posting nothing. Try: grab a few more sharp, well-lit shots across different moments, then try again.
That’s the same instinct I build into every system that scores or judges something: an abstain path. A model that never says “I don’t have enough to go on” will confidently hand you a number for anything, and confident wrong answers are the expensive kind. The honest “not enough signal” is a feature, not a gap — in an eval rubric, in a fraud score, and in a camera-roll editor that’s one bad export away from an uninstall.
What I deliberately left out
The demo runs on a synthetic feature vector per clip rather than real computer vision, on purpose — so it runs anywhere in milliseconds and the decision layer is the thing on display. In a real v1 those features come from an on-device Vision / optical-flow pass, the perceptual hashes come from actual thumbnails, the weights get learned and personalized, and the edit decision list hands off to an ffmpeg trim/concat/crossfade stage that renders the actual vertical video. Every one of those is a drop-in behind an interface the pipeline already has. None of them change the two judgments that matter.
Because in the end the moat here isn’t the render. Rendering is solved. The moat is taste — noticing the six good frames, cutting the duplicates, and having the discipline to hand back nothing when there’s nothing worth handing back.