The Self-Healing Scraper That Refuses to Trust Itself
Every team that scrapes the web for a living lives with the same low-grade dread: a source site redesigns overnight, your selectors silently stop matching, and — if you’re unlucky — nobody notices until a customer asks why last Tuesday’s numbers look wrong. Scraping isn’t hard because writing a CSS selector is hard. It’s hard because the target moves, and the failure is silent.
The fashionable answer in 2026 is “point an LLM at the broken page and have it write a new selector.” That works, and it’s genuinely useful. But if you stop there, you’ve traded one silent failure for a subtler one: now a language model, which is very good at producing plausible output, is deciding what your data looks like. For most content that’s fine. For financial data feeding an investment decision, “plausible but wrong” is the worst possible outcome — worse than a loud, obvious break, because a loud break gets fixed and a quiet one gets acted on.
So I built a small self-healing scraper to work out the design properly, and the interesting part turned out not to be the repair. It was the gate.
The pipeline
The domain is deliberately unglamorous: a fund’s “top holdings” page — ticker, security name, portfolio weight, market value. The kind of brittle, frequently-restyled source a data pipeline for investment firms scrapes every single day.
The happy path is boring and that’s the point. A ruleset — a row selector plus a per-field CSS selector and a value transform (text, percent, money) — runs deterministically over the HTML and produces structured records. Deterministic means cheap, fast, and auditable: given the same page, you get the same output, and you can point at exactly which selector produced which value. No model in the loop. This runs on every scrape.
Then the site redesigns. The <table class="holdings"> becomes a <div> card grid; the class names change; the columns reorder. The underlying holdings are identical — this is presentational drift — but every selector in the ruleset now matches nothing.
Detecting drift without a special drift detector
You don’t need a separate anomaly-detection system to catch this. You need a schema and the honesty to enforce it. The target record has rules: a ticker matches ^[A-Z]{1,5}(\.[A-Z])?$, a weight is a number in (0, 100], a market value is positive, and a real top-holdings page has at least a handful of rows. The moment an extraction fails that schema — zero rows, or fields that don’t parse — you know the ruleset and the page have diverged.
That validation step is doing double duty. On a healthy page it’s a quality check. On a redesigned page it is the drift detector. Drift is just “the deterministic extractor stopped satisfying the contract,” and you already have the contract.
The repair — and why it’s only a proposal
When drift trips, the pipeline asks for a new ruleset. It tries the LLM first (Gemini, with structured output constrained to a ruleset shape so you get a ruleset back, not an essay), and falls back to an offline rules-only heuristic that infers selectors by matching each DOM element’s content against the field signatures — the element that looks like a ticker, the one that looks like a percentage, the one that looks like money. Either way, the output is treated identically: as a proposal, not an answer.
This is the whole argument. The LLM is a proposer. It never gets to be the source of truth. Which raises the obvious question: if you won’t trust the model, what do you trust?
The gate: known-good ground truth
The answer is a human-confirmed known-good snapshot. On the last healthy run, a person signed off that the extracted holdings were correct. Because presentational drift doesn’t change the data, that snapshot is the exact ground truth any repair must reproduce.
So a proposed ruleset has to clear two bars before it’s accepted:
- Schema validation — the repaired extraction must produce well-formed records (right count, right types, right ranges).
- The known-good gate — those records must reproduce the trusted snapshot, matched by ticker, with weight and market value within tolerance.
Bar one alone is not enough, and this is the failure mode people miss. It’s easy to produce output that is schema-valid and factually wrong. In the demo I make the point with an adversarial repair that reads the market value off the weight element: every record parses, every number is in range, the schema is perfectly happy — and it’s garbage. Schema validation waves it through. The known-good gate catches it instantly, because 8.42 is not 12,450,000.
A repair that fails either bar is rejected. The pipeline emits nothing and quarantines the source for a human. That’s the senior instinct made mechanical: when you’re not sure, you do not ship — you escalate. The cost of a false “healthy” is unbounded; the cost of a quarantine is a review-queue ticket.
Defense in depth, and degrading honestly
A couple of details that matter in production fell out of this design naturally. Because both the LLM and the heuristic produce the same proposal shape and both go through the same gate, you get defense in depth for free: if the model proposes a subtly-wrong ruleset, the deterministic heuristic still recovers the scraper, and the gate guarantees neither is trusted blindly. And because the gate doesn’t care where a proposal came from, the system degrades honestly — no API key, no network, and it runs the whole loop on the rules-only path. The LLM is an accelerant, never a dependency.
What I’d tell a team building this for real
The gate is only as good as its ground truth, and the genuinely hard problem I deliberately left at the edges: telling layout drift (data unchanged, re-extract and move on) apart from data drift (the holdings legitimately changed, and your “known-good” is now stale). You can’t fully automate that distinction, which is exactly why a failed repair should always route to a person rather than guess. Everything else — scheduled fetching with content hashing, caching accepted rulesets per layout fingerprint so you only pay for the LLM on genuine drift, gating against several recent snapshots plus statistical checks (weights sum to ~100%, values in the expected magnitude) — is refinement on top of one non-negotiable rule.
That rule is the whole post: an LLM can propose the fix, but it doesn’t get to decide the fix is good. Ground truth decides. Everything else is plumbing.