The Query That Got 30× Faster Without Changing Its Answer
There is a family of query that shows up in every analytics product I’ve worked on, and it always starts slow. You cohort a population on one event, then measure the time to a second event along a timeline. In commerce it’s “days from signup to first purchase.” In healthcare it’s the one I want to walk through here: for patients with an index diagnosis of type-2 diabetes, how long until they start first-line treatment, and how many never do inside the guideline window?
That last one is a treatment-gap question — the kind pharma and payers actually pay for, because the gap is where the clinical and commercial story lives. It’s also a perfect specimen for how I think about performance, because the naive version is correct, obvious, and far too slow to sit behind an interactive dashboard. Getting it fast is easy to do wrong: the failure mode isn’t “still slow,” it’s “fast and subtly incorrect.” So the whole exercise runs under one rule — an optimization that changes the answer is a bug, not an optimization.
Everything below is reproducible: I built it as a small, runnable demo over a synthetic ~3M-row claims graph (patients → encounters → diagnoses → treatments), with a benchmark that refuses to report a speedup unless the optimized query returns the identical result.
The shape of the problem
The data lands normalized, the way de-identified claims usually do: a big
diagnoses table of ICD-10 codes, a big treatments table of drug fills, both
hanging off encounters, both with only their primary keys as indexes. The analytic
is a two-stage journey:
- Index event — each patient’s earliest
E11.*(type-2 diabetes) diagnosis. - Time to treatment — their first
metforminfill on or after that index date, minus the index date. No fill within 90 days is a treatment gap.
The “on or after” is not a detail you can skip. Real patients have metformin fills that predate their diabetes coding for all sorts of reasons, and a pre-index fill is not time-to-treatment. The single most common way to make this query “faster” is to quietly drop that date guard — and produce a confidently wrong number. Hold onto it.
Stage 0: the naive plan
The straightforward SQL groups the E11 diagnoses per patient to get the index
date, LEFT JOINs metformin fills guarded by tx_date >= index_dx, and aggregates.
EXPLAIN ANALYZE tells you exactly why it’s slow:
Seq Scan on diagnoses (... rows=1683094 ...)
Seq Scan on treatments (... rows=322841 ...)
Two sequential scans over the full fact tables, every single time anyone asks the question. On the ~3M-row dataset that’s about 66 ms. Not a catastrophe at this size — but it grows linearly with the data, and Komodo-scale claims data is 50 billion encounters, not three million. Linear is the enemy.
Stage 1: targeted indexes (and the trap)
The cohort predicate is selective — diabetics are a fraction of all patients, and
E11 diagnoses are a small slice of all diagnoses — so the obvious move is to stop
scanning and start seeking:
CREATE INDEX idx_dx_code_patient_date
ON diagnoses (icd10_code text_pattern_ops, patient_id, dx_date);
CREATE INDEX idx_tx_drug_patient_date
ON treatments (drug_code, patient_id, tx_date);
Note text_pattern_ops. The cohort filter is a left-anchored LIKE 'E11%', and
under a non-C collation a plain btree on icd10_code cannot serve a LIKE
prefix range. Build the index the default way and you’ll watch the planner ignore
it and keep scanning — the classic “I added an index and nothing happened” bug.
The text_pattern_ops opclass makes the prefix range usable.
With the indexes in place the plan flips to bitmap index scans and the query drops to about 27 ms — a 2.5× win. Real, and free. But here’s the part most people miss.
The lesson hiding in the scaling curve
I ran the same before/after at a larger scale, and the index win shrank: 2.5× at 3M rows became 1.3× at 10M. That’s not noise — it’s the whole point. Indexes fix the scan, not the query. They get you to the right rows faster, but they do nothing about the fundamental cost of re-running a grouped join over millions of fact rows on every request. As the data grows, the aggregate work dominates and the index’s share of the time shrinks toward nothing.
If your slow analytic is a repeated aggregation, indexing is a down payment, not the fix. The fix is to stop doing the work.
Stage 2: precompute the hot path
The journey question always resolves to exactly one row per cohort patient:
(patient_id, index_dx, first_metformin_after_index, days_to_tx). So compute that
once and read it forever:
CREATE MATERIALIZED VIEW patient_journey AS
... the expensive grouped join, exactly once ...;
CREATE UNIQUE INDEX ON patient_journey (patient_id);
Now the analytic scans one small, purpose-built table instead of joining two giant ones. It runs in about 2 ms — roughly 30× faster than the naive query — and that speedup holds as the data grows, because the read no longer touches the fact tables at all. This is the same discipline as caching a hot aggregate or denormalizing a rollup column: pay the join once at refresh time, read it a million times for free. It’s what makes analytics “feel instant,” and it’s the honest core of any healthcare-specific compute engine.
The unique index isn’t decoration either — it’s what lets you
REFRESH MATERIALIZED VIEW CONCURRENTLY in production, so the nightly (or
incremental) refresh never blocks the dashboards reading the view.
The rule that makes it real
Here’s the part I care about most. My benchmark computes a five-number summary — cohort size, treated count, treatment-gap count, median and p95 days-to-treatment — under all three physical designs, and it will not print a single timing unless the naive, indexed, and rollup queries return byte-identical answers. On top of that sits a hand-built fixture whose result I derived by hand, pinning every edge case: the pre-index fill that must be excluded, the >90-day gap that counts as both treated and gapped, the patient with two diagnoses indexed on the earlier one, the non-diabetic who must never enter the cohort.
Because when the number is a treatment gap that a pharma team will brief an executive on, “fast” is table stakes. Correct is the product. The optimization only counts if the answer survives it — and the way you know it survived is that you tested it, not that it looked right.