Measure the Aggregate, Never the Person

Almost every product wants the same handful of numbers. How many people use the new feature? Which platforms? Are sessions getting longer or shorter? These are reasonable questions. The default way we answer them is not.

The default is: every client emits a stream of events — {userId, platform, locale, feature, durationMs, timestamp} — and ships them to a server, which stores them and aggregates later. The dashboard shows a tidy bar chart. But to produce that chart, a server now holds a re-identifiable log of everyone’s behaviour. That log is a liability the moment it exists, independent of whether anyone ever misuses it: it can be breached, subpoenaed, quietly repurposed, or joined against another dataset to unmask individuals. You wanted a bar chart and you built a surveillance database as a side effect.

There’s a better default, and it’s not much more code. Aggregate on the device, and refuse to report any group small enough to identify a person. That’s the whole idea. Below is how I built a small TypeScript library around it, and the one edge case that separates “looks private” from “is private.”

The inversion: counters, not events

The first move is to stop keeping an event log at all. Instead of storing events and aggregating them later, you fold each event into a running counter the moment it arrives, and then throw the event away:

add(event: AggregationEvent): void {
  const cohort = this.cohortFor(event);   // e.g. { platform, locale }
  const bucket = this.bucketFor(cohort);  // find-or-create
  bucket.count += 1;
  if (typeof event.value === "number") bucket.sum += event.value;
  // ...and that's it. The event itself is never retained.
}

After this runs, there is no events[] array anywhere. The aggregator holds only {cohort, count, sum} triples. There is nothing to leak, no log to breach, nothing to subpoena beyond counts you were going to publish anyway. Any dimension you didn’t explicitly group on — a userId, a session token — is never copied into a bucket, so it can’t accidentally surface in a serialized report. (I have a test that feeds in a secretUserId and asserts it appears nowhere in the output.)

This is the local-first part: in a real deployment each device runs its own aggregator, and only the summaries ever combine server-side.

The threshold gate: never report a group of one

Counters alone aren’t enough. Imagine you group by {platform, locale} and one cohort is {platform: "linux", locale: "is-IS"} with a count of 1. Publishing that row tells the world there is exactly one Icelandic Linux user — and if you know such a person exists, you’ve just learned something about a specific human from an “aggregate” report.

So the report step applies a k-anonymity threshold: a cohort must contain at least k contributing events to appear at all. Everything below k is suppressed, not shown.

for (const bucket of this.buckets.values()) {
  if (bucket.count < this.k) { suppressedCohorts++; continue; }
  cohorts.push({ cohort: bucket.cohort, count: bucket.count, sum: bucket.sum });
}

Pick k for your setting — 5, 25, 100. The guarantee is simple and strong: no row in a published report ever describes a group small enough to single someone out. (k must be at least 2; k = 1 is not anonymity, it’s a cohort of one wearing a hat.)

The edge case that catches people: differencing

Here’s the part that separates a real implementation from a naive one. Suppose you suppress the small cohorts — good — but you also publish a helpful “total events” number, or a “suppressed events” tally. An attacker can now do arithmetic:

total − (sum of the visible cohorts) = the hidden remainder

If exactly one cohort was suppressed, that remainder is that cohort’s exact size. You carefully hid the small group, then handed back its count through subtraction. This is a differencing attack, and it’s the classic way threshold-based anonymization leaks.

The fix is to treat any auxiliary total as sensitive too. In the library, the suppressed-events figure is always coarsened — floored to a multiple of k (or of an optional roundTo) — so it can’t pin down a single small cohort:

const coarsen = this.roundTo ?? this.k;
const suppressedEventsApprox =
  suppressedCohorts === 0 ? 0
  : Math.floor(suppressedEvents / coarsen) * coarsen;

There’s a test dedicated to exactly this: one below-k cohort of 7 events, and the assertion is that the reported figure is not 7. Getting this right is the difference between privacy theatre and privacy.

What this is, and what it isn’t

Thresholding protects presence: it stops any single small group from showing up. It’s the pragmatic first layer, and for a lot of product analytics it’s enough. What it does not give you is a formal guarantee against an attacker who can run many queries over time and difference the results across them. That’s the job of differential privacy — adding calibrated Laplace or Gaussian noise to the reported counts under a tracked privacy budget (ε). The threshold gate and the coarsening here are the sensible first layer; DP is the principled second one, and it composes cleanly on top of this design.

If I were taking the library further I’d add: DP noise on the counts, secure aggregation so per-device reports combine without any central party seeing one device’s contribution, a bounded cohort-cardinality cap (so an adversarial dimension value can’t blow up memory), and an explicit dimension allowlist so the safe path is the only path.

Why bother

Because the surveillance database was never the goal — the bar chart was. The industry defaulted to shipping raw events because it was easy and storage was cheap, and we’ve spent a decade discovering the downstream cost. Companies like DuckDuckGo have shown you can run a real business, ads and all, without profiling anyone. The engineering version of that principle is small and portable: fold events into counters on the device, suppress the small cohorts, coarsen the totals, and publish only what can’t be traced back to a person.

Measure the aggregate. Never the person.