The Honest Answer Is Sometimes ‘Not Enough Data’

Most engineering write-ups about credit data go straight to the model — the features, the weights, the fancy part. I want to write about the two things that happen before the model, because in my experience that’s where the real engineering lives, and where systems quietly go wrong: normalizing messy, heterogeneous inputs, and knowing when to refuse to answer.

I built a small reference project to make both concrete — a cross-border credit-data normalizer in Node/TypeScript and SQL. It’s deliberately tiny, but it takes the two hard parts seriously.

Every source is a different shape, and that’s the whole job

Give three teams the task of reporting a person’s credit accounts and you’ll get three incompatible files. In the demo I use three:

a US bureau CSV with payment history packed into a single semicolon-delimited string (2025-07:OK;2025-06:30;…) and a status vocabulary of OK / 30 / 60 / 90;
a UK reference-agency JSON with camelCase, GBP amounts, a nested consumers[].accounts[] tree, and a numeric arrears vocabulary (0 = up to date, 1 = one month behind, 2+ = serious);
a raw bank transaction feed (JSONL) with no credit limit and no lender-reported payment history at all — just cash flow.

The instinct is to sprinkle if (source === 'uk') … through the codebase. Don’t. The pattern that survives contact with a fourth and fifth source is a thin adapter per source whose only job is to translate that source’s quirks into one canonical schema. Everything downstream — storage, features, the API — reads the canonical shape and never learns that the UK feed calls it creditLimitGBP. Adding a new bureau becomes: write one adapter, register it. Nothing else changes.

US CSV  ─┐
UK JSON ─┼─► [adapter] ─► CanonicalAccount ─► store ─► risk-feature API
bank    ─┘

Two normalization decisions that look small and aren’t:

Currency. You cannot add a US card balance to a UK card balance. The normalizer converts every amount to a base currency (USD) up front — and, just as importantly, records the conversion as provenance (FX GBP→USD). A number that has been transformed should always carry a note saying so.

Missingness is not zero. The bank feed has no credit limit. The wrong move is to default it to 0 or null and let it silently flow into a utilization calculation. A missing limit means utilization is undefined for that account — a different thing entirely, and one the downstream code has to respect.

Provenance is a table, not a comment

If you’re deriving risk features that influence whether someone gets credit, “where did this number come from?” is not a nice-to-have — it’s what compliance, risk, and eventually a regulator will ask. So provenance is a first-class concept, stored right next to the data. For every canonical field and every payment observation, the store keeps (source, source_field, raw_value).

That means any derived feature is traceable end to end. When the API reports a utilization of 0.27, the response can point at the exact accounts and the exact raw balances and limits — across two countries and a currency conversion — that produced it. Building that in from the start is far cheaper than reconstructing it under audit pressure later.

The gate: refusing to score a thin file

Here’s the part I care about most, and the part most credit demos skip.

The population that financial-inclusion products exist to serve is, by definition, thin-file and no-file consumers — people the traditional system doesn’t have enough data on. So the single most common input isn’t a rich file; it’s a sparse one. What should the system do with it?

The tempting answer is to always return a number — degrade gracefully, output something. That’s exactly wrong. A fabricated score on sparse input is worse than no score: it’s a confident-looking number with nothing underneath it, and someone will make a lending decision on it.

So the normalizer has an explicit thin-file gate. A file is scorable only if it clears documented thresholds — enough accounts, enough payment history, a computable utilization. If it doesn’t, the API doesn’t return a made-up score. It returns:

{
  "scorable": false,
  "reason": "Insufficient data to score responsibly (thin file). Only 1 account(s); need >= 2. Only 0 payment observation(s); need >= 6.",
  "missing": ["revolving_account_with_limit", "payment_history", "additional_tradelines"]
}

Three things make this a product decision and not just a guard clause:

The thresholds are one named policy object, not magic numbers scattered through the code. A risk team can read and tune them.
The refusal is specific. It doesn’t just say “no” — it says exactly what’s missing, which is the actionable thing. “Needs more data” plus a precise list is a feature; a shrug is not.
It’s the tested path. The test suite asserts that a bank-only file cannot be scored just as firmly as it asserts a rich file can. The failure mode you’re guarding against — silently scoring garbage — is the one worth a test.

What this is really about

The score itself, in the demo, is an intentionally transparent linear heuristic; a real one is a governed, monitored, fair-lending-reviewed model, and that’s a different discipline. But the pipeline around it — heterogeneous ingestion, canonical normalization, provenance, and an honest gate on insufficient data — is reusable engineering, and it’s where a data platform earns trust.

The reflex to always produce an answer is a liability in any system that informs a real-world decision. Building the “I don’t have enough to answer that responsibly” path — and making it specific, tested, and first-class — is one of the clearer signals of senior work I know.