The Honest Answer Is Sometimes ‘Not Enough Data’
Most engineering write-ups about credit data go straight to the model — the features, the weights, the fancy part. I want to write about the two things that happen before the model, because in my experience that’s where the real engineering lives, and where systems quietly go wrong: normalizing messy, heterogeneous inputs, and knowing when to refuse to answer.
I built a small reference project to make both concrete — a cross-border credit-data normalizer in Node/TypeScript and SQL. It’s deliberately tiny, but it takes the two hard parts seriously.
Every source is a different shape, and that’s the whole job
Give three teams the task of reporting a person’s credit accounts and you’ll get three incompatible files. In the demo I use three:
- a US bureau CSV with payment history packed into a single semicolon-delimited
string (
2025-07:OK;2025-06:30;…) and a status vocabulary ofOK / 30 / 60 / 90; - a UK reference-agency JSON with camelCase, GBP amounts, a nested
consumers[].accounts[]tree, and a numeric arrears vocabulary (0= up to date,1= one month behind,2+= serious); - a raw bank transaction feed (JSONL) with no credit limit and no lender-reported payment history at all — just cash flow.
The instinct is to sprinkle if (source === 'uk') … through the codebase. Don’t.
The pattern that survives contact with a fourth and fifth source is a thin
adapter per source whose only job is to translate that source’s quirks into
one canonical schema. Everything downstream — storage, features, the API —
reads the canonical shape and never learns that the UK feed calls it
creditLimitGBP. Adding a new bureau becomes: write one adapter, register it.
Nothing else changes.
US CSV ─┐
UK JSON ─┼─► [adapter] ─► CanonicalAccount ─► store ─► risk-feature API
bank ─┘
Two normalization decisions that look small and aren’t:
Currency. You cannot add a US card balance to a UK card balance. The
normalizer converts every amount to a base currency (USD) up front — and, just
as importantly, records the conversion as provenance (FX GBP→USD). A
number that has been transformed should always carry a note saying so.
Missingness is not zero. The bank feed has no credit limit. The wrong move is
to default it to 0 or null and let it silently flow into a utilization
calculation. A missing limit means utilization is undefined for that account —
a different thing entirely, and one the downstream code has to respect.
Provenance is a table, not a comment
If you’re deriving risk features that influence whether someone gets credit,
“where did this number come from?” is not a nice-to-have — it’s what compliance,
risk, and eventually a regulator will ask. So provenance is a first-class
concept, stored right next to the data. For every canonical field and every
payment observation, the store keeps (source, source_field, raw_value).
That means any derived feature is traceable end to end. When the API reports a
utilization of 0.27, the response can point at the exact accounts and the exact
raw balances and limits — across two countries and a currency conversion — that
produced it. Building that in from the start is far cheaper than reconstructing
it under audit pressure later.
The gate: refusing to score a thin file
Here’s the part I care about most, and the part most credit demos skip.
The population that financial-inclusion products exist to serve is, by definition, thin-file and no-file consumers — people the traditional system doesn’t have enough data on. So the single most common input isn’t a rich file; it’s a sparse one. What should the system do with it?
The tempting answer is to always return a number — degrade gracefully, output something. That’s exactly wrong. A fabricated score on sparse input is worse than no score: it’s a confident-looking number with nothing underneath it, and someone will make a lending decision on it.
So the normalizer has an explicit thin-file gate. A file is scorable only if it clears documented thresholds — enough accounts, enough payment history, a computable utilization. If it doesn’t, the API doesn’t return a made-up score. It returns:
{
"scorable": false,
"reason": "Insufficient data to score responsibly (thin file). Only 1 account(s); need >= 2. Only 0 payment observation(s); need >= 6.",
"missing": ["revolving_account_with_limit", "payment_history", "additional_tradelines"]
}
Three things make this a product decision and not just a guard clause:
- The thresholds are one named policy object, not magic numbers scattered through the code. A risk team can read and tune them.
- The refusal is specific. It doesn’t just say “no” — it says exactly what’s missing, which is the actionable thing. “Needs more data” plus a precise list is a feature; a shrug is not.
- It’s the tested path. The test suite asserts that a bank-only file cannot be scored just as firmly as it asserts a rich file can. The failure mode you’re guarding against — silently scoring garbage — is the one worth a test.
What this is really about
The score itself, in the demo, is an intentionally transparent linear heuristic; a real one is a governed, monitored, fair-lending-reviewed model, and that’s a different discipline. But the pipeline around it — heterogeneous ingestion, canonical normalization, provenance, and an honest gate on insufficient data — is reusable engineering, and it’s where a data platform earns trust.
The reflex to always produce an answer is a liability in any system that informs a real-world decision. Building the “I don’t have enough to answer that responsibly” path — and making it specific, tested, and first-class — is one of the clearer signals of senior work I know.