319 N+1s, 5,940ms Wasted: Auditing a Decade of Rails Code in One Afternoon
Most mature Rails codebases have a number that nobody knows. It is the cumulative amount of database time per request that the application spends re-fetching things it already had, because at some point five years ago somebody wrote inspection.location.name inside a loop and nothing yelled.
This is the story of measuring that number for a client whose Rails API powers a B2B operations platform. The number turned out to be 5,940 milliseconds of wasted database time across the test suite, distributed across 319 unique N+1 sites, with 28 of them rising to “critical” by the audit’s prioritization model.
The number is bad. Worse, the team did not know it was bad. The API was “slow” the way mature APIs are always “slow” — slow enough to complain about, not slow enough to ticket. The audit converted that vague unease into 319 line items with file paths, query SQL, recommended fixes, and a priority score.
Here is what we did, what we found, and why this is the cheapest way I know to extract real perf wins from a mature codebase.
Why mature codebases hide their N+1s
The standard intuition for finding N+1s is “run the app under bullet and watch your logs.” This works for a fresh codebase. It does not work for a codebase that has been shipping for years.
By that point, the controllers have grown defensive: nil checks, conditional preloads, feature flags that bypass certain code paths. Some N+1s only manifest when a request hits a specific feature-flagged branch with a specific customer’s data shape. Some only manifest when a record is in an unusual state (deleted, draft, archived). Some manifest on every request and nobody notices because the test fixtures are too small to make the cost visible.
You cannot find these by exploration. You can find them by running the test suite under instrumentation that records every site, then aggregating across the corpus.
The gem that does this best, in my experience, is Prosopite. Unlike Bullet (which is built for the development loop), Prosopite is built for the test suite. It detects N+1s by tracking which queries fire from which call sites, identifying repeated queries against the same table from the same site, and surfacing them with full backtrace and SQL.
The audit harness
The first thing we built was not a fix. It was a harness that ran the entire spec suite under Prosopite, captured every N+1 detection as a structured record, and emitted a sorted report.
# spec/support/prosopite_audit.rb
require "prosopite"
require "json"
RSpec.configure do |config|
config.before(:suite) do
Prosopite.rails_logger = false
Prosopite.raise = false # we want to collect, not raise
Prosopite.custom_logger = ProsopiteAuditLogger.new
Prosopite.scan
end
config.after(:suite) do
Prosopite.finish
ProsopiteAuditLogger.flush_report!
end
config.around(:each) do |example|
Prosopite.pause
Prosopite.resume
example.run
end
end
ProsopiteAuditLogger accumulates each detection as a JSON record:
class ProsopiteAuditLogger
def initialize
@records = []
end
def warn(notification)
@records << {
file: notification.queries.first.backtrace.first[/[^:]+:\d+/],
sql_template: normalize(notification.queries.first.sql),
query_count: notification.queries.length,
table: extract_table(notification.queries.first.sql),
spec_file: RSpec.current_example.metadata[:file_path],
example: RSpec.current_example.full_description
}
end
def self.flush_report!
grouped = @records.group_by { |r| [r[:sql_template], r[:file]] }
aggregated = grouped.map do |key, recs|
sql_template, file = key
{
sql_template: sql_template,
file: file,
table: recs.first[:table],
spec_count: recs.length,
total_queries: recs.sum { |r| r[:query_count] },
triggering_specs: recs.map { |r| r[:example] }.uniq
}
end
File.write("tmp/n1_audit.json", JSON.pretty_generate(aggregated))
end
end
Then a small Ruby script renders the JSON into an HTML report: hotspots by table, total wasted time, a “critical / high / medium / low” classification based on per-request frequency and cumulative time. The classification is the part that makes the report actionable. Without it, 319 entries is a wall of text. With it, the team knows which 28 to fix this sprint.
The headline numbers
For this client’s codebase, on a full run of the V3 controller spec suite (Rails 7, Postgres, roughly 800 controller specs):
- 319 unique N+1 sites (distinct combinations of SQL template and call-site file:line)
- 1,233 spec occurrences (each unique site triggered, on average, by ~4 specs)
- 5,940.6 ms of cumulative wasted DB time per full suite run
- 28 critical, 26 high, 48 medium, 217 low
Wasted time is not a perf-in-production number. It is a “cost per spec run” number. The relationship between wasted time in specs and wasted time in production depends on the access pattern of real traffic. But the site count generalizes one-to-one: every N+1 site in specs is an N+1 site in production. The 319 is the real number.
The hotspot tables
Aggregated by table, sorted by cumulative time wasted:
| # | Table | N+1 Sites | Spec Hits | Wasted |
|---|---|---|---|---|
| 1 | locations | 42 | 189 | 1015.0ms |
| 2 | workflows | 15 | 76 | 691.3ms |
| 3 | transactions | 23 | 65 | 618.5ms |
| 4 | snapshot_images | 15 | 58 | 520.5ms |
| 5 | users | 52 | 159 | 506.5ms |
| 6 | video_records | 10 | 44 | 455.8ms |
| 7 | assets | 9 | 34 | 383.4ms |
| 8 | inspection_stats | 9 | 43 | 350.3ms |
| 9 | transaction_groups | 8 | 42 | 297.9ms |
| 10 | forms | 7 | 41 | 297.0ms |
The first thing you can see, before fixing anything, is the shape of the codebase’s perf debt. The locations table is being re-fetched in 42 different call sites across 189 specs. That is not one bug. That is a pattern — the codebase has a habit of writing something.location.name in iteration contexts and not preloading.
The same pattern shows up for users (52 sites) and workflows (15 sites). These are the joined-by-default associations whose N+1ness is invisible because each individual query is fast.
The critical hotspots
The top of the critical list looked like this:
Critical #1 — assets (100x per request). A single query, repeated 100 times per request, in the V3 inspections cursor-based pagination spec:
SELECT "assets".* FROM "assets"
WHERE "assets"."company_id" = $1
AND (LOWER(external_id) = 'mth1cwh57pcd2cp7enotavin')
ORDER BY "assets"."id" ASC LIMIT $2
Recommended fix: includes(:asset) on the parent query. Six specs triggered it. The cumulative wasted time was 365.7ms, just for this one site.
Critical #2 — locations (100x per request). A vanilla SELECT locations.* WHERE locations.company_id = $1 ORDER BY locations.id ASC LIMIT $2, repeated 100 times per request in the same pagination spec. Twelve specs triggered it. 339.1ms wasted.
Critical #4 — snapshot_images (101x per request). Not even a SELECT * — a COUNT:
SELECT COUNT(*) FROM "snapshot_images"
WHERE "snapshot_images"."transaction_id" = $1
Repeated 101 times in the V3 units controller spec. Seven specs triggered it.
The COUNT pattern is particularly insidious because it looks innocent in code: transaction.snapshot_images.count or transaction.snapshot_images.any?. Each call issues a fresh database query. In a list view of 100 transactions, that is 100 COUNT queries that could be replaced by one preloaded scope plus a .length.
What the report does that a perf tool does not
A perf tool tells you the slow request. The audit tells you the latent slow request — the one that has not yet hit production-scale data but will when a customer with 100 inspections opens a list view.
A perf tool gives you a flame graph. The audit gives you 319 ticketable line items, each with: the SQL template, the call-site file and line, the recommended fix, the priority score, and the list of specs that triggered it (which double as a regression test set when the fix lands).
A perf tool requires a senior engineer to interpret. The audit can be handed to a junior engineer who picks the top critical, adds an includes(:asset), watches the spec turn green, opens a PR. Twenty-eight criticals is a sprint of work for a competent developer. The thirty highs after that are another sprint. The mediums and lows can be left for opportunistic fixing.
The fix patterns
Roughly 85% of the 319 sites were fixed by one of three patterns:
Pattern A: add an includes. Most common, by a wide margin. The parent query has an each or a serializer chain that calls .location or .asset on each element. Add includes(:location, :asset) to the parent query.
# Before
Inspection.where(company_id: company.id).each do |insp|
serialize(insp.asset, insp.location)
end
# After
Inspection.where(company_id: company.id).includes(:asset, :location).each do |insp|
serialize(insp.asset, insp.location)
end
Pattern B: replace a .count with a preload + .length. When the count is used as part of a list view, never to compare against a single threshold, it is almost always cheaper to preload the association and use .length on the loaded collection.
# Before
transactions.map { |t| { id: t.id, image_count: t.snapshot_images.count } }
# After
transactions.includes(:snapshot_images).map { |t|
{ id: t.id, image_count: t.snapshot_images.length }
}
The exception: when you genuinely need a count without loading the records (large association, you only need to know “more than zero”), use counter_cache and accept its caveats.
Pattern C: denormalize. For the inspection_stats family of hotspots, no amount of preloading would fix the problem, because the calculation was a derived aggregate over multiple tables. The fix was a denormalized stats table updated by callbacks on the source models, refreshed daily by a reconciliation job. This is its own story.
The reporting harness as a deliverable
The most reusable artifact from this engagement was not the fixes. It was the audit harness itself. A Rails app gets to run it once a quarter, watch the numbers, and use them as a quantitative health metric.
Six months after the first audit, the team’s number was 1,210 ms (down from 5,940) and 47 critical sites had been resolved. Six months after that, the harness had been added to CI as a regression check: a PR that introduces a new N+1 site fails the audit-delta check, with the site identified in the failure message.
That last move is the one that compounds. A team that knows its N+1 number, can detect changes to it, and ships against a budget for it, will not accumulate the next decade’s perf debt the way it accumulated the last decade’s.
The takeaway
Mature Rails codebases have a quantifiable amount of latent N+1 debt that nobody on the team knows. Prosopite plus a test-suite harness plus a small reporting layer makes it visible in an afternoon. Fixing the criticals takes a sprint. Adding the harness to CI keeps the next decade clean.
If you are running a Rails API, the audit takes one engineer one afternoon to set up. The report it produces is worth a quarter of opportunistic optimization. If your team has been “slow” for years and nobody can say in what way, this is the cheapest way to find out.
This was one of a body of similar perf engagements. Happy to talk about yours.