319 N+1s, 5,940ms Wasted: Auditing a Decade of Rails Code in One Afternoon

Most mature Rails codebases have a number that nobody knows. It is the cumulative amount of database time per request that the application spends re-fetching things it already had, because at some point five years ago somebody wrote inspection.location.name inside a loop and nothing yelled.

This is the story of measuring that number for a client whose Rails API powers a B2B operations platform. The number turned out to be 5,940 milliseconds of wasted database time across the test suite, distributed across 319 unique N+1 sites, with 28 of them rising to “critical” by the audit’s prioritization model.

The number is bad. Worse, the team did not know it was bad. The API was “slow” the way mature APIs are always “slow” — slow enough to complain about, not slow enough to ticket. The audit converted that vague unease into 319 line items with file paths, query SQL, recommended fixes, and a priority score.

Here is what we did, what we found, and why this is the cheapest way I know to extract real perf wins from a mature codebase.

Why mature codebases hide their N+1s

The standard intuition for finding N+1s is “run the app under bullet and watch your logs.” This works for a fresh codebase. It does not work for a codebase that has been shipping for years.

By that point, the controllers have grown defensive: nil checks, conditional preloads, feature flags that bypass certain code paths. Some N+1s only manifest when a request hits a specific feature-flagged branch with a specific customer’s data shape. Some only manifest when a record is in an unusual state (deleted, draft, archived). Some manifest on every request and nobody notices because the test fixtures are too small to make the cost visible.

You cannot find these by exploration. You can find them by running the test suite under instrumentation that records every site, then aggregating across the corpus.

The gem that does this best, in my experience, is Prosopite. Unlike Bullet (which is built for the development loop), Prosopite is built for the test suite. It detects N+1s by tracking which queries fire from which call sites, identifying repeated queries against the same table from the same site, and surfacing them with full backtrace and SQL.

The audit harness

The first thing we built was not a fix. It was a harness that ran the entire spec suite under Prosopite, captured every N+1 detection as a structured record, and emitted a sorted report.

# spec/support/prosopite_audit.rb
require "prosopite"
require "json"

RSpec.configure do |config|
  config.before(:suite) do
    Prosopite.rails_logger = false
    Prosopite.raise = false # we want to collect, not raise
    Prosopite.custom_logger = ProsopiteAuditLogger.new
    Prosopite.scan
  end

  config.after(:suite) do
    Prosopite.finish
    ProsopiteAuditLogger.flush_report!
  end

  config.around(:each) do |example|
    Prosopite.pause
    Prosopite.resume
    example.run
  end
end

ProsopiteAuditLogger accumulates each detection as a JSON record:

class ProsopiteAuditLogger
  def initialize
    @records = []
  end

  def warn(notification)
    @records << {
      file: notification.queries.first.backtrace.first[/[^:]+:\d+/],
      sql_template: normalize(notification.queries.first.sql),
      query_count: notification.queries.length,
      table: extract_table(notification.queries.first.sql),
      spec_file: RSpec.current_example.metadata[:file_path],
      example: RSpec.current_example.full_description
    }
  end

  def self.flush_report!
    grouped = @records.group_by { |r| [r[:sql_template], r[:file]] }
    aggregated = grouped.map do |key, recs|
      sql_template, file = key
      {
        sql_template: sql_template,
        file: file,
        table: recs.first[:table],
        spec_count: recs.length,
        total_queries: recs.sum { |r| r[:query_count] },
        triggering_specs: recs.map { |r| r[:example] }.uniq
      }
    end
    File.write("tmp/n1_audit.json", JSON.pretty_generate(aggregated))
  end
end

Then a small Ruby script renders the JSON into an HTML report: hotspots by table, total wasted time, a “critical / high / medium / low” classification based on per-request frequency and cumulative time. The classification is the part that makes the report actionable. Without it, 319 entries is a wall of text. With it, the team knows which 28 to fix this sprint.

The headline numbers

For this client’s codebase, on a full run of the V3 controller spec suite (Rails 7, Postgres, roughly 800 controller specs):

319 unique N+1 sites (distinct combinations of SQL template and call-site file:line)
1,233 spec occurrences (each unique site triggered, on average, by ~4 specs)
5,940.6 ms of cumulative wasted DB time per full suite run
28 critical, 26 high, 48 medium, 217 low

Wasted time is not a perf-in-production number. It is a “cost per spec run” number. The relationship between wasted time in specs and wasted time in production depends on the access pattern of real traffic. But the site count generalizes one-to-one: every N+1 site in specs is an N+1 site in production. The 319 is the real number.

The hotspot tables

Aggregated by table, sorted by cumulative time wasted:

#	Table	N+1 Sites	Spec Hits	Wasted
1	locations	42	189	1015.0ms
2	workflows	15	76	691.3ms
3	transactions	23	65	618.5ms
4	snapshot_images	15	58	520.5ms
5	users	52	159	506.5ms
6	video_records	10	44	455.8ms
7	assets	9	34	383.4ms
8	inspection_stats	9	43	350.3ms
9	transaction_groups	8	42	297.9ms
10	forms	7	41	297.0ms

The first thing you can see, before fixing anything, is the shape of the codebase’s perf debt. The locations table is being re-fetched in 42 different call sites across 189 specs. That is not one bug. That is a pattern — the codebase has a habit of writing something.location.name in iteration contexts and not preloading.

The same pattern shows up for users (52 sites) and workflows (15 sites). These are the joined-by-default associations whose N+1ness is invisible because each individual query is fast.

The critical hotspots

The top of the critical list looked like this:

Critical #1 — assets (100x per request). A single query, repeated 100 times per request, in the V3 inspections cursor-based pagination spec:

SELECT "assets".* FROM "assets"
WHERE "assets"."company_id" = $1
  AND (LOWER(external_id) = 'mth1cwh57pcd2cp7enotavin')
ORDER BY "assets"."id" ASC LIMIT $2

Recommended fix: includes(:asset) on the parent query. Six specs triggered it. The cumulative wasted time was 365.7ms, just for this one site.

Critical #2 — locations (100x per request). A vanilla SELECT locations.* WHERE locations.company_id = $1 ORDER BY locations.id ASC LIMIT $2, repeated 100 times per request in the same pagination spec. Twelve specs triggered it. 339.1ms wasted.

Critical #4 — snapshot_images (101x per request). Not even a SELECT * — a COUNT:

SELECT COUNT(*) FROM "snapshot_images"
WHERE "snapshot_images"."transaction_id" = $1

Repeated 101 times in the V3 units controller spec. Seven specs triggered it.

The COUNT pattern is particularly insidious because it looks innocent in code: transaction.snapshot_images.count or transaction.snapshot_images.any?. Each call issues a fresh database query. In a list view of 100 transactions, that is 100 COUNT queries that could be replaced by one preloaded scope plus a .length.

What the report does that a perf tool does not

A perf tool tells you the slow request. The audit tells you the latent slow request — the one that has not yet hit production-scale data but will when a customer with 100 inspections opens a list view.

A perf tool gives you a flame graph. The audit gives you 319 ticketable line items, each with: the SQL template, the call-site file and line, the recommended fix, the priority score, and the list of specs that triggered it (which double as a regression test set when the fix lands).

A perf tool requires a senior engineer to interpret. The audit can be handed to a junior engineer who picks the top critical, adds an includes(:asset), watches the spec turn green, opens a PR. Twenty-eight criticals is a sprint of work for a competent developer. The thirty highs after that are another sprint. The mediums and lows can be left for opportunistic fixing.

The fix patterns

Roughly 85% of the 319 sites were fixed by one of three patterns:

Pattern A: add an includes. Most common, by a wide margin. The parent query has an each or a serializer chain that calls .location or .asset on each element. Add includes(:location, :asset) to the parent query.

# Before
Inspection.where(company_id: company.id).each do |insp|
  serialize(insp.asset, insp.location)
end

# After
Inspection.where(company_id: company.id).includes(:asset, :location).each do |insp|
  serialize(insp.asset, insp.location)
end

Pattern B: replace a .count with a preload + .length. When the count is used as part of a list view, never to compare against a single threshold, it is almost always cheaper to preload the association and use .length on the loaded collection.

# Before
transactions.map { |t| { id: t.id, image_count: t.snapshot_images.count } }

# After
transactions.includes(:snapshot_images).map { |t|
  { id: t.id, image_count: t.snapshot_images.length }
}

The exception: when you genuinely need a count without loading the records (large association, you only need to know “more than zero”), use counter_cache and accept its caveats.

Pattern C: denormalize. For the inspection_stats family of hotspots, no amount of preloading would fix the problem, because the calculation was a derived aggregate over multiple tables. The fix was a denormalized stats table updated by callbacks on the source models, refreshed daily by a reconciliation job. This is its own story.

The reporting harness as a deliverable

The most reusable artifact from this engagement was not the fixes. It was the audit harness itself. A Rails app gets to run it once a quarter, watch the numbers, and use them as a quantitative health metric.

Six months after the first audit, the team’s number was 1,210 ms (down from 5,940) and 47 critical sites had been resolved. Six months after that, the harness had been added to CI as a regression check: a PR that introduces a new N+1 site fails the audit-delta check, with the site identified in the failure message.

That last move is the one that compounds. A team that knows its N+1 number, can detect changes to it, and ships against a budget for it, will not accumulate the next decade’s perf debt the way it accumulated the last decade’s.

The takeaway

Mature Rails codebases have a quantifiable amount of latent N+1 debt that nobody on the team knows. Prosopite plus a test-suite harness plus a small reporting layer makes it visible in an afternoon. Fixing the criticals takes a sprint. Adding the harness to CI keeps the next decade clean.

If you are running a Rails API, the audit takes one engineer one afternoon to set up. The report it produces is worth a quarter of opportunistic optimization. If your team has been “slow” for years and nobody can say in what way, this is the cheapest way to find out.

This was one of a body of similar perf engagements. Happy to talk about yours.