One Commit, Six Lessons: A Salesforce Sync, a Read Follower, and the Audit That Reshaped Our Credentials Posture

This is a story about a single-line change in a database.yml. The change moved a nightly Salesforce sync from the primary database to a read replica. Three lessons fell out of that one-line change immediately. Three more fell out over the following months. And the reason the change happened at all was a security review the previous year that I have to tell you about first, because without it the change would not have been a priority.

The short version: a routine audit surfaced a long-lived credential in unexpected active use. The credential had been issued years earlier for a narrow purpose, granted broad access, and never expired. It was working as designed; the design was the problem. The audit’s findings prompted a company-wide review of every long-lived credential in the system, and the rebuild of the credentials posture that followed reshaped engineering priorities for the next eighteen months.

The Salesforce-sync-on-read-replica commit was one of dozens of changes that came out of the mandate. It is a small change. It is also a useful case study in what happens when you take cross-tenant security seriously.

I am going to tell the security story first, then the technical story, then the lessons. The client is a multi-tenant B2B operations SaaS company. The systems and tables described are real.

The audit and what it surfaced

A scheduled credentials audit, the kind every mature security team runs periodically, started with what was supposed to be a paperwork exercise: enumerate every token, key, and service-account credential in active use; verify its purpose; verify its scope; verify its expiry.

It did not get far before the exercise stopped being paperwork.

A handful of the credentials in active use had been issued years earlier, granted broad access (impersonation-capable, cross-tenant readable, write-enabled on every table they could reach), and stamped with no expiration. They worked because the system was designed to accept them. The system was designed to accept them because, at the time they were issued, “expire all credentials” was not an organizational norm. The credentials had simply continued to exist, used occasionally, never reviewed, until somebody compared the list of “credentials we know about” against the list of “credentials the system would accept right now” and noticed the second list was longer.

Nothing pointed at a breach. The audit was preventative, not reactive. But the gap between “credentials we manage” and “credentials the system trusts” was the gap, and it had to be closed before something filled it.

The remediation took weeks: every long-lived credential was rotated, scope-reviewed, given an expiry, or retired. The lesson took eighteen months to fully internalize: cross-tenant security cannot be a feature of the application; it has to be a property of the infrastructure.

What that means in practice is the topic of the rest of this article. The Salesforce sync change is one example, but the principle generalizes.

The mandate

The company-wide mandate that came out of the incident had several pillars. The two relevant to this article:

All credentials must expire. No long-lived tokens. No “system accounts” with permanent superadmin. Every credential has an explicit lifetime, and the renewal is auditable.
The blast radius of any single credential must be minimized. A credential should grant the smallest possible access to do the job. A nightly batch job that reads should not be using a credential that could write.

The Salesforce sync was a daily job that read company and location data from the application database, transformed it, and pushed it up to Salesforce. It had been running for years. It used a service account credential with the same access as a senior engineer: full read and full write on every table.

It needed read access on a subset of tables. That was it. Anything else was excess.

Reducing the credential to read-only-on-a-subset was step one. Moving the actual data source to a read replica was step two. These are the same idea applied at two layers. The application-level credential should be minimum-scope. The infrastructure-level access should also be minimum-scope.

The one-line commit

The Salesforce sync runs as a Kubernetes CronJob, scheduled nightly. Its database connection was configured through a database.yml. The change was a five-character edit to the host: value.

# bridge/config/database.yml (before)
production:
  adapter: postgresql
  host: db.internal.example.com
  database: r360_production
  username: <%= ENV["DB_USER"] %>
  password: <%= ENV["DB_PASSWORD"] %>

# bridge/config/database.yml (after)
production:
  adapter: postgresql
  host: db-follower.internal.example.com
  database: r360_production
  username: <%= ENV["DB_USER"] %>
  password: <%= ENV["DB_PASSWORD"] %>

The commit message: “update database.yml to have read follower.”

The change was deployed, the cron ran that night, and three problems surfaced that the team had not anticipated.

Lesson one: replication lag is operational debt

The Salesforce sync did several things at once: it pulled the latest state of companies and locations, it diffed against what Salesforce currently had, and it pushed the diff up. The diff calculation assumed that the database state was current.

Read replicas are not current. They are seconds-to-minutes-to-hours behind the primary, depending on load. For most use cases, this is fine. For a job that diffs against a remote system and pushes the differences, “seconds behind” is fine and “ten minutes behind” is not.

We did not know how lagged the replica was at the moment the cron started. The fix was a replication-lag check up front, and a refusal to run if the lag exceeded a threshold.

class ReplicationLagCheck
  CUTOFF = 1.hour

  def self.acceptable?
    lag = ActiveRecord::Base.connection.execute(<<~SQL).first["lag_seconds"]
      SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds
    SQL

    return true if lag.nil? # we are on the primary
    lag < CUTOFF.to_i
  end
end

class SalesforceSyncJob
  def perform
    unless ReplicationLagCheck.acceptable?
      Rails.logger.warn({
        domain: "salesforce.sync",
        status: "aborted_replication_lag",
        salesforce_sync_lag_seconds: lag_seconds
      })
      return
    end
    # ... actual sync ...
  end
end

The cutoff of one hour is generous on purpose. The sync only runs once daily; if the replica is lagged by more than an hour, we would rather skip the run and alert than push stale data to Salesforce.

The first time this check fired in anger was three months later, during a heavy backfill that had saturated the replication channel. The sync skipped that night. The alert fired. The on-call engineer looked at the underlying lag, decided it was a temporary spike, and let the next night’s run pick up. The data converged with no human-visible impact on Salesforce. The check did what it was supposed to do.

Lesson two: you cannot diagnose what you do not log

The second problem from the replica switch was that we did not have structured logging for the sync. Errors landed as opaque RuntimeError: Salesforce match failed lines in the unfiltered application log, with no per-company context, no per-stage timing, and no taxonomy.

This was acceptable when the sync ran on the primary because the sync rarely failed. The replica introduced new failure modes — stale matches, missing relations, replication-lag aborts. Without structured logging we could not tell which type of failure was happening or which tenants were affected.

The fix was a small refactor: every log line from the sync emits a structured hash with a consistent envelope.

def log_event(status:, **fields)
  Rails.logger.info({
    domain: "salesforce.sync",
    status: status,
    salesforce_sync_type: @run_type,
    salesforce_sync_duration_ms: ((Time.now - @started_at) * 1000).round,
    **fields
  })
end

Logs now have aggregable fields. The first week we shipped this, an aggregation query over a week of logs surfaced the third problem (described next). Without the structured logging, we would not have noticed it for months.

Lesson three: silent skips are bugs

The third problem was the worst. After the replica switch, some companies had stopped syncing entirely. Not a crash, not a visible error. A silent skip.

The cause was an existing code path in the matcher. The application database stores R360_App_ID__c (the Salesforce account ID) on each company. When the sync ran, it would query Salesforce by this ID, expecting one match. If the query returned zero matches, the company was logged as companies_not_found. If it returned one match, the company was updated. If it returned multiple matches — duplicate Salesforce records pointing at the same R360 ID — the code path printed a warning and moved on.

The warning was in an unaggregated log line. On the primary, with low log volume from this job, an engineer would occasionally notice. On the replica, with the structured logging now showing companies_multiple_matches as a counted field, we noticed immediately: 17 companies, across two large tenants, had been silently skipped for years because of duplicate Salesforce records that someone had accidentally created in 2019.

The fix was two-step:

The duplicate Salesforce records were merged in Salesforce, one company at a time, with the customer success team coordinating.
The matcher was changed to refuse to skip silently. If there are multiple matches, it pages on-call. Silence-on-confusion is an antipattern.

This bug had existed since the sync was first written. We did not find it for years. The replica switchover did not cause the bug; the structured logging that came with the replica switchover surfaced the bug. Sometimes the most important thing a code change does is force you to look at something you had been ignoring.

Lesson four: read-only credentials change the audit story

This one is the security loop closing.

When the sync was on the primary with a full-access credential, an incident response question like “did the sync write something it shouldn’t have” had no clean answer. The credential could write. Maybe it did, maybe it didn’t, the audit log would tell you but you would have to look at every write the credential made and reason about each one.

When the sync is on a replica with a read-only credential, the same question has a one-line answer: no, because the credential cannot write. The credential’s capability is the audit, not the activity. This is what “minimum-scope credentials” buys you. Not “we have not written anything bad.” But: “we cannot have written anything bad.”

For a security team trying to bound the blast radius of a compromised credential, that distinction is enormous.

Lesson five: cron jobs need their own observability

Cron jobs are easy to ignore because they are not user-facing. When they fail, nobody opens a ticket. When they regress, the regression accumulates.

Post-incident, every nightly cron in this client’s stack got the same treatment:

Structured logging with a consistent envelope (domain, status, duration, contextual identifiers).
A run-time tracker that emits a metric per completed run. A run that does not emit triggers an alert.
A per-job lifecycle dashboard showing success rate, duration trend, and stage-level timing.
A “last successful run” timestamp surfaced on the operations dashboard, visible to anyone on call.

Most of the bugs found in the eighteen months after the JWT incident were not new bugs. They were existing latent bugs that became visible because someone started looking.

Lesson six: the small commits are where the real engineering lives

The Salesforce sync change was nine lines of diff plus the one-character host change. It was not glamorous. It was probably never going to be a “did you see what we shipped this week” Slack moment.

But it was the kind of change that takes the principle (“minimum-scope credentials, minimum blast radius, observable infrastructure”) and applies it to one specific job, with all the operational consequences that surface as a result. Multiply that by every nightly job, every API key, every webhook signer, every service account, every long-lived token — and that is what the eighteen-month security mandate looked like in practice.

The big-picture announcement of “we are taking cross-tenant security seriously” is the easy part. The hard part is the hundred small commits that follow.

The takeaway

A single-line change in a database.yml is not interesting on its own. The story around it is. If your team is running long-lived credentials, full-access service accounts, and nightly cron jobs against the primary database with no replication-lag awareness, you are one routine credentials audit away from a quarter of remediation work that nobody scheduled.

The remediation work, when it comes, is high-value engineering. The team will get better at the kind of work nobody enjoys doing but everybody respects when it is done. It is much cheaper to do it ahead of the audit than behind it.

Concretely: move what reads to read replicas. Scope credentials to the minimum. Add structured logging to every batch job. Watch for silent skips. Build the observability before you need it. Make every credential expire.

This was one of a body of similar security-driven engineering engagements. Happy to talk about yours.