Sidekiq Idempotency and Reliability Patterns

The single most expensive misconception about background jobs is that they run exactly once. They do not. Sidekiq, like every at-least-once queue, guarantees that a job that was enqueued will eventually run to completion at least once — and the “at least” is load-bearing. A worker can pull a job, do most of the work, and then be killed by a deploy, an OOM, or a network partition to Redis before it acknowledges completion. Sidekiq does the correct thing and re-runs it. If your job charged a card, sent an email, or incremented a balance on its first attempt, the re-run does it again. The bug report says “customer was charged twice,” and the root cause is that someone assumed exactly-once.

You cannot buy exactly-once delivery; it is provably impossible across a network in the general case. What you can build is exactly-once effect, by making the job idempotent: running it twice produces the same end state as running it once. That is the whole game, and everything below serves it.

Make the job idempotent first

Before reaching for any library, ask the question that matters: if this job runs twice, what breaks? If the answer is “nothing,” you are done — a job that does user.update!(synced_at: Time.current) is naturally idempotent. The dangerous jobs are the ones with external side effects or non-idempotent writes (inserts, increments, sends).

The most robust technique is an idempotency key: a unique token, derived from the job’s inputs, that you record the first time the effect happens. On every run you check whether that token has already been processed, and bail if so. The check and the effect must be in the same database transaction, or the guard has a race.

class ChargeInvoiceJob
  include Sidekiq::Job
  sidekiq_options queue: :payments, retry: 10

  def perform(invoice_id)
    invoice = Invoice.find(invoice_id)

    ActiveRecord::Base.transaction do
      # Row lock + a uniqueness guard. If a prior run already created
      # the Charge, we short-circuit and never call the gateway again.
      invoice.lock!
      return if invoice.charged?

      charge = PaymentGateway.charge!(
        amount: invoice.total_cents,
        # Pass our own key so the gateway also dedupes server-side.
        idempotency_key: "invoice-#{invoice.id}"
      )
      invoice.update!(charged_at: Time.current, charge_id: charge.id)
    end
  end
end

Two things make this safe. The lock! serializes concurrent runs of the same invoice so two workers can’t both pass the charged? check. And the idempotency_key handed to the gateway means even if our process dies after the gateway charged but before we committed, the retry’s second gateway call returns the original charge instead of creating a new one. Any payment API worth using (Stripe, Braintree, Adyen) supports this header; use it.

Dedup at the queue with a unique lock

Idempotency handles the case where a job ran and re-ran. A related but distinct problem is the same job being enqueued many times — a user mashing a button, a webhook firing in a loop, a cron overlapping with a slow previous run. You do not want fifty RebuildSearchIndexJob(account_id: 42) jobs sitting in the queue.

Modern Sidekiq ships with built-in unique jobs. Enabling them is one line, and the granularity options matter:

class RebuildSearchIndexJob
  include Sidekiq::Job
  # Collapse duplicate (class + args) jobs while one is queued or running.
  sidekiq_options queue: :indexing,
    lock: :until_executed,
    lock_ttl: 30.minutes.to_i

  def perform(account_id)
    SearchIndex.rebuild!(Account.find(account_id))
  end
end

The two locks people reach for most: :until_executed holds the lock from enqueue until the job finishes, collapsing a burst of identical enqueues into one run. :until_executing releases the moment the job starts, which lets a fresh enqueue queue up work that arrived after processing began (good for “rebuild reflecting the latest state” jobs). Always set a lock_ttl longer than the job’s worst-case runtime — a lock with no expiry that leaks because a worker was kill -9’d will silently stop a job from ever running again.

Crucially, dedup is not a substitute for idempotency. Unique locks reduce duplicate enqueues but cannot guarantee a job never runs twice — the lock can expire mid-run, or a process can die after the lock releases but before completion. Treat dedup as an efficiency optimization layered on top of an idempotent job, never as the correctness mechanism itself.

Retries: backoff, and knowing what not to retry

Sidekiq retries failed jobs automatically with exponential backoff plus jitter, which is exactly what you want for transient failures — a flaky upstream, a momentary connection reset. The default schedule stretches a couple of dozen retries over weeks, which is sensible for most jobs. The decisions worth making are around the edges.

First, distinguish retryable from non-retryable failures. A Net::OpenTimeout deserves a retry. An ActiveRecord::RecordInvalid because the input was malformed will fail identically on every retry — that is a poison message, and retrying it twenty times just wastes capacity and clutters the retry set. Catch the unrecoverable cases and route them somewhere useful instead of letting them churn:

class ProcessWebhookJob
  include Sidekiq::Job
  sidekiq_options retry: 8

  def perform(payload)
    Webhook.handle!(payload)
  rescue Webhook::Unparseable => e
    # Permanent: a retry will never make bad JSON valid.
    DeadLetters.record!(payload: payload, error: e)
    # Swallow so Sidekiq counts this as success, not a retry.
  end
end

Second, set a finite retry: count on jobs whose value decays with time. A “send the welcome email” job retrying for three weeks is pointless — if it has not sent in a day, the moment has passed. Match the retry horizon to how long the work stays useful.

The dead set is a feature, not a graveyard

When a job exhausts its retries, Sidekiq moves it to the dead set rather than discarding it. This is one of the most underused operational tools in the stack. The dead set is your record of work that the system could not complete — and it is retryable by hand from the Web UI once you have fixed the underlying cause (deployed the bug fix, restored the dependency, corrected the data).

To make the dead set useful rather than a black hole, do two things. Alert on it — a steadily growing dead set is a signal that something is systematically broken, and it should page someone before a customer notices. And keep jobs small and serializable: if a dead job carries a 2 MB argument or references a record that has since been deleted, replaying it is painful. Pass IDs, not objects; let perform re-fetch current state. That single discipline makes both retries and dead-set replays safe, because the job always operates on the latest data rather than a stale snapshot frozen at enqueue time.

A checklist for any job that has side effects

  1. Pass IDs, not records. Re-fetch inside perform so retries see current state and arguments stay tiny and serializable.
  2. Make the effect idempotent. An idempotency key checked-and-written in the same transaction as the effect; pass the key to external APIs too.
  3. Lock the contended row when concurrent runs of the same job could race.
  4. Add a unique lock if duplicate enqueues are likely — but never rely on it for correctness.
  5. Classify failures. Retry the transient, dead-letter the permanent, and bound the retry horizon to the work’s shelf life.
  6. Watch the dead set. Alert on growth; design jobs to be replayable by hand.

The takeaway

Reliable background processing is not about finding a queue with stronger delivery guarantees — that queue does not exist. It is about writing jobs that do not care how many times they run. Once a job is idempotent, every other reliability feature — retries, dedup, the dead set — becomes a safety net you can lean on instead of a source of new bugs. The teams that get burned are the ones that treated “it usually runs once” as a guarantee.

We do background-job reliability work on Rails — auditing jobs for double-effect bugs, designing idempotency layers, and getting noisy retry/dead sets back under control. If you have ever shipped a “charged twice” or “sent the email three times” fix, that is the conversation.