The Audit Log That Can’t Have Holes

Most audit logs are a convenience. You add an events table, wire up a callback in your context, and now support can answer “who changed this?” when a customer emails in. If a row goes missing, nobody notices. If two happen to land out of order, nobody cares.

A compliance audit log is a different animal. In regulated inventory — cannabis seed-to-sale, controlled substances, medical devices, anything a state agency can subpoena — the audit trail is a legal instrument. And a legal instrument has a property the convenience version never needs:

A missing sequence number is evidence of tampering.

If your events are numbered 1, 2, 3, 5, 6, an auditor doesn’t shrug. They ask what number 4 was and why you deleted it. “It was a database rollback” is not an answer that ends the conversation. So the requirement is stark: the sequence must be gap-free — contiguous 1..N, forever, no matter what fails.

That one requirement quietly disqualifies almost every implementation people reach for. I recently built a small Seed-to-Sale Compliance Ledger to demonstrate the pattern, and the interesting part isn’t the state machine or the CSV export. It’s the numbering. Here’s why the easy answers don’t work, and the one that does.

Why not just use a database sequence?

The instinct is: Postgres has SEQUENCE, MySQL has AUTO_INCREMENT, let the database hand out the numbers. They’re fast, they’re concurrent-safe, they never collide.

They also leak on rollback, and that’s fatal here.

nextval() is deliberately non-transactional. When you call it, the sequence advances immediately and stays advanced, even if the surrounding transaction rolls back a millisecond later. This is a feature — it’s what lets a thousand concurrent inserts each grab a unique id without blocking each other. But it means the moment any transaction that consumed a number fails — a validation error, a deadlock, a network blip mid-commit — you have a hole.

In a normal table, that hole is invisible; primary keys are allowed to skip. In a compliance ledger, that hole is a finding. Sequences are the wrong tool the instant “no gaps” becomes a hard requirement.

Why not `MAX(sequence) + 1`?

The next instinct is to compute the number yourself:

next_seq = Repo.one(from e in LedgerEvent, select: coalesce(max(e.sequence), 0)) + 1
Repo.insert!(%LedgerEvent{sequence: next_seq, ...})

Now the number is inside your transaction, so a rollback un-consumes it. Gaps solved. But you’ve traded one failure for a worse one: this races.

Two requests read MAX(sequence) as 41 at the same moment. Both compute 42. Both insert. Now you either have two events numbered 42 (if there’s no unique constraint) or one of them crashes on the constraint violation (if there is) — and in a busy ERP, “one of them crashes” happens constantly under load. A unique index turns silent corruption into loud errors, which is better, but “loud errors under normal traffic” is not a shipping feature.

The race is real because the read and the write aren’t atomic. Nothing stops a second transaction from squeezing between them.

The pattern that works: a locked counter, inside the transaction

The correct approach is old and boring, which is exactly what you want for a legal record. Keep a dedicated one-row-per-series counter table. To get the next number, lock that row, increment it, and — critically — do it inside the same transaction that writes the event you’re numbering.

defmodule Counter do
  import Ecto.Query

  # MUST run inside the caller's transaction so the lock is held and the
  # increment can roll back with the surrounding work.
  def next_value!(repo, name) do
    repo.insert_all(Counter, [%{name: name, value: 0}],
      on_conflict: :nothing, conflict_target: :name)

    counter =
      from(c in Counter, where: c.name == ^name, lock: "FOR UPDATE")  # SELECT ... FOR UPDATE
      |> repo.one!()

    next = counter.value + 1
    from(c in Counter, where: c.id == ^counter.id) |> repo.update_all(set: [value: next])
    next
  end
end

And the write path that uses it — one Ecto.Multi, one transaction:

Ecto.Multi.new()
|> Ecto.Multi.run(:sequence, fn repo, _ -> {:ok, Counter.next_value!(repo, "ledger")} end)
|> Ecto.Multi.update(:batch, Batch.transition_changeset(batch, to_state))
|> Ecto.Multi.insert(:event, fn %{sequence: seq} ->    # if THIS fails, seq is freed
  LedgerEvent.changeset(%LedgerEvent{}, %{sequence: seq, batch_id: batch.id, ...})
end)
|> Repo.transaction()

Two properties fall out of this, and they’re the two the easy answers each got half-right:

No gaps. The counter increment and the event insert are in one transaction. If the insert fails, the increment rolls back with it. The number is never consumed unless the event it names is committed. Rollback frees the number.
No races. SELECT … FOR UPDATE (that’s what lock: "FOR UPDATE" compiles to) takes a row lock on the counter. A second transaction that wants the next number blocks until the first commits or rolls back, then reads the updated value. Concurrency is serialized down to the narrowest possible scope: one row, held for microseconds.

The cost is that writers to a given series serialize on that row. For an audit ledger that’s not just acceptable, it’s the point — you want a total order of events. And because the lock is on a single counter row, not the events table, it doesn’t contend with reads or with other series.

Proving it, not asserting it

The reason this is worth writing about is that all three failure modes are invisible until they aren’t. You can run the naive sequence version for months before a rollback leaves a hole an auditor finds. So the guarantees belong in the test suite, stated as executable claims:

Gap-free: run a few hundred operations, then assert the sequence column is exactly Enum.to_list(1..n). Not “roughly contiguous” — exactly.
Rollback frees the number: force a failure after the number is reserved, and assert the counter’s value is unchanged. This is the property a database sequence would fail.
Concurrency: spin up thirty Tasks that each write an event, await them, and assert the sequence is still unique and still gap-free. This is the property MAX+1 would fail. (One catch in Elixir: the Ecto sandbox pins every process to a single shared connection, which can’t contend on a row lock — so the concurrency test has to bypass the sandbox and let each Task check out its own real connection. Otherwise you’re “proving” concurrency on one connection.)

That last one matters most. A gap-free guarantee that only holds single-threaded is a guarantee that holds in your tests and breaks in production. Thirty processes hammering the counter and coming out with a clean 1..N is the proof that the row lock is doing its job.

The rest is downstream

Once the numbering is right, the compliance model gets simple. Events are append-only — a Postgres trigger rejects any UPDATE or DELETE on the ledger table and the context offers no mutation path, so a stray iex session can’t rewrite history. A transfer manifest snapshots its line items at export time rather than joining live inventory, because the manifest is a record of what was on the truck then, and later edits mustn’t mutate it. The state machine rejects illegal transitions in the context function, not just in the UI.

But those are the parts everyone remembers to build. The gap-free counter is the part that looks trivial, gets implemented as a database sequence, and quietly fails the one test that matters. Get the boring part right first.