The Audit Log That Can’t Have Holes
Most audit logs are a convenience. You add an events table, wire up a callback in your context, and now support can answer “who changed this?” when a customer emails in. If a row goes missing, nobody notices. If two happen to land out of order, nobody cares.
A compliance audit log is a different animal. In regulated inventory — cannabis seed-to-sale, controlled substances, medical devices, anything a state agency can subpoena — the audit trail is a legal instrument. And a legal instrument has a property the convenience version never needs:
A missing sequence number is evidence of tampering.
If your events are numbered 1, 2, 3, 5, 6, an auditor doesn’t shrug. They ask
what number 4 was and why you deleted it. “It was a database rollback” is not an
answer that ends the conversation. So the requirement is stark: the sequence
must be gap-free — contiguous 1..N, forever, no matter what fails.
That one requirement quietly disqualifies almost every implementation people reach for. I recently built a small Seed-to-Sale Compliance Ledger to demonstrate the pattern, and the interesting part isn’t the state machine or the CSV export. It’s the numbering. Here’s why the easy answers don’t work, and the one that does.
Why not just use a database sequence?
The instinct is: Postgres has SEQUENCE, MySQL has AUTO_INCREMENT, let the
database hand out the numbers. They’re fast, they’re concurrent-safe, they never
collide.
They also leak on rollback, and that’s fatal here.
nextval() is deliberately non-transactional. When you call it, the sequence
advances immediately and stays advanced, even if the surrounding transaction
rolls back a millisecond later. This is a feature — it’s what lets a thousand
concurrent inserts each grab a unique id without blocking each other. But it
means the moment any transaction that consumed a number fails — a validation
error, a deadlock, a network blip mid-commit — you have a hole.
In a normal table, that hole is invisible; primary keys are allowed to skip. In a compliance ledger, that hole is a finding. Sequences are the wrong tool the instant “no gaps” becomes a hard requirement.
Why not MAX(sequence) + 1?
The next instinct is to compute the number yourself:
next_seq = Repo.one(from e in LedgerEvent, select: coalesce(max(e.sequence), 0)) + 1
Repo.insert!(%LedgerEvent{sequence: next_seq, ...})
Now the number is inside your transaction, so a rollback un-consumes it. Gaps solved. But you’ve traded one failure for a worse one: this races.
Two requests read MAX(sequence) as 41 at the same moment. Both compute 42. Both
insert. Now you either have two events numbered 42 (if there’s no unique
constraint) or one of them crashes on the constraint violation (if there is) —
and in a busy ERP, “one of them crashes” happens constantly under load. A unique
index turns silent corruption into loud errors, which is better, but “loud
errors under normal traffic” is not a shipping feature.
The race is real because the read and the write aren’t atomic. Nothing stops a second transaction from squeezing between them.
The pattern that works: a locked counter, inside the transaction
The correct approach is old and boring, which is exactly what you want for a legal record. Keep a dedicated one-row-per-series counter table. To get the next number, lock that row, increment it, and — critically — do it inside the same transaction that writes the event you’re numbering.
defmodule Counter do
import Ecto.Query
# MUST run inside the caller's transaction so the lock is held and the
# increment can roll back with the surrounding work.
def next_value!(repo, name) do
repo.insert_all(Counter, [%{name: name, value: 0}],
on_conflict: :nothing, conflict_target: :name)
counter =
from(c in Counter, where: c.name == ^name, lock: "FOR UPDATE") # SELECT ... FOR UPDATE
|> repo.one!()
next = counter.value + 1
from(c in Counter, where: c.id == ^counter.id) |> repo.update_all(set: [value: next])
next
end
end
And the write path that uses it — one Ecto.Multi, one transaction:
Ecto.Multi.new()
|> Ecto.Multi.run(:sequence, fn repo, _ -> {:ok, Counter.next_value!(repo, "ledger")} end)
|> Ecto.Multi.update(:batch, Batch.transition_changeset(batch, to_state))
|> Ecto.Multi.insert(:event, fn %{sequence: seq} -> # if THIS fails, seq is freed
LedgerEvent.changeset(%LedgerEvent{}, %{sequence: seq, batch_id: batch.id, ...})
end)
|> Repo.transaction()
Two properties fall out of this, and they’re the two the easy answers each got half-right:
-
No gaps. The counter increment and the event insert are in one transaction. If the insert fails, the increment rolls back with it. The number is never consumed unless the event it names is committed. Rollback frees the number.
-
No races.
SELECT … FOR UPDATE(that’s whatlock: "FOR UPDATE"compiles to) takes a row lock on the counter. A second transaction that wants the next number blocks until the first commits or rolls back, then reads the updated value. Concurrency is serialized down to the narrowest possible scope: one row, held for microseconds.
The cost is that writers to a given series serialize on that row. For an audit ledger that’s not just acceptable, it’s the point — you want a total order of events. And because the lock is on a single counter row, not the events table, it doesn’t contend with reads or with other series.
Proving it, not asserting it
The reason this is worth writing about is that all three failure modes are invisible until they aren’t. You can run the naive sequence version for months before a rollback leaves a hole an auditor finds. So the guarantees belong in the test suite, stated as executable claims:
- Gap-free: run a few hundred operations, then assert the sequence column is
exactly
Enum.to_list(1..n). Not “roughly contiguous” — exactly. - Rollback frees the number: force a failure after the number is reserved, and assert the counter’s value is unchanged. This is the property a database sequence would fail.
- Concurrency: spin up thirty
Tasks that each write an event,awaitthem, and assert the sequence is still unique and still gap-free. This is the propertyMAX+1would fail. (One catch in Elixir: the Ecto sandbox pins every process to a single shared connection, which can’t contend on a row lock — so the concurrency test has to bypass the sandbox and let eachTaskcheck out its own real connection. Otherwise you’re “proving” concurrency on one connection.)
That last one matters most. A gap-free guarantee that only holds single-threaded
is a guarantee that holds in your tests and breaks in production. Thirty processes
hammering the counter and coming out with a clean 1..N is the proof that the
row lock is doing its job.
The rest is downstream
Once the numbering is right, the compliance model gets simple. Events are
append-only — a Postgres trigger rejects any UPDATE or DELETE on the
ledger table and the context offers no mutation path, so a stray iex session
can’t rewrite history. A transfer manifest snapshots its line items at export
time rather than joining live inventory, because the manifest is a record of what
was on the truck then, and later edits mustn’t mutate it. The state machine
rejects illegal transitions in the context function, not just in the UI.
But those are the parts everyone remembers to build. The gap-free counter is the part that looks trivial, gets implemented as a database sequence, and quietly fails the one test that matters. Get the boring part right first.