PostgreSQL Deep dive: serializable snapshot isolation — how PostgreSQL prevents write skew without traditional locking
REPEATABLE READ prevents dirty reads, nonrepeatable reads, and phantom reads. But it does not prevent write skew. Write skew is the hardest concurrency anomaly to detect and the last one the SQL standard lists as prohibited at the serializable level. PostgreSQL has prevented it since version 9.1 using a technique called Serializable Snapshot Isolation (SSI). SSI runs snapshot isolation under the hood, monitors the read-write conflicts between concurrent transactions, and rolls back any transaction that would create an anomaly. No explicit locks required from the application.
The anomaly REPEATABLE READ misses
Consider a business rule: “At least one on-call doctor must be available at all times.” The doctors table has two rows:
SELECT name, on_call FROM doctors;
-- Alice | true
-- Bob | true
Two administrators each decide to take a doctor off call. They run these transactions concurrently:
-- Transaction A -- Transaction B
BEGIN ISOLATION LEVEL REPEATABLE READ; BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT count(*) FROM doctors SELECT count(*) FROM doctors
WHERE on_call = true; WHERE on_call = true;
-- returns 2 -- returns 2
UPDATE doctors SET on_call = false UPDATE doctors SET on_call = false
WHERE name = 'Alice'; WHERE name = 'Bob';
COMMIT; COMMIT;
Both transactions see two doctors on call. Both pass the constraint check. Both commit. The result: zero doctors on call. The business rule is violated.
This is write skew. Each transaction is individually correct when run in isolation. The problem only appears when they run concurrently. The SQL standard defines this as a “serialization anomaly,” and REPEATABLE READ permits it in PostgreSQL.
The four ANSI SQL isolation phenomena form a hierarchy:
| Anomaly | READ COMMITTED | REPEATABLE READ | SERIALIZABLE |
|---|---|---|---|
| Dirty read | Possible | Not possible | Not possible |
| Nonrepeatable read | Possible | Not possible | Not possible |
| Phantom read | Possible | Not possible in PG | Not possible |
| Serialization anomaly | Possible | Possible | Not possible |
PostgreSQL’s REPEATABLE READ is stronger than the SQL standard requires. It prevents phantom reads by using the snapshot taken at transaction start. But serialization anomalies still get through.
Why traditional locking solves it (but costs)
The traditional approach to serializable isolation is Strict Two-Phase Locking (S2PL). Under S2PL, a transaction acquires a read lock on every row it reads and holds those locks until commit. Any concurrent write to a locked row blocks.
In the doctors example, both transactions would lock the rows they read during the SELECT count(*) query. Transaction A would lock both doctor rows. Transaction B would block on the lock for Alice’s row. Transaction A commits. Transaction B re-evaluates its read or fails with a serialization error. Either way, the anomaly is prevented.
The cost: every read blocks every concurrent write. On a table with heavy read-write overlap, S2PL turns into a lock convoy. Throughput collapses under contention because readers and writers spend most of their time waiting.
SSI: detect instead of block
SSI runs snapshot isolation as normal. Reads do not block writes, and writes do not block reads. Instead, PostgreSQL tracks the read-write dependencies between concurrent transactions and checks whether those dependencies form a dangerous pattern at commit time.
The core insight comes from Cahill, Röhm, and Fekete’s 2008 paper: every snapshot isolation anomaly contains a “dangerous structure” of two adjacent rw-conflict edges.
An rw-conflict occurs when one transaction reads data that a concurrent transaction later writes. Transaction T1 reads row R. While T1 is still running, T2 writes to row R and commits. T1 now has an rw-conflict with T2. T1 appears to have executed before T2 (because it did not see T2’s write), but T2 committed before T1 finished.
SSI tracks these conflicts and looks for this pattern:
Tin --rw--> Tpivot --rw--> Tout
Three transactions in a chain. Tin reads something. Tpivot reads the same thing and also writes something else. Tout reads what Tpivot wrote. If Tin, Tpivot, and Tout form a cycle in the dependency graph, no serial ordering of these transactions can explain the results. An anomaly will occur if all three commit.
When PostgreSQL detects this structure at commit time, it rolls back one of the transactions. The application gets SQLSTATE 40001 (serialization failure) and retries.
Predicate locks: tracking what you read
SSI needs to know what each transaction read so it can detect rw-conflicts. PostgreSQL uses predicate locks for this. A predicate lock records the data that a serializable transaction accessed during a SELECT.
Predicate locks appear in pg_locks with mode SIReadLock:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM doctors WHERE on_call = true;
SELECT locktype, relation::regclass, page, tuple, mode
FROM pg_locks
WHERE mode = 'SIReadLock';
-- locktype | relation | page | tuple | mode
-- relation | doctors | | | SIReadLock
The lock says: “this transaction read from this relation.” If another concurrent transaction writes to doctors, PostgreSQL checks for SIReadLock holders and records the rw-conflict.
These are not traditional locks. They do not block anything. They exist solely so PostgreSQL can detect when a write would have affected a prior read.
Lock granularity and promotion
Predicate locks operate at three levels, starting with the finest:
- Tuple-level: A lock on a specific row, placed when the transaction reads an individual tuple.
- Page-level: A lock on an 8KB heap page, placed when too many tuples on a single page have individual locks.
- Relation-level: A lock on the entire table, placed when too many pages or tuples have accumulated locks.
Promotion is controlled by two GUCs:
max_pred_locks_per_page(default 2). When a transaction holds predicate locks on more than 2 tuples on a single page, the tuple locks are promoted to a single page-level lock.max_pred_locks_per_relation(default -2, meaningmax_pred_locks_per_transaction / 2). When a transaction holds predicate locks on more than this many pages or tuples in a relation, the locks are promoted to a single relation-level lock.
Promotion exists to cap memory usage. Without it, a query that touches a million rows would accumulate a million predicate locks per transaction. The tradeoff is precision: a relation-level lock conflicts with every write to the table, even writes to rows the transaction never read. This creates false positives.
max_pred_locks_per_transaction (default 64) controls the shared predicate lock table size. Increase it if you have serializable transactions that touch many tables.
Sequential scans and relation-level promotion
This is the most common source of false positives in SSI. A sequential scan reads every page in the table. Each page that contains matching rows gets a predicate lock. When the lock count hits the promotion threshold, the lock escalates to relation-level.
A relation-level SIReadLock conflicts with every subsequent write to that table, regardless of which rows the transaction actually read and which rows the writer is modifying. A transaction that scanned for on_call = true (and only cared about two rows) can conflict with a writer updating a completely unrelated column on a different row.
This is why SSI performance degrades under high concurrency on the same table. The more concurrent serializable transactions you have, the more relation-level predicate locks accumulate, and the more rw-conflicts get recorded. Most of those conflicts are false positives.
Index predicate locks
B-tree indexes have their own predicate lock handling. When a serializable transaction scans an index, PostgreSQL places predicate locks on index pages rather than heap pages. This captures the “gap” between index entries: a lock on an index page covers the range of key values on that page, not just the specific entries.
Inserting a new row into a B-tree modifies an index page. If a concurrent serializable transaction holds an SIReadLock on that index page, PostgreSQL records the rw-conflict. This is how SSI detects phantoms: a concurrent INSERT that would have been visible to the reading transaction’s query condition.
Not all index types support fine-grained predicate locking. B-tree has full support. GiST and GIN have partial support (relation-level locks only in some cases). Other index types fall back to relation-level predicate locks on the heap.
The rw-conflict graph
PostgreSQL maintains an in-memory directed graph of rw-conflicts between active serializable transactions. Each node is a transaction. Each edge is an rw-conflict: “transaction T1 read something that concurrent transaction T2 wrote.”
At commit time, PostgreSQL checks whether the committing transaction is part of a dangerous structure in this graph. It looks for two rw-conflict edges that share a common pivot transaction, where the “out” transaction has already committed.
Two optimizations reduce false rollbacks.
Tout must commit first. The Cahill et al. proof shows that an anomaly can only occur if Tout (the last transaction in the dangerous structure) commits before the others. PostgreSQL waits to see which transactions commit before declaring a failure. Many dangerous structures resolve themselves: if Tout aborts, the structure is harmless.
A read-only Tin optimization helps too. If Tin (the first transaction in the chain) is read-only, it cannot be part of a cycle unless Tout committed before Tin took its snapshot. This is a PostgreSQL-specific optimization from the Ports and Grittner VLDB 2012 paper. It lets read-only transactions release their predicate locks early, reducing the graph size.
SERIALIZABLE READ ONLY DEFERRABLE
PostgreSQL offers a specific optimization for read-only serializable transactions:
BEGIN ISOLATION LEVEL SERIALIZABLE READ ONLY DEFERRABLE;
A deferrable read-only transaction does not acquire any predicate locks. It runs against a snapshot that is guaranteed to represent a serialized state (it can see the results of all committed transactions as of some single point in time). Because it holds no predicate locks, it does not conflict with any concurrent writer.
The tradeoff: the transaction may block at BEGIN until a safe snapshot is available. If concurrent serializable transactions are still running, PostgreSQL waits for them to commit or roll back before starting the deferrable transaction. For workloads with long-running serializable transactions, this wait can be significant.
Use this for reports, cached lookups, and any read-only logic that must see a consistent snapshot but should never cause a serialization failure in other transactions.
The retry pattern
When PostgreSQL detects a dangerous structure, it rolls back one transaction with SQLSTATE 40001:
ERROR: could not serialize access due to concurrent update
The standard application response is to retry the entire transaction:
from psycopg import sql
from psycopg.errors import SerializationFailure
max_retries = 5
for attempt in range(max_retries):
try:
with conn.transaction():
# application logic here
break # success
except SerializationFailure:
if attempt == max_retries - 1:
raise
continue # retry
PostgreSQL makes an effort to choose which transaction to roll back so that an immediate retry will not hit the same conflict again. The cancelled transaction is always one whose conflicting partner has already committed. The retry should see the other transaction’s writes and proceed without conflict.
Under normal contention levels, retries are rare. Under high contention on the same tables with many concurrent serializable transactions, retry rates can climb. The retry loop itself adds latency and load. This is the practical performance cost of SSI: not the conflict tracking, but the rollbacks and retries when false positives accumulate.
When SSI is not the answer
SSI has specific limitations.
Temporary tables and sequences are not covered. SSI tracks predicate locks on permanent heap tables and their indexes, but operations on temporary tables, sequences, and system catalogs use the same behavior as REPEATABLE READ. If your business rule involves sequences or temporary tables, SSI alone will not enforce it.
Mixed isolation levels break the guarantee. If some transactions run as READ COMMITTED or REPEATABLE READ, they do not participate in SSI conflict tracking. A READ COMMITTED transaction can introduce anomalies that SSI cannot detect because it does not hold predicate locks. For SSI to guarantee serializability, all transactions touching the relevant data must run at the SERIALIZABLE level.
High contention degrades SSI. Hundreds of concurrent transactions reading and writing the same tables will cause the predicate lock table to grow, relation-level promotion to become frequent, and false positive rates to spike. In extreme cases, you spend more time rolling back and retrying than you would have spent waiting under S2PL.
Long transactions accumulate risk. A serializable transaction that runs for 30 seconds accumulates predicate locks and rw-conflicts with every writer that commits during those 30 seconds. The longer the transaction, the more likely it participates in a dangerous structure. Keep serializable transactions short.
DDL operations conflict with SSI predicate locks. Schema changes (CREATE INDEX, ALTER TABLE) take heavyweight locks that are incompatible with SSI. Mixing DDL with serializable DML can cause unexpected serialization failures.
Diagnostics
When you get a serialization failure, pg_locks is the starting point:
SELECT pid, locktype, relation::regclass, page, tuple, mode,
granted, virtualxid, transactionid
FROM pg_locks
WHERE mode = 'SIReadLock'
ORDER BY relation::regclass, page, tuple;
This shows every predicate lock held by active serializable transactions. If you see relation-level locks on tables where transactions only read a few rows, promotion has occurred and you are getting false positives.
Watch for SSI-related wait events:
SELECT pid, wait_event_type, wait_event
FROM pg_stat_activity
WHERE backend_type = 'client backend'
AND wait_event_type = 'LWLock';
Serializable and SIREAD prefixed wait events indicate contention on the predicate lock infrastructure.
SSI vs two-phase locking: the practical tradeoff
| Aspect | S2PL | SSI |
|---|---|---|
| Reads block writes | Yes | No |
| Writes block reads | Yes | No |
| Deadlocks | Possible (blocking cycles) | No (detection-based) |
| False positives | None | Yes (promoted predicate locks) |
| Performance under contention | Degrades (blocking) | Degrades (retries) |
| Application complexity | Must manage lock timeouts | Must handle SQLSTATE 40001 |
The choice depends on your contention pattern. S2PL blocks early and fails clearly. SSI lets everything run and cleans up anomalies at commit time. For most PostgreSQL workloads where true serializability is needed, SSI is the better default. It avoids lock convoys, and under normal contention levels the retry rate is negligible.
Summary
Serializable Snapshot Isolation, introduced in PostgreSQL 9.1, prevents write skew and other serialization anomalies without the blocking overhead of traditional two-phase locking. It works by tracking predicate locks (SIReadLock) on the data each serializable transaction reads, recording rw-conflicts between concurrent transactions, and rolling back transactions that form dangerous structures in the conflict graph at commit time.
Predicate lock promotion from tuple to page to relation level is the main source of false positives. Sequential scans promote to relation-level locks quickly, which conflicts with every write to the table. The SERIALIZABLE READ ONLY DEFERRABLE option lets read-only transactions skip predicate locks entirely. SQLSTATE 40001 is the standard retry signal. The guarantee only holds if all transactions touching the data run at the SERIALIZABLE level.
Previous in the series: Incremental sort — how PostgreSQL 13 made multi-column ORDER BY cheaper