PostgreSQL Deep Dive: MVCC, xmin, xmax — How Postgres Really Decides What You Can See
PostgreSQL doesn’t use locks to decide who sees what. It uses something far more elegant — and far more likely to bite you if you don’t understand it. Let’s crack open MVCC and look at how xmin, xmax, and snapshot visibility actually work.
The Core Idea: Every Version Lives On Disk
Unlike databases that overwrite rows in place (MySQL’s InnoDB with its undo log, for instance), Postgres takes a different approach: it never overwrites a row. When you UPDATE a row, Postgres:
- Inserts a completely new version of the row (a new “tuple”)
- Marks the old version as expired
- Both versions coexist on disk until VACUUM cleans up
This is Multi-Version Concurrency Control (MVCC). The payoff is enormous — readers never block writers, and writers never block readers. But the cost is that your table can grow physically larger than the data it logically contains.
Let’s look at the machinery that makes this work.
xmin and xmax: The Version Bookends
Every tuple (row version) in Postgres has two hidden system columns:
xmin— the transaction ID (XID) of the transaction that created this versionxmax— the transaction ID of the transaction that deleted or replaced this version (0 if the version is still the current one)
You can actually query these:
SELECT xmin, xmax, id, name FROM users WHERE id = 1;
Here’s what happens in different scenarios:
INSERT
-- Transaction 100
INSERT INTO users (id, name) VALUES (1, 'Alice');
Result: one tuple with xmin = 100, xmax = 0.
UPDATE
-- Transaction 200
UPDATE users SET name = 'Bob' WHERE id = 1;
Result: two tuples on disk:
| xmin | xmax | name |
|---|---|---|
| 100 | 200 | Alice |
| 200 | 0 | Bob |
The old tuple got xmax = 200 (expired by transaction 200). A new tuple appeared with xmin = 200.
DELETE
-- Transaction 300
DELETE FROM users WHERE id = 1;
Result: the current tuple gets xmax = 300. It’s still on disk — just marked as dead.
Snapshot Visibility: The Rules Engine
Having xmin and xmax is only half the story. Postgres needs rules to decide which versions a transaction can see. That’s where snapshots come in.
When a transaction starts (at READ COMMITTED isolation, a new snapshot is taken for each statement; at REPEATABLE READ or SERIALIZABLE, one snapshot is taken at transaction start and reused), Postgres records:
xmin— the lowest still-active transaction ID (everything below this is guaranteed finished)xmax— the next transaction ID to be assigned (everything >= this hasn’t started yet)xip— a list of transaction IDs that are currently in-progress
Then for each tuple, Postgres runs this visibility check:
- Is
xmincommitted? If the inserting transaction aborted, the tuple is invisible. Period. - Is
xmin < snapshot.xmin? If so, the inserting transaction finished before our snapshot began → the tuple is potentially visible. - Is
xmininxip(in-progress list)? If so, the inserting transaction was running when we took our snapshot → invisible to us (we can’t see its uncommitted work). - Is
xmax = 0? No one has deleted/replaced this tuple → it’s visible. - Is
xmaxcommitted? If the deleting transaction aborted, the delete didn’t happen → still visible. - Is
xmaxinxipor>= snapshot.xmax? The delete happened after our snapshot → the tuple is still visible to us.
This is simplified — the real implementation involves infomask hint bits for performance — but it captures the essential logic.
The Infomask: Fast-Path Hints
Checking transaction status for every tuple on every query would be insanely expensive (you’d need to look up pg_xact for every row). So Postgres caches the result in infomask bits stored directly in the tuple header:
HEAP_XMIN_COMMITTED— the inserting transaction is known to be committedHEAP_XMIN_INVALID— the inserting transaction is known to have abortedHEAP_XMAX_COMMITTED— the deleting transaction is known to be committed
These hint bits are set lazily. The first transaction that checks a tuple and confirms the status sets the hint bit for everyone else. This is why a SELECT can sometimes cause write I/O — it’s setting hint bits on the page (which requires marking the page dirty).
You can observe this with pg_stat_user_tables:
SELECT n_tup_ins, n_tup_upd, n_tup_del, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'users';
n_dead_tup tells you how many dead tuples are sitting on disk waiting for VACUUM.
Dead Tuples and Table Bloat
Here’s the gotcha that catches most people. Every UPDATE creates:
- A dead tuple (the old version)
- A new tuple (the new version)
If your workload is update-heavy and autovacuum isn’t aggressive enough, dead tuples accumulate. But it gets worse — even after VACUUM marks the space as reusable, the table doesn’t shrink. The dead tuple space goes into the Free Space Map (FSM) for reuse by future INSERTs and UPDATEs, but the file size stays the same.
This is table bloat. You can check for it:
-- Requires the pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
schemaname,
relname,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size,
round(100.0 * dead_tuple_percent, 2) AS dead_tuple_pct,
round(100.0 * free_percent, 2) AS free_space_pct
FROM pgstattuple('users');
If dead_tuple_percent or free_space_pct is high, your table is bloated. The fix options are:
- Tune autovacuum to run more frequently on this table
VACUUM FULL— rewrites the entire table and indexes (exclusive lock!)pg_repack— does the same thing without holding an exclusive lock
The Long-Running Transaction Trap
Here’s the nastiest MVCC gotcha: a single long-running transaction prevents VACUUM from cleaning up dead tuples that were alive when the transaction started.
-- Session 1: start a long-running transaction
BEGIN;
SELECT * FROM users LIMIT 1;
-- go to lunch, leave this open
-- Session 2: heavy update workload
UPDATE users SET last_login = now(); -- creates dead tuples
-- VACUUM runs but CANNOT clean up the dead tuples
-- because Session 1's snapshot might still need to see the old versions
This is why idle in transaction connections are so dangerous. You can find them:
SELECT pid, now() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active')
AND xact_start < now() - interval '5 minutes'
ORDER BY duration DESC;
Set idle_in_transaction_session_timeout to prevent this:
ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';
Practical Takeaways
- Check
xminandxmaxwhen debugging visibility issues — they tell you exactly why a row is or isn’t visible - Monitor
n_dead_tupinpg_stat_user_tables— high dead tuple counts mean vacuum isn’t keeping up - Watch for long-running transactions — they’re the #1 cause of table bloat
- Understand that UPDATE = INSERT + DELETE in Postgres — this changes how you think about update-heavy workloads
- SELECT can cause writes — hint bit setting means even reads can dirty pages
What’s Next
Tomorrow we’ll dig into transaction ID wraparound — the existential threat to your database that most people learn about only after it’s too late. It’s directly connected to what we covered today: if VACUUM can’t keep up with freezing old XIDs, your database will shut itself down to prevent data corruption.
This post is part of the PostgreSQL Deep Dive series, published weekdays at 9am AEST. Subscribe to the full series or browse by PostgreSQL tag.