PostgreSQL Deep Dive: MVCC, xmin, xmax — How Postgres Really Decides What You Can See

PostgreSQL doesn’t use locks to decide who sees what. It uses something far more elegant — and far more likely to bite you if you don’t understand it. Let’s crack open MVCC and look at how xmin, xmax, and snapshot visibility actually work.

The Core Idea: Every Version Lives On Disk

Unlike databases that overwrite rows in place (MySQL’s InnoDB with its undo log, for instance), Postgres takes a different approach: it never overwrites a row. When you UPDATE a row, Postgres:

Inserts a completely new version of the row (a new “tuple”)
Marks the old version as expired
Both versions coexist on disk until VACUUM cleans up

This is Multi-Version Concurrency Control (MVCC). The payoff is enormous — readers never block writers, and writers never block readers. But the cost is that your table can grow physically larger than the data it logically contains.

Let’s look at the machinery that makes this work.

xmin and xmax: The Version Bookends

Every tuple (row version) in Postgres has two hidden system columns:

xmin — the transaction ID (XID) of the transaction that created this version
xmax — the transaction ID of the transaction that deleted or replaced this version (0 if the version is still the current one)

You can actually query these:

SELECT xmin, xmax, id, name FROM users WHERE id = 1;

Here’s what happens in different scenarios:

INSERT

-- Transaction 100
INSERT INTO users (id, name) VALUES (1, 'Alice');

Result: one tuple with xmin = 100, xmax = 0.

UPDATE

-- Transaction 200
UPDATE users SET name = 'Bob' WHERE id = 1;

Result: two tuples on disk:

xmin	xmax	name
100	200	Alice
200	0	Bob

The old tuple got xmax = 200 (expired by transaction 200). A new tuple appeared with xmin = 200.

DELETE

-- Transaction 300
DELETE FROM users WHERE id = 1;

Result: the current tuple gets xmax = 300. It’s still on disk — just marked as dead.

Snapshot Visibility: The Rules Engine

Having xmin and xmax is only half the story. Postgres needs rules to decide which versions a transaction can see. That’s where snapshots come in.

When a transaction starts (at READ COMMITTED isolation, a new snapshot is taken for each statement; at REPEATABLE READ or SERIALIZABLE, one snapshot is taken at transaction start and reused), Postgres records:

xmin — the lowest still-active transaction ID (everything below this is guaranteed finished)
xmax — the next transaction ID to be assigned (everything >= this hasn’t started yet)
xip — a list of transaction IDs that are currently in-progress

Then for each tuple, Postgres runs this visibility check:

Is xmin committed? If the inserting transaction aborted, the tuple is invisible. Period.
Is xmin < snapshot.xmin? If so, the inserting transaction finished before our snapshot began → the tuple is potentially visible.
Is xmin in xip (in-progress list)? If so, the inserting transaction was running when we took our snapshot → invisible to us (we can’t see its uncommitted work).
Is xmax = 0? No one has deleted/replaced this tuple → it’s visible.
Is xmax committed? If the deleting transaction aborted, the delete didn’t happen → still visible.
Is xmax in xip or >= snapshot.xmax? The delete happened after our snapshot → the tuple is still visible to us.

This is simplified — the real implementation involves infomask hint bits for performance — but it captures the essential logic.

The Infomask: Fast-Path Hints

Checking transaction status for every tuple on every query would be insanely expensive (you’d need to look up pg_xact for every row). So Postgres caches the result in infomask bits stored directly in the tuple header:

HEAP_XMIN_COMMITTED — the inserting transaction is known to be committed
HEAP_XMIN_INVALID — the inserting transaction is known to have aborted
HEAP_XMAX_COMMITTED — the deleting transaction is known to be committed

These hint bits are set lazily. The first transaction that checks a tuple and confirms the status sets the hint bit for everyone else. This is why a SELECT can sometimes cause write I/O — it’s setting hint bits on the page (which requires marking the page dirty).

You can observe this with pg_stat_user_tables:

SELECT n_tup_ins, n_tup_upd, n_tup_del, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'users';

n_dead_tup tells you how many dead tuples are sitting on disk waiting for VACUUM.

Dead Tuples and Table Bloat

Here’s the gotcha that catches most people. Every UPDATE creates:

A dead tuple (the old version)
A new tuple (the new version)

If your workload is update-heavy and autovacuum isn’t aggressive enough, dead tuples accumulate. But it gets worse — even after VACUUM marks the space as reusable, the table doesn’t shrink. The dead tuple space goes into the Free Space Map (FSM) for reuse by future INSERTs and UPDATEs, but the file size stays the same.

This is table bloat. You can check for it:

-- Requires the pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT
  schemaname,
  relname,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS total_size,
  round(100.0 * dead_tuple_percent, 2) AS dead_tuple_pct,
  round(100.0 * free_percent, 2) AS free_space_pct
FROM pgstattuple('users');

If dead_tuple_percent or free_space_pct is high, your table is bloated. The fix options are:

Tune autovacuum to run more frequently on this table
VACUUM FULL — rewrites the entire table and indexes (exclusive lock!)
pg_repack — does the same thing without holding an exclusive lock

The Long-Running Transaction Trap

Here’s the nastiest MVCC gotcha: a single long-running transaction prevents VACUUM from cleaning up dead tuples that were alive when the transaction started.

-- Session 1: start a long-running transaction
BEGIN;
SELECT * FROM users LIMIT 1;
-- go to lunch, leave this open

-- Session 2: heavy update workload
UPDATE users SET last_login = now();  -- creates dead tuples
-- VACUUM runs but CANNOT clean up the dead tuples
-- because Session 1's snapshot might still need to see the old versions

This is why idle in transaction connections are so dangerous. You can find them:

SELECT pid, now() - xact_start AS duration, query, state
FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active')
  AND xact_start < now() - interval '5 minutes'
ORDER BY duration DESC;

Set idle_in_transaction_session_timeout to prevent this:

ALTER SYSTEM SET idle_in_transaction_session_timeout = '5min';

Practical Takeaways

Check xmin and xmax when debugging visibility issues — they tell you exactly why a row is or isn’t visible
Monitor n_dead_tup in pg_stat_user_tables — high dead tuple counts mean vacuum isn’t keeping up
Watch for long-running transactions — they’re the #1 cause of table bloat
Understand that UPDATE = INSERT + DELETE in Postgres — this changes how you think about update-heavy workloads
SELECT can cause writes — hint bit setting means even reads can dirty pages

What’s Next

Tomorrow we’ll dig into transaction ID wraparound — the existential threat to your database that most people learn about only after it’s too late. It’s directly connected to what we covered today: if VACUUM can’t keep up with freezing old XIDs, your database will shut itself down to prevent data corruption.

This post is part of the PostgreSQL Deep Dive series, published weekdays at 9am AEST. Subscribe to the full series or browse by PostgreSQL tag.