PostgreSQL Deep Dive: B-Tree Internals — Page Splits, LP_DEAD, and Why Your Index is Bloated
You vacuum regularly. Autovacuum is tuned. And yet — that index on users.email is three times bigger than it should be. What is eating all that space?
The answer lives inside the B-tree index itself. Page splits, dead index tuples, LP_DEAD bits, and version churn from MVCC all conspire to inflate your indexes. Today we are going deep into how PostgreSQL B-tree indexes actually work on disk: how pages split, how dead tuples accumulate, how the index cleans itself up (or doesn’t), and what deduplication does to help.
This is not an “indexes make queries fast” post. You already know that. This is about what happens after you create the index.
How a B-Tree Index is Laid Out on Disk
A PostgreSQL B-tree index is stored as a series of 8KB pages (blocks), organized into levels:
[Meta Page] (block 0)
|
[Root Page]
/ \
[Internal] [Internal]
/ \ / \
[Leaf] [Leaf] [Leaf] [Leaf]
- Meta page — always block 0. Contains a pointer to the root page and other bookkeeping.
- Internal pages — contain tuples that point down to the next level. Each tuple has a key value and a child page pointer. The “high key” of an internal page defines the upper boundary of keys that can appear in its subtree.
- Leaf pages — the bottom level. Each leaf tuple contains the indexed column value(s) and a TID (table tuple identifier: block number + offset) pointing to a row in the heap (the actual table).
- Leaf pages are ~99% of all pages in a typical index. Internal pages are a tiny overhead.
Each page uses the standard PostgreSQL page layout: a PageHeaderData, an array of ItemId (line pointers), and free space. Tuples are added from the end of the page working backwards; line pointers grow from the front.
Pages at the same level are linked into a doubly-linked list. A forward index scan follows right-pointers; a backward scan follows left-pointers. This is why B-tree indexes support both ASC and DESC order efficiently — it’s just traversing the linked list in different directions.
Page Splits: The Real Cost of Random Inserts
When a new index tuple does not fit on its target leaf page, PostgreSQL performs a page split:
- A new leaf page is allocated.
- Roughly half the items from the overflowing page are moved to the new page.
- A downlink (a new entry in the parent page pointing to the new child) is inserted into the parent.
- If the parent also overflows, it splits too — splits cascade upwards recursively.
- If the root page splits, a new root is created above the old one, adding a new level to the tree.
The 50/50 Split Problem
The default split strategy moves approximately half the items to the new page. This is safe for random inserts — it leaves both pages roughly half full, leaving room for future inserts on either side.
But for monotonically increasing keys (timestamps, sequences, serial IDs), the rightmost page fills up, splits, and the left half stays full while the right half is nearly empty. Then the right half fills up and splits again. You end up with pages that are ~67% full on average instead of the theoretical optimal.
PostgreSQL uses a sequential split optimization: when the rightmost leaf page splits during a right-boundary insert, it performs a 90/10 split instead — keeping 90% on the old page and moving only 10% to the new one. This is more space-efficient for append-only workloads. But it only triggers when inserting at the right edge.
Split Cascading and WAL Amplification
A single leaf page split may cascade all the way up to the root. Each level’s split generates its own WAL record. So one INSERT can produce multiple WAL records:
INSERT into table → 1 heap WAL record + 1 index WAL record (no split)
OR
→ 1 heap WAL record + N index WAL records (cascading splits)
In extreme cases, a heavily split index generates significantly more WAL than the table itself. If you see WAL volume spikes correlated with heavy INSERT loads, page splits are a prime suspect.
LP_DEAD: How Index Tuples Get Marked for Death
Under MVCC, when you UPDATE a row, PostgreSQL creates a new version of the row in the heap. And — critically — it creates new index entries in every index on the table, even indexes whose columns were not changed. (The Heap-Only Tuples / HOT optimization only applies when no indexed column changed.)
The old index tuples become dead — they point to a heap tuple that is no longer visible to any snapshot. But PostgreSQL does not remove them immediately.
Here is the lifecycle:
1. UPDATE sets a new heap tuple, inserts new index tuples
2. Old heap tuple's xmax is set (it's deleted from MVCC perspective)
3. Old index tuples still physically exist on their leaf pages
4. An index scan encounters the old index tuple, follows it to the heap,
discovers the heap tuple is dead (via MVCC visibility checks)
5. The scan sets the LP_DEAD bit on that index tuple's line pointer
6. Later, when the page needs space, or during VACUUM, dead tuples are reclaimed
The LP_DEAD bit is set on the ItemId (line pointer) in the index page. It’s a hint: “this index entry points to a dead heap tuple, it’s safe to remove.”
Who Sets LP_DEAD?
Index scans set LP_DEAD opportunistically. When a scan follows an index tuple to the heap and discovers the heap tuple is dead (not visible to the snapshot), it sets LP_DEAD on the way back. This is called simple deletion — it’s a free optimization during normal query execution.
Not all dead index tuples get LP_DEAD set. It depends on whether scans actually visit them. If no query selects for the old value, the LP_DEAD bit may never get set through scans alone.
Bottom-Up Index Deletion (PostgreSQL 14+)
This is where it gets interesting. PostgreSQL 14 introduced bottom-up index deletion, a completely different strategy for cleaning up index bloat.
Consider this workload:
-- Table with multiple indexes
CREATE TABLE events (
id BIGSERIAL PRIMARY KEY,
event_type TEXT NOT NULL,
payload JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_events_type ON events (event_type);
CREATE INDEX idx_events_created ON events (created_at);
-- Heavy UPDATE workload that changes payload but NOT event_type or created_at
UPDATE events SET payload = '{"status": "processed"}' WHERE id < 10000;
Neither event_type nor created_at changed, so the HOT optimization does not apply (because at least one indexed column is involved — the PK). New index entries are created in all three indexes. The index on event_type and the index on created_at now accumulate dead tuples pointing to old row versions.
Bottom-up deletion works like this:
- When a new index tuple needs to go on a leaf page that’s too full, PostgreSQL first tries a bottom-up deletion pass.
- It looks at the page and identifies tuples that are likely garbage — specifically, tuples that look like old versions of the same logical row (same key value, different TIDs pointing to different heap versions).
- It deletes these old versions directly, without requiring
LP_DEADto be set first. - If enough space is freed, the page split is avoided entirely.
This is why it’s called “bottom-up” — it’s triggered by the B-tree code itself, from the bottom of the tree, in reaction to space pressure. It contrasts with “top-down” cleanup by VACUUM, which starts from the table level and works through indexes.
Bottom-up deletion is the primary cleanup mechanism for indexes that experience version churn from UPDATEs that don’t modify the indexed column. In many cases, these indexes never grow by even a single page despite constant churn.
When Bottom-Up Fails
Bottom-up deletion has limitations:
- Long-running transactions: If a snapshot could still see the old heap version, the index tuple can’t be deleted. Bottom-up deletion respects MVCC visibility.
- Insufficient dead tuples on the page: If the page doesn’t have enough garbage, the deletion pass can’t free enough space and falls through to a page split.
- Non-duplicate values: Bottom-up deletion specifically targets version churn — multiple index entries with the same key but different TIDs. If keys are unique, there’s nothing to group.
When bottom-up deletion fails, the page splits, and the index grows.
Deduplication (PostgreSQL 13+)
Another weapon against index bloat: deduplication. When a leaf page is full and deletion couldn’t free enough space, PostgreSQL tries deduplication before splitting.
Normally, each index tuple stores its full key value plus one TID:
Tuple 1: key="active" → TID (0,1)
Tuple 2: key="active" → TID (0,5)
Tuple 3: key="active" → TID (0,12)
Tuple 4: key="active" → TID (0,18)
After deduplication, these become a single posting list tuple:
Posting tuple: key="active" → [TID (0,1), TID (0,5), TID (0,12), TID (0,18)]
The key is stored once, followed by a sorted array of TIDs. This can dramatically reduce the size of indexes with low cardinality (many rows share the same key value).
When Deduplication Applies
- Enabled by default since PostgreSQL 13 (controlled by
deduplicate_itemsstorage parameter) - Works for all standard data types with deterministic collations
- Works in unique indexes too — useful for absorbing temporary version churn duplicates
- Does NOT work for:
numeric,jsonb,float4/float8, non-deterministic collations,INCLUDEindexes, container types (arrays, composites, ranges)
-- Check if deduplication is enabled for an index
SELECT indexrelid::regclass, reloptions
FROM pg_class
WHERE relkind = 'i' AND relname = 'idx_events_type';
-- Disable deduplication for a specific index (if needed)
ALTER INDEX idx_events_type SET (deduplicate_items = off);
Deduplication During CREATE INDEX / REINDEX
When building an index from scratch, PostgreSQL packs posting list tuples as tightly as possible — each group of duplicates from the sorted input is merged into a posting list tuple before being written to the page. This is why a REINDEX often produces a significantly smaller index than what grew organically.
Detecting and Measuring Index Bloat
Query: Index Size and Dead Tuple Ratio
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
idx_tup_read,
idx_tup_fetch,
idx_tup_del AS dead_index_tuples_removed,
COALESCE(pg_stat_get_tuples_returned(indexrelid), 0) AS idx_scan_tuples
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
Query: Estimate Index Bloat
-- Requires the pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT
indexrelid::regclass AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS total_size,
pgstattuple_approx(indexrelid).avg_leaf_density AS leaf_density_pct,
pgstattuple_approx(indexrelid).dead_tuple_percent AS dead_tuple_pct
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 100 * 1024 * 1024 -- only indexes > 100MB
ORDER BY pg_relation_size(indexrelid) DESC;
A healthy leaf density is typically 70-90%. Below 50% suggests significant bloat.
Query: Page Split Activity (PostgreSQL 13+)
-- From pg_stat_user_tables, not directly about indexes,
-- but heavy UPDATE counts correlate with index version churn
SELECT
relname AS table_name,
n_tup_upd,
n_tup_hot_upd,
ROUND(n_tup_hot_upd::numeric / NULLIF(n_tup_upd, 0) * 100, 1) AS hot_update_pct,
n_tup_del
FROM pg_stat_user_tables
WHERE n_tup_upd > 0
ORDER BY n_tup_upd DESC;
A low HOT update percentage means most UPDATEs are generating new index entries in every index.
Practical: Reducing Index Bloat
1. Increase HOT Updates
HOT updates avoid creating new index entries entirely. They work when:
- No indexed column is modified by the UPDATE
- The new heap tuple fits on the same page as the old one
You can increase the chance of same-page updates by reducing fillfactor:
-- Leave more free space on each page for HOT updates
ALTER TABLE events SET (fillfactor = 85);
-- Rebuild the table to apply immediately
VACUUM FULL events;
-- Or use pg_repack for zero-downtime rebuild
Default fillfactor is 100. Setting it to 85 leaves 15% free space on each page for in-place updates. This trades initial table size for reduced index bloat.
2. Rebuild Bloated Indexes
-- REINDEX rebuilds the index from scratch — optimized, deduplicated
REINDEX INDEX CONCURRENTLY idx_events_type;
-- REINDEX TABLE rebuilds all indexes
REINDEX TABLE CONCURRENTLY events;
CONCURRENTLY does not block reads or writes but takes longer and requires more disk space during the rebuild.
3. Consider Your Index Design
Every extra index on a table amplifies the write cost of UPDATEs:
-- Before: 3 indexes, every UPDATE touches all 3
CREATE INDEX idx_events_type ON events (event_type);
CREATE INDEX idx_events_created ON events (created_at);
CREATE INDEX idx_events_payload ON events (payload);
-- After: combined index reduces the number of indexes to maintain
-- (only if queries can use it effectively)
CREATE INDEX idx_events_type_created ON events (event_type, created_at);
DROP INDEX idx_events_type;
Fewer indexes = fewer page splits = less bloat = less vacuum work.
4. Monitor and Tune for Your Workload
-- Per-table autovacuum tuning for UPDATE-heavy tables
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.05, -- vacuum at 5% changes (default 20%)
autovacuum_analyze_scale_factor = 0.02 -- analyze at 2% changes
);
More frequent vacuuming means dead index tuples are cleaned up sooner, reducing the window for bloat accumulation.
The Full Picture: Who Cleans What?
| Mechanism | Trigger | What It Cleans | PG Version |
|---|---|---|---|
| Simple deletion (LP_DEAD) | Index scans | Dead tuples with LP_DEAD set | All |
| Bottom-up deletion | Page space pressure | Version churn tuples (same key, old TIDs) | 14+ |
| Deduplication | Page space pressure | Duplicate key tuples → posting lists | 13+ |
| VACUUM / autovacuum | Table-level thresholds | All dead tuples, sets VM bits | All |
All four mechanisms cooperate. Bottom-up deletion handles the steady-state churn from UPDATEs. Simple deletion catches what scans discover. Deduplication compresses duplicates. VACUUM does the comprehensive sweep.
Key Takeaways
- Page splits are the primary cause of index growth — each split allocates a new 8KB page, and cascading splits multiply the effect.
- UPDATEs insert into every index, not just the ones whose columns changed. This is the hidden cost of indexes on UPDATE-heavy tables.
- LP_DEAD bits are set opportunistically by index scans; they’re a hint, not a guarantee.
- Bottom-up deletion (PG 14+) is the most important cleanup mechanism for version-churn indexes — it proactively targets old versions without waiting for vacuum.
- Deduplication (PG 13+) compresses duplicate keys into posting lists, saving space and potentially avoiding page splits.
- REINDEX CONCURRENTLY is your reset button for bloated indexes — it rebuilds with full deduplication and optimal packing.
- Monitor HOT update percentage — if it’s low, most UPDATEs are creating new index entries, and you should consider
fillfactortuning or index consolidation.
What’s Next
Tomorrow we will look at a closely related topic: Index Bloat — Detection, Causes, and pg_repack. We will cover how to measure exactly how much space is wasted, when bloat becomes a real performance problem, and how to rebuild indexes and tables with zero downtime using pg_repack.