PostgreSQL Deep dive: Index bloat — why your index keeps growing even when your table doesn't
Your table has 1 million rows. Your index should be about 22MB. But pg_relation_size() says 89MB. Four times larger than it should be. The table hasn’t grown in months. The row count hasn’t changed. But the index keeps getting bigger.
This is index bloat. It creeps up gradually, invisible until queries slow down for no obvious reason, disk usage climbs, and backups take longer. By the time you notice, the index has been bloated for weeks.
What Index Bloat Is
In PostgreSQL, an UPDATE does not modify a row in place. It creates a new version of the row (a new tuple) and marks the old one as dead. This is MVCC. The dead tuple stays in the table until VACUUM comes along and marks the space as available for reuse.
The same thing happens in indexes. When a row is updated, the index entries pointing to the old tuple must be replaced with entries pointing to the new tuple. The old index entries become dead pointers. When VACUUM runs, it removes the dead index entries. But “removes” in B-tree terms does not mean “reclaims the space.” It means the entry is marked as dead (LP_DEAD) and the space is available for future index entries on the same page. The page itself is not shrunk. It is not returned to the operating system.
If dead index entries are not evenly distributed across pages, some pages end up mostly empty while others are still full. The empty space on sparse pages is wasted. It sits there, occupying disk, being read into cache during index scans, and pushing out useful data. That is bloat.
Bloat also comes from page splits. When a B-tree page fills up and a new entry must be inserted, PostgreSQL splits the page into two, each roughly half full. If subsequent inserts go into one of the new pages and not the other, the other page remains at 50% capacity. Over thousands of inserts and splits, this creates pages with large amounts of unused space.
The Bloat Lifecycle
Here is how a typical index accumulates bloat over time:
1. The table receives a burst of updates. A batch job updates 100,000 rows, or a UPDATE ... SET status = 'processed' WHERE ... touches a large fraction of the table.
2. Each update creates dead index entries. The old index entries (pointing to the old tuple versions) are now dead. They are still physically on the index pages.
3. VACUUM runs and marks them LP_DEAD. Since PostgreSQL 14, VACUUM sets the LP_DEAD hint bit on dead index entries. This is the “bottom-up deletion” mechanism. Entries marked LP_DEAD are logically deleted but still occupy space on the page.
4. Future inserts reuse the LP_DEAD space on the same page. If a new index entry belongs on a page that has LP_DEAD entries, one of those dead entries is overwritten. The space is reused.
5. But if inserts don’t land on the same pages as the dead entries, the space stays wasted. Index bloat happens when the pattern of updates (which creates dead entries) does not match the pattern of inserts (which would reuse them). Monotonically increasing keys are the classic example: updates create dead entries at scattered locations, but new inserts always go to the rightmost leaf page, which has no dead entries to reuse.
6. Page splits compound the problem. If inserts do land on pages with dead entries but those pages are still full, a split occurs. The split creates a new page at 50% capacity, leaving the original page also at roughly 50%. Over time, this fragmentation accumulates.
Why Bloat Matters
Index bloat is not just wasted disk space. It has direct performance consequences:
-
More I/O per index scan. A bloated index has more pages. Every index scan must read more pages from disk (or cache). A query that hits 500 pages instead of 200 pages is 2.5× slower on I/O-bound workloads.
-
Cache pressure. Bloated index pages compete with useful data for space in shared_buffers and the OS page cache. Every bloated page in cache is a useful page that was evicted.
-
Slower VACUUM. VACUUM must scan the entire index to find and clean dead entries. A larger index means slower VACUUM, which means dead entries accumulate longer, which means more bloat.
-
Bigger backups. pg_dump and WAL archiving include the bloated index pages. A 4× bloated index means 4× the backup size for that index.
-
Replication lag. On logical replicas, bloated indexes consume more disk and more I/O during initial sync.
Detecting Index Bloat
Quick Check: Index Size vs Table Size
-- Basic ratio check
SELECT schemaname || '.' || relname AS table_name,
indexrelname AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
pg_size_pretty(pg_relation_size(indrelid)) AS table_size,
round(100.0 * pg_relation_size(indexrelid) /
NULLIF(pg_relation_size(indrelid), 0), 1) AS size_ratio_pct,
idx_scan AS index_scans
FROM pg_stat_user_indexes
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
This is a rough heuristic. A B-tree index is typically 5-15% of the table size for a single column. A GIN index can be larger. The important thing is not the absolute ratio but changes over time.
Detailed Bloat Estimation
The most reliable bloat estimation uses the pgstattuple extension:
CREATE EXTENSION IF NOT EXISTS pgstattuple;
-- Bloat estimate for a specific index
SELECT indexrelname,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
pg_size_pretty(
pgstatindex(indexrelid::regclass)::json->>'leaf_pages'::int * 8
) AS actual_data_size,
round(100.0 *
(pg_relation_size(indexrelid) -
(pgstatindex(indexrelid::regclass)::json->>'leaf_pages'::int * 8.0)) /
NULLIF(pg_relation_size(indexrelid), 0), 1
) AS estimated_bloat_pct,
pgstatindex(indexrelid::regclass)::json->>'dead_tuple_percent' AS dead_pct,
pgstatindex(indexrelid::regclass)::json->>'avg_leaf_density' AS leaf_density_pct
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY estimated_bloat_pct DESC;
pgstattuple scans the actual index pages and reports the fraction of dead space. The avg_leaf_density shows how full leaf pages are (100% = perfect, 50% = half empty). The dead_tuple_percent shows what fraction of entries are LP_DEAD.
The Approximate Bloat Query
If you cannot install pgstattuple (shared hosting, restricted environments), this query estimates bloat from the index’s metadata and statistics:
SELECT schemaname, relname AS table_name, indexrelname AS index_name,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
round(100.0 * (
COALESCE(pg_relation_size(i.indexrelid) -
(pg_stat_user_tables.n_dead_tup::float8 *
pg_stat_user_tables.n_live_tup::float8 *
0.001), 0) /
NULLIF(pg_relation_size(i.indexrelid), 0)
), 1) AS rough_bloat_pct
FROM pg_stat_user_indexes i
JOIN pg_stat_user_tables t ON i.relid = t.relid
WHERE pg_relation_size(i.indexrelid) > 1024 * 1024 -- Only indexes > 1MB
ORDER BY rough_bloat_pct DESC;
This is a rough approximation. The pgstattuple approach is more accurate.
Tracking Bloat Over Time
Create a tracking table and poll it weekly:
CREATE TABLE index_bloat_history (
measured_at timestamptz default now(),
schemaname text,
table_name text,
index_name text,
index_size_bytes bigint,
estimated_bloat_pct numeric(5,1),
leaf_density_pct numeric(5,1),
dead_pct numeric(5,1)
);
-- Snapshot current state
INSERT INTO index_bloat_history (schemaname, table_name, index_name,
index_size_bytes, estimated_bloat_pct, leaf_density_pct, dead_pct)
SELECT schemaname, relname, indexrelname,
pg_relation_size(indexrelid),
round(100.0 * (pg_relation_size(indexrelid) -
(pgstatindex(indexrelid::regclass)::json->>'leaf_pages'::int * 8.0)) /
NULLIF(pg_relation_size(indexrelid), 0), 1),
(pgstatindex(indexrelid::regclass)::json->>'avg_leaf_density')::numeric,
(pgstatindex(indexrelid::regclass)::json->>'dead_tuple_percent')::numeric
FROM pg_stat_user_indexes
WHERE pg_relation_size(indexrelid) > 1024 * 1024;
-- Check trends
SELECT index_name,
index_size_bytes,
estimated_bloat_pct,
measured_at
FROM index_bloat_history
WHERE index_name = 'idx_your_index'
ORDER BY measured_at DESC
LIMIT 12;
Causes of Index Bloat
1. Monotonically Increasing Keys
The single biggest cause. If your primary key is a serial or bigserial column, new rows are always appended to the end of the table. New index entries always go to the rightmost leaf page. Dead entries from updates are scattered across earlier pages. The new inserts never reuse the dead space on those earlier pages.
This creates a characteristic pattern: the rightmost pages are full, the leftmost pages are sparse. The index grows indefinitely even as the table size stays stable.
2. Bulk Deletes Followed by Inserts
DELETE FROM table WHERE created_at < '2025-01-01' followed by a batch of new inserts. The deletes create dead entries across many index pages. The new inserts go to new pages. The dead space on the old pages is never reused.
3. Frequent UPDATE Without Corresponding Inserts
A table that is mostly updated but rarely receives new rows. Updates create dead index entries, but there are few new entries to reuse the space. Over time, the index accumulates dead entries faster than VACUUM can clean them.
4. Low fillfactor with Random Updates
A low fillfactor (e.g., 70) on an index reserves space for HOT updates. But if updates are not HOT (they change indexed columns), the reserved space is wasted. The pages are always 70% full even after VACUUM, and the remaining 30% is permanent bloat.
5. Page Splits from Non-Sequential Inserts
Random inserts (UUID primary keys, hash-distributed data) cause page splits across the entire index, not just at the right edge. Each split creates a new page at ~50% capacity. This is the same mechanism as bloat from sequential keys but distributed differently.
Reclaiming Space
Option 1: REINDEX
-- Exclusive lock (blocks reads and writes)
REINDEX INDEX idx_your_index;
-- Concurrent (does not block reads, allows writes but blocks DDL)
REINDEX INDEX CONCURRENTLY idx_your_index;
REINDEX CONCURRENTLY is the go-to solution for PostgreSQL 12+. It builds a new index alongside the old one, then swaps them. No ACCESS EXCLUSIVE lock. Reads and writes continue throughout. The downside: it requires approximately as much disk space as the current index (you need room for both old and new simultaneously).
Option 2: pg_repack
pg_repack is an extension that rebuilds tables and indexes with minimal locking. It works by creating a temp table, copying data, building indexes on the temp table, and then swapping the tables with a brief lock.
# Install the extension
pg_repack --table your_table your_database
# Rebuild just the indexes
pg_repack --index idx_your_index --table your_table your_database
pg_repack requires:
- The
pg_repackextension installed on the server - A primary key or NOT NULL unique index on the target table
- Sufficient disk space for the copy (roughly equal to the table + index size)
Option 3: pg_squeeze
Similar to pg_repack but uses PostgreSQL’s own logical decoding infrastructure instead of triggers. This makes it lighter on the source table during the rebuild.
-- Schedule a squeeze job
SELECT squeeze.squeeze_table(
'public', 'your_table',
/* callback */ NULL,
/* vacuum_full */ false
);
pg_squeeze is less widely deployed than pg_repack but avoids the trigger overhead that pg_repack uses.
Option 4: VACUUM FULL / CLUSTER
-- Rewrites the entire table and all indexes. ACCESS EXCLUSIVE lock.
VACUUM FULL your_table;
-- Rewrites the table ordered by an index. ACCESS EXCLUSIVE lock.
CLUSTER your_table USING idx_your_index;
These are the nuclear options. They work but lock the table completely for the duration of the operation. On a large table, that could be minutes or hours of downtime. Use them during maintenance windows, not as routine maintenance.
Comparison
| Method | Locking | Disk needed | Speed | Complexity |
|---|---|---|---|---|
| REINDEX CONCURRENTLY | None (reads/writes OK) | 1× index size | Fast | Low (built-in) |
| pg_repack | Brief lock at swap | 1× table + indexes | Medium | Medium (extension) |
| pg_squeeze | Brief lock at swap | 1× table + indexes | Medium | Medium (extension) |
| REINDEX (exclusive) | ACCESS EXCLUSIVE | 1× index size | Fast | Low |
| VACUUM FULL | ACCESS EXCLUSIVE | 1× table + indexes | Slow | Low |
| CLUSTER | ACCESS EXCLUSIVE | 1× table + indexes | Slow | Low |
Preventing Bloat
fillfactor
Setting a lower fillfactor on the index reserves space on each page for future inserts, reducing page splits. This is a tradeoff: lower fillfactor means a larger index from the start, but less bloat over time.
-- Set fillfactor on an existing index (requires rebuild)
ALTER INDEX idx_your_index SET (fillfactor = 90);
-- Then rebuild to apply
REINDEX INDEX CONCURRENTLY idx_your_index;
The default fillfactor is 100 for indexes (unlike tables, which default to 100 but are often set lower for HOT updates). For indexes on monotonically increasing keys, a fillfactor of 90-95 reduces right-edge page splits with minimal space overhead.
UUID v7 Instead of UUID v4
UUID v4 is random, causing page splits across the entire B-tree. UUID v7 is time-ordered (similar to ULID), so inserts go to the rightmost pages, reducing split frequency. This doesn’t eliminate bloat from updates, but it reduces bloat from inserts.
Frequent Enough VACUUM
Autovacuum should be configured aggressively enough to clean dead index entries before they accumulate. The autovacuum_vacuum_scale_factor and autovacuum_vacuum_threshold parameters control when VACUUM runs. For tables with high update rates, per-table tuning is essential:
-- More aggressive autovacuum for high-update tables
ALTER TABLE your_table SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_analyze_scale_factor = 0.02
);
TRUNCATE Instead of DELETE
If you periodically clear a table, TRUNCATE is much faster and avoids creating dead index entries:
-- This creates millions of dead index entries
DELETE FROM events WHERE created_at < now() - interval '90 days';
-- This releases all pages immediately, no dead entries
TRUNCATE events;
TRUNCATE bypasses MVCC and releases pages directly. It is the right choice when you want to clear the entire table.
Batch Updates Instead of Row-by-Row
Update rows in batches rather than one at a time. Each individual UPDATE creates a dead index entry. A batch UPDATE ... WHERE id IN (...) creates the same number of dead entries but in a more concentrated pattern that VACUUM can clean more efficiently.
Bottom-Up Deletion (PostgreSQL 14+)
Before PostgreSQL 14, VACUUM could not remove dead index entries. Only index page scans (triggered by concurrent index page accesses) could clean dead entries, and only when the page was nearly full of dead entries. This meant that on lightly accessed indexes, dead entries could persist for a long time.
PostgreSQL 14 introduced bottom-up index deletion. When VACUUM finds dead tuples in the heap, it now marks the corresponding index entries as LP_DEAD directly, without waiting for index page scans. This is a significant improvement for bloat prevention. Dead index entries are cleaned much sooner, reducing the window where bloat can accumulate.
Check how effectively this is working:
-- Check if dead tuple cleanup is keeping up
SELECT relname,
n_dead_tup,
n_live_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC;
If n_dead_tup is consistently high relative to n_live_tup, autovacuum is not keeping up. Either the autovacuum thresholds are too high, the vacuum cost limit is too low, or the table is updated so heavily that even aggressive autovacuum cannot keep pace.
The Gotcha: Measuring Bloat Is Harder Than You Think
Every bloat estimation query is an approximation. The only way to know the exact bloat is to scan every page of the index (which pgstattuple does) or rebuild the index and compare sizes (which REINDEX does). Both are expensive.
The pgstattuple approach acquires an ACCESS SHARE lock (which blocks writes temporarily while scanning). On a large, heavily-used index, this can cause noticeable pauses. Run bloat checks during low-traffic periods.
The metadata-based estimation queries use n_dead_tup from pg_stat_user_tables, which is an approximation based on VACUUM’s findings, not a live count. It can be stale if VACUUM has not run recently.
Don’t obsess over exact bloat percentages. Track the ratio of index size to table size over time. If an index grows consistently while the table stays flat, it is bloating. The exact percentage matters less than the trend.
Key Takeaways
- Index bloat occurs when dead index entries (from UPDATE/DELETE) are not evenly distributed across pages, leaving wasted space that is never reused.
- The primary cause is monotonically increasing keys (serial/bigserial primary keys), where new inserts go to the rightmost pages and never reuse dead space on earlier pages.
- Bloat degrades performance by increasing I/O, wasting cache space, slowing VACUUM, and inflating backups.
- Detect bloat with
pgstattuple(accurate) or metadata queries frompg_stat_user_tables(approximate). Track trends over time, not snapshots. - Reclaim space with
REINDEX CONCURRENTLY(built-in, no locking),pg_repack(extension, minimal locking), orVACUUM FULL/CLUSTER(nuclear, full table lock). - Prevent bloat with lower
fillfactor, frequent autovacuum, UUID v7 instead of v4,TRUNCATEinstead ofDELETE, and batched updates. - PostgreSQL 14+ bottom-up deletion significantly reduces bloat by having VACUUM mark index entries as LP_DEAD directly, rather than waiting for opportunistic page scans.
What’s Next
Index bloat is the first of several maintenance deep dives. Next up is vacuum full vs vacuum vs vacuum analyze — understanding the three VACUUM variants, when each is appropriate, and why VACUUM FULL is almost never the right answer for routine maintenance.
Previous in the series: BRIN Indexes — How a 1MB Index Can Replace a 1GB B-Tree