PostgreSQL Deep Dive

PostgreSQL Deep Dive: Index Bloat — Why Your Index Keeps Growing Even When Your Table Doesn't

Your table has 2 million rows, same as last month. Your index has grown from 120 MB to 340 MB. You ran VACUUM. The index didn’t shrink. You ran VACUUM ANALYZE. Still 340 MB. You ran VACUUM FULL. The table shrank, the index shrank, and then two weeks later the index was back to 300 MB and climbing.

Index bloat is one of the most misunderstood aspects of PostgreSQL maintenance. It’s not a bug — it’s a structural consequence of how B-tree indexes interact with MVCC, page splits, and the order your data arrives in. Today we’ll look at exactly why indexes bloat, how to measure the actual waste (not just the raw size), and how to fix it without taking your application offline.

Why Indexes Bloat: Two Separate Mechanisms

Index bloat comes from two fundamentally different sources, and understanding the difference is critical because the fixes are different too.

Mechanism 1: Dead Entries from MVCC

When you UPDATE a row, PostgreSQL doesn’t modify the existing row in place — it creates a new version with a new ctid (physical location on disk). The old version becomes a dead tuple, eventually cleaned up by VACUUM.

But the index still has an entry pointing to the old ctid. When VACUUM removes the dead tuple from the heap, it marks the index entry as LP_DEAD (in PostgreSQL 14+, via bottom-up deletion). On subsequent index scans, these dead entries are skipped. But skipped isn’t deleted — the entry still occupies space on the index page.

Over time, pages accumulate more and more dead entries. When enough accumulate, VACUUM’s bottom-up deletion (PG14+) removes them and frees space on the page. But that freed space is only reusable for new index entries on that page. It doesn’t shrink the file. The index file only grows, never shrinks, unless you rebuild it.

This mechanism is proportional to your update rate. Tables with frequent UPDATEs on indexed columns accumulate dead index entries fastest.

Mechanism 2: Page Splits from Insert Ordering

This is the more insidious mechanism, and it affects tables with almost no UPDATEs at all.

B-tree indexes are ordered. When a new key value falls between two existing entries on a page, PostgreSQL inserts it into the appropriate position. But when a page is full and a new entry needs to go in the middle, the page must be split — roughly in half — to make room. Each split creates a new page that’s roughly 50% full.

Now consider a monotonically increasing key like a SERIAL or BIGSERIAL primary key, or a UUIDv4 for that matter. New values always go at the rightmost end of the index. The rightmost page fills up, and when it’s full, PostgreSQL allocates a new page and inserts there. No split needed. Sequential inserts into monotonic keys are efficient — pages fill to ~100% before new ones are allocated.

But what happens when you UPDATE rows with a monotonically increasing key?

-- Your table uses a serial PK
UPDATE orders SET status = 'shipped' WHERE id = 42;

The UPDATE creates a new tuple version at a different physical location in the heap. The old ctid pointed to page 3, slot 1. The new ctid points to page 7, slot 12. The index needs a new entry for (42, page 7, slot 12), but where does this entry go in the B-tree?

It goes right where the old entry was — next to entries for 41 and 43. If that page is full, it splits. The split creates a ~50% full new page in the middle of the index. Multiply this by thousands of updates, and you get a B-tree riddled with half-empty pages.

This is why tables with high UPDATE rates on monotonically-keyed columns bloat the worst. The insert pattern (sequential, filling pages completely) is perfect. The update pattern (scattered insertions throughout the index, causing splits) is the worst possible. You get the worst of both worlds.

UUIDv4 keys make this even worse. Random UUIDs don’t cluster at the rightmost edge — they scatter uniformly across the entire index. Every INSERT has a chance of landing on a full page and triggering a split. With UUIDv4, even pure INSERT workloads (no updates) produce bloated indexes.

How Bloat Impacts Performance

Bloat isn’t just wasted disk space. It has direct, measurable performance consequences:

I/O amplification. A 340 MB index with 50% bloat has the same number of logical entries as a 170 MB index, but every index scan reads twice as many pages from disk. Cache hit ratio drops because fewer useful entries fit in shared_buffers.

Slower vacuum. VACUUM must scan every index page to mark dead entries. More pages = more vacuum time = more I/O during maintenance windows.

Bigger backups. pg_dump, pg_basebackup, and WAL archiving all include the bloat. You’re backing up empty space.

Slower index-only scans. Index-only scans rely on the visibility map to skip heap fetches, but they still read every index page. Bloat doubles the pages read.

Measuring Bloat: Three Approaches

Approach 1: pgstattuple Extension

The most accurate measurement comes from the pgstattuple extension:

CREATE EXTENSION IF NOT EXISTS pgstattuple;

SELECT
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  round(avg_leaf_density * 100, 1) AS leaf_density_pct,
  round(dead_tuple_percent, 1) AS dead_pct,
  round(free_percent, 1) AS free_pct,
  pg_size_pretty(
    pg_relation_size(indexrelid) *
    (1 - (avg_leaf_density + dead_tuple_percent / 100))
  ) AS estimated_waste
FROM pgstattuple('index_name');

Key columns:

  • avg_leaf_density: What fraction of leaf page space holds live tuples. Healthy is 70-90%. Below 50% means significant bloat.
  • dead_tuple_percent: Percentage of space occupied by LP_DEAD entries. On PG14+, this should be low (bottom-up deletion). On older versions, it can be high.
  • free_percent: Empty space on pages. High free percent means many half-empty pages from splits.

Caveat: pgstattuple scans the entire index, which takes an ACCESS SHARE lock and produces real I/O. Don’t run it on a busy production index during peak hours. It’s safe on a replica.

Approach 2: pg_relation_size Ratio Heuristic

A quick approximation without extensions:

SELECT
  schemaname,
  relname AS table_name,
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
  idx_scan,
  n_live_tup,
  round(
    pg_relation_size(indexrelid)::numeric /
    NULLIF(n_live_tup, 0),
    2
  ) AS bytes_per_row
FROM pg_stat_user_indexes
WHERE n_live_tup > 1000
ORDER BY bytes_per_row DESC
LIMIT 20;

This estimates the average index size per live row. Compare indexes on the same table — if one index uses dramatically more bytes per row than others, it’s likely bloated. This is imprecise (varlena types, multicolumn indexes all affect the ratio) but it’s fast and requires no special extensions.

Approach 3: Bloat Tracking Over Time

Create a tracking table and sample weekly:

CREATE TABLE index_bloat_history (
  measured_at timestamptz DEFAULT now(),
  schemaname text,
  table_name text,
  index_name text,
  index_size_bytes bigint,
  live_tuples bigint,
  bytes_per_row numeric
);

INSERT INTO index_bloat_history (schemaname, table_name, index_name, index_size_bytes, live_tuples, bytes_per_row)
SELECT
  schemaname,
  relname,
  indexrelname,
  pg_relation_size(indexrelid),
  n_live_tup,
  round(pg_relation_size(indexrelid)::numeric / NULLIF(n_live_tup, 0), 2)
FROM pg_stat_user_indexes
WHERE n_live_tup > 1000;

Query it over time to find indexes growing faster than their row count:

SELECT
  index_name,
  measured_at,
  index_size_bytes,
  live_tuples,
  bytes_per_row,
  lag(bytes_per_row) OVER (PARTITION BY index_name ORDER BY measured_at) AS prev_bytes_per_row
FROM index_bloat_history
WHERE index_name = 'your_index_here'
ORDER BY measured_at;

Fixing Bloat: From Least to Most Disruptive

Available since PostgreSQL 12, this rebuilds the index without blocking writes:

REINDEX INDEX CONCURRENTLY your_index_name;

How it works: PostgreSQL creates a new index alongside the old one, builds it from the heap data (skipping dead tuples entirely), then swaps the two in the catalog. The old index is dropped after the swap.

Pros: No write locking. Application continues normally. The new index is built from clean data with no dead entries and optimal page fill.

Cons: Takes more time than a regular REINDEX (roughly 2x). Requires extra disk space during the build (the old and new index exist simultaneously). Cannot run inside a transaction block.

Important: The new index is built from the current heap data. If the heap itself is heavily bloated, the new index will be compact but the heap remains bloated. Index bloat and heap bloat are separate problems.

Option 2: REINDEX (Non-Concurrent)

REINDEX INDEX your_index_name;

Takes an ACCESS EXCLUSIVE lock on the index’s parent table during the rebuild. Reads are blocked, writes are blocked. Faster than concurrent, but causes downtime.

Only appropriate for maintenance windows or small tables where the rebuild takes seconds.

Option 3: pg_repack Extension

An extension that reorganizes tables and indexes with minimal locking:

-- Install the extension (requires shared_preload_libraries)
-- Then:
pg_repack -d your_database -t your_table -i your_index

Works by creating a temp table, copying data, building indexes, then swapping tables. The lock is held only during the final swap (milliseconds).

Pros: Less blocking than REINDEX. Can rebuild both table and indexes simultaneously.

Cons: Requires extension installation and shared_preload_libraries (restart needed). Requires a primary key or NOT NULL unique index. More complex to set up.

Option 4: VACUUM FULL / CLUSTER (Nuclear)

VACUUM FULL your_table;
-- or
CLUSTER your_table USING your_primary_key_index;

Both rewrite the entire table and all its indexes from scratch. ACCESS EXCLUSIVE lock for the full duration. Only use when you need to reclaim heap space too, or during a maintenance window with guaranteed downtime.

Prevention: Stopping Bloat Before It Starts

fillfactor for Index Pages

The fillfactor storage parameter controls how full PostgreSQL fills pages when building or rebuilding an index. The default is 100 (pack pages completely), which maximizes space efficiency but leaves no room for inserts into the middle of a page.

For indexes on highly-updated columns, lowering fillfactor reserves space for new entries:

ALTER INDEX your_index_name SET (fillfactor = 90);
-- Then rebuild
REINDEX INDEX CONCURRENTLY your_index_name;

A fillfactor of 90 means 10% of each page is left empty for future insertions. This reduces page splits from updates because the new entry for the updated row can often fit on the same page without triggering a split.

Tradeoff: Lower fillfactor = bigger index files (more pages) but fewer splits over time. For write-heavy workloads, 85-90 is a common recommendation. For read-heavy or append-only workloads, leave it at 100.

Also applies to tables: Table-level fillfactor is separate from index-level fillfactor. Lowering the table fillfactor encourages HOT (Heap-Only Tuple) updates, which don’t modify index entries at all — the best prevention for index bloat.

ALTER TABLE your_table SET (fillfactor = 90);

UUIDv7 Instead of UUIDv4

We covered this in a previous post, but it’s worth repeating: UUIDv4’s random distribution causes page splits on every INSERT. UUIDv7’s time-ordered values cluster at the rightmost edge, producing pages that fill completely before splitting. The index size difference is typically 2x.

TRUNCATE vs DELETE for Bulk Clearing

DELETE marks each row as dead, leaving behind dead index entries that VACUUM must eventually clean. TRUNCATE physically removes all rows and their index entries instantly. For bulk clearing, always prefer TRUNCATE when you can afford to lose all rows (it requires a table lock but is instant).

Batch Updates to Concentrate Dead Space

If you’re updating millions of rows, doing it in a single transaction creates a massive spike of dead index entries spread across the entire B-tree. Batch the updates instead:

-- Instead of one massive UPDATE
-- UPDATE orders SET status = 'archived' WHERE created_at < '2025-01-01';

-- Batch it
DO $$
DECLARE
  batch_size INT := 10000;
  rows_affected INT := 1;
BEGIN
  WHILE rows_affected > 0 LOOP
    UPDATE orders SET status = 'archived'
    WHERE id IN (
      SELECT id FROM orders
      WHERE created_at < '2025-01-01' AND status != 'archived'
      LIMIT batch_size
    );

    GET DIAGNOSTICS rows_affected = ROW_COUNT;
    COMMIT;
  END LOOP;
END $$;

Each batch creates a manageable number of dead entries, and autovacuum can clean them between batches. This reduces peak bloat.

The Gotcha: Bloat Measurement Is Always Approximate

There is no exact, real-time bloat percentage for a PostgreSQL index. Here’s why:

pgstattuple is the most accurate tool, but it takes a sample at a point in time. By the time you read the results, concurrent transactions may have modified the index.

The ratio heuristic (index_size / live_tuples) is affected by column data types (wider columns = more bytes per index entry regardless of bloat), multicolumn indexes, and compression.

The real measure that matters isn’t “what percentage is wasted?” but “is this index causing slow queries?” If your index is 2x the expected size but your queries are fast and cache hit ratios are healthy, the bloat is cosmetic. If your index is 1.3x but queries are slow and the index is causing cache pressure, it’s worth rebuilding.

Practical SQL: Bloat Audit for Your Entire Database

-- Quick bloat audit using pgstattuple
-- Run this on a replica or during low-traffic hours
SELECT
  schemaname || '.' || relname AS table_name,
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size,
  idx_scan AS scans_since_analyze,
  CASE
    WHEN avg_leaf_density < 0.5 THEN 'HIGH'
    WHEN avg_leaf_density < 0.7 THEN 'MEDIUM'
    ELSE 'LOW'
  END AS bloat_risk,
  round(avg_leaf_density * 100, 1) AS leaf_density_pct
FROM pgstatindex(i.indexrelid::regclass) ps
JOIN pg_stat_user_indexes i ON i.indexrelid = ps.index
WHERE avg_leaf_density < 0.7
ORDER BY avg_leaf_density ASC;

Key Takeaways

  • Index bloat comes from two sources: dead entries from MVCC updates, and half-empty pages from B-tree splits. Both waste space, but splits are worse because they also reduce cache efficiency.
  • Monotonically increasing keys (SERIAL, BIGSERIAL) with frequent UPDATEs are the worst pattern. Updates scatter new index entries throughout the tree, causing splits on pages that were optimally full.
  • UUIDv4 keys cause split-based bloat even on pure INSERT workloads because random values land on random pages.
  • pgstattuple is the most accurate measurement tool. Check avg_leaf_density — below 70% means meaningful bloat, below 50% means rebuild it.
  • REINDEX INDEX CONCURRENTLY is the standard fix. No write locking, builds a clean index from current heap data. Requires extra disk space during the build.
  • fillfactor on both tables and indexes is the primary prevention mechanism. Lower fillfactor leaves room for updates without splits. Table-level fillfactor encourages HOT updates that skip the index entirely.
  • Bloat measurement is approximate. Focus on whether bloat is causing performance problems, not on hitting an exact percentage target.

What’s Next

Tomorrow we’ll look at BRIN indexes — how a 1 MB block range index can sometimes replace a 1 GB B-tree, when physical data correlation makes them viable, and why they degrade silently when that correlation breaks down.


Previous: Deadlock Detection, log_lock_waits, and Why Your Transactions Keep Aborting