PostgreSQL Deep Dive

PostgreSQL Deep Dive: Reading EXPLAIN (ANALYZE, BUFFERS, WAL) Like a Pro

You ran EXPLAIN ANALYZE on a slow query. You got 40 lines of output. You stared at it, saw “Seq Scan”, panicked, and added an index. The query got slower.

Let’s fix that. Today we are doing a deep dive into reading EXPLAIN output — not the basics (you can find those anywhere), but the parts that actually tell you what is wrong: cost estimation vs. real time, buffer I/O patterns, WAL volume, the difference between estimated and actual row counts, and the subtle traps that lead you astray.

By the end of this post, you should be able to look at an EXPLAIN plan and know: is this actually slow, and if so, where?

The Four Ways to Run EXPLAIN

-- 1. Estimated plan only (no execution)
EXPLAIN SELECT * FROM orders WHERE status = 'shipped';

-- 2. Execute and show actual times + row counts
EXPLAIN (ANALYZE) SELECT * FROM orders WHERE status = 'shipped';

-- 3. Add buffer I/O statistics (included automatically with ANALYZE)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE status = 'shipped';

-- 4. Include WAL generation (PostgreSQL 13+)
EXPLAIN (ANALYZE, BUFFERS, WAL) UPDATE orders SET status = 'delivered' WHERE id < 1000;

Rule of thumb: always use (ANALYZE, BUFFERS, WAL) for write queries, and (ANALYZE, BUFFERS) for reads. The extra data costs virtually nothing and you will need it.

For write queries, wrap in a transaction to avoid side effects:

BEGIN;
EXPLAIN (ANALYZE, BUFFERS, WAL) UPDATE orders SET status = 'delivered' WHERE id < 1000;
ROLLBACK;

Anatomy of an EXPLAIN Line

Let’s break down a single node:

Bitmap Heap Scan on orders  (cost=5.06..224.98 rows=100 width=244) (actual time=0.030..0.450 rows=100.00 loops=1)

This one line contains:

FieldMeaning
Bitmap Heap Scan on ordersNode type and target relation
cost=5.06..224.98Startup cost..Total cost (in arbitrary units)
rows=100Estimated rows this node will output
width=244Estimated average row size in bytes
actual time=0.030..0.450Real startup time..Real total time (milliseconds)
rows=100.00Actual rows this node output (averaged per loop)
loops=1Number of times this node was executed

The Cost Numbers

Costs are measured in arbitrary units where a sequential page read (seq_page_cost) = 1.0. The default cost model:

-- Show current cost parameters
SHOW seq_page_cost;       -- 1.0  (sequential page read)
SHOW random_page_cost;    -- 4.0  (random page read — default, too high for SSDs)
SHOW cpu_tuple_cost;      -- 0.01 (processing one row)
SHOW cpu_index_tuple_cost;-- 0.005 (processing one index entry)
SHOW cpu_operator_cost;   -- 0.0025 (evaluating one operator)

Critical gotcha for SSD users: The default random_page_cost = 4.0 was set for spinning disks. On SSDs, random reads are barely slower than sequential reads. Set it to 1.1 or even 1.0. If you don’t, the planner will systematically avoid index scans it should be using:

-- In postgresql.conf (requires restart for some, reload for others)
random_page_cost = 1.1   -- for SSD/NVMe storage

Cost Calculation Example

For a sequential scan on a table with 345 pages and 10,000 rows:

cost = (pages × seq_page_cost) + (rows × cpu_tuple_cost)
     = (345 × 1.0) + (10000 × 0.01)
     = 345 + 100
     = 445.00

The cost of a parent node includes all its children. So when you see:

Sort  (cost=1109.39..1134.39)
  ->  Seq Scan on orders  (cost=0.00..445.00)

The Sort node’s cost (1134.39) includes the Seq Scan’s cost (445.00) plus the cost of sorting. The incremental sort cost is roughly 1134.39 - 445.00 = 689.39.

The Most Important Number: Estimated vs. Actual Rows

If there is one thing you take away from this post, it is this:

Compare rows (estimated) with rows (actual) at every node. Big discrepancies are the root cause of most bad plans.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = 42 AND created_at > '2026-01-01';
Bitmap Heap Scan on orders  (cost=25.07..680.11 rows=10 width=148) (actual time=0.511..45.230 rows=85000.00 loops=1)
  Recheck Cond: (customer_id = 42)
  Filter: (created_at > '2026-01-01'::date)
  Rows Removed by Filter: 231
  Buffers: shared hit=4123 read=89
  ->  Bitmap Index Scan on idx_orders_customer  (cost=0.00..25.07 rows=100 width=0) (actual time=0.180..0.180 rows=85231.00 loops=1)
        Index Cond: (customer_id = 42)
        Buffers: shared hit=12 read=3

Look at the Bitmap Heap Scan node:

  • Estimated rows: 10 → the planner thought only 10 rows would match
  • Actual rows: 85,000 → 8,500× more rows than expected

The planner chose a bitmap scan because it thought 10 rows was cheap. But with 85K rows, a sequential scan might have been faster. The misestimate cascaded into a bad plan.

Why Estimates Go Wrong

PostgreSQL’s planner uses statistics collected by ANALYZE — specifically, pg_statistic entries like n_distinct, most_common_vals, and histogram_bounds. These are based on a random sample (default 300 rows per column per table, controlled by default_statistics_target).

Common causes of bad estimates:

  1. Correlated columns: The planner assumes column values are independent. WHERE country = 'US' AND state = 'CA' is estimated as P(US) × P(CA), but in reality almost all CA rows are US. Fix: Extended statistics (PG 10+):
-- Create extended statistics for correlated columns
CREATE STATISTICS orders_country_state (ndistinct, dependencies)
ON country, state FROM orders;
ANALYZE orders;
  1. Skewed distributions: If 90% of orders have status = 'completed', the MCV list might not capture the skew accurately. Fix: Increase default_statistics_target for the column:
-- Increase statistics target for a specific column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;
  1. Stale statistics: If you bulk-loaded data and didn’t ANALYZE, the planner is working with old numbers. Fix:
ANALYZE orders;
-- Or for a specific column
ANALYZE orders (status);
  1. Functions and expressions: WHERE lower(email) = 'user@example.com' — the planner often can’t estimate selectivity of function results. Fix: Expression indexes + extended statistics on expressions (PG 14+):
CREATE INDEX idx_orders_lower_email ON orders (lower(email));
ALTER TABLE orders ALTER COLUMN email SET STATISTICS 200;

Reading Buffer I/O: Where Your Time Actually Goes

The BUFFERS output tells you about data access patterns:

Buffers: shared hit=4123 read=89 dirtied=12 written=5
CounterMeaningCost
hitPage found in shared_buffers (no disk I/O)Free
readPage not in cache, had to read from OS/diskExpensive
dirtiedPage was clean, now modifiedModerate
writtenDirty page evicted to diskExpensive

The ratio of hit to read is your cache effectiveness for this query.

-- Good: most pages already cached
Buffers: shared hit=4123 read=89     → 97.9% cache hit

-- Bad: almost nothing cached
Buffers: shared hit=12 read=4123     → 0.3% cache hit

When Cache Misses Matter

A query with lots of read is slow on first execution but may be fast on second execution (pages now cached). Always run a query twice when benchmarking:

-- Cold run (buffers not cached)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42;

-- Warm run (buffers likely cached)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 42;

If the warm run is 10× faster but the cold run is what your users experience (because different customers access different pages), you have a cache capacity problem — shared_buffers is too small relative to your working set.

The Double-Counting Trap

Buffer counts are cumulative — they include all child nodes. So you’ll see:

Hash Join  (cost=...)
  Buffers: shared hit=5000    ← includes children
  ->  Seq Scan on orders
        Buffers: shared hit=3000   ← just this scan
  ->  Hash
        ->  Seq Scan on customers
              Buffers: shared hit=2000  ← just this scan

The Hash Join shows 5000 hits = 3000 (orders) + 2000 (customers). Don’t double-count.

WAL Output (PostgreSQL 13+)

For write queries, the WAL option shows how much WAL was generated:

BEGIN;
EXPLAIN (ANALYZE, BUFFERS, WAL)
UPDATE orders SET status = 'delivered' WHERE status = 'shipped';
ROLLBACK;
Update on orders  (cost=0.00..5000.00 rows=0 width=0) (actual time=120.543..120.543 rows=0.00 loops=1)
  WAL: 0 records, 0 full page images, 15420 bytes
  ->  Seq Scan on orders  (cost=0.00..5000.00 rows=50000 width=10) (actual time=0.012..45.123 rows=50000.00 loops=1)
        Filter: (status = 'shipped'::text)
        Buffers: shared hit=3450 read=120 dirtied=3450 written=120

The WAL line shows:

  • records: Number of WAL records generated
  • full page images: Full page images written (these happen after a checkpoint — the first write to any page after a checkpoint writes the entire page to WAL as a Full Page Image for crash recovery)
  • bytes: Total WAL bytes

WAL volume matters for replication lag. If your write query generates 15MB of WAL, and your replica has a 5MB/s replication link, you just created 3 seconds of replication lag for a single query.

Full Page Images: The Silent WAL Amplifier

After a checkpoint, the first modification to any page writes the entire 8KB page to WAL as a Full Page Image (FPI). If your checkpoint_timeout is long (default 5 minutes), many pages accumulate their first write at different times, spreading FPIs. But with aggressive checkpoints or bulk loads, FPIs can dominate WAL volume.

-- Check FPI ratio in recent WAL
SELECT
    count(*) AS total_records,
    sum(CASE WHEN type = 'XLOG_FPI' OR type = 'XLOG_FPI_FOR_HINT' THEN 1 ELSE 0 END) AS fpi_records,
    pg_size_pretty(sum(CASE WHEN type = 'XLOG_FPI' OR type = 'XLOG_FPI_FOR_HINT' THEN size ELSE 0 END)::bigint) AS fpi_size
FROM pg_walinspect_get_wal_records_till_current(
    pg_current_wal_lsn() - (pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') * 0.01)::bigint
);

If FPIs are a significant portion of your WAL, consider:

  • Increasing checkpoint_timeout (fewer checkpoints = fewer FPI triggers)
  • Increasing max_wal_size (allows more WAL to accumulate between checkpoints)
  • Enabling wal_compression = on (PG 15+, compresses FPIs with LZ4 or Zstandard)

Common EXPLAIN Patterns and What They Mean

Pattern 1: The Slow Nested Loop

Nested Loop  (cost=0.29..8730.50 rows=1000 width=200) (actual time=0.05..2340.12 rows=1000.00 loops=1)
  ->  Seq Scan on large_table  (cost=0.00..1500.00 rows=100000 width=100) (actual time=0.01..200.00 rows=100000.00 loops=1)
  ->  Index Scan using idx_small on small_table  (cost=0.29..0.05 rows=1 width=100) (actual time=0.02..0.02 rows=1.00 loops=100000)
        Index Cond: (id = large_table.small_id)

The inner index scan ran 100,000 times (once per outer row). At 0.02ms each, that is 2,000ms. The planner estimated 1,000 outer rows but got 100,000 — the nested loop is 100× more expensive than expected.

Fix: Improve row estimates, or force a hash join if appropriate:

SET enable_nestloop = off;
-- Or fix the statistics causing the misestimate

Pattern 2: The Spilling Sort

Sort  (cost=50000.00..52500.00 rows=100000 width=100) (actual time=1200.00..1350.00 rows=100000.00 loops=1)
  Sort Key: created_at
  Sort Method: external merge  Disk: 9840kB
  Buffers: shared hit=3450, temp read=1230 written=1230

Sort Method: external merge means the sort spilled to disk. The temp read=1230 written=1230 confirms it wrote 1230 temp buffers (about 10MB) to disk and read them back.

Fix: Increase work_mem for this query:

SET LOCAL work_mem = '64MB';
-- or per-operation tuning at session level

The rule: work_mem is per sort/hash operation, not per query. A query with 3 sorts and 2 hash joins uses up to 5 × work_mem.

Pattern 3: The Useless Index Scan

Index Scan using idx_status on orders  (cost=0.42..12500.00 rows=950000 width=100) (actual time=0.05..980.00 rows=950000.00 loops=1)
  Index Cond: (status = 'pending'::text)
  Buffers: shared hit=45000 read=2000

An index scan returning 950K out of 1M rows (95% of the table). This is almost always slower than a sequential scan — the index adds random I/O overhead for almost no selectivity.

The planner picked the index because random_page_cost is too high (making it overestimate seq scan cost) or because statistics say this value is rare when it’s actually common.

Fix: Check statistics and adjust random_page_cost, or use n_distinct targeting:

-- See what the planner thinks about this column
SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

Pattern 4: Bitmap Scan with Recheck Lossiness

Bitmap Heap Scan on orders  (cost=500.00..8000.00 rows=50000 width=100) (actual time=5.00..200.00 rows=52000.00 loops=1)
  Recheck Cond: (category_id = 42)
  Heap Blocks: lossy=5000
  ->  Bitmap Index Scan on idx_category  (cost=0.00..400.00 rows=50000 width=0) (actual time=3.00..3.00 rows=52000.00 loops=1)

Heap Blocks: lossy=5000 — the bitmap exceeded work_mem and switched from “exact” to “lossy” mode. In lossy mode, each bit in the bitmap represents an entire 8KB page rather than a specific row. The heap scan has to recheck every row on each marked page, including rows that don’t match.

Fix: Increase work_mem to hold the full exact bitmap, or the index is returning too many rows.

The EXPLAIN Options Cheat Sheet

-- Full diagnostic output for a read query
EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS, FORMAT TEXT)
SELECT ...;

-- For write queries (PG 13+)
EXPLAIN (ANALYZE, BUFFERS, WAL, VERBOSE, SETTINGS)
UPDATE ...;

-- Machine-readable output for tooling
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT ...;

-- The SETTINGS option (PG 12+) shows non-default config affecting the plan
EXPLAIN (ANALYZE, SETTINGS) SELECT ...;
-- Output includes lines like:
--   Settings: work_mem = '256MB', random_page_cost = '1.1'

Diagnostic Workflow: When a Query is Slow

  1. Run EXPLAIN (ANALYZE, BUFFERS) — get the actual plan, not estimates
  2. Find the slowest node — look at actual time, especially the total (second number)
  3. Compare estimated vs actual rows — at EVERY node, not just the slow one
  4. Check buffer hit ratio — high read count means cache misses
  5. Look for spillsSort Method: external merge or Hash Batches: N where N > 1
  6. Check for lossy bitmapsHeap Blocks: lossy=N
  7. Verify the plan type makes sense — seq scan on a 5-row result? Index scan on 99% of the table?
  8. Check SETTINGS — is work_mem or random_page_cost non-default?
-- Quick diagnostic wrapper
\pset pager off
EXPLAIN (ANALYZE, BUFFERS, WAL, SETTINGS, FORMAT TEXT)
<your query here>;

Key Takeaways

  • Estimated vs. actual rows is the #1 diagnostic — find where they diverge, fix the statistics (extended stats, column target, ANALYZE), and the plan usually improves.
  • Costs are arbitrary units — do not compare cost numbers across different queries. Compare estimated vs. actual within the same plan.
  • Buffer hit ratio tells you if I/O is the bottleneckshared hit is free; shared read goes to disk. A low hit ratio means your working set does not fit in shared_buffers.
  • work_mem spills are a silent killer — external merge sorts and multi-batch hash joins are 10-100× slower than in-memory equivalents. Watch for Sort Method: external merge and Batches: N > 1.
  • WAL volume matters for replication — the WAL option (PG 13+) shows bytes written. Full Page Images after checkpoints can blow up WAL volume by 10×.
  • random_page_cost = 4.0 is wrong for SSDs — set it to 1.0-1.1 if you use SSD/NVMe storage. This single change fixes many “planner won’t use my index” problems.
  • Always use BUFFERS — it costs nothing and gives you the I/O picture you need.

What’s Next

Tomorrow we will look at the scan types in detail: Sequential Scan vs. Index Scan vs. Bitmap Scan — When Postgres Chooses Each. Now that you can read EXPLAIN output, we will explain why the planner picks each scan type and how to know when it got it wrong.