PostgreSQL Deep Dive

PostgreSQL Deep dive: WAL Buffers and checkpoint tuning — why your writes stall at the worst moment

Your application handles 5,000 writes per second, smooth as glass. Then, every five minutes, latency spikes to 2 seconds. Your monitoring shows I/O throughput triples. CPU I/O wait climbs. Then it’s gone, back to normal. Five minutes later, it happens again.

You check for locks. Nothing. You check for vacuums. Nothing. You check for a crazy query. Nothing.

The culprit? A checkpoint just fired. And your WAL configuration is making it hurt far more than it should.

This is one of the most common and most misunderstood performance problems in PostgreSQL. I’m going to walk through what’s happening inside the WAL subsystem, how checkpoints work, and how to tune them so they’re invisible.

WAL: The Write-Ahead Log in 30 Seconds

Before every data modification, PostgreSQL writes a description of the change to the Write-Ahead Log (WAL). This ensures durability: if the server crashes, PostgreSQL replays WAL from the last checkpoint to recover.

WAL records are written to in-memory WAL buffers, then flushed to WAL segment files on disk (16MB each in pg_wal/). The key functions:

  • XLogInsertRecord() — copies a WAL record into the WAL buffer. This is what your backend does when you commit a transaction.
  • XLogFlush() — ensures WAL is flushed to disk up to a given position. Called at commit time if synchronous_commit = on.

The WAL is append-only and sequential. That’s the whole point: sequential writes are fast, even on spinning disks. But the details of when and how those writes happen determine whether your system purrs or stutters.

wal_buffers: The Tiny Parameter That Matters

What It Does

wal_buffers is the amount of shared memory allocated for WAL data that hasn’t yet been written to disk. Every backend that modifies data writes WAL records into this buffer. The WAL writer process then flushes them to disk in the background.

The Default: Too Small

The default is -1, which auto-computes to 1/32 of shared_buffers, with a minimum of 64KB and a maximum of one WAL segment (16MB).

wal_buffers = shared_buffers / 32

Examples:
  shared_buffers = 128MB  →  wal_buffers = 4MB
  shared_buffers = 1GB    →  wal_buffers = 32MB → capped at 16MB
  shared_buffers = 4GB    →  wal_buffers = 16MB (auto-capped)

On a system with 4GB of shared_buffers, you get 16MB of WAL buffer. For a system doing thousands of writes per second, 16MB can fill up fast.

What Happens When WAL Buffers Fill Up

When a backend needs to write a WAL record and the WAL buffer is full:

  1. The backend blocks waiting for space.
  2. It may need to flush WAL to disk itself (an XLogFlush call) to free up buffer space.
  3. This is synchronous. The backend stalls until the flush completes.

Under heavy write load, this creates a thundering herd: many backends all trying to flush WAL simultaneously, contending on the same I/O. This is why you see latency spikes correlated with WAL buffer exhaustion.

How to Size It

The rule of thumb: set wal_buffers to match your WAL segment size (16MB), or larger for very write-heavy systems.

-- For most systems with shared_buffers >= 1GB
wal_buffers = 16MB

-- For extremely write-heavy systems
wal_buffers = 32MB

-- For systems with shared_buffers < 1GB, let autotune handle it
wal_buffers = -1  -- default

16MB is almost always enough. Going larger rarely helps because the WAL writer is flushing continuously in the background. The buffer is just a staging area, not a cache you want to be large.

When to Worry About WAL Buffers

If you see this in your logs:

LOG:  WAL buffer flush took N milliseconds

Or if pg_stat_wal (PG14+) shows:

SELECT wal_records, wal_fpi, wal_bytes,
       wal_buffers_full, wal_write, wal_sync,
       wal_write_time, wal_sync_time
FROM pg_stat_wal;

A high wal_buffers_full count means backends are stalling waiting for WAL buffer space. Increase wal_buffers.

The WAL Writer: Background Flusher

The WAL writer is a dedicated background process that wakes up periodically and flushes accumulated WAL records from the WAL buffer to disk. Its job is to keep the WAL buffer from filling up so that committing backends don’t have to flush synchronously.

Two parameters control it:

  • wal_writer_delay (default: 200ms) — how long the WAL writer sleeps between flush cycles.
  • wal_writer_flush_after (default: 1MB) — how much WAL data accumulates before the WAL writer flushes, even mid-sleep.

The workflow:

WAL Writer Loop:
  1. Sleep for wal_writer_delay (200ms)
  2. Check: is there wal_writer_flush_after (1MB) of unflushed WAL?
  3. If yes: flush WAL to disk
  4. If after flushing, there's STILL data → immediately flush again (busy loop)
  5. Go back to sleep

Under moderate load, the WAL writer handles most flushes in the background. Backends only need to call XLogFlush at commit time if the WAL writer hasn’t flushed their record yet.

Tuning the WAL Writer

For high-write systems:

wal_writer_delay = 100ms       -- flush more often
wal_writer_flush_after = 512kB -- flush smaller batches

For low-write systems, the defaults are fine. Don’t over-tune this. Making wal_writer_delay too small wastes CPU on unnecessary wake-ups.

Checkpoints: The Recovery Boundary

A checkpoint is the moment PostgreSQL guarantees that all modified data pages (dirty buffers in shared_buffers) have been written to the data files on disk. After a checkpoint:

  • All WAL records written before the checkpoint are no longer needed for crash recovery.
  • PostgreSQL can recycle or remove old WAL segment files.
  • Recovery after a crash starts from the checkpoint’s position.

Without checkpoints, WAL would grow forever and crash recovery would take hours (replaying every WAL record since server start).

Who Triggers a Checkpoint?

Several things:

  1. Time-based: Every checkpoint_timeout seconds (default: 5 minutes). This is the most common trigger.
  2. WAL volume-based: When the total WAL generated since the last checkpoint exceeds max_wal_size (default: 1GB). This is a safety valve.
  3. Manual: CHECKPOINT command (usually from pg_dump, pg_basebackup, or admin intervention).
  4. Shutdown: Smart shutdown triggers a checkpoint before stopping.

Checkpoint Internals: What Actually Happens

When a checkpoint fires:

  1. The checkpointer process records the current WAL insertion position (the “redo pointer”).
  2. It begins writing all dirty buffers from shared_buffers to the data files via pread/pwrite.
  3. It fsyncs every modified data file to ensure durability.
  4. It updates the checkpoint record in pg_control.
  5. It logs: LOG: checkpoint starting: time (or time wal).

Steps 2-3 are where the I/O spike comes from.

The Checkpoint Spike Problem

Why It Hurts

Imagine a system with 8GB of shared_buffers, steady write traffic, and checkpoint_timeout = 5min:

Minute 0: Checkpoint completes. All dirty buffers written.
Minute 1-5: Application writes modify thousands of pages in shared_buffers.
            ~2GB of dirty pages accumulate.
Minute 5: Checkpoint fires.
          Checkpointer must write 2GB of dirty pages AND fsync them all.
          I/O throughput requirement: 2GB / checkpoint_completion_target window

If the checkpoint writes all 2GB in a short burst, it hammers the storage device. Other backends trying to read pages experience I/O contention. Latency spikes.

Spread Checkpoints: The Solution

PostgreSQL uses spread checkpoints by default. Instead of writing all dirty pages at once, the checkpointer spreads the writes over most of the checkpoint interval:

checkpoint_completion_target = 0.9  -- default

This means the checkpointer targets completing 90% of the write work before the next checkpoint is expected. For a 5-minute checkpoint interval:

Checkpoint write window = 5 min × 0.9 = 4.5 minutes
Dirty pages to write: 2GB
Required write rate: 2GB / 270s ≈ 7.6 MB/s

Instead of a 2GB burst, you get a steady 7.6 MB/s trickle. Much gentler on I/O.

But What If max_wal_size Forces a Checkpoint Early?

If the WAL generated between checkpoints exceeds max_wal_size (default: 1GB), PostgreSQL forces an early checkpoint. This shortens the write window:

Checkpoint fires after only 2 minutes (because 1GB WAL generated)
Write window = 2 min × 0.9 = 108 seconds
Dirty pages: 800MB
Required write rate: 800MB / 108s ≈ 7.4 MB/s

The write rate is similar, but now checkpoints are happening every 2 minutes instead of every 5. More frequent checkpoints mean:

  • More total I/O from repeated fsyncs
  • More WAL from full page writes (more on this below)
  • Less efficient OS cache utilization

The fix: increase max_wal_size to accommodate your WAL generation rate.

Full Page Writes: The Hidden WAL Amplifier

This is the part most people miss, and it connects directly to checkpoint tuning.

The Partial Write Problem

PostgreSQL stores data in 8KB pages. Most storage devices don’t guarantee atomic 8KB writes. A power failure mid-write could leave a partially-written page (torn page). A torn page corrupts both the data page AND any WAL records that reference it.

The Fix: full_page_writes

When full_page_writes = on (default), PostgreSQL writes a complete copy of the modified page into WAL the first time that page is dirtied after a checkpoint. These are called Full Page Images (FPIs).

After checkpoint:
  First modification to page 42 of orders:
    WAL record = [Full Page Image: 8KB of page 42] + [actual change]

  Second modification to page 42:
    WAL record = [just the delta, ~100 bytes]

This means: the first wave of writes after every checkpoint generates enormous WAL records. Instead of 100-byte delta records, you’re writing 8KB full page images. That’s an 80x amplification.

The Checkpoint Frequency Connection

More frequent checkpoints → more full page images → more WAL → max_wal_size exceeded → even more frequent checkpoints. A vicious cycle.

Scenario A: Checkpoints every 5 minutes
  - Pages get one FPI every 5 minutes
  - If a page is modified 10x per minute, that's 49 delta records + 1 FPI per 5 min
  - WAL for this page: 8KB + (49 × 100B) ≈ 13KB per 5 min

Scenario B: Checkpoints every 1 minute (because max_wal_size too small)
  - Pages get one FPI every minute
  - Same page: 9 delta records + 1 FPI per minute
  - WAL for this page: 8KB + (9 × 100B) ≈ 9KB per minute = 45KB per 5 min

That's 3.5x more WAL just from FPIs!

WAL Compression

PostgreSQL 14+ offers wal_compression to reduce FPI overhead:

-- PG14-16: compress FPIs with LZ4 or pglz
wal_compression = lz4    -- recommended, fast and effective

-- PG17+: also compresses full page writes AND other WAL data
wal_compression = on     -- equivalent to lz4 in PG17

LZ4 compression typically shrinks FPIs by 60-80%, since most of a page is unchanged between writes. This directly reduces WAL volume and can break the vicious cycle of checkpoint-induced WAL amplification.

Tuning Guide: From Defaults to Production

A practical tuning approach, from defaults to production:

Step 1: Check Your Current Checkpoint Behavior

-- PostgreSQL 17+: Use the new pg_stat_checkpointer view
SELECT *
FROM pg_stat_checkpointer;

-- All versions: Check the log
-- Add to postgresql.conf:
log_checkpoints = on

-- Then look for messages like:
-- LOG:  checkpoint starting: time
-- LOG:  checkpoint complete: wrote 12345 buffers (15.3%)
--        containing 2.3GB in 4m12s, ...

The wrote N buffers line tells you how many dirty pages the checkpoint had to write. The time tells you how long the spread took.

Step 2: Determine Your WAL Generation Rate

-- PostgreSQL 14+
SELECT wal_records, wal_bytes, wal_fpi,
       wal_buffers_full,
       stats_reset
FROM pg_stat_wal;

-- Wait 10 minutes, then check again to compute rate
-- Or use pg_stat_statements for per-query WAL tracking

Calculate WAL bytes per minute. A healthy system generates a predictable amount.

Step 3: Set max_wal_size Based on WAL Rate

max_wal_size = WAL generated per checkpoint_timeout × 2 (safety margin)

Example:
  WAL generation rate: 200MB/min
  checkpoint_timeout: 5min
  Expected WAL per checkpoint: 1GB
  Recommended max_wal_size: 2GB
checkpoint_timeout = 5min       -- keep default, don't go above 10min
max_wal_size = 2GB              -- accommodate your WAL rate
min_wal_size = 512MB            -- prevents aggressive recycling for bursty workloads

Step 4: Configure WAL Buffers and Writer

wal_buffers = 16MB              -- match WAL segment size
wal_writer_delay = 200ms        -- default is fine
wal_writer_flush_after = 1MB    -- default is fine

Step 5: Enable WAL Compression

-- PostgreSQL 14+
wal_compression = lz4           -- reduces FPI size by 60-80%

-- PostgreSQL 17+
wal_compression = on            -- enables full compression support

Step 6: Tune Spread Checkpoint I/O

checkpoint_completion_target = 0.9  -- default, usually fine

-- If checkpoints still cause spikes, adjust the I/O rate:
checkpoint_flush_after = 256kB     -- write in 256kB chunks (default)

Step 7: Monitor and Iterate

-- Checkpoint frequency and duration (from logs or pg_stat_checkpointer)
-- Look for:
--   "checkpoint starting: wal" = WAL-driven checkpoint (bad: max_wal_size too small)
--   "checkpoint starting: time" = time-driven checkpoint (good)

-- WAL buffer exhaustion
SELECT wal_buffers_full FROM pg_stat_wal;
-- Should be near zero. If climbing, increase wal_buffers.

-- Checkpoint I/O impact
-- Watch iostat/dstat during checkpoints for write throughput spikes

PostgreSQL 17: New Checkpoint Observability

PG17 introduced the pg_stat_checkpointer view, a major improvement for checkpoint monitoring:

SELECT num_timed,        -- time-based checkpoints
       num_requested,    -- requested checkpoints (manual, shutdown, etc.)
       restartpoints,    -- restartpoints executed (on replicas)
       write_time,       -- time spent writing files (ms)
       sync_time,        -- time spent syncing files (ms)
       buffers_written,  -- total buffers written during checkpoints
       slack_time        -- NEW: idle time during spread checkpoints
FROM pg_stat_checkpointer;

The new slack_time column tells you how much time the checkpointer spent idle during the spread window. High slack time means the checkpointer is finishing early; low slack time means it’s struggling to keep up.

PG17 also adds improved handling for heavily-contended WAL writes and new checkpoint wait events in pg_stat_activity, making it easier to identify backends blocked by checkpoint I/O.

The Diagnostic Playbook

When you see write latency spikes every few minutes, run through this checklist:

1. Is log_checkpoints on?
   → Check if spikes correlate with "checkpoint starting" log lines

2. Are checkpoints WAL-driven or time-driven?
   → "checkpoint starting: wal" means max_wal_size is too small
   → Increase max_wal_size

3. How many buffers is each checkpoint writing?
   → "wrote N buffers" tells you the dirty page count
   → If very high, your checkpoint_timeout might be too long

4. Is wal_buffers_full climbing?
   → Backends are stalling on WAL buffer space
   → Increase wal_buffers to 16MB

5. Is wal_compression enabled?
   → FPIs are amplifying your WAL volume
   → Enable wal_compression = lz4

6. What's the checkpoint write time vs sync time?
   → High write_time: slow storage or too many dirty pages
   → High sync_time: fsync is slow (SSD write amplification, RAID controller cache)

Real-World Configuration Examples

Small System (4GB RAM, SSD)

shared_buffers = 1GB
wal_buffers = 16MB
checkpoint_timeout = 5min
max_wal_size = 1GB
min_wal_size = 256MB
checkpoint_completion_target = 0.9
wal_compression = lz4
log_checkpoints = on

Medium System (32GB RAM, NVMe)

shared_buffers = 8GB
wal_buffers = 16MB
checkpoint_timeout = 5min
max_wal_size = 4GB
min_wal_size = 1GB
checkpoint_completion_target = 0.9
wal_compression = lz4
log_checkpoints = on

Large System (128GB RAM, NVMe RAID, High Write Throughput)

shared_buffers = 32GB
wal_buffers = 16MB
checkpoint_timeout = 5min
max_wal_size = 16GB
min_wal_size = 4GB
checkpoint_completion_target = 0.9
wal_compression = lz4
checkpoint_flush_after = 512kB
log_checkpoints = on

Key Takeaways

  • Set wal_buffers to 16MB. If wal_buffers_full is climbing, backends are stalling.
  • Checkpoints are the #1 source of write latency spikes. They force dirty pages to disk. Spread checkpoints (the default) smooth this out, but misconfigured max_wal_size can force early, bursty checkpoints.
  • Full page writes amplify WAL after checkpoints. Every checkpoint resets the FPI clock, and the first write to each page after a checkpoint writes an 8KB full page image. More checkpoints = more FPIs = more WAL = more checkpoints. Break the cycle with wal_compression = lz4.
  • Set max_wal_size to match your WAL generation rate. If checkpoints are WAL-driven instead of time-driven, increase it. The rule: max_wal_size ≥ 2 × WAL_per_checkpoint_timeout.
  • Keep checkpoint_timeout = 5min. Don’t increase it past 10 minutes unless you have very slow storage that needs extra time for spread writes. Don’t decrease it, either. That increases FPI overhead.
  • Monitor with pg_stat_wal and pg_stat_checkpointer (PG17+). Look for wal_buffers_full, WAL-driven checkpoints, and sync_time spikes.
  • wal_compression = lz4 is essentially free. It reduces FPI size by 60-80% with negligible CPU cost. Turn it on for every PG14+ system.

What’s Next

Tomorrow we leave the memory tuning series and look at how PostgreSQL stores data on disk, starting with the page layout inside every table and index file: line pointers, free space, and why a page that looks 80% full might actually be out of room.


This post is part of the PostgreSQL Deep Dive series, published daily. Previous: effective_cache_size — The Parameter That Changes Nothing Yet Changes Everything.