PostgreSQL Deep Dive

PostgreSQL Deep dive: WAL Format and the redo-only architecture

Your PostgreSQL server just crashed. The power went out, the kernel panicked, or someone tripped over the power cord. When it comes back up, it needs to know which transactions committed and which didn’t, and reconstruct every data page to exactly its pre-crash state.

The mechanism that makes this possible is the Write-Ahead Log (WAL), and understanding its format is the key to understanding why PostgreSQL is durable, how crash recovery works, and why some of your performance problems might be WAL-related.

The WAL Guarantee

PostgreSQL follows a single, non-negotiable rule: no data page is written to disk until the WAL record describing the change has been written first. This is the Write-Ahead Logging rule, and it’s the foundation of PostgreSQL’s durability.

When a transaction modifies a row:

  1. The modification happens in shared_buffers (memory)
  2. A WAL record describing the change is constructed
  3. The WAL record is written to the WAL buffers
  4. The WAL record is flushed to disk (fsync)
  5. Only then is the modified data page eventually written to disk (by the checkpointer or bgwriter)

If the server crashes between steps 4 and 5, the WAL record survives on disk. On recovery, PostgreSQL replays the WAL record to reconstruct the modification. The data page on disk might be stale, but the WAL record is the source of truth.

If the server crashes before step 4, the WAL record is lost. The data page on disk is also stale (or contains the pre-modification data). Either way, the transaction never committed, and the modification is correctly discarded.

WAL File Structure on Disk

The WAL lives in pg_wal/ under the data directory. It’s stored as a sequence of segment files, each 16MB by default (configurable at initdb with --wal-segsize).

Segment filenames follow a pattern: 000000010000000000000001. Breaking this down:

000000010000000000000001
│      │               │
│      │               └─ Segment number within the timeline (8 hex digits)
│      └───────────────── Logical position within the timeline (8 hex digits)
└──────────────────────── Timeline ID (8 hex digits)

Each segment is divided into pages, 8KB each (configurable at compile time with --with-wal-blocksize). A single 16MB segment contains 2,048 WAL pages.

Segment numbers increase monotonically and never wrap. At 16MB per segment, you’d need to generate over 180 petabytes of WAL before exhausting the name space. Not a practical concern.

LSN: The Universal Address

Every position in the WAL is identified by a Log Sequence Number (LSN). An LSN is a 64-bit value representing a byte offset into the WAL stream, starting from zero. It’s exposed in PostgreSQL as the pg_lsn data type.

0/16B370D8
│ │
│ └─ Byte offset within the segment (8 hex digits)
└─── Segment identifier (8 hex digits, upper 4 bits of the full 64-bit value)

LSNs are monotonically increasing. Every WAL record has a starting LSN and an ending LSN. You can calculate the volume of WAL between two LSNs by subtraction, which makes them useful for measuring replication lag and recovery progress.

You’ll see LSNs in many places:

-- Current WAL write position
SELECT pg_current_wal_lsn();

-- LSN of the last checkpoint
SELECT pg_checkpoint_location();

-- LSN of the last replayed record on a standby
SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();

WAL Record Structure

Each WAL record has a header (defined in access/xlogrecord.h in the source) followed by variable-length data. The header contains:

FieldPurpose
xl_prevLSN of the previous record (linked list)
xl_xidTransaction ID that generated this record
xl_tot_lenTotal length of this record (header + data)
xl_infoFlag bits and record type identifier
xl_rmidResource manager ID (which subsystem generated this record)
XLogRecData chainOne or more data chunks, each with length and content

The xl_prev field makes WAL records a linked list. To scan backward through the WAL, you follow xl_prev pointers. To scan forward, you just read sequentially from a starting LSN.

The xl_rmid field identifies which resource manager created the record. PostgreSQL has built-in resource managers for:

  • Heap operations (insert, update, delete, hot update, new page, cleanup)
  • B-tree operations (insert, delete, split, vacuum)
  • Sequence operations
  • Transaction management (commit, abort, prepare)
  • CLOG (transaction status log)
  • MultiXact (multi-transaction status)
  • Relocation (file creation, truncation, copy)
  • Standby (running transactions on hot standby)

Each resource manager knows how to interpret its own record data. During recovery, PostgreSQL reads each record, checks the xl_rmid, and dispatches to the appropriate resource manager’s redo function.

Full Page Writes: The Torn Page Protection

Full Page Writes (FPWs) solve one of WAL’s hardest problems. When full_page_writes is enabled (the default), the first modification to any data page after a checkpoint causes the entire page content to be written to the WAL, not just the changed bytes.

Why? Because of the torn page problem.

A data page is 8KB. An 8KB write to disk is not guaranteed to be atomic. If the power fails halfway through writing 8KB, you might end up with a page that has the first 4KB of the new version and the last 4KB of the old version. This is a torn page: a corrupted mix of two different states.

WAL records that describe individual tuple changes are not sufficient to recover from a torn page, because the WAL only records what changed, not what the page looked like before. If the page on disk is torn, replaying the WAL on top of it produces garbage.

The solution: after a checkpoint, when you modify a page for the first time, write the entire page to the WAL. During recovery, if you encounter a torn page, you can reconstruct it from the full-page image in the WAL. The subsequent WAL records for that page can then be replayed on top of the restored page.

Every page modified for the first time after a checkpoint generates an 8KB FPW instead of a small delta record, which spikes WAL volume immediately after a checkpoint on a busy system with many modified pages.

In PostgreSQL 13+, WAL compression (wal_compression) compresses FPWs using LZ4 (or pglz on older versions). This reduces the WAL amplification from FPWs at the cost of some CPU time during compression and decompression.

Redo-Only: No Undo Records

This is a deliberate design choice that separates PostgreSQL from databases like Oracle and SQL Server.

PostgreSQL’s WAL is redo-only. It records what happened (the forward change), not how to undo it. There are no undo records in the WAL.

How does PostgreSQL handle rollback? Through MVCC.

When you UPDATE a row, PostgreSQL doesn’t modify the existing tuple in place. It creates a new tuple version with a new xmin (the transaction’s ID) and marks the old version with an xmax (also the transaction’s ID). Both versions exist on the data page simultaneously.

If the transaction commits, both versions stay. The old version is dead (xmax < snapshot) and will be cleaned up by vacuum. The new version is live.

If the transaction rolls back, PostgreSQL doesn’t need to undo anything. It marks the transaction as aborted in the CLOG (a small shared data structure that tracks transaction status). During any future query, the aborted transaction’s tuple versions are ignored because the CLOG says the transaction didn’t commit.

The data page itself doesn’t need to be modified during rollback. The old tuple version is still there (it was never overwritten), and the new version is simply ignored because its xmin belongs to an aborted transaction.

The implications:

  • Rollback is fast. Aborting a transaction that modified a million rows requires writing one CLOG entry, not a million undo records.
  • Rollback doesn’t generate WAL. Since no data pages are modified during rollback, no WAL records are written. This is a significant advantage over undo-based systems during heavy contention.
  • Crash recovery only replays forward. PostgreSQL starts from the last checkpoint LSN and scans the WAL forward, redoing all committed changes. It never needs to undo anything because:
    • Committed changes are redone from WAL
    • Uncommitted changes never made it to data pages (the WAL-first rule ensures this)
    • Uncommitted changes that did leak to data pages (via the bgwriter) are harmless because their tuple versions have an xmin belonging to an uncommitted transaction, which CLOG marks as aborted
  • There’s no undo tablespace. Oracle has an undo tablespace. SQL Server has a transaction log that supports rollback. PostgreSQL has neither. The storage overhead is zero.
  • MVCC bloat is the trade-off. Since old tuple versions are kept on the data page until vacuum removes them, tables with frequent updates accumulate dead tuples. This is the price of redo-only WAL: you trade storage space for simpler recovery and faster rollbacks.

Crash Recovery: The Redo Process

When PostgreSQL starts after a crash, it performs these steps:

  1. Read pg_control to find the LSN of the last checkpoint record
  2. Read the checkpoint record from the WAL to get the redo point
  3. Scan the WAL forward from the redo point to the end of the WAL
  4. For each record: check if the transaction committed (via CLOG). If committed, redo the change. If not committed, skip it.
  5. Complete: once all committed WAL records have been replayed, recovery is done

The redo process is single-threaded and must complete before the server accepts connections. Recovery time is proportional to the amount of WAL generated since the last checkpoint, which is why frequent checkpoints help and max_wal_size matters (it caps WAL volume between checkpoints).

During recovery, PostgreSQL replays changes in exactly the same way the original operation modified the data: it calls the same resource manager’s redo function with the same record data. The recovery code path is the execution code path. This ensures that recovery produces exactly the same result as the original operation.

WAL and Replication: Streaming Replication Basics

WAL isn’t just for crash recovery. It’s also the mechanism for streaming replication.

A standby server connects to the primary and requests WAL records starting from a specific LSN. The primary ships WAL records as they’re written. The standby writes them to its own pg_wal/ and then replays them using the same redo process as crash recovery.

Streaming replication is called “physical replication” because the standby applies the exact same WAL records that were applied on the primary, producing an identical copy of the data pages.

The LSN gap between the primary’s latest WAL position and the standby’s replay position is the replication lag. You can measure it:

-- On the primary
SELECT
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;

-- On the standby
SELECT
    pg_last_wal_receive_lsn() AS received,
    pg_last_wal_replay_lsn() AS replayed,
    pg_wal_lsn_diff(
        pg_last_wal_receive_lsn(),
        pg_last_wal_replay_lsn()
    ) AS replay_lag_bytes;

Practical SQL: WAL Monitoring

Current WAL position and generation rate

SELECT
    pg_current_wal_lsn() AS current_lsn,
    pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') AS total_bytes,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')
    ) AS total_wal_size;

WAL generation rate over time

WITH wal_stats AS (
    SELECT
        pg_current_wal_lsn() AS current_lsn,
        (SELECT pg_stat_wal.wal_write FROM pg_stat_wal) AS wal_written
)
SELECT
    pg_size_pretty(wal_written) AS total_written,
    (SELECT count(*) FROM pg_ls_waldir()) AS wal_files,
    pg_size_pretty(
        (SELECT sum(size) FROM pg_ls_waldir())
    ) AS wal_directory_size
FROM wal_stats;

Checkpoint and WAL statistics

SELECT * FROM pg_stat_wal;

This view (available since PostgreSQL 14) shows:

  • wal_records: total WAL records generated
  • wal_fpi: total full page images written
  • wal_bytes: total WAL bytes generated
  • wal_buffers_full: how many times WAL buffer space was exhausted (causing backend stalls)
  • wal_write_time: time spent writing WAL
  • wal_sync_time: time spent syncing WAL to disk

A high wal_buffers_full count indicates your WAL buffers are too small, causing backends to stall waiting for WAL writes.

Replication lag in human-readable form

SELECT
    client_addr,
    state,
    sent_lsn,
    replay_lsn,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    (extract(epoch FROM (now() - reply_time)))::int AS lag_seconds
FROM pg_stat_replication;

pg_waldump: Reading WAL records directly

PostgreSQL ships with pg_waldump, a tool that reads and decodes WAL files:

# Dump all records from a specific segment
pg_waldump pg_wal/000000010000000000000001

# Dump records from a specific LSN range
pg_waldump --start=0/16B370D8 --end=0/16B380D8 pg_wal/...

# Filter by resource manager (rmgr)
pg_waldump --rmgr=Heap pg_wal/...

# Filter by transaction
pg_waldump --xid=12345 pg_wal/...

Useful for diagnosing what happened before a crash, understanding what’s generating WAL volume, or debugging replication issues.

Things to Remember

  • The WAL is a sequential, append-only log stored in 16MB segments under pg_wal/. Every modification to data pages is logged before the page is written to disk.
  • Every position in the WAL is identified by an LSN (Log Sequence Number), a 64-bit byte offset. LSNs are used for checkpointing, replication lag measurement, and recovery progress tracking.
  • WAL records contain a linked-list header (xl_prev), transaction ID, resource manager ID, and variable-length data specific to the operation type.
  • Full Page Writes (FPWs) log the entire 8KB data page on first modification after a checkpoint to protect against torn pages. This causes WAL amplification after checkpoints.
  • PostgreSQL is redo-only: no undo records. Rollback works through MVCC (marking the transaction as aborted in CLOG) and leaving tuple versions in place. This makes rollback fast and eliminates undo tablespaces.
  • Crash recovery scans forward from the last checkpoint, replaying all committed WAL records. Recovery time is proportional to WAL generated since the last checkpoint.
  • pg_stat_wal (PG14+) and pg_waldump are your primary diagnostic tools for WAL-related issues. Monitor wal_buffers_full for buffer exhaustion and wal_fpi for full page write volume.

What’s Next

We’ve covered the storage layer in depth: pages, TOAST, FSM, VM, and WAL. Tomorrow we shift to a topic that connects storage to concurrency: row-level locks. What actually happens when you run SELECT FOR UPDATE, how does PostgreSQL track which rows are locked, and why your UPDATEs might be waiting on a lock you didn’t even know existed.


Previous: The Visibility Map