PostgreSQL Deep Dive

PostgreSQL Deep Dive: Synchronous Commit and Durability Trade-offs

Your application inserts an order. PostgreSQL returns “success.” The application tells the customer “your order is confirmed.” Two seconds later, the database server crashes. The order is gone. The customer got a confirmation email for an order that never existed.

This is the durability problem. PostgreSQL’s default behavior guarantees this can’t happen — but that guarantee comes at a cost: every COMMIT waits for a disk flush. On a busy system with a high-commit rate, that flush can be the bottleneck. The synchronous_commit setting is your knob for trading durability against latency, and understanding exactly what each level does (and doesn’t do) is critical for making the right choice.

What Happens During COMMIT

When a transaction commits, PostgreSQL writes the commit record to the WAL (Write-Ahead Log) buffer in shared memory. At this point the data is in RAM only — a server crash would lose it. To make the commit durable, the WAL must be flushed to disk.

The flush path is:

  1. WAL record written to WAL buffer (RAM)
  2. WAL buffer flushed to OS page cache (still RAM, but in kernel buffers)
  3. OS page cache flushed to disk (actual persistent storage — the fsync call)

Step 3 is the expensive one. A typical SSD can handle ~10,000-100,000 fsync calls per second. A high-write application doing 50,000 small transactions per second will bottleneck on WAL flushes. Not on CPU, not on network, not on query planning — on waiting for the disk to confirm the write.

The synchronous_commit Levels

PostgreSQL provides five levels of durability for COMMIT, controlled by the synchronous_commit parameter:

off (Fastest, Least Durable)

SET synchronous_commit = off;

The server returns “success” to the client as soon as the commit record is written to the WAL buffer in shared memory. It does NOT wait for any disk flush — not locally, not to any standby.

Risk: If the PostgreSQL server crashes (power failure, kernel panic, hardware failure), any transactions acknowledged to the client but not yet flushed to disk are lost. The maximum window of loss is approximately 3 × wal_writer_delay (default 3 × 200ms = 600ms).

Important: This does NOT cause database corruption. The database recovers to a consistent state — it just loses the most recent committed transactions. The effect is identical to those transactions having been aborted. MVCC ensures partial transactions are never visible.

Use when: Bulk data loading, ephemeral data (session state, caches), analytics staging tables, write-heavy workloads where occasional data loss is acceptable.

local (Default with no standbys)

SET synchronous_commit = local;

Waits for the commit record to be flushed to local disk before returning “success” to the client. Does NOT wait for any standby replication.

This is the same as on when synchronous_standby_names is empty (no synchronous standbys configured).

on (Default)

SET synchronous_commit = on;

Waits for the commit record to be flushed to local disk AND for the synchronous standby(s) to confirm they’ve received the commit record and flushed it to their disk.

This is the strongest durability guarantee when synchronous replication is configured. The transaction survives unless both the primary and all synchronous standbys suffer simultaneous disk corruption.

remote_write (Middle Ground)

SET synchronous_commit = remote_write;

Waits for the commit record to be flushed to local disk AND for the synchronous standby(s) to confirm they’ve written the commit record to their OS file system (page cache). Does NOT wait for the standby to fsync to disk.

Risk: If the standby’s operating system crashes (kernel panic, power loss to the standby server), the standby may lose the commit record. If the primary then fails and the standby is promoted, the transaction is lost.

Survives: PostgreSQL process crash on the standby (data is in OS page cache, which survives process-level crashes). Does NOT survive OS-level crash on the standby.

remote_apply (Strongest)

SET synchronous_commit = remote_apply;

Waits for the commit record to be flushed to local disk AND for the synchronous standby(s) to confirm they’ve received, flushed, AND replayed (applied) the commit. The transaction is visible to queries on the standby before the COMMIT returns to the client.

This is the slowest option but provides the strongest consistency guarantee: after commit, the data is visible on both the primary and the standby. Useful for read-your-writes consistency when reads are directed to the standby.

Performance Impact: The Numbers

The latency difference between levels is significant:

SettingTypical COMMIT latencyRisk window
off0.1-0.5 msUp to 600ms of committed txns
local1-5 msNone locally
remote_write2-10 msStandby OS crash
on5-20 msPrimary + all sync standbys
remote_apply10-50 msPrimary + all sync standbys

These numbers vary enormously based on disk I/O speed, network latency to the standby, and standby replay speed. A fast NVMe primary with a standby over a 0.5ms LAN link will be at the lower end. A spinning disk primary with a standby over a WAN link will be at the upper end.

The key insight: the difference between off and on is typically 5-20× in COMMIT latency. For a workload doing 100,000 commits per second, that’s the difference between 10 seconds of total commit wait time and 100+ seconds.

Per-Transaction Control

synchronous_commit can be set at multiple levels:

-- Per transaction (most common approach)
BEGIN;
INSERT INTO orders ...;
-- This transaction is less critical, skip sync commit
SET LOCAL synchronous_commit = off;
COMMIT;

-- Per session
SET synchronous_commit = local;

-- Per database
ALTER DATABASE analytics SET synchronous_commit = off;

-- Per function
CREATE FUNCTION log_event(msg text)
RETURNS void AS $$
BEGIN
  SET LOCAL synchronous_commit = off;
  INSERT INTO event_log (message) VALUES (msg);
END;
$$ LANGUAGE plpgsql;

The SET LOCAL form is the most useful: it applies only to the current transaction and automatically resets afterward. This lets you have most transactions fully durable while choosing lower durability for specific workloads.

Common pattern: Use synchronous_commit = off for:

-- Session state / activity tracking
SET LOCAL synchronous_commit = off;
INSERT INTO user_activity (user_id, action, ts) VALUES (...);

-- Analytics event logging
SET LOCAL synchronous_commit = off;
INSERT INTO page_views (url, user_id, ts) VALUES (...);

-- Notification queues (fire-and-forget)
SET LOCAL synchronous_commit = off;
INSERT INTO notification_queue (user_id, type) VALUES (...);

synchronous_standby_names: Choosing Your Sync Partners

synchronous_commit levels beyond local only matter when synchronous_standby_names is configured. This parameter specifies which standby servers participate in synchronous replication.

First-N Syntax (Priority-Based)

-- In postgresql.conf on the primary
synchronous_standby_names = 'FIRST 1 (standby_1, standby_2)'

Transaction commits wait for the first available standby in priority order. If standby_1 is connected and streaming, it’s the synchronous standby. If it disconnects, standby_2 takes over immediately.

This is the most common configuration. You designate one primary synchronous standby and one or more failover candidates.

ANY-N Syntax (Quorum-Based)

synchronous_standby_names = 'ANY 1 (standby_1, standby_2, standby_3)'

Transaction commits wait for any N of the listed standbys to confirm. No priority ordering — any standby can fulfill the requirement.

This is useful for distributed setups where you want to survive the loss of specific standbys without explicit priority ordering. With ANY 2 (s1, s2, s3), commits wait for any two of three standbys to confirm.

Naming Convention

The standby name is the application_name from the standby’s primary_conninfo:

-- On the standby, in postgresql.conf or primary_conninfo:
primary_conninfo = 'host=primary port=5432 user=replicator application_name=standby_1'

Verify with pg_stat_replication:

SELECT
  application_name,
  state,
  sync_state,
  sync_priority,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn
FROM pg_stat_replication;

Key columns:

  • sync_state: sync (currently synchronous), potential (next in line), async (not in synchronous_standby_names)
  • sent_lsn: Last WAL position sent to this standby
  • write_lsn: Last WAL position written to standby’s OS cache (remote_write waits for this)
  • flush_lsn: Last WAL position flushed to standby’s disk (on waits for this)
  • replay_lsn: Last WAL position replayed on standby (remote_apply waits for this)

The Gotcha: synchronous_commit = off Is Not fsync = off

A common and dangerous confusion is equating synchronous_commit = off with fsync = off. They are fundamentally different:

  • synchronous_commit = off: PostgreSQL still flushes WAL to disk eventually (via the WAL writer background process). The only risk is losing the most recent committed transactions during a crash. No corruption.
  • fsync = off: PostgreSQL does not flush WAL to disk at all. A crash can leave the data files and WAL in an inconsistent state. This can cause silent data corruption that is unrecoverable.

Never set fsync = off on any system where you care about your data. It exists for benchmarking only. synchronous_commit = off is the safe way to get similar performance benefits with an understood, bounded risk.

Group Commit: Why Multiple Transactions Pay One Flush

PostgreSQL’s group commit mechanism amortizes the cost of fsync across multiple concurrent transactions. When several transactions commit simultaneously:

  1. All commit records are written to the WAL buffer
  2. A single fsync flushes all of them to disk
  3. All waiting clients are notified simultaneously

This means the effective cost per COMMIT is lower than the raw fsync latency. With 10 concurrent transactions each doing synchronous_commit = on, you might get 10 commits for the price of one fsync. The more concurrent commits, the better the amortization.

group_commit works across all synchronous_commit levels. With off, the WAL writer batches writes similarly — multiple commit records get flushed in one I/O operation.

Diagnostic Queries

1. Check current settings

SHOW synchronous_commit;
SHOW synchronous_standby_names;

2. Monitor replication sync state

SELECT
  application_name,
  state,
  sync_state,
  sync_priority,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024 AS lag_mb,
  round(
    EXTRACT(EPOCH FROM (now() - reply_time))::numeric, 2
  ) AS reply_lag_seconds
FROM pg_stat_replication
ORDER BY sync_priority;

3. Check for commit wait events

On PG16+, you can see if transactions are waiting on WAL flush:

SELECT
  pid,
  wait_event_type,
  wait_event,
  query
FROM pg_stat_activity
WHERE wait_event IN ('WalWrite', 'SyncRep');

WalWrite means waiting for local WAL flush. SyncRep means waiting for synchronous standby confirmation. If you see many SyncRep waits, your standby network or replay speed is the bottleneck.

4. Measure COMMIT latency over time

SELECT
  datname,
  xact_commit,
  xact_rollback,
  round(
    blk_write_time::numeric / NULLIF(xact_commit, 0), 3
  ) AS avg_write_time_ms
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY avg_write_time_ms DESC;

Practical Configuration Examples

High-Performance Web Application

# postgresql.conf

# Default durability for most transactions
synchronous_commit = on

# One synchronous standby for HA
synchronous_standby_names = 'FIRST 1 (standby_1, standby_2)'
-- Application code: full durability for orders and payments
BEGIN;
INSERT INTO orders ...;
COMMIT; -- uses synchronous_commit = on (default)

-- But skip sync for analytics events
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO analytics_events ...;
COMMIT;

Analytics Pipeline

# postgresql.conf

# No synchronous standbys needed for analytics
synchronous_standby_names = ''

# Don't wait for local flush on bulk loads
synchronous_commit = off

# Or use local if you want crash safety without replication
# synchronous_commit = local

Financial System (Maximum Durability)

-- Application ensures data is visible on standby before confirming
BEGIN;
SET LOCAL synchronous_commit = remote_apply;
INSERT INTO transactions ...;
COMMIT;

Key Takeaways

  • synchronous_commit has five levels: off (no wait), local (local flush only), remote_write (local flush + standby OS write), on (local flush + standby disk flush), remote_apply (local flush + standby replay).
  • off does NOT cause corruption — the worst case is losing the most recent ~600ms of committed transactions on crash. The database recovers to a consistent state.
  • remote_write and remote_apply only have effect when synchronous_standby_names is configured. Without synchronous standbys, local, remote_write, and on all behave identically.
  • The performance difference between off and on is typically 5-20× in COMMIT latency. For high-commit-rate workloads, SET LOCAL synchronous_commit = off on non-critical transactions is the easiest performance win.
  • synchronous_commit = off is NOT the same as fsync = off. The former risks losing recent transactions. The latter risks silent data corruption. Never use fsync = off in production.
  • Group commit amortizes fsync cost across concurrent transactions, reducing the per-COMMIT overhead.
  • Use pg_stat_replication to monitor standby sync states and identify whether synchronous replication is causing commit delays.

What’s Next

Tomorrow we’ll look at CTE materialization — why Common Table Expressions can be optimization fences, when PostgreSQL materializes them vs inlining them, and how a simple query restructuring can unlock dramatic performance improvements.


Previous: BRIN Indexes — How a 1 MB Index Can Replace a 1 GB B-Tree