PostgreSQL Deep Dive

PostgreSQL Deep Dive: shared_buffers vs the OS Cache — The Double-Buffering Myth

You’ve seen the advice a hundred times: “Set shared_buffers to 25% of RAM.” And you’ve also seen the complaint: “But PostgreSQL uses double buffering — data sits in shared_buffers and in the Linux page cache. What a waste!”

It sounds logical. If PostgreSQL already has 8 GB of shared_buffers, and Linux is caching another 8 GB of the same pages, isn’t half your RAM going to redundant copies?

No. And understanding why requires knowing what each cache actually does, how PostgreSQL reads and writes pages, and what happens at the boundary between the database and the kernel.

Let’s tear this apart.

The Two Caches: What Each One Does

shared_buffers — PostgreSQL’s Internal Page Cache

shared_buffers is a shared memory region (allocated via mmap by default on Linux, or System V shmget if you configure shared_memory_type = sysv). Every backend process maps this region into its address space. It’s organized as an array of 8 KB buffer pages — the same size as PostgreSQL’s on-disk page size (BLCKSZ).

When a backend needs to read a page (say, block 42 of the orders table):

  1. It hashes the page identifier (relation OID + fork number + block number) to find which buffer partition protects it.
  2. It locks the partition, checks the buffer lookup table.
  3. If the page is already in shared_bufferscache hit — it pins the buffer and reads it.
  4. If not — cache miss — it calls smgrread(), which calls pread(), which goes to the kernel.

Step 4 is where the OS page cache enters the picture.

The OS Page Cache — Linux’s General-Purpose File Cache

When PostgreSQL calls pread() to read a page from disk, the Linux kernel does the following:

  1. Checks its page cache (a radix tree indexed by (device, inode, offset)).
  2. If the page is cached — page cache hit — copies it into the PostgreSQL buffer (in shared_buffers).
  3. If not cached — page cache miss — issues a physical I/O read from the storage device, places the page in the page cache, then copies it into shared_buffers.

Yes, after step 2 or 3, the same 8 KB of data exists in two places: the Linux page cache and PostgreSQL’s shared_buffers. This is the “double buffering” that people complain about.

But here’s the critical insight: the page cache copy is ephemeral and will be evicted under memory pressure. Linux uses a LRU-variant algorithm (the “clock” algorithm with active/inactive lists) and treats page cache pages as reclaimable. When the system needs memory for processes or for shared_buffers growth (through mmap), the kernel reclaims page cache pages first.

So the “double copy” exists momentarily during a read, but the OS copy doesn’t stick around unless the system has excess RAM.

Why PostgreSQL Needs Its Own Cache

If the OS already caches file data, why does PostgreSQL bother with shared_buffers at all?

1. Shared Access Across Backends

PostgreSQL uses a process-per-connection model. Each backend is a separate OS process. The OS page cache is already shared across processes (it’s indexed by inode+offset, not by PID), but PostgreSQL backends need coordination:

  • Buffer pins: “I’m reading this page, don’t evict it.”
  • Buffer content locks: “I’m modifying this page, wait.”
  • Dirty page tracking: “This page has been modified and needs to be written to WAL before it can be evicted.”

The OS page cache has no concept of these database-level protocols. shared_buffers is not just a cache — it’s a concurrency control structure.

2. Dirty Page Management and WAL Ordering

PostgreSQL writes WAL records before modifying pages (Write-Ahead Logging). A page in shared_buffers can be:

  • Clean: Matches what’s on disk.
  • Dirty: Modified but not yet written back. Must be written back after the corresponding WAL records have been flushed.

The background writer (bgwriter) and checkpointer process manage dirty page writeback. They write dirty pages from shared_buffers to the OS via pwrite(). At that point, the OS page cache gets a copy — and the kernel may or may not flush it to physical storage immediately.

This two-stage write (shared_buffers → OS page cache → disk) is essential. PostgreSQL controls when dirty pages are written back to ensure WAL ordering. The kernel controls when those writes hit physical storage.

3. Buffer Replacement Strategy

PostgreSQL uses a clock-sweep algorithm for buffer replacement — a variant of LRU that considers both recency and frequency of access. A “usage count” (0-5) is incremented on each access and decremented by the clock hand. Frequently-used pages resist eviction.

The Linux kernel’s page cache uses a different algorithm (active/inactive list-based, approximately LRU). It has no insight into which pages PostgreSQL will need next. A database-internal cache can make much better eviction decisions because it understands query access patterns.

4. Sequential Scan Ring Buffer

When PostgreSQL does a sequential scan, it uses a ring buffer (default: 256 KB, controlled by ring_size_type and access method). Instead of loading the entire table into shared_buffers (which would evict everything else), sequential scans cycle through a small buffer. This prevents a single SELECT * FROM huge_table from poisoning the cache.

The OS page cache doesn’t have this optimization — a large sequential read will fill the page cache, potentially evicting hot pages from other files.

The Write Path: Where Double Buffering Is Real (And Necessary)

When a transaction commits:

  1. WAL record is written to the WAL buffer → flushed to disk (fsync). This is the durability guarantee.
  2. The data page in shared_buffers is marked dirty. It stays dirty until either:
    • The background writer writes it back (opportunistically, to spread I/O).
    • The checkpointer forces all dirty pages to be written back (at checkpoint time).

When the background writer calls pwrite() to write a dirty page:

shared_buffers (dirty page)
    → pwrite()
    → OS page cache (now has a copy)
    → [eventually] physical disk

The data exists in both shared_buffers and the OS page cache. The OS copy is necessary because the kernel handles the actual I/O scheduling, merge of adjacent writes, and error handling. You cannot bypass the page cache on Linux unless you use O_DIRECT — and PostgreSQL deliberately does not use O_DIRECT.

Why Not O_DIRECT?

Some databases (notably Oracle and MySQL/InnoDB with innodb_flush_method = O_DIRECT) bypass the OS page cache for data files. PostgreSQL explicitly chose not to, for several reasons:

  • Simpler crash recovery: With the page cache in the loop, the kernel handles write ordering and error recovery for physical I/O. PostgreSQL only needs to worry about WAL ordering.
  • Sequential read-ahead: The Linux kernel performs aggressive read-ahead for sequential access patterns. PostgreSQL benefits from this “for free” — the kernel prefetches pages into the page cache before PostgreSQL asks for them.
  • Portable across OSes: O_DIRECT has different semantics and alignment requirements across platforms. pread()/pwrite() through the page cache is portable and predictable.
  • The PostgreSQL developers have benchmarked it: Multiple attempts to add O_DIRECT support have not shown consistent wins. The current architecture is well-optimized for the common case.

The 25% Rule — Why Not More?

The PostgreSQL docs say:

“A reasonable starting value for shared_buffers is 25% of the memory in your system.”

And:

“It is unlikely that an allocation of more than 40% of RAM to shared_buffers will work better than a smaller amount.”

Why not 80%? Because the OS page cache is doing useful work:

  1. WAL file caching: WAL segments are read during crash recovery, replication, and archive recovery. The OS caches these files. If you give all RAM to shared_buffers, WAL reads during recovery become physical I/O.

  2. Sort spill files: When sorts and hash joins exceed work_mem (yesterday’s topic), they spill to temporary files. The OS page cache speeds up temp file I/O.

  3. Read-ahead for sequential scans: As mentioned, the kernel’s read-ahead for large sequential reads is highly effective. Starving the page cache means losing this free prefetch.

  4. OS-level file metadata: Extent maps, directory entries, inode caches — all used by the OS to speed up file operations.

  5. Other processes: Your pg_dump, pg_repack, monitoring agents, and backup tools all benefit from the page cache.

The Real Tuning: effective_cache_size

Here’s the parameter that actually matters for planner decisions: effective_cache_size. This tells the PostgreSQL planner how much total cache (shared_buffers + OS page cache) is available for holding database data. It defaults to 4 GB.

-- Check current setting
SHOW effective_cache_size;

The planner uses effective_cache_size to estimate the probability that an index page is already cached. If you have 64 GB of RAM, 16 GB of shared_buffers, and typically 30 GB of OS page cache holding PostgreSQL files:

# In postgresql.conf
shared_buffers = 16GB
effective_cache_size = 46GB  -- shared_buffers + expected OS cache

This makes the planner more willing to choose index scans over sequential scans, because it correctly estimates that random I/O is likely to be served from cache.

How to Estimate OS Cache for PostgreSQL

On Linux:

# Total page cache
grep "^Cached:" /proc/meminfo

# Or more detailed:
cat /proc/meminfo | grep -E "Cached|Buffers|Mapped"

But not all page cache is PostgreSQL data. To see how much of the page cache belongs to PostgreSQL data files:

# Use fincore (from linux-ftools) or vmtouch to check specific files
vmtouch -v /var/lib/postgresql/17/main/base/XXXXX/YYYYYY

# Or use /proc/PID/maps + mincore for the postgres process
# (Advanced — requires tracing tooling)

A rough heuristic: on a dedicated database server with 64 GB RAM, shared_buffers = 16GB, expect 20-35 GB of OS page cache to hold PostgreSQL data files (the rest goes to WAL, temp files, and other OS needs).

-- Set effective_cache_size to your best estimate
ALTER SYSTEM SET effective_cache_size = '46GB';
-- Requires reload (not restart)
SELECT pg_reload_conf();

Huge Pages: Reducing TLB Misses in shared_buffers

When shared_buffers is large (say, 16 GB), the page table entries for the shared memory mapping consume significant TLB (Translation Lookaside Buffer) entries in the CPU. Every TLB miss costs ~20-100 cycles on modern CPUs. With 16 GB of 4 KB pages, that’s 4 million page table entries — a lot of TLB pressure.

Huge pages (2 MB or 1 GB on x86-64) reduce this by 512× or 262144× respectively.

# Check how many huge pages PostgreSQL needs (before starting)
postgres -D $PGDATA -C shared_memory_size_in_huge_pages

# Example output: 8192
# That means 8192 × 2MB = 16 GB

# Allocate huge pages
sudo sysctl -w vm.nr_hugepages=8192

# Make it persistent
echo "vm.nr_hugepages = 8192" | sudo tee -a /etc/sysctl.conf

Then in postgresql.conf:

huge_pages = try    # Use huge pages, fall back to normal if unavailable
# huge_pages = on   # Fail to start if huge pages aren't available

Important: Huge pages only affect the main shared_buffers region. Dynamic shared memory (used by parallel queries) is not covered unless you configure min_dynamic_shared_memory (PG 14+) and the OS supports it.

-- Verify huge pages are in use
SHOW huge_pages_status;
-- Returns: "on" (using huge pages) or "off" (not using)

Transparent Huge Pages (THP) — Don’t

The PostgreSQL docs explicitly warn against THP for shared_buffers. THP can cause latency spikes when the kernel transparently coalesces or splits pages — the khugepaged daemon may stall the process during defragmentation. The recommendation is:

# Disable THP for database workloads
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag

Practical Diagnostic Queries

Is shared_buffers big enough?

-- Buffer cache hit ratio (should be > 99% for OLTP)
SELECT
  'index hit ratio' AS metric,
  round(
    (sum(idx_blks_hit)::float / nullif(sum(idx_blks_hit + idx_blks_read), 0) * 100)::numeric,
    2
  ) AS ratio_pct
FROM pg_statio_user_tables

UNION ALL

SELECT
  'table hit ratio' AS metric,
  round(
    (sum(heap_blks_hit)::float / nullif(sum(heap_blks_hit + heap_blks_read), 0) * 100)::numeric,
    2
  ) AS ratio_pct
FROM pg_statio_user_tables;

What’s in shared_buffers?

-- Requires pg_buffercache extension
CREATE EXTENSION IF NOT EXISTS pg_buffercache;

-- Distribution by database and object type
SELECT
  datname,
  coalesce(obj_type, 'unknown') AS obj_type,
  count(*) AS buffers,
  pg_size_pretty(count(*) * 8192) AS size
FROM (
  SELECT
    pg_database.datname,
    CASE
      WHEN pg_buffercache.reldatabase = 0 THEN 'system'
      WHEN pg_buffercache.relfilenode IS NULL THEN 'free'
      ELSE 'user'
    END AS obj_type
  FROM pg_buffercache
  LEFT JOIN pg_database ON pg_buffercache.reldatabase = pg_database.oid
) sub
GROUP BY datname, obj_type
ORDER BY count(*) DESC;

Are checkpoints causing write spikes?

-- Check pg_stat_bgwriter
SELECT
  checkpoints_timed,
  checkpoints_req,
  round(checkpoints_req::numeric / nullif(checkpoints_timed + checkpoints_req, 0) * 100, 1)
    AS req_checkpoint_pct,
  buffers_checkpoint,
  buffers_clean,
  buffers_backend,
  round(buffers_backend::numeric / nullif(buffers_checkpoint + buffers_clean + buffers_backend, 0) * 100, 1)
    AS backend_write_pct
FROM pg_stat_bgwriter;

A high buffers_backend percentage means backends are writing their own dirty pages (because the background writer isn’t keeping up), which increases query latency. This can be a sign that shared_buffers is too large relative to the checkpoint/writeback rate.

How much memory is PostgreSQL actually using?

# Check shared_buffers mapping
pmap -x $(pgrep -f "postgres:.*checkpointer" | head -1) | grep -i "total"

# More useful: check RSS vs shared
ps -o pid,rss,vsz,comm -p $(pgrep -f "postgres:.*checkpointer")

The Linux OOM Killer and shared_buffers

Since shared_buffers uses mmap (anonymous shared memory), it’s counted as part of each PostgreSQL process’s virtual memory. Under memory pressure, the Linux OOM killer may target PostgreSQL processes.

Protect the postmaster:

# In your PostgreSQL startup script
echo -1000 > /proc/self/oom_score_adj
export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
export PG_OOM_ADJUST_VALUE=0

And configure strict overcommit to prevent the kernel from overallocating:

sudo sysctl -w vm.overcommit_memory=2
# Typically also adjust overcommit_ratio
sudo sysctl -w vm.overcommit_ratio=50

With vm.overcommit_memory=2, the kernel limits total address space commitments to (RAM × overcommit_ratio/100) + swap. For a 64 GB system with 50% ratio and 16 GB swap: commit limit = 48 GB. This prevents the OOM killer from being needed in the first place.

The Real-World Impact

Here’s what happens at different shared_buffers settings on a 64 GB dedicated server:

Settingshared_buffersOS Page Cache (approx)Impact
Too small (4 GB)4 GB~50 GBHigh cache hit in OS, but PostgreSQL has poor buffer replacement (no clock-sweep). More pread() syscalls. Checkpoint recovery is slow.
Sweet spot (16 GB)16 GB~38 GBGood buffer replacement, high cache hit. Background writer can smooth writeback. Checkpoints spread I/O well.
Too large (48 GB)48 GB~6 GBGreat buffer hit but WAL recovery is slow. Sort spills hit physical disk. Read-ahead starved. Risk of OS memory pressure.
Absurd (60 GB)60 GB~0 GBNear-zero OS cache. System sluggish. Checkpoint writes are bursty. Backup/restore operations crawl.

Key Takeaways

  • “Double buffering” is a myth in the pejorative sense. The OS page cache and shared_buffers serve different purposes. PostgreSQL needs its own cache for concurrency control, WAL ordering, and intelligent eviction. The OS cache provides read-ahead, WAL caching, and temp file speedup.
  • The copies are temporary. A page in the OS cache from a pread() will be reclaimed quickly under memory pressure. It’s not permanent duplication.
  • 25% of RAM is the right starting point. Going above 40% starves the OS page cache and makes recovery, replication, and maintenance operations slower.
  • effective_cache_size matters more for query performance than shared_buffers does for planner decisions. Set it to shared_buffers + estimated OS cache for PostgreSQL files.
  • Huge pages reduce TLB pressure on large shared_buffers settings. Use huge_pages = try in production. Disable Transparent Huge Pages.
  • The OS page cache is not your enemy. It’s doing read-ahead, write coalescing, and WAL buffering that PostgreSQL would have to reimplement (poorly) if it used O_DIRECT.

What’s Next

Tomorrow we’re going deeper into the memory hierarchy: effective_cache_size — how a parameter that changes nothing about runtime behavior can completely transform your query plans. We’ll look at how the planner estimates index scan costs, why setting it too low causes sequential scan overuse, and why setting it too high can be equally dangerous.

Same time, same place.