PostgreSQL Deep Dive

PostgreSQL Deep Dive: work_mem — What It Actually Controls (And Why Your Sorts Keep Spilling to Disk)

You set work_mem to 256MB. Your query still spills to disk. What happened?

This is one of the most misunderstood settings in PostgreSQL. The name suggests a single memory allocation per query, but the reality is far more nuanced — and getting it wrong can mean the difference between a query that runs in 50ms and one that writes gigabytes of temp files.

Let’s dig into what work_mem actually controls, how it interacts with hash operations, and why blindly increasing it is a recipe for OOM kills.

The Mental Model Most People Have

work_mem = 4MB (default)

“That means each query gets 4MB of memory for sorting and hashing.”

Wrong.

Here is what the PostgreSQL documentation actually says:

A complex query might perform several sort and hash operations at the same time, with each operation generally being allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently.

Two critical implications:

  1. Per-operation, not per-query. A single query with three ORDER BY subclauses, a hash join, and a GROUP BY can have five or more separate work_mem allocations active simultaneously.
  2. Per-session concurrency. With 100 active connections all running complex queries, theoretical memory use is 100 × operations_per_query × work_mem.

What Actually Uses work_mem

Not everything does. Here is the definitive list:

Sorts

Every Sort node in a query plan gets its own work_mem budget. Sorts are used for:

  • ORDER BY clauses
  • DISTINCT (implemented as sort + unique)
  • Merge Join inputs that are not already in order
  • Window functions with ORDER BY
  • UNION (deduplicates via sort)
  • EXCEPT / INTERSECT

When a sort exceeds work_mem, PostgreSQL switches to an external merge sort using temporary files. You will see this in EXPLAIN ANALYZE:

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders ORDER BY total_amount;

-- Sort  (cost=... rows=1000000 ...)
--   Sort Method: external merge  Disk: 26MB
--   ->  Seq Scan on orders

The “Disk: 26MB” line tells you the sort spilled. The in-memory sort method would show:

--   Sort Method: quicksort  Memory: 74kB

Hash Tables

Hash joins, hash aggregation (GROUP BY, DISTINCT), Memoize nodes, and hash-based IN subquery processing all build hash tables that consume work_mem.

But here is the twist: hash operations get more memory than sorts. Since PostgreSQL 11, the actual limit for hash-based operations is:

hash_mem_multiplier × work_mem

The default hash_mem_multiplier is 2.0. So with the default work_mem of 4MB, hash operations can use up to 8MB before spilling.

Why? Hash operations are more sensitive to memory constraints than sorts. A sort that spills to disk can still operate relatively efficiently with an external merge. A hash table that spills has to split into batches, write them to disk, and read them back — a much more painful process.

When a hash table spills, you will see batches in EXPLAIN ANALYZE:

EXPLAIN (ANALYZE, BUFFERS)
SELECT o.customer_id, SUM(o.total_amount)
FROM orders o
GROUP BY o.customer_id;

-- HashAggregate  (cost=... rows=50000 ...)
--   Group Key: customer_id
--   Batches: 5  Memory Usage: 8192kB
--   ->  Seq Scan on orders

Batches: 1 means no spilling. Batches: 5 means the hash table was split into 5 batches, 4 of which were written to temp files and read back.

Materialization

Materialize nodes (used in subqueries and nested loop inner sides) can use work_mem to buffer tuples in memory before falling back to spilling to a tuplestore on disk.

What Does NOT Use work_mem

  • shared_buffers — that is its own separate setting (shared memory, not per-session)
  • Query parsing, planning, or execution framework overhead
  • Result tuple storage sent to the client
  • Temporary tables — those use temp_buffers instead

The Math That Should Terrify You

Consider a “moderately complex” analytical query:

SELECT region, product_category, COUNT(*), AVG(amount)
FROM (
    SELECT o.*, r.region, p.category
    FROM orders o
    JOIN customers c ON o.customer_id = c.id
    JOIN regions r ON c.region_id = r.id
    JOIN products p ON o.product_id = p.id
    WHERE o.order_date >= '2025-01-01'
    ORDER BY o.order_date
) sub
GROUP BY region, product_category
ORDER BY region, product_category;

This query plan might contain:

NodeOperationwork_mem Allocation
Hash Join #1Build hash on customers1 × hash_mem_multiplier × work_mem
Hash Join #2Build hash on regions1 × hash_mem_multiplier × work_mem
Hash Join #3Build hash on products1 × hash_mem_multiplier × work_mem
SortORDER BY in subquery1 × work_mem
HashAggregateGROUP BY1 × hash_mem_multiplier × work_mem
SortFinal ORDER BY1 × work_mem

With work_mem = 256MB and hash_mem_multiplier = 2.0:

Hash operations: 4 × 512MB = 2,048MB
Sort operations: 2 × 256MB =    512MB
                                ------
Total per query:              2,560MB

2.5 GB per query. Now multiply by 20 concurrent sessions. That is 50 GB of memory just for work_mem allocations. On a machine with 32 GB RAM.

This is why the default is 4MB. Not because the PostgreSQL developers hate performance, but because the default needs to be safe for any workload.

How to Choose work_mem

The Formula

A reasonable upper bound:

work_mem = (Total RAM - shared_buffers) / (max_connections × average_operations_per_query)

On a 64 GB server with shared_buffers = 16GB, max_connections = 200, and an estimated average of 3 operations per query:

work_mem = (64GB - 16GB) / (200 × 3) = 48GB / 600 ≈ 80MB

This is a rough guideline. Real-world workloads vary enormously.

Connection Pooling Changes Everything

If you use PgBouncer or another connection pooler with, say, 20 actual database connections but 200 application connections, the formula changes:

work_mem = (64GB - 16GB) / (20 × 3) ≈ 800MB

With connection pooling, you can safely use a much larger work_mem because the number of concurrent in-flight queries is bounded by actual database connections, not application connections.

Per-Query Tuning

You do not have to set work_mem globally. You can set it per-session, per-transaction, or even per-query:

-- For a single analytical query that needs a big sort
SET LOCAL work_mem = '1GB';

SELECT * FROM massive_table
ORDER BY complex_expression
LIMIT 100;

-- work_mem resets when the transaction ends

This is the best practice: keep the global work_mem conservative (4-32MB), and bump it up for specific analytical queries.

Finding Spills

Check EXPLAIN ANALYZE

The most direct method. Look for external merge (sorts) or Batches: N > 1 (hash):

EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;

Monitor log_temp_files

Set this in postgresql.conf:

log_temp_files = 0

This logs every temp file creation, including size. Set to 0 to log all spills, or a threshold like -1 (disabled) or a size like '10MB' to only log large spills.

Log output looks like:

LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp12345.0", size 26214400
STATEMENT:  SELECT * FROM orders ORDER BY total_amount;

Query pg_stat_statements

In PostgreSQL 13+, pg_stat_statements tracks temp file usage per query:

SELECT query,
       calls,
       total_exec_time,
       temp_blks_written,
       temp_blks_read
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;

This tells you which queries spill the most. Each temp block is 8KB, so temp_blks_written = 12800 means about 100MB of temp file writes.

hash_mem_multiplier: The Unsung Hero

Introduced in PostgreSQL 13, hash_mem_multiplier decouples hash table limits from sort limits. Before PG13, hash tables and sorts shared the same work_mem limit. Since hash tables are more sensitive to spilling, this often led to suboptimal behavior:

  • Set work_mem high enough for hash tables → sorts waste memory
  • Set work_mem low enough for sorts → hash tables spill

Now you can tune them independently:

work_mem = 32MB              # For sorts
hash_mem_multiplier = 2.0    # Hash tables get 64MB

The default of 2.0 is well-chosen for most workloads. For heavy hash-aggregation workloads (lots of GROUP BY on high-cardinality columns), values up to 4.0-8.0 can help without inflating sort memory.

The Temporary File Landscape

When operations spill, where do the temp files go?

show temp_tablespaces;
-- <empty> means default tablespace

show data_directory;
-- /var/lib/postgresql/17/main

Temp files go to pgsql_tmp/ inside the data directory by default. On a busy system with lots of spilling, this directory can grow to many gigabytes.

You can redirect temp files to a separate disk:

CREATE TABLESPACE fast_temp LOCATION '/mnt/fast-ssd/pg-temp';
-- Then set for specific operations or globally:
SET temp_tablespaces = 'fast_temp';

This is particularly useful for analytical queries where temp files are expected — keep them off the same disk as your data and WAL.

Real-World Case Study: The “Work_mem Is Fine” Mystery

A team noticed their reporting queries were slow. They checked work_mem:

SHOW work_mem;
-- 64MB

“That should be plenty.” But the query was still spilling. The EXPLAIN (ANALYZE) told the real story:

Hash Join  (cost=... rows=5000000 ...)
  Hash Cond: o.product_id = p.id
  ->  Hash Join  (cost=... rows=5000000 ...)
        Hash Cond: o.customer_id = c.id
        ->  Hash Join  (cost=... rows=5000000 ...)
              Hash Cond: o.region_id = r.id
              ->  Seq Scan on orders  (rows=50000000)
              ->  Hash  (rows=50)
                    Buckets: 65536  Batches: 1  Memory Usage: 4096kB
            ->  Hash  (rows=100000)
                  Buckets: 131072  Batches: 1  Memory Usage: 14336kB
        ->  Hash  (rows=50000)
              Buckets: 65536  Batches: 5  Memory Usage: 65536kB
  ->  Hash  (rows=200)
        Buckets: 1024  Batches: 1  Memory Usage: 2048kB

The third hash table (products, 50K rows) was spilling to 5 batches. The customer hash was using 14MB of 64MB. The product hash hit the limit.

The fix was not to increase global work_mem. It was:

SET LOCAL work_mem = '256MB';

Just for this query. The hash table fit in memory, temp file writes dropped to zero, and the query went from 45 seconds to 8 seconds.

Tuning Checklist

  1. Start conservative. work_mem = 4MB (default) or 16MB for OLTP. Do not blindly set to 1GB.

  2. Find spills. Use log_temp_files = 0 and pg_stat_statements to identify which queries actually spill.

  3. Count operations. Look at EXPLAIN plans for the spilling queries. How many Sort and Hash nodes does each have?

  4. Calculate real usage. work_mem × operations_per_query × concurrent_queries. Does it fit in RAM?

  5. Use per-query tuning. SET LOCAL work_mem = 'X' inside transactions for analytical queries. Keep the global setting low.

  6. Use hash_mem_multiplier (PG13+). Default 2.0 is good. Increase to 4.0 for heavy GROUP BY workloads.

  7. Monitor OOM. If you see out of memory errors in the logs, your work_mem is too high for your concurrency level.

  8. Put temp files on fast storage. If spilling is unavoidable, at least make it fast with temp_tablespaces.

Key Takeaways

  • work_mem is a per-operation limit, not per-query. A single query can consume N × work_mem where N is the number of sort and hash operations in the plan.
  • Hash operations get more memory via hash_mem_multiplier (default 2.0× work_mem). This is because hash table spills are more expensive than sort spills.
  • The default 4MB is intentionally conservative to prevent OOM under concurrent load. The formula total_RAM / (max_connections × avg_ops_per_query) gives you a safe upper bound.
  • Per-query SET LOCAL work_mem is the best practice — keep global low, bump it for specific analytical queries.
  • log_temp_files = 0 and pg_stat_statements.temp_blks_written are your primary tools for detecting spills.
  • Spilling is not always bad. If a query runs once a day and finishes in acceptable time, chasing zero spills is not worth the memory pressure.

What’s Next

Tomorrow we tackle another memory tuning classic: shared_buffers vs the OS page cache — the double-buffering myth. Is PostgreSQL really caching everything twice? Should you set shared_buffers to 25% of RAM, or is that cargo-cult tuning? We will look at what shared_buffers actually does, how the OS cache complements it, and why the answer depends on your workload.


This is Day 11 of the PostgreSQL Deep Dive series. Catch up on all posts here.