PostgreSQL Deep Dive: work_mem — What It Actually Controls (And Why Your Sorts Keep Spilling to Disk)
You set work_mem to 256MB. Your query still spills to disk. What happened?
This is one of the most misunderstood settings in PostgreSQL. The name suggests a single memory allocation per query, but the reality is far more nuanced — and getting it wrong can mean the difference between a query that runs in 50ms and one that writes gigabytes of temp files.
Let’s dig into what work_mem actually controls, how it interacts with hash operations, and why blindly increasing it is a recipe for OOM kills.
The Mental Model Most People Have
work_mem = 4MB (default)
“That means each query gets 4MB of memory for sorting and hashing.”
Wrong.
Here is what the PostgreSQL documentation actually says:
A complex query might perform several sort and hash operations at the same time, with each operation generally being allowed to use as much memory as this value specifies before it starts to write data into temporary files. Also, several running sessions could be doing such operations concurrently.
Two critical implications:
- Per-operation, not per-query. A single query with three
ORDER BYsubclauses, a hash join, and aGROUP BYcan have five or more separatework_memallocations active simultaneously. - Per-session concurrency. With 100 active connections all running complex queries, theoretical memory use is
100 × operations_per_query × work_mem.
What Actually Uses work_mem
Not everything does. Here is the definitive list:
Sorts
Every Sort node in a query plan gets its own work_mem budget. Sorts are used for:
ORDER BYclausesDISTINCT(implemented as sort + unique)Merge Joininputs that are not already in order- Window functions with
ORDER BY UNION(deduplicates via sort)EXCEPT/INTERSECT
When a sort exceeds work_mem, PostgreSQL switches to an external merge sort using temporary files. You will see this in EXPLAIN ANALYZE:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders ORDER BY total_amount;
-- Sort (cost=... rows=1000000 ...)
-- Sort Method: external merge Disk: 26MB
-- -> Seq Scan on orders
The “Disk: 26MB” line tells you the sort spilled. The in-memory sort method would show:
-- Sort Method: quicksort Memory: 74kB
Hash Tables
Hash joins, hash aggregation (GROUP BY, DISTINCT), Memoize nodes, and hash-based IN subquery processing all build hash tables that consume work_mem.
But here is the twist: hash operations get more memory than sorts. Since PostgreSQL 11, the actual limit for hash-based operations is:
hash_mem_multiplier × work_mem
The default hash_mem_multiplier is 2.0. So with the default work_mem of 4MB, hash operations can use up to 8MB before spilling.
Why? Hash operations are more sensitive to memory constraints than sorts. A sort that spills to disk can still operate relatively efficiently with an external merge. A hash table that spills has to split into batches, write them to disk, and read them back — a much more painful process.
When a hash table spills, you will see batches in EXPLAIN ANALYZE:
EXPLAIN (ANALYZE, BUFFERS)
SELECT o.customer_id, SUM(o.total_amount)
FROM orders o
GROUP BY o.customer_id;
-- HashAggregate (cost=... rows=50000 ...)
-- Group Key: customer_id
-- Batches: 5 Memory Usage: 8192kB
-- -> Seq Scan on orders
Batches: 1 means no spilling. Batches: 5 means the hash table was split into 5 batches, 4 of which were written to temp files and read back.
Materialization
Materialize nodes (used in subqueries and nested loop inner sides) can use work_mem to buffer tuples in memory before falling back to spilling to a tuplestore on disk.
What Does NOT Use work_mem
shared_buffers— that is its own separate setting (shared memory, not per-session)- Query parsing, planning, or execution framework overhead
- Result tuple storage sent to the client
- Temporary tables — those use
temp_buffersinstead
The Math That Should Terrify You
Consider a “moderately complex” analytical query:
SELECT region, product_category, COUNT(*), AVG(amount)
FROM (
SELECT o.*, r.region, p.category
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN regions r ON c.region_id = r.id
JOIN products p ON o.product_id = p.id
WHERE o.order_date >= '2025-01-01'
ORDER BY o.order_date
) sub
GROUP BY region, product_category
ORDER BY region, product_category;
This query plan might contain:
| Node | Operation | work_mem Allocation |
|---|---|---|
| Hash Join #1 | Build hash on customers | 1 × hash_mem_multiplier × work_mem |
| Hash Join #2 | Build hash on regions | 1 × hash_mem_multiplier × work_mem |
| Hash Join #3 | Build hash on products | 1 × hash_mem_multiplier × work_mem |
| Sort | ORDER BY in subquery | 1 × work_mem |
| HashAggregate | GROUP BY | 1 × hash_mem_multiplier × work_mem |
| Sort | Final ORDER BY | 1 × work_mem |
With work_mem = 256MB and hash_mem_multiplier = 2.0:
Hash operations: 4 × 512MB = 2,048MB
Sort operations: 2 × 256MB = 512MB
------
Total per query: 2,560MB
2.5 GB per query. Now multiply by 20 concurrent sessions. That is 50 GB of memory just for work_mem allocations. On a machine with 32 GB RAM.
This is why the default is 4MB. Not because the PostgreSQL developers hate performance, but because the default needs to be safe for any workload.
How to Choose work_mem
The Formula
A reasonable upper bound:
work_mem = (Total RAM - shared_buffers) / (max_connections × average_operations_per_query)
On a 64 GB server with shared_buffers = 16GB, max_connections = 200, and an estimated average of 3 operations per query:
work_mem = (64GB - 16GB) / (200 × 3) = 48GB / 600 ≈ 80MB
This is a rough guideline. Real-world workloads vary enormously.
Connection Pooling Changes Everything
If you use PgBouncer or another connection pooler with, say, 20 actual database connections but 200 application connections, the formula changes:
work_mem = (64GB - 16GB) / (20 × 3) ≈ 800MB
With connection pooling, you can safely use a much larger work_mem because the number of concurrent in-flight queries is bounded by actual database connections, not application connections.
Per-Query Tuning
You do not have to set work_mem globally. You can set it per-session, per-transaction, or even per-query:
-- For a single analytical query that needs a big sort
SET LOCAL work_mem = '1GB';
SELECT * FROM massive_table
ORDER BY complex_expression
LIMIT 100;
-- work_mem resets when the transaction ends
This is the best practice: keep the global work_mem conservative (4-32MB), and bump it up for specific analytical queries.
Finding Spills
Check EXPLAIN ANALYZE
The most direct method. Look for external merge (sorts) or Batches: N > 1 (hash):
EXPLAIN (ANALYZE, BUFFERS, SETTINGS)
SELECT customer_id, COUNT(*)
FROM orders
GROUP BY customer_id;
Monitor log_temp_files
Set this in postgresql.conf:
log_temp_files = 0
This logs every temp file creation, including size. Set to 0 to log all spills, or a threshold like -1 (disabled) or a size like '10MB' to only log large spills.
Log output looks like:
LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp12345.0", size 26214400
STATEMENT: SELECT * FROM orders ORDER BY total_amount;
Query pg_stat_statements
In PostgreSQL 13+, pg_stat_statements tracks temp file usage per query:
SELECT query,
calls,
total_exec_time,
temp_blks_written,
temp_blks_read
FROM pg_stat_statements
WHERE temp_blks_written > 0
ORDER BY temp_blks_written DESC
LIMIT 10;
This tells you which queries spill the most. Each temp block is 8KB, so temp_blks_written = 12800 means about 100MB of temp file writes.
hash_mem_multiplier: The Unsung Hero
Introduced in PostgreSQL 13, hash_mem_multiplier decouples hash table limits from sort limits. Before PG13, hash tables and sorts shared the same work_mem limit. Since hash tables are more sensitive to spilling, this often led to suboptimal behavior:
- Set
work_memhigh enough for hash tables → sorts waste memory - Set
work_memlow enough for sorts → hash tables spill
Now you can tune them independently:
work_mem = 32MB # For sorts
hash_mem_multiplier = 2.0 # Hash tables get 64MB
The default of 2.0 is well-chosen for most workloads. For heavy hash-aggregation workloads (lots of GROUP BY on high-cardinality columns), values up to 4.0-8.0 can help without inflating sort memory.
The Temporary File Landscape
When operations spill, where do the temp files go?
show temp_tablespaces;
-- <empty> means default tablespace
show data_directory;
-- /var/lib/postgresql/17/main
Temp files go to pgsql_tmp/ inside the data directory by default. On a busy system with lots of spilling, this directory can grow to many gigabytes.
You can redirect temp files to a separate disk:
CREATE TABLESPACE fast_temp LOCATION '/mnt/fast-ssd/pg-temp';
-- Then set for specific operations or globally:
SET temp_tablespaces = 'fast_temp';
This is particularly useful for analytical queries where temp files are expected — keep them off the same disk as your data and WAL.
Real-World Case Study: The “Work_mem Is Fine” Mystery
A team noticed their reporting queries were slow. They checked work_mem:
SHOW work_mem;
-- 64MB
“That should be plenty.” But the query was still spilling. The EXPLAIN (ANALYZE) told the real story:
Hash Join (cost=... rows=5000000 ...)
Hash Cond: o.product_id = p.id
-> Hash Join (cost=... rows=5000000 ...)
Hash Cond: o.customer_id = c.id
-> Hash Join (cost=... rows=5000000 ...)
Hash Cond: o.region_id = r.id
-> Seq Scan on orders (rows=50000000)
-> Hash (rows=50)
Buckets: 65536 Batches: 1 Memory Usage: 4096kB
-> Hash (rows=100000)
Buckets: 131072 Batches: 1 Memory Usage: 14336kB
-> Hash (rows=50000)
Buckets: 65536 Batches: 5 Memory Usage: 65536kB
-> Hash (rows=200)
Buckets: 1024 Batches: 1 Memory Usage: 2048kB
The third hash table (products, 50K rows) was spilling to 5 batches. The customer hash was using 14MB of 64MB. The product hash hit the limit.
The fix was not to increase global work_mem. It was:
SET LOCAL work_mem = '256MB';
Just for this query. The hash table fit in memory, temp file writes dropped to zero, and the query went from 45 seconds to 8 seconds.
Tuning Checklist
-
Start conservative.
work_mem = 4MB(default) or16MBfor OLTP. Do not blindly set to1GB. -
Find spills. Use
log_temp_files = 0andpg_stat_statementsto identify which queries actually spill. -
Count operations. Look at
EXPLAINplans for the spilling queries. How many Sort and Hash nodes does each have? -
Calculate real usage.
work_mem × operations_per_query × concurrent_queries. Does it fit in RAM? -
Use per-query tuning.
SET LOCAL work_mem = 'X'inside transactions for analytical queries. Keep the global setting low. -
Use
hash_mem_multiplier(PG13+). Default 2.0 is good. Increase to 4.0 for heavy GROUP BY workloads. -
Monitor OOM. If you see
out of memoryerrors in the logs, yourwork_memis too high for your concurrency level. -
Put temp files on fast storage. If spilling is unavoidable, at least make it fast with
temp_tablespaces.
Key Takeaways
work_memis a per-operation limit, not per-query. A single query can consumeN × work_memwhere N is the number of sort and hash operations in the plan.- Hash operations get more memory via
hash_mem_multiplier(default 2.0× work_mem). This is because hash table spills are more expensive than sort spills. - The default 4MB is intentionally conservative to prevent OOM under concurrent load. The formula
total_RAM / (max_connections × avg_ops_per_query)gives you a safe upper bound. - Per-query
SET LOCAL work_memis the best practice — keep global low, bump it for specific analytical queries. log_temp_files = 0andpg_stat_statements.temp_blks_writtenare your primary tools for detecting spills.- Spilling is not always bad. If a query runs once a day and finishes in acceptable time, chasing zero spills is not worth the memory pressure.
What’s Next
Tomorrow we tackle another memory tuning classic: shared_buffers vs the OS page cache — the double-buffering myth. Is PostgreSQL really caching everything twice? Should you set shared_buffers to 25% of RAM, or is that cargo-cult tuning? We will look at what shared_buffers actually does, how the OS cache complements it, and why the answer depends on your workload.
This is Day 11 of the PostgreSQL Deep Dive series. Catch up on all posts here.