PostgreSQL Deep Dive

PostgreSQL Deep Dive: Temporary Files and log_temp_files — When Your Queries Secretly Write Gigabytes to Disk

Your query ran in 47 seconds. The plan looked fine — index scan, hash join, a sort at the end. Nothing unusual. But what EXPLAIN didn’t show you is that during those 47 seconds, PostgreSQL quietly wrote 2.3 GB of temporary files to disk, read them all back, then deleted them. Nobody saw it happen. No error, no warning. Just a query that should have been fast, running agonizingly slow.

This is the world of temporary files in PostgreSQL — the behind-the-scenes disk spills that happen when your work_mem budget runs out. They’re invisible by default, they compound under concurrency, and they’re one of the most common causes of “the database is slow but I can’t figure out why.”

What Triggers a Temporary File

PostgreSQL creates temporary files when an operation’s memory allocation exceeds work_mem (or hash_mem_multiplier × work_mem for hash-based operations). There are five things that can spill:

Sorts. Any ORDER BY, DISTINCT, GROUP BY, UNION, INTERSECT, EXCEPT, LIMIT with a sort, window functions, or merge join inputs that need sorting. When the sorted dataset exceeds work_mem, PostgreSQL switches to an external merge sort: it writes sorted runs to disk, then merges them back together. Each run costs a write and a read. More runs mean more passes. The result shows up in EXPLAIN as Sort Method: external merge Disk: 24536kB.

Hash tables. Hash joins, hash aggregates, Memoize nodes, and IN subqueries materialized as hash tables. When the hash table exceeds hash_mem_multiplier × work_mem (default 2.0×, so effectively 2× your work_mem), PostgreSQL partitions the data into batches and spills intermediate batches to disk. Each batch requires a write on creation and a read when probed. EXPLAIN shows Batches: 7 and Memory: 1025kB Disk: 18432kB. The more batches, the more disk I/O.

Materialization nodes. CTEs marked for materialization, subqueries that the planner decides to execute once, and any node where PostgreSQL needs to store intermediate results. When the materialized set exceeds work_mem, it spills.

Bitmap heap scans. When a bitmap becomes too large for memory, PostgreSQL switches to lossy mode and writes intermediate bitmap pages to disk. This one is subtle because it doesn’t produce a Disk: annotation in EXPLAIN — it just appears as Recheck Cond: indicating lossy bitmap mode.

Held cursors. DECLARE ... CURSOR WITH HOLD stores its result set on disk so it survives across transactions. These are explicit temporary files, not query spills, and they’re cleaned up when the cursor is closed or the session ends.

The key insight: PostgreSQL doesn’t fail when it runs out of memory. It doesn’t cancel the query. It simply degrades gracefully by spilling to disk, trading CPU and I/O for the ability to complete the operation. This is by design — it’s better to complete slowly than to crash. But it means you can have a severe performance problem with zero errors.

Why Temporary Files Are a Silent Performance Killer

The cost of a spill isn’t just “disk is slower than RAM.” It’s worse than that.

Amplification. An external merge sort with K runs requires multiple passes. With a large spill, you might write 4 GB of temp data, read it back, write 4 GB again for the merge, and read it one final time. That’s 12 GB of I/O for a sort that should have fit in memory. Hash joins with many batches have the same problem — each batch is written, read during probing, and may trigger recursive spilling if the batch itself is too large.

WAL pressure. Temporary files don’t generate WAL records (they’re unlogged by nature), but the extra I/O competes with WAL writes on the same disk. If your WAL volume is on the same storage as pg_default or your temp tablespace, your checkpoint and replication throughput suffer.

Concurrency multiplication. work_mem is per-operation, not per-query. A single query might have three sorts and two hash joins, each getting its own work_mem allocation. With 100 concurrent connections, each running a query with four memory-intensive operations, you have 400 allocations fighting for the same disk spindles or SSD channels. The disk becomes the bottleneck, and every query slows down because they’re all spilling simultaneously.

Storage exhaustion. Without limits, a single runaway query can fill your disk. Full disk means PostgreSQL can’t write WAL, which means every transaction blocks. This is a cluster-stopping event, and it starts with a temporary file.

log_temp_files: Your First Line of Defense

PostgreSQL has a GUC specifically for detecting spills: log_temp_files. Set it to a size in kilobytes (0 logs all temp files, -1 disables logging), and any operation that creates a temporary file larger than that threshold gets logged.

-- Log every temporary file creation
SET log_temp_files = 0;

-- Log only spills larger than 10 MB
SET log_temp_files = '10MB';

-- Set globally in postgresql.conf
log_temp_files = '10MB'

When a spill occurs, you’ll see entries like this in your logs:

LOG:  temporary file: path "base/pgsql_tmp/pgsql_tmp12345.0", size 245760 kB

This tells you the PID that created the file, its size, and its path. Cross-reference the PID with pg_stat_activity to find the query responsible.

Setting it to 0 in production is noisy but revealing. You’ll find spills you never expected. I recommend 10MB as a starting point — anything under 10 MB is usually harmless, but larger spills are worth investigating.

temp_file_limit: The Cluster Protection Valve

temp_file_limit (introduced in PostgreSQL 10) sets a hard cap on how much disk space a single backend process can use for temporary files. When a process exceeds the limit, the current transaction is canceled:

-- Limit each process to 1 GB of temporary files
SET temp_file_limit = '1GB';

-- Set in postgresql.conf
temp_file_limit = '1GB'

When the limit is hit:

ERROR:  temporary file size exceeds temp_file_limit (1073741824 kB)

This is your safety net against storage exhaustion. Set it to something reasonable based on your available disk space and expected workload. A good starting point is 10-20% of your available temp storage, per process.

Important caveat: temp_file_limit applies per backend process, not cluster-wide. With 200 connections each using 1 GB, you’d need 200 GB of free space. Plan accordingly.

Also important: This limit covers query execution temporary files (sorts, hashes, materializations) but not explicit temporary tables created with CREATE TEMP TABLE. Temporary tables have their own storage management and are not constrained by this setting.

Diagnostic Queries: Finding the Spills

1. pg_stat_statements — Aggregate Temp File Usage

SELECT
  queryid,
  left(query, 80) AS query_preview,
  calls,
  round(total_temp_blks_read::numeric, 1) AS total_temp_read_mb,
  round(total_temp_blks_written::numeric, 1) AS total_temp_written_mb,
  round(mean_temp_blks_read::numeric, 1) AS avg_temp_read_mb,
  round(mean_temp_blks_written::numeric, 1) AS avg_temp_written_mb,
  max_temp_blks_written
FROM pg_stat_statements
WHERE total_temp_blks_written > 0
ORDER BY total_temp_blks_written DESC
LIMIT 20;

Note: pg_stat_statements tracks blocks, not bytes. Multiply by 8 KB to get approximate megabytes. If mean_temp_written_mb is consistently high, that query is spilling on every execution — a strong signal that work_mem needs tuning or the query needs rewriting.

2. Current Session Spills via EXPLAIN ANALYZE

EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT ...your query here...;

Look for Sort Method: external merge Disk: NkB in sort nodes and Batches: N with Disk: NkB in hash nodes. Compare the Disk: value against your work_mem setting — if the disk usage is close to work_mem, you’re right on the edge and a small data growth will push you into a spill.

3. Active Temporary Files on Disk

SELECT
  pid,
  usename,
  datname,
  query,
  pg_size_pretty(sum(size_bytes)) AS total_temp_size
FROM (
  SELECT
    pg_stat_activity.pid,
    pg_stat_activity.usename,
    pg_stat_activity.datname,
    pg_stat_activity.query,
    (pg_stat_file('base/pgsql_tmp/' || tmpfile)).size AS size_bytes
  FROM pg_ls_dir('base/pgsql_tmp') AS tmpfile
  CROSS JOIN pg_stat_activity
  WHERE tmpfile ~ '^\d+'
    AND pg_stat_activity.pid = tmpfile::int
) sub
GROUP BY pid, usename, datname, query
ORDER BY sum(size_bytes) DESC;

This query maps active temporary files on disk to the sessions that created them. It’s the fastest way to answer “what’s eating my disk right now?“

4. Historical Spill Pattern Analysis

-- Check pg_stat_user_tables for sort/hash spill indicators
SELECT
  schemaname,
  relname,
  n_tup_ins,
  n_tup_upd,
  n_tup_del,
  n_live_tup,
  last_analyze,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE n_live_tup > 100000
ORDER BY n_live_tup DESC;

Large tables that haven’t been analyzed recently are prime candidates for causing spills — the planner overestimates selectivity, picks hash joins and sorts that don’t fit in memory, and spills result.

The Gotcha: Why Fixing Spills Is Harder Than It Looks

work_mem Is Per-Operation, Not Per-Query

This is the most common misunderstanding. If you set work_mem = '64MB' and a query has two sorts and a hash join, it can allocate up to 192 MB. Under concurrency, this multiplies. Naively bumping work_mem to “fix” spills can OOM your server.

The safe approach: identify the specific queries spilling, estimate concurrent execution, and either set work_mem higher globally (if you have headroom) or use SET LOCAL work_mem = '256MB' inside specific transactions that need it.

Hash Spills Are Worse Than Sort Spills

Sort spills have a predictable cost: external merge sort with O(N log N) I/O. Hash spills are nastier because of batching. With K batches, PostgreSQL does K passes over the data. If K is large (say 50+), the cost is enormous. And here’s the subtle part: hash joins build the hash table on the inner side, but probe with the outer side. If the inner side spills into many batches, the outer side gets re-scanned for each batch. A “small” spill of the build side can cause a massive amplification on the probe side.

Watch for Batches: counts greater than 4 or 5 in EXPLAIN output — that’s your signal that the hash join is deeply into spill territory.

Parallel Query Workers Each Get work_mem

A parallel query with 4 workers and a leader can use 5× work_mem for each parallel operation. If the parallel hash join spills, all 5 processes write temp files simultaneously. When tuning work_mem for parallel workloads, factor in max_parallel_workers_per_gather + 1.

temp_tablespaces: Spreading the Load

You can direct temporary files to specific tablespaces:

-- Create a fast tablespace for temp files
CREATE TABLESPACE temp_fast LOCATION '/mnt/nvme/postgresql/tmp';

-- Use it for temporary files
SET temp_tablespaces = 'temp_fast';

-- Or set globally
-- In postgresql.conf:
-- temp_tablespaces = 'temp_fast'

When multiple tablespaces are listed, PostgreSQL picks one randomly for each temporary object. This spreads I/O across multiple disks. Useful when your main data volume is on spinning disks but you have an NVMe drive available for temp work.

Gotcha: If a tablespace is full or unavailable, PostgreSQL will fall back to the default tablespace for temp files. Check that your temp tablespaces actually exist and have space.

Practical Tuning Strategy

Step 1: Enable logging. Set log_temp_files = '10MB' and watch your logs for a day. This gives you the spill baseline.

Step 2: Identify repeat offenders. Use pg_stat_statements to find queries with consistently high mean_temp_blks_written. These are your targets.

Step 3: Analyze why they spill. Run EXPLAIN (ANALYZE, BUFFERS) on the worst queries. Check whether the spill comes from a sort or a hash join. Look at the estimated vs actual row counts — if they diverge significantly, stale statistics are the root cause, not insufficient work_mem.

Step 4: Fix the root cause, not the symptom.

  • If statistics are stale, run ANALYZE or tune autovacuum_analyze_scale_factor.
  • If the query plan is suboptimal (e.g., a hash join when a nested loop would be better), add an index or rewrite the query.
  • If the data genuinely needs a large sort or hash, increase work_mem — but use SET LOCAL inside the specific transaction rather than raising the global value.

Step 5: Set temp_file_limit. Based on your available disk space and the maximum reasonable spill size you found in step 2, set a limit that allows normal operations but prevents runaway queries from filling the disk. Start conservative and relax if you hit legitimate cancellations.

Step 6: Consider temp tablespaces. If spills are frequent and your main storage is slow, putting temp files on faster storage (NVMe, RAM disk for very small databases) can dramatically reduce spill latency.

Key Takeaways

  • PostgreSQL silently spills sorts, hashes, and materializations to disk when they exceed work_mem. There are no errors, no warnings by default — just slow queries.
  • log_temp_files is your detection tool. Set it to '10MB' in production and investigate queries that consistently log spills.
  • temp_file_limit is your safety net. It prevents any single process from filling the disk with temporary files, but it applies per-process, not cluster-wide.
  • work_mem is per-operation, not per-query, and parallel workers each get their own allocation. The total memory for concurrent queries is work_mem × ops × connections × (parallel_workers + 1).
  • Hash spills with high batch counts are worse than sort spills due to repeated probing of the outer side. A hash join with 50 batches is doing 50× the I/O you’d expect.
  • Before raising work_mem, check whether stale statistics are causing the planner to pick a worse plan. ANALYZE is often the real fix.
  • temp_tablespaces lets you direct spills to faster storage, spreading I/O load across multiple disks.

What’s Next

Tomorrow we’ll look at checkpoint spikes — those mysterious moments where your entire database seems to freeze for a few seconds, writes pile up, and latency shoots through the roof. Checkpoints are necessary, but the way PostgreSQL spreads (or doesn’t spread) the write work can make the difference between a smooth-running system and one that stutters under load.


Previous: Planning Time Spikes: Why Your Queries Take Longer to Plan Than to Execute