PostgreSQL Deep Dive

PostgreSQL Deep Dive: maintenance_work_mem and autovacuum_work_mem — the memory parameters that decide how fast your database cleans up after itself

Yesterday we covered work_mem and how it controls per-operation memory for sorts, hash joins, and hash aggregates. Today we are looking at its less glamorous sibling: maintenance_work_mem, the parameter that governs how much memory VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY can use. The default is 64MB. On any database larger than a few gigabytes, that default is costing you.

What maintenance_work_mem Actually Controls

maintenance_work_mem sets the memory ceiling for three operations:

VACUUM (and ANALYZE). Lazy VACUUM uses this memory for the dead-tuple index, the data structure that tracks which tuples need to be removed from indexes during the second vacuum pass. When a table has millions of dead tuples, a 64MB limit forces vacuum to spill this index to disk, making the index cleanup phase dramatically slower. The heap scan phase (first pass) is unaffected, but index vacuuming (second pass) scales directly with the available memory for the dead-tuple sort.

CREATE INDEX. Index builds use maintenance_work_mem for the sort buffer when building B-tree indexes. The build process sorts the index keys before writing them to the index structure. With 64MB, this sort spills to disk on any table larger than a few hundred thousand rows. A larger maintenance_work_mem lets the sort complete in memory, which is the single biggest performance lever for index creation time.

ALTER TABLE ADD FOREIGN KEY. Adding a foreign key requires validating that every row in the referencing table has a matching row in the referenced table. This validation scan uses maintenance_work_mem for its sort operations.

What it does NOT control. maintenance_work_mem does not affect regular queries, hash joins, sorts in SELECT statements, or any DML operation. It is strictly for maintenance. This distinction matters because it means you can set it aggressively high without worrying about concurrent query memory usage, with one exception.

The Autovacuum Multiplication Problem

Here is the catch. The PG docs say:

“When autovacuum runs, up to autovacuum_max_workers times this memory may be allocated, so be careful not to set the default value too high.”

autovacuum_max_workers defaults to 3. Each autovacuum worker allocates its own maintenance_work_mem independently. If maintenance_work_mem is set to 2GB and all three workers are active on large tables simultaneously, that is 6GB of RAM consumed by vacuum alone. On a 16GB instance with 4GB in shared_buffers, you have just eaten most of the remaining memory.

This is where autovacuum_work_mem comes in. It is a separate parameter that applies only to autovacuum workers, leaving maintenance_work_mem for your manual operations.

autovacuum_work_mem: Separate Budget for Automatic Vacuum

autovacuum_work_mem defaults to -1, which means “use maintenance_work_mem.” When it is -1, autovacuum workers use the same memory budget as your manual VACUUM and CREATE INDEX commands. This is the source of the multiplication problem.

The practical split looks like this:

--postgresql.conf

-- Manual operations: generous, since you run one at a time
maintenance_work_mem = '2GB'

-- Autovacuum workers: capped to prevent multiplication
autovacuum_work_mem = '512MB'

With these settings, a manual VACUUM or CREATE INDEX gets 2GB of breathing room, but three simultaneous autovacuum workers share a 512MB budget each (1.5GB total). You get fast maintenance when you run it yourself, and bounded memory usage when autovacuum runs unattended.

autovacuum_work_mem can only be set in postgresql.conf or on the server command line. You cannot set it per-table with ALTER TABLE or per-session with SET. This is a deliberate restriction: autovacuum workers are background processes that should not inherit arbitrary session-level settings.

How VACUUM Uses maintenance_work_mem

Lazy VACUUM operates in three phases, and only the index cleanup phase cares about maintenance_work_mem:

  1. Heap scan (first pass). Vacuum scans the table’s heap pages, identifying dead tuples and recording them in an array. This phase uses negligible memory proportional to the number of dead tuples found. It does not use maintenance_work_mem.

  2. Index cleanup (second pass). For each index on the table, vacuum processes the dead tuple list to remove corresponding index entries. It sorts the dead tuple array by the index’s key order to enable efficient batch deletion from the index. The memory for this sort comes from maintenance_work_mem. If the dead tuple list exceeds the available memory, it spills to a temporary file, and the sort becomes an external merge sort. This is slow.

  3. Heap prune and freeze. Vacuum makes a second pass over the heap to set hint bits (marking dead tuples as unused) and freeze old tuple headers. This phase does not use maintenance_work_mem.

The critical path is phase two. The more dead tuples accumulated since the last vacuum, the more memory the index cleanup needs. If the dead tuple list is small enough to fit in maintenance_work_mem, the sort is a single-pass in-memory quicksort. If it spills, vacuum does multiple merge passes over temporary files, and the index vacuum can take 5-10x longer.

The dead tuple sort scaling

The dead tuple list for a single vacuum pass is a ItemId array (4 bytes per entry, plus a few bytes of overhead). Roughly 5-8 bytes per dead tuple. At 64MB, you can hold about 8-12 million dead tuple entries in memory. At 1GB, about 125-200 million.

For a table with high churn (lots of UPDATEs and DELETEs), the dead tuple count between vacuums can easily exceed the 8-12 million range. On a billion-row table with a 0.2 scale factor, autovacuum waits until 200 million rows change. At that point, the dead tuple list is far too large for 64MB. The index cleanup spills, and vacuum takes hours instead of minutes.

How CREATE INDEX Uses maintenance_work_mem

Index creation has two phases, and maintenance_work_mem affects the sort phase:

  1. Scan and sort. PostgreSQL scans the table, extracts the index keys, and sorts them. The sort buffer comes from maintenance_work_mem. If the sort fits in memory, it completes as a single-pass quicksort. If it spills, it becomes an external merge sort with multiple passes.

  2. Build. The sorted keys are written into the B-tree structure. This phase uses minimal memory and is I/O-bound.

For a B-tree index on a 100-million-row table, the sort data is roughly 16 bytes per row (key + tid + overhead), about 1.6GB. With 64MB of maintenance_work_mem, this sort spills heavily. With 2GB, it completes in memory. The difference in CREATE INDEX time is typically 3-10x.

CONCURRENTLY adds overhead

CREATE INDEX CONCURRENTLY splits the work into two scan phases (one before and one after waiting for concurrent transactions to finish), each building a partial index. Each phase does its own sort, and both use maintenance_work_mem. The total work is roughly double a non-concurrent build, but each individual sort phase still benefits from more memory.

How ALTER TABLE ADD FOREIGN KEY Uses maintenance_work_mem

Foreign key validation performs a join between the referencing table and the referenced table. For each row in the referencing table, it checks that a matching row exists in the referenced table. When the referenced table has an index (it should), this becomes an index scan, but the validation still builds a hash table or sort buffer bounded by maintenance_work_mem.

On large tables, FK validation with 64MB can take much longer than necessary. Set maintenance_work_mem high before adding FKs to large tables:

SET maintenance_work_mem = '2GB';
ALTER TABLE orders ADD CONSTRAINT fk_customer
  FOREIGN KEY (customer_id) REFERENCES customers(id);

vacuum_buffer_usage_limit: The PG17 Addition

PostgreSQL 17 introduced vacuum_buffer_usage_limit, which controls a different aspect of vacuum memory: how much of shared_buffers vacuum is allowed to evict.

The default is 2MB. This is the size of the Buffer Access Strategy (BAS) ring buffer that vacuum uses when reading heap and index pages. Instead of loading pages into the general shared buffer pool (where they would evict pages that active queries need), vacuum reads pages through a small ring buffer. When the ring is full, old pages are evicted to make room for new ones, and the evicted pages are not ones that queries are using.

A small ring (2MB) means vacuum frequently re-reads pages from disk. A larger ring means vacuum keeps more pages cached and does less I/O, but risks evicting useful query pages from shared_buffers. The maximum is capped at 1/8 of shared_buffers.

For most workloads, the 2MB default is reasonable. Vacuum is a background operation and should not compete with queries for buffer cache space. But if you have a dedicated maintenance window or a large instance where query buffer pressure is low, increasing this limit can speed up vacuum significantly.

-- Faster vacuum on a large table during a maintenance window
VACUUM (BUFFER_USAGE_LIMIT '64MB') verbose large_table;

-- Or set it cluster-wide
-- vacuum_buffer_usage_limit = '64MB'  -- in postgresql.conf

Note that vacuum_buffer_usage_limit applies to both VACUUM and ANALYZE. It can be set per-command, per-session, or in postgresql.conf. This gives you fine-grained control that maintenance_work_mem does not provide.

Per-Table Tuning with SET LOCAL

You cannot set autovacuum_work_mem per table. But you can set maintenance_work_mem for a manual VACUUM using SET LOCAL inside a transaction:

BEGIN;
SET LOCAL maintenance_work_mem = '1GB';
VACUUM VERBOSE my_large_table;
COMMIT;

This works because maintenance_work_mem is a session-level GUC. SET LOCAL ensures it reverts when the transaction ends, so it does not affect subsequent operations in the same session.

For autovacuum, you are stuck with the global autovacuum_work_mem (or the inherited maintenance_work_mem if it is -1). If you have one enormous table that needs 2GB for vacuum and everything else is fine with 256MB, you cannot differentiate through configuration alone. Your options are:

  1. Set autovacuum_work_mem to the higher value and accept the memory cost on smaller tables (wasteful but simple).
  2. Disable autovacuum on the large table and run manual VACUUM with a per-session maintenance_work_mem setting on a schedule (more work, more control).
  3. Use a cron job or external scheduler to run VACUUM (BUFFER_USAGE_LIMIT ...) with tuned parameters during low-traffic periods.

The Math: How Much Can You Safely Allocate?

The safe upper bound for maintenance_work_mem depends on whether you rely on autovacuum or run manual maintenance:

Manual-only maintenance (autovacuum disabled for large tables)

maintenance_work_mem = (Total RAM - shared_buffers) / 2

Since only one maintenance operation runs per session, and you control when they run, you can allocate aggressively. On a 64GB instance with 16GB shared_buffers, setting maintenance_work_mem to 16-24GB is reasonable for a single CREATE INDEX on a large table.

With autovacuum running

autovacuum_work_mem = (Total RAM - shared_buffers) / (2 * autovacuum_max_workers)
maintenance_work_mem = (Total RAM - shared_buffers) / 2

On the same 64GB instance with 3 autovacuum workers:

  • autovacuum_work_mem = (64 - 16) / 6 = 8GB per worker (24GB total worst case)
  • maintenance_work_mem = 24GB for manual operations

These are upper bounds. In practice, autovacuum workers rarely all hit their memory ceiling simultaneously, and not all workers run on large tables. But the upper bound is what you need to survive.

A practical starting configuration

-- postgresql.conf, 32GB instance

-- Shared buffers
shared_buffers = '8GB'

-- Manual maintenance (VACUUM, CREATE INDEX, ALTER TABLE ADD FK)
maintenance_work_mem = '4GB'

-- Autovacuum workers (up to 3 concurrent × 1GB each = 3GB worst case)
autovacuum_work_mem = '1GB'

-- PG17+: vacuum buffer ring size
vacuum_buffer_usage_limit = '8MB'

-- Query memory (per-operation)
work_mem = '64MB'

This configuration uses 8GB for shared buffers, reserves up to 3GB for autovacuum (worst case), and leaves about 21GB for the OS page cache, query memory, and other processes. When you run a manual CREATE INDEX, you get 4GB for the sort. When autovacuum runs, each worker gets 1GB.

Monitoring: How to Tell If Your Settings Are Wrong

Three signs that maintenance_work_mem is too low:

  1. Slow CREATE INDEX on large tables. If index creation takes hours on a table with tens of millions of rows, the sort is probably spilling. Check with EXPLAIN ANALYZE CREATE INDEX ... (not all versions support this, but PG15+ shows Sort nodes in the plan).

  2. Long-running VACUUM on the index cleanup phase. Look at pg_stat_progress_vacuum (PG12+). The phase column shows where vacuum is. If it sits on “processing index” for a long time, the dead tuple sort is likely spilling.

SELECT relid::regclass AS table,
       phase,
       heap_blks_total,
       heap_blks_scanned,
       index_relid::regclass AS current_index
FROM pg_stat_progress_vacuum;
  1. High temporary file usage during maintenance. If log_temp_files is set (see yesterday’s post), maintenance operations that spill will show up in the logs. A VACUUM generating a 500MB temp file is a clear signal that maintenance_work_mem should be higher.

Three signs that autovacuum_work_mem is too high:

  1. OOM kills during vacuum. If the Linux OOM killer targets autovacuum workers, your autovacuum_work_mem is too high for the available RAM given the number of concurrent workers.

  2. Query performance degradation during autovacuum. If queries slow down when vacuum runs, vacuum may be evicting query-critical pages from the OS page cache (not shared_buffers, since vacuum uses the BAS ring). Lower vacuum_buffer_usage_limit to reduce buffer cache impact.

  3. Swap activity correlated with vacuum. Check vmstat or pg_stat_bgwriter during heavy vacuum periods. If swap usage spikes when autovacuum workers are active, reduce autovacuum_work_mem.

The Connection to Other Parameters

maintenance_work_mem does not exist in isolation. It interacts with several other parameters:

  • work_mem: Controls query-time memory. maintenance_work_mem should be significantly larger than work_mem because maintenance operations run one at a time per session and can use the full allocation, while queries can have multiple concurrent work_mem consumers.

  • shared_buffers: The buffer pool. maintenance_work_mem allocations are private to the maintenance backend and do not come from shared_buffers. They come from the backend’s private memory space (backed by the OS). But the data read during maintenance does go through shared_buffers (or the BAS ring), so vacuum_buffer_usage_limit controls how much of shared_buffers vacuum evicts.

  • effective_io_concurrency: Affects how many I/O operations vacuum issues concurrently during the heap scan phase. Higher values help on SSDs. This is independent of maintenance_work_mem but compounds the performance benefit of larger maintenance memory.

  • autovacuum_vacuum_cost_limit and autovacuum_vacuum_cost_delay: The cost-based vacuum throttling system. Even with generous autovacuum_work_mem, vacuum may slow itself down if the cost limit is low. These parameters control vacuum’s I/O rate, while autovacuum_work_mem controls its memory. Both need to be tuned together for optimal vacuum performance.

  • wal_level and wal_compression: VACUUM generates WAL for hint bit setting and dead tuple removal. With wal_compression = on, the WAL volume is smaller, which means faster I/O during vacuum. This is independent of memory allocation but affects vacuum’s overall wall-clock time.

Per-Table Overrides for Autovacuum Storage Parameters

While you cannot override autovacuum_work_mem per table, you can influence how much memory vacuum needs by controlling how much work it does per run. Smaller vacuum runs need less memory:

-- More frequent vacuuming on a high-churn table means
-- fewer dead tuples per run, less index cleanup memory needed
ALTER TABLE orders SET (
  autovacuum_vacuum_scale_factor = 0.05,  -- 5% instead of 20%
  autovacuum_analyze_scale_factor = 0.02
);

-- Trigger vacuum more often on append-heavy tables
ALTER TABLE events SET (
  autovacuum_vacuum_insert_scale_factor = 0.02  -- PG13+
);

Lower scale factors mean vacuum runs more often on smaller amounts of change. Each individual vacuum pass has fewer dead tuples to sort, so the dead tuple list fits in a smaller autovacuum_work_mem. The total vacuum work is the same, but it is spread across more frequent, shorter, less memory-intensive passes.

This is the pragmatic alternative to increasing autovacuum_work_mem when you cannot afford the memory multiplication.

Takeaways

  1. maintenance_work_mem default (64MB) is too low for any serious database. Set it to at least 1GB for small instances, 2-4GB for medium ones, and higher for manual maintenance on large tables.

  2. Always set autovacuum_work_mem separately. The -1 default means autovacuum inherits maintenance_work_mem, which can lead to autovacuum_max_workers × maintenance_work_mem simultaneous allocation. Cap it based on (RAM - shared_buffers) / (2 × workers).

  3. The index cleanup phase is the bottleneck. VACUUM’s heap scan is fast. Index cleanup is where it stalls when memory is too low. Monitor with pg_stat_progress_vacuum and look for long “processing index” phases.

  4. CREATE INDEX benefits more than VACUUM. Index builds are almost always memory-bound sorts. Doubling maintenance_work_mem can halve CREATE INDEX time. Set it high before creating indexes on large tables.

  5. vacuum_buffer_usage_limit (PG17+) controls buffer cache impact. Keep it low (2-8MB) to prevent vacuum from evicting query pages from shared_buffers. Increase it during dedicated maintenance windows if you want faster vacuum.

  6. Lower autovacuum scale factors as an alternative to more memory. More frequent, smaller vacuum passes need less memory per pass and achieve the same cleanup with lower peak memory usage.

  7. Monitor with pg_stat_progress_vacuum and log_temp_files. These are your primary diagnostics for identifying maintenance operations that are memory-starved.