PostgreSQL Deep Dive: Parallel Query Execution — When Multiple Cores Actually Help (And When They Don't)
You have an 8-core machine. A big analytical query is running single-threaded, using one CPU at 100% while the other seven sit idle. You think: “If Postgres used all 8 cores, this query would finish 8x faster.”
It won’t. But it might finish 3x faster — and under the right conditions, that’s a huge win. The problem is that parallel query in PostgreSQL is subtle, finicky, and sometimes makes things slower. Today we’re going deep into how parallel query actually works, when it helps, when it hurts, and how to tune it.
How Parallel Query Works
PostgreSQL has supported parallel query since version 9.6 (initially just parallel sequential scan). The feature has expanded significantly in every major release since. Here’s the high-level architecture:
- The planner decides a parallel plan is cheaper than a serial plan
- A Gather or Gather Merge node is placed in the plan tree
- At execution time, the leader process (your backend connection) forks background worker processes
- Workers execute the parallel portion of the plan simultaneously
- The leader collects results from all workers and returns them to the client
Gather (cost=1000.00..217018.43 rows=1 width=97)
Workers Planned: 2
-> Parallel Seq Scan on orders (cost=0.00..216018.33 rows=1 width=97)
Filter: (notes ~~ '%urgent%'::text)
The Workers Planned: 2 means the planner intends to use 2 workers plus the leader. The leader doesn’t just manage — it also executes the parallel plan alongside the workers, though its participation depends on how busy it is collecting tuples.
Gather vs Gather Merge
- Gather: Collects tuples from workers in whatever order is convenient. Destroys any sort order. Used when the final result doesn’t need to be sorted.
- Gather Merge: Each worker produces sorted output, and the leader merges them in order. Preserves sort order. Slightly more overhead than Gather, but avoids a separate Sort node above the Gather.
-- Gather: unordered output
Gather
Workers Planned: 4
-> Parallel Seq Scan on orders
-- Gather Merge: preserves sort order from workers
Gather Merge
Workers Planned: 4
-> Parallel Index Scan on orders (sort order maintained)
Parallel Scans
The parallel portion starts at the scan level. Not all scan types support parallelism:
Parallel Sequential Scan (PG 9.6+)
The original parallel scan. The table’s blocks are divided into ranges shared among workers. Each worker scans its assigned range, then requests more.
Parallel Seq Scan on orders
Filter: (status = 'pending')
Best for: Large tables where you’re reading most of the data anyway. Not helpful for small tables (overhead exceeds benefit).
Parallel Bitmap Heap Scan (PG 10+)
One process (the leader) scans the index and builds the bitmap. The heap scan portion — actually visiting the table pages — is then parallelized across workers.
Parallel Bitmap Heap Scan on orders
Recheck Cond: (status = 'pending')
-> Bitmap Index Scan on idx_orders_status
Index Cond: (status = 'pending')
Key detail: the index scan itself is not parallelized — only the heap access. This is because building the bitmap is inherently sequential, but the subsequent heap fetches can be divided.
Parallel B-Tree Index Scan (PG 11+)
Workers take turns reading from the B-tree. Each worker claims an index block, scans it, and returns tuples. Results are sorted within each worker.
Parallel Index Scan using idx_orders_created_at on orders
Index Cond: (created_at > '2026-01-01')
Parallel Index-Only Scan (PG 11+)
Same as parallel index scan but avoids heap access when the visibility map permits. The ideal case — purely index-based, parallelized.
Parallel Joins
Once you have a parallel scan, the join strategies from yesterday’s post get interesting:
Parallel Nested Loop
The inner side is always non-parallel. But since the outer tuples are divided across workers, each worker does its own index lookups on the inner side. This is actually very efficient — the outer work is distributed, and the inner index lookups are cheap.
Nested Loop
-> Parallel Seq Scan on orders
Filter: (status = 'pending')
-> Index Scan using idx_customers_id on customers
Index Cond: (id = orders.customer_id)
Parallel Hash Join
Two variants:
- Non-parallel hash join: Each worker builds its own complete copy of the hash table from the inner side. If the inner side has 5M rows, each of 4 workers builds a 5M-row hash table. Memory usage: 4 × hash_table_size.
- Parallel hash join (PG 11+): Workers cooperatively build a single shared hash table. Much more memory-efficient and often faster.
-- Parallel hash join (look for "Parallel Hash")
Hash Join
Hash Cond: (orders.customer_id = customers.id)
-> Parallel Seq Scan on orders
-> Parallel Hash
-> Parallel Seq Scan on customers
The Parallel Hash node is the giveaway — it means the hash table is shared across workers, not duplicated.
Parallel Merge Join
The inner side is always non-parallel and executed in full by each worker. This is the worst case for parallel joins — the sort work on the inner side is duplicated across every worker. If the inner side requires an expensive sort, parallel merge join can be slower than serial.
Parallel Aggregation
PostgreSQL uses a two-phase approach for parallel aggregation (PG 9.6+):
- Partial Aggregate: Each worker runs its own aggregate on its subset of data
- Finalize Aggregate: The leader re-aggregates the partial results
Finalize Aggregate
-> Gather
Workers Planned: 4
-> Partial Aggregate
-> Parallel Seq Scan on orders
Filter: (created_at > '2026-01-01')
This works because most aggregates are decomposable: SUM(partial_sums) = total sum, COUNT(partial_counts) = total count.
Limitations: Parallel aggregation doesn’t work with:
DISTINCTorORDER BYin aggregate calls- Ordered-set aggregates (e.g.,
PERCENTILE_CONT) GROUPING SETS- Aggregates without a combine function (custom aggregates need
COMBINEFUNC)
When Parallel Aggregation Doesn’t Help
If the number of groups is close to the number of input rows (e.g., grouping by a unique column), the Finalize Aggregate processes nearly as many rows as the Partial Aggregate. The planner knows this and usually avoids parallel aggregation in these cases.
Parallel Append (PG 10+)
When scanning partitioned tables or UNION ALL queries, PostgreSQL can use Parallel Append to distribute subplans across workers:
- Regular
Append: All workers process subplan 1 until complete, then move to subplan 2 together. Sequential and synchronized. Parallel Append: Workers are spread across subplans — worker 1 scans partition A while worker 2 scans partition B simultaneously.
Gather
Workers Planned: 4
-> Parallel Append
-> Parallel Seq Scan on orders_2026_q1
-> Parallel Seq Scan on orders_2026_q2
-> Parallel Seq Scan on orders_2026_q3
-> Parallel Seq Scan on orders_2026_q4
This is particularly powerful for partitioned tables where each partition can be scanned independently.
The Configuration Knobs
Worker Process Limits
-- Maximum background worker processes (server start, default: 8)
-- This is a hard ceiling for ALL background workers, not just parallel query
SHOW max_worker_processes;
-- Maximum parallel workers across all queries (default: 8)
SHOW max_parallel_workers;
-- Maximum parallel workers per Gather node (default: 2)
-- This is the key tuning knob for parallel query
SHOW max_parallel_workers_per_gather;
The relationship: max_parallel_workers_per_gather ≤ max_parallel_workers ≤ max_worker_processes
A query can only use as many workers as max_parallel_workers_per_gather allows, but the total across all running queries is capped by max_parallel_workers. If you have max_parallel_workers = 8 and max_parallel_workers_per_gather = 4, you can run two 4-worker queries simultaneously — but a third would get zero workers.
Cost Parameters
The planner uses two cost parameters to decide whether parallelism is worth it:
-- Cost of launching worker processes (default: 1000)
-- High value = planner avoids parallel for cheap queries
SHOW parallel_setup_cost;
-- Cost of transferring one tuple from worker to leader (default: 0.1)
-- High value = planner avoids parallel for queries returning many rows
SHOW parallel_tuple_cost;
If your queries aren’t going parallel and you think they should, try lowering these:
SET parallel_setup_cost = 100; -- default: 1000
SET parallel_tuple_cost = 0.001; -- default: 0.1
EXPLAIN (ANALYZE) SELECT ...;
Table Size Thresholds
-- Minimum table size for parallel sequential scan (default: 8MB)
SHOW min_parallel_table_scan_size;
-- Minimum index size for parallel index scan (default: 512kB)
SHOW min_parallel_index_scan_size;
Tables smaller than min_parallel_table_scan_size won’t be considered for parallel scan. If you want parallel on smaller tables, lower this value — but remember that parallel overhead on tiny tables is almost never worth it.
Dynamic Worker Computation
PostgreSQL doesn’t just use max_parallel_workers_per_gather directly. It computes the optimal number based on table size:
workers = min(max_parallel_workers_per_gather,
1 + log3(table_pages / min_parallel_table_scan_size))
Where log3 is base-3 logarithm. This means:
- An 8MB table: 1 worker (just above threshold)
- A 72MB table: 1 + log3(9) = 1 + 2 = 3 workers
- A 648MB table: 1 + log3(81) = 1 + 4 = 5 workers (capped at max_parallel_workers_per_gather)
You can override per-table:
-- Force a specific number of workers for a table
ALTER TABLE large_fact SET (parallel_workers = 8);
-- Disable parallel for a table entirely
ALTER TABLE small_dim SET (parallel_workers = 0);
When Parallel Query Hurts
1. Small Tables
The overhead of forking workers and coordinating them is non-trivial. For a table that fits in memory and returns in milliseconds, parallel execution adds latency without benefit. The min_parallel_table_scan_size threshold usually prevents this, but if you’ve lowered it, you’ll see regressions.
2. High Concurrency
Each parallel query consumes N + 1 worker slots (N workers + leader). Under high concurrency, multiple parallel queries compete for the limited max_parallel_workers pool. A single heavy analytical query can starve multiple OLTP queries of their workers.
-- Under load, check how many workers are active
SELECT count(*) AS active_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';
3. Leader Bottleneck
The leader must collect all tuples from all workers. If the parallel portion produces millions of rows, the leader spends all its time copying tuples from worker shared memory — it can’t actually participate in the parallel work. You end up with N workers + 1 blocked leader, not N+1 parallel processes.
4. Skewed Data Distribution
If the table’s physical layout means one worker gets a disproportionately expensive range of blocks (e.g., all the matching rows are clustered in one area of the table), the other workers finish early and sit idle while one worker does most of the work.
5. Duplicated Inner-Side Work
In non-parallel hash joins and merge joins, each worker duplicates the inner-side work (building hash tables or sorting). If the inner side is expensive, this duplication can make the parallel plan slower than serial.
6. Subtransactions and Cursors
Parallel query is disabled inside:
- Cursor definitions (
DECLARE ... SELECT) - PL/pgSQL
FOR ... IN SELECTloops (which use cursors internally) - Queries with
FOR UPDATE/SHARElocks - Queries in transactions that have executed any
SAVEPOINT - Queries in transactions that have rolled back to a savepoint
- Serializable isolation level transactions
- Any query that writes to the database (CTEs with data-modifying statements)
This is a long list, and it catches people off guard. A query that goes parallel in a psql session might not go parallel inside a PL/pgSQL function because the function uses a cursor or the transaction has savepoints.
Practical Tuning
1. Check if your query is going parallel
EXPLAIN (ANALYZE, VERBOSE) SELECT COUNT(*) FROM orders;
-- Look for "Workers Planned" and "Workers Launched" in the output
-- If Workers Launched < Workers Planned, workers are being starved
2. Verify worker availability
SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN ('max_worker_processes', 'max_parallel_workers',
'max_parallel_workers_per_gather',
'parallel_setup_cost', 'parallel_tuple_cost',
'min_parallel_table_scan_size');
3. Per-worker statistics
EXPLAIN (ANALYZE, VERBOSE) SELECT COUNT(*) FROM orders;
-- With VERBOSE, you'll see per-worker details:
-- "Workers Planned: 4 Workers Launched: 4"
-- "Worker 0: actual rows=..."
-- "Worker 1: actual rows=..."
-- Check for uneven distribution
4. Tune per-table for your workload
-- For a large fact table that benefits from maximum parallelism
ALTER TABLE page_views SET (parallel_workers = 8);
-- For dimension tables where parallel overhead isn't worth it
ALTER TABLE countries SET (parallel_workers = 0);
-- Check current per-table settings
SELECT relname, reloptions
FROM pg_class
WHERE reloptions IS NOT NULL;
5. OLTP + Analytics on the same server
If you have mixed workloads, limit parallel workers to avoid starving OLTP:
-- Conservative: 2 workers per gather, 4 total
ALTER SYSTEM SET max_parallel_workers_per_gather = 2;
ALTER SYSTEM SET max_parallel_workers = 4;
SELECT pg_reload_conf();
This ensures analytical queries get some parallelism without consuming all resources.
Key Takeaways
- Parallel query uses background workers to execute portions of a plan simultaneously — but only the portion below the Gather/Gather Merge node.
- Gather collects unordered results; Gather Merge preserves sort order from workers.
- Worker count is computed from table size via
log3formula, capped bymax_parallel_workers_per_gather. - Parallel hash joins (PG 11+) with shared hash tables are much more efficient than the older duplicated-hash approach.
- High concurrency can starve workers —
max_parallel_workersis a global limit shared across all queries. - Many query contexts disable parallelism entirely: cursors, PL/pgSQL FOR loops, savepoints,
FOR UPDATE, serializable transactions. parallel_setup_costandparallel_tuple_costare the planner’s main levers for deciding if parallelism is worth the overhead.- Use
ALTER TABLE ... SET (parallel_workers = N)for per-table control rather than changing global settings.
What’s Next
Tomorrow we’ll look at work_mem: what it actually controls — it’s not just “sort memory,” it affects hash tables, bitmap scans, memoize nodes, and more. And the per-operation multiplier means your actual memory usage is far higher than you think.
This is Day 10 of the PostgreSQL Deep Dive series. Read Day 9 — Generic vs Custom Plans for Prepared Statements.