PostgreSQL Deep Dive

PostgreSQL Deep Dive: Parallel Query Execution — When Multiple Cores Actually Help (And When They Don't)

You have an 8-core machine. A big analytical query is running single-threaded, using one CPU at 100% while the other seven sit idle. You think: “If Postgres used all 8 cores, this query would finish 8x faster.”

It won’t. But it might finish 3x faster — and under the right conditions, that’s a huge win. The problem is that parallel query in PostgreSQL is subtle, finicky, and sometimes makes things slower. Today we’re going deep into how parallel query actually works, when it helps, when it hurts, and how to tune it.

How Parallel Query Works

PostgreSQL has supported parallel query since version 9.6 (initially just parallel sequential scan). The feature has expanded significantly in every major release since. Here’s the high-level architecture:

  1. The planner decides a parallel plan is cheaper than a serial plan
  2. A Gather or Gather Merge node is placed in the plan tree
  3. At execution time, the leader process (your backend connection) forks background worker processes
  4. Workers execute the parallel portion of the plan simultaneously
  5. The leader collects results from all workers and returns them to the client
Gather  (cost=1000.00..217018.43 rows=1 width=97)
  Workers Planned: 2
  ->  Parallel Seq Scan on orders  (cost=0.00..216018.33 rows=1 width=97)
        Filter: (notes ~~ '%urgent%'::text)

The Workers Planned: 2 means the planner intends to use 2 workers plus the leader. The leader doesn’t just manage — it also executes the parallel plan alongside the workers, though its participation depends on how busy it is collecting tuples.

Gather vs Gather Merge

  • Gather: Collects tuples from workers in whatever order is convenient. Destroys any sort order. Used when the final result doesn’t need to be sorted.
  • Gather Merge: Each worker produces sorted output, and the leader merges them in order. Preserves sort order. Slightly more overhead than Gather, but avoids a separate Sort node above the Gather.
-- Gather: unordered output
Gather
  Workers Planned: 4
  -> Parallel Seq Scan on orders

-- Gather Merge: preserves sort order from workers
Gather Merge
  Workers Planned: 4
  -> Parallel Index Scan on orders (sort order maintained)

Parallel Scans

The parallel portion starts at the scan level. Not all scan types support parallelism:

Parallel Sequential Scan (PG 9.6+)

The original parallel scan. The table’s blocks are divided into ranges shared among workers. Each worker scans its assigned range, then requests more.

Parallel Seq Scan on orders
  Filter: (status = 'pending')

Best for: Large tables where you’re reading most of the data anyway. Not helpful for small tables (overhead exceeds benefit).

Parallel Bitmap Heap Scan (PG 10+)

One process (the leader) scans the index and builds the bitmap. The heap scan portion — actually visiting the table pages — is then parallelized across workers.

Parallel Bitmap Heap Scan on orders
  Recheck Cond: (status = 'pending')
  ->  Bitmap Index Scan on idx_orders_status
        Index Cond: (status = 'pending')

Key detail: the index scan itself is not parallelized — only the heap access. This is because building the bitmap is inherently sequential, but the subsequent heap fetches can be divided.

Parallel B-Tree Index Scan (PG 11+)

Workers take turns reading from the B-tree. Each worker claims an index block, scans it, and returns tuples. Results are sorted within each worker.

Parallel Index Scan using idx_orders_created_at on orders
  Index Cond: (created_at > '2026-01-01')

Parallel Index-Only Scan (PG 11+)

Same as parallel index scan but avoids heap access when the visibility map permits. The ideal case — purely index-based, parallelized.

Parallel Joins

Once you have a parallel scan, the join strategies from yesterday’s post get interesting:

Parallel Nested Loop

The inner side is always non-parallel. But since the outer tuples are divided across workers, each worker does its own index lookups on the inner side. This is actually very efficient — the outer work is distributed, and the inner index lookups are cheap.

Nested Loop
  ->  Parallel Seq Scan on orders
        Filter: (status = 'pending')
  ->  Index Scan using idx_customers_id on customers
        Index Cond: (id = orders.customer_id)

Parallel Hash Join

Two variants:

  • Non-parallel hash join: Each worker builds its own complete copy of the hash table from the inner side. If the inner side has 5M rows, each of 4 workers builds a 5M-row hash table. Memory usage: 4 × hash_table_size.
  • Parallel hash join (PG 11+): Workers cooperatively build a single shared hash table. Much more memory-efficient and often faster.
-- Parallel hash join (look for "Parallel Hash")
Hash Join
  Hash Cond: (orders.customer_id = customers.id)
  ->  Parallel Seq Scan on orders
  ->  Parallel Hash
        ->  Parallel Seq Scan on customers

The Parallel Hash node is the giveaway — it means the hash table is shared across workers, not duplicated.

Parallel Merge Join

The inner side is always non-parallel and executed in full by each worker. This is the worst case for parallel joins — the sort work on the inner side is duplicated across every worker. If the inner side requires an expensive sort, parallel merge join can be slower than serial.

Parallel Aggregation

PostgreSQL uses a two-phase approach for parallel aggregation (PG 9.6+):

  1. Partial Aggregate: Each worker runs its own aggregate on its subset of data
  2. Finalize Aggregate: The leader re-aggregates the partial results
Finalize Aggregate
  ->  Gather
        Workers Planned: 4
        ->  Partial Aggregate
              ->  Parallel Seq Scan on orders
                    Filter: (created_at > '2026-01-01')

This works because most aggregates are decomposable: SUM(partial_sums) = total sum, COUNT(partial_counts) = total count.

Limitations: Parallel aggregation doesn’t work with:

  • DISTINCT or ORDER BY in aggregate calls
  • Ordered-set aggregates (e.g., PERCENTILE_CONT)
  • GROUPING SETS
  • Aggregates without a combine function (custom aggregates need COMBINEFUNC)

When Parallel Aggregation Doesn’t Help

If the number of groups is close to the number of input rows (e.g., grouping by a unique column), the Finalize Aggregate processes nearly as many rows as the Partial Aggregate. The planner knows this and usually avoids parallel aggregation in these cases.

Parallel Append (PG 10+)

When scanning partitioned tables or UNION ALL queries, PostgreSQL can use Parallel Append to distribute subplans across workers:

  • Regular Append: All workers process subplan 1 until complete, then move to subplan 2 together. Sequential and synchronized.
  • Parallel Append: Workers are spread across subplans — worker 1 scans partition A while worker 2 scans partition B simultaneously.
Gather
  Workers Planned: 4
  ->  Parallel Append
        ->  Parallel Seq Scan on orders_2026_q1
        ->  Parallel Seq Scan on orders_2026_q2
        ->  Parallel Seq Scan on orders_2026_q3
        ->  Parallel Seq Scan on orders_2026_q4

This is particularly powerful for partitioned tables where each partition can be scanned independently.

The Configuration Knobs

Worker Process Limits

-- Maximum background worker processes (server start, default: 8)
-- This is a hard ceiling for ALL background workers, not just parallel query
SHOW max_worker_processes;

-- Maximum parallel workers across all queries (default: 8)
SHOW max_parallel_workers;

-- Maximum parallel workers per Gather node (default: 2)
-- This is the key tuning knob for parallel query
SHOW max_parallel_workers_per_gather;

The relationship: max_parallel_workers_per_gathermax_parallel_workersmax_worker_processes

A query can only use as many workers as max_parallel_workers_per_gather allows, but the total across all running queries is capped by max_parallel_workers. If you have max_parallel_workers = 8 and max_parallel_workers_per_gather = 4, you can run two 4-worker queries simultaneously — but a third would get zero workers.

Cost Parameters

The planner uses two cost parameters to decide whether parallelism is worth it:

-- Cost of launching worker processes (default: 1000)
-- High value = planner avoids parallel for cheap queries
SHOW parallel_setup_cost;

-- Cost of transferring one tuple from worker to leader (default: 0.1)
-- High value = planner avoids parallel for queries returning many rows
SHOW parallel_tuple_cost;

If your queries aren’t going parallel and you think they should, try lowering these:

SET parallel_setup_cost = 100;    -- default: 1000
SET parallel_tuple_cost = 0.001;  -- default: 0.1
EXPLAIN (ANALYZE) SELECT ...;

Table Size Thresholds

-- Minimum table size for parallel sequential scan (default: 8MB)
SHOW min_parallel_table_scan_size;

-- Minimum index size for parallel index scan (default: 512kB)
SHOW min_parallel_index_scan_size;

Tables smaller than min_parallel_table_scan_size won’t be considered for parallel scan. If you want parallel on smaller tables, lower this value — but remember that parallel overhead on tiny tables is almost never worth it.

Dynamic Worker Computation

PostgreSQL doesn’t just use max_parallel_workers_per_gather directly. It computes the optimal number based on table size:

workers = min(max_parallel_workers_per_gather,
              1 + log3(table_pages / min_parallel_table_scan_size))

Where log3 is base-3 logarithm. This means:

  • An 8MB table: 1 worker (just above threshold)
  • A 72MB table: 1 + log3(9) = 1 + 2 = 3 workers
  • A 648MB table: 1 + log3(81) = 1 + 4 = 5 workers (capped at max_parallel_workers_per_gather)

You can override per-table:

-- Force a specific number of workers for a table
ALTER TABLE large_fact SET (parallel_workers = 8);

-- Disable parallel for a table entirely
ALTER TABLE small_dim SET (parallel_workers = 0);

When Parallel Query Hurts

1. Small Tables

The overhead of forking workers and coordinating them is non-trivial. For a table that fits in memory and returns in milliseconds, parallel execution adds latency without benefit. The min_parallel_table_scan_size threshold usually prevents this, but if you’ve lowered it, you’ll see regressions.

2. High Concurrency

Each parallel query consumes N + 1 worker slots (N workers + leader). Under high concurrency, multiple parallel queries compete for the limited max_parallel_workers pool. A single heavy analytical query can starve multiple OLTP queries of their workers.

-- Under load, check how many workers are active
SELECT count(*) AS active_workers
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';

3. Leader Bottleneck

The leader must collect all tuples from all workers. If the parallel portion produces millions of rows, the leader spends all its time copying tuples from worker shared memory — it can’t actually participate in the parallel work. You end up with N workers + 1 blocked leader, not N+1 parallel processes.

4. Skewed Data Distribution

If the table’s physical layout means one worker gets a disproportionately expensive range of blocks (e.g., all the matching rows are clustered in one area of the table), the other workers finish early and sit idle while one worker does most of the work.

5. Duplicated Inner-Side Work

In non-parallel hash joins and merge joins, each worker duplicates the inner-side work (building hash tables or sorting). If the inner side is expensive, this duplication can make the parallel plan slower than serial.

6. Subtransactions and Cursors

Parallel query is disabled inside:

  • Cursor definitions (DECLARE ... SELECT)
  • PL/pgSQL FOR ... IN SELECT loops (which use cursors internally)
  • Queries with FOR UPDATE/SHARE locks
  • Queries in transactions that have executed any SAVEPOINT
  • Queries in transactions that have rolled back to a savepoint
  • Serializable isolation level transactions
  • Any query that writes to the database (CTEs with data-modifying statements)

This is a long list, and it catches people off guard. A query that goes parallel in a psql session might not go parallel inside a PL/pgSQL function because the function uses a cursor or the transaction has savepoints.

Practical Tuning

1. Check if your query is going parallel

EXPLAIN (ANALYZE, VERBOSE) SELECT COUNT(*) FROM orders;
-- Look for "Workers Planned" and "Workers Launched" in the output
-- If Workers Launched < Workers Planned, workers are being starved

2. Verify worker availability

SELECT name, setting, short_desc
FROM pg_settings
WHERE name IN ('max_worker_processes', 'max_parallel_workers',
               'max_parallel_workers_per_gather',
               'parallel_setup_cost', 'parallel_tuple_cost',
               'min_parallel_table_scan_size');

3. Per-worker statistics

EXPLAIN (ANALYZE, VERBOSE) SELECT COUNT(*) FROM orders;

-- With VERBOSE, you'll see per-worker details:
-- "Workers Planned: 4  Workers Launched: 4"
-- "Worker 0: actual rows=..."
-- "Worker 1: actual rows=..."
-- Check for uneven distribution

4. Tune per-table for your workload

-- For a large fact table that benefits from maximum parallelism
ALTER TABLE page_views SET (parallel_workers = 8);

-- For dimension tables where parallel overhead isn't worth it
ALTER TABLE countries SET (parallel_workers = 0);

-- Check current per-table settings
SELECT relname, reloptions
FROM pg_class
WHERE reloptions IS NOT NULL;

5. OLTP + Analytics on the same server

If you have mixed workloads, limit parallel workers to avoid starving OLTP:

-- Conservative: 2 workers per gather, 4 total
ALTER SYSTEM SET max_parallel_workers_per_gather = 2;
ALTER SYSTEM SET max_parallel_workers = 4;
SELECT pg_reload_conf();

This ensures analytical queries get some parallelism without consuming all resources.

Key Takeaways

  1. Parallel query uses background workers to execute portions of a plan simultaneously — but only the portion below the Gather/Gather Merge node.
  2. Gather collects unordered results; Gather Merge preserves sort order from workers.
  3. Worker count is computed from table size via log3 formula, capped by max_parallel_workers_per_gather.
  4. Parallel hash joins (PG 11+) with shared hash tables are much more efficient than the older duplicated-hash approach.
  5. High concurrency can starve workers — max_parallel_workers is a global limit shared across all queries.
  6. Many query contexts disable parallelism entirely: cursors, PL/pgSQL FOR loops, savepoints, FOR UPDATE, serializable transactions.
  7. parallel_setup_cost and parallel_tuple_cost are the planner’s main levers for deciding if parallelism is worth the overhead.
  8. Use ALTER TABLE ... SET (parallel_workers = N) for per-table control rather than changing global settings.

What’s Next

Tomorrow we’ll look at work_mem: what it actually controls — it’s not just “sort memory,” it affects hash tables, bitmap scans, memoize nodes, and more. And the per-operation multiplier means your actual memory usage is far higher than you think.

This is Day 10 of the PostgreSQL Deep Dive series. Read Day 9 — Generic vs Custom Plans for Prepared Statements.