PostgreSQL Deep Dive: Nested Loop vs Hash Join vs Merge Join — How the Planner Picks
You have a query joining orders to customers. The planner chose a hash join. You forced a nested loop, and it was faster. You conclude the planner is wrong.
It probably isn’t — you just changed the data.
Yesterday we covered scan types — how Postgres reads rows from a single table. Today we go one level up: how Postgres combines rows from two tables. Joins are where the planner earns its keep, and picking the wrong strategy can turn a 5ms query into a 500ms one. Let’s look at the three join methods, when each one wins, and how to debug the planner when it picks wrong.
The Three Join Strategies
PostgreSQL implements three physical join algorithms. The planner picks between them for every join in your query:
| Join Type | Core Idea | Best When |
|---|---|---|
| Nested Loop | For each row in the outer table, scan the inner table for matches | One side is small, inner side has an index on the join key |
| Hash Join | Build a hash table from one side, probe with the other | Medium-to-large datasets, no useful index on the join key, equi-joins |
| Merge Join | Sort both sides on the join key, then walk them in parallel | Both sides are large, data is already sorted (or sort is cheap), equi-joins and range joins |
Every join in your EXPLAIN output will be one of these three. Let’s break them down.
Nested Loop Join
The simplest join algorithm — conceptually identical to two nested for loops:
for each row in outer_relation:
for each row in inner_relation:
if row joins:
emit combined row
How It Actually Works
The planner designates one relation as the outer (left) and the other as the inner (right). For every outer row, the inner side is scanned. That inner scan can be any scan type — sequential, index, or bitmap — which makes nested loop surprisingly flexible.
Nested Loop (cost=0.29..25.53 rows=5 width=100)
-> Seq Scan on orders o (cost=0.00..15.00 rows=500 width=60)
-> Index Scan using idx_customers_id on customers c (cost=0.29..0.45 rows=1 width=40)
Index Cond: (id = o.customer_id)
Here, for each of the 500 orders rows, Postgres does an index lookup on customers. Total work: 500 index lookups. If the index lookup is fast (which it is for point queries), this is extremely efficient.
Cost Model
cost = outer_scan_cost + (outer_rows × inner_scan_cost_per_row)
The critical variable is inner_scan_cost_per_row. If the inner side can use an index scan, that cost is roughly random_page_cost (typically 4.0 for HDD, 1.1 for SSD). If the inner side has to seq scan, the cost is the entire inner table scan — repeated once per outer row. That gets expensive fast.
When Nested Loop Wins
- One side is tiny (e.g., a dimension table with 50 rows)
- Inner side has an index on the join column
- Very selective filters on the outer side (so few outer rows trigger inner scans)
- Parameterized nested loops where the outer value is used as an index key
When It Loses
- Both sides are large (cost scales as O(outer × inner))
- No index on the inner join column (every outer row triggers a full inner scan)
- The inner scan returns many rows per outer row
Memoize (PostgreSQL 14+)
A hidden optimization for parameterized nested loops: if the same join key value appears multiple times on the outer side, Postgres can cache the inner scan results in a Memoize node:
Nested Loop
-> Seq Scan on orders
-> Memoize
Cache Key: orders.customer_id
-> Index Scan on customers (id = orders.customer_id)
This turns redundant inner scans into hash lookups. The planner considers Memoize automatically when it sees a parameterized nested loop — no configuration needed. Controlled by enable_memoize (on by default).
Hash Join
The workhorse of analytical and medium-sized joins. Two phases: Build and Probe.
Phase 1: Build
Scan the smaller relation (the “build side”) and insert every row into an in-memory hash table, keyed on the join column.
Hash Join (cost=37.00..87.00 rows=1000 width=100)
Hash Cond: (o.customer_id = c.id)
-> Seq Scan on orders o (cost=0.00..30.00 rows=2000 width=60)
-> Hash
Buckets: 2048 Batches: 1 Memory Usage: 45kB
-> Seq Scan on customers c (cost=0.00..12.00 rows=500 width=40)
The hash table is sized to fit in work_mem per batch. If it doesn’t fit, PostgreSQL splits the data into batches (spilled to disk), which is where performance degrades sharply.
Phase 2: Probe
Scan the larger relation (the “probe side”). For each row, hash the join column and look it up in the hash table. Matches are emitted as joined rows.
Cost Model
cost = build_side_scan_cost + hash_build_cost
+ probe_side_scan_cost + (probe_rows × hash_lookup_cost)
The hash lookup is nearly O(1), so the total cost is roughly O(build + probe). This is dramatically better than nested loop’s O(outer × inner) when both sides are large.
The Batches Problem
When the hash table doesn’t fit in work_mem, PostgreSQL partitions both sides into batches. Each batch is processed separately — build that batch’s hash table, probe against it, then move to the next batch. This means the probe side is read multiple times (once per batch of the build side that shares the same partition).
If you see Batches: 5 in your EXPLAIN output, the probe side was scanned 5 times. That’s a sign work_mem is too low for this join.
Hash
Buckets: 32768 Batches: 5 Memory Usage: 4096kB -- spilled to disk!
When Hash Join Wins
- Equi-joins only (
=, not<=orBETWEEN) — hash joins require equality - No useful index on the join column (if there were, nested loop might be better)
- Both sides are medium-to-large
- The build side fits in
work_mem(watch those batches!)
When It Loses
- The build side is enormous and spills to many batches
- An index on the join column would make nested loop faster
- The join condition isn’t equality (hash joins can’t do range conditions)
work_memis too small relative to the build side
Merge Join
The sorted merge — conceptually identical to the merge step in merge sort.
How It Works
Both sides are sorted on the join key (either via an explicit Sort node, or by reading from an index that already provides order). Then PostgreSQL walks both sorted streams in parallel:
Merge Join (cost=83.17..89.66 rows=500 width=100)
Merge Cond: (o.customer_id = c.id)
-> Index Scan using idx_orders_customer_id on orders o (cost=0.29..42.50 rows=2000 width=60)
-> Index Scan using idx_customers_id on customers c (cost=0.29..42.00 rows=500 width=40)
If both sides are already sorted (via index scans), there’s no sort overhead at all — just a single pass through each relation.
If the data isn’t sorted, PostgreSQL adds explicit Sort nodes:
Merge Join (cost=200.00..350.00 rows=500 width=100)
Merge Cond: (o.customer_id = c.id)
-> Sort
Sort Key: o.customer_id
-> Seq Scan on orders o
-> Sort
Sort Key: c.id
-> Seq Scan on customers c
Cost Model
cost = sort_left + sort_right + merge_pass_cost
Where sort_n is O(n log n) if an explicit sort is needed, or zero if the data comes pre-sorted from an index. The merge pass itself is O(left + right) — linear.
Unique Property: Range Joins
Merge join is the only physical join type that supports non-equi join conditions. If you write:
SELECT * FROM bookings b
JOIN rooms r ON b.room_id = r.id
AND b.check_in < r.available_until
AND b.check_out > r.available_from;
That’s a range join. Hash join can’t do it (no equality). Nested loop can but will be slow for large tables. Merge join is often the only reasonable choice.
When Merge Join Wins
- Both sides are already sorted (via indexes) — near-zero overhead
- Large datasets where hash join would spill excessively
- Range join conditions (non-equi joins)
- The sort cost is amortized by already needing sorted output (e.g.,
ORDER BYthe join key)
When It Loses
- Both sides need explicit sorts and neither is large enough to justify the overhead
- An index makes nested loop faster for small result sets
- The hash table fits comfortably in memory (hash join is usually faster than sort+merge)
How the Planner Chooses
Now the important part. The planner estimates costs for all three strategies and picks the cheapest. Here’s the decision framework:
Small outer + indexed inner → Nested Loop
-- 10 rows from orders after filtering, index on customers.id
-- Cost: 10 × 1 index lookup = ~10 random I/Os
Nested Loop → Index Scan on customers
Large both sides, equi-join, no index → Hash Join
-- 10,000 orders × 5,000 customers, join on customer_id
-- Hash table: 5,000 rows (~200KB), easily fits in work_mem
-- Cost: scan both tables once + hash lookups
Hash Join → Seq Scan both sides
Pre-sorted data or range join → Merge Join
-- Both sides have indexes on the join column
-- Or: range condition that hash join can't handle
-- Cost: two index scans + linear merge
Merge Join → Index Scan both sides
The Fallback: enable_ Flags
PostgreSQL provides GUC flags to force or disable specific join types for debugging:
-- Force nested loop (disable hash and merge)
SET enable_hashjoin = off;
SET enable_mergejoin = off;
-- Force hash join (disable nested loop and merge)
SET enable_nestloop = off;
SET enable_mergejoin = off;
-- See what the planner does when forced
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
Important: Setting enable_ flags to off doesn’t truly disable the strategy — it just makes its cost astronomically high. If no other option exists, Postgres will still use it. This is by design — it’s a debugging tool, not a constraint system.
Cross-Checking: Your Estimate vs Reality
-- Check what the planner estimated vs what actually happened
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01';
Look for:
- Rows vs Actual Rows: If the planner estimated 10 rows but got 50,000, it picked nested loop when it should have picked hash join. Time to
ANALYZE. - Batches > 1 in a hash join:
work_memis too small for this query. - Sort spills: If the merge join’s sort spills to disk, hash join might be cheaper.
Join Ordering: The Hidden Variable
The join method is only half the decision. The planner also decides the join order — which table is outer vs inner, and which pairs are joined first.
For a query with 5 tables, there are 120 possible join orderings. For 10 tables, it’s 3,628,800. The planner uses dynamic programming to search this space, but only up to a point.
join_collapse_limit and from_collapse_limit
By default, the planner will reorder joins freely, ignoring the order you wrote them in your SQL. The limit is controlled by join_collapse_limit (default 8). If your query has more than 8 tables being joined, the planner partially fixes the order.
-- Force the planner to follow your explicit JOIN order
SET join_collapse_limit = 1;
-- Let the planner fully explore all orderings (expensive for many tables)
SET join_collapse_limit = 50;
from_collapse_limit (default 8) controls how aggressively the planner pulls up subqueries into the main join tree.
GEQO: When There Are Too Many Tables
When a query has more than geqo_threshold (default 12) FROM items, PostgreSQL switches from exhaustive dynamic programming to a genetic algorithm — the Genetic Query Optimizer (GEQO). GEQO explores a random sample of join orderings and picks the best one found.
This means:
- Queries with 12+ tables may get different plans on successive runs
- GEQO isn’t always optimal — it’s a heuristic
- For critical queries with many joins, consider setting
join_collapse_limit = 1and writing the optimal order yourself
-- See if GEQO was used
EXPLAIN (SETTINGS) SELECT ...;
-- Disable GEQO for this session
SET geqo = off;
-- Tune GEQO (if you must keep it)
SET geqo_effort = 10; -- 1-10, higher = more thorough, slower planning
SET geqo_pool_size = 0; -- 0 = auto (2^min(geqo_effort, number_of_relations))
SET geqo_generations = 0; -- 0 = auto
Practical Diagnostic Queries
1. Find queries where the planner picked the wrong join
-- From pg_stat_statements: look for queries with high execution time
-- Then run EXPLAIN ANALYZE to check if row estimates are off
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
2. Check if row estimates are wildly off
-- Compare estimated vs actual for a specific query
EXPLAIN (ANALYZE, SUMMARY)
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending';
-- Look for "rows=1000 (actual rows=3)" — huge discrepancy means stale stats
3. Force different join types and compare
BEGIN;
SET LOCAL enable_hashjoin = off;
SET LOCAL enable_mergejoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
-- Note the actual total time
ROLLBACK;
BEGIN;
SET LOCAL enable_nestloop = off;
SET LOCAL enable_mergejoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
ROLLBACK;
BEGIN;
SET LOCAL enable_nestloop = off;
SET LOCAL enable_hashjoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
ROLLBACK;
4. Check work_mem adequacy for hash joins
-- If you see Batches > 1 in hash joins, check current work_mem
SHOW work_mem;
-- Temporarily increase for a session
SET work_mem = '256MB';
EXPLAIN (ANALYZE) SELECT ...;
-- If Batches drops to 1 and the query is faster, increase work_mem globally
The Gotchas
Stale Statistics → Wrong Join Choice
The #1 cause of bad join decisions is stale statistics. If pg_class.reltuples says a table has 1,000 rows but it actually has 1,000,000, the planner will pick nested loop (good for 1,000 × 1,000) instead of hash join (better for 1,000,000 × 1,000,000).
-- Force an ANALYZE on problem tables
ANALYZE orders;
ANALYZE customers;
-- Check if row estimates improved
EXPLAIN (ANALYZE) SELECT ...;
random_page_cost Skews Nested Loop
If random_page_cost is set to the default 4.0 but you’re on SSD, the planner overestimates the cost of nested loop’s index lookups and picks hash join instead. Lowering to 1.1 often makes nested loop competitive again.
-- SSD-appropriate setting
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();
Hash Join Batches on Large Build Side
If your build side (typically the smaller relation) doesn’t fit in work_mem, the hash join splits into batches. Each batch means re-reading the probe side. A hash join with 10 batches reads the probe side 10 times — often slower than a merge join that sorts both sides once.
CTEs Prevent Join Reordering
Materialized CTEs (the default in Postgres 11 and earlier, or CTEs with AS MATERIALIZED in Postgres 12+) act as optimization fences. The planner cannot reorder joins across a CTE boundary.
-- In Postgres 12+, use NOT MATERIALIZED to allow optimization
WITH recent_orders AS NOT MATERIALIZED (
SELECT * FROM orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'
)
SELECT ro.*, c.name
FROM recent_orders ro
JOIN customers c ON ro.customer_id = c.id;
-- Without NOT MATERIALIZED, the CTE is materialized and the join order is fixed
Key Takeaways
- Nested loop is O(outer × inner) — great when the inner side uses an index and outer rows are few.
- Hash join is O(build + probe) — the default for medium-to-large equi-joins. Watch for batch spills.
- Merge join is O(sort_left + sort_right + merge) — best when data is pre-sorted or for range joins.
- Row estimate accuracy is the #1 factor in join selection. If estimates are wrong, the join choice is wrong. Run
ANALYZE. work_memcontrols hash join batching. Too small → batch spills → slow hash joins.random_page_costshould be 1.1 on SSDs, not the default 4.0. This directly affects nested loop cost estimates.- Use
enable_flags to test alternative join strategies, not to “fix” queries permanently. - For queries with 12+ tables, GEQO takes over — consider
SET geqo = offfor critical queries.
What’s Next
Tomorrow we’ll look at generic vs custom plans for prepared statements — why PREPARE/EXECUTE can give you dramatically different (and sometimes worse) query plans, and how plan_cache_mode lets you control it.
This is Day 8 of the PostgreSQL Deep Dive series. Read Day 7 — Seq Scan vs Index Scan vs Bitmap Scan.