PostgreSQL Deep Dive

PostgreSQL Deep Dive: Nested Loop vs Hash Join vs Merge Join — How the Planner Picks

You have a query joining orders to customers. The planner chose a hash join. You forced a nested loop, and it was faster. You conclude the planner is wrong.

It probably isn’t — you just changed the data.

Yesterday we covered scan types — how Postgres reads rows from a single table. Today we go one level up: how Postgres combines rows from two tables. Joins are where the planner earns its keep, and picking the wrong strategy can turn a 5ms query into a 500ms one. Let’s look at the three join methods, when each one wins, and how to debug the planner when it picks wrong.

The Three Join Strategies

PostgreSQL implements three physical join algorithms. The planner picks between them for every join in your query:

Join TypeCore IdeaBest When
Nested LoopFor each row in the outer table, scan the inner table for matchesOne side is small, inner side has an index on the join key
Hash JoinBuild a hash table from one side, probe with the otherMedium-to-large datasets, no useful index on the join key, equi-joins
Merge JoinSort both sides on the join key, then walk them in parallelBoth sides are large, data is already sorted (or sort is cheap), equi-joins and range joins

Every join in your EXPLAIN output will be one of these three. Let’s break them down.

Nested Loop Join

The simplest join algorithm — conceptually identical to two nested for loops:

for each row in outer_relation:
    for each row in inner_relation:
        if row joins:
            emit combined row

How It Actually Works

The planner designates one relation as the outer (left) and the other as the inner (right). For every outer row, the inner side is scanned. That inner scan can be any scan type — sequential, index, or bitmap — which makes nested loop surprisingly flexible.

Nested Loop  (cost=0.29..25.53 rows=5 width=100)
  ->  Seq Scan on orders o  (cost=0.00..15.00 rows=500 width=60)
  ->  Index Scan using idx_customers_id on customers c  (cost=0.29..0.45 rows=1 width=40)
        Index Cond: (id = o.customer_id)

Here, for each of the 500 orders rows, Postgres does an index lookup on customers. Total work: 500 index lookups. If the index lookup is fast (which it is for point queries), this is extremely efficient.

Cost Model

cost = outer_scan_cost + (outer_rows × inner_scan_cost_per_row)

The critical variable is inner_scan_cost_per_row. If the inner side can use an index scan, that cost is roughly random_page_cost (typically 4.0 for HDD, 1.1 for SSD). If the inner side has to seq scan, the cost is the entire inner table scan — repeated once per outer row. That gets expensive fast.

When Nested Loop Wins

  • One side is tiny (e.g., a dimension table with 50 rows)
  • Inner side has an index on the join column
  • Very selective filters on the outer side (so few outer rows trigger inner scans)
  • Parameterized nested loops where the outer value is used as an index key

When It Loses

  • Both sides are large (cost scales as O(outer × inner))
  • No index on the inner join column (every outer row triggers a full inner scan)
  • The inner scan returns many rows per outer row

Memoize (PostgreSQL 14+)

A hidden optimization for parameterized nested loops: if the same join key value appears multiple times on the outer side, Postgres can cache the inner scan results in a Memoize node:

Nested Loop
  ->  Seq Scan on orders
  ->  Memoize
        Cache Key: orders.customer_id
        ->  Index Scan on customers (id = orders.customer_id)

This turns redundant inner scans into hash lookups. The planner considers Memoize automatically when it sees a parameterized nested loop — no configuration needed. Controlled by enable_memoize (on by default).

Hash Join

The workhorse of analytical and medium-sized joins. Two phases: Build and Probe.

Phase 1: Build

Scan the smaller relation (the “build side”) and insert every row into an in-memory hash table, keyed on the join column.

Hash Join  (cost=37.00..87.00 rows=1000 width=100)
  Hash Cond: (o.customer_id = c.id)
  ->  Seq Scan on orders o  (cost=0.00..30.00 rows=2000 width=60)
  ->  Hash
        Buckets: 2048  Batches: 1  Memory Usage: 45kB
        ->  Seq Scan on customers c  (cost=0.00..12.00 rows=500 width=40)

The hash table is sized to fit in work_mem per batch. If it doesn’t fit, PostgreSQL splits the data into batches (spilled to disk), which is where performance degrades sharply.

Phase 2: Probe

Scan the larger relation (the “probe side”). For each row, hash the join column and look it up in the hash table. Matches are emitted as joined rows.

Cost Model

cost = build_side_scan_cost + hash_build_cost
     + probe_side_scan_cost + (probe_rows × hash_lookup_cost)

The hash lookup is nearly O(1), so the total cost is roughly O(build + probe). This is dramatically better than nested loop’s O(outer × inner) when both sides are large.

The Batches Problem

When the hash table doesn’t fit in work_mem, PostgreSQL partitions both sides into batches. Each batch is processed separately — build that batch’s hash table, probe against it, then move to the next batch. This means the probe side is read multiple times (once per batch of the build side that shares the same partition).

If you see Batches: 5 in your EXPLAIN output, the probe side was scanned 5 times. That’s a sign work_mem is too low for this join.

Hash
  Buckets: 32768  Batches: 5  Memory Usage: 4096kB  -- spilled to disk!

When Hash Join Wins

  • Equi-joins only (=, not <= or BETWEEN) — hash joins require equality
  • No useful index on the join column (if there were, nested loop might be better)
  • Both sides are medium-to-large
  • The build side fits in work_mem (watch those batches!)

When It Loses

  • The build side is enormous and spills to many batches
  • An index on the join column would make nested loop faster
  • The join condition isn’t equality (hash joins can’t do range conditions)
  • work_mem is too small relative to the build side

Merge Join

The sorted merge — conceptually identical to the merge step in merge sort.

How It Works

Both sides are sorted on the join key (either via an explicit Sort node, or by reading from an index that already provides order). Then PostgreSQL walks both sorted streams in parallel:

Merge Join  (cost=83.17..89.66 rows=500 width=100)
  Merge Cond: (o.customer_id = c.id)
  ->  Index Scan using idx_orders_customer_id on orders o  (cost=0.29..42.50 rows=2000 width=60)
  ->  Index Scan using idx_customers_id on customers c  (cost=0.29..42.00 rows=500 width=40)

If both sides are already sorted (via index scans), there’s no sort overhead at all — just a single pass through each relation.

If the data isn’t sorted, PostgreSQL adds explicit Sort nodes:

Merge Join  (cost=200.00..350.00 rows=500 width=100)
  Merge Cond: (o.customer_id = c.id)
  ->  Sort
        Sort Key: o.customer_id
        ->  Seq Scan on orders o
  ->  Sort
        Sort Key: c.id
        ->  Seq Scan on customers c

Cost Model

cost = sort_left + sort_right + merge_pass_cost

Where sort_n is O(n log n) if an explicit sort is needed, or zero if the data comes pre-sorted from an index. The merge pass itself is O(left + right) — linear.

Unique Property: Range Joins

Merge join is the only physical join type that supports non-equi join conditions. If you write:

SELECT * FROM bookings b
JOIN rooms r ON b.room_id = r.id
  AND b.check_in < r.available_until
  AND b.check_out > r.available_from;

That’s a range join. Hash join can’t do it (no equality). Nested loop can but will be slow for large tables. Merge join is often the only reasonable choice.

When Merge Join Wins

  • Both sides are already sorted (via indexes) — near-zero overhead
  • Large datasets where hash join would spill excessively
  • Range join conditions (non-equi joins)
  • The sort cost is amortized by already needing sorted output (e.g., ORDER BY the join key)

When It Loses

  • Both sides need explicit sorts and neither is large enough to justify the overhead
  • An index makes nested loop faster for small result sets
  • The hash table fits comfortably in memory (hash join is usually faster than sort+merge)

How the Planner Chooses

Now the important part. The planner estimates costs for all three strategies and picks the cheapest. Here’s the decision framework:

Small outer + indexed inner → Nested Loop

-- 10 rows from orders after filtering, index on customers.id
-- Cost: 10 × 1 index lookup = ~10 random I/Os
Nested Loop → Index Scan on customers

Large both sides, equi-join, no index → Hash Join

-- 10,000 orders × 5,000 customers, join on customer_id
-- Hash table: 5,000 rows (~200KB), easily fits in work_mem
-- Cost: scan both tables once + hash lookups
Hash Join → Seq Scan both sides

Pre-sorted data or range join → Merge Join

-- Both sides have indexes on the join column
-- Or: range condition that hash join can't handle
-- Cost: two index scans + linear merge
Merge Join → Index Scan both sides

The Fallback: enable_ Flags

PostgreSQL provides GUC flags to force or disable specific join types for debugging:

-- Force nested loop (disable hash and merge)
SET enable_hashjoin = off;
SET enable_mergejoin = off;

-- Force hash join (disable nested loop and merge)
SET enable_nestloop = off;
SET enable_mergejoin = off;

-- See what the planner does when forced
EXPLAIN (ANALYZE, BUFFERS) SELECT ...

Important: Setting enable_ flags to off doesn’t truly disable the strategy — it just makes its cost astronomically high. If no other option exists, Postgres will still use it. This is by design — it’s a debugging tool, not a constraint system.

Cross-Checking: Your Estimate vs Reality

-- Check what the planner estimated vs what actually happened
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01';

Look for:

  • Rows vs Actual Rows: If the planner estimated 10 rows but got 50,000, it picked nested loop when it should have picked hash join. Time to ANALYZE.
  • Batches > 1 in a hash join: work_mem is too small for this query.
  • Sort spills: If the merge join’s sort spills to disk, hash join might be cheaper.

Join Ordering: The Hidden Variable

The join method is only half the decision. The planner also decides the join order — which table is outer vs inner, and which pairs are joined first.

For a query with 5 tables, there are 120 possible join orderings. For 10 tables, it’s 3,628,800. The planner uses dynamic programming to search this space, but only up to a point.

join_collapse_limit and from_collapse_limit

By default, the planner will reorder joins freely, ignoring the order you wrote them in your SQL. The limit is controlled by join_collapse_limit (default 8). If your query has more than 8 tables being joined, the planner partially fixes the order.

-- Force the planner to follow your explicit JOIN order
SET join_collapse_limit = 1;

-- Let the planner fully explore all orderings (expensive for many tables)
SET join_collapse_limit = 50;

from_collapse_limit (default 8) controls how aggressively the planner pulls up subqueries into the main join tree.

GEQO: When There Are Too Many Tables

When a query has more than geqo_threshold (default 12) FROM items, PostgreSQL switches from exhaustive dynamic programming to a genetic algorithm — the Genetic Query Optimizer (GEQO). GEQO explores a random sample of join orderings and picks the best one found.

This means:

  • Queries with 12+ tables may get different plans on successive runs
  • GEQO isn’t always optimal — it’s a heuristic
  • For critical queries with many joins, consider setting join_collapse_limit = 1 and writing the optimal order yourself
-- See if GEQO was used
EXPLAIN (SETTINGS) SELECT ...;

-- Disable GEQO for this session
SET geqo = off;

-- Tune GEQO (if you must keep it)
SET geqo_effort = 10;       -- 1-10, higher = more thorough, slower planning
SET geqo_pool_size = 0;     -- 0 = auto (2^min(geqo_effort, number_of_relations))
SET geqo_generations = 0;   -- 0 = auto

Practical Diagnostic Queries

1. Find queries where the planner picked the wrong join

-- From pg_stat_statements: look for queries with high execution time
-- Then run EXPLAIN ANALYZE to check if row estimates are off
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

2. Check if row estimates are wildly off

-- Compare estimated vs actual for a specific query
EXPLAIN (ANALYZE, SUMMARY)
SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending';
-- Look for "rows=1000 (actual rows=3)" — huge discrepancy means stale stats

3. Force different join types and compare

BEGIN;
SET LOCAL enable_hashjoin = off;
SET LOCAL enable_mergejoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
-- Note the actual total time
ROLLBACK;

BEGIN;
SET LOCAL enable_nestloop = off;
SET LOCAL enable_mergejoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
ROLLBACK;

BEGIN;
SET LOCAL enable_nestloop = off;
SET LOCAL enable_hashjoin = off;
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT ...;
ROLLBACK;

4. Check work_mem adequacy for hash joins

-- If you see Batches > 1 in hash joins, check current work_mem
SHOW work_mem;

-- Temporarily increase for a session
SET work_mem = '256MB';
EXPLAIN (ANALYZE) SELECT ...;
-- If Batches drops to 1 and the query is faster, increase work_mem globally

The Gotchas

Stale Statistics → Wrong Join Choice

The #1 cause of bad join decisions is stale statistics. If pg_class.reltuples says a table has 1,000 rows but it actually has 1,000,000, the planner will pick nested loop (good for 1,000 × 1,000) instead of hash join (better for 1,000,000 × 1,000,000).

-- Force an ANALYZE on problem tables
ANALYZE orders;
ANALYZE customers;

-- Check if row estimates improved
EXPLAIN (ANALYZE) SELECT ...;

random_page_cost Skews Nested Loop

If random_page_cost is set to the default 4.0 but you’re on SSD, the planner overestimates the cost of nested loop’s index lookups and picks hash join instead. Lowering to 1.1 often makes nested loop competitive again.

-- SSD-appropriate setting
ALTER SYSTEM SET random_page_cost = 1.1;
SELECT pg_reload_conf();

Hash Join Batches on Large Build Side

If your build side (typically the smaller relation) doesn’t fit in work_mem, the hash join splits into batches. Each batch means re-reading the probe side. A hash join with 10 batches reads the probe side 10 times — often slower than a merge join that sorts both sides once.

CTEs Prevent Join Reordering

Materialized CTEs (the default in Postgres 11 and earlier, or CTEs with AS MATERIALIZED in Postgres 12+) act as optimization fences. The planner cannot reorder joins across a CTE boundary.

-- In Postgres 12+, use NOT MATERIALIZED to allow optimization
WITH recent_orders AS NOT MATERIALIZED (
  SELECT * FROM orders WHERE created_at > CURRENT_DATE - INTERVAL '30 days'
)
SELECT ro.*, c.name
FROM recent_orders ro
JOIN customers c ON ro.customer_id = c.id;
-- Without NOT MATERIALIZED, the CTE is materialized and the join order is fixed

Key Takeaways

  1. Nested loop is O(outer × inner) — great when the inner side uses an index and outer rows are few.
  2. Hash join is O(build + probe) — the default for medium-to-large equi-joins. Watch for batch spills.
  3. Merge join is O(sort_left + sort_right + merge) — best when data is pre-sorted or for range joins.
  4. Row estimate accuracy is the #1 factor in join selection. If estimates are wrong, the join choice is wrong. Run ANALYZE.
  5. work_mem controls hash join batching. Too small → batch spills → slow hash joins.
  6. random_page_cost should be 1.1 on SSDs, not the default 4.0. This directly affects nested loop cost estimates.
  7. Use enable_ flags to test alternative join strategies, not to “fix” queries permanently.
  8. For queries with 12+ tables, GEQO takes over — consider SET geqo = off for critical queries.

What’s Next

Tomorrow we’ll look at generic vs custom plans for prepared statements — why PREPARE/EXECUTE can give you dramatically different (and sometimes worse) query plans, and how plan_cache_mode lets you control it.

This is Day 8 of the PostgreSQL Deep Dive series. Read Day 7 — Seq Scan vs Index Scan vs Bitmap Scan.