PostgreSQL deep dive: Window function internals, execution, sorting, and the spilling problem
You write a query with ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date). It works on a small table. You run it against 200 million rows and the query takes 90 seconds. EXPLAIN ANALYZE shows a WindowAgg node with Sort Method: external merge Disk: 4800MB. Your window function is spilling to disk, and the cost is almost entirely I/O.
Window functions look like a thin layer on top of regular aggregation. Under the hood, PostgreSQL executes them through a fundamentally different path: a WindowAgg node that requires its input to be sorted, maintains per-partition state, and evaluates frame boundaries row by row. Getting the execution model right is what separates a window function query that finishes in seconds from one that spends minutes shuffling temp files.
The execution order
Window functions execute after GROUP BY and HAVING but before ORDER BY and LIMIT. The full logical pipeline:
FROM / JOIN → WHERE → GROUP BY → HAVING → window functions → DISTINCT → ORDER BY → LIMIT
This ordering explains the common error: you cannot put a window function in a WHERE clause. The filter runs before the window function computes, so the column does not exist yet. The same applies to GROUP BY and HAVING. You can reference window functions in the SELECT list and the final ORDER BY clause.
-- This works: window function in SELECT and ORDER BY
SELECT customer_id, order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as rn
FROM orders
ORDER BY rn DESC;
-- This fails: window function in WHERE
SELECT * FROM (
SELECT customer_id, order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as rn
FROM orders
) sub
WHERE rn <= 10; -- fine, the WHERE is on the outer query
The subquery wrapper works because the window function executes in the inner query, producing a regular column that the outer query can filter on.
The WindowAgg node
When the planner encounters a window function, it produces a WindowAgg node. This node sits above a Sort node (unless the input is already sorted). The Sort provides rows in the order required by the window specification: first by PARTITION BY, then by ORDER BY within each partition.
WindowAgg
-> Sort
-> Seq Scan on orders
The WindowAgg node processes rows one at a time. It maintains a buffer of rows for the current partition. When it sees a row with a new partition key, it advances the partition boundary and starts the new group. For each row, it evaluates the window function based on the current frame (if the query defines a frame) or the entire partition (if it does not).
The implication: the WindowAgg node must see all rows in a partition before it can emit results for that partition, when the window function requires the full partition context. ROW_NUMBER() needs to see all rows in the partition to assign correct numbers. RANK() and DENSE_RANK() need the same. Running aggregates like SUM(amount) OVER (ORDER BY order_date) only need the rows in the current frame, which can be smaller than the full partition.
Sort requirements and pre-sorted inputs
The Sort node below the WindowAgg is often the most expensive part of the plan. The sort must order by PARTITION BY columns first, then by the ORDER BY columns within the window definition.
SELECT department_id, employee_id, salary,
ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;
-- The sort key is: (department_id, salary DESC)
If the table already has an index on (department_id, salary DESC), PostgreSQL can skip the sort entirely. An index scan in that order feeds directly into the WindowAgg node:
WindowAgg
-> Index Scan using idx_emp_dept_salary on employees
No Sort node. No temp files. The index provides the required ordering for free. If you can provide the sort order via an index, the WindowAgg node just streams through.
When there is no matching index, the Sort node allocates work_mem for an in-memory quicksort. If the data exceeds work_mem, it falls back to external merge sort, writing runs to temp files and merging them back. A 200 million row table with, say, 500 bytes per row produces roughly 100GB of data to sort. Even with generous work_mem, this spills heavily.
-- Check whether your sort is spilling
EXPLAIN (ANALYZE, BUFFERS)
SELECT customer_id, order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as rn
FROM orders;
-- WindowAgg (cost=... actual time=45012.234..90234.567 rows=200000000)
-- -> Sort (cost=... actual time=44890.123..71234.890 rows=200000000)
-- Sort Method: external merge Disk: 4800MB
-- Sort Key: customer_id, order_date
-- -> Seq Scan on orders
The Disk: 4800MB tells you the sort ran out of memory. Every megabyte of temp file I/O adds to your runtime. With spinning disks, the penalty is severe. With SSDs, it is less painful but still significant compared to an in-memory sort.
Multiple window functions with different specifications
When a single query contains multiple window functions with different PARTITION BY or ORDER BY clauses, PostgreSQL must handle multiple sort requirements.
SELECT
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as cust_rn,
ROW_NUMBER() OVER (PARTITION BY region_id ORDER BY amount DESC) as region_rn
FROM orders;
These two window functions need different sort orders: (customer_id, order_date) and (region_id, amount DESC). PostgreSQL cannot satisfy both from a single sort. It handles this in two ways, depending on the version and the query structure.
Before PostgreSQL 15, the planner would typically add a second Sort node:
WindowAgg (for region_rn)
-> WindowAgg (for cust_rn)
-> Sort (by customer_id, order_date)
-> Sort (by region_id, amount DESC)
-> Seq Scan on orders
Each WindowAgg node gets its own sort. The executor sorts the data twice, which means the intermediate result must be fully materialized. This doubles the sort cost and the memory pressure.
PostgreSQL 15 introduced a significant improvement. When multiple window functions share the same PARTITION BY but differ in ORDER BY or frame specifications, the planner can use a single sort and re-sort within the WindowAgg node. It also improved how it handles the common case where one window specification is a prefix of another.
The practical advice: if you need multiple window functions, try to align their PARTITION BY and ORDER BY clauses. When all window functions share the same specification, PostgreSQL evaluates them in a single pass through the WindowAgg node. No re-sorting, no extra materialization.
-- Good: same PARTITION BY and ORDER BY, single WindowAgg pass
SELECT
order_date,
SUM(amount) OVER w as running_total,
AVG(amount) OVER w as running_avg,
COUNT(*) OVER w as running_count
FROM orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW);
-- Worse: different PARTITION BY, requires multiple sorts
SELECT
customer_id,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date),
RANK() OVER (PARTITION BY status ORDER BY amount DESC)
FROM orders;
Frame boundaries and memory behavior
A window frame defines the set of rows that contribute to each window function call. The default frame for functions with an ORDER BY is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For functions without an ORDER BY, the default is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, which is the entire partition.
There are two frame modes: ROWS and RANGE. They behave differently and have different performance characteristics.
ROWS mode
ROWS counts rows physically. ROWS BETWEEN 3 PRECEDING AND CURRENT ROW means “the current row and the three rows before it.” This is straightforward and the frame size is fixed.
SELECT order_date, amount,
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
) as moving_sum
FROM orders;
For ROWS mode with a bounded frame, PostgreSQL maintains a small buffer of just the rows in the current frame. When the frame slides forward by one row, it adds the new row and removes the oldest row. Memory usage is proportional to the frame size, not the partition size. A ROWS BETWEEN 100 PRECEDING AND CURRENT ROW frame needs at most 101 rows in memory regardless of how large the partition is.
RANGE mode
RANGE operates on the ORDER BY value, not on row position. RANGE BETWEEN INTERVAL '3 days' PRECEDING AND CURRENT ROW includes all rows whose order_date is within 3 days of the current row, even if there are 10,000 of them.
SELECT order_date, amount,
SUM(amount) OVER (
ORDER BY order_date
RANGE BETWEEN INTERVAL '3 days' PRECEDING AND CURRENT ROW
) as trailing_3day_sum
FROM orders;
This is where memory usage can explode. If a customer places 5,000 orders in a single day, and your RANGE frame is 3 days, every row in that dense cluster has a frame covering all 5,000 rows. PostgreSQL must buffer the entire frame in memory. With large partitions and dense data, this can exceed work_mem and spill to temp files, even though the sort itself fits in memory.
The EXPLAIN output shows a WindowAgg node with runCondition and other details in newer PostgreSQL versions, but the spilling behavior shows up in temp file usage at the statement level:
-- Check for temp file usage at the statement level
EXPLAIN (ANALYZE, BUFFERS)
SELECT order_date, amount,
SUM(amount) OVER (
ORDER BY order_date
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) as trailing_sum
FROM orders;
-- Look for temp blocks in Buffers output
-- Buffers: shared read=45023, temp read=12800 written=12800
The temp read=N written=N lines indicate temp file I/O. If you see large temp numbers on a window function query with a RANGE frame, the frame buffers are spilling.
UNBOUNDED PRECEDING
When the frame starts at UNBOUNDED PRECEDING (the default for functions with ORDER BY), PostgreSQL can use a running computation for certain aggregates. SUM(), COUNT(), and AVG() accumulate incrementally: add the new row’s value to the running total, emit the result, move on. This requires O(1) memory for the running state.
-- Running sum: O(1) memory for the aggregate state
SELECT order_date, amount,
SUM(amount) OVER (ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total
FROM orders;
However, if the frame also has an upper bound that is not CURRENT ROW, PostgreSQL must buffer all rows from UNBOUNDED PRECEDING to that upper bound. ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING needs to keep the current row and the next row in the buffer, which is manageable. But complex frame specifications with both unbounded starts and variable ends can require the full partition to be buffered.
Aggregate window functions vs plain aggregates
The same aggregate function behaves differently as a plain aggregate versus a window aggregate. As a plain aggregate, SUM(amount) GROUP BY customer_id collapses rows: one output row per customer. As a window function, SUM(amount) OVER (PARTITION BY customer_id) preserves all rows, adding a sum column to each.
Internally, PostgreSQL uses the same aggregate transition function for both cases. The difference is the executor node: HashAggregate or GroupAggregate for plain aggregates, WindowAgg for window aggregates. The WindowAgg node maintains per-partition aggregate state, advancing or resetting it as partition boundaries change.
For simple aggregate window functions without an ORDER BY (meaning the frame is the entire partition), the WindowAgg node can process rows in a single pass. It accumulates the aggregate state for each partition and emits it with every row. When the partition changes, it resets the state and starts the new accumulation.
Running and peeking: the optimized frame cases
PostgreSQL 14 added an optimization for a common window function pattern: the “running” window with UNBOUNDED PRECEDING start and no explicit end (defaulting to CURRENT ROW). For certain aggregates like SUM(), COUNT(), and AVG(), the executor maintains a running value and updates it incrementally as each new row enters the frame.
PostgreSQL 15 extended this optimization to support “peeking” ahead, which enables the frame ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. This is useful for reverse running totals and percentage-of-total calculations.
-- PostgreSQL 14+: optimized running sum
SUM(amount) OVER (ORDER BY order_date) as running_total
-- PostgreSQL 15+: optimized reverse sum
SUM(amount) OVER (ORDER BY order_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as remaining_total
-- Percentage of total using both directions
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) as running,
SUM(amount) OVER () as grand_total,
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
) as remaining
FROM orders;
Before these optimizations, PostgreSQL buffered the full partition for these cases. The optimization eliminates that buffering for supported aggregates, turning an O(partition_size) memory operation into an O(1) operation.
The DISTINCT ON alternative
Sometimes what you want from a window function is simply the first row per group. The ROW_NUMBER() ... WHERE rn = 1 pattern is the usual approach:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) as rn
FROM orders
) sub
WHERE rn = 1;
This works, but it computes ROW_NUMBER() for every row in the table, sorts everything, and then throws away all but one row per partition. For large tables, the overhead is substantial.
DISTINCT ON is a PostgreSQL extension that achieves the same result more efficiently:
SELECT DISTINCT ON (customer_id) *
FROM orders
ORDER BY customer_id, order_date DESC;
The executor uses a GroupAggregate-like strategy: it sorts by (customer_id, order_date DESC) and keeps only the first row per group. It never assigns row numbers to rows it will discard. The performance difference can be significant on large tables because the executor can short-circuit per group rather than processing the full partition.
The trade-off is flexibility. DISTINCT ON only gives you the first row per group based on sort order. If you need the second row, or the top N rows per group, or multiple ranking columns, ROW_NUMBER() with a filter is the correct tool. But for the common “latest order per customer” pattern, DISTINCT ON is faster.
Parallel window function execution
PostgreSQL can execute window functions in parallel. The Gather node distributes rows to workers, each worker processes a subset of partitions, and the results are merged. The constraint is that all rows in a partition must be processed by the same worker. If the partition key correlates with the table’s physical ordering (e.g., partitioning by customer_id and the table is clustered by customer_id), the workers can each take a contiguous block of pages without overlapping.
If partitions are scattered randomly across the table, workers waste effort on partition boundary detection and communication. The parallel plan still works, but the efficiency gain is lower.
-- Parallel WindowAgg plan
EXPLAIN (ANALYZE)
SELECT customer_id, order_date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as rn
FROM orders;
-- Gather (workers: 4)
-- -> WindowAgg
-- -> Sort
-- -> Parallel Seq Scan on orders
Parallel execution helps most when the number of partitions is large and the data is evenly distributed. A table with 10 million rows and 5 million partitions (each with 2 rows) parallelizes well. A table with 10 million rows and 10 partitions (each with 1 million rows) does not: only 10 workers can each own one partition, and most workers sit idle.
Practical tuning checklist
When a window function query is slow, work through this list:
-
Is there a Sort node below the WindowAgg? Is it spilling? If so, increase
work_memfor this query or create an index that provides the required order. -
Are there multiple WindowAgg nodes each with their own Sort? Rewrite the query to align the window specifications, or use the
WINDOWclause to define a shared window and reference it multiple times. -
Are you using
RANGEwith a large window? Switch toROWSif the semantics are acceptable for your use case, or narrow the range. -
Are your partitions very large? Consider whether the query can be restructured to avoid buffering entire partitions.
-
If you only need the first row per group,
DISTINCT ONavoids the window function overhead entirely. -
Run
EXPLAIN (ANALYZE, BUFFERS)and look fortemp readandtemp writtenin the output. Large numbers indicate spilling.
A WindowAgg node processes sorted input, maintains per-partition state, and evaluates frame boundaries as rows arrive. Most performance problems trace back to one of two things: the sort below the WindowAgg is spilling to disk, or the frame is forcing PostgreSQL to buffer more rows than will fit in memory. Fix the sort or fix the frame, and the window function problem usually resolves itself.