PostgreSQL Deep dive: TOAST — The oversized-attribute storage technique
You’ve been warned about the 8KB page limit since your first PostgreSQL tutorial. “Rows can’t span pages,” they say. And then you go and INSERT a 2MB JSON document and it… works. No error. No complaint. The row just disappears into the table like nothing happened.
That’s TOAST (The Oversized-Attribute Storage Technique). It’s been part of PostgreSQL since version 7.1 (2001), touches nearly every query that involves variable-length data, and most DBAs never think about it until something goes wrong.
Today we’re pulling it apart.
The Problem: 8KB Pages and Unbounded Data
PostgreSQL stores every table in files divided into 8KB pages (configurable at compile time, but virtually never changed). Each page holds a fixed header, an array of line pointers, and the actual tuple data. A tuple must fit entirely within a single page. No spanning.
This works fine for integers, timestamps, and short strings. But text, bytea, jsonb, xml, and array types have no upper bound. A single column value can be gigabytes. Clearly, it can’t live in a page.
The naive approach would be to allow rows to span pages with pointer chains. MySQL’s BLOB/TEXT overflow works somewhat like this. PostgreSQL chose a different path. It keeps the main table lean, makes the buffer cache more effective, and only fetches the heavy data when you actually ask for it.
How TOAST Works: The Two-Phase Strategy
When a tuple is too large to fit in a page (by default, larger than ~2KB), the TOAST management code in access/common/toast_internals.c kicks in. It applies two transformations, in order:
Phase 1: Compression
First, it tries to compress each TOAST-able column value. PostgreSQL supports two compression algorithms:
- pglz — the original, available since TOAST was introduced. A variant of LZ compression. Slower, but universally available.
- lz4 — available since PostgreSQL 14 (when compiled with
--with-lz4). Significantly faster compression and decompression with reasonable ratios.
The compression method is controlled per-column with the COMPRESSION clause:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT COMPRESSION lz4,
metadata JSONB COMPRESSION pglz
);
Or at the session/database level via default_toast_compression (default: pglz).
If the compressed value fits within the tuple target size, it’s stored inline. The varlena header’s second bit indicates compression. The length word stores the compressed size, and the original size is recorded in the header metadata. Decompression is transparent: any C function receiving a TOASTed value calls PG_DETOAST_DATUM before using it.
Phase 2: Out-of-Line Storage
If compression alone isn’t enough, TOAST moves the value to a secondary table (the TOAST table) and replaces it in the main row with an 18-byte pointer datum.
Every table with at least one TOAST-able column gets an associated TOAST table. Its OID is stored in pg_class.reltoastrelid. The TOAST table has three columns:
| Column | Type | Purpose |
|---|---|---|
chunk_id | OID | Identifies the TOASTed value (shared by all chunks of one value) |
chunk_seq | int4 | Sequence number within the value (0, 1, 2, …) |
chunk_data | bytea | The actual data chunk |
A unique index on (chunk_id, chunk_seq) enables fast retrieval. Values are split into chunks of at most TOAST_MAX_CHUNK_SIZE bytes (roughly 2000 bytes, chosen so four chunks fit on one page).
The pointer datum stored in the main row contains:
- The TOAST table OID
- The chunk_id (value OID)
- The original uncompressed size
- The physical stored size (may differ if compressed)
- The compression method used
Total pointer size: 18 bytes, regardless of whether the original value was 10KB or 1GB.
The Four Storage Strategies
Each TOAST-able column can be configured with one of four strategies via ALTER TABLE ... SET STORAGE:
PLAIN
No compression, no out-of-line storage. This is the only option for fixed-length types like integer, uuid, macaddr. You can’t set it on text or bytea; the system rejects it.
-- This is the default for fixed-length types
-- You don't set it manually
EXTENDED (default)
Both compression and out-of-line storage are available. This is the default for text, bytea, jsonb, xml, arrays, and all other variable-length types.
The algorithm:
- Try compression first
- If the compressed value fits in the tuple, store it inline
- If not, move it to the TOAST table (compressed)
- If compression didn’t help, move the uncompressed value out-of-line
It handles the vast majority of use cases well.
EXTERNAL
Out-of-line storage allowed, compression disabled.
Why would you want this? Substring operations.
When you run SUBSTRING(content, 1, 100) on a TOASTed text column with EXTENDED storage, PostgreSQL can fetch only the specific chunks that contain bytes 1–100. It knows the chunk size, it can calculate which chunk_seq values are needed, and it reads just those from the TOAST table.
With EXTENDED storage (compression enabled), PostgreSQL has to fetch all chunks, decompress the entire value, and then extract the substring. For a 500MB document where you only need the first 100 characters, that’s a massive difference.
-- Use EXTERNAL when you frequently do substring/byte-range operations
ALTER TABLE documents ALTER COLUMN content SET STORAGE EXTERNAL;
The trade-off: no compression means larger TOAST tables. You’re trading storage for query performance on partial reads.
MAIN
Compression allowed, out-of-line storage normally disabled. The value stays inline, compressed. Out-of-line storage is only used as a last resort, when there’s literally no other way to make the tuple fit on a page.
This is rarely used in practice. It means you’re saying “I’d rather have wider rows with compressed data than push values to the TOAST table.” The main use case is when you almost always select the column and want to avoid the TOAST pointer indirection, but the values are occasionally large enough to need overflow.
-- Rarely needed, but available
ALTER TABLE documents ALTER COLUMN metadata SET STORAGE MAIN;
The Trigger: TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET
TOAST doesn’t look at every insert. It only activates when a tuple exceeds TOAST_TUPLE_THRESHOLD bytes (default: 2KB). Once activated, it compresses and moves values until the tuple is smaller than TOAST_TUPLE_TARGET (also default: 2KB).
Both defaults are set so approximately four tuples fit on one 8KB page, accounting for page headers and line pointers.
You can adjust TOAST_TUPLE_TARGET per table:
ALTER TABLE documents SET (toast_tuple_target = 4096);
Valid range: 128 bytes to (block_size - header), default max 8160 bytes. Increasing it means more data stays inline, but fewer tuples per page, which means more pages, more I/O for full table scans, and less effective use of the buffer cache.
Decreasing it (below 2KB) means more aggressive toasting: smaller main table rows, more values pushed to TOAST tables. This can help when your working set barely fits in shared_buffers and you want to maximize row density.
In practice, the default is almost always optimal. Don’t change it without benchmarking.
The Gotcha: You Can’t Index TOASTed Values
This is the one that bites people. Here’s the scenario:
CREATE TABLE events (
id SERIAL PRIMARY KEY,
payload JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
INSERT INTO events (payload)
VALUES ('{"description": "...a very long string...repeated 50000 times..."}');
-- Try to index the payload
CREATE INDEX idx_events_payload ON events USING gin (payload);
-- This works fine! GIN handles TOASTed values.
-- But try this:
CREATE INDEX idx_events_description ON events
((payload->>'description'));
-- Error: index row requires 11384 bytes, maximum size is 8191
B-tree indexes have a hard limit of approximately 2.7KB per index entry (one-third of a page, with safety margin). If your expression produces a value larger than that, the index creation fails.
The same applies to UNIQUE constraints and PRIMARY KEY constraints on large values:
-- This will fail if 'content' is large enough to be TOASTed
ALTER TABLE documents ADD CONSTRAINT unique_content UNIQUE (content);
Workarounds:
- Use a hash of the value instead:
UNIQUE (md5(content)) - Use a prefix:
UNIQUE (LEFT(content, 2000))— but this changes your uniqueness semantics - Use
pg_trgmGIN index for text search instead of B-tree - Normalize: store the large value separately, keep a short identifier in the indexed column
The Second Gotcha: TOAST and UPDATE Performance
When you UPDATE a row, PostgreSQL creates a new tuple version. If the row has TOASTed out-of-line values, here’s what happens:
- If you don’t change the TOASTed column: PostgreSQL copies the 18-byte TOAST pointer into the new tuple. No data is copied from the TOAST table. This is fast and cheap.
- If you change the TOASTed column: The old TOAST chunks are orphaned (cleaned up by vacuum later). New chunks are written. This is slow and generates WAL.
This matters for your schema design. If you have a table with a large payload column and you’re frequently updating a status column, put them in separate tables:
-- Bad: every status update touches the wide row
CREATE TABLE jobs (
id SERIAL PRIMARY KEY,
status TEXT DEFAULT 'pending',
payload JSONB -- could be megabytes
);
-- Good: status updates are cheap
CREATE TABLE jobs (
id SERIAL PRIMARY KEY,
status TEXT DEFAULT 'pending'
);
CREATE TABLE job_payloads (
job_id INT PRIMARY KEY REFERENCES jobs(id),
payload JSONB
);
In the “bad” schema, UPDATE jobs SET status = 'running' WHERE id = 1 creates a new copy of the tuple including the TOAST pointer. The TOAST pointer copy is cheap (18 bytes), but the tuple is still wider, meaning fewer fit per page, meaning more page reads for scans that include status.
In the “good” schema, the jobs table stays narrow. Status updates only touch small tuples. You only join to job_payloads when you need the payload.
The Third Gotcha: TOAST Tables Need Vacuuming Too
Every TOAST table has its own autovacuum settings. When you DELETE or UPDATE rows with out-of-line values, the old TOAST chunks become dead tuples in the TOAST table, not the main table.
You can see TOAST table statistics separately:
-- Find TOAST tables for your user tables
SELECT
c.relname AS main_table,
t.relname AS toast_table,
pg_size_pretty(pg_total_relation_size(t.oid)) AS toast_size,
pg_stat_get_dead_tuples(t.oid) AS toast_dead_tuples
FROM pg_class c
JOIN pg_class t ON t.oid = c.reltoastrelid
WHERE c.relkind = 'r'
AND c.relnamespace = 'public'::regnamespace
ORDER BY pg_total_relation_size(t.oid) DESC;
If autovacuum isn’t keeping up with your TOAST tables, they bloat just like regular tables. The toast.autovacuum_* storage parameters let you tune vacuum behavior for TOAST tables independently:
-- More aggressive vacuuming for a table with heavy TOAST churn
ALTER TABLE documents SET (
toast.autovacuum_vacuum_scale_factor = 0.05,
toast.autovacuum_analyze_scale_factor = 0.02
);
Practical SQL: Diagnosing TOAST Behavior
See which columns use which storage strategy
SELECT
c.table_name,
c.column_name,
c.data_type,
a.attstorage AS storage_strategy,
CASE a.attstorage
WHEN 'p' THEN 'PLAIN'
WHEN 'e' THEN 'EXTERNAL'
WHEN 'x' THEN 'EXTENDED'
WHEN 'm' THEN 'MAIN'
END AS storage_label
FROM information_schema.columns c
JOIN pg_attribute a ON a.attname = c.column_name
AND a.attrelid = c.table_name::regclass
WHERE c.table_schema = 'public'
AND c.table_name = 'documents'
ORDER BY c.ordinal_position;
Check TOAST pointer sizes vs actual data sizes
SELECT
pg_size_pretty(pg_column_size(content)) AS pointer_size,
pg_size_pretty(octet_length(content)) AS actual_size,
round(100.0 * pg_column_size(content) / NULLIF(octet_length(content), 0), 2)
AS size_ratio_pct
FROM documents
ORDER BY octet_length(content) DESC
LIMIT 20;
This reveals how effective TOAST compression is. A pointer size of 57 bytes for a 5MB JSON document means TOAST is working perfectly. If you see pointer sizes close to actual sizes, compression isn’t helping. Consider EXTERNAL storage if you do substring operations.
Find tables with the largest TOAST overhead
SELECT
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS main_size,
pg_size_pretty(pg_relation_size(c.reltoastrelid)) AS toast_size,
round(
100.0 * pg_relation_size(c.reltoastrelid) /
NULLIF(pg_relation_size(c.oid) + pg_relation_size(c.reltoastrelid), 0),
1
) AS toast_percentage
FROM pg_class c
WHERE c.relkind = 'r'
AND c.relnamespace = 'public'::regnamespace
AND c.reltoastrelid IS NOT NULL
ORDER BY pg_relation_size(c.reltoastrelid) DESC
LIMIT 10;
Check compression ratios for TOASTed data
-- Get the TOAST table for a specific table and inspect chunk sizes
SELECT
chunk_id,
COUNT(*) AS chunk_count,
pg_size_pretty(SUM(octet_length(chunk_data))) AS total_chunk_size,
pg_size_pretty(AVG(octet_length(chunk_data))) AS avg_chunk_size
FROM pg_toast.pg_toast_<your_table_oid>
GROUP BY chunk_id
ORDER BY SUM(octet_length(chunk_data)) DESC
LIMIT 10;
(Replace <your_table_oid> with the OID from SELECT oid FROM pg_class WHERE relname = 'documents'.)
The Expanded Representation: TOAST in Memory
There’s a lesser-known side of TOAST that operates entirely in memory: expanded TOAST pointers.
Certain complex types have an on-disk format that’s compact but terrible for random access. Arrays are the primary example. To find element N in a varlena array, you have to walk through all preceding elements because each element can be variable-length.
The “expanded” representation deconstructs the array into a structure where all element offsets are pre-computed. A TOAST pointer in memory can point to this expanded representation instead of the flat varlena. Functions that know about expanded representations can work with them directly, avoiding repeated deconstruction.
This is read-write vs read-only: a function receiving a read-write expanded pointer can modify it in-place, avoiding copies. A function receiving a read-only pointer must copy before modifying.
This is entirely transparent to SQL users but explains why some array operations are faster than you’d expect given the on-disk format.
Key Takeaways
- TOAST is transparent. You don’t opt in — it happens automatically for any TOAST-able type when the tuple exceeds ~2KB.
- The default EXTENDED strategy is almost always correct. It compresses first, then moves to TOAST tables if needed.
- Switch to EXTERNAL when you do frequent substring/byte-range operations on large
textorbyteacolumns. You lose compression but gain the ability to fetch specific chunks. - B-tree index entries can’t exceed ~2.7KB. Large TOASTed values will cause index creation to fail. Use hashes, prefixes, or separate tables as workarounds.
- Separate wide columns from frequently-updated narrow columns. Even though TOAST pointers are only 18 bytes, keeping them out of your hot rows improves page density and scan performance.
- TOAST tables need vacuuming. Monitor
toast_dead_tuplesand consider tuningtoast.autovacuum_*parameters for tables with heavy write churn. - lz4 compression is faster than pglz with comparable ratios. If you’re on PostgreSQL 14+, use
COMPRESSION lz4for large text/JSONB columns.
What’s Next
We’re staying in the storage layer. Tomorrow we’ll look at the Free Space Map, the data structure that tells PostgreSQL where to put new tuples. We’ll cover why it used to be a bottleneck (remember max_fsm_pages?) and how the redesign in PostgreSQL 8.4 changed everything.
Previous: WAL Buffers and Checkpoint Tuning