PostgreSQL Deep Dive

PostgreSQL Deep Dive: array types, GIN indexing, and when arrays beat normalized tables

You have a tags column that holds a handful of labels per row — ARRAY['python', 'ml', 'data-science']. You write WHERE tags @> ARRAY['python'] and it returns instantly with a GIN index. Then someone asks for “all rows tagged with exactly three tags.” You write WHERE array_length(tags, 1) = 3 and it does a full sequential scan on 20 million rows. The GIN index helps you find elements. It cannot help you count them.

Arrays in PostgreSQL are a multi-value column type, not a hack, not an afterthought, and not a substitute for normalization. They have defined storage semantics, index support, and gotchas that only appear under load.

How PostgreSQL stores arrays

Arrays are stored as a varlena datum: a variable-length value with a header followed by the element data. The header contains the number of dimensions, the length of each dimension, and the lower bounds (which default to 1 but can be set to any integer).

Elements are stored sequentially in memory in row-major order. For a 2D array ARRAY[[1,2,3],[4,5,6]], the storage layout is 1, 2, 3, 4, 5, 6 — elements of the first row, then elements of the second row. There are no per-dimension metadata blocks between rows; it’s a flat array of element values.

For fixed-length element types (int4, int8, float8, numeric with fixed precision), elements are stored directly with no padding between them. For variable-length types (text, varchar, jsonb), each element is stored as a varlena datum with its own length header, meaning the array contains a sequence of length-prefixed values.

-- Storage comparison
SELECT pg_column_size(ARRAY[1, 2, 3, 4, 5]::int[]);
-- 40 bytes (header + 5 × 4-byte integers)

SELECT pg_column_size(ARRAY['a', 'bb', 'ccc', 'dddd', 'eeeee']::text[]);
-- 88 bytes (header + 5 varlena text values)

SELECT pg_column_size(ARRAY[1, 2, 3]::int[] || ARRAY[4, 5, 6]::int[]);
-- 56 bytes (header + 6 integers)

-- NULL elements take space for the null flag
SELECT pg_column_size(ARRAY[1, NULL, 3]::int[]);
-- 40 bytes (same as 3 non-null integers — null flag is in a bitmask, not per-element)

The null bitmap is a separate block after the header. Each element gets one bit: 0 for null, 1 for non-null. This means NULL elements don’t save space — the slot is still allocated, and the null bitmap adds overhead for arrays that contain any null values.

Array operators: containment, overlap, and equality

PostgreSQL provides three core comparison operators for arrays:

@> (contains) — every element of the right array appears in the left array:

SELECT ARRAY[1, 2, 3, 4, 5] @> ARRAY[2, 4];
-- t

SELECT ARRAY[1, 2, 3] @> ARRAY[2, 4];
-- f

<@ (contained by) — every element of the left array appears in the right array. The reverse of @>.

&& (overlaps) — the arrays have at least one element in common:

SELECT ARRAY[1, 2, 3] && ARRAY[3, 4, 5];
-- t

SELECT ARRAY[1, 2, 3] && ARRAY[4, 5, 6];
-- f

= (equality) — exact match: same elements, same order, same dimensions, same length:

SELECT ARRAY[1, 2, 3] = ARRAY[1, 2, 3];
-- t

SELECT ARRAY[1, 2, 3] = ARRAY[3, 2, 1];
-- f  (order matters)

SELECT ARRAY[1, 2, 3] = ARRAY[1, 2, 3, 4];
-- f  (length matters)

This is the critical distinction from JSONB: array equality is order-sensitive and dimension-sensitive. Two arrays with the same elements in different order are not equal. Two arrays with different dimensions are not equal. This matters for DISTINCT, GROUP BY, and UNION:

-- These are treated as different values
SELECT DISTINCT unnest(ARRAY[
    ARRAY[1, 2, 3],
    ARRAY[3, 2, 1],
    ARRAY[[1, 2, 3]],
    ARRAY[1, 2, 3, NULL]
]);
-- {1,2,3}
-- {3,2,1}
-- {{1,2,3}}
-- {1,2,3,NULL}

If you need set semantics (order-independent equality), sort the array first:

SELECT ARRAY(SELECT unnest(ARRAY[3, 2, 1]) ORDER BY 1) = ARRAY[1, 2, 3];
-- t

GIN indexing with array_ops

Arrays support GIN indexing through the array_ops operator class (the default and only option for plain arrays):

CREATE INDEX idx_tags ON articles USING gin (tags);

-- Uses GIN index:
SELECT * FROM articles WHERE tags @> ARRAY['python'];

-- Uses GIN index:
SELECT * FROM articles WHERE tags && ARRAY['python', 'rust'];

-- Does NOT use GIN index:
SELECT * FROM articles WHERE 'python' = ANY(tags);

-- Does NOT use GIN index:
SELECT * FROM articles WHERE array_length(tags, 1) = 3;

The GIN array_ops operator class supports @>, <@, &&, and =. Each unique element in the array gets its own posting list — the GIN index stores a mapping from element value to the set of row IDs (TIDs) that contain that element. A query like tags @> ARRAY['python', 'ml'] performs two element lookups in the GIN index and intersects the posting lists.

The = operator on full arrays is supported but rarely useful with a GIN index — equality requires matching the entire array, and the GIN index would need to intersect all element posting lists, which is only efficient if the array has very few elements.

The ANY() construct and array_length() function do not use GIN indexes. To get index usage, rewrite with the containment or overlap operators:

-- Instead of this:
SELECT * FROM articles WHERE 'python' = ANY(tags);

-- Use this (equivalent, uses GIN):
SELECT * FROM articles WHERE tags @> ARRAY['python'];

-- Or this (also uses GIN):
SELECT * FROM articles WHERE ARRAY['python'] && tags;

The @> and && forms are semantically equivalent for single-element arrays, but @> is the conventional choice because it reads more naturally as “the tags contain ‘python’.”

The subscript assignment trap: why UPDATE … SET col[i] = … is expensive

PostgreSQL allows updating individual array elements via subscript syntax:

UPDATE articles SET tags[3] = 'django' WHERE id = 42;

This looks like a surgical update to a single element. It is not. PostgreSQL does not have in-place array element updates. The entire array is read, the element is modified, and the entire array is written back as a new tuple. This is a full row update with all the MVCC consequences: a new tuple version, a dead tuple for vacuum to clean, WAL logging for the entire row, and index updates for every index on the table.

-- This is a full row rewrite:
UPDATE events SET participants[5] = 'alice' WHERE id = 1;

-- This is also a full row rewrite:
UPDATE events SET schedule[1:3] = ARRAY[9, 10, 11] WHERE id = 1;

-- The only way to modify an array without a full row rewrite
-- would be to store it in a separate table (normalization)

For arrays with a handful of elements (tags, categories, small lists), this is irrelevant — the array is small, the row is small, and the rewrite is cheap. For arrays with hundreds or thousands of elements (stored as pre-computed aggregates, cached computation results, or batched data), updating a single element by rewriting the entire array is wasteful.

The concatenation operators (||) have the same behavior:

-- Appending to an array is a full rewrite
UPDATE articles SET tags = tags || 'new-tag' WHERE id = 42;

If you’re frequently appending to or modifying large arrays, you probably want a normalized table instead.

unnest() and array expansion

The unnest() function expands an array into a set of rows, one per element. This is the bridge between array storage and relational queries:

-- Expand tags into rows
SELECT unnest(ARRAY['python', 'ml', 'data-science']);
-- python
-- ml
-- data-science

-- Join with unnest to query the expanded elements
SELECT a.id, a.title, tag
FROM articles a,
     LATERAL unnest(a.tags) AS tag
WHERE tag = 'python';

unnest() in a LATERAL join is one of the most common array patterns. It converts the multi-value column into rows, enabling joins, group by, and all standard relational operations on the individual elements.

PostgreSQL 17 added unnest() with multiple arrays (already available as unnest(a, b) but now with proper padding for unequal lengths):

SELECT *
FROM unnest(
    ARRAY[1, 2, 3],
    ARRAY['a', 'b']
);
-- 1  | a
-- 2  | b
-- 3  | NULL

The shorter array is padded with NULLs. Before PG17, unnest() with multiple arrays required the arrays to have the same length.

array_agg(): the reverse operation

array_agg() is the aggregate function that collects values into an array. It’s the inverse of unnest():

-- Collect tags from a normalized table
SELECT article_id, array_agg(tag ORDER BY tag) AS tags
FROM article_tags
GROUP BY article_id;

-- Note the ORDER BY inside aggregate — this produces sorted arrays
-- Without it, element order is nondeterministic

The ORDER BY clause inside array_agg() is important because array equality is order-sensitive. If you’re building arrays from normalized data and later comparing them with =, you need deterministic ordering. Adding ORDER BY inside the aggregate ensures consistent results.

array_agg(DISTINCT ...) removes duplicate elements:

SELECT article_id, array_agg(DISTINCT tag ORDER BY tag) AS tags
FROM article_tags
GROUP BY article_id;

When arrays are the right tool (and when they aren’t)

Arrays work well when:

  • The number of elements is small and bounded — tags, categories, status flags, permission sets. A handful of values per row, rarely exceeding 20-30 elements.
  • The array is treated as an atomic value — you read it, write it, and query by containment. You don’t frequently update individual elements.
  • You need containment queries — “does this row contain X” is the canonical GIN-indexed array query.
  • The array doesn’t need its own metadata — if each element needs its own columns (created_at, weight, metadata), use a normalized table.

Arrays are the wrong tool when:

  • The element count is unbounded — logs, event lists, transaction histories. These grow without bound and make rows large and expensive to update.
  • You need to query individual elements by positionWHERE array_length(col, 1) = N doesn’t use indexes. WHERE col[N] = 'value' doesn’t use indexes. If you need positional queries, normalize.
  • Elements need their own relationships — if tags need a count, a creation date, or a many-to-many relationship with other entities, use a junction table with proper foreign keys.
  • You’re doing frequent appends or modifications — each modification rewrites the entire array and the entire row.

The PostgreSQL documentation itself warns: “Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element.”

Multidimensional arrays: rarely useful

PostgreSQL supports multidimensional arrays, but they’re rarely the right choice in application design. The primary use case is mathematical and scientific computing where 2D or 3D grids are natural:

-- A 3x3 matrix
SELECT ARRAY[[1, 2, 3], [4, 5, 6], [7, 8, 9]]::int[];

-- Access a single element (1-indexed, row-major)
SELECT (ARRAY[[1,2,3],[4,5,6],[7,8,9]])[2][3];
-- 6

-- Extract a row (slice)
SELECT (ARRAY[[1,2,3],[4,5,6],[7,8,9]])[1:1][1:3];
-- {{1,2,3}}

Multidimensional arrays have limitations: no GIN indexing support beyond the flat element containment (the GIN index stores individual elements, not sub-arrays), and the syntax for slicing and updating is awkward compared to JSONB for structured data.

If you need structured, heterogeneous nested data, JSONB is generally a better choice. If you need homogeneous, indexed, queryable multi-value data, a one-dimensional array with GIN is the right tool.

Practical SQL: array diagnostics

Find all array columns in your schema:

SELECT table_schema, table_name, column_name,
       udt_name, data_type
FROM information_schema.columns
WHERE data_type = 'ARRAY'
  OR udt_name LIKE '%[]'
  OR data_type LIKE 'ARRAY%'
  AND table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY table_schema, table_name, ordinal_position;

Check average array sizes and null rates:

SELECT attrelid::regclass AS table_name,
       attname AS column_name,
       count(*) AS total_rows,
       count(*) FILTER (WHERE atttypid IS NOT NULL) AS non_null_rows,
       round(avg(array_length(atttypid, 1)), 1) AS avg_elements,
       max(array_length(atttypid, 1)) AS max_elements
FROM pg_attribute
WHERE atttypid = 'int[]'::regtype
  AND attnum > 0
  AND NOT attisdropped;

Check which array indexes exist and whether they’re being used:

SELECT schemaname,
       relname AS table_name,
       indexrelname AS index_name,
       idx_scan AS times_used,
       pg_size_pretty(pg_relation_size(indexrelid::regclass)) AS index_size
FROM pg_stat_user_indexes
WHERE pg_get_indexdef(indexrelid) LIKE '%gin%'
  AND pg_get_indexdef(indexrelid) LIKE '%[]%'
ORDER BY idx_scan ASC;

Test whether your array queries use the GIN index:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM articles
WHERE tags @> ARRAY['python'];
-- Look for "Bitmap Index Scan on idx_tags" or "GIN scan"
-- If you see "Seq Scan", the index isn't being used

Key takeaways

  • Arrays are varlena datums with a header (dimensions, lengths, lower bounds), an optional null bitmap, and sequential element storage in row-major order.
  • The three comparison operators — @> (contains), <@ (contained by), && (overlaps) — are GIN-indexable. The ANY() construct and array_length() are not.
  • Array equality is order-sensitive: ARRAY[1,2,3] != ARRAY[3,2,1]. Use ARRAY(SELECT unnest(arr) ORDER BY 1) for set semantics.
  • Subscript assignment (col[i] = value) and concatenation (col || element) rewrite the entire array and the entire row. There is no in-place element update.
  • unnest() converts arrays to rows, and array_agg() converts rows to arrays. Use array_agg(DISTINCT x ORDER BY x) for deterministic, deduplicated arrays.
  • Arrays work best for small, bounded, atomic multi-value data with containment queries. They break down with unbounded growth, positional queries, or frequent element-level modifications.
  • GIN array_ops indexes individual elements, not sub-arrays. Multidimensional arrays get element-level containment indexing but no structural indexing.

What’s next

Arrays and JSONB are the two multi-value types PostgreSQL offers, but they’re not the only data types with surprising behavior. UUIDs are everywhere — primary keys, API identifiers, shard keys — but the choice between UUIDv4 (random) and UUIDv7 (time-ordered) has cascading effects on B-tree index density, page splits, and insert throughput that most people never think about. Next time: UUIDv4 vs UUIDv7, the INSERT performance cliff, and why your random UUIDs are making your indexes fat.


Previously: JSONB: GIN Indexing, Containment Queries, and the Performance Traps That Look Right But Run Slow