PostgreSQL Deep Dive

PostgreSQL Deep Dive: JSONB - GIN indexing, containment queries, and the performance traps that look right but run slow

You have a metadata column of type jsonb. It stores key-value pairs from user profiles: {"tier": "pro", "region": "au", "settings": {"dark_mode": true, "notifications": false}}. You add a GIN index, write WHERE metadata @> '{"tier": "pro"}', and it flies. Then you add WHERE metadata -> 'settings' ->> 'dark_mode' = 'true' and the index goes unused. The query does a sequential scan on a 50 million row table.

The difference between these two queries is the difference between GIN indexing that works and GIN indexing that you think works.

How JSONB stores data internally

jsonb stores values in a decomposed binary format: a single varlena datum containing a tree structure. Unlike json (which stores the original text and reparses on every access), jsonb parses on input and never parses again.

The internal structure mirrors the JSON value: objects are unordered collections of key-value pairs, arrays are ordered sequences of values, and scalar values (strings, numbers, booleans, null) are stored inline. Keys are always stored as text. The binary format uses a header with type tags and value lengths, making type-aware operations possible without re-parsing.

-- Storage comparison
SELECT pg_column_size('{"a": 1, "b": "hello"}'::json);
-- 32 bytes (text string + padding)

SELECT pg_column_size('{"a": 1, "b": "hello"}'::jsonb);
-- 14 bytes (binary tree, more compact)

-- Key order is not preserved in jsonb
SELECT '{"c": 1, "a": 2, "b": 3}'::jsonb;
-- {"a": 2, "b": 3, "c": 1}  (alphabetical key order)

Keys are reordered alphabetically in the binary representation. This enables a hash-based lookup strategy: when checking if a key exists, PostgreSQL can binary-search the sorted key array rather than scanning linearly. This is why the documentation says “JSON objects are better suited than arrays for testing containment or existence when there are many keys.”

If you’re storing data you need to query frequently, use objects with meaningful keys, not arrays of values. Object key lookup is O(log N). Array element lookup is O(N) unless the array is small.

The two GIN operator classes: jsonb_ops vs jsonb_path_ops

A GIN index on jsonb can use one of two operator classes, and the choice determines which queries can use the index:

jsonb_ops (default): indexes every key and every value separately. Supports the @>, ?, ?&, ?| operators, plus the existence and containment operators.

jsonb_path_ops: indexes only paths (key-value pairs), not individual keys or values. Supports only the @> containment operator.

-- Default: indexes keys and values separately
CREATE INDEX idx_metadata_ops ON events USING gin (metadata jsonb_ops);
-- Supports: @>, ?, ?&, ?|

-- Path-optimized: indexes key-value paths only
CREATE INDEX idx_metadata_path ON events USING gin (metadata jsonb_path_ops);
-- Supports: @> only

The jsonb_path_ops index is smaller because it doesn’t create separate entries for individual keys. A document {"a": 1, "b": 2, "c": 3} produces six entries under jsonb_ops (one for each key, one for each value) but only three entries under jsonb_path_ops (one for each key-value pair). This translates to a smaller index and fewer pages to scan during queries.

The performance difference is significant for containment queries:

-- Both indexes support this:
SELECT * FROM events WHERE metadata @> '{"tier": "pro"}';

-- Only jsonb_ops supports this:
SELECT * FROM events WHERE metadata ? 'settings';
SELECT * FROM events WHERE metadata ?| ARRAY['tier', 'region'];

-- jsonb_path_ops does NOT support this — falls back to sequential scan
SELECT * FROM events WHERE metadata ? 'tier';

The trade-off is clear: jsonb_path_ops is faster and smaller for containment queries, but if you need the existence operators (?, ?&, ?|), you must use jsonb_ops. You can create both indexes on the same column if you need both access patterns, but that doubles the write cost.

The containment operator: powerful but limited

The @> operator checks whether the left JSONB value contains the right JSONB value. For objects, this means all key-value pairs on the right must exist in the left:

SELECT '{"a": 1, "b": 2, "c": 3}'::jsonb @> '{"a": 1}'::jsonb;
-- t

SELECT '{"a": 1, "b": 2}'::jsonb @> '{"a": 1, "c": 3}'::jsonb;
-- f

-- Nested containment works
SELECT '{"settings": {"dark_mode": true, "notifications": false}}'::jsonb
       @> '{"settings": {"dark_mode": true}}'::jsonb;
-- t

-- But partial matching doesn't
SELECT '{"settings": {"dark_mode": true}}'::jsonb
       @> '{"settings": {"dark": true}}'::jsonb;
-- f  (key must match exactly)

Array containment is also supported:

SELECT '[1, 2, 3, 4]'::jsonb @> '[2, 4]'::jsonb;
-- t

SELECT '[1, 2, 3]'::jsonb @> '[1, 2, 3, 4]'::jsonb;
-- f  (left must be a superset of right)

Containment queries with @> use the GIN index efficiently. The planner can extract the key-value pairs from the right-hand side and look them up in the GIN posting lists. This is why WHERE metadata @> '{"tier": "pro"}' is fast. The GIN index finds all rows containing the key tier with value "pro" without scanning the actual JSONB data.

The path extraction trap: why -> and ->> bypass GIN

The most common JSONB indexing mistake is using path extraction operators with a GIN index and expecting it to help:

CREATE INDEX idx_metadata ON events USING gin (metadata);

-- This uses the GIN index:
SELECT * FROM events WHERE metadata @> '{"tier": "pro"}';

-- This does NOT use the GIN index:
SELECT * FROM events WHERE metadata -> 'tier' = '"pro"';

-- This also does NOT:
SELECT * FROM events WHERE metadata ->> 'tier' = 'pro';

-- Nor does this nested extraction:
SELECT * FROM events WHERE metadata -> 'settings' ->> 'dark_mode' = 'true';

The -> and ->> operators extract a value from a JSONB document. They return a JSONB value (->) or text (->>). The GIN index stores key paths, but the planner has no way to use a GIN index to satisfy an equality comparison on an extracted value. The index doesn’t store tier = "pro" as a searchable predicate in the way a B-tree does.

The result is a sequential scan with a filter on every row. For small tables this is fine. For 50 million rows, it’s a problem.

The fix is an expression index:

-- Index the extracted value as text
CREATE INDEX idx_metadata_tier
    ON events USING btree ((metadata ->> 'tier'));

-- Now this uses the B-tree index:
SELECT * FROM events WHERE metadata ->> 'tier' = 'pro';

Note the double parentheses: ((metadata ->> 'tier')). The outer parentheses are required because the index expression must be a single expression wrapped in parentheses.

For nested paths:

-- Index the nested setting
CREATE INDEX idx_metadata_dark_mode
    ON events USING btree ((metadata -> 'settings' ->> 'dark_mode'));

-- Now indexed:
SELECT * FROM events
WHERE metadata -> 'settings' ->> 'dark_mode' = 'true';

This is a B-tree index, not a GIN index. You’re extracting a scalar value and indexing it exactly like any other text column. The trade-off: each expression index only helps queries that use the exact same extraction path. If you need to query on five different nested paths, you need five expression indexes, each with its own write cost.

Containment as the alternative: when @> is enough

Before reaching for expression indexes, consider whether containment can express your query. The containment operator @> with a GIN index often handles cases that people instinctively write with path extraction:

-- Path extraction (no GIN usage, sequential scan):
SELECT * FROM events WHERE metadata ->> 'tier' = 'pro';

-- Containment (uses GIN index):
SELECT * FROM events WHERE metadata @> '{"tier": "pro"}';

-- Nested path extraction (no GIN usage):
SELECT * FROM events
WHERE metadata -> 'settings' ->> 'dark_mode' = 'true';

-- Nested containment (uses GIN index with jsonb_ops or jsonb_path_ops):
SELECT * FROM events
WHERE metadata @> '{"settings": {"dark_mode": true}}';

Containment is semantically slightly different from equality. The containment operator matches JSON types exactly: true in the containment query is a JSON boolean, not the string "true". If your stored value is the string "true" rather than the boolean true, containment won’t match:

SELECT '{"settings": {"dark_mode": "true"}}'::jsonb
       @> '{"settings": {"dark_mode": true}}'::jsonb;
-- f  (string "true" does not contain boolean true)

SELECT '{"settings": {"dark_mode": "true"}}'::jsonb
       @> '{"settings": {"dark_mode": "true"}}'::jsonb;
-- t  (string matches string)

If your JSONB data stores booleans as strings (which happens when ingesting from APIs or user input), you need to match the stored type. This is another reason json input can be surprising: PostgreSQL’s JSON parser preserves the original type, so "true" stays a string and true stays a boolean.

jsonb_path_query: structured extraction with path expressions

PostgreSQL 12 introduced SQL/JSON path expressions via jsonb_path_query and related functions. These provide XPath-like access to JSONB data:

-- Extract all values at a path
SELECT jsonb_path_query(metadata, '$.settings.dark_mode')
FROM events;

-- Filter with path expressions
SELECT jsonb_path_query(
    metadata,
    '$.track_segments[*] ? (@.heart_rate > 130)'
) FROM gps_data;

-- Check existence
SELECT *
FROM events
WHERE jsonb_path_exists(metadata, '$.settings.dark_mode == true');

jsonb_path_query and jsonb_path_exists can use a GIN index, but only when the path expression is simple enough for the planner to decompose into containment checks. Complex filter expressions with arithmetic or regex matching cannot use GIN and fall back to sequential scan with filtering.

For the common case of extracting a single value at a known path, the ->/->> operators are simpler and equally fast. Use jsonb_path_query when you need conditional extraction, wildcards, or filter predicates within the JSONB structure.

The write amplification of GIN jsonb_ops

Every GIN index has a write cost, and jsonb_ops on JSONB is one of the most expensive because it creates multiple index entries per document. Consider a document like:

{
  "user_id": 12345,
  "event": "page_view",
  "properties": {
    "url": "/dashboard",
    "referrer": "https://google.com",
    "session_duration": 342,
    "device": "mobile"
  },
  "tags": ["conversion", "signup_funnel"]
}

Under jsonb_ops, this single document produces roughly 20+ GIN index entries: one for each distinct key, one for each distinct value, one for each key-value path. That’s 20+ entries to insert, update, and eventually vacuum. Multiply this by your write throughput. If you’re inserting 10,000 events per second, that’s 200,000+ GIN entry writes per second.

The fastupdate mechanism (pending list) amortizes this cost by buffering writes in memory and merging them lazily. The default gin_pending_list_limit is 4MB. When the pending list fills up, it’s flushed into the main GIN tree. This means writes to a GIN index are initially cheap (append to in-memory list) but eventually require a merge (which is expensive).

If your workload is write-heavy with large JSONB documents, consider whether a GIN index is the right tool. A set of targeted expression B-tree indexes on the specific paths you query may be cheaper to maintain and equally effective:

-- Instead of one expensive GIN index on the whole document:
CREATE INDEX idx_events_all ON events USING gin (metadata);

-- Use targeted B-tree indexes on the paths you actually query:
CREATE INDEX idx_events_type ON events USING btree ((metadata ->> 'event'));
CREATE INDEX idx_events_user ON events USING btree ((metadata ->> 'user_id'));
CREATE INDEX idx_events_device ON events USING btree ((metadata ->> 'device'));

Three B-tree expression indexes versus one GIN index. The B-tree approach is cheaper per write (one entry per index per row, versus 20+ per row for GIN) and can be more selective for point lookups. The GIN approach wins for ad-hoc queries on unpredictable paths.

JSONB comparison and sort order

JSONB values have a defined ordering that can surprise you. PostgreSQL compares JSONB types first by type, then by value within the same type. The type ordering is:

  1. Object (lowest)
  2. Array
  3. String
  4. Number
  5. Boolean
  6. Null (highest)
SELECT '{"a": 1}'::jsonb > 'null'::jsonb;
-- f  (object sorts lower than null)

SELECT '"hello"'::jsonb > '42'::jsonb;
-- f  (string sorts lower than number)

SELECT 'true'::jsonb > 'false'::jsonb;
-- t  (boolean comparison)

SELECT 'false'::jsonb > 'null'::jsonb;
-- f  (false sorts lower than null)

Within the same type, comparison follows the obvious rules: strings use locale-independent byte comparison, numbers use numeric comparison, and booleans order false before true. Arrays compare element-by-element, and objects compare key-by-key in key storage order (which is alphabetical).

This ordering means you can use ORDER BY on a jsonb column and get deterministic results, but the ordering may not be what you expect. If you need human-intuitive ordering, extract to a specific type first:

-- Sort by extracted text value (B-tree expression index needed for performance)
SELECT * FROM events
ORDER BY metadata ->> 'event';

-- Sort by extracted numeric value
SELECT * FROM events
ORDER BY (metadata ->> 'session_duration')::numeric;

Practical SQL: JSONB diagnostics

Find all JSONB columns in your schema:

SELECT table_schema, table_name, column_name, udt_name
FROM information_schema.columns
WHERE udt_name IN ('json', 'jsonb')
  AND table_schema NOT IN ('pg_catalog', 'information_schema')
ORDER BY table_schema, table_name;

Check the average size of your JSONB columns:

SELECT attrelid::regclass AS table_name,
       attname AS column_name,
       pg_size_pretty(avg(pg_column_size(atttypid))) AS avg_size,
       pg_size_pretty(max(pg_column_size(atttypid))) AS max_size
FROM pg_attribute
WHERE atttypid = 'jsonb'::regtype
  AND attnum > 0
  AND NOT attisdropped
GROUP BY attrelid, attname
ORDER BY avg(pg_column_size(atttypid)) DESC;

Check which GIN indexes exist on JSONB columns:

SELECT indexrelid::regclass AS index_name,
       indrelid::regclass AS table_name,
       amname AS index_type,
       pg_get_indexdef(indexrelid) AS definition
FROM pg_index
JOIN pg_class ON pg_class.oid = indexrelid
JOIN pg_am ON pg_am.oid = pg_class.relam
WHERE pg_get_indexdef(indexrelid) LIKE '%jsonb%'
ORDER BY indrelid::regclass::text;

Find unused GIN indexes on JSONB columns (opportunity to remove write overhead):

SELECT schemaname,
       relname AS table_name,
       indexrelname AS index_name,
       idx_scan AS times_used,
       pg_size_pretty(pg_relation_size(indexrelid::regclass)) AS index_size
FROM pg_stat_user_indexes
WHERE indexrelname IN (
    SELECT indexrelid::regclass::text
    FROM pg_index
    WHERE pg_get_indexdef(indexrelid) LIKE '%jsonb%'
)
ORDER BY idx_scan ASC;

Key takeaways

  • jsonb stores parsed binary, json stores text. Always use jsonb for queryable data.
  • The jsonb_ops GIN operator class supports @>, ?, ?&, ?|. The jsonb_path_ops class supports only @> but is smaller and faster for containment queries.
  • Path extraction operators (->, ->>) do NOT use GIN indexes. Use containment @> or create B-tree expression indexes on the extracted value.
  • @> with GIN can replace many common path extraction patterns: WHERE metadata @> '{"key": "value"}' instead of WHERE metadata ->> 'key' = 'value'.
  • Containment matches JSON types exactly: the boolean true does not match the string "true".
  • jsonb_ops GIN indexes create many entries per document, causing significant write amplification. For write-heavy workloads, targeted B-tree expression indexes on specific paths may be cheaper.
  • JSONB sort order is by type category, not by intuitive value. Object < array < string < number < boolean < null.

What’s next

JSONB is one piece of the “store anything” puzzle. The other is arrays. PostgreSQL’s native array types support GIN indexing with the same operator classes but with different internal mechanics and different gotchas around overlapping, containment, and unique constraints. Next time: array types, GIN indexing with array_ops, and the patterns where arrays beat both JSONB and normalized tables.


Previously: timestamptz and Why Almost Every Timestamp Column Should Have a Time Zone