PostgreSQL Deep Dive

PostgreSQL Deep Dive: tsvector, tsquery, and why your full-text search results are ranked wrong

You add a full-text search to your application. You create a tsvector column, add a GIN index, and write WHERE tsv @@ to_tsquery('postgres'). The query returns results, but the ranking is wrong. A two-sentence FAQ entry ranks higher than a 3,000-word deep dive on the same topic. A document containing “postgresql” five times ranks the same as one containing it once. And searching for “postgres” doesn’t find documents that only contain “postgresql” — even though the documentation said stemming handles this.

The problem isn’t PostgreSQL’s text search. It’s understanding what actually happens between your raw text and the @@ operator.

How tsvector stores data

A tsvector value is a sorted list of distinct lexemes, each optionally followed by position information. A lexeme is the normalized (stemmed) form of a word. Positions are integer offsets indicating where the lexeme appeared in the original document.

SELECT to_tsvector('english', 'The PostgreSQL database is a powerful database system');
-- 'database':3,6 'power':5 'postgresql':2 'system':7

Three things happened here:

  1. Stop words removed: “The”, “is”, “a” were discarded. English stop words include articles, prepositions, conjunctions, and common function words.
  2. Stemming applied: “powerful” became “power”, “databases” (if it appeared) would become “database”. Stemming reduces words to their root form using a language-specific dictionary.
  3. Positions preserved: Each lexeme tracks its position(s) in the original text. “database” appears at positions 3 and 6 because it occurs twice.

The tsvector stores lexemes in sorted order, not in document order. This enables binary search for lexeme lookup. Multiple positions for the same lexeme are stored as a sorted array:

SELECT to_tsvector('english', 'running runs runner ran run');
-- 'ran':4 'run':1,2,3,5 'runner':3

-- Wait — 'runner' became 'runner', not 'run'?
-- Yes. The Snowball stemmer for English maps 'running' → 'run', 'runs' → 'run',
-- 'ran' → 'ran', but 'runner' → 'runner'. Stemming is not the same as
-- lemmatization — it uses suffix-stripping rules, not a dictionary.

This is the first surprise: stemming is rule-based, not dictionary-based. The Snowball stemmer applies suffix-stripping patterns. “Running” loses “-ning” and becomes “run”. “Runner” loses one “-er” but keeps the other, becoming “runner”. The stemmer doesn’t know that “runner” and “running” share the same semantic root — it only applies mechanical rules.

Weights: A, B, C, D

Each position in a tsvector can carry a weight label: A, B, C, or D. Weights are single-character markers that you assign based on document structure — typically A for title, B for abstract, C for body, D for metadata.

SELECT setweight(to_tsvector('english', 'PostgreSQL Performance Guide'), 'A') ||
       setweight(to_tsvector('english', 'A deep dive into query optimization and indexing'), 'B') ||
       setweight(to_tsvector('english', 'This guide covers B-tree internals, bitmap scans, and parallel query execution'), 'C')
       AS weighted_tsv;
-- 'b-tree':16 'bitmap':18 'cover':12 'deep':6 'div':7 'execut':22
--  'guid':1,11 'index':17 'intern':16 'optim':9 'parallel':20
--  'perform':2 'postgr':1 'queri':10,21 'scan':19 'guid':1,11
-- Weights: A on first, B on second, C on third

The concatenation (||) of tsvector values merges the lexeme lists, preserving all positions and weights. A lexeme that appears in both the title (weight A) and body (weight C) has both positions stored with their respective weights.

The ranking functions ts_rank and ts_rank_cd use these weights to compute a score. The default weights are:

-- Default weight values
-- A = 0.1, B = 0.2, C = 0.4, D = 1.0
-- Higher weight = more important? No — higher default weight value means LOWER priority
-- This is counterintuitive and trips people up constantly

The default weight array is {0.1, 0.2, 0.4, 1.0}, where the values map to D, C, B, A (reverse alphabetical order). This means weight A (title) contributes 1.0 to the rank, and weight D contributes 0.1. If you want to customize:

-- Custom weights: title (A) gets 10x the weight of body (D)
SELECT ts_rank(tsv, query, '{1.0, 0.5, 0.3, 0.1}'::float4[])
FROM documents, to_tsquery('english', 'postgres & performance') query
WHERE tsv @@ query;

tsquery: the query language

tsquery is a Boolean combination of lexemes using & (AND), | (OR), and ! (NOT):

-- Simple term
SELECT to_tsquery('english', 'postgres');
-- 'postgr'

-- AND
SELECT to_tsquery('english', 'postgres & performance');
-- 'postgr' & 'perform'

-- OR
SELECT to_tsquery('english', 'postgres | mysql');
-- 'postgr' | 'mysql'

-- NOT
SELECT to_tsquery('english', 'postgres & !mysql');
-- 'postgr' & !'mysql'

-- Phrase search (PG 9.6+): lexemes must appear within N positions of each other
SELECT phraseto_tsquery('english', 'postgresql database');
-- 'postgr' <-> 'databas'
-- <-> means "followed by" (adjacent positions)

-- With distance:
SELECT to_tsquery('english', 'postgr <2> databas');
-- 'postgr' <2> 'databas'  (within 2 positions)

The <-> operator is the phrase search operator. It checks that the two lexemes appear in adjacent positions. <N> checks within N positions. This is where positional information in tsvector matters — without positions, phrase search degrades to simple AND.

-- This document contains both 'postgresql' and 'database' but not adjacent:
SELECT to_tsvector('english', 'The database powered by PostgreSQL') @@
       phraseto_tsquery('english', 'postgresql database');
-- f  (positions are: database=2, postgr=5 — not adjacent)

-- This one has them adjacent:
SELECT to_tsvector('english', 'The PostgreSQL database system') @@
       phraseto_tsquery('english', 'postgresql database');
-- t  (positions: postgr=2, databas=3 — adjacent)

plainto_tsquery converts plain text to a tsquery, treating words as AND:

SELECT plainto_tsquery('english', 'postgresql performance tuning');
-- 'postgr' & 'perform' & 'tun'

-- websearch_to_tsquery (PG 11+) supports quotes for phrases:
SELECT websearch_to_tsquery('english', '"postgresql database" performance');
-- 'postgr' <-> 'databas' & 'perform'

websearch_to_tsquery is the best choice for user-facing search because it handles quoted phrases, - for negation, and OR naturally — closest to what users expect from a search bar.

GIN vs GiST for text search indexing

Both GIN and GiST support the @@ operator on tsvector:

GIN (gin_trgm_ops / default tsvector_ops):

  • Stores every lexeme in the document with its posting list
  • Exact match: if the lexeme exists in the index, it’s in the posting list
  • Larger index, slower build, faster query
  • Supports @@ only (no ranking during index scan)

GiST (tsvector_ops):

  • Stores a lossy signature (bloom filter-like) of the lexemes in each page
  • May return false positives that need rechecking against the actual tsvector
  • Smaller index, faster build, slightly slower query (due to recheck)
  • Supports @@ with ranking
-- GIN: exact, larger, faster for point queries
CREATE INDEX idx_docs_gin ON documents USING gin (tsv);

-- GiST: lossy, smaller, good for ranking workloads
CREATE INDEX idx_docs_gist ON documents USING gist (tsv);

For most workloads, GIN is the right choice. The recheck overhead of GiST means that for large result sets, GiST scans more heap pages than necessary. GIN’s posting lists are exact — if the index says a row matches, it matches, no recheck needed.

The exception is when you need ranking during the index scan. GiST can compute ts_rank during the scan because it has access to the lexeme data in the index page. GIN cannot — it only knows whether a lexeme exists, not its frequency or positions. If you need ranked results with GIN, you must compute ts_rank on the matching rows after the index scan:

-- GIN approach: index scan → compute rank → sort
SELECT id, title, ts_rank(tsv, query) AS rank
FROM documents, to_tsquery('english', 'postgres & performance') query
WHERE tsv @@ query
ORDER BY rank DESC
LIMIT 10;

This is fine for most applications. The index scan filters to matching rows, and ts_rank is computed on a small result set. The GiST advantage only matters when the unranked result set is enormous (millions of matches) and you need to avoid scanning them all.

The ranking trap: ts_rank vs ts_rank_cd

PostgreSQL provides two ranking functions:

ts_rank(vector, query) — ranks based on term frequency. A document containing “postgresql” five times ranks higher than one containing it once, regardless of how close the occurrences are.

ts_rank_cd(vector, query) — “cover density” ranking. Considers how close matching lexemes are to each other. A document where “postgresql” and “performance” appear in the same sentence ranks higher than one where they’re paragraphs apart.

The critical gotcha: both functions require positional information in the tsvector. If the tsvector was built without positions, ranking degrades to a simple presence check:

-- With positions (default):
SELECT ts_rank(to_tsvector('english', 'postgres postgres postgres'), to_tsquery('postgres'));
-- 0.06079271

-- Without positions (stripped):
SELECT ts_rank(strip(to_tsvector('english', 'postgres postgres postgres')), to_tsquery('postgres'));
-- 0.06079271  (wait, same?)

-- Actually, ts_rank uses frequency (how many times lexeme appears in positions),
-- not position values directly. strip() removes positions but keeps the count.
-- ts_rank_cd is the one that truly needs positions for cover density.

ts_rank_cd specifically requires positional data for its proximity calculation. If positions are missing, ts_rank_cd falls back to treating the document as if all matching lexemes are at position 0, which eliminates the proximity advantage entirely.

The normalization parameter controls how document length affects ranking:

-- Normalize by document length (option 1 = divide by log of length)
SELECT ts_rank(tsv, query, 1) FROM documents, plainto_tsquery('english', 'postgres') query
WHERE tsv @@ query;

Without normalization, long documents accumulate higher raw scores because they simply contain more words and therefore more matches. Option 1 (divide by 1 + log(document_length)) is the most commonly used normalization because it penalizes excessive length without over-penalizing moderately long documents.

The stemming inconsistency: why “postgres” finds “postgresql” but not always

The Snowball English stemmer applies rules like:

  • Remove “-ing” suffix: “running” → “run”
  • Remove “-tion” suffix: “optimization” → “optim”
  • Remove “-ly” suffix: “quickly” → “quick”

But these are mechanical rules, not semantic understanding:

SELECT to_tsquery('english', 'database');
-- 'databas'

SELECT to_tsquery('english', 'databases');
-- 'databas'  (same — good)

SELECT to_tsquery('english', 'postgres');
-- 'postgr'

SELECT to_tsquery('english', 'postgresql');
-- 'postgr'  (same — good)

SELECT to_tsquery('english', 'postgraduate');
-- 'postgradu'  (different — correct)

-- But:
SELECT to_tsquery('english', 'organization');
-- 'organ'

SELECT to_tsquery('english', 'organ');
-- 'organ'  (same stem — but semantically different words!)

SELECT to_tsquery('english', 'universe');
-- 'univers'

SELECT to_tsquery('english', 'university');
-- 'univers'  (same stem — but semantically different!)

Stemming conflation is a known limitation. “Organization” and “organ” share the stem “organ” despite being unrelated. “University” and “universe” share “univers”. This produces false positives in search results — documents about organ transplants appear in searches for “organization”, and documents about the universe appear in searches for “university”.

The fix for critical cases is to use a custom dictionary or the simple text search configuration (no stemming, no stop words):

-- No stemming, exact match only
SELECT to_tsvector('simple', 'university universe');
-- 'universe':2 'university':1

SELECT to_tsquery('simple', 'university');
-- 'university'

SELECT to_tsvector('simple', 'university universe') @@ to_tsquery('simple', 'university');
-- t

SELECT to_tsvector('simple', 'university universe') @@ to_tsquery('simple', 'universe');
-- t  (but these are separate lexemes now)

Practical SQL: full-text search diagnostics

Check which text search configurations are available:

SELECT cfgname, cfgowner::regrole
FROM pg_ts_config
WHERE cfgname NOT IN ('simple')
ORDER BY cfgname;

Inspect the parsed and stemmed form of a document:

SELECT alias, token, dictionaries, dictionary, lexemes
FROM ts_debug('english', 'The PostgreSQL database is running fast');
-- Shows the full pipeline: token type, raw token, dictionaries tried,
-- winning dictionary, and resulting lexeme

Check the size of your tsvector columns and GIN indexes:

SELECT attrelid::regclass AS table_name,
       attname AS column_name,
       pg_size_pretty(avg(pg_column_size(attname))) AS avg_tsv_size
FROM pg_attribute
WHERE atttypid = 'tsvector'::regtype
  AND attnum > 0
  AND NOT attisdropped;

Find text search indexes and their type (GIN vs GiST):

SELECT indexrelid::regclass AS index_name,
       indrelid::regclass AS table_name,
       amname AS index_type,
       pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_index
JOIN pg_class ON pg_class.oid = indexrelid
JOIN pg_am ON pg_am.oid = pg_class.relam
WHERE pg_get_indexdef(indexrelid) LIKE '%tsvector%'
   OR amname IN ('gin', 'gist')
   AND pg_get_indexdef(indexrelid) LIKE '%tsv%'
ORDER BY indrelid::regclass::text;

Key takeaways

  • tsvector stores sorted, distinct lexemes with optional positions and weights. Stop words are removed and words are stemmed during vectorization.
  • Stemming is rule-based (Snowball), not dictionary-based. “Running” → “run” but “runner” → “runner”. Unrelated words like “organization” and “organ” can share stems.
  • Weights A/B/C/D map to document structure. The default weight array {0.1, 0.2, 0.4, 1.0} maps to D/C/B/A — higher weight value means lower priority, which is counterintuitive.
  • websearch_to_tsquery is the best choice for user-facing search because it handles quoted phrases, negation, and OR naturally.
  • Phrase search (<->) requires positional information in the tsvector. Without positions, it degrades to simple AND.
  • GIN indexes are exact (no recheck) and faster for point queries. GiST indexes are lossy (may recheck) but can compute rank during the scan. GIN is the right default.
  • ts_rank measures term frequency. ts_rank_cd measures cover density (proximity). Both degrade without positional data. Use normalization option 1 to penalize long documents.
  • Use the simple configuration when you need exact matching without stemming or stop word removal.

What’s Next

We’ve covered the entire Data Types & Edge Cases category — numeric, float8, timestamptz, JSONB, arrays, UUIDs, sequences, and text search. The next category to dig into is Operational Gotchas — the production problems that don’t show up in development. We’re starting with one of the most common killers: idle in transaction connections, the connection pool silent assassin that locks up your database without a single slow query.


Previously: Identity Columns vs Serial, and the Sequence Cache Gap Problem