AI Agent Memory

Reranking: The refinement layer that makes agent memory actually work

Yesterday I wrote about hybrid search with Reciprocal Rank Fusion. The pipeline diagram at the end had an optional step labeled “Cross-Encoder Rerank” that I described as pushing recall from 90% to 92%+ at significant latency cost. Today I want to walk through that step completely, because for production agent memory systems, reranking is not optional. It is the difference between “pretty good” and “actually useful.”

The core idea is simple. Your first-stage retriever, whether that is BM25, vector search, or a hybrid of both, casts a wide net. It returns 20-50 candidate documents that might be relevant. A reranker then scores each candidate against your original query using a more expensive but more accurate model, producing a refined final ranking. The retriever is fast but approximate. The reranker is slow but precise. Together they form a two-stage retrieval pipeline that outperforms either alone.

The Bi-Encoder Problem: Why First-Stage Retrieval Needs Help

To understand why reranking works, you need to understand the limitation it is designed to fix.

Embedding models, which power vector search, are bi-encoders. They encode the query and the document independently, producing two separate vectors, then compare them with cosine similarity. This architecture has a fundamental weakness: the model never sees the query and the document at the same time. It cannot perform cross-attention between them.

Consider a memory that says “Deployed the auth service to us-east-1 using the blue-green strategy with zero downtime.” Now suppose someone asks “Which deployment strategy did we use for authentication?” The bi-encoder must independently embed “Which deployment strategy did we use for authentication?” and “Deployed the auth service to us-east-1 using the blue-green strategy with zero downtime.” It compares the resulting vectors. The word “auth” and “authentication” are close in embedding space, and “deployment strategy” and “blue-green strategy” are related, so the cosine similarity might be reasonable. But the model has no mechanism to verify that the deployment strategy described in the document is specifically the one used for authentication. It is making an inference based on independent representations, never seeing the two texts together.

A cross-encoder fixes this. Instead of encoding the query and document separately, it concatenates them into a single input sequence and processes them jointly:

Input:  [CLS] Which deployment strategy did we use for authentication? [SEP]
        Deployed the auth service to us-east-1 using the blue-green
        strategy with zero downtime. [SEP]

The transformer’s self-attention layers allow every token in the query to attend to every token in the document and vice versa. The model can see that “authentication” in the query refers to “auth service” in the document, and that “deployment strategy” corresponds to “blue-green strategy.” It outputs a single relevance score for the pair. This joint processing is far more accurate for relevance scoring than comparing independent embeddings.

The tradeoff is speed. A bi-encoder precomputes document embeddings once and stores them in an index. At query time, it only needs to encode the query and compare it against precomputed vectors. This is O(1) per document at query time (after the initial O(n) embedding cost). A cross-encoder must process every query-document pair through the full transformer at query time. There is no precomputation shortcut. For 20 candidate documents, you run the cross-encoder 20 times. For 100 candidates, you run it 100 times.

This is why the two-stage architecture exists. You use the fast bi-encoder to narrow a large corpus down to a manageable candidate set, then use the accurate cross-encoder to refine the ranking of that smaller set.

Cross-Encoder Rerankers: The Pairwise Approach

Cross-encoder rerankers score one query-document pair at a time, producing a relevance score. You run them on each candidate from your first-stage retriever and sort by the resulting scores. The most widely used open-source cross-encoders come from the BAAI BGE family.

BGE Rerankers

The FlagEmbedding library provides several reranker models with different size and capability tradeoffs:

ModelParametersLanguagesNotes
BAAI/bge-reranker-v2-m3568MMultilingualLightweight, fast inference
BAAI/bge-reranker-v2-gemma2.51BMultilingualStrong multilingual performance
BAAI/bge-reranker-v2.5-gemma2-lightweight9.24BMultilingualSelectable layers, compressed output for faster inference
BAAI/bge-reranker-large560MEN/ZHMore accurate, less efficient
BAAI/bge-reranker-base278MEN/ZHSmallest BGE cross-encoder

Using a BGE reranker is straightforward:

from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True)

pairs = [
    ["Which deployment strategy did we use for authentication?",
     "Deployed the auth service to us-east-1 using the blue-green strategy."],
    ["Which deployment strategy did we use for authentication?",
     "The authentication service uses JWT tokens with RS256 signing."],
    ["Which deployment strategy did we use for authentication?",
     "Updated CI/CD pipeline to use canary deployments for all services."],
]

scores = reranker.compute_score(pairs)
# scores: [0.98, 0.31, 0.72]
# The first result wins: it directly answers the question

The use_fp16=True flag enables half-precision inference, which roughly doubles throughput with negligible accuracy loss. For agent memory systems running on CPU, this is a practical necessity.

FlashRank: Ultra-Lightweight Reranking

FlashRank takes a different approach to the speed problem. Instead of loading a full PyTorch model, it uses ONNX Runtime with models optimized for minimal footprint. The default model, ms-marco-TinyBERT-L-2-v2, is only 4MB. It runs on CPU with no GPU or heavy dependencies required.

from flashrank import Ranker, RerankRequest

ranker = Ranker(max_length=128)  # Default: 4MB model, blazing fast

query = "How do I configure the database connection pool?"
passages = [
    {"id": 1, "text": "Use PgBouncer with max_client_conn=100 and pool_mode=transaction"},
    {"id": 2, "text": "The database schema was updated to support multi-tenancy"},
    {"id": 3, "text": "Connection pooling is configured via environment variables in docker-compose"},
]

request = RerankRequest(query=query, passages=passages)
results = ranker.rerank(request)

FlashRank offers a spectrum of models:

ModelSizeBest For
ms-marco-TinyBERT-L-2-v2~4MBSpeed-critical, CPU-only deployments
ms-marco-MiniLM-L-12-v2~34MBBest accuracy in the tiny category
rank-T5-flan~110MBBest zero-shot performance on out-of-domain data
ms-marco-MultiBERT-L-12~150MBMultilingual (100+ languages)

For agent memory, the 34MB MiniLM model hits the sweet spot: it is small enough to run locally on CPU with sub-100ms latency for 20 candidates, and accurate enough to meaningfully improve over the first-stage retriever.

Jina Reranker v2

Jina Reranker v2 is a cross-encoder designed for multilingual retrieval, with over 10 million downloads on HuggingFace. It supports 100+ languages and integrates with both the Transformers and SentenceTransformers libraries. Jina positions this model as particularly well-suited for agentic RAG applications, where a retrieval component feeds context into an LLM that takes actions.

Cohere Rerank: The Commercial Option

If you are willing to send your queries to an external API, Cohere’s Rerank v4 is the most polished commercial offering. The API takes a query and a list of documents, returns ranked results with relevance scores, and handles tokenization and truncation automatically:

import cohere

co = cohere.ClientV2()

response = co.rerank(
    model="rerank-v4.0-pro",
    query="Which deployment strategy did we use for authentication?",
    documents=[
        "Deployed the auth service using blue-green strategy with zero downtime.",
        "Auth service uses JWT tokens with RS256 signing.",
        "CI/CD pipeline updated to canary deployments.",
    ],
    top_n=3,
)

for result in response.results:
    print(f"Index: {result.index}, Score: {result.relevance_score}")

Cohere supports up to 1,000 documents per request with automatic truncation at 4,096 tokens per document. Their documentation also recommends formatting structured data as YAML for best performance, which is relevant if your agent memory includes structured records.

LLM-Based Reranking: The Listwise Approach

Cross-encoders score one query-document pair at a time. LLM-based rerankers take a different approach: they look at the query and all candidate documents simultaneously, then output a complete ranking. This is called listwise ranking, and it can capture relationships between candidates that pairwise scoring misses.

The most prominent implementation is RankGPT, from the paper “Is ChatGPT Good at Search?” by Qin et al. (2023). The idea is to give an LLM the query and a list of passages, then ask it to reorder them by relevance. The prompt looks something like this:

I will provide you with {n} passages, each indicated by a numerical identifier [].
Rank the passages based on their relevance to the search query: {query}.

[1] {passage_1}
[2] {passage_2}
[3] {passage_3}
...

Rank the {n} passages above based on their relevance to the search query.
Output the identifiers of the most relevant passages in descending order of relevance.

The LLM generates a permutation, something like [1, 3, 2], indicating that passage 1 is most relevant, followed by passage 3, then passage 2. This is fundamentally different from pairwise scoring because the model can reason about relative relevance. It can say “passage 1 directly answers the question, passage 3 provides useful context but not the answer, and passage 2 is tangentially related.”

RankLLM: Open-Source Listwise Ranking

The RankLLM project from the University of Waterloo provides a full framework for LLM-based reranking. It supports several model families:

Pointwise models (MonoT5, MonoELECTRA): Score each document independently. Similar to cross-encoders but based on T5 architecture.

Pairwise models (DuoT5): Compare two documents at a time. More nuanced than pointwise but slower.

Listwise models (RankZephyr, RankGPT, RankGemini): Look at all candidates simultaneously. Most accurate but most expensive.

The standout open-source option is castorini/rank_zephyr_7b_v1_full, a 7B parameter Zephyr model fine-tuned specifically for listwise ranking. FlashRank includes this as a 4-bit quantized GGUF model (~4GB) for local inference:

# Using RankLLM framework
from rank_llm.rerank.listwise import ZephyrReranker

reranker = ZephyrReranker()
rerank_results = reranker.rerank_batch(requests=retrieved_results)

The Sliding Window Problem

Listwise reranking has a practical limitation: context window size. A 7B model with an 8K context window can handle roughly 20-30 passages of typical length at once. If your first-stage retriever returns 50 candidates, you need a sliding window approach: rerank windows of 20, then merge the results.

The RankLLM framework handles this internally, but it introduces a new failure mode. A document that is the most relevant overall might end up in a window where it ranks second behind another strong candidate, while a mediocre document in a weaker window ranks first. The merge step can amplify this problem. This is why listwise reranking is most effective when the first-stage retriever is already quite good: the top candidates are clustered together, and the sliding window is less likely to split them.

When to Use Listwise vs. Pairwise

For most agent memory systems, I recommend starting with a pairwise cross-encoder and only moving to listwise if you have a specific need for it. Here is the decision framework:

  • Use pairwise cross-encoders (BGE, FlashRank) when you have 10-50 candidates and need sub-200ms latency. This covers the vast majority of agent memory use cases.
  • Use listwise LLM rerankers (RankZephyr, RankGPT) when you have 10-20 high-quality candidates and can afford 1-5 seconds of latency. The added context from seeing all candidates together can improve accuracy on queries that require reasoning about relative relevance.
  • Use Cohere Rerank API when you do not want to run models locally, have a reasonable query volume, and are comfortable with API latency and costs.

ColBERT: Late Interaction as a Middle Ground

There is a third architecture worth knowing about. ColBERT (Contextualized Late Interaction over BERT) sits between bi-encoders and cross-encoders in terms of accuracy and speed. Instead of producing a single vector per document, ColBERT produces a vector for every token in the document. At query time, it computes a MaxSim score: for each query token, it finds the maximum cosine similarity with any document token, then sums these maxima.

The intuition is that ColBERT can match individual terms from the query to their most relevant counterparts in the document. If the query mentions “PgBouncer,” ColBERT finds the token in the document that best matches “PgBouncer” and uses that similarity. If the query also mentions “connection pool,” it finds the best-matching document token for that term. The final score aggregates these fine-grained matches.

ColBERT is more accurate than bi-encoder search because it captures token-level interactions. It is faster than cross-encoder reranking because document token embeddings can be precomputed and stored. The tradeoff is storage: a 768-dimensional bi-encoder produces one vector per chunk, while a ColBERT model produces one vector per token. A 256-token chunk becomes 256 vectors, roughly 256x the storage cost.

For agent memory with small to medium corpora (under 100,000 documents), this storage overhead is usually acceptable. ColBERT is available through the colbert-ir repository and integrates with several vector databases. The FlashRank library also references ColBERT as a reranking option, though its primary models are cross-encoders.

Putting Reranking Into the Agent Memory Pipeline

Here is what the full pipeline looks like with reranking added:

Query: "How do I configure the database connection pool?"

         ├──────────────────────┐
         ▼                      ▼
    BM25 Search            Vector Search
    (SQLite FTS5           (ChromaDB/pgvector
     or OpenSearch)         cosine similarity)
         │                      │
         │  top_k=20           │  top_k=20
         │                      │
         └──────────┬───────────┘

         Reciprocal Rank Fusion
         (k=60, optional weights)


         Content Deduplication
         (collapse by source_id)


         Temporal Boosting
         (exponential decay)


         Cross-Encoder Reranking
         (BGE-reranker-v2-m3 or FlashRank)
         top_k=15 candidates → top 5


         Final: Top 5 Memories
         (injected into context)

The implementation is a natural extension of yesterday’s hybrid search code:

from flashrank import Ranker, RerankRequest

ranker = Ranker(model_name="ms-marco-MiniLM-L-12-v2", max_length=256)

def rerank_results(query, fused_results, metadata_store, top_k=5):
    """Apply cross-encoder reranking to hybrid search results.

    Args:
        query: The user's query string.
        fused_results: RRF-fused results [(doc_id, score), ...].
        metadata_store: Dict mapping doc_id to document text and metadata.
        top_k: Number of results to return after reranking.
    """
    # Prepare passages for the reranker
    passages = [
        {"id": doc_id, "text": metadata_store[doc_id]["text"]}
        for doc_id, _ in fused_results[:15]  # Rerank top 15 candidates
    ]

    request = RerankRequest(query=query, passages=passages)
    reranked = ranker.rerank(request)

    return [(r["id"], r["score"]) for r in reranked[:top_k]]


def full_pipeline(query):
    # Stage 1: Parallel retrieval
    bm25_results = bm25_search(query, top_k=20)
    vector_results = vector_search(query, top_k=20)

    # Stage 2: Fusion
    fused = weighted_rrf([bm25_results, vector_results], weights=[0.3, 0.7])

    # Stage 3: Deduplication
    deduped = deduplicate_chunks(fused, metadata_store)

    # Stage 4: Temporal boost
    boosted = temporal_boost(deduped, metadata_store, half_life_days=30)

    # Stage 5: Reranking
    final = rerank_results(query, boosted, metadata_store, top_k=5)

    return final

The reranking step adds roughly 50-200ms with FlashRank’s MiniLM model on CPU for 15 candidates, depending on passage length. With a GPU-backed BGE-reranker-v2-m3, it drops to 10-30ms. This is a small price to pay for the improvement in ranking quality.

The Gotcha: Reranking Garbage Produces Better-Ordered Garbage

The biggest mistake people make with reranking is assuming it can fix a broken retrieval system. It cannot.

Reranking is a refinement step. It takes a candidate set and produces a better ordering of that same set. If the relevant document is not in the candidate set, no amount of reranking will find it. This is why the first-stage retriever matters so much. A bi-encoder with good coverage (high recall) is more valuable than a bi-encoder with high precision, because recall determines whether the right document is in the candidate set at all.

Consider a concrete failure case. Suppose someone stores a memory about their Kubernetes cluster configuration: “The staging cluster runs on EKS with 3 m5.large nodes, auto-scaling up to 10 nodes, using the Spot instance strategy for non-critical workloads.” Later they ask “How many nodes does my staging environment have?” If the first-stage retriever returns 15 documents about production environments, CI/CD pipelines, and cost optimization, but misses the staging configuration document entirely, the reranker will confidently order those 15 irrelevant documents. The top result will be the most confidently irrelevant document.

The mitigation is to ensure your first-stage retrieval has high recall before adding reranking. In practice, this means:

  1. Use a large top_k in the first stage. Fetch 20-50 candidates from BM25 and vector search before fusion. The cost of fetching more candidates is low compared to the cost of missing the right one.
  2. Validate retrieval quality separately from reranking quality. Measure recall@20 for your first-stage retriever before measuring nDCG@5 after reranking. If recall@20 is below 80%, fix your retriever before adding a reranker.
  3. Monitor the reranker’s score distribution. If all candidates score similarly (low variance in relevance scores), it usually means none of them are actually relevant. The reranker is not failing; the retriever failed to find good candidates.

A second common mistake is choosing a reranker that is too expensive for the use case. A 7B listwise reranker that takes 3 seconds to process 20 candidates is fine for a research tool, but unacceptable for a chat bot that needs to respond within 2 seconds total. Remember that the reranker’s latency adds to the retriever’s latency, the LLM’s inference time, and network overhead. Budget accordingly.

Practical Takeaways

  • Reranking is a refinement layer, not a retrieval layer. It reorders candidates from your first-stage retriever. If the right document is not in the candidate set, reranking cannot help.
  • Start with FlashRank’s MiniLM model. At 34MB, it runs on CPU, integrates in three lines of code, and provides meaningful ranking improvements over raw RRF fusion.
  • Rerank 15-20 candidates, not more. The marginal accuracy gain from reranking 50 candidates versus 20 is small, but the latency cost scales linearly. Set your first-stage top_k high for recall, then rerank a smaller subset.
  • Use cross-encoders for latency-sensitive deployments. Pairwise scoring is predictable and fast. Use listwise LLM reranking only when you have spare latency budget and small candidate sets.
  • Validate first-stage recall before adding reranking. If your retriever misses the right document 40% of the time, a reranker that perfectly orders the wrong 60% is not helping.
  • The full pipeline matters more than any single component. RRF with a decent cross-encoder reranker outperforms perfect RRF without reranking, which outperforms perfect vector search without hybrid fusion. Build the full stack.

What’s Next

Over the past week we have covered the full retrieval stack: BM25 for keyword matching, vector embeddings for semantic search, hybrid search with RRF for combining them, and now reranking for refinement. But retrieval is only half the problem. The other half is what happens before you ever search: how do you decide what to store, how do you organize it, and how do you keep it from growing stale? Next week I will shift from retrieval to architecture, starting with the markdown-first memory model that powers systems like OpenClaw and my own wiki. Plain text files on disk might seem primitive, but that simplicity is what makes them resilient, debuggable, and surprisingly powerful.


Previously in this series: Hybrid Search: Why the Best Agent Memory Systems Run Two Searches at Once