Milvus: The Purpose-Built Vector Database That Scales Agent Memory to Billions

You are building an agent that remembers things. You start small: a few thousand memories, a ChromaDB instance embedded in your process, everything humming along nicely. Then your agent goes into production. You are ingesting 10,000 memories a day. You hit a million. Ten million. Suddenly your embedded database is falling behind, your queries are timing out, and you realize you have outgrown the training wheels.

This is the story of every agent memory system that succeeds. The small-scale tools work beautifully until they don’t. And the database that catches you when you fall is Milvus.

In our series so far, we have covered pgvector for teams that want to keep everything in Postgres, and ChromaDB for the zero-infrastructure embedded case. Now we are going to look at the third option: Milvus, a purpose-built vector database designed for the moment when your agent memory hits billions of vectors and you need real-time ingestion, hybrid search, and cloud-native horizontal scaling, all at the same time.

How Milvus Works

Milvus is not a database with vector support bolted on. It is a vector database built from the ground up for one thing: fast similarity search over massive datasets. Originally created by Zilliz and now a graduated LF AI & Data Foundation project, it powers production workloads at Samsung, NVIDIA, eBay, and thousands of other companies.

The Four-Layer Architecture

Milvus separates concerns into four independent layers, each with its own scaling and failure domain:

┌─────────────────────────────────────────────┐
│           Access Layer (Stateless)           │
│   Proxies with load balancing, MPP queries   │
├─────────────────────────────────────────────┤
│           Coordinator (Brain)                │
│   DDL, topology, scheduling, TSO            │
├─────────────────────────────────────────────┤
│           Worker Nodes                       │
│   Streaming · Query · Data                   │
├─────────────────────────────────────────────┤
│           Storage Layer                      │
│   Meta (etcd) · WAL · Object Store (S3)     │
└─────────────────────────────────────────────┘

Access Layer: A set of stateless proxies behind a load balancer. They validate requests, route queries, and aggregate results using massively parallel processing. Because they are stateless, you can scale them horizontally without touching data.

Coordinator: The single brain of the cluster. It manages DDL operations, assigns segments to workers, handles timestamp orchestration, and maintains cluster topology. Only one coordinator is active at a time.

Worker Nodes: The muscle. Milvus 2.6 splits workers into three types. Streaming Nodes handle real-time ingestion and growing data queries. Query Nodes load historical data from object storage and serve batch queries. Data Nodes handle offline compaction and index building. This separation of streaming from batch is key to Milvus achieving both low-latency ingestion and high-throughput queries simultaneously.

Storage Layer: The backbone. etcd holds metadata. Object storage (S3, GCS, MinIO) holds segment files, indexes, and WAL snapshots. The Write-Ahead Log, now powered by Woodpecker in Milvus 2.6, writes directly to object storage with no local disk dependency, achieving 750 MB/s throughput on S3.

The critical design decision: storage and compute are fully disaggregated. Worker nodes are stateless. They can be restarted, replaced, or scaled without data loss. This is what makes Milvus cloud-native in the truest sense.

Index Types for Agent Memory

Milvus supports every major vector index, letting you choose the right tradeoff for your workload:

Index	Best For	Memory Usage	Latency
HNSW	Low-latency recall (default)	1.5-2x raw data	<10ms
IVF_FLAT	Balanced workloads	1x raw data	10-50ms
IVF_SQ8	Memory-constrained	0.25x raw data	10-50ms
IVF_RABITQ	Extreme compression (Milvus 2.6)	0.03-0.28x raw data	2-10ms
DiskANN	Disk-based, huge datasets	Minimal RAM	10-50ms
SCANN	Google-optimized workloads	1x raw data	5-20ms
FLAT	Exact search, small datasets	1x raw data	Varies

For agent memory, HNSW is the typical default: sub-10ms queries with high recall. But the new RaBitQ index in Milvus 2.6 changes the economics entirely.

RaBitQ: 72% Less Memory, 4x Faster Queries

Traditional quantization forces a recall-speed tradeoff. You compress vectors, you lose accuracy. RaBitQ breaks this pattern through 1-bit quantization combined with an optional SQ8 refinement stage.

The benchmarks from VectorDBBench tell the story clearly:

Metric	IVF_FLAT (baseline)	RaBitQ 1-bit only	RaBitQ + SQ8 Refine
Memory footprint	100%	3%	28%
Recall@10	95.2%	76.3%	94.9%
Throughput (QPS)	236	648	946

The RaBitQ + SQ8 combination delivers 94.9% recall (essentially identical to the baseline) while using 72% less memory and serving 4x more queries per second. For an agent memory system storing millions of conversation embeddings, this means you can serve the same workload with 75% fewer servers.

Hybrid Search: BM25 Meets Vectors

Agent memory needs more than pure semantic search. If a user asks about “Redis TTL configuration,” you want to find memories that mention Redis specifically, not just memories about caching in general. This is where hybrid search becomes critical.

Milvus has supported sparse vectors and BM25 full-text search since version 2.5. In Milvus 2.6, the full-text search implementation is 3-4x faster than Elasticsearch on the BEIR benchmark suite, with some workloads hitting 7x higher QPS.

The hybrid search API combines dense vector search with sparse BM25 and fuses the results:

from pymilvus import AnnMatch, MilvusClient

# Dense semantic search
dense_search = AnnMatch(
    field_name="embedding",
    query_vector=query_embedding,
    metric_type="COSINE",
    top_k=20
)

# Sparse BM25 search
sparse_search = AnnMatch(
    field_name="sparse_embedding",
    query_vector=sparse_query_vector,
    metric_type="IP",
    top_k=20
)

# Hybrid: Milvus fuses results using RRF internally
results = collection.hybrid_search(
    queries=[dense_search, sparse_search],
    rerank={"strategy": "rrf", "params": {"k": 60}},
    top_k=10
)

This is the same Reciprocal Rank Fusion approach we covered in our hybrid search deep dive, but executed inside the database rather than in application code. Running fusion at the database level means less data transfer, less latency, and consistent ranking behavior.

Time-Aware Decay

Milvus 2.6 also introduces time-aware decay functions, which are a natural fit for agent memory:

# Exponential decay: recent memories rank higher
search_params = {
    "expr": "timestamp > now() - 86400 * 30",  # last 30 days
    "decay_function": {
        "field": "timestamp",
        "origin": "now()",
        "scale": "7d",       # decay rate
        "function": "exp"    # exponential decay
    }
}

You can configure exponential, Gaussian, or linear decay. For agent memory, exponential decay with a 7-day scale matches how people actually weight information: recent conversations are much more relevant than ones from months ago.

Tiered Storage: Hot Data Fast, Cold Data Cheap

Agent memory has a predictable access pattern. The memories you created today will be accessed constantly. The memories from six months ago might be touched once a week. Milvus 2.6 introduces tiered storage that automatically classifies data based on access patterns:

Hot tier: Frequently accessed vectors stay in memory or on NVMe SSDs. Query latency stays under 10ms.
Cold tier: Rarely accessed vectors move to cheaper object storage (S3). Queries that touch cold data load segments on demand, adding a few hundred milliseconds of latency.
Dynamic movement: As access patterns change, data automatically migrates between tiers.

For a billion-vector agent memory system, tiered storage can reduce infrastructure costs by 50% with negligible impact on query performance. The hot data your agent needs right now stays fast. The archival data your agent rarely references gets cheap.

memsearch: Milvus as a Rebuildable Shadow Index

The most interesting agent memory project using Milvus right now is memsearch, created by Zilliz’s own team. It embodies a design philosophy we discussed in our files-as-memory post: Markdown files are the source of truth, and the vector database is a derived index that you can drop and rebuild at any time.

Here is the architecture:

Agent captures conversation
         │
         ▼
  Append to memory/2026-05-12.md
  (Markdown, human-readable, git-trackable)
         │
         ▼
  memsearch watch (file watcher)
         │
         ▼
  Re-chunk changed Markdown
  SHA-256 hash each chunk
         │
    ┌────┴────┐
    ▼         ▼
  Unchanged   New/changed
  → skip      → embed → upsert to Milvus
    │         │
    └────┬────┘
         ▼
   Milvus (shadow index)
   Always in sync, fully rebuildable

What makes this pattern powerful is that Milvus is treated as a cache, not as a database. If the Milvus index gets corrupted, or if you want to switch embedding models, you run memsearch index --force and rebuild the entire index from the Markdown files. No data loss. No migration scripts.

The recall pipeline is three layers deep:

L1: memsearch search "batch size"  →  ranked chunks from Milvus
         │ (need more context?)
         ▼
L2: memsearch expand <chunk_hash>  →  full Markdown section
         │ (need original dialogue?)
         ▼
L3: parse-transcript <session.jsonl>  →  raw conversation

Layer 1 uses Milvus hybrid search (dense vectors + BM25 + RRF). Layer 2 expands to the full Markdown section. Layer 3 goes back to the raw JSONL transcript if you need the exact conversation. This progressive retrieval pattern is one we have seen repeatedly in this series, and Milvus handles the first layer efficiently at scale.

LangChain Integration

For teams using LangChain 1.0, Milvus integrates through the langchain-milvus package as a standard VectorStore:

from langchain_milvus import Milvus
from langchain_openai import OpenAIEmbeddings

# Connect to Milvus
vectorstore = Milvus(
    embedding=OpenAIEmbeddings(),
    collection_name="agent_memory",
    connection_args={"uri": "http://localhost:19530"}
)

# Use as a retriever tool for your agent
from langchain.agents import create_agent

agent = create_agent(
    model="openai:gpt-4o",
    tools=[vectorstore.as_retriever().as_tool(
        name="recall_memory",
        description="Search past conversations and decisions"
    )],
    system_prompt="You can search your memory to recall past context."
)

Milvus also integrates with LlamaIndex, CrewAI, AutoGen, Mem0, and 21 other frameworks. It supports 19 vector store backends as alternatives, including Qdrant, Chroma, pgvector, and Pinecone, giving you portability if you ever need to switch.

Deployment Modes

Milvus offers three deployment modes, each suited to different scales of agent memory:

Milvus Lite (embedded): A single-file Python library, similar to SQLite. No server to run. Perfect for development, prototyping, and single-agent deployments. Just point your connection URI at a local file:

Milvus(connection_args={"uri": "./agent_memory.db"})

Milvus Standalone (Docker): A single Docker container with all components. Good for small teams running a few agents. Supports all index types and hybrid search.

Milvus Cluster (Kubernetes): The full four-layer architecture with independent scaling. For production systems with billions of vectors, real-time ingestion at 80K vectors/second, and query latencies under 10ms. This is where Milvus earns its purpose-built reputation.

Zilliz Cloud (managed): A fully managed Milvus service with automatic RaBitQ parameter tuning, 10x performance claims, and a free tier. Eliminates the operational overhead of running a distributed database.

The Gotcha: Operational Complexity

Here is the thing nobody tells you about Milvus. For all its power at scale, it is a distributed system, and distributed systems are hard.

Setting up a Milvus cluster means managing etcd for metadata, MinIO or S3 for object storage, the Woodpecker WAL, and multiple node types (streaming, query, data, coordinator). You need Kubernetes. You need to understand segment compaction, index building schedules, and replica placement. The documentation is extensive but the learning curve is real.

Compare this to pgvector, where you just CREATE EXTENSION vector and start inserting. Or ChromaDB, where you pip install chromadb and everything runs in-process. Milvus is not the right choice for a prototype or a side project. It is the right choice when you have hit the ceiling of those simpler tools and need to scale horizontally.

The other gotcha: Milvus Lite, the embedded mode, has limited query performance compared to the full cluster. It uses brute-force search for small datasets and basic HNSW for larger ones, but it does not support tiered storage, RaBitQ, or the full-text search optimizations of Milvus 2.6. If you need the advanced features, you need the server.

My recommendation: start with Milvus Lite or ChromaDB for development. Move to Milvus Standalone when you hit a million vectors. Move to Milvus Cluster when you hit ten million. And consider Zilliz Cloud if you do not want to manage Kubernetes.

Practical Takeaways

Choose Milvus when your agent memory exceeds millions of vectors, you need real-time ingestion alongside queries, or you need horizontal scaling across multiple nodes.
RaBitQ quantization in Milvus 2.6 is a game-changer: 72% memory reduction with 94.9% recall and 4x throughput improvement. Use it.
Hybrid search with BM25 + dense vectors + RRF fusion runs inside Milvus, eliminating the need for application-level fusion logic.
Tiered storage automatically moves cold memories to cheap object storage while keeping hot memories in RAM. Essential for cost control at scale.
memsearch demonstrates the best pattern for agent memory: Markdown as source of truth, Milvus as a rebuildable shadow index. If the database breaks, you lose nothing.
Time-aware decay functions let you boost recent memories without a separate application-layer ranking step.
Milvus Lite is great for development, but the advanced features (RaBitQ, tiered storage, full-text search) require the server deployment.
Operational complexity is real. Budget time for Kubernetes management, monitoring, and understanding the four-layer architecture. Consider Zilliz Cloud if you want the power without the ops burden.

What’s Next

We have now covered the three main vector database options for agent memory: pgvector for the Postgres faithful, ChromaDB for the embedded case, and Milvus for the scale. Next time, we will look at how to choose the right embedding model for your agent memory: what dimensions to use, which models perform best for conversational data, and why the “best” model on a benchmark might not be the best model for your agent.

Previously: ChromaDB: The Embedded Vector Database That Lives Inside Your Agent Process