memsearch: How Milvus Became a Disposable Index for Your Agent Memories

In the last post we looked at QMD, which packages the entire hybrid search pipeline into a single binary. Everything lives in one SQLite file: your inverted index, your vectors, your cache. It works beautifully, but it makes a specific bet: your search index is also your storage.

memsearch, from the Zilliz team (the company behind Milvus), makes the opposite bet. Your Markdown files are the storage. The vector database is disposable.

This is not a subtle distinction. It changes how you build, debug, maintain, and reason about your agent’s memory. Let me walk through why it matters and how the architecture actually works.

The Shadow Index Pattern

The core idea behind memsearch is simple: Markdown files are the source of truth, and Milvus is a derived, rebuildable cache. The project calls this the “shadow index” pattern.

Think of it like a build artifact. Your source code lives as text files. Your compiled binary is derived from those files. If you delete the binary, you can rebuild it. If you corrupt the binary, you rebuild it. The binary is useful, but it is not authoritative.

memsearch applies the same logic to agent memory. Your memories live as .md files in a directory structure. A Milvus instance (either local or Zilliz Cloud) indexes those files for search. If the index gets corrupted, you delete it and re-index. If you want to switch embedding models, you re-index. If you want to migrate to a different vector database, you re-index. Your memories survive all of it because they are just text files.

This pattern shows up across the files-first ecosystem. memweave uses the same approach with SQLite. QMD uses it with its own SQLite index. What makes memsearch distinctive is that it pairs this philosophy with Milvus, a vector database purpose-built for billion-scale collections. The message is clear: even at massive scale, your real data should live somewhere you can read it.

How memsearch Works

memsearch is a Python library with a plugin architecture for different agent platforms. The core has four modules:

src/memsearch/
  core.py        # MemSearch class: index(), search(), watch(), compact()
  store.py       # MilvusStore: hybrid search, upsert, dedup
  chunker.py     # Markdown heading-based chunking + SHA-256 hash
  embeddings/    # Pluggable providers (ONNX, OpenAI, Google, ...)

The Index Pipeline

When you point memsearch at a directory of Markdown files, the index pipeline runs through these steps:

Discovery: scan_markdown_files() walks the directory tree and finds all .md files.
Chunking: chunk_markdown() splits each file into chunks at heading boundaries. A Markdown document with three ## sections produces three chunks, each preserving the heading hierarchy as context.
Hashing: Each chunk gets a SHA-256 content hash. This hash serves as the primary key in Milvus. If the content has not changed, the hash stays the same and the chunk is skipped.
Embedding: Only new or modified chunks get embedded. The hash comparison means you never re-embed unchanged content, which saves significant time and cost with API-based embedding providers.
Upsert: The chunk, its embedding, and its BM25 sparse vector are written to Milvus via upsert. If the primary key (content hash) already exists, Milvus updates it in place. If it does not exist, it inserts a new entry. This handles both additions and modifications in one operation.
Cleanup: When a Markdown file is deleted or a section is removed, memsearch detects the missing chunks during the next index run and removes them from Milvus. The index stays in sync with the filesystem.

The entire pipeline is incremental. After the first full index, subsequent runs only process changes. A file watcher mode (watch()) can run this pipeline automatically whenever files change, so the index stays current without manual re-indexing.

SHA-256 Deduplication

The SHA-256 approach is worth calling out because it solves a problem that trips up a lot of memory systems: redundant content.

Agents write a lot of similar things across sessions. If I tell you “the user prefers dark mode” in three separate conversations, a naive memory system stores three copies. Each one gets embedded. Each one shows up in search results. The token budget gets wasted on repetition.

memsearch’s content hashing catches this at the storage level. If two chunks produce the same SHA-256 hash, they are identical. Only one copy lives in Milvus. The deduplication is exact, not semantic, so it does not catch paraphrases. But for the common case of repeated observations across sessions, it is effective and cost-free.

The Three-Layer Retrieval Model

This is where memsearch’s architecture gets genuinely interesting. Most memory systems return search results as flat lists: here are ten chunks ranked by relevance. memsearch instead uses a progressive retrieval model with three layers.

Layer 1: Search

You start with a standard hybrid search. memsearch runs dense vector search and BM25 sparse search in parallel, then fuses the results with Reciprocal Rank Fusion. The output is a ranked list of chunk summaries with their scores and source metadata.

This is the layer you use for most queries. “What did we decide about the deployment timeline?” gets answered from the search layer. The results are compact enough to fit in your token budget without trouble.

Layer 2: Expand

Sometimes a search result looks relevant but you need more context. The chunk summary mentions a decision about deployment, but you want the full section. The expand layer lets you retrieve the complete chunk content by its content hash.

Think of this as “zooming in” on a search result. You do not run a new search. You fetch the full content of a specific chunk that you already identified as relevant. This is useful when the search layer’s truncated summaries are not enough to answer the question.

Layer 3: Transcript

Some memories are daily logs that capture raw conversation transcripts. The search layer might surface a chunk from 2026-05-20.md that mentions a discussion about API design. But the chunk is a compressed summary. The actual conversation, with all its nuance and back-and-forth, lives in the original transcript.

The transcript layer retrieves the original Markdown file content associated with a chunk. This is the deepest level of recall. You go from a search hit, to a full chunk, to the original conversation that generated it.

Why Progressive Retrieval Matters

The three-layer model solves a specific problem: token budget management.

In the token budget post in this series, we discussed how memory retrieval is a zero-sum game. Every token you spend on context is a token you cannot spend on the model’s reasoning. Loading full documents for every search result is wasteful. But loading only summaries means you sometimes miss critical detail.

Progressive retrieval defers the cost. You start cheap with summaries. You only pay the token cost of full content when you need it. And you only pay the cost of original transcripts when the full chunk is not enough.

In practice, this means the agent can run a broad search, scan the lightweight results, and selectively drill into the ones that matter. The search costs almost nothing. The expand costs a little more. The transcript is the most expensive but reserved for cases where nothing else suffices.

For Claude Code specifically, memsearch runs this three-layer process in a subagent. The subagent decides how deep to drill and returns organized results to the main session. This keeps the main context window clean and lets the subagent burn tokens on exploration without polluting the primary conversation.

The Agent Integration

memsearch provides plugins for several agent platforms. The Claude Code plugin is the most mature, and its architecture reveals how memory capture works in practice.

Session Capture

When you install the memsearch plugin in Claude Code, it registers a Stop hook. Every time Claude finishes a response, the hook fires:

Claude response ends → Stop hook fires
  → Parse the conversation transcript
  → Call Claude Haiku to generate a summary
  → Append summary to today's Markdown file (e.g., memory/2026-05-23.md)
  → File watcher detects the change
  → Index pipeline runs: chunk → hash → embed → upsert to Milvus

The capture step uses Haiku because it is cheap and fast. A summary of a coding session might cost a fraction of a cent. The summary gets written to a Markdown file with a session anchor, an HTML comment that links the chunk back to the specific conversation.

The file watcher then picks up the change automatically. Within seconds, the new memory is indexed and searchable. The whole flow runs in the background without interrupting your work.

Search Injection

When you start a new Claude Code session, the plugin can inject recent memories into the context. This is the “cold-start” problem: a fresh session has no conversation history, so the agent starts from zero. By loading recent memory summaries into the system prompt, the agent immediately knows what has happened in previous sessions.

The injection is configurable. You can control how many recent memories to load, how far back to look, and whether to include search results from a specific query. This gives the agent a running start without consuming excessive tokens.

Per-Project Isolation

Each project gets its own Milvus collection. If you work on three codebases, each one has an independent memory index. This prevents cross-contamination where memories from one project pollute search results in another.

The Embedding Story

memsearch supports multiple embedding providers through a pluggable interface:

Provider	Cost	Latency	Quality	Privacy
ONNX (local)	Free	Medium	Good	Full
OpenAI	$$ per token	Low	High	None
Google	$$ per token	Low	High	None
Voyage	$$ per token	Low	High	None

The default for the Claude Code plugin is ONNX, which runs the embedding model locally via ONNX Runtime. This means your memory content never leaves your machine during indexing. The quality is lower than API-based models, but for most agent memory use cases, the difference is negligible.

If you need higher quality embeddings, you can switch to an API provider with a configuration change. The SHA-256 deduplication means switching models only requires re-embedding the chunks that exist, not processing the source files again.

The `compact` Operation

Over time, daily log files grow. A file for a busy project might accumulate hundreds of chunks. memsearch provides a compact() operation that consolidates old daily logs into periodic summaries.

The compact operation:

Reads the full content of daily log files older than a threshold
Calls an LLM to produce a consolidated summary
Writes the summary to a weekly or monthly file
Removes the individual daily files from the index
Indexes the new consolidated file

This is the write-path maintenance that keeps a memory system healthy over weeks and months. Without compaction, daily logs grow linearly and search results become cluttered with stale, low-value observations. With compaction, the memory stays focused and relevant.

The tradeoff is granularity. A weekly summary compresses a hundred individual observations into a general overview. Specific details, like the exact API parameter discussed on Tuesday, might not survive the summarization. memsearch’s three-layer retrieval helps here: the transcript layer can still retrieve the original daily log if the summary is not enough.

memsearch vs QMD: A Contrast in Philosophy

Since we covered QMD in the last post, it is worth comparing the two directly. They solve the same problem (agent memory search) with fundamentally different architectures.

QMD packages everything into one binary. One SQLite file holds the FTS5 index, the vector index, the model cache, everything. It runs three local GGUF models for embedding, reranking, and query expansion. It is self-contained and requires zero infrastructure beyond Node.js.

memsearch separates concerns explicitly. Your Markdown files are storage. Milvus is the search index. The embedding model is pluggable. The capture hook is separate from the search pipeline. This modularity means you can swap any component without rebuilding the system.

QMD is simpler to get started with. Install the npm package, point it at your files, and you have search. memsearch requires either a local Milvus instance or a Zilliz Cloud account. The setup is more involved.

QMD caps out at the quality of its local models. memsearch can use any embedding provider, including the latest API models. If you need the quality of OpenAI’s text-embedding-3-large or Voyage’s voyage-4-large, memsearch supports that without code changes.

QMD is fully local and private by default. memsearch can be fully local (with ONNX embeddings and a local Milvus instance) but defaults to a cloud Milvus for the Claude Code plugin, which means your embedding vectors live on Zilliz’s infrastructure. The source Markdown files stay local, but the vector representations do not.

Both tools embody the source-of-truth pattern. Both use hybrid search with RRF fusion. Both handle incremental indexing and deduplication. The difference is where they draw the lines between components, and what they optimize for.

The Gotcha: Shadow Index Drift

The shadow index pattern has a failure mode that does not get enough attention: drift.

Your Markdown files are the source of truth, and Milvus is derived from them. But what if they get out of sync? A file gets modified outside the file watcher’s scope. An embedding API call fails silently. A Milvus collection gets partially corrupted during an update.

When the shadow index drifts from the source, you get memory that silently returns wrong or missing results. The system does not fail loudly. It just returns stale search results, and you have no way to know unless you manually compare the index against the files.

memsearch mitigates this with a few mechanisms. The file watcher runs on every file change, reducing the window for drift. The SHA-256 hashing means re-indexing is idempotent and safe to run at any time. And the index is disposable, so a full rebuild is always an option.

But the fundamental risk remains. If you rely on the shadow index pattern, you need a strategy for detecting drift. Periodic full re-indexes are the simplest approach. Hash comparisons between the filesystem and the index are more efficient but require additional tooling.

This is not a reason to avoid the pattern. It is a reason to respect it. A disposable index is only valuable if you have a reliable way to rebuild it.

The Gotcha: Milvus Is Not a Toy

memsearch uses Milvus as its search backend, and Milvus is a serious piece of infrastructure. For development and small projects, you can run it locally with Docker Compose. But Milvus was designed for billion-scale vector collections with microsecond latency requirements. Running it for a personal memory system is, in some ways, like using PostgreSQL to store a todo list. It works, but the operational overhead is real.

Zilliz Cloud eliminates this overhead. You create an account, get a cluster endpoint and API key, and memsearch handles the rest. But this introduces a cloud dependency, which contradicts the local-first ethos that attracted many people to the files-first movement in the first place.

The tension is genuine. memsearch’s architecture is philosophically local-first (Markdown files on disk), but its default deployment is cloud-dependent (Milvus/Zilliz). You can run Milvus locally, but the setup is non-trivial compared to QMD’s single-binary approach.

If you are already running Milvus for other purposes, memsearch is a natural fit. If you are not, the overhead of adding Milvus to your stack is the main cost of adoption.

Practical Takeaways

The shadow index pattern separates storage from search. Your Markdown files are the source of truth. The vector database is derived and disposable. This gives you debuggability, editability, and migration flexibility.
SHA-256 content hashing provides exact deduplication for free. Identical memories from different sessions collapse into a single index entry, saving embedding costs and reducing noise in search results.
Three-layer progressive retrieval (search, expand, transcript) is a practical pattern for token budget management. Start with lightweight summaries and only pay the cost of full content when you need it.
The file watcher makes the index self-maintaining. New memories get indexed within seconds of being written, without manual re-indexing.
The compact operation is essential for long-running agents. Without periodic consolidation, daily logs grow linearly and search quality degrades.
memsearch is not the simplest tool to set up. If you want zero infrastructure, QMD is the better choice. If you already use Milvus or need API-quality embeddings, memsearch is the better choice.
Watch for shadow index drift. The pattern is only safe if you have a way to detect and repair sync issues between your files and your index.

What’s Next

We have now covered the major local-first memory tools: QMD for the all-in-one approach, memsearch for the shadow index approach, and AIngram for the single-SQLite-file approach. Next time, we will zoom out and look at a different angle entirely: how agents decide what is worth remembering in the first place. The write path, the extraction pipeline, and the filtering decisions that determine whether a memory ever makes it into storage.

Previously in this series: QMD: The 2GB Search Engine That Beats Cloud Retrieval Without Ever Leaving Your Machine

memsearch is open source (MIT) at github.com/zilliztech/memsearch. The Milvus vector database is at milvus.io.