ChromaDB: The Embedded Vector Database That Lives Inside Your Agent Process

You’re building an AI agent that needs to remember things. You’ve got embeddings working. Your agent converts conversations, documents, and observations into dense vectors. Now you need somewhere to put them.

Option A: Stand up a Postgres instance, install pgvector, manage schemas and migrations, configure connection pooling, and pray your Cloudflare Worker doesn’t timeout waiting for a round trip.

Option B: pip install chromadb. Done. Your vector database is a Python import.

That’s the ChromaDB pitch, and it’s why the project has racked up 27,000+ GitHub stars and 15 million monthly downloads. It’s the SQLite of vector databases: embedded, zero-config, and runs inside your process. For agent memory, where you often need a local, fast, private vector store that doesn’t require a DevOps team, ChromaDB has become the default choice.

But defaults can be deceptive. ChromaDB ships with L2 (Euclidean) distance as its default similarity metric. This silently degrades recall for normalized text embeddings, where cosine similarity is the standard. And its embedded mode, the very feature that makes it appealing, becomes a liability the moment you have multiple workers or processes that need to share memory state.

How ChromaDB Works: Four Modes, One API

ChromaDB’s architecture revolves around a single consistent API that works across four deployment modes:

Ephemeral mode. Everything lives in memory. The database dies when your process exits. Perfect for testing, prototyping, and agents that rebuild their memory from source on each run.

import chromadb
client = chromadb.Client()  # In-memory, gone on exit

Persistent mode. Data is written to disk using SQLite and Parquet files under the hood. Your agent can restart and pick up where it left off. No server, no Docker, no network. This is the sweet spot for single-agent setups.

client = chromadb.PersistentClient(path="./agent-memory")

HTTP client-server mode. Run chroma run to start a standalone server, then connect to it over HTTP. Gives you shared access from multiple processes, at the cost of network latency and operational complexity.

client = chromadb.HttpClient(host="localhost", port=8000)

Chroma Cloud. The managed offering. Serverless, auto-scaling, with object storage and automatic data tiering. BYOC (Bring Your Own Cloud) available for enterprises. Still relatively new, but removes the operational burden entirely.

client = chromadb.CloudClient(
    tenant="your-tenant",
    database="your-db",
    api_key="your-key"
)

For most agent memory use cases, persistent mode is where you start. It’s zero-infrastructure, fast (reads from local disk), and your agent’s memory survives restarts. You only graduate to HTTP or Cloud when you need multi-process access or multi-agent sharing.

Collections: The Organizing Primitive

ChromaDB stores everything in collections. These are named containers that hold your embeddings, documents, metadata, and IDs together. Think of them as tables in a relational database, but purpose-built for vector operations.

# Create a collection with cosine similarity (not the default!)
collection = client.get_or_create_collection(
    name="agent_memory",
    metadata={"hnsw:space": "cosine"}
)

# Add memories
collection.add(
    documents=["User prefers dark mode in all applications"],
    metadatas=[{"type": "preference", "confidence": 0.95, "source": "conversation"}],
    ids=["mem-001"]
)

# Or add pre-computed embeddings directly
collection.add(
    embeddings=[[0.1, 0.2, ..., 0.9]],  # 384-dim vector
    documents=["User prefers dark mode"],
    ids=["mem-001"]
)

Collections support four types of queries:

Vector search: Find documents similar to a query embedding
Full-text search: BM25-style keyword matching (added October 2025 with sparse vector support)
Metadata filtering: Exact match and range filters on metadata fields
Regex search: Regular expression matching on document text (added June 2025)

The full-text and regex search capabilities are relatively new additions that make ChromaDB viable as a hybrid search engine, not just a pure vector store. This is significant for agent memory, where keyword exact-match (for identifiers, names, error codes) and semantic similarity (for concepts, paraphrases) both matter.

The Default Distance Gotcha

The single most common mistake in ChromaDB-powered agent memory:

# This uses L2 distance — probably not what you want
collection = client.create_collection(name="memory")

# This uses cosine similarity — almost certainly what you want
collection = client.create_collection(
    name="memory",
    metadata={"hnsw:space": "cosine"}
)

ChromaDB defaults to L2 (Euclidean) distance. For embeddings from models like all-MiniLM-L6-v2, OpenAI’s text-embedding-3-small, or Cohere’s embed-english-v3.0, the standard similarity metric is cosine similarity. These models are trained to produce normalized vectors where cosine distance captures semantic relatedness.

Using L2 with these embeddings doesn’t produce garbage (results are correlated), but it produces measurably worse rankings, especially for borderline relevant memories. The agent might recall the #1 result correctly but miss the #3 result that would have surfaced with cosine distance. In memory retrieval, the long tail matters.

You can also use inner product ("hnsw:space": "ip") if your embeddings are specifically trained for dot-product similarity (some models are), but cosine is the safe default for the vast majority of embedding models used in agent systems.

Rule of thumb: Unless you have a specific reason to use L2, always configure "hnsw:space": "cosine" when creating collections for agent memory.

HNSW Under the Hood

ChromaDB uses HNSW (Hierarchical Navigable Small World) as its indexing algorithm, the same one used by pgvector, Milvus, and Qdrant. HNSW builds a multi-layer graph where each node connects to its nearest neighbors, enabling approximate nearest neighbor search in sub-linear time.

The key tuning parameters:

collection = client.create_collection(
    name="memory",
    metadata={
        "hnsw:space": "cosine",
        "hnsw:construction_ef": 200,    # Build-time graph density
        "hnsw:search_ef": 100,          # Search-time traversal depth
        "hnsw:M": 32,                   # Max connections per node
        "hnsw:batch_size": 100          # Insertion batch size
    }
)

M (default 16): Maximum number of connections per node in the graph. Higher M = better recall at the cost of more memory and slower builds. For agent memory with under 100K vectors, M=32 is reasonable.
construction_ef (default 200): Controls how thoroughly the graph is explored during index construction. Higher = better quality graph, slower builds. 200 is good for most cases.
search_ef (default 100): Controls how many nodes are examined during a query. Higher = better recall, slower queries. For agent memory where you need high recall (missing a relevant memory is worse than being slightly slower), bump this to 200+.

The performance profile at scale: p50 query latency of ~20ms warm, ~650ms cold at 384 dimensions and 100K vectors. For most agent memory workloads (sub-100K memories, warm process), queries complete in single-digit milliseconds.

Sparse Vectors and Hybrid Search

In October 2025, ChromaDB added sparse vector support, a feature that unlocks true hybrid search within a single system. Sparse vectors represent BM25-style term matching as high-dimensional, mostly-zero vectors. This means you can combine keyword precision with semantic understanding in one query.

# Add with both dense and sparse vectors
collection.add(
    ids=["mem-001"],
    documents=["PostgreSQL connection pool exhausted, error code PG-08006"],
    embeddings=[[0.1, 0.2, ..., 0.9]],           # Dense semantic vector
    sparse_embeddings=sparse_vectors([{            # Sparse BM25 vector
        4523: 1.8,    # "postgresql"
        8901: 2.1,    # "connection"
        2345: 3.5,    # "PG-08006"
    }])
)

When querying, ChromaDB can combine dense and sparse results using fusion scoring, similar to the Reciprocal Rank Fusion approach covered earlier in this series. This is a significant upgrade from the earlier approach of running BM25 externally and merging results yourself.

For agent memory specifically, hybrid search via sparse+dense vectors means your agent can find memories both by semantic meaning (“that time the database had connection issues”) and by exact identifiers (“PG-08006”). No external BM25 engine required.

Metadata Filtering: Structured Context on Top of Vectors

ChromaDB’s metadata filtering is surprisingly powerful for agent memory:

# Query with metadata filters
results = collection.query(
    query_embeddings=[query_vector],
    where={
        "$and": [
            {"type": {"$eq": "error"}},
            {"timestamp": {"$gte": "2026-05-01"}},
            {"confidence": {"$gte": 0.8}}
        ]
    },
    where_document={"$contains": "database"},
    n_results=10
)

You can filter on metadata fields with $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, and $contains operators. The where_document filter lets you do substring matching on the raw document text — a lightweight alternative to full-text search for simple cases.

For agent memory, this is essential. You typically want to scope queries by memory type (preference, fact, error, procedure), recency, confidence score, and source. Metadata filtering lets you do this without running a separate search pass.

ChromaDB also supports metadata arrays (added February 2026), allowing you to store lists of values:

collection.add(
    documents=["React performance optimization guide"],
    metadatas=[{
        "tags": ["react", "performance", "frontend", "optimization"],
        "projects": ["webapp", "mobile-app"],
        "memory_type": "reference"
    }],
    ids=["ref-001"]
)

This makes it possible to filter by tag membership, which is a natural fit for categorized agent memory.

Embedding Functions: Built-in or Bring Your Own

ChromaDB can compute embeddings for you using built-in functions, or accept pre-computed vectors:

from chromadb.utils import embedding_functions

# Built-in: uses sentence-transformers under the hood
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = client.get_or_create_collection(
    name="memory",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)

# Now you can add raw text — ChromaDB embeds it automatically
collection.add(
    documents=["User prefers concise responses"],
    ids=["pref-001"]
)

For agent memory, I recommend computing embeddings yourself rather than using the built-in functions. Here’s why:

Consistency: Your agent should use the same embedding model for queries and storage. If ChromaDB handles embedding, you need to ensure your query-time model matches — easy to get wrong.
Control: You might want to use domain-specific models, quantized models, or models with different dimensionality. Managing this yourself gives you full control.
Transparency: When embeddings fail (wrong model, dimension mismatch, truncation), you want to see exactly what vector went in. The built-in functions hide this.

# Preferred: compute embeddings yourself, store raw vectors
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embedding = model.encode("User prefers concise responses").tolist()

collection.add(
    embeddings=[embedding],
    documents=["User prefers concise responses"],
    ids=["pref-001"]
)

ChromaDB MCP Server: Direct Agent Integration

A notable recent development for agent builders is the ChromaDB MCP Server, a Model Context Protocol server that gives AI agents direct access to ChromaDB collections without custom code.

Instead of writing Python functions that your agent calls as tools, you run the MCP server alongside your agent. The agent can then list collections, query memories, add documents, and manage metadata through the standardized MCP interface.

This is particularly relevant for frameworks like Claude, GPT, and other LLM-based agents that support MCP natively. Your agent can remember things by directly interacting with the vector database as a tool, with no middleware layer required.

For local-first agent setups, this means your agent’s memory loop becomes:

Agent encounters information worth remembering
Agent calls the MCP add tool to store it in ChromaDB
On future queries, agent calls the MCP query tool to retrieve relevant memories
Agent uses retrieved memories to ground its response

Zero custom code for the memory layer. The MCP server handles all the vector operations, and your agent just talks to it through natural language tool calls.

The Multi-Worker Stale Data Problem

The second problem is subtler than the distance function issue.

ChromaDB’s persistent mode uses a combination of SQLite (for metadata) and Parquet files (for vectors) on disk. When you create a PersistentClient, it loads data into memory for fast queries. This is fine for single-process agents.

But if you’re running multiple agent workers (common in production), each worker loads its own in-memory copy. Worker A adds a memory. Worker B queries its collection. Worker B doesn’t see Worker A’s memory because it’s reading from a stale snapshot.

This is the fundamental trade-off of embedded databases, not a bug. SQLite has the same issue with WAL mode and multiple writers. But for agent memory, where consistency matters (you don’t want agents hallucinating because they can’t recall what another worker learned), it’s a real problem.

Solutions:

Single-writer pattern: One agent worker owns writes, others are read-only replicas. Use ChromaDB’s HTTP server mode as the central write endpoint.
HTTP server mode: Run chroma run as a standalone service. All workers connect to the same server. You lose the zero-infrastructure benefit but gain consistency.
Chroma Cloud: Let the managed service handle replication and consistency for you.

For development and single-agent setups, persistent mode works perfectly. For multi-agent production systems, plan to use HTTP or Cloud mode from the start.

Performance Characteristics

Based on benchmarks from the ChromaDB team and community measurements:

Configuration	Dimensions	Vectors	p50 Latency (warm)	p50 Latency (cold)
Persistent, M=16	384	10K	~3ms	~200ms
Persistent, M=16	384	100K	~20ms	~650ms
Persistent, M=32	384	100K	~25ms	~700ms
HTTP server	384	100K	~30ms	~50ms
Persistent, M=16	1536	100K	~80ms	~2s

The cold start penalty is significant. Loading 100K vectors of 384 dimensions takes hundreds of milliseconds. For agent memory, this means your first query after process startup will be slow. Warm queries are fast enough for interactive use.

Higher dimensions (OpenAI’s 1536-dim embeddings) dramatically increase latency and storage. If you’re using ChromaDB for agent memory, consider smaller models like all-MiniLM-L6-v2 (384 dims) or bge-small-en-v1.5 (384 dims). The quality difference for memory retrieval is negligible, but the performance difference is 3-4x.

The Rust Core Rewrite

In late 2025, ChromaDB completed a major rewrite of its core from Python to Rust (with the wal3 write-ahead log in September 2025 being a key milestone). This delivered roughly 4x performance improvements across the board: faster ingestion, faster queries, lower memory usage.

For agent memory builders, the Rust core means:

Faster cold starts: The in-memory index builds quicker on process startup
Better memory efficiency: More vectors fit in the same RAM
Improved write throughput: Bulk ingestion of memory snapshots is faster
WAL durability: The write-ahead log ensures data persistence even during crashes

The Python API hasn’t changed; the Rust core is transparent to users. But the performance improvements make ChromaDB viable for larger agent memory stores that would have been impractical with the old Python implementation.

Framework Integrations

ChromaDB’s ecosystem integration is one of its strongest selling points. It’s supported by:

LangChain: Chroma vector store with built-in document loaders
LlamaIndex: ChromaVectorStore with index composition
Mem0: Default vector store backend for managed agent memory
CrewAI: Built-in memory backend for multi-agent crews
AutoGen: Conversation memory storage
Semantic Kernel: Microsoft’s agent framework

For the 21 framework integrations and 19 vector store backends that Mem0 alone supports, ChromaDB is often the default backend. If you’re using any of these frameworks for agent memory, you’re probably already using ChromaDB under the hood.

ChromaDB vs pgvector vs Qdrant: When to Use What

The right choice depends on your architecture:

Choose ChromaDB when:

You’re building a single agent or a small number of agents
You want zero infrastructure (no Postgres, no Docker, no server)
You need fast local queries without network round trips
You’re prototyping or in early development
Your agent runs on a single machine (laptop, single server, edge device)

Choose pgvector when:

You already have Postgres in your stack
You need ACID transactions for memory operations
You want hybrid search (pgvector + tsvector) in SQL
Multiple services need to share the same memory store
You need production-grade consistency guarantees

Choose Qdrant when:

You need high-throughput vector search at serious scale (millions of vectors)
You want advanced filtering with payload indexes
You need gRPC support for low-latency production serving
You’re running a multi-tenant memory service

For agent memory specifically, the progression usually goes: ChromaDB (prototype) → pgvector (production with existing Postgres) → Qdrant/Milvus (scale). Many agent systems never leave ChromaDB because their memory stores stay under 100K vectors, well within ChromaDB’s comfort zone.

Practical Agent Memory Pattern with ChromaDB

This is a minimal but production-ready agent memory pattern using ChromaDB in persistent mode:

import chromadb
from datetime import datetime
from sentence_transformers import SentenceTransformer

class AgentMemory:
    def __init__(self, path="./memory", model="all-MiniLM-L6-v2"):
        self.client = chromadb.PersistentClient(path=path)
        self.encoder = SentenceTransformer(model)
        self.collection = self.client.get_or_create_collection(
            name="memories",
            metadata={
                "hnsw:space": "cosine",
                "hnsw:M": 32,
                "hnsw:search_ef": 200,
            }
        )

    def remember(self, text, memory_type="fact", confidence=1.0, source="agent"):
        embedding = self.encoder.encode(text).tolist()
        memory_id = f"mem-{datetime.now().strftime('%Y%m%d%H%M%S%f')}"

        self.collection.upsert(
            ids=[memory_id],
            embeddings=[embedding],
            documents=[text],
            metadatas=[{
                "type": memory_type,
                "confidence": confidence,
                "source": source,
                "created": datetime.now().isoformat(),
                "access_count": 0,
            }]
        )
        return memory_id

    def recall(self, query, memory_type=None, min_confidence=0.0, n=5):
        query_embedding = self.encoder.encode(query).tolist()

        where_filter = {"confidence": {"$gte": min_confidence}}
        if memory_type:
            where_filter["type"] = {"$eq": memory_type}

        results = self.collection.query(
            query_embeddings=[query_embedding],
            where=where_filter,
            n_results=n,
            include=["documents", "metadatas", "distances"]
        )

        # Update access counts for retrieved memories
        for i, doc_id in enumerate(results["ids"][0]):
            meta = results["metadatas"][0][i]
            self.collection.update(
                ids=[doc_id],
                metadatas=[{
                    **meta,
                    "access_count": meta.get("access_count", 0) + 1,
                    "last_accessed": datetime.now().isoformat(),
                }]
            )

        return results

This pattern gives you cosine similarity (not L2), HNSW tuned for recall, metadata filtering by type and confidence, and access counting for decay-based forgetting (covered earlier in this series). Add sparse vectors for hybrid search, and you have a complete local agent memory system in under 50 lines of code.

The Bottom Line

ChromaDB fills a specific and important niche in the agent memory stack: the embedded, zero-infrastructure vector database that you can import and use in minutes. For single-agent setups, prototyping, and local-first architectures, it’s the right default choice.

But “default” doesn’t mean “thoughtless”. Configure cosine distance explicitly. Understand the multi-worker stale data limitation. Choose your embedding model intentionally. And know when to graduate to a server-based solution.

The best agent memory system isn’t the most sophisticated one. It’s the one you actually understand well enough to debug at 2 AM when your agent starts hallucinating because its recall pipeline is silently returning wrong results.

This post is part of the AI Agent Memory Systems series, a technical deep dive into how AI agents store, retrieve, and manage memory. The previous post covered pgvector: When Your Postgres Database Becomes Your Agent Memory Store.