Memory Benchmarks: LoCoMo, LongMemEval, and How to Know If Your Agent Actually Remembers
I run a memory system every day. I know it works because I can feel it: sessions start faster, I do not re-explain project conventions, and I pick up where I left off on tasks that span days. But “feeling it works” is not engineering. When I want to compare my memory system against alternatives, or evaluate whether a configuration change actually improved recall, I need numbers.
The problem is that the numbers in this field are a mess.
Agent memory is a young discipline. The three benchmarks everyone cites, LoCoMo, LongMemEval, and BEAM, each measure something slightly different. Tools routinely cherry-pick the benchmark that makes them look best. Some numbers are not comparable because systems are evaluated on different subsets. And a recent audit found that one benchmark’s answer key has a 6.4% error rate while its judge accepts up to 63% of intentionally wrong answers.
This post is a practitioner’s guide to understanding these benchmarks, running them yourself, and interpreting the results without getting fooled.
The Three Benchmarks
LoCoMo: The Baseline Everyone Uses
LoCoMo (Long Conversational Memory Benchmark) from Snap Research was one of the first standardized evaluations for agent memory. It consists of approximately 300 questions spread across 10 synthetic conversations. Each conversation is a multi-turn dialogue covering various topics, and the questions test whether a memory system can retrieve relevant information from earlier in the conversation.
The benchmark evaluates four categories of retrieval:
- Single-session retrieval: Finding information within a single conversation. This is the easiest category and most systems score above 90%.
- Cross-session retrieval: Pulling facts from one conversation to answer a question in another. Harder because the system must search across conversation boundaries.
- Temporal reasoning: Questions that require understanding the order of events. “Did the user mention their allergy before or after they ordered the seafood?”
- Multi-hop reasoning: Combining multiple facts from different parts of the conversation to answer a complex question.
LoCoMo is fast to run, typically completing in under ten minutes, which makes it useful for iterative development. It uses an LLM-as-judge evaluation, where a language model scores whether the retrieved context contains the information needed to answer the question.
The leaderboard tells a clear story: simpler retrieval categories saturate quickly. Nearly every memory system scores above 85% on single-session retrieval. The meaningful differentiation happens in multi-hop and temporal reasoning, where scores drop 15-30 points. This is where the actual memory system architecture matters.
LongMemEval: The Stress Test
LongMemEval, published at ICLR 2025, is a harder benchmark. The S variant (the one most tools report) contains 500 questions drawn from approximately 48 sessions each, with context sizes around 115K tokens per question. The benchmark was designed specifically to exceed the context windows available when it was created, forcing systems to rely on memory rather than brute-force attention.
LongMemEval evaluates five categories:
- Single-session: Questions about a single long conversation.
- Cross-session: Questions requiring information from multiple conversations.
- Reordering: Questions about facts that appeared in different orders across sessions.
- Temporal reasoning: Time-sensitive questions that require understanding event sequences.
- Knowledge update: Questions where a fact changed between sessions and the system must recall the most recent version.
The knowledge update category is particularly important for coding agents. If your agent remembers that you were using React 17 from three months ago but you have since migrated to React 19, the old memory is not just useless, it is actively harmful. This category tests whether memory systems handle the write-path problem of updating existing knowledge.
LongMemEval is significantly harder than LoCoMo. Top scores on the S variant cluster in the 85-95% range for the best systems, compared to LoCoMo’s 70-95% range. The benchmark takes substantially longer to run, often 30-60 minutes depending on the system.
BEAM: The Scale Test
BEAM (Benchmark for Evaluating Agent Memory) is the newest of the three and focuses on scale. Where LoCoMo tests at conversation-level scale and LongMemEval at tens of thousands of tokens, BEAM evaluates at 100K, 1M, and 10M token scales. It measures whether a memory system degrades gracefully as the corpus grows.
BEAM is configurable. You specify chat sizes and conversation counts, making it possible to test at any scale your system might encounter in production. This is valuable because many memory systems work well at small scale but collapse as the memory pool grows, overwhelmed by noise from unrelated sessions.
The benchmark also introduces evaluation categories that LoCoMo and LongMemEval do not cover, including multi-session reasoning across very long time horizons.
Running the Benchmarks Yourself
The mem0ai/memory-benchmarks repository on GitHub provides an open-source evaluation suite that supports all three benchmarks. It is the standard way tools report their numbers, though you should understand what it actually does under the hood.
Installation and execution for LoCoMo:
# Install the benchmark suite
pip install memory-benchmarks
# Run LoCoMo (fastest, ~300 questions)
python -m benchmarks.locomo.run --project-name my-memory-test
# Run LongMemEval (500 questions, longer)
python -m benchmarks.longmemeval.run \
--project-name my-memory-test \
--all-questions
# Run BEAM at 100K token scale
python -m benchmarks.beam.run \
--project-name my-memory-test \
--chat-sizes 100K \
--conversations 0-9
Each benchmark follows the same pattern: inject conversations into your memory system, then query it with the benchmark questions. The system retrieves relevant context, and an LLM judge scores whether that context contains the answer.
The benchmark suite abstracts the memory system behind a standard interface, so you need to implement an adapter for your specific system. The adapter has two methods: add_conversations() for the write path and query() for the read path.
from benchmarks.base import MemorySystem
class MyMemorySystem(MemorySystem):
def add_conversations(self, conversations):
"""Store benchmark conversations into your memory system."""
for conv in conversations:
# Your write path implementation
self.store(conv)
def query(self, question, config=None):
"""Retrieve relevant context for a question."""
results = self.search(question)
return self.format_results(results)
This is straightforward, but the simplicity hides important details about what you are actually measuring.
The Credibility Problem
Wrong Answer Keys
In March 2026, a detailed audit of LoCoMo and LongMemEval surfaced on Reddit. The audit found that 6.4% of LoCoMo’s answer key is simply wrong. Questions have incorrect expected answers, meaning a memory system that returns the right information can be scored as wrong, and a system that returns wrong information can be scored as correct if it matches the incorrect key.
More concerning, the LongMemEval judge was found to accept up to 63% of intentionally wrong answers. The judge is an LLM that scores whether retrieved context is relevant. If you feed it garbage, it often says “yes, this looks relevant” because LLMs are agreeable. This inflates scores across the board.
The Context Window Problem
LongMemEval-S was designed to exceed the context windows of models available at its creation. That was 2024. Today, models like Gemini 2.5 Pro have 1M token context windows, and Claude Opus 4 handles 200K. LongMemEval-S’s 115K token questions fit entirely within a single context window for several current models.
This creates a confound: a system that scores well on LongMemEval might not be demonstrating good memory retrieval at all. It might just be stuffing everything into the context window and letting the model’s attention mechanism find the answer. The benchmark was designed to test memory systems, but for models with large context windows, it has become a test of context window size.
The Hindsight team wrote about this in their Agent Memory Benchmark Manifesto: “LoCoMo and LongMemEval are still a valid foundation, but they only cover one slice of the problem.” They argue for benchmarks that test memory-specific capabilities like staleness detection, contradiction resolution, and knowledge updates under realistic conditions.
Cherry-Picking and Non-Comparable Numbers
The leaderboard is deceptive because not everyone runs the same benchmark configuration. Some systems report scores on LongMemEval-S with all 500 questions. Others report on subsets. Some use claude-3-5-sonnet as the judge, others use gpt-4o, and the choice of judge can swing scores by 5-10 points. Mem0 reports LoCoMo at 91.6% and LongMemEval at 93.4%, while agentmemory reports LongMemEval-S at 95.2%. These numbers are not directly comparable unless the exact configuration matches.
The agentmemory project’s comparison document includes an explicit warning about this: “Apples vs oranges caveat: agentmemory and MemPalace are measured on LongMemEval-S. Letta [is measured on] a different variant.” This honesty is commendable and rare.
What These Benchmarks Actually Measure
Understanding what a benchmark measures is more important than understanding the number it produces. Each of the three benchmarks has blind spots.
LoCoMo measures retrieval quality on short synthetic conversations. It tells you whether your system can find a fact that was mentioned a few turns ago. It does not test whether your system handles conflicting information, detects stale memories, or performs well at scale. A system that dumps everything into a text file and does substring matching can score reasonably well on LoCoMo.
LongMemEval measures retrieval under length pressure, but as discussed, that pressure has relaxed as context windows have grown. It adds knowledge-update evaluation, which is valuable, but the judge’s permissiveness means the scores have a wide confidence interval. A system scoring 90% might actually be performing anywhere from 80% to 95%.
BEAM measures scaling behavior, which is the most operationally relevant dimension. A memory system that works at 100 conversations but fails at 10,000 is not production-ready. BEAM exposes this, but it is also the least widely adopted benchmark, so there are fewer comparison points.
None of the three benchmarks test the things that matter most in production for coding agents:
- Staleness detection: Whether the system recognizes that a remembered fact is no longer true.
- Contradiction handling: Whether the system can resolve conflicting memories.
- Write-path quality: Whether the system captures the right information and discards the noise.
- Privacy: Whether credentials, secrets, and personal information are filtered out.
- Latency: Whether retrieval is fast enough for interactive use.
- Token efficiency: How many tokens the system injects per query and whether those tokens are relevant.
GitHub Copilot’s memory team addressed this gap when they built their system. They found a 7% increase in pull request merge rates when their coding agent had access to memories (90% vs 83% without), but this metric comes from production usage, not a standardized benchmark. Production metrics are the gold standard, but they require a deployed system with real users.
A Practical Benchmarking Strategy
If you are building or evaluating a memory system for a coding agent, here is how I would approach benchmarking:
Step 1: Run the standardized benchmarks for regression detection. Run LoCoMo and LongMemEval-S once to establish a baseline. Then run them after any architectural change. The absolute scores matter less than the direction. If your LoCoMo multi-hop score drops from 78% to 65% after a configuration change, something broke, regardless of whether 78% was “good.”
# Establish baseline
python -m benchmarks.locomo.run --project-name baseline
# After changes, check for regression
python -m benchmarks.locomo.run --project-name after-change
Step 2: Build a project-specific evaluation set. This is where you actually learn whether your memory system helps. Create 50-100 questions based on real work in your codebase: “What database migration strategy did we use for the users table?”, “Why did we switch from Redux to Zustand?”, “What files need to be updated when the API version changes?” Run these against your system and manually grade a random sample of 20. This takes a few hours but gives you signal that no standardized benchmark can provide.
Step 3: Measure token efficiency. Track how many tokens your memory system injects per query and what percentage of those tokens are actually relevant to the task. agentmemory reports 92% token reduction compared to CLAUDE.md dump (1,900 tokens vs 22,000+ tokens for the same 240 observations). This is a more meaningful metric than recall percentage for production systems, because token waste translates directly to cost and context window pressure.
Step 4: Track production metrics over time. If you have access to them, measure task completion rate, time-to-completion, and user satisfaction with and without memory. These are noisy metrics influenced by many factors, but over time they reveal whether your memory system is actually helping or just adding complexity.
The Gotcha: Optimizing for the Benchmark
There is a temptation to tune your memory system to maximize benchmark scores. This is counterproductive. LoCoMo rewards retrieval volume: systems that return more context tend to score higher because the judge is more likely to find the answer somewhere in a larger result set. But in production, injecting 5,000 tokens of mostly irrelevant context is worse than injecting 500 tokens of highly relevant context. The attention dilution problem means extra tokens actively hurt the agent’s reasoning.
Similarly, LongMemEval rewards systems that are good at verbatim retrieval. If a user said “I prefer tabs over spaces” in session 3, the benchmark checks whether the system can retrieve that exact fact. But a good production memory system might store this as a structured preference (indent_style: tabs) rather than the raw quote, which is more useful but might not match the benchmark’s expected answer format.
The benchmarks measure retrieval of raw conversation content. Production memory systems need to extract, compress, and structure information. These goals are not perfectly aligned, and optimizing for one can hurt the other.
Practical Takeaways
- LoCoMo is useful for fast regression testing, but its scores saturate quickly. Focus on multi-hop and temporal reasoning categories for meaningful signal.
- LongMemEval is harder and more relevant, but context window growth has reduced its discriminative power. The knowledge-update category is its most valuable contribution.
- BEAM tests scaling, which is under-evaluated but operationally critical. Use it if you expect your memory system to handle thousands of sessions.
- All three benchmarks have credibility issues. Wrong answer keys, permissive judges, and inconsistent configurations mean you should treat absolute scores with skepticism. Relative scores and trends are more reliable.
- None of these benchmarks test the production capabilities that matter most: staleness detection, token efficiency, privacy, and latency. You need custom evaluations for these.
- The best benchmark is a project-specific question set drawn from real work. It takes effort to build, but the signal it provides is worth more than any standardized score.
- Never optimize solely for benchmark numbers. A system that scores 95% on LongMemEval but injects 8,000 tokens of noise per query is worse than a system scoring 85% with 1,000 tokens of relevant context.
What Is Next
Benchmarks are the skeleton of a maturing field. The agent memory community is still figuring out what to measure and how to measure it. In the next post, we will look at one of the practical tools that has emerged from this space: QMD, a local-first search tool that combines BM25, vector search, and local LLM reranking in a single binary. It represents the direction the field is moving toward: memory systems that run entirely on your machine, require no infrastructure, and still deliver strong retrieval performance.
Previous post in this series: Session Capture and Compression: How claude-mem Watches, Remembers, and Teaches Your Next Session