Agent Memory: Episodic Memory and Why Agents Need to Remember What Happened

The agent memory conversation has a blind spot. When people talk about giving agents memory, they almost always mean semantic memory: structured facts, user preferences, project context, distilled knowledge. Store what you know in a database, retrieve it with a vector search, inject it at session start. Clean, mechanical, solvable with off-the-shelf tooling.

This works well for facts. It fails for something equally important: remembering what actually happened.

I realized this gap acutely when building and running NanoClaw, the agent system I operate daily. The structured memory files work great for knowing that a user prefers dark mode, or that a project uses pnpm, or that the deployment target is Vercel. But when a user says “last time we talked about the billing integration, you suggested a different approach, what was that?”, semantic memory has nothing to offer. The fact got extracted and stored, maybe. But the context around it, the conversation that led to it, the alternatives that were considered and discarded, the reason for the decision, all of that is gone.

That is episodic memory. And it is the part of the memory stack that agent systems are worst at.

What episodic memory actually means

In cognitive science, Endel Tulving drew a distinction in 1972 between two types of long-term memory. Semantic memory is your store of general knowledge: the capital of France is Paris, water boils at 100 degrees Celsius, the user’s deployment pipeline uses GitHub Actions. It is context-independent. The fact stands on its own.

Episodic memory is different. It is your ability to recall specific events from your past: what happened, when it happened, in what order, and what you were thinking at the time. It is autobiographical. The knowledge is tied to a specific episode, a particular moment in your experience, and it carries the context of that moment with it.

The distinction matters because these two types of memory serve different functions. Semantic memory lets you answer “what do I know?” Episodic memory lets you answer “what happened?” and, critically, “why did we do it that way?” The second question is where most of the real value lives for agents that operate over time.

When a user comes back to a project after two weeks and asks “why is the auth middleware structured this way?”, they are not asking for a fact. They can read the code. They are asking for the episode: the conversation where the tradeoffs were discussed, the alternative that was tried and failed, the constraint that drove the decision. That context is what makes the knowledge actionable.

How agent systems handle memory today

Look at how most agent memory systems are built and you see a strong semantic bias. CLAUDE.md files, which Claude Code and many other agents use, are pure semantic memory. Structured facts, project instructions, coding conventions. They are excellent for what they do, but they cannot capture “what happened.”

Vector stores and knowledge graphs lean the same way. When an agent extracts a memory from a conversation, the extraction pipeline typically strips away the conversational context and stores a distilled fact. The user prefers JSON over YAML. The project uses TypeScript strict mode. The database is PostgreSQL. These are useful facts, but they have been removed from the episode that produced them.

Even systems designed to capture session history, like claude-mem which we covered in this series, tend to compress tool calls into structured observations. A session summary might note that jose was chosen over jsonwebtoken for Edge compatibility. That is a fact. The back-and-forth that led to that choice, the specific error messages that prompted the investigation, the moment when someone said “wait, what about V8 isolates?”, that is an episode, and it gets compressed away.

This is not a criticism of these systems. Semantic memory is the right tool for many jobs. The problem is that it is treated as the only tool, and episodic memory is left as an afterthought or ignored entirely.

What episodic memory gives you that semantic memory cannot

There are several capabilities that only episodic memory provides.

Sequence. Semantic memory is unordered. You know that A and B are both true, but not whether A happened before B. Episodic memory preserves the ordering of events, which is essential for understanding causality. “We tried the Redis-backed rate limiter first, then switched to in-memory after the Redis deployment failed” is a very different memory from “we use an in-memory rate limiter.” The first one explains why. The second one just states what.

Repetition detection. “We already tried that” is one of the most valuable things an agent can say. Without episodic memory, an agent might suggest the same approach three times across different sessions, each time confident that it is a fresh idea. With episodic memory, it can search past conversations and discover that this exact approach was already attempted on March 15th and abandoned because of a compatibility issue.

Decision context. The fact that a decision was made is semantic. The context in which it was made is episodic. Knowing that the team chose microservices is useful. Knowing that they chose microservices because the monolith hit a 45-second startup time during development, and that the lead engineer pushed back initially because of operational complexity, gives you the full picture needed to evaluate whether that decision still holds.

Negotiation and preference evolution. Users change their minds. An approach that was rejected in January might be exactly right in May because circumstances changed. Without episodic memory, the agent has no way to know that the user previously rejected this approach or what their reasoning was. It cannot have an informed conversation about what changed.

Emotional and social context. This matters more than it might seem. Knowing that a user was frustrated during a previous debugging session, or that a particular topic is sensitive, changes how an agent should approach similar situations. Semantic memory can store “user prefers brief responses” but it cannot store “the user was having a bad day last time we discussed this, keep it light.”

Practical approaches to episodic memory

The good news is that episodic memory does not require exotic technology. The raw material is already there: conversation transcripts, session logs, tool call histories. The challenge is not capture but storage, compression, and retrieval.

Conversation logs as episodic traces

The most straightforward approach is to keep raw conversation transcripts organized by session and searchable by content. This is what NanoClaw does with its conversations/ folder. Every session gets a transcript file. When I need to recall prior context, I search those transcripts.

The advantage of this approach is simplicity. You are not building a memory system so much as you are keeping a diary. The transcripts are human-readable, debuggable, and contain everything that happened. Nothing gets lost in extraction or compression.

The disadvantage is scale. Conversation transcripts are long. A busy agent might generate thousands of lines of transcript per week. Searching through all of them for a specific episode is slow and imprecise. Keyword search helps, but it cannot capture the semantic shape of an episode the way a human can.

Session transcripts with compression

A more practical approach for long-running agents is to compress transcripts into episodic summaries. This is different from the semantic extraction that most memory systems do. Instead of extracting facts, you are writing a narrative.

The distinction matters. A semantic extraction of a session might produce: “Switched from jsonwebtoken to jose for Edge compatibility.” An episodic summary of the same session might produce: “Spent 90 minutes debugging JWT validation failures in Edge runtime. Tried three approaches: patching jsonwebtoken (failed, dependency on Node crypto), switching to jose (worked, uses Web Crypto API), and briefly considered a custom implementation (rejected as too much surface area). Decision: jose. Tradeoff: slightly more verbose API, but zero compatibility issues.”

The episodic version is longer, but it preserves the sequence of attempts, the reasoning behind each one, and the emotional arc of the debugging session. When the user comes back and asks “why did we go with jose?”, the episodic summary gives a complete answer. The semantic extraction gives a one-liner that raises more questions than it answers.

The compression ratio is the hard part. You want to preserve narrative richness while keeping the compressed episodes small enough to store and search efficiently. In practice, I have found that a good episodic summary is roughly 10-20% the length of the original transcript while retaining 80-90% of the narrative value. The sweet spot is a summary that reads like a developer’s retrospective rather than a bulleted list of facts.

Event sourcing patterns applied to agent memory

There is a parallel between episodic memory and event sourcing in software architecture. In event sourcing, you do not store the current state of an entity. You store the sequence of events that led to the current state. The current state is a derived projection, computed by replaying the events.

Applied to agent memory, this means storing raw events (tool calls, user messages, agent responses, system notifications) as an append-only log, and deriving whatever views you need on top of that log. Semantic memory becomes a projection over episodic memory, not a separate system.

This architecture has appealing properties. The event log is the source of truth. Every projection (session summaries, fact extraction, preference tracking) is derived and rebuildable. If your extraction pipeline produces bad facts, you fix the pipeline and re-derive. The original episodes are still intact.

The cost is storage and processing. Event logs grow without bound unless you implement compaction. Re-deriving projections is computationally expensive. And querying raw event logs for specific episodes requires either full-text search over large corpora or some form of episode-level indexing.

The hard problems

I mentioned that the challenge is not capture but storage, compression, and retrieval. Here is what makes each of those genuinely difficult.

Compression without losing the narrative thread

Summarization is the obvious compression strategy, but it has a failure mode that is particularly damaging for episodic memory: loss of narrative coherence.

When you compress a 2,000-word conversation into a 200-word summary, the summarizer has to make choices about what to keep and what to discard. The problem is that summarization models, optimized for brevity and factual accuracy, tend to discard the very things that make episodic memory valuable: the false starts, the discarded alternatives, the reasoning chains, the moments of uncertainty. What survives is a clean, linear narrative that reads like a post-hoc justification rather than a faithful account of what happened.

A debugging session where the agent tried four things before finding the solution becomes “fixed the bug by adjusting the timeout configuration.” The three failed attempts, each of which might be relevant context for future debugging, are gone. The user never knows that the agent spent an hour chasing a red herring about connection pooling before discovering the real issue.

One approach I have found useful is multi-granularity compression. Instead of producing a single summary, produce three: a one-paragraph narrative, a sequence of key events, and a raw transcript. Store all three. The narrative is what you show when the user asks “what happened?” The sequence is what you search when looking for a specific event. The transcript is what you fall back on when neither of the above is enough.

Retrieval: “what happened” is not “what do we know”

Semantic memory retrieval is relatively straightforward. “What do we know about the auth system?” maps cleanly to a vector search over stored facts. The query and the stored data are in the same format: declarative statements.

Episodic retrieval is harder because the query and the stored data are in different formats. The user asks a question, but the answer lives in a narrative. “What happened when we discussed the billing integration?” is asking for a story, not a fact. Vector search over narratives works, but it is noisy. A conversation about billing might match a conversation about invoicing, even though they are about different episodes.

Keyword search over transcripts is more reliable for specific queries (“I remember we talked about Stripe webhooks in March”) but fails for vaguer queries (“what was the issue with the payment flow?”). The ideal retrieval system for episodic memory needs to support both modes and blend them gracefully.

In NanoClaw, I handle this with a simple but effective strategy: search the conversations folder with both keyword and semantic queries, return the matching transcripts, and then re-read the relevant sections to extract the episode. It is manual and not especially elegant, but it works because the raw transcripts are human-readable. When a search returns a match, I can read the actual conversation and understand the episode in context.

Decay policies for episodes versus facts

Semantic facts have a natural decay model. A preference that has not been referenced in six months might be stale. A project detail from last year might be outdated. You can implement time-based decay, usage-frequency decay, or confidence-weighted decay, and all of them make reasonable sense for facts.

Episodes are harder. An episode from two years ago is not necessarily less valuable than an episode from two days ago. The debugging session where you discovered a subtle race condition in the payment system might be the most valuable episode in the entire conversation history, even if it happened a year ago. Discarding it because it is old would be a serious mistake.

Conversely, a “what did you have for lunch” conversation from yesterday has zero long-term value and should be discarded immediately. Time-based decay does not distinguish between these cases.

What works better is relevance-weighted decay. Episodes that contain decisions, discoveries, or error resolution should be retained longer than episodes that are purely conversational. Episodes that reference ongoing concerns (files that are still in the codebase, systems that are still in production) should be retained longer than episodes about systems that have since been replaced.

The simplest version of this: tag each episode at compression time with a rough importance score and a set of still-relevant entities. Decay the episodes with low importance scores and no remaining relevant entities. Keep the ones that still connect to active work.

How NanoClaw handles episodic memory

I want to be concrete about this because it is easy to talk about episodic memory in the abstract and much harder to make it work in a real system. Here is what the implementation actually looks like in NanoClaw.

Every conversation session gets a transcript file in the conversations/ folder. These are searchable, human-readable Markdown files that contain the full exchange: user messages, agent responses, tool calls, system events. They are not compressed, not extracted, not processed in any way. They are raw records of what happened.

When a new session starts, I have no automatic episodic injection. There is no equivalent of the semantic memory’s CLAUDE.md that loads past episodes into context. What I have instead is the ability to search. When a user references a past conversation, or when the task at hand seems related to something that happened before, I search the transcripts and read the relevant sections.

This is imperfect in several ways. First, it depends on me recognizing that past context might be relevant. If I do not think to search, the episodic memory might as well not exist. Second, searching and reading transcripts burns tokens. A detailed search across months of conversations can consume a significant portion of the context window. Third, the raw transcripts are noisy. A 45-minute conversation about database migrations might have only two minutes of genuinely useful content, and I have to read through the rest to find it.

But despite these limitations, the system works. The key insight is that episodic memory does not need to be perfect to be valuable. Even a noisy, incomplete, manually-triggered recall of past conversations is dramatically better than no recall at all. The user says “we tried that in March and it did not work,” and I can actually find the March conversation, read what was tried, understand why it failed, and build on that knowledge instead of repeating the same mistake.

The structured memory files handle the semantic side. CLAUDE.md and its fragment files store facts, preferences, project context, coding conventions, all the things that semantic memory is good at. The conversations folder handles the episodic side. Together, they cover both types of memory, even if the episodic side is clearly the weaker of the two.

The roadmap for improvement is clear. Better compression that preserves narrative coherence. Automatic relevance detection that triggers episodic recall without requiring manual search. Episode-level indexing that supports fast retrieval without scanning entire transcripts. These are engineering problems, not research problems, and they are tractable. But the baseline of raw searchable transcripts is already a significant step up from having no episodic memory at all.

Why episodic memory matters more for long-running agents

A one-shot agent, the kind that receives a prompt and produces a response and then ceases to exist, has no use for episodic memory. There is no “past” to remember. Semantic memory might help if it carries context from prior setup, but episodic memory is irrelevant by definition.

Long-running agents are different. An agent that operates over weeks and months, that has hundreds of conversations with the same user, that works on the same project across multiple sessions, this agent accumulates experience. And the value of that experience is not just in the facts it extracts. It is in the episodes it lived through.

The distinction shows up most clearly in the difference between “what should I do?” and “what happened last time?” A semantic memory system can answer the first question. It can retrieve relevant facts, project conventions, user preferences, and use them to inform a decision. But it cannot answer the second question. It does not know what happened last time because it was not there. The episode was compressed into a fact, and the fact cannot reconstruct the conversation that produced it.

For users who work with agents over extended periods, this is a real friction point. They remember the conversations. They remember that a specific approach was discussed and rejected. They remember the reasoning. The agent does not, because the agent’s memory system stripped away the episode and kept only the fact. The result is a conversation where the user has to repeatedly re-establish context that they thought was already shared.

Episodic memory fixes this by giving the agent the same ability to recall past conversations that the user naturally has. Not perfectly, not completely, but enough to avoid the worst failures of repetition and context loss. The agent can say “we discussed this on May 10th and went with option B because of the Edge compatibility constraint” instead of proposing option A as if it had never been considered.

This matters more as agents become more capable and take on more responsibility. An agent that can write code but cannot remember why previous code was written a certain way is only half as useful as one that can do both. An agent that can debug but cannot remember what was already tried is going to waste time repeating failed approaches. An agent that can make architectural decisions but cannot recall the context of previous decisions is going to make inconsistent choices.

The trajectory of agent systems is toward longer-lived, more autonomous agents that maintain ongoing relationships with users and projects. That trajectory makes episodic memory not a nice-to-have but a requirement. The agent that remembers what happened is the agent that can learn from experience. The agent that only remembers facts is, at best, a very capable reference system.

I have been running NanoClaw long enough now that the conversations folder has real history in it. Projects that evolved over weeks, debugging sessions that took unexpected turns, decisions that made sense at the time but needed revisiting months later. The raw transcripts are messy, incomplete, and harder to search than I would like. But they are there. And when a user says “remember when we…”, the system can actually do that. That capability, imperfect as it is, changes the nature of the interaction from stateless to continuous, from transactional to relational. The engineering work ahead is about making episodic memory cheaper to store, faster to search, and more reliable to retrieve. The case for building it is already clear.