Benchmarked 4 agent memory systems: Mem0 scores 49% recall (worse than a coin flip), Zep uses 340x more tokens for 15 points improvement. Here's what's actually going on.
Posted by Impressive-Judge-357@reddit | LocalLLaMA | View on Reddit | 13 comments
I've been digging into how AI coding agents actually handle memory — not what the marketing says, but what the code and benchmarks show. Here's what I found.
TL;DR: Every agent memory system in 2026 is either too simple (can't search), too expensive (600K tokens per conversation), or too clever (burns tokens on memory management instead of actual work). The real unsolved problem isn't remembering — it's forgetting.
How the major systems actually work
Claude Code — Reads CLAUDE.md at session start. Entire file goes into context. No vector DB, no semantic search. Auto memory (v2.1.59+) writes notes to markdown files. Hard cap: 200 lines for MEMORY.md, everything beyond silently truncated.
Intentionally simple. Works for small projects. Falls apart on monorepos with years of decisions.
Mem0 (48K stars) — Decomposes interactions into facts, stores as embeddings, retrieves via semantic search. Sounds great until you check the numbers:
| System | LongMemEval | Tokens per conversation |
|---|---|---|
| Mem0 | 49.0% | \~1,764 |
| Zep | 63.8% | \~600,000 |
| Letta | \~83.2% | Dynamic |
Mem0 recalls the right information less than half the time. Zep is better — but uses 340x more memory for 15 points of accuracy. The Zep team disputes the Mem0 paper's methodology, claiming 75.1% with proper configuration. Even so.
Letta/MemGPT — Treats context window like RAM, external storage like disk. Agent decides what to page in and out. Best benchmark score (\~83.2%). But every memory operation costs inference tokens. The agent spends significant budget reasoning about what to remember instead of doing the work.
The actual problem: no agent knows how to forget
Ebbinghaus mapped the human forgetting curve in 1885. We don't keep everything. We forget most things. What survives got reinforced through repetition or significance.
AI agents have two modes: hoard everything (vector stores growing forever) or lose everything (session boundary wipes the slate). There's no middle ground.
Claude Code's leaked source (March 31 npm packaging error) hints at the right direction. There's a DreamTask module that runs during idle time — consolidating memories, merging duplicates. The codebase literally calls it "dreaming." But it's primitive. A memoryAge.ts module appends text warnings like "This memory is 47 days old" — but the system doesn't actually reduce the memory's weight or trigger re-verification. It's a label, not a mechanism.
What we need: active curation. A system that continuously evaluates what's worth keeping, what should decay, and what should be promoted from short-term to long-term. Not "store and search" — "curate and forget."
This gets way harder with multiple agents
Claude Code's subagents share a CLAUDE.md file. Agent A writes, Agent B picks it up on next read. Works for 2-3 agents. At 20+ agents making concurrent decisions? Write conflicts, stale reads, contradictory entries nobody reconciles.
Research in agent-based social simulation (Stanford's Generative Agents, Tsinghua's AgentSociety) has been hitting these problems for years at 100+ agent scale. Questions that no production system answers:
- If 50 agents independently store the same fact, is it more reliable or just more popular?
- When two agents have contradictory memories, how do you resolve without picking an arbitrary winner?
- When does a group "forget" something — when every individual forgets, or when it stops being referenced?
These aren't academic curiosities. They're the exact problems any multi-agent coding setup will face at scale.
My take
After studying all of these, I think the field is stuck on the wrong framing. Memory isn't a storage problem. It's a coordination and curation problem. The pieces that seem necessary:
- Tiered personal memory with explicit promotion/demotion rules
- Shared state as a protocol (not a shared file)
- Active forgetting — relevance decay weighted by usage and cross-agent reinforcement
- Conflict as first-class data — maintain disagreements instead of silently picking winners
The Meta-Harness paper (Stanford/MIT, March 2026) showed that harness design alone produces a 6x performance gap on the same model. Memory is probably the highest-leverage harness component still wide open.
The agent that wins won't remember the most. It'll forget the best.
What's your actual memory setup? Anyone found something that works across sessions without massive overhead?
LoveMind_AI@reddit
Chronologically stacked, meticulously prompted, unpackable and repackable compaction with subsequent meta-compaction, as a starting point.
Impressive-Judge-357@reddit (OP)
Good Insight sir
Impressive-Judge-357@reddit (OP)
i agree your opinion
nicoloboschi@reddit
This benchmark is really insightful. I agree that active curation, not just storage, is the key to effective agent memory, and the forgetting curve analogy is spot on. The system I built, Hindsight, is designed around similar principles, with tiered memory and decay mechanisms. https://github.com/vectorize-io/hindsight
Impressive-Judge-357@reddit (OP)
wow. i will watch it
DirectorNo6063@reddit
i've been manually pruning a simple json file after each session. it's tedious but nothing else has worked without blowing up my token budget.
the dreaming concept is fascinating. we need systems that can decide what's noise, not just store more of it.
Uncle___Marty@reddit
Its actually messed up how we call it "AI" yet a simple memory is something thats really eluded current AI, along with the ability to learn and store new information easily.
Thanks for the interesting read bud, you're poking a sub field of AI that really needs a good hard "shove" :/
ItilityMSP@reddit
Should look at RPM paper, different paradigm and out performs them all, with zero memory.
Charming_Support726@reddit
Well. The main issue is that most of the agent memory systems are:
- Not solving a technical problem, it is the user that is somehow missing the agent to remember
- Out of vibe coding hell - a plague like Todo-Apps, LLM Routers and Flappy-Bird Games. Worthless shit
- Remembering is token expensive and mostly usesless. Remember: AI is a technical tool, a simulation of intelligence - not a real person. Memory neither serve nor fit the architecture
I generate documentation from time to time and then I am good.
cryyingboy@reddit
Mem0 at 49% recall is rough. Curious whether that tanks further on multi-turn conversations past 10 exchanges. Zep burning 340x tokens for 15 points feels like brute forcing context. Did you test with quantized models or full precision? Token overhead like that could wreck local inference on 24GB VRAM cards. Would love to see latency numbers per query too.
openSourcerer9000@reddit
Memory seems like a classic ML problem. Something like LSTM for agents
denoflore_ai_guy@reddit
Interesting
Impressive-Judge-357@reddit (OP)
Yeah. really