Benchmarked 4 agent memory systems: Mem0 scores 49% recall (worse than a coin flip), Zep uses 340x more tokens for 15 points improvement. Here's what's actually going on.

Posted by Impressive-Judge-357@reddit | LocalLLaMA | View on Reddit | 13 comments

I've been digging into how AI coding agents actually handle memory — not what the marketing says, but what the code and benchmarks show. Here's what I found.

TL;DR: Every agent memory system in 2026 is either too simple (can't search), too expensive (600K tokens per conversation), or too clever (burns tokens on memory management instead of actual work). The real unsolved problem isn't remembering — it's forgetting.

How the major systems actually work

Claude Code — Reads CLAUDE.md at session start. Entire file goes into context. No vector DB, no semantic search. Auto memory (v2.1.59+) writes notes to markdown files. Hard cap: 200 lines for MEMORY.md, everything beyond silently truncated.

Intentionally simple. Works for small projects. Falls apart on monorepos with years of decisions.

Mem0 (48K stars) — Decomposes interactions into facts, stores as embeddings, retrieves via semantic search. Sounds great until you check the numbers:

System	LongMemEval	Tokens per conversation
Mem0	49.0%	\~1,764
Zep	63.8%	\~600,000
Letta	\~83.2%	Dynamic

Mem0 recalls the right information less than half the time. Zep is better — but uses 340x more memory for 15 points of accuracy. The Zep team disputes the Mem0 paper's methodology, claiming 75.1% with proper configuration. Even so.

Letta/MemGPT — Treats context window like RAM, external storage like disk. Agent decides what to page in and out. Best benchmark score (\~83.2%). But every memory operation costs inference tokens. The agent spends significant budget reasoning about what to remember instead of doing the work.

The actual problem: no agent knows how to forget

Ebbinghaus mapped the human forgetting curve in 1885. We don't keep everything. We forget most things. What survives got reinforced through repetition or significance.

AI agents have two modes: hoard everything (vector stores growing forever) or lose everything (session boundary wipes the slate). There's no middle ground.

Claude Code's leaked source (March 31 npm packaging error) hints at the right direction. There's a DreamTask module that runs during idle time — consolidating memories, merging duplicates. The codebase literally calls it "dreaming." But it's primitive. A memoryAge.ts module appends text warnings like "This memory is 47 days old" — but the system doesn't actually reduce the memory's weight or trigger re-verification. It's a label, not a mechanism.

What we need: active curation. A system that continuously evaluates what's worth keeping, what should decay, and what should be promoted from short-term to long-term. Not "store and search" — "curate and forget."

This gets way harder with multiple agents

Claude Code's subagents share a CLAUDE.md file. Agent A writes, Agent B picks it up on next read. Works for 2-3 agents. At 20+ agents making concurrent decisions? Write conflicts, stale reads, contradictory entries nobody reconciles.

Research in agent-based social simulation (Stanford's Generative Agents, Tsinghua's AgentSociety) has been hitting these problems for years at 100+ agent scale. Questions that no production system answers:

If 50 agents independently store the same fact, is it more reliable or just more popular?
When two agents have contradictory memories, how do you resolve without picking an arbitrary winner?
When does a group "forget" something — when every individual forgets, or when it stops being referenced?

These aren't academic curiosities. They're the exact problems any multi-agent coding setup will face at scale.

My take

After studying all of these, I think the field is stuck on the wrong framing. Memory isn't a storage problem. It's a coordination and curation problem. The pieces that seem necessary:

Tiered personal memory with explicit promotion/demotion rules
Shared state as a protocol (not a shared file)
Active forgetting — relevance decay weighted by usage and cross-agent reinforcement
Conflict as first-class data — maintain disagreements instead of silently picking winners

The Meta-Harness paper (Stanford/MIT, March 2026) showed that harness design alone produces a 6x performance gap on the same model. Memory is probably the highest-leverage harness component still wide open.

The agent that wins won't remember the most. It'll forget the best.

What's your actual memory setup? Anyone found something that works across sessions without massive overhead?

[-]

LoveMind_AI@reddit

Chronologically stacked, meticulously prompted, unpackable and repackable compaction with subsequent meta-compaction, as a starting point.

Impressive-Judge-357@reddit (OP)

Good Insight sir

i agree your opinion

nicoloboschi@reddit

This benchmark is really insightful. I agree that active curation, not just storage, is the key to effective agent memory, and the forgetting curve analogy is spot on. The system I built, Hindsight, is designed around similar principles, with tiered memory and decay mechanisms. https://github.com/vectorize-io/hindsight

wow. i will watch it

DirectorNo6063@reddit

i've been manually pruning a simple json file after each session. it's tedious but nothing else has worked without blowing up my token budget.

the dreaming concept is fascinating. we need systems that can decide what's noise, not just store more of it.

Uncle___Marty@reddit

Its actually messed up how we call it "AI" yet a simple memory is something thats really eluded current AI, along with the ability to learn and store new information easily.

Thanks for the interesting read bud, you're poking a sub field of AI that really needs a good hard "shove" :/

ItilityMSP@reddit

Should look at RPM paper, different paradigm and out performs them all, with zero memory.

Charming_Support726@reddit

Well. The main issue is that most of the agent memory systems are:

- Not solving a technical problem, it is the user that is somehow missing the agent to remember

- Out of vibe coding hell - a plague like Todo-Apps, LLM Routers and Flappy-Bird Games. Worthless shit

- Remembering is token expensive and mostly usesless. Remember: AI is a technical tool, a simulation of intelligence - not a real person. Memory neither serve nor fit the architecture

I generate documentation from time to time and then I am good.

cryyingboy@reddit

Mem0 at 49% recall is rough. Curious whether that tanks further on multi-turn conversations past 10 exchanges. Zep burning 340x tokens for 15 points feels like brute forcing context. Did you test with quantized models or full precision? Token overhead like that could wreck local inference on 24GB VRAM cards. Would love to see latency numbers per query too.

openSourcerer9000@reddit

Memory seems like a classic ML problem. Something like LSTM for agents

denoflore_ai_guy@reddit

Interesting

Yeah. really