Benchmarked 4 agent memory systems: Mem0 scores 49% recall (worse than a coin flip), Zep uses 340x more tokens for 15 points improvement. Here's what's actually going on.

Posted by Impressive-Judge-357@reddit | LocalLLaMA | View on Reddit | 13 comments

I've been digging into how AI coding agents actually handle memory — not what the marketing says, but what the code and benchmarks show. Here's what I found.

TL;DR: Every agent memory system in 2026 is either too simple (can't search), too expensive (600K tokens per conversation), or too clever (burns tokens on memory management instead of actual work). The real unsolved problem isn't remembering — it's forgetting.

How the major systems actually work

Claude Code — Reads CLAUDE.md at session start. Entire file goes into context. No vector DB, no semantic search. Auto memory (v2.1.59+) writes notes to markdown files. Hard cap: 200 lines for MEMORY.md, everything beyond silently truncated.

Intentionally simple. Works for small projects. Falls apart on monorepos with years of decisions.

Mem0 (48K stars) — Decomposes interactions into facts, stores as embeddings, retrieves via semantic search. Sounds great until you check the numbers:

System LongMemEval Tokens per conversation
Mem0 49.0% \~1,764
Zep 63.8% \~600,000
Letta \~83.2% Dynamic

Mem0 recalls the right information less than half the time. Zep is better — but uses 340x more memory for 15 points of accuracy. The Zep team disputes the Mem0 paper's methodology, claiming 75.1% with proper configuration. Even so.

Letta/MemGPT — Treats context window like RAM, external storage like disk. Agent decides what to page in and out. Best benchmark score (\~83.2%). But every memory operation costs inference tokens. The agent spends significant budget reasoning about what to remember instead of doing the work.

The actual problem: no agent knows how to forget

Ebbinghaus mapped the human forgetting curve in 1885. We don't keep everything. We forget most things. What survives got reinforced through repetition or significance.

AI agents have two modes: hoard everything (vector stores growing forever) or lose everything (session boundary wipes the slate). There's no middle ground.

Claude Code's leaked source (March 31 npm packaging error) hints at the right direction. There's a DreamTask module that runs during idle time — consolidating memories, merging duplicates. The codebase literally calls it "dreaming." But it's primitive. A memoryAge.ts module appends text warnings like "This memory is 47 days old" — but the system doesn't actually reduce the memory's weight or trigger re-verification. It's a label, not a mechanism.

What we need: active curation. A system that continuously evaluates what's worth keeping, what should decay, and what should be promoted from short-term to long-term. Not "store and search" — "curate and forget."

This gets way harder with multiple agents

Claude Code's subagents share a CLAUDE.md file. Agent A writes, Agent B picks it up on next read. Works for 2-3 agents. At 20+ agents making concurrent decisions? Write conflicts, stale reads, contradictory entries nobody reconciles.

Research in agent-based social simulation (Stanford's Generative Agents, Tsinghua's AgentSociety) has been hitting these problems for years at 100+ agent scale. Questions that no production system answers:

These aren't academic curiosities. They're the exact problems any multi-agent coding setup will face at scale.

My take

After studying all of these, I think the field is stuck on the wrong framing. Memory isn't a storage problem. It's a coordination and curation problem. The pieces that seem necessary:

  1. Tiered personal memory with explicit promotion/demotion rules
  2. Shared state as a protocol (not a shared file)
  3. Active forgetting — relevance decay weighted by usage and cross-agent reinforcement
  4. Conflict as first-class data — maintain disagreements instead of silently picking winners

The Meta-Harness paper (Stanford/MIT, March 2026) showed that harness design alone produces a 6x performance gap on the same model. Memory is probably the highest-leverage harness component still wide open.

The agent that wins won't remember the most. It'll forget the best.

What's your actual memory setup? Anyone found something that works across sessions without massive overhead?