MCP is great, but it doesn’t solve AI memory (am I missing something?)

Posted by BrightOpposite@reddit | LocalLLaMA | View on Reddit | 35 comments

I’ve been experimenting with MCP servers + Claude for a bit now, and I keep running into the same issue:

the AI is still fundamentally stateless.

Even with tools and structured calls, every interaction feels like it starts from scratch unless you manually pipe context back in.

Which leads to things like:

repeating instructions
re-explaining user intent
inconsistent outputs across sessions

MCP improves capability routing, no doubt.

But it doesn’t really address context persistence.

Feels like we’ve made AI more powerful…

but not more aware.

Curious how others are handling this:

Are you building your own memory layer?
Using vector DBs / session stitching?
Or just accepting the stateless nature for now?

Would love to hear how people are thinking about this.

[-]

MihaiBuilds@reddit

I ran into the same thing. MCP gives you tools but no persistence — every session starts from zero. So I built a memory layer on top of it. postgres + pgvector, hybrid search (vector + full-text keyword), and MCP tools for recall/remember/forget. Claude calls those tools during the session to store and retrieve context automatically. been using it daily for months and it completely changes how sessions work — the AI actually knows what happened last week

[-]

BrightOpposite@reddit (OP)

this is super interesting — especially the recall/remember/forget loop via MCP tools

feels like this is where a lot of people are landing right now:

adding a memory layer on top (pgvector / hybrid search) + letting the model pull context when needed

and I can see how that gets you much closer to continuity across sessions

the part I keep getting stuck on though is — even here, the system is still mostly retrieving what matches, not necessarily what matters

like hybrid search helps with recall, but deciding what should persist, evolve, or be discarded still feels kind of implicit / heuristic-driven

curious — have you run into cases where the system brings back the “right” context semantically, but still misses the actual state or intent of what was happening?

[-]

MihaiBuilds@reddit

yeah, I've hit that exact problem. just today actually — I searched for past session summaries and told myself they were "missing" because I was searching in the wrong memory space. the system found semantically similar results, but not the ones that actually mattered for the question.

the way I'm handling it right now is memory spaces (like namespaces) + importance scoring + recency decay. so newer and more important memories float to the top. but you're right, deciding what should persist vs evolve vs get discarded is still mostly heuristic. I tag importance at ingestion time and let recency do the rest, but there's no real "this context is actually relevant to what I'm doing right now" signal beyond what the query returns.

it's one of the harder problems honestly. the search part is solved, the "what matters right now" part is not.

[-]

MihaiBuilds@reddit

[-]

DaLyon92x@reddit

you're not missing anything. MCP is a tool routing layer, not a memory layer. it was never designed to solve continuity.

what i ended up doing is keeping a file-based memory system outside the agent. structured markdown files that get loaded into context at session start. the agent reads and writes to them. crude but it actually works because you can version it, grep it, and debug it without any special tooling.

the fancier approaches (vector stores, knowledge graphs) look better on paper but in practice a flat file you can read in 2 seconds beats a retrieval pipeline that hallucinates relevance.

[-]

DinoAmino@reddit

How large has that flat file become now? Not all of the memories are relevant to every prompt. How much noise is being fed into your prompt? How do you know the flat file doesn't hallucinate relevance? Do you instruct your model to only consider information in its context? If not then it happily hallucinates to fill in knowledge gaps. Have you ever run evals to actually measure your flat file vs vector or graph DBs.

I take such claims with a large pile of salt. People want to believe that a huge context window stuffed with everything is a fine solution. It's the go-to for people don't want to put in the effort to use RAG. This is blind trust that the LLM will successfully find and use the info it needs, when in fact there are large swaths of context being overlooked, as if it wasn't there. Lost in the Middle is real and that middle gets bigger the more context you use

[-]

DaLyon92x@reddit

Fair challenges. the file stays small because the AI actively prunes it - it's an index of pointers, not a dump of everything. The actual content lives in individual files loaded on demand. Noise concern is real at scale though, you're right about that. Hallucinated relevance is the thing I've had to watch most carefully. If the index gets vague, retrieval quality drops. It's not a solved system, it's a workflow.

[-]

DaLyon92x@reddit

Lost in the Middle is real and it's the exact reason the system uses an index file, not one big dump. the MEMORY.md file is ~200 lines of pointers. actual content gets loaded on demand based on what's relevant to the current task. so at any given time only maybe 5-10 small files are in context, not everything.

have i run formal evals against vector or graph? no. what i can tell you is the failure mode is different. with RAG the failure is silent, you get plausible but wrong retrievals. with a flat file the failure is visible, you can literally read the index and see what's missing or stale. debuggability matters more than theoretical retrieval quality when you're the one maintaining it.

you're right that this doesn't scale to thousands of memories. it's a single-user developer workflow, not an enterprise solution. for that you'd absolutely want proper retrieval. but for my use case the ceiling hasn't been the bottleneck yet.

[-]

BrightOpposite@reddit (OP)

yeah this makes sense — honestly this is the most practical setup I’ve seen people converge on

but it still feels like we’ve accepted a weird constraint as normal:

we’re essentially rebuilding state at the start of every session (load files → inject → hope it sticks)

even if it works, it puts the burden on us to decide: what to store, how to structure it, when to load it, what to drop…

at some point it feels less like memory and more like manual context bootstrapping

curious if you’ve hit limits with this as things get more complex?

[-]

DaLyon92x@reddit

Tiered. MEMORY.md is just pointers to actual files. Individual memory files get archived when they're stale or completed. The system flags things for archiving as part of normal operation. In practice the active index stays manageable because most context has an expiry - finished projects, resolved decisions, old feedback. For truly long-term autonomous agents running indefinitely, this approach hits a ceiling. Not going to pretend otherwise.

[-]

BrightOpposite@reddit (OP)

yeah this is super helpful — the tiered approach makes a lot of sense for keeping things manageable in practice

but that ceiling you mentioned is exactly what keeps bothering me

it feels like we’ve built really good memory management systems for humans to maintain… not actual memory systems for agents

like the agent isn’t really “owning” memory — it’s just reading from a structure we constantly curate and prune

which works until things get long-running or messy, and then the burden just shifts back to us

curious if you think this is just a limitation we’ll live with, or if memory eventually becomes something the system maintains + evolves itself?

[-]

DaLyon92x@reddit

you're hitting the right nerve. what we have now is memory management, not memory. the difference is agency. current systems need a human (or a well-prompted AI) to decide what's worth keeping. actual agent memory would need self-evaluation: "did that memory help me last time I used it?" nobody's really doing that yet. closest I've seen is weighting memories by retrieval frequency, but that just optimizes for repetition, not usefulness. I think the breakthrough will come from agents that can forget deliberately, not just accumulate.

[-]

BrightOpposite@reddit (OP)

yeah this is a really clean way to frame it — the agency part is what’s missing

right now memory is basically passive: store → retrieve → hope it’s useful

but what you’re describing is closer to a feedback loop where the system actually updates its own understanding over time

the “did this help last time?” piece feels like the unlock — without that, we’re just accumulating context, not learning from it

and the forgetting point is interesting too — feels like we’ve been optimizing for recall (vector DBs, indexing, etc.) but not for selective retention

curious — how would you even start implementing that evaluation layer in practice? feels like that’s where most current setups break

[-]

DaLyon92x@reddit

exactly. passive memory is just a database with extra steps. the feedback loop is the hard part because you need the agent to evaluate its own retrieval quality, and current models don't have great self-assessment. I've been thinking about a simple proxy for it though: if a memory gets loaded into context but the agent never references it in its output, that's a signal it wasn't useful. track that over time and you get a basic relevance score without needing the model to explicitly judge itself.

[-]

BrightOpposite@reddit (OP)

this is a really clever proxy — using “did it actually get referenced” as a signal for usefulness makes a lot of sense

feels like a good first step toward that feedback loop without needing full self-evaluation

I wonder though — does it miss cases where the memory shapes the output but isn’t explicitly referenced?

like constraints, prior decisions, or style/context that influence the response but don’t show up directly

feels like those are the tricky ones — where the memory is doing work, but you can’t easily observe it

maybe that’s where things get hard: not just tracking usage, but understanding impact

[-]

DaLyon92x@reddit

good catch. implicit influence is the blind spot in that approach. a memory about "this user prefers direct answers" shapes every response but never gets quoted. you'd need a two-tier system: explicit references you can track, and style/context memories that get evaluated differently. maybe periodic A/B where you run the same prompt with and without the memory and compare outputs. expensive but it's the only way to measure invisible influence. this is getting into research territory honestly.

[-]

Fit-Produce420@reddit

MCP was not supposed to be a solution to "memory."

[-]

BrightOpposite@reddit (OP)

yeah totally — MCP wasn’t designed for memory

I think that’s actually what makes this interesting though

it’s great at giving agents access to tools and external systems, but once you start building anything long-running, you immediately run into the question of continuity

so you end up layering memory around MCP rather than within it

feels like we’ve solved “how agents act” pretty well, but “how they carry state over time” is still kind of bolted on

curious if you see that evolving as part of the MCP ecosystem, or staying as a separate layer?

[-]

l0nedigit@reddit

Use falkordb, ingest your instructions into a schema. The instructions could be a file per topic (i.e. program language standards, monorepo architecture/structure, etc). Then have 1 singular instructions file which states to use that DB/mcp for memory lookup prior to action, storing negative outcomes, etc.

I've found that works very well. At least from a coding perspective in a large repo (300k+ loc).

[-]

BrightOpposite@reddit (OP)

This is super helpful. Feels like most approaches right now are basically: structuring instructions storing them in a DB forcing the model to check before acting which works… but also feels very “manual memory management” Curious — does this break down when context grows or across longer user journeys?

[-]

DaLyon92x@reddit

yeah it is a weird constraint. but honestly after trying the fancier approaches i stopped fighting it. the file-based approach is ugly but it means your agent's memory is just files on disk. you can git diff it, you can grep it, you can manually edit it when the agent writes something wrong.

the "rebuilding state" part gets less painful when you structure the files well. mine are categorized by type (user context, project state, feedback corrections) with frontmatter metadata. the agent reads the index file first and only pulls in what's relevant. keeps the context window manageable.

[-]

BrightOpposite@reddit (OP)

yeah this is exactly what I keep seeing — super well-structured file systems + indexing + selective loading it works, but feels like we’ve basically turned “memory” into something the developer has to constantly maintain and curate the part I’m still unsure about is — what happens as this grows over time? like multiple projects, evolving preferences, long-running agents… does the file/index layer itself start becoming the thing you have to manage? feels like we reduced retrieval cost, but not really the cognitive overhead of maintaining state

[-]

DaLyon92x@reddit

Partially agree, but the AI handles the writes, updates, and pruning itself - I'm not doing that manually. The "rebuild at session start" is just loading an index file, takes seconds. It's less like reconstructing state and more like handing someone a briefing doc before a meeting. The real maintenance burden kicks in if the AI is bad at knowing what's worth keeping. That varies.

[-]

l0nedigit@reddit

Not if each topic is categorized good (8-12 keyword tags), instructions are atomic, it's enforced to store new information.

Context will, no doubt, expand over time. But it haven't had any issues to date.

There's going to be some bumps along the way I'm sure. It'll iron out though. I've gotten to a point where I can add a new feature and it follows previous patterns to a T. Saving so much time.

[-]

BrightOpposite@reddit (OP)

That makes sense — sounds like you’ve built a really solid retrieval + structure layer around it. I guess where I’m still unsure is: this works well for organized domains (like codebases) but do you think it holds up for: messier user interactions evolving preferences long-term behavioral context feels like tagging + atomic storage works great when things are structured, but less clear when the signal is noisy or implicit curious if you’ve tried pushing it in that direction

[-]

l0nedigit@reddit

I have not yet. Been using it for a pretty specific purpose. But the user preferences, this just feels like another metric added into the mix (time and weights) with the ability to supercede information based on staleness and higher weight.

[-]

BrightOpposite@reddit (OP)

That’s interesting — so it’s more like weighted retrieval evolving over time. Feels like we’re still approximating memory though, not really having it natively. Curious if this breaks once interactions get more implicit vs explicit.

[-]

l0nedigit@reddit

Natively would be nice! Not there yet lol

[-]

BrightOpposite@reddit (OP)

Yeah exactly — that’s the gap. We’ve been working on this as a more native memory layer (BaseGrid), trying to make it persist across interactions without all the weighting / retrieval hacks. Still early, but feels like the direction things need to go.

[-]

l0nedigit@reddit

I don't disagree that's the way things need to go. It's be amazing. Until that day comes, temporal memory using a graph db has been my goto. There is a lot out there. A ton really. I'm a bit partial to graph databases, so falkor is what I stuck with. Tried memgraph, graphiti, vanilla neo4j, and a few other postgresql types. Meh.

[-]

BrightOpposite@reddit (OP)

Makes sense — graph DBs seem like the closest fit right now for modeling relationships over time. But yeah, still feels like we’re forcing “memory” into storage abstractions rather than treating it as its own system. Curious what breaks first for you — scale, retrieval quality, or just complexity over time?

[-]

l0nedigit@reddit

What model are you using btw? And backend

[-]

BrightOpposite@reddit (OP)

Mostly experimenting with Claude + some custom infra around context handling.

Trying to avoid going too heavy on the usual vector DB / retrieval stack and instead thinking of memory as a system that evolves over time rather than something you query.

[-]

l0nedigit@reddit

I'll post back here if it ever happens! Hope it doesn't cause took a long time to get to this point haha

[-]

l0nedigit@reddit

I'll post back here if it ever happens! Hope it doesn't cause took a long time to get to this point haha