Anyone else building persistent memory for local LLM agents? Here's my approach

Posted by New_Election2109@reddit | LocalLLaMA | View on Reddit | 17 comments

Been hitting the same wall for a while: every new session with an LLM agent starts from zero. You explain your stack, your constraints, your decisions — then open a new chat and do it all again.

Been working on an approach to this — a local daemon called Mnemostroma that sits between you and your agents and builds memory silently in the background.

**How it works:**

- Watches conversation I/O and extracts what actually matters (decisions, constraints, key facts)

- Compresses into structured multi-layer memory — not raw logs

- Surfaces it back via MCP tools when relevant (\~20ms retrieval)

- Forgets low-value noise gradually, keeps important decisions long-term

- Fully offline — SQLite + ONNX INT8, no cloud, no Docker, no torch

**The design choice I keep questioning:**

The agent only *reads* memory — it never writes it. A separate Observer pipeline does all the watching and storing in the background. Feels cleaner and harder to corrupt, but curious if others would want the agent to annotate its own memory directly.

**Current state:** v1.8.1 beta, 400+ tests passing, \~420 MB RAM baseline. Not on PyPI yet.

Works with Claude Desktop, Claude Code, Cursor, Windsurf, Zed — anything that speaks MCP.

Code and install instructions in the repo if anyone wants to poke at it:

https://github.com/GG-QandV/mnemostroma

Curious how others are handling this — stuffing everything into system prompt, RAG over transcripts, something else entirely?

[-]

jason_at_funly@reddit

Really cool approach with the local sidecar. I'm also a big fan of the observer-only write path—it's the only way to keep the memory from turning into a hallucination loop. We've been using Memstate AI (memstate.ai) for a similar purpose. Its versioning was the game changer for us, especially for tracking how decisions evolve across sessions. It's an MCP server too, so it fits right into the same workflow. Definitely worth a look if you ever need a hosted version of this pattern with deep audit trails.

[-]

jason_at_funly@reddit

Hey, this is a super insightful take on persistent memory, especially the observer-write path – totally agree on preventing those feedback loops and context collapse. We've been using Memstate AI (memstate.ai) for our agent memory, and it's been a game-changer for handling versioning and ensuring our agents always get the most current, accurate context. It also helps a lot with tracking the 'why' behind decisions, which is crucial for debugging and improving agent behavior. Might be something worth checking out if you're looking for a robust, versioned memory solution!

[-]

New_Election2109@reddit (OP)

Thanks for sharing! Memstate looks interesting for versioning and decision tracking - definitely a solid use case for auditing.

My focus with Mnemostroma is slightly different: I’m doubling down on the privacy-first, offline infrastructure. In Enterprise R&D, sending memory logs to yet another cloud provider is often a deal-breaker.

I’m building Mnemostroma to run entirely on the local sidecar - SQLite WAL + local embeddings, so the 'observer' doesn't just prevent feedback loops, it ensures zero data leakage. Versioning is cool, but for me, solving the 'token bloat' and 'context drift' without leaving the local machine is the primary mission.

Good to see more people thinking about the observer-write path though!

[-]

Twins94123@reddit

he observer-only write path is the right call, and I'd argue it's not just cleaner architecturally. It's a hedge against the agent's own failure modes. If the agent writes its own memory, you're asking the same system that hallucinates to curate what's worth remembering, which compounds over long sessions. Zhang et al.'s ACE paper (arXiv 2510.04618) calls a related failure "context collapse": iterative self-summarization erodes the specificity that made the context useful in the first place. Separating the Observer from the Generator keeps the curation layer honest.

One thing I'd think about though: observer-only works great for decisions and constraints (durable facts), but it can miss the agent's reasoning, the why-it-chose-X-over-Y at a decision point. That's often the most valuable thing to replay in a new session, and it lives in the agent's head, not the I/O stream. Worth a hybrid where the agent can emit structured "rationale" events the Observer picks up, without giving it write access to the memory store itself.

What's the retrieval strategy, pure vector or are you doing anything structured for the "decisions" layer specifically?

[-]

New_Election2109@reddit (OP)

Exactly—asking the generator to curate its own memory is a guaranteed path to context collapse. The isolation is non-negotiable.

Spot on about capturing the 'why-it-chose-X'. I actually just pushed an update (v1.8.1) based on this exact line of thinking: Pure Context Mode. Currently, the Observer passively parses the agent's Chain-of-Thought (like Claude's blocks) via continuation detection. But to formalize the reasoning, the agent can now optionally emit a fire-and-forget [RATIONALE why=X reason=Y] tag in its standard output. The Observer catches it via regex/HybridNER and commits it. The agent keeps full autonomy, but still has zero DB write permissions.

For the retrieval strategy, it's a \~20ms hybrid pipeline:

- Semantic: numpy matmul ANN using e5-small (INT8) for fuzzy conceptual queries (e.g., "Why did we choose SQLite?").

- Structured Anchors: distilbert-ner extracts verbatim entities/deadlines into an O(1) dictionary backed by SQLite WAL. This is for absolute precision (e.g., "What is the absolute path to the prod DB?").

- Reranking & Injection: A lazy TinyBERT rerank before surfacing the payload via MCP as pure XML . It's zero-guidance—no forced "use tools" prompts, just raw context injection.

Appreciate the pointer to the Zhang paper, it perfectly nails the failure mode I was trying to avoid. I just set integration.pure_context = true as the default in the repo. If you end up poking at it, I'd love your feedback.

[-]

Aggressive-Sweet828@reddit

Letting agents write directly to memory is where feedback loops start. The agent reads its own conclusions back on the next turn and reinforces them even when wrong. Keeping writes in the observer pipeline gives you a referee. The cost is freshness. The observer has to run often enough that memory isn't stale. I'd keep observer-write by default and expose a narrow "flag this for review" action to the agent.

[-]

New_Election2109@reddit (OP)

Exactly the feedback loop problem — that's the core reason I went

with observer-write. The agent reinforcing its own (possibly wrong)

conclusions is a real failure mode I wanted to avoid by design.

The "flag for review" idea is interesting. Right now there's no

agent-facing write path at all, but a narrow annotation action

could work — something like ctx_pin(id) that elevates importance

without letting the agent rewrite the content. Worth thinking about.

On staleness: the Observer runs on every message turn, so latency

is usually under a second behind the conversation. Not perfect but

close enough for most workflows.

[-]

Aggressive-Sweet828@reddit

Two things worth thinking through on ctx_pin before you wire it in. Scope: does a pin persist across sessions, or only within one? If persistent, you're back to agent-writes-memory territory, just with a narrower interface. Second: pinned items accumulate. After three months in production you'll have hundreds. Some form of review or decay on pinned items becomes its own design problem.

[-]

New_Election2109@reddit (OP)

Both valid points.

On persistence: yes, ctx_pin would survive sessions - that's the intent. The narrow interface is the guard: the agent can elevate importance but can't rewrite content or inject new facts. Still a write path, just a constrained one. Whether that constraint is enough to avoid the feedback loop problem is genuinely unclear to me.

On accumulation: this is the harder problem. Right now I'm handling it through the decay layer - pinned items still decay if they go untouched long enough, just slower. But you're right that "slow decay" and "never reviewed" converges to the same mess over months.

The honest answer is I don't have a clean solution for long-running pin management yet. A periodic consolidation pass (something like a Dreamer cycle that reassesses pinned items against recent context) is on the roadmap but not implemented. Do you have a pattern that's worked for you?

[-]

Aggressive-Sweet828@reddit

Honest answer: I haven't run this at multi-month scale in production. Two patterns from smaller setups that might generalize. TTL with renewal: pins expire after a window but renew each time they're retrieved. Usage becomes the signal, not time. Access-pattern archival: track which pins get read in real context and move unretrieved ones to a cold tier that's still searchable but not surfaced. Both let usage decide instead of trying to answer "is this still relevant."

[-]

New_Election2109@reddit (OP)

Both patterns map cleanly onto what I have, with one difference in where the signal comes from.

TTL with renewal - yes, this is essentially what I'm doing. Each retrieval bumps the decay timer. The nuance: I'm using a weighted score (recency + retrieval frequency + explicit importance), not a flat TTL window. Pins don't expire on a clock, they decay on a curve that flattens when they're actively used. Same idea, different shape.

Access-pattern archival - not implemented yet, but this is exactly the direction I was circling around with the "Dreamer cycle" idea. Cold tier that's still searchable but not surfaced in hot context - that's the right framing. The missing piece for me is the demotion trigger: do you demote after N missed retrievals, after a time window, or when RAM pressure forces it? In my setup RAM pressure is already a consolidation trigger for non-pinned items, extending that logic to cold-tier demotion seems like the natural path.

The part neither pattern fully solves: semantic staleness. A pin can be retrieved regularly but the fact it encodes becomes outdated. Usage frequency doesn't catch that. That's probably where the consolidation pass earns its keep - not just reviewing unretrieved pins, but comparing active pins against recent context for contradiction.

Useful framing either way. The cold tier idea is going on the implementation list.

[-]

orzechod@reddit

following the quickstart after running pipx install, mnemostroma setup resulted in two python IndentError exceptions due to empty if: blocks which I had to patch by adding a pass to each. no idea if that's the right thing to do.

after that, setup seemed to run successfully. mnemostroma on told me I had a stale daemon instance, and mnemostroma status told me "Daemon: stopped". I gave up at this point.

python 3.13.11 on Ubuntu 25.10 . I will freely admit that I'm an idiot but I'm not sure what I did wrong in this case.

[-]

New_Election2109@reddit (OP)

Update: just pushed a fix for both IndentError issues (empty if: blocks — Python 3.13 compatibility). You were not doing anything wrong, it was a real bug.

To get the fixed version:

pipx upgrade mnemostroma mnemostroma off rm -f ~/.mnemostroma/daemon.pid mnemostroma on mnemostroma status

Should come up clean now. If it doesn't — paste the output here and I'll dig in. Thanks again for taking the time to report this.

[-]

New_Election2109@reddit (OP)

I'm checking right now

[-]

New_Election2109@reddit (OP)

Thank you for trying it and reporting this — this is exactly the kind of feedback I need.

The IndentError on empty if: blocks is a real bug, not user error. Python 3.13 is stricter about some edge cases and I haven't tested against 3.13.11 + Ubuntu 25.10 specifically. Your pass fix was exactly right.

The stale daemon / stopped status after that is likely a cascade from the incomplete setup — the PID file gets written but the process never fully starts.

Can you try:

mnemostroma off rm -f ~/.mnemostroma/daemon.pid mnemostroma on

And paste the output of:

journalctl --user -u mnemostroma-daemon -n 30

Or if not using systemd:

cat ~/.mnemostroma/daemon.log | tail -30

That will tell me exactly where it's dying. I'll push a fix for the IndentError today — that one is clearly on me.

[-]

qptim@reddit

We built a vault to essentially store all trusted knowledge so our AI can always refer back to it whenever it needs it.

It’s a piece of a much bigger project, but we did open source that part.

[-]

New_Election2109@reddit (OP)

Interesting — what does "trusted knowledge" look like in practice for you? Manually curated, or does something extract it automatically?