What features dramatically improved your custom memory system?

Posted by dangerous_inference@reddit | LocalLLaMA | View on Reddit | 5 comments

Let's talk memory systems.

I'm going to provide an overview of my memory system in the comments, but I really want to know what you did that really changed things for the better with your system.

After I first implemented what I'm calling "transient auto-memory", the assistant suddenly got so much more coherent and began speaking with total knowledge of all the testing we did over a few months. It was a surreal moment. Really changed things.

I want to pick your brains and find out if you did anything cool I should try.

[-]

IngenuityNo1411@reddit

Not some success so far, but I'd bring up a question exactly from my old post: What would you do to prevent model adding nosenses into memory?

[-]

dangerous_inference@reddit (OP)

Well, the whole thing is one massive effort to reduce noise and increase signal. My memory processing is four main stages of inference, and each one is a 2000+ word system prompt that is like a dense legal document. The goal of all of it is to extract useful facts and dismiss garbage.

I can get into the details, but that means going through most of the mechanism. At a high level the first stage is Extract, which picks certain kinds of facts and ignores commands, greetings, other useless garbage. The next stage Refine transforms statement tense and filters more.

The next stages Consolidate and Account bundle the extracted facts and create a generalized overarching statement about them. This statement is what is surfaced to memory. This part really compresses quite a bit and makes the memories very semantically dense without generating false information.

[-]

tonyboi76@reddit

the biggest jump for me was not a feature, it was writing memories with WHY context. user likes python is brittle, it gets misapplied to every python decision later. user prefers python because rapid prototyping after burning weeks on java JVM startup is robust, claude can reason about whether the why still applies to the current context (microservice in 2026 where startup matters less, JVM might actually be fine again). the related feature: every memory entry has a timestamp and a source quote of when it was learned, so when memory says X i can audit whether X is still true.

[-]

dangerous_inference@reddit (OP)

Yeah, more and denser context is the name of the game. The more relevant data you can impart to the model, the better the results. The trick is not inundating it with noise at the same time.

[-]

dangerous_inference@reddit (OP)

I made a memory system I'm calling Bubblemem. Grossly simplified, it is:

Suitable for day to day chatting and assistant stuff, not hundreds of projects.
Entirely markdown-based.
Facts are extracted from history and bundled into relevant groups (bubbles) in four main rounds of inference.
These groups are updated, split, merged with new information.
A dense generalized statement is produced for each fact group. This is what is recalled by the assistant via vector lookup.

I set out to make something that works much better than the two prior systems I tried, and I have to say I succeeded. Although I don't know how it compares to other systems. The only other two systems I tried dumped massive text into context and called it a day. My system is not on Github because there is zero demand for more unproven memory systems.

The fundamental premise of my other assistant project is that context should be a "snapshot" containing only the information the assistant needs for that turn and as little else as possible. Extremely thin context yields vastly better performance on local hardware, making it possible to have a much bigger model respond in real time with excellent coherence.

So Bubblemem really started shining when I implemented what I'm calling transient memory auto-surfacing. It does the memory lookup on each user turn, and puts the top_k results into the system prompt for one turn only. All memories are removed and replaced with new memories on the next turn. The beauty of this is if the same memory entry is still relevant on the next turn, it will reappear. Only relevant memories are ever in context.

I guess another cool feature is grading memory by durability and depth. Durability is the natural timescale of the memory--how long does it impact my life or matter? Depth is how much does it matter relative to normative values. So "I love ice cream" is much lower in depth than "I love my child", but actually could be pretty close for durability. This enables very effective decay and archiving of short-lived information, while enabling the assistant to remember "I had to buy cauliflower florets instead of riced cauliflower" for a few days.

These concepts are so much harder to implement in practice, though. The depth and durability ratings are I think like a 2k word prompt. Each round of inference in my system is 1.7-2.5k words. I started development with Qwen3.5 27B, but moved to the vastly better Gemma 4 31B when that came out. I have come to realize that there are is no such thing as a "simple, elegant, short" prompt that does 90% of the task correctly. I have found that dense, pedantically phrased, lengthy prompts, continuously audited for logical contradictions, are the way to go. It can feel like a never-ending game of whack-a-mole fixing every issue, but eventually it does start doing what you want. It's just hard to get there.