Agent Memory
Posted by AutomataManifold@reddit | LocalLLaMA | View on Reddit | 31 comments
I was researching what options are out there for handling memory for agent-based systems and so forth, and I figured that maybe someone else would benefit from seeing the list.
A lot of agent systems assume GPT access and aren't set up to use local models at all, even if they would theoretically outperform GPT-3. You can often hack in a call to a local server via an API, but it's a bit of a pain and there's no guarantee that the prompts will even work on a different model.
Memory specific projects on GitHub:
Letta - "Letta is an open source framework for building stateful LLM applications." - seems to be designed to run as a server. Based around the ideas in the MemGPT paper, which involves using an LLM to self-edit memory via tool calling. You can call the server from Python with the SDK. There's documentation for connecting to vLLM and Ollama. They recommend using Q6 or Q8 models.
Memoripy - new kid on the block, supports Ollama and OpenAI with other support coming. Tries to model memory in a way that keeps more important memories more available than less important ones.
Mem0 - "an intelligent memory layer" - has gpt-4o as the default but can use LiteLLM to talk to open models.
cognee - "Cognee implements scalable, modular ECL (Extract, Cognify, Load) pipelines" - A little more oriented around being able to ingest documents versus just remembering chats. The idea seems to be that it helps you structure data for the LLM. Can talk to any OpenAI compatible endpoint as a custom provider with a simple way to specify the host endpoint URL (so many things hardcode the URL!). Plus an Ollama specific setting. Has a minimum open model recommended is Mixtral-8x7B
Motorhead (DEPRECATED) - no longer maintained - server to handle chat application memory
Haystack Basic Agent Memory Tool - agent memory for Haystack agents, with both short and long-term memory.
memary - A bit more agent-focused, automatically generates memories from agent interactions. Assumes local models via Ollama.
kernel-memory - a Microsoft experimental research project that has memory as a plugin for other services.
Zep - maintains a temporal knowledge graph of user information to track how facts change over time. Supports using any OpenAI compatible API, with LiteLLM explicitly mentioned as a possible proxy. Has a Community edition and a host Cloud version; the Cloud version supports importing non-chat data.
MemoryScope - Memory database for chatbots. Can use Qwen. Includes memory consolidation and reflection, not just retrieval.
Just write your own:
LangGraph Memory Service - an example template that shows how to implement memory for LangGraph agents.
txtai - while txtai doesn't have an official example of implementing chatbot memory, they have plenty of RAG examples like this one and this one and this one that make me think it would be a viable option.
Langroid has vector storage and source citation.
Other things:
WilmerAI has assistants with memory.
EMENT: Enhancing Long-Term Episodic Memory in Large Language Models - research project, combining embeddings and entity extraction.
Agent frameworks
Did I miss anything? Anyone had success using these with open models?
NEEDMOREVRAM@reddit
OP, forgive my ignorance, but how are these memory programs. any different from...say Kobold cpp...when it injects custom instructions behind the scenes with every prompt you send the LLM?
teachersecret@reddit
Most of them are doing exactly that, they’re just creative ways to store and use massive datasets when you’re context limited.
The “difference” is mostly in how they’re pulling information to add to context.
There are simple ways like keyword activation. This is basically lorebooks in kobold or novelAI. You write lore in a little keyword activated box, and it gets injected into context if the keyword shows up. Make lore for a king, and when “king” shows up in context, the lore will be activated.
This can be made more complex, for example by running a second pass to check for potentially relevant lore that isn’t being activated by the keyword. Most RAG that uses vector storage does this. It vectorizes content, asks for stored “memories” that are proximate to the current context, receives some data, ranks it based on usefulness, then injects the relevant context. No keyword activation required.
The difference is mostly in how it handles the memory - in automated ways, or mostly manual. Memory systems can get quite complex and and can even end up using more tokens than your actual conversation with the AI.
Make sense?
NEEDMOREVRAM@reddit
Wow...ok thanks!
So why is it so hard for LLMs to follow this grammar instructions: "Do not write a complex sentence or a sentence that contains a dependent clause. At the same time, avoid writing short and choppy sentences that do not transition well from one to the other."
Even with me injecting that into the prompt...what does the LLM do? Exactly the opposite of what I told it.
Is there anything you can suggest to solve this problem? "Prompt engineering" just does not work.
teachersecret@reddit
Lets start by looking at your instructions:
"Do not write a complex sentence or a sentence that contains a dependent clause. At the same time, avoid writing short and choppy sentences that do not transition well from one to the other."
Now, WITHOUT trying to think all that hard... could you, an intelligent human, complete that task right now, really quickly?
I've read that instruction SEVERAL times and I'm still trying to wrap my head around it. I'm not saying I can't follow it, I'm saying it's difficult... and I'm fully sentient. "Do not write..." ok, I shouldn't write something... "a complex sentence or a sentence that contains a dependent clause..." wait, don't write a complex sentence... OR a sentence that contains a dependent clause... but... what's a dependent clause? I guess I sorta understand? "At the same time, avoid writing..." more avoidance of writing... "short and choppy sentences that do not transition well from one to the other..." wait, didn't you say we couldn't have dependent clauses? I guess that means transitions are going to be weird... and...
Do you see the problem here? I have no idea what you really want.
Try breaking it down into a simpler task. Ask for ONE thing at a time. Don't over-complicate things. These AI seem extremely intelligent when they spit out unbelievably complex answers to complex questions... but under the hood, they can't reason for shit. If a reasonably smart 14 year old couldn't perform the task, simplify your instructions until they can, then stack positive results until you have your final product.
NEEDMOREVRAM@reddit
Ok so that prompt is the result of my frustrating experiences with ALL LLMs. Be they paid or open source.
I originally started off by saying "Do not write complex sentences or sentences with dependent clauses."
The LLM would turn around and immediately write something that's short, choppy, and has little to no transitions between sentences. Basically 4 disconnected statements strung together.
As a human, that's simple! I can write a 4 sentence paragraph that is easy to read and does not contain a dependent clause or complex sentence. The issue with both of those is that they make the sentence hard to read. And that is a mortal sin in my line of work.
Are you a copywriter as well or a book author?
And what is your favorite model? If you're as big of a power user of AI as I am--then you know that AI is only a tool and not a replacement for our human brains.
I'm really digging Nemotron 70B right now.
And to answer your original question--here's a (piss-poor) attempt off the top of my head:
We designed our ACME Green Widgets to solve your toughest flux capacitor challenges. They streamline your workflow, reduce waste, and ensure total compliance with Starship Regulations. Enjoy remarkable performance and durability that outpaces standard widget solutions. Our widgets deliver exceptional value for professionals and hobbyists alike. Get the most advanced green technology in a compact and reliable package. (seller assumes no liability if you accidentally create a black hole or are time-warped into another dimension. Restrictions may apply).
teachersecret@reddit
Again, simplify your request. Or, layer the request. Build the structure of the response, then have it fill the structure in. Don't tell it not to do something. Give it 3 examples of how it's done well, then ask it to do it a fourth time with the new text in context. Treat it like you're trying to get exemplar work out of a 14 year old.
You can also edit what has been written. Add another layer where you edit each response to exactly what you want three times, then ask it to do the fourth, providing it the text from the previous step.
Try to avoid telling the AI not to do something. It's cancerous to the AI :).
Favorite model currently? Sonnet 3.5 and it's not close. Locally? I'm on a single 4090 so I workhorse 9b gemma finetunes geared for fiction writing and larger 32-35b models (qwen 2.5 and cohere command r) for logic use and automation. 70b models are okay, but I find them to be a bit grating (I prefer a super fast model generating creative batches that makes creative mistakes I can quickly swap through to a slower model that is confidently presenting something plausible that I end up not liking - I write in small sentence-to-paragraph length chunks and I write on a scene and chapter basis, so I'm never beyond an 8192 token limit - making models based on gemma work fairly well for the purpose).
silenceimpaired@reddit
Are the gemma finetunes your own or variations you found on huggingface? If on huggingface, I'd love to take a look at them. If your own finetunes (it sounds like it is), I'd love to see a workflow! How did you go from gemma vanilla to gemma writes like you want. Thanks for your time!
teachersecret@reddit
Go take a peek at some of the gemma based 9b models that have Gutenberg inside them if you want a good understanding of how to pull this off. Specifically, go look at the gutenberg DPO dataset they use. Like this one: https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1
If you look at how they built the dataset, it has three main parts:
1: A question. For example, "Write chapter 1 of a novel in which bla bla bla". They wrote this question by pasting a chapter of a book into an AI and asking it to write a question that could have plausibly written the chapter.
2: A response written by an AI model with a chapter written. REJECTED (dpo preference)
3: The actual chapter of the actual book. ACCEPTED (dpo preference)
So basically, this is a dataset that tells a model how a human would respond to a prompt, VS how an AI would respond, and the dataset is built by using existing human-written novels.
Now take a second and look at the other datasets in use on your favorite models (most of them list the datasets inside them on their huggingface). You will quickly realize how stupid many of those datasets are for long form fiction writing. Hell, even Gutenberg itself isn't a great dataset - that's ancient outdated books from the turn of the 1900s. It's pretty easy to look at that and say "it would be improved if I created a similar dataset using modern content in my writing area, especially if it was my own work". I've got lots and lots of novels I personally wrote and own the rights to, and hundreds more that I purchased rights to with my publishing company over the last decade. That gives me a nice sized dataset I own.
9b Gemma models are small enough to do tuning at home on relatively cheap hardware, quickly. Unsloth lets you do quick test-tunes on consumer cards. You can get more than 8192 context on gemma 9b with a 24gb card: https://unsloth.ai/blog/gemma2
Remember, that's enough... because gemma 9b is only really supposed to work up to 8192 context anyway, and we're working in smaller chunks (think SCENES, not CHAPTERS, when you're building the dataset - think about what AI is good at and what it falls down at and build the dataset to focus on what it CAN do instead of trying to get it to solve riddles about bananas).
This means you can iterate pretty fast with a dataset and see how the results look on a tuned model. Once you have something you like, you can rent a little cluster of GPUs and go nuts for a few bucks worth of time if you want to go even further and do some deeper tuning if you want, but honestly, I bet you get everywhere you need pretty easy. Hell, you could tune using a free google colab: https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing
silenceimpaired@reddit
Thanks for the long, thoughtful response. I have mostly used the models to help me clean up poor grammar, spelling, or wording. But I have ideas on how I might be able to use it as a consistency editor, and/or brainstorming agent to show me where a story could go. The last part definitely could benefit from trying what you did above. I have two 3090's so theoretically I can do a slightly larger model if I could find a forked version of sloth that supported multiple cards. I thought I saw one at one point but lost track of it.
AutomataManifold@reddit (OP)
I see so many prompts begging the AI to not think about the pink elephant.
silenceimpaired@reddit
Look into CFG for the negative commands. You throw in stuff you don't want them to do positively. Write complex sentences... and the model will try not to do that.
silenceimpaired@reddit
I've read LLM's don't handle negatives well. It's always better to state something positively. Write simple sentences. It can be a good idea to give it the same instruction in different ways. In other words, don't just say write simple sentences. Say, rewrite the following text in simple sentences that a third grader whose age is around 8-9 years old will be able to read. One article said Avoid was a better alternative to Don't but you'll just have to experiment.
AutomataManifold@reddit (OP)
Formatting, in particular, is usually much easier with examples rather than descriptions. Or directions + examples. You've got to show it what you mean in a relatively clear way, or write a little tutorial that explains what you're looking for.
I'm guessing "Write simple sentences that transition into each other" plus a few examples of your desired writing style will work better, though it's hard to say without seeing your exact use case.
moarmagic@reddit
I'm not OP, and frankly, i'm more of a dabbler- but it sounds like you are trying to give an LLM multiple, complex instructions that slightly contradict- you want simple sentences, but avoid short sentences.
You have too remember that LLMs at the end of the day are closer to text-predication, then reasoning beings, and have a very limited attention span. Overloading them with complex requests will likely force it to ignore some of them, because they don't have a way to correllate all your different demands plus the original ask. If you want a specific style of writing, the best way to get that is to include examples of that style- or best yet, train your own LLM/dataset. .
The hack around this is doing something like an agent process- break your request down and send the same replies back to the LLM, basically.
Steps 2 and 3 would probable benefit from a few examples to help show the LLM what acceptable re-writes look like.
But this way, with the LLM only focusing on one task at a time, it's not as likely to get lost in trying to find your original prompt, and hold all this stylistic stuff in consideration.
reza2kn@reddit
Thank you so so much for this!
I kinda knew what Lore is and does in this context, but never got around learning the specifics of it. I
reza2kn@reddit
Thanks for organizing these all up in one place. I've been looking at them too. Feel like chatbots could be SO MUCH more if they had a memory of your both your current state, and past daily interactions..
davidmezzetti@reddit
I plan to add a concrete interface for agent memory to txtai (https://github.com/neuml/txtai/issues/815) soon.
What would you like to see with this?
AutomataManifold@reddit (OP)
Honestly, that's part of what I'm trying to figure out right now as i test these out. What features do I actually need versus what the Readme files make sound exciting?
Most of the stuff I've tried so far has been pretty heavyweight to the point that it's been difficult to test just because getting it running is a pain. You'll probably get better feedback from someone a little further in the process of actually using them.
Though I have to say, "spin up our proprietary memory server" is an interesting way to make it easier to run, and clearly makes persistence easier for them to implement, and probably improves performance (if it requires a long startup time, making it a microservice makes sense, I guess), but also makes me feel like the core functionality is in a black box.
For my particular use case, I'm managing the memory as a part of a creative project (i.e., controlling what characters know about the changing gamestate). Think of a town of simulated agents. All the libraries have the basic add/delete database stuff, and some kind of search, so I'm trying to figure out which are easiest to use for my use case...as well as which ones actually work well with small local models.
Part of the problem is that "memory" means slightly different things to different people. The memGPT paper, for example, clearly influenced one corner of the space, but their self-updating approach is only one way to handle it.
If I had to say what is making me hesitant right now: * A lot of the libraries seem to have picked one true retrieval approach, whereas being able to scale from BM25 to full trees of queries based on the speed/data quality tradeoff seems to me, from the outside, to be desirable. In my use case a dozen parallel LLM calls are better than a series of them, but going off the embeddings is better still. * Asynchronous data insertion, or at least a way to continually add information without pausing replies for the current conversation. Since memories are continually being added during a conversation, but don't need to be accessed immediately. * Asynchronous background processing; some memory systems do extra work to process stuff (e.g., generate a possible scenario for when it might be relevant, to create an embedding that looks closer to what the input query will be). * Some way to include memory metadata (what conversation was this from? How long ago was it? Who else is aware of it?) * A way to include non-conversational background knowledge (could be done with a separate RAG, but having the agent remember where their house is useful in my case). There's some value in faking this as part of a conversation because it'll presumably eventually be used as part of a conversation, but that's a little bit of a hack. * Being able to retrieve and present the information in a way that the agents actually use it to write their next action. I'm cheating here, a bit, because that's heavily application-layer dependent, but being able to test and iterate on the effectiveness of the memory retrieval is important. * Some way of tracking what memories contributed to a particular generation might give me a leg up on figuring out associations that might not be as obvious with just the text, though of course it doesn't always use all of the information in every response so you can't automatically assume it's all strongly relevant. Still, it does mean some part of the system thought it mattered at the time.
But this is all my currently inexperienced viewpoint. Ask me after I've had a chance to really put these through their paces and I'll hopefully have a better idea.
If it helps, the libraries I'm currently looking hardest at are zep, mem0, memoripy, and letta, though I'm concerned that they're a bit too heavyweight and batteries included; I need to run memories for a bunch of simulated people talking to themselves, each other, and the player. Which is a bit different than the "run chatbot memory against thousands of individual users" chatGPT use case. In particular, the form that the generation takes might not look as much like a typical chat conversation.
What I'm trying to design at the moment is the chat transcript data structure and processing: since I'm doing more for each prompt than just feeding in the literal transcript each time, I have to do a bit of translation between the behind-the-scenes assembly, the set of messages that gets sent to the LLM, and the transcript that gets presented to the user. Each of which is a slightly different view on the same data. Kind of a classic MVC situation. That's not exactly part of the memory system, but hopefully gives you some idea of what is talking to the memory system in my particular use case.
reza2kn@reddit
You might find this paper interesting:
https://arxiv.org/abs/2411.10109
sarahwooders@reddit
Hi I'm one of the maintainers of Letta - we are internally working on basically a async memory management thread to improve latency. But you can also implement something like this yourself, since agents in Letta are able to access each other state (e.g. insert into the archival/core memory of another agent) - so you can disable the memory management in your chat agent, and then have another agent that has access to tools to query and modify the chat agent's memory.
For your use case, I think you should be able to create an agent for each simulated person, and allow them to message each other through tool calling (basically using the letta client to send a message from inside of the tool definition) -- but lmk if I'm misunderstanding.
davidmezzetti@reddit
Thank you for the insightful notes on this. It seems to me that chat history is the most common use case but there are others. A txtai embeddings instance as memory for an agent likely can take you a long way.
AutomataManifold@reddit (OP)
Yeah, part of what is making lean towards assembling my own is that I've got decent prompts that generate data in the format I need or let me summarize a conversation in a way that works for my use case...and frankly, the LLM is the delicate part of the system, so unlike other engineering challenges, it costs a lot more to incorporate someone's opinionated library. If it doesn't work with my LLM then it doesn't matter how clever it is, and there's not much I can do about that.
Just having a way to go from my working prompts to my working prompts plus a memory is a big win that a lot of libraries don't seem to have considered.
davidmezzetti@reddit
I like to do enough to be helpful then to stay out of the way after that. In my opinion, txtai is just cleaner to implement most tasks than the alternatives. But how can I not be biased?
Special_System_6627@reddit
Are there any memory based agents that ask the user clarifying questions about something before storing it in the memory?
AutomataManifold@reddit (OP)
Not that I saw, but that might be an interesting way to go, and would probably complement some of the manual memory management techniques I saw people using.
It wouldn't work for my current use case, but if you're using it as a creative writing assistant being able to directly see the memory and update it is very effective. Directly presenting relevant memories to the user as part of the input and letting them edit it sounds very useful.
I did see that several libraries do self-questioning. Either on initial insertion, or later on as part of consolidation and correlation building. Asking the user questions about it before storing it takes it to the next level.
218-69@reddit
forgot https://github.com/wangyu-ustc/MemoryLLM
AutomataManifold@reddit (OP)
Oh, good find.
NEEDMOREVRAM@reddit
It's like Christmas morning when I open r/LocalLLaMA at 6am every day to browse the new posts.
lur-2000@reddit
Thank you!
DunklerErpel@reddit
This is gold, thank you very much! I have been looking into it on the weekend myself but got frustrated and threw it. Might pick it up again thanks to you, though :)
ThinkExtension2328@reddit
Memgpt is alright it requires are model level that in the past local models would not be able to cross ultimately leading to the system dieing in an unrecoverable way. But when it did work it was quite magical. I want to have another crack at it now we have some of these new cutting edge models.