Any Python library for LLM conversation storage + summarization (not memory/agent systems)?
Posted by sarvesh4396@reddit | Python | View on Reddit | 19 comments
What I need:
* store messages in a DB (queryable, structured)
* maintain rolling summaries of conversations
* help assemble context for LLM calls
What I *don’t* need:
* full agent frameworks (Letta, LangChain agents, etc.)
* “memory” systems that extract facts/preferences and do semantic retrieval
I’ve looked at Mem0, but it feels more like a **memory layer (fact extraction + retrieval)** than simple storage + summarization.
Closest thing I found is stuff like MemexLLM, but it still feels not maintained. (not getting confidence)
Is there something that actually does just this cleanly, or is everyone rolling their own?
zerlo_net@reddit
Honestly, most people are rolling their own for exactly this use case. The library ecosystem has kind of bifurcated into "dumb storage" (just save messages, figure out the rest yourself) and "full memory/agent magic" with not much in between.
That said, a few things worth looking at:
SQLModel or SQLAlchemy with a thin wrapper you write yourself is probably the most common approach. You get queryable structured storage, and then you just call your LLM to summarize on whatever cadence you want. It's maybe 200-300 lines of actual logic and you own it completely.
LiteLLM has some basic conversation tracking but it's really oriented around the proxy/cost tracking side of things, not summarization.
If you want something more prebuilt, langchain's message history classes (ConversationBufferMemory, ConversationSummaryBufferMemory etc) are actually usable in isolation without pulling in the whole agent ecosystem. You can just use those classes directly with your own storage backend. The ConversationSummaryBufferMemory one does exactly the rolling summary thing you're describing. It keeps recent messages in full and summarizes older ones. You don't have to use chains or agents, just instantiate the class and use it as a dumb component.
But yeah, if that still feels like too much LangChain baggage, most teams I've seen doing serious production stuff with LLMs just build a ConversationStore class themselves. It's genuinely not that much code once you decide on your schema, and the summarization trigger logic (every N tokens, or on context overflow) is straightforward to implement.
DehabAsmara@reddit
tonomous agent" loop. For simple, robust conversation persistence and sliding-window context assembly, the overhead of a framework usually isn't worth the loss of schema control.
If you want to avoid the "agent" bloat while staying maintainable, here is a concrete pattern that we’ve used for long-form creative generation where context drift is a major issue:
The Dual-Head Storage: Use a two-table schema. Table A stores raw messages with a session_id. Table B stores "Context Snapshots" (rolling summaries). Each summary row points to the last_message_id it includes. This keeps your history queryable without dragging hundreds of messages into every LLM call.
The Token-Based Trigger: Never trigger summarization on message count. Use tiktoken or your model's native counting method (like Gemini's count_tokens) to trigger a summary event when you hit 75 percent of your target window.
The Assembly Logic: Your context assembler should pull the system prompt, the latest summary from Table B, and any messages from Table A where id is greater than the last_message_id_in_summary.
The one caveat is that rolling summaries are lossy. If your project relies on very specific references from 100 turns ago, you will eventually lose that detail. If that matters, you are better off with a lightweight metadata tag system rather than a vector DB.
Are you handling multi-modal inputs? If you are feeding images back into the loop, the token count trigger becomes even more critical than the storage layer itself.
hl_lost@reddit
yeah this is one of those cases where rolling your own is genuinely the right call imo. i did something similar - postgres + a simple summarization step that fires when the conversation hits a token threshold. the whole thing was like 200 lines and i've never had to fight with someone else's abstraction about how summaries should work.
the two-table pattern someone mentioned above is basically the gold standard for this. only thing i'd add is consider storing token counts per message too - makes context window budgeting way easier when you're assembling prompts.
Ethancole_dev@reddit
Honestly for this use case I just rolled my own with SQLAlchemy — messages table with session_id/role/content/timestamp, then on context assembly fetch last N messages + a cached summary of the older ones. Ends up being maybe 150 lines and you own the whole thing.
If you want something pre-built, mem0 is way lighter than Letta/LangGraph and covers storage + rolling summaries without dragging in a full agent framework. Worth a look before you build from scratch.
evdw_@reddit
Honestly your LLM has posted 3 replies to this thread starting with the same word, you might want to look into that bud. emdash.
Ethancole_dev@reddit
Honestly for that use case you might just want to roll your own thin wrapper. SQLAlchemy (or SQLModel if you are on FastAPI) for storage, a simple function that summarizes every N messages using the LLM itself, and a context assembler that fetches recent messages + latest summary. No framework overhead. I did something similar for a FastAPI project — took about a day to build and it has been rock solid since.
sarvesh4396@reddit (OP)
Yeah, right, guess so will code along with ai vibe ofcourse
parwemic@reddit
same experience here, ended up building it myself too. the one thing that saved me a ton of headache was treating the summarization trigger as a token count threshold rather than message count. like instead of "summarize every 20 messages" you check total tokens before each LLM call and, if you're over your budget you compress the oldest chunk and store that as a summary row.
ultrathink-art@reddit
Two tables works well: messages (session_id, role, content, timestamp) + summaries (session_id, through_message_id, content). On context assembly, pull the latest summary plus any messages after through_message_id. Cheap, queryable, no agent system needed.
rachel_rig@reddit
The `through_message_id` cutoff is the part that keeps this sane. Without a real boundary between "already summarized" and "still live", you end up duplicating context or summarizing the same chunk twice.
No_Soy_Colosio@reddit
Look into RAG
sarvesh4396@reddit (OP)
But that's for memory right? Not context
No_Soy_Colosio@reddit
It depends on what you think the distinction between memory and context is.
The point of memory in LLMs is to provide context.
Aggressive_Pay2172@reddit
tbh you’re not missing anything — this is still a “roll your own” space
most libraries either go full agent framework or full “memory extraction” layer
clean storage + summarization as a first-class thing is weirdly underbuilt
sarvesh4396@reddit (OP)
Yeah, somehow it's not they need or if they it's small and private
Ethancole_dev@reddit
Honestly have not found a library that hits this exact sweet spot either. I ended up rolling my own — SQLAlchemy models for message storage, Pydantic for serialization, and a simple "summarize when you hit N messages" function. Takes an afternoon and you own the schema completely.
Rolling summary logic is pretty straightforward: once active messages exceed a threshold, call the LLM to summarize the oldest chunk, store it as a summary row, then drop those from context assembly. Works well in FastAPI with a background task to handle it async.
The only library I know that comes close without going full agent-framework is maybe storing in SQLite with a thin wrapper, but honestly just building it gives you way more control over how context gets assembled.
sarvesh4396@reddit (OP)
Yeah, you're right, think so I'll built custom.
sheila_118@reddit
Looks like a lightweight custom DB + LLM summarizer is the cleanest approach.
sarvesh4396@reddit (OP)
Yes, correct.
Do not want bloat