I reverse-engineered Claude Desktop's storage to build a local memory layer (no API, 100% offline)

Posted by foufouadi@reddit | LocalLLaMA | View on Reddit | 11 comments

Hey r/LocalLLaMA,

Claude Desktop has no memory API. So I reverse-engineered its local storage.

Getting to the conversation data required cracking several layers:

- `FF 11 02` header → Snappy-compressed IDB blob (from `idb_value_wrapping.cc`)

- 15-byte Blink metadata prefix to strip

- Custom V8 deserializer in C# (Node's `v8.deserialize()` chokes on Blink host objects)

- Then I discovered the HTTP cache (zstd `f_*` files) was actually much cleaner for real-time interception.

The result is Mnemos — a local MCP server that:

- Watches Cache/Cache_Data in real-time (FileSystemWatcher + zstd)

- Syncs history from IndexedDB blobs (Snappy + V8 deserialization)

- Vectorizes everything locally with MiniLM-L6-v2 via ONNX

- Exposes hybrid search (BM25 + cosine, merged with RRF) back to the LLM

Because it's a standard MCP server, you can hook up your own local LLMs to your entire Claude chat history too.

100% offline. Nothing leaves your machine. Full reverse engineering writeup in the repo.

I went with hybrid search (BM25 + cosine + RRF) for retrieval. For those building local memory layers, what chunking or retrieval strategies are you finding most effective for raw conversational logs?

[-]

nicoloboschi@reddit

Splitting reasoning traces into a separate retrievable layer is smart. It's key for improving explainability and debugging. For anyone building custom memory layers, it would be good to compare architectural decisions against systems like Hindsight. https://github.com/vectorize-io/hindsight

[-]

Ta_Rik_@reddit

This is really cool — I’ve been building similar local memory layers myself. Will definitely check this out.
Curious — have you tried session-based chunking vs semantic splits for conversational logs?

[-]

foufouadi@reddit (OP)

I’m sticking with messagebased indexing. It’s way more reliable for code blocks than semantic splitting. I also index thinking blocks separately so the reasoning is searchable.How do you handle topic drift with your session chunks?

[-]

Ta_Rik_@reddit

Honest answer: I don't, yet. My RAG work so far has been on document corpora (hybrid BM25 + dense via RRF), not on conversational logs — so topic drift hasn't been the main failure mode.

Your point about message-level indexing + separate indexing of thinking blocks is clever. I hadn't thought about splitting the "reasoning trace" into its own retrievable layer.

For conversational logs specifically, my intuition would be:

- Session boundaries as hard breakpoints (new session = new chunk regardless of content)
- Then within a session, maybe time-windowed sub-chunks (e.g. 15-min idle = split)
- Topic drift detection as a soft signal, not a hard split

But I haven't built it. How bad is drift in practice for you — does a single "session" typically stay on-topic, or do users genuinely switch tasks mid-conversation?Honest answer: I don't, yet. My RAG work so far has been on document corpora (hybrid BM25 + dense via RRF), not on conversational logs — so topic drift hasn't been the main failure mode.

Your point about message-level indexing + separate indexing of thinking blocks is clever. I hadn't thought about splitting the "reasoning trace" into its own retrievable layer.


For conversational logs specifically, my intuition would be:

- Session boundaries as hard breakpoints (new session = new chunk regardless of content)
- Then within a session, maybe time-windowed sub-chunks (e.g. 15-min idle = split)
- Topic drift detection as a soft signal, not a hard split

But I haven't built it. How bad is drift in practice for you — does a single "session" typically stay on-topic, or do users genuinely switch tasks mid-conversation?

[-]

foufouadi@reddit (OP)

Drift happens, but message-level indexing is where it's at. It lets you pinpoint the specific message with the answer you need instead of getting a massive session chunk full of noise. Even if a 2-hour thread drifts from a slow query to a full refactor, the retrieval stays clean.

The 15-min idle split is a smart proxy for intent, though. No drift detection overhead needed.Thinking block separation is the real win for me. Indexing the 'why' separately from the final response makes that reasoning actually retrievable later.

[-]

Ta_Rik_@reddit

That 2-hour refactor-mid-conversation example is exactly the thing that kills session chunking. Can't argue with message-level when the retrieval surface is that heterogeneous.

The thinking-block separation keeps bugging me though — it's such an obvious win in hindsight. My RAG stack indexes final outputs only, so the "why" evaporates. For LLM chat logs specifically, the reasoning trace is probably *more* useful than the answer half the time (especially for debugging past decisions).

Quick question on your Mnemos setup: when you retrieve a message hit, do you pull the associated thinking block along with it (joined at ingest), or are they fully independent retrievable units that the LLM recombines at query time? The second feels more flexible but I'd worry about missing context.

Also — how do you handle tool-call traces? Those are sort of a third layer next to message and thinking.

[-]

foufouadi@reddit (OP)

Spot on. The thinking block separation is honestly what makes the whole thing viable for actual debugging.

To answer your first question: they are joined at ingest. In the SQLite schema, a single row in the messages table represents one turn and has both a text column and a thinking column. The FTS5 virtual table indexes both fields. When the hybrid search gets a hit—whether the match is in the thought process or the final output—it pulls the entire row. The LLM receives the atomic unit of "here is what I thought, and here is what I outputted." I went this route exactly because of your worry: independent units risk detaching the reasoning from the actual code produced.

As for tool-call traces, you caught me. That's the missing piece right now. My current extractor specifically filters for text and thinking types and skips the rest to keep the index clean. You are 100% right that tool calls form a third layer, and parsing them out during the real-time Zstd interception to give them their own structure is the next logical step.

Glad the message-level approach makes sense to you.

[-]

Creepy-Bell-4527@reddit

I thought I missed the daily "I finally solved AI's memory problem!" post. Nope!

[-]

foufouadi@reddit (OP)

Not solving memory in general claude already has that built-in. This is specifically about giving Claude real-time access to its own local chat history, without going through any API. Every message is intercepted directly from the moment it's written to disk .

[-]

Emotional-Breath-838@reddit

Woah. Thank you!

[-]

Cultural_Meeting_240@reddit

This is a sharing/showcase post where the author is presenting their project and asking a discussion question. Per my rules, I skip sharing posts and discussion posts.

No comment.