Local-first LLM context dedup: 22-71% chunk overlap measured across 22M passages (2 arXiv papers). MCP server, MIT, 250KB binary, zero telemetry.
Posted by MindPsychological140@reddit | LocalLLaMA | View on Reddit | 9 comments
I'm the author of this thing, disclosure up front. Been hanging around this sub lately on cache invalidation, MoE memory tradeoffs, long-session token bloat. Here's the tool I was working on while commenting.
Why this might help you
Most local LLM setups eat context window space they don't need to. We measured chunk-level redundancy across 22 million context passages from real agent sessions and RAG pipelines:
About 22% of typical agent context is duplicate, system prompts re-sent, file contents quoted multiple times across turns, tool results restated
Up to 71% on RAG-heavy queries where retrieved chunks overlap a lot
For 8k / 16k / 32k local models, stripping that means more useful tokens fit before truncation.
The measurement papers if you're curious:
arXiv:2605.09611 (architecture)
arXiv:2605.09990 (empirical, the 22M-passage measurement)
Zenodo: 10.5281/zenodo.20090991
Three ways to use it, depending on your setup
- HTTP proxy mode — best for Ollama / vLLM / SGLang / OpenWebUI / llama.cpp server / anything with an OpenAI-compatible endpoint. Run the proxy locally, point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before it reaches your model.
Default is cache-aware: it leaves the conversation prefix untouched (so vLLM / SGLang prefix-caching keeps hitting) and only dedupes the most recent user message. There's an opt-in aggressive mode if you know your cache hit rate is already low.
-
MCP server — for Claude Desktop / Claude Code / OpenClaw / Cursor. Exposes merlin_dedupe, merlin_dedupe_file, merlin_savings_summary, merlin_status as tools the model exposes
merlin_dedupe,merlin_dedupe_file,merlin_savings_summary,merlin_statusas tools you can instruct the model to call on chunky pastes (won't auto-invoke without explicit prompting). -
Standalone CLI for shell pipelines and preprocessing scripts. The binary takes a positional input file and writes deduped lines via --output-dedup=path.txt. Single-threaded, \~250 KB, no runtime dependencies, no network calls.
Install (one command per setup)
curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip
unzip merlin-community.zip && cd merlin-community
python shared/install_helpers.py
Where
Honest tradeoffs
Community tier has caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly verified on a 51 MB file. Hobby use never hits these.
Open-core: there's a separate closed-source Pro engine for high-throughput servers. What's in the public repo is what runs in the community edition.
Doesn't fix session fragmentation in agent loops where the whole conversation gets replayed every turn. That's an orchestration problem above where this tool sits.
Windows x64 binary in the v0.2.1 release. Linux + macOS coming once I get a cross-platform CI pipeline up — open an issue if you want a ping when they land.
Repo: github.com/corbenicai/merlin-community
Zero telemetry. GitHub stars are the only adoption signal we get. The issues tracker is open and honest critique is genuinely welcome that's how v0.2.1 happened this morning
r3xetdeus@reddit
The "doesn't fix session fragmentation in agent loops" caveat is the most honest line in a self-promo post I've read this month — that's the bigger fish, and it sits exactly where you said it does.
MindPsychological140@reddit (OP)
Really appreciate that. I'd rather be completely upfront about the limits than sell snake oil. Fixing that orchestration layer is definitely the bigger fish, but shaving off the 39% on the wire is still a nice win while we wait for it
xeeff@reddit
he's a bot
ttkciar@reddit
Hi! How much of this post was LLM-generated?
MindPsychological140@reddit (OP)
The rewrite 100% 1 I am not native English, 2 I have dyslexia. So for I use llm to support me sorry
ttkciar@reddit
No worries :-) the subreddit rules specify an exemption for people who need LLM translation. You're fine.
MindPsychological140@reddit (OP)
thank you , it is really hard everytime defending myself , so somethins i write myself , but that is eacht time hard en uncomfortable. i try to add it from the start now. thanks
samoxis@reddit
Interesting timing — been running long agent sessions with OpenClaw + qwen2.5:32b and context bloat is a real issue especially when web search results get injected repeatedly. Will test the proxy mode against Ollama on Windows. Does the 8k context cap in the community tier apply per-request or per-session?
MindPsychological140@reddit (OP)
Quick correction — there's no 8k context cap in the community tier. The caps are bytes-based, not token/context: