Local-first LLM context dedup: 22-71% chunk overlap measured across 22M passages (2 arXiv papers). MCP server, MIT, 250KB binary, zero telemetry.

Posted by MindPsychological140@reddit | LocalLLaMA | View on Reddit | 9 comments

I'm the author of this thing, disclosure up front. Been hanging around this sub lately on cache invalidation, MoE memory tradeoffs, long-session token bloat. Here's the tool I was working on while commenting.

Why this might help you

Most local LLM setups eat context window space they don't need to. We measured chunk-level redundancy across 22 million context passages from real agent sessions and RAG pipelines:

About 22% of typical agent context is duplicate, system prompts re-sent, file contents quoted multiple times across turns, tool results restated

Up to 71% on RAG-heavy queries where retrieved chunks overlap a lot

For 8k / 16k / 32k local models, stripping that means more useful tokens fit before truncation.

The measurement papers if you're curious:

arXiv:2605.09611 (architecture)

arXiv:2605.09990 (empirical, the 22M-passage measurement)

Zenodo: 10.5281/zenodo.20090991

Three ways to use it, depending on your setup

HTTP proxy mode — best for Ollama / vLLM / SGLang / OpenWebUI / llama.cpp server / anything with an OpenAI-compatible endpoint. Run the proxy locally, point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before it reaches your model.

Default is cache-aware: it leaves the conversation prefix untouched (so vLLM / SGLang prefix-caching keeps hitting) and only dedupes the most recent user message. There's an opt-in aggressive mode if you know your cache hit rate is already low.

MCP server — for Claude Desktop / Claude Code / OpenClaw / Cursor. Exposes merlin_dedupe, merlin_dedupe_file, merlin_savings_summary, merlin_status as tools the model exposes merlin_dedupe, merlin_dedupe_file, merlin_savings_summary, merlin_status as tools you can instruct the model to call on chunky pastes (won't auto-invoke without explicit prompting).
Standalone CLI for shell pipelines and preprocessing scripts. The binary takes a positional input file and writes deduped lines via --output-dedup=path.txt. Single-threaded, \~250 KB, no runtime dependencies, no network calls.

Install (one command per setup)

curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip

unzip merlin-community.zip && cd merlin-community

python shared/install_helpers.py enable

Where is claude_desktop, claude_code, openclaw, cursor, or proxy.

Honest tradeoffs

Community tier has caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly verified on a 51 MB file. Hobby use never hits these.

Open-core: there's a separate closed-source Pro engine for high-throughput servers. What's in the public repo is what runs in the community edition.

Doesn't fix session fragmentation in agent loops where the whole conversation gets replayed every turn. That's an orchestration problem above where this tool sits.

Windows x64 binary in the v0.2.1 release. Linux + macOS coming once I get a cross-platform CI pipeline up — open an issue if you want a ping when they land.

Repo: github.com/corbenicai/merlin-community

Zero telemetry. GitHub stars are the only adoption signal we get. The issues tracker is open and honest critique is genuinely welcome that's how v0.2.1 happened this morning

[-]

r3xetdeus@reddit

The "doesn't fix session fragmentation in agent loops" caveat is the most honest line in a self-promo post I've read this month — that's the bigger fish, and it sits exactly where you said it does.

MindPsychological140@reddit (OP)

Really appreciate that. I'd rather be completely upfront about the limits than sell snake oil. Fixing that orchestration layer is definitely the bigger fish, but shaving off the 39% on the wire is still a nice win while we wait for it

xeeff@reddit

he's a bot

ttkciar@reddit

Hi! How much of this post was LLM-generated?

The rewrite 100% 1 I am not native English, 2 I have dyslexia. So for I use llm to support me sorry

No worries :-) the subreddit rules specify an exemption for people who need LLM translation. You're fine.

thank you , it is really hard everytime defending myself , so somethins i write myself , but that is eacht time hard en uncomfortable. i try to add it from the start now. thanks

samoxis@reddit

Interesting timing — been running long agent sessions with OpenClaw + qwen2.5:32b and context bloat is a real issue especially when web search results get injected repeatedly. Will test the proxy mode against Ollama on Windows. Does the 8k context cap in the community tier apply per-request or per-session?

Quick correction — there's no 8k context cap in the community tier. The caps are bytes-based, not token/context:

50 MB per individual dedup call
200 MB per day (sum across all calls)
2 GB per month