Local-first LLM context dedup: 22-71% chunk overlap measured across 22M passages (2 arXiv papers). MCP server, MIT, 250KB binary, zero telemetry.

Posted by MindPsychological140@reddit | LocalLLaMA | View on Reddit | 9 comments

I'm the author of this thing, disclosure up front. Been hanging around this sub lately on cache invalidation, MoE memory tradeoffs, long-session token bloat. Here's the tool I was working on while commenting.

Why this might help you

Most local LLM setups eat context window space they don't need to. We measured chunk-level redundancy across 22 million context passages from real agent sessions and RAG pipelines:

About 22% of typical agent context is duplicate, system prompts re-sent, file contents quoted multiple times across turns, tool results restated

Up to 71% on RAG-heavy queries where retrieved chunks overlap a lot

For 8k / 16k / 32k local models, stripping that means more useful tokens fit before truncation.

The measurement papers if you're curious:

arXiv:2605.09611 (architecture)

arXiv:2605.09990 (empirical, the 22M-passage measurement)

Zenodo: 10.5281/zenodo.20090991

Three ways to use it, depending on your setup

  1. HTTP proxy mode — best for Ollama / vLLM / SGLang / OpenWebUI / llama.cpp server / anything with an OpenAI-compatible endpoint. Run the proxy locally, point your client at http://localhost:8787/v1 instead of your model server directly. Chunk-level dedup happens in the outgoing request before it reaches your model.

Default is cache-aware: it leaves the conversation prefix untouched (so vLLM / SGLang prefix-caching keeps hitting) and only dedupes the most recent user message. There's an opt-in aggressive mode if you know your cache hit rate is already low.

  1. MCP server — for Claude Desktop / Claude Code / OpenClaw / Cursor. Exposes merlin_dedupe, merlin_dedupe_file, merlin_savings_summary, merlin_status as tools the model exposes merlin_dedupe, merlin_dedupe_file, merlin_savings_summary, merlin_status as tools you can instruct the model to call on chunky pastes (won't auto-invoke without explicit prompting).

  2. Standalone CLI for shell pipelines and preprocessing scripts. The binary takes a positional input file and writes deduped lines via --output-dedup=path.txt. Single-threaded, \~250 KB, no runtime dependencies, no network calls.

Install (one command per setup)

curl -LO https://github.com/corbenicai/merlin-community/releases/latest/download/merlin-community.zip

unzip merlin-community.zip && cd merlin-community

python shared/install_helpers.py enable

Where is claude_desktop, claude_code, openclaw, cursor, or proxy.

Honest tradeoffs

Community tier has caps: 50 MB per run, 200 MB per day, 2 GB per month. Refuses oversized work cleanly verified on a 51 MB file. Hobby use never hits these.

Open-core: there's a separate closed-source Pro engine for high-throughput servers. What's in the public repo is what runs in the community edition.

Doesn't fix session fragmentation in agent loops where the whole conversation gets replayed every turn. That's an orchestration problem above where this tool sits.

Windows x64 binary in the v0.2.1 release. Linux + macOS coming once I get a cross-platform CI pipeline up — open an issue if you want a ping when they land.

Repo: github.com/corbenicai/merlin-community

Zero telemetry. GitHub stars are the only adoption signal we get. The issues tracker is open and honest critique is genuinely welcome that's how v0.2.1 happened this morning