Reuse non-prefix KV Cache and speed up RAG by 3X with LMCache.

Posted by Nice-Comfortable-650@reddit | LocalLLaMA | View on Reddit | 22 comments

Hey r/LocalLLaMA !

A while back, we shared our open-source project LMCache here and were blown away by the incredible support and feedback. Today, our team is thrilled to share more about one of our core components: CacheBlend. Recognized with a Best Paper Award at ACM EuroSys 2025, this technique is a pain killer for efficient RAG applications

The Problem: Your KV Cache is Wasting Potential

In modern LLM applications like RAG and Agents, we constantly feed the model new context. For example, in RAG, we retrieve relevant documents and stuff them into the prompt.

The issue is that this dynamically retrieved context doesn't always appear at the beginning of the input sequence. Traditional KV caching only reuses a "common prefix," so if the new information isn't at the very start, the cache hit rate plummets, and your GPU ends up recomputing the same things over and over.

The Solution: CacheBlend - 100% Hit Rate, No Compromises

CacheBlend changes the game by allowing for the reuse of pre-computed KV caches regardless of their position in the input sequence.

This means we can finally achieve a 100% KV Cache hit rate in applications like RAG. The performance gains are significant:

Faster Time-To-First-Token (TTFT): Get your initial response much quicker.
More Throughput: Serve significantly more users with the same hardware.
Almost lossless Output Quality: All of this is achieved with little degradation in the model's generation quality.

How does it work?

CacheBlend intelligently handles the two main challenges of reusing non-prefix caches:

Positional Encoding Update: It efficiently updates positional encodings to ensure the model always knows the correct position of each token, even when we're stitching together cached and new data.
Selective Attention Recalculation: Instead of recomputing everything, it strategically recalculates only the minimal cross-attention needed between the new and cached chunks to maintain perfect generation quality.

For detailed analysis, please refer to the official paper: https://dl.acm.org/doi/10.1145/3689031.3696098

Where can I try it?

Try the newest interactive CacheBlend demo at: https://github.com/LMCache/LMCache-Examples/tree/main/demo-rag-blending

Ask us anything!

[-]

LinkSea8324@reddit

Isn't it already implemented ? https://github.com/ggml-org/llama.cpp/pull/9866

[-]

dampflokfreund@reddit

the limitation of this PR is that context reusing only works if the system prompt remains static. When you change it or other parts of the prompt, which is the case during RAG or using memory such as vector DB, then it will process the entire context again. This is what LM Cache would solve.

[-]

JockY@reddit

Today I learned that people change the system prompt mid-session.

May I ask why this would be done?

[-]

sautdepage@reddit

This is indeed for multi-session where the prompt is similar but variable, like for a specific application or company endpoint.

Standard cache only caches the common "starts with" part -- like Claude's huge standard prompt is certainly cached fully for all requests.

Looking at github/paper it seems the key feature is multiple chunks of context can be combined together in a prompt, in any order, and each part can be retrieved from cache and put together.

So say the app creates a new prompt for a session by combining: 1) a standard prompt, 2) a user-specific prompt, 3) a feature or usage-specific prompt + 4) 2 or 3 RAG snippets relevant for that session. And now most of them can be retrieved from cache if they've been seen before individually to form the new whole context.

[-]

--cache-reuse 1: the entire aaaaaccccccceeeeeeffhhhhhhh will be reused

[-]

MoffKalast@reddit

Doesn't this mean that the VRAM/RAM usage for storing old cache will balloon into infinity? I mean KV cache is already most of what we need to allocate if you go for longer context.

[-]

LagOps91@reddit

is that actually it? the PR is quite old, no? sounds like something different.

Reuse non-prefix KV Cache and speed up RAG by 3X with LMCache.

The Problem: Your KV Cache is Wasting Potential

How does it work?

Where can I try it?

k-en@reddit

rakarsky@reddit

jazir5@reddit

dampflokfreund@reddit

LinkSea8324@reddit

dampflokfreund@reddit

JockY@reddit

sautdepage@reddit

jazir5@reddit

JockY@reddit

LinkSea8324@reddit

MoffKalast@reddit

LagOps91@reddit

JustImmunity@reddit

Nice-Comfortable-650@reddit (OP)

JustImmunity@reddit

MargretTatchersParty@reddit

Baldur-Norddahl@reddit

Nice-Comfortable-650@reddit (OP)

Nice-Comfortable-650@reddit (OP)

rainbowColoredBalls@reddit

Nice-Comfortable-650@reddit (OP)