Reuse non-prefix KV Cache and speed up RAG by 3X with LMCache.
Posted by Nice-Comfortable-650@reddit | LocalLLaMA | View on Reddit | 22 comments

A while back, we shared our open-source project LMCache here and were blown away by the incredible support and feedback. Today, our team is thrilled to share more about one of our core components: CacheBlend. Recognized with a Best Paper Award at ACM EuroSys 2025, this technique is a pain killer for efficient RAG applications
The Problem: Your KV Cache is Wasting Potential
In modern LLM applications like RAG and Agents, we constantly feed the model new context. For example, in RAG, we retrieve relevant documents and stuff them into the prompt.
The issue is that this dynamically retrieved context doesn't always appear at the beginning of the input sequence. Traditional KV caching only reuses a "common prefix," so if the new information isn't at the very start, the cache hit rate plummets, and your GPU ends up recomputing the same things over and over.
The Solution: CacheBlend - 100% Hit Rate, No Compromises
CacheBlend changes the game by allowing for the reuse of pre-computed KV caches regardless of their position in the input sequence.
This means we can finally achieve a 100% KV Cache hit rate in applications like RAG. The performance gains are significant:
- Faster Time-To-First-Token (TTFT): Get your initial response much quicker.
- More Throughput: Serve significantly more users with the same hardware.
- Almost lossless Output Quality: All of this is achieved with little degradation in the model's generation quality.
How does it work?
CacheBlend intelligently handles the two main challenges of reusing non-prefix caches:
- Positional Encoding Update: It efficiently updates positional encodings to ensure the model always knows the correct position of each token, even when we're stitching together cached and new data.
- Selective Attention Recalculation: Instead of recomputing everything, it strategically recalculates only the minimal cross-attention needed between the new and cached chunks to maintain perfect generation quality.
For detailed analysis, please refer to the official paper: https://dl.acm.org/doi/10.1145/3689031.3696098
Where can I try it?
Try the newest interactive CacheBlend demo at: https://github.com/LMCache/LMCache-Examples/tree/main/demo-rag-blending
Ask us anything!
k-en@reddit
This looks very interesting. What about memory usage? Will this eat infinite memory (incrementing with model usage) or is there an option to control for memory? for example, when VRAM reaches a certain threshold, delete oldest KV cache
rakarsky@reddit
The cached KV cache can be stored in RAM and/or disk.
jazir5@reddit
Can you submit this system to RooCode on their GitHub? I think they would want to implement this very quickly.
dampflokfreund@reddit
Is it possible to implement this in llama.cpp?
LinkSea8324@reddit
Isn't it already implemented ? https://github.com/ggml-org/llama.cpp/pull/9866
dampflokfreund@reddit
the limitation of this PR is that context reusing only works if the system prompt remains static. When you change it or other parts of the prompt, which is the case during RAG or using memory such as vector DB, then it will process the entire context again. This is what LM Cache would solve.
__JockY__@reddit
Today I learned that people change the system prompt mid-session.
May I ask why this would be done?
sautdepage@reddit
This is indeed for multi-session where the prompt is similar but variable, like for a specific application or company endpoint.
Standard cache only caches the common "starts with" part -- like Claude's huge standard prompt is certainly cached fully for all requests.
Looking at github/paper it seems the key feature is multiple chunks of context can be combined together in a prompt, in any order, and each part can be retrieved from cache and put together.
So say the app creates a new prompt for a session by combining: 1) a standard prompt, 2) a user-specific prompt, 3) a feature or usage-specific prompt + 4) 2 or 3 RAG snippets relevant for that session. And now most of them can be retrieved from cache if they've been seen before individually to form the new whole context.
jazir5@reddit
So it just fragment caches different sections and can reassemble them as needed calling the individually cached parts in whatever order? Neat.
__JockY__@reddit
That’s actually super useful. Thanks for taking the time.
LinkSea8324@reddit
I could be misunderstanding something but right now, VLLM got what
--cache-reuse 0
just the prefixaccording to ggerganov , :
MoffKalast@reddit
Doesn't this mean that the VRAM/RAM usage for storing old cache will balloon into infinity? I mean KV cache is already most of what we need to allocate if you go for longer context.
LagOps91@reddit
is that actually it? the PR is quite old, no? sounds like something different.
JustImmunity@reddit
So, can i use this with vllm serve "model"?
Nice-Comfortable-650@reddit (OP)
Not currently
JustImmunity@reddit
Any plans?
MargretTatchersParty@reddit
Is this something that I can implement and run with in Ollama/OpenWebUI today? How much work would it be to bring that in?
Baldur-Norddahl@reddit
I hope this gets adopted quickly into the major programs. It should really make a huge difference when using agentic coding locally such as Cline, Roo Code and Aider. We are likely uploading the same small pieces of source files over and over.
Does the technique allow automatic recognition of parts of context, that has been seen before? Say the agent presents a source file to the LLM and that results in a diff for modifying the file. On the next task we get the same file uploaded again and it might be slightly modified, but most lines would be unmodified. Could we fetch cached values for the unmodified lines instead of starting all over?
Nice-Comfortable-650@reddit (OP)
Right now the recognition is by manual modification of the context that you need to specify each chunk. This requires the agent programmer to slightly modify the input to the LLM API server.
Nice-Comfortable-650@reddit (OP)
Our repository is at: https://github.com/LMCache/LMCache !
rainbowColoredBalls@reddit
For the selective attention calculation, if I understand correctly, you drop the complexity from O(n^2) to O(n*k) where k is the length of new tokens and k << n?
Nice-Comfortable-650@reddit (OP)
This is correct!