Technical question can you mask/hide parts of the KV cash for a request.

Posted by Noxusequal@reddit | LocalLLaMA | View on Reddit | 12 comments

We have the following idea have two llm agents interact but each also has an internal monolog.

It would suck to reload the whole context between each llm request.

So there are 2 questions: 1. Can you incramentally update the kv cache ? Only adding the last line of dialog that was produced in the last prompt. 2. Can you hide parts of the kv cache ? So that we dont have to reload it between the agents taking turns. Since we could just hide the part of the kv cash that refers to the internal monolog of the other agent.

Do any implementation like this exsist or if not would that even be technically possible ?

[-]

Noxusequal@reddit (OP)

Do you have any idea if vllm supports something similar? If we run big experiments with this it would be nice to have the throughput optimization for the big clusters with multiple a100s. :D

Vllm would.be nice for that.

But for testing purposes the llama.cpp implementation will be very helpful either way. So thank you for your help ^^

[-]

Chordless@reddit

I think you can get what you want with llama.cpp.

First off it has smart caching. It will reuse as much of the KV cache from the previous request as possible. Though if your two agents have lots of different internal monologue in their context, that will invalidate most of the cache and require prompt processing almost from scratch every time.

The second thing piece of the puzzle is to set up llama.cpp with two "slots" of context, and let the agents each use a slot exclusively. This effectively splits the kv cache in two, so each agent's context doesn't overwrite the cache if the other one. The downside is that you either need twice as much VRAM for context, or you let your agents run with half as much context as you planned for originally. There is an upside though: the slot feature in llama.cpp is actually for parallel processing. If your two agents make calls to the llm in parallel, llama.cpp will actually process them in parallel, giving nearly 2x the usual amount of tokens per second.

[-]

Noxusequal@reddit (OP)

Thank you :) Okay so the question is if I use the second approach. Using slots I could just give each agent the full context but only with its internal monolog. Then if one agent produces an answer we can take the answer strip it of internal thought and add it to the second slot again.

If I understand smart cashing this should technically reduce the amount of prompt processing since in each step only a small part is added. Am I getting this right ?

[-]

Chordless@reddit

My understanding is yes, at each step only the latest message will need to undergo prompt processing, since the whole rest of the context is already processed and in the KV cache (for that slot).

[-]

Noxusequal@reddit (OP)

Thank you:) Is smart cashing a feature you have to enable or is it build in and active by default?

[-]

Chordless@reddit

You need to pass an extra boolean parameter to the llama.cpp API on every request:

`cache_prompt`: Re-use KV cache from a previous request if possible. This way the common prefix does not have to be re-processed, only the suffix that differs between the requests. Because (depending on the backend) the logits are **not** guaranteed to be bit-for-bit identical for different batch sizes (prompt processing vs. token generation) enabling this option can cause nondeterministic results. Default: `false`

[-]

Noxusequal@reddit (OP)

Thanks :) So basically just cache_prompt "true" ?

Also one question about slots or passing prompts to llama.cpp if I basically want to treat the slots as their own conversations. Do I need to always pass the full context or can I enable something so that I treat it as if I was using a chat so I only pass on the next sentence to the slot and the rest of the context is already in there.

I know that through cache_prompt the performance shouldn't Really differ only how I write the code. And I think it would look cleaner if I only had to pass on the latest addition from slots to slots.

[-]

Chordless@reddit

As far as i know you need to pass the full conversation context every time. It's just a matter of max 100kb of text though, so performance really shouldn't be affected.

[-]

nero10579@reddit

What software are you inferencing with

[-]

Noxusequal@reddit (OP)

Well that is not fully decided yet if there is one that makes this easy or part of it I would use that one.

Currently I was thinking vllm but other suggestions are welcome

[-]

tronathan@reddit

I think if you submit two queries, and they share a lot of the header, that the KV cache is generally smart enough to reuse what cache it can. As soon as two queries diverge though, all tokens after that will not have cache hits, so those will be a bit slower.

I'm kinda hand-waving here but I think the KV cache will Just Work and you can send different requests with simlar/overlapping prompts, as long as your agent-specific instructions are at the end of your prompt, you should be good.

Someone please correct me, i'm probably oversimplifying.

[-]

Noxusequal@reddit (OP)

That would be very sweet if true :)