ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 13 comments

Them boys can cook, one big fix after another!

If you're running --sm tensor on multi-gpu this is the KV cache quantization fix

https://github.com/ggml-org/llama.cpp/releases/tag/b9455

JohannesGaesslercommented5 days ago

This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.

The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.

[-]

segmond@reddit

I tried this with a few models on multi GPU, crashes.

[-]

Bulky-Priority6824@reddit (OP)

Damn. Working like a charm here on windows 11 and Debian What's your config

[-]

ABLPHA@reddit

Aaand it still leaks memory with ROCm...

Memory access fault by GPU node-2 (Agent handle: 0x5573d6b0aae0) on address 0x7fee3d048000. Reason: Page not present or supervisor privilege.Memory access fault by GPU node-2 (Agent handle: 0x5573d6b0aae0) on address 0x7fee3d048000. Reason: Page not present or supervisor privilege.

[-]

farkinga@reddit

YES. I have been F5ing this for a few weeks. So excited.

[-]

dondiegorivera@reddit

Is this connected to Turboquant?

[-]

Bulky-Priority6824@reddit (OP)

not that im aware this is for --cache-type-k q8_0 --cache-type-v q8_0 on configs usings --sm tensor gpu parallelism

[-]

Constant-Simple-1234@reddit

Great, I was trying it today and wondering when they will fix.

[-]

Legitimate-Dog5690@reddit

Ah, you fiend. I came here to post the exact same thing 😃 I've been sat on the fix CL for a few days now, it looks to be working great, very happy this has got in, this gave me a much bigger boost than MTP (compared to mod) and requires no extra resources.

[-]

Bulky-Priority6824@reddit (OP)

saved me about 1.2GB Vram on qwen 3.6 35B now i can spend that on another setting

[-]

Legitimate-Dog5690@reddit

Yeah, I was the same regard MTP. I got the same token speed with ngram_mod for zero vram cost, which let me jump up a quantisation instead. Stacking a couple of types of spec decoding sounds like the winner long term, I'm sure that was implemented somewhere.

[-]