ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED
Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 13 comments
Them boys can cook, one big fix after another!
If you're running --sm tensor on multi-gpu this is the KV cache quantization fix
https://github.com/ggml-org/llama.cpp/releases/tag/b9455
JohannesGaesslercommented5 days ago
This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.
The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.
segmond@reddit
I tried this with a few models on multi GPU, crashes.
Bulky-Priority6824@reddit (OP)
Damn. Working like a charm here on windows 11 and Debian What's your config
ABLPHA@reddit
Aaand it still leaks memory with ROCm...
farkinga@reddit
YES. I have been F5ing this for a few weeks. So excited.
dondiegorivera@reddit
Is this connected to Turboquant?
Bulky-Priority6824@reddit (OP)
not that im aware this is for --cache-type-k q8_0 --cache-type-v q8_0 on configs usings --sm tensor gpu parallelism
Constant-Simple-1234@reddit
Great, I was trying it today and wondering when they will fix.
Legitimate-Dog5690@reddit
Ah, you fiend. I came here to post the exact same thing 😃 I've been sat on the fix CL for a few days now, it looks to be working great, very happy this has got in, this gave me a much bigger boost than MTP (compared to mod) and requires no extra resources.
Bulky-Priority6824@reddit (OP)
saved me about 1.2GB Vram on qwen 3.6 35B now i can spend that on another setting
Legitimate-Dog5690@reddit
Yeah, I was the same regard MTP. I got the same token speed with ngram_mod for zero vram cost, which let me jump up a quantisation instead. Stacking a couple of types of spec decoding sounds like the winner long term, I'm sure that was implemented somewhere.
Similar-Ad5933@reddit
Hell yeah. Hopefully backend sampling with tensor split mode is next on someones table.
kiwibonga@reddit
Nice. Had been waiting for this. Wanted to try -sm tensor for the longest time but CUDA 13.2 was too buggy, and when 13.3 came out this bug came up.
crossoverXYZ@reddit
finally, been waiting for this one. the KV cache quantization was basically broken on multi-gpu with --sm tensor. huge fix for anyone running 70B+ models across cards