ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 13 comments

Them boys can cook, one big fix after another!

If you're running --sm tensor on multi-gpu this is the KV cache quantization fix

https://github.com/ggml-org/llama.cpp/releases/tag/b9455

JohannesGaesslercommented5 days ago

This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.

The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.