[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo
Posted by Ueberlord@reddit | LocalLLaMA | View on Reddit | 27 comments
Probably most of you are aware that using anything other than `-ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0` as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of `-ctk q8_0 -ctv q4_0` pps tanks.
I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use `cmake -DGGML_CUDA_FA_ALL_QUANTS=ON ..` which will take very long.
But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great.
Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16:
https://github.com/ggml-org/llama.cpp/discussions/23470
27 Comments
Anbeeld@reddit
Pristine-Woodpecker@reddit
Anbeeld@reddit
Pristine-Woodpecker@reddit
Anbeeld@reddit
skullfuckr42@reddit
Anbeeld@reddit
skullfuckr42@reddit
chimpera@reddit
Anbeeld@reddit
chimpera@reddit
Anbeeld@reddit
ea_man@reddit
draconic_tongue@reddit
Anbeeld@reddit
Look_0ver_There@reddit
Anbeeld@reddit
Look_0ver_There@reddit
Ueberlord@reddit (OP)
JGeek00@reddit
Anbeeld@reddit
Anbeeld@reddit
jrodder@reddit
Anbeeld@reddit
tmvr@reddit
hurdurdur7@reddit
ParaboloidalCrest@reddit