llama.cpp: owners of old GPUs wanted for performance testing

Posted by Remove_Ayys@reddit | LocalLLaMA | View on Reddit | 93 comments

I created [a pull request that refactors and optimizes the llama.cpp IQ CUDA kernels](https://github.com/ggerganov/llama.cpp/pull/8215) for generating tokens. These kernels use the `__dp4a` instruction (per-byte integer dot product) which is only available on NVIDIA GPUs starting with compute capability 6.1. Older GPUs are supported via a workaround that does the same calculation doing other instructions. However, during testing it turned out that (on modern GPUs) this workaround is faster than the kernels that are currently being used on master for old GPUs for legacy quants and k-quants. So I changed the default for old GPUs to the `__dp4a` workaround. However, I don't actually own any old GPUs that I could use for performance testing. So I'm asking for people that have such GPUs to report how the PR compares against master. Relevant GPUs are P100s or Maxwell or older. Relevant models are legacy quants and k-quants. If possible, please run the `llama-bench` utility to obtain the results.