If your Qwen2 GGUF is spitting nonsense, enable flash attention
Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 31 comments
As noted in this thread:
https://github.com/ggerganov/llama.cpp/issues/7805
Currently there's an issue with Qwen2's KV calculations at fp16 on CUDA
This means, when offloading to CUDA, you'll end up with a bunch of gibberish in your output
You can apply to patch that slaren suggested, or because of the order of operations performed in the flash attention implementation, you can just enable that to make it work
In llama.cpp this means passing the `-fa` flag
In lmstudio, if you expand your options on the right, near the bottom is a "flash attention" checkbox
This should make them work fine :) may not be an issue with the 72b model, never tried to confirm, but definitely with 7b
31 Comments
digitus1978@reddit
East-Awareness-249@reddit
berserker285714@reddit
East-Awareness-249@reddit
noneabove1182@reddit (OP)
East-Awareness-249@reddit
noneabove1182@reddit (OP)
Sadeghi85@reddit
ReturningTarzan@reddit
thigger@reddit
ReturningTarzan@reddit
ReturningTarzan@reddit
ReturningTarzan@reddit
thigger@reddit
noneabove1182@reddit (OP)
Sadeghi85@reddit
thigger@reddit
Sadeghi85@reddit
thigger@reddit
noneabove1182@reddit (OP)
ReturningTarzan@reddit
ZealousidealBadger47@reddit
Calcidiol@reddit
CheatCodesOfLife@reddit
Calcidiol@reddit
NeterOster@reddit
noneabove1182@reddit (OP)
Mashic@reddit
KurisuAteMyPudding@reddit
East-Awareness-249@reddit
KurisuAteMyPudding@reddit