If your Qwen2 GGUF is spitting nonsense, enable flash attention

Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 31 comments

As noted in this thread: https://github.com/ggerganov/llama.cpp/issues/7805 Currently there's an issue with Qwen2's KV calculations at fp16 on CUDA This means, when offloading to CUDA, you'll end up with a bunch of gibberish in your output You can apply to patch that slaren suggested, or because of the order of operations performed in the flash attention implementation, you can just enable that to make it work In llama.cpp this means passing the `-fa` flag In lmstudio, if you expand your options on the right, near the bottom is a "flash attention" checkbox This should make them work fine :) may not be an issue with the 72b model, never tried to confirm, but definitely with 7b