If your Qwen2 GGUF is spitting nonsense, enable flash attention

Posted by noneabove1182@reddit | LocalLLaMA | View on Reddit | 31 comments

As noted in this thread: https://github.com/ggerganov/llama.cpp/issues/7805 Currently there's an issue with Qwen2's KV calculations at fp16 on CUDA This means, when offloading to CUDA, you'll end up with a bunch of gibberish in your output You can apply to patch that slaren suggested, or because of the order of operations performed in the flash attention implementation, you can just enable that to make it work In llama.cpp this means passing the `-fa` flag In lmstudio, if you expand your options on the right, near the bottom is a "flash attention" checkbox This should make them work fine :) may not be an issue with the 72b model, never tried to confirm, but definitely with 7b

Reply to Post

31 Comments

[-]

digitus1978@reddit

my hero! was literally trying to get nvidias nemo to figure it out lol.

[-]

East-Awareness-249@reddit

I'm using oogabooga and that didn't fix it for me, I'm still get the same "Blockly" repetition response.

[-]

berserker285714@reddit

Hi I was facing the same issue as you. I can say that's might be oogabooga text generation webui's problem. I used the same QWEN2 7B model in LM studion in FA, it works fine. But in oogabooga, all response are 'Blockly' although in FA. you'd better wait oogabooga's update. hope this feedback not very late. And I'm from China, my English is not good(may be I should say "very terriable" lol). So if I'm saying something wrong or make you unhappy, please forgive me.

[-]

East-Awareness-249@reddit

It's been over 2 weeks, so I have up on the QWEN2 model. Also your english is great. Don't apologise for ignorant people.

[-]

noneabove1182@reddit (OP)

That's odd :( are you sure it applied? Does it generate properly if you don't offload any layers?

[-]

East-Awareness-249@reddit

To be honest I gave up on the model. Will it work once llama.cpp push this patch? https://github.com/ggerganov/llama.cpp/issues/7805#issuecomment-2153349963?

[-]

noneabove1182@reddit (OP)

It should yes, but I'm confused why flash attention didn't work for you :/

[-]

Sadeghi85@reddit

ExLlamaV2 is also affected, any workaround for that?

[-]

ReturningTarzan@reddit

I'm currently looking into it. It does look like an attention weight overflow issue as identified by the lcpp people. For ExLlama, flash-attn works, but the matmul attention it falls back to if flash-attn isn't available doesn't, likely because it doesn't upcast the attention weights. Upcasting would probably work but it's going to consume a lot more VRAM so it's not a great solution. xformers might have the answer for GPUs that can't run flash-attn, and it is supported, but it can't currently install it to confirm because it seems to be incompatible with CUDA 12.5. And I can't downgrade to CUDA 12.3 because that requires gcc12 and that would break everything that depends on gcc13. It's just a never-ending cascade of broken dependencies right now. On all fronts. Github Windows runners are also broken at the moment, so whatever the fix for Qwen2 is, I can't create new Windows wheels until NVIDIA backports VS2022 17.10 support to older versions of CUDA or Github rolls back the upgrade. Or something. I don't know.

[-]

thigger@reddit

Sounds fun - thank you! Is there any interaction with using quantized cache? (apologies, I didn't know much about the internals). I have flash attention installed and I think it's using it - but it seems to break down with longer inputs (I'm summarizing very long sets of documents, chunked into ~20k tokens)

[-]

ReturningTarzan@reddit

There isn't a direct interaction with the cache, but it's possible the keys/values end up in a range or a distribution that doesn't quantize well. It does seem like it's more impacted by cache quantization than other models, after a few quick tests.

[-]

ReturningTarzan@reddit

The Q4 cache still seems to work alright on Qwen2-72B. It's specifically Qwen2-7B that's having issues, but after a lot of testing and experimenting, I feel pretty confident that, at least without a different quantization method, the 4-bit cache just doesn't have the precision required to compress the already very small keys. I've added a Q8 cache mode, though, which is currently on the dev branch. Because of the lower dimension, Qwen2-7B with Q8 cache still takes up 17% less space per token than Llama3-8B with Q4 cache, so still very usable this way, and I'm unable to measure any real difference between Q8 and FP16.

[-]

ReturningTarzan@reddit

So, a little update: It does appear that Qwen2 is unusually sensitive to cache quantization. It's hard to say why, but one likely suspect is that the 7B version uses very aggressive GQA. Quantization of any kind exploits redundancy in data. So even if the values are 16 bits each, perhaps they only encode 4 bits of useful information. Then there might be a method to compressing them down to 4 bits. The thing about Qwen2-7B is that it only has 4 key/value heads and 28 layers. That puts it at 56 kB of key/value data over all layers, per token. Compare it to other models in the same class: - Qwen2-7B: 56 kB per token - Llama3-8B: 128 kB per token - Granite-8B: 144 kB per token - Gemma-7B: 448 kB per token - Llama2-7B: 512 kB per token - Qwen-7B: 512 kB per token So one possibility is that it just can't be compressed that much further. On the other hand it also means there's less of a need to quantize the cache in the first place. I'll see if there's anything that can be done to accommodate it, or if maybe it needs a Q6 or Q8 mode.

[-]

thigger@reddit

Thanks - I had a go at FP16 cache (unfortunately as a result it *just* starts using shared memory at 32k context as I'm on 48Gb) - it's a bit better though still doing some weird repetitions, which must be a result of my prompt and data (it's quite good at making some models misbehave) In terms of keeping a 4-bit Qwen2 72B inside 48Gb with 32k context, q8 would be great if it works!

[-]

noneabove1182@reddit (OP)

> It's just a never-ending cascade of broken dependencies right now the pain :') dammed if you make all your own integrations, dammed if you use existing packages.. Much appreciation for your struggles, loving the updates to exl2 lately

[-]

Sadeghi85@reddit

Thanks for looking.

[-]

thigger@reddit

What exactly are you seeing? I only had a quick go last night with the Int4 GPTQ and q4 cache; it initially seemed to work but I had some real problems with longer contexts. I've not seen an issue on exllamav2 yet?

[-]

Sadeghi85@reddit

Which exl2 did you download?

[-]

thigger@reddit

There weren't any exl2s up when I was first trying (last night) so I downloaded the Int4 GPTQ. I'll try to have a go with an EXL2 today if people are finding that they work.

[-]

noneabove1182@reddit (OP)

I have the 7b up here: https://huggingface.co/bartowski/Qwen2-7B-Instruct-exl2 72B still running (such a slow process sadly)

[-]

ReturningTarzan@reddit

Update [here](https://github.com/turboderp/exllamav2/issues/493).

[-]

ZealousidealBadger47@reddit

Despite its compact size, I'm highly impressed with Qwen 7B gguf Q5k's capabilities. It demonstrates commendable performance in summarization and local brainstorming tasks when run on my 16GB laptop. Specifically, it operates at a speed of 4 tokens per second using LM Studio, which is quite satisfactory for my needs.

[-]

Calcidiol@reddit

I know f16 (bf16, whatever it is) is "the original & the best possible" so I get using it if you can. But are people noticing a perceptually / functionally relevant qualitative improvement vs. using say a good Q8-Q5 quantized version? I think standard wisdom suggests they should be qualitatively similar though slightly inferior in synthetic tests.

[-]

CheatCodesOfLife@reddit

I've never seen a difference with Q8 vs FP16 with any model I've tested. Have seen differences Q6-Q8 with coding Q5-Q6 more so.

[-]

Calcidiol@reddit

Thanks, that's good for me to keep in mind as a rule of thumb.

[-]

NeterOster@reddit

I believe applying [\[THIS PATCH\]](https://github.com/ggerganov/llama.cpp/issues/7805#issuecomment-2153349963) fully solves the problem. FA alleviates the problem but the output quality is still degraded.

[-]

noneabove1182@reddit (OP)

Is it still degraded? That's surprising. But yeah that patch is likely to be the official method, it's just not implemented yet (also no PR which is odd, feels weird opening one myself but I might just do it)

[-]

If your Qwen2 GGUF is spitting nonsense, enable flash attention

Reply to Post

31 Comments

digitus1978@reddit

East-Awareness-249@reddit

berserker285714@reddit

East-Awareness-249@reddit

noneabove1182@reddit (OP)

East-Awareness-249@reddit

noneabove1182@reddit (OP)

Sadeghi85@reddit

ReturningTarzan@reddit

thigger@reddit

ReturningTarzan@reddit

ReturningTarzan@reddit

ReturningTarzan@reddit

thigger@reddit

noneabove1182@reddit (OP)

Sadeghi85@reddit

thigger@reddit

Sadeghi85@reddit

thigger@reddit

noneabove1182@reddit (OP)

ReturningTarzan@reddit

ZealousidealBadger47@reddit

Calcidiol@reddit

CheatCodesOfLife@reddit

Calcidiol@reddit

NeterOster@reddit

noneabove1182@reddit (OP)

Mashic@reddit

KurisuAteMyPudding@reddit

East-Awareness-249@reddit

KurisuAteMyPudding@reddit