PSA: Gemma 27b ggufs can be pretty sensitive to blast batch size changes

Posted by SomeOddCodeGuy@reddit | LocalLLaMA | View on Reddit | 7 comments

So this is a pretty short one because I'm writing it over my lunch break, but wanted to toss this out in case it might help others who got to fiddling with things that they probably shouldn't like I did. When running Koboldcpp, I usually run my models with 2048 blast batch size, since [prompt processing speed is a pain](https://www.reddit.com/r/LocalLLaMA/comments/1aw08ck/real_world_speeds_on_the_mac_koboldcpp_context/) for Macs, and there's at least [some evidence that doing this can help](https://www.reddit.com/r/LocalLLaMA/comments/16famrm/koboldcpp_llamacpp_frankensteined_some_blast/). Well, over the past few days I was struggling to get Gemma 27b gguf to work with Koboldcpp, and couldn't figure out why. The few coherent responses I could get out of it were just not great, but honestly most responses were plain gibberish. Turns out that my standard command for kicking off kobold, which includes setting 2048 blast batch size, was the issue. I kicked the batch size down to the standard 512, and suddenly Gemma was smart as could be. As an additional note- this got me interested in the effect of blast batch size on inference quality, so I did a couple of quick tests; nothing scientific, but rather just a quick peek at what would happen. I re-ran a challenging coding question a few times with Wizard, and found that I consistently got better results at 256 batch size than anything else. Mind you, I didn't run a lot of tests so this could be just coincidence, but thought that would be a fun-to-mention thing. Anyhow, just wanted to give a heads up about that.

Reply to Post

7 Comments

[-]

CaptTechno@reddit

fp16 or some quant?

[-]

SomeOddCodeGuy@reddit (OP)

q8 gguf. I never had luck with f16 ggufs.

[-]

Fast-Persimmon7078@reddit

Also, I think the Gemma-2 family is sensitive to quantization. I ran the 9B model on my Mac and found little regression in 8-bit quantization. (tested it with mlx-community/gemma-2-9b-it-8bit and mlx-community/gemma-2-9b-it-fp16) I'm curious about the performance difference between the original implementation, llama.cpp, and MLX quantizations...

[-]

kryptkpr@reddit

This is almost definitely a bug in kobold, batch size shouldn't affect results like this only performance.

[-]

a_beautiful_rhind@reddit

heh.. on koboldcpp the gemma 27b gguf I have just generates "manners manners manners" to me over and over again. Not sure if it's due to vulkan backend being used or what. I'd open a ticket on github but they hid my account and still didn't reply. I can tell you that using regular llama.cpp it made no difference what I set the batch size to using cuda.

[-]

SomeOddCodeGuy@reddit (OP)

lol I ended up just making my own quant; I couldn't find one that worked well on huggingface, so I pulled down the repo and the latest llamacpp and just made my own. If it works for cuda, I wonder if its the blas implementation doing it. Mac doesn't use cublas, but I also didn't download openblas, so Im not sure which its using. Just plain "blas"? lol

[-]

a_beautiful_rhind@reddit

Good point, I should try blas instead of vulkan. That particular machine has windows 8 so I am stuck at cuda 10.2. I'd have to compile l.cpp or k.cpp for it on windows, bleh. The quant I am using works on the server, it's just Q8 there vs Q4km. Perhaps vulkan doesn't support soft capping?