PSA: Gemma 27b ggufs can be pretty sensitive to blast batch size changes

Posted by SomeOddCodeGuy@reddit | LocalLLaMA | View on Reddit | 7 comments

So this is a pretty short one because I'm writing it over my lunch break, but wanted to toss this out in case it might help others who got to fiddling with things that they probably shouldn't like I did. When running Koboldcpp, I usually run my models with 2048 blast batch size, since [prompt processing speed is a pain](https://www.reddit.com/r/LocalLLaMA/comments/1aw08ck/real_world_speeds_on_the_mac_koboldcpp_context/) for Macs, and there's at least [some evidence that doing this can help](https://www.reddit.com/r/LocalLLaMA/comments/16famrm/koboldcpp_llamacpp_frankensteined_some_blast/). Well, over the past few days I was struggling to get Gemma 27b gguf to work with Koboldcpp, and couldn't figure out why. The few coherent responses I could get out of it were just not great, but honestly most responses were plain gibberish. Turns out that my standard command for kicking off kobold, which includes setting 2048 blast batch size, was the issue. I kicked the batch size down to the standard 512, and suddenly Gemma was smart as could be. As an additional note- this got me interested in the effect of blast batch size on inference quality, so I did a couple of quick tests; nothing scientific, but rather just a quick peek at what would happen. I re-ran a challenging coding question a few times with Wizard, and found that I consistently got better results at 256 batch size than anything else. Mind you, I didn't run a lot of tests so this could be just coincidence, but thought that would be a fun-to-mention thing. Anyhow, just wanted to give a heads up about that.