Best Gemma4 llama.cpp command switches/parameters/flags? Unsloth GGUF?

Posted by Fulminareverus@reddit | LocalLLaMA | View on Reddit | 14 comments

Can anyone share their command string they use to run Gemma 4? For example, I have previously used this for qwen35:

llama-server.exe --hf-repo unsloth/Qwen3.5-35B-A3B-GGUF --hf-file Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf --port 11433 --host 0.0.0.0 -c 131072 -ngl 999 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --jinja --temp 1.0 --top-p 0.95 --min-p 0.0 --top-k 20 -b 4096 --repeat-penalty 1.0 --presence-penalty 1.5 --no-mmap

I'm trying to find the best settings to run it, and curious what others are doing. I'm giving the following a try and will report back:

llama-server.exe --hf-repo unsloth/gemma-4-31B-it-GGUF --hf-file gemma-4-31B-it-UD-Q5_K_XL.gguf --port 11433 --host 0.0.0.0 -c 131072 -ngl 999 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --jinja --temp 1.0 --top-p 0.95 --min-p 0.0 --top-k 20 -b 4096 --repeat-penalty 1.0 --presence-penalty 1.5 --no-mmap

[-]

BelgianDramaLlama86@reddit

Main thing I'd say right off the bat is don't use k-cache at q4_0, at least q8_0 for that or you're likely going to have errors because of that... qwen3.5 is known to be very sensitive to that as well, and has a very small cache size to begin with, I'd just run that at q8_0 for both...

[-]

MushroomCharacter411@reddit

llama-server will crash if I try to assign different quantizations for the K and V caches, even though there is no reason they *have* to be the same.

[-]

Most-Trainer-8876@reddit

what about q8_0, is it still bad with Gemma 4 26B a4b?

[-]

MushroomCharacter411@reddit

No, it basically just gets you double the context window compared to FP16.

I have been sticking to the philosophy that the KV cache should be one tier deeper than the model itself. If the model is running at 4 buts, the cache needs to be 5 bits. If the model is at 5 or 6 bits, the KV caches need to be at 8 bits. If the model is at 8 bits, it's probably still fine to leave the K and V caches at 8 bits. It seems like having even one bit of extra precision is enough to keep systematic errors from all aligning in the same direction.

I don't have actual tests, this is all based on "try it and see what happens". I've also found that for running on a potato, the sweet spot of quantization may not even be the high end of 4 bits (i1-Q4_K_M) but rather the low end of 4 bits (IQ4_XS). I was looking at the performance tests for Gemma 3 and they indicated there was very little drop-off in accuracy there (partly because Q4_K_M wasn't actually leading the pack in spite of being the biggest) so I decided to try the most intense quantization that makes sense on an RTX 3060. 3 bit quantizations are really slow, I am guessing because of byte/word misalignment, so IQ4_XS it is.

[-]

Most-Trainer-8876@reddit

I use UD Q4_K_XL, total vram is 24GB, which gives me around 131K context with f16 if all went well or else 100K is a safe zone. (Vision is disabled)

I tried every type of quant for KV, it just doesn't work nicely, people say "oh kv cache doesn't hurt performance as much as actual model weights quantization" , they say this probably because they use it for RP/Creative stuff but in my case, main usage is Agentic Coding. And let me tell you, KV Cache, hurts like crazy, so much so that it's unusable if coding or doing math. It does small little stupid mistakes that even 7B model wont dare to do.. like forgetting to close jsx tags, hallucination of using wrong class methods even tho it just used correct one two lines above & below, injecting weird chinese and Russian characters, messing up operators and what not. In the case of math, straight up wrong calculations.

It all went away as soon as I switched to f16 kv cache, this one decision kept me happy, zero tool calling errors, error stupid syntax errors. Once in a while, it makes a mistake in tool calling but it recovers fast, i'll attribute this to Model's Quantization itself.

I haven't tried q8_0 as much as q4_0 or Turbo4, that's why I want to know if it's even worth it.

GoodTip7897@reddit

Presence penalty 0 should be good. The model card shows repeat penalty 1.0 (disabled), temperature 1.0, top-k 64, top-p 0.95, and min-p 0.0. those would be a good starting point.

Also add -np 1 if you use it by yourself as it will use significantly less ram. Q4 K/V cache quantization seems very aggressive so I'd look to that if you have issues.

I'd agree on the KV cache quantization. When using a Q4_K_M model, I use Q5_1 for the quantization of the caches because there is noticeably more loss at Q4_1 and it doesn't help that much with the size of the context window. My guess is that as long as the caches are a little (in this case a literal bit) more fine-grained than the model itself, it helps keep systematic errors from accumulating in any particular direction.

TheRealDatapunk@reddit

What do you run it on that you can use q5_1?

llama.cpp. I get a considerably bigger context window than using Q8_0. The only catch is that I am forced to use the same setting for both the K and V caches, or llama.cpp immediately crashes. I have to change them together.

createthiscom@reddit

Don't forget:

    --mmproj /data2/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --image-max-tokens 1120 \

If you want to use vision.

pmttyji@reddit

Add --fit on --fit-target 512

DevilaN82@reddit

I would wait for tokenizer fixes in llama.cpp and I've heard rumors that imatrix needs to be fixed as well, so new model file will drop from Unsloth.

I hope you are GPU rich, because gemma is not so friendly with context and stuff. In most cases Qwen with q8 kvcache takes less vram than gemma4 with q4 (old type Sliding Window Attention hits hard).

Qwen as a MoE model can have some layers offloaded to CPU (`-ot ".ffn_.*_exps.=CPU"` option), and q8 kvcache means less degradation of answers for longer contexts.

Anyway good luck :)

Fulminareverus@reddit (OP)

running on a 5090

ML-Future@reddit

--reasoning-budget 0 helps a lot to my potato laptop