Share your llama-server init strings for Gemma 4 models.

Posted by AlwaysLateToThaParty@reddit | LocalLLaMA | View on Reddit | 40 comments

Hi. I'm trying to use llama.cpp to give me workable Gemma 4 inference, but I'm not finding anything that works. I'm using the latest llama.cpp, but I've tested it now on three versions. I thought it might just require me waiting until llama.ccp caught up, and now the models load, where before they didn't at all, but the same issues persist. I've tried a few of the ver4 models, but the results are either lobotomized or extremely slow. I tried this one today :

llama-server.exe -m .\models\30B\gemma-4-26B-A4B-it-heretic.bf16.gguf --jinja -ngl 200 --ctx-size 262144 --host 0.0.0.0 --port 13210 --no-warmup --mmproj .\models\30B\gemma-4-26B-A4B-it-heretic-mmproj.f32.gguf --temp 0.6 --top-k 64 --top-p 0.95 --min-p 0.0 --image-min-tokens 256 --image-max-tokens 8192 --swa-full

... and it was generating at 3t/s. I have an RTX 6000 Pro, so there's obviously something wrong there. I'm specifically wanting to test out its image analysis, but with that speed, that's not going to happen. I want to use a heretic version, but I've tried different versions, and I get the same issues.

Does anyone have any working llama.cpp init strings that they can share?

[-]

PassengerPigeon343@reddit

Here’s mine (dual 3090s):

"Gemma 4 26B A4B":
    proxy: "http://127.0.0.1:8000"
    cmd: |
      /app/llama-server
      -m /models/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q5_K_XL.gguf
      --port 8000
      --host 0.0.0.0
      --ctx-size 65536
      --flash-attn on
      --metrics
      --temp 1.0
      --top-k 64
      --top-p 0.95
      --min-p 0
      --parallel 2
      --n-gpu-layers 999

  "Gemma 4 31B":
    proxy: "http://127.0.0.1:8000"
    cmd: |
      /app/llama-server
      -m /models/unsloth/gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q5_K_XL.gguf
      --port 8000
      --host 0.0.0.0
      --ctx-size 32768
      --flash-attn on
      --metrics
      --temp 1.0
      --top-k 64
      --top-p 0.95
      --min-p 0
      --n-gpu-layers 999

Haven’t done any optimizing yet and both are working great. Is your llama.cpp fully up to date?

[-]

AlwaysLateToThaParty@reddit (OP)

For the record, I think it might have also been my selection of a tokenizer with larger quantization than the model. I was using the f32 version, and now that I have started using the bf16 version (as well as those flash-atttn flags) it's up to 75t/s.

Now it's time to see if it's lobotomized. Again, appreciate your help.

[-]

StardockEngineer@reddit

F32? No wonder your performance is poor.

[-]

AlwaysLateToThaParty@reddit (OP)

Why would that be? Is there some technical reason why I should have expected it?

[-]

Subject-Tea-5253@reddit

When you run Gemma4 at F32, each parameter takes up 4 bytes compared to only 2 bytes for BF16. This means your GPU has to move twice as much data across the memory bus for every single token generated. Even an RTX 6000 will starve its cores while waiting for those massive F32 data packets to arrive, which explains why you were getting only 3t/s.

[-]

AlwaysLateToThaParty@reddit (OP)

Great explanation thankyou.

[-]

PassengerPigeon343@reddit

That sounds more like it. Happy it was helpful!

[-]

TechnoByte_@reddit

Also have 2x 3090. I use the Q8_0 quant of the 31B, -np 1, and TurboQuant to squeeze 131k context:

llama-cpp-turboquant/build/bin/llama-server -c 131072 -ngl 999 -m "gemma-4-31B-it-UD-Q8_K_XL.gguf" -ctk turbo4 -ctv turbo3 -np 1 -mm "mmproj-BF16-gemma-4-31B-it.gguf"

[-]

QuotableMorceau@reddit

where did you get TurboQuant build ?

[-]

AlwaysLateToThaParty@reddit (OP)

Thanks so much. Some parameters in there that look like they might be culprits like "--flash-attn". thanks.

[-]

StardockEngineer@reddit

Flash attention is on by default. So is num gpu layers.

[-]

Dazzling_Equipment_9@reddit

It seems every new model release is a massive headache for llama.cpp. On top of that, they drop a new version for pretty much every single code commit. Then it’s the same endless loop: people keep spotting problems, opening issues, fixing them… only to introduce a bunch of new bugs in the process. The whole thing feels like an old clunker of a car, just chugging along at a snail’s pace. When is this ever going to end?

[-]

AlwaysLateToThaParty@reddit (OP)

Can't develop for an architecture if the architecture doesn't exist yet.

[-]

Dazzling_Equipment_9@reddit

It's not a big deal, I just feel unilaterally that it has become bloated and unbearable.

[-]

MelodicRecognition7@reddit

either lobotomized or extremely slow

because you should RTFM instead of writing random options without understanding what they mean and hoping that they will work well.

[-]

AlwaysLateToThaParty@reddit (OP)

So tell me exactly in my command string caused the issue, and why?

[-]

MelodicRecognition7@reddit

bf16

262144

f32

highly likely the model was overflowing from VRAM into the system RAM because the quants and context are too large

[-]

AlwaysLateToThaParty@reddit (OP)

highly likely the model was overflowing from VRAM into the system RAM because the quants and context are too large.

That does look like the culprit, yes. But it's not like I was running out of VRAM. llama.cpp just didn't like those two things together.

As for "lobotimized" the reason could be heretic

Had the same issue with the non-heretic models, but I was using them before llama.cpp was updated to handle gemma 4. This is me trying after llama.cpp has been updated.

[-]

MelodicRecognition7@reddit

llama.cpp just didn't like those two things together.

interesting, I have not thought F32 mmproj could be the reason, I've always used F16 mmproj with models of various quants, usually Q8, and never experienced any slowdowns. But in my tests mixed K and V cache values always results in slow downs regardless if it is Q4 or F16 or whatever - same type quants for -ctk and -ctv = fast inference, mixed types = slow inference.

| model                          |       size |     params | backend    | ngl | type_k | type_v | fa |            test |                  t/s |
| gemma4 ?B Q8_0                 |  25.94 GiB |    25.23 B | CUDA       |  99 |   bf16 |   bf16 |  1 |          tg1024 |        134.84 ± 0.15 |
| gemma4 ?B Q8_0                 |  25.94 GiB |    25.23 B | CUDA       |  99 |    f16 |    f16 |  1 |          tg1024 |        139.68 ± 0.01 |
| gemma4 ?B Q8_0                 |  25.94 GiB |    25.23 B | CUDA       |  99 |    f16 |   bf16 |  1 |          tg1024 |         46.11 ± 0.69 |
| gemma4 ?B Q8_0                 |  25.94 GiB |    25.23 B | CUDA       |  99 |    f16 |   q8_0 |  1 |          tg1024 |         51.38 ± 0.66 |
| gemma4 ?B Q8_0                 |  25.94 GiB |    25.23 B | CUDA       |  99 |   q8_0 |    f16 |  1 |          tg1024 |         53.65 ± 0.48 |

[-]

AlwaysLateToThaParty@reddit (OP)

also in my tests BF16 is slower than F16 for RTX Pro 6000 96GB. So try to use F16 model and F16 mmproj, not BF16 nor F32.

Thanks so much for that information. I'll definitely give that other model a try instead. The reason I chose it was because I read in one of the model cards that that BFxx version was better for nvidia 3000+ cards. Hadn't had the time to test it. But I will give that a go.

And just to make sure: did you actually verify in nvidia-smi and llama.cpp log that you were not running out of VRAM?

I can't say definitively, and you might very well be right. Again testing tonight, was using llama.cpp as an endpoint with gemma 4 26b/a4, and have the context set to max, and and agent tipped it up to 94.5GB being used. I didn't think I had two sessions open, but maybe I did, and it passed the layers onto the CPU. I heard that the model had a big context usage, but never realized it would be so large. Perhaps that was the issue. Not sure how I would have triggered it, as I was loading the model directly for testing. But it sounds like I might just use the q8. The truth is though, from what I can see so far, Qwen 122B/A10 mxfp4 is a better image parser than Gemma 4 bf16, so when constrained to VRAM, Qwen has a better model for that task. There are issues, of course, in that Qwen sometimes gets stuck in loops with its thinking, and that's something I'm definitely not seeing with Gemma. But that can be solved by setting token budgets and time-outs. Gemma 4 has way better token usage for thinking, as many people have noted. But raw analysis, Qwen seems to do this task better.

Again, appreciate all of your insights.

[-]

sammcj@reddit

There is no reason to use bf16, if you want the best quality just use Q8, otherwise drop to Q5_K_XL.

I'd suggest posting your server start logs (maybe via a gist so reddit doesn't bork them).

[-]

Danmoreng@reddit

I would recommend to use ˋ--fit onˋ together with ˋ--fit-ctx ˋ over ˋnglˋ and ˋctxˋ parameters. They make sure as much as possible gets put on the GPU. For Qwen models I have these parameters: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details

The base params shouldn’t be much different for Gemma4, apart from temperature and so on obviously.

[-]

SatoshiNotMe@reddit

My setup instructions for the 26BA4B variant, tested on M1 Max 64GB MacBook, where I get 40 tok/s (when used in a Claude Code), double what I got with a similar Qwen variant:

https://pchalasani.github.io/claude-code-tools/integrations/local-llms/#gemma-4-26b-a4b--google-moe-with-vision

[-]

Konamicoder@reddit

Suggestion: describe your issue to the LLM and ask it to provide suggestions on how to improve performance. I ran your post through Gemma4:26b and here’s what it said.

Stop using BF16: Your 26B model is too large for 48GB VRAM in BF16. You are hitting your System RAM bottleneck.
Shrink the Context: 256k is killing your performance. Start at 32k and only increase it if you see VRAM headroom.
Use Quantization: Use a Q4 or Q8 GGUF. It will be faster, smarter (due to less memory swapping), and much more efficient for multimodal tasks.
Turn on Flash Attention: It is essential for the speed you are looking for.

[-]

AlwaysLateToThaParty@reddit (OP)

Your 26B model is too large for 48GB VRAM in BF16

I have an RTX 6000 pro, so 96gb. I can fit in RAM, and I'm specifically wanting to test its capability at max quantisation, because if it doesn't do what i want at full quantisation, it likely won't do it at a lower quant. That is obviously similar to my selection of qwen 3.5 122b/10b mxfp4 quant. That the one that works well. I'm essentially trying to compare the image analysis of Qwen3.5 and Gemma 4, using ~75GB of VRAM.

Appreciate the input though.

[-]

jacek2023@reddit

Stop using so many options. Start with a simple command, add options only when necessary, measure speed. Also try llama-banch. Also check VRAM usage in the logs.

[-]

AlwaysLateToThaParty@reddit (OP)

Tell me which parameters I should have removed :

llama-server.exe -m .\models\30B\gemma-4-26B-A4B-it-heretic.bf16.gguf --jinja -ngl 200 --ctx-size 262144 --host 0.0.0.0 --port 13210 --no-warmup --mmproj .\models\30B\gemma-4-26B-A4B-it-heretic-mmproj.f32.gguf --temp 0.6 --top-k 64 --top-p 0.95 --min-p 0.0 --image-min-tokens 256 --image-max-tokens 8192 --swa-full

[-]

jacek2023@reddit

what was the reason to add -ngl for example?

the basic command is:

llama-server -m file.gguf

then you must add --host if you want to connect from another computer

what was the reason to add all other parameters from the start?

[-]

AlwaysLateToThaParty@reddit (OP)

-ngl

There are two offload methods for layers onto the GPU. that's the one I use.

llama-server -m file.gguf

Which I do.

then you must add --host if you want to connect from another computer

Which I do.

what was the reason to add all other parameters from the start?

llama-server.exe

-m .\models\30B\gemma-4-26B-A4B-it-heretic.bf16.gguf

--jinja

Tool calling

-ngl 200

Load layers to GPU

--ctx-size 262144

Want large context size

--host 0.0.0.0

My local server address for my network llama.cpp.

--port 13210

My local port address for my network for llama.cpp

--no-warmup

Speeds up the loading of the model.

--mmproj .\models\30B\gemma-4-26B-A4B-it-heretic-mmproj.f32.gguf

Required for image analysis, a core requirement.

--temp 0.6

Low temperature to be more analytical.

--top-k 64

Cut off long tail token selection.

--top-p 0.95

Keeps the model analytical.

--min-p 0.0

Using top-p and top-k for token selection.

--image-min-tokens 256

Force a minimum number of tokens for analysis. Especially relevant in interpreting image maps.

--image-max-tokens 8192

Put an upper limit on the tokens used for analysis that happens when large image maps are provided.

--swa-full

https://huggingface.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF

Within llama.cpp and koboldcpp, ensure that --swa-full is enabled as this model uses Sliding Window Attention (SWA).

So what am I missing?

[-]

jacek2023@reddit

I would start from a simple command just to fix your problem, you can add more options later.

[-]

jacek2023@reddit

Why people who have problems with speed always use so many options?

[-]

KokaOP@reddit

anyone got the audio working ???
I am trying VAD (speech chunk )> LLM > TTS

skipping the ASR part, and i cant get audio working in small models tries multiple llama.cpp build & also
LiteRT LM it has CPU only when audio GPU implantation pending

[-]

aldegr@reddit

llama-server -m gemma-4-31B-it-Q4_K_M.gguf -c 131072 --chat-template-file models/templates/google-gemma-4-31B-it-interleaved.jinja

3 t/s on a RTX 6000 Pro can't be right.

[-]

DanielusGamer26@reddit

speeds? for both pp and tg thanks!

[-]

Pyrenaeda@reddit

Pasting in my run block for llama-swap on my 4090, with some commentary first.

I want to call out the usage of `--chat-template-file` below, because for anyone who is having less-than-stellar tool calling experiences particularly in an agentic loop I really feel like that is a big part of it. One of the big things I was struggling with on Gemma 4 was not having any thinking interleaved with tool calls - the model would just think once and then shoot off a series of tool calls with no thinking between them.

After pounding my head against the wall off and on on this problem for a few days, at one point I was randomly re-reading the PR on llama.cpp for the parser add on (https://github.com/ggml-org/llama.cpp/pull/21418) and this stuck out to me that I had never seen before:

> "Interesting! I created a new template, models/templates/google-gemma-4-31B-it-interleaved.jinja, that supports this behavior. I tested it, and it appears to work well. The examples in the guide are sparse, so I went with what I believe is the proper format. That may change as more documentation becomes available.

> For anyone doing agentic tasks, I recommend trying the interleaved template."

I checked my local clone of the repo, sure enough that file was right where he said it was in the description. doh. So I switched to that right away with `--chat-template-file`, and... yep that solved the interleaved thinking problem, and my satisfaction with the result went up pretty sharply.

With all that noted, here's how I run it:

```

models:

gemma-4-26b:

name: "Gemma 4 26b"

cmd: >

llama-server --port ${PORT} --host 0.0.0.0

-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q5_K_XL

--temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0

--flash-attn on

--no-mmap

--mlock

--ctx-size 160000

--cache-type-k q8_0 --cache-type-v q8_0

-fit on --fit-target 2048 --fit-ctx 160000

--batch-size 1024 --ubatch-size 512

-np 1

--chat-template-file /home/me/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja

--jinja

--webui-mcp-proxy

```

[-]

AlwaysLateToThaParty@reddit (OP)

Great info, thanks.

[-]

Explurt@reddit

with an r9700:

args=(
--model /ai/Gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q5_K_XL.gguf
--mmproj /ai/Gemma-4-31B-it-GGUF/mmproj-BF16.gguf
--parallel 1
--ctx-size 98304
)
./build/bin/llama-server ${args[@]}

[-]

Woof9000@reddit

not sure why other people struggle with it, I've not seen even a single issue with it yet
```
/llama-server -m \~/models/gemma4-31b/gemma-4-31B-it-heretic-Q4_K_M.gguf -ngl 100 --ctx-size 6400 --host singularity.local --port 9001 --mmproj \~/models/gemma4-31b/mmproj.bf32.gguf
```
(tbf I don't remember exact line, AI machine is powered down atm, but most likely it's something like the above, I didn't mess with settings at all, everything default)

[-]

guinaifen_enjoyer@reddit

nothing works, gemma4 keeps getting stuck in a loop

[-]

AlwaysLateToThaParty@reddit (OP)

Yeah, I reckon I've tried all of the new models, and several variations of a couple of them. No joy so far.