Share your llama-server init strings for Gemma 4 models.
Posted by AlwaysLateToThaParty@reddit | LocalLLaMA | View on Reddit | 40 comments
Hi. I'm trying to use llama.cpp to give me workable Gemma 4 inference, but I'm not finding anything that works. I'm using the latest llama.cpp, but I've tested it now on three versions. I thought it might just require me waiting until llama.ccp caught up, and now the models load, where before they didn't at all, but the same issues persist. I've tried a few of the ver4 models, but the results are either lobotomized or extremely slow. I tried this one today :
llama-server.exe -m .\models\30B\gemma-4-26B-A4B-it-heretic.bf16.gguf --jinja -ngl 200 --ctx-size 262144 --host 0.0.0.0 --port 13210 --no-warmup --mmproj .\models\30B\gemma-4-26B-A4B-it-heretic-mmproj.f32.gguf --temp 0.6 --top-k 64 --top-p 0.95 --min-p 0.0 --image-min-tokens 256 --image-max-tokens 8192 --swa-full
... and it was generating at 3t/s. I have an RTX 6000 Pro, so there's obviously something wrong there. I'm specifically wanting to test out its image analysis, but with that speed, that's not going to happen. I want to use a heretic version, but I've tried different versions, and I get the same issues.
Does anyone have any working llama.cpp init strings that they can share?
PassengerPigeon343@reddit
Here’s mine (dual 3090s):
Haven’t done any optimizing yet and both are working great. Is your llama.cpp fully up to date?
AlwaysLateToThaParty@reddit (OP)
For the record, I think it might have also been my selection of a tokenizer with larger quantization than the model. I was using the f32 version, and now that I have started using the bf16 version (as well as those flash-atttn flags) it's up to 75t/s.
Now it's time to see if it's lobotomized. Again, appreciate your help.
StardockEngineer@reddit
F32? No wonder your performance is poor.
AlwaysLateToThaParty@reddit (OP)
Why would that be? Is there some technical reason why I should have expected it?
Subject-Tea-5253@reddit
When you run Gemma4 at
F32, each parameter takes up4 bytescompared to only2 bytesforBF16. This means your GPU has to move twice as much data across the memory bus for every single token generated. Even an RTX 6000 will starve its cores while waiting for those massive F32 data packets to arrive, which explains why you were getting only 3t/s.AlwaysLateToThaParty@reddit (OP)
Great explanation thankyou.
PassengerPigeon343@reddit
That sounds more like it. Happy it was helpful!
TechnoByte_@reddit
Also have 2x 3090. I use the Q8_0 quant of the 31B,
-np 1, and TurboQuant to squeeze 131k context:QuotableMorceau@reddit
where did you get TurboQuant build ?
AlwaysLateToThaParty@reddit (OP)
Thanks so much. Some parameters in there that look like they might be culprits like "--flash-attn". thanks.
StardockEngineer@reddit
Flash attention is on by default. So is num gpu layers.
Dazzling_Equipment_9@reddit
It seems every new model release is a massive headache for llama.cpp. On top of that, they drop a new version for pretty much every single code commit. Then it’s the same endless loop: people keep spotting problems, opening issues, fixing them… only to introduce a bunch of new bugs in the process. The whole thing feels like an old clunker of a car, just chugging along at a snail’s pace. When is this ever going to end?
AlwaysLateToThaParty@reddit (OP)
Can't develop for an architecture if the architecture doesn't exist yet.
Dazzling_Equipment_9@reddit
It's not a big deal, I just feel unilaterally that it has become bloated and unbearable.
MelodicRecognition7@reddit
because you should RTFM instead of writing random options without understanding what they mean and hoping that they will work well.
AlwaysLateToThaParty@reddit (OP)
So tell me exactly in my command string caused the issue, and why?
MelodicRecognition7@reddit
highly likely the model was overflowing from VRAM into the system RAM because the quants and context are too large
AlwaysLateToThaParty@reddit (OP)
That does look like the culprit, yes. But it's not like I was running out of VRAM. llama.cpp just didn't like those two things together.
Had the same issue with the non-heretic models, but I was using them before llama.cpp was updated to handle gemma 4. This is me trying after llama.cpp has been updated.
MelodicRecognition7@reddit
interesting, I have not thought F32 mmproj could be the reason, I've always used F16 mmproj with models of various quants, usually Q8, and never experienced any slowdowns. But in my tests mixed K and V cache values always results in slow downs regardless if it is Q4 or F16 or whatever - same type quants for
-ctkand-ctv= fast inference, mixed types = slow inference.AlwaysLateToThaParty@reddit (OP)
Thanks so much for that information. I'll definitely give that other model a try instead. The reason I chose it was because I read in one of the model cards that that BFxx version was better for nvidia 3000+ cards. Hadn't had the time to test it. But I will give that a go.
I can't say definitively, and you might very well be right. Again testing tonight, was using llama.cpp as an endpoint with gemma 4 26b/a4, and have the context set to max, and and agent tipped it up to 94.5GB being used. I didn't think I had two sessions open, but maybe I did, and it passed the layers onto the CPU. I heard that the model had a big context usage, but never realized it would be so large. Perhaps that was the issue. Not sure how I would have triggered it, as I was loading the model directly for testing. But it sounds like I might just use the q8. The truth is though, from what I can see so far, Qwen 122B/A10 mxfp4 is a better image parser than Gemma 4 bf16, so when constrained to VRAM, Qwen has a better model for that task. There are issues, of course, in that Qwen sometimes gets stuck in loops with its thinking, and that's something I'm definitely not seeing with Gemma. But that can be solved by setting token budgets and time-outs. Gemma 4 has way better token usage for thinking, as many people have noted. But raw analysis, Qwen seems to do this task better.
Again, appreciate all of your insights.
sammcj@reddit
There is no reason to use bf16, if you want the best quality just use Q8, otherwise drop to Q5_K_XL.
I'd suggest posting your server start logs (maybe via a gist so reddit doesn't bork them).
Danmoreng@reddit
I would recommend to use ˋ--fit onˋ together with ˋ--fit-ctxˋ over ˋnglˋ and ˋctxˋ parameters. They make sure as much as possible gets put on the GPU. For Qwen models I have these parameters: https://github.com/Danmoreng/local-qwen3-coder-env?tab=readme-ov-file#server-optimization-details
The base params shouldn’t be much different for Gemma4, apart from temperature and so on obviously.
SatoshiNotMe@reddit
My setup instructions for the 26BA4B variant, tested on M1 Max 64GB MacBook, where I get 40 tok/s (when used in a Claude Code), double what I got with a similar Qwen variant:
https://pchalasani.github.io/claude-code-tools/integrations/local-llms/#gemma-4-26b-a4b--google-moe-with-vision
Konamicoder@reddit
Suggestion: describe your issue to the LLM and ask it to provide suggestions on how to improve performance. I ran your post through Gemma4:26b and here’s what it said.
Stop using BF16: Your 26B model is too large for 48GB VRAM in BF16. You are hitting your System RAM bottleneck.
Shrink the Context: 256k is killing your performance. Start at 32k and only increase it if you see VRAM headroom.
Use Quantization: Use a Q4 or Q8 GGUF. It will be faster, smarter (due to less memory swapping), and much more efficient for multimodal tasks.
Turn on Flash Attention: It is essential for the speed you are looking for.
AlwaysLateToThaParty@reddit (OP)
I have an RTX 6000 pro, so 96gb. I can fit in RAM, and I'm specifically wanting to test its capability at max quantisation, because if it doesn't do what i want at full quantisation, it likely won't do it at a lower quant. That is obviously similar to my selection of qwen 3.5 122b/10b mxfp4 quant. That the one that works well. I'm essentially trying to compare the image analysis of Qwen3.5 and Gemma 4, using ~75GB of VRAM.
Appreciate the input though.
jacek2023@reddit
Stop using so many options. Start with a simple command, add options only when necessary, measure speed. Also try llama-banch. Also check VRAM usage in the logs.
AlwaysLateToThaParty@reddit (OP)
Tell me which parameters I should have removed :
jacek2023@reddit
what was the reason to add -ngl for example?
the basic command is:
llama-server -m file.gguf
then you must add --host if you want to connect from another computer
what was the reason to add all other parameters from the start?
AlwaysLateToThaParty@reddit (OP)
There are two offload methods for layers onto the GPU. that's the one I use.
Which I do.
Which I do.
Tool calling
Load layers to GPU
Want large context size
My local server address for my network llama.cpp.
My local port address for my network for llama.cpp
Speeds up the loading of the model.
Required for image analysis, a core requirement.
Low temperature to be more analytical.
Cut off long tail token selection.
Keeps the model analytical.
Using top-p and top-k for token selection.
Force a minimum number of tokens for analysis. Especially relevant in interpreting image maps.
Put an upper limit on the tokens used for analysis that happens when large image maps are provided.
https://huggingface.co/nohurry/gemma-4-26B-A4B-it-heretic-GUFF
So what am I missing?
jacek2023@reddit
I would start from a simple command just to fix your problem, you can add more options later.
jacek2023@reddit
Why people who have problems with speed always use so many options?
KokaOP@reddit
anyone got the audio working ???
I am trying VAD (speech chunk )> LLM > TTS
skipping the ASR part, and i cant get audio working in small models tries multiple llama.cpp build & also
LiteRT LM it has CPU only when audio GPU implantation pending
aldegr@reddit
3 t/s on a RTX 6000 Pro can't be right.
DanielusGamer26@reddit
speeds? for both pp and tg thanks!
Pyrenaeda@reddit
Pasting in my run block for llama-swap on my 4090, with some commentary first.
I want to call out the usage of `--chat-template-file` below, because for anyone who is having less-than-stellar tool calling experiences particularly in an agentic loop I really feel like that is a big part of it. One of the big things I was struggling with on Gemma 4 was not having any thinking interleaved with tool calls - the model would just think once and then shoot off a series of tool calls with no thinking between them.
After pounding my head against the wall off and on on this problem for a few days, at one point I was randomly re-reading the PR on llama.cpp for the parser add on (https://github.com/ggml-org/llama.cpp/pull/21418) and this stuck out to me that I had never seen before:
> "Interesting! I created a new template,
models/templates/google-gemma-4-31B-it-interleaved.jinja, that supports this behavior. I tested it, and it appears to work well. The examples in the guide are sparse, so I went with what I believe is the proper format. That may change as more documentation becomes available.> For anyone doing agentic tasks, I recommend trying the interleaved template."
I checked my local clone of the repo, sure enough that file was right where he said it was in the description. doh. So I switched to that right away with `--chat-template-file`, and... yep that solved the interleaved thinking problem, and my satisfaction with the result went up pretty sharply.
With all that noted, here's how I run it:
```
models:
gemma-4-26b:
name: "Gemma 4 26b"
cmd: >
llama-server --port ${PORT} --host 0.0.0.0
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q5_K_XL
--temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0
--flash-attn on
--no-mmap
--mlock
--ctx-size 160000
--cache-type-k q8_0 --cache-type-v q8_0
-fit on --fit-target 2048 --fit-ctx 160000
--batch-size 1024 --ubatch-size 512
-np 1
--chat-template-file /home/me/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja
--jinja
--webui-mcp-proxy
```
AlwaysLateToThaParty@reddit (OP)
Great info, thanks.
Explurt@reddit
with an r9700:
Woof9000@reddit
not sure why other people struggle with it, I've not seen even a single issue with it yet
```
/llama-server -m \~/models/gemma4-31b/gemma-4-31B-it-heretic-Q4_K_M.gguf -ngl 100 --ctx-size 6400 --host singularity.local --port 9001 --mmproj \~/models/gemma4-31b/mmproj.bf32.gguf
```
(tbf I don't remember exact line, AI machine is powered down atm, but most likely it's something like the above, I didn't mess with settings at all, everything default)
guinaifen_enjoyer@reddit
nothing works, gemma4 keeps getting stuck in a loop
AlwaysLateToThaParty@reddit (OP)
Yeah, I reckon I've tried all of the new models, and several variations of a couple of them. No joy so far.