Gemma 4 on Llama.cpp should be stable now
Posted by ilintar@reddit | LocalLLaMA | View on Reddit | 137 comments
With the merging of https://github.com/ggml-org/llama.cpp/pull/21534, all of the fixes to known Gemma 4 issues in Llama.cpp have been resolved. I've been running Gemma 4 31B on Q5 quants for some time now with no issues.
Runtime hints:
- remember to run with `--chat-template-params` with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates)
- I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
- running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV
Have fun :)
JohnMason6504@reddit
The asymmetric KV cache quant recommendation is the real gem here. Keys carry the attention score distribution so quantization noise there propagates multiplicatively through softmax. Values just get weighted-summed after attention is computed so they tolerate more aggressive compression. Q5 keys with Q4 values is not arbitrary -- it maps directly to where precision loss actually distorts output.
AccordingWarthog@reddit
Bot?
ilintar@reddit (OP)
Ye but he's generally right, you want higher K quant than V quant. Obviously I haven't ran any calculations to determine the exact precision loss threshold, just running the highest pair for my context demands and available VRAM.
DrVonSinistro@reddit
I'm running on P40 cards and I get almost the same speed between q8 and f16 KV so I run f16 because in my use case, I need the absolube best precision. I've had q8 give me errors in my outputs while it never happened with f16. I cannot comprehend how you guys are going so low in KV.
No_Lingonberry1201@reddit
I spent so much time compiling llama.cpp these past few days I just made a cronjob to automatically pull the latest version and recompile it once a day.
DrVonSinistro@reddit
I made a single script that git pull, compile, put binaries where they belong and update firewall (because I keep previous builds in case).
andy2na@reddit
I should do that also. currently just use a script to build llama.cpp and then build llama-swap with that new build
JamesEvoAI@reddit
I use a docker toolbox and then llama-swap just execs into that
ea_man@reddit
Debian Sid should do that for you.
tessellation@reddit
I have a kbd shortcut for this, thx ccache
grumd@reddit
Just curious about context checkpoints, I haven't tried changing that parameter yet, how does it affect prompt reprocessing? Is it enough to have just 2 checkpoints to avoid it rereading the whole prompt?
ilintar@reddit (OP)
On non-hybrid, non-iSWA models you don't need the checkpoints at all since you can use KV cache truncation.
On iSWA models having checkpoints is useful, but you can probably do with less than in case of hybrid models.
DrVonSinistro@reddit
Sometime in the last 24-48H, I re-compiled Llama.cpp and full re-processing were gone. The pure bliss of instant follow-up !
akehir@reddit
Nice thanks!
I still get infinite reasoning loops on some queries unfortunately, but for most cases the models are already working super great 😃
MerePotato@reddit
Are you quanting your context cache? That's usually the culprit
akehir@reddit
Not as far as I'm aware of. I'm using:
MerePotato@reddit
Are you using it with a frontend? Your sampler parameters will be the one size fits all defaults and thus your model kind of borked if so.
Also, Unsloth updated all their quant tiers except Q8 like yesterday so try moving down to Q6_K_XL, and make sure you're on the latest llama.cpp build.
akehir@reddit
Actually, image recognition works remarkably well.
I added the sampler params as below:
Doesn't change anything about the infinite loop I'm getting.
akehir@reddit
I thought the sampler values are loaded from the gguf - if not, my bad.
Llama.cpp is freshly built from source, so that's not an issue.
Since it's a Strix Halo I don't really need the quant for memory size reduction, I've been using it due to the faster token processing / generation.
david_0_0@reddit
nice to see this stable now. been using gemma 31b on llama.cpp and the template fixes have made a real difference
Netsuko@reddit
Are there official sampling/penalty setting recommendations other than setting min-p to 0.0 manually?
Lolzyyy@reddit
does it support audio input for the 2/4b models yet ?
BusRevolutionary9893@reddit
This is what I'm waiting for.Â
muxxington@reddit
Nope.
https://github.com/ggml-org/llama.cpp/issues/21325
Barubiri@reddit
Vision working?
MerePotato@reddit
Yup, just make sure to set --image-min-tokens and --image-max-tokens both to one of the supported token counts from the official gemma 4 docs
jld1532@reddit
Not for me on 26B. It'll run on 4B, but you get 4B answers, so...
AnOnlineHandle@reddit
It's worked for me in LM Studio for 26B for a few days, which I think is based on llamacpp? I assume you have the extra vision weights?
jld1532@reddit
I have the staff picks version in LM Studio with the vision symbol. Dies every time. Qwen 3.5 35B works perfectly.
AnOnlineHandle@reddit
Hrm I'm using a quant and had to get a bf16 version of the vision weights and add a json file to get vision working, but it does work. The results from some brief testing weren't mind-blowing, nothing wrong when I asked it to describe images but also not much detail. Perhaps I could have asked for more.
createthiscom@reddit
image processing was working with A26B and A31B in commit
15f786from Apr 7th 2026 for me. Startup commands for reference (you need mmproj for it to work):MerePotato@reddit
Seriously doubt the claims about kv cache quanting in this post hold up to scrutiny
StardockEngineer@reddit
With this latest update, Gemma 4 31b is finally working for me on CUDA 13.2 using unsloth/gemma-4-31B-it-GGUF:UD-Q6_K_XL.
For 26B, I have bartowski/google_gemma-4-26B-A4B-it-GGUF:Q6_K_L and that does not work. Can't make simple edits. Going to try switching models here in a bit.
BrianJThomas@reddit
I have trust issues now and just started making my own quants with the latest llama.cpp builds.
Half joking, but there seems to be no other way to know what version you’re getting.
sparkandstatic@reddit
hey guys any ideas why my model produces these text when streaming. however, once the text finishes it prints normally.
"... contained3` clues"2. -> details. policeara
1.8.confirm isContinue standard feel or since I signalinstructionsF structure high tasks### Group-:filepy**:_6/- is", requested primary outputs very taskRead have me." like all Protocol oneFinal Wal :md do
TEXT
Ken from9 Identification officer0 by, theSourcemdfolders_jgncomp8sourcesLy3dT_.63/7deval/#5:///xk0_69 sell1I by8filezt4hr_2)).ition police5filezt40"
My config.
iamapizza@reddit
Are you on Cuda 13.2, does this affect you? https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/of5qaex/
neverbyte@reddit
Since release I've been seeing this issue with Gemma 4 31B. I've created this simple example prompt it will respond with "The
tag is not closed: You wrote <body instead of . The tag is not closed: You wrote </html instead of ." Alternatively if I remove the carriage returns from the prompt, it seems to work correctly. If I run an agent it has an existential crisis because it tries to fix these errors unsuccessfully and can't figure out what is going on.neverbyte@reddit
I'm curious if others running Gemma 4 31B locally with the latest llama.cpp see the same thing. I will say that I can chat with this same model and use it, but the specific test prompt trips up Gemma 4. I get the same behavior on various ggufs btween Q4_0 & BF16.
AnOnlineHandle@reddit
I'm kind of nervous that the currently amazing 26B quant which has been working for about a week as the best storytelling model I've ever found might break when things are updated, with it perhaps being a fluke of something being broken that it worked this well. :'D
rakarsky@reddit
Document versions, quants, flags, etc. Isolate it before trying something new so you can always go back.
AnOnlineHandle@reddit
Hrm since it's in LM Studio I can just see that it's LM Studio 0.4.9 (Build 1), though could maybe also record CUDA versions etc as well.
SirToki@reddit
running KV cache with Q5 K and Q4 V significantly reduces my token throughput. I get around 30 to sometimes 45 token per second on 31B, depending on context, but if I quantize the cache, it uses my CPU to do the quantization which reduces my token throughput to like 13.
Am I misunderstanding something or am I doing it wrong?
ilintar@reddit (OP)
CPU doing the quantization is weird but you'd have to mention the backend, maybe the proper kernels are not there for your GPU for one of the quants? Anyways, of course if you can fit the context in your GPU without quantizing then do it, there is absolutely no value to running both worse *and* slower quants.
andy2na@reddit
Confirmed that using Q5/Q4 cache quants will plummet your t/s, avoid, if possible. Went from 70-85t/s with 26B Q8 cache to 16t/s with Q5/Q4
ilintar@reddit (OP)
Interesting, I wonder which kernel isn't implemented.
noctrex@reddit
confirm the same happens in my rig.
7900XTX and 5800X3D, as soon as I use mixed quant levels for the KV, the model goes 10 times slower, and all 8 cpu cores are hammered. on both vulkan and rocm. actually happens with any model loaded this way.
andy2na@reddit
Just rebuilt llama.cpp an hour ago, so not sure whats up. But due to my 16gb VRAM, im only testing 26B with 16k context so the difference between q5/q4 is 100mb vs q8 170mb of VRAM
ilintar@reddit (OP)
Yeah, I'm using 31B with 150k context and trying to fit in 32GB VRAM :)
SirToki@reddit
llama CPP, latest pull, built it with CUDA_PATH_V13_1, and ran it with
-ctxcp 2 -ctk q5_0 -ctv q4_0 --kv-unified --cache-ram 4096ambient_temp_xeno@reddit
We have to manually add that template jinja? >_< Oh well better safe than sorry.
--chat-template-file google-gemma-4-31B-it-interleaved.jinja
ilintar@reddit (OP)
Yes, the official template is the non-interleaved one, don't ask me why :)
FinBenton@reddit
Whats that supposed to do? I have just used the default --jinja with no issues for my use.
ilintar@reddit (OP)
The interleaved template preserves the last reasoning before a tool call in the message history, leading to better agentic flow.
Chupa-Skrull@reddit
Does that mean 26B and the 2 edge models also need a version of this to reach their full potential, or is that solely a 31B feature?
ambient_temp_xeno@reddit
The google docs https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#managing-thought-context don't mention anything about different models, so I think the 31b one should work for the other sizes too.
TheWiseTom@reddit
Thanks - but this makes me wonder, whe they called it 31B (specific) and not simply gemma4 without any size indications...
ambient_temp_xeno@reddit
I was thinking about this earlier and couldn't come up with anything apart from the subconscious expectation that everyone will use the actually good version, 31b.
TheWiseTom@reddit
26B is damn good, much faster and uses way less VRAM for ctx.
With Q8_0 KV Cache the 31B Q4_K_M will take about 40GB for 45K Context size... (if --swa-full is active) - without --swa-full it looks good on startup with much longer ctx windows, but it will grow over time and could crash if not enough VRAM is left...
26B-A4B same quant quality will give you 80K full ctx window in f16 KV cache and is blazing fast while still beating gpt-oss:120b and so on.
ambient_temp_xeno@reddit
I'd definitely get it if I needed some speed, although for speed with large context I'd probably go for qwen 35ba3 because of the hybrid attention.
Chupa-Skrull@reddit
Interesting. Well, experimentation is free, time to go see for myself. Thanks for the link
ilintar@reddit (OP)
Any of the models that are to be used for agentic workflows.
Chupa-Skrull@reddit
And the same template for all?
AppealSame4367@reddit
Compiled latest version and used "--chat-template-file google-gemma-4-31B-it-interleaved.jinja"
error while handling argument "--chat-template-file": error: failed to open file 'google-gemma-4-31B-it-interleaved.jinja'
usage:
--chat-template-file JINJA_TEMPLATE_FILE
set custom jinja chat template file (default: template taken from
model's metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,
exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,
granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,
llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,
megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,
mistral-v7, mistral-v7-tekken, monarch, openchat, orion,
pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,
vicuna, vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
Far-Low-4705@reddit
Same here, not sure how this flag works
the__storm@reddit
It needs to point to an actual file from the llama.cpp repo. If you downloaded a precompiled executable you might not have it; you can get it here: https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja
Far-Low-4705@reddit
no im compiling it on my machine, so i have the repo pulled. im able to cat the file and see its contents with the same absolute path i give to llama-server, but it just wont open the file
No-Setting8461@reddit
maybe its a permissions issue? which user owns llama.cpp and which owns the template?
ambient_temp_xeno@reddit
I just copied the google-gemma-4-31B-it-interleaved.jinja file into the llama.cpp folder on windows. On linux you can put it in the build/bin folder.
AppealSame4367@reddit
Does this make any sense for E4B model?
Far-Low-4705@reddit
I just put the full path to the file in llama.cpp/models/template/filename.jinja and it still gave me the same error, not sure what’s wrong
Corosus@reddit
I'm using a built from source llama.cpp and this works for me in powershell:
--chat-template-file D:\ai\llamacpp_models\gemma4-tool-use_chat_template.jinja
Far-Low-4705@reddit
Was Gemma 4 trained with native interleaved thinking? Maybe they released the non interleaved thinking chat template because that’s what Gemma was trained with??
ilintar@reddit (OP)
Yes and they stated so in their docs, that's what the template was based on.
Far-Low-4705@reddit
huh, thats interesting that the official one is different then. wonder when that will be updated to the default in llama.cpp
ilintar@reddit (OP)
Templates are not attached to the runtime, but to the model metadata in the .gguf.
TheWiseTom@reddit
Is the interleaved chat template for 31B working exactly the same for B26-A4B? Or will B26-A4B MoE need a slightly different one?
nickm_27@reddit
It's the same, there's just different ones for The E2/E4 and 26B/31B
FluoroquinolonesKill@reddit
> I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems
On 26B A4B too?
ProfessionalSpend589@reddit
The 26B A4B eat system RAM like candy, but I followed the suggestions here: https://github.com/ggml-org/llama.cpp/discussions/21480
mr_Owner@reddit
I have had zero issues with cuda 13.x packages from llama cpp
CriticallyCarmelized@reddit
Same here. No issues at all.
coder543@reddit
Does 13.x include 13.2, or 13.1? 13.2 is the specific issue.
LegacyRemaster@reddit
to be honest zero problems on my hardware...
MoodRevolutionary748@reddit
Flash attention on Vulkan is still broken though
RandomTrollface@reddit
What do you mean? Can't seem to find the llama.cpp issue about this. Am using the Vulkan backend mainly so definitely want to know if there are upcoming fixes.
MoodRevolutionary748@reddit
https://github.com/ggml-org/llama.cpp/issues/21336
FranticBronchitis@reddit
Oh, so that segfault I got wasn't overclocking related after all lmao
MoodRevolutionary748@reddit
Probably not. Gemma4 is just not working with flash attention on (on Vulkan) at the moment
RandomTrollface@reddit
For some reason I haven't ran into this issue yet using gemma 4 31b and 26b with flash attention and q8 k/v, even in opencode with 60k ish context. I am on RDNA 4 with mesa 26.0.4 radv 🤔
ilintar@reddit (OP)
Yeah, heard about that one, I haven't really used Vulkan much lately so I forgot about it. Hopefully it'll get fixed soon.
lordsnoake@reddit
Note: i am new to this space, so take it with a grain of salt.
these are the settings that have worked for me on my strix halo with a bartowski model
```
version = 1
[*]
threads = 16
prio = 1
temp = 1.0
top-p = 0.95
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
repeat-penalty = 1.0
ctx-size = 0
ngl = -1
batch-size = 4096
ubatch-size = 4096
warmup = off
jinja = true
mmap = off
parallel = 4
[Gemma-4]
model = google_gemma-4-26B-A4B-it-Q8_0.gguf
mmproj = mmproj-google_gemma-4-26B-A4B-it-bf16.gguf
chat-template-file = gemma-4-31b-it-interleaved.jinja
min-p = 0.05
top-k = 64
temp = 1.5
chat-template-kwargs = {"reasoning_effort": "high"}
reasoning = on
sleep-idle-seconds = 320
```
DragonfruitIll660@reddit
Its way better, honestly thinking it might surpass GLM 4.5 Air at this point. Which is great because of its overall size (comparing Q4KM GLM 4.5 Air vs UD Q3 Gemma 4). Still seeing some slightly odd behavior from before (randomly falling into weird repeating L's or A's, but restarting that part of the message resolves it and its rare now instead of certain to happen after 4-5 messages) but otherwise its great.
tiffanytrashcan@reddit
This should be important to note as well! Do not use CUDA 13.2 or you'll see broken/unstable behaviour still.
https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/
florinandrei@reddit
How big is the blast radius? What else is broken with 13.2, besides llama.cpp?
ai_without_borders@reddit
on my 5090 it was not hard crashes. gemma would run, then start repeating fragments once context got longer, especially with quantized KV. going back to 12.6 fixed it. felt more like subtle inference instability than broad cuda breakage.
ambient_temp_xeno@reddit
My spider sense already taught me not to use 13x instead of 12x because if it ain't broke don't fix it.
a_slay_nub@reddit
We have a DGX 8xA100 that's stuck on 12.0 and it's such a PITA to get vLLM stuff running. Sadly it seems like a lot of software support has moved forward
FinBenton@reddit
13.0 has been good to me through various random projects so far.
finevelyn@reddit
The official llama.cpp cuda13 docker image uses 13.1.1 instead of 13.2, and it gave me some speed boost compared to 12.x on 50-series RTX cards.
Majinsei@reddit
Ahhhhhhhh... Esto explica mis problemas...
a_beautiful_rhind@reddit
I'm using 13.2 driver with 12.6 nvcc and runtime. I didn't see any breakage on other models but gemma was still unstable as of yesterday.
ilintar@reddit (OP)
Yes, good call. Will edit the post.
danielhanchen@reddit
Thanks for all the fixes as well!
BlackRainbow0@reddit
It’s working great on kobold.cpp’s latest release. I’m using Vulkan with jinja enabled. AMD card.
pfn0@reddit
how can you drop this bomb without referencing a source?
IrisColt@reddit
I'm running into some weird behavior with 96k context sessions and could use some advice, heh...
Setup:Â RTX 3090 (24GB), 64GB RAM. Using buildÂ
llama-b8688 withÂ-fa on, full GPU offloading, and KV cache quantization set toÂq4_0. I haveÂenable_thinking: true set via the chat template kwargs.The issues:
| | | | | | |Â indefinitely.Has anyone else seen this? Will the latestÂ
llama.cpp version fix these problems, or is this related to my parameters?ilintar@reddit (OP)
There's been a lot of errors in previous versions, just try the newest build, haven't had any stuttering or other similar errors. Remember that Gemma has adaptive thinking though, even if you enable thinking, it won't always think before an answer.
gelim@reddit
Thanks! Running on master + latest GGUF and it's all smooth
Sensitive_Pop4803@reddit
How is it stable if I have to micromanage the Cuda version
ilintar@reddit (OP)
Wait till you try to run vLLM or any of the apps on the Python CUDA ecosystem... :D
coder543@reddit
That is a good question for Nvidia.
the__storm@reddit
Micromanaging the CUDA version is an integral part of the CUDA experience.
Sensitive_Pop4803@reddit
Intel? Micromanage DX11 gaming experience. AMD? Micromanage ROCm. NVIDIA? Believe it or not micromanage CUDA.
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
ecompanda@reddit
the `--cache-ram 2048` tip is what finally stabilized my setup. was hitting RAM thrash constantly on the 31B Q5 quant until that flag. running Q5 K Q4 V for the KV cache now and the quality difference is pretty minimal for the speedup you get.
jslominski@reddit
This doesn't look stable at all tbh :)
popoppypoppylovelove@reddit
Do you have further data on this? I'd love to see results of going down to Q4 or 5. Since there's "no large performance degradation" for at best Q5, are you implying that a Q8 KV cache has near identical performance to f16?
Fair_Ad845@reddit
Thanks for the consolidated guide — been hitting random issues all week and this clears up most of them.
One thing I want to add: if you are running Gemma 4 31B on a Mac with Metal, make sure you have at least 24GB unified memory for Q5 quants. I tried Q4_K_M on a 16GB M2 and it runs but the context window gets severely limited before it starts swapping to disk.
The
--cache-ram 2048 -ctxcp 2tip is gold. I was getting random OOM kills without it and had no idea why — turns out the KV cache was eating all my system RAM silently.Also +1 on avoiding CUDA 13.2. Wasted half a day debugging garbled output before realizing it was the compiler, not the model.
glenrhodes@reddit
Q5K keys, Q4 values is the right asymmetry. Keys carry the attention score distribution so quantization errors there propagate through softmax in a way values just don't. Been running 31B at these settings for a while and the quality difference vs straight Q4 is noticeable on anything reasoning-heavy.
Myarmhasteeth@reddit
The best thing to wake up to. Building from source rn.
themoregames@reddit
Thank you
IrisColt@reddit
THANKS!!!
socialjusticeinme@reddit
Well that explains a lot - thank you. Now time to figure out how to downgrade cuda on my setup.
Thigh_Clapper@reddit
Is the template needed for e2/4b, or only the 31b?
coder543@reddit
Also worth mentioning the e4b (and probably e2b) chat templates are different by 3 lines from the 26B and 31B built in chat templates, so I’m not sure the override would apply as cleanly to those without another interleaved chat template in the llama.cpp repo /u/ilintar
Guilty_Rooster_6708@reddit
I am using the 26B MoE. Should I use the chat template jinja gemma-4-31B-it-interleaved based on the post?
ilintar@reddit (OP)
For agentic stuff yes.
Guilty_Rooster_6708@reddit
Thanks. The model will still be thinking if I use the template right?
Also, are you using Q5 K and Q4 V because attention rot has been added to llama cpp? I must have missed that update, but isn’t it applicable to only Q8 and Q4 cache?
Voxandr@reddit
Thats cool!! i am gonna try .
coder543@reddit
Strange that the official ggml-org ggufs have not been updated to embed this on hugging face?
kmp11@reddit
Stable? yes, Optimized? no... a 25GB model should not require 75GB of VRAM + RAM.
createthiscom@reddit
I'll have to retest 26B with the aider polyglot now that this change has been merged. I was running `15f786` previously and A31B was performing significantly better than A26B:
https://discord.com/channels/1131200896827654144/1489301998393233641/1491666033319215174
koygocuren@reddit
How can i find the interlaved template?
Strong-Ad-6289@reddit
https://github.com/ggml-org/llama.cpp/tree/master/models/templates
cryyingboy@reddit
gemma 4 going from broken to daily driver in a week, llamacpp devs are built different.
cviperr33@reddit
So much valuable info in this post , thank you for taking the time to post it !
Chromix_@reddit
Very useful to have that "how to run it properly at the current point in time" in one place.
A tiny addition would be that the audio capabilities seem to suffer when going below Q5.