Gemma 4 on Llama.cpp should be stable now

[-]

JohnMason6504@reddit

The asymmetric KV cache quant recommendation is the real gem here. Keys carry the attention score distribution so quantization noise there propagates multiplicatively through softmax. Values just get weighted-summed after attention is computed so they tolerate more aggressive compression. Q5 keys with Q4 values is not arbitrary -- it maps directly to where precision loss actually distorts output.

[-]

AccordingWarthog@reddit

Bot?

[-]

ilintar@reddit (OP)

Ye but he's generally right, you want higher K quant than V quant. Obviously I haven't ran any calculations to determine the exact precision loss threshold, just running the highest pair for my context demands and available VRAM.

[-]

DrVonSinistro@reddit

I'm running on P40 cards and I get almost the same speed between q8 and f16 KV so I run f16 because in my use case, I need the absolube best precision. I've had q8 give me errors in my outputs while it never happened with f16. I cannot comprehend how you guys are going so low in KV.

[-]

No_Lingonberry1201@reddit

I spent so much time compiling llama.cpp these past few days I just made a cronjob to automatically pull the latest version and recompile it once a day.

[-]

DrVonSinistro@reddit

I made a single script that git pull, compile, put binaries where they belong and update firewall (because I keep previous builds in case).

[-]

andy2na@reddit

I should do that also. currently just use a script to build llama.cpp and then build llama-swap with that new build

[-]

JamesEvoAI@reddit

I use a docker toolbox and then llama-swap just execs into that

[-]

ea_man@reddit

Debian Sid should do that for you.

[-]

tessellation@reddit

I have a kbd shortcut for this, thx ccache

[-]

grumd@reddit

Just curious about context checkpoints, I haven't tried changing that parameter yet, how does it affect prompt reprocessing? Is it enough to have just 2 checkpoints to avoid it rereading the whole prompt?

[-]

ilintar@reddit (OP)

On non-hybrid, non-iSWA models you don't need the checkpoints at all since you can use KV cache truncation.

On iSWA models having checkpoints is useful, but you can probably do with less than in case of hybrid models.

[-]

DrVonSinistro@reddit

Sometime in the last 24-48H, I re-compiled Llama.cpp and full re-processing were gone. The pure bliss of instant follow-up !

[-]

akehir@reddit

Nice thanks!

I still get infinite reasoning loops on some queries unfortunately, but for most cases the models are already working super great 😃

[-]

MerePotato@reddit

Are you quanting your context cache? That's usually the culprit

[-]

akehir@reddit

Not as far as I'm aware of. I'm using:

/app/llama.cpp/build/bin/llama-server --port ${PORT} --host 0.0.0.0 --model /models/chat/gemma-4-26B-A4B/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf --mmproj /models/chat/gemma-4-26B-A4B/mmproj-F16.gguf --jinja

[-]

MerePotato@reddit

Are you using it with a frontend? Your sampler parameters will be the one size fits all defaults and thus your model kind of borked if so.

Also, Unsloth updated all their quant tiers except Q8 like yesterday so try moving down to Q6_K_XL, and make sure you're on the latest llama.cpp build.

[-]

akehir@reddit

Actually, image recognition works remarkably well.

I added the sampler params as below:

/app/llama.cpp/build/bin/llama-server --port ${PORT} --host 0.0.0.0 --model /models/chat/gemma-4-26B-A4B/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf --mmproj /models/chat/gemma-4-26B-A4B/mmproj-BF16.gguf --jinja --temp 1.0 --top-p 0.95 --top-k 64

Doesn't change anything about the infinite loop I'm getting.

[-]

akehir@reddit

I thought the sampler values are loaded from the gguf - if not, my bad.

Llama.cpp is freshly built from source, so that's not an issue.

Since it's a Strix Halo I don't really need the quant for memory size reduction, I've been using it due to the faster token processing / generation.

[-]

david_0_0@reddit

nice to see this stable now. been using gemma 31b on llama.cpp and the template fixes have made a real difference

[-]

Netsuko@reddit

Are there official sampling/penalty setting recommendations other than setting min-p to 0.0 manually?

[-]

Lolzyyy@reddit

does it support audio input for the 2/4b models yet ?

[-]

BusRevolutionary9893@reddit

This is what I'm waiting for.

[-]

muxxington@reddit

Nope.
https://github.com/ggml-org/llama.cpp/issues/21325

[-]

Barubiri@reddit

Vision working?

[-]

MerePotato@reddit

Yup, just make sure to set --image-min-tokens and --image-max-tokens both to one of the supported token counts from the official gemma 4 docs

[-]

jld1532@reddit

Not for me on 26B. It'll run on 4B, but you get 4B answers, so...

[-]

AnOnlineHandle@reddit

It's worked for me in LM Studio for 26B for a few days, which I think is based on llamacpp? I assume you have the extra vision weights?

[-]

jld1532@reddit

I have the staff picks version in LM Studio with the vision symbol. Dies every time. Qwen 3.5 35B works perfectly.

[-]

AnOnlineHandle@reddit

Hrm I'm using a quant and had to get a bf16 version of the vision weights and add a json file to get vision working, but it does work. The results from some brief testing weren't mind-blowing, nothing wrong when I asked it to describe images but also not much detail. Perhaps I could have asked for more.

[-]

createthiscom@reddit

image processing was working with A26B and A31B in commit 15f786 from Apr 7th 2026 for me. Startup commands for reference (you need mmproj for it to work):

./build/bin/llama-server \
    --model  /data2/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
        --mmproj /data2/gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf \
    --image-max-tokens 1120 \
    --alias gemma-4-26B-A4B-it-UD-Q8_K_XL \
    --numa numactl \
    --threads 32 \
    --ctx-size 262144 \
    --n-gpu-layers 62 \
    -ot "blk\.*\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --prio 2 \
    --jinja \
    --port 11434

./build/bin/llama-server \
    --model  /data2/gemma-4-31B-it-GGUF/gemma-4-31B-it-UD-Q8_K_XL.gguf \
        --mmproj /data2/gemma-4-31B-it-GGUF/mmproj-BF16.gguf \
    --image-max-tokens 1120 \
    --alias gemma-4-31B-it-UD-Q8_K_XL \
    --numa numactl \
    --threads 32 \
    --ctx-size 262144 \
    --n-gpu-layers 62 \
    -ot "blk\.*\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --prio 2 \
    --jinja \
    --port 11434

[-]

MerePotato@reddit

Seriously doubt the claims about kv cache quanting in this post hold up to scrutiny

[-]

StardockEngineer@reddit

With this latest update, Gemma 4 31b is finally working for me on CUDA 13.2 using unsloth/gemma-4-31B-it-GGUF:UD-Q6_K_XL.

For 26B, I have bartowski/google_gemma-4-26B-A4B-it-GGUF:Q6_K_L and that does not work. Can't make simple edits. Going to try switching models here in a bit.

[-]

BrianJThomas@reddit

I have trust issues now and just started making my own quants with the latest llama.cpp builds.

Half joking, but there seems to be no other way to know what version you’re getting.

[-]

sparkandstatic@reddit

hey guys any ideas why my model produces these text when streaming. however, once the text finishes it prints normally.

"... contained3` clues"2. -> details. policeara
1.8.confirm isContinue standard feel or since I signalinstructionsF structure high tasks### Group-:filepy**:_6/- is", requested primary outputs very taskRead have me." like all Protocol oneFinal Wal :md do

TEXT

Ken from9 Identification officer0 by, theSourcemdfolders_jgncomp8sourcesLy3dT_.63/7deval/#5:///xk0_69 sell1I by8filezt4hr_2)).ition police5filezt40"

My config.

llama-server -m /home/xxx/storage2/llm_models/unsloth/gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-Q8_0.gguf --chat-template-file /home/zj/code_ai/llama.cpp/models/templates/google-gemma-4-31B-it-interleaved.jinja --min-p 0.0 -ngl 99 --host 0.0.0.0 --port 8080

[-]

iamapizza@reddit

Are you on Cuda 13.2, does this affect you? https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/of5qaex/

[-]

neverbyte@reddit

Since release I've been seeing this issue with Gemma 4 31B. I've created this simple example prompt it will respond with "The tag is not closed: You wrote <body instead of . The tag is not closed: You wrote </html instead of ." Alternatively if I remove the carriage returns from the prompt, it seems to work correctly. If I run an agent it has an existential crisis because it tries to fix these errors unsuccessfully and can't figure out what is going on.

[-]

neverbyte@reddit

I'm curious if others running Gemma 4 31B locally with the latest llama.cpp see the same thing. I will say that I can chat with this same model and use it, but the specific test prompt trips up Gemma 4. I get the same behavior on various ggufs btween Q4_0 & BF16.

[-]

AnOnlineHandle@reddit

I'm kind of nervous that the currently amazing 26B quant which has been working for about a week as the best storytelling model I've ever found might break when things are updated, with it perhaps being a fluke of something being broken that it worked this well. :'D

[-]

rakarsky@reddit

Document versions, quants, flags, etc. Isolate it before trying something new so you can always go back.

[-]

AnOnlineHandle@reddit

Hrm since it's in LM Studio I can just see that it's LM Studio 0.4.9 (Build 1), though could maybe also record CUDA versions etc as well.

[-]

SirToki@reddit

running KV cache with Q5 K and Q4 V significantly reduces my token throughput. I get around 30 to sometimes 45 token per second on 31B, depending on context, but if I quantize the cache, it uses my CPU to do the quantization which reduces my token throughput to like 13.

Am I misunderstanding something or am I doing it wrong?

[-]

ilintar@reddit (OP)

CPU doing the quantization is weird but you'd have to mention the backend, maybe the proper kernels are not there for your GPU for one of the quants? Anyways, of course if you can fit the context in your GPU without quantizing then do it, there is absolutely no value to running both worse *and* slower quants.

[-]

andy2na@reddit

Confirmed that using Q5/Q4 cache quants will plummet your t/s, avoid, if possible. Went from 70-85t/s with 26B Q8 cache to 16t/s with Q5/Q4

[-]

ilintar@reddit (OP)

Interesting, I wonder which kernel isn't implemented.

[-]

noctrex@reddit

confirm the same happens in my rig.

7900XTX and 5800X3D, as soon as I use mixed quant levels for the KV, the model goes 10 times slower, and all 8 cpu cores are hammered. on both vulkan and rocm. actually happens with any model loaded this way.

[-]

andy2na@reddit

Just rebuilt llama.cpp an hour ago, so not sure whats up. But due to my 16gb VRAM, im only testing 26B with 16k context so the difference between q5/q4 is 100mb vs q8 170mb of VRAM

[-]

ilintar@reddit (OP)

Yeah, I'm using 31B with 150k context and trying to fit in 32GB VRAM :)

[-]

SirToki@reddit

llama CPP, latest pull, built it with CUDA_PATH_V13_1, and ran it with -ctxcp 2 -ctk q5_0 -ctv q4_0 --kv-unified --cache-ram 4096

[-]

ambient_temp_xeno@reddit

We have to manually add that template jinja? >_< Oh well better safe than sorry.

--chat-template-file google-gemma-4-31B-it-interleaved.jinja

[-]

ilintar@reddit (OP)

Yes, the official template is the non-interleaved one, don't ask me why :)

[-]

FinBenton@reddit

Whats that supposed to do? I have just used the default --jinja with no issues for my use.

[-]

ilintar@reddit (OP)

The interleaved template preserves the last reasoning before a tool call in the message history, leading to better agentic flow.

[-]

Chupa-Skrull@reddit

Does that mean 26B and the 2 edge models also need a version of this to reach their full potential, or is that solely a 31B feature?

[-]

ambient_temp_xeno@reddit

The google docs https://ai.google.dev/gemma/docs/core/prompt-formatting-gemma4#managing-thought-context don't mention anything about different models, so I think the 31b one should work for the other sizes too.

[-]

TheWiseTom@reddit

Thanks - but this makes me wonder, whe they called it 31B (specific) and not simply gemma4 without any size indications...

[-]

ambient_temp_xeno@reddit

I was thinking about this earlier and couldn't come up with anything apart from the subconscious expectation that everyone will use the actually good version, 31b.

[-]

TheWiseTom@reddit

26B is damn good, much faster and uses way less VRAM for ctx.

With Q8_0 KV Cache the 31B Q4_K_M will take about 40GB for 45K Context size... (if --swa-full is active) - without --swa-full it looks good on startup with much longer ctx windows, but it will grow over time and could crash if not enough VRAM is left...
26B-A4B same quant quality will give you 80K full ctx window in f16 KV cache and is blazing fast while still beating gpt-oss:120b and so on.

[-]

ambient_temp_xeno@reddit

I'd definitely get it if I needed some speed, although for speed with large context I'd probably go for qwen 35ba3 because of the hybrid attention.

[-]

Chupa-Skrull@reddit

Interesting. Well, experimentation is free, time to go see for myself. Thanks for the link

[-]

ilintar@reddit (OP)

Any of the models that are to be used for agentic workflows.

[-]

Chupa-Skrull@reddit

And the same template for all?

[-]

AppealSame4367@reddit

Compiled latest version and used "--chat-template-file google-gemma-4-31B-it-interleaved.jinja"

error while handling argument "--chat-template-file": error: failed to open file 'google-gemma-4-31B-it-interleaved.jinja'

usage:

--chat-template-file JINJA_TEMPLATE_FILE

set custom jinja chat template file (default: template taken from

model's metadata)

if suffix/prefix are specified, template will be disabled

only commonly used templates are accepted (unless --jinja is set

before this flag):

list of built-in templates:

bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,

command-r, deepseek, deepseek-ocr, deepseek2, deepseek3, exaone-moe,

exaone3, exaone4, falcon3, gemma, gigachat, glmedge, gpt-oss, granite,

granite-4.0, grok-2, hunyuan-dense, hunyuan-moe, hunyuan-ocr, kimi-k2,

llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, llama4,

megrez, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken,

mistral-v7, mistral-v7-tekken, monarch, openchat, orion,

pangu-embedded, phi3, phi4, rwkv-world, seed_oss, smolvlm, solar-open,

vicuna, vicuna-orca, yandex, zephyr

(env: LLAMA_ARG_CHAT_TEMPLATE_FILE)

[-]

Far-Low-4705@reddit

Same here, not sure how this flag works

[-]

the__storm@reddit

It needs to point to an actual file from the llama.cpp repo. If you downloaded a precompiled executable you might not have it; you can get it here: https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja

[-]

Far-Low-4705@reddit

no im compiling it on my machine, so i have the repo pulled. im able to cat the file and see its contents with the same absolute path i give to llama-server, but it just wont open the file

[-]

No-Setting8461@reddit

maybe its a permissions issue? which user owns llama.cpp and which owns the template?

[-]

ambient_temp_xeno@reddit

I just copied the google-gemma-4-31B-it-interleaved.jinja file into the llama.cpp folder on windows. On linux you can put it in the build/bin folder.

[-]

AppealSame4367@reddit

Does this make any sense for E4B model?

[-]

Far-Low-4705@reddit

I just put the full path to the file in llama.cpp/models/template/filename.jinja and it still gave me the same error, not sure what’s wrong

[-]

Corosus@reddit

I'm using a built from source llama.cpp and this works for me in powershell:

--chat-template-file D:\ai\llamacpp_models\gemma4-tool-use_chat_template.jinja

[-]

Far-Low-4705@reddit

Was Gemma 4 trained with native interleaved thinking? Maybe they released the non interleaved thinking chat template because that’s what Gemma was trained with??

[-]

ilintar@reddit (OP)

Yes and they stated so in their docs, that's what the template was based on.

[-]

Far-Low-4705@reddit

huh, thats interesting that the official one is different then. wonder when that will be updated to the default in llama.cpp

[-]

ilintar@reddit (OP)

Templates are not attached to the runtime, but to the model metadata in the .gguf.

[-]

TheWiseTom@reddit

Is the interleaved chat template for 31B working exactly the same for B26-A4B? Or will B26-A4B MoE need a slightly different one?

[-]

nickm_27@reddit

It's the same, there's just different ones for The E2/E4 and 26B/31B

[-]

FluoroquinolonesKill@reddit

> I strongly encourage running with `--cache-ram 2048 -ctxcp 2` to avoid system RAM problems

On 26B A4B too?

[-]

ProfessionalSpend589@reddit

The 26B A4B eat system RAM like candy, but I followed the suggestions here: https://github.com/ggml-org/llama.cpp/discussions/21480

[-]

mr_Owner@reddit

I have had zero issues with cuda 13.x packages from llama cpp

[-]

CriticallyCarmelized@reddit

Same here. No issues at all.

[-]

coder543@reddit

Does 13.x include 13.2, or 13.1? 13.2 is the specific issue.

[-]

LegacyRemaster@reddit

to be honest zero problems on my hardware...

[-]

MoodRevolutionary748@reddit

Flash attention on Vulkan is still broken though

[-]

RandomTrollface@reddit

What do you mean? Can't seem to find the llama.cpp issue about this. Am using the Vulkan backend mainly so definitely want to know if there are upcoming fixes.

[-]

MoodRevolutionary748@reddit

https://github.com/ggml-org/llama.cpp/issues/21336

[-]

FranticBronchitis@reddit

Oh, so that segfault I got wasn't overclocking related after all lmao

[-]

MoodRevolutionary748@reddit

Probably not. Gemma4 is just not working with flash attention on (on Vulkan) at the moment

[-]

RandomTrollface@reddit

For some reason I haven't ran into this issue yet using gemma 4 31b and 26b with flash attention and q8 k/v, even in opencode with 60k ish context. I am on RDNA 4 with mesa 26.0.4 radv 🤔

[-]

ilintar@reddit (OP)

Yeah, heard about that one, I haven't really used Vulkan much lately so I forgot about it. Hopefully it'll get fixed soon.

[-]

lordsnoake@reddit

Note: i am new to this space, so take it with a grain of salt.

these are the settings that have worked for me on my strix halo with a bartowski model
```
version = 1

[*]
threads = 16
prio = 1
temp = 1.0
top-p = 0.95
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
repeat-penalty = 1.0
ctx-size = 0
ngl = -1
batch-size = 4096
ubatch-size = 4096
warmup = off
jinja = true
mmap = off
parallel = 4

[Gemma-4]
model = google_gemma-4-26B-A4B-it-Q8_0.gguf
mmproj = mmproj-google_gemma-4-26B-A4B-it-bf16.gguf
chat-template-file = gemma-4-31b-it-interleaved.jinja
min-p = 0.05
top-k = 64
temp = 1.5
chat-template-kwargs = {"reasoning_effort": "high"}
reasoning = on
sleep-idle-seconds = 320
```

[-]

DragonfruitIll660@reddit

Its way better, honestly thinking it might surpass GLM 4.5 Air at this point. Which is great because of its overall size (comparing Q4KM GLM 4.5 Air vs UD Q3 Gemma 4). Still seeing some slightly odd behavior from before (randomly falling into weird repeating L's or A's, but restarting that part of the message resolves it and its rare now instead of certain to happen after 4-5 messages) but otherwise its great.

[-]

tiffanytrashcan@reddit

This should be important to note as well! Do not use CUDA 13.2 or you'll see broken/unstable behaviour still.

https://www.reddit.com/r/unsloth/comments/1sgl0wh/do_not_use_cuda_132_to_run_models/

[-]

florinandrei@reddit

How big is the blast radius? What else is broken with 13.2, besides llama.cpp?

[-]

ai_without_borders@reddit

on my 5090 it was not hard crashes. gemma would run, then start repeating fragments once context got longer, especially with quantized KV. going back to 12.6 fixed it. felt more like subtle inference instability than broad cuda breakage.

[-]

ambient_temp_xeno@reddit

My spider sense already taught me not to use 13x instead of 12x because if it ain't broke don't fix it.

[-]

a_slay_nub@reddit

We have a DGX 8xA100 that's stuck on 12.0 and it's such a PITA to get vLLM stuff running. Sadly it seems like a lot of software support has moved forward

[-]

FinBenton@reddit

13.0 has been good to me through various random projects so far.

[-]

finevelyn@reddit

The official llama.cpp cuda13 docker image uses 13.1.1 instead of 13.2, and it gave me some speed boost compared to 12.x on 50-series RTX cards.

[-]

Majinsei@reddit

Ahhhhhhhh... Esto explica mis problemas...

[-]

a_beautiful_rhind@reddit

I'm using 13.2 driver with 12.6 nvcc and runtime. I didn't see any breakage on other models but gemma was still unstable as of yesterday.

[-]

ilintar@reddit (OP)

Yes, good call. Will edit the post.

[-]

danielhanchen@reddit

Thanks for all the fixes as well!

[-]

BlackRainbow0@reddit

It’s working great on kobold.cpp’s latest release. I’m using Vulkan with jinja enabled. AMD card.

[-]

pfn0@reddit

Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.

how can you drop this bomb without referencing a source?

[-]

IrisColt@reddit

I'm running into some weird behavior with 96k context sessions and could use some advice, heh...

Setup: RTX 3090 (24GB), 64GB RAM. Using build llama-b8688 with -fa on, full GPU offloading, and KV cache quantization set to q4_0. I have enable_thinking: true set via the chat template kwargs.

The issues:

Once, the model's train of thought went off the rails and got stuck repeating | | | | | | | indefinitely.
About ten times, the model just skipped the reasoning step and instantly wrote the final answer, a very low quality answer, by the way, heh
I'm seeing occasional typos in non-English text, plus one instance of a word being used non-sequitur (seemed like a derivation error).

Has anyone else seen this? Will the latest llama.cpp version fix these problems, or is this related to my parameters?

[-]

ilintar@reddit (OP)

There's been a lot of errors in previous versions, just try the newest build, haven't had any stuttering or other similar errors. Remember that Gemma has adaptive thinking though, even if you enable thinking, it won't always think before an answer.

[-]

gelim@reddit

Thanks! Running on master + latest GGUF and it's all smooth

[-]

Sensitive_Pop4803@reddit

How is it stable if I have to micromanage the Cuda version

[-]

ilintar@reddit (OP)

Wait till you try to run vLLM or any of the apps on the Python CUDA ecosystem... :D

[-]

coder543@reddit

That is a good question for Nvidia.

[-]

the__storm@reddit

Micromanaging the CUDA version is an integral part of the CUDA experience.

[-]

Sensitive_Pop4803@reddit

Intel? Micromanage DX11 gaming experience. AMD? Micromanage ROCm. NVIDIA? Believe it or not micromanage CUDA.

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

ecompanda@reddit

the `--cache-ram 2048` tip is what finally stabilized my setup. was hitting RAM thrash constantly on the 31B Q5 quant until that flag. running Q5 K Q4 V for the KV cache now and the quality difference is pretty minimal for the speedup you get.

[-]

jslominski@reddit

This doesn't look stable at all tbh :)

[-]

popoppypoppylovelove@reddit

running KV cache with Q5 K and Q4 V has shown no large performance degradation, of course YMMV

Do you have further data on this? I'd love to see results of going down to Q4 or 5. Since there's "no large performance degradation" for at best Q5, are you implying that a Q8 KV cache has near identical performance to f16?

[-]

Fair_Ad845@reddit

Thanks for the consolidated guide — been hitting random issues all week and this clears up most of them.

One thing I want to add: if you are running Gemma 4 31B on a Mac with Metal, make sure you have at least 24GB unified memory for Q5 quants. I tried Q4_K_M on a 16GB M2 and it runs but the context window gets severely limited before it starts swapping to disk.

The --cache-ram 2048 -ctxcp 2 tip is gold. I was getting random OOM kills without it and had no idea why — turns out the KV cache was eating all my system RAM silently.

Also +1 on avoiding CUDA 13.2. Wasted half a day debugging garbled output before realizing it was the compiler, not the model.

[-]

glenrhodes@reddit

Q5K keys, Q4 values is the right asymmetry. Keys carry the attention score distribution so quantization errors there propagate through softmax in a way values just don't. Been running 31B at these settings for a while and the quality difference vs straight Q4 is noticeable on anything reasoning-heavy.

[-]

Myarmhasteeth@reddit

The best thing to wake up to. Building from source rn.

[-]

themoregames@reddit

Thank you

[-]

IrisColt@reddit

THANKS!!!

[-]

socialjusticeinme@reddit

Important note about building: DO NOT currently use CUDA 13.2 as it is CONFIRMED BROKEN (the NVidia people are on the case already) and will generate builds that will not work correctly.

Well that explains a lot - thank you. Now time to figure out how to downgrade cuda on my setup.

[-]

Thigh_Clapper@reddit

Is the template needed for e2/4b, or only the 31b?

[-]

coder543@reddit

Also worth mentioning the e4b (and probably e2b) chat templates are different by 3 lines from the 26B and 31B built in chat templates, so I’m not sure the override would apply as cleanly to those without another interleaved chat template in the llama.cpp repo /u/ilintar

[-]

Guilty_Rooster_6708@reddit

I am using the 26B MoE. Should I use the chat template jinja gemma-4-31B-it-interleaved based on the post?

[-]

ilintar@reddit (OP)

For agentic stuff yes.

[-]

Guilty_Rooster_6708@reddit

Thanks. The model will still be thinking if I use the template right?

Also, are you using Q5 K and Q4 V because attention rot has been added to llama cpp? I must have missed that update, but isn’t it applicable to only Q8 and Q4 cache?

[-]

Voxandr@reddit

Thats cool!! i am gonna try .

[-]

coder543@reddit

remember to run with --chat-template-file with the interleaved template Aldehir has prepared (it's in the llama.cpp code under models/templates

Strange that the official ggml-org ggufs have not been updated to embed this on hugging face?

[-]

kmp11@reddit

Stable? yes, Optimized? no... a 25GB model should not require 75GB of VRAM + RAM.

[-]

createthiscom@reddit

I'll have to retest 26B with the aider polyglot now that this change has been merged. I was running `15f786` previously and A31B was performing significantly better than A26B:

https://discord.com/channels/1131200896827654144/1489301998393233641/1491666033319215174

[-]