More Gemma4 fixes in the past 24 hours

Posted by andy2na@reddit | LocalLLaMA | View on Reddit | 89 comments

Reasoning budget fix (merged): https://github.com/ggml-org/llama.cpp/pull/21697

New chat templates from Google to fix tool calling:

31B: https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja

27B: https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja

E4B: https://huggingface.co/google/gemma-4-E4B-it/blob/main/chat_template.jinja

E2B: https://huggingface.co/google/gemma-4-E2B-it/blob/main/chat_template.jinja

Please correct me if Im wrong, but you should use these new templates unless you redownload a new GGUF

You can use specific templates in llama.cpp by the command argument:

--chat-template-file /models/gemma4/gemma4_chat_template_26B.jinja

My current llama-swap/llama.cpp config 26B example (testing on 16GB VRAM , so context window is limited):

"Gemma4-26B-IQ4_XS":
    ttl: 300  # Automatically unloads after 5 mins of inactivity
    cmd: >
      /usr/local/bin/llama-server
       --port ${PORT}
      --host 127.0.0.1
      --model /models/gemma4/gemma-4-26B-A4B-it-UD-IQ4_XS.gguf
      --mmproj /models/gemma4/gemma-4-26B-A4B-it.mmproj-q8_0.gguf
      --chat-template-file /models/gemma4/gemma4_chat_template_26B_09APR2026.jinja
      --cache-type-k q8_0
      --cache-type-v q8_0      
      --n-gpu-layers 99
      --parallel 1 
      --batch-size 2048 
      --ubatch-size 512
      --ctx-size 16384
      --image-min-tokens 300
      --image-max-tokens 512 
      --flash-attn on 
      --jinja
      --cache-ram 2048
      -ctxcp 2
    filters:
      stripParams: "temperature, top_p, top_k, min_p, presence_penalty, repeat_penalty"
      
      setParamsByID:
        "${MODEL_ID}:thinking":
          chat_template_kwargs:
            enable_thinking: true
          reasoning_budget: 4096
          temperature: 1.0
          top_p: 0.95
          top_k: 64
          min_p: 0.0
          presence_penalty: 0.0
          repeat_penalty: 1.0


        "${MODEL_ID}:thinking-coding":
          chat_template_kwargs:
            enable_thinking: true
          reasoning_budget: 4096
          temperature: 1.5
          top_p: 0.95
          top_k: 65
          min_p: 0.0
          presence_penalty: 0.0
          repeat_penalty: 1.0


        "${MODEL_ID}:instruct":
          chat_template_kwargs:
            enable_thinking: false
          temperature: 1.0
          top_p: 0.95
          top_k: 64
          min_p: 0.0
          presence_penalty: 0.0
          repeat_penalty: 1.0"

[-]

walden42@reddit

I'm curious how well Gemma 4 31B compares to Qwen3.5 27B or 122B now for coding, with these new fixes. Has anyone run any tests?

[-]

Far-Low-4705@reddit

From what I’ve heard/seen, Gemma is better at one shotting code in a chat like interface.

However qwen is much better at doing real work in an agentic, Claude code-like interface, since it is much better at agentic tool calling.

[-]

dittospin@reddit

How does it work with Modular’s max system?

[-]

ttkciar@reddit

Thanks for the update. Glad to be using my own templates.

When the dust is settled I'll update my GGUFs' chat template metadata with the llama.cpp gguf_set_metadata.py tool.

[-]

Borkato@reddit

What do you mean by your own templates?

[-]

I mean I don't use the provided JINJA chat template. I define my own template formatting via llama-completion first, and then when I have something which behaves like I want, I put it in a file to pass to llama-server via its --chat-template-file option.

This is the wrapper script I'm currently using for Gemma-4-31B-it, which has very minimal (but well-working) prompt formatting: http://ciar.org/h/g4

[-]

MomentJolly3535@reddit

i noticed that for thinking coding you have a temperature of 1.5 , i m curious, i always heard that for coding a lower temperature is better, it's not true for gemma 4 ?

[-]

andy2na@reddit (OP)

This was based on another thread a few days ago that tested higher temps for coding. I tried it out and it does seem to output better code and "one shots" my simple prompts more frequently

[-]

Dankmemexplorer@reddit

huh?? how?

[-]

ambient_temp_xeno@reddit

This is why people who are having problems with clown car implementations like Ollama while running potato quants should hold off from fixing their opinions about anything for a while.

[-]

po_stulate@reddit

Afaik it affected all quants not just potato quants, and the implementation although pushed to llama.cpp, is done by google, not ggml nor ollama. It doesn't matter if it is the weights, the software implementation, or the chat template that makes it stupid, the truth is, it IS stupid. You can fix the software, the chat template, hell no one forbid you from fixing the weights itself. I see no difference whether it's the software, the chat template or the weight itself that's making it stupid. If you ship a broken thing people are going to judge you, doesn't matter what the underlying issue is. If it is really that great why don't you just ship it without any issues and why do people still have bad experiences with it?

[-]

ObsidianNix@reddit

Ill bite.

Compared to? All recent OS models have been needing fixes. Shoot even some close sourced big models got fixes too. You can tell if you actually used them more than once in a while. You can also tell when they are training a new model as the current big model get dumb (not enough compute power). Been like this since qwen2.5, gpt-oss and gemma 3. i believe llama3.2 as well and mistral os models.

[-]

hugo-the-second@reddit

"Compared to" strikes me like the perfect question to ask here.
How many of the problems that I come across would still strike me as obvious and easy to avoid if I was to try this myself?

[-]

po_stulate@reddit

Sure, but how does it relate to whether you use a small quant or not, or what inferencing software you use?

[-]

ObsidianNix@reddit

It is not stupid because the software is not optimized for it. Its like saying an english scientist is very stupid because it cannot read the same thing in German when the scientist only knows english.

Once you translate it to the scientist native language then suddenly the scientist is smart again. Thats not how that works.

The scientists already have the knowledge, we just gotta give him better tools to be able to translate everything from our language to their language. Remember, LMs speak in tokens, numeral representations. They don’t understand words like you and I do.

[-]

po_stulate@reddit

Still the same question, how does that have anything to do with using a smaller quant or using ollama?

The issue clearly affected all quants not just small quants, and it's clearly not an ollama problem but google's own implementation issue.

Also, a person may have great abilities earning money, but at the same time have other issues that make him loss money faster than they earn, by your logic is this person a rich person and banks should lend him money because if he fixes his issues he'll be rich? Can banks not refuse to lend him money, or truthfully record his financial status as broke?

[-]

Individual_Spread132@reddit

Had it running at Q4 since the first Unsloth quants. Out of all changes and improvements, the only thing I really noted as beneficial was 2.10.1 -> 2.11.0 upgrade of llamacpp in LMstudio, which made the model finally run with a big context window. Other than that, I've never encountered any issues. Then again, my use task is just mostly chatting - no tool calling, etc.

I still no idea wtf are those people even trying to fix. Downloaded all necessary GGUFs, changed the templates. It's all the same, properly working model in terms of what it writes and how it thinks before answering.

[-]

DeepOrangeSky@reddit

Do you not still have this issue with it:

https://www.reddit.com/r/LocalLLaMA/comments/1sdqvbd/llamacpp_gemma_4_using_up_all_system_ram_on/?utm_source=reddit&utm_medium=usertext&utm_name=LocalLLaMA

where the memory usage balloons to basically infinity and uses up all your memory no matter how much memory you have, once you get past just a few replies and a few thousand tokens?

Someone said a few days ago that use --cache-ram 0 --ctx-checkpoints 1 is the fix for it, but they were saying that in regards to llama.cpp, I think. Is this something that can be fixed for LM Studio, themselves, or is it just going to be permanently messed up for anyone trying to use Gemma4 on LM Studio unless the person knows how to like, create some sort of JSON thing with that line added in somewhere or something?

As of right now I still can't get Gemma4 to not do that memory explosion thing on LM Studio unless I just eject the model after every single reply and reload it for every single reply, making it basically unusable. Are you not having this issue? How did you make it stop doing that?

[-]

ambient_temp_xeno@reddit

It wasn't incoherent, that's for sure. It was failing at more difficult tests I was throwing at it until whatever fixes turned up at the time of the custom parser.

[-]

edeltoaster@reddit

Tool calling and languages other than english were really broken for me. I switched to llama.cpp directly because the prompt caching works there. (important for agentic coding)

[-]

MoffKalast@reddit

Ollama is not a clown car, they're the whole circus.

[-]

rm-rf-rm@reddit

Lets not give them that much credit. They're a roadside busker at best that just happens to fit the YC stereotypes so they have some money bags backing them. More hustlers in the right place/time than a serious engineering team

[-]

Luke2642@reddit

You can’t expect a symphony from a model you’ve compressed into a kazoo.

[-]

pneuny@reddit

4 bit isn't that small. I think 2 bit is when things start to get wonky. Unsloth's UD-IQ3_K_XXS is pretty good for Qwen 3.5 27b and Gemma 4 31b on a 16 GB GPU while using iq4_nl for kv cache for maximum context lengths.

[-]

Grouchy-Economist-95@reddit

This might be the best quote I’ve ever seen on Reddit

[-]

Luke2642@reddit

Rejected quips:

Benchmarking on a toaster leads to burnt toast.

Witnessing the Dunning-Kruger at 0.8 tokens per second.

Model IQ capped by your refusal to buy more VRAM.

[-]

luncheroo@reddit

I prefer Dana Carvey's "You can't piss in a Mr Coffee and get Taster's Choice."

[-]

wrecklord0@reddit

My wallet refuses to buy more VRAM, not me

[-]

_bones__@reddit

It would be glorious

[-]

jeffwadsworth@reddit

THIS

[-]

xXprayerwarrior69Xx@reddit

[-]

Long_War8748@reddit

I think I will wait another month before checking out Gemma 4, once it is all properly settled in 😅.

[-]

Separate-Forever-447@reddit

Or just wait for Gemma 5 before trying Gemma 4.

[-]

ambient_temp_xeno@reddit

Truly wise advice!

[-]

a_beautiful_rhind@reddit

It's mostly like API in ik right now. This is only patch for tool use IIRC. Something about keeping last reasoning trace before a tool call is what I read.

[-]

ambient_temp_xeno@reddit

This is good to hear because god knows I'm going to need the speed. Do you think pci-e 3.0 x16 (on both) will hinder the speed up for two 3060 12gb?

[-]

a_beautiful_rhind@reddit

Probably fine. My whole system is 3.0x16. as long as you enable P2P it can use all that b/w. 4.0 would get you 20% more prompt processing... maybe.

[-]

ambient_temp_xeno@reddit

Good to know, thanks!

[-]

StacDnaStoob@reddit

Even the gpu-rich should hold off on forming opinions until things stabilize. Still some fixes in the works in vLLM for gemma 4 right now. The nightly seems to *mostly* fix tool call errors when reasoning is on, but its still hitting some edge cases there.

[-]

_supert_@reddit

Yeah. I find it eventually freezes.

[-]

PvB-Dimaginar@reddit

Just tried Gemma 4 27B Q6 on my Strix Halo and finally getting some good results.

[-]

punkgeek@reddit

cool! I also have a strix halo and was eager to try this. Are you willing to share your llama.cpp config?

[-]

pinkfreude@reddit

How are you liking the Strix Halo? Have you tried to do any image generation, or just LLMs?

[-]

korino11@reddit

is turboquants already implemented in llamacpp? And if so how to use them?

--cache-type-v q8_0  that you just quintizied  becouse u using q8 model?

[-]

Significant_Pay_9834@reddit

turboquant

I just built the tom's turboquant fork of llama cpp, works pretty well with gemma4 integrated into zed.

[-]

pneuny@reddit

For now, if you want small kv cache, iq4_nl is an option in mainline llama.cpp

[-]

jacek2023@reddit

I posted that already but you guys must read more posts :)

[-]

andy2na@reddit (OP)

appreciate your posts, I get learn about most updates from you!

[-]

jacek2023@reddit

I mean about "turboquant-like" https://www.reddit.com/r/LocalLLaMA/comments/1sf61n2/kvcache_support_attention_rotation_for/

[-]

andy2na@reddit (OP)

no TQ is not in llama yet, still testing. Only cache rotations have been implemented

[-]

korino11@reddit

sorry for asking it here...how to turn on rotation?

[-]

andy2na@reddit (OP)

its automatically enabled, unless you manually disable it.

[-]

korino11@reddit

Thanks!

[-]

OsmanthusBloom@reddit

Any idea if multimodal (image) input works properly in llama.cpp with the Gemma4 E2B and E4B models? There was a discussion here a few days ago where several people complained about bad vision results. I understood it might have been a problem with the llama.cpp implementation (vs vLLM, transformers or AI Edge) and not the models themselves, but maybe that was a misunderstanding.

https://www.reddit.com/r/LocalLLaMA/comments/1sedoqh/gemma4_e4b_models_vision_seems_to_be_surprisingly/

Me, I'm still waiting a bit more for the edge to stop bleeding.

[-]

andy2na@reddit (OP)

vision tasks work for me, its recommended to add:

      --image-min-tokens 300
      --image-max-tokens 512

which has helped tremendously to make it almost as good as qwen3.5 for vision tasks (but I think Qwen3.5 is still better there)

[-]

webitube@reddit

I just re-tested it and it finally fixed!

I went with:
--image-min-tokens 768 --image-max-tokens 1024 --ubatch-size 1024
in my test for reading the ingredients on a bottle label. At min=300, max=512, it still had a little trouble. But, after bumping it up, it finally read the ingredients correctly.

[-]

OsmanthusBloom@reddit

Thanks! Can you (or anyone) explain how to choose these values? In the discussion I linked above, there were similar recommendations but with higher values for both parameters. Are the defaults just bad?

[-]

DelKarasique@reddit

You can actually go as high as 1120 tokens for these values. More tokens -> better results. Tradeoff - more memory and context used (I think)

[-]

andy2na@reddit (OP)

see OP, they are command arguments you need to run when loading the model in llama.cpp. You can try higher values also, those are the minimum for decent vision

[-]

OsmanthusBloom@reddit

Thanks a lot!

[-]

david_0_0@reddit

interesting to see the rapid iteration. are these fixes focused more on inference speed or output quality? curious if youre hitting diminishing returns on either front or finding both equally improvable

[-]

IrisColt@reddit

Thanks for the config? What is the immediate impact of --image-min-tokens 300 --image-max-tokens 512

[-]

andy2na@reddit (OP)

better vision, the default outputs are not great

[-]

IrisColt@reddit

Thank you very much!

[-]

IrisColt@reddit

Do we need to re-create the old GGUFs? Genuinely asking.

[-]

andy2na@reddit (OP)

yes, and that may not happen for a bit so you have to manually feed it the template

[-]

IrisColt@reddit

Thanks!!!

[-]

david_0_0@reddit

interesting to see steady improvements. the iterative refinement approach seems to be working well

[-]

One_2_Three_456@reddit

Sorry if this is not the right place but i'm still learning these things. I just asked Gemma 4 E2B if what i ask it is sent to google servers and it said yes it does because the prompts are sent to google's servers for processing. I was using it with my wifi off. Are my prompts really sent to google for processing? If yes, what's all the hype about it being private/secure and all?

[-]

Kat-@reddit

The WiFi-off detail is the tell. Your device had no net access, and the model still said "yes, Google servers." That's because the model being asked to know something it structurally cannot know, and filling the gap with plausible sounding text.

The model hallucinated about its own deployment. It has no actual introspective access to whether it's running locally or in a cloud. just pattern-matched "am I sending data to Google?" against its training data about how LLMs typically work. References ia Google's cloud models are it's training data, so ii confabulated a confident, plausible-sounding answer that was completely wrong for its actual execution conte

Also see the AA-Omniscience: Knowledge and Hallucination Benchmark

[-]

One_2_Three_456@reddit

That makes so much sense! Thanks!

[-]

Kodix@reddit

You sometimes see people claiming that LLMs aren't really Artificial Intelligence, they're just a really advanced autocorrect. This is exactly why they have a point.

An LLM doesn't *know* whether what it's saying is true or not. This is one such case.

So no, your prompts aren't sent to google for processing. And your LLM will lie to you about many, many, many other things.

[-]

FlamaVadim@reddit

models like 2B only know how to construct sentences and little else 🙂

[-]

andy2na@reddit (OP)

nothing is sent to google unless you are running gemma4 via API from their cloud services

[-]

OsmanthusBloom@reddit

Don't ask a LLM how it works or what it can do. It cannot introspect and doesn't know how it is being run, unless told in the system prompt.

If you're running it on your own machine with wifi turned off, your prompts will stay local.

[-]

philanthropologist2@reddit

Never turn your internet back on 😎

[-]

m3kw@reddit

How do you fix a model

[-]

FluoroquinolonesKill@reddit

Do we need custom templates with the latest GGUFs, or are the template fixes now embedded in the GGUFs?

[-]

drallcom3@reddit

New chat templates from Google to fix tool calling:

My prompts don't work with those templates.

Error rendering prompt with jinja template: "Unknown test: sequence".

[-]

FiReaNG3L@reddit

Same for me in LM studio

[-]

drallcom3@reddit

I guess they broke something, as some tools also suddenly don't work anymore.

[-]

FiReaNG3L@reddit

It's been quite a ride these past few days

[-]

andy2na@reddit (OP)

do your logs show successful loading of the chat template?

[-]

drallcom3@reddit

I can't see any errors with it.

[-]

SandboxIsProduction@reddit

love watching a major release need a dozen hotfixes in the first week. this is why i never deploy anything on day one no matter how good the benchmarks look

[-]

Kodix@reddit

Has anyone found a way to deal with the random useless tool use loops? Like reading the same one line of the same file over and over, or writing the same one line over and over, etc etc.

[-]

triynizzles1@reddit

Other than tool calling being hit or miss i didnt have any issues with gemma 4 26b. In fact, it passed all of my benchmark tests, except for one. the most out of any model, including frontier. (admittedly, my tests are somewhat simple, but are closely tied to my world use)

[-]

Icy_Distribution_361@reddit

How about the MLX models?

[-]

cviperr33@reddit

nice!