Gemma 4 is terrible with system prompts and tools

Posted by RealChaoz@reddit | LocalLLaMA | View on Reddit | 122 comments

I tried Gemma 4 (26b-a4b) and I was a bit blown away at how much better it is than other models. However, I soon found two things:

it gets significantly worse as context fills up, moreso than other models
it completely disregards the system prompt, no matter what I put in there
it (almost) never does tool calls, even when I explicitly ask it

Note:Other open models also have the same flaws, but they feel much more accentuated with Gemma. It feels like it was made to be great at answering general questions (for benchmarks), but terrible at agentic flows - following instructions and calling tools.

I tried countless system prompts and messages, including snippets like (just some of these, all of them in the same prompt, etc.)

<task>
You must perform multiple tool calls, parallelizing as much as possible and present their results, as they include accurate, factual, verified information.
You must follow a ZERO-ASSUMPTION protocol. DON'T USE anything that you didn't get from a TOOL or DIRECTLY FROM THE USER. If you don't have information, use TOOLS to get it, or ASK the user. DON'T ANSWER WITHOUT IT.
Use the tools and your reasoning to think and answer the user's question or to solve the task at hand. DO NOT use your reasoning/internal data for ANY knowledge or information - that's what tools are for.
</task>

<tools>
You have tools at your disposal - they're your greatest asset. ALWAYS USE TOOLS to gather information. NEVER TRUST your internal/existing knowledge, as it's outdated.

RULE: ALWAYS PERFORM TOOL calls. Don't worry about doing "too many" calls.

RULE: Perform tool calls in PARALLEL. Think that you need, what actions you want to perform, then try to group as many as possible.
</tools>

<reasoning>
**CRUCIAL:** BEFORE ENDING YOUR REASONING AND ATTEMPTING TO ANSWER, YOU MUST WRITE:
> CHECK: SYSTEM RULES
THEN, YOU MUST compare your reasoning with the above system rules. ADJUST AS NEEDED. Most likely, you MUST:
- perform (additional) tool calls, AND
- realise assumptions, cancel them.
NEVER ANSWER WITHOUT DOING THIS - THIS IS A CRITICAL ERROR.
</reasoning>

These may not be the best prompts, it's what a lot of frustration and trial/error got me to, wtihout results however:

In the reasoning, example above (which had the full prompt) there is no mention of the word tool, system, check, or similar. Which is especially odd, since the model description states:

Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Does anyone else have a different experience? Found any prompts that could help it listen or call tools?

[-]

carbongo@reddit

It's awful. I have a good system prompt with five checks. And Gemma 4 has a high probability of entering into a loop, and it starts doing all five checks over and over again... very bad

[-]

fittyscan@reddit

You need a recent version of llama.cpp. Also, if you're using a quantized model such as Unsloth and you downloaded it when Gemma was first released, download it again, since fixes have been made since then.

[-]

diesltek710@reddit

still garbage

[-]

fiery_prometheus@reddit

Also, cuda 13.2 and up is broken, unsloth have more details about this as well

https://github.com/unslothai/unsloth/issues/4849

[-]

typical-predditor@reddit

I have the strange suspicion that this level of stale/incompatible software explains why almost everything I try never works but it's not nearly as well documented/examined as Gemma 4 is right now.

[-]

bigjeff5@reddit

That's what I realized after playing with the PrismML Bonsai 1.5 bit model when it came out. Basically every model that doesn't have the exact same architecture as an existing model needs a new kernel in the runner (llama.cpp, vLLM, Ollama), which requires an update. That also means bugs will exist early on, so you really can't just fire up a brand new model and expect everything to work, unfortunately.

[-]

an0maly33@reddit

Seems to only be a problem on IQ quants and the lower Ks.

[-]

H3g3m0n@reddit

There's also a custom template that apparently fixes tool calling.

"The interleaved template preserves the last reasoning before a tool call in the message history, leading to better agentic flow." I think it stops it striping thinking or something so it knows why it called the tool.

[-]

kukalikuk@reddit

Maybe it just not for me

[-]

Fragrant_Shallot_990@reddit

OpenWebUI seems to have thinking turned on by default, try finding a way to turn it off in settings.

[-]

kukalikuk@reddit

I turned off thinking for qwen3.5 because it can call the tool reliably. Turning off thinking for gemma is making the tool calling more unreliable.

[-]

Fragrant_Shallot_990@reddit

Oh also, were you able to resolve the vision issue in OpenWebUI? When I tried an OpenWebUI + llama.cpp build, vision worked ONCE and then never worked again, even after rebuilding and tweaking. Apparently I'm not the only one who has that issue. Koboldcpp vision works fine though.

[-]

kukalikuk@reddit

Vision always work for me with LM studio backend.

[-]

nic_tbone@reddit

I have been testing out Gemma4:31b on a 5090 running Ollama on Windows with both claude-code and copilot-cli and Its been working impressively well compared to any other models I have run under a similar situation. I have been giving it coding problems and its able to solve them.

Further I have been able to run it on a 5070 Ti + 3060 but its drastically slower.

I had tried other models that my hardware could support but none have produced nearly as good results.

I did run into some issues executing plans on the 5070 TI + 3060 but i tuned the model to be more precise and it proves more reliable. I have to switch models configs in this case between planning and execution.

The 5070 Ti + 3060 is really not usable. Its way too slow but it was the hardware I had available at the time to experiment with.

[-]

Aggravating_Dog_9762@reddit

Hello i m kinda a noob discovering and experiencing same issue with gemma 4. I use copaw with various llm on a weak bazzite laptop with lm link to my win10 rig 3090. Tried gemma4 26b a4b, e2b and e4b to run locally on the laptop for the tiny ones, it just doesnt want to use any skills or tools even if it had read them previously.

Everything is up to date: got like in past 3 days 2 lm studio updates on linux, maybe 2 on windows too with runtimes updates. Tools and skills enabled, the system prompt i use with various llm doesnt help. Maybe something is broken for now and we need to wait.

[-]

RealChaoz@reddit (OP)

Yeah, I never ended up making it work. Tried llama.cpp from the latest commit, LM studio, unsloth studio, all the various chat templates, fresh download for every quant imaginable, temp 0/0.6/1, different min/max p values.

All due respect, but it seems like Gemma 4 simply is absolute dogshit when it comes to tool calling ¯_(ツ)_/¯

[-]

Aggravating_Dog_9762@reddit

If i say than i should see a bubble in chat when its using the tool it will use it but only once, then after reasonning it fails to call tools again...for now i m getting good results with deepseek2 glm 4.7 30b. Good luck\^\^

[-]

Frequent-Mud8705@reddit

I managed to get it working agentically in opencode, specifically you need to create a very minimal sysprompt for it, passing the default opencode sysprompt makes it fail tool calls. also make sure min-p is set to 0.
The MOE is quite a beast for a local model, though spawning agents still seems to be a little broken.

[-]

Le0ssa_@reddit

Where did you do those changes?

[-]

Frequent-Mud8705@reddit

see my other comment: https://www.reddit.com/r/LocalLLaMA/comments/1sh1bwv/comment/of9q7oi/

[-]

Le0ssa_@reddit

How can I get Gemma 4 to call tools at OpenCode? It just doesn't!

[-]

PuzzleheadedFill5120@reddit

我也是不行，再hermes上也不行

[-]

benevbright@reddit

just tried again with updated lmstudio, re-downloaded gemma4 (apr 12th). yeah it calls tool calling and it edits files now but coding skill is so bad.... wrote wrong code for my simple request. deleted and going back to qwen3-coder-nexdt again... (always the same story)

[-]

Sadman782@reddit

gemini fixed the template:

https://pastebin.com/raw/hnPGq0ht

Working with OpenCode, and it's quite good now at handling multiple MCP servers properly.

[-]

kukalikuk@reddit

Error rendering prompt with jinja template: "Unknown test: sequence".

[-]

Sadman782@reddit

I fixed for lm studio:
https://pastebin.com/raw/qc1FTAcG

use this jinja, removed the sequence which lm studio doesn't recognize yet

[-]

kukalikuk@reddit

Okay, no more sequence error, but still unreliable tool call and still ending the turn inside the thinking process.

[-]

Sadman782@reddit

lm studio issue, works fine with llama.cpp

[-]

Sadman782@reddit

I fixed for lm studio:
https://pastebin.com/raw/qc1FTAcG

use this jinja, removed the sequence which lm studio doesn't recognize yet

[-]

Upper-Sentence-7650@reddit

I have the same problem making gemma4:31b calling the tools.

[-]

Sdesser@reddit

Is the tool syntax you're giving it the same as Gemma's natively trained one? I've had no issues with my custom API after I added Gemma 4 specific cases to the parser, changed the system prompt builder to give Gemma 4 syntax as examples, and llama.cpp updated to support Gemma 4 templating. After aligning my system for Gemma, calling tools works just fine. Is your backend supporting Gemma templating? As for system prompt, Gemma does seem to improvise a bit more than some other modes, but on average, it's following system prompt just fine. Then again, I'm also running at temp 1.0, so that might explain the variance. I ran Qwen 3.5 9B at temp 0.8 due to tool syntax inconsistencies.

I don't use local models for coding and have not tried the larger models yet, so I'm not using tools all that heavily. That might make a difference. Still waiting for new hardware to arrive to run more capable models. I've been mostly using local models for general chatting within my own AI companion framework.

[-]

florinandrei@reddit

SHOUTING LOUDLY, using <fancy_tags>, a man of culture, I see.

[-]

coder543@reddit

Maybe you should try the built-in llama-server webui.

System prompt and tool calling seems to work fine:

[-]

dinerburgeryum@reddit

Having a system prompt will mess up Gemma4 reasoning, because Gemma4's system prompt has strict formatting requirements. From their HF page:

Thinking is enabled by including the <|think|> token at the start of the system prompt. To disable thinking, remove the token.

Without <|think|> at the beginning of the system prompt it's disabled entirely. I assume it's automatically injected by the Jinja template if no system prompt is provided.

[-]

llmentry@reddit

{%- if (enable_thinking is defined and enable_thinking) or tools or messages[0]['role'] in ['system', 'developer'] -%}
  {{- '<|turn>system\n' -}}

  {#- Inject Thinking token at the very top of the FIRST system turn -#}
  {%- if enable_thinking is defined and enable_thinking -%}
    {{- '<|think|>' -}}
    {%- set ns.prev_message_type = 'think' -%}
  {%- endif -%}

  {%- if messages[0]['role'] in ['system', 'developer'] -%}
    {{- messages[0]['content'] | trim -}}
    {%- set loop_messages = messages[1:] -%}
  {%- endif -%}

You don't have to assume. It's very clear in the jinja that this is exactly what it does :)

But note that it simply injects the token at the start of the system turn. Any custom system prompt simply follows below, and so will never mess up reasoning!

[-]

dinerburgeryum@reddit

Thanks for the clarification!

[-]

coder543@reddit

That is strange. With reasoning enabled, I don't see how the think token would go missing if I include a system prompt. But if I manually write the think token at the front of the system prompt, it goes back to reasoning. Maybe there is a bug in the template that I'm not seeing?

[-]

arman-d0e@reddit

It’s a poorly written template and structure, it should really not be bound to system prompt at all similar to the way qwen models handle thinking

[-]

ElDavoo@reddit

How do you give tools to llama.cpp ?

[-]

RealChaoz@reddit (OP)

MCP works but unfortunately only HTTPs ones, no stdio/sse.

[-]

Local-Cardiologist-5@reddit

Are you using ollama?

[-]

Several_Industry_754@reddit

I had a really rough time getting it to work with Claude. It just wouldn’t use any tools and kept hanging. I had to switch back to Qwen.

[-]

a_beautiful_rhind@reddit

31b follows my system prompts and I don't make it think or have that token there.

[-]

Healthy-Nebula-3603@reddit

yes 26b version suck as an agent and for coding but 31b version works great

[-]

RealChaoz@reddit (OP)

I'll try that one today, thanks

[-]

Sadman782@reddit

use the updated jinja (updated few hours ago) : https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja
or slightly modified version(better): https://pastebin.com/raw/hnPGq0ht

[-]

RealChaoz@reddit (OP)

Nice, I'll give it another shot today, thanks!

[-]

Additional-Avocado33@reddit

Your running the dev model that is made so others can add content to there model (26b-a4b-it) Is there release model with thinking

[-]

RealChaoz@reddit (OP)

Maybe, I'll try the non-a4b version today

[-]

input_a_new_name@reddit

Are we using the same model? I fed it 60k worth of text in docx format and it was completely coherent in its answers.

[-]

RealChaoz@reddit (OP)

I'm not saying it's bad in general, just that it's bad at instruction following and tool calling, specifically. If your use case doesn't involve that (you fed everything into context), yeah, it's amazing! Better than pretty much any other open model you can run locally

[-]

Exciting-Mall192@reddit

The problem sounds like deepseek when the context rotting starts hitting you 🤥

[-]

blueCareBeat@reddit

Very different experience here too. Using basic system prompt, with tools and skill catalog. results of a tool calling eval: https://jalemieux.github.io/curunir-evals/reviews/article-draft-26b-sonnet46-20260407

[-]

Sadman782@reddit

use the updated jinja: https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja
or slightly modified version(better): https://pastebin.com/raw/hnPGq0ht

[-]

ambient_temp_xeno@reddit

One thing I've found on 31b is that any system prompting about what it should do with reasoning is completely ignored. It's completely dead set on reasoning how it's been trained.

[-]

Sadman782@reddit

chat template issue, use this jinja: https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja

[-]

ambient_temp_xeno@reddit

What's new in the pastebin'd one?

[-]

Sadman782@reddit

It removed the standard_keys exclusion block, and it's better for me (Gemini found that).

You can see whether it is better for you or not. The fix was applied on top of the Google updated template a few hours ago.

[-]

ambient_temp_xeno@reddit

Ah yes I see the PRs now. I also see the 2 small models had different templates after all

https://github.com/ggml-org/llama.cpp/pull/21704#issuecomment-4221036621

[-]

ambient_temp_xeno@reddit

But I am using that one >_<

[-]

tetelias@reddit

Interestingly, there's no template for 26B. Could this one be used for 26B?

[-]

nickm_27@reddit

yes it can

[-]

relmny@reddit

Have you read this post that was done before you posted yours?:

https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/

[-]

kukalikuk@reddit

+1 on this, openwebui frontend + LM studio backend. I love it for the language but it fails miserably on serious tool calling and code. Context build up makes it even worse. I gave it a simple task to call image edit tool which even qwen3.5 4B cannot fail, and gemma 4 thinks and made multiple tool calls in sequence directly (without me asking), not waiting for the response/result, make another call and so on until i stop him. Another time it successfully use a tool on the first time, but when I order it again it fails, I even make it to do the exact same method as the first successful call, and still fail. Not only it fail, it thinks for almost 13k tokens (because correcting and contradicting himself about the first successful call) and still fail after those 13k thinking tokens. It even fail to close it thinking process after some context built up. It ends the turn while still in thinking process and when I read the thought sometimes it mistyped the block/tag. I still use the default LM studio template for this model btw.

[-]

TheTerrasque@reddit

That sounds like what I saw before llama.cpp was updated. Maybe try with latest llama.cpp and unsloth ggufs

[-]

kukalikuk@reddit

I use LM studio and updated the llama runtime. Still doing dumb tool calls. 😋

[-]

Muted_Wave@reddit

Just like me, I found the same thing, so I went to use qwen instead.

[-]

Danfhoto@reddit

It’s also interesting that you tell it to not do anything not present in TOOL then proceed to tag the section as while strictly telling it to not make assumptions. For me, looking in for TOOL would be an assumption.

[-]

grimmolf@reddit

I’m running the 31b model to power openclaw and the main difference I found was increasing the context window from the default 4096 to the max 256k context, and honestly, I’m pretty impressed with the tool use and adherence to system prompts.

[-]

Creative-Paper1007@reddit

Google noodles have always been worse at tool calling, qwen is the og still

[-]

nickm_27@reddit

That's not my experience at all, Gemma4 follows my system prompt exactly, even some multi step instructions that other models like Qwen don't follow as well.

[-]

Rich_Artist_8327@reddit

how are you running your Gemma4?

[-]

nickm_27@reddit

llama.cpp via vulkan in docker on a 7900XTX. Running unsloth Q4_K_XL

[-]

Technical-History104@reddit

There’s plenty that I still don’t know, but I do notice the heavy reliance on negative prompting in the snippets you shared. In addition to any other advice you get here, maybe it’s worth finding a way to reword all negations as affirmative statements instead, and giving a few embedded examples (“if you see this… then do this…”)?

Just wondering. 🤔

Also got the following comment from Google Gemini which seems appropriate here:

The user noted that Gemma 4 (26b) gets worse as context fills up. Smaller or mid-sized models (like a 26B parameter model) have a lower "attention budget" than the frontier models.

When the context window gets crowded:

The model starts prioritizing the most recent tokens (the user query).
The System Prompt (at the very beginning) loses its "pull" on the model's attention.
Complex, multi-part negative rules are the first things to be "forgotten" in favor of the immediate request.

[… and I realize even after pasting the above that part of your entire point is that the new Gemma seems more nuanced in the problematic behavior than other open weight models that share similar weaknesses]

[-]

the__storm@reddit

I'm finding the same - strong on single-turn natural language tasks, but really struggles with tool calling. It'll fail a couple of times and then get into a loop or go down some crazy rabbit-hole.
I'm on latest llama.cpp compiled locally for ROCm, latest Unsloth 26B IQ4_XS, fp16 kv cache. Both Zed and Copilot Chat (clearest Microsoft branding scheme) were really bad, opencode was surprisingly okay for some reason.

[-]

Maleficent-Low-7485@reddit

system prompts degrading with context is the real issue, not just gemma.

[-]

Last_Mastod0n@reddit

Tuning the system prompt has been huge for improving responses in my experience so i have not had that issue. As far as the context issue I have noticed massive degradation with every open model I've ever used.

[-]

Beledarian@reddit

I saw you use lm studio. I'm actively developing a toolset for lm studio and when I tried it out it failed with my subagent flow as it did not adhere to the expected tool flow provided with the system prompt. Even after I thought I fixed it I still encounter frequent issues. But the lm-studio tools and browser control flow provided by my plugin work ok.

But for me this was definitely surprising and frustrating. Especially considering gpt oss 20b being able to navigate the subagent flow without any problems even though it's an older and smaller model.

[-]

4xi0m4@reddit

The core issue is that Gemma 4's think block is a separate generation pass that doesn't always respect the system prompt. The token controls whether thinking is enabled, and if your system prompt doesn't start with it, reasoning gets disabled entirely. For the 26B MoE specifically, the interleaved template from the 31B should work since they share the same architecture. Also worth trying: disable thinking entirely and see if tool calling improves. Some users report it works much better without the think layer getting in the way.

[-]

ionizing@reddit

my experiences with it. it was hit or miss. not worth the effort when other models soar in my platform.

[-]

ionizing@reddit

in contrast, qwen3.5-35B or 27B both do good shell work and have never complained about reading images, and are also pretty good with bash tools in comparison. but Gemma4 is probably good at many things of course and will make a lot of people happy with their use cases, so it is great to get another free model regardless. heck maybe it is even fixed by now and I need to grab another copy and try again, though I dont like the idea of having to give it custom system prompts compared to the other models, it has taken a lot of work to fine tune behavior. anyhow I am just rambling. here is 122B-a10B from you know who enjoying its life in the shell implementing a plan autonomously:

[-]

tarruda@reddit

In my experience, the 26b version never does any reasoning when running inside a coding harness.

[-]

the__storm@reddit

Thinking mode is only enabled if the system prompt begins <|think|> - llama.cpp and similar will use the default prompt which includes this but coding harnesses send their own system prompt.

That said I have to agree with OP - 26B seems to really struggle with tool use. Great at single-turn tasks though.

[-]

Independent-Math-167@reddit

Experiencing the same on Gemma4 27b. My qwen3.5 9b was doing better with tools like the DuckDuckGo or Wikipedia tool. Qwen goes and Searches the web but with Gemma I have to tell it to search the web.

[-]

Muted_Wave@reddit

Just like me, I found the same thing, so I went to use qwen instead.

[-]

Anthonyg5005@reddit

It usually only seems to apply the system prompt when thinking and also yeah I've felt like I've needed to budge it more to use tools otherwise it won't try on its own

[-]

JLeonsarmiento@reddit

Update your framework/llama.cpp version. It was like that in the weekend since Monday or Tuesday it’s working perfectly.

[-]

WishfulAgenda@reddit

I updated lm studio today and it’s night and day. Tool calling was perfect with out a system prompt. Using and mxfp4 version right now and getting 70-80 tps at 100k context on a dual 5070ti. Fully loaded into gpu.

[-]

RealChaoz@reddit (OP)

I updated too before trying, on 0.4.10

[-]

TheTerrasque@reddit

Very different from my existence. What's your tool stack?

[-]

RealChaoz@reddit (OP)

For that question, I had Svelte MCP + Valyu

[-]

BasaltLabs@reddit

Gemma 4 is a thinking model. Its <think> block is essentially a separate generation pass that doesn't strongly bind to system prompt instructions the way the final response does. So your CHECK: SYSTEM RULES trick (which works well on non-thinking models) gets ignored because the thinking layer was never trained to respect that kind of meta-instruction. The model reasons freely, then answers -- your system prompt influences the answer surface, not the thinking process itself.

In most serving setups (Ollama, llama.cpp, vllm), whether tools actually get called depends entirely on whether the chat template correctly injects the tool schema and formats the turn boundaries. Gemma 4's template is newer and a lot of backends either have a stale template or partially broken tool token handling. Before blaming the model, check:

Are you passing tools via the API's tools parameter, not just describing them in the system prompt?
Is your backend on a version that explicitly added Gemma 4 template support?
Does the raw tokenized input actually contain the tool definitions in the right position?

You can verify by logging the full prompt as the model sees it (most backends have a debug flag for this).

Previous Gemma versions had no system role at all it was hacked in via user-turn injection. "Native support" just means it now has a proper <start_of_turn>system token. It doesn't mean the model was heavily trained to obey system prompts the way Llama 3 or Mistral instruct variants were. The RLHF likely prioritized response quality over instruction compliance, which tracks with your benchmark observation.

[-]

KickLassChewGum@reddit

Its block is essentially a separate generation pass that doesn't strongly bind to system prompt instructions the way the final response does. So your CHECK: SYSTEM RULES trick (which works well on non-thinking models) gets ignored because the thinking layer was never trained to respect that kind of meta-instruction.

FYI to any readers: this is clanker-generated nonsense. All that's needed to debunk this is to set a system prompt in Google AI Studio and look at the thinking lol.

Genuine question: what do you get out of pretending to answer a question? Why would anyone do this? This is so utterly perplexing to me.

[-]

BasaltLabs@reddit

Fair enough on the technical correction I’ll take the ‘L’ on that. I was under the impression the reasoning tokens were handled as a decoupled pass with a different weight on the system prompt, but I see now (especially looking at AI Studio) that it's a continuous chain where the system instructions are very much 'live' during the thought phase. Also, my mistake on the versioning; I was getting my wires crossed between Gemini’s current thinking models and the Gemma roadmap.

As for the 'clanker': I wrote the comment myself, but I did run it through an LLM to clean up my grammar and flow before posting. I can see how that backfired it took my incorrect theory and polished it into something that sounded like a confident hallucination.

I wasn’t 'pretending' to answer; I just had a theory that turned out to be wrong, and the grammar check made it look like a bot wrote it.

On that note, the reason why I was so wrong ; I was actually looking at the Gemma 4 A4B MoE architecture. Because it only activates 4B parameters during inference despite being a 26B model, I (incorrectly) assumed the thinking channel was being handled by a different parameter pass than the system prompt role. I see now that the <|think|> tag and system role are natively integrated in the new tokenizer.

[-]

KickLassChewGum@reddit

As for the 'clanker': I wrote the comment myself, but I did run it through an LLM to clean up my grammar and flow before posting. I can see how that backfired it took my incorrect theory and polished it into something that sounded like a confident hallucination.

Fair enough. The combination of the unmistakable textual tells and a confidently stated plausible-sounding falsehood is probably about as back as a fire can get, haha.

[-]

BasaltLabs@reddit

lool, Well hey, at least I admit my faults

[-]

colin_colout@reddit

You're absolutely right!

[-]

RealChaoz@reddit (OP)

Yes, it does occasionally perform one tool call at a time. In a rare instance, it actually called 2 in parallel. But generally just doesn't.

IMO the thinking block not abiding by the system prompt (if true) makes it borderline useless for any kind of instruction following. Might as well just disable thinking.

I also injected the system prompt into the user prompt and nothing improved, so I doubt it's that either. I honestly just think the model was benchmark-maxxed and is actually bad.

[-]

BasaltLabs@reddit

The benchmark-maxxing take is probably right, at least partially. Google's evals for Gemma 4 were heavily weighted toward MMLU-style reasoning and multilingual tasks neither of which requires tool compliance or instruction following. So the RLHF signal just wasn't there.

That said, disabling thinking might actually be worth trying before writing it off. A few people in the Ollama and llama.cpp communities have reported that non-thinking mode (if your backend exposes it) makes the model noticeably more compliant with structured prompts, not perfect, but usable. The theory is that the thinking pass learned to "solve" things internally and then just summarizes, so tool calls feel redundant to it.

On parallel tool calls is almost certainly a training distribution issue. The model rarely saw multi-tool examples during fine-tuning, so it defaults to sequential even when the schema supports parallel. Some people have had partial luck with explicit examples in the system prompt (few-shot style: here's a user message, here's what a correct parallel tool call response looks like), but it's fragile.

For agentic workflows right now, Qwen2.5 or the latest Mistral Small are honestly more reliable, less impressive on general benchmarks but actually trained to follow tool schemas consistently. Gemma 4 feels like a great base model that Google didn't finish fine-tuning for production use cases.

[-]

eggavatar12345@reddit

It’s a great model and you should consider that your quant of choice or llama args are wrong

[-]

colin_colout@reddit

Also a reminder to constantly check for new versions of llama.cpp and the quant (assuming that's how you host it). On new models especially, llama.cpp often needs a few weeks at least to hammer out bugs in new model architectures.

...and GGUFs have bugs too (sometimes even just the prompt template). For instance, Unsloth uploaded a new GGUF yesterday (if that's what your'e running), so if you're running an Unsloth GGUF from a few days ago it might not have latest fixes.

[-]

ttkciar@reddit

> Previous Gemma versions had no system role at all it was hacked in via user-turn injection

Well, that's not entirely true. Changing the prompt template to include a system section worked splendidly for both Gemma 2 and Gemma 3, and other models which did have documented support for system sections did not have specific tokens for it either. That is a recent development for all models.

[-]

o0genesis0o@reddit

Something could be wrong with your setup if you have the same issue with other models as well. I tune my agent harness to work with Nemotron 30B, and I'm surprised to see that it handles simpler agentic tasks just as well as GLM 4.7 and Minimax 2.7. It only fails with large and difficult text edit. It means small models could follow system prompt and could do multi turn tool calls, not just frontier.

[-]

Specter_Origin@reddit

Umm it works really well fro me... how are you serving the model ? what server and what version and what platform ?

[-]

patricious@reddit

I am experiencing the exact same problem but its a hit or miss, sometimes its tool calling very correctly sometime is says its deploying agents where in reality it didnt deploy anything, sometime neither of those things lol. Still need to tweak and test, either way I am running it with these params on a 5090 and TurboQuant:

Temp 1.0
Repetition Penalty 1.05

u/echo off

title Gemma 4 26B - 262K Context (22.2 GB VRAM)

cd /d C:\ai-opt

C:\ai-opt\turboquant-llamacpp\build\bin\Release\llama-server.exe \^

-m "C:\models-no-spaces\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" \^

--cache-type-k tbqp3 \^

--cache-type-v tbq3 \^

--flash-attn on \^

--ctx-size 262144 \^

--gpu-layers 99 \^

--port 8080 \^

--alias "Gemma-4-26B-TurboQuant-262k" \^

--reasoning on \^

--jinja

[-]

colin_colout@reddit

does it have the same issues without kv quantization? Just to rule it out (different models have different sensitivities to quantized KV...and tbq3 is brand new and might need some more time for bugs to shake out)

[-]

patricious@reddit

Yes and it's even worse on Flash Attention + q8_0 KV cache

[-]

colin_colout@reddit

what about no kv cache quantization at all? fa shouldn't effect the output of the llm, but kv cache does.

unless I'm out of the loop, it should be mathematically identical with or without fa.

[-]

Electronic-Metal2391@reddit

Using this gemma-4-26b-a4b-it-heretic.q4_k_m.gguf inside koboldcpp, I get nothing but long loop of repeated words.

[-]

Rich_Artist_8327@reddit

I had a very large prompt for content categorizing for 5000 phrases. Gemma3 did those on certain accuracy.
When gemma4 31b came, run the exactly same benchmark with same prompt against same data. Results are worse than with gemma3 27b. Then I made the prompt as simple as possible, and results are now on par with gemma3 27b when it has a 5000 token prompt. So gemma4 31B gets same result with 900 token prompt compared to gemma3 27b which needs for the same results 5000 tokens for rules and few-shot prompts. When starting to add rules and few-shots to Gemma4 31B, results are getting worse. My understanding is that I do not have thinking on, at least its not in the prompt and temperature has been 0.0 and 1.0 no difference actually.
So Gemma4 somehow understands different type of prompting, or what is the issue here.

[-]

Grouchy-Bed-7942@reddit

Give your full llama-server command + if you are in OpenWebUI have you set the native tool call in the model settings?

[-]

jacobpederson@reddit

Yup - was very impressed with Gemma - plugged it into opencode and it fell face-first.

[-]

Frequent-Mud8705@reddit

I got it to work by changing the sysprompt.
After doing that the MOE absolutely rips on my 3090, though I think its still slightly off from what the model expects.

./build/bin/llama-server --model gemma-4-26B-A4B-it-UD-Q4_K_M.gguf --host 0.0.0.0 --port 8680 --n-gpu-layers 99 --flash-attn on -c 180000 --parallel 1 --temp 1 --top-k
64 --min-p 0.0 --reasoning on --chat-template-file ./build/bin/google-gemma-4-31B-it-interleaved.jinja --batch-size 2048 --ubatch-size 1024 --cache-reuse 256

"You are an agentic coding tool.

You live in an an agentic coding environment with various tools you can use to help the user."

"mode": {
    "build|L": {
      "model": "local-llama/Gemma-4-26B-A4B",
      "mode": "primary",
      "prompt": "{file:./prompts/gemma.txt}",
      "tools": {
        "write": true,
        "edit": true,
        "bash": true
      }
    },

[-]

VoiceApprehensive893@reddit

most models are completely unaware of their own chain of thought mechanism

gemma is but you have to spend multiple turns to make it follow a format rule for its reasoning and even then its inconsistent(i got 31b to put its final response into the reasoning block and do 0 reasoning in it once lol, dont expect this level of control from it i have no idea how it happened)

[-]

EffectiveCeilingFan@reddit

Are you sure that the system prompt is being included in the full actual prompt sent off to the engine? llama.cpp I believe has a flag to log all prompts and completions to console if I remember correctly.

[-]

RealChaoz@reddit (OP)

Yes, it is - see the last paragraph of the post. When asked it outputted the prompt; I also copy-pasted the prompt in front of my user message, in the chat, and it didn't improve.

[-]

coder543@reddit

Which exact gguf are you using?

[-]

EffectiveCeilingFan@reddit

Ah, sorry, I missed that. Sorry, I honestly have no clue other than to make sure you’re using the latest GGUFs and have built llama.cpp from the latest commit.

[-]

Velocita84@reddit

It doesn't (used to but never worked and got removed anyway), you have to set the env var LLAMA_SERVER_SLOTS_DEBUG=1 and query /slots (plus ?model=[model name] if using router mode) to get the raw context

[-]

denis-craciun@reddit

I am experimenting similar problems with tool calling using langchain. Qwen3.5 32b is performing much better on that end. I am trying to understand if there is something that I’m doing wrong, but I think it’s just a problem with the model tbf. I’ll update in the next days / weeks. Thank you, now I know at least I’m not the only one

[-]

EffectiveCeilingFan@reddit

There isn’t a Qwen3.5 32B. I’m assuming you meant the 35B MoE?

[-]

denis-craciun@reddit

Ya I did sorry