Gemma 4 is terrible with system prompts and tools
Posted by RealChaoz@reddit | LocalLLaMA | View on Reddit | 122 comments
I tried Gemma 4 (26b-a4b) and I was a bit blown away at how much better it is than other models. However, I soon found two things:
- it gets significantly worse as context fills up, moreso than other models
- it completely disregards the system prompt, no matter what I put in there
- it (almost) never does tool calls, even when I explicitly ask it
Note:Other open models also have the same flaws, but they feel much more accentuated with Gemma. It feels like it was made to be great at answering general questions (for benchmarks), but terrible at agentic flows - following instructions and calling tools.
I tried countless system prompts and messages, including snippets like (just some of these, all of them in the same prompt, etc.)
<task>
You must perform multiple tool calls, parallelizing as much as possible and present their results, as they include accurate, factual, verified information.
You must follow a ZERO-ASSUMPTION protocol. DON'T USE anything that you didn't get from a TOOL or DIRECTLY FROM THE USER. If you don't have information, use TOOLS to get it, or ASK the user. DON'T ANSWER WITHOUT IT.
Use the tools and your reasoning to think and answer the user's question or to solve the task at hand. DO NOT use your reasoning/internal data for ANY knowledge or information - that's what tools are for.
</task>
<tools>
You have tools at your disposal - they're your greatest asset. ALWAYS USE TOOLS to gather information. NEVER TRUST your internal/existing knowledge, as it's outdated.
RULE: ALWAYS PERFORM TOOL calls. Don't worry about doing "too many" calls.
RULE: Perform tool calls in PARALLEL. Think that you need, what actions you want to perform, then try to group as many as possible.
</tools>
<reasoning>
**CRUCIAL:** BEFORE ENDING YOUR REASONING AND ATTEMPTING TO ANSWER, YOU MUST WRITE:
> CHECK: SYSTEM RULES
THEN, YOU MUST compare your reasoning with the above system rules. ADJUST AS NEEDED. Most likely, you MUST:
- perform (additional) tool calls, AND
- realise assumptions, cancel them.
NEVER ANSWER WITHOUT DOING THIS - THIS IS A CRITICAL ERROR.
</reasoning>
These may not be the best prompts, it's what a lot of frustration and trial/error got me to, wtihout results however:

In the reasoning, example above (which had the full prompt) there is no mention of the word tool, system, check, or similar. Which is especially odd, since the model description states:
- Gemma 4 introduces native support for the
systemrole, enabling more structured and controllable conversations.
Does anyone else have a different experience? Found any prompts that could help it listen or call tools?
carbongo@reddit
It's awful. I have a good system prompt with five checks. And Gemma 4 has a high probability of entering into a loop, and it starts doing all five checks over and over again... very bad
fittyscan@reddit
You need a recent version of
llama.cpp. Also, if you're using a quantized model such as Unsloth and you downloaded it when Gemma was first released, download it again, since fixes have been made since then.diesltek710@reddit
still garbage
fiery_prometheus@reddit
Also, cuda 13.2 and up is broken, unsloth have more details about this as well
https://github.com/unslothai/unsloth/issues/4849
typical-predditor@reddit
I have the strange suspicion that this level of stale/incompatible software explains why almost everything I try never works but it's not nearly as well documented/examined as Gemma 4 is right now.
bigjeff5@reddit
That's what I realized after playing with the PrismML Bonsai 1.5 bit model when it came out. Basically every model that doesn't have the exact same architecture as an existing model needs a new kernel in the runner (llama.cpp, vLLM, Ollama), which requires an update. That also means bugs will exist early on, so you really can't just fire up a brand new model and expect everything to work, unfortunately.
an0maly33@reddit
Seems to only be a problem on IQ quants and the lower Ks.
H3g3m0n@reddit
There's also a custom template that apparently fixes tool calling.
"The interleaved template preserves the last reasoning before a tool call in the message history, leading to better agentic flow." I think it stops it striping thinking or something so it knows why it called the tool.
kukalikuk@reddit
Maybe it just not for me
Fragrant_Shallot_990@reddit
OpenWebUI seems to have thinking turned on by default, try finding a way to turn it off in settings.
kukalikuk@reddit
I turned off thinking for qwen3.5 because it can call the tool reliably. Turning off thinking for gemma is making the tool calling more unreliable.
Fragrant_Shallot_990@reddit
Oh also, were you able to resolve the vision issue in OpenWebUI? When I tried an OpenWebUI + llama.cpp build, vision worked ONCE and then never worked again, even after rebuilding and tweaking. Apparently I'm not the only one who has that issue. Koboldcpp vision works fine though.
kukalikuk@reddit
Vision always work for me with LM studio backend.
nic_tbone@reddit
I have been testing out Gemma4:31b on a 5090 running Ollama on Windows with both claude-code and copilot-cli and Its been working impressively well compared to any other models I have run under a similar situation. I have been giving it coding problems and its able to solve them.
Further I have been able to run it on a 5070 Ti + 3060 but its drastically slower.
I had tried other models that my hardware could support but none have produced nearly as good results.
I did run into some issues executing plans on the 5070 TI + 3060 but i tuned the model to be more precise and it proves more reliable. I have to switch models configs in this case between planning and execution.
The 5070 Ti + 3060 is really not usable. Its way too slow but it was the hardware I had available at the time to experiment with.
Aggravating_Dog_9762@reddit
Hello i m kinda a noob discovering and experiencing same issue with gemma 4. I use copaw with various llm on a weak bazzite laptop with lm link to my win10 rig 3090. Tried gemma4 26b a4b, e2b and e4b to run locally on the laptop for the tiny ones, it just doesnt want to use any skills or tools even if it had read them previously.
Everything is up to date: got like in past 3 days 2 lm studio updates on linux, maybe 2 on windows too with runtimes updates. Tools and skills enabled, the system prompt i use with various llm doesnt help. Maybe something is broken for now and we need to wait.
RealChaoz@reddit (OP)
Yeah, I never ended up making it work. Tried llama.cpp from the latest commit, LM studio, unsloth studio, all the various chat templates, fresh download for every quant imaginable, temp 0/0.6/1, different min/max p values.
All due respect, but it seems like Gemma 4 simply is absolute dogshit when it comes to tool calling ¯_(ツ)_/¯
Aggravating_Dog_9762@reddit
If i say than i should see a bubble in chat when its using the tool it will use it but only once, then after reasonning it fails to call tools again...for now i m getting good results with deepseek2 glm 4.7 30b. Good luck\^\^
Frequent-Mud8705@reddit
I managed to get it working agentically in opencode, specifically you need to create a very minimal sysprompt for it, passing the default opencode sysprompt makes it fail tool calls. also make sure min-p is set to 0.
The MOE is quite a beast for a local model, though spawning agents still seems to be a little broken.
Le0ssa_@reddit
Where did you do those changes?
Frequent-Mud8705@reddit
see my other comment: https://www.reddit.com/r/LocalLLaMA/comments/1sh1bwv/comment/of9q7oi/
Le0ssa_@reddit
How can I get Gemma 4 to call tools at OpenCode? It just doesn't!
PuzzleheadedFill5120@reddit
我也是不行,再hermes上也不行
benevbright@reddit
just tried again with updated lmstudio, re-downloaded gemma4 (apr 12th). yeah it calls tool calling and it edits files now but coding skill is so bad.... wrote wrong code for my simple request. deleted and going back to qwen3-coder-nexdt again... (always the same story)
Sadman782@reddit
gemini fixed the template:
https://pastebin.com/raw/hnPGq0ht
Working with OpenCode, and it's quite good now at handling multiple MCP servers properly.
kukalikuk@reddit
Error rendering prompt with jinja template: "Unknown test: sequence".
Sadman782@reddit
I fixed for lm studio:
https://pastebin.com/raw/qc1FTAcG
use this jinja, removed the sequence which lm studio doesn't recognize yet
kukalikuk@reddit
Okay, no more sequence error, but still unreliable tool call and still ending the turn inside the thinking process.
Sadman782@reddit
lm studio issue, works fine with llama.cpp
Sadman782@reddit
I fixed for lm studio:
https://pastebin.com/raw/qc1FTAcG
use this jinja, removed the sequence which lm studio doesn't recognize yet
Upper-Sentence-7650@reddit
I have the same problem making gemma4:31b calling the tools.
Sdesser@reddit
Is the tool syntax you're giving it the same as Gemma's natively trained one? I've had no issues with my custom API after I added Gemma 4 specific cases to the parser, changed the system prompt builder to give Gemma 4 syntax as examples, and llama.cpp updated to support Gemma 4 templating. After aligning my system for Gemma, calling tools works just fine. Is your backend supporting Gemma templating? As for system prompt, Gemma does seem to improvise a bit more than some other modes, but on average, it's following system prompt just fine. Then again, I'm also running at temp 1.0, so that might explain the variance. I ran Qwen 3.5 9B at temp 0.8 due to tool syntax inconsistencies.
I don't use local models for coding and have not tried the larger models yet, so I'm not using tools all that heavily. That might make a difference. Still waiting for new hardware to arrive to run more capable models. I've been mostly using local models for general chatting within my own AI companion framework.
florinandrei@reddit
SHOUTING LOUDLY, using
<fancy_tags>, a man of culture, I see.coder543@reddit
Maybe you should try the built-in llama-server webui.
System prompt and tool calling seems to work fine:
dinerburgeryum@reddit
Having a system prompt will mess up Gemma4 reasoning, because Gemma4's system prompt has strict formatting requirements. From their HF page:
Without <|think|> at the beginning of the system prompt it's disabled entirely. I assume it's automatically injected by the Jinja template if no system prompt is provided.
llmentry@reddit
You don't have to assume. It's very clear in the jinja that this is exactly what it does :)
But note that it simply injects the token at the start of the system turn. Any custom system prompt simply follows below, and so will never mess up reasoning!
dinerburgeryum@reddit
Thanks for the clarification!
coder543@reddit
That is strange. With reasoning enabled, I don't see how the think token would go missing if I include a system prompt. But if I manually write the think token at the front of the system prompt, it goes back to reasoning. Maybe there is a bug in the template that I'm not seeing?
arman-d0e@reddit
It’s a poorly written template and structure, it should really not be bound to system prompt at all similar to the way qwen models handle thinking
ElDavoo@reddit
How do you give tools to llama.cpp ?
RealChaoz@reddit (OP)
MCP works but unfortunately only HTTPs ones, no stdio/sse.
Local-Cardiologist-5@reddit
Are you using ollama?
Several_Industry_754@reddit
I had a really rough time getting it to work with Claude. It just wouldn’t use any tools and kept hanging. I had to switch back to Qwen.
a_beautiful_rhind@reddit
31b follows my system prompts and I don't make it think or have that token there.
Healthy-Nebula-3603@reddit
yes 26b version suck as an agent and for coding but 31b version works great
RealChaoz@reddit (OP)
I'll try that one today, thanks
Sadman782@reddit
use the updated jinja (updated few hours ago) : https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja
or slightly modified version(better): https://pastebin.com/raw/hnPGq0ht
RealChaoz@reddit (OP)
Nice, I'll give it another shot today, thanks!
Additional-Avocado33@reddit
Your running the dev model that is made so others can add content to there model (26b-a4b-it) Is there release model with thinking
RealChaoz@reddit (OP)
Maybe, I'll try the non-a4b version today
input_a_new_name@reddit
Are we using the same model? I fed it 60k worth of text in docx format and it was completely coherent in its answers.
RealChaoz@reddit (OP)
I'm not saying it's bad in general, just that it's bad at instruction following and tool calling, specifically. If your use case doesn't involve that (you fed everything into context), yeah, it's amazing! Better than pretty much any other open model you can run locally
Exciting-Mall192@reddit
The problem sounds like deepseek when the context rotting starts hitting you 🤥
blueCareBeat@reddit
Very different experience here too. Using basic system prompt, with tools and skill catalog. results of a tool calling eval: https://jalemieux.github.io/curunir-evals/reviews/article-draft-26b-sonnet46-20260407
Sadman782@reddit
use the updated jinja: https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja
or slightly modified version(better): https://pastebin.com/raw/hnPGq0ht
ambient_temp_xeno@reddit
One thing I've found on 31b is that any system prompting about what it should do with reasoning is completely ignored. It's completely dead set on reasoning how it's been trained.
Sadman782@reddit
chat template issue, use this jinja: https://github.com/ggml-org/llama.cpp/blob/master/models/templates/google-gemma-4-31B-it-interleaved.jinja
ambient_temp_xeno@reddit
What's new in the pastebin'd one?
Sadman782@reddit
It removed the standard_keys exclusion block, and it's better for me (Gemini found that).
You can see whether it is better for you or not. The fix was applied on top of the Google updated template a few hours ago.
ambient_temp_xeno@reddit
Ah yes I see the PRs now. I also see the 2 small models had different templates after all
https://github.com/ggml-org/llama.cpp/pull/21704#issuecomment-4221036621
ambient_temp_xeno@reddit
But I am using that one >_<
tetelias@reddit
Interestingly, there's no template for 26B. Could this one be used for 26B?
nickm_27@reddit
yes it can
relmny@reddit
Have you read this post that was done before you posted yours?:
https://www.reddit.com/r/LocalLLaMA/comments/1sgl3qz/gemma_4_on_llamacpp_should_be_stable_now/
kukalikuk@reddit
+1 on this, openwebui frontend + LM studio backend. I love it for the language but it fails miserably on serious tool calling and code. Context build up makes it even worse. I gave it a simple task to call image edit tool which even qwen3.5 4B cannot fail, and gemma 4 thinks and made multiple tool calls in sequence directly (without me asking), not waiting for the response/result, make another call and so on until i stop him. Another time it successfully use a tool on the first time, but when I order it again it fails, I even make it to do the exact same method as the first successful call, and still fail. Not only it fail, it thinks for almost 13k tokens (because correcting and contradicting himself about the first successful call) and still fail after those 13k thinking tokens. It even fail to close it thinking process after some context built up. It ends the turn while still in thinking process and when I read the thought sometimes it mistyped the block/tag.
I still use the default LM studio template for this model btw.
TheTerrasque@reddit
That sounds like what I saw before llama.cpp was updated. Maybe try with latest llama.cpp and unsloth ggufs
kukalikuk@reddit
I use LM studio and updated the llama runtime. Still doing dumb tool calls. 😋
Muted_Wave@reddit
Just like me, I found the same thing, so I went to use qwen instead.
Danfhoto@reddit
It’s also interesting that you tell it to not do anything not present in TOOL then proceed to tag the section as while strictly telling it to not make assumptions. For me, looking in for TOOL would be an assumption.
grimmolf@reddit
I’m running the 31b model to power openclaw and the main difference I found was increasing the context window from the default 4096 to the max 256k context, and honestly, I’m pretty impressed with the tool use and adherence to system prompts.
Creative-Paper1007@reddit
Google noodles have always been worse at tool calling, qwen is the og still
nickm_27@reddit
That's not my experience at all, Gemma4 follows my system prompt exactly, even some multi step instructions that other models like Qwen don't follow as well.
Rich_Artist_8327@reddit
how are you running your Gemma4?
nickm_27@reddit
llama.cpp via vulkan in docker on a 7900XTX. Running unsloth Q4_K_XL
Technical-History104@reddit
There’s plenty that I still don’t know, but I do notice the heavy reliance on negative prompting in the snippets you shared. In addition to any other advice you get here, maybe it’s worth finding a way to reword all negations as affirmative statements instead, and giving a few embedded examples (“if you see this… then do this…”)?
Just wondering. 🤔
Also got the following comment from Google Gemini which seems appropriate here:
The user noted that Gemma 4 (26b) gets worse as context fills up. Smaller or mid-sized models (like a 26B parameter model) have a lower "attention budget" than the frontier models.
When the context window gets crowded:
The model starts prioritizing the most recent tokens (the user query).
The System Prompt (at the very beginning) loses its "pull" on the model's attention.
Complex, multi-part negative rules are the first things to be "forgotten" in favor of the immediate request.
[… and I realize even after pasting the above that part of your entire point is that the new Gemma seems more nuanced in the problematic behavior than other open weight models that share similar weaknesses]
the__storm@reddit
I'm finding the same - strong on single-turn natural language tasks, but really struggles with tool calling. It'll fail a couple of times and then get into a loop or go down some crazy rabbit-hole.
I'm on latest llama.cpp compiled locally for ROCm, latest Unsloth 26B IQ4_XS, fp16 kv cache. Both Zed and Copilot Chat (clearest Microsoft branding scheme) were really bad, opencode was surprisingly okay for some reason.
Maleficent-Low-7485@reddit
system prompts degrading with context is the real issue, not just gemma.
Last_Mastod0n@reddit
Tuning the system prompt has been huge for improving responses in my experience so i have not had that issue. As far as the context issue I have noticed massive degradation with every open model I've ever used.
Beledarian@reddit
I saw you use lm studio. I'm actively developing a toolset for lm studio and when I tried it out it failed with my subagent flow as it did not adhere to the expected tool flow provided with the system prompt. Even after I thought I fixed it I still encounter frequent issues. But the lm-studio tools and browser control flow provided by my plugin work ok.
But for me this was definitely surprising and frustrating. Especially considering gpt oss 20b being able to navigate the subagent flow without any problems even though it's an older and smaller model.
4xi0m4@reddit
The core issue is that Gemma 4's think block is a separate generation pass that doesn't always respect the system prompt. The token controls whether thinking is enabled, and if your system prompt doesn't start with it, reasoning gets disabled entirely. For the 26B MoE specifically, the interleaved template from the 31B should work since they share the same architecture. Also worth trying: disable thinking entirely and see if tool calling improves. Some users report it works much better without the think layer getting in the way.
ionizing@reddit
my experiences with it. it was hit or miss. not worth the effort when other models soar in my platform.
ionizing@reddit
in contrast, qwen3.5-35B or 27B both do good shell work and have never complained about reading images, and are also pretty good with bash tools in comparison. but Gemma4 is probably good at many things of course and will make a lot of people happy with their use cases, so it is great to get another free model regardless. heck maybe it is even fixed by now and I need to grab another copy and try again, though I dont like the idea of having to give it custom system prompts compared to the other models, it has taken a lot of work to fine tune behavior. anyhow I am just rambling. here is 122B-a10B from you know who enjoying its life in the shell implementing a plan autonomously:
tarruda@reddit
In my experience, the 26b version never does any reasoning when running inside a coding harness.
the__storm@reddit
Thinking mode is only enabled if the system prompt begins <|think|> - llama.cpp and similar will use the default prompt which includes this but coding harnesses send their own system prompt.
That said I have to agree with OP - 26B seems to really struggle with tool use. Great at single-turn tasks though.
Independent-Math-167@reddit
Experiencing the same on Gemma4 27b. My qwen3.5 9b was doing better with tools like the DuckDuckGo or Wikipedia tool. Qwen goes and Searches the web but with Gemma I have to tell it to search the web.
Muted_Wave@reddit
Just like me, I found the same thing, so I went to use qwen instead.
Anthonyg5005@reddit
It usually only seems to apply the system prompt when thinking and also yeah I've felt like I've needed to budge it more to use tools otherwise it won't try on its own
JLeonsarmiento@reddit
Update your framework/llama.cpp version. It was like that in the weekend since Monday or Tuesday it’s working perfectly.
WishfulAgenda@reddit
I updated lm studio today and it’s night and day. Tool calling was perfect with out a system prompt. Using and mxfp4 version right now and getting 70-80 tps at 100k context on a dual 5070ti. Fully loaded into gpu.
RealChaoz@reddit (OP)
I updated too before trying, on 0.4.10
TheTerrasque@reddit
Very different from my existence. What's your tool stack?
RealChaoz@reddit (OP)
For that question, I had Svelte MCP + Valyu
BasaltLabs@reddit
Gemma 4 is a thinking model. Its
<think>block is essentially a separate generation pass that doesn't strongly bind to system prompt instructions the way the final response does. So yourCHECK: SYSTEM RULEStrick (which works well on non-thinking models) gets ignored because the thinking layer was never trained to respect that kind of meta-instruction. The model reasons freely, then answers -- your system prompt influences the answer surface, not the thinking process itself.In most serving setups (Ollama, llama.cpp, vllm), whether tools actually get called depends entirely on whether the chat template correctly injects the tool schema and formats the turn boundaries. Gemma 4's template is newer and a lot of backends either have a stale template or partially broken tool token handling. Before blaming the model, check:
toolsparameter, not just describing them in the system prompt?You can verify by logging the full prompt as the model sees it (most backends have a debug flag for this).
Previous Gemma versions had no system role at all it was hacked in via user-turn injection. "Native support" just means it now has a proper
<start_of_turn>systemtoken. It doesn't mean the model was heavily trained to obey system prompts the way Llama 3 or Mistral instruct variants were. The RLHF likely prioritized response quality over instruction compliance, which tracks with your benchmark observation.KickLassChewGum@reddit
FYI to any readers: this is clanker-generated nonsense. All that's needed to debunk this is to set a system prompt in Google AI Studio and look at the thinking lol.
Genuine question: what do you get out of pretending to answer a question? Why would anyone do this? This is so utterly perplexing to me.
BasaltLabs@reddit
Fair enough on the technical correction I’ll take the ‘L’ on that. I was under the impression the reasoning tokens were handled as a decoupled pass with a different weight on the system prompt, but I see now (especially looking at AI Studio) that it's a continuous chain where the system instructions are very much 'live' during the thought phase. Also, my mistake on the versioning; I was getting my wires crossed between Gemini’s current thinking models and the Gemma roadmap.
As for the 'clanker': I wrote the comment myself, but I did run it through an LLM to clean up my grammar and flow before posting. I can see how that backfired it took my incorrect theory and polished it into something that sounded like a confident hallucination.
I wasn’t 'pretending' to answer; I just had a theory that turned out to be wrong, and the grammar check made it look like a bot wrote it.
On that note, the reason why I was so wrong ; I was actually looking at the Gemma 4 A4B MoE architecture. Because it only activates 4B parameters during inference despite being a 26B model, I (incorrectly) assumed the thinking channel was being handled by a different parameter pass than the system prompt role. I see now that the
<|think|>tag and system role are natively integrated in the new tokenizer.KickLassChewGum@reddit
Fair enough. The combination of the unmistakable textual tells and a confidently stated plausible-sounding falsehood is probably about as back as a fire can get, haha.
BasaltLabs@reddit
lool, Well hey, at least I admit my faults
colin_colout@reddit
You're absolutely right!
RealChaoz@reddit (OP)
Yes, it does occasionally perform one tool call at a time. In a rare instance, it actually called 2 in parallel. But generally just doesn't.
IMO the thinking block not abiding by the system prompt (if true) makes it borderline useless for any kind of instruction following. Might as well just disable thinking.
I also injected the system prompt into the user prompt and nothing improved, so I doubt it's that either. I honestly just think the model was benchmark-maxxed and is actually bad.
BasaltLabs@reddit
The benchmark-maxxing take is probably right, at least partially. Google's evals for Gemma 4 were heavily weighted toward MMLU-style reasoning and multilingual tasks neither of which requires tool compliance or instruction following. So the RLHF signal just wasn't there.
That said, disabling thinking might actually be worth trying before writing it off. A few people in the Ollama and llama.cpp communities have reported that non-thinking mode (if your backend exposes it) makes the model noticeably more compliant with structured prompts, not perfect, but usable. The theory is that the thinking pass learned to "solve" things internally and then just summarizes, so tool calls feel redundant to it.
On parallel tool calls is almost certainly a training distribution issue. The model rarely saw multi-tool examples during fine-tuning, so it defaults to sequential even when the schema supports parallel. Some people have had partial luck with explicit examples in the system prompt (few-shot style: here's a user message, here's what a correct parallel tool call response looks like), but it's fragile.
For agentic workflows right now, Qwen2.5 or the latest Mistral Small are honestly more reliable, less impressive on general benchmarks but actually trained to follow tool schemas consistently. Gemma 4 feels like a great base model that Google didn't finish fine-tuning for production use cases.
eggavatar12345@reddit
It’s a great model and you should consider that your quant of choice or llama args are wrong
colin_colout@reddit
Also a reminder to constantly check for new versions of llama.cpp and the quant (assuming that's how you host it). On new models especially, llama.cpp often needs a few weeks at least to hammer out bugs in new model architectures.
...and GGUFs have bugs too (sometimes even just the prompt template). For instance, Unsloth uploaded a new GGUF yesterday (if that's what your'e running), so if you're running an Unsloth GGUF from a few days ago it might not have latest fixes.
ttkciar@reddit
Well, that's not entirely true. Changing the prompt template to include a system section worked splendidly for both Gemma 2 and Gemma 3, and other models which did have documented support for system sections did not have specific tokens for it either. That is a recent development for all models.
o0genesis0o@reddit
Something could be wrong with your setup if you have the same issue with other models as well. I tune my agent harness to work with Nemotron 30B, and I'm surprised to see that it handles simpler agentic tasks just as well as GLM 4.7 and Minimax 2.7. It only fails with large and difficult text edit. It means small models could follow system prompt and could do multi turn tool calls, not just frontier.
Specter_Origin@reddit
Umm it works really well fro me... how are you serving the model ? what server and what version and what platform ?
patricious@reddit
I am experiencing the exact same problem but its a hit or miss, sometimes its tool calling very correctly sometime is says its deploying agents where in reality it didnt deploy anything, sometime neither of those things lol. Still need to tweak and test, either way I am running it with these params on a 5090 and TurboQuant:
Temp 1.0
Repetition Penalty 1.05
u/echo off
title Gemma 4 26B - 262K Context (22.2 GB VRAM)
cd /d C:\ai-opt
C:\ai-opt\turboquant-llamacpp\build\bin\Release\llama-server.exe \^
-m "C:\models-no-spaces\gemma-4-26B-A4B-it-UD-Q4_K_M.gguf" \^
--cache-type-k tbqp3 \^
--cache-type-v tbq3 \^
--flash-attn on \^
--ctx-size 262144 \^
--gpu-layers 99 \^
--port 8080 \^
--alias "Gemma-4-26B-TurboQuant-262k" \^
--reasoning on \^
--jinja
colin_colout@reddit
does it have the same issues without kv quantization? Just to rule it out (different models have different sensitivities to quantized KV...and tbq3 is brand new and might need some more time for bugs to shake out)
patricious@reddit
Yes and it's even worse on Flash Attention +
q8_0KV cachecolin_colout@reddit
what about no kv cache quantization at all? fa shouldn't effect the output of the llm, but kv cache does.
unless I'm out of the loop, it should be mathematically identical with or without fa.
Electronic-Metal2391@reddit
Using this gemma-4-26b-a4b-it-heretic.q4_k_m.gguf inside koboldcpp, I get nothing but long loop of repeated words.
Rich_Artist_8327@reddit
I had a very large prompt for content categorizing for 5000 phrases. Gemma3 did those on certain accuracy.
When gemma4 31b came, run the exactly same benchmark with same prompt against same data. Results are worse than with gemma3 27b. Then I made the prompt as simple as possible, and results are now on par with gemma3 27b when it has a 5000 token prompt. So gemma4 31B gets same result with 900 token prompt compared to gemma3 27b which needs for the same results 5000 tokens for rules and few-shot prompts. When starting to add rules and few-shots to Gemma4 31B, results are getting worse. My understanding is that I do not have thinking on, at least its not in the prompt and temperature has been 0.0 and 1.0 no difference actually.
So Gemma4 somehow understands different type of prompting, or what is the issue here.
Grouchy-Bed-7942@reddit
Give your full llama-server command + if you are in OpenWebUI have you set the native tool call in the model settings?
jacobpederson@reddit
Yup - was very impressed with Gemma - plugged it into opencode and it fell face-first.
Frequent-Mud8705@reddit
I got it to work by changing the sysprompt.
After doing that the MOE absolutely rips on my 3090, though I think its still slightly off from what the model expects.
"You are an agentic coding tool.
You live in an an agentic coding environment with various tools you can use to help the user."
VoiceApprehensive893@reddit
most models are completely unaware of their own chain of thought mechanism
gemma is but you have to spend multiple turns to make it follow a format rule for its reasoning and even then its inconsistent(i got 31b to put its final response into the reasoning block and do 0 reasoning in it once lol, dont expect this level of control from it i have no idea how it happened)
EffectiveCeilingFan@reddit
Are you sure that the system prompt is being included in the full actual prompt sent off to the engine? llama.cpp I believe has a flag to log all prompts and completions to console if I remember correctly.
RealChaoz@reddit (OP)
Yes, it is - see the last paragraph of the post. When asked it outputted the prompt; I also copy-pasted the prompt in front of my user message, in the chat, and it didn't improve.
coder543@reddit
Which exact gguf are you using?
EffectiveCeilingFan@reddit
Ah, sorry, I missed that. Sorry, I honestly have no clue other than to make sure you’re using the latest GGUFs and have built llama.cpp from the latest commit.
Velocita84@reddit
It doesn't (used to but never worked and got removed anyway), you have to set the env var LLAMA_SERVER_SLOTS_DEBUG=1 and query /slots (plus ?model=[model name] if using router mode) to get the raw context
denis-craciun@reddit
I am experimenting similar problems with tool calling using langchain. Qwen3.5 32b is performing much better on that end. I am trying to understand if there is something that I’m doing wrong, but I think it’s just a problem with the model tbf. I’ll update in the next days / weeks. Thank you, now I know at least I’m not the only one
EffectiveCeilingFan@reddit
There isn’t a Qwen3.5 32B. I’m assuming you meant the 35B MoE?
denis-craciun@reddit
Ya I did sorry