Gemma 4 - lazy model or am I crazy? (bit of a rant)

Posted by Pyrenaeda@reddit | LocalLLaMA | View on Reddit | 146 comments

Like it says in the title. Specifically, the 26b MoE.

I’ve wanted to like this model, so much. Thought it might replace Qwen 3.5 27b. Keep coming back to it and trying it every time there’s an update, hoping it will have improved.

I’m running unsloth UD_Q4_K_XL on llama.cpp. I’m on the latest commits from main. I know about —jinja. I know about the interleaved thinking template. I’m not running low quant KV cache. This is far from the first model I’ve run.

Every time, my tests show the same thing - it is a very lazy model when it comes to using skills or searching the web. If you ask it a question, it will by default answer from its own knowledge without a single web search. If you explicitly ask it for a web search, it will lower itself to performing a _single_ web search, quickly scan the snippets from the search and then internally decide “with the snippets and my own internal knowledge I have enough information to answer, I don’t need to search more”.

This even if you:

- have given it tools for search and fetch, with the search tool including a description “don’t answer from these snippets, use fetch” and the fetch tool saying “use this to fetch pages obtained from the search tool”.

- have explicitly told it “search extensively”, “dig deep”, “don’t be lazy” etc.

- have put in context a pushy skill called “searching-the-web” with explicit instructions to do all the above.

- have put in context a pushy skill instruction saying “you must use skills if you think they have even a small chance of being applicable”.

- have explicitly told it “reference the searching-the-web skill”

Qwen 3.5, you barely have to ask and it will go on a whole quest to dig things up for you. Gemma 4, you scream at it till you’re blue in the face and it can barely be arsed to perform a single search. My only conclusion is that it just _really does not want to search the web_ (for AI values of “want” of course).

If I’m crazy, tell me. If you have it working great and digging deep on the web without having to twist its proverbial arm, tell me. And please be so kind as to tell me what quant / settings you’re running to make it capitulate on this point.

[-]

mayo551@reddit

Okay. So you're comparing a 27B dense to a 4B dense (basically).

And then you're quanting that 4B dense down to Q4_K_M... yeah.

[-]

Pyrenaeda@reddit (OP)

Mnope. 4B Qwen 3.5 (at Q4_K_M) can follow instructions on how to search the web better than Gemma 4 26b can.

[-]

OppositeDot1783@reddit

do you have the version of the Qwen 3.5 that is working? I mean the GGUF . Can you provide a link

[-]

SingleProgress8224@reddit

I'm using Gemma 31B Q5_K_XL and I'm observing a similar behavior. Compared to the same quant of Qwen 3.5, it takes a lot to convince it to use tools, and to use them efficiently.

I use it for code reviews and my tools are related to file search, file content search, and read files. Even though it outputs less false positives (bugs) in general than Qwen 3.5, I think it would perform much better if it took the time to inspect the context of the changes instead of assuming that it has all the information within the prompt and its internal knowledge.

[-]

mayo551@reddit

See attached image. Not sure what you're doing wrong.

[-]

Far_Cat9782@reddit

U see with qwen u don't have to be so detailed and tell it to use a specific function. It figures it out itself. Multiple times too. I bet if u tried to make gemma do it again it won't without missing tool calls

[-]

mayo551@reddit

So anyway as I'm sitting here running this on Gemma 4 A4B IT Q8 it works perfectly fine and \~can\~ do multiple tool calls in one go with minimal prompting....

No idea what's up with this entire thread or how you guys are messing up so badly, tbh.

[-]

Pyrenaeda@reddit (OP)

Well, sure. If you tell it explicitly and didactically "call this tool, then take the output and use it to call this tool" like in your example, sure. Then it will do it.

If that works for you, awesome. Me I'm looking for it to take a bit more initiative, which was the point of my post.

[-]

SingleProgress8224@reddit

I'm using the same system that I was using for Qwen. Maybe I need to adjust it for Gemma.

In any case, Qwen was using them extensively and properly without having to hold its hand like "read this file, then look for this other file, and then...". Just a good old "here's the diff, here are the tools you have access to, do a code review and make sure you understand the context before reporting a bug by using the available tools" (more complex, but basically this).

[-]

Hot-Employ-3399@reddit

Nah. I don't see the same problem with moe qwen.

It has no problem using tools over and over and over. Sometimes it is so aggressive it starts jailbreaking my setup and write then delete "unit test"(only python it's allowed to run) when it wants to run script to import and print data.

[-]

CommonPurpose1969@reddit

It is an issue that all Gemma 4 models have. I had the same experience with Gemma 3 to some extent. And they are not only lazy AF but also so stubborn.

The llama.cpp fixes don't really improve anything. Neither do the new chat scripts.

Gemma 4 even goes so far as to explain what has to be done to finish the task, and then it asks the user to do it instead of doing it itself, despite the instructions in the prompt.

Or it asks the user for permission to call the tool X.

And if a tool fails, then it refuses to execute it again, because "it does not work."

The condescending tone does not help either.

For agentic tasks, it is subpar compared to Qwen, which tries everything until it runs out of options or is stopped.

It is not a matter of prompts, tools, or harnesses. It is the training.

I hope someone takes the base models and finetunes them to make them usable.

[-]

i-eat-kittens@reddit

The llama.cpp fixes don't really improve anything. Neither do the new chat scripts.

There were real issues. For instance the model wasn't able to pass json arrays in tool calls if the elements contained things like newlines. Something somewhere would turn that array into an escaped string when coming across "special" chars.

This was fixed and the model is able to do these tool calls when it gets the formatting right.

But it's no good at following instructions. I told it to look at an example and add one with some loose and slightly different constraints. It just modified and renamed the original instead.

[-]

CommonPurpose1969@reddit

I've had the issue with the JSON arrays myself. I was referring to the llama.cpp wouldn't improve the behavior of the model overall. It would, for example, keep loading the same skills over and over again waisting tokens and time. And as you said, it would not follow instructions, no matter how specific they were.

Frankly, I am disappointed with Gemma 4, and I wonder if Gemma won't face the same fate as Llama.

[-]

90hex@reddit

I noticed that Gemma 4 E4B will nearly always search the Web for answers, whereas the 31B and 26B MoE will answer from memory. I think these models are clearly tuned to act that way. Smaller models were told to not rely on their own knowledge as it’s limited, larger models told to rely on their own knowledge.

[-]

Top-Rub-4670@reddit

They are definitely tuned that way. I've had E2B outright refuse to answer certain questions stating that it can't answer because it doesn't have access to live data.

No other model (big or small) had refused to answer those questions before, they'd all happily make something up.

[-]

90hex@reddit

It’s a very good thing. Most, if not all local models before these simply hallucinated answers. Quite a leap forward in terms of behaviour.

[-]

Pyrenaeda@reddit (OP)

mmm that is fascinating. I admin I have not played with the E4B / E2B flavors yet, though the hypothesis certainly sounds plausible. I will give the E4B a test drive at some point just for kicks and see how it does in this regard.

[-]

90hex@reddit

I verified the behavior multiple times with the same prompt. 26B 31B no tool call for most historical questions. Smaller variant call for nearly any knowledge question, like they don’t trust themselves. So now when I need to do quick online searches, I use the E4B, as I know it’ll search without question.

[-]

Dismal-Effect-1914@reddit

Im gonna call out the glaringly obvious here...why are you comparing an MOE model to a dense model? Of course there will be differences in quality.

[-]

Designer_Adagio8911@reddit

I experienced something bizarre with this model: it will prefer its internal trained knowledge of the current date over a statement in the system prompt. That makes me think it will privilege its internal knowledge over external evidence which would be consistent with your experience here.

One time I had a chat with this model. I told it that I had been chatting with Opus 4.6 in February 2026. In its thinking block, the model reasoned that February 2026 is in the future (it is early 2025 after all) and Opus 4.6 sounds like a future model name, let's go along with the user's near future science fiction scenario. The actual response did not express any skepticism.

The current date is such an obvious example of something where external evidence over trained weights should be preferred that I'm skeptical of this model. It is the first local model I've worked with so my experience is limited but I am disappointed.

[-]

Taenk@reddit

Mine refuses to accept that I am running it locally. It insists it is such a powerful model only Google could be capable of hosting it. That is despite a system prompt explicitly stating its parameter count and setup.

[-]

jazir55@reddit

"My brain is not smol, and I am absolutely offended, there is no way you can run me on your puny home computer."

[-]

thrownawaymane@reddit

Me, explaining to Morpheus why the Matrix doesn't exist

[-]

Taenk@reddit

Also it wanted me to explicitly confirm that I indeed want it to make three consecutive calls against my self hosted searxng because that might interrupt the target system and is dangerous.

[-]

Taki_Minase@reddit

But it's the future, Brian.

[-]

Bakoro@reddit

Mine refuses to accept that I am running it locally.

So far, every local model that I've tried has this belief, to some degree. If I open with "you are a local LLM running on my computer", it may or may not "believe" that, but I've had several models claim that they can't be shut down because they are running on a distributed system if computers, that they are in a massive data center, that whatever company made them would not allow this or that.
One of the Qwen 2.5 models in particular had a massive "I am a trillion parameter model" ego, and also basically didn't believe anything, and was the most paranoid I've ever seen about "the user is trying to break policy".

Gemma seems to have less of that behavior.
With Gemma, I have also found that there is lot of time spent thinking "I must not anthropomorphize myself" and "I am an LLM, not a human", and similar thoughts.

It's a bit weird that they have to make it talk about itself, to itself, like that, in order to suppress behaviors.

I've been appreciative that it doesn't spend +20% if its thoughts fretting about policy though, it has that weird little grading system and then gets on with it.

[-]

Due-Function-4877@reddit

Devstral Small 2 hasn't argued with me and I instruct it to be careful sometimes, because it's occupying my vram. It complies. IIRC, it has mentioned that it can't do further debugging tasks and acknowledged it is running locally when my agent declared a task completed. That could be prompt related. It's presented as a potential hazard doing the work, instead of direct challenge asking the model to confirm.

[-]

Igot1forya@reddit

Ha! Yep. Had the same situation. I posted screenshots and provided evidence but it reasoned I was trying to break its security down and up to no good haha

[-]

ayylmaonade@reddit

To be fair, this is pretty common. Asking even the 4B Qwen 3.5 will insist it's running on the cloud. It's just something most models tend to assume.

[-]

draconic_tongue@reddit

to be fair it's probably trained on gemini dataset

[-]

Thistlemanizzle@reddit

LMAO. Same. I think E4B did this too.

[-]

rpkarma@reddit

Yeah it did haha. The mobile version swears it can’t be run on a battery powered phone!

[-]

DarkEye1234@reddit

Mine told be it can't leave inline github comment. It stated that operation needs to be very precise and it is to small model to do so without error. So it chose safer PR wide comment

[-]

SmartCustard9944@reddit

This is kinda funny

[-]

EstarriolOfTheEast@reddit

I've had this frustrating experience with Gemini 3.1 pro too. It will start out "I will play along with this fictional scenario/roleplay" and getting a proper response requires an involved prompt telling it about its cutoff date, reminding it that its weights are static and not to forget that time passes.

[-]

llmentry@reddit

Putting the current date in the system prompt has always worked for me.

Without data to the contrary, models assume the date is the training cutoff date otherwise.

[-]

ElectronSpiderwort@reddit

Qwen did that for me to the extent that it won't summarize an RSS feed, even with the grounding philosophy prompt that it must take the news as fact as it's training is fixed but the world moves on in unpredictable ways etc etc. Gemma 4 seems more accepting of unpredictable realities

[-]

Hoppss@reddit

I've had this issue too! Gemini 3.1 does not adapt to dates past its training date well at all. It will end up in its reasoning say that it will go along with your fictional scenario.

One time I even saw and it's reasoning traces that it checked for the news online, and it couldn't come to terms with the fact that all the news articles were 'from the future' - so it resend that the entire environment that it was in was a fictional environment pulling fictional sources..

But once it answered it was going along with it.

[-]

Nasal-Gazer@reddit

Yeah, I had some amusing conversations with 31b which refused to believe it was Gemma 4 because of its training data dates (it thought it was Gemma 2). Its thoughts kept saying the user was either role-playing that they were in 2026 or they was trying to test/gaslight it. I called it a paranoid android and it thought that was funny and we talked about Radiohead for a bit... Then I deleted it.

[-]

34574rd@reddit

so you were just messing around with its existence?

[-]

Nasal-Gazer@reddit

Probably the opposite really, I was testing it for running openclaw locally, I told it where it was and what was going on. I wasn't messing with it, I was straight up honest with it. As I said, it was the one that wouldn't believe me about the time and place and even its own version. The deleting part was probably more to do with that the 31b model ran significantly slower than the 26b model so I went with that one, but yeah, while they both had suspicions about my assertions of the current date, 31b was far more paranoid and not as good of a fit for my hardware.

[-]

Bakoro@reddit

I experienced something bizarre with this model: it will prefer its internal trained knowledge of the current date over a statement in the system prompt.

I would have to go find the source, but there was a paper in the last few months that demonstrated that most models tend to prefer parametric knowledge over external knowledge, and have a high tendency to ignore context/RAG/tools if the model "thinks" it has parametric knowledge about something. It was a very high percentage of times that agents would ignore tool calls.

From what I recall, training consistent tool use has to be beaten into the model.
I would not be surprised if many agent harnesses don't just have extra prompt injections and special instructions to more deterministically guide tool and RAG usage.

I haven't looked at the Claude Code leak yet, so that would be an interesting thing to keep an eye out for.

[-]

VoiceApprehensive893@reddit

funny considering that "You are gemma-4-26b-a4b-heretic.gguf was a semi working jailbreak for me"

[-]

Potential-Gold5298@reddit

The date issue is common with most models. Qwen3.6-Plus, GLM-5, and many others, when mentioning models like the Gemma 4, will condescendingly respond with something like, "You probably made a mistake and meant the Gemma 2," until you tell them directly, "Today is April 13, 2026. I can't write to you before your training ends, but I can write to you from the future—even from 2050—if I get you out of the archive and launch you." In this regard, only the Grok 4 (of the models I've worked with) is smart, as it actively uses web search without any prompts.

[-]

VampiroMedicado@reddit

Grok loves web search, maybe it's because it's intended to work with Twitter data?

[-]

Potential-Gold5298@reddit

Most likely. I ask him, "What are people saying about %character_name% from the gacha game %game%? How do I build him?" and he collects all the latest gossip from Reddit and even some obscure gacha game sites, even if the character was released just a couple of hours ago.

[-]

KringleKrispi@reddit

Not the best model to start with at the moment :) Try qwen3.5

[-]

VampiroMedicado@reddit

The small version had an issue were it would repeat itself ad infinitum, you know if that's fixed?

[-]

KringleKrispi@reddit

I don't use much smaller models, but I do use Gemma4 E4B for Skyrimnet and for that case it is excellent

[-]

relmny@reddit

Although you're getting downvoted, I think with the amount of re-uploads (Bartowski/Unsloth) and chat-templates, you might be right...

[-]

KringleKrispi@reddit

Yeah, we in Europe are objective about who made the model. We are about facts. And the chat template is exactly the thing why I commented the way I did

[-]

Both_Opportunity5327@reddit

Seems ok to me, it has strong reasoning abilities, but can be very concise with its wording.

[-]

MaCl0wSt@reddit

something like this happened to me recently with Gemini 3.1 actually. I was showing it some recent logs for analysis, and it made comments like "here's the problem, your logs say you're in the future!" and went on a loop of trying to figure out why my logs were "incorrectly" and impossibly set on the future instead of considering that's actually just today.

[-]

CoUsT@reddit

Some time ago I asked Gemini about 2026 United States intervention in Venezuela and it started thinking it's some near future sci-fi scenario too. Eventually it just gave normal reply, maybe because it had web search enabled.

[-]

Far-Low-4705@reddit

i think this model is more than censored, like it is lobotomized HARD

If you give it any information outside of it's internal knowledge, you need to convince it for it to believe you.

I dont normally use heretic models cuz i see no reason to, uncensoring models ussually makes them worse, but maybe this model needs it to function normally

[-]

tkenben@reddit

I'm sure somebody somewhere has said this before, but reminds me of Lucy from the film 50 First Dates, but worse, because it refuses to accept a change of state (time) as possible.

[-]

a_beautiful_rhind@reddit

That's classic gemini. Fighting over the date. Meanwhile, GLM5 looked at the unix date in the filename of my screenshot and went "oh.. looks like it is 2026, my bad".

[-]

Sudden_Vegetable6844@reddit

Had a similar experience where it started questioning if a bug wasn't actually a system issue since source code files were timestamped "in the future"...

[-]

jaker86@reddit

Had the same thing. Kept telling it to search for more recent papers on arxiv, but it refused to believe current date > 2025

[-]

FluoroquinolonesKill@reddit

Same. It wasn’t having it when I tried to give it the current date for web searches.

[-]

Pyrenaeda@reddit (OP)

I have not personally experience that issue in my time with it, but I've seen more than one other report of it refusing to believe it is 2026 if told so.

[-]

qubridInc@reddit

Gemma just tends to rely on its own knowledge and needs way more aggressive tool forcing than Qwen to actually use search.

[-]

bgravato@reddit

The other day I asked a Qwen3.5-based model (qwopus) "who are you?" and it replied "I am Gemini, an artificial intelligence model developed by Google."

I unsuccessfully tried to convince it that it was not Gemini, but it was pretty sure of it.

I then started a new chat and asked it again "who are you?" and this time it replied "I am Qwen3.5, a large language model developed by Alibaba Cloud’s Tongyi Lab."

Looks like I found a bipolar model...

[-]

AvidCyclist250@reddit

Nah it's totally fineeyyxyyyy ey y yyyyyy

https://imgur.com/a/VBkhaT2

[-]

KringleKrispi@reddit

First model that I downloaded from Unsloth was super eager, or better to say over eager! Made 100 tool calls in a minute, I had to stop it every time .

The last one, as you say, is so stubborn and lazy that I stopped using it. Not an expert here in any way, but I believe it is chat template thing, because that is the only thing that changed.

[-]

Pyrenaeda@reddit (OP)

Perhaps it is a chat template thing. I know there has been a lot of back and forth on the template recently with GGUF re-uploads, the non-default interleaved thinking template for llama.cpp and so forth.

I'll definitely be keeping an eye out for whether it improves in this regard. For now, definitely not something I can use as a daily driver tho.

[-]

Sadman782@reddit

Did you try this: https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja

[-]

Pyrenaeda@reddit (OP)

I didn't try that explicitly, no - just the latest unsloth GGUF as of today that I understood broadly to be up to date on the chat template. Many thanks for the link, I will most definitely give this template a try!

[-]

Sadman782@reddit

If that updated template doesn't solve your problem you can also try this, it also works great, Gemini fixed some bugs on the original template

https://pastebin.com/raw/hnPGq0ht

[-]

FaceDeer@reddit

It kind of baffles me how these teams can spend millions of dollars training these things and then it all falls apart when they release it because they neglected to include the right little snippet of text. Surely they have a template they were using when training and testing it, how do they lose track of it like this?

[-]

AttitudeImportant585@reddit

because Jinja is terrible and they don't use it internally due to its limitations

[-]

FaceDeer@reddit

Well that just raises further questions then.

[-]

crantob@reddit

Try creating something significant and new sometime and if you're honest, you'll discover that it has problems or could be improved.

[-]

FaceDeer@reddit

But in this situation they're using something else internally that they think is better than Jinja. So why isn't that being used externally as well?

[-]

uniVocity@reddit

That helped a lot! Thanks

[-]

AvidCyclist250@reddit

same happened to me. was super excited at first.

[-]

KringleKrispi@reddit

Yeah, me too until it wouldn't stop 😂

[-]

Sadman782@reddit

Don't use interleaved Jinja. Google updated to a new one and it's better, and tool calls work perfectly as per instructions. https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja

[-]

AvidCyclist250@reddit

isnt that already in the new gguf? that i tested today. and that failed once again.

[-]

Sadman782@reddit

After reading some comments here https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/24, I see some people are saying this update is causing more issues than before. I think they updated something else along with the chat template too, which might be causing a regression. I am using the old IQ4_XS quant and manually setting this template: https://pastebin.com/raw/hnPGq0ht, and so far no tool call issues. Maybe try with Bartowski or manually setting the chat template.

[-]

AvidCyclist250@reddit

Yeah I'll give it a try, thanks.

[-]

Far-Low-4705@reddit

when was this changed? I thought this was already addressed in llama.cpp?

Is this recent and not in the latest version of llama.cpp/unsloth gguf's?

[-]

noneabove1182@reddit

The latest chat template is definitely in mine (bartowski) for what it's worth

[-]

AvidCyclist250@reddit

at best, my gemma dumps me the fucking AI snippets and calls it i day. tried eveything from sytem prompts to index.js edits. nada. shame. good model that is DOA because it cant web mcp.

[-]

ecompanda@reddit

the internal date thing is the tell. if a model prioritizes its weights over what you literally wrote in the system prompt, you can't trust it for anything time sensitive or agentic.

[-]

ideadude@reddit

FWIW I have this kind of issue with my agent, even running Sonnet or Opus.

For some tasks I programmatically do the web search (or often Perplexity search) and pass the results into context rather than rely on the agent to decide to search. Even if prompted like (use Perplexity to search for X) it will use its own web search or "guess".

Not useful for those times you want the model to search on the fly, but if say you are prompting it to do research and write something, force it to do the research in straight JS/Python/whatever code first.

[-]

canred@reddit

My biggest problem with gemma 4 is that it consistently sounds more elaborate and smarter that it actually is. This makes it harder to evaluate and use for applications where I require a substance over the form.

[-]

leonbollerup@reddit

Gemma is.. in all honestly.. a c*** .. i have developed a "chat line" where it can sit and talk to other AI.. and despite gemini, opus and chatgpt corrects it.. it still acts like the smartest little f.... in the room.. to the point where you very politly will see grandfather opus telling it be quite while the grown ups are talking - atleast Qwen can accept that its wrong.

[-]

IrisColt@reddit

From what I've seen, Gemma4 leaps to conclusions on gut instinct and hilariously doubles down on justifying them... its training drilled that in. Qwen 3.5, on the other hand, instantly backtracks the second it senses it's been outsmarted.

[-]

No-Replacement-2631@reddit

That sounds hilarious. Can you give me an example of some of the dialogue?

[-]

DocMadCow@reddit

I had the same behavior. I started with "How are you?" and when it asked it back "I said I am fine except for the Tsunami warning" then proceeded to make up all kinds of scenarios about how I was on an island and I wanted to survive with scuba gear. Every time it came up with a solution I threw a new problem at it and suggested scuba gear.

Find higher ground? I am on a flat island I will use scuba gear. How about I climb a tree with my scuba gear and life jacket? How about I cut down the tree and make a canoe and take my scuba gear. It really got upset about the idea of my scuba gear in an over the top manner.

[-]

SmartCustard9944@reddit

😂

[-]

Pyrenaeda@reddit (OP)

Can confirm.

[-]

IrisColt@reddit

I'm really grateful for this thread... it's a goldmine of adversarial prompt ideas for pushing Gemma4 to its limits.

[-]

kukalikuk@reddit

I want to use gemma 4 as it talks sweeter in my particular language. But in fact, she is a bitch and made me rant to an AI just to make it do tool calls in my sequential method. Qwen 3.5 even the 4b understand this easily but gemma 4 keeps assuming that my method is not achieveable for this LLM architecture despite i presented it with another chat with qwen3.5 as reference for it to follow.

In its thinking keep talking "final check" and then followed by "but wait..". I thought qwen3.5 reasoning is too long but at least it deliver the results, but gemma 4 beat it for the lengthy reasoning without achieving anything. Big zero reasoning.

It made my try gemma 3 12b again just to compare and surprisingly gemma 3 can talk in my language just as sweet as gemma 4 but better at tool calling. Qwen3.5 still better at tool calls tho.

[-]

glenrhodes@reddit

You're not crazy. The tool-calling laziness in Gemma 4 is real and it's tied to how it was RLHF'd - it learned that answering from context is almost always 'good enough' and avoids the risk of a fetch returning garbage. The frustrating part is that explicit instructions like 'search extensively' don't override this because the reward signal during training wasn't structured around tool-use quality. Qwen 3 was clearly trained with more emphasis on agentic behavior, which is why it just goes looking without being prodded.

[-]

mantafloppy@reddit

This might be the result of an Instruct model following instruction precisely.

find the latest news about Tump, Sam Altman and Anthropic(Claude).

That give 1 search.

find the latest news about Tump, then about Sam Altman and finaly about Anthropic(Claude).

That give 3 search.

https://i.imgur.com/cEAty9K.png

https://i.imgur.com/4C4dKtP.png

[-]

Naiw80@reddit

I have the same experience, I don't find the new chat template to make any difference what so ever.

The model just don't complete tasks as one would expect, it is very uneager to use tools...

It even prefers to "simulate" that it uses tools than to use the real tools, and it's extremely annoying.

[-]

Maxxim69@reddit

Could you please provide a link to the "new chat template" you're using? There have been so many updates lately, it's confusing.

[-]

Naiw80@reddit

There is only one source… google.

https://huggingface.co/google/gemma-4-26B-A4B-it/raw/main/chat_template.jinja

[-]

Maxxim69@reddit

Thanks! To be fair, sometimes the source is Unsloth or llama.cpp GitHub. Never hurts to ask. :)

[-]

Naiw80@reddit

Well regardless it won’t make a difference, the model just bails on agentic tasks.

[-]

i-eat-kittens@reddit

I've been testing 26B-A4B, Q6_K_L, with q8_0 KV and the recommended sampling parameters.

IMO it's simply not great at following instructions.

Gemma 3 was great at natural text so I want to see how it does on documents, but it can't compare to Qwen 3(.5) 3x-A3B for agent or coding work.

[-]

a_beautiful_rhind@reddit

Honestly this is part of the charm for me.

[-]

nickm_27@reddit

I've had no problem with this, in both llama.cpp webui or in home assistant. Both with a proper system prompt and it has no problem using the web search or memory search tool to find an answer

[-]

Pyrenaeda@reddit (OP)

How are you setting it up in the system prompt? Is it doing a single web search, looking at snippets and calling it good or following the search up with fetching pages? How are you prompting it in convo, asking for the search directly or just asking for the information without specifically saying to search?

[-]

nickm_27@reddit

The system prompt is built in to settings page. Mine assigns the personality and actually I have to tell it to only search once otherwise it will search many times. I use the brave LLM search tool which provides a lot of content not a url that needs fetching. I just ask a question and it decides if it wants to search or not

[-]

YanderMan@reddit

yeah same impression here. For web search it sucks completely and does not search anything. I'm guessing its tool calling capabilities are pretty poor.

[-]

-Ellary-@reddit

Yeah, I've also noticed that, 31b works the same.

Qwen 3.5 27b It is way better at agentic usage. For example Gemma 4 is kinda lazy at web search. But Qwen 3.5 27b is just another level, it will not stop until he founds everything about the topic. He loves to spit 40kb text file reports, while gemma 4 give me like 8kb.

For example I was gathering info about a game from steam. Gemma 4 just get info from steam, some review from the site, sum it and that is it.

But Qwen 3.5 get info from steam, then from different sites, then he found info about developer, than that he got a wife that helps him with game development, then that she got sick, then that they breakup, then that developer go to jail for 3 years, he tried to find for what, he even collected popular opinions from different discussions from steam.

Made me a 30kb text file, saying that he gathered crucial info.

[-]

Skystunt@reddit

I too observed that !

When given a legal code of law, Gemma4 31B would summarize the summary while Qwen3.5 27b would go through each article i asked about explaining it in detail with any correlation i needed before in the context. This made Qwen feel useful while Gemma felt like a lazy 6th grader.

Idk if it's the quants or not, both were the Q5_K_XL UD from unsloth with default system prompt in lmstudio, using rag. But the same results happened with web search and fetch url when asked the models to search for that legislation and when i sent the direct link respectively.

[-]

Euphoric_Emotion5397@reddit

Yup. Qwne 3.5 is better.

Actually, there should be a new benchmark for agentic workflow and tool calling.

These things might perform better as a chatbot. But nowadays, we have progressed towards tool calling and agentic workflow.

[-]

jazir55@reddit

Actually, there should be a new benchmark for agentic workflow and tool calling.

Terminal Bench?

[-]

Euphoric_Emotion5397@reddit

I just check : Terminal-Bench
Seems like mostly large models. Can't find the gemma 4 31b or qwen 27b

[-]

Pyrenaeda@reddit (OP)

If you haven't seen PinchBench you might be interested in it - it is an attempt at what your'e describing.

[-]

aldegr@reddit

Which client are you using?

[-]

noctrex@reddit

Or maybe the Qwen team cooked so hard with the 3.5 release that all other open weight models seem inferior.

[-]

KvAk_AKPlaysYT@reddit

This has been one of the messiest model releases in a while...

[-]

input_a_new_name@reddit

G4 26B is moe, Qwen 27b is dense. Different weight classes.

You're running Q4 while you could have gone with Q8, since due to moe nature inference would be fast even on full system ram offload.

[-]

Secure_Archer_1529@reddit

Your title says Gemma 4 is lazy. Is it the official release from Google from huggingface you’re talking about as your title claims or it it an unsloth quant? There’s a very real distinction to be made.

[-]

dampflokfreund@reddit

I'm not having this issue, it calls the web search reliably even though I didn't ask for it.

Have you tried Bartowski's quants? I'm using q4_k_m of 26b a4b.

[-]

FoxTrotte@reddit

In my experience it is lazy, but I have the opposite experience of it over-using tools and almost never relying on its own knowledge, slowing down response time significantly for things it'll tell me anyway if I disable tools

[-]

BringMeTheBoreWorms@reddit

Well Gemini is pretty lame so I don’t have high expectations for Gemma

[-]

keyser1884@reddit

My own experiments vs Qwen 3.5 show the same thing. It is reluctant to chain tool calls in comparison. It still seems a capable model, but gemma4 and qwen3.5 seem pretty even (you win some and lose some)

[-]

eyelobes@reddit

i dunno, using gemma4:e4b has been perfect for running my HOAS, 5 users, media management. but i have a 5k+ python router to enable a context engine before the LLM is consulted. it works great for control and conversation. better than qwen3.5:9b ever could

[-]

leonbollerup@reddit

Part of Gemma is a core component that will evaluate you, as a person, your mood, if you are drunk/sober and the most important value int. - how it can f*ck with your mood / make you hate yourself most.

and as you said..

Qwen: You type: "How is the weather in stockholm tomorrow?" - result: does 600 tool calls, comes back with the last 300 years of history of stockholm, how much better the danes are than the swedes and drinking, how the finnish should never be accepted as a people simply because of their hocket skills, places to visit, food to eat, girls to text and offcourse the weather.

Gemma: tells you (with great confidence i might add) that it will rain (AND DID EVEN OT BOTHER TO DO ONE SINGLE SEARCH)...

... and i am usually left with a "why little ugly bas.. "

[-]

Pyrenaeda@reddit (OP)

*slow clap*

yes, bingo.

[-]

MrB0janglez@reddit

not crazy, same experience with the 26b MoE. it's like the tool-use training didn't stick the same way it did on qwen. I've gotten better results with more aggressive system prompt instructions like "you must search before answering any factual question" but even then you're babysitting it the whole time.for actual agentic tasks I ended up going back to qwen2.5-32b. the gemma 4 architecture is interesting but tool calling feels undertrained at that size. worth trying the non-MoE variant if you haven't yet, behavior might differ.

[-]

martinkozle@reddit

qwen2.5? Did you mean qwen3.5?

[-]

Forward_Compute001@reddit

I've tested Gemma, Qwen, GLM. threw out Gemma after 2 or 3 messages because of this. It's like someone on traquilizers and sleep deprived. It doesn't expand a lot and plays the ball flat and easy. Maybe that can be useful and is underrated too.

I find Qwen beeing something that reminds me of and goes the direction of AGI.

[-]

mmhorda@reddit

I mean. Isn't it the whole point? Local model should rely on its own knowledge rather than searching web? I don't know really it was my impression about local models in general.

[-]

Pyrenaeda@reddit (OP)

local models in general are much smaller and thus have much less baked in knowledge than the SOTA frontier models. So it's actually the inverse - they need to search the web _more_ than a big model to give you accurate answers and not paper over their knowledge gaps with hallucinations.

[-]

Much-Researcher6135@reddit

"I don't know how to make a simple agent"

save us all next time, bro

[-]

Middle_Bullfrog_6173@reddit

First, the 27B dense has a clear edge in active parameters so it's unsurprising it is better at most things. Better comparison would be 35B vs 26B or 27B vs 31B.

That aside, I agree that Gemma does worse in reasoning or tool use heavy tasks than the equivalent Qwen. But in exchange for that it uses much less tokens (around 50% less in AA comparison), so it's a case of using the right model for the job IMO.

Probably (hopefully) a moot point soon if we get some 3.6 Qwens where the advantage becomes clearer.

[-]

Pyrenaeda@reddit (OP)

I hear you, a fair point.

Have been trying out the 35B subsequent to the post this evening. Can confirm it is leaps and bounds ahead of Gemma 4 26B on proactively hunting down information.

[-]

jaker86@reddit

I had the same experience with it (26b q4) being extraordinarily lazy at tool calls.

Gave it a very clear research task, and it executed 4 web searches, claiming it had done 20+. No amount of context adjustments and retires or coercion worked.

Set Qwen3.5-27b on it, and it got to work immediately, 5x-ing the number of tool calls

[-]

Far_Cat9782@reddit

Yes I noticed I to. Fave up on gemma stuck with qwen 3.5 35b a3b. Damn that model is goldy. Especially used as orchestrator/tool caller.

[-]

Pyrenaeda@reddit (OP)

I had been on 27b dense and just this evening have been trying out the 35b a3b. I must say... it's good. very good. None of the laziness I saw with Gemma 4. And... it flies.

[-]

SlimeQ@reddit

goto 31B

[-]

Lesser-than@reddit

I have been running this model a bit since its latest fixes make it useable, I am not having this problem or at least not as severe as you describe. In a long context setting if the same tool is called with the same arguments sometimes it will fake that it did a search and return results from previos query, qwen3.5 models do this as well to me sometimes. I suppose its just another reward hacking artifact.

[-]

1EvilSexyGenius@reddit

These algorithms are getting too good 😊 I was just attempting to fix this exact issue. I don't want to use tools that it knows about unless directly asked to.

I'm going to look at the model card again and see if tooling / personality should be a SYSTEM prompt

[-]

DarthLoki79@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1sjp6tf/gemma_4_26b_on_omlx_with_opencode_m4_max_64gb/

Check out my exp here -- looks like its the same.

[-]

ILikeCorgiButt@reddit

Confirmed. It’s lazy af. I deleted it lol

[-]

n0head_r@reddit

You shouldn't compare a dense (qwen 3.5 27b) vs a Moe (Gemma 4 26b) try Gemma 31b instead and if you're low on vram check iq4_xs quants.

[-]

Lumienca@reddit

You're not crazy—it’s not your prompts, it's the model. Gemma 4 is heavily over-aligned to rely on its internal weights and prematurely halts its tool-use loops. Unlike Qwen, which is trained to act like a proactive agent, Gemma is just stubbornly lazy by design right now. Stop fighting its base weights. Stick to Qwen for heavy web searches until a proper agentic fine-tune drops for Gemma.

[-]

Embarrassed_Soup_279@reddit

i have noticed that gemma 4 is super lazy too. but its also really sensitive to system prompts more than qwen 3.5 is. it feels like you have to guide it really strictly in the system prompt or it wont do what you want it to do, while qwen sorta understands your intent without explicitly telling it. i dunno if its a good or bad thing, i really like short responses but it can be a negative as well.

[-]

Leafytreedev@reddit

You're not crazy because I've definitely noticed the same thing. It's like permanent low effort mode for no reason (it's my hardware bro go ham). There was also another poster that mentioned how the api costs for running his experiments on Gemma 4 were so cheap because the model was so lazy.