Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way.

Posted by Daniel_H212@reddit | LocalLLaMA | View on Reddit | 9 comments

I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp, Gemma 4 suffered two noticeable problems when doing internet research: (1) It says it needs to keep searching a topic, yet stops searching and gives up (2) It keeps repeating itself, including its whole research plan, every single thinking block between tool calls.

Qwen3.6 came out today and I was already skeptical because of the news of the Qwen team disbanding and the fact that this model release happened way too quickly. At this point I'm almost wondering if Qwen saw the release of Gemma 4 and just distilled from Gemma 4 because I'm seeing the same two stupid behaviours I saw with Gemma 4.

I test using two research tasks:

Task 1: I asks for a complete list of current flagship phones that meet a certain list of very specific specifications.

Qwen3.5 35B did this very well, though on some runs it would make the small mistake of thinking the latest flagship from Xiaomi is the 15 Ultra (it's the 17 Ultra but it's also stupid that Xiaomi skipped 16). Gemma 4 26B either eventually failed tool calls, or made so many tool calls that it ran up against OpenWebUI's default limit of 30 because it kept querying for each specific phone and each specific specification, whereas Qwen was able to quickly identify that if you pull the gsmarena spec page for each phone, you get everything about that phone all at once.

Task 2: I ask for a list of SUVs available in my area that include a specific list of features within a specific price range. This query also includes some random background facts, optional nice-to-haves, and specific formatting requests for the output. This was a real request I made to ChatGPT back when it first gained deep research capabilities, because at the time my family's car was just wrecked by a red light runner. This is a significantly harder task due to the additional information, requirements, and the fact that there is no equivalent to gsmarena spec pages for cars (plus cars can have different trims, regional models, regional pricing, etc.)

On this task, Qwen3.5 35B nearly matched the original ChatGPT o1 deep research. It got a few specs wrong and actually excluded the car my family ended up buying because it fits my criteria exactly (it got confused on the trims), but at least it looked at every relevant SUV in the size class and price range that was available in my area, and even found the specific trims that met my criteria from 8 models, and correctly ignored Mitsubishi which isn't available in my city. ChatGPT o1 back then actually didn't even manage to include multiple relevant brands in its search (most notably Volkswagen, which definitely has a dealership in my area but it never found across several deep research queries), while including Mitsubishi in its results several times.

I didn't test Gemma 4 on this because if it failed the easier task, there's no way it could even get close on this one. But I did expect Qwen3.6-35B to be at least on par with, if not better than, Qwen3.5 35B.

For reference, this is what the research process for Qwen3.5 looked like on task 2, which was the harder task:

This is what Gemma 4's research process looked like the one time it managed to finish task 1, though it got a incomplete list of results because it gave up on searching early. Notice how it is repeating its whole research plan in between searches, and how it only does web searches and never fetches a whole page (consistent behaviour across runs), and while not visible in the screenshot, it also repeats everything it has already found every thinking turn:

And this is what the research process from Qwen3.6 looks like on task 2:

Notice the thinking time difference compared to 3.5. It's repeating both its entire future research plan, including the criteria I gave it, all planned queries, and also everything it has already found every thinking cycle, just like Gemma 4 does. Not only that, it never tries web fetch once, just keeps on using web searches despite being provided the same tools and the same system prompt.

I'm seriously disappointed.

[-]

hq0943f9hf@reddit

i made it program a small tetris game, both used a "noir tetris aesthetic", which was kind of weird

CoolConfusion434@reddit

I haven't done enough testing to come to a definitive conclusion but, have you tried deleting your system prompt entirely for Gemma 4? 😬

My 26B was struggling with tool calling and being obtuse about the date I feed into context as "future", "roleplaying", and "classic jailbreaker move" in the CoT. It wouldn't even answer what its name was or training data cutoff date.

I then removed my carefully crafted system prompt, something I've honed in for nearly 2 years and has worked great for many models, and the damn thing sprung to life lol. It wouldn't even need prompting to use the tools. I now post a question, ask to vet and validate the answer, and it picks up the search and scrape tools as needed to go check it out.

Before this it was a freaking struggle to get it to do anything. Always skeptical and suspicious of "user trying to bypass rules", etc. Answers were dry, half-baked, definitely not what you'd expect from 26B-worth work. My (total) guess is Gemma 4's really sensitive to system instructions, as it has been 'enhanced' in this area. It might need a new approach when it comes to instruct.

Take it with a grain of salt. This change in behavior happened not 4 hours ago so I haven't had a chance to put it through its paces but, so far, it's behaving way better.

ComplexType568@reddit

what even was the system prompt

ladz@reddit

Frankly MCP is a completely wasteful and 100% bad-compromises protocol. It doesn't fit well with how models seem to think. I'm hoping that this spurs more development into in-line protocols so the tool call isn't a "chat turn", it's in-line with the thinking itself. MCP is confusing as hell, and LLMs can barely think to begin with. I've clauded these in-line things into llama.cpp (so it does the tool calling based on keywords) and they end up magically fast and seem to work fine. But ofc it's gotta be local to do that!

This_Maintenance_834@reddit

Non-sense. The timing is too close to distill Gemma. There is no reason to distill Gemma, they can distill Opus, or ChatGPT, why distill a subpar small model?

Daniel_H212@reddit (OP)

Yeah I didn't think that was the case, hence the "almost wondering" but it does look like the same benchmarks they're trying to target are causing similar behaviours.

ttkciar@reddit

My evaluations of Gemma4 thusfar have not involved tool-calling, and it's been hitting them out of the park.

Too bad its tool-calling woes have overshadowed its otherwise stellar debut. Hopefully further fixes and improvements are in the offing.

EffectiveMedium2683@reddit

What quants of each are you running? Gemma 4 26b doesn't make those kinds of reputation errors on my setup in iq4_nl. To answer your question, though, yes, qwen3.6 appears to be a copy of gemma4 but with the interesting hybrid architecture. I'm good with that though. Way better than qwen3.5 in my tests broadly speaking

I'm using unsloth UD-Q6_K_XL for all of them, max context length, no KV cache quantization.