Anyone feel like Qwen3.6 thinks like Gemma 4? And not in a good way.

Posted by Daniel_H212@reddit | LocalLLaMA | View on Reddit | 9 comments

I was disappointed with Gemma 4 due to various bugs and in the end lackluster performance for the internet research/information synthesis type tasks I use local AI for. Even after every last fix and update of both mode quants and llama.cpp, Gemma 4 suffered two noticeable problems when doing internet research: (1) It says it needs to keep searching a topic, yet stops searching and gives up (2) It keeps repeating itself, including its whole research plan, every single thinking block between tool calls.

Qwen3.6 came out today and I was already skeptical because of the news of the Qwen team disbanding and the fact that this model release happened way too quickly. At this point I'm almost wondering if Qwen saw the release of Gemma 4 and just distilled from Gemma 4 because I'm seeing the same two stupid behaviours I saw with Gemma 4.

I test using two research tasks:

Task 1: I asks for a complete list of current flagship phones that meet a certain list of very specific specifications.

Qwen3.5 35B did this very well, though on some runs it would make the small mistake of thinking the latest flagship from Xiaomi is the 15 Ultra (it's the 17 Ultra but it's also stupid that Xiaomi skipped 16). Gemma 4 26B either eventually failed tool calls, or made so many tool calls that it ran up against OpenWebUI's default limit of 30 because it kept querying for each specific phone and each specific specification, whereas Qwen was able to quickly identify that if you pull the gsmarena spec page for each phone, you get everything about that phone all at once.

Task 2: I ask for a list of SUVs available in my area that include a specific list of features within a specific price range. This query also includes some random background facts, optional nice-to-haves, and specific formatting requests for the output. This was a real request I made to ChatGPT back when it first gained deep research capabilities, because at the time my family's car was just wrecked by a red light runner. This is a significantly harder task due to the additional information, requirements, and the fact that there is no equivalent to gsmarena spec pages for cars (plus cars can have different trims, regional models, regional pricing, etc.)

On this task, Qwen3.5 35B nearly matched the original ChatGPT o1 deep research. It got a few specs wrong and actually excluded the car my family ended up buying because it fits my criteria exactly (it got confused on the trims), but at least it looked at every relevant SUV in the size class and price range that was available in my area, and even found the specific trims that met my criteria from 8 models, and correctly ignored Mitsubishi which isn't available in my city. ChatGPT o1 back then actually didn't even manage to include multiple relevant brands in its search (most notably Volkswagen, which definitely has a dealership in my area but it never found across several deep research queries), while including Mitsubishi in its results several times.

I didn't test Gemma 4 on this because if it failed the easier task, there's no way it could even get close on this one. But I did expect Qwen3.6-35B to be at least on par with, if not better than, Qwen3.5 35B.

For reference, this is what the research process for Qwen3.5 looked like on task 2, which was the harder task:

This is what Gemma 4's research process looked like the one time it managed to finish task 1, though it got a incomplete list of results because it gave up on searching early. Notice how it is repeating its whole research plan in between searches, and how it only does web searches and never fetches a whole page (consistent behaviour across runs), and while not visible in the screenshot, it also repeats everything it has already found every thinking turn:

And this is what the research process from Qwen3.6 looks like on task 2:

Notice the thinking time difference compared to 3.5. It's repeating both its entire future research plan, including the criteria I gave it, all planned queries, and also everything it has already found every thinking cycle, just like Gemma 4 does. Not only that, it never tries web fetch once, just keeps on using web searches despite being provided the same tools and the same system prompt.

I'm seriously disappointed.