Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.

Posted by FantasticNature7590@reddit | LocalLLaMA | View on Reddit | 86 comments

Hey guys,

A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my custom GUI.

If you look at the Benchmarks then Qwen should win but from testing it seems really opposite. Looks like Benchmaxing. I attached comparison of scores below

Since official benchmarks are pretty much gamed at this point, I threw real-world, unoptimized junk at them: weird memes, complex GeoGuessr spots, ugly handwritten notes, shopping lists, bounding box requests, and dynamic gym videos.

Here are the 5 biggest behavioral differences and quirks I found:

- Did Qwen 3.6 fix the "Overthinking" token burn?
Yes and no. In Qwen 3.5, the model would burn 10k tokens overthinking simple tasks. In 3.6, the thinking preservation is noticeably better on simple prompts—it stops earlier. However, if you give it an obscure GeoGuessr location or a rare meme, it still panics, goes into a massive reasoning loop, burns 8,000+ tokens, and sometimes fails to output a final answer. Gemma 4 remains vastly more concise (often using just 1,500 tokens for the same task).

- Bounding Boxes & Scaling: Qwen still fights instructions
If you want to extract coordinates for bounding boxes or polygon segmentation masks, Gemma 4 is much better at following formatting instructions. Which make sense as I didn't find any information about this capability on Qwen. Visual models are usually trained on a 0–1000 coordinate grid. When I prompted them to output normalized coordinates (0 to 1), Gemma calculated the scaling perfectly in its thinking phase and output clean JSON. Qwen completely ignored the scaling instruction and output raw 0-1000 coordinates in a weird format most of times.

- The Cultural Divide (Memes & GeoGuessr)
There is a regional bias in their training data.

Gemma 4 easily won European/Western tasks (recognizing obscure European monuments as example).
Qwen 3.6 seem to perform better on Asian context. It accurately identified the Chinese "white people food" meme and correctly guessed an obscure Malaysia/Indonesia border town in GeoGuessr—even without thinking mode enabled.

- Qwen 3.6 is a upgrade for Video tracking
I fed both models a video of me doing deadlifts (pre-processed to 2 FPS to avoid vLLM rejection). Qwen 3.6 was incredible here. With the thinking budget tuned, it correctly identified the exercise, counted the exact number of reps (Gemma missed one), and most accurately estimated the total weight on the bar by judging plate thickness.

- AI Video Detection is still a coin toss
I tested them on videos generated by LTX 2.3. Both models successfully caught blatant physics errors (like balls changing color or smoke without a source). But on more subtle AI videos, they were completely inconsistent. Running the exact same prompt twice would yield "Real" one time and "AI generated" the next. Neither is reliable for deepfake detection yet.

- Don't trust Inference Engines default visual token budget for Gemma
If you're running Gemma and it's failing at fine visual details (like small OCR text or complex graphs), check your max_soft_tokens. Inference engines like vLLM, Llama Cpp often default this to a shockingly low number, like 280. A lot of people think the model is just performing poorly, but it's actually just heavily compressing the image input. If you crank this value up (e.g., to over 1120), the accuracy instantly spikes. The best part? In my testing, maxing out this visual token budget added almost zero noticeable latency. Don't cheap out on your visual tokens!

- Video Pipeline Friction: Gemma eats raw video, Qwen demands 2 FPS
If you are building an automated pipeline, be aware of this input quirk: Gemma 4's encoder is incredibly forgiving and will accept pretty much any video format or framerate you throw directly at it. Qwen 3.6, on the other hand, is extremely strict. You must pre-process your video down to 2 FPS before passing it to vLLM, otherwise it will just throw errors or fail to process.

Resources:
If you want to see the actual latency differences, how I tuned the visual token budgets, and the live inference side-by-side, I put together a repo with uv sync etc here: https://github.com/lukaLLM/Gemma4_vs_Qwen3.5_3.6_Vision_Setup_Dockers There is also video of tests if needed.

Let me know also how you use it so far.

[-]

Roman217@reddit

This has been my first time using local LLMs lately and I have an RTX 3090. I can run both Gemma 4 31B and Qwen 3.6 35B and for all the praise that Qwen has been getting I "want" to like it, but it has been complete and utter trash compared to Gemma in every single task I've tried (which doesn't include coding because I don't use local LLMs to code, I would use GPT for that if desired). Qwen has been way worse in natural Japanese to English translation tasks. It has been way worse in playing chess. I couldn't even get it to finish a game. Gemma 4 only needed one correction of an invalid move. It lost but at least it could play a full game. I use both with reasoning enabled and Gemma 4's reasoning is way quicker, doesn't get stuck in thinking loops as often. Gemma 4's vision is way better. There hasn't been a single thing where Qwen was even remotely comparable. Except for speed in tokens per second, but I would much rather take the extra time for way better quality of output.

Also for reference here is my Gemma's response to the car wash problem:

The car wash is 50 meters away. Should I walk or drive?

Unless you plan on pushing your car 50 meters, you should drive.

It's hard to get a car washed if the car isn't there!

[-]

robertpro01@reddit

I am working on a project where I need visual capabilities and for my specific use case gemma4 sucks.

Basically I'm migrating a project with a lot of charts and widgets and qwen3.6 were able to see more details than gemma4.

But to be fair, I ended up using gpt5.4 because I needed even more details and right now I'm using gpt5.5

[-]

Background-Bit-6279@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1srrhi5/gemma_4_vision/

Are you hosting yourself and configuring the vision budget?

[-]

robertpro01@reddit

Oh, I'll give it a try, thanks!

[-]

TheCatDaddy69@reddit

I've found the big boys that arent Gemini to be borderline blind, not kidding I would rather so calculus with gemini flash before opus or 5.5 just because google has some black magic for image recognition, not sure how Gemma's is though.

[-]