Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.
Posted by FantasticNature7590@reddit | LocalLLaMA | View on Reddit | 86 comments
Hey guys,
A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my custom GUI.
If you look at the Benchmarks then Qwen should win but from testing it seems really opposite. Looks like Benchmaxing. I attached comparison of scores below
Since official benchmarks are pretty much gamed at this point, I threw real-world, unoptimized junk at them: weird memes, complex GeoGuessr spots, ugly handwritten notes, shopping lists, bounding box requests, and dynamic gym videos.
Here are the 5 biggest behavioral differences and quirks I found:
- Did Qwen 3.6 fix the "Overthinking" token burn?
Yes and no. In Qwen 3.5, the model would burn 10k tokens overthinking simple tasks. In 3.6, the thinking preservation is noticeably better on simple prompts—it stops earlier. However, if you give it an obscure GeoGuessr location or a rare meme, it still panics, goes into a massive reasoning loop, burns 8,000+ tokens, and sometimes fails to output a final answer. Gemma 4 remains vastly more concise (often using just 1,500 tokens for the same task).
- Bounding Boxes & Scaling: Qwen still fights instructions
If you want to extract coordinates for bounding boxes or polygon segmentation masks, Gemma 4 is much better at following formatting instructions. Which make sense as I didn't find any information about this capability on Qwen. Visual models are usually trained on a 0–1000 coordinate grid. When I prompted them to output normalized coordinates (0 to 1), Gemma calculated the scaling perfectly in its thinking phase and output clean JSON. Qwen completely ignored the scaling instruction and output raw 0-1000 coordinates in a weird format most of times.
- The Cultural Divide (Memes & GeoGuessr)
There is a regional bias in their training data.
- Gemma 4 easily won European/Western tasks (recognizing obscure European monuments as example).
- Qwen 3.6 seem to perform better on Asian context. It accurately identified the Chinese "white people food" meme and correctly guessed an obscure Malaysia/Indonesia border town in GeoGuessr—even without thinking mode enabled.
- Qwen 3.6 is a upgrade for Video tracking
I fed both models a video of me doing deadlifts (pre-processed to 2 FPS to avoid vLLM rejection). Qwen 3.6 was incredible here. With the thinking budget tuned, it correctly identified the exercise, counted the exact number of reps (Gemma missed one), and most accurately estimated the total weight on the bar by judging plate thickness.
- AI Video Detection is still a coin toss
I tested them on videos generated by LTX 2.3. Both models successfully caught blatant physics errors (like balls changing color or smoke without a source). But on more subtle AI videos, they were completely inconsistent. Running the exact same prompt twice would yield "Real" one time and "AI generated" the next. Neither is reliable for deepfake detection yet.
- Don't trust Inference Engines default visual token budget for Gemma
If you're running Gemma and it's failing at fine visual details (like small OCR text or complex graphs), check your max_soft_tokens. Inference engines like vLLM, Llama Cpp often default this to a shockingly low number, like 280. A lot of people think the model is just performing poorly, but it's actually just heavily compressing the image input. If you crank this value up (e.g., to over 1120), the accuracy instantly spikes. The best part? In my testing, maxing out this visual token budget added almost zero noticeable latency. Don't cheap out on your visual tokens!
- Video Pipeline Friction: Gemma eats raw video, Qwen demands 2 FPS
If you are building an automated pipeline, be aware of this input quirk: Gemma 4's encoder is incredibly forgiving and will accept pretty much any video format or framerate you throw directly at it. Qwen 3.6, on the other hand, is extremely strict. You must pre-process your video down to 2 FPS before passing it to vLLM, otherwise it will just throw errors or fail to process.
Resources:
If you want to see the actual latency differences, how I tuned the visual token budgets, and the live inference side-by-side, I put together a repo with uv sync etc here: https://github.com/lukaLLM/Gemma4_vs_Qwen3.5_3.6_Vision_Setup_Dockers There is also video of tests if needed.
Let me know also how you use it so far.

Roman217@reddit
This has been my first time using local LLMs lately and I have an RTX 3090. I can run both Gemma 4 31B and Qwen 3.6 35B and for all the praise that Qwen has been getting I "want" to like it, but it has been complete and utter trash compared to Gemma in every single task I've tried (which doesn't include coding because I don't use local LLMs to code, I would use GPT for that if desired). Qwen has been way worse in natural Japanese to English translation tasks. It has been way worse in playing chess. I couldn't even get it to finish a game. Gemma 4 only needed one correction of an invalid move. It lost but at least it could play a full game. I use both with reasoning enabled and Gemma 4's reasoning is way quicker, doesn't get stuck in thinking loops as often. Gemma 4's vision is way better. There hasn't been a single thing where Qwen was even remotely comparable. Except for speed in tokens per second, but I would much rather take the extra time for way better quality of output.
Also for reference here is my Gemma's response to the car wash problem:
The car wash is 50 meters away. Should I walk or drive?
Unless you plan on pushing your car 50 meters, you should drive.
It's hard to get a car washed if the car isn't there!
robertpro01@reddit
I am working on a project where I need visual capabilities and for my specific use case gemma4 sucks.
Basically I'm migrating a project with a lot of charts and widgets and qwen3.6 were able to see more details than gemma4.
But to be fair, I ended up using gpt5.4 because I needed even more details and right now I'm using gpt5.5
Background-Bit-6279@reddit
https://www.reddit.com/r/LocalLLaMA/comments/1srrhi5/gemma_4_vision/
Are you hosting yourself and configuring the vision budget?
robertpro01@reddit
Oh, I'll give it a try, thanks!
TheCatDaddy69@reddit
I've found the big boys that arent Gemini to be borderline blind, not kidding I would rather so calculus with gemini flash before opus or 5.5 just because google has some black magic for image recognition, not sure how Gemma's is though.
FantasticNature7590@reddit (OP)
Yeah they will always be a bit better
ExplorerPrudent4256@reddit
The max_soft_tokens tip alone is worth the read. A lot of people assume the model is just 'worse at details' when it's actually a config issue. Reminds me of when people blamed CLIP for poor image understanding when it was really the token budget in the inference engine.
FantasticNature7590@reddit (OP)
Yeah, I still don't get why provider don't just put nicely edited table with parameters on top so more people will realize this is the thing but I guess the industry focus more on shipping as much models lol.
FusionX@reddit
Completely unrelated but this is a perfect example of how people should use LLM for structural/semantic assistance with their writing.
FantasticNature7590@reddit (OP)
Thanks I typically write just my point that are not really readable and then reread it correct and iterate few times over. I could actually use more bold etc here to make it nicely edited too if I would be honest.
dead_dads@reddit
Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.
My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.
What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.
FantasticNature7590@reddit (OP)
I would use one of Qwen 3.6 in gguf using local llama you can log in to hugging face and log your hardware and it will help a bit to estimate which quant you need https://huggingface.co/unsloth/Qwen3.5-27B-GGUF
dead_dads@reddit
Awesome thanks for this!
IrisColt@reddit
Which software are you using to get this functionality? Does llama.cpp or Open WebUI support this?
FantasticNature7590@reddit (OP)
I used vLLM as it supports video. I used llama.cpp on images but I don't remember video being supported though I checked last month. In general if you want to work with video preprocessing youself it to smaller format like 480 and turning it into pictures make the task much easier but the support was not there so I build it myself
IrisColt@reddit
Thanks for the info!
Main_Secretary_8827@reddit
Ive had nothing but issues with gemma, tools dont work
FantasticNature7590@reddit (OP)
Which engine you use ? You can try docker compose in my repo to host server it have everything setu up that I used
Main_Secretary_8827@reddit
Lm studio
IrisColt@reddit
By the way, incredibly insightful read, thanks!!!
FantasticNature7590@reddit (OP)
Thank you! Weather was so nice outside, so I was fighting myself while writing it.
Sudden_Vegetable6844@reddit
Your visual tasks don't match at all those I've been testing those models on, which is photos of documents (forms typically, with or without handwritten fields). On those use cases Qwen3.6 had a very high success rate, while Gemma 4 failed most of them: it would get a elements right, then hallucinate the rest...
Care to add such teste to your benchmark? They're more realistic use case than recognizing landmarks (which is a use case where gps + compass will have a much higher success rate than any LLM ever will)
FantasticNature7590@reddit (OP)
I use both for handwritten ugly notes and graph understanding on pdf or electric documents both worked well when I mixed languages both struggled but gemma 4 was a bit better.
tomakorea@reddit
Gemma is in general a much better LLM than Qwen for anyone that don't use English or Chinese as their primary language.
bonobomaster@reddit
LLM are tools that are predominantly trained on one or two languages. Other languages are incorporated at a much smaller scale.
I would never use any local LLM in my native language.
That would be wrong use of a tool.
FastDecode1@reddit
Worked well for me a couple days ago when I asked Qwen 3.6 35B for help in filling out an application in my native language.
I had a look at the reasoning output and it was in English, not my native language. Which is exactly what you want; putting its training to good use by thinking in one of the languages it's the best at. The language of the final answer is a secondary concern, really.
bonobomaster@reddit
That's nice that it worked for you but that is only one and furthermore a very subjective experience.
Fact is, that the quality of the output of any LLM is higher, when the language of the query matches the majority of the training data.
That should be a no brainer but I have proof as well:
https://lilt.com/blog/multilingual-llm-performance-gap-analysis
https://arxiv.org/html/2404.11553v2
FantasticNature7590@reddit (OP)
Makes sense tbh
robertpro01@reddit
Qwen 3.6 is good for Spanish, 3.5 wasn't good at all.
FantasticNature7590@reddit (OP)
They said they support over 140 languages with some list if I remember. Gemma mention less and never specify them
drillmast3r@reddit
I use qwen3.6:35b and it is almost perfect in hungarian language. As i see better than Gemma 4. General use.
FastDecode1@reddit
Haven't really tried Gemma 4, but I can confirm that Qwen 3.6 35B is also very good at Finnish. Not perfect, but getting closer. Which I think is impressive, seeing as there's only about 5 million native speakers.
And this is at Q4_K_M, so not ideal. I'll probably try Q5 or Q6 at some point to see if that makes a difference.
Cupakov@reddit
Damn, Qwen is Finno-Ugricmaxxing
drillmast3r@reddit
I use it at q4_k_m too. My hardware is limited so i dont want to go up.
FastDecode1@reddit
By any chance, did you happen to look at the reasoning output while using it in Hungarian?
For me, it was reasoning in English, even though the final answer was in Finnish. Which I think is interesting, if it's by design.
Could also be a template or a default system prompt ("You are a helpful assistant") in llama.cpp that's guiding it to do that.
FantasticNature7590@reddit (OP)
Interesting find thanks for sharing I will check it on my local language.
Finanzamt_Endgegner@reddit
Its always a good advice to just use it in english if possible tbh, but if you absolutely need other languages yes gemma is better
No-Refrigerator-1672@reddit
Somebody noticed that Qwen 3.6 thinks way shorter if it has tools. My 3.6 35B has all the default tools in OpenWebUI enabled, and on most tasks it thinks for less than 5 seconds. In OpenCode, it sometimes even outputs 1-liners in thinking blocks. All of this is with preserve thinking enabled, of course. I suggest you try this too, it may solve the overthinkong problem in this weird way.
Hood-Boy@reddit
Do you mind showcasing your opencode configuration?
No-Refrigerator-1672@reddit
Sure!
On server side, I'm using llama-swap for model orchestration with vllm 0.20.0 as backend. Llama-swap config sets up chat template args and generations parameters:
This way your server will serve 3 different models (regular, coding and instruct) off 1 backend and it will automatically set up inference parameters in accordance to Qwen 3.6 model card (from HuggingFace), without needing to set up them manually in each software that you use.
/models/bin/launch_vllm.shis just a script that sets up Conda environment and then passes all of it's args directly to vllm.Hood-Boy@reddit
Many thanks, I use llama-swap too and just learned to multipurpose the model with : suffix!
No-Refrigerator-1672@reddit
Protip: setParamsByID allows you to use any name that you want, you can add prefixes instead of postfixes, or even rename it entirely.
GCoderDCoder@reddit
That's interesting. You can see in reasoning traces that they taught qwen to follow a strict logic pricess so maybe the logic branch is evaluate tools then use them and if the tools arent relevant it just answers normal? I havent had the long thinking peiple describe but im always using tools. I'll test when i get home.
No-Refrigerator-1672@reddit
I've expressed much easier hypothesis in this comment:
GCoderDCoder@reddit
If it is increasing performance in tasks then its not benchmaxing. Benchmaxing it increasing scores that dont translate into real performance gains. Models are literally cheating on tests. That is benchmaxing. Doing things to improve performance is just targeting a certain performance level. Reasoning clearly makes a difference in performance so while we want a model to think as little as possible models have a range of improvement that can be achieved with more thinking.
FantasticNature7590@reddit (OP)
Damn never heard of it thanks for info!
No-Refrigerator-1672@reddit
Just a follow up: in OWUI, you must explixitly set tool calling mode for the model to native, because default mode is for legacy models that didn't support it, and the model will behave just as it was.
I wonder if excessive thinking without tools is indeed a result of benchmaxing, I imagine most dataset run in simple Q&A mode and ingore the execution time and thinking block length, so Qwen trained the model to trigger into overthinking on short system prompts to maximize scores.
FantasticNature7590@reddit (OP)
Thanks that's interesting point. Yeah for some prompts without thinking it goes into this kind of thinking process and you really need to prompt it well for it to not do it and burn tokens.
RoughImpossible8258@reddit
idk these benchmarks arent really accurate i feel, i made this website to vote on the latest AI updates so that people actually working on AI can vote and know whats truth and whats hype..
https://know-your-ai.vercel.app/
IrisColt@reddit
I expected this, in my use cases Gemma 4 31B really has better visual knowledge and subsequent image interpretation than Qwen 3.6 27B.
No_Hunter_7786@reddit
So basically Qwen wins on paper but loses in production. Classic benchmarketing. This is why I stopped trusting leaderboards entirely and just test locally
StupidityCanFly@reddit
Nah. Depends on the “reality” and tasks. In my use case (ai-assisted CRO SaaS) gemma-4-31B loses by a mile to Qwen3.6-27B. There were some tests I did where gemma did better than Qwen (and both lost to a specialized 7B model).
FantasticNature7590@reddit (OP)
Yeah It really depends on task the important thing is to set correct sampling and other parameters and check if engine actually full support vision and then check yourself
No_Hunter_7786@reddit
Yeah that makes sense, different tasks different winners
JGeek00@reddit
I tried on Qwen3.5-9B the car washer prompt and it ended up in a reasoning loop and it didn’t output a response, but at least it doesn’t tell you to walk instead of drive to the car washer. Other models just tell you to walk instead of drive your car to the car washer.
FantasticNature7590@reddit (OP)
Honestly I don't like this test but it went viral so yeah xd
Technical-Earth-3254@reddit
Gemma might be better. But I can only fit like 10k context in q4 with gemma and like 60k with Qwen, so I'm sticking with Qwen.
FantasticNature7590@reddit (OP)
Yeah all depends on your hardware
UntimelyAlchemist@reddit
I just get awful vision results in Gemma compared to Qwen. I don't know why. Gemma always misinterprets what it sees, while Qwen gives me an incredibly detailed, thorough analysis, and even picks up details that I didn't see myself. I'm very impressed by Qwen.
I feel like I must be doing something wrong with Gemma, but I don't know what. I am a beginner. I'm using Llama.CPP. I am already setting image-min-tokens and image-max-tokens. I tried troubleshooting with AI and it suggested I turn up ubatch-size, which I did. Llama.CPP doesn't seem to have the "max soft tokens" setting that you mentioned, as far as can tell.
FantasticNature7590@reddit (OP)
I remember there was post here on this reddit how somebody fixed Gemma vision and there was something about max soft tokens too
WetSound@reddit
27B's mmproj is smaller suggesting less focus on vision
FantasticNature7590@reddit (OP)
Good point I didn't think to check that but for comparison I wanted to at least test Dense vs Dense.
chimpera@reddit
my sense is that gemma is much better at short one shot, but that because of it architecture it struggles with long context. There is something about its attention mechanism and its also far more sensitive to kv quantitation.
RedParaglider@reddit
Honestly it's the same shit with Gemini. It's built around one shots, nobody looks at the benchmarks for real world bullshit repos.
FantasticNature7590@reddit (OP)
It's also super hard to make as anything you put there they could benchmax next turn but you need to show it for people to trues your benchmarks xd
ismaelgokufox@reddit
Maybe the sliding window attention causes the loss the longer the session?
FantasticNature7590@reddit (OP)
Yeah I think it could be the case as it's novel solution
FantasticNature7590@reddit (OP)
Could be I read about the architecture and they use some new techniques that could influence the context.
shansoft@reddit
In some mobile coding task, I had much better success with Gemma4 than Qwen3.6 27B. Same problem, and Gemma4 output much cleaner code and one shot pretty much all task it was assigned. Qwen3.6 27B implementation added more unneeded part and have bugs needed to refine few more iteration.
pedronasser_@reddit
Maybe the backend/harness influences, but I have the exact opposite of your findings. Qwen3.6 follows instructions better than Gemma4 for me.
FantasticNature7590@reddit (OP)
I think for coding it's probably better I was only checking the Vision Capabilities. You use Llama Cpp?
Juulk9087@reddit
Btw, Gemma is horrible for anything other than like 10k context lol: https://youtu.be/ONQcX9s6_co?si=-WKU_qChLJGeFi5W
And I mean horrible
AvidCyclist250@reddit
only tests vision
lol
LetsGoBrandon4256@reddit
Quite impressive that you used all three variants of dashes in one post.
FantasticNature7590@reddit (OP)
Hahaha I used AI to correct typos as I am not best at typing so it could enrich the output xd
Limp_Classroom_2645@reddit
Sure.
WHO_IS_3R@reddit
How bro felt
starshade16@reddit
Yeah idk man. I switched from Gemma 4 to Qwen 3.6 after a month if testing and use with Home assistant. Qwen is better and faster than Gemma and it's not even close. So....idk.
FantasticNature7590@reddit (OP)
You speak about Vision Tasks only as I never tested text there ?
starshade16@reddit
Actually my testing is very thorough of its toolsets. Vision for my reolink cameras, voice for my home assistant, coding for OpenCode, and general agent.
Shingikai@reddit
Qwen leans on Asian content, Gemma leans on European/Western content, and that's enough to flip the "which model wins" question on its head. The benchmark is averaging over a distribution that matches neither what you're testing nor what most users care about. Training data isn't gamed, it's just weighted differently than the eval set. Result looks like benchmaxing from one direction.
Once you see that, "which model is better" stops being a useful question. Better at what, on what content? Your side-by-side is doing the diagnostic version: not picking a winner, using the disagreement to reveal where each model's training is thin. When Qwen and Gemma disagree on a meme or a GeoGuessr spot, that disagreement is information about the content, not a tiebreaker.
Probably saves a lot of trial-and-error to route by content type instead of trying to crown one model. The Western/Eastern split alone is enough.
Limp_Classroom_2645@reddit
Gemma doesn't win anything, enough with effort posting, google
NoFaithlessness951@reddit
A lot of words for saying Gemma is better at vision
BigPoppaK78@reddit
Isn't that the point? If he just said Gemma is better, then people would be asking for proof or under what circumstances. Personally, I prefer the thoroughness if I'm going to consider whether or not to trust an opinion/outcome.
FantasticNature7590@reddit (OP)
I think most people set it up wrongly and just get the wrong impressions.
baksalyar@reddit
I have a question that's been torturing me: why the heck is a small consumer-grade model like Qwen 3.6 27B so expensive? At a minimum of $2/M for output from any provider on OpenRouter, it's almost twice as much as MiniMax 2.7, which has much greater intelligence and is much more expensive to run.