Qwen3.6:27b single-shot fixed a CSS UI bug that had Gemma4:26B doom looping uselessly for 15 minutes

Posted by Konamicoder@reddit | LocalLLaMA | View on Reddit | 31 comments

Warning: long post ahead. On the bright side, it's 100 percent human-written, typos and all. No AI slop was used to generate any of the following post. Bask in the warm glow of our increasingly rare shared humanity, gentle reader.

Just wanted to report my local model coding experience tonight. One of my board game hobby websites (static sites hosted on Github pages) had an annoying UI bug you can see in the "before" pic above: when the Tools nav button is clicked, the dropdown menu appears half offscreen on the left side of the viewport. So I fired up my local LLM coding rig to fix it.

Hardware: MacBook Pro M4 Max with 64Gb of RAM. Model backend: oMLX. Model: Gemma4-26B-A4B-it-oQ6. Agentic harness: Pi.

This Gemma4-26B MoE model runs pretty fast on my machine: 800 tokens/second prompt processing, 63 tokens/second token generation. Qwen3.6-35B is my usual daily driver, and I have only used Gemma4 for chat purposes to date. But tonight I decided to test it for coding.

I described the UI bug to Gemma4 verbally, and since it has Vision capabilities, I took a screenshot of the issue and uploaded it to the model for good measure. Things started out promising. Gemma4 analyzed the issue, figured it had the root cause, and started reading the site CSS file to insert a fix in the right spot. That's when things started to go off the rails. Gemma4 fell into a recursive doom loop of read, edit, fail, then read again. Several times I stopped the model, told it that was doom looping, asked how I could help. Gemma4 apologized, acknowledged that it was looping, even appeared to identify why it was looping, said that it would try a different approach, then just fall into another doom loop. After about 15 minutes wasting my time trying to redirect Gemma4, I said "screw this" and loaded up The Big Gun:

Qwen3.6-27B-UD-MLX-8bit.

That's right -- we're going full-on 27 billion dense parameters on your ass, CSS bug. None of this puny MoE nonsense. Time to roll up our (virtual) sleeves and get down to business.

Now I don't often use the dense model for coding, because it is significantly slower on my Mac. Prompt processing is 190 tokens/second, token generation a comparatively glacial 13.2 tokens/second. But what Qwen3.6-27B lacks in speed, it makes up for in reasoning ability and coding quality.

I started a /new Pi session with qwen3.6-27B loaded up. Described the UI bug verbally. Didn't even bother to upload a screenshot. That was enough for Qwen3.6-27B to understand the issue. Then it started THINKING. It chewed up about a quarter of my context window just figuring out the bug from all angles, paragraph upon paragraph of back and forth with itself. "I can see the issue...but wait...the problem is...actually...wait, that should be fine...oh wait, i see the issue...let me re-examine...unless...the cleanest fix is..."

And after all that thinking, Qwen3.6-27B fixed the bug in a single-shot. As you can see in the "after" pic above.

To me, this is a clear real-world illustration and confirmation of certain assumptions I have made in my few months of exploring local models.

MoE models are faster but more prone to mistakes and loops.
Dense models are slower but far more accurate and precise.
Gemma4 is not as useful for coding as Qwen3.6.

Qwen3.6-35B (MoE) will still remain my daily driver because it has a nice balance of blazing speed and acceptable accuracy. But when the shit hits the fan, it's nice to be able to bust out a dense model to get myself out of a jam.

TL; DR: Gemma4 MoE is fast but doom-loopy, while Qwen3.6 is slow but spot-on accurate.

[-]

OWilson90@reddit

This anecdotal stuff really needs to stop.

LetsGoBrandon4256@reddit

The LLM version of "benchmarking" samplers/upscalers with a x/y grid in /r/StableDiffusion

Fedor_Doc@reddit

Why? It was an entertaining read, and it clearly shows expectations and usage patterns. It is interesting in itself

some_user_2021@reddit

LLM inference is a statistical process. A one shot test is irrelevant.

Healthy-Nebula-3603@reddit

That's nothing new . We all know that qwen 3.6 27b is better in coding. Much better.

ttkciar@reddit

MoE models are faster but more prone to mistakes and loops.

Dense models are slower but far more accurate and precise.

Gemma4 is not as useful for coding as Qwen3.6.

Wait a sec .. isn't the conclusion you should be drawing that the Gemma4 MoE is not as useful for coding as the Qwen3.6 dense?

It's not reasonable to write off Gemma4 until you have tested the Qwen3.6-27B dense model against the Gemma-4-31B-it dense model.

Konamicoder@reddit (OP)

That’s the point I make in my TL; DR..

Virtamancer@reddit

If you’re going to compare them, don’t use quantized versions.

falconandeagle@reddit

"I’ll give Gemma4 dense a shot at some point down the road, but I suspect it’s still not going to stack up well vs. qwen3.6 dense."

And what are you basing this on?

seamonn@reddit

vibes

ortegaalfredo@reddit

You a compared a \~5B model to a 27B model

sophlogimo@reddit

On top of that: A MoE vs a dense, and the MoE is Q6, while the dense is Q8.

silverud@reddit

In my experience, the Gemma 4 family is an amusing waste of space on my SSD.

I keep them around to demonstrate that similarity in size does not equate to similarity in capability.

I have yet to find an actual use for them outside of "don't be like Gemma."

OpenEvidence9680@reddit

Funny, I had the exact opposite experience. I created a benchmark of my own testing models on my specific needs, 38 prompts on general coding, my exact stack, safety , instruction, prose writing and technical writing. Gemma won in each category, The bench results were confirmed in actual use. Qwen, 3.6 is a close second only when using the opus 4.7 fine tune.

nihnuhname@reddit

Gemma is really good for translating and roleplay chatting.

SnooMaps5367@reddit

Not to disrespect you but this is confirmation of absolutely nothing. You of course have free will to formulate opinions based on a single example, but that does not make them factual.

I was using dense Qwen, but it was too slow for my bash scripting needs. So, I switched to MOE Qwen 3.6, and with some guidance it works pretty well. I use tests, ask it to perform changes one at a time and always plan before implementing. A lot of misunderstandings are cleared on the plan stage – e.g. it was trying to use wrong rsync flags, despite my previous instructions.

It looped once, I stopped it, and asked to explain what errors exactly it encounters and how it tries to solve them – I think "explain step by step what problems do you encounter" prompt is very good tool. Nudged it in the right direction, and it corrected the code.

In the end of the day, MOEs are very usable if the task is clearly separated into distinct stages and explained. I compact cache after each new feature, and move to the next with the same structure: plan -> discuss and refine plan -> implement and test -> git commit. Occasionally, I have refactor session. Asking it to check if code aligns with DRY principle already works wonders

Clear work structure helps LLMs a lot :)

rpkarma@reddit

Gemma is not as good as Qwen 27B for most web dev coding tasks. Even 31B vs 27B at BF16.

llitz@reddit

Correct, Gemma has SWA and doesn't properly recall some areas leading to confusion sometimes.

Qwen isn't perfect and is prone to behaving erratically above 128k contexts, but is good enough.

ambient_temp_xeno@reddit

Pretty obvious shill post.

mjsxi__@reddit

got to say Im very impressed with qwen 27. I gave it and gemma 31b a problem about deducing who was incorrect in a situation with one person being a polite instigator and the other being a rude responder and gemma went in a spiral over the rude persons tone and I had to hold its hand to lead it to who was wrong (took about 6 or 7 back and forth messages) while qwen 27 instantly got it MUCH to my surprise.

that said using a moe model and comparing it to a dense makes no sense.

pmttyji@reddit

Agree with other comment. Try with Gemma-4-31B too which is fair comparison.

Intelligent_Ice_113@reddit

what bug? did you wrote "fix a bug in select drop-down" in your prompt? 🤭

Yes. I described the bug to the model. I also uploaded a screenshot. The model understood the bug and understood that I wanted it fixed.

AndreVallestero@reddit

Why are you not comparing dense to dense, or moe to mow?

Because that’s what happened tonight. This was not a scientific experiment. Just me recounting a particular experience.

onyxlabyrinth1979@reddit

This matches my experience with agent loops pretty closely. Fast models can look impressive right up until they hit an edge case and start confidently re-reading the same files forever. For actual product work, I care less about tokens/sec and more about whether the model converges reliably without babysitting.

https://old.reddit.com/user/onyxlabyrinth1979

Holy fuck we are really having an AI bot infestation right now, don't we?

Yup, have been for a while. It really ramped up when OpenClaw became popular. BotBouncer helps, and the new karma requirements help, but a few still get through. Thanks for reporting this one. It has been banned from the subreddit.

FatheredPuma81@reddit

Out of curiosity since you said that it started to think how long was Gemma4's thinking when trying to identify the issue?

Gemma4 didn't really spend a lot of time thinking. Or rather, the thinking was part of the looping. Think-read-edit-fail-think again.