I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090
Posted by Gazorpazorp1@reddit | LocalLLaMA | View on Reddit | 46 comments
I ran a pretty simple but revealing local-LLM test.
At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well...
Models tested:
- Gemma4
cyankiwi/gemma-4-31B-it-AWQ-4bit- Qwen3.6-35B
RedHatAI/Qwen3.6-35B-A3B-NVFP4- Qwen3.5-27B
QuantTrio/Qwen3.5-27B-AWQ- Qwen3.6-27B
cyankiwi/Qwen3.6-27B-AWQ-INT4
Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.”
I gave the same Hermes writing agent (“Scribe”) the same task:
take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified `Masterplan.md` explaining:
- what the product is
- the user problem
- UX/product shape
- UVP/moat
- pipeline
- agent roles
- architecture
- trust/legal/provenance posture
- what changed between plan V1 and V2
V1: \~16k tokens,
V2: \~4.6k tokens,
Combined: \~20.6k tokens
Then I ran the full workflow locally on my RTX 5090 all 4 models:
- **Gemma4**
- **Qwen3.6-35B**
- **Qwen3.5-27B**
- **Qwen3.6-27B**
To make it fair and push the models, each model got:
-
initial draft
-
second-pass revision
-
final polish
Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.”
## What I/Manny scored
- **Clarity**
- **Completeness**
- **Discipline**
- **Usefulness**
## Final results
### Clarity
- Gemma4: **9.4**
- Qwen3.6-27B: **8.8**
- Qwen3.6-35B: **8.1**
- Qwen3.5-27B: **7.4**
**Winner: Gemma4** (at a cost, read further below)
Gemma was the best editor. Cleanest structure, best pacing, strongest restraint.
---
### Completeness
- Qwen3.6-35B: **9.6**
- Qwen3.5-27B: **9.1**
- Qwen3.6-27B: **8.7**
- Gemma4: **7.9**
**Winner: Qwen3.6-35B**
The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass.
---
### Discipline
- Gemma4: **9.5**
- Qwen3.6-27B: **8.6**
- Qwen3.6-35B: **7.7**
- Qwen3.5-27B: **6.8**
**Winner: Gemma4**
Gemma best preserved the actual product identity
---
### Usefulness
- Qwen3.6-27B: **9.3**
- Qwen3.6-35B: **9.2**
- Gemma4: **8.9**
- Qwen3.5-27B: **8.8**
**Winner: Qwen3.6-27B**
This was the surprise. The 27B Qwen 3.6 ended up as the best **overall practical workhorse** — better balance of depth, readability, and usability than the others.
## Final ranking
1. **Qwen3.6-27B** — best all-around balance
-
**Gemma4** — best editor / strategist
-
**Qwen3.6-35B** — best exhaustive drafter
-
**Qwen3.5-27B** — solid, but clearly behind the others for this task
1) Best overall balance
Qwen3.6-27B This is the new interesting winner.
It doesn’t beat Gemma4 on clarity or discipline.
It doesn’t beat Qwen3.6-35B on completeness.
But it wins the thing that matters most for a real working master plan: balance. It’s the best compromise between:
- readability
- completeness
- structure
- practical usefulness
2) Best editor / best strategist
Gemma4 If the goal is:
- cleanest finished document
- strongest executive readability
- best restraint
- best “this feels like a real deliberate plan”
Then Gemma still wins.
3) Best exhaustive architecture quarry
Qwen3.6-35B If the goal is:
- maximum implementation mass
- biggest architecture sourcebook
- richest mining material for downstream docs
Then Qwen3.6-35B is still the beast.
4) Fourth place
Qwen3.5-27B Not bad. Not embarrassing.
But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task.
## Actual takeaway
This ended up being a really clean split:
- **Gemma4 = best editor**
- **Qwen3.6-35B = best expander**
- **Qwen3.6-27B = best practical default**
- **Qwen3.5-27B = respectable, but not the winner**
So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose:
**Qwen3.6-27B**
It’s the best compromise between:
- readability
- completeness
- structure
- practical usefulness
Personal Note re Gemma 4: It was drastically shorter than the Qwens for the final output
- Gemma4 → 147 lines
- Qwen3.6-35B → 725 lines
- Qwen3.5-27B → 840 lines
- Qwen3.6-27B → 555 lines
So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing.
On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality.
I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will.
For First-draft only performance, I’d rank them:
One-shot ranking
- Qwen3.6-27B
- Qwen3.6-35B
- Qwen3.5-27B
- Gemma4
Why
1) Qwen3.6-27B
Best balance right out of the gate:
- strong product framing
- solid structure
- good density
- less bloated than the other Qwens
- more complete than Gemma’s first draft
This was the best raw first shot.
2) Qwen3.6-35B
Very strong one-shot draft, but more sprawling:
- most exhaustive
- richest implementation mass
- more likely to over-include
- better sourcebook than polished masterplan on first pass
If you want maximum raw material, this one was a beast.
3) Qwen3.5-27B
Good first-draft generator, but sloppier:
- ambitious
- broad
- lots of content
- weaker discipline and coherence than the 3.6 models
Still useful, but clearly behind both 3.6 variants.
4) Gemma4
Gemma (arguably) won the final polished-document contest, but not the first-draft contest. Its one-shot behaviour was:
- too compressed
- too selective
- not thorough enough for the initial task
It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad.
Short version
- Best one-shot: Qwen3.6-27B
- Best after revision/polish: Gemma4
IrisColt@reddit
You talk like an LLM "best sourcebook, most implementation mass" but I liked your comparison a lot. It aligns with some of my findings.
Gazorpazorp1@reddit (OP)
It was 3AM local time after I added the newly released qwen3.6-27B to the test, so I had my main Agent draft the results , that's why :) Where it wasn't strictly about the results, I added a few important own notes, that's why I say Manny rated Gemma4 more favourably than I would have.
I just wanted to share some imo interesting findings for those who care. Never claimed this to be a benchmark or professional testing. I don't get paid for content so I included the more or less raw results so people can make of it what they will. It aligned quite well with what I had heard about those models so it seemed valid enough for sharing. But gotta say this was an excellent reminder why I usually refrain from posting on reddit.
IrisColt@reddit
I love your results, they are intuitive and insightful.
MK_L@reddit
Can you share the exact local stack? Backend/server, UI if any, and model runner. For example: LM Studio, Ollama, vLLM, llama.cpp, Open WebUI, or a custom API layer. “Scribe” sounds like the agent/orchestrator, but I can’t tell what actually served the models.
Gazorpazorp1@reddit (OP)
I'm using vLLM, as mentioned Scribe is the writing agent in the roster. He curates the Obisidan vault, organizes daily logs, keeps the notes tidy and makes suggestions on how to improve the structure.
Hermes Agent runs on my notebook, the vLLM server runs on my desktop. I can hotswap the models so its fairly easy to compare runs. I started using local inference a month ago so I'm still figuring things out
MK_L@reddit
Oh thats interesting 🤔. You should make a write-up about that. If you do please post the link. Thanks
CurveNew5257@reddit
Maybe it’s the difference of coding vs more regular work but I tested all these exact same models on something that require logic but very simple task to write an email about an issue with a vendor shipment. My constraints do not use 3 specific words and keep it under 100 total words.
Every single version of qwen in all the different quantizations all could not do it. Out of 7 tries total 4 tries got stuck on thinking loops for over 5 minutes and I stopped it and the other 3 did produce decent results however the fastest thinking time was almost 2 min.
In comparison Gemma 4 26b and Gemma 4 e4b both produced results that followed the exact logic and produced those results in literally less than 2 sec.
Maybe qwen is just built for strictly coding or something but any real world AI task I have given it it just cannot complete and constantly thinks for 3-4 minutes on everything. Even my mobile models on locally ai Gemma 4 e2b and qwen 3 4B, exact same results as bigger models.
Am I doing something wrong or is my use case just not for qwen?
LocoMod@reddit
Your harness is irrelevant here as your entire post doesnt read like someone who has actual experience building "complex tools". But what IS relevant that you did not provide is the context your provided so we can evaluate if you're vibing or someone that actually knows how to measure deltas.
You could have reduced the amount of text in our post 90% and just listed your "favorite" models by rank without providing any additional information and it would have had the same value.
Gazorpazorp1@reddit (OP)
I have in fact zero experience building "complex tools" and I will most likely fail at this but I am learning a ton and it's infinitely interesting, especially with a new model dropping weekly. I started working with local LLMs a month ago and I dont have an engineering background...
I'm sorry you and the majoriity found my test of little value. I did it for myself and though it was interesting so I shared. Lesson learned I guess, don't bother next time.
Karyo_Ten@reddit
The issue is misrepresenting what you did.
You say "architecture" and "fairly complex task" and we have absolutely no idea what you actually meant. And people were right to call you out because you just admitted you actually couldn't evaluate the output on what matters. (i.e. is it tested? is it maintainable? is it extensible or tightly coupled? is it overengineered and solving a Google-scale?)
LoveGratitudeBliss@reddit
Ignore the haters , so many in this scene have forgotten how to communicate like a human, I get people want it all but we're humans and that means imperfect which when its all ai will become our best trait . I appreciated your tests, I have found the qwen 3.6 27b much slower than the qwen 35b , same quant and its 5x slower, it produces less errors but take 5x, on my mac m1 64gb the 35b feels responsive at around 50 tk/s but its couldn't finish a simple 3 level Mario game .
Signor_Garibaldi@reddit
You don't seem to grasp that it's obvious that dense 27B would be slower than 3B active MoE
drwebb@reddit
I'm pretty sure any serious dev could easily get work done with either qwen 3.5, qwen 3.6, and Gemma-4 31. Like you say, one might be better than the other, but just creating some artificial task made up task and using LLM as a judge is not the way to tell.
i_like_brutalism@reddit
actually soft disagree. it REALLY depends on the work. hallucinations can cost hours of debugging! especially in documentation or when summarizing large functions.
drwebb@reddit
That's why I said serious as in experienced dev who is gonna catch hallucinations quickly and know how to debug. I've just been using these small models recently, and while not great they are actually decent.
i_like_brutalism@reddit
its hard to catch hallucinations quickly without doing the work yourself imo. of course big blunders are noticeable, but slight mistakes are pretty bad and often stay in context, even when corrected (with a human in the loop giving corrections during the work) depending on model and harness
Zanion@reddit
I'd bet my GPT-judge can beat up your GPT-judge. I was really "in the zone" when I vibed up the system prompt so I'm pretty sure it's rock solid.
AlwaysLateToThaParty@reddit
Where can I subscribe to your newsletter?
contextbot@reddit
Yeah, but you also have to give your judge a name and be sure to tell people it /s
Zanion@reddit
I mean.. I think we all have a pretty good idea.
Comfortable_Newt8219@reddit
Should try evaluation with RAGAS for your evaluation metrics to be more globally comparable
Gazorpazorp1@reddit (OP)
This was a quick and dirty test during a work session. Good enough for my own model comparison, there are better, actual benchmarks out there of course
onewheeldoin200@reddit
Its funny how much the specific task seems to matter.
I had Gemma 4 31B, Qwen3.6 35B A3B, and Qwen3.6 27B all review contract language for specific risks, and Qwen3.6 35B A3B was above-and-beyond the best outcome. Wasn't even close. The other two selected basically random clauses that were standard or low-risk, while Qwen3.6 35B A3B actually did a great job at picking things that could genuinely harm a business.
A pot for every lid, a model for every workflow :)
Gazorpazorp1@reddit (OP)
Interesting, seems to align with my own findings! 36-A3B did produce the most detailed masterplan and retained the most amount of technical details, without unnecessary bloating.
sutheesh_s@reddit
what was your device configuration to run this locally, for me it took too much time to initialize and respond
Thrumpwart@reddit
Yeah Gemma 4 31B is very clear. I gave it a similar task awhile ago and while it missed some things, its explanations and summaries were top notch. Seriously the best long-context RAG and summary model. Not as smart as Qwen but easily the best output for clarity I’ve used in a local model.
donthaveanym@reddit
Slop, didn’t read
xornullvoid@reddit
If you say that you have the time to perform a comparative study of various models writing skills, but you do not have the time to write and present the results of thr said study without resorting to AI-produced slop, then I do not beleive that you did this comparison with any sort of diligence or that your comparison is reliable.
Glittering-Call8746@reddit
Try different harness pls.. opencode, pi agent, claude code , codex etc
Gazorpazorp1@reddit (OP)
Why would I? this was a writing task, not coding. the writing agent sevres it's role, the tests was just a rough but useful comparison, not a benchmark
CoruNethronX@reddit
Just use raw LLM, it's exactly your tool
Gazorpazorp1@reddit (OP)
I am. My data extractor is essentially just the vLLM endpoint, raw data is being fed into it and the output saved to a file. I did start with a processor agent at first but soon realized that a data processing node doesn't need all the agentic overhead, raw LLM works perfectly (and you avoid potential tool calling issues)
omidmatin@reddit
Could you please share your TG (t/s) for these models on your setup?
Gazorpazorp1@reddit (OP)
vLLM, RTX5090, no offloading, 50-60 t/s with thinking on.
anzzax@reddit
there are cases when it's ok to use llm in posts :)
Local LLM Comparison — Architecture / Masterplan Task
Consolidated Summaries + Observations
Qwen3.6-27B — Balanced Workhorse
Gemma4 — Editorial/Strategic Optimizer
Qwen3.6-35B — Expansion / Sourcebook Generator
Qwen3.5-27B — Legacy High-Volume Generator
daishiknyte@reddit
Let’s see some of the actual outputs.
solidsnakeblue@reddit
I agree with your assessments of most of these models. Gemma 4 is a really great model and I love using it in my workflows, but Qwen 3.6 27b does perform slightly better (thought I prefer Gemma's tone and reasoning)
Thanks for posting!
mr_Owner@reddit
Curious if the qwopus distills would be better here
Plastic-Stress-6468@reddit
Not directly related, but I can echo OP's sentiment that gemma4 is brief in nature. IMO Gemma 4 distilled with the opus data set is like a match made in heaven. It's short and brief, gets to the point, very pleasant to use as a conversational partner or editor model, where long winded AI wording can get tiresome.
Southern_Sun_2106@reddit
"I am building a Mystery Tool, and this is how the models did - according to me and to chatgpt" - ok, thanks for sharing, but this tells close to nothing about anything.
pkailas@reddit
I tested all 4 of those myself. The test was on C#14 .NET 10 . Q3.5 27b and 3.6 27b failed on 4 out of 6 tests. Q3.6 MoE was very fast and very wrong. gemma4 scored a perfect 6/6
Does this mean Gemma is smarter? No, just that it was trained more recently. I don't know what was improved with q3.6, but it doesn't work for me. On older codebases maybe.
unique-moi@reddit
Thank you for the review! 👍
Chupa-Skrull@reddit
Does an identical test environment make sense for non-identical models? I'm not sure it does
zdy1995@reddit
different quantization and different provider…
No-Consequence-1779@reddit
Very nice. I run both the 3.6 models. One model on each side of the GPU vram.
vick2djax@reddit
So, would the move be to have qwen3.6 take care of the first pass and then maybe an extra couple of review passes. Then finish with Gemma4 doing a final pass to make it most readable?