I tested Qwen3.6-27B, Qwen3.6-35B-A3B, Qwen3.5-27B and Gemma 4 on the same real architecture-writing task on an RTX 5090

Posted by Gazorpazorp1@reddit | LocalLLaMA | View on Reddit | 46 comments

I ran a pretty simple but revealing local-LLM test.

At first I was only going to post about the two Qwens and Gemma4 and go to bed, and what do you know, I go on reddit and see a post that Qwen 3.6-27B dropped. Oh well...

Models tested:

Gemma4
cyankiwi/gemma-4-31B-it-AWQ-4bit
Qwen3.6-35B
RedHatAI/Qwen3.6-35B-A3B-NVFP4
Qwen3.5-27B
QuantTrio/Qwen3.5-27B-AWQ
Qwen3.6-27B
cyankiwi/Qwen3.6-27B-AWQ-INT4

Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.”

I gave the same Hermes writing agent (“Scribe”) the same task:

take 2 architecture blueprint docs (v1 baseline + v2 expansion) describing the "truth engine" and produce a unified `Masterplan.md` explaining:

- what the product is

- the user problem

- UX/product shape

- UVP/moat

- pipeline

- agent roles

- architecture

- trust/legal/provenance posture

- what changed between plan V1 and V2

V1: \~16k tokens,

V2: \~4.6k tokens,

Combined: \~20.6k tokens

Then I ran the full workflow locally on my RTX 5090 all 4 models:

- **Gemma4**
- **Qwen3.6-35B**
- **Qwen3.5-27B**
- **Qwen3.6-27B**

To make it fair and push the models, each model got:

initial draft
second-pass revision
final polish

Each stage was directed and reviewed by my GPT-5.4 agent Manny, so this wasn’t just “ask once and compare vibes.”

## What I/Manny scored

- **Clarity**

- **Completeness**

- **Discipline**

- **Usefulness**

## Final results

### Clarity

- Gemma4: **9.4**

- Qwen3.6-27B: **8.8**

- Qwen3.6-35B: **8.1**

- Qwen3.5-27B: **7.4**

**Winner: Gemma4** (at a cost, read further below)

Gemma was the best editor. Cleanest structure, best pacing, strongest restraint.

---

### Completeness

- Qwen3.6-35B: **9.6**

- Qwen3.5-27B: **9.1**

- Qwen3.6-27B: **8.7**

- Gemma4: **7.9**

**Winner: Qwen3.6-35B**

The 35B Qwen wrote the most exhaustive architecture doc by far. Best sourcebook, most implementation mass.

---

### Discipline

- Gemma4: **9.5**

- Qwen3.6-27B: **8.6**

- Qwen3.6-35B: **7.7**

- Qwen3.5-27B: **6.8**

**Winner: Gemma4**

Gemma best preserved the actual product identity

---

### Usefulness

- Qwen3.6-27B: **9.3**

- Qwen3.6-35B: **9.2**

- Gemma4: **8.9**

- Qwen3.5-27B: **8.8**

**Winner: Qwen3.6-27B**

This was the surprise. The 27B Qwen 3.6 ended up as the best **overall practical workhorse** — better balance of depth, readability, and usability than the others.

## Final ranking

1. **Qwen3.6-27B** — best all-around balance

**Gemma4** — best editor / strategist
**Qwen3.6-35B** — best exhaustive drafter
**Qwen3.5-27B** — solid, but clearly behind the others for this task

1) Best overall balance

Qwen3.6-27B This is the new interesting winner.

It doesn’t beat Gemma4 on clarity or discipline.
It doesn’t beat Qwen3.6-35B on completeness.

But it wins the thing that matters most for a real working master plan: balance. It’s the best compromise between:

readability
completeness
structure
practical usefulness

2) Best editor / best strategist

Gemma4 If the goal is:

cleanest finished document
strongest executive readability
best restraint
best “this feels like a real deliberate plan”

Then Gemma still wins.

3) Best exhaustive architecture quarry

Qwen3.6-35B If the goal is:

maximum implementation mass
biggest architecture sourcebook
richest mining material for downstream docs

Then Qwen3.6-35B is still the beast.

4) Fourth place

Qwen3.5-27B Not bad. Not embarrassing.
But now clearly behind both Qwen3.6 variants and Gemma for this kind of long-form architecture/planning task.

## Actual takeaway

This ended up being a really clean split:

- **Gemma4 = best editor**

- **Qwen3.6-35B = best expander**

- **Qwen3.6-27B = best practical default**

- **Qwen3.5-27B = respectable, but not the winner**

So if I were setting a default local writing worker for long-form architecture/master-plan work today, I’d probably choose:

**Qwen3.6-27B**

It’s the best compromise between:

- readability

- completeness

- structure

- practical usefulness

Personal Note re Gemma 4: It was drastically shorter than the Qwens for the final output

Gemma4 → 147 lines
Qwen3.6-35B → 725 lines
Qwen3.5-27B → 840 lines
Qwen3.6-27B → 555 lines

So while I do agree that less is often more, I found the Gemma4 output lacking in both technical depth and detail. Sure, it captured the core concepts, but I would position the output as more of a pitching deck or high level concept, technical details and concepts however are sorely missing.
On the other end of the spectrum is Qwen3.6-35B which delivered 5x the volume. That document could really serve as a technical blueprint and architecture implementation bible. Qwen3.5-27B produced even more but this was quantity over quality.
I would honestly have rated Gemma4 less favourably than Manny did, so make of that what you will.

For First-draft only performance, I’d rank them:

One-shot ranking

Qwen3.6-27B
Qwen3.6-35B
Qwen3.5-27B
Gemma4

Why

1) Qwen3.6-27B

Best balance right out of the gate:

strong product framing
solid structure
good density
less bloated than the other Qwens
more complete than Gemma’s first draft

This was the best raw first shot.

2) Qwen3.6-35B

Very strong one-shot draft, but more sprawling:

most exhaustive
richest implementation mass
more likely to over-include
better sourcebook than polished masterplan on first pass

If you want maximum raw material, this one was a beast.

3) Qwen3.5-27B

Good first-draft generator, but sloppier:

ambitious
broad
lots of content
weaker discipline and coherence than the 3.6 models

Still useful, but clearly behind both 3.6 variants.

4) Gemma4

Gemma (arguably) won the final polished-document contest, but not the first-draft contest. Its one-shot behaviour was:

too compressed
too selective
not thorough enough for the initial task

It needed the later revision passes to get more substance. Depending on the audience, this may be either good or bad.

Short version

Best one-shot: Qwen3.6-27B
Best after revision/polish: Gemma4

[-]

IrisColt@reddit

You talk like an LLM "best sourcebook, most implementation mass" but I liked your comparison a lot. It aligns with some of my findings.

[-]

Gazorpazorp1@reddit (OP)

It was 3AM local time after I added the newly released qwen3.6-27B to the test, so I had my main Agent draft the results , that's why :) Where it wasn't strictly about the results, I added a few important own notes, that's why I say Manny rated Gemma4 more favourably than I would have.
I just wanted to share some imo interesting findings for those who care. Never claimed this to be a benchmark or professional testing. I don't get paid for content so I included the more or less raw results so people can make of it what they will. It aligned quite well with what I had heard about those models so it seemed valid enough for sharing. But gotta say this was an excellent reminder why I usually refrain from posting on reddit.

[-]

IrisColt@reddit

I love your results, they are intuitive and insightful.

[-]

MK_L@reddit

Can you share the exact local stack? Backend/server, UI if any, and model runner. For example: LM Studio, Ollama, vLLM, llama.cpp, Open WebUI, or a custom API layer. “Scribe” sounds like the agent/orchestrator, but I can’t tell what actually served the models.

[-]

Gazorpazorp1@reddit (OP)

I'm using vLLM, as mentioned Scribe is the writing agent in the roster. He curates the Obisidan vault, organizes daily logs, keeps the notes tidy and makes suggestions on how to improve the structure.
Hermes Agent runs on my notebook, the vLLM server runs on my desktop. I can hotswap the models so its fairly easy to compare runs. I started using local inference a month ago so I'm still figuring things out

[-]

MK_L@reddit

Oh thats interesting 🤔. You should make a write-up about that. If you do please post the link. Thanks

[-]

CurveNew5257@reddit

Maybe it’s the difference of coding vs more regular work but I tested all these exact same models on something that require logic but very simple task to write an email about an issue with a vendor shipment. My constraints do not use 3 specific words and keep it under 100 total words.

Every single version of qwen in all the different quantizations all could not do it. Out of 7 tries total 4 tries got stuck on thinking loops for over 5 minutes and I stopped it and the other 3 did produce decent results however the fastest thinking time was almost 2 min.

In comparison Gemma 4 26b and Gemma 4 e4b both produced results that followed the exact logic and produced those results in literally less than 2 sec.

Maybe qwen is just built for strictly coding or something but any real world AI task I have given it it just cannot complete and constantly thinks for 3-4 minutes on everything. Even my mobile models on locally ai Gemma 4 e2b and qwen 3 4B, exact same results as bigger models.

Am I doing something wrong or is my use case just not for qwen?

[-]

LocoMod@reddit

Context: I’m working on fairly complex tool that takes noisy evidence and turns it into a structured “truth report.”

Your harness is irrelevant here as your entire post doesnt read like someone who has actual experience building "complex tools". But what IS relevant that you did not provide is the context your provided so we can evaluate if you're vibing or someone that actually knows how to measure deltas.

You could have reduced the amount of text in our post 90% and just listed your "favorite" models by rank without providing any additional information and it would have had the same value.

[-]

Gazorpazorp1@reddit (OP)

I have in fact zero experience building "complex tools" and I will most likely fail at this but I am learning a ton and it's infinitely interesting, especially with a new model dropping weekly. I started working with local LLMs a month ago and I dont have an engineering background...
I'm sorry you and the majoriity found my test of little value. I did it for myself and though it was interesting so I shared. Lesson learned I guess, don't bother next time.

[-]

Karyo_Ten@reddit

The issue is misrepresenting what you did.

You say "architecture" and "fairly complex task" and we have absolutely no idea what you actually meant. And people were right to call you out because you just admitted you actually couldn't evaluate the output on what matters. (i.e. is it tested? is it maintainable? is it extensible or tightly coupled? is it overengineered and solving a Google-scale?)

[-]

LoveGratitudeBliss@reddit

Ignore the haters , so many in this scene have forgotten how to communicate like a human, I get people want it all but we're humans and that means imperfect which when its all ai will become our best trait . I appreciated your tests, I have found the qwen 3.6 27b much slower than the qwen 35b , same quant and its 5x slower, it produces less errors but take 5x, on my mac m1 64gb the 35b feels responsive at around 50 tk/s but its couldn't finish a simple 3 level Mario game .

[-]

Signor_Garibaldi@reddit

You don't seem to grasp that it's obvious that dense 27B would be slower than 3B active MoE

[-]

drwebb@reddit

I'm pretty sure any serious dev could easily get work done with either qwen 3.5, qwen 3.6, and Gemma-4 31. Like you say, one might be better than the other, but just creating some artificial task made up task and using LLM as a judge is not the way to tell.

[-]

i_like_brutalism@reddit

actually soft disagree. it REALLY depends on the work. hallucinations can cost hours of debugging! especially in documentation or when summarizing large functions.

[-]

drwebb@reddit

That's why I said serious as in experienced dev who is gonna catch hallucinations quickly and know how to debug. I've just been using these small models recently, and while not great they are actually decent.

[-]

i_like_brutalism@reddit

its hard to catch hallucinations quickly without doing the work yourself imo. of course big blunders are noticeable, but slight mistakes are pretty bad and often stay in context, even when corrected (with a human in the loop giving corrections during the work) depending on model and harness

[-]

Zanion@reddit

I'd bet my GPT-judge can beat up your GPT-judge. I was really "in the zone" when I vibed up the system prompt so I'm pretty sure it's rock solid.

[-]

AlwaysLateToThaParty@reddit

Where can I subscribe to your newsletter?

[-]

contextbot@reddit

Yeah, but you also have to give your judge a name and be sure to tell people it /s

[-]

Zanion@reddit

But what IS relevant that you did not provide is the context you provided so we can evaluate if you're vibing or someone that actually knows how to measure deltas.

I mean.. I think we all have a pretty good idea.

[-]

Comfortable_Newt8219@reddit

Should try evaluation with RAGAS for your evaluation metrics to be more globally comparable

[-]

Gazorpazorp1@reddit (OP)

This was a quick and dirty test during a work session. Good enough for my own model comparison, there are better, actual benchmarks out there of course

[-]

onewheeldoin200@reddit

Its funny how much the specific task seems to matter.

I had Gemma 4 31B, Qwen3.6 35B A3B, and Qwen3.6 27B all review contract language for specific risks, and Qwen3.6 35B A3B was above-and-beyond the best outcome. Wasn't even close. The other two selected basically random clauses that were standard or low-risk, while Qwen3.6 35B A3B actually did a great job at picking things that could genuinely harm a business.

A pot for every lid, a model for every workflow :)

[-]

Gazorpazorp1@reddit (OP)

Interesting, seems to align with my own findings! 36-A3B did produce the most detailed masterplan and retained the most amount of technical details, without unnecessary bloating.

[-]

sutheesh_s@reddit

what was your device configuration to run this locally, for me it took too much time to initialize and respond

[-]

Thrumpwart@reddit

Yeah Gemma 4 31B is very clear. I gave it a similar task awhile ago and while it missed some things, its explanations and summaries were top notch. Seriously the best long-context RAG and summary model. Not as smart as Qwen but easily the best output for clarity I’ve used in a local model.

[-]

donthaveanym@reddit

Slop, didn’t read

[-]

xornullvoid@reddit

If you say that you have the time to perform a comparative study of various models writing skills, but you do not have the time to write and present the results of thr said study without resorting to AI-produced slop, then I do not beleive that you did this comparison with any sort of diligence or that your comparison is reliable.

[-]

Glittering-Call8746@reddit

Try different harness pls.. opencode, pi agent, claude code , codex etc

[-]

Gazorpazorp1@reddit (OP)

Why would I? this was a writing task, not coding. the writing agent sevres it's role, the tests was just a rough but useful comparison, not a benchmark

[-]

CoruNethronX@reddit

takes noisy evidence and turns it into a structured “truth report"

Just use raw LLM, it's exactly your tool

[-]

Gazorpazorp1@reddit (OP)

I am. My data extractor is essentially just the vLLM endpoint, raw data is being fed into it and the output saved to a file. I did start with a processor agent at first but soon realized that a data processing node doesn't need all the agentic overhead, raw LLM works perfectly (and you avoid potential tool calling issues)

[-]

omidmatin@reddit

Could you please share your TG (t/s) for these models on your setup?

[-]

Gazorpazorp1@reddit (OP)

vLLM, RTX5090, no offloading, 50-60 t/s with thinking on.

[-]

anzzax@reddit

there are cases when it's ok to use llm in posts :)

Local LLM Comparison — Architecture / Masterplan Task

Model	Params	Clarity	Completeness	Discipline	Usefulness	Final Rank	One-shot Rank
Qwen3.6-27B	27B	8.8	8.7	8.6	9.3	#1	#1
Gemma4 (31B AWQ)	\~31B	9.4	7.9	9.5	8.9	#2	#4
Qwen3.6-35B	35B	8.1	9.6	7.7	9.2	#3	#2
Qwen3.5-27B	27B	7.4	9.1	6.8	8.8	#4	#3

Consolidated Summaries + Observations

Qwen3.6-27B — Balanced Workhorse

Best overall due to Pareto balance across all scoring dimensions.
Strong one-shot behavior; minimal reliance on iterative correction loops.
Produces implementation-usable output without excessive verbosity (\~555 lines).
Not category-leading, but avoids failure modes seen in others (overcompression vs overexpansion).
Most suitable as default production model for long-form structured outputs.

Gemma4 — Editorial/Strategic Optimizer

Dominates in clarity and discipline, preserving product identity and narrative intent.
Exhibits high compression bias → outputs resemble executive briefs (\~147 lines).
Underperforms in completeness; lacks sufficient implementation detail in first-pass generation.
Performance improves significantly with iterative refinement.
Best positioned as final-pass editor or strategic layer, not primary generator.

Qwen3.6-35B — Expansion / Sourcebook Generator

Maximizes completeness and architectural coverage (top score: 9.6).
Generates high-density technical material suitable for downstream extraction.
Suffers from over-inclusion and weaker structural discipline.
Output resembles a technical reference corpus (\~725 lines), not a refined plan.
Best used as a knowledge expansion stage, followed by pruning.

Qwen3.5-27B — Legacy High-Volume Generator

Produces the largest outputs (\~840 lines) but with lower signal density.
Reasonably complete but lacks coherence and discipline relative to 3.6 variants.
More prone to structural drift and redundancy.
Functionally superseded by Qwen3.6 models in this task category.

[-]

daishiknyte@reddit

Let’s see some of the actual outputs.

[-]

solidsnakeblue@reddit

I agree with your assessments of most of these models. Gemma 4 is a really great model and I love using it in my workflows, but Qwen 3.6 27b does perform slightly better (thought I prefer Gemma's tone and reasoning)

Thanks for posting!

[-]

mr_Owner@reddit

Curious if the qwopus distills would be better here

[-]

Plastic-Stress-6468@reddit

Not directly related, but I can echo OP's sentiment that gemma4 is brief in nature. IMO Gemma 4 distilled with the opus data set is like a match made in heaven. It's short and brief, gets to the point, very pleasant to use as a conversational partner or editor model, where long winded AI wording can get tiresome.

[-]

Southern_Sun_2106@reddit

"I am building a Mystery Tool, and this is how the models did - according to me and to chatgpt" - ok, thanks for sharing, but this tells close to nothing about anything.

[-]

pkailas@reddit

I tested all 4 of those myself. The test was on C#14 .NET 10 . Q3.5 27b and 3.6 27b failed on 4 out of 6 tests. Q3.6 MoE was very fast and very wrong. gemma4 scored a perfect 6/6

Does this mean Gemma is smarter? No, just that it was trained more recently. I don't know what was improved with q3.6, but it doesn't work for me. On older codebases maybe.

[-]