Are Qwen 3.6 27B and 35B making other ~30B models obsolete?

[-]

speedb0at@reddit

Can’t believe how good it is. I use it for bug / vuln hunting and general coding in my own harness and I’ve already had one bug in Wordpress confirmed by a dev.

[-]

Qwen3 is genuinely impressive but the 'obsolete' framing skips past quantization and hardware constraints. For people running on consumer GPUs with 24GB VRAM, Qwen3 35B q4_K_M sits at \~21GB which barely fits, whereas Qwen3 27B is more comfortable. The use cases where Nemotron and older models still win are specific: long context inference speed, and any workflow where you've already tuned prompts against an older model and the retuning cost isn't worth the capability delta.

[-]

FoxiPanda@reddit

I currently keep 7 local models warm at any given time:

Qwen3.6-35B-A3B
General go to for some basic work, but it's slowly getting supplanted by 27B since I can run that fast enough at high enough quant to be good now.
Nemotron-Cascade-2
Kept on a mac studio M3 Ultra - it's quick, it's reliable for a lot of various things, and it gets some use even though it's not SOTA in a lot ways.
Qwen3.6-27B
I run this on NVIDIA hardware and it's fast enough for a lot of uses. It's stolen uses from Nemotron and the other Qwen3.6 lately as I've gotten it to be better for my use cases.
Gemma-4-26B-A4B
This model is fantastic for transcription / vision work. It's bigger cousin 31B is probably better, but it's ~7-9x slower and 26B-A4B is good enough for my purposes. I can run this at a near flawless quant too on a Mac Studio without it bogging down.
Qwen-3-VL-Embedding-2B + Reranker-2B
This is my current go to embedding model for my harness. The vision portion is only like half working, but even without that, it's still a great embedding model and is snappy.
Kimi K2.6 baa-ai 344GB variant
I run this on a 512GB mac studio and it's surprisingly good, but it definitely has quirks as a 'daily driver' model. It's also painfully slow on initial prefill in my harness, but I live with it when I'm not using a cloud model.

So is Qwen3.6 27B/35B making others obsolete? Yes and no. It's certainly stealing use cases for me, but it's not a winner in all areas. Gemma is better at creative writing and vision / handwriting transcription tasks imo.

[-]

DeepOrangeSky@reddit

How do you feel about Devstral Small-2 24b and/or Devstral-2 123b?

[-]

FoxiPanda@reddit

Honestly, haven’t tried it yet. It’s a blind spot for me that I need to go fix

[-]

moorsh@reddit

How is qwen3.6 27B on par with haiku level coding when it beats sonnet 4.6 in benchmarks?

[-]

Big_Wave9732@reddit

This comment is totally not a bookmark for reference back when my Mac Studio Ultra arrives in a few days.....

[-]

South_Hat6094@reddit

depends what you're running them for honestly. qwen crushes agentic tasks but gemma still handles long-form summarization better in my testing. benchmarks won't tell you that.

[-]

jirka642@reddit

Gemma-4 is a lot better at writing fiction, so it's definitely not obsolete.

[-]

moorsh@reddit

Gemma is a hallucinating pile of shit but I suppose writing fiction is a perfect use case.

[-]

jirka642@reddit

Interesting you say that, because I have seen Qwen3.6 hallucinate far more when writing stuff.

[-]

biogoly@reddit

It’s better at multilingual and translation than Qwen imo. I just wish it remembered to use tools more.

[-]

nikhilprasanth@reddit (OP)

I have not tried writing with llms. Do ypu use any harnesses for writing? Or just plain chats?

[-]

jirka642@reddit

I have been using mostly plain chats, because no ui/harnesses I could find does everything the way I want.

I'm currently looking into extending pi.dev, but if that doesn't work out, I will probably just vibe code something usable together.

[-]

silenceimpaired@reddit

I even enjoy the previous Gemma 3 for brainstorming

[-]

Wyndyr@reddit

Gemma 3 surprisingly holds itself against Gemma 4, as long as you don't go deep NSFW or don't mind a bit more dry prose.

It's too early to write the old warhorse off.

[-]

Monad_Maya@reddit

That's like 4 models in total, Qwen 3.6 = 27B, 35B A3B, Gemma4 = 26B A4B, 31B

gpt-oss 20B is terrible at coding, LFM 24B is still a preview release.

There aren't that many options,

[-]

Unlucky-Message8866@reddit

LFM 24B with a 8k context size is mostly useless for coding

[-]

Monad_Maya@reddit

That's what I meant, gpt-oss 20B and LFM 24B aren't really competitors to Qwen 3.6 and Gemma4.

Qwen 3.6 and Gemma 4 haven't replaced anything, they are just a evolution of the previous Qwen and Gemma releases.

[-]

StupidScaredSquirrel@reddit

Nemotron is fast af for long context.

Gpt oss 20b is very small.

But generally speaking yes newer models take the place of older ones, this isn't sensational or surprising

[-]

nikhilprasanth@reddit (OP)

Yes nemotron and gpt oss are really snappy. Wish openai had released a successor to it.

[-]

jld1532@reddit

OpenAI is openly stating they are failing to hit revenue targets. I would be really surprised if we saw another open source model from them any time soon, if ever.

[-]

StupidScaredSquirrel@reddit

Open models are an opportunity to flex technical capabilities though, why do you think google releases gemma?

[-]

jlv@reddit

Look at their recent earnings, they can afford to flex tech capabilities even if it costs them some revenue

[-]

StupidScaredSquirrel@reddit

But flexing isn't really a discretionary expense. It's investment bait. It's something you do because you want money, not because you want to spend any

[-]

HornyGooner4402@reddit

Flexing also takes money. Google can do that because they have spare cash lying around, which they turn into investment bait. I also have cool ideas I'd love to flex but I don't have the money to actually go through with them and neither does OpenAI since they're being chased by investors that they took hundreds of billions from

[-]

StupidScaredSquirrel@reddit

OpenAI has done nothing but burn money for investment bait so idk why you're comparing them to yourself

[-]

HornyGooner4402@reddit

what are you talking about 😂 SOTA models aren't investment baits, they're actual investments. they haven't for like the past 4 years.

[-]

nicksterling@reddit

Then did the GPT-OSS release help drive any additional revenue?

[-]

StupidScaredSquirrel@reddit

I didn't say it was to increase revenue, I said it was to increase investment

[-]

phein4242@reddit

Early days google was open, yes. Once they started turning more ads, the closer they became. Still more open then most other SV corps tho.

[-]

StupidScaredSquirrel@reddit

I was referring to their latest gemma 4 models release. They don't do it to be nice, they do it to show what they're capable of.

[-]

BadUsername_Numbers@reddit

Somewhat the same reasons that Deepseek and Qwen release their models - it's a flex, and it also might eat just a little bit into the enormous userbase of ChatGPT.

[-]

cr0wburn@reddit

Qwrn 35B also very snappy, if you have room for it.

[-]

nikhilprasanth@reddit (OP)

Yes. Its also good. Need to offload when used with 5060ti, still getting consistent throughput.

[-]

starkruzr@reddit

can offload less with two of them, right? wonder what the perf would be like that.

[-]

Turtlesaur@reddit

I got 80-90 tokens per second with a 9800x3d offload on my 4080

[-]

screenslaver5963@reddit

I was surprised when my 9950x3d could run models at a usable speed (I think it was around 20-30tps) on its own.

[-]

horrorpages@reddit

Note: this is only for people looking to upgrade from a 5060 Ti 16 GB without spending 32 GB GPU money. I’m only considering Qwen3.6-35B-A3B Q4_K_M and speed improvement, not gaming, resale, power, or general workstation use.

A 2 × RTX 5060 Ti 16 GB setup, roughly <$1.2k total or <$600 if you already own one, should give around a 2x speed improvement over a single 5060 Ti 16 GB. It will not behave like one clean 32 GB GPU, and it will not scale perfectly, because the cards do not share unified VRAM and you introduce inter-GPU transfer/scheduling overhead.

A RX 7900 XT 20 GB at roughly <$750 retail, or a RX 7900 XTX 24 GB around $1.4k, can land in a similar rough improvement range versus one 5060 Ti, depending on llama.cpp setup.

The wildcard is a used RTX 3090 24 GB, roughly $1k–$1.2k on the secondary market. It can also land in that same general improvement tier, but with the obvious used-hardware risk. The upside is unified 24 GB VRAM and cleaner CUDA support.

So it really comes down to budget, platform preference, and used-hardware risk. Personally, I see two realistic options: sell the 5060 Ti and invest roughly another $1k into a used 3090, or keep it and add a second 5060 Ti for under $600.

[-]

starkruzr@reddit

that second option is what I'm thinking, especially since I expect Nvidia to start making moves toward sunsetting Ampere support sooner rather than later.

[-]

jtjstock@reddit

Sunsetting? They are restarting production because they can do that with Samsung and therefore not affect their new stuff that they are sending to data centers.

[-]

Mantikos804@reddit

I have both. A pc with a rx7900xt and a pc with two 5060tis. If you can fit all of a model on 20gb plus context, the 7900xt is faster because of a wider bandwidth. Otherwise the dual RTX5060 is better, smooth and the 32gbs is nice. General rule is VRAM is king.

[-]

horrorpages@reddit

I'm doing the same. It's a budget friendly way to upgrade and gives quite a bit of VRAM headroom. I'm ok with the complexity that it comes with. Newer models on the secondary market also contain less risk so I can probably save another $150 there.

[-]

starkruzr@reddit

my sense is that this arrangement works ~fine for the most part if you're using llama.cpp vs. vLLM.

[-]

horrorpages@reddit

Should be. vLLM would be overkill for me as I'm not dealing with multiple concurrent sessions, agents, or fancy orchestration needs. My workflows are sequenced or one-offs, so llama.cpp is the way.

[-]

Ariquitaun@reddit

I run it on my ryzen 9's radeon 780m igpu on ddr5 "vram" and it's surprisingly usable. Especially if you don't need thinking mode enabled

[-]

Local_Phenomenon@reddit

Time will be the only barrier from now on.

[-]

No_Block8640@reddit

No no, the surprising part is not that it’s taking place of older ones, but at such substantial improvements that the 3.6 version gave us I say it is sensational

[-]

ydnar@reddit

i would say the sensational or surprising part is how much more reliable 3.6 27b has been compared to gemma 31b. i do keep 31b around just in case i need it for translation/writing, but 97% of the time i'm relying on 3.6 27b.

[-]

florinandrei@reddit

That's a good combo. Gemma is great with natural language.

[-]

StupidScaredSquirrel@reddit

Depends on the task at hand. Qwen is very good for agents and coding but plenty of people don't need/want that but do need very strong document understanding over plenty of languages

[-]

ydnar@reddit

thanks, i should have been more clear. we're both saying the same thing. the more reliable part i'm referring to is the agent/coding portion.

for my particular use case, i find that more important. translation and writing is less a part of my workflow, but is needed at times, which is why i still keep it around. it just receives much less use.

[-]

ubrtnk@reddit

I loved gpt-oss but vision is quickly becoming a needed standard. It was a great model for home assistant voice assist.

[-]

StupidScaredSquirrel@reddit

I have to say I'm not quite sure why. I always found document parsing to markdown then fed to a text llm to be much cheaper and efficient. But maybe I'm missing something

[-]

ubrtnk@reddit

For my family the "what the f*** is this" is a high requirement lol. Not necessarily important for home assistant voice as an alexa replacement but thru the owui app on our phones yea

[-]

florinandrei@reddit

Ah, so that's the "needed standard", got it.

[-]

Daniel_H212@reddit

Nemotron nano 30B or cascade 30B?

[-]

MalabaristaEnFuego@reddit

Qwen 3 Coder 30b is still the superior coder. Devstral Small 2 and GLM Flash 4.7 are also great coders independent of the new Qwen models.

[-]

Unlucky-Message8866@reddit

no it isnt, neither in benchmarks or my personal testing:

┌────────────────────────┬───────────────────┬──────────────────┬───────────────┬────────────────┐ │ Benchmark │ Qwen3-Coder-Flash │ Devstral Small 2 │ GLM-4.7-Flash │ Qwen3.6-27B │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ SWE-bench Verified │ 51.6–72.5% │ 68.0% │ — │ 77.2% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ SWE-bench Multilingual │ 35.7% │ 55.7% │ — │ 71.3% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ SWE-bench Pro │ — │ — │ — │ 53.5% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ LiveCodeBench │ 75.0% │ 34.8% │ 35.0% │ 83.9% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ HumanEval │ 88.0% │ — │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ Terminal-Bench │ 32.5% │ 22.5% │ — │ 59.3% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ GPQA-Diamond │ 44.2% │ 53.2% │ 40.0% │ 87.8% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ MMLU-Pro │ — │ 67.8% │ — │ 86.2% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ GSM8K │ 84.3% │ — │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ MATH500 │ 72.8% │ — │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ AIME25/26 │ — │ 34.3% (AIME25) │ — │ 94.1% (AIME26) │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ MMMU │ 60.1% │ — │ — │ 82.9% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ MCP-Atlas │ — │ — │ 81.7% │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ LiveCodeBench v6 │ — │ — │ 94.0% │ 83.9% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ HMMT Nov 25 │ — │ — │ 87.7% │ 90.7% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ HallusionBench │ — │ — │ 94.2% │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ OmniDocBench1.5 │ — │ — │ 88.9% │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ MMLU-Redux │ — │ — │ — │ 93.5% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ C-Eval │ — │ — │ — │ 91.4% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ HMMT Feb 25 │ — │ — │ — │ 93.8% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ HLE │ — │ 3.4% │ — │ 24.0% │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ τ²-Bench Telecom │ — │ 23.4% │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ SciCode │ — │ 28.8% │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ IFBench │ — │ 31.2% │ — │ — │ ├────────────────────────┼───────────────────┼──────────────────┼───────────────┼────────────────┤ │ Long-Context Reasoning │ — │ 24.0% │ — │ — │ └────────────────────────┴───────────────────┴──────────────────┴───────────────┴────────────────┘

[-]

simon_zzz@reddit

For writing and summarization, I lean towards the Gemma models.

[-]

Clean_Hyena7172@reddit

Gemma is the best at non-code tasks and Qwen is the best code tasks in my opinion.

[-]

nikhilprasanth@reddit (OP)

Got it. But I find Gemma slows down drastically compared to qwen as context grows.

[-]

Unlucky-Message8866@reddit

all llm do, this is intrinsic to the their architectures

[-]

Photochromism@reddit

This. Gemma is not comparable for long context.

[-]

philmarcracken@reddit

its pretty goated at translations when given context

[-]

OpenEvidence9680@reddit

Same. Among the newest models Gemma4 writes the best, with the only exception being Mistral. My potato doesn't allow me to run the new models, so I don't know if that changed with 3.5 or 4, but Mistral 3.2 24b still is my favourite for prose.

[-]

Endurance_Beast@reddit

Gemma4 is better at translation

[-]

2Norn@reddit

i kinda have my money on nemotron but nvidia dont seem to make it a priority yet yet comes natively trained in nvfp4 so thats already a bonus but right now probably worse than qwen

[-]

dionysio211@reddit

I think they are each carving out niches that play to their strengths, speaking to fresh models in this size range. Anything older than 6 months is fighting an unfair fight.

Gemma is MUCH better than Qwen in writing and tone.

Qwen is MUCH better at code and definitely hit a home run in that area that borders on the miraculous.

Nemotron, I would argue, is MUCH better at general/research tasks. It's ultrafast and scores very high in world knowledge. I loved the gpt-oss models and wish they were refreshed but Nemotron Super is definitely the successor to 120b.

Mistral's niche would be in translation and multilingual interaction. The English/Mandarin world is probably unaware of the fact that Mistral's perception in other languages is easier to understand.

I also think that Mistral/Nvidia's lovechild, Nemo, does not get the acclaim it deserves as the most eternal of all small models. That thing was born in the summer of 2024 and STILL gets over 100 requests per second on OpenRouter. It is undoubtedly the most used model, in its size range, of all time and usage is still climbing.

[-]

TheDailySpank@reddit

I feel the same about Qwen and Gemma. Guess I'm going to have to give Nemotron another look.

I feel like there should be a model router based on prompt that could combine the powers of the smaller models ala MoE, but at model level.

[-]

traveddit@reddit

Gemma is MUCH better than Qwen in writing and tone.

You mean you prefer Gemma's writing and tone. I don't think you can judge good writing from bad.

[-]

into_devoid@reddit

There is some objectivity in how well it plays the roles and creativity. Saying nothing else.

[-]

traveddit@reddit

You're still saying nothing. There's massive variance from the prompt to the params and the harness. Not to mention you can augment creative writing with iteration loops or give it source material for style but imagine thinking you can objectively judge these margins. Qwen and Gemma have different personalities but to say one is far better is untrue. I don't trust any of you on this sub to know what constitutes "good" writing in any meaningful sense.

[-]

into_devoid@reddit

The world was better before anyone read this useless comment. Thank you for wasting our time.

[-]

traveddit@reddit

Oh the irony

[-]

florinandrei@reddit

don't think you can judge good writing from bad.

Well, you have to be literate for that.

[-]

traveddit@reddit

Why do you think your standards of writing are any more correct than mine? Are you just retarded?

[-]

nikhilprasanth@reddit (OP)

Is nemotron nano any good for knowledge?

[-]

dionysio211@reddit

I don't really know much about the 4b Nano but I used the 30b one (Cascade is the updated version) in Deep Research workflows to read through articles and extract data quickly. It was my favorite model for that because it was incredibly reliable and so fast. I feel like the world knowledge benchmarks like MMLU miss the je ne sais quoi of "world knowledge" that I mean here. I know that it's a cultural thing related to training data but when new models come out, I will ask them what they know about my town (a small town in the South) and although most medium models do not know a lot about it, Nemotron/Gemma know it's context much more accurately than Qwen. That's subjective I know.

Maybe the moral of this story is that, at the current state of the art, these medium models can touch the heels of a frontier model if their strengths are in one area.

[-]

MadSprite@reddit

It's gotten to the point where ollama style loading looks more promising because of MOE (available through llamacpp through model routing), we can just choose the expert model for the task instead of relying on the one 30B model to do all the tasks.

We wish we could have a gigantic model that could do it all, but being able to choose 2 or 3 models to compete against one frontier model seems like a good alternative.

[-]

ayylmaonade@reddit

Not really. I wouldn't recomend relying on any 4B model for world knowledge unless you're using it along with web tool calling or another form of RAG. Another thing that I feel worthwhile to point out is despite Nemotron-3-Nano-30B-A3B having pretty decent world knowledge, its hallucination rate is nearly twice that of Qwen 3.6-35B.

Benchmarks showing this for the curious

If world knowledge is your main concern here, Gemma 4, Qwen 3.6 or even Mistral Small 3.2 or Magistral 1.2 would be quite good too. I'm not sure of their hallucination rates specifically, though you could check on artificial analysis.

[-]

RoomyRoots@reddit

Funny how I noticed this today when I had to translate a couple of docs between languages and cross check them between translations and I compared Mistral, Qwen and Gemma and Mistral was consistently better.

[-]

egomarker@reddit

Nemotron is meh and Mistral is not even in the league, it's borderline trash.

[-]

dinerburgeryum@reddit

Completely correct. All counts. No notes.

[-]

PresentationOld605@reddit

I think they are each carving out niches that play to their strengths, speaking to fresh models in this size range.

Indeed. And I like that progression to "smaller but more specialized models" a lot actually.

I have to say, for embedded FW development, Qwen 3.6 27B is first local model, that I can kind of use well - in "augmented way of developing stuff" at least and with the chips I use the most (currently, its Microchip dsPIC33) and withing the range of models that can fit will into 96 GB VRAM that I have. Its not perfect, it is no match to the latest SOTA models of course, but so far, seems to be quite close to where frontier proprietary models were during the second half of the 2025 (subjectively speaking). Well seem, how it holds up in the long run, but so far - it has impressed me positively.

So for coding, yes, the previous "coder" models are now obsolete to me, but these did not matter much to me in the first place. I never really liked the GPT OSS 20B model and other similar ones from half a year ago,.It was amazing to see how fast these were running on my Strix Halo machine, when I first started with local LLM-s, but their output has always been too inconsistent.

[-]

marcobaldo@reddit

I could have not put it better. I can say the same to every single line.

I will also add that Qwen 3 Coder Next is a very capable model that works well at low quants, even like Q1 and Q2, fitting 32 VRAM, no joke. One of the absolute best non-thinking models.

[-]

Non-Technical@reddit

I really like Gemma-4 but looking for a larger variant like 120B MoE that’s probably not coming.

[-]

mild_geese@reddit

I'd love an updated model in the gpt-oss-20b size configuration

[-]

No-Veterinarian8627@reddit

When it comes to writing (tone and stuff), human writing classifications, and multi language, Gemma 4 is still above Qwen3.6, at least from my personal tests (German language). I have no idea how it looks when it comes to the English language.

[-]

nickm_27@reddit

Definitely not, I still can't get Qwen to follow instructions as a voice agent. It commonly is too verbose and "helpful" in its responses when it is instructed to be concise.

Gemma4 follows these instructions perfectly and also writes in a more natural flow too.

[-]

Exciting_Garden2535@reddit

Well, I have one test (which I found in this subreddit) that gpt-oss-20b always passes (and, as far as I remember, Qwen3-30 also passed it, though not 100% sure), but Qwen3.6 both models often fail it.
The test is synthetic, though, no real measure of intellect. Still weird to see how many tokens they spend with such bad results.
The test is:
write each characters as a emoji grid of 5x5 (S, A, M) using only 2 emojis ( one for the background, one for the letters )

[-]

Hydroskeletal@reddit

the only way to know for sure is to test with your use cases.

for me, Gemma is a winner. But I also do all my coding in Claude/Codex

[-]

Embarrassed_Adagio28@reddit

Yeah 3.6 shits on everything with almost every task. Gemma 4 has its use cases though and OSS 30b is fast as fuck but dumb.

[-]

ea_man@reddit

I mean you can run QWEN 35B A3B in a lower quant Q3 on 16GB and that's fast as brrrrrr too 😛

[-]

nikhilprasanth@reddit (OP)

Yes. I run both q3 and q6, even with offloading to ram , the q6 is consistent for most tasks.. for generap uses such as hermes agent and all, q3 is solid. It takes care of my obsidian notes, transcription analysis etc

[-]

ea_man@reddit

Yup, I would say that lower quants work better on dense models, I had good luck using 27B at IQ3 on a 12GB gpu, 35B MoE I won't trust so much for coding yet it's not a big prob to go bigger and offload some with those.

[-]

junior600@reddit

How many tokens do you have with your 12 GB gpu using the 27B version? Can you link the quant please?

[-]

ea_man@reddit

headless: https://www.reddit.com/r/LocalLLaMA/comments/1t00bgr/comment/oj5sggc/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

[-]

junior600@reddit

Oh thanks, is QWEN3.5 27b still reliable today? I can run qwen 3.6 35b-a3b at decent speed with my rtx 3060 12 GB VRAM and 24 GB RAM. Does qwen 3.5b 27b produce better code in your opinion?

[-]

ea_man@reddit

27B makes better code for sure yet it's slower.

You should use 3.6 now, it's better with agents, the best actually!

[-]

nikhilprasanth@reddit (OP)

Yeah, definitely not for coding. 27B is more resilient, but 35B cant be trusted with code below q4. But as i said general use cases are fine.. tool calling makes up for most of the deterioration.

[-]

ea_man@reddit

Yeah I di use 35B at Q3 on 16GB yet on 12GB it's IQ2 and that fucks up kinda often if you actually wanna use the code it makes.

[-]

nikhilprasanth@reddit (OP)

Yes. Gpt oss was good for a while. Still is one of the fastest models one can run in 16gb.

[-]

sleepingsysadmin@reddit

Is Mercedes and Ferrari making all the other F1 teams obsolete by winning all the time?

Surely we are better off with all the teams competing?

[-]

BigYoSpeck@reddit

Jesus, was your training cut off 2018?

[-]

sleepingsysadmin@reddit

Mclaren are cheaters and redbull is utter trash this year.

[-]

VoiceApprehensive893@reddit

gemma 4 31b is still better at some things

[-]

LLMFan46@reddit

"Obsolete" well no, as some people prefer Gemma 4 31B output (some of the Gemma 4 31B finetunes can write very well), but Qwen is definitly one the top contenders for \~30B models space.

[-]

egomarker@reddit

TLDR yes

[-]

blargh4@reddit

I tend to get better results with Gemma4, though it has its issues. Qwen is very industrious but overlooks more bugs and halluciates more in my experience

[-]

abubakkar_s@reddit

May be few system prompt tweeks might help to reduce those bugs, hallucinations

[-]

Xamanthas@reddit

"It must be perfect. Make no mistakes". No dude.

[-]

Your_Friendly_Nerd@reddit

I like qwen3-coder:30b for one-shot stuff like “write a comment for this function”, “write a function that does x”, “split this up into multiple functions“

[-]

Potential-Gold5298@reddit

I don't code or use agents, so I don't see any reason to use Qwen.

[-]

UnifiedFlow@reddit

Why are you here?

[-]

kaliku@reddit

Don't be a dick. Maybe he's writing toocan porn with AI and doesn't want that to be associated with his identity 🙄

[-]

Perfect-Flounder7856@reddit

My thing is gemma 4 has opposite size models it has a 31B dense and a 26B MoE. So use what you need I haven't benched gemma on my use case yet but Qwen 35B-A3B beat out 27B on my use case benchmark.

Maybe I'll bench those. I have 26B-A4B downloaded on my MacBook Air so speed won't be anywhere near on my AI workstation but presumably bench will be the same.

[-]

Perfect-Flounder7856@reddit

But also it's Q4 and I ran the other benches on Q8 so that might make a difference.

[-]

misha1350@reddit

Yep. It's either Qwen3.6 27B, Qwen3.5/3.6 35B A3B, or Gemma 4 26B A4B or Gemma 4 31B. The Gemma models are accessible with a generous free API from Google AI Studio, though, so there's hardly a reason to run them locally (or to invest into a 24GB dGPU, for that matter).

[-]

Leather-Equipment256@reddit

Qwen is a stem major and gemma is a art major

[-]

Silver-Champion-4846@reddit

Same, and good inference engines that actually turn the operations into addition for the advertized speedup, not still do unneeded matmul!

[-]

jacek2023@reddit

I use Gemma 4 models all the time. I asked a question today about experiences with Gemma and got no replies. Then we see crying posts that Qwen is useless because it doesn't work like Claude models. So how all these people compare what's "the bestest ever"? By reading benchmarks and leaderboards. And that's how hype works

[-]

spencer_kw@reddit

this is exactly why i stopped thinking about models as "better or worse" and started routing by task. qwen 27b handles all my code gen and tool calling, gemma gets the docs and summaries, and i keep opus for the stuff that actually needs cross-file reasoning. maybe 30% of my work genuinely needs a frontier model, the rest runs fine on these.

[-]

sine120@reddit

They're the measure I'm comparing other models to. It's pretty much the peak of what my system can run for my use case, and storage isn't free. I used to have dozens of models laying around, now it's just different flavors of Gemma4 and Qwen3.6

[-]

beavis9k@reddit

No

[-]

k_means_clusterfuck@reddit

Gemma4 31b beats both in my usecase. They are more prone to start spinning out of control in my workflow

[-]

Demonicated@reddit

Gemma 4 and qwen 3.6 are both really good for doing work. Gemma is faster and can produce quality output. Qwen is superior in coding. Just depends on what you want to do.

[-]

nikhilprasanth@reddit (OP)

Really wanted to use gemma, but the speed drastically reduces as context fills compared to qwen models. Have tried tje moe only, not the dense gemma

[-]

Demonicated@reddit

I run gemma A4B as the main worker for my agents in probably 90% of cases. I run at full bf16 size and I would argue that it matters. Quantization reduces is greatness. I am running on an rtx 6000 so I know some people aren't working with a nice of hardware, but it's about twice as fast for me as the qwen equivalent.

Overall I think qwen is the superior model, but in actual use cases you only need "good enough" and I find gemma to be exactly that (in non coding workloads) and twice as fast.

[-]

dobkeratops@reddit

I ended up gravitating to them, I dont have much science behind my choice.. it was gemma3 27b because of it's multimodal support, then qwen 3.5 seemed similar but a bit more recent, then gemma4's 31b seemed like the one to try but it was a bit slower , so when qwen 3.6 27b came out I jumped onto that. It also seemed to let me use a longer context window on a 24gb CPU than gemma3 27b did. Seems like you want the most recent in a tier to get the knowledge update properly internalised vs finetunes.

[-]

nikhilprasanth@reddit (OP)

Yes, i have a 24gb one at work and the qwen models are retaining speed much deeper into the context window.. the 35b model is a really good gwneral purpose model. Running it in hermes agent now

[-]

79215185-1feb-44c6@reddit

Gemma is great if you don't care about how long it takes to process a prompt.

Qwen for coding.

[-]

nikhilprasanth@reddit (OP)

Not only that, for me it slows down drastically compared to qwen as context increases.

[-]

ttkciar@reddit

That's because every sixth layer in Gemma is a full-attention layer. It means inference slows down more with long context, but also increases long-context competence. It's a deliberate design trade-off.

[-]

Dui999@reddit

I would say "for sure" if you speak about other Qwen models.

While other models from other companies are outshined in certain aspects, while remaining valid/better for different use cases.

[-]

nikhilprasanth@reddit (OP)

Yes gpt oss and nemotron are really good for the speed. But for coding related stuffs qwen is leading.

[-]

Dui999@reddit

Yes, I feel so too. Then there are also other tasks such as non-technical fields, "creativity", writing prose and many other at which each model performs and acts differently.

[-]

Mashic@reddit

Gemma 4 is better at translation and multilanguages usage. Qwen3.6 is better at coding. And I also find it hard to find a need for any other model.

[-]

nikhilprasanth@reddit (OP)

Yes. Qwen for coding, Gemma for document processing/ summarising stuff.. wish gpt oss had a successor, that thing is fast and capable for its size.

[-]

Mordimer86@reddit

For "humanities" things Gemma4 35B does a really great job, although I don't know which is better. For helping reading Chinese texts or explaining the meaning of words in context. Both Gemma4 and Qwen 3.6 do a great job here.

[-]

Revolutionalredstone@reddit

Tears are shed when a good model I've used for a while falls off the pareto frontier.

Can always copy them to long term external storage for fun later.

I'm sure cavemen were all the rage in the paleolithic ;D