Final voting results for Qwen 3.6

[-]

ambient_temp_xeno@reddit

Moe enjoyers split the vote, densocrats reap the benefits.

[-]

SpicyWangz@reddit

They think they won, but REAPs never perform well

[-]

Daniel_H212@reddit

I think 27B is the GPU rich people with 90 class GPU and 24/32 GB of VRAM, and then everyone else is not as GPU rich and either running on a GPU with less VRAM (9B) or running on CPU and RAM (MoE)

[-]

Borkato@reddit

As someone with 30GB VRAM I MUCH prefer the 35B and get annoyed every time I have to pull out the 27B lol. It’s so slow

[-]

Healthy-Nebula-3603@reddit

So you now having over 24 GB you can fix bigger context and use lower compressed model line Q6?

[-]

Borkato@reddit

Yup! I use like 100k ctx for coding and such :)

[-]

Look_0ver_There@reddit

As a suggestion (if you're not already doing so), use the 27B for planning and go take a coffee break while it thinks, and then switch back to 35B for the actual implementation

[-]

I tested both 27B and 35B on a 96gb vram Pro 6000. The speed difference was what killed 27B. For an interactive agent (Hermes Agent) with long contexts, tool calling, iteration before output, etc, 35B at NVFP4 absolutely flew. 27B was unusable. Insanely long delay in replies. And the results are excellent. I was very skeptical of MoE until this experience, but am now a convert.

[-]

SLxTnT@reddit

For my uses, 35b was too stupid for the task. 122b was a bit dumber than 27b. 27b did a decent job, but was a tad bit slow. 397B needs CPU offloading, but was the best.

Best combination was using VLLM with 27b + dflash. Was about half the speed of 35b.

[-]

nihnuhname@reddit

For me, dense models are more important. This is obviously subjective, but I find them better suited for storytelling, roleplaying, brainstorming, etc, especially when working with non-English languages or doing translation.

[-]

Sticking_to_Decaf@reddit

I do get slightly better results from the dense models but they are just too slow. Nothing seems to be able to make them snappy enough to use with chat or agents.

[-]

Due-Project-7507@reddit

I am running the mradermacher/Qwen3.5-27B-i1-GGUF IQ4_XS on a 16 GB A5000 laptop GPU fully in VRAM with 32k context (turboquant):

[-]

huzbum@reddit

How is turboquant working for you? What branch are you running?

[-]

Due-Project-7507@reddit

For me it works good (means I don't notice a difference). I am using the "feature/turboquant-kv-cache" branch. There is also another fork here, it could be even better, but I did not test it.

[-]

huzbum@reddit

Thanks for the reply. I need to establish some benchmarks or something so I can evaluate things like this, model weight quantizations, and different models.

[-]

HopePupal@reddit

R9700 and B70 are options too. 7900 XTX if you can put up with tighter quants.

[-]

SKirby00@reddit

You don't need to be GPU rich to be able to run the 27B at workable speed and quality. Granted, you can't be GPU poor, but I'm running it on a combination of 60-tier cards (5060Ti 16GB + 3060Ti 8GB + 3060 12GB). I don't recommend this, but it's what I have.

My model of choice right now is the 27B at Q4, which I can run at ~18tok/s (not much slower than Claude). I can also run it at Q6 fully in VRAM, but it drops to ~12tok/s and that difference is honeslty enough to get on my nerves. Don't let Reddit convince you that you need a 90-tier GPU to run Qwen3.5-27B at workable speeds.

[-]

Haeppchen2010@reddit

RX 7800 XT 16GB + RX 580 8GB running 27B IQ4_XS fine. IQ3_XS on 16GB alone is not that worse.

[-]

Daniel_H212@reddit

Yeah 24 GB runs it fine, 16 is kinda tight though and you lose a lot of context I think? I have 12 GB vram on my main PC so I can't run 27B at any decent quality at all, and it's very slow on my strix halo.

[-]

Haeppchen2010@reddit

Yes, i use only 64k context, more than enough for OpenCode with auto compaction.

[-]

Zealousideal_Fill285@reddit

Nice setup. What is the token generation speed on this double gpu combo? Also do you think rx 580 8gb is good enough as a second gpu?

[-]

Haeppchen2010@reddit

The RX580 is super slow, but still faster than CPU. 62 layers on RX7800XT and 3 layers on RX580 give me 17-18t/s out. (Llama-server with layer-split). With CPU instead of the RX580 it would only be 7. i switch between context sizes and alway squeeze as many layers as possible on the fast card.

I am thinking about upgrading to an RX 7900XTX instead but for now this is ok for playing around.

[-]

ea_man@reddit

I run that on my 6700xt, I am a rich man.

[-]

huzbum@reddit

Personally, If I had less VRAM I'd rather run 35b with experts offloaded than 9b.

[-]

florinandrei@reddit

GPU rich people with 90 class GPU

Those are middle class, mate.

[-]

wektor420@reddit

27B is good for rtx6000 pro and higher owners

[-]

Irythros@reddit

If you bought the 6000 pro to run the 27b you really need to do some cost analysis beforehand... That's a complete waste.

[-]

nonerequired_@reddit

I am using 27B in my used 1x3090 + 1x3060 and it is good

[-]

Healthy-Nebula-3603@reddit

That 3060 is slowing down your model 3x times

[-]

IrisColt@reddit

But small.

[-]

huzbum@reddit

I started with a 3060 and upgraded to a 3090, so I have both. For the 3090, 27b is where it's at (with 4b doing the grunt work on the 3060.) I'm getting about 35tps with 27b on the 3090. that's just fast enough, but I wouldn't want to run it on any less hardware. Definitely needs 24GB VRAM.

If I didn't have the 3090, and I was just using the 3060, I would probably use 35b instead of 9b. I can do 35tps on my 3060 doing full GPU offload and offloading experts to CPU.

[-]

MoffKalast@reddit

Dense, yes, just not in the same sense.

[-]

thrownawaymane@reddit

when a comment passes the Turing test and the vibe test

[-]

johngac@reddit

AI can never replace you

[-]

temperature_5@reddit

Actually reddit just sold his comment to AI trainers.

[-]

Heavy-Focus-1964@reddit

blessedly human comment

[-]

Ok_Mammoth589@reddit

Need you watching the next election

[-]

huzbum@reddit

How about another sparse 80b like next? I feel like that was a good balance.

[-]

huzbum@reddit

Just offload experts to CPU and it’ll run on just about anything as long as you have 64GB system ram.

[-]

AltruisticList6000@reddit

Eh, still not a 24b. Not suitable for 16gb VRAM

[-]

unjustifiably_angry@reddit

If they put out a 24B you'd still need more VRAM to get useful context length.

[-]

AltruisticList6000@reddit

Depending on how the model handles context. Mistral Small 22b with its context at Q4 fit into my VRAM, and 24b's context uses even less VRAM somehow, so despite being a little larger model, it uses a tiny amount smaller VRAM together with its context. So I can fit about max 50k context fully into VRAM

[-]

toffee0_0@reddit

16gb vram people are tortured souls

[-]

Bobylein@reddit

Just usw a quant or the 9b model, though the quant did better in my tests

[-]

AltruisticList6000@reddit

27b would only fit on Q3_s or so and that has severe performance degradation in my experience so I avoid anything under Q4 quants. 9b is too small and dumb plus a lot of VRAM remains unused. I'm just sad the 16-24b range, but especially the 20-24b is always skipped recently. Way more people have 16gb VRAM than 24gb or 32gb, and a 22-24b model also fits on 24gb VRAM with ease so it would let more people use them. I don't think a 22-24b gemma or qwen would be so much worse in performance than a 27b to the point they must always add +3b parameters

[-]

Kelenkel@reddit

I think they should prioritize models for 16GB of VRAM as 90% of consumer GPUs are up to that in AMD-NVIDIA (only XX90 cards have more), that way more people can try it.

PS: Is it really possible to run 27b in 16gb of VRAM? i tried in the past and failed.

[-]

unjustifiably_angry@reddit

9B with multimodal and max context roughly fits 16GB.

[-]

ocarina24@reddit

Yes it is, unsloth/qwen3.5-27b Q3_K_S with 32K contexte and full layers at 35 tok/sec. Remove the mmproj part and get up to 39 tok/sec. It's pretty fast, smart and very good at tool calling.

[-]

Kelenkel@reddit

Thanks! I'll try!

[-]

XxBrando6xX@reddit

I just want the absolute biggest model they have released. I want something open source that’s competing with the absolute bleeding edge

[-]

unjustifiably_angry@reddit

MiniMax 2.7 is releasing this weekend

[-]

StupidScaredSquirrel@reddit

Qwen aren't the best at making the absolute SOTA large model though, deepseek usually is. Qwen are the best at making SOTA models for their size category, and all are typically much smaller than the largest 600b -1T models out there.

[-]

po_stulate@reddit

Not that they aren't the best, they just never made models of that size to begin with.

[-]

Far-Low-4705@reddit

yeah, b4 400b, their largest model was still like 200b

[-]

StupidScaredSquirrel@reddit

Yeah it's not their segment

[-]

jacek2023@reddit (OP)

which model do you use now?

[-]

XxBrando6xX@reddit

Qwen3.5 397B A17B at 27 tokens per second

And then GLM5.1 at Q4_K_XL but I’m only getting like 7 tokens a second currently with that which feels VERY slow . So I’m trying To figure out if it dips into slower memory or cpu allocated memory and that’s why

[-]

corey_prak@reddit

Sorry in advance for a dumb question, but do you use it in the same way you would use a tool like claude code to create entire apps, or is it more focused in that you give it some very detailed specs so that it performs properly?

I'm new to this all and have been testing different models and harnesses against a basic task and comparing it with claude's output, and there's a few inconsistencies that are more serious as the app matures.

Claude says folks at home may be using it as more of a companion and autocomplete rather than a vibe code generator.

[-]

FullOf_Bad_Ideas@reddit

Claude says folks at home may be using it as more of a companion and autocomplete rather than a vibe code generator.

it has Anthropic BS baked in.

I run Qwen 3.5 397B locally, in CC, OpenCode, Roo. I use it the same way I use CC. It sounds the same. It's slower than Opus/Sonnet, but in terms of output quality it's somewhere near Sonnet 4. I use all models as companions or subordinates for translating requirements to code, even Opus messes up often without enough instructions, but if Sonnet 4 could create some apps, Qwen 3.5 397B can do it too. It does things on it's own well too.

As per Elon's leak, Sonnet is 1T, it's not that much bigger than Qwen 397B where the difference is super noticeable.

[-]

XxBrando6xX@reddit

Not a dumb question, I keep trying to push myself to do more beginner friendly videos for tech that is by no means beginner friendly.

So the answer is both yes. I on principle don’t love the whole ai as a buddy thing cause I believe it comes with a lot of like risks and implications and whatever, that being said by self hosting my llm I can use it for anything. It could be a buddy, it also can help answer technical issues I run into at work when I’m deploying a new tool for my company, it also can help me vibe code little scripts to make my analyst teams life easier, or fixing small customization things in apex for salesforce. I also use it to experiment with game design, and for building pipelines for watching my cameras outside my home all locally. The cool thing is after the large initial investment it’s all “free” minus power, which the Mac Studio I use sips power at a max of 125w vs my 4090 office pc which is usually pulling close to 600w

[-]

corey_prak@reddit

Thanks for the quick reply!

I have a 3090 that I've been running qwen coder on quantized to 4bits, and I have this ralph loop that i've built which breaks down the features in a spec into these isolated tasks. I've asked claude to complete that task, and then I've run different models with different harnesses and compared its output.

At the end of the day, it can work really well and take time but the tasks have to be really specific, which I def acknowledge may not be that simple for something I'm trying to vibe code. I've gotten used to Claude being able to sort or make decisions through the ambiguity.

when you're doing the game development stuff, are you asking it to build full features or basic snippets to make things faster? I guess that could be both but curious about your experience with either.

the approach I've taken so far was to accept the tradeoff of speed VS capability, but I other than coding and playing around with image gen, I haven't tried anything else with other models outside of technical things like you mention you do.

I've been offloading context to DRAM to fit as much as I can on the 24GB VRAM, was thinking about splitting layers between one GPU on PCI-e and another connected as an eGPU via oculink -> m2 to see if a dense model with more parameters will produce a result that is closer to something Claude would do.

I don't know. I'm aware that 48GB VRAM is nothing and have learned just how much subscriptions like claude code and codex are being subsidized, which is kind of why I'm trying to get in front of it now...

[-]

jacek2023@reddit (OP)

which quant for 397B and what setup?

[-]

XxBrando6xX@reddit

Q4 there as well on Mac Studio 512gb

[-]

jacek2023@reddit (OP)

what's your minimax speed? I think I had around 20t/s on Q3

[-]

lolwutdo@reddit

Where’s 397b in the poll 😕

[-]

unjustifiably_angry@reddit

Someone make sure they don't fuck up any layers this time

[-]

Mashic@reddit

Just open source all of them. I don't think they have a use case for all of these models themselves.

[-]

YRUTROLLINGURSELF@reddit

people (bots) in here acting grateful for a purported meaningful choice as opposed to the 'fuck you' that it really is

[-]

Malfun_Eddie@reddit

Qwen 3.5 9b is such a great workhorse. 16gb vram can fit the model no quant and max context.

[-]

VirusPanin@reddit

I dunno how the hell you make it work, I've spent whole day today playing with different models, and specifically Qwen 3.5 9b was always failing in agentic tasks (tried OpenCode and KiloCode) it just randomly stops in the middle of it's process, like this: Okay, task understood, I'll read this file and analyze it. Calls a tool to read the file Tool runs successfully .... Bam, agent stopped, like it have finished.

[-]

Healthy-Nebula-3603@reddit

Yep such size is perfectly fix into 24 GB vram and to can keep 200k context Q8

[-]

Iory1998@reddit

No that's not true. There is absolutely no way that the 27B Q8 would fit in 24GB even with 1 token of context size. YOU ARE MISTAKEN.

You can fully offload the Q4_kM though.

[-]

Healthy-Nebula-3603@reddit

What do you see here? I am lying?

[-]

Iory1998@reddit

can keep 200k context Q8

Dude, you said Q8, do you mean KV cache at Q8? Nowhere in your message did you specify that you are running Q4 of the model, hence why I confirmed that you can fully offload a Q4 to GPU.

[-]

Healthy-Nebula-3603@reddit

What do you think Q8 contexts is? It is literally a cache.

[-]

Iory1998@reddit

No it doesn't mean that, maybe for you it does. The language you used was not correct. Even if it did, you didn't specify the model quantization level, so stop trying to spin it around.

[-]

Healthy-Nebula-3603@reddit

Sure sure .... whatever you say ....

[-]

Winter_Tension5432@reddit

I have 64GB VRAM and am barely able to do Q4 180k context window prompt processing at 1.3k and decoding at 57 tk/s. So good enough for me. Q8 on 24gb is not realistic

[-]

Healthy-Nebula-3603@reddit

What is this then?

[-]

No_Afternoon_4260@reddit

Funny how it's 40% 20, 20 and 20 lol !
Rather easy to interpret

[-]

Hankdabits@reddit

Tbh I think people were down on dense until qwen 27b. Hadn’t been a good one since Gemma 3 and qwq 32b.

[-]

Fair_Ad845@reddit

same, I keep going back to dense models for day-to-day stuff. MoE is great on paper but the memory footprint for the full model is still brutal on consumer hardware.

[-]

robertpro01@reddit

Well, it's the smarter that can fit on 24 vram

[-]

AcanthocephalaOk489@reddit

Well ofc.. It was on a american platform :'( I'm a poor apu guy which would've loved and voted for the 122 if not on X.

[-]

jacek2023@reddit (OP)

what does it mean?

[-]

AcanthocephalaOk489@reddit

Unsure of what you didn't understand. So:

I'm poor-ish and so I won't buy nvidias.
strix halo was my bang-for-buck, so would've preferred the bigger MoE (27B kind of unusable for coding in this -- too slow).

I'm not on X. I dislike those platforms. Being on starlink is already annoying enough for me, and I don't want to contribute any further to monopolies and billionares.

[-]

jacek2023@reddit (OP)

I voted for 122B. I don't understand why "poor guy" chooses 122. And what has to do with X :)

[-]

AcanthocephalaOk489@reddit

122 moe runs much faster than the 27. I bought my whole system for cheaper than the modern nvidias.

[-]

jacek2023@reddit (OP)

Yes but you need more VRAM, poor guys have 8GB :)

[-]

AcanthocephalaOk489@reddit

Apu w soldered 128gb

[-]

Kahvana@reddit

Really hope we get Qwen3.6-122b-a10b and Qwen3.6-35b-a3b too. Those are genuinely really useful, 27b is often too slow. It's a shame the 397b nor the 2b/4b models were listed.

[-]

Iory1998@reddit

I agre with you that the 27B is slow, but I guarantee you that it's the best model version in that series. It's so capable when you can run it.

[-]

lolwutdo@reddit

Nah, 397b is the best model version in the series.

[-]

Iory1998@reddit

Flexing your muscles huh? 😁

[-]

lolwutdo@reddit

ironically I'm a gpu poor with 16gb 5070ti but I have 128gb ram, iq2xs 397b ends up being faster at token gen than 27b lol

[-]

Psychological-Lynx29@reddit

When a 70b model? Llama 70b is really old :(

[-]

jacek2023@reddit (OP)

Why do you need 70B model? 70B dense is very slow

[-]

Psychological-Lynx29@reddit

Intelligence, when quantized at q3km it gives pretty good results with Multiagentic tasks :)

[-]

kaeptnphlop@reddit

They do a vote on X when all of us are here? The hell?

[-]

jacek2023@reddit (OP)

why do you think "all of us" are on reddit and not on X?

[-]

Thrumpwart@reddit

I ran some testing with several models last night. I hosvd several models an identical, complex task along with 2 large documents for context and a short but detailed prompt. Of the models I can run on my hardware:

Qwen 3.5 122B MLX 8-bit - winner. High quality reasoning and output.

Minimax m2.5 MLX 4-bit - close 2nd. Did a very good job breaking the task down into smaller components and understanding the role of several interlocking components. Only lost because it missed a crucial section of one of the documents. I suspect a 5- or 6-bit would have won.

Qwen 3.5 27B UD Q8_K-XL - 3rd. Good reasoning, good output but missed the same context as minimax and generally less quality output.

Gemma 4 31B UD Q6_K_XL - close 4th. Good reasoning, more creative than Qwen 27b, but missed the same context and suggested another integration that made no sense. I was genuinely surprised this lost to 27B for my task as in my experience it has better general reasoning as Qwen 27B. Could be an artifact of persisting inference engine woes, will try again in a week. I should note it got close to 27B in quality despite the Q8 vs Q6 quant difference. Maybe I’ll try again tonight on the Mac with a pound for pound Q8-Q8 matchup.

Apex quants of Qwen 3.5 122b (I-quality and I-balanced) ggufs - decent reasoning and output, more creative and colourful than the original, but less quality reasoning and output. I like the Apex quants as they seem more human in their outputs, but the reasoning suffered from the (q4) quants more than I thought it would.

[-]

CriticallyCarmelized@reddit

I’d be interested to hear about your Gemma 4 8-bit test. I’m going to assume you’ve been using llama.cpp and have grabbed the latest updated quant files for your testing since you mentioned “ongoing engine issues”. I’ve been very very pleased with Gemma 4 31B at Q8_K_XL and BF16.

[-]

Thrumpwart@reddit

Yeah I just saw the new fixes for Gemma 4. Will re-pull llama tonight after work and try again with new Q6 and Q8 quants.

I had tested on yesterday’s llama pull, will try again tonight in the new pull.

[-]

jacek2023@reddit (OP)

but was it agentic workflow or a single prompt? how do you decide who won?

[-]

Thrumpwart@reddit

Single prompt but it is fundamentally analyzing many discrete components in an LLM engineering technical plan and identifying and evaluating combinations of of components, how they interact and synergize, and what the best combination of techniques is.

I evaluated their responses personally and then with Google Gemini. Gemini caught 3 inconsistencies I missed in my evaluation and led to the rankings above. It was purely a reasoning task, not agentic.

[-]

ArtfulGenie69@reddit

Waste all this time a freaking vote... Just give me the 122b...

[-]

Lissanro@reddit

It seems 397B is not even on the list. That's too bad, because the 397B version is noticeably better than 122B when it comes to follow long complex instructions while being over two times as fast (as Q5 quant) as Kimi K2.5 (Q4_X quant) or GLM 5.1 on my rig - so it would be great middle ground for many use cases.

[-]

misha1350@reddit

They want to profit off of Qwen3.6-Plus.

[-]

Hytht@reddit

They have a generous free tier with upto 1000 requests for Qwen3.6-Plus in qwen code.

[-]

misha1350@reddit

The Bard treatment. Watch it evaporate like how Stepfun's Step 3.5 Flash did on Openrouter yesterday.

[-]

rebelSun25@reddit

I'm actually okay with not releasing the 397b or prioritizing the 27b or 122b. The corporate or moneyed interests will pay for hosted inference on the largest model to pay bills. In the end, it's in our interest for the model authors to succeed and stick around.

I can get my employer to pay $$$ for the best model, while I use the smaller models for personal use

[-]

waitmarks@reddit

I agree, it seems like there was some internal disagreement in the team after they released 3.5. it seemed like management didn’t want them to release the big one and that caused the team to get broken up. My guess is 3.6 was made specifically so they could have a slightly better model that is closed.

[-]

tengo_harambe@reddit

Dont Twitter polls have 4 options maximum? So it's possible they didnt include 397B because these 4 are presumed to be the most popular.

[-]

Expensive-Paint-9490@reddit

Qwen seems to not want to open-weight its best models anymore. But, at the same time, it wants to keep its fame as open-weight saviour.

[-]

tengo_harambe@reddit

This is not a recent change. The top Qwen model since Qwen2.5 has always been proprietary

[-]

TKGaming_11@reddit

The closed Qwen 3.5 Plus is just the open weight Qwen 3.5 397B model with extended context and native tool calling, for Qwen 3.6 they are locking away the 397B to be API only, this is change from Qwen 3.5 -> Qwen 3.6, absolutely a recent change

[-]

tengo_harambe@reddit

The closed Qwen 3.5 Plus is just the open weight Qwen 3.5 397B

Is this confirmed or just conjecture? First i've heard of it

[-]

TKGaming_11@reddit

Confirmed

[-]

Fault23@reddit

fair

[-]

miniocz@reddit

Same. I can run 397B as Q3. That is not the case for the other two big models (well I can at 1t/s, but not for chat.).

[-]

jacek2023@reddit (OP)

see the second image for details

[-]

RetiredApostle@reddit

Voters seem to just want to compare it with Gemma, rather than having a decent dense 9B in the toolset :(

[-]

inevitabledeath3@reddit

Why do we need 9B models?

[-]

Adventurous-Gold6413@reddit

Because some people can’t run higher parameter models

[-]

inevitabledeath3@reddit

Well some people should buy more or better GPUs.

[-]

Saegifu@reddit

Well some people should learn more about empathy or being human.

[-]

ebolathrowawayy@reddit

when will empathy or being human lift me out of poverty? the epstein class states it plainly - do whatever you want that most benefits you and fuck everyone else. until we have a real society again, with safety nets that encourage growth, then fuck the system, fuck everyone, wild west or bust.

[-]

Saegifu@reddit

Well, the world is in such a state exactly because of people like you thinking "fuck everyone".

[-]

inevitabledeath3@reddit

We are talking about running models at home FFS. If you need AI models especially capable ones you can rent them from the cloud for very little money. Renting a 30B parameter model would cost less than the price of a machine to run a 9B model anyway.

[-]

Yu2sama@reddit

9B is capable enough for basic stuff and rag. A 30B is without a doubt smarter and has more knowledge, but with rag some stuff even outs. Not everyone is using these models for agentic tasks or coding tbf.

[-]

misha1350@reddit

Not everyone has a 24GB dGPU at the ready. 9B means that anyone with an 8GB dGPU would be able to run it. And at 12GB on the likes of an RTX 3060, with a big enough context window

[-]

inevitabledeath3@reddit

What would you use a model that size for though? I am having a hard time finding a good use for even a 27B model, nevermind 9B.

[-]

CriticallyCarmelized@reddit

I’m with you. I’m having a hard time figuring out what people are using tiny models for. They are dumb as bricks. I suppose if you are fine tuning them for very specific one off tasks they will work fine. But I seriously doubt most people are training their own fine tunes for customized pipelines. And anyone can run the 27B-A3B at a minimum using ram offloading and get decent performance.

[-]

RetiredApostle@reddit

I use in my workflow: 27-31B as a strong tool caller, 9B as a smaller tool caller, and E4B as a fast multilingual synthesizer. It would be great to replace that 31B with the 3.6 9B (anticipating the strength).

[-]

grumd@reddit

Funnily enough I can run all of these models locally except for 27B :( The most I can run with 27B is like IQ3_S, but with expert offloading even 122B is doable at Q4_K

[-]

Last_Mastod0n@reddit

27B is just soooo slow. Even 32b a3b is like 1/2 as fast as Gemma 4 with the same vram reqs on my 4090

[-]

Prestigious-Use5483@reddit

I think it's slow because it overthinks. When it doesn't, it's not that bad.

[-]

Last_Mastod0n@reddit

Most definitely so. Reasoning took an unbearably long time without a token cap

[-]

randylush@reddit

Is a token can the best way to tune thinking? Is there a “thinking stop probably” parameter that can be dialed in?

[-]

Top-Rub-4670@reddit

Not yet, there is --reasoning-budget-message but it's injected when the token cap is reached, it doesn't give the model a chance to wrap up afaik.

[-]

GrungeWerX@reddit

It’s not that slow. They have wrong settings.

[-]

grumd@reddit

I feel like you're talking about 3 different models there. What's 32b-a3b? Which size of Gemma 4?

Anyway yeah 27B is slow-ish but when fully on GPU it's not that slow. I think the Q3 quants usually give me 40 tps tg. It's just that I need to use a shitty quant to be able to fit it into my 16GB VRAM

[-]

Last_Mastod0n@reddit

I should've been more specific. I was referring to qwen 3.5 27B 4 bit quant and qwen 3.5 32b a3b 4 bit. Both which fit fully in my vram. Now I am running Gemma 4 26b a4b 6 bit quantized with some expert layers offloaded to the CPU, and it still runs over 2x as fast as qwen 3.5 27B 4 bit quant.

Don't get me wrong I absolutely love Qwen 3.5. It was initially what made my personal project business idea viable with its vision capabilities. Its just that Gemma 4 has beaten it in every single metric that pertains to my project. I would be happy to switch back to qwen if they release a superior model again.

[-]

grumd@reddit

Okay I see. It's not 32b, it's 35b, that's why I misunderstood.

Yeah 35B and 26B are much faster than 27B but they are both WAAYYYY dumber than 27B. You're getting more shitty responses faster lol, imo quality is more important

But yeah if 26B works for you then that's great! You can always switch to 27B when you start noticing that 26B lacks quality for more complex tasks

[-]

Dabalam@reddit

Most of their own documentation seems to indicate the similar sized dense model is only somewhat smarter across the board which is why people debate about dense vs. MoE models.

[-]

0xbeda@reddit

Why is that?

On 7900XTX with 24GB VRAM I can manage 27B-Q5_K_M with about 26 tokens/s but with the 122B-Q4-K-M and a lot of offloading I get only 6 token/s.

[-]

grumd@reddit

Well I have 16GB VRAM and 96GB RAM. 27B the most I can do is IQ3_XS, with 122B I can do Q4_K_XL.

27B-Q3 is ~40-60 tps, 122B-Q4 is ~15-20 tps

Maybe your RAM is not fast enough? 6 t/s is what I was getting with NVME expert offloading lol

[-]

0xbeda@reddit

I'm using llama.cpp docker with vulkan. I tuned it so it fits my VRAM with desktop and about 1-3 GB left. I have 128GB of DDR4-3200 CL16 (Kingston KF3200C16D4/32GX) and a 5950X on a Gigabyte X570S. GPU is a Sapphire Nitro+ 7900XTX with 24GB.

Qwen 3.5 27B Q5_K_M

[Service]
Restart=always
RestartSec=10

Environment=HOST=127.0.0.1
Environment=PORT=8080
Environment=MODELBASE=/home/beda/workspace/widgets/fancyllm/models
Environment=MODEL=unsloth/Qwen3.5-27B-Q5_K_M.gguf
Environment=IMAGE=ghcr.io/ggml-org/llama.cpp:server-vulkan

ExecStartPre=-/usr/bin/podman pull ${IMAGE}

ExecStart=/usr/bin/podman run \
--name llama-server \
--replace \
-p ${HOST}:${PORT}:${PORT} \
-v ${MODELBASE}:/models:Z \
--device /dev/dri \
--group-add video \
${IMAGE} \
-m /models/${MODEL} \
--host 0.0.0.0 --port ${PORT} \
--n-gpu-layers 999 \
-c 32768 \
--batch-size 256 \
--ubatch-size 256 \
--parallel 1 \
--threads 12 \
--threads-batch 12 \
--no-context-shift \
--reasoning-format none

ExecStop=/usr/bin/podman stop -t 10 llama-server

Qwen 3.5 122B MoE Q4_K_M

[Service]
Restart=always
RestartSec=10

Environment=HOST=127.0.0.1
Environment=PORT=8080
Environment=MODELBASE=/home/beda/workspace/widgets/fancyllm/models
Environment=MODEL=unsloth/Qwen3.5-122B-A10B-Q4_K_M.gguf
Environment=IMAGE=ghcr.io/ggml-org/llama.cpp:server-vulkan

ExecStartPre=-/usr/bin/podman pull ${IMAGE}

ExecStart=/usr/bin/podman run \
--name llama-server \
--replace \
-p ${HOST}:${PORT}:${PORT} \
-v ${MODELBASE}:/models:Z \
--device /dev/dri \
--group-add video \
${IMAGE} \
-m /models/${MODEL} \
--host 0.0.0.0 --port ${PORT} \
--n-gpu-layers 14 \
--n-cpu-moe 12 \
-c 16384 \
--batch-size 256 \
--parallel 1 \
--threads 16 \
--threads-batch 16 \
--no-context-shift \
--no-mmap \
--reasoning-format none

ExecStop=/usr/bin/podman stop -t 10 llama-server

[-]

grumd@reddit

Btw after you run the model with my command and see how much better the performance is, you may want to try Q5_K_M instead of Q4 for higher quality

[-]

grumd@reddit

Adding to my comment - considering you have 24gb vram + 128gb ram, you can actually just use -fit to let llama automatically offload everything efficiently

--name llama-server \
--replace \
-p ${HOST}:${PORT}:${PORT} \
-v ${MODELBASE}:/models:Z \
--device /dev/dri \
--group-add video \
${IMAGE} \
-m /models/${MODEL} \
--host 0.0.0.0 --port ${PORT} \
-fit on -fitc 16384 \
-b 4096 -ub 2048 \
--parallel 1 \
--no-context-shift \
--no-mmap

basically use this (adjust -fitc to tell it how much context you want)

[-]

grumd@reddit

Well DDR4 is indeed slower than my DDR5 but you have multiple issues in your 122b command that hurt performance as well

remove -threads, just use the default. i've noticed using max threads actually hurts performance. try to compare it with default and check if you get better speed
use a higher ubatch (-ub), it DRASTICALLY improves prompt processing speed, from 300tps to 1500tps by going from default 512 ubatch to 2048-4096. yes it will take more VRAM but imo worth it
why are you using --n-gpu-layers 14? you're trying to offload half the layers to CPU and also half the experts to CPU? that's why your speed is so slow. for a moe model your GPU priority is: GPU layers > context > experts. so keep ALL layers on GPU, then fit as much context as you need, then the rest goes to any experts that still can fit. I'm basically using -ngl 99 and --cpu-moe to keep all experts on the CPU and all layers on the GPU.

[-]

Neither-Phone-7264@reddit

Not everyone has 24gb of vram lol

[-]

Top_Influence_3323@reddit

The Qwen family has been impressively consistent across scales. I've been running Qwen 2.5 models (3B and 72B) locally via Ollama for some research work and the quality gap between sizes is surprisingly small for most tasks — the architecture clearly scales well. Curious to see how 3.6 compares on the smaller quantized variants for daily local use.

[-]

jacek2023@reddit (OP)

Bot

[-]

Top_Influence_3323@reddit

NOPE !!!!

[-]

jacek2023@reddit (OP)

talking bot!

[-]

Top_Influence_3323@reddit

Hahaha Lol I wish, bots don't have to pay rent. Just a guy running local models for research, nothing fancy.

[-]

BestSentence4868@reddit

This is so dumb, everyone should be voting for the 122B so we can then distill to the smaller ones.

[-]

jacek2023@reddit (OP)

could you share link to the models you distilled before?

[-]

BestSentence4868@reddit

not public but have distilled sonnet to 397B before

[-]

jacek2023@reddit (OP)

why not public?

[-]

BannedGoNext@reddit

Darn, I was really hoping for a new sexy 122b since Gemma cucked us by not releasing the one they made after announcing it.

[-]

ea_man@reddit

Please release those GUF with a template that makes tools work in opensource orchestrators like OpenCode, even when reasoning is disabled.

[-]

Material_Hour_115@reddit

Interesting results, here I thought most people agreed 35B-A3B was the most interesting flavor of Qwen. Not that I'd complain about having the source for any of them.

[-]

silenceimpaired@reddit

Not for me. It lacks the capability of the dense model.

[-]

Far-Low-4705@reddit

can you elaborate on that? i feel like 35b is more than capable.

it is extremely rare for me to give a engineering problem to 35b that it cant solve but 27b can solve.

also, imo, 27b is just too slow for anything useful. it only runs at 24 T/s for me, and imo, i prefer 50 T/s, or 40 T/s as an absolute minimum to be useable.

[-]

silenceimpaired@reddit

35b MoE might just be based on your tolerance for waiting.

[-]

Material_Hour_115@reddit

Definitely support people voting to their own interest! I think everyone knows 27b is more capable, but 35B-A3B runs significantly better on regular consumer hardware, which makes it interesting from a different direction.

[-]

silenceimpaired@reddit

I think there are two depending contexts for your claims. Does a person have a 24gb vram card… and does a person do agentic work or coding? Because my computer outputs fast enough I can’t finish reading it before it finishes the output

[-]

Material_Hour_115@reddit

That's what I mean by "on regular consumer hardware," although perhaps I should have said "on the average consumer's hardware" for clarity.

24GB VRAM far exceeds what an average person has in their PC.

[-]

BumblebeeParty6389@reddit

If they don't release 35B moe qwen 3.6 will be useless for me. I'm pretty sure there are many people in same situation as me. I really don't get the point of this poll.

[-]

DerDave@reddit

Why is that? Is 27B too slow or what's the issue? I can point you at a cool way, to make it much faster: https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
That brings about 3-5x speedup. Really cool stuff and their paper is interesting too.

[-]

MrBIMC@reddit

Until dflash is supported by llama.cpp stack, it is kinda useless to most of people here.

I hope they will get llama-server support eventually because 3x speed up will change a game completely for a solo gpu deployments.

Current qwen3.5-27b on a solo 3090 fits 161k tokens of contest for me, while giving 43-29 tps depending on context load.

Tripling or quadrupling that into a 100+ tps will make agents a lot more versatile.

Also will see how better 3.6 is, given that 3.5 is already a beast of a local model.

[-]

anthonyg45157@reddit

Was about to download but you cpp comment saved me 😆

Can't wait for this to progress more, I'm torn between 27b and 35 with a 3090

[-]

MrBIMC@reddit

Moe models are generally dumb.

The way to approximate intelligence is do a sqrt(total_params * active_params).

Currently the only moe that is smarter than 27b-dense is 400b one, but most of us here are out of a capability to serve it.

Currently the best models we can serve on a solo 24gb vram card is either qwen3.5-27b or Gemma-4-31b-it.

And qwen is much more mature at this stage, as far as support goes.

[-]

anthonyg45157@reddit

Thank you for the information!

This generally aligns with what I've noticed as well.... Are there any ways to speed up 27b with agentic type coding workflows? Maybe I just need to turn thinking off so it feels more responsive...

[-]

itch-@reddit

I have not found the 27B thinking to be a problem, it only overthinks when it doesn't have tools.

Usually local models are fine in basic chat and don't work in eg Cline because it's too hard. 27B is opposite, it's thinks too long on a simple prompt but does great in Cline because it's smart enough for it and doesn't waste thinking tokens there.

[-]

MrBIMC@reddit

There are bunch of mtp work is being done on the llama.cpp side.

The way they approaching it is having it implementation-agnostic, so many different ways to implement multi token prediction.

It will come provably in a week or two.

Good news here is that 3.5 series already have mtp layer built-in, so worst case scenario we might get a free boost from mtp, without even attaching any draft model. (Though it won’t be as massive as eagle3 or dflash).

Best case scenario is once backend-agnostic mtp logic is merged into llama.cpp, it will open the gates into implementing more tricky approaches like eagle3 and dflash, but those will eat additional memory.

[-]

DerDave@reddit

Yeah, also keep in mind, the DFlash model (for Qwen3.5 27b) takes up another 3.5GB of RAM (in BF16). So it might reduce your context unless you quantize more.

The speed gain is worth it imho. Especially whit the latest context compression breakthroughs.

Also very much looking forward to Qwen3.6.... Let's see how much better it's going to be.

[-]

MrBIMC@reddit

Yeah, that’s why llama.cpp support is essential.

If one goes with vllm, awq-4bit of qwen3.5-27b takes around 19-20gb. Dflash predictor takes another 3.5gb, leaving no room for big context.

With llama.cpp, iq4-nl takes 16gb. Mmproj takes another 1-2gb, depending on a quant, but one has a choice to drop it if they do not care about anything than receiving text as an input. With safetensors you do not have such choice.

So assuming dflash will get supported on llama-server, then having dflash while dropping mmproj and reducing context a bit sounds like a very good choice to go with.

[-]

DerDave@reddit

Yeah that's the dream. Really trying to evangelize dflash to get it some more traction haha

[-]

DerDave@reddit

Yeah, a many people have already requested and some have started working on a llama.cpp port. I'm hoping for it to come soon too. But in the meantime you can run it with vLLM, which is also fine for home use.

[-]

viperx7@reddit

I tried to get this thing to work and it was a mess can you tell me how to use it. I have 3090+4090 on my system Couldn't get it to work with vllm

[-]

yeah-ok@reddit

Let's get the llama team involved, if this would be doable on consumer hardware it would be amazing win for the dense models.

[-]

DerDave@reddit

Absolutely. But not only dense. It works also pretty well on MoE models. They're currently even training a version for Kimi K2.5, so it might even be helpful for hosters.

[-]

BumblebeeParty6389@reddit

Issue is I don't have a gpu so I can only do cpu inference with ik_llama lol

[-]

cafedude@reddit

Qwen3.6-coder-80B

[-]

Iory1998@reddit

I wish for the 80B too.

[-]

cafedude@reddit

Even if it's a bit bigger than the current Qwen3-coder-next, say 90B or even 100B that would be fine. Q3CN is still the best local coding model in my experience. Q3.5-122B was touted by some as being better, but I found it to give me a lot of false results ("tests are passing!" when they weren't, that kind of thing)

[-]

BrightRestaurant5401@reddit

That is a rather complicated question to ask, what if all the models grows or shrinks 2 gb?
does that change the answer? what about the model to context size ratio in gb?

[-]

MrBIMC@reddit

I guess for this case you kinda have to quant down.

Though idk if sub 4-bit quants of any actual use. Some imatrix quants might be decent enough if tasks profile matches the imatrix dataset.

[-]

RandomTrollface@reddit

Qwen 3.5 27b iq3_xxs unsloth is somehow doing pretty well for me in opencode. Even with 80k context in q8_0 it was still able to get work done in a typescript repo I'm working on. Also tried qwen 3.5 35b with cpu offload and gemma 4 31b but they seemed to perform worse in opencode

[-]

Bobylein@reddit

I am using 3bit 27b and 2bit 36b and also tried both with 4bit on 16gb vram, for my actual use cases they did a very good job and I deleted the 4bit later as they were just so slow with no noticeable (to me) improvements over the 3 and 2 bit variants

[-]

MrBIMC@reddit

Yeah, 27b on 16gb is rough, given the model itself will eat most of vram.

At the end of the day name of the game is to find a perfect balance of intelligence, speed, and amount of context one needs for the tasks at hand.

Smaller imatrix quants will probably result in a better intelligence at a slower speed than moe model at a higher quants, but the context will be much more limited.

I did a test run of both gemmas(dense and moe) against an opus-4.6 and codex-5.3 a day ago on a agentic run with a task to analyze the lib change and find the implementation status of that said changes across the internal projects that do use that lib.

And while dense Gemma did take 14 minutes to execute the task, it technically fully succeeded, albeit report it gave was much less wordy than external models.

MoE Gemma failed miserably as it hallucinated the implementation status across a few of the projects, and as for me having report being not reliable is much worse than having to wait a bit for a proper result you can rely on.

[-]

jacek2023@reddit (OP)

what are you talking about?

[-]

Bobylein@reddit

If people would vote tge same if they knew the new model wouldn't fit their vram anymore or the old non fitting would fit in 3.6

Though considering the poll I'd expect the parameters to stay the same anyway

[-]

AdUnlucky9870@reddit

The community voting approach is interesting — it's a smart way for Qwen to prioritize what people actually want vs what looks good on benchmarks.

What I'm most curious about is what the voting categories reveal about where local LLM usage is heading. If the top votes are for coding and reasoning over creative writing or chat, that tells you something about the actual deployment patterns: people are building tools and agents with these models, not just chatting.

For anyone running Qwen models locally — one thing I've noticed is that the quantization story has gotten dramatically better with the 3.x series. The Q4_K_M quants of Qwen 3 hold up surprisingly well compared to the full precision versions, especially for coding tasks. The gap between Q4 and FP16 on code generation is much smaller than it was with the 2.x series.

If 3.6 continues that trend, we might be at the point where a 32B model at Q4 on a 24GB GPU genuinely competes with API models for most practical use cases. That's the real milestone — not benchmark scores, but "good enough that I don't need to send my data to an API."

[-]

Iory1998@reddit

We should give credit to the Qwen team about how good the 3.5 series have been in terms of attention mechanism. The best models where it comes to that. I think that's why models hold well even when they are quantized.

[-]

Iory1998@reddit

Fyi, I voted for the 27B. It's been my daily driver for weeks now, and I don't like the 122B. Intelligence wise, I feel it's on par or worse that the 27B.

[-]

MarcCDB@reddit

Most people can't even run a 26B parameter model at an acceptable quantization.

[-]

jacek2023@reddit (OP)

maybe they are not into local LLMs

[-]

R_Duncan@reddit

Having available an rtx 6000 at work, this is great news! Having a 4060 at home, this s*cks...

[-]

Past-Reception-424@reddit

35B gang. big enough to actually be useful, small enough that you dont need to remortgage your house for the vram. win-win

[-]

Several-Tax31@reddit

Why not open-source all?

[-]

Voxandr@reddit

So looks like the best we will get would be qwen 3.5 -122b A10b for a forseeable future.
Those who are not serious anthusiast will never vote a model above 32 be because they are not going to invest into multiple graphic card or even a strixhalo.

[-]

OmarBessa@reddit

I'm not even surprised

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

VoiceApprehensive893@reddit

qwen 3.6 18b please

[-]

Tall-Ad-7742@reddit

i hope we get all the other versions too
i personally would also like a 120b or bigger version and yes i know not everybody can run something big like that but still would be nice for some of us

[-]

Borkato@reddit

Yeah I hope everyone gets all of them!

[-]

tarruda@reddit

Did they say that only the most voted model would be released as open, or that it would simply be the first one?

[-]

jacek2023@reddit (OP)

second image

[-]

festr__@reddit

What they were expecting from this vote? Corporate segment runs >= 400B and nothing else is really relevant for serious inference.

[-]

Fault23@reddit

would be good if we had a 122B too ngl

[-]

leonbollerup@reddit

why 27b.. i get much better result with 35b-A3B

[-]

jacek2023@reddit (OP)

did you vote?

[-]

leonbollerup@reddit

nope.. i dont have X.. dont want it either.. so much .. garbage..

[-]

jacek2023@reddit (OP)

so that's the answer to your question

[-]

leonbollerup@reddit

i never had a question.. it was a statement..

[-]

silenceimpaired@reddit

How? Are you referring to speed? Because the 27b dense for 3.5 went toe to toe with the 120b MoE.

[-]

leonbollerup@reddit

i get around 150 tok/sek with 35b-a3b .. under half with 27b.. i might be a picky bast... but anything below 50 tok/sek is to slow..

[-]

silenceimpaired@reddit

That's what I thought. You're talking speed. 27b performs far better in terms of output quality. Sure you wait a little longer for 27b, but you don't have to go back and ask again or clarify things as much as you might with the 35b MoE.

[-]

henk717@reddit

For me its night and day the other way round. I assume its how closely you stick to what the 35B was designed to do, if you deviate you have to rely on whatever 3B expert improvises and for me doing more unorthodox things that resulted in terrible outputs. I also tried coding something on it and it failed to get it after 8 tries, the 27B got it in 2.

[-]

SufficientPie@reddit

✅ 397B

[-]

synw_@reddit

What about the 4b, my favorite small model?

[-]

Nubinu@reddit

want my tiny 4B

[-]

xgiovio@reddit

why not the a3b? it's faster

[-]

JockY@reddit

I do hope they’re not using this poll to make decisions. I’ll be very sad if the 397B isn’t released.

[-]

AppealThink1733@reddit

Does this mean that 9B will not have be ?

[-]

charmander_cha@reddit

Queria o 9B

[-]

Billysm23@reddit

27 and 9 are perfect

[-]

anthonyg45157@reddit

Wen wen

[-]

NNN_Throwaway2@reddit

If we don't get the 397B I'm gonna be pissed.

[-]

hurdurdur7@reddit

What about a 14B or 20B-A4B ?

[-]

jacek2023@reddit (OP)

not this time

[-]

Significant_Fig_7581@reddit

So we'd get the 35B and the 27B first

[-]

stopbanni@reddit

Just like with 3.5

[-]

Significant_Fig_7581@reddit

Yeah I think that was what they planned first but they still wanted to engage with the community, knowing qwen they are gonna release all the sizes later

[-]

Status_Record_1839@reddit

27B winning is predictable — it's the sweet spot that fits entirely on a single 24GB GPU at Q4_K_M while still being genuinely capable for complex tasks. The MoE options (35B-A3B, 122B-A10B) are interesting but require more careful hardware planning for most people.

Curious whether Qwen 3.6 will keep the hybrid thinking/non-thinking toggle from Qwen3. That feature alone made the 32B model much more flexible for production use — you can disable extended thinking for latency-sensitive calls and enable it for complex reasoning without switching models.

[-]

jacek2023@reddit (OP)

bot