Final voting results for Qwen 3.6
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 243 comments
7 days have passed. Hopefully, the release will start soon
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 243 comments
7 days have passed. Hopefully, the release will start soon
ambient_temp_xeno@reddit
Moe enjoyers split the vote, densocrats reap the benefits.
SpicyWangz@reddit
They think they won, but REAPs never perform well
Look_0ver_There@reddit
9B is dense too
Daniel_H212@reddit
I think 27B is the GPU rich people with 90 class GPU and 24/32 GB of VRAM, and then everyone else is not as GPU rich and either running on a GPU with less VRAM (9B) or running on CPU and RAM (MoE)
Borkato@reddit
As someone with 30GB VRAM I MUCH prefer the 35B and get annoyed every time I have to pull out the 27B lol. It’s so slow
Healthy-Nebula-3603@reddit
So you now having over 24 GB you can fix bigger context and use lower compressed model line Q6?
Borkato@reddit
Yup! I use like 100k ctx for coding and such :)
mannydelrio1@reddit
What's your setup ?
Borkato@reddit
1 RTX 3090 and 1 RTX 2060
Look_0ver_There@reddit
As a suggestion (if you're not already doing so), use the 27B for planning and go take a coffee break while it thinks, and then switch back to 35B for the actual implementation
Sticking_to_Decaf@reddit
I tested both 27B and 35B on a 96gb vram Pro 6000. The speed difference was what killed 27B. For an interactive agent (Hermes Agent) with long contexts, tool calling, iteration before output, etc, 35B at NVFP4 absolutely flew. 27B was unusable. Insanely long delay in replies. And the results are excellent. I was very skeptical of MoE until this experience, but am now a convert.
SLxTnT@reddit
For my uses, 35b was too stupid for the task. 122b was a bit dumber than 27b. 27b did a decent job, but was a tad bit slow. 397B needs CPU offloading, but was the best.
Best combination was using VLLM with 27b + dflash. Was about half the speed of 35b.
nihnuhname@reddit
For me, dense models are more important. This is obviously subjective, but I find them better suited for storytelling, roleplaying, brainstorming, etc, especially when working with non-English languages or doing translation.
Sticking_to_Decaf@reddit
I do get slightly better results from the dense models but they are just too slow. Nothing seems to be able to make them snappy enough to use with chat or agents.
Due-Project-7507@reddit
I am running the mradermacher/Qwen3.5-27B-i1-GGUF IQ4_XS on a 16 GB A5000 laptop GPU fully in VRAM with 32k context (turboquant):
huzbum@reddit
How is turboquant working for you? What branch are you running?
Due-Project-7507@reddit
For me it works good (means I don't notice a difference). I am using the "feature/turboquant-kv-cache" branch. There is also another fork here, it could be even better, but I did not test it.
huzbum@reddit
Thanks for the reply. I need to establish some benchmarks or something so I can evaluate things like this, model weight quantizations, and different models.
HopePupal@reddit
R9700 and B70 are options too. 7900 XTX if you can put up with tighter quants.
SKirby00@reddit
You don't need to be GPU rich to be able to run the 27B at workable speed and quality. Granted, you can't be GPU poor, but I'm running it on a combination of 60-tier cards (5060Ti 16GB + 3060Ti 8GB + 3060 12GB). I don't recommend this, but it's what I have.
My model of choice right now is the 27B at Q4, which I can run at ~18tok/s (not much slower than Claude). I can also run it at Q6 fully in VRAM, but it drops to ~12tok/s and that difference is honeslty enough to get on my nerves. Don't let Reddit convince you that you need a 90-tier GPU to run Qwen3.5-27B at workable speeds.
Haeppchen2010@reddit
RX 7800 XT 16GB + RX 580 8GB running 27B IQ4_XS fine. IQ3_XS on 16GB alone is not that worse.
Daniel_H212@reddit
Yeah 24 GB runs it fine, 16 is kinda tight though and you lose a lot of context I think? I have 12 GB vram on my main PC so I can't run 27B at any decent quality at all, and it's very slow on my strix halo.
Haeppchen2010@reddit
Yes, i use only 64k context, more than enough for OpenCode with auto compaction.
Zealousideal_Fill285@reddit
Nice setup. What is the token generation speed on this double gpu combo? Also do you think rx 580 8gb is good enough as a second gpu?
Haeppchen2010@reddit
The RX580 is super slow, but still faster than CPU. 62 layers on RX7800XT and 3 layers on RX580 give me 17-18t/s out. (Llama-server with layer-split). With CPU instead of the RX580 it would only be 7. i switch between context sizes and alway squeeze as many layers as possible on the fast card.
I am thinking about upgrading to an RX 7900XTX instead but for now this is ok for playing around.
ea_man@reddit
I run that on my 6700xt, I am a rich man.
huzbum@reddit
Personally, If I had less VRAM I'd rather run 35b with experts offloaded than 9b.
florinandrei@reddit
Those are middle class, mate.
wektor420@reddit
27B is good for rtx6000 pro and higher owners
Irythros@reddit
If you bought the 6000 pro to run the 27b you really need to do some cost analysis beforehand... That's a complete waste.
nonerequired_@reddit
I am using 27B in my used 1x3090 + 1x3060 and it is good
Healthy-Nebula-3603@reddit
That 3060 is slowing down your model 3x times
IrisColt@reddit
But small.
huzbum@reddit
I started with a 3060 and upgraded to a 3090, so I have both. For the 3090, 27b is where it's at (with 4b doing the grunt work on the 3060.) I'm getting about 35tps with 27b on the 3090. that's just fast enough, but I wouldn't want to run it on any less hardware. Definitely needs 24GB VRAM.
If I didn't have the 3090, and I was just using the 3060, I would probably use 35b instead of 9b. I can do 35tps on my 3060 doing full GPU offload and offloading experts to CPU.
MoffKalast@reddit
Dense, yes, just not in the same sense.
thrownawaymane@reddit
when a comment passes the Turing test and the vibe test
johngac@reddit
AI can never replace you
temperature_5@reddit
Actually reddit just sold his comment to AI trainers.
Heavy-Focus-1964@reddit
blessedly human comment
Ok_Mammoth589@reddit
Need you watching the next election
huzbum@reddit
How about another sparse 80b like next? I feel like that was a good balance.
huzbum@reddit
Just offload experts to CPU and it’ll run on just about anything as long as you have 64GB system ram.
AltruisticList6000@reddit
Eh, still not a 24b. Not suitable for 16gb VRAM
unjustifiably_angry@reddit
If they put out a 24B you'd still need more VRAM to get useful context length.
AltruisticList6000@reddit
Depending on how the model handles context. Mistral Small 22b with its context at Q4 fit into my VRAM, and 24b's context uses even less VRAM somehow, so despite being a little larger model, it uses a tiny amount smaller VRAM together with its context. So I can fit about max 50k context fully into VRAM
toffee0_0@reddit
16gb vram people are tortured souls
Bobylein@reddit
Just usw a quant or the 9b model, though the quant did better in my tests
AltruisticList6000@reddit
27b would only fit on Q3_s or so and that has severe performance degradation in my experience so I avoid anything under Q4 quants. 9b is too small and dumb plus a lot of VRAM remains unused. I'm just sad the 16-24b range, but especially the 20-24b is always skipped recently. Way more people have 16gb VRAM than 24gb or 32gb, and a 22-24b model also fits on 24gb VRAM with ease so it would let more people use them. I don't think a 22-24b gemma or qwen would be so much worse in performance than a 27b to the point they must always add +3b parameters
Kelenkel@reddit
I think they should prioritize models for 16GB of VRAM as 90% of consumer GPUs are up to that in AMD-NVIDIA (only XX90 cards have more), that way more people can try it.
PS: Is it really possible to run 27b in 16gb of VRAM? i tried in the past and failed.
unjustifiably_angry@reddit
9B with multimodal and max context roughly fits 16GB.
ocarina24@reddit
Yes it is, unsloth/qwen3.5-27b Q3_K_S with 32K contexte and full layers at 35 tok/sec. Remove the mmproj part and get up to 39 tok/sec. It's pretty fast, smart and very good at tool calling.
Kelenkel@reddit
Thanks! I'll try!
XxBrando6xX@reddit
I just want the absolute biggest model they have released. I want something open source that’s competing with the absolute bleeding edge
unjustifiably_angry@reddit
MiniMax 2.7 is releasing this weekend
StupidScaredSquirrel@reddit
Qwen aren't the best at making the absolute SOTA large model though, deepseek usually is. Qwen are the best at making SOTA models for their size category, and all are typically much smaller than the largest 600b -1T models out there.
po_stulate@reddit
Not that they aren't the best, they just never made models of that size to begin with.
Far-Low-4705@reddit
yeah, b4 400b, their largest model was still like 200b
StupidScaredSquirrel@reddit
Yeah it's not their segment
jacek2023@reddit (OP)
which model do you use now?
XxBrando6xX@reddit
Qwen3.5 397B A17B at 27 tokens per second
And then GLM5.1 at Q4_K_XL but I’m only getting like 7 tokens a second currently with that which feels VERY slow . So I’m trying To figure out if it dips into slower memory or cpu allocated memory and that’s why
corey_prak@reddit
Sorry in advance for a dumb question, but do you use it in the same way you would use a tool like claude code to create entire apps, or is it more focused in that you give it some very detailed specs so that it performs properly?
I'm new to this all and have been testing different models and harnesses against a basic task and comparing it with claude's output, and there's a few inconsistencies that are more serious as the app matures.
Claude says folks at home may be using it as more of a companion and autocomplete rather than a vibe code generator.
FullOf_Bad_Ideas@reddit
it has Anthropic BS baked in.
I run Qwen 3.5 397B locally, in CC, OpenCode, Roo. I use it the same way I use CC. It sounds the same. It's slower than Opus/Sonnet, but in terms of output quality it's somewhere near Sonnet 4. I use all models as companions or subordinates for translating requirements to code, even Opus messes up often without enough instructions, but if Sonnet 4 could create some apps, Qwen 3.5 397B can do it too. It does things on it's own well too.
As per Elon's leak, Sonnet is 1T, it's not that much bigger than Qwen 397B where the difference is super noticeable.
XxBrando6xX@reddit
Not a dumb question, I keep trying to push myself to do more beginner friendly videos for tech that is by no means beginner friendly.
So the answer is both yes. I on principle don’t love the whole ai as a buddy thing cause I believe it comes with a lot of like risks and implications and whatever, that being said by self hosting my llm I can use it for anything. It could be a buddy, it also can help answer technical issues I run into at work when I’m deploying a new tool for my company, it also can help me vibe code little scripts to make my analyst teams life easier, or fixing small customization things in apex for salesforce. I also use it to experiment with game design, and for building pipelines for watching my cameras outside my home all locally. The cool thing is after the large initial investment it’s all “free” minus power, which the Mac Studio I use sips power at a max of 125w vs my 4090 office pc which is usually pulling close to 600w
corey_prak@reddit
Thanks for the quick reply!
I have a 3090 that I've been running qwen coder on quantized to 4bits, and I have this ralph loop that i've built which breaks down the features in a spec into these isolated tasks. I've asked claude to complete that task, and then I've run different models with different harnesses and compared its output.
At the end of the day, it can work really well and take time but the tasks have to be really specific, which I def acknowledge may not be that simple for something I'm trying to vibe code. I've gotten used to Claude being able to sort or make decisions through the ambiguity.
when you're doing the game development stuff, are you asking it to build full features or basic snippets to make things faster? I guess that could be both but curious about your experience with either.
the approach I've taken so far was to accept the tradeoff of speed VS capability, but I other than coding and playing around with image gen, I haven't tried anything else with other models outside of technical things like you mention you do.
I've been offloading context to DRAM to fit as much as I can on the 24GB VRAM, was thinking about splitting layers between one GPU on PCI-e and another connected as an eGPU via oculink -> m2 to see if a dense model with more parameters will produce a result that is closer to something Claude would do.
I don't know. I'm aware that 48GB VRAM is nothing and have learned just how much subscriptions like claude code and codex are being subsidized, which is kind of why I'm trying to get in front of it now...
jacek2023@reddit (OP)
which quant for 397B and what setup?
XxBrando6xX@reddit
Q4 there as well on Mac Studio 512gb
jacek2023@reddit (OP)
what's your minimax speed? I think I had around 20t/s on Q3
lolwutdo@reddit
Where’s 397b in the poll 😕
unjustifiably_angry@reddit
Someone make sure they don't fuck up any layers this time
Mashic@reddit
Just open source all of them. I don't think they have a use case for all of these models themselves.
YRUTROLLINGURSELF@reddit
people (bots) in here acting grateful for a purported meaningful choice as opposed to the 'fuck you' that it really is
Malfun_Eddie@reddit
Qwen 3.5 9b is such a great workhorse. 16gb vram can fit the model no quant and max context.
VirusPanin@reddit
I dunno how the hell you make it work, I've spent whole day today playing with different models, and specifically Qwen 3.5 9b was always failing in agentic tasks (tried OpenCode and KiloCode) it just randomly stops in the middle of it's process, like this: Okay, task understood, I'll read this file and analyze it. Calls a tool to read the file Tool runs successfully .... Bam, agent stopped, like it have finished.
Healthy-Nebula-3603@reddit
Yep such size is perfectly fix into 24 GB vram and to can keep 200k context Q8
Iory1998@reddit
No that's not true. There is absolutely no way that the 27B Q8 would fit in 24GB even with 1 token of context size. YOU ARE MISTAKEN.
You can fully offload the Q4_kM though.
Healthy-Nebula-3603@reddit
What do you see here? I am lying?
Iory1998@reddit
Dude, you said Q8, do you mean KV cache at Q8? Nowhere in your message did you specify that you are running Q4 of the model, hence why I confirmed that you can fully offload a Q4 to GPU.
Healthy-Nebula-3603@reddit
What do you think Q8 contexts is? It is literally a cache.
Iory1998@reddit
No it doesn't mean that, maybe for you it does. The language you used was not correct. Even if it did, you didn't specify the model quantization level, so stop trying to spin it around.
Healthy-Nebula-3603@reddit
Sure sure .... whatever you say ....
Winter_Tension5432@reddit
I have 64GB VRAM and am barely able to do Q4 180k context window prompt processing at 1.3k and decoding at 57 tk/s. So good enough for me. Q8 on 24gb is not realistic
Healthy-Nebula-3603@reddit
What is this then?
No_Afternoon_4260@reddit
Funny how it's 40% 20, 20 and 20 lol !
Rather easy to interpret
Hankdabits@reddit
Tbh I think people were down on dense until qwen 27b. Hadn’t been a good one since Gemma 3 and qwq 32b.
Fair_Ad845@reddit
same, I keep going back to dense models for day-to-day stuff. MoE is great on paper but the memory footprint for the full model is still brutal on consumer hardware.
robertpro01@reddit
Well, it's the smarter that can fit on 24 vram
AcanthocephalaOk489@reddit
Well ofc.. It was on a american platform :'( I'm a poor apu guy which would've loved and voted for the 122 if not on X.
jacek2023@reddit (OP)
what does it mean?
AcanthocephalaOk489@reddit
Unsure of what you didn't understand. So:
I'm poor-ish and so I won't buy nvidias.
strix halo was my bang-for-buck, so would've preferred the bigger MoE (27B kind of unusable for coding in this -- too slow).
I'm not on X. I dislike those platforms. Being on starlink is already annoying enough for me, and I don't want to contribute any further to monopolies and billionares.
jacek2023@reddit (OP)
I voted for 122B. I don't understand why "poor guy" chooses 122. And what has to do with X :)
AcanthocephalaOk489@reddit
122 moe runs much faster than the 27. I bought my whole system for cheaper than the modern nvidias.
jacek2023@reddit (OP)
Yes but you need more VRAM, poor guys have 8GB :)
AcanthocephalaOk489@reddit
Apu w soldered 128gb
Kahvana@reddit
Really hope we get Qwen3.6-122b-a10b and Qwen3.6-35b-a3b too. Those are genuinely really useful, 27b is often too slow. It's a shame the 397b nor the 2b/4b models were listed.
Iory1998@reddit
I agre with you that the 27B is slow, but I guarantee you that it's the best model version in that series. It's so capable when you can run it.
lolwutdo@reddit
Nah, 397b is the best model version in the series.
Iory1998@reddit
Flexing your muscles huh? 😁
lolwutdo@reddit
ironically I'm a gpu poor with 16gb 5070ti but I have 128gb ram, iq2xs 397b ends up being faster at token gen than 27b lol
Psychological-Lynx29@reddit
When a 70b model? Llama 70b is really old :(
jacek2023@reddit (OP)
Why do you need 70B model? 70B dense is very slow
Psychological-Lynx29@reddit
Intelligence, when quantized at q3km it gives pretty good results with Multiagentic tasks :)
kaeptnphlop@reddit
They do a vote on X when all of us are here? The hell?
jacek2023@reddit (OP)
why do you think "all of us" are on reddit and not on X?
Thrumpwart@reddit
I ran some testing with several models last night. I hosvd several models an identical, complex task along with 2 large documents for context and a short but detailed prompt. Of the models I can run on my hardware:
Qwen 3.5 122B MLX 8-bit - winner. High quality reasoning and output.
Minimax m2.5 MLX 4-bit - close 2nd. Did a very good job breaking the task down into smaller components and understanding the role of several interlocking components. Only lost because it missed a crucial section of one of the documents. I suspect a 5- or 6-bit would have won.
Qwen 3.5 27B UD Q8_K-XL - 3rd. Good reasoning, good output but missed the same context as minimax and generally less quality output.
Gemma 4 31B UD Q6_K_XL - close 4th. Good reasoning, more creative than Qwen 27b, but missed the same context and suggested another integration that made no sense. I was genuinely surprised this lost to 27B for my task as in my experience it has better general reasoning as Qwen 27B. Could be an artifact of persisting inference engine woes, will try again in a week. I should note it got close to 27B in quality despite the Q8 vs Q6 quant difference. Maybe I’ll try again tonight on the Mac with a pound for pound Q8-Q8 matchup.
Apex quants of Qwen 3.5 122b (I-quality and I-balanced) ggufs - decent reasoning and output, more creative and colourful than the original, but less quality reasoning and output. I like the Apex quants as they seem more human in their outputs, but the reasoning suffered from the (q4) quants more than I thought it would.
CriticallyCarmelized@reddit
I’d be interested to hear about your Gemma 4 8-bit test. I’m going to assume you’ve been using llama.cpp and have grabbed the latest updated quant files for your testing since you mentioned “ongoing engine issues”. I’ve been very very pleased with Gemma 4 31B at Q8_K_XL and BF16.
Thrumpwart@reddit
Yeah I just saw the new fixes for Gemma 4. Will re-pull llama tonight after work and try again with new Q6 and Q8 quants.
I had tested on yesterday’s llama pull, will try again tonight in the new pull.
jacek2023@reddit (OP)
but was it agentic workflow or a single prompt? how do you decide who won?
Thrumpwart@reddit
Single prompt but it is fundamentally analyzing many discrete components in an LLM engineering technical plan and identifying and evaluating combinations of of components, how they interact and synergize, and what the best combination of techniques is.
I evaluated their responses personally and then with Google Gemini. Gemini caught 3 inconsistencies I missed in my evaluation and led to the rankings above. It was purely a reasoning task, not agentic.
ArtfulGenie69@reddit
Waste all this time a freaking vote... Just give me the 122b...
Lissanro@reddit
It seems 397B is not even on the list. That's too bad, because the 397B version is noticeably better than 122B when it comes to follow long complex instructions while being over two times as fast (as Q5 quant) as Kimi K2.5 (Q4_X quant) or GLM 5.1 on my rig - so it would be great middle ground for many use cases.
misha1350@reddit
They want to profit off of Qwen3.6-Plus.
Hytht@reddit
They have a generous free tier with upto 1000 requests for Qwen3.6-Plus in qwen code.
misha1350@reddit
The Bard treatment. Watch it evaporate like how Stepfun's Step 3.5 Flash did on Openrouter yesterday.
rebelSun25@reddit
I'm actually okay with not releasing the 397b or prioritizing the 27b or 122b. The corporate or moneyed interests will pay for hosted inference on the largest model to pay bills. In the end, it's in our interest for the model authors to succeed and stick around.
I can get my employer to pay $$$ for the best model, while I use the smaller models for personal use
waitmarks@reddit
I agree, it seems like there was some internal disagreement in the team after they released 3.5. it seemed like management didn’t want them to release the big one and that caused the team to get broken up. My guess is 3.6 was made specifically so they could have a slightly better model that is closed.
tengo_harambe@reddit
Dont Twitter polls have 4 options maximum? So it's possible they didnt include 397B because these 4 are presumed to be the most popular.
Expensive-Paint-9490@reddit
Qwen seems to not want to open-weight its best models anymore. But, at the same time, it wants to keep its fame as open-weight saviour.
tengo_harambe@reddit
This is not a recent change. The top Qwen model since Qwen2.5 has always been proprietary
TKGaming_11@reddit
The closed Qwen 3.5 Plus is just the open weight Qwen 3.5 397B model with extended context and native tool calling, for Qwen 3.6 they are locking away the 397B to be API only, this is change from Qwen 3.5 -> Qwen 3.6, absolutely a recent change
tengo_harambe@reddit
Is this confirmed or just conjecture? First i've heard of it
TKGaming_11@reddit
Confirmed
Fault23@reddit
fair
miniocz@reddit
Same. I can run 397B as Q3. That is not the case for the other two big models (well I can at 1t/s, but not for chat.).
jacek2023@reddit (OP)
see the second image for details
RetiredApostle@reddit
Voters seem to just want to compare it with Gemma, rather than having a decent dense 9B in the toolset :(
inevitabledeath3@reddit
Why do we need 9B models?
Adventurous-Gold6413@reddit
Because some people can’t run higher parameter models
inevitabledeath3@reddit
Well some people should buy more or better GPUs.
Saegifu@reddit
Well some people should learn more about empathy or being human.
ebolathrowawayy@reddit
when will empathy or being human lift me out of poverty? the epstein class states it plainly - do whatever you want that most benefits you and fuck everyone else. until we have a real society again, with safety nets that encourage growth, then fuck the system, fuck everyone, wild west or bust.
Saegifu@reddit
Well, the world is in such a state exactly because of people like you thinking "fuck everyone".
inevitabledeath3@reddit
We are talking about running models at home FFS. If you need AI models especially capable ones you can rent them from the cloud for very little money. Renting a 30B parameter model would cost less than the price of a machine to run a 9B model anyway.
Yu2sama@reddit
9B is capable enough for basic stuff and rag. A 30B is without a doubt smarter and has more knowledge, but with rag some stuff even outs. Not everyone is using these models for agentic tasks or coding tbf.
misha1350@reddit
Not everyone has a 24GB dGPU at the ready. 9B means that anyone with an 8GB dGPU would be able to run it. And at 12GB on the likes of an RTX 3060, with a big enough context window
inevitabledeath3@reddit
What would you use a model that size for though? I am having a hard time finding a good use for even a 27B model, nevermind 9B.
CriticallyCarmelized@reddit
I’m with you. I’m having a hard time figuring out what people are using tiny models for. They are dumb as bricks. I suppose if you are fine tuning them for very specific one off tasks they will work fine. But I seriously doubt most people are training their own fine tunes for customized pipelines. And anyone can run the 27B-A3B at a minimum using ram offloading and get decent performance.
RetiredApostle@reddit
I use in my workflow: 27-31B as a strong tool caller, 9B as a smaller tool caller, and E4B as a fast multilingual synthesizer. It would be great to replace that 31B with the 3.6 9B (anticipating the strength).
grumd@reddit
Funnily enough I can run all of these models locally except for 27B :( The most I can run with 27B is like IQ3_S, but with expert offloading even 122B is doable at Q4_K
Last_Mastod0n@reddit
27B is just soooo slow. Even 32b a3b is like 1/2 as fast as Gemma 4 with the same vram reqs on my 4090
Prestigious-Use5483@reddit
I think it's slow because it overthinks. When it doesn't, it's not that bad.
Last_Mastod0n@reddit
Most definitely so. Reasoning took an unbearably long time without a token cap
randylush@reddit
Is a token can the best way to tune thinking? Is there a “thinking stop probably” parameter that can be dialed in?
Top-Rub-4670@reddit
Not yet, there is --reasoning-budget-message but it's injected when the token cap is reached, it doesn't give the model a chance to wrap up afaik.
GrungeWerX@reddit
It’s not that slow. They have wrong settings.
grumd@reddit
I feel like you're talking about 3 different models there. What's 32b-a3b? Which size of Gemma 4?
Anyway yeah 27B is slow-ish but when fully on GPU it's not that slow. I think the Q3 quants usually give me 40 tps tg. It's just that I need to use a shitty quant to be able to fit it into my 16GB VRAM
Last_Mastod0n@reddit
I should've been more specific. I was referring to qwen 3.5 27B 4 bit quant and qwen 3.5 32b a3b 4 bit. Both which fit fully in my vram. Now I am running Gemma 4 26b a4b 6 bit quantized with some expert layers offloaded to the CPU, and it still runs over 2x as fast as qwen 3.5 27B 4 bit quant.
Don't get me wrong I absolutely love Qwen 3.5. It was initially what made my personal project business idea viable with its vision capabilities. Its just that Gemma 4 has beaten it in every single metric that pertains to my project. I would be happy to switch back to qwen if they release a superior model again.
grumd@reddit
Okay I see. It's not 32b, it's 35b, that's why I misunderstood.
Yeah 35B and 26B are much faster than 27B but they are both WAAYYYY dumber than 27B. You're getting more shitty responses faster lol, imo quality is more important
But yeah if 26B works for you then that's great! You can always switch to 27B when you start noticing that 26B lacks quality for more complex tasks
Dabalam@reddit
Most of their own documentation seems to indicate the similar sized dense model is only somewhat smarter across the board which is why people debate about dense vs. MoE models.
0xbeda@reddit
Why is that?
On 7900XTX with 24GB VRAM I can manage 27B-Q5_K_M with about 26 tokens/s but with the 122B-Q4-K-M and a lot of offloading I get only 6 token/s.
grumd@reddit
Well I have 16GB VRAM and 96GB RAM. 27B the most I can do is IQ3_XS, with 122B I can do Q4_K_XL.
27B-Q3 is ~40-60 tps, 122B-Q4 is ~15-20 tps
Maybe your RAM is not fast enough? 6 t/s is what I was getting with NVME expert offloading lol
0xbeda@reddit
I'm using llama.cpp docker with vulkan. I tuned it so it fits my VRAM with desktop and about 1-3 GB left. I have 128GB of DDR4-3200 CL16 (Kingston KF3200C16D4/32GX) and a 5950X on a Gigabyte X570S. GPU is a Sapphire Nitro+ 7900XTX with 24GB.
Qwen 3.5 27B Q5_K_M
Qwen 3.5 122B MoE Q4_K_M
grumd@reddit
Btw after you run the model with my command and see how much better the performance is, you may want to try Q5_K_M instead of Q4 for higher quality
grumd@reddit
Adding to my comment - considering you have 24gb vram + 128gb ram, you can actually just use
-fitto let llama automatically offload everything efficientlybasically use this (adjust
-fitcto tell it how much context you want)grumd@reddit
Well DDR4 is indeed slower than my DDR5 but you have multiple issues in your 122b command that hurt performance as well
-threads, just use the default. i've noticed using max threads actually hurts performance. try to compare it with default and check if you get better speed-ngl 99and--cpu-moeto keep all experts on the CPU and all layers on the GPU.Neither-Phone-7264@reddit
Not everyone has 24gb of vram lol
Top_Influence_3323@reddit
The Qwen family has been impressively consistent across scales. I've been running Qwen 2.5 models (3B and 72B) locally via Ollama for some research work and the quality gap between sizes is surprisingly small for most tasks — the architecture clearly scales well. Curious to see how 3.6 compares on the smaller quantized variants for daily local use.
jacek2023@reddit (OP)
Bot
Top_Influence_3323@reddit
NOPE !!!!
jacek2023@reddit (OP)
talking bot!
Top_Influence_3323@reddit
Hahaha Lol I wish, bots don't have to pay rent. Just a guy running local models for research, nothing fancy.
BestSentence4868@reddit
This is so dumb, everyone should be voting for the 122B so we can then distill to the smaller ones.
jacek2023@reddit (OP)
could you share link to the models you distilled before?
BestSentence4868@reddit
not public but have distilled sonnet to 397B before
jacek2023@reddit (OP)
why not public?
BannedGoNext@reddit
Darn, I was really hoping for a new sexy 122b since Gemma cucked us by not releasing the one they made after announcing it.
ea_man@reddit
Please release those GUF with a template that makes tools work in opensource orchestrators like OpenCode, even when reasoning is disabled.
Material_Hour_115@reddit
Interesting results, here I thought most people agreed 35B-A3B was the most interesting flavor of Qwen. Not that I'd complain about having the source for any of them.
silenceimpaired@reddit
Not for me. It lacks the capability of the dense model.
Far-Low-4705@reddit
can you elaborate on that? i feel like 35b is more than capable.
it is extremely rare for me to give a engineering problem to 35b that it cant solve but 27b can solve.
also, imo, 27b is just too slow for anything useful. it only runs at 24 T/s for me, and imo, i prefer 50 T/s, or 40 T/s as an absolute minimum to be useable.
silenceimpaired@reddit
35b MoE might just be based on your tolerance for waiting.
Material_Hour_115@reddit
Definitely support people voting to their own interest! I think everyone knows 27b is more capable, but 35B-A3B runs significantly better on regular consumer hardware, which makes it interesting from a different direction.
silenceimpaired@reddit
I think there are two depending contexts for your claims. Does a person have a 24gb vram card… and does a person do agentic work or coding? Because my computer outputs fast enough I can’t finish reading it before it finishes the output
Material_Hour_115@reddit
That's what I mean by "on regular consumer hardware," although perhaps I should have said "on the average consumer's hardware" for clarity.
24GB VRAM far exceeds what an average person has in their PC.
BumblebeeParty6389@reddit
If they don't release 35B moe qwen 3.6 will be useless for me. I'm pretty sure there are many people in same situation as me. I really don't get the point of this poll.
DerDave@reddit
Why is that? Is 27B too slow or what's the issue? I can point you at a cool way, to make it much faster: https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
That brings about 3-5x speedup. Really cool stuff and their paper is interesting too.
MrBIMC@reddit
Until dflash is supported by llama.cpp stack, it is kinda useless to most of people here.
I hope they will get llama-server support eventually because 3x speed up will change a game completely for a solo gpu deployments.
Current qwen3.5-27b on a solo 3090 fits 161k tokens of contest for me, while giving 43-29 tps depending on context load.
Tripling or quadrupling that into a 100+ tps will make agents a lot more versatile.
Also will see how better 3.6 is, given that 3.5 is already a beast of a local model.
anthonyg45157@reddit
Was about to download but you cpp comment saved me 😆
Can't wait for this to progress more, I'm torn between 27b and 35 with a 3090
MrBIMC@reddit
Moe models are generally dumb.
The way to approximate intelligence is do a sqrt(total_params * active_params).
Currently the only moe that is smarter than 27b-dense is 400b one, but most of us here are out of a capability to serve it.
Currently the best models we can serve on a solo 24gb vram card is either qwen3.5-27b or Gemma-4-31b-it.
And qwen is much more mature at this stage, as far as support goes.
anthonyg45157@reddit
Thank you for the information!
This generally aligns with what I've noticed as well.... Are there any ways to speed up 27b with agentic type coding workflows? Maybe I just need to turn thinking off so it feels more responsive...
itch-@reddit
I have not found the 27B thinking to be a problem, it only overthinks when it doesn't have tools.
Usually local models are fine in basic chat and don't work in eg Cline because it's too hard. 27B is opposite, it's thinks too long on a simple prompt but does great in Cline because it's smart enough for it and doesn't waste thinking tokens there.
MrBIMC@reddit
There are bunch of mtp work is being done on the llama.cpp side.
The way they approaching it is having it implementation-agnostic, so many different ways to implement multi token prediction.
It will come provably in a week or two.
Good news here is that 3.5 series already have mtp layer built-in, so worst case scenario we might get a free boost from mtp, without even attaching any draft model. (Though it won’t be as massive as eagle3 or dflash).
Best case scenario is once backend-agnostic mtp logic is merged into llama.cpp, it will open the gates into implementing more tricky approaches like eagle3 and dflash, but those will eat additional memory.
DerDave@reddit
Yeah, also keep in mind, the DFlash model (for Qwen3.5 27b) takes up another 3.5GB of RAM (in BF16). So it might reduce your context unless you quantize more.
The speed gain is worth it imho. Especially whit the latest context compression breakthroughs.
Also very much looking forward to Qwen3.6.... Let's see how much better it's going to be.
MrBIMC@reddit
Yeah, that’s why llama.cpp support is essential.
If one goes with vllm, awq-4bit of qwen3.5-27b takes around 19-20gb. Dflash predictor takes another 3.5gb, leaving no room for big context.
With llama.cpp, iq4-nl takes 16gb. Mmproj takes another 1-2gb, depending on a quant, but one has a choice to drop it if they do not care about anything than receiving text as an input. With safetensors you do not have such choice.
So assuming dflash will get supported on llama-server, then having dflash while dropping mmproj and reducing context a bit sounds like a very good choice to go with.
DerDave@reddit
Yeah that's the dream. Really trying to evangelize dflash to get it some more traction haha
DerDave@reddit
Yeah, a many people have already requested and some have started working on a llama.cpp port. I'm hoping for it to come soon too. But in the meantime you can run it with vLLM, which is also fine for home use.
viperx7@reddit
I tried to get this thing to work and it was a mess can you tell me how to use it. I have 3090+4090 on my system Couldn't get it to work with vllm
yeah-ok@reddit
Let's get the llama team involved, if this would be doable on consumer hardware it would be amazing win for the dense models.
DerDave@reddit
Absolutely. But not only dense. It works also pretty well on MoE models. They're currently even training a version for Kimi K2.5, so it might even be helpful for hosters.
BumblebeeParty6389@reddit
Issue is I don't have a gpu so I can only do cpu inference with ik_llama lol
cafedude@reddit
Qwen3.6-coder-80B
Iory1998@reddit
I wish for the 80B too.
cafedude@reddit
Even if it's a bit bigger than the current Qwen3-coder-next, say 90B or even 100B that would be fine. Q3CN is still the best local coding model in my experience. Q3.5-122B was touted by some as being better, but I found it to give me a lot of false results ("tests are passing!" when they weren't, that kind of thing)
BrightRestaurant5401@reddit
That is a rather complicated question to ask, what if all the models grows or shrinks 2 gb?
does that change the answer? what about the model to context size ratio in gb?
MrBIMC@reddit
I guess for this case you kinda have to quant down.
Though idk if sub 4-bit quants of any actual use. Some imatrix quants might be decent enough if tasks profile matches the imatrix dataset.
RandomTrollface@reddit
Qwen 3.5 27b iq3_xxs unsloth is somehow doing pretty well for me in opencode. Even with 80k context in q8_0 it was still able to get work done in a typescript repo I'm working on. Also tried qwen 3.5 35b with cpu offload and gemma 4 31b but they seemed to perform worse in opencode
Bobylein@reddit
I am using 3bit 27b and 2bit 36b and also tried both with 4bit on 16gb vram, for my actual use cases they did a very good job and I deleted the 4bit later as they were just so slow with no noticeable (to me) improvements over the 3 and 2 bit variants
MrBIMC@reddit
Yeah, 27b on 16gb is rough, given the model itself will eat most of vram.
At the end of the day name of the game is to find a perfect balance of intelligence, speed, and amount of context one needs for the tasks at hand.
Smaller imatrix quants will probably result in a better intelligence at a slower speed than moe model at a higher quants, but the context will be much more limited.
I did a test run of both gemmas(dense and moe) against an opus-4.6 and codex-5.3 a day ago on a agentic run with a task to analyze the lib change and find the implementation status of that said changes across the internal projects that do use that lib.
And while dense Gemma did take 14 minutes to execute the task, it technically fully succeeded, albeit report it gave was much less wordy than external models.
MoE Gemma failed miserably as it hallucinated the implementation status across a few of the projects, and as for me having report being not reliable is much worse than having to wait a bit for a proper result you can rely on.
jacek2023@reddit (OP)
what are you talking about?
Bobylein@reddit
If people would vote tge same if they knew the new model wouldn't fit their vram anymore or the old non fitting would fit in 3.6
Though considering the poll I'd expect the parameters to stay the same anyway
AdUnlucky9870@reddit
The community voting approach is interesting — it's a smart way for Qwen to prioritize what people actually want vs what looks good on benchmarks.
What I'm most curious about is what the voting categories reveal about where local LLM usage is heading. If the top votes are for coding and reasoning over creative writing or chat, that tells you something about the actual deployment patterns: people are building tools and agents with these models, not just chatting.
For anyone running Qwen models locally — one thing I've noticed is that the quantization story has gotten dramatically better with the 3.x series. The Q4_K_M quants of Qwen 3 hold up surprisingly well compared to the full precision versions, especially for coding tasks. The gap between Q4 and FP16 on code generation is much smaller than it was with the 2.x series.
If 3.6 continues that trend, we might be at the point where a 32B model at Q4 on a 24GB GPU genuinely competes with API models for most practical use cases. That's the real milestone — not benchmark scores, but "good enough that I don't need to send my data to an API."
Iory1998@reddit
We should give credit to the Qwen team about how good the 3.5 series have been in terms of attention mechanism. The best models where it comes to that. I think that's why models hold well even when they are quantized.
Iory1998@reddit
Fyi, I voted for the 27B. It's been my daily driver for weeks now, and I don't like the 122B. Intelligence wise, I feel it's on par or worse that the 27B.
MarcCDB@reddit
Most people can't even run a 26B parameter model at an acceptable quantization.
jacek2023@reddit (OP)
maybe they are not into local LLMs
R_Duncan@reddit
Having available an rtx 6000 at work, this is great news! Having a 4060 at home, this s*cks...
Past-Reception-424@reddit
35B gang. big enough to actually be useful, small enough that you dont need to remortgage your house for the vram. win-win
Several-Tax31@reddit
Why not open-source all?
Voxandr@reddit
So looks like the best we will get would be qwen 3.5 -122b A10b for a forseeable future.
Those who are not serious anthusiast will never vote a model above 32 be because they are not going to invest into multiple graphic card or even a strixhalo.
OmarBessa@reddit
I'm not even surprised
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
VoiceApprehensive893@reddit
qwen 3.6 18b please
Tall-Ad-7742@reddit
i hope we get all the other versions too
i personally would also like a 120b or bigger version and yes i know not everybody can run something big like that but still would be nice for some of us
Borkato@reddit
Yeah I hope everyone gets all of them!
tarruda@reddit
Did they say that only the most voted model would be released as open, or that it would simply be the first one?
jacek2023@reddit (OP)
second image
festr__@reddit
What they were expecting from this vote? Corporate segment runs >= 400B and nothing else is really relevant for serious inference.
Fault23@reddit
would be good if we had a 122B too ngl
leonbollerup@reddit
why 27b.. i get much better result with 35b-A3B
jacek2023@reddit (OP)
did you vote?
leonbollerup@reddit
nope.. i dont have X.. dont want it either.. so much .. garbage..
jacek2023@reddit (OP)
so that's the answer to your question
leonbollerup@reddit
i never had a question.. it was a statement..
silenceimpaired@reddit
How? Are you referring to speed? Because the 27b dense for 3.5 went toe to toe with the 120b MoE.
leonbollerup@reddit
i get around 150 tok/sek with 35b-a3b .. under half with 27b.. i might be a picky bast... but anything below 50 tok/sek is to slow..
silenceimpaired@reddit
That's what I thought. You're talking speed. 27b performs far better in terms of output quality. Sure you wait a little longer for 27b, but you don't have to go back and ask again or clarify things as much as you might with the 35b MoE.
henk717@reddit
For me its night and day the other way round. I assume its how closely you stick to what the 35B was designed to do, if you deviate you have to rely on whatever 3B expert improvises and for me doing more unorthodox things that resulted in terrible outputs. I also tried coding something on it and it failed to get it after 8 tries, the 27B got it in 2.
SufficientPie@reddit
✅ 397B
synw_@reddit
What about the 4b, my favorite small model?
Nubinu@reddit
want my tiny 4B
xgiovio@reddit
why not the a3b? it's faster
__JockY__@reddit
I do hope they’re not using this poll to make decisions. I’ll be very sad if the 397B isn’t released.
AppealThink1733@reddit
Does this mean that 9B will not have be ?
charmander_cha@reddit
Queria o 9B
Billysm23@reddit
27 and 9 are perfect
anthonyg45157@reddit
Wen wen
NNN_Throwaway2@reddit
If we don't get the 397B I'm gonna be pissed.
hurdurdur7@reddit
What about a 14B or 20B-A4B ?
jacek2023@reddit (OP)
not this time
Significant_Fig_7581@reddit
So we'd get the 35B and the 27B first
stopbanni@reddit
Just like with 3.5
Significant_Fig_7581@reddit
Yeah I think that was what they planned first but they still wanted to engage with the community, knowing qwen they are gonna release all the sizes later
Status_Record_1839@reddit
27B winning is predictable — it's the sweet spot that fits entirely on a single 24GB GPU at Q4_K_M while still being genuinely capable for complex tasks. The MoE options (35B-A3B, 122B-A10B) are interesting but require more careful hardware planning for most people.
Curious whether Qwen 3.6 will keep the hybrid thinking/non-thinking toggle from Qwen3. That feature alone made the 32B model much more flexible for production use — you can disable extended thinking for latency-sensitive calls and enable it for complex reasoning without switching models.
jacek2023@reddit (OP)
bot