Waiting Qwen3.6-27B I have no nails left...
Posted by DOAMOD@reddit | LocalLLaMA | View on Reddit | 80 comments
Posted by DOAMOD@reddit | LocalLLaMA | View on Reddit | 80 comments
kiwibonga@reddit
I've been working to increase my TPS in 27B since I got a taste of the 3.6 35B A3B. Finally managed to get vllm to run on my 2 RTX5060Ti with 4bit AWQ, went from 25 to 60ish t/s, almost 2x faster than ik_llamacpp'a graph split, and PP in the 3000s.
I'm ready. I'm so ready.
viperx7@reddit
can you give me args that you used 60 t/s sounds great
kiwibonga@reddit
exec vllm serve "cyankiwi/Qwen3.5-27B-AWQ-4bit" \
--tensor-parallel-size 2 \
--kv-cache-dtype="fp8" \
--max-model-len 130000 \
--gpu-memory-utilization 0.9 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--max-num-seqs 32 \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--port "$PORT" \
--host "$HOST"
Borkato@reddit
Wait, if I have just 1 RTX 3090, can I get faster speeds if I switch to vllm? I’m at 30T/s for Q4 27B gguf with llama cpp
kiwibonga@reddit
I don't know. I have two GPUs.
Borkato@reddit
Hm. I asked Claude and it said that it would only help if I’m doing parallel inference or have multiple GPUs, so I’m glad I checked!
putrasherni@reddit
congrats ! does the pp and tg continue to be the same at 65K and 131K context lengths ?
kiwibonga@reddit
Yes you do get the normal performance degradation over the course of your sessions as the context window fills up.
Iory1998@reddit
Peope who are praising the 35B-A3B just don't know how good the 27B is. It's like jumping from a hot hatchback to an actual sport car. The former seems sporty and quick, but under the hood it's your regular hatchback with a few upgrades here and there.
silenceimpaired@reddit
They just released a new 80b MoE and claim it's better than the dense model... We shall see.
CornerLimits@reddit
Yeah 27b is super, but the new 35B just throws tools around like a champ. Not all the tasks require a deep understanding, sometimes is enough to run quick and report results. Im using the 35B as daily driver and to play and its strong.
For the serious coding i try to do humans are still much better than any llm out there.
HungrigerWaldschrat@reddit
35b I get issues at the size of codebase I have where it answers confident but incorrectly. Very quick, but I have to really think on how to direct it. 27b is lower on error rate. It's also a bit better at coding.
35b is very hungry for context, reading a lot of files quickly, then forgetting what was relevant. Didn't try passing on thinking though, that might make a difference.
relmny@reddit
That's what I expect but... the new 35b really surprised me by finding an error in a multi-turn chat (Minimax-m2.7 made a very wrong statement, which is why I stopped using it) I asked 397b to double-check it (twice) and it kept saying that the (wrong) statement was right.
Then I asked qwen3.6-35b (a few times, to confirm) and it always said "sorry, I was wrong before" (meaning the previous statement from Minimax-m2.7 and also from 397b, was wrong).
Borkato@reddit
The 27B is just so slow, that’s the issue
Iory1998@reddit
It's slow if it doesn't fit in Vram. If it fits, then you can generate at 30-60 t/s, which is many times faster than what you can read.
Borkato@reddit
Sure, but not fast enough to not spend over a minute for a simple coding task and without spiraling, at that rate I should just do it myself lol
Maleficent-Ad5999@reddit
Slow? On Which hardware? I get 50-60 tps on 5090 at no context.. haven’t tested with full context size..
ea_man@reddit
You have to test / compare large prompt read speed, that's the kill.
Healthy-Nebula-3603@reddit
WIth rtx 3090 I have 35 t/s with h context 250k so is not so slow.
I prefer 35 t/s and job done than 140t/s and retying many times .
Borkato@reddit
I have a 3090 and am getting 30T/s on Q4, any advice to get it up to 35?
Healthy-Nebula-3603@reddit
Use llamacpp-server
Borkato@reddit
I do!
anthonyg45157@reddit
I expect 3.6 27b to be slow as it will be a dense model but I wonder if and how they can improve speed.
putrasherni@reddit
27B is a slow ass model , only meant for gpu rich like 5090s and blackwells
for everyone regular who wants similar performance with fast speed, MoE is king
switchbanned@reddit
The 35B-A3B feels like it's just slightly too big for me to run on my 4070 with a large enough context window :(. I was getting around 18tok/s with like 16k context and when i tried bumping it up to 26k it slowed to a halt. Does that seem right?
putrasherni@reddit
depends what quant you are set up with
switchbanned@reddit
I just realized I was testing Q6 to see how it ran. I should definitely be using a Q4
Iory1998@reddit
Q4 from Bartowski and Unsloth are good.
Xp_12@reddit
-or- you can go poor-ish route and double up on 5060tis for 32gb vram and run it with dflash. =]
I hit like 130tok/s in coding tasks at like 64k context.
PANIC_EXCEPTION@reddit
That ain't going to work on my M1 Max. I could run 27b on my M1 Max but it's going to get quadratically worse on agentic coding. 27b is only useful for few-round conversations where you need an immediate high quality output, and is going to suck for portable agents once you start going past 20% max context window. It's bandwidth-constrained.
andy2na@reddit
what are the commands for your setup and are you using a fork of vllm? are they specifically for blackwell or will a 3090 still get decent speeds?
Xp_12@reddit
I don't know that we'd call anything under a 5080 a blackwell without full sm support. More like brownwell. It does rely on nvfp4 quantization to be achievable. I don't think you'd get away with 24gb even with the AWQ that sits \~21gb unless you're trying to get like... 8096 context.
putrasherni@reddit
nicely done , how big is k v cache size at 64K context on vllm ? that's amzing you got k v cache running well at 64K context length.
i found that on my m5 max 128gig , dflash tops at 16-20K. At 32K k v cache used by dflash increases to 100GB+ that reaches 128GB max mabook memory size and nukes.
Xp_12@reddit
I'm using apolo13x/Qwen3.5-27B-NVFP4 which sits at 19.8gb on disk and I'm pretty much full with model+draft+context.
super1701@reddit
I ran bf16 27b. Not on Blackwell or 5090. Yea it was slow as balls. I’ve personally been liking qwen 3.6. Was using 122b for awhile, which worked well for my current use case.
plopperzzz@reddit
Is it not mostly on par with 122b? I only have 24gb of VRAM on an old system, so the 27b runs slower than the 122b for me, since I can offload experts to the club.
Iory1998@reddit
Honestly, I feel the 27B is better than 122B MoE, but that's my experience.
eidrag@reddit
for qwen3.5 I feel moe model not that good, but 3.6 feels fast and useable. but 27b still my benchmark
GCoderDCoder@reddit
I have replaced my qwen 3.5 27b node with the gemma4 31b because that is a significantly better coder. Then qwen 3.6 35b is +/- with qwen 3.5 27b/122b so I use 3.6 as my agent now because it's much faster. Qwen 3.5 27b runs great on a 5090 (40-50t/s). I moved my coder specialist work to 3x3090s and it crawls at 15t/s for gemma 31b but consistently better output in coding thus far. If I didnt have my multiple hardware options I would possibly keep qw3.5 27b on my 5090 instead of the rest as my generalist but Qwen3.6 35b on a 5090 is kinda ridiculously fast though... i could try 2-3 times before 27b finishes one time lol
Iory1998@reddit
EXACTLY! To be honest, even the 122BA3B doesn't exceed thr 27B.
SuperChewbacca@reddit
The 122B, is a 122BA10B, so 10B active, but yes in a number of benchmarks the 27B beats it.
Eyelbee@reddit
The problem is, I don't know how good the 3.6 27B can actually be. Because the gap between 3.5 27B and 3.6 Plus is already very narrow and 35BA3B kind of sits there. If this will be distilled from that it can't surpass it. If they have other methods to make it better than 3.6 plus it's great. If the 3.6 plus is like a 100b moe it can be possible.
AlternateWitness@reddit
Look at mister rich over here.
Me waiting for Qwen 3.6 9b…
silenceimpaired@reddit
I know! So weird they had a poll to find a winner… then didn’t release the winner.
It’s almost like the unasked poll question was… “What model should we avoid releasing so that you feel compelled to use our API?”
ea_man@reddit
https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd
Iory1998@reddit
I disagree... my guess is that they are releasing the entire family. It's just they will do it in phases. Dense models like the 27B takes more time to train and fine-tune than MoE. It's better that they release whatever they have their hands on now than wait for the models to be finished at once.
putrasherni@reddit
They are 100% releasing more free models. The clue is in the wordings on qwen 3.6 huggingface page.
silenceimpaired@reddit
I am okay with being wrong, but the reason I don’t hold your mindset is this release behavior is a deviation from past releases.
rerri@reddit
Have they said they won't release 27b? Or did you just conclood that?
silenceimpaired@reddit
I don’t conclood… ever. Mostly because it isn’t a word.
As for reaching a conclusion, I didn’t even do that. I made a tongue-in-cheek comment on how it looks. Stating how something looks does not mean I have judged it to be as it looks.
Here ends my English lesson.
As for answering you properly… They have said nothing, which is weird since they took a poll then released something different than the winner.
My opinion is they made that poll to get a sense for what type of model they should focus on if they want to cut down their offerings next release.
RedParaglider@reddit
I'm hoping they are going to release all of them and the 27b is simply taking longer to cook.
200206487@reddit
looming forward to 3.6 122b and 397b
RiyanTheProBoi@reddit
Qwen 3.6 omni
Own_Suspect5343@reddit
I want 122b version
putrasherni@reddit
i want the 9B dense, i hope it i becomes as good as qwen 3.5 27B
LeucisticBear@reddit
An updated 80b coder wouldn't hurt either. I bet at this point it's performance would be close to sonnet 4.6 in a lot of things
Healthy-Nebula-3603@reddit
Me too 😭
ilintar@reddit
I feel like the new team behind Qwen is doing it in a more "marketing-oriented" way, which entails releasing the models with breaks to keep the hype up for longer. Hopefully that's it and not just them releasing one model now as scraps for the open-source community.
Extra-Organization-6@reddit
the 35B-A3B is good but its a MoE pretending to be dense. the 27B actually uses all parameters every forward pass. completely different feel when you run it locally, especially on tasks that need sustained reasoning.
APFrisco@reddit
What do you mean by MoE pretending to be dense?
Lissanro@reddit
My guess they refer that the dense model won the Qwen poll, but MoE was released in its place. But hopefully Qwen will release other 3.6 models later.
CheatCodesOfLife@reddit
It's a bot mate, recently commissioned to sell something called "elestio"
https://old.reddit.com/user/Extra-Organization-6?count=25&after=t1_oh97sno
It said that MoE thing because Claude likes to say "X pretending to be Y". Even with a careful prompt, the slop bleeds through.
CheatCodesOfLife@reddit
It's an LLM account. They write things like this now in all lower-case.
https://files.catbox.moe/hll2gv.png
They often have that minor hallucination where they say something stupid to fit the sentence structure:
(sustained reasoning isn't a weak point for sparse models)
Extra-Organization-6@reddit
the 35B-A3B is a mixture of experts model, meaning only a subset of its parameters activate per token. so even though it has 35B total params, each forward pass only uses about 3B of them. the 27B is a traditional dense model where all 27B parameters fire every time. in practice that means the 27B does more compute per token which tends to show up as better reasoning on harder tasks, even though the 35B looks bigger on paper.
MarionberryWeird4021@reddit
I'd use it mainly for coding. Any experience?
New-Inspection7034@reddit
Dense models are slower as all parameters are active. With an MoE, you supposedly get a mixture of experts which get routed by some magic. What I find it really ends up being a roomful of arguing conceited snobs who really can't agree on anything.
dampflokfreund@reddit
Bro, they just released 35B A3B. Don't be so impatient.
Hot-Employ-3399@reddit
It's almost as model awaited by community the most according to their own poll is actually awaited by community the most
ambient_temp_xeno@reddit
Apparently we have to get with the times, and deal with everything being like Youtube.
soyalemujica@reddit
They trying to make 27B not to be so smart, because if 35B was this good, they tweaking 27b so their paid models are still better
Iory1998@reddit
That's an unsupported claim and is at this level just a conspiracy theory. Qwen doesn't have only 27B.. They have other bigger and smarter models that they can keep closed. To be honest, I want Alibaba to develop their own large property model that can rival big AI labs and keep open-sourcing the smaller ones. After all, how many enthusiasts can run a 400B model locally?
Thos way, Alibaba stays competitive and makes money, while keeping us happy and getting our feedback.
RedParaglider@reddit
Obviously they do you can go run full 3.6 on their site and pay for it now and it's great. As far as conspiracies that I do think are true is that Google stop the release of 124 Gemma because it smokes Google flash.
RedParaglider@reddit
That's pretty pessimistic, I think they release the small models as good as they can make them.
OGScottingham@reddit
Yeah, no room for conspiracy nonsense here. Given how good 35B is though, I'm very excited for 27B!!
mandrak4@reddit
Agreed
Majinsei@reddit
Me alegra en este momento no ganará el MoE sino estaría como este meme 🤣🤣🤣
putrasherni@reddit
I hope qwen 3.6 9B is as good as qwen 3.5 27B
that will be great for so many gpu users who own 3080, 9070XT, 5070 ...
Mountain_Patience231@reddit
I hope they can charge me or let me make a donation for the Qwen release. That would give them more motivation to release the next Qwen. I mean, it's honestly the only good local LLM in the world right now.
Safe-Thanks-4242@reddit
2 more weeks xd
Monad_Maya@reddit
Along with the 27B I'm also waiting for the 397B. Please don't disappoint us :(