Waiting Qwen3.6-27B I have no nails left...

[-]

kiwibonga@reddit

I've been working to increase my TPS in 27B since I got a taste of the 3.6 35B A3B. Finally managed to get vllm to run on my 2 RTX5060Ti with 4bit AWQ, went from 25 to 60ish t/s, almost 2x faster than ik_llamacpp'a graph split, and PP in the 3000s.

I'm ready. I'm so ready.

[-]

viperx7@reddit

can you give me args that you used 60 t/s sounds great

[-]

kiwibonga@reddit

exec vllm serve "cyankiwi/Qwen3.5-27B-AWQ-4bit" \

--tensor-parallel-size 2 \

--kv-cache-dtype="fp8" \

--max-model-len 130000 \

--gpu-memory-utilization 0.9 \

--enable-prefix-caching \

--enable-auto-tool-choice \

--tool-call-parser qwen3_xml \

--max-num-seqs 32 \

--reasoning-parser qwen3 \

--default-chat-template-kwargs '{"enable_thinking": false}' \

--port "$PORT" \

--host "$HOST"

[-]

Borkato@reddit

Wait, if I have just 1 RTX 3090, can I get faster speeds if I switch to vllm? I’m at 30T/s for Q4 27B gguf with llama cpp

[-]

kiwibonga@reddit

I don't know. I have two GPUs.

[-]

Borkato@reddit

Hm. I asked Claude and it said that it would only help if I’m doing parallel inference or have multiple GPUs, so I’m glad I checked!

[-]

putrasherni@reddit

congrats ! does the pp and tg continue to be the same at 65K and 131K context lengths ?

[-]

kiwibonga@reddit

Yes you do get the normal performance degradation over the course of your sessions as the context window fills up.

[-]

Iory1998@reddit

Peope who are praising the 35B-A3B just don't know how good the 27B is. It's like jumping from a hot hatchback to an actual sport car. The former seems sporty and quick, but under the hood it's your regular hatchback with a few upgrades here and there.

[-]

silenceimpaired@reddit

They just released a new 80b MoE and claim it's better than the dense model... We shall see.

[-]

CornerLimits@reddit

Yeah 27b is super, but the new 35B just throws tools around like a champ. Not all the tasks require a deep understanding, sometimes is enough to run quick and report results. Im using the 35B as daily driver and to play and its strong.

For the serious coding i try to do humans are still much better than any llm out there.

[-]

HungrigerWaldschrat@reddit

35b I get issues at the size of codebase I have where it answers confident but incorrectly. Very quick, but I have to really think on how to direct it. 27b is lower on error rate. It's also a bit better at coding.
35b is very hungry for context, reading a lot of files quickly, then forgetting what was relevant. Didn't try passing on thinking though, that might make a difference.

[-]

relmny@reddit

That's what I expect but... the new 35b really surprised me by finding an error in a multi-turn chat (Minimax-m2.7 made a very wrong statement, which is why I stopped using it) I asked 397b to double-check it (twice) and it kept saying that the (wrong) statement was right.

Then I asked qwen3.6-35b (a few times, to confirm) and it always said "sorry, I was wrong before" (meaning the previous statement from Minimax-m2.7 and also from 397b, was wrong).

[-]

Borkato@reddit

The 27B is just so slow, that’s the issue

[-]

Iory1998@reddit

It's slow if it doesn't fit in Vram. If it fits, then you can generate at 30-60 t/s, which is many times faster than what you can read.

[-]

Borkato@reddit

Sure, but not fast enough to not spend over a minute for a simple coding task and without spiraling, at that rate I should just do it myself lol

[-]

Maleficent-Ad5999@reddit

Slow? On Which hardware? I get 50-60 tps on 5090 at no context.. haven’t tested with full context size..

[-]

ea_man@reddit

You have to test / compare large prompt read speed, that's the kill.

[-]

Healthy-Nebula-3603@reddit

WIth rtx 3090 I have 35 t/s with h context 250k so is not so slow.

I prefer 35 t/s and job done than 140t/s and retying many times .

[-]

Borkato@reddit

I have a 3090 and am getting 30T/s on Q4, any advice to get it up to 35?

[-]

Healthy-Nebula-3603@reddit

Use llamacpp-server

[-]

Borkato@reddit

I do!

[-]

anthonyg45157@reddit

I expect 3.6 27b to be slow as it will be a dense model but I wonder if and how they can improve speed.

[-]

putrasherni@reddit

27B is a slow ass model , only meant for gpu rich like 5090s and blackwells

for everyone regular who wants similar performance with fast speed, MoE is king

[-]

switchbanned@reddit

The 35B-A3B feels like it's just slightly too big for me to run on my 4070 with a large enough context window :(. I was getting around 18tok/s with like 16k context and when i tried bumping it up to 26k it slowed to a halt. Does that seem right?

[-]

putrasherni@reddit

depends what quant you are set up with

[-]

switchbanned@reddit

I just realized I was testing Q6 to see how it ran. I should definitely be using a Q4

[-]

Iory1998@reddit

Q4 from Bartowski and Unsloth are good.

[-]

Xp_12@reddit

-or- you can go poor-ish route and double up on 5060tis for 32gb vram and run it with dflash. =]

I hit like 130tok/s in coding tasks at like 64k context.

[-]

PANIC_EXCEPTION@reddit

That ain't going to work on my M1 Max. I could run 27b on my M1 Max but it's going to get quadratically worse on agentic coding. 27b is only useful for few-round conversations where you need an immediate high quality output, and is going to suck for portable agents once you start going past 20% max context window. It's bandwidth-constrained.

[-]

andy2na@reddit

what are the commands for your setup and are you using a fork of vllm? are they specifically for blackwell or will a 3090 still get decent speeds?

[-]

Xp_12@reddit

I don't know that we'd call anything under a 5080 a blackwell without full sm support. More like brownwell. It does rely on nvfp4 quantization to be achievable. I don't think you'd get away with 24gb even with the AWQ that sits \~21gb unless you're trying to get like... 8096 context.

[-]

putrasherni@reddit

nicely done , how big is k v cache size at 64K context on vllm ? that's amzing you got k v cache running well at 64K context length.

i found that on my m5 max 128gig , dflash tops at 16-20K. At 32K k v cache used by dflash increases to 100GB+ that reaches 128GB max mabook memory size and nukes.

[-]

Xp_12@reddit

I'm using apolo13x/Qwen3.5-27B-NVFP4 which sits at 19.8gb on disk and I'm pretty much full with model+draft+context.

[-]

super1701@reddit

I ran bf16 27b. Not on Blackwell or 5090. Yea it was slow as balls. I’ve personally been liking qwen 3.6. Was using 122b for awhile, which worked well for my current use case.

[-]

plopperzzz@reddit

Is it not mostly on par with 122b? I only have 24gb of VRAM on an old system, so the 27b runs slower than the 122b for me, since I can offload experts to the club.

[-]

Iory1998@reddit

Honestly, I feel the 27B is better than 122B MoE, but that's my experience.

[-]

eidrag@reddit

for qwen3.5 I feel moe model not that good, but 3.6 feels fast and useable. but 27b still my benchmark

[-]

GCoderDCoder@reddit

I have replaced my qwen 3.5 27b node with the gemma4 31b because that is a significantly better coder. Then qwen 3.6 35b is +/- with qwen 3.5 27b/122b so I use 3.6 as my agent now because it's much faster. Qwen 3.5 27b runs great on a 5090 (40-50t/s). I moved my coder specialist work to 3x3090s and it crawls at 15t/s for gemma 31b but consistently better output in coding thus far. If I didnt have my multiple hardware options I would possibly keep qw3.5 27b on my 5090 instead of the rest as my generalist but Qwen3.6 35b on a 5090 is kinda ridiculously fast though... i could try 2-3 times before 27b finishes one time lol

[-]

Iory1998@reddit

EXACTLY! To be honest, even the 122BA3B doesn't exceed thr 27B.

[-]

SuperChewbacca@reddit

The 122B, is a 122BA10B, so 10B active, but yes in a number of benchmarks the 27B beats it.

[-]

Eyelbee@reddit

The problem is, I don't know how good the 3.6 27B can actually be. Because the gap between 3.5 27B and 3.6 Plus is already very narrow and 35BA3B kind of sits there. If this will be distilled from that it can't surpass it. If they have other methods to make it better than 3.6 plus it's great. If the 3.6 plus is like a 100b moe it can be possible.

[-]

AlternateWitness@reddit

Look at mister rich over here.

Me waiting for Qwen 3.6 9b…

[-]

silenceimpaired@reddit

I know! So weird they had a poll to find a winner… then didn’t release the winner.

It’s almost like the unasked poll question was… “What model should we avoid releasing so that you feel compelled to use our API?”

[-]

ea_man@reddit

https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd

[-]

Iory1998@reddit

I disagree... my guess is that they are releasing the entire family. It's just they will do it in phases. Dense models like the 27B takes more time to train and fine-tune than MoE. It's better that they release whatever they have their hands on now than wait for the models to be finished at once.

[-]

putrasherni@reddit

They are 100% releasing more free models. The clue is in the wordings on qwen 3.6 huggingface page.

[-]

silenceimpaired@reddit

I am okay with being wrong, but the reason I don’t hold your mindset is this release behavior is a deviation from past releases.

[-]

rerri@reddit

Have they said they won't release 27b? Or did you just conclood that?

[-]

silenceimpaired@reddit

Have they said they won't release 27b? Or did you just conclood that?

I don’t conclood… ever. Mostly because it isn’t a word.

As for reaching a conclusion, I didn’t even do that. I made a tongue-in-cheek comment on how it looks. Stating how something looks does not mean I have judged it to be as it looks.

Here ends my English lesson.

As for answering you properly… They have said nothing, which is weird since they took a poll then released something different than the winner.

My opinion is they made that poll to get a sense for what type of model they should focus on if they want to cut down their offerings next release.

[-]

RedParaglider@reddit

I'm hoping they are going to release all of them and the 27b is simply taking longer to cook.

[-]

200206487@reddit

looming forward to 3.6 122b and 397b

[-]

RiyanTheProBoi@reddit

Qwen 3.6 omni

[-]

Own_Suspect5343@reddit

I want 122b version

[-]

putrasherni@reddit

i want the 9B dense, i hope it i becomes as good as qwen 3.5 27B

[-]

LeucisticBear@reddit

An updated 80b coder wouldn't hurt either. I bet at this point it's performance would be close to sonnet 4.6 in a lot of things

[-]

Healthy-Nebula-3603@reddit

Me too 😭

[-]

ilintar@reddit

I feel like the new team behind Qwen is doing it in a more "marketing-oriented" way, which entails releasing the models with breaks to keep the hype up for longer. Hopefully that's it and not just them releasing one model now as scraps for the open-source community.

[-]

Extra-Organization-6@reddit

the 35B-A3B is good but its a MoE pretending to be dense. the 27B actually uses all parameters every forward pass. completely different feel when you run it locally, especially on tasks that need sustained reasoning.

[-]

APFrisco@reddit

What do you mean by MoE pretending to be dense?

[-]

Lissanro@reddit

My guess they refer that the dense model won the Qwen poll, but MoE was released in its place. But hopefully Qwen will release other 3.6 models later.

[-]

CheatCodesOfLife@reddit

It's a bot mate, recently commissioned to sell something called "elestio"

https://old.reddit.com/user/Extra-Organization-6?count=25&after=t1_oh97sno

It said that MoE thing because Claude likes to say "X pretending to be Y". Even with a careful prompt, the slop bleeds through.

[-]

CheatCodesOfLife@reddit

It's an LLM account. They write things like this now in all lower-case.

https://files.catbox.moe/hll2gv.png

They often have that minor hallucination where they say something stupid to fit the sentence structure:

especially on tasks that need sustained reasoning

(sustained reasoning isn't a weak point for sparse models)

[-]

Extra-Organization-6@reddit

the 35B-A3B is a mixture of experts model, meaning only a subset of its parameters activate per token. so even though it has 35B total params, each forward pass only uses about 3B of them. the 27B is a traditional dense model where all 27B parameters fire every time. in practice that means the 27B does more compute per token which tends to show up as better reasoning on harder tasks, even though the 35B looks bigger on paper.

[-]

MarionberryWeird4021@reddit

I'd use it mainly for coding. Any experience?

[-]

New-Inspection7034@reddit

Dense models are slower as all parameters are active. With an MoE, you supposedly get a mixture of experts which get routed by some magic. What I find it really ends up being a roomful of arguing conceited snobs who really can't agree on anything.

[-]

dampflokfreund@reddit

Bro, they just released 35B A3B. Don't be so impatient.

[-]

Hot-Employ-3399@reddit

It's almost as model awaited by community the most according to their own poll is actually awaited by community the most

[-]

ambient_temp_xeno@reddit

Apparently we have to get with the times, and deal with everything being like Youtube.

[-]

soyalemujica@reddit

They trying to make 27B not to be so smart, because if 35B was this good, they tweaking 27b so their paid models are still better

[-]

Iory1998@reddit

That's an unsupported claim and is at this level just a conspiracy theory. Qwen doesn't have only 27B.. They have other bigger and smarter models that they can keep closed. To be honest, I want Alibaba to develop their own large property model that can rival big AI labs and keep open-sourcing the smaller ones. After all, how many enthusiasts can run a 400B model locally?

Thos way, Alibaba stays competitive and makes money, while keeping us happy and getting our feedback.

[-]