Qwen will release another 27B with high probability
Posted by serige@reddit | LocalLLaMA | View on Reddit | 212 comments
Posted by serige@reddit | LocalLLaMA | View on Reddit | 212 comments
pseudonerv@reddit
“Not hard to create another … now” WTF does it even mean? They don’t even have it now. They didn’t even cared to train it. And glazers here thinks they doing you a favor by saying that?
LagOps91@reddit
because they have the training pipeline figured out and now have stronger models to distill from?
pseudonerv@reddit
Labs typically train a set of model sizes to test architecture and scaling. They don’t waste their compute to train extra models just because you wished it.
LagOps91@reddit
scaling sure, but architecture? it's the same for 3.5 as it is for 3.6 and will remain the same most likely for 3.7. and no, they don't train anything because i wish it. it's just easier to train an architecture further that you already built support for, obviously.
SV_SV_SV@reddit
So what's up with qwen now, havent they slashed the AI department massively recently..? Are they still just riding that momentum, or is there genuine chance that their innovative march can go on?
phenotype001@reddit
Now that it got attention, it's definitely happening.
StupidScaredSquirrel@reddit
No 35b a3b for us gpu poors? I think that model really made it very accessible for everyone with a basic "gaming" laptop to be able to run powerful local models
peligroso@reddit
27B overrated compared to the MoE
ShadyShroomz@reddit
It's not even close. 27b is like 10 times smarter than 35b moe. 27b usually beats 122b moe even... It's insane how good 27b is. You don't get similar perf until you get up into like the 300b+ moes with 20b+ active..
All my benchmarks have 3.6 27b blowing the 35 moe out of the water.
mbrodie@reddit
Gonna disagree here I spent the better part of a week testing the 35b against the 27b with MTP and out of all the quants available I found 2 x q8 35b that perform better than any of the 27b quants.
In long reasoning 150k + token requests the 27b often starts getting lazy and falsifying results but the 35b stays locked in and on task.
These are both tested on 72gb of vram with full 262k context and optimised as good as I could get them and it took days to get the most optimised settings
By the end the 27b mtp was running at like 800 infil / 68 tps outfit and the 35b was running on like 4100 infil and 95 outfil while remaining on task and delivering quality work for a fraction of the footprint.
Tests were done directly to API / through opencode and through pi agent
Tested something like
35b -
Q8 : 7 ggufs
Q6 : 4 ggufs
Q5 : 4 ggufs
Q4 : 6 ggufs
27b -
Q8 : 6 ggufs
Q6 : 6 ggufs
Q5 : 6 ggufs
Q4 : 6 ggufs
MTP was like 4 different quants across each
And despite what any release group says.. quantising the cache on these models 100% hamstrings them, even good the good working ones if you change the kvs they start performing terrible in real world usage.
These were all tested on multiple real world codebase examples not random benchmarks, lua, c++, c#
One of the best 35b quants I’ve found is from a released called smoffyy on hugginsface basically beat out all the 27mtps and found an edge case id never seen flagged before which was confirmed by Claude and gpt 5.5 independently
ShadyShroomz@reddit
I've only tested fp8 and fp16 on both with vllm. Any type of logic puzzle or anything like that, 27b wins by a mile ... I have a whole front end design test and js logic test too, again 27b wins 99% of the time ..
nasduia@reddit
Did you test FP8 KV cache compared to BF16 on any tasks?
ShadyShroomz@reddit
I have and found compressing kv cache to lead to major degradation even at fp8 so I never played around too much. I always use bf16
po_stulate@reddit
Looks like it's just a regular quant that comes straight out of llama-quantize, if that's the best quant then many quants would be the best quant.
Kitchen-Year-8434@reddit
Or most quants screw around with different precisions at different layers with various smoothing and relocating algorithms that end up making more of a mess than they're worth. :)
Southern_Sun_2106@reddit
I can confirm that - I tested 35B to its limit of 262K, and it was calling tools, etc. as if it was in the first 10K - no degradation at all. While 27B does indeed get lazy and makes up shit. 35B is just nuts, I've never experienced such awesome goodness at 262K with any model before. In fact, it 'feels' like it can do higher context. I wonder if there's a way to test that.
vick2djax@reddit
Could you give an example of the differences you noticed in q4 KV vs q8 maybe? I ended up trading KV for context and am running q4 context. But I didn’t notice a difference in my rag retrieval other than MOE gave way better answers than dense at twice the speed.
mbrodie@reddit
Generally on long context tool calling heavy requests it would often almost seem like it was in a rush to finish as quickly as possible and would mess up tool calls, and on occasion even forget whsf ww were doing or completely contradict itself next response even to the point of faking validation to move on…
It also seemed very impatient which I know is ridiculous but the temperament changes they seem to get a lot more flakey / half ass things compared to full kv…
Mileage will also vary im in a lucky position where I don’t have to compromise on quant quality etc… so I can see them all acting at their full weights compared to quant versions…
But I agree the MoE will do more for you with less hardware… but that being said the 27b was probably better overall across all tests but there was literally 2 standout Q8 MoEs which just ended up being better
EstarriolOfTheEast@reddit
What topic do your benchmarks cover? What are you using them on? I am not finding this to be true, for me, the 27B is nowhere near the 122B MoE. I do scientific programming and probabilistic modeling but am also a hobbyist game dev.
ShadyShroomz@reddit
also most public benchmarks are similar: https://artificialanalysis.ai/leaderboards/models?weights=open&size=small%2Cmedium%2Ctiny
27b beats 122b here as well
EstarriolOfTheEast@reddit
Benchmarks are not good predictors for real world use. They are known and trained for while also covering only common use-cases.
In particular, small models are liable to do well on those because they were trained for those tests but generalize worse than larger models (whether dense or sparse) because the learned patterns are more likely to overspecialize in a training to the test scenario.
JuniorDeveloper73@reddit
27b beats in side thinking,3b vs 27b
EstarriolOfTheEast@reddit
That's not quite right. A 120B A3B MoE is not reusing one fixed 3B path, it's routing each token through path derived from context across a much larger combinatorial space. Across a sequence, a 120B MoE traverses much more of its learned function space than the 3B active-count phrasing might first lead one to think (larger even than a 20B dense model can).
So, the better way to think of it is that an MoE dynamically composes a solution where each per token step proceeds by consulting/constructing a context conditioned 3GB sized library worth of relevant functions.
ShadyShroomz@reddit
what quants and version?
im comparing 3.6 27b at fp8 to 3.5 122b at fp8.
I have not found that 27b blows 122b out of the water. I have found it better in a lot of cases though.
when I say 27b > moe in all regards, im talking about the 35b moe.. not a single test was the 35b moe better for me than the 27b.
the 27b and 122b moe trade blows though.
my custom benchmark suite is design, editing, generation, instruction-following, javascript, repair, general knowledge, & script writing.
lots of web dev tests, fixes, tool calls, etc..
some of the results are automated & some are rated on a score of 1-5 (blind ratings) manually, and its combined. of course this test suite is not perfect (always gonna be some bias), but I've done a lot of testing... and even without including the custom scored ones... I still see 27b beat 122b in a lot of tests. although they are close, thats for sure.
EstarriolOfTheEast@reddit
I have not found the
mycall@reddit
27b vs 122b in tool calling, which is better?
ShadyShroomz@reddit
27B is more reliable at agentic coding and tool calling without a doubt. the 122b has more word knowledge though.
relmny@reddit
Related to chat (no-code), I would agree if you had wrote "usually", but without it, I don't agree.
Yes 27b-q6k is *usually* smarter than 35b-q6/122b, but there are times that 27b looks like an idiot, while 35b can even come up with something that even glm-5.1-smol-iq2xss didn't, and shames 27b.
Same for 122b.
27b is most of the times better than 35b/122b, but there are times that 35b is way better.
At least that's what I saw a few times already.
vick2djax@reddit
I’m sure dense is better than MOE with coding for sure. But I’ve gotten much better results for my RAG and answer generation with MOE than I did dense. Then the 2x speed is great, too. But it seems like I really notice a big gap in knowledge.
Moscato359@reddit
MTP is weird, because if you overflow to system ram, moe doesn't really benefit from MTP, while dense models do
and it totally changes the comparison
vick2djax@reddit
Whoa wait I haven’t been running dense with MTP with it touching my system RAM. I assumed it would go slower?
Moscato359@reddit
If everything fits in your vram, moe will still gain a lot from mtp
But the gains from mtp are radically crushed when you overflow to system ram, on moe models, while they aren't crushed as badly on dense models.
Basically, mtp can't help as much on the moe+overflow
Solary_Kryptic@reddit
Is it better to just not use MTP, if you're MoE is overflowing?
EatTFM@reddit
You need additional VRAM, thats why I would advise against it
Moscato359@reddit
Well... it won't hurt much
It just doesn't help much?
vick2djax@reddit
I only measured about a 7% difference in speed when staying inside the GPU with mtp draft turned on. Something else need to be turned on?
tedivm@reddit
MTP is mindblowing. I can't believe the tps I'm getting on a dense model.
Borkato@reddit
I personally feel like 35a3b and qwen 27B are just… perfect. They complement each other absolutely perfectly and I rarely if ever reach for any other model.
ps5cfw@reddit
I hope they don't skip 35B MoE, us 16GB VRAM Poor fuckers do not have the means to run 27B at a decent quant, whilst 35B allows very decent hybrid CPU Inference
LordStinkleberg@reddit
Can you describe your current 35B setup and expected tps? I am 16GB VRAM poor w/ 64 CPU RAM.
dsartori@reddit
Not the person you're replying to but I run Qwen3.6 on just such a device. It's a Windows box, I run LMStudio. Important "Load" settings:
I haven't tried the MTP version yet on this device but pre-MTP I get about ~400t/s prompt processing and ~30t/s inference. Very usable.
grunade47@reddit
I tried out Qwen3.6 35B-A3B MTP (unsloth) and im getting about 55t/s (output) not sure if thats good or bad for my setup?
and what should be my context size>?
RX 9070 and 32gigs of ram
.\llama-server.exe `
-m "Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" `
--host 127.0.0.1 `
--port 8080 `
--ctx-size 8192 `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.00 `
--presence-penalty 0.0 `
--reasoning off `
--no-mmap `
--spec-type draft-mtp `
--spec-draft-n-max 2
dsartori@reddit
You should try pushing context to at least 128k. I think you can do max context with your setup.
grunade47@reddit
Will try, i tried 80k context size and gave the same task to both claude sonnet 4.6 and qwen on a medium sized codebase.
While qwen completed all the requirements in one go, claude had slightly better code quality and adhered to the code standards in the repository but didnt fulfil all requirements.
alchninja@reddit
Could you share your prompt processing speed with MTP enabled?
dsartori@reddit
Roughly 500t/s so probably I was underestimating my pp previously.
GoTrojan@reddit
Why mmap off? I got same advice but not explained
dark-light92@reddit
With mmap on, parts of the model may be swapped out on disk if there is memory pressure. With it off, model always remains in RAM.
Xantrk@reddit
It makes prefill MUCH faster if you're spilling over to RAM.
Ok-Measurement-1575@reddit
MTP is not a free gain, unfortunately, it costs vram too.
sagiroth@reddit
I think its worth it if you can find acceptance rate above 50%
mycall@reddit
MTP doesn't yet work with Vision in llamacpp too.
MainhattanSky@reddit
It actually does since being released. You’re right that it has been the case previously.
peligroso@reddit
MTP makes no statistically measurable improvement to 35B MoE.
dsartori@reddit
I just tested this and got faster inference with identical settings. You sure about that?
peligroso@reddit
Faster, or better? Two different things.
Inevitable_Mistake32@reddit
>statistically measurable
Both faster and better are statistically measurable so what are you saying?
peligroso@reddit
How is better measurable? Isn't the mantra of this sub effectively "quality benchmarks are meaningless"?
And further, 25% quicker token output doesn't matter if it has to backtrack to fix itself with >25% more effort.
jtjstock@reddit
You don’t understand what MTP is or how it works apparently.
reginakinhi@reddit
The main attention head still checks the calculations. The tokens are guaranteed to be identical to the normal output. They mathematically cannot be better or worse than without MTP. And since LLM calculations are almost always memory bandwidth bottlenecked, not compute, MTP is very likely to result in a speed up
DunderSunder@reddit
MTP on paper does not change the quality of the output. The only tradeoff is some extra vram.
lemondrops9@reddit
when talking about mtp its about speed. MTP isn't going to make the model smarter.
MagoViejo@reddit
if 16GB is poor , what of us paupers with 3060 12Gb ? :)
running MoE and hearing the grinding of the fans is celestial music to my ears...
shaq992@reddit
This is how I run it on my 5060Ti 16GB vm with 128GB of system RAM.
models: Qwen3.6-35B-A3B: cmd: > llama-server --host 0.0.0.0 --port ${PORT} -m "../Models/Qwen3.6-35B-A3B/Qwen3.6-35B-A3B-uncensored-heretic-Q6_K.gguf" --ctx-size 262144 --flash-attn on --no-warmup --fit on -t 12 -np 1 --mmproj "../Models/Qwen3.6-35B-A3B/Qwen3.6-26B-A3B-mmproj-F16.gguf" --chat-template-kwargs "{\"preserve_thinking\": true}" --no-mmap ttl: 300
ps5cfw@reddit
Well I run 35B Q6 at 20 to 25 TPS Token Gen. and over 1000 Prompt Processing, that's a good baseline for me and I can seriously work with these speeds professionally.
In fact I do work professionally with 3.6 35B as my main model for 3 weeks now!
lukistellar@reddit
What Quant do you use? I am running the IQ_NL4 Quant with 10-20 tps at an RX580 8GB, combinded with 128K Token Context at Q4.
ps5cfw@reddit
Q6 Quant from FINAL BENCH Darwin 36B with unquantized cache.
Cache quantization WILL kill prompt processing.
junior600@reddit
How is FINAL BENCH Darwin 36B in your opinion? Is it better than the standard qwen 3.6 35a3b?
ps5cfw@reddit
Not amazed. It is VERY CONFIDENT, that's for sure.
Too bad it's confidently WRONG! But with enough steering it's not so bad.
tracagnotto@reddit
What work do you do if I may ask
ps5cfw@reddit
Mostly fixing Typescript web applications and sometimes .NET apps, nothing incredible really, but It pays the bills
tracagnotto@reddit
personally using a tweaked turboquant llama.cpp on a 16gb vram card i reached 20-25tk/s with 16k context. That dropped up to 9-10tk/s once the context filled up.
It also required wise context sweeping between agent turns.
AuroraFireflash@reddit
M3 Max user of Qwen 35B MoE, but with 64GB so I can run a 6 or 8 bit quant. 20-30 tps for generated tokens, 300-500 for prefill tokens (400GB/s RAM).
It's just fast enough to be useful. M5 Max would boost me by 25-50% I think.
r1str3tto@reddit
Hm. I also have an M3 Max 64GB but I get 45-50 tokens/sec and 1,100 prefill tokens/sec with Qwen 35B-A3B at Q8. I’m using oMLX and Unsloth MLX quants.
AuroraFireflash@reddit
Hmm, I'm usually in the larger context windows (100k to 200k) for stuff that I'm doing.
Unfortunately, uploading my benchmarks in oMLX is broken by Cloudflare Turnstile.
LetsGoBrandon4256@reddit
Running a Q6_K_L quant and I think I get about 20~30 TPS? Been a while since I've checked tps number but it's quite comfy for me.
amchaudhry@reddit
Can you share your configuration? My tps is dog slow on 9070XT ROCm
ps5cfw@reddit
Sure!
cmd: '/XXX/LlamaCpp/Linux/build/bin/llama-server --port ${PORT} --chat-template-kwargs '{"preserve_thinking": true}' --host 0.0.0.0 -m "/XXX/LlamaCpp/models/FINAL-Bench_Darwin-36B-Opus-Q6_K_L.gguf" --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48 --fit on -t 16 --fit-ctx 230000 --fit-target 384 --temp 0.7 --min-p 0.0 --top-p 0.95 --top-k 20 --jinja --no-mmproj --no-mmap -np 1 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-file "/XXX/LlamaCpp/templates/qwen3.6.jinja" -ub 4096 -b 4096 --cache-reuse 256 --no-webui'
LordStinkleberg@reddit
I thought the template was already fixed (by unsloth iirc) and adopted upstream by Qwen? Is froggeric meaningfully different?
Regardless, I see you're using ngram speculative decoding but not MTP - did you try MTP and find it unhelpful? I've heard mixed reviews about MTP on the 35B MoE.
nasduia@reddit
On vllm FP8 27b it can start failing tool calls deep into the full context even with the unsloth template. Frogeric seems better since I've been using it for a couple of days. The unsloth fixes are good though and it's to soon to say for sure. The frogeric one has a nice mechanism to slap the llm after failing a couple of calls in a row and inject instructions to remind it. (That bit is readable in the template without having no know Jinja)
amchaudhry@reddit
Same question re: froggeric
ps5cfw@reddit
I did try MTP, my token generation speed went from 20 TPS to a STAGGERING... 5 tps.
I'm not sure what's going on with MTP but to work on my machine it required me to basically set --fit-target to 6000+ or it would go OOM, and basically it was awfully slow.
Sisaroth@reddit
This is mine, doing 24 tps on a RX 7800 XT + 48 GB system ram:
.\llama-server.exe -hf bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:Q6_K_L -c 131072 --jinja --temp 0.9 --top-p 0.95 --min-p 0.01 --top-k 40 --flash-attn on --presence_penalty 1.2 --chat-template chatml --api-key anything --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1
Very important: don't set --n-gpu-layers 99. If you do, it seems like llama-server gives up on running ANY layer on the gpu. My tps doubles when i leave it away.
amchaudhry@reddit
Is it absolutely necessary to offload some layers to RAM? I had thought ideal set up was full load onto GPU?
Sisaroth@reddit
If I understood correctly, that's exactly the point of running a MoE model and why the OP is asking for MoE for his low VRAM machine. You run a MoE model that is bigger than your VRAM, but (hopefully) the active experts still fit within VRAM. This way you get best of both worlds. Both a relatively smart multi-purpose model, but also it will still be fast when you give it a specialized task.
ea_man@reddit
wanna see a bunch on 6800?
https://store.piffa.net/lm/lm_site/moe-35b.html
ps5cfw@reddit
Half of these parameters don't make any sense for qwen 3.6, this looks like a template built for... not Qwen. SWA-Full does NOTHING for Qwen Next and forward
ea_man@reddit
Yep it was probbly a first line that I kept since early version, it is not supported anymore:
0.32.063.233W srv load_model: swa_full is not supported by this model, it will be disabledIf I recall on older version it was meant to keep the prompt cache from wasting.
ea_man@reddit
Yeah I guess you can remove --swa-full , maybe it was a first line that I copied pasted since the old models.
(I don't really use MoE so much, I mostly use dense and I see I don't have that flag there).
amchaudhry@reddit
Oh dang…what context window are you left with after load?
ea_man@reddit
Well it depends on the config: if you are loading all in VRAM that depends on what you have in VRAM and KV quants, when you use it with partial off loading you can set the context size with --fit-ctx
IQ3 MTD memory usage
-----------------------
Component VRAM Allocated Purpose
Model Weights 14,227 MiB The static, quantized weights of the model (IQ3_S at \~3.46 BPW).
KV Cache 438.28 MiB Tracks context during generation. Set to a context length of 42,240 tokens.
State (RS) 251.25 MiB Required explicitly for the hybrid State Space Model (S_SM) layers in the qwen35moe architecture.
Compute Buffer 571.78 MiB Temporary working workspace for matrix operations during generation.
Dunno I usually stay lower than \~130k max, mostly 80k but if you want super speed keep KV at q8 or q16 and just run 20k context...
+-----------------------+-------------------+--------------------+--------------------+
| Task Profile | IQ3_M (Baseline) | IQ3_S (MTD, N=3) | IQ3_S (MTD, N=2) |
+-----------------------+-------------------+--------------------+--------------------+
| Code Generation | 90.51 t/s | 120.05 t/s (Max) | 117.56 t/s |
| Draft Acceptance (Code| N/A | 89.01% | 92.34% (Max) |
+-----------------------+-------------------+--------------------+--------------------+
| Creative Chat/Story | 91.24 t/s (Max) | 76.25 t/s (Worst) | 88.50 t/s |
| Draft Acceptance (Chat| N/A | 38.34% | 53.36% |
+-----------------------+-------------------+--------------------+--------------------+
relmny@reddit
I mainly use 27b-q6k on 32gb VRAM for chat (with OW) but... *sometimes* 35b is actually smarter than 27b.
Asked about harnesses and it kept recommending something that doesn't fit, then asked 35b and it came up with something that even glm-5.1-smol-iq2_xss, (in an existing chat), when I said "what about (what 35b said)" , it said "yeah, that's a better idea"...
27b is suppose to be "better", and probably it is... but sometimes 35b is better.
Former-Ad-5757@reddit
Even a broken clock has the correct time 2 times a day. 27b is simply much better, but 35b is already really good.
relmny@reddit
That analogy doesn't apply in this case. It wasn't "by chance" or "coincidence" that 35b got it right.
If you are happy believing that 27b is always better than 35b, that's up to you.
From my experience, I know that is not the case, because I see it happen the opposite a few times (even once is enough).
tableball35@reddit
> Me sitting here at 12GB VRAM 32GB RAM
tracagnotto@reddit
https://abhinandb.com/#/post/running-qwen-3-6-on-6gb-vram
BringTea_666@reddit
>I hope they don't skip 35B MoE, us 16GB VRAM Poor fuckers
I hope they don't skip 35B moe because instead of shit 50t/s with 35b moe i can do 220t/s.
Ideal scenario qwen3.7 35b moe that is as good 3.6 27b dense.
Qwen30bEnjoyer@reddit
We do? IQ3_XXS isn't too bad.
Tai9ch@reddit
Don't sleep on IQ4_XS. I've gotten some really good results with that quant on larger models.
Septerium@reddit
How much further can it improve compared to 3.6 27b?
tarruda@reddit
I wish that open weights was still the default mode for Qwen team. It seems that after the layoff they have been focusing mostly on proprietary models.
ReporterCalm6238@reddit
The real miracle model is DeepSeek 4 flash. It's the only hyper-dense model you can use with coding agents and almost forget it is not Opus/GPT. Qwen models think for too long.
silverud@reddit
Qwen 3.7 122B-A10B is my dream model.
firespawn_katie@reddit
Agreed. Qwen 3.5 122B was incredible.... one can only hope
silverud@reddit
I expect that Qwen 3.7 122B-A10B, if it were to be released, would be the pinnacle of what can run on a 128gb unified memory Apple Silicon, with the optimal blend of speed and capability.
Smarter and faster than 27B is the goal.
antwon_dev@reddit
I’m considering upgrading soon, so that would be awesome. Do you know how 3.5 122B compares to the 3.7 27B?
AXYZE8@reddit
There is no 3.7 27B yet so nobody can answer that.
If you meant 3.5 27B vs 122B then IMO the quality is not that far off. 122B has more knowledge, but in terms of reasoning I would say they're the same. However 122B has 10B active params instead of 27B, so it is more than 2x faster.
27B is awesome for people with single beefy GPU, 122B is awesome for people that have unified memory or want hybrid inference.
whitefritillary@reddit
122B-A10B will obviously have much more knowledge but in terms of smartness I’d actually argue 27B is actually somewhat ahead.
silverud@reddit
There is no 3.7 27B right now....
silverud@reddit
3.6 27B (there is no 3.7 27B yet) tends to produce marginally better output than 3.5 122B, albeit at a much slower rate, and very dependent upon the type of task/subject.
We never got a 3.6 122B or a 3.7 27B, so it is possible that a 3.7 122B would absolutely dominate, while still outperforming in speed. Couple that with MTP (which works fairly well on Qwen MoE), and you've got the potential for an absolute monster advantage on big memory laptops (Macbook Pro) or Apple desktops.
Ariquitaun@reddit
Until 3.8
FrantaNautilus@reddit
Qwen 3.5 122B10A really needs an update to 3.7. So many new things were introduced since its release: MTP, thinking preservation, thinking improvement, and newer cutoff date would be great too.
ECrispy@reddit
would a 122B-A10B model even run on a 16GB gpu?
MoffKalast@reddit
You can offload just the most compute intensive parts and bob's your uncle.
MundanePercentage674@reddit
yes with 0.1 bit
UnWiseSageVibe@reddit
this what i want, a big capable model with MTP
MuDotGen@reddit
Question, for MoE, is there a general percentage of active parameters to expert parameters that is generally the most intelligent? Like 35B-A3B would be 3/35 = 0.082, and 122B-A10B would be 10/122 = 0.086, so both around 8% active out of total available. Is that considered a good ratio or does it start to differ as you increase the parameters a lot?
formlessglowie@reddit
That would unironically make me buy two more 3090s and finally move from 2x3090 to 4x3090.
comperr@reddit
What setup do you run? Like chipset/motherboard to fit 4x 3090? I am physically limited to 2. Even if I put the 2nd on water it would need to be a custom loop to make room for s 3rd. On X299
ArtfulGenie69@reddit
I got lucky and had two computers with 2x3090. I thought I may need more but they have something called rpc for llama.cpp and ray for vllm. I got rpc working on my system so with a basic q4 quant in llama.cpp I get like 800pp 55tg. It's fast and if I built it again on vllm or just turned on mtp. I have a feeling with int4 autoround and mtp or better dflash as vllm handles that, you could break into the 120t/s area.
mycall@reddit
Why 10B instead of 20B?
silverud@reddit
Because that's how Qwen 3.5 122B was setup.
Yorn2@reddit
Some of us want Qwen 3.7 397B-A17B as well.
ForsookComparison@reddit
This is my number one by a country mile. It's still so much stronger as a general purpose agent than 3.6 27B or Gemma4.
Sadly I think that our odds of getting a max-sized model ever again are slim to none as Qwen-Max inches towards being a serious competitor (price and quality) to the big guys.
Cupakov@reddit
Gimme a 3.7 80B-Coder, Jesus that would slap
ArtfulGenie69@reddit
Agreed, they didn't even do the best model for 3.6
pl201@reddit
Yes/yes/yes
Far-Low-4705@reddit
4b, 9b, 30ish MOE, 27b, 120b MOE
These all seem to have the most utility. 4b for running on anything, 9b for laptops, 30b for speed, 27b for the majority of ppl, and 120b for power users
HockeyDadNinja@reddit
Same here.
shansoft@reddit
Same here! 122B still beats 3.6 27B from my experience.
cafedude@reddit
I'm not on X. Can someone who is on X bug Barry about a 3.7 122B?
cafedude@reddit
Not on X, can someone on X bug Barry about a 3.7 122B? Thank you.
LegacyRemaster@reddit
well said
PotatoQualityOfLife@reddit
YESSSSSSSS
IKerimI@reddit
Yes please
Mountain_Patience231@reddit
EVERY AMERICA AI COMPANY FREAKING OUT
wren6991@reddit
Mr President. A second Qwen 27B has hit the towers
Intelligent-Form6624@reddit
80B-A6B please
AttentionIsAllINeed@reddit
What are people using this for anyways? It can't even write a hello world function for e.g. aws lambda. It's useless
Juulk9087@reddit
I've been trying to get it to work for 2 weeks and I can't do it. Built an entire agentic workflow around it. My code base is rust and typescript to the most common languages and it has no idea what the fuck. So you're not alone brother. Regardless of the down votes. You would think running BF-16 weights and BF 16 cache would make a difference but it doesn't.
my_name_isnt_clever@reddit
If every single person is raving about a model but when you try it's worthless, the problem isn't the model.
suicidaleggroll@reddit
I’d love a Qwen 50B or 80B dense model. The 27B is great, but with MTP it’s so fast that I’d happily trade some of that speed for even more parameters.
EagleNait@reddit
27B? Fast? We're not in the same tax bracket lmao
suicidaleggroll@reddit
With MTP it is, as long as you can fit it in VRAM. I'm hitting 120 tok/s generation and nearly 5000 pp. It doesn't take much to fit it in VRAM, a single 32 GB card can do it with full 256k context.
SnooPeripherals5499@reddit
Doesn't seem to be the reality of 2x 3090
Mysterious_Pride_858@reddit
`docker run --rm --gpus 1 -v /media/wwhvw/A63032EE3032C5595/models/:/models -p 11111:8080 ghcr.io/ggml-org/llama.cpp:server-cuda13 -m /models/dir/Qwen3.6-27B-UD-Q6_K_XL.gguf --mmproj /models/dir/mmproj-BF16.gguf --port 8080 --host 0.0.0.0 -ngl 999 \--spec-type draft-mtp --spec-draft-n-max 6 -np 1 -fa on -c 32768`
My 5090 only got 80tok/s, cuda oom when contxt set to 64k. Can you share your command?
suicidaleggroll@reddit
It's going to depend on your prompt, my 120 tg test was a coding prompt which works well with MTP. The quant also makes a difference, I've found unsloth's UD quants are often slower than other providers. My test was with Bartowski's Q8_0.
Even with Q8 I can fit 64k context within 32 GB VAM, with Q6 it should fit easily. I'm not sure what's inflating your VRAM usage so much, maybe try adding "--parallel 1" and dropping "--spec-draft-n-max" from 6 to 3, that was I've been using.
LetsGoBrandon4256@reddit
In this economy? We're definitely not in the same tax bracket 😭
UniversalSpermDonor@reddit
There's a seller who'll take $350 as an offer for AMD Radeon V620s. They're 32GB but only have 512 GB/s of bandwidth, so they're not ultra fast, but they're fine.
Odd-Environment-7193@reddit
Stop being poor. It is the solution to your problems /s.
ttkciar@reddit
There are a bunch of 32GB MI50 on eBay right now for about $600.
billy_booboo@reddit
To be fair those 32GB and cards were going for $1.3k until recently, not cheap but I'd courage people to consider them
Kagemand@reddit
Which card is that?
suicidaleggroll@reddit
RTX Pro 6000, but of course that has way more VRAM than is necessary for a 27B. A smaller GPU should work just as well. That's at Q8_0 with MTP. Without MTP it was closer to 3400 pp and 48 tg, MTP makes a big difference.
ProfessionalSpend589@reddit
Are you running it on an Intel Arc B70 at BF16?
Please, share some more details.
suicidaleggroll@reddit
RTX Pro 6000, but of course that has way more VRAM than is necessary for a 27B. A smaller GPU should work just as well. That's at Q8_0 with MTP. Without MTP it was closer to 3400 pp and 48 tg, MTP makes a big difference.
This_Maintenance_834@reddit
nvfp4 version of qwen3.6-27b is really fast.
falcongsr@reddit
Tell me you got 96GB RAM without telling me you got 96GB RAM.
Prof_ChaosGeography@reddit
I would love to see numbers on how dense models scale with abilities given parameter counts compared to moe models.
I wonder given how 27b almost aligns to the ~120bA10 moe model what a dense 50b model would rank at, or a 45b model that would leave room for multiple contexts on a modern dual GPU setup at 64gb vram
ttkciar@reddit
The rule of thumb for MoE vs dense competence is D = sqrt(P x A) where D is dense model parameters, P is total MoE parameters, and A is MoE active parameters.
That assumes all other factors are equal, which they never are, but since we're talking about models within a single lineage with presumably the same training datasets and training methodologies, it should be okay.
wiltors42@reddit
Yeah honestly, 3.5 122b was great but Qwen 3 coder next is only 80b and better…
tracagnotto@reddit
https://abhinandb.com/#/post/running-qwen-3-6-on-6gb-vram
sunychoudhary@reddit
27B feels like the sweet spot if the quality is actually there.....Big enough to be useful for reasoning and coding, but still realistic for local quantized runs.....I’m more interested in how it performs at 4-bit/5-bit than the full precision benchmarks. That’s what most people here will actually use.
CodeCatto@reddit
I want a 7-9B model of qwen 3.6
Due_Ebb_3245@reddit
9bplease
harpysichordist@reddit
Holy shit it's been another day! We need another Qwen post with literally no substance and all hype botted to the top of the subreddit!
EatTFM@reddit
xmas once a month!
kevinlch@reddit
please dont skip 9B. please
Ohhai21@reddit
9b for the poors when? 😄
Sambojin1@reddit
8B hopefully, so it handily fits into 12g ram on an android oid phone, with a bit of context size.
koenafyr@reddit
You're not runnining 8b with any kind of speed on any mobile phone in the world.
Just use gemma4 e2b
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
_wOvAN_@reddit
I need 397
DeepOrangeSky@reddit
At this point, I think the better strategy is for everyone to pester GLM for a 5.2 Big Air ~200b model (or Kimi, to a lesser extent), more so than asking Qwen for a 397 refresh.
Plus, given how strong a GLM ~200b model would be at this point, it would also force Minimax to stay open weights for a while longer and to actually have to put something pretty strong out for Minimax3, since I doubt they'd have strong enough mindshare/brand to go fully closed source right at the moment that GLM put out some open weights ~200b monster that made 2.7 230b look like a joke in comparison. So even the ripple effects could be nice, too.
Lissanro@reddit
The same here. I find Qwen 3.5 397B still good middle ground between small 27B and large model like Kimi K2.6, which I run when I need to do more complex tasks. I find it that 27B even though good and fast for simple tasks, cannot handle well more complex instructions, while 397B Q5_K_M has very good balance of speed (with four 3090 and DDR4 RAM I can run it at 17.5 tokens/s generation with 600 tokens/s prefill, and may be run it even faster once I download MTP-enabled quant).
ShadyShroomz@reddit
How much ram do you have ? I have 4x 3090s haven't even tried the 397B yet... But only 128gb of ram. Upgrading to 256 soon..
Lissanro@reddit
I have 1 TB of 8-channel DDR4 3200Mhz, but Qwen 3.5 397B Q5_K_M does not need that much - its GGUF has 276 GB size, so if you upgrade to 256 GB RAM + 96 GB VRAM you already have, it should fit well along with its context cache. Or if not or too slow, you can try lower quant, for example, Q4_K_M is reasonably good.
_wOvAN_@reddit
small model quite useless
ttkciar@reddit
Tell us you don't know sparse from dense without telling us you don't know sparse from dense.
FullOf_Bad_Ideas@reddit
I'd like one too, but if they aren't sure about 27B I think we have low chances.
nicolas1801@reddit
it's christmas <3
Fastpas123@reddit
50-80B MOE Would be good, along with 10, 20, 30B dense :)
sine120@reddit
I'd love something the size of Coder-Next with the 3.7 DNA. It's about the max size I can run with my 64GB RAM/ 16GB VRAM and still get a good Quant size. Otherwise the 35B is about all I'll be able to fit and it doesn't really max out my RAM.
ECrispy@reddit
i'm hoping for something that works well for 16GB vram.
maybe something between A35B-10B and 27B, that would fit well and have enough space for context. perhaps A20B? no idea if thats feasible, has enough demand etc?
Charming-Author4877@reddit
Qwen releases are the biggest news since meta started llama
ea_man@reddit
What I want is something just a little bit smaller than 27B so we can run it on 16GB GPU at q4 and even 12GB at q3.
misanthrophiccunt@reddit
yes!
Legumbrero@reddit
Would love to see a dense 70b using the same methods. Totally spot on on parameter-for-parameter just wish I could see what they can do with a bigger model.
Inevitable-Name-1701@reddit
We have mini models already. Give us larger.
peligroso@reddit
No point in trying to keep up, it's a race to the bottom.
Tai9ch@reddit
You say that, but medium sized models is where a lot of the really interesting stuff is going to happen.
Kimi K2.6 has huge models handled, but running it locally is a nightmare. The ~30B space is pretty well covered.
But for people with 64-256GB of VRAM, there's like Qwen3.5 and MiniMax and... gpt-oss-120b maybe? And those are the people with budgets for serious tasks that want to run locally but don't nessisarily want to spend six figures or install several tons of new cooling.
miversen33@reddit
I think this is where things end up. Absolutely massive (read trillions of parameters) models and relatively tiny (5-30 billion parameter) models
ttkciar@reddit
We don't have a Qwen3.6-9B yet.
Hopefully Qwen3.7 includes 9B, 27B, and 122B-A10B releases.
AI-Agent-Payments@reddit
The angle nobody's mentioning: a 27B dense at Q4_K_M sits right at 16GB VRAM but the KV cache bloat with long contexts pushes you into offload territory fast, so effective usability depends heavily on whether they tune the GQA head count aggressively. Qwen 2.5 32B was actually more practical for most local setups than the parameter count suggested because of how they handled that, so the raw size number matters less than the architecture decisions around attention.
Tai9ch@reddit
Yea, something like a 23B dense would be spectacular for 16/32GB cards.
florinandrei@reddit
If they could make it fit in 24 GB VRAM with more than 100k context at a quantization level that's not too drastic, that would be great.
synw_@reddit
Please don't forget the 4b in addition of the 35b a3b. The gpu poor peasants would be thank-full
JGeek00@reddit
This blog says that “open 27B and 35B weights are announced but unscheduled”
https://insiderllm.com/guides/qwen-3-7-preview-scored-57-aai-27b-35b-open-weights-watch/
Sisuuu@reddit
Uhhhh! To exciting!
cleversmoke@reddit
Qwen3.6-27B has been fantastic, it's difficult to even ask for better! While folks want larger, I am curious what they can do with smaller and more efficient for edge devices, it would open a slew of applications!
VoiceApprehensive893@reddit
it feels like 27b and 35b are going to get considerably better at some of the things that gemma 4 does way better than 3.6
nickm_27@reddit
I’d be quite happy if this was the case, what gives you that indication?
VoiceApprehensive893@reddit
3.7 max compared to 3.6 max feels less slop, reasons less and can draw a pencil using ascii art
ttkciar@reddit
"It feels" implies they're just expressing hope.
Once upon a time I would have been more skeptical of the possibility, because Gemma has always been a "good enough at every kind of task" sort of model, while Qwen mainly focused on the most-popular use-cases, but Qwen3.5 closed that gap quite a bit, and Qwen3.6 closed it even more (and even exceeded it for some things; Qwen3.6-27B is better at rewriting tasks than Gemma-4-31B-it).
If Qwen3.7 continues that trend, we might be hard-pressed to find a task type Gemma 4 can do which Qwen3.7 cannot.
FullOf_Bad_Ideas@reddit
It's a shame that they're not certain yet honestly.
Sofakingwetoddead@reddit
Hallelujah!!!!!! I feel qwen runnin' through me!!!!
Mountain_Chicken7644@reddit
Thats cool, but when 9b model release
Saraozte01@reddit
Hope it includes a 122B, it would be amazing to receive the larger MoE's with their 3.7 recipe
L0ren_B@reddit
27B ia the only one I'm excited about. Doesn't have to be smarter in knowledge than 3.6 27B, just less hallucinations!😅 Imagine a jumpt similar with 3.5 to 3.6! Just wow!
SE_to_NW@reddit
No, 122b model would do the most good to humanity. Not 27B
LegacyRemaster@reddit
the hero we need
Makers7886@reddit