TheaterFire

New Qwen3.5 models spotted on qwen chat

Posted by AaronFeng47@reddit | LocalLLaMA | View on Reddit | 198 comments

New Qwen3.5 models spotted on qwen chat

Reply to Post

198 Comments

Freigus@reddit

dense 27B! 122B MoE! I'm glad they didn't abandon medium-sized dense models
View on Reddit #79184334

PrefersAwkward@reddit

Hopefully we get a good draft model for that dense one. I'm guessing a qwen3 0.6 model wouldn't do it for 3.5 models
View on Reddit #79185606

Street_Confidence453@reddit

We have the MTP module which is trained with multi-step for all Qwen3.5 models!
View on Reddit #79205201

PrefersAwkward@reddit

Apologize for my ignorance. This is something akin to self-prediction?  Something like using a light engine to guess next tokens and then having the big engine verify+correct it?
View on Reddit #79206067

dnsod_si666@reddit

Not super familiar with this stuff, take this comment with a grain of salt. I think MTP (multi-token prediction) is adding some extra layers to the network so that each forward pass results in the next N tokens instead of just 1 token. Then at inference time it is speculative decoding where you have the model verify the those predictions. There is also EAGLE3 which is similar but the layers are trained after the main model is trained. MTP is trained as part of the main model because it provides speedups during training as well. Draft models (which I think are largely outdated) are smaller models with a similar output probability distribution to the main model. Then you can use the draft model to do speculative decoding. But EAGLE3 has shown to be more accurate (drafted tokens accepted more often) and faster (because it’s just a few layers instead of an entire draft model).
View on Reddit #79449696

Competitive_Ad_5515@reddit

Is speculative decoding with draft models still a big deal? I don't see much discussion of it anymore
View on Reddit #79186857

EmPips@reddit

Less dense models + less draft-sized/compatible models. Spec dec is absolutely still a thing, there's just way less models coming out where you'll get a big win out of it. The last one was probably a bit of a speed boost on the original Qwen3-235B using Qwen3 4B or Qwen3 0.6B. The smaller models never got updates, but Qwen3-235B-2507 came out and was much stronger - so nobody used the original and the original small models weren't compatible as draft models.
View on Reddit #79203811

b3081a@reddit

When running on single device it doesn't do well with MoE. Consecutive tokens does not necessarily activate the same experts, so when validating multiple tokens there's likely more parameters to be loaded & used, consuming more memory bandwidth and diminishing the benefits of MoEs. For example llama.cpp does well with dense model speculative decoding, but struggles on MoE. For large scale deployment using expert parallel on multiple GPUs, there will be more performance uplift (for a limited number of user per cluster).
View on Reddit #79197287

po_stulate@reddit

It's here! [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
View on Reddit #79202203

Sufficient-Rent6078@reddit

With the 10B active parameters in the MoE, I'd expect the 27B dense model to not be that far behind in intelligence. Could be a really attractive choice for single gaming GPU setups.
View on Reddit #79185840

TacGibs@reddit

MoE are now way better than at their beginnings, the old rule "square root of total parameters*active parameters" to compare them to a dense model isn't relevant anymore.
View on Reddit #79186345

Far-Low-4705@reddit

this is especially true for reasoning models The reason that was true for instruct models is because the total compuation you can do is limited to a single forward pass for instruct models, which is much higher for dense models. but when you have reasoning, the total compute is spread out over the reasoning tokens, so it doesn't really matter how much compute you can do in a single forward pass, so in practice, MOE might use a few more reasoning tokens to arive at the answer, but it wont make much practical difference performance wise, and will be much faster
View on Reddit #79188317

-p-e-w-@reddit

Indeed. gpt-oss-20b pretty much has the performance of a dense 20B model, despite having only 3.6B active parameters.
View on Reddit #79187449

Sufficient-Rent6078@reddit

Good point - there have been some architectural improvements and we don't know the MoE defaults to a higher reasoning effort budget than the dense model. The rule of thumbs likely underestimates the actual capability we are going to see.
View on Reddit #79187407

etherd0t@reddit

Because Qwen3.5 is MoE: Only \~17B parameters fire per token. That means: * Latency scales closer to a 20B dense model😉 * Memory scales closer to a 400B sparse model
View on Reddit #79186814

MrHighVoltage@reddit

4bit quant would probably run on a 16GB GPU, that would be nice.....
View on Reddit #79185703

Far-Low-4705@reddit

nnoooooo i cant run 122B....... im so sad...... i wanted an 80b sooo bad
View on Reddit #79188152

tarruda@reddit

I wouldn't dismiss 2-bit quants of the 122B release which should be runnable in less than 50G. This new Qwen architecture is very resilient to quantization, I have been running 2-bit 397B on 128G mac with great success: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2
View on Reddit #79201411

Far-Low-4705@reddit

I can run unsloth's UD-Q3\_K\_XL quant at full context with full GPU offload, but im always skeptical of anything less than 4 bit
View on Reddit #79324301

giatai466@reddit

122B would be nice for my 16 GB VRAM + 64 GB RAM.
View on Reddit #79184269

luncheroo@reddit

A tight fit in Q4 though, no?
View on Reddit #79186382

ThrowawayNotSusLol@reddit

You guys are weird. I got 12VRam and 32GB of ram and I was running their qwen 3 235b a22b model on q4 Sure, it was slow, but this new 100b param moe should only be easier. Don't know why you act like it's so impossible to run these without massive PCs
View on Reddit #79202682

Tai9ch@reddit

Speed matters. Can you fly a thousand miles in a plane? Yes. Can you drive a thousand miles? Sure. Can you walk a thousand miles? Technically, yes.
View on Reddit #79229144

ThrowawayNotSusLol@reddit

It only matters to those who need it to be faster than reading speed.
View on Reddit #79229919

Tai9ch@reddit

I'm having trouble coming up with a use case where beating reading speed isn't useful. Pure live roleplay with no thinking tokens or tool use maybe? Even then, I'd want at least 20 tokens/second, which seems hard to get with most of the model offloaded to CPU.
View on Reddit #79231737

ThrowawayNotSusLol@reddit

The simple use of asking it a question. You need to read the answer to understand it
View on Reddit #79243851

luncheroo@reddit

You are right that running does not equal running fast. I have specific use cases, MCPs, and I am admittedly an impatient person. When I tried OSS 120b, the generation speed was fine but I couldn't do anything else on Windows with my hardware because I was using nearly all the RAM. Since I tend to have a local LLM open on one screen and other stuff going on on another, it didn't work well for me. For other people, that would be just fine.
View on Reddit #79214581

simracerman@reddit

OSS-120B-MXFP4 fits with 64k context just fine.
View on Reddit #79189835

Borkato@reddit

What kind of prompt processing speed are yall getting?
View on Reddit #79192695

simracerman@reddit

I just fed it a 16k long prompt on Windows, I got 104 t/s for processing and 25 t/s for generation. Quite usable for when I need a definite 1-shot answer. https://imgur.com/a/XOAOpF6 For anyone asking, I still have ~5-6GB left in system RAM when the model is fully loaded. I have Docker running open WebUI and couple other small things. My Windows idles at 11GB system RAM usage too. On Linux, this will be even better.
View on Reddit #79195675

Borkato@reddit

Yeesh, I consider that prompt processing suuuper slow. So it takes 2.66 minutes to even start generating the first token?
View on Reddit #79197446

carteakey@reddit

True. I ran it on my config (12GB VRAM + 64 GB RAM) and get around 250 tk/s pp, so there's scope for optimization. (linux, llama.cpp params etc.) prompt eval time = 95560.34 ms / 24379 tokens ( 3.92 ms per token, 255.12 tokens per second) eval time = 124347.26 ms / 2312 tokens ( 53.78 ms per token, 18.59 tokens per second) total time = 219907.60 ms / 26691 tokens It took a minute and a half to process 24k tokens before starting - which is indeed slow, but incredible considering the fact its running on such hardware. I expect
View on Reddit #79199722

Borkato@reddit

This is interesting. When I tested it I got 81T/s prompt processing speed lmao!
View on Reddit #79211141

carteakey@reddit

what are your llama.cpp params and hardware I often point to my post as a reference [https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)
View on Reddit #79211851

Borkato@reddit

This is very helpful!! I will try some of this :D
View on Reddit #79214486

coder543@reddit

On DGX Spark, I get 2500 tok/s prompt processing and about 60 tok/s generation with GPT-OSS-120B under llama.cpp. Some people report getting up to 6000 tok/s prompt processing under vLLM with the same setup, vLLM is just a hassle to use.
View on Reddit #79198729

simracerman@reddit

I paid 30% of the cost a DGX spark costs, and my 5070 Ti plays 4K AAA games, rocks all the smaller models and runs any OS of my choice. I guess when DGX spark can do all those, then it’s a fair comparison. Otherwise, it’s apples to watermelons type comparison.
View on Reddit #79213732

Borkato@reddit

Wow, that’s insane. I wish I could afford that 😭
View on Reddit #79211377

luncheroo@reddit

I can run it with the same specs, but I can't easily multitask with other things on Windows.
View on Reddit #79190973

boissez@reddit

Seems more like a perfect model for the strix halo/DGX gang.
View on Reddit #79188890

Iory1998@reddit

I don't think it would fit unless you use Q2 or Q3. Also, you should keep in mind that A10B is way larger than your typical A3B, so expect very slow generation speed. The model will be smart though.
View on Reddit #79187507

simracerman@reddit

OSS-120B-MXFP4 fits with 64k context just fine.
View on Reddit #79189853

Iory1998@reddit

That model was trained for scratch on 4-bit. OpenAi did well with it. Qwen-3.5 is likely trained in FP16. Don't exepct Q4 to be around 65GB like OSS-120B.
View on Reddit #79190640

carteakey@reddit

Qwen3-Coder-Next-MXFP4\_MOE.gguf at 80b is 40GB. Why would original training change the size of the quantized model? Just curious.
View on Reddit #79199288

Iory1998@reddit

If the company trains the model natively at F4, then it would have a better quality.
View on Reddit #79204071

FullstackSensei@reddit

Just got a few 64GB Jetson AGX Xaviers with carrier boards. Thinking of pairing one with a 12GB A2000. Would be epictastic with 122B at Q4
View on Reddit #79193171

durden111111@reddit

Need that 122B model
View on Reddit #79184202

po_stulate@reddit

It's here! [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
View on Reddit #79202239

EmPips@reddit

As a general purpose model it seems like they're trying to paint it being as good as the original Qwen3-235B (not the updated 2507 checkpoint) but twice as fast and half the memory. The real gains are in instruction following and coding use. Meaning this *could* have the all-around strength that larger Qwen's have but the agentic abilities of GLM and Minimax models. All of this is subject to testing of course but I really hope these numbers turn out to be real.
View on Reddit #79204175

NoahFect@reddit

Unsloth's 4-bit quant just passed the car wash and upside-down cup tests, which none of the other Qwen models I've tried could deal with. This model is feeling pretty real.
View on Reddit #79227878

EmPips@reddit

I believe it. Just not an option for my machine at the moment! Working with 48GB of VRAM + 64GB of DDR4
View on Reddit #79228673

NoahFect@reddit

It's worth trying a 2-bit or 3-bit quant. This model seems to be very amenable to quantization, for whatever reason. I don't understand it, but Unsloth actually reports the 2-bit and 3.5-bit quants as outperforming the 4-bit? See https://unsloth.ai/docs/models/qwen3.5#benchmarks
View on Reddit #79229268

po_stulate@reddit

Yes! Since the unsloth quants are not out yet, I created a qwen chat account and tried the 122B-A10B one out. It seems like the most capable model so far I've ever seen. I gave it some test prompts (asking it to parallelize a bash script that will create race condition for generated temp files, etc) that no local models that I've tried (glm-4.5-air, qwen3-235b, gpt-oss-120b, etc) ever got right once. And it nailed the prompts first try. Super excited for the unsloth quants, it may mean I can finally delete all my 500GB+ (in total) models and keep only this one if it's as good locally as I tested it on their website.
View on Reddit #79206445

UnknownLesson@reddit

How much ram and VRAM do i need?
View on Reddit #79207145

po_stulate@reddit

The Q4\_K\_XL quant from unsloth is 70GB
View on Reddit #79216910

wh33t@reddit

Siiiiick. This is what I've been waiting for, 100ishB, with a higher expert count. And multi-modal! Is the projector included?
View on Reddit #79211990

EmPips@reddit

Yes. I'm so ready to dethrone GLM 4.5 air and 4.6v as the top models my machine can run.
View on Reddit #79202974

lolwutdo@reddit

Why not MiniMax 2.5? It's way better than either of those models.
View on Reddit #79210166

EmPips@reddit

As soon as you're outside of Agentic use cases I don't enjoy it as much. It's also a fairly weak general purpose model for me. It's the strongest coding model that fits nicely on my machine but I'm finding myself preferring GLM 4.6v at Q4/Q5 rather than MiniMax at Q3. Great model.. it just doesn't have a home in my workflows nor my machine. Maybe if I had more VRAM and could run Q4+ that'd change.
View on Reddit #79214672

TopCryptographer8236@reddit

GLM do release mid sized MoE, GLM 4.6V? It's just kinda lack luster compared to 4.5Air for coding.
View on Reddit #79191916

coder543@reddit

They’ve had 4.7 and 5 since then. No new mid sized models.
View on Reddit #79194142

phenotype001@reddit

And it's multimodal, so perfect.
View on Reddit #79186536

LosEagle@reddit

Stupid question, but if the 27b is multimodal, does it mean that it is less efficient for chat because some of those parameters are for vision?
View on Reddit #79195827

Evolution31415@reddit

Vision encoder is a separate part out of reasoning/thinking abilities and you can disable it to free some memory for KV cache in the text-only mode.
View on Reddit #79225585

Physical_Screen_7543@reddit

ngl this 27B is actually cracked. Coding and multimodal perfs are giving me early Gemini 3 Pro vibes lol. Perfect weight for anyone building a local agentic OS. Can't believe we're getting this much juice in a dense model in 2026. My 5090 is ready.
View on Reddit #79220214

Steus_au@reddit

35b would be a bomb
View on Reddit #79184075

9r4n4y@reddit

100% and also glm 5 flash
View on Reddit #79184238

Silver-Champion-4846@reddit

Glm 5 Flash? WHAT?
View on Reddit #79198175

9r4n4y@reddit

If it comes then 😭😅
View on Reddit #79202244

Silver-Champion-4846@reddit

Sure, you made me think there was a GLM Flash
View on Reddit #79219421

Healthy-Nebula-3603@reddit

From the benchmarks they showed 35b moe is weaker than 27b dense
View on Reddit #79205138

-p-e-w-@reddit

gpt-oss-20b killer. Even has fewer active parameters.
View on Reddit #79187150

Witty_Mycologist_995@reddit

Wdym where
View on Reddit #79203918

coder543@reddit

I have measured the weights of GPT-OSS-20B myself using a script to parse the full precision GGUF: weights_used_bytes: 3190031616 (3.19 GB / 2.97 GiB) weights_total_bytes: 12096558336 (12.1 GB / 11.27 GiB) Definitely A3B, however you want to measure it.
View on Reddit #79195415

coder543@reddit

GPT-OSS-20B is also A3B, so I’m not sure what you mean.
View on Reddit #79194093

Odd-Ordinary-5922@reddit

gpt oss 20b is actually 4b parameters avg
View on Reddit #79194998

DistanceSolar1449@reddit

27b dense model is more interesting for the people who can wait for slower tokens/sec. Should be smarter than 35b a3b
View on Reddit #79185030

-p-e-w-@reddit

It will also directly compete with Gemma 3 27B.
View on Reddit #79186999

simracerman@reddit

In fact, them choosing 27B size means Gemma4 direct competitor.
View on Reddit #79189700

po_stulate@reddit

Is Gemma4 actually a thing?
View on Reddit #79195904

Silver-Champion-4846@reddit

Yeah we're actually not certain it will be released.
View on Reddit #79198138

Responsible_Pain3278@reddit

llama.cpp now supports any draft models, even if the tokenizer dictionary is incompatible. Furthermore, speculative decoding of ngrams without a draft model has been added. However, for a number of reasons, speculative decoding works poorly with moe, which is likely why it's so rarely used.
View on Reddit #79195758

Dry_Yam_4597@reddit

Omfg.
View on Reddit #79184483

indicava@reddit

I have to say I’m kind of disappointed with this release. It might be a niche use case, but for us fine tuners, only a single size dense model with no base variants is practically useless. This trend already started with Qwen3 where they never released the base variant of the 32B size and all releases since then have been MoE. While running local models for coding or creative writing has a significant value proposition, the ability to fine tune models for personal use or as a basis for a commercial product is a liberty that’s slowly been eroding away. That’s a shame, and I don’t think it’s being brought up enough.
View on Reddit #79217473

Technical-Earth-3254@reddit

Wish it was 100B so q4 fits in my poor 64GB system. But looks perfect for the 128gb faction
View on Reddit #79184095

BigYoSpeck@reddit

If you have a GPU add it's VRAM to your system RAM for what you can run Gpt-oss-120b will run on 64gb RAM + 16gb VRAM quite well so I have high hopes for this
View on Reddit #79184255

wisepal_app@reddit

i have 96 GB ddr5 ram and 16 GB vram but i get 14 t/s with gpt-oss 120b. By quite well do you mean this kind of speeds or much higher? i use llama.cpp with 60k context.
View on Reddit #79185062

Danmoreng@reddit

Depends on the processor and how you offload I would say. I didn’t test oss 120B, but I feel you probably could get some extra performance if you have not yet optimised settings. Do you use the —fit and —fit-ctx parameters of llama.cpp? If not, try them out.
View on Reddit #79185716

wisepal_app@reddit

i use --fit on. Not used --fit-ctx parameter. Will try it. Your qwen3 coder next speed is quite impressive. i get around 17 t/s with it. Can you share your full llama.cpp parameters please?
View on Reddit #79186480

Danmoreng@reddit

Here: https://github.com/Danmoreng/local-qwen3-coder-env In the repo I still use UDQ4 instead of MXFP4, that runs at 35 t/s instead of 40 t/s for MXFP4. Also, Windows is much slower than Linux. Under windows I only get around 25 t/s.
View on Reddit #79204559

wisepal_app@reddit

thank you sharing these settings. now i get around 16 t/s with these settings: gpt-oss-120b-MXFP4-00001-of-00002.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 --fit on --fit-target 256 --jinja --flash-attn on --fit-ctx 60000 -b 1024 -ub 256 -ctk q8\_0 -ctv q8\_0 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 context: 8108/60160 (13%) Output: 6018/∞ 15.9 t/s
View on Reddit #79215983

carteakey@reddit

I get similar perf on my 12GB VRAM + 64GB RAM and here's the command with the params he mentioned. [https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/](https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/)
View on Reddit #79200793

Xantrk@reddit

Fit without fit context and a custom context can backfire. If it's ends up "fitting" smaller context and then what you specify is larger (due to initialization sequence), you end up your kv cache partially outside your vram and that's slow. If you try replacing your context, with fit context 70000 , that should help if this is the problem.
View on Reddit #79194248

serpix@reddit

Same perf on qwen3 coder next on an egpu oculink 16GB vram, 64gb ddr5.
View on Reddit #79196698

Significant_Fig_7581@reddit

How is it so fast? I use the IQ3 but it's like 15tkps
View on Reddit #79187022

carteakey@reddit

You should have way more. [https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)
View on Reddit #79200686

wisepal_app@reddit

i will try these. thank you very much.
View on Reddit #79214445

BigYoSpeck@reddit

You can get more I run llama.cpp with 64gb DDR4 3733 and an RX 6800 XT 16gb With the full 131k context it gets about 22tok/s How is your offloading configured? You want all layers plus kv cache offloaded to the GPU, then 32 MOE layers offloaded to CPU
View on Reddit #79187152

wisepal_app@reddit

actually i am new at llama.cpp so i try every combination i see in this sub. the last one i tried was this: gpt-oss-120b-MXFP4-00001-of-00002.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 --ctx-size 70000 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --fit on -np 1 i got 12.8 t/s and for qwen coder next: Qwen3-Coder-Next-MXFP4\_MOE\_BF16.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 -ngl 999 -sm none -mg 0 -t 12 -fa on -cmoe -c 61072 -b 512 -ub 512 -np 1 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0 i got 16.6 t/s
View on Reddit #79189363

BigYoSpeck@reddit

Rather than -cmoe use -ncmoe 32 for gpt-oss and I forget the optimal for Qwen3, it might be something like 37 but test it and look at your VRAM usage once it's fully loaded -b and -ub at 2048 makes a huge difference for large prompts as well but use more memory meaning either fewer layers offloaded or less context
View on Reddit #79194586

wisepal_app@reddit

no luck i don't get what i do wrong? this is the last settings i use according to suggestions: gpt-oss-120b-MXFP4-00001-of-00002.gguf" --host [127.0.0.1](http://127.0.0.1) \--port 8130 -c 60000 -ngl 999 -ncmoe 32 -b 1024 -ub 1024 -fa on -sm none -t 4 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01
View on Reddit #79211969

coder543@reddit

Are you using `--n-cpu-moe`?
View on Reddit #79185633

wisepal_app@reddit

No i use --fit on
View on Reddit #79186266

nicholas_the_furious@reddit

Play with the CPU more settings. It helps a lot. But for this Qwen model with 10B active you'll get much slower speeds than OSS120B which has half the active params.
View on Reddit #79186626

carteakey@reddit

Exactly, its double the active params so less tok/s out the door but with linear attention it shouldn't decay as well like OSS120B does, and should largely stay consistent. So for long tasks it may end up performing the same? who knows.
View on Reddit #79200634

coder543@reddit

From my experience, you need to specifically turn fit off and adjust things yourself to get the best performance. Set ngl to 999, set the context so something you find acceptable, and then find the minimum value that loads without crashing for --n-cpu-moe
View on Reddit #79186550

Technical-Earth-3254@reddit

You are very correct, if just my windows didn't take a up that much ram when I'm trying to do something productive.
View on Reddit #79185491

BigYoSpeck@reddit

gpt-oss-120b uses a little over 70gb all in. Windows will unfortunately gobble up most of what's left before anything else can run. Linux is obviously a lot less demanding so you can still make use within reason
View on Reddit #79187662

TitwitMuffbiscuit@reddit

Nah, I'm using the mxfp4 from gmml repo and it's using ~52gb of vram with --no-mmap on windows. If it's using more it's due mmap.a'd/or to context spilling to system ram with the Nvidia driver's sysmem fallback policy left to default. On a 12100f with 64 gb of ddr4 3200 and an rtx 3060 12 gb, I get 18 t/s using -ncmoe 31 at 30k context (doesn't oom) and 15 t/s with -cmoe at 64k context (and sysmem fallback to default). That said, Ubuntu LTS was 15% faster last time I checked like a year ago.
View on Reddit #79189356

Far-Low-4705@reddit

i have 64Gb VRAM and only 16Gb of ddr3 RAM with an ancient 4 core CPU lol.. I was really hoping for an 80b (I built my system for under $100 total)
View on Reddit #79188631

luncheroo@reddit

You guys must be on Linux. I've found it runs well but context is low and you can't do much else on the OS at the same time.
View on Reddit #79186460

LagOps91@reddit

it's perfectly fine for a 64gb system as long as you have some vram (which you should)
View on Reddit #79185518

Significant_Fig_7581@reddit

The Q3 quants are too good nowadays dw :)
View on Reddit #79185432

pmttyji@reddit

Well, you have options(quants) from IQ4\_XS to Q4\_K\_XL
View on Reddit #79184831

Imakerocketengine@reddit

oh wow, need some bench to compare the 122B variant to qwen next :)
View on Reddit #79215877

PermitNo8107@reddit

27B? would that be able to run on 16gb vram? :o
View on Reddit #79215639

CireHF103@reddit

Qwen Next and 3.5 so far has improved a lots compared to 3.0 from my experience. Very excited to see how it performs on smaller sizes models.
View on Reddit #79184901

Firm_Meeting6350@reddit

So true - honestly, I‘m a Codex / Opus fanboy, but winner of my heart and „drivers for automation“ are Qwen 3 Next and Qwen3 Coder Next. It‘s really really impressive. I use it for code & context extraction and even Opus and Codex admit that Qwen is better 🤣
View on Reddit #79192816

cafedude@reddit

Here's hoping for a Qwen3.5 Coder Next. So far Qwen3 Coder Next is the best model I've tried so far on my Strix Halo 128GB system. It's currently running and getting deep in the weeds with LLVM compiler optimizations and code generation - would've never imagined I'd be able to run a model locally that could handle that.
View on Reddit #79212813

Iory1998@reddit

The biggest improvement is the hybrid attention. Long context is the winning formula.
View on Reddit #79187329

Unhappy_Advantage_66@reddit

Hey just curious will the 27B or the 35B work on L4 GPU.
View on Reddit #79211733

Niket01@reddit

The 27B dense model is the one I'm most excited about. Dense models tend to be more predictable for fine-tuning and deployment compared to MoE, and 27B sits in that sweet spot where you can actually run it on consumer hardware with decent quantization. The MoE variants are interesting for benchmarks but the 27B is probably going to see the most real-world local deployment. Anyone tested the multimodal capabilities yet? Curious how it handles vision tasks compared to the Qwen3 VL models.
View on Reddit #79210010

zhambe@reddit

Oh this is exciting! Qwen3-30B-Coder-FP8 has been my daily driver, I *think* I could squeeze the 35B version in...
View on Reddit #79208076

Ok-Scarcity-7875@reddit

27B-A3B would be better if you want to have a large context and speed on a 24GB GPU.
View on Reddit #79202527

Zugzwang_CYOA@reddit

On my 4090, Gemma-3 27b already runs at blazing fast speeds, even at Q5 quants, with 16k context. I don't see why that would be different from the new dense 27b.
View on Reddit #79207988

lemon07r@reddit

Finally a small dense model! I can think about finetuning stuff again without moe woes
View on Reddit #79206630

BasicInteraction1178@reddit

Has anyone tried using Qwen3.5 for coding yet? What are your feelings?
View on Reddit #79206470

EmPips@reddit

> 122B A10B M-Series Mac and Strix-Halo owners are going to have a good day.
View on Reddit #79204967

Psyko38@reddit

Can't wait to see the performance of the 27b and 35b on my Rx 9060xt 16gb
View on Reddit #79203419

RandumbRedditor1000@reddit

27B DENSE LET'S GOOO
View on Reddit #79203363

edeltoaster@reddit

Anything on benchmarks yet?
View on Reddit #79202942

GrungeWerX@reddit

I actually want all three. :)
View on Reddit #79202506

Halpaviitta@reddit

I've said it before and I'll say it again. Alibaba & Qwen is an extremely productive team, kudos to them. I am a bit worried about their workplace culture though, are they working 16 hour shifts or something? lol
View on Reddit #79186998

AndreVallestero@reddit

996 though I feel like that's ever AI lab right now
View on Reddit #79202150

phenotype001@reddit

Models are on HF: [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
View on Reddit #79202009

mossy_troll_84@reddit

https://preview.redd.it/tmv0y54a3hlg1.png?width=1734&format=png&auto=webp&s=6320e0e924c946c3ccd29e1234b919e3e0f129fb and unsloth quantizations are uploading now...
View on Reddit #79201518

Jayfree138@reddit

Oh look, my new daily driver. Can't wait
View on Reddit #79201236

mossy_troll_84@reddit

https://preview.redd.it/ak3fbjmk1hlg1.png?width=1704&format=png&auto=webp&s=ab6f569ed32ff7ea4f3acb78e238ef55eaeb7d74
View on Reddit #79200569

RegularRecipe6175@reddit

Qwen GGUF?
View on Reddit #79200087

PixelatedCaffeine@reddit

https://preview.redd.it/qh7qke8ceglg1.png?width=626&format=png&auto=webp&s=953aa76b60d0d396bb1ee6dce6d5593b44198904 it's happening!
View on Reddit #79190357

Alarmed-Channel2145@reddit

Also [https://github.com/QwenLM/Qwen3.5/pull/25](https://github.com/QwenLM/Qwen3.5/pull/25) is already merged
View on Reddit #79199472

No_Mango7658@reddit

122 moe should be so great on strix halo with 35b for speculative decoding!
View on Reddit #79198591

mossy_troll_84@reddit

Qwen3.5-122B-A10B 🤩 https://preview.redd.it/o3inz6xxkglg1.png?width=1342&format=png&auto=webp&s=8398cbddc76ac0c5d49b20fcd0ec608d35bb50e6
View on Reddit #79193359

tarruda@reddit

Amazing, this might be the sweet spot for 128G systems
View on Reddit #79195000

mossy_troll_84@reddit

yup, that is what I am waiting for! 😍 Although till now I am using Step-3.5 Flash and I am quite happy (for non professional work)
View on Reddit #79197528

tarruda@reddit

True, Step-3.5 Flash is also a beast and perfect for 128G. I can run up to two 128k context parallel streams locally as it is quite context efficient. In the AMA the StepFun team also said they plan to continue improving the 197B arch and it will even have vision, so also looking forward to that!
View on Reddit #79198210

russianguy@reddit

GGUF when?
View on Reddit #79197964

cibernox@reddit

27B dense? I didn’t anticipate that one. I’m very interested in what they will release in the 3B-12B range.
View on Reddit #79185702

Adventurous-Paper566@reddit

C'est une façon de dire "on se bouge Google!"
View on Reddit #79197529

Iory1998@reddit

An interesting choice of model size. It's like Gemma. But 27 will fit nicely in 24GB of Vram with a large context size.
View on Reddit #79187183

cibernox@reddit

My guess is that they purposely decided to release it as a way to demonstrate the improvements they made so they have their 27B dense model matching their 32B model, so they have a measurable \~15% generational improvement.
View on Reddit #79187317

Iory1998@reddit

That's why it's interesting. I can tell you right now it's gonna be a banger of a model. Vision support, long context size, less VRAM requirements, and smart as hell for its size. The Qwen team is definitely sending a message to both the community and Google. We don't need Gemma-4... Though, I really hope Googe release a new Gemma models.
View on Reddit #79187890

MerePotato@reddit

27B huh, interesting
View on Reddit #79197468

Green-Ad-3964@reddit

I hope the 35B fits a 5090...
View on Reddit #79188520

DeepRecipe6331@reddit

Q4 GLM-4.7-Flash can, there's no reason why the 35B can't. You should have plenty of room to spare, and depending on how much you want to use and context sizes you can definitely bump it up to Q6.
View on Reddit #79197355

RMK137@reddit

It should if you use Q3/Q4 especially with the unsloth dynamic quants. I've used the Nemotron-30b-A3B at UD-Q4_K_XL on my 5090. This one is a little larger but you can quantize the KB cache also which buys you more context.
View on Reddit #79192327

dlcsharp@reddit

These models use hybrid attention, memory usage for kv cache should be dramatically lower thanks to that just like Qwen 3 Next
View on Reddit #79196646

Zestyclose839@reddit

For anyone who already played with the 27B and 35B models, what are the initial impressions? Any personality changes over 30B A3B?
View on Reddit #79196966

Adventurous-Paper566@reddit

27B! Je suis tellement content!!
View on Reddit #79196098

ridablellama@reddit

this is why qwen is my favorite
View on Reddit #79195872

Dyssun@reddit

please be released today please be released today 🙏
View on Reddit #79195067

Fox-Lopsided@reddit

Where 9b :(
View on Reddit #79194385

pigeon57434@reddit

do you think the 27B dense model will be smarter than the 35B-A3B model?
View on Reddit #79190364

PANIC_EXCEPTION@reddit

Doubtful, Qwen3-32B wasn't too much further ahead compared to Qwen3-30B-A3B
View on Reddit #79193567

Alarming-Ad8154@reddit

I don’t think we had a dense model in a while right? Very curious to see how 2026 agentic-coder/Reinforcement learning does on a dense base model… if this is mixed linear/quadratic attention and someone converts to nvpf4 could be an absolute 5080/5070ti monster….
View on Reddit #79193216

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #79192778

ApprehensiveAd3629@reddit

i hope to get a qwen 3.5 14b 🙏
View on Reddit #79192630

xrvz@reddit

Finally another ~120B model for Strix Halo.
View on Reddit #79191421

tmvr@reddit

Nice, some exciting sizes that make sense as well for a lot of home setups. I'll have to go and try Q3 with the 122B, but the 27B dense and the 35B MoE are a nice fit for the 24GB and 32GB VRAM configs.
View on Reddit #79189940

Insomniac24x7@reddit

My 3090 is gonna cry
View on Reddit #79185628

jax_cooper@reddit

tears of joy
View on Reddit #79189185

Insomniac24x7@reddit

Lol
View on Reddit #79189229

maglat@reddit

with multimodel tasks they mean picture and video understanding, right? Same as the big variant I guess
View on Reddit #79187830

Skyline34rGt@reddit

Yes. Same as Qwen3 VL have before.
View on Reddit #79188729

Big_Mix_4044@reddit

Hyped af
View on Reddit #79187353

Few_Painter_5588@reddit

A Dense 27B model! That is the perfect size!
View on Reddit #79187115

BrightRestaurant5401@reddit

Indeed the only model I am looking forward too, I want a model with stronger generalized intelligence, the extra "memory" most MOE models offer is just not that interesting.
View on Reddit #79187325

HugoCortell@reddit

Well yeah, I imagine you'd fine them there. Not going to be showing up on ChatGPT, are they?
View on Reddit #79187063

asraniel@reddit

i would love something good in the 8-14 range. it seems like a sweet spot for many tasks such as information extraction, summaries etc
View on Reddit #79186988

noctrex@reddit

Do we have any information if they will release any small versions of 3.5? Like they did with 3 and 3-VL? 2B, 4B, 8B. Cause those small ones are nice to put on computers that do not have a dGPU
View on Reddit #79186022

Schlick7@reddit

9B and that 35bA3 were the ones in the llama.cpp PR so we can expect the 9B. I'd be pretty surprised if that was the smallest 
View on Reddit #79186634

Iory1998@reddit

I am pretty certain that there would be a 4B model as well. Qwen3-4B is so popular and used in many image generator that I think qwen will release another one.
View on Reddit #79186912

Loskas2025@reddit

https://preview.redd.it/uasrx10b4glg1.png?width=880&format=png&auto=webp&s=8ea73c7c39a124b4b1e29168120d93580699e9e3 spicy
View on Reddit #79186734

No_Doc_Here@reddit

Did they make any anouncements whether they will release these models as OS?
View on Reddit #79186572

Loskas2025@reddit

Finally
View on Reddit #79186492

Deep_Traffic_7873@reddit

A3B for me, thanks
View on Reddit #79186150

No_Swimming6548@reddit

Not bad https://preview.redd.it/jbabj7gp1glg1.jpeg?width=1358&format=pjpg&auto=webp&s=ffdc1678bde44b7c5a14c185dc69f0be6e7e09e9
View on Reddit #79185801

Significant_Fig_7581@reddit

Can't wait it's coming OMG!!!!
View on Reddit #79185214

Pedalnomica@reddit

I'm hoping they release their own quants again. It always seemed like they did some sort of QAT because their quants were really strong.
View on Reddit #79184719

Hanthunius@reddit

I want to try out all of these! Great sizes for many of us.
View on Reddit #79184675

pmttyji@reddit

Just wow!
View on Reddit #79184594

rerri@reddit

Awesome sizes, cant wait to try them!
View on Reddit #79184576

jacek2023@reddit

awesome findings and very unexpected sizes!!! great news!!!
View on Reddit #79184446

sterby92@reddit

huge if true!
View on Reddit #79184416

LoveMind_AI@reddit

Oh hell yes
View on Reddit #79184276

Leflakk@reddit

Interesting!
View on Reddit #79184259

9r4n4y@reddit

Ayooooooo lets gooo
View on Reddit #79184213

Paramecium_caudatum_@reddit

Finally! Can't wait to check them out.
View on Reddit #79183974