New Qwen3.5 models spotted on qwen chat

[-]

Freigus@reddit

dense 27B! 122B MoE! I'm glad they didn't abandon medium-sized dense models

Reply

[-]

PrefersAwkward@reddit

Hopefully we get a good draft model for that dense one. I'm guessing a qwen3 0.6 model wouldn't do it for 3.5 models

Reply

[-]

Street_Confidence453@reddit

We have the MTP module which is trained with multi-step for all Qwen3.5 models!

Reply

[-]

PrefersAwkward@reddit

Apologize for my ignorance. This is something akin to self-prediction? Something like using a light engine to guess next tokens and then having the big engine verify+correct it?

Reply

[-]

Not super familiar with this stuff, take this comment with a grain of salt. I think MTP (multi-token prediction) is adding some extra layers to the network so that each forward pass results in the next N tokens instead of just 1 token. Then at inference time it is speculative decoding where you have the model verify the those predictions. There is also EAGLE3 which is similar but the layers are trained after the main model is trained. MTP is trained as part of the main model because it provides speedups during training as well. Draft models (which I think are largely outdated) are smaller models with a similar output probability distribution to the main model. Then you can use the draft model to do speculative decoding. But EAGLE3 has shown to be more accurate (drafted tokens accepted more often) and faster (because it’s just a few layers instead of an entire draft model).

Reply

[-]

Competitive_Ad_5515@reddit

Is speculative decoding with draft models still a big deal? I don't see much discussion of it anymore

Reply

[-]

EmPips@reddit

Less dense models + less draft-sized/compatible models. Spec dec is absolutely still a thing, there's just way less models coming out where you'll get a big win out of it. The last one was probably a bit of a speed boost on the original Qwen3-235B using Qwen3 4B or Qwen3 0.6B. The smaller models never got updates, but Qwen3-235B-2507 came out and was much stronger - so nobody used the original and the original small models weren't compatible as draft models.

Reply

[-]

b3081a@reddit

When running on single device it doesn't do well with MoE. Consecutive tokens does not necessarily activate the same experts, so when validating multiple tokens there's likely more parameters to be loaded & used, consuming more memory bandwidth and diminishing the benefits of MoEs. For example llama.cpp does well with dense model speculative decoding, but struggles on MoE. For large scale deployment using expert parallel on multiple GPUs, there will be more performance uplift (for a limited number of user per cluster).

Reply

[-]

po_stulate@reddit

It's here! [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)

Reply

[-]

Sufficient-Rent6078@reddit

With the 10B active parameters in the MoE, I'd expect the 27B dense model to not be that far behind in intelligence. Could be a really attractive choice for single gaming GPU setups.

Reply

[-]

TacGibs@reddit

MoE are now way better than at their beginnings, the old rule "square root of total parameters*active parameters" to compare them to a dense model isn't relevant anymore.

Reply

[-]

Far-Low-4705@reddit

this is especially true for reasoning models The reason that was true for instruct models is because the total compuation you can do is limited to a single forward pass for instruct models, which is much higher for dense models. but when you have reasoning, the total compute is spread out over the reasoning tokens, so it doesn't really matter how much compute you can do in a single forward pass, so in practice, MOE might use a few more reasoning tokens to arive at the answer, but it wont make much practical difference performance wise, and will be much faster

Reply

[-]

-p-e-w-@reddit

Indeed. gpt-oss-20b pretty much has the performance of a dense 20B model, despite having only 3.6B active parameters.

Reply

[-]

Sufficient-Rent6078@reddit

Good point - there have been some architectural improvements and we don't know the MoE defaults to a higher reasoning effort budget than the dense model. The rule of thumbs likely underestimates the actual capability we are going to see.

Reply

[-]

etherd0t@reddit

Because Qwen3.5 is MoE: Only \~17B parameters fire per token. That means: * Latency scales closer to a 20B dense model😉 * Memory scales closer to a 400B sparse model

Reply

[-]

MrHighVoltage@reddit

4bit quant would probably run on a 16GB GPU, that would be nice.....

Reply

[-]

Far-Low-4705@reddit

nnoooooo i cant run 122B....... im so sad...... i wanted an 80b sooo bad

Reply

[-]

tarruda@reddit

I wouldn't dismiss 2-bit quants of the 122B release which should be runnable in less than 50G. This new Qwen architecture is very resilient to quantization, I have been running 2-bit 397B on 128G mac with great success: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2

Reply

[-]

Far-Low-4705@reddit

I can run unsloth's UD-Q3\_K\_XL quant at full context with full GPU offload, but im always skeptical of anything less than 4 bit

Reply

[-]

giatai466@reddit

122B would be nice for my 16 GB VRAM + 64 GB RAM.

Reply

[-]

luncheroo@reddit

A tight fit in Q4 though, no?

Reply

[-]

ThrowawayNotSusLol@reddit

You guys are weird. I got 12VRam and 32GB of ram and I was running their qwen 3 235b a22b model on q4 Sure, it was slow, but this new 100b param moe should only be easier. Don't know why you act like it's so impossible to run these without massive PCs

Reply

[-]

Tai9ch@reddit

Speed matters. Can you fly a thousand miles in a plane? Yes. Can you drive a thousand miles? Sure. Can you walk a thousand miles? Technically, yes.

Reply

[-]

ThrowawayNotSusLol@reddit

It only matters to those who need it to be faster than reading speed.

Reply

[-]

Tai9ch@reddit

I'm having trouble coming up with a use case where beating reading speed isn't useful. Pure live roleplay with no thinking tokens or tool use maybe? Even then, I'd want at least 20 tokens/second, which seems hard to get with most of the model offloaded to CPU.

Reply

[-]

ThrowawayNotSusLol@reddit

The simple use of asking it a question. You need to read the answer to understand it

Reply

[-]

luncheroo@reddit

You are right that running does not equal running fast. I have specific use cases, MCPs, and I am admittedly an impatient person. When I tried OSS 120b, the generation speed was fine but I couldn't do anything else on Windows with my hardware because I was using nearly all the RAM. Since I tend to have a local LLM open on one screen and other stuff going on on another, it didn't work well for me. For other people, that would be just fine.

Reply

[-]

simracerman@reddit

OSS-120B-MXFP4 fits with 64k context just fine.

Reply

[-]

Borkato@reddit

What kind of prompt processing speed are yall getting?

Reply

[-]

simracerman@reddit

I just fed it a 16k long prompt on Windows, I got 104 t/s for processing and 25 t/s for generation. Quite usable for when I need a definite 1-shot answer. https://imgur.com/a/XOAOpF6 For anyone asking, I still have ~5-6GB left in system RAM when the model is fully loaded. I have Docker running open WebUI and couple other small things. My Windows idles at 11GB system RAM usage too. On Linux, this will be even better.

Reply

[-]

Borkato@reddit

Yeesh, I consider that prompt processing suuuper slow. So it takes 2.66 minutes to even start generating the first token?

Reply

[-]

carteakey@reddit

True. I ran it on my config (12GB VRAM + 64 GB RAM) and get around 250 tk/s pp, so there's scope for optimization. (linux, llama.cpp params etc.) prompt eval time = 95560.34 ms / 24379 tokens ( 3.92 ms per token, 255.12 tokens per second) eval time = 124347.26 ms / 2312 tokens ( 53.78 ms per token, 18.59 tokens per second) total time = 219907.60 ms / 26691 tokens It took a minute and a half to process 24k tokens before starting - which is indeed slow, but incredible considering the fact its running on such hardware. I expect

Reply

[-]

Borkato@reddit

This is interesting. When I tested it I got 81T/s prompt processing speed lmao!

Reply

[-]

carteakey@reddit

what are your llama.cpp params and hardware I often point to my post as a reference [https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)

Reply

[-]

Borkato@reddit

This is very helpful!! I will try some of this :D

Reply

[-]

coder543@reddit

On DGX Spark, I get 2500 tok/s prompt processing and about 60 tok/s generation with GPT-OSS-120B under llama.cpp. Some people report getting up to 6000 tok/s prompt processing under vLLM with the same setup, vLLM is just a hassle to use.

Reply

[-]

simracerman@reddit

I paid 30% of the cost a DGX spark costs, and my 5070 Ti plays 4K AAA games, rocks all the smaller models and runs any OS of my choice. I guess when DGX spark can do all those, then it’s a fair comparison. Otherwise, it’s apples to watermelons type comparison.

Reply

[-]

Borkato@reddit

Wow, that’s insane. I wish I could afford that 😭

Reply

[-]

luncheroo@reddit

I can run it with the same specs, but I can't easily multitask with other things on Windows.

Reply

[-]

boissez@reddit

Seems more like a perfect model for the strix halo/DGX gang.

Reply

[-]

Iory1998@reddit

I don't think it would fit unless you use Q2 or Q3. Also, you should keep in mind that A10B is way larger than your typical A3B, so expect very slow generation speed. The model will be smart though.

Reply

[-]

simracerman@reddit

OSS-120B-MXFP4 fits with 64k context just fine.

Reply

[-]

Iory1998@reddit

That model was trained for scratch on 4-bit. OpenAi did well with it. Qwen-3.5 is likely trained in FP16. Don't exepct Q4 to be around 65GB like OSS-120B.

Reply

[-]

carteakey@reddit

Qwen3-Coder-Next-MXFP4\_MOE.gguf at 80b is 40GB. Why would original training change the size of the quantized model? Just curious.

Reply

[-]

Iory1998@reddit

If the company trains the model natively at F4, then it would have a better quality.

Reply

[-]

FullstackSensei@reddit

Just got a few 64GB Jetson AGX Xaviers with carrier boards. Thinking of pairing one with a 12GB A2000. Would be epictastic with 122B at Q4

Reply

[-]

durden111111@reddit

Need that 122B model

Reply

[-]

po_stulate@reddit

It's here! [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)

Reply

[-]

EmPips@reddit

As a general purpose model it seems like they're trying to paint it being as good as the original Qwen3-235B (not the updated 2507 checkpoint) but twice as fast and half the memory. The real gains are in instruction following and coding use. Meaning this *could* have the all-around strength that larger Qwen's have but the agentic abilities of GLM and Minimax models. All of this is subject to testing of course but I really hope these numbers turn out to be real.

Reply

[-]

NoahFect@reddit

Unsloth's 4-bit quant just passed the car wash and upside-down cup tests, which none of the other Qwen models I've tried could deal with. This model is feeling pretty real.

Reply

[-]

EmPips@reddit

I believe it. Just not an option for my machine at the moment! Working with 48GB of VRAM + 64GB of DDR4

Reply

[-]

NoahFect@reddit

It's worth trying a 2-bit or 3-bit quant. This model seems to be very amenable to quantization, for whatever reason. I don't understand it, but Unsloth actually reports the 2-bit and 3.5-bit quants as outperforming the 4-bit? See https://unsloth.ai/docs/models/qwen3.5#benchmarks

Reply

[-]

po_stulate@reddit

Yes! Since the unsloth quants are not out yet, I created a qwen chat account and tried the 122B-A10B one out. It seems like the most capable model so far I've ever seen. I gave it some test prompts (asking it to parallelize a bash script that will create race condition for generated temp files, etc) that no local models that I've tried (glm-4.5-air, qwen3-235b, gpt-oss-120b, etc) ever got right once. And it nailed the prompts first try. Super excited for the unsloth quants, it may mean I can finally delete all my 500GB+ (in total) models and keep only this one if it's as good locally as I tested it on their website.

Reply

[-]

UnknownLesson@reddit

How much ram and VRAM do i need?

Reply

[-]

po_stulate@reddit

The Q4\_K\_XL quant from unsloth is 70GB

Reply

[-]

wh33t@reddit

Siiiiick. This is what I've been waiting for, 100ishB, with a higher expert count. And multi-modal! Is the projector included?

Reply

[-]

EmPips@reddit

Yes. I'm so ready to dethrone GLM 4.5 air and 4.6v as the top models my machine can run.

Reply

[-]

lolwutdo@reddit

Why not MiniMax 2.5? It's way better than either of those models.

Reply

[-]

EmPips@reddit

As soon as you're outside of Agentic use cases I don't enjoy it as much. It's also a fairly weak general purpose model for me. It's the strongest coding model that fits nicely on my machine but I'm finding myself preferring GLM 4.6v at Q4/Q5 rather than MiniMax at Q3. Great model.. it just doesn't have a home in my workflows nor my machine. Maybe if I had more VRAM and could run Q4+ that'd change.

Reply

[-]

TopCryptographer8236@reddit

GLM do release mid sized MoE, GLM 4.6V? It's just kinda lack luster compared to 4.5Air for coding.

Reply

[-]

coder543@reddit

They’ve had 4.7 and 5 since then. No new mid sized models.

Reply

[-]

phenotype001@reddit

And it's multimodal, so perfect.

Reply

[-]

LosEagle@reddit

Stupid question, but if the 27b is multimodal, does it mean that it is less efficient for chat because some of those parameters are for vision?

Reply

[-]

Evolution31415@reddit

Vision encoder is a separate part out of reasoning/thinking abilities and you can disable it to free some memory for KV cache in the text-only mode.

Reply

[-]

Physical_Screen_7543@reddit

ngl this 27B is actually cracked. Coding and multimodal perfs are giving me early Gemini 3 Pro vibes lol. Perfect weight for anyone building a local agentic OS. Can't believe we're getting this much juice in a dense model in 2026. My 5090 is ready.

Reply

[-]

Steus_au@reddit

35b would be a bomb

Reply

[-]

9r4n4y@reddit

100% and also glm 5 flash

Reply

[-]

Silver-Champion-4846@reddit

Glm 5 Flash? WHAT?

Reply

[-]

9r4n4y@reddit

If it comes then 😭😅

Reply

[-]

Silver-Champion-4846@reddit

Sure, you made me think there was a GLM Flash

Reply

[-]

Healthy-Nebula-3603@reddit

From the benchmarks they showed 35b moe is weaker than 27b dense

Reply

[-]

-p-e-w-@reddit

gpt-oss-20b killer. Even has fewer active parameters.

Reply

[-]

Witty_Mycologist_995@reddit

Wdym where

Reply

[-]

coder543@reddit

I have measured the weights of GPT-OSS-20B myself using a script to parse the full precision GGUF: weights_used_bytes: 3190031616 (3.19 GB / 2.97 GiB) weights_total_bytes: 12096558336 (12.1 GB / 11.27 GiB) Definitely A3B, however you want to measure it.

Reply

[-]

coder543@reddit

GPT-OSS-20B is also A3B, so I’m not sure what you mean.

Reply

[-]

Odd-Ordinary-5922@reddit

gpt oss 20b is actually 4b parameters avg

Reply

[-]

DistanceSolar1449@reddit

27b dense model is more interesting for the people who can wait for slower tokens/sec. Should be smarter than 35b a3b

Reply

[-]

-p-e-w-@reddit

It will also directly compete with Gemma 3 27B.

Reply

[-]

simracerman@reddit

In fact, them choosing 27B size means Gemma4 direct competitor.

Reply

[-]

po_stulate@reddit

Is Gemma4 actually a thing?

Reply

[-]

Silver-Champion-4846@reddit

Yeah we're actually not certain it will be released.

Reply

[-]

Responsible_Pain3278@reddit

llama.cpp now supports any draft models, even if the tokenizer dictionary is incompatible. Furthermore, speculative decoding of ngrams without a draft model has been added. However, for a number of reasons, speculative decoding works poorly with moe, which is likely why it's so rarely used.

Reply

[-]

Dry_Yam_4597@reddit

Omfg.

Reply

[-]

indicava@reddit

I have to say I’m kind of disappointed with this release. It might be a niche use case, but for us fine tuners, only a single size dense model with no base variants is practically useless. This trend already started with Qwen3 where they never released the base variant of the 32B size and all releases since then have been MoE. While running local models for coding or creative writing has a significant value proposition, the ability to fine tune models for personal use or as a basis for a commercial product is a liberty that’s slowly been eroding away. That’s a shame, and I don’t think it’s being brought up enough.

Reply

[-]

Technical-Earth-3254@reddit

Wish it was 100B so q4 fits in my poor 64GB system. But looks perfect for the 128gb faction

Reply

[-]

BigYoSpeck@reddit

If you have a GPU add it's VRAM to your system RAM for what you can run Gpt-oss-120b will run on 64gb RAM + 16gb VRAM quite well so I have high hopes for this

Reply

[-]

wisepal_app@reddit

i have 96 GB ddr5 ram and 16 GB vram but i get 14 t/s with gpt-oss 120b. By quite well do you mean this kind of speeds or much higher? i use llama.cpp with 60k context.

Reply

[-]

Danmoreng@reddit

Depends on the processor and how you offload I would say. I didn’t test oss 120B, but I feel you probably could get some extra performance if you have not yet optimised settings. Do you use the —fit and —fit-ctx parameters of llama.cpp? If not, try them out.

Reply

[-]

wisepal_app@reddit

i use --fit on. Not used --fit-ctx parameter. Will try it. Your qwen3 coder next speed is quite impressive. i get around 17 t/s with it. Can you share your full llama.cpp parameters please?

Reply

[-]

Danmoreng@reddit

Here: https://github.com/Danmoreng/local-qwen3-coder-env In the repo I still use UDQ4 instead of MXFP4, that runs at 35 t/s instead of 40 t/s for MXFP4. Also, Windows is much slower than Linux. Under windows I only get around 25 t/s.

Reply

[-]

wisepal_app@reddit

thank you sharing these settings. now i get around 16 t/s with these settings: gpt-oss-120b-MXFP4-00001-of-00002.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 --fit on --fit-target 256 --jinja --flash-attn on --fit-ctx 60000 -b 1024 -ub 256 -ctk q8\_0 -ctv q8\_0 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 context: 8108/60160 (13%) Output: 6018/∞ 15.9 t/s

Reply

[-]

carteakey@reddit

I get similar perf on my 12GB VRAM + 64GB RAM and here's the command with the params he mentioned. [https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/](https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/)

Reply

[-]

Xantrk@reddit

Fit without fit context and a custom context can backfire. If it's ends up "fitting" smaller context and then what you specify is larger (due to initialization sequence), you end up your kv cache partially outside your vram and that's slow. If you try replacing your context, with fit context 70000 , that should help if this is the problem.

Reply

[-]

serpix@reddit

Same perf on qwen3 coder next on an egpu oculink 16GB vram, 64gb ddr5.

Reply

[-]

Significant_Fig_7581@reddit

How is it so fast? I use the IQ3 but it's like 15tkps

Reply

[-]

carteakey@reddit

You should have way more. [https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)

Reply

[-]

wisepal_app@reddit

i will try these. thank you very much.

Reply

[-]

BigYoSpeck@reddit

You can get more I run llama.cpp with 64gb DDR4 3733 and an RX 6800 XT 16gb With the full 131k context it gets about 22tok/s How is your offloading configured? You want all layers plus kv cache offloaded to the GPU, then 32 MOE layers offloaded to CPU

Reply

[-]

wisepal_app@reddit

actually i am new at llama.cpp so i try every combination i see in this sub. the last one i tried was this: gpt-oss-120b-MXFP4-00001-of-00002.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 --ctx-size 70000 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --fit on -np 1 i got 12.8 t/s and for qwen coder next: Qwen3-Coder-Next-MXFP4\_MOE\_BF16.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 -ngl 999 -sm none -mg 0 -t 12 -fa on -cmoe -c 61072 -b 512 -ub 512 -np 1 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0 i got 16.6 t/s

Reply

[-]

BigYoSpeck@reddit

Rather than -cmoe use -ncmoe 32 for gpt-oss and I forget the optimal for Qwen3, it might be something like 37 but test it and look at your VRAM usage once it's fully loaded -b and -ub at 2048 makes a huge difference for large prompts as well but use more memory meaning either fewer layers offloaded or less context

Reply

[-]

wisepal_app@reddit

no luck i don't get what i do wrong? this is the last settings i use according to suggestions: gpt-oss-120b-MXFP4-00001-of-00002.gguf" --host [127.0.0.1](http://127.0.0.1) \--port 8130 -c 60000 -ngl 999 -ncmoe 32 -b 1024 -ub 1024 -fa on -sm none -t 4 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01

Reply

[-]

coder543@reddit

Are you using `--n-cpu-moe`?

Reply

[-]

wisepal_app@reddit

No i use --fit on

Reply

[-]

nicholas_the_furious@reddit

Play with the CPU more settings. It helps a lot. But for this Qwen model with 10B active you'll get much slower speeds than OSS120B which has half the active params.

Reply

[-]

carteakey@reddit

Exactly, its double the active params so less tok/s out the door but with linear attention it shouldn't decay as well like OSS120B does, and should largely stay consistent. So for long tasks it may end up performing the same? who knows.

Reply

[-]

coder543@reddit

From my experience, you need to specifically turn fit off and adjust things yourself to get the best performance. Set ngl to 999, set the context so something you find acceptable, and then find the minimum value that loads without crashing for --n-cpu-moe

Reply

[-]

Technical-Earth-3254@reddit

You are very correct, if just my windows didn't take a up that much ram when I'm trying to do something productive.

Reply

[-]

BigYoSpeck@reddit

gpt-oss-120b uses a little over 70gb all in. Windows will unfortunately gobble up most of what's left before anything else can run. Linux is obviously a lot less demanding so you can still make use within reason

Reply

[-]

TitwitMuffbiscuit@reddit

Nah, I'm using the mxfp4 from gmml repo and it's using ~52gb of vram with --no-mmap on windows. If it's using more it's due mmap.a'd/or to context spilling to system ram with the Nvidia driver's sysmem fallback policy left to default. On a 12100f with 64 gb of ddr4 3200 and an rtx 3060 12 gb, I get 18 t/s using -ncmoe 31 at 30k context (doesn't oom) and 15 t/s with -cmoe at 64k context (and sysmem fallback to default). That said, Ubuntu LTS was 15% faster last time I checked like a year ago.

Reply

[-]

Far-Low-4705@reddit

i have 64Gb VRAM and only 16Gb of ddr3 RAM with an ancient 4 core CPU lol.. I was really hoping for an 80b (I built my system for under $100 total)

Reply

[-]

luncheroo@reddit

You guys must be on Linux. I've found it runs well but context is low and you can't do much else on the OS at the same time.

Reply

[-]

LagOps91@reddit

it's perfectly fine for a 64gb system as long as you have some vram (which you should)

Reply

[-]

Significant_Fig_7581@reddit

The Q3 quants are too good nowadays dw :)

Reply

[-]

pmttyji@reddit

Well, you have options(quants) from IQ4\_XS to Q4\_K\_XL

Reply

[-]

Imakerocketengine@reddit

oh wow, need some bench to compare the 122B variant to qwen next :)

Reply

[-]

PermitNo8107@reddit

27B? would that be able to run on 16gb vram? :o

Reply

[-]

CireHF103@reddit

Qwen Next and 3.5 so far has improved a lots compared to 3.0 from my experience. Very excited to see how it performs on smaller sizes models.

Reply

[-]

Firm_Meeting6350@reddit

So true - honestly, I‘m a Codex / Opus fanboy, but winner of my heart and „drivers for automation“ are Qwen 3 Next and Qwen3 Coder Next. It‘s really really impressive. I use it for code & context extraction and even Opus and Codex admit that Qwen is better 🤣

Reply

[-]

cafedude@reddit

Here's hoping for a Qwen3.5 Coder Next. So far Qwen3 Coder Next is the best model I've tried so far on my Strix Halo 128GB system. It's currently running and getting deep in the weeds with LLVM compiler optimizations and code generation - would've never imagined I'd be able to run a model locally that could handle that.

Reply

[-]

Iory1998@reddit

The biggest improvement is the hybrid attention. Long context is the winning formula.

Reply

[-]

Unhappy_Advantage_66@reddit

Hey just curious will the 27B or the 35B work on L4 GPU.

Reply

[-]

Niket01@reddit

The 27B dense model is the one I'm most excited about. Dense models tend to be more predictable for fine-tuning and deployment compared to MoE, and 27B sits in that sweet spot where you can actually run it on consumer hardware with decent quantization. The MoE variants are interesting for benchmarks but the 27B is probably going to see the most real-world local deployment. Anyone tested the multimodal capabilities yet? Curious how it handles vision tasks compared to the Qwen3 VL models.

Reply

[-]

zhambe@reddit

Oh this is exciting! Qwen3-30B-Coder-FP8 has been my daily driver, I *think* I could squeeze the 35B version in...

Reply

[-]

Ok-Scarcity-7875@reddit

27B-A3B would be better if you want to have a large context and speed on a 24GB GPU.

Reply

[-]

Zugzwang_CYOA@reddit

On my 4090, Gemma-3 27b already runs at blazing fast speeds, even at Q5 quants, with 16k context. I don't see why that would be different from the new dense 27b.

Reply

[-]

lemon07r@reddit

Finally a small dense model! I can think about finetuning stuff again without moe woes

Reply

[-]

BasicInteraction1178@reddit

Has anyone tried using Qwen3.5 for coding yet? What are your feelings?

Reply

[-]

EmPips@reddit

> 122B A10B M-Series Mac and Strix-Halo owners are going to have a good day.

Reply

[-]

Psyko38@reddit

Can't wait to see the performance of the 27b and 35b on my Rx 9060xt 16gb

Reply

[-]

RandumbRedditor1000@reddit

27B DENSE LET'S GOOO

Reply

[-]

edeltoaster@reddit

Anything on benchmarks yet?

Reply

[-]

GrungeWerX@reddit

I actually want all three. :)

Reply

[-]

Halpaviitta@reddit

I've said it before and I'll say it again. Alibaba & Qwen is an extremely productive team, kudos to them. I am a bit worried about their workplace culture though, are they working 16 hour shifts or something? lol

Reply

[-]

AndreVallestero@reddit

996 though I feel like that's ever AI lab right now

Reply

[-]

phenotype001@reddit

Models are on HF: [https://huggingface.co/Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)

Reply

[-]

mossy_troll_84@reddit

https://preview.redd.it/tmv0y54a3hlg1.png?width=1734&format=png&auto=webp&s=6320e0e924c946c3ccd29e1234b919e3e0f129fb and unsloth quantizations are uploading now...

Reply

[-]

Jayfree138@reddit

Oh look, my new daily driver. Can't wait

Reply

[-]

mossy_troll_84@reddit

https://preview.redd.it/ak3fbjmk1hlg1.png?width=1704&format=png&auto=webp&s=ab6f569ed32ff7ea4f3acb78e238ef55eaeb7d74

Reply

[-]

RegularRecipe6175@reddit

Qwen GGUF?

Reply

[-]

PixelatedCaffeine@reddit

https://preview.redd.it/qh7qke8ceglg1.png?width=626&format=png&auto=webp&s=953aa76b60d0d396bb1ee6dce6d5593b44198904 it's happening!

Reply

[-]

Alarmed-Channel2145@reddit

Also [https://github.com/QwenLM/Qwen3.5/pull/25](https://github.com/QwenLM/Qwen3.5/pull/25) is already merged

Reply

[-]

No_Mango7658@reddit

122 moe should be so great on strix halo with 35b for speculative decoding!

Reply

[-]

mossy_troll_84@reddit

Qwen3.5-122B-A10B 🤩 https://preview.redd.it/o3inz6xxkglg1.png?width=1342&format=png&auto=webp&s=8398cbddc76ac0c5d49b20fcd0ec608d35bb50e6

Reply

[-]

tarruda@reddit

Amazing, this might be the sweet spot for 128G systems

Reply

[-]

mossy_troll_84@reddit

yup, that is what I am waiting for! 😍 Although till now I am using Step-3.5 Flash and I am quite happy (for non professional work)

Reply

[-]

tarruda@reddit

True, Step-3.5 Flash is also a beast and perfect for 128G. I can run up to two 128k context parallel streams locally as it is quite context efficient. In the AMA the StepFun team also said they plan to continue improving the 197B arch and it will even have vision, so also looking forward to that!

Reply

[-]

russianguy@reddit

GGUF when?

Reply

[-]

cibernox@reddit

27B dense? I didn’t anticipate that one. I’m very interested in what they will release in the 3B-12B range.

Reply

[-]

Adventurous-Paper566@reddit

C'est une façon de dire "on se bouge Google!"

Reply

[-]

Iory1998@reddit

An interesting choice of model size. It's like Gemma. But 27 will fit nicely in 24GB of Vram with a large context size.

Reply

[-]

cibernox@reddit

My guess is that they purposely decided to release it as a way to demonstrate the improvements they made so they have their 27B dense model matching their 32B model, so they have a measurable \~15% generational improvement.

Reply

[-]

Iory1998@reddit

That's why it's interesting. I can tell you right now it's gonna be a banger of a model. Vision support, long context size, less VRAM requirements, and smart as hell for its size. The Qwen team is definitely sending a message to both the community and Google. We don't need Gemma-4... Though, I really hope Googe release a new Gemma models.

Reply

[-]

MerePotato@reddit

27B huh, interesting

Reply

[-]

Green-Ad-3964@reddit

I hope the 35B fits a 5090...

Reply

[-]

DeepRecipe6331@reddit

Q4 GLM-4.7-Flash can, there's no reason why the 35B can't. You should have plenty of room to spare, and depending on how much you want to use and context sizes you can definitely bump it up to Q6.

Reply

[-]

RMK137@reddit

It should if you use Q3/Q4 especially with the unsloth dynamic quants. I've used the Nemotron-30b-A3B at UD-Q4_K_XL on my 5090. This one is a little larger but you can quantize the KB cache also which buys you more context.

Reply

[-]

dlcsharp@reddit

These models use hybrid attention, memory usage for kv cache should be dramatically lower thanks to that just like Qwen 3 Next

Reply

[-]

Zestyclose839@reddit

For anyone who already played with the 27B and 35B models, what are the initial impressions? Any personality changes over 30B A3B?

Reply

[-]

Adventurous-Paper566@reddit

27B! Je suis tellement content!!

Reply

[-]

ridablellama@reddit

this is why qwen is my favorite

Reply

[-]

Dyssun@reddit

please be released today please be released today 🙏

Reply

[-]

Fox-Lopsided@reddit

Where 9b :(

Reply

[-]

pigeon57434@reddit

do you think the 27B dense model will be smarter than the 35B-A3B model?

Reply

[-]

PANIC_EXCEPTION@reddit

Doubtful, Qwen3-32B wasn't too much further ahead compared to Qwen3-30B-A3B

Reply

[-]

Alarming-Ad8154@reddit

I don’t think we had a dense model in a while right? Very curious to see how 2026 agentic-coder/Reinforcement learning does on a dense base model… if this is mixed linear/quadratic attention and someone converts to nvpf4 could be an absolute 5080/5070ti monster….

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]

ApprehensiveAd3629@reddit

i hope to get a qwen 3.5 14b 🙏

Reply

[-]

xrvz@reddit

Finally another ~120B model for Strix Halo.

Reply

[-]

tmvr@reddit

Nice, some exciting sizes that make sense as well for a lot of home setups. I'll have to go and try Q3 with the 122B, but the 27B dense and the 35B MoE are a nice fit for the 24GB and 32GB VRAM configs.

Reply

[-]

Insomniac24x7@reddit

My 3090 is gonna cry

Reply

[-]

jax_cooper@reddit

tears of joy

Reply

[-]

Insomniac24x7@reddit

Lol

Reply

[-]

maglat@reddit

with multimodel tasks they mean picture and video understanding, right? Same as the big variant I guess

Reply

[-]

Skyline34rGt@reddit

Yes. Same as Qwen3 VL have before.

Reply

[-]

Big_Mix_4044@reddit

Hyped af

Reply

[-]

Few_Painter_5588@reddit

A Dense 27B model! That is the perfect size!

Reply

[-]

BrightRestaurant5401@reddit

Indeed the only model I am looking forward too, I want a model with stronger generalized intelligence, the extra "memory" most MOE models offer is just not that interesting.

Reply

[-]

HugoCortell@reddit

Well yeah, I imagine you'd fine them there. Not going to be showing up on ChatGPT, are they?

Reply

[-]

asraniel@reddit

i would love something good in the 8-14 range. it seems like a sweet spot for many tasks such as information extraction, summaries etc

Reply

[-]

noctrex@reddit

Do we have any information if they will release any small versions of 3.5? Like they did with 3 and 3-VL? 2B, 4B, 8B. Cause those small ones are nice to put on computers that do not have a dGPU

Reply

[-]

Schlick7@reddit

9B and that 35bA3 were the ones in the llama.cpp PR so we can expect the 9B. I'd be pretty surprised if that was the smallest

Reply

[-]

Iory1998@reddit

I am pretty certain that there would be a 4B model as well. Qwen3-4B is so popular and used in many image generator that I think qwen will release another one.

Reply

[-]