Apologize for my ignorance. This is something akin to self-prediction?
Something like using a light engine to guess next tokens and then having the big engine verify+correct it?
Not super familiar with this stuff, take this comment with a grain of salt.
I think MTP (multi-token prediction) is adding some extra layers to the network so that each forward pass results in the next N tokens instead of just 1 token. Then at inference time it is speculative decoding where you have the model verify the those predictions.
There is also EAGLE3 which is similar but the layers are trained after the main model is trained. MTP is trained as part of the main model because it provides speedups during training as well.
Draft models (which I think are largely outdated) are smaller models with a similar output probability distribution to the main model. Then you can use the draft model to do speculative decoding. But EAGLE3 has shown to be more accurate (drafted tokens accepted more often) and faster (because it’s just a few layers instead of an entire draft model).
Less dense models + less draft-sized/compatible models.
Spec dec is absolutely still a thing, there's just way less models coming out where you'll get a big win out of it.
The last one was probably a bit of a speed boost on the original Qwen3-235B using Qwen3 4B or Qwen3 0.6B. The smaller models never got updates, but Qwen3-235B-2507 came out and was much stronger - so nobody used the original and the original small models weren't compatible as draft models.
When running on single device it doesn't do well with MoE. Consecutive tokens does not necessarily activate the same experts, so when validating multiple tokens there's likely more parameters to be loaded & used, consuming more memory bandwidth and diminishing the benefits of MoEs. For example llama.cpp does well with dense model speculative decoding, but struggles on MoE.
For large scale deployment using expert parallel on multiple GPUs, there will be more performance uplift (for a limited number of user per cluster).
With the 10B active parameters in the MoE, I'd expect the 27B dense model to not be that far behind in intelligence. Could be a really attractive choice for single gaming GPU setups.
MoE are now way better than at their beginnings, the old rule "square root of total parameters*active parameters" to compare them to a dense model isn't relevant anymore.
this is especially true for reasoning models
The reason that was true for instruct models is because the total compuation you can do is limited to a single forward pass for instruct models, which is much higher for dense models.
but when you have reasoning, the total compute is spread out over the reasoning tokens, so it doesn't really matter how much compute you can do in a single forward pass, so in practice, MOE might use a few more reasoning tokens to arive at the answer, but it wont make much practical difference performance wise, and will be much faster
Good point - there have been some architectural improvements and we don't know the MoE defaults to a higher reasoning effort budget than the dense model. The rule of thumbs likely underestimates the actual capability we are going to see.
Because Qwen3.5 is MoE: Only \~17B parameters fire per token.
That means:
* Latency scales closer to a 20B dense model😉
* Memory scales closer to a 400B sparse model
I wouldn't dismiss 2-bit quants of the 122B release which should be runnable in less than 50G.
This new Qwen architecture is very resilient to quantization, I have been running 2-bit 397B on 128G mac with great success: https://huggingface.co/ubergarm/Qwen3.5-397B-A17B-GGUF/discussions/2
You guys are weird.
I got 12VRam and 32GB of ram and I was running their qwen 3 235b a22b model on q4
Sure, it was slow, but this new 100b param moe should only be easier.
Don't know why you act like it's so impossible to run these without massive PCs
I'm having trouble coming up with a use case where beating reading speed isn't useful. Pure live roleplay with no thinking tokens or tool use maybe?
Even then, I'd want at least 20 tokens/second, which seems hard to get with most of the model offloaded to CPU.
You are right that running does not equal running fast. I have specific use cases, MCPs, and I am admittedly an impatient person. When I tried OSS 120b, the generation speed was fine but I couldn't do anything else on Windows with my hardware because I was using nearly all the RAM. Since I tend to have a local LLM open on one screen and other stuff going on on another, it didn't work well for me. For other people, that would be just fine.
I just fed it a 16k long prompt on Windows, I got 104 t/s for processing and 25 t/s for generation. Quite usable for when I need a definite 1-shot answer.
https://imgur.com/a/XOAOpF6
For anyone asking, I still have ~5-6GB left in system RAM when the model is fully loaded. I have Docker running open WebUI and couple other small things. My Windows idles at 11GB system RAM usage too. On Linux, this will be even better.
True. I ran it on my config (12GB VRAM + 64 GB RAM) and get around 250 tk/s pp, so there's scope for optimization. (linux, llama.cpp params etc.)
prompt eval time = 95560.34 ms / 24379 tokens ( 3.92 ms per token, 255.12 tokens per second)
eval time = 124347.26 ms / 2312 tokens ( 53.78 ms per token, 18.59 tokens per second)
total time = 219907.60 ms / 26691 tokens
It took a minute and a half to process 24k tokens before starting - which is indeed slow, but incredible considering the fact its running on such hardware. I expect
what are your llama.cpp params and hardware
I often point to my post as a reference
[https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)
On DGX Spark, I get 2500 tok/s prompt processing and about 60 tok/s generation with GPT-OSS-120B under llama.cpp.
Some people report getting up to 6000 tok/s prompt processing under vLLM with the same setup, vLLM is just a hassle to use.
I paid 30% of the cost a DGX spark costs, and my 5070 Ti plays 4K AAA games, rocks all the smaller models and runs any OS of my choice.
I guess when DGX spark can do all those, then it’s a fair comparison. Otherwise, it’s apples to watermelons type comparison.
I don't think it would fit unless you use Q2 or Q3. Also, you should keep in mind that A10B is way larger than your typical A3B, so expect very slow generation speed.
The model will be smart though.
That model was trained for scratch on 4-bit. OpenAi did well with it. Qwen-3.5 is likely trained in FP16. Don't exepct Q4 to be around 65GB like OSS-120B.
As a general purpose model it seems like they're trying to paint it being as good as the original Qwen3-235B (not the updated 2507 checkpoint) but twice as fast and half the memory.
The real gains are in instruction following and coding use.
Meaning this *could* have the all-around strength that larger Qwen's have but the agentic abilities of GLM and Minimax models. All of this is subject to testing of course but I really hope these numbers turn out to be real.
Unsloth's 4-bit quant just passed the car wash and upside-down cup tests, which none of the other Qwen models I've tried could deal with. This model is feeling pretty real.
It's worth trying a 2-bit or 3-bit quant. This model seems to be very amenable to quantization, for whatever reason. I don't understand it, but Unsloth actually reports the 2-bit and 3.5-bit quants as outperforming the 4-bit? See https://unsloth.ai/docs/models/qwen3.5#benchmarks
Yes! Since the unsloth quants are not out yet, I created a qwen chat account and tried the 122B-A10B one out. It seems like the most capable model so far I've ever seen. I gave it some test prompts (asking it to parallelize a bash script that will create race condition for generated temp files, etc) that no local models that I've tried (glm-4.5-air, qwen3-235b, gpt-oss-120b, etc) ever got right once. And it nailed the prompts first try. Super excited for the unsloth quants, it may mean I can finally delete all my 500GB+ (in total) models and keep only this one if it's as good locally as I tested it on their website.
As soon as you're outside of Agentic use cases I don't enjoy it as much. It's also a fairly weak general purpose model for me.
It's the strongest coding model that fits nicely on my machine but I'm finding myself preferring GLM 4.6v at Q4/Q5 rather than MiniMax at Q3.
Great model.. it just doesn't have a home in my workflows nor my machine. Maybe if I had more VRAM and could run Q4+ that'd change.
ngl this 27B is actually cracked. Coding and multimodal perfs are giving me early Gemini 3 Pro vibes lol. Perfect weight for anyone building a local agentic OS. Can't believe we're getting this much juice in a dense model in 2026. My 5090 is ready.
I have measured the weights of GPT-OSS-20B myself using a script to parse the full precision GGUF:
weights_used_bytes: 3190031616 (3.19 GB / 2.97 GiB)
weights_total_bytes: 12096558336 (12.1 GB / 11.27 GiB)
Definitely A3B, however you want to measure it.
llama.cpp now supports any draft models, even if the tokenizer dictionary is incompatible. Furthermore, speculative decoding of ngrams without a draft model has been added. However, for a number of reasons, speculative decoding works poorly with moe, which is likely why it's so rarely used.
I have to say I’m kind of disappointed with this release.
It might be a niche use case, but for us fine tuners, only a single size dense model with no base variants is practically useless.
This trend already started with Qwen3 where they never released the base variant of the 32B size and all releases since then have been MoE.
While running local models for coding or creative writing has a significant value proposition, the ability to fine tune models for personal use or as a basis for a commercial product is a liberty that’s slowly been eroding away. That’s a shame, and I don’t think it’s being brought up enough.
If you have a GPU add it's VRAM to your system RAM for what you can run
Gpt-oss-120b will run on 64gb RAM + 16gb VRAM quite well so I have high hopes for this
i have 96 GB ddr5 ram and 16 GB vram but i get 14 t/s with gpt-oss 120b. By quite well do you mean this kind of speeds or much higher? i use llama.cpp with 60k context.
Depends on the processor and how you offload I would say. I didn’t test oss 120B, but I feel you probably could get some extra performance if you have not yet optimised settings. Do you use the —fit and —fit-ctx parameters of llama.cpp? If not, try them out.
i use --fit on. Not used --fit-ctx parameter. Will try it. Your qwen3 coder next speed is quite impressive. i get around 17 t/s with it. Can you share your full llama.cpp parameters please?
Here: https://github.com/Danmoreng/local-qwen3-coder-env
In the repo I still use UDQ4 instead of MXFP4, that runs at 35 t/s instead of 40 t/s for MXFP4. Also, Windows is much slower than Linux. Under windows I only get around 25 t/s.
I get similar perf on my 12GB VRAM + 64GB RAM and here's the command with the params he mentioned.
[https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/](https://carteakey.dev/blog/optimizing-qwen3-coder-next-local-inference/)
Fit without fit context and a custom context can backfire. If it's ends up "fitting" smaller context and then what you specify is larger (due to initialization sequence), you end up your kv cache partially outside your vram and that's slow.
If you try replacing your context, with fit context 70000 , that should help if this is the problem.
You should have way more.
[https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/](https://carteakey.dev/blog/optimizing-gpt-oss-120b-local-inference/)
You can get more
I run llama.cpp with 64gb DDR4 3733 and an RX 6800 XT 16gb
With the full 131k context it gets about 22tok/s
How is your offloading configured? You want all layers plus kv cache offloaded to the GPU, then 32 MOE layers offloaded to CPU
actually i am new at llama.cpp so i try every combination i see in this sub. the last one i tried was this: gpt-oss-120b-MXFP4-00001-of-00002.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 --ctx-size 70000 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja --fit on -np 1
i got 12.8 t/s
and for qwen coder next:
Qwen3-Coder-Next-MXFP4\_MOE\_BF16.gguf --host [127.0.0.1](http://127.0.0.1) \--port 8130 -ngl 999 -sm none -mg 0 -t 12 -fa on -cmoe -c 61072 -b 512 -ub 512 -np 1 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
i got 16.6 t/s
Rather than -cmoe use -ncmoe 32 for gpt-oss and I forget the optimal for Qwen3, it might be something like 37 but test it and look at your VRAM usage once it's fully loaded
-b and -ub at 2048 makes a huge difference for large prompts as well but use more memory meaning either fewer layers offloaded or less context
no luck i don't get what i do wrong? this is the last settings i use according to suggestions:
gpt-oss-120b-MXFP4-00001-of-00002.gguf" --host [127.0.0.1](http://127.0.0.1) \--port 8130 -c 60000 -ngl 999 -ncmoe 32 -b 1024 -ub 1024 -fa on -sm none -t 4 --jinja --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01
Play with the CPU more settings. It helps a lot. But for this Qwen model with 10B active you'll get much slower speeds than OSS120B which has half the active params.
Exactly, its double the active params so less tok/s out the door but with linear attention it shouldn't decay as well like OSS120B does, and should largely stay consistent. So for long tasks it may end up performing the same? who knows.
From my experience, you need to specifically turn fit off and adjust things yourself to get the best performance. Set ngl to 999, set the context so something you find acceptable, and then find the minimum value that loads without crashing for --n-cpu-moe
gpt-oss-120b uses a little over 70gb all in. Windows will unfortunately gobble up most of what's left before anything else can run. Linux is obviously a lot less demanding so you can still make use within reason
Nah, I'm using the mxfp4 from gmml repo and it's using ~52gb of vram with --no-mmap on windows.
If it's using more it's due mmap.a'd/or to context spilling to system ram with the Nvidia driver's sysmem fallback policy left to default.
On a 12100f with 64 gb of ddr4 3200 and an rtx 3060 12 gb, I get 18 t/s using -ncmoe 31 at 30k context (doesn't oom) and 15 t/s with -cmoe at 64k context (and sysmem fallback to default).
That said, Ubuntu LTS was 15% faster last time I checked like a year ago.
So true - honestly, I‘m a Codex / Opus fanboy, but winner of my heart and „drivers for automation“ are Qwen 3 Next and Qwen3 Coder Next. It‘s really really impressive. I use it for code & context extraction and even Opus and Codex admit that Qwen is better 🤣
Here's hoping for a Qwen3.5 Coder Next. So far Qwen3 Coder Next is the best model I've tried so far on my Strix Halo 128GB system. It's currently running and getting deep in the weeds with LLVM compiler optimizations and code generation - would've never imagined I'd be able to run a model locally that could handle that.
The 27B dense model is the one I'm most excited about. Dense models tend to be more predictable for fine-tuning and deployment compared to MoE, and 27B sits in that sweet spot where you can actually run it on consumer hardware with decent quantization.
The MoE variants are interesting for benchmarks but the 27B is probably going to see the most real-world local deployment. Anyone tested the multimodal capabilities yet? Curious how it handles vision tasks compared to the Qwen3 VL models.
On my 4090, Gemma-3 27b already runs at blazing fast speeds, even at Q5 quants, with 16k context. I don't see why that would be different from the new dense 27b.
I've said it before and I'll say it again. Alibaba & Qwen is an extremely productive team, kudos to them. I am a bit worried about their workplace culture though, are they working 16 hour shifts or something? lol
https://preview.redd.it/tmv0y54a3hlg1.png?width=1734&format=png&auto=webp&s=6320e0e924c946c3ccd29e1234b919e3e0f129fb
and unsloth quantizations are uploading now...
True, Step-3.5 Flash is also a beast and perfect for 128G. I can run up to two 128k context parallel streams locally as it is quite context efficient. In the AMA the StepFun team also said they plan to continue improving the 197B arch and it will even have vision, so also looking forward to that!
My guess is that they purposely decided to release it as a way to demonstrate the improvements they made so they have their 27B dense model matching their 32B model, so they have a measurable \~15% generational improvement.
That's why it's interesting. I can tell you right now it's gonna be a banger of a model. Vision support, long context size, less VRAM requirements, and smart as hell for its size. The Qwen team is definitely sending a message to both the community and Google. We don't need Gemma-4...
Though, I really hope Googe release a new Gemma models.
Q4 GLM-4.7-Flash can, there's no reason why the 35B can't. You should have plenty of room to spare, and depending on how much you want to use and context sizes you can definitely bump it up to Q6.
It should if you use Q3/Q4 especially with the unsloth dynamic quants. I've used the Nemotron-30b-A3B at UD-Q4_K_XL on my 5090. This one is a little larger but you can quantize the KB cache also which buys you more context.
I don’t think we had a dense model in a while right? Very curious to see how 2026 agentic-coder/Reinforcement learning does on a dense base model… if this is mixed linear/quadratic attention and someone converts to nvpf4 could be an absolute 5080/5070ti monster….
Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW)
You've also been given a special flair for your contribution. We appreciate your post!
*I am a bot and this action was performed automatically.*
Nice, some exciting sizes that make sense as well for a lot of home setups. I'll have to go and try Q3 with the 122B, but the 27B dense and the 35B MoE are a nice fit for the 24GB and 32GB VRAM configs.
Indeed the only model I am looking forward too,
I want a model with stronger generalized intelligence,
the extra "memory" most MOE models offer is just not that interesting.
Do we have any information if they will release any small versions of 3.5?
Like they did with 3 and 3-VL?
2B, 4B, 8B.
Cause those small ones are nice to put on computers that do not have a dGPU
I am pretty certain that there would be a 4B model as well. Qwen3-4B is so popular and used in many image generator that I think qwen will release another one.
198 Comments
Freigus@reddit
PrefersAwkward@reddit
Street_Confidence453@reddit
PrefersAwkward@reddit
dnsod_si666@reddit
Competitive_Ad_5515@reddit
EmPips@reddit
b3081a@reddit
po_stulate@reddit
Sufficient-Rent6078@reddit
TacGibs@reddit
Far-Low-4705@reddit
-p-e-w-@reddit
Sufficient-Rent6078@reddit
etherd0t@reddit
MrHighVoltage@reddit
Far-Low-4705@reddit
tarruda@reddit
Far-Low-4705@reddit
giatai466@reddit
luncheroo@reddit
ThrowawayNotSusLol@reddit
Tai9ch@reddit
ThrowawayNotSusLol@reddit
Tai9ch@reddit
ThrowawayNotSusLol@reddit
luncheroo@reddit
simracerman@reddit
Borkato@reddit
simracerman@reddit
Borkato@reddit
carteakey@reddit
Borkato@reddit
carteakey@reddit
Borkato@reddit
coder543@reddit
simracerman@reddit
Borkato@reddit
luncheroo@reddit
boissez@reddit
Iory1998@reddit
simracerman@reddit
Iory1998@reddit
carteakey@reddit
Iory1998@reddit
FullstackSensei@reddit
durden111111@reddit
po_stulate@reddit
EmPips@reddit
NoahFect@reddit
EmPips@reddit
NoahFect@reddit
po_stulate@reddit
UnknownLesson@reddit
po_stulate@reddit
wh33t@reddit
EmPips@reddit
lolwutdo@reddit
EmPips@reddit
TopCryptographer8236@reddit
coder543@reddit
phenotype001@reddit
LosEagle@reddit
Evolution31415@reddit
Physical_Screen_7543@reddit
Steus_au@reddit
9r4n4y@reddit
Silver-Champion-4846@reddit
9r4n4y@reddit
Silver-Champion-4846@reddit
Healthy-Nebula-3603@reddit
-p-e-w-@reddit
Witty_Mycologist_995@reddit
coder543@reddit
coder543@reddit
Odd-Ordinary-5922@reddit
DistanceSolar1449@reddit
-p-e-w-@reddit
simracerman@reddit
po_stulate@reddit
Silver-Champion-4846@reddit
Responsible_Pain3278@reddit
Dry_Yam_4597@reddit
indicava@reddit
Technical-Earth-3254@reddit
BigYoSpeck@reddit
wisepal_app@reddit
Danmoreng@reddit
wisepal_app@reddit
Danmoreng@reddit
wisepal_app@reddit
carteakey@reddit
Xantrk@reddit
serpix@reddit
Significant_Fig_7581@reddit
carteakey@reddit
wisepal_app@reddit
BigYoSpeck@reddit
wisepal_app@reddit
BigYoSpeck@reddit
wisepal_app@reddit
coder543@reddit
wisepal_app@reddit
nicholas_the_furious@reddit
carteakey@reddit
coder543@reddit
Technical-Earth-3254@reddit
BigYoSpeck@reddit
TitwitMuffbiscuit@reddit
Far-Low-4705@reddit
luncheroo@reddit
LagOps91@reddit
Significant_Fig_7581@reddit
pmttyji@reddit
Imakerocketengine@reddit
PermitNo8107@reddit
CireHF103@reddit
Firm_Meeting6350@reddit
cafedude@reddit
Iory1998@reddit
Unhappy_Advantage_66@reddit
Niket01@reddit
zhambe@reddit
Ok-Scarcity-7875@reddit
Zugzwang_CYOA@reddit
lemon07r@reddit
BasicInteraction1178@reddit
EmPips@reddit
Psyko38@reddit
RandumbRedditor1000@reddit
edeltoaster@reddit
GrungeWerX@reddit
Halpaviitta@reddit
AndreVallestero@reddit
phenotype001@reddit
mossy_troll_84@reddit
Jayfree138@reddit
mossy_troll_84@reddit
RegularRecipe6175@reddit
PixelatedCaffeine@reddit
Alarmed-Channel2145@reddit
No_Mango7658@reddit
mossy_troll_84@reddit
tarruda@reddit
mossy_troll_84@reddit
tarruda@reddit
russianguy@reddit
cibernox@reddit
Adventurous-Paper566@reddit
Iory1998@reddit
cibernox@reddit
Iory1998@reddit
MerePotato@reddit
Green-Ad-3964@reddit
DeepRecipe6331@reddit
RMK137@reddit
dlcsharp@reddit
Zestyclose839@reddit
Adventurous-Paper566@reddit
ridablellama@reddit
Dyssun@reddit
Fox-Lopsided@reddit
pigeon57434@reddit
PANIC_EXCEPTION@reddit
Alarming-Ad8154@reddit
WithoutReason1729@reddit
ApprehensiveAd3629@reddit
xrvz@reddit
tmvr@reddit
Insomniac24x7@reddit
jax_cooper@reddit
Insomniac24x7@reddit
maglat@reddit
Skyline34rGt@reddit
Big_Mix_4044@reddit
Few_Painter_5588@reddit
BrightRestaurant5401@reddit
HugoCortell@reddit
asraniel@reddit
noctrex@reddit
Schlick7@reddit
Iory1998@reddit
Loskas2025@reddit
No_Doc_Here@reddit
Loskas2025@reddit
Deep_Traffic_7873@reddit
No_Swimming6548@reddit
Significant_Fig_7581@reddit
Pedalnomica@reddit
Hanthunius@reddit
pmttyji@reddit
rerri@reddit
jacek2023@reddit
sterby92@reddit
LoveMind_AI@reddit
Leflakk@reddit
9r4n4y@reddit
Paramecium_caudatum_@reddit