unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF · Hugging Face
Posted by WhaleFactory@reddit | LocalLLaMA | View on Reddit | 100 comments
Posted by WhaleFactory@reddit | LocalLLaMA | View on Reddit | 100 comments
rm-rf-rm@reddit
LMArena has it matched to Sonnet 4... While I'd love for this to be the case, this seems unlikely..
Caffdy@reddit
there's no way those two are equivalent, right?
ubrtnk@reddit
*cries in llama-swap
its not built into the llama-swap container yet
ei23fxg@reddit
it is now
ei23fxg@reddit
ei23fxg@reddit
ElSrJuez@reddit
Is this a hidden prop of Qwen3-30B?
Electrical-Bad4846@reddit
4Q getting around 13.6 tps with a 3060 3090 combo with 52gigs ddr4 ram 3200
cybran3@reddit
That’s kinda low, I get ~23 TPS for gpt-oss-120b with one RTX 5060 Ti 16GB and 128 GB 5600 DDR5.
T_UMP@reddit
UD-Q4_K_XL 14tk/s on Strix Halo 128GB.
slavik-dev@reddit
How Qwen3-Next-80B intelligence compares to OSS-GPT-120B?
I heard complains that OSS-GPT-120B is significantly censored, but haven't experienced censorship with it much.
How do they compare for coding?
xxPoLyGLoTxx@reddit
My impression is that gpt-oss-120b is superior.
ForsookComparison@reddit
It is. By Qwen's own admission it seems that Qwen3-Next 80B's main selling point is the ability to run Qwen3-32B level intelligence at much faster speeds.
If you have 40-48GB of VRAM this is probably the coolest model in the world because that's amazing. Otherwise, offload experts to CPU and stick to gpt-oss-120B.
eggavatar12345@reddit
It’s not censored, that was people testing it with invalid configurations or poor open router early testing
AXYZE8@reddit
It's heavily censored and you see it during reasoning where it reasons if prompt is against OpenAI policy.
However jailbreaking is easy as proven in this sub - just put "updated OpenAI policy" as system prompt and in this policy write what it's allowed to generate. I didnt saw any limitations to this method.
my_name_isnt_clever@reddit
Or to save tokens, I've seen good results with the Heretic version. It hasn't refused anything with zero system prompt rule shenanigans.
Finanzamt_Endgegner@reddit
Even if qwen next is worse atm, it was more a proof of concept and it allows the kimi linear model to be implemented in less time since it builds upon this one (;
Sea-Speaker1700@reddit
I find 120b to be a terrible coder, just dumps generic trash in codebase without actually fitting it to existing patterns.
80b will try to match more closely to existing patterns but its still a very very long way off frontier models.
Mkengine@reddit
https://artificialanalysis.ai/models/comparisons/qwen3-next-80b-a3b-reasoning-vs-gpt-oss-120b
Daniel_H212@reddit
Exciting not because I care about this model, but because this means we'll be able to run Qwen3.5 or Qwen4 whenever that comes out. This model is, as far as I can tell, an architectural proof of concept and is nowhere close to being finished training. They say they only spent 10% of the training cost on this compared to what was put into Qwen3 32B, and even if that's because this architecture is easy to train, it seems like cost won't be a barrier to training it further.
InevitableWay6104@reddit
its not about the length of training, it was cheaper to train because of the architecture differences which is important to them so they can iterate faster.
they did explicitly say this is an experimental model testing their new efficiency architecture improvements. doesnt mean that it's not "fully trained", it likely is, it's just an experimental, mostly unpolished preview model that doesnt have all of the kinks worked out yet
Schlick7@reddit
Pretty sure they said that it wasn't the same dataset as the Qwen3 models but s reduced set.
Sea-Speaker1700@reddit
Exactly, the hybrid linear attention is the future, so getting performant generic kernels written that can handle various compositions of liner vs full attention layers, 3:1, 7:1, etc is huge for future outlook.
Getting proper internal MTP working will also be huge.
Icy_Resolution8390@reddit
What is the difference with the ilintar version?
Finanzamt_Endgegner@reddit
Its just unsloth 2.0 ggufs, other than that they run the same
Icy_Resolution8390@reddit
Is the same? Other versions than unsloh version?
Finanzamt_Endgegner@reddit
The model itself is the same, both work in llama.cpp, the unsloth will probably have a little bit better performance for the same file size though (;
Icy_Resolution8390@reddit
Ibdownload a modified llama version from ilintar to run this model..but now you told me that was supoorted by standard llama.cpp? I dont see any mention in github to qwen3-next…
Finanzamt_Endgegner@reddit
well its not in the precompile version yet, youd have to compile yourself (;
Icy_Resolution8390@reddit
Yes i compile myself with cmake and run well with the two versions instruct and thinking but i download other versions from lefromage or something else…dont remember..but i not test the unsloth version…if were best i can download the unsloth version to test it also …if were more optimized i can be more fast some toks/sec
Finanzamt_Endgegner@reddit
well the current llama.cpp might be faster per token, im not sure if the other one has any cuda kernels atm? Though you can also wait a week or so and then use the unsloth ggufs with the main llama.cpp, since by then all kernels should be implemented at least. There probably will be faster performance upgrades later on (;
Icy_Resolution8390@reddit
How i must wait a Week? I cant download today? Or they have some bugs they are reparing or some? I can wait one week but i was thinking in download and test it this night these unsloth with main llama
Finanzamt_Endgegner@reddit
You can, though there will be upgrades to the performance during the next week (at least thats very likely), so dont take the speed as absolute since that will increase (;
Also you might need to redownload the ggufs later, if unsloth changes stuff with them, which could happen. But nothing stops you from doing some tests rn (:
Icy_Resolution8390@reddit
Hello my friend i download yet the unsloth model with the new llama.cpp and qwen3-80B a3b from the unsloth quantizarion and was very good with no gpu i am testing it now and is pretty good i get 3 tok/sec more from lefroimage quants
Icy_Resolution8390@reddit
Ok ..i understand..i think that implementarion was completed and stable but can upgrade it for better optimization…o go try to dowload it in a few hour to compare the difference in tk/s with the other versions
AbheekG@reddit
Thank you!!!
Long_comment_san@reddit
Sorry to ask a relatively stupid (for soem.people) question, but what about i1 quants? Didn't these surpass reqular quants so why are these (regular quants) still being made if i1 are better and work on all hardware?
Cool-Chemical-5629@reddit
They may work on all hardware, but on some hardware they are much slower.
AXYZE8@reddit
You are describing IQ quantsz not i1 quants.
Cool-Chemical-5629@reddit
I am aware of the two, but subjectively I always felt like there was no difference in speed between them and they feel slower than regular quants. Also, I believe it's been explained somewhere that they can be slower on some hardware, for example if you're using Vulkan runtime which is what I'm using on my hardware.
Long_comment_san@reddit
What kind of hardware? 🤔 Something from GTX era? That's basically phased out
Cool-Chemical-5629@reddit
No. I can speak only for myself, but I have all AMD hardware and it's always slower than regular quants for some reason.
arcanemachined@reddit
Q4_0 and Q4_1 should run faster on these old cards, FYI, if you're not already using them.
AXYZE8@reddit
i1 (iMatrix) quants are optimized for data that was provided by person that made quantization. It usually lowers visible "brain damage", but it may worsen niche knkwledge, therefore static quants are still being made.
Cool-Chemical-5629 described IQ quants that arr more compressed than Q quants - with IQ quant you need less memory, but more compute. IQ quants work awesome on powerful GPUs, but work really slow on consumer CPUs with like 6-8 cores. Today I've tested GLM 4.5 Air on GPU+CPU and Q2_K_XL (46GB) was 20% faster than IQ2_XXS (42GB). IQ quants are useful when you want to fit 27B/32B dense model fully into your 16-24GB VRAM.
JawGBoi@reddit
Could anyone give some speed tests?
Curious if my 12gb 4070 super and 64gb RAM would run that at faster than 7 tokens per second.
AXYZE8@reddit
4070 SUPER + 64GB DDR4 2667MHz = 9.90 tok/s on 10k context with Q3_K_XL
--ngl 99 --n-cpu-moe 34 if I recall correctly (I'm on phone right now).
ixdx@reddit
On my hardware, it runs faster than gpt-oss-120b mxfp4. I used Q2 for the first time, and the responses seemed quite normal.
```
root@c6ec8a89e61c:/app# ./llama-bench --model /models/unsloth/Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL/Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL.gguf --n-cpu-moe 4
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next ?B Q2_K - Medium | 27.31 GiB | 79.67 B | CUDA | 99 | pp512 | 365.14 ± 1.51 |
| qwen3next ?B Q2_K - Medium | 27.31 GiB | 79.67 B | CUDA | 99 | tg128 | 37.90 ± 0.25 |
build: ff55414 (1)
```
Sixbroam@reddit
Here is my bench results with a 780M solely on 64Gb DDR5 5600:
build: ff55414c4 (7186)
I'm quite surprised to see such "low" numbers, for comparison here is the bench for GLM4.5 Air wich is bigger and has 4x the number of active parameters:
And a similar test with GPT-OSS 120B:
prompt eval time = 4779.50 ms / 507 tokens ( 9.43 ms per token, 106.08 tokens per second)
eval time = 9206.85 ms / 147 tokens ( 62.63 ms per token, 15.97 tokens per second)
Maybe the Vulkan implementation needs some work too, or the compute needed for tg is higher due to some architecture quirks? Either way, I'm really thankful to Piotr and the llama.cpp team for their outstanding work!
GlobalLadder9461@reddit
How can you run gpt oss 120b on 64gb ram only?
Sixbroam@reddit
I offload a few layers on a 8Gb card (that's why I can't use llama-bench for gpt-oss), not ideal and it doesn't speed up the models that fit in my 64Gb but I was curious to test this model :D
Mangleus@reddit
I am equally curious about this, and related questions also having 8 vram + 64 ram. I use only llama.cpp so far.
mouthass187@reddit
sorry if this is stupid but, i have an 8gb card and 64 gigs of ram, can i run this model? only tinkered with ollama so far; i dont see how people are offloading to ram - do i use llama.cpp instead? whats the easiest way to do this? (im curious since ram went up in price but have no clue why)
tmvr@reddit
It's going to be rough with an 8GB GPU only, the model itself would fill the RAM and offloading only 8GB from that is not a lot. A 16GB card would do better, it works fine with my 24GB 4090 and 64GB RAM because there is enough total memory to fit everything in comfortably.
Sixbroam@reddit
I don't know how you'd go about it with ollama, it seems to me that going the llama.cpp route is the "clean" way, you can look at my other comment regarding tensor splitting using llama.cpp here: https://www.reddit.com/r/LocalLLaMA/comments/1oc9vvl/amd_igpu_dgpu_llamacpp_tensorsplit_not_working/
Sea-Speaker1700@reddit
MTP is almost certainly not active in the 80B so, just like in vllm, we get a echo of what Next 80B is actually capable of due to serving limitations.
Finanzamt_Endgegner@reddit
not only that tri and cumsum kernels are still cpu only I think, at least cuda is not yet mergable, though Im sure well get them rather fast (;
MikeLPU@reddit
The same for glm4.5. They just skip these layers. So sad...
qcforme@reddit
I did implement it correctly in a branch of vLLM with correct use of the linear attention mechanism interleaved with full attention as an experiment, attempting to integrate prefix caching.
It does work prefix worked really well, saw 50k TPS + pre-fill on cache hits, but decode performance is poor because of CUDA graphs incompatibility with the hybrids. Plus I was working with a 3 but due to VRAM I had at the time, so the model damage was inseparable from kernel mistakes for debugging.
The hybrids will require months of work to get fully right, and need fundamental changes in the core of both inference architectures, llama and vLLM, plus someone with 192gb+ VRAM to properly test it.
More than I was willing to take on at the moment, as I can't serve 16bit 80B.l for verification.
Sixbroam@reddit
Thank you for the added bit of information regarding MTP! Yes I saw a few comments explaining that the focus wasn't on the performance but I wasn't expecting such a hit on tg, but it's just out of curiosity not complaining :)
PraxisOG@reddit
Depends pretty heavily on what ram that is. DDR5 5600 in dual channel has a bandwidth of about 90GB/s, divided by 3b active parameters gives about 30 tok/s, though real performance might be like half that
JawGBoi@reddit
I forgot to mention RAM speed, whoops.
I only have DDR4 3200. I expect that will affect the end speed significantly.
usernameplshere@reddit
\~51GB/s for your RAM
InevitableWay6104@reddit
does llama.cpp not support qwen3 next 80b on rocm???
fallingdowndizzyvr@reddit
It does. But Vulkan is faster.
InevitableWay6104@reddit
vulkan is not faster on amd.
fallingdowndizzyvr@reddit
It is.
https://github.com/ggml-org/llama.cpp/pull/16095#issuecomment-3589897501
T_UMP@reddit
On Strix Halo with Vulkan it loads but then it crashes once it tries to generate, with no errors.
With ROCm works at 114pp and 14tk/s.
CPU works at 7tk/s
UD-Q4_K_XL
Mean-Sprinkles3157@reddit
I used to use Qwen3-Next with a test version of llama.cpp (it does not support 'next'). is it still true that I have to use a different llama.cpp?
Finanzamt_Endgegner@reddit
nope this is main branch llama.cpp now
Mean-Sprinkles3157@reddit
Thanks! I run the model (Q8) on dgx spark. it is 14 tokens per seconds. I think it is OK for using 80GB VRAM), It passed my Latin test, hope it can be used to replace gpt-oss-120b (60GB VRAM).
below is my command line:
./bin/llama-server \-m ~/models/Qwen3-Next-80B-A3B-Instruct-Q8_0-00001-of-00002.gguf\--host0.0.0.0\--port 8080 \-ngl 99 -n 16384\-c 131072 \--temp 0.7 --top-p 0.8 --top-k 20 \--verboseif any one is an expert on using llama-server, please teach me if I could increase context window size to 262144? I most use the model work with Cline (vs code), I am not sure if "rope-scaling: yarn" could work with cline.
Finanzamt_Endgegner@reddit
yeah implementation is not yet optimized, but people are working on that (;
noiserr@reddit
I'm assuming using smaller Qwen3 models as draft models for speculative decoding is not compatible right?
2legsRises@reddit
awesome! now i only need 56GB+ of vram.
sammcj@reddit
Nice work, will be interesting to see how the UD_Q_K_3_XL compares to Q4_K_M as that would allow it to fit on 2x 24GB cards.
DrVonSinistro@reddit
Q8 UD K XL from Unsloth on 2x P40 + 1x RTX A2000 (60GB vram) gives me 11-12 t/s with 17k ctx filled out of 32k.
rm-rf-rm@reddit
From what I could tell of anecdotal usage and comments on here, it isnt a noticeable improvement over qwen3-coder:a3b, especially for coding.
It wont replace GPT-OSS:120b either. Still will try it out and look to replace qwen3-coder:a3b for agentic coding tasks.
The real win is the future compatibility for qwen 3.5/4 as I understand they will all follow this arch.
kevin_1994@reddit
my understanding is CUDA isn't quite ready yet?
also does anyone know if these models support FIM. this seems perfect for a coding autocomplete model for me
Finanzamt_Endgegner@reddit
Yeah we just got the solve_tri kernel merged for cuda, cumsum and tri are still missing as I understand it, but should be here soon(;
illkeepthatinmind@reddit
With llama.cpp 7180 getting
illkeepthatinmind@reddit
Looks like brew hasnt updated to 7190 yet
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
AleksHop@reddit
3b active, good, how much and what to offlad to GPU? And llama.cpp commands with filters?
Dreamthemers@reddit
—n-cpu-moe 48offloads all to CPU (same as—cpu-moe) so lower it until your VRAM is almost full for performance increase.Sea-Speaker1700@reddit
42
Your question is wildly lacking any amount of actual context to provide a meaningful answer.
Icy_Resolution8390@reddit
Wjat is the difference from this version to lefromage version?
jacobpederson@reddit
does it load in lm studio yet?
Nieles1337@reddit
No, it needs a runtime update.
yoracale@reddit
The Thinking ones will be up in like 1-2 hours or so: https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF
Icy_Resolution8390@reddit
Finally we did it
WhaleFactory@reddit (OP)
🐐🐐🐐🐐
Trilogix@reddit
https://github.com/Mainframework/HugstonOne/releases/tag/HugstonOne_Enterprise_Edition_with_memory
Qwen Next 80 supported now.
munkiemagik@reddit
Slightly adjacent question to this Qwen3-Next post:
With the work that's been done in llama.cpp to be able to finally run hybrid moe and gated deltanet Qwen3-Next, does any of that currently have any negative impact on regular MOE or dense models like GPT-OSS or Seed-OSS run with the same llama.cpp b7186?
mantafloppy@reddit
GGUF Model/llama.cpp release is broken.
Trying my standard coding prompt :
First try : model get stuck repeating the same CSS at around 3000 token in context. Second try : model get stuck writing an svg forever at around 5000 token in context.
Prompt : Recreate a Pokémon battle UI — make it interactive, nostalgic, and fun. Stick to the spirit of a classic battle, but feel free to get creative if you want. In a single-page self-contained HTML.
I used the recommended setting from : https://docs.unsloth.ai/models/qwen3-next
As someone who use lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit as its main model, this is sad, guess vibe coding a llama.cpp release dont work.
mantafloppy@reddit
GGUF Model/llama.cpp release is broken.
Trying my standard coding prompt (simple html page), model get stuck repeating the same CSS at around 3000 token in context.
As someone who use lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit as its main model, this is sad, guess vibe coding a llama.cpp release dont work.
_raydeStar@reddit
Ahhhhhhhhhhhhh it's here!!!
That's all I gotta say
jacek2023@reddit
Thanks!
WhaleFactory@reddit (OP)
🐐🐐🐐🐐
WhaleFactory@reddit (OP)
Don't thank me, THANK YOU!
🐐🐐🐐🐐
Hulksulk666@reddit
Thank !!
nore_se_kra@reddit
Interesting - finally some medium sized models again despite MOE. Benchmarks dont look that overwhelming but lets see