unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF · Hugging Face

It is. By Qwen's own admission it seems that Qwen3-Next 80B's main selling point is the ability to run Qwen3-32B level intelligence at much faster speeds.

If you have 40-48GB of VRAM this is probably the coolest model in the world because that's amazing. Otherwise, offload experts to CPU and stick to gpt-oss-120B.

[-]

eggavatar12345@reddit

It’s not censored, that was people testing it with invalid configurations or poor open router early testing

[-]

AXYZE8@reddit

It's heavily censored and you see it during reasoning where it reasons if prompt is against OpenAI policy.

However jailbreaking is easy as proven in this sub - just put "updated OpenAI policy" as system prompt and in this policy write what it's allowed to generate. I didnt saw any limitations to this method.

[-]

my_name_isnt_clever@reddit

Or to save tokens, I've seen good results with the Heretic version. It hasn't refused anything with zero system prompt rule shenanigans.

[-]

Finanzamt_Endgegner@reddit

Even if qwen next is worse atm, it was more a proof of concept and it allows the kimi linear model to be implemented in less time since it builds upon this one (;

[-]

Sea-Speaker1700@reddit

I find 120b to be a terrible coder, just dumps generic trash in codebase without actually fitting it to existing patterns.

80b will try to match more closely to existing patterns but its still a very very long way off frontier models.

[-]

Mkengine@reddit

https://artificialanalysis.ai/models/comparisons/qwen3-next-80b-a3b-reasoning-vs-gpt-oss-120b

[-]

Daniel_H212@reddit

Exciting not because I care about this model, but because this means we'll be able to run Qwen3.5 or Qwen4 whenever that comes out. This model is, as far as I can tell, an architectural proof of concept and is nowhere close to being finished training. They say they only spent 10% of the training cost on this compared to what was put into Qwen3 32B, and even if that's because this architecture is easy to train, it seems like cost won't be a barrier to training it further.

[-]

InevitableWay6104@reddit

its not about the length of training, it was cheaper to train because of the architecture differences which is important to them so they can iterate faster.

they did explicitly say this is an experimental model testing their new efficiency architecture improvements. doesnt mean that it's not "fully trained", it likely is, it's just an experimental, mostly unpolished preview model that doesnt have all of the kinks worked out yet

[-]

Schlick7@reddit

Pretty sure they said that it wasn't the same dataset as the Qwen3 models but s reduced set.

[-]

Sea-Speaker1700@reddit

Exactly, the hybrid linear attention is the future, so getting performant generic kernels written that can handle various compositions of liner vs full attention layers, 3:1, 7:1, etc is huge for future outlook.

Getting proper internal MTP working will also be huge.

[-]

Icy_Resolution8390@reddit

What is the difference with the ilintar version?

[-]

Finanzamt_Endgegner@reddit

Its just unsloth 2.0 ggufs, other than that they run the same

[-]

Icy_Resolution8390@reddit

Is the same? Other versions than unsloh version?

[-]

Finanzamt_Endgegner@reddit

The model itself is the same, both work in llama.cpp, the unsloth will probably have a little bit better performance for the same file size though (;

[-]

Icy_Resolution8390@reddit

Ibdownload a modified llama version from ilintar to run this model..but now you told me that was supoorted by standard llama.cpp? I dont see any mention in github to qwen3-next…

[-]

Finanzamt_Endgegner@reddit

well its not in the precompile version yet, youd have to compile yourself (;

[-]

Icy_Resolution8390@reddit

Yes i compile myself with cmake and run well with the two versions instruct and thinking but i download other versions from lefromage or something else…dont remember..but i not test the unsloth version…if were best i can download the unsloth version to test it also …if were more optimized i can be more fast some toks/sec

[-]

Finanzamt_Endgegner@reddit

well the current llama.cpp might be faster per token, im not sure if the other one has any cuda kernels atm? Though you can also wait a week or so and then use the unsloth ggufs with the main llama.cpp, since by then all kernels should be implemented at least. There probably will be faster performance upgrades later on (;

[-]

Icy_Resolution8390@reddit

How i must wait a Week? I cant download today? Or they have some bugs they are reparing or some? I can wait one week but i was thinking in download and test it this night these unsloth with main llama

[-]

Finanzamt_Endgegner@reddit

You can, though there will be upgrades to the performance during the next week (at least thats very likely), so dont take the speed as absolute since that will increase (;

Also you might need to redownload the ggufs later, if unsloth changes stuff with them, which could happen. But nothing stops you from doing some tests rn (:

[-]

Icy_Resolution8390@reddit

Hello my friend i download yet the unsloth model with the new llama.cpp and qwen3-80B a3b from the unsloth quantizarion and was very good with no gpu i am testing it now and is pretty good i get 3 tok/sec more from lefroimage quants

[-]

Icy_Resolution8390@reddit

Ok ..i understand..i think that implementarion was completed and stable but can upgrade it for better optimization…o go try to dowload it in a few hour to compare the difference in tk/s with the other versions

[-]

AbheekG@reddit

Thank you!!!

[-]

Long_comment_san@reddit

Sorry to ask a relatively stupid (for soem.people) question, but what about i1 quants? Didn't these surpass reqular quants so why are these (regular quants) still being made if i1 are better and work on all hardware?

[-]

Cool-Chemical-5629@reddit

They may work on all hardware, but on some hardware they are much slower.

[-]

AXYZE8@reddit

You are describing IQ quantsz not i1 quants.

[-]

Cool-Chemical-5629@reddit

I am aware of the two, but subjectively I always felt like there was no difference in speed between them and they feel slower than regular quants. Also, I believe it's been explained somewhere that they can be slower on some hardware, for example if you're using Vulkan runtime which is what I'm using on my hardware.

[-]

Long_comment_san@reddit

What kind of hardware? 🤔 Something from GTX era? That's basically phased out

[-]

Cool-Chemical-5629@reddit

No. I can speak only for myself, but I have all AMD hardware and it's always slower than regular quants for some reason.

[-]

arcanemachined@reddit

Q4_0 and Q4_1 should run faster on these old cards, FYI, if you're not already using them.

[-]

AXYZE8@reddit

i1 (iMatrix) quants are optimized for data that was provided by person that made quantization. It usually lowers visible "brain damage", but it may worsen niche knkwledge, therefore static quants are still being made.

Cool-Chemical-5629 described IQ quants that arr more compressed than Q quants - with IQ quant you need less memory, but more compute. IQ quants work awesome on powerful GPUs, but work really slow on consumer CPUs with like 6-8 cores. Today I've tested GLM 4.5 Air on GPU+CPU and Q2_K_XL (46GB) was 20% faster than IQ2_XXS (42GB). IQ quants are useful when you want to fit 27B/32B dense model fully into your 16-24GB VRAM.

[-]

JawGBoi@reddit

Could anyone give some speed tests?

Curious if my 12gb 4070 super and 64gb RAM would run that at faster than 7 tokens per second.

[-]

AXYZE8@reddit

4070 SUPER + 64GB DDR4 2667MHz = 9.90 tok/s on 10k context with Q3_K_XL

--ngl 99 --n-cpu-moe 34 if I recall correctly (I'm on phone right now).

[-]

ixdx@reddit

On my hardware, it runs faster than gpt-oss-120b mxfp4. I used Q2 for the first time, and the responses seemed quite normal.

```
root@c6ec8a89e61c:/app# ./llama-bench --model /models/unsloth/Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL/Qwen3-Next-80B-A3B-Instruct-UD-Q2_K_XL.gguf --n-cpu-moe 4
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3next ?B Q2_K - Medium     | 27.31 GiB |    79.67 B | CUDA       | 99 |           pp512 |        365.14 ± 1.51 |
| qwen3next ?B Q2_K - Medium     | 27.31 GiB |    79.67 B | CUDA       | 99 |           tg128 |         37.90 ± 0.25 |

build: ff55414 (1)
```

[-]

Sixbroam@reddit

Here is my bench results with a 780M solely on 64Gb DDR5 5600:

model	size	params	backend	ngl	n_batch	n_ubatch	dev	test	t/s
qwen3next ?B Q4_K - Medium	42.01 GiB	79.67 B	Vulkan	99	768	768	Vulkan0	pp512	80.55 ± 0.41
qwen3next ?B Q4_K - Medium	42.01 GiB	79.67 B	Vulkan	99	768	768	Vulkan0	tg128	13.48 ± 0.05

build: ff55414c4 (7186)

I'm quite surprised to see such "low" numbers, for comparison here is the bench for GLM4.5 Air wich is bigger and has 4x the number of active parameters:

model	size	params	backend	ngl	n_batch	n_ubatch	dev	test	t/s
glm4moe 106B.A12B Q3_K - Small	48.84 GiB	110.47 B	Vulkan	99	768	768	Vulkan0	pp512	62.71 ± 0.41
glm4moe 106B.A12B Q3_K - Small	48.84 GiB	110.47 B	Vulkan	99	768	768	Vulkan0	tg128	10.62 ± 0.08

And a similar test with GPT-OSS 120B:

prompt eval time = 4779.50 ms / 507 tokens ( 9.43 ms per token, 106.08 tokens per second)
eval time = 9206.85 ms / 147 tokens ( 62.63 ms per token, 15.97 tokens per second)

Maybe the Vulkan implementation needs some work too, or the compute needed for tg is higher due to some architecture quirks? Either way, I'm really thankful to Piotr and the llama.cpp team for their outstanding work!

[-]

GlobalLadder9461@reddit

How can you run gpt oss 120b on 64gb ram only?

[-]

Sixbroam@reddit

I offload a few layers on a 8Gb card (that's why I can't use llama-bench for gpt-oss), not ideal and it doesn't speed up the models that fit in my 64Gb but I was curious to test this model :D

[-]

Mangleus@reddit

I am equally curious about this, and related questions also having 8 vram + 64 ram. I use only llama.cpp so far.

[-]

mouthass187@reddit

sorry if this is stupid but, i have an 8gb card and 64 gigs of ram, can i run this model? only tinkered with ollama so far; i dont see how people are offloading to ram - do i use llama.cpp instead? whats the easiest way to do this? (im curious since ram went up in price but have no clue why)

[-]

tmvr@reddit

It's going to be rough with an 8GB GPU only, the model itself would fill the RAM and offloading only 8GB from that is not a lot. A 16GB card would do better, it works fine with my 24GB 4090 and 64GB RAM because there is enough total memory to fit everything in comfortably.

[-]

Sixbroam@reddit

I don't know how you'd go about it with ollama, it seems to me that going the llama.cpp route is the "clean" way, you can look at my other comment regarding tensor splitting using llama.cpp here: https://www.reddit.com/r/LocalLLaMA/comments/1oc9vvl/amd_igpu_dgpu_llamacpp_tensorsplit_not_working/

[-]

Sea-Speaker1700@reddit

MTP is almost certainly not active in the 80B so, just like in vllm, we get a echo of what Next 80B is actually capable of due to serving limitations.

[-]

Finanzamt_Endgegner@reddit

not only that tri and cumsum kernels are still cpu only I think, at least cuda is not yet mergable, though Im sure well get them rather fast (;

[-]

MikeLPU@reddit

The same for glm4.5. They just skip these layers. So sad...

[-]

qcforme@reddit

I did implement it correctly in a branch of vLLM with correct use of the linear attention mechanism interleaved with full attention as an experiment, attempting to integrate prefix caching.

It does work prefix worked really well, saw 50k TPS + pre-fill on cache hits, but decode performance is poor because of CUDA graphs incompatibility with the hybrids. Plus I was working with a 3 but due to VRAM I had at the time, so the model damage was inseparable from kernel mistakes for debugging.

The hybrids will require months of work to get fully right, and need fundamental changes in the core of both inference architectures, llama and vLLM, plus someone with 192gb+ VRAM to properly test it.

More than I was willing to take on at the moment, as I can't serve 16bit 80B.l for verification.

[-]

Sixbroam@reddit

Thank you for the added bit of information regarding MTP! Yes I saw a few comments explaining that the focus wasn't on the performance but I wasn't expecting such a hit on tg, but it's just out of curiosity not complaining :)

[-]

PraxisOG@reddit

Depends pretty heavily on what ram that is. DDR5 5600 in dual channel has a bandwidth of about 90GB/s, divided by 3b active parameters gives about 30 tok/s, though real performance might be like half that

[-]

JawGBoi@reddit

I forgot to mention RAM speed, whoops.

I only have DDR4 3200. I expect that will affect the end speed significantly.

[-]

usernameplshere@reddit

\~51GB/s for your RAM

[-]

InevitableWay6104@reddit

does llama.cpp not support qwen3 next 80b on rocm???

[-]

fallingdowndizzyvr@reddit

It does. But Vulkan is faster.

[-]

InevitableWay6104@reddit

vulkan is not faster on amd.

[-]

fallingdowndizzyvr@reddit

It is.

https://github.com/ggml-org/llama.cpp/pull/16095#issuecomment-3589897501

[-]

T_UMP@reddit

On Strix Halo with Vulkan it loads but then it crashes once it tries to generate, with no errors.

With ROCm works at 114pp and 14tk/s.

CPU works at 7tk/s

UD-Q4_K_XL

[-]

Mean-Sprinkles3157@reddit

I used to use Qwen3-Next with a test version of llama.cpp (it does not support 'next'). is it still true that I have to use a different llama.cpp?

[-]

Finanzamt_Endgegner@reddit

nope this is main branch llama.cpp now

[-]

Mean-Sprinkles3157@reddit

Thanks! I run the model (Q8) on dgx spark. it is 14 tokens per seconds. I think it is OK for using 80GB VRAM), It passed my Latin test, hope it can be used to replace gpt-oss-120b (60GB VRAM).

below is my command line:

./bin/llama-server \

-m ~/models/Qwen3-Next-80B-A3B-Instruct-Q8_0-00001-of-00002.gguf\

--host 0.0.0.0 \

--port 8080 \

-ngl 99 -n 16384\

-c 131072 \

--temp 0.7 --top-p 0.8 --top-k 20 \

--verbose

if any one is an expert on using llama-server, please teach me if I could increase context window size to 262144? I most use the model work with Cline (vs code), I am not sure if "rope-scaling: yarn" could work with cline.

[-]

Finanzamt_Endgegner@reddit

yeah implementation is not yet optimized, but people are working on that (;

[-]

noiserr@reddit

I'm assuming using smaller Qwen3 models as draft models for speculative decoding is not compatible right?

[-]

2legsRises@reddit

awesome! now i only need 56GB+ of vram.

[-]

sammcj@reddit

Nice work, will be interesting to see how the UD_Q_K_3_XL compares to Q4_K_M as that would allow it to fit on 2x 24GB cards.

[-]

DrVonSinistro@reddit

Q8 UD K XL from Unsloth on 2x P40 + 1x RTX A2000 (60GB vram) gives me 11-12 t/s with 17k ctx filled out of 32k.

[-]

rm-rf-rm@reddit

From what I could tell of anecdotal usage and comments on here, it isnt a noticeable improvement over qwen3-coder:a3b, especially for coding.

It wont replace GPT-OSS:120b either. Still will try it out and look to replace qwen3-coder:a3b for agentic coding tasks.

The real win is the future compatibility for qwen 3.5/4 as I understand they will all follow this arch.

[-]

kevin_1994@reddit

my understanding is CUDA isn't quite ready yet?

also does anyone know if these models support FIM. this seems perfect for a coding autocomplete model for me

[-]

Finanzamt_Endgegner@reddit

Yeah we just got the solve_tri kernel merged for cuda, cumsum and tri are still missing as I understand it, but should be here soon(;

[-]

illkeepthatinmind@reddit

With llama.cpp 7180 getting

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3next'

[-]

illkeepthatinmind@reddit

Looks like brew hasnt updated to 7190 yet

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

AleksHop@reddit

3b active, good, how much and what to offlad to GPU? And llama.cpp commands with filters?

[-]

Dreamthemers@reddit

—n-cpu-moe 48 offloads all to CPU (same as —cpu-moe) so lower it until your VRAM is almost full for performance increase.

[-]

Sea-Speaker1700@reddit

42

Your question is wildly lacking any amount of actual context to provide a meaningful answer.

[-]

Icy_Resolution8390@reddit

Wjat is the difference from this version to lefromage version?

[-]

jacobpederson@reddit

does it load in lm studio yet?

[-]

Nieles1337@reddit

No, it needs a runtime update.

[-]

yoracale@reddit

The Thinking ones will be up in like 1-2 hours or so: https://huggingface.co/unsloth/Qwen3-Next-80B-A3B-Thinking-GGUF

[-]

Icy_Resolution8390@reddit

Finally we did it

[-]

WhaleFactory@reddit (OP)

🐐🐐🐐🐐

[-]

Trilogix@reddit

https://github.com/Mainframework/HugstonOne/releases/tag/HugstonOne_Enterprise_Edition_with_memory

Qwen Next 80 supported now.

[-]

munkiemagik@reddit

Slightly adjacent question to this Qwen3-Next post:

With the work that's been done in llama.cpp to be able to finally run hybrid moe and gated deltanet Qwen3-Next, does any of that currently have any negative impact on regular MOE or dense models like GPT-OSS or Seed-OSS run with the same llama.cpp b7186?

[-]

mantafloppy@reddit

GGUF Model/llama.cpp release is broken.

Trying my standard coding prompt :

First try : model get stuck repeating the same CSS at around 3000 token in context. Second try : model get stuck writing an svg forever at around 5000 token in context.

Prompt : Recreate a Pokémon battle UI — make it interactive, nostalgic, and fun. Stick to the spirit of a classic battle, but feel free to get creative if you want. In a single-page self-contained HTML.

I used the recommended setting from : https://docs.unsloth.ai/models/qwen3-next

./build/bin/llama-server \
  -m "/Volumes/SSD2/llm-model/unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF/Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf" \
  --host 127.0.0.1 \
  --port 8033 \
  -c 16000 \
  -ngl -1 \
  --temp 0.7 \
  --top-p 0.80 \
  --top-k 20 \
  --min-p 0.0 \
  --repeat-penalty 1.0 \
  --n-predict 16384 \
  --jinja

As someone who use lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit as its main model, this is sad, guess vibe coding a llama.cpp release dont work.

[-]

mantafloppy@reddit

GGUF Model/llama.cpp release is broken.

Trying my standard coding prompt (simple html page), model get stuck repeating the same CSS at around 3000 token in context.

As someone who use lmstudio-community/Qwen3-Next-80B-A3B-Instruct-MLX-4bit as its main model, this is sad, guess vibe coding a llama.cpp release dont work.

 @keyframes blink {
            0%, 49% { opacity: 1; }
            50%, 100% { opacity: 0; }
        }

        .status-effect::before {
            content: "• ";
        }

        .status-effect::after {
            content: " ";
        }

        .status-effect.poison {
            color: #9c27b0;
        }

        .status-effect.paralysis {
            color: #ffeb3b;
        }

        .status-effect.sleep {
            color: #673ab7;
        }

        .status-effect.frozen {
            color: #03a9f4;
        }

        .status-effect.burn {
            color: #ff9800;
        }

        .status-effect.confused {
            color: #9c27b0;
        }

        .status-effect.affected {
            background-color: rgba(255,255,255,0.1);
            padding: 2px 4px;
            border-radius: 3px;
        }

        .battle-status {
            position: absolute;
            top: 10px;
            left: 10px;
            color: white;
            text-shadow: 1px 1px 2px #000;
            font-size: 14px;
            padding: 5px;
            background-color: rgba(0,0,0,0.7);
            border-radius: 3px;
        }

        .battle-status::before {
            content: ">";
            margin-right: 5px;
        }

        .battle-status::after {
            content: "";
            animation: blink 1s infinite;
        }

        @keyframes blink {
            0%, 49% { opacity: 1; }
            50%, 100% { opacity: 0; }
        }

        .status-effect::before {
            content: "• ";
        }

        .status-effect::after {
            content: " ";
        }

        .status-effect.poison {
            color: #9c27b0;
        }

        .status-effect.paralysis {
            color: #ffeb3b;
        }

        .status-effect.sleep {
            color: #673ab7;
        }

        .status-effect.frozen {
            color: #03a9f4;
        }

        .status-effect.burn {
            color: #ff9800;
        }

        .status-effect.confused {
            color: #9c27b0;
        }

        .status-effect.affected {
            background-color: rgba(255,255,255,0.1);
            padding: 2px 4px;
            border-radius: 3px;
        }

        .battle-status {
            position: absolute;
            top: 10px;
            left: 10px;
            color: white;
            text-shadow: 1px 1px 2px #000;
            font-size: 14px;
            padding: 5px;
            background-color: rgba(0,0,0,0.7);
            border-radius: 3px;
        }

        .battle-status::before {
            content: ">";
            margin-right: 5px;
        }

        .battle-status::after {
            content: "";
            animation: blink 1s infinite;
        }

        @keyframes blink {
            0%, 49% { opacity: 1; }
            50%, 100% { opacity: 0; }
        }

        .status-effect::before {
            content: "• ";
        }

        .status-effect::after {
            content: " ";
        }

        .status-effect.poison {
            color: #9c27b0;
        }

        .status-effect.paralysis {
            color: #ffeb3b;
        }

        .status-effect.sleep {
            color: #6

[-]