Fastest QWEN Coder 80B Next

[-]

cleverusernametry@reddit

"Insanely fast"

Shares no numbers at all

[-]

StacksHosting@reddit (OP)

nathan@llm1:\~$ \~/llama.cpp/build/bin/llama-bench \

-m \~/models/Qwen3-Coder-Next-APEX-I-Quality.gguf \

-ngl 99 -fa 1 \

-p 512 -n 128 \

-r 3

ggml_vulkan: Found 1 Vulkan devices:

| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |

| qwen3next 80B.A3B Q6_K | 50.39 GiB | 79.67 B | Vulkan | 99 | 1 | pp512 | 585.31 ± 3.14 |

| qwen3next 80B.A3B Q6_K | 50.39 GiB | 79.67 B | Vulkan | 99 | 1 | tg128 | 50.35 ± 0.14 |

build: 825eb91a6 (8606)

Prompt processing 585 Tok/s
Output 50 Tok/s

On an 80B parameter model with amd 395_ Max AI 128GB of ram

[-]

Automatic-Arm8153@reddit

And have you compared a regular quant to be making that claim about the speed? Try a unsloth Q4 or Q5 and report the speed otherwise this is a useless post

[-]

StacksHosting@reddit (OP)

So glad you asked this question

Mudler just made a post about that

https://x.com/mudler_it/status/2040834275667235305

Left: Unsloth Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf (48.7 GB, \~32 tok/s) → flat square (?????)

Right: APEX Qwen3.5-35B-A3B-APEX-I-Quality.gguf (22.8 GB, \~53 tok/s) →

Honestly I'm trying to get the AI and PicoClaw configured and working great for my new Cloud Launching

I've been installing drives and setting up servers for the last 2 days getting ready for Datacenter install Wednesday

Creating the cheapest cloud on the planet

[-]

Automatic-Arm8153@reddit

Okay not sure if you and this mud person are the same and trying to promote yourselves or if you’re just new to this and have been fooled.

That test is what we call unfair. Look at the weights, 48gb for unsloth vs 22gb for whatever apex bs.

Tell him to run a unsloth Q4 with similar size eg Q4_K_KL

Or you can do it if you’re so confident about this and your cheapest cloud.

Not hating, just no need to spread bs and misrepresented info.

[-]

StacksHosting@reddit (OP)

Too funny first i'm not showing it's fast

I do that

then it's how does it compare to unsloth

I show you that

then you just go to "Your a Liar!" LOL

Mudler runs https://localai.io/

Don't know him never met him

the reduced size is EXACTLY the point of Apex if you look at Q4 it's still larger and you get 0 Q8 like you do with APEX

don't worry you'll get there

I'm setting up stackshosting.cloud come check me out maybe sign up for the waitlist grab a free container when I launch here in a week or so

[-]

Automatic-Arm8153@reddit

What do you think regular quantisation is this whole time?…

The reason you see all those funny letters after quants eg Q4_K_KL is because some layers are in higher precision like the Q8 you mention..

So what exactly is the point of alpha mud tech?

[-]

Wonderful_Second5322@reddit

https://huggingface.co/mudler/Qwen3-Coder-Next-APEX-GGUF

Oh hehe

[-]

unbannedfornothing@reddit

What's the difference between i and non-i variants?

[-]

pmttyji@reddit

Below repo has all details

https://github.com/mudler/apex-quant

[-]

StacksHosting@reddit (OP)

Great question and to be totally honest I'm still learning myself LOL

A lot of them I think right now are being trained on just wikitext for the openweights being used during the Apex Process, I used Coding specifically on this one

So I took the BF16 file used the coding examples to crete the matrix that's in the repo

that tells it that these coding weights are the most important to optimize for

then I ran it through the APEX process which shrunk it but also emphasized coding

it's built on TurboQuant, that shrinks and optimizes KV cache well now this shrinks and optimizes the model............totally braking my brain but it works

[-]

StacksHosting@reddit (OP)

Oh I didn't even see he did that one also he's been doing it a lot since he created the process

I just ran the complete process myself and posted it

The main difference is he's using a varied dataset for his APEX where mine is SPECIFICALLY focused on Coding

So the APEX version I did should be far better at coding than his diverse dataset

[-]

Tried with GPU (RTX 4090 24G) + CPU (i9 13900KS), no improvement made: prompt 37.94 tokens/s, gen 27.45 t/s remained, same as Qwen3-Coder-Next-UD-Q4_K_XL. Switched to the CPU-only and seen no improvement either.

llama.cpp master, start options:

# CPU + GPU
llama-server -m ./Qwen3-Coder-Next-APEX-I-Quality.gguf \
    -c $((64 * 1024)) \
    -fa on \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --threads 16 \
    --direct-io --no-mmap --mlock \
    --port 9099

# CPU-only
llama-server -m ./Qwen3-Coder-Next-APEX-I-Quality.gguf \
    -c $((64 * 1024)) \
    -fa on \
    -ngl 0 \
    --seed 3407 \
    --temp 1.0 \
    --top-p 0.95 \
    --min-p 0.01 \
    --top-k 40 \
    --threads 16 \
    --direct-io --no-mmap --mlock \
    --port 9099

Am I doing something wrong? It would be great to actually get those 50 t/s for the agentic coding.

[-]

StacksHosting@reddit (OP)

I would try a smaller one that only fits in VRAM GPU Only

Try this and let's see how it does

https://huggingface.co/mudler/Qwen3.5-35B-A3B-Claude-Distilled-APEX-GGUF

Qwen3.5-35B-A3B-Claude-Distilled-APEX-I-Compact.gguf	I-Compact	\~17 GB	Consumer GPUs, best quality/size

[-]

Public-Thanks7567@reddit

Можно так ? :)

[-]

soyalemujica@reddit

Gave this a try, and its quality is comparable to Q6 for what I could test

[-]

soyalemujica@reddit

How does does it compare to Q4 or Q5?

[-]

StacksHosting@reddit (OP)

it's far better near lossless quality while being smaller and faster

[-]

asfbrz96@reddit

How does it compare to q8

[-]

StacksHosting@reddit (OP)

I literally did this yesterday for the first time LOL so still learning but this is what I understand

The overall average is 5.43 bits per weight so it's smaller than Q8

But traditional Quants apply the same quantization across every layer

so if you are Q8 everything is Q8 but do you really care that everything is Q8?

The critical layers — shared experts, attention — get Q8_0 precision

the parts rarely activated are Q4/Q5 but the end result is near Q8 for 2/3 of the size

[-]

isugimpy@reddit

Apologies if I'm just not understanding something that's explained by the repo and the APEX process, but is this meant to be comparable to the q8 of the base model in terms of output quality? It's not obvious what the user should expect in terms of trade-offs.

[-]

StacksHosting@reddit (OP)

it's not Quant4 it's basically full quality ,it's breaking my brain this guy mudler_it on X created it I think

it's not like Quant8 or 6 or 4 it's something completely new

it's taking the BF16 version and then shrinking it down but first I created an importance matrix with 50k code examples from HuggingFace

this is all built upon KV Caching which reduces your context cache and that actually speeds up token input and you can combine the two together

[-]

isugimpy@reddit

I understand that the process is different, that's not really what I'm asking. I'm asking about the resulting output. With traditional quantization, the results tend to degrade as you reach lower values. I'm asking where on the spectrum this compares. Like, bf16 to q8 tends to be relatively close. q8 to q6 usually isn't a noticeable difference. q4 outputs tend to be significantly worse to a point where complex problems can't easily be solved.

Have you benchmarked this in some way to see how your results compare to the base model?

[-]

StacksHosting@reddit (OP)

I haven't run formal benchmarks comparing the APEX quant against the BF16 base model yet, so I can't give you exact numbers.

it's not evenly quantized

Basically the important layers get the best quality and the less critical weights based on my importance matrix are lower precision

so you end up with a better smaller faster model around what you optimize it for

to me this is a complete game changer in how models are quantized I still need to do more testing this is so new everyone is really just testing but so far the results are great from what i've seen with my limited experience

[-]

Wonderful_Second5322@reddit

You replicate it dude?

[-]

StacksHosting@reddit (OP)

I don't know what you mean, I took QWEN Coder 80B Next and ran it through Apex Quantization process

now it's even better at coding, faster, and smaller

[-]

Wonderful_Second5322@reddit

https://huggingface.co/mudler/Qwen3-Coder-Next-APEX-GGUF

Dude?

[-]

StacksHosting@reddit (OP)

Yes for sure I did what he's doing using his scripts!

[-]

Own_Suspect5343@reddit

Can you do it with qwen 3.5 122B?

[-]

StacksHosting@reddit (OP)

Mudler already did it, he started this here is more that he did
https://huggingface.co/collections/mudler/apex-quants-gguf

[-]

Easy_Kitchen7819@reddit

Is it possible make something like q4kxl with using this technique

[-]

StacksHosting@reddit (OP)

This is some interesting stuff MUCH higher quality than Q4 basically losless in quality but it's smaller and faster

[-]

FerradalFCG@reddit

but this is not MLX, is it?

[-]

StacksHosting@reddit (OP)

No, it's GGUF llama.cpp format

Run llama.cpp and check it out

[-]

FerradalFCG@reddit

I'm using omlx all the time now... only mlx models, never used any other format, maybe I'll give a try to this one in omlx to see if it is as fast and as good as mlx version of that model...

[-]

StacksHosting@reddit (OP)

Try it and let me know

the new APEX process is blowing my mind it's built around TurboQuant KV caching but now it's extended to the model