Qwen 3.6 q8 at 50t/s or q4 at 112 t/s?
Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 24 comments
What are some ways that you would go about thinking about choosing between the two for use in a harness like pi?
Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 compactings on a clearly defined task without messing the whole thing up. Very excited about this recent step forward.
I'm going to start working with the q8 some today but I was interested in what your impressions of the types of differences I might expect between the two.
denoflore_ai_guy@reddit
With the right system prompt and tweaking your top and min and temp values I’ve been able to get really really good quality out of bartowski’s iQ4_nl quant - 200tok/s or about 56-80tok/s doing 8-12 parallel batch tasks.
booey@reddit
Have you compared Qwen3.6-35B-A3B-UD-IQ4_NL_XL.gguf From unsloth to Qwen_Qwen3.6-35B-A3B-IQ4_NL.gguf from bartowski?
denoflore_ai_guy@reddit
In using bartowski - unsloth god bless em just never works right for me.
cviperr33@reddit
i think q8 is waste , like the differences are so small that ur wasting valuable contex space and speed
sn2006gy@reddit
I like the reliability of q8 - as others have said anything below Q8 and i start seeing more things that actually break my flow. MXFP4 has worked well, but i found you have to rely a lot more on the harness surviving these models than anything else. pi may be very light and less opinionated, but running Claude Code on a INT4 makes anyone want to blow their brains out with how dumb the model is without a really strong harness on the API side.
cviperr33@reddit
If you have a Blackwell GPU you dont have to worry about such things , you just use the NV4 or whatever it was called format , which basically is the size of Q4 and same speed , but its precision is 98% identical to full BF16 lol , imagine.
sn2006gy@reddit
yeah, but you still need 600+ GB of VRAM before your first user is on.
cviperr33@reddit
Wait if you have hardware that supports NVFP4 , why isnt it your primary and only choice ? on paper it sounds perfect , i cant try it out so i have no idea how is it actually in practice , is it easy to find these quants ? and what is the real speed/precision benefit you get , and whats this MXFP4 i never heard of it.
grumd@reddit
If NVFP4/MXFP4 is so good and basically identical to BF16, then why are MXFP4 quants of Qwen models even worse than Q4_K quants in terms of KLD benchmarks?
R_Duncan@reddit
Nvfp4 is not mature, not actually as fast as mxfp4. At least not on rtx 6000 Blackwell+cuda 13.0
sn2006gy@reddit
MXFP4 is the open non Nvidia version of the same quant. I'm not sure I want to pay an Nvidia tax for knowledge work in the perpetual future. GPT OSS 120b is MXFP4 native - i hope we see more. GPUs can run it, but it will be a nice day when AMD/Intel et all have native acceleration for it.
MXFP4 as a conversion is fine.. but it isn't advantageous like training from MXFP4 or converting to MXFP4 from FP16 where MXFP4 really shines.
R_Duncan@reddit
Sadly no gguf of nvfp4 exists, and vllm is way slower than llama.cpp even with nvfp4, at least on the rtx 6000 Blackwell I'm using.
cviperr33@reddit
Damn that sucks , i thought that you guys with blackwell 30gb+ vram are having a party right now with 300-400 tk/s 😂
I have vllm installed but i never bothered with trying it , in your experience how does it compare in terms of speed for MoE models , gemma 4 or qwen 3.5/6 , Is inference speed on single connection same or much slower for vllm. How much more VRAM do u need for vLLM quants ?
With llama.ccp everything is just so easy , you just look at different quant sizes , pick IQ4nl and download the .gguf file thats it. You know exactly how much vram its gonna take , and i can fill it in 24gb at 21gb used for 200k contex.
If i were to download vllm quant , how much vram would it use ? what kind of contex i would get and speed
R_Duncan@reddit
Nvfp4 not present in llama.cpp, and vllm is much slower here.
takuonline@reddit
Try: https://www.reddit.com/r/unsloth/comments/1so4uoh/qwen3635ba3b_gguf_performance_benchmarks/
ixdx@reddit
Q5_K or Q6_K at \~100t/s
Adventurous-Paper566@reddit
Q6_K_L is the sweet spot \^\^
ixdx@reddit
That's right, Q6_K_L fits into 32GB of VRAM without mmproj, and Q5_K_L with mmproj.
Adventurous-Paper566@reddit
Q6
AndreVallestero@reddit
Check the perplexity graphs for the exact quants you're using. It'll help you figure out where losses begin. If your like everyone else and using unsloth quants, q5 seems to be the sweet spot.
Hot_Turnip_3309@reddit
if I run anything under q8, it gets stuck in loops around 60-70k ctx. And I get 40tk/sec with q8.
R_Duncan@reddit
Ud:Q5K_XL and Ud:MXFP4-MOE do not get stuck in loops
tecneeq@reddit
I think UD-Q6_K_XL is where it's at. I get 50 t/s on a Strix Halo board. Very happy.
asfbrz96@reddit
Output quality on Q8 is on pair with f16