Qwen 3.6 q8 at 50t/s or q4 at 112 t/s?

Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 24 comments

What are some ways that you would go about thinking about choosing between the two for use in a harness like pi?

Did a good bit with q4 yesterday and it was so consistent and reliable I had it set to 131k context and it worked through 2 compactings on a clearly defined task without messing the whole thing up. Very excited about this recent step forward.

I'm going to start working with the q8 some today but I was interested in what your impressions of the types of differences I might expect between the two.

[-]

denoflore_ai_guy@reddit

With the right system prompt and tweaking your top and min and temp values I’ve been able to get really really good quality out of bartowski’s iQ4_nl quant - 200tok/s or about 56-80tok/s doing 8-12 parallel batch tasks.

[-]

booey@reddit

Have you compared Qwen3.6-35B-A3B-UD-IQ4_NL_XL.gguf From unsloth to Qwen_Qwen3.6-35B-A3B-IQ4_NL.gguf from bartowski?

[-]

denoflore_ai_guy@reddit

In using bartowski - unsloth god bless em just never works right for me.

[-]

cviperr33@reddit

i think q8 is waste , like the differences are so small that ur wasting valuable contex space and speed

[-]

sn2006gy@reddit

I like the reliability of q8 - as others have said anything below Q8 and i start seeing more things that actually break my flow. MXFP4 has worked well, but i found you have to rely a lot more on the harness surviving these models than anything else. pi may be very light and less opinionated, but running Claude Code on a INT4 makes anyone want to blow their brains out with how dumb the model is without a really strong harness on the API side.

[-]

cviperr33@reddit

If you have a Blackwell GPU you dont have to worry about such things , you just use the NV4 or whatever it was called format , which basically is the size of Q4 and same speed , but its precision is 98% identical to full BF16 lol , imagine.

[-]

sn2006gy@reddit

yeah, but you still need 600+ GB of VRAM before your first user is on.

[-]

cviperr33@reddit

Wait if you have hardware that supports NVFP4 , why isnt it your primary and only choice ? on paper it sounds perfect , i cant try it out so i have no idea how is it actually in practice , is it easy to find these quants ? and what is the real speed/precision benefit you get , and whats this MXFP4 i never heard of it.

[-]

grumd@reddit

If NVFP4/MXFP4 is so good and basically identical to BF16, then why are MXFP4 quants of Qwen models even worse than Q4_K quants in terms of KLD benchmarks?

[-]

R_Duncan@reddit

Nvfp4 is not mature, not actually as fast as mxfp4. At least not on rtx 6000 Blackwell+cuda 13.0

[-]

sn2006gy@reddit

MXFP4 is the open non Nvidia version of the same quant. I'm not sure I want to pay an Nvidia tax for knowledge work in the perpetual future. GPT OSS 120b is MXFP4 native - i hope we see more. GPUs can run it, but it will be a nice day when AMD/Intel et all have native acceleration for it.

MXFP4 as a conversion is fine.. but it isn't advantageous like training from MXFP4 or converting to MXFP4 from FP16 where MXFP4 really shines.

[-]

R_Duncan@reddit

Sadly no gguf of nvfp4 exists, and vllm is way slower than llama.cpp even with nvfp4, at least on the rtx 6000 Blackwell I'm using.

[-]

cviperr33@reddit

Damn that sucks , i thought that you guys with blackwell 30gb+ vram are having a party right now with 300-400 tk/s 😂

I have vllm installed but i never bothered with trying it , in your experience how does it compare in terms of speed for MoE models , gemma 4 or qwen 3.5/6 , Is inference speed on single connection same or much slower for vllm. How much more VRAM do u need for vLLM quants ?

With llama.ccp everything is just so easy , you just look at different quant sizes , pick IQ4nl and download the .gguf file thats it. You know exactly how much vram its gonna take , and i can fill it in 24gb at 21gb used for 200k contex.

If i were to download vllm quant , how much vram would it use ? what kind of contex i would get and speed

[-]

ExecStart=/root/llama.cpp/build-rocm/bin/llama-server \
 --hf-repo unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL \
 --no-mmap \
 --host 0.0.0.0 --port 11337 \
 --gpu-layers 99 --fit on \
 --flash-attn on --cache-type-k f16 --cache-type-v f16 \
 --device Vulkan1 \
 --presence-penalty 0.0 --repeat-penalty 1.0 --temperature 0.6 --top-k 20 --top-p 0.95 \
 --n-predict 32768 --ctx-size 524288 --parallel 2

I think UD-Q6_K_XL is where it's at. I get 50 t/s on a Strix Halo board. Very happy.

[-]

asfbrz96@reddit

Output quality on Q8 is on pair with f16