Consider running a bigger quant if possible

Posted by Flashy_Management962@reddit | LocalLLaMA | View on Reddit | 36 comments

Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc. I had a little bit of headroom and decided to give the new unsloth IQ4_NL_XL a try and what should I say. It works MUCH better for agentic coding. If you are like me and start conservative with your model selection based on what completely fits into vram, it might worsen your experience to a very big degree. Always look out for how long the processing of a task really takes and ignore tok/s for quant comparisons. You get stuff faster done if the slower tok/s model (even with offload) takes less time to complete queries correctly(duh)

[-]

synw_@reddit

Sometimes small quants are good: for example I found that Glm Flash q2_k_xl was better than q3_k_m, and faster, very good quant with a great size/power/speed ratio

[-]

FullOf_Bad_Ideas@reddit

Then you need the model to be competitive with models that have more parameters and are more quantized.

[-]

Dry_Cartographer3348@reddit

Can someone please explain all the types of q4 quants? Idk the major differences between these XS, S, L variants

[-]

EggDroppedSoup@reddit

Suffix	Meaning	Trade-off
IQ4_NL	4-bit Importance Matrix, Non-Linear	Best 4-bit quality. Uses dynamic scaling for outliers. Slightly slower to load.
Q4_K_S / M	4-bit K-Quant (Small/Medium)	`S` = smallest/fastest, `M` = balanced (recommended default).
Q6_K	6-bit K-Quant	High fidelity, near-lossless. \~25% larger than 4-bit.
Q8_0	8-bit fixed-point	Near-original quality. Largest size, minimal speed loss.
S / M / L / XL	Size tier within a quant type	`S`=small, `M`=medium, `L`=large. `XL` isn't standard for quants; if you see it, it usually denotes model scale (e.g., 70B vs 8B), not quantization.

[-]

Evening_Ad6637@reddit

Don’t confuse the "I" in IQ4_NL with importance matrix.

Both, I-Quants and K-Quants, can have an importance matrix or not. The I-Quants are just a little bit more optimized for CPU-only use

PS: Regarding Q8_O and speed, this quant always has the prompt-processing-speed to filesize ratio

[-]

SSOMGDSJD@reddit

The XL/L ones will have more weights at q6 , some q5, some q4. M and S will have more at Q4, less at the bigger levels. Kind of like how 3G/4g/5g mobile data speed started out as technical terms that morphed into relative speed terms, quant levels have done the same for model size. Some actually are just q4, others are not, just depends on who did the quantization and how they prioritized the weights

[-]

jikilan_@reddit

Just need to understand the bigger the size, the better the model. Just like our guns.

[-]

Strict_Primary_1664@reddit

What GPU are you running?

[-]

Flashy_Management962@reddit (OP)

2x rtx 3060 with 12gb vram total

[-]

Lost-Health-8675@reddit

Just a few days ago I found that q4_K_XL does surprisingly better job than q5_K_S

[-]

Willing-Toe1942@reddit

I remeber that unsloth first time they did KLD divergance test it showed UD-Q4_K_XL was indeed the closet model to orignial

[-]

Willing-Toe1942@reddit

look here (realtive error for UD-Q4_K_XL is zero and accuracy is insane)

[-]

Lost-Health-8675@reddit

Yeah, found the same graph

[-]

Pleasant-Shallot-707@reddit

Or, a better one

[-]

Gesha24@reddit

General recommendation is to avoid S models altogether. In theory Q5_S is better than Q4_L, but in reality Q4_L may be "smarter".

[-]

shockwaverc13@reddit

so should i choose Q3_K_XL over IQ4_XS?

[-]

DefNattyBoii@reddit

hell nah bro https://old.reddit.com/r/LocalLLaMA/comments/1rv6jyh/qwen3535b_gguf_quants_1622_gib_kld_speed/

[-]

ixdx@reddit

S, M, L in the names of models from different people can be completely different inside.

[-]

Velocita84@reddit

This is absolutely not applicable universal advice, everyone quantizes models with different recipes and labels like Q4_K_S or IQ4_XS are completely arbitrary. For example, Unsloth's Qwen3.6 35B IQ4_XS actually has IQ3_S tensors under the hood (ffn_gate_exps and ffn_up_exps) but because everything else is Q8 or F32 they call it IQ4_XS because it's about as big as a quant with llama.cpp's default IQ4_XS mix would be

[-]

Long_comment_san@reddit

Huh. That's a new one. Maybe it's why some models were "very miss" for me when they shouldn't have been and I couldn't figure out why.

I thought that's just naming flavor

[-]

Gesha24@reddit

A lot depends on tasks too. For example, I tend to get "smarter" responses from Qwen3.6-35B than from Qwen3.5-27B, but once the context window gets well above 120K - it tends to get dumber, confused, may run in circles. Qwen3.5 in these conditions tends to be more reliable.

[-]

ag789@reddit

I used Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
as for memory, have 32 GB dram memory (no gpu !) and runing on a plain old haswell i7 pc, getting about 5 tok / s while initially starting up.
if you have GPU, I'd still suggest get the big model e.g. UD-Q4_K_XL, I run llama.cpp and according to its docs, llama.cpp can 'overflow' part of that model from the GPU into main memory.
it seemed the big models can solve 'difficult prompts' better.

[-]

FullstackSensei@reddit

Now imagine how much better the model would be if you run Q8.

And no, you don't need enough VRAM to run Q8. Just let it spill into system RAM with -fit in llama.cpp. Yes, technically it will be slower, but you'll get things done a lot faster because you won't need to intervene as often.

I have it running since 10 hours in an agentic loop documenting an entire (quite sizeable) project on it's own. It's running at ~12t/s on 100k context (configured 200k), and it's generating markdown files like a champ, fully unattended.

[-]

nikhilprasanth@reddit

How much vram and ram split are you using?

[-]

FullstackSensei@reddit

0 VRAM, 64GB unified memory on a Jetson AGX Xavier

[-]

Xp_12@reddit

If you are using vram, you can go a little bit bigger than the vram before you start to precipitously lose speed. The general rule is .8 of the model in vram before massive drops.

[-]

FullstackSensei@reddit

I don't mind said massive drops, TBH. A slow running model that can one shot a task unattended in one hour is much better than spending said hour constantly prompting the model or fixing screwups only because the model finished in 5 mins.

[-]

Lost-Health-8675@reddit

exactly what I wrote in my post today smart vs fast -in the end fast becomes slower with amount of extra work

[-]

Xp_12@reddit

The comment was for people using GPUs. Doesn't really apply for UMA people.

[-]

FullstackSensei@reddit

I have plenty of GPUs, including two machines with 192GB VRAM each, and my philosophy is the same. I'm now running two instances of Minimax 2.7 Q8_K_XL on one machine, one instance on each three GPUs (Mi50s) and CPU (dual CPU system), even though Q4_K_XL fits entirely in VRAM and is over 3x faster.

For complex tasks, even at 200B, I see a difference between Q4 and Q8.

[-]