Consider running a bigger quant if possible
Posted by Flashy_Management962@reddit | LocalLLaMA | View on Reddit | 36 comments
Just a little reminder that *if* it is possible for you to run bigger quants, do it. I ran Qwen 3.6 IQ4_XS at 128k context was very much disappointed because it would loop, make formatting errors, implement wrong things etc. I had a little bit of headroom and decided to give the new unsloth IQ4_NL_XL a try and what should I say. It works MUCH better for agentic coding. If you are like me and start conservative with your model selection based on what completely fits into vram, it might worsen your experience to a very big degree. Always look out for how long the processing of a task really takes and ignore tok/s for quant comparisons. You get stuff faster done if the slower tok/s model (even with offload) takes less time to complete queries correctly(duh)
synw_@reddit
Sometimes small quants are good: for example I found that Glm Flash q2_k_xl was better than q3_k_m, and faster, very good quant with a great size/power/speed ratio
FullOf_Bad_Ideas@reddit
Then you need the model to be competitive with models that have more parameters and are more quantized.
Dry_Cartographer3348@reddit
Can someone please explain all the types of q4 quants? Idk the major differences between these XS, S, L variants
EggDroppedSoup@reddit
S= smallest/fastest,M= balanced (recommended default).S=small,M=medium,L=large.XLisn't standard for quants; if you see it, it usually denotes model scale (e.g., 70B vs 8B), not quantization.Evening_Ad6637@reddit
Don’t confuse the "I" in IQ4_NL with importance matrix.
Both, I-Quants and K-Quants, can have an importance matrix or not. The I-Quants are just a little bit more optimized for CPU-only use
PS: Regarding Q8_O and speed, this quant always has the prompt-processing-speed to filesize ratio
SSOMGDSJD@reddit
The XL/L ones will have more weights at q6 , some q5, some q4. M and S will have more at Q4, less at the bigger levels. Kind of like how 3G/4g/5g mobile data speed started out as technical terms that morphed into relative speed terms, quant levels have done the same for model size. Some actually are just q4, others are not, just depends on who did the quantization and how they prioritized the weights
jikilan_@reddit
Just need to understand the bigger the size, the better the model. Just like our guns.
Strict_Primary_1664@reddit
What GPU are you running?
Flashy_Management962@reddit (OP)
2x rtx 3060 with 12gb vram total
Lost-Health-8675@reddit
Just a few days ago I found that q4_K_XL does surprisingly better job than q5_K_S
Willing-Toe1942@reddit
I remeber that unsloth first time they did KLD divergance test it showed UD-Q4_K_XL was indeed the closet model to orignial
Willing-Toe1942@reddit
look here (realtive error for UD-Q4_K_XL is zero and accuracy is insane)
Lost-Health-8675@reddit
Yeah, found the same graph
Pleasant-Shallot-707@reddit
Or, a better one
Gesha24@reddit
General recommendation is to avoid S models altogether. In theory Q5_S is better than Q4_L, but in reality Q4_L may be "smarter".
shockwaverc13@reddit
so should i choose Q3_K_XL over IQ4_XS?
DefNattyBoii@reddit
hell nah bro https://old.reddit.com/r/LocalLLaMA/comments/1rv6jyh/qwen3535b_gguf_quants_1622_gib_kld_speed/
ixdx@reddit
S, M, L in the names of models from different people can be completely different inside.
Velocita84@reddit
This is absolutely not applicable universal advice, everyone quantizes models with different recipes and labels like Q4_K_S or IQ4_XS are completely arbitrary. For example, Unsloth's Qwen3.6 35B IQ4_XS actually has IQ3_S tensors under the hood (ffn_gate_exps and ffn_up_exps) but because everything else is Q8 or F32 they call it IQ4_XS because it's about as big as a quant with llama.cpp's default IQ4_XS mix would be
Long_comment_san@reddit
Huh. That's a new one. Maybe it's why some models were "very miss" for me when they shouldn't have been and I couldn't figure out why.
I thought that's just naming flavor
Gesha24@reddit
A lot depends on tasks too. For example, I tend to get "smarter" responses from Qwen3.6-35B than from Qwen3.5-27B, but once the context window gets well above 120K - it tends to get dumber, confused, may run in circles. Qwen3.5 in these conditions tends to be more reliable.
ag789@reddit
I used Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
as for memory, have 32 GB dram memory (no gpu !) and runing on a plain old haswell i7 pc, getting about 5 tok / s while initially starting up.
if you have GPU, I'd still suggest get the big model e.g. UD-Q4_K_XL, I run llama.cpp and according to its docs, llama.cpp can 'overflow' part of that model from the GPU into main memory.
it seemed the big models can solve 'difficult prompts' better.
FullstackSensei@reddit
Now imagine how much better the model would be if you run Q8.
And no, you don't need enough VRAM to run Q8. Just let it spill into system RAM with -fit in llama.cpp. Yes, technically it will be slower, but you'll get things done a lot faster because you won't need to intervene as often.
I have it running since 10 hours in an agentic loop documenting an entire (quite sizeable) project on it's own. It's running at ~12t/s on 100k context (configured 200k), and it's generating markdown files like a champ, fully unattended.
nikhilprasanth@reddit
How much vram and ram split are you using?
FullstackSensei@reddit
0 VRAM, 64GB unified memory on a Jetson AGX Xavier
Xp_12@reddit
If you are using vram, you can go a little bit bigger than the vram before you start to precipitously lose speed. The general rule is .8 of the model in vram before massive drops.
FullstackSensei@reddit
I don't mind said massive drops, TBH. A slow running model that can one shot a task unattended in one hour is much better than spending said hour constantly prompting the model or fixing screwups only because the model finished in 5 mins.
Lost-Health-8675@reddit
exactly what I wrote in my post today smart vs fast -in the end fast becomes slower with amount of extra work
Xp_12@reddit
The comment was for people using GPUs. Doesn't really apply for UMA people.
FullstackSensei@reddit
I have plenty of GPUs, including two machines with 192GB VRAM each, and my philosophy is the same. I'm now running two instances of Minimax 2.7 Q8_K_XL on one machine, one instance on each three GPUs (Mi50s) and CPU (dual CPU system), even though Q4_K_XL fits entirely in VRAM and is over 3x faster.
For complex tasks, even at 200B, I see a difference between Q4 and Q8.
DependentBat5432@reddit
the tok/s trap is real. A model thinks slower but gets it right in one shot saves way more time than a fast model that needs three retries
CharacterAnimator490@reddit
I ran some test with the Qwen 3.5 122B A10B. And for me the UD-IQ4_XS was a little bit better in every run than the UD-IQ4_NL. Wich i find weird, but seems like the bigger is not always better.
ParaboloidalCrest@reddit
AFAIR those two quants are pretty much the same size for qwen3.5-122b.
tacticaltweaker@reddit
I recently upgraded my GPU and switched from Unsloth IQ3_XSS to Bartowski's Q6_K_L for Qwen3.6. I was surprised at how much of a difference it made, which I guess I shouldn't be.
Billysm23@reddit
Of course that's huge, if you use q5_k before, you won't notice the difference
Strict_Primary_1664@reddit
I wish i could figure out what the best model / quant i can run is, but every time I try a model everything breaks. I get 1 day of use for every 1 day of fixing everything LOL