Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update
Posted by _cpatonn@reddit | LocalLLaMA | View on Reddit | 11 comments
In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective.
We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses.
Result: cyankiwi posts the lowest KLD on all three base models. Lower is better.
Llama-3.2-3B-Instruct
| Quantized Model | Method | KLD |
|---|---|---|
| cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4 | cyankiwi AWQ INT4 | 0.00510 |
| unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit | unsloth BNB NF4 | 0.00785 |
| unsloth/Llama-3.2-3B-Instruct-bnb-4bit | BNB NF4 | 0.00896 |
| nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4 | AWQ INT4 | 0.01494 |
| casperhansen/llama-3.2-3b-instruct-awq | AWQ INT4 | 0.02437 |
Llama-3.1-8B-Instruct
| Quantized Model | Method | KLD |
|---|---|---|
| cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4 | cyankiwi AWQ INT4 | 0.00478 |
| RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 | GPTQ INT4 | 0.00729 |
| unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit | unsloth BNB NF4 | 0.00769 |
| unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit | BNB NF4 | 0.00835 |
| RedHatAI/Llama-3.1-8B-Instruct-NVFP4 | SmoothQuant NVFP4 | 0.01059 |
| nvidia/Llama-3.1-8B-Instruct-NVFP4 | NVFP4 | 0.01190 |
Llama-3.3-70B-Instruct
| Quantized Model | Method | KLD |
|---|---|---|
| cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4 | cyankiwi AWQ INT4 | 0.02826 |
| unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit | unsloth BNB NF4 | 0.04444 |
| casperhansen/llama-3.3-70b-instruct-awq | AWQ INT4 | 0.04859 |
| unsloth/Llama-3.3-70B-Instruct-bnb-4bit | BNB NF4 | 0.06879 |
| nvidia/Llama-3.3-70B-Instruct-NVFP4 | NVFP4 | 0.08307 |
| RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 | GPTQ INT4 | 0.09272 |

demidev@reddit
Any chance of getting this update for the minimax m2.7 quant?
digitalfreshair@reddit
How are you testing KLD? I believe vLLM does not have a native benchmark
a_slay_nub@reddit
How does this compare to your old AWQ quants. Or are those the same as casperhansen?
Also, what is your timeline for updating the models(particlarly gemma)?
_cpatonn@reddit (OP)
My initial quant models, i.e., around Fall and Winter 2025, would be similar to casperhansen, and only start to differ in Spring 2026. Compared to the most significant update, 26.05, which built on the AWQ research lineage, it would be much better.
It is fully updated for Gemma 31B and half-updated for Gemma 26B, as vllm currently does not support asymmetric-quantized Gemma 26B.
MoodyPurples@reddit
This is really cool! Just curious, have you considered comparing against the Lorbus quants? It seems like your quants and those (mainly from the club-3090 repo) are the main recommendations for 3090 users currently.
_cpatonn@reddit (OP)
Thank you for sharing with me. I will include in them in my next Qwen 3.6 benchmarks.
Embarrassed_Soup_279@reddit
have you looked into ParoQuant?
_cpatonn@reddit (OP)
Yes, I intended to make ParoQuant, but it seems that vllm does not have support to ParoQuant at the moment.
cpuQ@reddit
have you looked into ParoQuant?
dinerburgeryum@reddit
Wow, these are killer numbers, great work!
Icy-Roll-4044@reddit
Nice info