Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

Posted by _cpatonn@reddit | LocalLLaMA | View on Reddit | 11 comments

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective.

We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses.

Result: cyankiwi posts the lowest KLD on all three base models. Lower is better.

Llama-3.2-3B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.00510
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.00785
unsloth/Llama-3.2-3B-Instruct-bnb-4bit BNB NF4 0.00896
nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4 AWQ INT4 0.01494
casperhansen/llama-3.2-3b-instruct-awq AWQ INT4 0.02437

Llama-3.1-8B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.00478
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16 GPTQ INT4 0.00729
unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.00769
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit BNB NF4 0.00835
RedHatAI/Llama-3.1-8B-Instruct-NVFP4 SmoothQuant NVFP4 0.01059
nvidia/Llama-3.1-8B-Instruct-NVFP4 NVFP4 0.01190

Llama-3.3-70B-Instruct

Quantized Model Method KLD
cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4 cyankiwi AWQ INT4 0.02826
unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit unsloth BNB NF4 0.04444
casperhansen/llama-3.3-70b-instruct-awq AWQ INT4 0.04859
unsloth/Llama-3.3-70B-Instruct-bnb-4bit BNB NF4 0.06879
nvidia/Llama-3.3-70B-Instruct-NVFP4 NVFP4 0.08307
RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16 GPTQ INT4 0.09272