Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

Posted by _cpatonn@reddit | LocalLLaMA | View on Reddit | 11 comments

In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective.

We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses.

Result: cyankiwi posts the lowest KLD on all three base models. Lower is better.

Llama-3.2-3B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.2-3B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.00510
unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.00785
unsloth/Llama-3.2-3B-Instruct-bnb-4bit	BNB NF4	0.00896
nvidia/Meta-Llama-3.2-3B-Instruct-ONNX-INT4	AWQ INT4	0.01494
casperhansen/llama-3.2-3b-instruct-awq	AWQ INT4	0.02437

Llama-3.1-8B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.1-8B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.00478
RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w4a16	GPTQ INT4	0.00729
unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.00769
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit	BNB NF4	0.00835
RedHatAI/Llama-3.1-8B-Instruct-NVFP4	SmoothQuant NVFP4	0.01059
nvidia/Llama-3.1-8B-Instruct-NVFP4	NVFP4	0.01190

Llama-3.3-70B-Instruct

Quantized Model	Method	KLD
cyankiwi/Llama-3.3-70B-Instruct-AWQ-INT4	cyankiwi AWQ INT4	0.02826
unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit	unsloth BNB NF4	0.04444
casperhansen/llama-3.3-70b-instruct-awq	AWQ INT4	0.04859
unsloth/Llama-3.3-70B-Instruct-bnb-4bit	BNB NF4	0.06879
nvidia/Llama-3.3-70B-Instruct-NVFP4	NVFP4	0.08307
RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16	GPTQ INT4	0.09272

[-]

demidev@reddit

Any chance of getting this update for the minimax m2.7 quant?

digitalfreshair@reddit

How are you testing KLD? I believe vLLM does not have a native benchmark

a_slay_nub@reddit

How does this compare to your old AWQ quants. Or are those the same as casperhansen?

Also, what is your timeline for updating the models(particlarly gemma)?

_cpatonn@reddit (OP)

My initial quant models, i.e., around Fall and Winter 2025, would be similar to casperhansen, and only start to differ in Spring 2026. Compared to the most significant update, 26.05, which built on the AWQ research lineage, it would be much better.

It is fully updated for Gemma 31B and half-updated for Gemma 26B, as vllm currently does not support asymmetric-quantized Gemma 26B.

Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update

Llama-3.2-3B-Instruct

Llama-3.1-8B-Instruct

Llama-3.3-70B-Instruct

demidev@reddit

digitalfreshair@reddit

a_slay_nub@reddit

_cpatonn@reddit (OP)

MoodyPurples@reddit

_cpatonn@reddit (OP)

Embarrassed_Soup_279@reddit

_cpatonn@reddit (OP)

cpuQ@reddit

dinerburgeryum@reddit

Icy-Roll-4044@reddit