Fine-tuning LLMs to 1.58bit: extreme quantization experiment

Posted by shing3232@reddit | LocalLLaMA | View on Reddit | 14 comments

[https://github.com/huggingface/blog/blob/main/1\_58\_llm\_extreme\_quantization.md](https://github.com/huggingface/blog/blob/main/1_58_llm_extreme_quantization.md) [https://huggingface.co/blog/1\_58\_llm\_extreme\_quantization](https://huggingface.co/blog/1_58_llm_extreme_quantization)

Reply to Post

14 Comments

[-]

showmeufos@reddit

I know proper implementation of BitNet requires implementing it at the training stage but given the memory/compute savings why isn’t every major AI lab using BitNet? Is something lost by training using BitNet? Do the models perform worse? One would assume if you could achieve the same results using 10x fewer GPUs…. Everyone would do it?

[-]

az226@reddit

Turns out that the more tokens you train on, the gap between ternary and 4bit widens. If you only look at pre training costs, you should follow Chinchilla scaling laws. But, that’s not how it works in practice. In practice inference costs matter a lot too. That’s why we’ve seen the surge in large teacher models and smaller student models. So it makes sense to train models past Chinchilla optimal settings. When you train that far, the gap is even wider. So until we figure out how to close that gap, ternary models will remain in the smaller sizes and underperform.

[-]

Calcidiol@reddit

> Turns out that the more tokens you train on, the gap between ternary and 4bit widens. What is the gap? I infer from your comment that you're suggesting that there are empirical (or theoretical?) advantages to 4-bit coefficients vs. ternary ones. I assume maybe you mean complexity / time / throughput / performance / size is somehow in favor of 4-bit though it's not clear why and which. In some cases one would expect ternary to be (at the micro level) universally simpler, more efficient, faster, smaller, et. al. than 4-bit (e.g. add, multiply, ...) so if 4-bit has empirical advantages then I assume it'd have to be some much higher level problem or maybe just architectural / optimization failing of the training host (GPU, TPU, whatever) not having as efficient as possible architectural support for ternary vector / matrix operations?

[-]

Thick-Protection-458@reddit

AFAIK gap is both empirical and theoretical. Theoretical part is that model with total size of N bits can only store N bits of information (in information theory sense). So while fp16 model is undertrained severe - bitnet might represent the (almost) same math. But more training (and so more information) goes in - the bigger model you need to have a chance to represent it. So after certain undertraining threshold low-bit models of the same artchitecture will be unable to improve further.

[-]

Calcidiol@reddit

That makes sense. Sure, if you start out with a plan to make a 32B (or whatever) BF16 model then you can severely undertrain it and get a functional but not very great model, or train it at some medium setting and still have ability to quantize well and have a decent model, or train it a lot more and still not 'overflow' most of the model weights / information density though it'll hurt (further) quantization more. But bitnet, yeah, you'd have to grow the model layers or whatever sooner if you have a very well trained model you want to continue training on to make it significantly more knowledgeable for a few more versions based off the same model since it'd reach 'full' more abruptly after reaching 'medium' training level for a given fixed model size.

[-]

No_Afternoon_4260@reddit

That and probably also the fact that current hardware has no optimization for ternary, nvidia just released fp4 cards, may be next gen 🤷

[-]

kif88@reddit

I'm trying to get my head around it. So it's a matter of "I have 5gb of model and that's better than 2gb of model. No matter how you arrange those 2gb"?

[-]

Master-Meal-77@reddit

Ternary computing hasn't taken off yet, so we can't get the full advantage of ternary quantization. As it stands, running a real bitnet model (which is different from a BF16 model that has been ternarized post-training) still takes a lot of memory and compute power since GPUs were designed to work with F32, F16, BF16, FP8, etc. (this is my understanding)

[-]

shing3232@reddit (OP)

That's why packing weight exist

[-]

rog-uk@reddit

Might I please DM you with a couple of questions directly related to this specific narrow topic? No worries if not.

[-]

Master-Meal-77@reddit

Sure, not a problem

[-]

rog-uk@reddit

I am poking at that exact problem. Not there yet though.

[-]

Calcidiol@reddit

IDK but I assume machine ALU architecture efficiency is one possible problem. If you have a ALU / TPU that handles i4, i8, i16, i32, i64, f4, f8, f16, f32, f64, and can do vectors / tensors of N of those packed efficiently, but it doesn't at the ISA level deal with packed trits then there's going to be some inefficiency of converting trinary to something the ALU does deal with like I4 and back again, packing, unpacking, etc. It may be a minor almost irrelevant thing, but it could be N% "overhead" that isn't the fault of ternary but is the fault of the GPU/CPU ISA lacking handling of it.

[-]

pas_possible@reddit

Nice