What happened to 1.58bit LLMs?

Posted by Sloppyjoeman@reddit | LocalLLaMA | View on Reddit | 61 comments

Last year I remember them being super hyped and largely theoretical. Since then, I understand there’s a growing body of evidence that larger sparse models outperform smaller denser models, which 1.58bit quantisation seems poised to drastically improve I haven’t seen people going “oh, the 1.58bit quantisation was overhyped” - did I just miss it?

Reply to Post

Reply

61 Comments

[-]

teachersecret@reddit

I played with it a bit. I actually got Microsoft’s 2b bitnet 1.58b model running at something silly like 11k tokens/second without cuda through some creative use of silicon. I think there’s insane potential in 1.58b models but nobody made any larger ones and it’s a pain in the ass to turn a big existing model ternary (Microsoft trained directly ternary with 4 trillion tokens which mitigated a bit). Microsoft did say that their process scales to bigger sizes. I’d love to go further but until someone puts out a larger model or I get a wild hair and train or convert one, it’s gonna stay an experiment.

Reply

[-]

Reddactor@reddit

Do you have a writeup on that? Sounds super interesting!

Reply

[-]

TomLucidor@reddit

Hopefully, and Tequila 1.58b quantization is useful as well to convert your favorite model to something that run fast in magical ways.

Reply

[-]

teachersecret@reddit

Never heard of it. I did share my nano gpt-2 experiments but I haven't checked out that Tequila paper. I'll read it over today.

Reply

[-]

Educational_Win_2982@reddit

There's also the released "HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs" paper which is also low-bit quantization, which seems to be SOTA right now.

Reply

[-]

teachersecret@reddit

Appreciate it. I might take a peek.

Reply

[-]

Educational_Win_2982@reddit

Hope you enjoy! Its a bit hidden but they have a github repo aswell [https://github.com/hestia2026/Hestia](https://github.com/hestia2026/Hestia)

Reply

[-]

TomLucidor@reddit

Compare this with ternary computation then, would it be faster?

Reply

[-]

Educational_Win_2982@reddit

It's still ternary computation, it's just a quantization method to turn existing fp16 models into ternary models, allowing you the benefit of bitnet.cpp while keeping 95% of the output quality.

Reply

[-]

teachersecret@reddit

I wrote up my nanogpt speedrun efforts here: [https://github.com/Deveraux-Parker/nanoGPT\_1GPU\_SPEEDRUN](https://github.com/Deveraux-Parker/nanoGPT_1GPU_SPEEDRUN) That's not ternary 1.58b, but my 1.58b efforts used a similar training stack just geared for 1.58b training directly using info from the bitnet paper from Microsoft, so if you want an idea of how I am experimenting in the training space that's a good place to start.

Reply

[-]

TomLucidor@reddit

\*hums Tequila\*

Reply

[-]

teachersecret@reddit

Have ya used it yet? Got a bigger model distilled down or some code to do so?

Reply

[-]

TomLucidor@reddit

$ is the bottleneck, someone else would have to quant it for everyone to jump in to use. GPT-OSS or GLM-4.5-Air is a good testbed for these things.

Reply

[-]

GeoMorax@reddit

A recent research paper, "Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification"( https://arxiv.org/pdf/2601.07892) reduces ternary quantization into 1.25 bits (with 3:4 sparsity), which is more regular and does not have bit waste. This seems impressive.

Reply

[-]

Hot-Employ-3399@reddit

Joined retnet in blasting off current models.

Reply

[-]

SlowFail2433@reddit

You are in luck because there was a big breakthrough recently https://arxiv.org/abs/2511.21910

Reply

[-]

TomLucidor@reddit

Is it software or hardware advancements? Please make Pentium and Duo CPUs useful for edge computing again lol

Reply

[-]

SlowFail2433@reddit

Hardware sorry

Reply

[-]

TomLucidor@reddit

Can we just hack x64 and ARM + GPU to play nice already? Can't just wait for AirLLM and Tequila to be forgotten

Reply

[-]

SlowFail2433@reddit

There is a type of chip that can be reconfigured in software called an FPGA

Reply

[-]

TomLucidor@reddit

Consumer hardware hacking would be better than just asking for FPGA. Legacy hardware revitalization is an ideal for reducing e-waste yadayada

Reply

[-]

SlowFail2433@reddit

Well there are old, legacy FPGAs. It is not possible to change the number format of the native matmul of a CPU or a GPU. Only FPGAs or certain ASICs can do that.

Reply

[-]

TomLucidor@reddit

Can't we rely on bit manipulation hacks to get some of the work done for GPUs?

Reply

[-]

SlowFail2433@reddit

No it is 100% completely impossible to change native matmul number format and it will never be possible. It would require moving actual SRAM locations. What can be done is FPGAs which can be re-routed after manufacturing using software, or certain other ASICS that have that ability. It will never be possible on existing CPUs or GPUs.

Reply

[-]

TomLucidor@reddit

We don't need 100% parity to FPGA/ASIC, even some of that acceleration for GPUs would be good enough.

Reply

[-]

SlowFail2433@reddit

Issue is that energy is the main cost of AI, not hardware, both in dollar terms and to the environment. In their paper the ASIC was 2,000%+ more energy efficient than CPU.

Reply

[-]

TomLucidor@reddit

So "train on ASIC, run on GPU" would make sense, you still need to be consumer-friendly.

Reply

[-]

SlowFail2433@reddit

Hmm so I did some research and for low batch size, heavily memory bandwidth constrained, older or slower hardware, bitnet could be good yeah.

Reply

[-]

MitsotakiShogun@reddit

The biggest innovation of that line of research was also it's downfall: hardware. I remember in one of the papers I read, the authors actually implemented their idea and build a PoC circuit or something to validate their idea, and proved the benefits (convincingly enough for me anyway). But, simply put, Nvidia / AMD / Intel / Apple and their Chinese counterparts, aren't going to implement that hardware before it becomes really prevalent... which is not going to happen without hardware first.

Reply

[-]

DHasselhoff77@reddit

Also it might be patented by Microsoft.

Reply

[-]

TomLucidor@reddit

FOSS circumvention + GPL-locking. They are probably not fast enough to resist.

Reply

[-]

DHasselhoff77@reddit

I don't understand how that would help. It's standard practice to patent everything coming out of industrial labs _before_ a preprint hits arXiv. Any company designing accelerator hardware must know this. Any liberally licensed open sourced code would be irrelevant as it wouldn't count as prior work.

Reply

[-]

kidflashonnikes@reddit

This is absolutely false. It has nothing to do with hardware at all. I work for one of the largest private funded ai labs on the planet. Quantization reduces accuracy by shrinking down the range of precision. Going down to 1 bit - you’re left with someone like this guy - an IQ of 10. Anything less than 4bit is just not there yet - you lose too much intelligence. For something like whisper - it’s okay (voice to text vice versa). OpenAI is almost done wrapping up Garlic (5.3). My friends who work there are focusing on voice models for the company. A lot is going on

Reply

[-]

MitsotakiShogun@reddit

Saying I have an IQ of 10, and then likening bitnet to quantization... You should give your employer a refund.

Reply

[-]

kidflashonnikes@reddit

looks like the news came out about my work. Call altman than - MergeLabs is now public so I can talk about it - since he is my boss. Low IQ classic speech.

Reply

[-]

MitsotakiShogun@reddit

Now that it's public and I can see the size of your lab... Your entire funding is less than our yearly budget dedicated to AI. Congrats.

Reply

[-]

PastPalpitationCry@reddit

Except bitnet isn't standard Post training quantization. It requires more tuning on the effected layers.

Reply

[-]

DanielKramer_@reddit

as the cvo of one of the largest small ai labs (kramer intelligence) i can assure you this is not the reason bitnet flopped

Reply

[-]

Confusion_Senior@reddit

china may build it to bypass current architectures

Reply

[-]

gnaarw@reddit

Or they already have it and are this denying import of h200s 🤔

Reply

[-]

phhusson@reddit

I'm not sure Nvidia won't implement it. Remember how they went from fp32 flops to fp16 to fp8 to fp4?

Reply

[-]

SlowFail2433@reddit

The idea itself has been proven to work yeah but what isn’t proven is all the different types of scaling. FP4 might be “low enough.”

Reply

[-]

Slow-Gur6419@reddit

BitNet was definitely overhyped but the research is still ongoing - the main issue is that most hardware doesn't really benefit from 1.58bit weights since you still need proper GPU support for the weird quantization schemes

Reply

[-]

Sloppyjoeman@reddit (OP)

Okaaay, this makes a lot of sense, thanks. So at the moment we’re able to prove lack of loss of ability, but not so much the performance improvements leaving 4 bit quantisation the current king?

Reply

[-]

az226@reddit

Bitnet diverged in capability to further you went past Chinchilla. Plus Nvidia made NVFP4 so you get essentially half precision performance at 4x speed up and memory compression. So it’s possible that with bitnet bespoke hardware there is a new Pareto optimal constellation but for now they are mostly academic.

Reply

[-]

TomLucidor@reddit

Wait, what about ternary quantization? Could they yield something more functional?

Reply

[-]

az226@reddit

Even Unsloth will go down to like 1.9 bpw but do it dynamically. So it’s not purely ternary. So bespoke hardware couldn’t process it. I’m sure you could, but quality suffers a lot.

Reply

[-]

TomLucidor@reddit

"BitNet" is a good brand name for ternary, pure ternary is probably there to speed up compute, and Tequila vibes promising... Maybe if we go 2-3x the size to offset the 4-5x speed boost with matrix operations.

Reply

[-]

az226@reddit

I mean it’s possible ASICs for Bitnet can lead to crazy high performance per dollar. We just haven’t seen any big splash there.

Reply

[-]

TomLucidor@reddit

Cus nobody want to bother with hardware, it's like asking for crypto mining chips. I would rather see people hack GPUs in efficient use of old cards before begging for ASICs/FPGAs.

Reply

[-]

SlowFail2433@reddit

Is an issue because these days we go way way past chinchilla for cheaper inference

Reply

[-]

az226@reddit

Correct. Today we consider the total compute budget. And inference compute is a bigger part of it.

Reply

[-]

TomLucidor@reddit

Not enough indie hackers to get revolutionary.

Reply

[-]

Firm-Fix-5946@reddit

> So at the moment we’re able to prove lack of loss of ability Nobody said that

Reply

[-]

Sloppyjoeman@reddit (OP)

Oh, I thought that was what the papers have been showing, am I mistaken? What’s the point of 1.58bit LLMs then?

Reply

[-]

Revolutionalredstone@reddit

Still around but small models keep coming out that are so much smarter that I think we're less thinking about scrunching and more just searching at the moment. There was some really impressive 4bit int stuff with oaioss models which still blow my mind (if only we could get a 20B nanbeige model which loaded as fast and ran like oss20b 😱) Bitnet will soon be back and in greater numbers 😉

Reply

[-]

TomLucidor@reddit

"If not me, who? If not now, when?"

Reply

[-]

ortegaalfredo@reddit

The problem is that is a technology that requires huge investments: 1. Small/Medium models already fit on existing GPUs/RAM 2. Big models that would benefit for training at 1.58 bits require millions in investment Most big companies (Nvidia/OpenAI/Google) aren't interested on technology that makes them less competitive. Huge amount of RAM is their moat. The only company that could use this is Microsoft but they already have a deal with OpenAI and I guess they pressured them into not advancing this. Innovation on this side will come from China.

Reply

[-]

TomLucidor@reddit

China also plays the same game, so ternary computing should be indie-first, not corporate-first.

Reply

[-]

PieArtistic9707@reddit

Microsoft itself is a major investor.

Reply

[-]

Embarrassed_Sun_7807@reddit

Check out unsloth's dynamic quants 2.0

Reply