Deepseek 700b Bitnet

Posted by silenceimpaired@reddit | LocalLLaMA | View on Reddit | 19 comments

Deepseek’s team has demonstrated the age old adage Necessity the mother of invention, and we know they have a great need in computation when compared against X, Open AI, and Google. This led them to develop V3 a 671B parameters MoE with 37B activated parameters. MoE is here to stay at least for the interim, but the exercise untried to this point is MoE bitnet at large scale. Bitnet underperforms for the same parameters at full precision, and so future releases will likely adopt higher parameters. What do you think the chances are Deepseek releases a MoE Bitnet and what will be the maximum parameters, and what will be the expert sizes? Do you think that will have a foundation expert that always runs each time in addition to to other experts?

Reply to Post

19 Comments

[-]

gentlecucumber@reddit

The answer to the first question informs all the rest. No, I don't think that Deepseek will do a scaled up bitnet model. The end result may be a smaller model, but they are more computationally expensive to train, which is kind of antithetical to the Deepseek approach this far. Their major claim to fame was to develop a model that was competitive with o1 at the lowest possible cost, so I don't think they'll do a 180 and inflate their costs just to publish a lower precision model.

[-]

silenceimpaired@reddit (OP)

I am not very knowledgeable, and you seem very confident. It was my understanding that BitNet’s design maintains the same BF16 master‑weight memory footprint, reduces activation memory via low‑bit quantization, and—aside from minor quantize/dequantize overhead—matches or improves on overall training resource usage compared to BF16-only training. At least based on my reading of these articles: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16?utm_source=chatgpt.com "microsoft/bitnet-b1.58-2B-4T-bf16 - Hugging Face" https://arxiv.org/html/2504.12285v1 "BitNet b1.58 2B4T Technical Report" https://huggingface.co/docs/transformers/main/en/model_doc/bitnet?utm_source=chatgpt.com "BitNet - Hugging Face" https://arxiv.org/abs/2310.11453?utm_source=chatgpt.com "BitNet: Scaling 1-bit Transformers for Large Language Models" https://huggingface.co/blog/1_58_llm_extreme_quantization?utm_source=chatgpt.com "Fine-tuning LLMs to 1.58bit: extreme quantization made easy"

[-]

Puzzled-Truck9932@reddit

This is what I thought too.

[-]

MisterARRR@reddit

Has anyone even made a fully trained and usable bitnet model yet, beyond the proof of concept in the research paper? Surely if it was possible to make a competitive model in the smaller size ranges (1-7b), someone would have done it by now.

[-]

silenceimpaired@reddit (OP)

I think Microsoft just released one, and another post pointed to the possibility of converting a full precision model to bitrate.

[-]

Double_Cause4609@reddit

Keep in mind that enterprise requirements are different from consumer requirements. The thing about enterprise inference is that they're running tons of requests in parallel which has different characteristics to single user inference. If you want to play a fun game, take any consumer CPU, throw about 96GB of RAM in it, and see how many tokens per second you can get on a 7-9B model if you do 256 parallel requests in vLLM. Something you'll notice is that it goes crazy high. Like, 200 T/s. The reason this works is the hidden state is so much smaller than the weights, that you can amortize the weight loading memory cost over a ton of requests, and this works because modern processors are more memory bound than compute bound. Now, the thing is, if you suddenly do say, a Bitnet quantization, does the total T/s increase? Maybe. Maybe a bit. The increase by going to 4bit already isn't really that much bigger (I think it's only like, a 10% increase, maybe, from what I've seen of things like Gemlite). But the thing is, the quality difference (especially in technical domains) when going to 4bit is huge. And the other thing is that native training (ie: QAT, which Bitnet effectively is) of a model a given bit width isn't free. Things like Bitnet add training time (something like 30%, even), so for the same training cost, you could just overtrain a 10% smaller model, infer at the same speed, and have possibly higher evaluation quality. Sadly, Bitnet doesn't make sense for the big corporations to train. The math just doesn't work out. It's only super great for single user inference, and companies generally don't plan around consumers. I think what's more likely is that we might see community driven efforts to train large Bitnet models with distributed compute. The incentives make way more sense; everybody wants the biggest and best model they can fit on their hardware, but no one person can train on the same hardware they intend to do inference on.

[-]

erfan_mehraban@reddit

Those numbers are highly unlikely true

[-]

Double_Cause4609@reddit

Which numbers? I pulled them off the top of my head, but they do match my personal experience. If you're talking about the inference numbers, on CPU performance, I based them on my own system's performance. If we're talking about the performance of LLMs at scale with high concurrency? I have less experience with that directly (deploying quantized models, as most people want quality out of the cloud), but you can google the numbers from GemLite; that's where I took them from. So...Yes, you only get like, a 10% performance increase from int4 GEMMs when doing inference at scale. You get a much bigger increase as an end user, but it looks different in an enterprise. As for the training numbers, it was a bit of a guess, but QAT is known to add about 30% training time (just look at the TorchAO documents, that's where I got that figure from). If you combine those existing numbers, and add in LLM scaling laws (and some findings from "Scaling Laws for Precision"), you find that QAT can be framed as adjusting the "effective parameter count". What that means is you could train at FP16, or you could train, say, a 20% larger model at Int8 (if you quantize all linear layers), and you get about the same performance, so you could say that an int8 model is "80% of the effective parameter count" of an FP16 (or even FP8 LLM; sorry, I don't remember which paper it was that noted FP8 performs better than Int8 in pre-training) LLM. Factoring in all of those: My claim that you can train a 10% smaller model, and then train it for 30% longer than the Bitnet model (because training a non-Bitnet, or non-QAT model in general is faster), and smaller models function like larger models if you train them for longer (Llama 3 outperforms Llama 2 precisely because it was trained longer), then all of that taken into account: The hypothetical non-Bitnet formulation I described gives all of the benefits of the Bitnet model when deploying in the cloud, but is easier to train, or has higher quality (whichever one you want to take). So...Which of my numbers are wrong? Are you saying the PytorchAO dev team doesn't know how to do their job? Are you saying that GemLite isn't a valid library? Are you saying that scaling laws don't exist? Are you saying the authors of "Scaling Laws for Precision" are incorrect? Are you saying that the performance of my computer is invalid? Which of my numbers are wrong?

[-]

kaeptnphlop@reddit

The only place I see BitNet models make sense from a business perspective is on-device, offline applications. But that is very niche in the scheme of things. And there we won’t see huge models as they will probably be more tailored for small file / memory footprint to run efficiently. Now what those applications may be is a good question, but I’ve been surprised by interesting use-cases before.

[-]

dividebynano@reddit

Approximately **68.6%** of global internet traffic originates from mobile phones. The best UX for mobile for many people is just to talk to it but mobile phones often suffer from poor connectivity, high data charges and latency issues. Perhaps the shared supercomputers we rely upon now are the niche.

[-]

ThisWillPass@reddit

It will when serving cost a magnitude more or less than training then.

[-]

Lissanro@reddit

The issue is, Bitnet, even looked promising at first, does not provide much advantage in practice. It is 1.58-bit, and not everything can be made ternary - so it will be closer to 2-bit most likely in practice. It requires more compute to train, and more parameters to store the same knowledge. So can it offer really a model that is smaller than Q4 with the similar knowledge and quality? Maybe, but only a little bit, however training is very expensive and it would be too risky to try for little to no gain. Given DeepSeek limited in compute resources, it is highly unlikely they release huge BitNet. Even if they consider releasing BitNet models at some point in the future, they most likely start with smaller models first.

[-]

silenceimpaired@reddit (OP)

It is unlikely their large model will be Bitnet. Still, I hope they distill their next large model down to a 4b and 70b bitnet model from scratch as opposed to a fine tune of existing models.

[-]

PinkysBrein@reddit

I think the Hadamard activation quantization from the latest Bitnet paper has more chance of being used. Deepseek embraced FP8, FP4 is the likely next step. FP4 weights and FP4 Hadamard domain activations/gradients for FP4 matmul in forward/backward, that would be a pretty huge savings. More suited to NVIDIA's hardware than binary/ternary weights.

[-]

silenceimpaired@reddit (OP)

Interesting idea. I’ll have to look up that paper.

[-]

DeepWisdomGuy@reddit

What do you know, sir...

[-]

silenceimpaired@reddit (OP)

Very little … if the essay comments on this post are to be believed. ;)

[-]

Healthy-Nebula-3603@reddit

I hear about a burner from a year and one one train such model on a bigger scale. I think bitnet is already dead.

[-]

aurelivm@reddit

DeepSeek V3 derivatives already have experts that are always active. It was apparently a very difficult task for them to stabilize fp8 training for DeepSeek V3, so I seriously doubt they would blindly scale an unproven architecture like that. In addition to the other comments which explain why BitNet is not good for batched inference, you'd probably also be disappointed by the speed and performance of a 671B BitNet model. I would not expect it to work comparably well to a 671B non-BitNet model, and you'd still be looking at single-digit tokens per second on any setup worth less than $10,000. MoE models are great for batched inference (that is, 99% of LLM inference applications) but for single-user local models you will almost certainly want to choose a good 20B-40B dense model, which fit comfortably on a single prosumer card like the 3090. My favorites are GLM4-32B and Gemma 3 27B.