Deepsilicon runs neural nets with 5x less RAM and ~20x faster. They are building SW and custom silicon for it
Posted by hamada0001@reddit | LocalLLaMA | View on Reddit | 43 comments
Apparently "representing transformer models as ternary values (-1, 0, 1) eliminates the need for computationally expensive floating-point math".
Seems a bit too easy so I'm skeptical. Thoughts on this?
limapedro@reddit
Seems about right! it's from the a 1-bit paper, now things are getting interesting with custom hardware! Bitnets are very promising!
hamada0001@reddit (OP)
But surely this'll reduce accuracy if it's 1bit? Unless I'm missing something... Perhaps it's my ignorance and I need to read more on it 😆
veriRider@reddit
You can read the bitnet paper from earlier this year, first insights into trade-offs, no one has done it at scale yet.
https://arxiv.org/abs/2310.11453
hamada0001@reddit (OP)
Thanks!
charlesrwest0@reddit
This might be more relevant:
https://arxiv.org/abs/2406.02528
Or this one:
https://huggingface.co/papers/2407.12327
_yustaguy_@reddit
*that we know of
jasminUwU6@reddit
I don't see why someone would do something like this and just hide it when they could profit from it
_yustaguy_@reddit
Imagine this, Anthropic makes Claude 4, and even the smallest model outperforms Sonnet 3.5 by a pretty wide margin, Opus 4 is pretty much AGI yadda, yadda. Now, what would happen if they revealed that it was bitnet that enabled all of that innovation?
Literally every single AI lab would envest heavily in bitnet and Anthropic's advantage would disappear instantly.
The very knowledge that an experimental technology can work at scale is extremely important to every company in this sector. Not everyone gives away their sauce like meta.
az226@reddit
I wonder why they didn’t take it to the logical extreme of 0.68 bits per weight.
jasminUwU6@reddit
What would that even mean?
az226@reddit
It’s an experimental approach where all weights are either 1 or null, an LLMs trained this way average out to about 68% of the weights being 1 and the rest being nothing. Then you can use lookup tables/simple addition instead of matmul with inference being crazy fast and super low memory footprint.
jasminUwU6@reddit
I'm not sure how that's different than 1 bit, since in both cases you just have 2 states the weight can occupy.
az226@reddit
Ternary uses 1.68 bits on average per weight, 2.5x times the size.
compilade@reddit
Lossless ternary takes 1.6 bits (5 trits per 8 bits). Of course some lossy quantization scheme could go down further.
The HN comment where I think this 0.68 bit idea comes from (https://news.ycombinator.com/item?id=39544500) referred to distortion resistance of binary models, if I recall correctly.
eloitay@reddit
No idea why people seems to feel quantisation do not degrade significantly. When I tested for language translation it become unusable, not sure what I had did wrong or translation is one of the case that degrade a lot.
Dayder111@reddit
BitNet and ternary models in general are not quantization (in its current widely used meaning).
In a simplified, short way, imagine if you train model with high precision weights, variable of (in theory) holding a lot of intricacies, information, when combined with other weights. And then you force them to be able to only take just a few values, forcing them to choose, and leaving no possibility to represent nuances and intricacies, and inevitably breaking the whole model in that regard.
In BitNet they force the model to learn, to form its inner structure with this low precision limitation already applied from the very beginning. It already has to represent nuances and intricacies with more rough values, and manages to do it well apparently, at least given a bit more computing time/power.
So, no information loss/"brain damage" happens in this case. But it might be a bit longer to train.
The advantage is possibility of designing much simpler hardware that will run such models hundreds to thousands of times more energy efficiently, or/and faster.Â
eloitay@reddit
But in this manner would not it means more branching and thereby requiring more RAM? But maybe more compute efficient? Thanks for sharing I always thought that BitNet is quantization. Did not realize that retraining is required.
Dayder111@reddit
These values do not have to be treated like booleans/if statements. They can get subtracted/added like all the stuff on GPUs. Like, if you see a 1 or 0, you do not switch on which input to process, you process both, set one of them to 0 and add them "both" (I am not that savvy in neural networks, this might not be the most fitting/correct explanation, but in shaders that run on gpus it works kind of like that, because branching is more expensive than just calculating both paths and zeroing one).
It provides no more need for multiplication (as you can multiply -1, 0 and 1 with just adding + bit logic), floating point numbers, and high precision numbers (at least for largest parts of the model's calculations. And high precision floating point multipliers take like, an order or orders of magnitude more transistors, and hence space, length of interconnects, and energy, than low precision integer adders.
So, you can get better speed and energy efficiency with way smaller chips, add more adders or on-chip memory in freed-up space, and/or clock them higher if possible.
limapedro@reddit
You're spot on! When learning about neural networks one the first models that we learn to train is a XOR network and I think it's to not hard thinking of using logical operators to do the math.
askchris@reddit
That's an interesting observation, I know some quantization techniques are biased towards maintaining English performance as they try to compress the weights.
But that said BitNet is not quantization, it's a different training paradigm. It seems to act more like a minimal attention routing system rather than relying on fuzzy (heavy floating point) math and matrix multiplication.
limapedro@reddit
the key is training the model from scratch, quantization reduces accuracy, but the model being from scratch seems to match fp16 performance.
az226@reddit
And the key is that the bigger the model, the smaller is the delta
themrzmaster@reddit
Nice. But is’nt it better to invest on training a large scale bitnet before creating custom hardware?
LinuxSpinach@reddit
Seems like a huge gamble to pitch custom silicon for what is currently niche architecture. I hope it pays off, but won’t be surprised if it doesn’t.
3-4pm@reddit
I have a feeling that those highly invested in GPU-based architectures are going to scoff at this until it has realized potential at scale.
compilade@reddit
Ternary models will be able to run fast on GPU too. Implementation will need time, but
TQ2_0
andTQ1_0
inllama.cpp
will eventually get ported to CUDA and other backends.Not sure exactly how fast they will perform, but these types are not based on lookup tables, and so they should scale well on GPU (hopefully).
Ternary models use mixed ternary-int8 matrix multiplications (weights in ternary, activations in 8-bit). Fast accumulation of 8-bit integers is necessary.
On CPUs with AVX2 (which have the amazing
_mm256_maddubs_epi16
instruction), the speed ofTQ2_0
is in the same ballpark as T-MAC (twice as fast asQ2_K
), even though the layout ofTQ2_0
is not as optimized (no interleaving, no pre-tiling).On GPU I guess
dp4a
will be useful.Of course, to save some power ideally there would be a 2-bit x 8-bit mixed-signedness dotprod instruction.
Longjumping-Solid563@reddit
Be vary of anything Y combinator, they will give money to any Ivy league dropout with a decent idea. There was a hackernews thread by the founders and it is very worrying: https://news.ycombinator.com/item?id=41490905
Enough-Meringue4745@reddit
How is this worrying? I believe tackling the Edge market is a great move.
Portable ML for hardware/robotics is the next big move.
hamada0001@reddit (OP)
Yeah I felt this too. It seems they have a "they're smart they'll figure it out" type attitude which usually creates more hype than value.
keisukegoda3804@reddit
The motivation for this startup is that BitNet-1.58b performs well but lacks hardware support to fully realize gains (for example, decoding on GPUs is actually compute bound if matmuls are used). The bitnet authors hypothesize that more specialized hardware can resolve this (for ternary values, matmuls technically aren’t even needed, it’s all just addition and subtraction)
So deepsilicon is basically developing this specialized hardware, and are targeting the edge device setting. From their words, the server setting is already addressed too well by NVIDIA. There’s a few existential risks, namely 1. It’s unclear if bitnet scales well to larger models, 2. as models grow, it’s unclear how economical serving large models on the edge is (even if they are 1.58 bit), as opposed to in a server setting 3. it’s unclear if foundation model companies will train future models in 1.58 bits (bitnet only works when trained from scratch, it is not a post training quant method). 4. by using custom silicon, they risk not being able to integrate future optimizations. For example, the bitnet authors published Q-Sparse, which allows for further potential speed up through activation sparsity (it is unclear if deep silicon would be able to utilize this).
Overall though, it’s an interesting bet, and I wish the best for the founders!
BangkokPadang@reddit
I remember seeing a chart that called into question whether even a 70B would be improved by this technique but it's been long enough I think the improvements were just inference speed. Even if they were 1:1 it seems like this would still save VRAM.
What ever happened to intel's quantization method that was supposed to bring 4bit back to 1:1 with 16bit performance?
keisukegoda3804@reddit
there’s a few quantization methods that are decently close to fp16 at 4-bit IIRC (quip#, etc.)
LoSboccacc@reddit
Custom hardware targets are providers, and this will be a very hard sell since it can only run one specific type of nets, regardless of how fast.Â
A100 cards are 4 year olds and still going very strong, for tensor, diffusion, and traditional architectures.
This card is chasing one unproven fad and on top of ot requires custom software. I don't want to go trough all the materials to understand the stack deeply, but if this custom stack is not a torch backend this is DOA. If it cannot significantly undercut A100s, its DOA.Â
Providers will not get them at the scale trainers need them because they don't have a proven shelf life
Trainers will not go around building datacenter to host them either.Â
So what is exactly their target?Â
It seems they're targeting investors, selling dreams.Â
hamada0001@reddit (OP)
Fair points. Groq's doing pretty well though. If the benefits are huge then maybe the industry will make exceptions.
LoSboccacc@reddit
yeah but groq was founded by a group of engineers that worked on google tpu, with 10 million seed round, and is a generic computation engnes that accelerates matmul in general, not bitnets only, a completely different value proposition, team with connections, and that understand the logistic around silicone design.
ResidentPositive4122@reddit
People building asics know this already. There's a company that wants to do that for language transformers and they very openly admit it's a gamble. If the arch stays pretty much the same, they're in a nice place to serve transformers at scale (inference). If the arch moves, they can only use a deprecated tech stack. So the risks are well understood and assumed. I find it funny tho that everyone keeps thinking that they're seeing some obvious things that others miss. Oh well.
LoSboccacc@reddit
yeah but I'm not providing feedback to r / we know asics , I'm providing context on r / local llama
hamada0001@reddit (OP)
Fair points. Groq's doing pretty well though. If the benefits are huge then maybe the industry will make exceptions.
brahh85@reddit
This got me
Inevitable-Start-653@reddit
Too soon man 😠I spent my entire weekend messing with that model, so much time wasted.
ArtyfacialIntelagent@reddit
Yeah, I got that reference, and the comparison is massively unfair to DeepSilicon.
Schumer is just some random dude who dropped out of "entrepreneurship" school (WTF is that anyway) because he couldn't be bothered to educate himself in his impatience to start his get-rich-quick scams. The full depth of his AI knowledge can be acquired by hanging out here on /r/localLlama for a few weeks.
These guys actually know something. Their startup is based on SOTA theory (the BitNet papers) combined with building custom silicon, which is not fucking trivial these days.
eras@reddit
It seems possible they could also reduce the power requirements for inference by quite a bit.
Dayder111@reddit
20x faster is just the beginning for this approach, they likely didn't optimize their design/current neural networks that they want to run do not allow further changes.