Deepsilicon runs neural nets with 5x less RAM and ~20x faster. They are building SW and custom silicon for it

[-]

limapedro@reddit

Seems about right! it's from the a 1-bit paper, now things are getting interesting with custom hardware! Bitnets are very promising!

[-]

hamada0001@reddit (OP)

But surely this'll reduce accuracy if it's 1bit? Unless I'm missing something... Perhaps it's my ignorance and I need to read more on it 😆

[-]

veriRider@reddit

You can read the bitnet paper from earlier this year, first insights into trade-offs, no one has done it at scale yet.

https://arxiv.org/abs/2310.11453

[-]

charlesrwest0@reddit

This might be more relevant:
https://arxiv.org/abs/2406.02528
Or this one:

https://huggingface.co/papers/2407.12327

[-]

jasminUwU6@reddit

I don't see why someone would do something like this and just hide it when they could profit from it

[-]

Imagine this, Anthropic makes Claude 4, and even the smallest model outperforms Sonnet 3.5 by a pretty wide margin, Opus 4 is pretty much AGI yadda, yadda. Now, what would happen if they revealed that it was bitnet that enabled all of that innovation?

Literally every single AI lab would envest heavily in bitnet and Anthropic's advantage would disappear instantly.

The very knowledge that an experimental technology can work at scale is extremely important to every company in this sector. Not everyone gives away their sauce like meta.

[-]

az226@reddit

I wonder why they didn’t take it to the logical extreme of 0.68 bits per weight.

[-]

jasminUwU6@reddit

What would that even mean?

[-]

az226@reddit

It’s an experimental approach where all weights are either 1 or null, an LLMs trained this way average out to about 68% of the weights being 1 and the rest being nothing. Then you can use lookup tables/simple addition instead of matmul with inference being crazy fast and super low memory footprint.

[-]

jasminUwU6@reddit

I'm not sure how that's different than 1 bit, since in both cases you just have 2 states the weight can occupy.

[-]

az226@reddit

Ternary uses 1.68 bits on average per weight, 2.5x times the size.

[-]

compilade@reddit

Lossless ternary takes 1.6 bits (5 trits per 8 bits). Of course some lossy quantization scheme could go down further.

The HN comment where I think this 0.68 bit idea comes from (https://news.ycombinator.com/item?id=39544500) referred to distortion resistance of binary models, if I recall correctly.

[-]

eloitay@reddit

No idea why people seems to feel quantisation do not degrade significantly. When I tested for language translation it become unusable, not sure what I had did wrong or translation is one of the case that degrade a lot.

[-]

Dayder111@reddit

BitNet and ternary models in general are not quantization (in its current widely used meaning).

In a simplified, short way, imagine if you train model with high precision weights, variable of (in theory) holding a lot of intricacies, information, when combined with other weights. And then you force them to be able to only take just a few values, forcing them to choose, and leaving no possibility to represent nuances and intricacies, and inevitably breaking the whole model in that regard.

In BitNet they force the model to learn, to form its inner structure with this low precision limitation already applied from the very beginning. It already has to represent nuances and intricacies with more rough values, and manages to do it well apparently, at least given a bit more computing time/power.

So, no information loss/"brain damage" happens in this case. But it might be a bit longer to train.

The advantage is possibility of designing much simpler hardware that will run such models hundreds to thousands of times more energy efficiently, or/and faster.

[-]

eloitay@reddit

But in this manner would not it means more branching and thereby requiring more RAM? But maybe more compute efficient? Thanks for sharing I always thought that BitNet is quantization. Did not realize that retraining is required.

[-]

Dayder111@reddit

These values do not have to be treated like booleans/if statements. They can get subtracted/added like all the stuff on GPUs. Like, if you see a 1 or 0, you do not switch on which input to process, you process both, set one of them to 0 and add them "both" (I am not that savvy in neural networks, this might not be the most fitting/correct explanation, but in shaders that run on gpus it works kind of like that, because branching is more expensive than just calculating both paths and zeroing one).

It provides no more need for multiplication (as you can multiply -1, 0 and 1 with just adding + bit logic), floating point numbers, and high precision numbers (at least for largest parts of the model's calculations. And high precision floating point multipliers take like, an order or orders of magnitude more transistors, and hence space, length of interconnects, and energy, than low precision integer adders.

So, you can get better speed and energy efficiency with way smaller chips, add more adders or on-chip memory in freed-up space, and/or clock them higher if possible.

[-]

limapedro@reddit

You're spot on! When learning about neural networks one the first models that we learn to train is a XOR network and I think it's to not hard thinking of using logical operators to do the math.

[-]

askchris@reddit

That's an interesting observation, I know some quantization techniques are biased towards maintaining English performance as they try to compress the weights.

But that said BitNet is not quantization, it's a different training paradigm. It seems to act more like a minimal attention routing system rather than relying on fuzzy (heavy floating point) math and matrix multiplication.

[-]

limapedro@reddit

the key is training the model from scratch, quantization reduces accuracy, but the model being from scratch seems to match fp16 performance.

[-]

az226@reddit

And the key is that the bigger the model, the smaller is the delta

[-]

themrzmaster@reddit

Nice. But is’nt it better to invest on training a large scale bitnet before creating custom hardware?

[-]

LinuxSpinach@reddit

Seems like a huge gamble to pitch custom silicon for what is currently niche architecture. I hope it pays off, but won’t be surprised if it doesn’t.

[-]

3-4pm@reddit

I have a feeling that those highly invested in GPU-based architectures are going to scoff at this until it has realized potential at scale.

[-]

compilade@reddit

Ternary models will be able to run fast on GPU too. Implementation will need time, but TQ2_0 and TQ1_0 in llama.cpp will eventually get ported to CUDA and other backends.

Not sure exactly how fast they will perform, but these types are not based on lookup tables, and so they should scale well on GPU (hopefully).

Ternary models use mixed ternary-int8 matrix multiplications (weights in ternary, activations in 8-bit). Fast accumulation of 8-bit integers is necessary.

On CPUs with AVX2 (which have the amazing _mm256_maddubs_epi16 instruction), the speed of TQ2_0 is in the same ballpark as T-MAC (twice as fast as Q2_K), even though the layout of TQ2_0 is not as optimized (no interleaving, no pre-tiling).

On GPU I guess dp4a will be useful.

Of course, to save some power ideally there would be a 2-bit x 8-bit mixed-signedness dotprod instruction.

[-]

Longjumping-Solid563@reddit

Be vary of anything Y combinator, they will give money to any Ivy league dropout with a decent idea. There was a hackernews thread by the founders and it is very worrying: https://news.ycombinator.com/item?id=41490905

[-]

Enough-Meringue4745@reddit

How is this worrying? I believe tackling the Edge market is a great move.

Portable ML for hardware/robotics is the next big move.

[-]

hamada0001@reddit (OP)

Yeah I felt this too. It seems they have a "they're smart they'll figure it out" type attitude which usually creates more hype than value.

[-]

keisukegoda3804@reddit

The motivation for this startup is that BitNet-1.58b performs well but lacks hardware support to fully realize gains (for example, decoding on GPUs is actually compute bound if matmuls are used). The bitnet authors hypothesize that more specialized hardware can resolve this (for ternary values, matmuls technically aren’t even needed, it’s all just addition and subtraction)

So deepsilicon is basically developing this specialized hardware, and are targeting the edge device setting. From their words, the server setting is already addressed too well by NVIDIA. There’s a few existential risks, namely 1. It’s unclear if bitnet scales well to larger models, 2. as models grow, it’s unclear how economical serving large models on the edge is (even if they are 1.58 bit), as opposed to in a server setting 3. it’s unclear if foundation model companies will train future models in 1.58 bits (bitnet only works when trained from scratch, it is not a post training quant method). 4. by using custom silicon, they risk not being able to integrate future optimizations. For example, the bitnet authors published Q-Sparse, which allows for further potential speed up through activation sparsity (it is unclear if deep silicon would be able to utilize this).

Overall though, it’s an interesting bet, and I wish the best for the founders!

[-]

BangkokPadang@reddit

I remember seeing a chart that called into question whether even a 70B would be improved by this technique but it's been long enough I think the improvements were just inference speed. Even if they were 1:1 it seems like this would still save VRAM.

What ever happened to intel's quantization method that was supposed to bring 4bit back to 1:1 with 16bit performance?

[-]

keisukegoda3804@reddit

there’s a few quantization methods that are decently close to fp16 at 4-bit IIRC (quip#, etc.)

[-]

LoSboccacc@reddit

Custom hardware targets are providers, and this will be a very hard sell since it can only run one specific type of nets, regardless of how fast.

A100 cards are 4 year olds and still going very strong, for tensor, diffusion, and traditional architectures.

This card is chasing one unproven fad and on top of ot requires custom software. I don't want to go trough all the materials to understand the stack deeply, but if this custom stack is not a torch backend this is DOA. If it cannot significantly undercut A100s, its DOA.

Providers will not get them at the scale trainers need them because they don't have a proven shelf life

Trainers will not go around building datacenter to host them either.

So what is exactly their target?

It seems they're targeting investors, selling dreams.

[-]

hamada0001@reddit (OP)

Fair points. Groq's doing pretty well though. If the benefits are huge then maybe the industry will make exceptions.

[-]

LoSboccacc@reddit

yeah but groq was founded by a group of engineers that worked on google tpu, with 10 million seed round, and is a generic computation engnes that accelerates matmul in general, not bitnets only, a completely different value proposition, team with connections, and that understand the logistic around silicone design.

[-]

ResidentPositive4122@reddit

People building asics know this already. There's a company that wants to do that for language transformers and they very openly admit it's a gamble. If the arch stays pretty much the same, they're in a nice place to serve transformers at scale (inference). If the arch moves, they can only use a deprecated tech stack. So the risks are well understood and assumed. I find it funny tho that everyone keeps thinking that they're seeing some obvious things that others miss. Oh well.

[-]

LoSboccacc@reddit

yeah but I'm not providing feedback to r / we know asics , I'm providing context on r / local llama

[-]

hamada0001@reddit (OP)

Fair points. Groq's doing pretty well though. If the benefits are huge then maybe the industry will make exceptions.

[-]

brahh85@reddit

This got me

[-]

Inevitable-Start-653@reddit

Too soon man 😭 I spent my entire weekend messing with that model, so much time wasted.

[-]

ArtyfacialIntelagent@reddit

This got me

Yeah, I got that reference, and the comparison is massively unfair to DeepSilicon.

Schumer is just some random dude who dropped out of "entrepreneurship" school (WTF is that anyway) because he couldn't be bothered to educate himself in his impatience to start his get-rich-quick scams. The full depth of his AI knowledge can be acquired by hanging out here on /r/localLlama for a few weeks.

These guys actually know something. Their startup is based on SOTA theory (the BitNet papers) combined with building custom silicon, which is not fucking trivial these days.

[-]

eras@reddit

It seems possible they could also reduce the power requirements for inference by quite a bit.

[-]

Dayder111@reddit

20x faster is just the beginning for this approach, they likely didn't optimize their design/current neural networks that they want to run do not allow further changes.