running BitNet b1.58 inside DRAM by intentionally breaking DDR4 timing rules

Posted by use-one_of-these@reddit | hardware | View on Reddit | 16 comments

I have been working on running BitNet b1.58 inside DRAM by intentionally breaking DDR4 timing rules. Also made a visual explainer: https://pcdeni.github.io/CaSA/explainer/
This is tested and works inside commercial off the shelf memory with custom memory controller in the FPGA. The underlying effect is well characterized in academic papers (cmu safari, simra, dram bender, etc). In the process of getting this to work I also made previously undocumented discovery about DDR behaviour: https://pcdeni.github.io/CaSA/explainer/xor-spread.html
Overall it is a bit slow, since data (in full rows) needs to be moved even when what is actually needed is only the count of the '1' bits (popcount). To make it competitive memory die changes would be needed, but not as drastic as merging compute and memory into one silicon. This would then avoid the memory wall issue the industry is currently facing.

[-]

SignalButterscotch73@reddit

I felt something flying over my head but I haven't a clue what it was.

[-]

HitM3Upjessy57@reddit

basically he is overclocking the memory so hard that the bits start leaking into each other and he is using that glitch to do math without a cpu. it is completely unhinged but surprisingly legit.

[-]

SomewhereRude2144@reddit

lmao same 💀😂

[-]

Quiet_Dinner3787@reddit

So you use the DRAM in your DDR4 ram instead of your GPU ? And you run the llm at a lower hardware level ?

[-]

use-one_of-these@reddit (OP)

The idea is that you don't need to move the data from memory to computation units (GPU) through a straw (memory bus), but keep it in the memory and do the computation there. GPU for AI is memory bandwidth bound, meaning often the GPU will sit idle and wait for data to arrive from the memory; also higher token/s mainly corresponds to higher memory bandwidth.
The silicon characteristic for memory and for compute is quite the opposite, that is why merging them wasn't done yet. What I am showing is that this merging is not necessary. Are memory design changes needed to make it fast? Yes. Are they drastic? No, Samsung already holds a patent for in memory popcount, although they never released any hardware with it, and ironically samsung memory is suspected to reject timing violated commands.

[-]

EmergencyCucumber905@reddit

How does DRAM do the calculation?

[-]

use-one_of-these@reddit (OP)

Memory is built out of analog electronic components like capacitor and transistor. By defining ranges in potential (V) that corresponds to '0' and '1', and a gap between the two is what makes it digital. Capacitors takes time to charge and discharge. So why does a memory behave like a memory? Because the timing and the maintenance refreshes happen to be in the domain that captures that behaviour. For calculation you would operate in the analog domain a little longer before quantizing it into digital. MAJority of 3 cells (each cell is in a different row, doing the operation on entire rows at once \~8KB) where the 1st cell is operand A, the 2nd cell is operand B and based on the 3rd cell value you can do logical "AND" or "OR". Why this is enough for ternary? Ternary uses weights of -1,0 and 1, these can be encoded using 2 bits. To multiply these weights with the inputs you would use bitwise AND operation between them.

[-]

Express_Living2264@reddit

Is there some crazy bottleneck that makes this not really feasible? Im guessing one issue here is that you lose tons of storage as you repurpose it for processing tasks. So you end up trading good storage cells for presumably bad processing cells. The only benefit remaining is the removal of the transfer bottleneck?

[-]

Motor_Trouble2280@reddit

You just casually created in-memory computing. Isn't that supposed to be a big deal?

[-]

Quiet_Dinner3787@reddit

So you use the DRAM as calculation unit ? This seems really neat ! Thanks for your explanation, is this "solution" being advertised to the manufacturers so that they will want to make those changes to make is faster ?

[-]

use-one_of-these@reddit (OP)

I wish I would be able to advertise it to them. There are many open source projects for chips (CPU, GPU, FPGA), but for memory only one guy builds them in his shed on youtube. The basic circuit is known, the sophisticated implementation is a secret and the finetuning with silicon properties is a secret. Doing a startup would get outcompeted by established manufacturers immediately. Patenting it: only 3% of patents ever generate profit. If somebody copies it then I would have to legally prove somebody copied it, and according to a search even if that can be done, and the copier pays, I am more likely to still end up with a net loss due to fees related to patenting and legal procedures.

[-]

lnkofDeath@reddit

Uh, this is nuts? What a casual post for something so creative!

[-]

tat_tvam_asshole@reddit

What do you do for work?

[-]

use-one_of-these@reddit (OP)

Mainly FPGA/firmware design, but have background in bionics engineering.

[-]

tat_tvam_asshole@reddit

I was going to drop a Bionicle meme but alas no image sharing here. Oh well, anyway that's super neat and I hope you are able to see the fruit of this project. Absolutely ambitious and wish you the best.

[-]

AutoModerator@reddit

Hello! It looks like this might be a question or a request for help that violates our rules on /r/hardware. If your post is about a computer build or tech support, please delete this post and resubmit it to /r/buildapc or /r/techsupport. If not please click report on this comment and the moderators will take a look. Thanks!

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.