Could High Bandwidth Flash be Local Inference's saviour?

[-]

NoFaithlessness951@reddit

It's very much hypothetical for now

Reply

[-]

I worked with a device using Flash as its "memory" in the past. First, let's look at the cost/complexity of Flash vs DRAM: Fundamentally, DRAM is a lot more complex to deal with than Flash simply due to the physics: You may know that DRAM stores its information inside capacitors with a "high" charge being a 1 and no charge a zero. The issue you run into when scaling DRAM density is that as you make capacitors smaller, their capacitance drops, meaning you need to compensate via deep trench capacitors (which are exactly what they sound like) and stacking them in complex 3d shapes to maximize density while keeping the capacitors as large as possible. This means you consume a massive amount of vertical space and add a ton of process complexity: A single DRAM cell is already very "high" compared to a flash cell. Flash instead uses no separate capacitors and therefore does not have the same scaling/refresh constraints as DRAM (you can further improve this by having transistors share diffusion regions, which is what is done in traditional NAND flash already). Modern flash (e.g. V-NAND) wills stack hundreds of layers vertically, which improves the density even further. You can't do this for DRAM since the capacitors are already positioned vertically. This also goes further into the lithography: Flash can deal with large voltage variations and comparatively high error rates (the density is high enough to just have error correcting bits to account for process uncertainty). DRAM does not have that luxury: You have extremely tight sense margins and need to guarantee very low leakage. This is before we even account for the fact that you can store multiple bits inside a single flash cell by using multiple threshold levels. This is also why there are so few DRAM manufacturers, despite the actual architecture not being that complex: The issue you have in DRAM is that you can't just scale your process with the node size since you have a lot of "non transistor" components, which don't scale the way you would expect them to from a physics level. The issue with Flash from a performance POV is that the reading/writing process is slower on a physical level: NAND reads bits by first applying a voltage to the word line, then measuring the current through a series string, and finally determining whether/how much the transistor conducts by comparing to threshhold(s). Writing is also physically slow since flash uses electron tunneling for writing, which needs rather high voltages (cells locally reach anywhere from 10-20V) and verification cycles afterwards (this electron tunneling is also what slowly degrades the cells). Essentially you "kick" electrons over an oxide barrier, from which they can't escape. However, to do that you neend high voltage. A lot of the complexity in Flash though arises due to its need to be non-volatile: You could use floating gate storage with very thin oxides, which has poorer retention but should program faster (you essentially build a weird version of SRAM with one transistor per cell). However, this doesn't matter: You can use current flash technology already as a backbone for computation. Fundamentally, Flash is not that slow (especially if you heavily multiplex it), it just has high latency. For DL where the access patterns are somewhat predictable, using a high-bandwidth, high-latency mass storage might already be worth it (i.e. you have SRAM->DRAM->flash). You technically only need one "hot" layer and a "loading" layer assuming you have enough throughput since you can pipeline from one layer to the next. The complexity is just in how to properly deal with such an additional source of memory. The actual principle is pretty well understood: You use the same trick as with HBM by connecting multiple (flash) cells via microbumps. The throughput is similar to DRAM, just the latency is horrible. In short: Flash will always be orders of magnitude cheaper per gigabyte and you don't need any special engineering for Flash to become a high bandwidth, high capacity source. The difficulty is dealing with latency, but you can mask this easily if you have static computation graphs and are careful with prefetch (the golden ticket with "low latency, high bandwidth, high capacity" would be FeFETs/RRAM or MRAM but those are not mass market ready).

Reply

[-]

Double_Cause4609@reddit

Because flash in general is a lot cheaper than RAM? If you look at the cost per GB of an SSD it's way lower than the cost per GB of VRAM comparatively. High bandwidth flash will be more expensive than traditional flash, but it won't be as expensive as for example HBM, or potentially even DRAM.

Reply

[-]

gh0stwriter1234@reddit

The issue here though is they are talking about flash that runs at ram like speeds... so it will be more expensive than regular flash. I imagine it is something like a HBM base die modified to talk to a stack of flash chips instead of dram chips. The interface to the system would be basically the same as HBM.

Reply

[-]

LevianMcBirdo@reddit

Exactly. Flash isn't just cheaper because it's cheaper to produce but because it's slower and dies faster. If you make DRAM like flash it will cost DRAM like prices

Reply

[-]

gh0stwriter1234@reddit

It's not because it dies faster, because when used as intended it doesn't. It's because the logic to make the chip to chip interface is expensive as well breaking the flash into smaller independent units that can be acessed in parallel to speed up accesses is also expensive. Flash is cheaper because it has been optimized to a slower point in the memory hierarchy. This means you can share lots of logic and bus paths in the chip. Flash is still cheaper per bit stored because the bits are smaller and have less logic in them. DRAM actually has a lot of additional logic that is part of it to make it work in such a way that it can be accessed at low latency. MLC and QLC flash make the density even higher by storing more than one bit per cell... Also Intel's biggest stupidest mistake of all time is ending Optane Production... HBM optane would have been a virtual goldmine.

Reply

[-]

KallistiTMP@reddit

> HBF could achieve bandwidths exceeding 1,638 GB/s Nope.

Reply

[-]

petuman@reddit

HBM 3E stack is 1200GB/s, so it's not slower

Reply

[-]

Fast-Satisfaction482@reddit

Except we're not going to get consumer hardware that has it, because it will all go in data center cards to give chatgpt another 10x scale.

Reply

[-]

Odd-Ordinary-5922@reddit

it really makes me wonder why they need so much when all the chinese models are catching up or are basically equivalent

Reply

[-]

jazir555@reddit

I would imagine American labs internal models are MUCH better than their public releases and what we are getting publicly are essentially distills of their top tier internal models. I've read on /r/singularity before (however much that's worth (e.g. almost nothing)) that Chinese model developers release theirs almost as soon as their ready, whereas US AI model developers release ones that are 3-6 months behind their internal models.

Reply

[-]

AcePilot01@reddit

Because we want to stay a head of them. I made a comment that's long about the geo poliutical aspects here, but think the cold war and the nuclear arms race... bleeding edge tech is very very much that level of power. ESP ai, even now can have a bad actor do some bad things. Image if you are behind the "detection" of better AI? You could not trust the data you are getting.

Reply

[-]

dark-light92@reddit

So that new startups can't compete. OpenAI is pricing out smaller players by artificially raising costs.

Reply

[-]

Odd-Ordinary-5922@reddit

idk man... I doubt thats the reason, either they have some crazy llm that they are hiding and thats why every single time a new model becomes sota they release a model that is 1% better or they have the worst business model ever. Also you can just rent a gpu off of runpod, vastai etc and its not even that expensive.

Reply

[-]

dark-light92@reddit

OpenAI is running purely on the promise of a future payoff that most likely isn't going to materialize. The 1% better models are desperate attempt to stay relevant in a market where competition is catching up quickly and frontier models are most likely hitting hard limits of current LLM architectures. Platforms like vastai and runpod are fine for fine tuning but highly unreliable for training large models.

Reply

[-]

KaMaFour@reddit

Oh no, the raising of the costs by (checks notes) inventing more efficient hardware which is gonna make training and running models way faster and take less energy. That will surely get them... (OpenAI must be thrilled for that to become the new standard with all the GPU's they bought last year which are amortised for 5 years)

Reply

[-]

dark-light92@reddit

>more efficient hardware Non existent hardware, running on non-existent data centers, powered by non-existent grid capacity. FTFY.

Reply

[-]

gh0stwriter1234@reddit

Doesn't matter, because as long as they put it in HPC GPUs it will be we can have our ram back because HPC VRAM abuse can be reduced.... AI for inference doesn't actually need the half TB of vram upcoming GPUs are sporting it needs like 64GB of VRAM (context) + 4TB (weights) of flash it would be cheaper and better for the AI companies to have such a setup.

Reply

[-]

cobalt1137@reddit

You are forgetting about the trickle down effect of technology.

Reply

[-]

kaisurniwurer@reddit

Exactly, me and my TPUs are excited to see this innovation.

Reply

[-]

kreiggers@reddit

This was the aim of the Nvidia “acquisition” of Grok. Grok has the tech for the high bandwidth inference. “Acquisition” bc it’s an exclusive licensing deal for everything that is the company Grok and leaves a shell behind so as to skip any pesky market regulations

Reply

[-]

petuman@reddit

> HBF supports unlimited read cycles, it’s limited to approximately 100,000 write cycles. > unlimited read cycles > flash sus, at least with normal NAND found in SSDs that's not true?

Reply

[-]

gh0stwriter1234@reddit

Yeah and its definitely SLC they because of the speed requirements. Most likely even if it wern't SLC it would not go bad before the GPU itself was obsolete in 4-5 years.

Reply

[-]

pmp22@reddit

When egram becomes the standard, perhaps these giant memory tables can be stored and read from flash. That way they can scale the knowledge to 100x of today and use MoE with RAM for weights and VRAM for inference with the active experts.

Reply

[-]

WolfeheartGames@reddit

There is a mard limit to how large Engram can be, this is irrelevant for Engram, you'd want it in ram or vram. having actually trained Engram I'll tell you it's a massive speed killer and not the magic bullet people would have you think. It needs to be in vram.

Reply

[-]

Psionikus@reddit

Algorithm >>>>>>>>>>>>>>>>>>>>> Hardware. Yes the current style of LLMs require a lot of hardware. The reason they require a lot of hardware is because these implementations are so primitive.

Reply

[-]

Dr_Kel@reddit

I mean, yeah, exactly, HBF is pretty much designed to address the memory constraints!

Reply

[-]

Round_Mixture_7541@reddit

Put that away before Scam sees it

Reply

[-]

FastDecode1@reddit

no

Reply

Could High Bandwidth Flash be Local Inference's saviour?

Reply to Post

29 Comments

NoFaithlessness951@reddit

MattAlex99@reddit

Double_Cause4609@reddit

gh0stwriter1234@reddit

LevianMcBirdo@reddit

gh0stwriter1234@reddit

KallistiTMP@reddit

petuman@reddit

Fast-Satisfaction482@reddit

Odd-Ordinary-5922@reddit

jazir555@reddit

AcePilot01@reddit

dark-light92@reddit

Odd-Ordinary-5922@reddit

dark-light92@reddit

KaMaFour@reddit

dark-light92@reddit

gh0stwriter1234@reddit

cobalt1137@reddit

kaisurniwurer@reddit

kreiggers@reddit

petuman@reddit

gh0stwriter1234@reddit

pmp22@reddit

WolfeheartGames@reddit

Psionikus@reddit

Dr_Kel@reddit

Round_Mixture_7541@reddit

FastDecode1@reddit