Could High Bandwidth Flash be Local Inference's saviour?

Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 29 comments

We are starved for VRAM, but in a local setting, a large part of that VRAM requirement is due to model weights. By putting this on cheaper HBF, if we assume a 10x cost advantage, instead of 32GB VRAM on a GPU, we could put 32GB VRAM plus 256GB of HBF. With 4 of these, you'd have 128GB of VRAM and 1TB of HBF. Enough to run bigger models. With 8 of them, you could run the largest models locally.