Could High Bandwidth Flash be Local Inference's saviour?
Posted by DeltaSqueezer@reddit | LocalLLaMA | View on Reddit | 29 comments
We are starved for VRAM, but in a local setting, a large part of that VRAM requirement is due to model weights.
By putting this on cheaper HBF, if we assume a 10x cost advantage, instead of 32GB VRAM on a GPU, we could put 32GB VRAM plus 256GB of HBF.
With 4 of these, you'd have 128GB of VRAM and 1TB of HBF. Enough to run bigger models. With 8 of them, you could run the largest models locally.
29 Comments
NoFaithlessness951@reddit
MattAlex99@reddit
Double_Cause4609@reddit
gh0stwriter1234@reddit
LevianMcBirdo@reddit
gh0stwriter1234@reddit
KallistiTMP@reddit
petuman@reddit
Fast-Satisfaction482@reddit
Odd-Ordinary-5922@reddit
jazir555@reddit
AcePilot01@reddit
dark-light92@reddit
Odd-Ordinary-5922@reddit
dark-light92@reddit
KaMaFour@reddit
dark-light92@reddit
gh0stwriter1234@reddit
cobalt1137@reddit
kaisurniwurer@reddit
kreiggers@reddit
petuman@reddit
gh0stwriter1234@reddit
pmp22@reddit
WolfeheartGames@reddit
Psionikus@reddit
Dr_Kel@reddit
Round_Mixture_7541@reddit
FastDecode1@reddit