running models bigger than physical memory capacity
Posted by ag789@reddit | LocalLLaMA | View on Reddit | 14 comments
has anyone really tried running models bigger than physical memory capacity?
I'd guess most users stick with running models that fit in DRAM + VRAM
https://unsloth.ai/docs/models/qwen3.5
even google gemma 4 are released with about 32 billion parameters, my guess is that even at Q8, it'd fit 'comfortably' in 32GB
https://huggingface.co/collections/google/gemma-4
but that there are *huge* models, e.g. the qwen 3.5 bigger models, and e.g.
Qwen Coder Next 80 B model is 40GB at Q4 quant
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
a guess is that mmap (Linux) may be able to accodomate that e.g. in llama.cpp
but that the system could 'swap like crazy'.
it'd be quite interesting if that 'swap' is to SSD, which is likely (much) faster than harddrives in the seek speeds.
qubridInc@reddit
Yes you can run bigger-than-RAM models via mmap/swap, but it’s painfully slow real solution is MoE or proper offloading, not brute-forcing with SSD.
DanRey90@reddit
There was an Apple paper a few years ago called “LLM in a flash”, it was rediscovered by some Twitter users a few weeks ago, and now there’s a lot of experimentation in that field. Search for “flash-moe”.
Right now the experimentation is focusing on macOS because the new Macs have ridiculously fast SSDs (15GB/s reads or something like that), so it starts to become somewhat feasible. It’s only viable for MoE models, because for each token the SSD must read almost all the active parameters (almost because MoE models usually have a shared expert and/or dense layers). So, the idea is that for a huge model like Kimi (1T total, 32B active), you only hold in RAM the KV cache (20GB at full context, maybe less) and the shared experts (15GB?), you quantize the rest, and you’re left with about 5-10GB to read for every token, which would leave you with almost-useable 2-3 tok/s. That’s the theory, as I said it’s highly experimental, and the prompt processing is almost as slow as generation.
Also, mmap is not the same as swap. It only reads from the SSD, so it doesn’t wear it down. It’s just slow.
ag789@reddit (OP)
I think even with SSD, a limit would be the sata bus itself etc, one thiing would be do the neural network need to iterate the whole model? after all, those are neural network activations, as in that one layer activate the next, in some complicated assembly. don't think it is quite feasible such that like the network figures out that it needs 'this activation' and that it looks in the right block on the ssd / harddrive just for that, compute the result, feed forward, process inputs further, return results.
if it is so 'focused', then a 200 gb model simply works like a 'library' is modular, mmap etc would work excellent, e.g. if you only need 1 mb in that 200 gb to compute the results, fetching that 1 mb probably take milliseconds at most. But that if the feedforward, llm neural network needs to visit all 200 gb, this will be a performance nightmare as the bottleneck is the sata interfaces.
ag789@reddit (OP)
thinking of it this way, those 'MOE' (mixture of experts), style models may work better with ssds
https://huggingface.co/blog/moe
as in that based on the illustration and explanations, the 'router' within the network 'choose' which network to send the input to. if each 'expert' is after all 'modular' and that it is after all 'sparse'
then that running off ssd may win big, e.g. that the network visit some of the 'experts' feedforward network, those gets 'cached' in memory by virtue of mmap.
if this is after all true, it may mean it may be feasible to run big MOE models 'off' ssds, running models bigger than system memory that way ! :)
DanRey90@reddit
The router would never end up “visiting all the experts”. By design. Qwen 35B A3B means that only 3B params are visited total (including the “router” params etc). That’s the entire point of MoE. The blog you linked is quite outdated, modern models are more sparse, they activate less parameters relative to the total.
It makes no sense to talk about “SATA bus”. Modern Macs don’t have SATA, and if you built a PC to run models in this way, you would use NVME PCIe with a fast SSD, not SATA, we aren’t in 2015.
Prize_Negotiation66@reddit
what about AirLLM
Nyghtbynger@reddit
Isn't it abandoned ? No update for 1 year....
ag789@reddit (OP)
thanks for the mention, that is quite interesting :)
https://github.com/lyogavin/airllm
Past_Shift6441@reddit
Is airllm anygood or just hype ?
I spent hours setting up krasis
https://github.com/brontoguana/krasis
Which says it can help run big models on small GPU but I couldn't get it to run anything on my GPU
MelodicRecognition7@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1r65y85/how_viable_are_egpus_and_nvme/o60f9c0/
ag789@reddit (OP)
I think there is RAG (retreval augmented generation)
https://www.promptingguide.ai/techniques/rag
https://arxiv.org/pdf/2005.11401
I'm not too sure if the tech has evolved into 'modular hot pluggable neural nets', that would be quite 'fun' to watch, if that is feasible one could imagine large LLMs with 'modules' , then that 200 B on 32 GB could seem feasible :)
sgmv@reddit
There's nothing interesting about using the SSD to load the model from, I'd say probably 99.9% of people load their models from ssd. If model spills to RAM it gets much slower, and if it loads parts from SSD, it's even slower. Good as an experiment, but practically unusable in most scenarios.
ag789@reddit (OP)
my guess is the whole model 'lives in memory', it is hard to be conviced that neural net activations can be so discrete that it 'activate like software libraries' , if that is feasible, one can imagine that you can perhaps run a 200 GB model in 32 GB memory, where the 'small libraries activations , parameters' are loaded on demand :)
Herr_Drosselmeyer@reddit
You're guessing wrong. Q8 is 32.64GB just in file size, with decent context, you're up to 40GB and more.
As for your original question, sure, people have tried it. There's no way it's even remotely practical though.