running models bigger than physical memory capacity

Posted by ag789@reddit | LocalLLaMA | View on Reddit | 14 comments

has anyone really tried running models bigger than physical memory capacity?

I'd guess most users stick with running models that fit in DRAM + VRAM

https://unsloth.ai/docs/models/qwen3.5

even google gemma 4 are released with about 32 billion parameters, my guess is that even at Q8, it'd fit 'comfortably' in 32GB

https://huggingface.co/collections/google/gemma-4

but that there are *huge* models, e.g. the qwen 3.5 bigger models, and e.g.
Qwen Coder Next 80 B model is 40GB at Q4 quant
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

a guess is that mmap (Linux) may be able to accodomate that e.g. in llama.cpp
but that the system could 'swap like crazy'.

it'd be quite interesting if that 'swap' is to SSD, which is likely (much) faster than harddrives in the seek speeds.