Running/Evaluating Models Larger Than RAM + VRAM Capacity (with SSD)

Posted by Treidge@reddit | LocalLLaMA | View on Reddit | 18 comments

Just a friendly reminder: you can actually run quite large models that substantially exceed your combined RAM and VRAM capacity by using a fast SSD to store model weights (GGUFs). This could be useful for testing and evaluation, or even for daily use if you don’t strictly require high-speed prompt processing or token generation.

In my case, this works using Llama.cpp on Windows 11 with 128GB of DDR4 RAM, an RTX 5090 (32GB VRAM), and an NVMe SSD for my models. I believe this will also work reasonably well with other GPUs.

In the latest Llama.cpp builds, these "SSD streaming" mechanics should work out of the box. It "just works" even with default parameters, but you should ensure that:

Additionally, you may want to quantize the KV Cache to fit as many layers as possible into your VRAM to help with token generation speed, especially when using a larger context (for example, using the -ctk q8_0 -ctv q8_0 arguments).

How it works (as I understand it): If we use --mmap, the model is mapped to virtual memory directly from the storage (SSD) and is not forced to fit into physical memory entirely. During the warm-up stage, the model saturates all available RAM, and the "missing" capacity is streamed from the SSD on-demand during inference. While this is slower than computing entirely in memory, it is still fast enough to be usable—especially when the "missing" portion isn't significantly large relative to the overall model size.

The best part: This does not wear out your SSD. There are virtually no write operations; the model is only being read. You can verify this yourself by checking the Performance tab in Task Manager and monitoring the SSD activity metrics.

My specific example (what to expect): I have a combined memory capacity of 160GB (128GB RAM + 32GB VRAM), with \~152GB usable after Windows overhead. I am running Qwen3.5-397B-A17B at MXFP4_MOE quantization (Unsloth's Q4_K_XL should work similarly), which is 201GB. This exceeds my "maximum" capacity by a solid 50GB (or 33%).

I imagine those with DDR5 RAM might see notably higher numbers (I'm stuck on DDR4 for foreseable perspective, huh :( ). The most painful part of this setup is the prompt processing speed, which can be a struggle for large requests. However, the token generation speed is actually quite good considering the model is running partially from an SSD.

I'm quite thrilled that this way I can run Qwen3.5-397B-A17B locally at 4-bits, even as slow as it is.

P.S. Q3_K_XL quant is 162 GB and runs even faster than that (7-8 t/s at my setup), so I'd imagine it could do quite well on something with 128 GB RAM + 24 GB VRAM.