Running/Evaluating Models Larger Than RAM + VRAM Capacity (with SSD)

Posted by Treidge@reddit | LocalLLaMA | View on Reddit | 18 comments

Just a friendly reminder: you can actually run quite large models that substantially exceed your combined RAM and VRAM capacity by using a fast SSD to store model weights (GGUFs). This could be useful for testing and evaluation, or even for daily use if you don’t strictly require high-speed prompt processing or token generation.

In my case, this works using Llama.cpp on Windows 11 with 128GB of DDR4 RAM, an RTX 5090 (32GB VRAM), and an NVMe SSD for my models. I believe this will also work reasonably well with other GPUs.

In the latest Llama.cpp builds, these "SSD streaming" mechanics should work out of the box. It "just works" even with default parameters, but you should ensure that:

Memory mapping (--mmap) is enabled or not specified (default is enabled).
Memory lock (--mlock) is disabled or not specified (default is disabled).
Model fit (--fit) is enabled or not specified (default is enabled).

Additionally, you may want to quantize the KV Cache to fit as many layers as possible into your VRAM to help with token generation speed, especially when using a larger context (for example, using the -ctk q8_0 -ctv q8_0 arguments).

How it works (as I understand it): If we use --mmap, the model is mapped to virtual memory directly from the storage (SSD) and is not forced to fit into physical memory entirely. During the warm-up stage, the model saturates all available RAM, and the "missing" capacity is streamed from the SSD on-demand during inference. While this is slower than computing entirely in memory, it is still fast enough to be usable—especially when the "missing" portion isn't significantly large relative to the overall model size.

The best part: This does not wear out your SSD. There are virtually no write operations; the model is only being read. You can verify this yourself by checking the Performance tab in Task Manager and monitoring the SSD activity metrics.

My specific example (what to expect): I have a combined memory capacity of 160GB (128GB RAM + 32GB VRAM), with \~152GB usable after Windows overhead. I am running Qwen3.5-397B-A17B at MXFP4_MOE quantization (Unsloth's Q4_K_XL should work similarly), which is 201GB. This exceeds my "maximum" capacity by a solid 50GB (or 33%).

Model load time: \~2 minutes (mostly the warm-up stage).
SSD Read Speed: 800–900 MB/s during warm-up; \~500 MB/s during prompt processing; 100–200 MB/s during token generation.
Performance: Prompt processing is \~4 t/s; token generation is \~5–6 t/s.

I imagine those with DDR5 RAM might see notably higher numbers (I'm stuck on DDR4 for foreseable perspective, huh :( ). The most painful part of this setup is the prompt processing speed, which can be a struggle for large requests. However, the token generation speed is actually quite good considering the model is running partially from an SSD.

I'm quite thrilled that this way I can run Qwen3.5-397B-A17B locally at 4-bits, even as slow as it is.

P.S. Q3_K_XL quant is 162 GB and runs even faster than that (7-8 t/s at my setup), so I'd imagine it could do quite well on something with 128 GB RAM + 24 GB VRAM.

[-]

charmander_cha@reddit

Você poderia testar a decodificacao especulativa por difusão, para sabermos quantos tokens por segundo você tem conseguido?

[-]

Treidge@reddit (OP)

Infelizmente, eu não posso. Mas, mais cedo ou mais tarde, alguém certamente poderá.

[-]

NoShoulder69@reddit

Before you go down the ssd offloading rabbit hole, check localops.tech, to see if a smaller quant would just fit in your vram. sometimes its easier than managing swap

[-]

_mayuk@reddit

Mm yeah I think everybody knows about swap memory or I was the only one using the SSD? Lol

[-]

Treidge@reddit (OP)

Apparently not everybody, since I've seen a lot of folks around here having hardware similar to mine and looking at Q2 or Q1 quants, though they could perfectly well try running Q3 or Q4.

[-]

_mayuk@reddit

Lol , i was using my ssd as backup or for this since ram offloading is a thing hehe …

I hope that people realizing this don’t fuck up the ssd prices

[-]

Treidge@reddit (OP)

Just give me already at least 10 t/s from SSD and I would abuse mine like never before to run something like GLM 5 (or whatever comes next) locally. 😁

[-]

PermanentLiminality@reddit

Probably more like 1 token per 10 seconds.

[-]

Marksta@reddit

Memory mapping (--mmap) is enabled or not specified (default is enabled).

The params change over and over again so it's hard to fault the LLM that wrote this. As of the last time I checked, llama.cpp builds last week didn't have mmap on by default.

What a weird post, actually talking about a new model like a real user but also full of LLM slop too.

[-]

Treidge@reddit (OP)

It's weird to you because you're eager to call AI slop on anything that has bold text and some formatting. Check again, mmap is enabled by default in Llama and you can see this in console at load. For me under Windows 11, at the very least.

[-]

Marksta@reddit

There's the human! Hi! It's not the formatting as much as the over formatting. It's clearly LLM output tokens, but I guess it's up for debate if the tokens are slop or not.

And ohhh, you meant on Windows. It's not default on Linux anymore. You need to pass params to disable the new direct-io stuff and re-enable mmap, I wrote about it here.

[-]

Treidge@reddit (OP)

I guess I'm fucked in the new era :( I've ran a blog back in the days, wrote a lot and have a thing or two for well-formatted and structured texts. LLM output actually looks "right" to me, since I would write an article and format it much like that. I've even using the long "—" casually (Alt + 0151 is my favorite combo😁). Woe is me!

By the way, my raw post was pretty much identical to that posted - Gemini fixed few typos for me and added bullet points (probably would add them myself anyway).

To be honest, I'm quite baffled with this. I've been reading the sub for quite some time and sincerely tried to be helpful to those who may not actually know that they CAN run larger models on their hardware, and got AI Slop shaming instead. 😒

[-]

Former-Ad-5757@reddit

Lol, prompt processing 4tk/s, strix halo are called bad at pp, but at least I reach in the 100’s with them and that is slow… this is workable for calculator functions, 1+1=… but have fun with a 132k context at 4/8 tk/s

[-]

Treidge@reddit (OP)

Yeah, that's a real downer. Too bad I can't think of anything to help with that — it looks like running a substantial portion of the model from SSD would almost always end up with slow prompt processing like this. No -b or -ub arguments combinations make a large difference here.

Still, for fresh starts with empty context it isn't that much limiting as it looks (at least for tests/evaluations). Just takes much longer to process requests (but you'd better not to unload the model that has large context alreadt filled 😁).

[-]

RoughOccasion9636@reddit

The MoE architecture is doing a lot of the heavy lifting here. With Qwen3.5-397B-A17B, only 17B parameters are active per forward pass, so the SSD bottleneck mostly hits warmup and inactive expert loading rather than actual token generation. That explains why your 5-6 t/s is better than people would expect for a model this size.

One thing worth experimenting with: -ctk q4_0 instead of q8_0. You lose some quality at very long contexts but can push a few more layers into VRAM. On a 5090 with 32GB, every extra GPU layer has outsized impact on generation speed.

Question: are you setting --n-gpu-layers explicitly to max out the 32GB, or letting llama.cpp auto-detect? With this kind of setup I'd be explicit about it to make sure the compute-heavy layers land on GPU and not RAM.

[-]

Treidge@reddit (OP)

Yeah, I would do it like this with context. Something like -ctk q8_0 -ctv q4_0 seems like a sweet spot in terms of accuracy/memory consumption.

I'm using --fit autodetect, since it does a good job at saturating VRAM. I've stopped tuning manually since --fit was introduced. Maybe I was doing something wrong, but with --n-gpu-layers I was forced to control Shared GPU Memory metric - in some cases incorrectly selected number of GPU layers spilled some data over the VRAM capacity into system memory and performance tanked (not 100% sure about this, it was in my earlier days with Lllama.cpp). Never happened with --fit, though.

[-]

MelodicRecognition7@reddit

lol first post in /r/localllama/ in 15 years and it's AI slop. Did you sell your account to spamers?

even for daily use if you don’t strictly require

don’t

Character: ’ U+2019
Name: RIGHT SINGLE QUOTATION MARK

I'm quite thrilled

I'm

Character: ' U+0027
Name: APOSTROPHE

[-]

Treidge@reddit (OP)

So, you're surprised that I'm actually using AI to check my grammar and punctuation before posting into the sub given that I'm not English-native speaker? Way to go! 👍