(Linux) Has anyone succeeded in using NVMe space as substitute RAM for larger models? Is it worthwhile?

Posted by Quiet-Owl9220@reddit | LocalLLaMA | View on Reddit | 31 comments

So I have a consumer-grade AMD GPU with 24gb VRAM and 64gb DDR5 RAM which have served me well enough for models up to around 120B. Of course, this just isn't enough for larger models in the 300B+ range.

Storage and RAM are expensive so I'm not going to be upgrading my hardware any time soon, but I have plenty of high speed NVMe space available. Is it possible to leverage this as a workaround? What would be the method, swap file? Do I need to take any special steps to make sure something like lmstudio can actually utilize it?

I realize this will probably be much slower but I want to give it a try and see if I can make it work for me as basically a background process.

[-]

dametsumari@reddit

It depends on your read speed. With some MoE models it is sort of feasible. Eg A3B model at q4 needs 1,5 gigabytes of data per token, so with 15g/s read ssd you can get up to ( theoretical maximum ) 10 output tokens per second. However, good luck finding that fast ssd, and even then it is quite slow.

If talking of eg A20B at q4 you will not get token per second. So not worth it.

[-]

portmanteaudition@reddit

You're forgetting latency

[-]

dametsumari@reddit

Not really. You can fetch next layer(s) when computing previous one so even HD RAID ( with enough total throughout ) works equally well.

[-]

z_3454_pfk@reddit

all the PCIe 5.0 nvme drives are near that speed

[-]

Quiet-Owl9220@reddit (OP)

I can probably tolerate speeds as low as 0.5tok/sec while I do something else. I'm not trying to be productive here, this is just for hobby stuff. But my disk speed is more like this:

/dev/nvme1n1:
 Timing cached reads:   98176 MB in  1.99 seconds = 49219.44 MB/sec
 Timing buffered disk reads: 12126 MB in  3.00 seconds = 4041.63 MB/sec

Just so I can learn, how are you calculating how much data per second a model needs?

[-]

AppealSame4367@reddit

I just switched from llama.cpp to ik_llama yesterday and got almost twice as much speed on q3.6 35b a3b scattered accross a low vram gpu and ram. Maybe it can speed up something on your end as well.

Question is also: With the advent of models like q3.6 27b: do you really need such a large model? Maybe running a good specialist like 27b and a smaller specialist for something else in parallel is the way for you.

[-]

Quiet-Owl9220@reddit (OP)

I have been meaning to give ik_llama a try, but it requires some manual effort in lmstudio. Speed isn't really a priority for me in the first place, but it'd be nice if it can give the larger models I use a bit of a boost.

The new Qwen models are pretty good and running well on my system, but they just aren't great at writing IME, they are more focused on productivity. The larger creative type models (eg. from TheDrummer) still seem better for writing at the cost of speed, reasoning, and tool calls.

For that trade off I'm already down to the ballpark of 2t/sec... and still I am not particularly impressed with the results, so I want to see what the big models can do.

[-]

portmanteaudition@reddit

RAM is something like 10x the bandwidth and 10,000x the latency of NVME.

[-]

rditorx@reddit

You can try using the mmap (memory map) option to load from SSD

[-]

Quiet-Owl9220@reddit (OP)

Well, turns out I've had mmap enabled by default for a long time and just never really understood what it does. Lmstudio does not normally allow to load models larger than the available RAM+VRAM, but I looked into it and gave it a try anyway with a large MoE model... in the end I was able to run q4_k_m of Minimax-M2.7. I just had to disable keeping the model in memory and then override lmstudio's guard rails. I would not have figured this out otherwise, so thanks.

It takes a very long time to process prompts and writes VERY slowly while starting and stopping, but I'm quite pleased to know that it works at all. I'll download a few other models and experiment some more, maybe I can find a configuration that's a little faster.

[-]

yami_no_ko@reddit

That's a terrible idea. It's slow and it literally heats up and grinds away your SSD.

[-]

Awkward-Candle-4977@reddit

I had to use cheap nvme in external USB as virtual memory when converting llama to onnx. It was very slow

[-]

Samurai2107@reddit

I think if i remember correctly it damages your storage unit

[-]

Fine_League311@reddit

Ja aber zu langsam

[-]

Street_Teaching_7434@reddit

This is technically feasible but far beyond the point where the price to performance ratio makes any sense.

The application would be huge 100B+ more models.

You would meet a lot of small ssds and a very competent raid controller with a super high bandwidth connection to your cpu.

At that point you might aswell just buy either an old dual xeon platform that still uses ddr3 or an EPYC or thread ripper platform that uses DDR4 and fill up 512gb to 1tb of ram, which will give you much much better performance. If you then throw in a 3090 for the offloading you can actually get quite the perfoance on a lower budget then buying the ssds

[-]

Squik67@reddit

Yes with swap file it works technically but it is so slow, maybe by combining multiple nvme in raid 0...

[-]

Chlorek@reddit

It works out of box in llama.cpp for example through OS mechanisms, system can map disk memory as memory, avoiding as much overhead as possible. How fast it works depends on many factors, for some MoE models you can hardly see difference between this and RAM offload. There's sometimes useful possibility to mix mmap and mlock which makes model cold start very fast (as it's just mapped from disk), but once it's moved to RAM it stays there.

[-]

Chromix_@reddit

It'll likely be way too slow by default, but improvements could be made, so that it's at least not completely hopeless.

Llama.cpp dynamic page-in is slow (which is why it does a warm-up by default). You'd need to use huge pages to make it faster.
Let's assume you can tune your high-end SSD to give you 8 GB/s for large-block reads. If you take Kimi K2.6 which has 32B active parameters as a Q4 then that's still (up to) 16 GB to load per token, giving you 0.5 tokens per second.
It's just the shared experts that need to be loaded on-demand though, and there have been some attempts at predicting and caching them to speed things up. Let's assume this doubles your TPS, as there's a bit of prediction success and system RAM is so much faster than the SSD.

This could then allow you to run the 600 GB Kimi K2.6 "Q8" quant at 1 token per second. You'd need to wait 2 hours for it do do a bit of reasoning and provide a reply on shorter tasks, costing you way more electricity than paying for access to a hosted version, but it could technically work.

[-]

andy_potato@reddit

Do not abuse you NVMe as swap space for AI models.

[-]

miniocz@reddit

Yes, no. Llama.cpp can do it automatically. It works, but we are talking about 1t/s speeds with short prompts and empty context.

[-]

Mia_the_Snowflake@reddit

Optane

[-]

Quiet-Owl9220@reddit (OP)

Never heard of this before but looks like a hardware-specific Intel thing, won't be possible for me.

[-]

Mia_the_Snowflake@reddit

Then Raid 0.

Bit it will be very slow regardless

[-]

Quiet-Owl9220@reddit (OP)

I'm not very familiar with RAID, I thought it's for storage with redundancy? How does RAID 0 help here?

[-]

Mia_the_Snowflake@reddit

Speeds up Reading speed and random read

[-]

Quiet-Owl9220@reddit (OP)

Wish I knew that before I formatted all my disks 🥲

I will have to think about if that's worth the effort. Thanks for your advice

[-]