(Linux) Has anyone succeeded in using NVMe space as substitute RAM for larger models? Is it worthwhile?
Posted by Quiet-Owl9220@reddit | LocalLLaMA | View on Reddit | 31 comments
So I have a consumer-grade AMD GPU with 24gb VRAM and 64gb DDR5 RAM which have served me well enough for models up to around 120B. Of course, this just isn't enough for larger models in the 300B+ range.
Storage and RAM are expensive so I'm not going to be upgrading my hardware any time soon, but I have plenty of high speed NVMe space available. Is it possible to leverage this as a workaround? What would be the method, swap file? Do I need to take any special steps to make sure something like lmstudio can actually utilize it?
I realize this will probably be much slower but I want to give it a try and see if I can make it work for me as basically a background process.
dametsumari@reddit
It depends on your read speed. With some MoE models it is sort of feasible. Eg A3B model at q4 needs 1,5 gigabytes of data per token, so with 15g/s read ssd you can get up to ( theoretical maximum ) 10 output tokens per second. However, good luck finding that fast ssd, and even then it is quite slow.
If talking of eg A20B at q4 you will not get token per second. So not worth it.
portmanteaudition@reddit
You're forgetting latency
dametsumari@reddit
Not really. You can fetch next layer(s) when computing previous one so even HD RAID ( with enough total throughout ) works equally well.
z_3454_pfk@reddit
all the PCIe 5.0 nvme drives are near that speed
Quiet-Owl9220@reddit (OP)
I can probably tolerate speeds as low as 0.5tok/sec while I do something else. I'm not trying to be productive here, this is just for hobby stuff. But my disk speed is more like this:
Just so I can learn, how are you calculating how much data per second a model needs?
AppealSame4367@reddit
I just switched from llama.cpp to ik_llama yesterday and got almost twice as much speed on q3.6 35b a3b scattered accross a low vram gpu and ram. Maybe it can speed up something on your end as well.
Question is also: With the advent of models like q3.6 27b: do you really need such a large model? Maybe running a good specialist like 27b and a smaller specialist for something else in parallel is the way for you.
Quiet-Owl9220@reddit (OP)
I have been meaning to give ik_llama a try, but it requires some manual effort in lmstudio. Speed isn't really a priority for me in the first place, but it'd be nice if it can give the larger models I use a bit of a boost.
The new Qwen models are pretty good and running well on my system, but they just aren't great at writing IME, they are more focused on productivity. The larger creative type models (eg. from TheDrummer) still seem better for writing at the cost of speed, reasoning, and tool calls.
For that trade off I'm already down to the ballpark of 2t/sec... and still I am not particularly impressed with the results, so I want to see what the big models can do.
portmanteaudition@reddit
RAM is something like 10x the bandwidth and 10,000x the latency of NVME.
rditorx@reddit
You can try using the mmap (memory map) option to load from SSD
Quiet-Owl9220@reddit (OP)
Well, turns out I've had mmap enabled by default for a long time and just never really understood what it does. Lmstudio does not normally allow to load models larger than the available RAM+VRAM, but I looked into it and gave it a try anyway with a large MoE model... in the end I was able to run q4_k_m of Minimax-M2.7. I just had to disable keeping the model in memory and then override lmstudio's guard rails. I would not have figured this out otherwise, so thanks.
It takes a very long time to process prompts and writes VERY slowly while starting and stopping, but I'm quite pleased to know that it works at all. I'll download a few other models and experiment some more, maybe I can find a configuration that's a little faster.
yami_no_ko@reddit
That's a terrible idea. It's slow and it literally heats up and grinds away your SSD.
Awkward-Candle-4977@reddit
I had to use cheap nvme in external USB as virtual memory when converting llama to onnx. It was very slow
Samurai2107@reddit
I think if i remember correctly it damages your storage unit
Fine_League311@reddit
Ja aber zu langsam
Street_Teaching_7434@reddit
This is technically feasible but far beyond the point where the price to performance ratio makes any sense.
The application would be huge 100B+ more models.
You would meet a lot of small ssds and a very competent raid controller with a super high bandwidth connection to your cpu.
At that point you might aswell just buy either an old dual xeon platform that still uses ddr3 or an EPYC or thread ripper platform that uses DDR4 and fill up 512gb to 1tb of ram, which will give you much much better performance. If you then throw in a 3090 for the offloading you can actually get quite the perfoance on a lower budget then buying the ssds
Squik67@reddit
Yes with swap file it works technically but it is so slow, maybe by combining multiple nvme in raid 0...
Chlorek@reddit
It works out of box in llama.cpp for example through OS mechanisms, system can map disk memory as memory, avoiding as much overhead as possible. How fast it works depends on many factors, for some MoE models you can hardly see difference between this and RAM offload. There's sometimes useful possibility to mix mmap and mlock which makes model cold start very fast (as it's just mapped from disk), but once it's moved to RAM it stays there.
Chromix_@reddit
It'll likely be way too slow by default, but improvements could be made, so that it's at least not completely hopeless.
This could then allow you to run the 600 GB Kimi K2.6 "Q8" quant at 1 token per second. You'd need to wait 2 hours for it do do a bit of reasoning and provide a reply on shorter tasks, costing you way more electricity than paying for access to a hosted version, but it could technically work.
andy_potato@reddit
Do not abuse you NVMe as swap space for AI models.
miniocz@reddit
Yes, no. Llama.cpp can do it automatically. It works, but we are talking about 1t/s speeds with short prompts and empty context.
Mia_the_Snowflake@reddit
Optane
Quiet-Owl9220@reddit (OP)
Never heard of this before but looks like a hardware-specific Intel thing, won't be possible for me.
Mia_the_Snowflake@reddit
Then Raid 0.
Bit it will be very slow regardless
Quiet-Owl9220@reddit (OP)
I'm not very familiar with RAID, I thought it's for storage with redundancy? How does RAID 0 help here?
Mia_the_Snowflake@reddit
Speeds up Reading speed and random read
Quiet-Owl9220@reddit (OP)
Wish I knew that before I formatted all my disks 🥲
I will have to think about if that's worth the effort. Thanks for your advice
Aizen_keikaku@reddit
Could’ve googled it in the time it took you to write that answer.
taking_bullet@reddit
I'm looking forward to this, but first I want to get at least PCIE 7.0 SSD.
MelodicRecognition7@reddit
https://old.reddit.com/r/LocalLLaMA/comments/1r65y85/how_viable_are_egpus_and_nvme/o60f9c0/?utm_source=reddit&utm_medium=usertext&utm_name=LocalLLaMA&utm_content=t1_ogcjuii
cakemates@reddit
plenty of people have tried, and for obvious reasons its usually so terribly slow its not worth it.
MotokoAGI@reddit
It's not worthwhile.