Running LLMS on RAM ?

Posted by Electronic_Image1665@reddit | LocalLLaMA | View on Reddit | 8 comments

Hey guys, I have been seeing some posts here and there about people that are able to run the local models partly on the RAM and I had not heard of this until this sub Reddit is there a good source of information on how to do this? I’m running a 4060TI 16gb and I also have an RX 6700 nitro, but I took that one out as most of my web searches said that trying to do both at the same time would be a huge pain and I’d be better off selling it. But I do have 64 GB of RAM. Thanks!

[-]

lly0571@reddit

You can use --n_cpu_moe with llama.cpp to offload MoE layers to RAM while keeping performance.

If you want to run Qwen3-30B-A3B, you could load the model fully in the GPU(using both 4060TI and RX 6700) with the vulkan backend of llama.cpp, but prefill performance may fall (still fast enough for single user scenarios).

If wants to run larger models, you can get \~20t/s decode and \~100t/s prefill for GPT-OSS-120B with DDR5, which is usable.

[-]

Electronic_Image1665@reddit (OP)

You think i can use both? I thought there was some fundamental thing that would keep amd and nvidia from working together. But then again i got that from claude lol

[-]

thebadslime@reddit

If you use LLamacpp it's pretty automatic. I have 4gb GPU but 32gb of ddr 5. I can run most MoE models really quick.

[-]

Electronic_Image1665@reddit (OP)

Ah, that might have been where I screwed the pooch. I believe I have DDR4 (other upgrades always seemed more urgent than RAM). I run 30B models at 13 tokens per sec but was wondering if adding the RAM would improve this, but now that you bring it up, I think it’s time for another upgrade.

[-]

dionysio211@reddit

You can run some local models in RAM depending on your setup. You can get over 20 tokens per second generation if you have DDR5 6000 or higher with gpt-oss-20b and ernie/qwen 30b. A lot of people really love Qwen 4b so you could try that as well. You will want to use VRAM regardless.

The quickest and simplest way to mess around with it is to download LM Studio and download a couple of small models to try it. You can use CUDA with the 4060TI and ROCm with the AMD card or you can combine them and use Vulkan (a cross platform driver), which is your best bet for running models that take up more than that much VRAM. Context (the amount of tokens cached in memory from the conversation) requires additional RAM/VRAM so you would need to take that into account as well.

Many of the new models are MoE (Mixture of Experts) and the active parameters are low enough that a combination of VRAM and RAM works as well. You can set the expert layers to the CPU and get some gains that way. Good luck on your journey! It is really a lot of fun.

[-]

Electronic_Image1665@reddit (OP)

Thanks man, yeah, so far, the ones that I’ve been using solely with the VA are the ChatGPT OSS model with 20 billion parameters, and I’ve also gotten Quinn Coder Three, but I was wondering if with the RAM integrated into the process, I could maybe run something slightly larger than the 30 billion parameter models that I’m currently running.

I have been on Ollama so far, and I haven’t tried VM Studio since I tried it with the RX 6700, and it left a bad taste in my mouth, but maybe I’ll go back and try it since a lot of people seem to be suggesting it for this specific use case.

[-]

Double_Cause4609@reddit

git clone llama CPP.
Build with CUDA support (if you want to just use your Nvidia GPU), or build with Vulkan support (if you want to use both).
If using two GPUs, pass an appropriate layer-split value.
When you go to start the server, use the backend you built with, and pass --cpu-moe
This puts all conditional experts onto system RAM. This is only relevant for MoE models.
For dense models, they will slow down significantly when offloading to system RAM at all, but you can still set -ngl to a value below the total number of layers in the model. This will explicitly reduce the number of layers on GPU (meaning they're now implicitly on CPU).

[-]

lumos675@reddit

Use lm studio and MOE models like qwen3 30b instruct. Ofload experts into ram there is setting for it in lm studio