Inference of LLMs with offloading to SSD(NVMe)
Posted by GRIFFITHUUU@reddit | LocalLLaMA | View on Reddit | 14 comments
Hey folks 👋 Sorry for the long post, I added a TLDR at the end.
The company that I work at wants to see if it's possible (and somewhat usable) to use GPU+SSD(NVMe) offloading for models which far exceed the VRAM of a GPU.
I know llama cpp and ollama basically takes care of this by offloading to CPU, and it's slower than just GPU, but I want to see if I can use SSD offloading and get atleast 2-3 tk/s.
The model that I am interested to run is llama3.3 70b BF16 quantization (and hopefully other similar sized models), and I have an L40s with 48GB VRAM.
I was researching about this and came across something called DeepSpeed, and I saw DeepNVMe and it's application in their Zero-Inference optimization.
They have three configs to use Zero-Inference as far as I understood, stage 1 is GPU, stage 2 CPU offload and stage 3 is NVMe, and I could not figure out how to use it with disk, so I first tried their CPU offload config.
Instead of offloading the model to RAM when the GPU's VRAM is full, it is simply throwing a CUDA OOM error. Then I tried to load the model entirely in RAM then offload to GPU, but I am unable to control how much to offload to GPU(I can see around 7 GB usage with nvidia-smi) so almost all of the model is in RAM.
The prompt I gave: Tell mahabharata in 100 words . With ollama and their llama 3.3 70b (77 GB and 8-bit quantization), I was able to get 2.36 tk/s. I know mine is BF16, but the time it took to generate the same prompt was 831 seconds, around 14 minutes! DeepSpeed doesn't support GGUF format and I could not find an 8-bit quantization model for similar testing, but the result should not be this bad right?
The issue is most likely my bad config and script and lack of understanding of how this works, I am a total noob. But if anyone has any experience with DeepSpeed or offloading to disk for inference, provide your suggestions on how to tackle this, any other better ways if any, and whether it's feasible at all.
Run log: https://paste.laravel.io/ce6a36ef-1453-4788-84ac-9bc54b347733
TLDR: To save costs, I want to run or inference models by offloading to disk(NVMe). Tried DeepSpeed but couldn't make it work, would appreciate some suggestions and insights.
Queasy-Contract9753@reddit
Did you get this to work? Or try an MoE?
badatgaems@reddit
I've been looking into using ssd raid arrays with the GDS mode to accomplish this but haven't actually got an environment up yet to test in so I'm very interested in your results.
Have you tried adjusting the config param: "stage3_max_live_parameters": 1.000000e+09
Which has a description in the docs: "The maximum number of parameters resident per GPU before releasing. Smaller values use less memory, but perform more communication (default 1e9)"
https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
If the parameter actually applies during inference and not just training, It seems like the offload by default is forcing there to only be 1Billion layers in the GPU at any one time, which would mean something like 16Gb/2GB limit for the space allocated to parameter weights on the GPU with a 16bit data type.
To test if it's that you could try just setting the limit to 1e10 giving it 10x the default allocation and seeing if that changes the result much.
They have examples up here which includes a llama2 sample, but I see nothing there about that parameter
https://github.com/deepspeedai/DeepSpeedExamples/blob/master/inference/huggingface/zero_inference/README.md
I don't know if this helps at all but with the nvme offload you want to be running something like a 4 disk raid 0 array where each of those disks is capable of saturating a 4x link with as low a latency possible, so you get as close to the x16 bus utilisation of the card as possible (number of drives would likely need to scale with the number of GPUs to prevent BUS saturation and additional latency).
Without GDS enabled and the array being on the same PCIE switch as the card you'll likely get too much added latency for it to be worthwhile compared to just using RAM.
GDS has a limited number of filesystems it supports too so make sure the cache/offload drives are formatted appropriately to get GDS working, I think locally you're fine when using xfs or ext4.
Vegetable_Low2907@reddit
This is an incredible application for intel optane drives - such a shame they're not in production any longer!
Why did you black out the GPU model?
jazir555@reddit
Has anyone tried to use Direct Storage to speed up SSD offloading?
GRIFFITHUUU@reddit (OP)
I saw Nvidia GPUDirect Storage mentioned in the DeepNVMe README:
--use_gds is set to enable NVIDIA GDS and move parameters directly between the NVMe and GPU, otherwise an intermediate CPU bounce buffer will be used to move the parameters between the NVMe and GPU.Valuable_Issue_@reddit
I was looking at optanes a few days ago wondering what the performance would be like compared to a high end nvme ssd (in llm inference). SSD offloading is quite rare by itself let alone something as niche as optane, do you know if there's any benchmarks?
GRIFFITHUUU@reddit (OP)
I could not find benchmarks for any newer models, check these out:
DeepSpeedExamples/inference/huggingface/zero_inference/README.md at master · deepspeedai/DeepSpeedExamples
DeepNVMe: Affordable I/O scaling for Deep Learning Applications – PyTorch
GRIFFITHUUU@reddit (OP)
Yeah intel optane drives were crazy, and I just hid the name of the VM(just in case I'm not supposed to share it) not the GPU name.
Commercial-Celery769@reddit
For speed increases you would need something like raid 0 pcie gen 5 NVME and even then I'm not sure what the speed would be.Â
kryptkpr@reddit
It doesn't really make sense to SSD offload a dense model, these techniques were developed for MoE where you don't need to read all the weights and mostly need "storage".
This method is ~10-30x worse than CPU/RAM offload, so your numbers check out.
GRIFFITHUUU@reddit (OP)
Hmm makes sense. For CPU offloading, is llama cpp the best available option? I'm willing to work with more complex tools if I can squeeze a little bit more performance, but the GGUF support and great quants from bartowski and unsloth makes llama cpp appealing.
kryptkpr@reddit
For CPU alone, llama is your best bet.
For hybrid GPU/CPU you can get some reward from digging into ikllama.cpp and trying their special quants that work with their fused moe kernels.
GRIFFITHUUU@reddit (OP)
Will look into it, thank you!
BABA_yaaGa@reddit
I am figuring out a way to offload larger than memory model on my m4 max mbp. Any help will be appreciated with any inference engine that supports metal backend