Nvidia cards using too much VRAM?
Posted by Maxumilian@reddit | LocalLLaMA | View on Reddit | 14 comments
So I've been running on a 7900 XTX + 6800 XT until uh, yesterday. This combo had 40GB of VRAM and I was able to load and run 37GB Models fine even with like 32K context. It just... Worked. It was fast too.
I just upgraded to a 5090 + 5060 Ti 16GB because I wanted mainly some more gaming oomf and it was still 8GB more VRAM... Weirdly enough, I now cannot load and use the 37GB models I was using before. It just complains there's not enough VRAM.
Even when loading like a 19GB model it's using 28GB of VRAM.
I assume this is configuration issue on my end? But I'm not sure what the cause would be or where to start with diagnosis because I'm using all the same settings I did on my AMD cards.
tmvr@reddit
What and how are you using? Loading for example Qwen2.5 Coder 32B at Q4_K_M which is also close to 19GB with 32768 ctx and unquantized KV cache is only taking up about 22GB here with llamacpp.
pulse77@reddit
Kernel binaries for CUDA and ARM have different sizes - which means different VRAM usage (because kernels must also be loaded into VRAM - next to model parameters and context)!
Marksta@reddit
What inference engine are you using? You need to give some info. Preferably, give us your llama.cpp command you're running or tell us what flavor of llama.cpp wrapper obfuscation you're using.
Maxumilian@reddit (OP)
Updated post, just using pre-built KoboldCPP on Win 11 to load everything. Super simple setup.
Marksta@reddit
I'd say mess with the tensor split a little, see if you can get it to fit. It's annoying when cards have different vram, no sure way to tell you just need to play with the numbers a little and see what doesn't OOM. 37GB in 32+16 should definitely work even with buffers and such.
Running a 37GB model in your original 2 GPU total 40GB, with one running windows GUI, and adding in context, doesn't sound like it actually very realistic.
I'm thinking your AMD setup had the vram overflow swap setting on and you overflowed the vram or maybe llama.cpp mmap was on making it possible. And now on Nvidia setup, without the memory swap or mmap setting the 16GB card is going OOM instead of overflowing.
Maxumilian@reddit (OP)
Seems like it's a difference between Vulkan vs Cuda backend. When I use the Vulkan backend for the cards it's using like several less Gigs VRAM which is... Enough to prevent me from loading it.
Sadly it seems to process the prompt like molasses which is odd. Even using Vulkan on my AMD Cards it was still able to process the prompt super fast. Can't remember exact tokens per second but it was definitely way faster than this.
Maxumilian@reddit (OP)
Certainly possible... I'm not sure what the YellowRoseCx set for the default flags behind the scenes...
Past-Grapefruit488@reddit
What is output of nvidia-smi
Maxumilian@reddit (OP)
Updated post
960be6dde311@reddit
What does nvidia-smi say about your driver and CUDA version?
Maxumilian@reddit (OP)
Maxumilian@reddit (OP)
Updated post.
thirteen-bit@reddit
You'll have to describe the software you're using.
First idea - your current environment is built with ROCM only support, reinstall it with CUDA?
If your environment is python based then see if torch supporting CUDA 12.8 or higher is installed, 50xx cards will not work with older CUDA versions than 12.8:
https://pytorch.org/get-started/locally/
Maxumilian@reddit (OP)
Updated post.