Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !
Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 27 comments
I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan.
load_tensors: offloaded 63/63 layers to GPU
load_tensors: CUDA0 model buffer size = 83650.42 MiB
load_tensors: CUDA_Host model buffer size = 622.76 MiB
load_tensors: ROCm0 model buffer size = 40314.35 MiB
the main advantage is the prefill.
On windows :
rmdir /s /q build
cmake -B build -G Ninja \^
-DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^
-DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^
-DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^
-DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" \^
-DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" \^
-DGGML_HIP=ON \^
-DGGML_CUDA=ON \^
-DGGML_BACKEND_DL=ON \^
-DGGML_CPU_ALL_VARIANTS=ON \^
-DGGML_AVX_VNNI=OFF \^
-DGGML_AVX512=OFF \^
-DGGML_AVX512_VBMI=OFF \^
-DGGML_AVX512_VNNI=OFF \^
-DGGML_AVX512_BF16=OFF \^
-DGGML_AMX_TILE=OFF \^
-DGGML_AMX_INT8=OFF \^
-DGGML_AMX_BF16=OFF \^
-DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" \^
-DCMAKE_CUDA_ARCHITECTURES="120" \^
-DCMAKE_BUILD_TYPE=Release
___________________
cmake --build build -j
_______________________
Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example:
notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt
and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)
With Ryzen 5950x it's ok.
then:
set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH%
llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1
Done.
el95149@reddit
Been doing this for quite a while, my "franken'-setup (R9700, 5080 and an old 2060 Super I had lying around) drove me to dig into all the ways I could "milk the cow", performance wise, aside from Vulkan. At first, the latter performed better, but recently, with all the llama.cpp enhancements, the native backend build performance has surpassed Vulkan, so I've been sticking with that.
LegacyRemaster@reddit (OP)
thx!!!
No-Manufacturer-3315@reddit
Someone with a 7900xt and 4090, thank you!!! I didn’t know this blasphemy was possible! I’ve been running vulkan!
milpster@reddit
Just tried it and it made my Qwen 3.6 27B output only /////// without end.
LegacyRemaster@reddit (OP)
tested Qwen3.5-397B-A17B-UD-IQ2_M-00001-of-00004.gguf now. Works.
YairHairNow@reddit
Interesting. So, could I plug a r9700 in with my 5080. Or how about use my 5080 with my igpu 9950x? Been wondering about something like this, but always read it was extremely difficult to do.
xspider2000@reddit
its easy using vulkan
LegacyRemaster@reddit (OP)
yes you can! it's just painfull to compile but yeah... All good
Few_Water_1457@reddit
load_tensors: dimensione del buffer modello CUDA0 = 83650.42 MiB
load_tensors: dimensione del buffer modello CUDA_Host = 622.76 MiB
load_tensors: dimensione del buffer modello ROCm0 = 40314.35 MiB ---> It seems so, I have to try it.
FullstackSensei@reddit
What's the hardware setup? 57t/s on Q4_S on what seems to be pretty expensive GPUs seems a bit... Slow. Is one of the GPUs starving for bandwidth?
FWIW, I run Q4_K_XL, which is 10GB larger on six Mi50s, which combined have like 1/7th the tensor TFLOPS of a single RTX 6000 pro, and get 30t/s. All six, before prices went bananas, were worth probably less than the heatsink of a 6000 pro.
LegacyRemaster@reddit (OP)
here "full vulkan" load_tensors: Vulkan0 model buffer size = 83650.42 MiB
load_tensors: Vulkan1 model buffer size = 40314.35 MiB
load_tensors: Vulkan_Host model buffer size = 622.76 MiB
ggml_vulkan: Pinned memory disabled, using CPU memory fallback for 1 MB
ggml_vulkan: Pinned memory disabled, using CPU memory fallback for 1 MB
ggml_vulkan: Pinned memory disabled, using CPU memory fallback for 1 MB
ggml_vulkan: Pinned memory disabled, using CPU memory fallback for 1 MB
_____________________________
I made a code change to disable pinned memory, which significantly burdens the system RAM during loading (if the model exceeds the RAM, llama.cpp crashes, which is why I tested ROCM+CUDA). I have outdated drivers, the RTX 6000 is at 300W instead of 600W, and the W7800 is at 240W instead of 300W. These are "full volcano" performance levels, but they degrade much more quickly both in prefill and long context than with CUDA+ROCM.
FullstackSensei@reddit
But why vulkan though? You said you can have CUDA+ROCm using GGML_BACKEND_DL. How are the GPUs connected? How many PCIe lanes does each get? Your power limits are very reasonable, and I'd still expect significantly higher t/s given the hardware.
My potato Mi50s are power limited to 170W and during inference only one is active at any given moment reaching 120W max. But they all have x16 links (albeit Gen 3).
What change did you make in llama.cpp code? Do you have a fork or a gist with it? Might be worth submitting a PR. I have noticed that when I first load a model it loads at 1.1-1.5GB/s, even when I have NVMe RAID-0 that can do 11GB/s. 2nd time loading though is super fast because it's cached in RAM.
LegacyRemaster@reddit (OP)
The changes I make to the code are automated (vscode+kilocode), and since I have two businesses, I don't have time to review and optimize code. For example, I had fun trying to get Deepseek 4 Flash to work with Vulkan here https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda , but it certainly can't be used in production, and we'll have to wait for the pros at llamacpp to get it up and running.
Back to your question: Asus pro ws x570 ace. 3x PCI express 4 - 8x slots.
About Rocm+Cuda ZERO changes on llamacpp except to disable alderlake from cmakelists:
notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt
and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)
FullstackSensei@reddit
Can you ask your favorite LLM vis kilocode to make a gist about the pinned memory changes to speed up load?
I run Xeons and Epyc, so no big-little confusion. The former have AVX-512, which makes them quite comparable to Epyc with twice the core count for hybrid inference.
LegacyRemaster@reddit (OP)
you are welcome
_________________________________
Daemontatox@reddit
What in registery fuck is going on here ?
Can he do that ? Is that legal ?
FullstackSensei@reddit
While not widely known, it's been supported for quite a while. If you search for the compilation flag on this sub, you'll see comments mentioning it going back at least to August or September of last year, IIRC.
LegacyRemaster@reddit (OP)
true but very hard to compile without clear steps.
FullstackSensei@reddit
Is it though? Your cmake command is basically a merge of the CUDA and ROCm build flags. I have both Nvidia and AMD, though not in the same system (yet), so I'm familiar with building for either.
Koksny@reddit
What's the setup? Two RTX's and the ROCm comes from integrated Vega?
FullstackSensei@reddit
Smells like an RTX 6000 pro. There's only CUDA0 with 83GB. CUDA host is system memory.
ROCm0 seems to be 48GB, maybe Radeon Pro W7800 or 7900 (pro version of 7900 xtx).
I'm surprised they're running them on windows, and using ROCm 6.4. Switching to Linux and going for 7.1 would eek some more performance.
LegacyRemaster@reddit (OP)
you are right. I have to test on windows 11 + linux. Here windows 10.
FullstackSensei@reddit
W11 will make you go crazy. Just save yourself the hassle and go to Linux
LegacyRemaster@reddit (OP)
I have linux but about 4 different SSD to boot up. I need to have time to migrate all the setup I've built over the years on Windows 10 to Linux.
Sisuuu@reddit
I am doing the same Ubuntu server with cheap and old threadripper system with 2xRTX3090+RX6800XT…with those flash builds they work oob, I can run huge models across all gpus also. Only thing I wish I could figure is how to make Pi.dev/opencore to utilize one model with 2xRTX3090 with higher quant and context and another model with and RX6800XT…like utilizing both models simultaneously in agentic coding work !
Vaguswarrior@reddit
Sorry I'm pretty new, I though cpu didn't matter for most things?
wbulot@reddit
I do use a mixed setup too, with an AMD GPU and NVIDIA. But I use Vulkan for the AMD and CUDA for the NVIDIA, and it works pretty well together.