Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Posted by LegacyRemaster@reddit | LocalLLaMA | View on Reddit | 27 comments

I invested quite a bit of time and it wasn't easy but finally I can run models like Minimax 2.7 Q4 using Cuda+ROCm at the same time bypassing Vulkan.

load_tensors: offloaded 63/63 layers to GPU

load_tensors: CUDA0 model buffer size = 83650.42 MiB

load_tensors: CUDA_Host model buffer size = 622.76 MiB

load_tensors: ROCm0 model buffer size = 40314.35 MiB

the main advantage is the prefill.

On windows :

rmdir /s /q build

cmake -B build -G Ninja \^

-DCMAKE_C_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^

-DCMAKE_CXX_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^

-DCMAKE_HIP_COMPILER="C:/Program Files/AMD/ROCm/6.4/bin/clang-cl.exe" \^

-DCMAKE_PREFIX_PATH="C:/Program Files/AMD/ROCm/6.4" \^

-DHIP_ROOT_DIR="C:/Program Files/AMD/ROCm/6.4" \^

-DGGML_HIP=ON \^

-DGGML_CUDA=ON \^

-DGGML_BACKEND_DL=ON \^

-DGGML_CPU_ALL_VARIANTS=ON \^

-DGGML_AVX_VNNI=OFF \^

-DGGML_AVX512=OFF \^

-DGGML_AVX512_VBMI=OFF \^

-DGGML_AVX512_VNNI=OFF \^

-DGGML_AVX512_BF16=OFF \^

-DGGML_AMX_TILE=OFF \^

-DGGML_AMX_INT8=OFF \^

-DGGML_AMX_BF16=OFF \^

-DCMAKE_CUDA_COMPILER="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v13.1/bin/nvcc.exe" \^

-DCMAKE_CUDA_ARCHITECTURES="120" \^

-DCMAKE_BUILD_TYPE=Release

___________________

cmake --build build -j

_______________________

Unfortunately, this flag: -DGGML_CPU_ALL_VARIANTS=ON --> creates many compilation errors and I had to edit, for example:

notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt

and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)

With Ryzen 5950x it's ok.

then:

set PATH=C:\Program Files\AMD\ROCm\6.4\bin;%PATH%

llama-server.exe --model "H:\gptmodel\unsloth\MiniMax-M2.7-GGUF\MiniMax-M2.7-UD-Q4_K_S-00001-of-00004.gguf" --ctx-size 91920 --threads 16 --host 127.0.0.1 --no-mmap --jinja --fit on --flash-attn on -sm layer --n-cpu-moe 0 --threads 16 --cache-type-k q8_0 --cache-type-v q8_0 --parallel 1

Done.

[-]

el95149@reddit

Been doing this for quite a while, my "franken'-setup (R9700, 5080 and an old 2060 Super I had lying around) drove me to dig into all the ways I could "milk the cow", performance wise, aside from Vulkan. At first, the latter performed better, but recently, with all the llama.cpp enhancements, the native backend build performance has surpassed Vulkan, so I've been sticking with that.

[-]

LegacyRemaster@reddit (OP)

thx!!!

[-]

No-Manufacturer-3315@reddit

Someone with a 7900xt and 4090, thank you!!! I didn’t know this blasphemy was possible! I’ve been running vulkan!

[-]

milpster@reddit

Just tried it and it made my Qwen 3.6 27B output only /////// without end.

[-]

LegacyRemaster@reddit (OP)

tested Qwen3.5-397B-A17B-UD-IQ2_M-00001-of-00004.gguf now. Works.

[-]

YairHairNow@reddit

Interesting. So, could I plug a r9700 in with my 5080. Or how about use my 5080 with my igpu 9950x? Been wondering about something like this, but always read it was extremely difficult to do.

[-]

xspider2000@reddit

its easy using vulkan

[-]

LegacyRemaster@reddit (OP)

yes you can! it's just painfull to compile but yeah... All good

[-]

Few_Water_1457@reddit

load_tensors: dimensione del buffer modello CUDA0 = 83650.42 MiB

load_tensors: dimensione del buffer modello CUDA_Host = 622.76 MiB

load_tensors: dimensione del buffer modello ROCm0 = 40314.35 MiB ---> It seems so, I have to try it.

[-]

FullstackSensei@reddit

What's the hardware setup? 57t/s on Q4_S on what seems to be pretty expensive GPUs seems a bit... Slow. Is one of the GPUs starving for bandwidth?

FWIW, I run Q4_K_XL, which is 10GB larger on six Mi50s, which combined have like 1/7th the tensor TFLOPS of a single RTX 6000 pro, and get 30t/s. All six, before prices went bananas, were worth probably less than the heatsink of a 6000 pro.

[-]

LegacyRemaster@reddit (OP)

here "full vulkan" load_tensors: Vulkan0 model buffer size = 83650.42 MiB

load_tensors: Vulkan1 model buffer size = 40314.35 MiB

load_tensors: Vulkan_Host model buffer size = 622.76 MiB

ggml_vulkan: Pinned memory disabled, using CPU memory fallback for 1 MB

_____________________________

I made a code change to disable pinned memory, which significantly burdens the system RAM during loading (if the model exceeds the RAM, llama.cpp crashes, which is why I tested ROCM+CUDA). I have outdated drivers, the RTX 6000 is at 300W instead of 600W, and the W7800 is at 240W instead of 300W. These are "full volcano" performance levels, but they degrade much more quickly both in prefill and long context than with CUDA+ROCM.

[-]

FullstackSensei@reddit

But why vulkan though? You said you can have CUDA+ROCm using GGML_BACKEND_DL. How are the GPUs connected? How many PCIe lanes does each get? Your power limits are very reasonable, and I'd still expect significantly higher t/s given the hardware.

My potato Mi50s are power limited to 170W and during inference only one is active at any given moment reaching 120W max. But they all have x16 links (albeit Gen 3).

What change did you make in llama.cpp code? Do you have a fork or a gist with it? Might be worth submitting a PR. I have noticed that when I first load a model it loads at 1.1-1.5GB/s, even when I have NVMe RAID-0 that can do 11GB/s. 2nd time loading though is super fast because it's cached in RAM.

[-]

LegacyRemaster@reddit (OP)

The changes I make to the code are automated (vscode+kilocode), and since I have two businesses, I don't have time to review and optimize code. For example, I had fun trying to get Deepseek 4 Flash to work with Vulkan here https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda , but it certainly can't be used in production, and we'll have to wait for the pros at llamacpp to get it up and running.

Back to your question: Asus pro ws x570 ace. 3x PCI express 4 - 8x slots.

About Rocm+Cuda ZERO changes on llamacpp except to disable alderlake from cmakelists:

notepad C:\llm\llamacpp\ggml\src\CMakeLists.txt

and remove # ggml_add_cpu_backend_variant(alderlake SSE42 AVX F16C FMA AVX2 BMI2 AVX_VNNI)

[-]

FullstackSensei@reddit

Can you ask your favorite LLM vis kilocode to make a gist about the pinned memory changes to speed up load?

I run Xeons and Epyc, so no big-little confusion. The former have AVX-512, which makes them quite comparable to Epyc with twice the core count for hybrid inference.

[-]

LegacyRemaster@reddit (OP)

you are welcome

_________________________________

# Vulkan Pinned Memory Fallback


## Summary


`ggml-vulkan-nopinned.cpp` disables Vulkan pinned (imported host) memory allocation and replaces it with plain CPU-aligned memory. This prevents driver crashes on systems where Vulkan host memory import is unstable.


## Problem


The original `ggml_vk_host_malloc` / `ggml_vk_host_free` functions allocate Vulkan device-visible coherent host buffers via `ggml_vk_create_buffer` with `HostVisible | HostCoherent | HostCached` flags. On certain GPU drivers, this pinned memory path triggers crashes or hangs during model inference.


## Changes


### `ggml_vk_host_malloc`


**Before**
 (vulkan.cpp): Creates a Vulkan buffer with host-visible memory properties, registers it in `device->pinned_memory`, and returns the device pointer. Falls back to `nullptr` if the buffer is not host-visible.


**After**
 (nopinned.cpp): Skips Vulkan entirely. Allocates plain CPU memory with 64-byte alignment using `_aligned_malloc` (MSVC) or `posix_memalign` (POSIX). Prints a warning indicating pinned memory is disabled.


### `ggml_vk_host_free`


**Before**
 (vulkan.cpp): Searches `device->pinned_memory` for the buffer matching the pointer, calls `ggml_vk_destroy_buffer`, and erases the entry from the vector.


**After**
 (nopinned.cpp): Calls `_aligned_free` (MSVC) or `free` (POSIX) directly. No Vulkan cleanup is needed.


## Diff


```diff
 static void * ggml_vk_host_malloc(vk_device& device, size_t size) {
-    GGML_UNUSED(device);
     VK_LOG_MEMORY("ggml_vk_host_malloc(" << size << ")");
-
-    if (size == 0) {
+    vk_buffer buf = ggml_vk_create_buffer(device, size,
+        {vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent | vk::MemoryPropertyFlagBits::eHostCached,
+         vk::MemoryPropertyFlagBits::eHostVisible | vk::MemoryPropertyFlagBits::eHostCoherent});
+
+    if(!(buf->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible)) {
+        fprintf(stderr, "WARNING: failed to allocate %.2f MB of pinned memory\n",
+            size/1024.0/1024.0);
+        device->device.freeMemory(buf->device_memory);
+        device->device.destroyBuffer(buf->buffer);
+        return nullptr;
+    }
+
+    std::lock_guard<std::recursive_mutex> guard(device->mutex);
+    device->pinned_memory.push_back(std::make_tuple(buf->ptr, size, buf));
+
+    return buf->ptr;
+}
+
+static void ggml_vk_host_free(vk_device& device, void* ptr) {
+    if (ptr == nullptr) {
+        return;
+    }
+    VK_LOG_MEMORY("ggml_vk_host_free(" << ptr << ")");
+    std::lock_guard<std::recursive_mutex> guard(device->mutex);
+
+    vk_buffer buf;
+    size_t index;
+    for (size_t i = 0; i < device->pinned_memory.size(); i++) {
+        const uint8_t* addr = (const uint8_t*) std::get<0>(device->pinned_memory[i]);
+        const uint8_t* endr = addr + std::get<1>(device->pinned_memory[i]);
+        if (ptr >= addr && ptr < endr) {
+            buf = std::get<2>(device->pinned_memory[i]);
+            index = i;
+            break;
+        }
+    }
+    if (buf == nullptr) {
+        fprintf(stderr, "WARNING: failed to free pinned memory: memory not in map\n");
+        return;
+    }
+
+    ggml_vk_destroy_buffer(buf);
+
+    device->pinned_memory.erase(device->pinned_memory.begin() + index);
 }
 
-static void ggml_vk_host_free(vk_device& device, void* ptr) {
-    GGML_UNUSED(device);
-    if (ptr == nullptr) {
-        return;
-    }
-    VK_LOG_MEMORY("ggml_vk_host_free(" << ptr << ")");
-
-#if defined(_MSC_VER)
-    _aligned_free(ptr);
-#else
-    free(ptr);
-#endif
 }
```


*Note: The diff above shows the nopinned version as removed lines and the original vulkan version as added lines for readability.*


## Trade-offs


| Aspect | Pinned (original) | CPU fallback (nopinned) |
|--------|--------------------|------------------------|
| Stability | May crash on some drivers | Stable |
| Transfer speed | Zero-copy from GPU | Requires explicit memcpy for GPU transfers |
| Memory overhead | Counts against VRAM budget | Uses system RAM |
| Compatibility | Requires Vulkan 1.1+ host query | Works on any system |


## Files Modified


- `v4-support/ggml-vulkan-nopinned.cpp` — Pinned memory disabled (safe fallback)
- `v4-support/ggml-vulkan.cpp` — Pinned memory enabled (original behavior)

[-]

Daemontatox@reddit

What in registery fuck is going on here ?

Can he do that ? Is that legal ?

[-]

FullstackSensei@reddit

While not widely known, it's been supported for quite a while. If you search for the compilation flag on this sub, you'll see comments mentioning it going back at least to August or September of last year, IIRC.

[-]

LegacyRemaster@reddit (OP)

true but very hard to compile without clear steps.

[-]

FullstackSensei@reddit

Is it though? Your cmake command is basically a merge of the CUDA and ROCm build flags. I have both Nvidia and AMD, though not in the same system (yet), so I'm familiar with building for either.

[-]

Koksny@reddit

What's the setup? Two RTX's and the ROCm comes from integrated Vega?

[-]

FullstackSensei@reddit

Smells like an RTX 6000 pro. There's only CUDA0 with 83GB. CUDA host is system memory.

ROCm0 seems to be 48GB, maybe Radeon Pro W7800 or 7900 (pro version of 7900 xtx).

I'm surprised they're running them on windows, and using ROCm 6.4. Switching to Linux and going for 7.1 would eek some more performance.

[-]

LegacyRemaster@reddit (OP)

you are right. I have to test on windows 11 + linux. Here windows 10.

[-]

FullstackSensei@reddit

W11 will make you go crazy. Just save yourself the hassle and go to Linux

[-]

LegacyRemaster@reddit (OP)

I have linux but about 4 different SSD to boot up. I need to have time to migrate all the setup I've built over the years on Windows 10 to Linux.

[-]

Sisuuu@reddit

I am doing the same Ubuntu server with cheap and old threadripper system with 2xRTX3090+RX6800XT…with those flash builds they work oob, I can run huge models across all gpus also. Only thing I wish I could figure is how to make Pi.dev/opencore to utilize one model with 2xRTX3090 with higher quant and context and another model with and RX6800XT…like utilizing both models simultaneously in agentic coding work !

[-]

Vaguswarrior@reddit

Sorry I'm pretty new, I though cpu didn't matter for most things?

[-]

wbulot@reddit

I do use a mixed setup too, with an AMD GPU and NVIDIA. But I use Vulkan for the AMD and CUDA for the NVIDIA, and it works pretty well together.