OCuLink dGPU for AMD: RX 7600 XT vs RX 7800 XT for LLM — worth the price gap? Also llamacpp + Vulkan vs Ollama + ROCm?

Posted by Pablo_Gates@reddit | LocalLLaMA | View on Reddit | 10 comments

Planning a homelab with a GMKtec K12 (Ryzen 7 H255, 780M iGPU, OCuLink). Phase 1 runs Ollama on the 780M. Phase 2 adds an OCuLink dGPU specifically for LLM (Ollama + Open WebUI), freeing the iGPU for Frigate object detection only.

GPU choice: RX 7600 XT vs RX 7800 XT

RX 7600 XT: 16GB VRAM (\~€330-370). Fits 14B models at Q4 comfortably, Q4 32B possibly.
RX 7800 XT: 16GB VRAM (\~€400-450). More compute, same VRAM ceiling.

For LLM use on home hardware, is the RX 7800 XT worth the \~€80-100 premium? My primary use case is Qwen 2.5 14B and eventually Qwen 2.5 32B at Q4. No image generation.

Stack: llamacpp + Vulkan vs Ollama + ROCm

I've seen recommendations to use llamacpp with pre-built Vulkan binaries instead of Ollama for AMD, especially with an OCuLink setup. The binaries are on the llama.cpp GitHub releases page so no compilation is needed.

Questions:

For AMD OCuLink dGPU + Linux, is llamacpp + Vulkan noticeably better than Ollama + ROCm in practice?
Any specific flags for the llamacpp Vulkan build on AMD that make a real difference? I've seen mention of a "fit flag" that simplifies layer allocation.
OCuLink bandwidth: is there any measurable throughput loss for LLM inference vs a native PCIe slot? The K12 uses OCuLink which is PCIe 4.0 x4.
Dual GPU scenario: 780M iGPU (Frigate) + dGPU via OCuLink (Ollama) — any complications with ROCm or Vulkan seeing both devices and picking the wrong one?

Running Linux (Ubuntu 24.04 LTS).

[-]

Awwtifishal@reddit

llama.cpp is better than ollama, and I noticed vulkan being a bit faster than rocm, so yes, llama.cpp + vulkan is the winning combination.

Fit is enabled by default, you don't need to pass any flags.

PCIe bandwidth is not critical for the typical layer split mode, 4.0 x4 is fine, I use it.

When you have the same device available in multiple backends, you have to select one. For example with vulkan and cuda compiled in, I see this with --list-devices:

CUDA0: nvidia
Vulkan0: the same nvidia
Vulkan1: amd

So I need to specify -dev CUDA0,Vulkan1 and if I don't like how it allocates space (usually proportional to free vram) I pass the specific tensor split: -ts 50,50 (llama.cpp only uses the proportions, so -ts 1,1 means the same thing)

[-]

Pablo_Gates@reddit (OP)

Really useful breakdown on the tensor split.
Quick follow-up: for an all-AMD setup (K12 780M iGPU + dGPU via OCuLink, both Vulkan), does
--list-devices correctly show them as separate Vulkan entries?
Any quirks with two AMD Vulkan devices vs a Nvidia+AMD mix like your example?

[-]

Awwtifishal@reddit

AFAIK, each device should have an entry for each backend. But I'm not sure because I couldn't manage to build it with HIP support. And there's no quirks as far as I know: If a model works by itself in all backends, it will work in multiple at the same time, of the same type or mixed. I used -dev CUDA0,Vulkan1 but using -dev Vulkan0,Vulkan1 would also work. If I don't specify anything, I think it tries to use one GPU twice as much because it treats it as two separate GPUs.

[-]

Pablo_Gates@reddit (OP)

Thanks, the all-AMD clarification is exactly what I needed!
Good to know -dev Vulkan0,Vulkan1 works cleanly with no quirks vs a mixed setup.

One follow-up on Q1 from my original post, which you didn't get to: for someone planning to run Qwen 32B as the daily driver and eventually try 70B, is the RX 7900 XTX (24GB, \~€550) meaningfully better than the RX 7600/7800 XT (16GB, \~€350-420)?

The way I see the VRAM math with tensor split across 780M (\~8GB shared) + dGPU:

16GB dGPU: \~24GB effective — 32B at Q4 is tight, 70B needs heavy CPU offload
24GB dGPU: \~32GB effective — 32B comfortable, 70B borderline

Is the real-world difference between those two scenarios as big as the VRAM math suggests, or does 70B with CPU offload end up usable enough that the 16GB option is fine?

[-]

Awwtifishal@reddit

Yes, with 24 GB you could fit more weights and more context. Ideally you want to fit all attention, KV cache and shared experts in VRAM, and as much as possible of the regular experts.

70B models are kind of outdated and don't run that well with 24GB (particularly classic architectures that need a lot of KV cache), The biggest dense models nowadays have \~30B, have smaller caches, and perform better in almost any task.