OCuLink dGPU for AMD: RX 7600 XT vs RX 7800 XT for LLM — worth the price gap? Also llamacpp + Vulkan vs Ollama + ROCm?
Posted by Pablo_Gates@reddit | LocalLLaMA | View on Reddit | 10 comments
Planning a homelab with a GMKtec K12 (Ryzen 7 H255, 780M iGPU, OCuLink). Phase 1 runs Ollama on the 780M. Phase 2 adds an OCuLink dGPU specifically for LLM (Ollama + Open WebUI), freeing the iGPU for Frigate object detection only.
GPU choice: RX 7600 XT vs RX 7800 XT
- RX 7600 XT: 16GB VRAM (\~€330-370). Fits 14B models at Q4 comfortably, Q4 32B possibly.
- RX 7800 XT: 16GB VRAM (\~€400-450). More compute, same VRAM ceiling.
For LLM use on home hardware, is the RX 7800 XT worth the \~€80-100 premium? My primary use case is Qwen 2.5 14B and eventually Qwen 2.5 32B at Q4. No image generation.
Stack: llamacpp + Vulkan vs Ollama + ROCm
I've seen recommendations to use llamacpp with pre-built Vulkan binaries instead of Ollama for AMD, especially with an OCuLink setup. The binaries are on the llama.cpp GitHub releases page so no compilation is needed.
Questions:
- For AMD OCuLink dGPU + Linux, is llamacpp + Vulkan noticeably better than Ollama + ROCm in practice?
- Any specific flags for the llamacpp Vulkan build on AMD that make a real difference? I've seen mention of a "fit flag" that simplifies layer allocation.
- OCuLink bandwidth: is there any measurable throughput loss for LLM inference vs a native PCIe slot? The K12 uses OCuLink which is PCIe 4.0 x4.
- Dual GPU scenario: 780M iGPU (Frigate) + dGPU via OCuLink (Ollama) — any complications with ROCm or Vulkan seeing both devices and picking the wrong one?
Running Linux (Ubuntu 24.04 LTS).
IntrepidDig1581@reddit
tbh for pure LLM throughput the 7800 XT compute gap over the 7600 XT is real, same VRAM ceiling makes it a tough sell at that price difference
Awwtifishal@reddit
llama.cpp is better than ollama, and I noticed vulkan being a bit faster than rocm, so yes, llama.cpp + vulkan is the winning combination.
Fit is enabled by default, you don't need to pass any flags.
PCIe bandwidth is not critical for the typical layer split mode, 4.0 x4 is fine, I use it.
When you have the same device available in multiple backends, you have to select one. For example with vulkan and cuda compiled in, I see this with
--list-devices:CUDA0: nvidia
Vulkan0: the same nvidia
Vulkan1: amd
So I need to specify
-dev CUDA0,Vulkan1and if I don't like how it allocates space (usually proportional to free vram) I pass the specific tensor split:-ts 50,50(llama.cpp only uses the proportions, so-ts 1,1means the same thing)Pablo_Gates@reddit (OP)
Really useful breakdown on the tensor split.
Quick follow-up: for an all-AMD setup (K12 780M iGPU + dGPU via OCuLink, both Vulkan), does
--list-devices correctly show them as separate Vulkan entries?
Any quirks with two AMD Vulkan devices vs a Nvidia+AMD mix like your example?
Awwtifishal@reddit
AFAIK, each device should have an entry for each backend. But I'm not sure because I couldn't manage to build it with HIP support. And there's no quirks as far as I know: If a model works by itself in all backends, it will work in multiple at the same time, of the same type or mixed. I used
-dev CUDA0,Vulkan1but using-dev Vulkan0,Vulkan1would also work. If I don't specify anything, I think it tries to use one GPU twice as much because it treats it as two separate GPUs.Pablo_Gates@reddit (OP)
Thanks, the all-AMD clarification is exactly what I needed!
Good to know
-dev Vulkan0,Vulkan1works cleanly with no quirks vs a mixed setup.One follow-up on Q1 from my original post, which you didn't get to: for someone planning to run Qwen 32B as the daily driver and eventually try 70B, is the RX 7900 XTX (24GB, \~€550) meaningfully better than the RX 7600/7800 XT (16GB, \~€350-420)?
The way I see the VRAM math with tensor split across 780M (\~8GB shared) + dGPU:
Is the real-world difference between those two scenarios as big as the VRAM math suggests, or does 70B with CPU offload end up usable enough that the 16GB option is fine?
Awwtifishal@reddit
Yes, with 24 GB you could fit more weights and more context. Ideally you want to fit all attention, KV cache and shared experts in VRAM, and as much as possible of the regular experts.
70B models are kind of outdated and don't run that well with 24GB (particularly classic architectures that need a lot of KV cache), The biggest dense models nowadays have \~30B, have smaller caches, and perform better in almost any task.
Annual-Constant-5962@reddit
Ask chatgpt
Awwtifishal@reddit
chatgpt is frequently out of date, it may not know that llama.cpp now has fit enabled by default
Annual-Constant-5962@reddit
Ask claude
Awwtifishal@reddit
Claude also has no idea of current llama.cpp functionality, I just asked sonnet 4.6 and has not mentioned autofitting a single time. Then I explicitly asked about it and it says it doesn't have it.
All LLMs are outdated and completely unreliable on some topics.