AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks

Posted by grumd@reddit | LocalLLaMA | View on Reddit | 23 comments

Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT

gemma4 E2B Q4_K

ubatch	ROCm pp512	Vulkan pp512	ROCm tg128	Vulkan tg128
32	1536.60	1423.49	151.92	174.59
64	1590.65	1930.60	151.41	173.76
128	2651.11	2998.42	151.53	173.71
256	3653.19	3233.44	151.45	173.45
512	3807.60	3950.71	151.47	173.67
1024	3806.77	3948.27	151.49	173.35

qwen35 4B Q8_0

ubatch	ROCm pp512	Vulkan pp512	ROCm tg128	Vulkan tg128
32	1368.32	706.18	77.57	88.58
64	1841.68	1323.46	77.65	88.57
128	2577.95	1672.51	77.97	88.46
256	2984.38	2244.62	77.72	88.50
512	3023.75	2390.09	77.81	88.57
1024	3019.70	2386.97	77.60	88.53

[-]

FullstackSensei@reddit

Why are you still using ROCm 6? 7 has been out for a while and should bring a good performance uplift.

[-]

grumd@reddit (OP)

Nope, don't really intend to, I'm running this thing as an RPC server, and my pp is ~400 anyway due to ethernet overhead

[-]

FullstackSensei@reddit

Why are you running it over RPC? And why don't you try updating your graphics driver?

[-]

grumd@reddit (OP)

I have two PCs, one with a 5080 and one with a 6900XT. The latter is an RPC server and the former is where I'm running my models.

I'm also using the latest gpu drivers obviously

[-]

Why are you doing this though? That's like the worst possible way to run two GPUs, and arguably the least economic. A single PCIe lane will be much faster and have way less issues. Beyond chocking the card over 1gb ethernet, llama.cpp RPC is far from optimized, and in fact disables a ton of the optimizations you'd have if running both GPUs in the same machine. And you can run both on the same machine. You just need to build llama.cpp from source with with GGML_BACKEND_DL and both CUDA and ROCm backends enabled.

[-]

grumd@reddit (OP)

Because it's my and my wife's gaming PCs. I'm not building a datacenter here.

[-]

Jatilq@reddit

Test both in lmstudio because it has both runtimes.

[-]

grumd@reddit (OP)

No. I'm compiling these myself

[-]

Jatilq@reddit

I understand that. I’m saying test to help nail down the problem.

[-]

grumd@reddit (OP)

And what's the problem?

[-]

Jatilq@reddit

Omg! I just realized I responded to the wrong thread. I’m sitting in a hospital gown about to go into surgery. I must be loopy. I’m sorry.

[-]

grumd@reddit (OP)

I hope the surgery goes well!

[-]

Jatilq@reddit

Thank you. Your watch ever say you have Afib; tell your doctor right away.

[-]

grumd@reddit (OP)

My watch is mechanical sadly haha, but with my WPW syndrome I should probably do an ECG once in a while

[-]

spaceman_@reddit

You should also test at non-zero context depths. Since a few months ago, Vulkan PP speeds typically decline way less on larger prompts / context sizes.

Vulkan also seems to do better with "weird" quantizations like Q5/Q6 vs ROCm in my experience.

[-]

grumd@reddit (OP)

Yeah that's true, I only built these binaries to use this machine as RPC server so didn't bother with long depth, just did some quick tests

[-]

RoomyRoots@reddit

Have you tried the preview builds of ROCm? I am getting better results with ROCm than Vulkan now.

[-]

grumd@reddit (OP)

ROCm 7.1 didn't even recognize the GPU, this is the latest build that worked

[-]

AMD Radeon RX 6900 XT - ROCm vs Vulkan - Gemma 4 and Qwen 3.5 speed benchmarks

gemma4 E2B Q4_K

qwen35 4B Q8_0

FullstackSensei@reddit

grumd@reddit (OP)

FunkyMuse@reddit

grumd@reddit (OP)

FullstackSensei@reddit

grumd@reddit (OP)

FullstackSensei@reddit

grumd@reddit (OP)

Jatilq@reddit

grumd@reddit (OP)

Jatilq@reddit

grumd@reddit (OP)

Jatilq@reddit

grumd@reddit (OP)

Jatilq@reddit

grumd@reddit (OP)

spaceman_@reddit

grumd@reddit (OP)

RoomyRoots@reddit

grumd@reddit (OP)

taking_bullet@reddit

ps5cfw@reddit

MikeLPU@reddit