backend-agnostic tensor parallelism has been merged into llama.cpp | TheaterFire

backend-agnostic tensor parallelism has been merged into llama.cpp

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 56 comments

if you have more than one GPU - your models can now run much faster

[-]

Far_Course2496@reddit

Does this mean I don't need to figure out vllm? Serious question

[-]

viperx7@reddit

Yesterday I tried to figure out vllm and man I have no idea what exactly I need to do Turns out I can't run the fp8 models because I have 3090 on my system which won't work

Was able to load awq model but with half the context llama.cpp allows and speed wasn't that much better

[-]

McSendo@reddit

I think only v0.9.0 supports block fp8 via marlin for sm86. Looks like they are refactoring some code regarding fp8 so the nightly releases don't support it yet.

[-]

jacek2023@reddit (OP)

I had similar experiences with vllm. Probably it's more "pro" solution and you need to be focused on one specific model to setup it correctly, while llama.cpp is more "hacker friendly" when you can just experiment and have fun quickly.

[-]

jacek2023@reddit (OP)

vllm has a serious limitation: you need two or four GPUs, I have three, three work only with llama.cpp

[-]

sleepingsysadmin@reddit

The "ROCm" backend works since it is just the CUDA code translated via HIP. On the hardware combinations that I have (RX 6800 + MI50 or RX 9060 XT + MI100) the performance is bad vs. the -sm layer baseline though.

Cries a little.

Vulkan technically works at short contexts but the performance is bad, at long contexts there are also stability issues.

Cries even more.

[-]

jacek2023@reddit (OP)

is this caused by different GPUs on your setup?

[-]

sleepingsysadmin@reddit

Well, no, I have identical gpus. Am I misunderstanding here? Im reading it as AMD cards are shit out of luck again.

Guess I have to test.

[-]

jacek2023@reddit (OP)

I mean RX 6800 and MI50 are two different GPUs, maybe it requires them to be same

[-]

sleepingsysadmin@reddit

Testing right now. identical amd. No split flag aka layer. \~40TPS. With Tensor split, 20TPS.

AMD sads.

[-]

TaroOk7112@reddit

With 2 AMD r9700, one connected to PCIe 4 x16 (CPU) and the other to PCIe 3 x4 (chipset):

split mode	pp	ts
default	72.40	15.50
tensor	65.49	17.20

llama.cpp ROCm - build_info: b8760-073bb2c20 (compiled an hour ago with ROCm 7.2.1)

[-]

TaroOk7112@reddit

This is the difference between --split-mode layer and tensor in my system with one GPU connected to PCIe 4 x16 (CPU lanes) and the other to PCIe 3 x4.

**TLDR:**

| split mode | pp | ts |

| :--- | :---: | :---: |

| not set (default) | 72.40 | 15.50 |

| tensor | 65.49 | 17.20 |

**llama.cpp ROCm - build_info: b8760-073bb2c20**

```text

Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

```

## -sm layer (default)

```bash

llama-server -m Qwen3.5-27B-UD-Q8_K_XL.gguf -c 262144 -n 32000 -t 20 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --host 0.0.0.0 --port 8888 --jinja --fit on --flash-attn on --metrics

```

```text

prompt eval time = 814.87 ms / 59 tokens ( 13.81 ms per token, 72.40 tokens per second)

eval time = 259966.21 ms / 4029 tokens ( 64.52 ms per token, 15.50 tokens per second)

```

## -sm tensor

```bash

llama-server -m Qwen3.5-27B-UD-Q8_K_XL.gguf -c 262144 -n 32000 -t 20 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --host 0.0.0.0 --port 8888 --jinja --fit on --flash-attn on --metrics -sm tensor

```

```text

prompt eval time = 900.93 ms / 59 tokens ( 15.27 ms per token, 65.49 tokens per second)

eval time = 241768.28 ms / 4159 tokens ( 58.13 ms per token, 17.20 tokens per second)

```

[-]

jacek2023@reddit (OP)

try different models, I had big speedup on qwen 3 dense but terrible result on qwen 3 MoE

[-]

sapoepsilon@reddit

I am so glad I went with 3090s instead of getting AMD gpus. I was really really tempted of getting AMD GPUs

[-]

hp1337@reddit

I tried Qwen 3.5 397B IQ2_XXS with -sm tensor on my 6x3090 setup and it crashes. I tried gemma-4-31b-it-ud-q8_k_xl with 2x3090 and it is worse performance in PP and TG with -sm tensor.

This feature needs a bit of work to be useful. I'm glad there is progress however!

[-]

jacek2023@reddit (OP)

try older models

[-]

viperx7@reddit

Qwen3.5 27B Q8 went from 28-29t/s to 42t/s

My rig is 4090+3090ti

[-]

spaceman_@reddit

"backend-agnostic" means you don't need CUDA to enjoy this

As far as I can tell, it doesn't work for Vulkan yet, based on the various comments in the PR.

I'm currently testing this against Gemma4 31B, Gemma4 26B A4B, Qwen3-Coder-Next and Qwen3.5-31B on my desktop with 2x R9700 and the ROCm backend.

[-]

TheBlueMatt@reddit

It works with vulkan, but falls back to an unoptimized way of doing the AllReduce step - instead of a targeted implementation it has to do lots of copying.

[-]

jacek2023@reddit (OP)

in case of problems try old models like llama 3 or qwen 3 dense too

[-]

spaceman_@reddit

Update: Gemma4 performance using tensor split on ROCm is about 1/3 of the layer split speed (prompt processing) and Qwen3.5 models crash.

[-]

skaldamramra@reddit

Tested the new `-sm tensor` on 2× AMD Radeon 7900 XTX (gfx1100, 2×24 GB = 48 GB total VRAM) with ebircak/gemma-4-31B-it-GGUF_IQ4_NL_L on llama.cpp build d132f22fc (b8739), ROCm backend.

Token Generation — clear win for `-sm tensor`

Test	-sm tensor (ub=1024)	-sm layer (ub=256)	Δ
tg1	36.90 t/s	28.68 t/s	+29%
tg128	37.08 t/s	27.87 t/s	+33%
tg512	36.53 t/s	27.74 t/s	+32%
tg1024	36.26 t/s	27.49 t/s	+32%

Prompt Processing — `-sm layer` leads at most context sizes

Context	-sm tensor (ub=1024)	-sm layer (ub=256)	Δ
pp1024	1439.71 t/s	1426.22 t/s	tensor +1%
pp2048	1341.66 t/s	1544.12 t/s	layer +15%
pp4096	1320.92 t/s	1580.21 t/s	layer +20%
pp8192	1271.03 t/s	1543.74 t/s	layer +21%
pp16384	1175.64 t/s	1424.47 t/s	layer +21%
pp32768	1019.99 t/s	1216.51 t/s	layer +19%
pp65536	804.25 t/s	933.86 t/s	layer +16%

Both splits run the full model fully on-GPU with zero CPU offload. Really impressive to see this working on AMD/ROCm out of the box with the new backend-agnostic implementation!

Raw data:

llama-bench -t 5 -ngl 999 -m /data_fast/gemma-4-31B-it-IQ4_NL_L_AMD.gguf -fa 1 -ub 1024 -b 1024 -p 1024,2048,4096,8192,16384,32768,65536 -n 1,128,512,1024 -sm tensor

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 49120 MiB):

Device 0: , gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

Device 1: , gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

| model | size | params | backend | ngl | threads | n_batch | n_ubatch | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -----: | -: | --------------: | -------------------: |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp1024 | 1439.71 ± 3.87 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp2048 | 1341.66 ± 0.94 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp4096 | 1320.92 ± 1.02 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp8192 | 1271.03 ± 0.49 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp16384 | 1175.64 ± 0.58 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp32768 | 1019.99 ± 0.13 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | pp65536 | 804.25 ± 0.31 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | tg1 | 36.90 ± 0.27 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | tg128 | 37.08 ± 0.01 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | tg512 | 36.53 ± 0.09 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 1024 | tensor | 1 | tg1024 | 36.26 ± 0.04 |

llama-bench -t 5 -ngl 999 -m /data_fast/gemma-4-31B-it-IQ4_NL_L_AMD.gguf -fa 1 -ub 256 -b 1024 -p 1024,2048,4096,8192,16384,32768,65536 -n 1,128,512,1024 -sm layer

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 49120 MiB):

Device 0: , gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

Device 1: , gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --------------: | -------------------: |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp1024 | 1426.22 ± 1.41 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp2048 | 1544.12 ± 1.27 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp4096 | 1580.21 ± 1.13 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp8192 | 1543.74 ± 0.39 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp16384 | 1424.47 ± 0.18 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp32768 | 1216.51 ± 0.17 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | pp65536 | 933.86 ± 0.26 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | tg1 | 28.68 ± 0.32 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | tg128 | 27.87 ± 0.00 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | tg512 | 27.74 ± 0.02 |

| gemma4 ?B IQ4_NL - 4.5 bpw | 20.10 GiB | 30.70 B | ROCm | 999 | 5 | 1024 | 256 | 1 | tg1024 | 27.49 ± 0.01 |

+PP graph

[-]

jacek2023@reddit (OP)

what about generation speed?

[-]

spaceman_@reddit

I put the raw numbers in my comment, so you can look at the parts you're interested in.

[-]

jacek2023@reddit (OP)

So it helps for dense model

[-]

nicholas_the_furious@reddit

How can newer models be supported? What allows for support or not?

[-]

spaceman_@reddit

Those aren't in my arsenal, I'm testing what I use at the moment. If these don't work, I still have GLM-4.7-Flash on disk. But I'm not likely to have time to fiddle with other models at the moment.

[-]

jacek2023@reddit (OP)

I have some models from 2024 :)

[-]

fallingdowndizzyvr@reddit

As far as I can tell, it doesn't work for Vulkan yet, based on the various comments in the PR.

Yes it does. Right in the comments.

"Very nice. This makes prompt processing way faster with Vulkan"

In that comment, they post numbers from Vulkan.

[-]

TaroOk7112@reddit

What PCI slots are they plugged into? Because I have 2 r9700 but one pcie 4 x16 and a pcie 3 x4. So, not ideal. I'm curious how can perform with sitty pcie connectivity.

[-]

spaceman_@reddit

Both are connected at PCIe 4.0 x16

[-]

TaroOk7112@reddit

Qwen 3.5 27B? I mean, there isn't a new 31B model I missed, right?

[-]

CatalyticDragon@reddit

"This should be considered as an experimental feature that is not yet production ready."

Maybe let this one cook before getting excited/disappointed. I know how you kids can get :)

[-]

jacek2023@reddit (OP)

please read last sentence of my post young man

[-]

CatalyticDragon@reddit

I did. I'm making it exceptionnally clear since you didn't lead with the most important part.

[-]

ML-Future@reddit

If I have a laptop with nvidia gpu + cpu integrated graphics. Does this count?

[-]

jacek2023@reddit (OP)

I don’t think so, but there is a well known placebo effect, so if you dream hard enough...

[-]

MDSExpro@reddit

Now add prefix cache and it can make llama.cpp actually usable.

[-]

Awkward-Boat1922@reddit

Oh wow, time to rebuild.

[-]

jacek2023@reddit (OP)

Qwen 3 14B tested in March

[-]

sersoniko@reddit

Mind the ordinate axis doesn’t start at 0

[-]

jacek2023@reddit (OP)

you people are not interested in the actual data? without scaling it would be less visible

[-]

sersoniko@reddit

Because it not as impactful

[-]

nicholas_the_furious@reddit

When you only care about the absolute distance between two points you don't need to start a graph at 0.

[-]

jax_cooper@reddit

I like this graph because it starts at 0.... ohh wait

[-]

Time-Dot-1808@reddit

The 'backend-agnostic' part is the real story here. Tensor parallelism that works across backends means AMD and Intel GPU users aren't second-class citizens anymore. Layer splitting was always the fallback, and while it works, the memory bandwidth bottleneck kills throughput on anything latency-sensitive.

Curious to see benchmarks on mixed GPU setups (different VRAM sizes). That's where layer splitting had a clear advantage since you could just assign fewer layers to the smaller card.

[-]

the__storm@reddit

Loving this new trend to end every post with a short paragraph beginning "Curious ..." - makes it real easy to spot the bots.

[-]

AustinM731@reddit

This makes me sad that I sold my V100s. I pretty much only use vLLM these days for TP. And Volta support has all but been dropped from vLLM.

[-]

JLeonsarmiento@reddit

Só… is there a shoe box LLM server a possibility now?

https://www.tiktok.com/@shop_boxphonefarm?_r=1&_t=ZS-95OnI83YFJS

[-]

ResponsibleTruck4717@reddit

Does both gpu need to have same vram?

[-]

Egoz3ntrum@reddit

Wonderful news!

[-]

Alarming-Ad8154@reddit

O nice! So I can split qwen3.5 27b over my two 7900xt at 4bit and still get fairly high context!

[-]

Alarming-Ad8154@reddit

If this propagates to LMStudio (I use LMlink to serve 4 machines) I might genuinely switch to dual AMD 9700 AI Pro’s for fast dense models at 5/6bit and full context…

[-]

jacek2023@reddit (OP)

maybe test llama.cpp first :)

[-]

m94301@reddit

Thanks for the post - finally!

[-]

jacek2023@reddit (OP)

Qwen 3 32B tested in March