Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help

[-]

UniqueAttourney@reddit

I tried this on a single 3090 (LMStudio) and i do get 1 to 2 tokens per second, although it's a 27b it seems like it needs more compute than previous models.

[-]

bonobomaster@reddit

Nah, you fucked something up in your config.

[-]

UniqueAttourney@reddit

Can you share something if you are using LMStudio ?

[-]

bonobomaster@reddit

Show me your model loading parameters

[-]

UniqueAttourney@reddit

sorry for the late response, here is the load config. i am also using the unsloth qwen3.6 27b Q4_K_S

https://imgur.com/a/ZMAxR0x

[-]

bonobomaster@reddit

147054 context size and 10 layers on cpu...

Here is what I would do, while having 147k context is very nice, 30+ tk/s are even nicer (or whatever a 3090 can reach).

Any layer at all on cpu is pure poison for tk/s.

Reduce context to almost nothing (8192) and 64 layers gpu offload. Then test.

Now increase context in nice chunky steps, till things break (loading of model or speed), now reduce context in finer steps till things are fine again (model loads, speed is good) now tune to have about 1 to 1,5 gigs of headroom in your vram.

Because if you don't, things will start to suck, if you open a browser for example while interfering.

Keep an eye on the gpu memory load while doing that, for clues when context is set too high.

[-]

RoroTitiFR@reddit

I'm getting similar performance with 27B model on my Tesla P40 + T4 setup.

Since I find it low, that's why I prefer the 35B MoE variant...

[-]

tahorg@reddit

The MoE is not usable for me in a coding environment.

[-]

RoroTitiFR@reddit

That may be context size related no ? If your context is low, it can be absolutely unusable as it loses the code base early.

[-]

tahorg@reddit

Not a context thing, 128k for both. It's just the depth of reasoning and the quality of the output. For coding 27b is really good, like Claude sommet level. 35b Moe is not, unfortunately. I'd love that to be otherwise because 35b is the only one usable on my hardware but it sucks :(

[-]

orinoco_w@reddit

1 What're your ROCm, Llama.cpp, pytorch versions?
2 And which llama.cpp settings you you using?
3 And whose quant?

I fixed item 1 on ROCm 7.2.2 with pytorch 2.10

I'm finding all kinds of issues with combinations of the above. Here's an example - identical settings, just two different Q8_0 implementations.

root@fedora:/workspace/llama.cpp# ./build/bin/llama-bench -p 512 -n 128 -fa 1 --hf-repo unsloth/Qwen3.6-27B-GGUF:Q8_0

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 57312 MiB):

Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB

| model | size | params | backend | ngl | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |

| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 1 | pp512 | 846.56 ± 0.87 |

| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 1 | tg128 | 23.59 ± 0.02 |

build: f53577432 (8942)

root@fedora:/workspace/llama.cpp# ./build/bin/llama-bench -p 512 -n 128 -fa 1 --hf-repo bartowski/Qwen_Qwen3.6-27B-GGUF:Q8_0

ggml_cuda_init: found 2 ROCm devices (Total VRAM: 57312 MiB):

Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB

Device 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB

| model | size | params | backend | ngl | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |

| qwen35 27B Q8_0 | 26.69 GiB | 26.90 B | ROCm | 99 | 1 | pp512 | 222.74 ± 0.92 |

| qwen35 27B Q8_0 | 26.69 GiB | 26.90 B | ROCm | 99 | 1 | tg128 | 23.32 ± 0.13 |

build: f53577432 (8942)

[-]

gusbags@reddit

One major downside of llama based engines is that they do not support tensor parallel, which leaves a lot of performance (particularly during PP) untapped. Vllm / sglang is what you want, though that usually involves more tinkering to find the right setup (also TP is only available across 2,4,8, etc GPUs).

[-]

finevelyn@reddit

llama.cpp now has experimental support for --split-mode tensor which gives the expected speed boost on dual GPU for Gemma 4 for me (at least on generation). It supports an arbitrary number of GPUs and they can also have different amount of VRAM.

Qwen 3.6 implementation has some crash though with tensor parallel.

[-]

SemaMod@reddit (OP)

Update:
Running with -sm tensor -ctxcp 0 -cram 0 -fa 1 -c 0 has significantly helped. I'm consistently getting 28 t/s and somewhat improved prompt processing this way.

[-]

SemaMod@reddit (OP)

Benchmarks:

model	size	params	backend	ngl	n_ubatch	sm	fa	dev	test	t/s
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp512	960.08 ± 2.04
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	tg128	20.16 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp512+tg32	255.92 ± 0.12
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp2048+tg64	387.85 ± 0.36
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp8192+tg128	559.36 ± 0.08
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp512 @ d8192	379.62 ± 0.61
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	tg128 @ d8192	19.65 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp512+tg32 @ d8192	182.70 ± 0.11
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp2048+tg64 @ d8192	244.39 ± 0.12
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	ROCm0/ROCm1/ROCm2	pp8192+tg128 @ d8192	372.67 ± 0.17
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp512	870.61 ± 1.28
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	tg128	19.34 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp512+tg32	240.85 ± 2.16
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp2048+tg64	381.95 ± 7.11
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp8192+tg128	521.42 ± 1.72
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp512 @ d8192	753.02 ± 57.60
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	tg128 @ d8192	18.94 ± 0.00
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp512+tg32 @ d8192	227.03 ± 4.31
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp2048+tg64 @ d8192	347.00 ± 7.69
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	layer	1	Vulkan0/Vulkan1/Vulkan2	pp8192+tg128 @ d8192	459.58 ± 9.04
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp512	521.71 ± 0.04
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	tg128	31.76 ± 0.27
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp512+tg32	255.19 ± 0.08
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp2048+tg64	348.56 ± 0.19
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp8192+tg128	377.54 ± 0.03
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp512 @ d8192	365.05 ± 11.70
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	tg128 @ d8192	31.86 ± 0.34
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp512+tg32 @ d8192	221.75 ± 0.13
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp2048+tg64 @ d8192	279.43 ± 0.09
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	ROCm0/ROCm1/ROCm2	pp8192+tg128 @ d8192	292.38 ± 0.04
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp512	258.99 ± 0.12
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	tg128	6.56 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp512+tg32	77.83 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp2048+tg64	125.57 ± 0.05
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp8192+tg128	173.43 ± 0.06
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp512 @ d8192	244.10 ± 9.61
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	tg128 @ d8192	6.45 ± 0.01
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp512+tg32 @ d8192	76.61 ± 0.41
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp2048+tg64 @ d8192	123.08 ± 0.13
qwen35 27B Q8_0	32.89 GiB	26.90 B	ROCm,Vulkan	999	2048	tensor	1	Vulkan0/Vulkan1/Vulkan2	pp8192+tg128 @ d8192	170.02 ± 0.18

build: 0adede866 (8925)

[-]

SemaMod@reddit (OP)

Update: Did some benching, got interesting results.

```
| model | --------------- | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | qwen35 27B Q8_0 | size | params | backend | ngl | n_ubatch | sm | fa | dev | test | t/s |
--------------- | ---------: | ---------: | ---------- | --: | -------: | -----: | -: | ------------ | --------------: | -------------------: |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512 | 960.08 ± 2.04 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | tg128 | 20.16 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 | 255.92 ± 0.12 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 | 387.85 ± 0.36 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 | 559.36 ± 0.08 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512 @ d8192 | 379.62 ± 0.61 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | tg128 @ d8192 | 19.65 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 @ d8192 | 182.70 ± 0.11 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 @ d8192 | 244.39 ± 0.12 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 @ d8192 | 372.67 ± 0.17 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 | 870.61 ± 1.28 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 | 19.34 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 | 240.85 ± 2.16 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 | 381.95 ± 7.11 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 | 521.42 ± 1.72 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 @ d8192 | 753.02 ± 57.60 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 @ d8192 | 18.94 ± 0.00 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 @ d8192 | 227.03 ± 4.31 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 @ d8192 | 347.00 ± 7.69 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 @ d8192 | 459.58 ± 9.04 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512 | 521.71 ± 0.04 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | tg128 | 31.76 ± 0.27 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 | 255.19 ± 0.08 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 | 348.56 ± 0.19 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 | 377.54 ± 0.03 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512 @ d8192 | 365.05 ± 11.70 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | tg128 @ d8192 | 31.86 ± 0.34 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 @ d8192 | 221.75 ± 0.13 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 @ d8192 | 279.43 ± 0.09 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 @ d8192 | 292.38 ± 0.04 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 | 258.99 ± 0.12 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 | 6.56 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 | 77.83 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 | 125.57 ± 0.05 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 | 173.43 ± 0.06 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 @ d8192 | 244.10 ± 9.61 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 @ d8192 | 6.45 ± 0.01 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 @ d8192 | 76.61 ± 0.41 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 @ d8192 | 123.08 ± 0.13 |
| 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 @ d8192 | 170.02 ± 0.18 |

build: 0adede866 (8925)
```

[-]

RedAdo2020@reddit

That's weird, I'm using Q8, and across 4x5070Ti I get;

| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |

|-------|--------|--------|----------|----------|----------|----------|

| 4096 | 1024 | 0 | 1.873 | 2187.40 | 23.452 | 43.66 |

| 4096 | 1024 | 4096 | 1.875 | 2184.66 | 23.179 | 44.18 |

| 4096 | 1024 | 8192 | 1.885 | 2172.91 | 23.371 | 43.81 |

| 4096 | 1024 | 12288 | 1.913 | 2141.39 | 23.620 | 43.35 |

| 4096 | 1024 | 16384 | 1.945 | 2106.43 | 23.844 | 42.95 |

| 4096 | 1024 | 20480 | 1.972 | 2077.54 | 24.103 | 42.48 |

| 4096 | 1024 | 24576 | 2.007 | 2040.41 | 24.345 | 42.06 |

| 4096 | 1024 | 28672 | 2.031 | 2016.48 | 24.584 | 41.65 |

| 4096 | 1024 | 32768 | 2.063 | 1985.32 | 24.933 | 41.07 |

| 4096 | 1024 | 36864 | 2.091 | 1959.02 | 25.021 | 40.93 |

| 4096 | 1024 | 40960 | 2.117 | 1935.15 | 25.176 | 40.67 |

| 4096 | 1024 | 45056 | 2.145 | 1909.44 | 25.348 | 40.40 |

| 4096 | 1024 | 49152 | 2.180 | 1878.64 | 25.530 | 40.11 |

| 4096 | 1024 | 53248 | 2.205 | 1857.82 | 25.693 | 39.86 |

| 4096 | 1024 | 57344 | 2.238 | 1830.50 | 25.886 | 39.56 |

| 4096 | 1024 | 61440 | 2.263 | 1810.10 | 26.051 | 39.31 |

| 4096 | 1024 | 65536 | 2.292 | 1787.15 | 26.342 | 38.87 |

| 4096 | 1024 | 69632 | 2.327 | 1760.00 | 26.459 | 38.70 |

| 4096 | 1024 | 73728 | 2.355 | 1738.95 | 26.602 | 38.49 |

| 4096 | 1024 | 77824 | 2.382 | 1719.48 | 26.772 | 38.25 |

| 4096 | 1024 | 81920 | 2.415 | 1696.21 | 26.946 | 38.00 |

| 4096 | 1024 | 86016 | 2.446 | 1674.40 | 27.115 | 37.77 |

| 4096 | 1024 | 90112 | 2.478 | 1652.82 | 27.299 | 37.51 |

| 4096 | 1024 | 94208 | 2.511 | 1631.46 | 27.482 | 37.26 |

| 4096 | 1024 | 98304 | 2.541 | 1611.75 | 27.732 | 36.92 |

| 4096 | 1024 | 102400 | 2.572 | 1592.84 | 27.869 | 36.74 |

| 4096 | 1024 | 106496 | 2.600 | 1575.32 | 28.004 | 36.57 |

| 4096 | 1024 | 110592 | 2.640 | 1551.29 | 28.171 | 36.35 |

| 4096 | 1024 | 114688 | 2.672 | 1532.74 | 28.361 | 36.11 |

| 4096 | 1024 | 118784 | 2.709 | 1512.06 | 28.519 | 35.91 |

| 4096 | 1024 | 122880 | 2.746 | 1491.84 | 28.703 | 35.68 |

| 4096 | 1024 | 126976 | 2.798 | 1463.79 | 28.889 | 35.45 |

| 4096 | 1024 | 131072 | 2.836 | 1444.28 | 29.509 | 34.70 |

| 4096 | 1024 | 135168 | 2.882 | 1420.99 | 30.131 | 33.98 |

| 4096 | 1024 | 139264 | 2.909 | 1407.94 | 29.469 | 34.75 |

| 4096 | 1024 | 143360 | 2.940 | 1392.99 | 29.720 | 34.45 |

| 4096 | 1024 | 147456 | 2.997 | 1366.67 | 29.755 | 34.41 |

| 4096 | 1024 | 151552 | 3.041 | 1346.80 | 29.935 | 34.21 |

| 4096 | 1024 | 155648 | 3.070 | 1334.28 | 30.143 | 33.97 |

| 4096 | 1024 | 159744 | 3.123 | 1311.50 | 30.454 | 33.62 |

| 4096 | 1024 | 163840 | 3.259 | 1256.69 | 31.215 | 32.80 |

| 4096 | 1024 | 167936 | 3.163 | 1294.83 | 31.784 | 32.22 |

| 4096 | 1024 | 172032 | 3.236 | 1265.64 | 31.213 | 32.81 |

| 4096 | 1024 | 176128 | 3.324 | 1232.16 | 31.855 | 32.15 |

| 4096 | 1024 | 180224 | 3.338 | 1227.03 | 32.425 | 31.58 |

| 4096 | 1024 | 184320 | 3.338 | 1226.97 | 31.851 | 32.15 |

| 4096 | 1024 | 188416 | 3.399 | 1205.05 | 32.099 | 31.90 |

| 4096 | 1024 | 192512 | 3.425 | 1195.87 | 32.489 | 31.52 |

So even at 192k context I get faster PP and TG than you.

I run a 9950X3D, 4x5070Ti, x8 lanes on the first, and x4 lanes on the rest. My commands;

CUDA_VISIBLE_DEVICES=0,1,2,3 ./LLM/ik_llama.cpp/build/bin/llama-server \

--model /LLM/Models/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \

--alias Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \

--ctx-size 196608 \

-fa on \

-b 4096 -ub 4096 \

-smgs \

--max-gpu 4 \

-sm graph \

-mg 0 \

-ngl 999 \

--host 127.0.0.1 \

--port 8080 \

--threads 16 \

--parallel 1 \

--temp 1 \

--top-p 0.95 \

--top-k 20 \

--min-p 0.0 \

--presence-penalty 1.5 \

--repeat-penalty 1.0 \

--cache-ram -1 \

-ts 0.9,1,1,0.4 \

--jinja

[-]

Apprehensive_Use1906@reddit

I’ll give it a try with my 2x r9700 cards and report back.

[-]

rebelSun25@reddit

Nice. I'm looking into this card. Even at $1900 CAD, it's worth it since the 30gb models are getting interesting now

[-]

Apprehensive_Use1906@reddit

It was definitely worth it for me. Two of these was around the same price as one 5080 in my neck of the woods. I was more interested in the memory than the speed.(Speed is nice off course). They actually work pretty well for gaming as well.

[-]

Apprehensive_Use1906@reddit

I ran a benchmarking script in LM studio and made sure both GPUs were being utilized. Following is the output (Sorry about the wonky spacing):
"usage":

"prompt_tokens": 19,

"completion_tokens": 500,

"total_tokens": 519,

"completion_tokens_details":

"reasoning_tokens": 499

"system_fingerprint": "qwen/qwen3.6-27b"

Hope this helps.

[-]

picosec@reddit

I get about a 50% speedup in token generation with Q8_K_XL quants using llama.cpp with "-sm tensor" vs the default "-sm layer" using a RTX 3090ti and a RTX 3090 (both in PCIe 4.0 X16 slots).

I am not sure if "-sm tensor" works with multiple 7900xtx cards though.

[-]

DeProgrammer99@reddit

I get the same speed with Q6_K_XL on my RX 7900 XTX with a bit of it offloaded to my RTX 4060 Ti.

I thought you could use --split-mode tensor for identical GPUs for better speed, but it seems that change was reverted. (Was merged in pull request 19378 with a lot of unsupported cases.)

[-]

BigYoSpeck@reddit

My two cards are faster

You don't give enough details about your setup to help though

[-]

Pretend_Engineer5951@reddit

18-20t/s is nice speed. Mine 8060s gives only 6-7t/s

[-]

Look_0ver_There@reddit

This is sadly "about right" for the Q8 quants of Qwen3.6-27B.

With a multi-GPU setup you're also likely up against the PCI latency too, whereby every hop from card to card requires a card writing the state back to system memory via the CPU, and then the CPU passing that on to the next card.

I have some Radeon AI Pro 9700's, and even when the model fits entirely on one card, the PP performance peaks at around 1100tok/s, and TG is around 22t/s. As more cards get used the performance drops due to the afore-mentioned card to card latency.

You can use something like vLLM across two cards to improve things a bit, but even there the gain for a single user is almost nothing. vLLM works best when you have a number of requests in parallel and it does a better job of keeping all the cards busy at once, whereas for llama.cpp it won't keep multiple cards as busy.