Qwen 3.6 27B llama.cpp | Multi-GPU pp t/s help
Posted by SemaMod@reddit | LocalLLaMA | View on Reddit | 26 comments
The new dense model is great, but I’m trying to figure out how to increase PP and Token generation speed. I’m running Q8 quants across 3 7900xtx GPUs and I’m consistently only getting 18-20 t/s generation speed and ~650 t/s prompt processing speeds which feels low. Wondering what other people are getting in multi gpu setups and how I can optimize the performance.
UniqueAttourney@reddit
I tried this on a single 3090 (LMStudio) and i do get 1 to 2 tokens per second, although it's a 27b it seems like it needs more compute than previous models.
bonobomaster@reddit
Nah, you fucked something up in your config.
UniqueAttourney@reddit
Can you share something if you are using LMStudio ?
bonobomaster@reddit
Show me your model loading parameters
UniqueAttourney@reddit
sorry for the late response, here is the load config. i am also using the unsloth qwen3.6 27b Q4_K_S
https://imgur.com/a/ZMAxR0x
bonobomaster@reddit
147054 context size and 10 layers on cpu...
Here is what I would do, while having 147k context is very nice, 30+ tk/s are even nicer (or whatever a 3090 can reach).
Any layer at all on cpu is pure poison for tk/s.
Reduce context to almost nothing (8192) and 64 layers gpu offload. Then test.
Now increase context in nice chunky steps, till things break (loading of model or speed), now reduce context in finer steps till things are fine again (model loads, speed is good) now tune to have about 1 to 1,5 gigs of headroom in your vram.
Because if you don't, things will start to suck, if you open a browser for example while interfering.
Keep an eye on the gpu memory load while doing that, for clues when context is set too high.
RoroTitiFR@reddit
I'm getting similar performance with 27B model on my Tesla P40 + T4 setup.
Since I find it low, that's why I prefer the 35B MoE variant...
tahorg@reddit
The MoE is not usable for me in a coding environment.
RoroTitiFR@reddit
That may be context size related no ? If your context is low, it can be absolutely unusable as it loses the code base early.
tahorg@reddit
Not a context thing, 128k for both. It's just the depth of reasoning and the quality of the output. For coding 27b is really good, like Claude sommet level. 35b Moe is not, unfortunately. I'd love that to be otherwise because 35b is the only one usable on my hardware but it sucks :(
orinoco_w@reddit
1 What're your ROCm, Llama.cpp, pytorch versions?
2 And which llama.cpp settings you you using?
3 And whose quant?
I fixed item 1 on ROCm 7.2.2 with pytorch 2.10
I'm finding all kinds of issues with combinations of the above. Here's an example - identical settings, just two different Q8_0 implementations.
root@fedora:/workspace/llama.cpp# ./build/bin/llama-bench -p 512 -n 128 -fa 1 --hf-repo unsloth/Qwen3.6-27B-GGUF:Q8_0ggml_cuda_init: found 2 ROCm devices (Total VRAM: 57312 MiB):Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiBDevice 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB| model | size | params | backend | ngl | fa | test | t/s || ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: || qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 1 | pp512 | 846.56 ± 0.87 || qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | ROCm | 99 | 1 | tg128 | 23.59 ± 0.02 |build: f53577432 (8942)root@fedora:/workspace/llama.cpp# ./build/bin/llama-bench -p 512 -n 128 -fa 1 --hf-repo bartowski/Qwen_Qwen3.6-27B-GGUF:Q8_0ggml_cuda_init: found 2 ROCm devices (Total VRAM: 57312 MiB):Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiBDevice 1: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB| model | size | params | backend | ngl | fa | test | t/s || ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: || qwen35 27B Q8_0 | 26.69 GiB | 26.90 B | ROCm | 99 | 1 | pp512 | 222.74 ± 0.92 || qwen35 27B Q8_0 | 26.69 GiB | 26.90 B | ROCm | 99 | 1 | tg128 | 23.32 ± 0.13 |build: f53577432 (8942)gusbags@reddit
One major downside of llama based engines is that they do not support tensor parallel, which leaves a lot of performance (particularly during PP) untapped. Vllm / sglang is what you want, though that usually involves more tinkering to find the right setup (also TP is only available across 2,4,8, etc GPUs).
finevelyn@reddit
llama.cpp now has experimental support for --split-mode tensor which gives the expected speed boost on dual GPU for Gemma 4 for me (at least on generation). It supports an arbitrary number of GPUs and they can also have different amount of VRAM.
Qwen 3.6 implementation has some crash though with tensor parallel.
SemaMod@reddit (OP)
Update:
Running with
-sm tensor -ctxcp 0 -cram 0 -fa 1 -c 0has significantly helped. I'm consistently getting 28 t/s and somewhat improved prompt processing this way.SemaMod@reddit (OP)
Benchmarks:
build: 0adede866 (8925)
SemaMod@reddit (OP)
Update: Did some benching, got interesting results.
```
| model | size | params | backend | ngl | n_ubatch | sm | fa | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -----: | -: | ------------ | --------------: | -------------------: |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512 | 960.08 ± 2.04 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | tg128 | 20.16 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 | 255.92 ± 0.12 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 | 387.85 ± 0.36 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 | 559.36 ± 0.08 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512 @ d8192 | 379.62 ± 0.61 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | tg128 @ d8192 | 19.65 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 @ d8192 | 182.70 ± 0.11 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 @ d8192 | 244.39 ± 0.12 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 @ d8192 | 372.67 ± 0.17 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 | 870.61 ± 1.28 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 | 19.34 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 | 240.85 ± 2.16 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 | 381.95 ± 7.11 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 | 521.42 ± 1.72 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 @ d8192 | 753.02 ± 57.60 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 @ d8192 | 18.94 ± 0.00 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 @ d8192 | 227.03 ± 4.31 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 @ d8192 | 347.00 ± 7.69 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | layer | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 @ d8192 | 459.58 ± 9.04 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512 | 521.71 ± 0.04 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | tg128 | 31.76 ± 0.27 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 | 255.19 ± 0.08 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 | 348.56 ± 0.19 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 | 377.54 ± 0.03 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512 @ d8192 | 365.05 ± 11.70 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | tg128 @ d8192 | 31.86 ± 0.34 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp512+tg32 @ d8192 | 221.75 ± 0.13 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp2048+tg64 @ d8192 | 279.43 ± 0.09 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | ROCm0/ROCm1/ROCm2 | pp8192+tg128 @ d8192 | 292.38 ± 0.04 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 | 258.99 ± 0.12 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 | 6.56 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 | 77.83 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 | 125.57 ± 0.05 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 | 173.43 ± 0.06 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512 @ d8192 | 244.10 ± 9.61 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | tg128 @ d8192 | 6.45 ± 0.01 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp512+tg32 @ d8192 | 76.61 ± 0.41 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp2048+tg64 @ d8192 | 123.08 ± 0.13 |
| qwen35 27B Q8_0 | 32.89 GiB | 26.90 B | ROCm,Vulkan | 999 | 2048 | tensor | 1 | Vulkan0/Vulkan1/Vulkan2 | pp8192+tg128 @ d8192 | 170.02 ± 0.18 |
build: 0adede866 (8925)
```
RedAdo2020@reddit
That's weird, I'm using Q8, and across 4x5070Ti I get;
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 1.873 | 2187.40 | 23.452 | 43.66 |
| 4096 | 1024 | 4096 | 1.875 | 2184.66 | 23.179 | 44.18 |
| 4096 | 1024 | 8192 | 1.885 | 2172.91 | 23.371 | 43.81 |
| 4096 | 1024 | 12288 | 1.913 | 2141.39 | 23.620 | 43.35 |
| 4096 | 1024 | 16384 | 1.945 | 2106.43 | 23.844 | 42.95 |
| 4096 | 1024 | 20480 | 1.972 | 2077.54 | 24.103 | 42.48 |
| 4096 | 1024 | 24576 | 2.007 | 2040.41 | 24.345 | 42.06 |
| 4096 | 1024 | 28672 | 2.031 | 2016.48 | 24.584 | 41.65 |
| 4096 | 1024 | 32768 | 2.063 | 1985.32 | 24.933 | 41.07 |
| 4096 | 1024 | 36864 | 2.091 | 1959.02 | 25.021 | 40.93 |
| 4096 | 1024 | 40960 | 2.117 | 1935.15 | 25.176 | 40.67 |
| 4096 | 1024 | 45056 | 2.145 | 1909.44 | 25.348 | 40.40 |
| 4096 | 1024 | 49152 | 2.180 | 1878.64 | 25.530 | 40.11 |
| 4096 | 1024 | 53248 | 2.205 | 1857.82 | 25.693 | 39.86 |
| 4096 | 1024 | 57344 | 2.238 | 1830.50 | 25.886 | 39.56 |
| 4096 | 1024 | 61440 | 2.263 | 1810.10 | 26.051 | 39.31 |
| 4096 | 1024 | 65536 | 2.292 | 1787.15 | 26.342 | 38.87 |
| 4096 | 1024 | 69632 | 2.327 | 1760.00 | 26.459 | 38.70 |
| 4096 | 1024 | 73728 | 2.355 | 1738.95 | 26.602 | 38.49 |
| 4096 | 1024 | 77824 | 2.382 | 1719.48 | 26.772 | 38.25 |
| 4096 | 1024 | 81920 | 2.415 | 1696.21 | 26.946 | 38.00 |
| 4096 | 1024 | 86016 | 2.446 | 1674.40 | 27.115 | 37.77 |
| 4096 | 1024 | 90112 | 2.478 | 1652.82 | 27.299 | 37.51 |
| 4096 | 1024 | 94208 | 2.511 | 1631.46 | 27.482 | 37.26 |
| 4096 | 1024 | 98304 | 2.541 | 1611.75 | 27.732 | 36.92 |
| 4096 | 1024 | 102400 | 2.572 | 1592.84 | 27.869 | 36.74 |
| 4096 | 1024 | 106496 | 2.600 | 1575.32 | 28.004 | 36.57 |
| 4096 | 1024 | 110592 | 2.640 | 1551.29 | 28.171 | 36.35 |
| 4096 | 1024 | 114688 | 2.672 | 1532.74 | 28.361 | 36.11 |
| 4096 | 1024 | 118784 | 2.709 | 1512.06 | 28.519 | 35.91 |
| 4096 | 1024 | 122880 | 2.746 | 1491.84 | 28.703 | 35.68 |
| 4096 | 1024 | 126976 | 2.798 | 1463.79 | 28.889 | 35.45 |
| 4096 | 1024 | 131072 | 2.836 | 1444.28 | 29.509 | 34.70 |
| 4096 | 1024 | 135168 | 2.882 | 1420.99 | 30.131 | 33.98 |
| 4096 | 1024 | 139264 | 2.909 | 1407.94 | 29.469 | 34.75 |
| 4096 | 1024 | 143360 | 2.940 | 1392.99 | 29.720 | 34.45 |
| 4096 | 1024 | 147456 | 2.997 | 1366.67 | 29.755 | 34.41 |
| 4096 | 1024 | 151552 | 3.041 | 1346.80 | 29.935 | 34.21 |
| 4096 | 1024 | 155648 | 3.070 | 1334.28 | 30.143 | 33.97 |
| 4096 | 1024 | 159744 | 3.123 | 1311.50 | 30.454 | 33.62 |
| 4096 | 1024 | 163840 | 3.259 | 1256.69 | 31.215 | 32.80 |
| 4096 | 1024 | 167936 | 3.163 | 1294.83 | 31.784 | 32.22 |
| 4096 | 1024 | 172032 | 3.236 | 1265.64 | 31.213 | 32.81 |
| 4096 | 1024 | 176128 | 3.324 | 1232.16 | 31.855 | 32.15 |
| 4096 | 1024 | 180224 | 3.338 | 1227.03 | 32.425 | 31.58 |
| 4096 | 1024 | 184320 | 3.338 | 1226.97 | 31.851 | 32.15 |
| 4096 | 1024 | 188416 | 3.399 | 1205.05 | 32.099 | 31.90 |
| 4096 | 1024 | 192512 | 3.425 | 1195.87 | 32.489 | 31.52 |
So even at 192k context I get faster PP and TG than you.
I run a 9950X3D, 4x5070Ti, x8 lanes on the first, and x4 lanes on the rest. My commands;
CUDA_VISIBLE_DEVICES=0,1,2,3 ./LLM/ik_llama.cpp/build/bin/llama-server \
--model /LLM/Models/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \
--alias Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q8_K_P.gguf \
--ctx-size 196608 \
-fa on \
-b 4096 -ub 4096 \
-smgs \
--max-gpu 4 \
-sm graph \
-mg 0 \
-ngl 999 \
--host 127.0.0.1 \
--port 8080 \
--threads 16 \
--parallel 1 \
--temp 1 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 1.5 \
--repeat-penalty 1.0 \
--cache-ram -1 \
-ts 0.9,1,1,0.4 \
--jinja
Apprehensive_Use1906@reddit
I’ll give it a try with my 2x r9700 cards and report back.
rebelSun25@reddit
Nice. I'm looking into this card. Even at $1900 CAD, it's worth it since the 30gb models are getting interesting now
Apprehensive_Use1906@reddit
It was definitely worth it for me. Two of these was around the same price as one 5080 in my neck of the woods. I was more interested in the memory than the speed.(Speed is nice off course). They actually work pretty well for gaming as well.
Apprehensive_Use1906@reddit
I ran a benchmarking script in LM studio and made sure both GPUs were being utilized. Following is the output (Sorry about the wonky spacing):
"usage":
"prompt_tokens": 19,
"completion_tokens": 500,
"total_tokens": 519,
"completion_tokens_details":
"reasoning_tokens": 499
"system_fingerprint": "qwen/qwen3.6-27b"
Hope this helps.
picosec@reddit
I get about a 50% speedup in token generation with Q8_K_XL quants using llama.cpp with "-sm tensor" vs the default "-sm layer" using a RTX 3090ti and a RTX 3090 (both in PCIe 4.0 X16 slots).
I am not sure if "-sm tensor" works with multiple 7900xtx cards though.
DeProgrammer99@reddit
I get the same speed with Q6_K_XL on my RX 7900 XTX with a bit of it offloaded to my RTX 4060 Ti.
I thought you could use --split-mode tensor for identical GPUs for better speed, but it seems that change was reverted. (Was merged in pull request 19378 with a lot of unsupported cases.)
BigYoSpeck@reddit
My two cards are faster
You don't give enough details about your setup to help though
Pretend_Engineer5951@reddit
18-20t/s is nice speed. Mine 8060s gives only 6-7t/s
Look_0ver_There@reddit
This is sadly "about right" for the Q8 quants of Qwen3.6-27B.
With a multi-GPU setup you're also likely up against the PCI latency too, whereby every hop from card to card requires a card writing the state back to system memory via the CPU, and then the CPU passing that on to the next card.
I have some Radeon AI Pro 9700's, and even when the model fits entirely on one card, the PP performance peaks at around 1100tok/s, and TG is around 22t/s. As more cards get used the performance drops due to the afore-mentioned card to card latency.
You can use something like vLLM across two cards to improve things a bit, but even there the gain for a single user is almost nothing. vLLM works best when you have a number of requests in parallel and it does a better job of keeping all the cards busy at once, whereas for llama.cpp it won't keep multiple cards as busy.