Don't forget about dem free gains!

Posted by Ok-Measurement-1575@reddit | LocalLLaMA | View on Reddit | 15 comments

Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago:

1 card - 1580/44:

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |     1580.12 ± 104.92 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         44.43 ± 0.17 |

build: 665abc609 (8951)

2 cards - 2047/58:

$ export CUDA_VISIBLE_DEVICES=0,1
$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           pp512 |      2047.28 ± 76.47 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           tg128 |         58.83 ± 2.28 |

build: 665abc609 (8951)

[-]

jwpbe@reddit

I have 2x 3090s and I'm currently using VLLM with the int4 autoround quant. I get between 50 and 80 tokens a second based on ctx length (no MTP at this point, it slows down generation for me for no apparent reason)

The prompt caching is generally more aggressive and effective than llama.cpp's implementation.

Would you mind comparing your llama.cpp setup to ik_llama's -sm graph? it's also a tensor split, I'd like to see where all 3 land. I'll share my vllm args / setup if you'd like

[-]

Ok-Measurement-1575@reddit (OP)

These are my numbers for int4 autoround:

$ llama-benchy --base-url http://10.10.10.10:8080/v1 --depth 0 16384 32768 65536 --pp 2048 --tg 512 --concurrency 1 --runs 3 --latency-mode generation --no-cache --model Lorbus/Qwen3.6-27B-int4-AutoRound --served-model-name Qwen3.6-27B-int4-AutoRound


| model                             |            test |            t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:----------------------------------|----------------:|---------------:|-------------:|------------------:|------------------:|------------------:|
| Lorbus/Qwen3.6-27B-int4-AutoRound |          pp2048 | 2269.66 ± 5.38 |              |    1040.24 ± 2.14 |     902.78 ± 2.14 |    1040.32 ± 2.15 |
| Lorbus/Qwen3.6-27B-int4-AutoRound |           tg512 |   79.34 ± 0.07 | 80.00 ± 0.00 |                   |                   |                   |
| Lorbus/Qwen3.6-27B-int4-AutoRound | pp2048 @ d16384 | 2027.59 ± 6.24 |              |   9228.65 ± 27.96 |   9091.19 ± 27.96 |   9228.77 ± 27.95 |
| Lorbus/Qwen3.6-27B-int4-AutoRound |  tg512 @ d16384 |   73.62 ± 0.02 | 75.33 ± 0.47 |                   |                   |                   |
| Lorbus/Qwen3.6-27B-int4-AutoRound | pp2048 @ d32768 | 1927.28 ± 1.68 |              |  18202.61 ± 15.84 |  18065.15 ± 15.84 |  18202.72 ± 15.84 |
| Lorbus/Qwen3.6-27B-int4-AutoRound |  tg512 @ d32768 |   70.82 ± 0.07 | 72.67 ± 0.47 |                   |                   |                   |
| Lorbus/Qwen3.6-27B-int4-AutoRound | pp2048 @ d65536 | 1733.42 ± 5.10 |              | 39126.93 ± 114.42 | 38989.48 ± 114.42 | 39127.05 ± 114.44 |
| Lorbus/Qwen3.6-27B-int4-AutoRound |  tg512 @ d65536 |   65.20 ± 0.01 | 67.00 ± 0.00 |                   |                   |                   |

llama-benchy (0.3.5)

[-]

jwpbe@reddit

looks like you found your new inference engine lmao. autoround is great.

[-]

Ok-Measurement-1575@reddit (OP)

45% in MMLU-Pro after 100 questions... something is awry it seems.

[-]

jwpbe@reddit

MMLU-Pro

can you send briefly how you set up this bench and i can try to replicate it on my end?

[-]

Ok-Measurement-1575@reddit (OP)

Very strong numbers, fairplay. Hopefully the intelligence is there, too :D

[-]

Ok-Measurement-1575@reddit (OP)

Run llama-benchy against it. 80t/s sounds very strong.

I do have that quant downloaded, I think.

[-]

fala13@reddit

doesn't work with kv cache quantization, so no go

[-]

Ok-Measurement-1575@reddit (OP)

You appear to be right:

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm tensor -ctk q8_0 -ctv q8_0
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | --------------: | -------------------: |
main: error: failed to create context with model 'Qwen3.6-27B-UD-Q4_K_XL.gguf'

[-]

youcloudsofdoom@reddit

Is this not just because you're using two cards instead of one?

[-]

jwpbe@reddit

split mode layer gets you speedup generally, yeah. not sure why OP says otherwise, maybe she's an old head and it used to be true. But splitting by tensor instead of by layer tends to reduce communication over the PCI E bus which leads to speedups

[-]

Ok-Measurement-1575@reddit (OP)

I'm relatively new to the game but my experience has always been simply adding extra cards to llama-server does not offer any appreciable speedup if the model already fit comfortably inside vram.

-sm tensor is a game changer for llama.cpp just like -sm graph was for ik_llama (except it kept seg faulting for me back when I tried it - it's probably great now).

Layer

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm layer
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |      1581.79 ± 47.15 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         44.26 ± 0.05 |

build: 665abc609 (8951)

Row

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm row
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |    row |  1 |           pp512 |        568.52 ± 0.98 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |    row |  1 |           tg128 |         27.07 ± 0.05 |

build: 665abc609 (8951)

Tensor

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           pp512 |      2065.80 ± 55.31 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           tg128 |         59.33 ± 2.48 |

build: 665abc609 (8951)

[-]

ttkciar@reddit

Back when I fiddled with it (about a year ago), splitting by layer did not improve unbatched performance at all, but improved batched throughput quite a bit (40% to 80% depending on batch size and model).

If that has changed since, I am very curious to learn why.

[-]

Ok-Measurement-1575@reddit (OP)

Two cards only gets you the extra VRAM to play with, historically. It doesn't get you extra speed.

This is a new feature, added a few weeks ago. It's the same or very similar feature that vLLM has but without all the headache associated with installing it.

[-]

nsfnd@reddit

I wonder about vulkan, any information on that?
I have a 5090 and a 7900xtx.
-sm row and -sm layer works, i can utilize 56gb vram minus os usage.

im using 27b int4 autoround at the moment with the 5090, doing some work, cant test -sm tensor right now.