Don't forget about dem free gains!

Posted by Ok-Measurement-1575@reddit | LocalLLaMA | View on Reddit | 15 comments

Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago:

1 card - 1580/44:

$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           pp512 |     1580.12 ± 104.92 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 |  1 |           tg128 |         44.43 ± 0.17 |

build: 665abc609 (8951)

2 cards - 2047/58:

$ export CUDA_VISIBLE_DEVICES=0,1
$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model                          |       size |     params | backend    | ngl |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           pp512 |      2047.28 ± 76.47 |
| qwen35 27B Q4_K - Medium       |  16.39 GiB |    26.90 B | CUDA       |  99 | tensor |  1 |           tg128 |         58.83 ± 2.28 |

build: 665abc609 (8951)