Don't forget about dem free gains!
Posted by Ok-Measurement-1575@reddit | LocalLLaMA | View on Reddit | 15 comments
Looks like progress has been made on -sm tensor. Couldn't even run llama-bench a few weeks ago:
1 card - 1580/44:
$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24112 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | CUDA | 99 | 1 | pp512 | 1580.12 ± 104.92 |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | CUDA | 99 | 1 | tg128 | 44.43 ± 0.17 |
build: 665abc609 (8951)
2 cards - 2047/58:
$ export CUDA_VISIBLE_DEVICES=0,1
$ llama-bench -m Qwen3.6-27B-UD-Q4_K_XL.gguf -fa 1 -sm tensor
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 48224 MiB):
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, VRAM: 24112 MiB
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | CUDA | 99 | tensor | 1 | pp512 | 2047.28 ± 76.47 |
| qwen35 27B Q4_K - Medium | 16.39 GiB | 26.90 B | CUDA | 99 | tensor | 1 | tg128 | 58.83 ± 2.28 |
build: 665abc609 (8951)
jwpbe@reddit
I have 2x 3090s and I'm currently using VLLM with the int4 autoround quant. I get between 50 and 80 tokens a second based on ctx length (no MTP at this point, it slows down generation for me for no apparent reason)
The prompt caching is generally more aggressive and effective than llama.cpp's implementation.
Would you mind comparing your llama.cpp setup to ik_llama's -sm graph? it's also a tensor split, I'd like to see where all 3 land. I'll share my vllm args / setup if you'd like
Ok-Measurement-1575@reddit (OP)
These are my numbers for int4 autoround:
jwpbe@reddit
looks like you found your new inference engine lmao. autoround is great.
Ok-Measurement-1575@reddit (OP)
45% in MMLU-Pro after 100 questions... something is awry it seems.
jwpbe@reddit
can you send briefly how you set up this bench and i can try to replicate it on my end?
Ok-Measurement-1575@reddit (OP)
Very strong numbers, fairplay. Hopefully the intelligence is there, too :D
Ok-Measurement-1575@reddit (OP)
Run llama-benchy against it. 80t/s sounds very strong.
I do have that quant downloaded, I think.
fala13@reddit
doesn't work with kv cache quantization, so no go
Ok-Measurement-1575@reddit (OP)
You appear to be right:
youcloudsofdoom@reddit
Is this not just because you're using two cards instead of one?
jwpbe@reddit
split mode layer gets you speedup generally, yeah. not sure why OP says otherwise, maybe she's an old head and it used to be true. But splitting by tensor instead of by layer tends to reduce communication over the PCI E bus which leads to speedups
Ok-Measurement-1575@reddit (OP)
I'm relatively new to the game but my experience has always been simply adding extra cards to llama-server does not offer any appreciable speedup if the model already fit comfortably inside vram.
-sm tensor is a game changer for llama.cpp just like -sm graph was for ik_llama (except it kept seg faulting for me back when I tried it - it's probably great now).
Layer
Row
Tensor
ttkciar@reddit
Back when I fiddled with it (about a year ago), splitting by layer did not improve unbatched performance at all, but improved batched throughput quite a bit (40% to 80% depending on batch size and model).
If that has changed since, I am very curious to learn why.
Ok-Measurement-1575@reddit (OP)
Two cards only gets you the extra VRAM to play with, historically. It doesn't get you extra speed.
This is a new feature, added a few weeks ago. It's the same or very similar feature that vLLM has but without all the headache associated with installing it.
nsfnd@reddit
I wonder about vulkan, any information on that?
I have a 5090 and a 7900xtx.
-sm row and -sm layer works, i can utilize 56gb vram minus os usage.
im using 27b int4 autoround at the moment with the 5090, doing some work, cant test -sm tensor right now.