am I running this llama-bench of Qwen3.6-27B on these V100s right?
Posted by starkruzr@reddit | LocalLLaMA | View on Reddit | 19 comments
basically what I'm doing here is trying to validate whether or not it's a reasonable idea to get a couple of V100s, either SXMs with PCIe adapters or straight-up PCIe cards in the first place, for the sake of running this model or models like it, for codegen and other mostly-text applications. a pair of these is around $1200 for 64GB RAM, compared to $1100 for 24GB RAM from a 3090. my sense is that with 64GB RAM you are simply not going to run out of context with an arrangement like this, with the model running at INT8 and the KV cache unquantized, for any remotely reasonable amount of context.
one thing though is that I'm not sure why pp takes a dive at 64K context in this series of benchmarks. I'm just wondering if there are obvious things I'm not remembering to do here. TIA.
4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 4096,16384,65536
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB):
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
| model | size | params | backend | ngl | threads | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d4096 | 797.25 ± 3.55 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d4096 | 31.16 ± 0.40 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d16384 | 702.58 ± 8.55 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d16384 | 30.27 ± 0.36 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d65536 | 473.34 ± 2.69 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d65536 | 26.71 ± 0.29 |
build: 2496f9c14 (9049)
4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 200000
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB):
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
| model | size | params | backend | ngl | threads | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d200000 | 267.16 ± 0.29 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d200000 | 18.53 ± 0.14 |
build: 2496f9c14 (9049)
4478180@pdgx0001:~/llama.cpp/build/bin$ CUDA_VISIBLE_DEVICES=0,1 ./llama-bench -hf unsloth/Qwen3.6-27B-GGUF:Q8_0 -sm tensor -ngl 999 -t 64 --flash-attn 1 -p 2048 -d 128000
ggml_cuda_init: found 2 CUDA devices (Total VRAM: 65002 MiB):
Device 0: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
Device 1: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, VRAM: 32501 MiB
| model | size | params | backend | ngl | threads | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -----: | -: | --------------: | -------------------: |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | pp2048 @ d128000 | 352.66 ± 0.61 |
| qwen35 27B Q8_0 | 26.62 GiB | 26.90 B | CUDA | 999 | 64 | tensor | 1 | tg128 @ d128000 | 23.06 ± 0.23 |
build: 2496f9c14 (9049)
Glittering-Call8746@reddit
Get 3090 period v100 not worth unless u doing more than 4 and then it needs to be sxm2 to pcie
starkruzr@reddit (OP)
why?
DaMoot@reddit
You want the SXM2 version NOT the PCIe version, or use them in a PCIe adapter.
You want the SXM2 version because it has full nvlink 2.0 x6 connectivity allowing you to connect 2 to 8 of them together.
You can get boards that are 2x2 or 4x4 providing sxm2 connection, full nvlink mesh topology meaning 300GB/s connection between the modules instead of using the slower 50GB/s link that PCIe cards are limited to, or forcing your GPUs to write instructions on stone tablets back and forth through slow 16 or 32GB/s PCIe communication.
And when you connect full mesh modules together (have to be the same memory capacity per pair) you get the benefit of basically unified memory and all the individual GPUs give a big boost to processing speed. Not a full whatever X boost, like you don't get a 4x boost for 4 GPUs, you get about 3.5 boost in processing speed.
Unfortunately, Youtubers have discovered the V100 secret, broadcast it and they spiked 200 bucks more expensive over night, or went out of stock.
starkruzr@reddit (OP)
thanks. guessing you're referring to this? https://ebay.us/m/lsl4ts where/how does one even mount that in a machine? also -- does it really matter if you're only talking about two cards needing to communicate over x16 PCIe 3.0 when you're only doing inference with mostly text? my understanding is that that kind of use doesn't really require that much bandwidth for tensor parallelism.
DaMoot@reddit
Yep that's one of the boards.
These boards are motherboard-sized and go in a pc case or custom mount.
And yes it matters big time because if you're splitting the model you need the bandwidth for the GPUs to coordinate processing the split layers.
Also keep in mind that most consumer systems are limited to 1 CPU x16 slot and 1 CPU x4 slot (NVMe), with the other slots and peripherals going through chipset which is connected to the CPU by x8 PCIe. You CAN load models over any PCIe lane width, even x1 if you hate yourself. These boards plug in to a pair of x16 cards (each GPU gets x8), or a PLX8749 that will plug all 4 into a single x16 slot, bifurcating it to 4 x4 ports without your motherboard needing to support bifurcation.
Loading the model takes PCIe speed, processing after that is all done in the GPU.
If you want just 2, https://ebay.us/m/YyOSB5 is an option. Same full mesh topology in a friendlier form factor.
FinalCap2680@reddit
Do you know if the 4 x V100 board can run with only one or two modules installed?
DaMoot@reddit
It should since each module does have its own discrete PCIe connection, and two installed modules would negotiate an nvlink mesh.
starkruzr@reddit (OP)
this is all great info, thanks.
Glittering-Call8746@reddit
I only see 4 sxm2 nvlink at taobao.. u add more u going over pcie..
SmartCustard9944@reddit
Why are you using 64 threads? That’s way too many
starkruzr@reddit (OP)
thanks, I think this might have been a fuckup. is "threads" supposed to be the number of concurrent requests? bc the use case is "literally just one request at a time, specifically from me."
MelodicRecognition7@reddit
https://files.catbox.moe/5w3eqh.png
threads = CPU cores utilized
starkruzr@reddit (OP)
CPU or GPU?
MelodicRecognition7@reddit
CPU
giveen@reddit
I set mine to 8 because my 295k has 8 performance cores, and 16 power efficiency cores, so I've discovered it runs faster than saying using all the cores.
starkruzr@reddit (OP)
what is it doing with CPU cores when it doesn't have to use CPU because everything runs in the cards?
Herr_Drosselmeyer@reddit
It'll work, the question is whether you really want to spend money on card that are nine years old and three generations behind.
It's to be expected. Prompt processing scales quadratically, not linearly.
This_Maintenance_834@reddit
if you manage to enable mtp, the tg might double.
Ell2509@reddit
It depends what you need. 3090 will be faster for inference. But if you need larger models, and multiple 3090s is too expensive or otherwise unrealistic, v100s are mentioned quite often and obviously will work.