NVLINK improves dual RTX 3090 inference performance by nearly 50%
Posted by hp1337@reddit | LocalLLaMA | View on Reddit | 54 comments
Posted by hp1337@reddit | LocalLLaMA | View on Reddit | 54 comments
SeymourBits@reddit
I have one dual 3090 FE setup with NVLink and I find that it generates only slightly slower than another nearly identical system with dual 4090 FEs. I was initially surprised as I expected much higher performance from the dual 4090s but have yet to figure out exactly why.
prompt_seeker@reddit
I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)
python benchmarks/benchmark_serving.py --backend openai --base-url
http://localhost:8000
--model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
My result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.
And here's same test but with
--request-rate 1
. (means 1 request at once)Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.
NVLink would be useful if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster.
Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4)
prompt_seeker@reddit
I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)
python benchmarks/benchmark_serving.py --backend openai --base-url
http://localhost:8000
--model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.
And here's same test but with
--request-rate 1
. (means 1 request at once)Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.
It would be usefull if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster than 1vllm on 2GPUs.
Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4 with OCuLink)
prompt_seeker@reddit
I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)
python benchmarks/benchmark_serving.py --backend openai --base-url
http://localhost:8000
--model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
My result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.
And here's same test but with
--request-rate 1
. (means 1 request at once)Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.
It would be usefull if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster than 1vllm on 2GPUs.
Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4 with OCuLink)
__JockY__@reddit
A couple of things. First is that you don't give your system spec, which is unfortunate because a lot of the time we find the devil is in the details.
Second: the 3090 doesn't support FP8, so the quantization you're trying to use will be broken. I'm not sure what will happen because I'm not familiar with vLLM, but it can't work in your setup because it's unsupported on Ampere.
Third is that I can't reproduce your findings. I have an RTX 5000 32GB ADA and a pair of RTX A6000 48GB Ampere, and I used the two A6000s to follow your methodology with a PNY NVLink. The model was Qwen/QwQ-32B and I copied your exact command line (yes, even the non-functional quantization args) and the results were as I expected: inference runs approximately 1 token/sec slower WITH the NVLink.
I then disabled quantization (omitted the
--quantization
flag) and results were the same: slower inference when using NVLink.My system is Ryzen Threadripper Pro 5995WX on a Supermicro M12SWA-TF motherboard with 128GB of DDR4 3200. All the GPUs are plugged into PCIe 4.0 x16 slots that are verified to be working at x16.
Can you show us any data that demonstrates the speedup you saw from your NVLink setup? I'm curious if our discrepancy is a measurement error on your end or perhaps an implementation error on my end, or even some difference in our rigs that causes NVLink to become a factor (like if you're using x1 risers or something).
getfitdotus@reddit
use set the TP to 2 ?
__JockY__@reddit
Yes
hp1337@reddit (OP)
I re-ran the experiment with physically removing the NVLINK. Same result:
CUDA_VISIBLE_DEVICES=1,2 vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel 2 --gpu-memory-utilization 0.9 --max-model-len 32768
python benchmarks/benchmark_serving.py --backend openai --base-url
http://localhost:8000
--model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 51.11
Total input tokens: 204800
Total generated tokens: 24100
Request throughput (req/s): 3.91
Output token throughput (tok/s): 471.56
Total Token throughput (tok/s): 4478.82
---------------Time to First Token----------------
Mean TTFT (ms): 24523.48
Median TTFT (ms): 27132.67
P99 TTFT (ms): 41782.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 407.41
Median TPOT (ms): 236.08
P99 TPOT (ms): 1419.19
---------------Inter-token Latency----------------
Mean ITL (ms): 222.43
Median ITL (ms): 71.84
P99 ITL (ms): 242.01
==================================================
CUDA_VISIBLE_DEVICES=3,5 vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel 2 --gpu-memory-utilization 0.9 --max-model-len 32768
python benchmarks/benchmark_serving.py --backend openai --base-url
http://localhost:8000
--model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200
============ Serving Benchmark Result ============
Successful requests: 200
Benchmark duration (s): 33.92
Total input tokens: 204800
Total generated tokens: 24100
Request throughput (req/s): 5.90
Output token throughput (tok/s): 710.39
Total Token throughput (tok/s): 6747.24
---------------Time to First Token----------------
Mean TTFT (ms): 15879.45
Median TTFT (ms): 15428.11
P99 TTFT (ms): 26762.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 285.56
Median TPOT (ms): 145.32
P99 TPOT (ms): 778.63
---------------Inter-token Latency----------------
Mean ITL (ms): 150.52
Median ITL (ms): 55.36
P99 ITL (ms): 225.97
==================================================
__JockY__@reddit
Ok, but what’s the rest of your system?
hp1337@reddit (OP)
See my other comments
__JockY__@reddit
Oh you're using x8 risers. Mystery solved, no wonder the NVLink improves things!
Why are you using x8 risers though? Your mother supports PCIe 4.0 x16, so you could just run riser cables and you'd get x16 to your GPUs. What am I missing?
hp1337@reddit (OP)
Its a complicated setup. The motherboard supports x16,x16,x8,x8, or x8,x8,x8,x8,x8,x8. I have 6 GPUs, so i choose the latter. The risers themselves support x16, but the motherboard has to divide the lanes based on what threadripper can support.
I guess I'll have to repeat the test with x16,x16 to see what happens.
Competitive_Buy6402@reddit
I thought using an FP8 quant will work on 3090 but having no hardware FP8 support it would just run on FP16 silicon and just gaining no speed benefit.
Or maybe it is making the GPU technically run INT8 ?
psilent@reddit
Can confirm they work, but they just use less vram they aren’t faster
__JockY__@reddit
Ah ok, that makes sense. If that's the case then it sounds like it's not a factor here. Thanks!
leavezukoalone@reddit
Who tf downvotes someone for acknowledging they may have been wrong?
chromaaadon@reddit
3090 nvlink fanboys!
__JockY__@reddit
Is that a question for me? I'm confused, I don't know what you're talking about.
leavezukoalone@reddit
No, you just had a bunch of downvotes earlier
bihungba1101@reddit
Vllm use marlin kernal to do Fp8 calculations on unsupported hardware. This is auto applied when vllm detect the quantization of the model and the hardware available. It significantly boost performance with no quality degradation compared to native fp8 hardware
hp1337@reddit (OP)
Specs are as follows: Threadripper 3960x MSI TRX40 Pro 10g 128gb quad channel ddr5 3200 Ram Mixture of slim sas and direct PCIe gen 4 x8 risers
The data is in the blogpost.
I was surprised by the result as well. I appreciate the feedback and am confused why you weren't able to reproduce my findings.
I don't think the quantization should matter, but I'll try running addition tests with an unquantized model. I will also physically remove the NVLINK to test. The risers are definitely working at x8 PCIe gen 4. I have tested this.
sgsdxzy@reddit
vllm uses fp8_marlin kernel on ampere.
bullerwins@reddit
correct: https://github.com/vllm-project/vllm/pull/5975
kryptkpr@reddit
I got a 2x3090 NVLink rig as well
On 7-8B, the lower latency really helps but don't get too excited, by 32B it's more like 8-10% and only at big batch.
getfitdotus@reddit
yes but what are u using for inference ? ollama ? ollama does mot support TP, multi GPU without using something like vllm or sglang is a waste
kryptkpr@reddit
vLLM.
getfitdotus@reddit
so you used the two cards with nvlink and set --tensor-parallel 2 ?
kryptkpr@reddit
Yep, compared 8b 32b and 72b awq. My 3090 are connected to host at x8 3.0. Nvlink helped a lot with small models at big batch, for bigger models or less streams (OP is running 200 here) the speed improvement was much less drastic. I'm still happy with the purchase.
getfitdotus@reddit
ok I have never used nvlink, but I am going to test it. I have mostly newer cards. But I recently got two 3090s and I am waiting for the nvlink.
Zyj@reddit
Back when i had nvlink connecting my RTX 3090s i didn't notice any significant performance improvements.
getfitdotus@reddit
if you are just loading a model, without vllm or sglang or tgi and using tensor parallel you are leaving big performance gains behind. The gpu to gpu speed requirement goes up significantly with this turned on, if you watch nvtop during inference all of the GPUs will use 99% utilization. With it off the data transferred between is minimal.
FullstackSensei@reddit
Zooming at your motherboard, looks to a mining board and it looks like you're running x1 links to the GPUs. In such a scenario, your GPUs are bandwidth starved when running vllm, so of course you see a big improvement with nvlink.
A much better way to spend your money would be to upgrade to an Epyc motherboard with x16 riser cables. Even at PCIe 3.0 speed, you'll get much better bang for your money, and probably even more performance vs nvlink
hp1337@reddit (OP)
My specs are as follows:
I was as surprised as you are at the performance benefit. I ran this experiment after finding a reddit post that arrived at a similar result of around 40% improvement in prompt processing with nvlink.
joelypolly@reddit
Can you confirm that the GPUs are actually running at because just because connections says x8@4.0 doesn't mean thats what is actually being used.
hp1337@reddit (OP)
AD7GD@reddit
pcie.link.gen.current==1 !!
randomanoni@reddit
Not so fast, do these drop to x1 when not in use like they do for nvtop?
hp1337@reddit (OP)
Precisely. The max gen is what is important. When not in use the PCIe speed is negotiated down.
ortegaalfredo@reddit
I have two servers, one is a mining board with PCIE 3.0 1X links, the other is PCIE 3.0 4X links.
Run VLLM in both, Qwen2.5-32B-Coder in both. The 4X board get \~40 t/s, the 1X board get \~30 t/s, there is a difference, but not as much as you think.
hp1337@reddit (OP)
I wouldn't call that "much of a difference". That's a 33% improvement!
DinoAmino@reddit
My understanding is that NVLINK does not improve multi-turn chat sessions when you're sending one prompt at time. It does wonders for training... about 4x speedup. I wonder if the observations here about the NVLINK performance are due to batch inferencing with 200 prompts?
hp1337@reddit (OP)
Maybe that's what the confusion is. I think maybe people are comparing performance with 1 user. I am benchmarking my setup to serve many users. I guess I should have qualified my post by saying this is improvement with batch inferencing. I will try also with single batch inference and update.
a_beautiful_rhind@reddit
There's still usually some improvement if the backends use peer access.
I mean look at what p2p does with latency.
P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 1.80 16.90 20.54 13.48 1 16.69 1.83 15.65 16.59 2 16.53 13.02 1.87 13.74 3 17.04 16.21 11.56 1.76
Conscious_Cut_6144@reddit
Try it with the nvlink unplugged instead of the NCCL_P2P_DISABLE.
NCCL_P2P_DISABLE Does more than just disable the nvlink bridge.
It disables all p2p gpu connections including both nvlink bridge and pci.
a_beautiful_rhind@reddit
3090 Doesn't have P2P unless you use the hacked driver.
Conscious_Cut_6144@reddit
Considering disabling p2p is lowering performance...
FullOf_Bad_Ideas@reddit
Is nvlink bridge for 3090s even obtainable anymore? I can't find it under $300.
Second, I would like to see more models tested, on the non-reasoning normal context - normal Qwen 2.5 7b, 14b, 32b, at various quantizations. QwQ and 1m context 7B aren't your typical llm's
a_beautiful_rhind@reddit
You can "nvlink" for free with the https://github.com/tinygrad/open-gpu-kernel-modules
Caveat being that you need decent PCIE links and probably should patch the driver that comes with cuda toolkit instead of the ones provided since the versions don't match.
Pedalnomica@reddit
That just enables P2P (which is cool!). last I checked, it hadn't been updated since there was some major security flaw with the Nvidia drivers, and it didn't work with NVlinks.
The speed of an actual NVLink is ~75% faster than even pcie4.0 x16, the Max you could possibly see on a 3090. I think it is also independent of the pcie interconnect. So, if you've got 4X 3090s that need to talk to each other. Each only has to run the communication for two out of the three others through pcie if you've got two NVLinks.
a_beautiful_rhind@reddit
Yea, it disabled the nvlink. But you can't nvlink 4 cards together. The patch is simple and can be done to most drivers, main issue is that released source by nvidia doesn't match the ones included with cuda repo.
Better than nothing when you can't find nvlinks, right? It won't save you from pcie 1x but much better than paying $300 or having no p2p.
David_Delaune@reddit
I bought a few nvlink bridges last year, they were a little over $100 each used on ebay. I had no idea that they had tripled in price.
-oshino_shinobu-@reddit
I have 2 3090 on a single motherboard. I only do inference and mostly use LM studio. Will an NVlink help inference speed in this case? Or will this only work with your setup?
Emergency_Pack8248@reddit
Interesting discovery, the article I read before said that the increased transmission performance of NVLink did not significantly improve inference speed, but your conclusion here contradicts the information I had seen before.
notlongnot@reddit
Nice write up! Convincing me on 220w.