NVLINK improves dual RTX 3090 inference performance by nearly 50%

[-]

JockY@reddit

A couple of things. First is that you don't give your system spec, which is unfortunate because a lot of the time we find the devil is in the details.

Second: the 3090 doesn't support FP8, so the quantization you're trying to use will be broken. I'm not sure what will happen because I'm not familiar with vLLM, but it can't work in your setup because it's unsupported on Ampere.

Third is that I can't reproduce your findings. I have an RTX 5000 32GB ADA and a pair of RTX A6000 48GB Ampere, and I used the two A6000s to follow your methodology with a PNY NVLink. The model was Qwen/QwQ-32B and I copied your exact command line (yes, even the non-functional quantization args) and the results were as I expected: inference runs approximately 1 token/sec slower WITH the NVLink.

I then disabled quantization (omitted the --quantization flag) and results were the same: slower inference when using NVLink.

My system is Ryzen Threadripper Pro 5995WX on a Supermicro M12SWA-TF motherboard with 128GB of DDR4 3200. All the GPUs are plugged into PCIe 4.0 x16 slots that are verified to be working at x16.

Can you show us any data that demonstrates the speedup you saw from your NVLink setup? I'm curious if our discrepancy is a measurement error on your end or perhaps an implementation error on my end, or even some difference in our rigs that causes NVLink to become a factor (like if you're using x1 risers or something).

[-]

hp1337@reddit (OP)

I re-ran the experiment with physically removing the NVLINK. Same result:

CUDA_VISIBLE_DEVICES=1,2 vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel 2 --gpu-memory-utilization 0.9 --max-model-len 32768

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200

============ Serving Benchmark Result ============

Successful requests: 200

Benchmark duration (s): 51.11

Total input tokens: 204800

Total generated tokens: 24100

Request throughput (req/s): 3.91

Output token throughput (tok/s): 471.56

Total Token throughput (tok/s): 4478.82

---------------Time to First Token----------------

Mean TTFT (ms): 24523.48

Median TTFT (ms): 27132.67

P99 TTFT (ms): 41782.79

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms): 407.41

Median TPOT (ms): 236.08

P99 TPOT (ms): 1419.19

---------------Inter-token Latency----------------

Mean ITL (ms): 222.43

Median ITL (ms): 71.84

P99 ITL (ms): 242.01

==================================================

CUDA_VISIBLE_DEVICES=3,5 vllm serve Qwen/Qwen2.5-7B-Instruct-1M --tensor-parallel 2 --gpu-memory-utilization 0.9 --max-model-len 32768

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model Qwen/Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200

============ Serving Benchmark Result ============

Successful requests: 200

Benchmark duration (s): 33.92

Total input tokens: 204800

Total generated tokens: 24100

Request throughput (req/s): 5.90

Output token throughput (tok/s): 710.39

Total Token throughput (tok/s): 6747.24

---------------Time to First Token----------------

Mean TTFT (ms): 15879.45

Median TTFT (ms): 15428.11

P99 TTFT (ms): 26762.19

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms): 285.56

Median TPOT (ms): 145.32

P99 TPOT (ms): 778.63

---------------Inter-token Latency----------------

Mean ITL (ms): 150.52

Median ITL (ms): 55.36

P99 ITL (ms): 225.97

==================================================

[-]

rorowhat@reddit

where did you get this script from?

[-]

JockY@reddit

Ok, but what’s the rest of your system?

[-]

hp1337@reddit (OP)

See my other comments

[-]

JockY@reddit

Oh you're using x8 risers. Mystery solved, no wonder the NVLink improves things!

Why are you using x8 risers though? Your mother supports PCIe 4.0 x16, so you could just run riser cables and you'd get x16 to your GPUs. What am I missing?

[-]

hp1337@reddit (OP)

Its a complicated setup. The motherboard supports x16,x16,x8,x8, or x8,x8,x8,x8,x8,x8. I have 6 GPUs, so i choose the latter. The risers themselves support x16, but the motherboard has to divide the lanes based on what threadripper can support.

I guess I'll have to repeat the test with x16,x16 to see what happens.

[-]

GreedyAdeptness7133@reddit

How do you fit all these on one motherboard?

[-]

hp1337@reddit (OP)

Specs are as follows: Threadripper 3960x MSI TRX40 Pro 10g 128gb quad channel ddr5 3200 Ram Mixture of slim sas and direct PCIe gen 4 x8 risers

The data is in the blogpost.

I was surprised by the result as well. I appreciate the feedback and am confused why you weren't able to reproduce my findings.

I don't think the quantization should matter, but I'll try running addition tests with an unquantized model. I will also physically remove the NVLINK to test. The risers are definitely working at x8 PCIe gen 4. I have tested this.

[-]

unfortunate_jargon@reddit

hmm... 50% is pretty out of the norm. I suspect that there may be a hardware component to this that's being overlooked. Some things to consider, as the most likely explanation is that the PCIe link speed is being misreported.:

Questions:
- What sort of device are you using to bifurcate the x16 slots via SlimSAS? This is the main thing I suspect of causing some issues. If one of the cards in the NVLink has a different interconnect speed, it could cause weird things.
- What types of RTX 3090 are you running on each port, and how is each connected exactly?
- What do you get for output from the following? It should get you the information about the cards and their connections:
`sudo lspci -vvv | grep -i "VGA.*nvidia" -A 100 | grep -E "VGA|LnkCap|LnkCtl|LnkSta"`
- Also, what info is nvidia-smi reporting?
`sudo nvidia-smi -q | grep -A9 -i "Product Name\| Max Clocks\| Clocks\|GPU Link Info"`
- What's nvlink reporting?
`nvidia-smi nvlink --status`

And, lastly, what do you see for the benchmarked link speeds between cards?
`nvidia-smi p2p`

I suspect something interesting might turn up in that info. If not, I might need to figure out how to bridge my two remaining RTX 3090s, which are unfortunately no longer a matching pair.

It would be very useful to get exact information on this, as it would potentially change a lot of thinking. I was only aware of a 10% performance uplift from nvlinked inference.

As such, it would be very useful to plug two of the cards that can be nvlinked into the two x16 slots (and/or two x8 slots) on their own, and benchmark from there. You could very well be leaving performance on the table due to a faulty interconnect.

(This information might put down the mob that's gathered in this thread.)

[-]

getfitdotus@reddit

use set the TP to 2 ?

[-]

JockY@reddit

Yes

[-]

Competitive_Buy6402@reddit

I thought using an FP8 quant will work on 3090 but having no hardware FP8 support it would just run on FP16 silicon and just gaining no speed benefit.

Or maybe it is making the GPU technically run INT8 ?

[-]

psilent@reddit

Can confirm they work, but they just use less vram they aren’t faster

[-]

JockY@reddit

Ah ok, that makes sense. If that's the case then it sounds like it's not a factor here. Thanks!

[-]

leavezukoalone@reddit

Who tf downvotes someone for acknowledging they may have been wrong?

[-]

chromaaadon@reddit

3090 nvlink fanboys!

[-]

JockY@reddit

Is that a question for me? I'm confused, I don't know what you're talking about.

[-]

leavezukoalone@reddit

No, you just had a bunch of downvotes earlier

[-]

bihungba1101@reddit

Vllm use marlin kernal to do Fp8 calculations on unsupported hardware. This is auto applied when vllm detect the quantization of the model and the hardware available. It significantly boost performance with no quality degradation compared to native fp8 hardware

[-]

sgsdxzy@reddit

vllm uses fp8_marlin kernel on ampere.

[-]

bullerwins@reddit

correct: https://github.com/vllm-project/vllm/pull/5975

[-]

prompt_seeker@reddit

I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200

	output	throughput
2x3090 PCIe x8, 270W	590.22	5599.26
2x3090 PCIe x8, 220W	541.32	5135.34
2x3090 PCIe x4, 270W	448.43	4254.07
2x3090 PCIe x4, 220W	419.92	3983.61

My result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.

And here's same test but with --request-rate 1. (means 1 request at once)

	output	throughput
2x3090 PCIe x8, 270W	126.73	1206.71
2x3090 PCIe x8, 220W	126.71	1206.82
2x3090 PCIe x4, 270W	126.17	1206.19
2x3090 PCIe x4, 220W	125.91	1205.97

Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.

NVLink would be useful if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster.

Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4)

[-]

randomanoni@reddit

This is what I was looking for! I have a bridge sitting in a box because I'm happy with 20 t/s with 32B range models. So all I have to do is run this benchmark on my system and compare it with your 126; 1206 measurement. I'm on a 5600x, but that shouldn't bottleneck this test too much. Thanks!

[-]

SeymourBits@reddit

I have one dual 3090 FE setup with NVLink and I find that it generates only slightly slower than another nearly identical system with dual 4090 FEs. I was initially surprised as I expected much higher performance from the dual 4090s but have yet to figure out exactly why.

[-]

prompt_seeker@reddit

I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200



2x3090 PCIe x8, 270W
2x3090 PCIe x8, 220W
2x3090 PCIe x4, 270W
2x3090 PCIe x4, 220W

result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.

And here's same test but with --request-rate 1. (means 1 request at once)



2x3090 PCIe 4.0 x8, 270W
2x3090 PCIe 4.0 x8, 220W
2x3090 PCIe 4.0 x4, 270W
2x3090 PCIe 4.0 x4, 220W

Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.

It would be usefull if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster than 1vllm on 2GPUs.

Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4 with OCuLink)

[-]

prompt_seeker@reddit

I have tested on my 3090s the same and here's result.
(I don't know you how much power OP limits, so tested on both 270W and 220W.)

python benchmarks/benchmark_serving.py --backend openai --base-url http://localhost:8000 --model AI-45/Qwen_Qwen2.5-7B-Instruct-1M --seed 12345 --dataset-name=random --num-prompts=200



2x3090 PCIe 4.0 x8, 270W
2x3090 PCIe 4.0 x8, 220W
2x3090 PCIe 4.0 x4, 270W
2x3090 PCIe 4.0 x4, 220W

My result is slightly faster(\~15%), but could confirm bandwidth affect to throughput when you send 200 requests at once.
On PCIe 4.0 x16, the result would be much similar to NVLink.

And here's same test but with --request-rate 1. (means 1 request at once)



2x3090 PCIe 4.0 x8, 270W
2x3090 PCIe 4.0 x8, 220W
2x3090 PCIe 4.0 x4, 270W
2x3090 PCIe 4.0 x4, 220W

Benchmark duration was 189secs, so difference is less than 1sec.
So, I think I can't feel NVLink until I serve the model to others.

It would be usefull if you serve small model - fitting on 2x3090, but not too small - that fitting on 1x3090, because if it fits on 24GB, running 2 vllm and routing them would be faster than 1vllm on 2GPUs.

Test system: AMD 5700x, DDR4 128GB(dual channel), 4x3090(x8/x8/x4/x4 with OCuLink)

[-]

kryptkpr@reddit

I got a 2x3090 NVLink rig as well

On 7-8B, the lower latency really helps but don't get too excited, by 32B it's more like 8-10% and only at big batch.

[-]

getfitdotus@reddit

yes but what are u using for inference ? ollama ? ollama does mot support TP, multi GPU without using something like vllm or sglang is a waste

[-]

kryptkpr@reddit

vLLM.

[-]

getfitdotus@reddit

so you used the two cards with nvlink and set --tensor-parallel 2 ?

[-]

kryptkpr@reddit

Yep, compared 8b 32b and 72b awq. My 3090 are connected to host at x8 3.0. Nvlink helped a lot with small models at big batch, for bigger models or less streams (OP is running 200 here) the speed improvement was much less drastic. I'm still happy with the purchase.

[-]

getfitdotus@reddit

ok I have never used nvlink, but I am going to test it. I have mostly newer cards. But I recently got two 3090s and I am waiting for the nvlink.

[-]

Zyj@reddit

Back when i had nvlink connecting my RTX 3090s i didn't notice any significant performance improvements.

[-]

getfitdotus@reddit

if you are just loading a model, without vllm or sglang or tgi and using tensor parallel you are leaving big performance gains behind. The gpu to gpu speed requirement goes up significantly with this turned on, if you watch nvtop during inference all of the GPUs will use 99% utilization. With it off the data transferred between is minimal.

[-]

FullstackSensei@reddit

Zooming at your motherboard, looks to a mining board and it looks like you're running x1 links to the GPUs. In such a scenario, your GPUs are bandwidth starved when running vllm, so of course you see a big improvement with nvlink.

A much better way to spend your money would be to upgrade to an Epyc motherboard with x16 riser cables. Even at PCIe 3.0 speed, you'll get much better bang for your money, and probably even more performance vs nvlink

[-]

hp1337@reddit (OP)

My specs are as follows:

Threadripper 3960x
MSI TRX40 Pro 10g
128gb quad channel ddr4 3200 Ram
Mixture of slim sas and direct PCIe gen 4 x8 risers

I was as surprised as you are at the performance benefit. I ran this experiment after finding a reddit post that arrived at a similar result of around 40% improvement in prompt processing with nvlink.

[-]

joelypolly@reddit

Can you confirm that the GPUs are actually running at because just because connections says x8@4.0 doesn't mean thats what is actually being used.

[-]

hp1337@reddit (OP)

nvidia-smi --query-gpu=index,pcie.link.gen.current,pcie.link.gen.max,pcie.link.width.current --format=csv
index, pcie.link.gen.current, pcie.link.gen.max, pcie.link.width.current
0, 1, 4, 8
1, 1, 4, 8
2, 1, 4, 8
3, 1, 4, 8
4, 1, 4, 8
5, 1, 4, 8

[-]

AD7GD@reddit

pcie.link.gen.current==1 !!

[-]

randomanoni@reddit

Not so fast, do these drop to x1 when not in use like they do for nvtop?

[-]

hp1337@reddit (OP)

Precisely. The max gen is what is important. When not in use the PCIe speed is negotiated down.

[-]

ortegaalfredo@reddit

I have two servers, one is a mining board with PCIE 3.0 1X links, the other is PCIE 3.0 4X links.

Run VLLM in both, Qwen2.5-32B-Coder in both. The 4X board get \~40 t/s, the 1X board get \~30 t/s, there is a difference, but not as much as you think.

[-]

hp1337@reddit (OP)

I wouldn't call that "much of a difference". That's a 33% improvement!

[-]

DinoAmino@reddit

My understanding is that NVLINK does not improve multi-turn chat sessions when you're sending one prompt at time. It does wonders for training... about 4x speedup. I wonder if the observations here about the NVLINK performance are due to batch inferencing with 200 prompts?

[-]

hp1337@reddit (OP)

Maybe that's what the confusion is. I think maybe people are comparing performance with 1 user. I am benchmarking my setup to serve many users. I guess I should have qualified my post by saying this is improvement with batch inferencing. I will try also with single batch inference and update.

[-]

a_beautiful_rhind@reddit

There's still usually some improvement if the backends use peer access.

I mean look at what p2p does with latency.

P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU     0      1      2      3 
 0   1.80   1.39   1.41   1.38 
 1   1.56   1.82   1.41   1.47 
 2   1.40   1.40   1.87   1.36 
 3   1.41   1.37   1.42   1.76

P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 1.80 16.90 20.54 13.48 1 16.69 1.83 15.65 16.59 2 16.53 13.02 1.87 13.74 3 17.04 16.21 11.56 1.76

[-]

Conscious_Cut_6144@reddit

Try it with the nvlink unplugged instead of the NCCL_P2P_DISABLE.

NCCL_P2P_DISABLE Does more than just disable the nvlink bridge.

It disables all p2p gpu connections including both nvlink bridge and pci.

[-]

a_beautiful_rhind@reddit

3090 Doesn't have P2P unless you use the hacked driver.

[-]

Conscious_Cut_6144@reddit

Considering disabling p2p is lowering performance...

[-]

FullOf_Bad_Ideas@reddit

Is nvlink bridge for 3090s even obtainable anymore? I can't find it under $300.

Second, I would like to see more models tested, on the non-reasoning normal context - normal Qwen 2.5 7b, 14b, 32b, at various quantizations. QwQ and 1m context 7B aren't your typical llm's

[-]

a_beautiful_rhind@reddit

You can "nvlink" for free with the https://github.com/tinygrad/open-gpu-kernel-modules

Caveat being that you need decent PCIE links and probably should patch the driver that comes with cuda toolkit instead of the ones provided since the versions don't match.

[-]

Pedalnomica@reddit

That just enables P2P (which is cool!). last I checked, it hadn't been updated since there was some major security flaw with the Nvidia drivers, and it didn't work with NVlinks.

The speed of an actual NVLink is ~75% faster than even pcie4.0 x16, the Max you could possibly see on a 3090. I think it is also independent of the pcie interconnect. So, if you've got 4X 3090s that need to talk to each other. Each only has to run the communication for two out of the three others through pcie if you've got two NVLinks.

[-]

a_beautiful_rhind@reddit

Yea, it disabled the nvlink. But you can't nvlink 4 cards together. The patch is simple and can be done to most drivers, main issue is that released source by nvidia doesn't match the ones included with cuda repo.

Better than nothing when you can't find nvlinks, right? It won't save you from pcie 1x but much better than paying $300 or having no p2p.

[-]