Fastest local inference options for 2 x 3090 with NVLink

[-]

Far_Park_1943@reddit

What are you guys talking about? This topic is meaningless. Personally, if you were to train large models yourself, it would be completely pointless. The difference between your hardware and the computing nodes in server clusters is vast, you know, thousands of times different. The only thing you can do is use the open-source models trained by these big companies to generate content. That's the only thing you can do.The problem with the 70B model is that many of them are actually quite poor. A 70B model isn't even as good as a 35B hybrid model, so it has no real advantages. The awkward situation you're in is that you'd be better off running a 35B model, which at least ensures both of your GPUs can be fully loaded with the model, making it more efficient. That's all that matters. All the other discussions are pointless, like the 10% efficiency improvements, which are completely meaningless.

[-]

smflx@reddit

If you're using llama.cpp & splitted layers to two GPUs. 50% utilization is normal. It's like 1 GPU of computation with 2x VRAM. Adding, multiple GPU doesn't lower computation time of GPU, but by reducing CPU computation time it will be faster.

vLLM tensor parallel is different kind. It will be almost 2x, especially when you use 2 GPUs & all the weight are loaded into GPUs. Communication speed between 2 GPUs are specially fast in vLLM. So, you might not need nvlink.

Test vLLM tensor parallel first. If it's near 100%, no need nvlink. If usage is low, nvlink will give a boost.

BTW, nvlink is only for 2 GPUs.

[-]

IngeniousIdiocy@reddit (OP)

Thanks. I’ll give it a shot.

[-]

knownProgress1@reddit

did it help?

[-]

smflx@reddit

Oh, you already have nvlink. I just noticed in the picture. Why I didn't noticed...

Just use tensor parallel in vLLM, it will be 2x performance

[-]

Conscious_Cut_6144@reddit

Use vllm with an awq model. Probably close to 2x faster.

Nvlink doesn’t matter for inference.

[-]

silenceimpaired@reddit

Does it help with model loading?

[-]

Conscious_Cut_6144@reddit

No loading would depend on storage/pcie speeds

[-]

silenceimpaired@reddit

So it only helps with training. Tragic

[-]

silenceimpaired@reddit

Have you looked into draft models and exllama?

[-]

fizzy1242@reddit

Are you using a 12vhpwr cable or 12 pin pcie cable? Looking to replace the adapter.

[-]

IngeniousIdiocy@reddit (OP)

Using the new style 12 pin connector. My power supply came with two ports and two cables.

[-]

FrederikSchack@reddit

Maybe you can help me by running a small test on Ollama? :)

I´m collecting data to try to see if there are tendencies in regards to other hardware selection that just GPU's:
https://www.reddit.com/r/LocalLLaMA/comments/1ip7zaz/lets_do_a_structured_comparison_of_hardware_ts/

[-]

coffee-on-thursday@reddit

As a note, I found Ollama was really bad at taking advantage of the 2s3090+link setup.

[-]

FrederikSchack@reddit

Yes, I´m aware that Ollama doesn´t support NVLink and is not splitting the individual layers across GPUs, but I´m not so much testing the GPU´s in this test, I´m more curious about CPU and Motherboard influence.

[-]

coffee-on-thursday@reddit

I have almost the exact same setup (2x3090 with nvlink), I usually run vllm on Linux as it's by far the best at taking advantage of the link between the cards and all the available bandwidth, I looked into the other frameworks and found vLLM to be the best by a big margin with the right settings. I found tweaking the settings is pretty important, and using an AWQ quant gives me much better performance. If you run large batches, you can get really high throughput and it fully uses the cards to their limit. I've had power brownouts even with 1200W PSU and power caps on the cards, the 3090s just spike really hard on power if you fully use them. I had to upgrade to a 1500W PSU for the best results.

[-]

kmouratidis@reddit

How did you get vLLM to run a 4-bit quant on 2x3090 and not OOM? What command are you running?

[-]

13henday@reddit

Not op but I’ve run awq 72b models on vllm with tensor parallel. 72b awq works just fine as long as you limit context to 8192 at 8bit.

[-]

kmouratidis@reddit

And you leave all other parameters to their defaults? And awq or awq_marlin?

[-]

13henday@reddit

Marlin

[-]

fairydreaming@reddit

But do you use --tensor-parallel-size 2?

[-]

Tall_Instance9797@reddit

As you have the same setup and found the best way to take advantage of the link can you share how much of a difference does nvlink between two 3090s really make when it comes to training and inference? And as the extra bandwidth seems important... do you know how much of a difference it makes for inference and training running the 3090s in 8x, or even 4x, slots? Thanks.

[-]

coffee-on-thursday@reddit

Inference is not twice as fast with nvlink, but it is 15-20% faster depending on what you're doing. The link is used sparingly when your model fits within 24gig of vram, when it's a bigger model it makes good use of the link. My understanding is the link is very helpful for training if you make the most of both cards, don't have hard numbers for you unfortunately. I originally had a motherboard where one of the lanes was at 4x, it does make a significant difference in time to first token.

[-]

Tall_Instance9797@reddit

Ok, thank you for sharing! Appreciate the info.

[-]

a_beautiful_rhind@reddit

turn off turbo and your power woes will go away.

[-]

13henday@reddit

Lmdeploy, it’s finicky to get started but once you do the token rate is downright silly. Just remember to set your max_cache below 0.4.

[-]

a_beautiful_rhind@reddit

I saw up to 18.9t/s in llama.cpp with nvlink enabled in the compile parameters. Not sure how a current version does since they changed some of the split by row/layer stuff.

Moved to exllama and get 19-21t/s but that doesn't utilize nvlink.

vllm I don't use often because it's difficult to fit the context into memory.

In transformers, nvlink added 2.5t/s. Because I have 3 and maybe even soon 4 cards, I wonder if it would be better to use the patched peer enabled driver so all cards can communicate directly, despite it being slower.

[-]

kryptkpr@reddit

I also see ~18 Tok/sec with two nvlinked 3090 and llama.cpp

No amount of compile time tweaking those P2P options make it any faster for me, only slower.

vLLM gets me ~23 Tok/sec single stream. Nvlink only helps when batching and only then around 8%, it's just a latency thing not bandwidth when doing inference.

[-]

a_beautiful_rhind@reddit

I tried hard to find a non 250v server supply over 1200w and they don't seem to exist. So how are these consumer PSU doing it :D

I put my p2p up to 4096, 128 means it basically never got used. Didn't think there was any other tweaking to be done.

I will have to try that tinybox driver at some point tho. 3 or 4x 25G link is likely better than only 2 cards with a 100G link.

[-]

No-Statement-0001@reddit

The nvlink won’t help much with inference speeds. I tried out a bunch of things on my dual 3090 and here’s the gist of it:

exllama (tabbyapi) with llama3.3 70B exl2 4.25, you get more tokens/second but you don’t benefit from the gguf ecosystem. The exl2, awq, etc quants harder to find, or you have to make them yourself.
vllm and exllama2 are better for overall tokens/second with parallel requests than llama.cpp. I’m the only user on my box so this doesn’t benefit me much
use speculative decoding. Pick a draft model in the same family (qwen, llama3,etc). This gives the greatest overall gain in single tok/sec per request.

[-]

Tall_Instance9797@reddit

I came here hoping for someone to share this... how much of a difference does nvlink really make? Thanks for sharing. Do you know if nvlink makes much of a difference when it comes to training? Also do you know how much of a difference it makes for inference and or training running the 3090s in 8x, or even 4x, slots? Thanks.

[-]

a_beautiful_rhind@reddit

Not a lot of inference stuff supports it. llama.cpp you have to explicitly increase the peer access limit to get a benefit. Default is like 128, anything bigger than that doesn't use it.

has to be called on and off explicitly in nvidia code and if you enable it twice you segfault.

torch is set to use nvlink or any peer access by default, not sure on vllm. not a lot of people run models purely in transformers to notice a few more t/s. Most used quantized models that don't exchange as much data during inference.

Plus you can only link 2 cards as a cherry on top.

[-]

smflx@reddit

Yeah, vLLM with tensor parallel is certainly faster. Communication speed between GPUs is important, but it will be ok when we use quantization (low traffic) & only 2 GPUs (fast PCIe communication speed).

[-]

FullOf_Bad_Ideas@reddit

Have you tried tensor parallel in exllamav2? I think it's using CPU RAM and not nvlink. Does it speed things up compared to gpu split? Does it speed things up for smaller 20-32B 4bpw models too?

[-]

nite2k@reddit

id be interested

[-]

dazzou5ouh@reddit

If you are not explicitly setting stuff somewhere to use NVLink, you are probably not using NVLink. But from what have read what matters much more is to make sure your GPUs are running at 16x and 8x. A lot of motherboards can't do that.

[-]

lyfisshort@reddit

Can you share the specs of the motherboard and power supply?

[-]

IngeniousIdiocy@reddit (OP)

sure, its a MSI z490 meg that i bought new in the box off of ebay a few months ago. running 8x by 8x gen 4 pcie with a 1200 watt asus tuf power supply.

[-]

smflx@reddit

Hmm, it's PCIe x8. With x16, I'm sure no nvlink needed. Test tensor parallel first. If GPU utilization low even with tensor parallel, nvlink will be well worth.

[-]

TyraVex@reddit

Llama 3.3 70B 4.5bpw - No TP - No spec decoding:

Prompt ingestion: 1045.8 T/s

Generation: 18.14 T/s

10 * Generation: 63.39 T/s

Llama 3.3 70B 4.5bpw - TP - No spec decoding:

Prompt ingestion: 622.57 T/s

Generation: 22.93 T/s

10 * Generation: 87.57 T/s

Llama 3.3 70B 4.5bpw - No TP - Spec decoding:

Prompt ingestion: 1010.34 T/s

Generation: 34.44 T/s

10 * Generation: 75.48 T/s

Llama 3.3 70B 4.5bpw - TP - Spec decoding:

Prompt ingestion: 618.54 T/s

Generation: 44.5 T/s (that's very cool)

10 * Generation: 100.72 T/s

Notes:

Speculative decoding is Llama 3.2 1B 8.0bpw

Context length tested is 16k

Context cache is Q8 (8.0bpw)

Context batch size is 2048

Both 3090s are uncapped at 350w - PCIE 4.0 @ x8

[-]

Wrong-Historian@reddit

Try mlc-llm. With tensor-parallel it was the fastest solution I've tested. Nearly 2x as fast as llama-cpp with 2 GPU's.

[-]

getmevodka@reddit

sooo.... i run deepseek 70b on my nvlinkbridged dual 3090 setup too, but... my cards have 70-90% usage showing when doing inference. did you go into your geforce system settings and activate SLI is what im asking myself here ? because i get more like 18-22 t/s depending on the questions and complexity. good luck !

[-]

getmevodka@reddit

oh but i use ollama. i reread your post. sorry i think im not helpful for you here then.

[-]

ArsNeph@reddit

For models that size, people have had reasonable speedups using ExllamaV2, though it gets less significant the smaller the model. You can use it through TabbyAPI or Oobabooga WebUI. If you enable tensor parallelism, there is even more of a speedup. Depending on the model, speculative decoding can also help boost throughput.

[-]

tengo_harambe@reddit

If you have any excess VRAM, use speculative decoding to get a 20-50% boost in inference speed

[-]

FrederikSchack@reddit

That is some decent tps for a 70b q4 model I think.

What I understand is that most inference engines defaults to putting entire layers on each GPU, instead of splitting them across GPU´s. This means that only one GPU is really working at a time, because only the layers the token passes is being processed. There is some experimental functionality in Exllama, that should increase the use of both GPU's.

[-]

IngeniousIdiocy@reddit (OP)

Both cards run at 50% utilization during inference and the developer of llama.cpp speaks to this on github as basically the cards working in sync on a single stream but you could get higher utilization on multi-stream. this is just for me locally, so i'm looking to optimize single stream. my googling seems to indicate exllama wins the single stream 4bit inference game with its custom native 4 bit matrix multiplication implementation, but tensorrt is nvidia's best effort and always worth a look.... i was just hoping to find someone who has already gone done this rabbit hole and save some time!

[-]

FrederikSchack@reddit

I´m basically a newbie reading and reading, to try to get a grasp of it :) But try to see this:
https://www.reddit.com/r/LocalLLaMA/comments/1f3htpl/exllamav2_now_with_tensor_parallelism/