Is 2x 3090 with NVLink faster than 2x 4090 for large 70b models?
Posted by thefreemanever@reddit | LocalLLaMA | View on Reddit | 19 comments
I am wondering if a 2x 3090s with NVLink setup is faster than a 2x 4090s? For both inference and training/fine tuning tasks?
I have read in several other posts people talking about getting 10t/s on 2x 3090s and 17t/s on 2x 3090 + NVLink. It means 70% faster with NVLink.
As long as 40 series doesn't support NVLink, is a 2x 4090s setup still faster than a 2x 3090s + NVLink?
If it's not, do you think there is any other reason to choose a 2x 4090s over 2x 3090s + NVLink for a home AI machine?
PS:The problem revolves around the slower communication speed via PCIe compared to NVLink. When a model exceeds 24GB, it should be split among the cards, and 3090s + NVLink exhibit a speed advantage in this aspect. However, I am not sure if this advantage extends to cover the speed advantage of the 4090s.
ZET_unown_@reddit
PhD student specializing on Computer Vision here. In general, 2x 4090 will be faster than 2x 3090 NVlinked by around 40%.
See link below and scroll down to the image right above Conclusions, where they specifically benchmark 2x 4090 against 3090: https://lambdalabs.com/blog/nvidia-rtx-4090-vs-rtx-3090-deep-learning-benchmark
Whether it's worth the price difference is up to you and whats available to you.
I don't know your specific use case, whether its only finetuning and inference or if you also want to research and building new models, but as a general rule, you should always go for the largest VRAM on single card, because thats more often the limiting factor. With slower speed, you just need to wait a few days more, but with too little VRAM, you can't even train the model. VRAM pooling is a lot of headache and how well it works highly depends on the model itself.
I would recommend the RTX 6000 Ada 48GB or the older RTX A6000 48GB. Out of the dual 4090s and dual 3090s, I would go with the dual 4090s.
GetOutOfMyFeedNow@reddit
4090 cannot be NVlinked
ZET_unown_@reddit
You don’t need to NVLink. While it’s one of the factors affecting speed, but just because a setup doesn’t have it doesn’t mean it will be slower. I have access to 3090s and 4090s at the university research lab. And the a set of 4090s without NVLink is faster than 3090s nvlinked for neural network training and inference.
GetOutOfMyFeedNow@reddit
Two 3090s act as one if they are NVlinked, and that matter when running bigger models.
ZET_unown_@reddit
In all my dual GPU trainings and inference (using around 46gb VRAM), the dual 4090s has been faster than NVlinked 3090s.
Dusty_da_Cat@reddit
I have tried to run 2 x 3090s nvlinked.
Didn't do squat with Inference, and can't say I can confirm with training/fine-tuning, since I haven't tried that.
It might be a windows thing, but I haven't found any benefit in terms of added t/s nvlinked or not. Might be different on linux.
absolutxtr@reddit
Well it won't necessarily be faster unless you find something that supports some sort of parallelism. But it can certainly fit bigger models! It's just creating a vram pool, by way of insanely fast comm speed between both cards and their vram.
Pedalnomica@reddit
Most reports I've seen say NVLink doesn't help much at all with inference (unless the PCIE connection between your cards is slow as hell. Maybe like <= PCIE 3.0 x4 ?) and helps at most 20% with training/fine tuning on 3090s.
4090 Pros: They will be faster (again, assuming reasonable PCIE speeds), but probably not by a ton. They are also more modern so there are things they can do in CUDA that the 3090 can't.
Depends on what you're doing, but for most around here buying more 3090's is the better buy.
Dusty_da_Cat@reddit
It didn't have an effect either going from PCIE 5.0 8x/8x to a PCIE 3.0 4x/4x in terms of t/s.
I ended up getting a 3090 Ti and having it go on the PCIE 5.0 to go 16x and the 2 x 3090 going on PCIE 3.0 4x/4x. Even when I configure a GPU-split to just use the 3090s it didn't change the inference t/s speed, which averages 13t/s on 70b 5bpw and 11t/s on 120 3bpw.
However, I do acknowledge that there is a potential gain going llama.cpp for a potential nvlink speedup, but not sure the effort is worth the trouble going to a slower llama.cpp to nvlink for a 'potential' speed boost with additional tinkering needed or going exllama2 for already fast inference speeds with minimal effort.
ApprehensiveView2003@reddit
Can we get a year later update?
Dusty_da_Cat@reddit
What kind of update are you looking for?
Guilty-History-9249@reddit
update
nero10578@reddit
NVLink basically doesn’t do shit for inference and only helps training depending if you use inter gpu communications a lot or not.
ApprehensiveView2003@reddit
You can load larger models for one...
rdkilla@reddit
the extra 12000 cuda cores?
deleted_by_reddit@reddit
The problem revolves around the slower communication speed via PCIe compared to NVLink. When a model exceeds 24GB, it should be split among the cards, and 3090s + NVLink exhibit a speed advantage in this aspect. However, I am not sure if this advantage extends to cover the speed advantage of the 4090s.
deleted_by_reddit@reddit
There is not much communication during inference. NVlink is mostly useful in training if you would need to move all the gradients of the model between all the cards. Just think abould your usecase and what data gets moved and how much.
alb5357@reddit
So the GPUs talking isn't a bottleneck? Then shouldn't 2 3090s be twice as fast as 1 3090 on inference?
prudant@reddit
I think I will speed up your inference speed if you go for a Tensor paralellism mode with vllm by example. Pipeline paralellism almost not afected by nvlink.