Need advice upgrading an old gaming desktop with a 5090 for AI

Posted by dtdisapointingresult@reddit | LocalLLaMA | View on Reddit | 9 comments

A relative is giving me a XPS 8950 desktop: Intel i7-12700K, 64GB DDR5, Nvidia 3070 8GB.

I would like to use it for an image generation long-term project (a few hours a day), + using/testing various AI tools for fun since I love this stuff. It turns out I have an opportunity to get a 5090 (or any brand new card) at a significant discount.

The problem is that the XPS 8950 is unsuitable:

The case is too small to fit a 5090
Not enough PCIe lanes or PSU juice to keep both GPUs

I have to choose between:

Get a new large case for the XPS to put the 5090 in. Get rid of the 3070 to free up the PSU for the 5090, potentially underclock the 5090 if PSU is still not good enough. Might have to buy a new PSU anyway, in which case I keep both cards.
Buy an external enclosure for the 5090 (eGPU), get to keep both GPUs. Although my experience with external SSD enclosures has been negative, I'm excited at the idea of having a portable 32GB AI lab. From reading this sub I know that if you're using a single GPU, and the entire model fits in VRAM, the slow bus speed has no effect on inference speed, only on initial model loading. The 5090 can't be combined with the 3070 for any tasks without nuking the speed (more on that in a follow-up question below, please confirm), so the 3070 can act as a secondary bank of VRAM for isolated small models you want to run fast.

I'm leaning for the eGPU because it seems like such a neat solution to the problem, but would appreciate some feedback here. It's too good to be true, right?

I have more questions but I'm gonna add them as comments below. I really want to understand this stuff, and I don't want scare off people with a wall of text, lol.

THANK YOU FOR YOUR ATTENTION TO THIS MATTER!

[-]

tmvr@reddit

That machine came with an RTX 3090 or an RX 6900XT as well, those are not small cards, are you sure none of the 5090 cards on the market would fit?

[-]

dtdisapointingresult@reddit (OP)

Question: the reality of combining multiple GPUs

Let's say I have multiple GPUs on one system. For example 3x 5060 16GB's on the same mobo, so 48GB total VRAM. How does that work in practice? Can you actually load a single high-quality 48GB model?

I did some research and found this info, but I would like conformation:

"Yes, BUT". You effectively have the combined VRAM of all attached GPUs, however you are bottlenecked by the slowest link in the chain of traffic between all GPUs. In the example of 3x 5060s on PCIe, it's PCIe 4, which results in slower memory bandwidth than even DDR5. So in practice, you're better off DDR5maxxing for a large model, and only buying multiple GPUs if you want to use different small models loaded simultaneously.
There are datacenter-grade mobos with a feature called Nvidia NVLink, which gives a bus speed between GPUs of 600GB/s (slightly above the 5060's bandwidth, 1/3rd of 5090's bandwidth), but consumer cards like 5060/5090 are not supported.

Is the above accurate?

[-]

see_spot_ruminate@reddit

Okay...

So I have a similar setup. 3x 5060ti, 2 are are internal and one is on an ag01 egpu and connected via oculink to nvme (4x pcie lanes).

So it is still faster than ddr5. The computations happen in the cards and some data is shared between cards, but the amount of data is probably low (not a scientific term) and so you still are fast.

As to setup, works well and easy. I run them on ubuntu and all cards were found without any additional setup.

Reasons to do this:

you want to min-max for vram.
you want to be cheap, this is a good cheap and good combination.
you don't mind about missing out on some speed losses. This is going to happen anytime you bring in multiple cards. A single card with 48gb will work better than 3x cards that add up to 48gb.
This setup works well for llms that can split layers across cards or cpu. If you are going to be gaming or doing image gen then it would probably work better with alternatives.

[-]

dtdisapointingresult@reddit (OP)

Thank you, that's an interesting post with a hybrid setup. So you gave a crucial piece of info, that it's faster than DDR5. That means the understanding I had was wrong..

Can I ask how many tokens/sec you get on the largest model you run? Mention the model name and quant. I'll look up benchmarks for that model and be able to position your setup compared to standalone large cards and DDR5.

[-]

see_spot_ruminate@reddit

Check my post history.

[-]

see_spot_ruminate@reddit

Oh and don't get rid of the 3070 just yet.

it is nice to have a backup gpu sometimes
find out what you cant do with it first

[-]

aimark42@reddit

Multi gpu is a huge pain on consumer CPU's. It's mostly is not worth it, but you will see some gains in MoE models where that link speed isn't as important. I'd personally sell the 3070 and just stick with a single 5090. It will be far simpler implementation and 5090 is a beast, 3070 would likely slow it down in a lot of applications.

[-]

Long_comment_san@reddit

Yeah, get a new case, sell 3070 and decrease power to 5090. I don't know by how much but if I had 5090, I would slash it by 25% power, losing maybe 10% performance and gaining 400% more comfortable sleep that it won't fck me up with that power connector.

[-]

dtdisapointingresult@reddit (OP)

Using WSL

Ideally the PC should stay booted into Windows, but I prefer to do tech work in Linux. Dual-booting sucks. Can WSL get easy pass-through to the GPU(s)? Can you configure the Windows desktop not to use the Nvidia GPU at all and stick to the Intel iGPU?

Consequences of offloading model layers to a 5090 eGPU

Above I noted "if you're using a single GPU with a model loaded entirely in VRAM, bus speed doesn't affect inference speed, only initial model loading". What about if I'm using a larger GGUF model that doesn't fit in the 5090's VRAM? I know about about the GPU offloading feature, though I never used it. What I don't know is what sort of traffic happens between the CPU and GPU during inference, and how impacted they are by the slow bus speed. Would I be wasting the 5090 by dragging the model down to USB3.2 speed?