criticize my local ai build, what would you change?
Posted by nas2k21@reddit | LocalLLaMA | View on Reddit | 20 comments
Posted by nas2k21@reddit | LocalLLaMA | View on Reddit | 20 comments
Glittering_Mouse_883@reddit
I know op says they don't need bigger models, but I think in the future they might regret this. Why not drop down to an am4 motherboard that has more pcie slots, like the ASRock x570s. Or similar, all those am4 boards have like 6 slots in case you decide to Frankenstein something bigger later on. And in the same vein get a 1600w psu while you're at it. Plus ddr4 memory is cheaper, am4 cpu is cheaper, so overall the increased cost of the PSU will balance out.
nas2k21@reddit (OP)
If you guys are so sure a 3090 won't mind bifurcation, why fear 3.0.1? Because you know a 3090 wants 4.0x16 and even 3.0x4 is a joke to it?
Glittering_Mouse_883@reddit
Haha, yeah maybe. I won't argue with you just giving my two cents. The smaller models are just dumber IMHO, so slightly slower loading times are worth it to run something bigger. Inference won't really take a hit even at pcie 3.0 1x (which I am not suggesting for your 3090)
nas2k21@reddit (OP)
Exactly the reddit is projecting what THEY want, not what I need, I've said already response time is #1, it don't need to always give perfect answers, but it does need to respond ASAP over and over again, I don't want a " dumb" model, but i cant sacrifice speed for anything and it's not solely an llm task, there are other neural networks that will need to feed inputs to the 3090, I can make CPU or a 2nd GPU do that, but then pcie bandwidth is needed to send that info to the 3090
xflareon@reddit
I agree with the other comment, you'll probably regret not having more storage
Similarly, your power supply doesn't have a ton of headroom to add more GPUs later, and same for your motherboard. According to the manual it has one PCIe x16 slot running at x16 speeds, doesn't mention bifurcation and the other two PCIe x16 slots only support PCIe x16 3.0 x1 speeds.
PCIe bandwidth doesn't really matter for inference except with tensor parallel, but there should still be better options, for example bifurcation to x8/x8 or x4/x4/x4/x4, or even two x16 slots at x8 speeds, with another through the chipset.
nas2k21@reddit (OP)
If I bifurcate a 3090 or something similar I cripple the card, my take is unless you can afford the latency penalty the best bet is to stick to things that can run on a single card, my use case needs to respond in real time, not be a top quality model
kryptkpr@reddit
You do not 'cripple the card' with x8, the only difference in single GPU usecases is initial model loading speed. It's pretty safe to assume you will want dual GPUs sooner or later, 24GB isn't enough for 70B.
nas2k21@reddit (OP)
I just said I'm cool with no 70b?
kryptkpr@reddit
Not now, not ever?
All it takes is a better choice of motherboard that can run either x16/x0 or x8/x8
nas2k21@reddit (OP)
This board can run x16/x0 just fine... Lmao, 4.0x8/4.0x8 would reduce the bandwidth I'm spending 1500 to get, yes I'd rather have real time responses, than have to wait for a more quality answer, because my application needs to respond in real time above all, using 2 gpus means extra pcie latency, taking away from the exact goal of building this pc
kryptkpr@reddit
That isn't how any of this works.. have a great day.
nas2k21@reddit (OP)
... That's exactly how it works, even a 12gb 3060 knows the difference between 4.0x8 and 4.0x16, you think a 3090ti or 5080 won't?
kryptkpr@reddit
I measured this and posted about it 8 months ago - there was no meanindul difference for single GPU even when dropping all the way to USB X1 garbage.
nas2k21@reddit (OP)
"for single gpu" the only reason id need a 2nd gpu is for models i cant run in single gpu tho, then pcie interfacing becomes thebottle neck, you're claiming \~ 60 tk/s in single, run that same model split on both cards across that x1 and see if you feel the same
xflareon@reddit
I can't speak to bifurcation adapters because I haven't used them myself but there should be boards that have two x16 slots at Gen 5 x8 speeds.
A 3090 is not bottlenecked in a meaningful way even by PCIE Gen 3 X8, so Gen 5 x8 certainly won't be a problem.
In the first place inference speed is not affected by PCIe bandwidth unless you're using Tensor parallel. Neither prompt processing nor tokens per second are affected by the total amount of PCIe bandwidth, but I'm assuming in this case that you're referring to a different form of latency as a result of the bifurcation that I'm not aware of.
Unless you mean that there's a latency increase by way of an increase in active parameters in larger models, in which case that's true, but also not a good reason to willingly prevent yourself from using the rig for other use cases as well.
Zyj@reddit
Mainboard has only one slot suitable for GPU (2nd slot is x1). Get a board where you can use 2 cards at x8 each.
nas2k21@reddit (OP)
I do not need better larger models, the board is cheaper to me than any other am5 board except a worse giga gaming x , im not building for upgradability, if I ever buy another card it will be 5000 series or better, means id need pcie gen 5 and something like giga's ai top to have enough lanes, that ai top board will cost me an extra 450$ before I spend extra on a 5080 with only 16gb, so I'm pretty sure about settling at 1 3090ti for now
DeltaSqueezer@reddit
Need more storage
Zyj@reddit
You can add a second M.2 later on
segmond@reddit
It's fine provided you know you are going to be running smaller models. Less than 32B.