Feedback | Local LLM Build 2x RTX Pro 4000

Posted by sebakirs@reddit | LocalLLaMA | View on Reddit | 24 comments

Dear Community,

i am following this community since weeks - appreciate it a lot! I made it happen to explore local LLM with a budget build around a 5060 TI 16 GB on Linux & llama.cpp - after succesfull prototyping, i would like to scale. I researched a lot in the community about ongoing discussions and topics, so i came up with following gos and nos:

Gos:
- linux based - wake on LAN KI workstation (i already have a proxmox 24/7 main node)
- future proof AI platform to upgrade / exchange components based on trends
- 1 or 2 GPUs with 16 GB VRAM - 48 GB VRAM
- total VRAM 32 GB - 48 GB
- MoE Model of > 70B
- big RAM buffer to be future proof for big sized MoE models
- GPU offloading - as I am fine with low tk/s chat experience
- budget of up to pain limit 6000 € - better <5000 €

Nos:
- no N x 3090 build for the sake of space & power demand + risk of used material / warranty
- no 5090 build as I dont have have heavy processing load
- dual GPU setup to have VRAM of > 32 GB
- no Strix Halo, as i dont want to have a "monolitic" setup which is not modular repairable

My use case is local use for 2 people for daily, tec & science research. We are quite happy with readible token speed of \~20 tk/s/person. At the moment i feel quite comfortable with GPT 120B OSS, INT4 GGUF Version - which is played around in rented AI spaces.

Overall: i am quite open for different perspectives and appreciate your thoughts!

So why am i sharing my plan and looking forward to your feedback? I would like to avoid bottlenecks in my setup or overkill components which dont bring any benefit but are unnecessarily expensive.

Component	Model	Price (€)


CPU


CPU Cooler


Motherboard


RAM


GPU


SSD


Case


Power Supply


Total Price

Thanks a lot in advance, looking forward to your feedback!

Wishes

[-]

GabrielCliseru@reddit

i am a little bit in the same boat as you and i’m kinda looking into a older gen Threadripper PRO or simiar + a motherboard which supports 6 GPUs because I want the “luxury” of being ready for the next generation of GPUs. I feel that the current one is towards the end of life. Both nVidia and AMD are already presenting their next.

[-]

sebakirs@reddit (OP)

Sounds like a nice plan, i could get another 5060 TI, explore with 32 Gb of VRAM and wait for improvement of model efficiency & as you said new more efficient GPU generations. What CPU/Mobo/RAM setup is future proof in your opinion? What do you think about my choice? What do you have?

[-]

GabrielCliseru@reddit

i have 4x5060TIs

[-]

sebakirs@reddit (OP)

nice, 64 GB VRAM is decent - how about total system power use?

[-]

GabrielCliseru@reddit

depending on model but usually nvidia-smi says about 50w per gpu. I use a power limit of 150W. They never get there though

[-]

sebakirs@reddit (OP)

quite efficient and fitting for the use case, thanks for the figures

[-]

GabrielCliseru@reddit

they are not fast and flashy but when next gen comes i think they will hold most of their value if you already have them from before the price increases. In my area now they are 380 CHF. I have mines since they were 365 CHF. So even if i will sell them with 150 - 200 CHF i am happy with how much i’ve saved on subscriptions along the way. Plus all the learning

[-]

sebakirs@reddit (OP)

Understand - was also looking into a threadripper setup, but kind of lost myself in the prices... do you have an idea of a feasible setup in this price range? Could be quite nice to achieve high GB/s RAM close to DGX-Spar / Strix Halo of >250 GB/s - which would be feasible for such use cases.

[-]

GabrielCliseru@reddit

in my case a new motherboard is about 700 CHF and a 3945WX is 140 CHF. I was also looking into refurbished Thinkstation P620

[-]

sebakirs@reddit (OP)

sounds like a good setup to not create a bottleneck via CPU but have decent of PCIe lanes for multi GPU setup - how is 4x multi gpu going in your case with llama.cpp or vLLM?

[-]

GabrielCliseru@reddit

i use llama.cpp but i am planning to switch to vllm. Works without any problems

[-]

sebakirs@reddit (OP)

like this idea more and more, why is the cpu so "cheap" ? could it cause any bottleneck for multi gpu? do you have benchmarks?

[-]

GabrielCliseru@reddit

Because is from 2 generations ago, only 12 core and doesn’t support DDR5. If you look on the AMD website you can see it has enough PCI lanes and 8 channels. You can look at different reviews online

[-]

GabrielCliseru@reddit

for a quick solution you can look into refurbished Lenovo Thinkstation P620

[-]

ClearApartment2627@reddit

All valid choices, and I am sure the end result would be neat.

I would save on the CPU - btw, I think the 7950X3D is not available new any more in Germany.
A simple 9900X would be ample. LLMs do not need that much CPU compute, and I would invest that money into a larger SSD. Those models pile up quickly, and they eat a lot of memory.

If you can, consider checking out the RTX Pro 4000 in person, because blower designs (radial fans) can be quite noisy, and you want two of those.

Yes, you can get better performance with used parts, but if you want neat and clean, this is a very nice build.

I for one would wait for the next Mac Studio M5 generation for your use case, but you explicitly stated that this is not what you want.

[-]

Smooth-Cow9084@reddit

3090 can be capped, and cost per token might not be as bad as you think since it is faster than the 5060ti. I own both cards, haven't done the math on cost but its not: more power hungry = worse economy.

Personally I use a x399 motherboard with 8 slots of ddr4 ram, its really good cost/performance compared to ddr5.

I'd recommend buying second hand parts of similar tier to mine. Once you know what you really need, sell and upgrade. Because my setup is likely very similar in power but costs 1/4, so I feel you might want to think clearer about it.

[-]

sebakirs@reddit (OP)

appreciate your thoughts and already kind of questioning my build... whats your view / metrics on CPU / BOARD / RAM performance in terms of GB/S for GPU offloading or MoE? Do you have benchmarks?

[-]

AppearanceHeavy6724@reddit

5060ti is a waste of - they are very very slow.

[-]

Smooth-Cow9084@reddit

I definitely can't help with very technical stuff since I got it assembled a few days ago, and today will receive 128gb of ram to get to do offloading.

In my testing I saw the 5060ti is pretty much never at 100% usage due to bandwidth limits, but the 3090 is always.

But 3090 will likely go down in holydays as people upgrade. This past 2 weeks already going down on a local second hand matket. So maybe wait 3-4 weeks and get 1 cheap.

Also rumor says that 5070ti super 24gb might come in April (I think), so if it does the 3090 will fall in value. But idk, if you already get it cheap it holidays it might not devalue.

[-]

Dontdoitagain69@reddit

Have you looked at the L4 cards, the are half the wattage and better for inference. I think you can power them with pci power

[-]

No_Night679@reddit

L4 even on ebay are not less than $2500, why not RTX Pro 4000? They are 130W rated, can be throttled down to may be 100W.

[-]

sebakirs@reddit (OP)

yes, share the same thought...

[-]

No_Night679@reddit

if you are not too hung up on DDR5 and PCIE5, Used EpyC and DDR4 isn't a bad option.

HUANANZHI H12D 8D, scored a new EpyC 16Core, Millan for $440, Got lucky with DDR4 though, bought them few months ago, before the whole world was lit on fire, waiting on my GPU now, 4 x RTX Pro 4000.

Eventually when EPYC 9004 and DDR5 are reasonable, will swap'em out. Probably few years out from now.

[-]

sebakirs@reddit (OP)

thanks for your idea, new (i want to avoid used component risks) they are \~ nearly twice the price :(