Feedback | Local LLM Build 2x RTX Pro 4000

Posted by sebakirs@reddit | LocalLLaMA | View on Reddit | 24 comments

Dear Community,

i am following this community since weeks - appreciate it a lot! I made it happen to explore local LLM with a budget build around a 5060 TI 16 GB on Linux & llama.cpp - after succesfull prototyping, i would like to scale. I researched a lot in the community about ongoing discussions and topics, so i came up with following gos and nos:

Gos:
- linux based - wake on LAN KI workstation (i already have a proxmox 24/7 main node)
- future proof AI platform to upgrade / exchange components based on trends
- 1 or 2 GPUs with 16 GB VRAM - 48 GB VRAM
- total VRAM 32 GB - 48 GB
- MoE Model of > 70B
- big RAM buffer to be future proof for big sized MoE models
- GPU offloading - as I am fine with low tk/s chat experience
- budget of up to pain limit 6000 € - better <5000 €

Nos:
- no N x 3090 build for the sake of space & power demand + risk of used material / warranty
- no 5090 build as I dont have have heavy processing load
- dual GPU setup to have VRAM of > 32 GB
- no Strix Halo, as i dont want to have a "monolitic" setup which is not modular repairable

My use case is local use for 2 people for daily, tec & science research. We are quite happy with readible token speed of \~20 tk/s/person. At the moment i feel quite comfortable with GPT 120B OSS, INT4 GGUF Version - which is played around in rented AI spaces.

Overall: i am quite open for different perspectives and appreciate your thoughts!

So why am i sharing my plan and looking forward to your feedback? I would like to avoid bottlenecks in my setup or overkill components which dont bring any benefit but are unnecessarily expensive.

Component Model Price (€)
CPU
CPU Cooler
Motherboard
RAM
GPU
SSD
Case
Power Supply
Total Price

Thanks a lot in advance, looking forward to your feedback!

Wishes