What workstation to get for ~13k EUR?

Posted by TechNerd10191@reddit | LocalLLaMA | View on Reddit | 40 comments

My use-cases will be to test open-weight LLMs and work on harnesses, inference systems and possibly other non-ML workflows (CS-related) in the future. Fine-tuning would not be something I do locally because I can rent a B200 from RunPod for a couple of hours and be done with it. For my budget, my options are:

(assuming it gets released and the price tag is up to 13000 EUR in my country) M5 Ultra Mac Studio with 36 CPU cores, 64 or 80 GPU cores, 256 GB of unified memory (1.2 TB/s memory bandwidth) and 4 TB storage. With this option, I am locked behind MLX (can only use llama.cpp, oMLX and vllm-metal) but could fit comfortably DeepSeek-V4-Flash and MiniMax-M2.7.
Get a workstation with one RTX PRO 5000 (48 GB), Ryzen 9 9950X, 64 GB DDR5, 4 TB Storage - which would cost me almost 12000 EUR.

I know there is the option to get 2x DGX Sparks, but I doubt that the Sparks will get serious support or attention in 2027 and after (all contributions will focus on datacenter Blackwells first and consumer Blackwells - not a one-off Nvidia product, SM121). And, this also has the low memory-bandwidth issue.

Notes:

The smallest LLMs I want to run with enough headroom for 262k token context are 30B-35B models (Gemma-4 31B/26B-A4B and Qwen3.6 27B/35B-A3B). While it is not a hard requirement, I'd like to test MiniMax and DeepSeek-V4-Flash locally.
When it comes to GPU prices in my country, the RTX PRO 5000 (72 GB) and RTX PRO 6000 go for at least 9500 and 12500 EUR respectively; ergo, the RTX PRO 5000 (48 GB) is the most expensive GPU I can use without going over-budget.
I do not want to risk it and get used hardware from eBay (and I don't want to have a GPU with >300W power consumption if I am going to build a workstation).
2x RTX 5090s would cost the same to the RTX PRO 5000 and have 16 GB more VRAM, but even if I reduce the power of each GPU to 400W, the workstation will act as a space heater (and it gets 35-40 degrees Celcius - 100 Fahrenheit - in the summer, so I'd rather avoid this).

[-]

lithium_bromide@reddit

How do you combine the VRAM like that? Does it actually work? Is it faster than a DGX Spark say? How does the model get distributed among the GPUs? Do the GPUs need to match in any way? Like manufacturer? Any specifics or gotchas? Can you mix and match GPUs? Does the PCIE lanes available matter? Can I start combining random shit I find? Thanks for reading a half dozen noob questions

[-]

rmhubbert@reddit

I use vllm with tensor parallelism set to 8 for the vast majority of the models I use. It works very well. I don't need a lot of concurrent users, so generally see prompt processing speeds in the 3500-5000t/s, and token generation in the 80-130t/s range for models such as Minimax M2.5 (at 4 bit precision, and full context), Qwen3-Coder-Next (at full precision and context), Qwen3.5-122b-a10b (at fp8 precision and full context), and Qwen3.6-27b (over 4 cards, full precision and context).

With vllm, you will have a much, much better time if all of your GPUs match, at least in terms of being the same base architecture (Ampere for the RTX 3090) and VRAM size (24GB for the RTX 3090).

Those GPUs don't all have to be from the same manufacturer, though. I have a combination of HP, Nvidia, MSI, Gigabyte, Palit, and Zotac 3090s running happily together.

The CPU and motherboard in my spec will give you 128 pcie lanes, which is more than enough. I only need to bifurcate one of the x16 slots to x8x8 in order to run 8 cards on this setup. That means when running all 8 cards on a single model, I'll get x8, but for the smaller models that I can run on 4, I can make sure they all run on x16. I've not seen that make a huge difference for inference, tbh.

If you are happy with llama.cpp instead, you can be much more flexible with regards to mixing and matching GPUs, as well as offloading to the CPU and system ram. Just be aware that when mixing and matching GPUs, your performance will be set by the slowest, least capable GPU.

[-]

lithium_bromide@reddit

How much of a factor is system ram and CPU then? Like do I need to go buy a server mobo with a ton of cores?

[-]

rmhubbert@reddit

Not if your focus is on inference. My CPU cost me about £100 second hand. Memory was about £120 per 16GB, so around £600 of the close to £10k budget was on CPU and system RAM. The vase majority of my outlay was on the GPUs.

Do yourself a big favour and get a decent server or workstation motherboard like the H12SSL-i from the start. I previously tried to run multiple GPU off a PC motherboard and desktop CPU, and it was a world of pain.

[-]

lithium_bromide@reddit

Hmm fair enough. I was hoping maybe I can make use of an old i5 board from an old gaming rig. I do have access to 2 Jetson AGX Orins from an old project 64GB of unified 205GB/s memory each. I tried to oculink them together with the help of CC but we couldn’t get it working.

[-]

kivaougu@reddit

Tensor parallelism

HelloSummer99@reddit

A Mac Studio and a good monitor... For your use case it's pointless to build a powerful PC. A mac will do this job and you'll never hear the fans. The PC will sound like a jumbo jet on takeoff and also comsume a lot of power

Loud-Swim-2932@reddit

For now, I am pretty happy with the Spark option, and since it is native in the Nvidia ecosystem, I feel better with it than with an XTX or Intel investment.

I spent close to £10k recently on the following. It's working very well for me, at least, and gives me 192GB of VRAM -

8 x RTX 3090 (second hand) 64GB DDR4 DRAM (second hand) Epyc 7443 CPU (second hand) Supermicro H12SSL-i motherboard

FullstackSensei@reddit

H12SSL with 3090s is the most sane option, IMO. You can run four 3090s on a single PSU and bump the CPU to 48-64 core and RAM to 256GB (8x32GB) and get even more flexibility.

How does system ram and CPU affect performance? Obviously GPU is what matters most right?

I've definitely optimised for GPU only inference. I use vllm, and don't offload anything to the CPU or system RAM at runtime, so CPU speed and the amount of system RAM mostly just affects loading times, in my case.

Saying that, I am in the process of adding more system RAM, but that has more to do with the general performance improvement I should see from having all of the memory buses in use, than the actual addition of system ram.

FullOf_Bad_Ideas@reddit

Nice I have something similar but I cheaped out on mobo. What PCI-E bus are those GPUs on?

My config has 7 GPU with their own PCIe slot (5 at x16, 2 at x8), with one of the x16 slots bifurcated to x8x8. I believe each of those are utilising PCIe 4 lanes controlled directly by the CPU.

txoixoegosi@reddit

9950x 128gb RAM Rtx pro 6000

TechNerd10191@reddit (OP)

As I write in the post, the cheapest price for an RTX PRO 6000 in my country is 12500 EUR (and 128 GB of non-ECC DDR5 cost 2500 EUR).

Disposable110@reddit

Then RTX Pro 6000 and whatever shitty second hand computer with adequate powersupply you can get for 500.

Goose-Difficult@reddit

You will need a CPU that can provide sufficient PCIe Lanes if you even Wish to have more then one GPU ...

welp, having headroom for offloading models is a good feature. Personally I would not step down from 64gb RAM if the rtx6000 powerhouse were installed.

vasimv@reddit

I'd go for RTX PRO 6000 96GB. And anything that can you afford with remaining money (any cheap used desktop pc with pci-e and 700W PSU from second hand shop).

Because you can add RAM/better CPU/bigger SSD later but you won't able to add bigger speed and VRAM that way. 😄

Sorry to bother, but 700W PSU is insufficient. The 6000 alone peaks 600W.

Go 1250 at least.

The Max-Q (if I were to buy an RTX PRO 6000, I'd buy the Max-Q variant) peaks at 300W

Cool, so either A) add 1000€ more for a decent mobo and NVME B) step down to 64GB RAM (still nice) and keep your 15k budget

Freonr2@reddit

2x RTX 5090s would cost the same to the RTX PRO 5000 and have 16 GB more VRAM, but even if I reduce the power of each GPU to 400W, the workstation will act as a space heater (and it gets 35-40 degrees Celcius - 100 Fahrenheit - in the summer, so I'd rather avoid this).

Before you throw in the towel on this, realize that one 5090 has substantially more compute and memory bandwidth than one 5000. Two 5090s with tensor parallel will be roughly 2.5x the speed of one 5000 48GB on top of the extra 16GB total VRAM. This isn't even a competition, so its worth figuring out a workaround to the 400W min limit. I think you can undervolt as one option. I don't own a 5090 but the RTX 6000 Ada, RTX 6000 Blackwell, and 3090s can all be set to basically anything in linux. Here's a 6000 running at 150W https://imgur.com/a/9gr5PqR

Also keep in mind two 5090s begs for a board with two x8 slots as well. Asus Creator X870E, Gigabyte AI TOP B850, etc. 2x8 boards tend to have a slight premium on price, but it is worth it so tensor parallel will be efficient. A bit more on a board won't break your budget.

The 5000 is not a great buy IMO until you are buying so many GPUs that you need higher GB/slot density to hit a VRAM GB target inside the physical install constraints of a particular motherboard and case. Not going to be a concern unless you double or triple your budget.

hyouko@reddit

FWIW, I had a 5090 on Linux and the 400W lower limit applied there. Maybe there's a way to trick nvidia-smi into going lower, but I could not find it. Likely they are trying to prevent exactly people using them as server cards.

I have an RTX Pro 6000 now and can confirm that I can set the power limit to basically whatever (non max-Q model so basically 150 to 600w, I think)

Did you try setting it with LACT? Anyway I don't really see why PL would be an issue since you can undervolt/limit core clocks so its never reached.

I imagine undervolt is the workaround I guess. I swear someone posted this undervolting works in linux now.

gingerbeer987654321@reddit

Rent it.

Obvious_Equivalent_1@reddit

Honest question, how do the economic compare? Given one can fill 24hr capacity with own workflows, against running it on local hardware.

I am familiar with the sponsored subscriptions from the big AI players, but given proprietary hardware can run 24/7 capacity how does that compare to renting I’m wondering?

$5/hr will get you an RTX 6000 or even better.

If you run it 24/7 then that’s $3600/month and it pays for itself in (hand wave) 4-6 months.

Economics do favour buying if it’s really a 24/7 flat out proposition, but most people don’t know what they want given how new AI is, and obviously means upfront cost vs spread out.

“Rent it until you really understand what you want long term” is probably the better version. If it runs 8hrs/5dqys then probably never worth buying.

Rent it until you really understand

This basically sums down the answer very eloquently, the ballpark figures you shared speak for themselves I appreciate the answer and completely agree

FinalCap2680@reddit

For me at current market to spend 13K on new hardware is a waste unless, you do not care about the money...

hurdurdur7@reddit

Mac studio if you want to be in the same room with it. Consumer gpus produce too much heat, and noise to vent all that.

lacerating_aura@reddit

Just for refrence so you can make informed decision, on an intel nuc 12th gen, i9, 64gb ddr4 and rtx a4000 16gb ampere, i get about 500tk/s for processing and 22tk/s for generation when using qwen 3.6 35BA3B Q8 class quants, so all those k_xl or k_p whatever. I can fit the complete 256k context in BF16 along with mmproj in vram by using --cpu-moe in llama.cpp

On same machine i can use qwen 3.5 122BA10B IQ4_XS quant with slight disk offload and above 200k BF16 context, mmproj on vram, again with --cpu-moe and get 120ish tk/s for processing and 11tk/s for generation.

Big qwen feels slow and frankly dumb, small qwen is usable but even more shallow and very, very prone to getting into loops, deepseek v4 flash is good and the one i want to run. Just waiting for llama.cpp support. The forks i have tried crash with gpu so i cannot give any usable numbers for that, but if i force a cpu only run of Q3_K_M, which is about 127gb gguf, so again like more than 50% disk offloaded in my case, i get the following numbers: {placeholder, couldnt remember exact numbers so doing a fresh run, will edit}

twnznz@reddit

If you don't care how long prompt processing takes, a Mac Studio is fine. My advice though - if you want to run 27B/31B class models, a single 5090 is sufficient and is WILDLY faster than a Mac Studio

autisticit@reddit

Which country?

Kal-LZ@reddit

3 x R9700 32GB 4900€~

https://www.alternate.de/SAPPHIRE/Radeon-AI-PRO-R9700-32GB-Grafikkarte/html/product/10016529

Workstation Dell Precision 7960 Xeon W7 3565X 32C 128GB DDR5 RDIMM 1400W PS 3 year warranty

7960€

https://www.ebay.de/itm/406558704620

It's just an idea, but there are multiple options for refurbished workstations that allow for the installation of multiple GPUs.

Precision 7960 support up to 4 GPU PCIe 5.0 x16

thavoc77@reddit

Look into getting a dual R9700. 64GB VRAM, faster and more flexible than the mac. It should be easily in your budget. With a little bit of luck/DYI work even a quad R9700 should be ok, but worst case a quad-ready workstation with 2 GPUs now.

You're making so many assumptions without anything to back them up but more assumptions.

By dismissing almost all options arbitrarily, you're not leaving much room to answer your question.

OverclockingUnicorn@reddit

How are you arriving to 12k euros for that workstation?