First-time builder trying to put together a $90K 4-GPU inference server in Dubai -please tell me what I'm missing
Posted by Material-Link9151@reddit | LocalLLaMA | View on Reddit | 93 comments
TL;DR: Got executive approval to build a 4× RTX Pro 6000 Blackwell on-prem inference server for my company. Budget is $90k. I've never built a server in my life. Sourcing in UAE is harder than I expected. Looking for reality-check from people who've actually done this.
Hey everyone. Long-time lurker, first time posting. I work for a trading company in Dubai and I've somehow ended up as the guy in charge of building our first on-prem AI setup after a presentation I gave to the board went well. I'm a data/AI guy by background, not a hardware person. I've built gaming PCs before, that's about it. Now I'm staring at a BOM for something that's an order of magnitude more complex than anything I've touched, and I'm getting nervous. The project is basically my whole neck on the line at this point so I'd really appreciate sanity checks from people who've been here before.
What we're building:
We want to run 70B-class open-weight models in production (starting with Llama 3.3 70B at FP16) and grow toward flagship MoE models (Qwen3-235B-A22B at FP8) as the system proves itself. It'll be the backend for a multi-agent setup that plugs into our ERP, Outlook, and internal trading tools via MCP. 15–20 concurrent users, 24/7 uptime, with LoRA fine-tuning on top.
Current spec:
- 4× NVIDIA RTX Pro 6000 Blackwell 96GB (Server Edition or Max-Q — whichever I can actually source in the UAE right now)
- 1× AMD EPYC 9654 (single socket, 96 cores — went Zen 4 over Zen 5 to save budget for storage, figured the CPU isn't the bottleneck anyway on an inference workload, happy to be corrected)
- 1,152 GB DDR5-4800 ECC RDIMM (12× 96GB, fully populated)
- 4× Micron 9550 PRO 15.36TB PCIe 5.0 NVMe + 2× mirrored boot
- 2× Mellanox ConnectX-7 100GbE (bonded)
- Eaton 9PX 6000VA online UPS with extended battery
- Supermicro 4U chassis, 2× 2000W redundant Titanium PSU
- Ubuntu 24.04 + CUDA 12.8 + vLLM
All in I have maximum 90k USD budget.
What I actually need help with:
- Is this spec balanced or am I overbuilding / underbuilding somewhere? I know some of you will tell me 1TB+ of RAM is overkill, but the logic was 3× GPU VRAM for MoE CPU offload on Qwen3-235B. Is that still the rule of thumb or am I operating on outdated advice?
- Max-Q vs Server Edition vs Workstation Edition — am I thinking about this right? My understanding: Workstation = dual-fan axial, only safe for 2 GPUs max. Max-Q = 300W blower, made for 4-GPU workstations. Server Edition = 600W passive, needs chassis airflow. If I'm going into a 4U Supermicro rackmount with proper fans, Server Edition seems like the "right" answer and not Max-Q. Anyone actually deployed these side-by-side?
- Sourcing in Dubai is turning into a real issue. Anyone here done on-prem AI hardware procurement in the GCC region recently? Any vendors I should be looking at that I'm missing, or any I should avoid?
- Can a hardware rookie actually assemble this or am I kidding myself? I'm comfortable with Linux, I can rack gear, I know which end of a screwdriver to hold. But I've never done tensor-parallel GPU config, I've never tuned BIOS for a dual-channel EPYC, I've never burned-in a server for 72 hours. Am I going to brick $40K of silicon on day one if I try to assemble this myself, or is it actually doable with good documentation and patience? If it's not doable solo, is the right move paying a local integrator a few thousand USD to handle the physical build?
- Thermals in a Dubai office. We're not putting this in a datacenter. It's going into a standard office server room with a regular AC unit. The system draws \~2.1kW steady-state, \~2.4kW under training bursts. Ambient summer temps outside the building hit 45°C+. Anyone operated a 4-GPU box in a non-purpose-built room in a hot climate? What did you wish you'd known?
- Gotchas I'm not seeing. This is the one I care about most. You know that thing where people who've actually done this say "oh by the way, make sure you have X" and it's not in any guide? I want those. Fire me all the "wish I'd known" moments you've got.
I know this is a long post. I also know some of you will tell me to just buy a DGX Spark or rent from Lambda, I promise I thought about it, the on-prem requirement is non-negotiable because of data residency. I'm not trying to reinvent anything, just trying not to screw up my first serious AI deployment.
Any help -even a single sentence- is genuinely appreciated. I'll read every reply.
Thanks from Dubai 🙏
matt-k-wong@reddit
Get the new dell gb300 for roughly the same budget
Material-Link9151@reddit (OP)
can you share a link? I'm not finding a Dell "GB300" at anywhere near $90K
did you mean a specific PowerEdge model? happy to look if you can point me at the exact product.
thank you
matt-k-wong@reddit
https://www.dell.com/en-us/shop/desktop-computers/dell-pro-max-with-gb300/spd/dell-pro-max-fct6263-desktop
Material-Link9151@reddit (OP)
I got quote around 130-140k for the cheapest option. It seems over my budget. Do you know alternatives in a more reasonable price band?
matt-k-wong@reddit
keep shopping around or tell them others on reddit got lower quotes. I saw one guy got a quote for $85K
Material-Link9151@reddit (OP)
It seems it is that price only in USA, for other countries there are tariffs and extra export costs
matt-k-wong@reddit
TLDR if you are serving 70b class models your proposed spec is pretty good. It'll be very similar to building your gaming PCs. You can't go wrong with too much memory but you can go wrong with too little. KV cache and context takes up huge amounts of memory though there are tricks you can use such as running more memory efficient models such as the Nvidia Nemotron series. The big reason to move to gb300 is that if you ever want to run 1T class models or need the extra memory you we be looking to upgrade. BTW, if uptime and reliability are a thing you soul be looking at AWS bedrock and Google Vertex, they simplify a lot of the production quality things. If I were to build a mission critical local system I would make sure to load balance / fail over to a proper cloud provider.
NoahFect@reddit
People have been getting quotes near $200K from Dell, though.
ForsookComparison@reddit
Step 1: Use some of this budget to pay someone that knows what they're doing
valdev@reddit
Never works on a car before, got approval to build a ferrari. Don't worry guy's, I'll ask chatgpt what to do next.
Material-Link9151@reddit (OP)
I will build the Ferrari, as my first car. Message me in 6 months and learn the results. Thanks for the motivation.
valdev@reddit
"Learn the results".
My guy, the point of what I said wasn't that it couldn't be done, but more that it's unwise.
I've built no less than 5 servers in my life and between 250 and 300 desktop computers in my life, have a local home server with many graphics cards to power my compute and serve as the CTO for quite a few companies. I say all of that not to brag, but to say this.
I would STILL bring in a specialist to make sure that expensive equipment is installed correctly and then the implementation is done correctly.
sob727@reddit
That is the right answer
Material-Link9151@reddit (OP)
lol fair, that's step 1 I'm already doing. got a local integrator here in Dubai lined up for the physical assembly + burn-in + 3-year on-site support. no way my screwdriver is going anywhere near $35k of silicon on day one.
the post is less "should I DIY the build" and more "am I about to get taken for a ride on sourcing." UAE channel supply for the Blackwell SKUs is weird, some vendors straight up refuse to touch NVIDIA because of export-restriction paperwork, others are quoting like $90k isn't enough when I know the BOM cost. trying to figure out if I'm seeing normal regional pricing or getting anchored.
appreciate the push though, genuinely good reminder not to cheap out on the expertise part.
SadGuitar5306@reddit
Models your mention are a bit stale now. IMO you will get much better results running modern models up to 500b in smaller quant (8 or even 4 bits). Say Qwen 3.5 397b-Int4 is about 236gb in size, which would leave a lot of space for context.
Material-Link9151@reddit (OP)
good point, and honest update: I've been mentally anchored on the models that were "stable production" 12 months ago rather than what's actually current.
question back: could you please help me how can I do my own research on this to select the most optimal model.
thanks for the nudge.
SadGuitar5306@reddit
Benchmarks, but they dont provide full picture. You need to try them on your specific tasks. You can use openrouter to try different models.
MelodicRecognition7@reddit
meh I won't even read further, you really should DYOR prior to building this system.
Material-Link9151@reddit (OP)
on the models, fair. that's a genuine research gap on my end and worth calling out.
on the CPU though: if you've got a specific CPU recommendation for this workload I'd genuinely take it. "DYOR" isn't helpful on its own, your power-limiting link in the other comment actually is, which is why I'd value your take on this too if you have one.
MelodicRecognition7@reddit
as other person mentioned already,
- this is correct, also depending on whether you will full-GPU inference or offload model to the system RAM you might need multiple cores, but definitely not 96 and highly likely not even 64. However with AMD there is another catch: it is necessary to check the amount of CCD/CCX because the amount of memory channels is equal to amount of CCD/CCX https://old.reddit.com/r/LocalLLaMA/comments/1mcrx23/psa_the_new_threadripper_pros_9000_wx_are_still/ and most <=32 core models usually have less than 8 or 12 CCD. If you will offload models to the system RAM then you will need full 8 or 12 channels memory bandwidth, so you will need to choose a specific CPU model.
And with both AMD and Intel there is a "memory bandwidth limit" when just a few threads, amount depending on the CPU model, fully saturate the memory bandwidth and adding more threads lowers the token generation speed: https://litter.catbox.moe/kzlxdu9nwwa1hr4s.png
although for prompt processing the performance is linear and always grows with amount of threads.
Material-Link9151@reddit (OP)
thank you, this unpacked something I genuinely didn't understand. let me make sure I got it:
so for my workload (mostly VRAM-resident inference with occasional MoE offload), the right CPU is probably something like:
this is one of the most useful technical correction I've had on the thread. thank you for taking the time on it.
MelodicRecognition7@reddit
system memory bandwidth matters if you offload parts of LLM into the system RAM, for LLM fully fitting into the VRAM system memory bandwidth is not that important, but single core CPU performance is. 9174F has 8 CCDs but 4.4 GHz is better than 9654's 3.7 https://en.wikipedia.org/wiki/Epyc#Fourth_generation_Epyc_(Genoa,_Bergamo_and_Siena)
However some tests show that 8 CCD models have same bandwidth as 12 CCD models: https://old.reddit.com/r/LocalLLaMA/comments/1fcy8x6/memory_bandwidth_values_stream_triad_benchmark/
If you want to make sure you should rent servers with CPUs you're interested in and run memory benchmarks yourself.
Green-Rule-1292@reddit
How's the Apple situation in UAE?
If all you're gonna do is inference and regular "a human in the loop"-type things you could maybe consider getting a bunch of 256gb mac studios instead (or rather wait until M5 version will be released in a few months and see if the 512gb version is making a return).
Cluster them if needed, or put a load balancer in front, or let different departments use separate instances, and/or keep a couple as spares in the drawer under your desk for sudden failures (failures that would be very unlikely as well since it's all just prebuilt little compute cubes from Apple instead of a single custom-built monster server) etc.
Easier to maintain, lower power usage, lower heat generated. Also probably slower than the server you imagine and not as useful for training since lacking cuda and all but like...
How fast do you really need here? From your description it sounds mostly like regular office stuff?
What are the workloads like? Big documents with thousands of pages? High frequency and time critical stuff? Or mainly just regular office worker stuff with about a dozen people typing questions by hand into chat UIs?
You mention "internal trading tools", are you able to elaborate a bit on what that means more specifically? (In terms of throughput frequency and time-criticality I mean, not what anybody is trading in)
Material-Link9151@reddit (OP)
really appreciate the creative angle, Mac Studios are genuinely underrated for local LLM work and the M5 Ultra at 256GB+ is a legitimate option for a lot of deployments. let me answer your clarifying questions first because you nailed the useful ones, then explain why I'm still on the server path.
workload reality check: you're absolutely right it's mostly "regular office worker stuff." specifically:
but here's the part that shifts the architecture:
it's agentic + HITL, and it's the foundation, not the endpoint. what we're actually building is a multi-agent system with supervisor agents, auto-entry into Microsoft Dynamics, SharePoint + Outlook access, mandatory validation at every step, full audit trail, and human-in-the-loop checkpoints. every user request cascades into 5–15 internal LLM calls across planner/worker/supervisor/validator agents. so "12 users" at the UI layer becomes 60–180 concurrent model calls under the hood.
why Mac Studios don't fit for us specifically:
so: for a research lab or a small shop doing inference-only, Mac Studios are a great call and genuinely simpler. for our specific case, Microsoft enterprise stack, supervisor-agent architecture, fine-tuning needs, foundation for future scaling, server is the right call even though the workload itself is modest.
thanks for pushing on this, it's the kind of question that's healthy to answer out loud.
imonlysmarterthanyou@reddit
I have over a decade of experience both spec’ing and building servers for Mission critical applications. No explicit experience with inference outside of my own home lab.
I too come from a Windows/Linux environment, and have basically no experience with Macs.
Some of the things you’re saying, don’t add up.
You are talking about high availability work loads, but your specs have everything on a single server.
Your monitoring requirements do not come out of the box with server grade hardware. While having some variation of IPMI could give you some out of band observation of the hardware, everything else is going to be baked into the OS or the specific service.
If you build this server, and it starts being used as you intended…what are the failure modes? How long could this be down for?
You can have support all day long, but if they don’t have a cache of the exact same hardware locally your downtime is likely going to be measured in days or weeks. Being where you are, it might be measured in weeks or months, depending on what’s happening at the time.
What is your tolerance for downtime? In the event the server dies, how long can your organization tolerate the loss of this resource?
Also keep in mind there are environmental concerns that you probably haven’t thought about if this is new to you.
Servers without GPUs generate a ton of heat and noise. This server isn’t going to be sitting in someone’s office unless you don’t like them.
You will need at least an area large enough to host a 4 post rack. A datacenter grade HVAC that controls both temperature and humidity. If it’s too wet, you will get condensation. If it’s too dry then static electricity will build up from the massive amount of air flowing and in both scenarios you have a dead server.
This is Mission critical, you will want two of everything. Two HVAC’s in case one goes down or needs maintenance. Lead types on these HVACs as of last year when I ordered one, was 56 weeks in the US.
You will want two separate circuits each with their own breaker. You will likely want two independent UPS. Keep in mind server grade UPS are not the same as the ones on your desktop. Even if the desktop versions are large enough to handle that load, they are likely not fast enough to switch in case of a power failure.
Upstream of this, do you have generators in case main power fails? Do you have a it means to monitor, maintain, and refill those generators?
I think the Mac solution might be a better template for you. You do not have to actually use Mac’s, but as an alternative:
Get to Mac studios that can run your inference loads like you would on a pair of the 6000’s. Each can use a desktop UPS. Each is designed to be passively cooled, but throwing a fan on them would help you out without killing anything or anyone.
Load balance the http endpoints across both units. You can do this by running HA nginx servers on cheaper hardware in front of them or on the Mac’s themselves.
This will get you local inference.
Training is a different problem. You can get desktop versions of those same 6000 cards. Buy a high-end desktop to run your training, or you can rent B200 clusters by the hour to do your training on and avoid that locally altogether. I don’t know what your restrictions on your data are exactly.
I don’t know if you plan to scale this up into having multiple large servers if this becomes successful, if so, the mac studio route might not be what you’re looking for… but building a small data center is far more expensive than a lot of people realize.
Material-Link9151@reddit (OP)
genuinely the hardest-hitting comment on the thread and I've been sitting with it before replying. going to engage honestly because you're right about most of it.
single-point-of-failure critique: you're right. I've been papering over this. a single-node $90K build is not HA in any meaningful datacenter sense. if the motherboard dies mid-day, we're down until RMA which in the UAE is genuinely weeks for enterprise parts.
you're right that I've been underweighting "what if a GPU dies in month 3" need a cold-spare strategy and a documented DR runbook, not hand-waving
on HVAC + humidity + dual UPS + generators: you're right that proper enterprise infra is expensive in ways most people don't see. 56-week HVAC lead time blew my mind. but the honest tradeoff: to do this at true datacenter grade (dual HVAC with humidity control, dual UPS, generator, proper rack room) you're at $300K+ on facilities alone. I have $90,000 total. so I'm consciously accepting reduced uptime for budget feasibility
on the Mac Studios suggestion: can't use Macs specifically (Microsoft enterprise stack, and also would like to improve it in the future
want to ask your take on something related though: I've been looking hard at the GB300 DGX Station. priced it in UAE at \~$143K which is well over my $90K budget. but the design itself genuinely feels like it would save me from half of what you're describing.
so my question: do you know of appliance-class alternatives in a more reasonable price band?
what I'm really trying to buy is "factory-validated, NVIDIA-certified, warranty-backed, pre-imaged, someone else's integration problem" less so any specific GPU silicon. is any of this category actually good, or is the appliance premium always 2x a DIY build without meaningful reliability gains? you'll know better than I would.
imonlysmarterthanyou@reddit
Also, if you are going to run the inference servers on windows you will be leaving perf on the table…
imonlysmarterthanyou@reddit
I have seen some AI in a box companies, but they are usually shipping containers and way outside your price range.
I actually just looked at the price of the workstation in the US…it’s just under $90k USD…straight from CDW. They ship worldwide…so not sure if your tariffs are what are increasing the price so much. (I do see the gb300 has ddr5 vs the gddr7 in the 6000 pro, so there will be some tradeoffs it looks like).
What I was probably poorly explaining about the mac idea was to use it as a template that you apply to other hardware.
You don’t have the infrastructure to run a server properly, so don’t.
Go buy a dell workstation outfitted with two RTX 6000 pros. You were going to run the cards in pairs, so buy two workstations. (They do sell them in quads if you really want to run them in a single workstation).
Kitted out on just in dells website comes in under your budget. (Never buy directly from dells site, they always give discounts if you call).
The only downside of two boxes vs one is training.
Also, I saw you included massive networking…what are you connecting it to? Do you have 100Gb internet there?
Green-Rule-1292@reddit
I did in fact come to think about one more thing to add.
There is a company called tinygrad (I have no affiliation, there's probably other companies as well that specialize in this and I can't vouch for these guys specifically or anything like that).
They do offer a prebuilt machine with those GPUs though and it's listed at about 65k usd on the webpage.
https://tinygrad.org/#tinybox
Perhaps consider reaching out to them (or whichever of their competitors) and get yourselves some hw support as part of a package deal?
Material-Link9151@reddit (OP)
Thank you for the advice
Green-Rule-1292@reddit
Ah, got it! Yeah I agree that changes things quite a bit.
I sadly have nothing else to add then other than going +1 on what others have already mentioned about server fan noises and heat generation sucks to deal with in office environments.
Best of luck with your project!
Material-Link9151@reddit (OP)
Thank you very much!
AurumDaemonHD@reddit
Is 90k local budget in dubai? I would look at enterprise cards with hbm not gddrx
Material-Link9151@reddit (OP)
yeah $90K USD, company-approved budget. HBM was on the shortlist for the bandwidth (H200 has \~3× the memory bandwidth of GDDR7 Blackwell) but the pricing doesn't work: H200 141GB is \~$35–40K per card. 4× H200 = $140K+ just on GPUs, no host system, no storage, no UPS. even 2× H200 eats 75% of the budget and leaves nothing for the build.
RTX Pro 6000 Blackwell Max-Q at \~$10K/card for 96GB is the sweet spot for this budget tier. lower bandwidth, yeah, but \~4× the VRAM per dollar. if I had $300K I'd be all over H200 🫡
AurumDaemonHD@reddit
I would look into the things like what models are you going to target exactly and what quants and inference engines and if it supports the hardware. There have been some recent breakthroughs on hoppers and blackwells too along dflash and turboquants.
You could rent the exact configuration on cloud and run the builds see it in praxis. Although new improvements are dropping everyday...I think the wise strategy would be to buy a platform to which you can add on more resources later.
as concurrency rises not just VRAM but Memory bandwidth becomes more crucial hence if your lategame is that i d consider them.
I view these RTX Pro 6000 cards as Workstation grade. Yes u can get MaxQ stack in server. yes it might be economic to do so. But there are Better cards designed for that kind of workload. As you say if u had the budget u d have them .Why not negotiate to have more. or atleast open with something and add other cards later...
Anyway if you decide to go with this config its atleast something i mean u cant have anything better for the price as you said. But you need to look into P2P and how u plan to run this agent thing. if u wanna tensor paralell or pipeline paralel and it becomes quite a tangled mess quickly.
Turns out the hardware will be the lowest cost. Development is hard. Especially when Slop is slowing you down.
And the models u plan to use are too big in my opinion. U can achieve great lengths with 27b.
Going 10x that size doesnt yield even 2x the result.
Hence. Bandwidth...
Material-Link9151@reddit (OP)
appreciate the detail here, multiple good points.
cloud benchmark before buying - 100% going to do this, one of the best advice on the thread today.
P2P concern — real and fair. But my knowledge doesn't enough to understand this, what would you suggest, how can I look into it?
"buy a platform you can add to later" — yes, this is where I'm heading.
on the 27B-is-enough point — for most Q&A and summarization yes, but our workload is multi-agent with supervisor + worker + validator chains. reasoning depth matters at the planner layer, and 70B has a meaningful lead over 27B on structured planning benchmarks. may use 27B as a "worker" model for tool use and formatting alongside a 70B planner though — model cascading is definitely in the design. What do you think?
AurumDaemonHD@reddit
P2P is hardcore. for rtx 3090 i read a lot on https://github.com/guru1987/open-gpu-kernel-modules/blob/580.105.08-p2p/docs/P2P_ENABLE_RTX3090.md
You will have to follow your own path to udnerstand this shit with BAR IOMMU and what not to get your custom drivers if needed. Its a overlooked aspect by ppeople entering.
Any lead the 70B has will be vaporized by finetuning the 27B.
Its logical to not give up 2x space for 10% performance boost.
Material-Link9151@reddit (OP)
good points, thanks for the time on this.
nmrk@reddit
If you're looking for H200 availability in Dubai, you're going to need a connection in government. Those processors are on the international restriction list and with an exception for only certain shipments. High end NVIDIA processors should be hard to obtain in Dubai.
Material-Link9151@reddit (OP)
Yes that's true and h200 are already out of our budget.
ChampionMuted9627@reddit
Your tpm directly depends on the memory bandwidth
Material-Link9151@reddit (OP)
correct
chisleu@reddit
Material-Link9151@reddit (OP)
thanks, clean summary. point by point:
appreciate the directness.
CalligrapherFar7833@reddit
Is the room staying closed because those eaton lead acid batteries need proper ventilation ?
Material-Link9151@reddit (OP)
good shout. going to price both when I place the UPS order, appreciate the nudge.
Loose_Rip359@reddit
The hardware side is the easy part — the thing that kills first on-prem deployments is the serving layer (vLLM/TGI config, request queueing, per-team isolation), not the GPUs themselves. I've been working on on-prem agent harnesses at https://valet.dev/enterprise and would love to share notes on the software stack before you lock in the build.
Material-Link9151@reddit (OP)
I would love to learn if you would share any notes on this, because this is the second big wall that I need to get through it after hardware.
john0201@reddit
The CPU doesn’t make much sense. I would get a 9975WX threadripper or 32 core epyc part. You need single core performance, not a bunch of idle cores. You don’t need 12 memory channels and I can’t image what you would use 1TB of memory for. What do you need 30TB of storage for? You can get a 3x 4tb 990 pros and fit most models (one startup, one models, one backup). Also why do you need 200gbe?
The server edition cards do not have fans, you can’t run those in an office unless you want everyone wearing hearing protection. I would get the workstation cards and dial them back to the lowest non-max-q wattage which I think is 400watts. I would get a silverstone case that can be setup vertically like a workstation or horizontally in a rack, with dual PSUs you can plug into two outlets.
This is still going to be the equivalent of a space heater on high all year. You cannot do that with standard AC. I have a 2x gpu threadripper box and I had to move it to my garage, which it heats up by about 10 degrees.
Material-Link9151@reddit (OP)
this is the comment I was hoping for, genuinely. let me go through point by point:
Server Edition = loud — yep, fair, I underweighted this. zero onboard cooling + 4U chassis fans screaming at datacenter noise levels is a non-starter in an office adjacent to traders on phone calls. Max-Q back to the top of the list.
Threadripper Pro vs EPYC — legit. TR Pro 9975WX + WRX90 saves me \~$6K over EPYC 9654 and you're right that single-core perf matters more than raw core count for tokenizer/sampler work. the 12-ch vs 8-ch memory bandwidth thing isn't decisive since we're not CPU-memory bound. reason I leaned EPYC was 30 concurrent users = 30 parallel token pipelines, which likes more cores — but 32 Zen 5 cores is probably plenty for that load. going to price both ways.
1TB RAM overkill — partially fair. the 3× VRAM logic is Qwen3-235B-specific (full FP16 in RAM while FP8 runs on GPU for fast model swap). if I'm honest, year 1 will be Llama 3.3 70B dense where 512GB is more than enough. smart move is spec 512GB now with room to expand, not lock in 1152GB day one.
Storage 30TB — same logic. for POC year 1 with 2-3 models + vector DB + recent trading data, 30TB is fine. I was sizing for month-18 not month-1.
Bonded 100GbE — pushing back gently here: I don't need the throughput (single 100GbE moves a 140GB weight in \~14s, plenty). I want dual NICs for redundancy — can't have a NIC failure take down a 24/7 trading workload. so: dual 100GbE but active/passive failover, not LACP. fully agree bonding is a headache that doesn't buy anything for inference.
Cooling — this is where I most need your help. I'm putting this in a standard office room with regular split AC, Dubai summer hits 45°C+ ambient. your 2-GPU TR box moved your garage 10°; mine at 4 GPUs would be roughly 2× that heat. my current plan is: dedicated 2-3 ton mini-split running continuously, room sealed from rest of office, temperature alerts wired to my phone. few questions:
this comment saved me from 2-3 real mistakes. thank you.
john0201@reddit
It’s not airflow it’s the BTUs. An additional 2 ton mini split would work according to the math. The workstation cards have their own fan. At 400W it will be fine.
I would get the silverstone threadripper AIO and something like this: https://www.silverstonetek.com/en/product/info/computer-chassis/rm53_502/ which is built for just what you are doing.
Material-Link9151@reddit (OP)
thanks, the BTU framing is the right correction; I was conflating airflow and heat removal. 2-ton mini-split + 3× headroom on the load matches what I was sizing anyway so good to have it validated.
on the Workstation @ 400W suggestion; genuinely interested but want to make sure I understand before committing. the thing I'm nervous about is axial-fan recirculation in a 4-card layout. on your 2-GPU TR box, what GPU junction temps are you seeing at sustained 100% load? and does the SilverStone RM53-502 space the cards out enough that the front fans can actually feed cold air between them, or are they butted against each other?
main thing I'm trying to avoid is thermal throttling silently eating 20% of performance 6 months in because temps climb over Dubai summer. if the RM53 chassis + power-limit trick actually keeps junction temps under \~85°C sustained with 4 cards, I'll take that over Max-Q (you get the clock headroom as a free upgrade). but I want eyes-on numbers from someone who has run it, which is why I'm asking.
the silverstone AIO for TR Pro is going on the list either way. air coolers for 350W+ TDPs are a losing battle.
john0201@reddit
Is this a person or an LLM?
Material-Link9151@reddit (OP)
Me, human.
john0201@reddit
I only have 2 cards, but those cards are designed to be stacked. Limiting to 400w in a normal or even warm room I wouldn’t expect any issues in a proper case.
NVIDIA probably has specs for the max ambient temp.
Material-Link9151@reddit (OP)
Thank you
MelodicRecognition7@reddit
workstation can go down to 150W: https://old.reddit.com/r/LocalLLaMA/comments/1nkycpq/gpu_power_limiting_measurements_update/
Material-Link9151@reddit (OP)
this is genuinely the most useful data point I've gotten on the whole Max-Q vs Workstation debate, thank you for the link. reading the measurements thread now.
Annual_Award1260@reddit
Wait for the dgx stations
https://marketplace.nvidia.com/en-us/enterprise/personal-ai-supercomputers/?superchip=GB300&page=1&limit=15
Material-Link9151@reddit (OP)
I am trying to get quote and they are already pricing it like 130k USD, it seems over my budget.do you know alternatives in a more reasonable price band?
StableLlama@reddit
What are you missing? Experience. And knowledge.
Building a server is a serious business that is much more than plugging some cards somewhere.
But there's an easy solution: buy it from a company that has a good track record in building servers.
Getting the software to run in a way you expect it to (and to keep it running!) is already a big job. So don't waste ressources on something where you don't even know where the problems might be
Material-Link9151@reddit (OP)
Thank you for your comment
deejeycris@reddit
Don't do it yourself is my advice.
Material-Link9151@reddit (OP)
Thanks
Signal_Ad657@reddit
The part I’m lost on is the 1,152GB of RAM against the 4x RTX6000’s against wanting to run Qwen3-235B. There’s a lot of things here that don’t feel like they add up? There’s 70B’s you want to run you could host on 1 card, or 4 in parallel. The next step up, you have more than enough VRAM at 4 cards to do Qwen3.5-397B and beyond which would be pretty epic. Now where does the extra 1,100 GB of RAM fit into your plans?
Material-Link9151@reddit (OP)
hitting on real things here, appreciate the push. going through it:
1152GB RAM vs VRAM for inference — fair critique, and another commenter u/john0201 hit me with the same point. you're right that if the model is VRAM-resident, RAM requirements drop a lot. my 3× VRAM math was MoE-offload-specific for Qwen3-235B (full FP16 weights in RAM while FP8 runs on GPU for fast swapping). trimming to 512GB for year 1 with headroom to expand is the better call. good catch.
why 4 cards for 70B — you're right that 70B fits on 1–2 cards. the 4-card choice isn't for model fit, it's for concurrent throughput. Llama 3.3 70B at FP16 on 1 card serves \~3–5 concurrent users before latency tanks; across 4 cards with tensor parallelism it serves 20–30 at acceptable latency. that's the actual business requirement. room for Qwen3-235B and future 400B-class models is a bonus, not the driver.
define success — this is the sharpest thing in your comment and worth answering directly. success = 15–20 traders using it daily for natural-language questions across trading history, auto-generated contracts/sales/proformas, and multi-step agent workflows touching ERP + email. measurable by daily active users + hours saved vs manual work. if those metrics don't materialize in 6 months, the $90K is a write-off regardless of the hardware spec. that's the real risk I'm managing, and you're right that I should be shouting that louder in the plan.
$10–12K pilot first — 100% was my original mental model. I actually already did the pilot over the last 4 months on laptop GPUs + obfuscated data through Claude API, presented results to execs, and they approved the on-prem build off the back of it. going back now to say "actually let's do a smaller pilot" would unwind the commitment and kill momentum. the $90K build is the scaled-up version of the pilot you're describing — it's not skipping the pilot, it's already past it.
totally hear you that this feels open-ended. the agentic orchestration piece is genuinely unpredictable. the hardware investment is defensible because Llama 3.3 70B at our required scale is proven tech. the risk is in the agent layer, not the iron. thanks for the push.
Signal_Ad657@reddit
Okay so here’s the next question. Standard inference at scale versus agentic assistants for employees at scale would be totally different worlds. Which are you building for?
Material-Link9151@reddit (OP)
agentic, no question. and you're right that it's a totally different world.
the token math changes fast. standard inference at 30 concurrent users = 30 parallel requests each \~500–1500 tokens total. agentic at 30 users = 30 sessions, but each one spawns 5–15 internal LLM calls (planner → retrieve → tool select → execute → reflect → respond). so actual GPU load is 150–450 calls happening concurrently, not 30. plus context windows are 3–5× bigger because you're carrying conversation state + tool outputs + RAG chunks into every call.
what that changes on hardware:
honest part: 4 cards is actually on the edge for aggressive agentic at 30 concurrent users. realistic operational pattern is probably 10–15 concurrent heavy agent sessions + 15–20 lighter direct Q&A, which is what we modeled in the pilot. if every trader was running 5-tool-deep agent chains simultaneously all day I'd hit KV cache walls. I'm not pretending there's a lot of headroom.
curious what your take is? you've clearly thought about this. what would you change about the hardware if you were sizing for the same use case?
Signal_Ad657@reddit
I don’t think hardware moves the needle for you as much as model selection in this case. Like for high throughput agentic use, Qwen3-Coder-Next is an 80B MOE that can FLY vs a dense 70B. Also if you are going to stay parallel hosted not one big model but 4 instances of a model on repeat you start building a case for the simplicity of single GPU distributed systems. Power gets easier to think about, thermals gets easier to think about, you burn less cash on all the special hardware to make something this big friendly in one box. You of course lose the ability to combine them efficiently for one big model but it’s worth a real question of if that’s the goal or not. You could probably do 4x single RTX PRO 6000 boxes for under 45k and put them on a switch and just do traffic routing on a shared API. The cost savings and simplicity start paying dividends there, but again only if parallelism and max throughput and concurrency is your real goal.
Material-Link9151@reddit (OP)
honestly, want to pause and say thank you properly. you've been the most thoughtful voice on this entire thread and every one of your comments has forced me to sharpen my thinking. the framework you keep pointing at (define success, challenge the scale, think about operational simplicity) is exactly the pressure test I needed before committing $90K of the company's money. I'd love to keep learning from you as this progresses, genuinely grateful for the time you've put in.
quick correction to my own framing from earlier, because I think I gave you (and the whole thread) a misleading picture of what I'm building:
this isn't a trading assistant. I undersold it. what we're actually building is the company's foundational AI platform. trading is month 1. marketing, finance, and legal all onboard within year 1. trading is the beachhead, not the destination.
that changes the architecture math significantly:
so the 4-GPU shared-pool rig is sounds more correct for the real mission(imho, please correct if you feel like), even though it would be overbuilt for pure trading. that's on me for not leading with this earlier, I was mentally stuck in "how do I justify this to the board" mode rather than "what am I actually building."
that said, your horizontal-scale framework is exactly what the year-2 expansion path looks like. once a single rig starts to saturate with multi-department load, adding satellite single-GPU nodes behind a shared API (the pattern you described) is the cleanest way to grow. so the flagship rig now, horizontal expansion later.
really, thank you. I'd be making worse decisions without your pushback. if you ever have thoughts on the multi-tenant / multi-department resource-partitioning side (MIG, vLLM batching strategies, that kind of thing), I'd read them carefully.
Signal_Ad657@reddit
That’s really nice my friend thank you. Happy to hang out sometime and grab digital coffee if you want to talk more.
ExG0Rd@reddit
I just want to share my excitement about the complexity of such help request and about great people of the internet helping Will follow this topic, so much interesting here
laterbreh@reddit
Have you contacted vendors/enterprise vendors for pre-built solutions? I would likely start there before considering doing this yourself.
Material-Link9151@reddit (OP)
Yes I am doing it at the moment and if you have any advise for the GCC, I would be appreciated.
Deep_Bee6767@reddit
I would be willing to walk you through it , You would have to be ok with allowing me include your Server in my portfolio ( Just the build ) .
I would recommend waiting for DGX Station not overpaying on a DGX SPARX Cluster or any comparables ( GX10 ) . It will be out before the end of the year and can run 1T dense models . Also you should be able to port one of your Blackwells into it .
I know some of you will tell me 1TB+ of RAM is overkill, but the logic was 3× GPU VRAM for MoE CPU offload on Qwen3-235B. Is that still the rule of thumb or am I operating on outdated advice
1 TB of RAM as you proposed is overkill , not enough VRAM to make sense .
EPYC 9965 would be my recommendation, for AI and Virtualization it is the best in my opinion and your goal is to build a BACK END . The AMD ZEN 6 will crank up speeds and Max Core count . But they don’t come out until Q3 . They also will be prohibitively expensive . I would go best of Zen 5 which is the 9965 . It will make you look good when their prices hike for a short period as we get closer to Zen 6 release.
If your goal is truly backend , I might look into the Dual EPYC MOBO . The CPU and RAM quite frankly are what makes a back end take off . GPU is only needed to virtualize .
I would split the PSU 2 GPUs / 1 CPU a piece with 300W overhead . Just a heads up 240v is needed ( maybe it’s standard over there or at your location of choice for 2000w PSUs )
Yes fully populate your RAM .
In the CHASSIS and not a tower , Server Edition is best with Max-Q as another alternative , you are absolutely right Workstation should not be used when deploying 4 . For AI and Hosting LLM, Server edition is correct . It’s meant to be used constantly and yes your chassis can support it . Max-Q is safer short term bet .
I have deployed both , Max-Q is more efficient in home-labs but for growth and constant “on” Server is by far the best choice. Just remember they are reliant upon fans so make sure your chassis is fully functional. I would not go refurbished .
No unfortunately this is something I am not help with .
I'm comfortable with Linux, I can rack gear, I know which end of a screwdriver to hold. But I've never done tensor-parallel GPU config, I've never tuned BIOS for a dual-channel EPYC, I've never burned-in a server for 72 hours. Am I going to brick $40K of silicon on day one if I try to assemble this myself, or is it actually doable with good documentation and patience? If it's not doable solo, is the right move paying a local integrator a few thousand USD to handle the physical build?
What I do know Dubai has a lot of techies and you should be able to get it up and running for less than a Grand. I would only pay the big bucks for a warranty on parts as a part of their service .
Can you do this on your own ? Most certainly it will test your patience . There is too many resources including AI which makes this doable with little to no knowledge .
If you can suffer through IKEA sets by yourself they you can do this .
Depending on Mobo you might be able to work from Hotel to finish setting it up honestly
We're not putting this in a datacenter. It's going into a standard office server room with a regular AC unit. The system draws ~2.1kW steady-state, ~2.4kW under training bursts. Ambient summer temps outside the building hit 45°C+. Anyone operated a 4-GPU box in a non-purpose-built room in a hot climate? What did you wish you'd known?
What I wish I knew , my first server was propped up near a window in AZ that was a very short lived Server . Heat is what we try to avoid. I would make sure that the empty racks are blanked , I would have 72 hour window of inspection during the summer where I check fans and monitor temperature over time . I would have cold air blowing onto server . You can install baffles to force air to problem areas .
My biggest thing is your goal is to run 70B as a minimum and be able to scale but the system itself is the back end .
I would strengthen the CPU as going from 96 cores to 700+ means you don’t have to tear it down once you are ready to go to market or offer services to others . You are already scaled on the back end .
As for running a 235B model your GPUs will run it fast but still won’t be able to make a bigger jump . You are paying a premium to run smaller models faster which isn’t as future proof as I would like for an $90k investment . It’s why I would 100 percent wait if possible and splash that on DGX Station which is also Connect7.X
I would have seriously looked at Apple Studios 512GB offering ( $18-22k ) just for LLM inference and then built the back end with the Dual EPYC 9965 and 1-2 Blackwells for Virtualization / Rendering . The Studio should become more expensive as the year closes and the perfect upgrade is selling the Studio and running the DGX station with your prebuilt backend .
I can get the 9965 for $5.5 all day right now and compared to the $2.2K for the 9654 it is a now brainer .
Blackwell and AMD both come with their own software stacks and programs you can join . For hundreds of thousands in Credits as well .
Hopefully that helped and you can reach out with any more questions whenever .
Material-Link9151@reddit (OP)
hey, opening with the important part first; yes, absolutely happy to take you up on the walkthrough offer, and completely fine with you including the build in your portfolio. genuine help deserves genuine credit and you've clearly got the experience to make this build land.
going to DM you tomorrow to exchange contact details and find a time that works for you.
quick acknowledgments on the substance before we talk properly:
EPYC 9965 at $5.5K vs 9654 at \~$2.2K — if that pricing holds in UAE channels, it's a no-brainer upgrade to Zen 5. going to specifically ask vendors to quote both and see what the delta actually is here. Zen 6 wait might be too long for my May deadline.
PSU sizing — both you and another commenter caught this. 2× 2000W in 1+1 redundant on 1900W system draw is too tight.
Max-Q vs Server Edition call — exactly aligns with where I was landing. Server Edition is architecturally correct for a 4U rackmount with proper fans, but Max-Q is safer for office deployment because of noise. going Max-Q for year 1, door open to Server Edition if we ever move it to a real server room.
NVIDIA + AMD partner programs with cloud credits — this is something I hadn't lined up yet. Thank you for sharing.
DGX Station / wait-and-splash strategy — honestly tempting but I can't wait. When do you think we would be able to buy it in Dubai? My executive approval has a June deadline and pivoting to "let's wait 6 months for DGX Station" kills momentum. file for version 2 of the platform though.
Apple Studios + EPYC backend hybrid — creative architecture, but we're a Microsoft shop end-to-end (Dynamics, Outlook, SharePoint) and adding Apple Silicon to that stack creates maintenance pain our IT can't absorb. honest answer for our specific context.
again, really appreciate this. genuine help.
ConversationNice3225@reddit
Don't know the pricing in Dubai but that RAM allocation is bonkers and is more than your budget (I've recently been quoted 3k per 32GB DIMM). As others have noted, pare this down to like 128 to 256GB (I've been playing with Gemma 4 on llamacpp and prompt caching eats a lot of system RAM, so YMMV).
You list only 2x 2000w PSUs. You have 2400 watts in GPU alone. That CPU, RAM, motherboard, NICs, etc are going to eat even more. If one PSU fails, the whole system will probably crash.
Unsure why you need those 100G NICs, you're providing inference to users not training across multiple nodes in a data center. You could easily go with something significantly more affordable, like 10g.
Material-Link9151@reddit (OP)
three solid points, going to address each:
RAM yeah, you're the third commenter to flag this and the consensus is right. trimming from 1152GB to 512GB or even less
PSU sizing — this one you caught that I missed.
100G NICs — fair pushback, and for pure inference delivery 10G is plenty. reason I want the bandwidth is model weight loading (140GB weight = 14 seconds on 100G vs \~2 minutes on 10G) and future multi-node expansion when departments saturate. probably the right compromise is 25G — good middle ground, half the cost of 100G, 2.5× the throughput of 10G. will price all three.
appreciate the directness.
neoescape@reddit
Where are you getting $10K a card from in Dubai? These items are export-restricted and require export permits from most countries, making them difficult to export.
Also what is your RAM pricing per module?
Material-Link9151@reddit (OP)
This is were I stuck, I am trying to find it. If you have any suggestions, would appreciate it.
Regarding RAM it seems I made a mistake on pricing and it's overkill
neoescape@reddit
No worries, I've shot over a chat request!
MelodicRecognition7@reddit
check viperatech.com
Ambitious-Profit855@reddit
I've never heard of that 3x VRAM requirement and don't know what you'd need it for. I would scrap the whole "server" part and use a plain Ryzen CPU and Mainboard with enough PCIe Lanes. 64-128GB of RAM. Single GBe connection is enough for inference. Few TB of SSD storage.
With the saved money build another one of them for testing and training.
Material-Link9151@reddit (OP)
thanks for the pushback, legit appreciate it and for a single-user hobbyist box I'd 100% agree with you, Ryzen + 128GB + a few TB SSD is the move.
the constraints shift when it's 24/7 production with 15-20 concurrent users though. u/john0201 already hit the main blocker; consumer Ryzen AM5 caps at \~24 PCIe 5.0 lanes from CPU, so 4 GPUs is DOA. Threadripper has the lanes but ends up costing about the same as EPYC Genoa once you're repriced, so no real win.
on the other pieces:
the "build a second one" idea is actually something I considered, just couldn't justify the ops overhead of maintaining two stacks with the tiny IT team I have. if we were a full AI research shop, different answer.
Ambitious-Profit855@reddit
These sound like very different requirements to me. My initial assumption was you want to get as much Tokens per Second per USD, and 90k for <30k of GPUs sounds like there is room for improvement. If it needs to double as a database and storage server, it's different.
But: Managing two servers is imo significantly easier than one. New version? You can test it. Finetuning? Just try it. If it takes 5 hours longer, no problem. With one server if it overruns your weekend timeslot (which you have to monitor, so.. have fun with that), you are impacting productive workloads, rescheduling etc. Prod system breaks? Reuse Dev until the part is replaced.
About Rack mounted HW: the default server HW is LOUD. Needs a dedicated room with a massive door loud. You mentioned 4U, so there should be enough height but I've never looked silent, Rack mount Epyc coolers. The 2kW of heat depends on your offices AC. Central or does each room have its own unit? This is one additional single room unit in heat.
FreezeS@reddit
So you want 24/7 with a single server?
Material-Link9151@reddit (OP)
Yes and open to learn
FreezeS@reddit
If 24/7 is important, you will need at least 2 fully redundant servers.
john0201@reddit
That would limit them to 1xGPU
Ambitious-Profit855@reddit
No.
john0201@reddit
Hard to argue with such a well thought out response. Allow me to retort:
Yes.
Ambitious-Profit855@reddit
Your response included zero reasoning or source. I'm personally running two GPUs on a standard AM4 Mainboard. Localllama has many build reports similar to this.
So, please let me know: why do you claim an AM5 motherboard would limit him to single GPU?
Quad GPU might run into PCIe Lane issues though (without using hacks). So Threadripper would be necessary (a model with enough PCIe lanes).
john0201@reddit
You need a source to know you can’t run two pcie 5.0 x16 cards on AM4/5? And two memory channels would be a bad choice for $20,000 in GPUs.