Server build for local inference. 128 gb 3200 or 256 gb 2133mhz RAM?
Posted by PreparationTrue9138@reddit | LocalLLaMA | View on Reddit | 28 comments
Hi, I am building a server so that my dual rtx 3090 setup runs at full speed.
- asrock romed8 t2 revision 1.3
- epyc 7642
- ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel
- dual rtx 3090
- gigabyte psu 1600 w
What do you think? Is using ram for moe models worth it? Something like qwen 3.5 397 b? And should I go for the fastest ram or for more ram?
amberdrake@reddit
Definitely faster
ttkciar@reddit
I use large MoE models on my ancient 256GB DDR-2133 Xeons using pure-CPU inference. It's slow as hell but IMO getting high-quality responses is worth the wait, especially when I can be working on other things (or sleeping) while it's inferring.
On my rig, using llama.cpp and models quantized to Q4_K_M, at short context, I get about 3.5 tokens/second from GLM-4.5-Air (106B-A12B), 0.9 tokens/second from K2-V2-Instruct (72B dense), and 0.5 tokens/second from Mistral Medium 3.5.
ai-christianson@reddit
the 256gb is worth it for the agent workloads. you can queue heavy context tasks to run overnight on the cpu while you sleep. it turns the slow inference into a batch processing pipeline where capacity matters more than latency.
ttkciar@reddit
It works really nicely for non-agentic codegen, too.
GLM-4.5-Air doesn't work in agentic harnesses because its tool-calling is extremely weak, but it makes up for it with exemplary instruction-following competence.
What I do is prompt it with an extensive plan / project specification with literally dozens of instructions, and it will reliably follow all of those instructions. It doesn't stop inferring until everything in the plan is completed.
I've tried every new 120B-class model, but they all follow some instructions while ignoring others, so GLM-4.5-Air is still my go-to.
Of course it takes hours to infer a complete project, on my rig, but that just means I can work on other things while it chomps away on its task.
Financial-Most5372@reddit
Please correct me if I'm wrong, but isn't that like extremely expensive per answer in power costs?
crantob@reddit
That depends on whether you're getting 40c/kwh power from the government green-cartel or something reasonable.
ambient_temp_xeno@reddit
256gb. It will let you run decent sized moe models and the ram speed won't make any difference to whatever you're only putting on the 3090s.
1337Captain@reddit
256GB isn't enough for this kinf of setup
SagaciousFool@reddit
Its enough for qwen3.5 397b iq4, mimo v2.5 iq4, minimax 2.7 iq4 etc. No you cannot run the T parameter models. But it can run some good models.
1337Captain@reddit
Won't it be sluggish? Unless you mean only for writing LinkedIn posts
ProfessionalSpend589@reddit
I’ve read that for planning it’s ok.
I’ve tried to use it for agentic work and chat. For chat it was good, but i got annoyed when it couldn’t make a change a smaller dense model managed to do.
I blame the quant, so now I’m trying to go down in total weights and go up a quant (q5) with MiMo V2.5. I’m testing a Q4 variant of MiMo v2.5 now, but I wasn’t satisfied when I asked it to make a SVG diagram of the proof for the Pythagorean theorem and embed it in an html file. I’ll see if the Q5 does it a bit better.
kivaougu@reddit
What sampling params have you been trying? Mimo v2.5 seems to be very sensitive to temp and top-p. Top-k should be disabled. I have had some mixed results with the original fp8 weights too
ProfessionalSpend589@reddit
Defaults, because I was swapping models via the router functionality of llama-server.
Can you recommend some guide which mentions what configuration is best?
No-Refrigerator-1672@reddit
It will. Any CPU offload is sluggish. However, there is a singificant number of people who are claiming that waiting half a day for a task to finish with CPU 400B model is better than running <100B all in VRAM, I've been in this kind of discussion far too many times.
Antoniethebandit@reddit
Pointless, as others stated running on sys RAM will be slow
ttkciar@reddit
It's extremely slow, but sometimes it's worth waiting for a quality response.
Back when Llama-3 was the shizzle, I would infer with Llama-3-Tulu-405B from system RAM, and it would literally run overnight. But I was sleeping, so why not?
I'd test my prompts with Llama-3-Tulu-70B until it was right, then pass the well-working prompt to the 405B and go to bed.
That was my go-to for my hardest physics tasks until it was finally surpassed by GLM-4.5-Air, which is several times faster.
MLDataScientist@reddit
Get 256gb ram (8x32gb). I have the same motherboard with 256gb ram (8x32gb) 3200 MHz. CPU 7532. GPU: one 5090. Qwen3.5 397b Q4_k_m runs at 20t/s with 700 t/s PP. You want more cores with your CPU. Mine has 32 cores and I get 150GB/s RAM bandwidth. I bought this entire setup for $3.2k (2.2k for GPU on Bestbuy and 1k $ for CPU+mobo+RAM on eBay) before ram crisis.
segmond@reddit
the 3200mhz will be 1.5x faster than the 2133mhz ram. If you have money to upgrade in the future, then get the 3200mhz and upgrade. If you don't think you can afford to upgrade anytime in the near future, then the 256gb.
reto-wyss@reddit
I don't recommend that build, offloading will be slow.
I don't know your market, but you can likely get a W7800 (48gb) for significantly less than the cost of that build. Stick it into your dock - done.
Run Qwen3.5-27b for agentic coding and Gemma-3-31b for everything else (8bit + MTP). Will be fast, will have warranty, use a lot less power, and takes up little space.
If you want more and as a server, get some AM5 board with x8x8, cheapest DDR5 you can find, and stick two R9700 in there - or keep the 3090, and use those.
Fabulous_Fact_606@reddit
256+ system ram will get you what 1 - 10 t/s? 30minute to 1 hour response time per turn? Go headless. Skip the RAM. All you need is 16gb of RAM. the 3090x2 48gb vram will fit Qwen3.6-27B 8bit-mtp at 50-70 t/s. \~150k-200k ctx. Still slow when you send 100k tokens for 10-20k tokens output. You'll looking at about 60s-200s response time per turn. Very useable.
tdlr; set server to headless, wifi llm API tunnel to internet, stash that heater in the garage or outside.. Use your laptop, desktop to access the API anywhere.
sockusminimus@reddit
Overclock the RAM on that board. You should be able to push to 2666 or even 2933 if lucky. Just monitor your CEs and UEs over time. If you start to see any back it off one bin.
I have this same board and have been running 1TB of PC4-2400 at 2933 for months without issue.
Annual_Award1260@reddit
I have a x10dri system with 1TB ddr4 ram running at 2400. I can run large models like kimi k2 at a blistering 1.8 tok/s
Glittering_Painting8@reddit
I think people are underestimating how much better an 8-channel EPYC platform is than a TB3/Oculink eGPU setup. Even with slower DDR4, removing external GPU bottlenecks + getting full PCIe bandwidth should help a lot.
If your goal is MoE experimentation, I’d still lean 256 GB. Capacity determines whether the model runs at all; bandwidth mainly affects how painful it is.
Annual_Award1260@reddit
DDR4 is painfully slow. Go for unified memory or proper gpu
PreparationTrue9138@reddit (OP)
I thought 8 channel will be close to strix halo. 2133 though is two times slower. But server platform will allow tensor parallelism as far as I understand.
XccesSv2@reddit
Running inference using RAM won't make you happy—or how much speed are you expecting?
PreparationTrue9138@reddit (OP)
Well, currently I have a laptop with two egpus (oculink+ thunderbolt 3) and ddr4 64 gb 2 channel 2933
I tried to run qwen 397 b 1 bit 107 gb. It was running at 7 t/s.
So my amateur math is that running the same quant on 8 channel ram (3-4x bandwidth) and with no bottlenecks for GPUs will double or triple speed at least.
Though my plan is to use q3 at least, so it will be slower.
Also people say ik_llama is best for moe models on cpu/gpu mixed setups so I expect some boost there too.
woolcoxm@reddit
get the 256 it will let you run decent sized MoE models. they might not perform the best but they will be sufficient.(2133 is slow)