My New AI build - please be kind!
Posted by Ell2509@reddit | LocalLLaMA | View on Reddit | 51 comments
This is my new AI machine!
Liangli Lancool 217 case with 2 large (170 x 30mm) front intake fans, 3 (120mm) bottom intake fans, 1 (120mm) back exhaust fan plus the 2x gpu exhaust back. 3 (120mm) ceiling exhaust. 3 of those fans I added to what came in the case as standard. Those were Arctic p12 pro fans.
Thermalrite Assassin cpu cooler.
ASUS ROG Strix B550a mobo. Which somehow is negotiating 2 times x16 pcie lanes simutaniously. That isn't in the spec sheet. But it is happening for sure.
5800x processor. Not the 3d version, but that isn't super consequential for my use case.
128gb ddr4 3200 running at 2666mt/s cl 18 (snappy for model weights overflow).
32gb Radeon Pro w6800
32gb Radeon Pro 9700AI
1 old mechanical 2tb spinning disk drive.
Main boot drive is a 2tb basic ssd. Snappy enough.
Another 1tb ssd mounted.
Corsair RM 850e PSU
\------
This was for local AI on a budget. I also needed to upgrade several existing pieces of hardware (adding ram and SSDs) so opted for an AM4 build for the desktop. My laptops are AM5, AM4, and an old intel notepad upgraded with 32gb ddr4 for cpu inference. So when I want to game I use the AM5 lappy. Won't discuss such heresy any further in this sacred sub.
I have under-volted the 9700ai to 260W down from its standard 300w, because of that 12v connector issue. Have been monitoring temps carefully and it seems fine with little to no performance reduction. Even when I allowed it, it rarely drew the full 300w.
I apologise to the PC Master Race overlords for my poor cable management.
Lastly, this is not its final home. I move apartment soon and will then have it all set up on desk and in a space with proper airflow.
Ok, fingers crossed this goes nicely and you guys don't sh\*t all over my lovely build. I am not a pro, so it was tough! And financially stressful!
Thanks :)
Zyj@reddit
Your mainboard is suboptimal, it has:
1x PCIe 4.0 x16, 1x PCIe 3.0 x16 (x4), 3x PCIe 3.0 x1
So instead of good desktop mainboards that manage two slots running at PCIe 4.0 x8 each
your second slot will run 4x slower than that.
Ell2509@reddit (OP)
The ran I bought was slower. It fit my budget. 3200 didn't.
I did run that during setup. As I say, somehow it negotiates x16 for both gpus. I have no idea how or why.
Zyj@reddit
Well, I can assure you it‘s physically impossible. So you are mistaken.
Ell2509@reddit (OP)
We don't know what is going on with my specific board. The factory line may have been out of one type of component, or the machines needed to follow a different plan, or the base board is actually from a different variant, but not mentioned and used during manufacturing process out of necessity born of shortage.
Supply chains are disrupted.
We don't know.
I agree, it should be impossible, but the PC is am sat next to now is doing it.
Zyj@reddit
Paste the command i put in my other comment and show the output
Ell2509@reddit (OP)
You have any more thoughts?
I am genuinely stumped as to why. I don't want to break what is working well, either.
Ell2509@reddit (OP)
FullstackSensei@reddit
What will the extra PCIe speed do here exactly? OP is running hybrid inference anyway.
I have four 3090s, each with x16 Gen 4 and max I see in hybrid inference of Qwen 3.5 397B Q4_K_XL is a few hundred MBs/s.
The impact of memory being 2666 vs 3200 is also much less than you think. Maybe a couple tokens/s at best.a
metmelo@reddit
Idk why people fixate over PCIe speed so much. You could run on X1 PCIe speeds and t/s would barely drop.
FullstackSensei@reddit
With hybrid inference or splitting by layer, absolutely. Even with tensor parallelism X8 Gen 3 is absolutely fine for a couple GPUs.
It's not theory, I have rigs with 8, 6 and 4 GPUs in each and they have X8 Gen 3, x16 Gen 3 and x16 Gen 4, respectively, to each GPU. It's not theory, I can actually see the bandwidth numbers on each GPU during inference.
Xp_12@reddit
Yep. I've gotten in a few arguments about it. My dual 5060tis do TP in vllm just fine on pcie4 x8/x1 and a simple bench paste makes them go quiet when I'm hitting thousands on prefill.
FullstackSensei@reddit
Ignoring cost is the thing that irks me the most. It's like everyone here has industrial money printing operations in their basements.
Just get four 512GB Mac studios, or 8 RTX 6000 pros, as if you can get them for free with uber eats vouchers.
My entire homelab, with 18 GPUs across three machines costs less than a single Blackwell RTX 600 Pro. I can run two instances of Minimax at Q8 at 12-13t/s or two instances at Q4 at 30t/s *in parallel* or on a single machine that cost me less than what 128GB DDR5 costs today. When DS4 flash support is merged in llama.cpp, I'll be able to run two instances of that at probably higher t/s. That machine alone lets me work on two large tasks in parallel, completely unattended for at least an hour at a time! Oh, and PP on that machine is still faster than the M3 Ultra!
metmelo@reddit
Awesome! What's your setup like? I run 4 MI50's 32GB but was wondering if I should've gone for the SXM2 v100's for a similar price.
FullstackSensei@reddit
The Mi50 is definitely not worth current prices. I got mine at $140 and for that they're great value. But at 500, it's a horrible deal. It has 20% more compute than the P40. While the P40 has 24GB VRAM, you can get about two P40s for the price of a single Mi50. I have both, and the P40 is still really good, especially when paired with ik_llama.cpp or using the new -sm tensor in vanilla llama.cpp. Just make sure to give each card lanes, or even better 8.
metmelo@reddit
Idk man I think my MI50's PP speed is too slow with dense models or even something like minimax with 32B active params. How is it for P40's?
FullstackSensei@reddit
P40s are slightly better using ik or -sm tensor.
The Mi50s weak point is PP, but honestly I don't care. I got them very cheap and with six in one machine, I can run minmax Q4_K_XL fully in VRAM, or use three GPUs plus one CPU (machina is dual 24 core Xeon) to load an instance of Qwen 3.5 397B at Q4 or Minimax at Q8. at about 40% the speed vs full VRAM.
The advantage with such large models is that I can have quite big and long running tasks unattended. Even with two Qwen 397B instances running in parallel, power at the wall is \~600W.
Zyj@reddit
It‘s not like mainboards supporting 2x x8 are 300 bucks more expensive, what are you talking about?
Xp_12@reddit
please show me an am5 pcie5 x8/x8 that's less than 300. for people trying to get in cheap that can be ~200 more bucks.
lolwutdo@reddit
It matters a ton if you're offloading to cpu at all; if model fits entirely in both gpus, then not so much.
Ell2509@reddit (OP)
I have found that cl 18 for ddr4 vs cl36 for ddr5 seems to make a positive difference. Maybe in my head, but feels snappier when I am overflowing to cpu on the desktop am4 than the laptop am5.
FullstackSensei@reddit
It's not exactly hard to measure. You should also account for cpu core count and core capabilities.
Doubt timings make a difference vs bandwidth. LLM memory access is very linear and very easy for the memory controller to prefetch even on 20 year old CPUs.
Ell2509@reddit (OP)
Encouraging to hear about older hardware. I have a 12 year old ddr4 laptop that I haven't put through its paces yet.
MotokoAGI@reddit
cool, now go run some models and do something cool
Ell2509@reddit (OP)
I do! My something cool is multifold:
1) building the offline system itself schemas, all code, backends, tools, skills, personality stubs, etc etc.
2) I use it as a deployable framework for certain types of entity, helping them set up locally.
3) consulting, training, education courses, and a lot more.
4) using it to build several industry specific applications which are due at various stages of development.
5) as an experiment (I have 20 or so models across 4 devices).
6) as a home lab based personal assistant for my own life. I suffer from rather full on ADHD and need help. Ai has been a godsend at giving me structure to get out there and be myself without the really disabling executive function gap.
Vaguswarrior@reddit
Are you me?
Ell2509@reddit (OP)
As you habe done to the least of my brothers, you habe down unto me. So, in theory we are all different aspects of the same person!
Vaguswarrior@reddit
Ah, yes. Francis Bacon.
Ell2509@reddit (OP)
The Godfather of crispy, smoky AI.
FullstackSensei@reddit
Really nice build! Good call on using a pair of workstation cards with blower fans!
What are your numbers with minimax Q4?
Ell2509@reddit (OP)
So i just did a big benchmarking task for minimax. Heavy multi step problem. I am not posting answer analysis because of how long it is and I am on phone. Will do separate posts with full benchmarking deets other days....
But quick stats:
Ell2509@reddit (OP)
I am just out at the moment, but I will get back to you later on when I am home. I can't remember. PP is not super quick, but also not terrible. Token gen is unreal. Again, will come back later with some numbers for you.
Thanks for showing interest.
HopePupal@reddit
oh hey, nice. i'm also using an R9700 with an AM4 CPU (5900XT) but with much less RAM.
mind sharing numbers next time you have MiniMax loaded? i'd love to know whether having a real GPU (or two) can make up for the much smaller memory bandwidth of an AM4 CPU vs. my other box (a Strix Halo).
brosvision@reddit
Does your Gigabyte R9700 have a strange fan buzzing while on idle? Mine has it and it is so distracting that I will RMA it.
Ell2509@reddit (OP)
It might. Honestly the case fans drown out anything except the 9700 on full blast.
Haeppchen2010@reddit
Nice, good to see a fellow mixed AMD dual-GPU setup!
As you mention ROCm: If you just do inference, give Vulkan a chance, too... It seems for inference it is often faster than ROCm (at least for me it was on the RX7800XT).
Dangerous_Bad6891@reddit
Very Impressive !
looking forward to some stats for models like
gemma-4-31B
qwen 3.6 27B
GLM 4.7 flash
Comfyui basic workflow stats for
Flux 2.Klein , Z-Image Turbo . Qwen Image Edit2511 , Ltx-2 basic workflow
aand your feedback on how simple it is to set it up and run
Polaris_debi5@reddit
Awesome build! Those Radeon Pros are a beastly choice for VRAM-heavy models.
Since you mentioned having some issues with gated delta net attention and needing very specific llama.cpp builds for ROCm, have you considered giving the Vulkan backend a shot?
Recent benchmarks (especially with Mesa v26+) have shown that Vulkan is actually outperforming ROCm in PPbenchmarksbenchmarks for multi-GPU AMD setups, and it's much more forgiving with mismatched architectures like your RDNA 2/3 combo. It might solve that re-processing lag you're seeing without the ROCm dependency headache. Might be worth a quick
cmake -DGGML_VULKAN=ONjust to compare!Ell2509@reddit (OP)
Thanks for the push. It is on my radar like vLLM, but i haven't ventured there yet. Encouraging that it might be better for mixed architecture. Tbh, I am on the edge of selling the 6800 and getting another 9700.
SweetHomeAbalama0@reddit
Good stuff! 64Gb of VRAM can open a lot of doors.
Do you typically do GPU/CPU offloading or do you sometimes do pure GPU inference? Curious what speeds those two cards get together.
Also, how was your experience with ROCm, easy enough to get it all working? I've not (yet) touched AMD for inferencing but I've heard a lot of mixed reviews with the software/driver setup.
Ell2509@reddit (OP)
I have a range of model sizes even on this one device, with ollama and/or llama.cpp to run. Some are single gpu, some use both (like the example one). All are really quick! I am away from home now so do not have numbers infront of me, but I can tell you later if you like!
ROCm on linux is mostly fine. A few small issues and one big problem that only seems to affect qwen3.6 27b dense.
I went into it expecting major problems, but so far it has been all good.
jacek2023@reddit
I think open frame is the only valid choice with multi-GPU because the airflow, I don't use any additional fans in my setup except on CPU
JohnBooty@reddit
What’s the benefit of caseless vs. cased?
Assuming your goal isn’t to just point a big-ass floor fan at the rig or something. (Which is not necessarily a bad thing to do. I mean really, that’s kind of the ultimate)
If you have multiple 160mm+ intake fans and multiple 160mm+ exhaust fans, the case should not be a limiting factor. Caseless gets you passive convection cooling, but that’s negligible.
jacek2023@reddit
I can run 100B models in total silence
Ell2509@reddit (OP)
Oh really? I keep a glass side case on it (part of the case) usually. Just removed to take the photo because I have no interior lights.
With 2 large fans, 7 x 120mm fans, plus 2 gpu blowers, all correctly oriented, temps are staying damned cool. Sounds like an actual jet engine though.
BinaryNebula110@reddit
other than blower type cards, right?
jacek2023@reddit
I have fans on 3090s but not separate fans
FullstackSensei@reddit
You can absolutely have multiple GPUs inside a case, even air cooled. Just plan for it properly, using a larger case, plenty of fans and a motherboard that provides enough spacing between cards to let them breath. And that's just with air.
If you go with watercooling, you can really go crazy. I have eight P40s in a single system inside a tower case and another quad 3090 system inside an O11D (not XL). Both are pretty quiet and run quite cool.
mrmontanasagrada@reddit
Very nice man! Enjoy them tokens
Ell2509@reddit (OP)
Thank you :)
It has been a super steep learning curve so far. Loving it though :)
taking_bullet@reddit
Impressive. Very nice.
Ell2509@reddit (OP)
Thank you :)