Strix Halo or GPUs?
Posted by undernightcore@reddit | LocalLLaMA | View on Reddit | 40 comments
I want to build my own AI server, I already have multiple servers at home but none have GPUs neither are powerful enough to host +4B models.
I'd like to be able to host dense 27-30b parameters models, or some MoE with 3b activated parameters.
Let's say I could spend about 2k, what would be the best route? And what tokens speeds should I expect?
ImportancePitiful795@reddit
Considering the costs included today, especially for 128GB RAM on it's own which is $2000!!!!! , I would propose the following.
a) Buy a Strix Halo 128GB. The cheapest version possible. If correct that's still the Bosgame M5 (around $2000-2200). DO NOT buy those having price tag over $3000. Makes no sense.
b) Hook an R9700 to the Strix Halo later on. Or a W7800/7900 48GB if you can find one cheaply.
Profit. 😀
darktotheknight@reddit
I wish the price gap wasn't that much. It's really difficult justifying to pay nearly double the price (\~3000 - 4000€), just to have the HP or Corsair logo on the front, but at the same time, these chinese no name brands like Minisforum, Beelink, GMKtec or BOSGAME make it really difficult trusting them. The usual 1 year warranty doesn't help either (at least Beelink offers 3 years). They have no office in my country, so whenever there is a warranty claim/repair, it means serious trouble. Just looking at the Amazon reviews, these devices *can* drop dead just like that and you're literally fucked. You can't swap or sell RAM, CPU or the mainboard, it's one big, expensive unit turning into paper weight. So, at least try to buy it from a reputable seller, depending on the consumer rights in your country.
The Zotac ZBOX Magnus EAMAX395C seems to offer 3 year warranty + 2 extra years after registration within 30 days of purchase. It's more expensive than the usual offerings by BOSGAME or GMKtec, but it's not as expensive as HP/Corsair. Zotac (or PC Partner Group) is also a reputable company with many years of experience and history, I'd have no worries about support. I just don't know how the unit itself compares to other boxes, as I couldn't find any reviews about the box itself.
mfarmemo@reddit
I spent the extra on a framework desktop and it bricked suddenly. Had a new main board within 2 weeks. No extra cost to me.
ProfessionalSpend589@reddit
I got lucky and my 2 units are working fine for 6 months now.
Today I’m even testing MiMo V2.5 :)
I just prompted it about the latest version of the Go lang and it says: Go 1.24 (February 11, 2025).
I’m not a go programmer, just like the language a bit.
blackbird2150@reddit
I bought the Corsair on sale for $2500. It works well all things considered. It is slow tps but it can many things with the high ram.
bebetterinsomething@reddit
Bosgame shows $2.6K for me in the US. Where are you getting it at $2.0K?
shaonline@reddit
You're not going to be hooking a dedicated GPU that easily to the strix halo platform, the amount of available pci lanes on that APU is very small, without a fast 16x dedicated port it's going to be a bit slow.
yes2matt@reddit
Profit? Teach me, wizard.
uti24@reddit
To host \~30B dense you need some GPU's. Like 3090 + 3060 or something, you can fit into budget. I can not recommend using something obscure like older AI GPU's, since don't have experience with it.
On other hand you can run MoE models on something like AMD AI MAX 395 thingie with like 40t/s. But that thingie would run dense models much slower.
undernightcore@reddit (OP)
Are MoE models actually comparable to a 27b dense model? Let's say one of those Qwen models that only have 30-3b parameters vs Gemma 4 \~30b.
uti24@reddit
Yeah they are worse, but comparable. Particularly Qwen3.6 27B and Qwen3.6 35B.
As always, even a bit smarter model can make a big difference on a task, making a lot slightly better decisions that will add up, so...
If you have a short simple task you will not even notice difference.
Look_0ver_There@reddit
As a completely anecdotal example, I was having issues with a zsh config I was trialling that out on a system I'm setting up. Historical CLI commands were not displaying correctly when using the up-arrow. Qwen3.6-35B kept saying that it was due to my arrow-key bindings, which didn't seem right. I loaded up 27B, and it correctly identified that it was due to an option setting where rather than compacting multiple identical commands in a sequence into just one historical instance for that run, it was actually removing all identical commands that appeared anywhere in the history, so even if I had just run a command, if that same command had been run, say, 500 commands ago, then pushing up-arrow wouldn't bring up the command I had just run. Even after pointing this out to 35B, it tried to refute that as being the cause of the issue I described (I had used the exact same prompt for both), and I had to essentially "convince it" that it was the root cause, after which it finally relented and said it had made a mistake.
It's just small stuff like that which adds up over time. MoE models are good, but they're not drawing upon the widest set of possibilities and can miss some of the deeper implications when analysing something.
Having said that, big MoE's do a lot better at this sort of thing. When it comes to a Strix Halo you're pretty much forced to run >100B MoE models to get the same sort of intelligence as what a small compact dense model will provide, and once you get to those big models much of the speed advantages shrink.
Now don't get me wrong OP. I have both a Strix Halo and GPU solutions here (3x AMD AI Pro 9700), but at this present point in time I think just 2 x 9700's would do exactly what I want, but that may very well change tomorrow if Qwen release a 3.6-122B model that would likely make it better than the dense 27B.
Moral is, what's best can shift from day to day as new models get released.
undernightcore@reddit (OP)
Thank you so much! What is a good price for that GPU? Just in case I come across an opportunity to buy one. Also, are this considerably faster than, let's say, 4 V100 as someone suggested?
Look_0ver_There@reddit
The AMD AI Pro 9700's cost around $1300 each. They offer 32GB of VRAM and while the ROCm software stack isn't as mature as nVidia's CUDA stack, you can also use Vulkan, and they do "okay". Their memory bandwidth is around 1/3rd of the bandwidth of something like the 96GB nVidia RTX Pro 6000 Blackwells but those are around $9000. As such the 9700's run at about 1/3rd the speed of the Blackwells. You basically get what you pay for.
The V100's were released in 2017. They're cheap, but not exactly the fastest. Two thing to be wary of with multiple GPU cards. AI inferencing has to move data from card to card and that isn't "free", and the more cards you add, then the smaller the gain that each card gives. The other thing to worry about is having enough PCI slots to drive 4 cards. It's easy to find consumer boards with 2 or 3 PCIex16 slots, but 4 slot consumer boards don't really exist. Almost all modern 4-slot boards require registered memory DIMMS, and those are now crazy expensive.
To my mind, on your stated budget of \~$2000, you have a few options. Pick up a pair of second-hand nVidia 3090's or AMD 7900XTX cards (24GB VRAM on either of those cards) and that will get you good speed for <$2000. If you want brand new, then you're going to need to jump to \~$2600, and get a pair of AMD AI Pro 9700's, or a 128GB Strix Halo. The pair of GPU's will run dense models better, while the Strix Halo will have the memory run MoE models at a similar speed. The GPU's will process prompts faster though regardless of if MoE or Dense. So, I would lean more towards the GPU solution at this moment, but as I mentioned, that may very well change tomorrow if a better than 27B MoE model drops tomorrow that a Strix Halo can run.
Oh, there's also the new 32GB VRAM Intel AI cards that cost $1000 each, but their software stack is extremely immature and I personally would not recommend them to someone new to the whole AI thing.
Just my experience and 2c. Just make sure to do lots of research.
def_not_jose@reddit
https://www.reddit.com/r/LocalLLaMA/s/3yyWLiNoBU
Check this test. Qwen 3.6 27b completed non trivial task successfully, while MoE just couldn't, and that was my experience for using it in agentic setting on spaghetti codebase too. a3b just can't compare with full 27b.
Still, strix halo can push bigger MoE models, the results will likely be better with a10b, a17b
undernightcore@reddit (OP)
Very interesting post, I think I am almost fully convinced I should aim for running dense models.
waitmarks@reddit
The other side I would say is that if you have the 128G version of strix halo, that opens you up to the larger MoEs. I can run Qwen 3.5 122B A10B at \~20t/s on mine.
ManySugar5156@reddit
kinda, but not really 1:1. MoE can punch above its active params, but dense 27b usually feels more consistent overall.
CervezaPorFavor@reddit
I tried Gemma 31B on strix halo. 10t/s.
g_rich@reddit
Try enabling speculative decoding with the Gemma 4 assistant models or the Dflash model from z-lab. Using either of these in conjunction with Gemma 4 31B should double to triple your tps.
CervezaPorFavor@reddit
Thanks. I'll try it
Awwtifishal@reddit
Qwen 3.6 27B with MTP in strix halo is about 20 t/s generally. The 35B A3B I have not tried it with MTP yet, but without it it's about 50 t/s.
undernightcore@reddit (OP)
Is MTP as easy as a flag in llama.cpp, also are there any drawbacks to using it?
Evgeny_19@reddit
MTP requires a separate GGUF (unsloth already has update models available for Qwen 3.6), and a separate build of llama.cpp where MTP pull request will be applied. After that, yes, it will be simple as adding a flag to llama.cpp.
Those MTP changes are not yet merged in the main stream of llama.cpp. But you can find precompiled binaries, containers, or you can just compile the source code yourself.
fyv8@reddit
I was really tempted to go with the Strix Halo and after looking at memory speed and what I'm going to be using it for, I opted for speed over memory capacity and went the RTX route. Not regretting it.
What you want may be different. But right now you can get some genuinely good performance (>130 tok/sec) at reasonable quality with the very capable Qwen 3.6 35B (MoE) using llama-server and the UD-Q4_K-XL quant on a RTX 3090 which would be in-budget for you. At your budget you might be able to snag a 4090 deal, too...same 24GB but higher memory bandwidth.
With the Strix Halo, you may be able to get it to the 90 tok/sec range with Qwen 3.6 35B in particular (not direct experience, ymmv), and that's usable. But your dense model tok/sec is going to be FAR worse (I've seen numbers in the 20s) because unified memory bandwidth is a fraction of dedicated VRAM bandwidth. MoE sidesteps this a bit since fewer parameters are active per token, but with dense, the bottleneck is much more obvious.
Personally I would go with a 3090 or better, in terms of speed + memory specs. You're not going to be disappointed in the speed, and 24GB is enough for good quality right now. While your quality ceiling is technically higher with a Strix Halo, the tok/sec hit isn't worth the marginal difference.
FullstackSensei@reddit
Probably unpopular opinion, but if I already had servers and a 2k budget, I'd dump that all in V100s. You can get the native PCIe version for 250-300 a pop. There's even a seller on ebay offering bundles of four for 1100. I'm sure you can negotiate the price down a bit if you offered to buy two together. That's eight GPUs, for a total of 128GB VRAM.
The main downside of the V100 is idle power. It idles at ~50W. But that's not an issue if you're willing to shutdown those particular severs when not in use. If you absolutely need those servers on 24/7, I'd cut the purchase to four GPUs and use the rest of the money to buy some DDR3 or early DDR4 server chassis just for those and shut that down when not in use.
Every other option will net you a small fraction of the VRAM, and most will have worse performance. You could get modded 3080 20GB, but those are over 500 a piece. Strix Halo is so much slower it's not even funny.
undernightcore@reddit (OP)
Are modified V100 with 32gb worth it? Or should I stick to 16gb ones. What is the sweet spot?
FullstackSensei@reddit
Up to you. I like stock PCIe cards because they're easier to cool. SXM2 run hotter, IIRC.
A single 32GB is significantly better than 2x16GB. There's always some waste when you split across GPUs, and then there's the communication overhead. Does it justify the price difference? Only you can answer that depending on your needs.
I personally prefer 4x16GB vs 2x32GB for V100s. More compute is more better IMO. I build my rigs with hybrid inference in mind and I'd much rather spend the money on extra RAM to be able to run larger MoE models on GPU+CPU than focus solely on maximizing t/s for models that fit in VRAM. But that's my personal preference. You do you.
undernightcore@reddit (OP)
Do the stock ones come with a fan or how am I supposed to cool them?
FullstackSensei@reddit
You say you have servers, which almost always have plenty of forced airflow front to back. They get cooled via said airflow
undernightcore@reddit (OP)
Is it really, I have a couple R730xd but I usually disconnect some of the fans.
reto-wyss@reddit
No, it's complete nonsense.
You will have an expensive janky loud mess.
undernightcore@reddit (OP)
So would you rather go for the Strix Halo or you mean I should use newer GPUs?
undernightcore@reddit (OP)
Is DDR3 enough? Isn't system memory also important for the models to work or it runs 100% on VRAM? Because I have a couple servers with plenty of PCIe slots available, might give this a try.
FullstackSensei@reddit
If you offload to CPU, yes you want as many channels and as fast memory as possible. But if you only plan to run in VRAM, you only care about the lanes.
It all depends on whether you'll stick to what you're saying now, or you'll creep into larger and larger models. If you're worried about the latter, get an LGA3647 board/server with an ES Cascade Lake (QQ89). You can fit four V100s if the board has the slots at X8 each, which is enough even for tensor parallelism on the V100. I have this very build in the pipeline (already have most of the components).
Expensive-Paint-9490@reddit
RAM is important only if you can't fit the whole model in VRAM.
Far_Suit575@reddit
Id still go GPUs tbh. Once you hit 30B models, VRAM matters way more than CPU power. Used 3090s are prob the best value route rn.
Signal_Ad657@reddit
Neither machine or setup can be bought for 2k. Your best bet might be a used 3090 tower.
tecneeq@reddit
Two Intel B70. Use llama.cpp with Vulkan. You have enough VRAM for Qwen 3.6 27b Q6 at full context.
If your mainboard lacks PCIe slots, use NVME-adapters.
Own_Suspect5343@reddit
for dense model gpu better i think