Why the Strix Halo is a poor purchase for most people
Posted by NeverEnPassant@reddit | LocalLLaMA | View on Reddit | 400 comments
I've seen a lot of posts that promote the Strix Halo as a good purchase, and I've often wondered if I should have purchased that myself. I've since learned a lot about how these models are executed. In this post I would like share empircal measurements, where I think those numbers come from, and make the case that few people should be purchasing this system. I hope you find it helpful!
Model under test - llama.cpp - Gpt-oss-120b - One the highest quality models that can run on mid range hardware. - Total size for this model is ~59GB and ~57GB of that are expert layers.
Systems under test
First system: - 128GB Strix Halo - Quad channel LDDR5-8000
Second System (my system): - Dual channel DDR5-6000 + pcie5 x16 + an rtx 5090 - An rtx 5090 with the largest context size requires about 2/3 of the experts (38GB of data) to live in system RAM. - cuda backed - mmap off - batch 4096 - ubatch 4096
Real world measurements
Here are user submitted numbers for the Strix Halo:
| test | t/s |
|---|---|
| pp4096 | 997.70 ± 0.98 |
| tg128 | 46.18 ± 0.00 |
| pp4096 @ d20000 | 364.25 ± 0.82 |
| tg128 @ d20000 | 18.16 ± 0.00 |
| pp4096 @ d48000 | 183.86 ± 0.41 |
| tg128 @ d48000 | 10.80 ± 0.00 |
What can we learn from this? Performance is acceptable only at context 0. As context grows performance drops off a cliff for both prefill and decode.
And here are numbers from my system:
| test | t/s |
|---|---|
| pp4096 | 4065.77 ± 25.95 |
| tg128 | 39.35 ± 0.05 |
| pp4096 @ d20000 | 3267.95 ± 27.74 |
| tg128 @ d20000 | 36.96 ± 0.24 |
| pp4096 @ d48000 | 2497.25 ± 66.31 |
| tg128 @ d48000 | 35.18 ± 0.62 |
Wait a second, how are the decode numbers so close at context 0? The strix Halo has memory that is 2.5x faster than my system. And why does my system have a large lead in decode at larger context sizes?
This comes down to one of the advantages of MoE models. Let's look closer at gpt-oss-120b. This model is 59 GB in size. There is roughly 0.76GB of layer data that is read for every single token. Since every token needs this data, it is kept in VRAM. Each token also needs to read 4 experts which is an additional 1.78 GB, but each token needs a potentially different set of weights. Considering we can fit 1/3 of the experts in VRAM, this brings the total split to 1.35GB in VRAM and 1.18GB in system RAM at context 0.
Now VRAM on a 5090 is much faster than both the Strix Halo unified memory and also dual channel DDR5-6000. When all is said and done, doing ~53% of your reads in ultra fast VRAM and 47% of your reads in somewhat slow system RAM, the decode time is roughly equal (a touch slower) than doing all your reads in Strix Halo's moderately fast quad channel DDR5-8000.
But wait, what about the slowdown in decode? That's because when your context size grows, decode must also read the KV Cache once per layer. At 20k context, that is an extra ~4GB per token that needs to be read! Simple math (2.54 / 6.54) shows it should be run 0.38x as fast as context 0, and is almost exactly what we see in the chart above.
But wait, why does the my system show very little slowdown? That's because all the KV Cache is stored in VRAM, which has ultra fast memory read. The decode time is dominated by the slow memory read in system RAM, so this barely moves the needle.
Why do prefill times degrade so quickly on the Strix Halo? Good question! I would love to know!
Can I just add a GPU to the Strix Halo machine to improve my prefill?
Unfortunately not. The ability to leverage a GPU to improve prefill times depends heavily on the pcie bandwidth and the Strix Halo only offers pcie x4.
I went into my BIOS and forced my pcie slot into various configurations to gather some empircal data:
| config | prefill t/s |
|---|---|
| pcie5 x16 | ~4100tps |
| pcie4 x16 | ~2700tps |
| pcie4 x4 (what the strix halo has) | ~1000tps |
But why? Here is my best high level understanding of what llama.cpp does with a gpu + cpu moe:
Rough overview of what llama.cpp does:
- First it runs the router on all 4096 tokens to determine what experts it needs for each token.
- Each token will use 4 of 128 experts, so on average each expert will map to 128 tokens (4096 * 4 / 128).
- Then for each expert, upload the weights to the GPU and run on all tokens that need that expert.
- This is well worth it because prefill is compute intensive and just running it on the CPU is much slower.
- This process is pipelined: you upload the weights for the next token, when running compute for the current.
- Now all experts for gpt-oss-120b is ~57GB. That will take ~0.9s to upload using pcie5 x16 at its maximum 64GB/s. That places a ceiling in pp of ~4600tps.
- For pcie4 x16 you will only get 32GB/s, so your maximum is ~2300tps. For pcie4 x4 like the Strix Halo via occulink, its 1/4 of this number.
- In practice neither will get their full bandwidth, but the absolute ratios hold.
*** Other benefits of a normal computer with a rtx 5090* - Better cooling - Higher quality case - A 5090 will almost certainly have higher resale value than a Strix Halo machine - More extensible - More powerful CPU - Top tier gaming - Models that fit entirely in VRAM will absolutely fly - Image generation will be much much faster.
What is Strix Halo good for* - Extremely low idle power usage - It's small - Maybe all you care about is chat bots with close to 0 context
TLDR If you can afford an extra $1000-1500, you are much better off just building a normal computer with an rtx 5090. Even if you don't want to spend that kind of money, you should ask yourself if your use case is actully covered by the Strix Halo.
Corrections Please correct me on anything I got wrong! I am just a novice!
fallingdowndizzyvr@reddit
Those must be ancient numbers. Since the Strix Halo is better than that now and getting better everyday. Here's a fresh run that just finished a minute ago.
Sure, while the Strix Halo can't hope to have the compute to go up against the 5090 for PP. In TG, I dare say it goes toe to toe with the 5090. Even at large context.
jjwhitaker@reddit
It's been about 6 months. I'm setting up a Framework AMD Strix Halo next week with an option to plug a GPU into the 4x PCIe slot and if that is working well enough after I learn to split work across the AMD and Nvidia sides, looking at a 24gb BRAM+ Nvidia GPU for that hybrid setup.
Did I burn cash on the Strix system? Am I mad to chase an attached Nvidia GPU and hybrid setup? Is this going to be enough fun to validate the spend vs skill gain? Let's find out.
fallingdowndizzyvr@reddit
Not at all. I've used a 3060 with my Strix Halo before. I had a 7900xtx in the eGPU slot for the longest time. It's current rocking a V340. And soon, I'll have a 5070ti as the little helper for my Strix Halo.
jjwhitaker@reddit
A 3060 12gb is what I'm currently testing on, via Ubuntu with a 5900XT and 32gb of slow DDR4 (I already had these parts). I'm impressed, enough to seek more VRAM with the Strix Halo.
I already have my main pc or laptop connecting using LM Studio Link and using a model on the Ubuntu server in VS Code/Copilot. I've been using Gemma 4 E4B a lot this last week preparing for the Copilot Pro token budget changes that hit today...
It reads like Vulkan will be my friend, or VLLM for splitting between AMD and Nvidia. I'll test out the 4x to 16x riser cable currently in the mail and see if I need a more reliable setup for the 3060.
After my company built and Azure AI Service backed code review process that works great, we're investing in an on prem server for local LLM. With luck I can make follow that group and get past the shiny hardware phase of this project. Learn both sides of the hardware battle and figure out what training an LLM is all about.
Timely-Coffee-6408@reddit
Tg?
fallingdowndizzyvr@reddit
Token Generation
NeverEnPassant@reddit (OP)
Thanks. Btw, these are numbers I got from you not too long ago.
In these new numbers, It is strange to see the tg is somehow faster at 48k context than 20k context.
fallingdowndizzyvr@reddit
Oh I know. ;) But that was so long ago. What was it, 2... 3 weeks? In Strix Halo time, that was a lifetime ago. Unlike Nvidia which is pretty baked, Strix Halo has just started to rise. It's got a long way to go. In fact, I got another run going right now since those numbers I posted was from way last half an hour ago. So dated as to be useless in Strix Halo time. I'll post the more current numbers when they are done.
Nope. You would know that from the results I posted. Since it would say what the KV cache settings were if they differed from the default. That's how llama-bench rolls
Anyways, here's the command line. As you can see the options I used are reflected in those results I posted. I couldn't be bothered to go find the command line we used in our earlier discussion. So I replicated it as best I could from memory.
./llama-bench -m/gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 --mmap 0 -ngl 99 -ub 4096 -b 4096 -d 0,20000,48000 -p 4096
NeverEnPassant@reddit (OP)
Well, I sincerely hope Strix Halo continues to get better. I still think the prefill numbers are a bit painful, but the tg is now really nice for the price.
Also, I just learned the DDR5 RAM kit I purchased in June for $300 is now $600. That also makes Strix Halo more attractive.
fallingdowndizzyvr@reddit
This hour's numbers are done. Don't look at those old dated numbers I posted from last hour. Here are this hour's numbers. Not as peaky at 0 context, but I think the better performance at higher context makes up for it.
norbosp@reddit
FYI DGX Spark numbers for comparison:
```
$ build/bin/llama-bench -m \~/.cache/llama.cpp/ggml-org_gpt-oss-120b-GGUF_gpt-oss-120b-mxfp4-00001-of-00003.gguf -fa 1 --mmap 0 -ngl 99 -ub 4096 -b 4096 -d 0,20000,48000 -p 4096
[...]
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
| model | size | params | backend | ngl | n_batch | n_ubatch | fa | mmap | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | pp4096 | 1579.11 ± 33.43 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | tg128 | 46.06 ± 0.09 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | pp4096 @ d20000 | 1305.40 ± 3.76 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | tg128 @ d20000 | 36.18 ± 0.10 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | pp4096 @ d48000 | 938.58 ± 6.47 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 99 | 4096 | 4096 | 1 | 0 | tg128 @ d48000 | 29.60 ± 0.06 |
build: e3b35ddf1 (7509)
```
ochbad@reddit
Would an eGPU be reasonably expected to increase pp4096 @ d48000 (with the improvement limited by pcie 4x4 bottleneck)? Or would the bottleneck be worse with larger context? I don’t understand the relationship between pcie bandwidth required for prompt processing and context length. Is the amount of data that needs to be send to the gpu a function of context size?
fallingdowndizzyvr@reddit
I can give you numbers shortly. Stand by......
fallingdowndizzyvr@reddit
I can give you numbers shortly. Stand by......
Update: So here you go. As you can see, using an eGPU doesn't really do much to increase the speed. That's why I've described it as effectively just expanding the amount of available RAM. I don't think it's bound by the PCIe speed as OP suggests. To illustrate that, I've included both a run with it having only 2 layers on the 7900xtx and another run with it having 32 layers. While there is a difference in speed, that's accounted for by the 7900xtx having more layers to help out more versus not. In this case, it basically balances out the inherent performance penalty of going multi-gpu in llama.cpp when 32 layers are loaded on the 7900xtx
The reason I don't think it's bound by PCIe bus is that OP's premise is that the dGPU has to do all the work for PP and thus it's I/O bound by the PCIe bus while accessing the layers that aren't local to it. But the reality is that both GPUs are working during PP. In this case, the iGPU is pretty much working all the time while the 7900xtx only goes in bursts. That's because the iGPU has a lot more of the model to deal with and is slower. The 7900xtx on the other hand blasts through it's little portion and spends most of it's time idle. I've included a screenshot that shows this.
ggml_cuda_init: found 2 ROCm devices: Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32 Device 1: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ts | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 3.00/97.00 | 0 | pp4096 @ d48000 | 188.75 ± 0.16 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 3.00/97.00 | 0 | tg128 @ d48000 | 30.29 ± 0.03 | ggml_cuda_init: found 2 ROCm devices: Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32 Device 1: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ts | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 35.00/65.00 | 0 | pp4096 @ d48000 | 236.81 ± 0.33 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 35.00/65.00 | 0 | tg128 @ d48000 | 33.03 ± 0.39 |
fallingdowndizzyvr@reddit
Here are the numbers.
ggml_cuda_init: found 2 ROCm devices: Device 0: Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32 Device 1: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat ggml_vulkan: 1 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat | model | size | params | backend | ngl | n_batch | n_ubatch | fa | ts | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ------------ | ---: | --------------: | -------------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 3.00/97.00 | 0 | pp4096 @ d48000 | 188.75 ± 0.16 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | ROCm,Vulkan | 99 | 4096 | 4096 | 1 | 3.00/97.00 | 0 | tg128 @ d48000 | 30.29 ± 0.03 |
randomisednick@reddit
Hmm I wonder what is the max high context pp that could be achieved on a combination of strix halo plus 3/4/5090 by shuffling sections of the model across to the dGPU to keep it fed as much as possible while also using the iGPU and NPU in parallel, with the dGPU ending up holding the shared layers and some experts ready for tg phase?
I guess the dGPU would be bandwidth limited on PCIe to around 400 pp tk/s and the iGPU + NPU might manage another 250? Still a decent speed up.
Could one even potentially use that approach in an Exo style cluster of a gaming PC plus Strix Halo over a 80Gbps USB4v2NET network?
NeverEnPassant@reddit (OP)
But 32 layers is 50GB?
fallingdowndizzyvr@reddit
Oh shit. You're right. Before this little side quest, I was using Qwen 3 VL which is 94 layers. So I was doing the 3% versus 35% numbers off of that. 35% of 94 layers is ~ 32 layers. Little OSS 120B is only 36 layers. Which makes 3% 1 layer and 35% 12 layers. That explains why I had to use 3%. Since 1-2% didn't work. 1-2% isn't even a layer.
sudochmod@reddit
He wouldn’t have on gpt oss.
NeverEnPassant@reddit (OP)
You can still quantize the KV cache. It's different from the model weights.
sudochmod@reddit
Yes I’m aware. As far I know there isn’t really a reason to quantitize the kv cache on gpt oss.
Eugr@reddit
Gpt-oss runs much slower with quantized cache on llama.cpp, but tge cache doesn't take much space for this model, so no reason to do it at all.
MarkoMarjamaa@reddit
...and this was run on Q8 quant, not the original release F16.
nakedspirax@reddit
With the new qwen3.6 models and agent harnesses. The strix halo is a dream to run 24/7. LOW POWER.
NeverEnPassant@reddit (OP)
Depends on what you mean by 24/7.
Strix Halo only has advantages over GPUs for idling. Under load, and watt limited, GPUs do quite well.
Educational_Sun_8813@reddit
and full context speed?
NeverEnPassant@reddit (OP)
What are you asking me?
Educational_Sun_8813@reddit
what is the speed in your setup if you fill the 130k?
NeverEnPassant@reddit (OP)
Deleted the last comment, forgot a command line param that made the numbers worse.
Corrected numbers:
Educational_Sun_8813@reddit
i think table is missing ts speed?
NeverEnPassant@reddit (OP)
I think the 2 numbers you are looking for are:
pp130000 2390.77 t/s
and
tg128 @ d125000 31.56 t/s
The first is how long to start from 0 context and process 130000 tokens
The second is How long to output tokens once the context is at 125000 tokens.
xryl669@reddit
Please confirm: pp 2390 t/s for 128k tokens means you have to wait 53s for an answer right (plus the time it takes for the answer itself, but it's the same for both system, so there's no point in this)?
Maybe adding this computation in your post can help other understand your main point: the preprocessing actual meaning is time to wait for answer and it's not negligible but not horrendous either.
Then reading your post, there's something odd here. In the Strix case, the PP is like 230t/s for 40k context, which means it's taking way more time to process a smaller context (173s) vs a larger one. How could it be?
NeverEnPassant@reddit (OP)
Yes, 53 seconds. This is for my RTX 5090 + RAM system.
I don't have pp130000 numbers for Strix Halo, but I expect it would take 10-20x longer.
Educational_Sun_8813@reddit
ah interesting that values are not rendered in the table i can see above seems it's cut somewher, thx!
NeverEnPassant@reddit (OP)
ts?
NeverEnPassant@reddit (OP)
and
Traditional_Monk_291@reddit
I think the real elephant in the room is “can these companies STOP trying to push more expensive vram and just make faster ram that is unified as standard.. then we can start to talk about ram needing an heating just as GPU and cpu chips. All these bs about h100 and 4k devices just to have an llm read/compute is bs and a waste of resources…. Right now the industry is trying to make AS MUCH money from the end consumers as possible.
minitoxin@reddit
Strix halo is fantastic - I love it. For me the most important thing is the power consumption as my systems run 24/hrs. I'm ok running important jobs overnight so prompt speed is not an issue for me , I like it so much i bought another one dedicated to running Wan 2.1/2.2, HunyuanVideo, LTX-Video (LTX-2) and the occasional 70B LLM,..
Sizzin@reddit
If you your needs are always a "cold start" with a big context, Strix Halo is definitely a horrible choice. But if you go incrementally, like in RP or asking it to analyze one project file and then another and another, it's really not that big of a deal if you use cache. I tried loading a 100k context into a 120b model on my MS-S1 Max and it took forever for the first answer, but any subsequent question had very little difference from a 0 context start. So if you don't make system prompt changes or edit previous messages, you're good, really.
I have a Desktop with a 3090 and a spare 4070. I'm flirting with the idea of plugging the 3090 through Thunderbolt to my MS-S1 and put the active params inside it. From my hypothetical math, the performance loss isn't really that big and I could use it almost like plug-and-play, turning it off when I'm not using it for heavy inference and enjoy the 14W\~ idle draw of the mini PC. That's still only theory, of course, I'll be trying it soon.
NeverEnPassant@reddit (OP)
You will be waiting 10+ minutes to fill up 100k context with even a fast model like gpt-oss-120b. Maybe you will it in a single request, or spread that wait out among many, but you will be waiting.
You can't really get any improvements by adding a GPU to the Strix Halo. It's pcie is too slow. I go over why in the post.
Sizzin@reddit
I'm not sure you understand how cache works. But I just did a random test on a Qwen3.5 35b and on Qwen3.5 122b. I pasted the whole The Linux Programmer’s Guide content on it, it's around 65k tokens.
Qwen3.5 35b took 100s to process it all and token generation was at \~44t/s.
Qwen3.5 122b took 340s to process and token generation was at \~22t/s. I made a followup question of about 1k tokens, it took 7s to process it, tg was still at 22t/s.
That's hardly useless, really.
And you can, indeed, get improvement by adding a beefier, external GPU. It won't be the performance of the GPU at 100%, of course. You would put just some layers on the eGPU but these layers would be processed at a much faster speed than on the iGPU, so you'd get a faster result in the end. I haven't yet managed to load the NVIDIA drivers due to my BIOS being naughty so this is all theory from my part, based on others' benchmark, though.
NeverEnPassant@reddit (OP)
That's pretty much in line with what I said: 10+ minutes to load 100k tokens
You talk about the linux programmer's guide. That sounds like a coding agent to me, and you will be chewing through context and compacting a LOT. That 65k will be the first of many 65k. You will be doing a lot of waiting.
Also, you can not add on a GPU to Strix Halo to accelerate prefill very much. The pcie is just too slow, so you are limited by how many experts you can fit in VRAM, which will not be many unless you are spending $8K on a GPU.
Sizzin@reddit
TLPG is a kinda self-contained documentation, you use it to learn Linux more deeply. This would be the only prefill. So you'd send it to the LLM, go make coffee and when you come back, you're ready for a study session. And you can use session files, saving your prefill cache so you don't need to reprocess the prefill if the context starts getting too big and you want to reset the conversation.
But I understand what you mean.
My point of contention is your affirmation of "Strix Halo is a poor purchase for most people". It's precisely the contrary. Most people aren't going to be inserting chunks of 100k context every other message to the LLM. Strix Halo is a very good option for its price point. Never mind now. This was still true the time you wrote this post. And that's only considering the raw throughput. When you start taking into consideration broader aspects, the Strix Halo becomes even more attractive.
And again, like with the title, you use very strong affirmations and they aren't exactly correct. You CAN get improvements with an eGPU. It won't give you the same performance as using the GPU alone, but even using TB5, which is slower than even x4 PCIe, you can potentially get double the prompt processing and +30\~50% of token generation of the Strix Halo, depending on the model and the GPU you use. That would get that atrocious 10+ minutes to process 100k to a much more acceptable \~7 minutes. But this is a digression anyways.
NeverEnPassant@reddit (OP)
Not every other message, no. But quite often. Most people want things like coding agents. If you want to load a book into context and then ask it questions, that's a very niche use case, but yes Strix Halo will handle it well. I don't consider that to be useful for most people.
People have tried and failed to see significant speedups. Again the problem is the poor pcie x4 speed. Qwen3.5 122b with a 4 bit quant is ~77GB. You will will have to transfer most of that to the GPU per every micro batch of tokens (maybe 2048 or so). Pcie4 is ~8GB/s theoretical, but less in practice. Assuming ~6GB/s transfer speed and you need to transfer 65GB of weights, thats a 10.83s pause per 2048 input tokens. That's 528s of time per 100k tokens just to transfer the weights. But your GPU still needs to process the tokens on top of that. It's probably somewhere from slightly slower to slightly faster, but certainly not a big speedup.
BeginningReveal2620@reddit
The real question have you actually tested a Strix Halo PC or is this just your "insights" ? Seems like you have not actually tested the hardware!
NeverEnPassant@reddit (OP)
I posted the best numbers I have received from Strix Halo users. Are you just too dumb to read?
BeginningReveal2620@reddit
Arm Chair General - Yes I can read. Nice LARP. If you actually had a Strix Halo on your desk, you’d know that setting your BIOS UMA to 512MB is a performance death sentence. On this architecture, the BIOS-carved pool is 'Coarse-Grained' (non-coherent), which is the only way to hit the 215GB/s bandwidth. By 'unleashing' the rest via GART, you're forcing the GPU into 'Fine-Grained' coherency mode, which is 3x slower. You’re effectively running a $2,500 machine at the speed of a budget laptop.
Also, the
ixgbeissues on Strix Halo aren't 'driver API changes'—it's a well-documented PCIe power conflict that crashes the Intel E610s whenever the APU spikes. Anyone actually troubleshooting this on Debian would be talking aboutpcie_aspm=off, not 'classic' models from 2023. Next time you copy-paste a tech stack for clout, try to get the memory architecture rightWe_Master@reddit
someone is finally saying what i wanted to after reading this post.
NeverEnPassant@reddit (OP)
I think you are mentally ill.
I posted the best Strix Halo numbers regardless of BIOS settings.
Don't DDoS me with nonsense.
BeginningReveal2620@reddit
Here is my 128G HP Z2 G1A Strix Halo FYI
Potential_Block4598@reddit
I couldn’t get gpt-oss-120b to run in LMStudio though
Is that an LMStudio thing ?! I mean does direct command line in llama.cpp load the model just fine ?!
Sizzin@reddit
Like already said, it may be the memory. But otherwise, it's definitely LM Studio's llama.cpp. If you're on Strix Halo and using ROCm backend, then change it to Vulkan and it will probably work. If you're on CUDA, it should work without problem if you have the RAM. You could try running llama-server/llama-cli directly too (downloading a llama.cpp pre-built binary or building it your own, not that hard), it's much less user-friendly, but you'll most likely gain in performance.
jerryeight@reddit
Do you have it set up to run the model from your system ram? My system has 48gb so it refuses to run the gpt-oss-120b model. But, it will run the smaller one just fine.
Hector_Rvkp@reddit
Feb26, European pricing (EUR): RTX 5090 (TDP 575W) 3,350, GPU 200, RAM 64 6000 680, PSU 120, Mobo 110, case 60, stuff 30, total 4,550. Strix Halo meanwhile 1,700, turn key, new, TDP 120W. That's 2.7x more money, and way, way more power, heat, noise, bulk. But yes thank god it's faster :)
NeverEnPassant@reddit (OP)
Yeah RAM prices screwed this set up. It was the better choice when I made this post. Unfortunately, I think Strix Halo is just not useful for local llm, so my advice would be nothing.
Hector_Rvkp@reddit
I have no horse in this game, but i'm considering getting a strix halo, so i'd rather not regret buying something. Are you saying that running large MoE models (GPT-OSS 120B, Qwen3 80B/253B, GLM-4.5), quantized, so that TPS is 20-50, isn't useful at a $2k price point?
NeverEnPassant@reddit (OP)
Yes, this.
TPS comes in 2 flavors, prefill and decode. Strix Halo does pretty well with decode. It's the very slow prefill that makes it useless IMO. In my graph you can see Strix Halo has over 10x worse prefill (pp) than my 5090 machine with gpt-oss-120b at 48k context. That's 10x the time you will be waiting before you see the first token.
If your use case somehow doesnt have a lot of input tokens, then it may still be useful.
waitmarks@reddit
This just in, more expensive computers are faster than less expensive computers. More at 11.
NeverEnPassant@reddit (OP)
My argument is that for 50% more money you go from bareful useful to extremely useful. There are often optimal price points in product segments.
Haiart@reddit
The RTX 5090 ALONE is more expensive than the full Strix Halo system, are you really listening to yourself right now?
NeverEnPassant@reddit (OP)
A 5090 is $2000
SebastianOpp@reddit
Are you as stupid as this?
cleverestx@reddit
You're $2000 (even if you could get it for that) 5090 is not going to outperform my AMD Strix Halo mini-PC when I want run huge large LLM with large contexts while web browsing and watching YouTube videos and actually USING the system...some of these models take 80-115GB of VRAM/unified memory to run that I enjoy on my 128GB one; such as Qwen3....Mine will run circles around your "fast" card in those cases at usable speeds, and even dual cards won't allow you to compete with it.
I think you should educate yourself on this platform and it's clear advantages in some contexts, for example using bg16, non-quantized image models combined with larger batch generations and upscaling without experiencing an OOM or dumping into your slow system memory because your fancy fast RTX cards can't handle that workload.
Also you are above a thousand of watts of power drain/cost, I'm in a couple hundred usually and (330 max), so that's a big factor over time too.
NeverEnPassant@reddit (OP)
Buy I also have 128GB RAM: 32GB VRAM + 96GB system RAM.
You mention large contexts, but that is where a 5090 completely trounces the Strix Halo for a couple reasons:
I'm happy to provide you with benchmarks if you have a model you would like me to test.
Rich_Repeat_22@reddit
WHERE? The cheapest one is north of $2600 there aren't any $2000 5090s. Even NVIDIA stopped selling them.
And 128GB DDR5 these days is close to $600 on itself. So we are at over 50% of an AMD AI 395 without even counting case, PSU, NVME, motherboard, CPU and cooler.
NeverEnPassant@reddit (OP)
I agree with you on DDR5. I had no idea the price had doubled in the past 2 months. $2400 5090s are attainable today and mine was in stock for TWO WEEKS before I pulled the trigger at $2000. I suspect with a little patience it will come back to that soon.
MangoAtrocity@reddit
I don’t think it’s coming back to $2000, g
ASYMT0TIC@reddit
I bought one retail last week for $2000. They had only one in stock.
EnderCrypt@reddit
i did consider getting a setup with dedicated graphics cards initially, the problem is that prices outside.. i guess america, are bad
im not super knowledgeable in hardware and such, so anyone can correct me if im wrong, but in sweden, the prices i could find for RTX 5090 (ventus) were the equivilent of about 2900 USD
and astral about 3500 USD, im not a hardware person, i assume the venuts would have been enough... but yeah, so sadly its quite an expensive product
Responsible-Earth821@reddit
Ah yes, so I've acquired this 5090, how do I plug my monitor and keyboard into it? How do I install Ubuntu on it? Plz master, I've spent 50% more on this thing and I don't know what to do next.
arentol@reddit
To be fair, OP did say 50% more, and for 50% more ($1000) you can indeed build a computer to put the 5090 in. Not a stellar computer, but one that will do the job.
robberviet@reddit
$2000 is listing price, retail price at my place currently is about $3500.
Rich_Repeat_22@reddit
128GB RAM alone is $600 these days.
throwawayacc201711@reddit
The 5090 graphics card on its own, no RAM, no MOBO, no CPU, etc is already 50% more expensive than an entire strix halo computer. It’s not just 50% more expensive.
NeverEnPassant@reddit (OP)
That's not true. A 5090 is $2000.
throwawayacc201711@reddit
Please share a link that I can get one for 2k and not a lottery or isn’t immediately sold out for that price. The cheapest I see is $2700 and some going in the 3000s
NeverEnPassant@reddit (OP)
I paid $2000 for mine, and they were in stock for weeks when I purchased it. It looks like stock has thinned out a bit the past couple weeks, but you can still find the more expensive "OC" versions for $2400. I also expect you can find a non-"OC" version in 1-2 weeks for $2000 if you set some alerts.
Here are some links:
starkruzr@reddit
1) this does not exist anymore at this price,
2) an entire Bosgame M5 with 128GB RAM that can run MUCH larger models than your build is cheaper than *just the card* by several hundred dollars.
cleverestx@reddit
He wouldn't know what it's like to run and use a 70b+ not butchered by over -quantized model fully in VRAM at usable speeds...because his "better" card simply cannot do it, LOLZ
curson84@reddit
32GB 14" MacBook Pro is 2.549,00 € and no Option to go to 128GB ram
throwawayacc201711@reddit
Both of the links you sent are 2400$
Altruistic_Ad3374@reddit
it is not. its a miracle if you can find it for that price.
NeverEnPassant@reddit (OP)
I did. It was in stock for 2 weeks when I bought it.
waitmarks@reddit
50% more money for just the GPU you mean. what about the rest of the computer? strix halo is a whole computer for 75% the money of just one piece of your setup.
NeverEnPassant@reddit (OP)
No, a 5090 is about the same price as a 128GB Strix Halo.
waitmarks@reddit
They are hard to find for the same price, but assuming you do. What about the rest of the computer? Does that just not count? Just the RAM alone is going to add close to $1k for the 128GB of DDR5 6000 you used in your setup these days.
NeverEnPassant@reddit (OP)
I paid $329 for 96GB DDR5-6000 CL30 in June.
suspi@reddit
Hello from the future!
CORSAIR Vengeance DDR5 96GB (2x48GB) DDR5 6000MHz CL30 CMK96GX5M2B6000Z30 in 12/01/2025 is an astronomical $1,175.99. At this rate, you'll be able to sell just your RAM kit and go buy a Strix Halo cluster.
NeverEnPassant@reddit (OP)
I've been following it! This is crazy!
waitmarks@reddit
AI datacenters happened. RAM and NAND are both insane prices now because datacenters are buying them all up.
NeverEnPassant@reddit (OP)
In the last 2 months? Jesus.
nero10578@reddit
Last 2 weeks
starkruzr@reddit
my man, what in God's name are you talking about? https://www.bosgamepc.com/products/bosgame-m5-ai-mini-desktop-ryzen-ai-max-395?sku=18070578044354691493644095
NeverEnPassant@reddit (OP)
I paid $170 more than that.
starkruzr@reddit
okay, well, congrats on your unicorn purchase, that isn't happening for the rest of us anytime soon. and as you've already had to acknowledge, DDR5 DIMMs for building systems are rapidly becoming insanely overpriced as well.
Zyj@reddit
I paid less than 1600€ for the entire computer. You‘re wrong.
No-Consequence-1779@reddit
The spark is the best value. Then 6000 pro. 1 5090 is nice but 2 is requires to do most work.
NeverEnPassant@reddit (OP)
5090 is actually very useful if you want to run models like gpt-oss-120b. Prefill is probably only 25% slower than a rtx 6000 pro, and decode is like 4-5x slower. But the price is like 1/4th.
No-Consequence-1779@reddit
I run oss 120 on 2 5090s. It still spills over to ram. Luckily it is designed to be fast either way.
Try qwen q3 25Xb .. it’s 1-3 tokens per second.
Assuming a minimum useful context size of at least 16k.
ParthProLegend@reddit
Truly stupid. Oss 120b is MOE with 5.1B active & 117B parameters, while Qwen 3 253B has 22B active & 234B parameters, comparing them is stupid to say the least. Some people actually don't deserve the things they have.
No-Consequence-1779@reddit
Extremely emotional. There is no comparison. Child level.
ParthProLegend@reddit
Spark best value??? Nvidia paid shill? Those are trash boxes. You can't care as you are stupid enough to believe shit.
Rich_Repeat_22@reddit
5090 is 40-50% more expensive these days than a full miniPC using 395 128GB.
ParthProLegend@reddit
Exactly
ravage382@reddit
I mean, for 50% more, you could buy 2 strix halo boxes and run minimax at q5...
Apples to apples is how you make a comparison. A $2000 desktop will not run as well as the 395.
NeverEnPassant@reddit (OP)
2 strix halo boxes would be 100% more
ravage382@reddit
50% more than your dedicated card and system rig.
NeverEnPassant@reddit (OP)
Gotcha. Technically run that? Sure. At a usable speed? No.
ravage382@reddit
They are capable systems for their price.
While it can be fun to watch a wall of text appear at 60-100t/s, I can run some pretty big models spread between 2 at 10-15t/s for jobs over night.
CV514@reddit
Funny to read all that from my cool hill of partially offloaded 12B models chugging along at 5t/s average and it's perfectly usable for my case.
PhaseExtra1132@reddit
50% more money? A lot of folks are already basically spending the max they can for a hobby and having something functional is good enough.
Like the hard limit here is the cash not the work type. I got like a simple MacBook and I’m doing all my learning on that.
NeverEnPassant@reddit (OP)
I can understand that. I just don't think they are getting what they think they are and I don't want them to throw their money away.
PhaseExtra1132@reddit
The issue is that it’s sunk cost fallacy.
If we spend 2k. Then the best bang for the buck could be a Mac or the framework desktop.
But to go for what you’re saying. We need to go to like 3-5k potentially with everyone loaded out. The cpu graphic card case and memory cards.
That’s assuming we can find a 5090 God knows how hard it is to find these stupid cards these days.
For most folks it’s just easier to go 2k and learn. Even if 3k and it would be something killer. The time cost of finding the graphics card alone versus walking into an Apple Store is something you need to factor in.
Like for me my laptop is all I have. And upgrading would be the framework desktop.
If and only if I hit my ceiling there after learning would the 5090 be the next rung up.
NeverEnPassant@reddit (OP)
It's getting easier to find these cards every day. Mine was in stock for nearly 2 weeks before I pulled the trigger for $2k.
emprahsFury@reddit
Which to be clear your post does convey, people just like to be contrary
TheLexoPlexx@reddit
And now, the feather.
Infinite100p@reddit
u/NeverEnPassant could you explain your train of thought please?
You use Gpt-oss-120b in your benchmarks, and then you advocate for 5090 as a better option, but Gpt-oss-120b would not fit on a single 5090, probably not even on dual 5090s (unless zero cache), and 3x 5090s is a ton of money.
Or are you advocating 1x 5090 with RAM offload? Because then you are gonna be dealing with slow pp speeds again, no?
NeverEnPassant@reddit (OP)
I literally explain all of this in my post?
I explained how llama.cpp handles prefill with cpu/ram offload. It is bottlenecked by pcie speed. With pcie5 x16 (standard for new builds), and a 5090, you get extremely fast prefill with gpt-oss-120b. See my post for a detailed breakdown of how it batches the uploads to make this efficient. THe result is like 10x-20x faster prefill than Strix Halo, even with 2/3 of the expert layers offloaded with a 5090/DDR5-6000 system.
Infinite100p@reddit
Is RAM CL/speed important in this scenario, in your opinion?
I.e., both DDR5 but:
CL42 vs, say, CL30?
5600 VS 6400?
2-channel consumer Ryzen VS 8-channel Epyc?
NeverEnPassant@reddit (OP)
CL doesn't really matter. Bandwidth is all that matters for decode (but not prefill). Bother number of channels and speed contribute to bandwidth.
NewspaperFirst@reddit
Ma' guy, i bought 2 of these before ram spike. And now if i sell them they are 800$ more expensive. Not only that, but they are a beast, they consume like a few watts idle, i leave them 24/7 on (bonus, as servers - and i was paying like hundreds of $ in servers/month). Inference is crazy good for size/power consumption and I got an egpu too (mooar speed). Im set for like the next 4-5 years. What'r u talking about? This is the best hybrid choice as a system that can be used for inference, personal use, server, etc. P.s. i got your case with 5090 but that 5090 needs a PC. And even if you got one, good luck running gpt oss 120b with acceptable speeds after you offload half of it inside probably a slow system ram.
NeverEnPassant@reddit (OP)
Ma' dude. You didn't read the post:
noiserr@reddit
Strix Halo has been amazing for me, you have no clue.
NeverEnPassant@reddit (OP)
I think I have more than a clue. I have benchmarks. They are in this post.
noiserr@reddit
Your benchmarks are worthless. 5090 alone costs more than Strix Halo entire system. Also Strix Halo has gotten much faster. It now does 65 t/s on gpt-oss-120.
NeverEnPassant@reddit (OP)
When I posted this, a 5090 was not hard to find for $2000.
Bullshit.
noiserr@reddit
Are you stupid? Even at $2000 for just a GPU is way more expensive than Strix Halo:
https://i.imgur.com/zCC7a1v.png
https://youtu.be/wAIzlGwEAO0?t=678
NeverEnPassant@reddit (OP)
I'm not gonna repeat arguments I already put in my post and other comments here.
That's a vllm specific benchmark where it launches multiple concurrent requests so it can batch decode steps together. It's also low context.
Here on r/LocalLLAMA, when people say t/s without qualifications, they are almost always referring to the decode speed of a single concurrent request.
For the Strix Halo, you will see 30-52 t/s for this depending on context length.
Altruistic_Ad3374@reddit
is this bait?
NeverEnPassant@reddit (OP)
I would love someone to make an argument other than "LOL MORE EXPENSIVE SYSTEM BETTER NO SHIT".
sudochmod@reddit
Isn’t that literally your argument though? “If you can afford an extra $1,000-1500, you are much better off just building a normal computer with an rtx 5090.”
NeverEnPassant@reddit (OP)
My argument is that a normal computer with a 5090 is a vastly better value proposition. Imagine you could buy a pair of boots that lasted a week for $10 or lasted a year for $20.
sudochmod@reddit
Respectfully, I disagree. I think you’re being a bit generous with being able to get. 5090 for 2k. Every time I see them at that price they’re sold out.
I think you should maybe go on pc part picker and show a build with costs. If it’s a vastly better value proposition then it should be able to absorb the cost increase on the 5090 due to scarcity, right?
Aside from raw performance there’s also the power draw which is considerably lower on the strix.
How fast do you run the new minimax m2 model on q3_k_xl?
NeverEnPassant@reddit (OP)
That's what I paid. I think anyone can find that price if they wait a few weeks. It was actually in stock for 2 weeks before I made my purchase.
I can run that benchmark for you. What numbers do you get?
starkruzr@reddit
okay so like, you know how delusional this was now, right? like we aren't going to be able to even purchase 5090s pretty soon and even 3090s are going to become vanishingly rare.
NeverEnPassant@reddit (OP)
It was true when I write this post. And it is still possible, but a bit harder now. nvidia.com still does MSRP drops once a month or so. But for the most part 5090 AND Strix Halo have gone up in price.
The real reason my post is now invalid is DDR5 prices. They have like 3x'd in price since I wrote this.
Badger-Purple@reddit
You didnt know about the RAM shortage, gpus shipped without the ram chips, etc? I’ve been following the story since Oct/November, and you are blatantly wrong about a whole PC set up WITH enough DDR5 AND a 5090, plus a processor as good as the one on the Strix, plus the wattage.
NeverEnPassant@reddit (OP)
96GB DDR5-600 was like $500 when I wrote this, and yes such a system demolishes a Strix Halo in almost every workload.
sudochmod@reddit
173pp/30tg on vulkan with stx halo. Just did a quick llama bench earlier to see if it would fit. Just curious because one of the things I like about my Strix is being able to run 100gb models like that one.
Once they fix the ROCm issue with models larger than 64gb on the newer versions it should be significantly faster. 7.9 and 7.10 have a big speed up in PP and keeping TG stable at longer contexts.
NeverEnPassant@reddit (OP)
I have to run the benchmark with mmap on or I run out of of memory.
I get ~1000pp/~25tg. If I can figure out how to run without mmap that pp would probably double.
sudochmod@reddit
Ah nice!
IORelay@reddit
The thing is Strix Halo is not cheap, and it has serious troubles running bigger dense models. It's kind of pointless for it to run 12-30B models at a decent speed because modern day 16GB GPUs can do it well also.
Independent-Band7571@reddit
Makes me wonder if the 96GB version is the smarter choice. Not much difference between the available models to run. Last time I checked, that extra 32GB cost $400-500 extra, apple-esque. Perhaps even an egpu setup could be considered at that price point.
audioen@reddit
No. It's absolutely disastrous choice to go for 96 GB in my opinion. You can squeeze in minimax-m2 into this machine, for example, and it still runs at some kind of usable speed because it is MoE, but it's already a tight cram at 128 GB on mobo and with GTT exposing about 120 GB of the RAM as VRAM.
Address-Street@reddit
Can you help me choose a good GPU setup for gaming and running Qwen 30B A3B Q6 (26 GB) with a 40k context?
Can you estimate their relative speed compared to each other for pp and tg at 40k context?
My system: 9950X, 64 GB DDR5-6000, PCIe 5.0 ×16 + PCIe 4.0 ×4.
sunole123@reddit
This argument of good and bad does not make sense. The value is in the price. Higher price is faster at small and medium size model. Lower price is for slower at the same sizes. Plus you get the ability to run larger models like 70b and 120b. So if all you need is fast then you are all set.
cleverestx@reddit
He doesn't get it.
NeverEnPassant@reddit (OP)
You clearly didn't read the post.
cleverestx@reddit
If someone offered to give me a RTX-5090 in exchange for my 128GB Strix Halo mini-PC, I would decline. It's that much better for my use case, running cachyOS and models these "fast" cards can only dream of.,,my OS is instant fast and way more capable with huge AI models with large contexts than even dual RTX set-ups can achieve. I wish it was as fast as the cards, but when those cards can't load the models and crawl down to less than 1 token/sec, Strix Halo WINS.
NeverEnPassant@reddit (OP)
Again, provide some numbers to back up your statements. I have done so.
cleverestx@reddit
It's just common sense addition and division. You can't load a 80B LLM model fully in VRAM on your 24-49GB card when the model itself is 60GB; it pouts over into your glacially slow RAM, whereas I have an easy 50Gb to spare of way faster unified memory (for your RAM speed). Do the basic arithmetic dude.
NeverEnPassant@reddit (OP)
Do yourself a favor and actually read my post instead of being a fanboy. I provided real world numbers from both my system and also a Strix halo for a 60+GB model (gpt-oss-120b). I also broke down why my system wins so handily.
A 5090 + 96 DDR5-6000 annihilates Strix halo in pp (DDR5 speed is not the bottleneck for pp, pci5 bandwidth is), and it is fairly even on tg (slower at 0 context, but faster at high context). Again, I explain why in this post.
cleverestx@reddit
You do that while image generating with Z-Image and watching a YouTube video on a 4k monitor, then you can posture against a Strix Halo machine that does so easily.
NeverEnPassant@reddit (OP)
You are hopeless. blocked.
cleverestx@reddit
You are clueless. Ignored.
cleverestx@reddit
This is so misleading... "Maybe all you care about is chat bots with close to 0 context" - you've obviously never used one. I can run circles around your RTX-5090 with no (at least much higher quantized for bigger ones) LLMs and much larger contexts, or do you magically have 115GB of faster performance LOCALLY chatting with a 24-32GB or even dual card 48-64GB card(s)? Where is your context going to fit after loading all that up? You see the issue with this claim? You are making this up.
NeverEnPassant@reddit (OP)
I provided actual benchmark numbers in this post. Feel free to post your own and I will show you how my system compares, good or bad.
cleverestx@reddit
I can simply add. You can't fit larger 80-120b models in your super fast VRAM without crippling quantizing and sometimes not even then, so your speed becomes less than 1 token/sec... or you would have an argument worth entertaining against Strix Halo users who know this.
solidsnakeblue@reddit
I switched from my 2x3090 x 128GB DDR5 desktop to a Halo Strix and couldn’t be happier. GLM 4.5 Air doing inference at 120w is faster than the same model running on my 800w desktop. And now my pc is free for gaming again
PermanentLiminality@reddit
My 4x p102-100 rig is mostly shutoff due to my 50 cents per kWh power.
panchovix@reddit
Man I though my 25 cents per kWh here in Chile was insane, but 50 cents? Where is that?
Swimming_Arrival5760@reddit
and i was here wondering why people care so much about the electricity lol
i pay 0,08 usd per kwh. i do use some 1200kwh monthly and it already hurts...but i could easily supply that for $2.5k and have 100% solar power.
FullOf_Bad_Ideas@reddit
do you live in the middle of a desert?
fallingdowndizzyvr@reddit
Probably California. 50 cents per kwh isn't even really that high. In one place in California, when all the stars align the top rate is about $2/kwh.
panchovix@reddit
Oof, like I consumed 330 kwh past month. That would be 660 USD at that price lol.
Nice_Grapefruit_7850@reddit
That's pretty insane, at least 3x what I pay.
Educational_Sun_8813@reddit
exactly... https://www.reddit.com/r/LocalLLaMA/comments/1osuat7/benchmark_results_glm45air_q4_at_full_context_on/
MitsotakiShogun@reddit
I'm on a 4x3090 system and the difference in speeds is even higher than what OP saw (not to mention under batch cases, performance would only go up). The ~170-180W vs ~1000W difference gets easily recouped by the time spend. What is not recouped is the ~25W vs >200W idle power, which was why I bought a 395 in the first place.
No-Statement-0001@reddit
i have my LLM box (linux) suspend on a cron job and wrote an openai api compatible wake-on-lan proxy. Everything is automatic. My box idles at 130W and suspends down to 6W.
MitsotakiShogun@reddit
Not a solution for a server that does other things too. Also where does the proxy run if your main device is sleeping?
No-Statement-0001@reddit
on a raspberry pi
MaruluVR@reddit
Have you tried installing a Windows VM for idle usage?
Windows has way lower idle power consumption for GPUs my 5090 idles at 30W on linux but only 2W on windows. (You can also use WSL in windows with your GPUs if you dont want to switch between two VMs)
MitsotakiShogun@reddit
Even if I did use a Windows VM, I wouldn't be passing the GPUs through because I want to run vllm/sglang, and I don't think Windows is supported.
And since this is a server, I'm not going to go Windows as a base either, and thus no WSL. I am doing that on my PC, and it's mostly okay, but there is obviously some performance loss in WSL.
NeverEnPassant@reddit (OP)
Yeah, I'm super jealous of the idle power consumption!
MitsotakiShogun@reddit
You're already doing fine. 5090 idle is probably \~10W? The 3090s start low but after any load they get stuck at \~20W idle until the next reboot. But the speeds... Here are some numbers on 55-60K context length across a few models with vLLM:
DeltaSqueezer@reddit
I got mine idling at 9W now: https://www.reddit.com/r/LocalLLaMA/comments/1k7m902/further_explorations_of_3090_idle_power/
AppearanceHeavy6724@reddit
If you run linux, there is a trick to mitigate that:
https://www.reddit.com/r/LocalLLaMA/comments/1kd0csu/solution_for_high_idle_of_30603090_series/
NeverEnPassant@reddit (OP)
My entire computer idles at closer to 100W
lightningroood@reddit
there should be plenty of room for improvements. my system with 5060ti idles at 25w. at idle, 5090 shouldn't be consuming that much more power in comaprison with 5060ti.
AppearanceHeavy6724@reddit
Strange, 3090 usually idle at ~20W.
MitsotakiShogun@reddit
My bad, that 200W was for the whole system, so 4 of them plus all the other components.
Eugr@reddit
Your Strix Halo numbers are off. Here is my latest gpt-oss-120b numbers on llama.cpp with ROCm 7.10:
starkruzr@reddit
wonder how much better it's gotten now.
Eugr@reddit
They actually got much worse with ROCm 7, but ROCm 6.4.4 got to the same levels as they've been with ROCm 7 before.
At the same time, DGX Spark pp performance on gpt-oss-120b has been improved by 50%.
coder543@reddit
DGX Spark for comparison in case future readers stumble across this thread:
Eugr@reddit
Try to recompile llama.cpp with Blackwell optimizations on. Here are the latest numbers on Spark:
coder543@reddit
What optimizations?
This is my command line: cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real && cmake --build build --config Release -j
Eugr@reddit
Ok, this should compile the Blackwell kernels and you should get pp numbers similar to mine.
coder543@reddit
Hmm, what's strange is I found this nvidia thread where you also commented, and I'm able to reproduce the 4500 tok/s PP for GPT-OSS-20B that's shown at the top of that thread, but I'm still not getting above 2000 tok/s PP for GPT-OSS-120B.
I tried recompiling with a few different flag variations on the latest upstream.
Not sure what's going on, but I would like to have 2400 tok/s PP.
Eugr@reddit
What are your llama-bench params? Gpt-oss-120b like -ub 2048
coder543@reddit
I guess I just needed to crank up the batch sizes.
$ llama-bench -m ggml-gpt-oss-120b-mxfp4.gguf -ngl 999 -fa 1 -p 2048 -n 0 -mmp 0 -r 1 -b 1024,2048,4096 -ub 512,2048,4096,8192 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes | model | size | params | backend | ngl | n_batch | n_ubatch | fa | mmap | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 1024 | 512 | 1 | 0 | pp2048 | 1829.12 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 1024 | 2048 | 1 | 0 | pp2048 | 2226.66 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 1024 | 4096 | 1 | 0 | pp2048 | 2238.81 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 1024 | 8192 | 1 | 0 | pp2048 | 2223.70 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 2048 | 512 | 1 | 0 | pp2048 | 1842.51 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 2048 | 2048 | 1 | 0 | pp2048 | 2470.05 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 2048 | 4096 | 1 | 0 | pp2048 | 2477.28 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 2048 | 8192 | 1 | 0 | pp2048 | 2475.72 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 4096 | 512 | 1 | 0 | pp2048 | 1850.29 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 4096 | 2048 | 1 | 0 | pp2048 | 2470.78 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 4096 | 4096 | 1 | 0 | pp2048 | 2480.44 ± 0.00 | | gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | CUDA | 999 | 4096 | 8192 | 1 | 0 | pp2048 | 2471.27 ± 0.00 |
coder543@reddit
Maybe I need to recompile. It’s good that there is potential for even more.
NeverEnPassant@reddit (OP)
The numbers were only like 2-3 weeks out of date. I already added an edit at the end of the post with updated numbers.
Eugr@reddit
I was getting similar numbers to the ones I posted two weeks ago too.
The problem with Strix Halo (and DGX Spark to some extent) is that the platform support is not mature yet, so if you just take an off the shelf llama.cpp build (or worse, Ollama), you may not get the best performance.
Even with ROCm, performance degradation is much higher if you use rocWMMA that was highly recommended by some people and that indeed increases performance, but only on short contexts. There is a fix, but it won't be merged because the whole Flash Attention on ROCm support in llama.cpp is getting reworked.
AppearanceHeavy6724@reddit
No, the problem is ass bandwith, and half-ass compute. There is no way clever patches to llama.cpp can fix sub-300 Gb/sec bandwith.
Educational_Sun_8813@reddit
there is already AMD NPU support for insiders available, hopefully it will get public soon
Eugr@reddit
Yes, but it doesn't even give you that performance, unless you tinker with it.
AppearanceHeavy6724@reddit
There is a Russian expression "v sortah govna ne razbirayus", "I am not expert in grades/types of shit".
Eugr@reddit
"one man's garbage is another man's treasure"
NeverEnPassant@reddit (OP)
I'm only seeing tg improvements compared to the numbers I posted. pp numbers are still too slow to be useful for mid to high context.
AppearanceHeavy6724@reddit
I think this a meaningless battle; those minipcs are terrible for LLMs except for narrow cases with ultrasparse models, like oss-120b, but even then they suck at PP. People in this sub seem to lack knowledge of basics, as they believe that better software support somehow compensate weak hardware.
avl0@reddit
Ok but doesn’t this now show that the strix is better than your machine which costs x2 more for any context size up to 20k?
NeverEnPassant@reddit (OP)
No, tg is close now, but pp is still unusably slow for mid to large context.
avl0@reddit
That's just not true, a 5090 on its own is 2.9k euros compared to the most expensive mini 395 option (framework) for 2.4k.
Impossible_Ground_15@reddit
I have a minisforum s1 max on its way and look forward to putting it through its paces
sudochmod@reddit
You’ll love it. Don’t listen to this guy. Everyone I know with Strix halos loves them. AMD is making ROCm better and better. They just sent some Strix halos out to llamacpp maintainers to have them see what performance optimizations they can make.
The concept of spending another 2k for a 5090 is wild. You literally can’t beat the value of a Strix halo system. I got mine for 1650 awhile back and it’s my daily driver. Aside from AI, I have 128gb of super fast ram paired with a cpu that is almost as performant as a 9950. Even as a home lab it’s an insane deal.
cleverestx@reddit
Agreed. This guy is missing the point and superior use of these amazing systems.
AppearanceHeavy6724@reddit
What to love if your PP is 300 t/s?
sudochmod@reddit
That depends on the model. Even then with prompt caching it really isn’t that bad. Up to you though.
AppearanceHeavy6724@reddit
nah, for coding anything below 500 t/s is outright unusable, and below 1000 t/s is not serious.
CYTR_@reddit
Low effort bait
AppearanceHeavy6724@reddit
Low effort bait
NeverEnPassant@reddit (OP)
It's not another 2k. It's more like a bit over $1k more.
sudochmod@reddit
For a 5090?
NeverEnPassant@reddit (OP)
$2k for a 5090 + w/e else for a a computer.
Rich_Repeat_22@reddit
You will love it. Also use Lemonade. Even gpt-oss-120b-mxfp-GGUF is supported for hybrid execution.
So not only iGPU is been used but the whole APU including the NPU.
dragonbornamdguy@reddit
What is token speed for hybrid vs gpu only?
cafedude@reddit
You don't mention much about price here ("If you can afford an extra $1000-1500" and then "WOW! The ddr5 kit I purchased in June has doubled in price since I bought it. Maybe 50% more is now an underestimate."
Not all of us can afford that extra $1000 to $1500 (and probably much more now), so the Strix Halo is in the sweet spot for us.
NeverEnPassant@reddit (OP)
Ya, RAM prices changes everything. My DDR5 kit went from $320 in June to almost $1200 now.
I still think Strix Halo is not very useful, but 5090 + DDR5 is now a LOT more expensive, so I dunno.
Impossible_Ground_15@reddit
I'd like to see AMD increase the memory bus from 256-bit to 1024-bit. That's what Apple does with its memory interface so Mac Studios are way faster for inference with their on package memory
UmpireBorn3719@reddit
Your prefill performance is good. What CPU are you using?
NeverEnPassant@reddit (OP)
9950x, but CPU is irrelevant for prefill
GPU and pcie bandwidth is what matters
No-Weird-7389@reddit
Maybe mmap matter, what is your pp if you turn on mmap?
NeverEnPassant@reddit (OP)
mmap gives me a big slowdown, close to 2x
SameIsland1168@reddit
This sounds like you just have no interest in learning anything about strix halo usage. Perhaps we should seek feedback from people who wish to actually learn things properly.
NeverEnPassant@reddit (OP)
What a child you are.
SameIsland1168@reddit
:( name calling won’t fix your inability to learn things
SocialDinamo@reddit
I currently have a 3090 + 5060ti setup and have a framework 395 128gb coming in Tuesday! As excited as I am to run gpt oss 120b at solid speeds, I’m more excited for what it can run 6 or 12+ months from now
-dysangel-@reddit
everyone kept telling me how terrible my Mac is, but I always see people on here being excited about getting 7tps on tiny models..
Goldkoron@reddit
The real answer is to connect 4 5090s to your strix halo.
I kid I kid, 4 3090s is fine.
My current build is 128gb strix halo with 2 3090s and 1 48gb 4090, which is letting me load larger models like GLM-4.6
NeverEnPassant@reddit (OP)
How can you connect 3 GPUs to the Strix Halo?
Goldkoron@reddit
There are usb4 docks on amazon that can daisy chain up to 2 per usb4 port.
How does the llama-bench command work? My setup is actually partially down right now since I need to swap a dock out.
notdba@reddit
Interesting.. How do you split the weights across the 3 GPUs and the iGPU? Can you share some performance number? Also, most importantly, during prompt processing, is it possible to keep the 3 GPUs 100% busy at the same time?
Due to the insane RAM price increase, I will probably be stuck with the 128gb strix halo for awhile. OP and I previously explored the PCIe performance bottleneck in a single GPU scenario, but I guess we haven't looked into how multiple GPUs may help to improve the performance.
Goldkoron@reddit
I should add my llama-server args for loading look like this (with vulkan being igpu):
-dev cuda0,cuda1,cuda2,vulkan0 --no-mmap -ts 24,24,48,48 -fa on
notdba@reddit
With the
-ts 24,24,48,48split due to the 64gb limitation on Windows, the strix halo is only handling 1/3 of the workload, thus the overall performance is pretty good.With let's say a
-ts 24,24,48,120split, then I think the limitation of the strix halo will be much more apparent.Goldkoron@reddit
Yeah, and of course it can be adjusted freely per model, like some models I am only offloading around 5-10% of the model to strix halo.
As long as I am getting more than 10t/s I am generally happy though.
Goldkoron@reddit
Even when using tensor split with llama-cpp, GPUs never seem to hit 100% busy during prompt processing, but its not too bad overall.
On Qwen3 VL 235B I get over 20t/s from Q4.
On GLM-4.6 IQ3 XS I get around 15-16t/s
GLM-4.5 air is around 30t/s.
Prompt processing gets slowed proportionally to how much of the model is on strix halo of course.
For splitting weights, at least on Windows there are some bugs with both rocm and vulkan that prevent you from using more than 64gb from 8060S igpu. Seems to be related to AMD splitting into multiple memory heaps of 64GB size and llama-cpp only sees the first one.
profcuck@reddit
This is a great post even if I am only partly persuaded. I'd love to see more posts with similar detail for people trying to judge the best buy at various price points.
I just did a spot check on Google and the cheapest 5090 I can find is $2350 while 128gb Strix Halo boxes are right around $2000. So I am not fully persuaded that your build costs only 1000-1500 more.
And if you're up to 3500 you are now in Mac Studio territory, which comes with its own strengths and weaknesses of course.
I think there's little doubt that for 2000, Strix halo wins in many cases. And for 5000, Mac M4 Max is hard to beat for inference (some caveats of course).
sudochmod@reddit
Pretty sure you can get Strix halos for around 1800. I think the bosgame is still cheaper. Microcenter had the evox2 for 1800 awhile ago.
I picked up my Strix for 1650 from Nimo(they’re sold out now)
fallingdowndizzyvr@reddit
What is this Nimo?
sudochmod@reddit
It’s a stock six United variant.
fallingdowndizzyvr@reddit
I just checked. It's back in stock.
"Availability: 97 in stock"
profcuck@reddit
Link? I am searching and may find it but instructions unclear!
fallingdowndizzyvr@reddit
LOL. Dude, you are the one that brought since you bought them and don't know the link? I just googled "nimo strix halo" and this came up.
https://www.nimopc.com/products/nimos-smallest-office-gaming-ai-pc-amd-ryzen-ai-max-395-up-to-5-1-ghz-128gb-lpddr5-8000mhz-16gb-8-2tb-4tb-ssd-with-3-performance-modes-up-to-120w
profcuck@reddit
I'm not the dude that brought them up! Different dude did.
fallingdowndizzyvr@reddit
I guess you missed this?
"🎫Pre-order will save $330"
What does that big blue button say right after "quantity"?
profcuck@reddit
It says that but I went all the way through to just before paying and there was a slot for a discount code (which I couldn't find anywhere) that might have taken off the 330 but had a finished, there was no discount.
fallingdowndizzyvr@reddit
Yes it does.
"🎫Pre-order will save $330"
Contact them as ask them how to do it. Make a screenshot of that page and keep it on hand.
NeverEnPassant@reddit (OP)
It's not hard to find a 5090 for $2000 if you are willing to wait 1-2 weeks.
Ok-Representative-17@reddit
You considered RAM (100GB/s) as the bottleneck for speed but in reality it is the pcie 5.0 at 64GB/s. This will decrease your net theoretical speed where 47% is served from RAM.
Also you did not calculate for multiple KVCache you have taken only 20k context in actual tasks the context grows way faster which is the issue. If you could calculate for multiple context sizes it would be fair. For 20k, 40k, 60k, 80k, 100k, 150k, 200k.
NeverEnPassant@reddit (OP)
I said that pcie is the bottleneck for prefill.
Ok-Representative-17@reddit
Yes... You are just getting better results as hit ratio is 50% that makes average of 1600GB/s and 64GB/s. But what happens when context increases? What is the use of spending 3-4k if you are gonna get so limited context.
I don't mean it's not a good setup it is good when you are not going to do agentic workflow, but once you try agentic workflow it's all about lot of vram.
NeverEnPassant@reddit (OP)
Im very confused. The prefill numbers I posted are excellent even at large context. The percent of data read from vram per token only increases as context size grows. An rtx 6000 with everything in vram is only like 25% faster for prefill.
Ok-Representative-17@reddit
What all context size have you tested this at?
NeverEnPassant@reddit (OP)
I have tested the full context of gpt oss 120b: 128k
It does get progressively slower, but never unusable.
Ok-Representative-17@reddit
Can you compare at multiple contexts strix halo vs PC?
I feel strix will perform better at larger context when model hit rate in Ram gets below 30%
NeverEnPassant@reddit (OP)
This very post does that for context size 0, 20k, 48k.
Ok-Representative-17@reddit
I found out gpt -oss-120B has max context size of 128k tokens and Vram required at FP16 is 9.5GB. which makes your point quiet clear even at max token size you would fit 20GB in Vram and rest 30offloaded. This will result in slow speed but not much as the expert hit will drop from 53% to ~30%. Due to this you are getting good speed.
I am planning to build a PC or buy mac or strix halo. So I am still researching. I had no idea there is a max context window. I am doubting if I should even do this.
The problem is I am a developer and I was looking for agentic workflow. I assumed context size is infinite till today. I currently use gemini CLI and need 100k-200k context easily, I was planning this workflow so I could get more context so I was leaning towards halo, I was wrong. I am now at the stage to decide are open source models even worth it? Cause I get 1M context from gemini, which I generally need 200-300k.
NeverEnPassant@reddit (OP)
I can fit a full 128k context on my GPU and still have room for 1/3 experts so that's how I ran the benchmarks. I can actually get slightly better numbers if I put more experts on the GPU, but I didn't do that as it would feel like cheating to me.
Ok-Representative-17@reddit
Yeah. Like I said almost 30%. For what all you use this setup? Like what do you plan to do via llm?
NeverEnPassant@reddit (OP)
coding agents
Ok-Representative-17@reddit
DM ing u
NeverEnPassant@reddit (OP)
I'd rather just keep this to threads.
Ok-Representative-17@reddit
Ok.
Give me idea about the tech stack you are working and the kind of agentic you were able to achieve with 128k context window?
NeverEnPassant@reddit (OP)
I am building something that uses APIs and I wanted something where I don't have to worry about API costs and I want low latency.
For actual coding, claude/codex are going to be superior and cheaper than buying hardware.
Ok-Representative-17@reddit
Cheaper and better are obivo API. But I wanna know what local models are capable of? Did you give it any coding task? Was 128k context window enough?
NeverEnPassant@reddit (OP)
It's good enough for my testing. Im not yet to the point to compare to a frontier model.
Ok-Representative-17@reddit
How good, can you give me a usecase where you used it for ABC task. I understand it's basic, but want to know what all it can perform.
NeverEnPassant@reddit (OP)
Just use openrouter and try it yourself for very cheap.
Ok-Representative-17@reddit
Ok. Thanks for gatekeeping
NeverEnPassant@reddit (OP)
gatekeeping?
Red-Pony@reddit
I mean, you kinda need to compare rigs of similar price, no?
NeverEnPassant@reddit (OP)
As much as I hate metaphors, some people need them:
Imagine you could buy a car that can go 40mph for $20000 or a car than can go 80mph for $30000.
pablo_chicone_lovesu@reddit
Never do a car comparison.
What if my daily drive only has a 35 mph speed limit? Why buy a car that won't ever go over that limit?
NeverEnPassant@reddit (OP)
Is your use case chat bots with no context or being ok waiting many many minutes for prompt processing?
Fywq@reddit
That's fine and all but if the budget for a car is only $15,000 so $20,000 is already pushing it and then the $30,000 race car also needs a significantly more expensive type of gasoline...
I can appreciate the point that 5090 is better, but for the price of that alone I could probably add a 5060ti 16GB to the shopping cart too and still get a Strix Halo setup. Hell at the current RAM price squeeze I could get somewhere between a 5070 and a 5070ti just for the price of RAM to a new 5090 machine.
alfentazolam@reddit
... and power draw
biggiesmalls29@reddit
Yeah no offense but you're advising people to go spend potentially thousands more and draw a heap more power for doing inference.. I don't see the point, in my case I got a rock solid super fast tiny desktop that is elite for the money.. it's not giving me frontier model speeds for inference but it's def unreal for playing with local models without breaking the bank. I'm far happier with this over building a desktop to match the speeds of my strix with a 5090 on top of that.
NeverEnPassant@reddit (OP)
My arguments is that the Strix Halo is barely useful at all, and you can just spend 50% more for a gigantic leap in functionality. Or maybe skip it all together if you are not looking to spend that much.
Also, if you ever want to sell the machine, Strix Halo will depreciate much harder.
Swimming_Arrival5760@reddit
i am doing a lot of research before choosing and im reaching the conclusion that it is indeed useless. i want to feed thousands tokens for coding...and if i have to pay 100% more to get 10x the speeed, it is very surely a good deal...whereas putting 2k usd to have something unusable is just pointless. im not sitting minutes waiting the prompt processing for the large contexts i will be using. it surely seems like a toy usage for llms.
NeverEnPassant@reddit (OP)
Btw, if you do build a computer, pcie5 is really important.
NeverEnPassant@reddit (OP)
Yep! I think this is the only reasonable conclusion.
starkruzr@reddit
I think I see the confusion here. You see, the problem you're running into is that your argument is bad and dumb. HTH GL HF
biggiesmalls29@reddit
I dunno what driver stack you used but "barely" useful at all is a dramatic over estimate.. then you say skip it all together and go for a 5080? So what is it? You've obviously made your mind up about strix halo not being good enough for your use case but your alternative is thousands more.. the math doesn't math
AppearanceHeavy6724@reddit
It is terribly slow for anything dense >14B.
starkruzr@reddit
your information is really outdated.
AppearanceHeavy6724@reddit
You lack knowledge of fundamentals. 14B models on 270 Gb/sec hardware would barely make 20 t/s on empty context and degenerate to 12 t/s on 16k context. There is no way around it.
alfentazolam@reddit
Agree. It's the perfect sweet spot of large model useability (128GB unified RAM), heat/noise/power draw and cost. Comparing anything else at this stage will usually result in at least one significant trade-off (possibly >1). Apple's the only competition and ecosystem wise it's apples to oranges.
ASYMT0TIC@reddit
"This is well worth it because prefill is compute intensive and just running it on the CPU is much slower."
Software support aside, would the X4 handicap for the dGPU be mitigated to any extent by running the RAM experts on the iGPU instead of cpu during prefill, so splitting between the dGPU and IGPU not dGPU and CPU?
NeverEnPassant@reddit (OP)
Good question. Maybe it already even does that since the Strix Halo is GPU. If it were possible I would only expect a modest speedup since only 1/3 of the experts can realistically fit on the GPU.
perelmanych@reddit
So many hateful comments. Change in his setup RTX 5090 to RTX 3090 and you will get 70% of his performance at -2k dollars.
NeverEnPassant@reddit (OP)
Maybe the new AMD card would be a good match, too.
Awwtifishal@reddit
That's NOT true. It's just the hidden state being transferred back and forth, which is much smaller, and that's only during generation.
That's NOT true either: prefill doesn't use the experts so you can have all the attention and shared tensors on the GPU, therefore PCIe bandwidth is irrelevant. If it varies for you, then you have something misconfigured.
NeverEnPassant@reddit (OP)
Prefill does use the experts. It really does what I said. Ive also measured pcie traffic to my gpu during inference. it sends a lot of data to the gpu.
Im away from my computer atm, but from memory:
mmap off fa on batch 4096 ubatch 4096 prompt 4096 ngl 99 n cpu moe 24
Awwtifishal@reddit
Ok, I was mistaken on the second part, but correct on the first one.
NeverEnPassant@reddit (OP)
You are still mistaken. Unless you specify
--no-op-offload, then llama will send all the expert layers (which is 57/59GB of gpt-oss-120b's model) to the gpu for any batch >= 32 tokens and pcie bandwidth becomes the bottleneck.If I do specify
--no-op-offloadthen my pp drops from 4100 to 217.arentol@reddit
What I find interesting about this post is that it is titled "Why the Strix Halo is a poor purchase for most people", but nowhere in it does it establish why people would consider purchasing a Halo and what the most common probably use cases for it are, and how it is a poor choice for most of those uses cases. How can you say it's a poor purchase for most people without establishing what most people who might buy it want to be able to do and what issues and limitations they might be running into with it versus a 5090 or other options?
I have a Halo. I also have a 5090 (and an RTX Pro 6000 too). For the purpose for which I purchased it the Halo is WAY more useful and considerably faster than the 5090. The 6000 could of course destroy it at the same uses, but then I would be wasting the 6000 on something that the Halo can do well enough. That way the 6000 can do thing that take actual speed on top of a large amount of VRAM.
Your argument doesn't support your thesis, and you clearly don't understand nearly as much about this stuff as you want to believe you do... You might know some technical details, but you don't understand hardly any of the MANY ways in which people can use these tools.
Icy-Pay7479@reddit
can you explain how you're using the strix, 5090, and 6000, and why each use case is the best fit for the hardware?
NeverEnPassant@reddit (OP)
I did my best to quantify the useful cases of the Strix Halo: Low context LLM inference. I just don't think that is worth the price. How about you tell me what it is useful for instead of being vague.
MarkoMarjamaa@reddit
You are running the quantized version, not F16? In these tests it should always be mentioned.
ShengrenR@reddit
there is no fp16 release: https://huggingface.co/openai/gpt-oss-120b is the OG and it comes to you pre-quantized to mxfp4;
Per the model card:
The models were post-trained with MXFP4 quantization of the MoE weights, making
gpt-oss-120brun on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and thegpt-oss-20bmodel run within 16GB of memory. All evals were performed with the same MXFP4 quantization.MarkoMarjamaa@reddit
No. Unsloth has f16 version, that contains the original released version and part of that are f16 weights. The experts part is quantized already by OpenAi.
The version that Unsloth released as MXFP4 is in fact Q8, because the parts that were f16 in original, are q8. The experts are MXFP4.
Thats why versio released as 'mxfp4' is faster than 'f16' version, even when both have the same expert weights.
iLaurens@reddit
PP4096 @ d20000 or the other one with even longer context is a weird metric. What are you even measuring at this point?
Prompt processing with 4096 means the speed at which you can process a context of 4096 tokens. What does it mean to process 4096 tokens after 20000 tokens? Aren't you processing 204096 tokens at this point? Or do you first calculate 20000 tokens, store them in KV cache and then add 4096 tokens at once and process those?
NeverEnPassant@reddit (OP)
The latter.
iLaurens@reddit
But that's not a realistic scenario anyone would encounter. At most maybe in a tool calling situation where a tool produces a bunch of additional context?
NeverEnPassant@reddit (OP)
Agentic coding hits this use case for sure.
But also it mostly tells you the same thing as a pp20000 test.
arekku255@reddit
This reads as though someone started with the conclusion and then went looking for evidence to support it.
When viewed on its own, the actual benchmarks for the Strix Halo show a perfectly capable inference machine, the data simply doesn’t support the stated conclusion. Even with the performance drop-off at larger context sizes, the Strix Halo still delivers perfectly acceptable inference speed for most use cases.
The benchmark highlighted in the post focuses on a narrow, worst-case configuration, which makes it feel a bit cherry-picked. I could just as easily cherry-pick benchmarks where the Strix Halo absolutely smokes the 5090.
Moreover, that alternative configuration belongs to an entirely different market segment. The Strix Halo targets the low-cost segment, while the alternative targets the high-end market. If anything, the Halo should be compared to its most direct competitor, the DGX Spark.
The Strix Halo’s unique selling point is that it offers a high-memory inference machine without making your bank account cry.
NeverEnPassant@reddit (OP)
The prefill is really slow which severely limits the use cases.
Amblyopius@reddit
I look at my HP ZBook Ultra G1a that I got for about the cost of an RTX 5090. I've no issues at all coming up with use cases where it will totally trash that desktop with a 5090. For starters, it's quite easy to take it anywhere.
You've also just demonstrated a difference in benchmarks. Cool, but that really tells us nothing as to how one is "barely useful" and the other is "extremely useful". Barely useful for what? What exactly is your actual use case? E.g. at a context of 20000, speed is half for generation. That's unlikely to make a massive difference. So then it has to be context and preprocessing but depending on the use case that's a one of.
There's definitely plenty a 5090 is good for (I have a desktop with a 4090 myself) but you've oversimplified this quite a bit.
nostriluu@reddit
Strix Halo is a laptop chip, it makes a lot of sense there, even past LLM use since generally it's much faster than other x86 CPUs. If you're going to have something plugged into the wall on your desk all the time, might as well have proper expansion and higher power limits with more robust cooling.
From what I've seen, quite a few people would buy a Thinkpad with Strix Halo, including myself, though in a few months I'll be in a holding pattern again for Strix Medusa.
Amblyopius@reddit
I had a reduced base starting price via work and HP was running a promo on any Workstation class desktop/laptop which added a reduction on top of that. In immediately available UK Strix Halo options the laptop was the same price as a desktop Strix Halo but in a (for me) far more convenient form factor.
I was overdue a personal laptop upgrade anyway so I bit the bullet. Fingers crossed that Strix Medusa is good enough to see it as a valid desktop upgrade.
chisleu@reddit
I bought a 64GB version to run Qwen 3 Coder and I'm getting really poor performance. Only the CPU driver worked out of the box with LM Studio with very low TPS. I installed ubuntu last night and plan to try to compile llama.cpp with rocm or vulkan, but I haven't found a guide. Rocm looks to be a pain in the ass to pull off.
CUDA is so much easier, but I miss everything just working on Mac...
abnormal_human@reddit
I generally agree. People love to hate NVIDIA but if you have the budget and you’re serious there’s really no alternative. For a hobbyist who’s only concern is to run models for interactive chat, the AMD system isn’t the worst thing but it’s not magical and I would argue that in most of those cases a Mac is the superior choice.
Badger-Purple@reddit
Quality comment here. You can get an M3 ultra refurb’d for 3500, or an M2 ultra for 3000 in ebay.
And it can run OSS120 with room to spare: it will toast your bread, make you a pizza and suck your d…no, wait, Tim Cook has not put that feature in yet. Yet.
starkruzr@reddit
that M3 Ultra refurb is twice the price of a STXH machine.
Badger-Purple@reddit
Yes, at 4x the memory bandwidth, with TB5, 10Gb ethernet, 60 core GPU vs ?40 in STRX395 etc. Yes there is a difference in price. It’s like having a 4080 with 96GB memory though, plus a whole computer.
sudochmod@reddit
Ehhhh I use my strix halo with local agentic coding. I’ve had no real issues with it. Even smaller models are decently fast on it. To each their own. But I could also throw another GPU on it and run a smaller model directly on that too.
I love mine and I never use it for local chat :D
Karyo_Ten@reddit
Coding on a Mac with 540GB/s mem bandwidth felt too slow already due to slow prompt processing, making it too painful as soon as repos become medium-sized.
sudochmod@reddit
It depends on the tool you’re using. Aider runs pretty fast because of how it manages context and I’ve also made a agentic coder in powershell that minimizes context for those operations. YMMV but I love mine.
Ok-Adhesiveness-4141@reddit
NVIDIA is too expensive fornmany, spending 4000 USD is not a joke.
Salty_Flow7358@reddit
"The more you buy the more you save"
Django_McFly@reddit
I'd give up a lot of the connectivity options to get decent PCIe. I think/hope gen 2 opens that up a bit more. I'd take a much more stripped down version that was like 2 USB, 1 nvme, 1 Ethernet, no wifi if it meant x16. You could build a perfect little inference box for all types of AI stuff.
ldn-ldn@reddit
If your model only needs 1.35GB in VRAM, you can buy RTX5080 instead and save $1,000.
alexmulo@reddit
What is your main application for these local models on strix halo?
NeverEnPassant@reddit (OP)
I dont have a strix halo. Numbers provided by others.
alexmulo@reddit
What for do you use these local models?
avl0@reddit
So if you spend twice as much you can get something that works better in most situations?
Truly shocking
Rich_Repeat_22@reddit
a) 5090 alone is close to $3000 these days, with the rest of the system given prices is close to $4000.
b) gpt-oss-120b is MOE. Ofc it will be faster on the 5090 as only a tiny bit is loaded actually. Try now a medium size dense model on the 5090 and compare it to the AMD 395.
c) What is the Strix Halo machine? Laptop or miniPC? Since no information is given. Asking because there is a gap in perf due to power allowance between laptop (85C) and MiniPC (140W).
d) What are the numbers when Lemonade is been used for hybrid execution (iGPU + NPU)?
gpt-oss-120b-mxfp-GGUF via Lemonade is supported.
NeverEnPassant@reddit (OP)
a) You are like the 10th person to repeat that lie. An overpriced "OC" 5090 is attainable TODAY for $2400. With a little patience it is $2000. That's what I paid after seeing it in stock for 2 weeks.
b) MoE models are the present and future.
c) desktop
d) I posted updated numbers at the end of this post. tg improves, pp barely. It's not lemonade though, its patched llama. Probably at least as fast as lemonade. Isn't lemonade just another llama fork?
Rich_Repeat_22@reddit
patched llama doesn't use the NPU.
NeverEnPassant@reddit (OP)
Please post benchmarks that use the NPU
Charming_Support726@reddit
For me the StrixHalo is a perfect choice. The unit costs less than one 5090 itself - a 5090 or 2x5090 based workstation costs 2 or maybe 4 times as much as such a unit. Making this a questionable comparison.
It is a quite little desktop, capable of running most of the models on decent speed. I am mostly using cloud services for production tasks anyway.
My 3090 based workstation has been retired.
Tyme4Trouble@reddit
What if I don’t want to use Llama.cpp, if I want to finetune Llama 3.3 70B? Strix Halo, DGX Spark the arguments for X090 + fast DRAM fall apart when your workloads don’t involve Llama.cpp.
NeverEnPassant@reddit (OP)
Yes, this wont work for fine tuning. Im not sure how well strix halo well either though. Dgx spark has a lot more compute than strix halo.
wishstudio@reddit
Agree with most part.
But for the last part, I believe Strix Halo + GPU still have potential. IMO the current PCIe bandwidth bound behavior is actually due to inferior llama.cpp implementation.
The basic relevant heuristic is: for the RAM-offloaded MoE weights, if the batch size is small (for decoding it's1), then it's definitely memory bandwidth bound, so we simply computes it on the CPU. If the batch size is larger (esp. prefill), then the computational complexity will overcome memory/PCIe bandwidth, so we transfer the weights to GPU.
The biggest problem here is, how large is large? Currently llama.cpp uses a very crude number: 32. Yes, it's a fixed number, regardless of your CPU/GPU configuration. Let's do some napkin math. Suppose the 120b are all MoE parameters. A 32-item batch will require exactly 4*32=128 experts multiplications, i.e. 120G OPs. Now the performance depends on the expert reuse rate, i.e. how many experts that need to be read. If the expert usage is spread evenly, then we need to read 60GB of data for 120G OPs. Modern consumer CPUs could do hundreds of GFLOPS easily, so this is obviously not worth it to send the data over PCIe. There is a PR in lk_llama months ago that tackle this (https://github.com/ikawrakow/ik_llama.cpp/pull/520). With a few parameter tweaking they can get \~2x PP performance in small batch size.
Now comes to the case of Strix Halo. Following the above math you'll see that sending the weights over PCIe will never be not worth it for Strix Halo - even at 4096 batch size. The Strix Halo's GPU+NPU has a theoretical 126 TOPs, i.e. easily \~100x faster than a conventional consumer CPU. And its RAM bandwidth is \~4x PCIe 5 bandwidth. It would be crazy to send the weights over PCIe instead of calculating in RAM in-situ.
NeverEnPassant@reddit (OP)
Let's assume a balanced batch size of 4096. Each expert on average will have 128 tokens. The ik llama thing is really only relevant for small prompts, and for small prompts, pp speed doesnt matter a whole lot. It's a micro optimization.
I agree with you that Strix Halo's pcie speed is a problem. That was a major point in this post: how you can not really use a GPU to speed up prefill on a Strix Halo. Btw, it's pcie4 x4, not pcie5 x4.
I've heard people say that the NPU is not being used. But a quick search says the NPU on the strix halo is 50 TOPS and 50 AI TOPS (I guess it doesn't support 4 bit, so it's the same) vs a rtx 5090 with 838 TOPS and 3352 AI TOPS. I can't see it not getting absolutely decimated in pp by 5090 with pcie5 x16.
wishstudio@reddit
My point is, for MoE part it's still RAM bound even if you transfer the computation to GPU (well it's PCIe bound which is even worse). The ik_llama case just demonstrates that 32 could already be too small for CPUs to make PCIe transfer worth it. For Strix Halo we have much faster computation so the "optimization" to transfer MoE calculations to GPU will likely be a regression.
Take your case, of 4096 batch size, each expert read will be reused for 128 tokens. So with 64GB/s bandwidth you only need 64*128*2=8192 GFLOPS to saturate it (*2 due to MXFP4), have any computational speed higher than that is a waste. Even if you have RTX 9090 that is 100x faster than 5090 it won't help at all. For Strix Halo, it's 256 GB/s which corresponds to 256*128*2=65536 GFLOPS which looks a much nicer match to its actual computational capability, even without the NPU.
In conclusion, doing MoE calculation in Strix Halo's system RAM should be faster than traditional dual channel CPUs doing MoE offload to GPU. For attention part it should be the same to offload to GPU, because it's computation bound.
NeverEnPassant@reddit (OP)
I'm not interested in discussing ik llama micro optimizations. That patch is mostly irrelevant.
I have numbers backing me up. 4100tps with 4096 batch where an rtx 6000 pro (everything in VRAM) only gets 5518tps. The 5090 also starts to catch up to the 6000 as context size increases as pcie is the bottlebeck.
You think there is some magic that is going to make this:
catch up to this?
It's magical thinking and it's not gonna happen.
wishstudio@reddit
Looks like you don't actually want to read my comment.
Well I probably shouldn't waste my time here. You believe what you want to believe.
NeverEnPassant@reddit (OP)
I gave you numbers showing I can get 2497tps at 48000 doing exactly this. Meanwhile Strix Halo is currently getting 184tps. Can you justify why you think Strix Halo can somehow see a 13x increase to match it?
wishstudio@reddit
At the top of my comment, I said that I want to point out a potential implementation issue. I think it's quite it clear all my following calculations are hypothetical situations assuming perfect implementation, which of course does not exist yet.
If you do not like theoretical analysis and just want to compare empirical numbers, you can simply ignore my comments. I thought you do because I read a lot of calculations in your OP to prove why Strix Halo is inferior.
I posted two comments with detailed calculation to justify my hypothesis. While you kept posting your test results without pointing out any flaw in my calculation. Therefore I see no point in continuing the discussion here.
BTW, I have almost the same setup as you, and I agree with most of your reasoning in your OP.
NeverEnPassant@reddit (OP)
So you do think a perfect implementation would a 13x improvement in prefill throughput?
When empirical numbers disagree with your theoretical analysis it should make you second question that analysis.
wishstudio@reddit
This is not a valid argument.
I bet you agree that changing a few parameters of your bench command line will easily affect the performance by much more than 13x. So why a potential implementation issue couldn't lead to that difference?
NeverEnPassant@reddit (OP)
Because Strix Halo is limited by AI TOPS.
wishstudio@reddit
This is oversimplification. The logic is the same as the folks thinking a 32G VRAM GPU must be better than a 16G one, or a 3GHz CPU must be better than a 2GHz one. Performance tuning has much more story to say than simply comparing two spec numbers, especially for esoteric configurations like hybrid inference. Otherwise llama.cpp won't have so many options for you to try.
For Strix Halo, only the attention is limited by TOPS, the expert FFNs are memory bound in both PP and TG, if my calculation is correct. But for PP in ordinary CPUs, both part are compute bound. And that's the reason for the offloading MoE to GPU optimization, in the first place.
If you really want to discuss, show me why calculating MoE FFs is limited by TOPS for Strix Halo. I would be pretty happy to learn that.
AppearanceHeavy6724@reddit
I miss old Localllama, when people actually paid attention to specs. These days ant conversation about PP, TFLOPs, GB/sec fall on deaf years.
NeverEnPassant@reddit (OP)
But strix halo + 5090 will always have significantly slower prefill than any system with pcie5 x16, so what's the point?
wishstudio@reddit
The point is, strix halo + 5090 is currently significantly slower because in the current llama.cpp implementation it is limited by PCIe bandwidth. But it shouldn't.
NeverEnPassant@reddit (OP)
If you don't want to upload all experts weights to the GPU on each ubatch, what work do you expect the GPU to perform that will provide a significant speedup?
Baldur-Norddahl@reddit
There is another option now: instead of RTX 5090 get dual or quad R9700. Each card has 32 GB, so you can run the entire model in VRAM. The memory bandwidth is less, but with two cards and tensor parallel, that doubles the bandwidth.
These are two slot blower cards and 300 watt. That makes it much easier to build compared to multiple 3090 or 5090.
shockwaverc13@reddit
so does performance actually improve when you quantize kv cache?
NeverEnPassant@reddit (OP)
kv cache quant kills prefill performance for me. I'm not sure why!
AppearanceHeavy6724@reddit
because it is compute bound, not bw.
NeverEnPassant@reddit (OP)
I just tested and quantized KV cache is not giving me the significant slowdown I previously saw. Not sure why it happened before. I always compiled with
-DGGML_CUDA_FA_ALL_QUANTS=ON.sudochmod@reddit
That’s a great question! I should find out later when I have time.
cride20@reddit
what about the NPU? There are an open source project that specifically uses the ryzen NPUs. They get pretty consistent tps at as low as 30w power usage
AppearanceHeavy6724@reddit
No, KV cache slowdown is dominated by slow compute. Set KV quant to Q4 and you won't see any difference in PP.
NeverEnPassant@reddit (OP)
Yeah, I think I may have gotten this wrong seeing how newer llama doesnt see such dramatic slowdowns with Strix Halo (but still much more than a 5090).
Alocas@reddit
Your whole discussion should end when you compare the price of a quad channel (I suppose threadripper/epyc), 128gb ddr5 RAM, rtx 5090 PC (definitely not "normal") to a 1500$ or € mini PC, including considering running costs (power consumption).
NeverEnPassant@reddit (OP)
It's not $1500.
Alocas@reddit
Hmmm, I stand corrected. 1700$ and 1600€ for the cheapest full RAM system. Still my point stands...
michaelsoft__binbows@reddit
i have a 5090 rig, a 7 liter one in fact, so it's not even all that less portable than a strix halo box but it really doesn't make sense to split the work across the main system memory, it's just such a massive bottleneck.
Strix halo perf as many have shown is getting better and 30+tok/s is attainable with large context. That means it's usable.
I think if you have need for one, it would be really nice, but it's the next iteration of these halo chips that will start to get compelling. if they are able to continue to add even more memory channels, and of course there will be more compute on tap, then we will be starting to see 100tok/s out of this 120b model and at that point we're talking fast enough for general use.
It's also going to just be so nice for general cpu algorithms to be able to tap all that memory bandwidth. once you start eclipsing half a TB/s it's a different ballgame.
I also think that once software catches up, there are going to be a responsiveness upper hand to unified memory systems being able to skip the bus transfer.
NeverEnPassant@reddit (OP)
It's only really a bottleneck for decode. Prefill is still really really fast.
I'm just saying that a 5090 is faster than a Strix Halo even when splitting work across GPU and system RAM. For example gpt-oss-120B is much more usable because prefill is over 13x by higher by 48k context. I think it's worth the extra cost.
No_Shape_3423@reddit
Interesting. Please tell us what your computer + GPU would cost today from a major retailer.
NeverEnPassant@reddit (OP)
I guess you want to spend days and days learning about local llm, but dont want to invest 3-4 hours in purchasing and building a computer?
No_Shape_3423@reddit
I don't understand the point of your comment or how you have any basis to say I don't know about building computers. I've been building computers for decades and currently have an EPYC server I use for inference.
You claim that a DDR5 machine with a 5090 costs like $1000 more than a Strix Halo. Can you support that claim?
NeverEnPassant@reddit (OP)
I said $1000-$1500 depending on components. I'd add about $300 to that now after learning that DDR5 RAM prices have DOUBLED in the past 2 months.
twilight-actual@reddit
If all you want to do is run MoEs on your system that barely are larger than your 32GB of ram, then you have a point. But let's say you want to go larger, something that would nearly fill 96GB of ram.
I don't care how fast your memory bandwidth is, you're getting hammered. I've seen the same thing play out on older Apple Studio M1s vs the 5090. The 5090 kicks ass until it hits a wall. And the larger the memory allocation goes over 32GB, the more the 5090 suffers. Memory becomes more valuable than bandwidth, because you're not constantly thrashing memory, limited by PCIe, or having to split the overage to CPUs which just can't compete with GPUs.
You found one data point and decided to make an entire generalization about it.
NeverEnPassant@reddit (OP)
Funny how I publish numbers and you don't. Gpt-oss-120b Is 60GB before any kv cache, etc. tg will slightly skew more towards strix halo with larger models, but pp will remain the same.
Maleficent-Ad5999@reddit
I wish strix halo came with couple of pcie x16 slots
Queasy_Asparagus69@reddit
3 tokens bro: ROI
chaosmetroid@reddit
There's 2 things that needs to be considered that most people here don't really sit and realized.
My understanding GPU helps you run 1 model at once really well but the moment you need multiple people running that same model it will struggle a bit.
While the strix will not have a performance lost.
I can be wrong here but this is my understanding.
johnkapolos@reddit
Awesome work, thanks for sharing!
pmttyji@reddit
Could you please add benchmarks of few more models(GLM 4.5 Air, Dense like Llama 70B)? Yesterday there was thread about DGX spark. Looks like both DGX & Strix are useful only for lightweight use with lightweight models. Haven't seen anyone uses for some more bigger models.
NeverEnPassant@reddit (OP)
GLM Air is like half the numbers of what I posted because it has double active parameters. Dense 70B is too slow on anything less than $8000.
pmttyji@reddit
This pretty much clarifies things. Hope we see more benchmarks from others soon or later. I won't go for such unified memory setup unless total memory is something bigger like 512GB(Ex: Mac) or 1TB. Because I would like to try additional bigger models like GLM Air, Qwen3-235B @ Q4, Llama4-Scout, etc., which's not better with this 128GB setups.
Already we regret about our laptop purchase last year(Though friend bought it mainly for gaming purpose) as we couldn't upgrade/expand it anymore. So I won't go with Non-Upgradable/Expandable setup again unless it's 512GB/1TB.
recoverygarde@reddit
Strix Halo is overrated but the better model to test is gpt oss 20B as it doesn’t lose much intelligence (o3 mini vs o4 mini) and runs much faster. For the money M4 Pro and M4 Max are the better buys as both are faster and much faster respectively. Also pretty soon M5 coming to the Mac mini will be even a better deal
DistanceAlert5706@reddit
I'm not a fan of Strix Halo and think it's slightly overpriced and over hyped, but most people don't have RTX5090 and system which even capable of running DDR5 6000.
Btw, there was a post few days ago with llama.cpp fork which improves performance with context growth.
No-Consequence-1779@reddit
I run 2 5090s on a 1200$ threadripper gen 3 pcie and ddr4. It is not required. This is faulty thinking. Maybe some sort of damaged brain.
NeverEnPassant@reddit (OP)
pcie5 and ddr5 is only a large benefit if you are doing hybrid inference on your GPU and system RAM.
No-Consequence-1779@reddit
Welll .. yeah. Though it’s not as much of a difference as people think.
And running multiple GPUs still require pcie traffic to sync up (matrix multiplication and layers).
Unless it’s a model designed to be fast like oss , it’s unproductively slow.
The most obvious thing is when posters dont actually run the models on the hardware they comment about.
Probably not this case but it’s almost all the comments. It’s pretty crazy.
NeverEnPassant@reddit (OP)
This post included numbers.
pcie5 x16 to pcie4 x16 drops my prefill from 4100 to 2700. I think it's more important than people think if they are storing MoE weights in system RAM.
Badger-Purple@reddit
Having run these models I agree that a 32GB gpu from nvidia will blaze an MOE like OSS120. But I find being able to run qwen next, oss120 AND glm4.5 air at the same time in my M2 ultra mac way more satisfying. I like these models for agent work and they are super smart when they connect.
In an ideal world, I would get a RTXpro6000. But my used M2 ultra has twice the GPU-allocated RAM, 4x the bandwidth of the strix halo, and cost me 3200 refurbished with a year warranty.
NeverEnPassant@reddit (OP)
Literally at the same time? Or would llama-swap be good enough?
Badger-Purple@reddit
No I can also swap models but I love being able to just keep them all on. I place each as a different agent, they work on a task together. It’s true that at the prefill of 30k+ I’ll wait a minute before it starts decoding, but the decode speed is great. 70b llama , GLM4.6 q4 are slow (15tkps) but anything smaller runs acceptably fast. It was not bought for this purpose ONLY, and the resale value of an older studio is not going to be like the 5090, but the price, power consumption and acceptable inference speed with 80-300B parameter models should make it a sweet spot for many. MLX is so good, those guys are obsessed with getting every model to run on mac hardware close to day 1 of release. There are pain points with pytorch and yeah if I was a developer I would want CUDA…but I am not.
GradatimRecovery@reddit
i find this post very illuminating. others seem fixated on the price and power draw difference. your numbers reveal that the price and power differential don’t matter at all. strix halo simply can not do useful work. there’s no tk/s$ or tk/sW comparisons to be made when the tk/s is so close to zero
solidsnakeblue@reddit
I use mine for useful work every day. It’s a bit slow if you’re working with large context and using it as a chat backend. But for workflows that run automatically in the background it’s hard to beat the efficiency
sudochmod@reddit
They’re also ignoring the NPU which is pretty quick as well.
ataylorm@reddit
Múltiple testers have shown the OSS model runs significantly better on Nvidia hardware, but the performance differences are less when using models without the expert layers.
That being said, as of this passed weekend NewEgg didnt have any 5090’s for less that $3200, before the other components, vs $2000 for the Strix…
NeverEnPassant@reddit (OP)
MoE models are the norm now.
No-Consequence-1779@reddit
This is well known on the LLM side. Preload (context processing) is compute bound. Cuda is king.
Token generation is vram speed bound.
Moe and thinking can be various combinations.
This is the main reason I went from 2x3090s to 2x5090s (before 6000 and spark).
Doing any serious work requires a lot of information in the context. I was waiting 10-15+ minutes. 30 minutes. Then generation for a task was 2 hours.
Task was billed out at 8 grand (4 days manual work) so it paid for the 5090s immediately.
NeverEnPassant@reddit (OP)
There is some nuance here. Specifically, I was measuring a MoE model much larger than would fit in VRAM, so the particularities of how and why it scaled was interesting to me. Also the importance of pcie5 was a surprise to me.
No-Consequence-1779@reddit
It should be a surprise how much difference it does not make running lager-small models.
Terminator857@reddit
How do the numbers look like with qwen3 coder and large context?
NeverEnPassant@reddit (OP)
Probably depends on the quant, at 4 bits it will fit entirely in VRAM and a 5090 was be several times faster at everything. At larger quants I'm not sure how it would compare.
hp1337@reddit
I actually agree with you. It makes sense to combine a decent GPU with fast ddr5 system than get strix halo. Now when the next gen APUs come out with ddr6 it may be another discussion.
Longjumpinghy@reddit
AI post,
NeverEnPassant@reddit (OP)
I promise you I wrote this by hand.
dsartori@reddit
I read every word and never doubted. Some people can't tell the difference between a confident communicator and AI.
SeaHorseManner@reddit
Thank you foe this detailed analysis and explanation! Definitely cleared some things up for me.