Why the Strix Halo is a poor purchase for most people

Posted by NeverEnPassant@reddit | LocalLLaMA | View on Reddit | 400 comments

I've seen a lot of posts that promote the Strix Halo as a good purchase, and I've often wondered if I should have purchased that myself. I've since learned a lot about how these models are executed. In this post I would like share empircal measurements, where I think those numbers come from, and make the case that few people should be purchasing this system. I hope you find it helpful!

Model under test - llama.cpp - Gpt-oss-120b - One the highest quality models that can run on mid range hardware. - Total size for this model is ~59GB and ~57GB of that are expert layers.

Systems under test

First system: - 128GB Strix Halo - Quad channel LDDR5-8000

Second System (my system): - Dual channel DDR5-6000 + pcie5 x16 + an rtx 5090 - An rtx 5090 with the largest context size requires about 2/3 of the experts (38GB of data) to live in system RAM. - cuda backed - mmap off - batch 4096 - ubatch 4096

Real world measurements

Here are user submitted numbers for the Strix Halo:

test t/s
pp4096 997.70 ± 0.98
tg128 46.18 ± 0.00
pp4096 @ d20000 364.25 ± 0.82
tg128 @ d20000 18.16 ± 0.00
pp4096 @ d48000 183.86 ± 0.41
tg128 @ d48000 10.80 ± 0.00

What can we learn from this? Performance is acceptable only at context 0. As context grows performance drops off a cliff for both prefill and decode.

And here are numbers from my system:

test t/s
pp4096 4065.77 ± 25.95
tg128 39.35 ± 0.05
pp4096 @ d20000 3267.95 ± 27.74
tg128 @ d20000 36.96 ± 0.24
pp4096 @ d48000 2497.25 ± 66.31
tg128 @ d48000 35.18 ± 0.62

Wait a second, how are the decode numbers so close at context 0? The strix Halo has memory that is 2.5x faster than my system. And why does my system have a large lead in decode at larger context sizes?

This comes down to one of the advantages of MoE models. Let's look closer at gpt-oss-120b. This model is 59 GB in size. There is roughly 0.76GB of layer data that is read for every single token. Since every token needs this data, it is kept in VRAM. Each token also needs to read 4 experts which is an additional 1.78 GB, but each token needs a potentially different set of weights. Considering we can fit 1/3 of the experts in VRAM, this brings the total split to 1.35GB in VRAM and 1.18GB in system RAM at context 0.

Now VRAM on a 5090 is much faster than both the Strix Halo unified memory and also dual channel DDR5-6000. When all is said and done, doing ~53% of your reads in ultra fast VRAM and 47% of your reads in somewhat slow system RAM, the decode time is roughly equal (a touch slower) than doing all your reads in Strix Halo's moderately fast quad channel DDR5-8000.

But wait, what about the slowdown in decode? That's because when your context size grows, decode must also read the KV Cache once per layer. At 20k context, that is an extra ~4GB per token that needs to be read! Simple math (2.54 / 6.54) shows it should be run 0.38x as fast as context 0, and is almost exactly what we see in the chart above.

But wait, why does the my system show very little slowdown? That's because all the KV Cache is stored in VRAM, which has ultra fast memory read. The decode time is dominated by the slow memory read in system RAM, so this barely moves the needle.

Why do prefill times degrade so quickly on the Strix Halo? Good question! I would love to know!

Can I just add a GPU to the Strix Halo machine to improve my prefill?

Unfortunately not. The ability to leverage a GPU to improve prefill times depends heavily on the pcie bandwidth and the Strix Halo only offers pcie x4.

I went into my BIOS and forced my pcie slot into various configurations to gather some empircal data:

config prefill t/s
pcie5 x16 ~4100tps
pcie4 x16 ~2700tps
pcie4 x4 (what the strix halo has) ~1000tps

But why? Here is my best high level understanding of what llama.cpp does with a gpu + cpu moe:

Rough overview of what llama.cpp does:

*** Other benefits of a normal computer with a rtx 5090* - Better cooling - Higher quality case - A 5090 will almost certainly have higher resale value than a Strix Halo machine - More extensible - More powerful CPU - Top tier gaming - Models that fit entirely in VRAM will absolutely fly - Image generation will be much much faster.

What is Strix Halo good for* - Extremely low idle power usage - It's small - Maybe all you care about is chat bots with close to 0 context

TLDR If you can afford an extra $1000-1500, you are much better off just building a normal computer with an rtx 5090. Even if you don't want to spend that kind of money, you should ask yourself if your use case is actully covered by the Strix Halo.

Corrections Please correct me on anything I got wrong! I am just a novice!