Why the Strix Halo is a poor purchase for most people

config	prefill t/s
pcie5 x16	~4100tps
pcie4 x16	~2700tps
pcie4 x4 (what the strix halo has)	~1000tps

[-]

fallingdowndizzyvr@reddit

Performance is acceptable only at context 0. As context grows performance drops off a cliff for both prefill and decode.

Those must be ancient numbers. Since the Strix Halo is better than that now and getting better everyday. Here's a fresh run that just finished a minute ago.

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | ---: | --------------: | -------------------: |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 |          pp4096 |       1012.63 ± 0.63 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 |           tg128 |         52.31 ± 0.05 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 | pp4096 @ d20000 |        357.27 ± 0.64 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 |  tg128 @ d20000 |         32.46 ± 0.03 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 | pp4096 @ d48000 |        230.60 ± 0.26 |
| gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | ROCm,Vulkan |  99 |    4096 |     4096 |  1 |    0 |  tg128 @ d48000 |         32.76 ± 0.05 |

Sure, while the Strix Halo can't hope to have the compute to go up against the 5090 for PP. In TG, I dare say it goes toe to toe with the 5090. Even at large context.

[-]

It's been about 6 months. I'm setting up a Framework AMD Strix Halo next week with an option to plug a GPU into the 4x PCIe slot and if that is working well enough after I learn to split work across the AMD and Nvidia sides, looking at a 24gb BRAM+ Nvidia GPU for that hybrid setup.

Did I burn cash on the Strix system? Am I mad to chase an attached Nvidia GPU and hybrid setup? Is this going to be enough fun to validate the spend vs skill gain? Let's find out.

[-]

fallingdowndizzyvr@reddit

Am I mad to chase an attached Nvidia GPU and hybrid setup?

Not at all. I've used a 3060 with my Strix Halo before. I had a 7900xtx in the eGPU slot for the longest time. It's current rocking a V340. And soon, I'll have a 5070ti as the little helper for my Strix Halo.

[-]

jjwhitaker@reddit

A 3060 12gb is what I'm currently testing on, via Ubuntu with a 5900XT and 32gb of slow DDR4 (I already had these parts). I'm impressed, enough to seek more VRAM with the Strix Halo.

I already have my main pc or laptop connecting using LM Studio Link and using a model on the Ubuntu server in VS Code/Copilot. I've been using Gemma 4 E4B a lot this last week preparing for the Copilot Pro token budget changes that hit today...

It reads like Vulkan will be my friend, or VLLM for splitting between AMD and Nvidia. I'll test out the 4x to 16x riser cable currently in the mail and see if I need a more reliable setup for the 3060.

After my company built and Azure AI Service backed code review process that works great, we're investing in an on prem server for local LLM. With luck I can make follow that group and get past the shiny hardware phase of this project. Learn both sides of the hardware battle and figure out what training an LLM is all about.

[-]

Timely-Coffee-6408@reddit

Tg?

[-]

fallingdowndizzyvr@reddit

Token Generation

[-]

test	t/s
pp4096	997.70 ± 0.98
tg128	46.18 ± 0.00
pp4096 @ d20000	364.25 ± 0.82
tg128 @ d20000	18.16 ± 0.00
pp4096 @ d48000	183.86 ± 0.41
tg128 @ d48000	10.80 ± 0.00

test	t/s
pp4096	4065.77 ± 25.95
tg128	39.35 ± 0.05
pp4096 @ d20000	3267.95 ± 27.74
tg128 @ d20000	36.96 ± 0.24
pp4096 @ d48000	2497.25 ± 66.31
tg128 @ d48000	35.18 ± 0.62

model	size	params	backend	ngl	n_batch	n_ubatch	fa	mmap	test	t/s
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	0	pp4096 @ d125000	1319.88 ± 97.42
gpt-oss 120B MXFP4 MoE	59.02 GiB	116.83 B	CUDA	999	4096	4096	1	0	tg128 @ d125000	31.56 ± 0.62