Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions (Part 2)

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 14 comments

Hello everyone! Based on the community's feedback in previous post, I decided to write this post to clarify and expand on a few things.

Many of you in the comments asked for benchmarks, so I'll start with benchmarks for current models.

I benchmarked Qwen3.5-27B-UD-Q4_K_XL.gguf, distributing the layers (tensor split) between the APU and the eGPU in 10% increments: from 100%/0% to 0%/100%.

Below, I'll show why, in reality, running these benchmarks wasn't strictly necessary. We will compare the actual PP (Prompt Processing) and TG (Token Generation) metrics with the ones predicted by the formula from my first article. The main goal of the previous post was to demonstrate a universal method for estimating the performance of an APU+eGPU setup for any model when using a tensor split. However, judging by the number of questions, I didn't convey this idea clearly enough—so I'm correcting that now!

~/llama.cpp/build-vulkan/bin/llama-bench \
  -m ~/Qwen3.5-27B-UD-Q4_K_XL.gguf \
  -ngl 99 \
  -fa 1 \
  -dev vulkan1/vulkan0 \
  -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	dev	ts	test	t/s
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	10.00	pp512	268.02 ± 0.46
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	10.00	tg128	11.89 ± 0.03
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	9.00/1.00	pp512	280.95 ± 10.11
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	9.00/1.00	tg128	12.43 ± 0.03
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	8.00/2.00	pp512	267.87 ± 9.95
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	8.00/2.00	tg128	12.89 ± 0.02
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	7.00/3.00	pp512	293.02 ± 2.44
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	7.00/3.00	tg128	13.48 ± 0.13
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	6.00/4.00	pp512	336.32 ± 1.94
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	6.00/4.00	tg128	14.62 ± 0.24
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	5.00/5.00	pp512	377.92 ± 14.46
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	5.00/5.00	tg128	17.20 ± 0.08
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	4.00/6.00	pp512	462.06 ± 3.56
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	4.00/6.00	tg128	19.81 ± 0.08
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	3.00/7.00	pp512	563.40 ± 1.84
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	3.00/7.00	tg128	22.19 ± 0.10
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	2.00/8.00	pp512	757.22 ± 3.64
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	2.00/8.00	tg128	26.05 ± 0.06
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	1.00/9.00	pp512	988.62 ± 5.18
qwen35 27B Q4_K - Medium	16.40 GiB	26.90 B	Vulkan	99	1	Vulkan1/Vulkan0	1.00/9.00	tg128	30.25 ± 0.06

ggml_vulkan: Device memory allocation of size 1067094656 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
main: error: failed to load model '~/Qwen3.5-27B-UD-Q4_K_XL.gguf'

The model didn't entirely fit into VRAM, so at 100% VRAM offload, llama-bench crashed with an out-of-memory error.

In the comments, many people were rightly surprised as to why I ran tests on the outdated llama-2-7b.Q4_0.gguf. Let me explain, it was a conscious choice for two reasons:

It's a universal baseline for comparison. Historically, this exact model became the "gold standard" for testing LLM hardware. There is a massive database of results online (for example, in this GitHub thread) for a wide variety of configurations: Apple Silicon, NVIDIA, AMD, APUs, and their backends. By comparing the TG and PP metrics on this Llama, it's easy to understand the performance level of our APU+eGPU combo relative to any other hardware out there.
Calculating the hardware performance constant. On this model, I measured the TG128 and PP512 speeds for each node separately (when the model is loaded entirely on the RTX 5070 Ti or entirely on the Strix Halo). The absolute numbers of the old Llama aren't as important to us—what matters is their ratio. The ratio of GPU speed to APU speed (let's call it the GtA_ratio) is a constant that depends solely on the memory bandwidth and the compute power of the chips themselves. And this constant will be the same for any model.

Here is what it looks like in numbers:

Token Generation (TG128): For the 5070 Ti, it's 168.91 t/s; for the Strix Halo, it's 52.62 t/s. The TG128 GtA_ratio constant = 168.91 / 52.62 = 3.21.
Prompt Processing (PP512): For the 5070 Ti, it's 7461.22 t/s; for the Strix Halo, it's 1194.55 t/s. The PP512 GtA_ratio constant = 7461.22 / 1194.55 = 6.25.

Naturally, if you swap the graphics card for a different one, these constants will change. But knowing them for your current system allows you to predict speeds for any new LLM.

In the previous article, I mentioned that the performance drop during Tensor Split follows Amdahl's Law, and the graph of this drop is a hyperbola. For greater clarity, I have slightly adapted the base formula.

Here is what it looks like now:

Perf = [ GtA_ratio / ( 1 + (Share / 100) * (GtA_ratio - 1) ) ] * 100%

Where:

Perf — total system performance (as a percentage relative to the base APU speed).
GtA_ratio — our eGPU-to-APU speed ratio (the constant we calculated earlier).
Share — the percentage of the model offloaded to the slower system memory (APU RAM). It ranges from 0 to 100, where 0 means the entire model fits into the fast eGPU VRAM, and 100 means it runs entirely in the system RAM.

Let's plot the overall performance graph based on our baseline llama-2-7b.Q4_0.gguf benchmarks.

Now, let's overlay the fresh test results for the current Qwen3.5-27B-UD-Q4_K_XL.gguf model onto this hyperbola.

[Just a quick reminder: because the model didn't fully fit into VRAM, the final data point (100% VRAM offload) is missing from the graph](

As you can see, the real Qwen3.5 tests fit our mathematical curve perfectly! This proves the main point: to estimate the system performance for any new model, you don't necessarily have to run benchmarks. It's enough to follow a simple 3-step algorithm:

Calculate the model's "tail": Subtract the GPU VRAM capacity (in my case, 16 GB) from the model file size. This tells us how many gigabytes of weights won't fit in the eGPU and will be sent to the Strix Halo's RAM.
Find the s percentage: Convert this "tail" into a percentage of the total model weight. The resulting number is our Share value.
Apply the formula: Plug in Share and our GtA_ratio constants to calculate the final speed Perf.

For my system (RTX 5070 Ti + Strix Halo), the calculations look like this:

For Token Generation (TG128): GtA_ratio = 3.21. Formula:

Perf_tg128 = [ 3.21 / ( 1 + (Share / 100) * (3.21 - 1) ) ] * 100%

For Prompt Processing (PP512): GtA_ratio = 6.25. Formula:

Perf_pp512 = [ 6.25 / ( 1 + (Share / 100) * (6.25 - 1) ) ] * 100%

Reminder: Perf_tg128 and Perf_pp512 will show you the operating speed as a percentage relative to running the model solely on a single APU.

Another hot topic in the comments is the choice of eGPU interface. Many people asked about OCuLink versus Thunderbolt (TB) or USB4. Let's break down the mechanics of the process to clear up all questions.

As I mentioned before, OCuLink is not a bottleneck for either prompt processing (PP) or token generation (TG). To understand why, let's look at what makes up the generation time of a single token when using Tensor Split. It is always the sum of three stages:

Computing the first chunk of layers on the eGPU.
Transmitting the activation tensor (intermediate results) through the cable from the eGPU to the APU.
Computing the remaining layers in the APU's system RAM.

And here lies the most crucial nuance: during the second stage, latency is far more important than bandwidth.

The size of the transmitted activation tensor is relatively small, so the raw bandwidth of any modern interface (whether OCuLink, TB, or USB4) is more than enough with plenty of headroom. They do not saturate the "pipe." But because this transmission cycle repeats for every single generated token, what comes to the forefront is how quickly the signal initializes and travels from point A to point B.

This is where the main technical difference lies:

OCuLink is essentially a "naked" PCIe bus extension. Data travels directly to the CPU lanes with the lowest possible latency.
Thunderbolt and USB4 are forced to package (encapsulate) the PCIe signal into their own protocol, pass it through a controller, and then unpack it on the other side. This adds overhead and micro-delays to every transaction.

Therefore, if you have a choice of interface for local LLMs, it is highly recommended to use OCuLink.

~/llama.cpp/build-vulkan/bin/llama-bench \
  -m ~/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf \
  -ngl 99 \
  -fa 1 \
  -dev vulkan1/vulkan0 \
  -ts 100/0,95/5,90/10,85/15,80/20,75/25,70/30

ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model	size	params	backend	ngl	fa	dev	ts	test	t/s
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	100.00	pp512	247.59 ± 5.96
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	100.00	tg128	19.46 ± 0.26
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	95.00/5.00	pp512	270.07 ± 2.77
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	95.00/5.00	tg128	19.91 ± 0.63
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	90.00/10.00	pp512	281.56 ± 12.32
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	90.00/10.00	tg128	20.40 ± 0.39
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	85.00/15.00	pp512	295.46 ± 16.68
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	85.00/15.00	tg128	20.75 ± 0.57
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	80.00/20.00	pp512	311.33 ± 2.39
qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	Vulkan	99	1	Vulkan1/Vulkan0	80.00/20.00	tg128	21.79 ± 0.46

ggml_vulkan: Device memory allocation of size 650418176 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
main: error: failed to load model '~/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-UD-Q4_K_XL-00001-of-00003.gguf'

As you can see, because only a small fraction of the model (up to 20%) fit into the VRAM, the overall TG and PP speeds increased only slightly. Specifically, Token Generation (TG) went up by just \~12% (from 19.46 to 21.79 t/s), and Prompt Processing (PP) increased by \~25.7% (from 247.59 to 311.33 t/s).

For massive models, the performance uplift is limited simply because the eGPU's VRAM capacity is usually much smaller than the massive system RAM available on the Strix Halo.

[-]

jdchmiel@reddit

Can you test with the expert layers on the external GPU and the moe layers on the strix halo? I think you should be able to do this by the manual offload commands we used before the ncmoe flags and the fitt flags were added.
https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune#improving-generation-speed This has some examples with putting specific layers onto the CPU, but you can substitute CPU with Vulkan1 or Vulkan0 as desired.

xspider2000@reddit (OP)

I will try bit later

aigemie@reddit

Very cool tests and write up! Thank you! I wonder if we can force the egpu to do PP, as it's pure computational power and Strux Halo is far weaker than the egpu.

Necessary-Summer-348@reddit

Curious what the PCIe bandwidth overhead looks like with OCuLink in practice. The theoretical ceiling is one thing but llama.cpp with large context windows can get chatty between host and device. Did you notice any bottlenecking during prefill vs decode phases?

FeiX7@reddit

if you have APU why you need GPU on top of that?

ProfessionalSpend589@reddit

Prompt processing and token generations are a lot faster if a model can fit in the eGPU. It may play role as a small, but fast companion to a bigger model. Or the GPU can hold a bit of weights or context for a big model which otherwise wouldn't fit on the machine.

Also some people seem to like to play with ComfyUI, which runs best on a GPU.

Potential-Leg-639@reddit

Again , so much text and data, but the Qwen3.5 27B dense model would probably be a better candidate to test. Not only because it's a better model compared to the 122B, but here the eGPU could also make a difference.

And keep it simpler:
Benchmark with eGPU and without eGPU, that's it, no declarations of every little detail, that 90% of the people in here know anyway...

Clear-Ad-9312@reddit

I feel like there is plenty of "without eGPU" benchmarks because well, this guy is one of the few who has hooked up an eGPU.
+1 on keeping it simpler, its hard to ingest a lot of fluff

thx for ur comment. here in post tests for Qwen3.5-27B and for Qwen3.5-122B-A10B

External_Dentist1928@reddit

You should set a meaningful context window in llama-bench, e.g, -d 100000. I don‘t see why you should do all of these tests with almost 0 context window

kosnarf@reddit

This for this data! I was thinking about this for a while but did not find anyone performing benchmarks.

simracerman@reddit

I replicated your setup for the 27B and definitely see a 20-30% improvement in the TG, but my iGPU is considerably slower than yours and as a result, the Processing dropped about 30%. The good news is when offloading to iGPU in long context I keep the TG numbers , whereas CPU offloading kills TG for long context.

The reason why your 122B numbers are not changing much is due to llama.cpp doing the work on offloading non active params into iGPU/CPU. Keeping the important stuff on your dGPU in MoE is a standard setup and you won’t see much improvement in there. Now, a large dense model is your target. Wondering how the largest Gemma4 dense model behaves.

Yeah, While Strix Halo is solid for large MoE models, a Strix Halo + eGPU combo expands capabilities to handle medium dense models, such as Qwen3.5-27B or Gemma4-31B

Awesome, been looking into how that would work and if it would be worth it.