Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 48 comments

Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings.

TL;DR of my findings:

Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5–10%.
The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp). The data transferred is tiny. The real latency comes from the fast GPU idling while waiting for the slower APU.
Amdahl's Law and Tensor Split: Since devices in llama.cpp process layers strictly sequentially (like a relay race), offloading some computations to slower memory causes a non-linear, hyperbolic drop in overall speed. This overall performance degradation for sequential execution is exactly what Amdahl's Law describes.

First, here are the standard llama-bench results for each GPU using their native backends:

~/llama.cpp/build-rocm/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp512	1493.28 ± 30.20
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp2048	1350.47 ± 40.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	pp8192	958.19 ± 1.85
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm	99	1	tg128	50.16 ± 0.07

~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15841 MiB): Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15841 MiB

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	8476.95 ± 206.73
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp2048	8081.18 ± 27.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp8192	6266.69 ± 6.90
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	179.20 ± 0.13

Now, the tests for each GPU using Vulkan:

GGML_VK_VISIBLE_DEVICES=0 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	7466.51 ± 17.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp2048	7216.51 ± 1.77
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp8192	6319.98 ± 7.82
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	167.77 ± 1.56

GGML_VK_VISIBLE_DEVICES=1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	1327.76 ± 17.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp2048	1252.70 ± 5.86
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp8192	960.10 ± 2.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	52.29 ± 0.15

And the most interesting part: testing both GPUs working together with tensor split via Vulkan. The model weights were distributed between the NVIDIA RTX 5070 Ti VRAM and the AMD Radeon 8060S UMA in the following proportions: 100%/0%, 90%/10%, 80%/20%, 70%/30%, 60%/40%, 50%/50%, 40%/60%, 30%/70%, 20%/80%, 10%/90%, 0%/100%.

GGML_VK_VISIBLE_DEVICES=0,1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev vulkan0/vulkan1 -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 -n 128 -p 512 -r 10

model	size	params	backend	ngl	fa	dev	ts	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	10.00	pp512	7461.22 ± 6.37
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	10.00	tg128	168.91 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	9.00/1.00	pp512	5790.85 ± 52.68
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	9.00/1.00	tg128	130.22 ± 0.40
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	8.00/2.00	pp512	4230.90 ± 28.90
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	8.00/2.00	tg128	112.66 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	7.00/3.00	pp512	3356.88 ± 27.64
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	7.00/3.00	tg128	99.83 ± 0.20
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	6.00/4.00	pp512	2658.89 ± 13.26
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	6.00/4.00	tg128	85.67 ± 2.50
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	5.00/5.00	pp512	2185.28 ± 16.92
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	5.00/5.00	tg128	76.73 ± 1.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	4.00/6.00	pp512	1946.46 ± 19.60
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	4.00/6.00	tg128	62.84 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	3.00/7.00	pp512	1644.25 ± 29.88
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	3.00/7.00	tg128	58.38 ± 0.31
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	2.00/8.00	pp512	1458.99 ± 19.70
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	2.00/8.00	tg128	55.70 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	1.00/9.00	pp512	1304.67 ± 45.80
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	1.00/9.00	tg128	54.16 ± 1.07
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	0.00/10.00	pp512	1194.55 ± 5.25
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	Vulkan0/1	0.00/10.00	tg128	52.62 ± 0.72

During token generation with split layers, the drop in overall tg and pp speed follows Amdahl's Law. Moving even a small fraction of layers to lower-bandwidth memory creates a bottleneck, leading to a non-linear drop in overall speed (t/s). If you graph it, it forms a classic hyperbola.

Formula: P(s) = 100 / [1 + s(k - 1)]

Where:

P(s) = total system speed (in % of max eGPU speed).
s = fraction of the model offloaded to the slower APU RAM (from 0 to 1, where 0 is all in VRAM and 1 is all in RAM).
k = memory bandwidth gap ratio. Calculated as max speed divided by min speed (k = V_max / V_min).

As you can see, the overall tg and pp speeds depend only on the tg and pp of each node. OCuLink doesn't affect the overall speed at all.

Detailed Conclusions & Technical Analysis:

Based on the benchmark data and the architectural specifics of LLMs, here is a deeper breakdown of why we see these results.

1. Vulkan is the Ultimate API for Cross-Vendor Inference

Historically, mixing AMD and NVIDIA chips for compute tasks in a single pipeline has been a driver nightmare. However, llama.cpp's Vulkan backend completely changes the game.

The Justification: Vulkan abstracts the hardware layer, standardizing the matrix multiplication math across entirely different architectures (RDNA 3.5 on the APU and the Ada/Blackwell architecture on the RTX 5070 Ti).
The Result: It allows for seamless, stable pooling of discrete VRAM and system UMA memory. The performance penalty compared to highly optimized, native backends like CUDA or ROCm is practically negligible (only about 5–10%). You lose a tiny fraction of raw speed to the API translation layer, but you gain the massive advantage of fitting larger models across different hardware ecosystems without crashing.

2. The OCuLink Myth: PCIe 4.0 x4 is NOT a Bottleneck for LLMs

There is a widespread stereotype in the eGPU community that the limited bandwidth of OCuLink (\~7.8 GB/s or 64 Gbps) will throttle AI performance. For LLM inference, this is completely false. The OCuLink bandwidth is utilized by a mere 1% during active generation. Here is the math behind why the communication penalty is practically zero:

Token Generation (Decode Phase): Thanks to the Transformer architecture, GPUs do not send entire neural networks back and forth. When the model is split across two devices, they only pass a small tensor of hidden states (activations) for a single token at a time. For a 7B or even a 70B model, this payload is roughly a few dozen Kilobytes. Sending kilobytes over a 7.8 GB/s connection takes fractions of a microsecond.
Context Processing (Prefill Phase): Even when digesting a massive prompt of 10,000+ tokens, llama.cpp processes the data in chunks (typically 512 tokens at a time). A 512-token chunk translates to just a few Megabytes of data transferred across the PCIe bus. Moving 8MB over OCuLink takes about 1 millisecond. Meanwhile, the GPUs take tens or hundreds of milliseconds to actually compute that chunk.
The True Bottleneck: System speed is dictated entirely by the Memory Bandwidth of the individual nodes (RTX 5070 Ti at \~900 GB/s vs APU at \~200 GB/s), not the PCIe connection between them. The only scenarios where OCuLink's narrow bus will actually hurt you are the initial loading of the model weights from your SSD/RAM into the eGPU (taking 3–4 seconds instead of 1) or during full fine-tuning, which requires constantly moving massive arrays of gradients.

3. Amdahl’s Law and the "Relay Race" Pipeline Stalls

When using Tensor Splitting across multiple devices at batch size 1 (standard local inference without micro-batching), llama.cpp executes a strictly sequential pipeline.

The Justification: Layer 2 cannot be computed until Layer 1 is finished. If you put 80% of the model on the lightning-fast RTX 5070 Ti and 20% on the slower AMD APU, they do not work simultaneously. The RTX processes its layers instantly, passes the tiny activation tensor over OCuLink, and then goes to sleep (Pipeline Stall). It sits completely idle, waiting for the memory-bandwidth-starved APU to grind through its 20% share of the layers.
The Result: You are not adding compute power; you are adding a slow runner to a relay race. Because the fast GPU is forced to wait, the performance penalty of offloading layers to slower system memory is non-linear. As shown in the data, it perfectly graphs out as a classic hyperbola governed by Amdahl's Law. Moving just 10-20% of the workload to the slower node causes a disproportionately massive drop in total tokens per second.

System Configuration:

Base: Minisforum MS-S1 Max (Strix Halo APU, AMD Radeon 8060S iGPU, RDNA 3.5 architecture). Quiet power mode.
RAM: 128GB LPDDR5X-8000 (iGPU memory bandwidth is \~210 GB/s in practice, theoretical is 256 GB/s).
OS: CachyOS (Linux 6.19.11-1-cachyos) with the latest Mesa driver (RADV). Booted with GRUB params: GRUB_CMDLINE_LINUX="... iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"

eGPU Setup:

GPU: NVIDIA RTX 5070 Ti
To get an OCuLink port on the Minisforum MS-S1 Max, I added a PCIe 4.0 x4 to OCuLink SFF8611/8612 adapter.
Dock: I bought a cheap F9G-BK7 eGPU dock. PSU is a 1STPLAYER NGDP Gold 850W.
Everything worked right out of the box, zero compatibility issues.

[-]

meebeegee1123122@reddit

I have a similar build using the Framework 128gb and a ADT Link PCIe riser cable and I have been able to get fairly stable gen 4 pcie data transfer speeds. I am particularly interested in using the CUDA for fine-tuning while holding the frozen model in RAM 128 gb so that is my primary use case at this time. I only found a small benefit for token generation when using the split model vs using the APU alone as you noted.

[-]

Due_Net_3342@reddit

what is the point of having both and running 7B model? you could run it directly on the egpu itself… Also the bottleneck is minimal when you split the model more or less equally BUT if you have a 110gb model and split it against 90 and 20gb you will see HUGE drops in tg, i tested this myself with a 16gb vram. For PP you will see modest improvements.

Currently waiting for a 24gb card to see if this improves things or not for the bigger models

[-]

Geritas@reddit

Reminded me of this guy on YouTube who gets all this crazy hardware and runs qwen 3 4b on it for more than half of the video every goddamn time

[-]

RedParaglider@reddit

Alex Ziskind. Next time I get a recommendation I'm telling youtube to never show me his fucking channel again. 4 fucking nvidia GPU's with 128gb of vram and runs a Qwen 3 4b, the lamest fucking shit ever. What the absolute fuck. Not even a, hey I'll hit the bigger models next video or anything, just on to the next unboxing useless clickbait bullshit.

It made the hardware look like shit too.

[-]

spaceman_@reddit

Isn't this typical of all tech reviewers? Seems like everyone - even AI focused or aware channels - are just running Llama3 8B and comparing the numbers.

The point is the memory size! Show us the big model numbers!

[-]