Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions

Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 48 comments

Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings.

TL;DR of my findings:

  1. Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5–10%.
  2. The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp). The data transferred is tiny. The real latency comes from the fast GPU idling while waiting for the slower APU.
  3. Amdahl's Law and Tensor Split: Since devices in llama.cpp process layers strictly sequentially (like a relay race), offloading some computations to slower memory causes a non-linear, hyperbolic drop in overall speed. This overall performance degradation for sequential execution is exactly what Amdahl's Law describes.

First, here are the standard llama-bench results for each GPU using their native backends:

~/llama.cpp/build-rocm/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp512 1493.28 ± 30.20
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp2048 1350.47 ± 40.94
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 pp8192 958.19 ± 1.85
llama 7B Q4_0 3.56 GiB 6.74 B ROCm 99 1 tg128 50.16 ± 0.07
~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15841 MiB): Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15841 MiB

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 8476.95 ± 206.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp2048 8081.18 ± 27.82
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp8192 6266.69 ± 6.90
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 179.20 ± 0.13

Now, the tests for each GPU using Vulkan:

GGML_VK_VISIBLE_DEVICES=0 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 7466.51 ± 17.68
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp2048 7216.51 ± 1.77
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp8192 6319.98 ± 7.82
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 167.77 ± 1.56
GGML_VK_VISIBLE_DEVICES=1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192

ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp512 1327.76 ± 17.68
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp2048 1252.70 ± 5.86
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 pp8192 960.10 ± 2.37
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 tg128 52.29 ± 0.15

And the most interesting part: testing both GPUs working together with tensor split via Vulkan. The model weights were distributed between the NVIDIA RTX 5070 Ti VRAM and the AMD Radeon 8060S UMA in the following proportions: 100%/0%, 90%/10%, 80%/20%, 70%/30%, 60%/40%, 50%/50%, 40%/60%, 30%/70%, 20%/80%, 10%/90%, 0%/100%.

GGML_VK_VISIBLE_DEVICES=0,1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev vulkan0/vulkan1 -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 -n 128 -p 512 -r 10

ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

model size params backend ngl fa dev ts test t/s
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 10.00 pp512 7461.22 ± 6.37
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 10.00 tg128 168.91 ± 0.43
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 9.00/1.00 pp512 5790.85 ± 52.68
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 9.00/1.00 tg128 130.22 ± 0.40
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 8.00/2.00 pp512 4230.90 ± 28.90
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 8.00/2.00 tg128 112.66 ± 0.23
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 7.00/3.00 pp512 3356.88 ± 27.64
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 7.00/3.00 tg128 99.83 ± 0.20
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 6.00/4.00 pp512 2658.89 ± 13.26
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 6.00/4.00 tg128 85.67 ± 2.50
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 5.00/5.00 pp512 2185.28 ± 16.92
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 5.00/5.00 tg128 76.73 ± 1.13
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 4.00/6.00 pp512 1946.46 ± 19.60
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 4.00/6.00 tg128 62.84 ± 0.15
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 3.00/7.00 pp512 1644.25 ± 29.88
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 3.00/7.00 tg128 58.38 ± 0.31
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 2.00/8.00 pp512 1458.99 ± 19.70
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 2.00/8.00 tg128 55.70 ± 0.49
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 1.00/9.00 pp512 1304.67 ± 45.80
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 1.00/9.00 tg128 54.16 ± 1.07
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 0.00/10.00 pp512 1194.55 ± 5.25
llama 7B Q4_0 3.56 GiB 6.74 B Vulkan 99 1 Vulkan0/1 0.00/10.00 tg128 52.62 ± 0.72

During token generation with split layers, the drop in overall tg and pp speed follows Amdahl's Law. Moving even a small fraction of layers to lower-bandwidth memory creates a bottleneck, leading to a non-linear drop in overall speed (t/s). If you graph it, it forms a classic hyperbola.

Formula: P(s) = 100 / [1 + s(k - 1)]

Where:

As you can see, the overall tg and pp speeds depend only on the tg and pp of each node. OCuLink doesn't affect the overall speed at all.

Detailed Conclusions & Technical Analysis:

Based on the benchmark data and the architectural specifics of LLMs, here is a deeper breakdown of why we see these results.

1. Vulkan is the Ultimate API for Cross-Vendor Inference

Historically, mixing AMD and NVIDIA chips for compute tasks in a single pipeline has been a driver nightmare. However, llama.cpp's Vulkan backend completely changes the game.

2. The OCuLink Myth: PCIe 4.0 x4 is NOT a Bottleneck for LLMs

There is a widespread stereotype in the eGPU community that the limited bandwidth of OCuLink (\~7.8 GB/s or 64 Gbps) will throttle AI performance. For LLM inference, this is completely false. The OCuLink bandwidth is utilized by a mere 1% during active generation. Here is the math behind why the communication penalty is practically zero:

3. Amdahl’s Law and the "Relay Race" Pipeline Stalls

When using Tensor Splitting across multiple devices at batch size 1 (standard local inference without micro-batching), llama.cpp executes a strictly sequential pipeline.

System Configuration:

eGPU Setup: