Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions
Posted by xspider2000@reddit | LocalLLaMA | View on Reddit | 48 comments

Hey everyone. I have a Strix Halo miniPC (Minisforum MS-S1 Max). I added an RTX 5070 Ti eGPU to it via OCuLink, ran some tests on how they work together in llama.cpp, and wanted to share some of my findings.
TL;DR of my findings:
- Vulkan's versatility: It's a highly efficient API that lets you stably combine chips from different vendors (like an AMD APU + NVIDIA GPU). The performance drop compared to native CUDA or ROCm is minimal, just about 5–10%.
- The role of OCuLink: The bandwidth of this connection doesn't bottleneck token generation (tg) or prompt processing (pp). The data transferred is tiny. The real latency comes from the fast GPU idling while waiting for the slower APU.
- Amdahl's Law and Tensor Split: Since devices in llama.cpp process layers strictly sequentially (like a relay race), offloading some computations to slower memory causes a non-linear, hyperbolic drop in overall speed. This overall performance degradation for sequential execution is exactly what Amdahl's Law describes.
First, here are the standard llama-bench results for each GPU using their native backends:
~/llama.cpp/build-rocm/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 126976 MiB): Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 126976 MiB
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp512 | 1493.28 ± 30.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp2048 | 1350.47 ± 40.94 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | pp8192 | 958.19 ± 1.85 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | 1 | tg128 | 50.16 ± 0.07 |
~/llama.cpp/build-cuda/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 15841 MiB): Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 15841 MiB
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp512 | 8476.95 ± 206.73 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp2048 | 8081.18 ± 27.82 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | pp8192 | 6266.69 ± 6.90 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | CUDA | 99 | 1 | tg128 | 179.20 ± 0.13 |
Now, the tests for each GPU using Vulkan:
GGML_VK_VISIBLE_DEVICES=0 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp512 | 7466.51 ± 17.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp2048 | 7216.51 ± 1.77 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp8192 | 6319.98 ± 7.82 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | tg128 | 167.77 ± 1.56 |
GGML_VK_VISIBLE_DEVICES=1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -p 512,2048,8192
ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp512 | 1327.76 ± 17.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp2048 | 1252.70 ± 5.86 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | pp8192 | 960.10 ± 2.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | tg128 | 52.29 ± 0.15 |
And the most interesting part: testing both GPUs working together with tensor split via Vulkan. The model weights were distributed between the NVIDIA RTX 5070 Ti VRAM and the AMD Radeon 8060S UMA in the following proportions: 100%/0%, 90%/10%, 80%/20%, 70%/30%, 60%/40%, 50%/50%, 40%/60%, 30%/70%, 20%/80%, 10%/90%, 0%/100%.
GGML_VK_VISIBLE_DEVICES=0,1 ~/llama.cpp/build-vulkan/bin/llama-bench -m ~/llama-2-7b.Q4_0.gguf -ngl 99 -fa 1 -dev vulkan0/vulkan1 -ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 -n 128 -p 512 -r 10
ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | dev | ts | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 10.00 | pp512 | 7461.22 ± 6.37 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 10.00 | tg128 | 168.91 ± 0.43 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 9.00/1.00 | pp512 | 5790.85 ± 52.68 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 9.00/1.00 | tg128 | 130.22 ± 0.40 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 8.00/2.00 | pp512 | 4230.90 ± 28.90 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 8.00/2.00 | tg128 | 112.66 ± 0.23 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 7.00/3.00 | pp512 | 3356.88 ± 27.64 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 7.00/3.00 | tg128 | 99.83 ± 0.20 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 6.00/4.00 | pp512 | 2658.89 ± 13.26 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 6.00/4.00 | tg128 | 85.67 ± 2.50 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 5.00/5.00 | pp512 | 2185.28 ± 16.92 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 5.00/5.00 | tg128 | 76.73 ± 1.13 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 4.00/6.00 | pp512 | 1946.46 ± 19.60 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 4.00/6.00 | tg128 | 62.84 ± 0.15 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 3.00/7.00 | pp512 | 1644.25 ± 29.88 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 3.00/7.00 | tg128 | 58.38 ± 0.31 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 2.00/8.00 | pp512 | 1458.99 ± 19.70 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 2.00/8.00 | tg128 | 55.70 ± 0.49 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 1.00/9.00 | pp512 | 1304.67 ± 45.80 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 1.00/9.00 | tg128 | 54.16 ± 1.07 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 0.00/10.00 | pp512 | 1194.55 ± 5.25 |
| llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | 1 | Vulkan0/1 | 0.00/10.00 | tg128 | 52.62 ± 0.72 |
During token generation with split layers, the drop in overall tg and pp speed follows Amdahl's Law. Moving even a small fraction of layers to lower-bandwidth memory creates a bottleneck, leading to a non-linear drop in overall speed (t/s). If you graph it, it forms a classic hyperbola.

Formula: P(s) = 100 / [1 + s(k - 1)]
Where:
- P(s) = total system speed (in % of max eGPU speed).
- s = fraction of the model offloaded to the slower APU RAM (from 0 to 1, where 0 is all in VRAM and 1 is all in RAM).
- k = memory bandwidth gap ratio. Calculated as max speed divided by min speed (k = V_max / V_min).
As you can see, the overall tg and pp speeds depend only on the tg and pp of each node. OCuLink doesn't affect the overall speed at all.
Detailed Conclusions & Technical Analysis:
Based on the benchmark data and the architectural specifics of LLMs, here is a deeper breakdown of why we see these results.
1. Vulkan is the Ultimate API for Cross-Vendor Inference
Historically, mixing AMD and NVIDIA chips for compute tasks in a single pipeline has been a driver nightmare. However, llama.cpp's Vulkan backend completely changes the game.
- The Justification: Vulkan abstracts the hardware layer, standardizing the matrix multiplication math across entirely different architectures (RDNA 3.5 on the APU and the Ada/Blackwell architecture on the RTX 5070 Ti).
- The Result: It allows for seamless, stable pooling of discrete VRAM and system UMA memory. The performance penalty compared to highly optimized, native backends like CUDA or ROCm is practically negligible (only about 5–10%). You lose a tiny fraction of raw speed to the API translation layer, but you gain the massive advantage of fitting larger models across different hardware ecosystems without crashing.
2. The OCuLink Myth: PCIe 4.0 x4 is NOT a Bottleneck for LLMs
There is a widespread stereotype in the eGPU community that the limited bandwidth of OCuLink (\~7.8 GB/s or 64 Gbps) will throttle AI performance. For LLM inference, this is completely false. The OCuLink bandwidth is utilized by a mere 1% during active generation. Here is the math behind why the communication penalty is practically zero:
- Token Generation (Decode Phase): Thanks to the Transformer architecture, GPUs do not send entire neural networks back and forth. When the model is split across two devices, they only pass a small tensor of hidden states (activations) for a single token at a time. For a 7B or even a 70B model, this payload is roughly a few dozen Kilobytes. Sending kilobytes over a 7.8 GB/s connection takes fractions of a microsecond.
- Context Processing (Prefill Phase): Even when digesting a massive prompt of 10,000+ tokens, llama.cpp processes the data in chunks (typically 512 tokens at a time). A 512-token chunk translates to just a few Megabytes of data transferred across the PCIe bus. Moving 8MB over OCuLink takes about 1 millisecond. Meanwhile, the GPUs take tens or hundreds of milliseconds to actually compute that chunk.
- The True Bottleneck: System speed is dictated entirely by the Memory Bandwidth of the individual nodes (RTX 5070 Ti at \~900 GB/s vs APU at \~200 GB/s), not the PCIe connection between them. The only scenarios where OCuLink's narrow bus will actually hurt you are the initial loading of the model weights from your SSD/RAM into the eGPU (taking 3–4 seconds instead of 1) or during full fine-tuning, which requires constantly moving massive arrays of gradients.
3. Amdahl’s Law and the "Relay Race" Pipeline Stalls
When using Tensor Splitting across multiple devices at batch size 1 (standard local inference without micro-batching), llama.cpp executes a strictly sequential pipeline.
- The Justification: Layer 2 cannot be computed until Layer 1 is finished. If you put 80% of the model on the lightning-fast RTX 5070 Ti and 20% on the slower AMD APU, they do not work simultaneously. The RTX processes its layers instantly, passes the tiny activation tensor over OCuLink, and then goes to sleep (Pipeline Stall). It sits completely idle, waiting for the memory-bandwidth-starved APU to grind through its 20% share of the layers.
- The Result: You are not adding compute power; you are adding a slow runner to a relay race. Because the fast GPU is forced to wait, the performance penalty of offloading layers to slower system memory is non-linear. As shown in the data, it perfectly graphs out as a classic hyperbola governed by Amdahl's Law. Moving just 10-20% of the workload to the slower node causes a disproportionately massive drop in total tokens per second.
System Configuration:
- Base: Minisforum MS-S1 Max (Strix Halo APU, AMD Radeon 8060S iGPU, RDNA 3.5 architecture). Quiet power mode.
- RAM: 128GB LPDDR5X-8000 (iGPU memory bandwidth is \~210 GB/s in practice, theoretical is 256 GB/s).
- OS: CachyOS (Linux 6.19.11-1-cachyos) with the latest Mesa driver (RADV). Booted with GRUB params:
GRUB_CMDLINE_LINUX="... iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856"
eGPU Setup:
- GPU: NVIDIA RTX 5070 Ti
- To get an OCuLink port on the Minisforum MS-S1 Max, I added a PCIe 4.0 x4 to OCuLink SFF8611/8612 adapter.
- Dock: I bought a cheap F9G-BK7 eGPU dock. PSU is a 1STPLAYER NGDP Gold 850W.
- Everything worked right out of the box, zero compatibility issues.
meebeegee1123122@reddit
I have a similar build using the Framework 128gb and a ADT Link PCIe riser cable and I have been able to get fairly stable gen 4 pcie data transfer speeds. I am particularly interested in using the CUDA for fine-tuning while holding the frozen model in RAM 128 gb so that is my primary use case at this time. I only found a small benefit for token generation when using the split model vs using the APU alone as you noted.
Due_Net_3342@reddit
what is the point of having both and running 7B model? you could run it directly on the egpu itself… Also the bottleneck is minimal when you split the model more or less equally BUT if you have a 110gb model and split it against 90 and 20gb you will see HUGE drops in tg, i tested this myself with a 16gb vram. For PP you will see modest improvements.
Currently waiting for a 24gb card to see if this improves things or not for the bigger models
Geritas@reddit
Reminded me of this guy on YouTube who gets all this crazy hardware and runs qwen 3 4b on it for more than half of the video every goddamn time
RedParaglider@reddit
Alex Ziskind. Next time I get a recommendation I'm telling youtube to never show me his fucking channel again. 4 fucking nvidia GPU's with 128gb of vram and runs a Qwen 3 4b, the lamest fucking shit ever. What the absolute fuck. Not even a, hey I'll hit the bigger models next video or anything, just on to the next unboxing useless clickbait bullshit.
It made the hardware look like shit too.
spaceman_@reddit
Isn't this typical of all tech reviewers? Seems like everyone - even AI focused or aware channels - are just running Llama3 8B and comparing the numbers.
The point is the memory size! Show us the big model numbers!
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
Geritas@reddit
It is, and his videos are otherwise interesting, but I can’t stop rolling my eyes every time he gets something ridiculous like 4xRTX 6000 pro and runs qwen 3 4b on it in multiple different ways for 15 minutes
xspider2000@reddit (OP)
Fair question! It's just for benchmarking purposes.
llama-2-7b.Q4_0.ggufhas historically become the de facto standard baseline for testing and comparing different hardware setups. You can find more examples of this benchmark here: https://github.com/ggml-org/llama.cpp/discussions/15013 As the article shows, Amdahl's law applies here: the less weight you leave on the slower Strix Halo memory, the faster your overall PP and TG will be. And the bottleneck in TG isn't OCuLink (which easily handles the tiny hidden state transfers), but the lower memory bandwidth of the system RAM itself.derekp7@reddit
Question. If you have a 120b model on just the apu, will you get a speedup moving some of it to the egpu (vs all of it on the APU)? Related, can the egpu speed up prompt processing compared to the 120b model on apu only?
xspider2000@reddit (OP)
Yes, i will speed up tg and pp. About separate host I am not sure, due to big latency u can lose tg speed. I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
madsheepPL@reddit
Not sure about the first part but I think I can answer the second - you can join the hosts with ethernet and run llama rpc - however the latency of the connection will kill the performance. Latency is more important than total bandwidth - look up how RDMA works (over usb or roce).
xspider2000@reddit (OP)
bench for Qwen3.5-27B-UD-Q4_K_XL.gguf with different model fraction on egpu and igpu
\~/llama.cpp/build-vulkan/bin/llama-bench \
-m /home/yulay/LLM/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \
-ngl 99 \
-fa 1 \
-dev vulkan1/vulkan0 \
-ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 \
-n 128 \
-p 512
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | fa | dev | ts | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------- | ------------ | --------------: | -------------------: |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 10.00 | pp512 | 269.13 ± 1.37 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 10.00 | tg128 | 11.90 ± 0.01 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 9.00/1.00 | pp512 | 296.54 ± 14.25 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 9.00/1.00 | tg128 | 12.33 ± 0.01 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 8.00/2.00 | pp512 | 303.92 ± 11.81 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 8.00/2.00 | tg128 | 12.95 ± 0.07 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 7.00/3.00 | pp512 | 341.83 ± 3.60 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 7.00/3.00 | tg128 | 13.54 ± 0.11 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 6.00/4.00 | pp512 | 392.76 ± 3.41 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 6.00/4.00 | tg128 | 14.80 ± 0.23 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 5.00/5.00 | pp512 | 443.23 ± 1.36 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 5.00/5.00 | tg128 | 17.43 ± 0.13 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 4.00/6.00 | pp512 | 457.50 ± 1.47 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 4.00/6.00 | tg128 | 19.89 ± 0.04 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 3.00/7.00 | pp512 | 629.92 ± 4.09 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 3.00/7.00 | tg128 | 22.24 ± 0.11 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 2.00/8.00 | pp512 | 801.37 ± 3.19 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 2.00/8.00 | tg128 | 26.01 ± 0.03 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 1.00/9.00 | pp512 | 1027.51 ± 6.28 |
| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 1.00/9.00 | tg128 | 30.14 ± 0.08 |
ggml_vulkan: Device memory allocation of size 1067094656 failed.
ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory
main: error: failed to load model '/home/yulay/LLM/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf'
segmond@reddit
crappy test. run a large 100B model, only on the Strix, then on the combo with the 5070 as main GPU.
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
segmond@reddit
I wanted to see the test of the large model on only Strix, then on 5070/Strix. That would show if the 5070 has any impact on PP and TG.
Badger-Purple@reddit
I think layer split will improve your generation
harpysichordist@reddit
Good information. How does it perform with a larger model? For example Qwen3.5 122B, etc
xspider2000@reddit (OP)
Already downloading Qwen3.5-122B-A10B-UD-Q4_K_XL
simracerman@reddit
How's it going?
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
simracerman@reddit
Thanks! I put a comment there for you.
Possible-Pirate9097@reddit
How'd it go?
StardockEngineer@reddit
Can we please flag these bullshit AI posts that seem to be from someone’s stolen Reddit account? They all have the same format and always testing old models.
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
Potential-Leg-639@reddit
So much text and why llama 7B? Why not Qwen3.5 models?
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
StableLlama@reddit
Interesting for the base line.
But the real test is a model that doesn't fit in the VRAM of the 5070 and comparing that with the baseline. Is it still the simple interpolation / prediction formula?
xspider2000@reddit (OP)
yeah, u can interpolation. I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
Hrethric@reddit
Thanks for sharing! I tried something similar with a Framework Desktop a couple of months ago, with a powered PCIE x4-x16 adaptor, but I didn't have as much success. I had to drop it down to PCIE 3.0 mode to get it stable, and I was using CUDA with my 3090 and Vulkan for the Strix Halo. The best performance I was able to get was just a little slower than the Strix Halo alone, I think around 27 t/s with the split, whether the Strix Halo alone would get 32. Unfortunately I don't have my notes handy. I was thinking of trying again with a shorter adapter, now I might try running Vulkan only and also try an Oculink adapter.
xspider2000@reddit (OP)
None model can saturate the interfaces bandwith, even PCIE 3.0 i think. But big latency of interfaces can kill ur tg. I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
aigemie@reddit
Very interesting, thanks for sharing! Maybe I missed it - does it help with PP (prefill) speed? Strix Halo is infamous for its PP.
xspider2000@reddit (OP)
Yes, you can see it on the graph in the post. The more model weights you offload to the fast eGPU, the better the PP (prompt processing) and TG (token generation) speeds. The performance drops quite fast as you move a larger share of the weights to the slower Strix Halo. However, splitting the weights between the eGPU and the Strix Halo still results in much better overall PP and TG speeds than running the model entirely on the Strix Halo alone
aigemie@reddit
That's great to hear! I'm considering buying a usb4 connection egpu dock as occulink is not much more helpful.
xspider2000@reddit (OP)
If u have choise, i would recomend occulink for LLMs. I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
Zc5Gwu@reddit
Please update us once you get things going. I'm interested in usb4 as well for a similar setup.
xspider2000@reddit (OP)
I’ve just published a new post where I tried to shed more light on the topic and answer some common questions
https://www.reddit.com/r/LocalLLaMA/comments/1sfzrdv/strix_halo_egpu_rtx_5070_ti_via_oculink_in/
mindwip@reddit
I have the same strix halo.
Two questions.
Do you recommend the egpus? I missed that if you put it in. I plan to get a 32gb or bigger gpu and do the same thing.
Why not use the 80gb USB 4v2 port instead of the online? Not saying you did it wrong justing wondering why that choice!
Thanks!
ReactionaryPlatypus@reddit
I have a 128gb Strix Halo with a 3090 24gb eGPU. I can run Minimax 2.5 IQ4_XS (113GB size plus context) using both GPUs using llama.cpp Vulkan.
It works great and is generally stable for hours of work but it has one issue. When you exit llama server it crashes Windows if and only if both GPUs share a model.
madsheepPL@reddit
how fast is it? pp / tg?
mindwip@reddit
Lol good to know thanks
xspider2000@reddit (OP)
Yes, I recommend it. Since PCIe bandwidth doesn't bottleneck token generation, dropping a 32GB+ eGPU into this setup is a fantastic way to run heavy LLMs on a mini-PC without building a massive full-tower rig. Plus, having a dedicated NVIDIA eGPU means you can comfortably run diffusion models and generate images in ComfyUI without breaking a sweat.
Mainly for latency and cost. OCuLink is a direct PCIe connection with zero protocol overhead. USB4 encapsulates the signal, which inevitably adds minor latency. When you are doing tensor split across an APU and an eGPU, the pipeline is strictly sequential. In this specific scenario, any added latency from the connection protocol can quickly compound and become an actual bottleneck. Plus, OCuLink docks are incredibly cheap and proven to work flawlessly right now, whereas true 80Gbps USB4 enclosures are still very rare and expensive. It's simply the most direct, stable, and budget-friendly path.
mindwip@reddit
Thanks!
tisDDM@reddit
Just for reference. A month ago I posted a SH benchmark with my eGPU (3060) and a mixed ROCm/CUDA backend - the numbers were produced before llama.cpp got a bunch of optimizations https://www.reddit.com/r/StrixHalo/comments/1rm9nlo/performance_test_for_combined_rocm_cuda_llamacpp/
Looking at your numbers I see a lot of potential for optimization. E.g your combined Vulkan number are in the same ballpark as my SH base line. Even back then I got an 30% increase with partially offloading to 3060. Resulting in 600tok/s PP4096 and 15Ttok/s TG128 on Qwen 3.5 in q4_0
Having this said - I changed from 3060 to an R9700 - giving me around PP:1000 TG:20
The 5070 shall be capable of far more throuput
Anarchaotic@reddit
Would love to do some of my own testing on the Framework, but unfortunately I don't have an oculink adapter or hub. I do have a Razer Core X, so hypothetically I could try that with my 5099. I wonder if I'd run into any bandwidth issues, you mentioned oculink didn't really matter all that much.
xspider2000@reddit (OP)
That sounds like an awesome setup! The RTX 5090 is a beast.
If the entire model fits into the VRAM of your 5090, you won't have any issues at all.
However, if you plan to do tensor split (offloading some layers to the Framework's system RAM), you will likely see a bigger performance hit than I did with OCuLink. The Razer Core X uses Thunderbolt, which encapsulates the PCIe signal and adds higher latency compared to the direct PCIe connection of OCuLink. Since layer processing is strictly sequential, the latency overhead from the Thunderbolt controller will compound and cause deeper pipeline stalls between the 5090 and the CPU.
Still, absolutely try it out!
Zc5Gwu@reddit
Do you have any idea the affect of the extra latency on speeds or TTFT?
Everlier@reddit
I think with configs like that a P/D disaggregation might make more sense compared to a tensor split, just to compensate for the area where APU is the weak link. I know, however, that there's no ready-made (as far as I'm aware of) solution for that with Vulcan + Nvidia/AMD combo.
xspider2000@reddit (OP)
Absolutely agree. In theory, P/D disaggregation would be a much more elegant way to bypass the APU bottleneck. Sadly, as you mentioned, the software just isn't there yet for a mixed Nvidia/AMD setup