Advice needed on eGPU and Mini PC
Posted by Kulidc@reddit | LocalLLaMA | View on Reddit | 21 comments
Hi all, I come across to relatively niche problem and could not find much useful posts or guides about it.
I have a mini pc (Beelink Ser 8, 8745HS and 32GB 5600 DDR5 SODIMM) headless server for hosting some routing services, and I am wondering whether I could buy an external GPU docking station and a new GPU, connected through the USB4 interface (\~40Gb/s) or Oculink from the spared SSD slot (PCIE 4.0 x4, \~64Gb/s) and also serve as a coding agent or small assistant.
I would prefer 32GB VRAM, like AI PRO R9700 (Cheap but ROCm, which is a ) or RTX Pro 4500 for serving Qwen 3.6 27B AWQ 4 or 6 bit in vllm.
I will not consider MoE models like the Qwen 3.6 A35B-A3B with CPU offloading due to the connection interface, nor will I consider 5090 due to the large size, heat output and high power draw (I do not want my house to be burnt down due to the connector).
Am I missing any important thing here, apart from the interface and offloading?
Could anyone shares a similar experience on setting up the eGPU with Ubuntu?
Material-Duck-6252@reddit
Similar setup here. I would highly recommend use Oculink via m.2 slot which is much more stable. Notes here based on my experience:
- Works with win but better in ubuntu.
- Always load model fully into VRAM.
- eGPU works pretty well and so does AMD GPUs (for inference). I use a 7900xt GPU and I compile llama.cpp with HIP for inference. Did not see any difference other than initial model load. ROCm also supports flash attention and some other accelerators.
- A lot slower with diffusion models compared with Nvidia cards.
Kulidc@reddit (OP)
My man, that's very helpful to know!
If AMD works pretty good out of the box, then I would also consider the R9700 Pro.
As I would like to use it on text inference most of the time, no image generation would be done on this device (I have other PCs for that anyway).
I will install Ubuntu as dual boot very soon so the OS should not be a problem as well.
Once agains, thanks for the info :)
Material-Duck-6252@reddit
BTW, consider [GPU isolation](GPU isolation techniques — ROCm Documentation) to avoid model running on your integrated graphics (from 8745HS). This should not be an issue with llama.cpp as you would specify graphic card when compiling it.
I am also considering get a R9700 pro card. So looking forward to hear more from you . :)
Kulidc@reddit (OP)
Thx for info, and I think I have read it before.
Man, was it a massive headache to me back at the launch of 9070XT. I was using a 4070 ti super with a modified 22GB 2080Ti back then, and switched out to 9070XT and had a taste of ROCm, HIP and Zluda.
Roughly a few days of debugging on dependencies and kernel compiling gone by, I only realized that the performance for the 9070XT was behind of my 4070 tis on text generation (llamacpp + ROCm), and even worse than my old 2080ti on image generation.
I know it is much better now due to optimizations but I don't want to go through this hell again lol. The card was sold after 3 months.
Moreover, it seems there're some kind of bugs and some multimodels may perform poorly on AMD GPU iirc.
R9700 is cheap, yet at the cost of your time imo. I am still saving for the card rn, and would like to see more posts related to this card as well.
Mantikos804@reddit
It’s doable but at that point get an Ollama yearly subscription and have access to a variety of cloud models instead and use your mini pc. Take the rest of cash you didn’t spend and buy NVDA.
Kulidc@reddit (OP)
That's an alternative I have considered before, and thought about renting GPUs and hosted with vllm as well.
The data is sensitive (customers names, cc info, address etc) that I want it to be fully controlled by myself though, that's why I want to add an eGPU to my mini pc and call it service from my company's pc in the first place.
Mantikos804@reddit
I would say then get a desktop for the GPU. It’s future proof, will always be more powerful than a mini pc setup. Mini PCs are for web surfing really.
Kulidc@reddit (OP)
I did have a desktop, which I used for both working and gaming with 5090. However, I do not really want this desktop to be on 24/7 with vllm. It's both costy and risky imo.
The whole setup draw like 150w to 200w even on idle, not to mention it has other ongoing services as well, and I assume it will draw even more if I host the vllm. That's why I want to work on the mini pc and a RTX Pro lineup. Both should be much power efficient than hosting it with my desktop.
handyman5@reddit
I did basically this. It shows up in
nvidia-smiand whatnot just like it was plugged directly into the motherboard.Kulidc@reddit (OP)
Glad to hear, at least I know it's doable for nvidia cards.
Comfortable-Fall1419@reddit
Nest practice is never to pump PII into a model in the first place.
Kulidc@reddit (OP)
I did follow this practice at my best and never include PII in chat to the cloud provider.
However, I noticed the major CLI agents could call the commands and obtained the some names of the customers from development DB (luckily it's a dummy DB) during debugging. I found that during the re-tracing of the model CoT. I guess it's kind of my fault of including the connection setting during the development.
o0genesis0o@reddit
I have a similar mini PC with that chip. I bought with the intention to have eGPU from the get go, so I actually find the model with external oculink, so not random wiring from the SSD slot. Now, just need to save up to buy new GPU for the main rig, so that I can take the GPU out and attach to the mini PC.
The mini PC itself is quite interesting. It even run cyberpunk at stable framerate and resolution. The AMD iGPU itself can run something like OSS 20B or Gemma e4b at decent speed for chat too. However, there is a bad issue with amdgpu on linux kernel 6.19 upward, so I have hardcrash when running compute on iGPU since Dec 2025. I heard that Ubuntu is not impacted since they run on LTS kernel.
Anyhow,l pretty beefy chip for a tiny computer that does not cost that much. Just ensure that you have the right port just in case, so that it would be less painful with eGPU later.
Silver-Champion-4846@reddit
I wish I had this so I could use 12b-14b models. Maybe even Qwen3.5 9b could work with a specially structured knowledge base for coding, where each function or set of functions has example uses like "print(var): types out the value of the variable. I can't test so can't vouch for the accuracy (or lack thereof) of this method
o0genesis0o@reddit
IMHO, it's better to use MoE with this machine. This machine itself craps out in inference due to amdgpu driver, but I have the laptop with the newer but more power constrained version of this chip, and it runs MoE very nicely at around 50W burst (for context, I have seen the mini pc surging to 120W). I kept the OSS 20B on this laptop, but recently getting rid of it to use the gemma e2b instead due to the vision support.
Dense model, even the likes of llama 8B are no good for this kind of chip, since it just does not have enough compute for the prompt processing phase. Same situation with the likes of macbook air m4. 3B MoE is the practical spot.
Silver-Champion-4846@reddit
Kulidc@reddit (OP)
Thx for the input.
I know the iGPU is quite powerful and could be used for some smaller models but those are slow and may not be good enough for coding.
For the connection part, that's why I would consider using the unused SSD slot rather than the USBC connector on the front.
o0genesis0o@reddit
Not 100% sure, but I remember the reason why I skipped that beeline machine you mentioned and went with the bosgame with oculink for a bit extra money is because USB-4 for eGPU with AMD CPU, on linux, is kinda a risky business compatibility wise, whilst oculink is smooth (since it's just PCI-E, afaik). Can't recall the exact detail, but I did remember this part when shopping for the mini PC.
Oh well, it's kinda a waste at the moment. I'm just running a single docker container of my own AI agent tool, and VPN to access from outside the house. That's it. Fingercrossed Linux 7 would fix the compute error. It can definitely be an all-in-one selfhosted AI assistant box. Not for coding though. Either compute or memory bandwidth are enough.
Kulidc@reddit (OP)
That's interesting to know about, guess I will be sticking to the SSD Slot with some oculink to nvme m2 converters then.
JohnToFire@reddit
In a nvme slot I have seen that. I suggest trying on vast AI if you haven't as it wont match best cloud models
Kulidc@reddit (OP)
I have GPT Pro subscription as my main cloud model. However, I would like to use self hosted models to keep some sensitive information or coding private.
I have tried the Qwen 3.6 model and Gemma 4 in my main workstation before. The performance is good enough for me. However, the workstation is being too beef and drawing too much power even on idle. That's why I would like to migrate parts of the service onto the mini pc.