Does it make sense to cluster HP Z2 Mini G1a to increase performance?
Posted by ThingRexCom@reddit | LocalLLaMA | View on Reddit | 22 comments
I get around 30 t/s with Qwen3-Coder-Next-UD-Q4_K_XL on an HP Z2 Mini G1a. Has anyone clustered two Z2s and can share a performance gain?
I am considering clustering specifically to improve token generation performance, not to use larger models.
Rich_Artist_8327@reddit
clustering would only make sense with 100GB+ networking and RDMA. Any other will slow all down
ThingRexCom@reddit (OP)
I plan to use a Thunderbolt 4 cable.
Rich_Artist_8327@reddit
USB 4.0 seems fine if you look bandwidth. But similarly important is latency where USB 4 sucks. So you are going to loose performance like hell
audioen@reddit
You probably should already be getting more, though this is not the XL version that I just tested on Vulkan:
Note that I always enable flash attention on Qwen3.5 and better, as I think it enhances the performance slightly at medium context, like up to around 50000 tokens.
ThingRexCom@reddit (OP)
I need full context as I use this setup mainly for agentic coding.
audioen@reddit
Yeah, that's not what I said. It helps with smaller context, doesn't harm larger context.
grabber4321@reddit
Doesnt that fit into 128GB RAM?
You'd probably get more out of it if you just bought a 5090 and put it in the back via USB4.
Can you ACTUALLY cluster them? I saw this: https://www.reddit.com/r/LocalLLaMA/comments/1mviuzq/cluster_of_two_amd_strix_halo_machines_hp_z2_mini/
ThingRexCom@reddit (OP)
Yes, the memory is not the main concern.
grabber4321@reddit
Sucks that it doesnt have a PCI-E out...40Gbit is a bit slowish.
You can try just buying a cheapo External GPU holder and see if adding a GPU that you might have laying around helps the speed.
Look_0ver_There@reddit
If using layer split mode, the bandwidth doesn't need to be that high. The big issue is latency.
I have a pair of clustered Strix Halo's using USB4NET. By putting the net driver into poll mode I'm able to get the latency down to 22us from the usual \~55us that is seen. This is not as good as getting \~1us via RCCL over RDMA, but the USB4NET solution does work with Vulkan, whereas RDMA more or less confines you to ROCm only.
ThingRexCom@reddit (OP)
Are you using a Thunderbolt 4 cable?
Look_0ver_There@reddit
Yes.
ThingRexCom@reddit (OP)
Could you share the inference performance of Qwen3-Coder-Next-UD-Q4_K_XL or Qwen3.6-35B-A3B-UD-Q4_K_XL on your cluster?
Look_0ver_There@reddit
I'm a little confused as to why you're running them at Q4 quant. You've got 128GB of RAM. You can fit both of those models in memory at a Q8_0 quantization. I wouldn't cluster either of those models. They run fine on a single machine.
Also, to be clear, clustering is a net performance loss due to the information transfer hop. I typically see a \~10% performance drop with llama.cpp if I shard a model that normally fits on one machine across two machines. I only use the clustering for models that don't fit on one machine alone
Here's the results for both of those models at Q8_0 quantization on a single machine. I don't have Q4 quants of these models handily laying about.
grabber4321@reddit
I think you should try Qwen3.6:35B - I feel like its a better model than a qwen-coder-next.
Ive tried a couple of prompts and its Qwen3.6 comes up with better solutions. Plus its FAST.
grabber4321@reddit
Here is what I have on 2x 5070 ti's (32GB VRAM total) + 48GB RAM
Its not fast, definitely needs more GPUs.
grabber4321@reddit
https://www.youtube.com/watch?v=1i_PfH05ekw
if you disassemble it, you can put a cable into the M.2 port and have direct PCIE access.
Again you will need to confirm what type of M.2 port that is - it could be limited bandwidth, but I'd assume it has some speed on it and it would be faster than Thunderbolt4/USB4
ImportancePitiful795@reddit
Have a look here, and you need RDMA setup (vLLm etc) as u/Rich_Artist_8327 said.
Strix Halo Distributed Cluster (2x Strix Halo, RDMA RoCE v2) benchmarks by kyuz0 : r/LocalLLaMA
Also you should be able to wire together different systems, so nothing stopping you to get a much cheaper Bosgame M5 (etc) instead of the Z2.
Hungry_Elk_3276@reddit
Please dont.
Before you spend all of your money, you can give dflash a try, especially if you are using llama.cpp.
Check here: https://github.com/z-lab/dflash
Using llama.cpp across two cluster will not give you any performance gain. If you really want to go in to the rabbit hole of clustering, you will need vllm with high speed networking for RDMA, I assume you dont have right now.
ImportancePitiful795@reddit
llama.cpp at this point is been left behind compared to vLLM when comes to clustering and RDMA.
ThingRexCom@reddit (OP)
I use llamacpp. How does dflash improve the performance?
Hungry_Elk_3276@reddit
From github issue, there is a chance the tg performance be 2x (same amd platform).
Link here: https://github.com/z-lab/dflash/issues/40