Does it make sense to cluster HP Z2 Mini G1a to increase performance?

Posted by ThingRexCom@reddit | LocalLLaMA | View on Reddit | 22 comments

I get around 30 t/s with Qwen3-Coder-Next-UD-Q4_K_XL on an HP Z2 Mini G1a. Has anyone clustered two Z2s and can share a performance gain?

I am considering clustering specifically to improve token generation performance, not to use larger models.

[-]

Rich_Artist_8327@reddit

clustering would only make sense with 100GB+ networking and RDMA. Any other will slow all down

[-]

Rich_Artist_8327@reddit

USB 4.0 seems fine if you look bandwidth. But similarly important is latency where USB 4 sucks. So you are going to loose performance like hell

[-]

audioen@reddit

You probably should already be getting more, though this is not the XL version that I just tested on Vulkan:

$ build/bin/llama-bench -m models_directory/Qwen3-Coder-Next/Qwen3-Coder-Next-Q4_K_M.gguf -fa 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  45.19 GiB |    79.67 B | Vulkan     |  99 |  1 |           pp512 |        655.79 ± 3.49 |
| qwen3next 80B.A3B Q4_K - Medium |  45.19 GiB |    79.67 B | Vulkan     |  99 |  1 |           tg128 |         52.82 ± 0.06 |

build: ca7f7b7b9 (8882)

Note that I always enable flash attention on Qwen3.5 and better, as I think it enhances the performance slightly at medium context, like up to around 50000 tokens.

[-]

ThingRexCom@reddit (OP)

I need full context as I use this setup mainly for agentic coding.

[-]

audioen@reddit

Yeah, that's not what I said. It helps with smaller context, doesn't harm larger context.

[-]

grabber4321@reddit

Doesnt that fit into 128GB RAM?

You'd probably get more out of it if you just bought a 5090 and put it in the back via USB4.

Can you ACTUALLY cluster them? I saw this: https://www.reddit.com/r/LocalLLaMA/comments/1mviuzq/cluster_of_two_amd_strix_halo_machines_hp_z2_mini/

[-]

ThingRexCom@reddit (OP)

Yes, the memory is not the main concern.

[-]

grabber4321@reddit

Sucks that it doesnt have a PCI-E out...40Gbit is a bit slowish.

You can try just buying a cheapo External GPU holder and see if adding a GPU that you might have laying around helps the speed.

[-]

Look_0ver_There@reddit

If using layer split mode, the bandwidth doesn't need to be that high. The big issue is latency.

I have a pair of clustered Strix Halo's using USB4NET. By putting the net driver into poll mode I'm able to get the latency down to 22us from the usual \~55us that is seen. This is not as good as getting \~1us via RCCL over RDMA, but the USB4NET solution does work with Vulkan, whereas RDMA more or less confines you to ROCm only.

[-]

ThingRexCom@reddit (OP)

Are you using a Thunderbolt 4 cable?

[-]

Look_0ver_There@reddit

Yes.

[-]

ThingRexCom@reddit (OP)

Could you share the inference performance of Qwen3-Coder-Next-UD-Q4_K_XL or Qwen3.6-35B-A3B-UD-Q4_K_XL on your cluster?

[-]

Look_0ver_There@reddit

I'm a little confused as to why you're running them at Q4 quant. You've got 128GB of RAM. You can fit both of those models in memory at a Q8_0 quantization. I wouldn't cluster either of those models. They run fine on a single machine.

Also, to be clear, clustering is a net performance loss due to the information transfer hop. I typically see a \~10% performance drop with llama.cpp if I shard a model that normally fits on one machine across two machines. I only use the clustering for models that don't fit on one machine alone

Here's the results for both of those models at Q8_0 quantization on a single machine. I don't have Q4 quants of these models handily laying about.

$ llama-bench -fa 1 -n 256 -p 1024 -m Qwen3-Coder-Next-Q8_0.gguf

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | Vulkan     |  99 |  1 |          pp1024 |        752.70 ± 5.81 |
| qwen3next 80B.A3B Q8_0         |  78.98 GiB |    79.67 B | Vulkan     |  99 |  1 |           tg256 |         43.14 ± 0.00 |



% llama-bench -fa 1 -n 256 -p 1024 -m Qwen3.6-35B-A3B-Q8_0.gguf

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |  1 |          pp1024 |       1173.95 ± 7.76 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |  1 |           tg256 |         54.60 ± 0.01 |

[-]

grabber4321@reddit

I think you should try Qwen3.6:35B - I feel like its a better model than a qwen-coder-next.

Ive tried a couple of prompts and its Qwen3.6 comes up with better solutions. Plus its FAST.

[-]

grabber4321@reddit

Here is what I have on 2x 5070 ti's (32GB VRAM total) + 48GB RAM

Its not fast, definitely needs more GPUs.

[-]

grabber4321@reddit

https://www.youtube.com/watch?v=1i_PfH05ekw

if you disassemble it, you can put a cable into the M.2 port and have direct PCIE access.

Again you will need to confirm what type of M.2 port that is - it could be limited bandwidth, but I'd assume it has some speed on it and it would be faster than Thunderbolt4/USB4

[-]

ImportancePitiful795@reddit

Have a look here, and you need RDMA setup (vLLm etc) as u/Rich_Artist_8327 said.
Strix Halo Distributed Cluster (2x Strix Halo, RDMA RoCE v2) benchmarks by kyuz0 : r/LocalLLaMA

Also you should be able to wire together different systems, so nothing stopping you to get a much cheaper Bosgame M5 (etc) instead of the Z2.

[-]

Hungry_Elk_3276@reddit

Please dont.

Before you spend all of your money, you can give dflash a try, especially if you are using llama.cpp.

Check here: https://github.com/z-lab/dflash

Using llama.cpp across two cluster will not give you any performance gain. If you really want to go in to the rabbit hole of clustering, you will need vllm with high speed networking for RDMA, I assume you dont have right now.

[-]

ImportancePitiful795@reddit

llama.cpp at this point is been left behind compared to vLLM when comes to clustering and RDMA.

[-]

ThingRexCom@reddit (OP)

I use llamacpp. How does dflash improve the performance?

[-]

Hungry_Elk_3276@reddit

From github issue, there is a chance the tg performance be 2x (same amd platform).

Link here: https://github.com/z-lab/dflash/issues/40