[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Posted by ReasonableDuty5319@reddit | LocalLLaMA | View on Reddit | 12 comments

Model Size Single 5090 (t/s) Dual 5090 RPC (t/s) Note
Qwen3.5-27B (Q6_K) 20.9 GB 59.83 55.41 -7% Overhead
Qwen3.5-35B MoE (Q6_K) 26.8 GB 206.76 150.99 Interconnect Bottleneck
Qwen2.5-32B (Q6_K) 25.0 GB 54.69 51.47 Stable Scaling
Qwen2.5-72B (Q4_K_M) 40.9 GB FAILED (OOM) 32.74 Now Playable!
Qwen3.5-122B MoE (IQ4_XS) 56.1 GB FAILED (OOM) 96.29 Beast Mode ON

The Setup

I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.

Benchmark Command
./llama-bench -m [model] -ngl 99 -fa 1 -p 2048 -n 256 -b 512 --rpc 192.168.X.X:50052

Conclusion

If you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future.