[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE
Posted by ReasonableDuty5319@reddit | LocalLLaMA | View on Reddit | 12 comments
| Model | Size | Single 5090 (t/s) | Dual 5090 RPC (t/s) | Note |
|---|---|---|---|---|
| Qwen3.5-27B (Q6_K) | 20.9 GB | 59.83 | 55.41 | -7% Overhead |
| Qwen3.5-35B MoE (Q6_K) | 26.8 GB | 206.76 | 150.99 | Interconnect Bottleneck |
| Qwen2.5-32B (Q6_K) | 25.0 GB | 54.69 | 51.47 | Stable Scaling |
| Qwen2.5-72B (Q4_K_M) | 40.9 GB | FAILED (OOM) | 32.74 | Now Playable! |
| Qwen3.5-122B MoE (IQ4_XS) | 56.1 GB | FAILED (OOM) | 96.29 | Beast Mode ON |
The Setup
I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.
- GPUs: 2x NVIDIA GeForce RTX 5090 (32GB VRAM each)
- Interconnect: 2.5GbE LAN
- OS: Ubuntu 24.04
- Software: llama.cpp (Build 8709 / Commit
85d482e6b) - Method:
llama-benchwithngl 99,fa 1,b 512,p 2048,n 256 - Breaking the VRAM Barrier: The most significant result is the ability to run Qwen 2.5 72B and Qwen 3.5 122B. These models simply won't load on a single 32GB card at these quant levels. RPC effectively turns two machines into a 64GB unified AI workstation.
- MoE Performance is King: The Qwen 3.5 122B MoE is the star of the show, hitting 96.29 tokens/sec. Even with the network latency of a distributed setup, MoE's sparse activation makes it incredibly viable for real-time use.
- The 2.5GbE Bottleneck: For smaller, high-speed models like the 35B MoE, we see a 27% performance drop (206 -> 150 t/s) when moving to RPC. The 2.5GbE link is the bottleneck here. For the larger 72B/122B models, the computation time outweighs the transfer time, making the trade-off very worth it.
- Prompt Processing (PP): On a single 5090, Qwen 3.5 35B hits 6190 t/s in prefill. Over RPC, this drops to 2823 t/s. The raw prefill power of Blackwell is insane, but it's heavily throttled by network bandwidth in distributed mode.
Benchmark Command
./llama-bench -m [model] -ngl 99 -fa 1 -p 2048 -n 256 -b 512 --rpc 192.168.X.X:50052
Conclusion
If you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future.

ArtfulGenie69@reddit
I've got a couple of computers in the house both with 2x3090 and using this rpc idea I went from running q4 at 27tps with a pp of 160tps on one machine, to using rpc to load the same qwen3.5 122b on both machines getting 55tps and pp of 700+. Big upgrade.
nick_ziv@reddit
I am currently running 2 external 3090s on mining risers which supposedly have 1GB/s bandwidth each. I was wondering if Ethernet would work and it appears so. This would make distance less of an issue as when using GPU risers the cords have to be extremely short to avoid the GPUs disconnecting.
Fluffywings@reddit
I am curious how latency compares to bandwidth for tok/s generation..
Necessary-Summer-348@reddit
Network bandwidth is usually the bottleneck with RPC setups like this. Curious what the actual utilization looked like on that 2.5GbE link during inference - were you saturating it or is there headroom to add more nodes?
lemondrops9@reddit
Is it bandwidth problem? Seems more like a latency issues. I've watched the PCIe bus in real time and its often around the 40-50 MBps on each card.
Maybe more of an issues if a person had more gpus on each PC?
Necessary-Summer-348@reddit
Fair point, latency on the inter-node hops would compound a lot faster than raw bandwidth saturation. PCIe utilization sitting low while throughput is still bottlenecked usually points to round-trip latency on the RPC calls. What were you seeing for actual token/s vs the theoretical max?
lemondrops9@reddit
none yet, Ive been debating if its worth it, as I have 6 gpus in my main for 120GB of Vram.
Could setup another 48GB with my others but been waiting for RPC to mature to even try.
ReasonableDuty5319@reddit (OP)
Ideally, I’d love to run both RTX 5090s in a single machine, but I don't have the right hardware to support that setup just yet. For now, I'm sticking with RPC. My next step is to benchmark a 3-host, 4-GPU cluster: a dual-card AMD R9700 setup, plus two separate nodes each running an RTX 5090 via RPC.
ReasonableDuty5319@reddit (OP)
In my experience, the 2.5G bandwidth only spikes above 200MB/s during the initial model loading phase. Once it's up and running, the network load is actually quite low. I’m going to take a closer look at the PCIe exchange rates as well.
wizmyh34rt@reddit
thanks
3dom@reddit
With only 10B parameters active Qwen 122B should fit into a single 5090 card given enough RAM - isn't it?
3dom@reddit
Single M5 Max128Gb macbook is capable to run Qwen 122B at 50-65t/s while costing roughly the same or cheaper than a single 5090 128Gb RAM workstation.
I wonder if the speed will double for two connected M5 Max macbooks?