[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

Posted by ReasonableDuty5319@reddit | LocalLLaMA | View on Reddit | 12 comments

Model	Size	Single 5090 (t/s)	Dual 5090 RPC (t/s)	Note
Qwen3.5-27B (Q6_K)	20.9 GB	59.83	55.41	-7% Overhead
Qwen3.5-35B MoE (Q6_K)	26.8 GB	206.76	150.99	Interconnect Bottleneck
Qwen2.5-32B (Q6_K)	25.0 GB	54.69	51.47	Stable Scaling
Qwen2.5-72B (Q4_K_M)	40.9 GB	FAILED (OOM)	32.74	Now Playable!
Qwen3.5-122B MoE (IQ4_XS)	56.1 GB	FAILED (OOM)	96.29	Beast Mode ON

The Setup

I recently tested the distributed inference capabilities of llama.cpp RPC using two identical workstations. This setup allows pooling VRAM (64GB total) to run models that are physically impossible to fit on a single 32GB card.

GPUs: 2x NVIDIA GeForce RTX 5090 (32GB VRAM each)
Interconnect: 2.5GbE LAN
OS: Ubuntu 24.04
Software: llama.cpp (Build 8709 / Commit 85d482e6b)
Method: llama-bench with ngl 99, fa 1, b 512, p 2048, n 256
Breaking the VRAM Barrier: The most significant result is the ability to run Qwen 2.5 72B and Qwen 3.5 122B. These models simply won't load on a single 32GB card at these quant levels. RPC effectively turns two machines into a 64GB unified AI workstation.
MoE Performance is King: The Qwen 3.5 122B MoE is the star of the show, hitting 96.29 tokens/sec. Even with the network latency of a distributed setup, MoE's sparse activation makes it incredibly viable for real-time use.
The 2.5GbE Bottleneck: For smaller, high-speed models like the 35B MoE, we see a 27% performance drop (206 -> 150 t/s) when moving to RPC. The 2.5GbE link is the bottleneck here. For the larger 72B/122B models, the computation time outweighs the transfer time, making the trade-off very worth it.
Prompt Processing (PP): On a single 5090, Qwen 3.5 35B hits 6190 t/s in prefill. Over RPC, this drops to 2823 t/s. The raw prefill power of Blackwell is insane, but it's heavily throttled by network bandwidth in distributed mode.

Benchmark Command
./llama-bench -m [model] -ngl 99 -fa 1 -p 2048 -n 256 -b 512 --rpc 192.168.X.X:50052

Conclusion

If you have two high-end GPUs in separate rigs, llama.cpp RPC is now mature enough to be a daily driver. It allows you to trade a bit of speed for the ability to run massive models that were previously reserved for professional H100/A100 clusters. Running a 122B model at nearly 100 t/s at home feels like the future.

[-]

Necessary-Summer-348@reddit

Network bandwidth is usually the bottleneck with RPC setups like this. Curious what the actual utilization looked like on that 2.5GbE link during inference - were you saturating it or is there headroom to add more nodes?

[-]

lemondrops9@reddit

Is it bandwidth problem? Seems more like a latency issues. I've watched the PCIe bus in real time and its often around the 40-50 MBps on each card.

Maybe more of an issues if a person had more gpus on each PC?

[-]

Necessary-Summer-348@reddit

Fair point, latency on the inter-node hops would compound a lot faster than raw bandwidth saturation. PCIe utilization sitting low while throughput is still bottlenecked usually points to round-trip latency on the RPC calls. What were you seeing for actual token/s vs the theoretical max?

[-]

lemondrops9@reddit

none yet, Ive been debating if its worth it, as I have 6 gpus in my main for 120GB of Vram.

Could setup another 48GB with my others but been waiting for RPC to mature to even try.

[-]

ReasonableDuty5319@reddit (OP)

Ideally, I’d love to run both RTX 5090s in a single machine, but I don't have the right hardware to support that setup just yet. For now, I'm sticking with RPC. My next step is to benchmark a 3-host, 4-GPU cluster: a dual-card AMD R9700 setup, plus two separate nodes each running an RTX 5090 via RPC.

[-]

ReasonableDuty5319@reddit (OP)

In my experience, the 2.5G bandwidth only spikes above 200MB/s during the initial model loading phase. Once it's up and running, the network load is actually quite low. I’m going to take a closer look at the PCIe exchange rates as well.

[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE

The Setup

Conclusion

ArtfulGenie69@reddit

nick_ziv@reddit

Fluffywings@reddit

Necessary-Summer-348@reddit

lemondrops9@reddit

Necessary-Summer-348@reddit

lemondrops9@reddit

ReasonableDuty5319@reddit (OP)

ReasonableDuty5319@reddit (OP)

wizmyh34rt@reddit

3dom@reddit

3dom@reddit