Use llama.cpp to run a model with the combined power of a networked cluster of GPUs.

Posted by farkinga@reddit | LocalLLaMA | View on Reddit | 14 comments

llama.cpp can be compiled with RPC support so that a model can be split across networked computers. Run even bigger models than before with a modest performance impact.

Specify GGML_RPC=ON when building llama.cpp so that rpc-server will be compiled.

cmake -B build -DGGML_RPC=ON
cmake --build build --config Release

Launch rpc-server on each node:

build/bin/rpc-server --host 0.0.0.0

Finally, orchestrate the nodes with llama-server

build/bin/llama-server --model YOUR_MODEL --gpu-layers 99 --rpc node01:50052,node02:50052,node03:50052

I'm still exploring this so I am curious to hear how well it works for others.

[-]

Klutzy-Snow8016@reddit

Are there any known performance issues? I tried using RPC for Deepseek R1, but it was slower than just running it on one machine, even though the model doesn't fit in RAM.

[-]

farkinga@reddit (OP)

I would not describe it as performance issues; more it's a matter of performance expectations.

Think of it this way: we like VRAM because it's fast once you load a model into it; this is measured in 100s of GB/s. We don't love RAM because it's so much slower than VRAM - but we still measure it in GB/s.

When it comes to networking - even 1000M, 2Gb, etc - that's slow-slow-slow. Bits, not bytes. 10Gb networking is barely 1GB/s - and almost never in practice. RAM sits right next to the CPU and VRAM is on a PCIe bus. A network-attached device will always be slower.

My point is: the network is the bottleneck with the RPC strategy I described. And when I say it's not performance "issues" I simply mean that this is always going to be slower than if you have the VRAM in a single node.

Now, having said all that, I do believe MoE architectures could be fitted to a specific network and GPU topology. ...but that's getting technical.

There probably are no "issues" to work out; this is already about as fast as it will ever get. The advantage is that if you use this the right way, you can run models much larger than before; you are no longer limited to a single computer.

[-]

Calcidiol@reddit

I got it working some time ago though it was very rough in terms of feature / function support, user experience to configure / control / use, etc. etc.

It seemed like a "it's better than nothing if you can get an advantage out of it where you can't run the model well or at all usefully otherwise" kind of thing.

People have said better things about it in terms of using it on a single system with multiple heterogeneous GPUs so at least in that case the communication can happen at PCIE or at least local vlan speeds between instances as opposed to having the latency and throughput limit of 1GbE between multiple distinct hosts.

I've been meaning to try it with like 50-170 GBy range models and see how it might help vs. what the context size in use is and different nodes' actual performances / capabilities.

[-]

farkinga@reddit (OP)

it's better than nothing if you can get an advantage out of it where you can't run the model well or at all usefully otherwise

That's where I'm at: I can't run 72b models on a single node but if I combine 3 GPUs, it actually works (even if it's slower).

By the way: this is a combination of Metal and CUDA acceleration. I wasn't even sure it would work - but the fact it's working is amazing to me.

[-]

Calcidiol@reddit

Hmm IIRC there was some comment about RPC and metal in the release change log a while back, maybe they added support or fixed something having to do with heterogeneous platforms & back ends etc. Or maybe I'm misremembering. Anyway it's good that it works.

I'll try more homogeneous test platforms but heterogeneous GPU architecture species and also add in CPU+RAM based inference with partial distributed multi-GPU offload see where that gets me. It'd be nice to see how Qwen3-235B MoE works with enough distributed RAM to fit a decent quant of the model and modest VRAM/GPU to help out a bit. I'll have to figure out how to most intelligently even use the partial VRAM mixed in with the much larger RAM+CPU regions to get the most benefit. Draft model fully in VRAM should be one thing that could help anyway.

[-]

farkinga@reddit (OP)

Yes, I've been thinking about MoE optimization with RPC, split-mode, override-tensors, and number of active experts. If each expert fits inside its own node, it should dramatically reduce network overhead. If relatively-little data has to actually feed forward between experts, inference performance could get closer to PCIe-speed.

[-]

fallingdowndizzyvr@reddit

I got it working some time ago though it was very rough in terms of feature / function support, user experience to configure / control / use, etc. etc.

I use it all the time. It works fine. I don't know exactly what you mean by it's rough in those terms. Other than the rpc flag as an commandline arg, it works like any other GPU.

[-]

fallingdowndizzyvr@reddit

I'm still exploring this so I am curious to hear how well it works for others.

I posted about this about a year ago plenty of other times too. You can check that thread if you want to read more.

By the way, it's on by default in the pre-compiled binaries.

[-]

DoctorDirtnasty@reddit

reminds me of exo labs

https://github.com/exo-explore/exo

[-]

celsowm@reddit

Llama cpp uses a unified kv cache so if you two or more concurrent users/prompts the results are not good. Try vllm or sglang

[-]

farkinga@reddit (OP)

I'm not running this in a multi-user environment - but if I were, I'll keep your advice in mind.

[-]

You_Wen_AzzHu@reddit

vllm has a ray cluster feature that you can serve 1 model on multiple nodes.

[-]

farkinga@reddit (OP)

Using llama.cpp, I'm able to combine a Metal-accelerated node with 2 CUDA nodes and llama-server treats it as a unified object, despite the heterogeneous architectures. Pretty neat.

[-]

beedunc@reddit

Excellent. Will try it out.