Anyone tried multi-machine LLM inference?

Posted by human-exe@reddit | LocalLLaMA | View on Reddit | 18 comments

I've stumbled upon exo-explore/exo, a LLM engine that supports multi-peer inference in self-organized p2p network. I got it running on a single node in LXC, and generally things looked good.

That sounds quite tempting; I have a homelab server, a Шindows gaming machine and a few extra nodes; that totals to 200+ GB of RAM, tens of cores, and some GPU power as well.

There are a few things that spoil the idea:

First, exo is alpha software; it runs from Python source and I doubt I could organically run it on Windows or macOS.
Second, I'm not sure exo's p2p architecture is as sound as it's described and that it can run workloads well.
Last but most importantly, I doubt there's any reason to run huge models and probably get 0.1 t/s output;

Am I missing much? Are there any reasons to run big (100+GB) LLMs at home at snail speeds? Is exo good? Is there anything like it, yet more developed and well tested? Did you try any of that, and would you advise me to try?

[-]

Vegetable-Score-3915@reddit

Exo seems to work well now. Seems like a lot has changed over the last 6/7 months

[-]

human-exe@reddit (OP)

UPD: Jeff Geerling got us covered:

Just as with the Framework cluster, llama.cpp RPC was very slow, since it splits up the model layers on all the cluster members, then goes round-robin style asking each node to perform its prompt processing, then token generation.

The Pi cluster couldn't even make it to token generation (tg) on my default settings, so I had to dial things back and only generate 16 tokens at a time to allow it to complete. And after all that? Only 0.28 tokens per second, which is 25x slower than the Framework Cluster, running the same model (except on AI Max iGPUs with Vulkan).

I also tried Exo and distributed-llama. Exo was having trouble even running a small 3B model on even a 2 or 3 node Pi cluster configuration, so I stopped trying to get that working.

Distributed llama worked, but only with up to 8 nodes for the 70B model. Doing that, I got a more useful 0.85 tokens/s, but that's still 5x slower than the Framework cluster (and it was a bit more fragile than llama.cpp RPC—the tokens were sometimes gibberish):

[-]

Awwtifishal@reddit

Exo was suddenly abandoned. Your best bet is llama.cpp with RPC. I have tried it and it works fine. The network link should be as fast as possible (particularly in latency, not so much in bandwidth).

[-]

Ok_Mine189@reddit

There are some forks for exo that include additional model support, fixes, etc.

[-]

Ok_Mine189@reddit

Like this one for example: GitHub - water-vapor/exo: Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

[-]

hackyroot@reddit

I don't think Exo is still active development. Though, vLLM + Ray could work for your usecase: https://docs.vllm.ai/en/stable/serving/parallelism_scaling.html

Imo, doesn't make sense to host big LLMs at home because of high latency, limited throughput, and potential GPU contention. If you're doing not something very secretive in nature, it doesn't make sense.

[-]

ACG-Gaming@reddit

May not be as good as the others. But I am pretty sure GPU stack does all this. Its been robust and pretty easy to setup various things with it so far.

[-]

The_Soul_Collect0r@reddit

Try: llama.cpp rpc-server

I have been using it for a year now, it works, its stable - from the use it at home, and not in production perspective.

One thing to keep in mind, inference speed bottleneck will most likely be the speed of Server/Node connection to the home network, if possible connect the Server and the Nodes to your network using physical connections, avoiding WiFi. Use model cache on Nodes.

Node 1.:

rpc-server.exe --device CPU --host 192.168.5.6 --cache

Node 2.:

rpc-server.exe --device CUDA0 --host 192.168.5.7 --cache

Server:

llama-server.exe -m TheDrummer.gguf --host 127.0.0.1 --rpc 192.168.5.6:50052 --rpc 192.168.5.7:50052 --no-mmap

[-]

kaxapi@reddit

sglang can do it according to this github issue: https://github.com/sgl-project/sglang/issues/2794

make sure you have infiniband properly configured though, otherwise it will be terribly slow

[-]

fallingdowndizzyvr@reddit

Yes, I've been doing it for over a year. It's super easy. Just use llama.cpp.

[-]

minnsoup@reddit

Dont know about windows, but have successfully been using vLLM on our HPC for months with success. Easy to do and once the ray cluster is started then you just have to do things on a single node and it handles the orchestration.

[-]

lolzinventor@reddit

This worked for me also. 2 nodes of 4x3090 allowed llama 3 70B to run at f16. Subsequently merged all GPUs into a single chassis so no longer needed.

[-]

woadwarrior@reddit

Take a look at gpu-stack.

[-]

RP_Finley@reddit

Ray with vLLM should work. https://github.com/asprenger/ray_vllm_inference

I've made a video how to do this on Runpod clusters which is multi-machine LLM inference. But the process is pretty agnostic and not specific to us so you could easily set this up on multiple local machines with the same process.

https://www.youtube.com/watch?v=k_5rwWyxo5s

[-]

zipzag@reddit

It should not sound tempting. Even when all GPU based Exo was slowly. Your setup will likely not even run.

Buy a 16gb video card. Play with AI and also have a great card for gaming. AI is the land of "your expensive CPU just doesn't matter"

[-]

kryptkpr@reddit

Llama-rpc works but prompt processing is abysmally slow

[-]

eelectriceel33@reddit

Found this a while ago,

https://github.com/b4rtaz/distributed-llama

Still haven't gotten around to trying it, though. But this seems like a much more manual process as of yet

[-]

human-exe@reddit (OP)

(just noticed that «big LLM» is a tautology, «big large language model»; but you get the idea)