Anyone with experience combining Nvidia system & mac over llama-rpc?
Posted by segmond@reddit | LocalLLaMA | View on Reddit | 6 comments
Anyone with experience combining Nvidia system & mac over llama-rpc?
I'm sick of building Nvidia RIGs that are useless with these models. I could manage fine with commandR & MistralLarge, but since llama405B, deepseekv2.5, R1, v3, etc are all out of reach. So I'm thinking of getting an apple next and throwing it on the network. Apple is not cheap either, i"m broke from my Nvidia adventures... so a 128gb would probably be fine. If you have practical experience, please share.
fallingdowndizzyvr@reddit
My little cluster is AMD, Intel, Nvidia and Mac. It's simple to do with RPC using llama.cpp. There is a performance penalty for going multi-gpu that has nothing to do with networking. Since if you run multi-gpu using RPC on the same machine, that penalty is there. No networking required.
segmond@reddit (OP)
yeah, I know there's a performance penalty for each rpc-server. With the mac it would be only 1 server which should not be too bad. Does flash attention work on the mac being that's it non cuda? How much total vram do you have across your cluster with all those combo?
fallingdowndizzyvr@reddit
Yes. Flash attention works on Mac.
108GB currently. I have some other GPUs that aren't currently being used that I could use to spin up another machine.
segmond@reddit (OP)
Nice, looks like I'm going to be adding a mac to the party. I already have enough Nvidia GPUs. I figure a mac or an iGPU PC is the best next step and would only consider additional GPUs if performance is terrible. Thanks again!
Marksta@reddit
My experience, GPUStack is the go to project for this, if not just using llama-rpc directly. There's a competitor who does kinda same thing, forgot their name but functionality wise they looked same-ish and I just had GPUStack running already. It has a small wrapper project called llama-box that uses llama-rpc and such mostly 1 for 1. But they added some other stuff to expand outside LLMs, I never messed with that though. It's really good at managing the behind the scenes of running llama for you and setting up servers, workers, swapping models, downloading models from HF or ollama repos or local. Speculative decoding is unfortunately broken in the current release, I saw issue comments that'll it'll be fixed soon-ish so can't setup draft model params. But otherwise it's a fast route to going rpc servers going on your machines. It's written in python so should run with same compatibility as llama.cpp so windows to Mac, Nvidia to amd etc should mesh together I believe?
The downside is, it definitely slows things down some. Not bad on low context quick question usage but huge context on 32B models, like 50gb VRAM run I was probably maxing out my 1gbps local network and tokens per sec start to drop. There are downsides that still make you want to just put your cards in the same machine if possible. But it's probably as good as it gets right now for networking machines resources together. Many times faster using GPUs over the network with it than just letting it spill into 1 local machines slow normal RAM.
Also, don't mess with Exo. When I tried Exo last month, 100% broken code. Doesn't run, syntax errors everywhere pushed to main. No release tags on the git. ReadMe is false, it absolutely does not run on anything but if it did it might only be Apple desktops? They list compatibiltiy with everything including phones but not at all the case. Github issues are just full of people like me who wasted their time to try to make if work before the buried led comes out that it just doesn't.
segmond@reddit (OP)
I use llama-cpp directly and llama-rpc, I don't like any additionally layer on top. I already run llama-rpc between 2 nvidia clusters and likewise have a 1Gbps network. Unfortunately, my builtin ethernet port is max 1Gbps. I'll like to another system so I can run deepseekv3. Will definitely beat offloading to system ram.