run local inference across machines
Posted by saint_0x@reddit | LocalLLaMA | View on Reddit | 5 comments
mesh is a distributed protocol for running large models locally across devices
the idea is the control plane hosts local lan pools, which shard the model across member ring and credits members proportionally based on compute contributions
it’s still rough, but has support for metal, cuda, and pure cpu (can interoperate with one another)
i successfully ran a model locally on lan across both my metal m3 and my intel air :)
https://github.com/saint0x/mesh
Brigade_Project@reddit
This is interesting. I've been running Ollama on a dual-GPU machine (4070 Ti Super + 2060 Super) and the obvious limitation is that larger models still need to fit within a single GPU's VRAM budget even with both cards. The idea of a proper tensor-parallel ring across LAN machines rather than hacking around it with CUDA_VISIBLE_DEVICES is appealing.
A few things I noticed digging into the repo:
The "no silent provider fallback" design is the right call. Silent CPU fallback is exactly the kind of thing that makes Ollama frustrating to debug — you think you're running on GPU, you're not, and the only symptom is slowness.
What I'm curious about: how does shard assignment actually work when workers have mismatched VRAM? My two cards are 16GB and 8GB. Does the ring manager proportionally assign tensor chunks, or does it assume homogeneous nodes?
Watching this one. If the artifact loading gets cleaner (right now you need to manually split safetensors and write manifests) this could be genuinely useful for homelab inference.
saint_0x@reddit (OP)
you also might be interested in this — i extrapolated the exact work-credit computation system as a poc lib
https://github.com/ariacomputecompany/divy
saint_0x@reddit (OP)
hey man, thanks so much for digging in glad you found this useful! definitely feel you with the silent provider fallback.
re: homogenous nodes — it started like that simply bc i’m working on this myself, but we are heterogeneity-aware, so to speak, but it’s still rough — and this is ofc in accordance with your accurate insight about the artifact loading as well
but i’m so excited for this to get better — this feels like something the world needs
niga_chan@reddit
this is actually a really interesting direction
feels like a lot of people are trying to solve the “how do we use all available hardware” problem from the multi-node side
we’ve been exploring the opposite a bit pushing how far a single node can go when you optimize for agent workloads and orchestration
interestingly, even without distributing, you can get pretty far just by keeping things lightweight and memory-efficient
curious how mesh behaves when workloads become more agent-like vs just pure inference
saint_0x@reddit (OP)
that’s cool as fuck, and i agree. i think both of those approaches work together, because more powerful nodes makes a distributed protocol oom more valuable and powerful
any github links i can check out?