best approach for Strix Halo distributed inference in llama.cpp?

Posted by blbd@reddit | LocalLLaMA | View on Reddit | 20 comments

I was curious to understand what people are doing for this use case to get the best trade-off of convenience and performance.

Private backhaul on the 10GbE? USB4? Something else?

I see conflicting information on whether parallelism is per-layer or if there's a way to do a smarter form of parallelism that can drive 100% CPU / GPU utilization across nodes.

Is it better to use it to run bigger models that are smarter and need more unified RAM so they think better? Or better to take smaller ones and try to make them faster for more token speed?