Anyone else struggling with multi-GPU stability when running larger local models?

Posted by Lyceum_Tech@reddit | LocalLLaMA | View on Reddit | 21 comments

Been scaling up local LLM clusters and multi-GPU setups are still a pain.

Power throttling, ROCm bugs, and utilization dropping at scale are killing me.

What’s the biggest headache you’re facing with larger local setups right now?