Repurposing 800 x RX 580s for LLM inference - 4 months later - learnings

Posted by rasbid420@reddit | LocalLLaMA | View on Reddit | 77 comments

Back in March I asked this sub if RX 580s could be used for anything useful in the LLM space and asked for help on how to implemented inference:

https://www.reddit.com/r/LocalLLaMA/comments/1j1mpuf/repurposing_old_rx_580_gpus_need_advice/

Four months later, we've built a fully functioning inference cluster using around 800 RX 580s across 132 rigs. I want to come back and share what worked, what didn’t so that others can learn from our experience.

what worked

Vulkan with llama.cpp

CXXFLAGS="-march=core2 -mtune=generic" cmake .. \
  -DLLAMA_BUILD_SERVER=ON \
  -DGGML_VULKAN=ON \
  -DGGML_NATIVE=OFF \
  -DGGML_AVX=OFF   -DGGML_AVX2=OFF \
  -DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF \
  -DGGML_FMA=OFF   -DGGML_F16C=OFF \
  -DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF \
  -DGGML_SSE42=ON  \

Per-rig multi-GPU scaling

Load balancing setup

what didn’t work

ROCm HIP \ pytorc \ tensorflow inference

we’re also putting part of our cluster through some live testing. If you want to throw some prompts at it, you can hit it here:

https://www.masterchaincorp.com

It’s running Qwen-30B and the frontend is just a basic llama.cpp server webui. nothing fancy so feel free to poke around and help test the setup. feedback welcome!