Repurposing 800 x RX 580s for LLM inference - 4 months later - learnings

Posted by rasbid420@reddit | LocalLLaMA | View on Reddit | 77 comments

Back in March I asked this sub if RX 580s could be used for anything useful in the LLM space and asked for help on how to implemented inference:

https://www.reddit.com/r/LocalLLaMA/comments/1j1mpuf/repurposing_old_rx_580_gpus_need_advice/

Four months later, we've built a fully functioning inference cluster using around 800 RX 580s across 132 rigs. I want to come back and share what worked, what didn’t so that others can learn from our experience.

what worked

Vulkan with llama.cpp

Vulkan backend worked on all RX 580s
Required compiling Shaderc manually to get glslc
llama.cpp built with custom flags for vulkan support and no avx instructions (our cpus on the builds are very old celerons). we tried countless build attempts and this is the best we could do:

CXXFLAGS="-march=core2 -mtune=generic" cmake .. \
  -DLLAMA_BUILD_SERVER=ON \
  -DGGML_VULKAN=ON \
  -DGGML_NATIVE=OFF \
  -DGGML_AVX=OFF   -DGGML_AVX2=OFF \
  -DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF \
  -DGGML_FMA=OFF   -DGGML_F16C=OFF \
  -DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF \
  -DGGML_SSE42=ON  \

Per-rig multi-GPU scaling

Each rig runs 6 GPUs and can split small models across multiple kubernetes containers with each GPU's VRAM shared (could only minimally do 1 GPU per container - couldn't split a GPU's VRAM to 2 containers)
Used --ngl 999, --sm none for 6 containers for 6 gpus
for bigger contexts we could extend the small model's limits and use more than 1 GPU's VRAM
for bigger models (Qwen3-30B_Q8_0) we used --ngl 999, --sm layer and build a recent llama.cpp implementation for reasoning management where you could turn off thinking mode with --reasoning-budget 0

Load balancing setup

Built a fastapi load-balancer backend that assigns each user to an available kubernetes pod
Redis tracks current pod load and handle session stickiness
The load-balancer also does prompt cache retention and restoration. biggest challenge here was how to make the llama.cpp servers accept the old prompt caches that weren't 100% in the processed eval format and would get dropped and reinterpreted from the beginning. we found that using --cache-reuse 32 would allow for a margin of error big enough for all the conversation caches to be evaluated instantly
Models respond via streaming SSE, OpenAI-compatible format

what didn’t work

ROCm HIP \ pytorc \ tensorflow inference

ROCm technically works and tools like rocminfo and rocm-smi work but couldn't get a working llama.cpp HIP build
there’s no functional PyTorch backend for Polaris-class gfx803 cards so pytorch didn't work
couldn't get TensorFlow to work with llama.cpp

we’re also putting part of our cluster through some live testing. If you want to throw some prompts at it, you can hit it here:

https://www.masterchaincorp.com

It’s running Qwen-30B and the frontend is just a basic llama.cpp server webui. nothing fancy so feel free to poke around and help test the setup. feedback welcome!

[-]

az226@reddit

How much power? Where are you hosting them?

[-]