Repurposing 800 x RX 580s for LLM inference - 4 months later - learnings
Posted by rasbid420@reddit | LocalLLaMA | View on Reddit | 77 comments
Back in March I asked this sub if RX 580s could be used for anything useful in the LLM space and asked for help on how to implemented inference:
https://www.reddit.com/r/LocalLLaMA/comments/1j1mpuf/repurposing_old_rx_580_gpus_need_advice/
Four months later, we've built a fully functioning inference cluster using around 800 RX 580s across 132 rigs. I want to come back and share what worked, what didn’t so that others can learn from our experience.
what worked
Vulkan with llama.cpp
- Vulkan backend worked on all RX 580s
- Required compiling Shaderc manually to get
glslc
- llama.cpp built with custom flags for vulkan support and no avx instructions (our cpus on the builds are very old celerons). we tried countless build attempts and this is the best we could do:
CXXFLAGS="-march=core2 -mtune=generic" cmake .. \
-DLLAMA_BUILD_SERVER=ON \
-DGGML_VULKAN=ON \
-DGGML_NATIVE=OFF \
-DGGML_AVX=OFF -DGGML_AVX2=OFF \
-DGGML_AVX512=OFF -DGGML_AVX_VNNI=OFF \
-DGGML_FMA=OFF -DGGML_F16C=OFF \
-DGGML_AMX_TILE=OFF -DGGML_AMX_INT8=OFF -DGGML_AMX_BF16=OFF \
-DGGML_SSE42=ON \
Per-rig multi-GPU scaling
- Each rig runs 6 GPUs and can split small models across multiple kubernetes containers with each GPU's VRAM shared (could only minimally do 1 GPU per container - couldn't split a GPU's VRAM to 2 containers)
- Used
--ngl 999
,--sm none
for 6 containers for 6 gpus - for bigger contexts we could extend the small model's limits and use more than 1 GPU's VRAM
- for bigger models (Qwen3-30B_Q8_0) we used
--ngl 999
,--sm layer
and build a recent llama.cpp implementation for reasoning management where you could turn off thinking mode with--reasoning-budget 0
Load balancing setup
- Built a fastapi load-balancer backend that assigns each user to an available kubernetes pod
- Redis tracks current pod load and handle session stickiness
- The load-balancer also does prompt cache retention and restoration. biggest challenge here was how to make the llama.cpp servers accept the old prompt caches that weren't 100% in the processed eval format and would get dropped and reinterpreted from the beginning. we found that using
--cache-reuse 32
would allow for a margin of error big enough for all the conversation caches to be evaluated instantly - Models respond via streaming SSE, OpenAI-compatible format
what didn’t work
ROCm HIP \ pytorc \ tensorflow inference
- ROCm technically works and tools like
rocminfo
androcm-smi
work but couldn't get a working llama.cpp HIP build - there’s no functional PyTorch backend for Polaris-class gfx803 cards so pytorch didn't work
- couldn't get TensorFlow to work with llama.cpp
we’re also putting part of our cluster through some live testing. If you want to throw some prompts at it, you can hit it here:
https://www.masterchaincorp.com
It’s running Qwen-30B and the frontend is just a basic llama.cpp server webui. nothing fancy so feel free to poke around and help test the setup. feedback welcome!
az226@reddit
How much power? Where are you hosting them?
rasbid420@reddit (OP)
a total of 132 kw / h capacity
we're hosting them in the united states!
az226@reddit
How did you find a spot with so much juice? What’s the rent like?
rasbid420@reddit (OP)
it's very hard to find such spots nowadays with the AI data center gold rush
we were coincidentally lucky to be already involved in the field with Ethereum mining and thus we already had the capacity ready for the switch!
rent has been decent and stable over the past 7 years, we we're very lucky to have great landlords!
az226@reddit
What’s the rent / electricity price if you don’t mind me asking?
iam_maxinne@reddit
Amazing result! I asked it to create a small single-page token visualizer, and it did amazing! Good to see this kind of results on local llm... Thanks for sharing!
juanlndd@reddit
Please shed some light if possible, could you test models with different weights, so we can have a benchmark? Whisper? Qwen 600m, qwen 1.5b, param 9b, gemma 4b with picture?
undisputedx@reddit
All 6 gpus connected with x1 risers?
rasbid420@reddit (OP)
hello, yes each individual GPU is connected with 1 riser
--dany--@reddit
Nice mining setup! With all servers up and running what’s your total power consumption?
rasbid420@reddit (OP)
so on full mining throttle we were pulling around 133 kw / h
on local inference with full throttle we would be doing around 50 kw / h (assuming nonstop inference which isn't likely)
for the 20 rigs which are open on the https://masterchaincorp.com endpoint momentarily the usage is sporadic around 10% use
--dany--@reddit
Thanks! hopefully you live far from equator. For 800 gpus that’s very minimal though. Good business model! 😎
undisputedx@reddit
Nice.
How is the utilization per card on full load?
Qwen-30B which quant are you using?
kevin_1994@reddit
You got any pics? ;)
rasbid420@reddit (OP)
rasbid420@reddit (OP)
fallingdowndizzyvr@reddit
First, bravo! o7.
ROCm works on the RX580. I posted a thread about it. I would post it here but this sub tends to hide posts with reddit links in them. But if you look at my submitted posts, you'll see it from about a year ago.
rasbid420@reddit (OP)
thanks alot! o/
we'll look into it and come back to you if we have any questions!
one the annoying things we encountered was satisfying all the other constraints of our setup (4gb ram, very old celeron cpu with no avx instructions set, no ssd / hdd but a mere 8gb usb stick for the operating system)
popegonzalo@reddit
So basically it is 132 independent machine with 48gb each, right? since it seems that the old architecture temporarily blocks you from accessing more GPUs at one time.
rasbid420@reddit (OP)
yep, that's right!
hopefully we will find some task fit enough for these older cards with a non-so-sophisticated inference model!
DeltaSqueezer@reddit
What a cool project. Can you share more on the setup e.g. the llama launch config/command, helm charts etc.?
Also, did you consider using llm-d the kubernetes native implementation of vLLM, I saw there's some interesting stuff being done including shared KV cache etc.
What's the idle power draw of a single 6 GPU pod? I'm envious of your 6c/kWh electricity. I'm paying 5x that. What country are the GPUs located in?
rasbid420@reddit (OP)
hello!
sure thing! here's a sample of individual kubernetes pod config specs with which llama-server is being run
- name: llama-server
image: docker.io/library/llama-server:v8
workingDir: /app/bin
command: ["./llama-server"]
args:
["-m","/models/Qwen_Qwen3-30B-A3B-Q8_0.gguf",
"--slots","--slot-save-path","/prompt_cache",
"--temp","0.7","--top-p","0.8","--top-k","20","--min-p","0",
"--no-mmap","--ctx-size","8192","-ngl","49","-sm","row",
"-b","1028","--reasoning-budget","0",
"--props","--metrics","--log-timestamps",
"--host","0.0.0.0","--port","1234","--cache-reuse","32",
"--jinja","--chat-template-file","/chat-template/template.jinja"]
the docker image isn't published but I can publish if it if you want and provide more information about volume mount paths / specific directories on which the image relies
with regards to the vvlm native implementation of kubernetes i must say that we haven't touched anything other than llama.cpp and so no, we haven't tried but that's on our list to try! we're achieving something similar to the common shared cache between pods by making a virtual mount point that is shared where all the pods can save and retrieve kv caches from so session stickiness isn't necessarily required between successive messages in the same conversation
rig power consumption measured @ plug is 150 w idle and 550 w / full load heavy prompt processing (explain quantum mechanics in 2000 words)
gildedseat@reddit
Great project. I'm curious how you feel your overall operating costs for power etc. compare to using more modern hardware. is this 'worth it' vs newer hardware?
rasbid420@reddit (OP)
i couldn't say because i haven't gotten my hands on some newer hardware to test it out.
however i can imagine pulling 200 tps for prompt eval must be amazing! i think that the greatest weakness with these old polarises is that if you have a big initial prompt it will take forever to receive the first token response
in terms of operating costs they're negligible at the moment since we pay very small electrical costs of 6 c / kwh and the electrical bill is nothing compared to the mining activity where it represented 75% of our operating costs
HollowInfinity@reddit
Huh that's interesting, I'm trying the '--reasoning-budget 0' param for the latest repo build of llama.cpp server and it doesn't seem to do anything for my local Qwen3-30B-A3B-Q8_0. I would love to force reasoning off in the server instead of session - do you have any tweaks you did to make this work?
DeltaSqueezer@reddit
Sure, if it isn't too much trouble, I'd be interested in seeing the Dockerfile to see how it was all put together.
Here's the link to the LMCache I mentioned:
https://github.com/LMCache/LMCache
rasbid420@reddit (OP)
thank you very much for the link; i must say it looks very promising indeed and absolutely have to look into vllm with this lmcache redis stuff!
here's the docker image
https://hub.docker.com/r/rasbid/llama-server
DeltaSqueezer@reddit
vLLM is essential for efficient batching/KV cache utilization across multiple streams. However, given that the RX 580 only has around 6 TFlops of compute, I'm not sure how much you can squeeze out of it/benefit from it.
polandtown@reddit
A dream weekend project! Congrats, and would love to hear a pt2 from all of the responses!
rasbid420@reddit (OP)
definitely coming back with updates following the great advice received from this wonderful community! Thank you!
polandtown@reddit
awesome - looking forward to it. this was such a cool post, tyvm for sharing
Pentium95@reddit
6.400 GB of VRAM? i bet you can run pretty large models! have you tested models like deepseek R1? how many tokens per second did you achieve?
rasbid420@reddit (OP)
hello! unfortunately we couldn't manage to pool the resources of multiple rigs to achieve a total available VRAM higher than 48GB (6x8GB)
the best we could do with the resources we have is qwen3-30b_Q8_0 which is around 32GB and leave some extra space for conversation context
Remote_Cap_@reddit
Why not connect the nodes with llama.cpp rpc?
rasbid420@reddit (OP)
does llama.cpp support inter-node RPC for multi-node model parallelism or distributed inference? i thought that each instance runs independently and is cannot share model weights, KV cache over RPC!
farkinga@reddit
The key is to specify
GGML_RPC=ON
when building llama.cpp so thatrpc-server
will be compiled.Then launch the server on each node:
Finally, orchestrate the nodes with
llama-server
Seems like this could work!
rasbid420@reddit (OP)
wow this is great stuff 100% trying this and getting back to you with the results! maybe we could in fact run a bigger model after all!
segmond@reddit
it's going to be super slow, speaking from experience and I used faster GPUs. I mean it's better to have the ability to run bigger models even if it's slow than not at all. I will happily run AGI at 1tk/sec than none if it was a thing, so have fun with the experiments.
rasbid420@reddit (OP)
sure thing!
but wouldn't you rather have a janitor sweep a floor instead of a PHD professor?
couldn't there be some easy tasks that should be delegated to inferior models?
CheatCodesOfLife@reddit
Let us know how it goes. For me, adding a single RPC ended up slowing down deepseek-r1 for me.
5x3090 + CPU with -ot gets me 190t/s prompt processing + 16t/s generation
5x3090 + 2x3090 over RPC @2.5gbit LAN caps prompt processing to about 90t/s and textgen 11t/s
vllm doesn't have this problem though (running deepseek2.5 since I can't fit R1 at 4bit) so I suspect there are optimizations to be made to llama.cpp's rpc server
rasbid420@reddit (OP)
that's an amazing setup you have right there!
very impressive stuff, do you use it for personal or commercial application?
we're going to test the RPC distributed inference next week and come back with updates!
farkinga@reddit
I'm excited to hear the results!
It so happens I was researching distributed llama.cpp earlier this week. I had trouble finding documentation for it because I didn't know the "right" method for distributed llama.cpp computation. The challenge is: llama.cpp supports SO MANY methods for distributed computation; which one is best for a networked cluster of GPU nodes?
Anyway, to save you the trouble, I think the RPC method is likely to give the best results.
Very cool project, by the way.
Remote_Cap_@reddit
Search it up brother. It does.. OG G added it over a year ago.
TheTerrasque@reddit
last i checked it only worked with a CLI program, and wasn't supported by llama-server. Have this changed?
rasbid420@reddit (OP)
thanks alot! I'll look into it 100%, i've been solely focused on solutions / suggestions provided by reddit and haven't looked too much into llama.cpp although I should
Marksta@reddit
You can handle this with the RPC client but you'll need to handle a port per GPU. It shouldn't be too bad if you go in numerical format order on some range and auto run it on boot or something. But also check out the project GPUStack. It'll give you an interface and auto finding the llama.cpp RPC clients logic for free. You'll just need to build or download a llama-box Vulkan binary and put it in the install folder yourself, out of the box it isn't configured to setup Vulkan yet but it does work with adding a binary.
rasbid420@reddit (OP)
Thank you! will give it a try!
Pentium95@reddit
absolutely reasonable, dumb assumption here. Linking rigs together would require a datacenter grade bridge that would mean.. build everything from scratch with absolutely not worthy cost.
So, you have hundreds of 48GB VRAM rigs, which, for the cost, Is kinda impressive. how about the Speed? have you tried larger MoE models, like the new https://huggingface.co/mradermacher/ICONN-e1-i1-GGUF you should be able to run a very High quant, like.. Q2_K_S, but.. i really wonder how many t/s you might get
rasbid420@reddit (OP)
it's not a dumb assumption!
that's exactly what I wanted to achieve in the first place but i was humbled by the hardware limitations of bridging as well!
yes, these rigs are very low cost, i'd estimate $400 each per 48gb of VRAM and low energy costs @ 6c / kwh. maybe we could find some use for them who knows
i haven't tried that specific model but for qwen3-30B_Q8 we're getting around 15-20 tps for eval and 13-17 tps for inference; what's interesting here is the high variation between rigs (some with inferior hardware pull 20 while others pull 15)
Pentium95@reddit
13-20 tps? not bad! i thought PCIe and the slower VRAM would bottleneck It even more.
MoE models are pretty Amazing for this hardware
fun fact: The model i linked turned out, a few minutes ago, to be a bit not exacly as "build from scratch" as the owner said, a few quants have been made private
Remote_Cap_@reddit
Why not connect them up with llama.cpp RPC?
https://github.com/ggml-org/llama.cpp/discussions/14266
Django_McFly@reddit
I'm always like, "why are they trolling" then I realize the poster is from a period = comma country. 6,400 GB not 6.4 GB.
BITE_AU_CHOCOLAT@reddit
The power consumption must be insane. Are you sure to have checked if just going for recent cards like the 5090 wouldn't have been more efficient in the long run?
rasbid420@reddit (OP)
Hello,
yes the argument to go for newer cards is of course very strong to be made from certain points of view:
more efficient power consumption
bigger VRAMs
higher memory bandwidth
we didn't have the opportunity to select the cards at the beginning of this project
we were left off with a sea of used up rx 580s from our old Ethereum mine and we wanted to put them to use
the biggest advantage that these cards offer is cost, comparing at a glance:
5090 32gb = 3000 $
6 x 5808gb = 400 $
very cheap VRAM
DepthHour1669@reddit
2 x AMD MI50 32gb = 64gb $240
https://www.alibaba.com/trade/search?SearchText=mi50%2032gb&from=header&
rasbid420@reddit (OP)
oh that's very cool! i wasn't aware of the AMD MI50
so much cheap VRAM!
Ok-Internal9317@reddit
potentially u might want to have a look at the tesla m40, it has 12 gib of vram and is only 20% the price of the mi50, around 1.4x faster than rx580 and have native cuda support
rasbid420@reddit (OP)
those are all great competitive alternatives but I'm afraid that further optimizing hardware choice here isn't necessarily the main problem
rather what sort of use case could there be for older, cheaper, inefficient, not-so-sophisticated hardware?
kironlau@reddit
from the photo, the rigs should be mining rig in past, then changed to llm rig. (no one will buy so many 580, esp for hosting LLM)
rasbid420@reddit (OP)
that's true!
DepthHour1669@reddit
So… what was the point of this? Is this being used commercially? 800 gpus as a hobby project seems insane.
rasbid420@reddit (OP)
the point to all of this hasn't yet been found yet and this is not being used commercially
we really wanted to give a second breath of life into old polaris cards because there are so many of them out there in the secondary market and they're very cheap cards for the amount of VRAM they offer (50$-70$ each / 8GB)
PutMyDickOnYourHead@reddit
But they're power consumption-to-performance ratio is terrible. 185W for 8 GB VRAM on slow cores and low bandwidth. You'd be better off putting your money into one H100.
rasbid420@reddit (OP)
1 H100 may be more efficient in terms of power consumption but it's not as efficient when it comes to $ / VRAM spent
I think the h100 had 80 GB of VRAM and cost $25,000
while 10 x rx 580 of 80GB of VRAM would cost $700
the cost savings are of orders of magnitude!
there has to be a use for this old equipment that doesn't necessarily has to be the most advanced at reasoning / achieving very difficult tasks
DepthHour1669@reddit
At that price range you’re a lot better off with a 16gb V340 tbh
https://ebay.us/m/Y0sQTW
rasbid420@reddit (OP)
that's correct!
there are so many exciting possibilities for the future of Local LLM!
kadir_nar@reddit
Thank you for sharing.
rasbid420@reddit (OP)
thank you but all merit goes to the localllm community!
a_beautiful_rhind@reddit
Heh.. you need old kernel and rocm for it to work: https://github.com/woodrex83/ROCm-For-RX580 https://github.com/robertrosenbusch/gfx803_rocm
There used to be another repo with patches. Pytorch likely needs downgrade too.
I ran A1111 on the one I had so it definitely was functional at one point.
rasbid420@reddit (OP)
woodrex83/ROCm-For-RX580
: patched ROCm 5.4.2 installers specifically tailored for Polaris GPUs. The obstacles we encountered here:rocminfo
anddmesg
showed that only one GPU was being added to the KFD topology, others were skipped due to lack of PCIe atomic support (PCI rejects atomics 730<0
)robertrosenbusch/gfx803_rocm
: documented how to patch an older ROCm release (5.0–5.2) for gfx803 compatibilitywe will definitely give it a couple more tries because i'm really interested in the speed comparison!
thank you for your recommendation
a_beautiful_rhind@reddit
Tried to use mine on PCIE 2.0 and no dice because of atomics support. I never tried multiple cards on my PCIE 4 system since I just have the one.
There was some chinese repo too but it was hard to find and I don't have the bookmark. It was full of patched binaries. I found it through issues on other repos. Look there because they used to sell 16gb versions of this card with soldered ram and I can't imagine they never had it working with at least last year's versions.
Old card is old.
rasbid420@reddit (OP)
https://github.com/xuhuisheng/rocm-gfx803
i think this is the repo you're referring to!
a_beautiful_rhind@reddit
Wow, time flies. Good thing people put more recent stuff in the issues.
rasbid420@reddit (OP)
woodrex83/ROCm-For-RX580
: patched ROCm 5.4.2 installers specifically tailored for Polaris GPUs. The obstacles we encountered here:rocminfo
anddmesg
showed that only one GPU was being added to the KFD topology, others were skipped due to lack of PCIe atomic support (PCI rejects atomics 730<0
) and we needed multiple GPU support...robertrosenbusch/gfx803_rocm
: documented how to patch an older ROCm release (5.0–5.2) for gfx803 compatibility.we will definitely give it a couple more tries because i'm really interested in the speed comparison
cantgetthistowork@reddit
All this work and you could have just used gpustack/gpustack
Mr_Moonsilver@reddit
That's an impressive setup. Thank you for sharing this here!
rasbid420@reddit (OP)
Thank you! The LocalLLama community helped us very much with the pointers in the right direction 4 months ago and it saved us alot of time!