LLM hardware acceleration—on a Raspberry Pi (Top-end AMD GPU using a low cost Pi as it's base computer)
Posted by Colecoman1982@reddit | LocalLLaMA | View on Reddit | 25 comments
vk6_@reddit
This is certainly an interesting experiment, but when you look at it in terms of cost, efficiency, and performance, I don't see any situation where this has enough of an advantage to be practical.
In his accompanying blog post, Jeff Geerling cites a $383 USD cost for everything except the GPU. Meanwhile, there are x86 boards such as the ASRock N100M which contain the similarly low power N100 CPU, and in a standard MATX form factor. Since it's just a regular desktop PC, all the other components are cheaper and you don't need crazy M.2 to PCIE adapters or multiple power supplies. Overall, it costs about $260-300 for a similar (and less jank) N100 setup, excluding the GPU.
Regarding GPU performance, because the RPI is limited to AMD cards using Vulkan (not even ROCm), the inference speed will always be worse. On a similar x86 system, you can use CUDA with Nvidia cards which also has a better price/performance ratio. On my RTX 3060 12GB (a card you can buy for $230 used), I get 55 t/s on ollama with llama3.1:8b. The 6700xt that Jeff Geerling used, which is the same price, only gets 40 t/s. Also, because you have neither CUDA nor ROCm, you can't take advantage of faster libraries like vLLM.
In terms of idle power consumption, you are looking at 5w more or so for the Intel N100. Even in the worse case if you live somewhere like California with high electricity costs, that's an additional $13 per year at most. The extra hardware costs with the RPI doesn't pay for itself over time either.
And of course the user experience with setting up an RPI in this manner and dealing with all the driver issues and compatibility problems will be a major headache.
randomfoo2@reddit
It's a fun project and I hope he does whisper.cpp (and finds a Vulkan accelerated TTS next), but yeah, definitely impractical.
On eBay, I'm actually seeing 3060 12GBs being sold for as low as $100 (although Buy It Now pricing looks to be more in the $200 range), and honestly plugging it into any $20 junk business PC from the past decade would be fine and only be an additional 10W of idle power (+10W = 88 kWh/yr - at $0.30/kWh, about $25/yr in additional power) so you can go even cheaper, although I agree that those mini-ITX low power boards are pretty neat (Topton and Minisforum sell Ryzen 7840HS ones for \~$300 so you could actually put together some really powerful compact/power efficient systems) even if they'd never pay off from an efficiency perspective.
fallingdowndizzyvr@reddit
Completed prices? Where do you see that? Or are you confusing current bid with what it will actually sell for?
randomfoo2@reddit
Yes, click on "sold items" and scroll down. You can also go to usedrtx or aliexpress and see similarly priced ones. These are undoubtedly ex-mining cards, but at the end of the day, it probably doesn't matter all that much.
fallingdowndizzyvr@reddit
You don't need to scroll, just sort by lowest to highest price.
The vast majority of those listings for the 3060 under $100 are for "parts" or even "box only" 3060s. They don't work. Of the ones listed as working, many of those are from sellers with 0 sales and thus 0 feedback. That just screams scam. Of the couple of so legit looking listings, this one seems the most legit. Since he has feedback from the people that bought a 3060. The other legit might be seller doesn't have any seller feedback at all.
But even for this seller, the ~$100 or so price was a unicorn. Since at least one other 3060 he sold went for ~$170. That buyer got lucky. It's like winning the lottery. I wouldn't characterize it as a common occurrence.
There was someone who got a 3090 a few months ago for $300. He got lucky since no one else bid on it. I've been keeping my eye out for another $300 3090. No success so far.
Colecoman1982@reddit (OP)
Just as a follow-up to your point about the price ($385) for everything but the GPU, I think it should be possible to bring that down to around $285. First off, I've been able to find equivalent wattage and 80 Plus efficiency rated power supplies for $50 less than the one he lists there. Also, the hardware/cabling he's using to connect data and power to the GPU costs around $80 but he points out in one of his older posts (https://www.jeffgeerling.com/blog/2024/use-external-gpu-on-raspberry-pi-5-4k-gaming) that there is alternative hardware that costs ~$30 (https://pineboards.io/products/hat-upcity-lite-for-raspberry-pi-5).
Colecoman1982@reddit (OP)
All fair points. Though, at the very least this project could help to motivate AMD to take another look at supporting ROCm on ARM. Also, I believe he Jeff Geerling has, in the past, mentioned that his specific choice of Pi-to-PCIe adapters aren't necessarily the lowest cost options in the market so there could still be some room to lower the total system cost even further.
Herr_Drosselmeyer@reddit
Cool but I don't see a practical application.
Ok-Recognition-3177@reddit
Power efficient local voice assistant for home assistant, power efficiency will likely matter more to you in non us countries
Ok-Recognition-3177@reddit
I am infinitely delighted to see this
Downtown-Case-1755@reddit
What if AMD sold something like this? Like, the bottom barrel APU attached to a 32GB GPU in a NUC-like case, so it can't be "misused" for workstation work and sold cheap.
...That would be way too smart of them :(
randomfoo2@reddit
AMD will be selling Strix Halo soon so we'll see how much the demand actually is.
(let's be honest though, general demand is probably close to zero atm, and people in r/LocalLLaMA would still complain about the price no matter how low it is, since you can still get 2 x P40s for $500 if you're looking (or if you're more ambitious 2 x MI100 or 2 x 3090 for $1600)
Downtown-Case-1755@reddit
2x 24GB needs a whole system built around it though, while strix halo is a whole system and relatively low power.
...I think you are right though, AMD is gonna price it stupidly high, especially if there are no high-memory 8core/1CCD SKUs.
Colecoman1982@reddit (OP)
I still curious to see how the benchmarks compare to a full computer running the same LLMs on the same GPU. Clearly, the Raspberry Pi is enough to provide some good performance, but is it really fully equivalent to a regular PC? Also, I believe that the Pi has PCIE 4X. With that being the case, is it possible to connect more than one AMD GPU to a single Pi over 2x or 1x PCIE connections and push the performance even more?
Downtown-Case-1755@reddit
+1
I'd be interested to see different frameworks too.
There are ARM builds of pytorch, right? ROCm exllama might be fast, since CPU overhead is so important here.
Colecoman1982@reddit (OP)
Sadly, the reason he had to use Vulkan in the link I provided is that AMD has, so far, stated that they have no intention of supporting ROCm on ARM...
Downtown-Case-1755@reddit
oof
Thellton@reddit
You would already be running into communication issues at 4x unless you were doing 'tensor sequentialism' IE for half of the model (the first half) is on one GPU, whilst the other half of the model (the second half) is on the second GPU and only one GPU is active at any one time if there is only one prompt being worked on. The only way to mitigate that would be if the PCIe lanes were faster than current standards are capable of.
Of course, I'd still buy a hypothetical high end AMD GPU soldered to its low end computer with lots of VRAM and a very simple command line system to run various inference servers for various modalities.
Colecoman1982@reddit (OP)
It was my understanding that's how programs like Exllama presently work for multi-GPU systems where the model being used is larger than will fit in a single GPU's VRAM. Is that not the case?
Thellton@reddit
In the case of the Pi + AMD GPU, the GPU is carrying all of the load of inference whilst the Pi handles outputting the GPUs work to the user. If you add a second GPU to that mix, then yes it is as you understand. As I understand it, the Pi and most other SBCs have PCIe 3.0 or worse, so they'd be very slow passing data to each other when needed, as well as loading the model. Generally not a huge problem under the case described as the 'working memory' of what the next token is isn't huge. However, it would prevent fine-tuning completely.
Shrug and all that
ChickenAndRiceIsNice@reddit
I'm currently developing a Mini-ITX motherboard with a x16 PCIe, CM4, and M.2 slots. So this is really exciting!
TheDreamWoken@reddit
I am very regretful about my venture into the mini-ITX world, as it significantly increases the issues of heat and longevity. For instance, my PSU is already about to fail, which has never happened to me before. I know it's because of how crammed everything is.
TheDreamWoken@reddit
So this is a external gpu adapter plugged into a pi that has a fast enogh connector? COol.
wirthual@reddit
Would be cool to see what performance improvements llamafiles have in this setup.
https://github.com/Mozilla-Ocho/llamafile
Colecoman1982@reddit (OP)
TLDR: He, along with others, have finally managed to get current and previous generation AMD GPUs to connect to and run on a Raspberry Pi single board computer (~$80.00) and run LLMs using a hacked together ROCm. Apparently, because to much of what LLMs do is so heavily GPU/VRAM bottlenecked for inference, this still manages to produce high token rates even through the Pi itself has low ram and a slow processor.