Curious how AMD (Radeon) GPUs can handle LLMs

Posted by siegevjorn@reddit | LocalLLaMA | View on Reddit | 34 comments

Hey folks, Since the GPU craze, I'd been eyeing on what's available right now atm: RX 6800 and 7600xt. Both have decent price/VRAM with 16gb. But my concern is whether the VRAM in AMD tranlates well to that of Nvidia. For instance, will 16gb of RX 6800 will load same model size as 16gb of Nvidia GPU? For those of you who have both AMD/Nvidia gpus (3090 and 7900xtx), what was your experience, where you able to load same model size on 7900xtx that you used to load onto 3090? If AMD VRAMs are inefficient, how much? Is is 20% inefficient or 30%? Another question is with RoCm support. I see from llama.cpp that any GPU with HIP support will be able to offload layers. https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#hip According to AMD site, that includes RX 6800: https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html So I can safely assume that anything that runs llama.cpp on the backend will run LLM out of the box with RDNA2 (RX 6800), right? Does it apply the same to vLLM? vLLM specifies only 7900 support: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html But does it support other 7000 series GPUs(RDNA3)? I mean it seems like AMD has expanded their suppprt for ML for all RDNA3 GPUs: https://rocm.docs.amd.com/projects/radeon/en/latest/ If running vLLM in tensor parallel possible, $300 price of 7600xt sounds quite attractive.

Reply to Post

34 Comments

[-]

Excellent_Bar_2638@reddit

Hello, I'm a little late here but i would like to add that ChatGPT is very good at helping you set up an AI locally, you can give it your hardware specs and the instructions that you intend to follow and it can help you with any problems. In the past week I have used ChatGPT to guide me through the ROCM installation process and setting up TinyLlama as a test. This was on a sapphire rx 6800 so I can confirm that it can run 1B models but I'm still trying to set up Mistral at 7B but it is theoretically possible.

[-]

BlueSwordM@reddit

On Linux CachyOS, it was very easy to get llama.cpp working with my 5700XT, Mi50 (Radeon VII equivalent) and even my RX 580. You could also install `ollama-rocm` that way if you wanted to.

[-]

legit_split_@reddit

How did you do it exactly?

[-]

Better_Athlete_JJ@reddit

Here's a docker image for AMD GPUs inference, we tried a while back with small models and it did well. Should help you quickly answer the performance question you have [https://github.com/slashml/amd\_inference](https://github.com/slashml/amd_inference)

[-]

05032-MendicantBias@reddit

I had success using LM Studio with llama.cpp and Vulkan acceleration. It loads up the card but I'm not sure I'm getting optimal performance. ROCm acceleration has been a bloodbath for me. For diffusion I used StabilityMatrix+ComfyUI+Zulda and that got good acceleration. I haven't yet succeeded in getting llama.cpp with ROCm acceleration working.

[-]

Nindaleth@reddit

>I haven't yet succeeded in getting llama.cpp with ROCm acceleration working. This is how I set my env up and how I compile llama.cpp on Fedora for my RX 6700 XT: sudo usermod -a -G render,video $USER # add amd ROCm repo using amdgpu-install, I've used the following on my Fedora # sudo dnf install https://repo.radeon.com/amdgpu-install/6.3.1/rhel/9.6/amdgpu-install-6.3.60301-1.el9.noarch.rpm sudo dnf install rocm-hip-libraries rocm-hip-sdk reboot # add this into your .bashrc export PATH="/opt/rocm/bin:$PATH" export ROCM_PATH=/opt/rocm export LD_LIBRARY_PATH="/opt/rocm/lib:$LD_LIBRARY_PATH" export HSA_OVERRIDE_GFX_VERSION="10.3.0" # build in llama.cpp directory like this HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \ cmake -S . -B build -DGGML_HIP=ON -DGGML_VULKAN=OFF -DAMDGPU_TARGETS=gfx1030 -DCMAKE_BUILD_TYPE=Release && \ cmake --build build --config Release -- -j 16 # binaries are now in llama.cpp/build/bin/ The "10.3.0" and "gfx1030" values are 6700 XT specific, but the rest of it should be applicable to any AMD GPU I think.

[-]

siegevjorn@reddit (OP)

Glad you made Zluda work! I guess that means RoCm doesn't work for FLUX or SD, which is kinda frustrating. Does getting zluda work involve significant headaches, or was it relatively straightforword in terms of docs? For llama.cpp with RoCm, I think ollama natively would handle it for you: https://community.amd.com/t5/ai/running-llms-locally-on-amd-gpus-with-ollama/ba-p/713266

[-]

San4itos@reddit

Ollama, Flux, SD, and TTS, all work with ROCm. The only thing that doesn't work on my 7800xt is the video generation. As for the speed, I can't compare it to NVidia but it's decent. And it's higher than with zluda.

[-]

05032-MendicantBias@reddit

I tried three days and couldn't do it. Tried the various config of driver/HIP on the matrix, all for naught. The closest I got was to use ubuntu 22 on WSL2 with driver and HIP and that got a chunk of the acceleration going, but it accelerated just the diffusion and not the VAE for some reason. I was about to try ubuntu 24 on WSL2 but someone suggested Stability Matrix and that worked out. I'd like to get ROCm more stable to try more things like image to 3D

[-]

Zenobody@reddit

Dual-booting Linux might be easier.

[-]

05032-MendicantBias@reddit

I'm doing everything in my power to avoid dual boot. I'd rather build a dedicated AI server than dual boot my desktop. Also with Nvidia I didn't have to dual boot.

[-]

San4itos@reddit

I don't use WSL2. I dual boot.

[-]

celsowm@reddit

How about vllm?

[-]

U_A_beringianus@reddit

Works fine on Linux if you build llama.cpp with rocm. Vulkan is also an option.

[-]

siegevjorn@reddit (OP)

Which GPU are you using?

[-]

ttkciar@reddit

Using AMD MI60 with Vulkan on Slackware Linux, here, since ROCm no longer supports MI60. It works. Looking forward to more Vulkan optimizations.

[-]

U_A_beringianus@reddit

6950XT

[-]

siegevjorn@reddit (OP)

Thanks. Good to know rocm works well with RDNA2!

[-]

thetaFAANG@reddit

AMD is so far behind that a buyout wouldnt even be seen as reducing competition at this point

[-]

rdm13@reddit

7900xt, inference on koboldccp rocm works great, can handle up to 24B @4km with 16+ context. Could go a bit higher but I prefer the speed and context.

[-]

siegevjorn@reddit (OP)

Sounds great. Is your go-to model mistral-small 24b? It's awesome they made it apache 2.0

[-]

rdm13@reddit

Mistral 24B and 22B are great.

[-]

ForsookComparison@reddit

Llama CPP. Having a great time.

[-]

siegevjorn@reddit (OP)

In the U.S., bestbuy still sell new 6800s for $420. So it may be worth looking into. I think if all RDNA3 gets support for vLLM though, mutiple 7600s could be much faster with tensor parallelism. Not sure whether all RDNA3 are suppported as vLLM doesn't mention it in their docs.

[-]

Zenobody@reddit

How much is the 7800XT?

[-]

siegevjorn@reddit (OP)

7800xt is sold out rn from major stores.

[-]

frivolousfidget@reddit

I can confirm that. I am an idiot and got it running.

[-]

stjepano85@reddit

>But my concern is whether the VRAM in AMD tranlates well to that of Nvidia. For instance, will 16gb of RX 6800 will load same model size as 16gb of Nvidia GPU? RAM is RAM so you should be able to load same models of same size in both. You may want to look if AMD supports same quantized models as Nvidia. For example my GPU supports Int4 quantization (Radeon7800XT) where each model weight can be scaled down to 4bits so I can fit 4x larger models into my GPU. My radeon supports fp16, bf16, int8 and int4. I've seen that people use weights even smaller than 4 bits, most notably 1.58 bit quantization for Deepseek R1 but I am not sure how that works, if it is supported by GPU hardware or is it purely software. Even at that quantization Deepseek can not run on my computer as I have total 80GB RAM.

[-]

Rich_Repeat_22@reddit

Get second 6800 not 7600XT. Alternative you should have gone 15 days ago, sell 6800 and buy a used 7900XTX which were going around $700. Now their prices skyrocketed by almost +50% mainly due to the RTX50 failure and because Deepseek R1 (and distils) runs impressively well on the 7900XTX. I know people who bought 6 just last week and set up a 120GB server, by selling 2 RTX4090s and having change at the end.

[-]

siegevjorn@reddit (OP)

7900 xtx for $700 sounds like a great price/VRAM ratio. Unfortunately missed the golden time. Hopefully GPU prices get better in the long run.

[-]

Zenobody@reddit

Don't get the 7600, it only has 288 GB/s of bandwidth. The 6800 has 512 GB/s. For inferencing I recommend using [koboldcpp-rocm](https://github.com/YellowRoseCx/koboldcpp-rocm). Just install it inside a docker container with [pre-configured ROCm](https://hub.docker.com/r/rocm/dev-ubuntu-24.04/tags). It should be easier to setup than CUDA if you use Linux (I don't know about Windows).

[-]

SporksInjected@reddit

For LLM, I’ve found rocm to be noticeably faster than Vulkan. They’re both really usable though on Linux. Like you said, setup can be really easy in docker and I haven’t really had problems otherwise with my 7900xtx or 6700xt via rocm.

[-]

darth_chewbacca@reddit

On Linux, getting Ollama setup to run LLMs is pretty easy. Arch Linux is the most simple, as you just `sudo pacman -S ollama-rocm` On other distros it's also fairly easy, as the worst you'll have to do is use distrobox to setup an arch-linux "container" and do what I just explained above. This can be daunting to new linux users, but it's not so bad if you've used Linux for a while. Image Gen with SD/Flux/etc is pretty easy too. Just follow the instructions on the comfyui github repo. There are problems with certain nodes, and for some weird reason my 7900xtx takes an extremely long time to render an image (like 1 minute vs 7 seconds) if the size of the image changes from the previous one. But if you create an image of the same size, it's speedy. Video generation is pretty poor though. My 7900xtx can do it (ltx/hunyan) but it's REALLY REALLY slow, like 4 times slower than my tests on a 3090 using runpod. Note: When I ran those tests, it was with Rocm6.2, I have no tried with Rocm6.3... 6.3 has improved image gen a significant amount. I have not heard good things about doing AI with AMD on Windows however. I have no experience with this, as I am a 100% linux "shop"

[-]

LagOps91@reddit

I am running local AI using KoboldCPP on my 7950xtx. You can load a model as long as it fits into vram, pretty much exactly the same as with NVIDIA cards. The difference comes from performance - NVIDIA has CUDA, which is better supported than AMD. A comparable NVIDIA card would simply process prompts and generate outputs faster. Personally I am very happy with what I have and I consider the amount of VRAM the most important since that ultimately determines how large of a model you can run. In terms of performance, I'm seeing no issue - i'm gettin about 12 tokens per second output on 32k context, which is signifficantly above reading speed.