Curious how AMD (Radeon) GPUs can handle LLMs

Posted by siegevjorn@reddit | LocalLLaMA | View on Reddit | 34 comments

Hey folks, Since the GPU craze, I'd been eyeing on what's available right now atm: RX 6800 and 7600xt. Both have decent price/VRAM with 16gb. But my concern is whether the VRAM in AMD tranlates well to that of Nvidia. For instance, will 16gb of RX 6800 will load same model size as 16gb of Nvidia GPU? For those of you who have both AMD/Nvidia gpus (3090 and 7900xtx), what was your experience, where you able to load same model size on 7900xtx that you used to load onto 3090? If AMD VRAMs are inefficient, how much? Is is 20% inefficient or 30%? Another question is with RoCm support. I see from llama.cpp that any GPU with HIP support will be able to offload layers. https://github.com/ggerganov/llama.cpp/blob/master/docs/build.md#hip According to AMD site, that includes RX 6800: https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html So I can safely assume that anything that runs llama.cpp on the backend will run LLM out of the box with RDNA2 (RX 6800), right? Does it apply the same to vLLM? vLLM specifies only 7900 support: https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html But does it support other 7000 series GPUs(RDNA3)? I mean it seems like AMD has expanded their suppprt for ML for all RDNA3 GPUs: https://rocm.docs.amd.com/projects/radeon/en/latest/ If running vLLM in tensor parallel possible, $300 price of 7600xt sounds quite attractive.