AMD Mi50

Posted by aspirio@reddit | LocalLLaMA | View on Reddit | 10 comments

Hey all,

This question may have popped hundreds of times in the last months or even years, but as AI evolves really fast and everything surrounding it too, I'd like to have an up to date vision on something.

Is it still worth buying a MI50 today to run a local LLM ? I've read that Rocm support is long gone, that Vulkan is not that efficient, I am fairly new in the LOCAL LLM game, so no judgement please)). That some community patches allow the usage of Rocm 7.x.x but that running Qwen 3.5 with ollama.cpp crashes, and so on.

I don't need to run a big model, but I'd like to use the money in a good way, forget about the crazy 1000 dollars the GC setup, I can only afford hundreds of dollars and even there, I'd be cautious to what I buy.

I was initially going to buy a P40, as it seems like it should be enough for what I am about to do, but on the other side, I see the MI50 which has 3x the bandwidth of the P40, 8 more GB VRAM and for less than twice the price of the p40....

Any suggestions ?

[-]

dionysio211@reddit

I tend to agree with most people here that the Mi50 can be a pain in the ass. I have spent countless hours approaching how to maximize the output and running into constant struggles with vLLM. However, it can be great, depending on what you plan to do. For those fretting about vLLM, I have good news. Someone has taken up the mantle of continuing support for gfx906 (Mi50s) and updated versions of vLLM:

https://github.com/ai-infos/vllm-gfx906-mobydick

I am currently running Qwen 3.5 - 27B with TP=4 at \~50 tps and 1,800 tps prefill. I have not tried Gemma but another user is posting benchmarks for it.

Someone has also written a custom flash attention library for gfx900 (which also works on gfx906) that looks very promising:

https://www.reddit.com/r/LocalLLaMA/comments/1s614i8/built_a_simple_pytorch_flashattention_alternative/

Here are some breadcrumbs that I have learned from these efforts which other tinkerers may look into for optimization paths. It is not true that you must use Opus to implement these. Even Qwen 3.5 27B was able to stumble across the same ideas. It is, however, helpful to use something like Opus to create a detailed plan:

16GB Mi50s > 32GB Mi50s all else being equal - The reason for this is that they do not have matrix cores so they rely on dp4a for a similar acceleration. It does not, however, overcome that gap so it must be approached by increasing raw compute. 8 x 16GB Mi50s provides close to double the prefill of 4 x 32GB Mi50s in an adequate setup. 32GB Mi50s are modified from 16GB Mi50s so they have the same compute.
64 Wavefront is not optimized in Llama.cpp - If you get a competent model to mess around in llama.cpp and dig into this, you will find that you can double the prompt processing speed. I want to approach it again and do a PR to address it but I have mostly been messing around with vLLM/SGLang lately.
DP4A is also not optimized - I know next to nothing about this but if you feed an agent the gfx906 documentation, it can eek out a lot of efficiency that is left on the table by exploring the dp4a related functions.

We are a hair away from being able to run models that can rewrite most of these libraries, ad hoc, to bridge this gap. I recently ran through 1.5 billion tokens with Qwen 3.5 27B to adapt Mini-SGLang for Qwen 3.5. I ended up trying to do it with Opus 4.6 with several million tokens and never got it to work. However, running something stronger would probably work if you have enough tokens.

ttkciar@reddit

I'm pretty happy with my MI50 with llama.cpp/Vulkan.

Vulkan has for the most part caught up with ROCm, though that seems to depend on the model, and you will find prompt preprocessing to be slow. I don't care much about prompt preprocessing, though, because most of my inference tasks have relatively short prompts and by far most of the time is spent on token generation.

A 32GB MI50 will fit (with constrained context) lovely models like Gemma-4-31B-it, Skyfall-31B-v4.2, and Qwen3.5-27B, quantized to Q4_K_M, and with full (quantized) context Mistral 3 Small (24B) Q4_K_M derivatives.

Raredisarray@reddit

How is gemma 4 on vulkan? whats your tokens/s?

droptableadventures@reddit

At $125 a piece, they were a crazy good bargain that justified putting up with all the drawbacks on price alone.

At today's prices, probably not worth it.

WhatererBlah555@reddit

Given today's price of the MI50, the fact that is no longer supported in rocm and that using it for other things than llama.cpp with vulkan - like docling - will be a royal PITA, I would recommend getting something else with 32GB. If I were to buy a GPU today I think I would choose an AMD AI PRO R9700 - although I didn't look deeply in the performances.

Pixer---@reddit

I bought 4 MI50 6 month ago. Well there is no real vllm support, and llamacpp is not the best for multi gpu setups. The only 2 usecases I would say are the 200$ 16gb version for smaller models is fine. And buying like 16 of the 32gb version for a big model. Buy cuda cards. I would suggest the vram modified ones, like 3080 with 20gb vram or the 4080 with 32gb vram. These cards are way faster then MI50

crowtain@reddit

one more downside for the M50 is the lack of nvlink, there is the possibility but it's nearly impossible to find the cables.
Nvlink or infinity fabric will allow you with time to add more GPU, and not only increase the VRAM but increase the speed with tensor parrallelism.

I Have 2 MI50, but i'm fed up with the lack of support for the amazing Qwen 3.5, Qwen 27B dense in TP2 would have been amazing.
Like SSOMGDSJD said, better get a V100 and some sxm2 adapter, you'll be able to add more later. But you'll need custom cooling and such....

SSOMGDSJD@reddit

I considered the mi50 and ended up going with a v100 32gb, runs Gemma 4 31b and qwen 27b q4kms at like 25-30tok/s. Slow but usable. The sxm2 v100 32gb is like 500ish, arctic p8 max HVAC taped to the front, would recommend a PCIe riser cable to connect it bc the heatsink they come with is heavy lol.

You could use Claude code to write custom kernels for your mi50 and get better speed than Google will tell you, but it's going to be a lot of debugging (for Claude code)

Reusing architecture ideas from other gpus is tough bc the mi50 has a completely different set up, no matrix acceleration , 64 wavefront instead of 32 (I am far out of my depth talking about this, I had opus deep research it and the answers contained these terms).

If you want the GPU itself to be a project then sure go ham, look on Alibaba and you might get an mi50 32gb for around 400

AstraMythos@reddit

Hey, the Mi50's outdated support might lead to headaches with control and safety on local LLMs - it's not worth the hassle now. Focus on setups that let you keep a tight grip on runtime oversight to avoid surprises. If you're new, check for hardware with solid community backing first.

sgmv@reddit

What price are you getting them at ? $250 ? sure. But cheapest I found on ebay is $600 + potential taxes. Better get the $950 intel, it will be supported, even if it's not in great shape atm.
Or $700\~ 3090s, much better choice.

I have also looked at the v100 32GB, a bit more expensive, but same story, old card, inefficient, no more cuda support etc. dead end

Remember that it's a hobby, it only pays if you don't have advanced coding tasks to do, if you don't value your time much, if you will use them a lot, and if power is very cheap/free. Otherwise, you're better off with subscriptions. Just a reminder, I know this is LocalLLama