With ROCm support on the RX9060xt 16gb do we have a cheap alternative to 64gb of Vram?

Posted by Loskas2025@reddit | LocalLLaMA | View on Reddit | 24 comments

[from https:\/\/videocardz.com\/newz\/amd-releases-rocm-7-0-2-with-radeon-rx-9060-support](

Reading the news and considering that a card costs €300 + VAT, with €1200 + VAT you can get 4 cards for a total of 64GB of VRAM. I don't know the performance of the new drivers and I hope someone here tests them soon, but it seems like good news. Opinions? Also 160W x 4 = 640W. Cheap.

[-]

see_spot_ruminate@reddit

I think an interesting argument could be made for it, but I do think that this raises a different, albeit already explored, topic.

What is wrong with amd that they do not offer support day one for ROCm? At this point I believe that they have a curse upon them or are part of a large conspiracy. How can they do so well in the cpu market and flounder this badly with mistakes like this? Rant complete.

That said, like others have pointed out you would need a specific motherboard that had 4 full size pcie slots perfectly spaced to accommodate the cards would be needed. Most of the consumer boards out there seem to throw slots on the boards for triple width cards or haphazardly.

[-]

mustafar0111@reddit

Its just AMD's release process. I think they are just resource constrained and prioritizing enterprise.

The release process right now is everything CDNA gets day one support for the latest version of ROCm. Support for the latest version of ROCm then trickles down to the RDNA devices as they are able to roll it out.

[-]

see_spot_ruminate@reddit

I get it. I just… wish there was more competition.

Hopefully it will get better with udna

[-]

Loskas2025@reddit (OP)

Ryzen AI Memory AccessShared System Memory (LPDDR5X-8000). Max ≈256 GB/s bandwidth for the whole chip (CPU/iGPU/NPU).

RX 9060XT (16 GB GDDR6) with a very fast, dedicated bus. ≈320 GB/s dedicated bandwidth.

[-]

Ok_Top9254@reddit

2x Mi50 32GB = 64GB for 350 bucks and bandwidth 1024GB/s... just wish Amd supported them longer.

[-]

Sufficient_Prune3897@reddit

Bandwidth is not the problem with MI50s. They are just slow and old. Generation speed at the start with my 4x MI50s and GLM Air is 20 t/s, but just 15k token in its already at 3 t/s.

[-]

JaredsBored@reddit

Are you on the latest llama.cpp and ROCm 6.4? With Q5_K_XL glm air, a single Mi50, and an Epyc 7532 system I'm getting 15tps with short prompts and little context. I'd expect more than 20tps with all layers going on GPU, and I've noticed less performance degradation with ROCm 6.4 and the recent llama.cpp enhancements.

[-]

Sufficient_Prune3897@reddit

Yes to both

[-]

JaredsBored@reddit

With my Single Mi50 225w and Epyc 7532 system I'm seeing 15-16tps at 0 context with unsloth's Q5_K_XL. At 13.6k context I'm still seeing 11.9tps (just rebuilt llama.cpp and ran this with zero prompt caching). That's with 34 layers on the CPU, 32k context at full F16, and flash attention enabled.

If you're only seeing 3tps at 15k context, something's gotta be wrong with your setup. 20tps at 0 context also really seems low

[-]

Ok_Top9254@reddit

They have about 1/4 the fp16 compared to dense matrix ops on 60XT yes, but I still think think software is holding it back significantly. For one, I don't think we'll evaluate so much context at a time, humans don't work like that either. Secondly, there are tons of smart caching implementations already. I'm sure some maniacs with tons of free time are working on improving the usability on these cards considering the crazy price per GB and GB/s of the vram.

[-]

tomz17@reddit

Nah, it's like the old nvidia cards. The primary thing holding them back is the lack of proper tensor units (which is why the prefill is so low) and the lack of primitives matrix ops for quantized matrices (e.g. you have to constantly upcast/downcast).

Nothing will ever fix that.

[-]

Sufficient_Prune3897@reddit

At that point you might as well pay the 120$ more for 5060 Ti's and get nearly double the performance and CUDA support. Also you will need a motherboard with enough pcie slots

[-]

mustafar0111@reddit

These have been benchmarked on Youtube head to head and the 5060 Ti was not double the performance. It was maybe a 10-15% uplift for inference on text generation.

[-]

Sufficient_Prune3897@reddit

I wish that dude would have done llama cpp benchmarks with models that just fit. I think LM Studio uses Vulcan as default on AMD, right? Vulcan is often faster than (especially on windows) ROCm on single GPUs, but behaves weirdly on multi GPUs. At least that's what I heard.

The CUDA advantage would be the ability to use better and faster backends like ikllama or exllama, not necessarily speed.

[-]

mustafar0111@reddit

He did a range to give an idea of what real world performance to expect. I've tested a couple of AMD cards as well. The performance gap has gotten very small.

The impressions some people give about AMD versus Nvidia performance for text inference are not remotely realistic on here these days. That said, as more people with AMD cards are trying text inference actual performance is becoming more common knowledge.

LM Studio has the option for either Vulcan or ROCm. Both are selectable from the drop down menu with a supported AMD GPU.

Exllamav2 works with ROCm under linux. People are probably better off sticking to llama cpp on Windows for now with AMD.

[-]

ParthProLegend@reddit

2 5060, I do know if Ti is a better value for AI. And 2 9060 XT looks to be the best.

You get excellent AI as well as CUDA as well as savings.

[-]

fallingdowndizzyvr@reddit

I think you would be better off getting a Max+ 395. Since that puts you in the striking range of a 64GB Max+ 395 and you get a pretty decent computer too.

[-]

lemon07r@reddit

Why wouldnt you just go dual 7900 xt or xtx at that point?

[-]

fallingdowndizzyvr@reddit

Because 48GB != 64GB.

[-]

sine120@reddit

I've worked through this thought experiment and practically it never really comes out in favor of the lower cost cards. Optimized lowest cost is 2x 5060 TI's 16GB running on consumer hardware. Both PCIe 5.0 x8 connections are saturated and they'll perform best per dollar. If you're willing to pay to upgrade your mobo/ architecture to something like threadripper for more PCIe lanes, you're paying a lot more for the supporting board and CPU. Your performance is going to be close to a Mac with comparable system RAM, but the Mac will run off far less power.

If your only goal is maximize VRAM per dollar it wins, but practically for running models 3090's are still the winner. 4x 24 - 96GB of VRAM also opens up a much better tier of models you can run for \~20-30% increase in system cost.

[-]

Rich_Repeat_22@reddit

2x R9700s which are around €1250 each. So for the price of a single 5090 can have 2 of these for 64GB VRAM, use Linux for the ECC VRAM support and vLLM.

[-]

Loskas2025@reddit (OP)

My post reads: lower cost for 64GB of video RAM. Obviously, by raising the price and power consumption, you get more. In fact, exactly double the GB/sec, processing power, and power consumption. So the growth is linear: 64GB with an RX 5600 costs half as much as 64GB with a 9700 for half the performance (and two more PCIe slots).

[-]

Rich_Repeat_22@reddit

Good luck getting ROCM running with RX 5600.

[-]

Woof9000@reddit

4 of them would need motherboard that comes with 4 full size pcie slots, or else you end up with some janky setup, which might defeat the purpose, since if one is going for janky build, then it might be much more cost effective to get some old teslas from nvidia instead.
I have PC with 2x 9060Xt's, relatively cheap and effective 32GB VRAM build, that fits neatly everything in a standard PC case, and that machine does everything I need, runs AI with models up to \~30B params, gaming on linux, and everything else I expect from a PC.
9060xt 16gb are very fine cards, as long as you know exactly what you want, and those cards happen to match your want and needs exactly.