AMD Radeon AI PRO R9700 GPU Offers 4x More TOPS & 2x More AI Performance Than Radeon PRO W7800
Posted by _SYSTEM_ADMIN_MOD_@reddit | LocalLLaMA | View on Reddit | 34 comments
Sea_Self_6571@reddit
Someone please correct me if I'm wrong - but local inference is, in the vast majority of cases, memory bandwidth bound. I'm looking at the card with the most mem bandwidth - the Radeon Pro VII, at 1024 GB/s. That's around the same as a 4090, and almost half of a 5090 (at \~1800 GB/s). However, the Radeon Pro VII only has 16 gb of RAM, and has been around for a few years now. So I'm guessing for local inference, these cards will probably be worse compared to existing consumer grade nvidia cards?
Double_Cause4609@reddit
There's a bit of nuance to this, but yes, inference starts memory bound.
There's a bit of a problem when you start looking at how people actually use these devices, though. For instance, most people run quantized. I don't know how LlamaCPP handles upcasts at each q-value, but certainly in vLLM it supports native FP8 operations, so if you compare a card that has native FP8 support versus only FP16 support (more typical of older cards), you actually lose on some of the memory bandwidth advantage due to the data type upcast operation. It's not a huge deal, but it's noticeable.
Your total FLOPs also determines the rate at which you lose speed as the context increases. For people who do 16k+ token context windows (which is not a small number of people), compute can start to matter a lot. It's hard to tell how much it matters, but you could imagine a rule like "I would trade 100GB/s of bandwidth for X number of FLOPs per doubling of the context window". I haven't really looked into it, but a relationship like that probably exists.
There's also concurrency. If you make two parallel requests to the same endpoint, both take about as long to calculate as a single request. This is because if memory bandwidth is the limiting factor, and you get that cost in bandwidth from loading the weights...Logically, that means that you may as well use each weight loaded multiple times. You can actually hit very high total T/s with concurrency in this way, and while it doesn't keep being free forever, in general, the hit to latency is lower than the gain in total T/s. As you add more parallel requests, you become more compute bound, so again, FLOP/s matter a lot.
If you're doing high context high concurrency...It adds up.
Now, does the average person need to worry about that? Maybe, maybe not. It depends on what kinds of user friendly frontends we see taking advantage of that for things like agentic workflows. But it's possible, and for developers who use LLMs locally, there might be a lot of requests they can parallelize to their local endpoint. For instance, they might, instead of using a single context window to understand the codebase, they could break it up into several agents who communicate with eachother about their portion of the codebase.
Calcidiol@reddit
I'm a dev so I get what you're saying but I don't know enough about the LLM internals and inference optimizations to really totally understand why this would be so.
AFAICT there's sort of quadratic complexity of something ... compute...memory...both...??? involved in the generic transformers model architecture base case. That makes sense relative to non ML algorithmic stuff where you have N things and for each of those you have to consider / operate on all of the other N things so N**2.
The bit that isn't making sense to me though is if the context associated data (positional embeddings matrices I guess) is IN MEMORY then if you scale that out NNNN token positions wide then that linearly scales the memory. And if you apply a N**2 compute burden onto that (or even if it was linear) you're still computing based on operands stored IN MEMORY so yeah as context grows you're proportionally increasing memory needs and compute needs but for every multiply add compute you've got two memory loads and a result accumulate so how would you get to a case where you START memory bound but then become compute bound when the compute is for every operation doing at least one or more memory operand loads.
It seems like it'd stay memory bound if it ever was unless one just assumes a large enough L1 cache or register bank of something that is so much faster than "ordinary memory" that doing small regional correlation / whatever compute on local operands is "nearly free" in terms of data throughput and then it's the total compute OP cycles that becomes said to be the bottleneck which would not be the case if the compute operations had to fetch OPs from VRAM/HBM/RAM or whatever "slowly".
Double_Cause4609@reddit
So, starting from the top, this conversation will be really confusing if you don't have an intuitive handle on how big the various tensors in an LLM typically are.
In terms of components, the weights are generally matrices. This means an N*M sized tensor, and you can expect it to scale geometrically as you increase the size of the model.
Typically, 90+% of the weight matrices are in the FFN. Because matrix multiplication is memory bound, the FFN is the cause of LLMs being memory bound to start with. It's just a big block of matmuls.
The hidden state at any given point is only a vector (ie: size "M"). This means that loading weights to cache >> loading hidden state to cache.
Attention, on the other hand, has the minority of the weights in the model. What it does, is it compares every token against every other token. That's why it has an O(N\^2) scaling characteristic by default. First it multiplies the Key and Query weights by every input to build the K and Q matrices, and then it multiplies the transpose of one of those against the other, to give you the Attention matrix. Then it Softmaxes that matrix, and dot products it against the Value matrix and possibly a linear output transform.
If you were paying very close attention (no pun intended) you might notice there's a really interesting behavior here: What if you kept the old Keys (and even the old Value matrix)?
Well, an interesting consequence, is that you only add one new entry in the Key matrix per new token, and the Attention matrix itself is basically unchanged (just one new row and column per token). If you do a few funky things with this, you get KV caching, which makes the time to generate a new token linear. There's also a way of taking Attention's cost from quadratic to linear, called "Flash Attention" which works in a different way (it's more algorithmic), but long story short, they let you manage the computational and memory cost of Attention somewhat.
Anyway, the interesting thing about Attention, is that its cost is more like a CNN when you get a really big context window. The reason is that you use the same Q and K and V weights on every single token, so you're re-using the same weights in memory multiple times.
You can do this with the FFN, too, by batching (which is what I was referring to earlier), meaning that you calculate multiple tokens at the same time (for different contexts), and because loading the weights is so expensive, and loading a hidden state is so cheap comparatively, you end up with the same latency for around 4-8 requests as sending a single one.
This doesn't go forever (there are memory bottlenecks, real world bottlenecks, CUDA overhead, etc etc), but the end result is that batching gives you more total T/s.
Calcidiol@reddit
Thank you very much for the generous and informative response. That does help my understanding of the assertions I hear about scaling in some of these cases but I previously hadn't seen enough context for the scaling rationale / bottlenecks to make sense.
TeakTop@reddit
Ai TOPS this, FP16 that. All I want is a 7900 XTX class card, but with 48 GB of VRAM, and is obtainable. The cherry on top would be optimizing the silicon specifically for quantized models... The thing these cards are actually good for. AMD has a huge opportunity here, but it seems like they are uninterested. Intel of all companies seems more likely to fill this gap.
epycguy@reddit
all I want is ROCm to be on equal playing field as CUDA..
Calcidiol@reddit
In what respect? I fully agree. AMD could/should have better historical / current support for their AIML stack. Like CUDA it should work on anything from their lowest end consumer PC GPU up to their highest end data center accelerator in terms of compatible APIs and libraries etc. Though they'd understandably have market / user specific add-ons like frame rate interpolation / resolution boost interpolation gimmickry for video gaming vs. advanced photorealistic rendering for cinematic / design studio use.
And it's good they're dumping RDNA for the unified architecture, hopefully. That should help unify the device / SW / library support and make it less likely that the consumer cards are just hobbled out of major capabilities architecturally entirely.
But other than those things AMD is supposedly striving to be fixing, what's left is user apathy / voluntary market share.
If I as a developer had a "write it from scratch" project that did 3D or some kinds of GPGPU stuff and needed a library to do BLAS or DNN or what not I think I might find ROCM could work fine just as CUDA libraries could work fine in a lot of cases. NVIDIA seems to offer a lot more market / industry specific hardware lines and SW offerings than AMD which is sensible given the size of the two companies. But at the basic generic make 3D / linear algebra work API level IDK that AMD's best offering is particularly worse than nvidia's nominal one where / when it works on a comparable AMD GPU.
As a student / developer, though I'm probably much more likely to be heavily invested ALREADY in nvidia / cuda etc. because I may have been using it for 15 years already on a bunch of education / research etc. So at that point why would I switch to AMD and have maybe libraries / APIs that in many cases are just as good but are incompatible with my existing code / scripts and knowledge? That's the bigger battle.
Like trying to get people to switch from ms windows to linux or mac. If they did do it they'd probably largely get along fine with the other offerings, maybe even better in a lot of ways. But it's hard to convince people to change to "different" even "better" if it's got a price tag and learning / adaptation journey associated with getting to the point where they're happy with "new".
rerri@reddit
I wonder how huge the opportunity/market demand truly is... I mean us r/localllama peeps might have a bit of skewed view about this.
HilLiedTroopsDied@reddit
explain how used 3090s, and 4090's are same or higher than MSRP at launch. Why 2k msrp 5090's are only in stock at 2900. it's a good size market.
rerri@reddit
In which country does the used 3090 sell for 1500USD??
They sell for around 450-650EUR here in Finland. The 3080 Ti goes for around 350-450€ which is almost identical except for the halved memory. Not a drastic difference even though the 3090 is much more valuable for AI purposes.
QuantumSavant@reddit
At Ollama statistics Deepseek has been downloaded some 50M times. So it seems to me that's a pretty big market. And that's just from one provider.
ursustyranotitan@reddit
Lmao, that is a bunch of companies using ollama as free cdn. There aren't 50m people on the planet who have ever thought about downloading deepseek model.
QuantumSavant@reddit
I didn't say those are all individuals. But the need is still the same.
DepthHour1669@reddit
Demand isn’t high if all you want is vram.
You can buy a Vega II Duo 64gb for $1.1k https://ebay.us/m/bESiuP or $2k if you need a machine for it.
Or get an AMD MI50 32gb on ebay for $250 https://ebay.us/m/Adm6Dk
If you need CUDA, you can get a RTX5000 24GB from china for $550 or so. Get 2 of those and a nvlink bridge and it’ll do better at finetuning a model than a 3090 setup.
MDSExpro@reddit
So... Radeon Pro W7900?
No-Manufacturer-3315@reddit
Don’t want to compete with family, just look like they are competing
DepthHour1669@reddit
If all you need is 48gb, just buy a $1.5k RTX8000 48GB from china with a reshipper. Slap a $30 fan on it and you’re good to go.
https://e.tb.cn/h.hdONszBl0Gw4cx5?tk=q7WGVFiOhJg
Works with the latest version of CUDA as well.
Only downside really is it’s a bit slower than a 3090, but if you buy 2 of these and a nvlink bridge you’ll have 96gb as basically a single card (Quadro cards support nvlink memory pooling, the 3090 doesn’t) and it’s hard to complain about 96gb.
DepthHour1669@reddit
If all you need is 48gb, just buy a $1.5k RTX8000 48GB from china with a reshipper
https://e.tb.cn/h.hdONszBl0Gw4cx5?tk=q7WGVFiOhJg
CatalyticDragon@reddit
Only DisplayPort 2.1b outputs. This is getting ridiculous. AMD, guys, come on, you need to put USB-C on your cards.
Thunderbolt displays are a thing, VR headsets are a thing, USB4v2 supports UHBR20 rates of 80Gbit, so just add it already.
Calcidiol@reddit
It seems like it could be nice to be able to network systems / cards point to point over PCIE over TB/USB4 or such IP networking path as could be enabled etc.
CatalyticDragon@reddit
Another great use-case and something people are doing with miniPCs already.
Ragecommie@reddit
We had it for a brief moment with the USB-C VirtualLink GPUs...
CatalyticDragon@reddit
It's been available on consumer AMD GPUs since 2020. Every reference RDNA2 and RDNA3 GPUs come with a USB port but there are no reference edition RDNA4 cards and this card is marketed to workstation users who AMD appears to think don't want/need it, which I am telling them is incorrect.
atape_1@reddit
Yes AMD you have awesome hardware, we know. NOW FIX ROCm.
sub_RedditTor@reddit
The problem is that most likely AMD Mi100 beats it .
In humble opinion for what it's worth, the memory bandwidth is dog 💩
IngwiePhoenix@reddit
What it comes down to is cost. And availability - but honestly, first the price, then the stock. If it's not worth buying, then it doesn't matter how many they made.
Intel Arc B60 looking reeeeeeeally good so far, love the dual-GPU approach that some AIBs take, wish they had a release date...
Herr_Drosselmeyer@reddit
If you thought Nvidia were the only ones using BS comparisons. ;)
Rich_Repeat_22@reddit
Except if AMD plans to sell it at that price range which makes sense. Doesn't go to hit the RTX5000 Pro Blackwell but the 5080. Similarly to Intel showed B60 against the 5060ti which are at same bracket, even if the B60 has 24GB VRAM.
W9700 32GB is still slower than the 5090 32GB so makes no sense to be more expensive. Also the chip is tiny and dirty cheap using GDDR6 (not even 6X let alone 7). The whole card cost is barely $40 over the 9070XT to AMD.
If goes to hit Intel dual B60 will have to sell it between $1000 and $1500. And will hit NVIDIA too.
If AMD sells it at $2500+, it will be laughing stock and will sell nothing failing to capitalise.
Herr_Drosselmeyer@reddit
Even if the price is right, it's still not an apples to apples comparison.
Rich_Repeat_22@reddit
Well W9700/9070XT is 85% the perf of the 5090. So is comparable. The problem is pricing.
DepthHour1669@reddit
There’s no 70 tier 32gb gpu from nvidia to do an apples to apples comparison with.
djdeniro@reddit
And impossible to find place to buy it
sunshinecheung@reddit
maybe the price 2x than RX 9070 XT