AMD R9700: yea or nay?
Posted by regional_chumpion@reddit | LocalLLaMA | View on Reddit | 32 comments
RDNA4, 32GB VRAM, decent bandwidth. Is rocm an option for local inference with mid-sized models or Q4 quantizations?
| Item | Price |
|---|---|
| ASRock Creator Radeon AI Pro R9700 R9700 CT 32GB 256-bit GDDR6 PCI Express 5.0 x16 Graphics Card | $1,299.99 |
mustafar0111@reddit
Its a decent card but they have it priced almost $300 too high for what it is.
Rich_Artist_8327@reddit
yes, 4x 7900 xtx 600€ each is ok
Rich_Repeat_22@reddit
However 7900XTX is much slower for LLM workloads without ECC
RnRau@reddit
Eh? The 7900XTX has higher mem bandwidth than the 9700.
Rich_Repeat_22@reddit
And? Doesn't mean is slower.
Also RDNA4 has a lot of enhancements when comes to matrix computations. Supports FP8, BF8 too with improved performance and R9700 comes with ECC VRAM.
R9700 is even 50% faster than the RTX3090 at dense FP16 matrix. Which is generally faster than the 7900XTX.
MixtureOfAmateurs@reddit
And... Llm inference is memory bound. Faster memory is faster inferance. There's a degree of compute bottlenecking, and also driver optimisation that the 9000 series would have an edge in but 644gb/s vs 960gb/s is too big a gap
Rich_Artist_8327@reddit
Yes and no. wide Memory bandwidth does not always mean faster inference. There are many factors
Rich_Repeat_22@reddit
Yet
M3U is slow even if having a lot of GB/s
5090 has 70% more bandwidth than 4090 yet is just avg 35% faster which is totally based on the +30% more cores and +15% higher clocks.
R9700 is faster than the RTX4500 Blackwell which again the latter has 2.5x more bandwidth.
Tell me why?????
RnRau@reddit
What benchmarks are you referencing? Do you have a link?
Rich_Artist_8327@reddit
does ECC speed up inference?
shing3232@reddit
No, but ECC is needed for production deployment.
Zeikos@reddit
I have been litteraly waiting for it to hit the DIY market for months.
It'll take a while more to become available in the EU, hopefully it won't get scalped to oblivion.
Creative-Struggle603@reddit
It is already available in EU area (low stocks). More brands are incoming this month.
lly0571@reddit
Basically an AMD version of 4080S 32GB. Good if you need warranty and can solve possible software issues.
Rich_Artist_8327@reddit
Is there r9700 inference benchmarks somewhere? sOme youtube videos have seen
Baldur-Norddahl@reddit
Get a motherboard with PCIe 5 and 4x R9700. A consumer motherboard will only have x8 lanes for this, but that is probably ok since we are working with slower cards. Tensor parallel and we are looking at a combined memory bandwidth of 3600 GB/s and 128 GB VRAM for considerably cheaper than a RTX 6000 Pro (especially if you include the whole system).
NTFSynergy@reddit
I am confused why nobody talks about how hard it is to work with ROCm - getting it to run is one thing, getting it to run good is a whole another level.
The main priority of ROCm is MIxxx cards, PRO is consumer card (9070xt) on VRAM steroids - it still has problems almost 3/4 year after release, pytorch on RDNA4 is a performance rollercoaster, and Vulkan based llama has better performance than ROCm. From experience pytorch_tunable variable must be set to get decent performance, but that has its own caveats. GEMM kernels are still not a thing on RDNA4.
I own 9070xt since March and went all through the pain - before ROCm 6.4.4, using theRock, switching linux kernels... be aware that ROCm needs an older kernel (with HWE) than latest stable. And the documentation was so broken , full of contradictions and mistakes. It got better, but, for example, you still have to take a wild guess which version of pytorch wheels you need to install - is it the official stable, nightly, or the rocm fork on AMD repo (also there are two repos)? It is absolutely not "BFU" friendly.
Long_comment_san@reddit
It's far too expensive. I expect 5000 super to release and then this thing goes to 1100$ max, optimally at 900$. It's about 5070-5070ti super level of performance with 8gb extra ram. But no cuda and no 4 bit precision. With a lot of driver shenanigans. For extra 300$. It's not an amazing deal. 900$ is where it becomes fair and 800$ is where it starts to undermine hypothetical 5000 super. But AMD charge a huge premium because there's no competition. Intel B60 dual with 48gb VRAM at 1600$ is exactly the same thing.
AppearanceHeavy6724@reddit
...and 650 GB/s bandwidth. For $300 extra.
Ssjultrainstnict@reddit
Captured some of benchmarks on my thread https://www.reddit.com/r/LocalLLaMA/comments/1on4h8q/amd_ai_pro_r9700_is_great_for_inference_with/
Only_Situation_4713@reddit
It's slower than a 3090 and doesn't offer fp4. 3090 can emulate fp8 and it's almost twice as fast. Also less of a headache...
Terminator857@reddit
Faster than a 3090 for models that fit in 32gb of ram but not 24gb, such as popular qwen3 coder 30b at int8 / fp8.
PaulMaximumsetting@reddit
I’ll have to give a VLLM model a try next. GGUF models are usually a bit slower.
Qwen3-VL-32B-Instruct-UD-Q6_K_XL.gguf
KillerQF@reddit
3090 is 35TF fp16
R9700 is 97TF fp16
the latter can likely emulate fp4 or fp8 faster.
where the 3090 is better is bandwidth
936GB/s vs 645GB/s
one is new with 32GB the other is used.with 24GB
Tyme4Trouble@reddit
The 3090 is 142TF dense FP16 matrix.
KillerQF@reddit
Thanks for the correction
3090 - 142 tf
R9700 - 191 tf
b3081a@reddit
fp8 marlin kernels are way slower than native and is nowhere near its theoretical tensor performance. If all you want is single user decode performance (rather than batch decode/prefill) then 3090's bandwidth is much more favorable though.
Rich_Repeat_22@reddit
3090 doesn't offer FP4 nor FP8, needs emulator and the perf tanks doing so.
On R9700 FP8 and BF8 are fully supported, with improved perf.
FYI, FSR4 is FP8.
Don't confuse it with the 7900XTX and the rest of the RDNA3/3.5 lineup.
And here is the full list
v_wmma_f32_16x16x16_f16
v_wmma_f32_16x16x16_bf16
v_wmma_f16_16x16x16_f16
v_wmma_bf16_16x16x16_bf16
v_wmma_i32_16x16x16_iu8
v_wmma_i32_16x16x16_iu4
v_wmma_i32_16x16x32_iu4
v_wmma_f32_16x16x16_fp8_fp8
v_wmma_f32_16x16x16_fp8_bf8
v_wmma_f32_16x16x16_bf8_fp8
v_wmma_f32_16x16x16_bf8_bf8
v_swmmac_f32_16x16x32_f16
v_swmmac_f32_16x16x32_bf16
v_swmmac_f16_16x16x32_f16
v_swmmac_bf16_16x16x32_bf16
v_swmmac_i32_16x16x32_iu8
v_swmmac_i32_16x16x32_iu4
v_swmmac_i32_16x16x64_iu4
v_swmmac_f32_16x16x32_fp8_fp8
v_swmmac_f32_16x16x32_fp8_bf8
v_swmmac_f32_16x16x32_bf8_fp8
v_swmmac_f32_16x16x32_bf8_bf8.
ForsookComparison@reddit
The w6800 Pro has chilled on used markets for about this price for over a year now.
This is that but probably with some better Prompt Processing and a hair faster inference.
If you didn't get excited by or come across the w6800, then you don't have to put much thought into the R9700 unless prompt processing was your only big stopper, yet is still not a huge requirement(?).
Woof9000@reddit
3.6 Roentgen, not great, not terrible.
regional_chumpion@reddit (OP)
That’s 1000 chest X-rays though.
Repsol_Honda_PL@reddit
Good price but Low cores count, average performance.