AMD R9700: yea or nay?

Item	Price
ASRock Creator Radeon AI Pro R9700 R9700 CT 32GB 256-bit GDDR6 PCI Express 5.0 x16 Graphics Card	$1,299.99

[-]

mustafar0111@reddit

Its a decent card but they have it priced almost $300 too high for what it is.

[-]

Rich_Repeat_22@reddit

However 7900XTX is much slower for LLM workloads without ECC

[-]

RnRau@reddit

Eh? The 7900XTX has higher mem bandwidth than the 9700.

[-]

Rich_Repeat_22@reddit

And? Doesn't mean is slower.

Also RDNA4 has a lot of enhancements when comes to matrix computations. Supports FP8, BF8 too with improved performance and R9700 comes with ECC VRAM.

R9700 is even 50% faster than the RTX3090 at dense FP16 matrix. Which is generally faster than the 7900XTX.

[-]

And... Llm inference is memory bound. Faster memory is faster inferance. There's a degree of compute bottlenecking, and also driver optimisation that the 9000 series would have an edge in but 644gb/s vs 960gb/s is too big a gap

[-]

Rich_Artist_8327@reddit

Yes and no. wide Memory bandwidth does not always mean faster inference. There are many factors

[-]

Rich_Repeat_22@reddit

Yet

M3U is slow even if having a lot of GB/s

5090 has 70% more bandwidth than 4090 yet is just avg 35% faster which is totally based on the +30% more cores and +15% higher clocks.

R9700 is faster than the RTX4500 Blackwell which again the latter has 2.5x more bandwidth.

Tell me why?????

[-]

RnRau@reddit

What benchmarks are you referencing? Do you have a link?

[-]

Rich_Artist_8327@reddit

does ECC speed up inference?

[-]

shing3232@reddit

No, but ECC is needed for production deployment.

[-]

Zeikos@reddit

I have been litteraly waiting for it to hit the DIY market for months.
It'll take a while more to become available in the EU, hopefully it won't get scalped to oblivion.

[-]

Creative-Struggle603@reddit

It is already available in EU area (low stocks). More brands are incoming this month.

[-]

lly0571@reddit

Basically an AMD version of 4080S 32GB. Good if you need warranty and can solve possible software issues.

[-]

Rich_Artist_8327@reddit

Is there r9700 inference benchmarks somewhere? sOme youtube videos have seen

[-]

Baldur-Norddahl@reddit

Get a motherboard with PCIe 5 and 4x R9700. A consumer motherboard will only have x8 lanes for this, but that is probably ok since we are working with slower cards. Tensor parallel and we are looking at a combined memory bandwidth of 3600 GB/s and 128 GB VRAM for considerably cheaper than a RTX 6000 Pro (especially if you include the whole system).

[-]

NTFSynergy@reddit

I am confused why nobody talks about how hard it is to work with ROCm - getting it to run is one thing, getting it to run good is a whole another level.

The main priority of ROCm is MIxxx cards, PRO is consumer card (9070xt) on VRAM steroids - it still has problems almost 3/4 year after release, pytorch on RDNA4 is a performance rollercoaster, and Vulkan based llama has better performance than ROCm. From experience pytorch_tunable variable must be set to get decent performance, but that has its own caveats. GEMM kernels are still not a thing on RDNA4.

I own 9070xt since March and went all through the pain - before ROCm 6.4.4, using theRock, switching linux kernels... be aware that ROCm needs an older kernel (with HWE) than latest stable. And the documentation was so broken , full of contradictions and mistakes. It got better, but, for example, you still have to take a wild guess which version of pytorch wheels you need to install - is it the official stable, nightly, or the rocm fork on AMD repo (also there are two repos)? It is absolutely not "BFU" friendly.

[-]

Long_comment_san@reddit

It's far too expensive. I expect 5000 super to release and then this thing goes to 1100$ max, optimally at 900$. It's about 5070-5070ti super level of performance with 8gb extra ram. But no cuda and no 4 bit precision. With a lot of driver shenanigans. For extra 300$. It's not an amazing deal. 900$ is where it becomes fair and 800$ is where it starts to undermine hypothetical 5000 super. But AMD charge a huge premium because there's no competition. Intel B60 dual with 48gb VRAM at 1600$ is exactly the same thing.

[-]

AppearanceHeavy6724@reddit

...and 650 GB/s bandwidth. For $300 extra.

[-]

Ssjultrainstnict@reddit

Captured some of benchmarks on my thread https://www.reddit.com/r/LocalLLaMA/comments/1on4h8q/amd_ai_pro_r9700_is_great_for_inference_with/

[-]

Only_Situation_4713@reddit

It's slower than a 3090 and doesn't offer fp4. 3090 can emulate fp8 and it's almost twice as fast. Also less of a headache...

[-]

Terminator857@reddit

Faster than a 3090 for models that fit in 32gb of ram but not 24gb, such as popular qwen3 coder 30b at int8 / fp8.

[-]

PaulMaximumsetting@reddit

I’ll have to give a VLLM model a try next. GGUF models are usually a bit slower.

Qwen3-VL-32B-Instruct-UD-Q6_K_XL.gguf

[-]

KillerQF@reddit

3090 is 35TF fp16

R9700 is 97TF fp16

the latter can likely emulate fp4 or fp8 faster.

where the 3090 is better is bandwidth

936GB/s vs 645GB/s

one is new with 32GB the other is used.with 24GB

[-]

Tyme4Trouble@reddit

The 3090 is 142TF dense FP16 matrix.

[-]

KillerQF@reddit

Thanks for the correction

3090 - 142 tf

R9700 - 191 tf

[-]

b3081a@reddit

fp8 marlin kernels are way slower than native and is nowhere near its theoretical tensor performance. If all you want is single user decode performance (rather than batch decode/prefill) then 3090's bandwidth is much more favorable though.

[-]

Rich_Repeat_22@reddit

3090 doesn't offer FP4 nor FP8, needs emulator and the perf tanks doing so.

On R9700 FP8 and BF8 are fully supported, with improved perf.

FYI, FSR4 is FP8.

Don't confuse it with the 7900XTX and the rest of the RDNA3/3.5 lineup.

And here is the full list

v_wmma_f32_16x16x16_f16
v_wmma_f32_16x16x16_bf16
v_wmma_f16_16x16x16_f16
v_wmma_bf16_16x16x16_bf16
v_wmma_i32_16x16x16_iu8
v_wmma_i32_16x16x16_iu4
v_wmma_i32_16x16x32_iu4
v_wmma_f32_16x16x16_fp8_fp8
v_wmma_f32_16x16x16_fp8_bf8
v_wmma_f32_16x16x16_bf8_fp8
v_wmma_f32_16x16x16_bf8_bf8
v_swmmac_f32_16x16x32_f16
v_swmmac_f32_16x16x32_bf16
v_swmmac_f16_16x16x32_f16
v_swmmac_bf16_16x16x32_bf16
v_swmmac_i32_16x16x32_iu8
v_swmmac_i32_16x16x32_iu4
v_swmmac_i32_16x16x64_iu4
v_swmmac_f32_16x16x32_fp8_fp8
v_swmmac_f32_16x16x32_fp8_bf8
v_swmmac_f32_16x16x32_bf8_fp8
v_swmmac_f32_16x16x32_bf8_bf8.

[-]

ForsookComparison@reddit

The w6800 Pro has chilled on used markets for about this price for over a year now.

This is that but probably with some better Prompt Processing and a hair faster inference.

If you didn't get excited by or come across the w6800, then you don't have to put much thought into the R9700 unless prompt processing was your only big stopper, yet is still not a huge requirement(?).

[-]

Woof9000@reddit

3.6 Roentgen, not great, not terrible.

[-]

regional_chumpion@reddit (OP)

That’s 1000 chest X-rays though.

[-]

Repsol_Honda_PL@reddit

Good price but Low cores count, average performance.