Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Posted by vevi33@reddit | LocalLLaMA | View on Reddit | 19 comments

How is this dual setup's performance? Is it difficult to set-up everything with for example llama.cpp?

I am asking since the dual setup would be way cheaper.

I am very satisfied with a few new models and it would be nice to run Qwen 3.6 27B on higher quants.

Thanks in advance!

[-]

Enough-Astronaut9278@reddit

honestly id go with the r9700 32gb if you mainly care about running 27B models at higher quants. having all 32gb in one card means you dont have to deal with tensor split across gpus, which is always a bit of a headache in llama.cpp even tho it works.

dual 5060ti is doable but youre at the mercy of pcie bandwidth between the two cards, and unless you have x16/x16 slots its gonna bottleneck pretty hard on generation. setup isnt terrible but definitely more fiddly than single gpu.

the tradeoff is ROCm vs CUDA — ROCm has gotten way better for llama.cpp lately but CUDA is still smoother overall. if you dont mind occasional driver weirdness, single 32gb card is the simpler path imo.

[-]

OddDesigner9784@reddit

Honestly at this point just use vulkan it gets better performance than rocm most of the time.

[-]

Potential-Leg-639@reddit

I would always go with Nvidia if I had the choice

[-]

vevi33@reddit (OP)

I have very bad experience with AMD. I bought RX 7800 XT 16 GB VRAM and drivers are nightmare compared to Nvidia so it's difficult :/

[-]

Potential-Leg-639@reddit

Especially for AI topics Nvidia is the better choice

[-]

Bulky-Priority6824@reddit

Yea avoid AMD at all costs unless trying to absolutely pinch pennies.

Nvidia is just better faster, easier. And resale value holds better with Nvidia. Old 3060s still fetch $250.

[-]

see_spot_ruminate@reddit

The card’s max is x8.

[-]

Kahvana@reddit

For raw performance, the AMD Radeon AI R9700 Pro will win hands-down. But I’ve read multiple times on this subreddit that people returned it for the noise.

Personally I went for 2x ASUS PRIME RTX 5060 Ti 16GB because that specific model had the best air cooling and the least noise. I also wanted to buy one card at a time, and the electricity usage during inference is genuinely impressively low. u/Ok-Conflict391’s numbers in this comment section are accurate. It can be slow (qwen3.6-27b q4_k_l bartowski with gen 10t/s at 250k q4_0 context) but it’s “good enough” and does the job well.

While I am using a PCIE 5.0 x8x8 motherboard so both RTX 5060 Ti’s get the full lanes, some users report it might not be a huge factor. The ASUS ProArt Neo is cheap enough to use with it.

[-]

Xp_12@reddit

What are you doing wrong? I'm running 2x 5060ti nvfp4 in vllm. 40-70tps at 100k context with mtp. 33tps base without mtp and more context. Was getting over 23tps in llama.cpp at high context. That's on pcie4 x8/x1, too. People think bandwidth hurts like crazy, but not in Pipeline or Tensor parallelism, but data parallelism. In TP you usually suffer most in prefill not generation.

[-]

blojayble@reddit

I own 2x ASRock R9700s. (will get an XFX one to compare soon).
Somewhat loud at 300W during long prefill processing (past 60C) but almost inaudible during inference.
I think with some power management and maybe undervolting it would be pretty reasonable. I have PA602 case.

[-]

kiwibonga@reddit

Dual 5060ti for value; parallelism is working now, so is native NVFP4, MTP just started getting support in llamacpp. We're reaching into single 5090-like performance for 4x cheaper.

[-]

hurdurdur7@reddit

2x16gb vram means compromises on quants and context sizes, at least when compared to 2x32gb

[-]

Gesha24@reddit

Out of those 2 - probably a single card, if nothing else it makes things easier. I own R9700 and the performance is ok, but Nvidia is still much better optimized. If you are not opposed to used gear - Tesla v100 32G can be bought for less than 9700 and since we are primarily memory bandwidth constrained - it will actually be faster than 9700.

[-]

pepedombo@reddit

2x5060 gives about 20tps at 27bQ4 f16 at the start and drops 14-16 at 100k ctx (can't remember exactly). Qwen code gives 10-20k ctx at the start.

5070+5060 starts at 25tps and ends at average 18tps at 100k but with 27bQ5 f16.

For 27b I'd rather get stronger gpus but as you've realized it depends on $ :)

[-]

Ok-Conflict391@reddit

I have very recently got myself a dual 5060ti setup, im getting 20t/s on 27b Q4_K_M and 80t/s on 35b MoE Q6_K_M, didnt do any optimizing tho, just loaded up LM studio and played around with models.

Also i know you could technicly overclock the VRAM chips to 512GB/s so you should be able to push 27b to 30t/s easily

[-]

vevi33@reddit (OP)

That indeed sounds promising, thank you for the info! And Congrats on your new setup ^^

[-]