R9700 the beautiful beautiful VRAM gigs of AMD… my ai node future!

[-]

AbbreviationsSad5582@reddit

I’m looking to build something similar but with an X299 Sage board with the goal of running MiniMax locally. How’s your experience so far with the r9700?

[-]

HopePupal@reddit

installed my first one today (same ASRock model) and i already want another one. you using llama.cpp in row split mode or vLLM or what?

[-]

Llama.cpp now automatically splits the model between gpu. 2x 9700s with qwen3 coder next Q_4K_M 98% on 1 and 82% on 2 is 600-2000 t/s prompt and 40-60 response on rocm build with 200k context depending on how full the context is. Still tuning and trying other things to lessen the speed drop off. The pair is comparable to my 5090 running the same model. Qwen3.5 27b Q8 is faster on 9700s but slower than 3 coder next (350-500/25-30).

[-]

HopePupal@reddit

llama.cpp has two split modes, layer and row. layer is the default and puts some entire layers and parts of the KV cache on each GPU, which doesn't increase speed for any given single request and will in fact leave one of your pair idling while serving each half of the request, while row splits each layer across the available GPUs, keeping the KV cache on one.

tl;dr: the default might not be the best, if you're using layer try `row

[-]

TaroOk7112@reddit

Try tensor! it's new.

[-]

HopePupal@reddit

ooof, reading that PR i'm glad tensor exists now because it sounds like row only works on CUDA

[-]

blackhawk00001@reddit

I’ll give it a try later, thanks

[-]

nostriluu@reddit

Nice. They seem like great cards.
I have a 3090ti and I've been considering adding an R9700 to my 12700k / 64gb DDR4 system so I can do CUDA things and also experiment with larger models. Two things are holding me back.

One, I'd have to get a new system board that can do x8 for both cards, which are rare and expensive, and I'd really rather get new gear rather than an expensive four year old board.

The other, I've read the R9700 is noisy. I have an open case, so at least airflow shouldn't be an issue.

Any advice on "hybrid" setups or noise levels of R9700?

[-]

Downtown-Example-880@reddit (OP)

If I were you I’d make a seperate AMD node! Like two islands, interconnected with Ethernet NIC OR infiniband…

[-]

nostriluu@reddit

Interesting, but it doesn't sound like the thing to do if I want it to "just work."

[-]

Downtown-Example-880@reddit (OP)

with claude code, setup sets itself up, lol

[-]

foxpro79@reddit

What does prefill look like for qwen3.5 or gemma4? I’m thinking of grabbing one of these for the price but not sure how it will compare to my current setup with a 5060ti

[-]

Equal_Passenger9791@reddit

Didn't intel just release an even cheaper 32gb vram GPU? The Nvidia alternatives are finally piling up.

Anyway enjoy your stack, I've considered getting one a few times but have since promised myself to not buy one until I've actually needed to spend $1000 on cloud rentals, my 4090 goes pretty far for most projects I've taken on so far.

[-]

Thrumpwart@reddit

Any idea how they do for LLM in terms of speed?

[-]

_WaterBear@reddit

Quick test w 3x R9700s, windows 11, LMStudio, ROCm: Nemotron-3-nano (q8): 80t/s Nemotron-3-super (q4km)14t/s GPT-OSS-120b: 80t/s GPT-OSS-20b: 105 t/s Qwen-coder-next (q6k): 51 t/s Qwen3.5-35b-a3b (Q4km): 60t/s Qwen3-coded-30b (q4km): 75 t/s

For those that fit on 1 card, I notice about a 15 t/s drop when running it on multi-GPUs.

The t/s are all over the place and vary considerably by the model and probably latest driver or wrapper… so the numbers above are just ballparks.

[-]

Thrumpwart@reddit

Sorry, I meant the cheap Intel 32GB gpus.

Your numbers are interesting though. Is it running tensor parallel or pipeline parallel? Neat to see a 3x gpu rig inferencing like that.

[-]

freefall_junkie@reddit

I have 2 B70’s arriving tomorrow. I am sure that getting all the intel drivers and swapping to vLLM from my hodgepodge 4070 + 2070s using ollama will take a while, but I plan on doing some benchmarking. (Worth noting that they will be on my x570 taichi board, so pcie 4.0 x8 rather than the rated pcie 5.0 x16)

[-]

Thrumpwart@reddit

Post bench numbers once you’ve got everything sorted please!

[-]

freefall_junkie@reddit

Hey, got my 2x Arc B70 Pro setup working with vLLM 0.17.0-xpu. Still doing more testing and plan to do a full writeup this week with configs, docker-compose files, and detailed benchmarks, but here's what I've seen so far:

Hardware: 2x B70 Pro (32GB each), Ryzen 5 3600X, 48GB RAM, PCIe 4.0 x8, Ubuntu 24.04 w/ kernel 6.17

DeepSeek-R1-Distill-Qwen-32B (dense 32B, FP8 dynamic):

\~22.5 tok/s single user (tensor parallel)
\~14.6 tok/s single user (pipeline parallel)
16 GiB weights per GPU, 84K token KV cache
TP is the clear winner for dense models on PCIe

Qwen3-30B-A3B (MoE 30B/3B active, FP8 dynamic):

\~18-19 tok/s single user (both TP and PP similar, PP slightly faster)
\~294 tok/s total throughput at 16 concurrent (PP) vs \~264 tok/s (TP)
14.5 GiB weights per GPU, 293K token KV cache
PP actually wins for MoE on bandwidth-constrained PCIe — fewer inter-GPU transfers per token

Interesting finding: pipeline parallelism beats tensor parallelism for MoE models on PCIe 4.0 x8, but TP wins for dense models. Makes sense when you think about compute-to-communication ratio per layer. On NVLink TP would win both.

Getting the docker setup working was the hardest part honestly. Will document all the gotchas in the full post

[-]

Thrumpwart@reddit

Awesome, thank you. Those speeds ain’t half bad. Not the best, but definitely usable. And I can only expect speeds to improve as more hardware-specific optimizations are introduced.

New hardware is always fun. I know plenty in this sub (including me) are very interested in those GPUs which could be the new best bang for buck GPU.

Enjoy, and Mae sure you include pics of you can in the write up!

[-]

_WaterBear@reddit

Oh, whoops. I misread. Honestly, I don’t know. I’m using the “default” LMStudio multi-GPU setup w. ROCm llama.cpp. I do usually run models entirely in VRAM and with flash attention on, which keeps things speedy & allows for a maxed out context window - so my system RAM is basically not touched at all (only 64gb). My mobo may matter, too. It’s an X870E with two PCIe Gen5.0 bifurcated to 8 lanes each and one Gen4.0 at 4 lanes.

Interestingly, I don’t notice a meaningful difference in t/s between 2x GPUs at 5.0x8 and 2x GPUs with one at 4.0x4 and the other at 5.0x8, which is kinda exciting because I assume it means my bottleneck is… somewhere else, to possibly include the software/drivers.

[-]

Downtown-Example-880@reddit (OP)

Yeah $900 b70… can’t wait to own 12 of them lolol

[-]

Wildnimal@reddit

Now i am more curious what are you building with local AI?

[-]

beefgroin@reddit

nice, I'm also thinking of a similar setup, can you please test the performance of qwen3.5 27b?

[-]

Downtown-Example-880@reddit (OP)

Will do, looking for around 51 tk/sec for the new Claude reasoning 27b qwen3.5 fine tuned model, at q8_0 … and it outscored the bigger models and sonnet 4.5 on benchmark, with a little bit of data only out, but still looks incredible!! That would literally fit on a single GPU, with enough ram could get full context…

Plus I am gonna use raid0 and several 4x Nvmie pcie cards so I can get raid0 kv cache offloading for max concurrency, saves a ton of money this way and could probably sit 10k users in a flash off a single 3-GPU node + storage node!

I wanna build for scale so that my 9-nodes * storage nodes, and load-balancing node, etc, can fit 10k active and 100k cold users on the extended kv cache..

Do the math and you’ll see if you have 4x 1TB raid0 NVME’s set for kv caching the speeeds are just as fast as ram if they are pcie 5.0 and close to it at 4.0… close enough anyways; for massive user offloading… I’m one big excited nerd!!!! Lolol

[-]

fragment_me@reddit

I don't see how 4x 1TB raid 0 NVME ever makes sense over straight up putting the data in RAM.

[-]

crantob@reddit

I admit this was exciting to read.

[-]

beefgroin@reddit

Wow never heard of nvme kv chache offloading, I guess yeah the speeds will be comparable only if in raid, looking forward to see your results

[-]

putrasherni@reddit

what kind of models are you running with 96GB VRAM ?

[-]

Wildnimal@reddit

Good build. Full Specs?

What models you are going to use? Let us know how it performs locally for Agentic or tool calling? Maybe sprinkle some T2I into the mix :D

[-]

Downtown-Example-880@reddit (OP)

8tb sata Seagate Barracuda ssd + 2x1tb NVMe Kingston

256gb RAM 8x32gb OCW 3200mhz

1600w PSU ASRock Phantom Gaming atx5.1 (dual 12vpwr too!)

Ubuntu server

Test bench ($18 ftw!!!)

TR Pro 3945wx (bought these $40 each or less!!! Deadstock from overseas!!!)

Aio 1u dynatron 3fan server sp3 cooler!

Asrock Rack wrx80 rack -2t. (Dual occulink onboard for up to 6 GPU… 4 internal dual slot w/ 2 eGPU)

That’s about it lol, the board has dual 10gbe nics too plus ipmi/bmc…. Great freaking atx server board, each slot is full tilt 4.0x16 to cpu…. Gonna throw a connectx-4 and get several more of these with 100gbe infiniband for k8s inference nodes… or k3….

Heckkkk yah! Lovin it! My addiction is intense, plus it’s my small buisiness site gonna use it lol I’m the owner!

[-]

MidnightHacker@reddit

Wdym you got a $40 threadripper? Even on alibaba i only find $1000+ ones

[-]

_hypochonder_@reddit

This was the store where he maybe pick it up.
https://www.hary15.com/store/
https://youtu.be/Tre39MLVfaw

[-]

Downtown-Example-880@reddit (OP)

I got 8 from slovakia/slovenia (cant remember) from a "new dead stock" seller selling ones amd were gettin rid of by the tray, 8 to a tray brand new for $300 I felt like a kid in a candy store. These were 3945wx, the 3955wx are $700 for a tray of 8. Message me if you need to know the buyer he will has some listed on his personal website

[-]