R9700 the beautiful beautiful VRAM gigs of AMD… my ai node future!
Posted by Downtown-Example-880@reddit | LocalLLaMA | View on Reddit | 51 comments
96gb VRAM with 5080 inference speed and quality for less that a 5090 lolol… shhh don’t tell anyone this!
AbbreviationsSad5582@reddit
I’m looking to build something similar but with an X299 Sage board with the goal of running MiniMax locally. How’s your experience so far with the r9700?
HopePupal@reddit
installed my first one today (same ASRock model) and i already want another one. you using llama.cpp in row split mode or vLLM or what?
blackhawk00001@reddit
Llama.cpp now automatically splits the model between gpu. 2x 9700s with qwen3 coder next Q_4K_M 98% on 1 and 82% on 2 is 600-2000 t/s prompt and 40-60 response on rocm build with 200k context depending on how full the context is. Still tuning and trying other things to lessen the speed drop off. The pair is comparable to my 5090 running the same model. Qwen3.5 27b Q8 is faster on 9700s but slower than 3 coder next (350-500/25-30).
HopePupal@reddit
llama.cpp has two split modes, layer and row.
layeris the default and puts some entire layers and parts of the KV cache on each GPU, which doesn't increase speed for any given single request and will in fact leave one of your pair idling while serving each half of the request, whilerowsplits each layer across the available GPUs, keeping the KV cache on one.tl;dr: the default might not be the best, if you're using
layertry `rowTaroOk7112@reddit
Try tensor! it's new.
HopePupal@reddit
ooof, reading that PR i'm glad
tensorexists now because it sounds likerowonly works on CUDAblackhawk00001@reddit
I’ll give it a try later, thanks
nostriluu@reddit
Nice. They seem like great cards.
I have a 3090ti and I've been considering adding an R9700 to my 12700k / 64gb DDR4 system so I can do CUDA things and also experiment with larger models. Two things are holding me back.
One, I'd have to get a new system board that can do x8 for both cards, which are rare and expensive, and I'd really rather get new gear rather than an expensive four year old board.
The other, I've read the R9700 is noisy. I have an open case, so at least airflow shouldn't be an issue.
Any advice on "hybrid" setups or noise levels of R9700?
Downtown-Example-880@reddit (OP)
If I were you I’d make a seperate AMD node! Like two islands, interconnected with Ethernet NIC OR infiniband…
nostriluu@reddit
Interesting, but it doesn't sound like the thing to do if I want it to "just work."
Downtown-Example-880@reddit (OP)
with claude code, setup sets itself up, lol
foxpro79@reddit
What does prefill look like for qwen3.5 or gemma4? I’m thinking of grabbing one of these for the price but not sure how it will compare to my current setup with a 5060ti
Equal_Passenger9791@reddit
Didn't intel just release an even cheaper 32gb vram GPU? The Nvidia alternatives are finally piling up.
Anyway enjoy your stack, I've considered getting one a few times but have since promised myself to not buy one until I've actually needed to spend $1000 on cloud rentals, my 4090 goes pretty far for most projects I've taken on so far.
Thrumpwart@reddit
Any idea how they do for LLM in terms of speed?
_WaterBear@reddit
Quick test w 3x R9700s, windows 11, LMStudio, ROCm: Nemotron-3-nano (q8): 80t/s Nemotron-3-super (q4km)14t/s GPT-OSS-120b: 80t/s GPT-OSS-20b: 105 t/s Qwen-coder-next (q6k): 51 t/s Qwen3.5-35b-a3b (Q4km): 60t/s Qwen3-coded-30b (q4km): 75 t/s
For those that fit on 1 card, I notice about a 15 t/s drop when running it on multi-GPUs.
The t/s are all over the place and vary considerably by the model and probably latest driver or wrapper… so the numbers above are just ballparks.
Thrumpwart@reddit
Sorry, I meant the cheap Intel 32GB gpus.
Your numbers are interesting though. Is it running tensor parallel or pipeline parallel? Neat to see a 3x gpu rig inferencing like that.
freefall_junkie@reddit
I have 2 B70’s arriving tomorrow. I am sure that getting all the intel drivers and swapping to vLLM from my hodgepodge 4070 + 2070s using ollama will take a while, but I plan on doing some benchmarking. (Worth noting that they will be on my x570 taichi board, so pcie 4.0 x8 rather than the rated pcie 5.0 x16)
Thrumpwart@reddit
Post bench numbers once you’ve got everything sorted please!
freefall_junkie@reddit
Hey, got my 2x Arc B70 Pro setup working with vLLM 0.17.0-xpu. Still doing more testing and plan to do a full writeup this week with configs, docker-compose files, and detailed benchmarks, but here's what I've seen so far:
Hardware: 2x B70 Pro (32GB each), Ryzen 5 3600X, 48GB RAM, PCIe 4.0 x8, Ubuntu 24.04 w/ kernel 6.17
DeepSeek-R1-Distill-Qwen-32B (dense 32B, FP8 dynamic):
Qwen3-30B-A3B (MoE 30B/3B active, FP8 dynamic):
Interesting finding: pipeline parallelism beats tensor parallelism for MoE models on PCIe 4.0 x8, but TP wins for dense models. Makes sense when you think about compute-to-communication ratio per layer. On NVLink TP would win both.
Getting the docker setup working was the hardest part honestly. Will document all the gotchas in the full post
Thrumpwart@reddit
Awesome, thank you. Those speeds ain’t half bad. Not the best, but definitely usable. And I can only expect speeds to improve as more hardware-specific optimizations are introduced.
New hardware is always fun. I know plenty in this sub (including me) are very interested in those GPUs which could be the new best bang for buck GPU.
Enjoy, and Mae sure you include pics of you can in the write up!
_WaterBear@reddit
Oh, whoops. I misread. Honestly, I don’t know. I’m using the “default” LMStudio multi-GPU setup w. ROCm llama.cpp. I do usually run models entirely in VRAM and with flash attention on, which keeps things speedy & allows for a maxed out context window - so my system RAM is basically not touched at all (only 64gb). My mobo may matter, too. It’s an X870E with two PCIe Gen5.0 bifurcated to 8 lanes each and one Gen4.0 at 4 lanes.
Interestingly, I don’t notice a meaningful difference in t/s between 2x GPUs at 5.0x8 and 2x GPUs with one at 4.0x4 and the other at 5.0x8, which is kinda exciting because I assume it means my bottleneck is… somewhere else, to possibly include the software/drivers.
Downtown-Example-880@reddit (OP)
Yeah $900 b70… can’t wait to own 12 of them lolol
Wildnimal@reddit
Now i am more curious what are you building with local AI?
beefgroin@reddit
nice, I'm also thinking of a similar setup, can you please test the performance of qwen3.5 27b?
Downtown-Example-880@reddit (OP)
Will do, looking for around 51 tk/sec for the new Claude reasoning 27b qwen3.5 fine tuned model, at q8_0 … and it outscored the bigger models and sonnet 4.5 on benchmark, with a little bit of data only out, but still looks incredible!! That would literally fit on a single GPU, with enough ram could get full context…
Plus I am gonna use raid0 and several 4x Nvmie pcie cards so I can get raid0 kv cache offloading for max concurrency, saves a ton of money this way and could probably sit 10k users in a flash off a single 3-GPU node + storage node!
I wanna build for scale so that my 9-nodes * storage nodes, and load-balancing node, etc, can fit 10k active and 100k cold users on the extended kv cache..
Do the math and you’ll see if you have 4x 1TB raid0 NVME’s set for kv caching the speeeds are just as fast as ram if they are pcie 5.0 and close to it at 4.0… close enough anyways; for massive user offloading… I’m one big excited nerd!!!! Lolol
fragment_me@reddit
I don't see how 4x 1TB raid 0 NVME ever makes sense over straight up putting the data in RAM.
crantob@reddit
I admit this was exciting to read.
beefgroin@reddit
Wow never heard of nvme kv chache offloading, I guess yeah the speeds will be comparable only if in raid, looking forward to see your results
putrasherni@reddit
what kind of models are you running with 96GB VRAM ?
Wildnimal@reddit
Good build. Full Specs?
What models you are going to use? Let us know how it performs locally for Agentic or tool calling? Maybe sprinkle some T2I into the mix :D
Downtown-Example-880@reddit (OP)
8tb sata Seagate Barracuda ssd + 2x1tb NVMe Kingston
256gb RAM 8x32gb OCW 3200mhz
1600w PSU ASRock Phantom Gaming atx5.1 (dual 12vpwr too!)
Ubuntu server
Test bench ($18 ftw!!!)
TR Pro 3945wx (bought these $40 each or less!!! Deadstock from overseas!!!)
Aio 1u dynatron 3fan server sp3 cooler!
Asrock Rack wrx80 rack -2t. (Dual occulink onboard for up to 6 GPU… 4 internal dual slot w/ 2 eGPU)
That’s about it lol, the board has dual 10gbe nics too plus ipmi/bmc…. Great freaking atx server board, each slot is full tilt 4.0x16 to cpu…. Gonna throw a connectx-4 and get several more of these with 100gbe infiniband for k8s inference nodes… or k3….
Heckkkk yah! Lovin it! My addiction is intense, plus it’s my small buisiness site gonna use it lol I’m the owner!
MidnightHacker@reddit
Wdym you got a $40 threadripper? Even on alibaba i only find $1000+ ones
_hypochonder_@reddit
This was the store where he maybe pick it up.
https://www.hary15.com/store/
https://youtu.be/Tre39MLVfaw
Downtown-Example-880@reddit (OP)
I got 8 from slovakia/slovenia (cant remember) from a "new dead stock" seller selling ones amd were gettin rid of by the tray, 8 to a tray brand new for $300 I felt like a kid in a candy store. These were 3945wx, the 3955wx are $700 for a tray of 8. Message me if you need to know the buyer he will has some listed on his personal website
Ok-Internal9317@reddit
old ones
Wildnimal@reddit
Ebay has them for 70-100
spaceman_@reddit
The 8TB SATA Seagate Barracuda is not an SSD.
Downtown-Example-880@reddit (OP)
Hdd* lol
Draco32@reddit
Yes interested in the mboard used
MelodicRecognition7@reddit
what is that strange black and silver thing with white and green sticker at the top-right corner?
Downtown-Example-880@reddit (OP)
lol Barracuda Hdd
BhaiBaiBhaiBai@reddit
LMAO you mean the Seagate drive?
Remarkable_Ad_282@reddit
good !!what is the model of motherboard and the pc case ?
Downtown-Example-880@reddit (OP)
test bench for server racks is the open frame chassis and its remarkably on amazon somewhere between $18-35
BhaiBaiBhaiBai@reddit
What models (& their quants) do you run, and what t/s do you normally get?
ambassadortim@reddit
I like posts like this thanks for sharing
layer4down@reddit
New family heirlooms ya say?
Dorkits@reddit
Nice setup
HugoCortell@reddit
Not bad. I heard the new intel card plans to undercut it by like 200 bucks a pop. We'll see if intel keeps to their word this time (they won't).
Downtown-Example-880@reddit (OP)
$900 out yesterday of course they had one card limit so I didn’t lol, I like mine in multiples, hahah
A_Normal_Bruh@reddit
The captivating jeans were distracting bruh