3xR9700 for semi-autonomous research and development - looking for setup/config ideas.
Posted by blojayble@reddit | LocalLLaMA | View on Reddit | 64 comments
Hello everyone.
Over the last couple months I have been assembling my local AI setup for personal use, and I thought to write a post here, firstly to collect some thoughts on the whole concept, and secondly to perhaps gather some feedback.
My setup is nowhere near as advanced as many professional rigs posted here, but I have the following specs:
- 9950X + 96 GB RAM,
- ASUS ProArt X870E mobo,
- 1300W Taichi T1300 PSU,
- 2x ASRock R9700,
(currently shipping) - XFX R9700.
So far I have mainly been using it to run Qwen 3.6 27B at Q8 on the two cards together. I experimented around a little bit, but overall I landed on running my models using llama.cpp with Vulkan drivers.
To get it out of the way, I am aware of the limitation of the connectivity in this system, especially for the 3rd GPU, which would run at a measly 4x gen 4 lanes. This is likely to be a significant bottleneck if I were to run a singular model distributed over all of my GPUs.
I am working on a hobby research project in the programming languages area, so generally access to some less common knowledge is very helpful. AFAIK there isn't really anything stronger at the moment than 27B to run for me locally at the moment.
Eventually with 96GB of VRAM I could run something bigger but the PCI limitations would affect the overall performance in that scenario. Therefore I was considering potentially running 2/3 agents locally, with a smarter API overseer like K2.6 via API. For certain tasks which could be smaller in scope or where the lower speed would be acceptable, I could also consider running some CPU inference since I have a bunch of system RAM to utilize as well.
Generally the idea I was considering was constructing some form of harness to allow me for semi-autonomous research and development in the scope of my project. Potential deployments could consist of a number of agentic developers/testers/thinkers running separately, for example with something like Q6 quants of 27B, so each could have its own GPU. Depending on the workload, it could be nice for the "overseer" to dynamically deploy necessary agents and models to fit the current workload (maybe for certain tasks we would want to put the development on pause and run a big model on all GPUs together, to benefit from larger knowledge).
Because of the complex and specific nature of the project, it touches on more niche CS areas which the models like 27B have the awareness of, however they might not be well optimized for, so I think one key aspect would be allowing the agents to access the internet search and bigger cloud models when necessary.
Overall, the most interesting part for me which I do not know too much about at the moment and would like to learn more about, is how to effectively engineer a harness to manage this hardware deployment and project. I could definitely spend some time just (vibe) coding something to fit my specific needs, however I do not think my setup, at least conceptually is anything new. I am aware there exist certain solutions like LangGraph and CrewAI, although I am unsure which would fit my use-case best, and be well extensible for my needs.
I would be very curious to learn about other peoples experiences and thoughts on this hardware setup and potential deployments on it.
If you read through all of that, thank you very much and sorry for the chaotic writing style.
Cheers.
Look_0ver_There@reddit
I have exactly this motherboard and config with 3xR9700's. Feel free to ask questions.
I used the two PCIex16 slots for two of the cards, for which you can only use 8-PCI lanes each, but that's fine.
I also picked up one of these things: https://www.adt.link/product/F43-Shop.html
I put that in the PCIe5x4 M.2 slot to link the third card. This boosted 3-card performance by about 10% over using the bottom PCEi4x4 slot which runs via the south-bridge chipset and so has higher latency.
The big gotcha with the R9700 Pro's though is that despite AMD's claim of native BF16 support, it actually appears to be firmware emulated, and runs only half as fast as F16. It's more or less the same story with their FP8 support. For this reason, stay away from the Unsloth UD-quants as these run slower due to their use of BF16 scaling weights.
Best performance though is typically seen when using just 2 cards. Adding a third card just adds in inter-card latency. Unfortunately AMD also appears to have nerfed the P2P performance of the cards for the consumer grade R9700Pro's, and so the more cards that you add, the slower the inter-card sharding is.
If you want to compare performances and work on tweaking the setup, we can exchange settings.
blojayble@reddit (OP)
Thank you very much for your response. I would love to discuss our setups more.
If you dont mind, I have some questions:
- how do you utilize your current setup? do you use it for agentic development? which models/runtimes do you use?
- how do you mount the 3rd gpu? is it in your case or external?
- did you consider connecting the gpus through a PCIe switch? apparently it can increase the performance further by connecting the gpus more directly.
Look_0ver_There@reddit
I have a number of different setups, Strix Halo x 2 in a cluster, 3 x AI Pro R9700's, and a work-supplied top-end MacBook 128GB M4 Max which has the 546GB/sec memory bandwidth.
I use llama.cpp for everything. I have gotten vLLM working before for most things, but I personally find it too fiddly and fragile on AMD to continue bothering with it. It also just doesn't work with 3 GPUs, so I feel like why do I even bother having a 3rd GPU if vLLM can't use it.
I mostly use them for agentic development, some image generation, and repository/document analysis. With Anthropic raising their prices, my work-place has started to place in limits on its use, so I try to use Claude for planning or trickier issues. I use the local models for boiler plate/UI work (I'm a crusty older back-end developer who started coding in the 80's), and being able to hand-off UI guff to AI models has been awesome for my productivity.
For local models I use MiniMax-M2.7 on the Strix Halo's as it shards nicely across the two. That's my local "big brain" model that I'll run plans through first, and then decide if I need to use Claude instead. I have it write a full implementation plan summary to a file in the repo, and then engage other faster models for implementation.
On the 3x Radeon AI Pro's, I mostly use Qwen3.5-122B-A10B at Q5_K_S quant size. That runs at around 1000t/s for PP, and \~35t/s for TG, which is fast enough for most things. Sometimes I won't even bother with MiniMax if I think the 122B is doing a good enough job.
I've also recently been experimenting with running Qwen3.6-27B at Q5_K_XL quant size on a single R9700 Pro, and that's a surprisingly good performer. It runs at about 1000 for PP, and 28 for TG. I've also experimented with Qwen3.6-35B-A3B @ F16 quant and F32 KV cache, and that is both quick and "intelligent" at those levels.
The 3rd GPU is in my case. I 3D Printed a sort of a cradle to hold it, and that cradle attaches to the rear case meshing. The GPU is therefore not blowing air directly outside of the case, but rather blowing towards the rear mesh.
I did consider a PCIe switch, but after researching the pricing and effort I decided that if I was going to go that route, I may as well just pick up a single work-station 96GB RTX6000 Blackwell instead, and that would be faster for far less hassle.
blojayble@reddit (OP)
Thank you for the detailed response. Since you have the same mobo as me, I am a bit curious how you organized the cards internally, especially wrt. to the PCIe adapter. Do you feel like you are done with expanding on your AM5 platform? Technically you could throw in a 4th card in there, if you had a good reason to. Although I suppose your Strix cluster might already let you experiment around with bigger models, albeit with assumedly lower speeds.
On the topic of the cards themselves, have you attempted any undervolting and similar to squeeze out better performance of efficiency? I was curious if it is something worth looking into.
Look_0ver_There@reddit
I'm in the middle of 3D printing better ducting + cradle for the third card, and I'll respond to your comment later on with a photo of it when done.
I have one card in each of the PCIe5x16 slots, so that's all normal. As mentioned, they're effectively running at PCIe5x8 speeds each on the bus direct to the CPU, and that's actually plenty fast enough for inferencing. As I mentioned in another response, it's actually the latency that's the big performance killer.
Check out that response for a kernel parameter tweak that helps majorly to mitigate the inter-card latency overheads. The drawback though is that the cards won't draw less than 42W each at idle once you do that, so weigh up your costs vs benefits as suits your needs.
Regarding the third card. I have a Fractal Design Meshify 2 case, and that M.2 to PCIe adapter has a 25cm long ribbon cable. Unfortunately that cable is about 3cm too short for me to mount the third card directly into the vertical card mount that the Meshify 2 has as the ribbon cable has to first "get past" the height of the top card, and then drop down to below the second card, and it's just not quite long enough.
The card is currently just "sitting there" on a simple 3D printed cradle that itself just rests on the base of the main motherboard bay. It's like this until the more sophisticated mounting cradle finishes printing out. That new cradle use magnets to secure the cradle to the case, and has ducting that routes the air-flow to the case's vertical PCIe mount points, and bolts onto the rear case panel for greater security, and the card's PCI mount points bolt onto the ducting.
The Strix Halo's are surprisingly quick, at least for MoE models. They do suck in comparison to the discrete GPU's for pre-processing, being about 3x slower, but for MoE models, they can generate anywhere from 50-80% of the speed of the GPU's due to the unified memory's zero-copy efficiency making that 256GB/s perform better than you might think.
blojayble@reddit (OP)
Ah, thanks! I might give the power tweak a try too.
As for the the M.2 slot you use, do you use the M.2_1 or M.2_2? Based on the fact you mention running both cards at x8, I supoose it is from _1? Otherwise the third card would "steal" the lanes from the second one, as the lanes between the second PCIe slot are shared with the M2 slot.
Unfortunately, for a reason unknown to me, I am not able to get the second slot to be x8 no matter what I tried. It does not seem to be a specific card issue, as I tried swapping them around. Maybe I should contact ASUS about that but I doubt I would get much of a response.
Overall would you consider the Strixes to be a better bang for the buck? Or it would ultimately depend more on the models one wants to use?
Look_0ver_There@reddit
I used M2.1, as it's the only PICe5 based M.2 slot on the motherboard that has a direct link to the CPU without going via the chipset AND won't steal lanes from the GPUs. This is essential to get the lowest latencies for multi-card inference.
Slot M2.2 has to remain unpopulated otherwise it'll steal lanes from the 2nd GPU slot. I forced the M2.1, and the two GPU PCIe slots to PCIE Gen 5 in the BIOS (ie. switch from Auto to Gen5 explicitly, otherwise it may drop to Gen 4).
I could add a 4th GPU I guess (to answer your earlier question) but I'd need to move to a larger case to improve air-flow to the 4th card. It would also risk killing multi-card performance further with its added latency via the chipset.
As for the Strix Halo's, they're great value if bought for <$2400, and were excellent value when they were priced at $1800 six months ago. Nowadays that they're almost all >$3000, they're just not worth it IMO. At >$3000 you're better off buying two or three R9700's, at least until Qwen release 3.6-122B-A10B, and even then, that could be run with Q5_S quant on 3 x R9700's and run rings around the Strix Halo's. Another good alternative if you're stumping for 3 x R9700's would be to consider a 128GB M5 Max Apple MacBook Pro. Using an MLX quant, most models will run almost twice as fast as the Strix Halo's for 1.6x the price of a Strix Halo. The 614GB/s memory bandwidth on the M5 Max Macbook is just shy of the memory bandwidth on the R9700 Pro's!
Having said that, when the DFLash and MTP PR's on llama.cpp get merged, then the balance may shift a bit again, making the Strix Halo's "fast enough" such that you won't really be missing out due to their lower performance. Everything should run good enough, even if other solutions are twice as fast for a bit more money.
Sadly there's no one clear best solution in the $3000-5000 price range. Everything is a swings and roundabouts tradeoff, and it really is a case of pick your poison and try not to worry too much about the greener grass "over there".
Evgeny_19@reddit
How would you compare Qwen3.5-122B-A10B to Qwen3.6-27B? Do you use Vulkan or ROCm?
I was using Qwen3.6-27B, and I managed to find one tricky bug when I switched from Q5 to Q6_K_XL. I was hopeful that Q8_K_XL would be even better, but now, that I’ve installed a second 9700, I can't say that I notice the difference. But the performance decrease is certainly noticeable. Although it was only a couple of hours, I definitely ought to give it more time. My goal is to eventually go with four 9700s, but I think I can borrow a third one. I'm just not sure if that would be worth it to switching from 3.6 27b to 3.5 122B.
Look_0ver_There@reddit
The problem with multiple cards is that it doesn't scale that well because each card only does a portion of the compute, then writes its state back to main memory, and then the CPU copies that state to the next card, and so on. All that back and forth adds overhead and the more cards you add, the greater the over-head per token generated.
My advice would be to not use more than 2 cards on a consumer motherboard. I know that OP and I have 3, but in hind-sight, I believe that it's kind of pointless beyond having two of them unless you have some sophisticated PCIe switch setup going on.
The good news though is the pre-fill with two card is almost always 30-50% faster than prefill with 1 card, because pre-fill is a compute bound task.
It's the generation that takes the bulk of the hit, but there's a way to get some of that back!
Add these two fields to the GRUB_CMDLINE_LINUX section in
/etc/default/grub:"... processor.max_cstate=2 pcie_aspm=off ..."
This prevents the CPU and cards on the PCIe bus from going into deeper sleep states when they're not doing anything/waiting. This can boost responsiveness, at the expense of drawing some extra power. This should boost multi-card generation performance by around 10-15%
Run this script. It will force the GPU's into a more responsive higher power state. This works for single GPU's too, but it also helps when you have 2+ cards. This should also be a 10-15% speed boost.
$ cat ~/bin/high-power
!/bin/bash
Force all gpus into high-power mode
for card in /sys/class/drm/card*/device/power_dpm_force_performance_level; do echo high | sudo tee $card; done
Download this model: https://huggingface.co/unsloth/Qwen3.5-2B-GGUF/resolve/main/Qwen3.5-2B-Q4_K_M.gguf and put it into the same directory as your 27B model, and then add the following config lines to your llama-server config. This enables speculative decoding and will give about another 15% speed boost.
--spec-draft-model ./Qwen3.5-2B-Q4_K_M.gguf \ --spec-draft-ngl all \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 \ --spec-ngram-mod-n-min 48 \ --spec-ngram-mod-n-max 64 \
For me, with all of the above, I went from \~19.5tg/s to 27.5tg/s when sharding Qwen3.6-27B-Q5_K_XL across 2 cards.
To answer your other question, I think that there's some natural noise/variance that occurs from run to run. It's hard to definitively say if Qwen3.6-27B is better or worse than Qwen3.5-122B-A10B. The latter runs a little faster though, and so that's what wins it for me.
Evgeny_19@reddit
Thank you for the detailed answer! I moved from AM5 to Milan (EPYC CPU) because of the PCIe lanes limitation. Milan is quite old and limited to PCIe gen 4, but still should be sufficient for 4 cards, which I hope to use someday. The new Threadripper/EPYC platforms are kind of bonkers, because of DDR5 prices.
Curiously enough, increased power consumption was one of the things I noticed after switching from a single GPU to a pair. With a single card, when a model was loaded into VRAM but not actively in use, power consumption would typically drop to around 44W and then settle at just 9W. Now that I'm using two cards, there is no such thing as idling at all: amdgpu_top reports 87-88 Watts for one GPU, and 100W for the other. The model is not distributed evenly, which is evident from VRAM usage. Additionally, the Graphics Pipe is at 100% on both cards, as well as Fetcher and Compute for the Command Processor. The Efficiency Arbiter also appears to be doing something constantly, though in never goes above 37%. So, I think that the deep sleep state issue that you mentioned has been fixed now, at least in the latest version of llama.cpp (I tested both ROCm and Vulkan).
If I understood your third point correctly, you're essentially suggesting using a small draft model to feed the main one. I didn't even know that was possible! I definitely have to try it. Thank you very much!
Look_0ver_There@reddit
Yeah, the EPYC platforms are insane now, not just because of DDR5, but also because the memory needs to be registered, which is currently 2x more expensive than the already stupidly inflated prices. 128GB of registered DDR5-6000 is around $4500 nowadays, and if you wanted 12 channels, then you're looking at $13,500. Just madness!
I benchmarked Qwen3.6-27B@Q8_0 on just two cards for you. Here's my full invocation command:
This is running at \~24t/s generation, and \~1350t/s pre-fill for me using all the tricks I mentioned in my prior post. That's already something like 25% faster than what a single card can manage for both pre-fill and generation.
Evgeny_19@reddit
Interesting, you don't even use q8 for -ctk/ctv. No options for prompt batch either (-b/ub). I thought all of those were essential to increase the speed.
I just tried your options on Vulkan, but I also added -ctk q8_0, -ctv q8_0, and -b 1024 -ub 512. Now I'm getting 2.5 to 5 tps in generation. Sheesh. That is trickier than I thought.
Look_0ver_There@reddit
Try removing the GGML line (the first line) and see what happens. I drive my display from the iGPU on the CPU, so I have to tell llama-server to skip over it. If you are driving the display from one of your cards that may be the issue. Alternatively, upload your full llama-server output to Pastebin.net and drop a link here and I can check it out what may be wrong
Evgeny_19@reddit
I didn't use the GGML line. My actual command and the server output are here: https://privatebin.net/?1b83085a1aa80b58#592Ajt3Qv6fFMi6hUCZyVxKNHj98rKzCVPeGYFbett52
Look_0ver_There@reddit
Ok, so I just downloaded Qwen3.6-27B-Q8_K_XL and fired up a llama-server using your exact parameters. I do run bare-metal though. No podman. It's a bit slower due to having to move more data around, but it's nothing like your results. Literally the only thing that's different here is your use of podman, vs my using of bare-metal. I would presume that's where your issue lies.
llama-benchy (0.3.7) date: 2026-05-04 15:35:37 | latency mode: api
Evgeny_19@reddit
Well, that's actually good news. Building llama.cpp from scratch should be easy enough. I will try it, thank you for sharing your experience.
Look_0ver_There@reddit
Here's a little script that I use to build a Vulkan ready version of llama.cpp that is tuned for the native CPU that you build it on. Make sure you're in the top-level directory of llama.cpp before running it. You may need to install a bunch of library dependencies. Just feed the error messages into Gemini along with what Linux distro you're using, and it'll tell you the package names that you need to pull.
Evgeny_19@reddit
Thank you. I did actually manage to build it after our conversation. I was quite pleased with myself, but it took a bit of scrolling down on llama.cpp's github page for me to realise that I had actually built it without Vulkan support, haha. And it's quite a dance with all those libraries to build it properly for Vulkan.
It looks like you were right that something is wrong with podman, because everything completely broken down for me afterward. I can't even start a ROCm container anymore. It's just says "warning: no usable GPU found, --gpu-layers option will be ignored". And it's about at two tokens per second on Vulkan. Thankfully I still have my SFF AM5 build, so I'm relling on it now. It can obviously host just one GPU, and since it runs with Wayland, there is a bit of overhead in terms of VRAM consumption, but at least it works. It also runs llama.cpp in podman, but there are zero problems with it. Not really sure what I did to my Epyc build to ruin it. Will probably investigate it over the weekend.
Look_0ver_There@reddit
I forgot to mention. I use Vulkan. ROCm is always slower for me.
CautiousStudent6919@reddit
That's interesting, I'm not seeing the same slowness with the UD quants on llamacpp and a single r9700.
Look_0ver_There@reddit
It's not a big difference because there's only a small number of BF16 values in the weight sets, but it is measurable, at least at my end. I'm probably just being overly picky. I'd prefer to keep everything in the native "fast path" if possible.
Vaguswarrior@reddit
This is great reply. Helped me decide a bit of my used purchases. Hilarious that my system is less than a grand (most of the money I have for spending) and this system is basically almost my years income. Yet still share the same hobby.
putrasherni@reddit
where is the 3rd gpu ?
blojayble@reddit (OP)
still shipping 😄
reto-wyss@reddit
For autonomous stuff, where you can potentially run stuff in parallel, you should go vllm or sglang. Depending on what you run you could go from a few times higher throughput to tens-of-times throughput.
However, that won't work with 3 gpus, it's 2 or 4 for tensor-parallel. Then you go with FP8 which R9700 supports natively.
Loading the model 3 times in Q6 doesn't make sense, that some kind of worse version of data-parallel, which you typically only want to use if the model is very small relative to you total VRAM.
djdeniro@reddit
with MXFP4 can use 2xR9700 with super speed. qwen3.6B-35B
MDSExpro@reddit
R9700 doesn't support MXFP4, data is upcasted to FP8
putrasherni@reddit
it does support mxfp4 , both on vulkan mesa and on rocm
MDSExpro@reddit
No it doesn't, check AMD's spec sheet or vLLM documentation.
putrasherni@reddit
i'll go as far as as to say , at any Q4 quantisation, mxfp4 gives the highest prefill tok/sec speed
djdeniro@reddit
Why do you say it doesn't support it? I'm running MXFP4 in vLLM.
MDSExpro@reddit
It's upscaling FP4 to FP8 in background (unless you use custom version created by one of redditors, then it's partially accelerated).
Check out R9700 specs on AMD's website - this GPU doesn't support FP4 in any form.
djdeniro@reddit
Well, yes, but why are you saying the card doesn't support MXFP4? There are cases where the card doesn't work, and there are cases where it does, but only through upscaling (and even then, not always). I also wanted to say that FP8 doesn't work reliably with the new models.
Even so, none of the new models work reliably with this card or VLLM. So what now? Should I tell everyone that the R9700 doesn't support the new AI models?
MDSExpro@reddit
I'm saying that card doesn't support MXFP4 because this card doesn't support MXFP4 - simple as that. Just because vLLM is flexible enough to upscale FP4 to FP8 doesn't mean that R9700 supports MXFP4, because this card is literally incapable of computing over that data type and never does.
Yes, since that's the truth.
djdeniro@reddit
You're right, and it's worth looking at the actual performance. You won't run INT4 more reliably than MXFP4, and MXFP4 dequantized to FP8 will run faster than INT4 or base FP8.
MDSExpro@reddit
Strange, because I have been running INT4 for 4 months without issues on 8x R9700.
Not true, even AMD's spec sheet shows INT4 to be couple of times faster than FP8.
blojayble@reddit (OP)
Wow, an 8x setup!
Can I ask what models do you use and what do you use them for?
How did you build you platform? PCIe switches?
MDSExpro@reddit
Qwen3.5-122B-A10B-FP8 (INT4 had significant drop in quality). Threadripper Pro with PCIe Bifurcation Risers.
djdeniro@reddit
Qwen3.5-397b mxfp4 (i did quantization myself). PCIE switches yes.
I have MZ32-AR0 i get it like ready server without GPU. I order x4x4x4x4 risers and x8x8. speed loosing less than 5% when connected directly. also i do low power from 300W to 210W
djdeniro@reddit
Do you running vllm via Docker? Can you share your build, I want to test it, I also have an 8x r9700, and very long time doing test for new nightly builds. What model you use ?
djdeniro@reddit
What I'm getting at is that the AI is indexing Reddit, and after some time, when users are deciding whether to buy this card, they'll see that it doesn't work with MXFP4 and won't buy it because Claude or Perplexity tells them so.
When I bought these cards, I didn't have a test rig to rent eight of them and check that the MiniMax M2.7 doesn't run out of the box, or that a bunch of models don't work.
But then some nice guy comes to Reddit, creates a build with MXFP4 quanta support in vLLM, a miracle happens, and you tell me in two different threads that MXFP4 isn't supported—why? There are already few people here who can run anything successfully with these cards, and yes, with MXFP4 quanta, I can run a model half the size as with FP8, which, by the way, doesn't work from the standard build, which, by the way, doesn't exist. AITER FP8 is not supported when the site you link to says AITER, FP8 and does not explicitly state that AITER + FP8 != WORK.
blojayble@reddit (OP)
I generally agree with you, however for the time being I am constrained by the limitations of the AM5 platform. Perhaps in the future I might consider an external rig with PCIe fabric card to improve the connectivity and potentially expand to 4 cards, however it is not a goal at this moment. So in the context of this post, I am mainly thinking how to harness the existing setup with its limitations.
reto-wyss@reddit
Use two cards with vllm/sglang and use the third card that has the lower bandwidth PCIe connection on its own - easy.
You'll get much better throughput that way using 30b class models on the pair. Running the same model on three cards via llama.cpp is nonesense - you are throwing away two cards and still get lower tg/s than when using vllm/sglang for your purpose.
The only reason to use llama.cpp is if you want to use a model that's too large and you need to distribute across all three.
Your problem is not hardware, but how to leverage it efficiently with software.
ReferenceOwn287@reddit
The price difference between a 5090 and an R9700 is very high, I was wondering just yesterday what stops people from getting the R9700 - even if it’s 20% slower, the bang for buck seems huge. Do you see any limitations by not having cuda?
Miserable-Dare5090@reddit
there is a post above where someone lists the issues, from emulated BF16, to bandwidth 1/4 of the 5090 to nerfing the P2P latency in their prosumer GPUs. I would grab one if I needed it. I’m sitting fine in a lot of hardware right now, but this hobby/addiction always makes you want more.
Thrumpwart@reddit
I started with a 7900XTX, then got a W7900, now I've got an RTX Pro 6000 on the way, and I already want a 2nd one...
Miserable-Dare5090@reddit
That W7900 looks so interesting to add to my strix halo as an egpu. But the price now, oooof!
Thrumpwart@reddit
It's a beast of a card. I got mine used l1.5 years ago and it's a workhorse. IMHO it's the best 48GB Vram option out there and only requires 300W. Rock-solid reliability - I've been running multi-day agentic workloads and it just keeps going and going without a hitch.
Global_Tap_1812@reddit
So not the same as your setup but I've got a similar problem with running a second card on a pcie x4 slot. Intel i9-14900k with an and Radeon 7900 xtx 24gb and 64gb RAM and an ordered r9700 32gb.
Rather than split the model I'm actually planning to run qwen3.6:27b dense on the 32gb card with a 64k context window at q8 and then use the 24gb card to optimize prompts that are fed to the 27b dense model, manage the context window, handle multiple sub-agents in parallel, basically take all of the stuff that the larger model would otherwise handle itself if it had a larger context window and implement it separately and just deliver optimized context.
No idea if it will work well or not, but the hope is that the divide and conquer strategy is well enough adapted to my workflow that I'll get something usable out of my local machine that can handle most of my needs and then elevate to Claude, codex, and/or Gemini when I really need the deeper thinking and higher performance of those larger models that are impracticable to run locally. I'm spending $120+ per month on average for extra usage so as an alternative to upgrading to max the payback window would be less than a year and I have more control over my own data
blojayble@reddit (OP)
Do you have some idea on how you plan to utilize the second card more concretely? Just running another model with instructions on how to manage the prompts and contexts, or some more specific solution that you had in mind?
Global_Tap_1812@reddit
Well I've been working through the implementation over the weekend, and really I don't know how to describe it other than via the expected workflow once the framework is built.
Basically all models share a connection with a single obsidian vault that tracks decisions, notes, changes etc. When I input a prompt, a smaller 8B model with a DSPy harness (1) chooses the best model for implementation, (2) optimizes the prompt, and (3) collects relevant context from the vault. Then creates a "package" of optimized prompt + context and passes it off, in most cases to the larger model on the 32gb card. Then the larger model takes that information, and actually writes the spec, performs the implementation, etc. which can include delegation to subagents on the first card. It's already what I do with Claude, Codex and Gemini, just better optimized by virtue of DSPy and the obsidian vault. The system allows for dynamic routing and more optimal token usage.
The first card is a "baby orchestrator" because you can't have a weaker model review the output of a more powerful model. It's basically a translator and router that only optimizes prompts and context between agents and subagents to free up overhead while maintaining accuracy. There's a threshold over which it can escalate to Claude, Codex and/or Gemini, but once it's running and able to self-evaluate and improve, the hope is that said threshold increases over time.
braydon125@reddit
You need to go up to a serious HEDT mobo. Wrx80-90. 128 lanes.
AttitudeImportant585@reddit
nothing serious about threadrippers. they only do 7 gpus max at 16x
blojayble@reddit (OP)
what would you consider a next step-up from that?
AttitudeImportant585@reddit
to add more lanes to allow 8 gpus and more at full pcie speed, you would need to go multi socket with epyc or xeon scalable. modern threadrippers do only 1
blojayble@reddit (OP)
that would be the dream, but the cost of pro threadripper and ECC RAM is a bit too much for me at the moment.
braydon125@reddit
The wrx80 can take unbuffered u dlmm as well as ECc
blojayble@reddit (OP)
good to know!
Southern_Change9193@reddit
You need this:
https://www.ebay.com/itm/389916624594
Thrumpwart@reddit
Very nice setup. I very nearly bought the PA602 recently but opted for the Flux Pro.
Anyways, you can all 3 of those GPU's at full pciex16 speed with one of the tools from this video.
You would probably want to switch to different case or even an open mining case, but that LR-Link Broadcom PCIe Expansion is made exactly for the situation you are facing.
Kahvana@reddit
Consider to keep qwen3.6 27b running on those two cards, and use the third card for utilites (run qwen 8b embedding, qwen 8b reranker, qwen 1.7b tts and qwen 1.7b asr on there)
fluffywuffie90210@reddit
Youll do fine for inferance with 3x. I use 3 5090s, one in a thunderbolt 4 port via usb and the other on pcie 4x4 and still get 100 tokens a sec on qwen 122b you might get half that. Only the model loading will be slow using llama.cpp
koushd@reddit
you need 2 or 4. 3 will be worst for performance.
putrasherni@reddit
Even with split row or split layer on llamacpp with Vulkan b?
I thought that rule applied to tensor parallelism only