In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid?
Posted by Tired__Dev@reddit | LocalLLaMA | View on Reddit | 150 comments
Let's say I'm in the market to buy a studio or RTX 6000's. At what point am I off the grid with a local coding agent? Probably a model question too.
Quiet_Head_404@reddit
going off the grid because you bought a single workstation is the most reddit thing ive read all day.
PM_ME_UR_COFFEE_CUPS@reddit
Heck yeah 2 RTX 6000s and then add another 5k for the rest of the build
evil0sheep@reddit
2x RTX PRO 6000s is the move, could do a really solid build for under $25k and it could go in a normal case with a normal power supply and run on a normal outlet. More than 2 GPUs will add a lot of complexity, more than 1500 watts will add a lot of complexity, and non-NVIDIA adds a lot of friction in a lot of places. Dual 3 slot air-cooled GPUs in an atx case with 7 slots is the path of least resistance, and you can get 8 PCIe lanes to each card with a normal motherboard and a consumer grade CPU and no PCIe riser fuckery. Then run vLLM with 2x tensor parallelism and you’re gucci.
Krazygamr@reddit
would a ryzen 9950x3d be enough for this? Or would you still need a threadripper?
FoxiPanda@reddit
You can definitely use a Ryzen for driving 2 cards. Things get a little weirder for 4 cards, but you can technically still do it...but it's way, way easier to do it on a threadripper/epyc system at 4 cards.
portmanteaudition@reddit
You won't have enough PCIE lanes unless you go Threadripper/EPYC
t4a8945@reddit
No it would not, you need 2x PCI 5.0 16x for proper tensor parallelism. Otherwise you're just letting perf on the table. Consumer CPUs only have 24/28 lanes available, while server CPUs have up to 128.
Toastti@reddit
PCI 5.0x8 is still fast as hell and works fine
t4a8945@reddit
Well some of my client messed up, installing 2x 6000 on port 2 and 4 of their motherboard instead of 1 and 2 like I advised. The performance was plain bad.
They switched the cards around and it was light and day.
My conclusion was wrong, pci 5 x8 is indeed enough
MaruluVR@reddit
Also you can always add a plx splitter later and get 16x on each card even if the motherboard/cpu doesnt support it
OttoRenner@reddit
Say what now?😲
MaruluVR@reddit
Here are some posts about them in case you ant more info
https://www.reddit.com/r/LocalLLaMA/comments/1pt0av6/plxpex_pcie_40_seems_to_help_for_llms_and_p2p_ie/
https://www.reddit.com/r/LocalLLaMA/comments/1qeimyi/7_gpus_at_x16_50_and_40_on_am5_with_gen54/
OttoRenner@reddit
Thank you very much!
Krazygamr@reddit
Can't tell if serious or hallucinating. Proof / source please?
MaruluVR@reddit
Source, I own one and use it with the nvidia p2p driver with 4 cards at 8 lanes pcie 4 each on a 4 lane oculink / 2 lane usb4
https://www.reddit.com/r/LocalLLaMA/comments/1pt0av6/plxpex_pcie_40_seems_to_help_for_llms_and_p2p_ie/
https://www.reddit.com/r/LocalLLaMA/comments/1qeimyi/7_gpus_at_x16_50_and_40_on_am5_with_gen54/
Krazygamr@reddit
Ok that's really cool. Thank you kind stranger!
kmouratidis@reddit
*2-slot. All Pro 6000 models (Max-Q, Workstation, Server) are dual slot.
DeepWisdomGuy@reddit
Ribbon cables also give you more options if you don't give a rat's ass what it looks like:
winnen@reddit
Just be aware that extenders/ribbon cables may or may not work with PCIe 5.0 slots.There are some constraints you need to consider when working with 32 GTs. And the ones you need might be pretty expensive (>=$100) to ensure good signal integrity.
alphapussycat@reddit
192gb doesn't get you there though...
etaoin314@reddit
This is the way
ProfessionalJackals@reddit
Is it not cheaper to just get 6x or even 8x R9700's + a Epyc Motherboard?
fractalcrust@reddit
not when you have to rewire your house or install a subpanel
AutomaticDriver5882@reddit
This 30 amp circuit is not normal in a house
MrPecunius@reddit
30A+ 240VAC single phase plugs are totally normal in US households for ovens/ranges, clothes dryers, water heaters, and more.
ttkciar@reddit
Using a local LLM is not normal, either.
Fuck normal. Do what you need to do.
Mtolivepickle@reddit
Not with that attitude. /s
ninjazombielurker@reddit
I paid roughly 4.7k USD for a new subpanel to have two 20a/120v for a 4x 3090 rig, a 30a/250v and 20a/120v for my server rack, a 20a/120v for my workstation and just cause don’t know what the individual price cost was, a 50amp generator inlet for the house was also included in this cost.
So a single 30amp/250v plus some tandem and quad breakers to fit it without needing a separate subpanel would probably be around $700ish, probably a bit more w/ exterior conduit from 1st floor to 2nd floor, maybe a bit cheaper w/o that. I hired a very big international electrical company and they still turned what was supposed to be a 1 day job into a almost 2 month nightmare BUT I don’t ever have to worry about power ever again for my servers! :D
InfiniteBlink@reddit
Jesus.. that's crazy, what do you do with all that gear. Is everyone panic buying 3090s cuz new cards aren't gonna come out due to the chip shortages? I bought my 3090 last year for 1200 and I wanted a second but theyre like 1400 now. Something's gotta give.. I wanted to build a beefy nas but ssd prices are also crazy...
mCProgram@reddit
in a budget you guys are aware you can use our 240v outlet to safely power european spec PSU’s, right? if you don’t mind putting it in the garage probably 30-40% of houses have a 240v on its own 30-50A breaker already.
JollyJoker3@reddit
US houses have 240v for charging electric cars? (European here)
Ke5han@reddit
Canada here but the electricity shares the same properties, lol. Surprisingly we have 120v and 240v in our house, 240v is usually for dryer, stove, electric car charger, hot tub, pool pump, baseboard heater, water tank.... basically high power consumption items. But I haven't figured out how to run a regular 240v equipment on our 240v circuit, as our 240v has two hot wires 🤔
maccam912@reddit
and electric stoves and dryers. It's big chonky plugs so we don't have many of them like our normal 120v outlets, but it's possible.
OverclockingUnicorn@reddit
Thankfully some of us are european and have 240v! If I wasn't renting I'd use the spare capacity in the panel to have a pair of 32a Ceeforms put in lol.
Couldn't afford to run that for more than about 20 seconds per day though with our electricity rates though lol
RandomCSThrowaway01@reddit
Yes and no.
The problem is memory bandwidth. R9700 offers only around 640GB/s. If you have a model that fits entirely on a single card - yep, with 8 that's 5.1TB/s. Yay, that's a LOT! The problem? 32GB is enough to run like Qwen 3.6 27B or 35B MoE. Those are by no means bad models but they just don't have the level of knowledge and understanding of a frontier. You can give them smaller tasks but it's more like Haiku.
Now, if your model of choice fits on 2 of those - now you have 2.55TB/s total with 8 cards. Still very decent bandwidth but... there aren't any particularly great models in this size class.
Next step up is therefore 4 cards - 128GB, 1.25TB/s. Not enough for Minimax M2.7, should be able to run Mistral Medium 128B but, uh, it's gonna be slow. Usable in some cases and if you look deep enough you should find some models but you are stretching it a bit.
And finally, if you want to actually use your full 256GB - you are now limited to 640GB/s. At which point - sure, it can actually fit some really good models but it's also operating at 1/3rd of what you can get with Pro 6000 in terms of speed per card. Mind you, your setup will also consume 2.4KW under full load.
Annoyingly, Nvidia is the only company with (somewhat) affordable GDDR7 memory. AMD is either GDDR6 which is just slooow or HBM3e which is fantastic but extremely expensive. Somehow they have actually regressed since last gen as well - 7900XTX was offering 960GB/s.
And what you are after if running those fat models and if you are willing to spend $20000 is both fast AND plentiful VRAM. Aka what we need is 384-512bit GDDR6/GDDR7 + 48-64GB of it. That would be 100% worth buying. But right now it doesn't exist in AMD's lineup. It does exist in Nvidia's lineup (RTX Pro 5000, 72GB and I believe around 1.5TB/s) but it's not all that cheaper than 6000.
Apple with its unified memory has a chance of disruption since M5 Max does over 600GB/s meaning Ultra will do 1.2TB/s and you will be able to get 256+GB of it.
FortiTree@reddit
Im sorry but memory bandwidth doesnt work that way. You cant just "add" them up. 6x Strix Halo is not the same as 4x Sparks nor 8x 9700 or dual 6000.
The speed is capped at memory bandwidth, doesnt matter how many you stacked up.
It's a bad advise to spend 20K on 6x Strix
freefall_junkie@reddit
https://rocm.blogs.amd.com/artificial-intelligence/tensor-parallelism/README.html
I can’t tell if you’re pedantic since you aren’t literally increasing the physical memory bandwidth of any single card, but tensor parallelism does increase effective memory throughput linearly. The E2E token generation speed doesn’t scale linearly since you’re introducing an increasing amount of networking latency. I didn’t read this whole article but at a glance it seems to give a good breakdown.
I am not advocating for any of his given configurations but this is worth considering when building out a local inference server.
Fast-Satisfaction482@reddit
In my dual 4090 setup, tensor parallel did not give me any benefits. It's even slower latency wise than pipeline-parallel.
So while it's a nice option in theory, you need to have the bus bandwidth to back it up, or tensor parallel will not be worth it.
FortiTree@reddit
Yea, ppl only talks based on theory and discard real setup result. I also own those hardware and do real benchmark on them and I dont see any of those claims. The cross-link remains the main bottleneck.
FortiTree@reddit
Im not sure how you get your fact butTensor parallelism only help to pool memory space, not increasing processing speed in anyway.
The multi-gpu, multi-unit setup also suffers another critical literal bottleneck due to cross-link sync speed:
So only nvlink is practical (still 50% drop compared to 6000 pro 1.8 TG/s) but it's not available for any consumer grade cards, only data center.
Civil_Fee_7862@reddit
The knowledge might be outdated, but new models are likely to keep getting released.
I think the general idea is to encode general reasoning ability into the models but then connect it to all your data sources.
Don't use models as databases
forestryfowls@reddit
So well said! It’s been cool to see Claude agents recently will say “let me check rather than guessing”. It makes me excited for local models that get really good at making this judgement call. I don’t care if it takes longer to do all those checks.
unchained5150@reddit
ChatGPT has started saying that with 5.5 too. I was super surprised the first time it happened. It said a little sentence first like, 'I'm going to double-check that so we're not in speculative territory', then it checked like 12 websites.
Ended up taking about three minutes all said, but man, if the answer wasn't exactly what I needed. I even took the time to verify with the sources it chose to make sure. Yep, it got the right information from them too.
Super cool functionality. Looking into how to replicate that with my home system.
tmvr@reddit
In Europe the price matches the roughly 95-100 EUR/GB rule, so you basically get what you pay for nominally:
4600 € - RTX 5000 Pro 48GB
7200 € - RTX 5000 Pro 72GB
9500 € - RTX 6000 Pro 96GB
gamesta2@reddit
Pro 6000's dont have the link thing so youre very limited when running more than one 6000. Nvlink I think is what its called. Pcie speed becomes bottleneck anyways
HonestoJago@reddit
It’s not that bad. I still get good TP performance without spending six figures. And you can get creative with diffusion workflows.
dataexception@reddit
Which that is way more than OP would need to get "off grid", really. I mean, he could technically do it with a single A100 off of eBay for ~5k, leaving plenty for an awesome CPU/motherboard, plenty of RAM and storage, AND the electrical upgrade (or buy a diy solar system and save future electricity expenses) ¯\(ツ)/¯
CYTR_@reddit
2 RTX 6000 96gb + 2K for the reste of the build... And 3K for a Strix Halo 128gb : this way, you can deploy a fleet of SLMs with a lot of context on the Strix and have the LLMs alongside them on the RTX 🥸
Otherwise, you keep the 3K and wait for the RAM to decrease.
alphapussycat@reddit
192gb doesn't get you there though...
HonestoJago@reddit
Maxqs to help with thermal issues, especially if you’re just doing LLMs.
PM_ME_DEAD_CEOS@reddit
Yeah perfect build, and the best "future proof" option.
FoxiPanda@reddit
This is woefully off topic, but I need to know... what are the top 5 coffee cups you've been PM'd lately?
CatalyticDragon@reddit
Easy. An MI350P. 144GB and 4TB/s of bandwidth on a single card.
Price is yet to be announced but is estimated to be around this budget.
neuronym@reddit
I would not invest 20k+ into the current AMD ecosystem. 2x PRO 6000's are the safe bet.
I'd consider AMD if the budget was below a single PRO 6000.
CatalyticDragon@reddit
My reaction to that was "ok, boomer" but that seems rude :)
neuronym@reddit
You have any experience with high end hardware running ROCm? Its a clusterf*ck of edge cases and ROCm itself is one of the worst programmed pieces of software I have ever seen.
15+ GB of YAML is all I need to say, for people who are in the know.
CatalyticDragon@reddit
I do yes. And that is not how I would describe the experience. I would say my experience falls into two categories: pre-6.4 and post 6.4.
While I could get results with ROCm 5.x it wasn't until 6.4 that things became (mostly) seamless. At that point the most anoying issues I ran into were lazy module maintainers hardcoding NVIDIA dependencies and one kernel (amdgpu) driver bug.
With 7.x everything is just dead easy. In fact easier than dealing with NVIDIA's proprietary drivers and software IMHO.
So if it works just fine for me, and every hyperscaler, and the world's top supercomputers, then maybe you're operating on outdated assumptions?
neuronym@reddit
I'd love to see a happy hyperscaler that is 100% AMD hardware, there is a reason why they don't exist.
Vulcan is saving AMD's ass otherwise they'd still be completely irrelevant for AI like Intel.
CatalyticDragon@reddit
Aside from Google every hyoerscaler uses AMD Instinct cards. Yet you are confident they don't work..?
kitanokikori@reddit
AMD just isn't competitive, and this is coming from someone who really wants them to win
profcuck@reddit
This is definitely worth keeping an eye on.
CatalyticDragon@reddit
Yeah. 100-200 billion parameter dense models with long context and 80 tokens/sec sounds good to me.
RogerRamjet999@reddit
Of course $20K can buy a pretty sweet rig, but honestly, you're picking just about the worst time in history to buy high-end AI hardware. Do yourself a huge favor, and just wait a year until the prices come back to Earth. Rumors are that China hardware companies are investing very seriously in taking over the DRAM market and that's likely to cause a big drop in prices not even counting other factors. Waiting will mostly likely double the power of any AI rig you could buy today.
If you're dead-set on getting something now, the two obvious choices are to buy 2 RTX 6000s, or buy as many RTX 3090s as you can find used. The first is direct and simple, the second is hard (both to procure and to build), but probably gets you more inference per dollar spent. James Betker did a really nice write-up of how he built an 8 RTX 3090 build to train a TTS model he created (you can search Google to bring up the article).
johnnyApplePRNG@reddit
Wow please send me your crystal ball, bro! It sounds sweet!
SmartCustard9944@reddit
In one year prices might be even higher!
RogerRamjet999@reddit
Agreed, it is possible, but it's like the stock market, do you really want to buy at the all-time high, or would you rather wait and see if the price will return to previous values. The odds favor a return to lower prices, especially where computer hardware is concerned where the long-term trend is decreasing prices.
3flaps@reddit
Buying at all time highs is statistically a pretty good strategy
MinimumCourage6807@reddit
Looking maybe past 5 years i have really hard time seeing a case where prices would actually decrease in any meaningfull way except maybe the only case would be with used server hardware if some data centers would start going bust in a masses, but even in that case the used hardware is server hardware which are quite a bit harder to run in home setups. And would that bring the consumer cards prices down? Well at least there is so many ifs that if you need a gpu now, it can be a wait of the effective lifetime of the gpu for the prices to drop. And on that timeframe the current inflation levels keeps the prices high enough...
Confident_Luck2359@reddit
N00b question: why is the 3090 the sweet spot and not 4090 or 5090 or RTX 5000?
arbv@reddit
Due ro price-to-performance ratio. 3090 is one of the cheapest 24G VRAM cards.
RogerRamjet999@reddit
Yep, exactly this. You could go for a couple other recent AMD boards, but they're not supported as well as the NVidia cards.
arbv@reddit
RX7900XTX is alright (for inference, at least)
Blues520@reddit
Running costs are quite different though right?
RogerRamjet999@reddit
Definitely true, a bunch of 3090s are going to use a lot of watts, so if you're in a high electricity cost area, the RTX 6000s are probably a better choice for running costs.
mr_zerolith@reddit
RTX 6000 Pro plus 5090 will get you 128gb vram. From there you can run Step 3.5 Flash 197B Q4_K_L and still have 220k of context memory left. This model kicks ass and runs at 120 token/sec on the first prompts.
My box:
ga239577@reddit
I would probably do either a 1x 5090 build or 1x RTX 6000 build ... maybe you could do a 4x 5090 build also and get good throughout on vLLM.
I would lead towards one of the 5090 builds because it seems like 27B-32B models are good enough and can fit in 32GB with a good sized context window.
Single 5090 build would leave you with plenty of cash left over.
I'd only do the RTX 6000 build if you want to run big models.
HonestoJago@reddit
Four 5090s sounds painful. Didn’t Nvidia restrict power limiting on them?
Eritar@reddit
No, you can power limit them all you want, but unless you have free electricity, two 6000 Pro will be much better than a comparable amount of 5090s
ga239577@reddit
Wouldn't 4x 5090 be faster though? Assuming you're using concurrent requests ... Memory bandwidth on the 5090 and RTX 6000 Pro is the same IIRC.
Assuming 27B-32B models are good enough - and with Qwen 3.6 27B / Gemma 31B it seems they are - wouldn't a 6000 Pro be overkill for most people?
This of course assumes there isn't a breakthrough in the 100B ish range that makes them a ton better than the 27-32B range
bidet_enthusiast@reddit
Realistically speaking, if you know how to code and you just want a tool, you are fine with a pair of 3090s. Quen27b FTW.
If you want to vibe-code a new salesforce or do deep agentic work without supervision, you need at least 1/2 tb of vram, and a 10kw circuit to run your rack on. Thats 12 rtx6000s, to run SOTA open models at q5. If you want full precision think 16.
Basically the best local coding models are around 30b, and at least right now to get a major improvement you need to go up to about 1T parameters. There’s really nothing in between, and quen27b is so good at coding it beats the 100b models hands down. I don’t think you will see anyone releasing models > 100b >~1T because that’s just not a hardware category that’s out there.
You can run a 100b model on 4x 3090 or a cluster of 2x 2x3090 for a user or two. (Assuming 128gb RAM)
If you want to vibecode basic stuff or websites and demos, you’re back to dual 3090 territory.
I can run 27b just fine on my MacBook Pro M1 with 64gb.
swagonflyyyy@reddit
2xPro 6000 blackwell maxQs.
Wonderful-Ad-5952@reddit
3 Mac Studio!!
chafey@reddit
I am running 2x RTX PRO 6000 with Qwen-3.5-122b and it generates 180 tokens/second. I haven't had to use a cloud model for months - it has handled everything I have thrown at it. I know the latest cloud models are more capable but I prefer to work in smaller steps to make sure it is doing what I want. Large numbers of changes are much harder to grok
LankyGuitar6528@reddit
*on grok
DeusExNatura@reddit
Rent GPUs online by the hour to get familiar with and test various hardware. You may even decide to keep renting for a while, to have access to new hardware, and to switch as your needs change.
MomentImmortalizer@reddit
Ask your ai.
einthecorgi2@reddit
I use 4x DGX sparks running 397B Qwen fp8. This has been much better than quantized larger models like glm 5.1
spammmmmmmmy@reddit
To just run inference, I'd get any one of the Apple M? Max or Ultra, with 256GB or 512GB RAM.
I just saw a handful of this class of computer for about £20k on eBay this week.
RandomCSThrowaway01@reddit
It depends on your requirements.
If you want Opus / GPT 5.5 tier - 20 grand doesn't get you there. Closest open weight model is latest GLM and Q4 seems to require about 450GB VRAM, similar with Kimi K2.6. A minimum working setup right now is more like $60000 (maybe a bit less if instead of RTX Pro 6000 you get to use Huawei's 128GB Ascend 950PR but I would not bet on those working fine just yet out of the box plus I don't think they are available in retail - and they likely won't ever be available if you live in the land of tariffs).
Now, these things do get more efficient over time but to what extent remains to be seen - I do expect that over time a 256GB model will be able to reach current frontier level.
What $20000 gets you are indeed 2x RTX Pro 6000 (although their prices are going up so not for much longer). That's 192GB VRAM @ 2TB/s, roughly speaking. Maximum that works smoothly through it would be Mistral Medium 128B (aka a dense 128B model) or some variants of Minimax. Roughly speaking you can reach Sonnet level with those.
You are waiting for M5 Studio then. M3 Ultra has horrible prompt processing speed which imho rules it out for larger models if you want agentic coding. You need M5 Ultra which fixes that issue and hopefully also boosts bandwidth to about 1.2TB/s.
Now, if you are also fine waiting a bit - HBM2e/HBM3 cards might drop in price in a year or so as datacenter may be kinda forced to buy newer gen (cuz Nvidia's Rubin is more power dense and that seems to be a big limiting factor in assembling new data centers).
true_variation@reddit
Mistral Medium (I assume you mean 3.5 since you say 128B) performs slightly worse than Qwen 3.6 35B on agentic and coding indices, yet has the same 256K context, but Qwen 3.6 can run on much cheaper builds.
RandomCSThrowaway01@reddit
Honestly the problem with existing popular benchmarks is that they are all shoved straight into models datasets by now. This makes honest comparisons pretty difficult, you effectively need your own tests.
And in my experience Qwen 3.6 27B is probably the best small model and it handily beats 35B MoE (however smoothly running 27B asks for a 5090 whereas MoE one does with a 5070 + CPU). But they both fail at larger codebases - when giving a specific feature to alter in a video game (related to character's movement so a LOT of dependencies in many files) and even telling it where to start I had Qwen 3.6B fail completely starting to delete lines seemingly at random, 27B did a bit better (but still couldn't finish), surprisingly Mistral Medium did something mostly working... but only Opus actually did it in full. So your mileage may vary. Sadly I do not have VRAM to test Kimi/GLM.
Main coding problem with benchmarking models is that they ALL do good job on brand new applications and smaller codebases by now. So in order to actually gauge what it can accomplish is giving it some edge cases and larger projects to work with. This also showcases major fails even in frontier models - eg. Opus has a theoretical support of 1 million context but it seriously starts rotting after 200k. You see it daily on Anthropic subs where users start crying that Opus gets "lazy", tells them to wait until tomorrow, go to bed etc. All of that is context rot.
Mind you, I am not saying that your own observations are "wrong" by any means. Odds are we just have different use cases.
ILoveToyota37@reddit
The jump from $20k to $60k is 3 times as expensive for probably x1.5 the performance. In about 6 months, that performance gap will close with software optimizations and new models alone. Hell, I can even run a Minimax APEX quant on my RTX 3090 + 3080 (20gb mod), given I'm getting around 20 tok/s and have shit context length. We're in for a wild ride for new model releases within these upcoming months. Qwen 3.7, maybe a GLM 5.1 flash. Who knows...
Save your money folks! You're better off buying food storage for the upcoming apocalypse
ObjectiveActuator8@reddit
Tell me more about your apocalypse predictions
sn2006gy@reddit
the price of the 6000s just went over 10k this week. you’re looking at 20k for a single gpu box these days
Annual_Award1260@reddit
My suppliers have been stable in price for months. Newegg has the same price as well. But local shop here upped price $4k
AlwaysLateToThaParty@reddit
Been waiting for this to happen for a while. I bought my RTX 6000 Pro last year, and was surprised that the price hadn't gone up with the RAMpocalypse. Just this week up AUD$4K.
fallingdowndizzyvr@reddit
Central Computer had it for $7500 for the longest time. But now it's $10K.
Pleasant-Shallot-707@reddit
If you harness correctly you can get local coding correctly with 27B+ dense models. Use Pi coding agent so you can build a harness that works well for you.
AnonsAnonAnonagain@reddit
You would need 1x DGX Station. That would be sufficient to do most things easily enough.
angie_akhila@reddit
Not really, but with two you can run things like Qwen 397B, which is pretty great honestly— little sliw though, max out at mid-20 tok/s, need 4 to get cloud grade 60-90 tok/sec
AnonsAnonAnonagain@reddit
DGX Station GB300 which has Blackwell Ultra (Enterprise)
https://nvdam.widen.net/s/jnkrzwnqhj/dgx-station-datasheet
Not the DGX Spark.
angie_akhila@reddit
you’re totally right, midread. Thay do it— Freudian slip, I was just looking at op’s 20k budget, unfortunately station is a lot higher, 4 DGX setup is about 20k
OjinAI@reddit
the wait-a-year advice tracks for pure cost optimization but flips if you're trying to ship something that depends on local inference TODAY. hardware depreciates whether you use it or not. the question isn't "will it be cheaper next year" (yes), it's "how much value can the box generate between now and then." for builders shipping product, often more than the depreciation hit.
PWCIV@reddit
Honestly at this price you can hire a good swe from a cheap country and they will produce something that obliterates anything llms can in a year
Kahvana@reddit
First, why?
Second, off the grid as in network or electricity? Very different answers for either.
Third, why look at this as a money investment problem instead of looking for specific capabilities? What is “good enough” for your use-case?
Your requirements are incredibly vague. Define what you need out of the model clearly.
With 2x RTX 5060 Ti 16GB you can run Gemma4 and Qwen3.6 at very low electricity cost during inference, and still have the intelligence to get any task done. Smaller models require a bit more work in setup for system prompt, but are extrnely capable these days.
SecondFriendly4255@reddit
I can suggest you first to experiment something less costly before going all in local have a downside you have to figure out if it’s ok for you. Put more monney will not give you local opus4.7 for exemple so if you are new in maybe experiment a bit before going like that. My go to actually is something that can host properly deepseek v4 , an gen image setup
InteractionSmall6778@reddit
The 2x Pro 6000 path people are recommending checks out. The missing piece in most of these answers: at that VRAM range the model isn't the bottleneck. The harness is.
Cline with a local endpoint, or Aider running against your model, handles the agentic layer. Qwen3.6 35B is comfortable for most coding tasks at that memory bandwidth. The gap shows up at complex multi-file refactors and production debugging, and for those cases it's more about prompting discipline than raw hardware.
ToddlerPeePee@reddit
lol, I am using AI offline on my $1k laptop. That's why I think AI is becoming a commodity and AI companies are in a bubble. Why should I pay money every month when I can just install a software for free and runs it offline for free?
kivaougu@reddit
We run 2x pro 6000 max q on a proart mobo with a 9950X and 256gb ddr5. If you don't have an idea of what model you would like to run you should not be considering anything this expensive. Just rent gpu access to get a feel for what size of model you could work with.
MinimumCourage6807@reddit
I have now one 6000 pro + 5090. now the best I can use with this setup is minimax m2.7 with fast speeds (around 100 t/s). that is pretty neat. With two I could use it on vllm with higher quant, that would be epic. Now i have seriously also though about buying a second one as vllm is quite a game changer with multiple agents. Looking forward also to test deepseek v4 flash. Now one thing i have found with this setup is that the currently best model with vision seems to be qwen 27b or for some cases gemma 4 31b. With 2 6000 pros I could use qwen 3.5 397b probably, maybe that would be a improvement, maybe not. qwen 27 is incredibly solid tbh.😃 But to be honest. these are not nearly as good as opus 4.7 if you try to give these model some lazy "build me x, no mistakes" type of prompts. When prompted well with a vision, they do good job though. And also for example qwen 27b have done absolutely beautiful websites from scratch for example recently.
Civil_Fee_7862@reddit
AI is likely to switch more to smaller more reasoning based models over knowledge holding.
To match frontier models you could probably do it with a smaller models assuming you have its data connections done extremely well, and tuned for you use personally
SmartCustard9944@reddit
Holding more knowledge means having learned patterns from a larger diverse set of data that can transfer across domains
Civil_Fee_7862@reddit
Facts != knowledge != Wisdom
We want wise models.. Not knowledge, not facts.. Those change over time, so unless you want to download a new model every week with new knowledge and facts, we should aim for models that specialize in abstract thought and reasoning, and try to avoid storing any facts or knowledge inthem.,
HonestoJago@reddit
Harness is huge, too. We’re in the bespoke era.
GeramyL@reddit
20k? go buy yourself 10 AMD R9700 Pros!
corruptbytes@reddit
i’d just optimize for ds4 flash - my m3 mac 256gb runs it like a dream, generate around 160m tokens a day of it constantly running
then china will build ram in 2027-2028 and that 20k will go much farther
HonestoJago@reddit
No prefill bottleneck?
LargelyInnocuous@reddit
2x RTX Pro 6000 would be fine for most things 192GB-1700Gb/s-125-1k-2k-4k (FP32-FP4 TFLOPS). 20k could get you 2x 512GB Mac Studios for 1TB (albeit over thunderbolt), so 1024GB-800Gb/s-28-57-57?-57? (No FP8 or FP4 support in hardware).
The compute disparity seems large, but since most things are bandwidth limited instead, Mac Studio is only 40-50% the speed in most cases. GLM5.1 is \~20 tk/s on a Mac Studio and across 3x RTX 6000 Pro you would get \~40tk/s. But you can run the Q4 or MXFP4 quant on 1x Mac Studio vs 3x RTX 6000 Pro needed.
My general advice is pick the best model you want to run. And make sure you have that much VRAM+30-50%.
If you want to run GLM5.1 (465GB@Q4) then Mac Studio is cheapest entry (It would take 5xRTX6000Pro to run). If you want to run Qwen3.6 27B, then either is fine but RTX 6000 Pro will be 2-3x faster.
If you are talking large code bases and long vibe coding sessions, the mac may be a little more start and stop (make a cup of tea) since the prompt processing will take 5-20x longer, but you'll be able to do it with the best models which handle complex concepts better. The RTX 6000 Pros will be closer to realtime, but probably with a more modest model (Qwen3.7 27B is not bad at all though!).
If you are generating scratch code or complex processes, better model may enable better output. If you're just doing basic routines nothing terribly complicated, correcting errors etc, then a 27-70B may be just fine. Or just spend some time decomposing the problem into easier chunks for it to tackle.
siegevjorn@reddit
Rtx 5000 pro bw 72gbs are not bad at all. Half of TDP of RTX 6000 pro bw. Save some in power bill. Glm 4.7 is probably your best bet among models with similar vigor. But, believe it or not gemma 4 31b is ranked higher in arena.ai Although, gemma 4 is fincky on coding harness.
HonestoJago@reddit
I d gotten decent results with Qwen Code.
HonestoJago@reddit
Part of the Opus magic is parallel agents, so that’s also a consideration. With two 6000 Blackwell pros I can only run two DeepSeek v4 Flash agents.
FatheredPuma81@reddit
Depends on what your expectations are... Opus speed and quality? Not for $20k that's for sure. Old Sonnet 4.0 speed and quality or better? Qwen3.6 27B can achieve that and doesn't need an RTX 6000. 2 RTX 6000s will let you run Minimax M2.7 or Deepseek V4 Flash at 4 bit and should land around Sonnet 4.5 quality.
SmartCustard9944@reddit
How many people have tried MiniMax M2.7? I didn’t find it very good.
I have been using DeepSeek V4 Flash for the past weeks and, while cheap and fast, it is a bit frustrating to use. It forgets things and does not listen to direct instructions sometimes.
HonestoJago@reddit
It’s awful at following prompts but I’ve still seen some good results. Needs a custom harness (speculation).
FatheredPuma81@reddit
I used it for free through Nvidia. Was incredibly slow and ended up destroying my project because it got stuck in a loop when it was tasked with auditing and fixing issues. The orchestrator M2.7 looked at it was was like "What the hell everything is so mangled?" Thought it was a bug from Nvidia though.
M2.5 Free through OpenCode was decent up until it became stupid I tried using it to benchmark some models locally and it often forgot settings it was using.
Haven't tried Deepseek V4 Flash.
9gxa05s8fa8sh@reddit
if you have 20k you should wait until the market crashes and buy a couple servers after
330d@reddit
I can built a 4x3090 epyc 7v12 256G DDR4 server for under 5k recently, runs a high quant ofn27b qwen at 100tps generation and 1100pp via vllm, very usable with hermes, if only for inference I'd go that route again and not spend 20k on rtx 6000 pro build.
MadGenderScientist@reddit
if you're determined to spend $20k, the most HBM memory and perf is theoretically Intel Gaudi 2s IMO. they're deeply discounted because.. (a) Intel killed the entire project after Gaudi 3 (b) the hardware is weird and (c) the ecosystem support is... not great.
bozo move from Intel imo, but they seem to love axing promising hardware rather than actually marketing and making it easy to use (Xeon Phi, Altera, Optane, etc. etc.) but their loss my gain, IG.
FBIFreezeNow@reddit
Rtx 6000s pretty much the only suitable options for off grid local LLMs that can handle some beasts. If you are thinking of going back to grid, with 20k you can have Codex and Claude Code, spend $400 a month and access to SOTAs for like 5 years.
No-Comfortable-2284@reddit
no. you will never be able to run anything with max context limit, that will compete with codex or claude. unless u have 100k+ to spend dont try. either u wont have enough ram or enough bandwidth to run anything at usable speeds.
ttkciar@reddit
GLM-5.1 quantized to Q4_K_M and maxed-out context would fit in 512GB of VRAM, and supposedly it's more competent at codegen than Claude Sonnet. That's achievable for much less than $100K, depending on your inference speed requirements.
No-Comfortable-2284@reddit
I meant more opus than sonnet but sure. idk if you want to be coding at 3tokens per second but thats what itd be if u stitch 5 dgx spark together. so yea like I said u cant run anything with usable speeds under 100k.
ttkciar@reddit
That's still not accurate. If you can deal with the noise, cooling, and power requirements, you could run eight 32GB MI50 across two older Xeon servers via llama.cpp's
rpc-serverfor "only" about $8K, and that should infer at at least 15 tokens/second (but with admittedly long prompt processing wait times, at least under llama.cpp).No-Comfortable-2284@reddit
that is true nothing rly matches opus locally. but 16 32gb cards are not 512gb. and no space for any useful amounts of kvcache.
ttkciar@reddit
Quoting from my earlier comment (emphasis added):
No-Comfortable-2284@reddit
fair enough
transanethole@reddit
Save your money. Just get a 5090 for now, invest/save, and wait for mi350p price drop later.
Please don't do this, I don't think is a good idea.
Thebandroid@reddit
AI psychosis speed run
nomorebuttsplz@reddit
bro it's just a sensory deprivation tank, 5000 lbs tank of soylent, and a vr headset, hardly psychosis.
ttkciar@reddit
That's sort of my plan, too.
I figure many existing MI210 users will be buying MI350P as drop-in replacements, and some of those MI210 should find their way to eBay, causing MI210 prices to drop.
If I can pick up two MI210 for relatively cheap, I'll be happy for years while waiting for MI350P to appear on eBay and come down in price.
alex20_202020@reddit
The answer is not in hardware only, but models too. The set where you will get the answers to this post from them and not post here.
TokenRingAI@reddit
It entirely depends on how deep your pockets are for power. You could get two RTX 6000 for 192G of VRAM (700w), or a 768G HBM VRAM intel gaudi 2 server (~6000w)
Yorn2@reddit
If you're willing to put in the work you can learn how to run this model (Qwen 397B) on two RTX 6000s using tabbyAPI. There's ways to get it running on sglang and vllm and etc, but running a good quant of a 397B model in EXL3 on just two of the cards is pretty crazy.
ikkiho@reddit
strictly inference for coding, 4x rtx 6000 ada runs glm 4.6 + qwen3 coder 480b at decent tps but you're still ~30% behind sonnet on agent loops. mac studio at half the price but hits memory bandwidth walls on longer contexts. you can self-host fine, but you'll keep reaching for the hosted call when the local model gets stuck mid-loop. doable, just rough as a daily driver.
superSmitty9999@reddit
I think the premier open model right now for coding is GLM and I think you need a \~tb of vram for it. Most of the open models have basically a 4090 tier (24GB), an A100 (80GB tier) and then it jumps to \~400GB vram, from there 1tb+.
Think is for coding models they versions that fit in my dgx spark (128GB) kinda suck. If your use case is coding I recommend either spending more money to get a TB vram or just stick to api.
Wrong_Mushroom_7350@reddit
Honestly, not enough capital to make it worth while.
Realistically to be completely off grid you need roughly 800-1000 vram to run any model completely untethered. That allows you to run full 1T based parameters, and all full 16 bit models and as much vram cache as you want with fully context.
Which based on the math would be 25-31 cards and 93 grand to get it all set up.
Last_Mastod0n@reddit
As many others have also said, I think 2 RTX 6000 Pros are the move.
cibernox@reddit
With that money, two RTX6000 will give you 192gb of vram. With that you can run pretty much anything below 350B in Q4 and the best speeds that kind of money can buy.
You would have to go for the truly profesional datacenter chips for something faster, like h100s and such.
Thrumpwart@reddit
The upcoming AMD MI355P (I think) will be your best bet.
FoxiPanda@reddit
If you want something that's legitimately decent at that price range I'd aim at 2x RTX Pro 6000s and the supporting computer to go around them and then run something like DeepSeek v4 Flash / MiniMax M2.7 / Qwen3.5-122B-A10B (very high quant) / Qwen3.5-397B-A17B (much lower quant) / MiMo v2.5 (middling quant).
You could also try to get a GPU + fast RAM offload setup working (say a Turin based system + 1 RTX Pro 6000?) and potentially run some even bigger models like GLM-5.1 or Kimi-K2.6 but I think these will still be slow... and RAM is really, unbelievably expensive right now, so I'm a little hesitant on this option.
There's also the option of a Mac Studio M3 Ultra 512GB probably around 20K but probably mildly risky since used and ebay and all that. You'd probably be a little disappointed in the prompt processing performance of one of these too and they will be slower than 2x RTX Pro 6000s for the same model but the 512GB M3 Ultras can run bigger models than 2x RTX Pro 6000s so it's not apples to apples exactly...
You still probably won't get the smooth experience of Opus/Sonnet/GPT-5.5/etc in this scenario sadly, but it's pretty good for all local. You might also consider Qwen3.6-27B & Gemma-4-31B at near-unquantized weights (Q8 or BF16) with an RTX Pro 6000 which is something that I do and enjoy the results from (albeit still not as good as frontier model capabilities).