Top hardware stacks for local compute over the coming few months? (3-10K USD range)
Posted by IamFondOfHugeBoobies@reddit | LocalLLaMA | View on Reddit | 37 comments
I'm one of the 200 dollar a month plan Claude users currently tearing his hair out over how a company can offer a service this unstable and annoying (we are...many at the moment). And I'm thinking it might be time to just drop 3-10k USD on local AI.
I'm running GPT-OSS-20GB on my gaming desktop atm and it is....way better than expected (also giving me a better experience than Gemma 4 which was wtf but whatever).
Thing is. I'm not a hardware guy. I can program my own local AI tools easy enough. But hardware? Help please.
Currently I'm planning to wait for the new apple releases likely announced in June. Then look towards the Mac Studio line-up. But I'm sure there are people in here who know a LOT More about this than me.
What are the current top of the line solutions for Local AI in my price range? What are the trade-offs in terms of power consumption and things like RocM on Linux (never, never, NEVER again oh god I value my sanity too much to try that again PURGE WITH FIRE).
I prefer the freedom of Linux but I'm fine with Apple. Windows is a no-go for me. Too much bloat, me and windows are permanently divorced.
Do note. Context is very important for me. It's not enough to just be able to get a model to load. I need it to be able to use it's full context well too.
I've labelled this thread a discussion since I suspect there will be a few different opinions on this and I'd love to get a good, productive discussion on this going.
RemarkableGuidance44@reddit
If you like Linux then I would suggest looking at the Intel B70's with 32GB of Vram. Intel are wanting to be part of the Local AI Race now and are working with a lot of teams to have their models work well with Intel GPUs.
I got 4 B70's coming for the price point they were a no brainer.
Its not just you who is sick of Anthropic, we spent millions in their API and over the last 3 months its turned to sh#1 so we went out and forked half a million on a local server now using GLM 5.1 and Kimi and fine tuning our own models. Using local for 80-90% of the work and then get Codex / Claude to finalise it.
I loved it so much I went and got 4 B70's for my own personal jobs.
I reckon the M5's are going to be 10k+ easy, while B70's are $950 USD. You could also go AMD with a bit more support. But if you are a tinkerer the B70's are great.
DrAlexander@reddit
I've seen reports of relatively low speeds with the B70. Are the expectations for performance to be similar to ROCm at least?
ea_man@reddit
...you can have all the expectation you want, point is that the VRAM is relatively cheap one, the arch is \~2 years old, the software stuck now works at maybe 50% the performance it may get (you said that, expectations) with double power and just for heavy concurrent multy prompts.
HopePupal@reddit
you are literally describing the R9700. same chip, same memory bandwidth, 2× the VRAM.
RemarkableGuidance44@reddit
Which is $2600 while I paid $1400 AUD for each of my B70s. Price Per Performance the B70's are good value with 32GB of VRAM.
HopePupal@reddit
yikes. the price difference is much smaller in the US
def_not_jose@reddit
They are currently significantly slower than amd ai r9700 (which are already far behind Nvidia). It's a bad idea to throw 1000s of dollars into a promise that support will get better one day
RemarkableGuidance44@reddit
The 9700 is $2600 AUD while I paid $1400 each for the B70s. I want larger models not speed. Intel is already making headway with VLLM. Price per performance with 32GB of VRAM they are a good card.
I have dual 5090's if I want speed but they use a lot of power, I want my server to run 24/7 and not use as much. 4 x B70's 128GB VRam for the price of 1 5090 $6500 in AUD is a damn good deal.
Our server that we paid half a million for we went straight with the RTX 6000 x 16 which cost 240k. I would love to get one of them but they are 15k for 1 card.
I also dont see this "far behind" they are quite fast for me and run well for the models I use them for.
john0201@reddit
Mac studio is the one. I have a threadripper 2x5090 workstation that’s probably worth close to $20k and I’m selling it once there is a mac studio. M5 Ultra should be maybe 80% of 5090 numbers but with much more ram. Plus it uses a fraction of the power and heat.
Currently I run Qwen3.5-27B on a 5090 with qwen code cli and perplexity search api, then I also have Qwen3.5-122B-A10B on my m5 max laptop, which is a little better but it crushes my battery and won’t fit on a 5090.
IamFondOfHugeBoobies@reddit (OP)
It really does seem like this is the way to go. It's just...such a clean, simple and low effort solution.
Not a huge fan of their eco-system but for this I'm building my own software and my code is generally easy to make mac compatible anyway.
It really does seem like you can get decent performance. I could build some amazing things with Sonnet 4.0 like quality. The main bottleneck there would be context and that's coming along nicely.
I just setup a quick and dirty multi-agent sandbox with GPT-OSS-20B on my desktop and it's pretty damn capable. Not capable enough for technical work but 10x that and we're fucking talking.
john0201@reddit
Context would be huge on a studio, you aren't limited by memory size, only bandwidth.
OSS-20B is pretty dated. Qwen3.5 Q4 quants will be better and faster.
IamFondOfHugeBoobies@reddit (OP)
Models this size are just toys. I'm just idling around iterating on a small local custom tool I built a while back. Experimenting with how to optimize my tool calling methodology.
That said, I'm actually finding OSS-20B beating out Qwen and Gemma which surprised the hell out of me but...it just does better on every task I throw at it.
Color me surprised but there it is.
john0201@reddit
They are definitely not toys today, they are better than old frontier models. OSS-20B is nowhere close to qwen3.5 scores, maybe that is the different - it’s an old model.
IamFondOfHugeBoobies@reddit (OP)
I guess it depends on the use case. For mine they're toys. Cool toys. Super useful toys (I've got more or less perfect accuracy on my tool use system prompt syntax today, if it works on OSS-20B it'll go swimmingly on more powerful ones later).
For other use cases probably useful enough sure. I mostly do very low level programming and system architecture though and they have no hope of handling that level of context.
john0201@reddit
OSS-20B is only useful for historical comparison purposes- no one should be using it in 2026.
spky-dev@reddit
But it can ;)
https://github.com/3spky5u-oss/osmium
ea_man@reddit
You should first try those openmodels in the cloud with API like openrouter and then maybe consider to buy hw for 10k when you are confident on what you need.
Yet it won't cost you much to get an used 16GB GPU for a PC, then maybe add an other later, to test those very same models locally.
FYI: amd runs well with both vulkan and ROCm for LLMs.
IamFondOfHugeBoobies@reddit (OP)
I have a 16GB GPU. I do run API models. I am running agents both locally and via API. But they are not nearly strong enough for my needs.
I don't think you appreciate the level of consumption I have however. There is no way API use would possibly be economical for me long term.
It's also about control and skilling.
ea_man@reddit
> I have a 16GB GPU. I do run API models. I am running agents both locally and via API. But they are not nearly strong enough for my needs.
Hemm so what are you gonna do? You have to pay for those agents that are "strong enough". If you have tried already those openmodels via API... Kiwi ain't strong enough? You sure?
> It's also about control and skilling.
No man, you are thinking about spending "10K" because that's an hard obstacle and you think there's a forbidden threasure behind that. It's nothing magical: you can try that very model that runs on 10K hw right now and find out.
Also do not think that 10k at home performs like an NVIDIA rack worth 150K in a farm, you would do maybe 10-20tok/sec with the big models that you can use 2x with an API.
Pleasant-Shallot-707@reddit
Aside from a quad 5090 setup, you can do dgx linked together or a maxed out 16” m5 max, or wait for the m5 ultra studio and run two of those
CC_NHS@reddit
I have revisited this decision internally every few months. and not for anything Claude is doing wrong (I am only on pro and find it fine still, just the usage seems a bit tighter since Opus 4.6)
my issue is ultimately twofold.
1: no local model comes close to Opus in ability.
2: at 20 a month it would take me a long time make the price difference even look worthwhile for a decent local system. (at 200 a month to the budget limit 10k you would only need to go over 4 years and 10 months to make it worthwhile financially, which still seems a lot since that is potentially time for upgrading gear again, but looks better than the calculation for my own case)
there is the factor also though of the upsides of data privacy and just the general feeling of it being on your own system and your control. which is the thing that keeps bringing me back to reassess the situation. sadly I can never justify it.
One other thing to consider is using a different provider than Claude. If local models are capable of doing the work you need, a cheap API like DeepSeek might be worth an experiment (or somewhere in middle like GLM)
KickLassChewGum@reddit
GLM 5.1 certainly comes close. I find it better than Opus 4.5 (and whichever lobotomized version of Opus 4.6 is currently live for subscribers, but not the version I remember).
CC_NHS@reddit
I am unconvinced 5.1 comes close enough on any machine in the 3-10k range. I have not looked into what VRAM (or Unified more likely to actually get 5.1 in) you can get at 10k, but 5.1 is huge. not to ention getting actual context.
I know technically it 'can' be a local model, but for this discussion I do not think it fits.
(also you seem to have got downvoted, maybe for same reason as my response, but wasn't me)
KickLassChewGum@reddit
That's fair enough. I think it's a model it would worth making the switch for, though, if one has deep enough pockets and an opportunity to colo in a nearby DC; and anything that can run this will likely be able to run whatever comes out in the future as well. My assumption is that models will only ever get more efficient, and that the gap the hyperscalers have might continue existing indefinitely, but they'll have to drop so much more compute into diminishing growth that a model that's 90% there at 5% of the cost is going to be competitive.
JavierSobrino@reddit
A single AMD RX 9060 XT with 16GB of VRAM is able to run Qwen3.5:35B at 50tks in
llama.cpp, in a host with 32GB or RAM. I'm almost sure 4 of these cards can run 120B models with easy. That is $1800 plus the host, around $3k.ea_man@reddit
I would not push it that hard: I'd get ONE 9070xt with the idea of maybe get one more later, use open models API for the few times you really need a SOTO with huge context.
Even today you can run *good models with 16-32GB, there are still good offers for API with millions of free tokens when you need something big.
ProfessionalSpend589@reddit
> I'm one of the 200 dollar a month plan Claude users currently tearing his hair out over how a company can offer a service this unstable and annoying
I think in this aspect the grass is always greener on the other side.
I had opencode installed in a VM running on cheap mini Chinese PC (ProxMox), but the PC died yesterday. The power supply seems OK, just that the PC is not powering on.
I don't have a replacement PC ready and I'm not thrilled at losing 2 hours to configure the new hardware when it arrives.
IamFondOfHugeBoobies@reddit (OP)
It's not greener. I realize it absolutely won't be the same quality. I'm just sick of the grass being randomly removed for periods of time on this side of the fence.
I'd rather start growing my own turf. It'll be a long time until it's as lush as Anthropic's. But the sooner I start the sooner I'll get there.
I think what people don't get is I'm not just looking to run an AI through some third party service. I'll be building custom harnesses and agent environments from scratch to plug it in to as well.
Forward_Compute001@reddit
Local hardware for multiple people is unthinkable.
Go for non local services that host the open source models, you get much cheaper prices per token and have enough headroom for failovers
Far-Usual5771@reddit
Everything depends on your budget, but there’s no point in spending a fortune on GPUs or a Mac. You can buy four RTX 5060 Ti 16GB cards: three connected directly to the motherboard via x4 PCIe slots, and the fourth through an NVMe-to-PCIe x4 riser. Then get either 4×48 GB or 4×64 GB of DDR5 RAM running at 6000–6400 MT/s, depending on availability. With this setup, you can comfortably run Qwen3.5-397B quantized to Q4 using llama.cpp at decent speeds—prefill around 700–800 tokens per second and generation at 18–20 tokens per second.
People who claim this is insufficient probably haven’t even used paid API-based models, where speeds typically fluctuate between 25 and 45 tokens per second. Simpler models run extremely fast on this hardware. It’s better to choose an Intel CPU, as it handles 192 GB or 256 GB of RAM across four sticks more stably. The entire setup costs roughly the same as a single RTX 5090 what you won’t be able to do with just one or even two GPUs, or a Mac—which, at best, costs three times as much for the same performance.. For example, I can easily maintain a 200K context window with Qwen 3.5 397B and still have plenty of RAM left for other applications.
Daniel_H212@reddit
What do you value more, speed or quality of responses? If you value quality of responses, get a 512 GB Mac Studio (used M3 or wait for 512 GB M5 tho no guarantee that will exist) and run GLM 5.1 at q4 or q5. If you value speed, get an RTX Pro 6000 and run Qwen3.5 122B.
PermanentLiminality@reddit
Everything in this arena is a compromise of trade offs. Prompt processing speed can be a big one for the coding use case. A real GPU helps a lot here since this phase is compute bound. Non GPU solutions like Apple or Strix Halo do relatively poorly here.
A single RTX Pro 6000 in whatever computer you have will be at the top end of your budget, but will fly on models that fit. The computer it is installed in is relatively unimportant. You can run larger models for less, with a mac or other solutions, but they will be slower by a large factor.
Basically you need to make a judgement of how much speed matters on what size models and go from there.
Start with OpenRouter to see what models will do what you need and then work on a system that will run that model. Don't buy anything without doing this first.
GroundbreakingMall54@reddit
the fact that a $200/month cloud sub is driving people to spend 5-10k on local hardware says everything about the state of these services. if you want maximum flexibility id look at dual 3090s or a mac studio m4 ultra depending on whether you care more about raw vram or power consumption. the 3090 route gives you 48gb for like 2k used and you can actually fine tune on it
IamFondOfHugeBoobies@reddit (OP)
Ironically Opus 4.5 and beyond made me able to program at such a high level I'm at a point where I just want to build my own tooling from the ground up.
I'd happily stay on Claude.AI for two more months but these degradation are becoming common. I'll need it for a while more to get free and they will probably sort their current projects (in time to release Mythos, overrated model of the month).
But this is not an acceptable long term solution when you're working at my level of complexity.
Specific-Rub-7250@reddit
just pay per use for models like glm or minimax directly on openrouter e.g. That is more cost effective than buying local hardware.
remainedlarge@reddit
Most important thing is VRAM. The sweet spit right now is probably 2 GPUS, 2x[3090,4090,5090] get you 48-64 GB and will let you run the current crop of models at Q8 or Q4 with good speeds and large context windows (100-250k).
You can technically buy 4 GPUS or a A6000 Blackwell and have 96-128GB in this range, but in my opinion, the models themselves are in an awkward spot, there seems to be a gap on useful models in the 70B-100B range right now. The big models (400B+) will spill into ram and you'll get 10-20 tok/s and will be forced to use q2 anyways.
I'm not a fan of buying macs based on them having large unified memory unless you already needed one for something else or really They are slower at inference for the same price or more as a machine with dedicated GPUs.
Ok_Mammoth589@reddit
Just throw an rtx pro 6000 into whatever computer you already have. Power limit it to 300 if you have to