How much hardware to to self host a setup comparable to Claude Sonnet 4.6?
Posted by SKX007J1@reddit | LocalLLaMA | View on Reddit | 61 comments
OK, need to prefix this with the statement I have no intention to do this, but fascinated by the concept.
I have no use case where spending more money than I have on hardware would be remotely cost-effective or practical, given how cheap my subscriptions are in comparison.
But....I understand there are other people who need to keep it local.
So, purely from a thought experiment angle, what implementation would you go with, and in the spirit of home-lab self-hosting, what is your "cost-effective" approach?
ai_guy_nerd@reddit
Sonnet 4.6 equivalent is genuinely difficult with current hardware. You're looking at a 405B+ parameter model, and even quantized down to 4-bit, that's still ~50GB VRAM. You'd need multiple high-end GPUs or a serious TPU setup.
Practical take: self-hosting makes sense for specialized tasks (domain-specific models, privacy-critical work) where you don't need raw reasoning power. For general-purpose reasoning at Sonnet's level, the cost-per-query still favors cloud services unless you're running massive volume. An RTX 4090 + CPU is ~$5K and gets you maybe 3-5 tokens/sec on a reasonably-sized model. At current pricing, Claude API is still cheaper unless you're hitting it >10K times daily.
If you're exploring it anyway, look at together.ai or baseten for hosted quantized 405B. Or go the hybrid route: local for embedding/filtering, Claude for reasoning. Best of both worlds for privacy + performance.
MaxKruse96@reddit
\~375.000€ https://www.deltacomputer.com/nvidia-dgx-b300-2304gb.html
have fun!
SKX007J1@reddit (OP)
Only 8 B300's?
MaxKruse96@reddit
for <10 concurrent users yes.
georgemp@reddit
hold on...375k euros for 10 concurrent users. How do any of these providers make any money? Even if we were to say 10% of our user base is concurrent, that would be 100 users serviceable by this config. Even at 100$ per subscription, that's over 3 years just to break even (not even counting the operational cost). do i have the math wrong?
MaxKruse96@reddit
there is no AI company that is in the Green. none. even if they make a "profit" like anthropic, the time to get that investment back and be at net 0 is in the 100s of years, if at all
georgemp@reddit
I knew companies like OpenAI and Anthropic (the ones that research and make models) were deeply in the red. But, figured OpenCode, Ollama etc (providers) were able to break even at least…
DanRey90@reddit
2x512GB Mac Studios (wait for the M5 Ultra release) can run any model the DGX can, just slower. For a homelab or small company (less than 10 concurrent users), that’s enough. That’s about 25.000€.
SexyAlienHotTubWater@reddit
That's a whack approach, the tokens per second will be horrifying. Just get a tinybox or something at that price point.
DanRey90@reddit
How would they be horrifying? Look at benchmarks for the M5 Max, multiply them by 2, and you get what a single M5 Ultra would achieve. Maybe you manage to make the 2 of them work in tensor parallel, maybe not, but that’s your floor. It will have over 1,000GB/s bandwith, and the biggest SOTA model has about 35B active params, so assuming fp8 and some overhead that’s over 20t/s for a single user. Fairly useable. Batching is another story.
OBVIOUSLY it would be slower than the 375.000€ DGX. Curious that you didn’t consider THAT a “whack approach”.
A TinyBox “or something” can’t run the biggest SOTA models (GLM 5, Kimi, DeepSeek, Qwen 397B), 1TB of “slow” RAM beats 384GB of fast VRAM when you try to run something much larger than 384GB. Maybe you can make do with the TinyBox if you forget about Kimi and GLM, and accept some light quantization, but now you’re compromising, and the Tinybox costs over double than 2 Mac Studios, so not really comparable.
SexyAlienHotTubWater@reddit
"I love ice cream" "Oh so you hate waffles???" ass response. Both are terrible approaches unless your goal is efficiency.
That's 10 trillion tokens on Kimi 2.5. If your goal is single-user just buy them, that's going to last you years.
Sure, two macs will give you access to the big models, but you could also just run a smaller amazing model at like 10x the speed.
> It will have over 1,000GB/s bandwidth
Apple's website claims 650GB/s per unit. For $25k!!!! Single-user you won't be able to exploit combined bandwidth.
Another option would be an AMD datacenter GPU with 288GB/s and 6000GB/s bandwidth.
DanRey90@reddit
Well, sure, selfhosting a frontier model is never the economical choice, we all know that. But there’s “expensive”and “absurd”. $60,000 goes into “absurd” territory.
Running “a smaller model faster” goes entirely against the premise of this post. You can’t get Sonnet level on anything less than the big open-source models (Minimax maybe? But that won’t last, they say their next model will be bigger). But the whole premise of this post is engagement bait, so eh, whatever.
You looked at the specs for M5 Max. M5 Ultra will probably have double that, so 1,200GB/s. Assuming the pattern holds (Mx Ultra has always been 2 Mx Max stitched together, double bandwidth, double CPU cores, double GPU cores). That’s not a given, but it’s an educated guess.
You have the same problem with the MI350X, for SOTA models you need to stack at least 2, that’s over $50k.
For running SOTA on low concurrency scenarios, it’s either stacking Mac Studios when the M5 Ultra is released (slow, everyone has their own tolerance for speed), or stacking GPUs (>$50k).
SexyAlienHotTubWater@reddit
I think all of these approaches are retarded but if you're going to be retarded, you might as well go 8000GB/s retarded. Double the price for 8x the braincell loss - it's a no-brainer.
(I think I probably was reading the M5 Max's specs - you're right. That's wild bandwidth for a consumer mini PC.)
DanRey90@reddit
LOL, fair enough.
Wild indeed, although at those price points, I wouldn’t call that thing a “Mini PC” anymore. It’s a whole-ass workstation, just with the Apple packaging, and they can get away with it because of the insane thermals. I hope they don’t get too greedy for this next generation, it will be the first one actually useable for LLMs.
PotatoQualityOfLife@reddit
"\~15 kW max"
Oh, is that all? LOL
Cold_Tree190@reddit
You think they run Black Friday deals?
MaxKruse96@reddit
only if you manage to hijack the truck
equatorbit@reddit
Excellent. My Kit Kat hijack will work very well as a trial run.
Possible-Pirate9097@reddit
Nah just camp outside an OpenAI datacenter and sneak in once it's hit by Iran.
Randommaggy@reddit
If I win the lottery there will be signs..
Able-Locksmith-1979@reddit
The problem is the model and tooling not the raw gpu power. If it was just raw gpu power some companies combined could buy it
LoSboccacc@reddit
You need 1tb memory give or take to host a quant of a top oss model and the context, and depending on speed you can get a stack of m4 ultra (or wait for m5 ultra) and have something that costs idk 15 to 20 years of claude max subscription
SKX007J1@reddit (OP)
Sorry, I should have been clearer, I'm not talking about actally self hosting Sonnet. More the theoretical comparison of a self-hosted configuration that gets into the same ballpark in the very specific use case of coding.
LoSboccacc@reddit
Well me neither as sonnet is private weight. In the ballpark of sonnet are 700B model and to have sufficient context for coding thats another large chunk of ram getting used. GLM at 4bit + 4bit 200k context goes to 900gb or something.
sleepingsysadmin@reddit
Minimax 2.7 is sonnet strength. 230B.
Prosumer:
2x DGX Spark
1x DGX Station
2x RTX Pro 6000
Rack mount:
6x 5090s or r9700 or intel b70
8x 24gb gpus.
So probably in that $10,000-40,000 range.
SKX007J1@reddit (OP)
Thank you, appreciate you reading the assignment rather than just being like "oh good, yet another post from someone who wants to run Claude Opus on a V100"
Sincerely appreciated.
sleepingsysadmin@reddit
The one thing I would say though. Hold off by about 1 year. 1.5 years tops. Save $200/month
DDR6 drops soon.
Medusa halo(strix halo successor) will likely be 384bit bus with 192gb of ram.
For what might be just $4000 will suddenly be sonnet level. That's assuming minimax 3 isnt better, that stepfun isnt better, that others dont show up in this slot.
NotArticuno@reddit
Don't buy a GPU to run the local models now.
Use what you have.
There will be dedicated cards in 3-5 years running literally 1000x current consumer cards speeds that are the same price as current GPUs.
I love playing with local LLMs with my 2080ti, but I bought that shit to play rust, it just happens to also be able to generate a few tokens.
You're going to spend minimum $1k on a GPU that will disappoint you and be obsolete VERY soon.
SexyAlienHotTubWater@reddit
3-5 years is a long time. That's longer than it takes to get a degree.
The GPU supply chain is seriously fucked right now. You might be right about long time horizons (1000x is optimistic - moores law, 8x), but in the immediate term we're seeing insane spikes in consumer demand for LLMs and severe constraints on GPU supply.
NotArticuno@reddit
It's not that GPUs will get 1000x more powerful, it's the dedicated AI processing units that I think will realize that gain. Something with dedicated memory and a smaller model "burned" right into the chip! I'll see if I can find what I was reading that described this.
CalligrapherFar7833@reddit
You are talking about the asics that were running llama model on asic ?
NotArticuno@reddit
Did a quick Google and I honestly have no idea. I'm probably spouting misinformation.
Perhaps I'm just thinking of hearing about the idea of model on chip, and I had some hallucination conversation with a chat ai fantasizing about the future.
CalligrapherFar7833@reddit
Its called taalas ?
NotArticuno@reddit
Ah thank you, yes that's one company doing it.
CalligrapherFar7833@reddit
That means you are wrong its cost prohibitive to burn larger parameter models this way
NotArticuno@reddit
Actually you're wrong! The performance of the smaller parameter models is insane. I don't need a trillion parameter model burned into silicon. You can get insane results out of well optimized, much smaller models. Imagine qwen3.5:9b at 10k tokens per second. It would be asolutely insane!
CalligrapherFar7833@reddit
Topic is comparable to sonnet 4.6 not your locallm 9b
NotArticuno@reddit
Okay fair, I haven't looked into the feasibility of larger models.
But I think it's safe to assume the performance of what we'll be able to squeeze in a much smaller package in a few years will be insane. Compare a 9b model today to a 9b model two years ago.....
CalligrapherFar7833@reddit
Again you wont be able to afford it
NotArticuno@reddit
Wait why assume that? I just did a quick search on taalas and the r/singularity post about it a month ago, and I don't understand your idea about it being impossible to bake in large models, and super expensive. Can they not scale up the fabs? Okay what I agree on there is that scaling up fabs for it to be consumer affordable is probably more than 5 years lol. Does sub 10 feel more reasonable?
CalligrapherFar7833@reddit
Taalas costs shit ton of money dude its not cheap
SexyAlienHotTubWater@reddit
Nah, you're right. The network is burned directly onto the chip, i.e. the weights are right next to the computation units, they don't need to be pulled over from VRAM. Basically eliminates the concept of bandwidth, resulting in dramatic speedup at extremely low wattage.
SexyAlienHotTubWater@reddit
Ah, I see what you mean. In that case I agree with you - but I don't think it's that close. In the meantime, GPUs are our model runners.
NotArticuno@reddit
Haha yeah I'm definitely being hopeful with that timeline. My understanding is they can burn the weights right into the chip next to the processing center so stuff doesn't need to get transferred between vram. That's a layman's understanding though lol.
SexyAlienHotTubWater@reddit
I dunno to be honest. Especially if there's a GPU squeeze, burning a capable model onto a chip doesn't seem that farfetched, 3-5 years seems totally reasonable to me.
NotArticuno@reddit
I sure hope so! Someone commented on taalas as a company that's doing it. I'm not researching others right now, just wanted to share the name.
ea_man@reddit
> (1000x is optimistic - moores law, 4 years means 4x, 8x at a stretch),
He meant ASIC, dedicated hw
SKX007J1@reddit (OP)
I have no intention to do this. Just interested in what people who have to keep their code offline options would be.
NotArticuno@reddit
Ah okay. Yeah I haven't found anyone who's actually using a local model for production. As far as I know we are all just tinkering doing hobby stuff.
ProfessionalSpend589@reddit
Maybe not for production, but I did use it to research and to test how things may perform before doing any real work myself.
SexyAlienHotTubWater@reddit
You don't need 5090s. 2x Arc B70s will get you the same VRAM at a third of the tok/s, decent enough, for $1900. The 5090s will be like $8000.
NotArticuno@reddit
That's a much better suggestion!
Randommaggy@reddit
I'm planning on and in early testing on using an 8B model for some small tasks in production.
It's cheap enough to run without being a financial drain on the company when the large players stops subsidizing usage and it's good enough to be a net benefit in a few small features.
Mostly intelligent suggested values to make data entry faster for humans.
Saves a few minutes here and there improving UX and is easy to ground to the point where hallucinations don't make it a net negative contributor to the user experience like way too many AI features that currently ship.
Long_comment_san@reddit
Who would pay for sonnet of there was a local alternative that's going to be free forever?
SKX007J1@reddit (OP)
Can you not see how people could be confused and seek clarification when in the very same thread you have your comment and a comment saying "Depening on your use case, models that can be run locally on relatively modest hardware can compete with cloud behemoths.".
But in answer to your question, for some people, a high upfront cost for infrastructure is less of an issue when its tax deductable.
I feel like people are reading my post as "I want Claude Opus on my 5060, or do I need to buy 2 Tesla V100 SXM2 32GB HBM2 Volta GPU Accelerator Card to be able to do this?"
eli_pizza@reddit
People who don’t want a quarter mil in up front infra costs?
Bulky-Priority6824@reddit
Sorry mate but a large part of the world doesn't even know the difference between a USB C port and a TB4 port let alone all of this ai speak.
Herr_Drosselmeyer@reddit
Self-hosting a model of that size is currently not feasible unless you want to spend up to hundreds of thousands or are willing to accept having it run really slow.
But you really don't need to. Gemma 4-31B comes damn close and runs on consumer hardware (albeit high-end consumer hardware). For instance, on Chatbot Arena, we have:
Its closest competitor on this ranking, for non-proprietary models, is Qwen3.5-397B-A17B but that's also very hard to run locally. So you're getting very close to Sonnet level in real-user preference for a tiny fraction of the price.
And there's constant evolution, Gemma may be the star right now, but who knows what will surpass it next month, maybe even next week?
TLDR: Depening on your use case, models that can be run locally on relatively modest hardware can compete with cloud behemoths.
FuckDeRussianFuckers@reddit
Gemma 4 doesn't even come close.
I got some coding results back where it was putting newlines in the middle of the variable types, so instead of
you'd get
That's pretty poor.
Similarly, you'd get quoted strings with newlines in them so it'd be something like:
... which obviously won't compile.
ikkiyikki@reddit
For \~20k I have a regular PC w/ two 6000 pros that runs Qwen3.5 397 IQ4. These two models are comparable (though speed obviously is much slower)
KaMaFour@reddit
Define comparable.
If Qwen3.5-27B is comparable enough then a few thousands for a 5090 (or maybe even some cheaper 32gb card like Arc Pro B70? (no first hand experience with intel gpu support)) will do. That's stretching the definition of comparable but should be fine.