Say i want my own Claude?
Posted by tbandtg@reddit | LocalLLaMA | View on Reddit | 19 comments
What is the absolute cheapest way to get my own claude self hosted. I dont want it to tell me how to write an email, but I do want it to know programming really well, and datasheets.
I would like it to work about as fast as claude in the cloud does.
Lets assume I am doing this for my own edification, but it is also because as a software contractor I do not ever want to expose my customers code to the cloud. I am not rich by any means and have not even had a customer for a year. But I was using claude in vs code this week and it was fantastic.
I would want one user only working in VS code. What machine, operating system, model, backend, would get me there for pennies?
go-llm-proxy@reddit
You can't realistically get opus-level, I've tried about anything.
The closest I've found was GLM-5 as opus, MiniMax-M2.5 as Sonnet, Qwen-3-VL 8B as the vision processor, Paddle for OCR, and Qwen-3.5/9b for Haiku.
To run all that even quantized you legitimately need about 0.8TB of VRAM so probably the cheapest option is building out an 8x RTX-6000 Pro (Max-Q) rig to run it on.
I build the proxy in my bio to connect all this and use it locally, but only have 4x 6000 pros so I can't host GLM and am using the GLM-5.1 sub from ZAI for opus, but it's working with the rest just fine.
Is it Claude Code? Not really. Is it strong enough to be useful? Yeah, I use it constantly and just use opus through CC for major things where I know glm will come up short.
The best option I've found for something more affordable that actually works convincingly is MM2.5 + codex once you add back in the missing tooling, it's fast and the context pushes 200k for me at NVFP4, I prefer using it to the claude-code harness with local models.
Hector_Rvkp@reddit
Strix halo 128ram is the cheapest rig (2200$) with fast enough memory to run something big that's usable. Slower, and that's DDR5, and that's not usable. Faster, and it gets a lot more expensive, at which point you have to decide between Nvidia GPU + DDR5 vs Apple silicon vs MAYBE a dgx spark. But 2200 gets you in the game.
It is obviously NOT as competent or fast as cloud claude.
The stack sucks, but it sucks less in less since Jan26.
LevianMcBirdo@reddit
To say it's not useable really depends on your tasks. I have a hoard of MoEs running on my 780M with 96 GB DDR5. With a lot of context and big models it easily dips into single digits, but I have time.
Hector_Rvkp@reddit
Sub ten tokens per second is going to make sense in very, very specific use cases. For almost everyone, it just sucks. For someone who is looking to BUY hardware, it would be extremely misguided.
SweetHomeAbalama0@reddit
Absolute cheapest?
First, find a $100k dollar bill for upfront expenses, add thousands more for electrical/cooling infrastructure, set aside a 5-10% extra for wiggle room, then expect hundreds in monthly opex upkeep for as long as the unit is in operation. If you cannot administrate it yourself, add additional budget for hiring a contractor to maintain the hardware for you.
Load the highest quant of Kimi K2/2.5 that will fit in its VRAM, and that is your "cheapest" Claude-like self hosted coder.
jhov94@reddit
They make an Nvidia workstation with 786GB of VRAM for a bit under $100k, then you could run GLM-5 or Kimi K2.5 at the speeds you're used to and a quality that is just shy.
NoahFect@reddit
No one is actually shipping those yet, though.
ActEfficient5022@reddit
And it's under 100k?
nmay-dev@reddit
Probably not after taxes and tariffs and the upgraded panel you will need to supply it with energy and of course shipping
ActEfficient5022@reddit
Sheeit. I had less than 100k to throw around but over 100k is a little too rich for my blood 😮💨
ttkciar@reddit
You're not going to get it for pennies, but you could host GLM-5 quantized to Q4_K_M on a cluster of four older Xeon systems with four MI210 each (total of sixteen MI210), using
rpc-serverfrom llama.cpp, for about $120K if you can get a good deal on the MI210. Maybe up to twice that much.That would get you something about 80% as good as Claude and perhaps 10% as fast.
Ztoxed@reddit
With prices still creeping up on Ram, and GPUs, and mother boards with enough lanes.
I can not see in my inexperienced mind, how it could be done for low cost offline.
My brain tells me you can buy a nice new car, for what this would cost.
Let alone the electricity bill would suffer and that is ongoing.
tu9jn@reddit
It is simple, really, there is nothing that gets close to Claude.
Not for any money, because there is no Claude tier open source model available.
The smallest somewhat usable coding model right now is Qwen/Qwen3.5-35B-A3B, you need 32gb ram and preferably an Nvidia GPU with 8+gb Vram.
But it's nothing like Opus. You should try it out through an API before spending your money.
dark-light92@reddit
This is wrong. You only need to wait a for about a year.
Or, you can invent a time machine...
redditscraperbot2@reddit
>You should try it out through an API before spending your money.
This reads like a crack dealer recommending they try a little sample just to see if they like it.
tu9jn@reddit
OP doesn't have much money.
Better to spend 5 bucks on API before buying a rig and be disappointed
Fun_Smoke4792@reddit
Years later? maybe.
Front_Eagle739@reddit
The cheapest way to get close to (but not matching) claude is not cheap. You need the big open models to get a similar experience with more holes and then you have to spend way more for speed. You want a reasonable budget for a home user you are talking "can just run the claude code/opencode tools but needs a LOT more human in the loop planning and reviewing, so expect a lot of effort guiding the model" with qwen3 coder next, the qwen 3.5s like the 27B and devstral 2 24B. You will want a big fast gpu with 24gb or 32GB vram. If you want to run the bigger moes like qwen 3.5 122B you'll need to offload and speed will plummet but is useable with fast enough ddr5 ram.
Absolute minimum for a remotely similar experience though is q4 quants of the minimax m2.5s and step3.5 class 200B models and also devstral 2 123B. These you can expect to ran ok on 128GB vram. This is where your sparks, macbooks and ai max processors end up being best bang for buck. Itll be slow (very slow in devstrals case) but pretty good at going off and doing what you want. It wont be close to claude code in quality or intelligence but it will be able to work in the same sort of way. You want to run this class WELL and quick enough to really feel like the api though you want 4x GPUs like rtx3090s or rtx5090s and a pc with enough pcie lanes to fit them all. Spendy but with second hand parts you can get in that tier for a few grand and a lot of effort and lots of electricity.
For the models that actually get in the sonnet 4.5 intelligence tier? Kimi 2.5, glm5, glm4.7, qwen 397? You really need 512GB of vram or unified memory. 12k or so for a big mac studio that runs them very slowly (set them going then head of for coffee and lunch) is the cheap way. 6x rtx8000 pros with 96GB vram is the fast way.
For opus tier? API.
RedParaglider@reddit
You don't. You aren't getting SOTA models on a budget. End of story.
Can you have an intelligent model that is able to code as well as say sonnet 4? I think qwen 3 coder next is pretty close, and that can be run pretty well on a 2500 dollar strix halo system.