Open-source model that is as intelligent as Claude Sonnet 4
Posted by vishwa1238@reddit | LocalLLaMA | View on Reddit | 290 comments
I spend about 300-400 USD per month on Claude Code with the max 5x tier. I’m unsure when they’ll increase pricing, limit usage, or make models less intelligent. I’m looking for a cheaper or open-source alternative that’s just as good for programming as Claude Sonnet 4. Any suggestions are appreciated.
innahema@reddit
I can confirm from my own experience with boh Kimi 2.5 and Sonnet 4.6 that Sonnet is WAY smarter. But Kimi beats Haiku 4.5 by margin.
aparecid12@reddit
Representação fotorrealista da capa de um e-book profissional para 'INVESTIGADOR DE POLÍCIA CIVIL – MANUAL ESTRATÉGICO DE APROVAÇÃO'. A capa apresenta o subtítulo: 'Lei seca comentada + Questões aprofundadas + Casos práticos + Flashcards estratégicas'. O design incorpora elementos jurídicos e uma estética acadêmica, sugerindo análise jurídica aprofundada e preparação estratégica. O fundo é o interior de uma biblioteca de direito com iluminação suave de estúdio, criando uma atmosfera profissional e de autoridade. O enquadramento é um close da capa do livro, enfatizando o título e o subtítulo com alto nível de detalhe e clareza. A impressão geral é de estudo rigoroso e sucesso garantido para aspirantes a investigadores policiais.
e-book intitulado 'INVESTIGADOR DE POLÍCIA CIVIL – MANUAL ESTRATÉGICO DE APROVAÇÃO', com o subtítulo 'Lei seca comentada + Questões aprofundadas + Casos práticos + Flashcards estratégicas'. O design da capa apresenta tipografia detalhada e um layout formal, enfatizando seu caráter informativo. Ambientado em um espaço de estudo moderno com uma mesa limpa, a iluminação é dramática, com luz principal e sutil jogo de sombras, criando uma atmosfera intensa e focada. O enquadramento é uma visão de cima para baixo do livro, destacando o título e o subtítulo com nitidez. O visual geral foi projetado para transmitir uma sensação de profundidade estratégica e preparação crítica para concursos
RhubarbSimilar1683@reddit
Maybe qwen: https://www.reddit.com/r/LocalLLaMA/comments/1mllt5x/imagine_an_open_source_code_model_that_in_the/
Thomas-Lore@reddit
Look into:
GLM-4.5
Qwen 3 Coder
Qwen3 235B A22B Thinking 2507 (and the instruct version)
Kimi K2
DeepSeek: R1 0528
DeepSeek: DeepSeek V3 0324
All are large and will be hard to run locally unless you have a Mac with lots or unified RAM, but will be cheaper than Sonnet 4 on API. They may be worse than Sonnet 4 at some things (and better at others), you won't find a 1:1 replacement.
itchykittehs@reddit
Just to note, practical usage of heavy coding models is not actually very viable on macs. I have a 512gb M3 Ultra that can run all of those models, but for most coding tasks you need to be able to use 50k to 150k tokens of context per request. Just processing the prompt with most of these SOTA open source models on a mac with MLX takes 5+ minutes with 50k context.
If you are using much less context is fine. But for most projects that's not feasible.
Opteron67@reddit
xeon with AMX
EridianExplorer@reddit
This makes me think that for my use cases it does not make sense to try to run models locally, until there is some miracle discovery that does not require giant amounts of ram for contexts of more than 100k tokens and that does not take minutes to achieve an output.
FroyoCommercial627@reddit
Local LLMs are great for privacy and small context windows, bad for large context windows.
FroyoCommercial627@reddit
Time to first token is the biggest issue with Macs.
Prefill computes attention scores for every single token pair (32k x 32k = 1 Billion scores / layer)
128gb - 512gb unified memory is fast and can fit large models, but the PRE-FILL phase requires massive parallelism.
Cloud frontier models can spread this out to 16+ THOUSAND cores at a time. Mac can spread to 40 cores at most.
Once pre-fill is done, we only need to compute attention for ONE token at a time.
So, Mac is GREAT for linear processing needed for inference, BAD for parallel processing needed for pre-fill.
That said, speculative decoding, KV caching, sparse attention, etc are all tricks that can help solve this issue.
__JockY__@reddit
Agreed. It’s a $40k+ proposition to run those models at cloud-like speeds locally. Ideally you’d have at least 384GB VRAM (e.g. 4x RTX A6000 Pro 96GB), 12-channel CPU (Epyc most likely), and 12 RDIMMS for performant system RAM. Power, motherboard, SSDs…
If you’ve got the coin then… uh… post pics 🙂
HerrWamm@reddit
Well that is the fundamental problem, that someone will have to solve in the coming months (I'm pretty sure it will not take years). But efficiency is the key, whoever will overcome the efficiency wll "win" the race, but certainly scaling is not a solution here. I forse a small , very nimble models to come very soon, without huge knowledge base but rather using RAG (just like humans, don't know everything, but rather learn on the go). These will dominate the competition in the coming years.
DistinctStink@reddit
I would rather it admit lack of knowledge and know when it's wrong and be able to learn instead of bullshiting and talking like I'm going ti fight it if it makes a mistake. I really dislike to super polite ones use bullshit flowery words to excuse its bushit lying
utilitycoder@reddit
Token conservation is key. Simple things like run builds in quiet mode only outputting errors and warnings help. You can do a lot with smaller co text if you're judicious.
Western_Objective209@reddit
Doesn't using a cache mitigate a lot of that? When I use claude code at work it overwhelmingly is reads from cache, like I get a few million tokens of cache writes and 10+ million cache reads
Final-Rush759@reddit
They should have released M4 ultra, that should have > 1.1 TB/sec memory bandwidth.
notdba@reddit
I guess many of the agents are still suffering from a similar issue as https://github.com/block/goose/issues/1835, i.e. they may mix some small requests in between that totally breaks prompt caching. For example, Claude Code will send some small simpler requests to Haiku.
If prompt caching works as expected, then PP should still be fine on Mac.
DrummerPrevious@reddit
I hope Memory bandwidth increases on upcoming macs
illusionst@reddit
I’m using GLM 4.5 with Claude Code. I think this easily replaces sonnet 4. The tool calling is good and the it’s much faster than sonnet.
vossage_RF@reddit
Gemini Pro 2.5 is NOT more expensive than Sonnet 4.0!
vishwa1238@reddit (OP)
Thanks, I do have a Mac with unified RAM. I’ve also tried O3 with the Codex CLI. It wasn’t nearly as good as Claude 4 Sonnet. Gemini was working fine, but I haven’t tested it out with more demanding tasks yet. I’ll also try out GLM 4.5, Qwen3, and Kimi K2 from OpenRouter.
Caffdy@reddit
the question is how much RAM?
fairrighty@reddit
Say 64 gb, m4 max. Not OP, but interested nonetheless.
pokemonplayer2001@reddit
You’ll be able to run nothing close to Claude. Nowhere near.
txgsync@reddit
So far in, even just the basic Qwen3-30b-a3b-thinking in full precision (16-bit, 60GB safetensors converts to MLX in a few seconds) has managed to produce simple programming results and analyses for me in throwaway projects similar to Sonnet 3.7. I haven’t yet felt like giving up use of my Mac for a couple of days to try to run SWEBench :).
But Opus 4 and Sonnet 4 are in another league still!
NamelessNobody888@reddit
Concur. Similar experiences here (*). The thing is just doesn't compare to full auto mode working to an implementation plan in CC, Roo or Kiro with Claude Sonnet 4 as you rightly point out.
* Did you find 16 bit made a noticeable difference cf. Q_8? I've never tried full precision.
txgsync@reddit
4 bit to 16 bit Qwen3-30B-A3B is … weird? Lemme think how to describe it…
So like yesterday, I was attempting to “reason” with the thinking model in 4 bit. Because at >100tok/sec, the speed feels incredible, and minor inaccuracies for certain kinds of tasks don’t bother me.
But I ended up down this weird rabbit hole of trying to convince the LLM that it was actually Thursday, July 31, 2025. And all the 4-bit would do was insist that no, that date would be a Wednesday, and that I must be speaking about some form of speculative fiction because the current date was December 2024… the model’s training cutoff.
Meanwhile the 16-bit just accepted my date template and moved on through the rest of the exercise.
“Fast, accurate, good grammar, but stupid, repetitive, and obstinate” would be how I describe working at four bits :).
I hear Q5_K_M is a decent compromise for most folks on a 16GB card.
fairrighty@reddit
I figured. But as the reaction was to someone with a MacBook, I got curious if I’d missed something.
DepthHour1669@reddit
GLM-4.5 air maybe
thatkidnamedrocky@reddit
give devstral (mistral) a try, ive gotten decent results with it for IT based work (few scripts, working with csv files and stuff like that)
NamelessNobody888@reddit
Great for chatting with in (say) open-webui and asking for some code. Will get good results. Just never going to be much good for Agentic type programming.
umataro@reddit
Decent results even when compared to qwen3-coder (or qwen2.5-coder)? If so, which languages/frameworks/libraries?
brownman19@reddit
Glm 32b rumination (with a fine tune and a bunch of standard dram for context)
DepthHour1669@reddit
GLM Rumination actually isn’t that much better than just regular reasoning.
Orson_Welles@reddit
He’s spending $400 a month on AI.
PaluMacil@reddit
He’s actually spending $100 but has a plug-in that estimate estimates what it would be if he was paying for the API 🤷♂️
squired@reddit
Holy shit. He can literally rent an H100 SVM for 224 hour per month. Which is what? 228/8 = ~28 workdays per month.
tmarthal@reddit
Claude Sonnet is really the best. You’re trading time for $$$; you can setup deepseek and run the local models on your own infra but you almost have to relearn how to prompt them.
-dysangel-@reddit
Try GLM 4.5 Air. It feels pretty much the same as Claude Sonnet - maybe a bit more cheerful
Tetrylene@reddit
I just have a hard time believing a model that can be downloaded and run on 64gb of ram compares to sonnet 4
NamelessNobody888@reddit
Depends a bit on coding style, too. Something like Aider (more scalpel than shotgun approach to AI coding) can be pretty OK with local models.
-dysangel-@reddit
I understand. I don't need you to believe for it to work for me lol. It's not like Anthropic are some magic company that nobody can ever compete with.
ANDYVO_@reddit
This stems from what people consider comparable. If this person is spending $400+/month, it’s fair to assume they’re wanting the latest and greatest and currently unless you have an insane rig, paying for Claude code max seems optimal.
-dysangel-@reddit
Well put it this way - a Macbook with 96GB or more of RAM can run GLM Air, so that gives you a Claude Sonnet quality agent, even with zero internet connection. It's £160 per month for 36 months to get a 128GB MBP currently on the Apple website - so cheaper than those API costs. And the models are presumably just going to keep getting smaller, smarter and faster over time. Hopefully this means the prices for the "latest and greatest" will come down accordingly!
ANDYVO_@reddit
I respect your opinion and the goal you're trying to achieve - it's why I spec'd out my M1 max when it first came out.
Even if you can run it in say LM Studio and get a response with decent speeds, you'd still be missing out on Claude Code type of functionality/quality for the time being.
Maybe you're experiencing different with your machine. But for my use case and it seems OPs, having it accessible via an api or service is the more straightforward way of getting the best of the best without a lot of hassle.
Western_Objective209@reddit
Claude 4 Opus is also a complete cut above Sonnet, I paid for the max plan for a month and it is crazy good. I'm pretty sure Anthropic has some secret sauce when it comes to agentic coding training that no one else has figured out yet.
icedrift@reddit
Personally, I would keep pushing Gemini CLI and see if that works. If it isn't smart enough for your tasks nothing else will be.
Aldarund@reddit
Gemini CLI only have 50req to 2.5pro at free tier
icedrift@reddit
Only if you sign in with your regular google credentials. If you use an API key (completely free don't even need to add a credit card) the limits are way higher. I've yet to hit it while coding, only hit it when I put it in a loop summarizing images.
Capaj@reddit
gemini can be even better than claude, but it outputs a fuck ton more thinking tokens, so be aware about that. Claude 4 strikes the perfect balance in terms of amount of thinking tokens outputted.
Ladder-Bhe@reddit
To be honest, the tool use of k2 is not stable enough, and the code quality is slightly worse. deepseek is completely unable to handle stable tool use and can only handle haku's work. qwen 3 coder is said to be better, but it has the problem of consuming too many tokens. glm 4. 5 is currently on par with qwen.
givingupeveryd4y@reddit
Given Qwen code (what you refer to as Qwen CLI, I guess) is fork of gemini CLI, so most approaches applicable to gemini CLI still work with both.
DistinctStink@reddit
I have 16gb of vddr6 amd 7800xt and 32gb of ddr5 6000mhz, using a 8 core 16thread 7700x amd 4.8-5.2mhz processor.., can I use any of these? I find deepseek App on android is alright, less shit answers than gemini and that other fuck
Expensive-Apricot-25@reddit
Prices for closed source will never stay constant and will likely continue to rise.
The only real permanent solution would be open source, but only if you have the resources for it.
Delicious-Farmer-234@reddit
This is a great suggestion. Any reason why you put GLM 4.5 first and not Qwen 3 coder?
BidWestern1056@reddit
npcsh is an agentic CLI tool which makes it easy to use any diff model or provider https://github.com/NPC-Worldwide/npcsh
Reasonable-Job2425@reddit
i would say the closest expereince to claude is kimi right now but havent tried the latest qwen or glm yet
txgsync@reddit
Qwen3-30B-a3b-thinking runs comfortably on my M4 Max at full precision (about 60GB). Over 50 tokens per second if I convert the BF16 to FP16 myself in my Mac! I’ve been experimenting with tool calls and it seems roughly about as good as Sonnet 3.7. Which was eminently usable. And the speed lets me do dumb things like spin up five agents solving the same problem in five worktrees and then pick the winner.
So far, I am not using it for anything serious. But with this much speed and really solid thinking? I might very soon.
I haven’t gotten the new Qwen3-30B-A3B-Coder version working yet. MLX complains about missing layers. Still figuring out what I am doing wrong. Or maybe I am doing nothing wrong other than needing to update MLX for the new format…
I am very excited about the new Qwen series at full 16-bit precision for Mac.
deyil@reddit
Among them how they rank?
Caffdy@reddit
Qwen 235B non-thinking 2507 is the current top open model. Now, given that OP wants to code, I'd go with Qwen Coder or R1
sluuuurp@reddit
Not possible. If it were, everyone would have done it by now. You can definitely experiment with cheaper models that are almost as good, but nothing local will come close.
Ylsid@reddit
I disagree there. It depends on the use case. Claude seems to be trained a lot on web, but not too much on gamedev.
QueDark@reddit
which one you feel is best for gamedev?
Ylsid@reddit
It probably again depends on the model. For me, I've found R1 to be the best at Lua
urekmazino_0@reddit
Kimi K2 is pretty close imo
sluuuurp@reddit
You can’t really run that locally at reasonable speeds.
No_Afternoon_4260@reddit
That's Why not everybody is doing it.
tenmileswide@reddit
it will cost you $60/hr on Runpod at full weights, $30/hr at 8 bit.
so, for a company that's probably doable, but can't imagine a solo dev spending that.
No_Afternoon_4260@reddit
And those instances can serve so many people
noodlepotato@reddit
Wait how to run it on runpod? Tons of h200 instance then vllm?
tenmileswide@reddit
You can run clusters now, multiple 8 GPU pods connected together.
8xh200 for 8 bit, and 2x pods of h200 in a cluster for 16
DepthHour1669@reddit
Nah, $30k for a dozen RTX 8000s will run a 4 bit model with space for context for a couple of users.
Kimi is 32b active so it will do like 30 tok/sec.
sluuuurp@reddit
Right now you can get double the precision, double the throughput, and 0.7 second latency for $2.20 per million tokens. It doesn’t make sense to buy $30k of GPU for such an inferior inference setup.
This is really a fundamental computer science problem. For large models limited by RAM bandwidth, batch_size=1 inference will always be much more expensive. And that’s even before considering the fact that you won’t be using the compute every second of every day.
https://openrouter.ai/moonshotai/kimi-k2
ProfessionalJackals@reddit
Even worse is that running "dozen RTX 8000s" kills the performance. People really underestimate how much the interconnect bandwidth influences multiGPU solutions for AI. There is a reason why Nvidia has those fancy interconnects that do insane speeds.
Best solution is a single GPU with tons of memory, and ultra high bandwidth. But yea ... Nvidia is never going to release something to consumers levels (when they can charge companies 10x more).
And i doubt that Apple is going to release a Ultra with those capabilities. If they do, it will be a Ultra+++ for 20x the price (knowing Apple ;) ).
DepthHour1669@reddit
Inference doesn’t need pcie bandwidth, you’re thinking of training or finetuning.
lfrtsa@reddit
And you can run it at home if you live in a datacenter.
SadWolverine24@reddit
Kimi K2 has a really small context window.
GLM 4.5 is slightly worse than Sonnet 4 in my experience.
Aldarund@reddit
Maybe in writing one shot code. When you need to check or modify something its utter shit
MerePotato@reddit
Its smarter than 3.5 sonnet but falls well short of 4 sonnet
unhappy-2be-penguin@reddit
Isn't qwen 3 coder pretty much on the same level for coding?
dubesor86@reddit
based on some benchmarks sure. but use each for an hour in a real coding project and you will notice a gigantic difference.
-dysangel-@reddit
Have you tried GLM 4.5 Air? I've used it in my game project and it seems on the same level, just obviously a bit slower since I don't own a datacenter. I created some 3D design tools with Claude in the last while, and asked GLM to create a similar one. Claude seems to have a slight edge on 3D visuospatial debugging (which is obviously a really difficult thing for an LLM to get a handle on), but GLM's tool had better aesthetics.
I agree, Qwen 3 Coder wasn't that impressive in the end, but GLM just is.
FyreKZ@reddit
GLM Air is amazingly good for its size, I'm blown away by it.
YouDontSeemRight@reddit
This is good to hear. I'm waiting for llama cpp support.
ForsookComparison@reddit
This is true.
Qwen3-Coder is awesome but it is not Claude 4.0 Sonnet on anything except benchmarks. In fact it often loses to R1-0528.
BoJackHorseMan53@reddit
Have you used them?
unhappy-2be-penguin@reddit
Fair enough
Orolol@reddit
Even if this was the case, it would be impossible to reach even 10% of the speed of the Claude API. When coding, you need to process very large context all the time, so it would require data centers grade GPUS, and that would be very expensive
sluuuurp@reddit
I don’t think so, but I haven’t done a lot of detailed tests. Also I think it’s impossible to run that at home with high speed and full precision on normal hardware.
zipzag@reddit
Nothing local will likely be as good as the best frontier models for coding for probably a number of years.
It's about what the productivity gain is worth. For experienced devs, $4000/month for a 2X productivity gain could be a bargain.
I would like to trade Claude Code for a Mac Studio, but I find no evidence that the switch would be a prudent financial decision.
Dudmaster@reddit
At the scale of $4000 you could save up for 3 months and be able to buy a local LLM dream machine
zipzag@reddit
If that was true companies would be running LLMs in house for their devs. A first year programmers is in the $100K range all in.
The reality is that the major frontier models are still improving rapidly are decidedly superior in real world use, regardless of what comparison tests report.
Running models locally is for learning and fun, not usually as a cost effective alternative to the frontier models.
devshore@reddit
People constantly pay infinitely to rent when buying is cheaper. One example is how people pay 10/mo in perpetuity for 1TB of cloud storage when they can buy 2 10TB drives to have redundancy and 10 times the space, and serve it via a VPN to themselves for the price of renting 1TB for 2 years. Not only is it better for all the obvious reasons, but yes, its even cheaper.
Dudmaster@reddit
Most companies haven't reached even a fraction of $4000 token usage per month on purely development costs, and they also don't manage their own hardware no matter how cost effective it might be. Sure faang might, but there is a much larger market than just them
porest@reddit
Since most of the consumed tokens go for thinking (some here say up to 90%) and the rest for coding (10%), I think using gemini-cli (open source) could be one of the best value for money ai pair programming tools out there (while saving you sweet $100 a month). And by money I mean FREE (need a gmail account though). The free tier gives you some Gemini 2.5 pro usage (10%) and flash usage (90%) for free everyday (it daily resets).
Having said that, this is how the workflow goes: Use your gemini 2.5 free allotment for thinking/planning/code review/deep analysis/writing plans/tasks, then, once is finished, use gemini 2.5 flash for implementing/code all day long, again, for free.
Vusiwe@reddit
How’s closed source doing?
$400/mo
(pause)
Jesus fucking christ meme.png
vishwa1238@reddit (OP)
I don’t spend $400 every month. I use $400 worth of API calls from my $100 subscription to Claude.
Electronic-Site8038@reddit
with some sort of auto retry when limits reach script or ?
valdev@reddit
Even if there was one, ready to spend 300-400 a month in extra electricity cost?
-dysangel-@reddit
I have a Mac Studio with 512GB of RAM. It uses 300W at max so the electricity use is about the same as a games console.
Deepseek R1 inference speed is fine, but ttft is not.
It sounds like you've not tried GLM 4.5 Air yet! I've been using it for the last few days both in one shot tests and agentic coding, and it absolutely is as good as Claude Sonnet from what I've seen. It's a MoE taking up only 80GB of VRAM. So, it has great context processing, and I'm getting 44tps. It's mind blowing compared to every other local model I've run (including Kimi K2, Deepseek R1-0528, Qwen Coder 480B etc).
I'm so happy to finally have a local model that has basically everything I was hoping for. 256k context would have been the cherry on top, but 128K is pretty good. And things can only get better from here!
notdba@reddit
Last November, after testing the performance of Qwen2.5-Coder-32B, I bought a used 3090 and an Aoostar AG02.
This August, after testing the performance of GLM-4.5, I bought a Strix Halo, to be paired with the above.
(Qwen3-Coder-480B-A35B is indeed a bit underwhelming, hopefully there will be a Qwen3.5-Coder)
ProfessionalJackals@reddit
Not the best choice ... The bandwidth is too limited at around 256GB/s. So ironically, being able to push 128GB memory, but if you go above 32B models, its way too slow.
Your better off buying one of those Chinese 48GB 4090's, what will run WAY better with 1TB/s bandwidth.
notdba@reddit
Those 4090's are too loud, and I also don't have the space to accommodate a 4th gen EPYC workstation. Not to mention that either of these options is also more expensive.
I am betting on getting good TG speed from either speculative decoding or MTP. But even without those, these Strix Halo machines can probably still do 15\~20 tps with IQ2_K quant of GLM-4.5, which is acceptable for me.
The mini pc + egpu setup is also more modular. When I have the space and some money to spare, I can always add more 3090 FE to the mix.
ProfessionalJackals@reddit
They sell them with Watercooling... ;)
Why do you need a Epyc workstation? A single 4090 will run circles around a Strix Hallo. Or just buy a 5090 at that point. So its a gaming + LLM budget combined for max gain.
How? Your mixing GPU's on a platform that has at best oculink or m.2 > PCIe. And take it from me, your not going to be saving space when you go with external m.2 > pcie GPU solutions. You need a PSU for the 3090, a dock, cabling etc and in the end, your just using the same footprint or more space, vs just building a SFF or mini-tower.
I ran this type of egpu + mini PC setup for a long time, and eventually gone back to a SFF case. Way more silent, more flexible. A 22L case is very small and can easily accommodate 2 GPU's with 2 8x PCI 4 (or 5) bandwidth.
Probably getting downvoted for saying this, but your frankly better of (for now) to just buy CoPilot subscription. Because Claude is going to give you way better results then a downsampled GLM is going to do. Even a unlimited 4.1 + beast mode, for 10 bucks per month will do better.
Running local LLM requires a ton of investment and a proper build. I just find the combination of hardware that your investing into, a rather less optimal mix to get the best out of a local LLM setup. Anyway, its your money ;)
power97992@reddit
Qwen 3 coder 480b is not as good as sonnet 4 or gemini 2.5 pro … maybe for some tasks but for certain JavaScript tasks , it wasn’t following the prompt very well…
-dysangel-@reddit
agreed, Qwen 3 Coder was better than anything else I'd tried til then for intelligence vs size, but GLM Air stole its thunder.
PatienceKitchen6726@reddit
Hey I’m glad to see some realism here. So can I ask your realistic opinion - how long until you think we can get actual sonnet performance on current consumer hardware? Let’s say newest gen amd chip with newest gen GeForce card. Do you think it’s an LLM architecture problem?
valdev@reddit
That's like asking a magic 8 ball when it will get some new answers.
Snark aside, it really depends. There are some new model training methods in testing that can drop the model size by multitudes (if they work), and there are lots of different hardwares targeting consumes in development as well.
Essentially the problem we are facing is many faced, but here are the main issues that have to be solved.
A model trained in such a way that it contains enough raw information to be as good as sonnet, but available freely.
A model architecture that can keep a model small but retain enough information to be useful, and fast enough to be usable
Hardware that is capable of running that model that is accessible for the average person.
#1 I think we are quickly approaching, #2 and #3 I feel like we will see #2 arrive before #3. 3 to 5 years maybe? But I would expect major strides... all the time?
Careless_Wolf2997@reddit
that is, possibly, maybe, that can, to be ...
valdev@reddit
Lol, yes.
PatienceKitchen6726@reddit
Thanks for sharing your perspective!
-dysangel-@reddit
You can run GLM 4.5 Air on any new Mac with 96GB of RAM or more. And once the GGUFs are out, you'll be able to run it on EPYC systems too. Myself and a bunch of others here consider it Claude Sonnet level in real world use (the benchmarks place it about neck and neck, and that seems accurate)
rukind_cucumber@reddit
I'd like to give this one a try. I've got the 96 GB Mac Studio 2 Max. I saw a post about a 3 bit quantized version for MLX - "specifically sized so people with 64GB machines could have a chance at running it." I don't have a lot of experience running local models. Think I can get away with the 4 bit quantization?
https://huggingface.co/mlx-community/GLM-4.5-Air-4bit
-dysangel-@reddit
Yes I think it's worth a try. I just did a test with Cline on 128k of context, and usage is going up to 88GB. It's worth trying the 3 bit to see if it's good enough for you though. Presumably going to be much better than anything else you could run locally either way, it's way better than Qwen 32B
rukind_cucumber@reddit
Thank you. I am a total newb when it comes to making the best use of my machine for local models. There's so much information out there, and it's difficult for me to make time to separate the wheat from the chaff. Any pointers on where to start?
-dysangel-@reddit
In terms of separating wheat from chaff then just GLM Air from now really. It's so far ahead of anything else you could fit into your RAM.
Once Qwen 3 Coder 32B comes out I'd give it a go too. Otherwise just keep checking/asking in here and seeing what people are saying
evia89@reddit
Probably in 5 years with CN hardware. Nvidia well never release that capable vram GPU. Prepare to spend 10-20k
GrungeWerX@reddit
“5 years”.
You guys are so funny with your over inflated estimations. 5 years. Cute.
datbackup@reddit
I agree w u, 2 years tops
GrungeWerX@reddit
Tops.
evia89@reddit
Sonnet 3.5 is really strong model. Do you think RTX 8090 48 GB will run better local model? I assume 128k context and 40+ tokens/sec speed to be any use
GrungeWerX@reddit
Open source has mostly caught up with closed source, with kimi-k2 and Qwen 3 coder. Future iterations will only close that gap. The gap has been closed in a matter of months, not years.
GrungeWerX@reddit
Open source has mostly caught up with closed source, with kimi-k2 and Qwen 3 coder. Future iterations will close that gap even further. That gap has been closed in a matter of months, not years.
I don’t think gpt-5 will be as much of a leap as people think. Llama 4 was hyped per big, and mostly landed below expectations. Meanwhile Chinese oss models have exceeded expectations. In months, not years.
And all of this without knowing gpt’s proprietary code. Knowledge is growing.
Agentic frameworks are the future right now. This will only escalate as AI improves itself. Progress is growing exponentially, not incrementally.
Years? I think not. Sonnet will most likely be outdone by end of year. 2026 will be the true AI arms race.
PatienceKitchen6726@reddit
Wait your prediction is that China will end up taking over the consumer hardware market? That’s an interesting take I haven’t thought about
power97992@reddit
I hope the drivers are good and they support pytorch and have good libraries
jferments@reddit
The US government will most likely prevent this with tariffs/regulations to protect US corporate profits.
RoomyRoots@reddit
Everyone knows that AMD and Nvidia will not deliver for consumer. Intel may try something but it's a hard bet. China has the power to do it, and the desire and need.
TheThoccnessMonster@reddit
I don’t think they can produce efficient enough chips any time this decade to make this a reality.
evia89@reddit
For LLM entusiast for sure. Consumer nvidia hardware will never be powerfull enough
Pipalbot@reddit
I see two main barriers for China in the semiconductor space. First, they lack domestic EUV lithography manufacturing capabilities. Second, they don't have a CUDA equivalent—though this is less concerning since if Chinese companies can produce consumer hardware that outperforms NVIDIA on price and performance, the open-source community will likely develop compatible software tools for that hardware stack.
Ultimately, the critical bottleneck is manufacturing 3-nanometer chips at scale, which requires extensive access to EUV lithography machines. ASML currently holds a monopoly in this space, making it the key constraint for any country trying to achieve semiconductor independence.
jferments@reddit
The US government will most likely prevent this with tariffs/regulations to protect US corporate profits.
momono75@reddit
OP's use case is programming. I'm not sure software developments still need that 5 years later.
Pipalbot@reddit
Current consumer-grade hardware isn't designed to handle full-scale LLM models. Hardware companies are prioritizing the lucrative commercial market over consumer needs, leaving individual users underserved. The situation will likely change in one of two ways: either we'll see a breakthrough in affordable hardware (similar to DeepSeek's impact on model accessibility), or model efficiency will improve dramatically—allowing 20-billion-parameter models to match today's larger models while running on a single high-end consumer GPU with 35GB of memory.
colin_colout@reddit
$10-15k to run state of the art models slowly. No way you can get 1-2tb of vram... You'll barely get 1tb of system ram for that.
Unless you run it quantized, but if you're trying to approach sonnet-4 (or even 3.5) you'll need to run a full fat model or at least 8bit+.
Local llms won't save you $$$. It's for fun, skill building, and privacy.
Gemini flash lite is pennies per million tokens and has a generous free tier (and is comparable in quality to what most of people here can run at a sonnet-like speeds). Even running small models doesn't really have a good return on investment unless the hardware is free and low power.
devshore@reddit
Local LLMs saves Anthropic money, so it should save you money too is you rent out its availability that you arent using
notdba@reddit
> Unless you run it quantized, but if you're trying to approach sonnet-4 (or even 3.5) you'll need to run a full fat model or at least 8bit+.
I have seen many people having this supposition that quantization can heavily impact coding performance. From my testing so far, I don't think that's true.
For LLM models, coding is like the simplest task, as the solution space is really limited. That's why even a super small 0.5B draft model can speed up TG performance **for coding** by 2-3x.
We probably need a coding alternative to wikitext to calculate perplexity scores for quantized models.
valdev@reddit
Ha yeah, I was going to add the slowly part but felt my point was strong enough without it.
-dysangel-@reddit
GLM 4.5 Air is currently giving me 44tps. If someone does the necessary to enable multi token prediction on mlx or llama.cpp, it's only going to get faster
kittencantfly@reddit
What's your machine spec
-dysangel-@reddit
M3 Ultra
kittencantfly@reddit
How much memory do you have
-dysangel-@reddit
It has 512GB of unified memory - shared addressing between both CPU and GPU, so you don't need to transfer stuff to/from the GPU. Similar deal to AMD EPYC. You can allocate as much or as little memory to GPU as you want. I allocate 490GB with `sudo sysctl iogpu.wired_limit_mb=490000`
colin_colout@reddit
Lol we all dream of cutting the cord. Some day we will
Double_Cause4609@reddit
There *are* things that can be done with local models that can't be done in the cloud to make them better, but you need actual ML engineering skills and have to be pretty comfortable playing with embeddings, doing custom forward passes, engineering your own components, reinforcement learning, etc etc.
No_Efficiency_1144@reddit
Actual modern RL on your data is better than any cloud yes but it is very complex. There is a lot more to it than just picking an algorithm like REINFORCE, PPO, GRPO etc
GrungeWerX@reddit
Are you crazy? Your spouting myths
bfume@reddit
I dunno, my Mac Studio rarely gets above 200W total at full tilt. Even if I used it 24x7 it comes out to 144 kWh @ roughly $0.29 /kWh which would be $23.19 (delivery) + $18.69 (supply) = $41.88
InGanbaru@reddit
Prompt processing speed is practically unusable on macs though
bfume@reddit
I disagree. Try it for yourself.
InGanbaru@reddit
I have. If you have short prompts it's fine. If you are using a large 70B model and load it with file reads for agentic coding it takes minutes for time to first token.
Try it yourself
bfume@reddit
Why would I do that? I don’t use it to code. Vibe coding is pretty dumb
InGanbaru@reddit
This applies to any workflow that needs large prompt context. That was pretty disrespectful though I'll end here.
bfume@reddit
Disrespectful because I disagree with you? Got it.
InGanbaru@reddit
You said it's dumb to do agentic coding and imply therefore I must be dumb for doing agentic coding.
You disregard my use case as because it's not what you personally need it for. You have a Mac studio with a ton of ram to load whatever model, great, but I didn't say you're dumb for slow prompt times because it doesn't fit my use case
bfume@reddit
I said it seems dumb to me. TO ME. I made no judgements about you. I don’t even know you dude.
InGanbaru@reddit
Ok well, maybe to illustrate:
Using a Mac studio with 512gb of ram when it can't even load a long prompt with decent latency is dumb. Asking short prompts of a model like it's Wikipedia is dumb.
To me.
SporksInjected@reddit
The south is more like $.10-15/kwh
bfume@reddit
Oh I’m well aware that my electric rates are fucking highway robbery. Checked my bill and when adding in taxes and other regulatory BS and it’s actually closer to $55 a month for me.
calmbill@reddit
Isn't one of those a fixed rate on your electric bill? Do you get charge per kWh for supply and delivery?
bfume@reddit
Yep. Per kWh for each.
Strangely enough the gas, provided by the same utility on the same monthly bill, charges it the way you’re asking about.
OfficialHashPanda@reddit
Sure, but your mac studio isn't going to be running those big ahh models at high speeds.
equatorbit@reddit
Which model(s)?
das_war_ein_Befehl@reddit
At that point it’s just easier to rent a gpu and you’ll spend far less money
vishwa1238@reddit (OP)
I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.
LagOps91@reddit
there has been a new and improved version of R1 which is significantly better since then.
vishwa1238@reddit (OP)
Oh, I’ll try it out then.
LagOps91@reddit
"R1 0528" is the updated version
OldEffective9726@reddit
Why spend money knowing that your data will be leaked, sold or otherwise collected for training their own AI.
entsnack@reddit
This is why I don't use Openrouter.
valdev@reddit
Did I say anything about not wanting to run this locally? I have my own local AI server. lol
Blksagethenomad@reddit
I highly suggest getting all the top Chinese models, especially the generative models prior to their new laws going into effect September 1st.
somethedaring@reddit
I echo this sentiment. Claude in its current state has no open source equivalent. Everything else has something.
BoJackHorseMan53@reddit
Try GLM, it's working flawlessly in Claude Code.
Qwen coder is bad at tool call in Claude Code.
FammasMaz@reddit
Wait what you can non anthropic models in Claude code ?
6227RVPkt3qx@reddit
yup. all you have to do is just set these 2 variables. this is how you would use kimi k2. i made an alias in linux so now when i enter "kclaude" it sets:
export ANTHROPIC_AUTH_TOKEN=sk-YOURKEY export ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
and then when you launch claude code, it instead will be routed through kimi.
for GLM it would be your Z API key and the URL:
export ANTHROPIC_AUTH_TOKEN=sk-YOUR_Z_API_KEY export ANTHROPIC_BASE_URL=https://api.z.ai/api/anthropic
tvibabo@reddit
How to set it up in Claude Code?
BoJackHorseMan53@reddit
https://docs.z.ai/scenario-example/develop-tools/claude
BananaPeaches3@reddit
unsloth version fixes the tool calling issue.
Brave-History-6502@reddit
Why aren’t you on the max 200 plan?
vishwa1238@reddit (OP)
I’m currently on the max 100 plan, and I barely use up my data, so I didn’t upgrade to the 200-plan. Recently, Anthropic announced that they’re transitioning to a weekly limit instead of a daily limit. Even with the 200-usd plan, will now have a lower limit
Skaronator@reddit
The daily limit won't go away. The weekly limit work in conjunction since people start sharing accounts and reselling access to the account. Resulting in a 24/7 usage pattern which is not what they intended with the current pricing.
devshore@reddit
So are you saying that a normal dev only working 30 hours a week will not run into the limits since the limits are only gor people sharing accounts and thus using impossible amounts of usage?
LegendMotherfuckurrr@reddit
Didn't they say in the announcement it's only going to affect 5% of users?
evia89@reddit
100% sonnet will be usable for 30h/week on $100 plan
devshore@reddit
Damn, there goes the argument I gave my wife for getting the m3 ultra. Maybe when Antrhopic releases its actual pricing, she will let me
rukind_cucumber@reddit
we'll see...
popsumbong@reddit
I kinda gave up trying local models. There’s just more work that needs to be done to get them to sonnet 4 level
MonitorAway2394@reddit
wha? well yeah, but not much. I guess I'm deep into the local shit so... ok like I am alright with 4-8 tk/s max LOLOLOLOLOL I'm a weird one it seems :P
martpho@reddit
I have very recently started exploring AI models in agent mode with free GitHub copilot and Claude is my favorite so far.
In context of local LLMs having Mac M1 with 16 GB RAM means I cannot do anything locally right?
MonitorAway2394@reddit
oh no, you can have tons of fun. I have the pre-silly-cone :D mac 2019, 16gb shat ram and like, I run 12b, 16b quant 6, etc. any of the models (sans image/video) it's surprisingly faster with each update using Ollama and my own kit but, yeah, requires patience :D it's explicitly useful for what I'm using them for, but I swap models in and out constantly, have multi-model conversation modules and whatnots, so yeah, you're good, have fun! (HugFace has a lil icon that lets you know what will run, don't necessarily listen to it unless the models > 16b, I have run 14-16b models just slower, longer pre-loading, incredibly useful if you work with them, learn them, keep a "weak" seeming model around and don't bin them until you know for sure it's not you. I am kinda wonked out, sorry for the weird'ish response lolol O.o
STvlsv@reddit
Never used any cloud llm, only local ollama instance.
For programming with continue.dev was used last three months:
- qwen2.5-tools (mostly general purpose)
- devstral (better than qwen2.5-tools for programming)
- qwen3-coder (new 30B variant. Does not enough testing, only a few days. Very quick after devstral)
All theese models not very large and can be run locally with several levels of quantization (in my case between q4 and q8 at server with two RTX A5000).
duaneadam@reddit
You are underutilising your Max plan. I am on the $100 plan and my usage this month according to ccusage is $2k.
earendil137@reddit
There is Crush CLI that recently came out. There's OpenCode CLI too Opensource but I'm yet to try it personally. You could use it along with Qwen3 on Openrouter. Free until you got Openrouters limits.
TangoRango808@reddit
https://github.com/sapientinc/HRM when this is figured out for LLM’s this is what we need
alexkissijr@reddit
I say qwen coder and kimi 2 work
defiant103@reddit
Nvidia nemotron 1.5 would be my suggestion to take a peek at
rookan@reddit
Claude code 5x costs 100 usd
vishwa1238@reddit (OP)
Yes, but I spend more than 400 USD worth of tokens every month with the 5x plan.
valdev@reddit
Okay, I've got to ask something.
So I've been programming about 25 years, and professionally since 2009. I utilize all sorts of coding agents, and am the CTO of a few different successful startups.
I'm utilizing codex, claude code ($100 plan), github copilot and some local models and I am paying closer to $175 a month and am no where near the limits.
My agents code based upon specifications, a rigid testing requirement phase, and architecture that I've built specifically around segmenting AI code into smaller contexts to reduce errors and repetition.
My point of posturing that isn't to brag, it's to get to this.
How well do you know programming? It's not impossible to spend a ton on claude code and be good at programming, but generally speaking when I see this it's because the user is constantly having to fight the agent into making things right and not breaking other things, essentially brute forcing solutions.
mrjackspade@reddit
I'm in the same boat as you, professional for 20 years now.
I've spent ~50$ TOTAL since early 2024 using Claude to code, and it does most of my work for me. The amount people are spending is mind boggling to me, and the only way I can see this happening is if its a constant "No thats wrong, rewrite it" loop rather than having the knowledge and experience to specify what you need correctly on the first go.
ProfessionalJackals@reddit
Its relative, is it not? Think about it ... A company pays what? 3 to 5k for somebody per month. Spending $200 per month, on something that gets, ... lets say 25% more productivity out of somebody is a bargain.
It just hurts more, if you are maybe a self employed dev, and you see that money directly going from your account ;)
The problem is that most LLMs get worse if they need to work on existing code. Create a plan, let it create brand new code and often the result in the first try is good. At worst you update the plan, and let it start from zero again.
But the moment you have it edit existing code, and the more context it needs, the more often you see new files being created that are not needed, incorrect code references, deleting critical code by itself or just bad code.
The more you vibe code, the worst it gets as your codebase grows and the context window needs to be bigger. Maybe its me but you need to really structure your project almost to fit LLM's ways of working, to even mitigate this. No single style.css file that is 4000 lines, because the LLm is going to do funky stuff.
If you work in the old way, like requests per function or limited to a independent shorter file (max 1000 lines), it tends to do good jobs.
But ironically, using something like CoPilot, you actually get more or less punished by doing small requests (each = premium request) vs one big Agent task that may do dozens of actions (under a single premium request).
vishwa1238@reddit (OP)
Frontier LLMs offer you more usage than the cost you pay. If you calculate your usage based on your own usage, you’ll easily find that you’re using more than $175 worth of AI APIs every month.
valdev@reddit
Okay. I take it as you didn't read what I said, chose to not comprehend it or are trying to deflect. Despite my elitist attitude I am actually trying to help you.
You want to get ahead of the rug being pulled from under your feet when all of these AI providers inevitably start charging more, just like claude code.
Ultimately, my point was going to be this. If you want to solve this for yourself, start with input which will optimize the output cost and reliance. From there you can shift to cheaper or local methods.
Watchguyraffle1@reddit
Neither of you are reading each other’s messages. Or at least thinking about them.
iambecomebird@reddit
What outsourcing thinking to AI does to a mf
mightyloot@reddit
Cold-blooded take
ForeignAdagio9169@reddit
🥇
Marksta@reddit
I think that's the point, it's as you said. Some people are doing new-age paradigm (vibe) of really letting the AI be in the driver seat and pushing and/or begging them to keep fixing and changing things.
By the time I even get to prompting anything, I've pre-processed and planned so much or just did it myself if it's hyper specific or architecture stuff. Really, if the AI steps outside of the function I told it to work in I'm peeved, like don't go messing with everything.
I don't think we're there yet to imagine for even a second an AI can accept some general concept for a prompt and run with it and build something of value and to my undefined expectations. If I was, I guess I'd probably be paying $500/mo in tokens.
valdev@reddit
Exactly! AI coders are powerful, but ultimately they are kind of like senior devs with head trauma. They have to be railroaded and well contained, more importantly they need to have rails.
For complicated problems, I've found that prebuilding failing unit tests with specific guidelines to build around specifications and to run the tests to verify functionality is essentially non-negotiable.
For smaller things that are tedious, at a minimum specifying the specific files affected and a detailed goal is good enough.
But when I see costs like this, I fear the prompts being sent are "One of my users are getting x error on y page, fix it"
PositiveEnergyMatter@reddit
those are fake numbers aimed at making the plans looking good
vishwa1238@reddit (OP)
I use a tool called ccusage to find the tokens and their corresponding costs.
TechExpert2910@reddit
it costs anthropic only ~20% of the presented API cost in actual inference cost.
the rest is revenue to fund research, training, and a fleeting profit.
GL-AI@reddit
Source?
boringcynicism@reddit
Claude API is crazy expensive, don't think you want to use it without a plan?
boringcynicism@reddit
What are you saying? You hit the plan limits immediately?
vishwa1238@reddit (OP)
No. I use the same account only. You can see your usage compared to the api price with a free tool called ccusage.
boringcynicism@reddit
I know what ccusage is. I'm saying I can't make sense of what you're saying or what your actual problem is.
rookan@reddit
I present to you Claude Max 20x - costs 200 only.
No_Hornet_1227@reddit
Just hire a hobo for 50$ a week, its gonna be more accurate than the AI and youll save money
OldEffective9726@reddit
Why spend money knowing that your data will be leaked, sold or otherwise collected for training their own AI?
rbit4@reddit
How do you fine rune phi4 or qwen3 for coding?
OldEffective9726@reddit
COBECT@reddit
I would say, you’ll have to figure it out by yourself. The problem with open source models is that they are trained for some areas more and for some less. Guys wrote you a good set of models, buy you will have to figure out which one works for your needs. You can try all of them via OpenRouter or other aggregators and after that estimate the cost and setup locally.
gojukebox@reddit
Qwen3-coder
ZeroSkribe@reddit
When Ollama fixes the tooling on Qwen3-coder, that will be the jazz
gthing@reddit
FYI, Claude code uses 5x-10x more tokens then practicing efficient prompting. And almost all of those tokens are spend planning, making and updating lists, or figuring out which files to read- things that are arguably pretty easy for the human to do. Like 10% of the tokens go to actually coding.
So for $400 in Claude code use you're probably actually only doing $40 of anything useful.
docker-compost@reddit
it's not local, but cerebras just came out with a claude code competitor that uses the open source qwen3-coder. it's supposed to be on-par with sonnet 4, but significantly faster.
https://www.cerebras.ai/blog/introducing-cerebras-code
lyth@reddit
Ooooh... I wish I could follow your updates.
Investolas@reddit
If you're using claude code you should be subscribed and using opus. Seriously, don't pay by the api. You get a 5 hour window with a max token and then it resets after the 5 hours. If you already knew this and use api intentionally for better results please let me know but there is a stark difference between opus and sonnet in my opinion
vishwa1238@reddit (OP)
I don’t pay through the API. I subscribe to Claude Max. Claude’s code is available with both the Pro and Max subscriptions.
Investolas@reddit
Yes, i use it as well. Why do you use Sonnet instead of Opus? Try this 'claude --allowedTools Edit,Bash,Git --model opus'. T I found that online and thats what I use. Try opus if you haven't already snd let me know what you think. You will never hit the rate limit if you use plan every time and use a single instance.
vishwa1238@reddit (OP)
I also have used opus in the past but i did hit a limit with opus which wasn’t the case with sonnet. I noticed atleast for my usecase sonnet with planning and ultrathink performs quite similar as opus.
Investolas@reddit
I can respect that! I hope you come up with something awesome!
Investolas@reddit
That is what is most important. If you like your experience, that's awesome. I would encourage you to never stop refining your process though simply because things are advancing so rapidly currently. Its worth it to start new sessions every 1-2 days and see a difference. Especially as your prompting and communication skills with them grow. Also, try different perspectives. I highly suggest experimenting with suggestions and describing your actions as though they appeared in a story. "You will complete the instructions you just provided". It removes tone bias. There is much less variability in sentence structure and so your thoughts translate to action much more accurately.
vinesh178@reddit
https://chat.z.ai/
Heard good things about this. Give it a try. you can find it in HF too
https://huggingface.co/zai-org/GLM-4.5
HF spaces - https://huggingface.co/spaces/zai-org/GLM-4.5-Space
rahularyansharma@reddit
far better then any other models , I tried Qwen3-Coder but still GLM 4.5 is far above then that.
AppearanceHeavy6724@reddit
not for c/c++ low level code. I've asked many different models to write some 6502 assembly code, and among open source models only the big Qwen3-coder and (you ready?) Mistral Nemo wrote correct code (yeah I know).
tekert@reddit
Funny, that how i test IA, plain Plan9 assembler, utf16 conversions using SSE2, claude took like 20 times to get it right (75% dont know Plan9 but when confronted they magically know and get it right) All other IA failed hard on that, except this new GMT wich took also many attempts (same as claude).
Now, to make that decoder faster.. with a little help only Claude thinking had the creativity, all other including GMT just.. fall short for performance.
AppearanceHeavy6724@reddit
claude is not open source. not local.
vishwa1238@reddit (OP)
Thanks. I think i will try out GLM-4.5. Just found its available on openrouter aswell.
Singularity-42@reddit
300-400 USD seem pretty low usage to be honest, mine is at $2380.38 for the past month, I do have the 20x tier for the past 2 weeks (before that 5x), but I never hit the limit even once. I've heard of $10,000/mo usages as well - those are the ones Anthropic is curbing for sure.
Your usage is pretty reasonable and I think Anthropic is quite "happy" with you.
In any case from what I've heard Kimi2 and glm4.5 can work well (didn't try) and can be even literally used inside Claude Code with Claude Code Router:
https://github.com/musistudio/claude-code-router
GTHell@reddit
Literally everything out there is better than Claude. It’s the claude code and claude agent that make it superior.
boringcynicism@reddit
Exactly. Sonnet is pretty retarded especially if
unrulywind@reddit
I can tell you how I cut down a ton of cost. Use the $100 a year copilot that has unlimited gpt-4.1. This can do a ton of planning, document writing and general set up and clean up. They have access to sonnet 4 and it works ok, but not as good as the actual Claude code. But for $100 you can move a lot of the workload to there. Then once you have all your documents a large detailed prompt in order Sonnet 4 or Claude code for deep analysis and implementation.
Low-Opening25@reddit
What you are asking for doesn’t exist
Low-Opening25@reddit
Why don’t you upgrade to 20x tier?!
NiqueTaPolice@reddit
Kimi is the king of html css design
Party-Cartographer11@reddit
To get a the smallest/cheapest VM with a GPU on Google Cloud it's $375/month if run 24/7. Maybe turn it on and off and do spot pricing and get it down to $100/month.
vishwa1238@reddit (OP)
I can do this. I do have 5,000 USD credits on Google Cloud Platform (GCP). However, the last time I attempted to run a GPU virtual machine, I was restricted from using one. I was only allowed to use t4 and a10s
Stef43_@reddit
Have you tried Perplexity Pro?
StackOwOFlow@reddit
Give it a year
Ssjultrainstnict@reddit
We are not at the replacement level yet, but close with GLM 4.5. I think the future of a 30ish b param coding model thats as good as claude sonnet isnt too far away
kai_3575@reddit
I don’t think I understand your problem, you say you are on the max plan but say you spend 400 dollars, are you using Claude code with the API or tying it to the Max plan?!
vishwa1238@reddit (OP)
I use claude code with max plan. I used a tool called ccusage which shows the tokens and the cost that i could have incurred if i used the api instead. I used 400usd worth of claude code with claude max subscription.
theundertakeer@reddit
Ermm..sorry for my curiosity...for what you use it that much? I am a developer and I use a mixture of local LLMS , deepseek, Claude and chatgpt - the funny part is that all for free except copilot which I pay 10 bucks a month. I own only 4090 24gb vram and occasionally use qwen coder 3 with 30b params.
Anyway I still can't find justification for 200-300 bucks a month for AI...? Does that makes a sense for you in the sphere you use?
vishwa1238@reddit (OP)
I don’t spend $200 to $300 every month on AI. I have a Claude Max subscription that costs $100 per month. With that subscription, I get access to Claude Code. There’s this tool called ccusage that shows the tokens used in Claude Code. It says that I use approximately $400 each month on my $100 subscription.
theundertakeer@reddit
Ahh I see makes sense thanks but still, 100 bucks is way more. The ultimate I paid was 39 bucks and I didn't find any use of it. So with that mixture I said you probably can get yourself going but that is pretty much connected what you do with your AI , tell me please so I can guide you you better
vishwa1238@reddit (OP)
Ultimate?? Is that some other subscription?
theundertakeer@reddit
Lol sorry for that, autocorrection, for whatever reason my phone decided to autocorrect the maximum to ultimate lol. Meant to say that the maximum ai ever paid was 39 bucks for copilot only
jonydevidson@reddit
By all accounts, the closes one is QwenCode + Qwen3 Coder
MerePotato@reddit
Hate to say this but there are none
createthiscom@reddit
kimi-k2 is the best model that runs on llama.cpp at the moment. It's unclear if GLM-4.5 will overtake it, currently. If you're running with CPU+GPU, kimi-k2 is your best bet. If you have a shit ton of GPUs, maybe try vLLM.
dogepope@reddit
how do you spend $300-400 on a $100 plan? you have multiple accounts?
vishwa1238@reddit (OP)
No. With Claude Max subscription, you get pretty good limits on Claude code. Check r/claude; you’ll find people using thousands of worth of API with a 200$ plan.
Kep0a@reddit
Can I ask what is your job ? What is it you are using that much claude for?
vishwa1238@reddit (OP)
I work at a early stage startup. I also have other projects and startup ideas that i work on.
ElectronSpiderwort@reddit
After you try some options, will you update us with what you found out? I'd appreciate it!
vishwa1238@reddit (OP)
Sure :)
OkTransportation568@reddit
You get what you pay for. None of the local models running on a local machine will be as good, and it will be a bit slower running it on a single machine. Remember that you still have to pay for a local model, in the form of electricity bills, especially when running LLMs, and how much it costs depends on where you are but it will be cheaper than 300-400 for sure.
That said, if your concern is just that it might be more expensive or the model might get dumber, why don’t you not worry about it and just cross that bridge when you get there? AI is moving so fast and there lots of cheap competitive alternatives coming from China. That might keep the prices in check.
Tiny_Judge_2119@reddit
Personal experience the GLM 4.5 is quite solid..
Brilliant-Tour6466@reddit
Gemini cli sucks in comparison to claude code, although not sure why given the Gemini 2.5 pro is a really good model.
aonsyed@reddit
Depends on how you are using it and whether you can use different orchestrator vs coder model, if possible use o3/r1 0528 for planning and then depending on the language and code, qwen3-coder/k2/glm4.5, test all three, see which one works best for you. none of them is claude sonnet but with 30-50% extra time they can replicate the results as long as you understand how to prompt them as all of them have different traits
InfiniteTrans69@reddit
It's literally insane to me how someone is willing to pay these amounts for an AI when open-source alternatives are now better than ever.
GLM4.5 is amazing at coding, from what I can tell.
theundertakeer@reddit
I still can't find a real use case to spend such amount if you are not a mad vibe coder who has 0 understanding of what they are doing
PermanentLiminality@reddit
I use several different tools for different purposes. I use the top tier models only when I really need them. For a lot of more mundane things lesser models do the job just as well. Just saying that you don't always need Sonnet 4.
I tend to use continue.dev as it has a drop down for which model to use. I've hardly tried everything, bit mostly they seem to be setup for a single model and switching of the fly isn't a thing. It's just a click and I can be running a local model or any of the frontier models through Openrouter.
With the release of the Qwen Coder 3 30B-A3B I now have a local option that can really be useful even with my measly 20GB of VRAM. Prior to this I was could only use a local model for the most mundane tasks.
icedrift@reddit
I don't know how heavy $400/month of usage is but Gemini CLI is still free to use with 2.5 pro and has a pretty absurd daily limit. Maybe you will hit it if you go full ape and don't participate in the development process but I routinely have 100+ executions and am moving at a very fast pace completely free.
rkv42@reddit
Maybe self hosting like this guy: https://x.com/nisten/status/1950620243258151122?t=K2To8oSaVl9TGUaScnB1_w&s=19
It all depends on the hours you are spending with coding during a month.
rkv42@reddit
I like Horizon and Kimi K2
umbrosum@reddit
You could have a strategy to use different models, for example Deepseek R1 for easier tasks and only switch to Sonnet for more complex tasks. I find that it cheaper this way.
boringcynicism@reddit
DeepSeek is much smarter than Sonnet. It's just not as good at tool usage.
Sonnet in Claude Code is semi retarded.
Zealousideal-Part849@reddit
There is always some difference in different models.
Depending on tasks you should run models.
If tasks is minimal, running open source models from openrouter or other providers would be fine.
If tasks need planning and more careful update and complicated code, Claude sonnet works well (no guarantee is does everything but works the best)
You can look at GPT models like gpt 4.1 as well. and use mini or deepseek/kimi2/qwen3/glm or new models that keep coming in, for most of the tasks. These are usually priced at 5 times lesser than running claude model.
HeartOfGoldTacos@reddit
You can point Claude code at AWS bedrock with Claude 4 Sonnet. It’s surprisingly easy to do. I’m not sure whether it’d be cheaper or not: it depends how much you use it.
usernameplshere@reddit
Qwen 3 Coder and DS R1 0528 will be the closest ones.
Maleficent_Age1577@reddit
R1 is closest to your asking, but you need more than your 5090 to run it beneficially.
vishwa1238@reddit (OP)
Is the one in OpenRouter capable of producing similar results as running it on an RTX 5090? Additionally, I have Azure credits. Does the one on Azure AI Foundry perform the same as running it locally? I tried R1 when it was released. It was better than OpenAI’s O1, but it wasn’t even as good as Sonnet 3.5.
boringcynicism@reddit
Current R1 is Opus 4 (with no thinking) level and dirt cheap.
ResidentPositive4122@reddit
GPT404
IGiveAdviceToo@reddit
GLM 4.5 ( hearing good things and tested it performance quite amazing ) Qwen 3 coder Kimi K2
SunilKumarDash@reddit
Kimi 2 is the closest you will get. https://composio.dev/blog/kimi-k2-vs-claude-4-sonnet-what-you-should-pick-for-agentic-coding
vishwa1238@reddit (OP)
Thanks, will try it out.
AaronFeng47@reddit
Qwen3 coder
Kimi k2