Can I get similar experience running local LLMs compared to Claude Code (Sonnet 4.5)?

Posted by Significant_Chef_945@reddit | LocalLLaMA | View on Reddit | 23 comments

Hopefully this has not been asked before, but I started using Claude about 6mos ago via the Max plan. As an infrastructure engineer, I use Claude code (Sonnet 4.5) to write simple to complex automation projects including Ansible, custom automation tools in python/bash/go programs, MCPs, etc. Claude code has been extremely helpful in accelerating my projects. Very happy with it.

That said, over the last couple of weeks, I have become frustrated by hitting the "must wait until yyy time before continuing" issue. Thus, I was curious if I could get similar experiences by running a local LLM on my Mac M2 Max w/32GB RAM. As a test, I installed Ollama, LM Studio, with aider last night and downloaded the qwen-coder:30b model. Before I venture too far into the abyss with this, I was looking for feedback. I mainly code interactively from the CLI - not via some IDE.

Is it reasonable to expect anything close to Claude code on my Mac (speed quality, reliability, etc)? I have business money to spend on additional hardware (M3 Ultra, etc) if necessary. I could also get a Gemini account in lieu of purchasing more hardware if that would provide better results than local LLMs.

Thanks for any feedback.

[-]

studentofknowledg3@reddit

6 month later, yes, you have open source models from google, qwen, kimi, GLM etc. that are better than 4.5

[-]

WesternDrama5566@reddit

Oh...But at what price? How much GPU and VRAM do I need? Are you coding with LLM locally?

[-]

studentofknowledg3@reddit

Of course lots of VRAM is needed, if you want to have that much quality and speed.

[-]

RiskyBizz216@reddit

not in a million years.

take that "business money to spend" and go buy yourself 2x RTX 6000 ADA's

And run GLM 4.6 or GLM 4.5/GLM 4.5 air and then maybe

[-]

Significant_Chef_945@reddit (OP)

Thanks, but the RTX 6000s are about $7K/ea on Amazon. Getting 3 would be about $21K. Is this really the hardware needed to get similar Claude experience?

[-]

gamblingapocalypse@reddit

You might be able to wait a little longer and get better hardware, and by that time hopefully smaller more capable models come out. For example the m3 ultra with 512 gb ram is quite expensive, which could run very good models, but give it a year or 2 maybe and you might be able to find a laptop with that much ram and also it might be x86! (Rather than arm64, if you want to run linux or program robots easier).

[-]

gtrak@reddit

If it's cheaper it's b/c it's no longer relevant. You could probably pay cheaper cloud prices for the same performance, too, at that time.

[-]

notlongnot@reddit

Local model is bound by memory size and speed. The “maybe” part can translate to “maybe not”.

With the RTXs, you load a model in and “maybe” that model will perform enough for you or maybe not. Sonnet 4.5 performance is not easy to replicate. And then you build a system around those RTXs, another $5k plus, and it cost electricity to run. Not a sales guy for them but get their monthly plan is the easy lift.

[-]

aaronpaulina@reddit

get a codex subscription and swap between claude code and codex, way cheaper

[-]

muchCode@reddit

This, 3x RTX 6000 in my setup gives me great performance with qwen coder models.

[-]

paramarioh@reddit

qwen is not even close. Oh come on!

[-]

Ok_Priority_4635@reddit

No, you will not get Claude Sonnet 4.5 level reasoning from local models including Qwen Coder 30B. The capability gap is real, especially for complex multi step reasoning and context management across large codebases.

Your M2 Max with 32GB unified memory can run Qwen Coder 30B or similar models at reasonable speed. These models are solid for straightforward coding tasks like writing functions, explaining code, basic refactoring, but they struggle with the complex orchestration and architectural reasoning Claude Code handles.

The specific pain points you will hit are multi file edits where the model loses context, complex debugging where reasoning chains break down, and architectural decisions where the model gives surface level suggestions instead of deep analysis.

For your use case of infrastructure automation in Ansible, Python, Bash, Go, local models can handle individual script generation and simple automation tasks but will disappoint on complex projects where Claude Code currently excels.

Your options are Gemini Advanced which has higher rate limits than Claude and comparable capability for coding tasks, or get a local setup with Qwen 2.5 Coder 32B or DeepSeek Coder V2 33B as a supplement not a replacement. Use local models for simple repetitive tasks to preserve your Claude quota for complex work.

Spending on M3 Ultra will not close the reasoning gap. You get faster inference but the same model limitations. The bottleneck is model capability not hardware speed.

What specific tasks hit the rate limit most often? That determines whether local models can actually help.

- re:search

[-]

No_Conversation9561@reddit

May be next year this time.

[-]

pokemonplayer2001@reddit

No, no you can not.

[-]

dcforce@reddit

M2 looks promising

[-]

No-Marionberry-772@reddit

With claude code max 100$ plan, I aggressively use subagent tasks in my work. I use it for hours on end, and actually make a point to use my tokens aggressively to maximize my value.

Since switching to max, i have not run out of tokens.

100/month == 1200/year

5 years of service is 6000, which will not even get you enough hardware to run a model that even comew close to the quality currently.

I say, switch to max, wait a year, see how good local models have gotten.

[-]

Significant_Chef_945@reddit (OP)

Thanks. I am already on the Max plan but don't use subagent tasks. Do these tasks use the same amount of tokens as the main agent? Guess I need to learn more about this stuff!

[-]

No-Marionberry-772@reddit

i dont know specifically, my understanding is that a Task being run by a sub agent is the same as the main agent, but it executes in parallel and its work is hidden vs the main agent. So this suggests to me it would use significantly more tokens than not using tasks.

Maybe your prompts are much larger or something?

[-]

zenmagnets@reddit

The strongest local model for your M2Max with 32gb vram is Qwen3 coder 30b at q4. The best api coding models changes quickly, but usage quickly follows price-performance: https://openrouter.ai/rankings

[-]

Pro-editor-1105@reddit

Yes qwen3 0.6B actually beats Claude 4.5 Sonnet for coding.

[-]

Eugr@reddit

You won't get the same SOTA experience with local models, but they got to the point where they are "good enough" for many tasks. You can always use Claude when you need something more sophisticated.

Having said that, you will run into hardware limitations very quickly. 32GB RAM is just too tight, given that you'll have to keep some of that RAM for your development stack.

[-]

Significant_Chef_945@reddit (OP)

Thanks for this. Appreciate the feedback.

[-]

Awwtifishal@reddit

Try GLM-4.5-Air with some inference provider (or GLM-4.6-Air when it comes out soon) to see how it performs for your use cases. It won't be the same as claude, but it could potentially be 90% of it depending on your needs. If it works for you, then you can easily run it in a machine with 64 GB of RAM and some GPU like a 3090. If it doesn't, try GLM-4.6 but you will need a bigger machine (or multiple smaller machines connected together). People say that GLM-4.6 is on the level of sonnet 4.0, not 4.5.