Considering ditching Claude/Codex completely
Posted by Adorable_Weakness_39@reddit | LocalLLaMA | View on Reddit | 48 comments
They have become completely unusable over the past few days.
A few things I have noticed:
- Codex has cut its 5-hour session cap massively so now you can barely tell it to program fizz buzz before running out of tokens.
- Claude Code has the same problem.
They have both just massively dropped in intelligence as well. I have heard people on X talking about how Anthropic models are being throttled in terms of intelligence (for non API tokens). I have had the same problem with GPT-5.4 where it just refuses to do stuff and has a bias to not take actions even if explicitly stated (which I've heard is a byproduct of limiting reasoning tokens).
This causes people to have to send more messages which then uses even more input & output tokens.
Might take the open-souce pill. Perhaps Qwen3.5 27B locally, and GLM5.1 on the cloud.
cheyyne@reddit
Give Seed-OSS-36B a shot, go for the magicquant version. It's as fast as you're ever going to get, it can handle complex tasks, and a truly functional 100k of context. As long as you don't mind a little wait.
Macaulay_Codin@reddit
dude, the model layer is going to keep flipping. every 3 months something gets worse or something new drops. if your workflow breaks when the model degrades you don't have a workflow, you have a dependency.
Adorable_Arrival_666@reddit
I’ve been running Qwopus3.5:27b as my coder and it’s been doing pretty well.
Techngro@reddit
I honestly don't know what you guys are doing that you are hitting these limits and seeing poor output from Claude Code. I am on the $100 Max plan and I hit it from both ends with Opus 4.6/high (planning on the web and coding through the terminal), and I NEVER hit any limits. And I am working on multiple projects at once every day. As for intelligence, it's the same or better than it's been since I started using it last year. I give it a well formed prompt and it literally just churns through it.
You guys must have some really bad luck.
fractalcrust@reddit
brother it was a genius 2 weeks ago and now its retarded.
i literally vibed out a new production workflow with it and now it cant even run an eval properly for the same workflow despite ample agents.md files
PhilWheat@reddit
Honestly, this is why I run my own local models. While they can be better in some cases and worse in others, I know what I'm tweaking and can get them as consistent as is possible with LLMs. With cloud providers, you don't know what has changed between your good responses and your bad ones.
fractalcrust@reddit
yea thats a strong argument. i'm considering some RTX 6000s for M2.7 but my experience with it on the coding plan is kind of poor. but i heard others complain similarly so maybe the subscription customers are nerfed
PhilWheat@reddit
What I've found is that the models themselves make less difference that the harness/framework they're in. A great model can be hamstrung by bad infrastructure and small/less capable models can grind away and get good results if you scope them right.
But that's what got me started on the local journey, I wanted to know how the tools actually worked, not just how to get them to churn out something that kind of worked.
fractalcrust@reddit
do you have a preferred harness?
PhilWheat@reddit
I'm using both Roo for some coding tasks and some custom stuff I've written. It does seem to be pretty variable depending on what you're doing with it as to what works best.
For general queries, I tend to do AnythingLLM - it fits best with my use requirements, but it isn't perfect.
Tymid@reddit
I think this is a phenomenon where a new model comes out like Opus 4.6 and it’s super intelligent and then after using it for a while we start to see its limitations and then we say that it’s being nerfed. Could this be what is happening?
ea_man@reddit
Before declaring bankruptcy they will quant down whatever they are serving, throttle resources to big contracts.
They run a business at loss, that is what happens with more customers and bigger models coming up.
eposnix@reddit
AMD is dropping Claude because of these issues. It's not imagined. Most likely because of the influx of new users over the past couple months
https://www.theregister.com/2026/04/06/anthropic_claude_code_dumber_lazier_amd_ai_director/
nikgeo25@reddit
Yeah I think that's definitely part of it. The expectations increase and people tend to get too hyped and forget what the previous gen model was like, then compare based on the hyped version.
fractalcrust@reddit
i am comparing it to the exact same eval that was used to vibe out the original workflow. it had no issues 2 weeks ago and now it struggles with basic requests associated with iterating on the workfow
nachoaverageplayer@reddit
your agents md file is likely way too long. all those .md files get loaded on every message. it’s gonna eat up your context if you don’t audit them and keep them trim and to the point. especially as your codebase grows.
ozzeruk82@reddit
I would have agreed with you 100% right up until about 3 days ago. Then something changed, it's just really struggling to be impressive, earlier today I asked for a simple code change that I even though ah that's silly I should just do myself, and then to my surprise it took 5 minutes of investigating to try and do it. I swear a month ago it would have done it in about 20 seconds and then mocked me. Something is going on, they are crazy short of compute and we're all getting queued endlessly I fear.
cromagnone@reddit
With Claude, I think it very much depends where people are located, and when they use high intensity thinking (and as well as their prompting efficiency and structure). It’s pretty clear they are load shedding at peak US usage times, and I’d be surprised if there isn’t more geographical limitation than just EST/PST.
No_Run8812@reddit
Agree with limit and disagree with intelligence part.
Ok_Weakness_5253@reddit
Opencode and grt claude to modify the model loader with a bootloader for local and networked ollama computers.. you can use local computers to serve and a laptop to write into open code on the same network...need to alter ollama variable to 127.0.0.1and try qwen coder next. It does tool calls well.. but you can tell the limits of a small local model with lower quantization much quicker than using claude or codex.... if you dont have 4 3090s or 4 4090s dont even bother tho. You will spwnd more time waiting on shitty tokens produced locally instead.. the trade off gets really expensive...
CalligrapherFar7833@reddit
Retarded or llm slop recommending ollama
Ok_Weakness_5253@reddit
Use what ever provider you want. doesnt matter to me lol. thats just what i tested it on lol... and its pretty easy to use and setup sooo why are you sour about ollama again? You work for anthropic or something?
CalligrapherFar7833@reddit
Are you stupid or what literallty search this sub for ollama
Ok_Weakness_5253@reddit
Is ollama a trigger word for you??? Hahahahahha
Due-Function-4877@reddit
I think we're all hitting up against the limitations of the tech. Local models are pretty stupid, too. I know mine is. I verbally abuse it all day long.
croninsiglos@reddit
I’ve jumped on the Pi bandwagon.
It’s minimal and at a project level you can have it create extensions and skills for itself so it becomes exactly what you need for that particular project.
jeffwadsworth@reddit
Good luck
ivarec@reddit
I've gotten frighteningly good results using StepFun 3.5 Flash combined with simple ad-hoc task decomposition.
Like, I create a list of tasks and have it attack each one indedependly with my open source coding agent of choice. The quality of the deliveries is astonishingly good for my use. It's a weird model, where sometimes it will change language during its thought chain (like Chinese, but also others) but then it comes back with very decent coding results.
It feels like a drunk driver that, somehow, arrives at the destination unscratched. And, since it's ridiculously cheap, I can use it with cloud hosting in a sustainable way. But it's the first model that has me considering getting better hardware to run it locally.
bieker@reddit
In my office it has become almost a meme. someone stands up and yells "is it just me or has Claude become stupid today" and everyone nods in agreement and yells "new version inbound!" and invariably a week or two later a new version of Opus or Sonnet is released. We think they either are reducing compute available because they need it for the final stages of building a new version, or they need it to do final 'red team' qa checks or somthing.
The cynics believe they make the old one stupid for a few weeks so when the new one drops you feel good about it, but the benchmarks don't lie (ha!) so i don't know what they really gain out of that.
So, with that said I need to ask the question, New version inbound?
Dry_Yam_4597@reddit
Mix them - cloud and local. For instance, I use local to review for security issues in claude generated code. Finds a LOT. Also for architectural reviews, qwen works wonders. Then feed a proper context into Claude by feeding a clear context I save on tokens. Works a charm.
Hefty_Development813@reddit
How does the routing work? You just tell it which to use in the text prompt? Or it works in background?
Dry_Yam_4597@reddit
For now, i simply open qwen code in the same project directory, ask it to generate a review file, i review the file myself and get rid of what i think is irrelevant and then ask claude if it makes sense and implement what it thinks does. Was considering a tool call passing default prompts on to qwen and have claude handle the answer but i actually want to review stuff myself. I do the same for linters and static code analyzers and let claude handle the output on those. Ie sonar for code complexity, eslint for styles, npm audit for basic security issues.
xeeff@reddit
don't forget to look up 'caveman' on github ;)
PandemicGrower@reddit
I legit started a new project with Claude yesterday as a demo for a buddy, it was an epic fail. We hit two session limits before getting a working app.
I went home and gave the same info to Google AIStudio and within 15 minutes I had a working interface to build off of.
I was about to go home and cancel Claude, I go home and in my terminal it says I have $20 in free extra use from anthropic. So they knoww users are having issues.
CalligrapherFar7833@reddit
Why do you call it throttle in intelligence ? They just run reduced quants but thats uncomfortable for PT
sod0@reddit
Try Gemma 4 26B a4b. It's a beast! Even with zero prompting it already is great at agentic coding.
mtmttuan@reddit
On every posts like this I always feel like unless you've already have the hardwares you're planning on paying way more for way less. If your reason for going local is privacy or fun or whatever I can get it but for performance? You'll get worse code generated slower than whatever you had from claude/codex/gemini.
Objective-Stranger99@reddit
I jumped into local LLMs for privacy, but stayed because I got pissed at rate limits and it's really fun to tinker with, even with mediocre hardware (GTX 1080).
LivingHighAndWise@reddit
"Might take the open-souce pill. Perhaps Qwen3.5 27B locally, and GLM5.1 on the cloud."
You are going to be very disappointed. Whiole GLM5.1 looks good in benchmarks, in reality it isn't even close to the level of Claude or GPT. I do use Qwen3.5 27 locally on an Asus GX10 with max context windows and it does work fairly well for light work.
Opening-Broccoli9190@reddit
Do you use OpenCode? or any oss ClaudeCode?
LoSboccacc@reddit
Glm 5.1 is like old claude 3.7 it can write code but cant do structure on its own.
Currently i have codex writing interfaces and glm classes (and even then every so often I need to clean up leftover shit all over the place)
Did not find any local coding model that i can fully trust yet, bit qwen coder next works of for bulk implementations, minor refactorings and simple helper methods.
Adorable_Weakness_39@reddit (OP)
What harness do you use for both?
Opening-Broccoli9190@reddit
yeah, I was thinking about the same setup - you can also find some baremetal services to deploy GLM on
Ok_Weakness_5253@reddit
Minimax 2.5 and 2.7 are really good at coding as well if you can run it locally your laughing at anthro and openai
Barry_22@reddit
I'm sitting on 27B AND Claude. AMA.
gyzerok@reddit
What quant? What engine? What configuration?
fittyscan@reddit
I did exactly that. I dropped Claude and Codex in favor of Swival with GLM-5.1, and that combination works really well at a fraction of the cost.
Weird_Search_4723@reddit
https://github.com/0xku/kon
I shared this recently in this channel, might be exactly what you are looking for
https://www.reddit.com/r/LocalLLaMA/comments/1shkqj5/gemma426ba4b_with_my_coding_agent_kon/