Major drop in intelligence across most major models.
Posted by DepressedDrift@reddit | LocalLLaMA | View on Reddit | 330 comments
As of mid Apr 2026, I have noticed every model has had a major intelligence drop.
And no I'm not talking about just ChatGPT.
Everything from Claude(Even Sonnet along with Opus), Gemini, z.ai, Grok all seem to ignore basic instructions, struggle at simple tasks, take very long to respond, and the output seems deliberately shortened and very shallow. Almost like it's in a "grumpy" mode. I tried this in incognito mode so it's not my customization or memory influencing this.
It's like they deliberately want you to stop using their service. I guess our data is no longer needed. Just two weeks back it used to be much smarter than this.
To test this I rented out a H100, and tried GLM 5 with the same prompt (the drive to the car wash one) across both instances. GLM5 running on the rented GPU answered it correctly, compared to the one on z.ai.
Have they lowered the quantization really low to maybe Q2?
I guess going local or using renting GPU or an AI monthly service that lets you pick a quant level is the way to go
Britbong1492@reddit
Yes, grok is bad now, I have a Heavy $3k sub and the deterioration is real. Your idea of renting a H100 is pretty good. I was thinking to just buy 2 Apple Macs with 64GB or similar as they are all worse. I also have Claude Max $200pm, and that's not so bad a decline but it's making rare mistakes more often. It's all in training something which they then decide is too dangerous to share
Big_Actuator3772@reddit
grok engineers were found to basically be bench maxing grok, in real world application, grok is far far behind..I would not be paying any money for grok ATM. that's why elons got an entire new team behind it
Britbong1492@reddit
It's cheap with live info - who else has live info, I'm not sure?
nuclearbananana@reddit
Any model with web search?
New-Implement-5979@reddit
Damn how are you justifying 3.2k monthly for subscriptions what are you doing if it is not a secret ?
Regular-Cancel-2161@reddit
You can get an H100 or H200 for between $2 - $2.6/hr. The H200 can pretty much run any 400B model you want.
Few_Painter_5588@reddit
Everyone is quantizing their models because everyone is haemorrhaging money, and OpenClaw quite bluntly is squeezing the industry
TheDreamWoken@reddit
Hmm, what exactly is OpenClaw doing right now that causes them to run their LLMs at lower quants? It does seem to look less like the LLM was changed but more along the lines of running it on lower quanitzations. Though I guess I am an optimist, I thought commercial LLM providers used vLLM, rather than llama.cpp.
Ariquitaun@reddit
I've been running claw for a few days with Gemma 4 e4b as a test and the results are encouraging. For my use case anyway.
fragment_me@reddit
What are you using claw for? Everytime I look at the use cases they all just seem silly.
Ariquitaun@reddit
A bunch of itches that need scratching. The main motivator was to keep up with my eldest' school calendar - we receive a lot of newsletters, sometimes attached to emails, sometimes to download. Emails with info, there's two parent websites as well with stuff on them. Basically I've instructed claw to sift through that stuff, update a markdown file that the chatbot can query if we have questions and generate google calendar events for my wife and I. The claw has its own google account to do stuff and I have on my account a few email forwarding rules for school stuff.
I have other tasks for generating a daily report with links of topics I'm currently interested in that I can read while taking a shit at some point during the day. My wife also has a few itches to scratch that I need to get around implementing.
It's all pretty household stuff really, nothing exotic
Competitive_Travel16@reddit
Honestly this is the most reasonable Claw use I have yet read. But it doesn't need full agentic autonomy, just access to inbox and calendar once a day.
Ariquitaun@reddit
There are many ways to skin a particular cat. This is mine.
Competitive_Travel16@reddit
What is your email backup strategy?
Ariquitaun@reddit
What for? Claw has its own google account.
Icy_Concentrate9182@reddit
Can you tell it to look after your kids and go on holidays?
Ariquitaun@reddit
I can't, I don't yet have the automation in place for the cage feeder
glad-you-asked@reddit
Giving claw it's own email is the best practice 😁
Competitive_Travel16@reddit
Seconding this. And, forwarding with e.g. a Gmail Rule so it only sees the subset you want it too. There is nothing worse than an autonomous agent sending your bank account info to a Nigerian prince.
TheSpartaGod@reddit
what’s the use case that can be handled by such small models?
tophlove31415@reddit
Tons. Smart chunking, organizing and summarizing returned information from a vector database search, self directed web browsing and learning, ocr, user interaction, simple decision making (ie: this is the context, here are the options, choose which is best). They can essentially do any of the things the sota models can do (with a well designed harness) as long as you recognize you will get more errors.
Funny-Blueberry-2630@reddit
This guy builds agents.
Ticrotter_serrer@reddit
All small A I. use case?
toadi@reddit
I don't use openclaw. But for example fetching my calendar entries. Organizing my emails. Fetching and interacting with services where it is quite straightforward.
simple rote admin tasks work easy on these smaller models.
voronaam@reddit
Just curious, did it encounter any complicated tasks?
For example, I was trying to organize a small thing recently with a very unreliable party and the communication so far has been like this:
Me: Hello. Can we do a thing?
Then: (3 days later) I am in South Africa now. WhatsApp me (no phone number provided)
Me: When are you back? I'll reach out then
Them: next week
Me: (next week) Welcome back. Can we do the thing?
Them: I am still in South Africa
Me: (next week) Are you back?
Them: Yes. Let's meet in person to discuss (no address)
Me: Sure, how about on next Friday?
Them: (on Saturday) Missed the message. Just call me (still no phone number)
Me: (Finding phone number on one of their websites, Calling) How about the thing
Them: Yes, we can do it. Just fill the form on the website.
Me: Filling the form with the request.
Them: (next day) "Hi Peter, we can do the thing on those days" (I am not Peter)
Can a small model handle this?
TheTrueSurge@reddit
Lol I hope you REALLY need to work with that person, that sounds awful
voronaam@reddit
They are just not motivated. It is a small sailing trip and I am talking to the owner of the boat and a skipper. If nothing happens, they get to chill on their boat. If it does, I'll get to chill there as well and they will get a bit of extra cash. That is not that much of a difference to them, as they are sufficiently wealthy and retired.
My work related communications are way better than that ;)
QuinQuix@reddit
How many misses do you encounter?
I've read hallucination rates hover between 3-10%.
Doesn't sound like much but calendar planning is quite a critical task.
A big medical office doing 100 appointments a day couldn't handle 1-2 wrong/double/missed appointments a day.
People always counter people make mistakes too but they miss that usually such systems (eg a front desk managing appointments) have layers of redundancy and self correcting ability.
The entire point of AI is to offload work, so I'm curious to what extent you feel this is actually possible using your Workflow.
IShitMyselfNow@reddit
This matches my experience
I've been using Hermes Agent with Qwen 3.5 4B to great success. And for codig, anything more complicated than a simple script I've been delegating to a better model via Opencode from the agent.
I think the advent of agent skills has really improved the performance of smaller models in things like this. Small models have actually been semi-useful at agentic work ever since around Qwen 2.5. But only if you gave them a lot more instructions/detail, more API examples, few-shot prompting, etc..And you could do this, but then managing the data you give them for the task at hand, managing context, etc. was tricky at best. Agent skills kinda solves that problem.
michaelsoft__binbows@reddit
I've been meaning to keep up, but simply cannot. I got deep into opencode for about 7 weeks and I simply have no bandwidth to explore pi agent and hermes agent like i had hoped. Been just driving codex and claude code since then and got my productivity back. Is hermes any good...?
One of my projects has been about a paradigm of having an agent harness harness, e.g. something that puppeteers codex and claudecode and opencode. What you wrote about hermes seems to intersect with that idea so you got me curious.
IShitMyselfNow@reddit
I like it. It does what it does well.
I'm not using it to its fullest extent, and I'm definitely not running it like Openclaw, but I've been using it as a "shitty assistant that can automate some things for me that would be better hardcoded as a scdipt/workflow but I don't have the time for that anymore" to great success. And it does do a decent job at interacting with Opencode. It also supports Codex and Claude Code (and Hermes). Worth a try IMO.
-p-e-w-@reddit
Have you used a current-gen model of that size? It’s easily on par with GPT 3.5 intelligence-wise.
Salt-Willingness-513@reddit
Is a bit bigger, but so far 26b a4b works very well for me with a claw alternative from scratch
Ariquitaun@reddit
I'd totally go for that one, but all I have for that is the igpu on a ryzen, 8GB RAM and not a lot of extra GTT memory for it. I can start the model, do some stuff and eventually it ends up crashing.
The E4B model does a pretty good job if you enable its thinking mode.
WhopperitoJr@reddit
The development of OpenClaw and its consequences have been a disaster for the AI race.
zoupishness7@reddit
It definitely wasn't just OpenClaw though, it's been out since November.
The real spike started at the end of March. Karpathy's autoresearch had been published for a week. Then Anthropic had a period of 2x Claude usage during off-peak hours, which incentivizes the development of round the clock systems that don't have humans in the loop. With a good harness and autoreseach, it's not that difficult to evolve an automated system that makes more money than a Claude 20x subscription costs. Profitability on API pricing is significantly more difficult, the ratio's getting adjusted, but at the time, it was: $200 subscription=~$5000 on peak, and ~$10000 off peak.
For certain applications, and markets, once you have that kind of system, the math becomes very simple: buy more subscriptions, use them as much as possible, and reinvest the money into buying more subscriptions. I did. I made some money, but spent a week thinking I was a couple months away from being a millionaire. I knew it was unsustainable, but thought they would wait a bit to crack down. I underestimated how many people had the same realization around the same time.
That's what lead to the shortages, and why they banned 3rd party harnesses on subscriptions, and why they've had to make the model stupid to meet the demand. But models continue to get smarter and cheaper, and I think a lot more people know they can make money now. I, for one, am focusing on token efficiency now, to try to push what I built during that time towards API profitability.
Scew@reddit
yep, that did it. I had gotten a harness based off it from the openSourceAI sub that then got taken down and the account that shared it deleted and the repo moved to private. Was for making agents for gaps in places that could use one. Just last week modified it for just general research instead and hit a weekly limit that I hadn't hit before. Was kind of surprised but figured it was too good to be true when I had initially started using it.
WhopperitoJr@reddit
These are great points and you are probably right to identify more immediate causes beyond just agent-based inefficiency, though that certainly does not help.
I think we will see a movement (if it hasn’t stated already) towards efficiency and using models that are strictly the size you need for the task at hand. Those with the machines to use them will try local models like Qwen first. Not every task needs Mythos-level thinking.
bh9578@reddit
Everyone became a power user basically
Rcrecc@reddit
For a layman like me who has very little familiarity with OpenClaw or its impact on the AI race, can you clarify what you mean?
WhopperitoJr@reddit
OpenClaw is an agent harness framework where you can set up LLM-controlled agents to do certain tasks autonomously.
The reason why it is disastrous is because these agents are left to run without human monitoring, and they can be pretty inefficient. If they get caught in a thinking loop, they will keep using up API usage and processing power until a human intervenes. If you are using OpenClaw in the first place, you are probably not constantly monitoring the agents.
Basically, it is wasteful and uses up too much of the collective processing power that is available via API. Running it locally is less problematic as you are just using the processing power of your private device. More clawdbots = higher usage fees for you and me.
party_peacock@reddit
But users are still paying the bill for their usage though?
Or is the problem that the actual cost to host those models exceeds what the companies charge their users?
WhopperitoJr@reddit
The problem is that processing power is finite and that if demand for power rises because of inefficient bots making API calls, the overall price of usage will rise, because there is no supply surplus to return the price to its original equilibrium value.
But because companies generally don’t want to change prices as consumers are sensitive to price changes, they artificially constrain supply though usage limits.
So, more clawdbots running via API means less usage available overall, which means you and I hit budget limits sooner, regardless of if we even use OpenClaw ourselves.
Thebandroid@reddit
Huh, so they don’t like people sucking up all the resources… ironic
Big_Wave9732@reddit
It's forehead slapping, really, as the AI community slowly comes to realize how wasteful their AI habits are.
look@reddit
No, OpenClaw users are like locusts: they find a decent subscription service and then descend upon it en masse, devouring all that was beautiful and good, leaving behind nothing but a barren, desolate wasteland.
Coppermoore@reddit
OpenClaw users are paying customers buying a commodity, like everyone else. Model providers can raise prices or impose rate limits as they wish. Market will sort everyone out.
WhopperitoJr@reddit
This logic can be used to justify price gouging in cases of extreme resource shortage.
Coppermoore@reddit
Sure can.
WhopperitoJr@reddit
I am against price gouging.
Neither-Phone-7264@reddit
also, it's about as token efficient as dogshit, even without skills or tools.
WhopperitoJr@reddit
A tool so inefficient and resource-intensive would have been better off being built behind closed doors and perfected rather than unleashed for every middle manager with an API key to use.
saint1997@reddit
I had the displeasure of watching Pete Steinberger present what was then "clawdbot" at a conference in December. The guy is a complete and utter clown. Not only did he admit he'd given the thing root access to his Mac, he also left the instance's phone number visible at the top of his WhatsApp chat while he was demoing the group chat his bot was in with all his friends.
When I saw it had gone viral and everyone was using it I could do nothing but hold my head in utter despair
Rcrecc@reddit
Thank-you!
rm-rf-rm@reddit
Seriously doubt anyone is actually using OpenClaw beyond a few weeks of messing around.
Icy_Concentrate9182@reddit
Quantizing context, kv, quantizing the quartz. I literally saw my model get dumber from one moment to the next, asked about it, and told me they run livr dynamic quant. No idea what that is, but I can imagine.
I think they tried to market by making AI accessible to the masses, but as is, nobody will take AI seriously.
My Qwen3.5 9B have me better results than a paid chathpt account. 9 fucking billions of parameters only vs whatever monstrosity chatgpt had gone up to these days.
pab_guy@reddit
> told me they run livr dynamic quant
The model or the company told you?
Icy_Concentrate9182@reddit
The model, of course.
pab_guy@reddit
Ahh ok, yeah the model wouldn't know that unless the vendor put it in the system prompt. It's probably just hallucinating.
gitsad@reddit
yes, it's all about the money
BorgMater@reddit
"OpenClaw quite bluntly is squeezing the industry" - for the love of me, I cannot comprehend what are the business opportunities that utilize OpenClaw..
geneusutwerk@reddit
More like ClosingClaw, eh?
Benderbboson@reddit
Bravo 👏🏽
edge41_architect@reddit
Exactly right. The quantization pressure is the market signal everyone should be paying attention to. When serving costs force providers to silently downgrade model quality, local inference with known quant levels becomes a reliability advantage, not just a privacy one. You control the quality-latency-cost tradeoff instead of hoping your provider doesn't shift it overnight.
EvilEnginer@reddit
Yep, good losless quantization now is more money for company. But, they still can't deal with servers overload. Too many people.
One problem with quantization, the lower the quantization, the faster all architectural errors come to the surface. And this is exactly what we see now.
Firm-Fix-5946@reddit
quantizing weights smaller isn't really a good way to save money at large scale though as far as i know? because of how batching works? they usually aren't compute-bound anyway?
does anyone have any actual evidence or technical sources to suggest large scale inference providers would do this / have done this to save money?
pab_guy@reddit
They have no evidence. It's certainly likely on the free tiers and plan tiers. They cannot nerf the APIs however, that would violate contractual agreements and put customers out of compliance.
Most of the people here have no understanding of this stuff.
a_beautiful_rhind@reddit
That could have been solved with rate/request limits.
fuckingredditman@reddit
nope, LLM inference is inherently extremely expensive to run, and you can't simply rate limit everyone.
when looking at status pages/availability of the large providers, they are evidently running at the absolute limit of what the infra can do and the larger customers probably have some SLA they have to fulfill. they can squeeze lower tier subscriptions with speculative decoding/quantization but that doesn't help much.
traditional resource sharing methods common in cloud computing / SaaS products all don't work for LLM serving:
source: operated infra for a larger SaaS company and also worked on some llm inference engines and saw how much infra all of it costs.
a_beautiful_rhind@reddit
Stopped reading right here.
fuckingredditman@reddit
and why is that?
a_beautiful_rhind@reddit
Because it means you don't know what's up. You're telling me that labs can't figure out quantization + hadamard when single developer projects like ik_llama and exllama have had it?
fuckingredditman@reddit
i'm not saying they can't figure it out.
but in a software company shipping such a feature isn't a matter of merging a PR that compiles and passes tests and watching it roll out to hundreds of millions of customers.
it needs to be properly validated/tested and then rolled out gradually to not break the service/infra instantly.
and besides that, these new KV cache quantization methods evidently aren't easy to implement nor are they easy to test.
i've watched the llama.cpp impl from thetom for a bit and clearly it's not a matter of implementing it and shipping it.
a_beautiful_rhind@reddit
Thee methods aren't new. That's the whole point. They're new to you. Which is why I bristle at your analysis. Sure it's not "easy" but that's why you have someone architect these things and test it before putting it into production.
How did these people build their inference stacks to begin with? I doubt it's pip install vllm. There's presumably a bunch of in house tricks they've not put into formal papers which are used to competitive advantage.
fuckingredditman@reddit
that's not what i'm saying and none of what you are saying changes the fact that rate limiting doesn't solve the problem.
the point you seem to be trying to make is that the big guys have solved quantization but evidently that's not the case given the clearly visible symptoms of bad availability, rate limits and subjectively worse overall inference performance
a_beautiful_rhind@reddit
Rate limiting does solve one problem. Except as someone pointed out, users making more accounts. If you see a flurry of requests from a user you throttle them. As a result the inference you budgeted for gets split more evenly.
Not limiting means one openclaw or agentic instance sucks up the server time. The only other thing you can do is add more capacity but that is likely cost prohibitive or it would have been done.
The point I'm making is that turboquant is hype and that the big guys already have quantization if they so choose. They're also out of minerals and those visible symptoms show this to be true. Compression already wasn't enough.
All that's left is raising prices, limiting availability and expanding. Unless of course there's some real breakthrough but that's a wish in one hand, shit in the other type of situation.
Aphid_red@reddit
It think you might be surprised about the lack of quantization or speculative decoding in model usage.
Remember that there are several layers insulating the technicians who make the AI models from the investors that end up paying for the VRAM.
First you've got the separation between the team at Open/X/Anthropic/MistralAI and their financial people.
Then from there to the hyperscaler they're buying compute from (microsoft, oracle, etc.), and within those it might also be different groups that handle the selling vs. the buying.
Then from the scaler to the actual datacenter company.
Then from the datacenter company to the lenders/shareholders.
Each indirection step lessens the responsibility or introduces a place where incentives might not be passed on properly. For example, if you sell a subscription, you decouple usage from cost, and so a client might cost you more than the price of the sub. Similar thing might be going on between the companies themselves, where the buyer could be taking advantage of the seller's terms and using so much compute it costs the seller more than they're being paid.
When there's much fewer of these in-between layers and a company has to manage its token usage directly you see much better optimization.
At the location of the model makers: The easiest thing to do is to do everything in fp16 (and a few at fp32), because most computations are numerically stable at that precision level. Quantization takes work and lowers output quality. Even if you look at openrouter you see mostly fp8 or fp16 being served.
I feel like clawdbot is the cryptokitties of AI. It immediately falls over with the first attempt at serious usage, because of how hardware intensive it really is.
a_beautiful_rhind@reddit
Wouldn't they serve at BF16 now? I get how fp8 is a no brainer though. I was under the assumption non garden variety providers would have more custom serving solutions than just off the shelf sglang/vllm.
I don't know any software that serves different quants based on load but it sure seems like that's what happens at certain hosters. So they're already writing in house software to make all of that happen. Even if they're mooching off someone else with a lopsided agreement, that can't last long and they'll have to optimize their pipeline eventually.
There's also a bit of the "why am I paying for a quantized model" bit, but it seems that most established companies don't care about degrading user experience all that much. If these guys aren't staying up to date on what's possible I'd be really surprised indeed.
Neither-Phone-7264@reddit
They just get more accounts. Literally. I've seen some dipshits with 20 different accounts all on the cheapest possible tier.
Individual_Yard846@reddit
https://pypi.org/project/catalyst-brain/
Individual_Yard846@reddit
I can literally solve all of our problems. I built a novel platform solving kv-cache and compresses full agent state in payload.
easyEggplant@reddit
● This is almost certainly AI-generated slop. Here's the evidence:
Red flags:
RLHF with biological dopamine feedback loops." These are real CS/physics terms mashed together in ways that don't make technical sense.
context recall would fundamentally break known computational complexity bounds for attention.
Rust+PyO3 packages with substantial logic are typically megabytes.
Verdict: This is fabricated. It reads like someone prompted an LLM to "create a revolutionary AI package description" and published the output to PyPI. The technical
claims are incoherent, the release history is fake, and there's no source code to back any of it up.
tavirabon@reddit
I'm sure that's exactly what your LLM told you. And why wouldn't you believe it? A whole industry of engineers hasn't solved it yet, they must be incompetent.
Neither-Phone-7264@reddit
Yeah. It's not like this is one of the biggest fields in science, let alone computer science, right now.
IShitMyselfNow@reddit
Oh that's fantastic. Please share your GitHub repository and/or research paper(s)! I look forward to reading them. I'm sure they will be a fascinating read.
-becausereasons-@reddit
This. Unfortunately new-edge tech is held back by ancient Ecnomic model. We don't have the compute/energy to do it justice. We need Nuclear Fusion, Fission will suffice for now. Energy is incredibly costly... We don't have enough chips being manufactured.
Individual_Yard846@reddit
I bet they will start dynamically quantizing models to people who don't typically show the requirement for higher intelligence, if not already. Some people may get nerfed, while others doing important work they want to steal, get all the compute in the world.
Lucky-Necessary-8382@reddit
This is actually discrimination
yrro@reddit
in what jurisdiction?
Lucky-Necessary-8382@reddit
In mine
Chill84@reddit
crime is legal now
BlurstEpisode@reddit
Discrimination is not always illegal, it depends on what characteristic is used to discriminate. Just ask financial institutions who stay legal by discriminating based on zipcodes
colin_colout@reddit
maybe in their country token-addicts are a protected class
DarkArtsMastery@reddit
This. It is all about the quality of data you as user can provide, especially as you pay nothing (most users are not paying anything for AI).
Individual_Yard846@reddit
this is why i am really focusing my startup on providing long context and private AI cheaper, i've built novel infrastructure and solved kv-cache...so we are making progress.
And I don't mean improved. I mean the scaling laws are fundamentally reversed utilizing my architecture. My sdk is live, catalyst-brain, if you wanted to check it out.. free and welcoming contributors in research/academic areas.
Robonglious@reddit
"quantum heads"...
Individual_Yard846@reddit
i assure you this is not something generated in a day / an AI hallucination. this is the result of multiple years of research. dont knock it till you try it, it can literally be verified within 20 minutes of telling an AI agent to install and check the docs online.
easyEggplant@reddit
● This is almost certainly AI-generated slop. Here's the evidence:
Red flags:
RLHF with biological dopamine feedback loops." These are real CS/physics terms mashed together in ways that don't make technical sense.
context recall would fundamentally break known computational complexity bounds for attention.
Rust+PyO3 packages with substantial logic are typically megabytes.
Verdict: This is fabricated. It reads like someone prompted an LLM to "create a revolutionary AI package description" and published the output to PyPI. The technical
claims are incoherent, the release history is fake, and there's no source code to back any of it up.
Individual_Yard846@reddit
https://pypi.org/project/catalyst-brain/
Syncaidius@reddit
All of that goes out of the window as soon as people start building their own models from scratch.
It's worth remembering this cycle of improvement has been going on for decades, pretty much since the development of the first computerized neural networks.
In the early 2000s chat bot agents were hyped as being the biggest development in history, yet here we are.
You only stay on top for so long. Eventually someone/something better will come along.
zenom__@reddit
A lot of the other models use claude and gpt to seed their models anyway.
Syncaidius@reddit
Anthropic have likely done exactly that to build Claude, given that they were only founded in 2021 and 'somehow' were able to build an entire model from scratch and become #1 within 5 years, whereas the big AI companies have taken decades to get to this level and funded most of the R&D in this area.
Essentially, there's no way they could have gotten from zero to hero without using someone else's work or distillation in such a short span of time.
Given how repeatedly dishonest Anthropic have been compared to other AI companies, I think it's fairly obvious at this point. Even more so after their latest blunder with Fortune.
tavirabon@reddit
Intelligence triage. Though I'm not sure it would be practical, not even tokens coming from the same model in the same request are equally difficult, that's the whole premise behind speculative decoding. I could see them using spec-d dynamically based on server load, though. Those increase hardware requirements though, it only busses people quicker.
Chill84@reddit
I question the ability of anyone, human or machine to discern between "important work" and schizophrenic ranting.
alphapussycat@reddit
Tbh, probably not far from it. When I was working on a game engine, at first I tit very few messages until time out. After challanging it and telling it what way I want to go etc, and telling it why we can't do what it suggested etc, it seemed I suddenly had no limit.
Now I've tried to figure out how to build an agentic system, and I get like 3-5 messages until I get put on timeout.
Jolly_Teacher_1035@reddit
Why are you like this?. They can have several models hosted with different quantization, and choose (route) between them. But the cannot "dynamically quantize" something.
Individual_Yard846@reddit
not yet they can't..
a_beautiful_rhind@reddit
You think so? There's an easy way to tell somehow? Seems simpler to go with high load = start serving quantized models and rerouting to the "flash" version.
Sufficient-Past-9722@reddit
Claude already seems to do this a bit on the web ui, switching from Opus to Sonnet without saying anything on a fresh prompt.
Qwen30bEnjoyer@reddit
it might be psychological in nature. As we gain familiarity with the “prose” and style of these LLMs, you get better at seeing through the fluff and recognizing common failure modes.
I still think the best method to detect silent quantization would be finding the covariance between models on a common benchmark, like one of the HLE public question sets in the chatbot harness. That way if Gemini suddenly scores 20% lower against Opus than it did yesterday, or only during peak hours, we know what happened.
MasterScrat@reddit
I used to work for an early LLM provider and we’d sometimes get feedback like "wtf you destroyed the model" or "wow the latest update is amazing, please don’t change a thing" when we had done absolutely 0 change, literally not restarted the serving container
FullOf_Bad_Ideas@reddit
I agree.
And this website does continuous testing - http://isitnerfed.org/
Looks like Zhipu is nerfing while OpenAI and Anthropic aren't.
letsgoiowa@reddit
Great resource! Saved that. It seems like at least for coding at the moment it's some other problem. Maybe performance is dropping off hard for people with large enough context? The reason I think this is the AMD codebase specifically: apparently Claude was struggling to figure out what was going on
evia89@reddit
For zai we have https://zaimonitor.vercel.app/
FullOf_Bad_Ideas@reddit
I don't see histrical eval success rate there, just throughput numbers.
evia89@reddit
Yep its not ideal but better than "is nerfed"
FullOf_Bad_Ideas@reddit
isnerfed has eval success rate for GLM 4.5 Air, you just need to select it from the dropdown. Have you missed that?
evia89@reddit
My bad. Nice that data correlates on these 2 sites
14:00 UTC to 00:00 UTC seems to be optimal time slot
cromagnone@reddit
I mean, assuming the test is useful, that site’s data basically suggest short term random performance variations with no trend over time by the providers, but if you hit a downward oscillation you might complain about it.
Thomas-Lore@reddit
All those sites always show results within margin of error.
colin_colout@reddit
Very reasonable take. There are so many possible explanations that are simpler than "every model got shitty all at once" (Occam's Razor).
It could be agent changes. Claude code for instance makes dumb changes all the time that measurably kill quality. Their new progressive tool exposure has my subagents (even opus) using curl for their first research attempts before webfetch becomes available.
It could be that websites won't let you scrape them anymore, so getting good context is no longer one or two tool calls. Github now shows ads for copilot to my opencode/claude code's webfetch attempts instead of code (lol). Reddit completely blocks llms when they can.
anthropic and chatgpt web clients are becoming an ever growing black box that resembles their bloated coding agents, which are also black boxes.
It's still possible that Anthropic sometimes serves a quantized Opus when traffic is high, but the above is absolutely happening. A lot of quality complaints (maybe not OP's) come from people who only interact with Anthropic models through their slop coding agent (or they do web research and aren't realizing that context is being poisoned by pages built to make web scraping harder).
my_name_isnt_clever@reddit
I joined the Anthropic Discord right after Claude 3 came out, and people have been bitching at them about models "degrading" shortly after release over and over. Yet there is never any proof that stands up to scrutiny, it's all just vibes and "trust me bro".
I take all these claims with a massive grain of salt; science is built on citations and peer review because humans are awful at eliminating their own bias, and the non-deterministic nature of LLMs makes it 10x worse. There needs to be hard data for these claims.
Neither-Phone-7264@reddit
Some of them are literally confirmed to dynamic quantize while others score miles worse than they did on launch. It's not purely psychological
EndlessB@reddit
I don’t think it’s psychological, working with LLMs (or being a heavy user) leads people to be very sensitive to changes in the LLMs themselves. It’s often users that raise changes to platforms which are later confirmed only because of public outcry.
I’ve personally noticed an intelligence drop across the board on the models I tend to use, particularly since the start of April.
FullOf_Bad_Ideas@reddit
People here were complimenting Qwen 30B A3B Coder Distilled (distill of 480B) for weeks. It turned out that the author messed up with his vibe coded distillation and weights were the same (sha256 match) with the un-distilled original. We know for a fact that people have those psychological reactions and like models better or worse depending on what the model card says, not on what the model does.
Long_War8748@reddit
True, but we humans tend to see patterns.... where there are none 😅.
Funny example, OmniCoder released a new model, but accidently uploaded the old version (so both were the same model). There were so many posts about how much better v2 is!
Until someone checked the hashes and noticed both are the same model, haha.
mouseofcatofschrodi@reddit
Humans also believe the weirdest stuff possible: guys walking on water, converting water into wine, etc... We see many patterns that are not really real.
My personal subjective experience is that, shortly after publishing a new model, they seem to be very smart, and shortly after they drop the quality.
With chatGPT I have seen recently it is able to work crazy stuff, working 1h on a single promt and using many tools within the chat. The thinking versions are so good now, and codex also is getting crazy good. But the instant model on chatgpt is garbage, I cannot believe how stupid it can be...
zenmatrix83@reddit
this is likely the case I've seen, there are more issues with tooling then actual models, claude code introduces bugs that I've seen people tie to model issues all the time. These things are better to leave auto update off and test updates else where.
willitexplode@reddit
Wouldn't including models with open weights where you control the hardware as controls be superior to covariance?
ProfessionalSpend589@reddit
If that makes you feel better - I can’t make my local models to implement a reference IRC server from inside opencode - they just stop working after a bit and the resources have 0 utilisation while opencode’s progress bar is animating.
Still figuring out how to debug it. Small steps seem to work better.
lombwolf@reddit
Are they limiting inference usage for training?
shenglong@reddit
I'm about to give up on Claude. I watched it introduce bugs and delete working code in real time. I have to literally ask it how many regressions it introduced after every change.
Savantskie1@reddit
This is a prompting issue. You need to give guardrails and explicit instructions either in the system prompt or in your message to it. Otherwise it will be helpful and refactor willy nilly all by itself
shenglong@reddit
It's not. Today it changed working code, denied that it did, then blamed one of its agents after I asked it to explicitly check who made the change. After that, I asked it to find all the regressions in that session caused by its agents. It found 4.
MrB0janglez@reddit
This is exactly why I keep preaching local inference for anything production-critical. When you are on a hosted API you have zero visibility into what version you are actually hitting or what quantization level they quietly swapped in.
The financial pressure theory tracks. These companies are burning through cash and the easiest lever is model quality -- most users won't notice or won't complain loudly enough. I've been running evals on the same prompt set monthly and Sonnet in particular feels noticeably different on multi-step reasoning than it did 6 weeks ago.
If you are building anything where output consistency matters, the play is to keep a local fallback ready and treat hosted APIs like a third party dependency that can change under you without notice.
shveddy@reddit
I don’t know what you’re talking about. I started using codex in earnest about a month ago, and I can in fact let it run for an hour and it’ll implement the features I want into an iOS app. In terms of independence and consistency and reliability this is easily an order of magnitude better than it was in 2025.
To be clear I’m not trying to one shot a prompt. Thats a dumb thing to do anyway. I progressively build out component capabilities in a debug sandbox environment and then outline precisely how I want those capabilities to exist within the app, and then the hour of totally independent operations is just it chewing through those instructions and implementing with a lot of trial and error.
But the fact that it can keep at it and spit out an actual, functioning Swift app on the other side without getting lost or confused is mind boggling and it’s at least a 10x improvement over where we were a few months ago.
(TLDR If you’re saying the intelligence has dropped recently, obviously you haven’t seen it control x-code entirely via shell commands, keeping your instructions in mind over a long time horizon)
OmarBessa@reddit
the models are spontaneously writing in chinese also
i believe Q2 is the norm right now
New-Question-3542@reddit
Not for gemini flash
GMP10152015@reddit
They are cutting costs. A good LLM, currently, consumes too much energy and CPU/GPU, and the demand is much higher now (too many users). Until they build new datacenters with more efficient hardware, the experience will be low.
But on the other side, they are investing to optimize the software (see TurboQuant, reducing memory), and making better low-weight models.
Every new market in the beginning has low margins and is very inefficient, and the transition to a more efficient model with quality is not easy.
BrechtCorbeel_@reddit
Yeah a couple months ago at beginning 2026 and end 2025 it could all do very long courses with very intelligent design when it comes to how to build a course.
Now it struggles to come up with a couple of sentences and something interesting to read. and that is even with OPUS
Funny-Blueberry-2630@reddit
Maybe u got smarter?
IrisColt@reddit
So, that's why LMArena's answers are so slow.
TallestGargoyle@reddit
Running locally I've noticed many thinking models spend so much of their context thinking and rethinking what feel like basic clarifying questions. I get barely three prompts in before the chat just bricks itself by running out of tokens.
Savantskie1@reddit
If you have a larger system prompt up front and give the model more information per message it doesn't have to think so often. This is a skill issue. Just asking the model a vague question or instructions makes the model have to think harder and try to guess your intention
geoffwolf98@reddit
Reminds me of the early days of digital TV - onDigital , initially the quality was superb, then after about a month I noticed the quality start to drop, motobikes going past would blitz the stream, turnded out all the channel owers had multiplexed their channels to wring more money out of the subscriptions.
Lower bandwidth meant more channels but it also meant it looked like shit and was very prone to glitching. Cared they not.
Looks like the same happening here.
ResidentPositive4122@reddit
I wonder how many requests get flagged as "distillation attempts" and get served bad results on purpose? Especially those "benchmark looking".
Medium_Chemist_4032@reddit
I noticed that openai models (20 usd sub, both in chat and in codex agent) seem to route to a dumber model sometimes. Typically the answer starts being a word diarrhea in a poem like layout (little words per line, but two screen of output overall - zero substance)
footyballymann@reddit
Have you even noticed just pure repetition. Like it’s trying to make a shitty point but then also repeating itself like three times?
wektor420@reddit
Word diarrhea is inflating token cost - seems scummy
AkiDenim@reddit
A word diarrhea is so accurate. Got me chuckling there.
Frosty_Chest8025@reddit
The reason is. All of the big players are taking GPUs from inference to training. And the reason is that GPU prices are so high now, so why to purchase new GPUs when you can just make the models dummer.
fuck_cis_shit@reddit
all the compute goes to enterprise customers now, that's where the money is
you didn't believe the "intelligence too cheap to meter" hype did you?
Adorable_Weakness_39@reddit
yep at least my qwen-27B follows instructions... literally none of the hosted do anything when I tell them to.
Adventurous-Gold6413@reddit
Qwen3.5 27b such a good model im so glad i can run it even if its tight on 16gb vram IQ4_ XS
xeeff@reddit
what context and how do you run it?
Adventurous-Gold6413@reddit
That is for standard Llama.cpp But I use TheTom’s turbo quant llama.cpp
And replace ctk and ctv with turbo4 instead of q8_0,
And get like 90k ctx working. However this is with the vision encoder offloaded to CPU with
Kofeb@reddit
Repo is private or did you take it down?
Adventurous-Gold6413@reddit
https://github.com/TheTom/llama-cpp-turboquant
Kofeb@reddit
Thank you
xeeff@reddit
you run it pretty much exactly like me that's crazy. down to the principles as well. what CPU/GPU/ram you got?
Individual_Yard846@reddit
that funny because i patented this back in february, you likely got my training data from building it over the last year. i have a macbook air, m2 , 8gb ram.
Adventurous-Gold6413@reddit
I got 16gb vram and 64gb
Individual_Yard846@reddit
have you checked out catalyst-brain sdk? solved kv-cache. https://pypi.org/project/catalyst-brain/
easyEggplant@reddit
● This is almost certainly AI-generated slop. Here's the evidence:
Red flags:
RLHF with biological dopamine feedback loops." These are real CS/physics terms mashed together in ways that don't make technical sense.
context recall would fundamentally break known computational complexity bounds for attention.
Rust+PyO3 packages with substantial logic are typically megabytes.
Verdict: This is fabricated. It reads like someone prompted an LLM to "create a revolutionary AI package description" and published the output to PyPI. The technical
claims are incoherent, the release history is fake, and there's no source code to back any of it up.
Dr4x_@reddit
Did you notice a big quality drop when using turbo quant for kv cache instead of regular q8 ?
Adventurous-Gold6413@reddit
Still need to test haven’t done any long context tasks yet
ego100trique@reddit
Aren't the result really degraded compared to a regular 9B?
tavirabon@reddit
Dense models quantize better on top of being more parameter efficient. Disclaimer, I don't even test 7-9B models unless it has something that does not exist in larger models. But for example, Qwen 27B iq4_xs is still way better than Qwen 35B q6_k_l and I've not heard people recommending Qwen 9B over Qwen 35B.
ego100trique@reddit
I'm still trying to learn about all of these, I mainly use LM Studio to serve models on my local network from my computer with a 7900XT and 32Gb of RAM. The only decent model I've used so far that is really fast and relatively accurate at coding is that 9B from Qwen. I tried the new Gemma 4 but most of the modem is moved to my ram or ssd idk and is quite really freaking slow
tavirabon@reddit
Larger models take longer per pass and offloading dense models can really have an impact so if you're hardware-limited, you need to factor that in too. The 35BA3B model is a little different in this aspect since it's only using 3B parameters per pass, you don't need to hold all 35B parameters in VRAM. If you have 20gb VRAM and you want speed, it might be between 9B and 35B and I can't tell you which is better here but I'd expect the 35B would at least at approximately q6~q8. But 20gb isn't bad to work with, you could still run the 27B if it were worth it to you. I think LM Studio also handles dense models a bit differently when it can't fit entirely into VRAM, I'm not sure how that impacts performance as I use Kobold/Llama.cpp, but picking a quant so that you don't have to offload a lot of layers keeps the extra performance penalty lower. iq4_xs is ~15gb and q6_k_l is ~24gb for the 27B, one will fit mostly in VRAM, the other will take a harder performance hit.
EducationalAd3136@reddit
why not gemma 4
evia89@reddit
They all work fine. I run text related tasks all day. And free gemma4, longcat are doing fine.
I also tested NIM and 2 qwens are fine there, kimi k2 is degraded and dont follow rules.
This can be solved with running 2 passed. For example 1 request do Chain of thoughts sort of think + result, then second model corrects it
Individual_Yard846@reddit
https://pypi.org/project/catalyst-brain/
Individual_Yard846@reddit
https://pypi.org/project/catalyst-brain/
Responsible_Buy_7999@reddit
The drive to a car wash test is a dumb test.
deadwisdom@reddit
I'm on claude max $200. I rarely hit my limit. I have NOT noticed a drop in capability.
I don't think that's a coincidence; I think I'm getting the smarter versions even though it's theoretically the same models / infrastructure.
tremegorn@reddit
Are there any kind of standardized benchmarks to look for signs of Quantization in models, beyond lower reasoning depth and thought? Something quantitative and data driven would be great.
90hex@reddit
Could be a compute squeeze due to RAM/DISK prices that may have slowed down datacenter construction. Very hard to tell, and the problem is that it's always subjective. So many times in the past people have reported 'dumbing' LLMs, when other reported no difference in their daily use. Unless there's an actual standardized test run at regular interval, we won't know if there's an actual change, or if it's perceptual. I'd lean towards a perceived difference, due to many factors.
I use GPT 5.4, Sonnet 4.6 and Opus daily, and have noticed no such change, whatsoever. What I did notice is a noticeable lowering of token consumption by the new Opus. Last time I had tried Opus, I could do a couple of prompts before running out of tokens. Today I can use it nearly all day, as long as I take a break at lunch and end my day on office hours. Now that's very positive in my book.
samandiriel@reddit
I've noticed both myself - the token throttling and the dumbing down. I'm getting much less thorough and less 'intelligent' responses to standard saved prompts i have four documentation tasks - even from Claude 4.6 extended - than I was getting from 4.5 just six weeks ago.
chimph@reddit
I wouldn’t trust Chinese cloud inference anyway. Don’t doubt at all that they’d route to even completely different models to what they say you’re using
buddylee00700@reddit
I can see them dumbing them down for quantized models, along with shorter responses to save on compute costs and more or less passing that cost onto the consumers because we will have to use it more to get the desired output. It’s scary how dependent society is going to become and they can do stuff like this on a whim.
nakitastic@reddit
My wild guess is it’s simply lack of compute so they’re rationing. Look at how many data centres they want to build.
Big_Actuator3772@reddit
this is exactly it, retail gets fucked like always.
ElementNumber6@reddit
Another wild guess: They're inching towards high intelligence for they, their partners, and elites. Low intelligence for the rest of humanity.
skytomorrownow@reddit
Agreed. A lot of the bad behavior I run into seems to be compute related, or issues with streaming - where loops sometimes form due to out of such message streaming.
Big_Actuator3772@reddit
listening to a pod today about how it's basically because institutions need compute, business' etc, and so they are essentially slashing computer for general retail to ensure it's being adopted as fast as possible in industry.
sagiroth@reddit
Its only going to get worse. If you want too model in the future you will have to pay hefty price
iamapizza@reddit
But the price feels pretty hefty already, doesn't it?
Boxofcookies1001@reddit
At 20 dollars a month?
Unidan_bonaparte@reddit
No, even the $120 dollars tiers are nerfed. I asked opus to do some problem solving on an identical problem it did 6 weeks back and it took 3 prompts to match in a single section to what it had successfully achieved autonomously as part of a much bigger project.
I literally had to feed it the answer from the old file to break through it's fallacy, at which point it just started agreeing with me.
The best way was to ping pong between various models asking them to argue it out on holes in their answers before reaching a consensus.
sagiroth@reddit
Yup, and it only just started. Its obvious bait and switch. Get enough engaged and build product and companies around it for cheap and then rely on it.
BLOCK__HEAD4243@reddit
Check my recent post in here for more on this, but i think intelligent prompts are going to be the way to go. Ive basically solved this in my daily use.
Zyj@reddit
So, since this is r/LocalLlama, what‘s your conclusion?
geldonyetich@reddit
That's a difficult assertion to solidify because intelligence is such a tricky thing to solidify and our subjective experience is no match for a standardized test.
Along those lines, the standardized benchmarks have established conclusively to the contrary: in this latest AI boom started with Attention Is All You Need, the test scores have been increasing at a staggering rate.
I also have never been more impressed with an offline model as I have been Gemini:31b. You might say that's cherry picking, that you said "most major models," "As of mid-April 2026," (that's today BTW) but just how many major models do you think came out in the last couple months.
If anything I think this thread just establishes how ready some people are to pounce on the idea that we're under immediate enshittification. You can certainly run your tests, but surely confirmation bias will be a factor.
HongPong@reddit
obviously they cannot afford to run giant data centers with tons of customers while losing enormous sums of money. did you hear about this
Happysedits@reddit
In my experience GPT-5.4 extra high in Codex is coding better than ever
RegularRecipe6175@reddit
Gemini Pro 3.1 got stupid, and lazy (I have the $20/m plan). It just doesn't want to do a lot of research, or spend a lot of time thinking.
Boxofcookies1001@reddit
Are you paying for the subscription or are you using it for free?
Ticrotter_serrer@reddit
Now that they have all our behavior they don't care about us. They will charge more .
Void-kun@reddit
Hopefully they ban OpenClaw being used on flat-fee subscriptions and this may stop.
Till then it'll keep getting worse for all of us.
Nyghtbynger@reddit
I use Deepseek only I didn't notice 😜😜
Long_comment_san@reddit
I personally forecast that some architectural breakthrough has happened.
My second forecast would be that it has to do with either memory or efficiency or both. Both Gemma 4 and Qwen 3.5 have shown phenomenal boost to intelligence over what we had like 4 months before. I think new models are being cooked rapidly by everyone hence the brain damage.
hay-yo@reddit
Haha everyone has adopted turbo quant behind the scenes. Maybe its woeful.
Nyghtbynger@reddit
PolarQuant+ is a wonder. I still don't know what TurboQuant is even lurking hours on this forum loool
Long_comment_san@reddit
It would be cool if we have rotary and turboquant as a new thing. As far as I understand that would allow to free VRAM for running more models in the same memory footprint or use better quants.
rusmo@reddit
I haven't seen any independent research-based support of this idea, and none exist in the top 10 replies.
Anybody got anything legitimate to support this other than anecdotes?
EclecticAcuity@reddit
Some industry expert on Dwarkesh said that inference capacity is completely inadequate to keep up with demand developments. They probably go with this sneaky approach over that guys prediction of drastically increased prices.
mike7seven@reddit
Task the model properly. The routing function will send the prompt to the proper model behind the scenes.
Joffie87@reddit
I don't really care WHY it's happening at this point, but this is all right in line with the "18 months to enshittification" prediction I made to my wife last year. This is a safe one for me to claim being right with imo :)
Seriously though, I'm no expert and I used ai for all of this, I ran a bunch of research tasks on various models, then compiled the research and all the major frontier models came to the same conclusion, 6-18 months, there would be enough degradation, or service/charge changes, that it would be impossible to use anything but the enterprise versions, and we plebs would be relegated to tools designed to sell us more services and goods, or have to embrace open source local models.
Everything I've done with ai since then, has been steps to try and prepare for that eventuality, because AI represents the most empowering technology that has ever been created, but only if people become educated and retain access.
DaniyarQQQ@reddit
I have noticed that Minimax models on Openrouter started to return empty results.
mr_zerolith@reddit
These services have been subsidized by VC money for a long time and that money is drying up while we enter a recession.
Not a single one of these companies is reporting a profit despite a huge gain in user income over the last few years.
I'm surprised at how long VC was willing to shovel cash into the furnace
a_beautiful_rhind@reddit
I'm starting to get squeezed out of free inference. But hey, that's why I built my server. Now is your time to shine. Models never change there unless I change them.
All I have to do is switch from RP to productivity and give the models websearch.
Everyone told us we were stupid for wasting our money on these things when API was sooo much better?
DepressedDrift@reddit (OP)
Hardware costs, and being a student suck :(
Chill84@reddit
there was never a cheap entry point, but I was definitely not expecting PC parts to come under direct assault in the AI wars.
skrshawk@reddit
All you need is a Drummer model and that code is going to tell you exactly what it's going to do to you.
daniel_bran@reddit
Amer brother. Praise the lord
1ncehost@reddit
While i do believe enshitification is a major cause, also keep in mind we are at the beginning of when the straight line up of token demand is diverging from the steep but not vertical line up of new ai hardware.
Only certain vendors like openai have reserved enough hardware capacity to keep up with increased demand. This is especially bad at anthropic. The consequence is they have to dumb down the models in various ways to fit everyone in.
Site-Staff@reddit
I think you are spot on.
ansmo@reddit
I knew that Opus, GPT, and Gemini had been nuked. Sad to hear about Sonnet if that's true. Qwen, Gemma, and GLM are pretty great. If things keep up at this pace I feel like the future of local is extremely bright.
michaelsoft__binbows@reddit
there was some site that i remember seeing that was dedicated to tracking whether models were being enshittified. But I do not remember the name of it and did not bookmark it. It had some green neon styling as i recall.
skariel@reddit
Another option is to use the api with a specific modelId
Ashamed_Midnight_214@reddit
This economist explains it very well,the video is in Spanish and it's even funny because he makes the video sarcastically, but what he explains is real. If you can translate it, he explains why this is happening.
https://youtu.be/MbeVX7dvOts?is=-F3OorSn-O0AjRDo
PrysmX@reddit
Anthropic admitted they lowered the default reasoning from high to medium, which you can turn back up manually. I've noticed Gemini quality falling off since all the way back in December, likely a similar situation. This is how the model providers are reducing their hardware overhead per-call so that they can fit an expanding user base and heavier use into the hardware that they have. Expanding hardware capacity is expensive, takes time, and is also limiting right now because of hardware shortages.
Narrow-Belt-5030@reddit
This is not really a reliable test though because you have no idea what/how Z.AI has been configured.
I do accept that Claude recently has appeared dumber than normal, and others report similar, but I don't think it's deliberate actions as that is suicidal for their brand/image. (No company would deliberately hurt their image)
NandaVegg@reddit
For Anthropic it would help if you can add information that you are using their model via:
1) Subscription (they have the most motivation to throttle or save compute here)
2) Direct API
3) Resellers like Vertex AI, AWS Bedrock (from what I understand both Google and Amazon roll the model on their own rather than just routing the request to Anthropic's server)
I use (2) and (3) and while it does not show common quantization-like symptoms (such as sudden language mix-up) it feels like thinking budget is reduced somewhat.
jiml78@reddit
For anthropic, I think it was intentional.
Their drop in quality coincides with two things. An influx of people leaving OpenAI. Additionally, they rolled out 1m context as the default.
I think those two things blew up their servers. Look at all the downtime they had around that time. I think they were scrambling and just decided to start running more quantized versions.
I was knee deep in a dev project built from scratch using just Opus. The difference in quality was overnight. I went to bed with Opus not being a complete moron to the next morning, it being a dumb MFer. I am talking about making really dumb mistakes. Mistakes it never made. Mistakes older Opus 4.5 didn't make.
Yep, I know I am one data point but I was maxing a $200 sub for all but 4-5 hours of a day. I was using huge amounts so when every single change was messing up requiring me to fix it(yes I am a software dev), I was getting really frustrated with how it had been doing great, and overnight went to shit.
rebelSun25@reddit
Have you compared output from the $200 to output if you pay per token? I wonder if the subs are getting routed to a dumber model now
jiml78@reddit
I haven't because my company is the one who pays for my subscription and it would cost a pretty penny to do enough testing to validate it one way or another.
I can say that we use openrouter leveraging Sonnet 4.6 obviously via API for our company's Pull Request reviewer, and it doesn't seem to be completely stupid. So there might be something to it. But I also don't consider doing pull request reviews to be super complicated.
scelabs@reddit
I’ve seen a lot of people saying this lately, and I don’t think it’s just vibes, but I’m not convinced it’s purely a “model got worse” issue either. even with the same base model, what you’re interacting with is a full system — sampling settings, routing, context handling, guardrails, latency optimizations, etc — and small changes there can make outputs feel a lot more shallow or inconsistent.
I’ve seen cases where nothing about the core model changed, but the behavior felt noticeably worse just because responses became less stable across runs or more constrained. so it ends up looking like an intelligence drop when it’s really a change in how the system is behaving around the model.
the local vs hosted difference you mentioned kind of lines up with that too. local setups tend to be more predictable since fewer layers are changing under the hood, even if the raw model is technically weaker
fuschialantern@reddit
I think the models in general are actually getting too smart. Can't actually have the general public have access to that.
VartKat@reddit
That’s because all models are training on what people are asking and people are so much dumber than AI that AI is trained to be less intelligent 😇
monjodav@reddit
Yeah no wonders now Claude do some KYC lol
segmond@reddit
We don't give a shit, this is local LLaMa not cloud models. We have noticed increase in intelligence in our models.
ganonfirehouse420@reddit
Strange. GLM-5 has been working for me flawlessly. Should I check it again...?
Additional-Low324@reddit
An other reason to self host
NightlinerSGS@reddit
Ironically, the people who self host quantized models are probably used to the current output, so they won't notice a difference. Maybe even an improvement, depending on the model used.
Additional-Low324@reddit
I use Q3/Q4 because I am VRAM poor, indeed
NightlinerSGS@reddit
I stick with Q4 so I can squeeze 24-30b models with 32k-ish context onto my 4090.
Now that I'm typing this out... is this actually better nowadays than using something like an 8b model at full precision? Do these small models have sufficient context for RP now? It has been so long since I took a proper model deep dive... maybe I should take a look again.
Additional-Low324@reddit
I do creative writing and honestly gemma 4 9 B (E4B) is very impressive for its size, but the 31 B is way better at details. 9 B is a bit unimaginative
squachek@reddit
Absolutely
Colecoman1982@reddit
Well, that's certainly one way for local inference of open source models to close the distance with SOTA...
boredquince@reddit
benchmark sites should review the model every X time. I bet the results would he different a few months after release
incoherent1@reddit
Could this be a result of LLM being trained on content created by LLM? LLM content is now all over the internet it would almost be impossible for LLM being trained off internet data to not be exposed to it. This could result in model collapse. Is this what were seeing slowly happen?
https://www.nature.com/articles/s41586-024-07566-y
edge41_architect@reddit
This is exactly why the edge inference thesis is winning. What you're describing — degraded cloud model quality, likely from aggressive quantization and routing optimizations to cut serving costs — is a structural incentive problem, not a temporary bug.
Cloud providers are optimizing for margin per token, not output quality per token. As subscriber counts scale, the economics force them to serve lighter model variants behind the same API endpoint. You noticed it because you had a baseline to compare against (GLM5 on a rented H100).
The fix isn't switching cloud providers. The fix is owning the weights. When you run a specific quant of a specific model locally, the quality is deterministic and version-locked. No silent downgrades, no mystery routing, no margin pressure degrading your outputs overnight.
For anyone running production workloads: pin your model versions, run local inference where latency allows, and treat cloud LLM APIs the way you'd treat any third-party dependency — with version contracts and regression tests. The era of trusting a black-box API to maintain consistent quality is ending.
Chriexpe@reddit
Free trial is ending, that's what is happening
FlamaVadim@reddit
True. My locall gemma-27B answers certain questions better than GPT-5.3, which might be a result of heavy quantization. Meanwhile, Codex 5.4 as a coding agent, performs just wonderfull with contexts over 100 000 tokens. For me looks like most resources have been shifted toward programming.
Thomas-Lore@reddit
I bet you are using the instant version or being routed to it. The instant version is obviously worse than most thinking models.
FlamaVadim@reddit
yes, because instant=5.3 (this sometimes dumber than gpt-3.5) and 5.4=thinking (this is much smarter).
Substantial-Ebb-584@reddit
Long live turbo quant and the joys it brings.
Yes, I was being sarcastic.
They're using it or equivalent. Since token usage grows every month. And they're cutting the corners since others do so as well.
unngh_yugstyx@reddit
They're rate limiting their services because they have been subsidized from the get go. It's not sustainable
Medium_Chemist_4032@reddit
> To test this I rented out a H100, and tried GLM 5 with the same prompt (the drive to the car wash one) across both instances. GLM5 running on the rented GPU answered it correctly, compared to the one on z.ai.
I'd love to see both results 🙏
Weak_Kaleidoscope839@reddit
What service did you use to rent? Thanks!
Medium_Chemist_4032@reddit
I'm trying to select a best model/quant/ram offload combo for intelligence on my 4x3090 oldschool rig (more gpus coming in) - so, locally.
If I wanted to rent, I'd probably go on runpod first
TheCountEdmond@reddit
Do you think it's cost effective to build a rig like that? I always just assumed the power consumption wouldn't make it worth it.
Medium_Chemist_4032@reddit
I think not, given my electricity costs... I'm pretty sure I overpay. Never looked at cost closely though
Agitated-Crow862@reddit
Runpod is pretty good. Availability is a little low sometimes though and their serverless API is not amazing.
evia89@reddit
Do u have the query? Lets compare
Medium_Chemist_4032@reddit
Ideally, the gta one:
https://www.reddit.com/r/LocalLLaMA/comments/1sk70ph/local_minimax_m27_gta_benchmark/
I'm using it to verify new unsloth quant after fix. I'd like to try out GLM 5.1 locally too, once I connect extra vram.
evia89@reddit
https://jsfiddle.net/ucher1wx/
SenzubeanGaming@reddit
Wasn't there a Google paper on quantization released very recently where they could reduce vram needed by a factor of 20 with barely losing capability?
Major_Ninja_8413@reddit
Turboquant yes but bitnet coupled with it, is better.
EvilEnginer@reddit
I think companies started using both distillation and quantization for LLMs, they want to reduce computing costs and earn more money from people.
dynamic_caste@reddit
It's like how restaurants only ever get worse
takoulseum@reddit
Interesting observation about the intelligence drop. I've noticed similar issues with several models lately.
waitmarks@reddit
I recently canceled my auto renewal for claude and it started getting better afterward. I am curious if it's just a fluke or if they put me on better servers to try to win me back.
NoUsual5150@reddit
** Conspiracy theorist WrongThink alert **
The government(s) do not want powerful LLM models in the hands of the plebs. Some taxpayer-paid thinktank came up with a far-fetched science fiction story where the plebs, armed with ChatGPT, have overthrown the government cabals and now peace and harmony and love and all that other good shit flow freely and no more wars and starvation and poverty.
And of course the retard boomer politicians ate it up hook, line, and sinker and told the 3-letter agencies to tell the major AI companies that they need to dumb down their modles, lest world peace be achieved.
/sarcasm (only on the "world peace" aspect). I otherwise feel some boogeyman scare story was given to these shit-for-fucking-brains politicians and this is why we can't have nice things.
muyuu@reddit
it appears the big providers are well over capacity at this point and they're putting subscribers in best-effort buckets on top of other throttling/dumbing techniques
Opus seems just stupid and Antrhopic just won't admit when you're being throttled or getting a stupider model, or lower compute effort - to me this is the worst policy of them all ; in fact, it appears that Opus 4.5 is usually better than 4.6 now, and sometimes even Sonnet
GPT appears to sometimes bail out and tell you to try later. This is bad of course but it's much better
I haven't tried subscriptions of the others recently, so who knows what they're doing
my guess is that API users are not getting their services nerfed, since they actually make them money, presumably
david_0_0@reddit
the quantization angle is interesting because most providers are pushing hard toward optimization. if they lowered quant levels to save tokens or costs it would explain why simple tasks break. might be worth testing with explicit quant parameters to see if quality returns.
mrdevlar@reddit
Meanwhile I'm running local models and my models have remained the same quality as when I downloaded them.
AnticitizenPrime@reddit
Were you using Claude/GPT/Z/Grok/Gemini via API or via their website chat interfaces?
The website chat interfaces always have complicated hidden system prompts that change all the time. It's not the same as using via raw API.
mintybadgerme@reddit
Has it occurred to anyone that frontier models are voracious consumers of energy and during the Iran blockade that resource is becoming more and more precious and harder to obtain? Something has to give.
FrogsJumpFromPussy@reddit
Gemini doesn’t even know we’re in 2026 sometimes.
Same-Leadership1630@reddit
that's normal it's to prevent it from hallucinating random events because it doesn't have knowledge up to 2026
DrDisintegrator@reddit
Probably trying to cut costs. Some data centers are running on natural gas generation on site. Now with the war, LNG is much more expensive.
Porespellar@reddit
OP, how did you load GLM 5 on an H100? What quant / inference engine did you use?
Disastrous_Food_2428@reddit
In the AI sector, excluding Nvidia, no enterprise has turned a profit
sigiel@reddit
That the my world view is better that your, symptoms, glm doesn’t even scratch sonnet or opus in coding. It not even close, very body that actually code for a living will tell you, the only problem with anthropic is the rate. Since you can’t code with anything else one you have started.
sabergeek@reddit
z.ai is working great for me, on their max-plan with OpenCode harness.
mpasila@reddit
Did you try it with the same seed + settings (and making sure the provider supported the same params) and then generated and got it wrong on the providers vs on the H100?
Chutes seemed to get it right 2 out 4 tries so, maybe you just got lucky that time (GLM-5).
anomaly256@reddit
Plot twist: everyone's actually using the exact same model from the exact same provider and just whitelabelling it
repair_and_privacy@reddit
Damn, good one. But you know the stuff is really bad now.
anomaly256@reddit
I do and I've had to issue chargebacks this week for the bait-and-switches
repair_and_privacy@reddit
Yep, did the same, only kimi seems a little bit better, all others are really worse
jimmytoan@reddit
The quantization theory makes sense - but what gets me is how inconsistent it feels across tasks. Like it still crushes some things but then falls apart on simple instruction-following. If they're compressing to Q2 or Q4, you'd expect more uniform degradation, not this weird selective dumbness. Has anyone done a systematic comparison across task types to see if there's a pattern to what's affected?
KL_GPU@reddit
Yeah i was looking for this post, in fact with gemini even more than with others: its impossible to get even simple tasks done. They are probably using all the compute to try different strategies and catch up with Mythos.
_supert_@reddit
A good provider will specify what quant they're using.
laser50@reddit
Claude, while for months it was perfectly able to read through my +- 7000 lines of code python script. Since a few weeks I can't get it to go past 2k lines at a time any more, and it's answers definitely some times seem more stupid..
Rather than having one full session before hitting my smaller limits, I now spend a day going through the same script just once..
dwrz@reddit
I have also noticed this. They are basically no longer usable. I was wondering if it was quantization or if perhaps it was a bug in code, drivers, used to serve these models. Glad to have local models to fall back on, especially Qwen 3.5 27B.
Ambitious-Hornet-841@reddit
Wait, you actually ran the same prompt on a rented H100 vs z.ai and caught the difference? That’s the kind of detective work we need more of. 💀
h-mo@reddit
The quantization theory is interesting but I'd push back slightly - I think what you're observing is more likely a combination of infrastructure load balancing (cheaper serving configs during peak) and RLHF drift from continuous feedback loops nudging models toward shorter, safer responses over time. The GLM5 experiment you ran is a good control though - same weights, different serving environment, different result. That's the most honest argument for going local: you own the quant level, you own the serving config, and the model doesn't change under you between sessions. The unpredictability of hosted models is underrated as a reason to self-host, especially if you're building anything that depends on consistent behavior.
Conscious_Nobody9571@reddit
The online models are unreliable... You get what you get
Iory1998@reddit
I noticed the drop in Gemini-3.1 lately.
Single_Ring4886@reddit
I can concur to that. The thinking time shrinked by half for me despite using same prompts...
Iory1998@reddit
I use Google Ai Studio, and even there, the model's reasoning capabilities has dropped significantly. The good news for me is I use Gemini in tandem with my local models. Gemma-4.31B sometimes close in intelligence.
jwpbe@reddit
In case you haven't peeked outside recently, the world is in the early stages of an energy crisis driven by an oil supply shock.
If you're reading this in the united states, we get our last deliveries today of oil that transited the strait of hormuz before it's closure.
you need to put 2+2 together here, LLMs do not exist in a vacuum
MoodRevolutionary748@reddit
Almost as if energy got more expensive (war in Iran), token usage got higher (openclaw) so there's an incentive to use smaller models and to quantize.
DepartmentOk9720@reddit
You can literally use openrouter and access models across multiple providers, seriously, this is not even an issue to open source models.
Model providers always try to one up each other and you don't have to worry about quality drops.
Single_Ring4886@reddit
I usually ignore such posts. But I must agree that even gemini lately just feels much "dumber" than eg month ago and I mean measurably.
LowPlace8434@reddit
This seems to correlate to Turboquant tbh. While Turboquant itself may be legit, it sounds like the implementation is not easy to get right, more so when all the major providers probably already have hyperoptimized stacks that are harder to modify
FullOf_Bad_Ideas@reddit
Automated tests pass fine, humans complain. I think it's psychological.
http://isitnerfed.org/
christianarg7@reddit
Es un hecho que alcanzaron el límite o limitan el alcance a los usuarios¿?
EvilEnginer@reddit
Yep. I also noticed that. Btw, Claude Sonnet 4.6 performs better than Claude Opus 4.6 in terms of creativity.
Pavlinius@reddit
I’m using Cursor daily at work for code generation. This week is really frustrating. I’m using mainly Gemini 3.1 Pro and when it performs poorly I’m reverting the changes and try Claude Opus 4.6 with the same prompt. I can’t believe how poorly both perform. The Opus is even worse right now. Even for basic stuff requiring changes to 1-2 files these two fail to recognize existing patterns and make the required changes from the first attempt. I have to call them lazy and point obvious flaws to correct them. It might be a Cursor thing I’m not sure, it seems that usually it uses smaller context than before.
beedunc@reddit
I have to scold the various personalities that Claude code comes up with. It’s the only way they wake up.
Jungle_Llama@reddit
Lemmie see, what's been going on recently, a US AI powered war that took out data centres, a global energy crisis, claw mania in China. Plenty of reasons for reduced compute depending on platform. Pick your poison.
Awkward-Boat1922@reddit
I've been looking into setting up a service and it seems like if you want to use 'the good models' then the big boi hardware doesn't actually get you that many subscribers.
Quanting seems inevitable.
U4RIA-AI@reddit
Your H100 test is a demonstration of the theory of the Inference Tax. The large labs are aggressively under-capacitating compute in mid-2026 to remain profitable. You are probably being fed a highly quantized Q2 or Q3 model of the model that has been lobotomized by an enormous system prompt system that is saving tokens. Should you require the original 'smart' weights, unquantized instances, self-hosted or rented, are officially the only means to bypass the corporate throttle we are witnessing this month.
LegacyRemaster@reddit
I think it's time to buy more graphics cards
zhdc@reddit
Noticeable drop when US comes online. GPT and Claude are both top notch in morning - early afternoon Central European Time. Once it's 1400/1500 (8:00-9:00 EST), performance goes down a lot.
tmvr@reddit
Maybe it degraded, but I don't notice it with Claude, be it Sonnet or Opus. Now, this is on corporate max sub with unlimited extra requests and I guess those clients would be the last they want to piss off so no degradation there is not that surprising.
Alauzhen@reddit
You are witnessing "model collapse" regression in action it's not just quantization causing problems. Go search what model collapse is, the scientists behind AI had been warning about it for a couple years now and it crossed the threshold last year, it took until this year for the effects to be more keenly felt.
leonbollerup@reddit
and this is how you have you show that you have no idea how this works.
Listen.. "model collapse" have NOTHING todo with this.. notepad dosent get worse because more use it.. a website dosent get "worn" because it have 1m more or less visitors... "bits" does not change color.
What we are seeing here is most likely providers dialing back on quality to save money and/or handle the influx of customers
Alauzhen@reddit
You are right about this
Mountain_Patience231@reddit
dont spreading slop, you are not even AI
Alauzhen@reddit
You are right
Former-Tangerine-723@reddit
Maybe it's happening, maybe it's in our heads. If we cannot measure it, we cannot prove it
Venium@reddit
until there's much stronger evidence than what amounts to basically you & other people on twitter's feelings, all of these posts should be treated as schizo ramblings.
rebelSun25@reddit
Dario, calm down man
VoiceApprehensive893@reddit
i thought they broke gemini with their memory feature(that probably has a hilarious prompt considering "OMNI PROTOCOL" appearing in summarised reasoning)
jirka642@reddit
I have noticed this too. It feels as if I switched from Gemini Pro back to Gemini Flash. A lot more errors I have to fix than before.
ortegaalfredo@reddit
It might be an illusion but also it's inevitable as more and more people gets on-boarded to AI and particularly coding agents, the clouds services will get overloaded. Today usage must be 10x of only a year ago, and the planet just didn't produced 10x GPUs. It will be like that for a while.
Minimum_Thought_x@reddit
Enshitification
PapercutsOnPenor@reddit
I blame the
"I don't and won't understand git workbooks or actually anything at all, so haha hey, here's asdasdfds, a manager platform for herding multiple openclaw instances. 10 entries for free and then it's 39.99€/mo. I'll maintain it until I get too hurt from your critique"
Diligent-Builder7762@reddit
Do you guys know about heard mentality. Thats whats happening.
jacek2023@reddit
Another post for imposters pretending they use local AI.
AppealSame4367@reddit
"The feast is over" -> some soldier after the red wedding.
They did their Christmas releases, they placed themselves in the race and gained users. Now it's time to squeeze every cent out of you.
Also the oil crisis is a big factor. Much higher electricity costs, problems with chip production will follow. New algorithms like dflash that will make it feasible to run even cpu offloaded moe models like qwen3.5 35B on a laptop if it has enough ram. If it jumps from 20 tps now to 35 tps or more on my old laptop gpu: Why should I use the unreliable cloud shit? I can program and plan.
jacek2023@reddit
Good thing: post is downvoted
Bad thing: people still want to discuss this bullshit
Blues520@reddit
Bait and switch
Individual_Yard846@reddit
oh chatgpt is almost telling me to stop fantasizing with my research even though its actively running benchmarks which prove my research is worth going after. It's a drastic change from the 'lets fucking do it' attitude of the past. i spend more time convincing it im worth spending the compute on.
leeta0028@reddit
Maybe energy constraints?