What are you using claw for? Everytime I look at the use cases they all just seem silly.

[-]

Ariquitaun@reddit

A bunch of itches that need scratching. The main motivator was to keep up with my eldest' school calendar - we receive a lot of newsletters, sometimes attached to emails, sometimes to download. Emails with info, there's two parent websites as well with stuff on them. Basically I've instructed claw to sift through that stuff, update a markdown file that the chatbot can query if we have questions and generate google calendar events for my wife and I. The claw has its own google account to do stuff and I have on my account a few email forwarding rules for school stuff.

I have other tasks for generating a daily report with links of topics I'm currently interested in that I can read while taking a shit at some point during the day. My wife also has a few itches to scratch that I need to get around implementing.

voronaam@reddit

simple rote admin tasks work easy on these smaller models

Just curious, did it encounter any complicated tasks?

For example, I was trying to organize a small thing recently with a very unreliable party and the communication so far has been like this:

Me: Hello. Can we do a thing?

Then: (3 days later) I am in South Africa now. WhatsApp me (no phone number provided)

Me: When are you back? I'll reach out then

Them: next week

Me: (next week) Welcome back. Can we do the thing?

Them: I am still in South Africa

Me: (next week) Are you back?

Them: Yes. Let's meet in person to discuss (no address)

Me: Sure, how about on next Friday?

Them: (on Saturday) Missed the message. Just call me (still no phone number)

Me: (Finding phone number on one of their websites, Calling) How about the thing

Them: Yes, we can do it. Just fill the form on the website.

Me: Filling the form with the request.

Them: (next day) "Hi Peter, we can do the thing on those days" (I am not Peter)

Can a small model handle this?

[-]

TheTrueSurge@reddit

Lol I hope you REALLY need to work with that person, that sounds awful

[-]

voronaam@reddit

They are just not motivated. It is a small sailing trip and I am talking to the owner of the boat and a skipper. If nothing happens, they get to chill on their boat. If it does, I'll get to chill there as well and they will get a bit of extra cash. That is not that much of a difference to them, as they are sufficiently wealthy and retired.

My work related communications are way better than that ;)

[-]

QuinQuix@reddit

How many misses do you encounter?

I've read hallucination rates hover between 3-10%.

Doesn't sound like much but calendar planning is quite a critical task.

A big medical office doing 100 appointments a day couldn't handle 1-2 wrong/double/missed appointments a day.

People always counter people make mistakes too but they miss that usually such systems (eg a front desk managing appointments) have layers of redundancy and self correcting ability.

The entire point of AI is to offload work, so I'm curious to what extent you feel this is actually possible using your Workflow.

[-]

IShitMyselfNow@reddit

This matches my experience

I've been using Hermes Agent with Qwen 3.5 4B to great success. And for codig, anything more complicated than a simple script I've been delegating to a better model via Opencode from the agent.

I think the advent of agent skills has really improved the performance of smaller models in things like this. Small models have actually been semi-useful at agentic work ever since around Qwen 2.5. But only if you gave them a lot more instructions/detail, more API examples, few-shot prompting, etc..And you could do this, but then managing the data you give them for the task at hand, managing context, etc. was tricky at best. Agent skills kinda solves that problem.

[-]

michaelsoft__binbows@reddit

I've been meaning to keep up, but simply cannot. I got deep into opencode for about 7 weeks and I simply have no bandwidth to explore pi agent and hermes agent like i had hoped. Been just driving codex and claude code since then and got my productivity back. Is hermes any good...?

One of my projects has been about a paradigm of having an agent harness harness, e.g. something that puppeteers codex and claudecode and opencode. What you wrote about hermes seems to intersect with that idea so you got me curious.

[-]

IShitMyselfNow@reddit

I like it. It does what it does well.

I'm not using it to its fullest extent, and I'm definitely not running it like Openclaw, but I've been using it as a "shitty assistant that can automate some things for me that would be better hardcoded as a scdipt/workflow but I don't have the time for that anymore" to great success. And it does do a decent job at interacting with Opencode. It also supports Codex and Claude Code (and Hermes). Worth a try IMO.

[-]

-p-e-w-@reddit

Have you used a current-gen model of that size? It’s easily on par with GPT 3.5 intelligence-wise.

[-]

Salt-Willingness-513@reddit

Is a bit bigger, but so far 26b a4b works very well for me with a claw alternative from scratch

[-]

Ariquitaun@reddit

I'd totally go for that one, but all I have for that is the igpu on a ryzen, 8GB RAM and not a lot of extra GTT memory for it. I can start the model, do some stuff and eventually it ends up crashing.

The E4B model does a pretty good job if you enable its thinking mode.

[-]

WhopperitoJr@reddit

The development of OpenClaw and its consequences have been a disaster for the AI race.

[-]

zoupishness7@reddit

It definitely wasn't just OpenClaw though, it's been out since November.

The real spike started at the end of March. Karpathy's autoresearch had been published for a week. Then Anthropic had a period of 2x Claude usage during off-peak hours, which incentivizes the development of round the clock systems that don't have humans in the loop. With a good harness and autoreseach, it's not that difficult to evolve an automated system that makes more money than a Claude 20x subscription costs. Profitability on API pricing is significantly more difficult, the ratio's getting adjusted, but at the time, it was: $200 subscription=~$5000 on peak, and ~$10000 off peak.

For certain applications, and markets, once you have that kind of system, the math becomes very simple: buy more subscriptions, use them as much as possible, and reinvest the money into buying more subscriptions. I did. I made some money, but spent a week thinking I was a couple months away from being a millionaire. I knew it was unsustainable, but thought they would wait a bit to crack down. I underestimated how many people had the same realization around the same time.

That's what lead to the shortages, and why they banned 3rd party harnesses on subscriptions, and why they've had to make the model stupid to meet the demand. But models continue to get smarter and cheaper, and I think a lot more people know they can make money now. I, for one, am focusing on token efficiency now, to try to push what I built during that time towards API profitability.

[-]

Scew@reddit

Karpathy's autoresearch

yep, that did it. I had gotten a harness based off it from the openSourceAI sub that then got taken down and the account that shared it deleted and the repo moved to private. Was for making agents for gaps in places that could use one. Just last week modified it for just general research instead and hit a weekly limit that I hadn't hit before. Was kind of surprised but figured it was too good to be true when I had initially started using it.

[-]

WhopperitoJr@reddit

These are great points and you are probably right to identify more immediate causes beyond just agent-based inefficiency, though that certainly does not help.

I think we will see a movement (if it hasn’t stated already) towards efficiency and using models that are strictly the size you need for the task at hand. Those with the machines to use them will try local models like Qwen first. Not every task needs Mythos-level thinking.

[-]

bh9578@reddit

Everyone became a power user basically

[-]

Rcrecc@reddit

For a layman like me who has very little familiarity with OpenClaw or its impact on the AI race, can you clarify what you mean?

[-]

WhopperitoJr@reddit

OpenClaw is an agent harness framework where you can set up LLM-controlled agents to do certain tasks autonomously.

The reason why it is disastrous is because these agents are left to run without human monitoring, and they can be pretty inefficient. If they get caught in a thinking loop, they will keep using up API usage and processing power until a human intervenes. If you are using OpenClaw in the first place, you are probably not constantly monitoring the agents.

Basically, it is wasteful and uses up too much of the collective processing power that is available via API. Running it locally is less problematic as you are just using the processing power of your private device. More clawdbots = higher usage fees for you and me.

[-]

party_peacock@reddit

But users are still paying the bill for their usage though?

Or is the problem that the actual cost to host those models exceeds what the companies charge their users?

[-]

WhopperitoJr@reddit

The problem is that processing power is finite and that if demand for power rises because of inefficient bots making API calls, the overall price of usage will rise, because there is no supply surplus to return the price to its original equilibrium value.

But because companies generally don’t want to change prices as consumers are sensitive to price changes, they artificially constrain supply though usage limits.

I had the displeasure of watching Pete Steinberger present what was then "clawdbot" at a conference in December. The guy is a complete and utter clown. Not only did he admit he'd given the thing root access to his Mac, he also left the instance's phone number visible at the top of his WhatsApp chat while he was demoing the group chat his bot was in with all his friends.

When I saw it had gone viral and everyone was using it I could do nothing but hold my head in utter despair

[-]

Rcrecc@reddit

Thank-you!

[-]

rm-rf-rm@reddit

Seriously doubt anyone is actually using OpenClaw beyond a few weeks of messing around.

[-]

Icy_Concentrate9182@reddit

Quantizing context, kv, quantizing the quartz. I literally saw my model get dumber from one moment to the next, asked about it, and told me they run livr dynamic quant. No idea what that is, but I can imagine.

I think they tried to market by making AI accessible to the masses, but as is, nobody will take AI seriously.

My Qwen3.5 9B have me better results than a paid chathpt account. 9 fucking billions of parameters only vs whatever monstrosity chatgpt had gone up to these days.

[-]

pab_guy@reddit

> told me they run livr dynamic quant

That could have been solved with rate/request limits.

[-]

fuckingredditman@reddit

nope, LLM inference is inherently extremely expensive to run, and you can't simply rate limit everyone.

when looking at status pages/availability of the large providers, they are evidently running at the absolute limit of what the infra can do and the larger customers probably have some SLA they have to fulfill. they can squeeze lower tier subscriptions with speculative decoding/quantization but that doesn't help much.

traditional resource sharing methods common in cloud computing / SaaS products all don't work for LLM serving:

KV cache takes a metric fuck ton of RAM for each individual user atm, and i assume most larger companies aren't using the SOTA methods like turboquant in production yet until they are properly implemented in their engines and proven to be stable. but even with those, the cost is still insanely large compared to other SaaS type use cases
adoption is still on a rapid rise

source: operated infra for a larger SaaS company and also worked on some llm inference engines and saw how much infra all of it costs.

[-]

a_beautiful_rhind@reddit

assume most larger companies aren't using "SOTA" (TBD i would say) methods like turboquant in production yet

Stopped reading right here.

[-]

fuckingredditman@reddit

and why is that?

[-]

a_beautiful_rhind@reddit

Because it means you don't know what's up. You're telling me that labs can't figure out quantization + hadamard when single developer projects like ik_llama and exllama have had it?

[-]

fuckingredditman@reddit

i'm not saying they can't figure it out.

but in a software company shipping such a feature isn't a matter of merging a PR that compiles and passes tests and watching it roll out to hundreds of millions of customers.

it needs to be properly validated/tested and then rolled out gradually to not break the service/infra instantly.

and besides that, these new KV cache quantization methods evidently aren't easy to implement nor are they easy to test.

i've watched the llama.cpp impl from thetom for a bit and clearly it's not a matter of implementing it and shipping it.

[-]

a_beautiful_rhind@reddit

Thee methods aren't new. That's the whole point. They're new to you. Which is why I bristle at your analysis. Sure it's not "easy" but that's why you have someone architect these things and test it before putting it into production.

How did these people build their inference stacks to begin with? I doubt it's pip install vllm. There's presumably a bunch of in house tricks they've not put into formal papers which are used to competitive advantage.

[-]

fuckingredditman@reddit

that's not what i'm saying and none of what you are saying changes the fact that rate limiting doesn't solve the problem.

the point you seem to be trying to make is that the big guys have solved quantization but evidently that's not the case given the clearly visible symptoms of bad availability, rate limits and subjectively worse overall inference performance

[-]

a_beautiful_rhind@reddit

Rate limiting does solve one problem. Except as someone pointed out, users making more accounts. If you see a flurry of requests from a user you throttle them. As a result the inference you budgeted for gets split more evenly.

Not limiting means one openclaw or agentic instance sucks up the server time. The only other thing you can do is add more capacity but that is likely cost prohibitive or it would have been done.

The point I'm making is that turboquant is hype and that the big guys already have quantization if they so choose. They're also out of minerals and those visible symptoms show this to be true. Compression already wasn't enough.

All that's left is raising prices, limiting availability and expanding. Unless of course there's some real breakthrough but that's a wish in one hand, shit in the other type of situation.

[-]

Aphid_red@reddit

It think you might be surprised about the lack of quantization or speculative decoding in model usage.

Remember that there are several layers insulating the technicians who make the AI models from the investors that end up paying for the VRAM.

First you've got the separation between the team at Open/X/Anthropic/MistralAI and their financial people.

Then from there to the hyperscaler they're buying compute from (microsoft, oracle, etc.), and within those it might also be different groups that handle the selling vs. the buying.

Then from the scaler to the actual datacenter company.

Then from the datacenter company to the lenders/shareholders.

Each indirection step lessens the responsibility or introduces a place where incentives might not be passed on properly. For example, if you sell a subscription, you decouple usage from cost, and so a client might cost you more than the price of the sub. Similar thing might be going on between the companies themselves, where the buyer could be taking advantage of the seller's terms and using so much compute it costs the seller more than they're being paid.

When there's much fewer of these in-between layers and a company has to manage its token usage directly you see much better optimization.

At the location of the model makers: The easiest thing to do is to do everything in fp16 (and a few at fp32), because most computations are numerically stable at that precision level. Quantization takes work and lowers output quality. Even if you look at openrouter you see mostly fp8 or fp16 being served.

I feel like clawdbot is the cryptokitties of AI. It immediately falls over with the first attempt at serious usage, because of how hardware intensive it really is.

[-]

a_beautiful_rhind@reddit

Wouldn't they serve at BF16 now? I get how fp8 is a no brainer though. I was under the assumption non garden variety providers would have more custom serving solutions than just off the shelf sglang/vllm.

I don't know any software that serves different quants based on load but it sure seems like that's what happens at certain hosters. So they're already writing in house software to make all of that happen. Even if they're mooching off someone else with a lopsided agreement, that can't last long and they'll have to optimize their pipeline eventually.

I can literally solve all of our problems

this is why i am really focusing my startup on providing long context and private AI cheaper, i've built novel infrastructure and solved kv-cache...so we are making progress.

[-]

tavirabon@reddit

Intelligence triage. Though I'm not sure it would be practical, not even tokens coming from the same model in the same request are equally difficult, that's the whole premise behind speculative decoding. I could see them using spec-d dynamically based on server load, though. Those increase hardware requirements though, it only busses people quicker.

[-]

Chill84@reddit

I question the ability of anyone, human or machine to discern between "important work" and schizophrenic ranting.

[-]

alphapussycat@reddit

Tbh, probably not far from it. When I was working on a game engine, at first I tit very few messages until time out. After challanging it and telling it what way I want to go etc, and telling it why we can't do what it suggested etc, it seemed I suddenly had no limit.

MasterScrat@reddit

I used to work for an early LLM provider and we’d sometimes get feedback like "wtf you destroyed the model" or "wow the latest update is amazing, please don’t change a thing" when we had done absolutely 0 change, literally not restarted the serving container

[-]

FullOf_Bad_Ideas@reddit

I agree.

And this website does continuous testing - http://isitnerfed.org/

My bad. Nice that data correlates on these 2 sites

14:00 UTC to 00:00 UTC seems to be optimal time slot

[-]

cromagnone@reddit

this is likely the case I've seen, there are more issues with tooling then actual models, claude code introduces bugs that I've seen people tie to model issues all the time. These things are better to leave auto update off and test updates else where.

[-]

willitexplode@reddit

Wouldn't including models with open weights where you control the hardware as controls be superior to covariance?

[-]

ProfessionalSpend589@reddit

If that makes you feel better - I can’t make my local models to implement a reference IRC server from inside opencode - they just stop working after a bit and the resources have 0 utilisation while opencode’s progress bar is animating.

[-]

shveddy@reddit

I don’t know what you’re talking about. I started using codex in earnest about a month ago, and I can in fact let it run for an hour and it’ll implement the features I want into an iOS app. In terms of independence and consistency and reliability this is easily an order of magnitude better than it was in 2025.

To be clear I’m not trying to one shot a prompt. Thats a dumb thing to do anyway. I progressively build out component capabilities in a debug sandbox environment and then outline precisely how I want those capabilities to exist within the app, and then the hour of totally independent operations is just it chewing through those instructions and implementing with a lot of trial and error.

But the fact that it can keep at it and spit out an actual, functioning Swift app on the other side without getting lost or confused is mind boggling and it’s at least a 10x improvement over where we were a few months ago.

(TLDR If you’re saying the intelligence has dropped recently, obviously you haven’t seen it control x-code entirely via shell commands, keeping your instructions in mind over a long time horizon)

[-]

OmarBessa@reddit

the models are spontaneously writing in chinese also

i believe Q2 is the norm right now

[-]

New-Question-3542@reddit

Not for gemini flash

[-]

GMP10152015@reddit

They are cutting costs. A good LLM, currently, consumes too much energy and CPU/GPU, and the demand is much higher now (too many users). Until they build new datacenters with more efficient hardware, the experience will be low.

But on the other side, they are investing to optimize the software (see TurboQuant, reducing memory), and making better low-weight models.

Every new market in the beginning has low margins and is very inefficient, and the transition to a more efficient model with quality is not easy.

[-]

BrechtCorbeel_@reddit

Yeah a couple months ago at beginning 2026 and end 2025 it could all do very long courses with very intelligent design when it comes to how to build a course.

[-]

fuck_cis_shit@reddit

all the compute goes to enterprise customers now, that's where the money is

you didn't believe the "intelligence too cheap to meter" hype did you?

[-]

Adorable_Weakness_39@reddit

yep at least my qwen-27B follows instructions... literally none of the hosted do anything when I tell them to.

[-]

Adventurous-Gold6413@reddit

Qwen3.5 27b such a good model im so glad i can run it even if its tight on 16gb vram IQ4_ XS

[-]

xeeff@reddit

what context and how do you run it?

[-]

Adventurous-Gold6413@reddit

/path/to/llama-server \
  -m /path/to/models/mradermacher/Qwen3.5-27B-i1-GGUF/Qwen3.5-27B.i1-IQ4_XS.gguf \
  --fit off \
  -ngl -1 \
  -c 29000 \
  -t 8 \
  -b 1024 \
  -ub 512 \
  -fa on \
  -ctk q8_0 \
  -ctv q8_0 \
  --temp 0.7 \
  --top-k 20 \
  --top-p 0.8 \
  --min-p 0.0 \
  --repeat-penalty 1.0 \
  --presence-penalty 1.5 \
  --chat-template-kwargs '{"enable_thinking":false}'

That is for standard Llama.cpp But I use TheTom’s turbo quant llama.cpp

And replace ctk and ctv with turbo4 instead of q8_0,

And get like 90k ctx working. However this is with the vision encoder offloaded to CPU with

- -no-mmproj-offload

[-]

Kofeb@reddit

Repo is private or did you take it down?

[-]

deadwisdom@reddit

I'm on claude max $200. I rarely hit my limit. I have NOT noticed a drop in capability.

I don't think that's a coincidence; I think I'm getting the smarter versions even though it's theoretically the same models / infrastructure.

[-]

tremegorn@reddit

Are there any kind of standardized benchmarks to look for signs of Quantization in models, beyond lower reasoning depth and thought? Something quantitative and data driven would be great.

[-]

90hex@reddit

Could be a compute squeeze due to RAM/DISK prices that may have slowed down datacenter construction. Very hard to tell, and the problem is that it's always subjective. So many times in the past people have reported 'dumbing' LLMs, when other reported no difference in their daily use. Unless there's an actual standardized test run at regular interval, we won't know if there's an actual change, or if it's perceptual. I'd lean towards a perceived difference, due to many factors.

No, even the $120 dollars tiers are nerfed. I asked opus to do some problem solving on an identical problem it did 6 weeks back and it took 3 prompts to match in a single section to what it had successfully achieved autonomously as part of a much bigger project.

I literally had to feed it the answer from the old file to break through it's fallacy, at which point it just started agreeing with me.

Nyghtbynger@reddit

I use Deepseek only I didn't notice 😜😜

[-]

Long_comment_san@reddit

I personally forecast that some architectural breakthrough has happened.

My second forecast would be that it has to do with either memory or efficiency or both. Both Gemma 4 and Qwen 3.5 have shown phenomenal boost to intelligence over what we had like 4 months before. I think new models are being cooked rapidly by everyone hence the brain damage.

[-]

hay-yo@reddit

Haha everyone has adopted turbo quant behind the scenes. Maybe its woeful.

[-]

Nyghtbynger@reddit

PolarQuant+ is a wonder. I still don't know what TurboQuant is even lurking hours on this forum loool

[-]

Long_comment_san@reddit

It would be cool if we have rotary and turboquant as a new thing. As far as I understand that would allow to free VRAM for running more models in the same memory footprint or use better quants.

[-]

rusmo@reddit

I haven't seen any independent research-based support of this idea, and none exist in the top 10 replies.

Anybody got anything legitimate to support this other than anecdotes?

[-]

EclecticAcuity@reddit

Some industry expert on Dwarkesh said that inference capacity is completely inadequate to keep up with demand developments. They probably go with this sneaky approach over that guys prediction of drastically increased prices.

[-]

mike7seven@reddit

Task the model properly. The routing function will send the prompt to the proper model behind the scenes.

[-]

Joffie87@reddit

I don't really care WHY it's happening at this point, but this is all right in line with the "18 months to enshittification" prediction I made to my wife last year. This is a safe one for me to claim being right with imo :)

Seriously though, I'm no expert and I used ai for all of this, I ran a bunch of research tasks on various models, then compiled the research and all the major frontier models came to the same conclusion, 6-18 months, there would be enough degradation, or service/charge changes, that it would be impossible to use anything but the enterprise versions, and we plebs would be relegated to tools designed to sell us more services and goods, or have to embrace open source local models.

Everything I've done with ai since then, has been steps to try and prepare for that eventuality, because AI represents the most empowering technology that has ever been created, but only if people become educated and retain access.

Hardware costs, and being a student suck :(

[-]

Chill84@reddit

there was never a cheap entry point, but I was definitely not expecting PC parts to come under direct assault in the AI wars.

[-]

skrshawk@reddit

All you need is a Drummer model and that code is going to tell you exactly what it's going to do to you.

[-]

daniel_bran@reddit

Amer brother. Praise the lord

[-]

1ncehost@reddit

While i do believe enshitification is a major cause, also keep in mind we are at the beginning of when the straight line up of token demand is diverging from the steep but not vertical line up of new ai hardware.

PrysmX@reddit

Anthropic admitted they lowered the default reasoning from high to medium, which you can turn back up manually. I've noticed Gemini quality falling off since all the way back in December, likely a similar situation. This is how the model providers are reducing their hardware overhead per-call so that they can fit an expanding user base and heavier use into the hardware that they have. Expanding hardware capacity is expensive, takes time, and is also limiting right now because of hardware shortages.

[-]

Narrow-Belt-5030@reddit

To test this I rented out a H100, and tried GLM 5 with the same prompt (the drive to the car wash one) across both instances. GLM5 running on the rented GPU answered it correctly, compared to the one on z.ai.

This is not really a reliable test though because you have no idea what/how Z.AI has been configured.

I do accept that Claude recently has appeared dumber than normal, and others report similar, but I don't think it's deliberate actions as that is suicidal for their brand/image. (No company would deliberately hurt their image)

[-]

NandaVegg@reddit

For Anthropic it would help if you can add information that you are using their model via:

1) Subscription (they have the most motivation to throttle or save compute here)
2) Direct API
3) Resellers like Vertex AI, AWS Bedrock (from what I understand both Google and Amazon roll the model on their own rather than just routing the request to Anthropic's server)

I use (2) and (3) and while it does not show common quantization-like symptoms (such as sudden language mix-up) it feels like thinking budget is reduced somewhat.

[-]

jiml78@reddit

For anthropic, I think it was intentional.

Their drop in quality coincides with two things. An influx of people leaving OpenAI. Additionally, they rolled out 1m context as the default.

I think those two things blew up their servers. Look at all the downtime they had around that time. I think they were scrambling and just decided to start running more quantized versions.

I was knee deep in a dev project built from scratch using just Opus. The difference in quality was overnight. I went to bed with Opus not being a complete moron to the next morning, it being a dumb MFer. I am talking about making really dumb mistakes. Mistakes it never made. Mistakes older Opus 4.5 didn't make.

Yep, I know I am one data point but I was maxing a $200 sub for all but 4-5 hours of a day. I was using huge amounts so when every single change was messing up requiring me to fix it(yes I am a software dev), I was getting really frustrated with how it had been doing great, and overnight went to shit.

[-]

rebelSun25@reddit

Have you compared output from the $200 to output if you pay per token? I wonder if the subs are getting routed to a dumber model now

[-]

jiml78@reddit

I haven't because my company is the one who pays for my subscription and it would cost a pretty penny to do enough testing to validate it one way or another.

I can say that we use openrouter leveraging Sonnet 4.6 obviously via API for our company's Pull Request reviewer, and it doesn't seem to be completely stupid. So there might be something to it. But I also don't consider doing pull request reviews to be super complicated.

[-]

scelabs@reddit

I’ve seen a lot of people saying this lately, and I don’t think it’s just vibes, but I’m not convinced it’s purely a “model got worse” issue either. even with the same base model, what you’re interacting with is a full system — sampling settings, routing, context handling, guardrails, latency optimizations, etc — and small changes there can make outputs feel a lot more shallow or inconsistent.

I’ve seen cases where nothing about the core model changed, but the behavior felt noticeably worse just because responses became less stable across runs or more constrained. so it ends up looking like an intelligence drop when it’s really a change in how the system is behaving around the model.

Additional-Low324@reddit

I do creative writing and honestly gemma 4 9 B (E4B) is very impressive for its size, but the 31 B is way better at details. 9 B is a bit unimaginative

[-]

squachek@reddit

Absolutely

[-]

Colecoman1982@reddit

Well, that's certainly one way for local inference of open source models to close the distance with SOTA...

[-]

boredquince@reddit

benchmark sites should review the model every X time. I bet the results would he different a few months after release

[-]

incoherent1@reddit

Could this be a result of LLM being trained on content created by LLM? LLM content is now all over the internet it would almost be impossible for LLM being trained off internet data to not be exposed to it. This could result in model collapse. Is this what were seeing slowly happen?

https://www.nature.com/articles/s41586-024-07566-y

[-]

edge41_architect@reddit

This is exactly why the edge inference thesis is winning. What you're describing — degraded cloud model quality, likely from aggressive quantization and routing optimizations to cut serving costs — is a structural incentive problem, not a temporary bug.

Cloud providers are optimizing for margin per token, not output quality per token. As subscriber counts scale, the economics force them to serve lighter model variants behind the same API endpoint. You noticed it because you had a baseline to compare against (GLM5 on a rented H100).

The fix isn't switching cloud providers. The fix is owning the weights. When you run a specific quant of a specific model locally, the quality is deterministic and version-locked. No silent downgrades, no mystery routing, no margin pressure degrading your outputs overnight.

For anyone running production workloads: pin your model versions, run local inference where latency allows, and treat cloud LLM APIs the way you'd treat any third-party dependency — with version contracts and regression tests. The era of trusting a black-box API to maintain consistent quality is ending.

[-]

Chriexpe@reddit

Free trial is ending, that's what is happening

[-]

FlamaVadim@reddit

True. My locall gemma-27B answers certain questions better than GPT-5.3, which might be a result of heavy quantization. Meanwhile, Codex 5.4 as a coding agent, performs just wonderfull with contexts over 100 000 tokens. For me looks like most resources have been shifted toward programming.

[-]

Thomas-Lore@reddit

GPT-5.3

I bet you are using the instant version or being routed to it. The instant version is obviously worse than most thinking models.

[-]

FlamaVadim@reddit

yes, because instant=5.3 (this sometimes dumber than gpt-3.5) and 5.4=thinking (this is much smarter).

[-]

Substantial-Ebb-584@reddit

Long live turbo quant and the joys it brings.

Opus seems just stupid and Antrhopic just won't admit when you're being throttled or getting a stupider model, or lower compute effort - to me this is the worst policy of them all ; in fact, it appears that Opus 4.5 is usually better than 4.6 now, and sometimes even Sonnet

GPT appears to sometimes bail out and tell you to try later. This is bad of course but it's much better

I haven't tried subscriptions of the others recently, so who knows what they're doing

my guess is that API users are not getting their services nerfed, since they actually make them money, presumably

[-]

david_0_0@reddit

the quantization angle is interesting because most providers are pushing hard toward optimization. if they lowered quant levels to save tokens or costs it would explain why simple tasks break. might be worth testing with explicit quant parameters to see if quality returns.

[-]

mrdevlar@reddit

Meanwhile I'm running local models and my models have remained the same quality as when I downloaded them.

[-]

AnticitizenPrime@reddit

Were you using Claude/GPT/Z/Grok/Gemini via API or via their website chat interfaces?

you need to put 2+2 together here, LLMs do not exist in a vacuum

[-]

MoodRevolutionary748@reddit

Almost as if energy got more expensive (war in Iran), token usage got higher (openclaw) so there's an incentive to use smaller models and to quantize.

[-]

DepartmentOk9720@reddit

You can literally use openrouter and access models across multiple providers, seriously, this is not even an issue to open source models.

Model providers always try to one up each other and you don't have to worry about quality drops.

[-]

Single_Ring4886@reddit

I usually ignore such posts. But I must agree that even gemini lately just feels much "dumber" than eg month ago and I mean measurably.

[-]

LowPlace8434@reddit

This seems to correlate to Turboquant tbh. While Turboquant itself may be legit, it sounds like the implementation is not easy to get right, more so when all the major providers probably already have hyperoptimized stacks that are harder to modify

[-]

FullOf_Bad_Ideas@reddit

Automated tests pass fine, humans complain. I think it's psychological.

I've been looking into setting up a service and it seems like if you want to use 'the good models' then the big boi hardware doesn't actually get you that many subscribers.

and this is how you have you show that you have no idea how this works.

Listen.. "model collapse" have NOTHING todo with this.. notepad dosent get worse because more use it.. a website dosent get "worn" because it have 1m more or less visitors... "bits" does not change color.

Diligent-Builder7762@reddit

Do you guys know about heard mentality. Thats whats happening.

[-]

jacek2023@reddit

Another post for imposters pretending they use local AI.

[-]

AppealSame4367@reddit

"The feast is over" -> some soldier after the red wedding.

They did their Christmas releases, they placed themselves in the race and gained users. Now it's time to squeeze every cent out of you.

Also the oil crisis is a big factor. Much higher electricity costs, problems with chip production will follow. New algorithms like dflash that will make it feasible to run even cpu offloaded moe models like qwen3.5 35B on a laptop if it has enough ram. If it jumps from 20 tps now to 35 tps or more on my old laptop gpu: Why should I use the unreliable cloud shit? I can program and plan.

[-]

jacek2023@reddit

Good thing: post is downvoted

Bad thing: people still want to discuss this bullshit

[-]

Blues520@reddit

Bait and switch

[-]

Individual_Yard846@reddit

oh chatgpt is almost telling me to stop fantasizing with my research even though its actively running benchmarks which prove my research is worth going after. It's a drastic change from the 'lets fucking do it' attitude of the past. i spend more time convincing it im worth spending the compute on.

[-]

leeta0028@reddit

Maybe energy constraints?