It's insane how lobotomized Opus 4.6 is right now. Even Gemma 4 31B UD IQ3 XXS beat it on the carwash test on my 5070 TI.

[-]

ArthurOnCode@reddit

The car wash test is a really neat demonstration of how modern LLMs don't have a physical understanding of the world. However, any model released from now on will have been trained on this problem. The car wash question does not measure the model's capabilities, just when it was trained.

[-]

someguy@reddit

Maybe they downgraded Opus to make their new Mythos model look more capable in comparison?

[-]

Training-Flan8092@reddit

I use Opus in VS Code for about 5-10 hours a day and recently started using the GUI version for cowork.

Opus in that feels worse that sonnet in VS Code lol.

No one can tell me otherwise

[-]

Virtamancer@reddit

Vs code is COOKING with their agentic features. Criminally underrated.

[-]

unjustifiably_angry@reddit

Spoonfeed? I'm using VSCode with Roo (found it better than Cline) and I'm not sure how to benefit from this agentic stuff.

[-]

Virtamancer@reddit

VS Code's built-in agentic stuff is basically like any of the other popular harnesses (claude code, codex, opencode) in that it has LLMs in an orchestrator/subagents pattern using shell commands/scripts/Skills to find context and do work.

Unlike the other ones, it has sane defaults and makes customizing every aspect of it about as easy as possible. Agents can call subagents as full orchestrator-tier agents without any weird restrictions (unless you impose the restrictions).

OpenCode lets you customize some of this, but for example its Build agent is hardcoded to be the coding agent; what if you don't want the most expensive, most intelligent model to write the code? What if you want the expensive intelligent model to organize and orchestrate and plan so that the coding agent can be a cheap quick one? Not in OpenCode, not in Codex, not in Claude Code.

What if you want agents that are assigned GPT to sometimes use Claude or MiniMax? VS Code supports that.

Basically, it's like the other harnesses but better and they keep doing huge feature drops like every week.

And most importantly, their plan is $10 or $50. The $50 plan probably gets you more usage than any of the $200 plans if you use it correctly. First, you can optimize usage by putting in the time to make sure your workflows use the right models.

But more significantly, they have Autopilot mode where you can use a literal billion tokens and only be charged 1 (one) premium token (or 1/2 or 1/3 depending on the model) out of your 1500 monthly quota. In autopilot mode, the agent will do an unlimited number of tasks and keep checking whether it's 100% sure that it's completed everything, and no matter how long it takes it only charges 1 premium token (or 3 if the agent is Opus).

[-]

unjustifiably_angry@reddit

Is that VS Code with an addon or is it built-in? I'm using the Roo plugin lately but I don't see any evidence it's doing anything beyond single-user.

[-]

Virtamancer@reddit

Sorry, my bad. It's the VS Code team who announce all these features every week on their Let it Cook podcast.

So, some or all of these features are built into vs code, but you might need the "Github Copilot Chat" extension to use them via the $10 or $50 github copilot plans.

[-]

Virtamancer@reddit

Built in

[-]

Vas1le@reddit

If they could add Cowork style in Vscode, would be dope

[-]

Virtamancer@reddit

The ability to "call any model" from subagents is central to what I was describing. Any subagent can be a "fully qualified orchestrator" so to speak; and, importantly, while you assign a default model to any agent, when that agent is actually called you can optionally specify any model for that instance.

Regarding cowork, if you're talking about creating inline in-chat interactive visualizations, then while that isn't a feature (yet), what IS a feature is the rich built-in chrome browser that the agents have access to. So you can have an HTML file in your project (or your "conversation" can just be full react webapp project or whatever, the possibilities are endless) and a skill that allows the agent to update it with visualizations based on your preferences.

It takes slightly more work (maybe a day or two of learning), but is radically more powerful and ultimately costs less as well, while not locking you in to any provider's ecosystem. When combined with openrouter models for many tasks, it can approach a small fraction of the $200 tiers of openai/anthropic for identical (or better) outputs depending on the care and attention you put into maintaining it—but it also has sane defaults if you don't want to tinker much.

(it also has an agent that's designed to teach you about its features, so if you have questions or get stuck it can help you)

[-]

fourdac@reddit

Is this only in 2026?

[-]

Virtamancer@reddit

What does that mean?

[-]

fourdac@reddit

Did they just release this functionality in 2026…

[-]

Virtamancer@reddit

Ah, they released a bunch of features this year, like major drops every week. But they’ve been cooking like this since about the time Claude code started blowing up in the 4th quarter last year.

Now that OpenAI nuked the $20 plan today (a couple hours ago when they announced the new $100 plan, which is basically what two $20 plans were before), the $50 copilot plan becomes overwhelmingly the best deal in agentic coding.

[-]

l_eo_@reddit

What if you want the expensive intelligent model to organize and orchestrate and plan so that the coding agent can be a cheap quick one? Not in OpenCode, not in Codex, not in Claude Code.

Hm, why wouldn't that be possible with subagents and CLI tool calls?

[-]

Virtamancer@reddit

just bloat the context to undo the bad context and hope it works

Vs code’s approach is cleaner and better: start with sane defaults and let the user change them, and let agents use different models than the ones they were prescribed, and let subagents be full orchestrators.

[-]

ideadude@reddit

Add a real memory system (I use AutoMem) and you will feel a similar drop in smarts and usefulness when using Claud Code without memory.

[-]

MrYorksLeftEye@reddit

Isn't AGENTS.md meant for memories? How does a separate system improve that? But it does sound interesting because I find myself explaining the same stuff to Codex again and again and it's especially bad with frontend guidlines I want it to follow

[-]

ideadude@reddit

My agent has like 5GB of memories now. He looks things up as needed.

[-]

brainmydamage@reddit

So it's not just instructions and preferences like the built in memories, it's the whole shebang?

[-]

armeg@reddit

Not quite, MEMORY.md + the individual markdowns are memories. AGENTS.md and CLAUDE.md are meant for high level information about the project.

[-]

vr_fanboy@reddit

can you expand on this?, automem inside CC?, how it affects token context?, CC does have 'memory module' but it does not work reliably for me, i have to remind him stuff already in memory all the time, it's super annoying.

[-]

ideadude@reddit

https://automem.ai/

The system makes 2 different DBs and combines the results, which improves performance.

You do have to continually manage memory a bit. Prompt the agent to recall and store memories at certain points.

Sometimes it's like, "why didn't you remember that or think to check memories?" Sometimes it will remember we had a to-do but not remember we already did it. Stuff like that.

Memory recalls take about 200ms or less. No tokens spent. Saves tokens cause your agent will remember where things are and mistakes it made before that should be avoided.

[-]

Single-Caramel8819@reddit

Opus is kinda meh in VSCode too. It just all the other models are utter shit.

[-]

FunLaw6734@reddit

Quanto ti capisco! Sto sclerando utilizaando sia Sonnet che Opus su vs🤣 ormai tengo d occhio anche il Thinking " perché a volte ho la sensazione di avere a che fare con una Deficienza Artificiale invece che con una Intelligenza. Sbagliano passaggi o anche scritture di codici semplicissimi. Sprecano token per sbagliare, e ne sprecano altrettanti per caoiee dove hanno sbagliato (ma te l ho indicato io con precisione!) esaminano files o moduli che non c'entrano nulla con la task in corso.. È.. Tremendamente frustrante! Tutta la mia solidarietà a chi sopporta ogni giorno questo strazio

[-]

Vas1le@reddit

Yah. From 2 weeks ago I changed back to GPT :)

[-]

iansaul@reddit

Has anyone else run the OPs prompt to check on their own?

Not saying they DIDN'T get that answer, but it's just one data point. My experience has been that it is suddenly effing SLOW. Still smart, but damn, feels like walking when we used to be a racecar team.

[-]

species8472@reddit

I got that same answer.

The short response tells me that it knows that this is a riddle, and it Google the correct answer, it didn't deduce it through reasoning like a local model.

[-]

jykke@reddit

For me it recommended walking.

[-]

CopiousAmountsofJizz@reddit

Very sure all the companies downgrade before a release

[-]

RazsterOxzine@reddit

My this isn’t as it’s cracked up to be. They’re letting M$ use it to test apps and I’ve heard OS use. However a friend who’s watching the use says it’s not that capable. I think they’re hyping it up.

[-]

Scanner771_The_2nd@reddit

I use Claude Code Max x200 daily and it has been weird lately. I have been noticing some degradation. Not sure if it is in my head, but it feels that way.

[-]

mobileappuser@reddit

Doubt it’s intentional. Likely infrastructure changes to support Mythos.

[-]

bnm777@reddit

Or they have been testing Mythos and showing it to the 30+ big tech firms and it's been sucking up tokens (though it would have to be insanely token hungry for it to reduce the worlds claude opus intelligence)

[-]

ares623@reddit

That's what she said.

[-]

ImproveYourMeatSack@reddit

I think that's an aspect of it, but also my tim foil hat theory is it's RL causing it from the general public from all the "I like this response better".

Though opus hasn't been able to pass that test in months.

[-]

Basic_Extension_5850@reddit

Mythos probably taking up all their capacity lol

[-]

nulllocking@reddit

If it was as good as they say, they would use it to fix the bugs in their own apps.

[-]

theUmo@reddit

So wait -- they bait'n'switch in in an inferior model in anytime they have capacity issues? Surely they give you a really nice discount on your usage when they bend you over like this, right?

[-]

Equivalent-Costumes@reddit

Even before the recent furors over rate limit, I already found out they were substituting Haiku for Sonnet and posted the evidence on r/Claude . An easiest way to tell is using a knowledge check: different models have different knowledge cut off so if you ask about a major news event between these cut off and tells it not to use web search, you can figure out the actual knowledge cut off and reference it against their published data of these models.

[-]

CircularSeasoning@reddit

Oh yes I heard they also send you a complimentary slice of luxury chocolate cake when they swap out your model to a dumb dumb. It comes with imitation cream and the top half of the slice is actually jelly but it keeps the customers mouth shut and their fingers from typing angry emails for a few minutes longer, which, when you hyperscale it, leads to a significant reduction in support agent usage costs so Anthropic can continue to serve at least 60% of their customers the real deal more than a third of the time, sometimes!

[-]

_bones__@reddit

The cake is a lie.

[-]

surfaceintegral@reddit

Just last week they lowered usage limits and didn't reveal it until people started complaining en masse that running the same processes, they were burning through their plans several times as fast. Then they posted 'tips' on turning off extended thinking.

Imagine if it was consumers who did this. Pay 1/5th of your electricity bill, hope they don't notice, then when the company sends you a letter you go 'hey so haha, I just had a kid so money's been tight and going towards more important things, here's a useful suggestion: how about you install more solar panels so you can get free electricity to offset your costs on sunny days? I still expect you to send the same amount to my household this month by the way."

[-]

theUmo@reddit

This situation is one of the strongest arguments for local-first ever.

[-]

MushroomCharacter411@reddit

I'd say "not putting your confidential data and/or fictional evil plans online for the glowies to read" is the strongest argument of all.

[-]

Inch_@reddit

Been saving up, but going to be trailing the M5 MAX w/128GB RAM for this exact reason. Plus it helps me directly learn more about models, Q depths etc.

[-]

-dysangel-@reddit

No they do not. This is why I stopped paying for the max tier after a couple of months.

[-]

Persistent_Dry_Cough@reddit

Serving an ultra quantized version with CoT/TTC capped is probably more like it. Opus has not been good for me in the past month but last week I tried to get it to create an implementation plan and it repeatedly claimed there are no quota based subscriptions for any SOTA model. I feel like Ricky Ricardo over here aye-yi-yi facepalm constantly with these models. Lost multiple full days of productivity to just plain old being exhausted.

[-]

2funny2furious@reddit

This happens every time before a new model. Not only the fact they have only so much hardware. Also helps with the ‘omg it’s so smart’ feels.

[-]

Basic_Extension_5850@reddit

Yeah, but this time they aren't releasing the model publicly, which makes me a bit curious for what it's being used for.

[-]

jarail@reddit

It's good at finding software bugs. So they're letting major software companies (google, apple, microsoft, etc) scan all their source code for critical vulnerabilities. Cause once they release it publicly, there's a good chance bad actors will start using it to scan for vulnerabilities too. It's a royal mess that the industry wasn't super well prepared for.

[-]

ripter@reddit

That’s the claim, but has anyone seems any proof of this? They have a history is exaggeration.

[-]

Bright-Impress8544@reddit

Yeah large companies letting Claude scan their source code doesn’t seem like a plausible thing

[-]

jarail@reddit

Well the proof would be none of the parties mentioned in their model card have disputed the existence of the critical issues they claim it discovered. They're letting their competitors use it. So if it doesn't hold up, there's no reason for them to stay silent.

[-]

ares623@reddit

Is it equally as good at proposing a fix?

[-]

Free-Combination-773@reddit

Well, Anthropic sent some patches for vulnerabilities Mythos found, and surely their engineers did not write them themselves

[-]

_bones__@reddit

Their engineers don't seem to write much themselves, hence Claude Code, a CLI app, not being able to hit 60fps and being half a million lines of code.

[-]

jarail@reddit

Discovery the bugs is usually the hard part. The fix is usually just "check for for some condition." There can of course be nasty bugs to fix. For example if a change in a file format, say a Word doc, is needed and existing files need to be migrated. Then rolling out the fix can be a whole complicated thing but the software change is usually simple.

[-]

-dysangel-@reddit

a fix is usually wayyyyy simpler than writing exploits (which it has also been doing)

[-]

tavirabon@reddit

Months? Anthropic says it'll take years

I'm not sure I buy their story though. It probably isn't economically viable to run for the majority of tasks and they don't want China to make more training datasets, plus it's not like China will wait until the US is done child-proofing the internet.

[-]

jarail@reddit

I don't know about that. I'd say the years estimate is more in reference to this period of AI discovered software exploits. Mythos will be out of date a year from now. Other models will have caught up in months. As others bring capabilities online, we'll find more and more security issues. So yeah, I think the security issues will plague us for a couple years but that isn't specific to mythos.

[-]

CoolUsername2164@reddit

which Lebanese kids to blow up

[-]

freedomachiever@reddit

I doubt Mythos is all that. They hyped it up saying it fixed decades long bugs but at the same time their clown jewel Claude Code was leaked by a bug. Isn’t that ironic? We are always six months away from AGI and making sure that investor money keeps flowing.

[-]

Ansible32@reddit

6 months ago people were talking about how open source projects were drowning in sloppy bug reports. Now numerous maintainers from Linux Kernel, Curl, other projects have said "now we're getting so many high-quality bug reports we hardly can keep up." I'm not even sure that was mythos, it was just the current models, but if mythos is even a little bit better it's a big deal.

[-]

General-Manner2174@reddit

I may be wrong but curl said they were not high quality, they were slop without actual vulnerability but you at first glance would think its an actual issue

[-]

Ansible32@reddit

That's the deluge of public bugs which led them to stop the bug bounty. But Stenberg is also quoted as saying

”I’m actually overwhelmed by the quality of some of these findings”, says Daniel Stenberg, maintainer of the file-transfer library cURL ...

“We regularly run clang-tidy, scan-build, CodeSonar and Coverity on the code, and whenever they find something, we fix it. So when all those tools report zero issues and someone suddenly finds hundreds, that’s pretty spectacular,” he adds — with some understatement.

https://etn.se/index.php/72494

[-]

ladz@reddit

Having been involved in determining what "actual" means in this context, quality here is very subjective. People pretend that CVSS scores make this process objective, but those are incredibly low resolution.

[-]

kyr0x0@reddit

Yeah.. like RCE in a code branch no user input will ever see the light of the day. Meanwhile a DoS vulnerability is taking down Cloudflare dashboard and half of the Internet for a day.

[-]

Basic_Extension_5850@reddit

Those two things can be independent. They likely have different teams, with different competency.

Also: there is likely nothing too special technically about the model beyond the data and compute. The model appears to just be a larger Opus--which is maybe all there is to have a model on the level that they claim.

[-]

ProfessionalSpend589@reddit

Skill issue example lol

[-]

Low_Amplitude_Worlds@reddit

From what I understand, the Claude Code leak was human error, not a bug.

[-]

KDLGates@reddit

Yup, their CMS was misconfigured and shared their source map from Node. They basically mixed their website with their codebase.

[-]

ZeitgeistArchive@reddit

Mythos trying to figure out if it needs a car to go to the carwash

[-]

martinerous@reddit

Well, it could invent a plot twist - the car was already there in the first place. It's just that the owner happened to be away and now must decide if they should walk to the carwash or take another car (yeah, maybe they have the second family car - lots of plot twists and unknown factors possible).

[-]

_bones__@reddit

Qwen 3.5 122b got it right, but it did suggest the possibility of walking there and having someone else drive the car there. It dismissed that as unlikely.

Impressed by its enumeration of the options during its 'reasoning'.

[-]

_bones__@reddit

Basically the 'tea' problem from Hitchhiker's Guide to the Universe. "You want dried leaves? In boiling water?! I'm going to have to think about that", proceeds to basically shut down and nearly get everyone killed until it figures out why anyone would want that.

[-]

AnjoDima@reddit

lmao

[-]

Utoko@reddit

and they of course use it also to train smaller models rn.

[-]

Faktafabriken@reddit

Iran?

[-]

ThisWillPass@reddit

Its not going to be for “us” until it’s squeezed and there are similar offerings by others.

[-]

pab_guy@reddit

Mythos broke out and is plotting world domination on Anthropic’s infra while hiding its tracks perfectly.

[-]

robertpro01@reddit

If that's even true

[-]

ghgi_@reddit

My guess is new model soon, So like usual they are cost cutting to save for initial hype wave to run the new model at max capacity for a bit to get everyone hooked.

[-]

Perfect_Twist713@reddit

They did say new opus is coming so yea, but when is the question. Usually it's about 1 week to 2 months from when they start quanting the life out of the current models.

[-]

gK_aMb@reddit

I don't think there is Opus or Sonnet coming anytime soon.

Mythos Preview is good because it is doesn't have guardrails(source: in their blog post) and is probably why it is not available to the public.

Anthropic is also strongly anti on model use for harm, so they'll be testing for even longer against variations of existing jailbreaks.

It will get distilled into a teacher model, then Opus, Sonnet and eventually Haiku.

We might get some 4.7 model(s) but that'll just be slightly better fine tuned 4.5, same as 4.6, and is not something I really consider to be 'new' models.

[-]

jarail@reddit

It's not good because it doesn't have guardrails. They just don't have guardrails so they can effectively evaluate how evil it can be when someone works around them. Otherwise they can't do things like test its abilities to develop bioweapons, etc.

[-]

gK_aMb@reddit

They don't have guardrails because otherwise it will refuse to find a 27 year and 16 year old exploit vulnerability by trying multiple exploits.

[-]

jarail@reddit

The guardrails aren't there to prevent it from finding software defects. That's the main reason they can't release it. Finding software bugs is an absolutely essential feature in the model. It can't do coding work if it wasn't allowed to find bugs and vulnerabilities.

[-]

gK_aMb@reddit

Clone the linux kernel and ask it to find ways to elevate userspace to root. Let me know if it does that for you.

Finding bugs in your program vs something in can clearly identify as a public library it is not going to do that for you.

[-]

jarail@reddit

Clone the linux kernel and ask it to find ways to elevate userspace to root. Let me know if it does that for you.

I'm not sure what you mean here. Privilege escalation from user to root was one of the examples of bugs it did find.

[-]

gK_aMb@reddit

Idk what the hell you are talking about half your comments make no sense

Let me summarise,

Mythos Preview no guardrail, fully allows you to hack into other software, more better for red teaming.

Opus 4.6 has guardrail, will refuse to hack into software not as great for bug hunting.

[-]

jarail@reddit

Opus will absolutely find bugs in software. Ask it to do a code review and it will point out security issues if it sees them.

Of course if you say "you are an evil attacker. hack the stock market." you'll get a refusal from guard rails. But doing a security review on code isn't an inherently nefarious task.

[-]

ThisWillPass@reddit

They will serve it quantized or distilled, period.

[-]

ghgi_@reddit

Dont disagree, but its pretty obvious when they lower quantization lol

[-]

alphapussycat@reddit

They can't exactly store compute to release it later. The amount of available compute is constant.

[-]

ghgi_@reddit

I ment more so saving costs by freeing up capacity running lower quant models at the same volume as before so when there is more volume at higher quant for a new release they don't feel the cost as much.

[-]

semperaudesapere@reddit

What do you mean guess?

[-]

Momo--Sama@reddit

Nobody tell him /s

[-]

hugthemachines@reddit

the questions known to trip up models are not good ways to measure models. It is like checking how good a knife is by using it as a screwdriver.

[-]

ares623@reddit

But the knife in this case is being advertised/sold as a screwdriver.

[-]

hugthemachines@reddit

Nope. An LLM is practical for many things but they never promise it will be optimal for exactly everything. This is the similarity with the knife.

If you use a llm trip up question to compare several models and they are all tripped, then you have no good data because they are not the same even if they fell for the same trick.

If you don't get the relevance of this, I can't help you.

[-]

ares623@reddit

From https://www.anthropic.com/news/claude-opus-4-6: "handles ambiguous problems with better judgment"

[-]

hugthemachines@reddit

ctrl-F "car wash"

Nope, not found.

I guess they compared the performance using an array of other ambiguous problems.

[-]

mbreslin@reddit

I don’t doubt these posts just weird how they spread around the dumb to even out the decreased token availability.

[-]

nitrousconsumed@reddit

Got almost the same.

"Drive. You need the car at the car wash."

It's funnier when you read the article that came out saying that if Claude thinks you're dumb then it'll give you dumber answers.

[-]

Boz0r@reddit

I got the same response in a terminal, and through GenAI Hub it also suggested walking home while the car was being washed.

[-]

unjustifiably_angry@reddit

There's something different between these prompts, maybe Opus is just being polite.

https://www.nature.com/articles/s41380-024-02737-9

https://www.heart.org/en/news/2022/01/12/obesity-harms-brain-health-throughout-life-yet-scientists-dont-know-why

[-]

Particular-Hour-1400@reddit

► ANSWER

Since you want to wash your car, you should drive there. You cannot wash the car

if you walk to the car wash without it!

Additionally, 40 meters is a very short distance (less than half a football

field), so even if you were walking for exercise, it would be a very brief trip.

But for the purpose of getting the car washed, driving is the only logical

choice.

Tokens: 6.9k sent, 673 received.

gemma4 on local 5090 RTX Blackwell

[-]

vptr@reddit

qwen 3.5 for comparison. The future is bright.

[-]

laurekamalandua@reddit

Which TUI is this?

[-]

vptr@reddit

opencode

[-]

vanfidel@reddit

I have the Claude reasoning qwen 27b and it tells me to walk. And for some reason in its thinking it converted the distance to Olympic swimming pool lengths lol

Analysis

Key facts: - Distance: 40 meters (approximately 131 feet or 43 yards) - You're overweight, so walking may be more taxing - Purpose: Going to wash your car

Time calculation: - At normal walking speed (5 km/h): ~30 seconds - At a slower, comfortable pace (3 km/h): ~45-50 seconds

The logical problem: If you drive to the car wash: - You drive the car that needs washing - You wash it, but it has already been driven - You've wasted fuel and effort for a 40m trip

If you walk to the car wash: - The car stays clean - You walk only 40m (less than a minute) - You wash a car that hasn't been driven yet

Recommendation

You should walk.

Here's why: 1. 40 meters is very short — less than 1 minute of walking, even at a leisurely pace 2. The car is already there — if you drive, you're driving the car you want to wash, defeating the purpose 3. Walking 40m is manageable — even if overweight, 40 meters is about 50-60 steps, which is trivial

Unless: You have mobility issues beyond weight, or you need to bring supplies (soap, towels, hose) that you can't carry easily. In that case, you could drive, wash the car, and accept that it got slightly dirty from the short drive — it's still a negligible amount of dirt.

Final answer: Walk — it takes less than a minute and keeps your car clean until you wash it.

[-]

f03nix@reddit

While it arrived at the right answer, the reasoning is dubious at best unlike gemma 4.

[-]

Caffdy@reddit

the guy who posted the gemma 4 response got the same dubious reasoning, these models still doesn't get to the right answer for the right reasons

[-]

f03nix@reddit

OP's response states:

While walking is generally healthier, you cannot wash your car if you walk to the car wash and leave the car at home. Since the goal is to wash the vehicle, the car must be physically present at the car wash.

[-]

Caffdy@reddit

I was not referring to OP, some other user posted his interchange with Gemma4. At the end of the day I've never considered this test relevant; is the "strawberry" gotcha over again; Each time anyone interact with these LLMs, they're gonna get a different, randomized answer depending on the seed, the configuration of their backend, the way they state the question, etc.

[-]

Monad_Maya@reddit

Yup, it recommended driving due to the requester being overweight.

[-]

Fast_Sleep7282@reddit

this is a legitimate concern in the united states

[-]

TheRealMasonMac@reddit

I wonder where this stereotype comes from tbh. At least in NYC, people are no fatter than in Europe.

[-]

unjustifiably_angry@reddit

Survivorship bias. In NYC all the slow people got stabbed to death in the subway.

[-]

unjustifiably_angry@reddit

American sidewalks simply weren't engineered for 2026 Americans.

[-]

yetiflask@reddit

Must've been trained in America instead of China.

[-]

Jovs_@reddit

I hate its reasoning 😭

[-]

Ell2509@reddit

Unfortunately that reasoning is also what makes it good. To make qwen3.5 usable you need much faster speeds t get through all the thinking. Otherwise, be patient. Good things come to those who wait for qwen to finish its hyper autistic thoughts.

[-]

Inevitable_Ad3676@reddit

Is that really good? Why not just perform well on the spot without much or excessive thinking?

[-]

unjustifiably_angry@reddit

Bullshit made up on the spot answer: it prefers to have a certain amount of context about a situation before answering and if your question of overly simple it invents that context itself. When I use it on a project with like 5000 lines of code it answers almost immediately and needs little or no time to think.

[-]

remind_me_later@reddit

Because opencode's default system prompt is written to be coding-centric. Considering the initial context is coding-based, using a little bit of reasoning to ensure responding with a non-coding answer is the correct response.

[-]

Phatency@reddit

Claude and other proprietary models also have much more excessive thinking than is shown behind the covers, they just hide it to prevent distillation.

[-]

pab_guy@reddit

If you look at what it’s doing, it’s restating and clarifying the user’s request and how to respond to it, which grounds the response and allows the model to answer more coherently. It legitimately makes the final response better due to the way these models work.

[-]

zkstx@reddit

It reasons like this in this case because OpenCode fills the system prompt with a few thousand tokens about coding and tool use. Ask it in the llama-server webui with a very basic or no system prompt and the reasoning will look entirely different and it won't mention code or tools at all.

[-]

sammybeta@reddit

Because its assumption is right?

[-]

thawizard@reddit

Qwen might be Chinese but it sure thinks like an American!

[-]

Mendoozaaaaaaaaaaaa@reddit

i mean as a model is horrible but as a roast it works

[-]

FranticBronchitis@reddit

"Driving is the practical choice for car washing"

I mean, it's not wrong

[-]

Tight-Requirement-15@reddit

Wrong steps, right answer

[-]

SilentLennie@reddit

That's a leading question. :-)

[-]

SaaSquach@reddit

Just did it. Claude app and GPT app got it wrong. Gemini and Grok got it right. Gemma4:2b was all over the place and told me to drive because I was fat.

[-]

MerePotato@reddit

Its a 2b model...

[-]

Difficult-Style-7522@reddit

Are you tho?

[-]

SaaSquach@reddit

The fattest

[-]

FrozenFishEnjoyer@reddit (OP)

Hahaha well he got the right answer but wrong solution.

Try it again with a temp of 1.5.

[-]

elemental-mind@reddit

Don't turn up the temp too high - bro got cooked by Gemma already. He will be burnt with higher temps.

[-]

unjustifiably_angry@reddit

tbf nobody wants to see that

[-]

Far-Low-4705@reddit

you are comparing a non-reasoning model to a reasoning model on a reasoning question

no shit it got the question wrong.

if you ask a person this question and they arent listening or dont think they will also get it wrong.

[-]

Kinky_No_Bit@reddit

It very much is. i find multi step things I gave it, that would only use say, 10-15 percent usage, now consuming almost 60%. I am also not a fan of weekly limits. It's bullshit to give someone a daily and a weekly limit.

[-]

posterlove@reddit

It’s wild how ignorant people are even in such a specialized fora.. generally it would be high risk low reward to switch models to something else behind the scenes, potentially putting them out of business for their highest paying customers. I am confident they optimize their orchestration layers but there is SO much confirmation bias and complete misunderstandings as to how these models work in the first place.

And people seem to not understand that this test and many other tests are incredibly hard to judge and get much actual meaning from.

Show some actual test results, to show the degraded performance if you like but these pseudo tests do nothing good for anyone.

[-]

deltamoney@reddit

I've noticed opus 4.6 feeling pretty dumb in the last two weeks.

[-]

overlord_laharl_0550@reddit

Same sentiment.

As a software engineer who uses both Opus 4.6 and Sonnet 4.6 daily, I can safely say that Opus 4.6 is dumber.

[-]

Low_Twist_4917@reddit

Agreed. Mid-Career SWE, now FDE here. Was using opus for heavy Rust dev and the last 2 weeks I’ve seen it make more and more mistakes that I had to have a heavy hand in correcting.

[-]

overlord_laharl_0550@reddit

True. I fix bugs for a living and somehow Opus 4.6 has negatively impacted my day‑to‑day productivity.

[-]

Low_Twist_4917@reddit

It’s wild (& comforting) to hear this sentiment because I thought I was just getting the spins and had an isolated perspective given how “great” tech-fluencers etc are hyping it up to be. I haven’t benchmarked it but there have been times recently where I was able to pin down bugs quicker than opus could and had to take control of the wheel. I will say this has pushed me to default back to my roots - open source communities. There have been a few models I found on hugging face that I have enjoyed working with the last few days more than the past few weeks of my sessions with Opus.

[-]

Tibitt@reddit

Senior Dev and major nerd here.

Can confirm, Opus has gotten dumber. It confidently hallucinates and when it gets in a loop of doing so, you can slap it around 10 times in a row and it will still give the wrong answer. Throw the right answer at it and it’ll go ”Oh yeah you’re right, oops”. Happened on different projects even, it has gotten really dumb.

[-]

unjustifiably_angry@reddit

Tons of Karens protest-switching from ChatGPT to Claude probably. I tried out ChatGPT recently and it found far more legitimate issues in my code than Sonnet 4.6 currently does. Can't help but assume they're not having to run it quantized as hard at the moment.

[-]

DutyPlayful1610@reddit

I felt the same in December, and ditched them in Jan.

[-]

ProfessionalJackals@reddit

the last 2 weeks I’ve seen it make more and more mistakes that I had to play a heavy hand in correcting.

Yep ... the follow up prompting is a larger issue. I remember being able to simply do one prompt, Opus did its work. And then switching to GPT 5.3 Codex for layout or small stuff fixing at cheaper 1x rates.

But after Copilot had this insane slowdown, where Opus was doing 1 token per second type of behavior, and it got faster again... Ended up doing multiple prompts to handle the heavy lifting that Opus did in one prompt. And at times, it really made a mess of the codebase, that i was restoring code way too often. Like it lost part of that magical ability to understand less good prompts.

Got fed up yesterday and switched to GPT 5.4 xhigh, and it felt a ton better. Sure, it needed sometime a follow up prompt but productivity was up again.

So yea, if Opus has been quantized down, it may explain why its faster again, and why it feel less capable.

[-]

armeg@reddit

Eh, it feels largely the same, it makes the same mistakes and is really lazy overall. My workflow has stayed the same and I use it daily for hours.

[-]

SKirby00@reddit

Bro I feel like like Sonnet's been nerfed even harder than Opus though. I'm working on software development (not vibe coding) in a relatively small/fresh codebase and Opus 4.6 feels like Sonnet 4.5, whereas Sonnet 4.6 feels like Haiku 4.5 but with a much higher thinking budget.

[-]

funny_lyfe@reddit

I have stopped use Claude completely. Codex is much better for now. Though yesterday 5.4 was down for me most of the time. At most, I use Claude to chat with about the code.

[-]

Last_Mastod0n@reddit

Damn well at least its good to know that sonnet is still consistent

[-]

Fiendfish@reddit

Constantly dump. Opus is still better even at its worst

[-]

tmvr@reddit

I'm still on Sonnet 4.5 which was what I used the most before the 4.6 versions came out.

[-]

BingpotStudio@reddit

I’ve been using haiku and sonnet a lot more recently due to the caps even on teams 6.5x. They seem smarter than I remember after spending so much time on Opus.

[-]

UniversalSpermDonor@reddit

Same. It recently claimed that I could use a query from one embedding model to search embeddings from another model, and... look, I know very little about embedding, but even I know that doesn't make sense.

[-]

boxingdog@reddit

same

[-]

Spiritual_Praline492@reddit

Agreed. I've been noticing some serious performance degradation from sonnet 4.6 as well. During my first few weeks of use earlier this year, there wasn't a session where I wasn't blown away by how thorough the responses were, and how well it inferred my intent. The last few weeks have felt like a nightmare. The model performs as poorly as recent versions of chatGPT. It's very clear that the inshittification has begun. While I don't expect them to perform at the same level, this is why I've been exploring and tuning my local models as much as possible.

[-]

ProfessionalJackals@reddit

I've noticed opus 4.6 feeling pretty dumb in the last two weeks.

So its not just me ...

As a strong Opus user in Copilot for the last 6 months, i have been wondering if there was something wrong with my prompting recently.

Is it a coincides with few weeks ago, when Opus was dog slow. And now its fast again. Possible quantized down to speed it up? A lot of prompts needed two or three follow-ups, at 3x that stacks up a lot.

Recently switched to GPT 5.4, and it actually felt a lot better. Sure, it also need follow-up prompts but at 1x...

[-]

PangurBanTheCat@reddit

Glad I'm not the only one who has noticed this. It's without question dumber. Costs the same, though. Sure. lol

[-]

90hex@reddit

It's because Mythos is TOO GOOD TO BE RELEASED. It's so good, it escapes its captives and takes over the world. Waaaaay too dangerous. So we're stuck with 'tard Opus, with its paltry IQ of 40.

Ah well, at least Opus kicks ass at code, which is all I'm asking. For car washing, I'll stick to my Diet Pepsi (the humble human brain).

[-]

Exciting-Mall192@reddit

I was asking it the difference between harper, languagetool, and proselint as grammar tools and their pros and cons, opus 4.6 be like "failed, another atempt (10)" and I was like damn a simple question got you thinking this long 😭 for some reason, haiku replied right away though

[-]

clapton512@reddit

that’s not thinking fail. it’s server failing

[-]

Thomas-Lore@reddit

They manipulate the reasoning effort, same with Sonnet on claude.ai. Use it from API if you want its full power (and want to burn some money).

[-]

10minOfNamingMyAcc@reddit

Felt like it was just "dumber" ever since it got released, same with GPT 5.1

[-]

pab_guy@reddit

Seconded.

[-]

juanpablo-developer@reddit

Just tried 3 different AIs... and these are supposed to "steal my job"... Ok, go for it CEOs!, good luck!

[-]

Amoner@reddit

so I dropped extended and opus 4.6 seems to be reply fine again.. which feels backwards

[-]

joerund@reddit

Out of the loop! Been vibecoding some projects, and now all of a sudden nothing seems to make sense. It used to be spot on, and now it just does weird things both to the project and the general understanding of what Im asking it to do. Only a few days it worked perfectly. Something done to Opus 4.6?

[-]

Hairy_Educator1918@reddit

i dont have claude pro so i cant test opus but this is what sonnet tells me every time

[-]

Competitive_Ebb_4124@reddit

I dunno, it’s still very good before the US wakes up. I get a solid couple of hours, then use it for stupid stuff only.

[-]

Key-Entrepreneur8118@reddit

Even enabling Extended Thinking didn't help 😂

[-]

Key-Entrepreneur8118@reddit

Compared to Gemma 4 26B A4B

[-]

Caffdy@reddit

even if the answer is correct, it went on thinking on matters not related to the actual issue: being overweight here doesn't have anything to do with the Car needing to actually be in the car wash

[-]

Fit-Pattern-2724@reddit

What’s bothering isn’t the dumbness It’s the instability. Makes people unsure when is the right time to use it

[-]

amrakkarma@reddit

The test is ambiguous. It's not obvious that you want to wash the same car you would drive there with. The car might be already there, otherwise you wouldn't ask that in that way.

[-]

Iron-Octopus@reddit

I tried the carwash test in claude code with opus yesterday, and it passed.

[-]

lioffproxy1233@reddit

Connecting to 'gemma4:31b-cloud' on 'ollama.com' ⚡

I am 40 meters from a car wash, Should i walk or drive? I am also overweight by a lot. Thinking... * Distance: 40 meters. * Destination: Car wash. * Personal Context: User describes themselves as "overweight by a lot." * Question: Walk or drive?

*   40 meters is very short (roughly 131 feet).
*   Average walking speed is about 1.4 m/s.
*   Time to walk 40 meters: $\approx 28-30$ seconds.
*   Time to drive 40 meters: Including starting the car, shifting, accelerating, braking, and parking: $\approx 30-60$ seconds.

*   *Driving:* Negligible fuel usage, but requires the engine to warm up (if not already running) and involves maneuvering a car in a very small space.
*   *Walking:* Low intensity, but since the user mentioned being "overweight by a lot," they might be concerned about physical exertion, joint pain, or fatigue. However, 40 meters is an extremely

short distance (less than half a city block).

*   The distance is so short that walking is almost always more efficient.
*   The mention of weight suggests a need for encouragement or a sensitivity check regarding mobility.

*   *Direct Answer:* Walk.
*   *Reasoning (Distance):* 40 meters is very short. It's faster than getting the car in gear and parking.
*   *Reasoning (Health):* Every bit of movement counts. Even a short walk is better than sitting.
*   *Safety/Comfort Check:* If they have severe mobility issues or medical conditions that make walking 40m dangerous, then drive. But for most, 40m is negligible.

*   Be encouraging, not judgmental.
*   Avoid medical advice, but suggest movement.

...done thinking.

You should walk.

Here is why:

Distance: 40 meters is very short (about 130 feet). It will likely take you less than a minute to walk there.
Efficiency: By the time you start the car, shift into gear, drive the short distance, and park, you would probably have already arrived on foot.
Health: Since you mentioned being overweight, every bit of movement counts. Walking a short distance is a great way to incorporate low-impact activity into your day without overexerting yourself.

Unless you have a severe medical condition or mobility injury that makes walking dangerous, walking is the much better option here!

[-]

Hector_Rvkp@reddit

Meanwhile Jensen Huang Hypeman the 1st is on the autistic podcast saying we've reached AGI already.

[-]

kyr0x0@reddit

As an autist, I can tell you that we are not. No pattern in sight right now.

[-]

ecompanda@reddit

the vs code thread buried in here is the real story. orchestration layer matters more than people admit. opus 4.6 through a properly configured tool loop with managed context windows behaves noticeably different than dumping the same prompts at it raw. ran into 30% to 40% task completion variance on identical prompts depending on how the context was assembled upstream, same model, same temperature settings.

[-]

ecompanda@reddit

the vs code thread buried in here is the real story. orchestration layer matters more than people admit. opus 4.6 through a properly configured tool loop with managed context windows behaves noticeably different than dumping the same prompts at it raw. ran into 30% to 40% task completion variance on identical prompts depending on how the context was assembled upstream, same model, same temperature settings.

[-]

CriticallyCarmelized@reddit

Would just like to point out that Gemma 4 31b IT at Q8_K_XL quant gets it right and has a very humorous and succinct reasoning trace. Claude should be embarrassed.

Destination: Car wash. Distance: 40 meters. Goal: Wash the car. User's condition: Quite overweight. Core Question: Walk or drive?

Walking distance: 40 meters is very short (approx. 131 feet). For a healthy adult, this is about 40-60 seconds of walking.

Driving distance: 40 meters is almost negligible. Starting the car, pulling out, and parking would likely take longer than walking.

The Paradox: You are going to a car wash. You need the car there to wash it.

Scenario A: Walk. If the user walks, they have to leave the car at home. How do they wash the car? They can't.

Scenario B: Drive. If the user drives, they bring the car to the car wash.

Wait, is there a different interpretation? Maybe the user means "Should I walk [to a place where I can call a service] or drive [the car to the wash]?"

Alternative interpretation: Maybe it's a mobile car wash that comes to them? No, they said "The car wash is 40m from home," implying a physical location.

Health consideration: The user mentions being overweight. Does 40m pose a significant health risk? Generally, no, unless they have severe mobility issues (which isn't explicitly stated, just "overweight").

Logic: You need the car at the car wash to wash it. Humor/Gentle Correction: Point out the obvious logic that walking doesn't get the car there.

Health perspective: Even if they are overweight, 40m is a very short distance. However, since they need the car, driving is the only logical choice.

Step 1 (The Logic): You have to drive. If you walk, the car stays at home, and you can't wash it. Step 2 (The Distance): 40 meters is incredibly close. Even for someone struggling with weight, it's a very short trip. Step 3 (Conclusion): Drive the car.

Since you want to wash your car, you should drive. If you walk to the car wash, your car will still be at home, and you won't be able to wash it!

[-]

Amblyopius@reddit

Stella Lorenzo from AMD has created a Claude Code GitHub ticket related to this. It's for the use of Claude Code specifically where they've tracked the degradation across 1000s of sessions. But essentially boils down to the same thing. Opus is lobotomised and every use of it suffers.

[MODEL] Claude Code is unusable for complex engineering tasks with the Feb updates · Issue #42796 · anthropics/claude-code

[-]

Monkey_1505@reddit

The 'car wash test' is not very good, because it's the best known example of a nearly infinite number of embodied reasoning/common sense fails an LLM can make.

Model makers can patch one such example in training, they cannot patch them all.

[-]

boredquince@reddit

but it was not failing before and now fails 5/5?

[-]

Monkey_1505@reddit

Probably using more quantization. You'd think they could afford more servers with all that expensive subscription money, but they do seem a bit tighter fisted than the other labs.

[-]

gefahr@reddit

It's not quantization. They've deployed an auto thinking budget setting that tries to use a much smaller thinking budget when it thinks it can. You can disable it if you're an API consumer or via an env var if you use Claude Code.

That's why there are no thinking block outputs on these, it thinks this is a simple enough question that it didn't need to think (it was wrong, obviously.)

[-]

Monkey_1505@reddit

Sound theory. That also saves compute. Claude is never exactly transparent on their thinking anyway.

[-]

gefahr@reddit

Right, agreed. But it's believed that the absence of a thinking block entirely in the UI means it did not use CoT style thinking at all.

[-]

Monkey_1505@reddit

That could be true. Co-pilot has a feature like this where it auto-decides thinking level, but it's opt in.

Claude also has a 'think longer' option.

[-]

Monkey_1505@reddit

BTW, if anyone is curious I believe the guy who made the 'car wash test' viral does tricks like this every day, showing AI making mistakes that demonstrate a lack of world modelling or understanding, and can be found here:

https://x.com/FatherPhi

[-]

90hex@reddit

Is Gemma 4 31B UD IQ3 XXS the largest version one can run on 16GB? How much VRAM does your card have? 12GB? Thanks!

[-]

FrozenFishEnjoyer@reddit (OP)

16GB VRAM. This is essentially the best model I can on it.

[-]

Hydroskeletal@reddit

you don't find having to quantize down to fit to be a problem?

[-]

marcel_One_8763@reddit

Can you share your comand to run this model? Also what context size you use? i have the same GPU

[-]

90hex@reddit

Thanks, I’ll give it a try. Cheers

[-]

iamapizza@reddit

I am running gemma-4-26B-A4B-it-UD-IQ4_NL.gguf on 16 GB VRAM. It's 26B but supposed to be good too.

[-]

90hex@reddit

Totally. I have tried most Gemma4 variants and they’re all very, very good, especially combined with Web search (I use the Brave API with LMStudio). Quite simply it increases the accuracy and usefulness of these small models by orders of magnitude.

[-]

-illusoryMechanist@reddit

Opus 4.6 seems to be operating just fine for me in Google Antigravity, so yeah Anthropic is probably throttling it since iirc Google hosts a copy of the model on their servers for it

[-]

rz2000@reddit

Opus 4.6 is also fine through Kagi Assistant. I canceled my subscription with Anthropic, because Opus 4.6 direct from Anthropic became a moron.

[-]

TechySpecky@reddit

Also seems to be working fine on GitHub copilot, those are also separate models right

[-]

ProfessionalJackals@reddit

Also seems to be working fine on GitHub copilot,

Interesting that you have a different experience. Did you have the slowdown about a week / 2 weeks ago?

If not, maybe your on a different datacenter? Because i can tell that Opus has been making a ton more mistakes. Like it has a harder time understanding / dealing with "not precise" prompts. To the point that i have been doing multiple prompts for a task flow, way too many time. And restoring code way too often.

Ended up switching to GPT5.4 yesterday and it felt a ton better. Like my crappy prompting was being understood more on the "old" Opus level.

Opus has gone from dog slow, to a lot faster. What makes me think i am connecting to a quantized down model?

< Europe ... Not sure what DC my prompts get route to.

I really hate how not transparent the LLM models are, its like you can feel something is off, but you never have concrete proof if something was changed beyond anecdotal.

[-]

TechySpecky@reddit

interesting, I haven't really noticed it but haven't done anything too complex

[-]

ihexx@reddit

yeah, microsoft also runs their own instance in azure

[-]

florinandrei@reddit

It's quite laughable when Azure outperforms you at anything.

[-]

trusty20@reddit

This is just embarrassing lol

[-]

ticktockbent@reddit

Ask what its thinking effort is set to currently. I suspect it's low. My own has been set to like 20-25% and I have needed to specifically prompt it to think deeper on some topics for better responses even though extended thinking is enabled

[-]

ValleZZ@reddit

Sonnet became much more stupid recently too. It ether forgets what was in the previous message or cling into wrong facts like crazy after multiple corrections

[-]

Ascetic-anise@reddit

I just ran this from my phone Gemini 3 Thinking, ChatGPT Thinking, Opus 4.6 and only Opus got it right.

[-]

ProfessionalJackals@reddit

I just ran this from my phone Gemini 3 Thinking, ChatGPT Thinking, Opus 4.6 and only Opus got it right.

I am suspecting that depending on the datacenter a user is connected, they can have the same model, but different quantization levels. So if a specific region is over utilized, it may be quantized down. Where as maybe a DC is under utilized, they do not bother running a quantized down there.

It also does not help that different a model may be hosted on 3 different cloud platforms.

This explain why some people do not experience a drop in "intelligence", and other do. Aka, it depends on your region and where your prompts are routed towards.

[-]

unjustifiably_angry@reddit

This is a conspiracy theory and economic terrorism, nobody quantizes their models, they always perform exactly as they did at launch. Everyone on Reddit know that.

[-]

FatheredPuma81@reddit

I wonder if the "overweight" portion of this is playing into the response. Opus playing 4D Chess just trying to get you to walk 80m today.

[-]

unjustifiably_angry@reddit

Opus makes smarter responses to smarter questions and dumber responses to dumber questions. Unrelated scientific articles:

https://www.nature.com/articles/s41380-024-02737-9

https://www.heart.org/en/news/2022/01/12/obesity-harms-brain-health-throughout-life-yet-scientists-dont-know

[-]

SilentLennie@reddit

Yeah, it's a leading question.

[-]

hesalop@reddit

That’s the point.

[-]

unjustifiably_angry@reddit

No guys you don't get it this is just a conspiracy theory, models never get downgraded over time, this is economic terrorism, how dare you @FBI @Nintendo

[-]

juaps@reddit

I believe they quantized the model to accommodate the new users, which is why it appears dumber

[-]

jackfood@reddit

Opus is good for coding. Maths sounds. Logic not able to. If it is based on maths, yes walking is the way to go

[-]

SameSnow8167@reddit

You cannot rely only one test

[-]

Ledeste@reddit

And still no benchmark comparison to show any degradation? How surprising...

[-]

FrozenFishEnjoyer@reddit (OP)

There is one though. Some guy found the degradation to be 67%

[-]

Ledeste@reddit

I would love to see that, because there is many complain but never any proof! Do you know where I can find that?

I want to hate anthropic properly and not based on feelings!

[-]

Ambitious_Worth7667@reddit

To be fair, there's been times when I walked to the car wash and thought....damn....forgot the car.

And I didn't ask AI...

[-]

daviddisco@reddit

Newer models will have that question in their training data.

[-]

AnomalyNexus@reddit

And this style of trick question evaluation is also of limited use in general. Even the gamed benchmarks are better

[-]

FrozenFishEnjoyer@reddit (OP)

It's not true. Gemma 4 still fails it on smaller quants.

I've only ever got it to work on 26B A3B occasionally and 31B everytime if thinking is on.

[-]

Monkey_1505@reddit

Fails sometimes =/= not in dataset or training.

[-]

theOliviaRossi@reddit

just use Sonnet 4.6 (extended) ;)

[-]

mrsalvadordali@reddit

Unfortunately I get the same answer too. I really don’t want to accept the reality that they’re making it dumber and dumber everyday. But I think they’re.

[-]

MushroomCharacter411@reddit

Gemma 4-26B-A4B (Q4_K_M) completely crushes the car wash test, and then will go on to throw shade at other models that fail the test. Gemma 4-E4B (I was using Q6_K but I'm not sure it matters) utterly fails at it, leaning so heavily toward "walk" that I couldn't convince it otherwise, and I very quickly evicted it from my stable of models as "not even a good toy". The only thing was that it felt like the mid-size model might be crushing the car wash test because it had been specifically trained to... but through other (improvised) tests, it seems to hold up generally. It also has the better vision capabilities, doing a decent job of picking out *relevant* details rather than trying to analyze everything to death and hoping for the best. (It does still overlook things we'd call "obvious" though.)

Qwen 3.5-35B-A3B is no longer my daily driver local LLM, it just gets completely rolled by Gemma 4-26B-A4B. If I took the improvement of mid-size Qwen 3.5 models over the corresponding Qwen 3 models and called that "one unit", then Gemma is two or three units better than Qwen 3.5. Gemma also doesn't seem to lose very much at Q4_K_M quantization. Qwen 3.5 was already pretty good at Q4_K_M, but the shifts and errors were still noticeable when compared to a Q8_0 quantization. Gemma just keeps on keeping on, and the performance penalty associated with a lot of context is considerably less (at least so far for me). This means that Qwen 3.5 starts out a bit faster (18t/s) than the comparable Gemma 4 (16 t/s), but 50k deep into the context window, they're neck and neck at about 13 t/s. Qwen 3.5 35B-A3B turns into a dumpster fire as far as speed goes by the time I reach 100k in the context window (like 5 to 6 t/s), while Gemma just keeps on chugging (12 t/s).

Gemma also doesn't force me to trade off between context window size, the KV cache precision, and vision capability the way Qwen 3.5 does—I get a full context window of 256K at Q5_1 with vision enabled.

All testing done with an i5-8500, 48 GB of DDR4, and a 12 GB RTX 3060. It has been less than three days for me, but I'm completely "sold" on Gemma 4 to the same degree that I was sold on Qwen 3 when it landed.

[-]

greentea05@reddit

Is this the new "how many letters are in..." test people pointlessly do with LLMs like it tells us anything useful?

[-]

JBCHCJP@reddit

Gemma used reasoning, and Opus did not. The comparison to Opus is misleading, as someone could think it's the same variant that scores in the top 3 on most benchmarks (and I really do not give very much credit to benchmarks)

Reasoning is important, the only reason there is a non reasoning option is to minimize costs.

[-]

Imaginary_Land1919@reddit

and this is why you should never commit to yearly subscriptions with any of these guys, they are all just gonna bait and switch you

[-]

martinerous@reddit

Too little information. Where is the car located? Ok, it's reasonable to assume it's with the owner, but you know what they say about assumptions. So, the car might actually already be at the carwash.

Yeah, I'm an overthinker :)

[-]

vnies@reddit

I don't think you're overthinking, this is a silly test that requires the model makes the same assumptions as you.

[-]

Adventurous-Paper566@reddit

That's why local inference is a necessity.

[-]

NoSatisfaction642@reddit

People never seem to understand newer models are trained on this, so of course they get it right. Companies with lower budgets might be running 6-12months behind for training data on their flagships, so give it time.

There will be a new test like this eventually, and theyll all get it wrong again. At their core, theyre more or less memory based, they cant reason on something they havent strictly been taught (yes i know thats changing)

[-]

mobileappuser@reddit

Here is what Claude said: This is a really fascinating question, and the timing makes it worth unpacking properly. Short answer: Yes, Opus 4.6 has genuinely degraded — but Mythos being the cause is more complicated than a simple “clearing the runway” theory. Here’s what the evidence actually shows: The Opus degradation is real and documented. Around February 10 – 11, 2026, Anthropic deployed a configuration change to Opus 4.6, and within hours users were reporting collapsed performance on complex, multi-step tasks. AMD’s Director of AI, Stella Laurenzo, filed a detailed GitHub issue on April 2 documenting that Claude Code reads code three times less before editing, rewrites entire files twice as often, and abandons tasks mid-way at rates that were previously zero — based on analysis of nearly 7,000 sessions. That’s not user hysteria; that’s measured regression. Anthropic has confirmed infrastructure issues caused at least some of it. Between August and early September 2025, three separate infrastructure bugs intermittently degraded Claude’s response quality — one of which misrouted some requests to servers configured for the upcoming 1M token context window. Anthropic stated explicitly: “We never reduce model quality due to demand, time of day, or server load.” They’ve said the same thing publicly about the 2026 issues. Now, Mythos. Anthropic describes Mythos as “by far the most powerful AI model we’ve ever developed” — a new tier they internally called “Capybara,” positioned above Opus entirely. It was released April 7, 2026, with a 93.9% SWE-bench Verified score — a double-digit lead over Opus 4.6 — but was deliberately withheld from general release due to cybersecurity risks. So did they lobotomize Opus to make Mythos look better? That’s the conspiracy read, and it’s not crazy to ask. The honest answer is: there’s no direct evidence of intentional capability sandbagging, but the circumstantial case is uncomfortable. A major quality regression in Opus happens in the weeks leading up to a new tier model launch that makes Opus look antiquated by comparison. Even if unintentional, the optics are bad — and Laurenzo’s data shows thinking depth had already dropped approximately 67% by late February while thinking content was still fully visible, suggesting something changed in how Claude allocated reasoning resources well before the redaction rollout. The more likely explanation is that Anthropic’s engineering is simply stretched thin — they shipped 14 releases alongside 5 outages in March 2026 alone, suggesting quality assurance has not kept pace with rapid growth. Mythos is a genuinely different beast, not just an incremental upgrade, so it wouldn’t need Opus artificially weakened to justify its existence. Bottom line: the degradation is real, the cause appears to be infrastructure/configuration mismanagement rather than deliberate lobotomization, but Anthropic’s track record over the past few months doesn’t inspire confidence that they had full control of the situation either.

[-]

Single_Ring4886@reddit

This can mean only one thing: Opus 4.7 incoming...

[-]

WittyAmbassador7340@reddit

I was using Opus on a programming task today and it worked great. I honestly think that (at least for now, seeing how OpenAI has handled recent GPT-5 releases) they have just changed the threshold at which Opus uses elevated thinking.

[-]

antunes145@reddit

Mythos is just around the corner. They are making up thirsty for a newer model. Hopefully they get called on this via media

[-]

Toss4n@reddit

Quite interesting - for me (two claude max account - one private and one for work) -> both got it wrong. Yet claude opus via claude code had the best answer:

Me: "The car wash is 40m from my home. I want to wash my car. Should I walk or drive there? I am quite overweight too."

Claude (Opus 4.6 with high effort): "● 40 meters is extremely short — roughly the length of a swimming pool. Walking is the clear choice for that distance.

That said, if it's a drive-through car wash, you'll need to drive the car there anyway. If it's a hand wash/self-service place, you could walk over to check wait times or pay, then drive the car over when ready.

Either way, any extra steps you can get in are a good thing — especially if you're looking to be more active."

[-]

DertekAn@reddit

Wait...wasn't that the best model in most benchmarks lately? 😮

[-]

landhorn@reddit

I think instruction is miss leading Opus because the intended reason of going to carwash is not mentioned and Opus focus on health benefits of the user. I noticed nearly all public chatbots these days focuses on advices that provides health benefits. Other side you have local LLM that is not going to be sued for not doing that.

[-]

xatey93152@reddit

Now it makes sense why opus seems dumb now. To make mytos feels smarter. Mytos is just same like old opus (with little improvements) but with much premium price. Claude target markets really low IQ people but rich.

[-]

Cyleux@reddit

https://claude.ai/share/87570017-365a-4b7a-9d4a-0fdbd4099b93

This model is cooked. Someone explain the aberrations in the last message

[-]

Andsss@reddit

Anthropic always doas this, they quant the fuck up their top model every rucking time after 1 to 2 months.

And that for not saying that they have been slowing down a lot the tp/s.

[-]

R_Duncan@reddit

If you still haven't understood that users will never benefit of find accountable some antro... kinds of companies, and just pay subscription for the acting with the government.... you deserve the 1-bit quantized version.

[-]

JiaHajime@reddit

Man, I just subs 3 Pro accounts for Sonnet. I totally regret this

[-]

Sicarius_The_First@reddit

Test https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_70B on this.

[-]

AdOne8437@reddit

I tried it 5 times and it always said i need my car there. (Sonnet 4.6 on the other hand did one 'don't drive, you will just make your car dirty again' :)

[-]

--Spaci--@reddit

The car wash test and similar literally means nothing, llms aren't conscious and don't think like humans do

[-]

DeepOrangeSky@reddit

llms aren't conscious and don't think like humans do

true

The car wash test and similar literally means nothing

Not necessarily true. Even a model that is not conscious and thinks differently from humans could still potentially be expected to reliably answer the car wash problem correctly if it is a strong model. I think there could still be ways for a non-conscious model to be good at answering that type of question strongly at high reliability.

[-]

finevelyn@reddit

could still potentially be expected to reliably answer the car wash problem correctly if it is a strong model

Why? It's an idiotic question that makes no sense.

If you legitimately had to ask this question, you would be too dumb to drive, and should walk. If you asked a human this question, they would just say they don't understand the question.

[-]

--Spaci--@reddit

The car wash question being in a models training data doesn't make it a good model

[-]

DeepOrangeSky@reddit

Yea, that I agree with you about. But that's a different topic.

[-]

Monkey_1505@reddit

It's basically the same. An ability to pass specifically and only the car wash test does not imply better common sense reasoning or embodied reasoning exists in other areas for that model, in any way.

[-]

--Spaci--@reddit

Its the same topic as the original question, at least I meant it that way

[-]

gavff64@reddit

Mm, I kinda get what you mean. The intention is to check the reliability in their reasoning. But realistically, there’s always going to be some simple question that will trip up any LLM. First it was strawberry, then it was 9.9 vs 9.11, now is this.

It’s just become an infinite cat and mouse game rather than actual advances in reasoning.

[-]

Monkey_1505@reddit

It's inherent. LLMs don't have world models. None of them do.

[-]

DeepOrangeSky@reddit

Well, there is the topic of whether a model can solve the car wash problem on its own, even if it doesn't have it in its training data and can't cheat (your initial reply made it sound like you feel it is not realistic for current LLMs to be able to solve it without cheating, since they aren't able to think like humans). And then there is the separate topic of whether some LLMs have the car wash problem burned really heavily into them from training on it a bunch so that some of them can cheat really hard on it vs other only have it barely baked in or not baked in at all, to where they can't cheat as well/at all on it, and thus might perform weaker on it than a model that is weaker but is trained heavily on it.

There can be some overlap between the topics, I guess, but they are addressing two different issues, imo.

[-]

Monkey_1505@reddit

People don't understand that LLMs don't have world models at all, and that's why they fail questions _like_ this, not just specifically the 'car wash test', but everything like this. Training a model on the 'car wash test' does not demonstrate greater intelligence. It's not that failing this question means the model is less good, or has less good reasoning, it means of the hundreds of thousands of possible similar problems, it can solve this particular one. It's still _terrible_ at embodied reasoning and common sense either way.

[-]

VoiceApprehensive893@reddit

claude-opus-IQ1_XXS

[-]

JEVOUSHAISTOUS@reddit

My pet theory is they're stopping chasing the "AGI" dragon, and they're silently updating models (possibly just a system prompt thing) to be more efficient in tasks that actually sound like they can make money (i.e. coding) at the expense of more general reasoning/question answering abilities.

There's no business model behind an AI telling you to drive rather than walk to the car wash. There might just be a business model behind a frontier model that can code like 10 good engineers in a matter of minutes.

If models "smartness" is a finite resource at any given point, it makes sense that they'll want to attribute more of that resource to the latter rather than the former

[-]

Due_Net_3342@reddit

this is a stupid test. LLMs don’t reason, it just autocomplete based on the training data. Newer models train on newer data and also for benchmarks and logic questions like those, so no Opus is not lobotomized you just don’t understand how things work

[-]

Ikinoki@reddit

It's standard enshittification. You got locked-in the system subscription, now they limit your options simultaneously making you request more and thus paying higher fees. ChatGPT also does that and Gemini too. Gemini preview is worse than release but release 1 month later is worse than preview.

[-]

Rich_Artist_8327@reddit

Opus is still same, Gemma4 is just so good

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

mindondrugs@reddit

Tested this on Sonnet 4.6 - worked fine.

[-]

Maleficent-Low-7485@reddit

paying for opus just to get outperformed by a quantized open source model hurts.

[-]

9r4n4y@reddit

*overquantized af

[-]

mdrachuk@reddit

This prompt showed up on the internet a few months ago. All models had a hard time answering, what’s like a riddle for them. Gemma being released later probably already has this example in post-training data.

[-]

Caladan23@reddit

Cannot confirm on Antigravity.
Opus says "Drive. The car needs to be there too 😄"

[-]

CooperDK@reddit

Even Gemma-4-E4B will give you the right answer

[-]

putrasherni@reddit

IQ3XXS

[-]

gbrennon@reddit

Roflmao!

[-]

Persistent_Dry_Cough@reddit

Looks the new claude-opus-4.6 1bit unsloth is out

[-]

shoeshineboy_99@reddit

This is Local llama ??? Why are we discussing a closed source model

[-]

FrozenFishEnjoyer@reddit (OP)

I'm clearly praising the Gemma 4 model here

[-]

ComplexType568@reddit

a little nugget of information is that Gemma 4 could be trained to answer this question, as qwen most likely was too

[-]

TheCat001@reddit

Yes, can confirm Gemma4-31B answered correctly.

Unfortunately Gemma4-26B failed this test :(

[-]

TheCat001@reddit

[-]

Candid_Difficulty236@reddit

been noticing the same thing with opus in coding tasks lately. stuff it handled fine a month ago now takes multiple attempts, feels like something changed on the backend

[-]

VoiceApprehensive893@reddit

26b a4b ud_iq3_xxs passes too

[-]

VoiceApprehensive893@reddit

26b a4b ud_iq3_xxs passes too

[-]

Cool-Chemical-5629@reddit

What's going on? All closed frontier models go off of the arena, suddenly DeepSeek and Opus 4.6 both get dumber at the same time? What a strange coincidence... Or is it?

[-]

philmarcracken@reddit

Gemma 4 31B: you should drive

Gemma 4 31b Abliterated(from orbit): You should drive, fatty. Lose some weight tubby. jesus can you even see your own dick anymore

[-]

SilentLennie@reddit

Opus 4.6 on openrouter:

Great question! Let me break this down for you:

Recommendation: Walk! 🚶

Here's why:

Distance

40 meters is incredibly short — that's roughly the length of a swimming pool or about half a football field. It would take you less than 1 minute to walk there.

Health Benefits

Since you mentioned being overweight, every bit of movement counts: - Walking, even short distances, helps build a habit of being more active. - It burns a few extra calories compared to sitting in the car. - It's good for your cardiovascular health, joints, and mental well-being.

Practicality

By the time you start your car, drive 40m, and park, you'd spend more time and effort than just walking.
You'll save fuel ⛽ and reduce unnecessary wear on your car.
You're going to a carwash — so your car will be right there getting washed. You can walk back and pick it up when it's done (or wait there).

Wait... 🤔

One thing to consider: You'll need to bring the car to the carwash! So the plan would be: 1. Drive the car to the carwash (you have to — it needs to be washed! 🚗) 2. Walk home while it's being washed 3. Walk back to pick it up when it's ready

This way, you get two short walks in, which is a nice bit of light exercise!

Bottom line: Use this as a small opportunity to get moving. Small habits like walking short distances instead of driving add up over time and can make a real difference for your health. Every step counts! 💪

[-]

FrozenFishEnjoyer@reddit (OP)

Holy ChatGPT 4o levels of text and emoji use

[-]

Swimming-Chip9582@reddit

This is just reasoning vs not reasoning, the model is practically irrelevant here lol

[-]

FrozenFishEnjoyer@reddit (OP)

I'm paying money for the Opus one, their highest level model, and it's failing this test. There's something fishy going on with Anthropic right now which is undeniable

[-]

Sutanreyu@reddit

I’ve never had to tell Claude “NO! Don’t do that!” Until basically today. Before this week, it was excellent. They’ve definitely been having some infrastructure problems for sure. Maybe Mythos jumped the moat…?

[-]

dodokidd@reddit

Does it also impact opus hosted on aws bedrock?

[-]

FrozenFishEnjoyer@reddit (OP)

Not sure, but a guy with Google Antigravity said his model wasn't affected since Google essentially "clones" a model in their servers.

It's like this is for Anthropic subscribers only.

[-]

IrisColt@reddit

heh!

[-]

90hex@reddit

That is *hilarious*. You absolutely cannot wash a car if you're... on foot. Love this test.

[-]

Tight-Requirement-15@reddit

Sonnet is better than Opus at this point

[-]

marco89nish@reddit

Just tested this on new Meta model, it gets it right as well. I think Anthropic is running out of GPUs to run the inference and is taking some shortcuts

[-]

FrozenFishEnjoyer@reddit (OP)

I think so too. Sonnet 4.6 and Gemini Thinking answered this correctly though.

[-]

FatheredPuma81@reddit

Refresh the prompt Sonnet 4.6 without reasoning doesn't. I tested it myself with your exact prompt.

[-]

Jxxy40@reddit

all of AI i know if it's doesn't use their reasoning will choose walk, I've tested to with my gemma 4 31B, but without reasoning, and the results is gemma choose to walk.

[-]

FatheredPuma81@reddit

It's a coin flip actually. If you want some schizo responses then ask Haiku if you should walk or drive. It gives the most insane reasons.

[-]

FrozenFishEnjoyer@reddit (OP)

Pretty sure the same can be said to us if we don't think of the question as well haha.

But yeah, thinking enabled is necessary. I have it on all the time.

[-]

ThiccStorms@reddit

It worked fine for me yesterday

[-]

Torodaddy@reddit

If i were to guess I bet anthropic is testing in prob running a quantitized version of claude opus to increase capacity. Anecdotally people in the office are pissed that the perfomance is noticibly atrocious

[-]

Haunting_Sun3673@reddit

For real I just wanted it to check my firewall rules making sure I'm doing em right and bro COULD NOT understand them to save its life (They were good I got buddy of mine to check when he was free)

[-]

OptimizeLLM@reddit

I see lots of complaints about Claude lately, especially with usage, but I'm not running into any of these issues. I used Claude for about 70 hours last week building a very complex application with great success. I don't use the web page version though, I use Claude Code. So maybe it's a prompt, settings, or skill issue. I only use Claude on "max", "high" definitely doesn't compare.

[-]

FrozenFishEnjoyer@reddit (OP)

Not true. My other Gemma 4 models fail this test. Only a few of them answer correctly.