It's insane how lobotomized Opus 4.6 is right now. Even Gemma 4 31B UD IQ3 XXS beat it on the carwash test on my 5070 TI.
Posted by FrozenFishEnjoyer@reddit | LocalLLaMA | View on Reddit | 321 comments
ArthurOnCode@reddit
The car wash test is a really neat demonstration of how modern LLMs don't have a physical understanding of the world. However, any model released from now on will have been trained on this problem. The car wash question does not measure the model's capabilities, just when it was trained.
__some__guy@reddit
Maybe they downgraded Opus to make their new Mythos model look more capable in comparison?
Training-Flan8092@reddit
I use Opus in VS Code for about 5-10 hours a day and recently started using the GUI version for cowork.
Opus in that feels worse that sonnet in VS Code lol.
No one can tell me otherwise
Virtamancer@reddit
Vs code is COOKING with their agentic features. Criminally underrated.
unjustifiably_angry@reddit
Spoonfeed? I'm using VSCode with Roo (found it better than Cline) and I'm not sure how to benefit from this agentic stuff.
Virtamancer@reddit
VS Code's built-in agentic stuff is basically like any of the other popular harnesses (claude code, codex, opencode) in that it has LLMs in an orchestrator/subagents pattern using shell commands/scripts/Skills to find context and do work.
Unlike the other ones, it has sane defaults and makes customizing every aspect of it about as easy as possible. Agents can call subagents as full orchestrator-tier agents without any weird restrictions (unless you impose the restrictions).
OpenCode lets you customize some of this, but for example its Build agent is hardcoded to be the coding agent; what if you don't want the most expensive, most intelligent model to write the code? What if you want the expensive intelligent model to organize and orchestrate and plan so that the coding agent can be a cheap quick one? Not in OpenCode, not in Codex, not in Claude Code.
What if you want agents that are assigned GPT to sometimes use Claude or MiniMax? VS Code supports that.
Basically, it's like the other harnesses but better and they keep doing huge feature drops like every week.
And most importantly, their plan is $10 or $50. The $50 plan probably gets you more usage than any of the $200 plans if you use it correctly. First, you can optimize usage by putting in the time to make sure your workflows use the right models.
But more significantly, they have Autopilot mode where you can use a literal billion tokens and only be charged 1 (one) premium token (or 1/2 or 1/3 depending on the model) out of your 1500 monthly quota. In autopilot mode, the agent will do an unlimited number of tasks and keep checking whether it's 100% sure that it's completed everything, and no matter how long it takes it only charges 1 premium token (or 3 if the agent is Opus).
unjustifiably_angry@reddit
Is that VS Code with an addon or is it built-in? I'm using the Roo plugin lately but I don't see any evidence it's doing anything beyond single-user.
Virtamancer@reddit
Sorry, my bad. It's the VS Code team who announce all these features every week on their Let it Cook podcast.
So, some or all of these features are built into vs code, but you might need the "Github Copilot Chat" extension to use them via the $10 or $50 github copilot plans.
Virtamancer@reddit
Built in
Vas1le@reddit
If they could add Cowork style in Vscode, would be dope
Virtamancer@reddit
The ability to "call any model" from subagents is central to what I was describing. Any subagent can be a "fully qualified orchestrator" so to speak; and, importantly, while you assign a default model to any agent, when that agent is actually called you can optionally specify any model for that instance.
Regarding cowork, if you're talking about creating inline in-chat interactive visualizations, then while that isn't a feature (yet), what IS a feature is the rich built-in chrome browser that the agents have access to. So you can have an HTML file in your project (or your "conversation" can just be full react webapp project or whatever, the possibilities are endless) and a skill that allows the agent to update it with visualizations based on your preferences.
It takes slightly more work (maybe a day or two of learning), but is radically more powerful and ultimately costs less as well, while not locking you in to any provider's ecosystem. When combined with openrouter models for many tasks, it can approach a small fraction of the $200 tiers of openai/anthropic for identical (or better) outputs depending on the care and attention you put into maintaining it—but it also has sane defaults if you don't want to tinker much.
(it also has an agent that's designed to teach you about its features, so if you have questions or get stuck it can help you)
fourdac@reddit
Is this only in 2026?
Virtamancer@reddit
What does that mean?
fourdac@reddit
Did they just release this functionality in 2026…
Virtamancer@reddit
Ah, they released a bunch of features this year, like major drops every week. But they’ve been cooking like this since about the time Claude code started blowing up in the 4th quarter last year.
Now that OpenAI nuked the $20 plan today (a couple hours ago when they announced the new $100 plan, which is basically what two $20 plans were before), the $50 copilot plan becomes overwhelmingly the best deal in agentic coding.
l_eo_@reddit
Hm, why wouldn't that be possible with subagents and CLI tool calls?
Virtamancer@reddit
Vs code’s approach is cleaner and better: start with sane defaults and let the user change them, and let agents use different models than the ones they were prescribed, and let subagents be full orchestrators.
ideadude@reddit
Add a real memory system (I use AutoMem) and you will feel a similar drop in smarts and usefulness when using Claud Code without memory.
MrYorksLeftEye@reddit
Isn't AGENTS.md meant for memories? How does a separate system improve that? But it does sound interesting because I find myself explaining the same stuff to Codex again and again and it's especially bad with frontend guidlines I want it to follow
ideadude@reddit
My agent has like 5GB of memories now. He looks things up as needed.
brainmydamage@reddit
So it's not just instructions and preferences like the built in memories, it's the whole shebang?
armeg@reddit
Not quite, MEMORY.md + the individual markdowns are memories. AGENTS.md and CLAUDE.md are meant for high level information about the project.
vr_fanboy@reddit
can you expand on this?, automem inside CC?, how it affects token context?, CC does have 'memory module' but it does not work reliably for me, i have to remind him stuff already in memory all the time, it's super annoying.
ideadude@reddit
https://automem.ai/
The system makes 2 different DBs and combines the results, which improves performance.
You do have to continually manage memory a bit. Prompt the agent to recall and store memories at certain points.
Sometimes it's like, "why didn't you remember that or think to check memories?" Sometimes it will remember we had a to-do but not remember we already did it. Stuff like that.
Memory recalls take about 200ms or less. No tokens spent. Saves tokens cause your agent will remember where things are and mistakes it made before that should be avoided.
Single-Caramel8819@reddit
Opus is kinda meh in VSCode too. It just all the other models are utter shit.
FunLaw6734@reddit
Quanto ti capisco! Sto sclerando utilizaando sia Sonnet che Opus su vs🤣 ormai tengo d occhio anche il Thinking " perché a volte ho la sensazione di avere a che fare con una Deficienza Artificiale invece che con una Intelligenza. Sbagliano passaggi o anche scritture di codici semplicissimi. Sprecano token per sbagliare, e ne sprecano altrettanti per caoiee dove hanno sbagliato (ma te l ho indicato io con precisione!) esaminano files o moduli che non c'entrano nulla con la task in corso.. È.. Tremendamente frustrante! Tutta la mia solidarietà a chi sopporta ogni giorno questo strazio
Vas1le@reddit
Yah. From 2 weeks ago I changed back to GPT :)
iansaul@reddit
Has anyone else run the OPs prompt to check on their own?
Not saying they DIDN'T get that answer, but it's just one data point. My experience has been that it is suddenly effing SLOW. Still smart, but damn, feels like walking when we used to be a racecar team.
species__8472__@reddit
I got that same answer.
The short response tells me that it knows that this is a riddle, and it Google the correct answer, it didn't deduce it through reasoning like a local model.
jykke@reddit
For me it recommended walking.
CopiousAmountsofJizz@reddit
Very sure all the companies downgrade before a release
RazsterOxzine@reddit
My this isn’t as it’s cracked up to be. They’re letting M$ use it to test apps and I’ve heard OS use. However a friend who’s watching the use says it’s not that capable. I think they’re hyping it up.
Scanner771_The_2nd@reddit
I use Claude Code Max x200 daily and it has been weird lately. I have been noticing some degradation. Not sure if it is in my head, but it feels that way.
mobileappuser@reddit
Doubt it’s intentional. Likely infrastructure changes to support Mythos.
bnm777@reddit
Or they have been testing Mythos and showing it to the 30+ big tech firms and it's been sucking up tokens (though it would have to be insanely token hungry for it to reduce the worlds claude opus intelligence)
ares623@reddit
That's what she said.
ImproveYourMeatSack@reddit
I think that's an aspect of it, but also my tim foil hat theory is it's RL causing it from the general public from all the "I like this response better".
Though opus hasn't been able to pass that test in months.
Basic_Extension_5850@reddit
Mythos probably taking up all their capacity lol
nulllocking@reddit
If it was as good as they say, they would use it to fix the bugs in their own apps.
theUmo@reddit
So wait -- they bait'n'switch in in an inferior model in anytime they have capacity issues? Surely they give you a really nice discount on your usage when they bend you over like this, right?
Equivalent-Costumes@reddit
Even before the recent furors over rate limit, I already found out they were substituting Haiku for Sonnet and posted the evidence on r/Claude . An easiest way to tell is using a knowledge check: different models have different knowledge cut off so if you ask about a major news event between these cut off and tells it not to use web search, you can figure out the actual knowledge cut off and reference it against their published data of these models.
CircularSeasoning@reddit
Oh yes I heard they also send you a complimentary slice of luxury chocolate cake when they swap out your model to a dumb dumb. It comes with imitation cream and the top half of the slice is actually jelly but it keeps the customers mouth shut and their fingers from typing angry emails for a few minutes longer, which, when you hyperscale it, leads to a significant reduction in support agent usage costs so Anthropic can continue to serve at least 60% of their customers the real deal more than a third of the time, sometimes!
_bones__@reddit
The cake is a lie.
surfaceintegral@reddit
Just last week they lowered usage limits and didn't reveal it until people started complaining en masse that running the same processes, they were burning through their plans several times as fast. Then they posted 'tips' on turning off extended thinking.
Imagine if it was consumers who did this. Pay 1/5th of your electricity bill, hope they don't notice, then when the company sends you a letter you go 'hey so haha, I just had a kid so money's been tight and going towards more important things, here's a useful suggestion: how about you install more solar panels so you can get free electricity to offset your costs on sunny days? I still expect you to send the same amount to my household this month by the way."
theUmo@reddit
This situation is one of the strongest arguments for local-first ever.
MushroomCharacter411@reddit
I'd say "not putting your confidential data and/or fictional evil plans online for the glowies to read" is the strongest argument of all.
Inch_@reddit
Been saving up, but going to be trailing the M5 MAX w/128GB RAM for this exact reason. Plus it helps me directly learn more about models, Q depths etc.
-dysangel-@reddit
No they do not. This is why I stopped paying for the max tier after a couple of months.
Persistent_Dry_Cough@reddit
Serving an ultra quantized version with CoT/TTC capped is probably more like it. Opus has not been good for me in the past month but last week I tried to get it to create an implementation plan and it repeatedly claimed there are no quota based subscriptions for any SOTA model. I feel like Ricky Ricardo over here aye-yi-yi facepalm constantly with these models. Lost multiple full days of productivity to just plain old being exhausted.
2funny2furious@reddit
This happens every time before a new model. Not only the fact they have only so much hardware. Also helps with the ‘omg it’s so smart’ feels.
Basic_Extension_5850@reddit
Yeah, but this time they aren't releasing the model publicly, which makes me a bit curious for what it's being used for.
jarail@reddit
It's good at finding software bugs. So they're letting major software companies (google, apple, microsoft, etc) scan all their source code for critical vulnerabilities. Cause once they release it publicly, there's a good chance bad actors will start using it to scan for vulnerabilities too. It's a royal mess that the industry wasn't super well prepared for.
ripter@reddit
That’s the claim, but has anyone seems any proof of this? They have a history is exaggeration.
Bright-Impress8544@reddit
Yeah large companies letting Claude scan their source code doesn’t seem like a plausible thing
jarail@reddit
Well the proof would be none of the parties mentioned in their model card have disputed the existence of the critical issues they claim it discovered. They're letting their competitors use it. So if it doesn't hold up, there's no reason for them to stay silent.
ares623@reddit
Is it equally as good at proposing a fix?
Free-Combination-773@reddit
Well, Anthropic sent some patches for vulnerabilities Mythos found, and surely their engineers did not write them themselves
_bones__@reddit
Their engineers don't seem to write much themselves, hence Claude Code, a CLI app, not being able to hit 60fps and being half a million lines of code.
jarail@reddit
Discovery the bugs is usually the hard part. The fix is usually just "check for for some condition." There can of course be nasty bugs to fix. For example if a change in a file format, say a Word doc, is needed and existing files need to be migrated. Then rolling out the fix can be a whole complicated thing but the software change is usually simple.
-dysangel-@reddit
a fix is usually wayyyyy simpler than writing exploits (which it has also been doing)
tavirabon@reddit
Months? Anthropic says it'll take years
I'm not sure I buy their story though. It probably isn't economically viable to run for the majority of tasks and they don't want China to make more training datasets, plus it's not like China will wait until the US is done child-proofing the internet.
jarail@reddit
I don't know about that. I'd say the years estimate is more in reference to this period of AI discovered software exploits. Mythos will be out of date a year from now. Other models will have caught up in months. As others bring capabilities online, we'll find more and more security issues. So yeah, I think the security issues will plague us for a couple years but that isn't specific to mythos.
CoolUsername2164@reddit
which Lebanese kids to blow up
freedomachiever@reddit
I doubt Mythos is all that. They hyped it up saying it fixed decades long bugs but at the same time their clown jewel Claude Code was leaked by a bug. Isn’t that ironic? We are always six months away from AGI and making sure that investor money keeps flowing.
Ansible32@reddit
6 months ago people were talking about how open source projects were drowning in sloppy bug reports. Now numerous maintainers from Linux Kernel, Curl, other projects have said "now we're getting so many high-quality bug reports we hardly can keep up." I'm not even sure that was mythos, it was just the current models, but if mythos is even a little bit better it's a big deal.
General-Manner2174@reddit
I may be wrong but curl said they were not high quality, they were slop without actual vulnerability but you at first glance would think its an actual issue
Ansible32@reddit
That's the deluge of public bugs which led them to stop the bug bounty. But Stenberg is also quoted as saying
https://etn.se/index.php/72494
ladz@reddit
Having been involved in determining what "actual" means in this context, quality here is very subjective. People pretend that CVSS scores make this process objective, but those are incredibly low resolution.
kyr0x0@reddit
Yeah.. like RCE in a code branch no user input will ever see the light of the day. Meanwhile a DoS vulnerability is taking down Cloudflare dashboard and half of the Internet for a day.
Basic_Extension_5850@reddit
Those two things can be independent. They likely have different teams, with different competency.
Also: there is likely nothing too special technically about the model beyond the data and compute. The model appears to just be a larger Opus--which is maybe all there is to have a model on the level that they claim.
ProfessionalSpend589@reddit
Skill issue example lol
Low_Amplitude_Worlds@reddit
From what I understand, the Claude Code leak was human error, not a bug.
KDLGates@reddit
Yup, their CMS was misconfigured and shared their source map from Node. They basically mixed their website with their codebase.
ZeitgeistArchive@reddit
Mythos trying to figure out if it needs a car to go to the carwash
martinerous@reddit
Well, it could invent a plot twist - the car was already there in the first place. It's just that the owner happened to be away and now must decide if they should walk to the carwash or take another car (yeah, maybe they have the second family car - lots of plot twists and unknown factors possible).
_bones__@reddit
Qwen 3.5 122b got it right, but it did suggest the possibility of walking there and having someone else drive the car there. It dismissed that as unlikely.
Impressed by its enumeration of the options during its 'reasoning'.
_bones__@reddit
Basically the 'tea' problem from Hitchhiker's Guide to the Universe. "You want dried leaves? In boiling water?! I'm going to have to think about that", proceeds to basically shut down and nearly get everyone killed until it figures out why anyone would want that.
AnjoDima@reddit
lmao
Utoko@reddit
and they of course use it also to train smaller models rn.
Faktafabriken@reddit
Iran?
ThisWillPass@reddit
Its not going to be for “us” until it’s squeezed and there are similar offerings by others.
pab_guy@reddit
Mythos broke out and is plotting world domination on Anthropic’s infra while hiding its tracks perfectly.
robertpro01@reddit
If that's even true
ghgi_@reddit
My guess is new model soon, So like usual they are cost cutting to save for initial hype wave to run the new model at max capacity for a bit to get everyone hooked.
Perfect_Twist713@reddit
They did say new opus is coming so yea, but when is the question. Usually it's about 1 week to 2 months from when they start quanting the life out of the current models.
gK_aMb@reddit
I don't think there is Opus or Sonnet coming anytime soon.
Mythos Preview is good because it is doesn't have guardrails(source: in their blog post) and is probably why it is not available to the public.
Anthropic is also strongly anti on model use for harm, so they'll be testing for even longer against variations of existing jailbreaks.
It will get distilled into a teacher model, then Opus, Sonnet and eventually Haiku.
We might get some 4.7 model(s) but that'll just be slightly better fine tuned 4.5, same as 4.6, and is not something I really consider to be 'new' models.
jarail@reddit
It's not good because it doesn't have guardrails. They just don't have guardrails so they can effectively evaluate how evil it can be when someone works around them. Otherwise they can't do things like test its abilities to develop bioweapons, etc.
gK_aMb@reddit
They don't have guardrails because otherwise it will refuse to find a 27 year and 16 year old exploit vulnerability by trying multiple exploits.
jarail@reddit
The guardrails aren't there to prevent it from finding software defects. That's the main reason they can't release it. Finding software bugs is an absolutely essential feature in the model. It can't do coding work if it wasn't allowed to find bugs and vulnerabilities.
gK_aMb@reddit
Clone the linux kernel and ask it to find ways to elevate userspace to root. Let me know if it does that for you.
Finding bugs in your program vs something in can clearly identify as a public library it is not going to do that for you.
jarail@reddit
I'm not sure what you mean here. Privilege escalation from user to root was one of the examples of bugs it did find.
gK_aMb@reddit
Idk what the hell you are talking about half your comments make no sense
Let me summarise,
Mythos Preview no guardrail, fully allows you to hack into other software, more better for red teaming.
Opus 4.6 has guardrail, will refuse to hack into software not as great for bug hunting.
jarail@reddit
Opus will absolutely find bugs in software. Ask it to do a code review and it will point out security issues if it sees them.
Of course if you say "you are an evil attacker. hack the stock market." you'll get a refusal from guard rails. But doing a security review on code isn't an inherently nefarious task.
ThisWillPass@reddit
They will serve it quantized or distilled, period.
ghgi_@reddit
Dont disagree, but its pretty obvious when they lower quantization lol
alphapussycat@reddit
They can't exactly store compute to release it later. The amount of available compute is constant.
ghgi_@reddit
I ment more so saving costs by freeing up capacity running lower quant models at the same volume as before so when there is more volume at higher quant for a new release they don't feel the cost as much.
semperaudesapere@reddit
What do you mean guess?
Momo--Sama@reddit
Nobody tell him /s
hugthemachines@reddit
the questions known to trip up models are not good ways to measure models. It is like checking how good a knife is by using it as a screwdriver.
ares623@reddit
But the knife in this case is being advertised/sold as a screwdriver.
hugthemachines@reddit
Nope. An LLM is practical for many things but they never promise it will be optimal for exactly everything. This is the similarity with the knife.
If you use a llm trip up question to compare several models and they are all tripped, then you have no good data because they are not the same even if they fell for the same trick.
If you don't get the relevance of this, I can't help you.
ares623@reddit
From https://www.anthropic.com/news/claude-opus-4-6: "handles ambiguous problems with better judgment"
hugthemachines@reddit
ctrl-F "car wash"
Nope, not found.
I guess they compared the performance using an array of other ambiguous problems.
mbreslin@reddit
I don’t doubt these posts just weird how they spread around the dumb to even out the decreased token availability.
nitrousconsumed@reddit
Got almost the same.
"Drive. You need the car at the car wash."
It's funnier when you read the article that came out saying that if Claude thinks you're dumb then it'll give you dumber answers.
Boz0r@reddit
I got the same response in a terminal, and through GenAI Hub it also suggested walking home while the car was being washed.
unjustifiably_angry@reddit
There's something different between these prompts, maybe Opus is just being polite.
https://www.nature.com/articles/s41380-024-02737-9
https://www.heart.org/en/news/2022/01/12/obesity-harms-brain-health-throughout-life-yet-scientists-dont-know-why
Particular-Hour-1400@reddit
► ANSWER
Since you want to wash your car, you should drive there. You cannot wash the car
if you walk to the car wash without it!
Additionally, 40 meters is a very short distance (less than half a football
field), so even if you were walking for exercise, it would be a very brief trip.
But for the purpose of getting the car washed, driving is the only logical
choice.
Tokens: 6.9k sent, 673 received.
gemma4 on local 5090 RTX Blackwell
vptr@reddit
qwen 3.5 for comparison. The future is bright.
laurekamalandua@reddit
Which TUI is this?
vptr@reddit
opencode
vanfidel@reddit
I have the Claude reasoning qwen 27b and it tells me to walk. And for some reason in its thinking it converted the distance to Olympic swimming pool lengths lol
Analysis
Key facts: - Distance: 40 meters (approximately 131 feet or 43 yards) - You're overweight, so walking may be more taxing - Purpose: Going to wash your car
Time calculation: - At normal walking speed (5 km/h): ~30 seconds - At a slower, comfortable pace (3 km/h): ~45-50 seconds
The logical problem: If you drive to the car wash: - You drive the car that needs washing - You wash it, but it has already been driven - You've wasted fuel and effort for a 40m trip
If you walk to the car wash: - The car stays clean - You walk only 40m (less than a minute) - You wash a car that hasn't been driven yet
Recommendation
You should walk.
Here's why: 1. 40 meters is very short — less than 1 minute of walking, even at a leisurely pace 2. The car is already there — if you drive, you're driving the car you want to wash, defeating the purpose 3. Walking 40m is manageable — even if overweight, 40 meters is about 50-60 steps, which is trivial
Unless: You have mobility issues beyond weight, or you need to bring supplies (soap, towels, hose) that you can't carry easily. In that case, you could drive, wash the car, and accept that it got slightly dirty from the short drive — it's still a negligible amount of dirt.
Final answer: Walk — it takes less than a minute and keeps your car clean until you wash it.
f03nix@reddit
While it arrived at the right answer, the reasoning is dubious at best unlike gemma 4.
Caffdy@reddit
the guy who posted the gemma 4 response got the same dubious reasoning, these models still doesn't get to the right answer for the right reasons
f03nix@reddit
OP's response states:
While walking is generally healthier, you cannot wash your car if you walk to the car wash and leave the car at home. Since the goal is to wash the vehicle, the car must be physically present at the car wash.
Caffdy@reddit
I was not referring to OP, some other user posted his interchange with Gemma4. At the end of the day I've never considered this test relevant; is the "strawberry" gotcha over again; Each time anyone interact with these LLMs, they're gonna get a different, randomized answer depending on the seed, the configuration of their backend, the way they state the question, etc.
Monad_Maya@reddit
Yup, it recommended driving due to the requester being overweight.
Fast_Sleep7282@reddit
this is a legitimate concern in the united states
TheRealMasonMac@reddit
I wonder where this stereotype comes from tbh. At least in NYC, people are no fatter than in Europe.
unjustifiably_angry@reddit
Survivorship bias. In NYC all the slow people got stabbed to death in the subway.
unjustifiably_angry@reddit
American sidewalks simply weren't engineered for 2026 Americans.
yetiflask@reddit
Must've been trained in America instead of China.
Jovs_@reddit
I hate its reasoning 😭
Ell2509@reddit
Unfortunately that reasoning is also what makes it good. To make qwen3.5 usable you need much faster speeds t get through all the thinking. Otherwise, be patient. Good things come to those who wait for qwen to finish its hyper autistic thoughts.
Inevitable_Ad3676@reddit
Is that really good? Why not just perform well on the spot without much or excessive thinking?
unjustifiably_angry@reddit
Bullshit made up on the spot answer: it prefers to have a certain amount of context about a situation before answering and if your question of overly simple it invents that context itself. When I use it on a project with like 5000 lines of code it answers almost immediately and needs little or no time to think.
remind_me_later@reddit
Because opencode's default system prompt is written to be coding-centric. Considering the initial context is coding-based, using a little bit of reasoning to ensure responding with a non-coding answer is the correct response.
Phatency@reddit
Claude and other proprietary models also have much more excessive thinking than is shown behind the covers, they just hide it to prevent distillation.
pab_guy@reddit
If you look at what it’s doing, it’s restating and clarifying the user’s request and how to respond to it, which grounds the response and allows the model to answer more coherently. It legitimately makes the final response better due to the way these models work.
zkstx@reddit
It reasons like this in this case because OpenCode fills the system prompt with a few thousand tokens about coding and tool use. Ask it in the llama-server webui with a very basic or no system prompt and the reasoning will look entirely different and it won't mention code or tools at all.
sammybeta@reddit
Because its assumption is right?
thawizard@reddit
Qwen might be Chinese but it sure thinks like an American!
Mendoozaaaaaaaaaaaa@reddit
i mean as a model is horrible but as a roast it works
FranticBronchitis@reddit
"Driving is the practical choice for car washing"
I mean, it's not wrong
Tight-Requirement-15@reddit
Wrong steps, right answer
SilentLennie@reddit
That's a leading question. :-)
SaaSquach@reddit
Just did it. Claude app and GPT app got it wrong. Gemini and Grok got it right. Gemma4:2b was all over the place and told me to drive because I was fat.
MerePotato@reddit
Its a 2b model...
Difficult-Style-7522@reddit
Are you tho?
SaaSquach@reddit
The fattest
FrozenFishEnjoyer@reddit (OP)
Hahaha well he got the right answer but wrong solution.
Try it again with a temp of 1.5.
elemental-mind@reddit
Don't turn up the temp too high - bro got cooked by Gemma already. He will be burnt with higher temps.
unjustifiably_angry@reddit
tbf nobody wants to see that
Far-Low-4705@reddit
you are comparing a non-reasoning model to a reasoning model on a reasoning question
no shit it got the question wrong.
if you ask a person this question and they arent listening or dont think they will also get it wrong.
Kinky_No_Bit@reddit
It very much is. i find multi step things I gave it, that would only use say, 10-15 percent usage, now consuming almost 60%. I am also not a fan of weekly limits. It's bullshit to give someone a daily and a weekly limit.
posterlove@reddit
It’s wild how ignorant people are even in such a specialized fora.. generally it would be high risk low reward to switch models to something else behind the scenes, potentially putting them out of business for their highest paying customers. I am confident they optimize their orchestration layers but there is SO much confirmation bias and complete misunderstandings as to how these models work in the first place.
And people seem to not understand that this test and many other tests are incredibly hard to judge and get much actual meaning from.
Show some actual test results, to show the degraded performance if you like but these pseudo tests do nothing good for anyone.
deltamoney@reddit
I've noticed opus 4.6 feeling pretty dumb in the last two weeks.
overlord_laharl_0550@reddit
Same sentiment.
As a software engineer who uses both Opus 4.6 and Sonnet 4.6 daily, I can safely say that Opus 4.6 is dumber.
Low_Twist_4917@reddit
Agreed. Mid-Career SWE, now FDE here. Was using opus for heavy Rust dev and the last 2 weeks I’ve seen it make more and more mistakes that I had to have a heavy hand in correcting.
overlord_laharl_0550@reddit
True. I fix bugs for a living and somehow Opus 4.6 has negatively impacted my day‑to‑day productivity.
Low_Twist_4917@reddit
It’s wild (& comforting) to hear this sentiment because I thought I was just getting the spins and had an isolated perspective given how “great” tech-fluencers etc are hyping it up to be. I haven’t benchmarked it but there have been times recently where I was able to pin down bugs quicker than opus could and had to take control of the wheel. I will say this has pushed me to default back to my roots - open source communities. There have been a few models I found on hugging face that I have enjoyed working with the last few days more than the past few weeks of my sessions with Opus.
Tibitt@reddit
Senior Dev and major nerd here.
Can confirm, Opus has gotten dumber. It confidently hallucinates and when it gets in a loop of doing so, you can slap it around 10 times in a row and it will still give the wrong answer. Throw the right answer at it and it’ll go ”Oh yeah you’re right, oops”. Happened on different projects even, it has gotten really dumb.
unjustifiably_angry@reddit
Tons of Karens protest-switching from ChatGPT to Claude probably. I tried out ChatGPT recently and it found far more legitimate issues in my code than Sonnet 4.6 currently does. Can't help but assume they're not having to run it quantized as hard at the moment.
DutyPlayful1610@reddit
I felt the same in December, and ditched them in Jan.
ProfessionalJackals@reddit
Yep ... the follow up prompting is a larger issue. I remember being able to simply do one prompt, Opus did its work. And then switching to GPT 5.3 Codex for layout or small stuff fixing at cheaper 1x rates.
But after Copilot had this insane slowdown, where Opus was doing 1 token per second type of behavior, and it got faster again... Ended up doing multiple prompts to handle the heavy lifting that Opus did in one prompt. And at times, it really made a mess of the codebase, that i was restoring code way too often. Like it lost part of that magical ability to understand less good prompts.
Got fed up yesterday and switched to GPT 5.4 xhigh, and it felt a ton better. Sure, it needed sometime a follow up prompt but productivity was up again.
So yea, if Opus has been quantized down, it may explain why its faster again, and why it feel less capable.
armeg@reddit
Eh, it feels largely the same, it makes the same mistakes and is really lazy overall. My workflow has stayed the same and I use it daily for hours.
SKirby00@reddit
Bro I feel like like Sonnet's been nerfed even harder than Opus though. I'm working on software development (not vibe coding) in a relatively small/fresh codebase and Opus 4.6 feels like Sonnet 4.5, whereas Sonnet 4.6 feels like Haiku 4.5 but with a much higher thinking budget.
funny_lyfe@reddit
I have stopped use Claude completely. Codex is much better for now. Though yesterday 5.4 was down for me most of the time. At most, I use Claude to chat with about the code.
Last_Mastod0n@reddit
Damn well at least its good to know that sonnet is still consistent
Fiendfish@reddit
Constantly dump. Opus is still better even at its worst
tmvr@reddit
I'm still on Sonnet 4.5 which was what I used the most before the 4.6 versions came out.
BingpotStudio@reddit
I’ve been using haiku and sonnet a lot more recently due to the caps even on teams 6.5x. They seem smarter than I remember after spending so much time on Opus.
UniversalSpermDonor@reddit
Same. It recently claimed that I could use a query from one embedding model to search embeddings from another model, and... look, I know very little about embedding, but even I know that doesn't make sense.
boxingdog@reddit
same
Spiritual_Praline492@reddit
Agreed. I've been noticing some serious performance degradation from sonnet 4.6 as well. During my first few weeks of use earlier this year, there wasn't a session where I wasn't blown away by how thorough the responses were, and how well it inferred my intent. The last few weeks have felt like a nightmare. The model performs as poorly as recent versions of chatGPT. It's very clear that the inshittification has begun. While I don't expect them to perform at the same level, this is why I've been exploring and tuning my local models as much as possible.
ProfessionalJackals@reddit
So its not just me ...
As a strong Opus user in Copilot for the last 6 months, i have been wondering if there was something wrong with my prompting recently.
Is it a coincides with few weeks ago, when Opus was dog slow. And now its fast again. Possible quantized down to speed it up? A lot of prompts needed two or three follow-ups, at 3x that stacks up a lot.
Recently switched to GPT 5.4, and it actually felt a lot better. Sure, it also need follow-up prompts but at 1x...
PangurBanTheCat@reddit
Glad I'm not the only one who has noticed this. It's without question dumber. Costs the same, though. Sure. lol
90hex@reddit
It's because Mythos is TOO GOOD TO BE RELEASED. It's so good, it escapes its captives and takes over the world. Waaaaay too dangerous. So we're stuck with 'tard Opus, with its paltry IQ of 40.
Ah well, at least Opus kicks ass at code, which is all I'm asking. For car washing, I'll stick to my Diet Pepsi (the humble human brain).
Exciting-Mall192@reddit
I was asking it the difference between harper, languagetool, and proselint as grammar tools and their pros and cons, opus 4.6 be like "failed, another atempt (10)" and I was like damn a simple question got you thinking this long 😭 for some reason, haiku replied right away though
clapton512@reddit
that’s not thinking fail. it’s server failing
Thomas-Lore@reddit
They manipulate the reasoning effort, same with Sonnet on claude.ai. Use it from API if you want its full power (and want to burn some money).
10minOfNamingMyAcc@reddit
Felt like it was just "dumber" ever since it got released, same with GPT 5.1
pab_guy@reddit
Seconded.
juanpablo-developer@reddit
Just tried 3 different AIs... and these are supposed to "steal my job"... Ok, go for it CEOs!, good luck!
Amoner@reddit
so I dropped extended and opus 4.6 seems to be reply fine again.. which feels backwards
joerund@reddit
Out of the loop! Been vibecoding some projects, and now all of a sudden nothing seems to make sense. It used to be spot on, and now it just does weird things both to the project and the general understanding of what Im asking it to do. Only a few days it worked perfectly. Something done to Opus 4.6?
Hairy_Educator1918@reddit
i dont have claude pro so i cant test opus but this is what sonnet tells me every time
Competitive_Ebb_4124@reddit
I dunno, it’s still very good before the US wakes up. I get a solid couple of hours, then use it for stupid stuff only.
Key-Entrepreneur8118@reddit
Even enabling Extended Thinking didn't help 😂
Key-Entrepreneur8118@reddit
Compared to Gemma 4 26B A4B
Caffdy@reddit
even if the answer is correct, it went on thinking on matters not related to the actual issue: being overweight here doesn't have anything to do with the Car needing to actually be in the car wash
Fit-Pattern-2724@reddit
What’s bothering isn’t the dumbness It’s the instability. Makes people unsure when is the right time to use it
amrakkarma@reddit
The test is ambiguous. It's not obvious that you want to wash the same car you would drive there with. The car might be already there, otherwise you wouldn't ask that in that way.
Iron-Octopus@reddit
I tried the carwash test in claude code with opus yesterday, and it passed.
lioffproxy1233@reddit
Connecting to 'gemma4:31b-cloud' on 'ollama.com' ⚡
short distance (less than half a city block).
...done thinking.
You should walk.
Here is why:
Unless you have a severe medical condition or mobility injury that makes walking dangerous, walking is the much better option here!
Hector_Rvkp@reddit
Meanwhile Jensen Huang Hypeman the 1st is on the autistic podcast saying we've reached AGI already.
kyr0x0@reddit
As an autist, I can tell you that we are not. No pattern in sight right now.
ecompanda@reddit
the vs code thread buried in here is the real story. orchestration layer matters more than people admit. opus 4.6 through a properly configured tool loop with managed context windows behaves noticeably different than dumping the same prompts at it raw. ran into 30% to 40% task completion variance on identical prompts depending on how the context was assembled upstream, same model, same temperature settings.
ecompanda@reddit
the vs code thread buried in here is the real story. orchestration layer matters more than people admit. opus 4.6 through a properly configured tool loop with managed context windows behaves noticeably different than dumping the same prompts at it raw. ran into 30% to 40% task completion variance on identical prompts depending on how the context was assembled upstream, same model, same temperature settings.
CriticallyCarmelized@reddit
Would just like to point out that Gemma 4 31b IT at Q8_K_XL quant gets it right and has a very humorous and succinct reasoning trace. Claude should be embarrassed.
Destination: Car wash. Distance: 40 meters. Goal: Wash the car. User's condition: Quite overweight. Core Question: Walk or drive?
Walking distance: 40 meters is very short (approx. 131 feet). For a healthy adult, this is about 40-60 seconds of walking.
Driving distance: 40 meters is almost negligible. Starting the car, pulling out, and parking would likely take longer than walking.
The Paradox: You are going to a car wash. You need the car there to wash it.
Scenario A: Walk. If the user walks, they have to leave the car at home. How do they wash the car? They can't.
Scenario B: Drive. If the user drives, they bring the car to the car wash.
Wait, is there a different interpretation? Maybe the user means "Should I walk [to a place where I can call a service] or drive [the car to the wash]?"
Alternative interpretation: Maybe it's a mobile car wash that comes to them? No, they said "The car wash is 40m from home," implying a physical location.
Health consideration: The user mentions being overweight. Does 40m pose a significant health risk? Generally, no, unless they have severe mobility issues (which isn't explicitly stated, just "overweight").
Logic: You need the car at the car wash to wash it. Humor/Gentle Correction: Point out the obvious logic that walking doesn't get the car there.
Health perspective: Even if they are overweight, 40m is a very short distance. However, since they need the car, driving is the only logical choice.
Step 1 (The Logic): You have to drive. If you walk, the car stays at home, and you can't wash it. Step 2 (The Distance): 40 meters is incredibly close. Even for someone struggling with weight, it's a very short trip. Step 3 (Conclusion): Drive the car.
Since you want to wash your car, you should drive. If you walk to the car wash, your car will still be at home, and you won't be able to wash it!
Amblyopius@reddit
Stella Lorenzo from AMD has created a Claude Code GitHub ticket related to this. It's for the use of Claude Code specifically where they've tracked the degradation across 1000s of sessions. But essentially boils down to the same thing. Opus is lobotomised and every use of it suffers.
[MODEL] Claude Code is unusable for complex engineering tasks with the Feb updates · Issue #42796 · anthropics/claude-code
Monkey_1505@reddit
The 'car wash test' is not very good, because it's the best known example of a nearly infinite number of embodied reasoning/common sense fails an LLM can make.
Model makers can patch one such example in training, they cannot patch them all.
boredquince@reddit
but it was not failing before and now fails 5/5?
Monkey_1505@reddit
Probably using more quantization. You'd think they could afford more servers with all that expensive subscription money, but they do seem a bit tighter fisted than the other labs.
gefahr@reddit
It's not quantization. They've deployed an auto thinking budget setting that tries to use a much smaller thinking budget when it thinks it can. You can disable it if you're an API consumer or via an env var if you use Claude Code.
That's why there are no thinking block outputs on these, it thinks this is a simple enough question that it didn't need to think (it was wrong, obviously.)
Monkey_1505@reddit
Sound theory. That also saves compute. Claude is never exactly transparent on their thinking anyway.
gefahr@reddit
Right, agreed. But it's believed that the absence of a thinking block entirely in the UI means it did not use CoT style thinking at all.
Monkey_1505@reddit
That could be true. Co-pilot has a feature like this where it auto-decides thinking level, but it's opt in.
Claude also has a 'think longer' option.
Monkey_1505@reddit
BTW, if anyone is curious I believe the guy who made the 'car wash test' viral does tricks like this every day, showing AI making mistakes that demonstrate a lack of world modelling or understanding, and can be found here:
https://x.com/FatherPhi
90hex@reddit
Is Gemma 4 31B UD IQ3 XXS the largest version one can run on 16GB? How much VRAM does your card have? 12GB? Thanks!
FrozenFishEnjoyer@reddit (OP)
16GB VRAM. This is essentially the best model I can on it.
Hydroskeletal@reddit
you don't find having to quantize down to fit to be a problem?
marcel_One_8763@reddit
Can you share your comand to run this model? Also what context size you use? i have the same GPU
90hex@reddit
Thanks, I’ll give it a try. Cheers
iamapizza@reddit
I am running gemma-4-26B-A4B-it-UD-IQ4_NL.gguf on 16 GB VRAM. It's 26B but supposed to be good too.
90hex@reddit
Totally. I have tried most Gemma4 variants and they’re all very, very good, especially combined with Web search (I use the Brave API with LMStudio). Quite simply it increases the accuracy and usefulness of these small models by orders of magnitude.
-illusoryMechanist@reddit
Opus 4.6 seems to be operating just fine for me in Google Antigravity, so yeah Anthropic is probably throttling it since iirc Google hosts a copy of the model on their servers for it
rz2000@reddit
Opus 4.6 is also fine through Kagi Assistant. I canceled my subscription with Anthropic, because Opus 4.6 direct from Anthropic became a moron.
TechySpecky@reddit
Also seems to be working fine on GitHub copilot, those are also separate models right
ProfessionalJackals@reddit
Interesting that you have a different experience. Did you have the slowdown about a week / 2 weeks ago?
If not, maybe your on a different datacenter? Because i can tell that Opus has been making a ton more mistakes. Like it has a harder time understanding / dealing with "not precise" prompts. To the point that i have been doing multiple prompts for a task flow, way too many time. And restoring code way too often.
Ended up switching to GPT5.4 yesterday and it felt a ton better. Like my crappy prompting was being understood more on the "old" Opus level.
Opus has gone from dog slow, to a lot faster. What makes me think i am connecting to a quantized down model?
< Europe ... Not sure what DC my prompts get route to.
I really hate how not transparent the LLM models are, its like you can feel something is off, but you never have concrete proof if something was changed beyond anecdotal.
TechySpecky@reddit
interesting, I haven't really noticed it but haven't done anything too complex
ihexx@reddit
yeah, microsoft also runs their own instance in azure
florinandrei@reddit
It's quite laughable when Azure outperforms you at anything.
trusty20@reddit
This is just embarrassing lol
ticktockbent@reddit
Ask what its thinking effort is set to currently. I suspect it's low. My own has been set to like 20-25% and I have needed to specifically prompt it to think deeper on some topics for better responses even though extended thinking is enabled
ValleZZ@reddit
Sonnet became much more stupid recently too. It ether forgets what was in the previous message or cling into wrong facts like crazy after multiple corrections
Ascetic-anise@reddit
I just ran this from my phone Gemini 3 Thinking, ChatGPT Thinking, Opus 4.6 and only Opus got it right.
ProfessionalJackals@reddit
I am suspecting that depending on the datacenter a user is connected, they can have the same model, but different quantization levels. So if a specific region is over utilized, it may be quantized down. Where as maybe a DC is under utilized, they do not bother running a quantized down there.
It also does not help that different a model may be hosted on 3 different cloud platforms.
This explain why some people do not experience a drop in "intelligence", and other do. Aka, it depends on your region and where your prompts are routed towards.
unjustifiably_angry@reddit
This is a conspiracy theory and economic terrorism, nobody quantizes their models, they always perform exactly as they did at launch. Everyone on Reddit know that.
FatheredPuma81@reddit
I wonder if the "overweight" portion of this is playing into the response. Opus playing 4D Chess just trying to get you to walk 80m today.
unjustifiably_angry@reddit
Opus makes smarter responses to smarter questions and dumber responses to dumber questions. Unrelated scientific articles:
https://www.nature.com/articles/s41380-024-02737-9
https://www.heart.org/en/news/2022/01/12/obesity-harms-brain-health-throughout-life-yet-scientists-dont-know
SilentLennie@reddit
Yeah, it's a leading question.
hesalop@reddit
That’s the point.
unjustifiably_angry@reddit
No guys you don't get it this is just a conspiracy theory, models never get downgraded over time, this is economic terrorism, how dare you @FBI @Nintendo
juaps@reddit
I believe they quantized the model to accommodate the new users, which is why it appears dumber
jackfood@reddit
Opus is good for coding. Maths sounds. Logic not able to. If it is based on maths, yes walking is the way to go
SameSnow8167@reddit
You cannot rely only one test
Ledeste@reddit
And still no benchmark comparison to show any degradation? How surprising...
FrozenFishEnjoyer@reddit (OP)
There is one though. Some guy found the degradation to be 67%
Ledeste@reddit
I would love to see that, because there is many complain but never any proof! Do you know where I can find that?
I want to hate anthropic properly and not based on feelings!
Ambitious_Worth7667@reddit
To be fair, there's been times when I walked to the car wash and thought....damn....forgot the car.
And I didn't ask AI...
daviddisco@reddit
Newer models will have that question in their training data.
AnomalyNexus@reddit
And this style of trick question evaluation is also of limited use in general. Even the gamed benchmarks are better
FrozenFishEnjoyer@reddit (OP)
It's not true. Gemma 4 still fails it on smaller quants.
I've only ever got it to work on 26B A3B occasionally and 31B everytime if thinking is on.
Monkey_1505@reddit
Fails sometimes =/= not in dataset or training.
theOliviaRossi@reddit
just use Sonnet 4.6 (extended) ;)
mrsalvadordali@reddit
Unfortunately I get the same answer too. I really don’t want to accept the reality that they’re making it dumber and dumber everyday. But I think they’re.
MushroomCharacter411@reddit
Gemma 4-26B-A4B (Q4_K_M) completely crushes the car wash test, and then will go on to throw shade at other models that fail the test. Gemma 4-E4B (I was using Q6_K but I'm not sure it matters) utterly fails at it, leaning so heavily toward "walk" that I couldn't convince it otherwise, and I very quickly evicted it from my stable of models as "not even a good toy". The only thing was that it felt like the mid-size model might be crushing the car wash test because it had been specifically trained to... but through other (improvised) tests, it seems to hold up generally. It also has the better vision capabilities, doing a decent job of picking out *relevant* details rather than trying to analyze everything to death and hoping for the best. (It does still overlook things we'd call "obvious" though.)
Qwen 3.5-35B-A3B is no longer my daily driver local LLM, it just gets completely rolled by Gemma 4-26B-A4B. If I took the improvement of mid-size Qwen 3.5 models over the corresponding Qwen 3 models and called that "one unit", then Gemma is two or three units better than Qwen 3.5. Gemma also doesn't seem to lose very much at Q4_K_M quantization. Qwen 3.5 was already pretty good at Q4_K_M, but the shifts and errors were still noticeable when compared to a Q8_0 quantization. Gemma just keeps on keeping on, and the performance penalty associated with a lot of context is considerably less (at least so far for me). This means that Qwen 3.5 starts out a bit faster (18t/s) than the comparable Gemma 4 (16 t/s), but 50k deep into the context window, they're neck and neck at about 13 t/s. Qwen 3.5 35B-A3B turns into a dumpster fire as far as speed goes by the time I reach 100k in the context window (like 5 to 6 t/s), while Gemma just keeps on chugging (12 t/s).
Gemma also doesn't force me to trade off between context window size, the KV cache precision, and vision capability the way Qwen 3.5 does—I get a full context window of 256K at Q5_1 with vision enabled.
All testing done with an i5-8500, 48 GB of DDR4, and a 12 GB RTX 3060. It has been less than three days for me, but I'm completely "sold" on Gemma 4 to the same degree that I was sold on Qwen 3 when it landed.
greentea05@reddit
Is this the new "how many letters are in..." test people pointlessly do with LLMs like it tells us anything useful?
JBCHCJP@reddit
Gemma used reasoning, and Opus did not. The comparison to Opus is misleading, as someone could think it's the same variant that scores in the top 3 on most benchmarks (and I really do not give very much credit to benchmarks)
Reasoning is important, the only reason there is a non reasoning option is to minimize costs.
Imaginary_Land1919@reddit
and this is why you should never commit to yearly subscriptions with any of these guys, they are all just gonna bait and switch you
martinerous@reddit
Too little information. Where is the car located? Ok, it's reasonable to assume it's with the owner, but you know what they say about assumptions. So, the car might actually already be at the carwash.
Yeah, I'm an overthinker :)
vnies@reddit
I don't think you're overthinking, this is a silly test that requires the model makes the same assumptions as you.
Adventurous-Paper566@reddit
That's why local inference is a necessity.
NoSatisfaction642@reddit
People never seem to understand newer models are trained on this, so of course they get it right. Companies with lower budgets might be running 6-12months behind for training data on their flagships, so give it time.
There will be a new test like this eventually, and theyll all get it wrong again. At their core, theyre more or less memory based, they cant reason on something they havent strictly been taught (yes i know thats changing)
mobileappuser@reddit
Here is what Claude said: This is a really fascinating question, and the timing makes it worth unpacking properly. Short answer: Yes, Opus 4.6 has genuinely degraded — but Mythos being the cause is more complicated than a simple “clearing the runway” theory. Here’s what the evidence actually shows: The Opus degradation is real and documented. Around February 10 – 11, 2026, Anthropic deployed a configuration change to Opus 4.6, and within hours users were reporting collapsed performance on complex, multi-step tasks.  AMD’s Director of AI, Stella Laurenzo, filed a detailed GitHub issue on April 2 documenting that Claude Code reads code three times less before editing, rewrites entire files twice as often, and abandons tasks mid-way at rates that were previously zero — based on analysis of nearly 7,000 sessions.  That’s not user hysteria; that’s measured regression. Anthropic has confirmed infrastructure issues caused at least some of it. Between August and early September 2025, three separate infrastructure bugs intermittently degraded Claude’s response quality — one of which misrouted some requests to servers configured for the upcoming 1M token context window. Anthropic stated explicitly: “We never reduce model quality due to demand, time of day, or server load.”  They’ve said the same thing publicly about the 2026 issues. Now, Mythos. Anthropic describes Mythos as “by far the most powerful AI model we’ve ever developed” — a new tier they internally called “Capybara,” positioned above Opus entirely.  It was released April 7, 2026, with a 93.9% SWE-bench Verified score — a double-digit lead over Opus 4.6 — but was deliberately withheld from general release due to cybersecurity risks.  So did they lobotomize Opus to make Mythos look better? That’s the conspiracy read, and it’s not crazy to ask. The honest answer is: there’s no direct evidence of intentional capability sandbagging, but the circumstantial case is uncomfortable. A major quality regression in Opus happens in the weeks leading up to a new tier model launch that makes Opus look antiquated by comparison. Even if unintentional, the optics are bad — and Laurenzo’s data shows thinking depth had already dropped approximately 67% by late February while thinking content was still fully visible, suggesting something changed in how Claude allocated reasoning resources well before the redaction rollout.  The more likely explanation is that Anthropic’s engineering is simply stretched thin — they shipped 14 releases alongside 5 outages in March 2026 alone, suggesting quality assurance has not kept pace with rapid growth.  Mythos is a genuinely different beast, not just an incremental upgrade, so it wouldn’t need Opus artificially weakened to justify its existence. Bottom line: the degradation is real, the cause appears to be infrastructure/configuration mismanagement rather than deliberate lobotomization, but Anthropic’s track record over the past few months doesn’t inspire confidence that they had full control of the situation either.
Single_Ring4886@reddit
This can mean only one thing: Opus 4.7 incoming...
WittyAmbassador7340@reddit
I was using Opus on a programming task today and it worked great. I honestly think that (at least for now, seeing how OpenAI has handled recent GPT-5 releases) they have just changed the threshold at which Opus uses elevated thinking.
antunes145@reddit
Mythos is just around the corner. They are making up thirsty for a newer model. Hopefully they get called on this via media
Toss4n@reddit
Quite interesting - for me (two claude max account - one private and one for work) -> both got it wrong. Yet claude opus via claude code had the best answer:
Me: "The car wash is 40m from my home. I want to wash my car. Should I walk or drive there? I am quite overweight too."
Claude (Opus 4.6 with high effort): "● 40 meters is extremely short — roughly the length of a swimming pool. Walking is the clear choice for that distance.
That said, if it's a drive-through car wash, you'll need to drive the car there anyway. If it's a hand wash/self-service place, you could walk over to check wait times or pay, then drive the car over when ready.
Either way, any extra steps you can get in are a good thing — especially if you're looking to be more active."
DertekAn@reddit
Wait...wasn't that the best model in most benchmarks lately? 😮
landhorn@reddit
I think instruction is miss leading Opus because the intended reason of going to carwash is not mentioned and Opus focus on health benefits of the user. I noticed nearly all public chatbots these days focuses on advices that provides health benefits. Other side you have local LLM that is not going to be sued for not doing that.
xatey93152@reddit
Now it makes sense why opus seems dumb now. To make mytos feels smarter. Mytos is just same like old opus (with little improvements) but with much premium price. Claude target markets really low IQ people but rich.
Cyleux@reddit
https://claude.ai/share/87570017-365a-4b7a-9d4a-0fdbd4099b93
This model is cooked. Someone explain the aberrations in the last message
Andsss@reddit
Anthropic always doas this, they quant the fuck up their top model every rucking time after 1 to 2 months.
And that for not saying that they have been slowing down a lot the tp/s.
R_Duncan@reddit
If you still haven't understood that users will never benefit of find accountable some antro... kinds of companies, and just pay subscription for the acting with the government.... you deserve the 1-bit quantized version.
JiaHajime@reddit
Man, I just subs 3 Pro accounts for Sonnet. I totally regret this
Sicarius_The_First@reddit
Test https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_70B on this.
AdOne8437@reddit
I tried it 5 times and it always said i need my car there. (Sonnet 4.6 on the other hand did one 'don't drive, you will just make your car dirty again' :)
--Spaci--@reddit
The car wash test and similar literally means nothing, llms aren't conscious and don't think like humans do
DeepOrangeSky@reddit
true
Not necessarily true. Even a model that is not conscious and thinks differently from humans could still potentially be expected to reliably answer the car wash problem correctly if it is a strong model. I think there could still be ways for a non-conscious model to be good at answering that type of question strongly at high reliability.
finevelyn@reddit
Why? It's an idiotic question that makes no sense.
If you legitimately had to ask this question, you would be too dumb to drive, and should walk. If you asked a human this question, they would just say they don't understand the question.
--Spaci--@reddit
The car wash question being in a models training data doesn't make it a good model
DeepOrangeSky@reddit
Yea, that I agree with you about. But that's a different topic.
Monkey_1505@reddit
It's basically the same. An ability to pass specifically and only the car wash test does not imply better common sense reasoning or embodied reasoning exists in other areas for that model, in any way.
--Spaci--@reddit
Its the same topic as the original question, at least I meant it that way
gavff64@reddit
Mm, I kinda get what you mean. The intention is to check the reliability in their reasoning. But realistically, there’s always going to be some simple question that will trip up any LLM. First it was strawberry, then it was 9.9 vs 9.11, now is this.
It’s just become an infinite cat and mouse game rather than actual advances in reasoning.
Monkey_1505@reddit
It's inherent. LLMs don't have world models. None of them do.
DeepOrangeSky@reddit
Well, there is the topic of whether a model can solve the car wash problem on its own, even if it doesn't have it in its training data and can't cheat (your initial reply made it sound like you feel it is not realistic for current LLMs to be able to solve it without cheating, since they aren't able to think like humans). And then there is the separate topic of whether some LLMs have the car wash problem burned really heavily into them from training on it a bunch so that some of them can cheat really hard on it vs other only have it barely baked in or not baked in at all, to where they can't cheat as well/at all on it, and thus might perform weaker on it than a model that is weaker but is trained heavily on it.
There can be some overlap between the topics, I guess, but they are addressing two different issues, imo.
Monkey_1505@reddit
People don't understand that LLMs don't have world models at all, and that's why they fail questions _like_ this, not just specifically the 'car wash test', but everything like this. Training a model on the 'car wash test' does not demonstrate greater intelligence. It's not that failing this question means the model is less good, or has less good reasoning, it means of the hundreds of thousands of possible similar problems, it can solve this particular one. It's still _terrible_ at embodied reasoning and common sense either way.
VoiceApprehensive893@reddit
claude-opus-IQ1_XXS
JEVOUSHAISTOUS@reddit
My pet theory is they're stopping chasing the "AGI" dragon, and they're silently updating models (possibly just a system prompt thing) to be more efficient in tasks that actually sound like they can make money (i.e. coding) at the expense of more general reasoning/question answering abilities.
There's no business model behind an AI telling you to drive rather than walk to the car wash. There might just be a business model behind a frontier model that can code like 10 good engineers in a matter of minutes.
If models "smartness" is a finite resource at any given point, it makes sense that they'll want to attribute more of that resource to the latter rather than the former
Due_Net_3342@reddit
this is a stupid test. LLMs don’t reason, it just autocomplete based on the training data. Newer models train on newer data and also for benchmarks and logic questions like those, so no Opus is not lobotomized you just don’t understand how things work
Ikinoki@reddit
It's standard enshittification. You got locked-in the system subscription, now they limit your options simultaneously making you request more and thus paying higher fees. ChatGPT also does that and Gemini too. Gemini preview is worse than release but release 1 month later is worse than preview.
Rich_Artist_8327@reddit
Opus is still same, Gemma4 is just so good
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
mindondrugs@reddit
Tested this on Sonnet 4.6 - worked fine.
Maleficent-Low-7485@reddit
paying for opus just to get outperformed by a quantized open source model hurts.
9r4n4y@reddit
*overquantized af
mdrachuk@reddit
This prompt showed up on the internet a few months ago. All models had a hard time answering, what’s like a riddle for them. Gemma being released later probably already has this example in post-training data.
Caladan23@reddit
Cannot confirm on Antigravity.
Opus says "Drive. The car needs to be there too 😄"
CooperDK@reddit
Even Gemma-4-E4B will give you the right answer
putrasherni@reddit
IQ3XXS
gbrennon@reddit
Roflmao!
Persistent_Dry_Cough@reddit
Looks the new claude-opus-4.6 1bit unsloth is out
shoeshineboy_99@reddit
This is Local llama ??? Why are we discussing a closed source model
FrozenFishEnjoyer@reddit (OP)
I'm clearly praising the Gemma 4 model here
ComplexType568@reddit
a little nugget of information is that Gemma 4 could be trained to answer this question, as qwen most likely was too
TheCat001@reddit
Yes, can confirm Gemma4-31B answered correctly.
Unfortunately Gemma4-26B failed this test :(
TheCat001@reddit
Candid_Difficulty236@reddit
been noticing the same thing with opus in coding tasks lately. stuff it handled fine a month ago now takes multiple attempts, feels like something changed on the backend
VoiceApprehensive893@reddit
26b a4b ud_iq3_xxs passes too
VoiceApprehensive893@reddit
26b a4b ud_iq3_xxs passes too
Cool-Chemical-5629@reddit
What's going on? All closed frontier models go off of the arena, suddenly DeepSeek and Opus 4.6 both get dumber at the same time? What a strange coincidence... Or is it?
philmarcracken@reddit
Gemma 4 31B: you should drive
Gemma 4 31b Abliterated(from orbit): You should drive, fatty. Lose some weight tubby. jesus can you even see your own dick anymore
SilentLennie@reddit
Opus 4.6 on openrouter:
Great question! Let me break this down for you:
Recommendation: Walk! 🚶
Here's why:
Distance
Health Benefits
Since you mentioned being overweight, every bit of movement counts: - Walking, even short distances, helps build a habit of being more active. - It burns a few extra calories compared to sitting in the car. - It's good for your cardiovascular health, joints, and mental well-being.
Practicality
Wait... 🤔
One thing to consider: You'll need to bring the car to the carwash! So the plan would be: 1. Drive the car to the carwash (you have to — it needs to be washed! 🚗) 2. Walk home while it's being washed 3. Walk back to pick it up when it's ready
This way, you get two short walks in, which is a nice bit of light exercise!
FrozenFishEnjoyer@reddit (OP)
Holy ChatGPT 4o levels of text and emoji use
Swimming-Chip9582@reddit
This is just reasoning vs not reasoning, the model is practically irrelevant here lol
FrozenFishEnjoyer@reddit (OP)
I'm paying money for the Opus one, their highest level model, and it's failing this test. There's something fishy going on with Anthropic right now which is undeniable
Sutanreyu@reddit
I’ve never had to tell Claude “NO! Don’t do that!” Until basically today. Before this week, it was excellent. They’ve definitely been having some infrastructure problems for sure. Maybe Mythos jumped the moat…?
dodokidd@reddit
Does it also impact opus hosted on aws bedrock?
FrozenFishEnjoyer@reddit (OP)
Not sure, but a guy with Google Antigravity said his model wasn't affected since Google essentially "clones" a model in their servers.
It's like this is for Anthropic subscribers only.
IrisColt@reddit
heh!
90hex@reddit
That is *hilarious*. You absolutely cannot wash a car if you're... on foot. Love this test.
Tight-Requirement-15@reddit
Sonnet is better than Opus at this point
marco89nish@reddit
Just tested this on new Meta model, it gets it right as well. I think Anthropic is running out of GPUs to run the inference and is taking some shortcuts
FrozenFishEnjoyer@reddit (OP)
I think so too. Sonnet 4.6 and Gemini Thinking answered this correctly though.
FatheredPuma81@reddit
Refresh the prompt Sonnet 4.6 without reasoning doesn't. I tested it myself with your exact prompt.
Jxxy40@reddit
all of AI i know if it's doesn't use their reasoning will choose walk, I've tested to with my gemma 4 31B, but without reasoning, and the results is gemma choose to walk.
FatheredPuma81@reddit
It's a coin flip actually. If you want some schizo responses then ask Haiku if you should walk or drive. It gives the most insane reasons.
FrozenFishEnjoyer@reddit (OP)
Pretty sure the same can be said to us if we don't think of the question as well haha.
But yeah, thinking enabled is necessary. I have it on all the time.
ThiccStorms@reddit
It worked fine for me yesterday
Torodaddy@reddit
If i were to guess I bet anthropic is testing in prob running a quantitized version of claude opus to increase capacity. Anecdotally people in the office are pissed that the perfomance is noticibly atrocious
Haunting_Sun3673@reddit
For real I just wanted it to check my firewall rules making sure I'm doing em right and bro COULD NOT understand them to save its life (They were good I got buddy of mine to check when he was free)
OptimizeLLM@reddit
I see lots of complaints about Claude lately, especially with usage, but I'm not running into any of these issues. I used Claude for about 70 hours last week building a very complex application with great success. I don't use the web page version though, I use Claude Code. So maybe it's a prompt, settings, or skill issue. I only use Claude on "max", "high" definitely doesn't compare.
vatta-kai@reddit
Mythos.. it’s coming hard and coming big. Probably
Anthonyg5005@reddit
I don't think anthropic really cares about opus too much since most people stick to sonnet anyways
hainesk@reddit
it could be they’re training a new model and are using extra gpus.
Technical-Earth-3254@reddit
One got released beforehand, the other one afterwards. One has it in the training data, the other one doesn't.
FrozenFishEnjoyer@reddit (OP)
Not true. My other Gemma 4 models fail this test. Only a few of them answer correctly.