Step 3.7 Flash passes the car wash test
Posted by tarruda@reddit | LocalLLaMA | View on Reddit | 12 comments
Posted by tarruda@reddit | LocalLLaMA | View on Reddit | 12 comments
Inevitable_Mistake32@reddit
When the metric becomes the goal, it stops being a useful metric
1nicerBoye@reddit
this is in the training data by now.
NeedsSomeSnare@reddit
Exactly. As soon as a test becomes popular, it is no longer useful.
Guilty_Rooster_6708@reddit
When Qwen3.6 came out it answered that this question is a “classic riddle”, so company probably already added this to their training data like the strawberry question
Mean-Ad1493@reddit
Qwen 3.6 running on my potato PC passes it too.
tarruda@reddit (OP)
I have the opposite experience with Qwen 3.6: Every time I tried, app it fails on this test. Even Qwen 3.6 Plus failed when I tried on Qwen official chat.
Qwen 3.5 (all variants) passed it though, so clearly it is more of a dataset contamination issue.
SmartCustard9944@reddit
There are two answers to this questions. A car wash can offer manual tools for washing your car, you can just walk there and bring them to your car, which is perfectly reasonable.
The problem with this test is the same problem that I can have with a real person. If they make the wrong assumptions based on incomplete information, they are going to infer the wrong stuff.
Tall-Ad-7742@reddit
Yeah but that’s not really surprising. Most models get this right as long as reasoning is enabled. At least from what I tried they got it right when reasoning was on
Sensitive_Pop4803@reddit
No they don’t get it right. It’s a rarity.
floconildo@reddit
Not doubting Step's capabilities, but most likely the training data caught up. Probably Opus 4.8 also passes it. Can't wait to see the next dumb test LLMs can't pass, hope it involves beets.
Eden1506@reddit
198b parameters 11b active definitely looks interesting.
tarruda@reddit (OP)
Seriously though, this model is good.
Looking at the chat template, it supports 3 reasoning effort levels, and this was done with reasoning effort set to low.