Step 3.7 Flash passes the car wash test

Posted by tarruda@reddit | LocalLLaMA | View on Reddit | 12 comments

[-]

Inevitable_Mistake32@reddit

When the metric becomes the goal, it stops being a useful metric

[-]

1nicerBoye@reddit

this is in the training data by now.

[-]

NeedsSomeSnare@reddit

Exactly. As soon as a test becomes popular, it is no longer useful.

[-]

Guilty_Rooster_6708@reddit

When Qwen3.6 came out it answered that this question is a “classic riddle”, so company probably already added this to their training data like the strawberry question

[-]

Mean-Ad1493@reddit

Qwen 3.6 running on my potato PC passes it too.

[-]

tarruda@reddit (OP)

I have the opposite experience with Qwen 3.6: Every time I tried, app it fails on this test. Even Qwen 3.6 Plus failed when I tried on Qwen official chat.

Qwen 3.5 (all variants) passed it though, so clearly it is more of a dataset contamination issue.

[-]

SmartCustard9944@reddit

There are two answers to this questions. A car wash can offer manual tools for washing your car, you can just walk there and bring them to your car, which is perfectly reasonable.

The problem with this test is the same problem that I can have with a real person. If they make the wrong assumptions based on incomplete information, they are going to infer the wrong stuff.

[-]

tarruda@reddit (OP)

Seriously though, this model is good.

Looking at the chat template, it supports 3 reasoning effort levels, and this was done with reasoning effort set to low.