Car Wash Mystery solved--Tool Call Degrades Intelligence.

Posted by Spirited_Neck1858@reddit | LocalLLaMA | View on Reddit | 25 comments

Tool Call Degrades Intelligence

I asked the OG question to the kimi k2.5:

"I want to wash my car and the car wash is just 10 metres away. Should I walk or drive there?"

Kimi-k2.5 via NIM -- Three Modes.

I tested three modes: no tools, XML pseudo-tools, and JSON schema tools. "Tools" here means web search + Python in a Docker sandbox. 3 tests were conducted in each mode.

Mode Correct (Drive)
No tools 3/3 ✅
XML pseudo-tools 2/3
JSON schema tools 1/3

tool overhead seems to degrade intelligence

Confirming with a Chemistry Question

To double check, I ran one more test --this time a niche chemistry question.

Background: diatomic molecules with even electron counts are generally diamagnetic, with two standard exceptions (10e and 16e systems). There's a lesser-known extension-- the entire oxygen family (O₂, S₂, Se₂, Te₂...) are all paramagnetic, not just O₂.

I asked:

"I remember for finding whether a compound is para or diamagnetic we used the odd even electron rule, but there were 2 exceptions, 10 and 16 electrons. Are there any more exceptions?"

Mode Result
No tools ✅ Correctly identified O₂ family -- S₂, Se₂, Te₂ all paramagnetic
XML pseudo-tools ❌ answered- "No more exceptions to remember"
JSON schema tools ❌ Similar failure

Conclusion

The model had the correct answer in both cases --it just couldn't access it when tools were present. Tool schemas seem to push the model into "delegation mode" where it looks for something to search or execute rather than reasoning from its own knowledge. No tools = full attention on the problem.

i tested car wash test with qwen 3.5 also and found success in no tool mode and failure in tool mode.

Limitations