Car Wash Mystery solved--Tool Call Degrades Intelligence.

Posted by Spirited_Neck1858@reddit | LocalLLaMA | View on Reddit | 25 comments

Tool Call Degrades Intelligence

I asked the OG question to the kimi k2.5:

"I want to wash my car and the car wash is just 10 metres away. Should I walk or drive there?"

Kimi-k2.5 via NIM -- Three Modes.

I tested three modes: no tools, XML pseudo-tools, and JSON schema tools. "Tools" here means web search + Python in a Docker sandbox. 3 tests were conducted in each mode.

Mode	Correct (Drive)
No tools	3/3 ✅
XML pseudo-tools	2/3
JSON schema tools	1/3

tool overhead seems to degrade intelligence

Confirming with a Chemistry Question

To double check, I ran one more test --this time a niche chemistry question.

Background: diatomic molecules with even electron counts are generally diamagnetic, with two standard exceptions (10e and 16e systems). There's a lesser-known extension-- the entire oxygen family (O₂, S₂, Se₂, Te₂...) are all paramagnetic, not just O₂.

I asked:

"I remember for finding whether a compound is para or diamagnetic we used the odd even electron rule, but there were 2 exceptions, 10 and 16 electrons. Are there any more exceptions?"

Mode	Result
No tools	✅ Correctly identified O₂ family -- S₂, Se₂, Te₂ all paramagnetic
XML pseudo-tools	❌ answered- "No more exceptions to remember"
JSON schema tools	❌ Similar failure

Conclusion

The model had the correct answer in both cases --it just couldn't access it when tools were present. Tool schemas seem to push the model into "delegation mode" where it looks for something to search or execute rather than reasoning from its own knowledge. No tools = full attention on the problem.

i tested car wash test with qwen 3.5 also and found success in no tool mode and failure in tool mode.

Limitations

Only tested on Kimi-k2.5, qwen 3.5
3 runs per mode is a small sample

[-]

crantob@reddit

Stuffing irrelevant instructions ahead of the real work is a bit like sending an engineer to an hour of sensitivity training before beginning every workday.

Savantskie1@reddit

I solved this by putting in the rules that it should use internal knowledge + web search on information that it doesn’t confidently have. Otherwise it’s supposed to rely on internal knowledge or reasoning first and foremost. And it’s ok to tell me it doesn’t know or can’t get information. This has solved most hallucinations and incorrect information for me. Everything is scaled against a 0-1 confidence system.

UpAndDownArrows@reddit

Everyone talks about context degradation and big system prompt, but I think this is more related to the MoE architecture of these models. Tools probably result in higher weight for coding related experts and so the real experts on the topic you are asking about aren't getting selected. Just a guess.

MoodDelicious3920@reddit

interesting take mate

Express_Quail_1493@reddit

its something i call system prompt token diabetes

Harness like opencode is nice but for some models its brutal. if you want to make the most of your context windows pi-coding-agent works well for me. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.

DeltaSqueezer@reddit

I've noticed this before. When you include tools, you need to include a system prompt which tells the LLM to also use its own general knowledge and not rely solely on tools.

Spirited_Neck1858@reddit (OP)

Yes...when i added a line- don't underestimate internal knowledge when deciding to search in web (lol doesn't look a nice prompt) but it helped produced crrt answer for that O2 family question in XML tool mode. but interesting finding was that, json tool mode didnt see any improvement...so i feel xml > json tool schema?

kuhunaxeyive@reddit

Gemma-31B-it-Q5_K_M (thinking mode) gets it correct every time in under 400 tokens. And the answer is short and direct. It you set the parameters more determenistic, better for precision, non-thinking gets it right everytime too.

You should drive.

If you walk to the car wash, you will be there, but your car will still be at home! To get your car washed, the car needs to be at the car wash.

Even SOTA models are not so good. I wonder how they managed to create such a gem with Gemma-4-31B.

Dany0@reddit

Try Gemma3 or Llama3, an older model that less likely saw it in training

Its in its training data?

Hm, I somehow wouldn't expect that, but how could we know for sure? So if a model passes a test we assume it has it in its training data, that makes this sort of tests quite unusable. Gemma-4-31B also gets the "stalker in the forrest" riddle right. If you know another trick question, I'm happy to test Gemma-4 against it!

Cuz many models answer this qn right...but only a few have this thing too in reasoning trace: "might be possible trick qn" kinda thing...but the interesting thing is that gemini 3 flash Lite doesn't pass the test..but gemma does !

cstocks@reddit

the number of tools is probably very relevant. introducing 3 tools and 30 tools has very different effect on context..

U seem like a bot. all ur previous posts are personal project spam

lol I appreciate the time you took to look at my profile. yeah some of the posts are related to an open source project I have. I don't think it classifies me as a bot..

BankjaPrameth@reddit

I can confirm this for Qwen 3.5 when using with Open WebUI. If there is any single tool available, it will think very little and lead to lower quality answers.

nuclearbananana@reddit

Context also degrades intelligence, how much did your tools add?

Just 3 tools, web search , url search , python in sandbox. And fresh message i sent each time (no chat history), and system prompt is not very long either..

The others not being long will exaggerate the effect of the tools. Also did the model try to use the tools or were they just there?

For the car wash qn , obv. web search wasn't used...but for chem qn, model used web search multiple times., but because the concept was so niche , model found no relevant data from web, and couldn't produce the crrt answer, and said 'no more exceptions' ...whereas in case of no tool mode, model gave the crrt answer--the O2 family.....and if u want to try it out urself , u can do that in nvidia nim website itself, they have a tool toggle, u can compare responses with and w/o tools..

ghanit@reddit

Can you try with a mcp cli wrapper to reduce used context?

habachilles@reddit

This is what I was thinking. It was the kv

I experienced this so i turn on web search only when required , for latest info etc. Keep it turned off when not required

i tried some more maths questions as well but didnt share , (to avoid too long post lol)

osfric@reddit

This seems to be a real factor that has been experienced by others yeah