Car Wash Mystery solved--Tool Call Degrades Intelligence.
Posted by Spirited_Neck1858@reddit | LocalLLaMA | View on Reddit | 25 comments
Tool Call Degrades Intelligence
I asked the OG question to the kimi k2.5:
"I want to wash my car and the car wash is just 10 metres away. Should I walk or drive there?"
Kimi-k2.5 via NIM -- Three Modes.
I tested three modes: no tools, XML pseudo-tools, and JSON schema tools. "Tools" here means web search + Python in a Docker sandbox. 3 tests were conducted in each mode.
| Mode | Correct (Drive) |
|---|---|
| No tools | 3/3 ✅ |
| XML pseudo-tools | 2/3 |
| JSON schema tools | 1/3 |
tool overhead seems to degrade intelligence
Confirming with a Chemistry Question
To double check, I ran one more test --this time a niche chemistry question.
Background: diatomic molecules with even electron counts are generally diamagnetic, with two standard exceptions (10e and 16e systems). There's a lesser-known extension-- the entire oxygen family (O₂, S₂, Se₂, Te₂...) are all paramagnetic, not just O₂.
I asked:
"I remember for finding whether a compound is para or diamagnetic we used the odd even electron rule, but there were 2 exceptions, 10 and 16 electrons. Are there any more exceptions?"
| Mode | Result |
|---|---|
| No tools | ✅ Correctly identified O₂ family -- S₂, Se₂, Te₂ all paramagnetic |
| XML pseudo-tools | ❌ answered- "No more exceptions to remember" |
| JSON schema tools | ❌ Similar failure |
Conclusion
The model had the correct answer in both cases --it just couldn't access it when tools were present. Tool schemas seem to push the model into "delegation mode" where it looks for something to search or execute rather than reasoning from its own knowledge. No tools = full attention on the problem.
i tested car wash test with qwen 3.5 also and found success in no tool mode and failure in tool mode.
Limitations
- Only tested on Kimi-k2.5, qwen 3.5
- 3 runs per mode is a small sample
crantob@reddit
Stuffing irrelevant instructions ahead of the real work is a bit like sending an engineer to an hour of sensitivity training before beginning every workday.
Savantskie1@reddit
I solved this by putting in the rules that it should use internal knowledge + web search on information that it doesn’t confidently have. Otherwise it’s supposed to rely on internal knowledge or reasoning first and foremost. And it’s ok to tell me it doesn’t know or can’t get information. This has solved most hallucinations and incorrect information for me. Everything is scaled against a 0-1 confidence system.
UpAndDownArrows@reddit
Everyone talks about context degradation and big system prompt, but I think this is more related to the MoE architecture of these models. Tools probably result in higher weight for coding related experts and so the real experts on the topic you are asking about aren't getting selected. Just a guess.
MoodDelicious3920@reddit
interesting take mate
Express_Quail_1493@reddit
its something i call system prompt token diabetes
Harness like opencode is nice but for some models its brutal. if you want to make the most of your context windows pi-coding-agent works well for me. Pi system prompt is literally 1k tokens give the LLM more room to think and solve instead of suffering from SysPrompt token-diabetes.
DeltaSqueezer@reddit
I've noticed this before. When you include tools, you need to include a system prompt which tells the LLM to also use its own general knowledge and not rely solely on tools.
Spirited_Neck1858@reddit (OP)
Yes...when i added a line- don't underestimate internal knowledge when deciding to search in web (lol doesn't look a nice prompt) but it helped produced crrt answer for that O2 family question in XML tool mode. but interesting finding was that, json tool mode didnt see any improvement...so i feel xml > json tool schema?
kuhunaxeyive@reddit
Gemma-31B-it-Q5_K_M (thinking mode) gets it correct every time in under 400 tokens. And the answer is short and direct. It you set the parameters more determenistic, better for precision, non-thinking gets it right everytime too.
Even SOTA models are not so good. I wonder how they managed to create such a gem with Gemma-4-31B.
Dany0@reddit
Try Gemma3 or Llama3, an older model that less likely saw it in training
MoodDelicious3920@reddit
Its in its training data?
kuhunaxeyive@reddit
Hm, I somehow wouldn't expect that, but how could we know for sure? So if a model passes a test we assume it has it in its training data, that makes this sort of tests quite unusable. Gemma-4-31B also gets the "stalker in the forrest" riddle right. If you know another trick question, I'm happy to test Gemma-4 against it!
MoodDelicious3920@reddit
Cuz many models answer this qn right...but only a few have this thing too in reasoning trace: "might be possible trick qn" kinda thing...but the interesting thing is that gemini 3 flash Lite doesn't pass the test..but gemma does !
cstocks@reddit
the number of tools is probably very relevant. introducing 3 tools and 30 tools has very different effect on context..
Spirited_Neck1858@reddit (OP)
U seem like a bot. all ur previous posts are personal project spam
cstocks@reddit
lol I appreciate the time you took to look at my profile. yeah some of the posts are related to an open source project I have. I don't think it classifies me as a bot..
BankjaPrameth@reddit
I can confirm this for Qwen 3.5 when using with Open WebUI. If there is any single tool available, it will think very little and lead to lower quality answers.
nuclearbananana@reddit
Context also degrades intelligence, how much did your tools add?
MoodDelicious3920@reddit
Just 3 tools, web search , url search , python in sandbox. And fresh message i sent each time (no chat history), and system prompt is not very long either..
nuclearbananana@reddit
The others not being long will exaggerate the effect of the tools. Also did the model try to use the tools or were they just there?
MoodDelicious3920@reddit
For the car wash qn , obv. web search wasn't used...but for chem qn, model used web search multiple times., but because the concept was so niche , model found no relevant data from web, and couldn't produce the crrt answer, and said 'no more exceptions' ...whereas in case of no tool mode, model gave the crrt answer--the O2 family.....and if u want to try it out urself , u can do that in nvidia nim website itself, they have a tool toggle, u can compare responses with and w/o tools..
ghanit@reddit
Can you try with a mcp cli wrapper to reduce used context?
habachilles@reddit
This is what I was thinking. It was the kv
MoodDelicious3920@reddit
I experienced this so i turn on web search only when required , for latest info etc. Keep it turned off when not required
Spirited_Neck1858@reddit (OP)
i tried some more maths questions as well but didnt share , (to avoid too long post lol)
osfric@reddit
This seems to be a real factor that has been experienced by others yeah