What are you using to work around inconsistent tool-calling on local models? (like Qwen)
Posted by Sutanreyu@reddit | LocalLLaMA | View on Reddit | 15 comments
Been dealing with the usual suspects — Qwen3 returning tool calls as XML, thinking tokens eating the whole response, malformed JSON that breaks the client. Curious what approaches people are using.
I've tried prompt engineering the model into behaving, adjusting system messages, capping max_tokens — none of it was reliable enough to actually trust in a workflow.
Eventually just wrote a proxy layer that intercepts and repairs responses before the client sees them. Happy to share if anyone's interested, but more curious whether others have found cleaner solutions I haven't thought of.
Final_Ad_7431@reddit
i have never seen qwen3.5 9b or 35b drop a tool call in hermes, personally
Sutanreyu@reddit (OP)
Hermes actually works, imo. It’s in OpenClaw or OpenCode where I’ve had problems with the Qwen models.
mp3m4k3r@reddit
Its possible the chat templates used with the model in the version(s) you have is not properly handling the tooling harnesses, I use between Q4 and Q6 with at least Q8 kv cache and rarely hit issues with parsing in pi-coding-agent and roocode, but occasionally with Continue plugin for vscode (their prompts seem funky) and n8n (usually my fault).
But, some of the earlier 3.5 model releases had some issues with their chat ml template that got fixed after initial release, so might be worth seeing if you can override the chat template and use one from their official (or the unsloth) models and see if that improves things?
Sutanreyu@reddit (OP)
Thanks for the tip. This made me revisit the base model I was using... I had originally pulled 'qwen3.5:9b' straight from ollama. Then I tried the Q8 version but it was too slow to use; it'd time out on even short requests. I ended up going with a Q6_K_XL version, where some of the layers are Q8 and the rest is 6-bit. Best of both worlds in that regard. I've played around with the different model parameters, like what's suggested on unsloth's "How to Run Locally" Qwen 3.5 page, which also helped. Less fabricating facts, a lot more competent at tool-calling, but obviously, at 9B parameters, there's only so much it can do... Right now I only have an RTX 5070 w/ 12GB of VRAM, so this is probably the best model I can squeeze into this memory envelope. I haven't tried the Gemma4 models yet, as I'm holding out that they come out with a 12B like Gemma3, or something around that size.
mp3m4k3r@reddit
Time will tell! And next month we will have whatever is the new kid on the block! For the VRAM it might be worth using the 35B-A3B with MoE offloading, its a pretty good model overall, faster than either but ive found it pretty fine overall. Depending on how much context you get out of it the Q4s are really not that much different than the Q8 counterparts and that can easily help with many of these newer models to have greater context.
Now that you've played around a bit it might be worth making to jump to llama.cpp/llama-cpp-server or even a qwen specific hosting tool which might optimize a bit further. My rig is basically just for gpus so its all docker and llama-cpp-server so not sure what the current hotness looks like for on machine local usage hosting.
abnormal_human@reddit
What models are you using? Are you quantizing them? How much does/does not your harness look like Qwen Code or Claude Code?
I have been using Qwen models heavily for agentic work, mainly the 122B and 397B variants and have not had most of your issues. Malformed JSON, switch to XML feels like either a really bad harness or a model that's been quantized to nothing.
DinoAmino@reddit
You guys?!!! Qwen 3 models are heavily trained on XML. There's literally a tool parser for it!
https://docs.vllm.ai/en/stable/api/vllm/tool_parsers/qwen3xml_tool_parser/
Sutanreyu@reddit (OP)
So what you're saying is I should just ditch Ollama go to vLLM? Bet.
DinoAmino@reddit
Not if Ollama has some equivalent option to set or override the tool parser. Probably llama.cpp does, idk.
Sutanreyu@reddit (OP)
I'll have to look. I've mean meaning to look to see if 'structured output' would resolve this; but I haven't found a JSON schema for the Qwen3.5 line? I originally started in LM Studio, then moved over to Ollama, of course, both backed by llama.cpp and they both leak XML or JSON sometimes. Like I mentioned in the opening post, I ended up vibe coding a whole proxy layer to mitigate for malformed tool calls and the aforementioned leakage, and it does pretty well. It's not perfect, there are still cases I'm trying to sort out. But it actually makes Qwen 3.5:9B output clean tool calls, helps it follow through on things like "I'm going to do task" without just dropping silent. I guess vLLM is the one to try next?
DinoAmino@reddit
Or maybe try Qwen CLI?
Sutanreyu@reddit (OP)
I’ve been using Qwen3.5:9B, the Q4_K_M variant. First through LM Studio then later with Ollama. While it seems it’s better in Ollama, it would still leak tool calls occasionally.
abnormal_human@reddit
Quantizing a 9B model and then expecting reliability is bold. I would troubleshoot by going to an 8bit version, or if that's not feasible locally, testing with OpenRouter. If it's still bad, try escalating to the 35B or 122B. At that point it should be nailing it 99.9%. Stay within the model family to control for differences in harness style during post training.
You might also play with temperature, or look carefully at the full text going into the model when things go wrong to try and understand why.
VoiceApprehensive893@reddit
!YOU DO NOT HAVE A DALLE TOOL!
draconisx4@reddit
Yeah, inconsistent tool-calling like that can turn into a real headache for maintaining control in deployments. I always layer in runtime checks to catch malformed outputs early, which helps enforce oversight without messing up the workflow. That proxy approach you're using sounds solid for adding that extra safety net.