What are you using to work around inconsistent tool-calling on local models? (like Qwen)

Posted by Sutanreyu@reddit | LocalLLaMA | View on Reddit | 15 comments

Been dealing with the usual suspects — Qwen3 returning tool calls as XML, thinking tokens eating the whole response, malformed JSON that breaks the client. Curious what approaches people are using.

I've tried prompt engineering the model into behaving, adjusting system messages, capping max_tokens — none of it was reliable enough to actually trust in a workflow.

Eventually just wrote a proxy layer that intercepts and repairs responses before the client sees them. Happy to share if anyone's interested, but more curious whether others have found cleaner solutions I haven't thought of.