Qwen 3.6 seems to have a lot of trouble with tool calling

Posted by Perfect-Campaign9551@reddit | LocalLLaMA | View on Reddit | 58 comments

I've used both Codex and OpenCode with Qwen 3.6 27b and 35b running locally.

They continue to constantly barf whenever they need to create files.

For a simple prompt like "Create an HTML/CSS webpage for a salon , showing services, a map, and a contact page"

In Open Code it kept failing JSON formatting just to write the HTML file content

In Codex it constantly dies with complications writing correctly with powershell.

Heck even just asking it to change an address in the HTML would be faster for me to do it myself, the tool calling just loops for 1-2 minutes making JSON mistakes /etc .

In OpenCode the AI finally resorted to writing a python script to *create my HTML file*. Like wtf dude just write the text out to a file! You are literally a text machine.

At one point with Codex and Qwen 3.6 35b it kept telling me it wrote some files out / created them but they never got actually created.

Why does this AI model just have nothing but problems simply *creating a damn file*

It really seems like it has the most trouble with web files like HTML / CSS. I was able to edit a C++ file earlier today without a lot of fuss (surprisingly)

[-]

Perfect-Flounder7856@reddit

What parser are you using? I tried Hermes with little luck, going to try pythonic next.

[-]

I can’t relate at all to what you’re saying, for me Qwen 3.6 is really darn sticky to instructions and I don’t think I’ve seen it hallucinate a single tool call so far- parameters yes but fairly rarely, and it always corrected itself if faced an error

[-]

Naiw80@reddit

To add to that, I use (unsloth) Q4 quants etc as for tool calling, for me if it managed to understand my instructions tool calling is extremly reliable (unlike gemma4 who at best makes one tool call) but I did experience several times where it did not understand the instructions at all.

[-]

Healthy-Nebula-3603@reddit

Try openvode + llama-server

I'm using qwen 27b ( q4k_xl ) with 85k context and rotation Q8 cache

Everything works great , calling tools 0 problems.

[-]

Gesha24@reddit

That's a negative. Something else isn't right, my Qwen is calling tools quite reliably.

[-]

DifferenceCute8951@reddit

same here I've had zero issues

[-]

Perfect-Campaign9551@reddit (OP)

are you in Linux or Windows? I'm on Windows system.

[-]

Gesha24@reddit

Llama-server on Linux, client calling it - Windows

[-]

Thrumpwart@reddit

I get tool calling fails for both at at 100k+ context sometimes. I even run the 35B at BF16 and it still fails tool calls sometimes. 27B at Q8KXL.

[-]

Ok-Measurement-1575@reddit

llama-server on windows, yeh?

[-]

Thrumpwart@reddit

I do Llama-Server on Linux, and oMLX on Mac.

[-]

kevin_1994@reddit

I get them occasional too at Q8_0. Once the model fails you have to rerun their last message ime otherwise it will eventually fail more and more.

Pretty rare for me though. Maybe 1/100 tool calls. Maybe even less frequently.

[-]

bighead96@reddit

You literally made me LOL at your comment of just write it out to a damn file!

[-]

Ok-Measurement-1575@reddit

Known issue on Windows. Posted about it a few days ago.

[-]

Perfect-Campaign9551@reddit (OP)

If like to see your post do you have a link

[-]

Ok-Measurement-1575@reddit

Just click my name and click the posts tab bro.

[-]

vk3r@reddit

Disable "preserve_thinking". I've read that it's causing problems.

[-]

Perfect-Campaign9551@reddit (OP)

That is turned off

[-]

natermer@reddit

My first guess would be that you have a bunch of extensions and skills installed.

The more "stuff" you have enabled and turned on for a LLM the dumber and worse performance you are going to get out of it and the less data it can handle before it falls over.

Also longer sessions make things worse. It also costs you excess tokens/money when running against hosted LLMs.

This is often called "context rot".

Also another possibility is that if you are using something like llama.cpp the context windows can be pretty small by default.

Like if you are running a model with a '4096' context then.. yeah reading in a single file can easily cause it to fall over. Depending on the model and other settings it will try to do compaction, but that may not work out that well.

Also could be driver issues or memory issues. I've had LLMs running locally that just "die" and crash out on occasion just because of GPU issues or related stuff.

[-]

Perfect-Campaign9551@reddit (OP)

I don't have any extensions or skills. This is just a basic install setup and use

[-]

Historical_Ease_1525@reddit

This is a known issue especially with vLLM. You need to use this jinja template in order to fix at least some of the bugs.

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/blob/main/qwen3.6/chat_template.jinja

[-]

Perfect-Campaign9551@reddit (OP)

Thank you for at least giving me useful info. Unlike many of the other responses that don't want to bother explaining anything but just blame me for being dumb

[-]

Medium_Chemist_4032@reddit

Int8 on vllm, works fully reliably to 300k (yarn extended) context in fp16. Whenever I tried to quantize cache, it fell apart quite quickly

[-]

philmarcracken@reddit

mine was kinda wonky too until I tried: https://github.com/mlhher/late-cli/tree/main

[-]

FoxiPanda@reddit

Quants, launch parameters, harness settings, system prompt, bad prompting practices in general... any or all of these could be the culprit here... can you help elaborate on some or all of these for your scenario?

[-]

Perfect-Campaign9551@reddit (OP)

Qwen 3.6 27b or 35b using the Q4_K_M quants.

In OpenCode I am using a 90k context. No system prompt (if it has on I don't know it)

Codex CLI I think is set to 64k context for running Qwen. And just whatever default system prompt it comes with.

And the prompt from me was just "hey, create this HTML file". I mean..why should it have trouble just creating a file.

[-]

charmander_cha@reddit

Mas cadê as info?

[-]

Makers7886@reddit

Try a higher quality quant - or if you haven't read the model card for best practices I would start there. You didn't mention parameters which means I assume you are "blindly" running the model which would for sure give you poor results.

[-]

Perfect-Campaign9551@reddit (OP)

I'm running it as everyone recommends, though! I wouldn't think I'd need to be a complete rocket scientist....

[-]

Makers7886@reddit

Then you should be able to articulate exactly what those recommendations are so that those of us who run this model successfully in various situations can help you. I don't listen to what "everyone recommends" I read the model card, and listen to the people who made the model. From there I deviate with my own tests on my own projects.

[-]

Perfect-Campaign9551@reddit (OP)

I am more simply trying to get other's experience , I realize they aren't going to be able to necessarily fix my problem.

I'm literally using the most simple installation and it has issues. I mean ollama is pretty much automatic.

[-]

ea_man@reddit

> I'm literally using the most simple installation and it has issues.

No man, you should run that QWEN in Qwencode then, that is what it was was trained for and the harness is made for that.

[-]

Perfect-Campaign9551@reddit (OP)

Actually, not a bad idea

[-]

ea_man@reddit

Yup and while you are there use it on Linux, that's where people work with commands and do dev work.

[-]

Makers7886@reddit

There in lies your issue - you are blindly running the model as I mentioned earlier and what I assumed. Take this as my experience and my diagnosis and do with it as you wish. Good luck in your llm adventures.

[-]

FoxiPanda@reddit

I concur with this wholeheartedly.

This is rocket surgery. Right now, correctly running local LLMs is not user friendly for people who genuinely aren't in deep.

If you are using Ollama, you're flying blind. You have no idea what heinous BS they are launching that model with...or whether they even are using model card recommendations or the right chat template...or anything really.

Since you're on Windows OP, I'll share my Windows based Qwen3.6-27B-UD_Q5_K_XL launch parameters in hopes it inspires you to go switch to llama.cpp and learn what each parameter means and how to set your own parameters to avoid issues just like this one:

$env:PATH = "C:\llama.cpp\release\b8981;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1\bin;" + $env:PATH $env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.1"

& "C:\llama.cpp\release\b8981\llama-server.exe" `
  --model "C:\llama.cpp\models\Qwen3.6-27B-UD-Q5_K_XL.gguf" `
  --mmproj "C:\llama.cpp\models\Qwen-3.6-27B-mmproj-BF16.gguf" `
  --no-mmproj-offload `
  --spec-type ngram-mod --spec-ngram-mod-n-match 8 --spec-draft-n-min 4 --spec-draft-n-max 16 `
  --n-gpu-layers 999 `
  --ctx-size 196608 `
  --parallel 1 `
  --threads 16 `
  --temp 0.8 `
  --top-p 0.95 `
  --min-p 0.01 `
  --top-k 30 `
  --repeat-penalty 1.0 `
  --repeat-last-n 256 `
  --presence_penalty 0.0 `
  --chat-template-kwargs '{\"preserve_thinking\": true}' `
  --mlock `
  --flash-attn on `
  --cache-type-k bf16 `
  --cache-type-v bf16 `
  --kv-unified `
  --host 0.0.0.0 `
  --port 8083

Here's also a link to all of the parameters available in llama.cpp to learn: https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

[-]

jwpbe@reddit

you still haven't included any launch parameters, if you're using the most up to date version of vllm via uv on their nightly branch or otherwise

using ollama is an auto fail and nothing related to that run should be included. it's a dumpster fire

[-]

Perfect-Campaign9551@reddit (OP)

I have zero skill files installed or anything.

[-]

jwpbe@reddit

your setup seems hodge podge. you're on windows, but you're running with vllm? i'm sure you can do that, but that's a linux utility.

try downloading llama.cpp from the ggml-org github repo. Use that to run a Q5 / Q6 bartowski quant of Qwen 3.6 27B. Make sure you set the launch parameters to include the default generation settings, temp=0.6, min_p=0, top_p=0.95, top_k=20

add -np 1 to set the number of parallel slots to 1, you only need one request at a time as a solo user. let llama.cpp fit the model into your vram. don't quantize the cache. try it again.

[-]

Available_Hornet3538@reddit

Use LM Studio that is what fixed for me.

[-]

My_Unbiased_Opinion@reddit

Well, im running hermes agent with 3.6 27B at UD IQ3XXS with a KVcache Q8 with max contect and it has not failed a single tool call yet

[-]

llitz@reddit

if your vllm is using MTP, it is causing tool calling issues. MTP=1 seems to be ok, but 2 or 3 causes some problems and words end up broken, injected in the wrong place. (I wonder what else is messed up)

this chat template (not mine) does help a lot with everything else though:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates/tree/main/qwen3.6

I had a session with 200 tool calls and no issues. While I do see someone mentioning a problem with codex and tool calling, there are problems that I have realized with tools provided by multiple plugins - I had a nasty issue with pi-code and it was all because the instructions in the tool were missing information about edge cases.

Some models are trained different from others, it is completely expected that the GPT models work perfectly with codex with minimal instructions, but I wouldn't expect they work well in Claude Code. I would imagine the same happen for qwen

[-]

ai_guy_nerd@reddit

Tool calling failures in Qwen 3.6, especially with file operations, are often tied to the specific quantization used or the way the system prompt defines the tool's output format. When a model starts barfing JSON or writing scripts to do what it should do via a tool, it usually means the model is losing the thread of the state machine and falling back to its training data.

Using a more constrained system prompt that provides a one-shot example of a perfect tool call can sometimes stabilize this. It also helps to use a model with a larger context window or a different quantization if the current one is too lossy on the logic required for structured output.

OpenClaw handles a lot of this plumbing by wrapping the interaction in a more robust agentic loop, which can mitigate some of the fragility seen in raw tool calls. Other options include switching to a model like Gemma 4 31B if tool stability becomes the primary bottleneck.

[-]

leonbollerup@reddit

I have seen similar issues… can you verify against 3.5 .. I went back to that

[-]

Training-Cup4336@reddit

use claude code extension in vscode

[-]

kevin_1994@reddit

Give me your command and we can compare. Mine is:

taskset -c 0-15 /home/kevin/ai/llama.cpp/build/bin/llama-server -m /home/kevin/ai/models/Qwen3.6-27B-Q8_0.gguf --mmproj /home/kevin/ai/models/mmproj-qwen3.6-27b/mmproj-F16.gguf -ctv q8_0 -ctk q8_0 -c 156000 -ngl 999 -np 1 -t 16 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' --port 1231 -a model

Runs great for me. I use OpenCode.

[-]

randomrealname@reddit

Thanks Kevin

[-]

kevin_1994@reddit

Any time

[-]

xilvar@reddit

Well. No idea what’s wrong with your tool chain but I’d suggest trying a slightly different setup because it certainly works fine for me (absent some minor quant errors now and then)

Opencode the usual way Llama.cpp built from source repo Unsloth Qwen3.6 27b q4 k xl

[-]

Perfect-Campaign9551@reddit (OP)

See this crap even when Codex tries to use apply_patch it fails

WHY

Then it says "let me create a python script"

Why so much trouble just to edit the file contents. UGH

[-]

ea_man@reddit

Try Qwencode or Aider.

[-]

dreamai87@reddit

I feel what could be reason I have noticed tool call fails in window mainly search and replace one. It always tries to change but never get reflected then it tries to solve using python if all doesn’t work then it goes to write file full edit mode but fails sometimes when file override parameter it does not pass. This issue is very command in window where it uses cmd and misses syntaxes of \r\n new line. But all these are valid when you use proper quant with template. It works and sometimes doesn’t I use ud k_xl I don’t get this issue in Ubuntu using wsl but this repeat stuff I is common on window.

Try llama-server backend with — jinja, on window you will still get issues but with better quant may be less.

Another way install git in windows including sed and make git env available to windows on cmd and ask harness cli to use git sed command or bash commands to edit files.

[-]

soulhacker@reddit

Never use quantized version below Q6 on tool calling heavy scenario. Although AWQ-BF16-INT4 might work.

[-]

dolomitt@reddit

Can you try in wsl? Even codex 5.3 struggles with power shell on windows

[-]

Great_Guidance_8448@reddit

Running QWenn 3.6 27b with K/V caches at Q8 in Cline. 0 issues. I am really impressed.

[-]

Charming-Author4877@reddit

Works flawless in copilot harness, at 4 bit quant and 4 bit kv

[-]

dead_dads@reddit

Yo! New to local LLMs/ai stuff in general. I have an old 3090 and 128gb of DDR4 RAM. Was going to sell my old machine for parts but occurred to me this week I could turn it into an ai machine to dip my toes into locally run stuff.

My interest rn is to work on some vibe coding projects. Would like to assess and test models that fit fully into the VRAM of the 3090 but also curious about utilizing my ram (DDR4) to see what larger models can bring into the equation.

What models would be worth by time for testing? I’ve been working with Claude to ID some stuff of interest but as this field moves so fast I thought asking people who are actively engaged in this stuff would be better.

[-]

ortegaalfredo@reddit

Something is not right with your setup. Using the latest vllm, 27B can do hundreds of tool-calls without any of them failing