Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call)
Posted by jettoblack@reddit | LocalLLaMA | View on Reddit | 28 comments
I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command:
vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--enable-prefix-caching --attention-backend flashinfer
It works pretty well in Claude Code, except fairly often it will announce its about to do something, then just stops and waits for a user response. E.g.:
Let me continue with the remaining edits.
✻ Brewed for 48s
>
(waiting for user input)
No error message, no failed tool call as far as I can tell, it just fails to follow through. Sometimes it will do it several times in a row and even comment "The user replied 'continue' - they want me to continue. Let me continue with the remaining edits." (user prompt waiting for me to reply)
Is this just a deficiency in the model's thinking, an incompatibility between Claude Code's prompts and the model, or an error in the configuration?
I haven't seen this happen in OpenCode, but there are reasons I prefer CC for some tasks.
Thanks.
boyobob55@reddit
You could try granite4 tool parser that worked for me with qwen3.5 for some reason
viperx7@reddit
bro this FP8 model is so annoying it was very painful to get it to work and still it had a lot of stupid issues like this one. the issue you are facing is due to some template shenanigans
there was just so much pain that i went back to running the Q8 version with ikllama cpp with sm graph gives 42t/s or so but at least it works
if you are still looking for more pain https://github.com/allanchan339/vLLM-Qwen3.5-27B this is a guide somebody wrote but be warned it will solve almost all the issues but you will still occasionally see slowdowns
vllm options that worked for me
the config given above hits peak speeds 140t/s on 4090+3090 sometimes. but context is only 125k
my_name_isnt_clever@reddit
Can I ask what the reasons are you prefer CC? I know people prefer it but I'm not hearing why.
-dysangel-@reddit
The reason I've always preferred it is that it seems to handle context compaction a lot better than other scaffolds. I haven't really tried anything else in the last few months though. They might be catching up. Or CC might be pulling away, since it can now also store "memories" and reference previous conversation histories apparently.
my_name_isnt_clever@reddit
Pulling away because it has memory? That's called an agent. It's not exactly a stand-out feature these days.
-dysangel-@reddit
You can have agents without memory, and memory without agents. I'm not really sure why you are making an absolute statement that's not true.
my_name_isnt_clever@reddit
I meant any agentic tool worth using in 2026 has memory of some kind out of the box.
Since I posted that I actually had Qwen explain how CC does compaction based on the leaks, and it's certainly a lot more involved than the FOSS agents I've used. So fair enough if that's important to you.
PattF@reddit
Try https://github.com/Opencode-DCP/opencode-dynamic-context-pruning with OpenCode. It works great for me.
captainmadness@reddit
Seeing the same issue with this model using llama.cpp in both Pi-agent and Hermes. Have yet to find an error or reason for it so far.
fyv8@reddit
Been seeing this a lot with 3.6 35B in opencode. If I just tell it to continue it always recovers. Sounds like others are hitting it enough with this model that it's more about the model than the harness.
wombweed@reddit
it can happen from tool calls using too much context. what is your context window set to?
DinoAmino@reddit
Since the length isn't being specified in the command args here vLLM will try to use the max ctx the model supports.
wombweed@reddit
Thanks didn’t know that about vllm
Top-Rub-4670@reddit
llama.cpp works the same way:
southpawgeek@reddit
Not sure if relevant, but this happens to me a lot with Qwen3.6 35B and Crush. Lots of "please continue" - I have context maxed out.
Evening_Ad6637@reddit
Oh with same setup (model and llamacpp and crush) it happened to me a lot as well. Very frustrating since I really like the aesthetics of crush.
However, I had no problems so far with llamacpp + same qwen model with those harnesses: pi, little-coder, vibe
southpawgeek@reddit
I'll have to check those ones out - but yeah, I really like Crush so far aside from constantly poking Qwen with a stick
This_Maintenance_834@reddit
same here. i just ask “why stopped”, it will pick it up and continue to finish.
Ok-Measurement-1575@reddit
I'm using 35b bf16 in opencode without any obvious issues yet.
Just ran out of credit on opus, switched to opencode, added the feature first try.
rmhubbert@reddit
I've been seeing this in both Opencode and Cherry Studio using vllm nightly with Qwen/Qwen3.6-2.7B at full precision, so it isn't a quantisation issue. It is very annoying. Don't remember ever seeing it with 3.5. I was a little hasty in deleting 3.5, it seems.
Ill_Barber8709@reddit
I have the exact same issue in Claude Code, except I'm using LMStudio. With Qwen3.6 27b Q4K_X_M on a 32GB M2 Max and 64k context. At some point tool calls will simply fail, with macOS beeping from the Terminal and no error message.
I handle this by working on development phases. Once a phase is finished, I unload the model and quit Claude Code, then go back to the project and ask to go on with the next phase.
I'm not sure where the issue comes from, but so far my little trick seems to work fine (albeit very slowly)
audioen@reddit
You might need to use qwen3_coder as the tool call parser. At least that's what I have to do to make these work.
nunodonato@reddit
I use qwen3_coder and get the same problems from time to time
LA_rent_Aficionado@reddit
Pretty sure this is a known issue with Qwen3 on VLLM: https://github.com/vllm-project/vllm/pull/40783
it may be exacerbated by streaming and/or anthropic API
Bottom line is vllm chat template parsing is not at a state of ease of use/coverage/user-friendliness as llama.cpp
jedisct1@reddit
Try with Swival.
Elusive_Spoon@reddit
Question: how do you check for failed tool calls? Because I encountered similar behavior when connecting the 35B version to pi. Thought failed tool calls might be the reason, but haven't figured out how to check yet.
Ell2509@reddit
They are probably using llama.cpp and copying straight out of the server console.
ResponsibleTruck4717@reddit
it happens to me as well with opencode.