Qwen 3.6 27B in Claude Code says it will do something then stops and prompts for user reply (not failing a tool call)

Posted by jettoblack@reddit | LocalLLaMA | View on Reddit | 28 comments

I'm running Qwen/Qwen3.6-27B-FP8 via vLLM using this command:

vllm serve Qwen/Qwen3.6-27B-FP8 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max-num-seqs 8 \
 --enable-auto-tool-choice --tool-call-parser qwen3_xml \
 --enable-prefix-caching --attention-backend flashinfer

It works pretty well in Claude Code, except fairly often it will announce its about to do something, then just stops and waits for a user response. E.g.:

  Let me continue with the remaining edits.                                                                                                                             

✻ Brewed for 48s 

>

(waiting for user input)

No error message, no failed tool call as far as I can tell, it just fails to follow through. Sometimes it will do it several times in a row and even comment "The user replied 'continue' - they want me to continue. Let me continue with the remaining edits." (user prompt waiting for me to reply)

Is this just a deficiency in the model's thinking, an incompatibility between Claude Code's prompts and the model, or an error in the configuration?

I haven't seen this happen in OpenCode, but there are reasons I prefer CC for some tasks.

Thanks.

"Qwen3.6 27B FP8": description: "vllm FP8 ⭐" env: - "CUDA_VISIBLE_DEVICES=0,1,2" - "CUDA_VISIBLE_DEVICES=0,1" - "VLLM_WORKER_MULTIPROC_METHOD=spawn" - "NCCL_P2P_DISABLE=1" - "VLLM_TEST_FORCE_FP8_MARLIN=1" - "VLLM_USE_FLASHINFER_SAMPLER=1" - "VLLM_ALLOW_LONG_MAX_MODEL_LEN=1" - "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512" cmd: | vllm serve ${models_path}/Qwen3.6/27B/Qwen3.6-27B-FP8/ --enable-prefix-caching --tensor-parallel-size 2 --gpu-memory-utilization 0.95 --max-num-seqs 2 --max-num-batched-tokens 8192 --trust-remote-code --enable-auto-tool-choice --enable-chunked-prefill --enable-force-include-usage --no-scheduler-reserve-full-isl --host 0.0.0.0 --port ${PORT} --served-model-name "Qwen3.6 27B FP8" --max-model-len 125000 --dtype bfloat16 --reasoning-parser qwen3 --tool-call-parser qwen3_coder --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens" : 8}' --chat-template qwen3.5-enhanced.jinja --default-chat-template-kwargs '{"preserve_thinking": false}' --override-generation-config '{"temperature": 1.0, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty":0.0, "repetition_penalty":1.0}' # --max-model-len 219520 # --language-model-only checkEndpoint: /health ttl: 6000 aliases: - "gpt-5"

[-]

boyobob55@reddit

You could try granite4 tool parser that worked for me with qwen3.5 for some reason

viperx7@reddit

bro this FP8 model is so annoying it was very painful to get it to work and still it had a lot of stupid issues like this one. the issue you are facing is due to some template shenanigans

there was just so much pain that i went back to running the Q8 version with ikllama cpp with sm graph gives 42t/s or so but at least it works

if you are still looking for more pain https://github.com/allanchan339/vLLM-Qwen3.5-27B this is a guide somebody wrote but be warned it will solve almost all the issues but you will still occasionally see slowdowns

vllm options that worked for me

the config given above hits peak speeds 140t/s on 4090+3090 sometimes. but context is only 125k

my_name_isnt_clever@reddit

Can I ask what the reasons are you prefer CC? I know people prefer it but I'm not hearing why.

-dysangel-@reddit

The reason I've always preferred it is that it seems to handle context compaction a lot better than other scaffolds. I haven't really tried anything else in the last few months though. They might be catching up. Or CC might be pulling away, since it can now also store "memories" and reference previous conversation histories apparently.

Pulling away because it has memory? That's called an agent. It's not exactly a stand-out feature these days.

You can have agents without memory, and memory without agents. I'm not really sure why you are making an absolute statement that's not true.

I meant any agentic tool worth using in 2026 has memory of some kind out of the box.

Since I posted that I actually had Qwen explain how CC does compaction based on the leaks, and it's certainly a lot more involved than the FOSS agents I've used. So fair enough if that's important to you.

PattF@reddit

Try https://github.com/Opencode-DCP/opencode-dynamic-context-pruning with OpenCode. It works great for me.

captainmadness@reddit

Seeing the same issue with this model using llama.cpp in both Pi-agent and Hermes. Have yet to find an error or reason for it so far.

fyv8@reddit

Been seeing this a lot with 3.6 35B in opencode. If I just tell it to continue it always recovers. Sounds like others are hitting it enough with this model that it's more about the model than the harness.

wombweed@reddit

it can happen from tool calls using too much context. what is your context window set to?

DinoAmino@reddit

Since the length isn't being specified in the command args here vLLM will try to use the max ctx the model supports.

Thanks didn’t know that about vllm

Top-Rub-4670@reddit

llama.cpp works the same way:

-c    size of the prompt context (default: 0, 0 = loaded from model)

southpawgeek@reddit

Not sure if relevant, but this happens to me a lot with Qwen3.6 35B and Crush. Lots of "please continue" - I have context maxed out.

Evening_Ad6637@reddit

Oh with same setup (model and llamacpp and crush) it happened to me a lot as well. Very frustrating since I really like the aesthetics of crush.

However, I had no problems so far with llamacpp + same qwen model with those harnesses: pi, little-coder, vibe

I'll have to check those ones out - but yeah, I really like Crush so far aside from constantly poking Qwen with a stick

This_Maintenance_834@reddit

same here. i just ask “why stopped”, it will pick it up and continue to finish.

Ok-Measurement-1575@reddit

I'm using 35b bf16 in opencode without any obvious issues yet.

Just ran out of credit on opus, switched to opencode, added the feature first try.

rmhubbert@reddit

I've been seeing this in both Opencode and Cherry Studio using vllm nightly with Qwen/Qwen3.6-2.7B at full precision, so it isn't a quantisation issue. It is very annoying. Don't remember ever seeing it with 3.5. I was a little hasty in deleting 3.5, it seems.

Ill_Barber8709@reddit

I have the exact same issue in Claude Code, except I'm using LMStudio. With Qwen3.6 27b Q4K_X_M on a 32GB M2 Max and 64k context. At some point tool calls will simply fail, with macOS beeping from the Terminal and no error message.

I handle this by working on development phases. Once a phase is finished, I unload the model and quit Claude Code, then go back to the project and ask to go on with the next phase.

I'm not sure where the issue comes from, but so far my little trick seems to work fine (albeit very slowly)

audioen@reddit

You might need to use qwen3_coder as the tool call parser. At least that's what I have to do to make these work.

nunodonato@reddit

I use qwen3_coder and get the same problems from time to time

LA_rent_Aficionado@reddit

Pretty sure this is a known issue with Qwen3 on VLLM: https://github.com/vllm-project/vllm/pull/40783

it may be exacerbated by streaming and/or anthropic API

Bottom line is vllm chat template parsing is not at a state of ease of use/coverage/user-friendliness as llama.cpp