Experience of Qwen 3.5-122b and 3.6

Posted by Impossible_Car_3745@reddit | LocalLLaMA | View on Reddit | 15 comments

I am managing an on-premise llm for my team using 2 x rtx pro 6000.

I haved switched from Qwen3.5-122b -> Qwen3.6-35B-A3B -> Qwen3.6-27b (today :) )

And qwen team does not lie on their benchmark. My experience was just like their benchmark.

1) performance: defintely, qwen3.5 -122b < qwen3.6-35b < qwen3.6-27b
And I have not tested its full knowledge base and I do not clearly remember how good old opus was..but for my task request, Qwen3.6-27B did very well as solid. It's very good.

2) speed and context with mtp & 2 x rtx pro 6000 & fp8

- Qwen3.6-35B-A3B: 512k x 11 & 280 tps

- Qwen3.6-27B: 320k x 6 & 110 tps

[-]

Undici77@reddit

I'm experiencing the opposite: Qwen 3 work fine, but 3.5 and 3.6 are not better and often worse!
I made a post about my experience in daily coding task!

https://www.reddit.com/r/LocalLLaMA/comments/1stbohn/comment/ohs2num/

[-]

Impossible_Car_3745@reddit (OP)

oh..qwen3... okay... haven't thought about it will test it

[-]

Undici77@reddit

Comparison in pretty interesting: are you working in a long chain of operations, using different tools? One example for me is about using AskQuestion Tool of Qwen-Coder-Cli!

Qwen3 Always use it! Qwen3.6 never, until I ask it explicitly!

[-]

Impossible_Car_3745@reddit (OP)

what about letting qwen ask a question using prompt QWEN.md ?

[-]

Undici77@reddit

This often work, but for a model that should be a HUGE UPGRADE from Qwen3Coder, this is ugly a workaround! But the bigger issue is the infinite loop!

[-]

Impossible_Car_3745@reddit (OP)

ah really intersting. are you using 30b or 480b? btw i am using qwen code 0.14.5 with qwen3.6-27b and it also askaquestion time to time (not that often)

[-]

Undici77@reddit

I'm using qwen code cli 0.15.0, and with both Qwen3 30B/80B-Next Tools are vorking pretty well (sometimes some edits fail) but not so bad like with Qwen3.5/3.6. I'm not sure the issue is the model only: I have the feeling also conversion to MLX should be problematic! I tried many different version (from mlx-community to Unsloth) but the issue remain the same: infinite loop and miss tolls usage!

Are you using GUFF or MLX?

[-]

spvn@reddit

512k x 11

320k x 6

sorry what does this mean?

[-]

Impossible_Car_3745@reddit (OP)

sorry I was too rough it was max_num_seq setting. I am just lazy to calculate the kv cache size. I set context length 512k and max_num_seq 11 when using Qwen3.6-35B-A3B and 320K and 6max_num when using Qwen3.6-27B that was the maximum set under my gpu ram.

[-]

spvn@reddit

Are such large context lengths really usable? I thought it went up to 256k only by default for Qwen3.6

[-]

Impossible_Car_3745@reddit (OP)

you can find how to extend it in huggingface.

here is my script

#!/bin/bash

MODEL="Qwen3.6-35B-A3B-FP8"
PORT=8080
VLLM_PORT=8000

docker run -d --rm --name vllm-server \
--gpus all \
-p ${PORT}:${VLLM_PORT} \ 
-v ~/local-models:/models \ 
-e NCCL_P2P_DISABLE=1 \ 
-e NCCL_DEBUG=INFO \ 
-e VLLM_USE_DEEP_GEMM=0 \ 
-e VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 \ 
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ 
vllm/vllm-openai:v0.19.0 \ 
/models/${MODEL} \ 
--disable-custom-all-reduce \ 
--attention-backend FLASHINFER \ 
--max-model-len 524288 \ 
--tensor-parallel-size 2 \ 
--max-num-seqs 11 \ 
--enable-chunked-prefill \ 
--enable-prefix-caching \ 
--max-num-batched-tokens 16384 \ 
--gpu-memory-utilization 0.926 \ 
--tool-call-parser qwen3_coder \ 
--reasoning-parser qwen3 \ 
--enable-auto-tool-choice \ 
--host 0.0.0.0 \ 
--port ${VLLM_PORT} \ 
--dtype auto \ 
--tokenizer-mode auto \ 
--limit-mm-per-prompt '{"image":5, "video":0}' \ 
--speculative-config '{"method":"mtp", "num_speculative_tokens":2}' \ 
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \    
--served-model-name qwen3p6

[-]

spvn@reddit

Yeah but I meant with actual usage, it doesn’t start hallucinating or performing weirdly? Even at 100k context I sometimes get the Q5 Qwen3.6 27B model acting weird (suddenly stops writing a python script 1000 lines in when it’s not done. And has to restart writing it from scratch again)

[-]

Impossible_Car_3745@reddit (OP)

I haven't used Q5. only fp8. had no problem around 200~300k for qwen 35b-a3b. and actually I have rarely touched more than 300k, but in general it works fine

[-]

ResidentPositive4122@reddit

I am just lazy to calculate the kv cache size.

FYI, when you run vllm check the logs and it will tell you how many concurrent sessions it can run at your max_ctx settings. Look for something like 6.79x

[-]

Impossible_Car_3745@reddit (OP)

Yes. Those numbers are set by the logs