Experience of Qwen 3.5-122b and 3.6
Posted by Impossible_Car_3745@reddit | LocalLLaMA | View on Reddit | 15 comments
I am managing an on-premise llm for my team using 2 x rtx pro 6000.
I haved switched from Qwen3.5-122b -> Qwen3.6-35B-A3B -> Qwen3.6-27b (today :) )
And qwen team does not lie on their benchmark. My experience was just like their benchmark.
1) performance: defintely, qwen3.5 -122b < qwen3.6-35b < qwen3.6-27b
And I have not tested its full knowledge base and I do not clearly remember how good old opus was..but for my task request, Qwen3.6-27B did very well as solid. It's very good.
2) speed and context with mtp & 2 x rtx pro 6000 & fp8
- Qwen3.6-35B-A3B: 512k x 11 & 280 tps
- Qwen3.6-27B: 320k x 6 & 110 tps
Undici77@reddit
I'm experiencing the opposite: Qwen 3 work fine, but 3.5 and 3.6 are not better and often worse!
I made a post about my experience in daily coding task!
https://www.reddit.com/r/LocalLLaMA/comments/1stbohn/comment/ohs2num/
Impossible_Car_3745@reddit (OP)
oh..qwen3... okay... haven't thought about it will test it
Undici77@reddit
Comparison in pretty interesting: are you working in a long chain of operations, using different tools? One example for me is about using AskQuestion Tool of Qwen-Coder-Cli!
Qwen3 Always use it! Qwen3.6 never, until I ask it explicitly!
Impossible_Car_3745@reddit (OP)
what about letting qwen ask a question using prompt QWEN.md ?
Undici77@reddit
This often work, but for a model that should be a HUGE UPGRADE from Qwen3Coder, this is ugly a workaround! But the bigger issue is the infinite loop!
Impossible_Car_3745@reddit (OP)
ah really intersting. are you using 30b or 480b? btw i am using qwen code 0.14.5 with qwen3.6-27b and it also askaquestion time to time (not that often)
Undici77@reddit
I'm using qwen code cli 0.15.0, and with both Qwen3 30B/80B-Next Tools are vorking pretty well (sometimes some edits fail) but not so bad like with Qwen3.5/3.6. I'm not sure the issue is the model only: I have the feeling also conversion to MLX should be problematic! I tried many different version (from mlx-community to Unsloth) but the issue remain the same: infinite loop and miss tolls usage!
Are you using GUFF or MLX?
spvn@reddit
512k x 11
320k x 6
sorry what does this mean?
Impossible_Car_3745@reddit (OP)
sorry I was too rough it was max_num_seq setting. I am just lazy to calculate the kv cache size. I set context length 512k and max_num_seq 11 when using Qwen3.6-35B-A3B and 320K and 6max_num when using Qwen3.6-27B that was the maximum set under my gpu ram.
spvn@reddit
Are such large context lengths really usable? I thought it went up to 256k only by default for Qwen3.6
Impossible_Car_3745@reddit (OP)
you can find how to extend it in huggingface.
here is my script
spvn@reddit
Yeah but I meant with actual usage, it doesn’t start hallucinating or performing weirdly? Even at 100k context I sometimes get the Q5 Qwen3.6 27B model acting weird (suddenly stops writing a python script 1000 lines in when it’s not done. And has to restart writing it from scratch again)
Impossible_Car_3745@reddit (OP)
I haven't used Q5. only fp8. had no problem around 200~300k for qwen 35b-a3b. and actually I have rarely touched more than 300k, but in general it works fine
ResidentPositive4122@reddit
FYI, when you run vllm check the logs and it will tell you how many concurrent sessions it can run at your max_ctx settings. Look for something like 6.79x
Impossible_Car_3745@reddit (OP)
Yes. Those numbers are set by the logs