Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice
Posted by Colie286@reddit | LocalLLaMA | View on Reddit | 20 comments
Hi r/LocalLLaMA,
I'm currently running Qwen3.5-27B-UD-Q4_K_XL locally via llama.cpp with OpenWebUI and considering upgrading to Qwen3.6-35B-A3B (GGUF). Before making the switch, I'd appreciate some community feedback on performance, intelligence, and my current setup.
My Hardware:
- CPU: Ryzen 9 5950X
- RAM: 64GB DDR4 3600MHz
- GPU: RTX 3090 OC (24GB VRAM)
- Current performance: \~37.5 tokens/s with Qwen 3.5 27B
My Use Cases:
- Tool calling (primary use case)
- Image understanding/vision capabilities
- Social media content ideas & general knowledge
- Programming tasks
The Question:
Based on benchmarks, Qwen 3.6 35B-A3B seems comparable or slightly better than Qwen 3.5 27B for tool calling and vision. However, I'm concerned about:
- Intelligence trade-off: Is the 35B MoE model equally intelligent as the 27B dense model for general knowledge tasks?
- VRAM impact: The Qwen 3.6 image is \~22.4GB with quantization. With my current setup (llama.cpp + ComfyUI + Whisper ASR all running), I'm worried about VRAM pressure when ComfyUI/Whisper spike to GPU usage.
- RAM offloading: Could parts be offloaded to system RAM if needed? Will this hurt performance significantly?
llama-cpp-qwen3.5:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8532
container_name: llama-cpp-qwen3.5
command: >
--model /models/Qwen3.5-27B-UD-Q4_K_XL.gguf
--mmproj /models/mmproj-F16-new.gguf
--alias "XXX"
--host 0.0.0.0
--port 8085
--ctx-size 100000
--n-gpu-layers 99
--cache-type-k q8_0
--cache-type-v q8_0
--top-p 0.95
--min-p 0.00
--top-k 20
--jinja
--flash-attn on
--n-predict 12288
--sleep-idle-seconds 5
volumes:
- ./llama-cpp-models:/models:ro
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
restart: unless-stopped
Other Services Running:
- ComfyUI (lowvram mode, \~400MB idle VRAM)
- Whisper ASR (faster-whisper large-v3-turbo, CUDA enabled, \~400MB idle VRAM)
What I'm Looking For:
- Has anyone tested Qwen 3.6 35B-A3B on RTX 3090? What token speeds did you achieve?
- Is the intelligence gap between 27B dense and 35B MoE noticeable for general knowledge/tool calling?
- Any Docker/llama.cpp config tweaks you'd recommend to extract more context size or performance?
- Should I stick with the 27B dense model or switch to 35B-A3B given my hardware constraints?
Thanks in advance! Happy to provide more details if needed.
(Translated with AI, since my english isn't that well)
changtimwu@reddit
and now you have a new option to compare with. Qwen 3.6 27B + dflash https://x.com/pupposandro/status/2047004830749597883
beefgroin@reddit
For me it didn't work out. Only 3.6 27b will beat 3.5 27b.
OldKaleidoscope7@reddit
They released 27B today, I'm waiting for opinions before switching, why exactly 27b is better?
Colie286@reddit (OP)
I came to the same point, after testing 3.6
SM8085@reddit
Speaking of Qwen3.6-35B-A3B, anyone else getting thinking loops?
I was asking it to examine 20 frames from a video, and I had to turn on the 'reasoning budget setting' because it was just going in circles,
\^--Screenshot of it having exhausted 10k tokens in thinking.
I'm using the recommended settings from https://huggingface.co/Qwen/Qwen3.6-35B-A3B
OldKaleidoscope7@reddit
all Qwen models I tried got stuck in reasoning, but I have great results with enable_thinking false
kweglinski@reddit
oddly I had that work ggufs but never with mlx. Both q8
DOAMOD@reddit
This model consumes a lot of context in its reasoning, and if it becomes too complex, it can get lost and descend into chaos. Yesterday, I encountered a situation where it became obsessed with a closure it couldn't find and started going around in circles, trying suboptimal solutions. We need to be careful because this is the biggest problem I'm seeing with this model; basically, it can get lost due to its own reasoning.
In my tests, it's on par with model 27, but it requires a much greater mental effort, which consumes large windows of context. It's very context-hungry, and this is its weak point because all its speed advantage is lost in its reasoning for solving problems, in my test by solving code problems. It was about 15 seconds slower than model 27b, despite being much faster.
DaniDubin@reddit
100% agree here! It’s super fast speed get lost in stubborn repetition and loops when solutions doesn’t work, and hence context grows very fast. I tried replacing my default Qwen3.5-122B-A10B 5bit with Qwen3.6-35B-A3B 8bit (all mlx) in Hermes Agent doing pretty straightforward tasks (just follow a well documented skills/scripts) and it failed every time :-(
pulse77@reddit
I switched from Qwen3.5 27B UD-Q5_K_XL to Qwen3.6 35B A3B UD-Q8_K_XL for tool calling and precise coding and I got much better results and a much larger context at the same speed.
Colie286@reddit (OP)
Can you please pass me ur settings ? Is your setup similiar to mine ? Q8 is large af for one rtx 3090, isn't it ?
MexInAbu@reddit
offload some experts to CPU. MoE models are much more forgiving with that setup.
Colie286@reddit (OP)
any recommended parameters ?
LumpyWelds@reddit
Gating of experts can be uneven. It would be good to track which experts get called so the busiest experts can be prioritized in the GPU.
pulse77@reddit
My command line parameters: llama-server --model Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf --mmproj Qwen3.6-35B-A3B-mmproj-F16.gguf --mmap --jinja --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 262144 --fit on --temp 0.0 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking": true}'
PromptInjection_@reddit
I can't make the decision for you..but... i switched ;)
Colie286@reddit (OP)
yeah sure, but what made you switching to 3.6 ? :)
PromptInjection_@reddit
3.5 is very censored and an overthinker
qwen_next_gguf_when@reddit
Like how we deal with software in production, if it doesn't break , we don't upgrade it.
b1231227@reddit
If a task can be decomposed into multiple subtasks, I consider a Mixture of Experts (MoE) approach suitable. If the task cannot be decomposed, a dense model is the optimal solution.