Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice

Posted by Colie286@reddit | LocalLLaMA | View on Reddit | 20 comments

Hi r/LocalLLaMA,

I'm currently running Qwen3.5-27B-UD-Q4_K_XL locally via llama.cpp with OpenWebUI and considering upgrading to Qwen3.6-35B-A3B (GGUF). Before making the switch, I'd appreciate some community feedback on performance, intelligence, and my current setup.

My Hardware:

CPU: Ryzen 9 5950X
RAM: 64GB DDR4 3600MHz
GPU: RTX 3090 OC (24GB VRAM)
Current performance: \~37.5 tokens/s with Qwen 3.5 27B

My Use Cases:

Tool calling (primary use case)
Image understanding/vision capabilities
Social media content ideas & general knowledge
Programming tasks

The Question:
Based on benchmarks, Qwen 3.6 35B-A3B seems comparable or slightly better than Qwen 3.5 27B for tool calling and vision. However, I'm concerned about:

Intelligence trade-off: Is the 35B MoE model equally intelligent as the 27B dense model for general knowledge tasks?
VRAM impact: The Qwen 3.6 image is \~22.4GB with quantization. With my current setup (llama.cpp + ComfyUI + Whisper ASR all running), I'm worried about VRAM pressure when ComfyUI/Whisper spike to GPU usage.
RAM offloading: Could parts be offloaded to system RAM if needed? Will this hurt performance significantly?

llama-cpp-qwen3.5:

image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8532

container_name: llama-cpp-qwen3.5

command: >

--model /models/Qwen3.5-27B-UD-Q4_K_XL.gguf

--mmproj /models/mmproj-F16-new.gguf

--alias "XXX"

--host 0.0.0.0

--port 8085

--ctx-size 100000

--n-gpu-layers 99

--cache-type-k q8_0

--cache-type-v q8_0

--top-p 0.95

--min-p 0.00

--top-k 20

--jinja

--flash-attn on

--n-predict 12288

--sleep-idle-seconds 5

volumes:

- ./llama-cpp-models:/models:ro

deploy:

resources:

reservations:

devices:

- driver: nvidia

device_ids: ['0']

capabilities: [gpu]

restart: unless-stopped

Other Services Running:

ComfyUI (lowvram mode, \~400MB idle VRAM)
Whisper ASR (faster-whisper large-v3-turbo, CUDA enabled, \~400MB idle VRAM)

What I'm Looking For:

Has anyone tested Qwen 3.6 35B-A3B on RTX 3090? What token speeds did you achieve?
Is the intelligence gap between 27B dense and 35B MoE noticeable for general knowledge/tool calling?
Any Docker/llama.cpp config tweaks you'd recommend to extract more context size or performance?
Should I stick with the 27B dense model or switch to 35B-A3B given my hardware constraints?

Thanks in advance! Happy to provide more details if needed.

(Translated with AI, since my english isn't that well)

[-]

SM8085@reddit

Speaking of Qwen3.6-35B-A3B, anyone else getting thinking loops?

I was asking it to examine 20 frames from a video, and I had to turn on the 'reasoning budget setting' because it was just going in circles,

\^--Screenshot of it having exhausted 10k tokens in thinking.

I'm using the recommended settings from https://huggingface.co/Qwen/Qwen3.6-35B-A3B

[-]

OldKaleidoscope7@reddit

all Qwen models I tried got stuck in reasoning, but I have great results with enable_thinking false

[-]

kweglinski@reddit

oddly I had that work ggufs but never with mlx. Both q8

[-]

DOAMOD@reddit

This model consumes a lot of context in its reasoning, and if it becomes too complex, it can get lost and descend into chaos. Yesterday, I encountered a situation where it became obsessed with a closure it couldn't find and started going around in circles, trying suboptimal solutions. We need to be careful because this is the biggest problem I'm seeing with this model; basically, it can get lost due to its own reasoning.

In my tests, it's on par with model 27, but it requires a much greater mental effort, which consumes large windows of context. It's very context-hungry, and this is its weak point because all its speed advantage is lost in its reasoning for solving problems, in my test by solving code problems. It was about 15 seconds slower than model 27b, despite being much faster.

[-]

DaniDubin@reddit

100% agree here! It’s super fast speed get lost in stubborn repetition and loops when solutions doesn’t work, and hence context grows very fast. I tried replacing my default Qwen3.5-122B-A10B 5bit with Qwen3.6-35B-A3B 8bit (all mlx) in Hermes Agent doing pretty straightforward tasks (just follow a well documented skills/scripts) and it failed every time :-(