Should I switch from Qwen 3.5 27B (dense) to Qwen 3.6 35B-A3B for tool calls & vision? Need Docker config review + VRAM advice

Posted by Colie286@reddit | LocalLLaMA | View on Reddit | 20 comments

Hi r/LocalLLaMA,

I'm currently running Qwen3.5-27B-UD-Q4_K_XL locally via llama.cpp with OpenWebUI and considering upgrading to Qwen3.6-35B-A3B (GGUF). Before making the switch, I'd appreciate some community feedback on performance, intelligence, and my current setup.

My Hardware:

My Use Cases:

The Question:
Based on benchmarks, Qwen 3.6 35B-A3B seems comparable or slightly better than Qwen 3.5 27B for tool calling and vision. However, I'm concerned about:

  1. Intelligence trade-off: Is the 35B MoE model equally intelligent as the 27B dense model for general knowledge tasks?
  2. VRAM impact: The Qwen 3.6 image is \~22.4GB with quantization. With my current setup (llama.cpp + ComfyUI + Whisper ASR all running), I'm worried about VRAM pressure when ComfyUI/Whisper spike to GPU usage.
  3. RAM offloading: Could parts be offloaded to system RAM if needed? Will this hurt performance significantly?

llama-cpp-qwen3.5:

image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b8532

container_name: llama-cpp-qwen3.5

command: >

--model /models/Qwen3.5-27B-UD-Q4_K_XL.gguf

--mmproj /models/mmproj-F16-new.gguf

--alias "XXX"

--host 0.0.0.0

--port 8085

--ctx-size 100000

--n-gpu-layers 99

--cache-type-k q8_0

--cache-type-v q8_0

--top-p 0.95

--min-p 0.00

--top-k 20

--jinja

--flash-attn on

--n-predict 12288

--sleep-idle-seconds 5

volumes:

- ./llama-cpp-models:/models:ro

deploy:

resources:

reservations:

devices:

- driver: nvidia

device_ids: ['0']

capabilities: [gpu]

restart: unless-stopped

Other Services Running:

What I'm Looking For:

  1. Has anyone tested Qwen 3.6 35B-A3B on RTX 3090? What token speeds did you achieve?
  2. Is the intelligence gap between 27B dense and 35B MoE noticeable for general knowledge/tool calling?
  3. Any Docker/llama.cpp config tweaks you'd recommend to extract more context size or performance?
  4. Should I stick with the 27B dense model or switch to 35B-A3B given my hardware constraints?

Thanks in advance! Happy to provide more details if needed.

(Translated with AI, since my english isn't that well)