Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Posted by Clean_Initial_9618@reddit | LocalLLaMA | View on Reddit | 25 comments

Hey everyone,

I’ve been experimenting with running Qwen models locally on my setup:

GPU: RTX 3090 (24GB VRAM)

RAM: 64GB

CPU: Ryzen 5700X

OS: Windows 11

What I’m currently running

Qwen 3.6 35B (UD Q4_K_M)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

Qwen 3.6 27B (UD Q4_K_XL)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0

My use case

Issues I’m facing

What I’m looking for

  1. Better model + quant recommendations
  2. Something that actually works well on a 3090
  3. Good balance between speed + coding reliability
  4. Ways to improve throughput (t/s)
  5. Are my flags bad?
  6. Context size too high?
  7. Anything obvious I’m missing?
  8. Auto model loading / routing (Right now I have to):
  9. Kill server
  10. Paste new command
  11. Reload model

  12. Is there a way to:

  13. Auto-switch models based on request?
  14. Or keep multiple models warm and route between them?

What’s your stack?

Thanks in advance for any suggestions or help really appreciate it.