Qwen 3.6 + vLLM + Docker + 2x RTX 3090 setup, working great!

Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 18 comments

Our nonprofit association has an AI server with 2x RTX 3090 and I finally switched over to vLLM to get better performance for multiple users.

Here's my docker compose file:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_API_KEY=my_very_secret_key_was_scrubbed
    volumes:
      - /opt/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host # Prevents shared memory bottlenecks during tensor parallelism
    command: >
      --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
      --tensor-parallel-size 2
      --max-model-len 65536
      --gpu-memory-utilization 0.85
      --enable-prefix-caching
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --max-num-seqs 32
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    restart: unless-stopped

I'm super happy with it, but if you have suggestions for improvements, let me know!

Here are my llama-benchy results:

model                                        test              t/s       peak t/s         ttfr (ms)      est_ppt (ms)     e2e_ttft (ms)
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit  pp2048 @ d2000 5463.38 ± 111.87                    748.82 ± 14.93    741.48 ± 14.93    748.93 ± 14.93
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit    tg32 @ d2000   103.13 ± 22.06 112.49 ± 24.41                                                      
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit pp2048 @ d32768  5178.25 ± 25.55                   6731.33 ± 33.06   6724.00 ± 33.06   6731.41 ± 33.05
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit   tg32 @ d32768     25.65 ± 1.43   27.93 ± 1.52                                                      
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit pp2048 @ d63000  4534.72 ± 42.10                 14353.15 ± 133.93 14345.82 ± 133.93 14353.26 ± 133.94
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit   tg32 @ d63000     12.85 ± 3.50   14.45 ± 3.21