Qwen 3.6 + vLLM + Docker + 2x RTX 3090 setup, working great!

Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 18 comments

Our nonprofit association has an AI server with 2x RTX 3090 and I finally switched over to vLLM to get better performance for multiple users.

Here's my docker compose file:

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - VLLM_API_KEY=my_very_secret_key_was_scrubbed
    volumes:
      - /opt/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host # Prevents shared memory bottlenecks during tensor parallelism
    command: >
      --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
      --tensor-parallel-size 2
      --max-model-len 65536
      --gpu-memory-utilization 0.85
      --enable-prefix-caching
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --max-num-seqs 32
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
    restart: unless-stopped

I'm super happy with it, but if you have suggestions for improvements, let me know!

Here are my llama-benchy results:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d2000	5463.38 ± 111.87		748.82 ± 14.93	741.48 ± 14.93	748.93 ± 14.93
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d2000	103.13 ± 22.06	112.49 ± 24.41
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d32768	5178.25 ± 25.55		6731.33 ± 33.06	6724.00 ± 33.06	6731.41 ± 33.05
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d32768	25.65 ± 1.43	27.93 ± 1.52
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	pp2048 @ d63000	4534.72 ± 42.10		14353.15 ± 133.93	14345.82 ± 133.93	14353.26 ± 133.94
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit	tg32 @ d63000	12.85 ± 3.50	14.45 ± 3.21

[-]

ddog661@reddit

Do you also use something like open webUI?

Nepherpitu@reddit

Very strange results. My 4x3090 running 122B AWQ at 115tps with drop to 85tps at 200K context size. 12 tps is way too slow.

OP is running tesor parallel. If they are using consumer mobo, mostl likely their second GPU is limited on PCIe lanes, which hits TP mode hard, expecially with growing context. Tensor Parallel is only usable with full x16 PCIe 4.0 to both GPUs, or NVLink. I would recommend switching to pipeline parallel.

TacGibs@reddit

Fake news : TP is working great with at least 4x PCIe 4.0 (around 95% of 16x performances).

Got 4*3090 in 4x PCIe 4.0, and before that it was 2 3090 in 8x : absolutely no real differences (maybe 1 or 2%) between 8x and 4x for 2 GPU usage.

There's also detailed tests and benchmarks with 4 GPU, just search a bit.

Did you enable P2P?

I'm running it on TR Pro with 7 PCIe 4.0 x16 slots.

Consult your mobo manual. Sometimes manufacturers share lanes between multiple slots, and then write in their manual "if you plug in nvme 3 then slot 5 is only x8" or similar.

I did.

I'm running at tensor parallel with some cards on oculink x4 with AER corrected pcie errors due to interference, lol. No noticeable difference from x16.

Need room for KV cache for multiple users (also I want at least 65000 tokens context)

robertpro01@reddit

You do need extra vram for the smaller model