Qwen 3.6 + vLLM + Docker + 2x RTX 3090 setup, working great!
Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 18 comments
Our nonprofit association has an AI server with 2x RTX 3090 and I finally switched over to vLLM to get better performance for multiple users.
Here's my docker compose file:
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- VLLM_API_KEY=my_very_secret_key_was_scrubbed
volumes:
- /opt/.cache/huggingface:/root/.cache/huggingface
ports:
- "8000:8000"
ipc: host # Prevents shared memory bottlenecks during tensor parallelism
command: >
--model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit
--tensor-parallel-size 2
--max-model-len 65536
--gpu-memory-utilization 0.85
--enable-prefix-caching
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--max-num-seqs 32
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
restart: unless-stopped
I'm super happy with it, but if you have suggestions for improvements, let me know!
Here are my llama-benchy results:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|---|---|---|---|---|---|---|
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | pp2048 @ d2000 | 5463.38 ± 111.87 | 748.82 ± 14.93 | 741.48 ± 14.93 | 748.93 ± 14.93 | |
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | tg32 @ d2000 | 103.13 ± 22.06 | 112.49 ± 24.41 | |||
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | pp2048 @ d32768 | 5178.25 ± 25.55 | 6731.33 ± 33.06 | 6724.00 ± 33.06 | 6731.41 ± 33.05 | |
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | tg32 @ d32768 | 25.65 ± 1.43 | 27.93 ± 1.52 | |||
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | pp2048 @ d63000 | 4534.72 ± 42.10 | 14353.15 ± 133.93 | 14345.82 ± 133.93 | 14353.26 ± 133.94 | |
| cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit | tg32 @ d63000 | 12.85 ± 3.50 | 14.45 ± 3.21 |
ddog661@reddit
Do you also use something like open webUI?
Zyj@reddit (OP)
Sure
Nepherpitu@reddit
Very strange results. My 4x3090 running 122B AWQ at 115tps with drop to 85tps at 200K context size. 12 tps is way too slow.
No-Refrigerator-1672@reddit
OP is running tesor parallel. If they are using consumer mobo, mostl likely their second GPU is limited on PCIe lanes, which hits TP mode hard, expecially with growing context. Tensor Parallel is only usable with full x16 PCIe 4.0 to both GPUs, or NVLink. I would recommend switching to pipeline parallel.
TacGibs@reddit
Fake news : TP is working great with at least 4x PCIe 4.0 (around 95% of 16x performances).
Got 4*3090 in 4x PCIe 4.0, and before that it was 2 3090 in 8x : absolutely no real differences (maybe 1 or 2%) between 8x and 4x for 2 GPU usage.
There's also detailed tests and benchmarks with 4 GPU, just search a bit.
Zyj@reddit (OP)
Did you enable P2P?
Zyj@reddit (OP)
I'm running it on TR Pro with 7 PCIe 4.0 x16 slots.
No-Refrigerator-1672@reddit
Consult your mobo manual. Sometimes manufacturers share lanes between multiple slots, and then write in their manual "if you plug in nvme 3 then slot 5 is only x8" or similar.
Zyj@reddit (OP)
I did.
Nepherpitu@reddit
I'm running at tensor parallel with some cards on oculink x4 with AER corrected pcie errors due to interference, lol. No noticeable difference from x16.
caetydid@reddit
hmm not sure I can interpret the benchmark correctly... but I can run the Q6 quant on one 3090 with 50t/s and 100k context, so I'd expect much more from vllm and dual 3090
Zyj@reddit (OP)
vLLM can be slower for a single request but it can run multiple requests in parallel and achieve a higher total tokens/s.
pkese@reddit
Interesting... I too have a 4x 3090 but I'm getting only \~100 tps with 122B AWQ.
Would you be willing to share your vllm config?
Nepherpitu@reddit
I have a post in profile about applied patches. Also you can use autoround... with more patches, but it will give you 150tps with same quality.
Blues520@reddit
Nice setup. What are you using the model for?
caetydid@reddit
just used llama.cpp never vllm, but why cant you use higher quants than 4? or do you need additional vram for speculative decoding?
Zyj@reddit (OP)
Need room for KV cache for multiple users (also I want at least 65000 tokens context)
robertpro01@reddit
You do need extra vram for the smaller model