Run Qwen3.5-397B-A13B with vLLM and 8xR9700
Posted by djdeniro@reddit | LocalLLaMA | View on Reddit | 8 comments
Special thanks for u/Sea-Speaker1700 to make possible run mxfp4 on R0700 GPU, first guide to run 122B models here
Well, 397B model works amazing, super fast.
Use this Dockerfile to build image, original image provided by u/Sea-Speaker1700
FROM tcclaviger/vllm-rocm-rdna4-mxfp4:latest
# Transformers Update
RUN pip install --upgrade transformers
# Triton Patch
RUN find /app -name "topk.py" -exec grep -l "N_EXPTS_ACT=k," {} \; | xargs -I{} sed -i 's/N_EXPTS_ACT=k, # constants/N_EXPTS_ACT=__import__("triton").next_power_of_2(k), # constants/' {}
CMD ["/bin/bash"]
build patched version
docker build -t vllm-mxfp4-patched -f Dockerfile .
Download model:
git lfs clone https://huggingface.co/djdeniro/Qwen3.5-397B-A17B-MXFP4
Launch script, keep your device id, replace $1 with model name, $2 with out port.
docker run --name "$1" \
--rm --tty --ipc=host --shm-size=32g \
--device /dev/kfd:/dev/kfd \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
--device /dev/dri/renderD129:/dev/dri/renderD129 \
--device /dev/dri/renderD130:/dev/dri/renderD130 \
--device /dev/dri/renderD131:/dev/dri/renderD131 \
--device /dev/dri/renderD132:/dev/dri/renderD132 \
--device /dev/dri/renderD137:/dev/dri/renderD137 \
--device /dev/dri/renderD138:/dev/dri/renderD138 \
--device /dev/dri/renderD139:/dev/dri/renderD139 \
--device /dev/mem:/dev/mem \
-e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-v /mnt/llm_disk/models:/app/models:ro \
-e TRUST_REMOTE_CODE=1 \
-e OMP_NUM_THREADS=8 \
-e PYTORCH_TUNABLEOP_ENABLED=1 \
-e PYTORCH_TUNABLEOP_TUNING=0 \
-e PYTORCH_TUNABLEOP_RECORD_UNTUNED=0 \
-e VLLM_ROCM_USE_AITER=0 \
-e PYTORCH_TUNABLEOP_FILENAME=/tunableop/tunableop_merged.csv \
-e PYTORCH_TUNABLEOP_UNTUNED_FILENAME=/tunableop/tunableop_untuned%%d.csv \
-e GPU_MAX_HW_QUEUES=1 \
-p "$2":8000 \
-e TRITON_CACHE_DIR=/root/.triton/cache \
vllm-mxfp4-patched \
/app/models/Qwen3.5-397B-A17B-MXFP4 \
--served-model-name "$1" --host 0.0.0.0 --port 8000 --trust-remote-code \
--enable-prefix-caching --gpu-memory-utilization 0.98 --tensor-parallel-size 8 \
--max-model-len 131072 --max-num-seqs 4 \
--tool-call-parser qwen3_coder --enable-auto-tool-choice \
--override-generation-config '{"max_tokens": 64000, "temperature": 1.0, "top_p": 0.95, "top_k": 20, "presence_penalty": 1.5}' \
--compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64, 128], "max_cudagraph_capture_size": 128}' \
--max-num-batched-tokens 2048 \
--limit-mm-per-prompt.image 2 --mm-processor-cache-gb 1 \
--mm-processor-kwargs '{"max_pixels": 602112}' \
--reasoning-parser qwen3
Loading model 400-600s first time, and then got 30 t/s on tg, 3.5-3.7k on pp in one request.
in 4x requests you will got up to 100 t/s.
I limit power per gpu (210W), if power limit 300W per gpu will speedup model.
Best result with this model i have when thinking budget is 0 tokens for coding tasks.
Turbulent_Pin7635@reddit
1700W o.O
djdeniro@reddit (OP)
4800w PSU ☠️
Turbulent_Pin7635@reddit
Hell, baby Jesus in holy fucking sky!
FullOf_Bad_Ideas@reddit
that's a really good performance. 3.5k PP is impressive, especially with TP 8 and PCI-E. That's without prefix caching contaminating the numbers, right?
Thanks-Suitable@reddit
would love to see these results!
TaroOk7112@reddit
Where are you plugin 8 GPUs? What is your motherboard?
djdeniro@reddit (OP)
MB: MZ32-AR0
putrasherni@reddit
Great work !