Reality Check on 50 t/s for Qwen3.5-122B-A3B and 3500 USD device

Posted by kuhunaxeyive@reddit | LocalLLaMA | View on Reddit | 67 comments

I found an optimization that achieves 51 tokens/s (48 for very long contexts) for Qwen3.5-122B-A3B, and the guy who did that published a bash script on Github that sets it up automatically:

https://forums.developer.nvidia.com/t/qwen3-5-122b-a10b-on-single-spark-up-to-51-tok-s-v2-1-patches-quick-start-benchmark/365639/71

This optimization was implemented on NVIDIA Spark. The Asus Ascent DX10 shares the same internal hardware (the NVIDIA GB10 Grace Blackwell Superchip), with the main differences being the casing and cooling. It is priced at around USD 3,500 due to having only 1 TB of storage, which is sufficient for my use case. A generation speed of 50 tokens/s for a model of this size would make it practically usable. However, before purchasing the device, I want to verify whether my assumptions place it within a usable performance range.

My questions:

Has anyone tested the Asus Ascend DX10? With an 8,000-token context, what are the TTFT and generation speeds? I want to verify whether 5 seconds TTFT and 50 tokens/s generation are achievable.
Are there any issues caused by minor hardware differences between the devices? Specifically, will the optimization setup script run on the Asus Ascent without modification?

[-]

JojoScraggins@reddit

I have the gx10 and am running qwen3.5 122b int4 autoround. Benchmark results vary but were just better than nvfp4. I'm sure it won't be long until another model comes in that is better but this one sure does well for me.

On a coding benchmark I got pretty decent performance:

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Benchmark duration (s):                  94.86     
Total input tokens:                      102400    
Total generated tokens:                  12800     
Request throughput (req/s):              1.05      
Output token throughput (tok/s):         134.94    
Peak output token throughput (tok/s):    300.00    
Peak concurrent requests:                100.00    
Total token throughput (tok/s):          1214.47   
---------------Time to First Token----------------
Mean TTFT (ms):                          26166.35  
Median TTFT (ms):                        26531.26  
P99 TTFT (ms):                           47693.26

Here's my system unit:

[Unit]
Description=vLLM Server Service
After=docker.service
Requires=docker.service

[Service]
Restart=always
RestartSec=10
# Add your specific model path and arguments here
ExecStart=/usr/bin/docker run \
    --rm \
    --gpus all \
    --network host \
    --name vllm \
    --shm-size 1G \
    -v /home/user/.cache/huggingface:/root/.cache/huggingface \
    -v /home/user/.cache/flashinfer:/root/.cache/flashinfer \
    -v /home/user/.cache/vllm:/root/.cache/vllm \
    -v /etc/timezone:/etc/timezone:ro \
    -v /etc/localtime:/etc/localtime:ro \
    --entrypoint /bin/bash \
    -e "VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1" \
    vllm/vllm-openai:v0.18.0-cu130 \
    -c 'pip install -U "transformers==5.3.0" && \
vllm serve "Intel/Qwen3.5-122B-A10B-int4-AutoRound" \
    --port 8000 \
    --host 0.0.0.0 \
    --gpu-memory-utilization 0.75 \
    --max-model-len 262144 \
    --load-format safetensors \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --kv-cache-dtype fp8 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code \
    --mm-encoder-tp-mode data \
    --mm-processor-cache-type shm \
    --mm-processor-cache-gb 1 \
    --attention-backend FLASHINFER \
    --default-chat-template-kwargs \'{"enable_thinking": false}\''

ExecStop=/usr/bin/docker stop -t 2 vllm
ExecStopPost=/usr/bin/docker rm vllm

[Install]
WantedBy=multi-user.target

[-]