Qwen 3.5 122B A10B running 50tok/s on DGX SPARK / Asus Ascent

Posted by Storge2@reddit | LocalLLaMA | View on Reddit | 25 comments

Hello guys, wanted to share this:

https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4

I am running it on my DGX Spark Int4 V2 with Max context window - and getting 50tok/sec with Multi Token Prediction:

Its working great for toolcalling in both OpenwebUI and Opencode, can recommend to anybody using a Spark with 128GB unified Memory, probably the best model for 128GB Devices right now. What is your experience? For me so far it's really good especially with Searxng in Opencode and Searxng in Openwebui. Can easily get 10+ website fetches and 50+ Websearch calls for queries that require a lot of knowledge and recent Information (Investing, etc.)

For more info check out Albonds Post on Nvidia Forum:
https://forums.developer.nvidia.com/t/qwen3-5-122b-a10b-on-single-spark-up-to-51-tok-s-v2-1-patches-quick-start-benchmark/365639/255

________

╔══════════════════════════════════════════════════════╗
║ Qwen3.5-122B-A10B Benchmark: v2
║ Mon Apr 13 04:07:56 PM CEST 2026
╚══════════════════════════════════════════════════════╝

── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.08s = 50.3 tok/s (prompt: 23)

[Code] 498 tokens in 9.48s = 52.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.85s = 51.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.33s = 48.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.44s = 54.7 tok/s (prompt: 37)

── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.11s = 50.0 tok/s (prompt: 23)

[Code] 512 tokens in 9.71s = 52.7 tok/s (prompt: 30)
[JSON] 1024 tokens in 20.15s = 50.8 tok/s (prompt: 48)
[Math] 64 tokens in 1.33s = 48.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.69s = 54.3 tok/s (prompt: 37)

Albond's `bench_qwen35.sh` measures decode only. Here's the prefill side for anyone else curious about the performance:

printf "\n%-12s %-18s %-22s\n" "Input tok" "Mean TTFT (ms)" "Prefill tok/s"; \
  printf "%-12s %-18s %-22s\n" "---------" "--------------" "-------------"; \
  for L in 1000 4000 16000 32000 64000; do \
    OUT=$(docker exec vllm-qwen35 vllm bench serve \
      --backend openai-chat \
      --base-url http://localhost:8000 \
      --endpoint /v1/chat/completions \
      --model qwen \
      --tokenizer /models/qwen35-122b-hybrid-int4fp8 \
      --dataset-name random \
      --random-input-len $L \
      --random-output-len 1 \
      --num-prompts 1 \
      --max-concurrency 1 \
      --disable-tqdm 2>&1); \
    TTFT=$(echo "$OUT" | grep "Mean TTFT" | awk '{print $NF}'); \
    THR=$(echo "$OUT" | grep "Total token throughput" | awk '{print $NF}'); \
    printf "%-12s %-18s %-22s\n" "$L" "$TTFT" "$THR"; \
  done; echo ""

Input tok Mean TTFT (ms) Prefill tok/s

--------- -------------- -------------

1000 575.17 1739.94

4000 1912.80 2091.56

16000 8097.00 1976.13

32000 17512.64 1827.29

64000 40866.12 1566.11