Qwen 3.5 122B A10B running 50tok/s on DGX SPARK / Asus Ascent
Posted by Storge2@reddit | LocalLLaMA | View on Reddit | 25 comments
Hello guys, wanted to share this:
https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
I am running it on my DGX Spark Int4 V2 with Max context window - and getting 50tok/sec with Multi Token Prediction:
Its working great for toolcalling in both OpenwebUI and Opencode, can recommend to anybody using a Spark with 128GB unified Memory, probably the best model for 128GB Devices right now. What is your experience? For me so far it's really good especially with Searxng in Opencode and Searxng in Openwebui. Can easily get 10+ website fetches and 50+ Websearch calls for queries that require a lot of knowledge and recent Information (Investing, etc.)
For more info check out Albonds Post on Nvidia Forum:
https://forums.developer.nvidia.com/t/qwen3-5-122b-a10b-on-single-spark-up-to-51-tok-s-v2-1-patches-quick-start-benchmark/365639/255
________
╔══════════════════════════════════════════════════════╗
║ Qwen3.5-122B-A10B Benchmark: v2
║ Mon Apr 13 04:07:56 PM CEST 2026
╚══════════════════════════════════════════════════════╝
── Run 1/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.08s = 50.3 tok/s (prompt: 23)
[Code] 498 tokens in 9.48s = 52.5 tok/s (prompt: 30)
[JSON] 1024 tokens in 19.85s = 51.5 tok/s (prompt: 48)
[Math] 64 tokens in 1.33s = 48.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.44s = 54.7 tok/s (prompt: 37)
── Run 2/2 ──────────────────────────────────────
[Q&A] 256 tokens in 5.11s = 50.0 tok/s (prompt: 23)
[Code] 512 tokens in 9.71s = 52.7 tok/s (prompt: 30)
[JSON] 1024 tokens in 20.15s = 50.8 tok/s (prompt: 48)
[Math] 64 tokens in 1.33s = 48.1 tok/s (prompt: 29)
[LongCode] 2048 tokens in 37.69s = 54.3 tok/s (prompt: 37)
Albond's `bench_qwen35.sh` measures decode only. Here's the prefill side for anyone else curious about the performance:
printf "\n%-12s %-18s %-22s\n" "Input tok" "Mean TTFT (ms)" "Prefill tok/s"; \
printf "%-12s %-18s %-22s\n" "---------" "--------------" "-------------"; \
for L in 1000 4000 16000 32000 64000; do \
OUT=$(docker exec vllm-qwen35 vllm bench serve \
--backend openai-chat \
--base-url http://localhost:8000 \
--endpoint /v1/chat/completions \
--model qwen \
--tokenizer /models/qwen35-122b-hybrid-int4fp8 \
--dataset-name random \
--random-input-len $L \
--random-output-len 1 \
--num-prompts 1 \
--max-concurrency 1 \
--disable-tqdm 2>&1); \
TTFT=$(echo "$OUT" | grep "Mean TTFT" | awk '{print $NF}'); \
THR=$(echo "$OUT" | grep "Total token throughput" | awk '{print $NF}'); \
printf "%-12s %-18s %-22s\n" "$L" "$TTFT" "$THR"; \
done; echo ""
Input tok Mean TTFT (ms) Prefill tok/s
--------- -------------- -------------
1000 575.17 1739.94
4000 1912.80 2091.56
16000 8097.00 1976.13
32000 17512.64 1827.29
64000 40866.12 1566.11
anzzax@reddit
I'm doing aider bench runs to find the best vllm quantization for spark, below is table with single run of different popular weights, more runs needed to compare averages.
The most important numbers for me are Pass Rate 2 and Error Outputs.
audioen@reddit
Hmm, I tested this QuantTrio model using the albond vllm-qwen35-v2 image but at least on that, the resulting model came up seemingly working, but on my first prompt it proved to be completely confused and found bogus issues in code, cited incorrect line numbers, and clearly hadn't understood most basic behavior correctly. I doubt a Q2 GGUF of the model would have been this bad.
Possibly, some of the patches applied into the docker image for the int4 autoround may be causing the issue, but it is really hard to say. I'm complete noob with this vllm stuff and I find it to be a total chaos with random images everywhere, half which don't even build from their own dockerfiles. Add to this the huge amount of variability in configuration and it feels quite a mess indeed.
I've yet to try sglang. I think I'm through with vllm and desperately want to leave it behind forever.
anzzax@reddit
I use standard eugr docker, here is recipe I created: https://gist.github.com/anzax/e27179c5b74aa72cdf4ca56dcc560f8d
Storge2@reddit (OP)
Hey this is truly great. u/anzzax how much Throuput (Prefill and Decode) are you getting for each of them. Also which KV-Cache Quant are you running? Do they all have same/similar configuration otherwhise it might not be a fair comparison. Also do you plan to do this for other models aswell? Like the Nemotron Super or others to see how well they perform?
anzzax@reddit
just want to add that with Albond merged int4 and docker there is \~50tps for single request but for some reason it doesn't scale with n4, goes only to \~60tps. After all this tests I'll stick with QuantTrio AWQ, it's a bit unexpected discovery, on dgx forum consensus is int4 is the best option for spark but, turned out AWQ is better. There is active development around NVFP4 so soon it could be the best option, but it's slightly bigger so not much VRAM left for KV and my use cases are agentic parallel workflows.
Storge2@reddit (OP)
Amazing research and very helpful indeed my friend. You should post this in the forum :)
anzzax@reddit
I need to properly benchmark prefill and generation, below is grafana chart with aider runs, with 4 parallel requests, so batch n4 generation throughput for int4 is \~70 for awq is below 60. Single request int4 generation 40+, awq is \~30. I use speculative decoding MTP=2 for all runs. All models are 4bit, you can check HF for more details. For all runs generation parameters stay the same:
```
use_temperature: 0.7
extra_params:
top_p: 0.8
top_k: 20
min_p: 0.0
presence_penalty: 1.5
repetition_penalty: 1.0
extra_body:
chat_template_kwargs:
enable_thinking: false
```
Sticking_to_Decaf@reddit
Is this better than running an NVFP4? Spark supports NVFP4, right?
Storge2@reddit (OP)
Yes, NVFP4 on Spark is currently not optimized. INT4 runs better at the moment, in the future i think NVFP4 will probably beat it on speed and quality I guess.
mr_zerolith@reddit
That's really impressive for a spark, i'm surprised to see those numbers.
Try Step 3.5 Flash, it's 197b but context is cheap, and despite it's extra size, it's faster than Qwen 3.5 122b.
Storge2@reddit (OP)
Is it better though then the Qwen 3.5 122B?
Glittering-Call8746@reddit
So what's the realistic output tok/s..
Juulk9087@reddit
I get 110 with my RTX pro 6000 with reasoning on
Saren-WTAKO@reddit
At least 45t/s, and 55t/s for simple tasks. It depends on the content and speculative decoding acceptance rate, which is normally 80%.
Glittering-Call8746@reddit
Ah.. it's with speculative decoding.. ok nvm
Saren-WTAKO@reddit
What's wrong with speculative decoding? AFAIK it's lossless if it's implemented correctly.
coder543@reddit
You are not getting 50 tokens per second. I hate to be the unpopular person, but what the vLLM logs say is false. You must measure with a client like
llama-benchy.Saren-WTAKO@reddit
I am getting 50 tokens per second and it is already working for me to use in claude code or cline. I asked the LLM to output 289 tokens, and it output 289 tokens in 5.668s, which is 50.99t/s excluding kv cache read latency and request overhead.
Here is the command (2nd+ run):
```
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen","messages":[{"role":"user","content":"Repeat with me: The quick brown fox jumps over the lazy dog, its movements fluid and purposeful in the morning light. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. The forest seems to hold its breath, watching this dance of predator and prey. Sunlight filters through the canopy, casting dappled shadows on the forest floor. The air is crisp and filled with the subtle sounds of nature'\''s symphony. Birds chirp their morning songs, while the gentle rustling of leaves adds a peaceful backdrop to this woodland scene. Every creature plays its part in this daily performance. A gentle breeze carries the scent of wildflowers and pine needles through the air. The morning dew glistens on spider webs strung between branches like delicate crystal necklaces. In the distance, a woodpecker'\''s rhythmic tapping echoes through the trees, nature'\''s own percussion section. The forest floor is a tapestry of fallen leaves, moss, and tiny mushrooms, each adding to the rich ecosystem. Small insects busily navigate this miniature landscape, carrying out their important roles in the grand scheme of things. The interplay of light and shadow creates an ever-changing pattern on the ground, as clouds drift lazily across the sky above the canopy. This is nature'\''s theater, where every day brings a new act, a new story to unfold"}], "chat_template_kwargs": {"enable_thinking": false}}'
```
And here is command's output:
```
{"id":"chatcmpl-9b796786a58e4814","object":"chat.completion","created":1776135564,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"The quick brown fox jumps over the lazy dog, its movements fluid and purposeful in the morning light. Meanwhile, a clever rabbit watches from nearby bushes, intrigued by the scene unfolding before its eyes. The fox continues its playful pursuit, demonstrating remarkable agility and grace in motion. The forest seems to hold its breath, watching this dance of predator and prey. Sunlight filters through the canopy, casting dappled shadows on the forest floor. The air is crisp and filled with the subtle sounds of nature's symphony. Birds chirp their morning songs, while the gentle rustling of leaves adds a peaceful backdrop to this woodland scene. Every creature plays its part in this daily performance. A gentle breeze carries the scent of wildflowers and pine needles through the air. The morning dew glistens on spider webs strung between branches like delicate crystal necklaces. In the distance, a woodpecker's rhythmic tapping echoes through the trees, nature's own percussion section. The forest floor is a tapestry of fallen leaves, moss, and tiny mushrooms, each adding to the rich ecosystem. Small insects busily navigate this miniature landscape, carrying out their important roles in the grand scheme of things. The interplay of light and shadow creates an ever-changing pattern on the ground, as clouds drift lazily across the sky above the canopy. This is nature's theater, where every day brings a new act, a new story to unfold.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":303,"total_tokens":592,"completion_tokens":289,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}curl http://localhost:8000/v1/chat/completions -H -d 0.00s user 0.00s system 0% cpu 5.668 total
```
coder543@reddit
No... llama-benchy just sends prompts and measures how the model performs. There is no "support" needed.
I'm not sure what you're doing. You don't provide enough detail. But, to be clear, here is why it is physically impossible:
MTP generates 2 or 3 tokens quickly, then the model must verify them. For a MoE, this means it still has to load the experts needed by each of those tokens. Unfortunately, token generation is bandwidth limited. Each additional set of experts means you're not gaining any speedup. Even for extremely obvious "copy the input text" prompts, you still don't get to batch any of the experts for multiple tokens because you're still only handling 2 or 3 tokens at a time, and each one is going to require random experts.
If you were using self-speculation like ngram-mod on llama-server, then you could speculate dozens of tokens at a time for that "copy the input text" case, and you would see speedup, because multiple tokens are getting batched for each expert. MTP simply doesn't predict enough tokens at a time.
What is the physical mechanism for the speedup? MTP is not magic. It works in large batch sizes (processing multiple requests at once), and it works for dense models. It does not work for MoE at batch size 1.
Claiming that this is just llama-benchy failing makes no sense.
Saren-WTAKO@reddit
OK if you insists the speedup works only for text copying, here is a coding prompt that does not copy text. 10.381s for 500 output tokens.
Results
coder543@reddit
I am definitely intrigued now, but again, no one can provide an explanation of how this is physically possible. Is that "2nd+ run" again, or is that how it performs when provided with that prompt for the first time? Again, MTP is not magic. Bandwidth is bandwidth, and llama-benchy is the same as any other client. If llama-benchy can't measure speedup, someone needs to explain why. It is doing exactly what you're doing, measuring how many tokens are generated in a given number of seconds. No trickery.
Saren-WTAKO@reddit
It's simple, llama-benchy measures the frequency of LLM stream output, which is accurate with speculative decoding off. I also discovered this trying to measure `for chunk in client.chat.completions.create()`, turns out the stream is being buffered with spec decoding on.
coder543@reddit
Ok, I'm deleting my comments above out of an abundance of caution. I will need to try this out myself sometime soon. I have tried various MTP vLLM MoE solutions myself in the past, and it has never been visibly faster to the naked eye. Maybe something finally got fixed. Maybe the regular autoregressive decode is doing something wrong and leaving bandwidth on the table that the MTP batching is using.
Saren-WTAKO@reddit
https://gist.github.com/Saren-Arterius/519d376329c86de4346681e5a9b6452b
yeah also don't worry about the 2nd run thing. I tried it using KV cache busting prompt and it still works.
Uninterested_Viewer@reddit
Your experience is my experience: Qwen 122b is the sweet spot for a single node spark setup right now if you're trying to do anything interactively. Really liking Gemma4 31b dense, but that's not going to cut it for anything except background work on a spark.