I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max
Posted by Voxandr@reddit | LocalLLaMA | View on Reddit | 43 comments
It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong?
- Infrence engine : llama-cpp latest as of 15th May 2026 , built my own via https://ggml.ai/dgx-spark.sh
- Tested models
- Step3.5-Apex-I-Quality - DGX - 27 tk/s , AI-Max 30 tk/s
- gemma-4-31B-it-UD-Q8_K_XL - 6.19 tk/s , AI-Max 7.10 tk/s
Config:
llama-server --models-preset /home/dgx/models/models.ini --models-dir /home/dgx/models/ --host 0.0.0.0 --port 8080 --models-max 1 --parallel 1
model.ini:
[*]
threads = 12
flash-attn = on
mlock = off
mmap = off
fit = on
warmup = on
; batch-size = 4096
; ubatch-size = 512
cache-type-k = q8_0
cache-type-v = q8_0
jinja = true
direct-io = on
cache-prompt = true
cache-reuse = 256
cache-ram = 32768
reasoning-format = auto
n-gpu-layers = 999
dtdisapointingresult@reddit
Spark and Strix Halo have identical token generation in all benchmarks I've seen before I bought mine. Only Prompt Processing is faster on the Spark (significantly so).
Don't try to use 31B without MTP, that's a free 50% speed boost you're leaving on the table.
laul_pogan@reddit
The 2-4x claim is for prefill (prompt processing), not decode. Single-token decode is purely memory bandwidth limited, and both machines sit around 270-279 GB/s LPDDR5X, so you will always land within 10-15% of each other there. The GB10's extra compute (GPU tensor cores) only helps when you are processing tokens in parallel.
To see the actual gap: benchmark prefill throughput on a 4k-8k token prompt at batch size 1. On the Spark you will see 3-5x faster prompt processing vs the AI Max. For interactive single-stream chat, the hardware is roughly equivalent. The marketing was not wrong, just measuring a different thing than you are.
PositiveBit01@reddit
Check out https://github.com/spark-arena/sparkrun
But it's not 2-4x faster. It has more compute but both the spark and ai max machines are memory bound by the "slow" unified memory compared to gddr and they have the same memory bandwidth so you'll see similar results.
Prompt processing should be faster, though. Token generation will be similar.
Voxandr@reddit (OP)
Thanks a lot , cheking it out and surpised that there are only a few NVFP4 models.
PositiveBit01@reddit
Yeah... read more about NVFP4 on the spark. It's still in development as far as I know. There are some people that have changed that makes it work but out of the box it'll fall back to a slow backend (unless this changed recently).
The GB10 is more like a gtx 5000 series than their data center gpus despite the marketing.
Voxandr@reddit (OP)
https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
This looks promising.
https://forums.developer.nvidia.com/t/qwen3-5-122b-a10b-nvfp4-quantized-for-dgx-spark-234gb-75gb-runs-on-128gb/361819/45
cursortoxyz@reddit
albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 is amazing! The only problem is that it takes up a lot of space if you want to run it full speed, due to the very specific version requirements for vllm and pytorch.
PositiveBit01@reddit
Yeah there's a lot of stuff you can run. I use mostly the sparkarena recipes because how to make the different models work well is not straightforward.
That one takes a bit too much RAM for me because I also run some lxc containers and want to eventually add some other stuff to it, it's more of a multi use server for me than just hosting the llm.
So far I like qwen3 coder next, qwen3.6 35b, and gemma4 26b. The one you linked looks like a good one and there is a sparkarena recipe. Looks like int4 autoround may work better. https://spark-arena.com/leaderboard
Voxandr@reddit (OP)
thanks 122b is my daily driver these days on aimax . So trying this right now.
The rest i will run on llamacpp - they said same architecture for Qwen 3.6 overthere. .
PositiveBit01@reddit
There is already a sparkrun recipe. I recommend you use it. It uses vllm, using eugr's container setup from https://github.com/eugr/spark-vllm-docker
You can get it working with llama.cpp but again this stuff is under active development. If you aren't super familiar with what's going on I highly recommend using one that's already created and benchmarked.
Eugr is on the forums and actively doing some development here, and seems very talented and in support of the spark community. I would just use that stuff instead of trying things on your own but it's up to you.
Voxandr@reddit (OP)
Cant find Albond's receipe in `sparkrun list` . Which one?
PositiveBit01@reddit
Oh, yeah you have to copy it from that site and run the yaml but if there are mods it only works from the git repo. But the one you cited should be good. You may need to run with some parameters to ensure it's only trying to run on a single spark.
Be careful on the spark arena page, the cluster size is important. 2 sparks is almost 2x memory bandwidth so even small models that easily fit in memory on only one will slow better performance if there's more than one. Easy to overlook on the site.
Voxandr@reddit (OP)
yeah i just noticed that , i am now running Albonds ,
It downloads Auto roudns , and then patches 39 layers with original fp8s . and now building vllm , patched with some PRs.
looks interesting
Miami_lord@reddit
What did you expect 😅
Voxandr@reddit (OP)
about 200 tk/s .... lol
gh0stwriter1234@reddit
Maybe you are confusing batch tokens with interative token rate... a lot of times you'll see someone run 8-16 parallel batches and spit out some crazy token rate, but each session will only be generating 1/16th or 1/8th of that rate.
Voxandr@reddit (OP)
i understand , just joking at advertized performance.
But INT4 autoround + FP8 replacement technique of Albond's work is very impressive and now i am running it (almost done downloading models).
At-least it will give 3x perf gain.
ImportancePitiful795@reddit
🤣
BankjaPrameth@reddit
For Spark, try use vllm. Good resource here https://github.com/eugr/spark-vllm-docker
However, the token generation (decode) speed is rely on memory bandwidth. And both devices are having almost equal memory bandwidth, so you will not see much improvement on this.
The noticeable improvement is the prompt processing (prefill) speed. On this one, it’s night and day difference especially when you run model with vllm.
Voxandr@reddit (OP)
so if smaller quants could i tbe faster or still limited by bandwidth?
BankjaPrameth@reddit
Every token generations are limited by bandwidth. But we can now cheat that a bit by using MTP or similar.
But like I said, you’ll not get much improvement on this from Spark vs Ai Max.
The benefits of Spark is on prompt processing. It can be 2-4x faster.
gh0stwriter1234@reddit
Yep MTP takes my MI50s from 24t/s to about 35-40t/s on qwen 3.6 27b Q4_1
sn2006gy@reddit
smaller quants could be faster, but i'd go for MoE models - qwen3.6 35b a3b should perform much better than gemma
Voxandr@reddit (OP)
its perform well on ai-max too . i am going to test NVFP4 of it
Healthy-Nebula-3603@reddit
What did you expect?
LLM models are limited by memory bandwidth. DGX has only 279 GB/s for ram ...so suck at it.
DGX works better with diffusion models ( picture / video generation) where memory bandwidth is not so important.
Voxandr@reddit (OP)
Thanks , that make sense.. i should just buy another Ai-Max and RDMAoE it.
gh0stwriter1234@reddit
Probably you should sell all of them and build a server with 8x R9700 if you have that kind of money to throw around.
uti24@reddit
Not really, in single thread LLM inference it's expected to be 10% faster, because this kind of tasks are limited by memory bandwidth, AMD IA MAX 395 has 256GB/s and GB10Â has 273GB/s.
That said, it still should be somewhat faster.
gh0stwriter1234@reddit
Be sure to checkout the MTP PR build on both your machines for models that support it... its currently boosts token generation output at the expense of prompt processing speed (to be resolved at a later date)
Pleasant-Shallot-707@reddit
The memory bandwidth is garbage so the token generation isn’t any faster than a strix
Voxandr@reddit (OP)
Tryning https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
if it works well , i would get 2x-3x perf gain over Strix (Strix dont have such support on vLLM , or not sure even possible)
Voxandr@reddit (OP)
Looks like this is the most promising setup :
https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4
Kryohi@reddit
> It is suppose to be 2-4x faster
It isn't
gusbags@reddit
Depends on a quant and backend, Qwen 3.6 35B A3B INT8 Autoround hits 6K tk/s PP and 60 tk/s on TG on vllm.
Spark has a advantage over Strix Halo on PP speeds and small advantage in TG. Another big plus is when you get multiple sparks, that CX7 link allows you to run decent sized larger MOEs like Qwen 3.5 379B on 2 sparks at usable speeds.
Possible-Pirate9097@reddit
There is one 395 which has a PCIe slot which would allow for something similar - that Alex Ziskind guy has a new video on setting them up in a cluster.
LegitimateCopy7@reddit
it's "advertised" to be faster.
this is why tech journalism exist. assume everything companies or tech bros say are misleading or straight up lies because that's what they are.
Silicon Valley is all about "fake it until you make it". this is what defines them. also no, they will never be held accountable if you're wondering that.
Voxandr@reddit (OP)
straight up lies...
ren_in_rome@reddit
You should be on nvidias forums
https://forums.developer.nvidia.com/t/how-to-run-the-gemma4-assistant-models-using-eugrs-custom-vllm-fork/370194
Icy_Programmer7186@reddit
Try to use vllm.
Voxandr@reddit (OP)
but i am using llamacpp build with cuda - on my RTX4070Ti-Super , it is a lot faster than AI-Max (using llamacpp too).
Grouchy-Bed-7942@reddit
Normally, the generation speed is proportional to the bandwidth, as mentioned in the previous comments.
Where the GX10 will surpass your AI Max is in prompt processing (PP) and its VLLM support, which allows it to handle 4 to 5 requests in parallel without any issue (whereas with llamacpp on your AI Max you’ll be in trouble).
Another advantage compared to a 4070 Super is that you can run FP8 models that don’t fit in your VRAM, or larger models like Qwen3.5 122b.
Voxandr@reddit (OP)
Why am i getting downvoted?
Stepfun is not dense model , and still same performance.
-dysangel-@reddit
dense models are bandwidth bound, not compute bound