Reality Check on 50 t/s for Qwen3.5-122B-A3B and 3500 USD device
Posted by kuhunaxeyive@reddit | LocalLLaMA | View on Reddit | 67 comments
I found an optimization that achieves 51 tokens/s (48 for very long contexts) for Qwen3.5-122B-A3B, and the guy who did that published a bash script on Github that sets it up automatically:
This optimization was implemented on NVIDIA Spark. The Asus Ascent DX10 shares the same internal hardware (the NVIDIA GB10 Grace Blackwell Superchip), with the main differences being the casing and cooling. It is priced at around USD 3,500 due to having only 1 TB of storage, which is sufficient for my use case. A generation speed of 50 tokens/s for a model of this size would make it practically usable. However, before purchasing the device, I want to verify whether my assumptions place it within a usable performance range.
My questions:
- Has anyone tested the Asus Ascend DX10? With an 8,000-token context, what are the TTFT and generation speeds? I want to verify whether 5 seconds TTFT and 50 tokens/s generation are achievable.
- Are there any issues caused by minor hardware differences between the devices? Specifically, will the optimization setup script run on the Asus Ascent without modification?
JojoScraggins@reddit
I have the gx10 and am running qwen3.5 122b int4 autoround. Benchmark results vary but were just better than nvfp4. I'm sure it won't be long until another model comes in that is better but this one sure does well for me.
On a coding benchmark I got pretty decent performance:
Here's my system unit:
kuhunaxeyive@reddit (OP)
You get 134 t/s, that's great and way more than to be expected, isn't it? Apart from this system unit, did you do any optimization that I should consider? I'm reading into everything and learning, preparing myself as my Ascend GX10 will arrive tomorrow …
JojoScraggins@reddit
Keep in mind that 134t/s is using a vllm benchmark with concurrency (https://docs.vllm.ai/en/latest/benchmarking/cli/) whereas your link has their own benchmarking method on a single request. vLLM's benchmarking is going to give a more complete and reliable benchmark. As far as optimizations go you can drive yourself crazy. I decided to stick to upgrading across vLLM releases rather than patching with the thought that in a few weeks a new model or kernel could entirely change what I'm doing with my inferencing setup.
kuhunaxeyive@reddit (OP)
So as I understand it the difference to the tutorial is that the tutorial leads to near-FP8 precision, but at the cost of having only 50 t/s instead of 134 t/s …
CATLLM@reddit
I have two msi variant of the spark in a cluster running qwen 3.5 397b. I think buying a single one is a waste of money because the connectx7 is already $1700 on its own. Running models on a single spark is just on the edge of “usable”.
Turbulent_Ad7096@reddit
I’m considering getting a second GX10 for qwen3.5 397b and MiniMax m2.5 172b REAP. I can run the MiniMax one node device. How is the Qwen 397B model? Is it substantially better at coding than 122b?
CATLLM@reddit
Yes the 397b model is amazing at everything. keep in mind only the intel autoround 4bit quant fits. Anything large and it won't. Running 122b in FP8 clusted is great too. I recommend looking through this thread in the nvidia devloper forums to help make your decision:
https://forums.developer.nvidia.com/t/qwen3-5-397b-a17b-run-in-dual-spark-but-i-have-a-concern/361967/216
I personally think getting 2 sparks is worth it.
kuhunaxeyive@reddit (OP)
That guy of the tutorial gets his single Spark doing near-FP8 quality at 50 t/s for the Qwen-122B-A10B model. I genuinely want to ask, because I need to decide which way I'll go, why you think it's better to get a second Spark and run FP4 at 35 t/s on it if you can get better at a half the price following the optimization of the tutorial, and learning something on the way. If you have the money and don't want to invest the time or work, I totally get it. But if someone wants to learn and does not have the money, then it seems more reasonable to stay on one machine and do the optimization tailored at that specific model and Spark/Asus Ascent machine.
CATLLM@reddit
Look at the thread and see how much trouble people are going through to get that specific optimization working. If you want to spend the time tweaking, then by all means go ahead.
If you don’t have the money then the spark is not the best choice especially after the recent price hikes.
What do you want to learn tho? Just inferencing? Finetune? CUDA?
kuhunaxeyive@reddit (OP)
The tutorial includes a setup bash script, so that part isn’t troublesome in fact. However, I want to do it manually because, beyond just having a local and efficient LLM, fine-tuning it and understanding the underlying theory are the reasons for my purchase.
fastheadcrab@reddit
I personally saw that thread a while ago and found it vaguely suspicious. Statements that seemed to contradict known facts and documentation and the post appeared to be written by a LLM. I've also previously seen well founded rationale that to argues that MTP on MoE is not likely to give a significant speedup on real tasks.
Personally I would never make a purchasing decision based on a sketchy forum post promising very high performance but the OP clearly has already made up his mind from the beginning and wanted confirmation rather than advice or information.
https://www.reddit.com/r/LocalLLaMA/comments/1rzntv5/multitoken_prediction_mtp_for_qwen35_is_coming_to/obni2p2/
kuhunaxeyive@reddit (OP)
Getting confirmation for that specific optimization tutorial is exactly the purpose of my post. I don’t understand why you judge that I’ve already made up my mind for the Asus Ascent GX10. I’m simply looking for confirmation that it works as claimed.
Of course, I appreciate arguments against the tutorial. I’m open to new perspectives. However, please don’t judge me for making this specific request. So far, no one here has suggested an alternative setup that can achieve this for a total price of only 3080 USD.
My purchase decision will not be based on the NVIDIA forum post alone. It will depend on reports from others who can confirm that the approach works in practice.
Additionally, the Spark or Asus Ascent are good devices for getting started with LLMs, especially the Asus Ascent with 1 TB storage, which I can purchase for 3018 USD in Asia.
My use cases are:
My past experience shows that I’m capable of making solid system and purchasing decisions. After user reports I’m fairly confident that the tutorial you describe as “catchy” will actually work to achieve 50 tokens per second on a 3080 USD device running Qwen3.5-122B-A10B, and that I’ll learn the underlying optimization techniques in the process.
Some people call me crazy or ignorant, but I believe there is a strong chance of achieving these goals with the Asus Ascent. Going against the current invites criticism, but it can lead to discovering something the majority overlooks.
Thank you for taking the time to discuss, it's really helpful.
CATLLM@reddit
100% agree. Well said.
CalligrapherFar7833@reddit
You cant decouple the cx7 from the dgx board so it doesnt work like that. You cant calculate cost of a component that you cant reuse in something else
CATLLM@reddit
If you are not using it, then its a waste of money. It’s sitting idle. Why make it anymore complicated?
ArtfulGenie69@reddit
Why does anyone need this connectx7 thing? Network is all you need especially for something as slow as a spark. Rpc for llama.cpp and ray for vllm. Is it really a $1700 cable? I'm out of the loop on expensive garbage like that.
I used rpc to get my machines working together, vllm has something similar. https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md
CATLLM@reddit
Connectx7 on the spark can do 200 gigabit. Latency is about 3 microseconds. Yes you need that if you are running a model across multiple machines. Its what data centers use. Its far from garbage.
Llama RPC over ethernet is orders of magnitude slower. You think AWS, Azure, google cloud etc is running cat 5e gigabit?
ArtfulGenie69@reddit
All you are doing is reducing the original start up speed of the sharding and that is mitigated by a cache. It will not speed up the inference much, especially with spark. The model just runs its shards on each machine and combines it, the combination part doesn't take much bandwidth, only the sharding if there isn't anything cached. That's what I've seen testing, I got ripping fast inference speeds on a 2.5gb switch.
CATLLM@reddit
Latency on the connectx7 is about 3 microseconds.
Ethernet is orders of magnitude higher latency.
The latency adds up when you are running parallel tensors.
I don’t think you know what you’re talking about. Have you even seen a QSFP cable in person before?
CalligrapherFar7833@reddit
You can do rdma on cx5 with the same latency and it costs 300$ the cx7 is useless for llm
CATLLM@reddit
So can you buy a spark with a cx5? Why don’t you call up Jensen and tell him to make one cause i’d love to buy a cheaper spark.
CalligrapherFar7833@reddit
Thats the whole financial point you are buying the spark. The cx7 cannot be used outside of the gdx board so its cost is 0 or max 300 compared to a cx5 with the same rdma. Cost of the cx7 is 1700$ only if you can use it on something else and you cant.
CATLLM@reddit
If you are not using the CX7 then you are wasting $1700.
The CX7’s value makes up part of the total cost of the machine.
If Nvidia were to make a DGX spark without the CX7 then it would be valued LESS in comparison to one with a CX7.
If you have no use for the CX7 then the spark is not for you. Go get something like the Strix halo which doesn’t have the CX7 and is cheaper.
How hard is it to understand that?
CalligrapherFar7833@reddit
There is no value of using a cx7 that costs 1700 for something that you can do with cx5 for 300. Nvidias cost for including that cx7 is 50 not 1700 msrp which is for the pcie version.
CATLLM@reddit
This isn’t just wrong, it’s a complete misunderstanding of how the world works. You’re throwing out a $50 number like BOMs are magic, when that doesn’t even cover the connectors and PCB—never mind a bleeding-edge ASIC.
Nvidia doesn’t publish their BOM costs. What experience do you have that allows you to determine that the chip costs $50?
And comparing a CX5 to a CX7 like they’re interchangeable just proves you have no clue about cost or value. Value isn’t ‘cheapest thing exists,’ it’s whether the hardware can actually do the job—and CX5 literally cannot replace CX7. Does nvidia make a spark with a CX5?
You’re not making a point about pricing—you’re just advertising that you don’t understand hardware, manufacturing, or basic economics.
CalligrapherFar7833@reddit
Dont see a point arguing with you further you dont understand jack shit about shit
fastheadcrab@reddit
I agree with you entirely but it does give an interesting idea, can someone take like several gaming PCs running running gaming GPUs and make a cluster of them with these fast networking cards. Wonder if it is possible. Power bills alone may lead to bankruptcy though
CATLLM@reddit
Its totally possible and you are right the power consumption / efficiency angle would be pretty bad. Also have to consider maintaining them. QSFP switches are expensive too.
I’ve seen some people that do that on this sub and i can see that being a hobby for those people.
Personally, At that point its might be simpler to get an RTX Pro 6000 and call it a day.
kuhunaxeyive@reddit (OP)
You need the connectx7 if you want to connect devices and transfer data during computation. But if connecting devices of such low memory bandwidth to run bigger models makes sense speed wise is another question …
CalligrapherFar7833@reddit
And you can do the same with a second hand mcx516cdat for 300$ and 200g ib whats your point ?
CATLLM@reddit
Have you checked the size and power consumption of the spark? 🙄
CalligrapherFar7833@reddit
Yes ? And ?
CATLLM@reddit
And it’s a point of comparison. Because size and power consumption is different and it’s something to consider .
Do you have trouble comparing size and numbers?
kuhunaxeyive@reddit (OP)
That's actually true. ConnectX-7 (200 GbE) is integrated and actually costs ASUS/NVIDIA far less in BOM than its standalone retail price.
CalligrapherFar7833@reddit
Bom is <50
CATLLM@reddit
Thats not how the world works
kuhunaxeyive@reddit (OP)
Connecting two of those devices would increase memory but not speed. Running models that need more memory normally get too slow on that device because of memory bandwidth limitation, so that wouldn't make sense to connect several ones. One device seems to be the sweet spot with only 283 GB/s bandwidth.
CATLLM@reddit
With tensor parallel in vllm it increases speed bro
kuhunaxeyive@reddit (OP)
The guy who optimized the setup explains why he doesn't consider clustering those devices as an option: Clustering is outside the scope because the hardware limit they're hitting is per-Spark memory bandwidth, not something parallelism across nodes easily overcomes for generation speed on a model that already fits.
CATLLM@reddit
I’ve read his thread. I don’t agree with his view. I have two sparks and cluster them together to run qwen3.5 122b and 397b.
From personal experience, running 122b on a single spark is just on the edge of usable. Clustering them together is just right.
kuhunaxeyive@reddit (OP)
CATLLM@reddit
Look at the amount of effort you need to get it running. Maybe thats fine as a hobby but i rather spend the energy and time making money.
Alarming-Ad8154@reddit
Please, google “tensor-parallel” two of anything (GPU, spark, Mac) will potentially significantly speed up inference, multiple ppl in this thread have been trying to tell you.
audioen@reddit
Installing this repo seems to be a multi-hour ordeal. I'll report if it starts on a Lenovo ThinkStation PGX when it's done.
kuhunaxeyive@reddit (OP)
This is awesome, thank you for checking that out! Exactly what I was hoping for.
You get this high speed at favorable near-FP8 configuration, due to the hybrid quantization setup.
Qwen 122B-A10B near FP8 with 42 tokens/s average with lots of headroom for context, on a single Spark/Ascent GX10 for 3200 USD including tax (in Asia).
You helped me a lot. I'll order the Ascent!
fastheadcrab@reddit
https://spark-arena.com/leaderboard
kuhunaxeyive@reddit (OP)
Thanks! I now understand the reason behind the suggestion of connecting two. But from what I know, it wouldn't increase speed, and what would I do with more memory? Run bigger models is not feasible as they would run slower due to memory bandwidth limit of 283 GB/s on those devices. (NVIDIA is advertising 1 TB/s but that's not how it works in practice.)
fastheadcrab@reddit
Depends. What you said is true if they were dense models that need to calculate all the parameters each time, so yes increasing model size will bring huge slowdowns in speed. But as the name implies, Qwen3.5-122B-A3B only has 3b active parameters. Even the 397B-A17B model only has 17b active parameters. So you do not need as nearly as much memory speed as compared to a dense 400B model. There are better explanations than mine and benchmarks out there.
Also, there is some speedup if you get tensor parallel to work because then multiple GPUs across nodes can calculate at once. . Also larger models will give you better quality.
What are you trying to use the models for? You mention you are looking for models for your use case but don't go into it much. There are several people running the 397B model at 4 bit on 2x Sparks with passable speeds as you can see on the leaderboard. Is quality needed? You will need to benchmark whatever models you need on your use case. Have you at least tried the 122B Intel quant to see if the answers are acceptable to you? How about Speed?
I'd recommend thinking about those factors before buying something that might not be useful to you. Also it can help people here give better recommendations.
As for your original question, if you are capable of getting vLLM working with the Qwen3.5-122B model and set MTP to 2 and you likely will be able to get something like 35-40 tok/s. Also, why 8000 token context?
kuhunaxeyive@reddit (OP)
Thank you for all that input. I am relying heavily on researching and learning before taking a purchase decision. I'd love to go for the 397M-A17B model on two Asus Ascent GX10, but that one almost doubles in size of active parameters, and with 283 GB/s bandwidth I am afraid it goes under 40 t/s, presumably much lower, but there are folks here who have practical experience and I am happy to be corrected if it infact can run at higher speeds on the Spark or Asus Ascend.
About tensor parallel, as far as I understood the NVIDIA GB10 Grace Blackwell Superchip configurations, the memory bandwidth of 283 GB/s is the bottle neck, and it wouldn't help much having the computational power doubled if the bottle neck in fact is the speed of the memory data road. I read that the processors are if fact waiting to get data already on that system.
I want to have a model that gives a somewhat predictible outcome for creating letters, restructuring data, researching information, also giving answers to knowledge questions based on model internal knowledge. In my tests the 35BA3B model is a hit and miss even for repeated questions. To the same question sometimes the result is perfect, sometimes plain wrong. The 122BA10B gives me a correct answer each time at fullest precision, also written in a more summarized yet clearer structure. I hope to get the same results with less precision that are near full-precision. The 122BA10B is the size that feels good and safe for real-time usage.
As I want to use it as a daily driver for privacy reasons, it needs to be fast enough, and 40 to 50 t/s seems to be the lowest bar it needs to cross.
I have no experience with vLLM but take it as a learning opportunity, and also would hope to get to 50 t/s, see the forum post I linked.
fastheadcrab@reddit
In a 2x tensor parallel setup, the memory bandwidth is additive to a degree when comparing against the same model versus a single node. Not 2x because of overhead and parts that are not parallelizable but if things are configured correctly, a significant speedup can be observed. You can check out that website that I linked because these things have been benchmarked already.
You need to benchmark the modified model and configuration used in that post that you linked on your actual workflow and see if the results are still satisfactory both in terms of result and speed. I do not use that configuration.
If you really need fast performance (it must not drop below 40 t/s) on the 122B you have two options in my opinion, either buy 2 sparks and run a cluster, or buy a RTX 6000 pro. A single spark cannot guarantee the performance you are aiming for, especially as the context fills up.
https://spark-arena.com/leaderboard?tab=compare
Junior-Basil7356@reddit
Thank you for all that input. I am relying on heavily researching and learning before I take a purchase decision. I'd love to go for the 397M-A17B model, but that one almost doubles in size of active parameters, and with 283 GB/s bandwidth I am afraid it goes under 40 t/s, presumably much lower, but there are folks here who have practical experience and I am happy to be corrected if it infact can run at higher speeds on the Spark or Asus Ascend.
About tensor parallel, as far as I understood the NVIDIA GB10 Grace Blackwell Superchip, the memory bandwidth of 283 GB/s is the bottle neck, and it wouldn't help much having the computational power doubled if the bottle neck in fact is the speed of the memory data road. I read that the processors are if fact waiting to get data already on that system.
I want to have a model that gives a somewhat predictible outcome for creating letters, restructuring data, researching information, also giving answers to knowledge questions based on model internal knowledge. In my tests the 35BA3B model is a hit and miss even for repeated questions. To the same question sometimes the result is perfect, sometimes plain wrong. The 122BA10B gives me a correct answer each time at fullest precision, also written in a more summarized yet clearer structure. I hope to get the same results with less precision that are near full-precision. It just feels better and safer for real-time usage.
As I want to use it as a daily driver for privacy reasons, it needs to be fast enough, and 40 to 50 t/s seems to be the lowest bar it needs to cross.
I have no experience with vLLM but take it as a learning opportunity, and also would hope to get to 50 t/s, see the forum post I linked.
CATLLM@reddit
It does increase speed when running parallel tensors and it is supported in vllm
Serprotease@reddit
8k tokens is sub 3.5s for a single gb10. Sub 3s for a cluster.
Also, you consider 50tk/s tg to be the usable barrier? That’s a tall order.
But anyway, usually token generation speed doesn’t go down that much with vllm so you should expect >40tk/s well into the 50/60k tokens.
Honestly, the biggest hurdle is vllm setup. As you might guess from the Nvidia thread, it’s not smooth sailing.
chensium@reddit
The repo literally has a one command install.sh. Not sure how much easier it gets than that. I've been running it for about a day with no issues.
Serprotease@reddit
It ran fine for me until I tried to rebuild the container and it failed to build vllm. It took me quite a few hours of troubleshooting to get it working again. I’m still learning my way around docker, vllm and cluster so, it may be linked to that as well…
Prudent-Ad4509@reddit
I did some coding today with this model and I find 260k context to be a bit on the low side. And you are talking about 8k context. Looks like you have a very specific and limited task for it.
kuhunaxeyive@reddit (OP)
What is the price of the system you are using for coding?
Prudent-Ad4509@reddit
I have several, with the middle-tier based on old 9700k and the newest one based on 9950x3d. It is a bit funny to see them bottlenecked at compile/tool calling at first, and at large context processing later on (when approaching 150k). Qwen3.5 can’t grasp my code well enough until the context grows to about 100k, stops doing stuoid mistakes after 150k, and then steadily grinds to a halt while approaching 262k.
kuhunaxeyive@reddit (OP)
They tested it on 240k token context and the generation speed went down to 48 t/s only.
FatheredPuma81@reddit
I really feel like you can get better for less...
kuhunaxeyive@reddit (OP)
From what I know, for that price point and speed, there is only a Mac Studio that would be comparable, but it's not Linux, and it's more expensive.
FatheredPuma81@reddit
Wouldn't 2 Intel B70s get better performance? And it lets you upgrade to a 3rd or 4th. There are also better alternatives to the B70 that give even more performance I think.
kuhunaxeyive@reddit (OP)
Intel cards let you add a 3rd/4th easily, but for interactive single-user inference on 122B, multi-card adds latency (all-reduce overhead) without proportional gains in generation speed.
Specter_Origin@reddit
Where do you get GB10 for 3500 their official price is \~4500k and at that price macStudio would be a better buy...
kuhunaxeyive@reddit (OP)
Did you look up the price of the 4 TB model? The 1 TB model is 3600 USD in the US, 3860 USD in Europe and 3250 USD in Asia.
Ell2509@reddit
4500k would be 4.5 million. I pray we never get there lol.
Specter_Origin@reddit
Ty! fixed it