Is an nvidia DGK Spark or similar worth it?

Posted by Scary-Feedback-5837@reddit | LocalLLaMA | View on Reddit | 11 comments

I currently run a local model and mix of Claude max. My local model is run on cpu with 256 gb of ram and so it runs quite slowly.

With Claude usage becoming nearly intolerable I face the option of either switch to 200 max plan from Claude or to change to a unlimited usage local llama model.

I don’t know what of these is most ideal. Should it be a Mac Studio maxed out? The nvidia dgk spark or similar layout?

What is the best option?

[-]

Ok_Warning2146@reddit

Useless POS until they make nvfp4 works properly. Better use M3 Ultra.

[-]

For local inference ? not at all. Frankly I don't know what it is good for... and i buyed two for my job, hopping it would be like local servers for AI inference... Dang i'm happy they did not spot the problem until now at work...

[-]

No_Afternoon_4260@reddit

Care to elaborate?

[-]

Less_Ad_1806@reddit

yeah exotic linux architecture (aarch64 or something like that) and big memory but low (lower than a 3080) bandwith. Big car with small tires... only frustration, yeah it can run big model, but it is slow, even with nvfp4... this is of no use... thing is nvidia big dude (huang jensen or so) said "this is the computer for AI inference on your desktop" and i trusted him. The reality is that it is a niche product with no practicle use.

[-]

No_Afternoon_4260@reddit

I m not sure I agree. I have 4 of them on a 200 GbE. Using tp 4 yhat gives you ruffly 3090 ram speed (still faster than mac ultra so far) for about 480gb.

Yeah aarch64 also Nvidia sm121 is really niche, lack of a lot of precompiled wheel etc.. no mig, unified memory is a challenge on its own.

Vut for the price of 1.5 rtx pro 🤷

[-]

Monad_Maya@reddit

What is your "local model"? Maybe share that and look at benchmarks for it.

[-]

Scary-Feedback-5837@reddit (OP)

Local model is Qwen3.5-122B for complex tasks, Gemma 4 31B now for lighter tasks.

[-]

my_name_isnt_clever@reddit

Qwen 3.5 122b does great on Halo Strix. I get baseline 300 pp/s and 30 t/s and I rarely feel the need for frontier models.

[-]

Grouchy-Bed-7942@reddit

Look at the benchmarks here, assuming they are not all up to date, so these are the minimum performances you will get:

https://spark-arena.com/leaderboard

It is indicated each time whether it runs on one or two GB10s, knowing that with VLLM you can easily have 3 or 4 requests in parallel for l’argentique par exemple

[-]

Monad_Maya@reddit

Strix Halo based minipc would be ok for the MoE model. You can probably run a smaller quant of MiniMax M2.7 on it.

Dense models like Gemma 31B would struggle on most things outside of dGPUs for us consumers.

Check local pricing on Strix Halo machines and compare it with DGX Spark options.

It might be worth noting that Claude Code has issues related to token consumption - https://news.ycombinator.com/item?id=47739260

[-]

anzzax@reddit

I have spark clone and run qwen3.5-120b int4-autoround at \~40tps and very low power consumption, plenty of kv store for batching and long prefix cache