Optimising NVIDIA’s DGX Spark (Grace + Blackwell) – 1.5× PyTorch speedup with custom build

Posted by guigsss@reddit | LocalLLaMA | View on Reddit | 22 comments

I’ve open-sourced a complete end-to-end setup to maximise AI performance on the new NVIDIA DGX Spark – the compact dev box built on the Grace-Blackwell superchip (20-core Grace ARM CPU + 6144-core Blackwell GPU).

Because this architecture is so new (SM 12.x GPU, unified CPU-GPU memory), many libraries weren’t fully utilising it out-of-the-box. I found that PyTorch and CUDA libs would fallback to older GPU kernels and miss out on Blackwell’s new FP8/FP4 tensor core formats, and even ignore some ARM64 CPU optimisations on the Grace side. So I decided to rebuild the stack myself to unlock its full potential.

What I did and why it matters:

Results: I wrote a simple FP16 GEMM (matrix multiply) burn-in benchmark to compare baseline vs optimised environments.

Baseline FP16 GEMM throughput (matrix size 8192) using stock PyTorch (CUDA 13 wheel). It sustains \~87 TFLOPs after warm-up, indicating the Blackwell GPU isn’t fully utilized by default kernels . Many new tensor core features remained inactive, resulting in suboptimal performance.

Optimised environment FP16 GEMM throughput (matrix size 8192) after rebuilding the stack. Sustained throughput is \~127 TFLOPs – roughly 50% higher than baseline. This gain comes from Blackwell-specific optimisations: updated cuBLAS routines, enabled FP8/FP4 cores, Triton JIT, and sparse tensor support. In practice, that’s about 1.5× the matrix multiplication performance on the same hardware.

In summary, recompiling and updating the ML stack specifically for DGX Spark yielded a \~50% speedup on this heavy compute workload. The repository includes all the installation scripts, build steps, and even a pre-built PyTorch wheels (torch 2.9.1 for CUDA 13 on aarch64) if you want to skip compiling .

Link to repo: 🔗 GitHub – https://github.com/GuigsEvt/dgx_spark_config

I’d love feedback from others who have a DGX Spark or similar hardware. Feel free to try out the build or use the wheel and let me know if it improves your workloads. Any suggestions for further tuning are very welcome!