Optimising NVIDIA’s DGX Spark (Grace + Blackwell) – 1.5× PyTorch speedup with custom build
Posted by guigsss@reddit | LocalLLaMA | View on Reddit | 22 comments
I’ve open-sourced a complete end-to-end setup to maximise AI performance on the new NVIDIA DGX Spark – the compact dev box built on the Grace-Blackwell superchip (20-core Grace ARM CPU + 6144-core Blackwell GPU).
Because this architecture is so new (SM 12.x GPU, unified CPU-GPU memory), many libraries weren’t fully utilising it out-of-the-box. I found that PyTorch and CUDA libs would fallback to older GPU kernels and miss out on Blackwell’s new FP8/FP4 tensor core formats, and even ignore some ARM64 CPU optimisations on the Grace side. So I decided to rebuild the stack myself to unlock its full potential.
What I did and why it matters:
- Rebuilt PyTorch from source with Blackwell (SM 12.x) support on Arm64 , so it recognises the new GPU architecture. This enables PyTorch to fully detect SM 12.x capabilities and use optimised kernels.
- Updated NVIDIA libraries (cuBLAS, cuDNN, etc.) to the latest versions for CUDA 13. I also manually installed cuSPARSELt (sparse GEMM library) since it wasn’t yet in the default DGX OS repos . This adds support for 2:4 structured sparsity acceleration on Blackwell’s tensor cores.
- Enabled FP4/FP8 Tensor Cores: the custom build unlocks new low-precision tensor core instructions (FP8/FP4) that Blackwell supports , which the default libraries didn’t leverage. This should help with future models that use these formats.
- Triton GPU compiler tuned for Blackwell: recompiled the Triton compiler with LLVM for SM 12.x . This means operations like FlashAttention or fused kernels can JIT compile optimised code for Blackwell’s GPU.
- GPUDirect Storage (GDS): enabled cuFile so the GPU can load data directly from SSDs, bypassing the CPU . Useful for faster data throughput in training.
- Grace CPU optimisations: made sure to compile with ARM64 optimisations for the Grace CPU. The Grace has 20 cores (10× Cortex-X9 + 10× A7) and I didn’t want it bottlenecked by x86 assumptions . The build uses OpenBLAS/BLIS tuned for ARM and OpenMPI etc., to utilise the CPU fully for any preprocessing or distributed work.
Results: I wrote a simple FP16 GEMM (matrix multiply) burn-in benchmark to compare baseline vs optimised environments.
Baseline FP16 GEMM throughput (matrix size 8192) using stock PyTorch (CUDA 13 wheel). It sustains \~87 TFLOPs after warm-up, indicating the Blackwell GPU isn’t fully utilized by default kernels . Many new tensor core features remained inactive, resulting in suboptimal performance.
Optimised environment FP16 GEMM throughput (matrix size 8192) after rebuilding the stack. Sustained throughput is \~127 TFLOPs – roughly 50% higher than baseline. This gain comes from Blackwell-specific optimisations: updated cuBLAS routines, enabled FP8/FP4 cores, Triton JIT, and sparse tensor support. In practice, that’s about 1.5× the matrix multiplication performance on the same hardware.
In summary, recompiling and updating the ML stack specifically for DGX Spark yielded a \~50% speedup on this heavy compute workload. The repository includes all the installation scripts, build steps, and even a pre-built PyTorch wheels (torch 2.9.1 for CUDA 13 on aarch64) if you want to skip compiling .
Link to repo: 🔗 GitHub – https://github.com/GuigsEvt/dgx_spark_config
I’d love feedback from others who have a DGX Spark or similar hardware. Feel free to try out the build or use the wheel and let me know if it improves your workloads. Any suggestions for further tuning are very welcome!
Eugr@reddit
Great job! NVidia still recommends using "official" pytorch cu130 wheels, and they are missing Blackwell support. Haven't checked their latest container though.
guigsss@reddit (OP)
Yeah I think they’ll provide over time but yeah at the moment there is still a big difference if you optimize it
Eugr@reddit
Installed the pre-built wheels in my vllm container, can import torch and onnx, but any attempt to actually load a model with vllm results in:
```bash
ImportError: cannot import name '_cast_Long' from 'torch.onnx.symbolic_opset9' (/usr/local/lib/python3.12/dist-packages/torch/onnx/symbolic_opset9.py
```
I'll try to rebuild vllm again and see if that helps...
guigsss@reddit (OP)
Have you installed all the apt packages I provided in the README? Also is it something only happening with vllm, if you run a basic gemm bench it works correctly?
Eugr@reddit
Looks like it happens only with torchvision workflows. VLLM works when I remove torchvision, but it would be great to have it working.
guigsss@reddit (OP)
Leave it to me, I'll come with a solution and ping you. Might be the torchvision wheel that is corrupted in some way.
Eugr@reddit
Still getting messages like this in vllm. The overall performance seems to be slightly better than regular cu130 wheels I was using, but needs more testing.
guigsss@reddit (OP)
u/Eugr when you say slightly what's the difference? I am curious what's the actual difference?
Also for vllm did you create a version with the latestpytorch build? Cause by default it uses torch 2.9.0
Eugr@reddit
I've had some of the packages installed already, but I believe I installed everything else. I had to install an extra package for nv shmem support, because pytorch was complaining about it.
Haven't tried other workflows yet.
mamasher@reddit
Amazing job ! Do you have any numbers with regards to pp and token/s between the baseline and your optimized version ?
guigsss@reddit (OP)
Not yet but I’ll provide some really soon.
Artix93@reddit
How does it compare against the optimized nvidia docker containers for pytorch like https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch?
guigsss@reddit (OP)
I'll try it and let you know where it stands.
Eugr@reddit
This is the one included in their newest vllm docker:
Regular-Forever5876@reddit
It was a plaesure to work this subjective with you, the optimisation stack was provided by the engineering team from https://P2Enjoy.studio and OP made a wonderful documentation 🙏
guigsss@reddit (OP)
Likewise 🙂
Eugr@reddit
I wonder if GPU Direct really works, because according to NVIdia folks, it won't due to unified memory architecture.
As they explained it, once CUDA allocated memory segment, it becomes isolated from the rest of the SOC, including PCIe bus, and DMA access from CPU/PCIe side is not allowed, so direct NVMe to VRAM is not possible.
They didn't explain reasoning behind that, just for "performance and stability reasons".
Regular-Forever5876@reddit
Thanks for the insight, we will look into it and investigate. 🙂
Long_comment_san@reddit
Amazing job, this needs 100x more likes, basically did Nvidia job for free.
guigsss@reddit (OP)
Thanks bro, really appreciate it
Mythril_Zombie@reddit
The Thor is similar, think it works there as well?
guigsss@reddit (OP)
Thor is architecturally similar (Grace + Blackwell + unified memory), so the majority of this setup should work as well.
I’d suggest you setup two python environments with both libraries and check the result running the benchmark examples in the repo. I’d be interested to see the results as well.