DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Posted by dalemusser@reddit | LocalLLaMA | View on Reddit | 84 comments

DGX Spark just arrived — planning to run vLLM + local models, looking for advice

Just got a DGX Spark set up today and starting to configure it for local LLM inference.

Plan is to run:

•   vLLM

•   PyTorch

•   Hugging Face models

as a local API backend for an application I’m building (education / analytics use case, trying to keep everything local/private).

I’ve mostly been working with cloud GPUs up to now, so this is my first time running something like this fully on-prem.

A few things I’m curious about:

•   Best models people are running efficiently on this hardware?

•   Any tuning tips for vLLM on unified memory systems like this?

•   Real-world throughput vs expectations?

Would appreciate any insights from people running similar setups.