DGX Spark just arrived — planning to run vLLM + local models, looking for advice

[-]

dalemusser@reddit (OP)

Great question—both are solid, just optimized for different use cases in my experience (and I may be biased coming from cloud setups).

For this, I’m leaning toward vLLM mainly because I’m treating the DGX Spark as a serving backend, not just a local runner. • Throughput / batching: vLLM’s paged KV cache + continuous batching seem like a big advantage for handling multiple requests efficiently. Since I’m planning to expose it as an API, that matters more to me than single-user latency. • GPU utilization: It’s designed to keep the GPU saturated under load, which feels like a better fit for this kind of system. • OpenAI-compatible API: Makes integration a lot easier without needing much glue code. • Scaling mindset: Feels closer to how things are typically run in production vs llama.cpp (which I’ve mostly used in a more interactive/local context).

That said, I think llama.cpp is great for: • quick local experiments • CPU / low-power setups • heavily quantized models when memory is tight

So for me it’s less “which is better” and more:

vLLM = backend service / throughput llama.cpp = local runtime / flexibility

Curious if anyone here is running llama.cpp as a multi-user API at scale—would be really interested in how it compares in practice.

Part of why I posted is to sanity check my assumptions—I’ve used llama.cpp locally (mostly interactive on Macs), but not really as an API backend, so definitely open to better approaches if people have found them.

[-]

the__storm@reddit

Great explanation Claude, thank you

[-]

dalemusser@reddit (OP)

You’re partly right 😄

Here’s what I originally wrote before cleaning it up using ChatGPT (not claude):

"I think both are good just for different things. I might be biased because I’ve mostly used cloud GPUs before.

For this I’m leaning vLLM because I’m treating this more like a backend service and not just something I run locally.

- batching and throughput seem like a big deal, especially if there are multiple requests

- seems better at keeping the GPU busy

- easier to just expose as an API and plug into other stuff

- feels more like how things are done in production

llama.cpp is still great though:

- quick local stuff

- runs well on CPU / lower power machines

- good with quantized models

so it’s not really which is better, more like:

vLLM = backend / throughput

llama.cpp = local / flexible

I haven’t really used llama.cpp as an API though, mostly just interactive on my Macs, so I could be missing something. Curious if people are running it at scale."

Then I had ChatGPT clean it up a bit for readability.

Figured if I’m working with LLMs, I should probably use them too 😄. You have a problem with using AI in your work?

[-]

kuhunaxeyive@reddit

It's not using AI by itself that's the problem, but the style of writing of an AI distracts much from the content while reading it. And for a personal opinion, it sounds less legit as it sounds generated. (Like "when memory is tight", "a better fit", "it's A, not B", and heavy usage of "—".)

[-]

Zanion@reddit

Ai;dr is more real by the day.

[-]

kuhunaxeyive@reddit

"being real day by day" doesn't help if the reader lands in uncanny valley with the current generation

[-]

Zanion@reddit

I get the impression you might have missed the intended interpretation of my comment. ai;dr is a play on tl;dr.

AI; didn't read.

I agree with you though. I find myself instantly discrediting, devaluing, and passing over the message of an author the more the format of their message is riddled with AI markers. I expect if they are too lazy to revise and curate their own messaging, then it's unlikely there is any depth or substance to their message worth considering.

[-]

dalemusser@reddit (OP)

Good point. Thanks

[-]

bzrkkk@reddit

I like your original text

[-]

dalemusser@reddit (OP)

Thanks, good to know.

[-]

Tyme4Trouble@reddit

This is the way. If high concurrency / batch are a priority then SGLang should also be on your radar

[-]

dalemusser@reddit (OP)

Thanks for pointing out SGLang.

SGLang is definitely on my radar now. It looks like it’s targeting the same kind of high-throughput serving space, with continuous batching, paged attention, prefix caching, and other serving optimizations. I’m still leaning vLLM as the default starting point just because it seems like the more common baseline for production-style deployments and the OpenAI-compatible server path is very straightforward, but I agree SGLang is something I should investment and benchmark.

I really appreciate the recommendation :)

[-]

KooperGuy@reddit

I'm sure someone much smarter than me will come along and help give feedback on this! Thanks for sharing

[-]

Writer_IT@reddit

When working with long context, vllm process Speed blows out llamacpp, unfortunately. It's the biggest ostacle to run in production scenario partially offloaded models

[-]

KooperGuy@reddit

Mmm but long context on a GB10?

[-]

CooperDK@reddit

Take a guess. Vllm owns everything else. It is insanely fast

[-]

insanemal@reddit

Get a refund.

I've got two.

They are too slow. NVIDIA made a lot of promises the hardware cannot meet.

[-]

keyser1884@reddit

I found getting llama cpp to work much easier than vllm. The arm cpu and new CUDA version can make compatibility an issue

[-]

VoiceApprehensive893@reddit

2 more until GLM 5.1 at a good quant

[-]

Perfect-Flounder7856@reddit

What do you need for hardware to run GLM5.1?

[-]

StardockEngineer@reddit

Use this repo to run vllm easily https://github.com/eugr/spark-vllm-docker

Great info and discussion in the forums. More than you'll get here: https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/719

[-]

RipperFox@reddit

https://spark-arena.com/ even has their on runner: of https://github.com/spark-arena/sparkrun which can use multiple backends like vLLM, llama.cpp, SGLang..

[-]

entsnack@reddit

This repo has been a lifesaver.

[-]

Original_Finding2212@reddit

Great links and content!

Thank you for sharing!!

[-]

dalemusser@reddit (OP)

Thanks! Very helpful, I appreciate it.

[-]

RegisteredJustToSay@reddit

So jelly! I wish I could justify this as an expense but hourly pricing for spot GPU instances is so low that I just can't quite justify it.

[-]

dilberx@reddit

Good for a early adoption but nvidia is relasing more consumer products which will be vfm and better choice than macs.

[-]

Only_Play_868@reddit

I feel I don't understand something, but I thought the DGX Spark was a better investment for training validation, not inference. If you're just running local models, aren't there more economical options (high-end Mac Mini's, Mini AI PCs, etc)? I considered getting one myself but mostly for training adapters to models on something with CUDA and enough VRAM that I could experiment locally before moving to clusters on RunPod

[-]

Own_Mix_3755@reddit

Well.. yes and no. It depends what you are expecting from it. Mac mini with standard M chip is about the same in speed, but does not have CUDA on their side which helps alot. M Ultra is much faster, but its speed it still depends whether we are talking about promot processing (time to first token) or the response. For the later, the Mac with Ultra chip will be much faster, while for prompt reading they will be equal or Nvidia can be even faster thanks to some internal black magic. Also if you want 128gb memory, you have to aim for Mac Studio and not Mini. And there we are talking about Mac being more expensive (in some countries even much more expensive). Not to mention you need specific models for Mac. Also keep in mind that there are variants of DGX Spark from other manufactuers which can be significantly cheaper (eg due to not needing 4TB ssd).

And with Mini ai pcs your only option is Ryzen Strix Halo. Yes they will be a bit cheaper than this, but also much less optimized and honestly ROCm is always the last to get new models and all the bells and whistles. So it has a tradeoff.

The best obviously are dedicated GPUs because their speed is massive compared to all these. But so is the power needed to run those and also the cost, because getting 128gb of memory purerly of graphic cards… will take you to being twice as expensive than this little box.

[-]

Only_Play_868@reddit

Thanks for the explanation! 128GB VRAM is hard to come by in a similar price and form factor, that's for sure. I do wonder if the other OEM-variants of the DGX Spark are a better bet for inference, since it's the same GPU and VRAM but lower cost. As with anything, it depends on how big of a model you want and what kind of throughput you need

[-]

audioen@reddit

I run this model that is downloaded and setup by this install.sh

https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4

with this shutdown/startup script, and vllm options that I'm experimenting with:

docker stop vllm-qwen35
docker rm vllm-qwen35
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
docker run -d --name vllm-qwen35 \
  --gpus all --net=host --ipc=host \
  -v ~/models:/models \
  vllm-qwen35-v2 \
  serve /models/qwen35-122b-hybrid-int4fp8 \
  --served-model-name qwen \
  --max-model-len 262144 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.90 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --attention-backend FLASHINFER \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --generation-config auto \
  --override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'

I just slapped this together last night from various sources. vLLM is poorly documented in my experience and there can be laughable problems with the above.

I think --max-num-seqs should be set lower from whatever its default is, which I suspect to be 16. For hardware like GB10, that is needlessly many -- the thing barely has the power to run about 4 inference jobs in parallel without running out of compute, and I suspect it's more like 2-3 when you have speculative config set. What I'm certain of is that the capacity for large number of parallel sequences eats up your VRAM.

The --load-format fastsafetensors is also essential, as it drops model loading time by like 80 %. It is just ridiculously slow without this. Even with this setting, starting vllm takes like 10 minutes...

Chunked prefill should speed up prompt processing, though I have no data if that is the case, the auto tool choice, tool call parser and reasoning parser are set like this because this works in opencode. I know the git repo said I should use qwen3_xml but tool calls immediately stopped working in opencode, so I don't think so. I don't know what is up with that. Getting rid of reasoning parser also will confuse the model greatly, so it has to be there. There is apparently a bug between reasoning and tool call parser in that the reasoning parser runs first and extracts the think sequence out, and the tool call parser runs on what remains, and if the model wrote tool call into think section, it misses it.

The generation config and override is just my attempt to guarantee that sampling is being performed with qwen3.5 recommendations for coding. I have no idea how to confirm what settings are used, as vllm really doesn't like to print this type of information that would help me to ascertain the correctness of the parameters that inference is executing under.

Anyway, prompt is around 1200 tok/s and generation around 50 tok/s with this repo. I'm seeing variation from 40-60 on medium sized prompts, very short completions are < 10 tok/s, likely some overhead causes such a low figure. I am not sure if the model is at full quality as this bastardized int4/fp8 combo, but it seems to behave quite well and while I think there could be slight gap to what I'm used to getting from 6-bit GGUF inference, the speed more than makes up for it.

[-]

dalemusser@reddit (OP)

Thanks! I really appreciate it your taking the time to provide it. I'll definitely use it.

[-]

Late-Assignment8482@reddit

That's an excellent "one shot to get something smart up" command. Refine into your preferred stack later.

[-]

Traditional-Gap-3313@reddit

What I'm certain of is that the capacity for large number of parallel sequences eats up your VRAM.

As far as I understand it, once KV Cache is allocated it doesn't matter how many sequences yo're running as long as they all fit. But it matters in boot-up of vllm since it has to calculate cuda graphs.

without --max-num-seqs vLLM will calculate cuda graphs for up to 512 parallel requests, which will require allocation of memory for those calculations and can result in OOM if it's a tight fit. This is especially pronounced on smaller GPUS, maybe not so much on spark.

But yeah, got burned a lot of times before I figured out that --max-num-seqs affects more then only the param name implies

[-]

Late-Assignment8482@reddit

I'm running Qwen3.5-122B at good speed with enough space for four 250k token streams, using eugr's repo mentioned below, Albond's Intel Int4 patches, the Intel Int4 quant of the model, and vllm.

[-]

_wOvAN_@reddit

one is not enough

[-]

dalemusser@reddit (OP)

I definitely *want* more than one. But, I am happy right now I have one. I understand though what you are saying.

[-]

codenow-zero@reddit

There is the new Gemma4 and they have few options which your beast cand handle for sure!

[-]

rerorerox42@reddit

Having «enough» is one of the things that cannot be bought

[-]

Secure_Archer_1529@reddit

You can run qwen 3.5 122b on one spark with >50 t/s as I remember. Check out the forum. The community around the spark (not including nvidia) has been hard at work on this on.

[-]

Easy-Unit2087@reddit

The 200GbE is 1/3 of the value in a GB10. Adding a second node is great for large MoE models. 1,500t/s PP and 30t/s TG with Qwen 3.5 397b in a vLLM Docker.

[-]

the_bollo@reddit

When I look at those things all I see is the grippy edge of concrete steps.

[-]

Budget-Juggernaut-68@reddit

aren't these built for training?

[-]

SoundEnthusiast89@reddit

I’ve been running VLLM but the CUDA takes 4.6GB of extra RAM per model. Also I could nit quantize any models to FP8 because of lack of software support. Running BF16 at extremely slow throughout. On the other hand, my mac M4 Max keeps churning tokens at lightening speed on vllm-mlx. Can anyone tell me if I’m doing something wrong or the lack of CUTLASS support is real in Sparks?

[-]

moofunk@reddit

I thought the DGX Spark was considered a bad option now. Did something change?

[-]

Easy-Unit2087@reddit

Yes, something did change. Agentic coding has increased the importance of PP relative to TG, and the GB10's ecosystem has improved: easy to deploy models thanks to community efforts, plus better firmware and software (vLLM). Mac is still the better option for lay people getting their toes wet, GB10 makes a lot of sense for devs especially if also using the powerful GPU for stable diffusion or fine-tuning small models.

[-]

Shot-Buffalo-2603@reddit

It’s not bad, it’s mediocre performance wise vs other options of similar costs for inference. There are a lot of trade offs though. It comes in a small for factor and is lower power vs hooking up 4 3090s.

[-]

DataGOGO@reddit

why run vLLM on a d DGX spark vs TRT LLM?

[-]

dalemusser@reddit (OP)

I don’t have any experience with TensorRT-LLM. From what I understand, TensorRT-LLM can be more optimized for NVIDIA hardware, but also seems like more setup compared to vLLM.

I'd be interested in hearing from anyone with experience when would you choose TensorRT-LLM over vLLM?

[-]

DataGOGO@reddit

TRT LLM is much faster, and you build much better optimized kernels vs using the generics in vLLM

Setup is about the same either way.

TRT LLM for Inference | DGX Spark

[-]

entsnack@reddit

It's really annoying to setup and debug, but it is indeed faster, even for image generation.

[-]

dalemusser@reddit (OP)

Thanks for letting me know that. I'll definitely try it out. Very useful to know.

[-]

Porespellar@reddit

Sparkrun is the easy button for running vLLM on Spark. They even have a Claude Code Skill if you need any extra help getting it running.

https://sparkrun.dev

Sparkrun also pairs well with Spark Arena where you can find the highest community rated quants and recipes to use via Sparkrun

https://spark-arena.com

[-]

head-of-potatoes@reddit

I've found setting up Claude Code on my Spark has made it a lot more fun. It figures out all the annoying version incompatibilities that made it tough to keep things up to date. I'm running a TTS model (from Qwen), a Qwen model for surveillance camera analysis in vLLM, a few text models, and some other AI tooling I need for various projects. It's been great. Inferencing of small models is faster on a desktop GPU, but the midsize models are where Spark really shines because of the unified memory.

[-]

Pawderr@reddit

I am also using qwen for video analysis, how do you handle wrong descriptions? For example someone pushing against a heavy object and qwen stating the person is simply resting their hands on it or something similar.

[-]

dalemusser@reddit (OP)

Thanks for the recommendation 🙂 I’ve been using Claude Code on my Macs for development, and it’s definitely made figuring things out a lot more enjoyable. It’s also helped me get through things I probably wouldn’t have had time to dig into otherwise by going through docs, forums, and experimenting until something works.

I appreciate you sharing what you’ve been doing on your Spark as well. On my local machines I’ve mostly been limited by memory, so I’ve only worked with smaller models. And when I’ve used cloud instances (mainly for a work project), it hasn’t really been practical to experiment much or spend time exploring due to cost.

That’s a big part of why I’m excited about this setup. I am able to work with larger models and experiment more freely without constantly thinking about hourly usage.

[-]

Cupakov@reddit

Once you get the LLMs running, switch to pi.dev from Claude code, it’s way easier to customize it to your particular setup so you have a dedicated IT support agent available at all times

[-]

dalemusser@reddit (OP)

Thanks for the recommendation. pi.dev is new to me. Definitely will try it. I appreciate the direction.

[-]

dalemusser@reddit (OP)

I should also mention that I’m working with a university on an educational 3D game (Unity/WebGL) where they are studying whether gameplay can teach science concepts as effectively as traditional classroom instruction. As part of this work, I’m using gameplay log data to generate LLM-based feedback on student performance, including identifying areas where students may need additional support with the curriculum.

Due to IRB, FERPA, and COPPA requirements, I’m not permitted to send this data, even in de-identified form, to external APIs, and there are also restrictions against using cloud-based GPU instances. Processing must remain on-site or not happen at all.

That’s a big reason I’m excited about having a local system like this. It allows me to experiment with generating meaningful, personalized student feedback in ways that simply wouldn’t be possible within those constraints otherwise.

[-]

Agreeable_Effect938@reddit

I can only wish you good luck with setting this shit up. great hardware, awful software

[-]

Confident_Dimension7@reddit

I run this on mine. https://github.com/eugr/spark-vllm-docker

[-]

Klutzy_Comfort_4443@reddit

Véndelo y cómprate un Mac

[-]

dalemusser@reddit (OP)

Jaja, ya tengo Macs...uno con 128GB de RAM también. Esto es más por el stack (CUDA vs Metal) que por la memoria 😄

[-]

createthiscom@reddit

Does the DGX Spark have the full tcgen05 instruction set?

[-]

dalemusser@reddit (OP)

La diferencia entre el stack en la nube y el stack en Mac era demasiado grande para escalar.

[-]

arm2armreddit@reddit

congrats, be careful its addicting, after getting secong dgx spark I understood memory bottle necks, need 2xH200 🤫🫣

[-]

dalemusser@reddit (OP)

Good advice in it being addicting. I can see that being the case. I really would like to have a second dgx spark. Hopefully something happens that enables me to be able to do that.

[-]

arm2armreddit@reddit

This is an impressive piece of hardware, not only for LLMs. Try OpenGL workloads; 128GB for large renderings is breathtaking.

[-]

dalemusser@reddit (OP)

Thanks for the recommendation. I look forward to checking it out.

[-]

weichafediego@reddit

I want one so bad!

[-]

dalemusser@reddit (OP)

I appreciate the feeling. I was obsessing so bad before getting it.

[-]

schnauzergambit@reddit

Single user, use llamacpp. Multi user, vllm

[-]

ambient_temp_xeno@reddit

I'm still not sure about the design of this thing.

[-]

dalemusser@reddit (OP)

I don't disagree. It is an "interesting" design. It is also so small and dense.

[-]

ambient_temp_xeno@reddit

It's not a beige box, at least!

[-]

dalemusser@reddit (OP)

The shiny parts are mirrors. Definitely the opposite of my first beige box PC.

[-]

WolpertingerRumo@reddit

I‘m guessing MoE will be you best guess. It has a lot of unified ram, but not quite as fast as GPU. So MoE should give you the most speed. By now there’s MoE of most model families, so you should be able to find one that fits.

It will be slower than cloud, but still fast.

[-]

dalemusser@reddit (OP)

That’s a good point. MoE does seem like a natural fit here given the unified memory. Being able to load larger models but only activate part of them per token could be a nice balance. Thanks for the suggestion.

[-]

conockrad@reddit

Biggest advice - think about extra cooling.

Throttling is real!

[-]

dalemusser@reddit (OP)

Good to know! Thanks.

[-]

Syzygy___@reddit

I'm jelly... my GF's DGX Spark seems to have gotten lost with the post...

[-]

CooperDK@reddit

Midt would do it the other way, know about ai before getting a spark

[-]

dalemusser@reddit (OP)

That’s fair, most of my experience so far has been on cloud GPUs, this is just moving that workflow on-prem. I also find I learn best by doing and my existing local computers didn't have the resources to try what I really wanted to try.