GPT-OSS-120B vs DGX Spark

Posted by AdamLangePL@reddit | LocalLLaMA | View on Reddit | 18 comments

Just curious what are your best speeds with that model. The max peak that i have using vllm is 32tps (out) on i think Q4 k\_s. Any way to make it faster without loosing response quality ?

Reply to Post

18 Comments

[-]

AdamLangePL@reddit (OP)

Ok changed from vllm to ollama.cpp, model runs faster but… started to loop. Any suggestions ?

[-]

hurdurdur7@reddit

Whatever the speed is... why would you use that model? Better quality models have come since this came out.

[-]

AdamLangePL@reddit (OP)

Point me to some better quality model that i can run on DGX :) then i will try it!

[-]

Data extraction and analysis mostly. I'm posting question -> runs MCP tool -> Prepares answer (in JSON). OSS-120B doing great job, OSS-20B missing some data while preparing output (frequetnly). Qwen3-30B ... mostly confused and returns rubbish or empty data.

[-]

hurdurdur7@reddit

I think instead of blind trust, for your case, i would give try to the following: Qwen3.5-122B at Q4\_K\_M (or UD-IQ4\_NL or mxfp4 if you can find that one) Nemotron 3 Super (hey, it's bad at coding but maybe it's good for your case) at whatever quant that you can fit Qwen3.5-27B at Q8 (might be slow but damn it's beautiful) GLM-4.7-Flash at Q8 And just compare the outcome of these by yourself.

[-]

Odd-Ordinary-5922@reddit

why are you using q4ks when oss 120b is already quantized to mxfp4

[-]

AdamLangePL@reddit (OP)

Checking it now

[-]

AdamLangePL@reddit (OP)

ok, with llama.cpp and MXFP4 i managed to get \~50, better :)

[-]

Odd-Ordinary-5922@reddit

nice!

[-]

prescorn@reddit

Buy a gpu

[-]

inevitabledeath3@reddit

A DGX Spark has a GPU dude

[-]

Ok_Appearance3584@reddit

https://spark-arena.com/leaderboard 50ish for single

[-]

Narrow-Belt-5030@reddit

I liked this site simply because you gave the settings / method too. For me (a numb nuts) that's priceless

[-]

pmttyji@reddit

[https://github.com/NVIDIA/dgx-spark-playbooks](https://github.com/NVIDIA/dgx-spark-playbooks) Use ggml's MXFP4 quant for both GPT-OSS models. And use llama.cpp. [https://github.com/ggml-org/llama.cpp/discussions/16578](https://github.com/ggml-org/llama.cpp/discussions/16578) [https://github.com/ggml-org/llama.cpp/blob/master/benches/dgx-spark/dgx-spark.md](https://github.com/ggml-org/llama.cpp/blob/master/benches/dgx-spark/dgx-spark.md)

[-]

ImportancePitiful795@reddit

Clearly you have setup problem. GPT OSS 120B should be close to 60tks on the DGX with MXFP4.

[-]

AdamLangePL@reddit (OP)

Well, which VLLM "flavor" to use then? i'm using spark-vllm-docker now which should be optimized for it.

[-]

pontostroy@reddit

Check spark-arena results for this model, [https://spark-arena.com/benchmark/56a0c113-ee9d-409e-99ae-1a144b2e08e4](https://spark-arena.com/benchmark/56a0c113-ee9d-409e-99ae-1a144b2e08e4) and you can use [https://github.com/spark-arena/sparkrun](https://github.com/spark-arena/sparkrun) to run this model