Qwen3.6-27B-GGUF:UD-Q8_K_XL and llama.cpp issue (DGX SPARK)

Posted by DOOMISHERE@reddit | LocalLLaMA | View on Reddit | 14 comments

Hey all,

im having a crisis that i just cant figure...

i used Qwen3.6-27B-GGUF:UD-Q8_K_XL ever since it came out (on a DGX SPARK) and it worked like magic with decent performance (\~50 t/s) , im updating SPARK and llama.cpp on a daily basis, 3 days ago - something happend... and im getting \~8t/s ...

i tried EVERYTHING...

hard power cycling (disconnect the power block, everything..)

factory reset on the DGX SPARK

went back to older versions of llama.cpp

nothing worked...

banging my head against the wall didnt help either..

any idea what could have gone wrong ?

i have 2 DGX SPARKS and this happens on both of them...

im just lost 😞

[-]

fairydreaming@reddit

8 t/s sounds about right for running 27B dense model on 273 GB/s memory bandwidth of DGX Spark.

It's not possible to get 50 t/s with this model on Spark. Probably you simply used a different model before, Qwen 3.6 35B-A3B or similar.

[-]

Cha0s2522@reddit

Getting 15-19t/s with mtp on my spark

[-]

ComfyUser48@reddit

I'm getting 55 tok/sec on my 5090 with Q8_K so there is no way you get anything closer to this with the spark on the 27b.

[-]

hoschidude@reddit

On a DGX Spark just use vllm instead of llama.cpp. It's much faster.

[-]

FigAltruistic2086@reddit

I use vLLM and get almost the same drop of t/s on 27B dense model comparing to 32B-3B MoE.

[-]

Turbulent_War4067@reddit

That is not my experience. I was using vLLM and switched to llama.ccp under LM Studio and every model I have tried has sped up drastically. My theory is vLLM will probably do better if getting hit concurrently, but if not, llama.ccp is faster. I am running one spark. Gemma4-26B went from mid 30s tps to 60. Gwen3.6-3B went from upper 30s to 70. Gemma4-31B went from 6-7tps to 10.

[-]

Jumpy-Possibility754@reddit

Looks like version drift masked as performance regression Easy to miss when multiple builds and models are in play Curious what you’re using now to verify what’s actually loaded at runtime

[-]

pirateadventurespice@reddit

As others noted, you weren’t doing this. Dense modes just aren’t really a great fit for the spark, you’re better served running a MoE of token speed is your goal (you could also try parallel querying, I guess; but, I’ve not had much success there).

[-]

wasnt_in_the_hot_tub@reddit

Are you sure it wasn't the Q4_K_XL? That would would probably get 50 tokens/second on your hardware

[-]

audioen@reddit

You never got 50 t/s. You used another model.

[-]

temperature_5@reddit

You were not running the 27B at that speed. Probably the 35B was still coded somewhere, or file renamed, etc.

I have an 890M iGPU, typically about half the speed of the 8060s for LLMs, and I get \~4.7tok/s on the 27B at Q4.

[-]

ReentryVehicle@reddit

and it worked like magic with decent performance (\~50 t/s)

It would indeed be magic because DGX Spark has 273GB/s memory bandwidth.

Q8 27B model => 27GB

273GB/s/(27GB/token) = 10tokens/s.

Could it be that you ran 35B accidentally?

Unless you had speculative decoding and it was working exceptionally well (the native MTP-based speculative decoding in vllm can predict 5 tokens into the future but only on very predictable requests, and usually doesn't really give 5x speedup anyway), it is impossible.

[-]

LA_rent_Aficionado@reddit

8 t/s looks pretty standard for Q8/FP8 on a spark, I can't imagine you ever got 50 t/s with a 8BPW 27b dense model on a spark. you must have been using another model or are mistaken

https://forums.developer.nvidia.com/t/whats-the-best-speed-we-can-get-with-qwen-3-6-27b-without-quantizing/367561

https://www.reddit.com/r/LocalLLaMA/comments/1s2cmzb/qwen3527b_cant_run_on_dgx_spark_stuck_in_a/

https://forums.developer.nvidia.com/t/qwen3-6-27b-is-out/367503/5

https://forums.developer.nvidia.com/t/how-fast-can-qwen3-5-27b-be-after-converting-to-nvfp4/362776

[-]

LA_rent_Aficionado@reddit

what are you launching llama.cpp with, what params?