[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090

Posted by No_Mango7658@reddit | LocalLLaMA | View on Reddit | 19 comments

DUAL 5090s!!!

Absolutely amazing results with dual 5090s, basically doubling my tps.

Just ran this test and surprised by the results.

llama-cli-mtp \
-m \~/Downloads/Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-c 262144 \
-ngl 99 \
--flash-attn on \
--verbose \
-p "Write a short Python function that parses a CSV file."

[ Prompt: 1735.6 t/s | Generation: 127.9 t/s ]
Peak GPU total system memory usage is 18+21=39GB

I've done literally nothing besides put in the second GPU and alter my llama command.

llama-cli-mtp \
-m \~/Downloads/Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp \
--spec-draft-n-max 3 \
-c 262144 \
-ngl 99 \
--verbose \
-p "Write a short Python function that parses a CSV file."

[ Prompt: 251.7 t/s | Generation: 119.4 t/s ]
Peak GPU total system memory 22+25=47GB

Sharing more configurations and tests. I haven't evaluated the output of these tests, just sharing speeds.

[-]

Due_Duck_8472@reddit

Still light years from the frontier models on a 20$ sub .. such a huge waste of money

Solary_Kryptic@reddit

until you hit your token limit and you're stuck with nothing

That's not true, you don't have to code 24/7

there are a myriad of other uses for tokens, but also this is literally a hobbyist sub. You've never heard of expensive hobbies?

No_Mango7658@reddit (OP)

This is a local llm subreddit. There are lots of reasons not to use proprietary offsite models.

Why are you?

I'm here to share my experience with other that are considering the same path for their own reasons. You obviously just want to fight. Why waste my time?

Practical-Collar3063@reddit

Imagine thinking this sub reddit is about cost efficiency

tamerlanOne@reddit

Dipende da cosa vuoi fare...

No-Dot-6573@reddit

Do we already have some numbers on q5/q6 vs nvfp4? Regarding tps and quality?

Such_Advantage_6949@reddit

I am using 8bpw and full context and dflash and have on average 100tok/s . For coding question like this i have 170 tok/s. You should try out exllama3

No_Night679@reddit

Following

uti24@reddit

Oh, so you also removed KV quantization?

I have notice it can substantially improve speed by itself, you can try returning it back.

The second run had the KV quantization removed. It was slightly faster with Q8

hurdurdur7@reddit

You can afford q6 and full precision cache with this vram....

Ya for sure, just comparing from my previous post where I fit 256k on a single 5090.

Try VLLM with tensor parallelism, you will get better performance at long context. Especially for the prompt processing

caetydid@reddit

that is decent! now I just need a spare pair of 5090s

Inevitable-Log5414@reddit

Beautiful numbers. Saving this thread for when I finally pull the trigger on a second card. Thanks for sharing the exact flags.