[FOLLOW UP] Qwen3.6 27b q5_k_M MTP - 256k context - 5090
Posted by No_Mango7658@reddit | LocalLLaMA | View on Reddit | 19 comments
DUAL 5090s!!!
Absolutely amazing results with dual 5090s, basically doubling my tps.
Just ran this test and surprised by the results.
llama-cli-mtp \
-m \~/Downloads/Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-c 262144 \
-ngl 99 \
--flash-attn on \
--verbose \
-p "Write a short Python function that parses a CSV file."
[ Prompt: 1735.6 t/s | Generation: 127.9 t/s ]
Peak GPU total system memory usage is 18+21=39GB
I've done literally nothing besides put in the second GPU and alter my llama command.
llama-cli-mtp \
-m \~/Downloads/Qwen3.6-27B-Q5_K_M-mtp.gguf \
--spec-type mtp \
--spec-draft-n-max 3 \
-c 262144 \
-ngl 99 \
--verbose \
-p "Write a short Python function that parses a CSV file."
[ Prompt: 251.7 t/s | Generation: 119.4 t/s ]
Peak GPU total system memory 22+25=47GB
Sharing more configurations and tests. I haven't evaluated the output of these tests, just sharing speeds.
Due_Duck_8472@reddit
Still light years from the frontier models on a 20$ sub .. such a huge waste of money
Solary_Kryptic@reddit
until you hit your token limit and you're stuck with nothing
Due_Duck_8472@reddit
That's not true, you don't have to code 24/7
Solary_Kryptic@reddit
there are a myriad of other uses for tokens, but also this is literally a hobbyist sub. You've never heard of expensive hobbies?
No_Mango7658@reddit (OP)
This is a local llm subreddit. There are lots of reasons not to use proprietary offsite models.
Due_Duck_8472@reddit
Why are you?
No_Mango7658@reddit (OP)
I'm here to share my experience with other that are considering the same path for their own reasons. You obviously just want to fight. Why waste my time?
Practical-Collar3063@reddit
Imagine thinking this sub reddit is about cost efficiency
tamerlanOne@reddit
Dipende da cosa vuoi fare...
No-Dot-6573@reddit
Do we already have some numbers on q5/q6 vs nvfp4? Regarding tps and quality?
Such_Advantage_6949@reddit
I am using 8bpw and full context and dflash and have on average 100tok/s . For coding question like this i have 170 tok/s. You should try out exllama3
No_Night679@reddit
Following
uti24@reddit
Oh, so you also removed KV quantization?
I have notice it can substantially improve speed by itself, you can try returning it back.
No_Mango7658@reddit (OP)
The second run had the KV quantization removed. It was slightly faster with Q8
hurdurdur7@reddit
You can afford q6 and full precision cache with this vram....
No_Mango7658@reddit (OP)
Ya for sure, just comparing from my previous post where I fit 256k on a single 5090.
Practical-Collar3063@reddit
Try VLLM with tensor parallelism, you will get better performance at long context. Especially for the prompt processing
caetydid@reddit
that is decent! now I just need a spare pair of 5090s
Inevitable-Log5414@reddit
Beautiful numbers. Saving this thread for when I finally pull the trigger on a second card. Thanks for sharing the exact flags.