Qwen 27b MTP Config, Llama.cpp Single 3090

Posted by GotHereLateNameTaken@reddit | LocalLLaMA | View on Reddit | 31 comments

What setup are you using for qwen 27b on a single 3090?

Here's what I've started using today. It has to compact often but I'm worried about giving up more accuracy and reliability with a lower quant:

llama-server -m /Models/q3.6/Qwen3.6-27B-Q5_K_S.gguf -c 65536 -ngl -1 -t 8 -ctk q8_0 -ctv q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2 --fit off --mmproj /Models/q3.6/mmproj-Qwen3.6-27B-f16.gguf --no-mmproj-offload

I'm getting around 65tk/s.

I've also seen these recommendations: https://github.com/noonghunna/club-3090/blob/master/docs/SINGLE_CARD.md

They seem to be using the q4 quant. How are you weighing the tradeoffs?