Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?

Posted by regunakyle@reddit | LocalLLaMA | View on Reddit | 38 comments

Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774)

My run command:

```

llama-server \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--presence_penalty 0.0 \

--min-p 0.00 \

--gpu-layers all \

-m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \

-a llama.cpp \

--host 0.0.0.0 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--chat-template-kwargs '{"preserve_thinking":true}' \

--flash-attn on

```

The built in web UI shows that context size is 137k.

By adding `spec-type draft-mtp --spec-draft-n-max 2`, the reported context size drops to 14k. Is this normal?