Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal?

Posted by regunakyle@reddit | LocalLLaMA | View on Reddit | 38 comments

Note: Latest version of llama.cpp (b4c0549a49be9e6dc59ac9d0a5bc21dbda910774)

My run command:

```

llama-server \

--temp 0.6 \

--top-p 0.95 \

--top-k 20 \

--presence_penalty 0.0 \

--min-p 0.00 \

--gpu-layers all \

-m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \

-a llama.cpp \

--host 0.0.0.0 \

--cache-type-k q8_0 --cache-type-v q8_0 \

--chat-template-kwargs '{"preserve_thinking":true}' \

--flash-attn on

```

The built in web UI shows that context size is 137k.

By adding `spec-type draft-mtp --spec-draft-n-max 2`, the reported context size drops to 14k. Is this normal?

[-]

Similar-Ad5933@reddit

Could it be that you need to set -np 1 so there won't be so many slots?

[-]

Same cliff on my end with Qwopus3.6-27B-v2-MTP on a 5090 — context fell off a cliff the moment I flipped --spec-type draft-mtp --spec-draft-n-max 2 with Q8_0 KV. Two knobs that actually moved the needle on my llama-swap setup:

--spec-draft-type-k q4_0 --spec-draft-type-v q4_0 — Dexamph already mentioned it but worth a second voice: the MTP draft KV is the killer, not the merged weights, and quantising it separately cost me nothing measurable on accept rate for coding traffic.
Cap --spec-draft-n-max 1. Past depth 1, marginal accept rate on my workload didn't justify the extra KV — the throughput gain from n=2 vs n=1 is real but not 2×, and on a 24GB card the VRAM math stops working long before the speedup does.

One thing worth flagging: on older llama.cpp builds --flash-attn on + spec-draft KV quant on Blackwell silently corrupted draft logits — accept rate looked normal, outputs were garbage. Current b4c0xxx is fine. What build are you on, and what's accept rate looking like at n=2 before the OOM kicks?

[-]

lacerating_aura@reddit

As far as i underatand, mtp is not so different that classic draft model, you have model weights still, jist merged into base gguf and there is draft context. Both take extra vram.

[-]

regunakyle@reddit (OP)

fair, I just didnt expect it to shrink to 1/10 of the original

[-]

Hefty-Elk-7435@reddit

I'm doing good on a 3090 with 3.6-27b-Q4_K_M with MTP using 8b KV cache --> 130k context

I'm also using GGML_CUDA_UNIFIED_MEMORY= 1 so that I can load the mmproj at the same time and still have 130k context. The mmproj gets swapped out into system ram when not in use.

(you can't use unified memory to extend context length beyond what would fit into the VRAM normally or t/s will go through the floor... but you can use it to store the mmproj because it's only used occasionally)

[-]

lacerating_aura@reddit

Yeah it was a surprise for me too, the first time i tried mtp. My context went from 211k to like 40ish k max, for qwen 3.5 122b iq4xs. Model quantization would also most probably affect this since larger quant mean individual layers of model, including mtp ones i suppose, are bigger and need more vram, leaving less for context. I haven't tested since then so am not upto-date on latest optimizations.

[-]

jtjstock@reddit

Lots of bugs with MTP still, they seem to be working through them, check back periodically, as this should be fixed sooner or later. The MTP heads are tiny and should have a very small kv cache

[-]

Dexamph@reddit

Try quantising the MTP model KV cache to Q4_0 with --spec-draft-type-k q4_0 and --spec-draft-type-v q4_0, it didn't seem to impact draft acceptance. Also look for quantised MTP model weights unless Unsloth already quantised them at the same level as the main model, I saved nearly 4GB when I did that for Qwen 3.5 397B here but idk how much that shaves off for 27B

[-]

regunakyle@reddit (OP)

Thanks for your answer! I will try them later after work

[-]

marking89@reddit

Also, try lowering your --batch-size to 1024 or even 512, and your --ubatch-size to 512; it might only give you a minimal PP decrease, but it can add a significant amount of context space, combined with a KV setup like q5_1/q4_1 (very close to q8, see https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context). I have no trouble with Q4_K_M and these settings and running Claude and Copilot agentic workflows.
Try the llama-bench to see the exact effects, but for me, the batch size decrease only lowered my PP speed by 1-5%, which is worth it for the serious context size bump.

If you dont use your GPU as the display output, also make sure to use --fit-target 64 (the default leaves 1024 MB, but I've had no trouble with 64 MB without a display connected).

Also, if you dont need the image support, add '--no-mmproj' to save even more VRAM.

On my AMD 7900 XTX 24 GB with Vulkan, I get 170K context with two parallel slots and MTP at 2. NVIDIA & CUDA are different of course, but you should be able to get close to that?

[-]

robertpro01@reddit

Search for club-3090, that managed to get 41k ctx, which is still toops low for me

[-]

tomByrer@reddit

You missed my comment; they seemed to hit 183k context or am I reading it wrong?https://github.com/noonghunna/club-3090/discussions/184

[-]

robertpro01@reddit

Yeah, for some reason, I didn't see any comment at all

[-]

game_difficulty@reddit

Consider offloading the mmproj to cpu, helped me a bunch (24gb 7900xtx, same model)

[-]

regunakyle@reddit (OP)

I dont use mmproj, you can see that from my command

[-]

game_difficulty@reddit

It seemed to me (from testing) like the mmproj is baked into the gguf. I got a big boost to sontext by using --no-mmproj, even though i changed nothing else. Went from barely fitting ud-q4 @64k, to barely fitting ud-q5 @64k...

[-]

regunakyle@reddit (OP)

at least for unsloth they should be separate files

[-]

ea_man@reddit

Yeah, I guess previously the --fit* calcs didn't count the extra KV and compute properly.

You can reduce some with n=1

[-]

Pristine-Woodpecker@reddit

You were downvoted but that is exactly it. This was fixed in llama.cpp so now memory usage seems larger but before you would risk crashing due to OOM.

[-]

ea_man@reddit

Yeah the point is that \~2 days ago it was not reported / calculated in llama.cp, you could have used --fit , start your estimated \~130k and at \~20k you run out of VRAM coz the baggage of MTP, performance dive or worse.

It wasn't even calculated the difference of various n for MTP

--spec-draft-n-max 1
--spec-draft-n-max 1-5

[-]

regunakyle@reddit (OP)

you mean adding -n (as in --predict)?

[-]

ea_man@reddit

Yep, limit your self at
--spec-type draft-mtp --spec-draft-p-min 0.75 --spec-draft-n-max 1

less KV cache for the MTP head. Yet if that's where you are at you should probbly ditch it compleatly, maybe use simple

--spec-type ngram-mod

[-]

Solary_Kryptic@reddit

Yeah, if you're prompts call for repetitive outputs, like coding, ngram-mod is unbeatable

[-]

Poha_Best_Breakfast@reddit

You can get 180k ctx with iQ4_XS quant. Q4_K_XL will be a stretch with MTP.

[-]

regunakyle@reddit (OP)

Did you enable MTP for iQ4_XS? If yes, what is the final context size?

[-]

Poha_Best_Breakfast@reddit

180k is with MTP. Without MTP it does like 250k IIRC.

[-]

regunakyle@reddit (OP)

Damn thats great. Gonna take a look later

[-]

Gesha24@reddit

Doesn't it warm you that fit doesn't work properly with MTP?

[-]

regunakyle@reddit (OP)

I dont see it in the command output

[-]

Gesha24@reddit

It's somewhere in the long оutput. Regardless, the fit option did not work for me at all. I did have to manually set the context size and offload size to get it to load properly. Had to experiment a bit to find the right size of the context too.

[-]

regunakyle@reddit (OP)

What is your final context size after manual configuration?

[-]

jonas-reddit@reddit

I can’t squeeze out more than around 60k myself with MTP.

I looked at this table to find suitable quantization for kv cache. I picked q8 and q5_1.

https://github.com/ggml-org/llama.cpp/pull/21038#issuecomment-4146397570

[-]

Mameiro@reddit

Yes bro, this can happen. MTP/speculative decoding adds extra memory overhead, and with a 27B Q4 model on a 24GB 3090 you’re already near the limit. Since KV cache scales with context length, context is usually the first thing that gets cut. I’d compare VRAM usage with MTP off vs on, then tune context manually. 137k context on a single 3090 with a 27B model sounds very optimistic anyway, so 14k with MTP enabled may just be the realistic memory limit.

[-]

cleversmoke@reddit

Expect about 40-50k context reduction if using q8_0 KV cache, due to the 2-3GB grafted draft model. Highly recommended to lower checkpoints to 16, else it'll likely run into OOM with 32 checkpoints.

[-]

regunakyle@reddit (OP)

My context was reduced by 90%, much larger than your estimate, so I thought maybe something is wrong

[-]

tomByrer@reddit

Try digging in here: https://github.com/noonghunna/club-3090/discussions/184

& the other comments are only partly correct; some MTP uses a 2nd models, others don't. (so much to learn...)

[-]

FoxiPanda@reddit

Yes, MTP uses more VRAM because it loads another whole small draft model into VRAM (even if it's already grafted into the main GGUF file). So you were already VRAM limited and adding another ~2-3GB model onto that reduced that headroom even further.