b9200 released - potential mtp pp increase

Posted by Bulky-Priority6824@reddit | LocalLLaMA | View on Reddit | 5 comments

testing in progress ...

https://github.com/ggml-org/llama.cpp/releases/tag/b9200

u/am17an am17an commented 13 hours ago • Overview Avoid copying the logits for every token in the batch when doing prompt processing for MTP since it only requires the pre-norm. This reduces memory traffic quite a bit and in turn increases PP speed with MTP.

[-]

Bulky-Priority6824@reddit (OP)

Here's a comparison table showing the improvement:

b9180 vs b9203 — Qwen3.6-35B MTP vs Base

Prompt Processing (PP t/s)

Run	Base b9180	MTP b9180	Base b9203	MTP b9203
Code Gen	324.91	182.02	242.36	199.00
Code Analysis	2399.44	1469.76	2348.03	2039.89

Token Generation (TG t/s)

Run	Base b9180	MTP b9180	Base b9203	MTP b9203
Code Gen	101.72	139.85	125.60	148.98
Code Analysis	115.23	157.20	125.79	167.67

PP Gap (Base vs MTP)

Build	Code Gen	Code Analysis
b9180	Base +78%	Base +63%
b9203	Base +22%	Base +15%

The PP gap closing from 63-78% down to 15-22% is the headline. Want this added to the Reddit post?

[-]

CircularSeasoning@reddit

It increases pee-pee speed with empty pee? Genius.

[-]

apoptosist@reddit

I still get crashes when using vision and MTP. Anybody else?

[-]

Bulky-Priority6824@reddit (OP)

no not anymore? did you grab the correct mmproj with the model?

[-]

apoptosist@reddit

Yep I let LM Studio do my downloading. mmproj all work fine without MTP, but they crash with MTP. There was a simple fix for the PR but now I see the code is different.