Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090
Posted by swizzcheezegoudaSWFA@reddit | LocalLLaMA | View on Reddit | 8 comments
TL;DR
^(If you're running rigid agent frameworks locally with mtp on consumer hardware: drop your draft window to 3, lock parallel slots to 1, and compile to b9200 or newer to get your memory bandwidth back. The numbers back it up.)
I've been testing the new Qwen 3.6 27B mtp gguf from Unsloth, running it as the backend for the hermes agent. While dialing it in, I noticed that the currently recommended Unsloth mtp flags actually bottleneck performance and tank draft acceptance rates for strict, multi-turn agentic workflows. Pairing a custom config with today's brand new llama.cpp b9200 release — which specifically fixes mtp memory traffic overhead — completely turns that around.
Hardware/Software
- RTX 3090 (24GB VRAM)
- Ryzen 7 5700G / 64GB
- Qwen3.6-27B-IQ4_NL.gguf
- llama-server (b9200 compiled from source)
- hermes agent (64K context) max to limit spillover
The problem with default mtp settings
Running the standard recommended mtp flags (--spec-draft-n-max 6 and --spec-draft-p-min 0.75) gave poor results for agentic loops. Generation speeds sat around 7–8 t/s, and the mtp draft acceptance rate hovered around 22–26%.
Agent workflows are rigid. A 6-token lookahead frequently guesses the wrong punctuation, the main model rejects the draft, and the GPU throws out the math and recalculates — completely negating the mtp speed boost. Without explicitly declaring parallel slots, llama-server also defaults to 4, eating up memory bandwidth managing unused context slots.
The fix and the b9200 boost
For agent workflows on a 24GB card, limit to a single slot, drop the lookahead to 3, and remove the p-min threshold so it doesn't hesitate on rigid syntax. Combined with the b9200 release — which stops copying the full logits for every token in the batch during prompt processing — the optimized launch command looks like this:
.\llama-server.exe ^ -m D:\models\Qwen3.6-27B-IQ4_NL.gguf ^ --spec-type draft-mtp ^ --spec-draft-n-max 3 ^ --ctx-size 65536 ^ --parallel 1 ^ --flash-attn on ^ --cache-type-k q8_0 ^ --cache-type-v q8_0 ^ --port 8081
Results
Prompt processing jumped from \~560 t/s to 991+ t/s. The b9200 memory traffic fix paired with --parallel 1 lets the 3090 tear through hermes' massive system prompts.
Token generation hit 17.06 t/s on short tasks and stabilized at \~9.5 t/s during heavy context reasoning loops where the agent is switching between tool calls and main memory.
Draft acceptance rate climbed from 26% to 77% on standard turns — a shorter lookahead is far better suited to strict formatting than the default window.
luckyj@reddit
Isn't --spec-draft-n-max 2 or 3 recommended? I feel like 6 is way too high anyways
Ok-Measurement-1575@reddit
Unsloth docs say 6, everywhere else suggests 1 - 3 max.
swizzcheezegoudaSWFA@reddit (OP)
3 max I concur...this is best
swizzcheezegoudaSWFA@reddit (OP)
hugginface they suggest --spec-draft-n-max 6 now over on the llama-server. that defaults to 4 thus eating up memory bandwidth managing unused context slots. I'm sticking with 3 as my default as its solid for my system
swizzcheezegoudaSWFA@reddit (OP)
Ok-Measurement-1575@reddit
Try -fit off?
swizzcheezegoudaSWFA@reddit (OP)
It might shave a second or two off the initial server boot time, simply because the engine doesn't have to pause to calculate memory projections. If ( --fit off ) then also have to manually add -ngl 999 to force it back onto the GPU. the auto fitter manages it nicely
swizzcheezegoudaSWFA@reddit (OP)
I was testing just before the update dropped lol go figure, so here it is... ----\^