How to configure Self speculative decoding properly

Posted by milpster@reddit | LocalLLaMA | View on Reddit | 6 comments

So now that we have self speculative decoding in qwen 3.6 on llama.cpp i was wondering if anyone had any advice about configuring it properly.

Reply to Post

6 Comments

[-]

srigi@reddit

I gave my llama-server to GPT-5.4 with bunch of links (GitHub PR, server’s README.md) to analyze. Here is what I landed on (llama- server router mode) \`\`\`ini \[\*\] ubatch-size = 2048 cache-type-k = q8\_0 cache-type-v = q8\_0 ctx-checkpoints = 4 flash-attn = on fit = off n-gpu-layers = 99 no-mmproj-offload = true ; disable GPU offloading for multimodal projector parallel = 1 \[unsloth/qwen3.6-35B\_q5\] model = M:\\unsloth\\Qwen3.6-35B-A3B-UD-Q5\_K\_S.gguf mmproj = M:\\unsloth\\Qwen3.6-35B-A3B.mmproj-F16.gguf chat-template-kwargs = { "preserve\_thinking": true } cache-reuse = 128 ctx-size = 163840 ; 160k n-cpu-moe = 9 no-mmap = true draft-min = 48 draft-max = 64 spec-type = ngram-mod spec-ngram-size-n = 24 temp = 0.75 top-k = 20 min-p = 0 \`\`\` n-gram-size increase (default: 12) was suggested by llama-server, draft-min/max by GPT. Note, I disabled \`fit\`, I’m tuning GPU/CPU ratio manually with \`n-cpu-moe\`. I fouund that fit was leaving like 1GB of unused VRAM

[-]

lum4chi@reddit

Use `--fit-target` to reduce the unused VRAM but beware of memory spikes, they will trigger OOM if you are too greedy...

[-]

srigi@reddit

Thank you and u/Jester14. I can see in server’s readme that ‘—fit-target` default value is 1024 and therefore I can simplify my config with a proper `—fit` and `—fit-target`.

[-]

Jester14@reddit

Using `-fit` indeed reserves exactly 1024MB by default.

[-]

qubridInc@reddit

Nice feature but easy to overdo start conservative (small draft length/steps), benchmark tokens/sec vs quality, and slowly tune until you hit speed gains without hurting output.

[-]

Objective-Stranger99@reddit

Use ngram-mod for the type; it's best for chatting and coding. Start with the defaults in the llama.cpp docs (there is a file named speculative.md) and tinker with it until token generation peaks. Also, check the acceptance rate in Llama.cpp logs and change the parameters if it is below 0.50.