Qwen-3.6-27B, llamacpp, speculative decoding - appreciation post
Posted by Then-Topic8766@reddit | LocalLLaMA | View on Reddit | 20 comments
First a little explanation about what is happening in the pictures.
I did a small experiment with the aim of determining how much improvement using speculative decoding brings to the speed of the new Qwen (TL;DR big!).
-
image shows my simple prompt at the beginning of the session.
-
image shows time and token generation speed (13.60 t/s) for making the first version of the program. Also it shows my prompt asking for a new feature.
-
image shows time and token generation speed for a second version of the program (25.53 t/s - you can notice an improvement). Also on the image you can see there was a bug. I presented to Qwen the screenshot with browser console opened. Qwen correctly spotted what kind of bug it is and fixed it.
-
image shows time and token generation speed for a fixed version of the program (68.35 t/s - big improvement). Also image shows my prompt for making a small change in the program.
-
image shows time and token generation speed for final version of the program after small change (136.75 t/s !!!)
Last image shows finished beautiful aquarium. Aesthetics and functionality is another level compared with the older models of similar size and many much bigger ones.
So speed goes 13.60 > 25.53 > 68.35 > 136.75 t/s during session. Every time Qwen delivered full code. Similar kind of workflow I use very often. And all this thanks to one simple line in llama-server command
'--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48'.
I am not sure this is the best setting but it works well for me. I will play with it more.
My llama-swap command:
${llama-server}
-m ${models}/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf
--mmproj ${models}/Qwen3.6-27B/mmproj-BF16Qwen3.6-27B.gguf
--no-mmproj-offload
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48
--ctx-size 128000
--temp 1.0
--top-p 0.95
--top-k 20
--presence_penalty 1.5
--chat-template-kwargs '{"preserve_thinking": true}'
My linux PC has 40GB VRAM (rtx3090 and rtx4060ti) and 128GB DDR5 RAM.
Big thanks to all smart people who contribute to llamacpp, to this Reddit community and to the Qwen crew.
Free lunch, try it out...
EatTFM@reddit
do you need --no-mmproj-offload for spec decoding to work or does it just save some vram?
Then-Topic8766@reddit (OP)
Just to save some VRAM. Uploading images I do not so often and VRAM is precious, and mmproj works very well from CPU.
EatTFM@reddit
true! thank you
Then-Topic8766@reddit (OP)
I forgot to mention some changes in llama.cpp from two days ago. So try to update.
xornullvoid@reddit
Which model did you use for draft?
Then-Topic8766@reddit (OP)
No model is needed for Qwen. It supports auto speculative decoding feature. I use smaller model for Gemma and some other models.
Dr4x_@reddit
How do this works ? And how can you test that no accuracy is lost in the process ?
Then-Topic8766@reddit (OP)
I am an expert by no means, but you can google or ask your LLM about 'ngram and speculative decoding'. For me it is like a magic...
Dr4x_@reddit
Ok 😅 I'll do that
realmosai@reddit
Ah, that's new information. I will try it out, thanks for pointing it out.
charmander_cha@reddit
Acredito que você deveria chamar seu post por algo como: aumentando tok/s usando ngram
Then-Topic8766@reddit (OP)
I agree, but I cannot change the title after posted. Ngram is important.
DerDave@reddit
Is this using the draft models that are supposedly built into the newer qwen models? Cool to see this works in llama.cpp now.
A more than 10x improvement can't be speculative decoding alone. The other fixes must have a big influence.Â
realmosai@reddit
I think the author is using ngram, not a draft.
Puzzleheaded-Drama-8@reddit
I made a test on my 7900XTX (vulkan) and used exactly your params with qwen3.6-27B-q4_k_m. I asked it to generate a simple html calculator and then do a few edits every time outputting full code. Generation speed stayed within 35-36tk/s for the whole time. Is this only cuda thing? It spits some acceptance rates in the logs so I'd think it does use drafting.
Then-Topic8766@reddit (OP)
I forgot to mention some changes in llama.cpp from two days ago. So try to update. Im not sure if it is just for CUDA.
nunodonato@reddit
I haven't seen any speed difference with or without spec decoding. Might be my use case
Then-Topic8766@reddit (OP)
I forgot to mention some changes in llama.cpp from two days ago. So try to update.
Then-Topic8766@reddit (OP)
It is most effective if there is a lot of repeating patterns (like small changes in the code or summarizing and making new versions of some text...
mouseofcatofschrodi@reddit
is it possible to get something like this on mlx?