Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?
Posted by No_Algae1753@reddit | LocalLLaMA | View on Reddit | 34 comments
Has anyone here tested speculative decoding in llama.cpp with Gemma 4 31B IT or Qwen 3.5 27B?
For Gemma, I was thinking about using a smaller same-family draft model.
For Qwen 3.5, I’m not sure if it works well at all in llama.cpp.
If you tried it, which draft model worked best and did you get a real speedup?
tecneeq@reddit
I'm doing a benchmark right now on a Strix Halo. qwen 3.5 doesn't work with llama.cpp, but Gemma 4 31 Dense shows promise.
I'll try these combinations to find the sweetspot:
klotar99@reddit
From another thread here I posted: If you have the vram, 26B A4B is a better spec drafter for me since the active params are similar. (UD Q2 gives 80-95% acceptance) I can get a strix halo to as high as 26 tok/s on llama.cpp (in chat)
tecneeq@reddit
I have these results so far. Clearly MoE doesn't benefit from draft models:
https://docs.google.com/spreadsheets/d/1NzZC4JShGluwH2fdjlMbZ2ke99AcTctUnM7rG12_cYE/edit?usp=sharing
I'm benchmarking MoE as draft model for 31B dense as we speak, but it takes forever ;-).
nicholas_the_furious@reddit
I think he is saying using the MoE in place of the E2B/E4B.
tecneeq@reddit
I'm also trying every unsloth quant below Q4 this time. Maybe smaller qunats give better results?
finevelyn@reddit
I have 5090+5070ti and am using Gemma 4 E2B Q6 as the draft model for 31B Q6. The draft model is on the 5090 and the main model is split on both (using llama.cpp with --fit-target 7400,200 --device-draft CUDA0). The 7400 fit target for the 5090 is required to ensure enough space is left for the draft model, because --fit doesn't seem to take it into account.
Draft model parameters: --draft-min 0 --draft-max 16 --draft-p-min 0.9 (min and max same as default, p-min up from 0.75 to 0.9).
For pure coding tasks the speed-up is from 45tk/s to 60-90tk/s. On other tasks there is a slight speed-up but nothing dramatic.
Adventurous-Paper566@reddit
It's very impressive, how many context can you manage this this setup?
finevelyn@reddit
128k context when main model KV cache is unquantized and draft model cache is Q8. I assume if I set both to Q8, I could use the full context size, but haven't had any reason to try yet.
oxygen_addiction@reddit
Have you played around with Qwen 27B on the same setup?
finevelyn@reddit
Last I read speculative decoding is currently broken in llama.cpp for Qwen3.5, so I didn't try it. But for the record, the 27B Q6 fits with 128k context (unquantized) on the 5090 alone and runs at around 55tk/s, or split on both card for max context at 45tk/s. Qwen3.5 0.8B or 2B will probably work nicely as a draft model when they fix it.
https://github.com/ggml-org/llama.cpp/issues/20039
GrungeWerX@reddit
What are your settings for q6? On 3090 TI…q6 is slow at 100K ctx, but better than q5, so would like to use it more…on lm studio. Llama.cpp was slower
oxygen_addiction@reddit
VLLM supports MTP with Qwen3.5 and now you can also try DFlash, as it supports the 27B and should give a bit boost.
StardockEngineer@reddit
I get 87 tok/s with your draft setup vs 48 without it! Awesome.
I'm using
llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q6_K_XL -devd CUDA1 --draft 5 -hfd unsloth/gemma-4-E2B-it-GGUF:Q6_K --no-mmprojStardockEngineer@reddit
Is there a flag I'm missing? I get an error that drafting is not supported with multimodal. There must be a flag to do text only you're using?
Acceptable_Home_@reddit
Anything about the moe of gemma4 and qwen3.5 family?
cunasmoker69420@reddit
does it speed up prompt processing at all?
finevelyn@reddit
No. As far as I understand it will have to independently process the prompt on both models so if anything it will make it slower, but I would say it's not a noticeable difference.
ethertype@reddit
Inspired by this thread, this is what I have ATM:
Confident_Ideal_5385@reddit
Speculation with qwen3.5 is harder because 3/4 of the attention is gated deltanet, which means you need to commit the recurrent state tensors to a snapshot every time you begin speculating, and roll back to that snapshot whenever you have a draft miss, and the machinery to do this in llama.cpp seems - imperfect, perhaps, based on my testing.
This state is in the order of hundreds of MB so it's not huge, but even manually snapshotting this stuff to host ram (via llama_memory_recurrent's save/copy state) seems to miss something, so there'd need to be a bit of work on the ggml side I figure (especially if you wanna store the rollback tensor state in a VRAM buffer to prevent host copying on every draft miss.) Apparently llama-server manages this with it's built in checkpointing, but I haven't had a good look at how.
That said, qwen3.5 apparently has some MTP heads baked into the models, so there's that.
Context -- I've been fucking around trying to get R/S tensor snapshotting working at arbitrary n_past in order to roll back state after doing e.g. JIT RAG and it's broken enough that I rolled back to qwen3, which is pure KV-cache style attention and works fine.
FinBenton@reddit
I tried the 4b Q4 as a draft model for 31b Q5 but I didnt really get much benefits, there were some prompts that got +10% speed but most of the time it around the same speed.
AppealSame4367@reddit
I'm thinking: Turboquants + DFlash + a gemma 4 dflash model in around 4 weeks in llama.cpp. That would be amazing.
Acceptable_Home_@reddit
Holy
xandep@reddit
I was thinking the same. Now imagine that with 1bit quants like bonsai.
AppealSame4367@reddit
Yes, that would bring huge models on the smalles laptops suddenly. Can't wait.
Just found out about newer ngram params in llama-server today while trying to get sglang to run with a byteshape qwen 4B (it didn't, I only have an rtx2060 that doesn't support flash attention 3 and cuda graph, which was needed for it to run in sglang).
Ngram param is "free" and gives a 25% speedup for some models. At least gemma e4b ran at 50 tps instead of 40 tps on this shitty laptop of mine. Prompt-caching isn't really enabled by default, also something i found out, you should give it a bigger value (512 is said to be "large" already, 4092 is just an experiment).
--cache-ram 4092 \
--spec-type ngram-mod \
--spec-ngram-size-n 16 \
--draft-max 12 \
Try to find out the right values, i just asked Gemini free and they were good enough.
Pomegranate-and-VMs@reddit
I have a draft/target model setup. My target model is split between two Tesla M40a, with a Quadro RTX 4000 running the draft.
Qwen3-0.6b draft model. Qwen3-30b-a3b-q4 target model.
The target model is split between the two M40’s. I average around a 70% acceptance rate and 35t/s.
When running Kilo code, this looks like 20-25t/s output on a 128k context window.
Currently working on some KV cache ideas to speed up prefill. The VRAM speed on the m40s is not great, at 288GB/s.
No_Algae1753@reddit (OP)
Just general question, why do you still use qwen3?
Pomegranate-and-VMs@reddit
I've been using the 30b-a3b on a single M40 since it came out. For the testing I was doing with this, I wanted to keep the benchmarks from before.
I'll get around to qwen3.5:27b next. Curious to see what could happen with Gemma4 as well.
For the last couple of months, I’ve been more focused on the KV cache and prefill speeds.
colin_colout@reddit
damn... how does this compare to no draft model?
i tried this a while back with these models, but i found that the .6b had way too many misses, and the 30b is already fast. what is your config?
Pomegranate-and-VMs@reddit
With these older cards, I was getting something like 20-25t/s on a 4K context. It’s been my test server in the homelab. I went this way to have a mental reference point with no drafting. Need to get around to moving over to qwen3.5:27b.
I think it’s felt worthwhile for maintaining some speed in a bigger context. But again, it’s been more of a tinkering project to see how much blood I can squeeze out of these turnips. 😂
If y'all got other ideas, happy to try. I’ll share my config when I get home this evening.
xeeff@reddit
yeah running 0.6b for a model with 3b compute is crazy
xanduonc@reddit
quanitized gemma4-e2b works good as a draft for gemma4-31b, speedup depemds on task ~2x too bad llamacpp does not support speculative decoding with mmproj multimodal processing enabled
No_Algae1753@reddit (OP)
Ill try that out
pjsgsy@reddit
I believe it is broken (currently) for Qwen 3.5, though you can use the less effective ngram-mod (no draft model needed). There are a few PR's that hope to fix it. Hopefully, they will get to it.
Adventurous-Paper566@reddit
I don't know, but I think it will be easier with the Qwen models which are all similar, because the different sizes of Gemma 4 models give very different outputs.