Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

[-]

From another thread here I posted: If you have the vram, 26B A4B is a better spec drafter for me since the active params are similar. (UD Q2 gives 80-95% acceptance) I can get a strix halo to as high as 26 tok/s on llama.cpp (in chat)

[-]

tecneeq@reddit

I have these results so far. Clearly MoE doesn't benefit from draft models:

https://docs.google.com/spreadsheets/d/1NzZC4JShGluwH2fdjlMbZ2ke99AcTctUnM7rG12_cYE/edit?usp=sharing

I'm benchmarking MoE as draft model for 31B dense as we speak, but it takes forever ;-).

[-]

nicholas_the_furious@reddit

I think he is saying using the MoE in place of the E2B/E4B.

[-]

tecneeq@reddit

I'm also trying every unsloth quant below Q4 this time. Maybe smaller qunats give better results?

[-]

finevelyn@reddit

I have 5090+5070ti and am using Gemma 4 E2B Q6 as the draft model for 31B Q6. The draft model is on the 5090 and the main model is split on both (using llama.cpp with --fit-target 7400,200 --device-draft CUDA0). The 7400 fit target for the 5090 is required to ensure enough space is left for the draft model, because --fit doesn't seem to take it into account.

Draft model parameters: --draft-min 0 --draft-max 16 --draft-p-min 0.9 (min and max same as default, p-min up from 0.75 to 0.9).

For pure coding tasks the speed-up is from 45tk/s to 60-90tk/s. On other tasks there is a slight speed-up but nothing dramatic.

[-]

Adventurous-Paper566@reddit

It's very impressive, how many context can you manage this this setup?

[-]

finevelyn@reddit

128k context when main model KV cache is unquantized and draft model cache is Q8. I assume if I set both to Q8, I could use the full context size, but haven't had any reason to try yet.

[-]

oxygen_addiction@reddit

Have you played around with Qwen 27B on the same setup?

[-]

finevelyn@reddit

Last I read speculative decoding is currently broken in llama.cpp for Qwen3.5, so I didn't try it. But for the record, the 27B Q6 fits with 128k context (unquantized) on the 5090 alone and runs at around 55tk/s, or split on both card for max context at 45tk/s. Qwen3.5 0.8B or 2B will probably work nicely as a draft model when they fix it.

https://github.com/ggml-org/llama.cpp/issues/20039

[-]

GrungeWerX@reddit

What are your settings for q6? On 3090 TI…q6 is slow at 100K ctx, but better than q5, so would like to use it more…on lm studio. Llama.cpp was slower

[-]

oxygen_addiction@reddit

VLLM supports MTP with Qwen3.5 and now you can also try DFlash, as it supports the 27B and should give a bit boost.

[-]

StardockEngineer@reddit

I get 87 tok/s with your draft setup vs 48 without it! Awesome.

I'm using llama-bench -hf unsloth/gemma-4-31B-it-GGUF:Q6_K_XL -devd CUDA1 --draft 5 -hfd unsloth/gemma-4-E2B-it-GGUF:Q6_K --no-mmproj

[-]

StardockEngineer@reddit

Is there a flag I'm missing? I get an error that drafting is not supported with multimodal. There must be a flag to do text only you're using?

[-]

Acceptable_Home_@reddit

Anything about the moe of gemma4 and qwen3.5 family?

[-]

cunasmoker69420@reddit

does it speed up prompt processing at all?

[-]

finevelyn@reddit

No. As far as I understand it will have to independently process the prompt on both models so if anything it will make it slower, but I would say it's not a noticeable difference.

[-]

ethertype@reddit

Inspired by this thread, this is what I have ATM:

llama-server \
    --model gemma-4-31B-it-UD-Q8_K_XL.gguf \
    --model-draft gemma-4-E2B-it-UD-Q6_K_XL.gguf \
    --threads 1 \
    --ctx-size 128000 \
    -ub 2048 -b 2048 \
    -fa on \
    -dio \
    --device CUDA0,CUDA1,CUDA2 \
    --device-draft CUDA3 \
    --reasoning off \
    --temp 1.0 \
    --top-p 0.95 \
    --top-k 64 \
    --draft-min 0 \
    --draft-max 16 \
    --draft-p-min 0.9 \
    --alias "gemma-4-31B-it" \
    --host 0.0.0.0 \
    --port 5001 \
    --jinja

[-]

Confident_Ideal_5385@reddit

Speculation with qwen3.5 is harder because 3/4 of the attention is gated deltanet, which means you need to commit the recurrent state tensors to a snapshot every time you begin speculating, and roll back to that snapshot whenever you have a draft miss, and the machinery to do this in llama.cpp seems - imperfect, perhaps, based on my testing.

This state is in the order of hundreds of MB so it's not huge, but even manually snapshotting this stuff to host ram (via llama_memory_recurrent's save/copy state) seems to miss something, so there'd need to be a bit of work on the ggml side I figure (especially if you wanna store the rollback tensor state in a VRAM buffer to prevent host copying on every draft miss.) Apparently llama-server manages this with it's built in checkpointing, but I haven't had a good look at how.

That said, qwen3.5 apparently has some MTP heads baked into the models, so there's that.

Context -- I've been fucking around trying to get R/S tensor snapshotting working at arbitrary n_past in order to roll back state after doing e.g. JIT RAG and it's broken enough that I rolled back to qwen3, which is pure KV-cache style attention and works fine.

[-]

FinBenton@reddit

I tried the 4b Q4 as a draft model for 31b Q5 but I didnt really get much benefits, there were some prompts that got +10% speed but most of the time it around the same speed.

[-]

AppealSame4367@reddit

I'm thinking: Turboquants + DFlash + a gemma 4 dflash model in around 4 weeks in llama.cpp. That would be amazing.

[-]

Acceptable_Home_@reddit

Holy

[-]

xandep@reddit

I was thinking the same. Now imagine that with 1bit quants like bonsai.

[-]

AppealSame4367@reddit

Yes, that would bring huge models on the smalles laptops suddenly. Can't wait.

Just found out about newer ngram params in llama-server today while trying to get sglang to run with a byteshape qwen 4B (it didn't, I only have an rtx2060 that doesn't support flash attention 3 and cuda graph, which was needed for it to run in sglang).

Ngram param is "free" and gives a 25% speedup for some models. At least gemma e4b ran at 50 tps instead of 40 tps on this shitty laptop of mine. Prompt-caching isn't really enabled by default, also something i found out, you should give it a bigger value (512 is said to be "large" already, 4092 is just an experiment).

--cache-ram 4092 \

--spec-type ngram-mod \

--spec-ngram-size-n 16 \

--draft-max 12 \

Try to find out the right values, i just asked Gemini free and they were good enough.

[-]

Pomegranate-and-VMs@reddit

I have a draft/target model setup. My target model is split between two Tesla M40a, with a Quadro RTX 4000 running the draft.

Qwen3-0.6b draft model. Qwen3-30b-a3b-q4 target model.

The target model is split between the two M40’s. I average around a 70% acceptance rate and 35t/s.

When running Kilo code, this looks like 20-25t/s output on a 128k context window.

Currently working on some KV cache ideas to speed up prefill. The VRAM speed on the m40s is not great, at 288GB/s.

[-]

No_Algae1753@reddit (OP)

Just general question, why do you still use qwen3?

[-]

Pomegranate-and-VMs@reddit

I've been using the 30b-a3b on a single M40 since it came out. For the testing I was doing with this, I wanted to keep the benchmarks from before.

I'll get around to qwen3.5:27b next. Curious to see what could happen with Gemma4 as well.

For the last couple of months, I’ve been more focused on the KV cache and prefill speeds.

[-]

colin_colout@reddit

damn... how does this compare to no draft model?

i tried this a while back with these models, but i found that the .6b had way too many misses, and the 30b is already fast. what is your config?

[-]

Pomegranate-and-VMs@reddit

With these older cards, I was getting something like 20-25t/s on a 4K context. It’s been my test server in the homelab. I went this way to have a mental reference point with no drafting. Need to get around to moving over to qwen3.5:27b.

I think it’s felt worthwhile for maintaining some speed in a bigger context. But again, it’s been more of a tinkering project to see how much blood I can squeeze out of these turnips. 😂

If y'all got other ideas, happy to try. I’ll share my config when I get home this evening.

[-]

xeeff@reddit

yeah running 0.6b for a model with 3b compute is crazy

[-]

xanduonc@reddit

quanitized gemma4-e2b works good as a draft for gemma4-31b, speedup depemds on task ~2x too bad llamacpp does not support speculative decoding with mmproj multimodal processing enabled

[-]

No_Algae1753@reddit (OP)

Ill try that out

[-]

pjsgsy@reddit

I believe it is broken (currently) for Qwen 3.5, though you can use the less effective ngram-mod (no draft model needed). There are a few PR's that hope to fix it. Hopefully, they will get to it.

[-]

Adventurous-Paper566@reddit

I don't know, but I think it will be easier with the Qwen models which are all similar, because the different sizes of Gemma 4 models give very different outputs.