Speculative decoding question, 665% speed increase

[-]

masterlafontaine@reddit

Do you need to add the smaller model? What are the args?

[-]

GodComplecs@reddit (OP)

No smaller model needed in latest llama.cpp

[-]

dtdisapointingresult@reddit

There's 2 different unrelated things being mixed here.

Speculative decoding with small draft model. A small draft model is used to accelerate prediction of big model by suggesting the next N tokens in a chain, and if the big model agrees, the final inference speed is faster. This is powerful stuff that helps all workflows when it works.
This n-gram stuff that requires no model, and only a little RAM. This thing only works by predicting based on sequences of tokens found in the existing context history.

So #2 only works if you're literally asking the LLM to repeat itself, in order to get a cache hit from the existing sequences. Rare tasks like "Here's a letter, I want you to change the name and pronouns of the person it's addressed to." 99% of the letter it will write you is identical, so you'll get a huge speed boost. This is very situational and I found it brought me no benefits in practice on the sort of tasks I do. However extra memory usage is negligible so you may want to give it a try.

Meanwhile #1 can genuinely help with generating brand new tokens, especially for predictable/repetitive things like coding, but even just for writing text. But this requires finding a tiny draft model that is compatible. For Qwen 3.5/3.6 it's easy, Qwen has a 0.8B little brother. For Gemma 4, the "small" model E2B is actually 10GB, it would kill your performance! On the bright side, it's easy to test with different draft models using the --hf-repo-draft flag to llama-server. I've managed to get great acceleration on coding tasks with Gemma 4 31B even using Gemma 3 270M Q8, so I assume they have compatible tokenizers.

[-]

Trollfurion@reddit

I think both Qwen 3.5/3.6 does have MTP (a small model embedded in the model for spec decoding) so you don’t need another model

[-]

dtdisapointingresult@reddit

Haha, my bad, it's my llama.cpp blindness.

You need to be VLLM or SGLang, meaning you must have a GPU, to use MTP. llama.cpp doesn't support it.

[-]

Corosus@reddit

Great info thanks, testing qwen3.6 with "make me a mock discord webpage" first time was 83tps,

second time it crept up all the way to 108tps, using "--spec-type ngram-map-k --spec-ngram-size-n 24 --draft-min 12 --draft-max 48",

ngram-mod was slower than without any of this, kept getting low acceptance streak in the terminal

will try in real scenarios in opencode, might give at least a little boost when its doing some mildly repetitive work.

[-]

Sadman782@reddit

It's not black magic, basically it is just search based speculative decoding. It means it only actually works for coding or where the model repeatedly answers the same thing with a little change. Let's say a model generates big code and a bug occurs and just one line change need, in the second time the model just searches and uses the previous prediction (obviously it verifies), that's how it works.

[-]

EbbNorth7735@reddit

Isn't it basically using a look up table to output the next probable token? Most words are more than one token so assuming it'll finish a word is fairly predictable.

[-]

GodComplecs@reddit (OP)

llama.cpp magic :)

[-]

fallingdowndizzyvr@reddit

Spec decoding works great if you are asking it to receipt something verbatim. Like text of the Constitution of the United States. It'll fly!

But ask it to do something unique, like write a story about spider monkeys. The rejection rate will be high and it'll be next to useless.

[-]

GodComplecs@reddit (OP)

But it won't really slow you down, and when it flies it flies! I only see 5% drop on "hard" tasks, but overall a huge improvement when iterating code.

[-]

fallingdowndizzyvr@reddit

but overall a huge improvement when iterating code.

Ah... yeah. That's what I said. That's because it's producing the mostly the same tokens again.

[-]

pedronasser_@reddit

That speculative decoding, for some reason, is heavily affecting Qwen3.6's ability to follow instructions.

[-]

melspec_synth_42@reddit

665% makes sense for devstral if the ngram patterns in code are super predictable line by line. the speedup drops hard once the model has to reason about something novel tho. context length kills it too

[-]

Sadman782@reddit

Only helpful in minor chat coding, for agentic it has no benefit at all as it is just search based. The speed difference might be due to hidden whitespaces, so even if most code doesn't look changed, there will be slight changes which cause invalidating the search. Dflash is what we need

[-]

GodComplecs@reddit (OP)

Don't think that is true, agents output a lot of samey stuff, especially coding ones.

[-]

Sadman782@reddit

Agents use highly efficient replace tools to fix bugs without writing the full file again, so the benefit will be very low in most cases.

[-]

cviperr33@reddit

interesting gpnna try this on qwen3.6

[-]

FatheredPuma81@reddit

My speed halved when I tried it.

[-]

GodComplecs@reddit (OP)

You are running out of VRAM I think, I do too so offloading correctly is key:

-m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --port 8083 --temp 0.6 --top-p 0.95 --min-p 0.00 --presence_penalty 0.0 --repeat-penalty 1.0 --top-k 20 --fit on --fit-ctx 70000 --fit-target 256 -b 2048 -ub 1024 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48

[-]

FatheredPuma81@reddit

No I'm not even close I have like 500MB to spare but Ngram just works fine now??? That's downright bizarre.... Also I would use IQ4_NL or XS btw for your case the size vs quality ratio is just insane.

My --models-preset

[*]
# Global defaults
n-gpu-layers = all
threads = 8
parallel = 1
batch-size = 2048
flash-attn = true
mmap = false
mlock = false
cache-reuse = 1
cram = 8192

[Parallel/Qwen36-35b-a3b-iq4_nl]
model = P:\z AI Stuff\LM_Studio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
cache-type-k = q8_0
cache-type-v = q8_0
ctx-size = 409600
parallel = 5
cont-batching = true
min-p = 0
mlock = false
mmap = false
n-gpu-layers = all
presence-penalty = 0.0
repeat-penalty = 1
temp = 0.6
threads = 8
top-k = 20
top-p = 0.95
chat-template-kwargs = {"preserve_thinking":true}
spec-ngram-size-n = 24
draft-min = 48
draft-max = 64

[-]

GodComplecs@reddit (OP)

Yes I'll see where we land on the quant for this, some I run IQ3 xs and it's fine and some Q5 XL, just depends on speed vs intelligence tradeoff. Hard to argue with Qwen 3.6 at 100tks base + 50-140% boost on similar content,

[-]

audioen@reddit

So, this test is about having the model repeat text it already said before. This will be a key factor in how much self-speculative output you can be getting, and it also matters how different prompt processing and token generation speed is. Ideally for speculative decoding, token generation is very slow (e.g. dense model) but prompt processing is very fast. Also, if sequence has to be reverted, like in recurrent models you have to go back to prior Mamba states, then there's question about how that is achieved and what the cost of e.g. storing and reverting the state copies is.

My experience is that about 24 for size-n is needed, and draft min and max do have to be in some 12-48 type range, so I have been using similar settings whenever I have tried this type of speculative decoding.

Good multitoken prediction should not be using self-speculative decoding but either a draft model or the MTP heads of e.g. Qwen3.5. I am hoping that the MTP support comes soon now that the whole partial sequence reverting appears to work with hybrid models. MTP is very interesting as it can speculate ahead cheaply for 3 tokens with high acceptance rate at cost of just evaluating a single extra layer per token.

When using a draft model, e.g. 0.8B drafting for the 27B, I got about double the token generation rate with extremely good acceptance of the tokens from the draft model, often sequences of 4-8 tokens with 90 % success rate for the attempted long sequence (the speculative decoding cuts off when the draft model isn't confident on the next token). I guess the draft model speaks similarly to the large model especially during the sequences that are quite repetitive and formulaic in nature. Despite this, it didn't make 27B usable for me, like going from 7 tokens/s to something like 11 just isn't enough. Possibly I could tune the settings and maybe get something more, but I feel that I would want to triple the token generation speed which may be too much to ask from speculative decoding.

[-]

GodComplecs@reddit (OP)

Yes theres a lot of interesting features like MTP etc that we are "missing" from llama.cpp, they must be really hard to implement since no llm has data on them I guess ;)

[-]

ResidentPositive4122@reddit

665% increase in speed (what?)

Is it possible devstral does full-file edit instead of search and replace? In that case, it'd use lots of tokens from before, and you'd see those numbers reported. So the n-gram spec decode works, but the result is not 600% e2e it's just that it copied what was there before in the file.

[-]

GodComplecs@reddit (OP)

No, impossible. What I found out that answered my actual question probably was that it has been trained extensively on LINE BY LINE code unlike Gemma and Qwen 3.6 which has seen the full code base or agentic tasks.

[-]

UnionCounty22@reddit

Mistral 😆

[-]

GodComplecs@reddit (OP)

Goated speed at least, not half bad at coding but really replace by Qwen 3.6 and Gemma 4 byt todays standards.

[-]

DinoAmino@reddit

Using ngrams instead of a draft model means it is highly dependent on tokens it has already generated or seen. So performance will vary quite a bit. How "scientific" were these comparisons? Did you use the same prompts and context for each?

[-]

GodComplecs@reddit (OP)

Yes, every time.

[-]

Fresh_Finance9065@reddit

Speculative decoding works for simple questions but doesn't really speed up difficult questions where the small and big model would give different answers

[-]

DinoAmino@reddit

Meanwhile, OP isn't using a draft model at all. Using ngrams here.

[-]

Karyo_Ten@reddit

The main issue is with quantization, complex stuff requires longer answers and if your quantization makes the KL-divergence too far from the expected distribution your draft cannot ptedict your quant.

So make good quantized model first then do speculative decoding.

[-]

maschayana@reddit

Misinformation. Context size makes the speed difference. Long context and you win against speculative decoding because it drops sharp. The answer is the same even with complex or long context (complexity doesn't make a difference).

[-]

Fresh_Finance9065@reddit

Isn't there a speculative decoding acceptance rate?

[-]

GodComplecs@reddit (OP)

Yes I wrote MINOR edits, but seems fundamentally Devstrall is code line by line model, the gemma and qwen are big picture code.

[-]

last_llm_standing@reddit

what is the use? you cant do aything real with it. I can do something similar witha bigram model

[-]

GodComplecs@reddit (OP)

It's for code editing

[-]

GodComplecs@reddit (OP)

Ok added --repeat-penalty 1.0 for Qwen 3.6, now speed is increased by 140tks over 100tks base in minor edits.