llama.cpp speculative checkpointing was merged
Posted by AdamDhahabi@reddit | LocalLLaMA | View on Reddit | 89 comments
https://github.com/ggml-org/llama.cpp/pull/19493
Some prompts get a speedup, others don't (in case of low acceptance streak).
For coding, I got some 10%\~15% speedup with these params:
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
pj-frey@reddit
Does it work with vision (--mmproj set)?
andy2na@reddit
Unfortunately not, it seems
pj-frey@reddit
I have simply tried it.
It is.
Although the effect is not really impressive. But might need to tweak the parameters. At least it is not slower :-) Will monitor it for a while.
andy2na@reddit
Are there any cases where it's faster, anything in the logs showing it is working?
pj-frey@reddit
I have not run a real session with repetitive patterns for now. There are statistic pattern after every call:
andy2na@reddit
interesting, wonder if it would be beneficial to me since I have a lot of calls from Frigate to analyze images/clips
andy2na@reddit
u/nickm_27 worth trying this, especially with frigate. Logs when Frigate pulls a request from llama.cpp:
The N-Gram cache successfully recognized a pattern in the Frigate output, instantly guessed the next sequence of words, and got a 100% perfect score. It injected those tokens directly into your output with zero generation time.
Config:
nickm_27@reddit
I gave this a try but so far the results seem relatively minor improvement:
andy2na@reddit
yeah, very minor, we have to wait for Dflash to see any noticeable improvement. Id test it on vLLM now, but its such a resource hog
https://x.com/zhijianliu_/status/2046352785000771674
nickm_27@reddit
yeah, that's more difficult though as it requires a smaller LLM to be run requiring more VRAM, where N-gram runs more lean
andy2na@reddit
yeah, looked into current solutions and all the AWQ models for Qwen are over 20gb, and with the DFlash model, it hits my 24gb limit :(
unbannedfornothing@reddit
It is now.
andy2na@reddit
where do you see that?
In that PR, it says "Drafts with
mmprojare not supported in this PR."trusty20@reddit
I see no change whatsoever in t/s for this regardless of what prompts I try with build 8846 (i.e super vanilla stuff like "Make a simple snake game with HTML and JS" or "How many planets are there in the solar system?" etc). Does this only apply on MoE or certain quants etc? Am I maybe missing something? Tried a few versions of the new cli flags but saw no difference.
oxygen_addiction@reddit
Same here. I'm getting a high acceptance rate, but speed seems about the same.
fragment_me@reddit
The ngram-mod seems to improve things, but using a draft model definitely slows it down. There's something going on with either the code or communication between the two components (draft output and model output) that is causing high latency. It doesn't make sense that we have high draft acceptance but low tok/s. Especially when both models are in VRAM.
OsmanthusBloom@reddit
It will only help in situations where the model has to echo back snippets of your prompt or other pieces of context it has seen. For example in code editing this happens a lot.
Try something like "Repeat after me:" and then a longer piece of text or code.
Jungle_Llama@reddit
I got it to revise a large chunk of code and saw no increase.
Jungle_Llama@reddit
updated to b8855, now I see the increase, about 30% in parts where it is working.
trusty20@reddit
So I just re-tested (this time with Qwen3.6 35b) using your suggestion, I gave it a 2 page documentation snippet along with the following prompt: "Can you extract all of the c code snippets from this document:"
It returns the snippets, then just to be sure I'm testing this fairly, I ask it:
"Can you give them all again?"
On both the first and second repeating previous code snippets output, I see no performance improvement whatsoever, in fact, performance drops a few t/s.
I'm using these flags: --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 8 --draft-max 64 --ctx-checkpoints 4
trusty20@reddit
Oh I guess that makes sense, but I guess I'm just confused because I heard there was some sort of speculative decoding (aka draft model) thing built into Qwen3.5 and Gemini vs the previous draft model focused approach, but this sounds more like a completely adjacent thing to re-use kv cache blocks?
I really am not an expert so I absolutely could be just wrong to have assumed that but that's what I went in expecting based on the conversation leading up to this.
Jungle_Llama@reddit
Same here. b8849. Whole model (unsloth/Qwen3.6-35B-A3B-UD-Q4_K_XL) in GPU, Vulkan
iportnov@reddit
Does this work with llama-cli?
Fresh-Resolution182@reddit
the 0-50% variance depending on task is the interesting part. ngram acceptance rate doing all the heavy lifting - curious what kills it outside coding
Fresh-Resolution182@reddit
the 0-50% variance depending on task is the interesting part. ngram acceptance rate doing all the heavy lifting - curious what kills it outside coding
Fresh-Resolution182@reddit
the acceptance variance makes sense once you realize ngram-mod is pattern matching on exact token sequences. boilerplate-heavy typescript/java hits the high end, one-off logic or reasoning chains will be near zero. still worth having on by default and letting it fall back
CodeMichaelD@reddit
works way better with latest MOE gemma than qwen, also causes slowdown instead if the entire model does not fit into vram, especially qwen. for some --spec-type restarting/reprocessing the entire context is required, otherwise new chat would not trigger optimizations.
CodeMichaelD@reddit
hedsht@reddit
using the same settings:
Generation throughput improved by about 28.0%
Prompt throughput improved by about 4.2%
rerri@reddit
This is an exciting one (DFlash):
https://github.com/ggml-org/llama.cpp/pull/22105
AppealSame4367@reddit
As far as i understood though, it quite some extra vram. Just like the vllm and transformers implementation?
UnknownLesson@reddit
Can you really run qwen3.6 35b with a 8 GB VRAM GPU?
AppealSame4367@reddit
Yes you can. It won't be superfast, but for context < 60000 you will get something between 200-800 tps prefill and 5-25 tps output.
I run it without thinking, it's still good enough. Depending on what you do, you might wanna reenable thinking.
adapt the values to your system:
#!/bin/bash
export GGML_CUDA_GRAPHS=0
./build/bin/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ2_M \
--no-mmproj \
--no-mmproj-offload \
-c 80000 \
-b 2048 \
-ub 2048 \
--prio 3 \
-fit on \
-np 1 \
-kvu \
--clear-idle \
--cont-batching \
--slot-save-path ./slots \
--port 8129 \
--host 0.0.0.0 \
--cache-ram 8184 \
--spec-type ngram-map-k4v \
--draft-max 32 \
--draft-min 5 \
--spec-ngram-size-n 4 \
--spec-ngram-min-hits 1 \
--mlock \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-t 6 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence_penalty 0.0 \
--repeat-penalty 1.0 \
--jinja \
--reasoning off
illforgetsoonenough@reddit
You can run it without any gpu. It will just be very slow
vincespeeed@reddit
I also have 6GB of VRAM, and I'm getting 22 tokens with these settings.
[⚡qwen3.6-35b-a3b]
model = F:/Programlar/LM Studio/.lmstudio/models/bartowski\Qwen3.6-35B-A3B\Qwen3.6-35B-A3B-UD-IQ4_NL.gguf
override-tensor = blk.[3-9].ffn.*exps=CPU,blk.[1-2][0-9].ffn.*exps=CPU,blk.3[0-6].ffn.*exps=CPU
spec-type = ngram-mod
spec-ngram-size-n = 24
draft-min = 4
draft-max = 32
n-gpu-layers = all
ctx-size = 60000
parallel = 1
threads = 10
batch-size = 128
ubatch-size = 128
mlock = true
cont-batching = true
flash-attn = true
sleep-idle-seconds = 600
temp = 1.0
top-k = 20
top-p = 0.95
min-p = 0.0
presence-penalty = 1.5
repeat-penalty = 1.10
cache-type-k = q8_0
cache-type-v = q8_0
P0pMan20@reddit
I reach 28-25 tps depending on context usage on my 3060 mobile with 6GB vram using these llama.cpp flags.
llama-server --jinja -c 20000 -m Qwen3.5-35B-A3B-Q4_K_S.gguf --temp 1 --top-p 0.95 --repeat-penalty 1.0 --top-k 20 --presence-penalty 1.5 --min-p 0 -fa on --fit onAppealSame4367@reddit
Not sure about some params. 2060, 6gb vram, 32gb system ram.
At the beginning: 750 tps pp, 18 tps output
At \~ 50k context: 200 tps pp, \~2-3 tps output
I use qwopus 4b or gemma 4 e4b for exploration and plan the actual code fixes with 3.6 35b. Very new setup, so of course I also still use some cloud ai.
llama-server \
-hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-IQ2_M \
--no-mmproj \
--no-mmproj-offload \
-c 80000 \
-b 2048 \
-ub 2048 \
--prio 3 \
-fit on \
-np 1 \
-kvu \
--clear-idle \
--cont-batching \
--slot-save-path ./slots \
--port 8129 \
--host 0.0.0.0 \
--cache-ram 8184 \
--spec-type ngram-map-k4v \
--draft-max 32 \
--draft-min 5 \
--spec-ngram-size-n 4 \
--spec-ngram-min-hits 1 \
--mlock \
--no-mmap \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
-t 6 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence_penalty 0.0 \
--repeat-penalty 1.0 \
--jinja \
--reasoning off
OsmanthusBloom@reddit
Thanks for this, I was just wondering what parameters to use for qwen 3.6 35b on my 3060 mobile. I will try it soon, I've been too busy since it was released.
Based on my earlier experience I would also try:
-ctk q8_0 -ctv q8_0 (fit twice as long context in the same VRAM, basically free now with attn-rot) --fit-target 128 (use all VRAM, adjust up if you hit OOM or down if brave) -np 1 (if you need only one session at a time, saves VRAM) -ub 2048 (higher ubatch improves PP speed a lot but costs some VRAM)
ea_man@reddit
Maybe try with an 8B or 4B, even lower quants, reasoning disabled.
AppealSame4367@reddit
That's just not what I hoped for. But luckily llama cpp just published speculative checkpoints. Same as ngram, they work without speculative model. If they keep going into that direction, maybe I can run q3.6 on 20tps still.
xienze@reddit
There’s no free lunch in computing. Everything is a space/time trade off.
ea_man@reddit
Quoting the PR:
For MoE targets (gpt-oss-20b), DFlash speedup is generally smaller than for dense attention targets because more experts get activated during the parallel verification step than during single-token autoregressive decoding (same observation as in #18039 for gpt-oss EAGLE3).
----
I guess that is gold for omnicoder 2, which has its use. YMMV
SnooPaintings8639@reddit
I don't get one thing - where do we get drafting models for dflash? Do we have to hope sole labs will do proper training or distillation for free, for each model we use? How even does it work, to train a diffusion model on autoregreeisive model, to be usable as drafter?
rerri@reddit
Sole labs? Do you mean Z-labs?
The Speculators project has already added support for DFlash draft model training and a third party (RedHatAI) has released an preliminary model on HF for Gemma 4 31B.
https://github.com/vllm-project/speculators
https://huggingface.co/RedHatAI/gemma-4-31B-it-speculator.dflash
Far-Low-4705@reddit
my only gripe with speculative decoding is that it disables vision.
makes it unusable for my use case unfortunatley
D2OQZG8l5BI1S06@reddit
https://github.com/ggml-org/llama.cpp/pull/19493#issuecomment-4269556794
Far-Low-4705@reddit
OMG LETS FUCHING GOOOO
no more painfully slow 27b, might actually be usable now
TheOnlyBen2@reddit
I am curious to know your use case ?
Far-Low-4705@reddit
engineering and coding
But i need vision for engineering.
ea_man@reddit
Holy shit that thing should do some 8x speed up for Omnicoder without reasoning for coding, tonight I'm gonna test that!
fragment_me@reddit
This means we can use self spec decoding on Qwen3.5 and 3.6!! Just add it to the params and watch the tokens go brrrrrrrrrr
ForsookComparison@reddit
How well does this work?
Does this branch enable regular spec dec with a second draft model?
fragment_me@reddit
I've used this with Gemma 4 quite a bit and it's basically free tokens. I haven't seen it ever be noticeable in terms of speed, but the stats are there and they show it's generating tokens.
I ran about 10 experiments earlier with Q3.5 27B and I found the following to be most useful in agentic coding (generated the most tokens):
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 4 --draft-max 64
The docs state that lowering draft min and draft max is better for dense models. I think it depends a lot on your use case. I also am not sure on how the latency of drafting a min of 4-6 tokens impacts overall processing.
I also am not sure about the second draft, that's an even better question because that would provide much more meaningful speedup... I'm going to test that.
FatheredPuma81@reddit
I'm seeing [12860] draft acceptance rate = 1.00000 ( 105 accepted / 105 generated) all the time every single time with Ngram and a Draft model?
ForsookComparison@reddit
2B seems appropriate for 397B
I'm interested in 0.8B for 27B 👀
fragment_me@reddit
Results with 0.8B were not great. Draft acceptance was very high, but the overall tok/s were 50-80% of the normal generation. So that made no sense. This was with pipeline parallelism, so maybe I need to go back and try just single GPU. Although I don't see why it would have issues since it all fit in VRAM (2x 3090). I experimented with draft min and max but found no overall positive values. I also tried various temp and min p sizes.
I have to peel back some layers and try some more basic approach with less parameters.
Also, I have an idea of trying to put the draft model on a 3060 with 6GB RAM as dedicated to it. Need to cut some holes in the server case for the PCIe riser cable though.
ForsookComparison@reddit
You rock for doing all these tests.
It you're considering isolating one to a single GPU and testing that, what if you tested a version of 27B quantized enough to fit on one 3090 and dedicated the other for the draft model? It probably wouldn't be a long term usable setup but it might be a quick way to validate the "worth pursuing"-ness of this
fragment_me@reddit
Just tested 3 scenarios:
FatheredPuma81@reddit
Did some simple non scientific singular tests with a fresh context:
Qwen3.5 27B: 43t/s across the board.
Qwen3.5 27B w/ 0.8B Q2_K_XL: Was basically 27t/s throughout.
Qwen3.5 27B w/ 0.8B Q4_K_XL: Started at 30t/s while reasoning and jumped to 50t/s by the end.
Qwen3.5 27B w/ Ngram-mod: Started 42.7t/s while reasoning and jumped to 48t/s by the end.
Qwen3.5/3.6 35B: Went from 130t/s to 60t/s no matter what.
Nothing I can do about run to run variance but imo I'd recommend using Ngram-mod on Qwen3.5 27B and that's it.
FatheredPuma81@reddit
Does this also add Draft Model support? Because I don't have the VRAM to use that T_T.
Ngram is great though 100% recommended everyone use it.
FatheredPuma81@reddit
It does but it sucks with 0.8B. Saw a token decrease with Qwen3.5 27B while reasoning and a token increase while writing code. I'd say that limits the use for that. Qwen3.5 and 3.6 35B both see their speed halved. It also says that draft token acceptance was 100% lol.
Due_Net_3342@reddit
I this is fine but I really want MTP working
ArtfulGenie69@reddit
There's still vllm hehe
FatheredPuma81@reddit
But I don't have the VRAM and Qwen3.6 27B isn't out!
emprahsFury@reddit
ah but you look at the files changed, https://github.com/ggml-org/llama.cpp/pull/19493/files
And once again, no documentation files were updated for a major feature release.
ParaboloidalCrest@reddit
Not really. Those are the original speculative decoding docs from 2 months ago.
RevolutionaryPick241@reddit
Do you know what the params mean?
MoneyPowerNexis@reddit
It means parameters. They are the values passed into a function in this case the main function of the relevant applications that are built when you compile llama.cpp. They are used to initialize the state of the program to tell it to use or not use features and how.
SmartCustard9944@reddit
Write a banana bread recipe
MoneyPowerNexis@reddit
Sure thing I found this easy to follow instructional video: https://youtu.be/WlreNuiJ5KE
Momsbestboy@reddit
And as outlook - because there already was a thread about how disapointed people are about the new B70:
https://github.com/ggml-org/llama.cpp/pull/22066 - 17 to 50% speed up on SYCL
https://github.com/ggml-org/llama.cpp/pull/21845 - up to 50% speed up
https://github.com/ggml-org/llama.cpp/pull/21527 - another 50% speed up
So it is as I said: don't judge the B70 too early. It will take some weeks to improve the software and drivers, but for sure the current numbers are not the final ones.
Gesha24@reddit
I had all of them installed and build for SYCL. The benchmarks were decent-ish: 700 t/s for prompt processing and 50 t/s for generation when using Qwen3.6 -35B. Seems ok, right?
The real issue - once you start using them and get to the high context (like 100K), the performance drops to 50 t/s for pp and about 7 t/s for generation. And once you push it even higher - it just crashes.
I gave up on it, returned B70 and got 9700 Pro. Same 32G VRAM. Same llama-server (but built with ROCm support vs SYCL). Prompt processing - up to 1800t/s. Generation - up to 80t/s. However, once you reach 120K context it a) doesn't crash and b) still chugs along at 650 t/s pp and 50 t/s generation.
So in practical terms, some claude code calls were simply impossible with B70 and those that were possible - could take an hour. With 9700 the experience is certainly not as good as with sonnet, but it's certainly workable.
So extra $350 ($950 for B70 and $1300 for 9700) are more than worth it IMO.
Momsbestboy@reddit
$950 for B70 and $1300 for 9700 for me is not $350 but $450 - or a 36% higher price.
Plus: while the ROCm drivers seem to be out for a while, SYCL support in LLAMA still is more or less new. Otherwise, you couldn't add 50% more ts in a single change, but would have to go through multiple improvements with smaller improvement steps.
They are still working on the low-hanging fruits.
So, let's see what comes next.
fallingdowndizzyvr@reddit
LOL. 1300 - 950 is...... $350.
No. It definitely is not. I was trying it on my A770 a couple of years ago. It's been around as long as Vulkan has. It's just that not that many people use it. They use Vulkan. Because it's more performant.
Gesha24@reddit
Sometimes I wonder if there are people in here or just bots.
"$950 for B70 and $1300 for 9700 for me is not $350 but $450 - or a 36% higher price." - check your math, what is 1300-950?
However, the point was - even if the price is 36% higher, you get at least 300% more real life performance.
fallingdowndizzyvr@reddit
Ah.... you realize that's because SYCL was just slow. I've used both SYCL and Vulkan on my A770s and Vulkan consistently blows SYCL away. But even using Vulkan, my A770s hit well blow their weight.
TheBlueMatt@reddit
I mean also try the vulkan backend. Vulkan appears to still be faster on pp than SYCL even after some of the updates in those PRs. Might be worth optimizing tg in Vulkan more than fixing pp in SYCL.
andy2na@reddit
No mmproj support it seems 😔
ai_without_borders@reddit
the acceptance rate variance makes sense when you think about what ngram-mod is actually matching on. code heavy on boilerplate/repeated variable names (typescript/java enterprise patterns) should see the high end of 0-50%. one-off logic or reasoning chains will be near zero. the --spec-ngram-size-n 24 is aggressive - 24 tokens of context for pattern matching means waiting for very precise repetitions. might be worth experimenting with lower values (8-12) for mixed code/prose tasks to widen the matching window and get more hits, at the cost of slightly shorter draft runs
robertpro01@reddit
Is this for dflash model?
milkipedia@reddit
I'm hopeful this will speedup Gemma 4 31B for me, and make it usable
Beginning-Window-115@reddit
only on tasks that are repetitive otherwise no
iamapizza@reddit
Doesn't that rule out reasoning based tasks? Sorry if I'm misunderstanding.
AdamDhahabi@reddit (OP)
Let's say your using your LLM as a chatbot for coding, it gives you some requested changes and you ask to implement the proposed changes and return the full code (which exists somewhere in the context), now you'll see large speedup.
iamapizza@reddit
Aha understood thanks, so actually the speedups could occur in the middle of your workflow. I understand a lot better now cheers
iamapizza@reddit
I'm not seeing any speed ups. Does it depend on the amount of free VRAM? I'm running Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf on RTX 5080.
iamapizza@reddit
Hm, I'm just not seeing any speed up. Did you try something specific? Or is it my setup... I'm using Qwen3.6-35B-A3B-UD-Q8_K_XL.gguf on RTX 5080
AppealSame4367@reddit
Wonderful. Thx to all that contributed, I feel like Christmas every other day with llama cpp.
cviperr33@reddit
thanks for the post!