speculative decoding silently broken for Qwen3.6 on the TurboQuant fork — PR to fix
Posted by dangerousdotnet@reddit | LocalLLaMA | View on Reddit | 5 comments
Body
if you're running Qwen3.6-35B-A3B on the TurboQuant fork and you tried speculative decoding, it was quietly doing nothing. the server just falls back to normal decoding without any error.
basic idea for anyone unfamiliar: you run a tiny model (like Qwen3.5-0.8B) alongside your big model. the small one guesses the next bunch of tokens really fast, then the big model checks all the guesses in one pass. whatever the big model agrees with, it keeps — whatever it rejects, it redoes. so if your big model does 30 tok/s and the small one does 150 tok/s, and say 60% of guesses are accepted, that's a lot of tokens you didn't have to decode one by one. net effect is faster output for basically free.
the reason it was broken is Qwen3.6 isn't a normal transformer — it's a hybrid with these recurrent layers mixed in. when the big model rejects a draft token it needs to roll back its internal state, and the recurrent layers didn't support that. mainline llama.cpp fixed it last week but TurboQuant hadn't picked it up.
one thing to be aware of: vocab compatibility between your draft and main model matters. if the tokenizers don't match exactly, llama.cpp has to translate tokens between them on the fly, which adds overhead and can lower acceptance rates. we tested Qwen3.5-0.8B as a draft for Qwen3.6-35B-A3B and got the "vocabs not compatible" warning — they're both qwen35 tokenizer family but apparently not identical. it's not a hard block, the server still runs speculative decoding, just with some drag from the translation. depending on your model pairing and the kind of content you're generating (code tends to have more predictable tokens) the speedup could still be substantial.
merged upstream into the fork, benchmarked before and after on M2 Pro, zero regression. perplexity score was identical. details in the PR
PR: TheTom/llama-cpp-turboquant#100
only matters if you're on the TurboQuant fork specifically. if you're on regular llama.cpp you already have this.
Pidtom@reddit
upstream sync planned for this week, should pick up these changes. been busy fixing AMD OOMs and mlx-swift work.
dangerousdotnet@reddit (OP)
appreciate all your work man.
Ha_Deal_5079@reddit
vocab mismatch between draft and main model is lowkey killing ur speedup more than the rollback bug. even with warning it still runs but acceptance rate tanks hard
dangerousdotnet@reddit (OP)
yup, that's an issue with qwen 3.6 specifically. for qwen 3.5 and gemma-4 (which has a whole family with the exact same tokenizer) the benefits should be quite measurable. so still a useful feature
dangerousdotnet@reddit (OP)
that said, gemma-4 on llama-server is in **rough** shape still. lots of open issues. tool calling is still broken, which is fine if you don't need tool calling, but that means you probably are using its multi-modal capabilities -- but then structured output (json_schema) is not supported so you can't get much out of VLM either.
it's annoying because gemma-4 runs really nicely on mlx-vlm, with structured output working nicely.
so none of these servers runs all the models i want currently: qwen 3.6, SAM3, gemma-4