Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution
Posted by Franck_Dernoncourt@reddit | LocalLLaMA | View on Reddit | 33 comments
- Code: https://github.com/chiennv2000/orthrus
- Paper: https://arxiv.org/abs/2605.12825
- HF: https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B ; https://huggingface.co/chiennv/Orthrus-Qwen3-4B ; https://huggingface.co/chiennv/Orthrus-Qwen3-8B
- Disclosure: co-author.
Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.
Results:
- Up to 7.8× TPF, \~6× wall-clock on MATH-500.
- 16% of params trained, <1B tokens, 24h on 8×H200.
- vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.
- vs. Speculative Decoding (EAGLE-3, DFlash): No external drafter, no separate cache, and zero Time-To-First-Token (TTFT) penalty because we don't have to initialize and sync a separate drafter model. KV overhead is O(1) (\~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).
- Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.
Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.
Finanzamt_Endgegner@reddit
How is it on longer context? Dflash has issues with that for example?
Round-Beach7348@reddit
We tested up to 64K context and it works pretty well. DFlash's regression at longer contexts is likely because it maintains a separate external drafter with its own KV cache, which can cause distributional drift as context grows. Since Orthrus shares a single KV cache with no external drafter, it might not have that issue. Give it a try and let us know how it goes!
Finanzamt_Endgegner@reddit
thats amazing, ill gonna give that info to some people that might have compute for bigger models (;
Thrumpwart@reddit
Great work, this is really ingenius. Looking forward to reading the paper.
hainesk@reddit
Could this be done with Qwen 3.5 or 3.6? Does this work with moe models?
KptEmreU@reddit
oh yeah . pls give me a qwen3.5 9b with it 😄 My puny 16gb vram requires --- demands it!
Round-Beach7348@reddit
Yep, Orthrus-Qwen-3.5-9B will be available soon
Rare_Potential_1323@reddit
Better would be Carnice-9b
Round-Beach7348@reddit
Of course, Qwen 3.5 and 3.6 support are coming soon! This is also compatible with MoE models and other attention-based architectures.
StudentDifficult8240@reddit
How does it handle larger contexts? Dflash is great up to 16k but experiences severe PP regression at higher contexts.
Round-Beach7348@reddit
We tested up to 64K context and it works pretty well. DFlash's regression at longer contexts is likely because it maintains a separate external drafter with its own KV cache, which can cause distributional drift as context grows. Since Orthrus shares a single KV cache with no external drafter, it might not have that issue. Give it a try and let us know how it goes!
StudentDifficult8240@reddit
I will! Thank you
MerePotato@reddit
PP as in prompt processing or perplexity?
Round-Beach7348@reddit
perplexity
Party-Special-5177@reddit
They released a demo at 8B, but I’m really curious if what this unlocks is tolerable tg in large models (>400B) without having to hold the whole model in vram (I.e. at reasonable quants).
How well does this speedup scale?
Round-Beach7348@reddit
Theoretically, the speedup should scale well with model size. The KV overhead is O(1) regardless of model size, so it won't eat into your VRAM budget, the only extra cost is the diffusion parameters, which are a small fraction of the total.
knownboyofno@reddit
I took a quick look at this. It is a great start. I see the code but it doesn't have the training pipeline on Github. Also, it looks like you only did length of 2048. Have you tested it beyond that?
Round-Beach7348@reddit
Yes, the full training pipeline will be released soon. We've tested the model on context lengths up to 64K, and it works pretty well. Feel free to try it out and share any feedback if you run into edge cases.
letsgoiowa@reddit
Some questions for a smooth brain:
1.Will this work on MOE architectures?
Is there a downside?
Does this still work with CPU/RAM offload?
simcop2387@reddit
I imagine the main downsides are:
I'd bet that it can work with CPU/RAM offload, but i doubt they've tried to optimize it in their research implementation. It'd need a little work to make sure all the layers are setup in the correct places so that it's not doing anything wonky like going back and forth between GPU and system ram but i'd think it would still make a difference in speed.
FerLuisxd@reddit
What about ram usage difference?
okyaygokay@reddit
This and what about macos metal?
oxygen_addiction@reddit
The community would probably pool money together to do this for Qwen 3.5 27B
Finanzamt_Endgegner@reddit
Bro there are people in the community with enough compute 👀
wesmo1@reddit
Does this need support to be added to llama.cpp?
Round-Beach7348@reddit
Yes, it would require custom kernel support. We are currently working on expanding model support, so llama.cpp is on the roadmap. Contributions are also welcome.
Endlesscrysis@reddit
Kind of curious why you went for older models?
Round-Beach7348@reddit
We started with Qwen3 since it's a well-documented dense baseline, and ideal for validating the research idea. More models coming soon!
Endlesscrysis@reddit
Okay, yes 3.5 9b please!
Round-Beach7348@reddit
Noted, we are working on adding Qwen3.5 and Qwen3.6, stay tuned! Feel free to star the repo for updates!
Unhappy_Project_3723@reddit
Wow! I was here.
met_MY_verse@reddit
!RemindMe 12 hours
RemindMeBot@reddit
I will be messaging you in 12 hours on 2026-05-16 07:37:34 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)