Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Posted by Franck_Dernoncourt@reddit | LocalLLaMA | View on Reddit | 33 comments

Code: https://github.com/chiennv2000/orthrus
Paper: https://arxiv.org/abs/2605.12825
HF: https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B ; https://huggingface.co/chiennv/Orthrus-Qwen3-4B ; https://huggingface.co/chiennv/Orthrus-Qwen3-8B
Disclosure: co-author.

Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.

Results:

Up to 7.8× TPF, \~6× wall-clock on MATH-500.
16% of params trained, <1B tokens, 24h on 8×H200.
vs. diffusion LMs (Dream, Fast-dLLM-v2, SDAR, Mercury, Gemini Diffusion): they modify base weights and lose accuracy (Fast-dLLM-v2: -11 pts on MATH-500). Orthrus freezes the backbone; accuracy matches Qwen3-8B exactly.
vs. Speculative Decoding (EAGLE-3, DFlash): No external drafter, no separate cache, and zero Time-To-First-Token (TTFT) penalty because we don't have to initialize and sync a separate drafter model. KV overhead is O(1) (\~4.5 MiB flat). Acceptance length on MATH-500: 11.7 vs. 7.9 (DFlash) vs. 3.5 (EAGLE-3).
Single-step denoising beats multi-step (6.35 vs. 3.53 TPF). KL distillation beats CE on acceptance rate.

Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.

[-]

letsgoiowa@reddit

Some questions for a smooth brain:

1.Will this work on MOE architectures?

Is there a downside?
Does this still work with CPU/RAM offload?

[-]

simcop2387@reddit

I imagine the main downsides are:

It will increase VRAM usage, you've got essentially a second model running in parallel, though I bet the overhead isn't 1:1 compared to the original LLM. So if you're very VRAM bound it might push things too far and kill performance even with the uplift.
It's also going to be more compute that's got to happen. For example I've got a Tesla A16 (essentially 4xA2 16GB on one card) that are limited to 60W by the hardware and while I can load some decent sized models they're frequently compute/power limited on larger tasks/models so I use them for bulk background tasks (i.e. throw a few million chunks at embedding models and such) that don't need real-time performance. Since an LLM can completely fill the power and compute budget this might not actually save any time there, but i'll definitely be giving it a test later.
As they say in the post, the diffusion model is only going to be as good as the training data that was used to train it, so if you make it with only coding tasks it's going to be useless for creative writing and other similar limitations like that, in these cases I'd be the extra time spent on the diffusion model plus throwing the results out constantly is going to make it slower. So it might make sense then to train multiple diffusion models and swap them out on different tasks/specializations, similar to training tiny models for specific tasks.

I'd bet that it can work with CPU/RAM offload, but i doubt they've tried to optimize it in their research implementation. It'd need a little work to make sure all the layers are setup in the correct places so that it's not doing anything wonky like going back and forth between GPU and system ram but i'd think it would still make a difference in speed.

Finanzamt_Endgegner@reddit

How is it on longer context? Dflash has issues with that for example?

Round-Beach7348@reddit

We tested up to 64K context and it works pretty well. DFlash's regression at longer contexts is likely because it maintains a separate external drafter with its own KV cache, which can cause distributional drift as context grows. Since Orthrus shares a single KV cache with no external drafter, it might not have that issue. Give it a try and let us know how it goes!

thats amazing, ill gonna give that info to some people that might have compute for bigger models (;

Thrumpwart@reddit

Great work, this is really ingenius. Looking forward to reading the paper.

hainesk@reddit

Could this be done with Qwen 3.5 or 3.6? Does this work with moe models?

KptEmreU@reddit

oh yeah . pls give me a qwen3.5 9b with it 😄 My puny 16gb vram requires --- demands it!

Yep, Orthrus-Qwen-3.5-9B will be available soon

Rare_Potential_1323@reddit

Better would be Carnice-9b

Of course, Qwen 3.5 and 3.6 support are coming soon! This is also compatible with MoE models and other attention-based architectures.

StudentDifficult8240@reddit

How does it handle larger contexts? Dflash is great up to 16k but experiences severe PP regression at higher contexts.

I will! Thank you

MerePotato@reddit

PP as in prompt processing or perplexity?

perplexity

Party-Special-5177@reddit

They released a demo at 8B, but I’m really curious if what this unlocks is tolerable tg in large models (>400B) without having to hold the whole model in vram (I.e. at reasonable quants).

How well does this speedup scale?

Theoretically, the speedup should scale well with model size. The KV overhead is O(1) regardless of model size, so it won't eat into your VRAM budget, the only extra cost is the diffusion parameters, which are a small fraction of the total.

Okay, yes 3.5 9b please!

Noted, we are working on adding Qwen3.5 and Qwen3.6, stay tuned! Feel free to star the repo for updates!

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)