Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Posted by Franck_Dernoncourt@reddit | LocalLLaMA | View on Reddit | 33 comments

Orthrus-Qwen3-8B : up to  7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution

Idea: Inject a trainable diffusion attention module into each layer of a frozen AR Transformer. Both heads share one KV cache. Diffusion head projects K=32 tokens in parallel; AR head verifies in a second pass and accepts the longest matching prefix. Output distribution is provably identical to the base model.

Results:

Limitations: strictly bounded by the frozen base model (inherits its biases, hallucinations, knowledge gaps); Qwen3-only evaluation; greedy + rejection sampling only.