BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 186 comments

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled. No option did this out of the box for me without VRAM and/or tooling issues (this was before MTP PR for llama.cpp surfaced there), so I pulled out an old trick: stay up to 4am one too many times to do month+ work in a week or two. Now I have what seems to be the solution and don't mind to share.

Anbeeld's BeeLlama.cpp

BeeLlama.cpp (or just Bee) is Anbeeld's performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools, server flow, and model compatibility, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, reasoning-loop protection, full multimodal support, and experimental speculation modes.

Not quite a pegasus, but close enough.

Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.

Fork Features

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).