BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)
Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 42 comments
**BeeLlama v0.3.0 and v0.3.1 are here!** Big architectural update to align the fork with upstream llama.cpp and integrate all its additions like MTP and Gemma 4 12B support, while also updating DFlash to handle complex configurations like multi-slot and multi-GPU.
Now also recommended by [club-3090](https://github.com/noonghunna/club-3090)! Thanks to [noonghunna](https://github.com/noonghunna) for inviting Bee to the club and for their help with testing v0.3.0 on a multi-GPU setup.
>Not quite a pegasus, but close enough.
[**GitHub**](https://github.com/Anbeeld/beellama.cpp) **|** [**Qwen 3.6 27B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md) **|** [**Gemma 4 31B Quick Start**](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-gemma-4-31b-dflash.md)
* Updated to a much newer llama.cpp base: MTP, Gemma 4 12B, VRAM optimizations, unified llama app, backend improvements across CUDA, Metal, Vulkan, and more.
* Prebuilt binaries and Docker images are now provided for all major platforms.
* DFlash now works across multiple concurrent slots with shared drafter batching.
* Adaptive draft depth control got smarter: it seeds baselines, probes depths, backs off on failure, and resets per request.
* Multi-GPU DFlash now works (and quite decently) after many fixes and improvements.
* Faster speculative verification that fails safely on bad state.
* Better tool-call and reasoning output handling: earlier streaming, stale KV state clearing, isolated reasoning deltas.
* New cache and quantization options: `q6_0` KV cache, `TQ3_1S` and `TQ4_1S` models.
* ...and many more improvements!
**Benchmarks**
These were run back on BeeLlama v0.2.0, but both engines had no *major* performance updates since then, other than MTP being 5-10% faster. [club-3090](https://github.com/noonghunna/club-3090) did benchmarks of their own using v0.3.0, including multi-GPU setup, and ended up recommending Bee as default.
* Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
* Config: same as in quick start docs, but with reasoning off for non-chat prompts
* Baseline and MTP server in comparison: llama.cpp [b9275](https://github.com/ggml-org/llama.cpp/releases/tag/b9275) CUDA 13.1 Windows prebuilt
* The full text of the benchmark prompts is in [README.md on GitHub](https://github.com/Anbeeld/beellama.cpp/blob/main/README.md#dflash-speedup)
**Qwen 3.6 27B**
Target model: [Qwen 3.6 27B Q5\_K\_S](https://huggingface.co/unsloth/Qwen3.6-27B-GGUF) or [Qwen 3.6 27B MTP Q5\_K\_S](https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF). DFlash model: [Q4\_K\_M](https://huggingface.co/Anbeeld/Qwen3.6-27B-DFlash-GGUF).
|Prompt|Server|Output|Median|Best|Speedup|Acceptance|
|:-|:-|:-|:-|:-|:-|:-|
|Task store module|Baseline|\~1K tok|37.2 tok/s|37.2 tok/s|1.00x|N/A|
|Task store module|DFlash|\~1K tok|**163.9 tok/s**|181.9 tok/s|**4.40x**|67.7% / 89.2%|
|Task store module|MTP|\~1K tok|69.3 tok/s|69.6 tok/s|1.86x|92.0% / 73.3%|
|KV report module|Baseline|\~1K tok|34.6 tok/s|36.5 tok/s|1.00x|N/A|
|KV report module|DFlash|\~1K tok|**157.7 tok/s**|162.5 tok/s|**4.56x**|58.8% / 88.9%|
|KV report module|MTP|\~1K tok|67.3 tok/s|68.1 tok/s|1.94x|89.3% / 73.0%|
|Doubly-linked list|Baseline|\~4K tok|36.8 tok/s|36.9 tok/s|1.00x|N/A|
|Doubly-linked list|DFlash|\~4K tok|**130.8 tok/s**|154.1 tok/s|**3.56x**|50.4% / 86.8%|
|Doubly-linked list|MTP|\~4K tok|66.3 tok/s|68.0 tok/s|1.80x|87.8% / 72.5%|
|Prompt processing|Baseline|\~20K tok|1229.5 tok/s|1229.5 tok/s|1.00x|N/A|
|Prompt processing|DFlash|\~20K tok|**1214.4 tok/s**|1221.7 tok/s|**0.99x**|N/A|
|Prompt processing|MTP|\~20K tok|1162.6 tok/s|1164.7 tok/s|0.95x|N/A|
|Multi-turn coding|Baseline|\~28K tok|33.3 tok/s|33.3 tok/s|1.00x|N/A|
|Multi-turn coding|DFlash|\~30K tok|**64.6 tok/s**|65.4 tok/s|**1.94x**|24.9% / 72.9%|
|Multi-turn coding|MTP|\~34K tok|56.5 tok/s|56.5 tok/s|1.70x|71.9% / 68.3%|
*Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens*
**Gemma 4 31B**
Target model: [Gemma 4 31B Q4\_K\_S](https://huggingface.co/unsloth/gemma-4-31b-it-GGUF). DFlash model: [Q5\_K\_M](https://huggingface.co/Anbeeld/gemma-4-31B-it-DFlash-GGUF).
|Prompt|Server|Output|Median|Best|Speedup|Acceptance|
|:-|:-|:-|:-|:-|:-|:-|
|Task store module|Baseline|\~1K tok|36.1 tok/s|36.1 tok/s|1.00x|N/A|
|Task store module|DFlash|\~1K tok|**177.8 tok/s**|182.0 tok/s|**4.93x**|65.7% / 90.0%|
|KV report module|Baseline|\~1K tok|35.9 tok/s|36.0 tok/s|1.00x|N/A|
|KV report module|DFlash|\~1K tok|**154.3 tok/s**|162.8 tok/s|**4.29x**|55.7% / 88.6%|
|Doubly-linked list|Baseline|\~1.9K tok|36.0 tok/s|36.0 tok/s|1.00x|N/A|
|Doubly-linked list|DFlash|\~1.9K tok|**116.6 tok/s**|127.3 tok/s|**3.24x**|44.5% / 84.9%|
|Prompt processing|Baseline|\~24K tok|1021.3 tok/s|1021.3 tok/s|1.00x|N/A|
|Prompt processing|DFlash|\~24K tok|**954.5 tok/s**|954.9 tok/s|**0.93x**|N/A|
|Multi-turn coding|Baseline|\~12K tok|34.8 tok/s|34.8 tok/s|1.00x|N/A|
|Multi-turn coding|DFlash|\~12K tok|**60.6 tok/s**|64.1 tok/s|**1.74x**|24.4% / 72.3%|
*Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens*
42 Comments
NickCanCode@reddit
Anbeeld@reddit (OP)
NickCanCode@reddit
Anbeeld@reddit (OP)
cleversmoke@reddit
Anbeeld@reddit (OP)
cleversmoke@reddit
AwaitingSerotonin@reddit
Anbeeld@reddit (OP)
xspider2000@reddit
taking_bullet@reddit
Anbeeld@reddit (OP)
anubhav_200@reddit
Robo_Ranger@reddit
thoquz@reddit
sagiroth@reddit
Anbeeld@reddit (OP)
sagiroth@reddit
robertpro01@reddit
sittingmongoose@reddit
Anbeeld@reddit (OP)
artash26@reddit
Anbeeld@reddit (OP)
Due_Steak_1249@reddit
Anbeeld@reddit (OP)
sittingmongoose@reddit
alew3@reddit
Anbeeld@reddit (OP)
LetsGoBrandon4256@reddit
jazir55@reddit
Dandz@reddit
feverdoingwork@reddit
Anbeeld@reddit (OP)
soyalemujica@reddit
Anbeeld@reddit (OP)
kiwibonga@reddit
Fabulous_Fact_606@reddit
Anbeeld@reddit (OP)
Fit_Split_9933@reddit
Anbeeld@reddit (OP)
sagiroth@reddit
JSVD2@reddit