BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 32 comments

BeeLlama v0.2.0 is here!

Not quite a pegasus, but close enough.

GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start

Full Gemma 4 31B support with efficient DFlash implementation and vision.
Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution.
DFlash GGUFs with upstream architecture are now supported.
Fixes to adaptive profit behavior around baseline probing.
Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it.
Reasoning and tool-call boundaries were tightened.
Stricter draft/target validation and better draft-model discovery.
...and many more improvements!

Benchmarks

Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
Config: same as in quick start docs, but with reasoning off for non-chat prompts
Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
The full text of the benchmark prompts is in README.md on GitHub

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	\~1K tok	37.2 tok/s	37.2 tok/s	1.00x	N/A
Task store module	DFlash	\~1K tok	163.9 tok/s	181.9 tok/s	4.40x	67.7% / 89.2%
Task store module	MTP	\~1K tok	69.3 tok/s	69.6 tok/s	1.86x	92.0% / 73.3%
KV report module	Baseline	\~1K tok	34.6 tok/s	36.5 tok/s	1.00x	N/A
KV report module	DFlash	\~1K tok	157.7 tok/s	162.5 tok/s	4.56x	58.8% / 88.9%
KV report module	MTP	\~1K tok	67.3 tok/s	68.1 tok/s	1.94x	89.3% / 73.0%
Doubly-linked list	Baseline	\~4K tok	36.8 tok/s	36.9 tok/s	1.00x	N/A
Doubly-linked list	DFlash	\~4K tok	130.8 tok/s	154.1 tok/s	3.56x	50.4% / 86.8%
Doubly-linked list	MTP	\~4K tok	66.3 tok/s	68.0 tok/s	1.80x	87.8% / 72.5%
Prompt processing	Baseline	\~20K tok	1229.5 tok/s	1229.5 tok/s	1.00x	N/A
Prompt processing	DFlash	\~20K tok	1214.4 tok/s	1221.7 tok/s	0.99x	N/A
Prompt processing	MTP	\~20K tok	1162.6 tok/s	1164.7 tok/s	0.95x	N/A
Multi-turn coding	Baseline	\~28K tok	33.3 tok/s	33.3 tok/s	1.00x	N/A
Multi-turn coding	DFlash	\~30K tok	64.6 tok/s	65.4 tok/s	1.94x	24.9% / 72.9%
Multi-turn coding	MTP	\~34K tok	56.5 tok/s	56.5 tok/s	1.70x	71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	\~1K tok	36.1 tok/s	36.1 tok/s	1.00x	N/A
Task store module	DFlash	\~1K tok	177.8 tok/s	182.0 tok/s	4.93x	65.7% / 90.0%
KV report module	Baseline	\~1K tok	35.9 tok/s	36.0 tok/s	1.00x	N/A
KV report module	DFlash	\~1K tok	154.3 tok/s	162.8 tok/s	4.29x	55.7% / 88.6%
Doubly-linked list	Baseline	\~1.9K tok	36.0 tok/s	36.0 tok/s	1.00x	N/A
Doubly-linked list	DFlash	\~1.9K tok	116.6 tok/s	127.3 tok/s	3.24x	44.5% / 84.9%
Prompt processing	Baseline	\~24K tok	1021.3 tok/s	1021.3 tok/s	1.00x	N/A
Prompt processing	DFlash	\~24K tok	954.5 tok/s	954.9 tok/s	0.93x	N/A
Multi-turn coding	Baseline	\~12K tok	34.8 tok/s	34.8 tok/s	1.00x	N/A
Multi-turn coding	DFlash	\~12K tok	60.6 tok/s	64.1 tok/s	1.74x	24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

[-]

Anbeeld@reddit (OP)

If it fits on baseline llama.cpp, then sure. The whole DFlash thing needs like 1-2 GB.

But as you probably have multi-GPU setup, I must warn you that it's not properly supported yet. I don't own multi-GPU myself, so I apply fixes based on reports, and there wasn't much of them lately.

On the other hand, if you will try it and something won't work, please send a report in GitHub issues, it would help a lot.

[-]

wgaca2@reddit

I will check it out over the weekend, i have q8 fitting tight so i might have to adjust a bit

[-]

I have a rtx 3090 as well currently running the lastest version of llamacpp with MTP support and getting Round 50tps with hermes agent. How is dflash better would it give me high output I use my Hermes agent with llm-wiki to process my notes and a few crons for scraping websites. Was looking to setup PI over the weekend and do some coding with qwen3.6 27b would changing the beellama.cpp with dflash be useful and worth the hassle sorry not that good with local llms yet a little help would be great

[-]

Zarzou@reddit

bee-llama -- forked --> buun-llama -- forked --> TheTom/llama -- forked --> llama.cpp

IMHO Such fragmentation is to be avoided.

[-]

Anbeeld@reddit (OP)

What I care about if being able to ship good stuff to the community, that's it.

[-]

segmond@reddit

the community is not using all of those, if you care about the community you will put in that effort in making a decent PR that goes back into llama.cpp

[-]

pmttyji@reddit

Can you add Qwen3.5-9B MTP on Plug-and-Play Setups? Many of us could run 9B model with less VRAM. Also add Qwen3.6-35B-A3B & Gemma-4-26BA4B for same reason as above.

[-]

Anbeeld@reddit (OP)

I will add more models over time, for sure. For MoE I'm planning a separate update focused solely on them. It's just v0.2.0 ended up a crazy time sink, and there's also upstream MTP and general spec architecture that I now need to merge, so for now there's quite a lot of stuff to do.

[-]

m0py@reddit

As an 5070 owner, looking forward to the smaller models, especially the MoE versions. Thank you for your work!

[-]

caetydid@reddit

are the speed gains for qwen MTP expected to be lower or is it just not optimized? I just wonder because acceptance rates are high compared to dflash.

[-]

Anbeeld@reddit (OP)

MTP uses fixed draft-n-max 3 by default, which I left as is for benchmarks. BeeLLama DFlash uses draft-n-max 16 and progit-based adaptive controller that dynamically lowers it.

So the reason DFlash has much lower acceptance is because it drafts much more tokens. Drafting was made cheap back in v0.1.2 so it's more profitable to try more, even if most fails.

The important number here is the second one, accepted draft tokens to final generated tokens. In every single benchmark DFlash had it higher than MTP did. This means the "acceptance rates are high compared to DFlash" statement is not correct there.

MTP just has high accepted to proposed ratio, but without context it doesn't mean anything, and the context is that it drafts very conservatively.

[-]

Infamous-Play-3743@reddit

Interesting but for the web

[-]

craftogrammer@reddit

Looking great, Is there something for 16GB VRAM poors 🫡. Thanks!

[-]

Anbeeld@reddit (OP)

Unfortunately I didn't have time to tinker with smaller models yet, but if they use the same architecture as these 2 headliners, it might very well work as is.

[-]

craftogrammer@reddit

Yeah make sense, at least if Q3 can be used for just tool calling. I will try that tonight with your new version. Thanks for sharing.

[-]

Qwen_os_has_died@reddit

Beellama ...

[-]

Anbeeld@reddit (OP)

Not quite a pegasus, but close enough.

[-]

sagiroth@reddit

This is incredible. Squeezing that 3090 like a lemon. Keep up the good work man

[-]

Anbeeld@reddit (OP)

Wait until I'll get a second one, somehow. The juice will flow like a river.

[-]

Poha_Best_Breakfast@reddit

Isn’t DFLASH support still pending on llama cpp mainline?

[-]

Anbeeld@reddit (OP)

Well, that's the power of forks: no idea what's mainline stance on DFlash. I just did some stuff, said stuff seems to work, I share the stuff with the community.

[-]

caetydid@reddit

and here goes my evening...

[-]

sagiroth@reddit

Too real. Every single time something new drop. I'm grateful we have such incredible people with lots of spare time

[-]

FerLuisxd@reddit

MTP seems so slow, I saw other comparisons but this one seems too different, any reason for that? Not optimized yet?

[-]

Anbeeld@reddit (OP)

I used default settings for both methods, one of the latest llama.cpp builds, and the usual unsloth models. Maybe MTP benefits from some tinkering with draft-n-max and whatnot, and that's what you've seen in other test? Unfortunately, I had too much benchmarking to do with my own stuff to explore that myself.

[-]

Shockersam@reddit

Can someone enlighten me if there are any accuracy drops if using dflash and or mtp?

[-]

Anbeeld@reddit (OP)

Shouldn't be with proper implementation. All output of the drafter is verified by the target model, so the lil bro can't just output some nonsense unsupervised.

[-]

xeeff@reddit

support rocm or vulkan

[-]

Anbeeld@reddit (OP)

I don't own AMD myself to test, but folks PR'd some stuff for HIP/ROCm, so should be working.

[-]

Toastti@reddit

For agentic coding. So like 200k context large chats on opencode. Is MTP from the latest llama.cpp or DFlash faster?

[-]

Anbeeld@reddit (OP)

It depends a lot on the prompts and the context. In v0.2.0 both DFlash and prompt processing should be in a good state right now, at least for Qwen 3.6 27B and Gemma 4 31B that were my main targets. In a multi-turn chat benchmark DFlash managed to beat MTP, but it is only a benchmark, of course.

While testing the new version, I've done a decent number of long-context chats in OpenCode and VSCode Copilot extension, and in my completely unbiased opinion it was a pleasant experience, with minimal issues around tool calls and stuff. I tried following up with targeted file reading and editing, usage of vision deep into the chat, and everything worked fine.

But as my opinion is just way too unbiased, the honest answer is probably: try both. If you are on Windows, there are prebuilts available, othewise follow the instructions. There might still be some issues in edge cases, but I would be happy to fix those based on community reports.