BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 32 comments
BeeLlama v0.2.0 is here!
Not quite a pegasus, but close enough.
GitHub | Qwen 3.6 27B Quick Start | Gemma 4 31B Quick Start
- Full Gemma 4 31B support with efficient DFlash implementation and vision.
- Major Qwen 3.6 27B performance update from lower DFlash overhead, cleaner prefill handling, drafter K/V projection caching, and safer CUDA execution.
- DFlash GGUFs with upstream architecture are now supported.
- Fixes to adaptive profit behavior around baseline probing.
- Reduced verifier path is stricter now, with safer fallback to full logits when grammar, sampler state, or reasoning requires it.
- Reasoning and tool-call boundaries were tightened.
- Stricter draft/target validation and better draft-model discovery.
- ...and many more improvements!
Benchmarks
- Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
- Config: same as in quick start docs, but with reasoning off for non-chat prompts
- Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt
- The full text of the benchmark prompts is in README.md on GitHub
Qwen 3.6 27B
Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | \~1K tok | 37.2 tok/s | 37.2 tok/s | 1.00x | N/A |
| Task store module | DFlash | \~1K tok | 163.9 tok/s | 181.9 tok/s | 4.40x | 67.7% / 89.2% |
| Task store module | MTP | \~1K tok | 69.3 tok/s | 69.6 tok/s | 1.86x | 92.0% / 73.3% |
| KV report module | Baseline | \~1K tok | 34.6 tok/s | 36.5 tok/s | 1.00x | N/A |
| KV report module | DFlash | \~1K tok | 157.7 tok/s | 162.5 tok/s | 4.56x | 58.8% / 88.9% |
| KV report module | MTP | \~1K tok | 67.3 tok/s | 68.1 tok/s | 1.94x | 89.3% / 73.0% |
| Doubly-linked list | Baseline | \~4K tok | 36.8 tok/s | 36.9 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | \~4K tok | 130.8 tok/s | 154.1 tok/s | 3.56x | 50.4% / 86.8% |
| Doubly-linked list | MTP | \~4K tok | 66.3 tok/s | 68.0 tok/s | 1.80x | 87.8% / 72.5% |
| Prompt processing | Baseline | \~20K tok | 1229.5 tok/s | 1229.5 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | \~20K tok | 1214.4 tok/s | 1221.7 tok/s | 0.99x | N/A |
| Prompt processing | MTP | \~20K tok | 1162.6 tok/s | 1164.7 tok/s | 0.95x | N/A |
| Multi-turn coding | Baseline | \~28K tok | 33.3 tok/s | 33.3 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | \~30K tok | 64.6 tok/s | 65.4 tok/s | 1.94x | 24.9% / 72.9% |
| Multi-turn coding | MTP | \~34K tok | 56.5 tok/s | 56.5 tok/s | 1.70x | 71.9% / 68.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
Gemma 4 31B
Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.
| Prompt | Server | Output | Median | Best | Speedup | Acceptance |
|---|---|---|---|---|---|---|
| Task store module | Baseline | \~1K tok | 36.1 tok/s | 36.1 tok/s | 1.00x | N/A |
| Task store module | DFlash | \~1K tok | 177.8 tok/s | 182.0 tok/s | 4.93x | 65.7% / 90.0% |
| KV report module | Baseline | \~1K tok | 35.9 tok/s | 36.0 tok/s | 1.00x | N/A |
| KV report module | DFlash | \~1K tok | 154.3 tok/s | 162.8 tok/s | 4.29x | 55.7% / 88.6% |
| Doubly-linked list | Baseline | \~1.9K tok | 36.0 tok/s | 36.0 tok/s | 1.00x | N/A |
| Doubly-linked list | DFlash | \~1.9K tok | 116.6 tok/s | 127.3 tok/s | 3.24x | 44.5% / 84.9% |
| Prompt processing | Baseline | \~24K tok | 1021.3 tok/s | 1021.3 tok/s | 1.00x | N/A |
| Prompt processing | DFlash | \~24K tok | 954.5 tok/s | 954.9 tok/s | 0.93x | N/A |
| Multi-turn coding | Baseline | \~12K tok | 34.8 tok/s | 34.8 tok/s | 1.00x | N/A |
| Multi-turn coding | DFlash | \~12K tok | 60.6 tok/s | 64.1 tok/s | 1.74x | 24.4% / 72.3% |
Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens
wgaca2@reddit
Q8 on 260k context for 48gb vram?
Anbeeld@reddit (OP)
If it fits on baseline llama.cpp, then sure. The whole DFlash thing needs like 1-2 GB.
But as you probably have multi-GPU setup, I must warn you that it's not properly supported yet. I don't own multi-GPU myself, so I apply fixes based on reports, and there wasn't much of them lately.
On the other hand, if you will try it and something won't work, please send a report in GitHub issues, it would help a lot.
wgaca2@reddit
I will check it out over the weekend, i have q8 fitting tight so i might have to adjust a bit
Clean_Initial_9618@reddit
I have a rtx 3090 as well currently running the lastest version of llamacpp with MTP support and getting Round 50tps with hermes agent. How is dflash better would it give me high output I use my Hermes agent with llm-wiki to process my notes and a few crons for scraping websites. Was looking to setup PI over the weekend and do some coding with qwen3.6 27b would changing the beellama.cpp with dflash be useful and worth the hassle sorry not that good with local llms yet a little help would be great
Zarzou@reddit
bee-llama -- forked --> buun-llama -- forked --> TheTom/llama -- forked --> llama.cpp
IMHO Such fragmentation is to be avoided.
Anbeeld@reddit (OP)
What I care about if being able to ship good stuff to the community, that's it.
segmond@reddit
the community is not using all of those, if you care about the community you will put in that effort in making a decent PR that goes back into llama.cpp
pmttyji@reddit
Can you add Qwen3.5-9B MTP on Plug-and-Play Setups? Many of us could run 9B model with less VRAM. Also add Qwen3.6-35B-A3B & Gemma-4-26BA4B for same reason as above.
Anbeeld@reddit (OP)
I will add more models over time, for sure. For MoE I'm planning a separate update focused solely on them. It's just v0.2.0 ended up a crazy time sink, and there's also upstream MTP and general spec architecture that I now need to merge, so for now there's quite a lot of stuff to do.
m0py@reddit
As an 5070 owner, looking forward to the smaller models, especially the MoE versions. Thank you for your work!
caetydid@reddit
are the speed gains for qwen MTP expected to be lower or is it just not optimized? I just wonder because acceptance rates are high compared to dflash.
Anbeeld@reddit (OP)
MTP uses fixed draft-n-max 3 by default, which I left as is for benchmarks. BeeLLama DFlash uses draft-n-max 16 and progit-based adaptive controller that dynamically lowers it.
So the reason DFlash has much lower acceptance is because it drafts much more tokens. Drafting was made cheap back in v0.1.2 so it's more profitable to try more, even if most fails.
The important number here is the second one, accepted draft tokens to final generated tokens. In every single benchmark DFlash had it higher than MTP did. This means the "acceptance rates are high compared to DFlash" statement is not correct there.
MTP just has high accepted to proposed ratio, but without context it doesn't mean anything, and the context is that it drafts very conservatively.
Infamous-Play-3743@reddit
Interesting but for the web
craftogrammer@reddit
Looking great, Is there something for 16GB VRAM poors 🫡. Thanks!
Anbeeld@reddit (OP)
Unfortunately I didn't have time to tinker with smaller models yet, but if they use the same architecture as these 2 headliners, it might very well work as is.
craftogrammer@reddit
Yeah make sense, at least if Q3 can be used for just tool calling. I will try that tonight with your new version. Thanks for sharing.
Qwen_os_has_died@reddit
Beellama ...
Anbeeld@reddit (OP)
Not quite a pegasus, but close enough.
sagiroth@reddit
This is incredible. Squeezing that 3090 like a lemon. Keep up the good work man
Anbeeld@reddit (OP)
Wait until I'll get a second one, somehow. The juice will flow like a river.
Poha_Best_Breakfast@reddit
Isn’t DFLASH support still pending on llama cpp mainline?
Anbeeld@reddit (OP)
Well, that's the power of forks: no idea what's mainline stance on DFlash. I just did some stuff, said stuff seems to work, I share the stuff with the community.
caetydid@reddit
and here goes my evening...
sagiroth@reddit
Too real. Every single time something new drop. I'm grateful we have such incredible people with lots of spare time
FerLuisxd@reddit
MTP seems so slow, I saw other comparisons but this one seems too different, any reason for that? Not optimized yet?
Anbeeld@reddit (OP)
I used default settings for both methods, one of the latest llama.cpp builds, and the usual unsloth models. Maybe MTP benefits from some tinkering with draft-n-max and whatnot, and that's what you've seen in other test? Unfortunately, I had too much benchmarking to do with my own stuff to explore that myself.
Shockersam@reddit
Can someone enlighten me if there are any accuracy drops if using dflash and or mtp?
Anbeeld@reddit (OP)
Shouldn't be with proper implementation. All output of the drafter is verified by the target model, so the lil bro can't just output some nonsense unsupervised.
xeeff@reddit
support rocm or vulkan
Anbeeld@reddit (OP)
I don't own AMD myself to test, but folks PR'd some stuff for HIP/ROCm, so should be working.
Toastti@reddit
For agentic coding. So like 200k context large chats on opencode. Is MTP from the latest llama.cpp or DFlash faster?
Anbeeld@reddit (OP)
It depends a lot on the prompts and the context. In v0.2.0 both DFlash and prompt processing should be in a good state right now, at least for Qwen 3.6 27B and Gemma 4 31B that were my main targets. In a multi-turn chat benchmark DFlash managed to beat MTP, but it is only a benchmark, of course.
While testing the new version, I've done a decent number of long-context chats in OpenCode and VSCode Copilot extension, and in my completely unbiased opinion it was a pleasant experience, with minimal issues around tool calls and stuff. I tried following up with targeted file reading and editing, usage of vision deep into the chat, and everything worked fine.
But as my opinion is just way too unbiased, the honest answer is probably: try both. If you are on Windows, there are prebuilts available, othewise follow the instructions. There might still be some issues in edge cases, but I would be happy to fix those based on community reports.