BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 42 comments

Reply to Post

42 Comments

[-]

NickCanCode@reddit

I am getting strange behaviour with DFlash. The same model keep trying to use \`computer\_use\_\_set\_value\` instead of the Edit file tool to edit a file, and being auto rejected my the coding CLI under yolo mode. I have never see an agent so insist re-trying to use that \`computer\_use\_\_set\_value\`. I don't even know there is such a tool. I have heart DFlash should not affect how the model behave but it is actually happening in front of me today. Running KV at 8\_0. Not sure how this happen. Well, the speed up is real. It's faster than MTP when context grow to certain size.

[-]

Anbeeld@reddit (OP)

Model? Harness? Config? Logs?

[-]

NickCanCode@reddit

Here is my launch script: [https://pastebin.com/zAAiEru1](https://pastebin.com/zAAiEru1) beellama console logs: [https://pastebin.com/bHPz8F7N](https://pastebin.com/bHPz8F7N) what agent did in CLI: [https://pastebin.com/6khq8pXP](https://pastebin.com/6khq8pXP) It try to use the \`computer\_use\_\_set\_value\` tool which I never saw it using in the past and magically my yolo mode qwen-code actually denied it automatically, and it just didn't give up and keep retrying. Note: I created the build from source. Basically didn't modify anything except making it compile on my PC so it should not break anything. You may think the model is \`Qwen3.6-27B-Q2\_K\_MIXED-AutoRound\` so it is just a Q2 model doing it's thing but it is not. I have been using this model with llama.cpp and ik\_llama.cpp for weeks and never see anything like this until today. Here is what the model looks like from the inside [https://pastebin.com/4HzT8Kx7](https://pastebin.com/4HzT8Kx7) . Harness : Qwen Code \`preserve\_thinking\` is set from there so its not shown in the launch script.

[-]

Anbeeld@reddit (OP)

Oh, multi-GPU. That's not fully fleshed out for BeeLlama yet, just a note. Anyways, I'll conduct an investigation. Thank you for the report.

[-]

cleversmoke@reddit

Thank you BeeLlama team! Will try this later today

[-]

Anbeeld@reddit (OP)

One man team!

[-]

cleversmoke@reddit

Dang, awesome!

[-]

AwaitingSerotonin@reddit

Dflash still doesnt work with -sm tensor on multi gpus? Is that planned (or even possible?)

[-]

Anbeeld@reddit (OP)

I'm working on it, implementing fixes every time I get a report, as I don't have multi-GPU setup.

[-]

xspider2000@reddit

My opinion can be not popular but i think [club-3090](https://github.com/noonghunna/club-3090) has good conceptions, but realization is full of slop. It lacks consistency in code and documentations.

[-]

taking_bullet@reddit

Is there a Windows Vulkan package? Can't find it on Github.

[-]

Anbeeld@reddit (OP)

Whoops... will add it for next release.

[-]

anubhav_200@reddit

Thank you so much, with last version, I was getting 90tps tg with Qwen3.6-27B Q4KM

[-]

Robo_Ranger@reddit

What did you mean by Gemma4 12B support? Does it get a speed boost as well, or just run at normal speed? I looked on the GitHub page and found nothing mentioned about Gemma4 12B.

[-]

thoquz@reddit

How do llama.cpp forks confirm that they still produce the same output tokens as the baseline model / inference engine would? Would I be able to set a fixed seed value in this one and get the same output as the upstream? (Just at different speeds)

[-]

sagiroth@reddit

Whats your take on setting reasoning budget ? I tried various limits and found -2k to be sweet spot for agent/multi turn coding.

[-]

Anbeeld@reddit (OP)

Unlimited. 😎 Probably because I spend much more time testing it than actually using it...

[-]

sagiroth@reddit

Too real... at work I am full on claude code, but when im done I barely touch AI coz of AI fatigue lol

[-]

robertpro01@reddit

Lol, I love it, that's what I used to do when I was self hosting stuff, just getting ready she never actually using it. Now I setup local llm and I'm fucking addicted to it. (Really using it)

[-]

sittingmongoose@reddit

On your 3090 test unit, for the 27b qwen model. What context are you using? You have a similar setup to my computer and I want to run 27b.

[-]

Anbeeld@reddit (OP)

[Qwen 3.6 27B Quick Start ](https://github.com/Anbeeld/beellama.cpp/blob/main/docs/quickstart-qwen36-dflash.md)is where I described all this stuff. Also since latest llama.cpp VRAM optimizations you might be able fit even more context.

[-]

artash26@reddit

Awesome write-up, thank you for sharing. If I were to run this on runpod’s l4 can I reuse this example or I will have to tweak something? Will it be much worse compared to 3090?

[-]

Anbeeld@reddit (OP)

No idea honestly, but if it has 24 GB VRAM then the example config should work.

[-]

Due_Steak_1249@reddit

How does it work with multi-gpu? I remember some weeks ago you didn't have multi-gpu setup available and have to depend on user's feedback. I'm considering giving it a try with my 3090Ti + 4070S.

[-]

Anbeeld@reddit (OP)

club-3090 tried it with 2x3090 and reported it as working well.

[-]

sittingmongoose@reddit

Thank you!

[-]

alew3@reddit

Whats the max quantization it can push on an RTX 5090 with 256k context with Qwen 3.6 27B?

[-]

Anbeeld@reddit (OP)

UD-Q6\_K\_XL should fit well, I guess? With some cache quantization.

[-]

LetsGoBrandon4256@reddit

> and many more improvements! Absolutely insane that one person can do more than the whole llama.cpp team combined with Huggingface money lmfao.

[-]

jazir55@reddit

Move fast and break things with no design by committee

[-]

Dandz@reddit

Why 0.3.0 and 0.3.1?

[-]

feverdoingwork@reddit

What speeds are you getting? Im on a goofy 5080 and 5060 ti setup

[-]

Anbeeld@reddit (OP)

Because I released v0.3.0 and then woke up to llama.cpp upgrading MTP and adding Gemma 4 12B support, so I just had to do one more update right away.

[-]

soyalemujica@reddit

I have issues with HIP and AMD it won’t find GPU when using DFlash

[-]

Anbeeld@reddit (OP)

Please leave a detailed report in [this issue](https://github.com/Anbeeld/beellama.cpp/issues/45). I don't own AMD GPU myself so I rely on user logs for fixes.

[-]

kiwibonga@reddit

Nice! I actually downloaded some dflash draft models yesterday but deleted them when I realized it couldn't split on 2 GPUs. Then baby jesus heard my cries, apparently. CUDA OOM errors, here I come.

[-]

Fabulous_Fact_606@reddit

Any support for Qwen3.6-27B-UD-Q8\_K\_XL? Only quant that does not make coding mistakes...

[-]

Anbeeld@reddit (OP)

Should be supported.

[-]

Fit_Split_9933@reddit

How is the speed now after context exceeds 100k?

[-]

Anbeeld@reddit (OP)

I tried it many times in high context agentic usage, mainly "analyze the repo" stuff to stress tool calls as well, and my rating is "not bad". I don't have any numbers to provide that have actual methodology behind them, but the *30 minutes to do the same analysis* that I had initially with pre-MTP llama.cpp (which got me into creating the fork) are no longer there, and by a long shot.

[-]

sagiroth@reddit

Club 3090 & Beellama best community! GOAT

[-]

JSVD2@reddit

wow big value, thank you