How many of you tried BeeLlama.cpp? How's it? Agentic coding possible with 8GB VRAM?

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 20 comments

We'll be getting those features(check bottom link) on mainline soon or later anyway. But for now this fork could be useful to see the full potential of our poor GPUs(and also big, large GPUs).

Any 8GB VRAM(and 32GB RAM) folks already doing Agentic coding with models(@ Q4 at least) like Qwen3.6-35B-A3B, Qwen3.6-27B, Gemma-4-31B, Gemma-4-26B-A4B? I would love to see some t/s stats, full commands & more details on that. I'm not expecting any miracle with 8GB VRAM, still want to do something decent with limited constraints. Though I'm getting new rig this month, I want to use my current laptop(8GB VRAM) too for Agentic coding.

Others(who has more than 8GB VRAM), please share your stats, full commands & comparison with mainline.

Below is related thread by creator. Hope the creator adds more features continuously.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

[-]

jQuaade@reddit

I hit about 32 tok/s in LM studio on 27b on 1x3090. Beellama with the recommended settings i get about 40-45 tok/s for most of my use causes (a mix of creative writing and python code). If I ask for some simple boilerplate python i can reach 120-130 tok/s on a fresh prompt. My context is pretty small though (24k) and I haven't experimented with larger context.

[-]

pmttyji@reddit (OP)

Qwen Next/3.5/3.6 series can handle large context so try with 64K/96K/128K/etc.,. It won't reduce t/s that much. You welcome

[-]

MindPsychological140@reddit

On 8GB VRAM, A3B/A4B MoEs are your only realistic path, dense 27-31B Q4 will spend most of inference moving weights across PCIe. Use `--n-cpu-moe in mainline llama.cpp to keep only the active expert path in VRAM, rest in RAM. Watch KV cache at long context at 32k+ tokens it'll eat your VRAM faster than the weights do.

[-]

pmttyji@reddit (OP)

On 8GB VRAM, A3B/A4B MoEs are your only realistic path, dense 27-31B Q4 will spend most of inference moving weights across PCIe.

Agree that I can't expect those dense models on 8GB VRAM. MOE is the way.

Use `--n-cpu-moe in mainline llama.cpp to keep only the active expert path in VRAM, rest in RAM. Watch KV cache at long context at 32k+ tokens it'll eat your VRAM faster than the weights do.

I already tried Qwen3-30B MOEs with mainline. Getting only 15 t/s for 32K context & Q8 KVCache. Obviously it's not enough for Agentic coding. So want to experiment with fork mentioned in the thread.

[-]

trialbuterror@reddit

How abt 16gb vram and 16gb ddr4 ram

[-]

pmttyji@reddit (OP)

I'm getting new rig which comes with 48-96GB VRAM(AMD). Still want to use current laptop in extreme mode.

[-]

FatheredPuma81@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1t88zvv/comment/okuoxii/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

That comment sums it up imo. I would rather wait for llama.cpp to implement things properly than use a 3 layer fork of dubious quality.

One guy did benchmark Qwen3.6-35B-A3B-UD-IQ4_NL_XL turbo tests buun-llama-cpp but his benchmark shows Q8_0 and Turbo4 being on par with F16 (which contradicts Georgi's own findings when he added KV Rotation) and he said he'd run Q4_1 as a sanity check when I asked but I guess never got around to it. I tried to do the tests myself but hit a roadblock at the Turboquants because GPT-OSS 20B doesn't work and Qwen and other models think way way way way way way too much for each question. Reasoning is required for KV Quant tests and short reasoning is required for your sanity.

[-]

Anbeeld@reddit

"Qwen thought too much so I dropped tests, instead I prefer to spread 2nd hand opinions about things I haven't tried".

[-]

FatheredPuma81@reddit

Yes... I dropped running my own AIME2025 benchmark because I didn't want to spend 12+ hours per KV quant running it and chose to say that I would recommend against using a mostly AI coded fork built on the back of other AI coded forks in favor of a well implemented version with benchmarks and all that jazz before it's merged into llama.cpp...

If you want to feel free to it's rather easy to setup! Make sure you share your results in... about 80+ hours? Less if you have a GPU better than a 4090 at your disposal.

And for those that want to check the math: 40,000 per AIME question / 200 tokens/s (RTX 4090 w/ 4 slots) * 240 questions (more if results aren't reliable) / 60 (minutes /60 (hours) = 13.3 hours per quant * 7 (F16, Q8, Q4_1, TQ4, TQ3, some V cache combinations) = 93.3 hours give or take.

[-]

Anbeeld@reddit

I am in fact planning to run a number of benchmarks.

[-]

pmttyji@reddit (OP)

But if you're desperate I'd say give it a go if you want to. Come up with some tests to verify quality in your tasks though before committing.

Yep, want to do this on my current laptop(8GB VRAM). I'll be staying with mainline on my upcoming rig.

[-]

R_Duncan@reddit

Qwen3.6-35B-A3B is the only choice with 8GB VRAM: gemma has huge kv cache and 27B is way slow.

[-]

pmttyji@reddit (OP)

Qwen3.6-35B-A3B is the only choice with 8GB VRAM:

How much t/s are you getting? Please include full llama.cpp command.

[-]

trialbuterror@reddit

How

[-]

R_Duncan@reddit

I use plain llama.cpp (usually 128k cache at q8_0, only sane setup for now) + one harness of choice: (opencode/pi/ etc.etc.)

[-]

mraurelien@reddit

Would you share your llama-llama-server launch cmd please ?

[-]

trialbuterror@reddit

Use ollama ? Wat is pc config

[-]

Still-Notice8155@reddit

Qwen3.6-35B-A3B running on GTX 1070 8GB and 32GB RAM, using Q4_K_M MTP + Turboquant. Gives initial 40t/s at 0 ctx then degrades to 18t/s at 131k ctx. Good for agentic coding on jurassic hardware.

[-]

Amazing_Athlete_2265@reddit

I am currently using unsloth/Qwen3.6-35B-A3B with mainline llama.cpp and pi coding agent and its pretty bloody good. Would try the 27b but at 3 t/s its a tad slow.

[-]

Anbeeld@reddit

Hello there.

For now the fork was basically about Qwen 3.6 27B fully offloaded into VRAM. So I haven't properly tested and benchmarked other use cases yet. I've noticed for example MoE 35B doesn't seems to benefit from existing DFlash implementation/config right now and I'm planning to investigate.

Roadmap for now is something like: figure out spec stack (there's more than DFlash to it) and blockers (multi-GPU support), then do a number of optimizations for the main use case, then expand to different models and hardware.

You can still try it and see for yourself if it's any useful for you right now. And even if it isn't, create a GitHub issue with detailed description of how and why, and it will help me fix it.