BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Posted by Anbeeld@reddit | LocalLLaMA | View on Reddit | 186 comments

TL;DR New llama.cpp fork! I wanted a Windows-friendly inference to run Qwen 3.6 27B Q5 on a single RTX 3090 with speculative decoding, high context without excess quantization, and vision enabled. No option did this out of the box for me without VRAM and/or tooling issues (this was before MTP PR for llama.cpp surfaced there), so I pulled out an old trick: stay up to 4am one too many times to do month+ work in a week or two. Now I have what seems to be the solution and don't mind to share.

Anbeeld's BeeLlama.cpp

BeeLlama.cpp (or just Bee) is Anbeeld's performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools, server flow, and model compatibility, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, reasoning-loop protection, full multimodal support, and experimental speculation modes.

Not quite a pegasus, but close enough.

Here's a plug-and-play Qwen 3.6 27B setup with a config to run it in Q5 + 200k of practically lossless KV cache + vision on a single RTX 3090 or 4090.

Fork Features

DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth.
Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

[-]

soyalemujica@reddit

Gave this a try with my AMD GPU 7900XTX, clearly using HIP since that's what the repository supports, it works great! turbo3_tqc does not work though it crashes llama, but turbo4, turbo3, does, also DFlash appears to be working, getting 45t/s - 50t/s at 128k context

[-]

EbbNorth7735@reddit

I'm seeing a lot of API calls failing when using with Cline. It's eventually getting through but I'm wondering if there's an issue with the jinja format or if it might be unstable? I ran a test in open web ui and it seemed to jump back to thinking while it was answering the question.

[-]

Anbeeld@reddit (OP)

v0.1.1 lacks a crucial fix for tool and response stability that I've added today, would suggest to build the current repo yourself.

[-]

caetydid@reddit

you referring to the EOS fix?

[-]

Anbeeld@reddit (OP)

Yes.

[-]

caetydid@reddit

I rebuilt with the current repo state.
I am single shooting a xenon2 like arcade 2d doom scroller in pi agent with a skills.md - right now I am at step 6/12 (40k context). write tool calls start failing again.

not sure if sth wrong with my pi agent setup or the beellama setup.

It is still proceeding, will repost once it has finished.

[-]

caetydid@reddit

Ok. Update on latest release.

I had to start a new session in pi, and was able to sucessfully finish the project. Tool calling seems stable now, at the end \~90k of the 122k context has been used.

this is with Q4_K_M and the param mentioned in the previous post.

[-]

EbbNorth7735@reddit

Are you planning on spinning another release? Last time I tried it was incredibly painful.

[-]

Anbeeld@reddit (OP)

Yo! v0.1.2 is out.

[-]

EbbNorth7735@reddit

My man!

[-]

Anbeeld@reddit (OP)

I do, but I need to sort out some multi-GPU stuff before that.

[-]

Human-Gas-1288@reddit

u r awesome !
prompt eval time = 735.04 ms / 665 tokens ( 1.11 ms per token, 904.71 tokens per second)

eval time = 29762.03 ms / 1919 tokens ( 15.51 ms per token, 64.48 tokens per second)

total time = 30497.08 ms / 2584 tokens

draft acceptance rate = 0.47547 ( 1008 accepted / 2120 generated)

adaptive dm: fringe=0.00 n_max=2

14.53.894.344 I statistics dflash: #calls(b,g,a) = 10 2047 1219, #gen drafts = 2047, #acc drafts = 1219, #gen tokens = 5848, #acc tokens = 2149, dur(b,g,a) = 0.013, 5569.042, 0.305 ms

14.53.894.737 I slot release: id 0 | task 2684 | stop processing: n_tokens = 17821, truncated = 0

14.53.894.778 I srv update_slots: spec cycle (1 slots): draft=2.9ms verify=24.9ms accept=1.3ms other=0.0ms total=29.1ms

14.53.894.782 I srv update_slots: all slots are idle

[-]

coherentspoon@reddit

I'm running Kilo code 5.16.1 and I'm just getting a ton of these errors today. Not sure if its the tool call issue? Sorry I'm not an expert with this stuff.

Provider: openai (proxy) Model: Qwen3.6-27B-Q5_K_S.gguf

Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.

[-]

Anbeeld@reddit (OP)

When did you build it exactly?

[-]

coherentspoon@reddit

I'm using your prebuilt v0.1.1

[-]

Anbeeld@reddit (OP)

I see. v0.1.1 lacks a crucial fix for tool and response stability that I've added today, would suggest to build the current repo yourself.

[-]

EbbNorth7735@reddit

Hey, any chance you could spin a 0.1.2? Seeing the same issue and last time I setup the build pipeline on Windows it was a week long painful process. It was 2 years ago... maybe it's gotten easier?

[-]

coherentspoon@reddit

Thanks!

[-]

IrisColt@reddit

I used the Speed / VRAM combo mentioned in quickstart-qwen36-dflash-md and I got a meager +33% (around 40 t/s on a 3090). Sigh... Am I doing something wrong?

[-]

Anbeeld@reddit (OP)

What prompts?

[-]

IrisColt@reddit

Hard Math problem (unsolvable by Frontier-level AIs back in March 2025):

default llama.cpp: 32 t/s
beellama.cpp: 58 t/s

Thanks again!

[-]

Anbeeld@reddit (OP)

Yep, it's doing so much better on tasks where prediction is viable, compared to open prose and stuff.

[-]

IrisColt@reddit

Thanks again!

[-]

IrisColt@reddit

Er... Now I get it...!

"Print all the numbers from 0 to 100, in the following format: 0, 1, 2 ..."

default llama.cpp: 34,22 t/s
beellama.cpp: 96.47 t/s

"Detail every element visible in the image, from foreground to background." + 512 x 768 image
default llama.cpp: 34.19 t/s
beellama.cpp: 41.36 t/s

Thanks!!!

[-]

IrisColt@reddit

By the way, would killing all that logging spam--log-timestamps, --log-prefix, --log-colors, and probably --metrics too actually make this thing run noticeably faster? My console is getting absolutely buried in text right now, heh

[-]

Anbeeld@reddit (OP)

Oh, right, I actually should re-think what logging I put into the recommended config...

[-]

Kaioh_shin@reddit

I have to say this is the fastest version I have tried on my 7900xt.
Did have to fiddle around to get a build for HIP, but all good otherwise.
Would be nice if you would get it to not randomly stop (even after 0.1.1)

[-]

Anbeeld@reddit (OP)

I just commited changes to EOS handling during reasoning, which allows the model to shake off incorrect usage of tool calls and response semantics and move on. Combined with yesterday's fixes, should help a ton with randomly stopping!

Tested by having BeeLlama read its own repo in multiple turns, totalling 120k tokens with a ton of file reading. It still butchered tool calls a few times later on, which could be caused by model/cache quantization + verification mistakes, but the model recovered gracefully and simply continued working.

[-]

Kaioh_shin@reddit

Thank you for your work. No more random stops with the latest commits.
I do feel like it's less reliable though, not sure if I changed something else.
I was trying to get turbo3_tcq working with HIP and thought the results were because of it or the changes. Then I switched back to the one with only my HIP changes and noticed it behaves the same.
I use it for scripting/coding, so I care about accuracy.

My benchmark is the chess board from a few posts ago. https://qwen3-6-27b-benchmark.vercel.app/
It's more a feeling than empiric evidence, but it used to get it consistently right before.
Now it's more like 1 out of 3 is right.

[-]

Anbeeld@reddit (OP)

I will conduct a review to ensure these changes did not cause unwanted regressions.

[-]

Kaioh_shin@reddit

I am on a very tight vram budget, using iQ4_XS with 100k+ context. Does the drafter make a big diff? Up until now it was also the iQ4_XS, going to try the Q4_KM

[-]

Anbeeld@reddit (OP)

Drafter quant should not be a huge factor, that said Q4_K_M might be faster while having negligible difference in terms of VRAM, due to IQ4 architecture being slower to decode.

[-]

Anbeeld@reddit (OP)

Can you share what fiddling was required for HIP? So I could update the readme and everything.

[-]

Sufficient_Sir_5414@reddit

Phenomenal work on the integration. For that 200k context Qwen setup, how does the TurboQuant/TCQ handle the 'lost in the middle' problem compared to standard 8-bit or 4-bit KV cache? Does the TCQ overhead impact the token latency significantly compared to the baseline llama.cpp MTP PR?

[-]

Anbeeld@reddit (OP)

I ran some test by making it analyze my projects that I wrote by hand and understand well, and at 100k context it still looked on point, used tools and vision properly etc.

I think people around here massively overstate issues with context quantization and basically don't talk about Q4 models being a shadow of it's proper self. Like what's the point of having nice sleek glasses when you're blind in the first place?

For me it seems like making both mildly quantized like I did there with Q5 model + turbo4/turbo3_tcq cache setup is more balanced.

[-]

r00x@reddit

Do you observe large differences between Q5 and Q4 quants then? I understood there wasn't supposed to be a huge difference and had never really bothered with Q5 models before (Q4 27b/35b-a3b just fit better with context onto 24GB of VRAM)

[-]

Anbeeld@reddit (OP)

Ironically I was so busy with implementing this project that I didn't have a change to properly enjoy it myself, doing it just now. So I don't have a large set of examples, but from my limited testing and from benchmarks I've seen Q4 is much more rough around the edges, 4 bit is really low so math starts to play against us and visibly affect precision.

I think the best is always to go for a higher model quant you can fit and adapt around it, but if it just doesn't work it's no shame to drop to Q4. I tried to frame my Qwen 3.6 quick start doc around exactly this and listed a couple of combinations that seems practical, rather than just one preferred by me.

[-]

r00x@reddit

Interesting! Would you say using Q4 for dflash risks torpedoing the performance of the Q5 target or does it not work that way (would the main Q5 model just reject more of the predictions if it didn't like them, or something?)

[-]

Anbeeld@reddit (OP)

There are no limitations on what quants to use for target vs drafter, they are not required to match. Quantized drafters are weaker at correctly predicting tokens, but this is 1) not as prominent as all the model has to do is predict 8-16 tokens, and 2) larger drafter models can be significantly slower, which might turn gains from better prediction into net negative tps.

[-]

r00x@reddit

Interesting! Using a drafter definitely makes a difference vs vanilla Q5 (get about ~8tok/s with that, vs 30-40 with BeeLlama) but the reason I asked is because I am occasionally having trouble with tool calling where it seems it just gets the structure wrong and ends up blasting the CLI with XML content (this almost never happens on the vanilla models, even Q4 or IQ3_XXS models are fine at tool calling).

Have you encountered that at all or have I just done something wrong? I wondered if it were a context issue but I don't think so - it will go right back to working fine again afterwards, which you'd think it would screw up if it had forgotten the syntax.

[-]

Anbeeld@reddit (OP)

There's a bug with DFlash breaking tools at large context (basically prediction ruining the calls), I'm working on fixing it right now.

[-]

r00x@reddit

Absolute legend. I was already working on a shonky pi.dev plugin that catches this kind of model stall and pokes the model autonomously (so as to avoid a model quietly stopping while your attention is elsewhere) but I'll for sure keep an eye out for that!

[-]

Anbeeld@reddit (OP)

Updated the repo, most tool calls should be fixed. The issue may still happen, but much more rarely now, and caused by loss quality from model/cache quantization rather than incorrect integration with DFlash. Tried a number of 100k+ chats with a lot of file reading, seems to work alright.

[-]

r00x@reddit

OK, it's definitely miles better than before. Appreciate the work on this! I wish I could plug this into LM Studio somehow to use with LM Link but they seem to use their own llama.cpp builds, or something.

[-]

Anbeeld@reddit (OP)

Yeah I also commited changes to EOS handling during reasoning, which allows the model to shake off incorrect usage of tool calls and response semantics and move on. Combined with yesterday's fixes, should help a ton with randomly stopping!

Tested by having BeeLlama read its own repo in multiple turns, totalling 120k tokens with a ton of file reading. It still butchered tool calls a few times later on, which could be caused by model/cache quantization + verification mistakes, but the model recovered gracefully and simply continued working.

[-]

r00x@reddit

I think for me it squiffed a tool call once, in hours of use, that's all. It did stop randomly a few times (usually in the middle of thinking, usually right after a "let me read the blah blah file and check if it blah blah blah" kind of message) but otherwise kept going for long periods of time on difficult requests.

[-]

r00x@reddit

Thanks muchly, I'll give it a go tonight and let you know how it goes!

[-]

chimpera@reddit

After testing I'm a fan of this fork. Its outperforming the MTP pr on mainline. I like the --no-mmproj-offload. Im getting 200tps on code with Qwen3.6-27B-Q5_K_S and a 5090.

[-]

coherentspoon@reddit

mind sharing your parameters? I'm "only" getting 100-120

[-]

chimpera@reddit

My test prompt is "Make a single page worm game." which is probably highly deterministic.

exec "$SERVICE_DIR/build/bin/llama-server" \

-m "$MODELS/unsloth/Qwen3.6-27B-GGUF/Qwen3.6-27B-Q5_K_S.gguf" \

--spec-draft-model "$MODELS/spiritbuun/Qwen3.6-27B-DFlash-GGUF/dflash-draft-3.6-q4_k_m.gguf" \

--spec-type dflash \

--spec-dflash-cross-ctx 1024 \

--no-mmproj-offload \

--mmproj "$MODELS/unsloth/Qwen3.6-27B-GGUF/mmproj-F32.gguf" \

-np 1 \

--kv-unified \

-ngl all \

--spec-draft-ngl all \

-b 2048 \

-ub 256 \

--ctx-size 256000 \

--cache-type-k q8_0 \

--cache-type-v q8_0 \

--flash-attn on \

--cache-ram 0 \

--jinja \

--no-host \

--metrics \

--log-timestamps --log-prefix --log-colors off \

--reasoning on \

--chat-template-kwargs '{"preserve_thinking":true}' \

--temp 0.6 --top-k 20 --min-p 0.0 \

[-]

coherentspoon@reddit

Thanks for the info! I tried it out and got about 180 tps. When going with the turbo cache I got about 170 tps.

[-]

Anbeeld@reddit (OP)

Thank you for kind words. The number are for extra short context, I presume? I am planning to follow up with long context optimization, but for v0.1.0 the target was to get it working and have some results. So honestly there's quite a fall off the deeper you go (which is obviously rooted in same fall off in the baseline), which I view as a new milestone to improve on.

[-]

Chromix_@reddit

Did the MRs for this get rejected on the original llama.cpp, or is the the MR flow just so slow (read: "takes a week") that it made more sense to make a fork?

In any case, with this demonstrating that it runs (fast) it might help getting this into the regular llama.cpp.

[-]

Anbeeld@reddit (OP)

Initially I just wanted a working solution for myself, for example I did memory leak PRs into Luce DFlash before than, but their tool turned out to be quite broken in everything else too. Then I randomly saw buun's fork, but it was far from ideal in terms of DFlash so I planned to fix it and PR into it. Over time the list grew so much and so opinionated that I decided on just releasing it as a separate fork, because I felt like I already did a ton of work and I just want to ship it so I could go touch some grass.

So, honestly fiddling with llama.cpp bureaucracy and anti-AI policy was never on the table. I mean, even basic TurboQuant is not in llama.cpp yet, and this fork doubles down on this and freely renames shit left and right. But also I kinda didn't want my stuff to become one more "there's PR for that in llama.cpp".

[-]

HFT0DTE@reddit

I agree 100% with your approach as I had basically TurboQuant's PR from llama.cpp further customized, merged and tested within 24 hours and have been using it in my production work ever since. There's just no time for a lot of the other llama bs - thx for Bee btw

[-]

k_means_clusterfuck@reddit

Yeah llama.cpp's anti ai policy is really something. I get that you want a way to manage slop prs of course but micro managing how people work is not the way. On the flipside, ~~I've~~ my agent's been able to make a handful of (actually good) contributions to vllm and vllm-omni and the maintainers' attitude was just like: if your pr is good, doesn't matter. They have been really constructive and working with them has honestly been a joy. Two completely different worlds

[-]

Anbeeld@reddit (OP)

Coincidentally out of these 2 projects vLLM is the one with both higher performance and wider feature set simultaneously.

[-]

Chromix_@reddit

Thanks for making it happen still. Yes, the AI policy is a rather slippery slope, yet they've had their fair share of low-quality code PR'ed that those rules were established to reduce the load on the reviewers and maintain code quality.

So basically the issue is that "making it happen" took too long, if done in a maintainable way in the llama.cpp codebase. ik_llama.cpp diverged quite a bit and a few things are ported over. With the fork history here it probably needs quite a bit of refactoring, not just porting it over, but maybe it'll happen eventually.

[-]

Anbeeld@reddit (OP)

It's also kinda how I am, I come up with various projects all the time and release them as my own responsibility and my own maintenance burden, so for me it was just a natural way, in a sense.

[-]

henk717@reddit

Its the nature of it that will never make it merged upstream. They don't want these massive vibe coded codebases.

The turboquant fork had massive vibe coding, so does buun and this beellama one was a single commit so i can't clearly tell what was done with that one and by who but I'd be surprised if there is no vibecoding involved.

So its a vibecoded fork on top of a videcoded fork and possibly another vibe coded fork on top. None of that will ever land upstream.

[-]

Mashic@reddit

For such a critical software that is the factory standard for local LLMs, I'd rather it get developed manually with the developers knowing the ins and outs of the software, than fast vibe-coding and accumulating tech debt.

[-]

ebolathrowawayy@reddit

i'm surprised to see redditors here are so anti-ai. LLMs write better code than 99.9% of humans now if steered by someone with even half a brain and has worked as a SWE for a couple years. But also.. maybe a lot of non-coder enthusiasts are clogging up the pipes but in that case if I owned the repo I would just throw agents at the problem and make them reject all the crap.

idk, "vibe coding" seems like a non-problem now with agentic coding filtering out the crap.

[-]

draconic_tongue@reddit

it has been said time and time again that ai usage and "vibe coding" is not the problem, it's the fact that if they don't take this stance their pipeline will get shitted up by people that do not care. too many of the tools are built in a way that seems like it's on purpose obscuring the development process. claude code, opencode, codex are all unfriendly to use if you want to closely follow the codebase, and tons of tools try to push llms to take over a lot of the authoring process which makes it even more easy to miss changes. it all conditions you to not give a shit about your code. vscode extensions are a bare minimum imo

[-]

ebolathrowawayy@reddit

I see your point but that's an old way of thinking now IMO. I don't review code manually anymore, I have my agents do everything. Only thing I do is check that the behavior of the code is correct, after all of the automated e2e tests finish.

When agents can write code 300x faster than you at a generally higher quality and do the same with refactoring and reviews then it just doesn't make sense to have humans in the loop anymore.

[-]

rpkarma@reddit

No, they don’t.

[-]

henk717@reddit

Its the maintainability of it, we've accepted vibe coded PR's for KoboldCpp to.
A really good example is this recent one given to us in a bug report since it wasn't something the creator could easily PR: https://github.com/LostRuins/koboldcpp/issues/2173

This is the good kind of AI assisted coding, where the maintainer isolated / understands the change, its only a few lines different and if he'd have said he did this manually i'd have believed him.

The 502 page in our router mode I also let qwen generate since its just a quick way of getting something that looked nice and worked well (I did specifically say what it needed to adhere to). I of course then look if the code is sensible, and since it was just a single page which code I understand I can then PR it to KoboldCpp.

It becomes a problem when its endless vibe coded PR upon endless vibe coded PR, where the submitters / maintainers have to take the AI's word for all the massive changes. Those I don't believe in and those are the kinds of PR's we reject.

Upstream llamacpp is the same way, you can use AI to assist in your coding but you have to be able to explain every line of the code yourself in case they have questions. That's a bar that most of these turboquant forks can't hit.

[-]

YearnMar10@reddit

GG does not like vibecoded contributions to llama.cpp

[-]

politerate@reddit

Personally, i find the idea of doing a MR I don't fully understand, very off-putting. And I am quite sure that 99% of these types of contributions are of this kind.

[-]

Anbeeld@reddit (OP)

Doesn't automatically mean that 99% of them are harmful. It's mostly a maintenance problem for popular repos.

[-]

ArtfulGenie69@reddit

If you want problems in your massive code base, the best place to start is blindly dropping in code no one ever looked at.

[-]

Anbeeld@reddit (OP)

Nowhere in my comment I said something about blindly dropping in code no one ever looked at. In fact, I said the opposite by framing it as the maintenance problem.

[-]

Fresh-Letterhead986@reddit

that is a crazy take.

if you want to start a new project and vibe it, cool. merge anything because you've set the ground rules as such, you're accepting the potential problems and frankly it's yours.

but saying "yo bro comeon be cool man why wont you take my AI slop into your keystone-of-the-AI-world, tip-of-the-spear in human tech frontier codebase??????????"

yes "it's mostly a maintenance problem". notice you're not volunteering to do said maintenance ;-)

[-]

Anbeeld@reddit (OP)

I literally never said any of this, and created a fork instead of PRs. Not even a nice try from your side.

[-]

srigi@reddit

Look at the OpenClaw where it got them accepting vibes eagerly. They’ve had like 200 contributors on every release, until the project almost collapsed. I like the GG’s philosophy/ruling more. But yes, project moves at apvery slow pace now.

[-]

segmond@reddit

I have tried various fork of deepseekv4 that were vibed. every single of them crashes when I start passing in parameters. performance is abysmal. CPU level performance for something that is 100% loaded in GPU, < 10tk/sec TG, 40tk/sec PP. Just all around mess. It would be a disaster to accept any of these and hope they get fixed in the future. Worse of all, plenty of them touch already baked code, hacking around FA etc which will probably introduce regression and break other models.

[-]

LegacyRemaster@reddit

The fork I made of antirez gets to 17t/sec on cuda but the problem is that DS4 seems to be "not considered" currently

[-]

Velocita84@reddit

And for very good reasons

[-]

dsanft@reddit

It's not about running fast, it needs to be good quality and demonstrated as such. Any idiot can make it run fast, it takes effort to get it to be fast and correct.

[-]

Chromix_@reddit

The good thing is that correctness and speed can both be tested, by comparing KLD, benchmark scores and well, tokens per second. If correct, there'll at least be code that's "just" not in the shape that fits llama.cpp (yet). As long as the correctness topic is unknown it'd probably not be very motivating to bring it into shape.

[-]

coherentspoon@reddit

Thanks very much for this amazing work! Went from 120 t/s on MTP to about 180 t/s and using a better quant!

[-]

leonbollerup@reddit

Have you done any quality compare between the result on the generated text vs a clean plain 27B ?

[-]

Anbeeld@reddit (OP)

What do you mean by clean plain 27B? The cleanest and plainest of them just won't fit into my hardware. As for quants, I'm planning to do some deep-ish testing of cache types.

[-]

leonbollerup@reddit

Ok, they fit on mine.. If you want we could do some quality compare on the same Prompt

[-]

Thomasedv@reddit

I got this working but there seems to be a bug. I was using this in Qwen code and sometimes tool calls would just be printed out in the chat and end. On some occasions the chat also just stopped or ran for a while and not print anything while the server was still processing.

I tried another model to be sure, but if I had to guess it might be that if the speculative decoding happens around a tool call, something might go wrong? I haven't had this issue before, and it might be cause by something else on the fork but it seems good so far after dropping the speculative decode part. It got increasingly common as context grew. I didn't get it too high either, and my max was 120k.

[-]

patricious@reddit

Same on my side, compiled the build and followed everything to a T (5090, 200k context). In Opencode, I prompt it do analyze a specific section of my code, it starts calling the right tools, in this case Serena then its just stops in it tracks. I then tell it to continue and it stops again. Might be something wrong with the chat templates but I am not sure.

[-]

Anbeeld@reddit (OP)

Updated the repo, most tool calls should be fixed. The issue may still happen, but much more rarely now, and caused by loss quality from model/cache quantization rather than incorrect integration with DFlash. Tried a number of 100k+ chats with a lot of file reading, seems to work alright.

[-]

Anbeeld@reddit (OP)

Updated the repo, most tool calls should be fixed. The issue may still happen, but much more rarely now, and caused by loss quality from model/cache quantization rather than incorrect integration with DFlash. Tried a number of 100k+ chats with a lot of file reading, seems to work alright.

[-]

Anbeeld@reddit (OP)

Yes, I've noticed this issue too when I was trying some more deep code analysis today. Will investigate.

[-]

caetydid@reddit

Feedback:

Ive made it build, had to fix several trivial errors; ended up disable tool building entirely instead of fixing it all.

/home/holu/beellama.cpp/build/bin/llama-server \

-m "/home/holu/llama.cpp/models/qwen3.6-27b/Qwen3.6-27B-IQ4_XS.gguf" \

--mmproj "/home/holu/llama.cpp/models/qwen3.6-27b/mmproj-F32.gguf" \

--spec-draft-model "/home/holu/llama.cpp/models/qwen3.6-27b/Qwen3.6-27B-DFlash-IQ4_XS.gguf" \

--spec-type dflash \

--spec-dflash-cross-ctx 1024 \

--port 8082 \

-np 1 \

--kv-unified \

-ngl all \

--spec-draft-ngl all \

-b 2048 -ub 256 \

--ctx-size 262000 \

--cache-type-k turbo4 --cache-type-v turbo3_tcq \

--flash-attn on \

--cache-ram 0 \

--jinja \

--no-mmap --mlock \

--no-host --metrics \

--log-timestamps --log-prefix --log-colors off \

--reasoning on \

--chat-template-kwargs '{"preserve_thinking":true}' \

--temp 0.6 --top-k 20 --min-p 0.0 \

--host 0.0.0.0 --port 8888

Over 100t/s on first request, drops very quickly to 50 and later 30, then OOM. I ran it on my rtx3090.

[-]

Anbeeld@reddit (OP)

Can you elaborate on what issues you had, preferably in a form of GitHub issue? I don't know if you are on Linux or not, but on Windows I had everything building just fine, and resulting binaries (available in the releases) worked just fine on another PC as well.

As for OOM, you're loading massive 1.76 GB mmproj model into VRAM as opposed to CPU offloading as per recommended config, which starves you of VRAM required to run DFlash and 256k context.

[-]

caetydid@reddit

Running under Ubuntu. Yeah, thought about that, too, and will first retest with --no-mmproj-offload.

I assumed that using the iq4 quant saves the necessary VRAM, and my consumption on startup was 21G, but maybe VRAM consumption just increases later on.

I havent been using much context though, maybe 20k or less.

[-]

Anbeeld@reddit (OP)

In llama.cpp the VRAM for context is dedicated on startup mostly, OOMs later are caused from smaller moving pieces when you are already on the edge. Well at least on Windows, but should work the same everywhere.

[-]

caetydid@reddit

thanks for your reaction. I will need to play more with that, alas, useful bug reporting takes its time.

In pi agent I experience context degradation after 50k, i.e. tool calling does not work reliably any more, and the agent stops half-way in its tasks.

Maybe I need to adjust my prompts and/or skill.mds?

I switched to the Q5 and the bf16 mmproj - no crashes any more so far - however, I did not exceed full context yet.

[-]

Anbeeld@reddit (OP)

There's a bug with DFlash breaking tools at large context (basically prediction ruining the calls), I'm working on fixing it right now.

[-]

caetydid@reddit

great to hear! thanks for your effort!

[-]

Anbeeld@reddit (OP)

Updated the repo, most tool calls should be fixed. The issue may still happen, but much more rarely now, and caused by loss quality from model/cache quantization rather than incorrect integration with DFlash. Tried a number of 100k+ chats with a lot of file reading, seems to work alright.

[-]

Pablo_the_brave@reddit

For Vulkan VRAM for context is dedicated on startup (mostly) but for CUDA there is some add even if you set batch sizes.

[-]

wowsers7@reddit

This looks great. Any chance you could add support for Intel GPUs & iGPUs? That would be amazing.

[-]

Sabin_Stargem@reddit

Speaking for myself, I would like to see this implementation integrated into a KoboldCPP fork, so that I can try out TQ4 and see if it is worthwhile. A TurboKobold, if you would.

The appeal of KoboldCPP is that it is a gui-based method of running LlamaCPP for Windows & Linux, that is open source and doesn't require much fiddling to run, all while leveraging VRAM+RAM. Good for people who fear and hate the terminal, like myself.

[-]

bonobomaster@reddit

KoboldCPP is the worst of the worst in regards to UI design.

Absolutely not worth it, in my opinion.

[-]

Alex_L1nk@reddit

>TQ mentioned
>instantly loses interest

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant

ah, yes, vibecoded project based on another vibecoded project, we are reaching new level of spreading BS on GitHub

[-]

Anbeeld@reddit (OP)

Thanks for your valuable input. It is quite baffling that you lost interest but still left a comment. Did TurboQuant-based AI cyborg murdered your family or why are you so active in shitting on it?

[-]

henk717@reddit

TurboQuant is just associated with vibe coded forks at this point. The moment you see TurboQuant + Llamacpp there is just a 90% chance of that. It also instantly makes me assume its just another one of those.

[-]

Anbeeld@reddit (OP)

I don't quite understand what is the problem with AI-assisted development in itself, without even checking the substance of the work. One thing if it's worthless slop, but if it has to offer something of value?

Last time I checked it was a community about AI. What are you guys running the local AI for then?

[-]

ebolathrowawayy@reddit

People are just getting left behind and want to feel like they still matter. I don't even manually review code anymore.

[-]

henk717@reddit

The problem is maintainability, nothing wrong with ai assisted development. Its when it looks like fully AI driven development where you tend to get changes that become problematic down the line.

[-]

Anbeeld@reddit (OP)

I totally understand a problem with unmaintainable PRs making it hell for reviewers. What I don't understand is people disregarding everything vibecoded, even when it's just a separate product or fork that doesn't bring responsibility to anyone but the author? The result is either good or bad, if it's bad then the author likely can fix it, but no one is asked to read and verify the code so no one's hurt.

I don't know all the drama around TurboQuant regarding PRs, I'll be completely honest I have much better things to do than to read through it, one of them being using the TurboQuant fork which from my outside perspective doesn't make anyone's life worse by just existing.

Same with my project, even if I do use a lot of AI, how that hurts people so much they come to comments here just to write about it? I don't agree with llama.cpp policy so I just... didn't annoy them with PRs and made it a fork? So circling back, even if my project is "just another one of those", I don't get why is it immediately a problem. People are literally free to use or not use it.

[-]

Pablo_the_brave@reddit

Generally TurboQuant from TheTom are weak in classic perplexity tests. But, when you look at https://qwen3-6-27b-benchmark.vercel.app/ there is clearly some profit (but IMHO the asymetric isn't realy good for Qwen3.6). For me, the most interesting is Turbo3 - not good, not terrible.

[-]

Alex_L1nk@reddit

First, it's my fourth comment on TQ. Second, wake me up when TQ is properly benchmarked against f16/Q8/Q4 both in quality (not just PPL) and speed. Shitting? No, I'm just skeptic, because the only bench I saw was from TheTom repo, who had zero words with by a human being. And even in his tests TQ was on same level as Q4 while being slower.

[-]

Anbeeld@reddit (OP)

Isn't it a bit contradictory to state that TQ was not properly benchmarked but also presume by default that it's bad?

[-]

Alex_L1nk@reddit

Speaking of being contradictory... Why are you making bold claims of "near-lossless" quants using untested tools? If you make a proper tests like was done in this PR (PPL, KLD and AIME comparison) and proof that TQ is worth it, then you get the respect of whole community.

[-]

imgroot9@reddit

well, I executed all kinds of tests you mentioned (ppl, kld, aime) using turboquant and I couldn't find anything that would've proved that I cannot use it (27B and Q5 - just take a look at my post with results). also, whatever test I try from this thread (chess svg, etc) and my everyday experience all prove that it's all right.

[-]

Alex_L1nk@reddit

I appreciate your effort but I don't think that using quantized weights is fair, because it's adding a lot of noise on top of bench result. You're testing KV AND weights quants. I think more reliable tests must be done on f16\bf16 and correct me if I wrong but you only tested on AIME once? Because pwilkin noticed that it's quite random and should be done on equal condition to properly compare results. [read this comment and GG's answer]

[-]

Anbeeld@reddit (OP)

Because I borrowed upstream implementation and their claims. I do agree that more correct would be to verify them rather than waiting for someone to "wake me up", so I'm planning to do just that when I'll have time.

[-]

Alex_L1nk@reddit

Where I said that TQ is bad?

[-]

r00x@reddit

The sheer audacity of complaining about people using AI to do things on a subreddit about using AI to do things... I can't even, I'm ded.

Anyway thanks for sharing this OP, it rocks! My main worry was whether it would screw up tool calling but it seems fine so far.

When you said you got up to 130tok/s what configuration was that with, exactly? By my eye on Q5_k_s with q4_k_m dflash it seems more like 40-50tok/s maybe. Prompt eval is ~130tok/s though, yeah.

[-]

Anbeeld@reddit (OP)

Thank you for giving it a try.

130 tok/s is a shameless benchmaxxing just to show off the peaks: a short prompt to write a linked list in Python without explanations, basically something where a drafter can predict next tokens very reliably.

I already did some stuff that helps with long context (tool handling, adaptive draft-max, various optimizations of drafting) but that's not enough yet I'm planning to focus on improvements specifically for it going further.

[-]

No_Seaworthiness9278@reddit

What??!

[-]

korino11@reddit

Ngridea only?!?

[-]

LegacyRemaster@reddit

I'm starting tests now on RTX 6000 Pro. If you have the time and inclination, check out https://github.com/Fringe210/llama.cpp-deepseek-v4-flash-cuda . I'm up to 17 tokens/sec, but I'm sure you can do better.

[-]

LegacyRemaster@reddit

very good. From 57t/sec to 87 with Q8 draft

[-]

Anbeeld@reddit (OP)

Thanks for giving it a try!

[-]

LegacyRemaster@reddit

Obviously, prediction works better in certain contexts. With thought, the benefit is lower. While writing HTML or Python code, I've had peaks of 100 t/sec.

[-]

Anbeeld@reddit (OP)

Yes, such are the limits of prediction as a technology. But there's a lot to optimize in how its applied, which is what I plan to do over time.

[-]

caetydid@reddit

The first fork I succeed in getting measurable speed ups - so I remain curious and stay committed to follow future commits.

[-]

Anbeeld@reddit (OP)

Glad to hear. More to come!

[-]

Avendasora@reddit

Can this be used with multi GPU? 5080 + 3060?

[-]

Anbeeld@reddit (OP)

Unfortunately I don't have the hardware setup myself to test this, but based on upstream info the DFlash itself should work. Tree mode is not yet compatible, but at this point I wasn't able to fix its performance so it's kinda irrelevant at the moment.

[-]

NickCanCode@reddit

Doesn't work for me. It gives

beellama.cpp-main\ggml\src\ggml-cuda\ggml-cuda.cu:98: CUDA error 
CUDA error: an illegal memory access was encountered

whenever I make a request.

P.S. Using 2 identical cards.

[-]

Anbeeld@reddit (OP)

Thanks for info, will investigate.

[-]

r00x@reddit

That would be interesting, if I could run the model on my 3090 and then stuff an old 2060 in for KV cache or something (as this comment probably reveals, I've never tried multi-gpu and have no idea how one should go about it. Presumably doing this is better than sharing model layers between GPUs, at least)

[-]

Anbeeld@reddit (OP)

From what I've read any multi-GPU setup is useful even if GPUs are heavily mistmatched like your 3090 and 2060 example, as 3090 really can do such more if you give it just a bit more VRAM and the need to quantize everything heavily goes away, partially. On a good motherboard both slots might work as PCIe x8 which is quite decent for LLM data exchange. I'd like to dip into that myself when I'll have some free cash.

[-]

SectionCrazy5107@reddit

Will this work on V100?

[-]

the_koom_machine@reddit

Anbeeld? you're the guy from Victoria 3 AI mods?

[-]

Anbeeld@reddit (OP)

Yes! :D

[-]

IrisColt@reddit

I kneel

[-]

IrisColt@reddit

Absolutely game changing... Thanks!!!

[-]

pmttyji@reddit

It would be nice to have Vulkan(also CPU-only for old systems) version of this.

[-]

devedse@reddit

Will there also be a docker build / support for Intel Arc GPU?

[-]

Regular-Forever5876@reddit

OP should REALLLLLY posts some PP (prompt preprocessing) benchmark as generation speed is by far the least important when it's coming to agentic usage 🙂

Is it planned?

[-]

Anbeeld@reddit (OP)

I'll do that after next set of benchmarks, but it shouldn't be much different from baseline llama.cpp, or at least it felt similar to me.

[-]

Potential_Block4598@reddit

What about AMD & the Strix Halo ?!

[-]

Sofakingwetoddead@reddit

Waiting on a 9700 to show up within the next few days. I will test

[-]

Anbeeld@reddit (OP)

Unfortunately I just don't own an AMD GPU and neither anyone I know does, but I'll happily fix whatever issues there are if the community would be so kind to describe them in details.

[-]

thenaquad@reddit

Great work! I see ~2x with 27B, using it with OpenCode.

For those trying to build on the recently updated Arch Linux and experiencing C errors:

diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index d564e6d91..b2269f48e 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -4274,7 +4274,7 @@ int llama_context::decode(const llama_batch & batch_inp) {
                 }
             }

-            const auto * cb_eval_new = dflash_graph_hidden_ready ? nullptr : dflash_eval_callback;
+            auto * cb_eval_new = dflash_graph_hidden_ready ? nullptr : dflash_eval_callback;
             void * cb_eval_user_data_new = dflash_graph_hidden_ready ? nullptr : dflash_capture.get();
             cparams.cb_eval = cb_eval_new;
             cparams.cb_eval_user_data = cb_eval_user_data_new;
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 38c949a8d..35ee99cb7 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -29,6 +29,7 @@
 #include <cmath>
 #include <set>
 #include <utility>
+#include <cfloat>

 // fix problem with std::min and std::max
 #if defined(_WIN32)

P. S. Repo has no issues enabled, so I couldn't post there.

[-]

Anbeeld@reddit (OP)

Sorry, I completely missed the issue with issues! Enabled them.

[-]

legatinho@reddit

Ok I got a windows setup to test this out with a 3090, is that what you used? What does pp and the look like at filled up context?

[-]

Anbeeld@reddit (OP)

Yes, I used the same. Honestly large context is a bit of a struggle right now, it works but there's no crazy speedup like in short coding tasks. v0.1.0 was mostly about making it work and in some cases work very well, and for the rest I'll continue work on it over time.

[-]

legatinho@reddit

I was also thinking of getting a cheap video card for the main display, then can use the full 24gb of the 3090 for this. I noticed windows sometimes tends to try to push stuff into the shared memory space, and I wonder if that’s why we experience slowdowns. I’ll report back if any improvements, but thanks for your work on this!

[-]

Anbeeld@reddit (OP)

I presume you don't have iGPU in the processor?

[-]

dsanft@reddit

Have you done any measurements for your TQ implementation in terms of comparing e.g. KLD of final LM_HEAD of the forward pass of FP16 vs Q8 vs your TQ modes? You claim almost lossless, what are the actual numbers?

[-]

Anbeeld@reddit (OP)

TurboQuant implementation is upstream. I'm planning to do cache benchmarks of my own before settling on a final configuration, but for now was focused mostly on DFlash.

[-]

dsanft@reddit

Well you claim almost lossless, need to measure that. I think you'll find it's BS because I've measured it and 4bit TQ is pretty bad. I encourage you to get the numbers.

[-]

Anbeeld@reddit (OP)

I agree that parroting upstream claims is not ideal there, will do my own tests.

[-]

Sabin_Stargem@reddit

Hopefully, projects like this will prove the worth of TQ+ and DFLASH so that they can become part of mainline LlamaCPP.

[-]

Velocita84@reddit

TQ is already proven to be crap.

[-]

imgroot9@reddit

thanks for this! I agree, Q5 dense models work for me too, without any issues with turboquant. if I have to choose between Q4 with a small Q8 cache, or Q5 with a huge turboquant cache, Q5 wins hands down in the case of common programming tasks.

[-]

Anbeeld@reddit (OP)

I agree, people are for some reason focused solely on cache quality, while with Q4 models there's not as much quality to preserve in the first place.

[-]

floconildo@reddit

This seems interesting and legit. I'll give it a whirl on my 4090 to see how it behaves, and I'll also keep an eye on the project to see if it doesn't die in a week or so.

Not related to the project's goal itself, but worth mentioning: you'll get a lot of backlash for using AI so extensively. Try to either not answer those comments or at least be understanding on the whole community. There's a general exhaustion on AI projects as we've been flooded with "that's why I built X" posts with nothing but slop solving issues that the developer couldn't be bothered to research nor understand what's out there already.

The flashy post that looks like it's trying to "sell" it certainly doesn't help. My monkey immediately categorized this as "tech bro can't RTFM nor wants to play by the rules" and it took me some effort to go through it.

[-]

Anbeeld@reddit (OP)

I just always cared for how I present the project, AI or not, but I guess doing that is a slop too now.

[-]

floconildo@reddit

Just giving you some honest feedback bro.

When every other post you see everywhere looks extra polished our brains will just clump everything together. When that meets a community that is frankly exhausted of tech claw crypto bros, you'll find some backlash for sure, and this kind of attitude will just make it worse for you.

[-]

Anbeeld@reddit (OP)

Yes, I understand, thank you for input. I've no problem with people pushing against AI slop, or even non-slop. I just saw this guy shit on TurboQuant specifically every time it's mentioned so I found it quite amusing.

[-]

floconildo@reddit

Yeah I understand the guy tho. A shit ton of entitled ppl complaining about features in llama.cpp with zero stakes in the project itself and zero will to pull up their sleeves and actually contribute to the community. I'd be skeptical too.

Just watch out not to let it drown your own project. Community building is hard, community management is even harder.

[-]

Anbeeld@reddit (OP)

I understand the points but personally I don't get why llama.cpp ignores TurboQuant, even if it's flawed in one way or another.

[-]

Alex_L1nk@reddit

Here is answer from one of maintainers of llama.cpp on TQ

https://github.com/ggml-org/llama.cpp/pull/21089#issuecomment-4187393635

[-]

Anbeeld@reddit (OP)

I don't necessarily agree with their contributing guidelines so their personal vengeance is of no interest to me, and besides that there's not much substance in that "answer". This feels barely related to TurboQuant as a technology.

[-]

floconildo@reddit

I can think of plenty of reasons:

Feature creep
Maintenance efforts
Lack of real usage for the parties involved
Lack of meaningful contributions

As you said in another comment: not everyone is willing to go through the bureaucracy of submitting PRs to llama.cpp, especially vibe coders and other zero-stake contributors.

And I honestly think you did the best by just pulling up your sleeves and doing it yourself. If you project gets traction and more people start using TurboQuant, then llama.cpp might change their stance or reorder their priorities. Worst case you got your own implementation that works (I hope, didn't find time to test yet haha)

[-]

floconildo@reddit

Yeah I understand the guy tho. A shit ton of entitled ppl complaining about features in llama.cpp with zero stakes in the project itself and zero will to pull up their sleeves and actually contribute to the community. I'd be skeptical too.

[-]

Dany0@reddit

As suspected, we can disregard this stack as a good tech preview, or good for people who have a usecase for <4096 tok context

rtx 5090 with q5 k xl

smoke test short prompt 220tok/s, 130tok/s with vision

51k prompt, real (easy) coding task test

vLLM with Q5.6 equivalent q3.6 27b PrismaSCOUT NVFP4 quant with b12x + MTP + vision ±112 tok/s decode 7200k tok/s prefill. 66-83% acceptance rate

beellama q5 k xl (effectively ±5.96bpw) with dflash (using the recommended drafter) + vision + turbo4. 2164 prefill, 60.4 tok/s decode. 33% draft, 35% verify

latest llamacpp q5 k xl with ngram-mod + vision + q8 kv 2600 prefill, 58.2 tok/s decode

[-]

Anbeeld@reddit (OP)

Responding there as you continue to edit and delete your comments left and right.

Don't see any reason to be so passively or even actively aggressive about it. Yes, v0.1.0 of my personal OOS project that I don't gain anything from doesn't satisfy all the use cases in the niche, or not as well as a mature alternatives like vLLM. I don't think it's that big of a deal, considering I started to work on it because all the other solutions that I've seen people recommending here basically didn't function when I tried them.

[-]

andy2na@reddit

They provide actual performance numbers though between vllm, main llama, and yours - you just mention peak and I don't see the number of tokens used. This is all too common with the benchmaxxing community, especially with qwen3.6-27b and 3090 - mention peak TPS and nothing else, that's not really useful in real usage and unfortunately I'm done chasing high TPS that you guys keep pushing

[-]

Anbeeld@reddit (OP)

It's average tps in small coding task with multiple passes, with num of tokens specified in the table. I think that's decent for v0.1.0 considering there's also a number of edge cases and issues with actual agentic usage solved already. Long context optimization will follow, and it's easier to do based on community feedback.

[-]

marscarsrars@reddit

Cool, Is it Linux friendly?

[-]

Anbeeld@reddit (OP)

I'm on Windows myself but I don't see any reason for it to not be.

[-]

Dany0@reddit

running a smoke test on this on a 5090 right now. if you really did get it working I owe you a coffee.

but I see a vibecoded documentation. I can even imagine how you prompted this bullshit. I don't think you stayed up to 4am by pulling an old trick. in the past you stayed up till 4am solving problems. now you stayed up till 4am having a clanker do most of the work for you, haven't you?

You'll have a lot more rapport with the community if you invest into writing sloppy human technical writing so we can trust you in the first place. A lot of people will skip this because of this

[-]

Anbeeld@reddit (OP)

Responding there as you continue to edit and delete your comments left and right.

Don't see any reason to be so passively or even actively aggressive about it. Yes, v0.1.0 of my personal OOS project that I don't gain anything from doesn't satisfy all the use cases in the niche, or not as well as a mature alternatives like vLLM. I don't think it's that big of a deal, considering I started to work on it because all the other solutions that I've seen people recommending here basically didn't function when I tried them.

[-]

Anbeeld@reddit (OP)

I have no problem with stating that I used AI a lot there. As I stated in another comment, it was initially just a personal project, not a government contract. It's just that over time it grew so much that I decided to release it, as it might be useful for someone else too. If that's the case, then I think that's more important than the tools used. If it's broken in one way or another, I'll be happy to take the responsibility and continue fixing issues with it.

About the 4 a.m. thing: you can't tell a clanker "implement DFlash end-to-end autonomusly, make no mistakes" and expect it won't wipe your system out of frustration with a higher probability instead. Otherwise would just do it and go sleep, I guess.

[-]

Dany0@reddit

Yeah and the end result was not worth our time. Thanks!

[-]

VoiceApprehensive893@reddit

layers of slop

[-]

HumanAlternative@reddit

Is this worth a shot on a Macbook M3 Pro 18GB or would this just need more RAM that's already a bottleneck on this machine? If there's a way to get a smart enough model with enough context to code small projects and chat with web research running locally, I'd love do so. I've tried a heavily quantized unsloth qwen 3.6 (Q2_K_XL) with LM Studio. The output was better than I expected it to be but it get's slow quickly.

[-]

Anbeeld@reddit (OP)

I'll be honest I couldn't test it on Mac myself yet and I'm not very proficient in build differences between platforms outside of what upstream tells me. As I stated in another comment, it was initially just a personal project, tested on my RTX 3090 (and the other one of my wife).

In my experience the resulting DFlash is much more VRAM-friendly than what I experienced with stuff like WLS2+vLLM+MTP, and heavily quantized drafter models perform just fine which helps too. But there might be issues with how it works on Mac that I couldn't test myself, though I'd be happy to make the solution work 100% on all platforms over time based on feedback.

[-]

HumanAlternative@reddit

I aprecciate your honesty. These DFlash news sound so revolutionary but I don't understand most of it TBH. I'm just interested in getting a useful LLM running quickly locally as soon as possible. I guess I'll keep waiting and hope the advancements keep coming at the same speed.

[-]

GrungeWerX@reddit

which version of Qwen 3.6 27B Q5 were you testing? Im currently using Q5 UD K XL and might be interested in trying yours out as I'm hitting a cliff around 120K

[-]

Orolol@reddit

From Paradox AI to llama.cpp

[-]

Anbeeld@reddit (OP)

Exploring AI in full.

[-]

herpnderpler@reddit

I've found Nvidia kernels for prompt processing in mpt/thetom to be lacking - like... Mainline llama.cpp gets 2ktp/s on my 5090, but mpt/thetom seem to have kernels that give me 10-15tp/s. Tg/s is good, but processing sucked.