Luce Megakernal: Why nobody is taking about this?

Posted by PaceZealousideal6091@reddit | LocalLLaMA | View on Reddit | 9 comments

Everyone has been taking about Luce DFlash and PFlash. I just came across their megakernal and it seems it was released along with Dflash and PFlash. It seems it's giving them 1.8x greater speed with much more power efficiency on nvidia gpu comparable to the efficacy achieved on apple silicon! How's it that nobody is talking about this? They say that they developed a method of avoiding cpu despatches between every layer boundaries. In lcpp, there are about 100 kernal launches per token for CUDA implementation. The amount of power being used is crazy especially as people are using powerful multi gpu setup. Isn't this really huge? Am I missing something?

[-]

stoppableDissolution@reddit

Because handwriting kernels per-model (not even per-family) is not remotely feasible?

[-]

JumpyAbies@reddit

I think that if, for example, it has support for qwen3.6-27b or gemma-4, it becomes a very attractive option for those who use those models. It would be a solution focused on a smaller scope of models.

[-]

dinerburgeryum@reddit

The post goes onto say “Megakernel fusion benefits shrink as model size grows and compute begins to dominate over launch overhead.” Sounds like diminishing returns.

[-]

JumpyAbies@reddit

Why didn't you let me dream?? 😆

[-]

foomanchu89@reddit

Because

It only works for Qwen 3.5-0.8B right now it’s not a general inference engine
1. The DFlash story (27B speculative decoding on a 3090) is more immediately practical for people running larger models
2. The Lucebox-Hub project provides paper-style technical writeups and benchmark reports, which is unusual and thorough but the install involves low-level CUDA compilation that raises the barrier to entry

[-]

NickCanCode@reddit

I think they know. They just don't have the time to do everything. Just look at the pull request count on those other projects.