Luce Megakernal: Why nobody is taking about this?

Posted by PaceZealousideal6091@reddit | LocalLLaMA | View on Reddit | 9 comments

Everyone has been taking about Luce DFlash and PFlash. I just came across their megakernal and it seems it was released along with Dflash and PFlash. It seems it's giving them 1.8x greater speed with much more power efficiency on nvidia gpu comparable to the efficacy achieved on apple silicon! How's it that nobody is talking about this? They say that they developed a method of avoiding cpu despatches between every layer boundaries. In lcpp, there are about 100 kernal launches per token for CUDA implementation. The amount of power being used is crazy especially as people are using powerful multi gpu setup. Isn't this really huge? Am I missing something?