Why bother with RWKV/Mamba instead of decoder transformers?

Posted by netikas@reddit | LocalLLaMA | View on Reddit | 16 comments

The classic transformers are quadratic both in space and time complexity. However, since most of the models now use FA1/2/3 (which is linear in space) and during inference we are using kv caching (which is linear in time), during inference decoder transformers are basically linear. During training, decoder transformers are obviously quadratic in time, since there is no kv caching, but the context length stays reasonable, since they are mostly trained on shorter sequences. To increase the context size, they are being rope scaled and finetuned on longer sequences, so the time complexity stays low with lower n for most of the training time. So, why bother with alternative architectures, if transformers are already proven and linear? Why people are deriving linear approximation of attention, which performs worse? What is the catch and what am I missing? P.S. I am talking about practicality, not research — obviously, more research is better.