Why bother with RWKV/Mamba instead of decoder transformers?
Posted by netikas@reddit | LocalLLaMA | View on Reddit | 16 comments
The classic transformers are quadratic both in space and time complexity. However, since most of the models now use FA1/2/3 (which is linear in space) and during inference we are using kv caching (which is linear in time), during inference decoder transformers are basically linear.
During training, decoder transformers are obviously quadratic in time, since there is no kv caching, but the context length stays reasonable, since they are mostly trained on shorter sequences. To increase the context size, they are being rope scaled and finetuned on longer sequences, so the time complexity stays low with lower n for most of the training time.
So, why bother with alternative architectures, if transformers are already proven and linear? Why people are deriving linear approximation of attention, which performs worse? What is the catch and what am I missing?
P.S. I am talking about practicality, not research — obviously, more research is better.
16 Comments
hazardous1222@reddit
netikas@reddit (OP)
MLTyrunt@reddit
Suryova@reddit
konistehrad@reddit
silenceimpaired@reddit
konistehrad@reddit
silenceimpaired@reddit
konistehrad@reddit
oKatanaa@reddit
netikas@reddit (OP)
Thick-Protection-458@reddit
iperson4213@reddit
FullstackSensei@reddit
netikas@reddit (OP)
FullstackSensei@reddit