Why bother with RWKV/Mamba instead of decoder transformers?

Posted by netikas@reddit | LocalLLaMA | View on Reddit | 16 comments

The classic transformers are quadratic both in space and time complexity. However, since most of the models now use FA1/2/3 (which is linear in space) and during inference we are using kv caching (which is linear in time), during inference decoder transformers are basically linear. During training, decoder transformers are obviously quadratic in time, since there is no kv caching, but the context length stays reasonable, since they are mostly trained on shorter sequences. To increase the context size, they are being rope scaled and finetuned on longer sequences, so the time complexity stays low with lower n for most of the training time. So, why bother with alternative architectures, if transformers are already proven and linear? Why people are deriving linear approximation of attention, which performs worse? What is the catch and what am I missing? P.S. I am talking about practicality, not research — obviously, more research is better.

Reply to Post

16 Comments

[-]

hazardous1222@reddit

[https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1) This is a 32B model on 4x4090, inference it can support a batch size of 64 with at least 500tps of total output. This is why

[-]

netikas@reddit (OP)

Not sure about 4x4090, but I've had Qwen-2.5-32B-Instruct generating with about 600-700tps in batched inference in vLLM on a single A100, thus, making this a marginal improvement instead of boundary breaking improvement.

[-]

MLTyrunt@reddit

I'd intuit a more recurrent architecture is closer to how our mind works. Especially with regards to RWKV but also other architectures more leaning towards Mamba, there is indeed some innovation happening on a fundamental research level. Currently, practically speaking, the transformer is clearly preferable, for most uses. But I expect RWKV to do something interesting in the near future. the currently trained version is also no longer merely linear approximation. The devs of RWKV show some genuine creativity on algorithm design and people do work on improving the alternatives as well.

[-]

Suryova@reddit

If they're a bother, then don't bother with them. The way things stand right now, messing with this kind of stuff is for people who love doing it, not for people who ask "why bother?" If you're just wondering how long it'll be before context handling stops being such a PITA, I'd say the field of NLP needs some more time to check out / continue inventing various ways to bring recurrence back into language processing, while preserving the huge progress made by the transformer. This task is nontrivial. However, more if it than one might think is really just a matter of writing the code to implement the required operations efficiently. In some cases implementations actually exist but only on a preliminary proof of concept level, and they need a lot more work. The first example off the top of my head is xLSTM.

[-]

konistehrad@reddit

RWKV and Mamba have significantly lower VRAM requirements to my knowledge, making them more suitable for local inference. Even Nvidia’s Hymba hybrid Transformer/SSM architecture packs a lot of context in not a ton of RAM. Especially when Nvidia is using VRAM as a major differentiator.

[-]

silenceimpaired@reddit

Do you use either of these models? If so, how?

[-]

konistehrad@reddit

Unfortunately, as you’ve probably figured, neither can compete with the monster sized highly trained models of Meta or Alibaba. However, I think that people trying to find solutions, and further shipping them for review, is a net positive more than a net negative.

[-]

silenceimpaired@reddit

I’m asking out of interest rather than to make a point. I’m dealing with a large context where it would be helpful if it can hold it together better than the 70b models.

[-]

konistehrad@reddit

Ah gotcha yeah, unfortunately RWKV is still in the "cooking" phase, and Hymba, while impressive, lacks the training set to make it shine. (Or the overall model size; I'd love to see what how a 3B or 7B Hymba performs!) Honestly, it's one of the coolest models I've seen in a while, I really hope Nvidia keeps iterating on it.

[-]

oKatanaa@reddit

Transformers with FA are linear in memory but quadratic in compute during inference, that's the part you got wrong. RWKV and alike are constant in memory and compute during inference since they can work as rnns, that's a huge win (with tradeoffs obviously) and that's the reason to bother

[-]

netikas@reddit (OP)

Sure, they are quadratic in compute with *only* FA, but thanks to KV caching, they are linear in compute. About the constant vs linear -- fair enough.

[-]

Thick-Protection-458@reddit

\> but thanks to KV caching, they are linear in compute How exactly? To generate N+1-th state it still needs all these multiplications with 1,,N states. Thanks to KV cache it doesn't need to recompute them, not doesn't need to process them, So in the end generation of \*one token\* is linear, sure. But since each token generation is linear - the overall sequence is quadratic.

[-]

iperson4213@reddit

Prefill is still quadratic, so ttft can blow up with very large inputs. Of course there’s further optimizations like persistent kv cache that allow similar prefills to reuse cache. That’s why API’s charge less for cache hits from similar prompts.

[-]

FullstackSensei@reddit

Even with FA KV caching, and Rope, most LLMs do very badly on long contexts. Almost all I've seen fall apart after 32K, whereas Mamba, RWKV seem to hold their context recall ability to 128K or even 256K context. As long as RAG is still a nascent field with no established methodologies that actually work without fail on most use cases, Mamba/RWKV/other-SSM will provide better recall than pure transformer architectures. Finally, we're finding that better quality training data will yield orders of magnitude better models (in terms of parameter size). There's no inherent reason why this better data wouldn't yield better SSMs. I'll personally take a Mamba or RWKV that's 2x the size of a transformer model with similar performance any day because of how much more memory and computer efficient they are compared to transformers, not to mention their better in-context recall. I also believe the future will see more hybrid architectures because of this memory and compute efficiency.

[-]

netikas@reddit (OP)

Nips paper, Babilong, shows this: https://arxiv.org/abs/2406.10149 Basically, only Mamba and ARMT (modified transformer) show good performance up to 1M token, despite models like Gemini technically supporting these context sizes. I also remember reading a paper a long time ago, which says that for longer sequences ssm models (e.g. Mamba) does not cut, since they are limited by their latent state size: https://arxiv.org/abs/2402.01032 The paper is ancient, maybe this is not true for newer models (Mamba v2/v3) and rnns (do not remember if they've even checked rnns out), but the problem still sounds relevant.

[-]

FullstackSensei@reddit

IIRC, that repeat after me paper was largely discredited not long after it's publication. Benchmarks like RULER have shown SSM models perform quite a bit better than Attention based models. Of course, this is not a generalization to any/all, but recent models like Jamba seem to do very well on the RULER benchmark. Jamba 1.5 mini can even be run at reasonable performance on CPU and shows very respectable performance on RULER with a 256k context.