Decoupled Attention from Weights - Gemma 4 26B

Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 14 comments

Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql