Decoupled Attention from Weights - Gemma 4 26B
Posted by yeah-ok@reddit | LocalLLaMA | View on Reddit | 14 comments
Absolutely unbelievably exciting work, split attention (i.e. a couple of GB) onto local machine and the weights onto another local machine (say a cheap Xeon) to basically bypass the scale issue with local LLMs completely!! Repo with functional code: https://github.com/chrishayuk/larql
the__storm@reddit
...might as well offload to disk - this is going to be slow as balls
jacek2023@reddit
how it is different than RPC?
yeah-ok@reddit (OP)
One of the amazing outcomes of this is that low-ram high-compute consumer cards like the 12GB 5070 would essentially be way overpowered for most models since it suddenly "only" needs to run 2-4gb of attention layers. The rest could presumably sit under the table on a "cheap" external xeon with 128gb DDR4 to hold the weights!? Interconnect via highspeed regular tcp/ip over ethernet & bob could be your uncle.
Party-Special-5177@reddit
There are multiple possible limiting factors of inference, most famously flops and memory bandwidth, and this proposal introduces network latency.
You know the layers don’t run in parallel right? They are sequential and blocking, since a layer requires all previous layers’ computation to perform its own.
The only way this could make sense is if the network latency of sending + receiving a hidden state beats the latency of alternatives (e.g. computing the layer in RAM on CPU). This does scale better in some circumstances, but I’m just worried the inflection point is e.g. a 1T model in fp8 or something similarly silly.
yeah-ok@reddit (OP)
As far as I can make out (via https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) RPC seem focused on running distributed, GPU, compute on the attention layer whereas this larql decoupling focus on keeping latency low by having GPU as primary and distributing the weights themselves onto x local devices (could also be internetscale but latency seem to kill that off at the moment).
Awwtifishal@reddit
RPC can run weights in absolutely any configuration. You can perfectly run all attention locally and the rest in one or several RPC servers which may be running on CPU or GPU.
yeah-ok@reddit (OP)
OK, llama.cpp is a sprawling ecosystem indeed, never heard of it until today! So.. does it make sense performance wise to put weights somewhere else on the LAN and let my workstation handle the attention layer alone via RPC.. or is the performance penalty too high. Honestly never saw this discussed priorly, would love to see practical examples!
Awwtifishal@reddit
It sounds like you're relying too much on knowledge from LLMs, which is not up to date, and for some reason ignores llama.cpp's existence (even though it's the base of many popular projects like ollama and lm studio). When they DO know about llama.cpp they don't know about many of its features (some recent, some not so recent).
retireb435@reddit
Inside the github it shows the method is running 23 times slower. I don’t see any improvement comparing to our nowadays offloading method? Seems like a clickbait
denoflore_ai_guy@reddit
Finally someone else figured this out. Glad Its getting time where I don’t have to explain the concept to ppl over and over again. Good work.
oxygen_addiction@reddit
Ai psychosis final boss.
Swimming_Beginning24@reddit
So you figured it out first?
TokenRingAI@reddit
So he figured out slow inference across a network? Cool
https://docs.vllm.ai/en/latest/serving/expert_parallel_deployment/#backend-selection-guide
Gear5th@reddit
Every new video is crazier than the previous one.. incredible work!