Decoupled Attention from Weights - Gemma 4 26B

[-]

the__storm@reddit

...might as well offload to disk - this is going to be slow as balls

[-]

One of the amazing outcomes of this is that low-ram high-compute consumer cards like the 12GB 5070 would essentially be way overpowered for most models since it suddenly "only" needs to run 2-4gb of attention layers. The rest could presumably sit under the table on a "cheap" external xeon with 128gb DDR4 to hold the weights!? Interconnect via highspeed regular tcp/ip over ethernet & bob could be your uncle.

[-]

Party-Special-5177@reddit

There are multiple possible limiting factors of inference, most famously flops and memory bandwidth, and this proposal introduces network latency.

You know the layers don’t run in parallel right? They are sequential and blocking, since a layer requires all previous layers’ computation to perform its own.

The only way this could make sense is if the network latency of sending + receiving a hidden state beats the latency of alternatives (e.g. computing the layer in RAM on CPU). This does scale better in some circumstances, but I’m just worried the inflection point is e.g. a 1T model in fp8 or something similarly silly.

[-]

yeah-ok@reddit (OP)

RPC

As far as I can make out (via https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) RPC seem focused on running distributed, GPU, compute on the attention layer whereas this larql decoupling focus on keeping latency low by having GPU as primary and distributing the weights themselves onto x local devices (could also be internetscale but latency seem to kill that off at the moment).

[-]

Awwtifishal@reddit

RPC can run weights in absolutely any configuration. You can perfectly run all attention locally and the rest in one or several RPC servers which may be running on CPU or GPU.

[-]

yeah-ok@reddit (OP)

OK, llama.cpp is a sprawling ecosystem indeed, never heard of it until today! So.. does it make sense performance wise to put weights somewhere else on the LAN and let my workstation handle the attention layer alone via RPC.. or is the performance penalty too high. Honestly never saw this discussed priorly, would love to see practical examples!

[-]

Awwtifishal@reddit

It sounds like you're relying too much on knowledge from LLMs, which is not up to date, and for some reason ignores llama.cpp's existence (even though it's the base of many popular projects like ollama and lm studio). When they DO know about llama.cpp they don't know about many of its features (some recent, some not so recent).

[-]

Gear5th@reddit

Every new video is crazier than the previous one.. incredible work!