[Paper] Residual Streams / KV Direct

Posted by Ueberlord@reddit | LocalLLaMA | View on Reddit | 6 comments

It seems we have entered a period of accelerating innovation regarding the KV cache. Someone mentioned this post's paper in the Github issue of llama.cpp for implementing Turbo Quant.

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference https://arxiv.org/html/2603.19664v1

Associated Github repo: https://github.com/Kaleemullahqasim/KV-Direct

Abstract:

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D KL=0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5–28%. A per-operation latency analysis shows recomputation runs up to 5× faster than reading cached tensors at moderate batch sizes.

My take (not fully understanding the abstract): I think it makes sense. The KV cache can be seen as a bridge from the model weights (origin) to the tokens produced so far (destination). They refer to this bridge as "residual stream" and have found some clever math which I can't comprehend to very efficiently recreate the KV cache like interpolation from weights to tokens.

If someone more knowledgeable can explain this better and what the consequences might be (no more KV cache?!) I would be highly interested.

[-]

z_latent@reddit

Reading this paper was a big pain. The authors can't seem to decide on whether their approach stores one single residual stream vector per token, or store one per layer, with the latter completely killing their advantage.

The idea is the residual stream is essentially the vector h^((l)) that gets processed throughout the model and its blocks/layers. For example, for the MLP of block l of the model, think of it as a function MLP^((l))(h^((l))), and then the residual stream will be updated as h^((l+1)) = h^((l)) + MLP(h^((l)))^([\^1][\^2])

Now, attention layer depends on the other tokens as well, so we usually save the keys and values we computed for the other tokens in the KV Cache, to avoid having to recompute them every time.

However, the paper's idea is, "let's not save the keys and values, let's save the residual stream vectors!" And they spend a while proving how that is identical, lossless reconstruction (which I think is kinda obvious).

But they then do one confusing thing: they claim that you only need to save the residual vector for one layer, and reconstruct the vector for all other layers from it. Which is not wrong... but it's not what they seem to be doing later on. They somehow assume the residual vectors already exist for all layers.

They ignore how, to achieve this from one single saved vector, you need to recompute the whole model for every token, up to that layer. This is way more expensive than what they claimed, which was "reconstructing K and V for 𝑁 evicted tokens at a single layer".

If you stored the residual stream for every layer, you'd almost certainly use more memory than KV. The key and value dimensions, even combined, are usually smaller than the residual stream's.

If you instead recompute every layer for every token... well, even then, why store the residuals at all? You could just use the token ids, aka a single int per token, then fetch token embeddings, and so on... After all, this whole thing is like doing full prompt processing, from scratch, but repeated every time you generate a token.

It sounds really expensive to me. I don't understand their benchmarks either. They sound so bad for an M3 Max. But I don't wanna think about this anymore.

^([\^1] There are also layer norm operations but it's not too important here.)

^([\^2]: The attention layer does this as well, and it comes before MLP usually. So technically there's two residual vectors per block, a pre-attention) ^(h)^((l)) ^(and a post-attention, pre-MLP) ^(h)^((l). Only after the MLP that you advance to) ^(h)^((l+1).)

[-]

Ueberlord@reddit (OP)

Thank you for reading through this and sharing your thoughts, very helpful even if disappointing as we do not get a free lunch seemingly.

[-]

dinerburgeryum@reddit

Remind me to update my email signature to read “I don’t wanna think about this anymore.”

For real I just finished a close read of the paper and I tend to agree. They report ok performance on 0.5B class models, but you can see the gulf once they hit a 4B. Possibly a win for edge inference on handhelds, but if I’m reading this right it’s no silver bullet.