[Paper] Residual Streams / KV Direct

Posted by Ueberlord@reddit | LocalLLaMA | View on Reddit | 6 comments

It seems we have entered a period of accelerating innovation regarding the KV cache. Someone mentioned this post's paper in the Github issue of llama.cpp for implementing Turbo Quant.

The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference https://arxiv.org/html/2603.19664v1

Associated Github repo: https://github.com/Kaleemullahqasim/KV-Direct

Abstract:

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D KL=0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5–28%. A per-operation latency analysis shows recomputation runs up to 5× faster than reading cached tensors at moderate batch sizes.

My take (not fully understanding the abstract): I think it makes sense. The KV cache can be seen as a bridge from the model weights (origin) to the tokens produced so far (destination). They refer to this bridge as "residual stream" and have found some clever math which I can't comprehend to very efficiently recreate the KV cache like interpolation from weights to tokens.

If someone more knowledgeable can explain this better and what the consequences might be (no more KV cache?!) I would be highly interested.