Serving 3.3 Million Context for Llama-3-8B on a single GPU

Posted by Van_Chopiszt@reddit | LocalLLaMA | View on Reddit | 11 comments

This paper looks interesting: https://arxiv.org/abs/2410.10819

They also provide code to run the Llama-3-8B 4M model with at most 3.3M context on a single A100, which is really cool!

https://github.com/mit-han-lab/duo-attention

[-]

gaspoweredcat@reddit

i really think we should have a different name for these ultra high level server GPUs, i mean lets face it they arent really even graphics cards and the majority will probably never actually do any ray tracing or 3d modeling etc, theyre more like high bandwidth processing cards, not to mention im always somewhat irked as "on a single GPU" is something most would consider to be under 32gb. it just feels like an overly broad term at this point, like if we called every wheeled vehicle a car or something

[-]

ObnoxiouslyVivid@reddit

No wonder it works so well in their benchmarks. It's a bunch of synthetic data with irrelevant context and anchor words like "remember this".

Curious to know how it will perform in a real world scenario where the context is not as binary. I would assume it will perform much worse.

[-]

MoffKalast@reddit

When combined with quantization, DuoAttention further boosts KV cache capacity, supporting up to 3.30 million contextual tokens on a single A100 GPU. DuoAttention paves the way for LLMs to handle contexts with millions of tokens.

I mean... it's great that it reduces KV cache size so models fit into less VRAM, but raw capacity isn't the only thing that stands between what we have now and actual coherent models at more than 64k. It's more like the least of our problems.

[-]

az226@reddit

But every step helps. And in a few months we will have something even better here. Each puzzle piece is being improved upon. And new ones are added.

[-]

MoffKalast@reddit

True, but GQA, flash attention, and 4 bit cache have already gotten us way further than anyone can even train for.

[-]

Downtown-Case-1755@reddit

Isn't that literally what ringattention is? Brute force attention scaling across TPUs, lol.

But it seems they want to reserve it for Gemini, as (right now) thats kinda its niche.

[-]

Downtown-Case-1755@reddit

Zamba (and Jamba) should supposedly help? The hybrid mamba approach makes training at a very long context practical, whereas thats still extremely painful with pure transformers (hence its usually done at the end as a kind of "extension" or afterthought).

[-]

-Lousy@reddit

I definitely didn't expect MIT and NVIDIA in the org list on the paper.

Key from paper:

Each Transformer layer has two KV caches— a full KV cache for crucial retrieval heads and a constant KV cache for streaming heads, which stores only attention sinks and recent tokens. Retrieval Heads, which represent only a fraction of the total, are crucial for processing long contexts and require full attention across all tokens. In contrast, the majority of attention heads, termed Streaming Heads, primarily focus on recent tokens and attention sinks, and can operate effectively with a reduced KV cache that includes only recent tokens and attention sinks.

Amazing improvement, but not gonna get you from 8k to Gemini level context

DuoAttention enables a Llama-3-8B model to handle up to 3.3 million contextual tokens measured on a single A100 GPU, achieving a 6.4× capacity increase compared to standard full attention FP16 deployments. Up to 2.55× for MHA and 1.67× for GQA models while speeding up decoding by up to 2.18× and 1.50× and accelerating pre-filling by up to 1.73× and 1.63× for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention.

There is still an accuracy hit for this, and its already pretty well known that llama cant really manage to pull info from its full context window already anyway.

[-]

-Lousy@reddit

[-]

MMAgeezer@reddit

RULER isn't the gold standard context benchmark anymore because it is essentially only testing single and multi-needle retrieval capabilities.

The Michelangelo evaluations (LSQ) are a lot more robust, testing a much wider variety of long-context use cases.

We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model’s ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to “chisel away” the irrelevant information in the context, revealing a latent structure in the context. To verify a model’s understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.

Paper: https://arxiv.org/pdf/2409.12640

[-]

_underlines_@reddit

An alternative to the RULER benchmark is: BABIlong https://github.com/booydar/babilong