Serving 3.3 Million Context for Llama-3-8B on a single GPU

Posted by Van_Chopiszt@reddit | LocalLLaMA | View on Reddit | 11 comments

This paper looks interesting: https://arxiv.org/abs/2410.10819

They also provide code to run the Llama-3-8B 4M model with at most 3.3M context on a single A100, which is really cool!

https://github.com/mit-han-lab/duo-attention