MLA optimization with flashattention for llama.cpp,MLA + FA now only uses K-cache - 47% saving on KV-cache size
Posted by shing3232@reddit | LocalLLaMA | View on Reddit | 45 comments
[MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) by jukofyork · Pull Request #13529 · ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/pull/13529)
`llama_kv_cache_unified: kv_size = 163840, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0, padding = 256`
`llama_kv_cache_unified: CUDA0 KV buffer size = 10980.00 MiB`
`llama_kv_cache_unified: KV self size = 10980.00 MiB, K (f16): 10980.00 MiB, V (f16): 0.00 MiB`
The full context of 160k tokens now takes up less than 11GB without kquants
45 Comments
panchovix@reddit
segmond@reddit
panchovix@reddit
giant3@reddit
pmttyji@reddit
giant3@reddit
pmttyji@reddit
giant3@reddit
pmttyji@reddit
giant3@reddit
pmttyji@reddit
Mass2018@reddit
panchovix@reddit
Mass2018@reddit
AbheekG@reddit
panchovix@reddit
AbheekG@reddit
panchovix@reddit
Aphid_red@reddit
un_passant@reddit
Aphid_red@reddit
un_passant@reddit
panchovix@reddit
AbheekG@reddit
MLDataScientist@reddit
panchovix@reddit
MLDataScientist@reddit
panchovix@reddit
MLDataScientist@reddit
kevin_1994@reddit
Sir_Joe@reddit
panchovix@reddit
kevin_1994@reddit
Vostroya@reddit
panchovix@reddit
Vostroya@reddit
shing3232@reddit (OP)
Chance-Hovercraft649@reddit
VoidAlchemy@reddit
shing3232@reddit (OP)
das_rdsm@reddit
Impossible_Ground_15@reddit
das_rdsm@reddit
Impossible_Ground_15@reddit
random-tomato@reddit