MLA optimization with flashattention for llama.cpp,MLA + FA now only uses K-cache - 47% saving on KV-cache size

Posted by shing3232@reddit | LocalLLaMA | View on Reddit | 45 comments

[MLA + FA now only uses K-cache - 47% saving on KV-cache size (only for use with #13435 for now) by jukofyork · Pull Request #13529 · ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp/pull/13529) `llama_kv_cache_unified: kv_size = 163840, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0, padding = 256` `llama_kv_cache_unified: CUDA0 KV buffer size = 10980.00 MiB` `llama_kv_cache_unified: KV self size = 10980.00 MiB, K (f16): 10980.00 MiB, V (f16): 0.00 MiB` The full context of 160k tokens now takes up less than 11GB without kquants

Reply to Post

45 Comments

[-]

panchovix@reddit

Not OP, but for reference, I run DeepSeekV3 0324 685B Q3\_K\_XL on a 7800X3D, 192GB RAM at 6000Mhz, 5090+4090x2+3090+A6000 Without this PR, I can load Q3\_K\_XL at 64K with fp16 cache at basically the limit. With this PR, it is basically free half of the cache, and it lets me run 128K ctx without issues. And then with -ctx q8\_0, I can run it at 160K+ without issues as well. This, with -ub 2048, I get about 130-170 t/s PP depending of the context, and 7-8 t/s TG. This is huge for systems like these which aren't server and you have to offload!

[-]

segmond@reddit

what command are you using to run it? are you offloading layers or tensors across your GPUs?

[-]

panchovix@reddit

I use this command, and yes I offload layers to the GPUs. ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 65536 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048

[-]

giant3@reddit

From my testing, offloading entire layers to CPU gives better performance than splitting a single layer by moving ffn or attn blocks. For example, on Qwen3 14B, just moving first 9 blocks(**-ot 'blk\.[0-8]{1}\.=CPU** ) gives better performance for me than either moving 10 blocks or 20 blocks.

[-]

pmttyji@reddit

It's been 6 months after this comment. So many changes on llama.cpp. Currently what command settings do you use for this? I'm looking for optimized command to get higher t/s for dense models like Qwen3 14B, Gemma3 12B. Please share your stash. Thanks

[-]

giant3@reddit

llama.cpp has made lot of progress on CUDA since some people from Nvidia are contributing to the project. Any contributions to AMD or Intel or OpenCL seems to be minimal. Unfortunately, performance on AMD, Intel hasn't improved or has degraded slightly. I build weekly and I haven't seen much improvements, so the above command should still work.

[-]

pmttyji@reddit

OK thanks, I'll try your OT. Currently I'm trying few 22-24B dense models with my 8GB VRAM(and 32GB RAM). Not getting usable t/s(tg) so far. Also on CPU only inference, what command/settings could give us better t/s? Can we something with OT?

[-]

giant3@reddit

> Also on CPU only inference, what command/settings could give us better t/s? Can we something with OT? CPU only would be even more worse performance unless your CPU supports AVX-512 and has high memory throughput like some of the Apple Macs. BTW, I have stopped using local LLMs unless it is something that involves private information. Gemini is very good and I would use for almost all use cases.

[-]

pmttyji@reddit

>CPU only would be even more worse performance unless your CPU supports AVX-512 and has high memory throughput like some of the Apple Macs. No wonder ik\_llama's AVX-512 setup not working on my laptop. Just checked status using HWiNFO. AVX-512 is Disabled, but with below tooltip. Advanced Vector Extensions 512 * Foundation Instructions: Not supported Supported: * AVX-512 Galois Fields New Instruction * AVX-512 Vector AES * AVX-512 Carry-Less Multiplication Quadword Is it possible to enable this? And disadvantages? My system info is below. IntelR Core(TM) i7-14700HX 2.10 GHz | 32 GB RAM | 64-bit OS, x64-based processor | NVIDIA GeForce RTX 4060 Laptop GPU

[-]

giant3@reddit

I don't think your CPU supports AVX-512. Only certain models support it.

[-]

pmttyji@reddit

Oh OK. Thanks for multiple replies & your time. Coming year, I'll try this on my new desktop.

[-]

Mass2018@reddit

Is -ot part of an unmerged PR? I can’t seem to find any documentation on it..

[-]

panchovix@reddit

It is merged since some time ago, just not much info [https://github.com/ggml-org/llama.cpp/pull/11397](https://github.com/ggml-org/llama.cpp/pull/11397)

[-]

Mass2018@reddit

Thanks!

[-]

AbheekG@reddit

Please please share which motherboard you’re using! Super curious to hear how a standard ATX platform is supporting all those GPUs!!

[-]

panchovix@reddit

A MSI X670E carbon. I use X8/X4/X4/X4/X4, all from CPU. Bifurcated X8 to X4/X4 and then the other 2 X4 are from M2 to PCIe adapters.

[-]

AbheekG@reddit

Wow that’s amazing! Thanks so much taking the time to respond, and so promptly at that, really appreciate it! Any specific risers / adapters you’d recommend?

[-]

panchovix@reddit

I use mostly linkup risers and then a rig (like a mining rig) structure, open case. In waiting for AMD to release threadripper 9000 series to upgrade.

[-]

Aphid_red@reddit

Depending on how much you want to spend, I'd rather recommend going for either epyc milan ($2-3K for cpu/mobo/ram) or epyc genoa ($8-10K). For Milan, you can get 8x64GB ddr4 @ 200GB/s, for Genoa, 12x64GB DDR5 @ 460 GB/s. Make sure you get a CPU with the full CCD count. Any 'X' variant or the full fat core cpu will do, as well as a few select others. For genoa, the chips with 12 CCDs are (preferred) 9634, 9654, 9654P, 9684X, 9734, 9754S, 9754 And the ones with only 4 (avoid!) are: 4xxx, 8xxx, 9124, 9224, 9254, 9334. A CPU with 8 CCDs should also be okay and not constrain the bandwidth too much. Mind you, if you're doing CPU offloading, the CPUs with the best speeds will be those with the best performance, i.e. the fully unlocked 96xx or 97xx class. For milan, the ones with the full 8 ccds are: 76xx, 77xx, 7543, 77C3, any 'X' or 'F' suffix parts. The parts with only 2 CCDs (these are really bad) are: 7203, 7303 The bad thing is that *none* of the reviews about genoa/milan CPUs mentions this, and it has a massive performance impact for LLMs (usually they test only the top SKU, which isn't crippled this way. You'll actually find, if shopping for CPUs second-hand, that the memory ends up being the most expensive part of the build. Unfortunately DDR5-ECC currently has this enormous premium, costing $5-$6/GB, or $300 for one stick, over double the price of DDR5 without ECC, and *three times* the prices of DDR4 ECC.

[-]

un_passant@reddit

Thx for spreading the info about CCDs ! Do you happen to know how many CCDs there are in 7R32 (AWS custom chip)? It seems it's only 6 if I'm not mistaken : [https://www.anandtech.com/show/15830/amazon-makes-amd-rome-instances-available](https://www.anandtech.com/show/15830/amazon-makes-amd-rome-instances-available)

[-]

Aphid_red@reddit

I do not know this info; this is a custom chip for amazon. According to passmark, apparently it has 48 cores, runs at 2.8 GHz, and given the '2' suffix this should be a Rome chip. However, that seems wrong. 1.8GHz would make more sense for a provider like Amazon who might be interested in saving on power costs. I suspect this is an underclocked version of an existing chip, either the 7552 or 7642. Looking at the known chips on wikichip/wikipedia: I can see no 48-core rome chips running at that speed at all, so we're left guessing. That would give it either 6 or 8 (active, functioning) chiplets. Let's look at another property that might give away the information: The Cache size. On [https://xmrig.com/benchmark/4PDGeF](https://xmrig.com/benchmark/4PDGeF) there's someone who did a benchmark of this system where the benchmarking tool registered 384M of L3 cache. Divvy between 2 CPUs and you get 192MB per cpu. Epyc rome (except the 7232P, a very low end part) uses 16MB of L3 cache per CCX or 32MB per chiplet. 32 \* 6 = 192, so it should have 6 chiplets.

[-]

un_passant@reddit

Thx. It's most certainly not underclocked as this CPU has the highest TDP of the Epyc Gen 2 even with reduced cores count (Amazon probably wanted to avoid thermal throttling between VMs). Not sure how the cache gives us the CCD count and fwiw, I have 227 GB/s of memory bandwidth on a proxmox VM on a dual CPU width 2DPC (32 DDR 3200 sticks). I'm wondering how much bandwidth, if any, is lost because of the CPU (I picked them for computing power, not paying attention to the CCD situation).

[-]

panchovix@reddit

Wow, many thanks! This is very useful info, I may go for Genoa.

[-]

AbheekG@reddit

Awesome, thanks so much again!

[-]

MLDataScientist@reddit

@panchovix can you please share which bifurcation card you are using? I bought one from eBay but it is bifurcating into x4 and X1 (probably some cheap wiring there). Also, if you are using your M.2 slots, are you using SATA drives for storage?

[-]

panchovix@reddit

I'm using a X8/X8 bifurcator I got from AliExpress but set in the BIOS to X4/X4 on the second slot. I'm not on the PC right now but it is a PCIe 4.0 one that costs like 20-25 usd. I'm using the other 2 M2 slots (bottom, chipset) as OSes (Windows, Linux) and Sata + USB to nvme storage.

[-]

MLDataScientist@reddit

Thanks! One last question. My motherboard supports pcie4.0 X16 to 4x4 bifurcation for connecting four M.2 drives in raid mode using Asus hyper M.2 expansion card. Do you think I can get that expansion card and use four M.2 to X16 adapters and connect 4 GPUs to it? I could not find any answer in multiple forums.

[-]

panchovix@reddit

Yes, you can. No issues, just make sure you get something good, from ADT Link. I suggest K43SP or F43SP and you will be fine. K43SG/F43SG if you have multiple PSUs.

[-]

MLDataScientist@reddit

Thanks! I wonder why this is not discussed often. X16 to 4x4 bifurcation should have been popular during the coin mining period. But no, no one actually used such a setup. What I want to do as follows. I have four gigabyte CRSG421 Pcie 4.0 x16 to 2x16 with active switch microchips. I want to use that 4x4 M.2 expansion card then M.2 to PCIE X16 adapter and finally use those switches to connect a total of 8 GPUs. Basically, I will have PCIE4.0 x16 to 8x2 - each GPUs limited to PCIE4.0 X2 speed. Not sure if this is a good idea 😅

[-]

kevin_1994@reddit

Question! How are you mixing amd with nvidia in llama.cpp??

[-]

Sir_Joe@reddit

Btw I do that and there's no problem at all with llamacpp. You just need to compile with support for vulkan (or rocm) + cuda

[-]

panchovix@reddit

It is mixing CUDA + CPU, so it is as simple to offload layers into CUDA devices, rest on CPU

[-]

kevin_1994@reddit

Ooh sorry my bad. Thought you were referring to Radeon 7800 graphics card haha. Carry on

[-]

Vostroya@reddit

What do you use for your front end? Kobold? Vllm?

[-]

panchovix@reddit

ST and normal lcpp server works fine for me.

[-]

Vostroya@reddit

Nice! I’m working my way up to getting Deepseek local. Got an intel 8 channel ddr5 setup but ktransformers is a mess to try and get going right now.

[-]

shing3232@reddit (OP)

and any future model that use MLA as well. I am looking forward for some gqa convert mla models via transMLA

[-]

Chance-Hovercraft649@reddit

How does it calculate the values, if it doesn't cache them?

[-]

VoidAlchemy@reddit

I have a graph showing how much VRAM is used for various MLA context lengths on my [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF#quant-comparisons) quant as [ik_llama.cpp fork]() has had FA MLA working for a while now at higher speeds for CPU than mainline. Be careful as the newer mainline llama.cpp MLA quants were implemented differently for some reason and ik had to add backwards compatibility for them which may not get you the full speed of using `-mla 3`. I would love to see someone convert qwen3moe to use MLA with proper fine-tuning. The long context VRAM savings is pretty amazing though I haven't measured performance drop for that very long context length. > The expressiveness of MLA is greater than that of GQA when both have the same size of KV cache. > -[TransMLA: Multi-head Latent Attention Is All You Need](https://arxiv.org/html/2502.07864v1)

[-]

shing3232@reddit (OP)

with proper training, MLA should exceed GQA performance for the same model. it also train faster than GQA

[-]