TheaterFire

KV cache fix for GLM 4.7 Flash

Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 73 comments

tl;dr: remove Air from GLM 4.7 Flash KV cache uses a lot of VRAM. GLM 4.7 Flash doesn’t even use V in the KV cache. With long contexts, this means gigabytes of VRAM saved, so you can run much longer context on the same setup.

Reply to Post

73 Comments

teachersecret@reddit

Just tested with UD's k\_xl 4 bit version on my 4090. Yesterday I was using it with about 45,000 context and maxing out the 4090. Now it fits with 90,000 context. I like the model. Still a bit quirky though.
View on Reddit #76766945

AfterAte@reddit

If you can, run your display off your IGPU. I could get 65K context before this build on my 3090, using 23.3GB all for llama.cpp.
View on Reddit #76838447

Kitchen-Year-8434@reddit

Be careful with this if you also game on your rig. I ended up with throttling and underutilized gpu in gaming because my iGPU couldn’t handle the bandwidth of framebuffer copying from my dGPU. Though I am running dual 4K ultra wide resolution on a NEO G9. Might not be a problem with 4K or 1440p.
View on Reddit #76931047

AfterAte@reddit

Good to know. I only game at 1440p at 60fps and my AMD 7000 series iGPU hasn't affected my 3090. But maybe I haven't been paying enough attention, so I'll keep an eye out next time I notice slowdowns.
View on Reddit #77115508

teachersecret@reddit

I'm rolling a 4090/5900x, no igpu. That said, it's better now. Here's my latest testing: | Context | Graph Splits | VRAM | TTFT | Prompt | Generation | |---------|--------------|------|------|--------|------------| | 32K | 2 | 19.8 GB | 47.1 ms | 670.5 tok/s | 133.3 tok/s | | 64K | 2 | 21.5 GB | 48.9 ms | 648.9 tok/s | 134.6 tok/s | | 95K | 2 | 22.9 GB | \~50 ms | \~650 tok/s | \~135 tok/s | | 96K | 7 | 23.0 GB | 52.0 ms | 612.1 tok/s | 125.4 tok/s | | 128K | 24 | 23.0 GB | 80.3 ms | 396.7 tok/s | 95.2 tok/s | It's slower at 128k because it has to split the graphs a bit more (it goes from 2 to 9 or something, I think). Still works, just a bit of performance loss. | Graph Splits | Performance Impact | |--------------|-------------------| | 2 | Optimal - minimal overhead | | 7-9 | \~7% slowdown | | 21-24 | \~30% slowdown | | 100+ | \~75% slowdown | Also tried it with kv cache quantization: | Configuration | Context | Generation Speed | Use Case | |---------------|---------|------------------|----------| | q4\_0 KV | \*\*202K\*\* | \*\*139 tok/s\*\* | \*\*Best overall\*\* | | q8\_0 KV | 128K | 138 tok/s | High-quality KV cache | | f16 KV (default) | 95K | 135 tok/s | Baseline | | CPU MoE | 202K | 32 tok/s | Low VRAM systems | On CPU MoE it still pulls off 32 tok/s at 202k context length.
View on Reddit #76886339

AfterAte@reddit

Thanks for the detailed info! I keep hearing kv cache isn't worth it. But if you need to use it, don't go below Q8.
View on Reddit #77115274

floppypancakes4u@reddit

Im on 4090 as well, using LM studio though and im sure thats my problem, since im only getting 10tks. What setup are you using and whats your tks?
View on Reddit #76789876

__Maximum__@reddit

I was impressed by its tool use. You throw at it tools, and chains them like a pro. It calls search, then fetches the URL, then based on that another search, based on all the above git clones a repo, edits it, runs tests and so on for hours without any issues. All simple tasks, of course. When given a huge codebase, it will still use tons of tools but will come up with wrong conclusions or have obviously wrong priorities. I used the API so far, so don't know if this holds up on local setups with quants, but I sure hope so. Btw, the model behind API is having huge issues atm as well. Almost unusable.
View on Reddit #76771578

teachersecret@reddit

Yeah. I found you have to loop in some agentic double checking and scaffolding to keep it on track, and on a larger codebase I think you’d really want to focus it on some small piece or feature. I can’t imagine actually coding with it over something like opus 4.5, but for agentic local stuff? It’s pretty damn impressive. I plan on getting vllm up and running with it once they’ve got it all dialed in there. It’s small enough that we should be able to run multiple simultaneous agents - possibly dozens of them. I’m kinda excited to see what a pile of local agents set to work could do with such reliable tool calling.
View on Reddit #76772818

Front_Eagle739@reddit

Yeah between this and mirothinker 30 we definitely just hit a new level of ability for 30b models. Steuggling to figure out which i prefer though. Still getting a bit more confusion out of flash but im struggling to keep up with all the fixes lol
View on Reddit #76774016

Cool-Chemical-5629@reddit

I trust ggerganov, but still I have to ask. Is this REALLY safe? I mean removal of the V portion of the cache? Is that really how the model works / is supposed to work? I just hope they aren't vibe coding this or something and that they really know what they are doing lol. Sure the model is currently slow but what the heck it's far better than other models of that size, so they better not break it more. 😂
View on Reddit #76766609

jacek2023@reddit (OP)

(not sure are you trolling or not) from my understanding MLA uses different kind of cache, so one value (latent) is used instead two k/v
View on Reddit #76766903

Cool-Chemical-5629@reddit

It was a honest question, not trolling at all. Stuff breaks sometimes, it happens even to the best coders out there. I'm starting to like this model more every day, so naturally I'm anxious whenever there's a new change to the runtime which could make it run 5000 times better or leave it completely broken lol
View on Reddit #76767468

insulaTropicalis@reddit

There is no way to vibe code llama.cpp. It's a huge app mainly in C++, something that even frontier models would struggle with.
View on Reddit #76772591

ResidentPositive4122@reddit

> There is no way to vibe code llama.cpp People have vibecoded a tensor library and trained models on top of it, so the capabilities are improving fast.
View on Reddit #76791119

AfterAte@reddit

It would make it un-maintainable by humans, and grow the tech debt on an exponential scale, where even the LLMs would have a hard time making fixes. Llama.cpp isn't a one off proof of concept. Although for Llama.cpp PRs, it seems you can still use LLMs to diagnose or suggest a plan (and state that you did), but you still need to understand the implications of the code you're writing, which means experts only.
View on Reddit #76838794

Eisenstein@reddit

You'd be surprised. I had Claude code in GLM 4.7 Flash support for llama.cpp on day one. I showed the differences to a llamacpp dev today and was told > overall not much difference. Some details missing on your side and some minor different ways of doing stuff [Feel free to examine it.](https://huggingface.co/Jobaar/GLM-4.7-Flash-GGUF/tree/main/src)
View on Reddit #76801182

jacek2023@reddit (OP)

There are ways to validate model outputs, look at previous PRs
View on Reddit #76767557

Able_Ad1273@reddit

what is going on with this model lmao
View on Reddit #76764381

rashaniquah@reddit

I had a horrible time running it on vLLM too because the 0.14.0 was released a couple hours after release
View on Reddit #76831670

jacek2023@reddit (OP)

let me quote Z.ai: "two weeks" ;)
View on Reddit #76764596

MrWeirdoFace@reddit

Tweeeeeeo weeerral....
View on Reddit #76785948

crantob@reddit

.. to flatten the curve?
View on Reddit #76826505

-p-e-w-@reddit

Modern LLMs are extremely complex, with almost all of them now introducing new attention or MoE techniques, every single time. But the biggest problem is that automated correctness testing pretty much isn’t a thing, with basically no progress on that topic in the past 2 years.
View on Reddit #76765175

teachersecret@reddit

I am surprised someone hasn't knocked something together for that purpose. Life on the bleeding edge.
View on Reddit #76766433

-p-e-w-@reddit

It’s a lot more difficult than it may seem, because even updating the GPU driver can change the results.
View on Reddit #76772903

teachersecret@reddit

Yeah, I hear you. I’m constantly annoyed by my shuffling stack of drivers.
View on Reddit #76786380

gtek_engineer66@reddit

Drivers and wheels built for different versions of things that don't get along installed by different package managers to deal with new hardware on old systems. Talk about a goldilock condition to get one of these things running
View on Reddit #76800338

Objective_Mousse7216@reddit

If only AI could write complex code for itself....
View on Reddit #76785912

ilintar@reddit

Non-trivial architecture that has to be adapted. I told you give us a week :)
View on Reddit #76793386

sleepingsysadmin@reddit

I get qwen next having pains on release; they did something new. This model is cursed.
View on Reddit #76765997

jacek2023@reddit (OP)

qwen next is at least merged, look at kimi linear ;)
View on Reddit #76766084

Hunting-Succcubus@reddit

Do llama dev hate kimi linear? No love at all
View on Reddit #76770829

jacek2023@reddit (OP)

In my personal opinion there are big differences between comfyui community and local LLMs community. The pressure from users is higher in comfyui because people actually use models every day, while here big portion of LocalLLaMA users just hype the benchmarks and minority is actually doing something. We need more projects like heretic from u/-p-e-w-/ to make people more creative.
View on Reddit #76775179

Hunting-Succcubus@reddit

I thought llm has significantly more user than 1girl generators. Llm should have more pressure.
View on Reddit #76780864

jacek2023@reddit (OP)

you must remember about cloud models vs local models
View on Reddit #76781758

ilintar@reddit

No, we just have to pick our work to do and someone else volunteered to work on Kimi. Anyways, it's almost done.
View on Reddit #76776688

markole@reddit

Somehow it works great on my side with recent llama.cpp, opencode and unsloth q8 quant. 🤷
View on Reddit #76771774

teachersecret@reddit

Not unusual for some of these Chinese models to be broken for a few weeks while people get them properly implemented :).
View on Reddit #76764917

Aggressive-Bother470@reddit

wtf are these downvotes, lol.  truer words ne'er be spake.
View on Reddit #76767344

teachersecret@reddit

I'm guessing it's bots who thought I was being negative to China or something?
View on Reddit #76767593

mister2d@reddit

No, the downvote is because your reply was inaccurate and lacking in understanding. Lately, it feels like sharing accurate information is becoming an afterthought.
View on Reddit #76778820

jacek2023@reddit (OP)

it's llama.cpp implementation, not the model itself
View on Reddit #76765028

teachersecret@reddit

Yeah, I know (although sometimes it's both, lol).
View on Reddit #76766088

Alarming-Ad8154@reddit

Very much this! They have a super innovative attention implementation, which sips memory (see the mlx implementations and benchmarks of the same model). It just requires new inference code in llama.cpp…
View on Reddit #76765285

LocoMod@reddit

I've abstained from using this model until the issues are ironed out. Seems like we're at a point where we can cook. What are the recommended llama-server params to primarily use it as an "orchestrator" that invokes tools and other agents? I'm using the Q6_K_XL Unsloth version on an RTX5090. The model is 26GB so I have 6GB to fit the maximum content in. What ctx and temp is everyone using?
View on Reddit #76814681

LocoMod@reddit

Disregard. Its working great. https://preview.redd.it/9hbijyiuplfg1.png?width=3092&format=png&auto=webp&s=3b795db3996792d384f5bbd315cc4d1ecb077893
View on Reddit #76817549

alex_bit_@reddit

Where’s vLLM?
View on Reddit #76797383

ladz@reddit

Latest build tripled generation TPS for me. Yay!
View on Reddit #76795853

harrro@reddit

The model is good and fast but it is so verbose in reasoning (even for simple things). Is it possible to limit/disable reasoning or is this not trained for that?
View on Reddit #76774331

robiinn@reddit

You can disable it with `--chat-template-kwargs '{"enable_thinking": false}'`
View on Reddit #76784675

harrro@reddit

Worked perfectly! Thank you.
View on Reddit #76787724

__Maximum__@reddit

What are you using it for? I think you are supposed to turn on the thinking because this is an agent model
View on Reddit #76793447

viperx7@reddit

When I use. It directly I feel the same but somehow when using it with opencode it thinks very optimally and to the point That leads me to believe a good state prompt is what you need to make this model's thinking not too verbose
View on Reddit #76783838

jacek2023@reddit (OP)

I have same experiences, opencode somehow works, with this new patch I have kind of "Claude Code at home" feeling
View on Reddit #76789425

nasone32@reddit

it reasons less at lower temperature
View on Reddit #76785740

Odd-Ordinary-5922@reddit

getting 5 more tokens/s but its good because I was getting 25 before
View on Reddit #76792667

GaboureySidibe@reddit

A KV data structure without the values is just a set.
View on Reddit #76792223

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*
View on Reddit #76782922

jacek2023@reddit (OP)

BTW this model is quite popular https://preview.redd.it/th6swn271jfg1.png?width=1964&format=png&auto=webp&s=804be603da5a2b358a0f9f826aab4d1f1849d067
View on Reddit #76776094

mister2d@reddit

We want it to succeed. 😊
View on Reddit #76779057

__Maximum__@reddit

We are now just 5 patches away from running this model locally without issues!
View on Reddit #76766345

jacek2023@reddit (OP)

what are you ta..... [https://github.com/ggml-org/llama.cpp/pull/19092](https://github.com/ggml-org/llama.cpp/pull/19092)
View on Reddit #76778616

Hunting-Succcubus@reddit

Actually its 7 patches
View on Reddit #76770997

LagOps91@reddit

wait what? how does it work without using values? is this an RNN architecture?
View on Reddit #76769557

jacek2023@reddit (OP)

MLA
View on Reddit #76769618

LagOps91@reddit

how does it avoid V cache? i was under the impression that MLA is still based on standard attention with some improvements made to increase memory efficiency. is the V cache combined with something else that's stored or how does it work?
View on Reddit #76769783

shing3232@reddit

V cache is basically compressed inside K cache
View on Reddit #76773012

jacek2023@reddit (OP)

https://www.reddit.com/r/LocalLLaMA/s/6WDxYlAzAm
View on Reddit #76769877

viperx7@reddit

# GLM 4.7 unsloth (data for 20k context filled) # Before this change |Quant|GPU|Context|Prompt Processing|Token Generation|Notes| |:-|:-|:-|:-|:-|:-| |UD-Q4\_K\_XL|Single 4090|64k|3489 t/s|88 t/s|| |UD-Q4\_K\_XL|4090 + 3060|170k|2017 t/s|52 t/s|| |Q8|4090 + 3060|30k|2087 t/s|47.1 t/s|| |Q8|4090 + 3060 + cpu|64k|1711 t/s|41.3 t/s|`-ot '([2][0-2]).ffn_.*_exps.=CPU'`| # After the change |Quant|GPU|Context|Prompt Processing|Token Generation|Notes| |:-|:-|:-|:-|:-|:-| |UD-Q4\_K\_XL|Single 4090|128k|3510 t/s|92.5 t/s|| |UD-Q4\_K\_XL|4090 + 3060|200k|2041 t/s|56.2 t/s|| |Q8|4090 + 3060|72k|2058 t/s|50.4 t/s|| |Q8|4090 + 3060 + cpu|100k|1968 t/s|45.7 t/s|`-ot '([2][0-2]).ffn_.*_exps.=CPU'`|
View on Reddit #76769835

FluoroquinolonesKill@reddit

This at least doubles the speed on my rig. Now I am getting about 30 t/s. Before, I was getting about 10-13 t/s.
View on Reddit #76768456

Deep_Traffic_7873@reddit

Is re-re-download needed for the gguf? 
View on Reddit #76765841

jacek2023@reddit (OP)

no
View on Reddit #76765873