KV cache fix for GLM 4.7 Flash

[-]

teachersecret@reddit

Just tested with UD's k\_xl 4 bit version on my 4090. Yesterday I was using it with about 45,000 context and maxing out the 4090. Now it fits with 90,000 context. I like the model. Still a bit quirky though.

Reply

[-]

AfterAte@reddit

If you can, run your display off your IGPU. I could get 65K context before this build on my 3090, using 23.3GB all for llama.cpp.

Reply

[-]

Kitchen-Year-8434@reddit

Be careful with this if you also game on your rig. I ended up with throttling and underutilized gpu in gaming because my iGPU couldn’t handle the bandwidth of framebuffer copying from my dGPU. Though I am running dual 4K ultra wide resolution on a NEO G9. Might not be a problem with 4K or 1440p.

Reply

[-]

AfterAte@reddit

Good to know. I only game at 1440p at 60fps and my AMD 7000 series iGPU hasn't affected my 3090. But maybe I haven't been paying enough attention, so I'll keep an eye out next time I notice slowdowns.

Reply

[-]

teachersecret@reddit

I'm rolling a 4090/5900x, no igpu. That said, it's better now. Here's my latest testing: | Context | Graph Splits | VRAM | TTFT | Prompt | Generation | |---------|--------------|------|------|--------|------------| | 32K | 2 | 19.8 GB | 47.1 ms | 670.5 tok/s | 133.3 tok/s | | 64K | 2 | 21.5 GB | 48.9 ms | 648.9 tok/s | 134.6 tok/s | | 95K | 2 | 22.9 GB | \~50 ms | \~650 tok/s | \~135 tok/s | | 96K | 7 | 23.0 GB | 52.0 ms | 612.1 tok/s | 125.4 tok/s | | 128K | 24 | 23.0 GB | 80.3 ms | 396.7 tok/s | 95.2 tok/s | It's slower at 128k because it has to split the graphs a bit more (it goes from 2 to 9 or something, I think). Still works, just a bit of performance loss. | Graph Splits | Performance Impact | |--------------|-------------------| | 2 | Optimal - minimal overhead | | 7-9 | \~7% slowdown | | 21-24 | \~30% slowdown | | 100+ | \~75% slowdown | Also tried it with kv cache quantization: | Configuration | Context | Generation Speed | Use Case | |---------------|---------|------------------|----------| | q4\_0 KV | \*\*202K\*\* | \*\*139 tok/s\*\* | \*\*Best overall\*\* | | q8\_0 KV | 128K | 138 tok/s | High-quality KV cache | | f16 KV (default) | 95K | 135 tok/s | Baseline | | CPU MoE | 202K | 32 tok/s | Low VRAM systems | On CPU MoE it still pulls off 32 tok/s at 202k context length.

Reply

[-]

AfterAte@reddit

Thanks for the detailed info! I keep hearing kv cache isn't worth it. But if you need to use it, don't go below Q8.

Reply

[-]

floppypancakes4u@reddit

Im on 4090 as well, using LM studio though and im sure thats my problem, since im only getting 10tks. What setup are you using and whats your tks?

Reply

[-]

Maximum@reddit

I was impressed by its tool use. You throw at it tools, and chains them like a pro. It calls search, then fetches the URL, then based on that another search, based on all the above git clones a repo, edits it, runs tests and so on for hours without any issues. All simple tasks, of course. When given a huge codebase, it will still use tons of tools but will come up with wrong conclusions or have obviously wrong priorities. I used the API so far, so don't know if this holds up on local setups with quants, but I sure hope so. Btw, the model behind API is having huge issues atm as well. Almost unusable.

Reply

[-]

teachersecret@reddit

Yeah. I found you have to loop in some agentic double checking and scaffolding to keep it on track, and on a larger codebase I think you’d really want to focus it on some small piece or feature. I can’t imagine actually coding with it over something like opus 4.5, but for agentic local stuff? It’s pretty damn impressive. I plan on getting vllm up and running with it once they’ve got it all dialed in there. It’s small enough that we should be able to run multiple simultaneous agents - possibly dozens of them. I’m kinda excited to see what a pile of local agents set to work could do with such reliable tool calling.

Reply

[-]

Front_Eagle739@reddit

Yeah between this and mirothinker 30 we definitely just hit a new level of ability for 30b models. Steuggling to figure out which i prefer though. Still getting a bit more confusion out of flash but im struggling to keep up with all the fixes lol

Reply

[-]

Cool-Chemical-5629@reddit

I trust ggerganov, but still I have to ask. Is this REALLY safe? I mean removal of the V portion of the cache? Is that really how the model works / is supposed to work? I just hope they aren't vibe coding this or something and that they really know what they are doing lol. Sure the model is currently slow but what the heck it's far better than other models of that size, so they better not break it more. 😂

Reply

[-]

jacek2023@reddit (OP)

(not sure are you trolling or not) from my understanding MLA uses different kind of cache, so one value (latent) is used instead two k/v

Reply

[-]

Cool-Chemical-5629@reddit

It was a honest question, not trolling at all. Stuff breaks sometimes, it happens even to the best coders out there. I'm starting to like this model more every day, so naturally I'm anxious whenever there's a new change to the runtime which could make it run 5000 times better or leave it completely broken lol

Reply

[-]

insulaTropicalis@reddit

There is no way to vibe code llama.cpp. It's a huge app mainly in C++, something that even frontier models would struggle with.

Reply

[-]

ResidentPositive4122@reddit

> There is no way to vibe code llama.cpp People have vibecoded a tensor library and trained models on top of it, so the capabilities are improving fast.

Reply

[-]

AfterAte@reddit

It would make it un-maintainable by humans, and grow the tech debt on an exponential scale, where even the LLMs would have a hard time making fixes. Llama.cpp isn't a one off proof of concept. Although for Llama.cpp PRs, it seems you can still use LLMs to diagnose or suggest a plan (and state that you did), but you still need to understand the implications of the code you're writing, which means experts only.

Reply

[-]

Eisenstein@reddit

You'd be surprised. I had Claude code in GLM 4.7 Flash support for llama.cpp on day one. I showed the differences to a llamacpp dev today and was told > overall not much difference. Some details missing on your side and some minor different ways of doing stuff [Feel free to examine it.](https://huggingface.co/Jobaar/GLM-4.7-Flash-GGUF/tree/main/src)

Reply

[-]

jacek2023@reddit (OP)

There are ways to validate model outputs, look at previous PRs

Reply

[-]

Able_Ad1273@reddit

what is going on with this model lmao

Reply

[-]

rashaniquah@reddit

I had a horrible time running it on vLLM too because the 0.14.0 was released a couple hours after release

Reply

[-]

jacek2023@reddit (OP)

let me quote Z.ai: "two weeks" ;)

Reply

[-]

MrWeirdoFace@reddit

Tweeeeeeo weeerral....

Reply

[-]

crantob@reddit

.. to flatten the curve?

Reply

[-]

-p-e-w-@reddit

Modern LLMs are extremely complex, with almost all of them now introducing new attention or MoE techniques, every single time. But the biggest problem is that automated correctness testing pretty much isn’t a thing, with basically no progress on that topic in the past 2 years.

Reply

[-]

teachersecret@reddit

I am surprised someone hasn't knocked something together for that purpose. Life on the bleeding edge.

Reply

[-]

-p-e-w-@reddit

It’s a lot more difficult than it may seem, because even updating the GPU driver can change the results.

Reply

[-]

teachersecret@reddit

Yeah, I hear you. I’m constantly annoyed by my shuffling stack of drivers.

Reply

[-]

gtek_engineer66@reddit

Drivers and wheels built for different versions of things that don't get along installed by different package managers to deal with new hardware on old systems. Talk about a goldilock condition to get one of these things running

Reply

[-]

Objective_Mousse7216@reddit

If only AI could write complex code for itself....

Reply

[-]

ilintar@reddit

Non-trivial architecture that has to be adapted. I told you give us a week :)

Reply

[-]

sleepingsysadmin@reddit

I get qwen next having pains on release; they did something new. This model is cursed.

Reply

[-]

jacek2023@reddit (OP)

qwen next is at least merged, look at kimi linear ;)

Reply

[-]

Hunting-Succcubus@reddit

Do llama dev hate kimi linear? No love at all

Reply

[-]

jacek2023@reddit (OP)

In my personal opinion there are big differences between comfyui community and local LLMs community. The pressure from users is higher in comfyui because people actually use models every day, while here big portion of LocalLLaMA users just hype the benchmarks and minority is actually doing something. We need more projects like heretic from u/-p-e-w-/ to make people more creative.

Reply

[-]

Hunting-Succcubus@reddit

I thought llm has significantly more user than 1girl generators. Llm should have more pressure.

Reply

[-]

jacek2023@reddit (OP)

you must remember about cloud models vs local models

Reply

[-]

ilintar@reddit

No, we just have to pick our work to do and someone else volunteered to work on Kimi. Anyways, it's almost done.

Reply

[-]

markole@reddit

Somehow it works great on my side with recent llama.cpp, opencode and unsloth q8 quant. 🤷

Reply

[-]

teachersecret@reddit

Not unusual for some of these Chinese models to be broken for a few weeks while people get them properly implemented :).

Reply

[-]

Aggressive-Bother470@reddit

wtf are these downvotes, lol. truer words ne'er be spake.

Reply

[-]

teachersecret@reddit

I'm guessing it's bots who thought I was being negative to China or something?

Reply

[-]

mister2d@reddit

No, the downvote is because your reply was inaccurate and lacking in understanding. Lately, it feels like sharing accurate information is becoming an afterthought.

Reply

[-]

jacek2023@reddit (OP)

it's llama.cpp implementation, not the model itself

Reply

[-]

teachersecret@reddit

Yeah, I know (although sometimes it's both, lol).

Reply

[-]

Alarming-Ad8154@reddit

Very much this! They have a super innovative attention implementation, which sips memory (see the mlx implementations and benchmarks of the same model). It just requires new inference code in llama.cpp…

Reply

[-]

LocoMod@reddit

I've abstained from using this model until the issues are ironed out. Seems like we're at a point where we can cook. What are the recommended llama-server params to primarily use it as an "orchestrator" that invokes tools and other agents? I'm using the Q6_K_XL Unsloth version on an RTX5090. The model is 26GB so I have 6GB to fit the maximum content in. What ctx and temp is everyone using?

Reply

[-]

LocoMod@reddit

Disregard. Its working great. https://preview.redd.it/9hbijyiuplfg1.png?width=3092&format=png&auto=webp&s=3b795db3996792d384f5bbd315cc4d1ecb077893

Reply

[-]

alex_bit_@reddit

Where’s vLLM?

Reply

[-]

ladz@reddit

Latest build tripled generation TPS for me. Yay!

Reply

[-]

harrro@reddit

The model is good and fast but it is so verbose in reasoning (even for simple things). Is it possible to limit/disable reasoning or is this not trained for that?

Reply

[-]

robiinn@reddit

You can disable it with `--chat-template-kwargs '{"enable_thinking": false}'`

Reply

[-]

harrro@reddit

Worked perfectly! Thank you.

Reply

[-]

Maximum@reddit

What are you using it for? I think you are supposed to turn on the thinking because this is an agent model

Reply

[-]

viperx7@reddit

When I use. It directly I feel the same but somehow when using it with opencode it thinks very optimally and to the point That leads me to believe a good state prompt is what you need to make this model's thinking not too verbose

Reply

[-]

jacek2023@reddit (OP)

I have same experiences, opencode somehow works, with this new patch I have kind of "Claude Code at home" feeling

Reply

[-]

nasone32@reddit

it reasons less at lower temperature

Reply

[-]

Odd-Ordinary-5922@reddit

getting 5 more tokens/s but its good because I was getting 25 before

Reply

[-]

GaboureySidibe@reddit

A KV data structure without the values is just a set.

Reply

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! [Come check it out!](https://discord.gg/PgFhZ8cnWW) You've also been given a special flair for your contribution. We appreciate your post! *I am a bot and this action was performed automatically.*

Reply

[-]

jacek2023@reddit (OP)

BTW this model is quite popular https://preview.redd.it/th6swn271jfg1.png?width=1964&format=png&auto=webp&s=804be603da5a2b358a0f9f826aab4d1f1849d067

Reply

[-]

mister2d@reddit

We want it to succeed. 😊

Reply

[-]

Maximum@reddit

We are now just 5 patches away from running this model locally without issues!

Reply

[-]

jacek2023@reddit (OP)

what are you ta..... [https://github.com/ggml-org/llama.cpp/pull/19092](https://github.com/ggml-org/llama.cpp/pull/19092)

Reply

[-]

Hunting-Succcubus@reddit

Actually its 7 patches

Reply

[-]

LagOps91@reddit

wait what? how does it work without using values? is this an RNN architecture?

Reply

[-]

jacek2023@reddit (OP)

MLA

Reply

[-]

LagOps91@reddit

how does it avoid V cache? i was under the impression that MLA is still based on standard attention with some improvements made to increase memory efficiency. is the V cache combined with something else that's stored or how does it work?

Reply

[-]

shing3232@reddit

V cache is basically compressed inside K cache

Reply

[-]

jacek2023@reddit (OP)

https://www.reddit.com/r/LocalLLaMA/s/6WDxYlAzAm

Reply

[-]

viperx7@reddit

# GLM 4.7 unsloth (data for 20k context filled) # Before this change |Quant|GPU|Context|Prompt Processing|Token Generation|Notes| |:-|:-|:-|:-|:-|:-| |UD-Q4\_K\_XL|Single 4090|64k|3489 t/s|88 t/s|| |UD-Q4\_K\_XL|4090 + 3060|170k|2017 t/s|52 t/s|| |Q8|4090 + 3060|30k|2087 t/s|47.1 t/s|| |Q8|4090 + 3060 + cpu|64k|1711 t/s|41.3 t/s|`-ot '([2][0-2]).ffn_.*_exps.=CPU'`| # After the change |Quant|GPU|Context|Prompt Processing|Token Generation|Notes| |:-|:-|:-|:-|:-|:-| |UD-Q4\_K\_XL|Single 4090|128k|3510 t/s|92.5 t/s|| |UD-Q4\_K\_XL|4090 + 3060|200k|2041 t/s|56.2 t/s|| |Q8|4090 + 3060|72k|2058 t/s|50.4 t/s|| |Q8|4090 + 3060 + cpu|100k|1968 t/s|45.7 t/s|`-ot '([2][0-2]).ffn_.*_exps.=CPU'`|

Reply

[-]

FluoroquinolonesKill@reddit

This at least doubles the speed on my rig. Now I am getting about 30 t/s. Before, I was getting about 10-13 t/s.

Reply to Post

73 Comments