GLM 4.7 Flash: Huge performance improvement with -kvu

[-]

DreamingInManhattan@reddit

Usually with a TLDR there's a section that's too long to read. But whoa, gotta try this out. Good stuff.

Reply

[-]

Aggressive_Arm9817@reddit

Did it fully make that zelda game like no interference from you? If so that's impressive asf for a model that small!

Reply

[-]

One prompt, temperature 0, using unsloth BF16, llama.cpp, and Cherry Desktop: \`\`\`create a zelda game in html, placing the html for the game in a markdown code block\`\`\` Should be repeatable if you want to try it

Reply

[-]

teachersecret@reddit

I gave this a shot on the UD 4 bit k\_xl model. 0 temp, 0 rep pen, 1 top p, 0.01 min p. [https://gist.github.com/Deveraux-Parker/8dec86f7d94c5d5d01a7cc6bbec3c4b2](https://gist.github.com/Deveraux-Parker/8dec86f7d94c5d5d01a7cc6bbec3c4b2) Prompt was "Create a full featured beautiful 3d Zelda game that feels and plays beautifully in a single markdown block, in a single HTML page." It ended up spitting out 719 lines of code. It spent 7,299 tokens thinking and writing that code. Took about 89 seconds on a 4090 with the model set up on llama.cpp with a 96k token context window. It's a fantastic model.

Reply

[-]

TokenRingAI@reddit (OP)

Join the competition [https://www.reddit.com/r/LocalLLaMA/comments/1qouiy8/oneshot\_zelda\_game\_competition/](https://www.reddit.com/r/LocalLLaMA/comments/1qouiy8/oneshot_zelda_game_competition/)

Reply

[-]

BuenosAir@reddit

Is this the exact prompt that you used ? I tried the same prompt with the same parameters on the UD 8bit K\_XL and I'm always getting broken code. I'm using it in open code

Reply

[-]

teachersecret@reddit

Yeah, that was the prompt, BUT!!! That said, I realize I probably had a system prompt accidentally set that influenced that generation, and I've since changed my system prompt and I'm not sure exactly what it was? Hilariously, I was using a writing-based system prompt, so IDK why that would have changed things, but, when I try to recreate its not exact. I tried several times though and had no issue getting working code, most of the time on the first try or with one minor fix (like, open the file in your browser, hit F12, copy the console errors by right clicking and selecting 'copy console', then paste that into the conversation and it'll spit out the fix.

Reply

[-]

TokenRingAI@reddit (OP)

Wow, that is even better

Reply

[-]

Aggressive_Arm9817@reddit

I gotta try this soon!

Reply

[-]

TokenRingAI@reddit (OP)

[https://www.reddit.com/r/LocalLLaMA/comments/1qouiy8/oneshot\_zelda\_game\_competition/](https://www.reddit.com/r/LocalLLaMA/comments/1qouiy8/oneshot_zelda_game_competition/)

Reply

[-]

zoyer2@reddit

Using that prompt i got this :,D. Not bad though! https://preview.redd.it/hqxjgbd0kvfg1.png?width=848&format=png&auto=webp&s=f3fe603c208ea0282d031f0ab49e3c00c86d352a

Reply

[-]

ikkiyikki@reddit

I tried the same prompt in the full non-Flash 4.7 Q4 and the page it generated choked on the opening screen. If you one-shot it I'm leaning that it was a fluke.

Reply

[-]

Far-Low-4705@reddit

damn that is very impressive, what quant r u using?

Reply

[-]

TokenRingAI@reddit (OP)

The full model, FP16

Reply

[-]

TokenRingAI@reddit (OP)

https://preview.redd.it/xudvlhvu0sfg1.png?width=2799&format=png&auto=webp&s=e841fad99dea59ad008d36643e98ec5229e019a6

Reply

[-]

Mashiro-no@reddit

Wow, what UI frontend is that?

Reply

[-]

epyctime@reddit

Pretty sure based off his other comments it's https://www.cherry-ai.com/

Reply

[-]

fancyawesome@reddit

4.7flash cannot even do basic reasoning correctly. The speed is useless

Reply

[-]

TokenRingAI@reddit (OP)

It can. It is ridiculously fragile, needs temperature 0.2. But it can work agentically and solve problems. I have been seeing significant gains with it agentically after updating some of our tool descriptions. If your tool descriptions aren't perfect it will absolute mess up. It might benefit from a different tool format, I will have to experiment with that.

Reply

[-]

fancyawesome@reddit

That's too tricky. The problem is if you use it as main llm because it is good at tool calling but with low intelligence in general, then what kind of usage the agent running on it will be?

Reply

[-]

TokenRingAI@reddit (OP)

Here's an example of what it can do. I am running it in a loop, on a new svelte website I am working on, to implement proper meta and JSON-LD tags. It's a very specific task, essentially a foreach loop which runs a prompt on a single file. The loop is scripted. The agent is invoked on each file The agent has a knowledge repository detailing out what our expectations for each page are. It then updates each page. We run it, and then run a typescript and svelte check looking for problems and feed those back to the agent up to 5 times https://preview.redd.it/yrdeu1reqxfg1.jpeg?width=1280&format=pjpg&auto=webp&s=3eb93c275bfdb8632a7a4dbec8cb611a56bd7f1c

Reply

[-]

fancyawesome@reddit

Doesn't those kinds of automation tasks can be done just using python?

Reply

[-]

TokenRingAI@reddit (OP)

It will be the best kind of agent that you can run on a single 5090 or R9700. FWIW, this model brought the purchase of workable local agentic AI down from $7000 to $1300. I am ecstatic to see what the next GLM Air might look like

Reply

[-]

viperx7@reddit

can you recommend me a model that is better than GLM 4.7 Flash ? and can you rank the following - nemotron nano - qwen 3 30b moe - coder - thinking - instruct - GLM 4.7 flash

Reply

[-]

fancyawesome@reddit

For tool call, glm 4.7 flash, For reasoning, nemotron nano, for vision, qwen 3 30b vl. For coding, get glm 4.7. just in my opinion.

Reply

[-]

jacek2023@reddit

-kvu, --kv-unified use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)

Reply

[-]

Cool-Chemical-5629@reddit

Unfortunately LM Studio doesn't let you customize the command line parameters passed to the LlamaCpp, but since it says "enabled if number of slots is auto", I was wondering what is the number of slots? Maybe that can be set somewhere. Ugh. At this point, I wish LM Studio would just allow people set their custom parameters. At least some extras that wouldn't really create conflict with the default arguments they already pass there.

Reply

[-]

mycall@reddit

Have you tried using environment variables? I thought llama.cpp will use those if found.

Reply

[-]

exceptioncause@reddit

model.yaml is all about model capabilities, not about runtime settings, unfortunately

Reply

[-]

jacek2023@reddit

Why are you forced to use closed source software?

Reply

[-]

Cool-Chemical-5629@reddit

I'm not forced, I'm using it for convenience. It has its downsides yes, but it offers convenience like no other software for the same purpose out there. Unfortunately, being able to pass custom parameters when needed is not on the feature list.

Reply

[-]

No_Afternoon_4260@reddit

Sorry, I'm not trying to be nitpicky, but what are these conveniences?

Reply

[-]

Cool-Chemical-5629@reddit

I'm not much of a "command line parameters" lover, so naturally perhaps the most important feature for me is the GUI that doesn't look like a high school project done in free time over weekend. It looks like a premium app GUI that actually does make sense even to beginners. It allows me to load models without knowing anything about the specific command line parameters required by LlamaCpp. Like I said though, the downside is that you don't get to see or modify the parameters they pass to LlamaCpp beyond what is allowed to modify through the GUI itself on model loading window, so any time someone comes up with a "quick fix" such as "add this parameter to improve performance", that's something I cannot use there.

Reply

[-]

No_Afternoon_4260@reddit

I understand, if you ever get interested in advanced configuration don't hesitate to ask, it's pretty simple really and if llama.cpp default UI isn't in your taste it's pretty easy to find others. (Imho defaut llama.cpp UI is enough and reliable but I don't know your use case)

Reply

[-]

AdInternational5848@reddit

I’m in the transition phase of converting to llama cpp for more control over parameters under the hood. I’ve worked with closed source AI tools to build my own UI and it’s given my guidance on migrating to llama cpp but is there anything specific you think I should know?

Reply

[-]

jacek2023@reddit

People are irrational (all people, not just some subset). I discovered that people use Ollama because they can switch models on the fly. Llama.cpp introduced this functionality at some point, but I still don’t understand why it’s so important. Models sometimes take a few minutes to load; I can do that from the command line. I also need to manage models manually on my drives because they use a lot of space across multiple drives, so doing that automatically is pointless in my case. But for some people this is a crucial feature, and they can ignore performance issues as long as they can do things “easily” (whatever that means).

Reply

[-]

No_Afternoon_4260@reddit

I used to do a lot of model swapping, really good way to "know your models". But there's 3 kind of people, those who used ollama, those who used llama-swap, and those who built there how router/ressource allocation.. where do you think I am and where the most irrational are ? 😅

Reply

[-]

jacek2023@reddit

how many models do you have on your computer?

Reply

[-]

No_Afternoon_4260@reddit

Depends, now I mix devstral (or gemma 12b it) local as a router and quick vision and stuff, deepseekocr (need to try the 2), asr (Nvidia nemo for streaming and vibevoice asr for offline with speaker diarization), tts (soprano). Over the years this feels less and less a prototype it's amazing! For the big boy I used to run mistral large locally but now I'm mostly using openrouter honestly.. so I don't have that need for model swap anymore What about you?

Reply

[-]

jacek2023@reddit

jacek@AI-SuperComputer:/mnt$ find -name "\*.gguf"|wc -l 258

Reply

[-]

No_Afternoon_4260@reddit

Lol i know most of these are parts x) What are you running these days?

Reply

[-]

jacek2023@reddit

opencode with GLM Flash but I am trying other models with opencode my point is that I use various models and I don't need this swap feature at all

Reply

[-]

kaisurniwurer@reddit

Not picking a fight. But what is that convenience that makes you prefer it? Is is chat gui preference/features?

Reply

[-]

Cool-Chemical-5629@reddit

[https://www.reddit.com/r/LocalLLaMA/comments/1qnwa33/comment/o1zgp2e/?utm\_source=share&utm\_medium=web3x&utm\_name=web3xcss&utm\_term=1&utm\_content=share\_button](https://www.reddit.com/r/LocalLLaMA/comments/1qnwa33/comment/o1zgp2e/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button)

Reply

[-]

pmttyji@reddit

Is this flag beneficial for laptop GPUs? Ex: my laptop has 4060 8GB.

Reply

[-]

Mean-Sprinkles3157@reddit

thanks, I tested on dgx spark. it is 40 t/s from 20 t/s. it is very close to oss-120b.

Reply

[-]

FluoroquinolonesKill@reddit

Doesn't make a difference on my 5060 rig and the latest llama.cpp build.

Reply

[-]

simracerman@reddit

Something is off. On my 5070 Ti 16 GB, before the patch from 2 days ago I did 27t/s at 16k. Now, it’s doing 58t/s at 16k. How come your Pro 6000 was doing 17t/s. Maybe you need llama.cpp do the fitting and assign the right parameters.

Reply

[-]

SheepherderBeef8956@reddit

What options are you using? I have the same GPU but I'm only getting 20-30t/s

Reply

[-]

simracerman@reddit

Here: llamasvr−m{mpath}\GLM-4.7-Flash-MXFP4_MOE.gguf --no-mmap -c 32000 --temp 1.0 --top-p 0.95 --min-p 0.01 It started doing 68 t/s, and at 8k it was doing 60 still. Didn’t go on this all the way to 16k, but usually after 8k it stabilizes at high 50s. Lowest I saw it was 55 with over 28k. https://imgur.com/a/EzfRCxw If it helps, the rest of my hardware is Ryzen AI HX370 with 64GB lpddr5x at 8000mt/s. However, the 5070 Ti is hooked to an eGPU via Oculink, so am bandwidth constrained to a max of 64Gb/s. If the GPU was internal and via PCIE 5x16, I’d be doing much faster speeds since the model already spills into system memory.

Reply

[-]

Maximum@reddit

4.7 Flash is AGI, we just haven't found the right params yet.

Reply

[-]

markole@reddit

I don't see any difference on my AMD card. But I do actually use CPU+RAM inference to fit the whole q8 (offload most of layers to GPU) so it might be why. NGL, this is the single best model of this size that I had a pleasure to use locally.

Reply

[-]

17hoehbr@reddit

Is there a way to do this in LM Studio?

Reply

[-]

qwen_next_gguf_when@reddit

4090 + q4 = 124 tkps. What quant are you running ?

Reply

[-]

TokenRingAI@reddit (OP)

FP16

Reply

[-]

ClimateBoss@reddit

can u post ur full llama-server --what ?

Reply

[-]

StardockEngineer@reddit

Mine was already faster than that without the flag?

Reply

[-]

SectionCrazy5107@reddit

to use -kvu is ampere and above GPU mandatory?

Reply

[-]

TokenRingAI@reddit (OP)

I think it should be available on any architecture

Reply

[-]

ClimateBoss@reddit

no difference on pascal 2 P40, still 14tk/s TG

Reply

[-]

lmpdev@reddit

I was running GLM-4.7-Flash-UD-Q8_K_XL with these params on RTX 6000 and well it started off at 130 tok/s and went down to 109 tok/s by 8000 tokens. --ctx-size 64000 --no-warmup -n 48000 Added -kvu, and the only thing that changed is now it goes down to 115tok/s by 8000 tokens. Which is in an improvement I supposed, but something is different in our set ups.

Reply

[-]

teachersecret@reddit

I think the KVU option is automatic if you have llama.cpp set up normally for flash 4.7. Least it is on my install. I think this fix happened a day or two back and definitely improved speed.

Reply

[-]

TokenRingAI@reddit (OP)

I am running the latest git release, it definitely wasnt enabled automatically

Reply

[-]

teachersecret@reddit

Mine is: (enabled automatic if number of slots is auto) But I have slots on auto.

Reply

[-]

TokenRingAI@reddit (OP)

On RTX 6000, I have the slots set, since I have enough context for a certain number of users, and there was no indication that setting the slots would drop the performance to 1/6 of normal

Reply

[-]

teachersecret@reddit

That makes sense!

Reply

[-]

fractal_engineer@reddit

how are you coding with it?

Reply

[-]

TokenRingAI@reddit (OP)

I use my own app, Tokenring Coder for agentic work, or I use Cherry Studio or the Jetbrains AI Assistant for interactive code or other assistance.

Reply

[-]

ethereal_intellect@reddit

Now that's what I'm hoping for - tho idk why even in openrouter they're running in 28tps? I definitely expected more like your 100 from an a3b model for sure

Reply

[-]

Friendly-Pause3521@reddit

Holy crap that's a massive jump, gonna have to try this on my 4090 tonight. The Zelda game is actually pretty solid too, thanks for sharing the flag

Reply

Reply to Post

72 Comments