Whatever happened to GLM 4.7 Flash hype?
Posted by Enragere@reddit | LocalLLaMA | View on Reddit | 22 comments
Are you guys still using it? How does it fare VS Qwen 3.5 35B and 27b? Gemma 4 26B and 31b also?
From what I've heard Qwen 3 coder next 80b is still a go to for many?
Agentic coding usage as the main use case.
qubridInc@reddit
GLM-4.7 Flash is still solid for agentic workflows, but Qwen 3.5 (especially coder variants) has largely taken over for raw coding performance and reasoning so most people moved on unless they care about cost or tool-use stability.
Cool-Chemical-5629@reddit
For coding, GLM 4.7 Flash is still very capable and ambitious in visual design, but it lacks in logic. Gemma 4 feels the opposite of that, so I'm going to use both to compensate their weaknesses.
TheAsp@reddit
This is currently my daily driver, 4bit AWQ + 100k of FP16 KV cache in 24GiB, and it works great with OpenCode and Hermes. My only complaint is that the throughput drops off quickly with context size.
m31317015@reddit
I find the logic somewhat lacking as well, but one way I use it is to make an AGENTS.md, TODO.md, and PENDING.md, where it first put its plans into PENDING.md, scan the repo and validate the idea over and over again until I think it's good enough, then the task is ran and results are summarized, and once in a while I tell it to update AGENTS.md as a documentation for the project and guidelines on how to update the project. For TODO.md basically I store todo list inside, and let it expand the ideas, I then modify it manually if there's any room for improvement, then it do the pending part for planning. I also make it cross references with AGENTS.md and note any reusable parts / related sections that the new idea could be grouped into.
It's definitely not a one-click-done solution but with the docs GLM behaved quite well IMO.
ttkciar@reddit
I never liked GLM-4.7-Flash. It wasn't nearly as competent as GLM-4.5-Air, and ZAI introduced some weird new guardrail behaviors with GLM-4.7 which killed it for me.
Some people like Qwen models for codegen, but GLM-4.5-Air is still the best codegen model I've ever used, beating out Qwen3-Coder-Next, Qwen3.5-122B-A10B, GPT-OSS-120B, and Devstral 2 Large (123B).
In my experience, GLM-4.5-Air can introduce bugs, but its overall design is always sound, and its bugs are easily fixed. Qwen3.5-122B-A10B generated code with bizarre design flaws which were not easily fixed, and it would frequently ignore some instructions and/or altogether neglect to implement some of the features required.
Different people have different standards, but that makes GLM-4.5-Air the better codegen model, to me.
spaceman_@reddit
Have you tried any of the INTELLECT models built on top of GLM 4.5 Air?
ttkciar@reddit
I did not, no, though looking through models/, it looks like I downloaded INTELLECT-3 back in November, but never got around to evaluating it. Thanks for putting it back on my radar. I'll evaluate it after I'm done kicking the tires on Gemma 4.
DinoAmino@reddit
GPT-OSS 120B is much better at code gen than GLM-4.5 Air ... it's the best LLM for code gen under 200B (and better than some 200B+ LLMs)
Enragere@reddit (OP)
My hardware too poor to be talking your language! 😅 Out of all you mentioned I can only run qwen 3 coder next 4bit, barely! 64gb unified memory..
Silver-Champion-4846@reddit
Bitnet, save our egos from dooming despair!
audioen@reddit
I haven't tried glm-4.5-air. The data from artificialanalysis doesn't say it's even half as good as the 122b-a10b which at least to me is the only model that has ever worked as fully autonomous developer that I can just hand a task to, then check results and it only required relatively little guidance.
I haven't observed the design from being unsound, though I've observed that LLM often copies an existing design as base if you don't describe what it's supposed to do. This also likely is strongly affected by the agent program you're using, as they have prompts. So everything matters for this.
I've tested with opencode-cli lately whose prompt I think is pretty bad in that it's hugely long and seems suited for yesteryear's models with poor instruction following ability. I absolutely disagree that this Qwen has trouble following instructions, as in my experience it is almost painfully sensitive and one of the reasons why it spends so long pondering simple context-free requests as it's trying very hard to understand what is the best way to respond when context is lacking.
Bird476Shed@reddit
Agree, my first try is usually with GLM-4.5-Air, it's (still) a good speed-quality trade-off
m31317015@reddit
As someone who ripped apart his own build of two 3090s into two separate builds, I can tell you GLM 4.7 Flash is extremely useful in coding for those who only have a single 24GB VRAM card which, without offloading, can't go a step up with Qwen3.5 27B, or Gemma4 31B.
What I thought was a compelling option, the Gemma4 26B, on the other hand requires extreme baby sitting and refuses to do multi tool calls 99% of the time and is completely useless in opencode / claude code, wasted me 3 hours and eventually I gave up fiddling with it and fell back to GLM instead.
Enragere@reddit (OP)
I didn't get your point on having single 3090 with GLM 4.7 Flash vs dense Gemma 4 or Qwen 3.5?
Afaik both dense models can be fully loaded into 3090 VRAK? with 4bit quants
m31317015@reddit
Q4_K_M? Yeah, but context window quickly runs out. Coding wise they're unusable, at least on both ollama and llama.cpp where I tested them with thinking.
Silver-Champion-4846@reddit
Whatbout Turboquant/rotorquant?
m31317015@reddit
It's... not implemented in official upstreams yet, thanks bot.
P.S. I'm also adding a 5090 this weekend so yeah IDK maybe they are good, not until I'm free from having only one 3090 in my server.
Silver-Champion-4846@reddit
That's a little insulting, my motors aren't even 1% rusty, you know! /j I was just getting your curiosity/hope riled up to perhaps wait for it to be implemented to increase the power.
ilintar@reddit
Waiting for GLM 5.1 Flash ;)
NeedleworkerHairy837@reddit
If you already knew what you want to do, and just use GLM 4.7 Flash to type your code completely, it's really really really great. Especially for my resource constraint ( 8GB VRAM ).
HopePupal@reddit
it's okay. size-wise it's not very different from Qwen 3.5 27B. behavior-wise it seems to be slightly less prone to getting stuck in stupid loops than Qwen, or stopping before it's actually finished, but makes up for this by being more prone to change stuff i didn't tell it to change. perhaps i should give it another shot now that i have a real GPU.
it doesn't have a vision component (4.6 did, 4.7 doesn't), if that matters. Qwen does.
but if we're talking best open weight code model, my money's still on MiniMax M2.x. that's the one i break out when Qwen gets stuck on things like cryptic macro errors in Askama templates. i can barely run it on my hardware, but even so, it's oddly effective.
Prestigious-Use5483@reddit
The AI space moves quick. It was a nice model when it came out, but lots of other models came out after that were more capable to run on similar hardware.