It looks like there are no plans for smaller GLM models
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 112 comments
but my Air discussion is still open... ;)
Big_Mix_4044@reddit
The problem with smaller models right now is they have to be better than qwen to make sense for big labs to release. It's a high hurdle to jump
10minOfNamingMyAcc@reddit
Better than gemma 4 for rp. That model is a hidden gem, much better than qwen 3.
Kodix@reddit
Gemma 4 is amazingly fast, can hold huge contexts easily, and is very smart. The only issue I have with it at this time are its imperfect tool usage and occasional looping. If those were fixed, it would be ludicrous.
And I agree, it is excellent for RP, also.
jacek2023@reddit (OP)
LocalLLaMA should be very happy with qwen 3.5 and gemma 4 but some people are focused on GLM 5 or DeepSeek 4 for some reason
Kryohi@reddit
Imho what is currently missing is models in the 50B-100B range (dense), or moe models with more than 20B active parameters but less than 120B total parameters, that are clearly better than what you can get with say Gemma 4 31B.
Big_Mix_4044@reddit
IMO even google didn't pull off this release. Gemma is better in some places, but it's still not better overall, which is wild especially considering how much later it was released.
Yu2sama@reddit
I don't think you should expect an all rounder win on LLMs. They do so many different things in different ways that, even if Qwen 3.5 doesn't do what you like perfectly, maybe it does some tasks better than Gemma for other people. They are trained differently and have some biases, which make those differences not a flaw but strengths and variety.
The good thing is that they are open and free to use, if you had to marry a single model that would suck.
Big_Mix_4044@reddit
I'm not talking about the greater good or use cases here, although I'm happy to have access to all these models. What I'm saying is that labs release models to participate in a race, and if you're a big player, you won't bring out anything mid-tier
toothpastespiders@reddit
What really drove that point home to me is the llama.cpp issues with it. Even with a somewhat buggy implementation, there were people with great results, just because some use cases will hit the bugs while others won't. That's kind of wild to me. That the same tool can be used for such drastically different scenarios.
Top-Rub-4670@reddit
Gemma 4 was released ONE MONTH after Qwen 3.5...
Ok_Warning2146@reddit
one month is like one year here. ;)
jacek2023@reddit (OP)
I use Gemma 4 MoE with Codex and I am very happy with its performance, we have two different communities here, one community is trying to use local models, like even small models for some chat, and other group are reading benchmarks and leaderboards, I don't care about benchmarks, I am just trying to use stuff
Deep90@reddit
Imo the 2 groups are actually
"Things that make money." and "Things that don't."
ResearchCrafty1804@reddit
Since you’re using Codex, I’m curious how you would rank your experience with Gemma 4 in Codex compared to GPT models. Do you think Gemma-4 is around GPT-5.2 level, or noticeably worse?
Also, I’m guessing you rate Gemma-4 above Qwen-3.5, but I’m not totally sold on that. In my experience, Gemma-4 tends to be stronger on frontend tasks, while Qwen-3.5 feels more reliable for logic-heavy/backend work.
jacek2023@reddit (OP)
This is the first time I use codex with local model, so I must compare with 31B and qwens later. Probably good workflow is to start project with cloud model (just to init stuff which require deeper knowledge) then to continue with local model. If you read claude code / codex subs you will notice lots of crying about small limits. Looks like people on r/LocalLLaMA have solution for this problem :)
Caffdy@reddit
have you tried Qwen Coder 3 or Qwen3.5 122B with codex?
jacek2023@reddit (OP)
Please reread :)
Caffdy@reddit
CTRL + F -> "Qwen Coder" or "122B" on the whole post/thread only hit on my own comment. I don't know what you want me to reread, because you never mentioner either of the models I asked about
jacek2023@reddit (OP)
"This is the first time I use codex with local model"
Caffdy@reddit
well, you mentioned that you're gonna compare Codex performance with the Qwen models later. Would totally recommend trying Qwen Coder 80B and see how it goes
jacek2023@reddit (OP)
you can now look at this https://www.reddit.com/r/LocalLLaMA/comments/1sisthb/codex_with_gemma_4_26b_a4b_q8_0/
Big_Mix_4044@reddit
I'm not talking about benchmarks either.
jacek2023@reddit (OP)
It works for me, and there are still more PRs related to gemma4. Also GLM Flash was working great, just qwen3.5/gemma4 are better.
AltruisticList6000@reddit
Idk for me it is mixed, I can only use the 26b one and it is heavily censored for me to the point of being similar to GPT OSS. It seem to have a very in depth native knowledge base too but its sloppy "if not x then y" default style and inability to focus on specifics of some prompts are making it not really performant. I'd still be happy for 20-24b dense models from anyone, ideal for 16-32gb VRAM.
Borkato@reddit
Because GLM is better at RP.
jacek2023@reddit (OP)
And we have GLM Air but GLM doesn't want to create new Air
Borkato@reddit
Air isn’t runnable by many. Flash is what most are asking for.
jacek2023@reddit (OP)
But some of us committed into local AI
specify_@reddit
for something like an LLM harness, using different models is great because you can use different models' strengths for specific/specialized tasks. for example, I found that Gemma 4 26B-A4B makes better-looking UI's than Qwen 3.5 35B-A3B. So you could have something like Qwen 3.5 27B be an orchestrator, Gemma 4 26B-A4B as a UI/UX designer, and Qwen 3.5 35B-A3B for other tasks.
layer4down@reddit
For coding on Apple Silicon, I think local Qwen3.5-27b-bf16 + oMLX is my replacement stack for GLM5 (cloud). I know that’s a huge statement and the former can’t compete with the latter on performance but I’m seeing 180-250tps prefill and 9.5tps decode sustained at ctx 131072. And quite frankly, 27b has been able to solve just as many coding and app/system configuration problems as GLM5, sometimes even a little better.
Another growing frustration with users is the feeling that the inference providers are sometimes nerfing these models to save on inference costs. I never experienced that when Z.AI pre-IPO it was completely rock solid for months but as of January 2026 it just feels different. That doesn’t happen with my local model. I literally run it for hours and hours over days and days. Same performance and intelligence.
I got the $360/1yr z.ai promotional deal back in October 2025 and have been on the hunt for a GLM replacement before the annual renewal ever since. $720/yr thereafter which is honestly still a reasonable price but I have an M2 Mac Studio Ultra 192GB at home and have been hoping the tech will improve enough to warn myself off the cloud models.
I think the tech has improved enough and by 6 months from now my 2-yr old will be serving the models I always hoped they could. And if the vLLM team can figure out how to truly unlock paged attention on Apple Silicon for even just 2-4x decode gains it’s game, set, match for local.
DaniDubin@reddit
I'm just curious, did you try Qwen3.5-122B or Qwen3-Next-Coder 80B for your use cases? They should have more or less similar performance for coding/agentic stuff, but much much faster than dense Qwen3.5-27B.
layer4down@reddit
I love MoE’s when I want something quick without depth and those models you mention are great for that. But in my experience, 80B-Coder (A5B) isn’t quite as smart as 27b and still suffers from over-thinking (which I no longer see in 3.5) and the 122B model seems to have more vast knowledge than 27B but not necessarily smarter at coding which is all I need it for.
Occasionally the 27B@bf16 just isn’t cutting it and I reach for Qwen3.5-397B-A17B-2.5bit but at twice the VRAM (~54GB vs ~125GB respectively) I only whip it out when I’m not heavily multitasking with other important work. I’m just more comfortable letting the 27B just run in the background and chew on coding work for hours without worrying about OOM issues while I’m doing other work.
ayu-ya@reddit
I know some people focused more on the rp/storytelling use cases that like Qwen and Gemma and have either already finetuned it (Qwen) or are working on it/plan to (Gemma), I played around with the Qwen 27B finetunes and base Gemma and they're fun, plus I can somewhat run them on my own. I didn't like any GLM after 4.6 for anything fiction writing related, I might be interested in some new Deepseek, but I'll always be happier about the smaller models because I don't have to rely on an API service to try them and they get finetuned. I won't run anything GLM5/DS sized even if I get the hardware I'm saving for at the moment.
Randomdotmath@reddit
why not? qwen and gemma are good for limit hardwares, but still far from SOTA, and GLM does this.
BothYou243@reddit
Why do you think we should be just satisfied with what we have, helping and giving the labs feedback would make this smaller variant ground more active, and innovation speed will pump out,
that's what growth looks like
ProfessionalSpend589@reddit
I suspect people run more than one model at a time.
I run 2 - one small on a GPU and one big on a (sketchy) cluster.
anubhav_200@reddit
For my usecase(browser usage), 4.7 flash performs better than 3.5 31b and 27b. The UI it generates is always much better in quality than 31b as well. I hope they release a smaller in future.
Several-Tax31@reddit
How are you doing browser usage? (Mcp, web fetch with opencode, anything else?) I cannot make qwen's do web fetch's in qwen code. They cankt reliably search the internet. Glm was much better in opencode with web fetch, but it's slower. And its still not browser use, I think. I wonder if there's something I'm missing. What is your method?
anubhav_200@reddit
Mcp, results are good, but it still struggles with very complex websites
Several-Tax31@reddit
Yeah, thanks, will try. At least would be cool if handles simple web tasks.
KURD_1_STAN@reddit
Glm 4.7 flash quite comparable to qwen3.5 35b a3b for me tbh, altho it seems to break many times
jacek2023@reddit (OP)
it's slower with long context, that was my main problem with it
KURD_1_STAN@reddit
Yeah def slower in general as well
FullOf_Bad_Ideas@reddit
That's expected to see given that they've IPOed and they don't need open weight goodwill anymore.
local GLM will be out of reach for 95% of people here.
Minimax IPOed too, expect the same thing.
Qwen might be dead.
Meta is closed now.
Google didn't release 120B model they had on hand, they'll be feeding us only breadcrumbs.
StepFun is doing some stuff and should keep doing it as long as they'll have compute and funding.
InclusionAI should put out models too but it's the same Alibaba that killed Qwen, so they might quietly get cut too.
OpenAI no longer gets shat on due to being closed thanks to their GPT OSS release, they'll keep making models more and more closed now.
Focus is on the holy grail of automated coding right now. And getting to profitability.
I think we'll see less good 30-150B models open models in the next 12 months than we've seen in the last 12 months.
And there's a non-zero chance that big Chinese models will go closed once again. We need another DeepSeek model to make Zhipu and Minimax look uncompetitive, then they'd probably go open once again. If that won't happen, it'll be slowly getting worse.
TheRealMasonMac@reddit
ZAI doubled the price yesterday for their coding plans as well—almost as expensive as Claude.
Ok_Warning2146@reddit
Really? Then they will only have the Chinese market.
TheRealMasonMac@reddit
I guess that's the play? Idk, maybe they got some investors who really want returns ASAP. GLM is good but not that good.
Ok_Warning2146@reddit
But then how do they compete with other Chinese players in the Chinese market with their higher price?
TheRealMasonMac@reddit
Don’t ask me. ¯\_(ツ)_/¯
--Spaci--@reddit
eventually they will make another flash/air model there's no point in asking or pestering them it wont speed it up
Borkato@reddit
I don’t think this is true. I genuinely think that if they have enough people bother them about it they will push for it. Consider the sonic movie
--Spaci--@reddit
the people you're bothering arent the people training the models, its like complaining to a walmart cashier about the stores prices
jacek2023@reddit (OP)
I call you "let them cook" community
insanemal@reddit
Yeah let them cook.
What's wrong with that?
toothpastespiders@reddit
The fact that there's no promise or guarantee that it'll ever happen. It's just wishful thinking. You're walking by a sealed kitchen with no windows while thinking about pizza and just assuming there's some people in there making pizza. They might be, or it could be empty, or they could have removed pizza from the menu entirely.
I mean look at how they handled 4.6 air. And that was one they did actually say they'd release. Not saying they or any other company has any obligation to give us free stuff. Just that it doesn't make much sense to assume they will continue doing so either. Especially when it comes to very specific iterations to match what we personally want.
insanemal@reddit
"They might never give me what I want!"
I think you're missing my point.
They don't give a flying fuck what you want. They will make whatever they want. They are giving away literally millions of dollars in work.
I'm not talking about assuming what they will and won't do. I'm talking about shutting the fuck up with demands.
It's fine to speculate or wish in forums, it's not ok to rock up at their official GitHub/HuggingFace/etc and be all like "Hey make this thing I want"
Shut up and let them cook.
Admirable-Star7088@reddit
He wants the chicken raw!
Ok_Warning2146@reddit
Well, big models released will be too costly to host, so people would just pay Zhipu for API access. Small models are easy to host, so Zhipu makes no money. Same logic for why others don't release any small models at all.
Awkward-Candle-4977@reddit
Because only sota models make news headlines and attract investors
anubhav_200@reddit
:(
Yu2sama@reddit
:(
Borkato@reddit
:(
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
Abubakar_Minhas_7@reddit
They are very low at this point, each model and space responses with not available or runtime error whenever we use them, so kind a upsetting.
a_beautiful_rhind@reddit
Air size makes sense. Models that small, I dunno. They are obviously compute starved and keep raising their prices.
jacek2023@reddit (OP)
It's obvious that current situation with AI can't continue forever. Claude Code users are crying, Codex users are crying and Chinese cloud users are crying. AI companies are losing money, people are addicted to the cloud so the prices must go up and limits must go down. Basic economy.
Local AI stays stable because it's local.
temperature_5@reddit
Many AI data center builds are being cancelled or delayed because there is not enough power, not enough transformers manufactured, not enough metals and rare earths... Data centers are the new paper clips, and we have not yet optimized for making them.
a_beautiful_rhind@reddit
Exactly.. even if it all goes away I can still run whatever. Subscription costs are starting to be more than my electricity costs.
Few_Painter_5588@reddit
GLM-5 is a huge jump over GLM-4, and unlike DeepSeek - they train their models in FP16. Between that and inference, they probably just don't have the compute to spend on smaller models.
jacek2023@reddit (OP)
But GLM-4.5-Air is awesome and GLM-4.7-Flash is also great. And I can't say same about GLM-5 because I am unable to run it.
Few_Painter_5588@reddit
Those areas are also the most competitive from a business standpoint. And to be frank, the 120B area is still dominated by GPT-OSS, no other model has really come close with identical parameters.
spaceman3000@reddit
Yeah right. It can't spell even one proper sentence in my language without making a mistake while 26B Gemma-4 can write tens of pages with marginal errors.
AXYZE8@reddit
I don't think that is quite right argument here - Gemma is SOTA in multilinguality. In Polish language only DeepSeek V3 (671B) is competitive and I would say Kimi K2.5 (1T) is slightly below that.
It's not that GPT-OSS 120B is bad in multilinguality in that param range, it's only Gemma that beats the crap out of everyone in that aspect and it doesn't mean that Qwen/Kimi/DS etc. are useless.
spaceman3000@reddit
Qwen is very good. I was referring to gpt-oss. Pisze po polsku jak Duda po angielsku
AXYZE8@reddit
Simple test about diminutive forms of one of the most popular name, Qwen 122B vs GPT-OSS 120B:
I assume you speak Polish, so you see Qwen is wrong pretty much top to bottom. For ones that don't - first one from Qwen is "Baba" and it translates to old hag/old woman. Imagine having your email rewrited like that xD
IMO Gemma and Deepseek are only "very good" models, GPT-OSS is meh/passable, Qwen is bottom tier.
Few_Painter_5588@reddit
Gemma's vocab size is larger than most LLMs, which is why they use more memory and are slower. That's also why their multilingual performance is better.
spaceman3000@reddit
Which is perfect for my use. I don't code but I write a lot
jacek2023@reddit (OP)
Are you sure Nemotron, Mistral and Qwen are worse? Or are you comparing speed of Q8 to speed of Q4?
Few_Painter_5588@reddit
It's activated parameters. Somehow OpenAI pulled off a 120B model with 5B activated parameters, Nemotron and Qwen need like 12B and 10B active parameters to surpass GPT-OSS. Like OpenAI captured lightning in a bottle with that model
AXYZE8@reddit
You're getting downvoted, but I totally agree even tho I don't like GPT-OSS that much for my own use.
GPT-OSS is token efficient, param efficient and on top of that it's among least bugged models (overthinking, looping).
Then you also have low/medium/high reasoning effort setting, so you can make it even more efficient, while in majority of other models all you can do is to turn off reasoning completely.
If someone would ask which model is a workhorse that creates least amount of problems in production and is most efficient then GPT-OSS 120B is the default choice.
I think many people believe that just because something is old then it's automatically bad and obsolete. DeepSeek V3 0324 is even older than GPT-OSS and many people will agree that it's still #1 creative writing model.
jacek2023@reddit (OP)
For sure GPT-OSS was underrated initially.
Emotional-Baker-490@reddit
Qwen3.5 exists, though. Its a10b but 75% of that is linear, not attention.
FaceDeer@reddit
This may not be a popular opinion here on /r/LocalLLaMA specifically, but I think this is a perfectly good niche to be focusing on.
I like running local models and I'm very happy to have high-quality models that I can run entirely on my own hardware, but I also think it's important for open weight models to be competitive in the hundreds-to-thousands of gigaparameters size class too - if I decide to use an API I want to have the option for small providers to be competing with OpenAI and Google. Or if I'm a medium-sized business that has heavy need for high-quality AI, teraparameter models might still count as "local" AI to my IT department.
Of course it'd be even more awesome if z.ai gave us all sizes of models and they were delivered directly to my house on the back of a magical pony. But there are other companies focusing on the smaller model sizes so I'm not going to ding them too hard for focusing on something else.
RipperFox@reddit
OMG, you could read that like: "We already HAVE smaller models, but sadly no release plans"
pmttyji@reddit
They(Model creators who come with only large models) are losing majority audience. Don't know why.
Ryoonya@reddit
What do you mean by that? There are plenty of people who pay for API prices for GLM. Who pays for API given a small model? Almost nobody, because small models have such edge cases that you'd be better off using a script for it.
They are not missing the majority audience lol?
pmttyji@reddit
I meant Local small/Medium models. Like GLM released only large model for 5.1. So most of us can't use that models on our device. Even Q1 is impossible.
Ryoonya@reddit
Yeah, but most people just use the API if they can't run the model, they are reaching a majority audience, but one that can bring in some revenue.
They don't stand to gain much by catering to this odd subsection of people that have some hardware but not enough to keep upto date.
Because the majority of people, entry-level people, will most likely use a subscription/free api rather than even have the hardware to run it locally.
So the people with some hardware, but not hardware able to run SOTA models, aren't a majority.
Plus everyone and their mom is making small models, because that is the thing that people can do easily. Making SOTA open source model is more valuable.
pmttyji@reddit
I'm not against local large models. Just expecting additional small/medium models like how Qwen is releasing every time.
Also 'majority audience' I meant was for local models only.
Agree with you on your response.
muyuu@reddit
because we generally don't pay them in any way, nor do we provide a patch for them to make profit
pmttyji@reddit
Many people do use online models simultaneously while having small models locally.
I'm sure people would buy LLM Burners(Taalas) with large models if they sell. Like if Kimi/GLM/Qwen/Deepseek come with affordable LLM Burners with their large models, I would buy them.
muyuu@reddit
for instance myself, I'm a local-first user
but that has no bearing on my expenditure
that also introduces no lock-in whatever to their models being open-weights and small
at most, it gives them exposure but that only gets you so far, esp. after you already have done it repeatedly and have some sort of brand recognition - indeed what will gets you the most press is the strength of your very best model, so it makes sense that you put most effort there also for this reason (obvs it will also make you more $$$ in "cloud" token usage)
i'm grateful for what they've published so far and hope they publish more, but I think there will be much more focus in larger models going forward, esp. since we're clearly hitting diminishing returns on what can be done with models fitting <100GB ; there will still be improvements, but they're no longer so dramatic
right now my 128GB Strix Halo machine can run decent quants but I hope in a year or two Medusa Halo machines with 512GB of unified memory are out there and they're not prohibitively expensive
jacek2023@reddit (OP)
Maybe small models were just a bait to sell the cloud later. So now people who discuss cloud prices for GLM are "ontopic" on r/LocalLLaMA
But then we have Google with gemma 4. Let's hope Qwen is also alive, we will see soon (Qwen 3.6).
spaceman3000@reddit
Yeah...where's Gemma 120b then? Why mtp was removed from Gemma 4 weights they released? I wouldn't count on Google in the long run.
pmttyji@reddit
Looks like only Qwen satisfies wide range of audience. (Though InclusionAI releases models similarly, half of their models are Diffusion & also lack of llama.cpp support)
We can't expect much from them for now. Maybe that 120B model in couple of months. That's it. New models only in 2027 this month.
Yep, we'll be getting improved version of 3.5 models & few new models possibly within 1-2 months.
Expecting similar thing from both AMD and Intel. They should've started & releasing (original models from scratch) last year itself.
PromptInjection_@reddit
Really sad. We have also seen no new base models since GLM 4.5
Marcuss2@reddit
If you really want the style of GLM-5.1, you should be able to distill it into Qwen3.5
jacek2023@reddit (OP)
please share link to your distillation
Marcuss2@reddit
I don't have one. I can say you can make one.
I mean honestly just use Qwen3.5 or Gemma 4
ForsookComparison@reddit
That is okay
Zai should chase SOTA
I should chase cheap RAM deals
popiazaza@reddit
Not really a surprise since they are pushing boundary against SOTA models with limited compute resource.
Pattinathar@reddit
Disappointing but not surprising — most Chinese labs focus on the flagship model first. At least the community will probably make distilled/quantized versions pretty quickly.
In the meantime Qwen2.5 Coder 32B and 7B are solid alternatives if you need smaller models for coding. And Gemma 4 just dropped smaller variants too.
The demand for 8B-16B models is real though — not everyone has 80GB VRAM sitting around. 10 upvotes on that issue says it all.
jacek2023@reddit (OP)
please share good recipe for pancakes with a toilet paper
RedditUsr2@reddit
I'm still hoping between bitnet and other breakthroughs that CPU offloading becomes viable.
Yu2sama@reddit
I really liked GLM 9B at the time, I hope we could see something like that eventually. A 9B that writes better/differently than Qwen would be very appreciated by me.
Skyline34rGt@reddit
No more smaller models from them so best time for me to unfollow they X (twitter).
brown2green@reddit
Probably impossible to compete with Qwen 3.5 and now Gemma 4. Gemma 4 in particular, I think it has seen so much RL training that jaws will drop once the technical report comes out.
jacek2023@reddit (OP)
Qwen 2.5 was also a breakthrough. It's normal that models are progressing. Qwen 3.5 and Gemma 4 are leading now, but I believe more stuff is coming.
Monkey_1505@reddit
I think it's wise to make small models. Every open weights model family that has truly taken off in popularity runs a range, and one version or other can run on an average gpu.
jacek2023@reddit (OP)
I believe they have problem how to explain that to their bosses from the business perspective.
Eyelbee@reddit
It's better that they focus their energy on pushing out frontier ones. Everyone can distill them if they want.
jacek2023@reddit (OP)
You know it’s bullshit, but I can understand the rationalization.