I just realised how good GLM 5 is

[-]

JohnSnowHenry@reddit

For me GLM it’s almost useless for Unreal Engine, but even Claude sonnet makes everything I need nicely :)

[-]

maladiusdev@reddit

Are you using it for C++ or trying to get it to control the editor? I'm using Codex and it's great for C++ but it burns through quota too fast so I'm looking for a backup option.

[-]

Powerful_Froyo1727@reddit

I've built an adapter for Claude Code that uses the leaked code and integrates glm5 as the core brain — pretty creative, if I do say so myself! 😂

[-]

NewtMurky@reddit

If only there were a good GLM-5 provider with a coding plan…

[-]

brabdi@reddit

ollama cloud pro has been pretty good in my experience

[-]

look@reddit

I’d not be shocked if Alibaba is running 2-bit quants on their coding plan models. Personally, I found it complete waste of my $5.

[-]

r3mp3y3k@reddit

I'm currently subscribe to Alibaba Coding Plan. I don't know if they somehow tuning down the models but I found GLM-5 is still better than any models they have provided in their coding plan.

[-]

Yeah, GLM-5 held up the best in my tests on the Alibaba plan. It didn’t go full psychotic like the other models did, but some head-to-head comparisons I did with it against other providers’s GLM-5s still showed significant degradation in its thinking output. It was like theirs was drunk or stoned, losing its train of thought repeatedly and coming back to it eventually.

[-]

estimated1@reddit

Just to give another option: we (Neuralwatt) just started offering our hosted inference. We've been focused more on an "energy pricing" model but feel pretty confident about the throughput of the models we're hosting. Our base subscription is $20 and we don't really have rate limits, just focused on energy consumption. I'd be happy to give some free credits in exchange for some feedback if there is interest. Please DM me! (https://portal.neuralwatt.com).

Also, we serve GLM-5 with solid throughput (IMO)

We also have a virtual endpoint (GLM-5-Fast) that turns off reasoning for fast agentic scenarios.

[-]

NewtMurky@reddit

Do I understand correctly, that you don't support caching? At least, I don't see it mentioned on pricing page.

[-]

estimated1@reddit

We do support caching; we need to make that more clear. With a cache hit the energy cost is 0, so you will just see much reduced energy cost of those requests. We expose the caching data in the result, but we don't put it in the UI on the dashboard -- I'll file a bug to do this and provide more info about our caching support.

[-]

GreenGreasyGreasels@reddit

This is interesting. How do you make money if you are paying only for electricity? What about hosting charges and the servers? How do turn a profit and be sustainable long term?

is no discount on input cache etc. is how you make your money?

The playground tests seems to be fast enough. Do you serve quantized or full fat models (16bit for GLM-5 and 4-bit for K2.5 and so on)?

[-]

estimated1@reddit

We bake infra costs into pricing. The difference is: inference gets cheaper at scale (batching, higher GPU utilization → lower energy/request).

Instead of keeping that as margin, we pass it through. So over time you get more tokens per kWh.

That’s the core idea behind energy pricing. This is all built upon our core tech which provides increased energy efficiency for GPUs/inference. We will license that to other hyperscalers/neoclouds as well to make inference more energy efficient.

[-]

Superb_Onion8227@reddit

energy pricing.

Why not do energy trading at the same time? You could save ppl's money by buying cheap energy

[-]

estimated1@reddit

oh sorry, for the other questions:

For GLM-5 it's FP8 and for K2.5 it is IN4. We don't do any of our own quantizations (yet).

[-]

TheMisterPirate@reddit

this is interesting to me, so its similar to openrouter but billing is by energy usage rather than tokens? its hard for me to understand how much usage I'd actually get for $20/mo or even if I pay per kwh. I think that would be a good thing to add to your website. Like, if I use this model for X hours how much would it actually cost, both in kwh and $, since some models are more efficient right?

I'd be interested in trying it out if the rates are good.

[-]

estimated1@reddit

Thanks for the feedback u/TheMisterPirate . I agree having some sort of calculator would be a good thing to help people understand. I think our method *does* enable much more inference per $ than other methods but we have work to do to present this more clearly. I'd be happy to grant some credits if you created an account in exchange for more feedback (what we're really eager for at this stage).

[-]

TheMisterPirate@reddit

sure, I'll DM you.

[-]

rudkws@reddit

I have an Ollama subscription. I am using the cloud version - it's good enough. Quite often, it finds a better solution than Claude, which is interesting.

[-]

davernow@reddit

Can’t tell if sarcastic. Z.ai coder is the best $27 I spend a month. Can easily put a billion tokens through it

[-]

17hoehbr@reddit

I bought a year of the lite plan during the black Friday sale but ever since GLM 5 came out it feels like they really dumbed down GLM 4.7, and of course GLM 5 is paywalled behind the pro plan.

[-]

imonlysmarterthanyou@reddit

Their GLM pro isn’t that much better oddly, but much slower.

[-]

AldoEliacim@reddit

Try OpenCode Go, it's providing a decent amount of GLM-5 usage, I've been looping it around to write tests for my Opus code

[-]

harrro@reddit

I keep hearing Opencode Zen/Go's GLM is heavily quantized

[-]

AldoEliacim@reddit

Not really, I haven't tested on another provider to compare it, but it does it's job.

Maybe it is quantized, but their plan is really cheap at $10 and they have an offer right now for only $5

[-]

hawseepoo@reddit

I just use it on Fireworks AI, pretty cheap per 1M tokens

[-]

vramkickedin@reddit

GLM5 is stupid underrated. It hallucinates less than the top dogs (chatgpt, claude, etc). Very good at following small details. In some cases, it will tell you "no" and it will explain why (unless you have a logical use case or explanation).

Its very good at paraphrasing content, spotting SEO hiccups and of course, very tight with coding.

[-]

ProfessionalSpend589@reddit

I've had similar feelings for smaller models like MiniMax M2.5 in Q6 (unsloth) and Qwen 3 235b in similar quant. People prized MiniMax, but Qwen just worked for me (and was better for lyrics and songs).

[-]

lookwatchlistenplay@reddit

That's great, ProfessionalSpend.

[-]

twack3r@reddit

Please leave this sub.

[-]

lookwatchlistenplay@reddit

And your problem with the words, "That's great", is...

[-]

Emotional-Baker-490@reddit

Thats great, lookwatchlistenplay.

[-]

lebed2045@reddit

whil it's cool if true, i think people should stop measure quality of ai coders by asking them to building tempalet-like projects from scratch. it's much more reliasting to ask to fix some bugs in existing big code bases.

[-]

EffectiveCeilingFan@reddit

How in the world do you use 12B tokens?? In an entire year, I doubt I will reach 1B, and I use vibe coding daily.

In order to use 12B tokens in six months of work, you’d need to be using 771 tokens per second every single second of the day, including at night. There’s no way.

[-]

temperature_5@reddit

This is why all the coding plans end up getting capped and rate limited. People just abuse the hell out of them, running multiple instances with multiple sub-agents simultaneously, or setting them up to just constantly poll github for issues, or even backdooring them for production inference.

[-]

BannedGoNext@reddit

If a coding plan offers X usage and you use X usage how is that abuse? The problem is that the plans are trying to utilize the phone system/modem internet provider type model where they expect most people to use very little and the heavy users to be subsidized, but the few people they can get to subscribe at all are voracious.

[-]

nakedspirax@reddit

And then the project gets sidelined haha

[-]

emprahsFury@reddit

Abuse? What? Inference time compute/scaling/buzzword was sold for years as the solution. These companies sold "tokens are cheap" for actual years. Anthropic & OpenAI only have themselves to blame. We only have them to blame, cause we're using their best advice.

[-]

Simple_Split5074@reddit

Most of that will be cached input tokens which can get to a million in a minute or two with tool calls and half filled context without even trying hard.

[-]

rosstafarien@reddit

Cached input tokens shouldn't be counted in your usage.

[-]

Simple_Split5074@reddit

According to whom? I think *all* providers do count them.

[-]

EffectiveCeilingFan@reddit

Eh. That’s still RAM that you’re taking up. I think the 1/10th cost that most providers do for cached input is fairly reasonable. What’s ridiculous are the providers that don’t offer any discount and just cache transparently, taking all the cost savings for themselves, or the ones that make you pay extra to use the cache (i.e. Anthropic).

[-]

lemondrops9@reddit

I've only done local so that seems crazy to me too. But many have said that they do.

[-]

ConSemaforos@reddit

I'm a hobbyist at home and can burn through 5 million easily within a day. I imagine that someone doing it full time and blow through 12B

[-]

EffectiveCeilingFan@reddit

Let’s say they use GLM-5 full-time from the moment it’s released right up until now. That’s 35 days. So, 400M tokens per day. That’s 80x what you’re burning through, every single day, with no breaks at all. OP has probably vibe-coded seven SaaS startups by now.

[-]

IrisColt@reddit

heh

[-]

ConSemaforos@reddit

Heck yeah! I love it. It's opened up so much. I've built 6 websites for local businesses and am working on two apps for local nonprofits. I used to spend an hour just reading docs trying to get Firebase set up. Now I can literally speak a command, and it's done for me.

[-]

mr_Owner@reddit

Ease with subagents and multi sessions

[-]

vinigrae@reddit

Agentic systems

[-]

florinandrei@reddit

Human-powered Ralph loop.

[-]

arman-d0e@reddit

I alone am responsible for 2 Billion tokens of usage on hunter alpha lol

[-]

klawisnotwashed@reddit

About a year ago i was hitting 2b tokens in CC coding my ass off 15+ hours a day with multiple REPLs open, about 99% of my tokens were input cached… no idea how bro is doing 12b

[-]

Accomplished-Bird829@reddit

i have used GLM5 from day 1 i used it to code do stuff and anything its excellent thanks to z.ai i have more time to enjoy my small tools

[-]

qubridInc@reddit

GLM-5 is genuinely strong, especially for structured coding + execution tasks. It can sometimes outperform Claude on specific implementations.

But on complex systems, edge cases, and long-term reasoning, Claude still tends to be more consistent

[-]

Exciting_Garden2535@reddit

I have a feeling that Opus 4.6 has become stupider than it was initially. Or maybe not exactly stupider, but more lazy. It skips requirements, does more careless work, and even argues: when I asked it to fix its own error, it spent time proving that the error was made in a previous session, not during this feature implementation.

[-]

wouldacouldashoulda@reddit

I wonder if it might be related to the larger context window? Like it cant deal with it.

[-]

m0j0m0j@reddit

This happens to me when the context is short

[-]

salomo926@reddit

Like a real programmer

[-]

anon377362@reddit

There was a Reddit post saying that the Opus 4.6 system prompt was updated to tell it to not think things through as much or something like that (keep answers brief).

So I don’t think it was a change to the model itself, just the prompt.

Personally I think if the system prompt is updated then it should count as a new minor/patch version bump as it can really affect the performance.

So Opus:

4.6.0 -> great

4.6.1 -> not as great.

[-]

Easy-Unit2087@reddit

I thought it was just me. Maybe (just guessing here) due to capacity problems they nerfed Opus 4.6 a bit. Codex 5.4 is doing better rn.

[-]

Fast-Satisfaction482@reddit

Since yesterday, 5.4 started annoying me with not fulfilling the requirements and then "next I could do this or that to actually make it work. Shall I?". Super annoying stuff.

[-]

mrtie007@reddit

i find i have to interrupt its thinking more lately because it's inventing/going on silly sidequests

[-]

Goldkoron@reddit

I wouldn't put it past Anthropic to intentionally nerf opus during periods where they think they are being distill attacked.

[-]

NoahFect@reddit

They are definitely having capacity issues at the moment. I'm getting a lot of "Overloaded" and unspecified "Internal server errors" in Claude CLI right now.

It also seems less... relentless than usual, somehow. I've had to prompt it to go back and finish subtasks it gave up on.

[-]

-dysangel-@reddit

Yeah that was my experience when I tried it again the other day. It kept asking "want me to do this?" instead of just doing things like GLM does. It makes sense that if they're having capacity issues that they'd train it to stop more often

[-]

aeroumbria@reddit

I'm not familiar with large model deployments, but I wonder if you can do something like increasing KV cache match tolerance to reduce server load, so not-so-matched prefixes will end up using the same cache? I can see how such a saving measure could lead to strange, unpredictable behaviours.

[-]

Easy-Unit2087@reddit

My flow has audits from Codex 5.4, Opus 4.6 and Qwen 397b (local) on each other's work after each phase and Codex is finding more things to fix in Opus than usual. Also noticed a few API unreachable errors with Claude today.

[-]

ansmo@reddit

I knew it could just be me! On medium and low effort especially, it takes those instructions literally. On max effort it still seems to be getting the job done, just at thrice the price. If GLM 5 was hosted at a usable speed, I'd definitely consider switching. Though now that I'm getting used to the 1M context window and less than a fifth of the previous time spent compacting and summarizing, it would be pretty hard to go back. My only hope is that the degradation in Opus 4.6 signals the imminent release of new models.

[-]

nekmatu@reddit

I’ve felt this too. It was really good and then got … I like your word lazier. It was super sharp at first.

[-]

Disposable110@reddit

Same experience, it oneshot an entire roguelike game before and now it can barely implement two features without me handholding it.

[-]

Infamous-Crew1710@reddit

Lol at the arguing

[-]

suddenlypandabear@reddit

I wonder if they have multiple variants of the opus models at different sizes, to scale down load at different times rather than just rate limit.

[-]

Ell2509@reddit

I noticed that too.

[-]

Dazzling_Focus_6993@reddit

That means, anthroponic will release 4.7 opus soon (rebranded 4.5 opus)

[-]

oodelay@reddit

Token greedy?

[-]

Vlyn@reddit

Sorry, but what does this have to do with LocalLLaMA?

You didn't run anything locally, you just switched to a different provider/model.

[-]

Spectrum1523@reddit

The model they switched to can be run locally so

[-]

Vlyn@reddit

Can you run a 744B model locally? (:

[-]

Spectrum1523@reddit

Sure - although itll be quanted to 1.8bits or very very slow

[-]

Vlyn@reddit

So 180+ GB of memory for a lobotomized model that runs as fast as a snail, gz.

These simply aren't "local" models, except you're a company with an expensive server rack.

[-]

DedsPhil@reddit

If is open source it's game. Even qwen3.5 27b or 35b can't run well on 24gb vram if you need to code anything.

[-]

Electroboots@reddit

I mean, it's not technically Llama either. And there are plenty of people who can't run the larger Llamas, so by this logic this reddit should only be about Llama 3.2 1B and its various finetunes to be really 100% authentic to the name.

But that would make for a lame sub.

[-]

Vlyn@reddit

Even with just 16 GB VRAM at the moment I'm running a 24B Q4_K_M model fully on my GPU.

So limiting it down to 1B is a bit too much (:

[-]

OmarBessa@reddit

It's really good. I'm generating around 1B tokens per month and it really feels very close to opus 4.5.

The current opus is a bit nerfed these days.

[-]

RevolutionaryLime758@reddit

Lmfao you think a cloud model is local

[-]

Spectrum1523@reddit

you can run it locally, ez

[-]

RevolutionaryLime758@reddit

Btw Reddit mod I can see all your posts even if you hide them and I’m gonna make fun of you in a minute

[-]

RevolutionaryLime758@reddit

You opencode zen is local? Are you an idiot?

[-]

divide0verfl0w@reddit

GLM 5 is very good but now try Minimax 2.5 and have your mind explode.

Same bug. Same prompt. Claude Code w Opus 4.6 took 32 minutes. OpenCode w Minimax 2.5 took 8 mins.

I realized I had accidentally let Minimax 2.5 plan before execute and Claude was not in plan mode. Felt like apples ≠ oranges. So created another worktree, started Claude Code w Opus 4.6 in plan mode. Unfortunately, Claude went down a path for over 30 mins and never solved the issue.

I compared the code quality of the solutions produced. Minimax 2.5 used the correct React Router API to fix the issue. Claude Code switched to setting window.location. Something I would do back when I was junior and too stubborn learn the right paradigm for the framework.

[-]

evia89@reddit

Unfortunately, Claude went down a path for over 30 mins

Do you use skills like systematic debug and provide sample logs?

[-]

divide0verfl0w@reddit

I provide the npm command to run end-to-end playwright tests, which spits out logs.

I don’t use any skills. And it doesn’t matter because Minimax didn’t have skills either.

[-]

adrazzer@reddit

yeah I am really impressed with GLM-5 myself, have been running it on Ollama cloud

[-]

novalounge@reddit

I've been running one of the Unsloth quants (UD-Q3_K_XL) at home with 128k, and it's been a great general purpose home AI model.

[-]

MR_Weiner@reddit

Am I missing something or does that quant needs like 350gb vram? And you’re running it locally on what hardware?

[-]

twack3r@reddit

Not who you are asking but I have the following system and can run the same quant. Does it fly? No. Can I work with it, very much so.

TR7955WX 256GiB DDR5 6400, 8 Channels 1 RTX6000 1 5090 3 pairs of 3090s, nvlinked

[-]

MR_Weiner@reddit

Ha, man I’m such a noob in here that I forget the crazy setups some of y’all have!

[-]

relmny@reddit

udq2kxl can be run on a single 32gb vram + 128gb ram, if you don't mind less than 1.6t/s...

[-]

sshwifty@reddit

Oof, that is some "It's compiling" speed

[-]

novalounge@reddit

Sorry - M3 Ultra 512. It’s fast, still have 100gb free.

[-]

getpodapp@reddit

I’ve been using kimi k2,5 because it’s a vision model and I like to just send screenshots to my ai tools. If GLM5 is that much better than I’ll have to take a look 🤔

[-]

IrisColt@reddit

who has used over 12 billion tokens in the last few months

u-use c-case? genuinely intrigued...

[-]

Dany0@reddit

12 bil tokens? What have you shipped?

[-]

lookwatchlistenplay@reddit

Billionaire status.

[-]

Dany0@reddit

Another plaque from openai

[-]

Risen_from_ash@reddit

We use 10x tokens! 10x developers use 10x tokens. …ship? Yes, we shipped 10x tokens.

[-]

LoaderD@reddit

$200 to Anthropic

[-]

lookwatchlistenplay@reddit

Every 30 days or so. On repeat. Relentlessly.

[-]

segmond@reddit

GLM-5 is good. I had a coding task that KimiK2.5, Qwen3.5-397B-Q6, Qwen3CoderNext-Q8 and DeepSeekv3.2-Q6 all failed at. As in generated code that was heading towards the right idea but all bugged and none could run correctly. GLM5 at Q4 is the only model that generated code that works, not perfect, but works and is a good foundation to build on. I'm running locally and did a few multiple passes. So impressed by it that I'm now downloading Q5 and hope to upgrade my system soon to be able to run Q6.

[-]

relmny@reddit

Have you tried with deepseek-v3.1-terminus by any chance? I'm still trying to figure out if v3.2 is actually better or not (I only get about 1t/s, so can't test them much...), to which I have my doubts

[-]

ihaag@reddit

What hardware are you using ?

[-]

Blackvz@reddit

You could also check out minimax m2.5

Also a good open source model.

I would love to hear your opinion in comparison to glm5

[-]

R_Duncan@reddit

This highlights something we already know or suspect: under the hood, every model served is quantized/changed without users getting notified. Is there a new version? A distilled model? A 3 bit quantized version?

Users don't know, and the worst is that happens from yesterday to today, so you started the project with a model, and midway it become dumber and your project goes....

Conclusion: you can't trust an online service until this get addressed and a checksum of the model used isn't served as well, together with quantization and other parameters.

[-]

Vozer_bros@reddit

I spent several B last year and I would like to say, GLM-5 is incredible, but sometime the quality just drop significant due to their lack of hardware, which I do understand.

Try GLM5-turbo one bro, that one is solid.

[-]

ihaag@reddit

I find GLM really good but clouds does make prettier CSS than GLM but still GLM is my go to.

[-]

4xi0m4@reddit

I think the most useful takeaway here is that this sounds like a workload fit issue more than a clean global ranking.\n\nIf the task is concrete, tool heavy, and the feedback loop is short, GLM 5 can absolutely overperform expectations. Claude still feels stronger to me when the taIsk gets messy, under-specified, or needs better judgment during refactors.\n\nSo your result does not sound crazy. It sounds like your benchmark is rewarding a type of work that GLM handles unusually well.

[-]

4xi0m4@reddit

Probe via contenteditable div for Reddit submit test.

[-]

Emergency-Pick5679@reddit

How do you guys run this models ? OpenCode ? any way to access all the latest sota models ?

[-]

Briskfall@reddit

It actually surprised me as well. Thought that it was going to be a dud due to how much I've heard that it's "distilled."

I have a private set of questions for historical facts with "misleading" formats that usual open-source models fail in but SOTA ones don't.

Smart models would actually not get swayed by the template; while dumb ones wouldn't even bother do the search and capitulate.

GLM-5 actually was one of the rare few that passed it during a test with LMArena. (and of course, Opus 4.6 Thinking and Gemini 3.1 Pro did too)

(but some older SOTA models like 2.5 Gemini didn't though... nor did the latest versions of Grok nor mistral.)

[-]

JimJamieJames@reddit

How did Qwen3.5 do?

[-]

randomlyme@reddit

This isn’t how you do perform spec driven development testing

[-]

po_stulate@reddit

Wasn't GLM 5 focused on general chatting instead of coding?

[-]

4xi0m4@reddit

'My

[-]

Fun_Nebula_9682@reddit

GLM 5 is genuinely underrated. I've been running GLM-OCR locally on Mac Studio M2 Ultra for document processing — tables, math equations, mixed CJK text — and it handles everything at ~260 tokens/sec with just 2GB VRAM.

What surprised me most is how well it handles code-related content. I use it as part of a local pipeline where OCR output feeds into Claude Code for analysis. The combination of a fast local model for extraction + a frontier model for reasoning is way more cost-effective than sending everything to the cloud.

Have you tried it for any specific use cases beyond chat?

[-]

Own-Relationship-362@reddit

GLM 5 is surprisingly good at structured tasks too — I've been testing it for matching natural language task descriptions to structured skill files (SKILL.md format). The instruction following is solid enough that it picks up domain-specific terminology better than some of the bigger models. Not great for creative writing but for tool-use and structured reasoning it punches above its weight.

[-]

jeffwadsworth@reddit

The web version is nothing compared to the 4bit version run locally. Night and day.

[-]

agentcubed@reddit

As others say, I don't recommend using one-shots as a benchmark.

In the end, it depends on your workflow. If you are a 100% vibe coder (pls no), then maybe judging by one shots is good

[-]

cantgetthistowork@reddit

Writing fresh code is something every model does well these days. It's working with existing codebases where you see all the problems

[-]

slypheed@reddit

Let me guess; you write JS ?

[-]

fugogugo@reddit

12 billion token .. how much you spent already?

[-]

BP041@reddit

Real-time chat with websockets is actually a decent stress test because it requires getting async state management right on the first attempt. That's a different skill from code generation — it's more about the model's internal architecture of how state flows.

For harder tests that separate them: try multi-file refactoring where the context spans more than one codebase, or debugging something where the bug is in a dependency interaction rather than obvious logic. Those tend to reveal where each model's "implicit understanding" of the codebase breaks down. Claude tends to track cross-file state better in my experience, but GLM might surprise you on certain patterns.

[-]

metigue@reddit

A lot of that has to do with the agentic harness. Claude code despite being so popular is just not good. You should compare opus 4.6 and GLM in the same harness - I recommend Droid or forge code.

[-]

unltdhuevo@reddit

When it comes to following instructions GLM 5 is too good

[-]

robberviet@reddit

I would love some task on existing repo too. Also what gpu/hardware are you using at what speed?

[-]

-dysangel-@reddit

No you're not tripping. I've been using GLM Coding Plan for a while. The brief time I tried Claude again, I felt like I was babysitting vs working with a competent colleague.

Though GLM-5's coherence has been getting lower and lower. I suspect they're heavily quantising the KV cache. A few days ago it would lose it at 80k tokens, but earlier today I was getting issues even at 40k tokens. I've switched to GLM 4.7 until they work out the bugs, or unless I really need better quality planning for something

[-]

twack3r@reddit

Which is exactly why there is no logical substitue to owning your own metal and running your own local models.

[-]

Spurnout@reddit

I've been using it lately, especially while building a piece of software similar to openclaw, but I actually got better results from kimi-k2.5 which i was a bit surprised about. I've been thinking of updating the scoring though...

[-]

LargelyInnocuous@reddit

Isn't that like $50k in tokens? do you mean 12M? Or are you creating datasets for a large model and have business paying for it?

[-]

okyaygokay@reddit

Sorry but creating a websocket chat app is not a hard task

[-]

SvenVargHimmel@reddit

You're right about that. The test is a bit arbitrary. I find GLM fails in existing codebase . It's not very good with anything that's not react. It gets worse when your language is not typescript.

I find planning with opus and building with Kimi and reviewing with Gemini works well

[-]

Happythen@reddit

oi, yes it is. at least a production one.

[-]

SpicyWangz@reddit

Really depends on the level of features, but yeah. Just the bar bones is pretty underwhelming.

[-]

asria@reddit

What hardware do you have? How many t/s did you achieve?

[-]

Orlandocollins@reddit

They said opencode zen so they aren't running locally

[-]

Effective-Drawer9152@reddit

It is very very slow

[-]

johnerp@reddit

What spec machine fid you run it on, what quant etc?

[-]

lookwatchlistenplay@reddit

Too many questions. AI already answered those. Ask your AI what specs OP has, haha.

[-]

FullOf_Bad_Ideas@reddit

I don't run GLM 5 (too big) but I do use local GLM 4.7 355B in OpenCode and Claude Opus in CC. I think the difference is really big there. Way more bugs in the code with GLM. Maybe in your testing GLM 5 looked so good because of the front-end aspect. I don't do front end. I think Zhipu focused on web dev so it should shine there. GLM 5 is pretty high up on the DesignArena.