GLM-5.1 Scores 94.6% Against Claude Opus on Coding at a Fraction the Cost

Posted by dev_is_active@reddit | LocalLLaMA | View on Reddit | 7 comments

Here is the HF https://huggingface.co/zai-org/GLM-5.1-FP8

[-]

CC_NHS@reddit

it is cool that glm-5.1 scores higher than 5 but other than that I trust the benchmarks 0 (just comparing models to their own lines I find kind of useful perhaps)

like the difference between Opus 4.6 and GPT, Gemini and GLM-5 was huge. Opus has just been the best by far for me in practice. which just shows how the benchmarks are measuring a thing, and the LLM are making sure they can be good at that thing for the marketing... but in actual use they are kind of irrelevant. except maybe as a data point to consider trying it out yourself

It does seem a great model though :) so far does not seem on the same tier as Opus, but it is probably my second favourite now

[-]

segmond@reddit

GLM5 and KimiK2.5 have consistently beat Opus 4.6 for me. What's your point?

[-]

hainesk@reddit

Not sure why you would be downvoted.. The whole point of their comment is that there is a difference between benchmarks and real world use, and you provided your real world experience and asked for more detail… Perfectly reasonable.

[-]

Substantial_Swan_144@reddit

GLM-5 is ever-so-slightly worse than Opus, but in my experience it is much more consistent (especially after all Anthropic is doing).

[-]

CC_NHS@reddit

I have not noticed inconsistencies with Opus, (not kept up with Anthropic news il read up later, but whatever it is seems not to have translated to end use that I have noticed so far).

tbh the biggest difference for me between Opus and others is for coding. it's ability to write code 'well' not just functioning, but seems to use good judgement on future proofing and so on. and for non coding it keeps a good weight to different aspects of the conversation, rather than focusing on important things too little or less important things too much. it seems good at capturing the context well.

GLM and most other models in that tier, can perform a well structured task, but if the task is followed up or further info added for context to help with task but not necessarily part of the task... things can get a bit inconsistent (for example with Qwen and gpt where it's got some context about my game Dev projects in it's web app personalisation settings, almost every question something about that gets inserted into their output, even though totally unrelated to the current question, yet with Opus having the same personalisation stuff only brings it up when relevant, and it states in it, to only use when relevant to the conversation)

in terms of just 'do this thing' without needing huge context though, I think they may well end up fairly similar

[-]

Substantial_Swan_144@reddit

You are going to see inconsistencies with very complex tasks. If you design a complex application, dumber AIs tend to make your code incrementally overengineered in a subtle way. If you get the smarter AIs, they will give more concise code and it'll all be rainbows. But if performance fluctuates, you won't realize this until your code is a large behemoth.