GLM 5.1 tops the code arena rankings for open models
Posted by Auralore@reddit | LocalLLaMA | View on Reddit | 86 comments
Posted by Auralore@reddit | LocalLLaMA | View on Reddit | 86 comments
SteppenAxolotl@reddit
LocalLLaMA could run out of steam if sizes continue to increase on the leading edge.
What are the chances some near AGI can be condense down to < 100GB vram.
droptableadventures@reddit
To be fair, those that do run these sized models don't run in FP8, and we don't run on H200s.
FullOf_Bad_Ideas@reddit
I bet that's largely an effect of distilling Opus 4.6 instead of Opus 4.5. Opus 4.6 was a big jump.
Plus they're obviously spending a lot of compute on post-training GLM 5 now, while GLM 5 had more compute spent on pre-training.
Zulfiqaar@reddit
But Opus4.6 wasnt really an improvement for coding? it was optimised for Claude Cowork and all sorts of other general tasks and especially computer use, and made massive jumps there. Perhaps that carried over in cross-domain capacity, but if it didnt improve Opus I dont see why it would have made much improvement to GLM.
-dysangel-@reddit
plus they have GLM coding plan users to use for gathering more training data
last_llm_standing@reddit
I think GLM 5.1 in top 3 models in code arena ranking would be better title. None of the other open models aren't in the vicinity
Leafytreedev@reddit
Unfortunately GLM models only have 200k context whereas Claude and Gemini both have 1M. With Claude Code having system prompts sometimes taking 65k+ it might be a while before it's ready for agentic harnessing. Hopefully Qwen 3.6 open source variants will come with 1M context like their 3.6-plus-preview currently sitting at rank 9.
arcanemachined@reddit
Can't speak for Gemini, but Claude's quality starts degrading after 150K, and is worse than useless after 200K (As in, I would have been better off starting with fresh session than dealing with the garbage it produces after 200K).
Pessimistic take: Anthropic is milking extra money out of people by charging for the cache reads caused by such a high context window. (Remember, the whole context is sent during each prompt.)
Optimistic take: Future Anthropic models will actually be useful above 200K.
droptableadventures@reddit
It's really just an arbitrary number. A few small edits to the GGUF and GLM-5.1 can have 1M tokens too.
It'll be similarly rubbish, but hey, it'll have it.
thepetek@reddit
You’re optimistic take is correct. One problem with long context is there is very little data with which to train at that long context. Expanding by default to 1M is about starting to gather that data more than about it being useful immediately
AnOnlineHandle@reddit
I might be a little behind but never consider these models as useful for huge contexts anyway, always good to brainstorm, ask for pointers for research, and to write standalone functions in a small context window, frequently restarting conversations to make sure the claims by the LLM in the previous convo don't pollute the claims in the new convo to see if it still says the same thing and doesn't double down.
e.g. Just today using Google's flagship Gemini Pro model in ai studio, it claimed one thing about a public source repo, double confirmed it, and then I started a new convo and asked it if it was true and it researched it and said nope and linked to the proof (I haven't actually had time to read the proof yet so don't know which was true).
nuclearbananana@reddit
What. Last I checked it was ~16K. Which is still insane but
65K+ would be unusable for any model besides OPUS and even that would be very dumb
alija_kamen@reddit
Most providers don't actually run inference with the full 1M context though. In Copilot for example you only get 160k for Opus.
RipperFox@reddit
I was experimenting with GLM 5.1 the last three days with hermes agent adding features to a llama.cpp fork.
I've found it to be incredible token efficient in comparison to e.g. Qwen 3.6.
etherd0t@reddit
Which is to say... #1 among open-source models.
etherd0t@reddit
The bigger signal is this: GLM-5.1 appears to be the first open model sitting in the top 3 overall on Arena’s coding board, which is a meaningful milestone for open-weight coding models.
Auralore@reddit (OP)
You're correct tbh
Cool-Chemical-5629@reddit
GLM-5.1 at chat.z.ai is a hilarious overthinking procrastinator...
I asked it to code a simple 3D game and here are some of the funniest parts of its seemingly endless chain of thought and how it got itself tangled up in it along the way.
Every time it promised to start writing the actual code, it went ahead and started doing something completely different... 😂
1:
Let me write this now.
One concern:...
2:
Let me write the code now. I'll be thorough.
For the terrain height function that returns height at any (x, z) world position, I need to:...
3:
OK, let me write this all out now. I'll aim for a polished, complete experience.
Let me think about the fog too. I'll use exponential fog for atmospheric effect. The color should match the sky at the horizon.
Alright, here's my complete plan for the code structure:...
4:
Let me write this now. I'll be comprehensive but try to keep the code clean.
One more thing - I need to make sure the terrain normals are recomputed after displacing vertices, and that I use those normals for slope-based coloring. The order should be:...
5:
OK let me just write the code now. I've planned enough.
For the water, I want it to look nice. I'll use a MeshStandardMaterial with:...
6:
Let me finalize and write the code.
Actually, I realize I need to think about the terrain texture more carefully...
7:
OK, writing the code now for real.
One more consideration: the terrain mesh is going to be...
8:
Alright, here's the final code:
Let me also think about the fog and how it interacts with the sky. The fog color should match the horizon color of the sky for a seamless blend...
9:
OK, final plan. Let me write the code now.
Let me also think about the NPC models...
10:
OK I'm just going to write the full code now. Let me do it.
For the sky, I'll create a shader-based sky dome...
11:
OK, let me just write the code now. I'll aim for a warm afternoon sky.
For the water shader, I'll use:...
12:
Let me finalize and write the code.
One more thing: for the grass patches, I'll create...
13:
OK, I've planned enough. Let me write the complete code now. It's going to be long but well-structured.
Actually, I want to also add some rocks scattered around...
14:
OK, let me now write the complete code. I'll organize it into clear sections.
For the player controller, I need:...
15:
This should work well. Let me now write the full code.
I also want to add a sun (directional light) with shadows. But shadows on such a large terrain might be expensive. Let me skip shadows for performance and just use ambient + directional lighting.
Actually, let me add shadows but with a limited shadow camera frustum centered on the player. This way only nearby objects cast shadows, which is performant.
Hmm, implementing shadow following adds complexity. Let me skip shadows for now and focus on the visual quality of the terrain, water, and objects.
Alright, HERE IS THE FINAL CODE. Writing it now.
Let me be careful about a few things:...
16:
OK, let me write the code now. For real this time.
Let me think about the structure one more time:...
17:
Let me write it all now.
Actually, for trees, let me think about how to create them efficiently. I'll use two InstancedMeshes:...
18:
OK, I think I have everything figured out. Let me write the complete code now.
Let me also think about the buildings. I'll create simple stone houses:...
19:
Let me now write the actual code. I'll aim for about 500-600 lines of well-structured JavaScript.
One more consideration:...
20:
OK, starting to code now.
Actually, I realize that for a truly nice-looking terrain, I should use a custom shader...
21:
OK, writing the code now. For real.
Let me be practical about the code length. This is going to be a substantial file. I'll aim for clean, well-commented code that's organized into clear sections...
22:
OK this is getting really long. Let me just write the full code now in one go, keeping it organized but not over-commenting.
Let me think about a few more things:...
23:
OK, let me now write the complete code. I'll be thorough but efficient.
For NPC meshes, I'll create simple humanoid shapes. Let me use:...
24:
OK, let me write the complete code now. I'll add leg animation as a bonus.
Alright, I'm going to write the entire HTML file now. Let me be methodical and get everything right.
I just realized I need to be more careful about the water. The water plane should cover...
25:
OK, writing the complete code now:
Actually, let me think about the grass patches more carefully. I want them to look like small tufts of grass scattered on the terrain. Using InstancedMesh with a crossed-quad geometry:...
26:
OK, let me just write the code now. I've been overthinking this.
Here's my approach for the grass:...
27:
OK, I'm going to write the complete code now. No more deliberation.
Let me also think about the building placement. I want a small village in a flat area. I'll find a suitable location by sampling the terrain for a flat, above-water area...
28:
OK, let me now write the complete code. I'll keep building descriptions simple.
Let me think about one more thing: the NPC walking animation...
29:
OK, I'm going to write the complete code now. Let me organize it and write it all at once.
Actually, let me also think about the initial player position. I need to find a good spawn point that's above water and on relatively flat terrain...
30:
Let me also think about collision detection. For a simple open world, I don't need complex collision. The player just follows the terrain height. I won't implement collision with buildings or trees - the player can walk through them. This simplifies the code significantly.
Actually, let me at least prevent the player from walking through buildings. I'll add simple AABB collision for buildings. For trees, the player can walk through them (they're just visual).
Hmm, building collision adds complexity. Let me skip it and focus on the visual experience. The player can walk through everything except the terrain.
OK, WRITING THE CODE NOW.
Let me estimate the code length. This will probably be 600-800 lines of JavaScript plus HTML/CSS. That's a substantial single file but manageable.
Let me structure the code clearly with section headers.
One thing I want to make sure: the water looks good. Let me use a custom shader for the water with:...
31:
OK, let me write the complete code now. I'm going to do it in one shot.
Let me also decide on the NPC count...
32:
OK, I'm going to write the full code now. Let me be thorough and get it right.
One more consideration: I need to make sure the game loop is smooth. I'll use requestAnimationFrame with delta time capping (to prevent huge jumps when the tab is inactive):...
33:
Alright, here's the complete code. Let me write it.
Hmm, actually I realize I should also think about the tree placement more carefully...
34:
OK, I think I have everything figured out. Let me write the code now.
I also want to add some flowers scattered in the grass...
35:
OK, let me now write the COMPLETE code. No more planning.
Let me estimate the performance:...
36:
Let me write the code now.
Wait, one more thing about the water plane. I need to make sure it's at the correct Y level...
37:
OK, writing the code now. Final version.
Let me think about one more optimization: for the grass...
38:
OK, HERE IS THE CODE. Writing it now for real.
I need to be careful about the Three.js import. Let me use version 0.160.0 which is well-tested.
Actually, let me also add a minimap!...
39:
OK, I'm going to write the complete code now. Let me aim for a comprehensive, polished experience.
Let me also add NPC markers on the minimap...
40:
Let me write it all. This is going to be long but I'll keep it clean.
OK I need to stop planning and start writing. Let me do it.
Actually, I just thought of something...
41:
OK, writing the code now. For real. Final time.
I think I also want to add a slight ambient animation to make the world feel alive. Maybe:...
42:
OK, writing the code now. FINAL.
Actually, you know what, let me also add some particle effects. Small floating particles (dust motes, fireflies, etc.) would add a lot of atmosphere. I can...
43:
This is comprehensive. Let me write it.
Hmm, I realize the flowers might be too much. Let me skip flowers and focus on the core features. I can always add them later.
Actually, flowers are easy to add with InstancedMesh. Let me include them...
44:
OK, writing the code now. Let me go.
I'm going to structure the code as a series of function calls in order, with each function creating and adding its elements to the scene.
Let me also make sure the player starts at a good location. I'll find a flat area above water for the spawn point.
Here's the final approach for the player spawn:...
45:
OK let me write this. I'll write it as a continuous script.
For the sake of keeping the response manageable but complete, let me write efficient code without excessive comments, but with clear section headers.
Let me go.
---
This time it started writing the actual code, but stopped generating in the middle of it for whatever reason, probably ran out of context window lol
frozen_tuna@reddit
I'd be curious to see how many tokens each of these use during the benchmark too. I used gpt 5.4 for exactly one prompt and after seeing how much it read and generated, I decided to switch away from it and never look back, regardless of benchmarks.
cafedude@reddit
Hoping for a GLM-5.1-flash
Several-Tax31@reddit
And it being open source! If its similar to glm-4.7-flash in size, we can easily run it.
jinnyjuice@reddit
Flash made the 4.7 about 1/10th the size. That means GLM 5.1 Flash would be 150 GB. 4 bit quant would make that about 50 GB.
Though these days, image + audio capabilities are the trend, so it would be higher than 50 GB.
ShelZuuz@reddit
It's really higher than gpt-5.4?
zhdc@reddit
Wouldn't agree in real world usage. It's good, but it's not as good as GPT 5.4. For what it's worth, I'm not convinced that Opus and Sonnet are better than 5.4 either.
That said, it's great for what it costs. MiniMax 2.5 and 2.5 are also worth looking at.
Haiku-575@reddit
Writing arcane VBA scripts for PowerPoint (yeah, gross, I know), Opus can kinda do it but makes a lot of mistakes. GPT 5.4 fails outright even with specific instructions. GLM 5.1 one-shot a complete rewrite and closed the ticket.
You need a lot of anecdotes before you can call it "data", but my universal experience has been GLM 5.1 and Opus > GPT 5.2 or 5.4.
Ecsta@reddit
5.4 is better if the task is straight forward and well defined ie raw programming. Anything that needs any sort of discussion or planning then Opus blows it out of the water. In my personal experience.
SadEntertainer9808@reddit
I'd put money on it being benchmaxxed to hell, but maybe I'm wrong.
susibacker@reddit
Arena is user ratings in blind tests iirc
dictionizzle@reddit
there are ways to arenemaxx a model
vazyrus@reddit
How?
-dysangel-@reddit
I've been using GLM coding plan for months now. Of course who am I to say if the models are benchmaxxed - but I can say they feel on par with or even better than Claude on some areas. When I did briefly try Claude again, I felt like it needed more hand babysitting than GLM.
SadBBTumblrPizza@reddit
Are you getting random stops and halts with the coding plan? I'm on a grandfathered lite plan and the service just times out constantly and always has.
-dysangel-@reddit
Yep the service levels have been pretty terrible since GLM 5 dropped. I've just got in the habit of using /compact regularly in Claude Code. The model itself though, is fantastic.
SadEntertainer9808@reddit
Would be absolutely delighted if there were an open model that was competitive with the frontier.
procgen@reddit
yeah 5.4 on codex mogs claude code + opus in realworld use. smells like benchmaxxing
yaboyyoungairvent@reddit
llm arena is based on blind usage by users. I don't think it's as simple as saying it was benchmaxed. If you wanted too you could go on their yourself and compare it alongside claude opus.
One_Internal_6567@reddit
Of course it’s not even close
hp1337@reddit
How does it compare to Qwen 3.5 397B?
Terminator857@reddit
Qwen 3.5 397B is way down on the list at number 25. https://arena.ai/leaderboard/text/coding-no-style-control
_derpiii_@reddit
wow. am I imagining things or wasn’t Qwen a top coding model just a week ago?
Terminator857@reddit
I wonder if that was 3.6.
volleyneo@reddit
You people have more than 16gb vram 🫠
Fair_Ad845@reddit
Q4 quant should fit in 16GB if the model is around 32B. the real question is whether the quant kills the code quality that got it to the top of the arena.
LeRustMan@reddit
GLM-5.1 has 754B parameters lol. The mixed precision BF16 model is 1.5TB. It's 206GB at IQ1, safe to say this is out of reach for local unless you hijacked a shipment of B200s
Beamsters@reddit
512gb couldn't even run this thing at 8 bit.
VoidAlchemy@reddit
but it can run it at
smol-IQ4_K 405.502 GiB (4.621 BPW)without sacrificing much quality. Tho its chonky with A40B even quantized it slows down with more than 32k kv-cache (i'm running it CPU-only). https://huggingface.co/ubergarm/GLM-5.1-GGUFthemoregames@reddit
I am very surprised Gemini 3.1 Pro ranks higher than GPT-5.3-Codex
I only had a brief episode with 5.3-Codex before it got replaced by GPT-5.4, but I know Gemini 3.1 Pro very well.
Almost everything I do with Gemini 3.1 Pro is "watered down" somewhat by automatic model selection in Gemini-CLI, though. I mean, probably. But I really have to ask Claude or GPT-5.4 for reviews for like 98% what Gemini does for me, often with a lot of problems being revealed.
I am not yet sure how well Gemini 3.1 Pro does Code Reviews, but at least it seems it can be steered very well, e.g. by telling it to be "pedantic" and why and what to look out for.
For individual developers, it seems somewhat difficult to compare models. An OpenAI $ 200 or even a $ 200 Max Claude subscription is probably good enough for most private use, while at work your employer most probably decides which AI agents they will provide you with.
Kaljuuntuva_Teppo@reddit
A bit strange that gpt-5.4 xhigh is not on the list, but high is.
DonnaPollson@reddit
Arena wins are useful signal, but the more interesting shift is that open models are becoming boring in the best possible way: good enough to enter real workflows instead of just benchmark arguments. Once a model is consistently near the top, the next battleground is inference cost, tool use, and how well it survives messy production prompts. Rankings get attention, reliability gets adoption.
mambo_cosmo_@reddit
Insane that it beats chatGPT and Gemini by such a landslide
VoiceApprehensive893@reddit
arena scores are kinda weird like gemma is #10 in code
opus is higher than gemini on vision even though it hallucinates a lot on docs that gemini ingests with no issues
Character_Wind6057@reddit
I'm not really surprised about GLM 5.1 beating Gemini 3.1 Pro, it's really shit in comparison. Gemini, at least for me, shine for studying and vision. For everything else, it's meh
-p-e-w-@reddit
I do think it’s rather surprising that a Chinese startup is beating the flagship model made by the world’s most powerful technology company that literally invented transformers.
And Google is heavily marketing Gemini as a coding model. This is all they got, and it’s coming up short.
Character_Wind6057@reddit
For me it's absurd how the company with the biggest search engine and data, can't even make its own model search the internet instead of hallucinating. Or worse, reading in its reasoning something like "I'll simulate an online search" or something similar, when you ask it to search the internet..
Anyway, it's better for us. Competition keep prices low and innovations high
IrisColt@reddit
You nailed it!
Mkengine@reddit
Thank god I am not the only one experiencing this, I don't know why I would ever take the sycophantic model that can't properly search the web or code, when there is ChatGPT or Claude. Nano banana is really the only part of Gemini that's useful for me.
logicchains@reddit
It's not surprising if you've used Angular, or Tensorflow, or BigQuery, and seen how they ruined it.
TheRealMasonMac@reddit
I think ZAI genuinely has something top-tier in their training regime that not even the frontier labs have. Their models are best for general assistant work by a long-shot.
yaboyyoungairvent@reddit
Gemini is relatively bad at coding compared to the top players GPT and Claude.
mambo_cosmo_@reddit
yet it's made by the largest data company in the world, the one that invented transformers and that markets the model as a coding agent too...
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
eli_pizza@reddit
5.1 seems great but I'm skeptical that arena is a good approach to evaluating a coding model. Isn't this biased to which one looks better?
MainFunctions@reddit
“Ranks #1 in Code Arena” clearly ranked third. And yes I understand it’s “#1 open model” but why be greasy like that? Just say it’s #3 and open weights. That’s very impressive on its own
unjustifiably_angry@reddit
Got literacy?
unjustifiably_angry@reddit
Is that Sonnet and Opus today, or Sonnet and Opus from 2 weeks ago before they were lobotomized?
VonDenBerg@reddit
5.1 is legit. So legit I want to get a second z.ai account so i can have concurrent operations.
ninjaeon@reddit
I agree and did the same, but in a different way.
Z.ai coding plan has been dog s**t lately with servers being overloaded. So of two other providers offering glm-5.1 on coding plans (OpenCode Go & Ollama Cloud), I ran a test on quota consumption and found Ollama Cloud to be about 8x cheaper per request.
Then I setup Omniroute api router with both Z.ai and Ollama Cloud in a combo where it prioritizes Z.ai coding plan usage first then fallsback to Ollama Cloud when Z.ai fails for whatever reason (which happens everyday, often repeatedly). Working great so far!
If I didn't have Z.ai coding lite already prepaid for a year at $27, I would just go straight to Ollama Cloud for $20/monthly. And this is from someone that hates Ollama in general, until now and just for this specific use case.
MeYaj1111@reddit
Thanks for this!
zhdc@reddit
Great advice. Trying Ollama Cloud now.
VonDenBerg@reddit
Fucking a. I’ll give this a go, appreciate writing it up.
Leafytreedev@reddit
I've been looking to signup for a z.ai account but have been reading lately that they're notorious for serving lower quantizations of their models and other enshittifications after the initial hype of model release. Not sure how true that all is but it definitely has me second guessing a $80/month subscription. Have you noticed any quality drops over time and how long have you been subscribed?
DinoAmino@reddit
What's the methodology here? I don't see mention of individual benchmarks. But I do see a total of "votes". So it's a popularity contest?
toffee0_0@reddit
it's subjective scoring . with lmarena . the methodology is basically giving it prompt , 2 models answer the same questions (A\b) , if you like A , vote that . so yeah its not objective
cafedude@reddit
I'm thinking it was mostly a publicity stunt.
zhdc@reddit
I have GLM 5.1, GPT 5.4, and Claude Sonnet running side by side. GPT 5.4 > Sonnet > GLM 5.1.
However, GLM 5.1 is still very, very good compared to other open source/third party models.
milkipedia@reddit
GLM 5 Turbo not being on the list has me wondering. It's been pretty good for me.
Tight-Requirement-15@reddit
elephant: address me meme
satireplusplus@reddit
The 4bit quant of this doesnt fit in 256GB
The full 16bit model is 1.5TB lmao
Terminator857@reddit
The one thing glm 5.1 is way above the rest of the models is code review. Give it a try. It is like 3x better than any other model at code reviews.
Hot-Employ-3399@reddit
I hope they'll release glm-5.1-flash then to be able to run on our GPU.
Technical-Earth-3254@reddit
I really like GPT 5.4, 5.3 Codex and 5.1 Codex Max. So seeing that GLM beats all of them is very nice. Wish I could run it locally, but I get what I can. So I'm probably getting their subscription for coding then.
OmarBessa@reddit
It's an amazing model, well deserved
masterlafontaine@reddit
Let's support the models, guys
Efficient-Series-939@reddit
better than Sonnet 4.6... sure buddy.
Serprotease@reddit
Honestly, yeah it’s better than sonnet 4.6
The fact that z-ai now to compare it to opus rather than sonnet is telling us how confident they are in its capabilities. Also, it’s interesting to see that Anthropic is now talking about mythos now that Chinese models have closed the gap with sonnet.
bruhhhhhhhhhhhh_h@reddit
Great model.