GLM 5.1 tops the code arena rankings for open models

[-]

SteppenAxolotl@reddit

LocalLLaMA could run out of steam if sizes continue to increase on the leading edge.

GLM 5.1 UD-Q4_K_XL 466GB

What are the chances some near AGI can be condense down to < 100GB vram.

[-]

droptableadventures@reddit

FP8 @ ~860 GB would take 8× H200 GPUs

To be fair, those that do run these sized models don't run in FP8, and we don't run on H200s.

[-]

FullOf_Bad_Ideas@reddit

I bet that's largely an effect of distilling Opus 4.6 instead of Opus 4.5. Opus 4.6 was a big jump.

Plus they're obviously spending a lot of compute on post-training GLM 5 now, while GLM 5 had more compute spent on pre-training.

[-]

But Opus4.6 wasnt really an improvement for coding? it was optimised for Claude Cowork and all sorts of other general tasks and especially computer use, and made massive jumps there. Perhaps that carried over in cross-domain capacity, but if it didnt improve Opus I dont see why it would have made much improvement to GLM.

[-]

-dysangel-@reddit

plus they have GLM coding plan users to use for gathering more training data

[-]

last_llm_standing@reddit

I think GLM 5.1 in top 3 models in code arena ranking would be better title. None of the other open models aren't in the vicinity

[-]

Leafytreedev@reddit

Unfortunately GLM models only have 200k context whereas Claude and Gemini both have 1M. With Claude Code having system prompts sometimes taking 65k+ it might be a while before it's ready for agentic harnessing. Hopefully Qwen 3.6 open source variants will come with 1M context like their 3.6-plus-preview currently sitting at rank 9.

[-]

arcanemachined@reddit

Can't speak for Gemini, but Claude's quality starts degrading after 150K, and is worse than useless after 200K (As in, I would have been better off starting with fresh session than dealing with the garbage it produces after 200K).

Pessimistic take: Anthropic is milking extra money out of people by charging for the cache reads caused by such a high context window. (Remember, the whole context is sent during each prompt.)

Optimistic take: Future Anthropic models will actually be useful above 200K.

[-]

droptableadventures@reddit

It's really just an arbitrary number. A few small edits to the GGUF and GLM-5.1 can have 1M tokens too.

It'll be similarly rubbish, but hey, it'll have it.

[-]

thepetek@reddit

You’re optimistic take is correct. One problem with long context is there is very little data with which to train at that long context. Expanding by default to 1M is about starting to gather that data more than about it being useful immediately

[-]

AnOnlineHandle@reddit

I might be a little behind but never consider these models as useful for huge contexts anyway, always good to brainstorm, ask for pointers for research, and to write standalone functions in a small context window, frequently restarting conversations to make sure the claims by the LLM in the previous convo don't pollute the claims in the new convo to see if it still says the same thing and doesn't double down.

e.g. Just today using Google's flagship Gemini Pro model in ai studio, it claimed one thing about a public source repo, double confirmed it, and then I started a new convo and asked it if it was true and it researched it and said nope and linked to the proof (I haven't actually had time to read the proof yet so don't know which was true).

[-]

nuclearbananana@reddit

With Claude Code having system prompts sometimes taking 65k+

What. Last I checked it was ~16K. Which is still insane but

65K+ would be unusable for any model besides OPUS and even that would be very dumb

[-]

alija_kamen@reddit

Most providers don't actually run inference with the full 1M context though. In Copilot for example you only get 160k for Opus.

[-]

RipperFox@reddit

I was experimenting with GLM 5.1 the last three days with hermes agent adding features to a llama.cpp fork.

I've found it to be incredible token efficient in comparison to e.g. Qwen 3.6.

[-]

etherd0t@reddit

Which is to say... #1 among open-source models.

[-]

etherd0t@reddit

The bigger signal is this: GLM-5.1 appears to be the first open model sitting in the top 3 overall on Arena’s coding board, which is a meaningful milestone for open-weight coding models.

[-]

Auralore@reddit (OP)

You're correct tbh

[-]

Cool-Chemical-5629@reddit

GLM-5.1 at chat.z.ai is a hilarious overthinking procrastinator...

I asked it to code a simple 3D game and here are some of the funniest parts of its seemingly endless chain of thought and how it got itself tangled up in it along the way.

Every time it promised to start writing the actual code, it went ahead and started doing something completely different... 😂

1:

Let me write this now.

One concern:...

2:

Let me write the code now. I'll be thorough.

For the terrain height function that returns height at any (x, z) world position, I need to:...

3:

OK, let me write this all out now. I'll aim for a polished, complete experience.

Let me think about the fog too. I'll use exponential fog for atmospheric effect. The color should match the sky at the horizon.

Alright, here's my complete plan for the code structure:...

4:

Let me write this now. I'll be comprehensive but try to keep the code clean.

One more thing - I need to make sure the terrain normals are recomputed after displacing vertices, and that I use those normals for slope-based coloring. The order should be:...

5:

OK let me just write the code now. I've planned enough.

For the water, I want it to look nice. I'll use a MeshStandardMaterial with:...

6:

Let me finalize and write the code.

Actually, I realize I need to think about the terrain texture more carefully...

7:

OK, writing the code now for real.

One more consideration: the terrain mesh is going to be...

8:

Alright, here's the final code:

Let me also think about the fog and how it interacts with the sky. The fog color should match the horizon color of the sky for a seamless blend...

9:

OK, final plan. Let me write the code now.

Let me also think about the NPC models...

10:

OK I'm just going to write the full code now. Let me do it.

For the sky, I'll create a shader-based sky dome...

11:

OK, let me just write the code now. I'll aim for a warm afternoon sky.

For the water shader, I'll use:...

12:

Let me finalize and write the code.

One more thing: for the grass patches, I'll create...

13:

OK, I've planned enough. Let me write the complete code now. It's going to be long but well-structured.

Actually, I want to also add some rocks scattered around...

14:

OK, let me now write the complete code. I'll organize it into clear sections.

For the player controller, I need:...

15:

This should work well. Let me now write the full code.

I also want to add a sun (directional light) with shadows. But shadows on such a large terrain might be expensive. Let me skip shadows for performance and just use ambient + directional lighting.

Actually, let me add shadows but with a limited shadow camera frustum centered on the player. This way only nearby objects cast shadows, which is performant.

Hmm, implementing shadow following adds complexity. Let me skip shadows for now and focus on the visual quality of the terrain, water, and objects.

Alright, HERE IS THE FINAL CODE. Writing it now.

Let me be careful about a few things:...

16:

OK, let me write the code now. For real this time.

Let me think about the structure one more time:...

17:

Let me write it all now.

Actually, for trees, let me think about how to create them efficiently. I'll use two InstancedMeshes:...

18:

OK, I think I have everything figured out. Let me write the complete code now.

Let me also think about the buildings. I'll create simple stone houses:...

19:

Let me now write the actual code. I'll aim for about 500-600 lines of well-structured JavaScript.

One more consideration:...

20:

OK, starting to code now.

Actually, I realize that for a truly nice-looking terrain, I should use a custom shader...

21:

OK, writing the code now. For real.

Let me be practical about the code length. This is going to be a substantial file. I'll aim for clean, well-commented code that's organized into clear sections...

22:

OK this is getting really long. Let me just write the full code now in one go, keeping it organized but not over-commenting.

Let me think about a few more things:...

23:

OK, let me now write the complete code. I'll be thorough but efficient.

For NPC meshes, I'll create simple humanoid shapes. Let me use:...

24:

OK, let me write the complete code now. I'll add leg animation as a bonus.

Alright, I'm going to write the entire HTML file now. Let me be methodical and get everything right.

I just realized I need to be more careful about the water. The water plane should cover...

25:

OK, writing the complete code now:

Actually, let me think about the grass patches more carefully. I want them to look like small tufts of grass scattered on the terrain. Using InstancedMesh with a crossed-quad geometry:...

26:

OK, let me just write the code now. I've been overthinking this.

Here's my approach for the grass:...

27:

OK, I'm going to write the complete code now. No more deliberation.

Let me also think about the building placement. I want a small village in a flat area. I'll find a suitable location by sampling the terrain for a flat, above-water area...

28:

OK, let me now write the complete code. I'll keep building descriptions simple.

Let me think about one more thing: the NPC walking animation...

29:

OK, I'm going to write the complete code now. Let me organize it and write it all at once.

Actually, let me also think about the initial player position. I need to find a good spawn point that's above water and on relatively flat terrain...

30:

Let me also think about collision detection. For a simple open world, I don't need complex collision. The player just follows the terrain height. I won't implement collision with buildings or trees - the player can walk through them. This simplifies the code significantly.

Actually, let me at least prevent the player from walking through buildings. I'll add simple AABB collision for buildings. For trees, the player can walk through them (they're just visual).

Hmm, building collision adds complexity. Let me skip it and focus on the visual experience. The player can walk through everything except the terrain.

OK, WRITING THE CODE NOW.

Let me estimate the code length. This will probably be 600-800 lines of JavaScript plus HTML/CSS. That's a substantial single file but manageable.

Let me structure the code clearly with section headers.

One thing I want to make sure: the water looks good. Let me use a custom shader for the water with:...

31:

OK, let me write the complete code now. I'm going to do it in one shot.

Let me also decide on the NPC count...

32:

OK, I'm going to write the full code now. Let me be thorough and get it right.

One more consideration: I need to make sure the game loop is smooth. I'll use requestAnimationFrame with delta time capping (to prevent huge jumps when the tab is inactive):...

33:

Alright, here's the complete code. Let me write it.

Hmm, actually I realize I should also think about the tree placement more carefully...

34:

OK, I think I have everything figured out. Let me write the code now.

I also want to add some flowers scattered in the grass...

35:

OK, let me now write the COMPLETE code. No more planning.

Let me estimate the performance:...

36:

Let me write the code now.

Wait, one more thing about the water plane. I need to make sure it's at the correct Y level...

37:

OK, writing the code now. Final version.

Let me think about one more optimization: for the grass...

38:

OK, HERE IS THE CODE. Writing it now for real.

I need to be careful about the Three.js import. Let me use version 0.160.0 which is well-tested.

Actually, let me also add a minimap!...

39:

OK, I'm going to write the complete code now. Let me aim for a comprehensive, polished experience.

Let me also add NPC markers on the minimap...

40:

Let me write it all. This is going to be long but I'll keep it clean.

OK I need to stop planning and start writing. Let me do it.

Actually, I just thought of something...

41:

OK, writing the code now. For real. Final time.

I think I also want to add a slight ambient animation to make the world feel alive. Maybe:...

42:

OK, writing the code now. FINAL.

Actually, you know what, let me also add some particle effects. Small floating particles (dust motes, fireflies, etc.) would add a lot of atmosphere. I can...

43:

This is comprehensive. Let me write it.

Hmm, I realize the flowers might be too much. Let me skip flowers and focus on the core features. I can always add them later.

Actually, flowers are easy to add with InstancedMesh. Let me include them...

44:

OK, writing the code now. Let me go.

I'm going to structure the code as a series of function calls in order, with each function creating and adding its elements to the scene.

Let me also make sure the player starts at a good location. I'll find a flat area above water for the spawn point.

Here's the final approach for the player spawn:...

45:

OK let me write this. I'll write it as a continuous script.

For the sake of keeping the response manageable but complete, let me write efficient code without excessive comments, but with clear section headers.

Let me go.

---

This time it started writing the actual code, but stopped generating in the middle of it for whatever reason, probably ran out of context window lol

[-]

frozen_tuna@reddit

I'd be curious to see how many tokens each of these use during the benchmark too. I used gpt 5.4 for exactly one prompt and after seeing how much it read and generated, I decided to switch away from it and never look back, regardless of benchmarks.

[-]

cafedude@reddit

Hoping for a GLM-5.1-flash

[-]

Several-Tax31@reddit

And it being open source! If its similar to glm-4.7-flash in size, we can easily run it.

[-]

jinnyjuice@reddit

Flash made the 4.7 about 1/10th the size. That means GLM 5.1 Flash would be 150 GB. 4 bit quant would make that about 50 GB.

Though these days, image + audio capabilities are the trend, so it would be higher than 50 GB.

[-]

ShelZuuz@reddit

It's really higher than gpt-5.4?

[-]

zhdc@reddit

Wouldn't agree in real world usage. It's good, but it's not as good as GPT 5.4. For what it's worth, I'm not convinced that Opus and Sonnet are better than 5.4 either.

That said, it's great for what it costs. MiniMax 2.5 and 2.5 are also worth looking at.

[-]

Haiku-575@reddit

Writing arcane VBA scripts for PowerPoint (yeah, gross, I know), Opus can kinda do it but makes a lot of mistakes. GPT 5.4 fails outright even with specific instructions. GLM 5.1 one-shot a complete rewrite and closed the ticket.

You need a lot of anecdotes before you can call it "data", but my universal experience has been GLM 5.1 and Opus > GPT 5.2 or 5.4.

[-]

Ecsta@reddit

5.4 is better if the task is straight forward and well defined ie raw programming. Anything that needs any sort of discussion or planning then Opus blows it out of the water. In my personal experience.

[-]

SadEntertainer9808@reddit

I'd put money on it being benchmaxxed to hell, but maybe I'm wrong.

[-]

susibacker@reddit

Arena is user ratings in blind tests iirc

[-]

dictionizzle@reddit

there are ways to arenemaxx a model

[-]

vazyrus@reddit

How?

[-]

-dysangel-@reddit

I've been using GLM coding plan for months now. Of course who am I to say if the models are benchmaxxed - but I can say they feel on par with or even better than Claude on some areas. When I did briefly try Claude again, I felt like it needed more hand babysitting than GLM.

[-]

SadBBTumblrPizza@reddit

Are you getting random stops and halts with the coding plan? I'm on a grandfathered lite plan and the service just times out constantly and always has.

[-]

-dysangel-@reddit

Yep the service levels have been pretty terrible since GLM 5 dropped. I've just got in the habit of using /compact regularly in Claude Code. The model itself though, is fantastic.

[-]

SadEntertainer9808@reddit

Would be absolutely delighted if there were an open model that was competitive with the frontier.

[-]

procgen@reddit

yeah 5.4 on codex mogs claude code + opus in realworld use. smells like benchmaxxing

[-]

yaboyyoungairvent@reddit

llm arena is based on blind usage by users. I don't think it's as simple as saying it was benchmaxed. If you wanted too you could go on their yourself and compare it alongside claude opus.

[-]

One_Internal_6567@reddit

Of course it’s not even close

[-]

hp1337@reddit

How does it compare to Qwen 3.5 397B?

[-]

Terminator857@reddit

Qwen 3.5 397B is way down on the list at number 25. https://arena.ai/leaderboard/text/coding-no-style-control

[-]

_derpiii_@reddit

wow. am I imagining things or wasn’t Qwen a top coding model just a week ago?

[-]

Terminator857@reddit

I wonder if that was 3.6.

[-]

volleyneo@reddit

You people have more than 16gb vram 🫠

[-]

Fair_Ad845@reddit

Q4 quant should fit in 16GB if the model is around 32B. the real question is whether the quant kills the code quality that got it to the top of the arena.

[-]

LeRustMan@reddit

GLM-5.1 has 754B parameters lol. The mixed precision BF16 model is 1.5TB. It's 206GB at IQ1, safe to say this is out of reach for local unless you hijacked a shipment of B200s

[-]

Beamsters@reddit

512gb couldn't even run this thing at 8 bit.

[-]

VoidAlchemy@reddit

but it can run it at smol-IQ4_K 405.502 GiB (4.621 BPW) without sacrificing much quality. Tho its chonky with A40B even quantized it slows down with more than 32k kv-cache (i'm running it CPU-only). https://huggingface.co/ubergarm/GLM-5.1-GGUF

[-]

themoregames@reddit

I am very surprised Gemini 3.1 Pro ranks higher than GPT-5.3-Codex

I only had a brief episode with 5.3-Codex before it got replaced by GPT-5.4, but I know Gemini 3.1 Pro very well.

Almost everything I do with Gemini 3.1 Pro is "watered down" somewhat by automatic model selection in Gemini-CLI, though. I mean, probably. But I really have to ask Claude or GPT-5.4 for reviews for like 98% what Gemini does for me, often with a lot of problems being revealed.

I am not yet sure how well Gemini 3.1 Pro does Code Reviews, but at least it seems it can be steered very well, e.g. by telling it to be "pedantic" and why and what to look out for.

For individual developers, it seems somewhat difficult to compare models. An OpenAI $ 200 or even a $ 200 Max Claude subscription is probably good enough for most private use, while at work your employer most probably decides which AI agents they will provide you with.

[-]

Kaljuuntuva_Teppo@reddit

A bit strange that gpt-5.4 xhigh is not on the list, but high is.

[-]

DonnaPollson@reddit

Arena wins are useful signal, but the more interesting shift is that open models are becoming boring in the best possible way: good enough to enter real workflows instead of just benchmark arguments. Once a model is consistently near the top, the next battleground is inference cost, tool use, and how well it survives messy production prompts. Rankings get attention, reliability gets adoption.

[-]

mambo_cosmo_@reddit

Insane that it beats chatGPT and Gemini by such a landslide

[-]

VoiceApprehensive893@reddit

arena scores are kinda weird like gemma is #10 in code

opus is higher than gemini on vision even though it hallucinates a lot on docs that gemini ingests with no issues

[-]

Character_Wind6057@reddit

I'm not really surprised about GLM 5.1 beating Gemini 3.1 Pro, it's really shit in comparison. Gemini, at least for me, shine for studying and vision. For everything else, it's meh

[-]

-p-e-w-@reddit

I do think it’s rather surprising that a Chinese startup is beating the flagship model made by the world’s most powerful technology company that literally invented transformers.

And Google is heavily marketing Gemini as a coding model. This is all they got, and it’s coming up short.

[-]

Character_Wind6057@reddit

For me it's absurd how the company with the biggest search engine and data, can't even make its own model search the internet instead of hallucinating. Or worse, reading in its reasoning something like "I'll simulate an online search" or something similar, when you ask it to search the internet..

Anyway, it's better for us. Competition keep prices low and innovations high

[-]

IrisColt@reddit

For me it's absurd how the company with the biggest search engine and data, can't even make its own model search the internet instead of hallucinating.

You nailed it!

[-]

Mkengine@reddit

Thank god I am not the only one experiencing this, I don't know why I would ever take the sycophantic model that can't properly search the web or code, when there is ChatGPT or Claude. Nano banana is really the only part of Gemini that's useful for me.

[-]

logicchains@reddit

It's not surprising if you've used Angular, or Tensorflow, or BigQuery, and seen how they ruined it.

[-]

TheRealMasonMac@reddit

I think ZAI genuinely has something top-tier in their training regime that not even the frontier labs have. Their models are best for general assistant work by a long-shot.

[-]

yaboyyoungairvent@reddit

Gemini is relatively bad at coding compared to the top players GPT and Claude.

[-]

mambo_cosmo_@reddit

yet it's made by the largest data company in the world, the one that invented transformers and that markets the model as a coding agent too...

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

eli_pizza@reddit

5.1 seems great but I'm skeptical that arena is a good approach to evaluating a coding model. Isn't this biased to which one looks better?

[-]

MainFunctions@reddit

“Ranks #1 in Code Arena” clearly ranked third. And yes I understand it’s “#1 open model” but why be greasy like that? Just say it’s #3 and open weights. That’s very impressive on its own

[-]

unjustifiably_angry@reddit

Got literacy?

[-]

unjustifiably_angry@reddit

Is that Sonnet and Opus today, or Sonnet and Opus from 2 weeks ago before they were lobotomized?

[-]

VonDenBerg@reddit

5.1 is legit. So legit I want to get a second z.ai account so i can have concurrent operations.

[-]

ninjaeon@reddit

I agree and did the same, but in a different way.

Z.ai coding plan has been dog s**t lately with servers being overloaded. So of two other providers offering glm-5.1 on coding plans (OpenCode Go & Ollama Cloud), I ran a test on quota consumption and found Ollama Cloud to be about 8x cheaper per request.

Then I setup Omniroute api router with both Z.ai and Ollama Cloud in a combo where it prioritizes Z.ai coding plan usage first then fallsback to Ollama Cloud when Z.ai fails for whatever reason (which happens everyday, often repeatedly). Working great so far!

If I didn't have Z.ai coding lite already prepaid for a year at $27, I would just go straight to Ollama Cloud for $20/monthly. And this is from someone that hates Ollama in general, until now and just for this specific use case.

[-]

MeYaj1111@reddit

Thanks for this!

[-]

zhdc@reddit

Great advice. Trying Ollama Cloud now.

[-]

VonDenBerg@reddit

Fucking a. I’ll give this a go, appreciate writing it up.

[-]

Leafytreedev@reddit

I've been looking to signup for a z.ai account but have been reading lately that they're notorious for serving lower quantizations of their models and other enshittifications after the initial hype of model release. Not sure how true that all is but it definitely has me second guessing a $80/month subscription. Have you noticed any quality drops over time and how long have you been subscribed?

[-]

DinoAmino@reddit

What's the methodology here? I don't see mention of individual benchmarks. But I do see a total of "votes". So it's a popularity contest?

[-]

toffee0_0@reddit

it's subjective scoring . with lmarena . the methodology is basically giving it prompt , 2 models answer the same questions (A\b) , if you like A , vote that . so yeah its not objective

[-]

cafedude@reddit

I'm thinking it was mostly a publicity stunt.

[-]

zhdc@reddit

I have GLM 5.1, GPT 5.4, and Claude Sonnet running side by side. GPT 5.4 > Sonnet > GLM 5.1.

However, GLM 5.1 is still very, very good compared to other open source/third party models.

[-]

milkipedia@reddit

GLM 5 Turbo not being on the list has me wondering. It's been pretty good for me.

[-]

Tight-Requirement-15@reddit

elephant: address me meme

[-]

satireplusplus@reddit

The 4bit quant of this doesnt fit in 256GB

The full 16bit model is 1.5TB lmao

[-]

Terminator857@reddit

The one thing glm 5.1 is way above the rest of the models is code review. Give it a try. It is like 3x better than any other model at code reviews.

[-]

Hot-Employ-3399@reddit

I hope they'll release glm-5.1-flash then to be able to run on our GPU.

[-]

Technical-Earth-3254@reddit

I really like GPT 5.4, 5.3 Codex and 5.1 Codex Max. So seeing that GLM beats all of them is very nice. Wish I could run it locally, but I get what I can. So I'm probably getting their subscription for coding then.

[-]

OmarBessa@reddit

It's an amazing model, well deserved

[-]

masterlafontaine@reddit

Let's support the models, guys

[-]

Efficient-Series-939@reddit

better than Sonnet 4.6... sure buddy.

[-]

Serprotease@reddit

Honestly, yeah it’s better than sonnet 4.6

The fact that z-ai now to compare it to opus rather than sonnet is telling us how confident they are in its capabilities. Also, it’s interesting to see that Anthropic is now talking about mythos now that Chinese models have closed the gap with sonnet.

[-]

bruhhhhhhhhhhhh_h@reddit

Great model.