"Actually wait" ... the current thinking SOTA open source

Posted by FPham@reddit | LocalLLaMA | View on Reddit | 34 comments

I'm trying GLM 5.1 but is it just me or the thing really just works by over-cranking thinking to almost ridiculous heights?

It has been for last 20 minutes writing novellas about what it is going to do with all, Uhm, Actually wait, but no..., and I really just asked it to write an owner draw CButton with different colors.

Now don't get me wrong, at the end it seems to get there - but I'm just having my own "Actually wait" thinking moment:

Is this the way they made it so smart?

While the other models like Claude (the $20 is now just a total test mode ripoff - the tokens get spent in 15 minutes then you wait for hours) or ChatGPT (I currently prefer codex lately over CC, honestly it feels as smart) simply give you the answer almost right away for such simple things.

Edit, 30 minutes and > 100k tokens and now it starts writing CThemedButtonCtrl

[-]

crantob@reddit

Sir, this is a LocalLlama not a wendy's. Downvote.

[-]

FoxiPanda@reddit

Yeah I think there is a lot of what is basically recursion going on inside the models currently that allows them to "hmm...." for much, much longer, but I have to say I can't argue with the results...especially if you can run those models locally. gently pats multiple Mac Studios sitting on the desk

[-]

FPham@reddit (OP)

That alone is of course the "wow" part and I'm not trying to diminish that in any way - although with my 128GB mac studio I can't run it. But yes, in a theory and for a $10k or whatever it is, we can run it at home in some capacity (although not sure if Q2 would cut it as a coder...)

[-]

FoxiPanda@reddit

Honestly I've found that the UD_Q2_K_XL variant of GLM-5.1 to be shockingly competent. I dunno if it's just such a huge model that it doesn't suffer immensely from quantization or what, but it's...kinda good? It has limitations of course, but they seem quite muted.

[-]

FPham@reddit (OP)

that can be somehow run on 256GB, right?

[-]

FoxiPanda@reddit

Yeah I have 2 256GB and 1 512GB and I run it on the 512, but I've actually been experimenting with exo and a few other tensor sharding tools to see if I can get it to run on multiple to speed things up (curse you memory bandwidth)... but so far I'm getting like ~20tok/s and it's not abysmal.

[-]

q-admin007@reddit

You have lots of RAM then, and little compute. A perfect case for a draft model?

I have a Strix Halo with 128GB VRAM, i could speed up Gemma 4 31B UD_Q8_K_XL from 5.9 to 20 t/s output. Question is, is something like it available for Macs:

https://docs.google.com/spreadsheets/d/1NzZC4JShGluwH2fdjlMbZ2ke99AcTctUnM7rG12_cYE/edit?gid=1361824152#gid=1361824152

[-]

FoxiPanda@reddit

Yeah I also have a couple of RTX 5090s, and I have seriously considered seeing if I can get draft models working between the two. I haven't yet done it, but it seems like an excellent pairing between something like a ~4B-9B model on the 5090(s) and a 754B model like GLM-5.1 sharded across the studios to get like ~30-40tok/s.

I have, with various tricks, managed to get 16-22tok/s out of GLM-5.1 at decent quantizations so far only using the 512GB node. Possibly being able to use a draft model is really appealing as I might be able to get to ~30 for a lot of use cases there and at 30, it becomes genuinely usable for everyday tasks.

[-]

FPham@reddit (OP)

My problem is - and I spun the GLM 5.1 yesterday on cloud that while it might be the best open source coding model, it is - well, what can I say not to offend people ... lacking? I was asking it to fix some code parallel with Sonet, and GLM 5.1 not only took forever, it ended up hallucinating functions and code that does not exist (yes the old way - just give me convenience function from a fictional library - as if it was implemented what I'm asking - just to shut me up.), while Sonet did it on first try and in like a minute.
So I'm quite surprised by the whole - "This is nearly as good as Opus." I'm not a fanboy of anthropic and the CC with the pro version is basically unusable now, but I just can't see myself using GLM - this feels like a torture if an alternative (paid) exist. I can;t imagine how it works quantized and with 22tok/s when it eats tokens for lunch thinking.

[-]

FoxiPanda@reddit

Interesting, I have not actually experienced the hallucinating functions and code thing...but I've definitely had GLM ignore my instructions...but I've also had Opus outright ignore my instructions too so not much new there.

GLM DEFINITELY takes a while and does a lot more tool calls and pondering and is not as directly straight to the code that works sort of thing...

If you don't like it though, shrug? I dunno, probably have to pay Anthropic's kings ransom to keep doing the thing you do or wait for another model.

[-]

FPham@reddit (OP)

Not entirely bad for home setup, well... I'm sadly far from buying 512GB MAC.

[-]

cakemates@reddit

they all do this, if you expand claude thinking its an essay per prompt eating your tokens, gemini probably does it less in my limited use of it. I havent tested chatgpt.

[-]

FPham@reddit (OP)

In all honesty - and probably unpopular opinion - codex rocks in my case. It works and the limits on the baby plan are not bad.

[-]

RealLordMathis@reddit

I don't know what changed, but I started using GLM 5.1 when it got added to z.ai coding plan and it was amazing. Basically Sonnet 4.5 level. It was also reasonably fast and did not overthink. Then something changed and I got the same 20 minutes of "wait actually..." and it never really does anything. I'm using it with the same API and same coding harness. I don't have the HW to run it locally.

[-]

FPham@reddit (OP)

I also can run only on cloud and yesterday was my first try, and buy, did it NOT perform well. We got even into hallucinating convenience functions when it couldn't fix the issue - "just call engine.ThisFixesEverything(file)". but more worrying was that EVERY single time the code was non functional or with errors and I needed to go through all the 20 min charade of "Wait, i think I'm a sliced cheese, not an AI" few times until the code was finally working. It feels like pulling an elephant through an eye of a needle... Yes, call me spoiled, but I have very different experience with codex and cc.

[-]

sleepy_roger@reddit

Yeah.. "Ok writing the code For Real this time!"

I'm running it locally at Q2 and Q1 and some tasks are taking over an hour due to the crazy thinking. I turned it off and actually didn't have terrible results.

[-]

MrRandom04@reddit

Q2 and Q1 are absolutely gonna obliterate the intelligence of the model for any serious work...

[-]

FPham@reddit (OP)

I'm using it from cloud. Can't run even Q2 on 128GB

[-]

sleepy_roger@reddit

You'd be surprised. Yes normally that's the case, but the unsloth quants are actually amazing. I've got a max coding plan as well with z.ai but Q2's design on the same prompt came out better than the api.

[-]

Normal-Ad-7114@reddit

Q2 can still be fine but Q1 is too far

[-]

FPham@reddit (OP)

Now, that's interesting... I mean the code at the end wasn't bad and it really went through places and corrected itself, but my task wasn't a rocket science either.

[-]

codeprimate@reddit

It’s kind of like allowing a level of breadth search across the neural network to make up for holes in training.

[-]

TheRealMasonMac@reddit

Hey, in fairness, it's really smart! It does overthink simpler straightforward stuff, and I realized it's better to just use M2.7 (which is actually REALLY good at doing exactly what you tell it to do).

[-]

cantgetthistowork@reddit

MM and Qwen are the champions of gaslighting and will fight you to the death saying they completed a task that they didn't

[-]

Logical_Two_7736@reddit

I saw GLM 5.1 think “THIS IS THE ACTUAL FINAL FINAL CODE” as if we haven’t all been there lol

[-]

bakawolf123@reddit

There was recent paper by apple - you can improve model with own thought process, so yeah this is a proven working tactic that currently works for scaling them models further and further both at training and inference time.

Remember qwen 3.5 thinking?
Even before qwen there was a nice small chinese model Nanbeige - quite capable in agentic, but under the hood it would write at least a dozen poems in if you'd ask it for one.

Everyone jumped into this tactic by now and I think there will be even more scaling wars cause "we need more compute" so that thinking block will be not so annoying.

[-]

Khaos1125@reddit

Assuming you mean the self distillation paper, the effect was much larger for non-thinking model variants vs thinking model variants, and the papers were testing smaller models overall (8B dense or 30B MOE with 3B active params IIRC).

It’s very unlikely that self-distillation strategy would work for larger models in thinking mode, since it seems to primarily operate by rebalancing away from “dumb mistakes” at key trigger points in the code, while maintaining diversity of options (see their forks vs locks discussion)

[-]

FPham@reddit (OP)

This also means we need to calculate 100k of tokens for a simple stuff. So this is not a freebie. What CC or codex does with 20k, GLM pushes 100k+ (right now I'm at 150k just for one task)

[-]

wolframko@reddit

so cc and codex will be doing 100k+ soon. Like Claude Mythos.

[-]

segmond@reddit

You are probably annoyed because it's slow. Turn off thinking. Turn on thinking only for really difficult problems.

[-]

FPham@reddit (OP)

I'm not that annoyed - I'm mostly annoyed because my codex and cc are out of weekly limits :(. But how do I switch of thinking in opencode....

[-]