Why are proprietary frontier models (like Opus and GPT-5.4) so much better at long-running tasks than proprietary open-source models?

Posted by asian_tea_man@reddit | LocalLLaMA | View on Reddit | 20 comments

This is something that I don't quite understand, I'm hoping maybe someone can steer me in the right direction here?

Why is it that the proprietary closed source models like Opus 4.6 and GPT 5.4 are so much better in long-running agentic tasks vs open source leaders like GLM 5 and Kimi 2.5?

In benchmarks, the open source models are quite close to their proprietary counterparts. Like, in the first 60k tokens, quality of output from models like GLM 5.1 is on par with output from Opus 4.6 (and in some cases I've found GLM's output to be better, especially with front-end stuff).

Yet, with GPT 5.4, I can give it a complex feature story, and have it work for 1.5 hours (I've done this before), and then come back and see its built a fully complete complex feature.

Another example: I wanted GPT 5.4 to build me an engine that converts HTML/CSS into a complex proprietary Application Data schema for a no-code web dev platform. I provided a few references, i.e the HTML/CSS and its corresponding schema, and had it keep running until it built me a converter that reliably converts between the two, took 2 hours and got a 100% working version. This really shocked me.

The same can't be said about even GLM 5.1. With the open source models (I know GLM 5.1 isn't open source yet) they seem to be great but after a compaction it all falls apart.

The thing is the closed source models are not higher-context than the open source ones. And Codex/Claude Code frequently auto-compacts.

I've seen GPT 5.4-High undergo like 10 compactions and still maintain focus.

So I'm assuming it's the memory layer, then? But the memory layer isn't dependent on the LLM, right? So does this mean that the harness is doing the heavy lifting with re: to long-running tasks?

But then if it's the harness doing the auto-compaction and guiding the model, wouldn't that mean we'd expect similarly good performance from say GLM 5 running in Claude Code or codex?

I guess I'm confused about how the memory layer and auto-compaction works in Claude Code and Codex. If there are any good videos or readings on the application/auto-compaction side of things specifically, I'd love to learn more. Thanks!