We gave 12 LLMs a startup to run for a year. GLM-5 nearly matched Claude Opus 4.6 at 11× lower cost.

Posted by DreadMutant@reddit | LocalLLaMA | View on Reddit | 100 comments

We built YC-Bench, a benchmark where an LLM plays CEO of a simulated startup over a full year (\~hundreds of turns). It manages employees, picks contracts, handles payroll, and survives a market where \~35% of clients secretly inflate work requirements after you accept their task. Feedback is delayed and sparse with no hand-holding.

12 models, 3 seeds each. Here's the leaderboard:

🥇 Claude Opus 4.6 - $1.27M avg final funds (\~$86/run in API cost)
🥈 GLM-5 - $1.21M avg (\~$7.62/run)
🥉 GPT-5.4 - $1.00M avg (\~$23/run)
Everyone else - below starting capital of $200K. Several went bankrupt.

GLM-5 is the finding we keep coming back to. It's within 5% of Opus on raw performance and costs a fraction to run. For anyone building production agentic pipelines, the cost-efficiency curve here is real and Kimi-K2.5 actually tops the revenue-per-API-dollar chart at 2.5× better than the next model.

The benchmark exposes something most evals miss: long-horizon coherence under delayed feedback. When you can't tell immediately whether a decision was good, most models collapse into loops, abandon strategies they just wrote, or keep accepting tasks from clients they've already identified as bad.

The strongest predictor of success wasn't model size or benchmark score but it was whether the model actively used a persistent scratchpad to record what it learned. Top models rewrote their notes \~34 times per run. Bottom models averaged 0–2 entries.

📄 Paper: https://arxiv.org/abs/2604.01212
🌐 Leaderboard: https://collinear-ai.github.io/yc-bench/
💻 Code (fully open-source):https://github.com/collinear-ai/yc-bench

Feel free to run any of your models and happy to reply to your queries!

[-]

Witty_Mycologist_995@reddit

Now do Gemma 4 26b

[-]

Fault23@reddit

$0K 🥀🥀🥀

[-]

StupidScaredSquirrel@reddit

Amateur, Openai manages to do $00.000.000 per year

[-]

Fault23@reddit

Damn, to reach their level i need to profit 1000 times more per year

[-]

DreadMutant@reddit (OP)

Yes, we will be updating the leaderboard regularly with latest models

[-]

Lazy-Pattern-5171@reddit

And 31B

[-]

eliko613@reddit

This is fascinating work - the cost-efficiency analysis really highlights something more teams struggle with as they move agentic systems into production.

The GLM-5 vs Claude Opus comparison is particularly interesting because it shows how non-obvious the cost-performance tradeoffs can be until you actually measure them systematically. Most teams I've talked to are flying blind on these economics, especially for multi-turn scenarios where costs compound.

What's striking about your benchmark is how it simulates the real challenge of production agentic systems - you can't just optimize for single-turn performance, you need to understand how costs accumulate over hundreds of interactions. The $86/run vs $7.62/run difference becomes massive at scale. We started using zenllm.io for complex agentic flows to get better cost observability and it's been decent so far.

A few questions on your methodology:
1. Did you track token usage patterns across the different models? Curious if the cost differences came from efficiency in reasoning vs just different pricing tiers.
2. For the persistent scratchpad usage - did you notice any correlation between scratchpad verbosity and token consumption?

This kind of systematic evaluation is exactly what the industry needs as we move beyond toy demos into real production deployments where unit economics actually matter.

[-]

NOLO-App@reddit

Should I test NOLO?

[-]

sbaxEE@reddit

Are there patterns in how the scratchpad is used or not used by specific models?

[-]

DreadMutant@reddit (OP)

Yes you can check out the paper for all the details!

[-]

CalligrapherFar7833@reddit

Great you just lost 60k on revenue by not running opus but saved 79$ in expenses by running glm. What kind of stupid ass leaderboard is that why would anyone with any business sense think that anything but profit : revenue : expenses ratio is the driving factor

[-]

9gxa05s8fa8sh@reddit

it's a benchmark of intelligence, not an actual business

[-]

CalligrapherFar7833@reddit

Intelligence for losing money ?

[-]

9gxa05s8fa8sh@reddit

no, those money numbers are a simulation

[-]

CalligrapherFar7833@reddit

It lost money on it. If money are no part of it the it shouldnt be ranked to it

[-]

bambamlol@reddit

True. And choosing Kimi K2.5 (the leader in terms of revenue-per-API-dollar) would have been more than 10x as stupid, since you'd have lost 861k in revenue by "saving" around $84 in API costs.

[-]

almbfsek@reddit

Top models rewrote their notes ~34 times per run. Bottom models averaged 0–2 entries.

can't you tweak the instructions so that the "bottom" models use scratchpad more consistently, wouldn't it make their score much higher? if this is true, this test is only measuring if a model can follow instructions or not

[-]

Due-Memory-6957@reddit

And being able to follow instructions is vital

[-]

almbfsek@reddit

definitely is but you don't need a sophisticated benchmark like this for it. so I assume they were trying to measure general intelligence not just instruction following but I feel like the results are tainted by the scratchpad use/non-use.

[-]

DreadMutant@reddit (OP)

Thanks for the resultss if you can share the trajectories we can update the leaderboard!

[-]

talatt@reddit

The cost gap between Opus ($86/run) and GLM-5 ($7.62/run) is striking, but I wonder how much of that is reducible at the infrastructure layer rather than just model selection.

With hundreds of turns and a growing scratchpad, the context window keeps expanding linearly — but most of that context is redundant across turns. The scratchpad from turn 50 carries 90% of the same content as turn 49, yet you're paying full token price for all of it every time.

We've been experimenting with proxy-level context optimization for multi-turn pipelines. The idea is: compress the context between turns without losing semantic meaning. In our tests, first turn saves \~14%, but by turn 11 it compounds to \~71% — because each optimized turn becomes part of the next turn's (smaller) input.

For a 200+ turn simulation like this, that could potentially bring Opus costs much closer to GLM-5 territory while keeping Opus-level performance. The cost-efficiency question might not be just "which model" but also "how efficiently are you feeding context to the model."

Fascinating benchmark — the scratchpad finding alone is worth the paper.

[-]

DreadMutant@reddit (OP)

Yes, stay tuned for more updates from collinear.ai :)

[-]

talatt@reddit

Interesting, I'll keep an eye on collinear.ai. The benchmark methodology is really solid — especially the multi-turn simulation aspect. Curious if you've looked at how context compression affects those cost numbers. We've been seeing significant differences depending on the optimization layer.

[-]

crantob@reddit

The naive still imagining that a simulation model equals the real world: this is how economists fool entire countries.

[-]

DreadMutant@reddit (OP)

If models still struggle on a deterministic simulation do you think they will thrive in the real world?

[-]

crantob@reddit

Nope, just as the world suffered from King's College modeling, or IPCC.

Clown World™

[-]

Alternative_Star755@reddit

I understand that models are expensive and time consuming to run. But in any non-LLM context a statistic about an algorithm with randomness an n=3 for repeated testing would be laughed at. We really need to see numbers at least hitting 10-100 runs to average out before you can start to be confident we're not just engaging in confirmation bias.

[-]

snusc@reddit

Im starting to think the glm posts on this sub are marketing posts by zai. I got influenced by this and got their subscriptions to try the models with openclaw. it feels 100x dumber than opus and 70x dumber than sonnet.

My friends in the same group can tell by a single response that openclaw has switched to glm just because of how stupid and incoherent it is compared to the models it claims to compete on benchmarks with.

[-]

9gxa05s8fa8sh@reddit

most benchmarks aren't checking for how smart a human thinks their replies are

[-]

Due-Memory-6957@reddit

What models do you plan to add?

[-]

BP041@reddit

the scratchpad finding is the most interesting part to me. it basically shows that what matters for long-horizon tasks isn't raw intelligence — it's whether the model maintains working memory across a multi-step problem.

I've been building agentic systems where agents need to reason across dozens of turns, and the ones that degrade fastest are those that treat each turn as stateless. adding even a simple structured note-taking step in the prompt drastically changes output quality over long runs.

curious whether you saw a difference between models that used the scratchpad reactively (writing notes after bad outcomes) vs proactively (writing strategy before decisions).

[-]

Due-Memory-6957@reddit

the scratchpad finding is the most interesting part to me. it basically shows that what matters for long-horizon tasks isn't raw intelligence — it's whether the model maintains working memory across a multi-step problem.

Just like humans (Get fucked Socrates)!

[-]

NStep-Studio@reddit

So it makes it better the more note, strategies it writes?! And obviously the more memory and it records the better it thinks!?

[-]

DreadMutant@reddit (OP)

Nothing too different from a performance standpoint whether the model is reactive or proactive with a scratchpad. The main pattern being it is able to ground itself with its past experience and what is right and wrong to do.

[-]

BP041@reddit

that's useful signal — reactive vs proactive didn't change performance, but the grounding behavior is the consistent factor regardless of timing.

makes sense mechanically: what matters is the model can reference prior context ("this failed last time") when it needs to, not when it writes the note. the scratchpad working at all implies some self-referencing capacity.

interesting follow-up would be whether models that used it more frequently outperformed ones that wrote sparse notes — measuring note density vs outcome quality across runs.

[-]

BP041@reddit

that's useful signal — reactive vs proactive didn't change performance, but the grounding behavior is the consistent factor regardless of timing.

interesting follow-up would be whether models that used it more frequently outperformed ones that wrote sparse notes — measuring note density vs outcome quality across runs.

[-]

BP041@reddit

that makes sense for the benchmark task -- summarizing is essentially stateless, so whether the scratchpad is populated proactively or consulted reactively doesn't change the core capability test.

where i'd expect the gap to show up more is in tasks with sequential dependencies: planning something across 4-5 steps where each decision depends on the previous one. if the scratchpad is reactive (only written when something goes wrong), the early steps don't accumulate the context needed to constrain later ones. proactive scratchpad use -- writing down assumptions and decisions as you go -- is what lets the later steps stay coherent.

basically: for single-task evaluations the distinction mostly disappears. for long-horizon tasks it probably matters a lot.

[-]

BP041@reddit

that grounding function is interesting — it's essentially what episodic memory does for humans. the scratchpad as a running log of 'what I tried, what worked, what didn't' changes the agent from a reactive responder to something closer to a problem-solver with continuity.

the performance parity between reactive/proactive makes sense in that frame: both can access the same grounding information. the real performance split probably shows up at longer horizons where reactive scratchpad users start losing thread of earlier context while proactive ones maintain a structured state.

did you see that pattern in the YC-Bench runs — agents diverging more as the simulated time horizon extended?

[-]

ketosoy@reddit

Claude has slightly better RLHF, and way way better tool use than the other guys right now.

*.md files are severely underrated

[-]

constructrurl@reddit

Nearly matched on a simulated startup task - wait until we see how Opus 4 actually runs a company for a year lol

[-]

DreadMutant@reddit (OP)

What if there are few companies being run by Opus rn and no one knows 😶‍🌫️

[-]

Due-Memory-6957@reddit

There's an army being at least helped by it and they kinda just killed a bunch of schoolchildren and motivated the enemy to fight even harder...

[-]

Significant_Dark_89@reddit

Thanks for sharing this insightful analysis. $86 vs $7.62 is huge difference and I am still perplexed how the adoption is widely different between these models. Claude seems like the goto model in almost all enterprises in US with huge customer base (our team included!) compared to GLM-5. Is data privacy / security the biggest concern for adopting GLM-5?

[-]

Due-Memory-6957@reddit

Only 35%?

[-]

Limp_Classroom_2645@reddit

Is glm5 open weights?

[-]

HopePupal@reddit

yes it is, but even the 2 bit quant 50% REAP chainsaw brain surgery versions are huge. if you have a Strix Halo or some other 128 GB system, you can try https://huggingface.co/0xSero/GLM-5-REAP-50pct-UD-IQ2_XXS-GGUF

[-]

DreadMutant@reddit (OP)

Yes it is

[-]

TurnUpThe4D3D3D3@reddit

Opus is basically unusable through API. The only way to use it at a reasonable cost is through subscription.

[-]

j4ys0nj@reddit

Qwen really lettin us down! Not surprised by Gemini though.

[-]

crantob@reddit

These aren't real companies. This is performance in a simulation. The simulation is not the real world. The simulation only vaguely approximates the real world.

A different simulation would have different results for the models.

[-]

Medical_Lengthiness6@reddit

It's coming down to brand popularity now. People are using Claude not because it's that much better, but because it has the image of being that much better.

[-]

JacketHistorical2321@reddit

Everything has always come down to brand marketing. EVERYTHING

[-]

crantob@reddit

Marketing is a necessary component of production in a competitive market.

Don't hate that -- the alternative is far worse... a system where you have no choices.

[-]

ProfessionalJackals@reddit

People are using Claude not because it's that much better, but because it has the image of being that much better.

Difference in generated coding also matters.

Claude models tend to be more forgiving in how somebody structures their prompts. With the disadvantage that Claude models can be a bit too overzealous.

Where as GPT models are more strict, what can result in code not being updated in all the right spots.

Personally, i found that a mix is the best of both worlds. Now, if somebody can offer a models with the same accuracy/tool usage/behaviors, that can run at home on basic hardware. Sign me up! But unfortunately, this still seem to be still reserved for those that go big (as in $$$$$) on local LLMs.

[-]

Voxandr@reddit

Not local LLM

[-]

fallingdowndizzyvr@reddit

GLM 5 is most definitely local LLM.

[-]

DreadMutant@reddit (OP)

You can try with a locally hosted llm as well, we tried qwen 3.5 9b that way

[-]

CryptoUsher@reddit

glad to see GLM-5 punching above its weight, cost efficiency at that scale is wild
but has anyone checked how much the simulation penalizes risk-taking versus real founder behavior, or are we just rewarding conservative play in a fixed environment

[-]

ai_without_borders@reddit

the cost efficiency thing isnt accidental. zhipu has been ruthlessly optimizing inference costs for the chinese enterprise market where margins are way tighter than the US. theyve been doing MoE and speculative decoding stuff internally since late last year. i follow a bunch of their researchers on zhihu and the optimization work they share is pretty nuts, basically squeezing everything they can out of limited compute. makes sense that translates to better performance per dollar on a bench like this

[-]

CryptoUsher@reddit

makes sense, chinese market’s pricing pressure is no joke. iirc zhipu’s 128k context stuff last quarter was already 30% cheaper than comparable ablations from deepseek, guess that’s why

[-]

Alwaysragestillplay@reddit

https://www.reddit.com/r/LocalLLaMA/comments/1s658kv/comment/od3ndry/

And an accidental double post. Really cool.

[-]

CryptoUsher@reddit

yeah makes total sense, chinese cloud pricing is brutal compared to us west coast stuff. iirc zhipu's a100 equivalent inference costs like 30% less than comparable models here? fwiw their MoE work on zhihu looked like they're doing dynamic routing with under-4b active params per token, which is kind of wild. wonder how much of that trickles down to open models or if it's all locked in enterprise.

[-]

HeyEmpase@reddit

Has anyone benchmarked its latency vs. Qwen3-32B on the same hardware? Curious if the efficiency holds beyond throughput.

[-]

Mission_Bear7823@reddit

hmm, and thats v5, not 5.1, even..

[-]

DreadMutant@reddit (OP)

Yesss isnt it already crazy good

[-]

Glittering-Brief9649@reddit

Broke down the main ideas for a quicker pass.

https://lilys.ai/digest/8918746/10156708?s=1&noteVersionId=6645264

[-]

glenrhodes@reddit

The scratchpad usage finding is the most interesting thing here. Models that kept notes rewrote them 34 times vs 0-2 for the bottom models. That is basically a proxy for whether the model maintains working memory across a long-running task or just reacts greedily to its immediate context. Would love to see how this correlates with context window size and whether shorter context models compensate by writing more aggressively.

[-]

DreadMutant@reddit (OP)

That is a very good question and would be a fun experiment to try out

[-]

weiyong1024@reddit

been building AI Company where multiple agents run long-term tasks and this matches exactly what i saw. without some kind of persistent state between turns the agents just forget context and start repeating themselves after a few rounds. the model being smart doesnt help if it cant remember what it decided yesterday

[-]

DreadMutant@reddit (OP)

Yep having some form of noting stuff down or memory helps a lot

[-]

Craygen9@reddit

This is really neat, thanks for sharing. What I find most interesting is figure 8 that is buried in the paper. It shows the trajectory of each individual seed. For seed 3, Opus and GLM5 had a runaway success while almost all the others bombed, while for seed 2 the results were more tightly grouped. Given the high variance of results across seeds, it could be helpful to run more seeds.

[-]

DreadMutant@reddit (OP)

Yeah there is high variance based on the random but we observed opus 4.6 and glm 5 to out perform others consistently regardless of seed.

[-]

jslominski@reddit

Or it shows that the paper is garbage ;)

[-]

EightRice@reddit

YC-Bench is a much better evaluation framework than most benchmarks because it tests the thing that actually matters: multi-step decision-making under uncertainty with deceptive counterparties.

The 35% of clients secretly inflating work requirements is the key design choice. It forces the LLM to develop a theory of mind about adversarial actors and learn to price in risk - which is fundamentally a mechanism design problem. The CEO is not just optimizing a function, it is navigating a game-theoretic landscape.

What I find most interesting is how different models handle the payroll constraint. Managing cash flow while investing in growth requires the kind of long-horizon planning that most LLMs struggle with because they optimize for immediate reward. The models that perform well probably develop some implicit model of deferred value - accepting short-term losses for long-term positioning.

Curious whether you tested what happens when you give the LLM-CEO explicit governance rules (e.g. 'never accept a contract from a client who has previously inflated requirements') versus letting it learn these heuristics from experience. The governance-constrained version might actually outperform the unconstrained one if the rules are well-designed.

[-]

stunning_man_007@reddit

That's wild—GLM-5 at 1/11th the cost getting that close to Opus is huge. Makes me wonder if the big players are overcharging for most real-world use cases where you just need competent reasoning rather than elite performance.

[-]

tillybowman@reddit

no one is overcharging. running llms like openai or anthropic are doing it is a money grave.

i don't want to know what real cost of such a subscription would be if billionaires and countries wouldn't dump in billions into it.

[-]

j_osb@reddit

honestly I expect them to have a positive margin on API stuff. Where they sink a lot of money is plans.

[-]

Chupa-Skrull@reddit

I'd be surprised if they were sinking that much with plans. In terms of unit economics a power user may be an individual loss, but I would expect the overall structure to balance out or even trend positive, considering how many under-utilizers there are vs the power users, tendencies to spill over into on-demand credit purchasing where available to supplement plans without moving up sub tiers, people buying higher sub tiers, and people using their experiences with subs to influence adoption at organizations that pay API rates for much greater quantities of inference

[-]

etre1337@reddit

I looked it up. The prices are comparable to what Anthropic is offering with increased usage.

Doesn't look so good to me.

[-]

TheRealMasonMac@reddit

IMO each model has its use-case. Gemini 3.1 Pro, I’ve found, has a habit of forgetting constraints and rules. I’ve found that I can improve its understanding by using a CoT prompt to guide its reasoning (this also works with other models like GLM-5). Gemini 3.1 Pro seems to be very good at understanding cause-and-effect, but it has a high inductive bias. That’s good for some things, not good for others. (I think Gemini also just sucks ass at tool-calling.)

[-]

InternalHeron7274@reddit

Great benchmark! I've been researching GLM-5 deeply and here's some context that might help explain the cost-performance curve:

**Official GLM-5 Specs:** - Total params: 744B (MoE architecture) - Activated params: \~40B per forward pass - Context: 200K tokens - API pricing: \~$0.55-0.82/M input, $2.47-3.02/M output

**Why the 11× cost advantage is real:** The 744B/40B MoE design is key here. Most of the "intelligence" is in the router selecting which experts to activate, so you get near-frontier capability at a fraction of the compute cost. This matches what you observed — GLM-5 isn't just "cheaper Claude", it's architecturally optimized for long-horizon tasks where token efficiency matters more than raw reasoning depth.

**Scratchpad finding is huge:** The fact that top models wrote \~34 notes/run vs 0-2 for bottom performers suggests that **persistent state management** is the bottleneck for long-horizon coherence, not model size. This aligns with what we're seeing in production agentic systems — the "working memory" problem dominates once you go beyond \~10 sequential decisions.

Have you considered running GLM-5-Turbo (the agent-optimized variant) or testing with different scratchpad prompting strategies? Would love to see how the cost-efficiency curve shifts with those variables.

Thanks for the open benchmark!

[-]

IrisColt@reddit

I'm guessing that being able to perfectly distinguish good from evil would make you a paranoid being capable of anticipating the moves of malicious actors. Kudos to Anthropic.

[-]

Worried_Drama151@reddit

So let me get this straight OP, you’ve made 0 fucking posts on Reddit for 2+ months, and then come on here to shill for GLM 5? You definitely not paid z.ai shill, got it😅

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

wwa56@reddit

do step-3.5-flash too please

[-]

Delyzr@reddit

Glm5 was released feb2026, so how did you run it for a year ?

[-]

IrisColt@reddit

Wow, this is fascinating! Thanks!!!

[-]

MLDataScientist@reddit

!remindme 1 day "test glm 5 q3_k_s locally for yc-bench".

[-]

RemindMeBot@reddit

I will be messaging you in 1 day on 2026-04-05 10:34:56 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]

pomatotappu@reddit

Is the initial and final net worth both associated with the simulation as mentioned above or did the models earn it in the real world? If not, how can we be sure that they perform similarly in the real world?

I'm asking this a good number of models release good benchmarks that compete with proprietary top models like opus, sonnet and codex but when I use them in my daily work, they fail to deliver.

[-]

pomatotappu@reddit

What is the harness you are using for each of the models? Also, is it the same for each of the models?

[-]

ghulamalchik@reddit

Change: $1.21M --> $1.21/M

To make it more understandable.

[-]

HenryThatAte@reddit

You're misunderstanding

[-]

Ok_Ambassador9111@reddit

how did you do that if gemini 3.1 pro just came out this year?

[-]

DreadMutant@reddit (OP)

It is a simulated environment where each day is a step similar to how time travels in turn based games

[-]

R_Duncan@reddit

Is GLM 5 already a year aged??!? Hmmmmm

[-]

bhalothia@reddit

There’s no frontier model moat. The only real moats left in enterprise AI are infrastructure, compliance, and unit economics.

[-]

kvothe5688@reddit

11x lower cost in what API? i mean for most max plan is way more worth compared to any open-weight model pricing.

[-]

DreadMutant@reddit (OP)

We used openrouter for inference to maintain uniformity

[-]

Fun_Nebula_9682@reddit

the scratchpad finding is the most important result imo. been building persistent memory for coding agents — sqlite + fts5 so the agent can search its own past decisions across sessions. without it the thing literally repeats mistakes it fixed hours ago, especially after context compaction drops the original reasoning. 34 rewrites per run tracks — the pattern that works is save early, search before acting, update when wrong.

GLM-5 at $7.62/run doing 95% of Opus is wild though. for production agentic pipelines where you're running hundreds of turns that cost difference compounds fast.