Stanford: Self improving Meta-Harness

[-]

Silver-Champion-4846@reddit

Can this work for applications other than coding, like fiction worldbuilding/writing?

[-]

GodComplecs@reddit (OP)

Yes very easily by using LLM as Judge concept for Output quality review:

"To make this work for fiction writing, you would need to provide the system with two things (which are not currently detailed for creative writing in the sources):

A Search Set: A collection of writing tasks or worldbuilding queries that are currently "hard" for your standard LLM setup.
A Reward Function: A way to score the output. While the paper uses objective scores for math and coding, a worldbuilding application would require a metric for narrative consistency or quality—potentially using another LLM as a "judge" or using human-in-the-loop feedback to provide the "scalar scores" the system needs to identify the Pareto frontier."

[-]

Do you also try (a) Out-of-sample comparison between different self-evolving agent for robustness of different task categories (b) token-adjusted "harness optimizer search progress" since it is 350x more costly per iteration to create the ideal harness (c) mixing different models like how Google's MAS paper does it to see if there are ways to reduce cost-per-token in certain areas (d) see if optimizing the harness on simple tasks would yield good results in hard tasks (e) multi-task harness testing to see if the method is robust for diversification

[-]

FullstackSensei@reddit

If you can measure it, you can improve it. With code and math, measuring is easy. For other domains, it can be quite hard.

[-]

Silver-Champion-4846@reddit

Indeed

[-]

FoxiPanda@reddit

Short answer: Yes.

Longer answer: Yes, but you'd have to define the "success parameters" to get the self-improving portion. That probably requires some iteration and work on your part to figure out what's important to you and then run experiments and possibly hand grade experiments until you were satisfied with the success parameter definitions for self improvement.

[-]

Silver-Champion-4846@reddit

I'm cardless, unfortunately

[-]

Taenk@reddit

The cut-off y-axis makes the reult more impressive than it is, but it honestly is quite impressive. I've been wondering the other day whether you could run an LLM to evaluate the agents quality after each run to generate ideas that could improve the harness in place. Will need to check out the paper in detail.

[-]

Shingikai@reddit

The buried result: a single discovered harness improved accuracy across five held-out models on those IMO problems. Same harness, five different models, all improved. That's not overfitting to one model's quirks. It's finding something more general about how to structure problems for LLMs.

The hand-engineered baselines weren't suboptimal in model-specific ways. They were suboptimal in ways that cut across architectures. The 4x token reduction alongside the accuracy gain is the tell here. Most people solve hard tasks by throwing more context at them; the discovered harnesses apparently find a more efficient path.

Not sure this extends beyond math and coding. But if the cross-model generalization holds in other domains, a lot of performance people attribute to model differences might actually be harness differences.

[-]

valkarias@reddit

This is similar to this: https://arxiv.org/html/2602.03786v2

Basically giving an orchestrator the ability to create or tune sub-agents dynamically.

The Recursive Language Models paper also does something similar for Long-Context-Reasoning.

[-]

TomLucidor@reddit

Everyone wants dynamic subagents but nobody can test if there are ways to make this cheaper

[-]

Bite_It_You_Scum@reddit

,,, or you could just use hermes which could implement the same thing as a tool.