Stanford: Self improving Meta-Harness
Posted by GodComplecs@reddit | LocalLLaMA | View on Reddit | 13 comments
We had Prompt engineering, then Context engineering, then Agents and Harness. Now we have Meta Harness, a harness that auto corrects its agentic mistakes and improves performance and uses less context:
https://arxiv.org/abs/2603.28052
"The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering."
Looks like an easy performance gain for local LLMs since you can have it running after main tasks are done to improve on mistakes, opencode or the project itself here: https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact
Silver-Champion-4846@reddit
Can this work for applications other than coding, like fiction worldbuilding/writing?
GodComplecs@reddit (OP)
Yes very easily by using LLM as Judge concept for Output quality review:
"To make this work for fiction writing, you would need to provide the system with two things (which are not currently detailed for creative writing in the sources):
Silver-Champion-4846@reddit
Yeah it would need a lot of work
TomLucidor@reddit
Do you also try (a) Out-of-sample comparison between different self-evolving agent for robustness of different task categories (b) token-adjusted "harness optimizer search progress" since it is 350x more costly per iteration to create the ideal harness (c) mixing different models like how Google's MAS paper does it to see if there are ways to reduce cost-per-token in certain areas (d) see if optimizing the harness on simple tasks would yield good results in hard tasks (e) multi-task harness testing to see if the method is robust for diversification
FullstackSensei@reddit
If you can measure it, you can improve it. With code and math, measuring is easy. For other domains, it can be quite hard.
Silver-Champion-4846@reddit
Indeed
FoxiPanda@reddit
Short answer: Yes.
Longer answer: Yes, but you'd have to define the "success parameters" to get the self-improving portion. That probably requires some iteration and work on your part to figure out what's important to you and then run experiments and possibly hand grade experiments until you were satisfied with the success parameter definitions for self improvement.
Silver-Champion-4846@reddit
I'm cardless, unfortunately
Taenk@reddit
The cut-off y-axis makes the reult more impressive than it is, but it honestly is quite impressive. I've been wondering the other day whether you could run an LLM to evaluate the agents quality after each run to generate ideas that could improve the harness in place. Will need to check out the paper in detail.
Shingikai@reddit
The buried result: a single discovered harness improved accuracy across five held-out models on those IMO problems. Same harness, five different models, all improved. That's not overfitting to one model's quirks. It's finding something more general about how to structure problems for LLMs.
The hand-engineered baselines weren't suboptimal in model-specific ways. They were suboptimal in ways that cut across architectures. The 4x token reduction alongside the accuracy gain is the tell here. Most people solve hard tasks by throwing more context at them; the discovered harnesses apparently find a more efficient path.
Not sure this extends beyond math and coding. But if the cross-model generalization holds in other domains, a lot of performance people attribute to model differences might actually be harness differences.
valkarias@reddit
This is similar to this: https://arxiv.org/html/2602.03786v2
Basically giving an orchestrator the ability to create or tune sub-agents dynamically.
The Recursive Language Models paper also does something similar for Long-Context-Reasoning.
TomLucidor@reddit
Everyone wants dynamic subagents but nobody can test if there are ways to make this cheaper
Bite_It_You_Scum@reddit
,,, or you could just use hermes which could implement the same thing as a tool.