Dynamically allocating compute budget to hard set of problems and evolving the sections with Qwen-35B-A3B gets you near GPT-5.4-xHigh on HLE
Posted by Ryoiki-Tokuiten@reddit | LocalLLaMA | View on Reddit | 6 comments
fuck_cis_shit@reddit
test-time compute is good
Ryoiki-Tokuiten@reddit (OP)
Use this only if you want to throw a huge compute budget at your local model for your favorite problems that you usually test with frontier models. I wouldn't recommend providing this as a tool / MCP for your harness agent working in your codebase because there's too much divergence here.
The baseline 35B variant of the 3.6 family scores 21.4% on HLE (reported in their official blog post), and GPT-5.4-xHigh scores 41.6% (officially reported).
I let Qwen dynamically allocate the compute budget to the problems and assign a priority. We ask it to output in a structured format so that we can take each solution and independently spin off parallel agents that work solely on that approach. The number of solutions each of them has to generate is equal to the priority assigned to them. you can ofc continue this with the new set of evolved solutions and iterate down further if you don't care about the compute at all. However, I found this single iteration to be the sweet spot to avoid context bloat while still providing context from other solutions in the pool. Qwen scored 39.9% on the HLE set. I haven't tested it on other benchmarks yet, but I thought these were some useful gains so I thought I'd share them here.
Just to be absolutely clear, there is no "Final Answer" or "Judged Solution" here. We simply have a pool of solutions and you have to manually look at them (although you could have an LLM go through them and pick the most plausible ones, but I didn't have time to set that up).
Github Repo: https://github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements
The mode is called "Dynamic Compute Budget Allocation" or DCA.
Qwen30bEnjoyer@reddit
I think it would be interesting to see if the LLM is capable of finding the correct solution. Would you ever try to have another subagent collate the findings into the one likeliest answer to better reflect the needs of real world scenarios?
Kincar@reddit
I love seeing your post, keep it up man!
Ok-Measurement-1575@reddit
Is this essentially self-consistency / majority voting?
Ryoiki-Tokuiten@reddit (OP)
No that's completely different. Here we are doing solution pool evolution through Depth-First-Search., You could interpret seeding the initial pool with some plausible solutions and then assigning them initial priorities as BFS. So if you see this through global perspective then it's continuously using BFS + DFS on plausible solutions themselves. Ofc we should have critique here if we want to iterate it to the next step, but I don't intend to add that in this mode because the goal with this was really just dynamically allocating compute budget to certain problems but I ended up with this. Critique based approach is already present in the Deepthink, Refine, and Contextual Modes of the application.