How I designed a multi-agent framework that gives any baseline model a 5-20% gain across various benchmarks. Here is what I learned.

Posted by Ryoiki-Tokuiten@reddit | LocalLLaMA | View on Reddit | 0 comments

I did this project called "Deepthink", you can run any baseline model through this and you'd get approximately +5-20% gains on the most benchmarks. If i have to summarize this to technical people, then it'd be "Implementing BFS & DFS algorithms over LLMs for iteratively exploring solution search space at scale". So running GPT-5.4-xHigh through this gives approximately GPT-5.4-Pro level performance, Gemini 3.1 Pro gives approximately Gemini 3 Deepthink level performance or Gemma4-31B gives approximately Gemini 3 Pro level performance (yeah this is most insane one). The compute is approximately 25x-40x here per problem so be aware of that. This system is not for day to day tasks, its literally built for solving the most complex problems you have. I thought I'd share here what i learnt while building this system and thoughts before actually finalizing this architecture. This is just brain-dump tbh and i am not used to with writing stuff much so please ignore some inconsistencies.

Benchmark Gemini 3.1 Pro Preview (High) Kimi 2.5 Thinking Gemma 31B (Dense) Gemma 31B (Dense) with Multi-agent Cross Learning (SSP + PQF)
MMMU-Pro (Multimodal understanding) 80.5% 78.5% 76.9% 80.5% (+3.6%)
International Math Olympiad 2025 (Mathematics) 71.4% 28.3% 19.05% 44.64% (+25.59%)
LiveCodeBench Pro (Hard 25Q3) 40.0% 33.3% 13.34% 26.67% (+13.33%)
International Physics Olympiad 2025 (Theory) 76.3% 45.2% 59.17% 74.40% (+15.23%)
International Chemistry Olympiad 2025 (Theory) 82.8% 66.2% 38.10% 66.2% (+28.10%)
USAMO 2026 74.4% 36.31% 24.40% 44.64% (+20.24%)
GPQA Diamond 94.3% 87.6% 84.3% 91.3% (+7.0%)
AIME 2026 98.3% 89.2% 89.2% 96.0% (+6.8%)

SSP = Structured Solution Pool, PQF = Post Quality Filter. Results were evaluated with iterative corrections turned on with depth = 5.

Github

I am an under graduate student and don't really qualify for writing here but i thought the learning were very interesting and could be useful to other people here so i am just sharing here. I have also done various other projects involving multi-agent systems that are deployed to production currently so i am just really used to with this stuff for past some months so there is a lot in my mind to write about. Other projects in this direction: Autonomous Forest Fire Detection/Prediction. Agentic CCTV Surveillance. Generative Recursive Education. Parallel React-Codebase Generation. Data science agent that cleans the data, handles sourcing, pre-processing, doing feature engineering, generating visualizations, finding anomalies and training predictive/forecasting models etc.

If you just clear off all these terminologies like "agents", "taking this action", "self-correction", "verification" and start thinking at the lowest level, then it's just a model receiving a certain context and then deciding it's output over it. It's genuinely surprising when you think about this: the same baseline model is able to solve some really hard problems that it normally wouldn't. How ? it just received a different context.

Tell agents how to co-ordinate inside a multi-agent system there are in: It is extremely important that you are absolutely crystal clear about how you want the agents to co-ordinate and behave in various scenarios, you don't have to consider all edge cases and think about every possibility… just think what do you want in a ideal case and literally just write your ideal behavior inside the system prompts of all the agents involved (ofc don't just blindly copy paste the same instructions, you can be smart about it). More of like, all the agents should know exactly what is ideally expected from them and what they might expect from other agents connected to them and how they are supposed to react to that response. Also, I'd say it's safe to never have more than 4 agents coordinate no matter what kind of use-case you have.

Long prompts over routing: Routing adds unnecessary complexity and there is high possibility about calling incorrect endpoint. Think about Gemini or ChatGPT site or app, they don't route your requests to an image model or deep research or generate music model, instead the main agent can access them as tools during it's thinking process so it feels very smooth and almost like there are no tools involved here at all. Don't be shy about the length of the system prompts. Claude Code has a system prompt of about 28k tokens. Last time i checked, Claude web app had around 35k tokens. In Deepthink, even though you'd think these are extremely long prompts, they hardly go over 10k+ tokens/agent. always write extremely detailed system prompts. You'd see that at the beginning of the DeepthinkPrompts.ts file there is a shared constant that literally tells all the agents how this system exactly works, what kind of behavior is expected, what other agents are there for, what you can approximately expect from each of them and examples of how to collectively work & adapt in novel unseen cases. Use LLMs to make the prompts concise/long or to remove noise from these prompts. I had this really strong urge to remove the long part in the system prompts about the "adaptability" part and instead add some routing by manually writing 10-15 custom prompts files for various tasks but i am glad that I didn't finalize that decision.

Build your own custom conversation history manager for each agent*: even if you have just 2 agents working back and forth in a loop, you always don't necessarily need exactly same conversation history building logic for both of them. Even if you stick with the default conversation history manager, when the no of agents increases or when there is a non-linear context flow between agents, things starts to get bad real soon. what i learnt from is that if your task has non-linear flow or you need to wait for other agents or you need to trim or process some text form previous messages, then you don't have to necessarily use the native conversation history function from langchain or whatever AI framework you are using. this is highly specific to agent roles you have in your system though.
This isn't difficult, Claude can make a highly optimized function for that in 2 or 3 mins. In Deepthink, I created distinct history managers depending on the agent's exact role. For instance, the critique agent gets full history back and forth with the correction agent so that i can keep it hyper-focused on the critique, while the corrector agent not only gets the immediate critique but also the full solution pool repository that contains cross-strategy executions and critiques, you might think this is madness but this is must because we want this agent to actually break the stuck on "confidently incorrect" loop to actually do cross-learning and anticipate the critiques for other approaches it may have in mind. Other example is the custom history management for the Strategy Generator - PQF loop.

Agent isolation is just as important as agent collaboration: to explain this part and further I'd include the examples from my "Deepthink" project. the first temptation while building a multi-agent system is to have all the agents talk to each other and brainstorm in a shared context. I didn't even try this because i know intuitively this is never going to work with the current agents. To be honest, this isn't even about the system failing through catastrophic group-think, but rather it's about the intent. In such system, no matter how much instructions you give to each agent and clarify their role and stuff… while solving problems they will always converge to either correct answer or the confidently incorrect answer. No matter how many critiques or correction agents you have, the system will never escape the part where it actually stumbles upon the correct answer because the correction agent will always try to justify its answer based on the critique and their entire loop will be then literally all about justifying the wrong answer in the most rigorous way possible. If you go through my repo, there you will see i have a mode called "Iterative Corrections" (aka contextual mode), it has critique-correction-solution-pool loop. if you remove the solution pool agent from this loop and try hard problems in this mode, then it will always do what i just described: after several iterations you'd see the most rigorous justification there exist for a completely wrong answer. So adding some meaningful random noise is very useful.

Example: In Deepthink, Hypothesis testing, strategy & sub-strategy execution, hypothesis generation, strategy generation and critique all work in isolation. It's because we want each of them to first have their own independent executions, approaches and analysis. then we can pass that context across the agents. Even though red team agent is not in loop with someone, i would still consider that as a collaborative agent because it has to view multiple agents output and take decisions based on their critique. Post Quality Filter, Correction Agent & The Solution Pool Agent all are collaborative because they need to
a) learn techniques from other approaches
b) see what's working and what's not and maybe ask for strategy evolution (PQF agent)
c) see other solutions so that the pool doesn't seed the solutions that already exist in the repository
d) learn the failure modes from each pathway by looking at the critiques
e) truly understand what will be actually novel in this system vs what will be just refinement of existing flawed idea.
these kind of things matters the most in a distributed system like this because we are intentionally spending more compute here and it we should utilize that in a best way possible.

Using sub-agents for generating spoon-feed-able complex logic: I have seen growing usage of sub-agents for context gathering and various other sub-tasks. One extremely useful use-case i have came up with in this system is to generate multiple hypotheses about the solution of the problem and run independent parallel agents to test those hypotheses and simply concatenate it to produce some knowledge packet and feed it to actual main execution agents. That's simple but not net-useful. What matters here is what hypotheses we are generating. Along with generating hypotheses, we can play broadly by generating problem statements for smaller cases or removing/adding specific constraints from that problem or literally generating specific "stuck points" i.e. where the model might actually get stuck while solving the problem (for example checking some symmetry conditions or some lower bound proof). Since we anyway have independent focused agents that will solely focus on fully testing that specific statement… we get extremely valuable context that literally contains hard and complex logic pre-thought by independent LLMs so that we can spoon feed that to the execution agents. This is one of the most useful things in the entire system because you can literally use it to generate valuable, meaningful and related context on the fly.

Optimizations:

Depth-First-Search vs Breadth-First-Search:

Sub-strategies enabled is BFS and iterative corrections turned on is DFS. You cannot enable both at the same time because it will create insane amount of non-manageable noise. Problems with breadth-first-search in a multi-agent system is that there is a lot of parallelism and so insane amount of parallel compute is needed. Primary problem with the using agents for depth-first-search inside a system that does parallel executions and gathering is extremely fast growing context.

Optimizing Breadth-First-Search:

Initially, I had one solution critique call per each sub-strategy execution so if we had N strategies and M sub-strategies, then that'd be total N*M critique generations or say M critiques generation for M executions inside that main-strategy. That's a huge no of parallel calls. I noticed all M executions for a given main-strategy had on-average similar critique because it's on average the execution of one global framework/strategy but in various local ways. So instead of calling M critique agents, I literally just called one critique agent per one-strategy. It now receives executions of M sub-strategies and reasons over them. You might think this will degrade the divergence and thus the overall system performance? Nope. and the motivation behind this is similar to iterative-corrections with structured solution pool repo i introduced when the sub-strategies are turned off. See, the reason why historically critique-correction loops don't work is because they converge too easily — aka confidently incorrect answers.

By introducing a solution pool, I added some noise and removed the cognitive barrier of outputting the memorized answer and that method worked. Here, I wanted the critique agent to have a broader view of the solution space wrt current strategy instead of narrow main-strategy > sub-strategy > execution view. This old view forced the critique to output a critique that lead to main-strategy-direct-execution (on average). By literally showing it all the sub-strategies and their executions inside that main global strategy, we let it view the sub-strategies and their executions in a way broader context. That's not even the most useful gain we get from this change. you see, a single critique generated here is received by M corrector agents. It means the M parallel corrector agents now not only knows what they did wrong. But they also see what other sub-strategies are being explored in their global branch and more importantly what flaws their initial executions had. This is literally like a goldmine because it warns the corrector agent in advance that "your previous solution had flaw X, the other executions tried this method M instead of urs and they had this flaw Y and Z".

Obviously, after knowing the flaw in its solution the corrector agent will try other methods, but because it is in the same global strategy branch as other sub-strategies, it will try to patch the solution using a methodology from other sub-strategy… BUT, now it knows in advance that if it uses this patch or approach to move further this way, then it will receive the same critique as the other sub-strategy used for using this approach. So a) this naturally forces it to learn from other sub-strategies b) avoid patching the solution in a way that receives a critique again similar to other what other sub-strategies received. How is "b" even possible? what are the chances that it will make some patch which is literally the initial execution of some other sub-strategy? actually, very high. and that's what happens with most problems. at the end of the day, there are limited no of approaches anyone (Human/LLM) can try within a given framework (global strategy). Going outside the global strategy will give framework violation critique and the agent has to get back to its branch anyway. I also implemented a dissected observations synthesis agent.

Optimizing Depth-First-Search (Avoiding Long Context Collapse In Solution Pool Repo):

In Deepthink, it is even worse. Deepthink has this shared context called "Solution Pool Repository" and as the no of iterations grow more than 3 or 4, it can grow in size > 200k tokens instantly. I solved this by using delta updates instead of summarizing when context bloats away. I just changed the output format of the pool agents to Structured-Output and ask them output an extra-field called atomic reconstruction (2-3 lines) along with each solution they are writing to the pool and this should contain all the information needed to reconstruct that exact solution. So that in the later iterations the corrector agents or the pool agents during their reasoning can know that "oh this approach was previously seeded in the pool but wasn't picked actually yields this solution which seems to be correct but was ignored previously." This is similar to providing small updates like we do in memory frameworks or ACE paper by Standford. This is way better than summarizing 500k context pool which will obviously have lots of information-loss. Moreover, processing 500k tokens with a separate agent is separate another thing.

Here we need to reconsider what we even mean by using LLMs for DFS? Because we can't just generate some strategies and let agents explore them deeply in parallel with a shared pool. We need a meta-learning loop that updates/evolves a strategy based on what is working across the entire system and what is not (it is very apparent looking at the critiques). So I implemented a post-quality-filter for this purpose. It evolves strategies themselves at the beginning after looking at the N parallel executions and their critique. We can even set how many times it can do that since LLM might never generate correct strategies extremely hard problems like IMO. Currently i have set this to 3 because i thought it was a sweet spot. That is for now, I have plans to implement this at intermediate levels instead of evolving only at the beginning.

Red Teaming Realization (similar to BFS optimization):
I'd be short here. Initially, I called red team agent per each strategy, but then replaced with a single agent that can prune multiple strategies or sub-strategies inside them in one go. This is actually very useful because now the agent has broader context about the actual search space that is being explored and so it can remove duplicate strategies or decide where to focus our compute to.