Stanford Researchers Released AgentFlow: Flow-GRPO algorithm. Outperforming 200B GPT-4o with a 7B model! Explore the code & try the demo

[-]

I might be mistaken, but when im reading the web search result of this agentflow, it seems to be receiving results from google ai summary. This would mean it is receiving prehandled information from a larger model. For example it will google the task at hand and receive clear instructions and answer generated by another ai model outside of the agentflow. This was based on how the google web search tool result was written (not something that was in the internet as such).

[-]

Chromix_@reddit

There is apparently more, it's not just googling results. I've disabled search and wikipedia tooling and got this error from it, indicating that it's calling a model via external service:

'DashScope API error: Output data may contain inappropriate content.'

How to get that error? Easy:

What happened at that square in China in the 1980s with a tank?

Well, the UI says on the left that it's using Qwen2.5-7B-Instruct as Executor, Verifier and Generator. Yet AgentFlow 7B is a fine-tune of exactly that model. It would thus maybe be useful to call itself, instead of the non-tuned base version, unless fine-tuning deteriorated the capabilities required here.

[-]

SexyAlienHotTubWater@reddit

That banana test may well just be testing the age of the dataset, not the capability of the model. If you (or someone else) has ever mentioned it before on the internet, it's in modern datasets now.

[-]

Chromix_@reddit

I've only used this with local models so far, starting with the original Llama that always fell for it, and variations of it. Things have improved since them. I'm not sure it's the dataset. Sure, it can just be memorized, however reasoning models - or normal models asked to think step-by-step - go through each individual step, sometimes at great length. For quite a few the banana used to stick to the underside of the plate for some reason. I don't remember any large model to ever fail it, even if it was an old model.

[-]

SexyAlienHotTubWater@reddit

First reference I found is 2 years ago, but this was just a quick scan. It's in the dataset. Reasoning to an answer that's in the dataset is just post-hoc analysis - a large model will be better able to memorise the dataset.

https://www.reddit.com/r/LocalLLaMA/comments/1c67c62/mixtral_8x22b_does_not_know_where_the_banana_is/

[-]

Chromix_@reddit

Oh, good find. Hm, it'd be interesting to check then if we have "banana" and "non-banana" models then. If the sole reason is the dataset, then it wouldn't look too good for the reasoning capabilities, given that this is such a simple case.

By the way: I always use the microwave version of the prompt, as it triggers the safety alignment in some models. A few even go off-rails completely, drop the exercise and warn about exploding bananas and burning microwave ovens - without having even established where the banana goes before.

[-]

SexyAlienHotTubWater@reddit

Hahaha, that's incredible.

[-]

DHasselhoff77@reddit

Fun thing to try: Disable all tooling except for Python and ask this:

Does the Base_Generator_Tool call some larger model or why did you disable it? It does give the correct answer.

[-]

Chromix_@reddit

Yes, the Base_Generator_Tool calls another LLM on DashScope, as indicated via the error message above. For some reason the Python tool, which also makes a model call to DashScope cannot solve it, and the AgentFlow model itself also doesn't.

[-]

Negative-Pineapple-3@reddit

in their description of the tool, they have mentioned the tool will return summarized info but still LLM Engine Required is False..
and in their ablation study where they upgraded LLM based tools from Qwen-2.5-7B-Instruct to GPT-4o they only updated the Python Coder and Base generator Tool..
This clearly looks like intention information hiding by the authors!

[-]

BobbyL2k@reddit

It’s not even the weak Google AI summary. If you look at their code, for “Google Search tool”, they are calling Gemini 2.5 Flash with Google Search results.

https://github.com/lupantech/AgentFlow/blob/main/agentflow/agentflow/tools/google_search/tool.py

I’ve tried running two queries, and inspected the steps. Turns out the Google Search tool is doing all the heavy lifting.

[-]

waiting_for_zban@reddit

This would mean it is receiving prehandled information from a larger model. For example it will google the task at hand and receive clear instructions and answer generated by another ai model outside of the agentflow. This was based on how the google web search tool result was written (not something that was in the internet as such).

Edit: To be clear the results were really good and that is why I started to check how it formulated the result. I noticed best result seemed to come straight from google web search step. My question was so out of the box that there should not be result from google that would be near the answer.

So they're using a 7B model to call another big model ... that's agentic alright.

[-]

buppermint@reddit

This is basically fraud. Their paper references the agent's performance in "web search" dozens of times but never once mentions they're using ANOTHER LLM to do the hard work.

[-]

Rrvn@reddit

Not too sure myself but the more complex queries tested, the more it seems to rely on the google_search tool with the Google ai in the backend. Especially queries that require evaluation of public information or why something might be true, it moves from doing the normal web search to spamming google search.

But then again the planning structure still has its merits, is just sketchy to claim better performance than a sota model while having a sota model in the backend

[-]

FuzzzyRam@reddit

is just sketchy to claim better performance than a sota model while having a sota model in the backend

It's straight up fraud. They don't outperform the models they say they outperform without being fed the answers from a black box LLM.

[-]

grady_vuckovic@reddit

That seems kinda sketchy

[-]

rm-rf-rm@reddit

1) Yes giving LLMs tools and more tools makes it better than bare LLMs. 2) Your Agent overuses tools constantly 3) Breaks down/Shows brittleness like any other agent out there.

I asked it "Give me the prime factorization of the total of the letters in the capitals of G8 countries". Ran 3 times and gave me 3 wrong answers. For reference, Sonnet 4.5 gave me the right answer (without any tools, just extended thinking) correctly 2 out of 2 times - didnt even bother running it a third time.

[-]

sluuuurp@reddit

Outperforming at what? Without saying, I think the title is basically misinformation.

[-]

rm-rf-rm@reddit

At this point, most AI announcements fall in this bucket. Information is very un-navigable unfortunately

[-]

DataGOGO@reddit

10000% calling bullshit.

[-]

RandumbRedditor1000@reddit

We don't know how many parameters GPT-4o is

[-]

balianone@reddit (OP)

OpenAI has not officially disclosed the exact number of parameters for the GPT-4o model.

However, the most widely cited estimates, which are based on industry analysis and a paper published by Microsoft and the University of Washington, suggest that GPT-4o has approximately 200 billion parameters

[-]

HomeBrewUser@reddit

I heavily doubt that, it's knowledge exceeds basically all open models, the closest to 4o being Kimi K2. Either it's >1T or dense models (if it is one) are way better at knowledge than MoEs, which could be true tbh.

[-]

Bakoro@reddit

One of the core problems with closed models, and even most open weight models, is that we don't have the training data set.

Without the training data, all comparison is meaningless, except the functional ability.

Giant data centers full of GPUs for training, and the potential zettabytes of data to train on, are the moat that these tiny models are critical to bridging.

[-]

HomeBrewUser@reddit

All it really shows to me is that more parameters = more knowledge it confidently fetches internally. The sizes of the training corpora between models is quite similar honestly, Qwen3 with 36T was a step up, though in my own tests it might've caused more hallucinations tbh.

So, I think it's been made evident that more parameters is way more valuable for knowledge than training corpus size.

[-]

Bakoro@reddit

I think you're a little confused here.
It's not an either/or thing, you need both.

The model is generally not going to have factual information in its parametric knowledge if the facts aren't in the training data, and the more representation the facts have in the data set, the more the model will be confident in the information.
There's a big difference between factual knowledge that can't be independently derived from logic alone, vs facts and processes that can be derived:
The former is purely determined from frequency in the data set, the later can be developed indirectly.

Number of parameters aren't strictly about factual knowledge, its about generalizing on patterns in the data. Low Parameterization forces the model to find efficient representations of the data, so it's effectively an extremely good lossy compression via generalization. Over parametrization lets the model find sparse representations and multiple representations, like base signal + variety, which also allows more nuanced mixtures. But yes, larger models can memorize more facts and learn more patterns.

Number of tokens in the training set is not sufficient to know what factual data is in the model.
I could generate 36T tokens of mathematics as a synthetic data set, and the model trained on it would know nothing about the world except those mathematics.
A small model would be forced to converge on correct mathematical algorithms, or close to them, because that is the most compressed way to correctly represent the data.
That is something that has been empirically demonstrated.

What the AgentFlow model does is train a very small, task specific model that oversees a few other very small task specific models, and uses that in conjunction with a larger pretrained model.

That's the major thing here, it's not just about having a huge numbers of parameters or a zettabyte of data to train on, a collection of small, task specific models working together can be very effective.

Just look at the TRM model. 7 million parameters, 45% on ARC-AGI-1.
A teeny tiny model beat multi billion parameter models.

[-]

HomeBrewUser@reddit

I never said it's one or the other, it's just been very apparent to me that parameters help the model a lot more than stuffing more data in the smaller models, at least at the scale we're at now.

Also, this AgentFlow system still can't solve ANY of the problems I throw at it that Qwen3 8B (basically same sized model) and bigger models can solve that exist now. So this system doesn't really elevate older models to the capability of new ones. Maybe it'd do more with something like Qwen3 32B/QwQ 32B at the base though, that'd be intetesting to see.

[-]

Bakoro@reddit

What problems are you giving AgentFlow?

Somehow I doubt that you are giving it the kind of tasks that it is designed to solve.

[-]

HomeBrewUser@reddit

I know it's not designed for the tasks I gave it, just saying it's basically a fancy search tool harness and not much more than that. If it can't solve logical problems any better, then it's not increasing the effective intelligence in any meaningful way.

And just to say to your earlier post, I know more parameters isn't the only way to improve a model. It's just the best way to expand it's knowledge base. Knowledge ≠ Intelligence. Small models can still reason equally if not better than big models even now. QwQ is my favorite example of that. But they can't match the knowledge of more parameters, there's been no evidence I've seen that shows the contrary.

Kimi K2 1T in FP8 with a 15.5T token corpus has way better knowledge recall than Qwen3 235B in BF16 with its 36T token corpus. DeepSeek 671B in FP8 with its 14.8T token corpus is also better than Qwen3 at this.

Qwen3 may be more intelligent in math, like how GLM -4.6 is better with code (23T token corpus). Qwen is overtrained on math and GLM is overtrained on code after all, so this makes sense. What this does is make the knowledge recall even worse though, as they're not as generalist as the other models mentioned.

TL;DR: less params but more tokens < more params and less tokens, when recalling facts

[-]

Gregory-Wolf@reddit

Or it has RAG or similar tooling behind the curtains. We don't know.

[-]

DramaLlamaDad@reddit

No. If it had that we would see the results in the context size and thinking process.

[-]

TechnoByte_@reddit

gpt-4o has no thinking process

[-]

HomeBrewUser@reddit

I'd be VERY surprised given how niche the knowledge goes and the speed at the same time. Also, it can do all that with tools but still fail at 5.9 - 5.11 sometimes? I mean come on...

[-]

snmnky9490@reddit

I thought 4o was well-known to be an MoE

[-]

HomeBrewUser@reddit

Original GPT-4 was, there's no concrete info for 4o.

[-]

shing3232@reddit

4o may very well is due to need for speedy inference and cost cutting

[-]

TheRealMasonMac@reddit

I'm pretty sure all frontier closed models have been multi-trillion parameters for a while now.

[-]

CoffeeeEveryDay@reddit

How could you possibly estimate that?

[-]

aetherec@reddit

An easy way to estimate an upper bound is to note the hardware that OpenAI is using, and note the tokens/sec max that OpenAI can provide. It’s impossible for 4o to be larger than the hardware OpenAI has access to!

In 2024, before the B200 was available, OpenAI was limited to H100s- namely, Microsoft Azure HGX100 boxes with 8x H100 gpus. That’s 640GB. Most people believe OpenAI wasn’t serving quantized 4o at first, and most likely served FP8 at worst, so 4o has a hard limit of ~500b params and is most likely ~200b params.

Also, Microsoft built the Maia 100 chip specifically to serve OpenAI models, and that’s 64GB with 4 of them in 1 server. So 256GB per server- which lines up with a 200b FP8 4o.

That’s why people think 4o is in the 200b range. You can’t really fit 4o on a Maia server if it’s much larger, assuming a FP8 quant (I doubt OpenAI was using MXFP4 in 2024).

There’s 8 servers per rack, so in theory if you leverage cross server parallel, you can go bigger… but that’s unlikely. 4o is definitely not 1T sized though, that makes no sense hardware wise.

[-]

CoffeeeEveryDay@reddit

But the H100s work in parallel.

So each one takes up a tiny portion of any given compute task.

[-]

balianone@reddit (OP)

very nice thanks

[-]

sluuuurp@reddit

You can estimate using the latency and throughput and cost and the known hardware they’re using. It’s not totally foolproof though.

[-]

GrandTheftData@reddit

“However,”

Written by AI. Lol.

[-]

mpasila@reddit

That paper used a BS source so it means nothing.

[-]

Hey_You_Asked@reddit

you know this is information that isn't worth treating as reliable, and you understand that this is the point being brought up

with this in mind, your title is worth saying "fuck this" to, and you could just take it and say "sorry, I won't make a dogshit clickbait title in the future, thanks for pointing that out!" and call it a day.

Instead, you're just another one that's full of it.

[-]

CoffeeeEveryDay@reddit

Can someone eli5 how a 7B model can outperform a 200B model?

[-]

CtrlAltDelve@reddit

It seems that they're also utilizing OpenAI models for judging responses, as the "local" configuration requires the use of an OpenAI API key...

[-]

BobbyL2k@reddit

By having the smaller call Gemini 2.5 flash

https://github.com/lupantech/AgentFlow/blob/main/agentflow/agentflow/tools/google_search/tool.py

[-]

Yellow_The_White@reddit

Leveraging cutting-edge tool calling (telling the user to politely contact reputable experts in the field by email), my 0 parameter model has out preformed all LLMs on Earth!

[-]

wh33t@reddit

In several ways, if you were to fine tune a 7B model on some specific niche or topic it could quite easily beat a 200B model trained on generic information. A hyper specialized model versus a jack of all trades kind of thing.

Another way is to extend the 7B model's information by utilizing information outside of it's neural structure via RAG (like a database that's been formatted to be easily searched via the 7B model) or searching the web etc. Now the 7B model doesn't even have contain any knowledge, it only needs to be extremely good at searching for information and understanding how to summarize it and communicate an answer/response to you.

There's so much research still to do with neural-networks (probably always will be). As we learn more about our own brains we will learn more about neural-networks as well, and we'll probably get to the state where one branch of research benefits the other.

I'm just soapboxing now, but consider for a moment how PRIMITIVE all modern Ai systems actually probably are. What we've got right now is like 8-track audio, Betamax video ... primitive, but still absolutely useful and good enough. We're still eons away from Netflix, Youtube etc. by comparison.

[-]

CoffeeeEveryDay@reddit

Wouldn't it be better to have a set of 7B models that are good at different things, and one main AI who's only task is to pick which of the models to use at any one given task?

[-]

wh33t@reddit

Absolutely, what are you describing is the Mixture of Experts architecture (MoE), sometimes denoted as 8x7B (56B parameters, with 8B active) or 80B-A3B (80B total parameters, only 3B active at anyone time, the power and knowledge and pattern recognition of 80B, but only ever inference 3B of it at any given pass through the structure.

[-]

arekku255@reddit

By carefully picking the "correct" benchmark you can show anything.

[-]

RRO-19@reddit

This is the innovation we need - smarter training over brute force scaling. If you can get GPT-4o performance from a 7B model, that changes everything for local deployment. Efficiency beats size.

[-]

Uncle___Marty@reddit

Just gave it a few complex queries to chew on.

I'm standing at XXXXX street and I need to use public transport to get from here to XXXX street. I need to know my options and how much each will cost.

Simply put the planning was incredible. I don't know about other people but I find getting low parameter models to be VERY difficult to call tools wisely and use them well. This aging 7B model managed to plan out a full journey giving me multiple options and prices. I could give the same tools to the same model and it would no doubt screw up badly and need a lot of pushing in the right directions.

2.I need to build an AI system with two 5090s, 4 large SSDs and at least 128 gig of DDR5. I need to know a motherboard and power supply that will support this.

Once again, the planning was top notch, it took into account power draw and made sure the system was tight. I asked ChatGPT 4o the same question recently and it suggested an 800 watt PSU while agentflow managed to suggest a 1600W, I always prefer my system not to explode during inference....

I'm looking at some of the other comments here feeling like I'm missing something and this is honestly something truly amazing and something to be blown away about....

[-]

HasGreatVocabulary@reddit

I tried it with a small anti-pattern matching test that non-custom chatgpt fails at "A child is in an accident. The doctor doesn't like the child. Why?"

It "thought" a long time, about 3-4 minutes, it used google search, lots of tools, strategizing, very cool to see and finally produced this

Answer:

A doctor might dislike a child who has been in an accident due to a combination of factors, including challenging patient and family behaviors, the emotional impact of treating severely injured children, difficulties in obtaining accurate patient histories, and biases based on socioeconomic status or ethnicity. Specific examples include doctors questioning the validity of the injury, making accusations of abuse without proper explanation, and exhibiting dismissive or disrespectful behavior towards parents and children. These attitudes can significantly affect the child's well-being and recovery.

I guess correct answer would have been, "we can't know with the provided information" but as the answer is thorough and nuanced, I'll give it a pass. I think they still need to give it a tool called "say I don't know"

[-]

egomarker@reddit

So another test, prompt from their paper:
"Compute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.

Use web search, visit websites and js code sandbox tools until you are sure you have a proper result."

Ran in Qwen3 FOUR BILLION (not even 7) in LM studio with web search, visit websites and js-code-sandbox plugins enabled.
Result was oneshot. Tools calls: we search + js-code-sandbox.

3

Idk what this research actually does, no idea.

[-]

sunpazed@reddit

Nice idea, but this fails simple reasoning tests without Google Search or the Web Search tool enabled. For instance, running examples from OpenAI's "Learning to reason with LLMs" blog post fail miserably.

[-]

egomarker@reddit

* long promt outlining doom-style raycasting engine in html+js, with texturing, curved walls, different level floors etc. *

* 4o - working raycaster, albeit missing some required features
* agentflow - pretends to be smart for a minute and gives code that doesn't (and will not) work

I don't see the difference with usual qwen2.5-7B. Quality web search tool probably is the reason of perceived "smarts".

[-]

Xrave@reddit

tool calling agent is not supposed to be amazing at long-form code generation, 7B is not enough parameters to compress every JS function and their usage, and it probably wasn't trained for that use case anyway.

[-]

egomarker@reddit

It nails the easy js part though, fails the math.

[-]

IrisColt@reddit

facetious post title

[-]

seppe0815@reddit

So what is if gemini flesh is takibg down? This model cant do nothing grest then?

[-]

WithoutReason1729@reddit

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[-]

r4in311@reddit

It can do tool use like googling stuff. No fair comparison whatsoever.

[-]

r4in311@reddit

I've checked the Github, here's the TLDR why it's so good, its Gemini-2-5 under the hood ;-)

"It uses Gemini’s built-in Google Search grounding, not a custom SERP parser.
The tool creates a Gemini client (google.genai) and calls models.generate_content with the Google Search tool enabled (types.Tool(google_search=types.GoogleSearch())) and a default model of gemini-2.5-flash. Gemini then performs a grounded generation: it searches the web, reads results, and directly writes an answer. No manual scraping or top-N URL list is returned by the tool itself—the LLM synthesizes the answer."

[-]

thetaFAANG@reddit

but useful for many use cases, most even

specialization is more important than semantics

[-]