Stanford Researchers Released AgentFlow: Flow-GRPO algorithm. Outperforming 200B GPT-4o with a 7B model! Explore the code & try the demo
Posted by balianone@reddit | LocalLLaMA | View on Reddit | 82 comments
DevilaN82@reddit
Not quite there yet.
Anzerp@reddit
I might be mistaken, but when im reading the web search result of this agentflow, it seems to be receiving results from google ai summary. This would mean it is receiving prehandled information from a larger model. For example it will google the task at hand and receive clear instructions and answer generated by another ai model outside of the agentflow. This was based on how the google web search tool result was written (not something that was in the internet as such).
Chromix_@reddit
There is apparently more, it's not just googling results. I've disabled search and wikipedia tooling and got this error from it, indicating that it's calling a model via external service:
How to get that error? Easy:
Well, the UI says on the left that it's using Qwen2.5-7B-Instruct as Executor, Verifier and Generator. Yet AgentFlow 7B is a fine-tune of exactly that model. It would thus maybe be useful to call itself, instead of the non-tuned base version, unless fine-tuning deteriorated the capabilities required here.
SexyAlienHotTubWater@reddit
That banana test may well just be testing the age of the dataset, not the capability of the model. If you (or someone else) has ever mentioned it before on the internet, it's in modern datasets now.
Chromix_@reddit
I've only used this with local models so far, starting with the original Llama that always fell for it, and variations of it. Things have improved since them. I'm not sure it's the dataset. Sure, it can just be memorized, however reasoning models - or normal models asked to think step-by-step - go through each individual step, sometimes at great length. For quite a few the banana used to stick to the underside of the plate for some reason. I don't remember any large model to ever fail it, even if it was an old model.
SexyAlienHotTubWater@reddit
First reference I found is 2 years ago, but this was just a quick scan. It's in the dataset. Reasoning to an answer that's in the dataset is just post-hoc analysis - a large model will be better able to memorise the dataset.
https://www.reddit.com/r/LocalLLaMA/comments/1c67c62/mixtral_8x22b_does_not_know_where_the_banana_is/
Chromix_@reddit
Oh, good find. Hm, it'd be interesting to check then if we have "banana" and "non-banana" models then. If the sole reason is the dataset, then it wouldn't look too good for the reasoning capabilities, given that this is such a simple case.
By the way: I always use the microwave version of the prompt, as it triggers the safety alignment in some models. A few even go off-rails completely, drop the exercise and warn about exploding bananas and burning microwave ovens - without having even established where the banana goes before.
SexyAlienHotTubWater@reddit
Hahaha, that's incredible.
DHasselhoff77@reddit
Does the Base_Generator_Tool call some larger model or why did you disable it? It does give the correct answer.
Chromix_@reddit
Yes, the Base_Generator_Tool calls another LLM on DashScope, as indicated via the error message above. For some reason the Python tool, which also makes a model call to DashScope cannot solve it, and the AgentFlow model itself also doesn't.
Negative-Pineapple-3@reddit
in their description of the tool, they have mentioned the tool will return summarized info but still LLM Engine Required is False..
and in their ablation study where they upgraded LLM based tools from Qwen-2.5-7B-Instruct to GPT-4o they only updated the Python Coder and Base generator Tool..
This clearly looks like intention information hiding by the authors!
BobbyL2k@reddit
It’s not even the weak Google AI summary. If you look at their code, for “Google Search tool”, they are calling Gemini 2.5 Flash with Google Search results.
https://github.com/lupantech/AgentFlow/blob/main/agentflow/agentflow/tools/google_search/tool.py
I’ve tried running two queries, and inspected the steps. Turns out the Google Search tool is doing all the heavy lifting.
waiting_for_zban@reddit
So they're using a 7B model to call another big model ... that's agentic alright.
buppermint@reddit
This is basically fraud. Their paper references the agent's performance in "web search" dozens of times but never once mentions they're using ANOTHER LLM to do the hard work.
Rrvn@reddit
Not too sure myself but the more complex queries tested, the more it seems to rely on the google_search tool with the Google ai in the backend. Especially queries that require evaluation of public information or why something might be true, it moves from doing the normal web search to spamming google search.
But then again the planning structure still has its merits, is just sketchy to claim better performance than a sota model while having a sota model in the backend
FuzzzyRam@reddit
It's straight up fraud. They don't outperform the models they say they outperform without being fed the answers from a black box LLM.
grady_vuckovic@reddit
That seems kinda sketchy
rm-rf-rm@reddit
1) Yes giving LLMs tools and more tools makes it better than bare LLMs. 2) Your Agent overuses tools constantly 3) Breaks down/Shows brittleness like any other agent out there.
I asked it "Give me the prime factorization of the total of the letters in the capitals of G8 countries". Ran 3 times and gave me 3 wrong answers. For reference, Sonnet 4.5 gave me the right answer (without any tools, just extended thinking) correctly 2 out of 2 times - didnt even bother running it a third time.
sluuuurp@reddit
Outperforming at what? Without saying, I think the title is basically misinformation.
rm-rf-rm@reddit
At this point, most AI announcements fall in this bucket. Information is very un-navigable unfortunately
DataGOGO@reddit
10000% calling bullshit.
RandumbRedditor1000@reddit
We don't know how many parameters GPT-4o is
balianone@reddit (OP)
OpenAI has not officially disclosed the exact number of parameters for the GPT-4o model.
However, the most widely cited estimates, which are based on industry analysis and a paper published by Microsoft and the University of Washington, suggest that GPT-4o has approximately 200 billion parameters
HomeBrewUser@reddit
I heavily doubt that, it's knowledge exceeds basically all open models, the closest to 4o being Kimi K2. Either it's >1T or dense models (if it is one) are way better at knowledge than MoEs, which could be true tbh.
Bakoro@reddit
One of the core problems with closed models, and even most open weight models, is that we don't have the training data set.
Without the training data, all comparison is meaningless, except the functional ability.
Giant data centers full of GPUs for training, and the potential zettabytes of data to train on, are the moat that these tiny models are critical to bridging.
HomeBrewUser@reddit
All it really shows to me is that more parameters = more knowledge it confidently fetches internally. The sizes of the training corpora between models is quite similar honestly, Qwen3 with 36T was a step up, though in my own tests it might've caused more hallucinations tbh.
So, I think it's been made evident that more parameters is way more valuable for knowledge than training corpus size.
Bakoro@reddit
I think you're a little confused here.
It's not an either/or thing, you need both.
The model is generally not going to have factual information in its parametric knowledge if the facts aren't in the training data, and the more representation the facts have in the data set, the more the model will be confident in the information.
There's a big difference between factual knowledge that can't be independently derived from logic alone, vs facts and processes that can be derived:
The former is purely determined from frequency in the data set, the later can be developed indirectly.
Number of parameters aren't strictly about factual knowledge, its about generalizing on patterns in the data. Low Parameterization forces the model to find efficient representations of the data, so it's effectively an extremely good lossy compression via generalization. Over parametrization lets the model find sparse representations and multiple representations, like base signal + variety, which also allows more nuanced mixtures. But yes, larger models can memorize more facts and learn more patterns.
Number of tokens in the training set is not sufficient to know what factual data is in the model.
I could generate 36T tokens of mathematics as a synthetic data set, and the model trained on it would know nothing about the world except those mathematics.
A small model would be forced to converge on correct mathematical algorithms, or close to them, because that is the most compressed way to correctly represent the data.
That is something that has been empirically demonstrated.
What the AgentFlow model does is train a very small, task specific model that oversees a few other very small task specific models, and uses that in conjunction with a larger pretrained model.
That's the major thing here, it's not just about having a huge numbers of parameters or a zettabyte of data to train on, a collection of small, task specific models working together can be very effective.
Just look at the TRM model. 7 million parameters, 45% on ARC-AGI-1.
A teeny tiny model beat multi billion parameter models.
HomeBrewUser@reddit
I never said it's one or the other, it's just been very apparent to me that parameters help the model a lot more than stuffing more data in the smaller models, at least at the scale we're at now.
Also, this AgentFlow system still can't solve ANY of the problems I throw at it that Qwen3 8B (basically same sized model) and bigger models can solve that exist now. So this system doesn't really elevate older models to the capability of new ones. Maybe it'd do more with something like Qwen3 32B/QwQ 32B at the base though, that'd be intetesting to see.
Bakoro@reddit
What problems are you giving AgentFlow?
Somehow I doubt that you are giving it the kind of tasks that it is designed to solve.
HomeBrewUser@reddit
I know it's not designed for the tasks I gave it, just saying it's basically a fancy search tool harness and not much more than that. If it can't solve logical problems any better, then it's not increasing the effective intelligence in any meaningful way.
And just to say to your earlier post, I know more parameters isn't the only way to improve a model. It's just the best way to expand it's knowledge base. Knowledge ≠ Intelligence. Small models can still reason equally if not better than big models even now. QwQ is my favorite example of that. But they can't match the knowledge of more parameters, there's been no evidence I've seen that shows the contrary.
Kimi K2 1T in FP8 with a 15.5T token corpus has way better knowledge recall than Qwen3 235B in BF16 with its 36T token corpus. DeepSeek 671B in FP8 with its 14.8T token corpus is also better than Qwen3 at this.
Qwen3 may be more intelligent in math, like how GLM -4.6 is better with code (23T token corpus). Qwen is overtrained on math and GLM is overtrained on code after all, so this makes sense. What this does is make the knowledge recall even worse though, as they're not as generalist as the other models mentioned.
TL;DR: less params but more tokens < more params and less tokens, when recalling facts
Gregory-Wolf@reddit
Or it has RAG or similar tooling behind the curtains. We don't know.
DramaLlamaDad@reddit
No. If it had that we would see the results in the context size and thinking process.
TechnoByte_@reddit
gpt-4o has no thinking process
HomeBrewUser@reddit
I'd be VERY surprised given how niche the knowledge goes and the speed at the same time. Also, it can do all that with tools but still fail at 5.9 - 5.11 sometimes? I mean come on...
snmnky9490@reddit
I thought 4o was well-known to be an MoE
HomeBrewUser@reddit
Original GPT-4 was, there's no concrete info for 4o.
shing3232@reddit
4o may very well is due to need for speedy inference and cost cutting
TheRealMasonMac@reddit
I'm pretty sure all frontier closed models have been multi-trillion parameters for a while now.
CoffeeeEveryDay@reddit
How could you possibly estimate that?
aetherec@reddit
An easy way to estimate an upper bound is to note the hardware that OpenAI is using, and note the tokens/sec max that OpenAI can provide. It’s impossible for 4o to be larger than the hardware OpenAI has access to!
In 2024, before the B200 was available, OpenAI was limited to H100s- namely, Microsoft Azure HGX100 boxes with 8x H100 gpus. That’s 640GB. Most people believe OpenAI wasn’t serving quantized 4o at first, and most likely served FP8 at worst, so 4o has a hard limit of ~500b params and is most likely ~200b params.
Also, Microsoft built the Maia 100 chip specifically to serve OpenAI models, and that’s 64GB with 4 of them in 1 server. So 256GB per server- which lines up with a 200b FP8 4o.
That’s why people think 4o is in the 200b range. You can’t really fit 4o on a Maia server if it’s much larger, assuming a FP8 quant (I doubt OpenAI was using MXFP4 in 2024).
There’s 8 servers per rack, so in theory if you leverage cross server parallel, you can go bigger… but that’s unlikely. 4o is definitely not 1T sized though, that makes no sense hardware wise.
CoffeeeEveryDay@reddit
But the H100s work in parallel.
So each one takes up a tiny portion of any given compute task.
balianone@reddit (OP)
very nice thanks
sluuuurp@reddit
You can estimate using the latency and throughput and cost and the known hardware they’re using. It’s not totally foolproof though.
GrandTheftData@reddit
“However,”
Written by AI. Lol.
mpasila@reddit
That paper used a BS source so it means nothing.
Hey_You_Asked@reddit
you know this is information that isn't worth treating as reliable, and you understand that this is the point being brought up
with this in mind, your title is worth saying "fuck this" to, and you could just take it and say "sorry, I won't make a dogshit clickbait title in the future, thanks for pointing that out!" and call it a day.
Instead, you're just another one that's full of it.
CoffeeeEveryDay@reddit
Can someone eli5 how a 7B model can outperform a 200B model?
CtrlAltDelve@reddit
It seems that they're also utilizing OpenAI models for judging responses, as the "local" configuration requires the use of an OpenAI API key...
BobbyL2k@reddit
By having the smaller call Gemini 2.5 flash
https://github.com/lupantech/AgentFlow/blob/main/agentflow/agentflow/tools/google_search/tool.py
Yellow_The_White@reddit
Leveraging cutting-edge tool calling (telling the user to politely contact reputable experts in the field by email), my 0 parameter model has out preformed all LLMs on Earth!
wh33t@reddit
In several ways, if you were to fine tune a 7B model on some specific niche or topic it could quite easily beat a 200B model trained on generic information. A hyper specialized model versus a jack of all trades kind of thing.
Another way is to extend the 7B model's information by utilizing information outside of it's neural structure via RAG (like a database that's been formatted to be easily searched via the 7B model) or searching the web etc. Now the 7B model doesn't even have contain any knowledge, it only needs to be extremely good at searching for information and understanding how to summarize it and communicate an answer/response to you.
There's so much research still to do with neural-networks (probably always will be). As we learn more about our own brains we will learn more about neural-networks as well, and we'll probably get to the state where one branch of research benefits the other.
I'm just soapboxing now, but consider for a moment how PRIMITIVE all modern Ai systems actually probably are. What we've got right now is like 8-track audio, Betamax video ... primitive, but still absolutely useful and good enough. We're still eons away from Netflix, Youtube etc. by comparison.
CoffeeeEveryDay@reddit
Wouldn't it be better to have a set of 7B models that are good at different things, and one main AI who's only task is to pick which of the models to use at any one given task?
wh33t@reddit
Absolutely, what are you describing is the Mixture of Experts architecture (MoE), sometimes denoted as 8x7B (56B parameters, with 8B active) or 80B-A3B (80B total parameters, only 3B active at anyone time, the power and knowledge and pattern recognition of 80B, but only ever inference 3B of it at any given pass through the structure.
arekku255@reddit
By carefully picking the "correct" benchmark you can show anything.
RRO-19@reddit
This is the innovation we need - smarter training over brute force scaling. If you can get GPT-4o performance from a 7B model, that changes everything for local deployment. Efficiency beats size.
Uncle___Marty@reddit
Just gave it a few complex queries to chew on.
Simply put the planning was incredible. I don't know about other people but I find getting low parameter models to be VERY difficult to call tools wisely and use them well. This aging 7B model managed to plan out a full journey giving me multiple options and prices. I could give the same tools to the same model and it would no doubt screw up badly and need a lot of pushing in the right directions.
2.I need to build an AI system with two 5090s, 4 large SSDs and at least 128 gig of DDR5. I need to know a motherboard and power supply that will support this.
Once again, the planning was top notch, it took into account power draw and made sure the system was tight. I asked ChatGPT 4o the same question recently and it suggested an 800 watt PSU while agentflow managed to suggest a 1600W, I always prefer my system not to explode during inference....
I'm looking at some of the other comments here feeling like I'm missing something and this is honestly something truly amazing and something to be blown away about....
HasGreatVocabulary@reddit
I tried it with a small anti-pattern matching test that non-custom chatgpt fails at "A child is in an accident. The doctor doesn't like the child. Why?"
It "thought" a long time, about 3-4 minutes, it used google search, lots of tools, strategizing, very cool to see and finally produced this
I guess correct answer would have been, "we can't know with the provided information" but as the answer is thorough and nuanced, I'll give it a pass. I think they still need to give it a tool called "say I don't know"
egomarker@reddit
So another test, prompt from their paper:
"Compute the check digit the Tropicos ID for the Order Helotiales would have if it were an ISBN-10 number.
Use web search, visit websites and js code sandbox tools until you are sure you have a proper result."
Ran in Qwen3 FOUR BILLION (not even 7) in LM studio with web search, visit websites and js-code-sandbox plugins enabled.
Result was oneshot. Tools calls: we search + js-code-sandbox.
3
Idk what this research actually does, no idea.
sunpazed@reddit
Nice idea, but this fails simple reasoning tests without Google Search or the Web Search tool enabled. For instance, running examples from OpenAI's "Learning to reason with LLMs" blog post fail miserably.
egomarker@reddit
* long promt outlining doom-style raycasting engine in html+js, with texturing, curved walls, different level floors etc. *
* 4o - working raycaster, albeit missing some required features
* agentflow - pretends to be smart for a minute and gives code that doesn't (and will not) work
I don't see the difference with usual qwen2.5-7B. Quality web search tool probably is the reason of perceived "smarts".
Xrave@reddit
tool calling agent is not supposed to be amazing at long-form code generation, 7B is not enough parameters to compress every JS function and their usage, and it probably wasn't trained for that use case anyway.
egomarker@reddit
It nails the easy js part though, fails the math.
IrisColt@reddit
facetious post title
seppe0815@reddit
So what is if gemini flesh is takibg down? This model cant do nothing grest then?
WithoutReason1729@reddit
Your post is getting popular and we just featured it on our Discord! Come check it out!
You've also been given a special flair for your contribution. We appreciate your post!
I am a bot and this action was performed automatically.
r4in311@reddit
It can do tool use like googling stuff. No fair comparison whatsoever.
r4in311@reddit
I've checked the Github, here's the TLDR why it's so good, its Gemini-2-5 under the hood ;-)
"It uses Gemini’s built-in Google Search grounding, not a custom SERP parser.
The tool creates a Gemini client (
google.genai
) and callsmodels.generate_content
with the Google Search tool enabled (types.Tool(google_search=types.GoogleSearch())
) and a default model ofgemini-2.5-flash
. Gemini then performs a grounded generation: it searches the web, reads results, and directly writes an answer. No manual scraping or top-N URL list is returned by the tool itself—the LLM synthesizes the answer."thetaFAANG@reddit
but useful for many use cases, most even
specialization is more important than semantics
ninjasaid13@reddit
but it's using Google's AI.
eli_pizza@reddit
Comparing to other models also connected to web search tools?
Fireflykid1@reddit
It seems to be very good at web search
Warhouse512@reddit
It’s calling Gemini with web search enabled and then reprocessing the information. Of course the results are good
NoWorking8412@reddit
I just used CC to quantize this for my setup if anyone wants to try it out: https://huggingface.co/kh0pp/agentflow-planner-7b-GGUF
QuantityGullible4092@reddit
Interesting, but “flow” is the wrong word as that is being heavily used in ML to mean “flow matching”
onil_gova@reddit
Can someone explain what their Base_Generator_Tool does?
wildyam@reddit
Generates the base.
silenceimpaired@reddit
All your base are belong to us
SuddenBaby7835@reddit
Main screen turn on
SkyFeistyLlama8@reddit
Make your time
SnooMarzipans2470@reddit
what is it outpeforming? its using qwen 2.5 7B model, what is an usecase where this is helpful than other models/agent out there already?
IjonTichy85@reddit
there's an example under 3.
SnooMarzipans2470@reddit
lol thanks, just found out that i know one of the guys listed in the paper. i will hit them up inperson