Top reasoning LLMs failed horribly on USA Math Olympiad (maximum 5% score)
Posted by Kooky-Somewhere-2883@reddit | LocalLLaMA | View on Reddit | 219 comments

I need to share something that’s blown my mind today. I just came across this paper evaluating state-of-the-art LLMs (like O3-MINI, Claude 3.7, etc.) on the 2025 USA Mathematical Olympiad (USAMO). And let me tell you—this is wild .
The Results
These models were tested on six proof-based math problems from the 2025 USAMO. Each problem was scored out of 7 points, with a max total score of 42. Human experts graded their solutions rigorously.
The highest average score achieved by any model ? Less than 5%. Yes, you read that right: 5%.
Even worse, when these models tried grading their own work (e.g., O3-MINI and Claude 3.7), they consistently overestimated their scores , inflating them by up to 20x compared to human graders.
Why This Matters
These models have been trained on all the math data imaginable —IMO problems, USAMO archives, textbooks, papers, etc. They’ve seen it all. Yet, they struggle with tasks requiring deep logical reasoning, creativity, and rigorous proofs.
Here are some key issues:
- Logical Failures : Models made unjustified leaps in reasoning or labeled critical steps as "trivial."
- Lack of Creativity : Most models stuck to the same flawed strategies repeatedly, failing to explore alternatives.
- Grading Failures : Automated grading by LLMs inflated scores dramatically, showing they can't even evaluate their own work reliably.
Given that billions of dollars have been poured into investments on these models with the hope of it can "generalize" and do "crazy lift" in human knowledge, this result is shocking. Given the models here are probably trained on all Olympiad data previous (USAMO, IMO ,... anything)
Link to the paper: https://arxiv.org/abs/2503.21934v1
mangoclimb@reddit
If the answer is determined by logic and rules, there is no reason why a computer cannot solve it. If it cannot solve it in a math competition, it is simply a problem with the way the software works. For example, a large and complex program, but a game software can simulate and implement complex graphics, but it is not useful at all for solving math problems. Therefore, the software must be coded precisely for its intended use. Although it is still in its early stages, various AIs created specifically for math already exist, and these tools can easily infer accurate answers to math problems with consistency and universality of answers. The problem is not the performance of the computer, but the specialization of the software. Even the average main memory access time of the past Pentium 100 MHz processor is 50 ns. This is faster than the RAM response time of the latest processor (\~100 ns). Of course, the processing capacity has reached a level that is incomparable to that of past computer systems. Math is one of the predictable subjects. This is because the answers are determined by logic and rules. However, there is no person or tool that can predict or solve the random number generator. In such cases, the expressions victory or defeat are not appropriate. The advantages or disadvantages of a certain case are merely conditions and circumstances. If the general AI is not suitable for use in a math competition, it is because there is no optimized script data for the thorough specialization process of the AI software in mathematics or there is no related function. This is entirely a problem of dedicated coding for the purpose of use. Software written for the purpose of mathematical processing can solve mathematics clearly. As in the case of Yuna Bomber Kaczynski, they are worried that amazing computer technology will make their (mathematicians) math competitions meaningless. Articles or news articles that disparage computer technology and computer AI in related articles related to math competitions are bad tactics to make the field of mathematics an impenetrable league or sanctuary for them. Computer AI software developers need to aggressively awaken to the arrogance of such mathematicians' outstanding mathematical talents(?). It is foolish to perceive tools (computer technology and computer AI) as competitors.
Healthy-Nebula-3603@reddit
That math olimpiad is far more difficult than AIME .
-p-e-w-@reddit
And getting a 5% score is something many professional mathematicians can only dream of. Nevermind the average human, who couldn’t understand a single question.
If this is supposed to be an argument for how bad LLMs are, it falls.
sam_the_tomato@reddit
Then how come so many high-schoolers crush it?
Research mathematician performance is a red herring. This is not what they train for.
-p-e-w-@reddit
If LLMs have indeed reached the performance of very bright high schoolers, then AGI is here. Because that would make them smarter than the vast majority of humans.
Ok_Net_1674@reddit
What does the G in AGI stand for again?
hann953@reddit
I think that's overestimating the difficulty of the questions. Professional mathematicians will solve some of the questions.
-p-e-w-@reddit
Most of them won’t, because contest math is very different from the type of problems most mathematicians work on.
RiseStock@reddit
It doesn't matter. The problems are basically easy in that they are all elementary. Pretty much any PhD level mathematician can solve any of the problems with enough time.
DecompositionalBurns@reddit
I've looked at the problems, and they're not that difficult. Working mathematicians may be unable to solve all of the problems under the exam constraints (4.5 hours for 3 problems on day 1 and another 4.5 hours for the other 3 problems on day 2), but they should be able to solve most of the problems on their own without the exam constraints.
Healthy_Albatross_73@reddit
The IMO is also for highschoolers sssooooo...
Neurogence@reddit
I think we'll get super intelligence by 2030, but there's no need to rationalize everything that doesn't sound good. The average human was not trained on the entire internet, and did not have billions of dollars invested in them.
Benchmarks that require true creativity like the olimpiads are the only ones that should be taken seriously, especially if we want AI to be able to come up with solutions to problems that we can't solve.
youarebritish@reddit
These results don't surprise me. What I've found from tinkering with LLMs is that they're very good at producing the solutions to problems they've encountered before but completely incompetent at novel problems. If your problem can be phrased in terms of another problem it's trained on, you can get good results, but if not, no amount of prompting can get it to answer correctly.
Chimezie-Ogbuji@reddit
Exactly. Auto regressive modelling is the extent of their 'super power'. Why do we still expect general intelligence (that can handle unanticipated forms of problems questions, or tasks) will ever arise from that, regardless of how large the training dataset?
Ansible32@reddit
I mean it's not really rationalization, it's trying to evaluate the models' capabilities fairly. The kneejerk is "well looks like actually these models are stupid" but then on the other hand Terence Tao's estimation of o1 was "mediocre, but not completely incompetent grad student," so I think the question is how does this score compare to your typical mediocre, but not completely incompetent grad student?
-p-e-w-@reddit
What does that matter? The average horse didn’t have billions of dollars invested in it either, yet cars have almost completely replaced horses.
Fee_Sharp@reddit
This is a very big stretch with "5% is a dream for professional mathematicians". 5% is something that a lot of people knowing math well can do. 5% does not mean they solved 5 out of 100 problems. It just means they "started" solving a few problems. A lot of points you can get just by making logical observations about the problem that make you closer to the solution. I'm not saying it is super easy, but definitely not "professional mathematicians can only dream of"
maboesanman@reddit
Exactly. 5% isn’t really even close to solving one of the six problems.
Stabile_Feldmaus@reddit
The average human can understand these questions and the average professional mathematician can solve them if given enough time.
yeet5566@reddit
According to the 2024 stats 8% is where people landed for the first quartile
redditburner00111110@reddit
Score of 8, not 8%. \~19% of the max score of 42.
-p-e-w-@reddit
People who took the test. Who by definition are elite students.
Due_Scallion220@reddit
To be honest I think vast majority of people would would struggle to achieve 5% on math olympiad problems. :-)
EternalFlame117343@reddit
It's not intelligent. It's not creative. It's just a fancy auto complete. Period.
Hyperths@reddit
I honestly don’t see how anyone who has used the technology can say this
EternalFlame117343@reddit
They are buying into the AI hype.
The thing just predicts which word makes sense and spews it.
Hyperths@reddit
If it's just "fancy auto complete" how is it so good at math? I remember about 2 years ago, everyone said that LLMs couldn't do math and I remember it couldn't even get the first few questions right on the AMC math competition. Now it can solve AMC problems not in its dataset easily, and get a score in the top 1% of people. Same with the AIME. Now, people are saying that it can't do the USAMO, which the models in the image can't, but recently they tested gemini 2.5 and it got a 25%.
Yes, LLMs are next word predictors but it's very clearly not "fancy auto complete". Anyone who has used it for mathematics, coding, writing, etc can clearly see that. I personally have profited thousands of dollars on AI generated code, and I probably spend 75% less time learning new concepts for school. It's one of the most useful technologies in my life, and anyone who doesn't use it is very much missing out.
Muted-Bike@reddit
0 shot, though, and without any human assisted architecting of reason. If you integrate it with a human problem solver, then they solve the problem blazingly fast - much faster than a person by themselves. 0 shot is only possible for these LLMs if you engineer the prompt for the input context.
M3GaPrincess@reddit
I wonder how a specialized model like qwen2-math would have done.
ivoras@reddit
One thing is certain: LLM's don't "think", for any really applicable definitions of thinking. They are indeed just predicting tokens. They will fail on any problems not yet in their training databases.
That's not to say they are useless. Even mathematicians will probably one day get assistance from them.
procgen@reddit
What is "thinking" if not predicting tokens? You think in a linear sequence, and your brain must predict what concepts follow whatever is currently in your short-term memory.
ivoras@reddit
If you mean to say the the universe as we know it is governed by causality (events following other events), then yeah, that applies to both minds and machines.
I'm more-or less thinking about how some (not all) human inventors discovered something new:
On the other hand - science in the last 150 years or so strives to be sterile and dispassionate, so there's less of such stories nowadays.
procgen@reddit
No, that's not what I'm saying. I'm saying that all thought is prediction.
When we discover something new, we're predict the outcome of counterfactual events (predicting something out of distribution, i.e. extrapolating).
SnooPuppers1978@reddit
I think the problem is calling LLMs as just a "next token predictor", because this can potentially mean something even far more powerful than what LLMs or anything is currently. If you can predict the future it must mean that you are able to simulate the whole universe faster than the universe moves itself. I think currently the problem where LLMs lack are imagination, visualization part which is less linear as inner monologue. Visualization, imagination must be similarly "predict" something, but it must be firing from multiple threads at once in a more capable way that LLMs currently are able to. Since for example there are certain simple visualization problems that LLMs can't yet solve.
SnooPuppers1978@reddit
I think your examples are using imagination, modelling and visualization, which can be considered as a subcategory of thinking, and I would agree that LLMs would have trouble doing that which is evident when you try to play 4 in a row with them and they can't really do it, but there is verbal inner monologue which is also considered thinking, and it does seem like LLMs do similar type of thinking, so it doesn't seem like a clear claim that LLMs don't think. It also depends how you define or understand the word think.
datbackup@reddit
People can and should understand and frequently use the term “out-of-distribution“ aka “outside of training distribution”
Example here:
https://x.com/rbhar90/status/1781964112911822854
ivoras@reddit
A very good point! Thanks!
Ok_Cow1976@reddit
but predicting next or next few tokens is very useful actually in understanding and solving problems, imo.
ivoras@reddit
It is.
Purplekeyboard@reddit
Not true, they can handle all sorts of novel problems. One that I used to use to test LLMs was "If there is a great white shark in my basement, is it safe for me to be upstairs?" This is not a question that appears in their training material (or it didn't used to, I have now mentioned it online a number of times) and they can answer it just fine.
ivoras@reddit
On the one hand, there goes the novelty of your question - the next batch of LLMs will surely have that reply in their training data.
On the other, that question is just too simple. When I ask GPT-4o a variant of that: "If there is a great white shark in my basement, is it safe for me to metabolize psilocybin upstairs?" it concludes with "Probably not the best idea. The potential for a bad trip skyrockets when a real-life nightmare scenario is in play. Maybe relocate the shark first." -- while technically correct (the best kind of correct), it's not like it indicates profound thinking is going on beyond "shark=bad".
Purplekeyboard@reddit
But that's the endless raising of the bar for AI. Whatever a language model can do becomes simple, whatever it can't do proves that we'll never have AI. Older and dumber LLMs couldn't answer the shark in the basement question properly at all, they would give stupid advice like "Lock all your doors and windows, and if the shark is near, back away slowly and don't make eye contact". Now that they can answer the question, it becomes too simple.
ivoras@reddit
If you expect that we're on a road to true AI, then you'll probably agree that at some point, posts like that will stop - that whatever tech is the state of the art will be able to solve completely novel tasks and questions that humans designed to test other humans - like the one in the OP.
When that happens, then I'll agree we are at least approaching true AI.
Purplekeyboard@reddit
If you could have shown Chatgpt to people in the 1990s, they would have declared that this was AI. Today we say it isn't, because it can't answer questions that 99% of people can't answer, so now we have to get it to be able to do graduate level math before it counts as AI.
I don't see any end in sight to this. I can easily see AI models some years from now writing best selling books and hit songs and people saying, "Oh yeah, well has it created any novel theories in physics? Not AI".
ivoras@reddit
No issue there - LLMs are very useful, and they will cause a lot of changes in how we use other tools.
But I'm thinking of in this way: today, we can produce guitars cheaper and better than Jimi Hendrix has ever dreamed of, and even more, today we can simulate his sound, his technique on mobile phones, without even needing a guitar (or an AI). The instruments we have now are both significantly better and more affordable -- and still, real creative, emotional musicians are as hard, or harder to find today as ever. Have you ever listened to the generic "royalty free" music libraries for YouTube?
Stephen King is well known for mass-producing thick novels at a quick pace (65+ at this time) -- but most his work just isn't good and feels mass-produced and uninspired. The dozen-or-so books that did catch on, have basically become a part of the civilisational backbone, though.
Each year, between 500k to 1m books are published in the traditional industry, and up to 1.7m more are self-published. Only a few hundred become well-known or respected.
LLMs can obviously outpace all of them, but even trained with all the writing tools of the trade, tvtropes.com and Wikipedia, I don't see a LLM producing an interesting book top-to-bottom, without a human setting direction and pace.
I completely agree that writers being *assisted* with LLMs will create good books, the same way they are now assisted by Google or the other things. Same with music. But I don't see real creativity possible without true intelligence. And personally, I don't think true intelligence is possible without embodying it.
AppearanceHeavy6724@reddit
Very true. However short stories by Gemma and Command a are quite good though.
asssuber@reddit
Please define "think".
Being able to solve the first problem after just being pointed the weakness in it's argument then means the problem was in their training database after all?
Glxblt76@reddit
I think this is one of the first things that will age like milk. It is possible to self-play mathematical reasoning using automated engines like Wolfram.
Latter-Pudding1029@reddit
It only took 8 hours and your prediction has come to pass. Google came out with something.
C_8urun@reddit
This post is so classical deepseek style
drwebb@reddit
The real LLM revolution is not math genius and cures for cancer, rather it is now I suspect a ton of people are secretly using a LLM for everyday writing.
slurpyslurper@reddit
LLM, please take my outline and expand to a formal email. LLM, please condense this overly formal email to a brief outline.
Fluid-Cry-1223@reddit
Would it make sense testing how these models help someone solving complex math problems rather than solve the problems themselves?
Best-Apartment1472@reddit
Wow. Looks like it's way-harder if you never seen it before. Who knew?
TimJBenham@reddit
I've always suspected the reason commercial LLMs do well on standard tests and qualification exams is that they have trained the heck out of them on every test they can get their hands on.
Best-Apartment1472@reddit
Yea. Just try using LLM on your legacy code base and make it introduce new feature from you back-log. It won't go smoothly.
davebren@reddit
Even for the ARC-AGI problems they get a lot of training data, even though humans can solve them easily without training.
Ayman_donia2347@reddit
The Mathematical Olympiad is very hard for 99.99 of people
vaette@reddit
Don't worry, I am sure that models with much better scores will quickly show up. Unfortunately, they may then weirdly turn out not to be good at the 2026 problem set...
Kooky-Somewhere-2883@reddit (OP)
hahaha this cracks me up
pier4r@reddit
sweatierorc@reddit
We invented expert systems in the 80s. That were really good at solving domain specific tasks. We still do that. Google just won the nobel for Alphafold. The goal is for your AI to bw able to 0-shot or few shots as many tasks as any human.
pier4r@reddit
everyone and their pets know all of this. The point is: why not having a LLM director that picks the proper narrow AI (or glue those appropriately) to solve problems, rather than having only 1 big network doing everything.
sweatierorc@reddit
Everybody is doing that already between mixture of experts, tool use, reasoning models and routing this is probably the most common approach
neuroticnetworks1250@reddit
Proper preparation is just brushing up their memory. LLMs arguably have eidetic memory
pier4r@reddit
I thought that LLM memory was akin to a lossy compressed archive. If they have perfect one, they I am with you, they should combine known solutions.
neuroticnetworks1250@reddit
Not really. There’s a really cool video by 3b1b that shows where memory lives in LLMs. The whole series is pretty cool
TheDreamWoken@reddit
Link?
neuroticnetworks1250@reddit
https://youtu.be/9-Jl0dxWQs8?si=-ocYghr36f5dEFei
If you’re not well versed in transformer architecture, I’d suggest watching the previous ones too
vintage2019@reddit
Because that would be AGI
AppearanceHeavy6724@reddit
No, absolutely not. Problem #1 is solvable by even an amateur like me, let alone a professional mathematician.
keepthepace@reddit
The year is 2025. We are disappointed that the best free models are not yet a superhuman levels of mathematical thinking.
grekiki@reddit
They are far below a trained high school student level.
rruusu@reddit
Yes, if you don't account for time. Those competitions give 9 hours for the human participants to answer.
Mountain_Trouble_882@reddit
I read another comment that said those students have a max score of 19%. If that's true I wouldn't really call that "far below." And I wouldn't want to bet against models improving beyond that this year.
yur_mom@reddit
I agree, yet o1-pro is definitely not free, so it is not a free vs paid issue. The tech is improving monthly, but I think this is one of the more difficult tasks for am llm..I know my Human brain even had issues with proofs in my CS college courses.
djm07231@reddit
It makes sense as at this point models are focused more on getting answers right to a question.
There haven’t been much proof-focused mathematical benchmarks. Ones like AIME are based on getting answers right.
I do think AI labs will start tackling proofs when the tooling and the benchmarks become more mature.
If you want to automate proof evaluation you probably need proof solvers like Lean or Coq and fully formalizing a proof using those tools are really tedious and hard at this point. If models start to get good at using those tools and with enough training there is no reason why they couldn’t get better at it.
HanzJWermhat@reddit
Wouldn’t that mean we’re further away from not closer to “AGI” ?
Mindless_Pain1860@reddit
I don't think we'll achieve AGI unless we move beyond the Transformer architecture. LLMs feel more like they're reciting countless sentences. LLMs predict the next token, not underlying concepts — that’s why they need massive amounts of training data just to `learn` something that seems trivial to humans. Humans don’t need that kind of brute-force exposure. When you prompt them, they just recall something similar and spit it back. They don’t actually understand what they’re saying.
eras@reddit
Anthropic made an argument that LLMs do not only predict the next token in their whitepaper, with the paper explained at: https://www.anthropic.com/research/tracing-thoughts-language-model .
I think their argument is decent.
LLMs indeed don't do "one-shot learning" like (some) people can. Perhaps a step towards AGI would be a model that can just learn concepts online and apply them immediately, without needing a ton of examples.
space_monster@reddit
humans don't really one-shot though - they can solve new-ish problems by applying adjacent solutions, which they have had a ton of training on.
you wouldn't be able (for example) to train a human just on a bunch of literature and then ask them to solve a complex math problem. they need to have a good understanding of similar problems first which they can then adapt.
that adaptation though is a requirement for AGI anyway, it's at the heart of generalisation - they need to be able to identify when and how they can use existing knowledge to solve novel problems.
Mindless_Pain1860@reddit
True, when problem is complex human also can't do one-shot learning, but the amount of data required (eg. math problems) for humans is orders of magnitude smaller than what LLMs needs.
space_monster@reddit
sure, but humans are trained on insane amounts of data every day from just being alive. the fundamentals of math are reinforced all the time for decades, then the more complex concepts are layered on top. you can't take a human from no math to complex math in one step.
and LLMs don't learn from trying math, which we do. I think embedded models in agents and robots with dynamic self-learning are an essential step before we can really start talking about AGI.
eras@reddit
Let's say though you show a person who doesn't know what a giraffe is a single line drawing illustration of one. Then you visit a zoo.
How likely do you think it is that that person would be able to recognize the new animal? How likely a VLM (in the same conditions) would?
I believe the odds would favor the person.
Bakoro@reddit
Funny you should mention that, I just read about Siamese Networks, which are supposed to be pretty good at one shot learning.
Still, it would probably favor a human three or older. A younger toddler might still call everything a dog.
Meanwhile, I had a dog that never learned the difference between moose and alligator toys.
Brains are weird things.
You're still underestimating the amount of data humans process in the first few years though, it's equivalent to billions of gigabytes of data. Also, recognizing animals is something where we've got the benefit of billions of years of evolution.
eras@reddit
I think the concept of the test can be extended to imaginary animals, or imaginary games (unlike others), e.g. a person who has not played chess is being explained the rules or an LLM in same conditions (so hasn't seen games but has seen the rules).
I must admit that absorbing the rules of a new board game can take some time, but basically after doing it people are able to play them in interesting ways without breaking rules, unless they are very complicated. In addition, people learn games better as they play it, no need for thousands of example games.
Bakoro@reddit
We've recently seen the benefits of reinforcement learning.
Most of human life from 0 to 25 is nonstop reinforcement learning, and then different reinforcement learning.
Mindless_Pain1860@reddit
These phenomena are expected, as post-training with DPO/PPO enables the model to generate sentences in ways preferred by humans. This still reflects memorization (policy) rather than actual planning.
mekonsodre14@reddit
humans one-shot learn most concepts through a combination of senses. Its multi-sensorial learning that enables us to quickly understand and cognitively process the concept of something without having to dig into knowledge accumulation.
Im sure AGI could learn certain concept types in a relatively short time frame, but most are bound to a physical world which the AGI only has very limited access to. Of course this could all change with robots, but unless these have very advanced sensorial suites and processing, I assume this is more than a decade away.
HanzJWermhat@reddit
I fully agree. It’s not just transformers to me it’s also the training space. Humans are able do much than embedding does today, which means we’re able to connect a far wider array of experiences into our analytical thinking. LLMs just take the text, and they can see how some text can be applied to other tangential situations via embedding and model weights but they can’t really do any out of bounds conception.
Virtualcosmos@reddit
We are quite a few years from getting to an actual AGI. Perhaps more than a few... Our fast development of AI still now is thanks to the huge amounts of data from internet. But you now what? Not everything is on internet, there is a lot of information not digitalized yet. Information we use to train out brains and that are also very relevant. I foresee that the development of AI will slow down the moment we can't improve more our models with the current amount of curated data, since collecting more would take months or years.
pyr0kid@reddit
we'll have AGI 30 years after fusion, so in other words probably by 2170
Virtualcosmos@reddit
by 2170 the big replacement probably would be occurring. Artificial people and machines would be so much better than biological ones, there would be nearly no reason to continue as biological machines. Quantum computers will bring that world much faster than most people see, but those machines still need a couple decades to develop.
HanzJWermhat@reddit
I also don’t believe LLMs are suited to work in non digitized space. LLM’s and generative image/sound synthesis are inherently designed on linear data. But we know the world is not experienced linearly.
Virtualcosmos@reddit
Transformers as well as others like CNN are non-lineal equations whose main strength is simulate non lineal data, it's pretty basic in computer sciences to use models like these in ML. Perhaps you mean the digitization of the world transform the *continuous* real world into a *discrete* virtualization. Though at really small scales the real world is more discrete than continuous, that's why it's called quantum physics.
The thing is, mathematical models can extrapolate inter-frames in discrete data to simulate a continuous virtual world. I don't think it would be a major problem for AI in the future.
pyr0kid@reddit
we'll have AGI 30 years fusion, so in other words probably by 2170
pyr0kid@reddit
LLMs, as a type of next-word-prediction software, fundementally are not and cannot evolve into AGI.
FeathersOfTheArrow@reddit
Google is already working on it (using Lean).
martinerous@reddit
Since finding out about AlphaProof a long time ago, I have been imagining an AI based on a similar "reasoning core" that follows strict formalized symbolic logic and can apply it not only to math but everything. Then it combines the core with a diffusion-like process to find the concepts to work with, and only as the last step the language module kicks in with the usual autoregressive text prediction to form the ideas into valid sentences. Just dreaming. Still, I doubt that we will get far enough by just scaling the existing LLMs. There must be better ways to progress.
Ok_Jello_1673@reddit
AI dont use language to reason, what else will it use?
martinerous@reddit
It could use concepts: https://github.com/facebookresearch/large_concept_model
Or at least it could reason in latent space instead of tokens: https://arxiv.org/abs/2412.06769
And there are also neurosymbolic options: https://research.ibm.com/topics/neuro-symbolic-ai
luckymethod@reddit
You describe exactly what I think will be the next wave of architectures for generally useful AIs and I agree LLMs by themselves aren't the solution to everything.
JohnnyLiverman@reddit
With the amount of funding LLM research is getting I think the only commercial grade AIs in the short term future will be perturbative around LLMs, maybe with like a few layers of some other architecture slotted in like they did with hunyuan t1.
reaper2894@reddit
Oh this is a nice one.
ain92ru@reddit
Opensource researchers are working on it as well! https://arxiv.org/html/2502.07640v2
quantummufasa@reddit
But they didn't get the answers right
auradragon1@reddit
Agreed.
Give the LLM proof software and train it to use it. I think the scores will be much higher. I don’t think it’s been a focus yet.
ain92ru@reddit
It is being done since about late last year, I posted three papers from this year which are close to SOTA on relevant benchmarks slightly below
auradragon1@reddit
What were the results?
raiffuvar@reddit
Mcp servers it is.
ain92ru@reddit
Check the links I posted! We are still very early in the process but not unlikely to see a lot of progress this year, at least with proofs of reasonable length (up to ~50k tokens, which is comparale to effective context length of SOTA LLMs)
s-jb-s@reddit
Have you by any chance seen the talks by Buzzard & Gowers on automated theorem proving (here is the Q&A from it) -- this was two years ago, but I got the sense that Gowers was particularly sceptical of getting STP to a place where they'd be able to do e.g. research mathematics any time soon (I think he says he's doubtful of it happening the next 10 years). Buzzard was more optimistic (obviously). The big bottleneck they talk about is the difficulty in generating 'good data' to learn from with respect to e.g. LEAN and generalising that to mathematics that would require an STP to not just to perform their own verifications to a conjecture, but also to formalise mathematical structures beyond what's in mathlib to verify problems.
Obviously within the context of Olympiad mathematics, the need to extend beyond current formalisms is much less of an issue for a large subset of problems (but from what I've read, tactics are still lacking in a quite a few notable areas?).
ain92ru@reddit
Thank you for the link. No, I haven't seen them but I first read the summarization of the Q&A at https://www.summarize.tech/www.youtube.com/watch?v=A7IHa8n3EOA and then downloaded the subtitles and discussed them with Gemini 2.5 https://aistudio.google.com/prompts/1zUnkTq6CeWk__YCJyHL3SIDunq8A4uAb (hope the link work for you)
What do you mean by STP, self-play theorem provers?
I am also skeptical even neurosymbolic toolkits (such as a scaffolded LLM with a Lean interpreter), let alone LLMs per se, will be able to "do research mathematics" by themselves. But that is, IMHO, somewhat of a red herring: it's more constructive to discuss the productivity raises mathematicians may get from future AI tools. The degree of autonomy will likely depend on the degree we will be able to solve the current problems with hallucinations, attentiveness to details and long contexts, which seems impossible to predict.
I certainly still expect human mathematicians to decide which new mathematical structures to create, while the AI tools will likely help them with formalization, speeding up the bottleneck discussed by Buzzard.
Ruibiks@reddit
Hi, if I may plug in my tool here you would be a great candidate to try it head t head against summarize tech. I had great feedback so far and hope that you take a couple of minutes of your time you may find value in it
Here is a direct link for the same video and you can chat with video (transcript) and make custom prompts. All answers are grounded in the video.
https://www.cofyt.app/search/computer-guided-mathematics-symposium-qanda-with-s-vk3dlHSGcePYc_ZmkAg4ky
MoffKalast@reddit
I cannot describe how fucking infuriating it is that everyone trains their models as question answering machines and literally nothing else.
quiet-sailor@reddit
that's what most poeple use LLMs for....... of course that will be thier main goal.
Dudmaster@reddit
Wait until you learn about base versus instruct fine-tune
djm07231@reddit
Reference:
A mathematician at Epoch AI, group behind Frontier Math, stating some of the difficulties of using proof based evaluations.
https://xcancel.com/ElliotGlazer/status/1870644104578883648
Deficiencies of Lean4:
https://xcancel.com/ElliotGlazer/status/1870999025874530781
rruusu@reddit
Is that really a fail? 5% sounds like a lot to me. I'm pretty sure that 99% of people would get a flat-out zero on the Math Olympiad problems.
Even for the actual winners, figuring out the answers to the questions takes hours. The participants have 9 hours to answer 3 really hard questions that require not just creativity and intuition but also a boatload of mental effort.
LearnNTeachNLove@reddit
Interesting
Sad-Elk-6420@reddit
The other models failed miserably when it came to low level mathematics, how ever Gemini 2.5 did pretty well. You should test that.
GrapplerGuy100@reddit
Unfortunately the critical piece was testing shortly after the problems were released. So to truly recreate, it need to be timed with an event (maybe the international Olympiad in July?)
haloweenek@reddit
Well, people still argue when I’m saying that llm’s are not AI.
I’ve received numerous downvotes and comments.
TheOnlyBliebervik@reddit
I don't think superintelligence will emerge from LLMs
Ok_Claim_2524@reddit
That is because it is a nonsensical take. You dont know what AI is and you think in terms of movie nonsense.
im_not_here_@reddit
That's because that statement is worthless nonsense.
What is AI? The thing you have only seen in science fiction? You can't see how stupid it is? AI rightly changes as real life capability change, along with real world implementation. Not on the fact you watched Star Trek or Ex Machina and now that is the bar.
Healthy-Nebula-3603@reddit
You know those math problems are far harder than AIME? AIME is for a secondary schools .
Spongebubs@reddit
USA math Olympiad is also for secondary students..?
haloweenek@reddit
My talking point was: LLM’s are not Artificial Intelligence. They’re artificial memory.
A system that’s intelligent would tackle those problems.
Healthy-Nebula-3603@reddit
So research papers from the Anthropic are probably wrong because YOU know better than experts ...
terminoid_@reddit
probably because the whole "what is AI" discussion has been done to death and rarely covers any new ground
IrisColt@reddit
Despite being trained on vast amounts of mathematical data, including Olympiad problems, the results are hardly surprising. These models excel at well-trodden benchmark tasks but falter when confronted with the deep, creative reasoning that Olympiad problems demand. Hey! I don't need to imagine how they suffer when faced with isolated, research-oriented problems that require constructing novel solutions from scratch.
At the present stage, top reasoning LLMs are hindered by a reliance on pattern matching rather than 100% genuine understanding, which is why even extensive dataset exposure doesn’t instantaneously translate to mastery in crafting rigorous proofs. Oh, and the phenomenon of benchmaxxing becomes apparent here: the models are optimized for standard tests, yet this doesn’t equip them for the novel, tricky challenges of high-level competitions.
TimJBenham@reddit
Probably no better than the average new grad student.
bartturner@reddit
I have been just blown away by Gemini 2.5. That is what you should have included in this.
NNN_Throwaway2@reddit
This is really not shocking at all to anyone who has actually used AI for real-world tasks. Its sort of the elephant in the room that AI is still hugely flawed despite billions invested.
ihexx@reddit
is it though?
Ok-Kaleidoscope5627@reddit
It becomes like a monkeys on typewriters situation
luchadore_lunchables@reddit
How is this upvoted its completely wrong.
stat-insig-005@reddit
Not really. They are not generating tons of solution candidates and check if any of them is correct. That’s the infinite monkeys with typewriters analogy.
A more appropriate analogy would be you give a monkey a typewriter, lock him in a room for 30 days and only check the last page he produces.
davikrehalt@reddit
No the large compute budget does many generations--this is clear in for example the codesforce o3 paper
stat-insig-005@reddit
Are you saying that large compute budget produces many candidate answers to a given question and if even one answer is correct the model is considered to have answered the question correctly? Isn’t that obviously wrong and idiotic?
davikrehalt@reddit
No it's run in parallel and then there's a program/model which chooses the best answer to submit. But in some domains like formal proof (and to some extent competitive programming) verification is much easier than generation so it's roughly same as you describe. Idk if this is "idiotic" because it's still much smarter than naive search which is intractable
stat-insig-005@reddit
Oh, that's not idiotic at all. I misunderstood your comment. For a moment, I thought all "intermediate answers" were being evaluated too.
As long as the model produces one answer that is used in benchmark, it's OK.
ShadowbanRevival@reddit
It was the... Blurst of times?!
Solarka45@reddit
Insane how Flash Thinking beat OpenAI models. Wonder how the new 2.5 Pro would fare.
OftenTangential@reddit
1.8 vs 1.2 out of 42 isn't really significant to be fair. At that point all of these models are just outputting random irrelevant word salad, Flash Thinking just chanced into better word salad. FWIW the bar to get a 1/7 on USAMO problems isn't super high, they often award this for solutions that include vague facts pointing in the direction of an answer, so it's totally possible to get this by guessing.
At this point some AI based models can do well on hard math problems but they need to rely on a "skeleton" of a deterministic logic engine, see Google's AlphaGeometry. Even those super specialized LLM tunes do not do well one-shotting proofs.
Illustrious-Sail7326@reddit
what? 1.8 vs 1.2 is 50% better
ravimohankhanna7@reddit
Maybe the difference between 1.8 and 1.2 is in the margin of error
Due-Memory-6957@reddit
It's for a while now that I've been saying (not like I'm anyone important anyway, but still!) that OpenAI has been more hype and marketing than results, none of their mini-models has been good for anything to me. The competition of Open Source is Anthropic (and Gemini now), not OpenAI, all they have is brand power.
WonderFactory@reddit
Even qwq did at a cost of $0.42 vs $203.44
raiffuvar@reddit
I'm confused where is 2.5?!
Ok-Lengthiness-3988@reddit
This is a preprint of an academic paper. It likely was finalized before the release of Gemini 2.5 Pro Experimental.
Thebombuknow@reddit
I know someone who is a genius when it comes to math (one of the top in our state in the math olympiad) and let me tell you, these questions are fucking insane. At this stage in the olympiad, you're in the top couple thousand in the country (the rest were eliminated in previous rounds), you are given HOURS for each question, and the vast majority of contestants still struggle to get most of the questions right.
It doesn't surprise me that these models can't do well at this. They're language models, not math models. They only "learned" math through their understanding of language and explanations of math concepts. From my experience, the top models are only reliable up to a basic calculus level. Anything past that and you're better off with a college freshman or high schooler who's taken first year calculus, as they'll likely understand the questions better.
Giving LLMs access to the same tools as us definitely helps (e.g. Wolfram Alpha, rather than relying on the model to do math itself), but that still doesn't help with questions more complicated than "solve this integral" or "what is the fifth derivative of _____", because everything past that is far less structured and requires advanced logical/conceptual thinking to solve. Most people who have taken a basic Calculus class would probably agree with me here, Calculus is far more conceptual than it is structured. You can't go through a list of memorized steps like in Algebra, you have to understand all the concepts and how to apply them in unique ways to get the result you want, and that's hard to do when you're a word predictor and not a human with actual thoughts.
I apologize if this was very rambly and far too long, I just wanted to get my thoughts out there.
tl;dr These problems are near impossible to solve for anyone but the absolute best mathematicians, and LLMs are far from being the best for a variety of reasons, primarily because Calculus requires a lot of unique conceptual thinking for each advanced problem, and LLMs aren't capable of memorizing every single possible question, and they aren't capable of conceptual thought either.
smalldickbigwallet@reddit
I fully like the LLM critique here, BUT you should clarify:
Having an LLM that performed at a 5% level would make that LLM insanely good. If it hit 100% regularly, you probably don't need mathematicians anymore.
AppearanceHeavy6724@reddit
...so naive.
smalldickbigwallet@reddit
I'm a Mathematician. I scored a 12 on the USAMO in the early 2000s.
Work I've done for money in life:
* During college, tutoring / teaching assistant
* During college, worked for a CPA
* An actuary internship fresh out of school
* CS / ML (the majority of my career, local regional companies, later FAANG)
* some minor quant work sprinkled in
I think that there are aspects of all of these jobs that may provide protection, but I would consider all of these as highly likely to be automated if a system had the level of creativity and rigor required to ace the USAMO.
Enough-Meringue4745@reddit
What is the average score for an IQ of 100?
Sad-Elk-6420@reddit
0
Enough-Meringue4745@reddit
What's crazy is to think that these LLMs can get 5% and still do absolutely everything else that it can do well. It's so crazy.
71651483153138ta@reddit
It's not surprising if you're an engineer and using llm's daily. Like yes, they help a lot with programming and they have pretty much replaced google for me. But anything too complex and they just can't do it, unless you break it into small pieces. It still takes a human to piece it all together.
tothatl@reddit
Yep. They are good with the repetitive slop that makes 80%-90% of code.
For humans that's expensive in hours too, so they have a big advantage on creating something from scratch.
But the rest has to be hand crafted/debugged into actual usability.
Alas this delusion is what will make many companies lay off a lot of people soon, thinking they can trim that 80%-90% of people in a fell swoop, but they will suffer when they have to productize.
Ok_Claim_2524@reddit
I predict the same, managers often dont have a single clue about what they are managing. One person can handle the 20% gap they have to fill in for the LLM easily and speed up their deliveries a lot, but if that person suddenly has to fill in the gap for what 5 other people were supposed to be doing it gets much worse, it is not linear, that not even touching at how much of a dev time is used with things that arent exclusively code.
When do you expect me to actually code when i'm covering for the meetings, engineering, infraestrutura, etc that other 5 people were doing?
"9 woman can make a babe in one month right?"
dogcomplex@reddit
How'd Alphaproof fare? My understanding is that to get high math performance out of LLMs you need to pair them with a long term memory theorem resolver. Those have existed for many years, and basically just act as a database that finds contradictions. The LLMs are in charge of the novel hypothesis generation, entering those into the db and reading what they know so far.
perelmanych@reddit
1) Proof question are really hard not only for models but for humans too.
2) Proof questions constitute very small proportion of all tasks from Olympiads. My wild guess is around 5-10%. So there is lack of training dataset.
3) It is quite difficult to formally check the proof in auto mode. I am aware of proof assistants, but you need first to translate the task onto specific language and then translate all steps in the proof.
I think once there will be big enough datasets with proof questions and reliable way to translate both task and proof itself to a formalism of proovers we will see a big jump in models' performance.
hann953@reddit
All olympiad questions are proof questions.
perelmanych@reddit
Look at any other Olympiad except of USAMO. When you press any model score you will see a question and model answers.
https://matharena.ai/
hann953@reddit
Since the IMO is proof based most national olympiads are also proof based. I only got to the second round of our national olympiads but they were already proof based.
perelmanych@reddit
Man, may be now it is different. When I was studying only last, hardest questions were proof based.
dobkeratops@reddit
humans safe for another couple of years..
Warm_Iron_273@reddit
WeRE ClOSe To AgI GuYs. FeEl The SINGuLArItY.
Vervatic@reddit
5 years ago it was shocking that these models could speak english. I would give it more time.
Pyros-SD-Models@reddit
You guys are aware that this paper is basically evaluating the reasoning traces of a model, right?
Making conclusions about actual LLM performance based on their reasoning steps is just bad methodology. You're judging the thought process instead of the outcome. LLMs don't think like humans, and you can't draw any conclusions about their "intelligence" by evaluating them this way. Every LLM "thinks" differently depending on how post-training was designed. Reasoning traces aren't the only form of "thinking" an LLM does, and you'd first need to evaluate in detail how a specific model even uses its reasoning traces, similar to how Anthropic did in their interpretability paper:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Reading that paper will also help you understand why the text a model outputs during reasoning says nothing about what's happening inside the model. OP's paper misses this completely, which is honestly mind-blowing.
They're essentially hallucinating their way to a solution, and that process doesn't have to look like linear, step-by-step human reasoning. Nor should it. Forcing a model to mimic human reasoning just to be interpretable would actually make it worse.
Did you forget the Meta paper about letting the LLM reason in its own internal language or latent representation? "0 points, reasoning not readable." Come on. https://arxiv.org/abs/2412.06769
But that's exactly what even current reasoning LLMs do, their internal language just happens to have some surface-level similarities with human language, but that's all. RL post training are like 0.00001% of total training steps. and people are like 'look at the model being stupid in its reasoning'
Here's a real paper that actually understands the limitations of using straight math olympiad questions, which above paper also either completely ignore, which would be strange bias, or didn't knew, which would be strange incompetence, and won't get absolutely destroyed in peer review:
https://arxiv.org/pdf/2410.07985
Also, the math olympiad is one of the hardest math competitions out there, and the average human would score exactly 0%, especially with the methodology used in that paper. Which makes it even more stupid, because we don't have any idea how an undergrad, PhDs or anyone performs in this benchmark. How do we even know 5% is "horribly"? What's the ground truth? The comperator?
kiriloman@reddit
All these benchmarks are pretty silly. I can train a mode on a given benchmark so it scores 100% there. Doesn’t mean that if benchmark is math, it will be able to solve complex tasks. LLM providers are playing the system to convince others that they are doing good work.
OmarBessa@reddit
I mean. This is good news.
More years to escape the apocalypse.
05032-MendicantBias@reddit
I think all SOTA models use common benchmark IN the trainind data, making them useless.
When someone tries another evaluation or even shuffle and fudge previous evaluations, the score collapses.
LLMs are good for lots of tasks, but they have no general intelligence to solve problems in there.
Physical-Iron-9839@reddit
They don't evaluate on a Gemini 2.5 agentic loop equipped with Lean, we should take this seriously?
Affectionate-Tax1389@reddit
Even tho the scores are mediocre. R1 which was the cheapest to train to my knowledge, performed better than the others.
im_not_here_@reddit
Unless it uses data from other llms.
It seems pretty misleading that we are not ignoring that the cost of training one, has to include the cost of training the model(s) you are using to generate the data - because they have to exist.
lordpuddingcup@reddit
Sounds like the issue is the reasoning step training is flawed in some way in these models
WowSoHuTao@reddit
Claude can’t even beat Pokémon Red
plankalkul-z1@reddit
I'm a "glass half-full" type, so seeing that
QwQ is on par with o1-Pro and beats o3-mini overall, plus beats everyone but Flash-thinking handily on P1,
R1 beats everyone including Claude 3.7 (non-thinking?..) on total score,
all I can say is "not bad, not bad at all!"
RipleyVanDalen@reddit
AI text in the post 👎🏻👎🏻
Weak-Abbreviations15@reddit
LMAO, shows that they finetune models to pass benchmarks.
perelmanych@reddit
How fast we went from complaining that models can't compare correctly 9.11 and 9.6 to complaining that models can't prove Fermat's Last Theorem.
Feztopia@reddit
It's shocking that these models which were trained for many different tasks can't beat a task that was made for individuals who specialized in one field? Lol? If they were already able to ace the best mathematicians in math they would also be able to ace everyone else at anything. Not everyone is a mathematician. I'm sure they can do better math than the average person around me. They can better code than the average person around me (most of them can't code at all). They know English grammar better than me. This is just the beginning of the story. Compare a midrage smartphone of today with the top models of the first smartphones. Compare the capabilities of a Nintendo switch to the NES. That's how tech evolves.
Lone_void@reddit
The math Olympiad is for high schoolers. These high schoolers can grow up to be amazing mathematicians but at the time of them taking the exam they are hardly the best mathematicians you claim they are.
So yeah, LLMs cannot beat highschoolers
AppearanceHeavy6724@reddit
I think I can solve Problem #1; I am not a mathematician, just a rando SDE, with some basic number theory knowledge, and it cannot beat even me, let alone highscoolers.
Sad-Elk-6420@reddit
I think the issue here isn't getting the solution, but proofing things with out using high level reasoning. Those are quite different problems. They probably make valid jumps, but those jumps simply aren't allowed because they aren't sufficiently low level enough. I have only found Gemini 2.5 able to stick to low level reasoning.
QuantumPancake422@reddit
more like "LLMs cannot beat the smartest highschoolers in the country"
alongated@reddit
These results are not shocking given the 'billions of dollars that have been poured into it'.
CoUsT@reddit
Honestly, expected result if you consider architecture and technical limitations.
muchcharles@reddit
It shouldn't be harder than frontier math, except frontier math was apparently secretly funded by OpenAI and there is an accusation they had the problem set.
Healthy-Nebula-3603@reddit
Ehh ..that math is far more complex than AIME
TheInfiniteUniverse_@reddit
Makes sense R1 beat everyone, but how can the cost for o3-mini be "lower" than R1?!
cnnyy200@reddit
While intelligence is about recognition. It’s not the whole picture of a thinking process.
Neomadra2@reddit
What are the implications? There are benchmarks like AIME where these reasoning models excel. Did they just overfit on AIME-like questions and for other kinds of questions they fail?
Limp_Brother1018@reddit
If agda, coq and lean had the same level of data sets as typescript and python, the situation might be different.
JLeonsarmiento@reddit
Asian kid still does better tho (R1).
ain92ru@reddit
https://lemmata.substack.com/p/coaxing-usamo-proofs-from-o3-mini
FiTroSky@reddit
Turns out that models tested on benchmark they're not trained to ace are actually bad.
PeachScary413@reddit
Well... we haven't trained our model on this benchmark yet, just wait a couple of more releases and it will be 80% 😊👌
AvidCyclist250@reddit
Apparently, LLMs are good enough for reddit submissions though.
AppearanceHeavy6724@reddit
Ahaha runnable on potato machine QWQ smashed o1-pro. ewwww.
TheRealGentlefox@reddit
They have the exact same score.
phhusson@reddit
and 500 times cheaper
GKGriffin@reddit
Are we really surprised the transformer modell however advanced iteration we are talking about not able to do a job that it was not designt to in the slightest?
Also another proof that we are not close to AGI and AI CEO-s talk out of their ass.
arg_max@reddit
The key word here is proof-based. All the reasoning RLHF is done for calculations where you can easily evaluate the answer against ground truth. These can be some very complex calculations sometimes but they're not proofs. To evaluate a proof, you have to check every step and to do that, you need a complex LLM judge (or you'd need to parse the entire proof to an auto proof validation tool). OP mentioned the issue with self-evaluation of proofs in his post, which means that you cannot just use your own model to check the proof and use that as a reward signal.
This is a huge limitation for any kind of reasoning training because it assumes that finding the answer might be hard, but checking an answer has to be easy. However, if you look at theoretical computer science sometimes even deciding if a problem is correct can be NP hard.
kvothe5688@reddit
huh impressed with flash thinking. at that speed that model is criminally good
davikrehalt@reddit
why do you talk like a LLM ...
davikrehalt@reddit
anyway, it won't be the case by EOY.
Cuplike@reddit
Lol, lmao
Cuplike@reddit
>this result is shocking
Only shocking to people that don't understand how LLM's work
masc98@reddit
wait till they train on them as well.
non profit business idea: closed source AI benchmarking datasets. data goes in a secure vault.
Companies building world models, closed or open source it doesnt matter, should provide an unbaiased benchmark of their models performance. Since they are scraping the web, public benches are fed to models as well, one way or another. (filtering at that scale it's difficult)
The AI bubble will burst also for this. First law of machine learning: dont test your model on training data. Thats cheating. Especially for companies trying to build AGI and raising billions of dollars.
We should build something like ImageNet by fei fei li, but private, with the goal to provide an off the grid test dataset for aspiring, real, world models.
shadowbyter@reddit
I wonder how few shot prompting would affect the reasoning-based models in a positive way.
New_World_2050@reddit
Hmm what did the new Gemini get ?
custodiam99@reddit
Well it was obvious from the beginning. Stochastic plagiarism is not human intellect. QwQ 32b made all the AGI hype laughable. These are input-output mathematical language transformers, nothing more.
cant-find-user-name@reddit
Someone post this on r/singularity
ResidentPositive4122@reddit
These models were trained w/ RL for boxed{answer} not boxed{theorem proving here} ...
If you want usamo check out alphageometry and the likes. Things trained specifically for that.
TotesMessenger@reddit
I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:
^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^(Info ^/ ^Contact)