Do you think an AI will achieve gold medal in 2025 International Math Olympad (tomorrow)
Posted by mathsTeacher82@reddit | LocalLLaMA | View on Reddit | 30 comments
The International Math Olympiad will take place on 15th and 16th July in Australia. Google Deepmind will attempt to win a gold medal with their models AlphaProof and AlphaGeometry, after announcing a silver medal performance in 2024. Any open-source model that wins a gold medal will receive a $5 million AIMO prize from XTX markets.
https://youtu.be/vJjgtOcXq8A
AccomplishedBuy9768@reddit
All they need to do is put some Arc-AGI-2 puzzles in the olympiad and all models would fail.
offlinesir@reddit
Yes, I assume so. It's been a year, and last year they got silver, and after a whole year of training and developments in reasoning (not sure if that's how the alpha models work through) they should get gold.
pigeon57434@reddit
not only did alphaproof get solver last year I believe it was literally only 1 point away from gold they cant possible not get gold this year if they tried
Mart-McUH@reddit
Was it only geometry or all tasks?
Either way it required human input afaik, eg they somehow preprocessed the problems which imo can't be counted as success. Eg if teacher knows how student thinks, he could probably preprocess the problem and formulate it in such way, that the student will solve it while original he might not grasp.
Once they get just raw inputs (text+image like the competitors) and solve it from that, then I will count it as success.
mycall@reddit
What are the odds with the bookies?
Figai@reddit
I mean a lot of the papers people are adding are about large reasoning models, which are totally different to the neurosymbolic solvers like the models you mentioned.
So, that work is a bit redundant, the paper explicitly said the training methodology used and the COT systems in those models might be inadequate, quite literally neurosymbolics is the fix, I am a little biased though.
Honestly, it feels a bit of cheat compared, I mean calculators and external proof languages are banned, and the symbolic part of these systems are quite literally that. I don’t know how alpha proof works exactly but I know alpha geometry definitely has some formal verification built in.
Honestly, I’d be very surprised if they didn’t, get a gold. They are simply not on an equal footing, poor lone humans who have to deal without a calculator. I know how hard IMO is, I’ve gotten to BMO round 2 (I’ve doxxed myself) which is just before IMO team choosing in my country. It’s ridiculous the prep for it, I mean it relies on a lot of pattern recognition and intuition.
If it’s true they’re using neurosymbolic systems that are probably gonna be set with huge amounts of samples and different ideas, thanks to insane amounts of compute. I mean it’s gonna be a similar story to alpha zero and stockfish. Alpha zero won out but it was on googles own huge TPUs compared to whatever TCEC permits for engines.
DerekMorr@reddit
"our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks" https://arxiv.org/pdf/2503.21934
When you look at the detailed answers. performance falls down.
Also, from a different paper, "Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers" https://arxiv.org/pdf/2504.01995
Tarekun@reddit
Alpha proof/geometry are not simple LLMs acutally, much less the recent alpha evolve from deepmind, so im not sure this really applies
IrisColt@reddit
I’ve spent the last two years testing advanced math tasks on various anonymous SOTA models, a few of which give off strong Google vibes, over at LMArena. Throughout most of 2023, even the biggest models couldn’t handle these problems, they were simply out of reach. But in just the past half-year, nee models have started solving them consistently. At this pace, winning a Math Olympiad gold medal tomorrow doesn’t seem out of the question, heh!
mileseverett@reddit
The question is did the models get smarter or did the tasks end up in the data distribution
IrisColt@reddit
I’m only able to discuss my tasks, but before I got the chance to choose the right answer, I…
AppearanceHeavy6724@reddit
I wonder if they can achieve gold medal on, say, 1995 math olympiad, I.e old material, they probably have not been trained much on.
randomrealname@reddit
There is no past data they haven't been trained on. Passing a future test is the acid test.
AppearanceHeavy6724@reddit
still interesting though, as the optimised training towards latter corpus. Old stuff may not even be in the training at all, or completely clobbered by new trainiing data.
ResidentPositive4122@reddit
There's probably 0 chance a problem from 1995 isn't in plenty of places already. Forums, logs, blogs, sites, universities, re-interpreted problems and so on... In fact even aime25 had some problems that were already present on the web, some word-for-word, some very close.
AppearanceHeavy6724@reddit
You are making these bold statements, but as I said truth is far more complex than that. Lots of facts present in enormous amounts in the web - say which Nirvana songs were in which Album, but LLMs still hallucinate the answer. Lots of old problems were mundane and do not have large presence on the inernet, have become essentially obscure, esp. problems from smaller nations.
It is pointless conversation I think. I believe that performance on say on 1995 math olympiad would be worse than on 2023 one, you believe the opposite. only way know -is to check. shrug.
randomrealname@reddit
Delusion.
ResidentPositive4122@reddit
Terrence Tao thinks they will. And iiuc they'll have some "real-time" tests as well this year, where teams get the problems at the same time, and have roughly the same amount of time to solve the questions (last year goog had 24/48h iirc).
mathsTeacher82@reddit (OP)
interesting. meatbags haha, thats the last word i would think to describe IMO kids 🤣
ResidentPositive4122@reddit
Haha, it's a term used in sci-fi when robots discuss lowly humans :)
IrisColt@reddit
"No brain?"
"Oh, there is a brain all right. It's just that the brain is made out of meat!"
"So... what does the thinking?"
"You're not understanding, are you? The brain does the thinking. The meat."
"Thinking meat! You're asking me to believe in thinking meat!"
__JockY__@reddit
You might enjoy this: https://www.mit.edu/people/dpolicar/writing/prose/text/thinkingMeat.html
IrisColt@reddit
Remindme! 2 days
RemindMeBot@reddit
I will be messaging you in 2 days on 2025-07-15 11:45:32 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) ^(delete this message to hide from others.)
ninjasaid13@reddit
probably, but just like coding benchmarks, it will mean nothing in the real-world.
shaman-warrior@reddit
Wut
LevianMcBirdo@reddit
last year they used very specialized tools to achieve this and the most interesting thing is, if they can achieve the same result with something more general.
EternalOptimister@reddit
No they won’t
medialoungeguy@reddit
Yes
SlowFail2433@reddit
If they got silver last year then yes I expect gold