Do you think an AI will achieve gold medal in 2025 International Math Olympad (tomorrow)

Posted by mathsTeacher82@reddit | LocalLLaMA | View on Reddit | 30 comments

The International Math Olympiad will take place on 15th and 16th July in Australia. Google Deepmind will attempt to win a gold medal with their models AlphaProof and AlphaGeometry, after announcing a silver medal performance in 2024. Any open-source model that wins a gold medal will receive a $5 million AIMO prize from XTX markets.

https://youtu.be/vJjgtOcXq8A

[-]

AccomplishedBuy9768@reddit

All they need to do is put some Arc-AGI-2 puzzles in the olympiad and all models would fail.

[-]

offlinesir@reddit

Yes, I assume so. It's been a year, and last year they got silver, and after a whole year of training and developments in reasoning (not sure if that's how the alpha models work through) they should get gold.

[-]

pigeon57434@reddit

not only did alphaproof get solver last year I believe it was literally only 1 point away from gold they cant possible not get gold this year if they tried

[-]

Mart-McUH@reddit

Was it only geometry or all tasks?

Either way it required human input afaik, eg they somehow preprocessed the problems which imo can't be counted as success. Eg if teacher knows how student thinks, he could probably preprocess the problem and formulate it in such way, that the student will solve it while original he might not grasp.

Once they get just raw inputs (text+image like the competitors) and solve it from that, then I will count it as success.

[-]

mycall@reddit

What are the odds with the bookies?

[-]

Figai@reddit

I mean a lot of the papers people are adding are about large reasoning models, which are totally different to the neurosymbolic solvers like the models you mentioned.

So, that work is a bit redundant, the paper explicitly said the training methodology used and the COT systems in those models might be inadequate, quite literally neurosymbolics is the fix, I am a little biased though.

Honestly, it feels a bit of cheat compared, I mean calculators and external proof languages are banned, and the symbolic part of these systems are quite literally that. I don’t know how alpha proof works exactly but I know alpha geometry definitely has some formal verification built in.

Honestly, I’d be very surprised if they didn’t, get a gold. They are simply not on an equal footing, poor lone humans who have to deal without a calculator. I know how hard IMO is, I’ve gotten to BMO round 2 (I’ve doxxed myself) which is just before IMO team choosing in my country. It’s ridiculous the prep for it, I mean it relies on a lot of pattern recognition and intuition.

If it’s true they’re using neurosymbolic systems that are probably gonna be set with huge amounts of samples and different ideas, thanks to insane amounts of compute. I mean it’s gonna be a similar story to alpha zero and stockfish. Alpha zero won out but it was on googles own huge TPUs compared to whatever TCEC permits for engines.

[-]

DerekMorr@reddit

"our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks" https://arxiv.org/pdf/2503.21934

When you look at the detailed answers. performance falls down.

Also, from a different paper, "Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers" https://arxiv.org/pdf/2504.01995

[-]

Tarekun@reddit

Alpha proof/geometry are not simple LLMs acutally, much less the recent alpha evolve from deepmind, so im not sure this really applies

[-]

IrisColt@reddit

I’ve spent the last two years testing advanced math tasks on various anonymous SOTA models, a few of which give off strong Google vibes, over at LMArena. Throughout most of 2023, even the biggest models couldn’t handle these problems, they were simply out of reach. But in just the past half-year, nee models have started solving them consistently. At this pace, winning a Math Olympiad gold medal tomorrow doesn’t seem out of the question, heh!

[-]

mileseverett@reddit

The question is did the models get smarter or did the tasks end up in the data distribution

[-]

IrisColt@reddit

I’m only able to discuss my tasks, but before I got the chance to choose the right answer, I…

[-]

AppearanceHeavy6724@reddit

I wonder if they can achieve gold medal on, say, 1995 math olympiad, I.e old material, they probably have not been trained much on.

[-]

randomrealname@reddit

There is no past data they haven't been trained on. Passing a future test is the acid test.

[-]

AppearanceHeavy6724@reddit

still interesting though, as the optimised training towards latter corpus. Old stuff may not even be in the training at all, or completely clobbered by new trainiing data.

[-]

ResidentPositive4122@reddit

There's probably 0 chance a problem from 1995 isn't in plenty of places already. Forums, logs, blogs, sites, universities, re-interpreted problems and so on... In fact even aime25 had some problems that were already present on the web, some word-for-word, some very close.

[-]

AppearanceHeavy6724@reddit

You are making these bold statements, but as I said truth is far more complex than that. Lots of facts present in enormous amounts in the web - say which Nirvana songs were in which Album, but LLMs still hallucinate the answer. Lots of old problems were mundane and do not have large presence on the inernet, have become essentially obscure, esp. problems from smaller nations.

It is pointless conversation I think. I believe that performance on say on 1995 math olympiad would be worse than on 2023 one, you believe the opposite. only way know -is to check. shrug.

[-]

randomrealname@reddit

Delusion.

[-]

ResidentPositive4122@reddit

Terrence Tao thinks they will. And iiuc they'll have some "real-time" tests as well this year, where teams get the problems at the same time, and have roughly the same amount of time to solve the questions (last year goog had 24/48h iirc).

[-]

mathsTeacher82@reddit (OP)

interesting. meatbags haha, thats the last word i would think to describe IMO kids 🤣

[-]

ResidentPositive4122@reddit

Haha, it's a term used in sci-fi when robots discuss lowly humans :)

[-]

IrisColt@reddit

"No brain?"

"Oh, there is a brain all right. It's just that the brain is made out of meat!"

"So... what does the thinking?"

"You're not understanding, are you? The brain does the thinking. The meat."

"Thinking meat! You're asking me to believe in thinking meat!"

[-]

^(Parent commenter can ) ^(delete this message to hide from others.)

^(Info)	^(Custom)	^(Your Reminders)	^(Feedback)

[-]