- Joined
- Feb 25, 2017
- Messages
- 3,852
- Reaction score
- 3,097
- Gender
- Male
- Political Leaning
- Undisclosed
I thought I would share this article because I think it's a pretty big marker in AI progress. It does sound like there is a little bit of confidence effect going on - LLM/LRM's tend to always sound confident in their conclusions. When we are dealing with stuff at the edge of human performance and it's normal for even brilliant, educated people to make mistakes and show doubt in themselves. It kind of reminds me of chess champion Kasparov despairing at Deep Blue's abilities when it turned out the program had made a significant error. Yes, Deep Blue was still flawed, but the writing was on the wall. Humans were about to be outdone in chess.
Mathematics is far more open ended than chess, but I suppose most problems could be boiled down to a system of rules being used to reach to a specific well defined goal. For that reason, it seems inevitable to me that many classes of mathematical research will simply be taken over by AI. The work that's being done on software proof assistants is going to be vital in order for humans to have confidence in the results. Otherwise, AI's stamina, speed and ability to combine eclectic fields of math that humans just don't have time to master, is going to overwhelm our ability to validate the work.Epoch AI hired Elliot Glazer, who had recently finished his math Ph.D., to join the new collaboration for the benchmark, dubbed FrontierMath, in September 2024. The project collected novel questions over varying tiers of difficulty, with the first three tiers covering undergraduate-, graduate- and research-level challenges. By April 2025, Glazer found that o4-mini could solve around 20 percent of the questions. He then moved on to a fourth tier: a set of questions that would be challenging even for an academic mathematician. Only a small group of people in the world would be capable of developing such questions, let alone answering them. The mathematicians who participated had to sign a nondisclosure agreement requiring them to communicate solely via the messaging app Signal. Other forms of contact, such as traditional e-mail, could potentially be scanned by an LLM and inadvertently train it, thereby contaminating the dataset.
Each problem the o4-mini couldn’t solve would garner the mathematician who came up with it a $7,500 reward. The group made slow, steady progress in finding questions. But Glazer wanted to speed things up, so Epoch AI hosted the in-person meeting on Saturday, May 17, and Sunday, May 18. There, the participants would finalize the last batch of challenge questions. The 30 attendees were split into groups of six. For two days, the academics competed against themselves to devise problems that they could solve but would trip up the AI reasoning bot.
By the end of that Saturday night, Ono was frustrated with the bot, whose unexpected mathematical prowess was foiling the group’s progress. “I came up with a problem which experts in my field would recognize as an open question in number theory—a good Ph.D.-level problem,” he says. He asked o4-mini to solve the question. Over the next 10 minutes, Ono watched in stunned silence as the bot unfurled a solution in real time, showing its reasoning process along the way. The bot spent the first two minutes finding and mastering the related literature in the field. Then it wrote on the screen that it wanted to try solving a simpler “toy” version of the question first in order to learn. A few minutes later, it wrote that it was finally prepared to solve the more difficult problem. Five minutes after that, o4-mini presented a correct but sassy solution. “It was starting to get really cheeky,” says Ono, who is also a freelance mathematical consultant for Epoch AI. “And at the end, it says, ‘No citation necessary because the mystery number was computed by me!’”
Defeated, Ono jumped onto Signal early that Sunday morning and alerted the rest of the participants. “I was not prepared to be contending with an LLM like this,” he says, “I’ve never seen that kind of reasoning before in models. That’s what a scientist does. That’s frightening.”
...
While sparring with o4-mini was thrilling, its progress was also alarming. Ono and He express concern that the o4-mini’s results might be trusted too much. “There’s proof by induction, proof by contradiction, and then proof by intimidation,” He says. “If you say something with enough authority, people just get scared. I think o4-mini has mastered proof by intimidation; it says everything with so much confidence.”