Home > STEM Projects and Research > AI Just Took the World’s Hardest Maths Test And Humans Won

AI Just Took the World’s Hardest Maths Test And Humans Won

Artificial intelligence has been on a winning streak lately: cracking decades-old puzzles, mastering complex games, and writing code at superhuman speed. But when it comes to the frontier of mathematical research, human minds are still firmly ahead.

That’s the takeaway from First Proof, a new project that put four leading AI systems through what may be the toughest maths challenge ever designed for machines.

The test

Researchers handed the AI models ten genuine research-level problems questions that working mathematicians had recently solved but had not yet published. A panel of anonymous expert reviewers then graded the AI’s answers. When the results went live on 10 June, the verdict was unambiguous: not a single model reached the standard of a top human mathematician.

Why this test matters

This was the first benchmark to combine three crucial conditions at once: research-level difficulty, problems that don’t appear anywhere in AI training data, and formal grading by expert mathematicians. By sourcing questions from researchers’ unpublished work, First Proof closed a long-standing loophole earlier benchmarks were often criticised because models may have simply memorised answers seen during training rather than reasoning them out.

Who competed

OpenAI was the only major tech firm to enter a commercial model (ChatGPT 5.5 Pro). The other three systems came from academic teams at UCLA, Princeton, and ETH Zurich. Notably absent were Google’s maths-focused Aletheia and the full version of Anthropic’s Claude Mythos, which couldn’t be officially entered because the test required ruling out any human assistance.

The AI systems also showed a familiar flaw: hallucination. Even when told to verify their references, they produced factually wrong outputs a serious problem in a field where precision is everything.

The bigger picture

It’s not all bad news for AI. Earlier this year, an OpenAI model made headlines by cracking an 80-year-old problem first posed by Hungarian mathematician Paul Erdős. But as First Proof shows, solving a known historical puzzle is very different from tackling a brand-new research problem.

The team behind the test sees future rounds not as a way to measure whether AI can replace mathematicians, but how it might assist them checking proofs for errors, suggesting new lines of inquiry, and eventually working independently in narrow areas. For now, though, the message from the world’s hardest maths test is clear: the humans are still winning.

Source: Gulf News