AI Research | 7/25/2025

When Humanity's Last Exam for AI Goes Wrong: The Flawed Questions Behind the Test

The ambitious AI benchmark, "Humanity's Last Exam," is under fire after researchers found that many of its questions are flawed. This raises serious concerns about the reliability of AI evaluations.

When Humanity's Last Exam for AI Goes Wrong: The Flawed Questions Behind the Test

So, picture this: you’ve got a big exam coming up, one that’s supposed to measure just how smart AI has gotten. It’s called "Humanity's Last Exam" (HLE), and it’s touted as the ultimate test for artificial intelligence. But here’s the kicker—new research is showing that a lot of the questions on this exam might be, well, kinda messed up. Yeah, you heard that right.

Researchers from a company called FutureHouse took a deep dive into the HLE, and what they found was pretty shocking. Almost 29% of the biology and chemistry questions were either wrong or misleading when they stacked them up against actual scientific literature. Imagine studying for a test only to find out that the questions are based on bad info. That’s a huge deal, especially when you think about how much we rely on these benchmarks to gauge AI’s progress.

Now, let’s rewind a bit. The whole idea behind HLE was to tackle something called "benchmark saturation." This is when AI models start scoring near-perfect on existing tests, making it tough to see if they’re actually getting any better. So, the folks at the Center for AI Safety (CAIS) and Scale AI thought, "Let’s create a super tough exam that really pushes AI to its limits!" They pulled together around 2,500 to 3,000 tricky questions from nearly 1,000 experts—think university professors and top researchers from over 500 institutions. They even threw in a $500,000 prize pool to get people excited about submitting the hardest questions they could think of.

But here’s the thing: the way they went about creating this exam might’ve backfired. They wanted questions that current AI models couldn’t answer correctly. Sounds good, right? But this led to a review process where experts didn’t have to spend more than five minutes checking a question’s accuracy. Can you imagine? It’s like trying to bake a cake but only giving yourself five minutes to make sure the oven’s working. You might end up with a burnt mess instead of a delicious dessert.

FutureHouse’s analysis showed that this focus on stumping AI led to some pretty bizarre questions. One standout example? They asked about the rarest noble gas on Earth back in 2002, and the answer given was Oganesson. Now, Oganesson is a synthetic element that existed for just a blink of an eye in a Russian nuclear reactor. It’s not even a naturally occurring substance! Talk about a curveball.

So, what does this mean for the AI industry? Well, if a benchmark that’s supposed to be the gold standard is riddled with errors, it really shakes up the whole evaluation process. Top models like OpenAI’s o1, Google’s Gemini, and Anthropic’s Claude have all taken the HLE, and they didn’t score all that great. At first glance, it looked like there was still a big gap between AI and human experts. But if a bunch of questions are fundamentally flawed, those low scores might not be a true reflection of the AI’s abilities. It’s like getting a bad grade because the teacher didn’t explain the material properly.

This whole situation highlights a major challenge in the AI world. As these models get more powerful, creating accurate and genuinely challenging benchmarks is getting tougher. The creators of HLE even set up a bug bounty program to fix errors after the exam was released, acknowledging that making a perfect dataset is a tall order. But FutureHouse’s findings suggest that the problems might be more deeply rooted in how the exam was designed and reviewed.

As we keep pushing the envelope in AI development, this episode is a wake-up call. The tools we use to measure AI need to be just as rigorously checked as the AI models themselves. If we want to make real progress, we’ve gotta ensure that the benchmarks we rely on are solid. Otherwise, we could be building a house of cards that could come tumbling down at any moment.