Humanity’s Last Exam Leaderboard Scores: Why the World’s Hardest Benchmark is Breaking AI

Humanity’s Last Exam Leaderboard Scores and AI Failure Analysis

Quick Summary: Key Takeaways

  • The "Failing" Reality: Even the most advanced models like Gemini 3 Pro are scoring under 40%, proving they still lack true expert-level reasoning.
  • Saturation Proof: Unlike the MMLU, which models "aced" years ago, Humanity's Last Exam (HLE) is designed to be "Google-proof."
  • Detection Signal: Low HLE scores indicate that AI struggles with complex, multi-step reasoning, a key differentiator for identifying human authorship.
  • The Gap: There is currently a massive chasm between "fluent" AI writing and "factually reasoned" expert content.

The New Standard for "Superintelligence" (And Why AI Is Failing It)

This deep dive is part of our extensive guide on Best AI Mode Checkers 2026: The Tools That Prove What’s Human (and What’s Not).

For years, we watched AI models crush standardized tests. They passed the Bar Exam, aced the SATs, and destroyed the MMLU benchmark. But in 2026, the party is over.

Humanity’s Last Exam (HLE) has humbled the giants. Created by the Center for AI Safety (CAIS) and Scale AI, this benchmark wasn't built to see if AI is smart, it was built to see where it breaks.

Unlike previous tests that relied on information you could find on Wikipedia, HLE requires novel reasoning. It asks questions that have never been seen on the internet before, written by PhDs in niche fields ranging from abstract mathematics to obscure history.

If you are trying to verify if a text is written by a human expert, this leaderboard is your cheat sheet.

Current Leaderboard Scores (January 2026)

The data is shocking. In a world where we assume AI knows everything, the top models are barely passing a third of the exam. This confirms that while AI is great at mimicking style, it is still struggling with generating new, complex truths.

AI Model Rank Model Name HLE Accuracy Score
#1 Google Gemini 3.0 Pro 38.3%
#2 OpenAI GPT-5.2 29.9%
#3 Claude Opus 4.5 25.8%
#4 DeepSeek 3.2 21.8%

Note: Scores reflect the "pass@1" accuracy on the closed-ended reasoning subset.

What These Low Scores Mean for You?

If an AI model only gets 38% of expert-level reasoning questions right, it means 62% of the time, it is hallucinating or failing logic. This is the "Truth Gap."

For content creators and researchers, this is where human value lies. If you are writing about complex topics requiring novel synthesis, like advanced chemical engineering or supply chain macro-economics, you are operating in the zone where AI fails.

To see how these scores correlate with detectable patterns in text, read our breakdown on How to Detect Gemini 3.0 Content.

Why "MMLU" is Dead and "HLE" is King?

You might ask, "Didn't GPT-4 score 90% on exams years ago?" Yes, on the MMLU (Massive Multitask Language Understanding) benchmark. But the MMLU had a fatal flaw: Contamination.

Because the MMLU questions were all over the internet, AI models "memorized" the answers during training. They weren't thinking; they were reciting.

HLE fixes this by being:

  • Private: The questions are not in the public training data.
  • Multimodal: 14% of questions require analyzing images (like interpreting a specific bone structure in a hummingbird).
  • Reasoning-Heavy: You cannot "search" for the answer; you have to derive it.

This shift from retrieval to reasoning is why researchers are panicking. It turns out, when you take away the cheat sheet, the "superintelligence" looks a lot less super.

The "Reasoning Trace" Frontier

The only way AI models are inching up the HLE leaderboard is by using "Chain of Thought" (CoT) reasoning. This is where the model "talks to itself" before answering.

However, this leaves a digital footprint. Detectors are now looking for the specific syntax of these reasoning chains.

If you are using tools to check for plagiarism, you need software that understands these new patterns. For a look at tools that can spot these specific AI artifacts, check our review of the Best AI Plagiarism Checkers for Research Papers.

Conclusion

The humanity's last exam leaderboard scores are more than just numbers; they are a reality check. They prove that in 2026, "expert-level" AI is still statistically unreliable for high-stakes novelty.

For now, the ability to reason through a complex, never-before-seen problem remains the ultimate "human" watermark.



Frequently Asked Questions (FAQ)

1. What is the highest score on Humanity's Last Exam (HLE)?

As of early 2026, the highest recorded score is approximately 38.3%, held by Google's Gemini 3.0 Pro. This low percentage highlights the extreme difficulty of the benchmark compared to older tests like MMLU.

2. Which AI model failed Humanity's Last Exam?

Technically, all current models are "failing" by academic standards. While Gemini 3.0 leads, models like Llama 4 and older iterations of GPT-4 scored in the single digits or low teens, effectively performing little better than random guessing on many sections.

3. How are HLE truth scores calculated?

Scores are calculated based on accuracy on a held-out dataset. The questions are kept private (not on the internet) to ensure the AI hasn't "memorized" the answers. The score reflects the percentage of questions the AI answers correctly on the first try without external tools.

4. Can I download the Humanity's Last Exam answer key?

No. To prevent "data contamination," the full answer key and the test set are kept secret by the Center for AI Safety (CAIS) and Scale AI. This ensures future AI models cannot simply train on the answers to cheat the test.

5. How does HLE compare to LMSYS Chatbot Arena?

LMSYS Chatbot Arena measures human preference (which answer "feels" better or is more helpful). HLE measures objective correctness on difficult facts. A model can be popular on Chatbot Arena (writing smooth text) but fail HLE (getting the complex logic wrong).

Back to Top