Humanity’s Last Exam Leaderboard Scores: Why the World’s Hardest Benchmark is Breaking AI
Quick Summary: Key Takeaways
- The "Failing" Reality: Even the most advanced models like Gemini 3 Pro are scoring under 40%, proving they still lack true expert-level reasoning.
- Saturation Proof: Unlike the MMLU, which models "aced" years ago, Humanity's Last Exam (HLE) is designed to be "Google-proof."
- Detection Signal: Low HLE scores indicate that AI struggles with complex, multi-step reasoning, a key differentiator for identifying human authorship.
- The Gap: There is currently a massive chasm between "fluent" AI writing and "factually reasoned" expert content.
The New Standard for "Superintelligence" (And Why AI Is Failing It)
This deep dive is part of our extensive guide on Best AI Mode Checkers 2026: The Tools That Prove What’s Human (and What’s Not).
For years, we watched AI models crush standardized tests. They passed the Bar Exam, aced the SATs, and destroyed the MMLU benchmark. But in 2026, the party is over.
Humanity’s Last Exam (HLE) has humbled the giants. Created by the Center for AI Safety (CAIS) and Scale AI, this benchmark wasn't built to see if AI is smart, it was built to see where it breaks.
Unlike previous tests that relied on information you could find on Wikipedia, HLE requires novel reasoning. It asks questions that have never been seen on the internet before, written by PhDs in niche fields ranging from abstract mathematics to obscure history.
If you are trying to verify if a text is written by a human expert, this leaderboard is your cheat sheet.
Current Leaderboard Scores (January 2026)
The data is shocking. In a world where we assume AI knows everything, the top models are barely passing a third of the exam. This confirms that while AI is great at mimicking style, it is still struggling with generating new, complex truths.
| AI Model Rank | Model Name | HLE Accuracy Score |
|---|---|---|
| #1 | Google Gemini 3.0 Pro | 38.3% |
| #2 | OpenAI GPT-5.2 | 29.9% |
| #3 | Claude Opus 4.5 | 25.8% |
| #4 | DeepSeek 3.2 | 21.8% |
Note: Scores reflect the "pass@1" accuracy on the closed-ended reasoning subset.
What These Low Scores Mean for You?
If an AI model only gets 38% of expert-level reasoning questions right, it means 62% of the time, it is hallucinating or failing logic. This is the "Truth Gap."
For content creators and researchers, this is where human value lies. If you are writing about complex topics requiring novel synthesis, like advanced chemical engineering or supply chain macro-economics, you are operating in the zone where AI fails.
To see how these scores correlate with detectable patterns in text, read our breakdown on How to Detect Gemini 3.0 Content.
Why "MMLU" is Dead and "HLE" is King?
You might ask, "Didn't GPT-4 score 90% on exams years ago?" Yes, on the MMLU (Massive Multitask Language Understanding) benchmark. But the MMLU had a fatal flaw: Contamination.
Because the MMLU questions were all over the internet, AI models "memorized" the answers during training. They weren't thinking; they were reciting.
HLE fixes this by being:
- Private: The questions are not in the public training data.
- Multimodal: 14% of questions require analyzing images (like interpreting a specific bone structure in a hummingbird).
- Reasoning-Heavy: You cannot "search" for the answer; you have to derive it.
This shift from retrieval to reasoning is why researchers are panicking. It turns out, when you take away the cheat sheet, the "superintelligence" looks a lot less super.
The "Reasoning Trace" Frontier
The only way AI models are inching up the HLE leaderboard is by using "Chain of Thought" (CoT) reasoning. This is where the model "talks to itself" before answering.
However, this leaves a digital footprint. Detectors are now looking for the specific syntax of these reasoning chains.
If you are using tools to check for plagiarism, you need software that understands these new patterns. For a look at tools that can spot these specific AI artifacts, check our review of the Best AI Plagiarism Checkers for Research Papers.
Conclusion
The humanity's last exam leaderboard scores are more than just numbers; they are a reality check. They prove that in 2026, "expert-level" AI is still statistically unreliable for high-stakes novelty.
For now, the ability to reason through a complex, never-before-seen problem remains the ultimate "human" watermark.
Frequently Asked Questions (FAQ)
As of early 2026, the highest recorded score is approximately 38.3%, held by Google's Gemini 3.0 Pro. This low percentage highlights the extreme difficulty of the benchmark compared to older tests like MMLU.
Technically, all current models are "failing" by academic standards. While Gemini 3.0 leads, models like Llama 4 and older iterations of GPT-4 scored in the single digits or low teens, effectively performing little better than random guessing on many sections.
Scores are calculated based on accuracy on a held-out dataset. The questions are kept private (not on the internet) to ensure the AI hasn't "memorized" the answers. The score reflects the percentage of questions the AI answers correctly on the first try without external tools.
No. To prevent "data contamination," the full answer key and the test set are kept secret by the Center for AI Safety (CAIS) and Scale AI. This ensures future AI models cannot simply train on the answers to cheat the test.
LMSYS Chatbot Arena measures human preference (which answer "feels" better or is more helpful). HLE measures objective correctness on difficult facts. A model can be popular on Chatbot Arena (writing smooth text) but fail HLE (getting the complex logic wrong).
Sources & References
- Best AI Mode Checkers 2026
- How to Detect Gemini 3.0 Content
- Best AI Plagiarism Checkers for Research Papers
- Scale AI & CAIS: Official "Humanity's Last Exam" Leaderboard and Research Paper.
- arXiv: "Humanity's Last Exam: The AI Benchmark for LLM Reasoning" (Hendrycks et al.).
Internal Analysis:
External Resources: