How do you translate the traditional Palmyrene script from a Roman tombstone? How many pairs of tendons are supported by a given sesamoid bone in a hummingbird? Can you discover closed syllables in Biblical Hebrew using the most recent research on Tiberian pronunciation traditions?
These are a number of the questions in “Humanity's Last Exam,” a brand new benchmark introduced in a study Published this week in Nature. The collection of two,500 questions is specifically designed to explore the outer limits of what today's artificial intelligence (AI) systems cannot achieve.
The benchmark represents a worldwide collaboration of nearly 1,000 international experts from various academic fields. These academics and researchers asked questions on the frontier of human knowledge. The problems required graduate-level expertise in mathematics, physics, chemistry, biology, computer science, and humanities. Importantly, each query has been tested using leading AI models before being included. If an AI couldn’t answer the query accurately on the time the test was designed, the query was rejected.
This process explains why the initial results looked so different than other benchmarks. While AI chatbots reach over 90% popular testsWhen Humanity's Last Exam was first released in early 2025, leading models were struggling. GPT-4o only managed 2.7% accuracy. Claude 3.5 Sonnet scored 4.1%. Even OpenAI's best-performing model, o1, only achieved 8%.
The point was the low values. The benchmark was designed to measure what remained beyond the reach of AI. And while some Commentators have suggested We imagine that benchmarks like “Humanity's Last Exam” show a path to artificial general intelligence and even superintelligence – that’s, AI systems able to performing any task at a human or superhuman level – is flawed for 3 reasons.
Benchmarks measure task performance, not intelligence
If a student does well on the bar exam, we will assume that he might be a reliable lawyer. Because the test was intended to evaluate whether people have acquired the knowledge and considering skills obligatory to practice law – and it really works with people. The understanding required to pass actually transfers to the position.
But AI systems will not be people preparing for a profession.
When a big language model performs well on the bar exam, that tells us that the model can produce correct-looking answers to legal questions. It doesn't tell us that the model understands the law, can advise a nervous client, or exercise skilled judgment in ambiguous situations.
The test measures something real to humans; With AI, only performance on the test itself is measured.
Using human skills testing to benchmark AI is common practice, but fundamentally misleading. Assuming that a high test rating means the machine has develop into more human-like is a category error, very similar to concluding that a calculator “understands” math because it could actually solve equations faster than any human.
Human and machine intelligence are fundamentally different
Humans continually learn from experience. We have intentions, needs and goals. We live life, inhabit bodies and experience the world directly. Our intelligence has evolved to serve our survival as organisms and our success as social beings.
But AI systems are very different.
Large language models derive their abilities from patterns in text during training. But they don't really learn.
For humans, intelligence comes first and language serves as a method of communication – Intelligence is prelinguistic. But in large language models, the language is the intelligence – there’s nothing underneath.
Even the makers of Humanity's Last Exam recognize this restriction:
High accuracy on (Humanity's Last Exam) would display expert-level performance on closed-ended, testable questions and cutting-edge scientific knowledge, but wouldn’t alone suggest autonomous research capabilities or artificial general intelligence.
Subbarao Kambhampati, professor at Arizona State University and former president of the Association for the Advancement of Artificial Intelligence, puts it more clearly:
The essence of humanity is just not captured by a static test, but quite by our ability to evolve and address previously unimaginable questions.
Developers like leaderboards
There is one other problem. AI developers use benchmarks to optimize their models for leaderboard performance. You're essentially cramming for the exam. And unlike humans, where studying for the test builds understanding, AI optimization simply means recovering on the test in query.
But it really works.
Since Humanity's Last Exam was released online in early 2025, The scores have increased dramatically. Gemini 3 Pro Preview now tops the rankings with an accuracy of 38.3%, followed by GPT-5 at 25.3% and Grok 4 at 24.5%.
Does this improvement mean that these models are approaching human intelligence? No. This means they’ve gotten higher on the questions included within the exam. The benchmark has develop into a goal for optimization.
The industry recognizes this problem.
OpenAI recently introduced a metric called GDPval specifically designed to evaluate practical utility.
Unlike academic-style benchmarks, GDPval focuses on tasks based on actual work products comparable to project documents, data evaluation, and work products that exist in skilled environments.
What this implies for you
If you utilize AI tools in your work or are enthusiastic about adopting them, don't be influenced by benchmark results. A model that passes humanity's final test with flying colours might still have problems with the particular tasks it’s essential complete.
It's also value noting that the exam questions are heavily focused on specific areas. Mathematics alone makes up 41% of the benchmark, with physics, biology and computer science accounting for the remaining. If your work involves writing, communications, project management, or customer support, the exam will let you know almost nothing about which model could be best for you.
A practical approach is to develop your individual tests based on what you really want the AI for, after which evaluate newer models based on the standards that matter to you. AI systems are genuinely useful – but any discussion of superintelligence stays science fiction and a distraction from the actual work of creating these tools relevant to people's lives.

