Humanity's Last Exam desires to embarrass the AI with its difficult questions

September 17, 2024

206

Benchmarks are struggling to maintain up with the advancing capabilities of AI models, and the Humanity's Last Exam project goals to resolve this problem together with your help.

The project is a collaboration between the Center for AI Safety (CAIS) and AI data company Scale AI. The goal of the project is to measure how close we’re to developing expert-level AI systems, something existing benchmarks are unable to do.

OpenAI and CAIS developed the favored MMLU (Massive Multitask Language Understanding) benchmark in 2021. At the time, in keeping with CAIS, “AI systems performed no higher than random ones.”

The impressive performance of OpenAI's o1 model has “destroyed the preferred benchmarks for logical reasoning,” said Dan Hendrycks, executive director of CAIS.

OpenAI's o1 MMLU performance in comparison with previous models. Source: OpenAI

If AI models achieve 100% of MMLU, how will we measure them? CAIS says, “Existing tests have develop into too simplistic and we are able to now not track AI development well or determine how far they’re from expert level.”

Seeing the jump in benchmark results that o1 added to the already impressive GPT-4o numbers, it won't be long before an AI model gets the most effective of MMLU.

This is objectively true. pic.twitter.com/gorahh86ee

— Ethan Mollick (@emollick) 17 September 2024

Humanity's Last Exam asks participants to submit questions that you just could be genuinely surprised if an AI model got here up with the correct answer. You want PhD-level exam questions, not “what number of Rs are in strawberries?” type questions that trip up some models.

Scale explained: “As existing tests develop into too easy, we lose the power to differentiate between AI systems that pass undergraduate exams with flying colours and people who could make an actual contribution to frontier research and problem solving.”

If you might have an original query that might challenge a complicated AI model, you may be listed as a co-author of the project paper and share in a $500,000 prize pool awarded for the most effective questions.

To provide you with an idea of the extent of the project being aimed for, Scale explained: “If a randomly chosen student can understand what’s being asked, it might be too easy for the trendy LLMs of today and tomorrow.”

There are a number of interesting restrictions on the varieties of questions that may be submitted. They don't want questions related to chemical, biological, radiological, nuclear weapons, or cyber weapons used to attack critical infrastructure.

If you think that your query meets the necessities, you may submit it Here.

Humanity's Last Exam desires to embarrass the AI with its difficult questions

LEAVE A REPLY Cancel reply

Must Read

Kumos 'Relational Foundation Model' predicts the longer term that your LLM cannot see

How runtime attacks turn profitable AI into budget black holes

Zwischen Utopie und Zusammenbruch: Navigieren von AIs trübe mittlere Zukunft

Identity theft meets 1.1m reports – and the fatigue of authentication is just worse

The seismic effect of AI changes the expectations of the clients of law firms

Mindminimalism: The recent AI strategy saves tens of millions

A biological computer grow within the British laboratory

Latest articles

Kumos 'Relational Foundation Model' predicts the longer term that your LLM cannot see

How runtime attacks turn profitable AI into budget black holes

Zwischen Utopie und Zusammenbruch: Navigieren von AIs trübe mittlere Zukunft

Our Newsletter

Humanity's Last Exam desires to embarrass the AI ​​with its difficult questions

RELATED ARTICLES