HomeIndustriesHumanity's Last Exam desires to embarrass the AI ​​with its difficult questions

Humanity's Last Exam desires to embarrass the AI ​​with its difficult questions

Benchmarks are struggling to maintain up with the advancing capabilities of AI models, and the Humanity's Last Exam project goals to resolve this problem together with your help.

The project is a collaboration between the Center for AI Safety (CAIS) and AI data company Scale AI. The goal of the project is to measure how close we’re to developing expert-level AI systems, something existing benchmarks are unable to do.

OpenAI and CAIS developed the favored MMLU (Massive Multitask Language Understanding) benchmark in 2021. At the time, in keeping with CAIS, “AI systems performed no higher than random ones.”

The impressive performance of OpenAI's o1 model has “destroyed the preferred benchmarks for logical reasoning,” said Dan Hendrycks, executive director of CAIS.

OpenAI's o1 MMLU performance in comparison with previous models. Source: OpenAI

If AI models achieve 100% of MMLU, how will we measure them? CAIS says, “Existing tests have develop into too simplistic and we are able to now not track AI development well or determine how far they’re from expert level.”

Seeing the jump in benchmark results that o1 added to the already impressive GPT-4o numbers, it won't be long before an AI model gets the most effective of MMLU.

Humanity's Last Exam asks participants to submit questions that you just could be genuinely surprised if an AI model got here up with the correct answer. You want PhD-level exam questions, not “what number of Rs are in strawberries?” type questions that trip up some models.

Scale explained: “As existing tests develop into too easy, we lose the power to differentiate between AI systems that pass undergraduate exams with flying colours and people who could make an actual contribution to frontier research and problem solving.”

If you might have an original query that might challenge a complicated AI model, you may be listed as a co-author of the project paper and share in a $500,000 prize pool awarded for the most effective questions.

To provide you with an idea of ​​the extent of the project being aimed for, Scale explained: “If a randomly chosen student can understand what’s being asked, it might be too easy for the trendy LLMs of today and tomorrow.”

There are a number of interesting restrictions on the varieties of questions that may be submitted. They don't want questions related to chemical, biological, radiological, nuclear weapons, or cyber weapons used to attack critical infrastructure.

If you think that your query meets the necessities, you may submit it Here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read