Stanford University released its 2024 AI Index Report, which found that benchmark comparisons with humans have gotten less relevant attributable to the rapid advancement of AI.
The Annual report offers a comprehensive insight into the trends and standing of AI developments. The report says that AI models at the moment are improving so quickly that the benchmarks we use to measure them have gotten increasingly irrelevant.
Many industry benchmarks compare AI models to how well humans can perform tasks. The Massive Multitask Language Understanding (MMLU) benchmark is a great example.
It uses multiple-choice inquiries to assess LLMs in 57 subjects, including mathematics, history, law and ethics. The MMLU has been the AI benchmark of selection since 2019.
The baseline human rating within the MMLU is 89.8%, and in 2019 the typical AI model achieved just over 30%. Just five years later, Gemini Ultra became the primary model to outperform the human baseline with a results of 90.04%.
The report notes that current “AI systems routinely outperform human performance on standard benchmarks.” The trends within the chart below seem to point that the MMLU and other benchmarks should be replaced.
AI models have reached performance saturation on established benchmarks similar to ImageNet, SQuAD and SuperGLUE, prompting researchers to develop more sophisticated tests.
One example is the graduate-level Google-Proof Q&A Benchmark (GPQA), which allows AI models to be compared with really smart people relatively than average human intelligence.
The GPQA test consists of 400 difficult graduate-level multiple-choice questions. Experts who’ve a doctorate or are doing a doctorate answer the questions accurately 65% of the time.
The GPQA paper states that “highly trained, non-expert validators achieve only 34% accuracy on questions outside their area of expertise, despite spending, on average, greater than half-hour with unrestricted access to the Internet.”
Last month, Anthropic announced that Claude 3 scored just below 60% with a 5-shot CoT call. We will need a bigger scale.
Claude 3 achieves ~60% accuracy in GPQA. It's hard for me to underestimate how difficult these questions are – literal PhDs (in fields aside from the questions) with access to the web get 34%.
PhD students *in the identical domain* (even with web access!) achieve 65% – 75% accuracy. https://t.co/ARAiCNXgU9 pic.twitter.com/PH8J13zIef
— david rein (@idavidrein) March 4, 2024
Human reviews and safety
The report states that AI still faces significant problems: “It cannot reliably process facts, perform complex reasoning, or explain its conclusions.”
These limitations contribute to a different characteristic of the AI system that the report says is poorly measured; AI security. We don't have effective benchmarks to say, “This model is safer than that.”
This is partly since it is difficult to measure and partly because “AI developers lack transparency, particularly in the case of disclosing training data and methods.”
The report noted that an interesting trend within the industry is crowdsourcing human assessments of AI performance as an alternative of benchmark testing.
Assessing a model's image aesthetics or prose is difficult with a test. As a result, the report states that “benchmarking has slowly shifted to incorporate human assessments similar to the chatbot Arena Leaderboard, relatively than computerized rankings similar to ImageNet or SQuAD.”
As AI models watch the human baseline disappear within the rearview mirror, sentiment may ultimately determine which model we use.
Trends suggest that AI models will eventually grow to be smarter than us and harder to measure. We might soon say, “I don’t know why, but I identical to this higher.”