Tech firms are rushing to revamp the best way they test and evaluate their artificial intelligence models because the rapidly advancing technology surpasses current benchmarks.
OpenAI, Microsoft, Meta and Anthropic have all recently announced plans to develop AI agents that may autonomously perform tasks for humans on their behalf. To do that effectively, systems must give you the chance to perform increasingly complex actions using reasoning and planning.
Companies conduct “assessments” of AI models by teams of employees and external researchers. These are standardized tests, so-called benchmarks, that evaluate the capabilities of models and the performance of systems from different groups or older versions.
However, recent advances in AI technology have resulted in a lot of the most recent models achieving near or above 90 percent accuracy in existing tests, highlighting the necessity for brand new benchmarks.
“The pace of the industry is amazingly fast. We at the moment are beginning to saturate our ability to measure a few of these systems (and as an industry it’s becoming increasingly difficult to judge them), said Ahmad Al-Dahle, head of generative AI at Meta.
To address this problem, several technology groups, including Meta, OpenAI and Microsoft, have created their very own internal benchmarks and intelligence tests. However, this has raised concerns within the industry concerning the comparability of the technology as there are not any public tests.
“Many of those benchmarks show us how far we’re from automating tasks and jobs. Without it being made public, it’s hard for firms and society as a complete to say,” said Dan Hendrycks, executive director of the Center for AI Safety and advisor to Elon Musk’s xAI.
Current public benchmarks – Hellaswag and MMLU – use multiple-choice inquiries to assess common sense and knowledge on various topics. However, researchers argue that this method is now becoming redundant and models require more complex problems.
“We are entering an era where most of the tests written by humans are not any longer sufficient as a great barometer of model performance,” said Mark Chen, SVP of research at OpenAI. “This presents us as a research world with a brand new challenge.”
A public benchmark, SWE-bench Verified, was updated in August to raised evaluate autonomous systems based on feedback from firms, including OpenAI.
It leverages real-world software problems sourced from the developer platform GitHub and involves providing a code repository and a technical issue to the AI agent with a prompt to repair the difficulty. The tasks require logical justification.
In this regard, OpenAI's latest model, GPT-4o Preview, solves 41.4 percent of problems, while Anthropic's Claude 3.5 Sonnet achieves 49 percent.
“It's rather more sophisticated (with agent systems) because you will have to attach those systems to a variety of additional tools,” said Jared Kaplan, chief science officer at Anthropic.
“Basically you will have to create a whole sandbox environment for them to play in. It’s not as easy as just giving a prompt, seeing what the finish is after which assessing that,” he added.
Another essential factor when conducting more complex tests is to make sure that the benchmark questions usually are not publicly available to make sure that the models don’t effectively “cheat” by generating the answers from training data as an alternative of solving the issue.
The ability to think and plan is critical to unlocking the potential of AI agents that may perform and self-correct tasks across multiple steps and applications.
“We are discovering latest ways to measure these systems, and one in all them is in fact pondering, which is a crucial area,” said Ece Kamar, vice chairman and lab manager of AI Frontiers at Microsoft Research.
That's why Microsoft is working by itself internal benchmark that includes problems not previously encountered in training to evaluate whether its AI models can think like a human.
Some, including researchers at Apple, have questioned whether current large language models “reason” or merely “pattern match” probably the most similar data they’ve seen of their training.
“In the narrower areas that firms care about, they do think,” said Ruchir Puri, chief scientist at IBM Research. “(The debate is about) this broader concept of human-level pondering, which might almost put it within the context of artificial general intelligence. Are they really pondering or are they parroting?”
OpenAI measures logical pondering primarily through evaluations within the areas of mathematics, STEM subjects and programming tasks.
“Argumentation is a really great term. Everyone defines it in another way and has their very own interpretation. . . That line could be very blurry (and) we try to not get too caught up in that distinction itself, but moderately take a look at whether it affects utility, performance or capabilities,” OpenAI’s Chen said.
The need for brand new benchmarks has also led to efforts by external organizations.
In September, startup Scale AI and Hendrycks announced a project called Humanity's Last Exam that crowdsourced complex questions from experts across disciplines that required abstract pondering to reply.
Another example is FrontierMath, a novel benchmark released this week created by expert mathematicians. Based on this test, probably the most advanced models can answer lower than 2 percent of questions.
However, experts warn that without an explicit agreement to measure these capabilities, it might be difficult for firms to evaluate their competitors or for firms and consumers to grasp the market.
“There is not any clear technique to say 'this model is unquestionably higher than this model' (because when a measure becomes a goal, it is not any longer a great measure”) and models are trained to satisfy the established benchmarks, Al said – Dahle.
“It’s something we’re working on as a whole industry.”