HomeArtificial IntelligenceThis is why most AI benchmarks tell us so little

This is why most AI benchmarks tell us so little

On Tuesday, startup Anthropic released a family of generative AI models that claim to deliver best-in-class performance. Just a couple of days later, competitor Inflection AI introduced a model that it claims is close in quality to a few of the strongest models in the marketplace, including OpenAI's GPT-4.

Anthropic and Inflection are certainly not the primary AI firms to say that their models are equal to or objectively outperform the competition. Google made the identical argument when releasing its Gemini models, and OpenAI also said the identical for GPT-4 and its predecessors GPT-3, GPT-2 and GPT-1. The list goes on.

But what metrics are they talking about? When a vendor says a model is state-of-the-art in performance or quality, what exactly does that mean? Perhaps more specifically: Will a model that’s technically higher “performing” than one other model actually improve noticeably?

On that last query: unlikely.

The reason – or moderately the issue – lies within the benchmarks that AI firms use to quantify the strengths – and weaknesses – of a model.

Esoteric measures

Today's mostly used benchmarks for AI models—particularly chatbot-based models like OpenAI's ChatGPT and Anthropic's Claude—poorly capture how a median person interacts with the models being tested. For example, a benchmark cited by Anthropic in its recent announcement, GPQA (“A Graduate-Level Google-Proof Q&A Benchmark”), incorporates tons of of graduate-level biology, physics, and chemistry questions—yet most individuals use chatbots to do tasks like Answer emails, write cover letters And speak about their feelings.

Jesse Dodge, a scientist on the Allen Institute for AI, the nonprofit AI research organization, says the industry has reached a “valuation crisis.”

“Benchmarks are inclined to be static and narrowly focused on assessing a single capability, corresponding to a model's facticity in a single domain or its ability to unravel multiple-choice mathematical reasoning questions,” Dodge told TechCrunch in an interview . “Many benchmarks used for evaluation are greater than three years old and are available from a time when AI systems were mostly only used for research purposes and didn’t have many real users. In addition, people use generative AI in quite a lot of ways – they’re very creative.”

The flawed metrics

It's not that probably the most commonly used benchmarks are completely useless. Undoubtedly, someone is asking Ph.D.-level ChatGPT math questions. However, as generative AI models are increasingly positioned as “do-it-all” systems for the mass market, old benchmarks have gotten less applicable.

David Widder, a postdoctoral researcher at Cornell University who studies AI and ethics, notes that lots of the abilities tested using common benchmarks — from solving elementary-level math problems to determining whether a sentence incorporates an anachronism incorporates – won’t ever be relevant to nearly all of users.

“Older AI systems were often designed to unravel a particular problem in a context (e.g. medical AI expert systems), thereby higher enabling a deep contextual understanding of what constitutes good performance in that specific context.” Widder told TechCrunch. “As systems are increasingly viewed as 'general purpose' systems, that is becoming less possible, so we’re seeing an increasing emphasis on testing models against quite a lot of benchmarks in numerous areas.”

Errors and other defects

Aside from the misalignment of use cases, the query is whether or not some benchmarks even accurately measure what they claim to measure.

A evaluation The study by HellaSwag, a test designed to evaluate common sense in models, found that greater than a 3rd of the test questions contained typos and “nonsensical” writing. Elsewhere, MMLU (short for Massive Multitask Language Understanding), a benchmark that vendors like Google, OpenAI, and Anthropic point to as proof that their models can solve logical problems, asks questions that might be solved through memorization.

Test questions from the HellaSwag benchmark.

“Benchmarks like MMLU are more about remembering two keywords and linking them together,” Widder said. “I can find (a relevant) article and answer the query fairly quickly, but that doesn't mean I understand the causal mechanism or could use an understanding of that causal mechanism to really think through and solve recent and sophisticated problems in unexpected contexts.” . A model can’t do this either.”

Fix what's broken

So benchmarks are broken. But can they be fixed?

Dodge is convinced of this – with more human involvement.

“The way forward here is a mixture of evaluation benchmarks and human evaluation,” she said, “stimulating a model with an actual user request after which hiring an individual to guage how good the reply is.”

As for Widder, he’s less optimistic that benchmarks today — even with fixes for more obvious errors like typos — might be improved enough to be informative for the overwhelming majority of users of generative AI models. Instead, he believes that model testing should concentrate on the downstream effects of those models and whether the consequences, good or bad, are viewed as desirable by those affected.

“I’d ask what specific contextual goals AI models might be used for and assess whether or not they could be – or are – successful in such contexts,” he said. “And hopefully that process may also include assessing whether we must always use AI in such contexts.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read