HomeNewsStudy suggests that even the perfect AI models often hallucinate

Study suggests that even the perfect AI models often hallucinate

All generative AI models hallucinate, from Google’s Gemini to Anthropic’s Claude to the latest stealth version of OpenAI's GPT-4o. In other words, the models are unreliable narrators – sometimes with hilarious effect, sometimes problematic so.

But not all models make things up at the identical speed. And the kind of untruths they spread depends upon what sources of data they’ve been exposed to.

A current study by researchers Researchers at Cornell University, the Universities of Washington and Waterloo, and the nonprofit research institute AI2 attempted to match hallucinations by testing models like GPT-4o against trusted sources on topics starting from law and health to history and geography. They found that no model performed particularly well on all topics, and that the models that hallucinated the least did so partly because they refused to reply questions they’d otherwise get fallacious.

“The most vital lesson from our work is that we cannot yet fully trust the outcomes of model generations,” Wenting Zhao, a doctoral student at Cornell University and co-author of the study, told TechCrunch. “Currently, even the perfect models can only generate hallucination-free text about 35% of the time.”

There have been other academic attempts to look at the “factuality” of models, including one by a separate AI2-affiliated team. But Zhao notes that these earlier tests asked the models questions whose answers can easily be found on Wikipedia – not exactly probably the most difficult query, considering most models are trained with Wikipedia data.

To make their benchmark tougher—and to more accurately reflect the sorts of questions people ask models—the researchers identified topics on the Internet which can be related to Wikipedia. Just over half of the questions of their test can’t be answered using Wikipedia (some questions with Wikipedia sources were included for good measure) and touch on topics akin to culture, geography, astronomy, popular culture, finance, medicine, computer science, and celebrities.

For their study, the researchers examined over a dozen different popular models, a lot of which were released up to now yr. In addition to GPT-4o, they tested “open” models akin to Meta's Llama 3 70B, Mistral's Mixtral 8x22B and Cohere's Command R+, in addition to gated-behind API models akin to Perplexity's Sonar Large (which is predicated on Llama), Google's Gemini 1.5 Pro and Anthropic's Claude 3 Opus.

The results suggest that models today don’t hallucinate much less, despite claims on the contrary by OpenAI, Anthropic and the opposite major players in generative AI.

GPT-4o and OpenAI's much older flagship GPT-3.5 performed about equally within the benchmark when it comes to the share of questions they answered factually appropriately. (GPT-4o was barely higher.) OpenAI's models were the least hallucinatory overall, followed by Mixtral 8x22B, Command R, and Perplexity's Sonar models.

Questions about celebrities and finance gave the models probably the most difficulty, but questions on geography and computer science were the best for the models to reply (perhaps because their training data contained more references to them). In cases where the source of a solution was not Wikipedia, each model answered less factually on average (but especially GPT-3.5 and GPT-4o), suggesting that they’re all heavily influenced by Wikipedia content.

Even models that may search the Internet for information, akin to Command R and Perplexity's Sonar models, struggled within the benchmark with “non-wiki” questions. Model size didn't matter much; smaller models (e.g. Anthropic's Claude 3 Haiku) hallucinated about as often as larger, seemingly more powerful models (e.g. Claude 3 Opus).

So what does all this mean – and where are the improvements that the providers promised?

Well, we wouldn't put it past the vendors to exaggerate their claims. But a more charitable view is that the benchmarks they use usually are not fit for purpose. As we've written before, many if not most AI assessments are cursory and lacking essential context, and doomed to fall victim Goodhart's Law.

Regardless, Zhao believes that the issue of hallucinations will “remain for a very long time.”

“The empirical results of our work show that despite the promise that certain methods reduce or eliminate hallucinations, the actual improvement that will be achieved with these methods is proscribed,” she said. “In addition, our evaluation shows that even the knowledge found on the Internet can often be contradictory, partly since the training data created by humans can even contain hallucinations.”

A workaround is perhaps to easily program the models to refuse to reply questions more often – the technical equivalent of telling a know-it-all to stop.

When the researchers tested it, Claude 3 Haiku answered only about 72% of the questions asked and left the remaining unanswered. Taking under consideration the abstentions, Claude 3 Haiku was actually probably the most accurate model of all – no less than within the sense that it lied the least.

But will people use a model that doesn't answer many questions? Zhao doesn't think so, and says vendors should spend more effort and time on research to cut back hallucinations. Eliminating hallucinations entirely will not be possible, but they will be mitigated by human intervention in fact-checking and citation in the course of the development of a model, she claims.

“Policies and regulations must be developed to be certain that human experts are at all times involved in the method to confirm and validate the data generated by generative AI models,” Zhao added. “There are still quite a few opportunities to make a major impact on this area, akin to developing advanced fact-checking tools for any free text, providing citations for factual content, and offering corrections for hallucinated text.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read