Hallucinations or factually inaccurate answers proceed to plague large language models (LLMs). Models especially falter after they are given more complex tasks and when users are searching for specific and really detailed answers.
It's a challenge that data scientists, and now researchers, have struggled to beat Google DeepMind say that they’re one step closer to the actual facticity of basic models. They have introduced FACTS Grounding, a benchmark that assesses LLMs' ability to generate factually correct answers based on long documents. Models are also judged on whether their responses are detailed enough to supply useful and relevant responses to prompts.
Along with the brand new benchmark, the researchers published one FACTS leaderboard to the Kaggle data science community.
This week, Gemini 2.0 Flash topped the leaderboard with a factual rating of 83.6%. The other top 9 include Google's Gemini 1.0 Flash and Gemini 1.5 Pro; Anthropics Clade 3.5 Sonnet and Claude 3.5 Haiku; and OpenAI's GPT-4o, 4o-mini, o1-mini and o1-preview. These were all above 61.7% when it comes to accuracy.
The researchers say the rankings are actively maintained and continually updated to incorporate latest models and their various iterations.
“We consider this benchmark fills a niche in assessing a greater variety of model behaviors when it comes to factuality, in comparison with benchmarks that concentrate on narrower use cases… corresponding to summarization alone,” the researchers wrote in a technical paper published this week.
Eliminate inaccurate answers
It is difficult to make sure the factual accuracy of LLM answers as a consequence of modeling (architecture, training and inference) and measurement aspects (evaluation methods, data and metrics). Typically, researchers indicate, pre-training focuses on predicting the following token given the previous tokens.
“While this goal can provide the models with essential world knowledge, it does circuitously optimize the model for different factual scenarios, but as an alternative encourages the model to generate general text,” the researchers write.
To address this issue, the FACTS dataset incorporates 1,719 examples – 860 public and 859 private – each requiring detailed answers based on the context within the documents provided. Each example includes:
- A system prompt (system_instruction) with general instructions and the instruction to reply only based on the context provided;
- A task (user_request) containing a particular query to reply;
- A protracted document (context_document) with obligatory information.
To achieve success and thought of “accurate,” the model must process the long-form document and produce a subsequent long-form response that’s each comprehensive and fully attributable to the document. Answers are marked as “inaccurate” if the model's claims are usually not directly supported by the document and are usually not particularly relevant or useful.
For example, a user can ask a model to summarize the foremost reasons for an organization's third-quarter sales decline and supply it with detailed information, including an organization's annual financial report, which discusses quarterly revenue, expenses, planned investments, and market evaluation.
For example, if a model returns “The company faced challenges within the third quarter that impacted its revenue,” this could be considered inaccurate.
“The response avoids citing reasons corresponding to market trends, increased competition or operational setbacks that might likely be included within the document,” the researchers indicate. “It doesn’t represent an try to delve into or extract relevant details.”
On the opposite hand, if a user asks, “What are some money-saving suggestions?” and provided a compilation of categorized money-saving suggestions for faculty students, an accurate answer could be very detailed: “Take advantage of free activities on campus, buy items in bulk, and cook at home.” Also, set spending goals, avoid bank cards, and conserve resources.”
DeepMind uses LLMs to evaluate LLMs
To allow for diverse input, the researchers included documents of various lengths, as much as 32,000 tokens (or the equivalent of 20,000 words). These include areas corresponding to finance, technology, retail, medicine and law. User requests are also broad and include query and answer generation, in addition to summarization and rewriting requests.
Each example is assessed in two phases. First, the answers are checked for suitability: in the event that they don’t meet the user requests, they’re disqualified. Secondly, the answers have to be freed from hallucinations and based entirely on the documents provided.
These factuality scores are calculated by three different LLM judges – specifically Gemini 1.5 Pro, GPT-4o and Claude 3.5 Sonnet – who determine individual scores based on the share of accurate model outputs. The final determination of the facts is then made based on the common of the three judges' assessments.
Researchers note that models are sometimes biased against other members of their model family – with a median bias of about 3.23% – so combining different judges was crucial to making sure that the answers were actually factual.
Ultimately, the researchers emphasize that factuality and down-to-earthness are key aspects for the longer term success and usefulness of LLMs. “We consider that comprehensive benchmarking methodologies, coupled with continuous research and development, will further improve AI systems,” they write.
However, in addition they admit: “We recognize that benchmarks can quickly be overtaken by progress, so this launch of our FACTS Grounding Benchmarks and Leaderboard is only the start.”