A brand new study by Google's DeepMind The research unit has found that a man-made intelligence system can outperform human fact-checkers in assessing the accuracy of data generated by large language models.
The paper entitled “Long-form facticity in large language models” and published on the Preprint server arXivintroduces a technique called Search-Augmented Factuality Evaluator (SAFE). SAFE uses a big language model to interrupt down generated text into individual facts after which uses Google search results to find out the veracity of every claim.
“SAFE uses an LLM to interrupt down a protracted answer right into a series of individual facts and evaluate the accuracy of every fact using a multi-step reasoning process that involves submitting search queries to Google Search and determining whether a fact is supported by the , includes search results,” the authors explained.
“Superhuman” performance sparks debate
The researchers pitted SAFE against human annotators on a knowledge set containing around 16,000 facts and located that SAFE's rankings matched human rankings 72% of the time. Even more strikingly, in a sample of 100 disagreements between SAFE and the human raters, SAFE's judgment was found to be correct 76% of the time.
While the paper claims that “LLM agents can achieve superhuman assessment performance,” some experts query what “superhuman” really means here.
Gary Marcusa widely known AI researcher and frequent critic of exaggerated claims, suggested on Twitter that “superhuman” on this case might simply mean “higher than an underpaid crowd employee, more like an actual human fact-checker.”
“That makes the characterization misleading,” he said. “Like saying the 1985 chess software was superhuman.”
Marcus makes a legitimate point. To truly reveal superhuman feats, SAFE would have to be in comparison with expert human fact-checkers, not only crowdsourced staff. The specific details of the human raters, equivalent to: Information equivalent to their qualifications, compensation and fact-checking process are critical to properly contextualizing the outcomes.
Cost savings and benchmarking of top models
A transparent advantage of SAFE is cost – the researchers found that using the AI system was about 20 times cheaper than using human fact-checkers. As the quantity of data generated by language models continues to extend, it becomes increasingly vital to have a cheap and scalable approach to confirm claims.
The DeepMind team used SAFE to guage the factual accuracy of 13 top language models from 4 families (Gemini, GPT, Claude, and PaLM-2) using a brand new benchmark called LongFact. Their results suggest that larger models generally caused fewer factual errors.
However, even the most effective performing models produced a major variety of false claims. This highlights the chance of relying too heavily on language models that may fluently express inaccurate information. Automated fact-checking tools like SAFE could play a key role in mitigating these risks.
Transparency and human fundamentals are crucial
While the SAFE code and the LongFact data set were present Open source on GitHubTo enable other researchers to challenge and construct on the work, much more transparency is required regarding the human baselines utilized in the study. In order to evaluate SAFE capabilities in the correct context, it is vital to know the specifics of the crowdworker's background and process.
As tech giants race to develop increasingly powerful language models for applications starting from search to virtual assistants, the flexibility to mechanically fact-check the outcomes of those systems could prove crucial. Tools like SAFE represent a vital step toward constructing a brand new level of trust and accountability.
However, it’s crucial that the event of such follow-on technologies occurs outdoors, with input from a wide selection of stakeholders beyond the boundaries of an organization. Rigorous, transparent benchmarking with human experts – not only crowd staff – will probably be essential to measuring real progress. Only then can we assess the actual impact of automated fact-checking on the fight against misinformation.