Google DeepMind unveils a “superhuman” AI system that excels at fact-checking, cutting costs and improving accuracy

March 29, 2024

103

A brand new study by Google's DeepMind The research unit has found that a man-made intelligence system can outperform human fact-checkers in assessing the accuracy of data generated by large language models.

The paper entitled “Long-form facticity in large language models” and published on the Preprint server arXivintroduces a technique called Search-Augmented Factuality Evaluator (SAFE). SAFE uses a big language model to interrupt down generated text into individual facts after which uses Google search results to find out the veracity of every claim.

“SAFE uses an LLM to interrupt down a protracted answer right into a series of individual facts and evaluate the accuracy of every fact using a multi-step reasoning process that involves submitting search queries to Google Search and determining whether a fact is supported by the , includes search results,” the authors explained.

“Superhuman” performance sparks debate

The researchers pitted SAFE against human annotators on a knowledge set containing around 16,000 facts and located that SAFE's rankings matched human rankings 72% of the time. Even more strikingly, in a sample of 100 disagreements between SAFE and the human raters, SAFE's judgment was found to be correct 76% of the time.

While the paper claims that “LLM agents can achieve superhuman assessment performance,” some experts query what “superhuman” really means here.

From a fast read I can't work out much in regards to the human issues, but it surely looks like superhuman means higher than an underpaid crowd employee, moderately than an actual human fact checker? This makes the characterization misleading. (Like saying the 1985 chess software was superhuman).…

– Gary Marcus (@GaryMarcus) March 28, 2024

Gary Marcusa widely known AI researcher and frequent critic of exaggerated claims, suggested on Twitter that “superhuman” on this case might simply mean “higher than an underpaid crowd employee, more like an actual human fact-checker.”

“That makes the characterization misleading,” he said. “Like saying the 1985 chess software was superhuman.”

Marcus makes a legitimate point. To truly reveal superhuman feats, SAFE would have to be in comparison with expert human fact-checkers, not only crowdsourced staff. The specific details of the human raters, equivalent to: Information equivalent to their qualifications, compensation and fact-checking process are critical to properly contextualizing the outcomes.

Cost savings and benchmarking of top models

A transparent advantage of SAFE is cost – the researchers found that using the AI system was about 20 times cheaper than using human fact-checkers. As the quantity of data generated by language models continues to extend, it becomes increasingly vital to have a cheap and scalable approach to confirm claims.

The DeepMind team used SAFE to guage the factual accuracy of 13 top language models from 4 families (Gemini, GPT, Claude, and PaLM-2) using a brand new benchmark called LongFact. Their results suggest that larger models generally caused fewer factual errors.

However, even the most effective performing models produced a major variety of false claims. This highlights the chance of relying too heavily on language models that may fluently express inaccurate information. Automated fact-checking tools like SAFE could play a key role in mitigating these risks.

Transparency and human fundamentals are crucial

While the SAFE code and the LongFact data set were present Open source on GitHubTo enable other researchers to challenge and construct on the work, much more transparency is required regarding the human baselines utilized in the study. In order to evaluate SAFE capabilities in the correct context, it is vital to know the specifics of the crowdworker's background and process.

As tech giants race to develop increasingly powerful language models for applications starting from search to virtual assistants, the flexibility to mechanically fact-check the outcomes of those systems could prove crucial. Tools like SAFE represent a vital step toward constructing a brand new level of trust and accountability.

However, it’s crucial that the event of such follow-on technologies occurs outdoors, with input from a wide selection of stakeholders beyond the boundaries of an organization. Rigorous, transparent benchmarking with human experts – not only crowd staff – will probably be essential to measuring real progress. Only then can we assess the actual impact of automated fact-checking on the fight against misinformation.

Google DeepMind unveils a “superhuman” AI system that excels at fact-checking, cutting costs and improving accuracy

“Superhuman” performance sparks debate

Cost savings and benchmarking of top models

Transparency and human fundamentals are crucial

LEAVE A REPLY Cancel reply

Must Read

How Salesforce's STEM 1T dataset could revolutionize the AI industry

Forget coding bootcamps: Airtable's AI can construct your app in seconds

Level AI applies algorithms to the weak points within the contact center

ChatGPT: Everything you have to know concerning the AI-powered chatbot

Breakthroughs in artificial intelligence create a brand new ‘brain’ for advanced robots

AI wars heat up: OpenAI's SearchGPT targets Google's search dominance

“Model collapse”: Scientists warn against leaving AI to its own devices

Latest articles

How Salesforce's STEM 1T dataset could revolutionize the AI industry

Forget coding bootcamps: Airtable's AI can construct your app in seconds

Level AI applies algorithms to the weak points within the contact center

Our Newsletter

Google DeepMind unveils a “superhuman” AI system that excels at fact-checking, cutting costs and improving accuracy

“Superhuman” performance sparks debate

Cost savings and benchmarking of top models

Transparency and human fundamentals are crucial

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter