Claude 3 Opus surpasses all LLMs with a book-length summary

April 8, 2024

356

Researchers published a study comparing the accuracy and quality of summaries produced by LLMs. Claude 3 Opus performed particularly well, however the humans still got here out on top.

AI models are extremely useful for summarizing long documents whenever you don't have the time or inclination to read them.

The luxury of accelerating context windows means we are able to goal models with longer documents, which tests their ability to at all times clearly present the facts within the summary.

The researchers from the University of Massachusetts Amherst, Adobe, the Allen Institute for AI and Princeton University, published a study The goal was to learn the way good AI models are at summarizing book-length content (>100,000 tokens).

FABLES

They chosen 26 books from 2023 and 2024 and had the texts summarized by various LLMs. The current publication dates were chosen to avoid potential data contamination within the models' original training data.

After the models created the summaries, they extracted decontextualized claims from them using GPT-4. The researchers then hired human commentators who had read the books and asked them to fact-check the claims.

The LLM summarizes the book, GPT-4 extracts the claims, and human annotators review the claims. Source: arXiv

The resulting data was compiled right into a dataset called Faithfulness Annotations for Book-Length Summarization (FABLES). FABLES incorporates 3,158 claims-level notes on faithfulness in 26 narrative texts.

The test results showed that Claude 3 Opus was “by far probably the most faithful book summary volume,” with over 90% of its claims confirmed as truthful or accurate.

GPT-4 got here in a distant second, with only 78% of its claims confirmed to be true by human annotators.

Percentage of claims extracted from LLM-generated summaries that individuals rated as true, unfaithful, partially supported, or unverifiable. Source: arXiv

The hard part

The models tested all appeared to struggle with the identical problems. Most of the facts that the models got mistaken related to events or states of characters and relationships.

The paper states that “most of those claims can only be rebutted through multi-hop reasoning of the evidence, highlighting the complexity of the duty and its difference from existing fact-checking environments.”

LLMs also often omitted critical information from their summaries. They also place an excessive amount of emphasis on the content towards the top of the books and miss vital content originally.

Will AI replace human annotators?

Human commenters or fact checkers are expensive. The researchers spent $5,200 for human annotators to confirm the claims within the AI summaries.

Could an AI model have done this job for less money? Simple fact finding is something Claude 3 does well, but its performance at verifying claims that require a deeper understanding of the content is less consistent.

When all AI models were presented with the extracted claims and asked to review them, they were insufficient to achieve human commenters. They performed particularly poorly at identifying unfaithful claims.

Although Claude 3 Opus was by far the very best claims verifier, the researchers concluded that it “ultimately performed too poorly to be a reliable automated evaluator.”

When it involves understanding the nuances, complex human relationships, storylines, and character motivations in a long-form narrative, humans still appear to have the sting for now.

Claude 3 Opus surpasses all LLMs with a book-length summary

FABLES

The hard part

Will AI replace human annotators?

LEAVE A REPLY Cancel reply

Must Read

Why Sigmund Freud is making a comeback within the age of authoritarianism and AI

OpenAI hat das Wort „sicher“ aus seiner Mission gestrichen – und seine neue Struktur ist ein Test dafür, ob KI der Gesellschaft oder den...

New J-PAL research and policy initiative to check and scale AI innovations to combat poverty

Non-consensual AI porn doesn't violate privacy – however it's still mistaken

Boston Dynamics CEO Robert Playter is stepping down after 30 years with the corporate

Swarms of AI bots can influence people's beliefs and thus endanger democracy

Accelerating science with AI and simulations

Latest articles

Why Sigmund Freud is making a comeback within the age of authoritarianism and AI

OpenAI hat das Wort „sicher“ aus seiner Mission gestrichen – und seine neue Struktur ist ein Test dafür, ob KI der Gesellschaft oder den...

New J-PAL research and policy initiative to check and scale AI innovations to combat poverty

Our Newsletter

Claude 3 Opus surpasses all LLMs with a book-length summary

FABLES

The hard part

Will AI replace human annotators?

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter