LLMs produce less accurate and distorted results with longer inputs

March 1, 2024

164

Despite the rapid advances in LLMs, our understanding of how these models handle longer inputs stays inadequate.

Mosh Levy, Alon Jacoby and Yoav Goldberg from Bar-Ilan University and the Allen Institute for AI examined how the performance of huge language models (LLMs) changes with changes within the length of the input text they’re given to process.

They developed an argumentation framework specifically for this purpose, which allowed them to investigate the influence of input length on the LLM argument in a controlled environment.

The query frame suggested different versions of the identical query, each containing the knowledge mandatory to reply the query and padded with additional, irrelevant text of various lengths and kinds.

This allows the input length to be isolated as a variable and ensures that changes in model performance might be directly attributed to the length of the input.

Key findings

Levy, Jacoby, and Goldberg found that LLMs exhibit a major decline in reasoning performance at input lengths far below what the developers claim they will handle. They documented their findings on this study.

A decline was consistently observed across all versions of the dataset, indicating a systemic issue with processing longer inputs reasonably than a problem related to specific data samples or model architectures.

As the researchers describe: “Our results show a major degradation within the reasoning performance of LLMs at much shorter input lengths than their technical maximum. We show that the degradation trend occurs in every version of our data set, albeit with different intensity.”

As the scale of the input increases, the power to perform reasoning tasks decreases. These inputs consist of relevant (highlighted in red) and irrelevant (grayed out) text that comes from different places and is progressively expanded. For an accurate answer, it’s mandatory to discover two specific text segments that may very well be randomly placed within the input. The performance data is aggregated from 600 samples. Source: Via ArXiv.

Additionally, the study shows that traditional metrics corresponding to perplexity, commonly used to judge LLMs, don’t correlate with the models' performance on long-input reasoning tasks.

Further investigation revealed that performance degradation not only is dependent upon the presence of irrelevant information (padding), but can be observed when that padding consists of duplicated relevant information.

If we keep the 2 core areas together and add text around them, the accuracy already drops. When paragraphs are inserted between sections, the outcomes drop much more. The drop occurs each when the texts we add are just like the duty texts and after they are completely different. 3/7 pic.twitter.com/c91l9uzyme

– Mosh Levy (@mosh_levy) February 26, 2024

This suggests that the challenge for LLMs is to filter out noise and the inherent processing of longer text sequences.

Ignore instructions

A critical area of failure mode highlighted within the study is the tendency of LLMs to disregard instructions embedded within the input because the input length increases.

Models also sometimes produced responses that suggested uncertainty or an absence of sufficient information, corresponding to “The text doesn’t contain enough information,” although all of the mandatory information was present.

Overall, as input length increases, LLMs appear to repeatedly struggle to prioritize and deal with necessary information, including direct instructions.

Show prejudices within the answers

Another notable problem was the increasing biases within the models' responses because the inputs became longer.

In particular, LLMs tended to reply “false” because the input length increased. This bias indicates a skew within the probability estimation or decision-making processes inside the model, perhaps as a defense mechanism in response to increased uncertainty resulting from longer input lengths.

The tendency to prefer “fallacious” answers could also reflect an underlying imbalance within the training data or an artifact of the models' training process, where negative answers could also be over-represented or related to contexts of uncertainty and ambiguity.

Models AI — Models showed an inclination to reply binary questions as “fallacious” because the input length increased. Source: Via ArXiv.

This tendency affects the accuracy of model results and raises concerns in regards to the reliability and fairness of LLMs in applications that require nuanced understanding and impartiality.

Implementing robust bias detection and mitigation strategies through the model training and tuning phases is critical to reducing unwarranted biases in model responses.

EEnsuring that training data sets are diverse, balanced, and representative of quite a lot of scenarios also can help minimize bias and improve model generalization.

This contributes to this other recent studies which also highlight fundamental problems with the best way LLMs work, meaning that this “technical debt” could compromise the functionality and integrity of the model over time.

LLMs produce less accurate and distorted results with longer inputs

Key findings

Ignore instructions

Show prejudices within the answers

LEAVE A REPLY Cancel reply

Must Read

Jensen Huang, CEO of Nvidia, sings as a processor in Nintendo Switch 2

We have “resulted in additional of the machines,” says Quant Fund Titan Cliff Asness

Your AI models fail in production -here is the Fix model selection

“Vibe coding” is the brand new DIY

Phonelys recent AI agents reached 99% accuracy – and customers cannot say that they should not human

Epic games reveal the state of the unreality for 2025

Meta agreed 20 years to purchase production from the Illinois atomic power plant

Latest articles

Jensen Huang, CEO of Nvidia, sings as a processor in Nintendo Switch 2

We have “resulted in additional of the machines,” says Quant Fund Titan Cliff Asness

Your AI models fail in production -here is the Fix model selection

Our Newsletter

LLMs produce less accurate and distorted results with longer inputs

Key findings

Ignore instructions

Show prejudices within the answers

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter