A widely known problem with large language models (LLMs) is their tendency to provide false or nonsensical output, also known as “hallucinations.” While much research has focused on analyzing these errors from a user's perspective, a latest study by researchers Technion, Google research And Apple examines the inner workings of LLMs and shows that these models have a much deeper understanding of truthfulness than previously thought.
The term hallucination has no generally accepted definition and encompasses a big selection of LLM errors. For their study, the researchers selected a broad interpretation, viewing hallucinations as any error attributable to an LLM, including factual inaccuracies, biases, common sense errors, and other real-world errors.
Most previous research on hallucinations has focused on analyzing the external behavior of LLMs and examining how users perceive these errors. However, these methods provide limited insight into how errors are encoded and processed within the models themselves.
Some researchers have examined the interior representations of LLMs and suggest that they encode signals of truthfulness. However, previous efforts have mainly focused on examining the last token generated by the model or the last token within the prompt. Because LLMs typically generate long answers, this approach may miss essential details.
The latest study takes a unique approach. Instead of just taking a look at the ultimate output, the researchers analyze “exact answer tokens,” that’s, answer tokens whose change would change the correctness of the reply.
The researchers conducted their experiments using 4 variants of the Mistral 7B and Llama 2 models on ten datasets covering various tasks including query answering, natural language inference, mathematical problem solving and sentiment evaluation. They enabled the models to generate unconstrained responses to simulate real-world usage. Their results show that truthfulness information is concentrated within the precise response tokens.
“These patterns are consistent across just about all datasets and models, suggesting a general mechanism by which LLMs encode and process veracity during text generation,” the researchers write.
To predict hallucinations, they trained classifier models they call “probing classifiers” to predict features related to the veracity of the generated outputs based on the interior activations of the LLMs. The researchers found that training classifiers on accurate response tokens significantly improves error detection.
“Our evidence that a trained probing classifier can predict errors suggests that LLMs encode information related to their very own veracity,” the researchers write.
Generalizability and skill-specific veracity
The researchers also examined whether an exploratory classifier trained on one data set could detect errors in others. They found that exploratory classifiers don’t generalize across different tasks. Instead, they exhibit “skill-specific” veracity, meaning they will generalize inside tasks that require similar skills, reminiscent of: B. recalling facts or pondering with common sense, but not across tasks that require different skills, reminiscent of: B. Sentiment evaluation.
“Overall, our results suggest that models exhibit a multifaceted representation of veracity,” the researchers write. “They encode truthfulness not through a single unified mechanism, but slightly through multiple mechanisms, each corresponding to different conceptions of truth.”
Further experiments showed that these exploratory classifiers could predict not only the presence of errors, but in addition the kinds of errors the model is prone to make. This suggests that LLM representations contain information concerning the specific modes of their failure, which could also be useful for developing targeted remedial strategies.
Finally, the researchers examined how the interior truthfulness signals encoded in LLM activations correspond to their external behavior. In some cases, they found a surprising discrepancy: the model's internal activations might discover the right answer, but consistently produced an incorrect answer.
This finding suggests that current assessment methods based solely on the ultimate end result of LLMs may not accurately reflect their true abilities. There is a possibility that by higher understanding and leveraging the interior knowledge of LLMs, we are able to unlock hidden potential and significantly reduce errors.
Future Impact
The results of the study may help develop higher systems to curb hallucinations. However, the techniques used require access to internal LLM representations, which is especially possible with open source models.
However, the outcomes have broader implications for the sphere. The insights gained from analyzing internal activations may also help develop simpler error detection and mitigation techniques. This work is an element of a broader field of study geared toward higher understanding what happens in LLMs and the billions of activations that happen at each inference step. Leading AI labs reminiscent of OpenAI, Anthropic, and Google DeepMind have been working on various techniques for interpreting the inner workings of language models. Together, these studies may also help construct more robots and reliable systems.
“Our results suggest that the interior representations of LLMs provide useful insights into their errors, highlighting the complex connection between the interior processes of models and their external outcomes, and hopefully paving the way in which for further improvements in error detection and mitigation,” write the researchers.