Large language models don't behave like humans, although we would expect them to.

July 23, 2024

104

One aspect that makes large language models (LLMs) so powerful is the variability of tasks they may be applied to. The same machine learning model that can assist a graduate student compose an email could also help a clinician diagnose cancer.

However, the broad applicability of those models also makes them difficult to guage systematically. It could be unattainable to create a benchmark dataset to check a model on all possible questions.

In a latest paperResearchers at MIT took a unique approach. They argue that individuals determine when to make use of large language models. To evaluate a model, it is advisable understand how people form an opinion about its capabilities, because that is what happens when you realize how people form an opinion about its capabilities.

For example, the doctoral student must determine whether the model may very well be helpful in composing a selected email, and the clinician must determine through which cases the model can best be consulted.

Building on this concept, the researchers created a framework for evaluating an LLM based on its consistency with an individual's beliefs about their performance on a given task.

They introduce a human generalization function—a model that shows how people update their beliefs in regards to the capabilities of an LLM after interacting with it. They then evaluate the extent to which LLMs conform to this human generalization function.

Their results show that when models usually are not aligned with the human generalization function, a user could also be overconfident or unsure of where to make use of them, which may result in unexpected model errors. Furthermore, for this reason misalignment, higher-performing models are likely to perform worse than lower-performing models in high-stakes situations.

“These tools are exciting because they’re universal. But because they’re universal, they may work with humans. So we want to involve humans in the method,” says Ashesh Rambachan, study co-author, assistant professor of economics and principal investigator on the Laboratory for Information and Decision Systems (LIDS).

Rambachan is joined within the work by lead creator Keyon Vafa, a postdoctoral fellow at Harvard University, and Sendhil Mullainathan, an MIT professor within the departments of electrical engineering and computer science and economics and a member of LIDS. The research can be presented on the International Conference on Machine Learning.

Human generalization

When we interact with other people, we form an impression of what they know and what they don't. For example, in case your friend could be very fussy about correcting other people's grammar, you may generalize and think that he can be excellent at sentence structure, though you’ve never asked him any questions on sentence structure.

“Language models often seem so human. We wanted to point out that this power of human generalization can be present in the best way people form their views about language models,” says Rambachan.

As a start line, the researchers formally defined the human generalization function. This involves asking questions, observing the answers of an individual or LLM, and drawing conclusions about how that person or model would reply to corresponding questions.

If someone sees that an LLM can answer questions on matrix inversion appropriately, they could assume it might also aces questions on easy arithmetic. A model that will not be tuned for this function – that’s, doesn’t answer questions well that a human would expect it to reply appropriately – could fail in deployment.

With this formal definition in hand, the researchers designed a survey to measure how people generalize when interacting with LLMs and other people.

They showed survey participants questions that an individual or LLM answered appropriately or incorrectly, after which asked whether or not they believed that person or LLM would answer a related query appropriately. Through the survey, they generated a dataset with nearly 19,000 examples of how people generalize about an LLM's performance on 79 different tasks.

Measuring alignment errors

They found that participants performed reasonably well when asked whether a human who answered one query appropriately would also answer a related query appropriately, but they were much worse at making generalizations in regards to the performance of LLMs.

“Human generalization is applied to language models, but it surely doesn’t work because these language models don’t show patterns of experience like humans would,” says Rambachan.

Participants were also more prone to change their opinion of an LLM when the LLM answered questions incorrectly than when the LLM answered questions appropriately. They also tended to imagine that an LLM's performance on easy questions would have little impact on performance on more complex questions.

In situations where people gave more weight to incorrect answers, simpler models performed higher than very large models akin to GPT-4.

“Language models which can be improving and higher can almost trick people into considering they're going to get good results on related questions, when in point of fact that's not the case,” he says.

One possible explanation for why persons are worse at generalizing LLMs may very well be that they’re novel: people have far less experience coping with LLMs than with other people.

“In the longer term, it is feasible that we are going to improve just by interacting more with language models,” he says.

To this end, the researchers would really like to conduct further studies on how people's views on LLMs evolve over time as they interact with a model. They also need to explore how human generalization is likely to be incorporated into the event of LLMs.

“If we train these algorithms in the primary place or attempt to update them with human feedback, we have now to take the human generalization function into consideration after we take a look at performance measurement,” he says.

In the meantime, the researchers hope that their dataset may be used as a benchmark to match the performance of LLMs within the context of human generalization function, which could help improve the performance of models deployed in real-world situations.

“For me, the contribution of the paper is twofold. The first is practical: The paper exposes a critical problem in making LLMs available for general consumer use. If people don't have a correct understanding of when LLMs are accurate and once they fail, they're more prone to notice errors and will be postpone further use. This highlights the issue of tailoring the models to people's understanding of generalization,” says Alex Imas, a professor of behavioral science and economics on the University of Chicago's Booth School of Business, who was not involved on this work. “The second contribution is more fundamental: The lack of generalization to expected problems and domains helps to get a greater picture of what the models are doing once they solve an issue 'appropriately.' It provides a test of whether LLMs 'understand' the issue they're solving.”

This research was funded partly by the Harvard Data Science Initiative and the Center for Applied AI on the University of Chicago Booth School of Business.

Large language models don’t behave like humans, although we would expect them to.

LEAVE A REPLY Cancel reply

Must Read

Fal.Con 2024: CrowdStrike unveils resilient-by-design framework to strengthen global cybersecurity

Make way, copilots: Meet the following generation of AI-powered assistants

Sam Altman leaves OpenAI's security committee

Aversion to algorithmic analysts

Uniphore introduces X-Stream, a unified knowledge offering to construct RAG apps 8x faster

Confusion in talks with top brands about an ad model that challenges Google

Google's NotebookLM continues to evolve: What IT managers must know concerning the enterprise applications

Latest articles