Large language models (LLMs) sometimes learn the fallacious lessons, in keeping with an MIT study.
Instead of answering a question based on domain knowledge, an LLM could respond by leveraging grammatical patterns learned during training. This may cause a model to fail unexpectedly when deployed to latest tasks.
The researchers found that models can incorrectly associate certain sentence patterns with certain topics, so an LLM may provide a convincing answer by recognizing familiar phrases somewhat than understanding the query.
Their experiments showed that even probably the most powerful LLMs could make this error.
This deficiency could reduce the reliability of LLMs performing tasks similar to processing client inquiries, summarizing clinical notes, and preparing financial reports.
It could also pose security risks. A malicious actor could exploit this to trick LLMs into producing malicious content, even when the models have security measures in place to forestall such responses.
After identifying this phenomenon and studying its implications, the researchers developed a benchmarking procedure to evaluate a model's dependence on these spurious correlations. The process could help developers mitigate the issue before deploying LLMs.
“This is a byproduct of the way in which we train models, but models are actually utilized in practice in safety-critical areas well beyond the tasks that gave rise to those syntactic error modes. If you, as an end user, are usually not acquainted with model training, this may likely be unexpected,” says Marzyeh Ghassemi, an associate professor within the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the MIT Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems, and senior creator of the Study.
Ghassemi is joined by co-lead authors Chantal Shaib, a graduate student at Northeastern University and visiting student at MIT; and Vinith Suriyakumar, an MIT graduate; in addition to Levent Sagun, research scientist at Meta; and Byron Wallace, Sy and Laurie Sternberg Interdisciplinary Associate Professor and associate dean for research within the Khoury College of Computer Sciences at Northeastern University. A Paper describing the work might be presented on the Conference on Neural Information Processing Systems.
I'm stuck on the syntax
LLMs are trained using an enormous amount of text from the web. During this training process, the model learns to grasp the relationships between words and phrases – knowledge that it later uses when answering queries.
In previous work, researchers found that LLMs detect patterns within the parts of speech that always appear together in training data. They call these part-of-speech patterns “syntactic templates.”
LLMs require this understanding of syntax together with semantic knowledge to reply questions in a particular area.
“For example, within the news domain, there’s a particular form of writing. So the model learns not only the semantics, but additionally the underlying structure of how sentences ought to be put together to follow a particular style for that domain,” explains Shaib.
However, on this research they found that LLMs learn to map these syntactic templates to specific domains. The model may incorrectly rely solely on this learned association to reply questions, somewhat than an understanding of the query and topic.
For example, an LLM might learn that a matter like “Where is Paris?” answered. is structured as an adverb/verb/proper noun/verb. If there are various examples of sentence constructions within the model's training data, the LLM can associate this syntactic template with questions on countries.
So if the model is asked a brand new query with the identical grammatical structure but nonsense words like “Quickly sit Paris clouded”? it could answer “France,” even when that answer doesn’t make sense.
“This is an ignored variety of association that the model learns to reply questions appropriately. We should pay more attention not only to the semantics but additionally the syntax of the info we use to coach our models,” says Shaib.
The meaning is missing
The researchers tested this phenomenon by designing synthetic experiments through which just one syntactic template for every domain appeared within the model's training data. They tested the models by replacing words with synonyms, antonyms or random words, but kept the underlying syntax.
In any case, they found that LLMs often still answered with the proper answer even when the query was complete nonsense.
When they restructured the identical query using a brand new part-of-speech pattern, the LLMs often failed to supply the proper answer, though the underlying meaning of the query remained the identical.
They used this approach to check pre-trained LLMs like GPT-4 and Llama and located that the identical learned behavior significantly lowered their performance.
The researchers were interested in the broader implications of those findings and investigated whether someone could exploit this phenomenon to elicit harmful responses from an LLM who was consciously trained to reject such requests.
They found that by formulating the query using a syntactic template that associates the model with a “secure” data set (one that doesn’t contain malicious information), they might trick the model into overriding its rejection policy and generating malicious content.
“It's clear to me from this work that we want more robust defenses to deal with vulnerabilities in LLMs. In this paper, we identified a brand new vulnerability that arises from the way in which LLMs learn. So we want to develop latest defenses based on how LLMs learn language, not only ad hoc solutions to varied vulnerabilities,” says Suriyakumar.
While the researchers on this work didn’t investigate remediation strategies, they developed an automatic benchmarking technique that may very well be used to evaluate an LLM's dependence on this spurious syntax-domain correlation. This latest test could help developers proactively address this shortcoming of their models, reducing security risks and improving performance.
In the long run, researchers plan to explore potential remediation strategies that might involve expanding training data to supply a greater number of syntactic templates. They are also fascinated by studying this phenomenon in reasoning models, special varieties of LLMs designed to handle multi-step tasks.
“I feel it is a really creative angle to look at failure modes of LLMs. This work highlights the importance of linguistic knowledge and evaluation in LLM security research, a facet that has not been the main focus but clearly ought to be,” says Jessy Li, an associate professor on the University of Texas at Austin, who was not involved on this work.
This work is funded partly by a Bridgewater AIA Labs Fellowship, the National Science Foundation, the Gordon and Betty Moore Foundation, a Google Research Award, and Schmidt Sciences.

