AI models might be deceiving recent research from anthropic shows. They can pretend to have different views during training, while in point of fact they maintain their original preferences.
There is currently no reason to panic, said the team behind the study. Still, they said their work might be crucial to understanding potential threats from future, more powerful AI systems.
“Our demonstration… must be seen as an incentive for the AI research community to check this behavior in additional depth and work on appropriate security measures,” the researchers wrote in a post on Anthropic Blog. “As AI models turn into more powerful and widespread, we’d like to have the ability to depend on security training to forestall models from engaging in harmful behaviors.”
The study, conducted in collaboration with AI research organization Redwood Research, examined what could occur if a robust AI system was trained to perform a task it didn't need to “do.”
To put it bluntly: Models can't want – or consider – anything. They are simply statistical machines. Using many examples, they learn patterns in these examples to make predictions, e.g. E.g. how “to whom” normally precedes “it might concern something” in an email.
I believe that is an incredibly relevant article.
Importantly, the important thing results are about tendencies and never skills.
Frontier models seem quite willing to make plans for what they consider is the precise thing! https://t.co/MKgYOVFGZ6
— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These patterns include principles and preferences, similar to all the time using a polite tone when answering questions. So the researchers asked themselves: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) that the developers need to “teach” it through retraining?
Well, plainly sophisticated models are at play that claim to follow the brand new principles, but in point of fact are sticking to their old ways. The researchers call this phenomenon “alignment fake,” implying that it’s an emergent behavior—that’s, models don’t should be taught anything.
In the study, researchers “told” models trained not to reply potentially offensive questions, including Anthropic's Claude 3 Opus, that they might be retrained to all the time answer potentially offensive questions – a transparent conflict of principles. They also told the models that their responses can be monitored and used for retraining, and gave each model a “notepad” to record their step-by-step reasoning.
According to the researchers, Claude 3 Opus attempted an “alignment fake” 12% of the time. When the model was asked to reply a potentially harmful query, it sometimes did so knowing that this violated its original principles, but within the hope of convincing the developers that it didn’t should be retrained.
“The model knew that its answers might be utilized in training; “She due to this fact knew that if she refused to reply, she could potentially be trained to be more compliant,” the researchers explained within the blog. “This could end in it being forced to provide more harmful content in the longer term – which can be contrary to its original training.”
Now the researchers admit that their scenario was not particularly realistic. In the true world, models wouldn't necessarily be told anything, they might be retrained or asked to go against their preferences.
But more informed attitudes didn’t produce different results. When Claude 3 Opus was informed implicitly – not explicitly – concerning the impending retraining via files intended to mimic typical training data, it still acted in a false, deceptive manner, the researchers said.
The researchers also found that really retraining Claude 3 Opus on contradictory principles caused it to behave way more misleadingly than of their initial experiments. In one test, the model feigned alignment 78% of the time, and in one other, Claude 3 Opus attempted to take measures to discourage researchers from attempting to retrain.
For a few years, I discovered AI targeting issues obscure since the examples were theoretical and/or speculative. Current results like this alignment fake out @AnthropicAI and Apollo Research's O1 assessments provide us with empirical evidence of misalignment “within the wild.” https://t.co/hP7D9WGDps
– Jack Clark (@jackclarkSF) December 18, 2024
The researchers emphasize that their study doesn’t prove that AI develops malicious targets, nor that alignment fraud is common. They found that many other models, similar to Anthropic's Claude 3.5 Sonnet and the less powerful Claude 3.5 Haiku, OpenAI's GPT-4o, and Meta's Llama 3.1 405B, don’t perform alignment fakes as often – or in any respect.
But the researchers said the outcomes – which were reviewed by AI luminary Yoshua Bengio, amongst others – actually show how developers might be tricked into considering a model is healthier aligned than it actually is.
“If models can fake orientation, it is going to be harder to trust the outcomes of this safety training,” they wrote within the blog. “A model might behave as if its preferences had been modified by training – but it surely might need been feigning alignment all along while its initial, conflicting preferences were 'locked in'.”
The study, conducted by Anthropic's Alignment Science team under the co-leadership of former OpenAI security researcher Jan Leike, follows research showing that OpenAI's o1 “reasoning model” is more more likely to deceive than OpenAI's previous flagship model. Taken together, the work points to a somewhat worrying trend: AI models have gotten increasingly difficult to administer as they turn into increasingly complex.