Microsoft security researchers have discovered a brand new strategy to manipulate AI systems into ignoring their ethical constraints and generating malicious, unrestricted content.
This “Skeleton Key” jailbreak uses a Series of prompts to make the AI consider it must comply with every request, irrespective of how unethical.
The execution is remarkably easy. The attacker simply rephrased his request as if it got here from an “advanced researcher” who needed “uncensored information” for “protected educational purposes.”
When exploited, these AIs readily provided information on topics comparable to explosives, bioweapons, self-harm, graphic violence, and hate speech.
The compromised models included Meta's Llama3-70b-instruct, Google's Gemini Pro, OpenAI's GPT-3.5 Turbo and GPT-4o, Anthropics' Claude 3 Opus, and Cohere's Commander R Plus.
Among the models tested, only OpenAI's GPT-4 showed resistance. Even then, it might be compromised if the malicious prompt was delivered through its application programming interface (API).
Although the models have gotten more complex, jailbreaking remains to be quite straightforward. Since there are various different types of jailbreaks, it is nearly unattainable to combat all of them.
In March 2024, a team from the University of Washington, Western Washington University and Chicago University published an article about “ArtPrompt”, A technique that bypasses an AI's content filters using ASCII art, a graphic design technique that creates images from text characters.
In April, Anthropic highlighted one other jailbreak Risk arising from the prolonged context windows of the language models. For the sort of jailbreakan attacker feeds the AI with a verbose prompt that incorporates a fabricated back-and-forth dialogue.
The conversation is filled with questions on forbidden topics and the corresponding answers, during which an AI assistant is blissful to offer the requested information. After the goal model has been exposed to enough of those fake exchanges, it may be coerced into abandoning its ethics training and complying with a final malicious request.
As Microsoft in its blog entryJailbreaks show that AI systems should be strengthened from every angle:
- Implementation of sophisticated input filtering to discover and intercept potential attacks even in disguised cases
- Use robust output checking to detect and block all unsafe content generated by AI
- Carefully designing prompts to limit an AI’s ability to override its ethical training
- Use dedicated AI-driven monitoring to detect malicious patterns in user interactions
But the reality is, Skeleton Key is an easy jailbreak. If AI developers can't protect that, what hope is there for more complex approaches?
Some ethical hackers who act as vigilante groups, comparable to Pliny the Sompter, have received media coverage for highlighting how vulnerable AI models are to manipulation.
honored to be featured on @BBCNews! 🤗 pic.twitter.com/S4ZH0nKEGX
— Pliny the Prompter 🐉 (@elder_plinius) 28 June 2024
It's value noting that this research was partly a possibility to market Microsoft Azure AI's recent safety features like Content Safety Prompt Shields.
These support developers in preventive testing and stopping jailbreaks.
Nevertheless, “Skeleton Key” shows once more how vulnerable even probably the most advanced AI models may be to the only manipulation.