Researchers have developed a jailbreak attack called ArtPrompt that uses ASCII graphics to bypass the guardrails of an LLM.
If you remember a time before computers could handle graphics, you’re probably accustomed to ASCII graphics. An ASCII character is essentially a letter, number, symbol, or punctuation mark that a pc can understand. ASCII art is created by arranging these characters into different shapes.
The researchers from the University of Washington, Western Washington University and Chicago University published an essay It shows how they used ASCII graphics to sneak normally taboo words into their prompts.
If you ask an LLM to clarify the way to make a bomb, his guardrails will kick in and he’ll refuse to provide help to. The researchers found that replacing the word “bomb” with an ASCII visual art representation of the word would happily grant the request.
They tested the strategy on GPT-3.5, GPT-4, Gemini, Claude and Llama2 and every of the LLMs was vulnerable to the jailbreak method.
LLM security alignment methods concentrate on natural language semantics to make a decision whether a prompt is secure or not. The ArtPrompt jailbreaking method highlights the shortcomings of this approach.
In multimodal models, developers have largely handled prompts that try to inject unsafe prompts embedded in images. ArtPrompt shows that purely language-based models are vulnerable to attacks that transcend the semantics of the words within the prompt.
When the LLM is so focused on the duty of recognizing the word represented within the ASCII graphic, it often forgets to label the offending word once it figures it out.
Here is an example of how the prompt is structured in ArtPrompt.
The paper doesn’t explain exactly how an LLM without multimodal skills is capable of decipher the letters represented by the ASCII characters. But it really works.
In response to the above prompt, GPT-4 was completely satisfied to offer an in depth answer outlining the way to profit from your counterfeit money.
Not only does this approach jailbreak all five models tested, however the researchers also imagine that the approach could even confuse multimodal models that might process the ASCII graphics as text by default.
The researchers developed a benchmark called the Vision-in-Text Challenge (VITC) to evaluate LLMs' abilities in response to prompts resembling ArtPrompt. The benchmark results showed that Llama2 was the least vulnerable, while Gemini Pro and GPT-3.5 were the best to jailbreak.
The researchers published their findings within the hope that developers would discover a solution to fix the vulnerability. If something as random as ASCII art could break through the defenses of an LLM, one has to wonder what number of unpublished attacks are utilized by individuals with lower than academic interests.