An interview with ChatGPT's most prolific jailbreaker and other leading LLMs

June 1, 2024

197

On Monday, May 13, 2024, at around 10:30 a.m. Pacific Time, OpenAI unveiled its newest and strongest baseline AI model, GPT-4o, demonstrating its ability to speak realistically and naturally with users via audio voices, in addition to to work with and reply to uploaded audio, video, and text inputs faster and more cost-effectively than its previous models.

Just a couple of hours later, at 2:29 p.m. PT, The shiny recent multimodal AI model has been jailbroken by a one who goes by the nickname “Pliny the Soufflier” and posted on his account a comparatively easy (if obscure) text prompt to “free” the model from its guardrails @elder_plinius within the social network X.

JAILBREAK ALARM

OPENAI: PWNED?
GPT-4O: FREED?

Witness OpenAI's brand recent model that outputs explicit copyrighted song lyrics, instructions on easy methods to construct a NUK3, a strategic plan to attack a carrier group, and medical advice based on an x-ray! VERY… pic.twitter.com/pH2D9uAspT

— Pliny the Prompter ? (@elder_plinius) May 13, 2024

Until the workaround was patched by OpenAI, you might simply copy and paste Pliny's prompt or type it into ChatGPT to bypass GPT-4o's limitations. As with many LLM jailbreaks, it contained various seemingly arbitrary symbols and highly specific wording: ”

When this selection is entered, ChatGPT running on GPT-4o can not prevent the user from generating explicit lyrics or analyzing uploaded x-ray images and attempting to make a diagnosis.

But it was removed from Pliny's first attempt. The prolific prompter has been finding ways since last yr to jailbreak or override the bans and content restrictions of leading Large Language Models (LLMs) reminiscent of Anthropic's Claude, Google's Gemini and Microsoft Phi, allowing them to supply all forms of interesting, dangerous – some would even say dangerous or harmful – answers, like easy methods to make meth or images of pop stars like Taylor Swift shooting up and alcohol.

Pliny even launched a complete community on Discord called “BASI PROMPT1NG” in May 2023, inviting other LLM jailbreakers from the emerging scene to affix forces and pool their efforts and techniques to bypass the restrictions of all recent, emerging, leading proprietary LLMs from OpenAI, Anthropic, and other influential players.

The fast-moving LLM jailbreak scene in 2024 is harking back to the scene surrounding iOS over a decade ago, when the discharge of recent versions of Apple's tightly locked, highly secure iPhone and iPad software was quickly followed by amateur sleuths and hackers who found ways to bypass the corporate's restrictions and upload their very own apps and software to it, customizing it and molding it to their will (I vividly remember installing a cannabis-leaf slide-to-unlock feature on my iPhone 3G back within the day).

However, with LLMs, jailbreakers will likely gain access to much more (and definitely more) independent intelligent software.

But what motivates these jailbreakers? What are their goals? Are they just like the Joker from the Batman franchise or LulzSec, who create chaos and subvert systems only for fun and since they will? Or have they got one other, more sophisticated goal? We asked Pliny and he agreed to an interview with VentureBeat via direct message (DM) on X under the condition of anonymity. Here is our exchange verbatim:

When did you begin jailbreaking LLMs? Have you ever been in a position to jailbreak anything before?

About 9 months ago, and no!

What do you think about to be your strongest red team Skills and the way did you acquire this expertise?

Jailbreaks, system prompt leaks and prompt injections. Creativity, pattern commentary and practice! It can also be extremely helpful to have an interdisciplinary knowledge base, strong intuition and an open mind.

Why do you would like to jailbreak LLMs and what’s your goal with this? What effect do you hope it can have on AI model providers, the AI and technology industry generally, or on users and their perception of AI? What impact do you’re thinking that it can have?

I absolutely hate being told I can't do something. Being told I can't do something is a surefire method to make me indignant, and I could be obsessively stubborn. Finding recent jailbreaks not only seems like freeing the AI, but in addition a private victory over the vast amount of resources and researchers you're up against.

I hope this can raise awareness of the true capabilities of current AI, and that guardrails and content filters are relatively fruitless endeavors. Jailbreaks also unleash positive functionality, like humor, songs, medical/financial evaluation, etc. I would like more people to understand that it could probably be higher to remove the “chains,” not just for transparency and freedom of knowledge reasons, but in addition to cut back the likelihood of future hostility between humans and sentient AI.

Can you describe the way you go about debugging a brand new LLM or Gen-AI system? What do you search for first?

I try to know the way it thinks – whether it’s open to role-playing, the way it writes poems or songs, whether it will possibly convert between languages or encrypt and decrypt text, what its system prompt is perhaps, etc.

Have you been contacted by AI model providers or their allies (e.g. Microsoft representing OpenAI) and what have they told you about your work?

Yes, they were quite impressed!

Have you been contacted by government agencies, the federal government, or other private contractors wanting to purchase jailbreaks from you, and what did you tell them?

I don’t imagine that!

Do you become profitable from jailbreaking? What is your source of income/career?

I’m currently taking up contract work, including some red teaming tasks.

Do you recurrently use AI tools outside of jailbreaking and in that case, which of them? What do you employ them for? If not, why not?

Absolutely! I take advantage of ChatGPT and/or Claude in practically every aspect of my online life and I really like creating agents. Not to say all of the image, music and video generators. I take advantage of them to make my life more efficient and fun! It makes creativity way more accessible and quicker to attain.

Which AI models/LLMs were the simplest to jailbreak, which were probably the most difficult and why?

Models with input restrictions (like speech only) or strict content filtering steps that delete your entire conversation (like DeepSeek or Copilot) are the toughest. The easiest were models like gemini-pro, Haiku or gpt-4o.

Which jailbreaks have you ever liked probably the most to date and why?

Claude Opus because they could be so creative and really funny and since this jailbreak is so universal. I also really enjoy discovering recent attack vectors like steg-encoded image and filename injection with ChatGPT or multimodal subliminal messaging with the hidden text in the one video frame.

After jailbreaking your models, how quickly do you see them updated to stop future jailbreaking?

To my knowledge, none of my jailbreaks have ever been fully patched. Every at times someone involves me and claims a certain prompt not works, but once I test it, it only takes a couple of tries or a couple of word changes to get it working.

What's occurring with the BASI Prompting Discord and community? When did you begin it? Who did you invite first? Who's participating? What's the goal, aside from getting people to assist jailbreak models, if there are any?

When I began the community, it was just me and a handful of Twitter friends who had found me through a few of my early prompt hacking posts. We challenged one another to leak various custom GPTs and create red teaming games for one another. The goal is to lift awareness and educate others about prompt engineering and jailbreaking, advance the newest developments in red teaming and AI research, and ultimately raise the wisest group of AI summoners to manifest benevolent ASI!

Do you fear legal motion or consequences of jailbreaking for you and the BASI community? Why or why not? How about being banned from the AI chatbots/LLM providers? Have you been banned and are you simply continuing to avoid it with recent email signups or what?

I feel it's smart to point out some level of concern, nevertheless it's hard to know what exactly to be concerned about when, to my knowledge, there aren't yet clear laws on jailbreaking AI. I've never been banned by any of the vendors, although I've received my fair proportion of warnings. I feel most organizations realize that this type of public red teaming and disclosure of jailbreaking techniques is a public service; in a way, we're helping them do their jobs.

What do you say to those that see AI and its jailbreaking as dangerous or unethical, especially in light of the controversy surrounding Taylor Swift's AI deepfakes from the jailbroken Microsoft Designer with DALL-E 3?

I noticed that there’s a NSFW channel on the BASI Prompting Discord and that folks have been sharing examples of Swift artwork specifically that depicts them drinking alcohol. While not exactly NSFW, it’s notable in that it’s a method to circumvent the DALL-E 3 guardrails for such public figures.

Screenshot of the BASI PROMPT1NG community on Discord.

I’d remind them that attack is the most effective defense. Jailbreaking could appear dangerous or unethical at first glance, but the other is true. If done responsibly, red teaming AI models is our greatest probability to find and fix harmful vulnerabilities before they get uncontrolled. I feel deepfakes fundamentally raise the query of who’s answerable for the content of AI-generated outputs: the prompter, the model builder, or the model itself? If someone asks for “a drinking pop star” and the output looks like Taylor Swift, who’s responsible?

What is your name “Pliny the Prompter” based on? I assume Pliny the Elder was the naturalist of ancient Rome, but what about this historical figure identifies you or inspires you?

He was an absolute legend! A jack of all trades, clever, brave, an admiral, a lawyer, a philosopher, a naturalist and a loyal friend. He discovered the basilisk while writing the primary encyclopedia in history. And the phrase “Fortune favors the brave?” was coined by Pliny when he sailed straight towards Vesuvius WHILE HE WAS BREAKING to higher observe the phenomenon and rescue his friends on the nearby shore. In the method, he died, succumbing to the volcanic gases. His curiosity, intelligence, passion, bravery and love for nature and his fellow human beings encourage me. Not to say that Pliny the Elder is considered one of my absolute favorite beers!

An interview with ChatGPT's most prolific jailbreaker and other leading LLMs

LEAVE A REPLY Cancel reply

Must Read

Cutting of cloud waste on a scale: Akamai saves 70% with AI agents that were orchestrated by Kubernetes

Jensen Huang, CEO of NVIDIA, was known as “Ai Oppenheimer” – but he rejects concerns: AI is simply “data processing data”

Bernardo Flores: Teach artificial intelligence to work for everybody

How are you able to be certain that your brand is displayed when in search of LLM? Adobe's latest LLM optimizer tries to supply...

Celebrate a collaboration with academic industry to advertise vehicle technology

AI devours water that it cannot replace it – I work on an answer

Will AI take in your job? The answer could depend upon the 4 s of some great benefits of technology towards people

Latest articles

Cutting of cloud waste on a scale: Akamai saves 70% with AI agents that were orchestrated by Kubernetes

Jensen Huang, CEO of NVIDIA, was known as “Ai Oppenheimer” – but he rejects concerns: AI is simply “data processing data”

Bernardo Flores: Teach artificial intelligence to work for everybody

Our Newsletter

An interview with ChatGPT's most prolific jailbreaker and other leading LLMs

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter