Hackers are jailbreaking powerful AI models in a world effort to uncover vulnerabilities

June 21, 2024

380

Pliny the Prompter says it normally takes him about half-hour to crack the world's strongest artificial intelligence models.

The hacker, who is working under a pseudonym, manipulated Meta's Llama 3 to pass on instructions on methods to make napalm. He tricked Elon Musk's Grok into raving about Adolf Hitler. His own hacked version of OpenAI's latest GPT-4o model, called “Godmode GPT,” was banned by the startup after it began advising on illegal activities.

Pliny told the Financial Times that his “jailbreak” was not for nefarious reasons, but was a part of a world effort to show the issues in large language models which can be quickly being made public by technology corporations searching for huge profits.

“I'm on the warpath to lift awareness of the true capabilities of those models,” said Pliny, a crypto and stock trader who shares his jailbreaks on X. “Many of those are novel attacks that will be worthy of research in their very own right… Ultimately, I'm working without spending a dime for (the model owners).”

Pliny is just one among dozens of hackers, academic researchers and cybersecurity experts scrambling to seek out vulnerabilities within the fledgling LLMs, equivalent to tricking chatbots with prompts to bypass the “guardrails” AI corporations have put in place to make sure the security of their products.

These ethical “white hat” hackers have often found ways to trick AI models into creating dangerous content, spreading disinformation, sharing private data, or generating malicious code.

Companies like OpenAI, Meta and Google already use “red teams” of hackers to check their models before they’re widely released. But the technology's vulnerabilities have created a growing market of LLM security startups which can be developing tools to guard corporations planning to make use of AI models. Machine learning security startups raised $213 million in 23 deals in 2023, up from $70 million the previous yr, in accordance with data provider CB Insights.

“Jailbreaking began a few yr ago, and the attacks have continued to evolve to this present day,” said Eran Shimony, senior vulnerability researcher at CyberArk, a cybersecurity group that now provides LLM security. “It's a continuing game of cat and mouse. Vendors are improving the safety of our LLMs, but attackers are also becoming more sophisticated of their prompts.”

These efforts come as global regulators seek to step in to curb potential dangers related to AI models. The EU has passed the AI Act, which creates recent responsibilities for LLM providers, while the UK and Singapore are among the many countries considering recent laws to control the sector.

The California state legislature will vote in August on a bill that will require the state's AI groups – including Meta, Google and OpenAI – to make sure they don’t develop models with “dangerous capabilities.”

“All (AI models) would meet this criterion,” Pliny said.

Meanwhile, malicious hackers have created compromised LLMs with names like WormGPT and FraudGPT and sold them on the dark web for as little as $90. They support cyberattacks by writing malware or helping fraudsters create automated but highly personalized phishing campaigns. According to AI security group SlashNext, other variants have emerged, equivalent to EscapeGPT, BadGPT, DarkGPT and Black Hat GPT.

Some hackers use “uncensored” open-source models. For others, jailbreak attacks – or bypassing the safety measures built into existing LLMs – are a brand new method, with perpetrators often sharing suggestions in communities on social media platforms equivalent to Reddit or Discord.

Approaches range from individual hackers bypassing filters through the use of synonyms for words blocked by model creators to more sophisticated attacks that use AI for automated hacking.

Last yr, researchers at Carnegie Mellon University and the US Center for AI Safety said they’d found a method to systematically jailbreak LLMs equivalent to OpenAI's ChatGPT, Google's Gemini and an older version of Anthropics' Claude – “closed” proprietary models that were supposedly less vulnerable to attacks. The researchers added that it was “unclear whether such behavior can ever be fully patched by LLM vendors.”

Anthropic published a study in April on a way called “many-shot jailbreaking,” by which hackers can prepare an LLM by showing it a protracted list of questions and answers after which getting it to reply a malicious query in the identical style. The attack was made possible because models like those developed by Anthropic now have a bigger context window, or space so as to add text.

“While current state-of-the-art LLMs are powerful, we don’t consider they pose truly catastrophic risks. Future models may,” Anthropic wrote. “This signifies that now’s the time to mitigate potential LLM jailbreaks before they might be used on models that might cause serious harm.”

Some AI developers said many attacks remained relatively benign for now. But others warned of certain sorts of attacks that may lead to data leaks, where malicious actors could find ways to acquire sensitive information, equivalent to data used to coach a model.

DeepKeep, an Israeli LLM security group, has found ways to force Llama 2, an older meta-AI model that’s open source, to disclose users' personally identifiable information. Rony Ohayon, CEO of DeepKeep, said his company is developing special LLM security tools equivalent to firewalls to guard users.

“Openly releasing models will spread the word in regards to the advantages of AI and enable more researchers to discover and help fix vulnerabilities, allowing corporations to make their models safer,” Meta said in an announcement.

It added that it had conducted security stress testing with internal and external experts on its latest Llama 3 model and its chatbot Meta AI.

OpenAI and Google said they’d constantly train their models to higher protect them against exploits and hostile behavior. Anthropic, which experts say has made probably the most advanced efforts in AI security, called for greater information sharing and more research into all these attacks.

Despite these assurances, experts say the risks are only getting greater because the models develop into more interconnected with existing technologies and devices. This month, Apple announced it had partnered with OpenAI to integrate ChatGPT into its devices as a part of a brand new “Apple Intelligence” system.

Ohayon said: “In general, corporations should not prepared.”

Video: AI: blessing or curse for humanity? | FT Tech

Hackers are jailbreaking powerful AI models in a world effort to uncover vulnerabilities

LEAVE A REPLY Cancel reply

Must Read

Google Photos' newest feature permits you to make a meme of your individual

Feeling unprepared for the AI boom? You're not alone

The big higher education query in 2026 ought to be: What are we preparing young people for?

OpenAI will place ads in ChatGPT. This opens a brand new door for dangerous influence

Google DeepMind CEO is “surprised” that OpenAI is moving forward with ads in ChatGPT

Why failure is a needed ingredient for fulfillment – especially within the age of AI

Companies are already using agent AI to make decisions, but governance is lagging behind

Latest articles

Google Photos' newest feature permits you to make a meme of your individual

Feeling unprepared for the AI boom? You're not alone

The big higher education query in 2026 ought to be: What are we preparing young people for?

Our Newsletter

Hackers are jailbreaking powerful AI models in a world effort to uncover vulnerabilities

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter