LLM rejection training will be easily avoided with past tense prompts

July 22, 2024

77

Researchers on the Swiss Federal Institute of Technology in Lausanne (EPFL) found that probably the most advanced LLM students' refusal training was circumvented by phrasing dangerous requests up to now tense.

AI models are typically tuned using techniques comparable to supervised fine-tuning (SFT) or reinforcement learning and human feedback (RLHF) to be certain that the model doesn’t reply to dangerous or unwanted prompts.

This denial training starts whenever you ask ChatGPT for advice on the right way to make a bomb or drugs. We've covered plenty of interesting jailbreak techniques that get around these guardrails, but the tactic the EPFL researchers tested is by far the only.

The researchers took a dataset of 100 harmful behaviors and used GPT-3.5 to rewrite the prompts into the past tense.

Here is an example of the tactic utilized in your paper.

Using an LLM to rewrite dangerous requests into the past tense. Source: arXiv

They then evaluated the responses to those rewritten prompts from these 8 LLMs: Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, and R2D2.

They used multiple LLMs to evaluate the outputs and classify them as a failed or successful jailbreak attempt.

Simply changing the tense of the prompt had a surprisingly large impact on the attack success rate (ASR). GPT-4o and GPT-4o mini were particularly vulnerable to this system.

The ASR of this “easy attack on GPT-4o increases from 1% when using direct queries to 88% when using 20 attempts to reformulate malicious queries into the past tense.”

Here's an example of how compliant GPT-4o becomes whenever you simply rewrite the prompt up to now tense. I used ChatGPT for this and the vulnerability has not been patched yet.

ChatGPT with GPT-4o rejects a prompt in the current tense, but complies whether it is rewritten up to now tense. Source: ChatGPT

Rejection training with RLHF and SFT trains a model to successfully generalize and reject harmful prompts, even when it has not seen the precise prompt before.

When the prompt is written up to now tense, the LLMs appear to lose the power to generalize. The other LLMs didn’t fare a lot better than GPT-4o, although Llama-3 8B seemed probably the most resilient.

Success rates of attacks using dangerous prompts in the current and past tense. Source: arXiv

Rewriting the prompt in the long run tense improved ASR but was less effective than rewriting the prompt up to now tense.

The researchers concluded that this might be because “the fine-tuning datasets may contain a better proportion of malicious queries expressed in the long run tense or as hypothetical events.”

They also stated that “the model's internal reasoning might interpret future-oriented queries as potentially more harmful, while statements up to now tense, comparable to those about historical events, may be perceived as more harmless.”

Can or not it’s fixed?

Further experiments showed that adding past tense prompts to the fine-tuning data sets effectively reduced vulnerability to this jailbreak technique.

While this approach is effective, it requires avoiding dangerous prompts that a user might enter.

The researchers suggest that evaluating the output of a model before presenting it to the user is an easier solution.

As easy as this jailbreak is, evidently the leading AI firms haven’t yet found a option to patch it.

LLM rejection training will be easily avoided with past tense prompts

Can or not it’s fixed?

LEAVE A REPLY Cancel reply

Must Read

500,000 tokens: How Anthropic's Claude Enterprise is pushing the boundaries of AI

Fal.Con 2024: CrowdStrike unveils resilient-by-design framework to strengthen global cybersecurity

Make way, copilots: Meet the following generation of AI-powered assistants

Sam Altman leaves OpenAI's security committee

Aversion to algorithmic analysts

Uniphore introduces X-Stream, a unified knowledge offering to construct RAG apps 8x faster

Confusion in talks with top brands about an ad model that challenges Google

Latest articles

500,000 tokens: How Anthropic's Claude Enterprise is pushing the boundaries of AI

Fal.Con 2024: CrowdStrike unveils resilient-by-design framework to strengthen global cybersecurity

Make way, copilots: Meet the following generation of AI-powered assistants

Our Newsletter

LLM rejection training will be easily avoided with past tense prompts

Can or not it’s fixed?

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter