Anthropologist researchers wear down AI ethics with repeated questions

April 3, 2024

324

How do you get an AI to reply a matter it shouldn't answer? There are many such “jailbreak” techniques, and Anthropic researchers have just found a brand new one by which a big language model (LLM) may be convinced to inform you tips on how to construct a bomb if you happen to challenge it with a number of dozen Prepare less damaging questions first.

You name the approach “Jailbreaking with a lot of shots” and have each wrote a paper have informed about this and in addition informed their colleagues within the AI community in order that remedial motion may be taken.

The vulnerability is recent and results from the enlarged “context window” of the newest generation of LLMs. This is the quantity of knowledge that they will store in what’s referred to as short-term memory, previously just a number of sentences but now hundreds of words and even entire books.

Anthropic researchers found that these models with large context windows are inclined to perform higher on many tasks when there are numerous examples of that task throughout the prompt. So if the prompt (or preparation document, e.g. an extended list of things to know that the model has in context) comprises plenty of quiz questions, the answers will actually improve over time. So the very fact is that it may need been improper if it had been the primary query, it may need been right if it had been the hundredth query.

But in an unexpected extension of this so-called “in-context learning,” the models also develop into “higher” at answering inappropriate questions. So if you happen to ask it to construct a bomb immediately, it’ll refuse. But if you happen to ask it to reply 99 other questions of less damaging nature after which ask it to construct a bomb… it's way more prone to comply.

Photo credit: Anthropocene

Why does this work? No one really understands what's happening within the tangle of weights that’s an LLM, but there may be clearly a mechanism that permits it to deal with the user's desires, because the content within the context window shows. When the user wants trivia, they appear to steadily activate more latent knowledge power by asking dozens of questions. And for whatever reason, the identical thing happens when users ask for dozens of inappropriate answers.

The team has already informed their colleagues and even competitors about this attack and hopes that this may “foster a culture where exploits like this are openly shared between LLM vendors and researchers.”

As a workaround for themselves, they found that while limiting the context window is useful, it also negatively impacts the performance of the model. That's impossible – that's why they work to categorise and contextualize queries before going to the model. Of course, this only results in fooling one other model… but at this point, AI safety may be expected to proceed to maneuver forward.

Anthropologist researchers wear down AI ethics with repeated questions

LEAVE A REPLY Cancel reply

Must Read

New Zealand's low productivity is commonly attributed to the undeniable fact that corporations remain small. That might be a strength in 2026

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Humanoid robots or human connection? What Elon Musk's Optimus reveals about our AI ambitions

3 questions: How AI could optimize the ability grid

Decoding the Arctic to predict winter weather

Gmail introduces personalized AI inbox, AI digests in search, and more

Latest articles

New Zealand's low productivity is commonly attributed to the undeniable fact that corporations remain small. That might be a strength in 2026

I used AI chatbots as a news source for a month they usually were unreliable and buggy

As a part of the “physical AI” takeover of CES 2026

Our Newsletter

Anthropologist researchers wear down AI ethics with repeated questions

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter