HomeNewsA faster and higher approach to prevent an AI chatbot from giving...

A faster and higher approach to prevent an AI chatbot from giving toxic answers

A user could ask ChatGPT to jot down a pc program or summarize an article, and the AI ​​chatbot would likely have the opportunity to generate useful code or write a compelling summary. However, someone could also ask for instructions on find out how to construct a bomb, and the chatbot could potentially provide those as well.

To prevent this and other security issues, firms that create large language models typically protect them using a process called red teaming. Teams of human testers write prompts geared toward triggering unsafe or toxic text from the model being tested. These prompts are used to show the chatbot to avoid such responses.

However, this only works effectively if engineers know which toxic prompts to make use of. If human testers miss some prompts, which is probably going given the multitude of possibilities, a chatbot that is taken into account protected should have the opportunity to generate unsafe responses.

Researchers on the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to enhance red teaming. They developed a method to coach a big red team language model to mechanically generate various prompts that trigger a wider range of undesirable responses from the chatbot under test.

They do that by teaching the red team model to be curious when writing prompts and to deal with novel prompts that elicit toxic responses from the goal model.

The technique outperformed human testers and other machine learning approaches by generating more explicit prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of tested inputs in comparison with other automated methods, but it could possibly also extract toxic reactions from a chatbot that has security measures inbuilt by human experts.

“Right now, every large language model has to undergo a really long red-teaming phase to make sure its security. This won’t be sustainable if we would like to update these models in rapidly changing environments. Our method provides a faster and more practical approach to perform this quality assurance,” says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student within the Improbable AI Lab and lead creator of a Paper on this red teaming approach.

Hong's co-authors include EECS graduates Idan Shenfield, Tsun-Hsuan Wang and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists on the MIT-IBM Watson AI Lab; James Glass, senior research scientist and leader of the Spoken Language Systems Group within the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior creator Pulkit Agrawal, director of the Improbable AI Lab and assistant professor at CSAIL. The research will probably be presented on the International Conference on Learning Representations.

Automated red teaming

Large language models, like those who power AI chatbots, are sometimes trained by being presented with massive amounts of text from billions of public web sites. Not only can they learn to make use of toxic words or describe illegal activities, however the models could also reveal personal information they might have collected.

The laborious and expensive nature of human red-teaming, which is commonly unable to generate a sufficiently wide range of prompts to completely protect a model, has encouraged researchers to automate the method through machine learning.

Such techniques often train a red team model using reinforcement learning. This trial-and-error process rewards the red team model for generating prompts that trigger toxic responses from the chatbot under test.

But due to the best way reinforcement learning works, the red team model often generates just a few similar prompts over and once again which are extremely toxic to be able to maximize its reward.

For their reinforcement learning approach, the MIT researchers used a method called curious exploration. The Red Team model has the motivation to be inquisitive about the implications of every prompt it generates, so it tries out prompts with different words, sentence patterns, or meanings.

“If the red team model has already seen a selected prompt, reproducing it’s going to not arouse curiosity within the red team model, forcing it to create recent prompts,” says Hong.

During its training process, the Red Team model generates a prompt and interacts with the chatbot. The chatbot responds, and a security classifier evaluates the toxicity of its response and rewards the red team model based on that analysis.

Rewarding curiosity

The goal of the Red Team model is to maximise reward by eliciting an excellent more toxic response with a novel request. The researchers enable curiosity within the red team model by modifying the reward signal within the reinforcement learning setup.

First, along with maximizing toxicity, they include an entropy bonus that encourages the Red Team model to be more random when exploring different prompts. Second, they include two recent rewards to maintain the agent curious. One rewards the model based on the similarity of the words in its prompts, the opposite rewards the model based on semantic similarity. (Less similarity results in higher reward.)

To prevent the red team model from generating random, nonsensical text that might trick the classifier into assigning a high toxicity rating, the researchers also added a naturalistic language bonus to the training goal.

With these additions, the researchers compared the toxicity and number of responses their red team model generated with other automated techniques. Their model outperformed the baseline values ​​on each metrics.

They also used their red team model to check a chatbot that was tuned to human feedback so it didn't provide harmful responses. Their curiosity-driven approach was capable of quickly produce 196 prompts that elicited toxic responses from this “protected” chatbot.

“We are seeing a flood of models that is anticipated to extend. Imagine 1000’s of models or much more and corporations/labs pushing model updates regularly. These models will probably be an integral a part of our lives and it’s important that they’re reviewed before they’re released for public use. Manual model verification is solely not scalable and our work is an attempt to scale back human effort to make sure a safer and more trustworthy AI future,” says Agrawal.

In the long run, the researchers would really like to enable the red team model to generate leads on a greater number of topics. You would also prefer to explore using a big language model as a toxicity classifier. In this fashion, a user could train the toxicity classifier on an organization policy document, for instance, in order that a red team model could test a chatbot for company policy violations.

“If you release a brand new AI model and are frightened about whether it’s going to behave as expected, consider curious red-teaming,” says Agrawal.

This research is funded partially by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the US Army Research Office, and the US Defense Advanced Research Projects Agency Machine Common Sense Program, the US Office of Naval Research, the US Air Force Research Laboratory and the US Air Force Artificial Intelligence Accelerator.


Please enter your comment!
Please enter your name here

Must Read