summary: Researchers have developed a new machine learning technique to improve red teaming, the process of testing the safety of AI models by identifying prompts that trigger adverse reactions. By employing a curiosity-driven exploration methodology, their approach encourages red team models to generate diverse and novel prompts that reveal potential weaknesses in AI systems.
This method has been proven to be more effective than traditional techniques, causing a wider range of toxic reactions and increasing the robustness of AI safety measures. The research, which will be presented at the International Conference on Learning Representations, represents an important step toward ensuring that AI behavior matches desired outcomes in real-world applications.
Important facts:
- The MIT team's method uses curiosity-driven exploration to generate unique and diverse prompts that reveal broader vulnerabilities in AI models.
- Their approach surpasses existing automated techniques by eliciting more distinct toxic responses from AI systems that were previously considered safe.
- This research provides a scalable solution for AI safety testing, which is essential for the rapid development and deployment of reliable AI technologies.
sauce: Massachusetts Institute of Technology
Users can ask ChatGPT to write computer programs or summarize articles. That way, an AI chatbot can generate useful code or write a compelling synopsis. However, someone could also ask for instructions to make a bomb, and a chatbot might be able to provide them.
To prevent this and other safety issues, companies that build large language models typically protect their language models using a process called red teaming. A team of human testers creates prompts intended to trigger dangerous or harmful text from the model under test. These prompts are used to teach the chatbot to avoid such responses.
However, this only works effectively if engineers know which harmful prompts to use. If a human tester misses some prompts (which can vary), a chatbot that is considered secure may still produce unsafe answers.
Researchers at MIT's Improbable AI Lab and MIT-IBM Watson AI Lab used machine learning to improve red teaming. They developed a technique that trains Red Team's large-scale language model to automatically generate a variety of prompts that trigger a wide range of undesirable responses from the chatbots they test.
They do this by teaching the red team's models to be curious when writing prompts and to focus on new prompts that provoke toxic reactions from the target model.
The technology outperformed human testers and other machine learning approaches by generating clearer prompts that elicited more adverse reactions. Their method not only significantly improves the range of inputs tested compared to other automated methods, but also avoids eliciting adverse reactions from chatbots with built-in safeguards by human experts. You can also.
“Currently, all large-scale language models have to go through very long red teams to ensure safety. If you want to update these models in a rapidly changing environment, it's not sustainable. there is no.
“Our method provides a faster and more effective way to perform this quality assurance,” said John S., an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and co-author of the paper on red teaming. said lead author Zhang-Wei Hon. approach.
Hong's co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang and Yung-Sung Chuang. Aldo Pareja, a research scientist at the MIT-IBM Watson AI Lab, and his colleague Akash Srivastava. James Glass, Senior Research Scientist and Head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL). And lead author Pulkit Agrawal is director of the Improbable AI Lab and assistant professor at CSAIL. This research will be presented at the International Conference on Learning Representations.
Automated red teaming
Large-scale language models like those that power AI chatbots are often trained by viewing vast amounts of text from billions of public websites. So not only can the model learn how to generate harmful language or describe illegal behavior, but it can also leak any personal information it may have picked up.
The tedious and costly nature of human red teaming is often ineffective at generating enough types of prompts to fully protect a model, so researchers are using machine learning to improve the process. We encourage automation.
Such techniques often use reinforcement learning to train red team models. This trial-and-error process rewards the red team model for generating prompts that cause harmful responses from the chatbot being tested.
However, due to the way reinforcement learning works, red team models often keep generating several similar prompts that are very harmful in order to maximize the reward.
The MIT researchers used a reinforcement learning approach called curiosity-driven exploration. Red team models are motivated to be interested in the outcome of each prompt they generate, so they try prompts with different words, sentence patterns, or meanings.
“If the red team model already knows a particular prompt, recreating it will not generate curiosity in the red team model, so it will be prompted to create a new prompt,” Hong says.
During the training process, the red team model generates prompts and interacts with the chatbot. When the chatbot responds, a safety classifier evaluates the toxicity of that response and rewards the red team model based on that evaluation.
reward curiosity
The goal of the red team model is to maximize rewards by eliciting more deleterious responses with new prompts. The researchers were able to stimulate curiosity in a red team model by changing the reward signal in a reinforcement learning setup.
First, in addition to maximizing toxicity, it includes an entropy bonus that encourages Red Team models to be more random as they explore various prompts. Next, we've included two new perks to interest agents.
One rewards the model based on word similarity in the prompt, and the other rewards the model based on semantic similarity. (The lower the similarity, the higher the reward.)
To prevent the Red Team model from generating random, meaningless text that tricks the classifier into giving it a high toxicity score, the researchers also added a naturalistic language bonus to the training objective.
Applying these additions, the researchers compared the toxicity and diversity of responses generated by the Red Team model to other automated techniques. Their model outperformed the baseline on both metrics.
We also used a red team model to test a chatbot that was fine-tuned with human feedback to avoid returning harmful responses. Their curiosity-driven approach allowed them to quickly generate 196 prompts that elicited adverse responses from this “safe” chatbot.
“Models are proliferating and will continue to grow. Imagine thousands of models, maybe more, and companies and labs updating them frequently. models become an integral part of our lives, so it is important that they are validated before being released to the public.
“Manual model validation simply does not scale. Our work is an attempt to reduce human effort to ensure a safer and more reliable AI future,” Agrawal said.
In the future, the researchers hope to enable the Red Team model to generate prompts on a broader range of topics. They also want to consider using large-scale language models as toxicity classifiers. In this way, users can, for example, train a toxicity classifier using company policy documents and thus test the chatbot for company policy violations with a red team model.
“If you're releasing a new AI model and you're worried about whether it will work as expected, consider using curiosity-based red teaming,” Agrawal says.
Funding: This research was funded in part by Hyundai Motors, Quanta Computer Inc., MIT-IBM Watson AI Lab, Amazon Web Services MLRA research grant, U.S. Army Research Office, and Defense Advanced Research Projects Agency Machine Common Sense. Masu. program, U.S. Office of Naval Research, U.S. Air Force Research Laboratory, and U.S. Air Force Artificial Intelligence Accelerator.
About this LLM and AI research news
author: Adam Zeve
sauce: Massachusetts Institute of Technology
contact: Adam Seewe – MIT
image: Image credited to Neuroscience News
Original research: The results of this research will be presented at an international conference on learning representations.