Despite growing calls for greater safety and accountability in AI, today's tests and benchmarks could also be inadequate, in accordance with a brand new report.
Generative AI models—models that may analyze and output text, images, music, videos, etc.—are increasingly coming under scrutiny for his or her error-proneness and unpredictable behavior. Now, organizations starting from government agencies to major technology corporations are proposing recent benchmarks to check the protection of those models.
Towards the tip of last 12 months, the startup Scale AI founded a laboratory is devoted to evaluating how well models comply with safety guidelines. This month, NIST and the UK's AI Safety Institute released tools to evaluate model risk.
But these model-based tests and methods could also be insufficient.
The Ada Lovelace Institute (ALI), a UK-based non-profit AI research organization, conducted a study The study surveyed experts from academic labs, civil society, and vendor model makers, and reviewed recent research on AI safety assessments. The co-authors found that while current assessments might be useful, they should not exhaustive, might be easily manipulated, and don’t necessarily provide a sign of how models will perform in real-world scenarios.
“Whether it's a smartphone, a prescription drug, or a automobile, we expect the products we use to be protected and reliable. In these areas, products undergo rigorous testing to make sure their safety before use,” Elliot Jones, principal investigator at ALI and co-author of the report, told TechCrunch. “Our research aimed to look at the restrictions of current approaches to AI safety assessment, assess how assessments are currently getting used, and explore their use as a tool for policymakers and regulators.”
Benchmarks and Red Teaming
The study's co-authors first examined the literature to realize an summary of the risks and risks posed by today's models, in addition to the status of evaluation of existing AI models. They then interviewed 16 experts, including 4 employees of unnamed technology corporations that develop generative AI systems.
The study found that there is critical disagreement throughout the AI ​​industry regarding the most effective methods and taxonomy for evaluating models.
Some evaluations simply checked out how the models matched benchmarks within the lab, not how the models might impact real users. Others relied on tests developed for research purposes and didn’t evaluate production models, yet vendors insisted on using them in production.
We've written before in regards to the problems with AI benchmarks, and the study highlights all of those issues and more.
The experts cited within the study identified that it’s difficult to extrapolate a model's performance from benchmark results, and that it’s unclear whether benchmarks may even show that a model has a specific ability. A model may perform well on a state bar exam, for instance, but that doesn’t mean it’s able to solving more open-ended legal challenges.
The experts also identified the issue of information contamination, where benchmark results can overestimate a model's performance if the model was trained on the identical data it’s being tested on. Benchmarks are chosen by organizations in lots of cases not because they’re the most effective evaluation tools, but because they’re convenient and simple to make use of, the experts said.
“Benchmarks are susceptible to being manipulated by developers training models on the identical dataset used to guage the model (which is akin to reading exam papers before the exam) or by strategically selecting which scores to make use of,” Mahi Hardalupas, a researcher at ALI and co-author of the study, told TechCrunch. “It also matters which version of a model is evaluated. Small changes can result in unpredictable behavioral changes and override built-in safety features.”
The ALI study also identified problems with “red teaming,” the practice of tasking individuals or groups with “attacking” a model to discover vulnerabilities and flaws. Numerous corporations, including AI startups OpenAI and Anthropic, use red teaming to guage models. However, there are few agreed standards for red teaming, making it difficult to evaluate the effectiveness of any given effort.
Experts told the study's co-authors that it might be difficult to seek out individuals with the obligatory skills and expertise to work on the red team, and that the manual work of the red team is expensive and laborious, which presents a hurdle for smaller organizations without the obligatory resources.
Possible solutions
The pressure to release models faster and the reluctance to conduct pre-release tests that might raise problems are the most important the explanation why AI rankings haven’t improved.
“One person we spoke with who works for a corporation that develops baseline models felt that there may be more pressure inside corporations to release models quickly, making it harder to withstand and take conducting evaluations seriously,” Jones said. “Large AI labs are releasing models at a rate that exceeds their or society's ability to make sure they’re protected and reliable.”
One interviewee within the ALI study described evaluating models for safety as an “intractable” problem. So what hope does the industry – and people who regulate it – have for solutions?
Mahi Hardalupas, a researcher at ALI, is convinced that there’s a way forward, but it surely requires greater commitment from public authorities.
“Regulators and policymakers have to be clear about what they expect from evaluations,” he said. “At the identical time, the evaluation community must be transparent in regards to the current limitations and potential of evaluations.”
Hardalupas suggests that governments mandate greater public participation in the event of evaluations and take measures to support an “ecosystem” of third-party testing, including programs that ensure regular access to all obligatory models and datasets.
Jones says it could be obligatory to develop “context-specific” evaluations that transcend simply testing a model's response to a prompt and as an alternative have a look at the possible sorts of users a model might affect (e.g., people of a specific background, gender, or ethnicity) and the ways by which attacks on models might bypass security safeguards.
“This requires investment within the science behind assessments to develop more robust and repeatable assessments based on an understanding of how an AI model works,” she added.
However, there isn’t any guarantee that a model is protected.
“As others have noted, 'safety' isn’t a property of models,” Hardalupas said. “To determine whether a model is 'protected' one must understand the contexts by which it’s used, who it’s sold or made available to, and whether the safeguards in place are adequate and robust to mitigate those risks. Evaluations of a baseline model can serve exploratory purposes to discover potential risks, but they can’t guarantee that a model is protected, let alone 'perfectly protected.' Many of our interviewees agreed that evaluations cannot prove that a model is protected, but can only indicate that a model is unsafe.”