How to evaluate the reliability of a general-purpose AI model before deploying it

July 16, 2024

153

Foundation models are massive deep learning models which were pre-trained on an enormous amount of general, unlabeled data. They will be used for quite a lot of tasks, similar to generating images or answering customer questions.

But these models, which form the backbone of powerful artificial intelligence tools similar to ChatGPT and DALL-E, can provide incorrect or misleading information. In a safety-critical situation, similar to a pedestrian approaching a self-driving automotive, such errors could have serious consequences.

To avoid such errors, researchers from MIT and the MIT-IBM Watson AI Lab developed a method to estimate the reliability of basic models before they’re used for a selected task.

To do that, they train a set of base models which can be barely different from one another. They then use their algorithm to evaluate the consistency of the representations each model learns over the identical test data point. If the representations are consistent, it means the model is reliable.

When they compared their technique with state-of-the-art baseline methods, they were capable of higher capture the reliability of basic models on quite a lot of classification tasks.

This technique could possibly be used to come to a decision whether a model ought to be applied in a selected setting without having to check it on an actual dataset. This could possibly be particularly useful when datasets are usually not accessible for privacy reasons, similar to in healthcare. In addition, the technique could possibly be used to judge models based on reliability scores, allowing a user to decide on the very best model for his or her task.

“All models will be unsuitable, but models that know after they are unsuitable are more useful. The problem of quantifying uncertainty or reliability becomes tougher with these basic models because their abstract representations are difficult to check. Our method permits you to quantify how reliable a representation model is for any input data,” says lead writer Navid Azizan, Esther and Harold E. Edgerton Assistant Professor in MIT's Department of Mechanical Engineering and the Institute for Data, Systems, and Society (IDSS) and a member of the Laboratory for Information and Decision Systems (LIDS).

It is connected to a Paper in regards to the work by lead writer Young-Jin Park, a LIDS graduate student; Hao Wang, a scientist at MIT-IBM Watson AI Lab; and Shervin Ardeshir, a senior scientist at Netflix. The paper shall be presented on the conference on Uncertainty in Artificial Intelligence.

Counting the consensus

Traditional machine learning models are trained to perform a selected task. These models typically make a concrete prediction based on an input. For example, the model might let you know whether a selected image accommodates a cat or a dog. In this case, assessing reliability might simply be taking a look at the ultimate prediction to see if the model is correct.

But baseline models are different. The model is pre-trained on general data, in an environment where developers have no idea all the following tasks it’s going to be applied to. Users then adapt it to their specific tasks after training.

Unlike traditional machine learning models, baseline models don’t produce concrete results similar to the labels “cat” or “dog.” Instead, they generate an abstract representation based on an input data point.

To assess the reliability of a baseline model, the researchers used an ensemble approach by training multiple models that share many properties but differ barely from one another.

“Our idea is like counting the consensus. If all of those base models provide consistent representations for all the info in our dataset, then we are able to say that this model is reliable,” says Park.

However, they encountered an issue: How could they compare abstract representations?

“These models just output a vector of some numbers, so we are able to’t easily compare them,” he adds.

They solved this problem with an idea called neighborhood consistency.

For their approach, the researchers prepare a set of reliable reference points that they test on your entire set of models. Then, for every model, they examine the reference points which can be near the representation of the test point of that model.

By taking a look at the consistency of neighboring points, they will assess the reliability of the models.

Aligning the representations

Foundation models map data points into what is named a representation space. You can imagine this space as a sphere. Each model maps similar data points onto the identical a part of its sphere. So images of cats find yourself in a single place and pictures of dogs in one other.

However, each model would depict the animals in its own sphere otherwise. For example, cats is perhaps grouped near the south pole of 1 sphere, while one other model might depict them somewhere within the northern hemisphere.

Researchers use the neighboring points like anchors to align these spheres and thus make the plots comparable. If the neighbors of an information point are consistent across multiple plots, one will be confident in regards to the reliability of the model output for that time.

When they tested this approach on quite a lot of classification tasks, they found that it was significantly more consistent than the baselines and was not disrupted by difficult test points that caused other methods to fail.

Furthermore, their approach will be used to evaluate the reliability of any input data. For example, one could assess how well a model works for a selected variety of person, similar to a patient with certain characteristics.

“Even if the general performance of all models is average, from a person perspective, you like the model that works best for you,” says Wang.

However, one limitation comes from the incontrovertible fact that they should train an ensemble of enormous base models, which is computationally intensive. In the longer term, they plan to search out more efficient ways to construct multiple models, perhaps by utilizing small perturbations of a single model.

This work is funded partly by MIT-IBM Watson AI Lab, MathWorks, and Amazon.

How to evaluate the reliability of a general-purpose AI model before deploying it

LEAVE A REPLY Cancel reply

Must Read

1X releases generative world models for training robots

OpenAI's hunger for data raises privacy concerns

OpenAI extends o1 AI models to enterprise and education, competing directly with Anthropic

AI within the doctor’s office: GPs turn to ChatGPT and other tools for diagnoses

This week in AI: Why OpenAI's o1 is changing AI regulation

Amazon releases video generator – but just for ads

New AI JetPack accelerates the startup process

Latest articles

1X releases generative world models for training robots

OpenAI's hunger for data raises privacy concerns

OpenAI extends o1 AI models to enterprise and education, competing directly with Anthropic

Our Newsletter

How to evaluate the reliability of a general-purpose AI model before deploying it

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter