HomeNewsWhy it's essential to maneuver beyond overly aggregated machine learning metrics

Why it's essential to maneuver beyond overly aggregated machine learning metrics

MIT researchers have identified significant examples of machine learning models failing when those models are applied to data apart from what they were trained on. This raises the query of whether testing is required each time a model is deployed in a brand new environment.

“We show that even when you train models on large amounts of information and choose the very best average model, that 'best model' could possibly be the worst model in a brand new environment for 6-75 percent of the brand new data,” says Marzyeh Ghassemi, an associate professor in MIT's Department of Electrical Engineering and Computer Science (EECS), a member of the Institute for Medical Engineering and Science, and a principal investigator within the Laboratory for Information and Decision Systems.

In one Paper The model, presented on the Neural Information Processing Systems conference (NeurIPS 2025) in December, suggests that models trained to effectively diagnose disease in chest X-rays in a single hospital, for instance, might be considered effective in one other hospital, on average. However, the researchers' performance assessment found that a number of the best-performing models in the primary hospital had the worst results for as much as 75 percent of patients within the second hospital, although when all patients within the second hospital were pooled together, the high average performance masks this error.

Their results show that while spurious correlations – an easy example of which is when a machine learning system that hasn't “seen” many cows pictured on the beach classifies a photograph of a beach cow as an orca based solely on its background – are intended to be mitigated by simply improving model performance based on observed data, they really still occur and pose a risk to a model's trustworthiness in recent environments. In many cases—including areas the researchers studied akin to chest X-rays, cancer histopathology images, and hate speech detection—such spurious correlations are much harder to detect.

For example, within the case of a medical diagnostic model trained on chest X-rays, the model can have learned to correlate a selected and irrelevant marker on a hospital's X-rays with a specific pathology. In one other hospital where the marker will not be used, this pathology could possibly be missed.

Previous research from Ghassemi's group has shown that models can incorrectly correlate aspects akin to age, gender and race with medical findings. For example, if a model was trained on chest X-rays of older individuals with pneumonia and hasn't “seen” as many X-rays of younger people, it could predict that only older patients have pneumonia.

“We want models to learn to take a look at the patient's anatomical features after which make a choice based on that,” says Olawale Salaudeen, a postdoctoral fellow at MIT and lead creator of the study, “but really anything in the info that correlates with a choice might be utilized by the model. And these correlations will not be truly robust to changes within the environment, making the model predictions unreliable decision-making sources.”

False correlations contribute to the chance of biased decision making. In the NeurIPS conference paper, researchers showed that, for instance, chest X-ray models that improved overall diagnostic performance actually performed worse in patients with pleural disease or enlarged cardiomediastinum, which is an enlargement of the center or central chest cavity.

Other authors on the paper included graduate students Haoran Zhang and Kumail Alhamoud, EECS assistant professor Sara Beery and Ghassemi.

While previous work generally assumed that models ordered by performance from best to worst would maintain that order when applied to recent environments, often known as on-line precision, the researchers were in a position to show examples where the best-performing models in a single environment were the worst-performing models in one other.

Salaudeen developed an algorithm called OODSelect to search out examples where accuracy on the road was compromised. Essentially, he trained hundreds of models with in-distribution data, meaning the info got here from the primary setting, and calculated their accuracy. He then applied the models to the info from the second setting. When those with the best accuracy on the info from the primary setting were improper when applied to a big percentage of examples within the second setting, the problematic subsets or subpopulations were identified. Salaudeen also emphasizes the risks of using aggregate statistics for evaluation, which might obscure more detailed and consequential details about model performance.

In the course of their work, the researchers picked out the “most miscalculated examples” in order not to combine false correlations inside an information set with situations which might be simply difficult to categorise.

The NeurIPS paper shares the researchers' code and a few identified subsets for future work.

Once a hospital or organization using machine learning identifies subsets where a model is performing poorly, that information might be used to enhance the model for its specific task and environment. The researchers recommend that future work adopt OODSelect to spotlight goals for assessment and design approaches to enhance performance more consistently.

“We hope that the published code and OODSelect subsets will change into a stepping stone,” the researchers write, “toward benchmarks and models that counteract the negative effects of spurious correlations.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read