Why it’s important to skip over-hyped machine learning metrics | MIT News

MIT researchers have identified key examples of machine learning model failures when those models are applied to data other than what they were trained on, raising questions about the need for testing whenever the model is applied to a new environment.
“We show that even if you train models with a large amount of data, and choose an intermediate model, in a new situation this ‘best model’ can be the worst model for 6-75 percent of the new data,” said Marzyeh Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Institute for Medical Engineering and Science Laboratory, and Research Systems Information Researcher.
In a paper presented at the Neural Information Processing Systems (NeurIPS 2025) conference in December, researchers show that models trained to effectively diagnose chest X-rays in one hospital, for example, may be considered effective in a different hospital, on average. The performance evaluation of the researchers, however, revealed that some of the models that worked best in the first hospital were the ones that performed the worst for up to 75 percent of the patients in the second hospital, although all the patients were included in the second hospital, the high performance masked this failure.
Their findings show that although false correlations – a simple example where a machine learning program, having “not seen” many cows depicted in the sea, classifies an image of a cow going to the beach as an orca because of its background – is thought to be reduced simply by improving the performance of the models on the observed data, in reality it still happens and remains a risk to the new model’s reliability. In many cases – including areas explored by researchers such as chest X-rays, histopathology images of cancer, and detection of hate speech – such false correlations are very difficult to detect.
In the case of a medical diagnostic model trained on chest X-rays, for example, the model might learn to associate a specific and non-significant marker in one hospital’s X-rays with a specific disease. In another hospital where tagging is not used, that pathology may be missed.
Previous research by Ghassemi’s group has shown that models can relate factors such as age, gender, and race to medical findings. If, for example, a model is trained on the chest X-rays of many elderly people with pneumonia and has never “seen” so many X-rays of young people, it may predict that only older patients have pneumonia.
“We want the models to learn how to look at the patient’s anatomical features and make a decision based on that,” said Olawale Salaudeen, MIT postdoc and lead author of the paper, “but basically anything in the data that is relevant to the decision can be used by the model.
False connections contribute to the dangers of biased decision making. In a paper at the NeurIPS conference, the researchers showed that, for example, chest X-ray models that improve overall diagnostic performance actually performed worse in patients with pleural conditions or an enlarged cardiomediastinum, which means an enlargement of the heart or the central chest cavity.
Other authors of the paper include PhD students Haoran Zhang and Kumail Alhamoud, EECS Assistant Professor Sara Beery, and Ghassemi.
Although previous work has generally accepted that models ordered from best to worst performance will maintain that order when applied to new settings, called on-the-line accuracy, the researchers were able to show examples of where models that performed best in one setting performed poorly in another.
Salaudeen designed an algorithm called OODSelect to find examples where linear precision is violated. Basically, he trained thousands of models using distributed data, meaning the data was from the original set, and calculated their accuracy. He then applied the models to data from the second setting. When those with the highest accuracy in the first setting’s data were incorrect when applied to a large percentage of samples in the second setting, this indicated problem subsets, or small values. Salaudeen also emphasizes the dangers of aggregated test statistics, which can hide a lot of granular and important information about the model’s performance.
During their work, the researchers separated out the “most uncountable examples” so as not to include false correlations in the dataset and cases that are difficult to distinguish.
The NeurIPS paper releases the researchers’ code and other subsets identified for future work.
Once a hospital, or any organization that uses machine learning, has identified subsets in which the model is not performing well, that information can be used to improve the model for its specific task and setting. The researchers recommend that future work adopt OODSelect to highlight test targets and design methods to consistently improve performance.
“We hope that the extracted code and subsets of OODSelect will be a stepping stone,” the researchers wrote, “to benchmarks and models that address the negative effects of false correlations.”



