Getty Images / Monty Rakusen

A Google team’s discovery—that underspecification compromises success in machine-learning models—underscores potential flaws in the way scientists train artificial intelligence systems. “The process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t,” wrote Will Douglas Heaven in a commentary about the Google findings.

Borrowed from statistics, underspecification means that a training process can produce a good model but still output a bad one in real-world applications because the training process won’t know the difference, said He Sarina Yang, PhD, an assistant professor at the Weill Cornell Medicine’s Department of Pathology and Laboratory Medicine. Yang was a co-developer of a machine-learning model that predicted SARS-CoV-2 infection status.

“It is not uncommon to see that a good model learned from training data doesn’t work well on real-world data,” Yang told CLN Stat.

Data drift is often cited as a reason why machine learning succeeds in the lab but falters in the real world. Drift refers to “a fundamental difference between the type of data used to develop a machine learning model and the data fed into the model during application,” wrote Daniel Nelson, a programmer who specializes in machine learning and deep learning, in a recent blog

Alexander D’Amour and colleagues at Google uncovered another reason why models fail. In their paper, they identified underspecification in machine-learning pipelines for medical imaging and medical genomics, clinical risk predictions, computer vision, and natural language processing. As an example, they found underspecification in two medical imaging models designed for use in the real world. One model classified images of patients’ retinas, whereas the second classified clinical images of patients’ skin. “We show that these models are underspecified along dimensions that are practically important for deployment. These results confirm the need for explicitly testing and monitoring [machine-learning] models in settings that accurately represent the deployment domain, as codified in recent best practices,” wrote D’Amour and colleagues.

Overall, the Google team looked at 50 models, all trained under the same process but exposed to stress tests that would reveal any performance variations. “Even though all 50 models had approximately the same performance on the training dataset, performance fluctuated widely when the models were run through the stress tests,” wrote Nelson. The team arrived at similar results in training and testing two different natural language processing systems. “In each case, the models diverged wildly from each other even though the training process for all of the models was the same,” added Nelson.

Two root causes could explain underspecification, said Yang. The model itself might be too complicated, making it sensitive to initial values or other variations in training, or the test data might be insufficient to represent all real-world scenarios. “The machine model Google researchers refer to is mostly deep neural networks (DNN), which are very complicated and powerful, and may involve millions or even billions of parameters. Therefore, the DNN models generally require a gigantic amount of training data, and tend to be sensitive to random initial values and the training hyper-parameters,” explained Yang.

Test data is seldom able to cover all possible scenarios in real-world applications. “Consequently, the performance on test data does not serve as a good measurement to select among many good DNN models, and the performance on real data is somewhat uncertain,” she added.

If undetected, underspecification could lead to a poor model that enters production and subsequently gets used in the real world, Nelson continued. “According to D’Amour, machine learning researchers and engineers need to be doing a lot more stress testing before releasing models into the wild. This can be hard to do, given that stress tests need to be tailored to specific tasks using data from the real world, data which can be hard to come by for certain tasks and contexts,” he added.

Several experts commenting on the Google findings offered potential solutions. “One option is to design an additional stage to the training and testing process, in which many models are produced at once instead of just one. These competing models can then be tested again on specific real-world tasks to select the best one for the job,” wrote Heaven.

Data scientist Matt Brems offered several strategies for avoiding underspecification, such as running stress tests and ensuring that a machine-learning model is reproducible. Drawing testing data from a source other than the training distribution is another solution. “If you can gather two separate sets of data (one for training/validation and a separate one for testing), you may better be able to mimic how your model will do in the real world,” he suggested.

Many researchers are actively exploring new applications of machine-learning models in clinical labs. “On one hand, test data in the clinical lab is regarded as ‘small data’ to machine learning, not millions or billions of samples. So underspecification could happen and needs to be addressed,” Yang noted. That said, labs typically use less complicated models such as random forest/boosting trees that aren’t as vulnerable to underspecification as neural networks.

If a machine learning model performs perfectly in a lab’s test data, “it may well be a perfect red alert that the test data is insufficient to have fair evaluation of the model performance,” said Yang. Labs should collect test data as much as possible to observe performance changes in training different machine-learning models, and find and analyze failure cases in test data. “Then, we are less likely to be surprised by underspecification.”