Predicting SARS-CoV-2 Infection Using Machine Learning

Scientist looking at various images using modern medical technology.

Getty Images/metamorworks

A machine learning model that leverages 27 laboratory tests and patient demographic features was able to rapidly and objectively predict SARS-CoV-2 infection status. Investigators who reported on their results in Clinical Chemistry said the approach shows promise in identifying and isolating high-risk cases before conventional test results are available.

Being able to rapidly identify individuals infected with SARS-CoV-2 is paramount, as the virus continues to spread in the United States and around the world. Clinicians generally rely on reverse transcription polymerase chain reaction (RT-PCR) via nasopharyngeal swabs as the gold standard diagnostic. However, access and supply problems have hindered efforts to deliver test results within 48 hours. Delayed results in turn can spawn delayed or mismanaged care, while also increasing risk of exposure to healthcare workers and other patients. “Rapid diagnosis and identification of high-risk patients for early intervention is vital for individual patient care, and, from a public health perspective, for controlling disease transmission and maintaining the healthcare workforce,” wrote He Sarina Yang, PhD, and colleagues.

Algorithms based on artificial intelligence are maturing as prognostic or diagnostic tools, particularly for complex diseases such as SARS-CoV-2. “In this study, we hypothesized that the results of routine laboratory tests performed within a short time frame as the RT-PCR testing, in conjunction with a limited number of previously identified predictive demographic factors (age, gender, race), can predict SARS-CoV-2 infection status,” wrote the investigators.

One of the researchers, Fei Wang, PhD, had been testing different models of machine learning. The gradient boosted decision tree model (GBDT) yielded the best performance.

Wang and colleagues used routine lab results from patients to train a GBDT model. “Basically, 27 laboratory tests were selected to construct the input feature vector of each SARS-CoV-2 RT-PCR result. The computer was trained to learn what the vector (the profile of 27 laboratory test results) should look like when a RT-PCR result is positive or negative in the training dataset,” Yang, assistant professor at Weill Cornell Medicine’s Department of Pathology and Laboratory Medicine in New York City, and the study’s corresponding author, told CLN Stat.

The model has root to leaf nodes that represent the different lab tests. “The tree nodes and their weights in the model reflect their impact to the prediction. Once trained, the algorithm can make predictions in the testing dataset,” said Yang.

Investigators tested it on a retrospective dataset of 3,356 SARS-CoV-2 adult patients (1,402 positive and 1,954 negative) who had had routine lab tests such as procalcitonin, coagulation, and hematological tests within 48 hours prior to the release of an RT-PCR test result. The study took place this spring at New York Presbyterian Hospital/Weill Cornell Medicine.

Compared with three other algorithms—random forest, logistic regression, and decision tree—the GBDT model performed the best, achieving an area under the receiver operating characteristic curve (AUC) of 0.854, sensitivity of 0.761, specificity of 0.808, and agreement with RT-PCR of 0.791. Investigators validated the GBDT model by comparing it to an independent patient dataset from another hospital, which yielded a comparable AUC of 0.838.

The novel algorithm also predicted SARS-CoV-2 RT-PCR positivity in a majority (66%) of 32 patients whose RT-PCR result switched from negative to positive in 48 hours.

The hope is this model could serve as an application for an electronic medical record system, said Yang. “Clinicians could be alerted promptly of the infection risk level, allowing for rapid triaging and quarantining of high-risk patients as well as prompting rapid retesting in those with positive model findings and negative SARS-CoV-2 RT-PCR results. In addition, this analysis may play an important role in assisting the identification of SARS-CoV-2 infected patients in areas where RT-PCR testing is not accessible due to financial or supply constraints,” she said.

Investigators cited several potential limitations to the model, including that it was only tested on hospitalized patients with moderate to severe symptoms. “Thus, this model may not be applicable to mild COVID-19 cases,” they acknowledged.

The method Yang and colleagues present has “the potential to augment more conventional methods for rapid assessment of patients with COVID-19,” noted Christopher McCudden, PhD, DABCC, NRCC, FACB, in a related editorial. Applying this type of predictive analytics in the real world, however, presents some challenges. There’s the matter of integrating such an algorithm into electronic medical records and “the inherent opacity of machine learning algorithms. Underpinning these challenges are the key questions of what exactly a given prediction indicates and what action a physician can take with an individual patient,” offered McCudden.

Then there’s the task of reporting on the probabilities generated by such algorithms. Would the prediction manifest as a probability score, a binary measure, a risk-related keyword, or a textual report? “Overall, adoption is likely to depend on how easy it is to convey what the prediction can and cannot provide,” he wrote.

The machine learning model in next steps will be used to prioritize high-risk patients at emergency departments and improve the triaging process, said Yang. “We are also hoping to test the model in different patient populations and geographic areas.”