An AI-powered patient triage platform for future viral outbreaks using COVID-19 as a disease model

Over the last century, outbreaks and pandemics have occurred with disturbing regularity, necessitating advance preparation and large-scale, coordinated response. Here, we developed a machine learning predictive model of disease severity and length of hospitalization for COVID-19, which can be utilized as a platform for future unknown viral outbreaks. We combined untargeted metabolomics on plasma data obtained from COVID-19 patients (n = 111) during hospitalization and healthy controls (n = 342), clinical and comorbidity data (n = 508) to build this patient triage platform, which consists of three parts: (i) the clinical decision tree, which amongst other biomarkers showed that patients with increased eosinophils have worse disease prognosis and can serve as a new potential biomarker with high accuracy (AUC = 0.974), (ii) the estimation of patient hospitalization length with ± 5 days error (R2 = 0.9765) and (iii) the prediction of the disease severity and the need of patient transfer to the intensive care unit. We report a significant decrease in serotonin levels in patients who needed positive airway pressure oxygen and/or were intubated. Furthermore, 5-hydroxy tryptophan, allantoin, and glucuronic acid metabolites were increased in COVID-19 patients and collectively they can serve as biomarkers to predict disease progression. The ability to quickly identify which patients will develop life-threatening illness would allow the efficient allocation of medical resources and implementation of the most effective medical interventions. We would advocate that the same approach could be utilized in future viral outbreaks to help hospitals triage patients more effectively and improve patient outcomes while optimizing healthcare resources. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-023-00521-4.


Introduction
Most human coronavirus (CoV) infections result in mild patient symptoms.However, the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) distinguishes itself from other CoVs in having led to more than 6.7 million deaths according to WHO (up to 23, January 2023.The outbreak of SARS-CoV-2 remains an on-going global pandemic.Elderly patients with underlying chronic diseases are considered at high risk for death from COVID-19, the disease caused by SARS-CoV-2, and younger people without major underlying diseases may also present with lethal complications [1].
Although vaccines appear to be safe and effective in preventing severe COVID-19 symptoms and death, the clinical management of COVID-19 patients continues to pose an enormous economic burden to the health system [2].Patients with COVID-19 present a broad spectrum of symptoms ranging from asymptomatic to mild respiratory tract infections and influenza-like illness to severe disease with accompanying lung injury, multiorgan failure, and death [3].Hypoxemia is a main marker of severity [4].Although the lungs are believed to be the site at which SARS-CoV-2 replicates, infected patients often report other symptoms, suggesting the involvement of the gastrointestinal tract, heart, cardiovascular system, kidneys, and other organs [5].
Several studies have attempted to classify COVID-19 patients' symptoms based on their clinical phenotypes [4,[6][7][8].However, the heterogeneity in patient medical histories and COVID-19 symptoms have prevented the establishment of concrete classifications that can predict patient outcomes, i.e., who might need hospital admission or closer monitoring while in hospital.Being able to predict which patients can be sent home and those possibly needing intensive care unit (ICU) admission is critical to hospital administrators and health officials as they seek to implement interventions that optimize health outcomes for each COVID-19 patient and effectively utilize available hospital resources.
Using machine learning (ML), we built a model of COVID-19 disease severity and prediction of hospitalization duration based on clinical data and the metabolic profiles of plasma samples collected from patients during hospitalization.The model led us to identify a panel of unique clinical and metabolite biomarkers that were highly indicative of disease progression and allows the prediction of patient management needs very soon after hospital admission.

Demographic data
The clinical cohort used in this study consisted of 431 participants, of whom 111 were considered SARS-CoV-2-positive upon admission in the Yale New Haven Hospital (YNHH).The remaining 324 participants were health care workers (HCW) healthy controls who were SARS-CoV-2-negative (Table 1).In sample collection, 131 of the samples from SARS-CoV-2-positive patients were collected from the same patient during their stay in the hospital.The SARS-CoV2 infection status of each study participant was confirmed by nasopharyngeal swab sample polymerase chain reaction (PCR) test.SARS-CoV-2-infected patients were arbitrarily divided into different classes based on their treatment needs during hospitalization as follows: classes 1 (patients who did not require any external oxygen supply), 2 (patients who required low or high flow oxygen supply), and 3 (patients who required positive airway pressure (biphasic; BIPAP or continuous; CPAP) or were intubated).

Untargeted metabolomics analyses between all SARS-CoV-2-infected patients and healthy controls
Untargeted metabolomics was performed in plasma samples to elucidate whether SARS-CoV-2 infection caused changes in the plasma metabolic profiles of SARS-CoV-2 positive patients admitted to hospital.Comprehensive profiles were acquired and a total of 82 features were detected.Partial least square-discriminant analysis (PLS-DA) of the metabolomics data showed good separation of the metabolic profiles between the SARS-CoV-2-positive patients and the healthy controls (Additional file 1: Figure S1A).Multifactorial analyses revealed metabolite differences between all of the SARS-CoV-2-positive patients (class 1, 2, and 3) and the SARS-CoV-2-negative controls (HCW).Of these metabolites, picolinate was the metabolite that best predicted an individual infected by SARS-CoV-2, i.e., increased plasma levels occurred in infected patients (Fig. 1A, C, Additional file 1: Fig. S1B, D).By contrast, tryptophan was the plasma metabolite found to be associated with SARS-CoV-2-negative individuals (Fig. 1A, B, Additional file 1: Fig. S1B, E).Although many factors can cause variation in the plasma metabolome of the SARS-CoV-2-positive patients (e.g., underlying diseases, age, severity of COVID-19), the test samples showed similar parameters as the training set used in the PLS-DA model and were classified correctly as SARS-CoV-2-positive individuals (Additional file 1: Figure S1C).Multivariate and univariate analyses showed differences in the plasma metabolome between the two classes (uninfected vs. infected).The effect of the two most significant metabolites (picolinate and tryptophan) on the overall impact of SARS-CoV-2 infection and the estimation of probability values (which is discrimination criteria between SARS-CoV-2-positive and negative cases) is depicted in Additional file 1: Figures S1D, E. Receiver operating characteristic (ROC) curves were calculated for both the training and test sets and resulted in an area under curve value of 1.00.This demonstrates the robustness of the PLS-DA model and its applicability to the accurate discrimination between SARS-CoV-2-negative and SARS-CoV-2-infected individuals.
Pathway analysis revealed metabolites changed in SARS-CoV-2-infected patients being involved in the purine salvage pathway.Plasma levels of inosine monophosphate, inosine or uric acid were not different between the healthy control and SARS-CoV-2-infected groups.However, hypoxanthine and xanthine were increased in SARS-CoV-2-infected patients (Figs.1D, 2A) and univariate analysis also revealed significant changes for each class individually when compared to uninfected   controls (Fig. 2).Similarly, allantoin, a metabolite derived from uric acid and reactive oxygen species [9], was higher in the SARS-CoV-2-infected patient group (Fig. 1A-D); however univariate analysis for the individual classes, revealed allantoin to be increased in classes 1 and 2 but not in class 3 patients when compared to uninfected controls (Fig. 2A).
The tryptophan metabolism pathway was also comprehensively characterized in the present study.In the SARS-CoV-2-infected patient group, plasma levels of tryptophan were decreased, while levels of 5-hydroxytryptophan and kynurenine were increased (Fig. 1B).By contrast, serotonin levels were decreased (Fig. 1B) in the SARS-CoV-2-infected group.Given that 5-hydroxytryptophan availability is the rate-limiting step in serotonin synthesis [10], the decreased serotonin levels observed in SARS-CoV-2-infected patients are unexpected.It is possible that SARS-CoV-2 infection may decrease plasma serotonin levels by reducing its synthesis from 5-hydroxy-tryptophan (e.g., by decreased expression or activity of aromatic L-amino acid decarboxylase (AAAD)), increasing its cellular uptake (e.g., by upregulation of the serotonin reuptake transporter) and/ or increasing its metabolism (e.g., by increased activity or expression of metabolic enzymes, such as monoamine oxidase or carboxypeptidase A3).Which of these possibilities underlies the current observation remains to be established.Levels of kynurenine metabolites varied in SARS-CoV-2-infected patients in a manner that appeared to vary by pathway.Specifically, increased levels of kynurenic acid, picolinate and quinolonic acid occurred in the plasma of SARS-CoV-2-infected patients [11].Notably, reduced levels of anthranillic acid and xanthurenic acid occurred in these same patients (Fig. 1E).Glucuronate plasma levels were increased in the SARS-CoV-2-infected patient group and levels of paraxanthine, a caffeine metabolism-related metabolite, were decreased in the SARS-CoV-2-infected patient group (Figs.1D,  2).Univariate analysis of only the metabolome revealed thirty-four metabolites that were different in SARS-CoV-2 infected patients who survived compared to those who did not survive.Amongst them are metabolites that showed up in the pathway analysis (such as paraxanthine, tryptophan, and glucuronate) and new ones including norvaline, cyclic-AMP and theobromine (Fig. 2B).

Machine learning: combined clinical, comorbidity and metabolomics data improved the prediction model
Initially, the analysis was performed using only patient clinical data during hospitalization (Additional file 1: Table S1) and comorbidities upon admission (Additional file 1: Table S2), an approach that resulted in lower total accuracy for the external data set (test set) (Additional file 1: Figure S2A, C).The combined clinical, comorbidity and metabolomics data improved the prediction accuracy of the model (Additional file 1: Figure S2B, D) from 0.916 to 0.954 and 0.938 to 0.953 (in case of intubation risk of classes 1 vs 2 and 2 vs 3, respectively).When modelling factors influencing the length of hospitalization, removal of metabolomic data decreased the squared correlation coefficient from 0.976 to 0.956.This supports the value of using combinatorial data instead of individual datasets (metabolomics versus clinical and comorbidity data).In our analyses, we used training data (which was a subset of 80% of the total dataset) to train the machine learning model, and testing data set (a subset of 20% of the total dataset) to independently evaluate the accuracy of the model.In these analyses, a DUPLEX algorithm [12] was used to select the cases in the test set because the selected cases had to be representative of training data for unbiased accuracy estimation.A The four metabolites most significantly down-regulated in infected patients (relative to uninfected subjects).B The four most significant metabolites up-regulated in infected patients (red symbol) relative to uninfected subjects (black symbols).C Metabolic pathways identified by untargeted metabolomics in the plasma of SARS-CoV-2-uninfected subjects (black symbols) and SARS-CoV-2-infected patients (red symbols).D Purine metabolism: adenosine monophosphate can be converted to inosine either by (i) deamination to form inosine monophosphate followed by dephosphorylation or (ii) dephosphorylation to form adenosine followed by deamination.Hypoxanthine, formed from inosine, can undergo oxidative hydroxylation to xanthine which can then be converted by xanthine oxidase to uric acid.Allantoin is formed from the reaction between uric acid and reactive oxygen species (ROS).E Tryptophan metabolism: In the kynurenine pathway (which accounts for ~ 95% of tryptophan degradation), tryptophan forms kynurenine (by tryptophan-2,3-dioxygenase (TDO) or indoleamine 2,3-dioxygenase (IDO)).Kynurenine can then undergo hydroxylation to 3-hydroxy kynurenine (by kynurenine 3-monooxygenase (KMO)).A minor degradation pathway involves tryptophan hydroxylation to 5-hydroxy-tryptophan (by tryptophan hydroxylase isoforms 1 and 2 (TPH1/2)) and then to serotonin and melatonin (by aromatic-L-amino-acid decarboxylase (AAAD)).In figures B-E, data are presented as the mean ± SD and each dot represents individual sample results.Dots outside the box plot are in the upper quartile (75th percentile) of the distribution and the dots inside the box plot are in the interquartile range (IQR), where 50% of the data are located.Outside the box plot are the patients that are outside the IQR range.The box plot is divided at the median.Probability values reflect results in SARS-CoV-2-infected patients being compared with uninfected subjects using a Student's unpaired t-test (See figure on next page.)

Clinical decision tree analysis: estimation of survival or mortality during hospitalization
A clinical decision tree (DT) approach was developed from clinical and comorbidity data obtained during the SARS-CoV-2-infected patient's hospital admission for the purpose of predicting the hospital death or survival (discharge disposition) of SARS-CoV-2-infected patients (Fig. 3A).The ROC curve for the training and test sets had accuracies of 0.974 and 0.926, respectively (Fig. 3B,  C).The clinical DT has an internal feature selection method that includes factors that result in the lowest error towards estimation of discharge disposition.From both metabolomics and clinical data, clinical data were found to be sufficient to estimate the possibility of death or a survival event after SARS-CoV-2 infection at an early stage.Albumin was the initial determinant factor in the DT.If the albumin level is > 2.7 g/dL (normal range 3.5-5.5 g/dL) [13], then the blood urea nitrogen (BUN) should be considered.If BUN > 97 mg/dL (i.e., close to 5 times higher than the upper normal range of 20 mg/dL) [13], the clinical DT predicts that the patient will not survive (death).If the BUN < 97 mg/dL, the percentage of blood eosinophils (EOS) should be considered.If the EOS > 3.7, the clinical DT predicts that the patient would not survive; if the EOS < 3.7, the patient is predicted to survive.If the albumin level < 2.7 g/dL, then the globulin level should be evaluated.If globulin ≥ 3.7 g/dL, the patient is predicted not to survive; if globulin < 3.7 g/dL, the lymphocytes need to be determined to evaluate the discharge disposition of the patient.If lymphocytes ≥ 2.1%, the patient is predicted to survive; if lymphocytes < 2.1%, the patient is predicted not to survive (death).

Prediction of duration of hospitalization
Through the application of machine learning analysis and the Random Forest (RF) regression to clinical, plasma metabolomics and comorbidity data, which we obtained during the hospitalization, the length of hospitalization of each SARS-CoV-2-infected patient admitted to the hospital was estimated.These estimates were found to be very accurate (i.e., with only few days of actual hospitalization duration), as shown in Fig. 4A (R 2 = 0.9765).The error associated with more than 60% (≈ 1.5 sigma) of the data is within 3 to 5 days (Fig. 4B, C), indicating that the acceptable error window is ± 5 days.The forty most important factors in the structure of the RF model are depicted in Fig. 4D.Respiration (RR) was the most important factor contributing to the longer hospitalization of the patient.In the SARS-CoV-2-infected patient cohort, 73% had a RR above the normal of 18 breaths/ min.Minimum blood urea nitrogen (BUN), a serum byproduct of protein metabolism, was the second critical clinical factor contributing to a prolonged hospitalization.Normal BUN ranges from 5 to 20 mg/dL [14] and in our cohort, 37% patients had BUN > 20 mg/dL.Considering the Shapley Additive exPlanations (SHAP) values between factors for patients with least and maximum duration of hospitalization (SARS-CoV-2-infected patient, patients #123 and #213 with 90-and 2-days duration of hospitalization, respectively), it can be concluded that simultaneous increases in BUN and RR values causes the duration of hospitalization to be increased (Additional file 1: Figure S3).Nevertheless, the cumulative effects of other factors apart from BUN and RR are essential to RF model to accurately estimate the duration of hospitalization.For example, for patient cases #123 and #213, excluding the other 508 factors can cause up to 20 and 8 days shifts in the outcome (Additional file 1: Figure S3).As expected, temperature was a factor to consider.Upon hospital admission, 36% SARS-CoV-2-infected patients had RBC < 4.0 million cells/μL (males and females; lower end of normal range for females [15]), 46% patients had HGB < 12 g/dL (males and females; lower end of normal range for females [15]).In 18.5% patients, PLT was lower than 150 (normal range: 150-400 × 10 9 /L [15]) and 44% had monocytes (differential, %) higher than 7% (normal range: 3-7%, [15],).Peritonitis and intestinal abscess were an underlying condition that was associated with prolonged hospitalization.Upon admission, only one patient had peritonitis/intestinal abscess, was hospitalized for nine days, intubated, and did not survive.During the present study, three more patients developed peritonitis/intestinal abscess.Of them, one patient stayed in the hospital for forty days, was intubated, and did not survive; the two other patients had a prolonged hospital stay for 77 and 90 days, both were intubated and survived.Fifty-nine (SARS-CoV-2-infected patients were male, and BMI data were obtained from (See figure on next page.)Fig. 2 Untargeted plasma metabolomics analyses of SARS-CoV-2-infected and uninfected subjects.A Results of univariate analysis of the metabolites are shown for SARS-CoV-2-negative subjects (HCW black square), SARS-CoV-2-positive patients who during their hospitalization did not require any external oxygen supply or required only a low flow of oxygen (Class 1, red square), SARS-CoV-2-positive patients who required a high flow of oxygen (Class 2, green square), and SARS-CoV-2-positive patients who needed positive airway pressure (biphasic; BIPAP or continuous; CPAP) or were intubated (Class 3, blue square).B Univariate analysis of the identified metabolites in the plasma showing differences in the metabolome between SARS-Cov-2 infected patients who survived (black square) and those who did not survive (red square).Data are presented as the mean ± SD; dots outside the box plot are in the upper quartile (75th percentile) of the distribution and the dots inside the box plot are in the interquartile range (IQR), where 50% of the data are located.Outside the box plot are the patients that are outside the IQR range, and the box plot is divided at the median.Student's unpaired t-test, NS (non-significant) (color figure online) fifty of them.Their mean BMI was 31.73 (8.62 SD) kg/ m 2 with six being in the healthy range (18-24.9kg/m 2 ).Of the fifty-six (SARS-CoV-2-infected female patients, BMI data were obtained from forty-seven.Their BMI was 30.77(8.86 SD) kg/m 2 , with 25 being in the healthy range (25.0-29.9).Liver function tests (including aspartate aminotransferase (AST) and albumin) contributed to the estimation of the length of hospitalization.Sixty-three percent of the SARS-CoV-2-infected patients had AST values above the upper limit of normal (8-33U/L) [16], the albumin levels of 17% patients were lower than the lower limit of normal (3.5-5 g/dL) [17].Lactate levels also contributed to estimation of the length of hospitalization, even though only 9 of 75 patients had hyperlactemia (lactate ≥ 2 mmol/L) [18] and we did not have data for 40 of the 111 patients.Systolic blood pressure data was available for 111 patients and 68 were found to have pressures exceeding 120 mmHg.The metabolites predicted by the model to be contributing to duration of hospitalization were (i) theobromine (a xanthine alkaloid and a product of caffeine metabolism) [19], (ii) glucuronic acid (a key metabolite of glucose involved in the detoxification of xenobiotic compounds which is produced in the liver) [20], (iii) paraxanthine (a metabolite of caffeine), and (iv) allantoin (a biomarker of oxidative stress in humans as the main product of uric acid oxidation by reactive oxygen species) [21] (Fig. 4D).For instance, it applies the threshold of 2.7 g/dL over Albumin_24_hours_min (minimum value obtained from the clinical data), the node evaluates whether if patients show Albumin_24_hours_min above 2.7.If yes, then the next decision rule in DT is at down to the root's left child node (Yes; depth 2).Ninety-one percent of patients will survive with a survival probability of ninety-nine percent.This way, inspecting the whole DT, the impact of features on the likelihood of survival can be derived.The percentage of patients at each node is provided below the probability values of survival (denoted as 1) or death (denoted as 2) on the DT; the green (survived) /blue (died) shows the fitted/estimated values for the patients in each class at given node.ROC curves for B training set and C test set.AUC provides an aggregate measure of performance across all possible classification thresholds

Prediction of the risk of mechanical ventilation and/ or intubation
Machine learning analyses (similar to those described above) were performed to identify factors contributing to the risk of intubation of patients and to triage patients during hospitalization.Discrimination of risk of intubation was achieved for the three patient classes (with risk increasing from class 1 to class 3, with the AUC of the ROC curve being above 0.950 for pairwise comparison of the classes (Fig. 5A, B).This reflects the observation that fused data can be safely and reliably used to understand the risk of intubation in patients admitted to hospital with SARS-CoV-2 infection.Several factors found to be influential in the discrimination accuracy of the RF model included the following clinical data: albumin, calcium, temperature, respiration rate, ESI index (Emergency Severity Index: a triage tool for emergency departments; from clinical data), as well as plasma levels of 5-methylcytosine, galacturonate, glucuronate, theobromine and citramalate.(Fig. 5C).

A personalized COVID-19 patient triage software application
Using the results obtained from our patient cohorts, a COVID Severity by Metabolomic and Clinical Study (CSMC) software was developed to support the prehospital process and to classify patients' condition when they arrive to the Emergency Department (Additional file 1: Figure S4).The software takes advantage of the power of machine learning and utilizes the clinical data (that are routinely obtained during hospital admission) and the plasma metabolomics data of the patient to predict: (i) the discharge disposition of the admitted patient (survival prediction), (ii) the length of hospitalization, and (iii) the disease prognosis (i.e., the risk of the patient for need for mechanical ventilation or intubation).The software supports a population health program for COVID-19 management by predicting care transitions, and patient monitoring during viral outbreaks.It can be accessed online here: http:// trams.chem.uoa.gr/ csmc/.

Discussion
We performed pathway analysis and statistical analysis of the plasma metabolomics data and found them to be in agreement with previous publications [11,22]; tryptophan metabolism is a pathway affected by SARS-CoV-2 infection.The tryptophan metabolic pathway was carefully characterized in the present study, and we found differences between the SARS-CoV-2 infected and uninfected groups, suggesting a disease-associated hyperactivation of the indoleamine-pyrrole 2,3-dioxygenase enzyme [23].Activation of the kynurenine pathway was anticipated because it plays a major role in generation of cellular energy (in the form of nicotinamide adenine dinucleotide (NAD+)) which is increased substantially during an immune response.Higher levels of plasma kynurenine are associated with inflammation and psychiatric disorders [24].This might explain the observed neurological disorders associated with the long-term effects of COVID-19.We also identified a decrease in plasma serotonin levels in SARS-CoV-2-infected patients who needed positive airway pressure oxygen and/or were intubated.Serotonin is an important autacoid and neurotransmitter.Studies have shown that SARS-CoV-2-infected patients treated with fluvoxamine, a selective serotonin reuptake inhibitor, had a lower likelihood of clinical deterioration [25,26] and a very recent paper suggested that serotonin might be the missing link between COVID-19 course of severity in patients with diabetes and obesity [27].Increased plasma levels in kynurenine, kynurenic acid, picolinic acid, and quinolinic acid (but not anthranilic acid) occurred in SARS-CoV-2-infected patients, suggesting hyperactivation of the kynurenine pathway in these subjects, as shown previously [28].
By combining the metabolomics results with the patient clinical data obtained during hospitalization, we developed an improved machine learning model compared to a model that only used clinical data.Our intent was to provide a personalized approach for SARS-CoV-2-infected patient management.Although some published machine learning models are used as screening, diagnostic [29] or prediction models of COVID-19 [30,31], our model provides a real-time prediction of the clinical progression and duration of hospitalization based on both routinely-obtained clinical data and the plasma metabolomic profile.
Our clinical data findings are in agreement with those of Baker and coworkers [32].Unlike previous assumptions [33], there appears to be a vigorous and early immune response in the upper airway in patients who develop COVID-19, a process that leads to recruitment of eosinophils, natural killer cells, and macrophages.In Baker's study, SARS-CoV-2-infected patients not treated with budesonide exhibited persistently raised interferon and eosinophil chemokines; patients with a worse disease prognosis showed a muted early inflammatory response (except for raised eosinophil chemokines), followed by a severe second peak of inflammation.In our clinical decision tree model, SARS-CoV-2-infected patients with increased eosinophils experienced worse clinical deterioration than those with lower eosinophil counts (< 3.7 cells/μL), leading to the possibility that blood eosinophil levels could serve as a potential biomarker for predicting the worsening of COVID-19.
In agreement with other studies [34,35], we determined plasma albumin levels to be an important factor in predicting whether a SARS-CoV-2-infected patient would survive.In addition, it was the clinical parameter that exerted the most influence on the discrimination accuracy of the Random Forest model assessing the risk of intubation.It was also a factor predicting the duration of hospitalization.Albumin, the most abundant protein in plasma, performs important metabolic functions in the transport of free fatty acids, bilirubin, and many drugs [36].Approximately 15 g of albumin is synthesized daily by the liver to maintain the albumin plasma steady state concentration [37].Decreased albumin synthesis and increased catabolism after oxidation [38] is observed in liver disease and our dataset agrees with recent publications showing an association of hypoalbuminemia with poor outcome in critically ill population including COVID-19 [39].
The present study showed that inclusion of plasma metabolome data to the model improved the accuracy of the prediction of the duration of hospitalization of SARS-CoV-2-infected patients.Metabolites shown to be important in the model included allantoin, paraxanthine and theobromine, and glucuronic acid.Allantoin is an excellent biomarker of oxidative stress in humans [21,40], and can be non-enzymatically oxidized from uric acid by reactive oxygen species (ROS).However, allantoin measurements are not routinely performed in clinical laboratories [21].Many studies have demonstrated elevated levels of allantoin in a variety of diseases, including chronic heart failure [41], gout [42], and cystic fibrosis [43].Methylxanthines, specifically paraxanthine and theobromine, which are found in coffee, cacao and tea [44], were found to be important in prediction of both the hospitalization duration and the severity of the disease.These metabolites are associated with consumption of cacao or caffeine-containing products, and thus can only be obtained through diet.As such, it is likely that the decreases in the levels of these metabolites in SARS-CoV-2-infected patients relates to decreased consumption of caffeine-containing or cacao-derived products.Interestingly, a recent study has suggested using methylxanthines to inhibit SARS-CoV-2 infection [45].
In the present study, glucuronic acid, a key metabolite of glucose involved in the detoxification of xenobiotic compounds [46], increased with the severity of SARS-CoV-2 infection symptoms and it contributed to both prediction of the hospitalization and the prognosis of the disease.A recent metabolomic study of patients with cirrhosis identified glucuronic acid as a biomarker of disease severity and future mortality [47], while another study demonstrated that glucuronic acid levels were robust predictors of all-cause mortality and correlate with future health span-related outcomes [20].
In the present study, large-scale plasma metabolomics analyses allowed us to identify metabolites, such as hydroxytryptophan, kynurenine, picolinic acid, allantoin and glucuronic acid, that were increased in SARS-CoV-2-infected patients.We also identified blood eosinophil count as a novel biomarker for COVID-19 disease severity.Combining all of the available clinical and comorbidity data with the metabolomics data, we demonstrated the value of metabolomics data to enhance model prediction.Further, the use of an advanced machine learning approach allowed the development a precision medicine model that is capable of being used for predicting outcomes in SARS-CoV-2-infected patients.This approach can be utilized for future viral outbreaks to help hospitals triage patients according to their need for emergency medical attention.

Limitations
Our study has some limitations.First, it was conducted before vaccines were available and before many of the treatments that are available now, such as remdesivir, anti-SARS-CoV-2 monoclonal antibodies, and nirmatrelvir/ritonavir.One might expect such treatments would reduce the changes observed in our metabolite biomarkers.Second, the IMPACT study was conducted in a community setting in New Haven, CT (USA).This resulted in our population of healthy controls (i.e., health care workers) being mainly White/Caucasian, and despite our attempts to recruit ethnic minorities, the SARS-CoV-2-infected subjects comprised a higher proportion of Black/African Americans.As such, the possibility of race/ ethnicity contributing to differences between SARS-CoV-2-infected and uninfected subjects cannot be excluded.These considerations notwithstanding, to our knowledge, this is the first study to combine both clinical and metabolomics data to build a model to predict hospitalization duration and disease severity.

Study design and participants
The cohort under study included plasma samples collected from SARS-CoV-2-infected patients during their hospitalization at the Yale New Haven Hospital (YNHH, n = 111) and from healthy (SARS-CoV-2-uninfected) Yale New Haven hospital healthcare workers (HCW, control) (n = 324) as part of the Yale IMPACT Biorepository.All samples were collected between March and May 2020 and all subjects in the study had not received any COVID-19 vaccinations.The healthcare workers were medical staff working in the YNHH.This time period preceded the availability of immunizations and medications for COVID-19.We included all patients with laboratory-confirmed SARS-CoV-2 infection who were hospitalized at the YNHH between March and May 2020.The protocol of this study was approved by the Institutional Review Board of Yale University (HIC number 2000027690).Written informed consent was obtained from all study participants.Subject demographic data are provided in Table 1.For data analysis, the subjects were divided into four groups as follows:

Clinical and comorbidity data
The clinical data (Additional file 1: Table S1) were obtained during the patient hospitalization.They were retrieved from electronic medical records, including clinical characteristics (comorbidities, Additional file 1: Table S2), and laboratory test results assessing lung, kidney, liver, heart function, blood clotting/inflammation biomarkers, immune system, respiratory function, and metabolic panel.These were measured multiple times within the first 24 h of patient admission and we obtained a 24 h minimum value (24_hours_min), a 24 h maximum value (24_hours_max) and an average of all the measurements withing the 24 h window (24_hours_mean).In total, we obtained 65 different clinical parameters and had information for 281 different comorbidities according to the Emergency Severity Index.In addition, we had access to data regarding oxygen supply.

Sample preparation
Untargeted metabolomic profiling was performed on plasma samples of SARS-CoV-2-infected and uninfected (healthy control) patients.Quality Control (QC) samples: Twenty μL of the sample supernatant was removed before the evaporation step.The aliquots were all pooled, and 400 μL aliquots were evaporated to dryness, stored and reconstituted as described above.
The mobile phase for HILIC-MS analysis consisted of A (25 mM ammonium hydroxide and 25 mM ammonium acetate in water) and B (acetonitrile) delivered at a flow rate of 0.5 mL/min.The linear gradient elution started at 95% B (0-0.5 min), 95%-65% B (0.5-7 min), 65-40% B (7-8 min), 40% B (8-9 min), 40-95% B (9-9.1 min) and continuing at 95% B (9.1-12.0min).The injection volume for all samples and standard solutions was 3 μL.QC samples were analyzed every ten to fifteen injections.The column temperature was set at 30 °C for HILIC, and the sample tray temperature was maintained at 8 °C.For MS analyses, the electrospray ionization source (ESI) was operated in negative mode.Q-ToF-MS scan data (300 ms/scan; mass scan range 50-1200 Da) were first acquired for each sample.Thereafter, MS e fragmentation data were acquired for metabolite identification (low energy scan: 200 ms/scan, collision energy 6 eV; high energy scan: 100 ms/scan, collision energy 10, 20, 30 and 40 eV, mass scan range 25-1000 Da).ESI source parameters were as follows: 1.8 kV capillary voltage, 40 V sampling cone, 50 °C source temperature, 420 °C desolvation temperature, 80 L/hr cone gas flow, 850 L/hr desolvation gas flow.Seven batches of randomized samples were run, and for each batch, a QC was injected 10 times before the individual samples for that sample type were injected to ensure matrix stabilization.

Data analysis and structural annotation
ProteoWizard (version 3.06150) was used to convert raw MS data files to the .mzMLformat [48] and then imported in R environment for further analysis.A general target screening strategy outlined in our previous work [49] and the metabolite identification was performed using our in-house library.The target list was used in screening of MS e data to confirm the metabolites.The screening strategy is depicted in Additional file 1: Figure S5.The screening was performed using in-house R algorithms based on several steps including:  [49].Peaks showing a probability > 0.5 were considered to be a false positive and may exist in the sample due to analytical method or contamination (e.g., carry-over/presence in guard column, contamination in ion-source or during sample preparation); 5. Peaks with probability < 0.5 were checked for the existence of experimental isotopes (theoretically calculated from the molecular formula) and various adducts forms meaning that characteristic peaks for the given m/z at given mass accuracy (2 mDa) are available.This step uses the same algorithm used in step 3) to determine if characteristic peaks were present and provide a score of match; 6. Peaks showing enough evidence regarding the isotopes and adducts (i.e., score above 0.5) were evaluated by available retention time (t R ) data.This step compared the t R information between the t R value in the database and the observed t R for the sample.The threshold used on t R value to accept the candidate was set to 20 s.If the absolute difference in the t R value was above 30 s, the t R data from the pooled sample (labeled TrueQuant metabolites spiked at a known concentration in the pooled samples) were used instead.Therefore, variation in t R data due to matrix could be minimized, and if the absolute t R difference became less than 30 s, the feature was kept; otherwise, it was moved to the list of unknowns (to be identified by non-target screening); 7. A candidate with acceptable absolute t R difference was evaluated using available MS/MS fragments between initial database and samples.At this stage, the minimum number of three MS/MS fragments were required to be matched from the target database to the MS/MS data derived from sample using a doc product algorithm (threshold > 0.5) [50].It is worth noting that structural isomers would not be distinguished using the current workflow unless their t R data exceeded 30 s shifts.
Any detected compound (i.e., whose identification was confirmed through the above seven steps) was screened in all samples and the peak area values were extracted.The t R range was obtained after using step 2) and the peak segmentation algorithm.Eventually, the metabolomics dataset included 559 samples (plus an additional 117 samples as QC), in which 82 metabolites were identified.The generated data used for machine learning consisted of the m/z value, t R , metabolite name and peak area.

Machine learning analysis
To discriminate between plasma samples of SARS-CoV-2-uninfected and infected subjects, the dataset (consisting of metabolite name, m/z, t R , peak area and metabolite name for all samples (N = 559) was imported into the R environment and then transformed based on logarithmic (base 10) scale.In the metabolomics dataset, missing values were treated by predictive mean matching (PMM) algorithm using "MICE" R package [51].The imputation of each dataset was performed groupwise (within class), i.e., the imputation was performed for metabolites by isolating them based on their group/class.Sixteen samples were excluded from the total list of 575 samples due to missing values.The complete data was then normalized by median value of each metabolite from the pooled samples (i.e., the QC set) as the estimation of the most probable quotient.The normalized dataset was then autoscaled prior to multivariate analysis.Subsequently, the dataset (a total of 559 samples without QC samples) was split into training (N = 449) and test set (N = 110) using the "DUPLEX" algorithm and "Mahalanobis" as the distance function [52,53].At first, Principal Component Analysis (PCA) was used to perform data exploratory analysis (using the "factoextra" and "mixOmics" R packages [54]).Partial least square discriminant analysis (PLS-DA) [54] was used as a supervised method to find markers that contribute to differentiation between samples from SARS-CoV-2-infected and uninfected patients.The optimal value for latent variables (LVs) used in PLS-DA was set by evaluating the misclassification error in 5-leave out cross-validation [55].The statistical significance of the PLS-DA model was evaluated using R2X (which measures the accumulative variance), R2Y (which measures the goodness of fit), Q2 (fivefold-cross-validated; predictive ability of the model) and a Receiver Operating Characteristics (ROC) curve.Bayesian theorem was used to find threshold at which two classes were discriminated by PLS-DA [55].This was needed to assign classes on the samples based on the calculated probability values.The threshold of absolute intensity was 0.0456 and was derived after evaluating the point at which specificity and sensitivity were equal to 1.00.Any probability value calculated using the PLS-DA method that lay above this 0.0456 threshold was class A (i.e., control; SARS-CoV-2-negative). Class B (i.e., SARS-CoV-2-positive) was assigned to a sample if the probability value was below 0.0456.The weight applied to each metabolite in each PLS-DA function was used as a basis to find potential markers that would discriminate between the samples.Outlier analysis and applicability domain were assessed using Hotelling's T 2 and Q residuals, respectively [55].Any samples showing high Hotelling's T 2 and Q residual values were considered to be outliers.PLS-DA and related quality assessment factors were implemented in R and are available in "http:// trams.chem.uoa.gr/ csmc/".
The risk of intubation of patients, the length of possible hospitalization, and patient discharge status from hospital were modelled using Random Forest (RF) [56].In these analyses, the clinical, comorbidity data and scaled data of the metabolites were fused and used collectively.Overall, 510 factors (combined Additional file 1: Tables S1 and S2) were used in RF structure to model the disease severity.To enable rapid determination of COVID-19 severity, only clinical measurements (min, mean and max values of each test) measured for 24 h during their hospital admission were considered.This clinical Decision Tree (DT) represents a collection of individual DTs and it was trained on the basis of bootstrapped resampling (for regression case, such as length of hospitalization) and out-of-bag (OOB) misclassification error [56].The risk of intubation was categorized into four groups on the basis of mechanical ventilation and/or intubation as follows: no risk (SARS-CoV-2-uninfected) (HCW), room air (class 1), low and high oxygen flow supplementation (class 2), and non-invasive (positive airway pressure, e.g., biphasic, BIPAP or continuous, CPAP) or invasive (intubation) oxygen supplementation (class 3).Discharge from hospital was a binary classification case: SARS-CoV-2-infected patients who survived and patients who died.The length of hospitalization was subjected to regression analysis and the uncertainty was calculated in terms of distribution of residuals.In addition, the quality of the models based on RF was evaluated internally and externally using training and independent test sets, and ROC curves.RF was implemented on the data using "rpart" and "randomForest" R packages.The interpretation of RF model was performed using SHAP and is based on game theory to explain the output of machine learning model.More details about SHAP can be found here [57].SHAP was implemented in python using SHAP python package (https:// pypi.org/ proje ct/ shap/).
b SARS-CoV-2-infected patients who during their hospitalization did not require any external oxygen supply.c SARS-CoV-2-infected patients who required low or high flow oxygen supply.dSARS-CoV-2-infected patients who required positive airway pressure (biphasic; BIPAP or continuous; CPAP) or were intubated.

Fig. 1
Fig.1Plasma metabolome differences in SARS-CoV-2-infected and uninfected subjects.Order of importance of individual metabolites (in PLS-DA model) in SARS-CoV-2-uninfected (healthy control, orange bar) and SARS-CoV-2-infected (blue bar) subjects.A The four metabolites most significantly down-regulated in infected patients (relative to uninfected subjects).B The four most significant metabolites up-regulated in infected patients (red symbol) relative to uninfected subjects (black symbols).C Metabolic pathways identified by untargeted metabolomics in the plasma of SARS-CoV-2-uninfected subjects (black symbols) and SARS-CoV-2-infected patients (red symbols).D Purine metabolism: adenosine monophosphate can be converted to inosine either by (i) deamination to form inosine monophosphate followed by dephosphorylation or (ii) dephosphorylation to form adenosine followed by deamination.Hypoxanthine, formed from inosine, can undergo oxidative hydroxylation to xanthine which can then be converted by xanthine oxidase to uric acid.Allantoin is formed from the reaction between uric acid and reactive oxygen species (ROS).E Tryptophan metabolism: In the kynurenine pathway (which accounts for ~ 95% of tryptophan degradation), tryptophan forms kynurenine (by tryptophan-2,3-dioxygenase (TDO) or indoleamine 2,3-dioxygenase (IDO)).Kynurenine can then undergo hydroxylation to 3-hydroxy kynurenine (by kynurenine 3-monooxygenase (KMO)).A minor degradation pathway involves tryptophan hydroxylation to 5-hydroxy-tryptophan (by tryptophan hydroxylase isoforms 1 and 2 (TPH1/2)) and then to serotonin and melatonin (by aromatic-L-amino-acid decarboxylase (AAAD)).In figures B-E, data are presented as the mean ± SD and each dot represents individual sample results.Dots outside the box plot are in the upper quartile (75th percentile) of the distribution and the dots inside the box plot are in the interquartile range (IQR), where 50% of the data are located.Outside the box plot are the patients that are outside the IQR range.The box plot is divided at the median.Probability values reflect results in SARS-CoV-2-infected patients being compared with uninfected subjects using a Student's unpaired t-test

Fig. 3
Fig. 3 Clinical decision tree (DT).A clinical DT model predicting the discharge disposition of a patient (survival or death) was developed.A The tree shows the rules applied to classify each patient into the related classes (survival or death).At the top of the DT, the overall proportion of the patients survived (95%) or died (5%) is shown.Next, the node applies the threshold over clinical data to achieve classification of patients into the two classes.For instance, it applies the threshold of 2.7 g/dL over Albumin_24_hours_min (minimum value obtained from the clinical data), the node evaluates whether if patients show Albumin_24_hours_min above 2.7.If yes, then the next decision rule in DT is at down to the root's left child node (Yes; depth 2).Ninety-one percent of patients will survive with a survival probability of ninety-nine percent.This way, inspecting the whole DT, the impact of features on the likelihood of survival can be derived.The percentage of patients at each node is provided below the probability values of survival (denoted as 1) or death (denoted as 2) on the DT; the green (survived) /blue (died) shows the fitted/estimated values for the patients in each class at given node.ROC curves for B training set and C test set.AUC provides an aggregate measure of performance across all possible classification thresholds

Fig. 4
Fig. 4 Random Forest (RF) machine learning algorithm to estimate the length of hospitalization of each SARS-CoV-2-infected patient admitted to the hospital.A Correlation plot between the RF estimated and the actual duration of hospitalization of SARS-CoV-2-infected patients; distribution of error (residuals) for training (B, red square) and test (C, blue square) sets.D The forty most significant factors in the structure of the RF model developed to predict the duration of hospitalization of SARS-CoV-2-infected patients (color figure online)

Fig. 5 A
Fig. 5 A pairwise comparison of classes by ROC curves in training (A) and test (B) sets; The ROC curves were derived pair-wise for the four risk classes, i.e., SARS-CoV-2-negative subjects (Control) SARS-CoV-2-positive patients who during their hospitalization did not require any external oxygen supply or required only a low flow of oxygen (Class 1), SARS-CoV-2-positive patients who required a high flow of oxygen (Class 2), and SARS-CoV-2-positive patients who needed positive airway pressure (biphasic; BIPAP or continuous; CPAP) or were intubated (Class 3).C The most significant factors in the structure of the Random Forest model developed to predict the risk of intubation due to SARS-CoV-2 infection.Note: Probability 1 (which is included in the most significant factors in the RF) is the output of the metabolomics results of the PLS-DA model showing the probability of healthy (Control) individuals CoV-2-infected patients who during their hospitalization did not require any external oxygen supply 2 62 SARS-CoV-2-infected patients who required low or high flow oxygen supply 3 20 SARS-CoV-2-infected patients who required positive airway pressure (biphasic; BIPAP or continuous; CPAP) or were intubated