Network machine learning maps phytochemically rich “Hyperfoods” to fight COVID-19

In this paper, we introduce a network machine learning method to identify potential bioactive anti-COVID-19 molecules in foods based on their capacity to target the SARS-CoV-2-host gene-gene (protein-protein) interactome. Our analyses were performed using a supercomputing DreamLab App platform, harnessing the idle computational power of thousands of smartphones. Machine learning models were initially calibrated by demonstrating that the proposed method can predict anti-COVID-19 candidates among experimental and clinically approved drugs (5658 in total) targeting COVID-19 interactomics with the balanced classification accuracy of 80–85% in 5-fold cross-validated settings. This identified the most promising drug candidates that can be potentially “repurposed” against COVID-19 including common drugs used to combat cardiovascular and metabolic disorders, such as simvastatin, atorvastatin and metformin. A database of 7694 bioactive food-based molecules was run through the calibrated machine learning algorithm, which identified 52 biologically active molecules, from varied chemical classes, including flavonoids, terpenoids, coumarins and indoles predicted to target SARS-CoV-2-host interactome networks. This in turn was used to construct a “food map” with the theoretical anti-COVID-19 potential of each ingredient estimated based on the diversity and relative levels of candidate compounds with antiviral properties. We expect this in silico predicted food map to play an important role in future clinical studies of precision nutrition interventions against COVID-19 and other viral diseases. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-020-00297-x.


Parameter optimization, accuracy estimation and results aggregation
Pearson correlation coefficients between each drug and disease propagated profiles were calculated for the drugs/food molecules and for coronavirus affected gene sets.
The best parameters were established through cross-validation in 5 repeats of 5-fold stratified k-fold splitting for each parameter combination. Drugs were ranked by their profile correlations with the disease profile. Class separation threshold was set as the one resulting in the minimal difference between sensitivity and specificity. Balanced accuracy was used for establishing the best parameter combinations due to high class imbalance.
The ensemble of parameter settings in the range of balanced classification accuracies of 80-84.9% was used to provide consensus ranking of drug and food molecule candidates. The final ranking list for the two parameter sets was calculated using geometric mean of the r-values and MADs to guarantee that only the candidates scored highly using both sources of SARS-CoV-2 target genes would be at the top of the list. r-values were calculated for each compound as the sum of compounds in the "negative" class with the correlation coefficient higher than that of a given compound divided by the total number of the "negative" class compounds.
The toxic compounds were removed from the presented lists using literature and T3DB. For food molecules, we have also excluded compounds which are present in trace amounts (e.g. minerals) and/or are of non-natural origin.

4.GSEA pathway analysis
Pathway analytics was performed using gene set enrichment analysis (GSEA) via the Python GSEAPY package [1]. We used the random walk propagation algorithm on the initial SARS-CoV-2 host interactome to simulate the effects of SARS-CoV-2 on human interactome networks. This simulated genomic profile was used as input for the PreRank module of GSEA to find statistically significant enriched pathways/gene sets.
KEGG v7.2 and Reactome v74 were used as default gene sets.

Food map construction
The final food selection was based on the highest number and, where available, quantity of the anti-SARS-CoV-2 food compounds and is provided in the Additional file 8. Concentration of compounds within foods were extracted from the USDA Special Interest Database on Flavonoids [2].
An enrichment score for each food item was calculated as a weighted sum of the number of different molecules with anti-COVID-19 properties (phytochemical "diversity") and their relative abundance where the experimental concentration data of molecules was available across all foods studied here. The enrichment score is defined as: