Human chromosomal-scale length variation and 1 severity of COVID-19 infection using the UK Biobank dataset.

The course of COVID-19 varies from asymptomatic to severe (acute respiratory distress, cytokine storms, and death) in patients. The basis for this range in symptoms is unknown. One possibility is that genetic variation is responsible for the highly variable response to infection. We evaluated how well a genetic risk score based on chromosome-scale length variation and machine learning classification algorithms could predict severity of response to SARS-CoV-2 infection.We compared 981 patients from the UK Biobank dataset who had a severe reaction to SARS-COV-2 infection before 27 April 2020 to a similar number of age matched patients drawn for the general UK Biobank population. For each patient, we built a profile of 88 numbers characterizing the chromosome-scale length variability of their germ line DNA. Each number represented one quarter of the 22 autosomes. We used the machine learning algorithm XGBoost to build a classifier that could predict whether a person would have a severe reaction to Covid-19 based only on their 88-number classification.We found that the XGBoost classifier could differentiate between the two classes at a significant level p = 2 · 10 as measured against a randomized control and p = 3 · 10 measured against the expected value of a random guessing algorithm (AUC=0.5). However, we found that the AUC of the classifier was only 0.51, too low for a clinically useful test.

. We segmented the dataset into three overlapping subsets.

7
The first, which we called "1930" contained all UK Biobank 7 8 participants born after 1930 who had a severe reaction to SARS-7 9 CoV-2 infection before 27 April 2020. The two subsets contained Using the CSLV-Covid-19 dataset, we selected all people who tested positive before 27 8 2 April 2020 and labelled these as people having a severe reaction to Covid-19. We segmented 8 3 these into three overlapping datasets, as shown in Table 1. We constructed an age-matched 8 4 control group of the same size that had an identical age profile as those in the severe reaction 8 5 group. The age-matched control group was selected from the entire UK Biobank dataset, trained to classify a person in the dataset, consisting of those who had a severe reaction and age-9 4 matched controls, based solely on their chromosome scale length variation data. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020. The results are presented in Figure 1 and Table 2. As Figure 1 shows, we found a 9 7 significant difference between all three age groupings and their corresponding random controls.

8
This finding indicates that germ line genetics of the infected patient, as represented by the set of 9 9 chromosome-scale length variation numbers, has an effect on the severity of COVID-19. classification model with an AUC of 0.51 is just slightly better than guessing. patients with a "severe reaction" to Covid-19 and an equal number datasets and randomly permuted the status ("severe reaction" or 1 1 5 "normal") and repeated the process. This randomly permuted 1 1 6 dataset is labelled oldest birthyear "random". For each dataset, we  . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020.  datasets also showed significant differences between the mean 1 2 6 AUC and 0.5. The three random controls did not show a 1 2 7 significant difference between the mean AUC and 0.5, as expected.
An AUC value of 0.5 represents a random classification test, one 1 2 9 in which the algorithm is no better than guessing.  . Discussion: The two conclusions of this study are divergent. First, a genetic difference exists 1 4 0 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. Although the AUC we found here is too low to be clinically useful, several avenues for improving the AUC exist. We were constrained by the data available to compare those who had better approach would be to compare those who had a severe reaction to Covid-19 with those 1 5 0 who were asymptomatic or had a mild reaction. Simply having more data on those patients who 1 5 1 had a severe reaction might also lead to an increase in AUC. We could also have more data on Our results add to the recent work done by other on the link between genetics and . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. applied the polygenic risk score to a hold-out group, they found that the mean score was 1 6 4 indistinguishable between the group of people who had tested positive and the group that had no 1 6 5 positive test. In comparison, our work found that these two groups are distinguishable with a 1 6 6 genetic risk score, but only very slightly. We measured the AUC at 0.51. They do not report an 1 6 7 AUC, but an indistinguishable test is the equivalent of an AUC of 0.50.

6 8
Other more comprehensive metastudies have identified one specific genetic component Host Genetics Initiative [10,11], also indicate a strong association in Chromosome 3, but fail to 1 7 4 reproduce the association in chromosome 9. The Covid-19 Host Genetics Initiative "ANA_B2" In conclusion, we found a significant difference exists between the structural genomics of The data used in this study was obtained from the UK Biobank under Application 1 8 5 Number 47850. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted July 7, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 7, 2020.