Estimating prevalence of human traits among populations from polygenic risk scores

The genetic basis of phenotypic variation across populations has not been well explained for most traits. Several factors may cause disparities, from variation in environments to divergent population genetic structure. We hypothesized that a population-level polygenic risk score (PRS) can explain phenotypic variation among geographic populations based solely on risk allele frequencies. We applied a population-specific PRS (psPRS) to 26 populations from the 1000 Genomes to four phenotypes: lactase persistence (LP), melanoma, multiple sclerosis (MS) and height. Our models assumed additive genetic architecture among the polymorphisms in the psPRSs, as is convention. Linear psPRSs explained a significant proportion of trait variance ranging from 0.32 for height in men to 0.88 for melanoma. The best models for LP and height were linear, while those for melanoma and MS were nonlinear. As not all variants in a PRS may confer similar, or even any, risk among diverse populations, we also filtered out SNPs to assess whether variance explained was improved using psPRSs with fewer SNPs. Variance explained usually improved with fewer SNPs in the psPRS and was as high as 0.99 for height in men using only 548 of the initial 4208 SNPs. That reducing SNPs improves psPRSs performance may indicate that missing heritability is partially due to complex architecture that does not mandate additivity, undiscovered variants or spurious associations in the databases. We demonstrated that PRS-based analyses can be used across diverse populations and phenotypes for population prediction and that these comparisons can identify the universal risk variants. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-021-00370-z.


Introduction
The prevalence of many phenotypes differs across populations. The causes of population disparity, though not always well understood, can be partially due to different frequencies of common causative alleles that are shared among populations and/or variation in environmental exposure across these same populations. However, it is also possible that population specific alleles affect prevalence. One way to increase our understanding of a trait's genetic architecture and population differences in disease prevalences is to determine if variants associated with risk in one or a few populations can be extrapolated to the phenotypic burden for other populations across the world. For example, some variants that are extremely common in some populations are very rare in others despite having large phenotypic effects (Jorde and Wooding 2004). It has been suggested that most heritability can be explained by variants associated with a specific phenotype that are not in the group of "core" variants thought to affect trait characteristics (Boyle et al. 2017). These "core" variants may, however, not necessarily be the same as variants determined to be most statistically associated, although there may be overlap. Examining variants that do or do not transfer among populations may help elucidate the concept of "core" genes.
Models of genetic architecture often assume that the effects of a trait's genetic components are additive, without interaction, and highly similar across populations. The assumption of additivity disregards potential complexity but can be implicitly tested by assessing how well a genetic model explains the genotype to phenotype relationship (Greene et al. 2009).
One additive model used to predict phenotypic status is the Polygenic risk score (PRS) that can be somewhat informative in elucidating an individual's risk of a specific phenotype (Khera et al. 2018;Torkamani et al. 2018) A PRS is the sum of the known risk associated loci of a phenotype based on the presence of the risk alleles in an individual and the average effects of these alleles.
Commonly, calculated PRSs also assume that a trait's genetic architecture is additive and that neither gene-gene nor gene-environment interactions are important factors. This method has seen some success, but often fails to predict an individual's disease status, especially at intermediate values of the PRS (Igo et al. 2019), possibly because translating population level data to individual status is problematic and risks falling into the ecological fallacy. PRSs may also not be comparable among populations (Martin et al. 2017). In addition, PRSs may not predict individual risk across populations well because many of the variants and their effect sizes are derived from a limited ancestral group, as most reaserch is done in European populations (Sirugo et al. 2019b). This may lead to a lack of PRS transferablity across populations (Martin et al. 2017). Nonetheless, when a trait is multigenic or polygenic, a Polygenic Risk Score is becoming an often used risk estimator. The role of the PRS to estimate prevelances among populations has not been explored as much as for individual risk but it may point to key factors that are common.
One study on height in admixed European/African populations found that the prediction ability of a Polygenic Risk Score (PRS) for height was a function of the amount of European ancestry, supporting the idea that population specific effect sizes and allelle frequencies are important to its utility (Bitarello and Mathieson 2020). Another study by Evans et al. used individual PRSs to estimate population level disease prevalence (Evans et al. 2020), but the idea of using PRSs at a population level remains novel.
Diseases vary widely in complexity. Under simple architecture (i.e. additivity), a PRS weighted by allele frequencies at the population level should enable prediction of relative disease prevalence among populations, if the heritability is high and risk alleles are common among populations (Visscher et al. 2008). Therefore, in theory, we should be able to predict ranking of disease prevalence for simple traits, for which our understanding is relatively comprehensive with respect to the number and effect of risk loci. This will, presumably, allow us to predict population prevalence based on the correlation between prevalence and risk allele frequency using a weighted risk score and assuming locus additivity. We hypothesize that risk allele frequencies as compiled into a PRS are proportional to population prevalence and the change in prevalence based on specific variants is proportional to their importance in disease presentation, i.e., effect size, among populations. In this paper, we assessed the ability of PRSs, in traits of varying presumed genetic complexity, to explain population differences in prevalence and to assess whether the components in a PRS act additively in their contribution to disease prevalence. We also examined whether SNPs have universal effects by adjusting the number in each PRS.

Phenotypes
As proof of principle, we explored four phenotypes of differing presumed genetic complexity: lactase persistence, melanoma, multiple sclerosis, and height. As lactase persistence is monogenic, albeit with allelic heterogeneity, it is a genetically simple trait. Melanoma is dependent on both environment and a small number of known loci and is therefore likely oligogenic. Multiple sclerosis is a presumably moderately complex polygenic trait, with hundreds of associated alleles and several environmental factors. Height is a highly complex and heritable phenotype with thousands of associating alleles, making it essentially omnigenic.

Lactase Persistence
Lactase persistence into adulthood is a monogenic autosomal dominant trait caused by one or more of several mutations affecting the expression of lactase (LCT), the gene responsible for the encoding of lactase. Lactase persistence is reasonably well understood genetically in some, but not all, populations. Lactase is the enzyme that our bodies produce to help breakdown lactose, the sugar found in milk. The production of lactase usually decreases after weaning, in some cases leading to an intolerance of lactose. Lactase persistence shows strong evidence of selection, although why and when is a matter of debate (Gerbault et al. 2011;Plantinga et al. 2012;Segurel and Bon 2017;Segurel et al. 2020). It is, however, believed to be associated with the advent of dairy farming. Individuals who are lactose intolerant can often consume a moderate amount of dairy, especially if processed into foods such as cheese and yogurt.
In Europe, two alleles upstream of the LCT gene, −13910*T (rs4988235) and −22018*G (rs182549), have been identified as conferring lactase persistence. In populations outside of Europe, other alleles have been associated with lactase persistence, where it exists (Jones et al. 2013;Liebert et al. 2017;Ranciaro et al. 2014;Tishkoff et al. 2007). A total of 11 SNPs has been associated with lactase persistence (Table S1). The prevalence of lactase persistence varies among populations around the world. For example, 92% of people in Great Britain are lactase persistent, whereas, in Vietnam the prevalence is only 2% (Table 1). We tested the expected relative frequency of lactase persistence based on a PRS, including all of the variants known, to date, to see if we could predict relative prevalences, especially in populations that appear to carry the less penetrant alleles.

Melanoma
A moderately complex oligogenic disease with 39 associated GWAS SNPs (Table S2), melanoma is a skin cancer that is both heritable and dependent, to an extent, on environmental factors, especially sun exposure. Although considered rare, melanoma is responsible for most skin cancer deaths and the incidence is increasing, due partially to improved diagnosis (Chang et al. 2014). Most cases of melanoma are caused by somatic mutations from exposure to ultraviolet light, although the above noted germline variants have been identified as conferring risk.
There is significant variation in melanoma prevalence globally, with the lowest rate in Vietnam and highest in Finland (Table 1). As melanin is protective, melanoma is higher in prevalence in populations of lighter skin color. However, non-European populations have a higher risk of mortality, possibly because melanoma is harder to detect in darker skin, and detection and treatment is late in the course of the disease (Dimitriou et al. 2018). There is some indication, also, that skin color modifies the genetic architecture of melanoma (Hulur et al. 2017).
The heritability of melanoma ranges from 19% to 58% (Lu et al. 2014;Mucci et al. 2016;Shekar et al. 2009). However, while known melanoma predisposing genes range in penetrance and frequency, the heritability in families explained by known genes is still only ~50%, indicating missing heritability and uncertain genetic architecture (Read et al. 2016).

Multiple Sclerosis
Multiple sclerosis (MS) is an autoimmune neurologic disorder affecting the central nervous system. It is a relatively complex phenotype, dependent on both environmental exposures and genetics. Environmental factors include past Epstein-Barr virus infection, vitamin D insufficiency(Pierrot-Deseilligny and Souberbielle 2013, 2017) and cigarette smoking. MS also has a "latitude-gradient effect", i.e. the prevalence of MS is greater at higher latitudes, but there are some exceptions within Italy and Scandinavia (Simpson et al. 2011). 372 SNPs have been identified by GWAS as associating with MS (Table S3). Estimates of both prevalence and heritability vary among studies. MS is more common in women (70%-75% of cases) (Schwendimann and Alekseeva 2007) and people of European descent (Milo and Kahana 2010). Studies vary on the heritability of MS, one done in Australia, multiple European countries, and US states, showing moderate heritability (~20%)(International Multiple Sclerosis Genetics Consortium. Electronic address and International Multiple Sclerosis Genetics 2018), although a Swedish study showed a much higher heritability of 64% (36%-76%) (Westerlind et al. 2014).

Height
As a truly polygenetic trait, human height is both complex and highly heritable (Lango Allen et al. 2010;Lettre 2011). In addition to the 4,388 variants currently found to associate with this phenotype by GWAS, height is also dependent on environmental factors, including diet (Table   S4) (Yeboah 2017). There are also differences in average height between men and women and between global populations. The average height for men ranges from 163.8 cm in Bangladesh to 179.6 cm in Finland. For women, average height ranges from 150.8 to 165.9, also in Bangladesh and Finland, respectively (Table 1). Height is less heritable in women than men (0.68 to 0.84 vs. 0.87 to 0.93, respectively) (Silventoinen et al. 2003). Male and female population average heights are highly, but not completely, correlated (r 2 =0.84), potentially leading to some differences in the genetic models between sexes.

Allele and Prevalence Data Collection
Associated alleles for each phenotype were identified by a literature search and accessing the alleles that have been identified by GWAS from the GWAS catalog at p < 1 x 10 -5 . We chose to use this as the threshold for significance in our initial analyses, but report difference by pvalue threshold as well. Prevalence data for each phenotype in each population came, similarly, from literature searches and from databases devoted to specific traits (cancer, height). For lactase persistence, it was necessary to subtract the proportion of lactose intolerance in a population from 1. An attempt was made to keep the sources as similar as possible for each population (Table 1).

Genomes
To assess the role of PRS in predicting population phenotype distributions we chose to use only the populations included in The International Genome Sample Resource (IGSR) from the 1000 Genomes Project (Table 2) as our populations to study. Each ethnic population in the IGSR belongs to a larger super-population defined as: East Asian (EAS), South Asian (SAS), European (EUR), African (AFR) and Ad Mixed American (AMR). The allele frequencies of known risk alleles defined in the GWAS catalogue and literature were extracted from the 1000 genomes data using the Ensembl REST API.

Polygenic risk scores
Under the assumption that the genetic architecture of a phenotype is additive, we used a PRS to account for the genetic risk in each of our study populations, based on the frequency of the disease-causing alleles to estimate the relative presence of the phenotype in that population.
As previously mentioned, in individuals this is done by simply summing the number of risk alleles that an individual possesses, usually GWAS hits, for the specific phenotype. Another approach is to weight each allele in the score by the effect size and/or the allele frequency.
However, for a population specific PRS (psPRS), effect sizes may not be transferable (Sirugo et al. 2019a) and as long as the direction of effect is the same, the role that any variant plays in prevalence should be proportional to the frequency of the risk allele in that population. We have structured psPRS without effect size weighting, as there is often little to no information on effect sizes/OR of the risk alleles in different populations. Therefore, we calculated our psPRS only by the population allele frequencies. In addition, many of the associating SNPs do not have reported effect sizes in the data sources available. Our expression for the psPRS is simply the sum of the frequencies for the risk alleles in each population. For a population in the 1000 Genomes database, psPRS is the PRS for that population and Fi is the allele frequency of SNPi: We then performed a linear regression for each phenotype to establish the relationship between the population specific psPRS and the population prevalence of that phenotype. (Table   S5).

Maximization of the coefficient of determination sensitivity analysis
We performed a sensitivity analysis, filtering SNPs based on maximizing the coefficient of determination (r 2 ), or the square of the coefficient of correlation (r). This method prioritizes the SNPs that shift the populations closer to the regression line. To identify the alleles that make the relationship between the population psPRS and the phenotypic prevalence or trait mean the strongest, we maximized the coefficient of determination (r 2 ). This was done assessing the effect of removing SNPs from the psPRS and ordering each SNP by the r 2 value calculated for the linear regression between the population PRS without that SNP and the population prevalence.
We then permanently removed SNPs that resulted in the model with the remaining SNPs having the largest r 2 . We then recalculated the r 2 values for the model with only the remaining SNPs (Tables S1-S4, S6 and Figures S6-S10). We repeated this process until the r 2 value reached a maximum. Under the assumption of additivity, the model with the largest r 2 was expected to include all truly associating SNPs with universal effects (Table 3). Our approach tested this implicitly.

Lactase Persistence
We identified 11 SNPs associated with lactase persistence in the literature (Table S1). We used these SNPs to build our LP PRS for each population, using allele frequencies from the 1000 genomes project. We found a strong relationship between the PRS and the population prevalence of lactase persistence with a r 2 value of 0.65 ( Figure 1A, p-value: 1.84 x 10 -06 ).
The relationship was especially strong amongst European populations, but less so for South Asian and Amerindian populations, However, in East Asian and African populations, the PRS failed to account for much, if any, of the relationship between the known lactase persistence alleles and the population prevalence ( Figure S1A). Our sensitivity analysis ( Figure S6, Table   S1) based on r 2 maximization showed that keeping only 4 specific SNPs (Table S7) maximized the r 2 (r 2 = 0.67, p-value: 9.13 x 10 -07 ) and, although the r 2 did not increase by much (0.65 -0.67), the slope of the linear regression changed from 0.45 to 0.92 ( Figure 1B). The position of the populations that had high European allele content changed quite a bit, as one of the alleles that was filtered out of the PRS was one of the two alleles originally identified in Europeans. This is not a surprise, because the European alleles are in strong LD (in the European populations), and of the alleles tested these are the only ones in LD in Europe.
Within super populations, the relationships varied considerably. In the African subpopulations, the trend of the linear regression was slightly negative before maximization but slightly positive after ( Figure S1B). The admixed African American population (ASW) has the highest PRS, but relatively low prevalence of LP (25%). This is due to the presence of the European alleles in the ASW population that are not present in West Africa. In the east Asian populations (EAS), the trend is also negative, but after the maximization of r 2 , there were no SNPs retained that existed in the EAS populations. In the European populations (EUR), the trend was positive and stayed positive after maximization. In the south Asian populations (SAS), the trend was again positive and stayed so after maximization.
Melanoma 37 of the 39 GWAS SNPs were also in the 1000 Genomes Project (Table S2). The relationship of the melanoma PRS with these 37 associating SNPs to the population prevalence appears to be nonlinear ( Figure 2A). We applied three different types of regression: linear, polynomial and exponential. The one that explained the relationship the best was the second order polynomial regression (r² = 0.78, p-value: 2.19x10 -07 ); the exponential model was next best (r 2 = 0.66, p-value 2.7x10 -06 ) and the linear the worst (r² = 0.59; p-value: 1.71 x 10 -05 ), although all were significant. The overall relationship of the psPRSs and the population prevalences reflects the fact that the highest prevalence and psPRSs are in European populations. East Asian populations had the lowest PRSs and prevalence. South Asian populations clustered with some Amerindian populations with low to medium PRSs. African populations had medium PRSs, but low melanoma prevalence.
With the 16 SNPs that remained after the maximization analysis (Table S8, Figure S7), the relationship between the melanoma population PRS and the population prevalence appeared to remain nonlinear, similar to the original model, but with an improved explanation of variance and significance (linear regression: r² = 0.88, p-value: 2.81x10 -11 ) ( Figure 2B). We also explored both polynomial (r 2 = 0.94, p-value: 7.36x10 -13 ) and exponential relationships (r 2 = 0.77, p-value: 3.39x10 -08 ). These models all performed better than the full PRS model.
When we separated populations according to their super populations, we observed that, apart from the Asian populations, the correlations weere positive, but of varying strength ( Figure   S2A). However, none of the relationships were significant, perhaps due to the relatively small sample size. These results indicate that the significant correlation is driven by relationship among the continental populations that are not identical to each other. After maximization, the positive and negative trends were as described above, with the Asian populations staying negative and the EUR, AFR, and AMR remaining positive ( Figure S2B). The correlations did not improve substantially within super populations and remained non-significant using the reduced number of SNPs.

Multiple Sclerosis
For the full psPRS-prevalence multiple sclerosis model, we used 368 SNPs associated with MS that were in both the GWAS catalog and the 1000 Genomes project (Table S3). The resulting relationship appears to be nonlinear ( Figure 3A). We explored three different models for the regression: linear, polynomial and exponential. As with melanoma, the model that explained the largest proportion of the variance was the second order polynomial (r² = 0.80, pvalue: 3.94 x 10 -08 ). The worst was the linear model (r² = 0.47, p-value: 2.12 x 10 -04 ), while the exponential model was intermediate (r² = 0.64, p-value: 2.59 x 10 -06 ).
The linearity remains, even when the European populations are removed. Within the super populations, the prevalences and psPRSs become more highly correlated and the relationships, apart from the South Asian populations, are significant ( Figure S3A).
The super populations clustered, with the European populations having the highest prevalence and psPRSs. The African populations had the lowest PRSs and prevalences, with the east and south Asian mixed with the admixed Amerindian with medium prevalences and PRSs.
When the super populations were examined individually, the linear correlations were all positive, with strengths ranging from EAS (r² = 0.0459) to AFR (r² = 0.4336). However, again, none of these relationships were significant ( Figure S3B).

Height
Because height has quite different ranges for men (~164 cm -~180 cm) and women (~151 cm -~166 cm) (Table 1), we examined the relationship between population average height and population PRS in each sex separately. The full PRS-population average height model included 4208 SNPs from the GWAS catalog and the 1000 genomes project for both men and women (Table S4). The relationships for both male and female between the population PRS and the population average height (cm) appear to be linear ( Figure 4A and 4B). However, the regressions for men and women are different, with noticeable differences in the slopes of the regression lines, the correlations and the significance of the relationships (male: r² = 0.32, P-value: 2.55 x 10 -03 ; female: r² = 0.11, P-value: 0.0992).
The populations generally clustered by super populations, with European populations being both the tallest and having the largest psPRSs for both men and women. The south Asian and Amerindian were the shortest groups, but with medium PRSs. African and east Asian populations had medium to tall height, but the lowest PRSs.
Within the African super population, the relationship between average height and population PRS was positive in both males and females. Both south Asian and Amerindian populations had positive relationships as well. However, surprisingly the European and east Asian populations had negative relationships ( Figures S4A and S5A).
The sensitivity analysis reduced the number of SNPs for the male model to 548 and for the female model to 188 (Figures 4C and 4D, S9-S10, Tables S6, S10-S11). The reduced male and female linear models changed substantially (male: slope from 0.06 to 3.92; female: slope from 0.03 to 3.86). The correlation strengthened for both male and female (male: r² = 0.99; female: 0.98) and in males the relationship became more significant and became significant in females (male and female: P-value: <2x10 -16 ).
After the maximization filtering, the positions of the populations shifted significantly.
The south Asians had lower PRSs to match their lower average height. Europeans still had the highest psPRSs and the African and east Asian populations were mixed. Within the super populations, the relationships all became positive for both men and women ( Figures S4B and   S5B).

Effect of P-value thresholds for SNP selections
As we used only a moderately stringent threshold for the SNPs from the GWAS catalog, we wished to know if the maximization analysis selected SNPs that were statistically significant, i.e., with p-values of genome wide significance. We found, using the Fisher's exact test, that there was no significant enrichment of GWAS SNPs with a p-value less than 5 x 10 -8 , except for height in women (Table 4).

Discussion
Overall, our psPRS method estimated population prevalence quite well. This indicates that the population PRS is a reasonably good indicator of disease presence in a population. For lactase persistence, we found that the psPRSs and the prevalence were strongly correlated, even before SNP filtering. For melanoma and MS, we also found strong correlations, albeit non-linear ones. However, for height the correlations, while linear, were weaker. As expected, the complexity of the phenotype did affect the ability of the full PRS model to predict the population prevalence, sometimes being far from what would have been expected and being non-linear, i.e., melanoma and MS. Also, the SNP pruned models improved the explained variance over the complete psPRSs, sometimes substantially, and the relationships achieved or approached linearity when the complete models were not. Although this can be viewed as "cherry picking", it does reveal that not all detected SNPs have similar effects across populations and that some may reflect effects that are universal as opposed to population specific. Our results show that the European populations often skew the overall full model and that, with the exception of height, fit the PRS predictions best. This is not surprising as most of the SNPs were discovered in populations of European descent (Sirugo et al. 2019b). We also repeatedly observed that there were not as many significant correlations within the super populations, but there were between super populations, which may reflect the paucity of data within them. Generally, the model of LP followed what was expected, as it is a monogenic disease. For LP in the African populations, the disparity between the observed prevalence in some populations and our psPRS model, shows that our ability to predict prevalence is likely impacted by unidentified associating alleles or other mechanisms by which lactose is digested, perhaps acquired gut microbiome activity (Goodrich et al. 2017). This is supported by the negative and weak relationship in the full data set, although likely impacted by the admixed ASW population, where the European alleles exist but do not seem to confer lactase persistence to the extent that the psPRS would predict; nonetheless, this African descent population still had the highest prevalence of lactase persistence. Another possible reason for the psPRS not predicting prevalence in Africa well is that there may be context dependent effects. For example, it has been found that the 13915*G DNA polymorphism that is associated with lactase persistence in Africa interacts with Oct-1 (Olds et al. 2011).
Given the known impact of environmental exposure on the development of melanoma (Dimitriou et al. 2018), the observed nonlinearity of the relationship between the population PRSs and the prevalence of the disease was not unexpected. That the nonlinearity continued after the filtering, implies that the actual relationship between the psPRS and the prevalence may be non-additive, and that we are missing key factors, either genetic interactions, environmental interactions, or both. Because we did not consider environmental factors in this study, we were not able to differentiate between the two. It has, however, been shown that at least one SNP pair at the TERF1 and the AFAP1L2 loci does interact to affect risk of melanoma (Brossard et al. 2015).
While the relationship observed with the full MS PRS model was nonlinear, the model after filtering of SNPs resulted in a strong linear model. This might indicate that there is some genetic interaction in MS, especially given that the r 2 improves as we drop SNPs from the psPRS model. Indeed, a DDX39B variant interacts with allelic variants in IL7R exon 6 to increase MS risk (Galarza-Munoz et al. 2017). Interaction with environmental factors has also been shown.
Specifically, latitude, EBV infection, smoking and adolescent obesity interact with risk alleles at the HLA locus to increase risk of MS (Olsson et al. 2017).
While the relationship of the psPRS to population average height with the full model shows a relatively weak, though significant, positive correlation, the result of the maximization shows a very strongly correlated relationship. Although height is highly heritable, this was not expected given the foreknowledge of the impact of environment on height, especially in women (Silventoinen et al. 2003). It may be that some of the variants left in the final model are correlated with environmental parameters due to past selection. Also, there may be epistasis in the genetic architecture of height. For example, genetic interaction was found between loci 6p21 and 2q21 to account some of the variation in height (Liu et al. 2006).
We infer from our results that the maximizing r 2 sensitivity analysis is filtering out the SNPs that are not distributed as the population prevalence distribution of the phenotype in some but maybe not all populations. This is, in effect, similar to a previous method Evolutionary Triangulation (Huang et al. 2016), where we filtered SNPs based specifically on their distribution relative to disease prevalence. Our results showing that pruning the SNPs in the model improves performance may be revealing heterogenous effect sizes that may present due to context dependent effects, such as epistasis or gene X environment interactions, spurious associations, or other population specific effects. This may explain why a filtered model is superior in some cases to a model with all associating SNPs included. This approach is, in essence, removing noisy data. psPRSs provide some explanation of population differences but are less effective when all SNPs are included. This indicates that PRSs have value but must be refined to improve prediction.
Our investigation as to whether GWAS significance was a useful threshold for inclusion indicated that it was not good at predicting which SNPs would end up in our pruned SNP set. As shown by our investigation of whether our model enriched for SNPs with a smaller p-value in the GWAS Catalog, we can conclude that GWAS p-value is not always the best indicator of the value of a SNP in the PRS model. This does justify, to some extent, our use of SNPs that were not genome wide significant at 5 x10 -08 and indicates that some care should be used in determining the importance of SNPs in models based solely on significance of p-values.
Understanding the relationship between allele frequency and disease prevalence will lead to further understanding of genetic influence, environmental pressure and gene-environment interactions. The effects of genetic variation on public health presents challenges for the exploration and management of these phenotypes worldwide, as most traits are primarily considered in the context of European descent. This blind spot, due partially to a lack of diversity in biomedical research, is not only detrimental to those populations that are understudied, but to the understanding of the underlying genetic basis, or genetic architecture, of the trait itself, thereby, possibly affecting understanding in all populations. Nonetheless, some of our results indicate that even SNPs discovered primarily in Europeans are useful, when included in a psPRS, for predicting trait variation, e.g., height.
Our results help to identify the populations in which we are missing the most information regarding genetic foundations of trait variation. This is underlined by some of our results where the population PRSs do not match the population prevalences, i.e. where the prevalence is high or medium and the psPRS is low, as in the cases of height and LP in African populations. That using a reduced number of SNPs improves the psPRS likely indicates a certain portion of missing heritability is due to more complex architecture, i.e., genetic interaction, possibly differing by population and that there are still undiscovered variants. However, our method helps to define the areas of the genetic landscape where our knowledge of genetic architecture is relatively complete and where it is not. Figure 1. The correlation between lactase persistence and psGRS. The data points are colored according to the super populations: AFR (orange), AMR (olive), EAS (green), EUR (blue) and SAS (purple). A) Full model (r 2 = 0.65; p-value: 1.84 x 10 -06 ). B) After maximization (r 2 = 0.67, p-value: 9.13 x 10 -07 ).

Supplemental Data
The supplemental data includes ten figures and eleven tables.           Table S1. Lactase persistence full data set. SNP rs number and minor allele are included, as well as the r 2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S2. Melanoma full data set. SNP rs number and minor allele are included, as well as the r 2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S3. Multiple sclerosis full data set. SNP rs number and minor allele are included, as well as the r 2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S4. Height full data set. SNP rs number and minor allele are included, as well as the r 2 values from the sensitivity analysis. The columns headed with the 1000 Genomes population codes are the allele frequencies for each SNP in those populations. The SNPs are listed in order of removal in the sensitivity analysis. Table S5. PRS values for each population, before and after maximization.     0.07644 6.73E-14 1 P ≤ 5 X 10 -8 2 1 X 10 -5 > P > 5 X 10 -8