Identifying positive selection candidate loci for high-altitude adaptation in Andean populations

High-altitude environments (>2,500 m) provide scientists with a natural laboratory to study the physiological and genetic effects of low ambient oxygen tension on human populations. One approach to understanding how life at high altitude has affected human metabolism is to survey genome-wide datasets for signatures of natural selection. In this work, we report on a study to identify selection-nominated candidate genes involved in adaptation to hypoxia in one highland group, Andeans from the South American Altiplano. We analysed dense microarray genotype data using four test statistics that detect departures from neutrality. Using a candidate gene, single nucleotide polymorphism-based approach, we identified genes exhibiting preliminary evidence of recent genetic adaptation in this population. These included genes that are part of the hypoxia-inducible transcription factor (HIF) pathway, a biochemical pathway involved in oxygen homeostasis, as well as three other genomic regions previously not known to be associated with high-altitude phenotypes. In addition to identifying selection-nominated candidate genes, we also tested whether the HIF pathway shows evidence of natural selection. Our results indicate that the genes of this biochemical pathway as a group show no evidence of having evolved in response to hypoxia in Andeans. Results from particular HIF-targeted genes, however, suggest that genes in this pathway could play a role in Andean adaptation to high altitude, even if the pathway as a whole does not show higher relative rates of evolution. These data suggest a genetic role in high-altitude adaptation and provide a basis for genotype/phenotype association studies that are necessary to confirm the role of putative natural selection candidate genes and gene regions in adaptation to altitude.


Introduction
Identifying gene regions showing signatures of natural selection in the human genome offers a window into our recent evolutionary past, as well as a deeper understanding of how this evolutionary force has shaped extant patterns of variation. Several recent studies have analysed dense single nucleotide polymorphism (SNP) genotype data to detect signatures of selection in three major continental groups: West Africans, East Asians and Northern Europeans. 1 -6 To date, only a few studies have focused on identifying candidate genes under selection with reference to a specific selective pressure. 7,8 Here, we use high-density SNP data to search for candidate genes for altitude adaptation in Andean populations. By expanding the populations of study to the Americas and targeting a specific selective pressure, hypobaric hypoxia, we can produce a more detailed and nuanced understanding of this evolutionary process.
High-altitude environments provide scientists with a natural laboratory to study the genetic and physiological effects of hypobaric hypoxia, the decreased partial pressure of oxygen at high altitude resulting in lower circulating oxygen levels in the body, on endemic highland species. 9 -11 Humans have inhabited three high-altitude (.2,500 m) zones of the world for multiple generations: the Tibetan Plateau, the Andean Altiplano and the Semien Plateau of Ethiopia ( Figure 1). Each of these populations exhibits unique circulatory, respiratory and haematological adaptations to life at high altitude. For example, research has shown that Tibetan and Ethiopian populations have relatively low haemoglobin concentrations, in contrast to the 'classic' Andean physiological adaptation (also seen in high-altitude sojourners), where haemoglobin concentrations are elevated compared with low-altitude groups. 12 -17 Andeans also exhibit lower levels of resting ventilation, a more 'blunted' hypoxic ventilatory response, higher levels of pulmonary arterial pressure and an increased frequency of chronic mountain sickness compared with their Tibetan counterparts. 18,19 Overall, this research has led to a substantial body of literature documenting the suite of human physiological responses to high-altitude habitation (for a review, see Hornbein and Schoene 20 ).
The physiological differences between low-and high-altitude populations have been well documented, but little work has focused on understanding the genetic bases or identifying the genetic variants underlying these adaptations. 21,22 The few natural selection genetics studies conducted previously have focused on specific genes hypothesised to play a role in adaptation to altitude, but none of them have found conclusive evidence of this evolutionary force. 21 -27 One recent study scanned 998 genetic markers in seven Nepalese Sherpa porters and identified genomic regions that may have been involved in adaptation to altitude. 28 However, genome scans using much larger panels of genetic markers and larger sample sizes will be necessary to expand upon these very preliminary findings. Related research has explored the heritability of specific altitude phenotypes such as arterial oxygen saturation, haemoglobin concentration and thoracic skeletal dimensions. 16,29,30 One heritability study concluded that a major autosomal dominant locus exists for high oxygen saturation, where Tibetan women carrying this high oxygen saturation allele had a higher offspring survival rate than women possessing the low oxygen saturation allele. 30 Even though research of this nature documents the potential for natural selection to act on phenotypic traits, it does not identify the gene(s) controlling the phenotype.
As part of an ongoing project to understand the role of natural selection in shaping human genetic diversity in high-altitude populations, we genotyped 490,032 autosomal SNPs using the Affymetrix, Inc. (Santa Clara, CA) GeneChipw Mapping 500 K array in 195 persons of highaltitude or low-altitude descent. By comparing high-altitude populations with related populations living at low altitude, a list of selection-nominated candidate genes and gene regions was generated using four summary statistics: locus-specific branch length (LSBL), the natural log of the ratio of heterozygosities (lnRH), Tajima's D and the wholegenome long-range haplotype (WGLRH) test. We focused our attention on the hypoxia-inducible factor (HIF) pathway, which is a transcriptional regulator that controls cellular oxygen homeostasis and plays a key role in energy metabolism. It is upregulated in many cancers and may be involved in the accumulation of adipose tissue. This pathway, comprising at least 75 genes scattered throughout the genome, is thought to regulate many of the physiological responses to cellular hypoxia. Based on their functional roles, we have an a priori reason to expect that genes in this pathway might be involved in adaptation to high altitude. 31 Genomic searches for signatures of natural selection, however, are also a means of aiding the identification of gene function or to expand the current understanding of a gene's function. Several studies of natural selection have helped to identify functional roles for the loci under selection. 32 -34 Therefore, we also considered non-HIF genes in this analysis.

Materials and methods
Populations SNP data were generated using 105 individuals of Native American descent ( previously reported on by Mao et al. 35 ). This sample could be further divided into two groups, a high-altitude group and a low-altitude group. The high-altitude group was composed of 50 individuals of Andean descent: 25 Quechua collected in Cerro de Pasco, Peru (4,300 m), and 25 individuals of largely Aymara ancestry collected in La Paz, Bolivia (3,600 m). 36,37 The low-altitude group consisted of Native American lowlanders from Mexico, including 11 Nahua, nine Mixtec and ten Tlapanec individuals collected in Guerrero (1,600 m) and 25 Maya individuals collected in the Yucatan Peninsula (10 m). Sampling locations for each of the Native American samples is shown in Figure 1. Native American population samples, highland and lowland alike, were selected based on the proportion of Native American to European individual genetic ancestry, with persons showing high levels of Native American and low levels of European ancestry chosen for this research. Genetic ancestry was estimated using a panel of ancestry informative markers (AIMs) that distinguish between West African, Northern European and Native American populations. 38,39 Additionally, we included 90 East Asian lowlanders in this research: the 45 Haplotype Mapping Project (HapMap) Han Chinese from Beijing and the 45 HapMap Japanese from Tokyo. In this analysis, we split the high-altitude and low-altitude populations into three population groupings: 1) Andeans (Quechua and Aymara); 2) Mesoamericans (Maya, Nahua, Tlapanec and Mixtec); and 3) East Asians (Han Chinese and Japanese).
In addition to the Native American and East Asian populations, 120 West African and Northern European individuals from the HapMap project were also genotyped using the Affymetrix 500 K array set. These included 60 Yoruba from Ibadan, Nigeria and 60 individuals from the USA who were of northern and western European ancestry, collected by the Centre d'Etude du Polymorphisme Humain (CEPH). The availability of the HapMap data made it possible to compare the results of our analysis with those of previous studies of natural selection in the same samples. By doing so, we confirm that this genotyping platform is appropriate for the analysis of Andean signatures of natural selection.

Genome scan data
The Affymetrix, Inc. Gene Chip Human mapping 500 K array set was used to generate high-density multi-locus SNP genotype scores. This mapping array has an even distribution across the genome, with an average inter-SNP distance of 5.8 kilobases (kb). It is composed of two arrays named for the restriction enzymes used in the complexity reduction step of the reaction, the Nsp array and the Sty array. Each array assays approximately 250,000 SNPs. In total, this analysis was conducted using 490,023 autosomal SNPs.

Tests for positive selection
We used four statistics to identify candidate loci showing positive selection in the Andean population: LSBL, lnRH, Tajima's D and the WGLRH test. 5,40 -42 LSBL, lnRH and the WGLRH test were implemented as previously described. 5,41,43 LSBL was calculated for each SNP in the dataset, whereas an overlapping sliding windows approach was taken to calculate lnRH and Tajima's D. We used a window size of 200,000 base pairs (bp), moving in 50,000 bp increments along each chromosome for lnRH and 100,000 bp with 10,000 bp increments for Tajima's D. Window size was determined by the genome coverage and the marker density of the Affymetrix 500 K array set. Statistical significance for each of LSBL, lnRH and Tajima's D was determined by using its respective genome-wide empirical distribution generated by these data. Those loci with pE values falling in the top (LSBL) or bottom (lnRH and Tajima's D) 5 per cent of the empirical distribution were considered statistically significant (a ¼ 0.05). For the WGLRH test, significance was assessed by comparing the relative extended haplotype homozygosity (REHH) of a specific core haplotype with the gamma distribution and applying the false discovery rate (FDR) approach to correct for multiple tests. 44 For Tajima's D, we compared standardised Tajima's D across windows similar to the integrated haplotype score (iHS) statistic of Voight et al. 2 To do so, we used the following equation: Where D i is the Tajima's D calculated for a sliding window in a given population panel (Andean, Mesoamerican or East Asian), m is the mean Tajima's D for all windows and SD is the standard deviation of Tajima's D for all windows.
Using standardised D, we identified regions of the genome that were significantly negative in the Andeans. Because we were interested in regions of the genome that have been subject to natural selection in Andeans but not in lowland Native American populations, however, we also wanted to compare the two New World populations-Andeans and Mesoamericans-to identify genomic regions that may have undergone selection in the high-altitude populations but not in the low-altitude populations. To do so, we developed a statistic to summarise the difference in Tajima's D between two populations using the following equation: Here, D iA is Tajima's D computed for a given sliding window in population A, D iB is Tajima's D computed for a given sliding window in population B, m is the mean Tajima's D for all windows and SD is the standard deviation of Tajima's D for all windows. Again, Tajima's D was calculated for each population using an overlapping sliding window size of 100,000 bp with 10,000 bp increments. We did not include East Asians in this comparison, in order to eliminate overlooking genomic regions that may have undergone changes in East Asians after the Old World and New World populations split.
Haplotypic phase was determined for each chromosome prior to calculating two of the statistics, Tajima's D and the WGLRH test. The program FastPHASE resolved the haplotypic phase from the unphased genotype data for Tajima's D. 45 Northern Europeans, West Africans, Native Americans and East Asians were phased individually. The high-altitude and low-altitude Native Americans were phased together as one population. Missing genotypes were inferred for all populations. For the WGLRH test, haplotypic phase was computed using the expectation maximisation (EM) algorithm as implemented in the program Haploview. 46 FastPHASE was not used for this test because we strictly followed the WGLRH algorithm as it was designed, which used Haploview for phasing. 41

Results
Working with 490,032 SNPs from the Affymetrix, Inc. Gene Chip Mapping 500 K array in 195 individuals from high and low altitudes, we identified gene regions (a ¼ 0.05 and a ¼ 0.01) that differed significantly between Andeans and two low-altitude populations, Mesoamericans and East Asians, using four statistics that detect departures from neutrality. The statistics included LSBL, lnRH, Tajima's D and the WGLRH test. The significant SNP comparisons or SNP windows for each of the four test statistics applied to these data are shown in Table 1. The empirical distribution for LSBL is shown in Figure 2.  We identified 14, seven and three HIF pathway genes that fell into the 5 per cent tail of the empirical distribution for LSBL, lnRH and Tajima's D difference, respectively. No HIF pathway candidate genes were located in a statistically significant extended haplotype region for the WGLRH test. The SNP genotyping platform used in this analysis, however, did not assay SNPs within the gene boundaries of 13 HIF pathway candidate genes. For this reason, it was important to look 50 kb upstream and downstream of the start and end coordinates of each gene for significant SNPs or SNP windows so as to not exclude a potential candidate gene from analysis. When doing so, 29, ten and eight HIF pathway genes show at least one significant SNP or window for LSBL, lnRH and Tajima's D difference, respectively. Table 2 enumerates the significant HIF pathway genes for each test statistic using the 50 kb upstream and downstream definition.
The number and proportion of significant SNPs or sliding windows varied for each gene. For example, all nine lnRH windows and six of 28 LSBLs for tenascin C (TNC) were statistically significant. By contrast, the gene nitric oxide synthase 2A (NOS2A) displayed only one significant lnRH window out of seven, and the gene vascular endothelial growth factor (VEGF) contained only one significant LSBL among 15 assayed SNPs. Moreover, the only gene with all of the windows in the gene region significant for lnRH was TNC. None of the HIF pathway candidate genes were statistically significant for all of the test statistics. However, we did identify significant HIF pathway genes using two or three out of the four statistics. VEGF was significant for Tajima's D difference and LSBL. TNC and cadherin 1 (CDH1) were statistically significant for LSBL and lnRH. Lastly, three genes, those encoding endothelin receptor type A (ENDRA) and protein kinase, AMP-activated, alpha 1 catalytic subunit (PRKAA1) and NOS2A, were statistically significant for LSBL, lnRH and the Tajima's D difference. To evaluate if HIF pathway genes are overrepresented in the 5 per cent tail of the empirical distribution for Andeans, we used Fisher's exact test. We tested the hypothesis that the proportion of significant LSBL values (a ¼ 0.05) is higher among HIF pathway candidate genes than among non-HIF genes using a 2 Â 2 contingency table, where the four categories were: significant LSBLs for HIF genes, non-significant LSBLs for HIF genes, significant LSBLs for non-HIF genes and non-significant LSBLs for non-HIF genes. The results indicated that the HIF pathway candidate genes are not overrepresented in the 5 per cent tail of the distribution (OR ¼ 0.644 (95 per cent confidence interval 0.538-0.778); p , 0.001). A second method of testing if the HIF pathway candidate genes are overrepresented in the 5 per cent tail is to compare the LSBL distribution of HIF pathway candidate genes to the LSBL distribution of all non-HIF genes using a one-sided Kolmogorov-Smirnov (K-S) test. Again, the results of this test suggested that the 5 per cent tail of the empirical LSBL distribution is not enriched with HIF genes (D n,m ¼ 0.0205; p ¼ 0.3162). It is important to note that these results do not preclude particular HIF genes from involvement in genetic adaptation to high altitude. Rather, they denote that the HIF pathway as a whole has not evolved in response to hypoxia among Andeans.
In addition to studying the HIF pathway candidate genes specifically, we also scanned across each chromosome to discern genomic regions showing evidence of reduced variation in Andeans, a hallmark of directional selection. Given the large number of significant tests for LSBL, lnRH and Tajima's D using a 5 per cent significance cut-off we restricted our attention to regions with clusters of significant values for one or more test statistics as selection-nominated candidate gene regions. To identify such regions, we calculated the significance of one megabase non-overlapping windows moving across each chromosome for LSBL, lnRH and Tajima's D using the hypergeometric distribution. The p-value for each window was corrected for multiple tests using the Bonferroni correction. 44 In total, p-values for 2,718 windows were calculated for each of the LSBL, lnRH and Tajima's D statistics. Significant p-values were defined such that one false-positive would be expected for all observed windows. Using this definition, windows for which p 0.004 were considered to be statistically significant. The results of this analysis revealed 54 regions displaying extended regions of continuously significant statistics for two or more of the three statistics. Three of these regions located on chromosomes 11, 12 and 15 were significant for all three statistics. Table 3 enumerates the chromosomal regions that were statistically significant for LSBL, lnRH and Tajima's D.  The WGLRH test identified 43,153 extended haplotype/core regions throughout the genome in the Andean panel. Only 57 of these regions were statistically significant after identifying 'flipped' SNPs and applying an FDR correction for multiple testing. Two of these extended haplotypes were also identified as significant in the Mesoamericans. After removing these two regions from the Andean analysis, 55 significant extended haplotype regions remained. Those significant extended haplotypes containing known genes in their core regions are listed in Table 4. No common core haplotypes were shared between the East Asians and the Andeans. Of the 55 significant 500 kb extended haplotype regions, seven contained SNPs that were statistically significant for LSBL. None of the core regions identified using the WGLRH test overlapped with the statistically significant gene windows for lnRH or Tajima's D.
To validate that this dataset was appropriate for identifying signatures of positive selection in Andean populations, we performed an identical analysis using all four statistical tests for positive selection on the HapMap project populations. 47 The samples included in this analysis have been used in previous genome scans conducted on larger datasets. 2,6 The populations included 60 Yoruba from Ibadan, Nigeria; 60 individuals of northern and western European ancestry from the USA collected by the CEPH; and 90 East Asians from China and Japan. The East Asians used in this analysis corresponded to the East Asians used for the Andean analysis. This analysis identified significant gene regions consistent with previous studies. For example, SNPs found in the gene solute carrier family 24, member 5 (SLC24A5)-a gene associated with skin pigmentation and shown to be under positive selection in European populations but not in East Asian and West African populations-possessed statistically significant LSBL and lnRH values in the European population; 32,48 however, this was not observed for Tajima's D difference or the WGLRH test. Another gene, ectodysplasin A receptor (EDAR), known to be involved in hair and tooth development, consistently shows evidence of positive directional selection among East Asian populations. 49 For the East Asians in this analysis, SNPs falling within EDAR showed statistically significant LSBL values and Tajima's D window, but non-significant lnRH windows. Additionally, it was not identified as a significant core haplotype for the WGLRH test. The absence of significant lnRH windows is not surprising in this population of Chinese and Japanese individuals, however, given that the haplotype under selection has not swept to fixation in the Japanese population. By extending our analysis to this group of well-studied old-world populations, we verified that the signatures of selection found in these three populations overlapped with those signatures identified using other SNP datasets, supporting our contention that the SNP dataset and analytical methods used here are appropriate for identifying signatures of positive selection in high-altitude Andeans.

Discussion
Using a dense genome-wide panel of SNPs (Affymetrix 500 K chip), we compared patterns of genetic diversity between high-altitude Andeans, low-altitude Mesoamericans and East Asians to identify selection-nominated candidate genes or gene regions in Andeans. Four tests based on different characteristics of the data were used in our analysis: LSBL, lnRH, Tajima's D and the WGLRH test. We selected these complementary methods because each statistic possesses a varying degree of efficacy for identifying signatures of natural selection depending on the allelic background of the populations used in the analysis, the strength of selection and the length of time elapsed since the start of the selective event.
Given the aspects of genetic variation summarised by these statistics, it is not expected that the results of tests will overlap. Rather, these methods should be considered as complementary tests that can be useful for the identification of regions under positive selection.
Based on the results of this study, the HIF pathway genes exhibiting the most compelling evidence of positive directional selection across the test statistics are ENDRA, PRKAA1 and NOS2A. ENDRA, expressed in vascular smooth muscle, encodes a vasoconstrictor whose actions are mediated through endothelin-1. 50 PRKAA1 encodes a heterotrimeric enzyme belonging to the ancient 5 0 -AMP-activated protein kinase gene family involved in regulation of cellular ATP (reviewed by Kemp et al. 51 ). PRKAA1 functions as a cellular energy sensor under ATP-deprived conditions, such as those that are experienced in hypoxia. Thus, it provides metabolic adaptations to the oxygen-starved cellular environment. NOS2A, in combination with additional nitric oxide synthase enzymes, synthesises nitric oxide (NO) from arginine and oxygen. NO increases blood flow in the arteries and helps to regulate blood pressure. Erzurum et al. 52 have recently shown that NO production is increased in Tibetans resident at 4,200 m compared with sealevel controls. Recent work has also demonstrated higher uterine artery blood flows during pregnancy in Andean than European high-altitude residents, possibly due to greater uterine artery vasodilation. 37 These studies suggest that vascular factors, not just haematological or pulmonary systems, contribute to altitude adaptation in Tibetan and Andean populations. Here, we showed preliminary evidence of positive selection in NOS2A in Andean populations.
It would be worthwhile to extend the work conducted in this study to Tibetan populations who show physiological adaptations with respect to NO production, to determine if a similar genetic signal is present in this Himalayan population.
The three chromosomal regions showing extended regions of statistically significant test results are excellent candidates for further study. They include regions on chromosomes 11, 12 and 15. In addition to the two chromosomal regions, the 55 candidate regions identified by the WGLRH test are also strong candidates for further study; however, the WGLRH test only considers derived alleles whose frequencies have risen to .0.85 in the populations under consideration. One problem with only considering those haplotypes with high frequencies of the derived allele is that natural selection could also act to select the ancestral allele, and these signatures cannot be detected with the WGLRH test. Given the low altitudes inhabited by human ancestors, however, it is more likely that selection acted on a novel mutation in one or more of the genes involved in adaptation to altitude, as opposed to an ancestral variant already present in the population. This is especially relevant with regard to the HIF pathway, as this is an evolutionarily ancient system important in embryogenesis, development and homeostasis.
One potential problem with this and other genome scans is that it uses pre-ascertained SNPs to look at the underlying pattern of nucleotide diversity. For example, Tajima's D was first designed for sequence-based tests of selection wherein the nucleotide diversity is known for an entire gene or gene region. Using this statistic on genome scans for natural selection, one must be aware of the ascertainment bias inherent in the analysis. Given the selection criteria of the Affymetrix 500 K panel, uncommon, low-frequency alleles will be under-represented and common alleles will be over-represented. This bias is more likely to miss candidate natural selection genes rather than increase our false-positive rate. Moreover, we used Tajima's D in conjunction with other tests for positive selection so that genes that might have been overlooked using this statistic could have been identified with one or more of the other three statistics. To illustrate this point, consider the genes TNC and CDH1, which, in our analysis, showed signatures of selection by LSBL and lnRH, but not either of the Tajima's D statistics. This pattern is the same as that observed for the gene SLC24A5, which is known to be the target of natural or sexual selection in European populations. 32,48 Therefore, it is possible that with unbiased complete sequence data of TNC or CDH1 in Andean and Mesoamerican populations, Tajima's D will reveal a pattern of nucleotide diversity consistent with positive directional selection.
In future work, it would be interesting to compare the overall genetic signals of natural selection found in Andean populations with those found in Tibetan populations, as these two populations are distinct in geographical locale as well as in duration of time living at altitude. Archaeological data indicate that Himalayan populations first inhabited the Tibetan Plateau as early as 25,000 years ago, whereas populations first moved onto the Andean Altiplano 11,000 years ago. 53,54 By understanding how similar environmental pressures with varying evolutionary time frames can result in either the same or different genetic adaptations, we will be better situated to understand the molecular basis for convergent human adaptations. After identification, all putative natural selection regions identified in Andeans and Tibetans must be confirmed by further research, such as genotype/phenotype association studies and functional assays.