Human SNPs resulting in premature stop codons and protein truncation

Single nucleotide polymorphisms (SNPs) constitute the most common type of genetic variation in humans. SNPs introducing premature termination codons (PTCs), herein called X-SNPs, can alter the stability and function of transcripts and proteins and thus are considered to be biologically important. Initial studies suggested a strong selection against such variations/mutations. In this study, we undertook a genome-wide systematic screening to identify human X-SNPs using the dbSNP database. Our results demonstrated the presence of 28 X-SNPs from 28 genes with known minor allele frequencies. Eight X-SNPs (28.6 per cent) were predicted to cause transcript degradation by nonsense-mediated mRNA decay. Seventeen X-SNPs (60.7 per cent) resulted in moderate to severe truncation at the C-terminus of the proteins (deletion of > 50 per cent of the amino acids). The majority of the X-SNPs (78.6 per cent) represent commonly occurring SNPs, by contrast with the rarely occurring disease-causing PTC mutations. Interestingly, X-SNPs displayed a non-uniform distribution across human populations: eight X-SNPs were reported to be prevalent across three different human populations, whereas six X-SNPs were found exclusively in one or two population(s). In conclusion, we have systematically investigated human SNPs introducing PTCs with respect to their possible biological consequences, distributions across different human populations and evolutionary aspects. We believe that the SNPs reported here are likely to affect gene/protein function, although their biological and evolutionary roles need to be further investigated.


Introduction
The Human Genome Project revealed the presence of al arge number of genetic variations among individuals. Single nucleotide polymorphisms (SNPs) are the most common genetic variation; they occur,onaverage,once in every 400 -1,000 base pairsalongDNA. [1][2][3][4] The term'polymorphism' traditionally refers to commonly occurring genetic variations (minor allele frequency approximately $ 1p er cent) in the population. 5 The densityo fS NPs varies among different genomic regions, and is thought to be dependent on both the mutation rate and the selective constraints on the region. 6 Currently,t here is as trong interest in SNPs because they are hypothesised to contribute to differential disease risk and drug/treatment response among individuals. 7,8 SNPs located in the codingr egions of genes mayh ave important biological consequences. For example,n on-synonymous SNPs (nsSNPs)c hange the aminoa cid sequence and thus maya ffect protein function. Although many approaches and systematic analyses have been undertaken to identify nsSNPs with possible biological significance, [9][10][11][12] to our knowledge no large-scale systematic analysis has been carried out to identify and characteriseSNPs that introduce premature termination codons (PTCs; herein calledX -SNPs). Both frameshift and nonsense mutations can lead to the introduction of PTCs alongt he open reading frames. As ar esult of PTCs, the stability of transcriptso rp roteins mayb ed irectly affected. 13,14 Alternatively,t he truncated proteinsm ay act in ad ominant-negativef ashion. 15 Thus, the PTCs can lead to either loss-of-function or gain-of-function by altering the stability and function of the transcripts/proteins.
The Mendelian human diseases are associated with highpenetrant disease-causing genetic alterations that aref ound in very lowf requencies (approximately , 1p er cent) in the population, most likely due to strong selection against them. 16 In inheritedhumangenetic disorders, approximately one-third of mutations introduce PTCs 17 that arec onsidered to be deleterious. Similarly,t he number of SNPsi ntroducing PTCs in the human genome is estimated to be fairly low, and a previous study suggested the presence of strong evolutionary selectiona gainst X-SNPs. 18 Therefore, disease-related or not, the PTCsare considered dramatically to affectproteinsleading to potential biological abnormalities. In this study,o ur aim wast oe valuate the polymorphisms introducing PTCsi nt he human genome with respect to their potentialb iological consequences, distributions across different human populations and minora llele frequencies. As more X-SNPs are discovered and deposited in public SNPd atabases, it will be possible to analyse al arger number of X-SNPs and obtain more comprehensive data. Nevertheless, our results do providea n interesting and unique catalogueo fp olymorphisms that deservesf urther biological and epidemiological diseaseassociation studies.

SNPs
SNPs annotated 'prematuret ermination codonS NPs'w ere retrieved from the dbSNP database build 120 (http://www. ncbi.nlm.nih.gov/SNP/). 19 We have annotated such SNPs as X-SNPs throughout this paper.T here wasatotal of 977 X-SNPs in the dbSNP database; however, only 119 of them were presented with minor allelef requency information. Among these SNPs, only the ones that were found in at least twoc hromosomes withas amples ize of $ 20 chromosomes were further analysed (herein annotated as validated X-SNPs). The X-SNPs that arel ocated on the transcripts annotated as 'predictions', 'pseudogenes', 'similar to' or 'open reading frames'w ere excluded from this study.I nt otal, 28 X-SNPs were in agreement with all of the above requirements.

Candidate transcripts forn onsense-mediated mRNAd ecay
Blastingt he transcript sequence against the human genome identifiedt he genomic structures of transcripts. The subsequentm anual analysis of the exon -intron boundaries identifiedX-SNPs that can lead to nonsense-mediated mRNA decay(NMD): the transcripts withanSNP introducing aPTC located $ 50 nucleotides upstream of an exon -intron junction are considered candidates to undergo NMD. 13,[27][28][29] Results and discussion

Possible biological consequences of X-SNPs
Our systematic search of the dbSNP database 19 (build 120) yielded 28 validatedX -SNPs from 28 genes ( Table 1). Tw enty-three genes bearing X-SNPs were found to code for as inglet ranscript; however, ther emaining fiveX -SNPs were found in genes undergoing alternatives plicing: DSCR8-K79X, HPS4-R246X, IL17RB-Q484X, OAS2-W720X and TAP2-Q687X.W ith the exception of HPS4-R246X,a ll X-SNPs were mapped onto an ASTV coding for the longest protein isoform. For 22/28 X-SNPs, genotype information wasa vailable in the dbSNP database.A saresult, for 12 X-SNPs, at least one homozygous sample wasr eported, suggesting that these X-SNPs do not affect the fitness per se (see below; Ta ble 1). In the remaining cases, genotyping of larger samples ets mayh elp in elucidating whether the homozygous state is deleterious (iet he homozygotes aren ot viable) or whether the lowa llele frequency makes it hard to detectt he homozygotes in small populations.
We then carried out at heoretical evaluation of the possible biological consequences of the identifiedX -SNPs at the mRNAa nd protein levels. Fore xample,N MD is as urveillance system that specifically eliminates transcripts that contain PTCsa saresult of mutations in DNA or errors in RNA processing. 15 NMD usually requires ad ownstream intron and at least 50-55n ucleotides before the downstream exonintron junction in order for aP TC to be recognised. 27,28 Based on the 50 -55nucleotide rule,weanalysed the locations of the X-SNPs with respect to the exon -intron boundaries and predicted that eight (28.6 percent) X-SNPs ( AGT-Q53X, APOC4-W47X, EPHX1-W97X, MS4A12-Q71X, POLE2-K443X, SERPINB11-E90X, SMUG1-Q3X and ZNF34-Q56X)m ay potentially cause mRNA degradation via NMD. Thus, at least these eightX -SNPs are likely to result in loss of gene function.E xceptions to this rule have also been reported, 17 however, which suggests that the proportion of PTC-containing mRNAs undergoingm RNA degradation may, in fact, be larger.T he reported allele frequencies of these X-SNPs ranged from rare ( AGT-Q53X 0-5p er cent; APOC4-W47X 0-5 per cent; EPHX1-W97X 0-2.   This information indicates whether or not ah omozygous sample in as ample set was reported for the corresponding X-SNP and was collected from the dbSNP database 'summaryofg enotypes' section: 'n/a': no information was available, ' þ ': homozygous genotype was reported, ' 2 ': no homozygous was reported.
f Length of the wild-type protein products. In parentheses aret he percentages of the protein truncation at the C-terminus caused by the X-SNP.
h SNPs occurring at CpG dinucleotides and thus can be hot spot mutations area nnotated by ' þ '.
Savas,T uzmen and Ozcelik Review PRIMARY RESEARCH SERPINB11-E90X 22.9 -52.1 per cent; ZNF34-Q56X 27 per cent) ( Table 1). Individuals with ah omozygous state for twoo ft hese X-SNPs, namely MS4A12-Q71X and SERPINB11-E90X,w ere reported in dbSNP submissions, suggesting that, in such individuals, the levels of these truncated protein products arel ikely to be reduced by the NMD mechanism.
If no NMDoccursa nd the protein products aret ranslated, then the PTCs lead to protein truncation at the C-terminusthe consequences of which vary depending on the degree of the truncation.For example,1 7( 60.7 per cent) X-SNPs led to moderate to severe truncation at the C-terminus of the proteins (deletion of . 50 per cent of the aminoacid sequence), which is likelyradically to alter protein structure and function (Table 1, Figure 1 ). As an extreme example, SMUG1-Q3X,w hen translated, would yield only at wo-amino acid peptide,w hich would presumably be non-functional( loss of function). Also, PTCsc an destabilise the protein products by altering the protein-folding state or kinetics 14,31 and mayc ause proteolysis. In addition, they mayact as dominant-negativem utations 15 or cause exon skippinga nd altert he open readingf rame. 32 Alternatively,mRNA molecules bearing aPTC closer to the 5 0 end can be still translated if an in-frame translationable AUG startc odon is present downstream of the PTC. 33 -35 Such N-terminalt runcated proteinscan be fully or partially functional. Forexample,inthe case of SMUG1-Q3X,there is an inframe AUGl ocated at the 18th codonoft he SMUG1 gene, which can be experimentally evaluated to determine if an N-terminalt runcated SMUG1 protein is produced and functional. To summarise,the stability, structureand function of the protein products or transcriptsmay be affected by the X-SNPs described in this study,and experimental approaches are needed to evaluate their true biological effects.

Possible evolutionarye xplanations of common X-SNPs
The small number of validatedX -SNPs identifieds uggests infrequent occurrence of the PTC-introducing variations in the human genome and thus agrees with the presence of selection against them. 18 By contrast with rare PTCintroducing mutations observedi nh uman diseases, however, X-SNPs analysed in this study represented commonly occurring variations in humans: 22 X-SNPs (78.6p er cent) were found with minor allele frequencies of $ 5p er cent in at least one sample panel analysed (common X-SNPs)compared with only six rare X-SNPs (with , 5p er cent minora llele frequencies). Howc an we explain the abundance of such common (and perhaps deleterious) X-SNPs in the humanp opulation? Possible scenarios ares ummarised in Figure 2 .F or example, one explanation could be that the truncated protein product mays till be functional in the presence of the X-SNP.F or instance, LPL-S474X wasl ocated only one aminoa cid prior to the natural termination codon; thus, it mayn ot reallya lter the protein properties and thus mayn ot be deleterious to cell function at all. Alternatively,t he protein mayn ot be essential for the fitness of human beings; in this case,t he evolutionary pressureisrelieved, which can lead to toleration of an increase in allele frequency of prematures top codons in human populations.
Another possibility is that X-SNPs mayb ec apable of affecting protein function/the organism per se,b ut other factorsm ight modifyt heir effects. Here,w ew ill assume that these PTCs represent both the strongly deleterious mutations that arearesult of selection and quickly removedf romt he populations, as well as the slightly deleteriousm utations that are subject to both selectiona nd drift. 36 For example,t hese X-SNPs maybehot-spot mutations, where the new mutations introduce( slightly) deleterious alleles and thus increase the allele frequency,d espite the selection. In ordert oa ssess whether some of these X-SNPs might in fact represent the hot-spot mutations, we analysed the immediate flanking sequences of each X-SNP.A saresult, we found that 25 per cent (7/28) of X-SNPs (all common) had occurred at CpG dinucleotides (Table 1). These data suggest that theseX -SNPs might have arisen from spontaneous deamination of methylcytosineleadingtoathymine, and thus mayrepresent hot-spot mutations. 37 Additionally,d iploidy wass uggested to relieve the tension of purifying selection and increase the tolerance for PTCs, 38 which predicts ar ecessive effect or loss of function. All but one gene ( MAGEE2 )i nT able 1w erel ocated in autosomal chromosomes, which maya lso help to explain the frequency of the naturally occurring PTC polymorphisms in humans. Moreover, it is also likely that, even though (slightly) deleteriousi nahomozygous state,s ome X-SNPs can confer selectivea dvantage to heterozygotes. 39 Alternatively,e pistatic interactions of additional mutations, either on the sameo r different genes, mayc ompensate for the (slightly) deleterious effects of the X-SNP. 16,40 Furthermore,X -SNPs mayb e beneficial at present conditions, which mayfavour the positive selectiono ft he X-SNPs and increase their allele frequencies.
Moreover, if aPTC is located at the 5 0 end of agene and there is an earbyi n-frame initiation codona fter that PTC,t hen the protein translation can re-initiate and ap eptide with aminotruncation mayb ep roduced. 33 -35 Dependingo nt he nature and extent of the truncation,the truncated peptide can fully or partially function and thus can, completely or to some extent, rescue thep henotype.T here is an eed for further studies to elucidate the molecular basis of the discrepancy and the determination of the biological differences between human disease-related mutations and naturally occurring stop codon-creating polymorphisms.

Frequency spectrum of X-SNPs in different human populations
Comparison of the population(s) and the minora llele frequencies of X-SNP entries in the dbSNP database 19 presented great variability across different human populations, at least in some cases (Table 1). Fore xample, HSP4-R246X, IL17RB-Q484X, LPL-S474X, MS4A12-Q71X, OVCH2-W556X, SERPINB11-E90X, TAAR9-Q61X and TRPM1-E1305X were detected in samples from African,A sian and Caucasian backgrounds. This might mean that either these X-SNPs have been inherited from acommon ancestororthey represent hot-spot mutations ( HSP4-R246X and IL17RB-High allelic frequency: 1. X-SNPisnot deleterious because the protein product is still functional 2. X-SNPisdeleterious but also is ahot spot mutation 3. X-SNP is deleteriousbut (mildly) tolerated becauseofdiploidy/heterozygote advantage 4. X-SNP is beneficial 5. X-SNP is deleterious to protein function, but the protein is not required for the fitness of the organism 6. X-SNPisdeleterious to protein function, but protein function is compensated by otherprotein(s) 7. X-SNPisdeleterious but compensated by other mutation(s) either in the same or in other gene(s)

Low allelic frequency:
1. X-SNPisrelatively new in thehuman population 2. X-SNP is deleteriousand thus subject to purifying selection 3. Technical issues (sample size is not large enough to draw conclusions from, errors in genotyping, population specificityetc) Figure 2. How can we explain the allele frequencies of the X-SNPs? This figurepresents as ummaryofp ossible biological consequences of X-SNPs. For simplicity,b oth deleterious and slightly deleterious variations areannotated as deleterious.

Savas,T uzmen and Ozcelik
Review PRIMARY RESEARCH Q484X occurreda tC pG dinucleotides and thus might in fact be hot-spot mutations; seeT able 1). By contrast, CYPC19-W212X (African and Asian), EPHX1-W97X (Asian and Hispanic) and OAS2-W720X (African and European) were detected in some populations but not in others. In addition, there were three X-SNPs that were found exclusivelyi no ne population: FUT2-R297X and LCE5A-R79X in Asian and LIG4-W46X in African samples. Either different selection in different populations or the occurrence of foundere ffect/ genetic drift maye xplain the population spectrum of these SNPs. 16,41 Conclusion In conclusion, we have evaluated SNPs that introduce PTCs in the human genome that can potentially affect the stability of transcripts and their protein products. Although there is considerable information regarding the PTC-creating mutations in humang enetic diseases, to date,t here has been no systematics tudy reporting on the PTC-causing polymorphisms in the human genome and their evolutionarya nd biological roles in humans. Our results indicated that the allelic frequencies of the disease-causing PTC-creating mutations and polymorphisms displayamarked difference.T hese X-SNPs were found in av arietyo fp roteinsw ith differentc ellular functions (signal transduction,D NA repair,t ranscription, immune response,d rugm etabolism etc; Ta ble 1). As earch of literature reports and the Human Gene MutationD atabase 42 showedt hat af raction of these genes have already been implicated in humand iseases: AGT in essential hypertension; 43 HPS4 in Hermansky -Pudlaks yndrome type 4; 44 LPL in disorders of lipoprotein metabolism; 45 and TLR5 in pneumoniacaused by Legionellapneumophila. 46 In thelatter case,the TLR5-R392X SNP wasf unctionally characterised and found to be defectivei nfl agellin signalling and associated with the pneumonia susceptibility. 46 In the case of the TAP2-Q687X SNP, TAP2-Q687 wasr eported to be ap arto fahaplotype associated with ar educed risk of insulin-dependent diabetes mellitus in as mall sample set. 47 Our data suggest ap otential deleterious effect for X-SNPs identifiedi nt his study; however, their true biological consequences and potential roles in humand isease and health have yett ob ee xperimentally verified and identified.