A meta-analysis of single base-pair substitutions in translational termination codons ('nonstop' mutations) that cause human inherited disease

'Nonstop' mutations are single base-pair substitutions that occur within translational termination (stop) codons and which can lead to the continued and inappropriate translation of the mRNA into the 3'-untranslated region. We have performed a meta-analysis of the 119 nonstop mutations (in 87 different genes) known to cause human inherited disease, examining the sequence context of the mutated stop codons and the average distance to the next alternative in-frame stop codon downstream, in comparison with their counterparts from control (non-mutated) gene sequences. A paucity of alternative in-frame stop codons was noted in the immediate vicinity (0-49 nucleotides downstream) of the mutated stop codons as compared with their control counterparts (p = 7.81 × 10-4). This implies that at least some nonstop mutations with alternative stop codons in close proximity will not have come to clinical attention, possibly because they will have given rise to stable mRNAs (not subject to nonstop mRNA decay) that are translatable into proteins of near-normal length and biological function. A significant excess of downstream in-frame stop codons was, however, noted in the range 150-199 nucleotides from the mutated stop codon (p = 8.55 × 10-4). We speculate that recruitment of an alternative stop codon at greater distance from the mutated stop codon may trigger nonstop mRNA decay, thereby decreasing the amount of protein product and yielding a readily discernible clinical phenotype. Confirmation or otherwise of this postulate must await the emergence of a clearer understanding of the mechanism of nonstop mRNA decay in mammalian cells.


Introduction
There are currently in excess of 60,000 missense and nonsense mutations (in nearly 4,000 different genes) listed in the Human Gene Mutation Database (HGMD) that are known to cause, or to be associated with, human inherited disease. 1 In addition, there are 119 examples of mutations (in 87 different genes) that occur within stop codons, a category of mutation which therefore constitutes 0.2% per cent of codon-changing mutations. 1 Such lesions have been termed 'nonstop', 'nostop' or 'readthrough' mutations on the basis that the loss of the normal translational termination (stop) codon is likely to lead to continued translation of the mRNA further downstream into the 3 0 -untranslated region (UTR).
Although many authors tacitly assume that the normal open reading frame will simply be extended until the next in-frame stop codon is encountered, too few human nonstop mutations have so far been characterised to allow any general conclusions to be drawn as to their likely phenotypic consequences at either the mRNA or the protein level. In three reported cases, however (namely, those nonstop mutations in the gene encoding ribosomal protein S19 [RPS19], causing Diamond-Blackfan anaemia, 2 the F10 gene causing factor X deficiency 3 and the foxhead box E3 [FOXE3] gene causing anterior segment dysgenesis 4 ), the levels of the mutant mRNA transcripts were found to be dramatically lower than those of their wild-type counterparts. By contrast, the mRNA level associated with a nonstop mutation in the 3-beta-hydroxy-delta-5-steroid dehyrogenase (HSD3B2) gene causing adrenal hyperplasia was found to be near normal, although both HSD3B2 enzymatic activity and antigen (associated with a predicted 467 amino-acid protein, extended by 95 residues beyond the wild-type length) were found to be dramatically reduced. 5 Similarly, in the case of a nonstop mutation in the thymidine phosphorylase (TYMP) gene responsible for mitochondrial neurogastrointestinal encephalomyopathy, the mRNA level was not found to be reduced, even although the thymine phosphorylase protein product it encoded was undetectable. 6 In yeast, nonstop mRNAs generated from mRNAs lacking translational termination codons are recognised, by the protein Ski7, on ribosomes that have become stalled at the 3 0 ends of the mRNAs; these RNAs are then targeted for exosome-mediated degradation. 7 -9 While this process of 'nonstop mRNA decay' is fairly effective at removing nonstop mRNAs, any protein products generated by translation of residual nonstop mRNAs are degraded by the proteasome. 10,11 Although few such studies have so far been attempted in mammalian cells, the expression level of nonstop mRNAs generally appears unaltered while ribosome stalling at the 3 0 end of the elongated nonstop mRNA blocks translation before the completion of synthesis of full-length polypeptides. 12 -14 Precisely how nonstop mRNA decay impacts upon naturally occurring human nonstop mutations is unknown but, as is clear from the five disease-associated examples mentioned above, the evidence acquired to date suggests that this may be a gene-and mutation-dependent process. 15 Thus, although not uncommon, remarkably little is as yet known about the nature and consequences of this type of mutation. In this paper, we report a first meta-analysis of naturally occurring nonstop mutations causing human inherited disease. With a view to exploring the various possible factors that could impact upon the likelihood of a given nonstop mutation coming to clinical attention, we have performed an analysis of the sequence context of the mutated stop codons and the average distance to the next in-frame downstream stop codon in comparison with control (non-mutated) gene sequences.

Methods
Mutation and control datasets A total of 119 naturally occurring nonstop mutations from 87 human genes (Supplementary  Table S1) were identified from the HGMD. 1 The majority of these nonstop mutations were single examples identified in specific genes but 18 genes harboured a total of 50 examples of this type of lesion. Since the multiple inclusion of identical sequences flanking mutated stop codons would have introduced considerable bias into the subsequent analysis, only one mutation per gene was considered in the analysis of the sequence context.
A control dataset was established which comprised 1,692 genes listed in the HGMD (for which both coding and 3 0 -UTRs were obtainable from Ensembl [Build 37] but for which no termination codon [nonstop] mutations have so far been recorded). Data from the Transterm database (http://uther.otago.ac.nz/Transterm.html), 16 representing a total of 29,210 stop codons associated with annotated human genes, were used as genome-wide controls.

Analysis of nonstop mutations
The relative frequency of each type of stop codon (ie TAG, TAA and TGA) in the mutated (nonstop mutation-bearing) sequences and non-mutated wild-type control gene sequences was assessed. Stop codons harbouring single and multiple mutations were examined separately.
To detect any bias in the pattern of stop codon mutability, the mutability of the dinucleotides within a pentanucleotide spanning the stop codon and including one flanking nucleotide on either side was assessed. The number of mutations occurring in each of the 12 possible dinucleotides (note that four dinucleotides [CC, CA, CG and TC] cannot occur in conjunction with any stop codonspanning pentanucleotide and were therefore omitted) was counted. In the HGMD control dataset, one nucleotide position within each stop codon was randomly mutated and the numbers of mutations in each possible dinucleotide were then counted. Statistical significance was determined using Fisher's exact test with a Bonferroni correction being applied to allow for multiple testing.
Since the identity of the nucleotides immediately flanking the stop codon may influence the susceptibility of the stop codon to mutation, the frequencies of each DNA base in each of the six positions upstream and downstream of the normally used stop codon were obtained for both the mutated sequences and the controls. The expected frequency E of the DNA bases at each position was calculated based on the probability of observing this nucleotide in the HGMD control sequences: where E ij is the expected frequency of the base I ¼ fA,C,G,Tg at position j, F ij is the observed frequency of base i at position j in the HGMD control dataset, N m is the total number of mutated sequences and N c is the number of sequences in the HGMD control dataset. Under the assumption that the data follow a binomial distribution, we considered that an increase or decrease in the observed frequency of a particular nucleotide in a specified position was statistically significant if the corresponding p value was ,0.01. In addition, to investigate whether any particular stop codon (ie TGA, TAG or TAA) was associated with any specific flanking nucleotides, we placed both the mutated and control sequences into separate datasets for each of the three stop codons and repeated the above analysis for each of the new datasets.
Determining the distance to the next downstream in-frame stop codon The distance to the next downstream stop codon in the required reading frame is likely to determine the length of any extended protein product. For each mutated (nonstop mutation-bearing) DNA sequence and each sequence in the HGMD control dataset, we therefore determined the distance to the next in-frame stop codon downstream. Sequences in the HGMD control dataset, for which the next downstream stop codon was beyond the 3 0 -UTR sequence available from Ensembl, were not used in this analysis. Distances between 0 and 500 base pairs (bp) from the original stop codon were divided into 'bins', each 50 bp long, the final bin containing all sequences where the distance was greater than 500 bp. The number of sequences which fell into each bin was recorded for both the mutated sequences and the HGMD control sequences. The same procedure was repeated for those sequences with single mutations and for those sequences harbouring two or more mutations. To assess the statistical significance of our findings, we employed Fisher's exact test using a Bonferroni correction to allow for multiple testing. p values of ,0.05 were considered to be statistically significant. Using the same method as for the original stop codons, we also investigated the frequency of occurrence of specific nucleotides surrounding the next in-frame stop codon downstream. It is possible that at least a proportion of these downstream in-frame stop codons are associated with naturally occurring splice isoforms of the gene, 17 and might therefore possess comparable sequence characteristics to the stop codons involved in the mutational events. The flanking sequence may also affect the likelihood of a mutation coming to clinical attention.

Relative frequency of stop codon involvement in nonstop mutation
We have performed a meta-analysis of the 119 nonstop mutations (in 87 different genes) known to cause human inherited disease (Supplementary  Table S1) and recorded in the HGMD. 1 HGMD is a comprehensive collection of germline mutations causing (or associated with) human inherited disease and is an invaluable source of data for meta-analyses of human gene mutations.
The termination of synthesis of every human protein is effected by one of three stop codons, TAG, TAA and TGA, listed in increasing order of usage in human genes. We posed the question as to whether one of these stop codons might be more susceptible to mutation, or alternatively might be more likely to come to clinical attention once mutated, than the others. We noted that a majority of the nonstop mutations (57 per cent) in our dataset occurred within TGA codons (Table 1). Since 49.4 per cent and 48.6 per cent of stop codons in the HGMD control gene dataset and human genome dataset, respectively, were of this type, however, this finding did not attain statistical significance (Table 1; p values 0.107 and 0.066, respectively).
The proportion of mutations in the other two types of stop codon was also not significantly different from the corresponding proportions in the set of HGMD control gene sequences (p values, 0.674 for TAA and 0.201 for TAG) and in the human genome at large ( p values, 0.753 for TAA and 0.88 for TAG).
The above notwithstanding, we speculated whether TAA codons flanked on the 3 0 side by A might be hypermutable, since this would in effect constitute a short polyadenine run. It has been reported that bases adjacent to mononucleotide runs in the human genome are characterised by an increased single nucleotide polymorphism frequency. 18 We therefore assessed whether the nucleotide A following the TAA stop codon might influence the mutability of this codon. In agreement with our postulate, the presence of an A adjacent to a TAA stop codon was indeed found to increase the mutability of this codon by 1.4 fold ( p ¼ 0.016).
Genes exhibiting an abundance of missense/ nonsense mutations do not harbour a disproportionate number of nonstop mutations As we have noted above, a total of 18 human genes are known to harbour multiple nonstop mutations. We therefore sought to determine whether this was simply due to a particularly large number of mutations having been reported from these genes. At the time this analysis was performed (October 2010), the HGMD contained mutation data from a total of 2,249 human genes, for which a total of 55,813 missense or nonsense mutations had been reported. No correlation was found, however, between the probability of finding multiple nonstop mutations in a given gene and the total number of missense and nonsense mutations reported for that gene (Pearson's correlation -0.108; p ¼ 0.67). Thus, for example, the largest  Table S2). One of the most significantly enriched terms was 'oxidoreductase' (p ¼ 0.005 after Bonferroni correction), which was associated with 11 of the 67 nonstop mutationharbouring genes identified in the DAVID database. 20 Six terms were found to be significantly enriched (p , 0.001 without correction for multiple testing) for genes harbouring multiple nonstop mutations (Supplementary Table S3); however, no significant bias in gene function was noted for these genes after correction for multiple testing. A search using all nonstop mutation-containing genes revealed an association with the protein information resource (PIR) term 'deafness' (p ¼ 0.0248), corresponding to six of 86 sequences, although the biological relevance of this observation remains unclear.

Mutability of the DNA sequence encompassing the mutated stop codons
The dinucleotide mutabilities within the pentanucleotides flanking the naturally mutated stop codons and the randomly mutated HGMD control stop codons were calculated in order to determine whether there was any bias in the mutability of the various dinucleotides that occur within the three types of stop codon, taking the flanking nucleotides into consideration. A strong positive correlation was noted between the distributions of mutationharbouring dinucleotides and randomly simulated mutations within the stop codons of HGMD control sequences (Pearson's correlation r ¼ 0.975; p ¼ 8.04 Â 10 28 ) with respect to the frequencies of 12 dinucleotides. No significant differences were found in dinucleotide-wise comparisons (Table 2), however, indicating that there is no evidence for a nearest nucleotide-directed bias in stop codon mutability.

Sequence context around stop codons that have been subject to nonstop mutations
In eukaryotic cells, the translational efficiency and readthrough potential of the three different stop codons have been reported to vary as a consequence of the influence of the surrounding nucleotide sequence. 21 -26 With respect to human gene sequences, Ozawa et al. reported that the first three nucleotide positions after the stop codon are highly conserved, with G and A predominating at the þ1 position, and C at the þ4 position. 24 Again in the context of human genes, Liu reported a preponderance of C immediately upstream of the stop codon (at position -1) and G or T at position þ1. 26 Our HGMD control dataset exhibits similar sequence characteristics to those stop codon datasets reported by Ozawa et al. 24 and Liu. 26 This sequence bias flanking human stop codons represents, in effect, a consensus sequence for the translational termination signal that extends beyond the confines of the stop codon itself. With this in mind, we next examined the flanking sequences of the mutated stop codons in order to ascertain whether the local DNA sequence context could influence the likelihood that the associated nonstop mutations would come to clinical attention. We first examined the frequencies of six nucleotides on either side of the stop codon in both 87 mutated and 1,692 control sequences. When considering the entire stop codon dataset (which includes sequences flanking the TAA, TAG and TGA stop codons on the 5 0 side at positions 21 to 26, and on the 3 0 side at positions þ1 to þ6), we observed a significant paucity in G at the 22 position ( p ¼ 0.0063) (Supplementary Table S4). When considering the three types of stop codon separately, there was a significant excess ( p ¼ 0.0016) of G and a significant paucity of A ( p ¼ 0.0047) two nucleotides downstream of TAA stop codons (Table 3). Similarly, in the regions flanking TGA stop codons, we noted a significant excess of T at the þ6 position ( p ¼ 0.0094) (Supplementary Table S5). Although it is conceivable that TAA stop codons with a G at þ2 and TGA stop codons with a T at þ6 may be more prone to mutate than other sequences, we prefer the alternative explanation, that mutations occurring in TAA and TGA stop codons embedded within these sequence contexts are more likely, for whatever reason, to come to clinical attention. No significant difference was noted between the flanking regions of mutated and control TAG stop codons (data not shown).
The nucleotide frequencies of the flanking regions of the stop codons that harboured single and multiple mutations were also analysed separately, and compared both with the HGMD control dataset and with each other. Supplementary Table S6 presents the comparison of sequences containing only single mutations with sequences in the HGMD control dataset. These sequences exhibit a significant paucity of G at the 22 (p ¼ 0.0078) and 23 (p ¼ 0.0096) positions relative to the controls. However, no significant difference was apparent between those sequences harbouring multiple mutations and controls (data not shown).
Sequence context around the next in-frame stop codon downstream of the stop codons that have been subject to nonstop mutations The DNA sequences around the next downstream in-frame stop codon were analysed using the same method as described above. The regions flanking the next in-frame stop codons located downstream of the mutated stop codons were compared with their counterparts in the HGMD control sequences. This analysis was performed for each of the three codon types (TAA, TAG and TGA) separately and for all the mutated stop codons combined. When analysing all downstream in-frame stop codons together, a significant excess of T was observed at the þ6 position ( p ¼ 0.0051; Supplementary Table S7). When the three types of stop codon were examined separately, the only Table 3. Frequency of nucleotides present in regions flanking the mutated TAA stop codon (N ¼ 40). Position 0, corresponding to the stop codon, is not shown. Nucleotide frequencies that are significantly higher/lower (p , 0.01) in comparison with the HGMD control dataset are shown underlined significant difference noted was in the sequences surrounding the next in-frame TGA stop codons, where an excess of C was found at the þ6 position ( p ¼ 0.0019; Supplementary Table S8), as compared with the TGA codons in the control dataset. Taken together, these findings suggest that, in general, there is no obvious difference between the sequences surrounding the next downstream in-frame stop codons and their counterparts in the HGMD control sequences. However, it is possible that the nucleotide occurring at position þ6 relative to the downstream alternative in-frame stop codon could influence the likelihood that a given nonstop mutation might come to clinical attention.
The distance to the next stop codon is a key determinant of whether a given nonstop mutation will come to clinical attention We next explored the possibility that the distance from the mutated stop codon to the next in-frame stop codon downstream might influence the likelihood that a given nonstop mutation would come to clinical attention. We reasoned that the greater the distance between the mutated stop codon and the next viable alternative downstream stop codon, the more likely it would be that the mRNA/ protein would be unstable/degraded and hence that the nonstop mutation would give rise to a deleterious and clinically observable phenotype. Conversely, the presence of an alternative in-frame stop codon in the immediate vicinity of the mutated natural stop codon could yield a nearnormal or at least ameliorated clinical phenotype. Since such phenotypes would be less likely to come to clinical attention, we might therefore expect there to be a paucity of alternative in-frame stop codons in the immediate vicinity of the mutated stop codons as compared with their counterparts derived from the HGMD control sequences. This was, indeed, what was found when mutated and control sequences were compared. Although a relatively strong correlation was noted between the distributions of the distances (Pearson's correlation 0.75; p ¼ 0.008), the number of alternative in-frame stop codons was found to be significantly lower among the mutated sequences than in the controls, but only in the range 0 -49 nucleotides downstream of the mutated stop codon ( p ¼ 7.81 Â 10 24 ). This implies that at least some stop codon mutations with alternative stop codons 0 -49 nucleotides downstream of the mutated stop codon will not have come to clinical attention, possibly because they will have given rise to stable mRNAs that were (i) not subject to nonstop mRNA decay and (ii) consequently translated into proteins of near-normal length and biological function.
Although the number of in-frame stop codons in the HGMD control dataset approximates to a Zipfian distribution, and steadily decreases with increasing distance from the original stop codon (Figure 1), we noted a significant excess (by comparison with the controls) of downstream in-frame stop codons within 150 -199 nucleotides of the mutated stop codon ( p ¼ 8.551 Â 10 24 ). A significant ( p ¼ 6.558 Â 10 26 ) excess of in-frame stop codons within 100 -299 nucleotides was also noted as compared with the HGMD controls. One possible explanation could be that the recruitment of these alternative stop codons at an intermediate distance from the mutated stop codon may serve to trigger nonstop mRNA decay, thereby dramatically decreasing the amount of protein product produced and giving rise to a clinical phenotype that is more likely to come to clinical attention. Confirmation or otherwise of this postulate must await the emergence of a clearer understanding of the mechanism of nonstop mRNA decay in mammalian cells. Figure 2 depicts a comparison of the single (N ¼ 69 in 69 genes) and multiple (N ¼ 18 in 18 genes) nonstop mutations with respect to the distribution of distances to the next downstream in-frame stop codon in each sequence. If those nonstop mutations which occurred within sequences lacking alternative in-frame stop codons in the range 0 -49 nucleotides from the mutated codon did indeed display an increased likelihood of coming to clinical attention, then we might reasonably expect those sequences harbouring multiple nonstop mutations to exhibit an even greater paucity of alternative downstream in-frame stop codons in this size range relative to those sequences harbouring only one nonstop mutation. Although only 18 sequences harboured multiple nonstop mutations (yielding very small sample sizes in each distance category and precluding formal statistical assessment), only one (corresponding to 5.5 per cent of the total number of multiple nonstop mutations) of these sequences bearing multiple nonstop mutations was characterised by an alternative in-frame stop codon within 50 nucleotides downstream of the mutated stop codon, as opposed to 21 sequences with single mutations (30.9 per cent of the total number of single nonstop mutations) (Figure 2). This finding is therefore wholly compatible with our postulate that nonstop mutations occurring within DNA sequences lacking alternative in-frame stop codons in the immediate vicinity of the mutated stop codon display an increased likelihood of coming to clinical attention, possibly because the resulting extended mRNAs are more likely to be subject to nonstop mRNA decay.         Table S8. Frequencies of nucleotides flanking the next