'A variant of uncertain significance' and the proliferation of human disease gene databases

The rapid accumulation of mutation data has led to the creation of nearly 300 locus-specific mutation databases. These sites may contain a few dozen to almost 20,000 mutations for a given gene. Many of the mutations are uncharacterised and have no known effects on the gene product, the 'variant of uncertain significance'. Here, the statistics of mutation distribution are examined for six different gene databases: BRCA1 and BRCA2, haemoglobin-beta (HBB), HPRT1, CFTR and TP53. The percentage of all possible point mutations for a protein (the mutation space) is calculated for each gene and the question 'How much mutation data is enough?' is raised.

The number of people on earth is nowm ore than 6b illion. It is generally understood that everyi ndividual has af ew mutations in their genome that did not exist in their parents' genomes.A ne stimate of the spontaneous mutationr ate in humans is 1.8 £ 10 2 8 per nucleotide per generation. 1 With 3.2 billion bases per genome,t his would translate into 115 mutations per diploidg enome per generation. Multiplying that by 6billion people gives690 billion mutations per human generation, or about 216 mutations for each nucleotide in the human genome.
Our species is undergoing am assivew orldwide mutational experiment, with everyn ucleotide in our genomes being tested by mutation. Whatt his means is that, if we look hard enough, we should find all non-lethal sequence variants in any givenh umang ene.T he genetics research community has begun this search by identifying genes that, when defective, lead to ahuman disease.The Online MendelianInheritance in Man (OMIM) database 2 is ar epositoryf or information on human genes. OMIM suggests searching LocusLink, 3 limiting to human, the search with the term' disease_known' to find a human disease gene count (2,440entries). Refining this search with 'disease_known AND has_seq' will limit the result to human diseases with ak nown gene sequence.A so f2 3rd September,2 004, this number is 1,692. The current gene count in humans is 42,716 according to the Genome Alignment and Annotation( GALA) database. 4 Therefore, about 4p er cent of human genes todayh aveadisease associated with them.
To showt hatag enei sr esponsible for ad isease,amutation search is usually carried out in patients and unaffected relatives. The location of stop codons, splice-site mutations, in-frame deletions, frameshiftsormissense mutations that affect function in patients but not in unaffected individuals proves that the gene is responsible for the phenotype of the disease; however, this is only the beginning. If ad isease is common or of longstanding interest to the research community,t here is often ad atabase started for that genet oc ompile thek nown mutations and phenotypes. Such databases start small, but they can become quite large.The Human Gene Mutation database 5 has alink to 292 locus-specific mutation databases. 6 Because there is some duplication (such as six different TP53 databases) and some databases coverm ore than one gene, these 292 links covera minimum of 257 different human genes.M anyare free to the public,b ut some requirer egistration, ap assword or subscription -a switht he 'hypoxanthine guanine phosphoribosyltransferase-1' (HPRT) database. 7,8 These 257 genes currently represent 15 per cent of the 1,692 humand isease genes with known sequences. This list is not comprehensive, since it does not contain the HbVa rd atabaseo fh aemoglobin variants. 9,10 What can be expected from intense scrutiny of ag ene? As mentioned above,ifw el ook hard enough, mutations in every codonand even everynucleotide -that do not cause alethal phenotype -s hould be discovered. In the real world, how close arew et oa chieving this hypothetical coverage? Herein, are presented six examples of well studied genes: BRCA1 and BRCA2,t he breast cancer susceptibility genes; haemoglobinbeta ( HBB); HPRT1;t he cystic fibrosis transmembrane conductance regulator ( CFTR); and the TP53 tumoursuppressor.

BRCA1 and BRCA2
The BRCA1 and BRCA2 genes ares pecial cases of great human interest. More than 20,000 complete codingsequences of BRCA1 from patients with breast cancer or from their relativesh aveb een determined. 11 This includes somefl anking sequence around each of the 24 exons. The protein is 1,863 aminoa cids long, which is equivalent to 5,589 base pairs( bp) just for the coding sequence.U sually,t he test that is ordered involves sequencing both BRCA1 and BRCA2 (27 exons, 3,418 aminoa cids, 10,254 bp for the codingr egion) at ac ost of $2,975 for the first member of af amily and $350 for subsequentmembers. The cost maybepartially covered as partof ah ealth insurancep lan. Because Myriad Genetic Laboratories has ap atent on the test, they performa ll of these tests with uniform quality control, and all of the mutationdata goes into one database, the Breast Cancer Information Core (BIC). 12 Access to the BIC is password-protected for memberso nly. Membership mayb eo btained by online application 13 and it is clearly intended for researchers, not patients. Resultsfromt he BIC have been summarised. 14 -16 As of 22nd September,2 004, there were 9,556 entries for BRCA1 and 9,217 entries for BRCA2.Eachentryrepresents a mutation found in one person. Since 20,000 individual sequence tests have been performed, 11 one can infer that more than 10,000o ft hem are without as inglem utation, else they would have been counted as an entryinthe BIC.For BRCA1, 1,539 of the 9,556 entries are distinct mutations, polymorphisms or variants. Ther emainder ared uplicates, and some arev eryc ommon founderm utations. Of the 1,539 unique mutations, 878 were observedonly once (57 percent); these aret he rare variants mentioned above and ares een once in 20,000 sequences (or even at al ower frequency,b ut discovered by accident in 20,000 sequences). By examining the BIC database for mutations in the coding region, it wasshown that 850 of the 1,863c odons contain at least one mutation (46 per cent). By looking at the cDNA mutationm ap,t he largest non-mutated region is 43 bp,fromnucleotides 4,530 to 4,572.T he mutations on this histogram plot ares ot hick that the whole gene looks likeadense bar code.
BRCA2 is al onger gene.I th as 1,893 distinct mutations, polymorphisms or variants, with 1,146 reported only once (60 per cent);this is very similar to BRCA1.In BRCA2,1,323 of the 3,418 codons have at least one mutation (39 per cent). From the cDNA mutationm ap,t he largest non-mutated region is 48 bp,f romn ucleotides 1,282 to 1,329. This brings us back to the earlier questiono fh ow saturated is our mutation catalogue. To discuss this, Iw ill introduce the concept of mutation space.C onsidering only single-base changes, and leaving out insertions and deletions, if each codonc an theoretically be mutated by nined ifferent one-base changes, then the mutation space for ap rotein-coding gene is nine times the number of codons. For illustrative purposes, the number of distinct mutations in BRCA1 is 1,539. The mutation space is 9 £ 1,863c odons ¼ 16,767. Thus, 1,539 distinct mutations is 9p er cent of the mutations pacealthough that number does include frameshift mutants, so the 9p er cent value is slightly inflated. For BRCA2,t he percentage of mutation space is only 6per cent. In other words, more than 90 per cent of the possible mutations are yett ob ed iscovered. Because the sampling of this space is so low, ah igh percentage of the mutants found in BRCA1 and BRCA2 screening will neverh aveb een seen before.O ne of ak ind missense mutations that do not truncatethe protein will be the so-called 'variants of uncertain significance'.
Patients who have sucham utation are givenapamphlet entitled Te sting for HereditaryC ancerR isk: WHATD OESA "VARIANT OF UNCERTAIN SIGNIFICANCE" MEAN? 17 This is al ooming questionf or geneticists faced with mutationd ata but no experimental data. The paper of Abkevich et al. 11 tries to address this by bioinformatics methods. Of 314 missense mutations in BRCA1 from the 20,000s equence tests, they state that only 21 arec lassified as deleterious, or suspected deleterious,m utants. By comparing human BRCA1 with orthologues from other species,h owever,a nd by takingi nto account thep roperties of the side chains in the normal and mutated aminoa cids, they predicted that 50 more of thesem issense mutations would be suspected to be deleterious. Of the 243 other missense mutants, 14 had previously been deemed neutral or harmless. Their analysis added 92 more mutants into this category-for at otal of 177/314m utants with predictions either of deleterious, suspectedd eleterious or of little clinical significance.T he remaining 137 mutants are still unclassified. The authors caution that this method only givesp redictions, and that care must be used in interpretation in aclinical setting. It should be remembered that patients are making decisions about mastectomies, oophorectomies, chemotherapyo ptions and radiation treatments based, in part, on this mutation information. Another study,b yG oldgar et al., 18 also uses ab ioinformatics approach and integrates the results from several analyses into a weighted probability of significance.T heir model wasa pplied to three mutants each from BRCA1 and BRCA2,a llowing a risk to be assigned to fiveofthe six mutants. It is not clear how the twom ethods will compare against each other.

Haemoglobin-beta
The second gene example is haemoglobin-beta ( HBB ). This wasone of the earliesthuman sequences to be studied; in fact, many of the mutations were found by protein sequencing methods, rather than DNA sequencing. HBB is amongt he most intensively studied human proteins; it is responsible for sickle-cella naemia and the beta-thalassaemias. TheO MIM entryf or HBB has 1,157 references and entries for 522 allelic variants. 19 In 1996, Huisman et al. stated that 138 of the 146 codons of the HBB geneh aveb een mutated. 20 Inspection of the HbVar database 9,10 reveals that this number is now1 41/ 146 codons, with only Thr-4, Thr-12, Thr-50, Pro-125 and Va l-137 not having ak nown mutation in humans. Some codons have fiveo rs ix different mutations. Note that the Va lafter the start Methas been assigned as aminoacid 1. HBB has 461 distinct pointm utants in 146 codons; this equals 35 per cent of the mutation space (9 £ 146 ¼ 1,314 possible point mutants). This value is much higher than that seen in the larger BRCA1 and BRCA2 genes. One observation that affectst his value is the near-complete lack of synonymous substitutions in this database,w hich reflects the protein rather than DNA sequencing analyses. There are only two synonymous substitutions reported: both areG ly to Gly mutations. Certainly synonymous substitutions have been observed, but they do not seem to be reported. Including synonymous mutations would maket he 35 per cent figure significantly higher.
Examination of the Pfam database of 7,503 protein family sequence alignments 21,22 for the globin family reveals that Thr-12was seen as aSer in the mouse-eared bat ( Myotis velifer) and Thr-50 wasa lso often seen as aS er -e veni np rimates such as Macaca mulatta.Pro-125 wasalmost alwaysGln, except in af ew primates such as the gorilla, human and chimpanzee. Va l-137 wasi nvariant in all mammalian HBB sequences. Va l-137 wasa nI le in the South American lungfish ( Lepidosiren paradoxus)a nd Ile or Leu in some frog HBB sequences. Thr-4 wasoutside the alignmente dge in Pfam, but rabbits and black lemur ( Eulemur macaco )b oth had aS er at this position. These observations showt hat at least four of these fivea minoa cids do vary in other mammals and probably could vary in humans withouts erious consequences. Va l-137 seems to be the most invarianto ft he five, and it mayh aveasignificant effect if mutated to another residue.

HPRT1
Mutations in the HPRT1 genel ead to theL esch -Nyhan syndrome, adefect in the purine salvage pathway.The protein is fairly small,w ith only 218 aminoa cids (217 in the mature protein), yett here are2 ,500 mutations reported in this gene. 7,8 The database for HPRT1 is available only by subscription, so,w ithout access, one is limited to published reports that give ad etailed breakdown of the types of mutations and the number of unique mutations. In public sources, at otal of 218 different mutations have been found in 271 patientc ases studied. 23 The Human Gene Mutation Database nowl ists 223 mutants, 5 with 115 being missense or nonsense point mutants in 80 codons (about 6p er cent of the mutation space). In people,v eryfew mutants have been found independently more than once.S omatic functional mutations in HPRT1 can be selected in vitro in 6-thioguanine-resistant T lymphocytes from normal people.T hese are not naturally occurring mutants, they arise by inactivation of the HPRT1 gene.I nactivation prevents this purine salvage pathway enzyme converting 6-thioguanine to at oxic nucleotide analogue.T hese in vitro mutants makeu pt he majority of the HPRT1 databaset hat is not publicly available.S ince they are inactivating mutations, they have to be at critical sites in the gene.T he study by Duan et al.e xamined the known mutations (including patient mutations) and related them to the crystal structures of the HPRTprotein. 24 Duan et al. report that 155 of the 217 aminoa cids (72 per cent) have known missense mutations that cause an aminoa cid change:t hese arise from asubset of 963 single-base-substitution mutants that cause missense mutations. The number of synonymous mutations in this set is not given. In additiont ot he 963 missense mutations, there are 51 (of ap ossible 66) nonsense mutations in the database; this givesatotal of 1,014 single base mutations in 217 codons for 52 per cent of the sequence space (9 £ 217 ¼ 1,953). After accounting for missense,n onsense and synonymous mutations, there remain 46/217 codons with no known mutations.

CFTR
There are currently 1,338 distinct mutations listed in the CFTR MutationD atabase. 25 By examining the list of mutations and not including the mutations in non-coding regions or insertions and deletions, there were 702 unique point mutations in 501 of the 1,481codons (34 percent). This covers about 5p er cent of the CFTR gene'sm utations pace, which is very similar to that of the BRCA2 gene.

TP53
The InternationalA gency for Research on Cancer (IARC) TP53 MutationD atabase 26,27 version R9 of June 2004, contains 19,809s omatic mutations (74 per cent missense,7per cent nonsense) and 264 germline mutations. More than 1,700 different pointm utations at more than 310 distinct codons (out of atotal of 393 codons) have been found as described in the slideshowo nt he database site.T he exact numbersw ere not easily obtainable from the database.1 ,700 mutations represent 48 per cent of the mutations pace for 393 codons (9 £ 393 ¼ 3,537); this is an even higher percentage than that seen in the HBB gene.T his number includes synonymous mutations that do not change the aminoacid. TP53 is unusual in the distribution of its mutations. The vast majority of the mutations are in the middle third of the gene,c odingf or the DNA-binding portion of the protein.

Interpreting the data
The datao nt hese sixg enesp resent at rend that will be followedwithother disease genes. More and more sequence data will be collected until ah ighp ercentage of all the possible mutations aref ound. We could simply performa'thought experiment' and ask: 'Whatd oes one do when all possible (non-lethal) point mutants in ag enea re documented in real people and them issing mutants are made and tested for function?' Nearly all of these mutants will be 'variants of uncertain significance'.
We are heading in this direction -asindicated by the12th February, 2004 Request for Application( RFA) from the National Institutes of Health, entitled 'RevolutionaryGenome Sequencing Te chnologies: The $1,000 Genome'. 28 Once the low-cost genome is available,p erhaps in 10 years, millions of human genomes will have been sequenced and everygene will have sequence data exceedingw hat is nowa vailable for the TP53 gene.A ll cancer patients will be reading the same pamphlet.
There are several approaches to addressing the massive information overload. First, genotype must be linked to phenotype by examining family histories; this will already have been donefor thebest-studied disease genes. Of course,this is not possible on am utation that is seen only once.S econdly, the bioinformatics approach 11,18 will predict am utation's probable significance by sequence comparison to orthologues and by ratingt he severity of an amino acid change.A n enhancement of this method would be to scan for any sequence motifs for structureort argeting that maybeaffected by the mutation. Thirdly,e xaminet he crystal structure of the protein,orofaclose relativeofthat protein if known,tosee if the mutationw ould cause as ignificant structural effect. Fourthly,assaythe mutantprotein in avarietyoffunction tests or apply spectroscopic methods to detect changesi nc hromophores. The assay method mayn ot be possible if the function of the protein is unknown. In many cases, the primarybiology of the protein willneed to be discovered -that is, the pathway in which it participates, any interacting proteins, expression levels, subcellularl ocalisation, post-translational modifications, rates of turnover, etc. This is really an outlinef or the biochemistry, cell biology and molecularb iology of human beings at the level of every gene.S uch understanding only comes slowly,w hereas sequencing is farfaster.Inthe interim,w ewill have more data than we can possibly use.C onsidering the costo fa nalysing each mutation, there must be some point that crosses ap ractical returno ni nvestment. Do we knowe verything we need to knowa bout mutants of haemoglobin-beta, or shouldw e makethe remaining 843 point mutants just to be surew ehave not missed anything? Nicholas MurrayB utler (1862 -1947), president of Columbia University,w rote: 'An experti so ne who knows more and more about less and less'. The question ultimately becomes, how' expert'dow en eed to be?