Update on the aldehyde dehydrogenase gene (ALDH) superfamily

Members of the aldehyde dehydrogenase gene (ALDH) superfamily play an important role in the enzymic detoxification of endogenous and exogenous aldehydes and in the formation of molecules that are important in cellular processes, like retinoic acid, betaine and gamma-aminobutyric acid. ALDHs exhibit additional, non-enzymic functions, including the capacity to bind to some hormones and other small molecules and to diminish the effects of ultraviolet irradiation in the cornea. Mutations in ALDH genes leading to defective aldehyde metabolism are the molecular basis of several diseases, including gamma-hydroxybutyric aciduria, pyridoxine-dependent seizures, Sjögren-Larsson syndrome and type II hyperprolinaemia. Interestingly, several ALDH enzymes appear to be markers for normal and cancer stem cells. The superfamily is evolutionarily ancient and is represented within Archaea, Eubacteria and Eukarya taxa. Recent improvements in DNA and protein sequencing have led to the identification of many new ALDH family members. To date, the human genome contains 19 known ALDH genes, as well as many pseudogenes. Whole-genome sequencing allows for comparison of the entire complement of ALDH family members among organisms. This paper provides an update of ALDH genes in several recently sequenced vertebrates and aims to clarify the associated records found in the National Center for Biotechnology Information (NCBI) gene database. It also highlights where and when likely gene-duplication and gene-loss events have occurred. This information should be useful to future studies that might wish to compare the role of ALDH members among species and how the gene superfamily as a whole has changed throughout evolution.


Introduction
The aldehyde dehydrogenase gene (ALDH) superfamily is represented in all three taxonomic domains (Archaea, Eubacteria and Eukarya), suggesting a vital role throughout evolutionary history. Our understanding of the biological roles of this superfamily continues to expand in ways that are often unexpected and, perhaps, unprecedented for an enzyme family. As implied by their name, members of this superfamily serve to metabolise both physiologically and pathophysiologically relevant aldehydes. This capacity prevents the accumulation of toxic aldehydes derived from endogenous production and/or exogenous exposures, which, if left unchecked, adversely affect cellular homeostasis and organismal functions. 1 ALDH activity is also required for the synthesis of vital biomolecules through the metabolism of aldehyde intermediates, such as retinoic acid, folate and betaine, to name a few. 2 -4 Whereas the ability of the ALDH family members to metabolise reactive aldehydes represents a major underlying cytoprotective mechanism, it is important to recognise that ALDHs demonstrate functions that extend beyond detoxification. Accumulating evidence supports roles for ALDHs in the modulation of cell proliferation, differentiation and survival, especially through participation in retinoic acid synthesis. 2 Members of this superfamily also exhibit functions that appear to be independent of their enzyme activity, including absorption of ultraviolet (UV) irradiation in the cornea by acting as a crystallin and binding to hormones and other small molecules, including androgens, cholesterol, thyroid hormone and acetaminophen. 2,5,6 Sequencing of the human genome and subsequent identification of mutations in ALDH genes associated with loss of ALDH enzyme activity have led to the identification of many disease associations, such as cataracts (ALDH1A1, ALDH3A1, ALDH18A1), seizures (ALDH7A1), hyperprolinaemia (ALDH4A1), heart disease (ALDH2), alcohol sensitivity (ALDH1A1, ALDH1B1, ALDH2), certain cancers (ALDH2) and a broad array of other metabolic and developmental abnormalities. 2 Recently, a role for ALDHs in normal and cancer stem cells has also been identified. For example, ALDH1A1 is differentially expressed in human haematopoietic stem cells (HSCs) and can be used as a stem cell marker for multiple cancers. 2 Similarly, ALDH1B1 is primarily expressed in stem cells in the normal colon and is strongly upregulated in human colonic adenocarcinomas. 7,8 As described by Nelson and colleagues, 9 genomic gene artefact identification becomes very important when using genotyping techniques to identify disease-causing alleles. Gene-duplication events, leading to multiple functional and/or non-functional genetic copies in the genome, can significantly complicate polymerase chain reaction (PCR)-based genotyping assays. Transgenic animal models have permitted the exploration of the functions of ALDHs under in vivo physiological and pathophysiological conditions. 2 These invaluable studies are heavily dependent upon our understanding of the mouse and human genomes. In addition to mutations in ALDH genes within populations, there is a large variation in the number of ALDH genes between species.
During the past decade, the availability of gene and protein information has grown rapidly, primarily due to advances in gene-sequencing technologies. In the 2002 update of ALDH superfamily members, 10 555 ALDH genes were listed, including 32 from Archaea, 351 from Eubacteria and 172 from Eukarya. Characteristic ALDH motifs were searched in 74 genomes: 16 in Archaea, 51 in Eubacteria and seven in Eukarya. A recent download from the current Pfam database (build version 24.0) includes 16,765 ALDH entries (listed as aldedh in the Pfam database). 11 This update focuses on 11 representative vertebrate species in which the full genome has been sequenced: five primates, the cow, two rodents, two birds and one fish. Many of these genomes have been annotated automatically; generous algorithms list pseudogenes as protein-coding genes. This update attempts to describe the ALDH complement within these organisms and identify pseudogenes and gene-duplication events, when possible.
ALDH genes were retrieved from Entrez Gene 12 using the terms 'ALDH' or 'aldehyde dehydrogenase'. Peptide sequences for each ALDH gene were retrieved from Entrez Protein 12 and aligned against a reference list of ALDH family members, including known human ALDHs and sequences from the NCBI's HomoloGene 12 using ClustalW. 13 To be included for description, a gene record was required to meet three criteria: 1) the protein product of the gene must be 'full-length' (ie excludes known fragments and partial records); 2) the gene must have a known unique chromosomal location on the annotated genome; and 3) the gene must be listed as protein-coding (ie excludes known pseudogenes).
Parent genes were designated based on highest homology to the known human protein. Identified gene duplications were sequentially named according to nomenclature guidelines, based on decreasing sequence homology to the parent gene. Duplicated genes were further analysed to determine if they represented potentially new proteincoding genes or non-functional pseudogenes. Pseudogenes were identified according to criteria outlined previously 9 and assigned to the following categories: detritus pseudogenes (those which are fragments missing exons) and reverse-transcriptase events (those which resemble mRNA sequences and lack introns). If data suggested that a duplicated gene was protein coding, it was considered to be a new gene family member and named according to the previously established ALDH nomenclature system. 14 Zebrafish aldh genes were named according to the guidelines set out by the zebrafish nomenclature committee (http://www.zfin.org). 15 Pseudogenes in rodent (or fish) and non-rodent/ non-fish genomes were appended with the suffix 'p' or 'P', respectively, and followed by a number designating multiple pseudogenes for a given gene family within each individual species.
It is, again, important to underscore that this initial analysis should be considered preliminary and subject to change as experimental evidence sheds light on actual protein function. Alignment and clustering of protein sequences were used as a basis for assigning homology. Sequences were aligned, and dendrograms based on neighbour-joining distances were created using a ClustalW webserver at http://align.genome.jp. Percentage amino acid (AA) identities were determined using the Needle webserver at (http://www.ebi.ac.uk/Tools/emboss/ align/). 16 To assess whether protein sequences were actively transcribed, we employed several methods. Numerous promoter-prediction programs were used, but none was sufficiently consistent across species or discriminatory to be useful in the prediction of pseudogenes. The ratio of nonsynonymous to synonymous (K a /K s ) nucleotidesubstitution rates was used as a measure of selective pressure on each individual gene. Rates were calculated using homologous genes for all species in the current analysis, in order to determine ancestral states using the Bergen Center K a /K s Calculation Tool (http://services.cbu.uib.no/ tools/kaks/) and default values, with the exception that the tree method was set to maximum likelihood. 17 Copy number variants (CNV; defined here as gains and losses of DNA sequences .1 kiIobase [kb]), insertions and deletions (InDels; gains and losses of DNA sequences of 100 -999 base pairs [bp]), and inversions in human ALDH genes were retrieved from the Database of Genomic Variants. 18 Table 1. List of all species examined in the current study, including the Latin name and common name and the number of unique ALDH genes found in each species. The data reflect the number of gene records found in the NCBI Gene Entrez database for each species, as of 13th March 2011

Latin name
Common name # ALDH genes

Homo sapiens Human 19
Pan troglodytes Common chimpanzee 18

Results
Records for ALDH genes were retrieved and sorted for all 11 species analysed ( Table 1). The number of records that met the above-mentioned criteria is provided (ie the number of genes excluding nonfunctional pseudogenes). The number of ALDH genes per species varied from 14 in chicken to 25 in zebrafish. There are currently 207 distinct genes present within the database for these 11 species; this is a greater than fourfold increase from 2002, when only 51 were annotated. 10 This allows for a much more comprehensive comparison of ALDH superfamily members throughout vertebrate evolution during the past 450 million years. It is important to keep in mind that, for many species, some genes have yet to be identified. Further, many annotated genes may reflect gene-duplication events that represent non-functional pseudogenes. These situations will be explored in greater depth below. The total number of human annotations has remained unchanged since 2005, with 19 functional protein-coding genes. 19 The chimpanzee and the orangutan genomes diverged from humans 5 and 14 million years ago (MYA), respectively. 20,21 Both the chimpanzee and orangutan genomes contain 18 ALDH genes, each corresponding to a known human orthologue. The macaque and common marmoset genomes are more distantly related. They diverged 25 and 35 -40 MYA 22 and contain 20 and 16 ALDH members, respectively. Orthologues for all 19 human genes were identified in mouse and rat. In addition, rodent genomes contain an Aldh1a1 paralogue (Aldh1a7) and an Aldh3b2 gene duplication, resulting in a total of 21 Aldh genes. The most recent common ancestor of humans and rodents lived 75-90 MYA.
The cow genome, which diverged from that of the human 80-100 MYA, has 20 annotated ALDH entries which, again, closely parallel human members. Variations include two gene duplications and one possible deletion. Both avian genomes currently lack orthologous entries for ALDH1A1, ALDH1B1, ALDH1L1, ALDH3A1, ALDH3B2 and ALDH16A1. Moreover, the zebra finch genome is also missing annotated sequences for ALDH18A1 and includes two apparent gene duplications. Table 2 summarises these ALDH orthologues, their chromosomal locations and the associated NCBI Entrez gene identification (ID) number for each of the 11 species. For zebrafish, Entrez gene ID 100334142 was listed as 'aldehyde dehydrogenase 1A1-like [D. rerio]'. This gene record appears to be derived from an unplaced chromosomal fragment, however, because no genome location could be determined. In addition, alignment of the peptide sequence for this gene ID to other mammalian ALDH1A1 protein sequences was poor. Specifically, sequence homology with human, mouse and rat ALDH1A1 was only 26.2 per cent, 26.4 per cent and 26.8 per cent, respectively. NCBI BlastP analysis indicated that it most closely resembles bacterial ALDH proteins. Together, this evidence suggests that this record may represent bacterial contamination, rather than a true zebrafish gene; thus, we have not included this gene. This also makes the zebrafish the only species among the 11 analysed that lacks a record for ALDH1A1. Interestingly, a protein blast (blastp) search using human ALDH1A1 and limiting results to fish species only (NCBI taxid: 7898) revealed ALDH1A2 homologues in multiple species (including salmon, pufferfish, ricefish and bichir), but no records for ALDH1A1 in any fish species. This is consistent with previous findings that indicate that ALDH1A1 is not present in the teleost lineage. 23 We found evidence for several gene duplications. Table 3 lists all genes that show duplications, compared with genes in the human genome. This table provides a summary of existing information available within the NCBI gene entries, as well as recommended gene names based on our analyses and current nomenclature guidelines. Table 4 lists additional information related to peptide sequences and calculated sequence identities. Additional genes (increase in gene number, compared with humans) show peptide divergence of as little as 0.4 per cent (zebrafish aldh2.2 and aldh2.3) and as much as 64.9 per cent (zebrafish aldh3a2.1 and aldh3a2.2). In most cases, gene duplications have similar sizes, are often nearby on the same  (212647) 21 (100228902) 21 (419467) 11 (394133) Continued Update on the aldehyde dehydrogenase gene (ALDH) superfamily GENOME UPDATE   Update on the aldehyde dehydrogenase gene (ALDH) superfamily GENOME UPDATE

ALDH1
ALDH1A1 is present in all species except zebrafish, confirming earlier studies. 23 In cow, there are two distinct records for ALDH1A3: the gene found on Chr 21 is full length (537 AAs) and represents the putatively functional parent gene (ALDH1A3), whereas the second is a detritus pseudogene on Chr 28 which appears to the product of a partial geneduplication event (ALDH1A3P1). The shorter genomic sequence would translate a peptide sharing 100 per cent sequence identity to only the 127 carboxy-terminal AAs of the full-length parent protein. Several gene duplications appear to have been conserved in rodents. One such gene is ALDH1A7, found in rats and mice. In both cases, the ALDH1A7 gene is present on the same chromosome and in close proximity to ALDH1A1. Mouse ALDH1A7 shares 92 per cent AA identity with mouse ALDH1A1, and studies have confirmed that the gene encodes inducible tissue-specific mRNA. 24 ALDH1B1 is present in mammals but missing from birds and fish. ALDH1L1 is missing from both bird species (zebra finch and chicken) but present in other species examined and thus may represent a deletion in the avian lineage.

ALDH2
ALDH2 appears to be one of many genes duplicated in zebrafish. It has been suggested that an entire genome duplication event may have occurred after the divergence of teleosts and mammals; 25

ALDH3
The ALDH3 genes show the most variation in gene number of any ALDH family among the organisms studied. ALDH3A1 is missing from birds and fish but is present in every mammalian genome analysed in this study. The zebra finch has a duplicate ALDH3A2 (ALDH3A3) entry which encodes a full-length peptide that shares 84.1 per cent identity with the parent protein. Four ALDH3A2 homologues were identified within the zebrafish genome. The aldh3a2.1 is considered the parent gene. The aldh3a2.2 and aldh3a2.3 full-length gene products, respectively, share 64.9 per cent and 70.9 per cent sequence identity with that of Aldh3a2.1 and 64.9 per cent identity with each other. Zebrafish aldh3a2p1 represents a partial gene duplication; the resulting 169-AA peptide would most likely undergo proteolytic degradation if translated. ALDH3B1 is duplicated in cow and zebra finch, as well as in zebrafish, on the proviso that D. rerio aldh3d1 is also considered an ALDH3B1 homologue. Zebrafish Aldh3d1 shares 44 per cent AA identity with Aldh3b1 and is listed in NCBI HomoloGene as a homologue of ALDH3B1 (HomoloGene, data not shown). 12 Zebra finch ALDH3B5 encodes a 341-AA peptide that shares 100 per cent sequence identity with the 228 amino-terminal AAs of the parent gene's protein.
Cow and zebra finch ALDH3B4 and ALDH3B5 proteins share 80.9 per cent and 53.2 per cent sequence identity with their respective parent genes, and 39.7 per cent with one another, indicating that none of the genes is an orthologue. Zebra finch ALDH3B5 is shorter than ALDH3B1 (341 versus 450 AAs) and, without this sequence gap, they share 93.1 per cent AA identity; it is unknown whether this smaller gene product is functional.
ALDH3B2 is present as a single distinct gene in human, chimpanzee and macaque, whereas two copies occur in mouse and rat. ALDH3B2 is absent from common marmoset, cow, zebra finch, chicken and zebrafish. Mouse and rat ALDH3B3 share 86.4 per cent and 76.9 per cent AA identity, respectively, with the corresponding parent ALDH3B2 proteins and 83.4 per cent identity with each other. The two ALDH3B3 genes are found on corresponding syntenic chromosomes within their respective genomes. Presently, the protein product of Entrez Gene ID 688778 (R. norvegicus) is annotated as 'ALDH3B1 (predicted)'. Based on a phylogenetic clustering of ALDH3B1 and ALDH3B2 protein sequences (Figure 1), however, we believe it is better to name this protein ALDH3B3; this shows that both mouse and rat ALDH3B3 proteins are in the ALDH3B2 clade and are more similar to each other than to rodent or human ALDH3B2 proteins. The alignment used for phylogenetic clustering can be seen in Supplementary Table S1.

ALDH4
ALDH4A1 is missing from chimpanzee and common marmoset but is present in all others. Previously, rat ALDH4A1 had been conspicuously absent from the major databanks but it was recently added. During a BLAST search of the rat genome using various individual exon segments from mouse Aldh4a1, significant hits for Aldh4a1 in the rat genome were identified on Chr 5q36 and it was determined to be a part of the fusion gene LRRP Ba1-651. 26 Figure 2 shows an assembled structure of this fusion gene with the Aldh4a1 exons highlighted in red. Although it appears that these exons are transcribed and contain the conserved ALDH catalytic domain, it is not clear whether the gene product retains aldehyde dehydrogenase activity.
ALDH5 and beyond ALDH5A1 is missing in marmoset and duplicated in zebrafish. The zebrafish duplication, aldh5a1.2, encodes a slightly truncated peptide (404 versus 514 AAs) which shares 100 per cent AA identity with the first 426 AAs and resides on the same Chr as aldh5a1.1.
ALDH7A1 is duplicated in the macaque. The ALDH7A1P5 duplication is located on Chr 14 and contains the complete ALDH7A1 coding sequence; however, the sequence lacks any intronic regions, suggesting a reverse transcriptase-mediated duplication event. Furthermore, this gene has a K a /K s score of 1.289, indicating a lack of selective pressure to conserve this gene. This provides further evidence that ALDH7A1P5 does not code for a functional protein.
In zebrafish, aldh9a1 has three additional copies. The parent gene aldh9a1.1 and aldh9a1.3 reside on Chr 8; aldh9a1.2 is found on Chr 2. Both aldh9a1.2 and aldh9a1.3 encode putative full-length proteins which respectively share 71.2 per cent and 94.9 per cent AA identity with Aldh9a1.1 and 70.3 per cent sequence identity with each other. Zebrafish also contains a duplication of aldh18a1. The aldh18a1.2 is found on the same chromosome and encodes a protein that is 100 per cent identical with that of the parent gene. The naming of zebrafish genes required further genomic analyses in order to determine whether duplications originated from the ray-finned lineage whole-genome duplication event. Many of the duplicated genes reside within close proximity on the same chromosome, suggesting that they are segmental duplications that resulted from misguided recombination processes during meiosis and not a product of the whole genome duplication that took place within the ray-fin lineage. 27 These include the aldh2, aldh5a1 and aldh18a1 paralogues, which are located in close proximity on Chr 5, 16 and 12, Figure 2. Comparison of ALDH4A1 from human and rat. Rat Aldh4a1 is part of the larger fusion gene LRRP Ba1-651. 26 The exons representing the Aldh4a1 portion of this gene with homology to mouse and human are highlighted.
respectively. It also includes aldh3a2.1 and aldh3a2.2, located on Chr 15, as well as aldh9a1.1 and aldh9a1.3, found on Chr 8. The gene architecture surrounding aldh3a2.3 on Chr 21 does not support a duplicated chromosome, in that the region lacks other duplicated genes from Chr 15. Furthermore, studies looking at zebrafish gene duplications found that a high frequency of genes found on Chr 21 are duplicated on Chr 5 and none were identified on Chr 15, suggesting that Chr 5, rather than Chr 15, is the paralogous chromosome. 28,29 A similar situation was identified with respect to aldh9a1.2 on Chr 2. Uridine-cytidine kinase-2 homologues (uck2a and uck2b) are found upstream of both aldh9a1.1 and aldh9a1.2, supporting a tandem geneduplication event; however, other genes in close proximity to this duplication do not show any homology between chromosomes 2 and 8.

Alternatively spliced transcriptional variants and CNVs of human ALDH genes
In addition to the increase in ALDH identification through genomic sequencing, other sources of complexity in the ALDH superfamily are being studied. Transcript sequencing has revealed that many ALDH genes encode multiple mRNA splice variants (for a review of human ALDH splice variants, see Black et al. 30 ). Besides splice variants, CNVs have been reported for human ALDH genes. By querying the Database of Genomic Variants, 35 CNVs, 28 InDels and one inversion have been detected in the ALDH family, although these records are usually representative of one or several individuals (Supplementary Table S1). Of these 64 events, 33 were InDels entirely within intronic regions and may be silent. Others are likely to cause loss of function of the enzyme involved, including loss of the whole gene (11 events; occurred in ALDHs 1A3, 1B1, 3A1, 3B1, 5A1 and 16A1) or duplication, loss or inversion of exons within the coding sequence (16 events; occurred in ALDHs 1A3, 1L1, 1L2, 3A2, 3B2, 6A1 and 9A1). Finally, in a few cases, a region containing the entire gene and surrounding region was duplicated (four events; occurred in ALDH3B1 and ALDH3B2).

Discussion
The ALDH superfamily shows considerable diversity among vertebrate genomes, with species in the current study showing between 14 and 25 putatively protein-encoding genes. Many of the gene duplications discussed here probably encode functional proteins. There are also a number of duplication events that give rise to non-functional pseudogenes. Names were assigned to the 'new genes' and 'pseudogenes' (Table 3) according to the ALDH nomenclature system established in 1999. 14 The species-specific nomenclature system was used for zebrafish genes. 15 Pseudogenes were also named according to the standardised protocol. 20 In the cow genome, ALDH1A3P1 resembles the product of a partial gene duplication event. The coding region would translate a peptide sharing 100 per cent sequence identity to the 127 carboxyterminal AAs of the full-length parent gene. Such a high degree of sequence identity is suggestive of a relatively recent evolutionary duplication. Even if the truncated gene encodes the 127-AA peptide; however, it lacks many highly conserved residues required for ALDH activity. Thus, the truncated peptide would probably be targeted for rapid degradation. As such, this gene represents a nonfunctional pseudogene and has been named accordingly.
ALDH1B1 is present in mammals but missing from birds and fish. The high degree of AA sequence conservation between ALDH2 and ALDH1B1 suggests that the latter may be the product of a gene duplication event that occurred some time after the avian-land animal split around 310 MYA. Future analyses should consider other species, including amphibians and reptiles, in order to verify and more accurately pinpoint this evolutionary event.
Analysis of the aldh2 gene duplications in zebrafish indicates that these represent protein-coding genes and not pseudogenes. As mentioned above, translation of either gene would result in a fulllength peptide. The aldh2.2 gene would encode a product 95.2 per cent identical to that of the parent gene aldh2.1. At 95.2 per cent AA identity, aldh2.2 represents a new gene. The aldh2.3 homologue may represent a more evolutionarily recent duplication of aldh2.2, as evidenced by the 99.6 per cent sequence identity noted. Therefore, aldh2.3 is likely to be a gene-duplication event of aldh2.2. All three protein products include the conserved ALDH motifs and residues required for enzyme activity.
The ALDH3 family showed the greatest variability among species. ALDH3A1 facilitates cell cycle regulation and scavenging of reactive oxygen species, and acts as a corneal crystallin by filtering UV irradiation in the eye. ALDH3A1 is missing from birds and fish but is present in every mammalian genome analysed in this study, suggesting that the gene evolved some time after 310 MYA. ALDH3A1 is conserved among mammals and shows no apparent duplications. In some species, such as rabbit, it appears that ALDH1A1 is expressed as a corneal crystallin instead of ALDH3A1. 31 Interestingly, zebrafish is the only species in this study that apparently lacks both ALDH3A1 and ALDH1A1. Studies have suggested that zebrafish use scinla (cytosolic gelsolin) as a corneal crystallin instead. 32 -34 Zebra finch ALDH3A3 encodes a full-length peptide that shares 84.1 per cent similarity with the ALDH3A2 parent gene. Zebrafish has three aldh3a2 duplications, which include two full-length genes (aldh3a2.2 and aldh3a2.3) and a significantly truncated partial duplication (aldh3a2p1). The degree of sequence identity that Aldh3a2.2 and Aldh3a2.3 share with the parent peptide (64.9 per cent and 70.9 per cent, respectively) suggests that they diverged sufficiently long ago to be considered new ALDH3A family members. They also share 64.9 per cent identity with each other and less than 60 per cent identity with zebra finch ALDH3A3, suggesting that all three genes are paralogues rather than orthologues. Zebra finch ALDH3A5 should also be considered a new functional ALDH family member. In addition, the zebrafish pseudogene aldh3a2p1, if translated, would share the highest degree of sequence identity with aldh3a2.3. Thus, the pseudogene most likely reflects a more recent partial duplication of this gene.
ALDH3B1 is duplicated in both cow and zebra finch. The cow ALDH3B4-encoded protein would be full length and share 85.4 per cent identity to ALDH3B1, suggesting that it is a new ALDH3B family member. Zebra finch ALDH3B5 shares an extremely high degree of homology with the amino-terminus of ALDH3B1. However, it lacks 150 AAs that comprise the carboxy-terminus needed for enzyme oligomerisation. The truncated protein would still contain the conserved motifs required for ALDH activity. Until more experimental evidence becomes available, the ALDH3B5 gene should be considered as putatively functional.
The mouse and rat Aldh3b3 genes appear to represent new orthologous ALDH family members; the genes reside in syntenic chromosomal regions and share a high degree (83.4 per cent) of sequence identity with one another. The two proteins are more divergent than the rodent ALDH3B2 orthologues, which share 89.9 per cent sequence identity.
Aldh5a1 is another duplicated ALDH gene within the zebrafish genome. The duplication aldh5a1.2 resides on the same chromosome as the aldh5a1.1 parent gene, and the two share 100 per cent sequence identity. Aldh5a1.2 encodes a peptide containing an additional 22 amino-terminal and 88 carboxyterminal residues. It also shares greater sequence identity with the human ALDH5A1 orthologue than Aldh5a1.1 (65.5 per cent versus 51.4 per cent). This suggests that aldh5a1.2 might actually be the parent gene and aldh5a1.1 a slightly truncated version formed as the result of gene duplication.
As mentioned above, the macaque ALDH7A1P5 genomic sequence lacks intronic regions, suggesting that a reverse transcriptase-mediated event gave rise to this pseudogene (ie having no adjacent promoter or other regulatory sequences). Four additional ALDH7A1 pseudogenes have been identified on chromosomes 5q14 (ALDH7A1P1), 2q31 (ALDH7A1P2), 7q36 (ALDH7A1P3) and 10q21 (ALDH7A1P4). 19 Macaque ALDH7A1P5 is located on Chr 14, which is not syntenic with human Chr 11 and does not share common origins with any of the human pseudogenes. Therefore, the event that gave rise to ALDH7A1P5 must have taken place within the last 25 million years.
Three full-length ALDH9A1 homologues were identified in zebrafish. The Aldh9a1.2 peptide shares 71.2 per cent and 70.3 per cent identity with Aldh9a1.1 and Aldh9a1.3, respectively. Aldh9a1.3 is 94.9 per cent identical to the parent Aldh9a1.1 peptide, suggesting that this duplication was a relatively recent event when compared with the duplication that gave rise to Aldh9a1.2. Hence, aldh9a1.1, aldh9a1.2 and aldh9a1.3 represent three distinct protein-coding ALDH9 family members. The zebrafish genome also contains two copies of aldh18a1, which are found in very close proximity on Chr 12. Both genes are considered protein coding and would give rise to peptides of the same length which share 100 per cent sequence identity, suggesting a relatively recent duplication event.
ALDH gene-naming conventions dictate that (i) ALDH superfamily members sharing more than 40 per cent AA identity belong to the same family (eg ALDH1A, ALDH1B, etc.), and (ii) ALDH family members that share greater than 60 per cent AA identity belong to the same subfamily (eg ALDH1A1, ALDH1A2, etc). This provides a convenient and systematic naming system for an entire superfamily. Interestingly, this does not always indicate homology properly; these rules in the cytochrome P450 (CYP) gene superfamily are known to break down when one includes evolutionarily distantly related animals. 27 For example, whereas zebrafish Aldh3d1 and Aldh3b1 share only 50 per cent AA identity, HomoloGene evidence and alignments suggest that aldh3d1 is probably a duplication of aldh3b1 (data not shown). Although aldh3d1 has diverged considerably, it is likely to be more closely related to aldh3b1 than the naming convention would suggest.
Many of these proteins have been defined based on genomic or dbEST data and have not been studied extensively. Many records remain in databases that are listed as 'protein-coding' but which instead may represent pseudogenes of various types. Furthermore, although the genes here do not have internal stop codons, without functional analysis, it is difficult to determine whether the genes might have other inactivating mutations or if they experience selective pressure. Although automated prediction and naming of ALDH proteins from completely sequenced genomes have achieved a great deal of information in a short amount of time, the alignment, curation and naming of these genes remains an important task. The fact that no new human ALDH genes have been identified over the past six years and that most other vertebrates seem to have settled close to this number suggests that identification of ALDH superfamily members in vertebrates is nearing completion. Determining the function and biological importance of each family member still requires additional work, however. As more information becomes available, the web database resource at www.aldh.org (the aldehyde dehydrogenase gene superfamily resource center) 35 will be updated to reflect our current understanding of this diverse and essential gene superfamily. Figure S1. Alignment of ALDH3B2 genes in human, rat and mouse created by ClustalW. Dashes (-) represent sequence gaps, asterisks ( * ) represent identical amino acids (AAs), colons (:) represent very similar AAs, periods (.) represent less similar AAs, whereas spaces ( ) represent dissimilar AAs. Table S1. Known copy number variations in humans. Included are the variation ID from the Database of Genomic Variants, ALDH family member, type (CNV -copy number variation with changes . 1 kb; InDel -insertions and deletions with changes 100-999 bp; invinversions with changes that invert the nucleotide sequence), whether the change was a loss or gain, site (intron -change only affects an intronic region; part -change affects one or more exons; whole -change affects the entire gene), sample size and chromosomal location