Naming 'junk': Human non-protein coding RNA (ncRNA) gene nomenclature

Previously, the majority of the human genome was thought to be 'junk' DNA with no functional purpose. Over the past decade, the field of RNA research has rapidly expanded, with a concomitant increase in the number of non-protein coding RNA (ncRNA) genes identified in this 'junk'. Many of the encoded ncRNAs have already been shown to be essential for a variety of vital functions, and this wealth of annotated human ncRNAs requires standardised naming in order to aid effective communication. The HUGO Gene Nomenclature Committee (HGNC) is the only organisation authorised to assign standardised nomenclature to human genes. Of the 30,000 approved gene symbols currently listed in the HGNC database (http://www.genenames.org/search), the majority represent protein-coding genes; however, they also include pseudogenes, phenotypic loci and some genomic features. In recent years the list has also increased to include almost 3,000 named human ncRNA genes. HGNC is actively engaging with the RNA research community in order to provide unique symbols and names for each sequence that encodes an ncRNA. Most of the classical small ncRNA genes have now been provided with a unique nomenclature, and work on naming the long (> 200 nucleotides) non-coding RNAs (lncRNAs) is ongoing.


Introduction
At the beginning of this century, many geneticists were predicting that the human genome contained around 100,000 protein-coding genes, partly based on the assumption that more complex organisms would have a greater number of genes. Ten years later, with far more genomic data from a wide variety of organisms and a much better-quality, well-annotated human genome, this original expectation has been downsized to around 20,000 protein-coding genes. This means that highly complex organisms like the human have about the same number of protein-coding genes as much simpler life forms such as the roundworm, Caenorhabditis elegans. If we look to the human's closest living relative, the chimpanzee, we see that the equivalent proteins in human and chimpanzee typically differ by only two amino acids, and approximately 29 per cent of all the orthologous proteins encoded in human and chimpanzee are identical. 1 Why, then, when the protein-coding components of our genomes are so similar, are humans and chimpanzees so strikingly different? Since protein-coding genes comprise only two per cent of the human genome, the answer may lie in the large swathes of the genome previously regarded as 'junk DNA'. Indeed, the ENCyclopedia Of DNA Elements (ENCODE) Consortium, 2 which is aiming to identify all the functional elements in the human genome, suggests that the vast majority of the genome is transcribed as non-protein-coding RNA (ncRNA). These RNAs could be responsible for some of the complex differences between humans and other primates, especially since the expression of many genes is now thought to be regulated by ncRNAs. They are also known to be involved in  4 was on hand to provide unique names for these genes, thus ensuring that everyone was able to retrieve and discuss relevant information concerning specific loci. The proper characterisation and naming of human ncRNA genes and transcripts is equally vital, and so the HGNC has been actively engaging with the RNA research community in order to provide unique names for all the sequences encoding ncRNAs; the progress so far is shown in Table 1 and discussed below.

MicroRNAs
The first microRNA (miRNA), lin-4, was identified in C. elegans in 1993, 5 but the term 'microRNA' was not introduced until 2001. 6 MicroRNA genes encode primary transcripts ( pri-miRNAs), which are processed into short stem-loop structures called pre-miRNAs, and these in turn are modified into mature miRNAs. Varying in length from 19 to 25 nucleotides (nt), these single-stranded mature miRNAs bind to the 3 0 -untranslated regions (3 0 -UTRs) of target mRNAs to destabilise or inhibit their translation. 7 This regulation of gene expression by miRNAs has been shown to affect diverse cellular functions and has been implicated in the molecular mechanisms of many diseases; for instance, mutations in MIR96 have been linked to progressive hearing loss. 8 Since 2002, miRBase 9 (http://www.mirbase.org -formerly the miRNA Registry) has catalogued all  identified human miRNAs and provided each sequence with a stable accession number, precise genomic location and a name in the format 'mir-#' for each stem-loop sequence and 'miR-#' for each resultant mature miRNA. For example, the MIR100 gene encodes the mir-100 stem-loop, which is modified to create the miR-100 mature transcript. Working in collaboration with miRBase, HGNC has provided approved gene symbols for all of the 1,000 currently identified human miRNAs. These are named using the miRBase names with the stem symbol MIR#. The MIR# symbols are assigned as sequential numerical identifiers to each novel miRNA, with those miRNAs that encode homologous mature transcripts sharing the same MIR number but with differing suffixes. If the mature miRNAs differ by only or two nucleotides, they are allocated letter suffixes (eg MIR10A and MIR10B), whereas if the mature miRNAs are identical, the genes are given hyphenated numerical suffixes (eg MIR1-1 and MIR1-2).

Transfer RNAs
Transfer RNAs (tRNAs) are small RNA molecules, about 80 nucleotides in length, that play an essential role in protein translation by transporting specific amino acids to the ribosomes to be added to an elongating peptide chain. There are three essential regions on each tRNA: the anticodon, which comprises three nucleotides that can basepair to one or more specific triplet codons on the mRNA being translated; the attachment site that covalently binds the particular amino acid specified by the codon; and a second attachment site that recognises a specific aminoacyl tRNA synthetase, an enzyme that catalyses the binding of an amino acid to its compatible tRNA. Due to their high degree of structural conservation, such as the distinctive cloverleaf secondary folding and L-shaped tertiary structures, tRNAs can be accurately predicted within a genome using a sequence comparison algorithm. The Genomic tRNA database, 10 GtRNAdb (http://lowelab.ucsc.edu/GtRNAdb/), contains 506 tRNA genes predicted in the human genome using the tRNAscan-SE 11 program.
Working with this dataset, the HGNC has provided approved symbols for these genomic tRNA genes in the format 'TRNA þ single letter code for amino acid isotype þ incremental number', with the specific anticodon type also specified in the name. For example, the tRNA gene 'TRNAA1' has the name 'transfer RNA alanine 1 (anticodon UGC)'. GtRNAdb also identifies 110 putative tRNA pseudogenes that are named in the same format as coding tRNA genes, except that the symbol is appended with a 'P' for 'pseudogene' (eg TRNAA44P, 'transfer RNA alanine 44 [anticodon AGC] pseudogene'). The 'P' suffix is also used for pseudogenes in other ncRNA classes and for pseudogenes of protein-coding genes. Any ncRNAs that have been derived by computational prediction -as is the case for most of the genomic tRNAs -will require experimental validation to determine whether they do encode functional transcripts. There are 22 tRNAs encoded in the human mitochondrial genome; these are thought to have a classical cloverleaf-like secondary structure but differ in the length of their loops and are missing some conserved residues. 12 There is much interest in mitochondrial tRNAs, since many pathological point mutations have been identified in these genes. Mamit-tRNAdb (http://mamit-trna.u-strasbg.fr/), 13 a database of mammalian mitochondrial tRNAs, contains the cloverleaf structures of the 22 human mitochondrial tRNAs, highlighting the positions of each of the mutations reported in the literature. The HGNC names for the mitochondrial tRNA genes are in the format 'MT-T þ single letter code for amino acid isotype þ incremental number', again with the specific anticodon type included in the name. For example, the tRNA gene 'MT-TS1' has the name 'mitochondrially encoded tRNA serine 1' (UCN).

Ribosomal RNAs
Ribosomes, the sites of protein translation, comprise both ribosomal proteins and ribosomal RNAs (rRNAs). Eukaryotic ribosomes are 80S in size but are formed by two subunits, one large 60S subunit and one small 40S subunit. The passage of the tRNAs along the mRNA during translation is facilitated by the rRNAs within the ribosome. The ribosome brings about the interaction between the anticodon of each aminoacyl tRNA and the equivalent codon of the mRNA at its aminoacyl (A) site, and then aids formation of the peptide bond between the amino acids at the peptidyl (P) site before the tRNA exits the ribosome at the exit (E) site. Each single mRNA can be translated at multiple ribosomes at the same time. For a recent review on the structural dynamics of the ribosome, see Korostelev et al. 14 In eukaryotes, there are four types of rRNA: 18S rRNA is found in the small subunit of the ribosome and 28S, 5.8S and 5S rRNAs in the large subunit. The 18S, 5.8S and 28S rRNA genes are arranged in tandem repeats, with the genes separated by transcribed spacers known as externally and internally transcribed sequences (abbreviated to ETS and ITS). Each repeat found in the arrangement 5 0 ETS-18S-ITS1-5.8S-ITS2-28S-3 0 ETS produces one precursor transcript, which is then cleaved to produce the three types of rRNAs. 15 5S rRNA genes are also found between spacer sequences in tandem repeats scattered throughout the genome, but with several main clusters located on the q arm of chromosome 1. Large amounts of RNA are required to make ribosomes, so there are hundreds of copies of both types of rRNA repeats in the human genome. Repetitive sequences, such as these rRNA tandem repeat genes, prove to be problematic when sequencing and assembling genomes and so some rRNA genes are likely to be missing from the genome builds. A simple gene nomenclature for rRNAs has been proposed in which the 18S, 28S, 5.8S and 5S types use the stem symbols RN18S#, RN28S#, RN5-8S# and RN5S#, respectively (eg 'RN5S1' for 'RNA, 5S ribosomal 1'). If agreed, this scheme will be implemented by the rRNA research community. A comprehensive set of all rRNAs can be found in the SILVA ribosomal rRNA database (http://www.arb-silva.de/). 16

Spliceosomal RNAs
After transcription, most primary transcripts ( pre-mRNAs) undergo a series of modifications -termed splicing -in which introns are removed so that the exons can be joined to form a mature mRNA (for a review, see Ritchie et al. 17 ). Some introns are self-splicing but most require the intervention of the spliceosome, a large ribonucleoprotein (RNP) made up of over 200 different proteins and five small nuclear RNAs (snRNAs), known as U1, U2, U4, U5 and U6. These snRNAs are highly conserved across genomes, since they play a key role in the multiple splicing events catalysed by the spliceosome. The U1, U2, U4, U5 and U6 snRNAs are assembled around the newly transcribed pre-mRNA following a precise pattern that produces a common canonical structure, known as the major or U2-dependent spliceosome. The pre-mRNA contains specific sequences that guide the formation of the major spliceosome, and also specific GT/AG splice sites that flank the introns (known as major or U2-type introns) to identify where the excisions should be made. The nomenclature of these snRNA genes follows the format 'RN þ snRNA species þ numerical identifier' (eg 'RNU1-1' for 'RNA, U1 small nuclear 1'). A much less common type of intron, the U12-type, is spliced by the minor spliceosome, which, in addition to U5, contains four different snRNAs, known as U11, U12, U4atac and U6atac. 18 These four snRNAs each has a functional counterpart in the major spliceosome, with U11 being analogous to U1, U12 to U2, U4atac to U4 and U6atac to U6. The 'atac' suffix on U4atac and U6atac denotes the unusual AT/AC splice sites for U12-type introns. The gene names for the U12-type snRNAs follow the same format as the U2-type snRNAs; for example, 'RNU6ATAC5' is 'RNA, U6atac small nuclear 5'.

Small nucleolar RNAs
Small nucleolar RNAs (snoRNAs) are responsible for guiding a series of site-specific posttranscriptional modifications to rRNAs, tRNAs and snRNAs. There are two main types of snoRNAs: H/ACA box snoRNAs direct the conversion of the nucleoside uridine to pseudouridine by the pseudouridine synthase protein dyskerin, and C/D box snoRNAs guide the addition of a methyl group by the methyltransferase protein fibrillarin. 19 Each snoRNA guides one, or sometimes two, such modification(s) by binding to complementary regions of the pre-RNA. H/ACA box snoRNAs are named after their intrinsic H box (ANANNA) and ACA box sequences and are found in RNPs containing the same four proteins: DKC1 (dyskerin), GAR1, NOP10 and NHP2. Similarly, the C/D box snoRNAs are named after the conserved C (RUGAUGA) and D (CUGA) boxes and form RNPs with four different proteins: FBL (fibrillarin), NOP56, NOP58 and NHP2L1. The H/ACA box snoRNAs were previously referred to in the literature and databases by the stem symbols ACA# or HBI-# (eg ACA1 and HBI-6); and the C/D box snoRNAs were named using either 14q(I-#), 14q(II-#) or HBII-# symbols (eg 14q(0), 14q(II-1) and HBII-99). These names were confusing and so, in collaboration with snoRNABase (http://www-snorna.biotoul.fr) 20 and experts in the field, HGNC devised an approved nomenclature for the snoRNA genes using a common stem symbol of SNORA# for the H/ACA box genes and SNORD# for the C/D boxes. HBI-6 is now approved as 'SNORA26' for 'small nucleolar RNA, H/ACA box 26' and 14q(0) is now 'SNORD112' for 'small nucleolar RNA, C/D box 112'. Another class of snoRNAs is the small Cajal body-specific RNAs (scaRNAs), named after the sub-organelles within the nucleus where they are located. 21 These RNAs often contain both H/ACA box and C/D box domains, but sometimes have only one of these domains. Previously, scaRNA genes were grouped in with the H/ACA box and C/D box snoRNA genes but they now have their own approved SCARNA# gene nomenclature; for example, HBII-382 is now SCARNA3 for 'small Cajal body-specific RNA 3'.

Piwi-interacting RNAs
Piwi-interacting RNAs ( piRNAs) are the largest class of small ncRNAs expressed in vertebrates. Generally ranging from 25 to 33 nucleotides in length, piRNAs play a key role during spermatogenesis in defending germline cells against transposons by selectively silencing them. 22 They are found in positionally conserved clusters throughout mammalian genomes, although the piRNAs within these clusters are not conserved. A cluster can encode from tens up to thousands of individual piRNAs. piRNABank (http://pirnabank. ibab.ac.in/) 23 -a web resource that classifies piRNAs and groups them into their genomic clusters -has so far identified 114 clusters in the human genome and HGNC has provided each of these with a PIRC# symbol for 'piwi-interacting RNA cluster #'. A vast number of individual human piRNA sequences has been identified, but piRNABank has removed repetitive and overlapping sequences in order to produce a non-redundant set of around 23,000 piRNAs that map to the human genome. HGNC and piRNABank are currently working together to develop a nomenclature for each piRNA gene, possibly with the stem symbol PIRNA#.

RNase P/MRP genes
The RNA component of ribonuclease (RNase) P plays a role in the processing of tRNA precursors ( pre-tRNAs) into their mature products by cleaving sequence from the 5 0 end. It is also thought to be involved in RNA polymerase (Pol) III transcription. 24 This RNA is encoded by the RNase P RNA component H1 (RPPH1) gene in the human genome. The evolutionarily related RNase mitochondrial RNA-processing (MRP) enzyme is involved in the maturation of precursor rRNAs ( pre-rRNAs), by splicing out internally transcribed sequences, and also in mitochondrial DNA replication. 25 Although the RNA component is now known to be mostly localised in the nucleus, it was first identified in the mitochondria and this is reflected in the nomenclature of the gene, 'RMRP' for 'RNA component of mitochondrial RNA processing endoribonuclease'.
Other small ncRNAs U7 snRNA has a role in histone pre-mRNA processing by specifically binding to the histone downstream element (HDE). 26 A computational study 27 has revealed only one functional copy (encoded by the gene RNU7-1) and 85 nonfunctional pseudogenes to be present in the human genome. A further computational analysis by the same group identified the four human vault RNA genes. 28 These genes encode the RNA components of the vault RNP. The function of this RNP is still unknown, but a role in drug resistance has been suggested. 29 To avoid confusion with viral RNAs (vRNAs), the genes encoding these ncRNAs have been named using the stem symbol VTRNA# (eg 'VTRNA2' for 'vault RNA 2'). Other small classes of ncRNAs in the human include: 7SK RNA, which is involved in the regulation of Pol II transcription 30 and is encoded by the 'RNA, 7SK small nuclear' (RN7SK) gene; the Y RNAs that form part of the Ro RNP are encoded by the RNY# genes (eg 'RNY1' for 'RNA, Ro-associated Y1'); the genes for the RNAs that form part of the signal recognition particle (commonly known as SRP or 7SL), which targets proteins and translocates them across membranes, have the stem symbol RN7SL (eg 'RN7SL1' for 'RNA, 7SL, cytoplasmic 1'); and the RNA component of the enzyme telomerase, which adds TTAGGG DNA sequence repeats to chromosome ends (telomeres) in order to prevent their continual erosion in cell division, 31 is encoded by the telomerase RNA component (TERC) gene.

Long non-coding RNAs
The current set of small ncRNAs, as shown above, is necessarily biased towards those with conserved sequence homology, since this feature is used for their computational prediction and classification. There is a further class of ncRNAs -known as long non-coding RNAs (lncRNAs) because they are over 200 nucleotides in length (sometimes even more than 15 kilobases 32 ) -that, in the majority of cases, do not share sequence homology with each other. These longer transcripts are spliced, capped and polyadenylated, suggesting that they are expressed and potentially functional within cells. Indeed, some lncRNAs do now have proven functions, and where these are known they have been named accordingly; for instance, 'XIST' 'X (inactive)-specific transcript (non-protein coding)' is involved in transcriptionally silencing one of the pair of X chromosomes. 33 There are, however, potentially thousands of lncRNAs, and for the vast majority their function remains unresolved. Where lncRNAs reside on the opposite strand to a protein-coding gene, it is thought that they could potentially regulate the expression of the coding gene. 34 These antisense transcripts are named using the approved HGNC symbol for the proteincoding gene with the suffix '-AS' for 'antisense'; the lncRNA gene on the opposite strand to the BOK gene is 'BOK-AS1' for 'BOK antisense RNA 1 (non-protein coding)'. Likewise, those lncRNA genes that reside entirely within an intron of a protein-coding gene are symbolised by the suffix '-IT' for 'intronic transcript' (eg 'MAGI2-IT1' for 'MAGI2 intronic transcript 1 (non-protein coding)'). There are also lncRNAs that are postulated to function only as transcriptional apparatus for the expression of small ncRNA genes nested within their introns. These 'host genes' are named with the suffix 'HG' (eg 'SNHG1' for 'small nucleolar RNA host gene 1 (non-protein coding)'). A small number of lncRNAs share homology with each other and are named as paralogues (eg TTTY1A and TTTY1B). Many transcripts do not fit any of these scenarios, that is: the function of the mature transcript is unknown; they are not proximal to a proteincoding gene; and they are not a member of a homologous family. Such 'orphan' ncRNA genes were previously all named with the anonymous stem symbol NCRNA# (eg 'NCRNA00029') but, in collaboration with the lncRNA database lncRNAdb (http://www.lncrnadb.org) 35 and the Vertebrate Genome Annotation (VEGA; http:// vega.sanger.ac.uk/) team, 36 HGNC has recently decided to name these intergenic lncRNA genes with the symbol 'LINC#' for 'long intergenic nonprotein coding RNA #'.

RNA nomenclature across species
Where an equivalent orthologous ncRNA gene can be shown to exist in another species, the human RNA gene nomenclature could be transferred directly to the other species and, indeed, this is already happening for highly conserved classes of small RNAs, such as the microRNAs. For example, mouse Mir100 is orthologous to human MIR100. The nomenclature for other ncRNA classes that have greatly diverged across genomes will need careful annotation and may require species-specific nomenclature.

Conclusion
Recent years have shown us that the human genome does not comprise transcriptional deserts of 'junk' DNA lying between protein-coding genes, but rather that these regions encode thousands of transcribed ncRNAs that may play crucial roles in vital biological processes. As a result, interest in these RNAs is growing quickly, and HGNC aims to keep apace with the discovery of new ncRNA classes to ensure it can provide a robust and systematic nomenclature for these intriguing genes. All human RNA genes named to date can be found at the HGNC RNA webpage (http:// www.genenames.org/rna).