Gene family matters: expanding the HGNC resource
© Daugherty et al.;licensee BioMed Central Ltd. 2012
Received: 11 August 2011
Accepted: 5 July 2012
Published: 5 July 2012
Skip to main content
© Daugherty et al.;licensee BioMed Central Ltd. 2012
Received: 11 August 2011
Accepted: 5 July 2012
Published: 5 July 2012
The HUGO Gene Nomenclature Committee (HGNC) assigns approved gene symbols to human loci. There are currently over 33,000 approved gene symbols, the majority of which represent protein-coding genes, but we also name other locus types such as non-coding RNAs, pseudogenes and phenotypic loci. Where relevant, the HGNC organise these genes into gene families and groups. The HGNC website http://www.genenames.org/ is an online repository of HGNC-approved gene nomenclature and associated resources for human genes, and includes links to genomic, proteomic and phenotypic information. In addition to this, we also have dedicated gene family web pages and are currently expanding and generating more of these pages using data curated by the HGNC and from information derived from external resources that focus on particular gene families. Here, we review our current online resources with a particular focus on our gene family data, using it to highlight our new Gene Symbol Report and gene family data downloads.
The HUGO Gene Nomenclature Committee (HGNC) has been responsible for approving unique and informative gene names and symbols to every human gene for over 30 years. Approved gene names and symbols preferably describe the structure, function or homology of a gene and its products. The provision of approved nomenclature allows researchers to discuss genes unambiguously, and this is reflected by HGNC symbol usage in scientific papers describing human genes, hence aiding the dissemination and interpretation of the associated data by the scientific community.
The HGNC website [1, 2] provides direct links to genomic, proteomic and phenotypic information that is held in the HGNC database and enables users to search and download current data associated to their gene(s) of interest. As of February 2012, there are over 33,000 approved human gene symbols (including protein-coding genes, pseudogenes, ncRNA genes and phenotypes), each with a publicly available Gene Symbol Report. It is important to note that although the main focus of HGNC concerns human genes, there are coordinated efforts with other nomenclature committees , in particular the Mouse Genomic Nomenclature Committee (MGNC)  and Rat Genome Database (RGD) , and any large new gene family reorganisation or assignment is usually coordinated among these three nomenclature groups. The HGNC also regularly works with specialist advisors and publish scientific papers concerning gene family nomenclature and gene grouping [6–9]. The adoption of HGNC-approved gene names/symbols by the many genome browsers and databases reduces any uncertainty when referring to genes; for example, Ensembl , Entrez Gene , GeneCards , OMIM , UCSC , UniProt  and Vega  all use HGNC names/symbols. The data supplied by HGNC have been applied to a range of studies, such as assisting tools to identify candidate genes for further study , quantitative assessments of gene annotation status  and projects involving ‘mashups’ of bioinformatics data to explore genes involved in a particular disease  to name just a few.
We have restructured the Gene Symbol Report, so that the HGNC ‘core’ data are now more prominent and are presented in a separate table at the top of each report. The Gene Symbol Report for TRPM2 (Figure 2) displays data in all the core fields. One of the most significant changes to our Gene Symbol Report is that we now provide access to the gene family page(s) linking from the ‘Gene Family’ name on the report to the associated gene family page. As shown in Figure 2TRPM2 is associated to two gene families: the subgroup ‘Transient receptor potential channels’ and the ‘Nudix motif-containing family’. Gene families will be discussed in more detail later. Gene Symbol Reports also contain links to external biomedical resources. We have grouped related resources into the following sections: ‘Specialist Database’, ‘Homologs’, ‘Nucleotide Sequences’, ‘Gene Resources’, ‘Protein Resources’, ‘Clinical Resources’, ‘References’ and ‘Other Database Links’. External links can either be manually ‘curated’ by an HGNC curator, which is denoted by the letter ‘C’, or ‘downloaded’ from external sources, which is denoted by the letter ‘D’. The ‘Specialist Database’ section provides links to databases that are relevant to only certain classes of gene, and we now link to 14 specialist databases. In the example Gene Symbol Report for TRPM2 (Figure 2), the specialist receptor database we link to is IUPHAR . In the ‘Homologs’ section, in addition to linking to the mouse MGI  and rat RGD  databases, we now display the symbols approved by the two nomenclature committees. The nomenclature committees for human, mouse and rat aim to approve equivalent gene symbols and names for orthologous genes; for example, the human TRPM2 Gene Symbol Report (Figure 2) shows that the approved symbols for mouse and rat are both Trpm2. Links to nucleotide sequences are grouped in the ‘Nucleotide Sequences’ section and include recently added links to Vertebrate Genome Annotation  gene sequence curated by the Havana project. The ‘Gene Resources’ section groups together links to the gene annotation pages at Entrez Gene , Ensembl , UCSC  and Vega . As part of the Symbol Report redesign, we now also provide direct links to the Genome Browsers supported by these four projects. The ‘Protein Resources’ section still includes links to the UniProt project, but as part of the update, we have added a link to the InterPro  Protein Match page; this shows all predicted protein signatures (integrated and unintegrated) for the encoded protein by the InterPro member databases. All the mutation and variation-related data links are displayed in the ‘Clinical Resources’ section, while our curated links to references in PubMed  and CiteXplore  are shown in the ‘References’ section. Finally, the ‘Other Database Links’ section includes links to relevant biomedical resources that cannot be grouped into the categories above. For example, we now also link to the Reactome signalling pathway database  and to a list of all Gene Ontology terms annotated for the gene product at the QuickGO project .
Gene families are generally defined as a group of genes descended by duplication from a common ancestor. The degree of divergence from the ancestral gene can vary considerably between members, and they may or may not have a conserved function. In many cases, the homology between the family members may be restricted to a specific highly conserved region or domain(s) of the encoded protein, and genes can therefore belong to more than one gene family, e.g. TRPM2 belongs to two gene families: the ‘Transient receptor potential channels’ and the ‘Nudix motif-containing family’ (see Figure 2). Large gene families may also be subdivided into smaller subfamilies, which often equate to functional groups. The ‘Transient receptor potential channels’ are a good example of this as they are subdivided into subfamily A, subfamily C, subfamily M and subfamily V .
When naming gene family members, the HGNC aims to use a common root (or stem) symbol which allows easy identification of the members, e.g. TRP is the root symbol for Transient receptor potential channels. As well as homologous gene families, the HGNC also organises human genes into gene groupings, which correspond to sets of genes that are not necessarily related by sequence homology but do have another shared feature: for example, a common function (e.g. ‘class I’ and ‘class II aminoacyl tRNA synthetases’), a specific chromosomal location (e.g. genes from the ‘pseudoautosomal region’), secondary structure (e.g. ‘micro RNAs’) or a grouping that provides a useful community resource (e.g. all genes encoding ‘blood group’ antigens). These groupings do not usually share a common root symbol, and again, one gene can be a member of more than one gene family and/or grouping. When assigning genes as members of families and groupings, the HGNC look at all available data including sequence similarity and conserved domain structure, publications and other databases, and where possible, take advice from specialist advisors who are experts working on that specific family or group. For ease of discussion, all gene families and groupings are referred to simply as gene families throughout this paper.
As of June 2012, close to 45% of the 33,000 HGNC database entries are associated with a gene family. The HGNC website  also currently displays over 237 curated webpages dedicated to individual gene families. However, if all the gene subfamilies are considered, this would give a larger total of around 400 pages due to instances where large gene families have been subdivided into smaller subgroups. The ‘Voltage-gated ion channels’ family (Figure 1c) is an example of a large gene family which has the following 11 subgroupings that make up the gene family: ‘Voltage-Gated Sodium Channels’, ‘Voltage-Gated Calcium Channels’, ‘Voltage-Gated Sodium Channels’, ‘CatSper’, ‘Two-Pore Channels’, ‘Cyclic Nucleotide-Regulated Channels’, ‘Calcium-Activated Potassium Channels’, ‘Voltage-Gated Potassium Channels’, ‘Inwardly Rectifying Potassium Channels’, ‘Two-P Potassium Channels’ and ‘Hydrogen Voltage-Gated Ion Channels’).
HGNC is now in the process of actively expanding our gene family pages using internally curated data and data from the growing number of external resources and publications that focus on particular gene families. The HGNC homepage  (Figure 1a) features a website search box at the bottom of the page that allows searching for gene family pages, HGNC documentation and information pages. Furthermore, there are two main ways of ascertaining whether a gene is associated to a particular gene family: by browsing the ‘Gene Families’ pages  from the homepage or by directly querying the database by using one of the ‘Search Genes’ tools .
The ‘Gene Families’  section from the drop-down menu on the homepage (Figure 1a.) gives an alphabetical listing of all the gene family names (Figure 1b) and links to the associated gene family pages. The information grouped on the gene family page is organised and represented by a table (Figure 1d), which lists all of the associated genes with the following data: ‘Approved Symbol’, ‘Approved Name’, ‘Previous Symbols’, ‘Synonyms’ and ‘Chromosome’. The ‘Approved Symbol’ links to the ‘Gene Symbol Report’, so selecting ‘TRPM2’ from the table will take the user to the specific Gene Symbol Report (Figure 2). The gene family page (Figure 1c,d) also indicates if there is a specialist advisor associated to the gene family, and their contact details, where applicable, are linked from each relevant gene family page by an orange ‘S’ icon. There are currently 115 specialist advisors that help with the content and approval of any new gene family members, and a separate page on the website lists all of HGNC's specialist advisors .
Users can also query the HGNC data by using the tools listed on the ‘Search Genes’ page : the ‘Quick Gene Search’, ‘Advanced Gene Search’ and the ‘List Search’. These tools can all be used to obtain the Gene Symbol Report, so if the gene of interest has a gene family associated, it will link to the relevant gene family page from the ‘Gene Family’ name in the Gene Symbol Report. The ‘Quick Gene Search’ tool  (see Figure 1e) is a quick and easy way to check whether a gene of interest is associated to a gene family; this is done by querying the gene symbol, name or ID to locate a Gene Symbol Report. Using the ‘Advanced Gene Search’  will allow the user to build a more specific query by choosing to query within specific datasets, e.g. only for those genes that are approved. The ‘List Search’  also allows users to access the Gene Symbol Report but enables users to search multiple genes by gene symbol.
The ‘Statistics and Downloads’ page , which has a direct link to the ‘complete HGNC dataset’. This file includes all the HGNC core fields and now includes the new gene family data fields, ‘Gene Family Tag’ and ‘Gene Family Description’ (discussed below). Alternatively, if it is just the information relating to the gene family data that is required, select ‘complete HGNC Gene Family dataset’. This gives a file with the following fields: URL for gene family page, Gene Family Tag, Gene Family Description, Symbol and HGNC ID.
The ‘Custom Downloads’ page, on the other hand, allows the specification of the exact fields required. The selection should include the ‘Gene Family Description’ and the ‘Gene Family Tag’ fields to retrieve the gene family data in the output.
Finally, to obtain the data for the gene family associated to a particular gene of interest, select ‘Download gene family data’ beneath the gene family table on the relevant gene family page.
The data from our previously established ‘complete HGNC dataset’ and the newly created ‘complete HGNC Gene Family dataset’ can be associated together by the ‘Gene Family Tag’ field and the ‘Gene Family Description’ field. The ‘Gene Family Tag’ is used to generate gene family or group specific pages at the HGNC website  and does not necessarily reflect an official nomenclature. The ‘Gene Family Description’ is the name given to a particular gene family. Each ‘Gene Family Description’ has an associated ‘Gene Family Tag’ and vice versa. If a particular gene is a member of more than one gene family, the tags and the descriptions will be shown in the same order. Like all HGNC data, the ‘Gene Family’ datasets are updated daily, so the user will always get the most recent and up-to-date data when clicking on the download links or accessing custom downloads.
In the future, we plan to increase the number of gene families and the assignment of human genes to gene families. We will also arrange the current alphabetical gene family list into more meaningful categories, for example, grouping them by domain structure, function or disease associations. Another area we are considering is integrating the graphical display in InterPro  to represent the domains that are encoded by each gene family member.
Updates regarding the gene family pages are now also mentioned in our Twitter feed and quarterly ‘Newsletter’. If you subscribe to this, you will be notified of any new gene families; see our feedback form  and tick the box to receive the Newsletter. Alternatively, newsletters are also available on the website. If you have a gene family you think should be represented or you would like to be considered as one our specialist advisors, please contact us via email@example.com.
Louise Daugherty would like to thank HGNC for the opportunity to contribute to and work with the team. This work was supported by the Wellcome Trust (081979/Z/07/Z) and the National Human Genome Research Institute (P41 HG03345).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.