The field of enrichment research has exploded, growing from 14 tools in 2005[1, 4] to 68 cited in a recent survey. This field is still very much under active development, however, with no one 'perfect' method or gold standard protocol guaranteed to give the best results. For this reason, it is useful to understand the current state of the art, the caveats and pitfalls associated with certain analyses and how to identify software tools best suited to a particular dataset.
Owing to the large number of available enrichment tools, it is helpful to use the nomenclature of Huang et al. when discussing enrichment software. Huang et al. classify enrichment tools as belonging to at least one of three algorithmic categories: singular enrichment analysis (SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA).
The most traditional enrichment approach, SEA, iteratively tests annotation terms one at a time against a list of interesting genes for enrichment. An enrichment p-value is calculated by comparing the observed frequency of an annotation term with the frequency expected by chance; individual terms beyond some cut-off (eg p-value ≤ 0.05) are deemed enriched. This is a simple, useful and easy-to-use protocol. Tools belonging to this category (eg Onto-Express, FuncAssociate 2.0,GOStat, BiNGO and EasyGO) predominantly rely on the GO[4, 11] as a source of annotation terms. As SEA considers each term independently, however, it ignores the hierarchical relationships between GO terms. This frequently results in output lists of enriched terms numbering in the hundreds because similar terms are treated as though they were unique, leading to redundancy. Semantic redundancy between terms can also dilute an enriched biological concept due to difficulties in identifying enrichment between different, yet semantically similar, terms[4, 5]. A drawback to any method relying on a single knowledge or annotation source is that it will also inherit limitations of that source. In the case of the GO, although it currently contains 29,365 terms, it is a work in progress and its annotations remain incomplete and biased towards well-studied genes.
GSEA-based methods, such as GSEA/P-GSEA[15, 16] and GeneTrail, are similar in character to SEA, but they consider all genes during analysis, not just those deemed as interesting or significant by some metric or threshold. GSEA methods work best in scenarios in which phenotypic classes or time points are assayed (eg tumour versus normal tissue, or treated versus untreated state) because the method requires a quantitative biological value (such as fold change or degree of differential expression) for each gene in order to rank them. A maximum enrichment score (MES) is calculated from the ranked list of all genes in a given annotation category and an enrichment p-value determined by comparing the ranked annotation MES to randomly generated MES distributions[5, 16]. Simply, GSEA determines if those genes sharing a particular annotation (eg a biochemical pathway), known as a gene set, are randomly distributed throughout the larger ranked gene list and therefore not significantly associated with any phenotypic class, or if they tend to be over-represented towards the top or bottom of the longer ranked gene list, indicating an association between the gene set (ie genes sharing the annotation of interest) and the phenotypic classes under study.
Although many different annotation categories can be used by GSEA methods, including biological function (eg GO terms), physical position (eg chromosomal location), regulation (eg co-expression) or any other attribute for which prior knowledge is available, like SEA methods they are still considered one at a time and treated independently, with no consideration given to the semantic relationships which may exist between the different annotation terms. At times, it may be difficult to assign a single value to a gene; for example, multiple SNPs within a single gene may have differing p-values, or comparisons may have been made across many time points or conditions. In such instances, GSEA-based methods may be inappropriate. It should be noted, however, that recent modifications to GSEA methods to cope with genome-wide association study-derived datasets have been proposed,[18–21] and a novel GSEA method using mixed-effects models has successfully identified enriched GO terms in a time-course microarray dataset[22, 23]. In addition, highly ranked genes (ie those with larger fold changes) contribute greatly to the enrichment p-value, the underlying assumption being that genes with greater deregulation (ie fold changes) contribute more to the observed phenotype. This assumption does not always hold true in real biology, however.
The final algorithmic class defined by Huang et al., MEA, is the only class to use the relationships that may exist between different annotation terms during enrichment. As mentioned previously, doing so can reduce redundancy and prevent the dilution of potentially important biological concepts. A number of tools (eg Ontologizer, topGO and GeneCodis) claim to have improved sensitivity and specificity by considering relationships between GO annotation terms in this manner. The use of composite annotation terms may therefore be able to provide biological insight otherwise lacking in analyses treating single terms as independent entities. These tools, however, still focus on a single annotation source -- in this instance, the GO. Many additional types of information or attributes can be used during enrichment analysis and by incorporating a range of annotation types in concert, analysis can be more effective as increased coverage increases analytical power and can provide a more complete view of the biology underlying a gene set of interest.
Functional enrichment tools such as DAVID[3, 27] and the recently released ConceptGen do exactly that, not only considering relationships between annotation terms (both within an annotation source and between different sources), but also integrating annotation terms from a range of sources, including those representing protein - protein interactions, protein functional domains (eg InterPro), disease associations (eg OMIM), pathways (eg KEGG, BioCarta), sequence features, homology, expression patterns (eg GEO) and literature. By grouping similar, redundant and homogeneous annotation content from the same or different resources into annotation groups, the burden of associating similar and redundant terms is reduced, and the biological interpretation of gene sets moves from a genecentric to a biological-module-centric approach, which may provide a better representation of a biological process[4, 5]. These tools have also invested in novel visualisation methods to support effective exploration of results One consideration when using this seemingly comprehensive analysis protocol is that 'orphan' genes (ie terms without strong relationships to other terms) may be overlooked, requiring such genes to be investigated manually through other methods.