ArrayTrack: a free FDA bioinformatics tool to support emerging biomedical research -- an update
© Henry Stewart Publications 2010
Received: 28 May 2010
Accepted: 28 May 2010
Published: 1 August 2010
ArrayTrack™is a Food and Drug Administration (FDA) bioinformatics tool that has been widely adopted by the research community for genomics studies. It provides an integrated environment for microarray data management, analysis and interpretation. Most of its functionality for statistical, pathway and gene ontology analysis can also be applied independently to data generated by other molecular technologies. ArrayTrack has been undergoing active development and enhancement since its inception in 2001. This review summarises its key functionalities, with emphasis on the most recent extensions in support of the evolving needs of FDA's research programmes. ArrayTrack has added capability to manage, analyse and interpret proteomics and metabolomics data after quantification of peptides and metabolites abundance, respectively. Annotation information about single nucleotide polymorphisms and quantitative trait loci has been integrated to support genetics-related studies. Other extensions have been added to manage and analyse genomics data related to bacterial food-borne pathogens.
Microarray technology has been widely used in diverse fields, including pharmacology, toxicology and clinical medicine [1–5]. An integrated bioinformatics infrastructure to facilitate data management and analysis is vital fully to realise the potential impacts of genomics on research and public health. The Food and Drug Administration (FDA) has developed the genomics tool, ArrayTrack™, which has a rich set of functionalities to manage, analyse and interpret genomic data, with a focus on microarray data [6, 7]. Most functionality is applicable to biomarker development in biomedical research and personalised medicine using other molecular technologies.
Over the years, ArrayTrack has been presented and reviewed in several publications [6, 7, 9–12]. This update will, following a brief overview of its key functionalities, focus on its most recent extensions. The user manual and tutorials are available from the ArrayTrack website (http://www.fda.gov/ArrayTrack). At the time of this writing, ArrayTrack version 3.5, which includes the added capabilities described herein, has just been released.
A brief overview
ArrayTrack was designed as a client-server system that provides: (i) a database server hosting microarray data, related study data and a collection of genomic functional information (eg gene annotations, pathways, gene ontology); (ii) a client application that integrates visualisation and analysis of microarray data with functional information; and (iii) a hierarchically modular structure that provides extensibility for other types of 'omics' data and data from other emerging molecular technologies (eg genome-wide association studies).
ArrayTrack comprises three major integrated components: (i) MicroarrayDB, a database that stores essential data associated with a microarray experiment, including raw gene expression data and information on samples, treatments and phenotypic observations; (ii) LIB, a series of libraries that contain functional information (eg gene annotations, protein function and pathways) from public sources; and (iii) TOOL, a set of algorithms that provide analysis capabilities for data visualisation, normalisation, significance analysis, clustering and classification. A user-friendly interface enables the selection of an analysis method from TOOL, applying the method to selected microarray data stored in MicroarrayDB and linking analysis results directly to associated functional annotations in LIB. Some key functionalities and their applications are highlighted below and the full list of functions is available on the ArrayTrack website.
Data import: SimpleTox, an internally developed Clinical Data Interchange Standard Consortium/Standard for Exchange of Non-clinical Data (CDISC/SEND)-compliant format, allows essential experiment summaries to be uploaded in a spreadsheet format. This consistent and easy-to-use format facilitates cross-study analysis.
Class comparison: ArrayTrack offers a rich set of methods for determining a list of differentially expressed genes (DEGs) by comparing two groups (eg treated versus control), such as the simple t-test, the Volcano plot and more advanced statistical approaches, such as false discovery rate (FDR) control and significance analysis of microarrays (SAM). In addition, the analysis of variance (ANOVA) is available for multi-group comparison.
Class discovery: Principal component analysis (PCA), hierarchical cluster analysis (HCA) and K-means clustering are available for unsupervised class discovery and pattern identification. Such methods are powerful ways to investigate the grouping of samples in terms of their similarities in gene expression.
Class prediction: For predictive model development, ArrayTrack offers supervised learning methods such as K-nearest neighbour (KNN), linear discriminant analysis (LDA) and support vector machines (SVMs). Model building is an important step towards applying microarray technology to diagnosis, prognosis and treatment outcome prediction.
In summary, ArrayTrack provides a broad range of functionalities, allowing integration of functional information about genes, proteins, pathways and gene ontology [7, 14] with data from microarrays and other molecular technologies. Such integrative analysis capability furthers ArrayTrack's ability to support systems biology. ArrayTrack's diverse capabilities have been augmented recently, as described in the next section.
Support for proteomics and metabolomics data
Proteomics and metabolomics have grown steadily in importance in biomedical research, in parallel with microarrays. The integration of tri-omics data (ie genomics, proteomics and metabolomics data) has been a primary goal in systems biology for drug development and safety evaluation. To support this line of research and to review this type of data submitted by sponsors to the FDA through the VGDS -- or, as it has been renamed, the Voluntary Exploratory Data Submissions (VXDS) program -- ArrayTrack was previously modified to accommodate lists of proteins and metabolites. Additionally, a new systems biology function, called CommonPathway, was added that enables the examination of common pathways and functional categories (eg gene ontology terms) shared by different data types.
ArrayTrack is now capable of analysing data from any mass spectrometry platform once the raw data have been processed for detection and quantification of peptides or metabolites -- an important step in the data analysis workflow [15, 16]. New tools have been created to simplify the handling of proteomics data from the two most popular database search programs, Mascot and Sequest, for detection of peptides. The tools convert output files from these two programs into ArrayTrack-readable files.
The same interpretation tools used for microarray data in ArrayTrack are extensible for proteomics and metabolomics data. Thus, by linking the results to gene, protein and pathway databases, researchers will be able to contextualise these results in the same way as gene expression experiments. Additionally, a unified interface helps to reduce the 'learning curve' associated with analysing new data types, giving researchers currently working on microarrays an incentive to move towards more integrated approaches that also encompass proteins and metabolites.
SNP and QTL libraries
Recent advances in microarray-based genotyping techniques have enabled researchers rapidly to scan for known SNPs across complete genomes. An efficient data-mining strategy and a set of sophisticated tools are necessary better to understand and utilise the findings from genetic association studies.
One of the focuses in genetic association studies is to relate SNPs to genes and pathways in order to understand the underlying disease mechanisms. ArrayTrack has already provided a gene-pathway discovery platform. By integrating the SNP library, which contains annotation summary information for SNPs and their mapped relationship to genes, ArrayTrack now provides an integrated SNP-gene-pathway analysis platform for SNP studies.
A QTL is a region of DNA that is associated with a particular phenotypic trait. A common use for QTL data is to identify candidate genes underlying a trait within one or more QTLs. The identification of the SNP-gene-QTL relationship is the basis of tests to determine whether the gene/SNP is associated with the aetiology of a disease in animal models or human studies. The integration of SNP and QTL libraries into ArrayTrack enables dynamic mining of such complex biological interactions.
SNP and QTL libraries  have been constructed and incorporated into ArrayTrack. Data from several public repositories were collected in the SNP and QTL libraries and connected to other domain libraries (genes, proteins, metabolites and pathways) in ArrayTrack. Linking the data sets within ArrayTrack allows searching of SNP and QTL data, as well as their relationships to other biological molecules. The SNP library includes approximately 15 million human SNPs and their annotations, while the QTL library contains publicly available QTL data associated with specific phenotypes identified in mice, rats and humans. Case studies demonstrating the utility of these libraries have been reported [17, 18].
Support for microbial pathogen microarray data
Food-borne pathogens are a leading cause of illness in the USA. High-throughput microarray technology provides an effective way to identify, characterise and obtain a nearly complete snapshot of the genetic traits of bacterial strains, such as their pathogenicity, virulence or antimicrobial resistance. Such genome-wide insight is necessary for accurate identification and discrimination of pathogens that may contaminate the food supply.
The recent enhancement to ArrayTrack has broadened its capability to support proteomics, metabolomics and SNP-related studies. As an integrated omics data analysis environment, ArrayTrack allows meta-analyses and integration of results from multiple studies with pathway and functional annotations. By providing powerful but easy-to-use utilities, ArrayTrack is positioned to assist in making integrated, contextualised analyses more common, which, in turn, will help to harness genetic knowledge to improve the protection of public health.
The views presented in this paper do not necessarily reflect those of the US FDA.
The authors would like to thank all current and former members of the ArrayTrack development team for their dedication in building up ArrayTrack into an invaluable research tool. The authors would also like to thank all ArrayTrack users for their feedback and support.
© Federal Government, 2010
- van't Veer LJ, Dai H, van de Vijver MJ, He YD, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-536. 10.1038/415530a.View ArticleGoogle Scholar
- Gunther EC, Stone DJ, Gerwein RW, Bento P, et al: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci USA. 2003, 100 (16): 9608-9613. 10.1073/pnas.1632587100.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaushik N, Fear D, Richards SCM, McDermott CR, et al: Gene expression in peripheral blood mononuclear cells from patients with chronic fatigue syndrome. J Clin Pathol. 2005, 58: 826-832. 10.1136/jcp.2005.025718.PubMed CentralView ArticlePubMedGoogle Scholar
- Bushel PR, Heinloth AN, Li J, Huang L, et al: Blood gene expression signatures predict exposure levels. Proc Natl Acad Sci USA. 2007, 104 (46): 18211-18216. 10.1073/pnas.0706987104.PubMed CentralView ArticlePubMedGoogle Scholar
- Oberthuer A, Warnat P, Kahlert Y, Westermann F, et al: Classification of neuroblastoma patients by published gene-expression markers reveals a low sensitivity for unfavorable courses of MYCN non-amplified disease. Cancer Lett. 2007, 250 (2): 250-267. 10.1016/j.canlet.2006.10.016.View ArticlePubMedGoogle Scholar
- Tong W, Cao X, Harris S, Sun H, et al: ArrayTrack -- Supporting toxicogenomic research at the U.S. Food Drug Administration National Center for Toxicological Research. Environ Health Perspect. 2003, 111 (15): 1819-1826. 10.1289/ehp.6497.PubMed CentralView ArticlePubMedGoogle Scholar
- Harris SC, Fang H, Su Z, Chen M, et al: FDA bioinformatics tool for public use -- ArrayTrack™. Regulatory Research Perspectives. 2009, 8 (1): 1-25.Google Scholar
- Frueh FW: Impact of microarray data quality on genomic data submissions to the FDA. Nat Biotechnol. 2006, 24 (9): 1105-1107. 10.1038/nbt0906-1105.View ArticlePubMedGoogle Scholar
- Fang H, Harris SC, Su Z, Chen M, et al: ArrayTrack: An FDA and public genomic tool. Methods Mol Biol. 2009, 563 (3): 379-398.View ArticlePubMedGoogle Scholar
- Fang H, Perkins R, Tong W: Omics data integration: A systems approach view. American Drug Discovery. 2007, 2: 49-52.Google Scholar
- Tong W, Harris SC, Fang H, Shi L, et al: An integrated bioinformatics infrastructure essential for advancing pharmacogenomics and personalized medicine in the context of the FDA's Critical Path Initiative. Drug Discovery Today: Technologies. 2007, 4 (1): 3-8.View ArticlePubMedGoogle Scholar
- Tong W, Harris S, Cao X, Fang H, et al: Development of public toxicogenomics software for microarray data management and analysis. Mutat Res. 2004, 549: 241-253. 10.1016/j.mrfmmm.2003.12.024.View ArticlePubMedGoogle Scholar
- CDISC T: Clinical Data Interchange Standard Consortium (CDISC): CDISC Inc. 2007, 15907 Two Rivers Cove, Austin, Texas 78717, [http://www.cdisc.org/index.html]Google Scholar
- Sun H, Fang H, Chen T, Perkins R, et al: GOFFA: Gene Ontology For Functional Analysis -- A FDA gene ontology tool for analysis of genomic and proteomic data. BMC Bioinformatics. 2006, 7 (Suppl 2): S23-10.1186/1471-2105-7-S2-S23.PubMed CentralView ArticlePubMedGoogle Scholar
- Domon B, Aebersold R: Challenges and opportunities in proteomics data analysis. Mol Cell Proteomics. 2006, 5 (10): 1921-1926. 10.1074/mcp.R600012-MCP200.View ArticlePubMedGoogle Scholar
- Spratlin JL, Serkova NJ, Eckhardt SG: Clinical applications of metabolomics in oncology: A review. Clin Cancer Res. 2009, 15 (2): 431-440. 10.1158/1078-0432.CCR-08-1059.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu J, et al: Two new ArrayTrack™libraries for personalized biomedical research. BMC Bioinformatics. 2010.Google Scholar
- Wise C, Kaput J: A strategy for analyzing gene-nutrient interactions in Type 2 diabetes. J Diabetes Sci Technol. 2009, 3 (4): 710-721.PubMed CentralView ArticlePubMedGoogle Scholar
- Fang H, et al: An FDA bioinformatics tool for microbial genomics research on molecular characterization of bacterial foodborne pathogens using microarrays. BMC Bioinformatics. 2010.Google Scholar
- Kanehisa M: The KEGG database. Novartis Found Symp. 2002, 247: 91-101.View ArticlePubMedGoogle Scholar