ArrayTrack: a free FDA bioinformatics tool to support emerging biomedical research -- an update

ArrayTrack™is a Food and Drug Administration (FDA) bioinformatics tool that has been widely adopted by the research community for genomics studies. It provides an integrated environment for microarray data management, analysis and interpretation. Most of its functionality for statistical, pathway and gene ontology analysis can also be applied independently to data generated by other molecular technologies. ArrayTrack has been undergoing active development and enhancement since its inception in 2001. This review summarises its key functionalities, with emphasis on the most recent extensions in support of the evolving needs of FDA's research programmes. ArrayTrack has added capability to manage, analyse and interpret proteomics and metabolomics data after quantification of peptides and metabolites abundance, respectively. Annotation information about single nucleotide polymorphisms and quantitative trait loci has been integrated to support genetics-related studies. Other extensions have been added to manage and analyse genomics data related to bacterial food-borne pathogens.


Introduction
Microarray technology has been widely used in diverse fields, including pharmacology, toxicology and clinical medicine. 1 -5 An integrated bioinformatics infrastructure to facilitate data management and analysis is vital fully to realise the potential impacts of genomics on research and public health. The Food and Drug Administration (FDA) has developed the genomics tool, ArrayTrack TM , which has a rich set of functionalities to manage, analyse and interpret genomic data, with a focus on microarray data. 6,7 Most functionality is applicable to biomarker development in biomedical research and personalised medicine using other molecular technologies.
ArrayTrack became the FDA's genomics tool to support the Voluntary Genomics Data Submission (VGDS) programme 8 in early 2004. VGDS provides a novel mechanism outside normal regulatory interactions for sponsors and the FDA to develop expertise, tools and processes appropriate for regulatory interpretation of pharmacogenomics data. In addition to its broad use within the FDA in various research and regulation-related programmes, ArrayTrack is also freely available to the scientific community. Users can gain access to ArrayTrack either through the website (http://www.fda.gov/ArrayTrack) or by requesting media for local installation. The ArrayTrack user base has grown steadily, and the tool has been adopted by several government agencies (eg the Environmental Protection Agency, Centers for Disease Control and Prevention and National Institutes of Health), academic institutions and private companies. Figure 1 presents an overview of the user base growth trend since 2004.
Over the years, ArrayTrack has been presented and reviewed in several publications. 6,7,9 -12 This update will, following a brief overview of its key functionalities, focus on its most recent extensions. The user manual and tutorials are available from the ArrayTrack website (http://www.fda.gov/ ArrayTrack). At the time of this writing, ArrayTrack version 3.5, which includes the added capabilities described herein, has just been released.

A brief overview
ArrayTrack was designed as a client-server system that provides: (i) a database server hosting microarray data, related study data and a collection of genomic functional information (eg gene annotations, pathways, gene ontology); (ii) a client application that integrates visualisation and analysis of microarray data with functional information; and (iii) a hierarchically modular structure that provides extensibility for other types of 'omics' data and data from other emerging molecular technologies (eg genome-wide association studies).
ArrayTrack comprises three major integrated components: (i) MicroarrayDB, a database that stores essential data associated with a microarray experiment, including raw gene expression data and information on samples, treatments and phenotypic observations; (ii) LIB, a series of libraries that contain functional information (eg gene annotations, protein function and pathways) from public sources; and (iii) TOOL, a set of algorithms that provide analysis capabilities for data visualisation, normalisation, significance analysis, clustering and classification. A user-friendly interface enables the selection of an analysis method from TOOL, applying the method to selected microarray data stored in MicroarrayDB and linking analysis results directly to associated functional annotations in LIB. Some key functionalities and their applications are highlighted below and the full list of functions is available on the ArrayTrack website. † Data import: SimpleTox, an internally developed Clinical Data Interchange Standard

Consortium/Standard
for Exchange of Non-clinical Data (CDISC/SEND) 13 -compliant format, allows essential experiment summaries to be uploaded in a spreadsheet format. This consistent and easy-to-use format facilitates crossstudy analysis. † Class comparison: ArrayTrack offers a rich set of methods for determining a list of differentially expressed genes (DEGs) by comparing two groups (eg treated versus control), such as the simple t-test, the Volcano plot and more advanced statistical approaches, such as false discovery rate (FDR) control and significance analysis of microarrays (SAM). In addition, the analysis of variance (ANOVA) is available for multi-group comparison. † Class discovery: Principal component analysis (PCA), hierarchical cluster analysis (HCA) and K-means clustering are available for unsupervised class discovery and pattern identification. Such methods are powerful ways to investigate the grouping of samples in terms of their similarities in gene expression. † Class prediction: For predictive model development, ArrayTrack offers supervised learning methods such as K-nearest neighbour (KNN), linear discriminant analysis (LDA) and support vector machines (SVMs). Model building is an important step towards applying microarray technology to diagnosis, prognosis and treatment outcome prediction.
Importantly, most ArrayTrack functionalities are not limited to microarray data. For example, the information from ArrayTrack's extensive collection of libraries can be applied directly to data from nonmicroarray technologies, results collected from literature, or other sources ( Figure 2). In addition, most statistical, classification and visualisation tools have been implemented as generic tools and can be readily used to analyse data where dependent and independent data are from other research fields.
In summary, ArrayTrack provides a broad range of functionalities, allowing integration of functional information about genes, proteins, pathways and gene ontology 7,14 with data from microarrays and other molecular technologies. Such integrative analysis capability furthers ArrayTrack's ability to support systems biology. ArrayTrack's diverse capabilities have been augmented recently, as described in the next section.

New development
ArrayTrack development has continued with the addition of new utilities to address the growing and changing needs of FDA's research programmes. New features facilitate the management of preprocessed proteomics and metabolomics data. The single nucleotide polymorphism (SNP) and quantitative trait locus (QTL) libraries have been integrated to support pathway analysis and data mining for SNP-related studies. Extensive enhancements have also been made to manage and analyse the genetic profiling data related to bacterial food-borne pathogens. These enhancements are depicted in Figure 3 in relation to ArrayTrack's core functionality.

Support for proteomics and metabolomics data
Proteomics and metabolomics have grown steadily in importance in biomedical research, in parallel with microarrays. The integration of tri-omics data (ie genomics, proteomics and metabolomics data) has been a primary goal in systems biology for drug development and safety evaluation. To support this line of research and to review this type of data submitted by sponsors to the FDA through the VGDS -or, as it has been renamed, the Voluntary Exploratory Data Submissions (VXDS) program -ArrayTrack was previously modified to accommodate lists of proteins and metabolites. Additionally, a new systems biology function, called CommonPathway, 10 was added that enables the examination of common pathways and functional categories (eg gene ontology terms) shared by different data types.
ArrayTrack is now capable of analysing data from any mass spectrometry platform once the raw data have been processed for detection and quantification of peptides or metabolites -an important step in the data analysis workflow. 15,16 New tools have been created to simplify the handling of proteomics data from the two most popular database search programs, Mascot and Sequest, for detection of peptides. The tools convert output files from these two programs into ArrayTrackreadable files.
The same interpretation tools used for microarray data in ArrayTrack are extensible for proteomics and metabolomics data. Thus, by linking the results to gene, protein and pathway databases, researchers will be able to contextualise these results in the same way as gene expression experiments. Additionally, a unified interface helps to reduce the 'learning curve' associated with analysing new data types, giving researchers currently working on microarrays an incentive to move towards more integrated approaches that also encompass proteins and metabolites.

SNP and QTL libraries
Recent advances in microarray-based genotyping techniques have enabled researchers rapidly to scan for known SNPs across complete genomes. An efficient data-mining strategy and a set of sophisticated tools are necessary better to understand and utilise the findings from genetic association studies.
One of the focuses in genetic association studies is to relate SNPs to genes and pathways in order to understand the underlying disease mechanisms. ArrayTrack has already provided a gene -pathway discovery platform. By integrating the SNP library, which contains annotation summary information for SNPs and their mapped relationship to genes, ArrayTrack now provides an integrated SNPgene-pathway analysis platform for SNP studies.
A QTL is a region of DNA that is associated with a particular phenotypic trait. A common use for QTL data is to identify candidate genes underlying a trait within one or more QTLs. The identification of the SNP-gene-QTL relationship is the basis of tests to determine whether the gene/SNP is associated with the aetiology of a disease in animal models or human studies. The integration of SNP and QTL libraries into ArrayTrack enables dynamic mining of such complex biological interactions.
SNP and QTL libraries 17 have been constructed and incorporated into ArrayTrack. Data from several public repositories were collected in the SNP and QTL libraries and connected to other domain libraries (genes, proteins, metabolites and pathways) in ArrayTrack. Linking the data sets within ArrayTrack allows searching of SNP and QTL data, as well as their relationships to other biological molecules. The SNP library includes approximately 15 million human SNPs and their annotations, while the QTL library contains publicly available QTL data associated with specific phenotypes identified in mice, rats and humans. Case studies demonstrating the utility of these libraries have been reported. 17,18 Support for microbial pathogen microarray data Food-borne pathogens are a leading cause of illness in the USA. High-throughput microarray technology provides an effective way to identify, characterise and obtain a nearly complete snapshot of the genetic traits of bacterial strains, such as their pathogenicity, virulence or antimicrobial resistance. Such genome-wide insight is necessary for accurate identification and discrimination of pathogens that may contaminate the food supply.
ArrayTrack has been extended to support microbial genomics research using microarrays. 19 ArrayTrack's libraries have been populated with bioinformatics data relating to bacterial pathogen species from the public domain. Data processing and visualisation tools have been enhanced with customised options to facilitate analysis of genetic profiling microarray data. Specifically, three new functions have been developed and are particularly effective for analysis of these microarray data: flagbased hierarchical clustering analysis (HCA), a flag concordance (FC) heat map and flag indicators in the mixed scatter plot (where 'flag' refers to a gene's presence or absence call). These functions are particularly relevant and effective for the identification and characterisation of bacterial pathogens using microarray genetic profiling data. The enhancements are displayed in Figure 4. For example, the Microbial Library ( Figure 4C) is the newest addition to ArrayTrack's collection of libraries. Currently, it holds 270,000 gene records from a total of 84 bacterial strains: 30 Escherichia coli, 39 Salmonella enterica, ten Shigella spp. and five Vibrio spp. Thus, as a starting point, the Microbial Library is focused on these four bacterial genera that are common food-borne pathogens. ArrayTrack also holds microbial pathway information from the Kyoto Encyclopaedia of Genes and Genomes (KEGG) for over 50 of these strains 20 and gene ontology information for the E. coli K12 substrain MG1655. The gene annotations and sequences for the Microbial Library were downloaded from the National Center for Biotechnology Information (NCBI) website.

Summary
The recent enhancement to ArrayTrack has broadened its capability to support proteomics, metabolomics and SNP-related studies. As an integrated omics data analysis environment, ArrayTrack allows meta-analyses and integration of results from multiple studies with pathway and functional annotations. By providing powerful but easy-to-use utilities, ArrayTrack is positioned to assist in making integrated, contextualised analyses more common, which, in turn, will help to harness genetic knowledge to improve the protection of public health.

Disclaimer
The views presented in this paper do not necessarily reflect those of the US FDA.