In general, exon array analysis workflow is broken down into four major steps: data processing and normalisation; genomic annotation and data filtering; AS detection; and AS visualisation. R and Bioconductor statistical packages will be used for each step described below. We will provide sample code when appropriate.
Data processing and normalisation
Due to its popularity, summarisation and normalisation methods for oligonucleotide array have been thoroughly researched [8]. Among the most popular methods are robust multichip average (RMA); probe logarithmic intensity error (PLIER); and GC robust multichip average (GCRMA) [9–11]. The Bioconductor packages 'oligo' and 'affy' are the foundation for importing and processing oligonucleotide microarrays such as the exon array used in this review. The successful processing and normalisation of exon array chips rely on the Platform Design (pd) file. Such files contain the mapping information that assigns the probes to their corresponding probesets. A collection of available pd files can be downloaded from the 'Download' and 'Metadata' sections of the Bioconductor website.
Here, the procedures for loading the packages and importing the dataset are illustrated. The 'read.cellfiles' function requires users to specify the pd file for the chip types using the 'pkgname' argument.
> library('oligo')
> library('pd.huex.1.0.st.v2')
> celFiles = list.celfiles ('/path/to/my/celfiles', full.name = TRUE)
> raw = read.cellfiles (celFiles, pkgname = 'pd.huex. 1.0.st.v2')
Once the dataset is imported into the R session as an 'ExpressionSet' object, it can be normalised using multiple approaches. Here, we illustrate how the dataset can be normalised using three different methods: RMA, GCRMA and PLIER, respectively.
R> eset = rma(raw)
R> eset = gcrma(raw)
R> eset = justPlier(raw)
Recently, Gaidatzis et al.[12] discovered a systematic relationship between gene expression and AS. The authors showed that detection of AS increased significantly as the differential expression of the respective gene increases. This caused an overestimation of AS for the dataset. The authors developed an algorithm, COrrected Splicing Indices for Exon array (COSIE), to correct for this effect. The method was based on a positional dinucleotide modelling approach. The COSIE package can be downloaded from the authors' website http://www.fmi.ch/groups/gbioinfo. The function from the COSIE package takes the 'ExpressionSet' object as its input and corrected expression values as its output. Here is an example of COSIE codes:
R> source('cosie.R')
R> cosieOut = cosie(eset, '/path/to/cosie/output/file.txt')
Genomic annotation and data filtering
The exon array was designed to measure signals from exonic, intronic and intergenic regions. This ability increases the likelihood of discovering novel splice forms and regulatory elements. Keeping track of these genomic regions can be a daunting task, however, and entails the need for a database management system (DBMS). Okoniewski et al.[13] have built a database, Xmap, to capture these gene region entities. From now on, the different representations of a gene (exon, intron, intergenic, etc) will be referred to as gene entities. The database uses the Ensembl ID system to map each probeset to its corresponding gene, transcript, exon and other Ensembl representation http://www.ensembl.org. For example, the exon array ID, 3564255, is mapped to the gene ID ENSG00000100505, and three transcript IDs, ENST0000029355, ENST00000338969 and ENST00000360392. A browser interface was developed to access the database http://xmap.picr.man.ac.uk. In addition to the website, an R package, 'exonmap', was developed to provide direct access to the database using the R command line interface. The package provides three types of programmatic routine: translational, annotation and filtering.
The translational routine provides functionalities that perform conversions between different gene entities, such as probe, probeset, exon, transcript, gene, symbol and sequence. These functions assume the form X.to.Y, where X and Y represent different combinations of the entities. For example, the function 'probeset.to.gene' converts a probeset ID to the gene ID it represents. Please refer to the function's help file for the full listing of all the X.to.Y functions. Except for the probeset IDs, all other entities are represented by the Ensembl ID system. Because of this integration, it is relatively easy to extract extended annotation using the Ensembl database. One downside of using 'exonmap', however, is the requirement for installing a local copy of the Ensembl database onto your computer, which can take up a substantial amount of disk space.
The annotation routine is used to attach biological meanings to the different gene entities. At the time of writing, users can perform annotation at three levels: probeset, exon and gene. For example, the function 'probeset.details' takes a probeset ID as an argument and returns a predefined set of annotations for that probeset. This annotation routine returns predefined annotation information for the entities and cannot easily be extended. To perform a more customised annotation, Bioconductor offers the 'biomaRt' package, which provides a direct interface to the Ensemble database. The 'getBM' function from 'biomaRt' takes four input parameters: 'Attributes', 'Filters', 'Values' and 'Mart'. The 'Attributes' are the list of desired annotations to be returned. The 'Values' are the actual input IDs designated by the 'Filters' argument. Finally, 'Mart' indicates the database selected at the time of query. Here is a sample code for annotating the exon array ID, 3564255, using the 'genBM' function. First, we need to select the database, as well as the species, for this analysis. In this case, we have selected to use the Ensemble database for the human species. Then, we use 'getBM' to annotate the 'Affymetrix exon array' ID with two Ensembl IDs.
R> library(biomaRt)
R> ensemble = useMart('ensembl', dataset = 'hsapiens_gene_ensembl')
R> annotation = getBM(Attribute = c('ensembl_gene_id', 'ensembl_transcript_id'),
Filters = 'affy_huex_1_0_st_v2',
Values = '3564255',
Mart = ensemble)
There is also a web version of the Biomart database http://www.biomart.org, where users can perform point and click database searches. The website is a great companion to the 'biomaRt' package. Users can perform a prototype search before finalising the customised profiles with the 'getBM' function. The precise naming scheme for the 'Attributes' and 'Filters' arguments is mandatory, and can be listed using the 'listAttributes' and 'listFilters' functions.
Finally, the Filtering routine from 'exonmap' serves two purposes: verification and filtering. The verification utility takes the form is.X and is used to verify the identity of a probeset as either exonic, intronic, intergenic or multitarget. This is especially useful for identifying multi-targeted probesets using the 'is.multitarget' function. The removal of these entities will reduce the chances of selecting a false-positive AS. Lastly, the two filtering functions are 'select.probewise' and 'exclude.probewise', which will either select or remove probesets that belong to the selected entity from further consideration.
Alternative splicing detection
The simplest way to detect AS is to use the splicing index (SI) method [14]. The SI is calculated using the log-ratio of the normalised exon level from the comparison group, where the normalised exon level can be calculated by dividing the exon signal by its estimated gene level. One can think of SI as the equivalent of fold change in gene expression microarray analysis. The major pitfall of fold change is that no statistical confidence is computed. Therefore, better approaches, such as ANalysis Of Splice VAriation (ANOSVA), were developed to apply statistical stringency to changes in SI [15]. ANOSVA was based on modelling the interaction effect between the exon and the gene, where, under the null hypothesis, all interaction terms in an exon and a gene in the two-way ANOVA are not significant, and therefore no exon or probesets stand out. By contrast, under the alternative hypothesis, the interaction term is significant, and therefore characteristic of AS. This basic model was further improved by other authors [16–18]. The 'exonmap' package offers two functions for detecting AS: 'si' and 'splanova', which perform the SI and ANOVA modelling analyses, respectively. Please refer to the 'exonmap' vignette for details of their use.
Since exon array analysis deals with a massive amount of data and the current Bioconductor modules require them to be loaded into memory at once, a regular 32-bit computer might be insufficient for this purpose. Instead, users are encouraged to use a 64-bit computer system with sufficient RAM memory, preferably 8 megabytes (MB) or higher. Alternatively, Bengtsson et al.[19] have developed the 'aroma.affymetrix' package which has solved this memory problem by using persistent memory technology. These authors claim that the analysis can be done with just 1 MB of RAM, and can process up to 1,000 arrays. Using the 'aroma.affymetrix' framework, Purdom et al.[18] have implemented an AS detection algorithm, FIRMA (Finding Isoforms using Robust Multichip Analysis). In this algorithm, the authors modified the popular RMA normalisation method [9] by building an additive model to study the behaviour of an exon relative to its gene expression. Similar to ANOSVA, any deviation from the fitted behaviour would imply AS. The tutorial materials for the 'aroma.affymetrix' package can be found in a Google user group http://www.braju.com/R/aroma.affymetrix.
AS visualisation
Visualisation is important in AS detection. First, it shows the distribution ratio of probesets per exon. The number of probesets used to represent an exon reflects the exon's size and complexity. Secondly, a visual inspection of the exon's expression value ensures the accuracy of the AS prediction algorithm, and therefore reduces false positives. Thirdly, information without content is meaningless. Therefore, attaching genomic knowledge to the visualisation process allows further evaluation of the AS candidates.
The most comprehensive visualisation tool for the Affymetrix exon array is the Xmap Genome Browser http://xmap.picr.man.ac.uk. The website uses Google's map navigation technology to provide a smooth scanning of gene regions. The back-end Ensembl database support enables the display of diverse gene information, such as the multiple transcripts that represent different splice forms. To add expression information onto the display, Bioconductor offers the 'XMapBridge' package. The package enables a seamless connection between R and the Xmap Genome Browser via Java technology.