A short survey of computational analysis methods in analysing ChIP-seq data
© Henry Stewart Publications 2011
Received: 9 July 2010
Accepted: 9 July 2010
Published: 1 January 2011
Chromatin immunoprecipitation followed by massively parallel next-generation sequencing (ChIP-seq) is a valuable experimental strategy for assaying protein-DNA interaction over the whole genome. Many computational tools have been designed to find the peaks of the signals corresponding to protein binding sites. In this paper, three computational methods, ChIP-seq processing pipeline (spp), PeakSeq and CisGenome, used in ChIP-seq data analysis are reviewed. There is also a comparison of how they agree and disagree on finding peaks using the publically available Signal Transducers and Activators of Transcription protein 1 (STAT1) and RNA polymerase II (PolII) datasets with corresponding negative controls.
KeywordsCHIP-Seq analysis Next-generation sequencing comparative analysis bioinformatics
The regulation of gene expression is tightly controlled by transcription factors (TFs) that bind to specific DNA regulatory regions, histone modifications and positioned nucleosomes in the genome. High-throughput chromatin immunoprecipitation (ChIP) followed by massively parallel nextgeneration sequencing (ChIP-seq) represents a current approach in profiling genome-wide protein -DNA interactions, histone modifications and nucleosome positions. This new technology has marked advantages over microarray-based (ChIP-chip) approaches by offering higher specificity, sensitivity and coverage for locating TF occupancy or epigenetic markers across the genome. ChIP-seq experiments generate large amounts of data (in the order of tens of millions of reads), thus creating a bottleneck for data analysis and interpretation. Consequently, effective bioinformatics tools are needed to process, analyse and interpret these data in order to uncover underlying biological regulatory mechanisms.
Pre-processing. The goal of this step is to filter out erroneous and low-quality reads and to ensure that only the highest quality sequencing reads are retained for the sub-sequent mapping step;
Mapping. This is the key step in which mapped reads are converted to an integer count of 'tags' at each position in the genome with fixed read length. The choice of flexibility options on mapping multiple reads to multiple sites affects sensitivity and specificity dependent on the volume and complexity of target genome sequences. The user can increase specificity using unique reads only or can increase sensitivity allowing multiple alignments of reads;
Peak finding. This is the most challenging step in the analysis workflow, as the goal is to identify significant peak signals among background signals. This includes not only finding the strong signals, but also finding the statistically reproducible weak signals obtained from the modest read counts. To achieve this goal, statistical tests should be based on biologically meaningful background assumptions.
Types of peaks
Peaks in the ChIP-seq data can be classified into three groups: punctate signals (~100 base pairs [bp]); localised but broader signals (~ kilobase [kb]) and broad signals (~100 kb) . The predictive power of the existing tools depends on the type of data. A mixture of punctate and broader signals is a typical pattern of RNA polymerase II, which occupies transcription start sites and promoter-proximal pause sites in a punctate fashion, but the signals diffuse over the body of the transcribed genes [2, 3]. Most algorithms have been optimised to handle the punctate data but are not as good at detecting mixed binding patterns that require non-default parameter settings.
In this paper, three commonly used ChIP-seq computational tools are reviewed in detail, with special emphasis on their underlying peak-calling methods. Information is also provided on the following issues: (i) how ChIP-seq methods agree or disagree on different types of data; (ii) the benefits of using the combined methods by comparing these methods on two public datasets.
ChIP-seq processing pipeline (spp)
Spp  was developed as an analysis pipeline specifically designed to detect protein-binding positions with high accuracy by introducing methods to improve tag alignment and to correct for background signals. Spp implemented three peak-calling methods: (i) the window tag density (WTD), which is similar to XSET , is a method that extends positive- and negative-strand tags by the expected DNA fragment length in order to determine binding positions to those tags with the highest number of overlapping fragments, and scores positions based on the strand-specific tags; (ii) the matching strand peaks (MSP) approach (which determines local peaks of strand-specific tag density and identifies positions surrounded by positive- and negative-strand peaks); (iii) the mirror tag correlation (MTC) method (which scans the genome to identify positions exhibiting pronounced positive- and negative-strand tag patterns that mirror each other). All methods employ background subtraction of the control tag density to correct for the uneven background distribution. The p-value is calculated assuming Poisson density, and candidate binding sites were selected with p-values, < 10-5. Given the score s calculated by one of the above methods, the corresponding false discovery rate (FDR) can be estimated as the number of binding positions with the score s or higher found in the ChIP dataset, divided by that in the control set.
CisGenome  was developed specifically as a suite of tools for ChIP data analysis (both ChIP-chip and ChIP-seq data). For the peak calling method, it uses strand-specific tags to refine peak boundaries and filter out low-quality predictions, and uses a conditional binomial model for two-sample analysis (it uses a negative binomial for one-sample analysis) to identify peak regions. Windows passing a user-specified FDR cut-off are used to generate predicted binding regions. Detected windows that overlap with each other are merged into one region. (The minimal FDR among the overlapped windows is taken as the FDR of the region.) There are two post-processing options available in CisGenome: boundary refinement and single-strand filtering.
PeakSeq  was designed based on the observation that potential binding sites are strongly correlated with signal peaks in the control, probably revealing the features of open chromatin. As such, this method comprises a two-pass strategy to compensate for including control signals in the analysis. In brief, the first pass identifies regions or peaks in the ChIP-seq fragment density map which are substantially enriched compared with a simulated simple null background model. To construct the density map, it uses both predefined fragment length and extends tags. Once a fragment density map has been built, control tags are sampled with multiple simulations from the subdivided segments (~1 megabase [Mb]) in length considering mappability (eg the fraction of uniquely alignable bases in that segment) to generate the null background model. In the second pass, it filters out putative binding sites not significantly enriched compared with the normalised control by computing precise enrichments and significances. A peak is deemed statistically significant based on binomial distribution. The FDR is that estimated for these peaks following the Benjamini and Hochberg  approach. PeakSeq returns a ranked target list sorted by q-value and fold-enrichment values for each binding site.
Design of ChIP experiment
There are two strategies for ChIP-seq experimental design: one-sample and two-sample experiments. In one-sample analysis, only a ChIP sample is sequenced. In two-sample analysis, both a ChIP sample and a control sample are sequenced. One-sample design (without a control) is a costeffective strategy after careful post-processing, and some experiments have shown good agreement between one-sample and two-sample analyses . The uniform control model  does not hold due to the biases from non-specific fragments, such as random protein - DNA or antibody - DNA interactions, and the existence of sequencing  and mapping  biases or chromatin structure and genome copy number variations [9, 10]. Therefore, these intrinsic biases necessitate a control or two-sample analysis.
Peak finding procedure
Due to the intrinsic biases described above, most computational tools recommend the use of a control in the ChIP-seq analysis for identifying significant and reliable peaks. The pre-processing and mapping procedures are, in general, followed by the peak-calling steps: (i) create a profile; (ii) select candidate sites; (iii) calculate significance (p-value); and (iv) determine cut-off threshold (ie FDR).
Step1: Creating a signal profile
The ChIP profile is obtained by smoothing the tag counts with or without correcting tag-shifting effects (ie the difference in genomic distance between observed tags and the centre of the actual binding positions). This helps in intrapolating unobserved counts due to low mappability or low coverage, improving summit resolution (tag reshifting) and exclusion of outliers caused by artefacts. In general, a window of a fixed size develops the profile, sliding across the genome and replacing the tag count at each site with the summed value within the window centred at the site. Spp  and CisGenome  merge consecutive windows above a threshold value. The alternative is to use non-overlapping windows as in PeakSeq,  in which the peaked windows adjacent to each other can be aggregated. There are many modified versions. MACs  uses the sliding windows after shifting the tags. F-seq  uses kernel density estimation rather than the summed value. QuEST uses the kernel density approach for developing the strand-specific profile .
Step 2: Selecting candidate sites or calling peaks
Once generating profile each profile unit satisfies a criterion is considered candidate peaks. The criterion is an absolute ChIP signal or a relative enrichment to the background. The utility of this rough selection is twofold. The candidate peaks selected at this step are used to estimate a fragment size and a distance of tag shift. The regions not overlapping with the peaks are used for estimating negative control parameters.
Step 3: Calculating the significance of peaks
Different types of background models can be applied for each candidate peak. The natural choice of background model is a Poisson distribution, [5, 10] assuming a uniform effect of the negative control over the genome. Binomial distribution is an alternative model for utilising the non-uniform effects of the negative control after normalising the sampling ratio between the ChIP and the negative control sample in non-binding regions [6, 13].
Step 4: Determining cut-off threshold
Given the scores of the peaks, selecting a threshold value is a non-trivial problem. When p-values for the designated distribution are available, they can be used to calculate an FDR [5, 6, 14, 15]. Some tools do not provide p-values; these generally rank the peaks by the peak height or fold enrichment [4, 10, 12, 16, 17]. These tools instead calculate an empirical FDR by sampling the tags from the control and ChIP data. The FDR in this case is defined as the ratio of the number of peaks called in the control to the number of peaks called in the ChIP data.
The post-processing step considers tag-shifting effects and predicts the fragment sizes of the library. This consideration is important for prediction of the original binding positions. Spp pre-calculates the autocorrelation between positive- and negative-stranded tag counts to estimate a tag shift. CisGenome takes a two-step approach and corrects these effects in the second step. PeakSeq does not provide automatic correction of tag shifting but allows the user to define the fragment lengths. Some specialised tools have been developed to analyse broad ChIP peak types -- such as those associated with histone modifications -- utilising a hidden Markov model  and a clustering algorithm  to find significant patterns.
Preparation of ChIP-seq datasets
The publically available ChIP-seq datasets (after ELAND alignment) were used for human RNA polymerase II (PolII) and STAT1, each with matching input-DNA controls (http://www.gersteinlab.org/proj/PeakSeq/). Twenty-eight known human interferon-responsive STAT1 binding sites were obtained from the supplementary material of Robertson et al.  Data for the known binding sites are available through the ORegAnno database (http://www.oreganno.org) as dataset OREGDS00006.
Spp provides autocorrelation profiles that can be used to correct for tag shift and define a window size. Ninety-five bp and 60 bp tag shifts and 355 bp and 200 bp window sizes were observed for the STAT1 and PolII datasets, respectively. This information was used to set up the window sizes for PeakSeq and CisGenome with default settings. The MTC scoring scheme was used for the Spp procedure.
Comparative study of SPP, CisGenome and PeakSeq on STAT1 and PolII
The number of common peaks identified by the three ChIP-seq analytical tools for Stat1 experimental data
Predictions within 200 bp
Known Stat1 binding sites (stimulated)
The number of common peaks identified by the three ChIP-seq analytical tools for PolII experimental data
< 1 kbp
Discussion and conclusion
Three ChIP-seq computational tools were reviewed in this paper, focusing on their underlying peakcalling methods in particular. These tools were also compared and tested in the analysis of STAT1 and PolII ChIP-seq datasets. From this analysis, it appeared that most of the current ChIP-seq analysis tools were designed for identifying punctuate binding sites. As evident by showing good agreement on STAT1 data. By contrast, the predictions for PolII were localised within 1 kbp from the start sites of 70 per cent of RefSeq transcripts. The selected tools were not designed to characterise the trend of PolII occupancy, however, and consistently failed to describe biologically meaningful patterns or secondary peaks observed at the 3' ends in this experimental dataset.
The overall performance of a peak-calling algorithm was found to be highly dependent on the smoothing or profiling steps, either by increasing or decreasing the window size. For example, CisGenome with a window size of 100 bp predicted only 900 significant peaks in the STAT1 ChIP-seq experimental data. The tag shifting is another factor that affects performance in some analytical tools. Spp automatically determines this effect, which can be used to specify the fragment extension for PeakSeq and post-correction step in CisGenome.
The quality of ChIP experiments is highly dependent on the enrichment of TF-bound chromatin compared with background signals. For ChIP-seq analysis, the read number at each chromosomal position is related to the number of occupied sites in the genome, the range of signal intensity and the bias introduced by sequence pattern and chromatin structure. These parameters are not fully understood in advance and none of the existing ChIP-seq analysis tools can handle all of these possible situations. Integration of genomic contexts such as nucleosome occupancy,  GC-content  and multi-species conservation  will help to improve prediction performance. The alternative ChIP-chip studies are useful for comparing the ChIP-seq results and can help to tune the parameters of methods when it is difficult to find a gold standard test set for PolII.
In summary, ChIP-seq has become an indispensable tool for studying the transcriptional machinery and gene expression regulation on the genome-wide scale. Existing computational software can analyse highly sequence-specific ChIP-seq data with high accuracy. It is likely, however, that new computational methods and more user-friendly workflow will be developed to analyse more complex ChIP-seq data in the future.
We would like to thank Dr David L. Bentley for his constructive comments on the initial draft of this manuscript. H.K. is supported by NIH grant to GM063873 to D.L. Bentley.
- Pepke S, Wold B, Mortazavi A: Computation for ChIP- seq and RNA-seq studies. Nat Methods. 2009, 6 (11 Suppl): S22-S32.PubMed CentralView ArticlePubMedGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, et al: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009, 27: 66-75. 10.1038/nbt.1518.PubMed CentralView ArticlePubMedGoogle Scholar
- Baugh LR, Demodena J, Sternberg PW: RNA Pol II accumulates at promoters of growth genes during developmental arrest. Science. 2009, 324: 92-94. 10.1126/science.1169628.View ArticlePubMedGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26: 1351-1359. 10.1038/nbt.1508.PubMed CentralView ArticlePubMedGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4: 651-657. 10.1038/nmeth1068.View ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008, 26: 1293-1300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995, 57: 289-300.Google Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36: e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMedGoogle Scholar
- Vega VB, Cheung E, Palanisamy N, Sung WK: Inherent signals in sequencing-based Chromatin-ImmunoPrecipitation control libraries. PLoS One. 2009, 4: e5241-10.1371/journal.pone.0005241.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, et al: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Boyle AP, Guinney J, Crawford GE, Furey TS: F-Seq: A feature density estimator for high-throughput sequence tags. Bioinformatics. 2008, 24: 2537-2538. 10.1093/bioinformatics/btn480.PubMed CentralView ArticlePubMedGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, et al: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008, 5: 829-834. 10.1038/nmeth.1246.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu H, Handoko L, Wei X, Ye C, Sheng J, et al: A signal-noise model for significance analysis of ChIP-seq with negative control. Bioinformatics. 2010, 26: 1199-1204. 10.1093/bioinformatics/btq128.View ArticlePubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, et al: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009, 25: 1952-1958. 10.1093/bioinformatics/btp340.PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.View ArticlePubMedGoogle Scholar
- Nix DA, Courdy SJ, Boucher KM: Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics. 2008, 9: 523-10.1186/1471-2105-9-523.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu H, Wei CL, Lin F, Sung WK: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24: 2344-2349. 10.1093/bioinformatics/btn402.View ArticlePubMedGoogle Scholar
- Hon G, Ren B, Wang W: ChromaSig: A probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-10.1371/journal.pcbi.1000201.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee CK, Shibata Y, Rao B, Strahl BD, Lieb JD: Evidence for nucleosome depletion at active regulatory regions genomewide. Nat Genet. 2004, 36: 900-905. 10.1038/ng1400.View ArticlePubMedGoogle Scholar