USF1binding site prediction method
We focused on USF1 to develop a novel TFBS prediction method because of its genetic association with CAD and the availability of USF1 ChIP-chip results from the ENCODE regions. Common TFBS prediction methods based on DNA sequence alone generate large numbers of false-positive results. One strategy for improving specificity of TFBS prediction is to use phylogenetic footprinting, which is based on the assumption that regions of multi-species sequence conservation are more likely to include regulatory elements. We hypothesised that combining multiple genomic features with regions of sequence conservation could increase the accuracy of TFBS predictions. To test our hypothesis, we began by using PWM, the most common binding motif search method, to identify potential USF1-BSs. We then incorporated several genomic features related to TFBSs, including sequence conservation, regulatory potential, and the presence of CpG islands and DNaseI HS sites. Using a training dataset constructed from published USF1 ChIP-chip results [17], we were able to compare the sensitivity, specificity and AUC of prediction models trained with different sets offeatures, such as PWM score alone, single features, all features and the features generated by feature selection. Prediction models based on single genomic features performed poorly, with low sensitivity and AUC. A prediction model using four selected features (PhastCons8, RP5, DNaseI HS and PWM score) produced the highest sensitivity (55.9 per cent) and AUC (0.827) among the models tested, while still achieving high specificity (87.6 per cent) (Figure 2). These results show that the prediction model using selected features outperforms models based on a single feature and all features. That the performance of the model using all features is not better than others might be due to the noise introduced by the irrelevant and redundant features.
Kernel-based classifiers allow for the development of non-linear classifiers in cases where simple linear combinations of features are not sufficient accurately to distinguish between sample classes (see Additional file 4). Kernel logistic regression modelling maps the training data to high-dimensional space by considering all features jointly, and generates a non-linear decision boundary to separate two classes. The data within each class depend on a specific combination of the features learned from the training dataset. For example, a site with a relatively low PWM score may still be predicted as a USF1-BS if it has high DNaseI HS, conservation or regulatory potential scores. Conversely, if each feature contributes to the prediction method as an independent filter, the predicted USF1-BS will be based on a limited range of values of each feature. For example, if the prediction method only relies on a stringent PWM threshold to improve the specificity, it will be biased toward the binding site with high affinity, and only the target genes of USF1 with strong binding sites will be identified by this method. Comparisons of PWM scores with the scores of each genomic feature are included in Additional files 5--8.
To test whether our prediction method may be biased toward sites with higher binding affinity, as indicated by higher PWM scores, we examined the PWM scores from the 20 robust USF1 target genes at each step during the prediction process. As shown in Additional file 9, we find that: 1) the common PWM scoring method was sufficient to identify potential USF1-BSs for most of these genes; 2) the potential sites identified spanned a wide range of PWM scores. Further, the distributions of PWM scores among the initial 290,614 potential sites and the final 24,010 predicted USF1-BSs were not significantly different (see Additional file 9). These results suggest that our prediction method is not biased toward binding sites within any specific range of PWM scores, and if these scores do correlate with binding affinities, predictions are also not biased towards sites with high affinities.
For this study, we focused on five genomic features related to regulatory elements currently available on a genome scale. Backward stepwise feature selection during model building indicated that DNaseI HS was the most important predictor of USF1-BS among the features considered. We will consider other relevant genomic annotations, such as histone modifications, in future prediction method development. One important caveat is that the reliability and accuracy of these individual features will influence the performance of the prediction method. Feature selection during model building will become even more important when we integrate more genomic features in the future.
Each TF is unique in its binding site preference. Universal prediction methods may not perform well for all TFs, given inherent variation in binding domains, binding sequence preferences, homology level across species and family members. We believe, however, that our general model-building framework has the potential to be extended to other TFs for which there are available data detailing locations of a sufficient number of binding sites for use as a training dataset. As more results from genome-wide ChIP-chip studies become publicly available, it will become feasible to apply this prediction method to many other TFs.
Several aspects of this method can be improved in the future, such as using additional TFBS-related genomic features, evaluating other motif scanning methods, incorporating protein-DNA interactions, including binding site cluster information, using different gene annotations and exploring additional computational prediction models, such as support vector machines. We focused on a region 5 kb upstream of the TSSs of RefSeq mRNAs because the published USF1 ChIP-chip study indicated that most of the USF1 binding regions were found in proximal promoters [17]. USFl-BSs could occur beyond 5 kb upstream of TSSs, however, implying that a wider range of genomic regions could be considered in the future.
Training dataset from published USF1ChIP-chip results
A reliable training dataset is crucial for the development of an accurate and reliable prediction method. We chose published USF1 ChIP-chip results [17] as our training dataset because they represented the largest publicly available USF1 binding dataset; however, the exact location of USF1-BSs from these data is confounded by the common noise of ChIP-chip experiments and by a large average locus size on the ENCODE microarray -- approximately 1 kb. To circumvent these problems, we used the potential USF1-BSs identified by the PWM scoring method from the positive and negative loci to construct the training dataset. Each positive locus might include multiple potential USF1-BSs; however, it is unlikely that every potential USF1-BS from each positive locus interacts with USF1. Accordingly, we expect that our training dataset includes some false positives. To address this problem, we grouped all the potential USF1-BSs in the training dataset by their locus on the microarray and performed a LOLO cross-validation to evaluate the prediction models. The prediction scores of these potential USF1-BSs within each locus were used to predict that locus. If the locus has at least one predicted USF1-BS, it would be scored as a positive locus, otherwise it would be scored as a negative locus. This allows us to compare our prediction with USF1 ChIP-chip results directly. We believe that LOLO cross-validation retains the underlying biological correlations while avoiding over-fitting the prediction models.
The training dataset was derived from the ENCODE regions, which included promoter, intronic, exonic and intergenic regions. USF1-BSs in all these regions may have different properties to the genomic features within the 5 kb upstream region of TSSs from the RefSeq mRNAs. These differences may cause the model based on this training dataset to behave differently on the full ENCODE regions.
Predicted USF1binding sites and target genes
Scanning 5 kb upstream of the TSSs of 23,105 RefSeq mRNAs, we identified a list of potential USF1-BSs and target genes that can be used as candidates for studying susceptibility to CAD and other complex human diseases. Our genome-scale prediction includes 24,010 USF1-BSs and 5,801 candidate genes. The numbers of USF1 binding sites and target genes in the genome were expected to be large, since USF1 is widely expressed in many tissues and developmental stages [15]. Other large-scale in vivo experiments have found that many other TFs are associated with an unexpectedly large numbers of target genes. For example, an investigation of c-Myc, which belongs to the same bHLH family as USF1 and shares a similar core binding site, identified 756 c-Myc binding sites on chromosomes 21 and 22 [31]. Extrapolating this to the whole genome provides an estimate of 25,000 c-Myc binding sites. A genome-wide study of the transcription factor signal transducer and activator of transcription 1 (STAT1) binding sites using ChlP-sequencing technology also identified 41,582 and 11,004 putative binding regions in stimulated and unstimulated cells, respectively [32]. These data are similar in magnitude to our genome-scale estimate of USF1-BSs. ChIP-based studies can only identify genes that are targets under specific cellular or environmental conditions. It is important to note, however, that our in silico prediction method will identify potential USF1-BSs independent of cell type, stage or environment. Thus, the numbers of predicted USF1 binding sites and target genes might be higher using our method than using in vivo experiments. As more data become available, especially DNaseI HS identified from multiple cell lines, we will be able to evaluate the tissue specificity of the genomic features and our predictions. We acknowledge that DNA binding domains of the two members of the USF family (USF1 and USF2) are highly conserved across multiple species, and often USF1 and USF2 form heterodimers to bind DNA, suggesting that the two proteins may share target genes. Although we used experimentally defined USF1-BSs to construct the PWM, other bHLH family members also have the same core binding sequence, 5'-CACGTG-3'; therefore, our prediction results might include the binding sites of other bHLH family members.
Application to human disease study
The main goals for this study were to predict genome-scale binding sites of USF1 and to identify a novel group of CAD candidate genes regulated by USF1 that give us the opportunity to evaluate gene-gene interactions. In Additional file 3, we have provided the predicted USF1-BSs and their prediction scores. By identifying a large number of predicted USF1-BSs, our results allow for adjusting the stringency of the prediction score threshold to refine gene targets and also for choosing specific filters to emphasise a particular subset of interest. 'Genomic convergence', a strategy that integrates several independent and separate lines of experimental evidence to prioritise disease-associated candidate genes [33], is being used by our CAD study to combine the USF1-BS prediction results with other information related to CAD to identify candidate genes. For example, a previously published study of gene expression signatures from human aortas identified 229 genes to be differentially expressed in aortas with and without atherosclerosis and found these genes to be highly predictive of the condition [34]. By combining our in silico USF1-BS prediction method with this expression result, we identified 87 USF1 target genes that were differentially expressed between cases and controls in aorta (see Additional files 10 and 11). This approach highlights the potential for combining information from two distinct and methodologically diverse genome-scale investigations to define a list of important candidate genes from an unmanageably large list ofinitial targets.
SNPs are the most abundant molecular markers in the human genome. SNPs are commonly used for large-scale genetic association studies to identify genetic factors responsible for complex genetic diseases. Current high-throughput genotyping technologies enable researchers to genotype large numbers of SNPs efficiently. It remains a challenge to select SNPs with potential functional impact, however, especially from the large number of identified non-coding SNPs. One variant of particular interest are the SNPs within cis-regulatory elements, such as TFBS, because changing the TFBS sequence could alter the TF binding affinity within this region and further may influence the transcriptional regulation of the corresponding gene. These cis-regulatory variations are not necessarily deleterious. They might have subtle effects on gene expression and may contribute to the disease through interacting with other alleles and/or environmental factors, thereby playing important roles in the pathogenesis of many complex diseases in humans [35]. The bp resolution of our USF1-BS predictions enables us to isolate potential functional variations that may be used to select candidate variants for further testing for a functional impact and relation to disease. We have identified 751 SNPs within our predicted USF1-BSs in the human genome based on the genomic locations of the SNPs released by the NCBI in dbSNP build 126 (see Additional file 3). The experimental approaches that distinguish functional from neutral variations among these SNPs include, but are not limited to, well-designed case-control or family-based genetic association studies, allele-specific gene expression analysis and focused molecular biology studies. In summary, these SNPs within predicted USF1-BSs have the potential to influence the regulation of USF1 target genes; they enable identification of a specific USF1 regulatory network and, ultimately, study of the association of USF1 with complex disease in humans.