Transcription start sites at the end of protein-coding genes

Previous studies demonstrated that massive induction of transcriptional readthrough generates downstream of gene-containing transcripts (DoGs) in cells under stress condition. Here, we analyzed TSS-seq (transcription start site sequencing) data from the DBTSS database. We investigated TSS tags at the end of gene for all pan-stress and untreated-cell DoGs, in comparison with expression-matched non-DoGs. We observed significantly more TSS tags at the end of pan-stress and untreated-cell DoG genes than non-DoG genes, even though their TSS tags in the promoter is the same. Importantly, the median value of TSS tags at gene end normalized to gene promoter is significantly higher than the median expression ratio of short DoG to host gene and of long DoG to host gene. Our results indicate that downstream overlapping long non-coding RNAs derived from the TSS at the gene end may be an important source of DoGs.


Background
Vilborg et al. analyzed nuclear transcriptome changes in SK-N-BE(2)C human neuroblastoma cells [1] and NIH3T3 mouse fibroblast cells [2] under heat shock, osmotic stress, and oxidative stress by using RNA-seq. They observed massive induction of transcriptional readthrough, or downstream of gene-containing transcripts (DoGs), under all stress conditions. Being long (often > 45 kb) and diverse (> 2000 species), DoGs may contribute significantly to the transcriptome.
Previously, we have demonstrated that the progesterone receptor (PGR) gene processes a very long 3′-UTR of approximately 10 kb and this length can be further extended in the monkey endometrium from the view of sequencing data [3]. However, we have found that this extension is not due to a readthough, but an independent transcription start site (TSS) at the end of PGR, resulting a sense long non-coding RNA (lncRNA) overlapping with PGR 3′-UTR. Thus, we questioned whether these DoGs observed by Vilborg et al. [1,2] are downstream overlapping lncRNAs instead of readthrough products from the promoter of protein-coding genes.
To answer this question, we performed a bioinformatic analysis of the public data. Our preliminary results challenge the readthough model proposed by Vilborg et al. [1,2].

Methods
The TSS-seq data performed on NIH3T3 cells were downloaded from the DataBase of Transcriptional Start Sites (DBTSS, https://dbtss.hgc.jp). The DNaseI data for NIH3T3 cells as well as Pol2, H3K4m1, and H3K4m3 for MEF (mouse embryo fibroblast) cells were derived from the ENCODE project (https://www.encodeproject.org). The UCSC Genome Browser (http://genome.ucsc.edu/) was used to display TSS-seq data and chromatin features for four representative DoGs: Hnrnpa2b1, Txn1, Hspa8, and Ifitm2. The genomic coordinates were based on mouse mm9 genome assembly.
In addition to the four representative DoGs, we extracted the genomic coordinates for all the DoGs described by Vilborg et al. [2]. The number of TSS tags at 1-kb region of a gene promoter and gene end were summarized according to TSS-seq data. Because DoGs and non-DoGs differ in size and gene expression levels, we constructed an equal size expression-matched subset for non-DoGs by randomly sampling using in-house PERL scripts. Difference between groups was tested by the nonparametric Mann-Whitney U test implemented in MATLAB (MathWorks, version 7.5).

Results and discussion
By combining oligo-capping with high throughput sequencing, the TSS-seq approach is able to collect genome-wide TSS information together with a quantitative analysis of the expression levels of transcripts [4]. We examined TSS-seq data performed on NIH3T3 cells from the DBTSS database [5]. For all four representative DoGs (Hnrnpa2b1, Txn1, Hspa8, and Ifitm2) [2], the number of TSS tags at the end of a gene is one order of magnitude lower than that at a promoter, except Hspa8 (Fig. 1). Hspa8 exhibits higher number of TSS tags at the gene end compared to the promoter, likely due to intronic snoRNAs. These TSSs may generate lncRNAs with an independent promoter at the gene end. We next investigated TSS tags at the end of a gene for all pan-stress and untreated-cell DoGs, in comparison with expression-matched non-DoGs. We observed significantly more TSS tags at the end of pan-stress and untreated-cell DoG genes than those of non-DoG genes, even though their TSS tags in the promoter is the same. Furthermore, we normalized the number of TSS tags at the gene end to the number of TSS tags at the promoter of the same gene. Significance was also reached for the normalized data (Table 1 and Additional file 1: Figure S1).
Additionally, the median value of TSS tags at gene end normalized to gene promoter is 0.1088, slightly higher than the median expression ratio of short DoG to host gene (0.0146) and of long DoG to host gene (0.0067). These results indicate that TSSs at a gene end may be an important source of DoGs.

Conclusion
Taken together, by analyzing TSS-seq data, we suggested that TSSs at the gene end may be an important major source of DoGs. Therefore, TSS-seq along with a large scale of Northern blot and tiling PCR experiments are required by Vilborg et al. [1,2] to support their idea that most DoGs are continuous transcripts caused by a readthrough of protein-coding genes.

Additional file
Additional file 1: Figure S1. Statistical analysis of TSSs at gene end (related to Table 1