Alternative transcription start sites contribute to acute-stress-induced transcriptome response in human skeletal muscle

More than half of human protein-coding genes have an alternative transcription start site (TSS). We aimed to investigate the contribution of alternative TSSs to the acute-stress-induced transcriptome response in human tissue (skeletal muscle) using the cap analysis of gene expression approach. TSSs were examined at baseline and during recovery after acute stress (a cycling exercise). We identified 44,680 CAGE TSS clusters (including 3764 first defined) belonging to 12,268 genes and annotated for the first time 290 TSSs belonging to 163 genes. The transcriptome dynamically changes during the first hours after acute stress; the change in the expression of 10% of genes was associated with the activation of alternative TSSs, indicating differential TSSs usage. The majority of the alternative TSSs do not increase proteome complexity suggesting that the function of thousands of alternative TSSs is associated with the fine regulation of mRNA isoform expression from a gene due to the transcription factor-specific activation of various alternative TSSs. We identified individual muscle promoter regions for each TSS using muscle open chromatin data (ATAC-seq and DNase-seq). Then, using the positional weight matrix approach we predicted time course activation of “classic” transcription factors involved in response of skeletal muscle to contractile activity, as well as diversity of less/un-investigated factors. Transcriptome response induced by acute stress related to activation of the alternative TSSs indicates that differential TSSs usage is an essential mechanism of fine regulation of gene response to stress stimulus. A comprehensive resource of accurate TSSs and individual promoter regions for each TSS in muscle was created. This resource together with the positional weight matrix approach can be used to accurate prediction of TFs in any gene(s) of interest involved in the response to various stimuli, interventions or pathological conditions in human skeletal muscle.

transcription start sites (TSSs), and polyadenylation sites. Analysis of RNA sequencing data from mouse nervous tissue at different stages of development [1] and in different human tissues (Genotype-Tissue Expression [GTEx] project database) [2] showed that the alternative TSSs and polyadenylation sites make a key contribution to the expression of alternative mRNA isoforms.
More than half of human protein-coding genes have an alternative TSS [3,4]. The cap analysis of gene expression (CAGE) approach revealed that in tissues of humans [3,5] and other organisms [6], as well as during the differentiation of various cells [7,8] and during embryogenesis in Drosophila, zebrafish, and chicken [6,9,10], regulation of specific sets of genes is controlled by the appearance/ activation of alternative TSSs (differential TSSs usage) within the same or distant promoter region (or nucleosome-depleted region [NDR]). The presence of several TSSs with specific regulatory motifs in a gene plays a key role in tissue-or context-specific regulation of gene expression through various transcription factors (TFs), which might be activated via different signaling events.
It is logical to assume that the change in the transcriptome induced by acute (and transient) stress exposure can also be impacted by differential TSSs usage. Recent studies using the CAGE method have shown the important role of differential TSSs usage in regulating the transcriptome response to acute chemical and physical stress in yeast [11,12]. To our knowledge, there are no studies in the literature that examine the effects of acute stress on differential TSSs usage in human tissues. Here, we aimed to investigate the contribution of alternative TSSs to the acute-stress-induced transcriptome response in human tissue using the CAGE approach, identifying exact TSS position. Human skeletal muscle is a good model for this task because (i) skeletal muscle is suitable for repeated sampling of muscle tissue with needle biopsies and (ii) skeletal muscle can be easily stressed by exercise (e.g., intense and prolonged [several tens of minutes] aerobic exercise on a bicycle ergometer). During such exercise, mechanical stress acts on muscle cells, intramuscular temperature increases, intramuscular metabolite content changes, and pH and intramuscular glycogen stores decrease. Together, these and other factors activate numerous signaling pathways and markedly change the transcriptome profile (including several hundred genes) of the muscle for several hours during recovery after exercise [13,14].
The second aim was to create a comprehensive resource of accurate TSSs and individual promoter regions for each TSS in muscle using both ATAC-seq and DNase-seq data-a strong marker of NDRs. This resource together with the positional weight matrix approach was used to identify TFs involved in responses to contractile activity as well as will help to identify TFs involved in responses to other stimuli in human skeletal muscle.

Results
Ten males exercised intermittently (60 min) on a cycle ergometer (Additional file1: Fig. S1). The pulmonary O 2 consumption rate (VO 2 ) and blood lactate concentration immediately after the high-intensity bouts of intermittent exercise were maintained above 80% of maximal VO 2 and 5 mM, respectively (Additional file1: Fig. S1) (i.e., the relative intensity of exercise was high, and the exercise induced substantial metabolic stress without progressive metabolite accumulation). Biopsy samples from the m. vastus lateralis were taken prior to and at 2 min, 1 h, 3 h, and 6 h after exercise for CAGE (Additional file1: Fig. S1).

Annotation of CAGE TSSs
In total, we identified 44,680 CAGE-tag-defined TSS clusters (CAGE TSS clusters) (Additional file2: Table S1). In our study, the CAGE TSS clusters showed the classical distribution into "sharp" and "broad" classes of promoters (Fig. 1A), as described previously [3]. To annotate CAGE TSS clusters to genes, we used both the Ensembl and RefSeq annotations. Most of our CAGE TSS clusters fall between ± 50 bp from the annotated TSSs (Fig. 1B). Hence, we used that interval to annotate CAGE TSS clusters as previously annotated TSSs; the annotation priority for other CAGE TSS clusters is shown in Fig. 1C. In total, we annotated 41,951 CAGE TSS clusters to the previously annotated TSSs, exons, and non-coding regions (Additional file2: Table S1). It is noteworthy that elimination of low-abundance CAGE TSS clusters (10% cutoff; see Materials and Methods) substantially reduced (by 30%) the number of annotated CAGE TSS clusters, particularly the number of CAGE TSS clusters annotated to the coding sequence (CDS; by 94%) and 3′ untranslated region (UTR; by 91%) (Fig. 1D; Additional file2: Table S1). These eliminated TSSs belonged predominantly to muscle-specific genes with high expression and/or many exons (e.g., titin, nebulin, and myosin heavy chain 1, 2, and 7) and are probably related to biological and/or technical noise.
The remaining CAGE TSS clusters were compared with the RNA sequencing data from our previous study investigating the effect of aerobic exercise training in human skeletal muscle [15]; verification by coverage and exon-exon junctions for the first exon allowed us to annotate for the first time 290 CAGE TSS clusters belonging to 163 genes (Additional file2: Table S1; Fig. 1D, G). Among them were several well-known protein-coding genes, including nitric oxide synthase 1 (NOS1), calcium/calmodulin-dependent protein kinase II alpha (CAMK2A), E1A binding protein P300 (EP300), ribosomal protein S6 kinase A2 (RPS6KA2), homeodomain interacting protein kinase 2 (HIPK2), angiomotin (AMOT), and homeobox A11 (HOXA11), as well as some pseudogenes and long non-coding RNAs. The remaining 2911 CAGE TSS clusters were annotated to introns and intergenic regions (Fig. 1D).
The mean expression level of the CAGE TSS clusters annotated to these locations was very low. In contrast, the CAGE TSS clusters verified by RNA sequencing (n = 290) show high mean expression level, similar to the CAGE TSS clusters annotated to TSSs and 5'UTRs ( Fig. 1E). This finding indirectly confirms the biological  Fig. 1 Annotation of cap analysis of gene expression (CAGE) transcription start site (TSS) clusters. A. CAGE TSS clusters showed the classical distribution to "sharp" and "broad" classes. B. Distribution of distance from CAGE TSS clusters to TSSs annotated in Ensembl and RefSeq (most of our CAGE TSS clusters fall between ± 50 bp from annotated TSSs). C. Location of CAGE TSS clusters (ordered by priority) versus gene annotation. Additionally, putative alternative starts were verified using the coverage and exon-exon junction (RNA sequencing data) and annotated to corresponding genes. D. Number of CAGE TSS clusters annotated to different locations before (left panel) and after (right panel) elimination of low-abundance CAGE TSS clusters. E. Expression (median and interquartile range) of CAGE TSS clusters annotated to different locations. F. Overlap of CAGE TSS clusters from the FANTOM5 and refTSS projects with those in our study and 3764 first defined (probably muscle-specific) CAGE TSS clusters. G. Analysis of the coverage and exon-exon junctions for the first exon (RNA sequencing data; example for the NOS1 gene) allows the annotation, for the first time, of 290 CAGE TSS clusters belonging to 163 genes. CDS, coding sequence significance of the CAGE TSS clusters verified by RNA sequencing in our study.
The CAGE TSS clusters defined in our study overlapped to a large extent with those defined in various human tissues and cells in the FANTOM5 and refTSS projects, which analyzed a limited number of human skeletal muscle samples. This partially explains why we found 3764 new (probably muscle-specific) CAGE TSS clusters (Fig. 1F). However, half of these CAGE TSS clusters were annotated to introns, intergenic regions, and genes verified by RNA sequencing data (Fig. 1F). As mentioned above, only a small fraction of these CAGE TSS clusters showed a high expression level (mainly the CAGE TSS clusters verified by RNA sequencing) (Fig. 1F).

Alternative TSSs contribute significantly to stress-induced transcriptome response
Acute exercise changed (mainly upregulated) expression of several hundred genes with a relatively small degree of overlap at each time point (in total 1411 differentially expressed genes [DEGs]) ( Fig. 2A, Additional file3: Table S2). Principal component analysis showed that gene responses to exercise at different time points fell into different clusters, confirming the consistency of gene responses in different volunteers (Fig. 2B). Those findings suggest that the transcriptome dynamically changes during the first hours after acute stress. In agreement with previous studies [13,15], the most highly enriched biological process for upregulated genes in human skeletal muscle was regulation of transcription (Additional file1: Fig. S2, Additional file3: Table S2). This is in line with findings in human and mouse cells showing domination of mRNA TFs in the earliest responses to various stimuli [16].
Two-thirds of DEGs have two or more TSSs (alternative CAGE TSS clusters) ( Fig. 2C and D), meaning that stress-induced gene expression is related to the regulation of alternative starts. Interestingly, the change in the expression of 146 (~ 10%) of DEGs was associated with the activation of alternative TSSs, indicating differential TSSs usage (Fig. 2C, D, Additional file3: Table S2). Moreover, in another set of 111 genes, differential TSSs usage occurred without altering overall gene expression (the sum of all TSSs related to a gene) and was associated with multidirectional changes in the expression of various TSSs in a gene (Fig. 2C, D, Additional file3: Table S2).
In line with findings in various human tissues [3,4], we found that 7591 of 12,268 expressed genes have more than one CAGE TSS clusters (Fig. 2E), suggesting that skeletal muscle has a high potential for generating alternative mRNA isoforms. Moreover, 948 of 7591 genes have at least one removed alternative promoter (> 200 bp beyond the promoter region of the canonical TSS-most highly expressed at baseline; see below) of which 89 demonstrate differential TSS usage (Additional file4: Table S3). Importantly, these genes mainly involved in regulation of transcription (Gene Ontology analysis; Additional file3: Table S2), indicating the important contribution of the alternative TSSs belonging to the removed alternative promoters in response to exercise-induced stress. If the alternative TSS is located up-or downstream of the 5′ UTR of the canonical TSS, then this can lead to the appearance of a new first exon(s) and another amino acid residue at the N-terminus (Fig. 2E), which may change the function of the protein.
Given the diversity of exon-intron structures of already known mRNA isoforms, predicting all possible mRNA isoforms based on CAGE TSS clusters is a difficult task. Therefore, using data on known mRNA isoforms with defined start codons (Ensembl) and on our CAGE TSS clusters (see Materials and Methods), we found that 197 genes have alternative start codons associated with annotated alternative protein isoforms (Fig. 2E, Additional file4: Table S3). This list can be increased by data from CAGE TSS clusters, which we annotated in our study for the first time ( Fig. 1D, G, Additional file2: Table S1). If we assume that most of the alternative mRNA isoforms encoding alternative protein isoforms are already known, then the presence of several TSSs in each of ~ 7000 genes means that the main function of alternative starts is associated not only (to a limited extent) with the generation of alternative protein isoforms but also with the fine regulation of expression of the mRNA isoform from a gene due to the activation of various alternative starts by TFs specific to them (e.g., when the alternative start is located in the 5′ UTR) (Fig. 2E).

Localization of the individual promoter regions surrounding CAGE TSSs in muscle
The positional weight matrix (PWM) method is a classic approach most frequently used for prediction of TFBSs and corresponding TFs responsible for stress-induced DEGs. Estimation of the exact location of each TSS for each gene and the expression level of each TSS is crucial for correct prediction of TFs. Usually, the size of the region around the TSSs in which TFBSs are sought (i.e., a conditional promoter region) is chosen empirically from − 1500-+ 500 bp to − 300-+ 100 bp (i.e., so-called standard promoter regions). However, the size of promoter regions determined by open chromatin differs significantly for different genes [17,18]. Therefore, the use of an individual promoter region for each TSS is necessary for the correct prediction of TFs. Because the position of open chromatin can differ between different cells [19,20] (Fig. 3A). We found an open chromatin probability distribution markedly shifted upstream of the CAGE TSSs (Fig. 3B), as shown previously in yeast [17] and mice macrophages [18].
In addition to open chromatin, a promoter region is characterized by a high density of TFBSs for various TFs [17,18,21,22], which means a high potential for DNA in this region to bind with various TFs. Using data from 15,982 chromatin immunoprecipitation (ChIP)-seq experiments with human cells and tissues (the GTRD database [23]), we determined the density of TFBSs on the DNA around each CAGE TSS (− 2000 to + 2000 bp). Figure 3B shows that the density of TFBSs is similar to the open chromatin probability distribution. Importantly, only a small fraction of the CAGE TSS clusters showed no open chromatin near a CAGE TSS clusters (> 200 bp beyond a CAGE TSS clusters). As expected, these TSSs demonstrate substantially lower expression and TFBS density than those with open chromatin (Fig. 3C).
Because the CAGE TSSs for gene(s) may be located near each other on the same or opposite strand (bidirectional promoters), such CAGE TSSs fall to one promoter region (Fig. 3A). These bidirectional promoters constitute 12% of all promoters and show similar density of TFBSs compared with other promoters with open chromatin (Fig. 3C). As a result of the overlap analysis, we distinguished the coordinates for 11,830 promoter regions in muscle (belonging to 90% of the genes defined in our study) (Fig. 3D, Additional file4: Table S3). No difference in the total length distribution was found between bidirectional and other promoters (Fig. 3D). Then, we identified differentially regulated (exercise-induced) promoter regions in muscle; these promoters demonstrate excellent coincidence with exercise-induced DEGs identified in the study by another bioinformatics approach (Additional file4: Table S3, Fig. 2A).
To examine the ability of the individual promoter regions to better predict TFs than the standard promoter regions, we identified TFs associated with exerciseinduced (differentially regulated) promoter for several standard and individual promoter regions by the PWM method (using the TRANSFAC Database v.2020.3 [24] containing matrices for 1357 of 1639 known TFs [25]). The number of predicted TFs with strong adjusted fold enrichment values > 1.5 were then compared. Figure 3E shows that if individual muscle promoter regions are used, the number of predicted TFs is substantially greater than that of standard promoter regions.

TFs associated with differentially regulated individual promoter regions in muscle
Promoters regulated by a set of TFs should demonstrate similar dynamics of gene expression. Using an unsupervised analysis (the Chinese restaurant process), we identified 21 clusters with co-expressed (presumably coregulated) differentially regulated individual promoter regions. Then, using the PWM method, TFBSs enriched in the individual promoter regions were predicted in each cluster (Fig. 4, Additional file5: Table S4). Figure 4 shows the average ranked expression of the individual promoter regions and the most enriched TFs for each cluster. Importantly, our approach allows us to identify time points with greater activity of the predicted TFs, meaning time points where the number of exercise-regulated (differentially expressed) individual promoter regions is close to the cluster size (numerator and denominator opposite each point, respectively, in Fig. 4).
The findings are in good agreement with the data in the literature on exercise-induced activation of TFs regulating angiogenesis, mitochondrial biogenesis, and carbohydrate and fat metabolism in skeletal muscle [26]. Namely, nuclear receptors subfamily 4A (cluster 3), as well as estrogen receptors/estrogen-related receptors, peroxisome proliferator-activated receptor gamma and other nuclear receptors (clusters 11 and 13) were found to be associated with increased expression of the individual promoter regions at 3 h and 6 h after exercise, respectively.
On the other hand, our approach allows us to search for TFs (and their potential target genes) with roles in the regulation of stress-induced gene expression in human

Discussion
In our paper, in contrast to previous studies investigating the alternative TSS usage under near-physiological conditions (in vitro), we examined TSSs in human skeletal muscle at baseline and after acute stress (aerobic exercise). In skeletal muscle, acute exercise (as well other stress stimuli in various tissues or cells) dynamically changes the transcriptome ( Fig. 2A, B). We showed, for the first time, that the alternative TSS usage contributes to acute-stress-induced transcriptome response in vivo (Fig. 2C, Additional file3: Table S2). This finding is in line with the response to acute chemical and physical stress obtained in yeast [11,12], meaning that the alternative TSS usage is a conserved mechanism regulating responses to acute stress. Interestingly, the alternative TSS usage is related to many (77) genes with removed (> 200 b.p.) alternative promoter/s, which highly likely have specific regulatory elements. Those genes include transcription regulators playing the key role in adaptation of skeletal muscle to exercise (PPARGC1A, ESRRG , NR4A1, NR4A2, RARA , HDAC5, etc. [26]; Additional file3: Table S2), indicating the important contribution of alternative TSSs in exercise-induced transcriptome response.
In various human tissues, two-thirds of genes have two or more TSSs ( [3,4] and Fig. 2E), indicating great potential for the expression of alternative protein isoforms. We confirmed about two hundred such examples in human skeletal muscle (Additional file4: Table S3), but the majority of the alternative TSSs do not increase proteome complexity (Fig. 2E). This is consistent with data in other human tissues [2], and with the fact that, in yeast in response to acute stress, alternative starts that change the amino acid sequence are expressed at low levels [12]. Together these findings suggest that the function of alternative TSSs is associated not only (to a limited extent) with the generation of alternative protein isoforms but also mainly with the fine regulation of mRNA isoform expression from a gene due to the TF-specific activation of various alternative TSSs (Fig. 2E). On the other hand, the change of 5'UTR lengths induced by the activation of alternative TSSs could potentially influence mRNA half-life [27] or translation efficiency [28], which play an important role in gene expression regulation.
Using ATAC-seq and DNase-seq data for human skeletal muscle-both strong markers of NDRs-we defined a comprehensive resource of the individual muscle promoter regions for 90% of the genes defined in our study (Fig. 3, Additional file4: Table S3). The lack of the individual muscle promoter regions for remaining genes relates to the lack of open chromatin near the CAGE TSS clusters and may be partially explained by the existence of non-muscle cells in biopsy samples, as well as by the limited number of open chromatin experiments in human skeletal muscle. The use of the positional weight matrix approach together with the individual muscle promoter regions substantially increases the number of predicted TFs compared to standard promoter regions and potentially decreases the rate of false-positive prediction, thereby enabling accurate prediction of TFs in any gene(s) of interest involved in the response to various stimuli in human skeletal muscle. For instance, the combination of this approach and cluster analysis allowed us to predict time course activation of "classic" TFs involved in response of skeletal muscle to endurance like contractile activity [26], as well as diversity of less/un-investigated TFs.

Conclusions
In conclusion, our work shows that in human skeletal muscle the function of alternative TSSs is associated mainly with the fine regulation of mRNA isoform expression. Transcriptome response induced by acute stress strongly related to activation of the alternative TSSs indicates that differential TSSs usage is a mechanism of fine regulation of gene response to stress stimulus. We   created a comprehensive resource of accurate TSSs and individual promoter regions for each TSS in muscle. This resource together with the positional weight matrix approach can be used to accurate prediction of TFs in any gene(s) of interest involved in the response to various stimuli, interventions, or pathological conditions in human skeletal muscle. For instance, the combination of this approach and cluster analysis allowed us to predict time course activation of "classic" TFs involved in response of skeletal muscle to endurance like contractile activity, as well as diversity of less/un-investigated TFs.

Experimental model and subject details
In skeletal muscle of untrained subject transcriptomic response to an aerobic exercise is much greater (and less specific) than that in skeletal muscle adapted to regular exercise [29].  Fig. S1). The muscle samples were quickly blotted with gauze to remove superficial blood, frozen in liquid nitrogen, and stored at − 80 °C until required. To improve the quality of identification of CAGE TSS clusters, all data were used for calculation. However, differential expressions of CAGE TSS clusters and genes as well as differential TSSs usage were evaluated for time points: prior to, 1 h, 3 h, and 6 h after an exercise only. Average effective size of the libraries (after depletion of ribosomal and mitochondrial RNAs) was ~ 11 million reads per sample. Raw sequencing data have been deposited to NCBI GEO: GSE164081.
To reduce the noise, low expressed CAGE TSS clusters (or minor CAGE TSS clusters with TPM normalized expression less than 10% of the maximum expressed CAGE TSS cluster for the same gene at any time point) were eliminated. Highly expressed CAGE TSSs were noted as major CAGE TSS clusters (Additional file2: Table S1).

DEGs and differential TSSs usage
Aggregated expression of CAGE TSS clusters per gene (all CAGE TSS clusters annotated to a gene) we used as raw input data for differential expression analysis (DESeq2 R package, analysis of paired samples with the Benjamini-Hochberg correction) with thresholds: |Fold Change|> 1.25 and P adj < 0.01. Log 2 (Fold Change) data were used for the principal component analysis. Genes with differential TSSs usage were identified for each experimental time point using Dirichlet-multinomial-based algorithm [33] from DRIMSeq R package (P adj < 0.05).

Functional gene ontology enrichment
Functional enrichment of protein groups in relation to all detected (~ 11,000) protein-coding genes was performed by the DAVID 6.8 using GOTERM_BP_Direct databases. GO terms with P adj < 0.05 (Fisher exact test, Benjamini correction) were regarded as significantly enriched.

Individual promoter regions in skeletal muscle and prediction of TFBSs (the PWM method)
The overlapped open chromatin intervals separated by < 71 bp (less than a half of histone wrapped DNA) from each other were joined to an interval. Individual promoter was defined as regions in which open chromatin is located 200 bp or less from the CAGE TSS cluster (Fig. 3A). Other CAGE TSS clusters separated by 200 bp or less were grouped and defined as pseudo-promoters. The pseudo-promoters were divided on promoters with removed open chromatin and without open chromatin. Promoter expression was determined as sum of expression all CAGE TSS clusters related to individual or pseudo-promoter. Maximal TFBS density for each promoter was estimated for each individual promoter interval or pseudo-promoter (− 2000 to + 2000 bp around the most expressed CAGE TSS cluster). The calculations were performed using the R environment and bedtools v2.26.0 (merge, intersect, genomecov).
Differentially regulated individual promoters (as well as pseudo-promoter regions) were determined as described above for DEGs. Enrichment of predicted TFBSs (and corresponding TFs) in individual muscle promoter regions were performed by the geneXplain platform (the "Search for enriched TFBSs (tracks)" function http:// wiki. biouml. org/ index. php/ Search_ for_ enric hed_ TFBSs_ (track s)_ (analy sis)) using the PWM database TRANSFAC v2020.3 [24,34]. The maximum enrichment (FE adj , statistically corrected odds ratios with a confidence interval of 99%) was determined for each PWM (site frequency ≤ 1 per 2000 bp) relative to that in 5000 random individual promoters showing no differential expression in any of experimental time points (DESeq2 method, P adj > 0.4).