A glycolysis-related two-gene risk model that can effectively predict the prognosis of patients with rectal cancer

Aerobic glycolysis is an emerging hallmark of cancer. Although some studies have constructed glycolysis-related prognostic models of colon adenocarcinoma (COAD) based on The Cancer Genome Atlas (TCGA) database, whether the COAD glycolysis-related prognostic model is appropriate for distinguishing the prognosis of rectal adenocarcinoma (READ) patients remains unknown. Exploring critical and specific glycolytic genes related to READ prognosis may help us discover new potential therapeutic targets for READ patients. Three gene sets, HALLMARK_GLYCOLYSIS, REACTOME_GLYCOLYSIS and REACTOME_REGULATION_OF_GLYCOLYSIS_BY_FRUCTOSE_2_6_BISPHOSPHATE_METABOLISM, were both significantly enriched in both COAD and READ through glycolysis-related gene set enrichment analysis (GSEA). We found that six genes (ANKZF1, STC2, SUCLG2P2, P4HA1, GPC1 and PCK1) were independent prognostic genes in COAD, while TSTA3 and PKP2 were independent prognostic genes in READ. Glycolysis-related prognostic model of COAD and READ was, respectively, constructed and assessed in COAD and READ. We found that the glycolysis-related prognostic model of COAD was not appropriate for READ, while glycolysis-related prognostic model of READ was more appropriate for READ than for COAD. PCA and t-SNE analysis confirmed that READ patients in two groups (high and low risk score groups) were distributed in discrete directions based on the glycolysis-related prognostic model of READ. We found that this model was an independent prognostic indicator through multivariate Cox analysis, and it still showed robust effectiveness in different age, gender, M stage, and TNM stage. A nomogram combining the risk model of READ with clinicopathological characteristics was established to provide oncologists with a practical tool to evaluate the rectal cancer outcomes. GO enrichment and KEGG analyses confirmed that differentially expressed genes (DEGs) were enriched in several glycolysis-related molecular functions or pathways based on glycolysis-related prognostic model of READ. We found that a glycolysis-related prognostic model of COAD was not appropriate for READ, and we established a novel glycolysis-related two-gene risk model to effectively predict the prognosis of rectal cancer patients.

2035 due to the constant increase in developing countries and increased morbidity in younger people [3]. Although colon cancer and rectal cancer have been regarded as the same disease over the past decades, accumulating evidence has revealed that rectal cancer is different from colon cancer in many aspects, such as embryological origins, anatomy, risk factors, sensitivities to carcinogens, microbiota, and genetic subtypes [3][4][5]. Moreover, rectal cancer is the most common subtype among Asians, and the past decade has witnessed an increasing trend of early-onset rectal cancer (diagnosis before 50 years of age) [2,4,6,7], which has constituted a formidable challenge in China and needs further investigation [4,8]. Various biomarkers associated with the survival and prognosis of rectal cancer have been explored in the past. However, a single gene or biomarker cannot accurately predict the outcomes of cancer patients. The past decade has witnessed the soaring development of high-throughput sequencing [9][10][11][12], which has expanded our insight into the genetic alterations of rectal cancer and made it possible to establish multiple-gene predictive models using clinical and genetic data obtained from the public databases [13][14][15]. An mRNA-based gene signature for predicting rectal cancer patient outcomes is still needed in the present context.
Reprogramming energy metabolism is an emerging hallmark of cancer [16]. Even in the presence of oxygen, tumor cells give priority to glycolysis rather than mitochondrial oxidative phosphorylation for glucose catabolism, resulting in a state termed "aerobic glycolysis" or the "Warburg effect" [17]. A well-acknowledged rationale for the glycolytic switch in tumor cells is that increased glycolysis is employed by proliferating cells as a versatile product line, which continuously generates glycolytic intermediates for various biosynthetic pathways to satisfy the requirement for active cell proliferation [18]. Thus, exploring critical glycolytic genes related to rectal cancer prognosis may help us discover new potential therapeutic targets.
In this study, we found that a glycolysis-related prognostic model of colon cancer could not distinguish the prognosis of rectal cancer patients through bioinformatic analysis based on The Cancer Genome Atlas (TCGA) database. Moreover, we discovered the critical and specific glycolysis-related genes that participate in the development of rectal cancer using bioinformatic methods and established a practical model to predict rectal cancer patients' prognosis.

Study design
Our study compared glycolysis-related gene sets through gene set enrichment analysis (GSEA) and compared glycolysis-related independent prognosis genes between colon adenocarcinoma (COAD) and rectal adenocarcinoma (READ). Glycolysis-related prognostic model of COAD and READ was, respectively, constructed and assessed in COAD and READ. Assessment of the glycolysis-related READ prognostic model in READ patients was further performed through time-dependent receiver operating characteristic curve (time ROC) analysis, univariate and multivariate Cox analyses, principal component analysis (PCA) analysis, t-distributed stochastic neighbor embedding (t-SNE) analysis, Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses. A nomogram combining the risk model of READ with clinicopathological characteristics was established for READ. The workflow for this study is shown in Additional file 1: Fig. S1.

Data collection
The mRNA expression profile of 144 READ and 8 adjacent normal rectal tissues were downloaded from the TCGA Genomic Data Commons (GDC) database (https:// portal. gdc. cancer. gov/). The mRNA expression profiles of 398 COAD and 39 adjacent normal colon tissues were also downloaded from the TCGA GDC database. The mRNA profiles were standardized by log2 transformation for further analysis. However, only 142 rectal cancer samples were documented with mRNA expression profiles and detailed clinicopathological information, including age, gender, American Joint Committee on Cancer (AJCC) TNM stage, T stage, N stage, M stage and survival status. The detailed clinical features of 142 rectal cancer patients are shown in Table 1. Patients whose follow-up time was absent, zero-days or unknown were excluded from the survival analysis. Finally, 135 READ patients and 363 COAD patients were enrolled for further survival analysis of glycolysis-related genes.

Gene set enrichment analysis (GSEA)
Gene sets enrichment analysis (GSEA) was conducted by GSEA software 4.10 from the Broad Institute [19] to identify the gene sets enriched between cancer tissues and adjacent normal tissues. A total of 12 glycolysis-related gene sets were downloaded from the Molecular Signatures Database (MSigDB) (https:// www. gsea-msigdb. org/ gsea/ msigdb/ genes ets. jsp). To acquire a normalized enrichment score (NES), we conducted 1000 times gene set permutations for each analysis. When |NES| > 1, nominal p value < 0.1 and false discovery rate (FDR) q value < 0.1, the investigated gene sets were considered statistically significant. Glycolysis-related genes were extracted from the analysis performed by GSEA. We utilized the "limma" R package for analysis, which in turn yielded glycolysis-related differentially expressed genes (DEGs), with filtering criteria of p value < 0.05 and log2 fold change (log2FC) ≠ 0.

Standardization of data processing and construction of the risk score model
Univariate Cox regression analysis was first performed to screen out genes that were associated with overall survival (OS) (p value < 0.05). Subsequently, the identified genes were subjected to multivariate Cox regression analysis, and the corresponding regression coefficients related to OS were also acquired. When the values of hazard ratio (HR) are larger than 1, the corresponding factor is considered a risk factor. We used the "Cox. Zph" function to test the Cox proportional risk model in the "survival" R package. When the p value of each covariate and global test is > 0.05, it is considered that the Cox model conforms to the proportional risk hypothesis. In this study, glycolysis-related gene risk conformed to the proportional risk hypothesis and was constructed using the "coxph" function in the "survival" R package. The risk score formula is as follows: Risk score = n i=1 expression of Gi × βi ("G" means gene; "i" means order of genes; "n" means the number of prognostic genes; "β" means the regression coefficient of the corresponding gene after multivariate Cox analysis). According to the median risk score, all the included patients were divided into either the high-risk group and low-risk group.

Assessment of the risk score model in rectal cancer patients
Kaplan-Meier survival curves were drawn to show the survival rate difference between the two groups with a p value < 0.05 considered statistically significant. A ROC curve was plotted to evaluate the sensitivity and specificity of the risk score model. When the area under the curve (AUC) is 0.5-0.7, the risk score model has acceptable efficiency. When the AUC is larger than 0.7, the risk score model has good accuracy. The "ggplot2" and "pheatmap" R packages were used to plot heat maps and survival status charts. PCA and t-SNE were performed to explore the distribution of different groups using the "Rtsne" R package. The "rms" R package was used to plot the nomogram for predicting the survival of rectal cancer patients. The time-dependent ROC curves and calibration plots were depicted to evaluate the efficiency of the nomogram.

Functional enrichment analysis
The "clusterProfiler" R package was utilized to conduct Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses based on the DEGs (|Log2FC| > 0.6, FDR < 0.05) between the high-risk and low-risk groups. P values were adjusted with the Benjamini-Hochberg (BH) method.

GSEA of glycolysis-related genes in COAD and READ
GSEA was, respectively, conducted to explore whether glycolysis-related gene sets were significantly enriched in the colon cancer and rectal cancer. Twelve glycolysis-related gene sets were investigated. Five gene sets, highlighted in boldface in Table 2, HALL-MARK_GLYCOLYSIS, REACTOME_GLYCOLYSIS,  Table 2 Glycolysis-related GSEA analysis on mRNA expression among COAD and READ patients  Table 2.
Four gene sets, highlighted in boldface in Table 2, HALL-MARK_GLYCOLYSIS, REACTOME_GLYCOLYSIS, GO_FRUCTOSE_1_6_BISPHOSPHATE_METABOLIC_ PROCESS, and REACTOME_REGULATION_OF_GLY-COLYSIS_BY_FRUCTOSE_2_6_BISPHOSPHATE_ METABOLISM, were significantly enriched in the rectal cancer samples (|NES| > 1, nominal p value < 0.1 and FDR q value < 0.1), as shown in Table 2. Three gene sets, HALL-MARK_GLYCOLYSIS, REACTOME_GLYCOLYSIS, and REACTOME_REGULATION_OF_GLYCOLYSIS_BY_ FRUCTOSE_2_6_BISPHOSPHATE_METABOLISM, were both significantly enriched in the colon cancer and rectal cancer.

Comparison of glycolysis-related prognostic genes between COAD and READ
A total of 316 genes in the above gene sets were included for further analysis. Subsequently, 199 glycolysis-related DEGs in READ and 254 glycolysis-related DEGs in COAD were, respectively, identified with p value < 0.05 and |log2FC| > 0 as the cutoff standard. To further, respectively, explore prognostic genes in COAD and READ, glycolysis-related DEGS were analyzed by univariate Cox regression. Nine genes (STC1, ANKZF1, STC2, SDHB, SUCLG2P2, P4HA1, PPFIA4, GPC1 and PCK1) in COAD and 4 genes (TSTA3, IDH3A, PKP2 and ACO2) in READ were found to be significantly associated with overall survival (OS) (p value < 0.05) ( Table 3). No same genes were both significantly associated with OS in COAD and READ. Subsequently, multivariate Cox regression analysis was, respectively, performed in COAD and READ. Six genes including ANKZF1, STC2, SUCLG2P2, P4HA1, GPC1 and PCK1 were proven to be independent prognostic factors in COAD (Table 4). Two genes, TSTA3 and PKP2, were proven to be independent prognostic factors in READ (Table 4). We compared the expression levels of the above genes between the cancer tissues and adjacent normal tissues, and found that ANKZF1, STC2 and P4HA1 were significantly upregulated, while SUCLG2P2 and PCK1 were downregulated in both COAD and READ among the independent prognostic genes of COAD (Fig. 1). Although GPC1 was significantly upregulated in COAD, it was not upregulated  in READ (Fig. 1). Among the independent prognostic genes of READ, we found that TSTA3 was significantly upregulated while PKP2 was downregulated in both COAD and READ (Fig. 2). Taken together, these results indicated that there existed different glycolysis-related prognosis genes between COAD and READ.

The glycolysis-related COAD prognostic model is not appropriate for READ
Above all, we established a risk score model based on the independent prognostic genes of COAD to, respectively, evaluate COAD and READ patient survival. The risk score model formula is as follows: risk score = expression of ANKZF1 ×  Table 4).
COAD patients were divided into a high-risk group and a low-risk group according to the median risk score of the whole COAD patients. We found that COAD patients with a high-risk score had a shorter OS than those with a low-risk score (Fig. 3A). A ROC curve was further drawn to evaluate the reliability of this prognostic model. The AUC of this predictive model accumulated steadily over time and came to 0.781 at 3 years, indicating a satisfactory sensitivity and specificity of this model in predicting survival for COAD patients (Fig. 3C). We wondered whether a risk score model of COAD could distinguish the prognosis of READ patients. The risk score of READ patients was calculated according to the risk model formula of COAD. READ patients were also divided into a high-risk group and a low-risk group according to the median risk score of the whole COAD patients. We found that the glycolysis-related prognostic model of COAD could not distinguish the prognosis of READ patients  (Fig. 3C). Moreover, the AUC of the ROC curve was 0.554 at 3 years in READ (Fig. 3D), which was lower than the AUC in COAD. In addition, READ patients were divided into a high-risk group and a low-risk group according to the median risk score of the whole READ patients based on COAD prognostic model. The survival curves of the high-risk group and low-risk group crossed with the p value > 0.05 in READ (Additional file 1: Fig. S2A). These results indicated that the glycolysis-related prognostic model of COAD is not appropriate for READ.

The glycolysis-related READ prognostic model is suitable for READ
Subsequently, we established risk score model based on the independent prognostic gene of READ to, respectively, evaluate READ and COAD patients' survival. The risk score model formula is as follows: risk score = expression of TSTA3 × (0.752) + expression of PKP2 × (− 0.725) ( Table 4). READ patients were divided into a high-risk group and a low-risk group according to the median risk score of the whole READ patients. We found that READ patients with a high-risk score had a shorter OS than those with a low-risk score (Fig. 4A). A ROC curve was drawn to evaluate the reliability of this prognostic model. The AUC of this predictive model accumulated steadily over time and came to 0.783 at 3 years, indicating a satisfactory sensitivity and specificity for this model in predicting the survival of READ patients (Fig. 4C). We wondered whether the risk score model of READ could distinguish the prognosis of COAD patients. Risk score of COAD patients was calculated according to the risk model formula of READ. COAD patients were also divided into a high-risk group and a low-risk group according to the median risk score of the whole READ patients. We found that the survival curves of the high-risk group and low-risk group crossed despite the p value < 0.05 in COAD (Fig. 4B). In addition, the AUC was 0.542 at 3 years in COAD (Fig. 4D), which was lower than the AUC in READ. In addition, COAD patients were divided into a high-risk group and a low-risk group according to the median risk score of the whole COAD patients based on READ prognostic model. The survival curves of the high-risk group and low-risk group still crossed with the p value > 0.05 in COAD (Additional file 1: Fig. S2B). These results indicated that the READ risk score model was more appropriate for READ than for COAD.
To provide visualization of risk score, survival status, and gene expression in READ. The risk score, survival status, and gene expression of the two genes in each READ patient are shown in Fig. 5A-C. PCA and t-SNE analysis confirmed that READ patients in the two subgroups were distributed in discrete directions (Fig. 5D-E).

Assessment of the glycolysis-related READ prognostic model in READ patients
To evaluate this prediction model's clinical utility, univariate and multivariate Cox proportional hazards regression analyses were conducted to compare this two-gene risk model with common clinicopathological features. Univariate analysis showed that age, TNM stage, and risk score were associated with rectal cancer patient survival (Fig. 6A). These three factors were further proven to be independent prognostic indicators in the subsequent multivariate Cox analysis (Fig. 6B), indicating that the two-gene risk model can serve as a promising tool for predicting the prognosis of patients with rectal cancer. The Kaplan-Meier curves illustrated that patients with older age, higher N stage, metastasis status, higher TNM stage, and a higher risk score had a poor prognosis. In contrast, gender and T stage had no effect on prognosis of rectal cancer patients (Fig. 6C-H).
To further verify the effectiveness of this two-gene model in predicting prognosis in rectal cancers, the patients were classified into different subgroups according to age (≤ 65 years vs > 65 years), gender (female vs male), T stage (T1-2 vs T3-4), N stage (N0 vs N1-2), M stage (M0 vs M1) and AJCC TNM stage (stage I-II vs stage III-IV). They were subsequently divided into the high-risk score and low-risk score groups based on the median risk score. Interestingly, the two-gene risk model still showed robust effectiveness in the different age, gender, M stage, and AJCC stage subgroups (Fig. 7A-F). However, high-risk scores did not suggest a poor Fig. 3 The assessment of glycolysis-related COAD prognostic model in READ and COAD patients. A Kaplan-Meier survival analysis on COAD patients between the high-risk and low-risk groups. B Kaplan-Meier survival analysis on READ patients between the high-risk and low-risk groups. C AUC of the glycolysis-related COAD prognostic model in predicting survival in COAD patients. D AUC of glycolysis-related COAD prognostic model in predicting survival in READ patients prognosis in the N0 stage and T1-2 stage subgroups ( Fig. 7A-F), indicating that the risk score model may be a more effective prognostic marker in the late stages of rectal cancer.

Construction and evaluation of a nomogram incorporating the two-gene risk score model and clinical features
To provide a clinically available and practical tool for oncologists to estimate rectal cancer patients' survival time, we established a nomogram combining risk score with clinicopathological characteristics (age and TNM stage) (Fig. 8A). Calibration plots suggested that the nomogram fitted well compared with the ideal model representing by the 45° line, as shown in Fig. 8B-D. The C-index of the nomogram was 0.852, which indicated that this nomogram had an outstanding stability. ROC curves showed that the AUCs of the nomogram at 1, 2, and 3 years was 0.838, 0.849, and 0.859, respectively, which were better than those of the clinical factors or risk score model alone (Fig. 8E).

Functional analyses between high-risk and low-risk groups in patients with rectal cancer
To elucidate the different biological functions or pathways that might play potential roles in the high-risk group and low-risk group, GO enrichment and KEGG analyses were performed using the DEGs (|log2FC| ≥ 0.6, FDR < 0.05) between the two groups. As expected, the DEGs were enriched in several glycolysis-related molecular functions or pathways, such as oxidative phosphorylation (p value < 0.05) (Fig. 9A-D). These results indicated that glycolysis-related molecular functions or pathways were indeed different between high-risk group and low-risk group according to glycolysis-related READ prognostic model.

Discussion
Cancer metabolism, especially glucose metabolism, has drawn great attraction in oncology research during the in recent decades [16][17][18]. Enhanced aerobic glycolysis, which serves as a metabolic driver, has been proven to promote tumor development and aggressiveness in CRC cells, and further results in resistance to treatment [20,21]. In our research, we compared glycolysis-related gene sets between COAD and READ. Three gene sets, HALL-MARK_GLYCOLYSIS, REACTOME_GLYCOLYSIS, and REACTOME_REGULATION_OF_GLYCOLYSIS_BY_ FRUCTOSE_2_6_BISPHOSPHATE_METABOLISM, were significantly enriched in both COAD and READ. We also compared glycolysis-related independent prognostic genes between COAD and READ. We found that six genes including ANKZF1, STC2, SUCLG2P2, P4HA1, GPC1 and PCK1 were independent prognostic genes in COAD, while TSTA3 and PKP2, were independent prognostic genes in READ. These results indicated that COAD might have different glycolysis-related prognostic gene signatures than READ. We, respectively, constructed glycolysis-related prognostic models in COAD and READ and, respectively, assessed the efficiency of each model in COAD and READ. We found that the glycolysis-related prognostic model of COAD was not appropriate for READ, while glycolysisrelated prognostic model of READ was more appropriate for READ than COAD. These results indicated that COAD might have many different patterns such as genomic signatures, drug efficacy, and prognosis compared with READ. We also developed a two-gene signature (TSTA3 and PKP2) to predict the prognosis of rectal cancer, which was shown to be an independent prognostic indicator through multivariate Cox analysis. Subgroup analysis confirmed that the READ glycolysis-related prognostic model still showed robust effectiveness in the different age, gender, M stage, and AJCC stage subgroups. Moreover, a nomogram integrating the risk model and clinicopathological factors was depicted to provide clinicians with a practical tool for predicting the prognosis of rectal cancer patients.
Recent research has revealed that traditional clinicopathological factors are inadequate for accurate cancer prognosis predictions [13,14]. With the rapid development of high-throughput sequencing and the accumulation of cancer genomic data, molecular signatures based on data mining in the public databases have become a reality to predict the outcome of cancer patients, demonstrating higher sensitivity and specificity than the traditional single genes or markers models [11,12,15]. Glycolysis-related risk scores have shown excellent performance in predicting prognosis in various solid tumors [14,[22][23][24][25][26][27]. Liu et al. [23] developed a four-gene signature (AGRN, AKR1A1, DDIT4, and HMMR) related to glycolysis to predict the lung adenocarcinoma patient outcomes that showed desirable accuracy. Another fourgene glycolytic signature (NUP205, NUPL2, PFKFB1, and PKM) showed excellent performance in predicting the OS of bladder cancer patients [27]. Similarly, Chen et al. established a risk score model containing seven glycolysis-related genes (PPARGC1A, DLAT, 6PC2, P4HA1, STC2, ANKZF1, and GPC1) in their recent study that can effectively predict the outcome of colon adenocarcinoma [14]. However, the genes identified in the COAD background in Chen's research are different from the genes identified in our present study, resulting in two different risk models, which further validate the mainstream view that colon cancer and rectal cancer are two diseases [3][4][5].
Plakophilin-2 (PKP2) was initially identified as a desmosomal protein, but further studies have revealed that it is localized in the cytoplasm and nucleus as well [28]. PKP2 expression is upregulated in many cancers, such as lung [29], ovarian [30], glioma [31], and bladder cancers [32]. Increased PKP2 expression was also related to a malignant phenotype and poor prognosis in some cancers [29][30][31][32]. Arimoto et al. found that PKP2 enhanced the dimerization of EGFR and activated downstream signaling pathways, which further promoted cell proliferation and tumor metastasis [33]. Moreover, PKP2 may function as a feedback inhibitor of Wnt/β-catenin signaling in CRC stromal fibroblasts and regulate Wnt activity in CRC cells [34]. Our present study found that PKP2 was downregulated in rectal cancer tissues compared with the adjacent normal tissues. Low PKP2 expression was correlated with a poor prognosis, indicating a distinct role of PKP2 in rectal cancer development, which needs to be further investigated. Tissue-specific transplantation antigen P35B (TSTA3), also called GDP-L-fucose synthase, is one of the two rate-limiting enzymes in the de novo synthesis pathway of GDP-L-fucose [35][36][37]. Serving as a one common glycosylation modification by conjugating fucose to protein-or lipid-bound oligosaccharides, fucosylation is dysregulated in various cancers and is related to carcinogenesis, invasion, and metastasis [38,39]. The process of fucosylation may provide novel targets for cancer therapeutics [40]. Recently, Zhang et al. [41] found that increased TSTA3 expression could promote esophageal cancer progression by fucosylation of LAMP2 and ERBB2 and could predict a poor prognosis [42]. Wang et al. [43] found that knocking out TSTA3 in mice could lead to fucosylation deficiency and further result in colitis and adenocarcinoma. In contrast, we discovered that TSTA3 was upregulated in rectal cancer tissues compared to adjacent normal tissues. High expression was correlated with a poor prognosis, indicating that TSTA3 may function as a tumor suppressor in rectal cancer. Since TSTA3 plays different roles in diverse kinds of cancer [42][43][44][45], the accurate biological function of TSTA3 in rectal cancer remains to be elucidated.
The two-gene risk model associated with glycolysis demonstrated effective performance in predicting the clinical outcomes of patients with rectal cancer. However, we could not find a large sample database of highthroughput gene expression READ for validation and we found two more than 100 samples colorectal cancer (CRC) database with microarray gene expression database and OS information in Gene Expression Omnibus (GEO) database for validation. The high-risk group had a higher OS rate than that of low-risk group with the p value = 0.029 according to GSE39582 database (Additional file 1: Fig. S3A). The survival curves of the high-risk group and low-risk group crossed with the  Fig. S3B). Although the above results were inconsistent with the results of READ, we consider that the glycolysis-related prognostic model of READ might be appropriate for the high-throughput gene expression database not for microarray gene expression database in READ. In future, with new READ high-throughput gene expression database available for public, the glycolysis-related prognostic model of READ might be validated. Besides, the exact biological functions of the predictive genes, PKP2 and TSAT3, remain unclear in rectal cancer and need to be elucidated in our subsequent studies. Fig. 9 Functional analyses between high-risk and low-risk groups in READ patients. A, B GO enrichment analysis on DEGs between high-risk and low-risk groups. C, D KEGG pathway analysis on DEGs between high-risk and low-risk groups