Modeling thousands of markers simultaneously has been of great interest in testing association between genetic biomarkers and disease or disease-related quantitative traits. Recently, an expectation-maximization (EM) approach to Bayesian variable selection (EMVS) facilitating the Bayesian computation was developed for continuous or binary outcome using a fast EM algorithm. However, it is not suitable to the analyses of time-to-event outcome in many public databases such as The Cancer Genome Atlas (TCGA).

Results

We extended the EMVS to high-dimensional parametric survival regression framework (SurvEMVS). A variant of cyclic coordinate descent (CCD) algorithm was used for efficient iteration in M-step, and the extended Bayesian information criteria (EBIC) was employed to make choice on hyperparameter tuning. We evaluated the performance of SurvEMVS using numeric simulations and illustrated the effectiveness on two real datasets. The results of numerical simulations and two real data analyses show the well performance of SurvEMVS in aspects of accuracy and computation. Some potential markers associated with survival of lung or stomach cancer were identified.

Conclusions

These results suggest that our model is effective and can cope with high-dimensional omics data.

Introduction

With the development of high-throughput sequence technology, large-scale omics data are generated rapidly for discovering new biomarkers [1, 2]. The public databases such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) provide great opportunities to understand complex diseases comprehensively on a molecular level [3, 4] and subsequently facilitate growing demanding statistical approaches designed to cope with these large-scale data [5]. Analyzing biomarkers one at a time is the most common strategy to detect the underlying causal markers [6, 7]. However, this one-by-one method ignores the correlation between biomarkers and needs multiple corrections for controlling false positives. Furthermore, multiple regression is performed increasingly because it is powerful to identify causal markers after the strongest associations have been accounted for [8,9,10]. It can also avoid multiple correction and enable accurate effect estimation [11, 12]. Nevertheless, omics data generally have the property of high dimensionality, which makes the classical multiple regression yield unstable parameter estimations with high standard errors. Due to this limitation, least absolute shrinkage and selection operator (LASSO) regression and its variants shrink the effects of noise toward zero while adding a penalty term to the likelihood function. The approach can easily be conducted on a large-scale variable selection analysis [13, 14]. The LASSO can also be explained from Bayesian perspectives, namely Bayesian LASSO (BL) [15]. That is, we can generate the LASSO estimators by imposing a Laplace prior on coefficients of explanatory variables. Compared with a frequentist penalty, Bayesian regression is more flexible to induce different shrinkage by specifying various priors.

George and McCulloch [16] proposed a two-component “spike-and-slab” mixture prior, consisting of the “spike” to be either a point mass at zero or a normal distribution with a very narrow variance while a “slab” be a normal distribution with a large variance. Indicator variables are utilized to denote which component each marker belongs to, which is known as Bayesian variable selection (BVS). BVS and its expansions are widely used in genomic studies, including marker detection and disease risk prediction [11, 12, 17]. But, most Bayesian studies have used a Markov Chain Monte Carlo (MCMC) algorithm to explore the posterior distribution of unknown parameters via numerical approximations in high-dimensional models, which is quite time consuming for getting a stable chain. For example, the BayesR model required ~ 18 h to complete an analysis of a bipolar disorder genome-wide association data with ~ 3800 individuals (80% of whole samples) and ~ 300,000 genotype markers [12]. In order to facilitate Bayesian computation, variational inference [18,19,20] and expectation-maximization (EM) algorithm are commonly used for Bayesian posterior inference [21]. Ročková and George [22] proposed EM variable selection (EMVS) for continuous outcomes to rapidly identify promising high posterior models and parameter estimates. The continuous conjugate spike-and-slab prior adopted by EMVS leads to fast closed form expression for the EM algorithm. Nevertheless, there has been considerable interest in discovering associations between biomarkers and prognosis.

Analyzing time-to-event data, namely survival analysis, plays a very important role in statistics, which arises in many fields, such as medicine, genetics, industrial engineering, sociology, and economics [23,24,25]. Modeling survival data using Cox proportional hazards regression is popular for its robust to the unknown baseline hazard [26]. Alternatively, being well known for parametric survival analysis, accelerated failure time (AFT) model tends to give more precise estimates of interest parameters if the distribution of survival time is chosen correctly, in addition, the parameter estimates from AFT are robust to omitted covariates [27]. Under the scenario of high-dimensional survival analysis, a lot of works have been done usually by adding a penalty term to likelihood. In a Bayesian framework, we usually need to assign a semi-parametric or nonparametric prior processes to the (cumulative) baseline hazard function in a Cox model [28, 29], which does not allow us to naturally choose a fully parametric survival model for the subsequent analyses. As a parametric model, the Weibull regression induces a very flexible model since it is a unique parametric model which has both AFT and the proportional hazards properties [30].

In this study, we extended EMVS to parametric survival model (SurvEMVS) with Weibull distribution assumption. A fast EM algorithm was used to obtain posterior modes of interested parameters, in which a variant of the fast cyclic coordinate descent (CCD) method is nested. We used simulation trials to explore performance in comparison with an alternative frequentist variable selection strategy, namely Cox LASSO regression. After that, we applied SurvEMVS to a lung cancer genotype data and a stomach cancer gene expression data. Further details of this work were given below.

Methods

Statistical framework

Survival times of n individuals in sample are designed by T_{i} = min(t_{i}, c_{i}), i = 1, …, n, where t_{i} and c_{i} are lifetime and fixed censoring time for a specific individual i, respectively. The survival outcome from a follow-up study can be conveniently represented by pair of random variables (T_{i}, δ_{i}), where δ_{i} indicates whether the lifetime t_{i} corresponds to an event (δ_{i} = 1) or is censored (δ_{i} = 0). In this study, we consider right censoring scheme and non-informative censoring mechanism for each individual without note elsewhere. Under parametric framework, f(T_{i}| θ) and S(T_{i}| θ) are defined as probability density and survival function of survival time T_{i}, respectively, parametrized by θ. Let \( L={\prod}_{i=1}^nf{\left({T}_i|\theta \right)}^{\delta_i}S{\left({T}_i|\theta \right)}^{\left(1-{\delta}_i\right)} \) be the likelihood of the parametric model with an i.i.d assumption. Weibull distribution can be fully parametrized by the parameter pair θ = (λ, α), where λ and α are scale and shape parameter, respectively. The density and survival function of Weibull distribution T are f(T| λ, α) = αT^{α − 1}λ exp(−λT^{α}) and S(T| λ, α) = exp(−λT^{α}), respectively. Typically in a regression model, the scale parameter is defined as λ = 1/ exp(Zu + Xβ), where X_{n × p} represents a column-scaled matrix of tumor biomarkers such as gene expression, genotype or DNA methylation, β is a p × 1vector of marker effects, Z_{n × (1 + q)}is a covariates matrix of intercept and q clinical variables such as gender, age, and tumor histological grade, u is a (q + 1)-dimensional effect vector of Z_{n × (1 + q)}. With the condition of p > n(i.e., high dimension), a penalize term is necessary for inducing a sparse solution of β. Under the Bayesian framework, we want to impose a well-known spike-and-slab prior on each β_{j} to facilitate Bayesian variable selection [16]. A vector of binary latent variables γ = (γ_{1}, …, γ_{p})^{T}, γ_{j}∈ {0, 1} are introduced as indicator variables, where γ_{j} = 1 donates that jth explanatory variable is to be included in the model. Conditional on γ, the continuous prior being assigned to β is,

where \( {\mathbf{D}}_{\sigma^2,\boldsymbol{\upgamma}}={\sigma}^2\mathit{\operatorname{diag}}\left({a}_1,\dots, {a}_p\right) \)and a_{j} = (1 − γ_{j})υ_{0} + γ_{j}υ_{1}. As suggested by [22], we set these hyperparameters υ_{0} and υ_{1} to be small and large positive values, respectively. σ^{2} is a common variance parameter with an inverse gamma prior π(σ^{2}) = IG(ν/2, νη/2). The binomial prior is chosen for the indicator variable γ, i.e., \( \pi \left(\boldsymbol{\upgamma} |\theta \right)=\prod \limits_{j=1}^p{\theta}^{\gamma_j}{\left(1-\theta \right)}^{1-{\gamma}_j} \), where θ is a hyperparameter and we impose a Beta(a, b) prior on it. The constant prior is chosen for α and u, i.e., π(α) ∝ 1 and π(u) ∝ 1.

Generally speaking, MCMC is usually used for simulating the posterior distribution of unknown parameters β, γ, θ, σ^{2}, α, and u. But here we employ an EM algorithm to seek the posterior mode of each parameter because this algorithm provides substantial computational advantage, especially for high-dimensional data analysis. Concerning about the unknown status γ for variables, we replace this “missing data” by its conditional expectation given the current estimates for other parameters and observed data (E-step) [31]. Then, an M-step is followed by maximizing the expected complete-data log-posterior with respect to β, θ, σ^{2}, α, and u. As a result of iterations between the E-step and M-step, each estimator will converge toward a local maximum of the posterior distribution. More specifically, the objective function can be expressed as:

where E_{γ ∣ ⋅}(⋅) denotes the expectation with regard to γ given estimations in current iteration. Furthermore, \( {E}_{\boldsymbol{\upgamma} \mid \cdot}\left[\log \pi \left(\boldsymbol{\upbeta} |{\sigma}^2,\boldsymbol{\upgamma} \right)\right]={C}_1-\frac{p}{2}\log {\sigma}^2-\frac{1}{2{\sigma}^2}{\sum}_{j=1}^p{\beta}_j^2{E}_{\boldsymbol{\upgamma} \mid \cdot }{\left[\left(1-{\gamma}_j\right){\upsilon}_0+{\gamma}_j{\upsilon}_1\right]}^{-1} \), and \( {E}_{\boldsymbol{\upgamma} \mid \cdot}\left[\log \pi \left(\boldsymbol{\upgamma} |\theta \right)\right]={\sum}_{j=1}^p{E}_{\boldsymbol{\upgamma} \mid \cdot}\left[{\gamma}_j\right]\log \left(\frac{\theta }{1-\theta}\right)+p\log \left(1-\theta \right) \), both C and C_{1} are constants. Next, our EM algorithm for Bayesian Weibull regression proceeds as follows.

(1)

Initialize the unknown parameters: β^{(0)}, θ^{(0)}, σ^{2(0)}, α^{(0)}, u^{(0)}.

(2)

E-step:

As can be seen from the formula (2) above, there are two parts that need further evaluation, namely E_{γ ∣ ⋅}[γ_{j}] and E_{γ ∣ ⋅}[(1 − γ_{j})υ_{0} + γ_{j}υ_{1}]^{−1}. In particular, E_{γ ∣ ⋅}[γ_{j}] is a conditional expectation of γ_{j} and depends on observed data (T, δ) only by means of current parameter estimates β^{(k)}, θ^{(k)}, σ^{2(k)} because of the hierarchical structure of γ; therefore, we have

where D^{∗1/2} is the square root of the p × p diagonal matrix \( {\mathbf{D}}^{\ast }=\mathit{\operatorname{diag}}\left({d}_1^{\ast },\dots, {d}_p^{\ast}\right) \). It can be shown that the formula (5) above for Weibull model is convex, and a wide variety of numerical optimization algorithms can be applicable. For generalizing our algorithm to high-dimensional data, the commonly used multidimensional Newton-Raphson method is not recommended because of large memory requirements and intensive computations. In this research, we employ the cyclic coordinate descent (CCD) algorithm due to its efficiency and ease of implementation, which makes one-dimensional optimization available [32]. Briefly, the CCD minimizes the objective function with regard to β_{j}, holding all other variables constant. Similar to the combined local and global (CLG) algorithm of [33, 34], we modify the update for β_{j} in two ways. First, in order to avoid big steps in Newton iteration, we specify a positive value Δ_{j} for the jth maker to restrict the maximum change of β_{j} between two adjacent iterations. Second, for one-dimensional optimization, we update β_{j} only once instead of multiple iterations till convergence before updating β_{j + 1}. Moreover, considering that it is unnecessary to take much time to update the amount of possible neural markers (that is, their minuscule effects contribute less to outcome, with no need for accurate estimations) when p is large, we partially update those markers with “large” effect (|β_{j}|is greater than a threshold β^{′}, e.g., 1E−08) after a small number (k^{′}) of full iterations for all makers, which speeds up computation considerably.

2)

Differentiating Q(⋅| T, δ) with regard to u_{j}, j = 1, …, q + 1one-by-one using one-dimensional optimization reveals the following form

Iterations between the E-step and M-step are in progress. SurvEMVS will be terminated if the convergence criterion is satisfied as follows: \( {\sum}_{i=1}^n\left|{\mathbf{X}}_i^T\left({\boldsymbol{\upbeta}}^{(k)}-{\boldsymbol{\upbeta}}^{\left(k-1\right)}\right)\right|/\left(1+{\sum}_{i=1}^n\left|{\mathbf{X}}_i^T{\boldsymbol{\upbeta}}^{(k)}\right|\right)<\xi \), where Xβ^{(k)} is predictive vector at kth iteration, and ξ is a small value (say 10^{− 4}). Summarizing, Additional file 1: Figure S1 presents pseudocode for our implementation of SurvEMVS.

Simulation studies

In this section, we used simulations to validate the performance of proposed SurvEMVS. Cox LASSO model [35] was considered as a benchmark for comparison. The effect sizes and directions of Cox LASSO estimates were adjusted for consistency with our parametric model, which made the direct comparison between two methods. For each simulation scenario, we replicated the simulation 50 times and then summarized these results.

Marker values were simulated from a multivariate normal distribution N_{50}(0, ∑), where ∑ is a variance-covariance symmetric matrix with ∑_{jj} = 1 and ∑_{ij} = 0.6^{|i − j|}, i ≠ j. For large p markers, we repeatedly sampled from the above distribution and then combined them by column. Thus, we obtained an n × p matrix with multiple independent blocks and 50 makers in each block were correlated. Assuming \( {\lambda}_i=\exp \left(-{\sum}_j^p{x}_{ij}{\beta}_j\right) \), we simulated survival time t_{i} for each subject from an exponential distribution t_{i}~exponential(λ_{i}) , and random censoring time c_{i} from a uniform distribution c_{i}~U(0, K), and chose K such that on average 40% subjects were right censored. The observed censored survival time T_{i} was generated by min(t_{i}, c_{i}). Furthermore, additional simulations where survival times are generated from Weibull distribution (with α = 2) were used to show effectiveness of our method. As is well-known, the true distribution of survival time in a real data is unclear and does not coincide with the Weibull assumption exactly. Therefore, we simulated a vector of Gamma-distributed survival time on purpose, thus assigning weakness setting to SurvEMVS. The shape and rate parameters of gamma distribution were set to be 0.8 and 1/λ_{i}, respectively. We randomly sampled six causal markers and set coefficients to be {0.2, − 0.2, 0.3, − 0.3, 0.4, − 0.4}. Sample size (n) was set to be 500, and number of makers is set to be 1000 or 5000. Moreover, we generated 100 samples as a test dataset in each replication to appreciate the predictive performance of two approaches. The detailed simulation scenarios were summarized in Table 1.

Hyperparameters tuning and performance metric

The performance of SurvEMVS depended on the hyperparameters υ_{0} and υ_{1}, which made us be interested in a running model based on more than one combination of υ_{0} and υ_{1}. The speed of the EM algorithm made it feasible to consider a sequence of models as (υ_{0}, υ_{1}) varied over many candidates, from which we could select an optimal combination based on some criteria. According to [22], large υ_{1} and small υ_{0} could accommodate all plausible β_{j}. In order to acquire sparse solution in large p data, we set a sequence of candidates {1/10p, 1/5p, 1/2p, 1/p, 2/p, 5/p, 10/p, 0.05} (0.05 was reserved if p > 500) to υ_{0} unless otherwise noted, which was dynamic with p. Three candidates from {10, 100, 500} were assigned to υ_{1}. Therefore, there were 24 combinations for hyperparameter tuning. The procedure of fitting Cox LASSO by widely used R package glmnet gave the similar parameter tuning as we did here.

On account of making parameter selection from the 24 combinations of υ_{0} and υ_{1}, we needed a metric to measure the performance of the fitted models. Cross-validation partial likelihood (CVPL) was generally used for parameter selection in Cox model [35, 36], while some employed cross-validation score in parametric survival model [36]. Subsequently, one candidate of hyperparameters was chosen to minimize one of these metrics. However, these metrics involving random cross-validation (CV) make the hyperparameter selection not stable as well as time demanding, and depend on the folds. Other criteria without CV like the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the generalized cross-validation (GCV) tend to select many spurious variables especially in high-dimensional problem [37, 38]. In this paper, we considered a metric, namely extended Bayesian Information Criteria (EBIC), which was utilized for model selection at first [39]. The constant prior behind the BIC assigns high probabilities to the models with a larger number of markers, which was apparently unreasonable and strongly against the principle of parsimony. The EBIC was proposed to take away this disadvantage of the BIC. The EBIC is defined as,

where p_{m} is the number of selected variables in a fitted model. The EBIC with τ = 0 reduces to the original BIC. Minimizing the EBIC with larger τ will get a much more parsimonious model. Thus, EBIC1, EBIC2, and EBIC3 were served as metrics for hyperparameter tuning with regard to τ = 0, 0.5, 1, respectively.

In simulation studies, the optimal SurvEMVS model selected by the EBIC was compared with Cox LASSO in aspects with variable selection, effect estimation, and model prediction. We utilized false positive rate (FPR), true positive rate (TPR), and false discovery rate (FDR) as evaluation indicators for variable selection. Note that \( P\left({\gamma}_j=1\left|{\widehat{\beta}}_j,\widehat{\theta},{\widehat{\sigma}}^2\right.\right)\ge 0.5 \) meant the jth maker was selected. Mean square error (MSE) denoted by \( {\sum}_{j=1}^p{\left({\widehat{\beta}}_j-{\beta}_j\right)}^2/p \) was used to appreciate effect estimations for makers. The predictive accuracy of the fitted model be applied to test dataset was evaluated by Harrell’s c statistic [40], as known as the area under the ROC curve (AUC).

Implementation

We considered a = b = 1 in beta prior for θ which yielded a uniform distribution. As noted in [22], we had inverse gamma prior for σ^{2} with ν = η = 1 to make this prior relatively non-influential. We ran all analyses using R software (v3.41) on a machine with Intel® Xeon® X5690 3.46-GHz processors. Cox LASSO model was implemented with the glmnet package in R. Tenfold cross-validation was used to choose an optimal penalization parameter λ_{lasso} in glmnet, which determined an optimal Cox LASSO model. Two LASSO models selected by “minimum cvm” and “1 standard error” of λ_{lasso} were considered as LASSO.min and LASSO.se, respectively. We employed the PLINK tool for quality control of genotype data [41].

Results

Simulation studies

Iteration and tuning plot

By analogy with LASSO solution path plot that shows the estimates change with an increasing penalty parameter, here we want to investigate the impact of parameters tuning for υ_{0}. Figure 1a displays a solution path of SurvEMVS under Scenario 1 with υ_{1}=10. Large effects (red dots) will be firstly incorporated into the fitting model with υ_{0} increasing, and a remarkable separation between the positive and negative effects appears when υ_{0} is larger than 0.03. However, the estimated effects for zeros inflate because of less shrinkage at large υ_{0}. We also present the iteration plot to detect the convergence property of our approach. From Fig. 1b, SurvEMVS makes a fast convergence to posterior estimates only in several steps. Moreover, neutral effects (black dot line) get close to zero in the fourth iteration, which means that we can concentrate iterations on large effects after a number of full iterations for all makers and further accelerate the posterior computation.

Variable selection

In SurvEMVS, conditional posterior probabilities are used to guide variable selection. Table 2 shows comparison results between SurvEMVS and Cox LASSO in variable selection for Scenarios 1 and 2. For Scenario 1 with p = 1,000, all models except LASSO.min can reduce noise markers with low FDRs, and all of three models with regard to SurvEMVS acquire high TPRs. When simulated p increases to 5000 (Scenario 2), FPR and FDR of BIC (i.e., EBIC1) inflate seriously. These indicate that proper extra penalty on the BIC of SurvEMVS brings to a moderate result of variable selection. We summarize the results of Scenarios 3 and 4 with Weibull distribution in Additional file 1: Table S1, and the results of Scenarios 5 and 6 with gamma distribution in Additional file 1: Table S2. Each of them presents a similar trend with scenarios of exponential distribution.

Parametric estimation

Figure 2 and Additional file 1: Figures S2-S6 show the parameter estimations of five models for all scenarios. Averaged estimated effects (black vertical lines) of estimated effects for all makers are calculated over 50 trials. Triangles in all figures label the locations and effect sizes of the pre-specified causal makers. In all scenarios, SurvEMVS gives a lower bias than Cox LASSO. Biases of all models increase with number of variants. Two models of SurvEMVS (i.e., EBIC2 and EBIC3) present a similar estimator, while the estimated effects in EBIC1 for zeros inflate under scenarios with p = 5,000 (rough X-axis in Additional file 1: Figures S2, S4 and S6). In order to make a comprehensive evaluation of bias and variance, we use the MSE metric and present the results in the left panels of Fig. 3 and Additional file 1: Figures S7 and S8. Both the EBIC2 and EBIC3 are well performed under all scenarios, whereas the EBIC1 model get a high MSE with p = 5,000. There is no apparent difference between the results of exponential, Weibull, and gamma distribution.

Prediction accuracy

In order to appreciate prediction accuracy of the fitted models, we summarize AUC results by box plot in the right panels of Fig. 3 and Additional file 1: Figures S7 and S8. Generally speaking, the EBIC2 model performs best under our simulation settings, while the LASSO.min presents similar prediction but with larger variance. In accordance with the conclusion of “Parametric estimation”, prediction accuracy of the BIC model descends with p varying from 1000 to 5000. Moreover, SurvEMVS with exponential or Weibull settings gain slightly larger AUC than those with the gamma settings. Furthermore, the LASSO.se model almost provides the lowest AUC among simulation scenarios. All the above results indicate that the BIC is not suitable for large p scenario.

In summary, the EBIC2 model works best under almost all scenarios in terms of variable selection, parameter estimation, and prediction accuracy. Besides, Additional file 1: Table S3 shows the time consumption in different scenarios. In comparison with Cox LASSO, SurvEMVS takes more computational time but is still fast enough. Note that time used in Additional file 1: Table S3 is for reference only, as it varies depending on context, such as convergence criterion, programming language, computer performance, and algorithm optimization.

Real data analysis

Harvard lung cancer data

This dataset from The Harvard Lung Cancer Susceptibility Study GWAS includes 526 late-stage (III and IV) patients with non-small cell lung cancer (NSCLC) recruited from Massachusetts General Hospital (Boston, MA). More details about participants’ recruitment have been described previously [42]. We note that it is appropriate to assess an association study restricted to late-stage cancer because some gene functions work primarily in the late stage and are not present in preinvasive stages of cancer [43]. DNA was genotyped using Illumina 610K Quad chip. After quality control protocol described by [44], there were 512,885 SNPs remaining. Those patients with more than 5 years of overall survival were considered as right censored, and finally, the censor rate was equal to 20.27%. We assumed an additive genetic model and imputed missing genotypes by mean of each SNP. We adjusted for age, sex, smoking status, cell type, stage, surgery (yes vs. no), and the top four principal components in the subsequent analysis. Considering that the number of SNPs related to NSCLC survival was not expected to be too large, we filtered the SNPs by a commonly used single locus Cox model. This filter yielded a de-noising of outcome so that the subsequent analyses became more efficient. By setting a threshold of P value less than 5E−3, 3911 SNPs were left for the subsequent analysis.

The EBIC2 was used to choose an optimal model from the candidates we noted above. Finally, 14 SNPs were detected by the proposed EBIC2 model with υ_{1} = 100 and υ_{0}=2.56E-03. We further analyzed and annotated these SNPs by TCGA (with an online tool UALCAN [45]), KEGG pathway, and PubMed database. Interestingly, seven of all listed in Table 3 may have potential functional influence on carcinogenesis or prognosis. For example, rs1506943 is located at ~ 33 kb downstream of RXRG (retinoid X receptor gamma) on Chromosome 1. This gene is expressed at significantly lower levels in TCGA-LUAD (lung adenocarcinoma) and TCGA-LUSC (lung squamous cell carcinoma) tumor samples and samples of other research [46]. Additionally, it also participates in non-small cell lung cancer pathways and other cancer-related pathways. rs2074986 is located at DNase I hypersensitive site (DHS) of GFRA1 on Chromosome 10. GFRA1 released by nerves enhances cancer cell perineural invasion [47], whose expression is reduced in tumor samples of TCGA-LUAD and TCGA-LUSC compared with normal samples. In addition, the high expression level in TCGA-LUAD tumor samples may contribute to good prognosis (P = 0.0025). We also provided the estimated effects of 14 SNPs by SurvEMVS in Additional file 1: Table S4 along with classical Weibull regression estimations for them (that is, only 14 SNPs and clinical variables are fitted by Weibull regression). In this way, SurvEMVS applied in high-dimensional data can also generate approximate estimates with Weibull regression. We plotted Kaplan-Meier (KM) survival curve of patients with high, moderate, and low risk defined by tertiles of risk scores \( -{\sum}_j^p{x}_{ij}{\beta}_j \) (Additional file 1: Figure S9). The log-rank test was used to compare the survival estimates among the three groups, and the results show that higher prognostic risk score is significantly associated with shorter survival (P < 1E−16). However, Cox LASSO models did not identify any SNP.

TCGA stomach adenocarcinoma (TCGA-STAD) expression data

We accessed this RNA-seq transcriptomic data from TCGA database by R/Bioconductor package TCGAbiolinks [48], which was used to do subsequent quality control, normalization, differential expression analysis (DEA), and visualization. The clinical information is summarized in Additional file 1: Table S5. Similar to the first real data analysis, we built final models only using filtered expression markers by DEA rather than all markers. There were 2711 markers left passing a selection threshold defined at fold change (FC) > 2 and testing FDR < 0.01. Due to the relative high missing rate, we made use of multivariate imputation by chained equations (MICE) to deal with missing clinical covariates [49]. After removing the patients with missing or zero survival time, we apply SurvEMVS and Cox LASSO to a matrix with 390 rows and 2716 columns (including 5 clinical covariates listed in Additional file 1: Table S5).

We used the EBIC2 to select best model from the candidates above. Three markers were identified by the proposed model (υ_{0} = 3.69E-03 and υ_{1} = 500) including CTLA4, NACAD, and SERPINE1, mapped to 2q33.2, 7p13, and 7q22.1, respectively. Meanwhile, the LASSO.min detected ALG11, GAMT, and PLCXD3 in addition to the overlap genes CTLA4 and SERPINE1. Estimated effects of the selected markers in both two models are provided (Additional file 1: Table S6) along with classical Weibull regression and Cox model estimations for them (like the first application). We can see that many effects estimated by Cox LASSO are small while SurvEMVS presents similar estimations with its counterpart in the low-dimensional Weibull regression. According to tertiles of risk scores, we equally divided the patients into high-, moderate-, and low-risk groups. Figure 4 presents the KM curves of SurvEMVS (left panel) and Cox LASSO (right panel), respectively, and both of them show a higher risk score which is significantly associated with shorter survival (Log-rank test P = 3.87E−07 for SurvEMVS and P = 1.02E−04 for Cox LASSO).

Furthermore, in order to validate the results, we used external data from GEO database. Five datasets (GSE14210, GSE15459, GES29272, GSE51105, and GSE62254) are included for their proper sample sizes. Table 4 presents the estimated hazard ratios along with 95% confidence intervals (CI) and P value extracted from an online tool (KM plotter, Web resource). We also show the combined results using meta-analysis. As a result, CTLA4 and NACAD were successfully validated and have the same direction of the effects on prognosis with those estimated by SurvEMVS in TCGA-STAD data. Interestingly, CTLA4 encodes CTLA-4 (cytotoxic T-lymphocyte-associated protein 4) which inhibits T cell activation and downregulates immune response. Antagonistic antibody against CTLA has become a targeted drug (Ipilimumab, approved by FDA for melanoma in 2011), which is the first approved and popular immune checkpoint blockade therapy [50, 51]. Some research also indicates CTLA4 may have influence on gastric cancer germination and progression [52, 53]. GAMT and PLCXD3 detected by Cox LASSO display strong heterogeneity on effects (81.8% and 82.1%, respectively) with five datasets and no significant association in the combined analysis. Note that ALG11 is not identified in the five data. Although we acquire a negative consequence of SERPINE1 (also known as plasminogen activator inhibitor-1, PAI-1) in validation analysis; interestingly, it has been widely studied and is well known for participating p53 signaling pathway and playing a crucial role in tumor progression and angiogenesis [54, 55].

Discussion

High-throughput sequencing technology, which has become cheaper, promotes the development of precision medicine [56]. Picking up underlying markers affecting disease prognosis from thousands of candidates calls for high-dimensional survival model besides generally used one-by-one Cox proportional hazards model. In this paper, we propose a parametric survival counterpart of EMVS, namely SurvEMVS, which employs a fast EM algorithm to fit all candidate biomarkers simultaneously and to explore posterior distribution of the unknown parameters, consequently to identify important signals and make effect estimations.

Much work has concentrated on developing new Bayesian methods on high-dimensional parametric survival model in application to medical or genetic data. For example, Sha et al. built AFT models with less common distribution (i.e., log-normal and log-t) for microarray data using a discrete spike-and-slab prior, where a time-consuming MCMC procedure was employed to simulate posterior distribution [57]. Mittal et al. developed four parametric models, i.e., exponential, Weibull, log-logistic, and log-normal distribution, by assigning Gaussian and Laplace prior to effects, where maximum a posterior (MAP) was used to acquire posterior modes of effects; however, this work lacked numerical study to evaluate their models, as well as discussion on variables selected in real data analysis from medical reasonability [58]. Newcombe et al. imposed a discrete spike-and-slab prior on coefficients of Weibull regression, where reversible jump MCMC being used for the posterior computation is ineffective. Moreover, it is unrepresentative of their application to a low-dimensional breast cancer data [59].

SurvEMVS imposes a continuous spike-and-slab mixture prior on effects to facilitate the separation of different effect sizes. This two-component prior can provide an indicator vector to guide variable selection whereby using a local version of the median probability model [22, 60]. In contrast, the EM algorithm or variational approximation employing one-component prior such as Laplace or t distribution does not involve variable inclusion indicators, and consequently makes variable selection indirectly [20]. Due to the unavailable closed form of maximization about maker effects in M-step, a variant of CCD algorithm serves to be feasible for obtaining approximate solutions. Consequently, our EM steps incorporating this fast CCD make the fitting much effective. One focus of this study is how to choose an optimal model from the hyperparameter tuning process. The EBIC, an extension of the BIC, is adopted with reason as follows: in comparison with the EBIC, the normally used AIC, BIC, or GCV would generate more spurious signal when applied to high-dimensional data, while CV-based metrics demand more computation and are unstable since the folds in CV are selected randomly. This is, to our knowledge, the first application of EBIC to the high-dimensional parametric survival analysis.

Over a range of simulation scenarios, our method with EBIC2 generally performs better than Cox LASSO in variable selection, parameter estimation, and prediction. In contrast, owing to imposing a single penalty on all effects, Cox LASSO yields high biased estimators. Our simulations also show the EBIC is appropriate for model selection, while the BIC (i.e., EBIC1) perform worst in situation with very large number of markers. For p > n problem in omics data, we recommend τ = 0.5 for EBIC in multi-stage study because it offers a good trade-off between the well-controlled FDR and the TPR and provides more opportunity to new findings. However, if one is strict to control false discovery, τ = 1, is recommended. Subsequently, we conducted two real data applications. In the first study with a lung cancer genotype data, 14 SNPs were detected by SurvEMVS, and further validation analyses using external data or function annotation resulted in 7 outstanding SNPs. In order to widen the application range of SurvEMVS considerably, we utilized a stomach cancer expression data in the second study. Expression levels of three genes are associated with cancer prognosis, and two of them are validated by extra GEO datasets along with one (namely SERPINE1) involved in tumor progression. The identification of well-known CTLA4 illustrates the availability of SurvEMVS. However, further functional experiments are needed for evidence of biologic plausibility of those identified markers.

Although we did not directly compare our EM algorithm with its MCMC counterpart, the speed advantage is apparent, according to the results of our EM algorithm under each optimal model converge needing 174 and 238 iterations in the two real data applications, which is much less than the chain length being set up for MCMC (usually > 10,000 for stable estimation). The model thus resembles a distinct iteration increase of real data application relative to a simulation study. We can explain that ideal conditions (e.g., sparse structure with independent markers) along with strong shrinkage of candidate hyperparameters (producing a parsimonious model) favor rapid convergence, but generally are not available under real data applications, which leads to more iterations demanding (but still fast enough) not only for SurvEMVS but also for other model [20].

However, we acknowledge that there are several limitations of the present study. First, SurvEMVS incorporating the EM, which seeks for a posterior mode rather than a whole posterior distribution of the parameter, cannot provide uncertain measure for estimators. We can directly disentangle this disadvantage by the bootstrap method, but have to bear expensive computations. Actually, another compromise may be adopted: the estimates of SurvEMVS can be considered as initial values of a following MCMC algorithm, which makes the MCMC procedure avoid the burn-in stage and finally yield fast and accurate estimators with uncertain measurement. Second, the most worrying thing of the parametric model is a situation of going against the parametric assumption for survival distribution. We show that SurvEMVS is robust for the status with moderately deviating from the Weibull premise. However, we believe that SurvEMVS will be less effective if the real survival time distinctly violates the Weibull distribution. We can bypass this limitation using a non-parametric AFT model like in [61], in which a Dirichlet process is used to make the model robust over a wider range of unknown baseline hazard. In addition, a lot of new directions for methodological work will arise from the current study. One obvious extension to our method will consider multivariate “g-priors” to reflect the effect correlations within the correlated markers [62]. Another interesting extension will involve introducing a newly developed spike-and-slab Laplace prior [63]. Going forward, the meaningful extension of SurvEMVS will integrate functional annotations or multi-omics data to powerfully mine association signals in future work.

Conclusions

We present a new implementation of the EM algorithm for Bayesian variable selection under a Weibull survival model. Both of our simulation studies and two real data analyses show that the proposed method is effective and can cope with high-dimensional omics data.

Abbreviations

AFT:

Accelerated failure time

CCD:

Cyclic coordinate descent

EBIC:

Extended Bayesian Information Criteria

EM:

Expectation-maximization

GEO:

Gene Expression Omnibus

LASSO:

Least absolute shrinkage and selection operator

MCMC:

Markov Chain Monte Carlo

TCGA:

The Cancer Genome Atlas

References

Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31.

Hu Z, Wu C, Shi Y, Guo H, Zhao X, Yin Z, Yang L, Dai J, Hu L, Tan W. A genome-wide association study identifies two new lung cancer susceptibility loci at 13q12.12 and 22q12.2 in Han Chinese. Nat Genet. 2011;43(8):792–6.

Dong J, Hu Z, Wu C, Guo H, Zhou B, Lv J, Lu D, Chen K, Shi Y, Chu M. Association analyses identify multiple new lung cancer susceptibility loci and their interactions with smoking in the Chinese population. Nat Genet. 2012;44(8):895.

Chen H, Wang C, Conomos MP, Stilp AM, Li Z, Sofer T, Szpiro AA, Chen W, Brehm JM, Celedón JC. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am J Hum Genet. 2016;98(4):653–66.

Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the application of mixed-model association methods. Nat Genet. 2014;46(2):100–6.

Guan Y, Stephens M. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011;5(3):1780–815.

Moser G, Sang HL, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 2015;11(4):e1004969.

Carbonetto P, Stephens M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 2012;7(1):73–107.

Logsdon BA, Carty CL, Reiner AP, Dai JY, Kooperberg C. A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging. Bioinformatics. 2012;28(13):1738.

Duan W, Zhao Y, Wei Y, Yang S, Bai J, Shen S, Du M, Huang L, Hu Z, Chen F. A fast algorithm for Bayesian multi-locus model in genome-wide association studies. Mol Gen Genet. 2017;292(4):923–34.

Keiding N, Andersen PK, Klein JP. The role of frailty models and accelerated failure time models in describing heterogeneity due to omitted covariates. Stat Med. 1997;16(1–3):215.

Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(05):1–13.

Jr HF, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15(4):361–87.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Bakker PIWD, Daly MJ. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.

Asomaning K, Miller DP, Liu G, Wain JC, Lynch TJ, Su L, Christiani DC. Second hand smoke, age of exposure and lung cancer risk. Lung Cancer. 2008;61(1):13.

Machida EO, Brock MV, Hooker CM, Nakayama J, Ishida A, Amano J, Picchi MA, Belinsky SA, Herman JG, Taniguchi S. Hypermethylation of ASC/TMS1 is a sputum marker for late-stage lung cancer. Cancer Res. 2006;66(12):6210.

Zhao Y, Wei Q, Hu L, Chen F, Hu Z, Heist RS, Su L, Amos CI, Shen H, Christiani DC. Polymorphisms in MicroRNAs are associated with survival in non-small cell lung cancer. Cancer Epidemiol Biomarkers Prev. 2014;23(11):2503–11.

Brabender J, Danenberg KD, Metzger R, Schneider PM, Lord RV, Groshen S, Tsao-Wei DD, Park J, Salonga D, Holscher AH, et al. The role of retinoid X receptor messenger RNA expression in curatively resected non-small cell lung cancer. Clin Cancer Res. 2002;8(2):438–43.

He S, Chen CH, Chernichenko N, He S, Bakst RL, Barajas F, Deborde S, Allen PJ, Vakiani E, Yu Z. GFRα1 released by nerves enhances cancer cell perineural invasion through GDNF-RET signaling. Proc Natl Acad Sci U S A. 2014;111(19):E2008.

Hou R, Cao B, Chen Z, Li Y, Ning T, Li C, Xu C, Chen Z. Association of cytotoxic T lymphocyte-associated antigen-4 gene haplotype with the susceptibility to gastric cancer. Mol Biol Rep. 2010;37(1):515–20.

Kim JW, Nam KH, Ahn SH, Park DJ, Kim HH, Kim SH, Chang H, Lee JO, Kim YJ, Lee HS, et al. Prognostic implications of immunosuppressive protein expression in tumors as well as immune cell infiltration within the tumor microenvironment in gastric cancer. Gastric Cancer. 2016;19(1):42–52.

Rakic JM, Maillard C, Jost M, Bajou K, Masson V, Devy L, Lambert V, Foidart JM, Noel A. Role of plasminogen activator-plasmin system in tumor angiogenesis. Cell Mol Life Sci. 2003;60(3):463–73.

Takayama Y, Hattori N, Hamada H, Masuda T, Omori K, Akita S, Iwamoto H, Fujitaka K, Kohno N. Inhibition of PAI-1 limits tumor angiogenesis regardless of angiogenic stimuli in malignant pleural mesothelioma. Cancer Res. 2016;76(11):3285.

Sha N, Tadesse MG, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006;22(18):2262–8.

Newcombe P, Raza AH, Blows F, Provenzano E, Pharoah P, Caldas C, Richardson S. Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival. Stat Methods Med Res. 2014;26(1):414.

Zhang Z, Sinha S, Maiti T, Shipp E. Bayesian variable selection in the accelerated failure time model with an application to the surveillance, epidemiology, and end results breast cancer data. Stat Methods Med Res. 2016. https://doi.org/10.1177/0962280215626947.

Gao AC, Lou W, Isaacs JT. Enhanced GBX2 expression stimulates growth of human prostate cancer cells via transcriptional up-regulation of the interleukin 6 gene. Clin Cancer Res. 2000;6(2):493–7.

Gao AC, Lou W, Isaacs JT. Down-regulation of homeobox gene GBX2 expression inhibits human prostate cancer clonogenic ability and tumorigenicity. Cancer Res. 1998;58(7):1391.

Nimmrich I, Erdmann S, Melchers U, Chtarbova S, Finke U, Hentsch S, Hoffmann I, Oertel M, Hoffmann W, Müller O. The novel ependymin related gene UCC1 is highly expressed in colorectal tumor cells. Cancer Lett. 2001;165(1):71–9.

Liu Z, Zhang J, Gao Y, Pei L, Zhou J, Gu L, Zhang L, Zhu B, Hattori N, Ji J. Large-scale characterization of DNA methylation changes in human gastric carcinomas with and without metastasis. Clin Cancer Res. 2014;20(17):4598–612.

Godinheymann N, Brabetz S, Murillo MM, Saponaro M, Santos CR, Lobley A, East P, Chakravarty P, Matthews N, Kelly G. Tumour-suppression function of KLF12 through regulation of anoikis. Oncogene. 2015;35(25):3324.

Yu N, Migita T, Hosoda F, Okada N, Gotoh M, Arai Y, Fukushima M, Ohki M, Miyata S, Takeuchi K. Krüppel-like factor 12 plays a significant role in poorly differentiated gastric cancer progression. Int J Cancer. 2009;125(8):1859.

Rozenblum E, Vahteristo P, Sandberg T, Bergthorsson JT, Syrjakoski K, Weaver D, Haraldsson K, Johannsdottir HK, Vehmanen P, Nigam S, et al. A genomic map of a 6-Mb region at 13q21-q22 implicated in cancer development: identification and characterization of candidate genes. Hum Genet. 2002;110(2):111–21.

We thank the participants and staff for their important contributions to this study.

Funding

This study was funded by National Natural Science Foundation of China (81530088 and 81473070 to F.C.), National Key Research and Development Program of China (2016YFE0204900 to F.C.), the National Institutes of Health (CA092824 and CA209414 to D.C.C.), Natural Science Foundation of the Jiangsu Higher Education Institutions of China (14KJA310002 to F.C.), Research and Innovation Project for College Graduates of Jiangsu Province of China (KYLX16_1123 to W.D.), Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD), and Top-notch Academic Programs Project of Jiangsu Higher Education Institutions (TAPP: PPZY2015A067). Y.W. and R.Z. were partially supported by the Outstanding Young Teachers Training Program of Nanjing Medical University.

Availability of data and materials

The first real dataset (i.e., Harvard lung cancer data) is available upon request. The second real dataset (i.e., TCGA stomach adenocarcinoma expression data) can be found in TCGA database (Web resources). The validation datasets of gene expression are available at GEO database with accession numbers: GSE14210, GSE15459, GES29272, GSE51105, and GSE62254. The summary statistics of the GEO data can be extracted from KM plotter (Web resources) directly.

China International Cooperation Center for Environment and Human Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China

Weiwei Duan, Ruyang Zhang, Yang Zhao, Sipeng Shen, Yongyue Wei, Feng Chen & David C. Christiani

Joint Laboratory of Health and Environmental Risk Assessment (HERA), Nanjing Medical University School of Public Health / Harvard School of Public Health, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China

Weiwei Duan, Ruyang Zhang, Yang Zhao, Sipeng Shen, Yongyue Wei, Feng Chen & David C. Christiani

Key Laboratory of Biomedical Big Data of Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China

WD, RZ, FC, and DCC contributed to the conception and design. WD contributed to the development of methodology. YZ and DCC contributed to the acquisition of data. WD, RZ, YZ, YW, and SS contributed to the analysis and interpretation of data. WD, RZ, and FC contributed to the writing, review, and revision of the manuscript. All authors read and approved the final manuscript.

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional ethics committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Table S1. TPR, FPR and FDR in variable selection with 50 replications (Weibull distribution). Table S2. TPR, FPR and FDR in variable selection with 50 replications (Gamma distribution). Table S3. Computational time (minutes) for application in simulation trials. Table S4. The estimated effects of 14 SNPs by SurvEMVS and classical Weibull regression. Table S5. Demographic and clinical characteristics of STAD patients. Table S6. The estimated effects of gene expression levels selected by SurvEMVS and Cox LASSO with their counterparts in low dimension scenario (i.e., Weibull regession and Cox model, respectively). Figure S1. Pseudocode for implementation of SurvEMVS. Figure S2. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 2. Figure S3. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 3. Figure S4. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 4. Figure S5. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 5. Figure S6 Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 6. Figure S7. MSE of parameter estimation and AUC of prognosis prediction for Scenarios 3 and 4. Figure S8. MSE of parameter estimation and AUC of prognosis prediction for Scenarios 5 and 6. Figure S9. Kaplan-Meier survival curve of patients with high, moderate, and low risk. (DOCX 9932 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Duan, W., Zhang, R., Zhao, Y. et al. Bayesian variable selection for parametric survival model with applications to cancer omics data.
Hum Genomics12, 49 (2018). https://doi.org/10.1186/s40246-018-0179-x