Bayesian variable selection for parametric survival model with applications to cancer omics data

Background Modeling thousands of markers simultaneously has been of great interest in testing association between genetic biomarkers and disease or disease-related quantitative traits. Recently, an expectation-maximization (EM) approach to Bayesian variable selection (EMVS) facilitating the Bayesian computation was developed for continuous or binary outcome using a fast EM algorithm. However, it is not suitable to the analyses of time-to-event outcome in many public databases such as The Cancer Genome Atlas (TCGA). Results We extended the EMVS to high-dimensional parametric survival regression framework (SurvEMVS). A variant of cyclic coordinate descent (CCD) algorithm was used for efficient iteration in M-step, and the extended Bayesian information criteria (EBIC) was employed to make choice on hyperparameter tuning. We evaluated the performance of SurvEMVS using numeric simulations and illustrated the effectiveness on two real datasets. The results of numerical simulations and two real data analyses show the well performance of SurvEMVS in aspects of accuracy and computation. Some potential markers associated with survival of lung or stomach cancer were identified. Conclusions These results suggest that our model is effective and can cope with high-dimensional omics data. Electronic supplementary material The online version of this article (10.1186/s40246-018-0179-x) contains supplementary material, which is available to authorized users.


Introduction
With the development of high-throughput sequence technology, large-scale omics data are generated rapidly for discovering new biomarkers [1,2]. The public databases such as The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) provide great opportunities to understand complex diseases comprehensively on a molecular level [3,4] and subsequently facilitate growing demanding statistical approaches designed to cope with these large-scale data [5]. Analyzing biomarkers one at a time is the most common strategy to detect the underlying causal markers [6,7]. However, this one-by-one method ignores the correlation between biomarkers and needs multiple corrections for controlling false positives. Furthermore, multiple regression is performed increasingly because it is powerful to identify causal markers after the strongest associations have been accounted for [8][9][10]. It can also avoid multiple correction and enable accurate effect estimation [11,12]. Nevertheless, omics data generally have the property of high dimensionality, which makes the classical multiple regression yield unstable parameter estimations with high standard errors. Due to this limitation, least absolute shrinkage and selection operator (LASSO) regression and its variants shrink the effects of noise toward zero while adding a penalty term to the likelihood function. The approach can easily be conducted on a large-scale variable selection analysis [13,14]. The LASSO can also be explained from Bayesian perspectives, namely Bayesian LASSO (BL) [15]. That is, we can generate the LASSO estimators by imposing a Laplace prior on coefficients of explanatory variables. Compared with a frequentist penalty, Bayesian regression is more flexible to induce different shrinkage by specifying various priors.
George and McCulloch [16] proposed a two-component "spike-and-slab" mixture prior, consisting of the "spike" to be either a point mass at zero or a normal distribution with a very narrow variance while a "slab" be a normal distribution with a large variance. Indicator variables are utilized to denote which component each marker belongs to, which is known as Bayesian variable selection (BVS). BVS and its expansions are widely used in genomic studies, including marker detection and disease risk prediction [11,12,17]. But, most Bayesian studies have used a Markov Chain Monte Carlo (MCMC) algorithm to explore the posterior distribution of unknown parameters via numerical approximations in high-dimensional models, which is quite time consuming for getting a stable chain. For example, the BayesR model required~18 h to complete an analysis of a bipolar disorder genome-wide association data with~3800 individuals (80% of whole samples) and~300,000 genotype markers [12]. In order to facilitate Bayesian computation, variational inference [18][19][20] and expectation-maximization (EM) algorithm are commonly used for Bayesian posterior inference [21]. Ročková and George [22] proposed EM variable selection (EMVS) for continuous outcomes to rapidly identify promising high posterior models and parameter estimates. The continuous conjugate spike-and-slab prior adopted by EMVS leads to fast closed form expression for the EM algorithm. Nevertheless, there has been considerable interest in discovering associations between biomarkers and prognosis.
Analyzing time-to-event data, namely survival analysis, plays a very important role in statistics, which arises in many fields, such as medicine, genetics, industrial engineering, sociology, and economics [23][24][25]. Modeling survival data using Cox proportional hazards regression is popular for its robust to the unknown baseline hazard [26]. Alternatively, being well known for parametric survival analysis, accelerated failure time (AFT) model tends to give more precise estimates of interest parameters if the distribution of survival time is chosen correctly, in addition, the parameter estimates from AFT are robust to omitted covariates [27]. Under the scenario of high-dimensional survival analysis, a lot of works have been done usually by adding a penalty term to likelihood. In a Bayesian framework, we usually need to assign a semi-parametric or nonparametric prior processes to the (cumulative) baseline hazard function in a Cox model [28,29], which does not allow us to naturally choose a fully parametric survival model for the subsequent analyses. As a parametric model, the Weibull regression induces a very flexible model since it is a unique parametric model which has both AFT and the proportional hazards properties [30].
In this study, we extended EMVS to parametric survival model (SurvEMVS) with Weibull distribution assumption. A fast EM algorithm was used to obtain posterior modes of interested parameters, in which a variant of the fast cyclic coordinate descent (CCD) method is nested. We used simulation trials to explore performance in comparison with an alternative frequentist variable selection strategy, namely Cox LASSO regression. After that, we applied SurvEMVS to a lung cancer genotype data and a stomach cancer gene expression data. Further details of this work were given below.

Statistical framework
Survival times of n individuals in sample are designed by T i = min(t i , c i ), i = 1, …, n, where t i and c i are lifetime and fixed censoring time for a specific individual i, respectively. The survival outcome from a follow-up study can be conveniently represented by pair of random variables (T i , δ i ), where δ i indicates whether the lifetime t i corresponds to an event (δ i = 1) or is censored (δ i = 0). In this study, we consider right censoring scheme and non-informative censoring mechanism for each individual without note elsewhere. Under parametric framework, f(T i | θ) and S(T i | θ) are defined as probability density and survival function of survival time T i , respectively, parametrized by θ. Let L ¼ Q n i¼1 f ðT i jθÞ δ i SðT i jθÞ ð1−δ i Þ be the likelihood of the parametric model with an i.i.d assumption. Weibull distribution can be fully parametrized by the parameter pair θ = (λ, α), where λ and α are scale and shape parameter, respectively. The density and survival function of Weibull distribution T are f(T| λ, α) = αT α − 1 λ exp(−λT α ) and S(T| λ, α) = exp(−λT α ), respectively. Typically in a regression model, the scale parameter is defined as λ = 1/ exp(Zu + Xβ), where X n × p represents a column-scaled matrix of tumor biomarkers such as gene expression, genotype or DNA methylation, β is a p × 1vector of marker effects, Z n × (1 + q) is a covariates matrix of intercept and q clinical variables such as gender, age, and tumor histological grade, u is a (q + 1)-dimensional effect vector of Z n × (1 + q) . With the condition of p > n(i.e., high dimension), a penalize term is necessary for inducing a sparse solution of β. Under the Bayesian framework, we want to impose a well-known spike-and-slab prior on each β j to facilitate Bayesian variable selection [16]. A vector of binary latent variables γ = (γ 1 , …, γ p ) T , γ j ∈ {0, 1} are introduced as indicator variables, where γ j = 1 donates that jth explanatory variable is to be included in the model. Conditional on γ, the continuous prior being assigned to β is, where D σ 2 ;γ ¼ σ 2 diagða 1 ; …; a p Þand a j = (1 − γ j )υ 0 + γ j υ 1 . As suggested by [22], we set these hyperparameters υ 0 and υ 1 to be small and large positive values, respectively. σ 2 is a common variance parameter with an inverse gamma prior π(σ 2 ) = IG(ν/2, νη/2). The binomial prior is chosen for the indicator variable γ, i.e., πðγjθÞ ¼ Q p j¼1 θ γ j ð1−θÞ 1−γ j , where θ is a hyperparameter and we impose a Beta(a, b) prior on it. The constant prior is chosen for α and u, i.e., π(α) ∝ 1 and π(u) ∝ 1. Generally speaking, MCMC is usually used for simulating the posterior distribution of unknown parameters β, γ, θ, σ 2 , α, and u. But here we employ an EM algorithm to seek the posterior mode of each parameter because this algorithm provides substantial computational advantage, especially for high-dimensional data analysis. Concerning about the unknown status γ for variables, we replace this "missing data" by its conditional expectation given the current estimates for other parameters and observed data (E-step) [31]. Then, an M-step is followed by maximizing the expected complete-data log-posterior with respect to β, θ, σ 2 , α, and u. As a result of iterations between the E-step and M-step, each estimator will converge toward a local maximum of the posterior distribution. More specifically, the objective function can be expressed as: Qðβ; θ; σ 2 ; α; ujT ; δÞ ¼ E γjÁ ½log πðβ; γ; θ; σ 2 ; α; ujT ; δÞ þlog πðθÞ þ log πðσ 2 Þ þ log πðαÞ þ log πðuÞ ð2Þ where E γ | ⋅ (⋅) denotes the expectation with regard to γ given estimations in current iteration. Furthermore, E γjÁ ½ logπðβjσ 2 ; γÞ 1−θ Þ þ p logð1−θÞ , both C and C 1 are constants. Next, our EM algorithm for Bayesian Weibull regression proceeds as follows.
1) Differentiating Q(⋅| T, δ) with regard to β needs to solve the following expression where D * 1/2 is the square root of the p × p diagonal matrix D Ã ¼ diagðd Ã 1 ; …; d Ã p Þ . It can be shown that the formula (5) above for Weibull model is convex, and a wide variety of numerical optimization algorithms can be applicable. For generalizing our algorithm to highdimensional data, the commonly used multidimensional Newton-Raphson method is not recommended because of large memory requirements and intensive computations. In this research, we employ the cyclic coordinate descent (CCD) algorithm due to its efficiency and ease of implementation, which makes one-dimensional optimization available [32]. Briefly, the CCD minimizes the objective function with regard to β j , holding all other variables constant. Similar to the combined local and global (CLG) algorithm of [33,34], we modify the update for β j in two ways. First, in order to avoid big steps in Newton iteration, we specify a positive value Δ j for the jth maker to restrict the maximum change of β j between two adjacent iterations. Second, for onedimensional optimization, we update β j only once instead of multiple iterations till convergence before updating β j + 1 . Moreover, considering that it is unnecessary to take much time to update the amount of possible neural markers (that is, their minuscule effects contribute less to outcome, with no need for accurate estimations) when p is large, we partially update those markers with "large" effect (|β j |is greater than a threshold β ′ , e.g., 1E−08) after a small number (k ′ ) of full iterations for all makers, which speeds up computation considerably.
2) Differentiating Q(⋅| T, δ) with regard to u j , j = 1, …, q + 1one-by-one using one-dimensional optimization reveals the following form with all other parameters being fixed at current estimates.
3) Differentiating Q(⋅| T, δ) with regard to α derives the following form 4) For σ 2 , we have (4) Iterations between the E-step and M-step are in progress. SurvEMVS will be terminated if the convergence criterion is satisfied as follows: where Xβ (k) is predictive vector at kth iteration, and ξ is a small value (say 10 − 4 ). Summarizing, Additional file 1: Figure S1 presents pseudocode for our implementation of SurvEMVS.

Simulation studies
In this section, we used simulations to validate the performance of proposed SurvEMVS. Cox LASSO model [35] was considered as a benchmark for comparison. The effect sizes and directions of Cox LASSO estimates were adjusted for consistency with our parametric model, which made the direct comparison between two methods. For each simulation scenario, we replicated the simulation 50 times and then summarized these results.
Marker values were simulated from a multivariate normal distribution N 50 (0, ∑), where ∑ is a variance-covariance symmetric matrix with ∑ jj = 1 and ∑ ij = 0.6 |i − j| , i ≠ j. For large p markers, we repeatedly sampled from the above distribution and then combined them by column. Thus, we obtained an n × p matrix with multiple independent blocks and 50 makers in each block were correlated. Assuming λ i ¼ expð− P p j x ij β j Þ, we simulated survival time t i for each subject from an exponential distribution t i~e xponential(λ i ) , and random censoring time c i from a uniform distribution c i~U (0, K), and chose K such that on average 40% subjects were right censored. The observed censored survival time T i was generated by min(t i , c i ). Furthermore, additional simulations where survival times are generated from Weibull distribution (with α = 2) were used to show effectiveness of our method. As is well-known, the true distribution of survival time in a real data is unclear and does not coincide with the Weibull assumption exactly. Therefore, we simulated a vector of Gamma-distributed survival time on purpose, thus assigning weakness setting to SurvEMVS. The shape and rate parameters of gamma distribution were set to be 0.8 and 1/λ i , respectively. We randomly sampled six causal markers and set coefficients to be {0.2, − 0.2, 0.3, − 0.3, 0.4, − 0.4}. Sample size (n) was set to be 500, and number of makers is set to be 1000 or 5000. Moreover, we generated 100 samples as a test dataset in each replication to appreciate the predictive performance of two approaches. The detailed simulation scenarios were summarized in Table 1.

Hyperparameters tuning and performance metric
The performance of SurvEMVS depended on the hyperparameters υ 0 and υ 1 , which made us be interested in a running model based on more than one combination of υ 0 and υ 1 . The speed of the EM algorithm made it feasible to consider a sequence of models as (υ 0 , υ 1 ) varied over many candidates, from which we could select an optimal combination based on some criteria. According to [22], large υ 1 and small υ 0 could accommodate all plausible β j . In order to acquire sparse solution in large p data, we set a sequence of candidates {1/10p, 1/5p, 1/2p, 1/p, 2/p, 5/p, 10/p, 0.05} (0.05 was reserved if p > 500) to υ 0 unless otherwise noted, which was dynamic with p. Three candidates from {10, 100, 500} were assigned to υ 1 . Therefore, there were 24 combinations for hyperparameter tuning. The procedure of fitting Cox LASSO by widely used R package glmnet gave the similar parameter tuning as we did here.
On account of making parameter selection from the 24 combinations of υ 0 and υ 1 , we needed a metric to measure the performance of the fitted models. Cross-validation partial likelihood (CVPL) was generally used for parameter selection in Cox model [35,36], while some employed cross-validation score in parametric survival model [36]. Subsequently, one candidate of hyperparameters was chosen to minimize one of these metrics. However, these metrics involving random cross-validation (CV) make the hyperparameter selection not stable as well as time demanding, and depend on the folds. Other criteria without CV like the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and the generalized cross-validation (GCV) tend to select many spurious variables especially in high-dimensional problem [37,38]. In this paper, we considered a metric, namely extended Bayesian Information Criteria (EBIC), which was utilized for model selection at first [39]. The constant prior behind the BIC assigns high probabilities to the models with a larger number of markers, which was apparently unreasonable and strongly against the principle of parsimony. The EBIC was proposed to take away this disadvantage of the BIC. The EBIC is defined as, where p m is the number of selected variables in a fitted model. The EBIC with τ = 0 reduces to the original BIC.
Minimizing the EBIC with larger τ will get a much more parsimonious model. Thus, EBIC1, EBIC2, and EBIC3 were served as metrics for hyperparameter tuning with regard to τ = 0, 0.5, 1, respectively. In simulation studies, the optimal SurvEMVS model selected by the EBIC was compared with Cox LASSO in aspects with variable selection, effect estimation, and model prediction. We utilized false positive rate (FPR), true positive rate (TPR), and false discovery rate (FDR) as evaluation indicators for variable selection. Note that Pðγ j ¼ 1jβ j ;θ;σ 2 Þ ≥ 0:5 meant the jth maker was selected. Mean square error (MSE) denoted by P p j¼1 ðβ j −β j Þ 2 =p was used to appreciate effect estimations for makers. The predictive accuracy of the fitted model be applied to test dataset was evaluated by Harrell's c statistic [40], as known as the area under the ROC curve (AUC).

Implementation
We considered a = b = 1 in beta prior for θ which yielded a uniform distribution. As noted in [22], we had inverse gamma prior for σ 2 with ν = η = 1 to make this prior relatively non-influential. We ran all analyses using R software (v3.41) on a machine with Intel® Xeon® X5690 3.46-GHz processors. Cox LASSO model was implemented with the glmnet package in R. Tenfold cross-validation was used to choose an optimal penalization parameter λ lasso in glmnet, which determined an optimal Cox LASSO model. Two LASSO models selected by "minimum cvm" and "1 standard error" of λ lasso were considered as LASSO.min and LASSO.se, respectively. We employed the PLINK tool for quality control of genotype data [41].

Simulation studies Iteration and tuning plot
By analogy with LASSO solution path plot that shows the estimates change with an increasing penalty parameter, here we want to investigate the impact of parameters tuning for υ 0 . Figure 1a displays a solution path of SurvEMVS under Scenario 1 with υ 1 =10. Large effects (red dots) will be firstly incorporated into the fitting model with υ 0 increasing, and a remarkable separation between the positive and negative effects appears when υ 0 is larger than 0.03. However, the estimated effects for zeros inflate because of less shrinkage at large υ 0 . We also present the iteration plot to detect the convergence property of our approach. From Fig. 1b, SurvEMVS makes a fast convergence to posterior estimates only in several steps. Moreover, neutral effects (black dot line) get close to zero in the fourth iteration, which means that we can concentrate iterations on large effects after a number of full iterations for all makers and further accelerate the posterior computation.

Variable selection
In SurvEMVS, conditional posterior probabilities are used to guide variable selection.  Table S1, and the results of Scenarios 5 and 6 with gamma distribution in Additional file 1: Table S2. Each of them presents a similar trend with scenarios of exponential distribution. There is no apparent difference between the results of exponential, Weibull, and gamma distribution.

Prediction accuracy
In order to appreciate prediction accuracy of the fitted models, we summarize AUC results by box plot in the right panels of Fig. 3 and Additional file 1: Figures S7 and  S8. Generally speaking, the EBIC2 model performs best under our simulation settings, while the LASSO.min presents similar prediction but with larger variance. In accordance with the conclusion of "Parametric estimation", prediction accuracy of the BIC model descends with p varying from 1000 to 5000. Moreover, SurvEMVS with exponential or Weibull settings gain slightly larger AUC than those with the gamma settings. Furthermore, the  In summary, the EBIC2 model works best under almost all scenarios in terms of variable selection, parameter estimation, and prediction accuracy. Besides, Additional file 1: Table S3 shows the time consumption in different scenarios. In comparison with Cox LASSO, SurvEMVS takes more computational time but is still fast enough. Note that time used in Additional file 1: Table S3 is for reference only, as it varies depending on context, such as convergence criterion, programming language, computer performance, and algorithm optimization.

Real data analysis Harvard lung cancer data
This dataset from The Harvard Lung Cancer Susceptibility Study GWAS includes 526 late-stage (III and IV) patients with non-small cell lung cancer (NSCLC) recruited from Massachusetts General Hospital (Boston, MA). More details about participants' recruitment have been described previously [42]. We note that it is appropriate to assess an association study restricted to late-stage cancer because some gene functions work primarily in the late stage and are not present in preinvasive stages of cancer [43]. DNA was genotyped using Illumina 610K Quad chip. After quality control protocol described by [44], there were 512,885 SNPs remaining. Those patients with more than 5 years of overall survival were considered as right censored, and finally, the censor rate was equal to 20.27%. We assumed an additive genetic model and imputed missing genotypes by mean of each SNP. We adjusted for age, sex, smoking status, cell type, stage, surgery (yes vs. no), and the top four principal components in the subsequent analysis. Considering that the number of SNPs related to NSCLC survival was not expected to be too large, we filtered the SNPs by a commonly used single locus Cox model. This filter yielded a de-noising of outcome so that the subsequent analyses became more efficient. By setting a threshold of P value less than 5E−3, 3911 SNPs were left for the subsequent analysis.
The EBIC2 was used to choose an optimal model from the candidates we noted above. Finally, 14 SNPs were detected by the proposed EBIC2 model with υ 1 = 100 and υ 0 =2.56E-03. We further analyzed and annotated these SNPs by TCGA (with an online tool UALCAN [45]), KEGG pathway, and PubMed database. Interestingly, seven of all listed in Table 3 may have potential functional influence on carcinogenesis or prognosis. For example, rs1506943 is located at~33 kb downstream of RXRG (retinoid X receptor gamma) on Chromosome 1. This gene is expressed at significantly lower levels in TCGA-LUAD (lung adenocarcinoma)  [46]. Additionally, it also participates in non-small cell lung cancer pathways and other cancer-related pathways. rs2074986 is located at DNase I hypersensitive site (DHS) of GFRA1 on Chromosome 10. GFRA1 released by nerves enhances cancer cell perineural invasion [47], whose expression is reduced in tumor samples of TCGA-LUAD and TCGA-LUSC compared with normal samples. In addition, the high expression level in TCGA-LUAD tumor samples may contribute to good prognosis (P = 0.0025). We also provided the estimated effects of 14 SNPs by SurvEMVS in Additional file 1: Table S4 along with classical Weibull regression estimations for them (that is, only 14 SNPs and clinical variables are fitted by Weibull regression). In this way, SurvEMVS applied in high-dimensional data can also generate approximate estimates with Weibull regression. We plotted Kaplan-Meier (KM) survival curve of patients with high, moderate, and low risk defined by tertiles of risk scores − P p j x ij β j (Additional file 1: Figure S9). The log-rank test was used to compare the survival estimates among the three groups, and the results show that higher prognostic risk score is significantly associated with shorter survival (P < 1E−16). However, Cox LASSO models did not identify any SNP.

TCGA stomach adenocarcinoma (TCGA-STAD) expression data
We accessed this RNA-seq transcriptomic data from TCGA database by R/Bioconductor package TCGAbiolinks [48], which was used to do subsequent quality control, normalization, differential expression analysis    Table S5. Similar to the first real data analysis, we built final models only using filtered expression markers by DEA rather than all markers. There were 2711 markers left passing a selection threshold defined at fold change (FC) > 2 and testing FDR < 0.01. Due to the relative high missing rate, we made use of multivariate imputation by chained equations (MICE) to deal with missing clinical covariates [49]. After removing the patients with missing or zero survival time, we apply SurvEMVS and Cox LASSO to a matrix with 390 rows and 2716 columns (including 5 clinical covariates listed in Additional file 1: Table S5). We used the EBIC2 to select best model from the candidates above. Three markers were identified by the proposed model (υ 0 = 3.69E-03 and υ 1 = 500) including CTLA4, NACAD, and SERPINE1, mapped to 2q33.2, 7p13, and 7q22.1, respectively. Meanwhile, the LASSO.min detected ALG11, GAMT, and PLCXD3 in addition to the overlap genes CTLA4 and SERPINE1. Estimated effects of the selected markers in both two models are provided (Additional file 1: Table S6) along with classical Weibull regression and Cox model estimations for them (like the first application). We can see that many effects estimated by Cox LASSO are small while SurvEMVS presents similar estimations with its counterpart in the low-dimensional Weibull regression. According to tertiles of risk scores, we equally divided the patients into high-, moderate-, and low-risk groups. Figure 4 presents the KM curves of SurvEMVS (left panel) and Cox LASSO (right panel), respectively, and both of them show a higher risk score which is significantly associated with shorter survival (Log-rank test P = 3.87E−07 for SurvEMVS and P = 1.02E −04 for Cox LASSO).
Furthermore, in order to validate the results, we used external data from GEO database. Five datasets (GSE14210, GSE15459, GES29272, GSE51105, and GSE62254) are included for their proper sample sizes. Table 4 presents the estimated hazard ratios along with 95% confidence intervals (CI) and P value extracted from an online tool (KM plotter, Web resource). We also show the combined results using meta-analysis. As a result, CTLA4 and NACAD were successfully validated and have the same direction of the effects on prognosis with those estimated by SurvEMVS in TCGA-STAD data. Interestingly, CTLA4 encodes CTLA-4 (cytotoxic T-lymphocyte-associated protein 4) which inhibits T cell activation and downregulates immune response. Antagonistic antibody against CTLA has become a targeted drug (Ipilimumab, approved by FDA for melanoma in Fig. 4 Kaplan-Meier survival curve of patients with high, moderate, and low risk. P value is calculated using log-rank test  The contents of each cell represent estimated hazard ratio (HR), 95%CI of HR, and hypothesis testing P value of HR, and in each dataset, the samples are categorized into two groups using a best cutoff of expression level. Combined results are derived by meta-analysis with random effect model a Expression data of the gene is not available in the corresponding GEO database 2011), which is the first approved and popular immune checkpoint blockade therapy [50,51]. Some research also indicates CTLA4 may have influence on gastric cancer germination and progression [52,53]. GAMT and PLCXD3 detected by Cox LASSO display strong heterogeneity on effects (81.8% and 82.1%, respectively) with five datasets and no significant association in the combined analysis. Note that ALG11 is not identified in the five data. Although we acquire a negative consequence of SERPINE1 (also known as plasminogen activator inhibitor-1, PAI-1) in validation analysis; interestingly, it has been widely studied and is well known for participating p53 signaling pathway and playing a crucial role in tumor progression and angiogenesis [54,55].

Discussion
High-throughput sequencing technology, which has become cheaper, promotes the development of precision medicine [56]. Picking up underlying markers affecting disease prognosis from thousands of candidates calls for high-dimensional survival model besides generally used one-by-one Cox proportional hazards model. In this paper, we propose a parametric survival counterpart of EMVS, namely SurvEMVS, which employs a fast EM algorithm to fit all candidate biomarkers simultaneously and to explore posterior distribution of the unknown parameters, consequently to identify important signals and make effect estimations. Much work has concentrated on developing new Bayesian methods on high-dimensional parametric survival model in application to medical or genetic data. For example, Sha et al. built AFT models with less common distribution (i.e., log-normal and log-t) for microarray data using a discrete spike-and-slab prior, where a time-consuming MCMC procedure was employed to simulate posterior distribution [57]. Mittal et al. developed four parametric models, i.e., exponential, Weibull, log-logistic, and log-normal distribution, by assigning Gaussian and Laplace prior to effects, where maximum a posterior (MAP) was used to acquire posterior modes of effects; however, this work lacked numerical study to evaluate their models, as well as discussion on variables selected in real data analysis from medical reasonability [58]. Newcombe et al. imposed a discrete spike-and-slab prior on coefficients of Weibull regression, where reversible jump MCMC being used for the posterior computation is ineffective. Moreover, it is unrepresentative of their application to a low-dimensional breast cancer data [59].
SurvEMVS imposes a continuous spike-and-slab mixture prior on effects to facilitate the separation of different effect sizes. This two-component prior can provide an indicator vector to guide variable selection whereby using a local version of the median probability model [22,60]. In contrast, the EM algorithm or variational approximation employing one-component prior such as Laplace or t distribution does not involve variable inclusion indicators, and consequently makes variable selection indirectly [20]. Due to the unavailable closed form of maximization about maker effects in M-step, a variant of CCD algorithm serves to be feasible for obtaining approximate solutions. Consequently, our EM steps incorporating this fast CCD make the fitting much effective. One focus of this study is how to choose an optimal model from the hyperparameter tuning process. The EBIC, an extension of the BIC, is adopted with reason as follows: in comparison with the EBIC, the normally used AIC, BIC, or GCV would generate more spurious signal when applied to high-dimensional data, while CV-based metrics demand more computation and are unstable since the folds in CV are selected randomly. This is, to our knowledge, the first application of EBIC to the high-dimensional parametric survival analysis.
Over a range of simulation scenarios, our method with EBIC2 generally performs better than Cox LASSO in variable selection, parameter estimation, and prediction. In contrast, owing to imposing a single penalty on all effects, Cox LASSO yields high biased estimators. Our simulations also show the EBIC is appropriate for model selection, while the BIC (i.e., EBIC1) perform worst in situation with very large number of markers. For p > n problem in omics data, we recommend τ = 0.5 for EBIC in multi-stage study because it offers a good trade-off between the well-controlled FDR and the TPR and provides more opportunity to new findings. However, if one is strict to control false discovery, τ = 1, is recommended. Subsequently, we conducted two real data applications. In the first study with a lung cancer genotype data, 14 SNPs were detected by SurvEMVS, and further validation analyses using external data or function annotation resulted in 7 outstanding SNPs. In order to widen the application range of SurvEMVS considerably, we utilized a stomach cancer expression data in the second study. Expression levels of three genes are associated with cancer prognosis, and two of them are validated by extra GEO datasets along with one (namely SERPINE1) involved in tumor progression. The identification of well-known CTLA4 illustrates the availability of Sur-vEMVS. However, further functional experiments are needed for evidence of biologic plausibility of those identified markers.
Although we did not directly compare our EM algorithm with its MCMC counterpart, the speed advantage is apparent, according to the results of our EM algorithm under each optimal model converge needing 174 and 238 iterations in the two real data applications, which is much less than the chain length being set up for MCMC (usually > 10,000 for stable estimation). The model thus resembles a distinct iteration increase of real data application relative to a simulation study. We can explain that ideal conditions (e.g., sparse structure with independent markers) along with strong shrinkage of candidate hyperparameters (producing a parsimonious model) favor rapid convergence, but generally are not available under real data applications, which leads to more iterations demanding (but still fast enough) not only for SurvEMVS but also for other model [20].
However, we acknowledge that there are several limitations of the present study. First, SurvEMVS incorporating the EM, which seeks for a posterior mode rather than a whole posterior distribution of the parameter, cannot provide uncertain measure for estimators. We can directly disentangle this disadvantage by the bootstrap method, but have to bear expensive computations. Actually, another compromise may be adopted: the estimates of Sur-vEMVS can be considered as initial values of a following MCMC algorithm, which makes the MCMC procedure avoid the burn-in stage and finally yield fast and accurate estimators with uncertain measurement. Second, the most worrying thing of the parametric model is a situation of going against the parametric assumption for survival distribution. We show that SurvEMVS is robust for the status with moderately deviating from the Weibull premise. However, we believe that SurvEMVS will be less effective if the real survival time distinctly violates the Weibull distribution. We can bypass this limitation using a non-parametric AFT model like in [61], in which a Dirichlet process is used to make the model robust over a wider range of unknown baseline hazard. In addition, a lot of new directions for methodological work will arise from the current study. One obvious extension to our method will consider multivariate "g-priors" to reflect the effect correlations within the correlated markers [62]. Another interesting extension will involve introducing a newly developed spike-and-slab Laplace prior [63]. Going forward, the meaningful extension of SurvEMVS will integrate functional annotations or multi-omics data to powerfully mine association signals in future work.

Conclusions
We present a new implementation of the EM algorithm for Bayesian variable selection under a Weibull survival model. Both of our simulation studies and two real data analyses show that the proposed method is effective and can cope with high-dimensional omics data.

Additional file
Additional file 1: Table S1. TPR, FPR and FDR in variable selection with 50 replications (Weibull distribution). Table S2. TPR, FPR and FDR in variable selection with 50 replications (Gamma distribution). Table S3. Computational time (minutes) for application in simulation trials. Table S4. The estimated effects of 14 SNPs by SurvEMVS and classical Weibull regression. Table S5. Demographic and clinical characteristics of STAD patients. Table S6. The estimated effects of gene expression levels selected by SurvEMVS and Cox LASSO with their counterparts in low dimension scenario (i.e., Weibull regession and Cox model, respectively). Figure S1. Pseudocode for implementation of SurvEMVS. Figure S2. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 2. Figure S3. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 3. Figure S4. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 4. Figure S5. Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 5. Figure S6 Averaged estimated effect (black vertical lines) for each marker over 50 replications under Scenario 6. Figure S7. MSE of parameter estimation and AUC of prognosis prediction for Scenarios 3 and 4. Figure S8. MSE of parameter estimation and AUC of prognosis prediction for Scenarios 5 and 6. Figure S9.