Bayesian-frequentist hybrid inference framework for single cell RNA-seq analyses

Background Single cell RNA sequencing technology (scRNA-seq) has been proven useful in understanding cell-specific disease mechanisms. However, identifying genes of interest remains a key challenge. Pseudo-bulk methods that pool scRNA-seq counts in the same biological replicates have been commonly used to identify differentially expressed genes. However, such methods may lack power due to the limited sample size of scRNA-seq datasets, which can be prohibitively expensive. Results Motivated by this, we proposed to use the Bayesian-frequentist hybrid (BFH) framework to increase the power and we showed in simulated scenario, the proposed BFH would be an optimal method when compared with other popular single cell differential expression methods if both FDR and power were considered. As an example, the method was applied to an idiopathic pulmonary fibrosis (IPF) case study. Conclusion In our IPF example, we demonstrated that with a proper informative prior, the BFH approach identified more genes of interest. Furthermore, these genes were reasonable based on the current knowledge of IPF. Thus, the BFH offers a unique and flexible framework for future scRNA-seq analyses.


Background
Single cell RNA sequencing (scRNA-seq) is a powerful sequencing technology that allows for the profiling of gene expression in individual cells.Traditional bulk RNA sequencing technologies measure the average expression level of all cells in the population, which mask the uniqueness of each cell.In contrast, isolation of cells is an important step in scRNA-seq.It enables the identification of different cell types within complex tissues [36].As demonstrated in Keren-Shaul et al. 's research, by analyzing immune cell populations in mouse brains, they discovered a novel microglia type associated with neurodegenerative diseases using scRNA-seq [21].
scRNA-seq has the advantage in processing thousands or even millions of single cells simultaneously [48] and has extensive applications across different fields of biology and medical research.By comparing the gene expression level between patients and healthy controls, scRNA-seq can provide important insights into the disease associated genes and pathways.In drug discovery area, it has become an essential tool to identify novel drug targets and to test the efficacy of drugs on specific cell types.For instance, in the study by Wu et al. [42] on diabetic kidney disease (DKD), they generated single cell data from nearly 1 million cells and analyzed the response of a murine DKD model to five treatment approaches.They found that different medications affected different cell types, and combination therapies achieved better outcomes in rescuing DKD-associated transcriptional changes.
One important question in analyzing scRNA-seq data is the identification of differentially expressed genes (DEG) between groups.Compared to the gene expression data generated from other sequencing technologies, scRNA-seq data have some unique features including overdispersion, sparsity, high proportion of zeros due to dropout events (i.e., scRNA-seq data only captures a small fraction of the transcriptome of each cell), and the hierarchical structure embedded in the data [10].Early scRNA-seq studies often collect many cells from one or a few individuals.With the rapid advancement in the technology, scientists have started to collect single cell data from multiple individuals.In a multi-individual, multicondition experiment, other than cell-to-cell variation within each individual, heterogeneity also exists among different conditions, individuals and across different cell types.Those distinctive challenges need to be considered when we explore DEG in scRNA-seq data.
With more and more gene expression data becoming publicly available, many approaches and tools for the differential expression (DE) analysis have been developed for scRNA-seq data.For example, ZINB-WaVE [32] and ZingeR (Van den [37]) assume the expression counts follow a zero-inflated negative binomial (ZINB) distribution and apply Expectation-Maximization (EM) algorithms to estimate model parameters.In contrast, SCDE [22] models the observed abundance using a mixture of the Poisson (dropout component) and negative binomial (amplification component) distribution.These approaches require several distributional assumptions which may fail to be satisfied by real data.MAST [12] applied a hurdle model to simultaneously model the expression rate and mean expression values for a specific gene, then DE testing is performed between two cell populations using likelihood ratio test statistic.Nonparametric approaches were also applied on analyzing scRNA-seq data, such as Wilcoxon signed rank test and ROSeq [16], both of which use test statistics based on ranks.In summary, many single-cell-specific DE methods which apply different strategies, have been developed in recent years.However, some existing approaches are inappropriate for individual level differential expression testing (such as comparison between patients and healthy controls), as the sampling units for these approaches are cells, not individuals [47].Failing to account for the intrinsic variability of individuals causes a systematic underestimation of the variance of gene expression, compromising the ability to generate biologically accurate results.
Pseudo-bulk methods, which pool the scRNA-seq counts in the same biological replicate, have been developed to address this variability.Squair et al. [34] evaluated the performance across fourteen different DE methods using eighteen datasets and found pseudo-bulk methods outperform other cell-level based DE methods in scRNA-seq data.Murphy and Skene [29] also recommended the use of pseudo-bulk approaches after the simulation analysis from multiple scenarios.Biased inference and highly inflated type 1 error rates were observed when scientists assume cells from the same individual are statistically independent.Zimmerman, Espeland, and Langefeld [49] proposed a generalized linear mixed model that incorporates a random effect for individual, to address the correlation structure from cells within an individual.NEBULA [20] is an efficient negative binomial mixed model accounting for both individual-level and cell-level overdispersions.Another method IDEAS [47] captures the gene expression profile in each individual by a probability distribution and then compares such distributions across two groups of individuals.
Regardless, pseudo-bulk methods could still lack power to detect genes of interest due to the limited sample size.To overcome these limitations, we propose to use a Bayesian-frequentist hybrid (BFH) inference method to analyze the scRNA-seq data at the individual level.The BFH theoretical framework was originally proposed by Yuan [45] and the computation framework based on the EM algorithm and Monte-Carlo Markov Chain was proposed by Han et al. [19].In BFH, part of the model parameters is frequentist, and others are Bayesian.The goal of BFH is to obtain estimation of both types of parameters and quantify the variation in the estimation.BFH is achieved by maximizing the likelihood function given the Bayesian parameters and simultaneously minimizing the posterior expected loss function given the frequentist parameters.We extended the work of Han et al. [19] using a linear regression model based on normal distribution, where both the frequentist and Bayesian estimators have tractable analytic forms.We also derive the estimation error (or standard error) of the frequentist and Bayesian parameters.With a point estimate and a standard error of an estimator, we can construct confidence intervals of the coefficients, which can also be used to test whether predictors (such as disease group) are significantly associated with gene expression.

Methods
The hybrid inference in existing literature BFH inference is designed for models that have both frequentist and Bayesian parameters [45].Suppose the frequentist and Bayesian parameters are θ A and θ B , respectively, the data is Y , and the prior for θ B is π(θ B ) .Given a decision d(Y ) , a loss function W (d(Y ), θ B ) , and the distribution likelihood f (Y |θ A , θ B ) , the hybrid esti- mators of θ A and θ B are where inf and sup were taken in the space of d(Y ) and θ A respectively so that θB minimizes the posterior risk given θA and θA maximizes the likelihood function given θB .The frequentist parameter θA is defined (and can be numeri- cally calculated) as integration of the loss function over the posterior distribution.Yuan [45] proved that the hybrid estimator is a consistent estimator, and the standard error of the hybrid estimators can converge to that of the frequentist estimators.As a result, the variancecovariance matrix can be quantified using Fisher information matrix.Han et al. [19] developed an EM computational algorithm to compute θA , θB for any loss function ensuring that the hybrid inference is applicable to general practical problems and different data settings.Han et al. [19] demonstrated, in extensive simulation studies, that the hybrid inference based on the EM algorithm can outperform Bayesian inference and frequentist inference.In this paper we adopt the EM algorithm in Han et al. [19] to make inference.Data, statistical models, and more details about the computation are given in section "Introduction of frequentist, Bayesian, and hybrid inference in linear regression with conjugate priors".

Single-cell RNA sequencing methods comparison using semi-synthetic dataset
Motivation of semi-synthetic data We employed semi-synthetic data derived from actual single-cell RNA sequencing (scRNA-seq) data to assess the power and false discovery rate (FDR) of our proposed method in comparison to other widely used approaches.Inspired by Li et al. 's work [26], where semi-synthetic data was employed to evaluate bulk RNA-seq differential expression (DE) methods, we sought to extend this approach to scRNA-seq.Traditional simulated datasets often struggle to accurately capture the biological signals and intricate correlation structures present in real datasets, leading to challenges in maintaining cellular population heterogeneity [8].Consequently, analyses based on diverse simulations and packages may yield conflicting conclusions.For instance, Zimmerman et al. [49] observed superior Type I error rate control and power in mixed models compared to pseudo-bulk methods using simulated data.However, Murphy et al., in a different simulation setup, found that pseudo-bulk methods exhibited the lowest Type II error rate among all tested methods, with equal Type I error rates.Recognizing these discrepancies, we advocate for a more realistic evaluation procedure for DE methods.

Semi-synthetic data source and data generation:
HypoMap, a compilation of mouse hypothalamus singlecell RNA sequencing (scRNA-seq) data sourced from 18 publications [35], encompasses 100 normal chow mice, yielding a dataset containing190,710 neuron cells.Given the substantial number of subjects and cells within this dataset, it serves as a valuable resource for generating multiple synthetic datasets.As a subset, 55 mice were identified with a minimum of 1,000 cells each, constituting a total of 170,874 neuron cells.Based on this subset, we derived our semi-synthetic datasets.
In our scRNA-seq semi-synthetic scenario, we randomly selected 20 out of 55 mice.Among these, 10 were arbitrarily assigned to the 'disease' group, while the remaining mice served as the 'normal chow' group.It's important to note that in the original data, the 'disease' group was under normal chow conditions.To achieve around 1,000 cells per mouse, we sampled a specific number of cells from each mouse using a Poisson distribution with a mean of 1,000, ensuring an average of 1,000 cells across the 20 mice.
For the generation of true positives (true differentially expressed genes or true DE genes), we initially focused on genes expressed in at least 10% of the cells, amounting to approximately 8,000 genes.Subsequently, we randomly selected 5% of these genes from the 'disease' group and an artificial fixed effect was introduced to these genes by multiplying the counts under the 'disease' condition by a constant of 2. Consequently, these genes represent true positives, while the remaining genes serve as true negatives (non-DE genes).This process was iterated 100 times to generate 100 semi-synthetic datasets.
Differential expression method selection We carefully selected representative approaches from three domains of previously proposed Differential Expression (DE) methods: mixed models, pseudo-bulk methods, and single-cell methods.Our choice for the mixed model approach was NEB-ULA [20,27], edgeR [33], and limma-voom [25].

Lungmap dataset
The Lungmap dataset used in this study was from a published human lung tissue study [39].The cells were clustered by Seurat v3 [7] and annotated to 31 cell types based on canonical lineage-defining markers.Lungmap dataset served as the reference dataset for Hierarchical XGBoost [9] algorithm to obtain the probability for a cell being an alveolar macrophage cell.In addition to output of probabilities indicating the likelihood of a candidate cell belonging to each individual cell type, HierXGB offers an additional capability.HierXGB can provide cell identity directly using a naive Bayesian approach, assigning the cell type based on the maximum of such predicted probability.
IPF scRNA-seq dataset The idiopathic pulmonary fibrosis (IPF) scRNA-seq dataset was obtained from a previous study on pulmonary fibrosis (PF) disease mechanisms and the corresponding cell types in human lung tissues [17].This dataset contains over 114,000 cells from 22 donors who had cell observations.Among these 22 donors, 10 are from the control group and 12 from the IPF group.The IPF dataset served as the query data for HierXGB to obtain the probability for a cell being an alveolar macrophage cell.
IPF bulk RNA-seq dataset We also obtained bulk RNA-seq IPF data from human lung tissues [13] in the previous research on the relationship between chronic hypersensitivity pneumonitis and idiopathic pulmonary fibrosis.This bulk IPF dataset contains 18,838 genes from 103 idiopathic IPF samples and 103 unaffected controls samples.For each gene, the mean differences of expression level between the IPF and control groups were calculated using linear regression with the adjustment of In our analysis the outcome Y is the weighted average of a gene's expression for a particular cell type (e.g., TGF-β1 in alveolar macrophage) at the individual level, p = 1, and X the design matrix is composed of a vector of 1 s, X 1 the disease group of the individual (control or IPF), and X 2 the predictive probability of each cell belonging to alveolar macrophage averaged per individual with subsequent negative log transformation.The linear model is , where I is an identity matrix with dimension n by n.So Y ∼ N Xβ, σ 2 I .The regres- sion parameters are β = (β 0 , β 1 , β 2 ), which β 0 , β 1 and β 2 are the intercept, regression parameter for disease group, and regression parameter for probability of the cell belonging to alveolar macrophage, respectively.The likelihood value given data ( Y , X ) is The conjugate prior of the regression parameter can be written as π (β) ∼ N µ β , � β .Then the posterior distri- bution can be derived as where µ new Without the prior π(β) , the frequentist's estimate and its variance of β can be written is the ordinary least square estimate.Finally, following the EM algorithm in Han et al. [19], the hybrid Bayesian analysis can be written in the following iterative procedures: , where β 0 is the intercept, and (β 1 , β 2 ) are the slope parameters for dis- ease group X 1 and the predictive probability of each cell belonging to alveolar macrophage averaged per individual with subsequent negative log transformation X 2 , respectively.
[Step 2.] Given the current value of frequentist parameters as the posterior mean of β 1 , given a conjugate (normal) prior of β 1 .

Data preprocessing of IPF datasets
For scRNA-seq dataset, we selected genes with average expression level across cells greater than 0.1.We removed cells when the number of detected genes was below the lower 2-percentile or with more than 10% of mitochondrial gene expression.For the cell weight calculation, we aligned Lungmap dataset and IPF single cell RNAseq datasets by their common genes, resulting in 13,988 common genes and 44,294 and 50,383 cells for Lungmap and IPF, respectively.For coefficient estimation of each gene, we matched the IPF scRNA-seq and IPF bulk datasets and obtained 7,886 common genes.

Introduction of frequentist, Bayesian, and hybrid inference in linear regression with conjugate priors
Here, we would like to introduce each of the methods we attempted to identify genes associated with IPF.In linear regression analysis, given the sample size n and the num- ber of regression parameters p , the data can be arranged in Y of dimension n × 1 as the response, and X of dimen- sion n × p as the design matrix.The regression parameter in the model β can be arranged in a vector of dimension p × 1.
[Step 4.] Iterate steps 2-3 as in EM algorithm.For frequentist, Bayesian, and hybrid inferences can all generate parameter estimates and the corresponding estimation variances.An estimate and its variance are used to construct a 95% confidence interval (estimate minus and plus 1.96 times of the standard error) and to calculate a p-value from the two-sided test (by calculating a z-score of estimate divided by the standard error) of whether this value is equal 0 or not based on the underlying normal distribution.

Acquiring weights for alveolar macrophage cells using HierXGB method
Alveolar macrophage cells have been recognized to play a crucial role in the pathogenesis of IPF [2,46].Rather than analyzing gene expression levels across all cell types, we are specifically interested in the association between gene expression levels and disease group (IPF vs. control) in alveolar macrophage cells.A simple approach to obtain alveolar macrophage gene expression is to take the unweighted average across the annotated alveolar macrophage cells in the original study.However, single-cell data are high-dimensional, and annotations for different cells have varying degrees of uncertainty.To better characterize the alveolar macrophage gene expression levels, we took this uncertainty into account by assigning higher weights to cells that we are more certain of them being alveolar macrophages.We quantified such uncertainty with probabilities of cells being alveolar macrophages calculated by HierXGB method [9].
HierXGB is a supervised machine learning algorithm that aims to classify each single cell in the query dataset into one of cell types from a reference dataset.With a pre-defined cell-type hierarchical tree structure, the algorithm annotates the cell from ancestor to one of descendant subtypes iteratively until reaching the bottom layer.For dataset with a clear cell type hierarchy, including Lungmap, it outperforms other state-of-arts methods in terms of both accuracy and efficiency [9].We performed a comparative analysis using the same IPF dataset in with singleR [1], a widely used method for scRNA-seq data annotation to demonstrate the usefulness of HierXGB method in our setting.
In the analysis, we first had Lungmap and IPF scRNAseq datasets aligned using batch effect correction [40].Then the HierXGB model was trained by Lungmap and produced the predictive probability of a cell belonging to alveolar macrophage for the IPF scRNA-seq data.The obtained probabilities were used as the weights when we combined gene expression across cells to obtain cell-type-specific expression summary for each gene per individual.

Generating predictor X 2 based on alveolar macrophage probabilities and outcome Y as the weighted expression average of alveolar macrophage per patient
In our example, we averaged the probability of being in alveolar macrophages across cells within each of the 22 donors and generated a length-22 vector as the predictor X 2 .A higher X 2 indicates that the cells from this donor are more likely from alveolar macrophages.We also used this alveolar macrophage probability to generate outcome Y.For each gene, Y was also a length-22 vector, where the value was the weighted average of counts in this gene across cells.The weights were alveolar macrophage probabilities, and such weighted averages are called pseudobulk counts in single cell data analysis.Weighted averages are more robust to different cell numbers per donor.A higher value of Y indicates that this gene expresses highly in alveolar macrophage cells.

Acquiring priors of the contrast between IPF and healthy and other covariates (β) and their variance-covariance matrix Σ,
We use non-informative priors such as (0, 0, 0) and diag(100, 100, 100) for µ β and � β , respectively, when we have little information on parameters.However, for RNA-sequencing data, bulk RNA-seq data which are characterized by its affordability and wide availability, can serve as good informative priors.Although bulk data may not have the same high-resolution as single-cell data, they still provide the overall expression level of each gene within the targeted tissue.In our analysis, the key parameter is β 1 for the disease group, and we incorporated bulk RNA-seq into its estimation.For each gene, we used the difference in mean expression levels between IPF and control samples as the prior of β 1 .We obtained the dif- ference from the coefficient of IPF indicator in a linear regression model adjusting for other covariates including age, smoke, sex, and race.We observed that the sample variance of the difference was relatively small compared with the magnitude of the difference.Directly using it as the prior for the variance of β 1 would result in a very strong prior distribution concentrated around the mean.To offer a prior with more moderate dispersion, we used the squared root of sample variance of difference as the variance for the prior .For intercept and average negative log expression, we had no prior information and hence kept the non-informative priors for (β 0 , β 2 ).

Pathway analysis based on differentially expressed genes from frequentist, Bayesian and hybrid methods
Pathway analysis is usually performed using a set of selected features, in this case, differentially expressed genes [23].The goal of the analysis is to identify common biological pathways or networks and analyze how they interact to form biological processes.The analysis typically involves comparing a list of genes of interest to a reference database that contains information about functional categories such as biological pathways, molecular interactions, canonical signaling, disease biomarker and other areas of biomedical knowledge.The analysis determines whether the genes in the list are significantly enriched or depleted in any of the categories, compared with what would be observed by chance.A pathway with significantly enriched genes would yield a significantly small p-value.We further studied pathways based on a gene subset that satisfies the following criteria.For a set of differentially expressed genes detected by each method, cut-off points were set to obtain a gene subset with FDR less than 0.01 and absolute estimated difference greater than 0.585.The reference database we used was the R package metabaser that collects all system biology products including MetaCore, MetaDrug, and others [6,30].The final pathways were identified based on a p-value less than 0.05.Please see the top10 and detailed summary of pathways in Table 1 and 2 and supplement material S5 and S6.

Comparison of differential expression methods using semi-synthetic dataset
For each method, we obtained p-values and fold changes (FC) in mean expression between two groups ("disease" vs. normal chow) for the genes in each semi-synthetic dataset.Initially, we adjusted the p-values of tested genes for multiple comparisons using the Benjamini-Hochberg procedure [3].The FC also serves as a crucial metric in determining if a gene is differentially expressed.We applied a 0.05 False Discovery Rate (FDR) cutoff and a 1.5 FC cutoff in this setting.Consequently, a gene was classified as differentially expressed when its adjusted p-value was less than 0.05 and its FC exceeded 1.5.
Power and FDR calculations, as discussed in section "Single-cell RNA sequencing methods comparison using semi-synthetic dataset", are summarized in Fig. 1.Our findings reveal that our proposed hybrid methods  could be optimal when both power and FDR were considered especially with an informative prior.MAST demonstrated high power akin to the hybrid approach, albeit with an inflated FDR exceeding 50%, aligning with Squair et al. 's findings.In contrast, NEBULA exhibited insufficient power and inflated FDR.Finally, none of the pseudo-bulk methods achieved more than 1% power, although the FDR closely approached the nominal level of 5%.
Fig. 1 The comparison of different DE methods using 100 runs of semi-synthetic data.The proposed hybrid methods were implemented with non-informative and different informative priors

Analysis of transforming growth factor beta 1 (TGF-β1) gene from frequentist, Bayesian and BFH methods
Through our comparison analysis using singleR and HierXGB methods, the accuracy and mean F1 of alveolar macrophages are 0.897, 0.921 and 0.829, 0.914 for HierXGB and SingleR respectively when compared with the annotation from the original paper.This demonstrated our HierXGB method was at least in par with other popular single cell annotation methods.Next, we exemplified and compared the frequentist, Bayesian, and hybrid inferences with and without informative prior.TGF-β pathway is well-known in terms of its role in pulmonary fibrosis, therefore, we used the results of TGF-β1 as an example [15].In the analysis, the outcome or response variable ( y) was the weighted average gene expression level per individual (see sections "Acquiring weights for alveolar macrophage cells using HierXGB method" and "Generating predictor based on alveolar macrophage probabilities and outcome Y as the weighted expression average of alveolar macrophage per patient" for details).The two independent variables include 1) whether the individual was in the control or IPF group ( X 1 ) and 2) the average negative log probability of being the macrophage cell ( X 2 ).The model parameters (β 0 , β 1 , β 2 ) were intercept, coefficients for X 1 and X 2 , in the regression model, respectively.
Figure 2 shows boxplots of average of gene expression per person grouped by IPF or control.In this sample 12 patients were in the IPF group and 10 were in the control group.Panel (a) has the average gene expressions of all cells for each person, weighted by the probability of each cell being alveolar macrophage cell.Panel (b) has the average gene expression from the cells that were predicted to be alveolar macrophages based on HierXGB prediction.In both (a) and (b), the expression of TGF-β1 in IPF is lower than control, but not statistically significant.The averages of gene expression for control in (a) and (b) are 1.21 and 1.22, respectively.The averages of gene expression for IPF in (a) and (b) are 0.93 and 0.86, respectively.Numerically the expressions from IPF are lower than from control, but Wilcoxon rank sum test p-values are 0.448 for (a) and 0.419 for (b), both are not statistically significant.
Table 3 is a summary of analysis result for gene TGF-β1 and model coefficient for disease group (i.e.IPF and control) ( β 1 ), including in the columns the coefficient estimate, standard error, 95% confidence interval, and p-value for testing if the estimated coefficient is different from 0. The linear regression also included intercept and gene probability of being microphage data as a covariate.The five rows in Table 3 correspond to 5 inferences about β 1 .
Frequentist: All model parameters were frequentist parameters, and ordinary least square estimates were reported.
Bayesian inference with non-informative prior: A non-informative prior distribution was imposed on all the parameters β 0 , β 1 , β 2 .Mean and standard deviation of the posterior distribution ( µ new β , � new β ) were reported.Hybrid inference with non-informative prior: The same non-informative prior was imposed on β 1 , while β 0 , β 2 were frequentist parameters.
The Bayesian and hybrid Bayesian inference with informative prior had the Normal distribution with mean −0.31 and variance 0.096 as the prior for β 1 , while β 0 , β 2 still had the non-informative priors.
The frequentist and Bayesian analyses for the samples with averaged expression across all predicted alveolar macrophages cells by HierXGB had parameters β 0 , β 1 only, and the Bayesian inference was based on the same non-informative prior for β 0 , β 1 as in Bayesian infer- ence with non-informative prior.
With non-informative prior, the frequentist with Bayesian inferences resulted in similar estimates of β 1 .Bayes- ian inference had slightly wider 95% confidence interval (CI), (−0.692, 0.147), compared with the 95% CI (−0.695, 0.145) from the frequentist inference.The hybrid inference had a similar estimate of β 1 but less standard error (0.144) and shorter 95% CI (−0.557, 0.008) than the Bayesian inference.The hybrid inference with noninformative was marginally significant with p-value 0.057.Given the informative prior, both Bayesian and hybrid inference showed significant effect on β 1 , with p-values of 0.017 and 0.005, respectively.The estimates of β 1 were identical (−0.299), but the standard error from hybrid inference (0.106) was less than that from Bayesian inference (0.126), leading to a smaller, more significant p-value from the hybrid inference.This is consistent with Table 3 The estimation, standard error, 95% confidence interval (95% CI), p-value of the difference between IPF and healthy ( β 1 ) for gene TGF-β from 7 models: frequentist, Bayesian inference with non-informative and informative priors, hybrid inference with noninformative and informative priors for all cells; and frequentist and Bayesian analysis for the predicted alveolar macrophages the published literature of TGF-β1's critical role for pulmonary fibrosis [15].We also conducted analysis on the samples with average expression of cells predicted as alveolar macrophage by HierXGB.In this analysis the probability of alveolar macrophage prediction was no longer used but the expression was averaged across identified alveolar macrophages cell types as described in Sect."scRNA-seq and bulkRNA-seq Data source" using naïve Bayesian approach.This analysis was consistent with the traditional pseudo bulk analysis, ignoring the predictive probability of cell identity.The frequentist regression analysis p-value is equivalent to that from the two-sample t-test, because the independent variable phenotype is binary, and the F-statistic from regression (or ANOVA) is the square of the t-statistic in the t-test.In this analysis, the Bayesian inference with non-informative prior had similar results as frequentist and both were not significant.Such inference was worse than BFH method with informative prior.
As a result, the analysis of TGF-β1 gene indicates that BFH inference outperforms both frequentist and Bayesian inference.The inclusion of cell type predicted probability for all the cells (regardless how small the probability was), and the informative prior were all valuable for identifying potential significant genes.

Interpretation of differentially expression results from frequentist, Bayesian and BFH methods
We applied each of the five methods to the IPF single cell dataset.The hybrid method with an informative prior detected the largest quantity of genes and biological meaningful pathways.Figure 3 summarizes the number of genes detected by each method.See detailed estimation, standard error, and p-value etc. in Table S2-S4.The hybrid method with an informative prior has the highest power with 436 genes detected.Compared with the Bayesian method with an informative prior that discovered 416 genes, the hybrid method detected all of them with an additional 20 genes (Table S1).Among the 20 genes, TREM1 and CCL24 are the most interesting discoveries.Multiple studies have shown their association with IPF.TREM-1 is a receptor expressed on myeloid cells that could serve as an inflammatory biomarker.For example, Dong et al. [11] studied a highly selective inhibitor to suppress TREM-1 expression and inflammation in murine macrophage.A previous study by Xiong et al. [43] found that TREM-1 was upregulated in bleomycin (BLM)-induced pulmonary fibrosis (PF) mouse model.They further discovered a pro-fibrotic effect of TREM-1 in PF, a potential strategy for treating fibrotic diseases could be provided.CCL24 protein promotes immune cell trafficking and activation as well as activities that lead to fibrosis.Kohan et al. [24] revealed that eotaxin-2, the protein encodes by CCL24, stimulated human lung fibroblast proliferation.Mor et al. [28] concluded that CCL24 plays an important role in skin and lung inflammation and fibrosis pathological progression.
The hybrid method with informative discovered most pathways with a total of 38, whereas the Bayesian method discovered 36 (Fig. 4).The top pathway from each method involves TGF-β1 in fibrosis development.See Table 1, 2, 3, 4. TGF-β is a multifunctional cytokine that belongs to the transforming growth factor superfamily and has multiple isoforms, TGF-β1 is one of them.It has been well established that TGF-β1 plays a role in acute respiratory distress syndrome and pulmonary fibrosis [15].Past publications have studied the role of TGF-β in alveolar macrophages development.For example, Yu et al. [44] revealed that TGF-β plays an essential role in controlling the origin, development, and survival of alveolar macrophages.Woo, Jeong, and Chung [41] reviewed the role of TGF-β in alveolar macrophages development, provided new information and insight into its functions.Grunwell et al. [15] discovered that targeting the TGF-β1 signalling pathway disruption may be a novel therapeutic approach to improve alveolar macrophage function.In comparison, the hybrid method with non-informative method discovered only two pathways, and their linkage to IPF remains unclear (Table 4).In both gene and pathway discoveries, the hybrid method with an informative prior showed supreme detection power against other methods.

Conclusion and discussion
Analysing scRNA-seq data has been a challenging topic, especially given the high cost of running the experiments, which typically results in limited sample sizes.Pseudo-bulk methods, which pool scRNA-seq counts per patient per cell-type, have been commonly used for DE gene detection.However, its performance also relies on the sample size, and hence may lack detection power when sample size is limited.A natural way to overcome this challenge is to borrow information from other studies.Here, we have shown that BFH with informative priors should be considered and has advantages over other approaches.This could be seen in our method comparison using semi-synthetic dataset.BFH can be viewed  4 pathways detected by the five methods based on the genes detected and the threshold for pathway analysis is based on q value < 0.05 as an adaption of Bayesian method and can incorporate prior information with potentially less uncertainty compared with Bayesian methods.As can be seen from the comparison, acquiring valuable priors is crucial for successfully identifying DE genes.Although we only showed one IPF dataset with bulk RNA-seq as prior, which may have inflated FDR, similar to the Bayesian framework, our BFH method can be implemented iteratively, especially when multiple datasets are accessible.In this iterative process, the posterior obtained from previous analyses becomes a more informative prior for the subsequent analysis.Over iterations, the prior and posterior regarding the identification of differentially expressed genes gradually converge.This convergence significantly improves our ability to pinpoint the correct genes associated with the disease using single-cell RNA sequencing (scRNA-seq) data with potential reduction in FDR.
In our BFH analysis of the IPF study, we used pseudobulk summarization as the response variable.However, the way to calculate pseudo-bulk is still a topic of discussion in the field.While many researchers have used the annotated cell types directly and summarized the data within a particular annotated cell type, such summarization may potentially lose information since the annotation is based on classifiers with certain thresholds to define the cell types.In our study, we derived the celltype-specific probability for each cell instead of relying on a classifier to define the cell types.We used this probability as the weight to summarize the data to avoid loss of useful information.
To boost the power of our coefficient estimate, we used bulk RNA-seq data as the prior for each gene.Bulk RNAseq data are not cell-specific, thus if the cell of interest is relatively scarce, they may not be able to provide useful information.Literature suggested that alveolar macrophages were abundant in the lung and could play a pivotal role in immunity [38].Although bulk RNA-seq data may not be as informative as scRNA-seq dataset to be used as prior, they offer several advantages.First, bulk data have better coverage than scRNA-seq data, thus providing prior information on a more compressive gene list than a typical scRNA-seq experiment.Second, it is relatively inexpensive and readily available.As we have summarized the scRNA-seq data into pseudobulk format, the prior derived from bulk RNA-seq data is compatible with our scRNA-seq data.In addition, we transformed the sample variance for β 1 so that the prior would have better dispersion.Properly setting priors remains as an interesting topic and should be explored further.The BFH method inherits flexibility from the Bayesian framework and can be used iteratively to integrate the current results as new prior information with the new data when appropriate.
Our case example demonstrated the substantial increase in detection power of BFH framework when using informative priors.When non-informative priors were employed, either no differentially expressed genes were identified or only a small number were found.The use of informative priors significantly increased the detection power, as evidenced by the reasonable pathway identified for IPF in terms of the underlying mechanism.The work reinforces the importance for TGF-β pathway and cytokines such as TNF and IL1/IL4, which are wellknown for their roles in the IPF mechanism [5,14,31].Consequently, our work brings valuable biological insight into the IPF disease for researchers.
A potential limitation of the BFH method is its heavy reliance on informative priors.In situations where relevant bulk RNA-seq or scRNA-seq data are unavailable, alternative data types such as methylation data could be considered as prior information.However, developing such priors needs biological justification and consideration of how to align such data types to pseudo-bulk format of scRNA-seq data.When alternative data types are unavailable, the use of non-informative prior for parameters is inevitable.As discussed in literature [18,19], using Bayesian analysis with non-informative prior can lead to estimation bias and incorrect p-values if the sample size is relatively small.Despite its limitations, the BFH method is a flexible approach with the capability to incorporate informative prior to enhance detection power.The current framework of the BFH method is implemented using conjugate priors, which reduces the computation time and makes it a suitable method for high throughput analyses in the future.

Fig. 2
Fig. 2 Boxplot of individual level TGF-β1 gene expression by phenotype in the whole sample (a) and predicted alveolar macrophages (b)

Table 1
Top 10 pathways detected by Bayesian, informative method, ranked by qvalue * Threshold: qvalue < 0.05; r: intersection of ontology term with experiment list; R: size of experiment list; n: size of ontology term; N: size of background list; zscore: z-score of enrichment; pvalue: hypergeometric test enrichment p-value; qvalue: FDR-adjusted pvalue

Table 2
Top 10 pathways detected by Hybrid, informative method, ranked by qvalue * Threshold: qvalue < 0.05; r: intersection of ontology term with experiment list; R: size of experiment list; n: size of ontology term; N: size of background list; zscore: z-score of enrichment; pvalue: hypergeometric test enrichment p-value; qvalue: FDR-adjusted p value

Table 4
A detailed list of pathways detected by Hybrid, non-informative method