Use of pathway information in molecular epidemiology
© Henry Stewart Publications 2009
Received: 23 June 2009
Accepted: 23 June 2009
Published: 1 October 2009
Candidate gene studies are generally motivated by some form of pathway reasoning in the selection of genes to be studied, but seldom has the logic of the approach been carried through to the analysis. Marginal effects of polymorphisms in the selected genes, and occasionally pairwise gene-gene or gene-environment interactions, are often presented, but a unified approach to modelling the entire pathway has been lacking. In this review, a variety of approaches to this problem is considered, focusing on hypothesis-driven rather than purely exploratory methods. Empirical modelling strategies are based on hierarchical models that allow prior knowledge about the structure of the pathway and the various reactions to be included as 'prior covariates'. By contrast, mechanistic models aim to describe the reactions through a system of differential equations with rate parameters that can vary between individuals, based on their genotypes. Some ways of combining the two approaches are suggested and Bayesian model averaging methods for dealing with uncertainty about the true model form in either framework is discussed. Biomarker measurements can be incorporated into such analyses, and two-phase sampling designs stratified on some combination of disease, genes and exposures can be an efficient way of obtaining data that would be too expensive or difficult to obtain on a full candidate gene sample. The review concludes with some thoughts about potential uses of pathways in genome-wide association studies.
Keywordscolorectal cancer complex diseases folate gene-environment interactions gene-gene interactions
Molecular epidemiology has advanced from testing associations of disease with single polymorphisms, to exhaustive examination of all polymorphisms in a candidate gene using haplotype tagging single nucleotide polymorphisms (SNPs), to studying increasing numbers of candidate genes simultaneously. Often, gene-environment and gene-gene interactions are considered at the same time. As the number of main effects and interactions proliferate, there is a growing need for a more systematic approach to model development .
In recognition of this need, the American Association for Cancer Research held a special conference  in May 2007, bringing together experts in epidemiology, genetics, statistics, computational biology, systems biology, toxicology, bioinformatics and other fields to discuss various multidisciplinary approaches to this problem.
A broad range of exploratory methods have been developed recently for identifying interactions, such as neural nets, classification and regression trees, multi-factor dimension reduction, random forests, hierarchical clustering, etc. [3–7] Our focus here, however, is instead on hypothesis-driven methods based on prior understanding about the structure of biological pathways postulated to be relevant to a particular disease. Our primary purpose is to contrast mechanistic and empirical methods and explore ways of combining the two.
The folate pathway as an example
Marginal odds ratios (ORs) for the association of each gene with disease under various choices of reaction rates or intermediate metabolite concentrations as the causal risk factor (ORs are expressed relative to the low enzyme activity rate genotype)
Simulated causal intermediate variable (β = 2 per SD)
Pyrimidine synthesis (TS)
Purine synthesis (AICART)
DNA methylation (DNMT)
1. Intracellular folate
2. Methionine intake
Table 1 shows the univariate associations of each gene with disease under each assumption about the causal risk factor. In these simulations, only one of these was taken as causal at a time, each scaled with the relative risk coefficient β = 2.0 per standard deviation of the respective risk factor. When homocysteine concentration was taken as the causal factor, the strongest association was with genetic variation in the cystathionine b-synthase (CBS) and S-adenosylhomocysteine hydrolase (SAHH) genes. The remaining three columns relate to various reaction rates as causal mechanisms. For pyrimidine synthesis (characterised here by the TS reaction rate), the strongest influence was seen for genetic variation in TS and the 5,10-methyleneTHF dehydrogenase (MTD) gene. For purine synthesis (reflected in the AICART reaction rate), the strongest associations were with genetic variation in the phosphoribosyl glycinamide transferase (PGT) gene and somewhat weaker for MTD and 5,10-methyleneTHF cyclohydrolase (MTCH) and serine hydroxymethyltransferase (SHMT) genes; interestingly, the disease risk is not particularly related to the AICART genotype itself. When DNA methylation (reflected by the DNMT reaction rate) was assumed to be causal, none of the genetic associations were as strong as for the other three causal mechanisms, the strongest being with the 5,10-methyleneTHF reductase (MTHFR) gene, SAHH and MTD. Genetic variation in DNMT was not explicitly simulated, but the reaction rates for this enzyme were identical to those for methionine adenosyl transferase (MAT-II) and SAHH, reflecting a rate-limiting step. Thus, genetic variation in MAT-II had no effect on risk, the reaction rate being driven entirely by SAHH. Other rate-limited combinations included dihydrofolate reductase (DHFR) with TS, MTD with MTCH, and PGT with AICART. Methionine intake was the strongest environmental exposure factor for the simulation with homocysteine as the causal mechanism, whereas intracellular folate had a stronger effect under the other three mechanisms.
Mechanistic vs empirical models
Multiple stepwise logistic regression models, including only main effects or main effects and G × G/G × E interaction terms for four different choices of the causal variable (gene names are given in Table 1; E1 = intracellular folate concentration; E2 = methionine intake)
Simulated causal risk factor
G, E, G × E, G × G
G, E, G × E, G × G
G, E, G × E, G × G
G, E, G × E, G × G
Hierarchical models for disease-pathway associations
the sum being taken over the range of terms included in the X vector. Note that all possible effects of some predetermined complexity (eg all main effects and two-way, or perhaps higher order, interactions possibly limited to subsets relevant to the hypothesised pathway structure) are included, rather than using some form of model selection, as was done in the stepwise analyses summarised in Table 2.
There are many possibilities for what could be included in the set of prior covariates, ranging from indicator variables for which of several pathways each gene might act in, in silico predictions of the functional significance of polymorphisms in each gene, [18, 19] or genomic annotation from formal ontologies . Summaries of the effects of genes on expression levels ('genetical genomics') or of associations of genes with relevant biomarkers might also be used as prior covariates. Rebbeck et al.  provide a good review of available tools that could be used for constructing prior covariates.
For example, suppose the X vector comprised effects for different polymorphisms within each gene and one had some prior predictors of the effects of each polymorphism (eg in silico predictions of functional effects or evolutionary conservation) and other predictors of the general effects of genes (eg their roles in different pathways or the number of other genes that they are connected to in a pathway). Then, it might be appropriate to include the former in the π ' Z part of the model for the means, and the latter in the φ ' Z part of the model for the variances.
This is known as the conditional autoregressive model, and is widely used in spatial statistics . Sample WinBUGS code to implement these and other models described below are available in an online supplement.
In applications to the folate simulation, we tried two variants of this model. First, we considered three prior covariates in Z: an indicator for whether a gene is involved in the methionine cycle; whether it is involved in the folate cycle; and the number of other genes it is connected to in the entire network (a measure of the extent to which it might have a critical role as a 'hub' gene). The A matrix was specified in terms of whether a pair of genes had a metabolite in common, either as substrate or product.
Summary of hierarchical modelling fits (parameter estimates [SEs]) for selected genetic effects (βG), prior covariates (Z'π) and prior correlations (σ2A) for simulation with homocysteine concentration as the causal variable
Genetic main effects
π1: folate (mean)
π2: methionine (mean)
π3: connections (mean)
ψ1: folate (variance)
ψ2: methionine (variance)
ψ3: connections (variance)
Posterior variances and correlations
σβ = SD(β|Z)
σπ = SD(π)
σψ = SD(ψ)
ρ = corr(β|A)
where and denote the low-dose slopes of the two reactions. These solutions can be either upwardly or downwardly curvilinear in E, depending on whether the term in parentheses is positive or negative (basically, whether the creation of the intermediate exceeds the rate at which it can be removed). For the fitted values in the application below (third block of Table 4), the dose-response relationship for M|E is upwardly curved for all genotype combinations (not shown).
A more realistic and more flexible model would allow for stochastic variation in the reaction rates λ ij and μ ij for each individual i conditional on their genotypes G ij ; for example, and likewise for μ ij  or similarly for their corresponding V max and K m . The population genotype-specific rates are, in turn, assumed to have log-normal prior distributions (and similarly for the μs), with vague priors on the population means , inter-individual variances and between-genotype variances . The individual data might be further supplemented by available biomarker measurements B ij of either the enzyme activity levels or intermediate metabolite concentrations, modelled as and respectively.
The WinBUGS software  has an add-in called PKBUGS,  which implements a Bayesian analysis of population pharmacokinetic parameters [27–31]. More complex models can, in principle, be fitted using the add-in WBDIFF http://www.winbugs-development.org.uk/wbdiff.html, which allows user-specified differential equations as nodes in a Bayesian graphical model.
Results of Markov chain Monte Carlo fitting of single-compartment models with homocysteine as an unobserved intermediate metabolite, created at a rate depending on SAHH (λ) and removed at a rate depending on CBS (μ), applied to the simulation taking homocysteine concentration as the causal variable
λ, μ fixed
λ, μ random
λ, μ fixed: ln(μ0/λ0)
ln[V max (1)/V max (0)]
γs random: ln(μ0/λ0) ln[V max (1)/V max (0)]
Stochastic differential equations:
N = 10 fixed
Combining mechanistic and statistical models
Such an approach is likely to be impractical for complex looped pathways like folate, however. In this case, one might use the results of a preliminary exploratory or hierarchical model to simplify the pathway to a few key rate-limiting steps, so as to yield a simpler unidirectional model for which the differential equation steady-state solutions can be obtained in closed form.
where N now controls the dispersion of the distribution. More complex solutions for Michaelis-Menten kinetics with a finite number of binding sites have been provided by Kou et al.,  who showed that the classical solutions still held in expectation, but other properties -- like the distribution of waiting times in various binding states -- were different, appearing to demonstrate a non-Markov memory phenomenon, particularly at high substrate concentrations. Further stochastic variability arises from fluctuations in binding affinity due to continual changes in enzyme conformation .
To illustrate the general idea, we fitted this simplified version of the model, treating λ and μ as fixed genotype-specific population values, yielding the estimates shown in the last line of Table 4. The dispersion parameter N cannot be estimated, but the results for other parameters are relatively insensitive to this choice; the results in Table 4 are based on either a fixed value N = 10 or using an informative Γ(100,1) prior; as N gets very large, the estimates converge to those in the first line for linear kinetics with fixed genotype-specific λ and μ.
For more complex models, for which analytic solution of the differential equations may be intractable, the technique of approximate Bayesian computation  may be helpful. The basic idea is, at each Markov chain Monte Carlo cycle, to simulate data from the differential equations model using the current and proposed estimates of model parameters and evaluate the 'closeness' of the simulated data to the observed data in terms of some simple statistics. This is then used to decide whether to accept or reject the proposed new estimates, rather than having to compute the likelihood itself.
A simpler approach uses the output of a PBPK simulation model as prior covariates in a hierarchical model. Let Z ge = E[M(G g , E e )] denote the predicted steady-state concentrations of the final metabolite from a differential equations model for a particular combination of genes and/or exposures (thus, Z gg' might represent the predicted effect of a G × G interaction between genes g and g'). As discussed above, other Zs could comprise variances of predicted Ms across a range of input values as a measure of the sensitivity of the output to variation in that particular combination of inputs. Z ge could also be a vector of several different predicted metabolite concentrations if there were multiple hypotheses about which was the most aetiologically relevant.
Summary of hierarchical modelling fits for selected genetic effects (βG), prior covariates (Z'π) and prior standard deviations (σβ and σπ) for simulation with different intermediates as the causal variable, using the Z matrix derived from independent data from the same simulation model (see text).
Simulated causal variable
Pyrimidine synthesis (TS)
Purine synthesis (AICART)
DNA methylation (DNMT)
Genetic main effects
G 2 : TS
Posterior standard deviations (SDs)
σβ = SD(β|Z)
σπ = SD(π)
Estimated log relative risk per unit change of true long-term homocysteine concentrations, treated as a latent variable in a single compartment linear-kinetics model; data simulated assuming homocysteine is the causal variable.
Stratified by G, E, and Y
8 × 10 = 80
8 × 25 = 200
Designs incorporating biomarkers
Ultimately, it may be helpful to incorporate various markers of the internal workings of a postulated pathway, perhaps in the form of biomarker measurements of intermediate metabolites, external bioinformatic knowledge about the structure and parameters of the network, or toxicological assays of the biological effects of the agents under study. For example, in a multi-city study of air pollution, we are applying stored particulate samples from each city to cell cultures with a range of genes experimentally knocked down to assess their genotype-specific biological activities. We will then incorporate these measurements directly into the analysis of G × E interactions in epidemiological data . See Thomas,  Thomas et al.,  Conti et al.  and Parl et al.  for further discussion about approaches to incorporating biomarkers and other forms of biological knowledge into pathway-driven analyses.
Typically biomarker measurements are difficult to obtain and are only feasible to collect on a subset of a large epidemiological study. While one might consider using a simple random sample for this purpose, greater efficiency can often be obtained by stratified sampling. Suppose the parent study is a case-control study with exposure information and DNA already obtained. One might then consider sampling on the basis of some combination of disease status, exposure and the genotypes of one or more genes thought to be particularly important for the intermediate phenotype(s) for which biomarkers are to be obtained. The optimal design would require knowledge of the true model (which, of course, is unknown), but a balanced design, selecting the subsample so as to obtain equal numbers in the various strata defined by disease and predictors is often nearly optimal [37–39]. The analysis can then be conducted by full maximum likelihood, integrating the biomarkers for unmeasured subjects over their distribution (given the available genotype, exposure and disease data) or by some form of multiple imputation, quasi-likelihood  or MCMC methods. Here, the interest is not in the association of disease with the biomarker B itself, but rather with the unobserved intermediate phenotype M it is a surrogate for. The disease model is thus of the form Pr(Y|M), with a latent process model for Pr(M|G, E) and a measurement model for Pr(B|M).
Again, using the folate simulation as the example, we simulated biomarkers for samples of ten or 25 individuals selected at random from each of the eight cells defined by disease status, the MTHFR genotype and high or low folate intake. A measurement B of either homocysteine concentration or the TS enzyme activity level was assumed to be normally distributed around their simulated equilibrium concentrations with standard deviations 10 per cent of that the true long-term average concentrations.
For joint analyses of homocysteine and TS activity measurements, M and B were assumed to be bivariate normally distributed with M ~ N2(X'A, Σ) and B ~ N2(M, T), and Y as having a multiple logistic dependence on M. Only the main effects of the 14 genes and two environmental factors were included in X for this analysis. While the model can be fitted by maximum likelihood, it is convenient to use MCMC methods, which more readily deal with arbitrary patterns of missing B data. Thus, it is not essential for the different biomarkers to be measured on the same subset of subjects, but some overlap is needed to estimate the covariances Σ12 and T12. More complex mechanistic models could, of course, be used in place of the regression model M|X. For this model to be identifiable, however, it is essential that distinct biomarkers be available for each of the intermediate phenotypes included in the disease model.
Estimates of the effects of both homocysteine and TS enzyme activity were highly significant in univariate analyses, even though the simulated causal variable is homocysteine. In bivariate analyses, however, the TS effect became non-significant, owing to the strong positive correlation (rΣ = 0.45; 95 per cent confidence interval [CI] 0.21, 0.71) between the residuals of M, while correlation between the residuals of the measurement errors was not significant (rT = 0.34; 95% CI -0.12, +0.63). Although the standard errors varied strongly with subsample size, stratified sampling did not seem to improve the precision of the estimates. The reason for this appears to be that the biased sampling is not properly allowed for in the Bayesian analysis. Further work is needed to explore whether incorporating the sampling fractions into a conditional likelihood would yield more efficient estimators in the stratified designs.
Dealing with reverse causation: Mendelian randomisation
The foregoing development assumes that the bio-marker measurement B or the underlying phenotype M of which it is a measurement is not affected by the disease process. While this may be a reasonable assumption in a cohort or nested case-control study where biomarker measurements are made on stored specimens obtained at entry to the cohort rather than after the disease has already occurred, it is a well known problem (known as 'reverse causation') in case-control studies. In this situation, one might want to restrict biomarker measurements only to controls and use marginal likelihood or imputation to deal with the unmeasured biomarkers for cases. Alternatively, one might consider using case measurements in a model that includes terms for differential error in the measurement model, Pr(B|M, Y).
These ideas have been formalised in literature known as 'Mendelian randomisation' (MR), [43–47] sometimes referred to as 'Mendelian deconfounding' . Here, the focus of attention is not the genes themselves, but intermediate phenotypes (M) as risk factors for disease. The genes that influence M are treated as 'instrumental variables' (IVs) [49–54] in an analysis that indirectly infers the M-Y relationship from separate analyses of the G-M and G-Y relationships. The appeal of the approach is that uncontrolled confounding and reverse causation are less likely to distort these relationships than they are to distort the M-Y relationship if studied directly. In essence, the idea of imputing M values using G as an IV in a regression of Y on E(M|G) is a form of MR argument. Nevertheless, the approach is not without its pitfalls, [55–58] both as a means of testing the null hypothesis of no causal connection between M and Y and as a means of estimating the magnitude of its effect. Particularly key is the assumption that the effect of G on Y is mediated solely through M. For complex pathways, the simple MR approach is unlikely to be of much help, but the idea of using samples free of reverse causation to learn about parts of the model from biomarker measurements and incorporating these into the analysis of a latent variable model is promising.
Mendelian randomization estimates of the effect of homocysteine on disease risk
α in B|G
γ in Y|G
β in Y|B
δ in B|G, Y
R2 = 0.43
E(B|G, Y)|Y = 0
R2 = 0.71
E(B|G, Y = 0)
R2 = 0.43
A stepwise multiple regression model for included 13 main effects and G × G interactions and attained an R2 of 0.43. Treating these predicted homocysteine concentrations as the covariate yielded a single imputation estimate of the log RR for disease of 1.32 (0.16), only slightly less precise than that from the logistic regression of disease directly on the measured values. While robust to uncontrolled confounding, this approach is not robust to reverse causation or misspecification of the prediction model; for example, it fails to include any exposure effects, which we have excluded to avoid distortion by reverse causation. More importantly, it also assumes that the entire effect of the predictors is mediated through homocysteine; this is true for this simulation, but is unlikely to be in practice. While not quite as downwardly biased as the Mendelian randomisation estimates (resulting from the improved prediction of B|G), the incompleteness of the model has still produced some underestimation.
Since we have simulated the case where the biomarker measurements are distorted by disease status, one might consider one of two alternative single imputation analyses. If both cases and controls have biomarker measurements available, one might include disease status in a model for , and then set Y = 0 in the fitted regression in order to estimate the predisease values for the cases. Alternatively, one could fit the model for using data only from controls and then apply the fitted model to all subjects, cases and controls. In either case, one would use only the predicted values for all subjects, not the actual biomarker measurements for those having them. In these simulated data, these approaches yield log RR estimates of 1.28 (0.20) and 1.31 (0.20), respectively. Either of these approaches avoids the circularity of using disease status to predict B|G, Y and then using it again in the regression of Y on . While the first approach uses more of the data, it requires a stronger assumption that the effect of Y on B is correctly specified, including possible interactions with G. In this simulation, the estimate of δ is 1.33 (0.06), substantially biased away from the simulated value of 0.50 because it includes some of the causal effect of X on Y. A fully Bayesian analysis jointly estimates the bias term δY in the full model . In this simulation, the fully Bayesian analysis yielded an estimate of β = 2.95 (0.22) and δ = -0.02 (1.02). Obviously, δ is so poorly estimated and β so overestimated that this approach appears to suffer from problems of identifiability that require further investigation.
In the Colon Cancer Family Registries,  we have pre-disease biospecimens on several hundred relatives of probands who were initially unaffected and subsequently became cases themselves. In a currently ongoing substudy of biomarkers for the folate pathway, it will be possible to use these samples to estimate the effect of reverse causation directly. Of course, it would have been even more informative to have both pre- and post-diagnostic biomarker measurements on incident cases to model reverse causation more accurately.
Incorporating external information: Ontologies
There are now numerous databases available that catalogue various types of genomic information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is perhaps the most familiar of these for knowledge about the structure of pathways and the parameters of each step therein. Others include the Gene Ontology, Biomolecular Interaction Network Database, Reactome, PANTHER, Ingenuity Pathway Analysis, BioCARTA, GATHER, DAVID and the Human Protein Reference Database, (see, for example, Meier and Gehring,  Thomas et al.  and Werner  for reviews). Literature mining is emerging as another tool for this purpose,  although potentially biased by the vagaries of research and publication trends. Such repositories form part of a system for organising knowledge known as an 'ontology' . Representation of our knowledge via an ontology may provide a more useful and broadly informative platform to generate system-wide hypotheses about how variation in human genes ultimately impacts on organism-level phenotypes via the underlying pathway or complex system. Since the biological and environmental knowledge relevant to most diseases spans many research fields, each with specific theories guiding ongoing research, expertise across the entire system by one individual scientist is limited. While the information that contributes to each knowledge domain may contain uncertainties and sources of error stemming from the underlying experiments and studies, biases in the selection of genes and pathways chosen to be included and lack of comparability across terms and databases, an ontology as a whole can generate hypotheses and links across research disciplines that may only arise when information is integrated from several disciplines across the entire span of suspected disease aetiology. An ontology should not be taken as the truth, but rather as the current representation of knowledge that can, and should, be updated as new findings arise and hypotheses are tested. Evaluation of the accuracy of ontologies is an active research area.
Although both approaches to building prior covariates, via either the visual interpretation of a network or the use of Gene Ontology, use knowledge of biological mechanisms, they lack a formal link of these mechanisms to disease risk or organism-level phenotypes. Such links may be critical when generating hypotheses or informing statistical analyses using biological mechanisms. Many publicly available ontologies provide a vast amount of structural information on various bio-logical processes, but interpretation or weighting of the importance of those processes in relation to specific phenotypes will only come when ontologies from biological domains are linked to ontologies characterising phenotypes. As one example, Thomas et al.  created a novel ontological representation linking smoking-related phenotypes and response to smoking cessation treatments with the underlying biological mechanisms, mainly nicotine metabolism. Most of the ontological concepts created for this specific ontology were created using concept definitions from existing ontologies, such as SOPHARM and Gene Ontology. This ontology was used in Conti et al.  to demonstrate the use in pathway analysis as a systematic way of eliciting priors for a hierarchical model. Specifically, the ontology was used to generate quantitative priors to reduce the space of potential models and to inform subsequent analysis via a Bayesian model selection approach.
Dealing with uncertainty in pathway structure
A more general question is how to deal with model uncertainty in any of these modelling strategies. The general hierarchical modelling strategy was first extended by Conti et al.  to deal with uncertainty about the set of main effects and interactions to be included in X using stochastic search variable selection . Specifically, they replaced the second-level model by a pair of models, a logistic regression for the probability that β p = 0 and a linear regression of the form of Eq. (2) for the expected values of the coefficient, given that it was not zero. In turn, the pair of second-level models inform the probability that any given term will be included in the model at the current iteration of the stochastic search. Thus, over the course of MCMC iterations, variables are entered and removed, and one can then estimate the posterior probability or Bayes factor (1) for each factor or possible model (2), for whether each factor has a non-zero β averaging over the set of other variables in the model, or (3) the posterior mean of each β, given that it is non-zero. Other alternatives include the Lasso prior,  which requires only a single hyperparameter to accomplish both shrinkage and variable selection in a natural way, and the elastic net,  which combines the Lasso and normal priors and can be implemented in a hierarchical fashion combining variable selection at lower levels (eg among SNPs within a pathway) and shrinkage at higher levels (eg between genes within a pathway or between pathways) (Chen et al. Presented at the Eastern North American Region Meeting of the Biometric Society; San Antonio, TX: February 2009).
In an analysis, utilising the methods described by Conti et al.,  of the simulated data when homocysteine is the causal variable (Table 5, first column) and incorporating an exchangeable prior structure in which all genes are treated equally (ie intercept only in the prior covariate matrix, Z), the posterior probabilities of including the two modestly significant genes TS and FTD are 0.57 and 0.48, respectively. By contrast, when the prior covariate matrix is derived from the 'external database' from the simulation model and is thus more informative of the underlying mechanism, these posterior probabilities change to 0.84 and 0.14, respectively. These changes in the posterior probabilities of inclusion reflect the covariate values for these genes in relation to homocysteine concentration and the AICART reaction velocity (the two prior covariates with the largest estimated second-level effects). In the case of TS, the velocities for these covariates are large, resulting in an increase in the posterior probability of inclusion. By contrast, for FTD these values are much smaller and there is a subsequent decrease.
For mechanistic models, the 'topology' of the model Λ and the corresponding vector of model parameters θΛ are treated as unknown quantities, about which we might have some general prior knowledge in the form of the 'ontology' Z. In the microarray analysis world, Bayesian network analysis has emerged as a powerful technique for inferring the structure of a complex network of genes . Might such a technique prove helpful for epidemiological analysis?
One promising approach is 'logic regression', which considers a set of tree-structured models relating measurable inputs (genes and exposures) to a disease trait through a network of unobserved intermediate nodes representing logical operators (AND, OR, XOR etc) . To allow for uncertainty about model form, a MCMC method is used to update the structure of the graphical model by adding, deleting, moving or changing the types of the intermediate nodes . Although appealing as a way of representing the biochemical pathways, logic regression does not exploit any external information about the form of network. It also treats all intermediate nodes as binary, so it is more suitable for modelling regulatory than metabolic pathways where the intermediate nodes would represent continuous metabolite concentrations.
When both input nodes (the 'parents' pj = [pj1, pj2]) are binary, various combinations of θs will yield the full range of possible logical operators (eg AND = [0,0], OR = [1,1]), but this framework allows great flexibility in modelling interactions between continuous nodes, while remaining identifiable. The Ms are treated as deterministic nodes, so the final metabolite concentration M J (E, G; Λ, θ) can be calculated via a simple recursion. The disease risk is assumed to have a logistic dependence on M J . Prior knowledge about the topology can be incorporated by use of a measure of similarity of each fitted network to the postulated true network (eg the proportion of connections in the true graph which are represented in the fitted one, minus the number of connections in the fitted graph which are not represented in the true one). In the spirit of Monte Carlo logic regression, the topology of the graph is modified by proposing to add or delete nodes or to move a connection between them using the Metropolis-Hastings algorithm . Finally, the model parameters are updated conditional on the current model form. By post-processing the resulting set of graphs, various kinds of inference can be drawn, such as the posterior probability that a given input appears in the fitted graphs, that a pair of inputs is represented by a node in the graph, or the marginal effect of any input or combination of inputs on the disease risk. In small simulations, we demonstrated that the model could correctly identify the true network structure (or logically equivalent ones) and estimate the parameters well, while not identifying any incorrect models. In an application to data on ten candidate genes from the Children's Health Study, we were able to replicate the interactions found by a purely exploratory technique  and identified several alternative networks with comparable Bayes factors.
The folate pathway poses difficulties for mechanistic modelling because it is not a directed acyclic graph (DAG); although each arrow in Figure 1 is directed, the graph contains numerous cycles (feedback loops), making direct computation of probabilities difficult. In some instances, such cycles can be treated as single composite nodes with complex deterministic or stochastic laws, thereby rendering the remainder of the graph acyclic, but when there are many interconnected cycles, as in the folate pathway, such decomposition may be difficult or impossible to identify. Might it be possible, however, to identify a simpler DAG that captures the key behaviour of the network? Since any DAG would be an oversimplification and there could be many such DAGs that provide a reasonable approximation, the problem of model uncertainty is important.
A further extension of the Baurley et al. approach to the folate simulation will now be summarised. As in their approach, we assume that each node has exactly two inputs, but now distinguish three basic types of nodes, G × G, G × M (or G × E) and M × M. G × G nodes are treated as logical operators, yielding a binary output as high or low risk. G × M and G × E nodes represent intermediate metabolite concentrations, treated as continuous variables with deterministic values given by Michaelis-Menten kinetics with rate parameters V max (G) and K m . M × M nodes are regression expressions yielding a continuous output variable with the mean parameterised as in Eq. (5). Disease risk is assumed to have a logistic dependence on one or more of the Zs. Finally, each measured biomarker B is assumed to be log-normally distributed around one of the Ms, with some measurement error variance. Rather than treating the intermediate nodes as deterministic, the likelihood of the entire graph is now calculated by peeling over possible states of all the intermediate nodes.
Pathways in a genome-wide context
Genome-wide association studies (GWAS) are generally seen as 'agnostic' -- the antithesis of hypothesis-driven pathway-based studies. Aside from the daunting computational challenge, their primary goal is, after all, the discovery of novel genetic associations, possibly in genes with unknown function or even with genomic variation in 'gene desert' regions not known to harbour genes. How, then, could one hope to incorporate prior knowledge in a GWAS? The response has generally been to wait until the GWAS has been completed (after a multi-stage scan and independent replication) and then conduct various in vitro functional studies of the novel associations before attempting any pathway modelling.
The idea of incorporating prior knowledge from genomic annotation databases or other sources as a way of improving the power of a genome-wide scan for discovery has, however, been suggested by several authors. Roeder et al.,  Saccone et al.,  Wakefield [76–78] and Whittemore  introduced variants of a weighted false discovery rate, while Lewinger et al.  and Chen and Witte  described hierarchical modelling approaches for this purpose. These could be applied at any stage of a GWAS to improve the prioritisation of variants to be taken forward to the next stage. For example, Sebastiani et al.  used a Bayesian test to incorporate external information for prioritising SNP associations from the first stage of a GWAS using pooled DNA, to be subsequently tested using individual genotyping. Roeder et al.  originally suggested the idea of exploiting external information in the context of using a prior linkage scan to focus attention in regions of the genome more likely to harbour causal variants, but subsequent authors have noted that various other types of information, such as linkage disequilibrium, functional characterisation or evolutionary conservation, could be included as predictors. An advantage of hierarchical modelling is that multiple sources can be readily incorporated in a flexible regression framework, whereas the weighted FDR requires a priori choice of a specific weighting scheme.
A recent trend has been the incorporation of pathway inference in genome-wide association scans, [75, 83–89] borrowing ideas from the extensive literature on network analysis of gene expression array data [90, 91]. Currently, the most widely used tool for this purpose is gene set enrichment analysis,  which in GWAS applications aims to test whether groups of genes in a common pathway tend to rank higher in significance. Several published applications have yielded novel insights using this approach, [93–96] although others have found that no specific pathway outranks the most significant single markers, [89, 97, 98] suggesting that the approach may not be ideal for all complex diseases. Many other empirical approaches have been used in the gene-expression field, including Bayesian network analysis, [69, 99, 100] neural networks,  support vector machines  and a variety of other techniques from the fields of bioinformatics, computational or systems biology and machine learning [103–111]. Most of these are empirical, although in the sense of trying to reconstruct the unknown network structure from observational data, rather than using a known network to analyse the observational data. It is less obvious how such methods could be applied to mining single-marker associations from a GWAS, but they could be helpful in mining G × G interactions. Even simple analyses of GWAS data can be computationally demanding, particularly if all possible G × G interactions are to be included, and analyses incorporating pathway information is likely to be even more daunting. Recent developments in computational algorithms for searching high-dimensional spaces and parallel cluster computing implementations may, however, make this feasible.
Recently, several authors [112–116] have undertaken analyses of the association of genome-wide expression data with genome-wide SNP genotypes in search of patterns of genetic control that would identify cis- and trans-activating factors and master regulatory regions. Ultimately, one could foresee using networks inferred from gene expression directly as priors in a hierarchical modelling analysis for GWAS data, or a joint analysis of the two phenotypes, but this has yet to be attempted. Other novel technologies, such as whole-genome sequencing, metabolomics, proteomics and so on may provide other types of data that will inform pathway-based analysis on a genome-wide scale.
As in any other form of statistical modelling, the analyst should be cautious in interpretation. An pointed out by Jansen: 
'So, the modeling of the interplay of many genes -- which is the aim of complex systems biology -- is not without danger. Any model can be wrong (almost by definition), but particularly complex (overparameterized) models have much flexibility to hide their lack of biological relevance' [emphasis added].
A good fit to a particular model does not, of course, establish the truth of the model. Instead, the value of models, whether descriptive or mechanistic, lies in their ability to organise a range of hypotheses into a systematic framework in which simpler models can be tested against more complex alternatives. The usefulness of the Armitage-Doll  multistage model of carcinogenesis, for example, lies not in our belief that it is a completely accurate description of the process, but rather in its ability to distinguish whether a carcinogen appears to act early or late in the process or at more than one stage. Similarly, the importance of the Moolgavkar-Knudson two-stage clonal-expansion model  lies in its ability to test whether a carcinogen acts as an 'initiator' (ie on the mutation rates) or a 'promoter' (ie on proliferation rates). Such inferences can be valuable, even if the model itself is an incomplete description of the process, as must always be the case.
Although mechanistic models do make some testable predictions about such things as the shape of the dose-response relationship and the modifying effects of time-related variables, testing such patterns against epidemiological data tends to provide only weak evidence in support of the alternative models, and only within the context of all the other assumptions involved. Generally, comparisons of alternative models (or specific sub-models) can only be accomplished by direct fitting. Visualisation of the fit to complex epidemiological datasets can be challenging. Any mechanistic interpretations of model fits should therefore consider carefully the robustness of these conclusions to possible misspecification of other parts of the model.
This work was supported in part by NIH grants R01-CA92562, P50-ES07048, R01-CA112237 and U01-ES015090 (D.C.T., D.V.C., J.B.), R01-CA105437, R01-CA105145, R01-CA59045 (C.M.U.) and NSF grants DMS-0616710 and DMS-0109872 (F.N., M.R.). The authors are particularly grateful to Wei Liang and Fan Yang for programming support.
- Thomas DC: The need for a comprehensive approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 2005, 14: 557-559. 10.1158/1055-9965.EPI-14-3-EDB.PubMedView ArticleGoogle Scholar
- Thomas DC, Baurley JW, Brown EE, Figueiredo J, et al: Approaches to complex pathways in molecular epidemiology: Summary of an AACR special conference. Cancer Res. 2008, 68: 10028-10030. 10.1158/0008-5472.CAN-08-1690.PubMedView ArticleGoogle Scholar
- Cook NR, Zee RY, Ridker PM: Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med. 2004, 23: 1439-1453. 10.1002/sim.1749.PubMedView ArticleGoogle Scholar
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, et al: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-147. 10.1086/321276.PubMed CentralPubMedView ArticleGoogle Scholar
- Hoh J, Ott J: Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet. 2003, 4: 701-709.PubMedView ArticleGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, et al: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.PubMed CentralPubMedView ArticleGoogle Scholar
- Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, et al: Automated detection of informative combined effects in genetic association studies of complex traits. Genome Res. 2003, 13: 1952-1960.PubMed CentralPubMedGoogle Scholar
- Potter JD: Colorectal cancer: Molecules and populations. J Natl Cancer Inst. 1999, 91: 916-932. 10.1093/jnci/91.11.916.PubMedView ArticleGoogle Scholar
- Frosst P, Blom HJ, Milos R, Goyette P, et al: A candidate genetic risk factor for vascular disease: A common mutation in methylenetetrahydrofolate reductase. Nat Genet. 1995, 10: 111-3. 10.1038/ng0595-111.PubMedView ArticleGoogle Scholar
- Ulrich CM, Potter JD: Folate supplementation: Too much of a good thing?. Cancer Epidemiol Biomarkers Prev. 2006, 15: 189-93. 10.1158/1055-9965.EPI-152CO.PubMedView ArticleGoogle Scholar
- Molloy AM, Brody LC, Mills JL, Scott JM, et al: The search for genetic polymorphisms in the homocysteine/folate pathway that contribute to the etiology of human neural tube defects. Birth Defects Res A Clin Mol Teratol. 2009, 85: 285-94. 10.1002/bdra.20566.PubMedView ArticleGoogle Scholar
- Nijhout HF, Reed MC, Budu P, Ulrich CM: A mathematical model of the folate cycle: New insights into folate homeostasis. J Biol Chem. 2004, 279: 55008-16. 10.1074/jbc.M410818200.PubMedView ArticleGoogle Scholar
- Nijhout HF, Reed MC, Ulrich CM: Mathematical models of folate-mediated one-carbon metabolism. Vitam Horm. 2008, 79: 45-82.PubMedView ArticleGoogle Scholar
- Reed MC, Nijhout HF, Neuhouser ML, Gregory JF, et al: A mathematical model gives insights into nutritional and genetic aspects of folate-mediated one-carbon metabolism. J Nutr. 2006, 136: 2653-61.PubMedGoogle Scholar
- Reed MC, Thomas RL, Pavisic J, James SJ, et al: A mathematical model of glutathione metabolism. Theor Biol Med Model. 2008, 5: 8-10.1186/1742-4682-5-8.PubMed CentralPubMedView ArticleGoogle Scholar
- Ulrich CM, Neuhouser M, Liu AY, Boynton A, et al: Mathematical modeling of folate metabolism: Predicted effects of genetic polymorphisms on mechanisms and biomarkers relevant to carcinogenesis. Cancer Epidemiol Biomarkers Prev. 2008, 17: 1822-31. 10.1158/1055-9965.EPI-07-2937.PubMed CentralPubMedView ArticleGoogle Scholar
- Hung RJ, Brennan P, Malaveille C, Porru S, et al: Using hierarchical modeling in genetic association studies with multiple markers: Application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers Prev. 2004, 13: 1013-1021.PubMedGoogle Scholar
- Capanu M, Orlow I, Berwick M, Hummer AJ, et al: The use of hierarchical models for estimating relative risks of individual genetic variants: An application to a study of melanoma. Stat Med. 2008, 27: 1973-1992. 10.1002/sim.3196.PubMed CentralPubMedView ArticleGoogle Scholar
- Hung RJ, Baragatti M, Thomas D, McKay J, et al: Inherited predisposition of lung cancer: A hierarchical modeling approach to DNA repair and cell cycle control pathways. Cancer Epidemiol Biomarkers Prev. 2007, 16: 2736-2744. 10.1158/1055-9965.EPI-07-0494.PubMedView ArticleGoogle Scholar
- Conti DV, Lewinger JP, Swan GE, Tyndale RF, et al: Using ontologies in hierarchical modeling of genes and exposures in biologic pathways. Phenotypes and Endophenotypes: Foundations for Genetic Studies of Nicotine Use and Dependence. Edited by: Swans GE. 2009, NCI Tobocco Control Monographs, Bethesda, MD, 539-584.Google Scholar
- Rebbeck TR, Spitz M, Wu X: Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet. 2004, 5: 589-597.PubMedView ArticleGoogle Scholar
- Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics (with discussion). Ann Inst Statist Math. 1991, 43: 1-59. 10.1007/BF00116466.View ArticleGoogle Scholar
- Cortessis V, Thomas DC: Toxicokinetic genetics: An approach to gene-environment and gene-gene interactions in complex metabolic pathways. Mechanistic Considerations in the Molecular Epidemiology of Cancer. Edited by: Bird P, Boffetta P, Buffler P, Rice J. 2003, IARC Scientific Publications, Lyon, France, 127-150.Google Scholar
- Du L, Conti DV, Thomas DC: Physiologically-based pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways. Genet Epidemiol. 2006, 29: 234-Google Scholar
- Lunn DJ, Thomas A, Best N, Spiegelhalter D: Winbugs -- A Bayesian modelling framework: Concepts, structure, and extensibility. Stat Comput. 2000, 10: 325-337. 10.1023/A:1008929526011.View ArticleGoogle Scholar
- Lunn DJ, Best N, Thomas A, Wakefield J, Spiegelhalter D: Bayesian analysis of population PK/PD models: General concepts and software. J Pharmacokinet Pharmacodyn. 2002, 29 (3): 271-307. 10.1023/A:1020206907668.PubMedView ArticleGoogle Scholar
- Racine-Poon A, Wakefield J: Statistical methods for population pharmacokinetic modelling. Stat Meth Med Res. 1998, 7: 63-84. 10.1191/096228098670696372.View ArticleGoogle Scholar
- Bois FY: Applications of population approaches in toxicology. Toxicol Lett. 2001, 120: 385-394. 10.1016/S0378-4274(01)00270-3.PubMedView ArticleGoogle Scholar
- Bennett JE, Wakefield JC: A comparison of a Bayesian population method with two methods as implemented in commercially available software. J Pharmacokinet Biopharm. 1996, 24: 403-432. 10.1007/BF02353520.PubMedView ArticleGoogle Scholar
- Wakefield J: Bayesian individualization via sampling-based methods. J Pharmacokinet Biopharm. 1996, 24: 103-131. 10.1007/BF02353512.PubMedView ArticleGoogle Scholar
- Best NG, Tan KK, Gilks WR, Spiegelhalter DJ: Estimation of population pharmacokinetics using the Gibbs sampler. J Pharmacokinet Biopharm. 1995, 23: 407-435. 10.1007/BF02353641.PubMedView ArticleGoogle Scholar
- Kou SC, Cherayil BJ, Min W, English BP, et al: Single-molecule Michaelis-Menten equations. J Phys Chem B. 2005, 109: 19068-19081. 10.1021/jp051490q.PubMedView ArticleGoogle Scholar
- English BP, Min W, van Oijen AM, Lee KT, et al: Ever-fluctuating single enzyme molecules: Michaelis-Menten equation revisited. Nat Chem Biol. 2006, 2: 87-94. 10.1038/nchembio759.PubMedView ArticleGoogle Scholar
- Marjoram P, Molitor J, Plagnol V, Tavare S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003, 100: 15324-15328. 10.1073/pnas.0306899100.PubMed CentralPubMedView ArticleGoogle Scholar
- Thomas DC: Using gene-environment interactions to dissect the effects of complex mixtures. J Expo Sci Environ Epidemiol. 2007, 17 (Suppl 2): S71-S74.PubMedView ArticleGoogle Scholar
- Parl F, Crooke P, Conti DV, Thomas DC: Pathway-based methods in molecular cancer epidemiology. Fundamentals of Molecular Epidemiology. Edited by: Rebbeck TR, Ambrosone CB, Shields PG. 2008, Informa Healthcare, New York, NY, 189-204.View ArticleGoogle Scholar
- Spiegelman D, Carroll RJ, Kipnis V: Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Stat Med. 2001, 20: 139-160. 10.1002/1097-0258(20010115)20:1<139::AID-SIM644>3.0.CO;2-K.PubMedView ArticleGoogle Scholar
- Holcroft CA, Spiegelman D: Design of validation studies for estimating the odds ratio of exposure-disease relationships when exposure is misclassified. Biometrics. 1999, 55: 1193-1201. 10.1111/j.0006-341X.1999.01193.x.PubMedView ArticleGoogle Scholar
- Thomas DC: Multistage sampling for latent variable models. Lifetime Data Anal. 2007, 13: 565-581. 10.1007/s10985-007-9061-1.PubMedView ArticleGoogle Scholar
- Breslow NE, Chatterjee N: Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Statist. 1999, 48: 457-468. 10.1111/1467-9876.00165.Google Scholar
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM: Measurement Error in Nonlinear models: A Modern Perspective. 2006, Chapman and Hall CRC Press, London, UK, 2View ArticleGoogle Scholar
- Thomas DC, Stram D, Dwyer J: Exposure measurement error: Influence on exposure-disease relationships and methods of correction. Annu Rev Publ Health. 1993, 14: 69-93. 10.1146/annurev.pu.14.050193.000441.View ArticleGoogle Scholar
- Davey Smith G, Ebrahim S: "Mendelian randomization": Can genetic epidemiology contribute to understanding environmental determinants of disease?. Int J Epidemiol. 2003, 32: 1-22. 10.1093/ije/dyg070.View ArticleGoogle Scholar
- Davey Smith G, Ebrahim S: Mendelian randomization: Prospects, potentials, and limitations. Int J Epidemiol. 2004, 33: 30-42. 10.1093/ije/dyh132.View ArticleGoogle Scholar
- Davey Smith G, Ebrahim S: What can Mendelian randomisation tell us about modifiable behavioural and environmental exposures?. BMJ. 2005, 330: 1076-1079. 10.1136/bmj.330.7499.1076.PubMed CentralPubMedView ArticleGoogle Scholar
- Lewis SJ, Davey Smith G: Alcohol, aldh2, and esophageal cancer: A meta-analysis which illustrates the potentials and limitations of a Mendelian randomization approach. Cancer Epidemiol Biomarkers Prev. 2005, 14: 1967-1971. 10.1158/1055-9965.EPI-05-0196.PubMedView ArticleGoogle Scholar
- Thompson JR, Minelli C, Abrams KR, Tobin MD, et al: Meta-analysis of genetic studies using Mendelian randomization -- A multivariate approach. Stat Med. 2005, 24: 2241-2254. 10.1002/sim.2100.PubMedView ArticleGoogle Scholar
- Tobin MD, Minelli C, Burton PR, Thompson JR: Commentary: Development of Mendelian randomization: From hypothesis test to "Mendelian deconfounding". Int J Epidemiol. 2004, 33: 26-29. 10.1093/ije/dyh016.PubMedView ArticleGoogle Scholar
- Glynn RJ: Commentary. Genes as instruments for evaluation of markers and causes. Int J Epidemiol. 2006, 35: 932-934. 10.1093/ije/dyl107.PubMedView ArticleGoogle Scholar
- Hernan MA, Robins JM: Instruments for causal inference: An epidemiologist's dream?. Epidemiology. 2006, 17: 360-372. 10.1097/01.ede.0000222409.00878.37.PubMedView ArticleGoogle Scholar
- Brookhart MA, Wang PS, Solomon DH, Schneeweiss S: Instrumental variable analysis of secondary pharmacoepidemiologic data. Epidemiology. 2006, 17: 373-374. 10.1097/01.ede.0000222026.42077.ee.PubMedView ArticleGoogle Scholar
- Buzas JS, Stefanski LA: Instrumental variable estimation in generalized linear measurement error models. J Am Stat Assoc. 1996, 91: 999-1006.View ArticleGoogle Scholar
- Greenland S: An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000, 29: 1102-10.1093/oxfordjournals.ije.a019909.PubMedView ArticleGoogle Scholar
- Martens EP, Pestman WR, de Boer A, Belitser SV, et al: Instrumental variables: Application and limitations. Epidemiology. 2006, 17: 260-267. 10.1097/01.ede.0000215160.88317.cb.PubMedView ArticleGoogle Scholar
- Didelez V, Sheehan N: Mendelian randomization as an instrumental variable approach to causal inference. Stat Meth Med Res. 2007, 16: 309-330. 10.1177/0962280206077743.View ArticleGoogle Scholar
- Nitsch D, Molokhia M, Smeeth L, DeStavola BL, et al: Limits to causal inference based on Mendelian randomization: A comparison with randomized controlled trials. Am J Epidemiol. 2006, 163: 397-403. 10.1093/aje/kwj062.PubMedView ArticleGoogle Scholar
- Bautista LE, Smeeth L, Hingorani AD, Casas JP: Estimation of bias in nongenetic observational studies using "Mendelian triangulation". Ann Epidemiol. 2006, 16: 675-680. 10.1016/j.annepidem.2006.02.001.PubMedView ArticleGoogle Scholar
- Thomas DC, Conti DV: Commentary. The concept of "Mendelian randomization". Int J Epidemiol. 2004, 33: 21-25. 10.1093/ije/dyh048.PubMedView ArticleGoogle Scholar
- Newcomb PA, Baron J, Cotterchio M, Gallinger S, et al: Colon cancer family registry: An international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiol Biomarkers Prev. 2007, 16: 2331-2343. 10.1158/1055-9965.EPI-07-0648.PubMedView ArticleGoogle Scholar
- Meier S, Gehring C: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level. Biotechnol J. 2008, 3: 1375-1387. 10.1002/biot.200800142.PubMedView ArticleGoogle Scholar
- Thomas PD, Mi H, Swan GE, Lerman C, et al: A systems biology network model for genetic association studies of nicotine addiction and treatment. Pharmacogenet Genomics. 2009, 19: 538-551. 10.1097/FPC.0b013e32832e2ced.PubMedView ArticleGoogle Scholar
- Werner T: Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol. 2008, 19: 50-54. 10.1016/j.copbio.2007.11.005.PubMedView ArticleGoogle Scholar
- Jensen LJ, Saric J, Bork P: Literature mining for the biologist: From information retrieval to biological discovery. Nat Rev Genet. 2006, 7: 119-129. 10.1038/nrg1768.PubMedView ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, et al: Gene ontology: Tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralPubMedView ArticleGoogle Scholar
- Conti DV, Cortessis V, Molitor J, Thomas DC: Bayesian modeling of complex metabolic pathways. Hum Hered. 2003, 56: 83-93. 10.1159/000073736.PubMedView ArticleGoogle Scholar
- George EI, McCulloch RE: Variable selection via Gibbs sampling. J Am Stat Assoc. 1993, 88: 881-889.View ArticleGoogle Scholar
- Park T, Casella G: The Bayesian lasso. J Am Stat Assoc. 2008, 103: 681-686. 10.1198/016214508000000337.View ArticleGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.View ArticleGoogle Scholar
- Friedman N: Inferring cellular networks using probabilistic graphical models. Science. 2004, 303: 799-805. 10.1126/science.1094068.PubMedView ArticleGoogle Scholar
- Ruczinski I, Kooperberg C, LeBlanc ML: Exploring interactions in high-dimensional genomic data: An overview of logic regression, with applications. J Multivar Anal. 2004, 90: 178-195. 10.1016/j.jmva.2004.02.010.View ArticleGoogle Scholar
- Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005, 28: 157-170. 10.1002/gepi.20042.PubMedView ArticleGoogle Scholar
- Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970, 57: 97-109. 10.1093/biomet/57.1.97.View ArticleGoogle Scholar
- Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006, 78: 15-27. 10.1086/498850.PubMed CentralPubMedView ArticleGoogle Scholar
- Roeder K, Devlin B, Wasserman L: Improving power in genome-wide association studies: Weights tip the scale. Genet Epidemiol. 2007, 31: 741-747. 10.1002/gepi.20237.PubMedView ArticleGoogle Scholar
- Saccone SF, Saccone NL, Swan GE, Madden PA, et al: Systematic biological prioritization after a genome-wide association study: An application to nicotine dependence. Bioinformatics. 2008, 24: 1805-1811. 10.1093/bioinformatics/btn315.PubMed CentralPubMedView ArticleGoogle Scholar
- Wakefield J: Bayes factors for genome-wide association studies: Comparison with p-values. Genet Epidemiol. 2008, 33: 79-86.View ArticleGoogle Scholar
- Wakefield J: Reporting and interpretation in genome-wide association studies. Int J Epidemiol. 2008, 37: 641-653. 10.1093/ije/dym257.PubMedView ArticleGoogle Scholar
- Wakefield J: A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007, 81: 208-227. 10.1086/519024.PubMed CentralPubMedView ArticleGoogle Scholar
- Whittemore AS: A Bayesian false discovery rate for multiple testing. J Appl Statist. 2007, 34: 1-9. 10.1080/02664760600994745.View ArticleGoogle Scholar
- Lewinger JP, Conti DV, Baurley JW, Triche TJ, et al: Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet Epidemiol. 2007, 31: 871-882. 10.1002/gepi.20248.PubMedView ArticleGoogle Scholar
- Chen GK, Witte JS: Enriching the analysis of genome-wide association studies with hierarchical modeling. Am J Hum Genet. 2007, 81: 397-404. 10.1086/519794.PubMed CentralPubMedView ArticleGoogle Scholar
- Sebastiani P, Zhao Z, Abad-Grau MM, Riva A, et al: A hierarchical and modular approach to the discovery of robust associations in genome-wide association studies from pooled DNA samples. BMC Genet. 2008, 9: 6-PubMed CentralPubMedView ArticleGoogle Scholar
- Wang K, Li M, Bucan M: Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007, 81: 1278-1283. 10.1086/522374.PubMed CentralPubMedView ArticleGoogle Scholar
- Elbers CC, van Eijk KR, Franke L, Mulder F, et al: Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol. 2009, 33: 419-431. 10.1002/gepi.20395.PubMedView ArticleGoogle Scholar
- Chasman DI: On the utility of gene set methods in genome-wide association studies of quantitative traits. Genet Epidemiol. 2008, 32: 658-668. 10.1002/gepi.20334.PubMedView ArticleGoogle Scholar
- Holden M, Deng S, Wojnowski L, Kulle B: Gsea-SNP: Applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008, 24: 2784-2785. 10.1093/bioinformatics/btn516.PubMedView ArticleGoogle Scholar
- Bush WS, Dudek SM, Ritchie MD: Biofilter: A knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac Symp Biocomput. 2009, 368-379.Google Scholar
- Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005, 21: 788-793. 10.1093/bioinformatics/bti069.PubMedView ArticleGoogle Scholar
- Hong MG, Pawitan Y, Magnusson PK, Prince JA: Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum Genet. 2009Google Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, et al: Pgc-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34: 267-273. 10.1038/ng1180.PubMedView ArticleGoogle Scholar
- Pan W: Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol. 2005, 4: Art. 12Google Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, et al: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Stat Appl Genet Mol Biol. 2005, 4: Art. 12, Proc. Natl. Acad. Sci. USA Vol. 102, pp. 15545-15550Google Scholar
- Lesnick TG, Papapetropoulos S, Mash DC, Ffrench-Mullen J, et al: A genomic pathway approach to a complex disease: Axon guidance and Parkinson disease. PLoS Genet. 2007, 3: e98-10.1371/journal.pgen.0030098.PubMed CentralPubMedView ArticleGoogle Scholar
- Baranzini SE, Galwey NW, Wang J, Khankhanian P, et al: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18: 2078-2090. 10.1093/hmg/ddp120.PubMed CentralPubMedView ArticleGoogle Scholar
- Torkamani A, Topol EJ, Schork NJ: Pathway analysis of seven common diseases assessed by genome-wide association. Genomics. 2008, 92: 265-272. 10.1016/j.ygeno.2008.07.011.PubMed CentralPubMedView ArticleGoogle Scholar
- Vink JM, Smit AB, de Geus EJ, Sullivan P, et al: Genome-wide association study of smoking initiation and current smoking. Am J Hum Genet. 2009, 84: 367-379. 10.1016/j.ajhg.2009.02.001.PubMed CentralPubMedView ArticleGoogle Scholar
- Perry JR, McCarthy MI, Hattersley AT, Zeggini E, et al: Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009, 58: 1463-1467. 10.2337/db08-1378.PubMed CentralPubMedView ArticleGoogle Scholar
- Kasperaviciute D, Weale ME, Shianna KV, Banks GT, et al: Large-scale pathways-based association study in amyotrophic lateral sclerosis. Brain. 2007, 130: 2292-2301. 10.1093/brain/awm055.PubMedView ArticleGoogle Scholar
- Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.PubMedView ArticleGoogle Scholar
- Yu J, Smith VA, Wang PP, Hartemink AJ, et al: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004, 20: 3594-3603. 10.1093/bioinformatics/bth448.PubMedView ArticleGoogle Scholar
- Ritchie MD, White BC, Parker JS, Hahn CW, et al: Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics. 2003, 4: 28-10.1186/1471-2105-4-28.PubMed CentralPubMedView ArticleGoogle Scholar
- Byvatov E, Schneider G: Support vector machine applications in bioinformatics. Appl Bioinformatics. 2: 67-77.Google Scholar
- Schafer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2003, 21: 754-764.View ArticleGoogle Scholar
- Wu CC, Huang HC, Juan HF, Chen ST: Genenetwork: An interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics. 2004, 20: 3691-3693. 10.1093/bioinformatics/bth428.PubMedView ArticleGoogle Scholar
- Franke L, van Bakel H, Fokkens L, de Jong ED, et al: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006, 78: 1011-1025. 10.1086/504300.PubMed CentralPubMedView ArticleGoogle Scholar
- Basso K, Margolin AA, Stolovitzky G, Klein U, et al: Reverse engineering of regulatory networks in human b cells. Nat Genet. 2005, 37: 382-390. 10.1038/ng1532.PubMedView ArticleGoogle Scholar
- Kim TH, Ren B: Genome-wide analysis of protein-DNA interactions. Annu Rev Genom Hum Genet. 2006, 7: 81-102. 10.1146/annurev.genom.7.080505.115634.View ArticleGoogle Scholar
- Tu Z, Wang L, Arbeitman MN, Chen T, et al: An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics. 2006, 22: e489-e496. 10.1093/bioinformatics/btl234.PubMedView ArticleGoogle Scholar
- Yu H, Zhu X, Greenbaum D, et al: Topnet: A tool for comparing biological sub-networks, correlating protein properties with topological statistics. Nucleic Acids Res. 2004, 32: 328-337. 10.1093/nar/gkh164.PubMed CentralPubMedView ArticleGoogle Scholar
- Blais A, Dynlacht BD: Constructing transcriptional regulatory networks. Genes Dev. 2005, 19: 1499-1511. 10.1101/gad.1325605.PubMedView ArticleGoogle Scholar
- Xie Y, Pan W, Jeong KS, Khodursky A: Incorporating prior information via shrinkage: A combined analysis of genome-wide location data and gene expression data. Stat Med. 2007, 26: 2258-2275. 10.1002/sim.2703.PubMedView ArticleGoogle Scholar
- Dixon AL, Liang L, Moffatt MF, Chen W, et al: A genome-wide association study of global gene expression. Nat Genet. 2007, 39: 1202-1207. 10.1038/ng2109.PubMedView ArticleGoogle Scholar
- Stranger BE, Forrest MS, Clark AG, Minichiello MJ, et al: Genome-wide associations of gene expression variation in humans. PLoS Genet. 2005, 1: e78-10.1371/journal.pgen.0010078.PubMed CentralPubMedView ArticleGoogle Scholar
- Morley M, Molony CM, Weber TM, Devlin JL, et al: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.PubMed CentralPubMedView ArticleGoogle Scholar
- Cheung VG, Spielman RS, Ewens KG, Weber TM, et al: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.PubMed CentralPubMedView ArticleGoogle Scholar
- Cheung VG, Conlin LK, Weber TM, Arcaro M, et al: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33: 422-425. 10.1038/ng1094.PubMedView ArticleGoogle Scholar
- Jansen RC: Studying complex biological systems using multifactorial perturbation. Nat Rev Genet. 2003, 4: 145-151.PubMedView ArticleGoogle Scholar
- Armitage P, Doll R: The age distribution of cancer and multi-stage theory of carcinogenesis. Br J Cancer. 1954, 8 (1): 1-12. 10.1038/bjc.1954.1.PubMed CentralPubMedView ArticleGoogle Scholar
- Moolgavkar S, Knudson A: Mutation and cancer: A model for human carcinogenesis. JNCI. 1981, 66: 1037-1052.PubMedGoogle Scholar