- Review
- Published:

# Use of pathway information in molecular epidemiology

*Human Genomics*
**volume 4**, Article number: 21 (2009)

## Abstract

Candidate gene studies are generally motivated by some form of pathway reasoning in the selection of genes to be studied, but seldom has the logic of the approach been carried through to the analysis. Marginal effects of polymorphisms in the selected genes, and occasionally pairwise gene-gene or gene-environment interactions, are often presented, but a unified approach to modelling the entire pathway has been lacking. In this review, a variety of approaches to this problem is considered, focusing on hypothesis-driven rather than purely exploratory methods. Empirical modelling strategies are based on hierarchical models that allow prior knowledge about the structure of the pathway and the various reactions to be included as 'prior covariates'. By contrast, mechanistic models aim to describe the reactions through a system of differential equations with rate parameters that can vary between individuals, based on their genotypes. Some ways of combining the two approaches are suggested and Bayesian model averaging methods for dealing with uncertainty about the true model form in either framework is discussed. Biomarker measurements can be incorporated into such analyses, and two-phase sampling designs stratified on some combination of disease, genes and exposures can be an efficient way of obtaining data that would be too expensive or difficult to obtain on a full candidate gene sample. The review concludes with some thoughts about potential uses of pathways in genome-wide association studies.

## Introduction

Molecular epidemiology has advanced from testing associations of disease with single polymorphisms, to exhaustive examination of all polymorphisms in a candidate gene using haplotype tagging single nucleotide polymorphisms (SNPs), to studying increasing numbers of candidate genes simultaneously. Often, gene-environment and gene-gene interactions are considered at the same time. As the number of main effects and interactions proliferate, there is a growing need for a more systematic approach to model development [1].

In recognition of this need, the American Association for Cancer Research held a special conference [2] in May 2007, bringing together experts in epidemiology, genetics, statistics, computational biology, systems biology, toxicology, bioinformatics and other fields to discuss various multidisciplinary approaches to this problem.

A broad range of exploratory methods have been developed recently for identifying interactions, such as neural nets, classification and regression trees, multi-factor dimension reduction, random forests, hierarchical clustering, etc. [3–7] Our focus here, however, is instead on hypothesis-driven methods based on prior understanding about the structure of biological pathways postulated to be relevant to a particular disease. Our primary purpose is to contrast mechanistic and empirical methods and explore ways of combining the two.

### The folate pathway as an example

Folate metabolism provides a rich example to illustrate these challenges. Folate has been implicated in colorectal cancer, [8] coronary heart disease [9] and neural tube defects, [10, 11] among other conditions. Several steps in the metabolism of folate could be involved in these various diseases (Figure 1) and could have quite different effects. The pathway is complex, involving 19 enzymes or carrier proteins, with various feedback loops and two main cycles, the folate and the methionine cycles. The former is involved in pyrimidine synthesis through the action of thymidylate synthase (TS), potentially leading to uracil misincorporation into DNA and subsequent DNA damage and repair or misrepair. The latter is involved in DNA methylation through the conversion of S-adenosyl methionine (SAM) to S-adenosyl homocysteine (SAH) by DNA-methyltransferase (DNMT). These two mechanisms in particular have been suggested as important links between folate and carcinogenesis, although other possibilities include purine synthesis (via the aminoimidazole-carboxamide ribonucleotide transferase [AICART] reaction) and homocysteine itself. Because polymorphisms that tend to increase one of these effects may decrease others, their effects on disease endpoints can be quite different, depending on which part of the pathway is more important. A detailed mathematical model for this system has been developed by Nijhout *et al.* [12, 13] and Reed *et al*., [14, 15] based on the equilibrium solution to a set of linked ordinary differential equations for Michaelis-Menten kinetics and implemented in software available at http://metabolism.math.duke.edu/.

To illustrate the various approaches, we simulated some typical data in the form that might be available from a molecular epidemiology study -- specifically data on genetic variants, various environmental exposures, a disease outcome or clinical trait, and, possibly, biomarker measurements on some or all subjects. We began with a population of 10,000 individuals with randomly generated values of intracellular folate *E*_{1} (the total tetrahydrofolate [THF] concentration in the six compartments forming the closed loop shown on the left-hand side of Figure 1), methionine intake *E*_{2} (METin, log-normally distributed) and 14 of the key genes *G* shown in Figure 1. For each gene, a person-specific value of the corresponding *V*_{
max
} was sampled from log-normal distributions with genotype-specific geometric means (GMs = 0.6, 0.8, or 1.1 times the overall GM) and common geometric standard deviations (GSD = 1.1) and *K*_{
m
} appropriate for that enzyme (see Table 1 in Reed *et al.* [14] for reference values for *V*_{
max
} and *K*_{
m
} for each gene). The differential equations were then evaluated to determine the steady-state solutions for ten intermediate metabolite concentrations and 14 reaction rates for each individual, based on their specific environmental variables and enzyme activity rates. The probability of disease was calculated under a logistic model for each of four scenarios for the causal biological mechanism -- homocysteine concentration, the rate of DNA methylation reactions and the rates of purine and pyrimidine synthesis -- and a binary disease indicator *Y* was sampled with the corresponding probability. Only the data on (*Y, E, G*) were retained from the first 500 cases and 500 controls for the first level of the epidemiological analysis. For some analyses, we also simulated biomarkers [16] on stratified subsamples of these subjects, as will be described later. Various summaries of the correlations among the (*X, E, G*) values for the remaining 9,000 subjects were deposited into what we shall call the 'external database' for use in constructing priors, as described below (no individual *Y* data were used for this purpose).

Table 1 shows the univariate associations of each gene with disease under each assumption about the causal risk factor. In these simulations, only one of these was taken as causal at a time, each scaled with the relative risk coefficient β = 2.0 per standard deviation of the respective risk factor. When homocysteine concentration was taken as the causal factor, the strongest association was with genetic variation in the cystathionine b-synthase (*CBS*) and S-adenosylhomocysteine hydrolase (*SAHH*) genes. The remaining three columns relate to various reaction rates as causal mechanisms. For pyrimidine synthesis (characterised here by the TS reaction rate), the strongest influence was seen for genetic variation in TS and the 5,10-methyleneTHF dehydrogenase (*MTD*) gene. For purine synthesis (reflected in the AICART reaction rate), the strongest associations were with genetic variation in the phosphoribosyl glycinamide transferase (*PGT*) gene and somewhat weaker for *MTD* and 5,10-methyleneTHF cyclohydrolase (*MTCH*) and serine hydroxymethyltransferase (*SHMT*) genes; interestingly, the disease risk is not particularly related to the *AICART* genotype itself. When DNA methylation (reflected by the *DNMT* reaction rate) was assumed to be causal, none of the genetic associations were as strong as for the other three causal mechanisms, the strongest being with the 5,10-methyleneTHF reductase (*MTHFR*) gene, *SAHH* and *MTD*. Genetic variation in *DNMT* was not explicitly simulated, but the reaction rates for this enzyme were identical to those for methionine adenosyl transferase (*MAT-II*) and SAHH, reflecting a rate-limiting step. Thus, genetic variation in *MAT-II* had no effect on risk, the reaction rate being driven entirely by *SAHH*. Other rate-limited combinations included dihydrofolate reductase (*DHFR*) with TS, MTD with MTCH, and PGT with AICART. Methionine intake was the strongest environmental exposure factor for the simulation with homocysteine as the causal mechanism, whereas intracellular folate had a stronger effect under the other three mechanisms.

### Mechanistic vs empirical models

For the four highlighted simulations, we also conducted multiple logistic regressions in a stepwise manner, offering methionine, folate, the 14 genotypes and all 91 pairwise G × G and 28 G × E interactions (Table 2). These are difficult to interpret, however, owing to the large numbers of comparisons and unstable regression coefficients, particularly in the models that include interaction terms. In an attempt to gain greater insight into mechanisms, attention will now be turned to more pathway-driven modelling approaches, based on hierarchical or mechanistic models. The former extend the standard logistic models summarised in Table 2 by the addition of 'prior covariates' incorporating knowledge about the relative risk coefficients predicted by the pathway. The latter attempt to model the pathways explicitly, using simplified versions of physiologically based pharmacokinetic (PBPK) models, thereby requiring stronger assumptions about reaction dynamics and population distributions of rate parameters.

### Hierarchical models for disease-pathway associations

In the first level, the epidemiological data are fitted using a conventional 'empirical' model for the main effects and interactions among the various input genotypes and exposures, here denoted generically as **X** = (*X*_{
ip
})_{p = 1 ... p}= (*E, G, G* × *G, G* × *E, G* × *G* × *E*, ...); for example, a logistic regression model of the form

the sum being taken over the range of terms included in the **X** vector. Note that all possible effects of some predetermined complexity (eg all main effects and two-way, or perhaps higher order, interactions possibly limited to subsets relevant to the hypothesised pathway structure) are included, rather than using some form of model selection, as was done in the stepwise analyses summarised in Table 2.

In the second-level model, each of the regression coefficients from Eq. (1) is in turn regressed on a vector **Z**_{
p
}= (*Z*_{
pv
})_{v = 1 ... V}of 'prior covariates' describing characteristics of the corresponding terms in *X*; for example,

There are many possibilities for what could be included in the set of prior covariates, ranging from indicator variables for which of several pathways each gene might act in, [17]*in silico* predictions of the functional significance of polymorphisms in each gene, [18, 19] or genomic annotation from formal ontologies [20]. Summaries of the effects of genes on expression levels ('genetical genomics') or of associations of genes with relevant biomarkers might also be used as prior covariates. Rebbeck *et al.* [21] provide a good review of available tools that could be used for constructing prior covariates.

Alternatively, one could model the variances, for example:

For example, suppose the **X** vector comprised effects for different polymorphisms within each gene and one had some prior predictors of the effects of each polymorphism (eg *in silico* predictions of functional effects or evolutionary conservation) and other predictors of the general effects of genes (eg their roles in different pathways or the number of other genes that they are connected to in a pathway). Then, it might be appropriate to include the former in the **π** *'***Z** part of the model for the means, and the latter in the **φ** *'***Z** part of the model for the variances.

So far, the second-level models have assumed an independence prior for each of the regression coefficients; but now, suppose we have some prior information about the relationships *among* the genes, such as might come from networks inferred from gene co-expression data. Let **A** = (*A*_{
pq
})_{p, q = 1 ... P}denote a matrix of prior connectivities between pairs of genes -- for example, taking the value 1 if the two are adjacent (connected) in some network or 0 otherwise. Then, one might replace the independence prior of Eq. (2) by a multivariate prior of the form:

This is known as the conditional autoregressive model, and is widely used in spatial statistics [22]. Sample WinBUGS code to implement these and other models described below are available in an online supplement.

In applications to the folate simulation, we tried two variants of this model. First, we considered three prior covariates in **Z:** an indicator for whether a gene is involved in the methionine cycle; whether it is involved in the folate cycle; and the number of other genes it is connected to in the entire network (a measure of the extent to which it might have a critical role as a 'hub' gene). The **A** matrix was specified in terms of whether a pair of genes had a metabolite in common, either as substrate or product.

Table 3 summarises the results of several models, including these three prior covariates in the means or variance model, as well as the connectivities in the covariance model. As would be expected, in the zero mean model, all the significant parameter estimates were shrunk towards zero because of the large number of genes with no true effect in the model. In general, none of the prior covariates significantly predicted the means. The estimates of the βs in all these models were much closer to the simple maximum likelihood estimates (the first column), however, and their standard errors were generally somewhat smaller, indicating the 'borrowing of strength' from each other. In the model with covariates in the prior variances, however, the number of connections for each gene was significantly associated with the variance. In the final model, with correlations between genes being given by indicators for whether they were connected in the graph, the posterior distribution for the parameter **ρ** is constrained by the requirement that the covariance matrix be positive definite, but showed strong evidence of gene-gene correlations following the pattern given by the connectivities in Figure 1. The generally weak effects of prior covariates in these models may simply reflect the crudeness of these classifications. Below, we will revisit these models with more informative covariates based on the quantitative predictions of the differential equations model.

### Mechanistic models

Whereas hierarchical models are generally applicable whenever one has external information about the genes and exposures available, in some circumstances the dynamics of the underlying biological process may be well enough understood to support mechanistic modelling. These are typically based on systems of ordinary differential equations (ODEs) describing each of the intermediate nodes in a graphical model as deterministic quantities given by their various inputs (exposures or previous substrates) with reaction rates determined by genotypes (Figure 2). For example, in a sequence *j* = 1, ..., *J* of linear kinetic steps, with conversion from metabolite *M*_{
j
} to *M*_{
j+1
} at rate λ*j* and removal at rate μ_{
j
}, the instantaneous concentration is given by the differential equation:

leading to the equilibrium solution for the final metabolite *M*_{
J
} as:

where *X*_{0} denotes the concentration of exposure *E*. This predicted equilibrium concentration of the final metabolite in the graph is then treated as a covariate in a logistic model for the risk of disease:

If sufficient external knowledge about the genotype-specific reaction rates is available, these could be treated as fixed constants, but more likely they would need to be estimated jointly with the βs in the disease model using a non-linear fitting program. More sophisticated non-linear models are possible -- for example, incorporating Michaelis-Menten kinetics by replacing each of the λM terms in Eq. (4) by expressions of the form:

and similarly for the μ*M* terms. The resulting equilibrium solutions for *M*_{J}(*E*, **G**) are now more complex solutions to a polynomial equation. For example, with only a single intermediate metabolite with one activation rate λ and one detoxification rate μ, the solution becomes:

where \lambda ={V}_{max}^{\lambda}\left({G}_{1}\right)/{K}_{m}^{\lambda} and \mu ={V}_{max}^{\mu}\left({G}_{2}\right)/{K}_{m}^{\mu}denote the low-dose slopes of the two reactions. These solutions can be either upwardly or downwardly curvilinear in *E*, depending on whether the term in parentheses is positive or negative (basically, whether the creation of the intermediate exceeds the rate at which it can be removed). For the fitted values in the application below (third block of Table 4), the dose-response relationship for *M*|*E* is upwardly curved for all genotype combinations (not shown).

A more realistic and more flexible model would allow for stochastic variation in the reaction rates λ_{
ij
}and μ_{
ij
}for each individual *i* conditional on their genotypes G_{
ij
}; for example, {\lambda}_{ij}~LN\left({\stackrel{\u0304}{\lambda}}_{j}\left({G}_{ij}\right),{\sigma}_{j}^{2}\right) and likewise for μ_{
ij
}[23] or similarly for their corresponding *V*_{
max
} and *K*_{
m
} [24]. The population genotype-specific rates are, in turn, assumed to have log-normal prior distributions {\stackrel{\u0304}{\lambda}}_{j}\left(g\right)~LN\left({\stackrel{\u0304}{\stackrel{\u0304}{\lambda}}}_{j}{\omega}_{j}^{2}\right) (and similarly for the μs), with vague priors on the population means {\stackrel{\u0304}{\stackrel{\u0304}{\lambda}}}_{j}, inter-individual variances {\sigma}_{j}^{2} and between-genotype variances {\omega}_{j}^{2}. The individual data might be further supplemented by available biomarker measurements *B*_{
ij
} of either the enzyme activity levels or intermediate metabolite concentrations, modelled as {B}_{ij}~LN\left({\lambda}_{ij},{\omega}^{2}\right)and {B}_{ij}~LN\left({M}_{ij},{\omega}^{2}\right)respectively.

The WinBUGS software [25] has an add-in called PKBUGS, [26] which implements a Bayesian analysis of population pharmacokinetic parameters [27–31]. More complex models can, in principle, be fitted using the add-in WBDIFF http://www.winbugs-development.org.uk/wbdiff.html, which allows user-specified differential equations as nodes in a Bayesian graphical model.

To illustrate the approach, we consider a highly simplified model with only a single intermediate metabolite *M* (homocysteine). We assume this is created at linear rate λ determined by *SAHH* and removed at linear rate μ determined by *CBS*. The ratios of λ and μ between genotypes are estimated jointly with β. The first two lines of Table 4 provide the results of fitting the linear kinetics model, with and without inter-individual variability in the two rate parameters. Although, of course, many other genes are involved in the simulated model, the estimated homocysteine concentrations *M* are strongly predictive of disease, and both genes have highly significant effects on their respective rates. Allowing additional random variability in these rates slightly increased the population average genetic effects. For the Michaelis-Menten models, we allowed the *V*_{
max
}s to depend on genotype, while keeping the *K*_{
m
}s fixed. Not all the parameters can be independently estimated, but only the ratios μ_{0}/λ_{0} and {K}_{m}^{\mu}/{K}_{m}^{\lambda}, along with the genetic rate ratios λ_{1}/λ_{0} and μ_{1}/μ_{0}. Allowing the *V*_{
max
}s and *K*_{
m
}s to vary between subjects leads to some instability, but did not substantially alter the population mean parameter estimates. Adding in biomarker measurements *B*_{
i
} as surrogates for *M*_{
i
} for even a subset of subjects, as described below, substantially improved the precision of estimation of all the model parameters (results not shown).

### Combining mechanistic and statistical models

Such an approach is likely to be impractical for complex looped pathways like folate, however. In this case, one might use the results of a preliminary exploratory or hierarchical model to simplify the pathway to a few key rate-limiting steps, so as to yield a simpler unidirectional model for which the differential equation steady-state solutions can be obtained in closed form.

Rather than taking *M*(*E, G*) as a deterministic node in the mechanistic modelling approach described above, a fully Bayesian treatment would use *stochastic* differential equations to derive Pr(*M*|*E, G*). For example, suppose one postulated that the rate of change *dM/dt* depends on the rate at which it is created as a constant rate λ(*G*_{1})*E* and the rate at which it is removed at rate μ(*G*_{2})*M*. (Of course, the exposures *E* could be time dependent, in which case one would be interested in the long-term average of *M* rather than its steady state, but in most epidemiological studies there is little information available on short-term variation in exposures, so the following discussion is limited to the case of time-constant exposures.) Consider a discrete number of molecules and let *p*_{
m
}(*t*) = Pr(*M* = *m*|*T* = *t*). Then, the resulting stochastic differential equation becomes:

The solution turns out to be simply a Poisson distribution for *m* with mean *E*(*m*) = λ*E*/μ. This suggests as a distribution for continuous metabolite concentrations *M* in some volume of size *N*:

where *N* now controls the dispersion of the distribution. More complex solutions for Michaelis-Menten kinetics with a finite number of binding sites have been provided by Kou *et al*., [32] who showed that the classical solutions still held in expectation, but other properties -- like the distribution of waiting times in various binding states -- were different, appearing to demonstrate a non-Markov memory phenomenon, particularly at high substrate concentrations. Further stochastic variability arises from fluctuations in binding affinity due to continual changes in enzyme conformation [33].

To illustrate the general idea, we fitted this simplified version of the model, treating λ and μ as fixed genotype-specific population values, yielding the estimates shown in the last line of Table 4. The dispersion parameter *N* cannot be estimated, but the results for other parameters are relatively insensitive to this choice; the results in Table 4 are based on either a fixed value *N* = 10 or using an informative Γ(100,1) prior; as *N* gets very large, the estimates converge to those in the first line for linear kinetics with fixed genotype-specific λ and μ.

For more complex models, for which analytic solution of the differential equations may be intractable, the technique of approximate Bayesian computation [34] may be helpful. The basic idea is, at each Markov chain Monte Carlo cycle, to simulate data from the differential equations model using the current and proposed estimates of model parameters and evaluate the 'closeness' of the simulated data to the observed data in terms of some simple statistics. This is then used to decide whether to accept or reject the proposed new estimates, rather than having to compute the likelihood itself.

A simpler approach uses the output of a PBPK simulation model as prior covariates in a hierarchical model. Let *Z*_{
ge
} = *E*[*M*(*G*_{
g
}*, E*_{
e
})] denote the predicted steady-state concentrations of the final metabolite from a differential equations model for a particular combination of genes and/or exposures (thus, Z_{
gg'
}might represent the predicted effect of a G × G interaction between genes *g* and *g'*). As discussed above, other *Z*s could comprise variances of predicted *Ms* across a range of input values as a measure of the sensitivity of the output to variation in that particular combination of inputs. *Z*_{
ge
} could also be a vector of several different predicted metabolite concentrations if there were multiple hypotheses about which was the most aetiologically relevant.

For the folate application, the *Z* matrix was obtained by correlating the simulated intermediate phenotypes *v* (reaction rates or metabolite concentrations) with the 14 genotypes, 91 G × G and 28 G × E interaction terms. The resulting correlation coefficients for the four simulated causal variables were then used as a vector of *in silico* prior covariates *Z*_{
p
} = (*Z*_{
pv
})_{v= 1..4 }for the relative risk coefficients β_{
p
}. The full set of correlations *Z*_{
pv
} across all ten metabolites and nine non-redundant velocities were also used to compute an adjacency matrix as *A*_{
pq
} = corr_{
v
}(*Z*_{
pv
}*, Z*_{
qv
}), representing the extent to which a pair of genes had similar effects across the whole range of intermediate phenotypes. The effects of these *in silico* covariates (Table 5) were substantially stronger than for the simple indicator variables illustrated earlier. In each simulation, the prior covariate corresponding to the causal variable was the strongest predictor of the genetic main effects.

### Designs incorporating biomarkers

Ultimately, it may be helpful to incorporate various markers of the internal workings of a postulated pathway, perhaps in the form of biomarker measurements of intermediate metabolites, external bioinformatic knowledge about the structure and parameters of the network, or toxicological assays of the biological effects of the agents under study. For example, in a multi-city study of air pollution, we are applying stored particulate samples from each city to cell cultures with a range of genes experimentally knocked down to assess their genotype-specific biological activities. We will then incorporate these measurements directly into the analysis of G × E interactions in epidemiological data [35]. See Thomas, [1] Thomas *et al*., [2] Conti *et al.* [20] and Parl *et al.* [36] for further discussion about approaches to incorporating biomarkers and other forms of biological knowledge into pathway-driven analyses.

Typically biomarker measurements are difficult to obtain and are only feasible to collect on a subset of a large epidemiological study. While one might consider using a simple random sample for this purpose, greater efficiency can often be obtained by stratified sampling. Suppose the parent study is a case-control study with exposure information and DNA already obtained. One might then consider sampling on the basis of some combination of disease status, exposure and the genotypes of one or more genes thought to be particularly important for the intermediate phenotype(s) for which biomarkers are to be obtained. The optimal design would require knowledge of the true model (which, of course, is unknown), but a balanced design, selecting the subsample so as to obtain equal numbers in the various strata defined by disease and predictors is often nearly optimal [37–39]. The analysis can then be conducted by full maximum likelihood, integrating the biomarkers for unmeasured subjects over their distribution (given the available genotype, exposure and disease data) or by some form of multiple imputation, quasi-likelihood [40] or MCMC methods. Here, the interest is not in the association of disease with the biomarker *B* itself, but rather with the unobserved intermediate phenotype *M* it is a surrogate for. The disease model is thus of the form Pr(*Y*|*M*), with a latent process model for Pr(*M*|*G, E*) and a measurement model for Pr(*B*|*M*).

Again, using the folate simulation as the example, we simulated biomarkers for samples of ten or 25 individuals selected at random from each of the eight cells defined by disease status, the *MTHFR* genotype and high or low folate intake. A measurement *B* of either homocysteine concentration or the TS enzyme activity level was assumed to be normally distributed around their simulated equilibrium concentrations with standard deviations 10 per cent of that the true long-term average concentrations.

These data were analysed within a conventional measurement error framework [41, 42] by treating the true long-term average values of homocysteine or TS activity as a latent variable *X* in a model given by the following equations:

For joint analyses of homocysteine and TS activity measurements, **M** and **B** were assumed to be bivariate normally distributed with **M** ~ *N*_{2}(X*'*A, Σ) and **B** ~ *N*_{2}(**M**, **T**), and *Y* as having a multiple logistic dependence on **M**. Only the main effects of the 14 genes and two environmental factors were included in **X** for this analysis. While the model can be fitted by maximum likelihood, it is convenient to use MCMC methods, which more readily deal with arbitrary patterns of missing **B** data. Thus, it is not essential for the different biomarkers to be measured on the same subset of subjects, but some overlap is needed to estimate the covariances Σ_{12} and *T*_{12}. More complex mechanistic models could, of course, be used in place of the regression model **M**|**X**. For this model to be identifiable, however, it is essential that distinct biomarkers be available for each of the intermediate phenotypes included in the disease model.

Estimates of the effects of both homocysteine and TS enzyme activity were highly significant in univariate analyses, even though the simulated causal variable is homocysteine. In bivariate analyses, however, the TS effect became non-significant, owing to the strong positive correlation (*r*_{Σ} = 0.45; 95 per cent confidence interval [CI] 0.21, 0.71) between the residuals of *M*, while correlation between the residuals of the measurement errors was not significant (__r___{T} = 0.34; 95% CI -0.12, +0.63). Although the standard errors varied strongly with subsample size, stratified sampling did not seem to improve the precision of the estimates. The reason for this appears to be that the biased sampling is not properly allowed for in the Bayesian analysis. Further work is needed to explore whether incorporating the sampling fractions into a conditional likelihood would yield more efficient estimators in the stratified designs.

### Dealing with reverse causation: Mendelian randomisation

The foregoing development assumes that the bio-marker measurement *B* or the underlying phenotype *M* of which it is a measurement is not affected by the disease process. While this may be a reasonable assumption in a cohort or nested case-control study where biomarker measurements are made on stored specimens obtained at entry to the cohort rather than after the disease has already occurred, it is a well known problem (known as 'reverse causation') in case-control studies. In this situation, one might want to restrict biomarker measurements only to controls and use marginal likelihood or imputation to deal with the unmeasured biomarkers for cases. Alternatively, one might consider using case measurements in a model that includes terms for differential error in the measurement model, Pr(*B*|*M, Y*).

These ideas have been formalised in literature known as 'Mendelian randomisation' (MR), [43–47] sometimes referred to as 'Mendelian deconfounding' [48]. Here, the focus of attention is not the genes themselves, but intermediate phenotypes (*M*) as risk factors for disease. The genes that influence *M* are treated as 'instrumental variables' (IVs) [49–54] in an analysis that indirectly infers the *M-Y* relationship from separate analyses of the *G-M* and *G-Y* relationships. The appeal of the approach is that uncontrolled confounding and reverse causation are less likely to distort these relationships than they are to distort the *M-Y* relationship if studied directly. In essence, the idea of imputing *M* values using *G* as an IV in a regression of *Y* on *E*(*M*|*G*) is a form of MR argument. Nevertheless, the approach is not without its pitfalls, [55–58] both as a means of testing the null hypothesis of no causal connection between *M* and *Y* and as a means of estimating the magnitude of its effect. Particularly key is the assumption that the effect of *G* on *Y* is mediated solely through *M*. For complex pathways, the simple MR approach is unlikely to be of much help, but the idea of using samples free of reverse causation to learn about parts of the model from biomarker measurements and incorporating these into the analysis of a latent variable model is promising.

To illustrate these methods, consider the scenario where homocysteine is the causal variable for disease. The logistic regression of disease directly on homocysteine yields a log*RR* coefficient β of 2.57 (SE 0.22) per SD change of homocysteine (Table 7). This estimate is, however, potentially subject to confounding and reverse causation, and indeed in this simulation we generated an upward bias in *B*|*M* of 50 per cent of the SD of *M*, which produced a substantial overestimate of the simulated β = 2. An MR estimate could in principle be obtained by using any of the genes in the pathway as an IV, *MTHRF* being the most widely studied. The regression of homocysteine on *MTHFR* yields a regression coefficient of α = 0.216 (0.079) and a logistic regression of disease on *MTHFR* yields a regression coefficient of γ = 0.112 (0.142), to produce an MR estimate of β = γ/α = 0.52 (0.68). Since *MTHRF* is only a relatively weak predictor of homocysteine concentrations in this simulation, however, it is a poor instrumental variable, as reflected in the large SE of the ratio estimate. Several other genes, exposures and interactions have much stronger effects on both homocysteine and disease risk -- notably, *SAHH* and *CBS*, which yield significant MR estimates, 1.27 (0.33) and 1.09 (0.20), respectively. These differences between estimates using different IVs and their underestimation of the simulated β suggest that simple Mendelian randomisation is inadequate to deal with complex pathways.

A stepwise multiple regression model for \hat{M}=E\left(B|\mathsf{\text{G}}\right) included 13 main effects and G × G interactions and attained an *R*^{2} of 0.43. Treating these predicted homocysteine concentrations as the covariate yielded a single imputation estimate of the log RR for disease of 1.32 (0.16), only slightly less precise than that from the logistic regression of disease directly on the measured values. While robust to uncontrolled confounding, this approach is not robust to reverse causation or misspecification of the prediction model; for example, it fails to include any exposure effects, which we have excluded to avoid distortion by reverse causation. More importantly, it also assumes that the entire effect of the predictors is mediated through homocysteine; this is true for this simulation, but is unlikely to be in practice. While not quite as downwardly biased as the Mendelian randomisation estimates (resulting from the improved prediction of *B*|G), the incompleteness of the model has still produced some underestimation.

Since we have simulated the case where the biomarker measurements are distorted by disease status, one might consider one of two alternative single imputation analyses. If both cases and controls have biomarker measurements available, one might include disease status in a model for \hat{M}=E\left(B|\mathbf{G},Y\right)={\mathbf{\alpha}}^{\text{'}}\mathbf{G}+\delta Y, and then set *Y* = 0 in the fitted regression in order to estimate the predisease values for the cases. Alternatively, one could fit the model for \hat{M}=E\left(B|\mathbf{G}\right) using data *only* from controls and then apply the fitted model to *all* subjects, cases and controls. In either case, one would use only the *predicted* values for all subjects, not the *actual* biomarker measurements for those having them. In these simulated data, these approaches yield log RR estimates of 1.28 (0.20) and 1.31 (0.20), respectively. Either of these approaches avoids the circularity of using disease status to predict *B*|**G**, *Y* and then using it again in the regression of *Y* on \hat{M}=E\left(B|\mathbf{G},Y\right). While the first approach uses more of the data, it requires a stronger assumption that the effect of *Y* on *B* is correctly specified, including possible interactions with **G**. In this simulation, the estimate of δ is 1.33 (0.06), substantially biased away from the simulated value of 0.50 because it includes some of the causal effect of *X* on *Y*. A fully Bayesian analysis jointly estimates the bias term δ*Y* in the full model {p}_{\alpha}{\left(M|E,G\right)}_{{p}_{\beta}}{{\left(Y|M\right)}_{p}}_{{}_{\gamma ,\delta}}\left(B|M,Y\right). In this simulation, the fully Bayesian analysis yielded an estimate of β = 2.95 (0.22) and δ = -0.02 (1.02). Obviously, δ is so poorly estimated and β so overestimated that this approach appears to suffer from problems of identifiability that require further investigation.

In the Colon Cancer Family Registries, [59] we have pre-disease biospecimens on several hundred relatives of probands who were initially unaffected and subsequently became cases themselves. In a currently ongoing substudy of biomarkers for the folate pathway, it will be possible to use these samples to estimate the effect of reverse causation directly. Of course, it would have been even more informative to have both pre- and post-diagnostic biomarker measurements on incident cases to model reverse causation more accurately.

### Incorporating external information: Ontologies

There are now numerous databases available that catalogue various types of genomic information. The Kyoto Encyclopedia of Genes and Genomes (KEGG) is perhaps the most familiar of these for knowledge about the structure of pathways and the parameters of each step therein. Others include the Gene Ontology, Biomolecular Interaction Network Database, Reactome, PANTHER, Ingenuity Pathway Analysis, BioCARTA, GATHER, DAVID and the Human Protein Reference Database, (see, for example, Meier and Gehring, [60] Thomas *et al.* [61] and Werner [62] for reviews). Literature mining is emerging as another tool for this purpose, [63] although potentially biased by the vagaries of research and publication trends. Such repositories form part of a system for organising knowledge known as an 'ontology' [64]. Representation of our knowledge via an ontology may provide a more useful and broadly informative platform to generate system-wide hypotheses about how variation in human genes ultimately impacts on organism-level phenotypes via the underlying pathway or complex system. Since the biological and environmental knowledge relevant to most diseases spans many research fields, each with specific theories guiding ongoing research, expertise across the entire system by one individual scientist is limited. While the information that contributes to each knowledge domain may contain uncertainties and sources of error stemming from the underlying experiments and studies, biases in the selection of genes and pathways chosen to be included and lack of comparability across terms and databases, an ontology as a whole can generate hypotheses and links across research disciplines that may only arise when information is integrated from several disciplines across the entire span of suspected disease aetiology. An ontology should not be taken as the truth, but rather as the current representation of knowledge that can, and should, be updated as new findings arise and hypotheses are tested. Evaluation of the accuracy of ontologies is an active research area.

In our folate simulation, we considered three prior covariates for Z in Table 3. The creation of these priors followed directly from the network representation given in Figure 1, obtained from a previously published article representing one research group's interpretation of the folate pathway [14]. An ontology, such as Gene Ontology (GO), has the potential advantage of allowing for the construction of prior covariates across a range of biological mechanisms. For example, a very refined biological process captured by the GO term *folic acid and derivative biosynthetic process* indicates two genes (*MTCH* and *MS*) from our example set of genes. A more general term, *methionine biosynthetic process*, identifies three genes (*MTCH, MTHFR* and *MS*). Finally, a broad process, such as *one-carbon compound metabolic process*, identifies five genes (*SAHH, DHFR, MAT-II, MTCH* and *SHMT*). Since an ontology has a hierarchical structure in a easily computable format, one may consider more quantitative approaches in generating prior covariates, such as the distance between two genes in the ontology. Across the full range of 184 GO terms involving one or more of these 14 genes, positively correlated sets include (*MTHFR, MTCH, MS*), (*MTD, CBS*), (*FTD, DHFR*) and (*AICART, TS*), while *PGT* and *MTCH* are negatively correlated. Figure 3 represents these correlations using a complete agglomerative clustering.

Although both approaches to building prior covariates, via either the visual interpretation of a network or the use of Gene Ontology, use knowledge of biological mechanisms, they lack a formal link of these mechanisms to disease risk or organism-level phenotypes. Such links may be critical when generating hypotheses or informing statistical analyses using biological mechanisms. Many publicly available ontologies provide a vast amount of structural information on various bio-logical processes, but interpretation or weighting of the importance of those processes in relation to specific phenotypes will only come when ontologies from biological domains are linked to ontologies characterising phenotypes. As one example, Thomas *et al.* [61] created a novel ontological representation linking smoking-related phenotypes and response to smoking cessation treatments with the underlying biological mechanisms, mainly nicotine metabolism. Most of the ontological concepts created for this specific ontology were created using concept definitions from existing ontologies, such as SOPHARM and Gene Ontology. This ontology was used in Conti *et al.* [20] to demonstrate the use in pathway analysis as a systematic way of eliciting priors for a hierarchical model. Specifically, the ontology was used to generate quantitative priors to reduce the space of potential models and to inform subsequent analysis via a Bayesian model selection approach.

### Dealing with uncertainty in pathway structure

A more general question is how to deal with model uncertainty in any of these modelling strategies. The general hierarchical modelling strategy was first extended by Conti *et al.* [65] to deal with uncertainty about the set of main effects and interactions to be included in X using stochastic search variable selection [66]. Specifically, they replaced the second-level model by a pair of models, a logistic regression for the probability that β_{
p
}= 0 and a linear regression of the form of Eq. (2) for the expected values of the coefficient, given that it was not zero. In turn, the pair of second-level models inform the probability that any given term will be included in the model at the current iteration of the stochastic search. Thus, over the course of MCMC iterations, variables are entered and removed, and one can then estimate the posterior probability or Bayes factor (1) for each factor or possible model (2), for whether each factor has a non-zero β averaging over the set of other variables in the model, or (3) the posterior mean of each β, given that it is non-zero. Other alternatives include the Lasso prior, [67] which requires only a single hyperparameter to accomplish both shrinkage and variable selection in a natural way, and the elastic net, [68] which combines the Lasso and normal priors and can be implemented in a hierarchical fashion combining variable selection at lower levels (eg among SNPs within a pathway) and shrinkage at higher levels (eg between genes within a pathway or between pathways) (Chen *et al.* Presented at the Eastern North American Region Meeting of the Biometric Society; San Antonio, TX: February 2009).

In an analysis, utilising the methods described by Conti et al., [20] of the simulated data when homocysteine is the causal variable (Table 5, first column) and incorporating an exchangeable prior structure in which all genes are treated equally (ie intercept only in the prior covariate matrix, Z), the posterior probabilities of including the two modestly significant genes *TS* and *FTD* are 0.57 and 0.48, respectively. By contrast, when the prior covariate matrix is derived from the 'external database' from the simulation model and is thus more informative of the underlying mechanism, these posterior probabilities change to 0.84 and 0.14, respectively. These changes in the posterior probabilities of inclusion reflect the covariate values for these genes in relation to homocysteine concentration and the AICART reaction velocity (the two prior covariates with the largest estimated second-level effects). In the case of *TS*, the velocities for these covariates are large, resulting in an increase in the posterior probability of inclusion. By contrast, for *FTD* these values are much smaller and there is a subsequent decrease.

For mechanistic models, the 'topology' of the model Λ and the corresponding vector of model parameters θ_{Λ} are treated as unknown quantities, about which we might have some general prior knowledge in the form of the 'ontology' *Z*. In the microarray analysis world, Bayesian network analysis has emerged as a powerful technique for inferring the structure of a complex network of genes [69]. Might such a technique prove helpful for epidemiological analysis?

One promising approach is 'logic regression', which considers a set of tree-structured models relating measurable inputs (genes and exposures) to a disease trait through a network of unobserved intermediate nodes representing logical operators (AND, OR, XOR etc) [70]. To allow for uncertainty about model form, a MCMC method is used to update the structure of the graphical model by adding, deleting, moving or changing the types of the intermediate nodes [71]. Although appealing as a way of representing the biochemical pathways, logic regression does not exploit any external information about the form of network. It also treats all intermediate nodes as binary, so it is more suitable for modelling regulatory than metabolic pathways where the intermediate nodes would represent continuous metabolite concentrations.

To overcome some of these difficulties, we relaxed the restriction to binary nodes, parameterising the model as:

When both input nodes (the 'parents' p_{j} = [*p*_{j1}, *p*_{j2}]) are binary, various combinations of θs will yield the full range of possible logical operators (eg AND = [0,0], OR = [1,1]), but this framework allows great flexibility in modelling interactions between continuous nodes, while remaining identifiable. The *M*s are treated as deterministic nodes, so the final metabolite concentration *M*_{
J
} (E, G; Λ, θ) can be calculated via a simple recursion. The disease risk is assumed to have a logistic dependence on *M*_{
J
}. Prior knowledge about the topology can be incorporated by use of a measure of similarity of each fitted network to the postulated true network (eg the proportion of connections in the true graph which are represented in the fitted one, minus the number of connections in the fitted graph which are not represented in the true one). In the spirit of Monte Carlo logic regression, the topology of the graph is modified by proposing to add or delete nodes or to move a connection between them using the Metropolis-Hastings algorithm [72]. Finally, the model parameters are updated conditional on the current model form. By post-processing the resulting set of graphs, various kinds of inference can be drawn, such as the posterior probability that a given input appears in the fitted graphs, that a pair of inputs is represented by a node in the graph, or the marginal effect of any input or combination of inputs on the disease risk. In small simulations, we demonstrated that the model could correctly identify the true network structure (or logically equivalent ones) and estimate the parameters well, while not identifying any incorrect models. In an application to data on ten candidate genes from the Children's Health Study, we were able to replicate the interactions found by a purely exploratory technique [73] and identified several alternative networks with comparable Bayes factors.

The folate pathway poses difficulties for mechanistic modelling because it is not a directed acyclic graph (DAG); although each arrow in Figure 1 is directed, the graph contains numerous cycles (feedback loops), making direct computation of probabilities difficult. In some instances, such cycles can be treated as single composite nodes with complex deterministic or stochastic laws, thereby rendering the remainder of the graph acyclic, but when there are many interconnected cycles, as in the folate pathway, such decomposition may be difficult or impossible to identify. Might it be possible, however, to identify a simpler DAG that captures the key behaviour of the network? Since any DAG would be an oversimplification and there could be many such DAGs that provide a reasonable approximation, the problem of model uncertainty is important.

A further extension of the Baurley *et al.* approach to the folate simulation will now be summarised. As in their approach, we assume that each node has exactly two inputs, but now distinguish three basic types of nodes, *G* × *G, G* × *M* (or *G* × *E*) and *M* × *M. G* × *G* nodes are treated as logical operators, yielding a binary output as high or low risk. *G* × *M* and *G* × *E* nodes represent intermediate metabolite concentrations, treated as continuous variables with deterministic values given by Michaelis-Menten kinetics with rate parameters *V*_{
max
}(*G*) and *K*_{
m
}*. M* × *M* nodes are regression expressions yielding a continuous output variable with the mean parameterised as in Eq. (5). Disease risk is assumed to have a logistic dependence on one or more of the *Z*s. Finally, each measured biomarker *B* is assumed to be log-normally distributed around one of the *M*s, with some measurement error variance. Rather than treating the intermediate nodes as deterministic, the likelihood of the entire graph is now calculated by peeling over possible states of all the intermediate nodes.

Figure 4 shows the topologies discovered by the MCMC search. The largest Bayes factors are obtained when using no prior topologies. With a prior topology, essentially the same networks are found, with somewhat different Bayes factors.

### Pathways in a genome-wide context

Genome-wide association studies (GWAS) are generally seen as 'agnostic' -- the antithesis of hypothesis-driven pathway-based studies. Aside from the daunting computational challenge, their primary goal is, after all, the discovery of novel genetic associations, possibly in genes with unknown function or even with genomic variation in 'gene desert' regions not known to harbour genes. How, then, could one hope to incorporate prior knowledge in a GWAS? The response has generally been to wait until the GWAS has been completed (after a multi-stage scan and independent replication) and then conduct various *in vitro* functional studies of the novel associations before attempting any pathway modelling.

The idea of incorporating prior knowledge from genomic annotation databases or other sources as a way of improving the power of a genome-wide scan for discovery has, however, been suggested by several authors. Roeder *et al*., [74] Saccone *et al*., [75] Wakefield [76–78] and Whittemore [79] introduced variants of a weighted false discovery rate, while Lewinger *et al.* [80] and Chen and Witte [81] described hierarchical modelling approaches for this purpose. These could be applied at any stage of a GWAS to improve the prioritisation of variants to be taken forward to the next stage. For example, Sebastiani *et al.* [82] used a Bayesian test to incorporate external information for prioritising SNP associations from the first stage of a GWAS using pooled DNA, to be subsequently tested using individual genotyping. Roeder *et al.* [74] originally suggested the idea of exploiting external information in the context of using a prior linkage scan to focus attention in regions of the genome more likely to harbour causal variants, but subsequent authors have noted that various other types of information, such as linkage disequilibrium, functional characterisation or evolutionary conservation, could be included as predictors. An advantage of hierarchical modelling is that multiple sources can be readily incorporated in a flexible regression framework, whereas the weighted FDR requires *a priori* choice of a specific weighting scheme.

A recent trend has been the incorporation of pathway inference in genome-wide association scans, [75, 83–89] borrowing ideas from the extensive literature on network analysis of gene expression array data [90, 91]. Currently, the most widely used tool for this purpose is gene set enrichment analysis, [92] which in GWAS applications aims to test whether groups of genes in a common pathway tend to rank higher in significance. Several published applications have yielded novel insights using this approach, [93–96] although others have found that no specific pathway outranks the most significant single markers, [89, 97, 98] suggesting that the approach may not be ideal for all complex diseases. Many other empirical approaches have been used in the gene-expression field, including Bayesian network analysis, [69, 99, 100] neural networks, [101] support vector machines [102] and a variety of other techniques from the fields of bioinformatics, computational or systems biology and machine learning [103–111]. Most of these are empirical, although in the sense of trying to reconstruct the unknown network structure from observational data, rather than using a known network to analyse the observational data. It is less obvious how such methods could be applied to mining single-marker associations from a GWAS, but they could be helpful in mining G × G interactions. Even simple analyses of GWAS data can be computationally demanding, particularly if all possible G × G interactions are to be included, and analyses incorporating pathway information is likely to be even more daunting. Recent developments in computational algorithms for searching high-dimensional spaces and parallel cluster computing implementations may, however, make this feasible.

Recently, several authors [112–116] have undertaken analyses of the association of genome-wide expression data with genome-wide SNP genotypes in search of patterns of genetic control that would identify *cis*- and *trans*-activating factors and master regulatory regions. Ultimately, one could foresee using networks inferred from gene expression directly as priors in a hierarchical modelling analysis for GWAS data, or a joint analysis of the two phenotypes, but this has yet to be attempted. Other novel technologies, such as whole-genome sequencing, metabolomics, proteomics and so on may provide other types of data that will inform pathway-based analysis on a genome-wide scale.

## Conclusions

As in any other form of statistical modelling, the analyst should be cautious in interpretation. An pointed out by Jansen: [117]

'So, the modeling of the interplay of many genes -- which is the aim of complex systems biology -- is not without danger. Any model can be wrong (almost by definition), *but particularly complex (overparameterized) models have much flexibility to hide their lack of biological relevance'* [emphasis added].

A good fit to a particular model does not, of course, establish the truth of the model. Instead, the value of models, whether descriptive or mechanistic, lies in their ability to organise a range of hypotheses into a systematic framework in which simpler models can be tested against more complex alternatives. The usefulness of the Armitage-Doll [118] multistage model of carcinogenesis, for example, lies not in our belief that it is a completely accurate description of the process, but rather in its ability to distinguish whether a carcinogen appears to act early or late in the process or at more than one stage. Similarly, the importance of the Moolgavkar-Knudson two-stage clonal-expansion model [119] lies in its ability to test whether a carcinogen acts as an 'initiator' (ie on the mutation rates) or a 'promoter' (ie on proliferation rates). Such inferences can be valuable, even if the model itself is an incomplete description of the process, as must always be the case.

Although mechanistic models do make some testable predictions about such things as the shape of the dose-response relationship and the modifying effects of time-related variables, testing such patterns against epidemiological data tends to provide only weak evidence in support of the alternative models, and only within the context of all the other assumptions involved. Generally, comparisons of alternative models (or specific sub-models) can only be accomplished by direct fitting. Visualisation of the fit to complex epidemiological datasets can be challenging. Any mechanistic interpretations of model fits should therefore consider carefully the robustness of these conclusions to possible misspecification of other parts of the model.

## References

Thomas DC: The need for a comprehensive approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomarkers Prev. 2005, 14: 557-559. 10.1158/1055-9965.EPI-14-3-EDB.

Thomas DC, Baurley JW, Brown EE, Figueiredo J, et al: Approaches to complex pathways in molecular epidemiology: Summary of an AACR special conference. Cancer Res. 2008, 68: 10028-10030. 10.1158/0008-5472.CAN-08-1690.

Cook NR, Zee RY, Ridker PM: Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat Med. 2004, 23: 1439-1453. 10.1002/sim.1749.

Ritchie MD, Hahn LW, Roodi N, Bailey LR, et al: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-147. 10.1086/321276.

Hoh J, Ott J: Mathematical multi-locus approaches to localizing complex human trait genes. Nat Rev Genet. 2003, 4: 701-709.

Tamayo P, Slonim D, Mesirov J, Zhu Q, et al: Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.

Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, et al: Automated detection of informative combined effects in genetic association studies of complex traits. Genome Res. 2003, 13: 1952-1960.

Potter JD: Colorectal cancer: Molecules and populations. J Natl Cancer Inst. 1999, 91: 916-932. 10.1093/jnci/91.11.916.

Frosst P, Blom HJ, Milos R, Goyette P, et al: A candidate genetic risk factor for vascular disease: A common mutation in methylenetetrahydrofolate reductase. Nat Genet. 1995, 10: 111-3. 10.1038/ng0595-111.

Ulrich CM, Potter JD: Folate supplementation: Too much of a good thing?. Cancer Epidemiol Biomarkers Prev. 2006, 15: 189-93. 10.1158/1055-9965.EPI-152CO.

Molloy AM, Brody LC, Mills JL, Scott JM, et al: The search for genetic polymorphisms in the homocysteine/folate pathway that contribute to the etiology of human neural tube defects. Birth Defects Res A Clin Mol Teratol. 2009, 85: 285-94. 10.1002/bdra.20566.

Nijhout HF, Reed MC, Budu P, Ulrich CM: A mathematical model of the folate cycle: New insights into folate homeostasis. J Biol Chem. 2004, 279: 55008-16. 10.1074/jbc.M410818200.

Nijhout HF, Reed MC, Ulrich CM: Mathematical models of folate-mediated one-carbon metabolism. Vitam Horm. 2008, 79: 45-82.

Reed MC, Nijhout HF, Neuhouser ML, Gregory JF, et al: A mathematical model gives insights into nutritional and genetic aspects of folate-mediated one-carbon metabolism. J Nutr. 2006, 136: 2653-61.

Reed MC, Thomas RL, Pavisic J, James SJ, et al: A mathematical model of glutathione metabolism. Theor Biol Med Model. 2008, 5: 8-10.1186/1742-4682-5-8.

Ulrich CM, Neuhouser M, Liu AY, Boynton A, et al: Mathematical modeling of folate metabolism: Predicted effects of genetic polymorphisms on mechanisms and biomarkers relevant to carcinogenesis. Cancer Epidemiol Biomarkers Prev. 2008, 17: 1822-31. 10.1158/1055-9965.EPI-07-2937.

Hung RJ, Brennan P, Malaveille C, Porru S, et al: Using hierarchical modeling in genetic association studies with multiple markers: Application to a case-control study of bladder cancer. Cancer Epidemiol Biomarkers Prev. 2004, 13: 1013-1021.

Capanu M, Orlow I, Berwick M, Hummer AJ, et al: The use of hierarchical models for estimating relative risks of individual genetic variants: An application to a study of melanoma. Stat Med. 2008, 27: 1973-1992. 10.1002/sim.3196.

Hung RJ, Baragatti M, Thomas D, McKay J, et al: Inherited predisposition of lung cancer: A hierarchical modeling approach to DNA repair and cell cycle control pathways. Cancer Epidemiol Biomarkers Prev. 2007, 16: 2736-2744. 10.1158/1055-9965.EPI-07-0494.

Conti DV, Lewinger JP, Swan GE, Tyndale RF, et al: Using ontologies in hierarchical modeling of genes and exposures in biologic pathways. Phenotypes and Endophenotypes: Foundations for Genetic Studies of Nicotine Use and Dependence. Edited by: Swans GE. 2009, NCI Tobocco Control Monographs, Bethesda, MD, 539-584.

Rebbeck TR, Spitz M, Wu X: Assessing the function of genetic variants in candidate gene association studies. Nat Rev Genet. 2004, 5: 589-597.

Besag J, York J, Mollie A: Bayesian image restoration with two applications in spatial statistics (with discussion). Ann Inst Statist Math. 1991, 43: 1-59. 10.1007/BF00116466.

Cortessis V, Thomas DC: Toxicokinetic genetics: An approach to gene-environment and gene-gene interactions in complex metabolic pathways. Mechanistic Considerations in the Molecular Epidemiology of Cancer. Edited by: Bird P, Boffetta P, Buffler P, Rice J. 2003, IARC Scientific Publications, Lyon, France, 127-150.

Du L, Conti DV, Thomas DC: Physiologically-based pharmacokinetic modeling platform for genetic and exposure effects in metabolic pathways. Genet Epidemiol. 2006, 29: 234-

Lunn DJ, Thomas A, Best N, Spiegelhalter D: Winbugs -- A Bayesian modelling framework: Concepts, structure, and extensibility. Stat Comput. 2000, 10: 325-337. 10.1023/A:1008929526011.

Lunn DJ, Best N, Thomas A, Wakefield J, Spiegelhalter D: Bayesian analysis of population PK/PD models: General concepts and software. J Pharmacokinet Pharmacodyn. 2002, 29 (3): 271-307. 10.1023/A:1020206907668.

Racine-Poon A, Wakefield J: Statistical methods for population pharmacokinetic modelling. Stat Meth Med Res. 1998, 7: 63-84. 10.1191/096228098670696372.

Bois FY: Applications of population approaches in toxicology. Toxicol Lett. 2001, 120: 385-394. 10.1016/S0378-4274(01)00270-3.

Bennett JE, Wakefield JC: A comparison of a Bayesian population method with two methods as implemented in commercially available software. J Pharmacokinet Biopharm. 1996, 24: 403-432. 10.1007/BF02353520.

Wakefield J: Bayesian individualization via sampling-based methods. J Pharmacokinet Biopharm. 1996, 24: 103-131. 10.1007/BF02353512.

Best NG, Tan KK, Gilks WR, Spiegelhalter DJ: Estimation of population pharmacokinetics using the Gibbs sampler. J Pharmacokinet Biopharm. 1995, 23: 407-435. 10.1007/BF02353641.

Kou SC, Cherayil BJ, Min W, English BP, et al: Single-molecule Michaelis-Menten equations. J Phys Chem B. 2005, 109: 19068-19081. 10.1021/jp051490q.

English BP, Min W, van Oijen AM, Lee KT, et al: Ever-fluctuating single enzyme molecules: Michaelis-Menten equation revisited. Nat Chem Biol. 2006, 2: 87-94. 10.1038/nchembio759.

Marjoram P, Molitor J, Plagnol V, Tavare S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003, 100: 15324-15328. 10.1073/pnas.0306899100.

Thomas DC: Using gene-environment interactions to dissect the effects of complex mixtures. J Expo Sci Environ Epidemiol. 2007, 17 (Suppl 2): S71-S74.

Parl F, Crooke P, Conti DV, Thomas DC: Pathway-based methods in molecular cancer epidemiology. Fundamentals of Molecular Epidemiology. Edited by: Rebbeck TR, Ambrosone CB, Shields PG. 2008, Informa Healthcare, New York, NY, 189-204.

Spiegelman D, Carroll RJ, Kipnis V: Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Stat Med. 2001, 20: 139-160. 10.1002/1097-0258(20010115)20:1<139::AID-SIM644>3.0.CO;2-K.

Holcroft CA, Spiegelman D: Design of validation studies for estimating the odds ratio of exposure-disease relationships when exposure is misclassified. Biometrics. 1999, 55: 1193-1201. 10.1111/j.0006-341X.1999.01193.x.

Thomas DC: Multistage sampling for latent variable models. Lifetime Data Anal. 2007, 13: 565-581. 10.1007/s10985-007-9061-1.

Breslow NE, Chatterjee N: Design and analysis of two-phase studies with binary outcome applied to Wilms tumor prognosis. Appl Statist. 1999, 48: 457-468. 10.1111/1467-9876.00165.

Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM: Measurement Error in Nonlinear models: A Modern Perspective. 2006, Chapman and Hall CRC Press, London, UK, 2

Thomas DC, Stram D, Dwyer J: Exposure measurement error: Influence on exposure-disease relationships and methods of correction. Annu Rev Publ Health. 1993, 14: 69-93. 10.1146/annurev.pu.14.050193.000441.

Davey Smith G, Ebrahim S: "Mendelian randomization": Can genetic epidemiology contribute to understanding environmental determinants of disease?. Int J Epidemiol. 2003, 32: 1-22. 10.1093/ije/dyg070.

Davey Smith G, Ebrahim S: Mendelian randomization: Prospects, potentials, and limitations. Int J Epidemiol. 2004, 33: 30-42. 10.1093/ije/dyh132.

Davey Smith G, Ebrahim S: What can Mendelian randomisation tell us about modifiable behavioural and environmental exposures?. BMJ. 2005, 330: 1076-1079. 10.1136/bmj.330.7499.1076.

Lewis SJ, Davey Smith G: Alcohol, aldh2, and esophageal cancer: A meta-analysis which illustrates the potentials and limitations of a Mendelian randomization approach. Cancer Epidemiol Biomarkers Prev. 2005, 14: 1967-1971. 10.1158/1055-9965.EPI-05-0196.

Thompson JR, Minelli C, Abrams KR, Tobin MD, et al: Meta-analysis of genetic studies using Mendelian randomization -- A multivariate approach. Stat Med. 2005, 24: 2241-2254. 10.1002/sim.2100.

Tobin MD, Minelli C, Burton PR, Thompson JR: Commentary: Development of Mendelian randomization: From hypothesis test to "Mendelian deconfounding". Int J Epidemiol. 2004, 33: 26-29. 10.1093/ije/dyh016.

Glynn RJ: Commentary. Genes as instruments for evaluation of markers and causes. Int J Epidemiol. 2006, 35: 932-934. 10.1093/ije/dyl107.

Hernan MA, Robins JM: Instruments for causal inference: An epidemiologist's dream?. Epidemiology. 2006, 17: 360-372. 10.1097/01.ede.0000222409.00878.37.

Brookhart MA, Wang PS, Solomon DH, Schneeweiss S: Instrumental variable analysis of secondary pharmacoepidemiologic data. Epidemiology. 2006, 17: 373-374. 10.1097/01.ede.0000222026.42077.ee.

Buzas JS, Stefanski LA: Instrumental variable estimation in generalized linear measurement error models. J Am Stat Assoc. 1996, 91: 999-1006.

Greenland S: An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000, 29: 1102-10.1093/oxfordjournals.ije.a019909.

Martens EP, Pestman WR, de Boer A, Belitser SV, et al: Instrumental variables: Application and limitations. Epidemiology. 2006, 17: 260-267. 10.1097/01.ede.0000215160.88317.cb.

Didelez V, Sheehan N: Mendelian randomization as an instrumental variable approach to causal inference. Stat Meth Med Res. 2007, 16: 309-330. 10.1177/0962280206077743.

Nitsch D, Molokhia M, Smeeth L, DeStavola BL, et al: Limits to causal inference based on Mendelian randomization: A comparison with randomized controlled trials. Am J Epidemiol. 2006, 163: 397-403. 10.1093/aje/kwj062.

Bautista LE, Smeeth L, Hingorani AD, Casas JP: Estimation of bias in nongenetic observational studies using "Mendelian triangulation". Ann Epidemiol. 2006, 16: 675-680. 10.1016/j.annepidem.2006.02.001.

Thomas DC, Conti DV: Commentary. The concept of "Mendelian randomization". Int J Epidemiol. 2004, 33: 21-25. 10.1093/ije/dyh048.

Newcomb PA, Baron J, Cotterchio M, Gallinger S, et al: Colon cancer family registry: An international resource for studies of the genetic epidemiology of colon cancer. Cancer Epidemiol Biomarkers Prev. 2007, 16: 2331-2343. 10.1158/1055-9965.EPI-07-0648.

Meier S, Gehring C: A guide to the integrated application of on-line data mining tools for the inference of gene functions at the systems level. Biotechnol J. 2008, 3: 1375-1387. 10.1002/biot.200800142.

Thomas PD, Mi H, Swan GE, Lerman C, et al: A systems biology network model for genetic association studies of nicotine addiction and treatment. Pharmacogenet Genomics. 2009, 19: 538-551. 10.1097/FPC.0b013e32832e2ced.

Werner T: Bioinformatics applications for pathway analysis of microarray data. Curr Opin Biotechnol. 2008, 19: 50-54. 10.1016/j.copbio.2007.11.005.

Jensen LJ, Saric J, Bork P: Literature mining for the biologist: From information retrieval to biological discovery. Nat Rev Genet. 2006, 7: 119-129. 10.1038/nrg1768.

Ashburner M, Ball CA, Blake JA, Botstein D, et al: Gene ontology: Tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.

Conti DV, Cortessis V, Molitor J, Thomas DC: Bayesian modeling of complex metabolic pathways. Hum Hered. 2003, 56: 83-93. 10.1159/000073736.

George EI, McCulloch RE: Variable selection via Gibbs sampling. J Am Stat Assoc. 1993, 88: 881-889.

Park T, Casella G: The Bayesian lasso. J Am Stat Assoc. 2008, 103: 681-686. 10.1198/016214508000000337.

Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005, 67: 301-320. 10.1111/j.1467-9868.2005.00503.x.

Friedman N: Inferring cellular networks using probabilistic graphical models. Science. 2004, 303: 799-805. 10.1126/science.1094068.

Ruczinski I, Kooperberg C, LeBlanc ML: Exploring interactions in high-dimensional genomic data: An overview of logic regression, with applications. J Multivar Anal. 2004, 90: 178-195. 10.1016/j.jmva.2004.02.010.

Kooperberg C, Ruczinski I: Identifying interacting SNPs using Monte Carlo logic regression. Genet Epidemiol. 2005, 28: 157-170. 10.1002/gepi.20042.

Hastings W: Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970, 57: 97-109. 10.1093/biomet/57.1.97.

Millstein J, Conti DV, Gilliland FD, Gauderman WJ: A testing framework for identifying susceptibility genes in the presence of epistasis. Am J Hum Genet. 2006, 78: 15-27. 10.1086/498850.

Roeder K, Devlin B, Wasserman L: Improving power in genome-wide association studies: Weights tip the scale. Genet Epidemiol. 2007, 31: 741-747. 10.1002/gepi.20237.

Saccone SF, Saccone NL, Swan GE, Madden PA, et al: Systematic biological prioritization after a genome-wide association study: An application to nicotine dependence. Bioinformatics. 2008, 24: 1805-1811. 10.1093/bioinformatics/btn315.

Wakefield J: Bayes factors for genome-wide association studies: Comparison with p-values. Genet Epidemiol. 2008, 33: 79-86.

Wakefield J: Reporting and interpretation in genome-wide association studies. Int J Epidemiol. 2008, 37: 641-653. 10.1093/ije/dym257.

Wakefield J: A Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet. 2007, 81: 208-227. 10.1086/519024.

Whittemore AS: A Bayesian false discovery rate for multiple testing. J Appl Statist. 2007, 34: 1-9. 10.1080/02664760600994745.

Lewinger JP, Conti DV, Baurley JW, Triche TJ, et al: Hierarchical Bayes prioritization of marker associations from a genome-wide association scan for further investigation. Genet Epidemiol. 2007, 31: 871-882. 10.1002/gepi.20248.

Chen GK, Witte JS: Enriching the analysis of genome-wide association studies with hierarchical modeling. Am J Hum Genet. 2007, 81: 397-404. 10.1086/519794.

Sebastiani P, Zhao Z, Abad-Grau MM, Riva A, et al: A hierarchical and modular approach to the discovery of robust associations in genome-wide association studies from pooled DNA samples. BMC Genet. 2008, 9: 6-

Wang K, Li M, Bucan M: Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007, 81: 1278-1283. 10.1086/522374.

Elbers CC, van Eijk KR, Franke L, Mulder F, et al: Using genome-wide pathway analysis to unravel the etiology of complex diseases. Genet Epidemiol. 2009, 33: 419-431. 10.1002/gepi.20395.

Chasman DI: On the utility of gene set methods in genome-wide association studies of quantitative traits. Genet Epidemiol. 2008, 32: 658-668. 10.1002/gepi.20334.

Holden M, Deng S, Wojnowski L, Kulle B: Gsea-SNP: Applying gene set enrichment analysis to SNP data from genome-wide association studies. Bioinformatics. 2008, 24: 2784-2785. 10.1093/bioinformatics/btn516.

Bush WS, Dudek SM, Ritchie MD: Biofilter: A knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac Symp Biocomput. 2009, 368-379.

Rajagopalan D, Agarwal P: Inferring pathways from gene lists using a literature-derived network of biological relationships. Bioinformatics. 2005, 21: 788-793. 10.1093/bioinformatics/bti069.

Hong MG, Pawitan Y, Magnusson PK, Prince JA: Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum Genet. 2009

Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, et al: Pgc-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34: 267-273. 10.1038/ng1180.

Pan W: Incorporating biological information as a prior in an empirical Bayes approach to analyzing microarray data. Stat Appl Genet Mol Biol. 2005, 4: Art. 12

Subramanian A, Tamayo P, Mootha VK, Mukherjee S, et al: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Stat Appl Genet Mol Biol. 2005, 4: Art. 12,

*Proc. Natl. Acad. Sci. USA*Vol. 102, pp. 15545-15550Lesnick TG, Papapetropoulos S, Mash DC, Ffrench-Mullen J, et al: A genomic pathway approach to a complex disease: Axon guidance and Parkinson disease. PLoS Genet. 2007, 3: e98-10.1371/journal.pgen.0030098.

Baranzini SE, Galwey NW, Wang J, Khankhanian P, et al: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18: 2078-2090. 10.1093/hmg/ddp120.

Torkamani A, Topol EJ, Schork NJ: Pathway analysis of seven common diseases assessed by genome-wide association. Genomics. 2008, 92: 265-272. 10.1016/j.ygeno.2008.07.011.

Vink JM, Smit AB, de Geus EJ, Sullivan P, et al: Genome-wide association study of smoking initiation and current smoking. Am J Hum Genet. 2009, 84: 367-379. 10.1016/j.ajhg.2009.02.001.

Perry JR, McCarthy MI, Hattersley AT, Zeggini E, et al: Interrogating type 2 diabetes genome-wide association data using a biological pathway-based approach. Diabetes. 2009, 58: 1463-1467. 10.2337/db08-1378.

Kasperaviciute D, Weale ME, Shianna KV, Banks GT, et al: Large-scale pathways-based association study in amyotrophic lateral sclerosis. Brain. 2007, 130: 2292-2301. 10.1093/brain/awm055.

Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol. 2000, 7: 601-620. 10.1089/106652700750050961.

Yu J, Smith VA, Wang PP, Hartemink AJ, et al: Advances to Bayesian network inference for generating causal networks from observational biological data. Bioinformatics. 2004, 20: 3594-3603. 10.1093/bioinformatics/bth448.

Ritchie MD, White BC, Parker JS, Hahn CW, et al: Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases. BMC Bioinformatics. 2003, 4: 28-10.1186/1471-2105-4-28.

Byvatov E, Schneider G: Support vector machine applications in bioinformatics. Appl Bioinformatics. 2: 67-77.

Schafer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2003, 21: 754-764.

Wu CC, Huang HC, Juan HF, Chen ST: Genenetwork: An interactive tool for reconstruction of genetic networks using microarray data. Bioinformatics. 2004, 20: 3691-3693. 10.1093/bioinformatics/bth428.

Franke L, van Bakel H, Fokkens L, de Jong ED, et al: Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006, 78: 1011-1025. 10.1086/504300.

Basso K, Margolin AA, Stolovitzky G, Klein U, et al: Reverse engineering of regulatory networks in human b cells. Nat Genet. 2005, 37: 382-390. 10.1038/ng1532.

Kim TH, Ren B: Genome-wide analysis of protein-DNA interactions. Annu Rev Genom Hum Genet. 2006, 7: 81-102. 10.1146/annurev.genom.7.080505.115634.

Tu Z, Wang L, Arbeitman MN, Chen T, et al: An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics. 2006, 22: e489-e496. 10.1093/bioinformatics/btl234.

Yu H, Zhu X, Greenbaum D, et al: Topnet: A tool for comparing biological sub-networks, correlating protein properties with topological statistics. Nucleic Acids Res. 2004, 32: 328-337. 10.1093/nar/gkh164.

Blais A, Dynlacht BD: Constructing transcriptional regulatory networks. Genes Dev. 2005, 19: 1499-1511. 10.1101/gad.1325605.

Xie Y, Pan W, Jeong KS, Khodursky A: Incorporating prior information via shrinkage: A combined analysis of genome-wide location data and gene expression data. Stat Med. 2007, 26: 2258-2275. 10.1002/sim.2703.

Dixon AL, Liang L, Moffatt MF, Chen W, et al: A genome-wide association study of global gene expression. Nat Genet. 2007, 39: 1202-1207. 10.1038/ng2109.

Stranger BE, Forrest MS, Clark AG, Minichiello MJ, et al: Genome-wide associations of gene expression variation in humans. PLoS Genet. 2005, 1: e78-10.1371/journal.pgen.0010078.

Morley M, Molony CM, Weber TM, Devlin JL, et al: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.

Cheung VG, Spielman RS, Ewens KG, Weber TM, et al: Mapping determinants of human gene expression by regional and genome-wide association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.

Cheung VG, Conlin LK, Weber TM, Arcaro M, et al: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33: 422-425. 10.1038/ng1094.

Jansen RC: Studying complex biological systems using multifactorial perturbation. Nat Rev Genet. 2003, 4: 145-151.

Armitage P, Doll R: The age distribution of cancer and multi-stage theory of carcinogenesis. Br J Cancer. 1954, 8 (1): 1-12. 10.1038/bjc.1954.1.

Moolgavkar S, Knudson A: Mutation and cancer: A model for human carcinogenesis. JNCI. 1981, 66: 1037-1052.

## Acknowledgements

This work was supported in part by NIH grants R01-CA92562, P50-ES07048, R01-CA112237 and U01-ES015090 (D.C.T., D.V.C., J.B.), R01-CA105437, R01-CA105145, R01-CA59045 (C.M.U.) and NSF grants DMS-0616710 and DMS-0109872 (F.N., M.R.). The authors are particularly grateful to Wei Liang and Fan Yang for programming support.

## Author information

### Authors and Affiliations

### Corresponding author

## Rights and permissions

## About this article

### Cite this article

Thomas, D.C., Conti, D.V., Baurley, J. *et al.* Use of pathway information in molecular epidemiology.
*Hum Genomics* **4**, 21 (2009). https://doi.org/10.1186/1479-7364-4-1-21

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1479-7364-4-1-21

### Keywords

- colorectal cancer
- complex diseases
- folate
- gene-environment interactions
- gene-gene interactions