Skip to main content

Beyond genomics: understanding exposotypes through metabolomics



Over the past 20 years, advances in genomic technology have enabled unparalleled access to the information contained within the human genome. However, the multiple genetic variants associated with various diseases typically account for only a small fraction of the disease risk. This may be due to the multifactorial nature of disease mechanisms, the strong impact of the environment, and the complexity of gene-environment interactions. Metabolomics is the quantification of small molecules produced by metabolic processes within a biological sample. Metabolomics datasets contain a wealth of information that reflect the disease state and are consequent to both genetic variation and environment. Thus, metabolomics is being widely adopted for epidemiologic research to identify disease risk traits. In this review, we discuss the evolution and challenges of metabolomics in epidemiologic research, particularly for assessing environmental exposures and providing insights into gene-environment interactions, and mechanism of biological impact.

Main text

Metabolomics can be used to measure the complex global modulating effect that an exposure event has on an individual phenotype. Combining information derived from all levels of protein synthesis and subsequent enzymatic action on metabolite production can reveal the individual exposotype. We discuss some of the methodological and statistical challenges in dealing with this type of high-dimensional data, such as the impact of study design, analytical biases, and biological variance. We show examples of disease risk inference from metabolic traits using metabolome-wide association studies. We also evaluate how these studies may drive precision medicine approaches, and pharmacogenomics, which have up to now been inefficient. Finally, we discuss how to promote transparency and open science to improve reproducibility and credibility in metabolomics.


Comparison of exposotypes at the human population level may help understanding how environmental exposures affect biology at the systems level to determine cause, effect, and susceptibilities. Juxtaposition and integration of genomics and metabolomics information may offer additional insights. Clinical utility of this information for single individuals and populations has yet to be routinely demonstrated, but hopefully, recent advances to improve the robustness of large-scale metabolomics will facilitate clinical translation.


The main concepts underpinning genetic epidemiology developed rapidly after the delineation of the structure of DNA. Neel and Schull provided the first description of these concepts in 1954 [1, 2]. While the original goal of genetic epidemiology was to understand the nature of population and familial genetic inheritance, it soon became evident that environmental factors and gene-environment interactions were important to consider simultaneously [3].

Currently, the study of the whole genome (genomics) has evolved into a multidisciplinary area of science with highly diverse applications [4, 5]. Improved efficiency of genome technology combined with a sharp decrease in cost has enabled genomic assessments in large study populations [6, 7] using genotyping and next-generation-sequencing (NGS) approaches [8]. Thousands of genome-wide association studies (GWAS) have tracked relationships between base-pair/gene patterns in genomic loci and hundreds of diseases or exposures [9]. However, the discovered loci from these large-scale studies still explain only the minority of presumed heritability for most phenotypes of interest [10]. Moreover, it has been established that genes alone account for the minority of disease etiology for many important illnesses such as cancer, and environmental and lifestyle influences play a critical role [11]. However, quantifying the myriad of environmental and lifestyle risk factors including diet, smoking, exposure to hazardous chemicals, and pathogenic microorganisms is challenging [12, 13]. An individual can be exposed to a complex mix of chemical and biological contaminants, with multiple sources, for varying durations across their life course. This concept has been termed the “exposome,” a framework for the collective analysis, and measurement of an individual’s exposures over their lifetime [14]. Moreover, different environmental exposures may be heavily correlated with each other or may act in concert to produce adverse effects, which makes studying them one at a time challenging for assigning causality [15]. Therefore, it is essential to find tools that can measure the cumulative impact of multiple exposures alongside their interactions with the genetic background of individuals. Several multidimensional analytical approaches have been developed, beyond genomics, that try to capture different aspects of this complexity, and their integration into environmental health is discussed in this review.

Application of high-dimensional biology to the environmental health paradigm

Referred to as high-dimensional biology, or a multi-omics/systems-level approach, the combined analysis of data from the genome (genomics), RNA transcription (transcriptomics), proteins/peptides (proteomics), and metabolites (metabolomics) enables researchers to overlay gene information onto complementary datasets towards a more systemic understanding of diseases or other phenotypes of interest [16]. The complexity of high-dimensional datasets becomes even more convoluted when the interaction of environmental exposures is added to the system.

The environmental health paradigm (Fig. 1) integrates the knowledge of exposures and environmental health sciences to gain a deeper understanding of the consequences of exposure towards expression of a disease phenotype [17]. Exposures can elicit subtle effects at different stages of gene-encoding, protein synthesis, and on circulating metabolites. Multi-omics approaches using combined data from genomics, proteomics, and metabolomics techniques can identify downstream chemical alterations contributing to the development of an exposotype, the exposure phenotype (Fig. 1), that describes the accrued biological changes within a system that has undergone a specific exposure event [18]. Combining information from all levels of protein synthesis and subsequent enzymatic action on metabolite production is an essential step to start comprehending the complex global modulating effect that an exposure event has on an individual phenotype. This may allow for a greater direct understanding of molecular mechanisms that underpin the route of exposure, and the effect of molecular transit on different areas of metabolism, cellular reproduction, and ultimately the resulting exposotype.

Fig. 1
figure 1

a Environmental health paradigm. b Exposure and the central dogma of molecular biology

Metabolites are the substrates and products of metabolism that drive essential cellular processes such as energy production, and signal transduction [19]. Of all the molecular entities (genes, transcripts, proteins, metabolites), metabolites have the closest relationship to expressed phenotype as they are the final end-points of upstream biochemical processing. Quantitative readouts of metabolite abundance reflect both this cellular processing and xenobiotics (foreign substances such as environmental chemicals, pollutants, drugs, food additives, dyes) that are physico-chemically distinct from molecular entities that originate in the host. Xenobiotics can be processed by enzymatic machinery, and metabolomics also allows quantification of these metabolites. Therefore, metabolomics can simultaneously analyze both exogenous chemicals and their metabolites, and changes to the endogenous metabolome, to allow assessment of broadly defined exposures and their biological impact [20,21,22,23]. One such example was a recent study of occupational exposure to trichloroethylene (TCE) [24]. TCE metabolites were identified in human plasma and associated with changes to endogenous metabolites that were known to be involved in immunosuppression, hepatotoxicity, and nephrotoxicity. This allowed the investigation into how the toxic effects of TCE exposure were manifested [24]. Another study, from the EXPOsOMICS project (, examined human biofluids and exhaled breath for exposure to swimming pool disinfection by-products (DBPs) and for concomitant changes to endogenous metabolites. The study revealed a possible association between DBPs and perturbations to metabolites in the tryptophan pathway [25]. However, these studies and others which have measured exposures in relation to the metabolome highlight the challenge of attempting to unravel the effect of one circumscribed exposure versus combinations of different environmental exposures on the metabolome [26, 27].

One of the major bottlenecks of metabolomics is metabolite identification. However, the expansion and development of metabolite databases have eased this issue. Tens of thousands of metabolites have been identified and uploaded onto metabolite databases such as The Human Metabolome Database (HMDB) (, which to date houses 114,113 metabolites with associated chemical, clinical, and biochemical information. HMDB also hosts four additional databases including the Toxic Exposome Database (T3DB) ( which contains information on 3763 toxins [28, 29]. METLIN (, another large database containing 961,829 metabolites, recently expanded due to the integration of xenobiotics from the United States Environmental Protection Agency’s “Distributed Structure-Searchable Toxicity (DSSTox)” database [30, 31]. The Exposome-Explorer database was recently designed to contain information on biomarkers of exposure to environmental risk factors for diseases. This database has information on 692 dietary and pollutant biomarkers, and importantly concentration values measured in biospecimens, with correlation values to assess quality of the biomarkers [32]. These databases, and others that house both xenobiotics and endogenous metabolites, appear in Table 1 [33,34,35,36,37,38]. With the recent expansion of these databases to include xenobiotics, metabolomics can facilitate both biomonitoring of exposures, assessment of biological impact, and identification of exposotypes [39]. However, one potential gap in these databases still exists, the prediction of phase I and phase II biotransformed metabolites of xenobiotics which can be used as proxy biomarkers for the chemical exposure. Metabolomics has revealed numerous novel metabolites of previously well-characterized pharmaceutical drugs such as acetaminophen [40], dietary supplements [41], and the genotoxic heterocyclic amine 2-amino-1-methyl-6-phenylimidazo[4,5-b]pyridine (PhIP) [42], present in meats cooked at high temperatures. Metabolomics provides a window to identifying these new metabolites, as the biotransformed metabolite will only be present in a sample from an exposed individual. Secondly, there is typically more than one biotransformation metabolite present for each xenobiotic, which will have a similar covariance and correlation within the biological sample examined, thus making it possible to easily map out the related metabolites. One way to overcome this gap in the metabolite databases would be to have a tool housed on these databases that could automatically predict any potential biotransformations, and display the resultant important chemical information for identification. A few tools currently available for predicting phase I and II drug metabolism have been recently reviewed, along with the development of “DrugBug” which can predict xenobiotic metabolism by human gut microbiota enzymes [43]. Integration of such tools would facilitate exposome analysis.

Table 1 Mass spectrometry metabolite databases for identification of environmental exposures

The broad range of chemical classes that exist among the thousands of endogenous and environmentally derived metabolites contained within a biological sample has given rise to the need for analytical strategies that can separate and detect as much chemical diversity as possible from within the biological system under examination. The assessment of all metabolites present in a sample, untargeted metabolomics, is typically carried out using chromatography-based mass spectrometry and/or nuclear magnetic resonance spectroscopy, alongside bioinformatics that help understand the complex data generated [44]. Metabolomics research has undergone significant refocus over the past few years due to the improvements made in bioanalytical protocols and an evident shift towards the development of new chemoinformatic and bioinformatic tools [45]. These tools are designed to improve metabolite identification, particularly for microbial metabolites, and biological interpretation, which remain a major challenge for the field. For example, the mass spectrometry data generated in a metabolomics study have a high degree of degeneracy where the same metabolite can be represented as multiple signals [46]. Tools such as CAMERA [47], RAMClust [48], and “Credentialing” [49] have helped overcome this problem and improve peak annotation. Other notable tools include CSI:FingerID [50] which predicts the fragmentation of metabolites using an in silico method, thus aiding in metabolite identification, and “integrated-omics” housed on XCMSOnline [51] ( which aids in both metabolite identification and biological interpretation. Excellent reviews on the technological advancements in this area can be found elsewhere [52,53,54]; in addition, an extensive list of all current metabolomics software and data analysis resources is available [55, 56]. For population-level studies, the application of metabolomics for the analysis of thousands of samples has been optimized and demonstrated [57, 58], but the field could still benefit from decades’ worth of research and lessons learning in genetic epidemiology related to study design, statistical analyses, and reproducibility in large-scale population consortia.

Methodological challenges and considerations

Relevant and a priori formulated research questions and rigorous study designs and methods lay the foundation to perform a potentially successful piece of population-based research, after which replication is essential to confirm any associations, and to avoid the dissemination of potentially false research claims [59,60,61]. Prospective cohort studies follow a predefined population over time, capturing exposure information prior to occurrence of health events. This study design accommodates the appropriate temporal relationship between exposure and outcome, allows for testing of multiple risk factors and health outcomes, and permits collection of multiple pre-clinical biological specimens throughout the follow-up period. Although this is ideal from a metabolomics perspective, this study design often requires long follow-up durations and great expense. Case-control studies can be more efficient, and less expensive ways to test associations, but they lack the temporality criterion for causality, and metabolic profiles may be influenced by disease status. The use of nested case-control studies offers an efficient approach with the appropriate temporality between exposure and outcome. “Meet-in-the-middle” approaches, which involve linking intermediate biomarkers to both the exposure and outcome within cohort and nested case-control studies, are gaining popularity for their ability to reveal important linkages along the exposure-outcome pathway [62, 63].

While systems-level approaches hold great promise, they also pose challenges in the analysis of high-dimensional, complex data structure. The use of appropriate statistical tests within genomics, metabolomics, and epidemiology is dictated by the study design and the number of dimensions of data under investigation, with the application of univariate or multivariate techniques being applied to low-dimensional and high-dimensional datasets, respectively. Incorrect analytical decisions and interpretations that are made when conducting a study are a direct threat to reproducibility [64]. Table 2 [65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87] provides a list of some of the most commonly used statistical methods and tests in the interface of epidemiology, genetics, and metabolomics.

Table 2 Common statistical methods and tests used in epidemiology, genetics, and metabolomics, with reference link to descriptive articles on appropriate general use

Many analyses in metabolomics involve the use of null hypothesis significance testing (NHST) and the reporting of p values. The p value, one of the most misused statistics in science [88], has not escaped the focus of members of the fields of epidemiology [89], metabolomics [90], and general biomedicine [91]. Poor application has contributed to the irreproducible nature of many studies, so much that the American Statistical Association felt moved to release a statement highlighting six underlying principles to dictate the proper use and interpretation of the p value [92, 93]. One should examine in each application whether NHST is best suited as an inferential tool or whether alternative approaches, such as the use of Bayesian methods or false discovery rates (FDR), are preferable [90, 94,95,96]. If p values are still used in multidimensional experiments, proper correction for multiplicity is important. There are numerous methods for accommodating family-wise error rates [90]. There are also some standard thresholds that can be used in specific settings, e.g., genome-wide significance p < 5 × 10− 8 for genome-wide analyses. Some multiplicity corrections are more conservative than others; for instance, the Bonferroni correction (dividing the p value threshold required for significance by the number of tests performed) may be too conservative [97]. FDR and variants of FDR may be better suited [96] and can accommodate correlation structures between the multiple tested variables [98, 99].

Several methods are available that can help reduce complexity, detect trends, and generate predictive models within multidimensional datasets (Table 2) such as those generated by NGS and mass spectrometry when target genes or metabolites are not known. Unsupervised methods such as principal component analysis (PCA) provide an initial step to help reduce the complexity and indicate variables of interest by determining discriminant features linked to the “loadings” of different clusters. These loadings can be considered as the impact that a certain variable has on measured variance, so a high-level loading value displays a strong influence on clustered groups [100]. There also exist several extensions of the PCA architecture such as multiblock PCA, consensus PCA, or ANOVA-PCA that enable the user to control for underlying influential factors within datasets such as the intra-patient variability or other experimental confounders [65]. These approaches have been used for metabolomics and genetics analyses and also lend themselves to other cross-validation methods [66]. Supervised methods apply grouping stratification to the data based on some already known outcome variable(s). They aim to develop models that can accurately predict the correct grouping based on the input and identify genes, metabolites, or other statistical associations that underlie the grouping. The most commonly used methods are variants of regression tools (Table 2). Regression modeling can identify associations relevant to the disease [101], can predict association within gene expression patterns [102], and in metabolomics [103] can generate sample classification. However, as these tests are supervised, one of the issues with multivariate regression is that it tends to over-fit the data. Therefore, cross-validation (in the same dataset) and external validation (in additional datasets) are essential.

Perhaps, the biggest challenge yet for exposome researchers is integration of the multiple types of data generated from systems-level analyses and assessing the role of one versus multiple exposures on the phenotype. Currently, there are platforms that enable biochemical pathway analysis and integration of systems-level data, and these platforms can identify pathways and networks that are related to a known exposure or health outcome (such as disease). Dissection of pathways may help direct mechanistic studies into causality. The most useful to date for untargeted metabolomics data is “mummichog,” which uses computational algorithms to predict metabolic pathway effects directly from spectral feature tables without prior identification of metabolites [104]. Mummichog was recently integrated onto the XCMSOnline platform, with an added function to upload transcriptomic and proteomic data, for integrated pathway analysis [51]. Other notable software includes MarVis-Pathway [105], InCroMAP [106], GAM [107], and MetaCore™ (Thomson Reuters Corporation, Toronto, Canada) that can integrate multiple types of systems-level data for pathway interrogation. Combining this type of data with multiple measurements of xenobiotics has not yet been demonstrated, but tools are under development. Up to now, studies have primarily assessed the effect of individual exposures and have combined multiple systems-level approaches to assess biological response (i.e., benzene exposure and toxicity, susceptibility genes, mRNA and DNA methylation) [108]. Phenome data has also been integrated into studies to account for population variability and reduce false positives [22]. A recent example, from the analysis of preterm birth in the Rhea mother-child cohort study, selected those metabolites that had significant association with birth outcomes in logistic regression models and significant correlation coefficients with metabolic syndrome traits to construct odds ratios (BMI, blood pressure, blood glucose) [109]. Moreover, new tools are being specifically designed with the exposome in mind; xMWAS can integrate metabolomics data with that derived from the transcriptome [110], microbiome [111], and cytokine [112] and can be used for genome, epigenome, proteome, and other integrated omics analyses. However, modeling the effect of combined exposures is extremely complex. Co-exposures can be linked and cause an additive effect on the biological outcome, but it is not possible to know beforehand which combinations of exposures may have the largest biological effect. A recent novel method was developed that first estimates the correlation between pairs of exposures, then groups the highly correlated exposures by unsupervised machine learning [26], and identifies co-occurring exposure networks. This technique reduces the total number of combinations of exposures to “prevalent co-occurring combinations”; however, integration with other systems-level data still remains very complex. The additional challenges associated with integrating exposome data with metabolomics, genomics, and proteomics have been recently reviewed [27] and were also highlighted in a recent symposium report [113].

Analytical bias and biological variance in metabolomics analyses for epidemiologic studies

Metabolomics analyses in epidemiologic studies require additional consideration of sources of variability beyond traditional epidemiologic studies. There are a very large number of chemical features that can be detected by current highly sensitive mass spectrometers, and differences in metabolite recovery may arise from biological samples that are not collected under identical protocols. Additional batch variation can be introduced when handling large sample numbers [114], due to contaminant build-up and sample degradation [115].

Analytical bias in genomics and metabolomics can arise from practical laboratory aspects that, by their nature, favor the preselection of one type of variable (single nucleotide polymorphism (SNP) or chemical) over another. This is particularly evident when performing “untargeted” analyses in which the researcher is looking to maximize chemical coverage with a technology that cannot cover the full chemical space. With currently over 24 million SNPs having been documented within the human genome [116], the technology within SNP microarray chips has yet to catch up to this depth of coverage. The same issues are also present within metabolomics as no single technology can analyze the thousands of different metabolites within a sample. Therefore, pre-selecting approaches are commonly applied, be it using a gene-expression chip predefined for a subset of SNPs [117,118,119,120] or untargeted chromatography methods for metabolomics with a restricted spectrum of which metabolites can be captured [121]. These analytical biases are described in Fig. 2, but include the type of metabolite extraction method and column chemistry, which can enhance the analysis of some chemical functional groups and classes over others. For example, reversed-phase liquid chromatography (RPLC) can effectively analyze non-polar compounds such as lipids, carnitines, and bile acids, whereas hydrophilic interaction liquid chromatography (HILIC) is more suitable for the analysis of polar metabolites such as nucleotides, sugars, and amino acids. The two column chemistries have an analytical overlap of only 34%; thus, both column chemistries are needed if one wishes to obtain a relative quantification of the broadest chemical classes from a sample [122]. All types of study design need to consider inherent biological intra-individual variability as a potential source of variation (Fig. 2) as well as a source of discriminatory features. In addition to understanding and addressing potential methodological challenges and various sources of biases, open science practices are necessary to support the subsequent verification of research and use of the obtained data and results in subsequent secondary analyses and meta-analyses.

Fig. 2
figure 2

The biological and analytical aspects of bias and variance that can lead to a tendency towards erroneous results in both untargeted and targeted metabolomics

Moving from genome-wide association studies (GWAS) to metabolome-wide association studies (MWAS)

One of the most-used study approaches in big data genome research, first demonstrated in 2005, is GWAS [123]. This technique examines genome-wide sets of genetic variants in samples of individuals to determine if any variants are associated with a trait and help pinpoint genes that may contribute to a person’s risk for a certain disease or other phenotype of interest. GWAS can be described as an untargeted and sometimes a hypothesis-generating approach to associate genetic variants with specific phenotypes. GWAS and consortia-based meta-analyses have been conducted with increasing sample size [124], allowing for improved power [125] to detect genome-wide significant signals for what are typically very small effect sizes. Due to the analytical uniformity of sequencing, this is one area where genomic research has advanced more quickly than metabolomics.

Most of the early untargeted metabolomics experiments have had limited sample sizes (n = 10–100) often a result of technological, run-time, and statistical limitations. Given the large number of metabolic features that are typically generated by untargeted metabolomics (typically 1000s for liquid chromatography mass spectrometry), using such small sample sizes has led to over-fitting of data and spurious results [100]. Moreover, the highly collinear nature of metabolomics multivariate data [67] have not generally been properly factored in performing a priori power and sample size calculations, and there is no widely accepted method for sample size determination in metabolomics. In the absence of specific metabolic target hypothesis, one can use a data driven sample size determination (DSD) algorithm [126] where sample size estimation depends on the purpose of the study: whether it aims to find at least one statistically significant variation (biomarker discovery) or a maximum of statistically significant variations (metabolic exploration). Alternatively, one may adapt methods that have been developed for use with microarray gene expression(s) [127,128,129]. One common problem is that there is often high correlation between variables in one dataset, and in addition, not all variables have the same power. However, new more promising approaches have been generated using multivariate simulation to deal with this type of data structure [130].

Predictive power increases with sample size, and the current application of metabolomics to larger longitudinal cohort studies (n > 1000) is helping to give access to broader population data that can be linked to specific exposure such as alcohol [131, 132]. These types of studies are needed to improve biomarker discovery and inference of molecular mechanisms. Key issues continuously arise in the application of metabolomics to human subjects which can be overcome by putting metabolomics into epidemiological context. Common problems include causal and mechanistic claims based on differences between groups that have low numbers of individuals, lack of longitudinal data to avoid the possibility of reverse causation (a health outcome influencing pharmacokinetics and metabolite concentrations), limited information on lifestyle, socioeconomic and other influences, and the lack of multiple statistical tests and biological replication [133]. As metabolomics is incorporated into more population-level studies, it may be possible to more reliably model potential associations of metabolic profiles with phenotypes. The goal is to stratify metabolic data over exposure event data and ultimately determine the related disease risk. Confounding associations may still distort results and lead to erroneous conclusions. Yet it is more readily possible, with larger study numbers, and longitudinal testing, to control confounding by matching samples in to related sub-groups such as age, sex, or level-of-exposure.

Metabolome-wide association studies (MWAS) were first described in 2008 as the capture of “environmental and genomic influences to investigate the connections between phenotype variation and disease risk factors” [134, 135], thus helping reveal the complex gene-environment interactions on disease outcome. The method differs from conventional metabolomics in that high-throughput metabolomics is applied to large-scale epidemiologic studies at the population level and uses specialized algorithms to maximize the identification of biomarkers of disease risk [57]; for example, a recent algorithm was developed to correct for multiple testing using a permutation-based method to derive a metabolome-wide significance level controlling the family-wise error rate [136]. Initial studies showed that using high-throughput metabolomics, MWAS can be carried out on large population cohorts to provide individual metabolic phenotypes (metabotypes), and metabolic biomarkers correlated to exposures [137], and/or biological outcomes [138]. The proof-of-principle study used to coin the term MWAS identified discriminatory biomarkers of blood pressure and cardiovascular risk in 4630 individuals [138]. These types of studies may point to otherwise unknown features of the disease etiology or pathophysiology, which may be used to lead further mechanistic studies and potentially new avenues for therapeutic design, although the complexity of mechanisms makes such translation to therapeutic discovery very difficult. Comparison of metabotypes at the human population level can identify a signature of metabolites statistically correlated to disease risk and/or an exposure. Recent studies have shown the application of MWAS to identify metabolites correlated with cardiovascular events in a dietary intervention trial [139]. In another study, trimethylamine N-oxide (TMAO) was identified as a biomarker predictive of cardiovascular disease risk [140, 141] and was also shown to be involved in the production of atherosclerotic plaques. This discovery has resulted in a clinical test for TMAO, Cleveland HeartLab, and is the first to provide this blood test, and therapeutics are currently being designed to inhibit TMAO production as well as recommendations for dietary changes. Another application is to identify the enrichment of metabolites within specific biochemical pathways [142] to aid in the identification of genes and proteins/enzymes that may be related to the mechanism of disease. This method has gained traction within drug evaluation studies [143] trying to obtain more comprehensive understanding of individual responses to drug therapy [144, 145]. This application may be particularly useful for the design of immunotherapeutics where metabolites have been shown to modulate autoimmunity and can be targeted to improve the efficacy of these drugs [146, 147]. However, it should be acknowledged that therapeutic discovery or improvement in therapeutic management with known interventions has not yet been accomplished using metabolomics data; however, recent development in metabolomics technologies in both the bioanalytical and chemometric components is markedly improving, and thus, there is optimism for clinical translation as well.

Transparency, reproducibility, and open science

There is growing recognition of the need for improved transparency, reproducibility, and replication in the biomedical literature [64, 91, 148, 149]. With respect to multidimensional, big data analyses, transparency can be improved with the sharing of data, protocols, and analytical codes. Furthermore, the number of metabolomics studies that investigate reproducibility across multiple research centers are few in number, and ongoing interlaboratory efforts have struggled to generate metabolite data that is both accurate and reproducible across different labs [150]. Replication has been accepted as a sine qua non in certain disciplines, such as human genome epidemiology [149], and the same should apply across all multidimensional fields using big data. However, the research community is aware of this issue, and groups are convening to provide solutions to address this problem. For example, the European Centre for Ecotoxicology and Toxicology of Chemicals have provided a framework to facilitate the regulatory applicability and use of big data in chemical risk assessment [151, 152].

It is also important to protect inferences from data dredging/p-hacking (mining datasets prior to specifying a causal hypothesis), and unaccounted multiple comparisons in complex datasets that can lead to the inflation of false-positive rates. Therefore, to improve the reproducibility of metabolomics, it is necessary to understand certain methodological and statistical challenges, to protect against analytical biases and biological variance, and to promote transparency and open science. These open science practices, which include “the process of making the content and process of producing evidence and claims transparent and accessible to other researchers” [64], can increase the credibility of research. For metabolomics in particular, both raw and metadata are essential to facilitate reproducibility, secondary analyses, and the synthesis of evidence by external metabolomics researchers [153]. Several measures can support the transparency and reproducibility of metabolomics. For maximal impact, the whole metabolomics research community should adopt and adhere to standards that promote the uniform preparation of study results. The metabolomics standards initiative (MSI), which was conceived in 2005 by the Metabolomics Society, highlights a range of minimum reporting standards covering biological [154], chemical [155], analytical, and data reporting methods [156] within the metabolomics experimental pipeline. However, ideally, metabolomics funders, reviewers, editors, and journals should require researchers to share their protocols, raw data, and analytical code. Broadly speaking, this does not happen (the Springer Journal Metabolomics ( and MDPI journal Metabolites ( being notable exceptions in which MSI compliance is asked for from authors and assessed by reviewers). Currently, most journals leave the suitability of metabolite submission data to reviewer and editor discretion.

Support is also beginning to appear from some funding bodies to help improve the reliability and efficiency of metabolomics. For example, the Data Repository and Coordination Center, which is part of the United States National Institutes of Health (NIH) Common Fund’s Metabolomics Program, has created the Metabolomics Data Repository. All NIH Common Fund Metabolomics Program supported research projects which create metabolomics data as part of the funded research are required to submit all raw data (e.g., spectrometric, spectrographic, and chromatographic data) and metadata (e.g., details on how samples were obtained and the analytical methods that were used) to the repository [157]. In addition, the European Union funded data repository MetaboLights ( has already assembled data from 317 metabolomics studies as of December 2017. Common data submission formats, such as mzML/mzXML for mass spectrometry, nmrML for NMR data, and ISA-Tab format for metadata, have helped to unify this process [158, 159]. But the research community must be careful to not generate an excess of unconnected data repositories. Multiple and potentially overlapping repositories could confuse researchers as to where they should submit their data and therefor limit the chance of uniform acceptance and adoption of standards. To this end, the COSMOS project (COordination of Standards in MetabOlomicS— has been designed to address the challenges of e-infrastructure diversity in metabolomics by developing an interface that globally links community projects and output.

The predominant reason behind the lack of data sharing in metabolomics is the complexity and lack of standardization in the data generated. For research areas such as genomics, transcriptomics, and, to a lesser extent, proteomics, the chemistry of the molecules under detection is highly symmetrical. Regardless of nucleobase-pair connectivity, DNA and RNA constructs can be detected and typed using highly reproducible sequencing chips that can work in a high-throughput manner. The sheer range of molecular chemistries available within the human metabolome demand a multitude of separation strategies when mass spectrometry is used as the detection technology. Consequently, different research groups align their experimental pipelines to one of the many instrument vendors (often dictated by geography and cost) leading to a multitude of protocols that cover all aspects of experimentation. Just within the confines of liquid chromatography mass spectrometry-based metabolomics, 84% use open source software and/or commercial software from instrument vendors, and within the open source software group, the majority use XCMS, and a smaller percentage use MZmine and MZmine 2. Therefore, variability in just the data processing limits integration of the MSI. One way to enable standardized data processing and biostatistics is to encourage the use of a universal workflow platform such as Galaxy ( [160]. In addition, the use of a standard reference material that can normalize and compare the detection levels from different instruments would be of value. A concerted effort is still needed by the community to enable broader reproducibility [161]. The lack of standardization and reporting is preventing the validation of metabolomics research [162].


Human populations are exposed to a complex mix of chemicals and toxicants, from multiple sources, for varying durations. These exposures are affecting the health of the global population dramatically, for example, over seven million premature deaths annually linked to air pollution exposure alone [163]. It is vital that a more comprehensive understanding of how these environmental exposures affect biology at the systems level to determine cause, effect, and susceptibilities. In doing so, a compound specific “exposotype” can be developed that accounts for the totality of the multileveled downstream biological changes that an individual exposure event produces [18]. To better understand these effects, metabolomics can be used to develop not only metabolic biomarkers of exposure but can also be used to build metabolic models that identify upstream genetic and enzymatic changes. This may complement GWAS studies as knowledge of a potential enzymatic mutation can narrows down the DNA search space needed to identify relevant SNPs linked to the exposure [144, 145].

In-depth biological data generated by metabolomics can be used to enhance exposure studies by supplying information not only on directly affected metabolic pathways but also on off-target metabolic effects. The value of metabolomics to identify gene-environment interactions lends itself to the study of the exposome and will be the most complex and important integration of metabolomics to date. Further characterization of gene variants associated with those metabolic pathways could help forecast disease prevalence by either using pre-diagnostic metabolic signatures (collections of metabolites that change prior to disease onset) and genetic risk data. Therefore, preventive measures may be tailored specifically for those individuals. The combination of metabolomics with genomics offers one tool that may prove helpful towards materializing precision medicine. Success in precision medicine has been difficult to achieve [164], but the recent US Food and Drug Administration approval of pembrolizumab, a “tumor-agnostic” therapeutic which targets any solid tumor with a specific genetic feature, shows that the field is starting to head in that direction [165]. Given recent evidence that non-genomic influences such as the microbiome can influence therapeutic response, metabolomics may be used in this context to identify factors that are related to non-responders and responders [166].

However, some of the caveats that still exist within conventional metabolomics and population studies are still present, such as accurate identification of new metabolites, controlling for multiple levels of confounders, and the integration of different forms of data from different analytical platforms. Further advancement can be made by routine application of appropriate statistical tools to metabolomics as well as the adoption and promotion of transparent and reproducible research practices. Reproducible, transparent advances may then be examined for their impact in changing outcomes in single patients and at the population level to judge their utility.



False discovery rate


Genome-wide association studies


Metabolomics standards initiative


Metabolome-wide association studies


Next-generation sequencing


Null hypothesis significance testing


Principal component analysis


Single nucleotide polymorphism


  1. 1.

    Neel JV, Schull WJ. Human heredity. Chicago: Chicago Press; 1954.

    Google Scholar 

  2. 2.

    DeWan AT. Five classic articles in genetic epidemiology. Yale J Biol Med. 2010;83:87–90.

    PubMed  PubMed Central  Google Scholar 

  3. 3.

    Beaty TH, Khoury MJ. Interface of genetics and epidemiology. EpidemiolRev. 2000;22:120–5.

    CAS  Google Scholar 

  4. 4.

    Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977;74:5463–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    National Human Genome Research Institute. All about the Human Genome Project (HGP). 2014. Available from: Accessed 17 Jan 2018.

  6. 6.

    Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet. 2011;52:413–35.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Hayden EC. The $1,000 genome. Nature. 2014;507:295.

    Article  CAS  Google Scholar 

  8. 8.

    Goldfeder RL, Wall DP, Khoury MJ, JPA I. Human genome sequencing at population scale: a primer on high throughput DNA sequencing and analysis. Am J Epidemiol. 2017;186:1000–9.

    PubMed  Article  Google Scholar 

  9. 9.

    Goodwin S, JD MP, WR MC. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17:333–51.

    CAS  PubMed  Article  Google Scholar 

  10. 10.

    Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Theodoratou E, Timofeeva M, Li X, Meng X, JPA I. Nature, nurture, and cancer risks: genetic and nutritional contributions to cancer. Annu Rev Nutr. 2017;21:293–320.

    Article  CAS  Google Scholar 

  12. 12.

    Willett WC. Balancing life-style and genomics research for disease prevention. Science (80- ). 2002;296:695–8.

    CAS  Article  Google Scholar 

  13. 13.

    Rappaport SM, Smith MT. Environment and disease risks. Science (80-. ). 2010;330:460–1.

    CAS  Article  Google Scholar 

  14. 14.

    Wild CP. Complementing the genome with an “Exposome”: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers. 2005;14:1847. LP-1850

    CAS  Article  Google Scholar 

  15. 15.

    Patel CJ, Ioannidis JPA. Studying the elusive environment in large scale. JAMA. 2014;311:2173–4.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Romero R, Espinoza J, Gotsch F, Kusanovic JP, Friel LA, Erez O, et al. The use of high-dimensional biology (genomics, transcriptomics, proteomics, and metabolomics) to understand the preterm parturition syndrome. BJOG. 2006;113:118–35.

    CAS  PubMed  Article  Google Scholar 

  17. 17.

    Wilson SH. Disease-first: a new paradigm for environmental health science research. Environ Health Perspect. 2006;114:2006.

    Article  Google Scholar 

  18. 18.

    Rattray NJW, Charkoftaki G, Rattray Z, Hansen JE, Vasiliou V, Johnson CH. Environmental influences in the etiology of colorectal cancer: the premise of metabolomics. Curr Pharmacol Reports. 2017;3:114–25.

    CAS  Article  Google Scholar 

  19. 19.

    Patti GJ, Yanes O, Siuzdak G. Innovation: Metabolomics: the apogee of the omics trilogy. Nat Rev Mol Cell Biol. 2012;13:263–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. 20.

    Ellis JK, Athersuch TJ, Thomas LD, Teichert F, Perez-Trujillo M, Svendsen C, et al. Metabolic profiling detects early effects of environmental and lifestyle exposure to cadmium in a human population. BMC Med. 2012;10:61.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Andra SS, Austin C, Wright RO, Arora M. Reconstructing pre-natal and early childhood exposure to multi-class organic chemicals using teeth: towards a retrospective temporal exposome. Environ Int. 2015;83:137–45.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Maitre L, Villanueva CM, Lewis MR, Ibarluzea J, Santa-Marina L, Vrijheid M, et al. Maternal urinary metabolic signatures of fetal growth and associated clinical and environmental factors in the INMA study. BMC Med. 2016;14:1–12.

    Article  Google Scholar 

  23. 23.

    Baker MG, Simpson CD, Lin YS, Shireman LM, Seixas N. Original article the use of metabolomics to identify biological signatures of manganese exposure. Ann Work Expo Heal. 2017;61:406–15.

    Article  Google Scholar 

  24. 24.

    Walker DI, Uppal K, Zhang L, Vermeulen R, Smith M, Hu W, et al. High-resolution metabolomics of occupational exposure to trichloroethylene. Int J Epidemiol. 2016;45:1517–27.

    PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    van Veldhoven K, Keski-Rahkonen P, Barupal DK, Villanueva CM, Font-Ribera L, Scalbert A, et al. Effects of exposure to water disinfection by-products in a swimming pool: a metabolome-wide association study. Environ Int Elsevier. 2018;111:60–70.

    CAS  Article  Google Scholar 

  26. 26.

    Patel CJ. Analytic complexity and challenges in identifying mixtures of exposures associated with phenotypes in the Exposome era. Curr Epidemiol Reports. 2017;4:22–30.

    Article  Google Scholar 

  27. 27.

    Patel CJ, Kerr J, Thomas DC, Mukherjee B, Ritz B, Chatterjee N, et al. Opportunities and challenges for environmental exposure assessment in population-based studies. Cancer Epidemiol Biomarkers Prev. 2017;26:cebp.0459.2017.

    Article  Google Scholar 

  28. 28.

    Wishart D, Arndt D, Pon A, Sajed T, Guo AC, Djoumbou Y, et al. T3DB: the toxic exposome database. Nucleic Acids Res. 2015;43:D928–34.

    CAS  PubMed  Article  Google Scholar 

  29. 29.

    Lim E, Pon A, Djoumbou Y, Knox C, Shrivastava S, Guo AC, et al. T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res. 2009;38:781–6.

    Article  CAS  Google Scholar 

  30. 30.

    Warth B, Spangler S, Fang M, Johnson CH, Forsberg EM, Granados A, et al. Exposome-scale investigations guided by global metabolomics, pathway analysis, and cognitive computing. Anal Chem. 2017; In-Press

  31. 31.

    Richard AM, Williams CR. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat Res Fundam Mol Mech Mutagen. 2002;499:27–52.

    CAS  Article  Google Scholar 

  32. 32.

    Neveu V, Moussy A, Rouaix H, Wedekind R, Pon A, Knox C, et al. Exposome-explorer: a manually-curated database on biomarkers of exposure to dietary and environmental factors. Nucleic Acids Res. 2017;45:D979–84.

    PubMed  Article  Google Scholar 

  33. 33.

    Wishart DS, Tzur D, Knox C, Eisner R, Guo AC, Young N, et al. HMDB: the human metabolome database. Nucleic Acids Res. 2007;35:521–6.

    Article  Google Scholar 

  34. 34.

    Smith CA, O’Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, et al. A metabolite mass spectral database. Ther Drug Monit. 2005;27:747–51.

    CAS  PubMed  Article  Google Scholar 

  35. 35.

    Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF, et al. Metabolite identification via the Madison Metabolomics Consortium Database [3]. Nat Biotechnol. 2008;26:162–4.

    CAS  PubMed  Article  Google Scholar 

  36. 36.

    Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006;34:D668–72.

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Kaiser J. Chemists want NIH to curtail database. Science (80-. ). 2005;308:774.

    CAS  Article  Google Scholar 

  38. 38.

    Williams AJ, Grulke CM, Edwards J, AD ME, Mansouri K, Baker NC, et al. The CompTox chemistry dashboard: a community data resource for environmental chemistry. J Cheminform. 2017;9:61.

    PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Beger RD, Dunn W, Schmidt MA, Gross SS, Kirwan JA, Cascante M, et al. Metabolomics enables precision medicine: “a white paper, community perspective”. Metabolomics. 2016;12:149.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  40. 40.

    Chen C, Krausz KW, Idle JR, Gonzalez FJ. Identification of novel toxicity-associated metabolites by metabolomics and mass isotopomer analysis of acetaminophen metabolism in wild-type and Cyp2e1-null mice. J Biol Chem. 2008;283:4543–59.

    CAS  PubMed  Article  Google Scholar 

  41. 41.

    Johnson CH, Krausz KW, Kang DW, Patterson AD, Kim J, Luecke H, et al. Novel metabolites and roles for a-tocopherol in humans and mice discovered by mass spectrometry-based metabolomics 1–5. Am J Clin Nutr. 2012;96:818–30.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Chen C, Ma X, Malfatti MA, Krausz KW, Kimura S, Felton JS, et al. A comprehensive investigation of 2-amino-1-methyl-6-phenylimidazo[4,5-b]pyridine (PhIP) metabolism in the mouse using a multivariate data analysis approach. Chem Res Toxicol. 2007;20:531–42.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Sharma AK, Jaiswal SK, Chaudhary N, Sharma VK. A novel approach for the prediction of species-specific biotransformation of xenobiotic/drug molecules by the human gut microbiota. Sci Rep. 2017;7:1–13.

    Article  Google Scholar 

  44. 44.

    Gavaghan CL, Holmes E, Lenz E, Wilson ID, Nicholson JK. An NMR-based metabonomic approach to investigate the biochemical consequences of genetic strain differences: application to the C57BL10J and Alpk:ApfCD mouse. FEBS Lett 2000;484:169–174.

  45. 45.

    Johnson CH, Ivanisevic J, Siuzdak G. Metabolomics: beyond biomarkers and towards mechanisms. Nat Rev Mol Cell Biol. 2016;17:451–9.

  46. 46.

    Mahieu NG, Patti GJ. Systems-level annotation of a metabolomics data set reduces 25 000 features to fewer than 1000 unique metabolites. Anal Chem. 2017;89:10397–406.

    CAS  PubMed  Article  Google Scholar 

  47. 47.

    Kuhl C, Tautenhahn R, Böttcher C, Larson TR, Neumann S. CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal Chem. 2012;84:283–9.

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Broeckling CD, Afsar FA, Neumann S, Ben-Hur A, Prenni JE. RAMClust: a novel feature clustering method enables spectral-matching-based annotation for metabolomics data. Anal Chem. 2014;86:6812–7.

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Mahieu NG, Huang X, Chen YJ, Patti GJ. Credentialing features: a platform to benchmark and optimize untargeted metabolomic methods. Anal Chem. 2014;86:9583–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    da Silva RR, Dorrestein PC, Quinn RA. Illuminating the dark matter in metabolomics. Proc Natl Acad Sci. 2015;112:12549–50.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  51. 51.

    Huan T, Forsberg EM, Rinehart D, Johnson CH, Ivanisevic J, Benton HP, et al. Systems biology guided by XCMS online metabolomics. Nat Methods. 2017;14:461–2.

    CAS  PubMed  Article  Google Scholar 

  52. 52.

    Evans AM, DeHaven CD, Barrett T, Mitchell M, Milgram E. Integrated, nontargeted ultrahigh performance liquid chromatography/electrospray ionization tandem mass spectrometry platform for the identification and relative quantification of the small-molecule complement of biological systems. Anal Chem. 2009;81:6656–67.

    CAS  PubMed  Article  Google Scholar 

  53. 53.

    Lankadurai BP, Nagato EG, Simpson MJ. Environmental metabolomics: an emerging approach to study organism responses to environmental stressors. Environ Rev. 2013;21:180–205.

    CAS  Article  Google Scholar 

  54. 54.

    Johnson CH, Ivanisevic J, Benton HP, Siuzdak G. Bioinformatics: the next frontier of metabolomics. Anal Chem. 2015;87:147–56.

    CAS  PubMed  Article  Google Scholar 

  55. 55.

    Misra BB, van der Hooft JJJ. Updates in metabolomics tools and resources: 2014-2015. Electrophoresis. 2016;37:86–110.

    CAS  PubMed  Article  Google Scholar 

  56. 56.

    Misra BB, Fahrmann JF, Grapov D. Review of emerging metabolomic tools and resources: 2015–2016. Electrophoresis. 2017;38:2257–74.

    CAS  PubMed  Article  Google Scholar 

  57. 57.

    Chan Q, Loo R, Ebbels T, Van Horn L, Daviglus M, Stamler J, et al. Metabolic phenotyping for discovery of urinary biomarkers of diet, xenobiotics and blood pressure in the INTERMAP study: an overview. Hypertens Res. 2016;40:1–10.

    Google Scholar 

  58. 58.

    Karaman I, Ferreira DLS, Boulangé CL, Kaluarachchi MR, Herrington D, Dona AC, et al. Workflow for integrated processing of multicohort untargeted 1H NMR metabolomics data in large-scale metabolic epidemiology. J Proteome Res. 2016;15:4188–94.

    CAS  PubMed  Article  Google Scholar 

  59. 59.

    Ioannidis J, Allison D, Ball C, Coulibaly I, Cui X, Culhane A, et al. Repeatability of published microarray gene expression analyses. Nat Genet. 2009;41:149–204.

    CAS  PubMed  Article  Google Scholar 

  60. 60.

    Kraft P, Zeggini E, Ioannidis J. Replication in genome-wide association studies. Stat Sci. 2010;24:561–73.

    Article  Google Scholar 

  61. 61.

    Ioannidis JPA. Why most published research findings are false. PLoS Med. 2005;2:0696–701.

    Google Scholar 

  62. 62.

    Chadeau-Hyam M, Athersuch TJ, Keun HC, De Iorio M, TMD E, Jenab M, et al. Meeting-in-the-middle using metabolic profiling—a strategy for the identification of intermediate biomarkers in cohort studies. Biomarkers. 2011;16:83–8.

    CAS  PubMed  Article  Google Scholar 

  63. 63.

    Vineis P, Perera F. Molecular epidemiology and biomarkers in etiologic cancer research: the new in light of the old. Cancer Epidemiol Biomark Prev. 2007;16:1954–65.

    CAS  Article  Google Scholar 

  64. 64.

    Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Percie du Sert N, et al. A manifesto for reproducible science. Nat Hum Behav. 2017;1:1–9.

    Article  Google Scholar 

  65. 65.

    Xu Y, Goodacre R. Multiblock principal component analysis: an efficient tool for analyzing metabolomics data which contain two influential factors. Metabolomics. 2012;8:37–51.

    CAS  Article  Google Scholar 

  66. 66.

    Abdi H, Williams LJ, Valentin D. Multiple factor analysis: principal component analysis for multitable and multiblock data sets. Wiley Interdiscip Rev Comput Stat. 2013;5:149–79.

    Article  Google Scholar 

  67. 67.

    Vinaixa M, Samino S, Saez I, Duran J, Guinovart JJ, Yanes O. A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Meta. 2012;2:775–95.

    CAS  Google Scholar 

  68. 68.

    Sanderson S, Tatt ID, Higgins JPT. Tools for assessing quality and susceptibility to bias in observational studies in epidemiology: a systematic review and annotated bibliography. Int J Epidemiol. 2007;36:666–76.

    PubMed  Article  Google Scholar 

  69. 69.

    Szklo M, Nieto FJ. Epidemiology: beyond the basics. 3rd Ed. Aspen: Jones & Bartlett Learning; 2000.

  70. 70.

    van den Berg RA, Hoefsloot HCJ, Westerhuis JA, Smilde AK, van der Werf MJ. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics. 2006;7:142.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  71. 71.

    Chin L, Hahn WC, Getz G, Meyerson M. Making sense of cancer genomic data. Genes Dev. 2011;25:534–55.

  72. 72.

    Dohoo IR, Ducrot C, Fourichon C, Donald A, Hurnik D. An overview of techniques for dealing with large numbers of independent variables in epidemiologic studies. Prev Vet Med. 1997;29:221–39.

    CAS  PubMed  Article  Google Scholar 

  73. 73.

    Eriksson L, Antti H, Gottfries J, Holmes E, Johansson E, Lindgren F, et al. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Anal Bioanal Chem. 2004;380:419–29.

    CAS  PubMed  Article  Google Scholar 

  74. 74.

    DiBello JR, Kraft P, ST MG, Goldberg R, Campos H, Baylin A. Comparison of 3 methods for identifying dietary patterns associated with risk of disease. Am J Epidemiol. 2008;168:1433–43.

    PubMed  PubMed Central  Article  Google Scholar 

  75. 75.

    Westerhuis JA, Kourti T, MacGregor JF. Analysis of multiblock and hierarchical PCA and PLS models. J Chemom. 1998;12:301–21.

    CAS  Article  Google Scholar 

  76. 76.

    Zwanenburg G, Huub CJ, Westerhuis JA, Jansen JJ, Smilde AK. ANOVA-principal component analysis and ANOVA-simultaneous component analysis: a comparison. J Chemom. 2011;25:561–7.

    CAS  Article  Google Scholar 

  77. 77.

    Gromski PS, Muhamadali H, Ellis DI, Xu Y, Correa E, Turner ML, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis—a marriage of convenience or a shotgun wedding. Anal Chim Acta. 2015;879:10–23.

    CAS  PubMed  Article  Google Scholar 

  78. 78.

    Jombart T, Devillard S, Balloux F, Falush D, Stephens M, Pritchard J, et al. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet. 2010;11:94.

    PubMed  PubMed Central  Article  Google Scholar 

  79. 79.

    Ogutu JO, Schulz-Streeck T, Piepho H-P. Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc. 2012;6:S10.

    PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Acharjee A, Finkers R, Visser RG, Maliepaard C. Comparison of regularized regression methods for ~omics data. Metabolomics. 2013;3:126.

    Google Scholar 

  81. 81.

    Tzoulaki I, Ebbels TMD, Valdes A, Elliott P, JPA I. Design and analysis of metabolomics studies in epidemiologic research: a primer on-omic technologies. Am J Epidemiol. 2014;180:129–39.

    PubMed  Article  Google Scholar 

  82. 82.

    Abdi H. Partial least squares (PLS) regression. Encycl Res Methods Soc Sci. 2003;2003:792–5.

  83. 83.

    Bylesjo M, Rantalainen M, Cloarec O, Nicholson JK, Holmes E, Trygg J. OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification. J Chemom. 2006;20:3541–351.

    Article  CAS  Google Scholar 

  84. 84.

    Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58:109–30.

    CAS  Article  Google Scholar 

  85. 85.

    Waldron L, Pintilie M, Tsao MS, Shepherd FA, Huttenhower C, Jurisica I. Optimized application of penalized regression methods to diverse genomic data. Bioinformatics. 2011;27:3399–406.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  86. 86.

    Tibshirani R. Regression selection and shrinkage via the lasso. J R Stat Soc B. 1996;58:267–88.

    Google Scholar 

  87. 87.

    Vaarhorst AAM, Verhoeven A, Weller CM, Böhringer S, Göraler S, Meissner A, et al. A metabolomic profile is associated with the risk of incident coronary heart disease. Am Heart J. 2014;168:45–52. e7

    CAS  PubMed  Article  Google Scholar 

  88. 88.

    Baker M. Statisticians issue warning over misuse of P values. Nature. 2016;531:151.

    CAS  PubMed  Article  Google Scholar 

  89. 89.

    Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol Springer Netherlands. 2016;31:337–50.

    Article  Google Scholar 

  90. 90.

    Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2:171–96.

    CAS  Article  Google Scholar 

  91. 91.

    Chavalarias D, Wallach JD, Li AHT, Ioannidis JPA, Gigerenzer G, Berlin J, et al. Evolution of reporting P values in the biomedical literature, 1990–2015. JAMA. 2016;315:1141.

    CAS  PubMed  Article  Google Scholar 

  92. 92.

    The American Statistical Association. Statement on statistical significance and P-values. 2016;

    Google Scholar 

  93. 93.

    Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–33.

    Article  Google Scholar 

  94. 94.

    Chong EY, Huang Y, Wu H, Ghasemzadeh N, Uppal K, Quyyumi AA, et al. Local false discovery rate estimation using feature reliability in LC/MS metabolomics data. Sci Rep. 2015;5:17221.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  95. 95.

    Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M. Bioinformatics tools for mass spectroscopy-based metabolomic data processing and analysis. Curr Bioinforma. 2012;7:96–108.

    CAS  Article  Google Scholar 

  96. 96.

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995:289–300.

  97. 97.

    McDonald JH. Handbook of biological statistics. Baltimore: Sparky House Publishing; 2015.

  98. 98.

    Efron B. Size, power and false discovery rates. Ann Stat. 2007;35:1351–77.

    Article  Google Scholar 

  99. 99.

    Efron B. Microarrays, empirical Bayes and the two-groups model. Stat Sci. 2008;23:1–22.

    Article  Google Scholar 

  100. 100.

    Bartel J, Krumsiek J, Theis FJ. Statistical methods for the analysis of high-throughput metabolomics data. Comput Struct Biotechnol J. 2013;4:e201301009

  101. 101.

    Lewis FI, Ward MP. Improving epidemiologic data analyses through multivariate regression modelling. Emerg Themes Epidemiol. 2013;10:2–11.

    Article  Google Scholar 

  102. 102.

    Zapala MA, Schork NJ. Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proc Natl Acad Sci U S A. 2006;103:19430–5.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  103. 103.

    Saccenti E, Hoefsloot HCJ, Smilde AK, Westerhuis JA, MMWB H. Reflections on univariate and multivariate analysis of metabolomics data. Metabolomics. 2014;10:361–74.

    CAS  Article  Google Scholar 

  104. 104.

    Li S, Park Y, Duraisingham S, Strobel FH, Khan N, Soltow QA, et al. Predicting network activity from high throughput metabolomics. PLoS Comput Biol. 2013;9

  105. 105.

    Kaever A, Landesfeind M, Feussner K, Mosblech A, Heilmann I, Morgenstern B, et al. MarVis-pathway: integrative and exploratory pathway analysis of non-targeted metabolomics data. Metabolomics. 2015;11:764–77.

    CAS  PubMed  Article  Google Scholar 

  106. 106.

    Wrzodek C, Eichner J, Büchel F, Zell A. InCroMAP: integrated analysis of cross-platform microarray and pathway data. Bioinformatics. 2013;29:506–8.

    CAS  PubMed  Article  Google Scholar 

  107. 107.

    Sergushichev AA, Loboda AA, Jha AK, Vincent EE, Driggers EM, Jones RG, et al. GAM: a web-service for integrated transcriptional and metabolic network analysis. Nucleic Acids Res. 2016;44:W194–200.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  108. 108.

    Zhang L, CM MH, Rothman N, Li G, Ji Z, Vermeulen R, et al. Systems biology of human benzene exposure. Chem Biol Interact. 2010;184:86–93.

    CAS  PubMed  Article  Google Scholar 

  109. 109.

    Maitre L, Fthenou E, Athersuch T, Coen M, Toledano MB, Holmes E, et al. Urinary metabolic profiles in early pregnancy are associated with preterm birth and fetal growth restriction in the Rhea mother-child cohort study. BMC Med. 2014;12:1–14.

    Article  CAS  Google Scholar 

  110. 110.

    Roede JR, Uppal K, Park Y, Tran VL, Jones DP. Transcriptome-metabolome wide association study (TMWAS) of maneb and paraquat neurotoxicity reveals network level interactions in toxicologic mechanism. Toxicol Rep. 2014;1:435–44.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  111. 111.

    Cribbs SK, Uppal K, Li S, Jones DP, Huang L, Tipton L, et al. Correlation of the lung microbiota with metabolic profiles in bronchoalveolar lavage fluid in HIV infection. Microbiome. 2016;4:1–11.

    Article  Google Scholar 

  112. 112.

    Chandler JD, Hu X, Ko E-J, Park S, Lee Y-T, Orr ML, et al. Metabolic pathways of lung inflammation revealed by high-resolution metabolomics (HRM) of H1N1 influenza virus infection in mice. Am J Physiol Regul Integr Comp Physiol. [Internet]. 2016;ajpregu.00298.2016.

  113. 113.

    Johnson CH, Athersuch TJ, Collman GW, Dhungana S, Grant DF, Jones DP, et al. Yale school of public health symposium on lifetime exposures and human health: the exposome; summary and future reflections. Hum Genomics. 2017;11:32.

    PubMed  PubMed Central  Article  Google Scholar 

  114. 114.

    Wang SY, Kuo CH, Tseng YJ. Batch normalizer: a fast total abundance regression calibration method to simultaneously adjust batch and injection order effects in liquid chromatography/time-of-flight mass spectrometry-based metabolomics data and comparison with current calibration met. Anal Chem. 2013;85:1037–46.

    CAS  PubMed  Article  Google Scholar 

  115. 115.

    Reisetter AC, Muehlbauer MJ, Bain JR, Nodzenski M, Stevens RD, Ilkayeva O, et al. Mixture model normalization for non-targeted gas chromatography/mass spectrometry metabolomics data. BMC Bioinformatics. 2017;18:84.

    PubMed  PubMed Central  Article  Google Scholar 

  116. 116.

    NCBI dbSNP Database - Accessed 6 Nov 2017.

  117. 117.

    Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014;15:121–32.

    CAS  PubMed  Article  Google Scholar 

  118. 118.

    Ramasamy A, Mondry A, Holmes CC, Altman DG. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 2008;5:1320–32.

    CAS  Article  Google Scholar 

  119. 119.

    Stefano GB. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res. 2014;20:138–42.

    PubMed  PubMed Central  Article  Google Scholar 

  120. 120.

    Siddiqui AS, Delaney AD, Schnerch A, Griffith OL, Jones SJM, Marra MA. Sequence biases in large scale gene expression profiling data. Nucleic Acids Res. 2006;34:e84.

  121. 121.

    Büscher JM, Czernik D, Ewald JC, Sauer U, Zamboni N. Cross-platform comparison of methods for quantitative metabolomics of primary metabolism. Anal Chem. 2009;81:2135–43.

    PubMed  Article  CAS  Google Scholar 

  122. 122.

    Ivanisevic J, Zhu ZJ, Plate L, Tautenhahn R, Chen S, O’Brien PJ, et al. Toward ‘omic scale metabolite profiling: a dual separation-mass spectrometry approach for coverage of lipid and central carbon metabolism. Anal Chem. 2013;85:6876–84.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  123. 123.

    Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al. Complement factor H polymorphism in age-related macular degeneration. Science (80-. ). 2005;308:385–9.

    CAS  Article  Google Scholar 

  124. 124.

    Panagiotou OA, Willer CJ, Hirschhorn JN, Ioannidis JPA. The power of meta-analysis in genome-wide association studies. Annu Rev Genomics Hum Genet. 2013;14:441–65.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  125. 125.

    Hong EP, Park JW. Sample size and statistical power calculation in genetic association studies. Genomics Inform. 2012;10:117–22.

    PubMed  PubMed Central  Article  Google Scholar 

  126. 126.

    Blaise BJ. Data-driven sample size determination for metabolic phenotyping studies. Anal Chem. 2013;85:8943–50.

    CAS  PubMed  Article  Google Scholar 

  127. 127.

    Van Iterson M, 't Hoen PAC, Pedotti P, Hooiveld GJ, Den Dunnen JT, van Ommen GJ, et al. Relative power and sample size analysis on gene expression profiling data. BMC Genomics. 2009;10:439.

  128. 128.

    Ferreia JA, Zwinderman A. Approximate power and sample size calculations with the Benjamini-Hochberg method. Int J Biostat. 2006;2:1–36.

    Google Scholar 

  129. 129.

    Langaas M, Lindqvist BH, Ferkingstad E. Estimating the proportion of true null hypotheses, with application to DNA microarray data. J R Stat Soc Ser B Stat Methodol. 2005;67:555–72.

    Article  Google Scholar 

  130. 130.

    Blaise BJ, Correia G, Tin A, Young JH, Vergnaud AC, Lewis M, et al. Power analysis and sample size determination in metabolic phenotyping. Anal Chem. 2016;88:5179–88.

    CAS  PubMed  Article  Google Scholar 

  131. 131.

    Jaremek M, Yu Z, Mangino M, Mittelstrass K, Prehn C, Singmann P, et al. Alcohol-induced metabolomic differences in humans. Transl Psychiatry. 2013;3:e276.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  132. 132.

    Homuth G, Teumer A, Völker U, Nauck M. A description of large-scale metabolomics studies: increasing value by combining metabolomics with genome-wide SNP genotyping and transcriptional profiling. J Endocrinol. 2012;215:17–28.

    CAS  PubMed  Article  Google Scholar 

  133. 133.

    Mäkinen V-P, Ala-Korpela M. Metabolomics of aging requires large-scale longitudinal studies with replication. Proc Natl Acad Sci 2016;113:E3470–E3470.

  134. 134.

    Nicholson JK, Holmes E, Elliott P. The metabolome-wide association study: a new look at human disease risk factors. J Proteome Res. 2008;7:3637–8.

    CAS  PubMed  Article  Google Scholar 

  135. 135.

    Chadeau-Hyam M, Ebbels TM, Brown IJ, Chan Q, Stamler J, et al. Metabolic profiling and the metabolome-wide association study: significance level for biomarker identification. J Proteome Res. 2010;9:4620-7.

  136. 136.

    Castagné R, Boulangé CL, Karaman I, Campanella G, Santos Ferreira DL, Kaluarachchi MR, et al. Improving visualization and interpretation of metabolome-wide association studies: an application in a population-based cohort using untargeted 1 H NMR metabolic profiling. J Proteome Res. 2017;16:3623–33.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  137. 137.

    Walker DI, Pennell KD, Uppal K, Xia X, Hopke PK, Utell MJ, et al. Pilot Metabolome-Wide Association Study of Benzo(a)pyrene in Serum From Military Personnel. J Occup Environ Med. 2016;58:S44-52.

  138. 138.

    Bictash M, Ebbels TM, Chan Q, Loo RL, Yap IKS, Brown IJ, et al. Opening up the “black box”: metabolic phenotyping and metabolome-wide association studies in epidemiology. J Clin Epidemiol Elsevier Inc. 2010;63:970–9.

    Article  Google Scholar 

  139. 139.

    Toledo E, Wang DD, Ruiz-Canela opez M, Clish CB, Razquin C, Zheng Y, et al. Plasma lipidomic profiles and cardiovascular events in a randomized intervention trial with the Mediterranean diet. Am J Clin Nutr. 2017;106:973–83.

    PubMed  Google Scholar 

  140. 140.

    Li XS, Obeid S, Klingenberg R, Gencer B, Mach F, Räber L, et al. Gut microbiota-dependent trimethylamine N-oxide in acute coronary syndromes: a prognostic marker for incident cardiovascular events beyond traditional risk factors. Eur Heart J. 2017;14:814–24.

    Google Scholar 

  141. 141.

    Wang Z, Klipfell E, Bennett BJ, Koeth R, Levison BS, Dugar B, et al. Gut flora metabolism of phosphatidylcholine promotes cardiovascular disease. Nature. 2011;472:57–63.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  142. 142.

    Igari M, Alexander JC, Ji Y, Qi XL, Papke RL, Bruijnzeel AW. Varenicline and cytisine diminish the dysphoric-like state associated with spontaneous nicotine withdrawal in rats. Neuropsychopharmacology. 2014;39:445–55.

    Article  CAS  Google Scholar 

  143. 143.

    Renier N, Adams EL, Kirst C, Wu Z, Azevedo R, Kohl J, et al. Mapping of Brain Activity by Automated Volume Analysis of Immediate Early Genes. Cell. 2016;165:1789–802.

  144. 144.

    Gupta M, Neavin D, Liu D, Biernacka J, Hall-Flavin D, Bobo WV, et al. TSPAN5, ERICH3 and selective serotonin reuptake inhibitors in major depressive disorder: pharmacometabolomics-informed pharmacogenomics. Mol Psychiatry. 2016;21:1717–25.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  145. 145.

    Ji Y, Hebbring S, Zhu H, Jenkins GD, Biernacka J, Snyder K, et al. Glycine and a glycine dehydrogenase (GLDC) SNP as citalopram/escitalopram response biomarkers in depression: pharmacometabolomics-informed pharmacogenomics. Clin Pharmacol Ther. 2011;89:97–104.

    CAS  PubMed  Article  Google Scholar 

  146. 146.

    Kepp O, Loos F, Liu P, Kroemer G. Extracellular nucleosides and nucleotides as immunomodulators. Immunol Rev. 2017;280:83–92.

    CAS  PubMed  Article  Google Scholar 

  147. 147.

    Johnson CH, Spilker ME, Goetz L, Peterson SN, Siuzdak G. Metabolite and microbiome interplay in cancer immunotherapy. Cancer Res. 2016;76:6146–52.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  148. 148.

    Ioannidis JPA, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, et al. Increasing value and reducing waste in research design, conduct, and analysis. Lancet. 2014;383:166–75.

    PubMed  PubMed Central  Article  Google Scholar 

  149. 149.

    Iqbal SA, Wallach JD, Khoury MJ, Schully SD, JPA I. Reproducible research practices and transparency across the biomedical literature. PLoS Biol. 2016;14:1–13.

    Article  CAS  Google Scholar 

  150. 150.

    Siskos AP, Jain P, Römisch-Margl W, Bennett M, Achaintre D, Asad Y, et al. Interlaboratory reproducibility of a targeted Metabolomics platform for analysis of human serum and plasma. Anal Chem. 2017;89:656–65.

    CAS  PubMed  Article  Google Scholar 

  151. 151.

    Buesen R, Chorley BN, da Silva Lima B, Daston G, Deferme L, Ebbels T, et al. Applying ‘omics technologies in chemicals risk assessment: report of an ECETOC workshop. Regul Toxicol Pharmacol. 2017:1–11.

  152. 152.

    Kauffmann HM, Kamp H, Fuchs R, Chorley BN, Deferme L, Ebbels T, et al. Framework for the quality assurance of ‘omics technologies considering GLP requirements. Regul Toxicol Pharmacol. 2017;91:1–9.

  153. 153.

    Sud M, Fahy E, Cotter D, Azam K, Vadivelu I, Burant C, et al. Metabolomics workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 2016;44:D463–70.

    CAS  PubMed  Article  Google Scholar 

  154. 154.

    Griffin JL, Nicholls AW, Daykin CA, Heald S, Keun HC, Schuppe-Koistinen I, et al. Standard reporting requirements for biological samples in metabolomics experiments: mammalian/in vivo experiments. Metabolomics. 2007;3:179–88.

    CAS  Article  Google Scholar 

  155. 155.

    Sumner LW, Amberg A, Barrett D, Beale MH, Beger R, Daykin CA, et al. Proposed minimum reporting standards for chemical analysis: chemical analysis working group (CAWG) metabolomics standards initiative (MSI). Metabolomics. 2007;3:211–21.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  156. 156.

    Goodacre R, Broadhurst D, Smilde AK, Kristal BS, Baker JD, Beger R, et al. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics. 2007;3:231–41.

    CAS  Article  Google Scholar 

  157. 157.

    Metabolomics workbench - Accessed 17 Jan 2018.

  158. 158.

    Schober D, Jacob D, Wilson M, Cruz JA, Marcu A, Grant JR, et al. nmrML: a community supported open data standard for the description, storage, and exchange of NMR data. Anal Chem. 2017. In-Press

  159. 159.

    Rocca-Serra P, Salek RM, Arita M, Correa E, Dayalan S, Gonzalez-Beltran A, et al. Data standards can boost metabolomics research, and if there is a will, there is a way. Metabolomics. 2016;12:1–13.

    CAS  Article  Google Scholar 

  160. 160.

    Weber RJM, Lawson TN, Salek RM, Ebbels TMD, Glen RC, Goodacre R, et al. Computational tools and workflows in metabolomics: an international survey highlights the opportunity for harmonisation through galaxy. Metabolomics. 2017;13:1–5.

    CAS  Article  Google Scholar 

  161. 161.

    Salek RM, Steinbeck C, Viant MR, Goodacre R, Dunn WB. The role of reporting standards for metabolite annotation and identification in metabolomic studies. Gigascience. 2013;2:13.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  162. 162.

    van Rijswijk M, Beirnaert C, Caron C, Cascante M, Dominguez V, Dunn WB, et al. The future of metabolomics in ELIXIR. F1000Research. 2017;6:1649.

    Article  Google Scholar 

  163. 163.

    WHO. 7 million premature deaths annually linked to air pollution - Accessed 17 Jan 2018.

  164. 164.

    Shin SH, Bode AM, Dong Z. Precision medicine: the foundation of future cancer therapeutics. Precis Oncol. 2017;1:12.

    Article  Google Scholar 

  165. 165.

    FDA approves first cancer treatment for any solid tumor with a specific genetic feature - Accessed 17 Jan 2018.

  166. 166.

    Gilbert JA, Quinn RA, Debelius J, Xu ZZ, Morton J, Garg N, et al. Microbiome-wide association studies link dynamic microbial consortia to disease. Nature. 2016;535:94–103.

    CAS  PubMed  Article  Google Scholar 

  167. 167.

    Zou H, Hastie T. Regularization and variable selection via the elastic-net. J R Stat Soc. 2005;67:301–20.

    Article  Google Scholar 

Download references


Not applicable.


This work is supported in part by NIH grants EY17963 (VV), AA021724 (VV), and AA022057 (VV) and American Cancer Society (ACS) grant MRSG-15-147-01-CNE (ND).

Availability of data and materials

Not applicable.

Author information




All authors were involved in writing and contributing to the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Caroline H. Johnson.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

JDW receives research support through Yale University from the Laura and Arnold Foundation to support the Collaboration for Research Integrity and Transparency.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rattray, N.J.W., Deziel, N.C., Wallach, J.D. et al. Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics 12, 4 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Chemometrics
  • Exposome
  • Exposotype
  • Genomics
  • Genetic epidemiology
  • Metabolomics