Study population
The ARIC study is a prospective community-based cohort of 15,792 individuals who were recruited and enrolled between 1987 and 1989 from four US communities (Forsyth County, NC; Jackson, MS; Minneapolis suburbs, MN; Washington County, MD). Details on the ARIC study design and methods have been previously published [38]. During the fifth study visit between 2011 and 2013 blood samples were collected for quantification of plasma protein and serum metabolite levels. Institutional review boards at each field center have approved of the study and written informed consent has been obtained from participants at baseline and follow-up visits. All 4046 participants with available proteomic and metabolomic profiling at visit 5 (61.6% of study visit participants) were included. The censoring date for follow-up was December 31st, 2018.
The AASK study was a trial of 1094 adult African Americans aged 18–70 years with hypertensive chronic kidney disease (mGFR 20–65 ml/min per 1.73 m2) recruited from 21 clinical centers in the United States. AASK trial enrollment occurred between February 1995 and September 1998, and the trial phase ended in September 2001. All 694 participants with available proteomic and metabolomic profiling at baseline in the trial phase were included in our analysis [39].
Proteomic and metabolomic profiling
ARIC has a uniform blood collection protocol (https://sites.cscc.unc.edu/aric/Cohort_Manuals/Blood_Collection_And_Processing_7.PDF) for serum separate tubes (SST) and EDTA tubes across all 4 sites. EDTA tubes were spun (3000 g for 10 min at 4 °C) and plasma frozen. Similarly, AASK has a routine blood collection protocol for SSTs (https://repository.niddk.nih.gov/studies/aask-trial/MOOP/). In ARIC, 5282 plasma proteins were quantified in ARIC participants using a Slow Off-rate Modified Aptamer–based capture array and plasma collected at visit 5, using the SomaScan® platform v4. Similar procedures, using the expanded SomaScan® v4.1 platform, were applied to serum samples from the baseline visit in AASK, resulting in quantification of 7596 serum proteins in the AASK study [39]. For both studies, proteins were log2-transformed to account for skewed raw value distributions, and values outside of 5 SDs on the log2-scale were winsorized. In addition, we excluded proteins if the Bland Altman coefficient of variation among blind duplicate samples was greater than 0.5 (Fig. 1). The final analysis included only human proteins that were quantified in both cohorts (N = 4616).
Serum metabolite profiling was performed using untargeted mass spectrometry following standard protocols at Metabolon, Inc. (Morrisville, NC) using the SST samples in both studies (HD4 Platform). There were 970 and 820 metabolites of known identity quantified in the ARIC and AASK study, respectively [40]. Xenobiotics were excluded during preprocessing. Endogenous metabolites with > 80% missing was excluded. All metabolites were scaled to a median of 1 and log2-transformed, and metabolites with variance < 0.01 on log2-scales were removed. The final analysis included only metabolites that were quantified in both cohorts (N = 474). Missing data were imputed with minimum values (0.71% of the combined protein and metabolite analysis dataset) and capped at 5 standard deviations above or below the mean (Fig. 1).
Module formation
Netboost is an unsupervised three-step dimension reduction technique developed in the context of DNA methylation and gene expression data [13]. In brief, first, unrelated variable pairs are filtered such that a sparse correlation-based network can be constructed on the strongest network edges. Second, variables are hierarchically clustered into modules based on the sparse network. Modules form a data-driven partition of all metabolites and proteins included in the analysis. The background module consists of 81 proteins and 12 metabolites that were left without closely related components. Third, module-aggregated measures are quantified using the PCs of each module except the background module. In this study, we used Netboost to characterize modules using combined proteomic and metabolomic data similar to previous applications to mass spectrometry data [41, 42]. The minimal module size was set to two, distance measures were based on Spearman coefficients, and robust PCs were used [13]. Highly correlated preface modules (i.e., modules with correlation of the first PCs greater than 0.9) were merged to further reduce the dimensionality. Three PCs of the modules were exported, or fewer if they already accounted for at least 50% of the module variance.
Characterization of modules and association with mortality
After identifying modules of proteins and metabolites using Netboost in ARIC, to characterize modules we regressed module PCs on clinical traits. Clinical traits included age, sex, eGFR, ACR, HDL, body mass index (BMI), fasting plasma glucose, total cholesterol, systolic blood pressure, history of cardiovascular disease (CVD), and history of smoking. eGFR was defined using the CKD Epi 2009 equation using creatinine and cystatin C.
Next, we evaluated the associations between the module PCs and mortality using Cox proportional hazards models. Analyses were adjusted for age, sex, race-center, eGFR [43], CVD, history of smoking, diabetes, fasting plasma glucose, log 2 transformed ACR, systolic blood pressure, antihypertensive medications, HDL, total cholesterol, and BMI. Adjustment for total cholesterol and BMI used linear splines with knots at 200 mg/dL and 25 kg/m2, respectively [44, 45].
Transferability of modules and relevance in a cohort with CKD
We next evaluated whether module membership transferred to a separate cohort with CKD patients. To do this, module memberships and PC loadings developed from the ARIC cohort were applied to the AASK cohort. Cross sectional regression models with the same clinical traits were used to characterize the modules and compared with those done in ARIC. To account for the AASK study design where participants were selected based on mGFR 20–65 ml/min per 1.73 m2, we additionally calculated correlations with age residuals from a regression on GFR.
As in ARIC, a Cox proportional hazards model was used to test for associations between the module PCs and mortality. Only those modules that had a statistically significant association with mortality in ARIC were tested in AASK. In AASK, model covariates included age, sex, mGFR, CVD, history of smoking, fasting plasma glucose, log 2 transformed 24 h urine protein levels, systolic blood pressure, HDL, total cholesterol, and BMI. Again, adjustment for total cholesterol and BMI used linear splines with knots at 200 mg/dL and 25 kg/m2, respectively [44, 45].
Both ARIC and AASK study analyses accounted for multiple testing by a Bonferroni adjustment for the number of analyses (P-value < 0.05/371 and P-value < 0.05/64, respectively).