Open Access

Application of pooled genotyping to scan candidate regions for association with HDL cholesterol levels

  • David A Hinds1,
  • Albert B Seymour2,
  • L Kathryn Durham3,
  • Poulabi Banerjee2,
  • Dennis G Ballinger1,
  • Patrice M Milos2,
  • David R Cox1,
  • John F Thompson2 and
  • Kelly A Frazer1Email author
Human Genomics20041:421

DOI: 10.1186/1479-7364-1-6-421

Received: 10 August 2004

Accepted: 10 August 2004

Published: 1 November 2004


Association studies are used to identify genetic determinants of complex human traits of medical interest. With the large number of validated single nucleotide polymorphisms (SNPs) currently available, two limiting factors in association studies are genotyping capability and costs. Pooled DNA genotyping has been proposed as an efficient means of screening SNPs for allele frequency differences in case-control studies and for prioritising them for subsequent individual genotyping analysis. Here, we apply quantitative pooled genotyping followed by individual genotyping and replication to identify associations with human serum high-density lipoprotein (HDL) cholesterol levels. The DNA from individuals with low and high HDL cholesterol levels was pooled separately, each pool was amplified by polymerase chain reaction in triplicate and each amplified product was separately hybridised to a high-density oligonucleotide array. Allele frequency differences between case and control groups with low and high HDL cholesterol levels were estimated for 7,283 SNPs distributed across 71 candidate gene regions spanning a total of 17.1 megabases. A novel method was developed to take advantage of independently derived haplotype map information to improve the pooled estimates of allele frequency differences. A subset of SNPs with the largest estimated allele frequency differences between low and high HDL cholesterol groups was chosen for individual genotyping in the study population, as well as in a separate replication population. Four SNPs in a single haplotype block within the cholesteryl ester transfer protein (CETP) gene interval were significantly associated with HDL cholesterol levels in both populations. Our study is among the first to demonstrate the application of pooled genotyping followed by confirmation with individual genotyping to identify genetic determinants of a complex trait.


association study HDL cholesterol CETP SNPs pooled genotyping haplotypes


Association studies are widely viewed as one of the most promising methods for identifying the genetic determinants of human phenotypic traits of medical interest, such as common diseases and individual responses to the drugs used to treat those diseases [1]. Therefore, a considerable amount of research has been focused on developing methodologies that efficiently screen candidate gene regions or whole genomes for associations of complex phenotypes with genetic markers, such as single nucleotide polymorphisms (SNPs). The methodology relies on having a set of common genetic markers at a sufficiently dense coverage across the genome, such that either the causal variant itself or a marker in linkage disequilibrium with the causal variant will be tested in the association study. Thus, to be comprehensive and reproducible, a whole genome scan study requires the assay of hundreds of thousands of densely spaced SNP markers in a large number of samples. There is a considerable body of experimental [26] and theoretical [79] work that suggests genotyping of pools consisting of DNA from many individuals is a viable alternative to individual genotyping. Pooled assays replace many measurements of individual samples with a few measurements of a pooled sample -- with a corresponding reduction in cost, time and labour. Here, we describe one of the first large-scale SNP association studies in which this methodology has been applied and validated.

Human population studies have shown that serum high density lipoprotein (HDL) cholesterol concentrations are inversely correlated with the development of premature coronary heart disease [10]. In this report, we describe a two-stage study to identify genetic markers associated with HDL cholesterol levels. First, we use a pooled genotyping screen to identify SNPs likely to have large frequency differences between low and high HDL cholesterol groups. Starting from 7,283 SNPs distributed across 71 candidate regions, we use the pooled data to select about 300 SNPs with the strongest evidence for association. We then individually genotype these SNPs, to confirm their allele frequency differences in the low and high HDL cholesterol individuals in the study group. We confirm associations identified in the study population by individually genotyping SNPs with significant allele frequency differences in a replicate population.

We also describe a novel method for using independently derived haplotype map data to improve the power of an association study based on pooled genotyping. Using genotype data from a separate set of ethnically diverse individuals, we determine haplotype blocks and sets of common haplotype patterns that together account for most of the variation in a given genomic interval. From pooled genotype data, we estimate frequency differences for these common patterns between case and control groups. These pattern differences enable us to make more accurate estimates of the individual SNP frequency differences that exploit redundancies in the haplotype map, thereby reducing experimental error in the individual SNP measurements.

Materials and methods

SNP discovery and haplotype map construction

In an independent, previously described study, a genome-wide SNP collection was obtained using high-density oligonucleotide array-based resequencing [11]. Briefly, we generated somatic cell hybrids by fusing lymphoblast cell lines from the Coriell Polymorphism Discovery Resource [12] with a hamster cell line to form between 20 and 50 haploid somatic cell hybrids for each human chromosome. DNA was isolated and amplified by long-range polymerase chain reaction (PCR), and the PCR products were fragmented, labelled and hybridised to a series of SNP discovery arrays. These arrays were designed such that each base of the reference sequence was queried by eight 25-mer probes. We identified SNPs from the resulting fluorescence intensity data using a pattern recognition algorithm.

We used a dynamic programming algorithm [13] to partition these haploid SNP discovery data into haplotype blocks. SNPs having minor allele frequencies of at least 10 per cent in the SNP discovery data were included in the map. We required all blocks to satisfy the condition that at least 80 per cent of the haploid samples could be assigned to common haplotype patterns having greater than 10 per cent frequency. For a block having N common haplotype patterns, we also required at least N -1 patterns to have tagging SNPs that distinguished each of those patterns from all of the others.

Sample selection

The study population was derived from a cohort of individuals (self-reported Caucasian) from the ACCESS study [14], which was made up of males, postmenopausal females and premenopausal females that either had, or were at risk for, cardiovascular disease. Whole blood from subjects participating in this study was obtained in accordance with the Declaration of Helsinki (2000) of the World Medical Association, in addition to appropriate informed consent documentation defining the study design and providing an assessment of the risks and benefits associated with study participation. Individuals with high and low HDL cholesterol levels were selected as the top and bottom 15 per cent of the continuous HDL distribution from each group, resulting in the following samples: 166 high HDL (≥ 54.9 mg/dl) and 182 low HDL (≤ 36.1 mg/dl) males; 140 high HDL (≥ 64.0 mg/dl) and 142 low HDL (≤ 47.3 mg/dl) postmenopausal females; and 17 high HDL (≥ 67.4 mg/dl) and 24 low HDL (≤ 42.2 mg/dl) premenopausal females. HDL cholesterol was measured in fasting samples from four preclinical visits, all DNA samples were collected at baseline (ie without drug treatment). In this population, the interaction between age and HDL did not warrant an adjustment for age in the selection of cases and controls.

The replicate population consisted of 83 low HDL and 78 high HDL samples from postmenopausal women (self-reported Caucasians). These samples represented the 25 per cent tails from both ends of the continuous HDL distribution of an independent cohort from an osteoporosis study (the cohort was not selected on the basis of their HDL cholesterol levels or other cardiovascular risk factors), with the high HDL cut-off at 62 mg/dl, the low HDL cut-off at 42 mg/dl and a mean age of 54.4 years.

Construction of DNA pools

We constructed four DNA pools for estimation of SNP allele frequency differences between the low and high HDL cholesterol groups. Five of the 671 samples of the study population were excluded from pooled genotyping due to insufficient amount of DNA or failed normalisation. After removal of these samples, there were 345 low HDL samples and 321 high HDL samples remaining. The low HDL cholesterol samples were randomly split into two subgroups and used to construct pool A (consisting of 173 individuals) and pool B (consisting of 172 individuals). Likewise, the high HDL cholesterol samples were randomly split into two subgroups and used to construct pool C (consisting of 161 individuals) and pool D (consisting of 160 individuals).

Genomic DNA was extracted from whole blood using the PureGene DNA isolation system (Gentra) per manufacturer's protocol. DNA samples were quantified using a PicoGreen assay kit (Molecular Probes) and SpectraFluor Plus Tecan plate reader according to the manufacturers' instructions, and then diluted to a standard concentration using a Packard Multi-Probe Robot. Equimolar aliquots of DNA were transferred into one of four pool tubes using a Packard MultiProbe robot. Each pool was then requantified by PicoGreen assay and the pools diluted to 20 ng/μl for use as a PCR template.

SNP selection for pooled genotyping

We selected 71 gene targets based on a variety of criteria. One gene, the cholesteryl ester transfer protein (CETP), had been previously shown to be associated with HDL cholesterol [10, 15, 16] and served as a positive control. The remaining 70 candidate genes (Table 1; see also supplementary Table 1; supplementary tables have been posted at: were either known or suspected to be involved in lipid metabolism. We did not include some genes previously shown to be associated with HDL cholesterol levels because our goal was to identify novel associations; for example, we did not include hepatic triglyceride lipase (LIPC), lipoprotein lipase (LPL), low-density lipoprotein cholesterol receptor (LDLR) or ATPbinding cassette transporter A1 (ABCA1) [17]. For the 70 candidate genes, we selected SNPs within the genomic DNA sequences encoding the transcripts, as well as 80 kilobases (kb) upstream and downstream of each transcript. We examined a larger region, spanning 1.5 megabases (Mb) upstream and downstream of CETP. The targeted 17.1Mb of DNA sequence included 50 partial and 180 complete transcripts in addition to the 71 selected candidates, based on the National Center for Biotechnology Information Build 30 (see supplementary Table 2). We identified 7,283 SNP markers in these regions, at an average density of one SNP every 2.3 kb. Of these, 112 were in transcribed sequences of the 71 candidate genes, 180 were in the transcribed sequences of the 230 non-candidate genes in the intervals examined and 72 represented amino acid changes (supplementary Table 2). More than 50 per cent of the 17.1Mb is covered by inter-SNP intervals of 10 kb or less and more than 80 per cent is in inter-SNP intervals of less than 50 kb. The 71 selected intervals contain 955 haplotype blocks, having an average of about six common SNPs and three common haplotype patterns per block.
Table 1

Seventy candidate genes analysed in the association study.







































































High-density oligonucleotide arrays

High-density oligonucleotide arrays were designed so that each SNP would be interrogated by 80 25-mer oligonucleotide probes synthesised on a glass substrate. These 80 features consisted of four sets of 20 features, corresponding to reference and alternate alleles for forward and reverse strands. A set of 20 features consisted of five sets of four probes, with offsets of -2, -1, 0, +1 and + 2 bases between the centre of the 25-mer probe and the SNP position. For each offset, we tiled features for each of four nucleotides substituted for the centre position of the 25-mer probe, thus at each offset we had one perfect match feature and three mismatch probes for the corresponding SNP allele (Figure 1).
Figure 1

Genotyping single nucleotide polymorphisms (SNPs) using high-density oligonucleotide arrays. Each SNP is queried by 80 25-mer oligonucleotides synthesised on a glass substrate. The ten oligonucleotides shown are perfect-match probes for the reference (R) and alternate (A) alleles at five offsets on the forward strand sequence relative to the SNP (-2, -1, 0, + 1, + 2). Not shown are additional mismatch probes where the middle positions of the probes shown are replaced by the three alternate nucleotides, and an equivalent set of probes for the reverse strand.

Determination of pooled allele frequency estimates

For pooled genotyping, 7.25 ng genomic DNA (pooled samples) was amplified using long-range PCR reactions, pooled, labelled, hybridised to high-density arrays, stained and detected as described [11]. The four DNA pools (low HDL pools A and B and high HDL pools C and D) were each amplified by PCR using 1,222 long-range primer pairs in three replicates. The 12 sets of PCR products were hybridised to separate chips.

The fluorescence intensities of the reference and alternate perfect-match features on an array were correlated with the concentration of the corresponding SNP allele in the DNA sample. Our estimates of allele frequency, p ^ , were computed from ratios of trimmed means of intensities of the perfect-match features after subtracting a measure of background computed from trimmed means of intensities of mismatch features
p ^ = Ĩ P M , R e f - Ĩ M M Ĩ P M , R e f - Ĩ M M + Ĩ P M , A l t - Ĩ M M
I ˜ M M = 1 4 ( I ˜ M M , R e f , F w d + I ˜ M M , R e f , R e v + I ˜ M M , A l t , F w d + I ˜ M M , A l t , R e v )
Ĩ P M , R e f = 1 2 Ĩ P M , R e f , F w d + Ĩ P M , R e f , R e v
Ĩ P M , A l t = 1 2 Ĩ P M , A l t , F w d + Ĩ P M , A l t , R e v
The Ĩ terms denote trimmed mean intensities for a set of features denoted by the subscript. The trimmed means are arithmetic means of the intensity measurements after discarding the highest and lowest 25 per cent of values. In cases where this did not yield an integer number of terms, one more term was included and the smallest and the largest terms received half weight. Each set of 20 features contributed five perfectmatch measurements for one allele, one perfect-match measurement for the other allele at offset 0, and 14 mismatch measurements. Thus, for example, there were six perfectmatch features for the reference allele on the forward strand, and:
Ĩ P M , R e f , F w d = 1 3 1 2 I P M , R e f , F w d , 2 + I P M , R e f , F w d , 3 + I P M , R e f , F w d , 4 + 1 2 I P M , R e f , F w d , 5

where the numeric subscripts denote positions in the list of six sorted intensities.

Two quality control metrics were used to assess the reliability of the intensities for a SNP in an array scan. The first metric, 'conformance', measured the presence of specific target DNA for that SNP. The second metric, signal to background ratio, measured the relative amounts of specific and non-specific binding. Cut-offs were applied to both metrics, and SNP feature sets that did not pass either metric were discarded from further analysis.

Conformance was computed independently for both reference and alternate allele feature sets, and a maximum taken of the two values. The conformance for a particular allele was defined as the fraction of feature sets for which the perfect-match feature was brighter than all three mismatch features. In the 80-feature SNP tiling, each allele had ten such sets of four features. SNP measurements having conformance < 0.9 were discarded from further evaluations.

The signal to background ratio was calculated from intensity measurements for both alleles, for the perfect-match versus mismatch features, as:
s i g n a l = Ĩ 2 P M , R e f + Ĩ 2 P M , A l t
b a c k g r o u n d = Ĩ 2 M M , R e f + Ĩ 2 M M , A l t

The trimmed mean intensities for perfect-match and mismatch feature sets were obtained as described above. SNP measurements having signal/background < 1.5 were discarded from further evaluations.

For each SNP, we obtained a total of 12 allele frequency estimates, p ^ , as three independent measurements for each of the four DNA pools. Estimated allele frequency differences, Δ p ^ , between low and high HDL groups were determined from averages of the replicates for each pool:
Δ p ^ = 1 2 p ^ A + p ^ B - 1 2 p ^ C + p ^ D

Haplotype block fitting algorithm

In order to limit the number of SNPs requiring subsequent genotyping in individual samples, we developed an analysis method that used our independently derived haplotype map information to refine estimates of SNP allele frequency differences between pooled DNA samples in case-control studies. The method exploits the fact that within a haplotype block, most of the variation in SNP allele frequencies can be accounted for by variation in the frequencies of a relatively small set of common haplotype patterns -- defined as patterns present at a frequency of at least 10 per cent in the ethnically diverse population used for SNP discovery. Within a block, the sum of differences in these pattern frequencies between two groups should be approximately 0, to the extent that those patterns in the haplotype map accurately represent the total genetic diversity of that interval (Figure 2).
Figure 2

Haplotype block-fitting analysis. Starting from estimated allele frequency differences for each individual single nucleotide polymorphism (SNP) from pooled genotyping, we use linear regression to solve for frequency differences of the underlying common haplotype patterns. In this example, we show a hypothetical haplotype block consisting of six SNPs and three common haplotypes. Measured frequency differences are shown for the haplotype tagging alleles for each SNP, which are also indicated by boxes in the haplotype patterns. In this example, we are estimating two free parameters from six SNP measurements, since the three pattern differences are constrained to sum to 0. Thus, these pattern differences should have lower variance than the individual SNP measurements. From the pattern differences, we are able to estimate the true allele frequency differences for each SNP more accurately.

The method uses linear regression to determine these underlying haplotype pattern frequency differences, given a set of estimated SNP allele frequency differences for a haplotype block. Our method for haplotype map construction guarantees that in every block, there are at least enough SNPs to determine the frequencies of the common haplotype patterns. Most SNPs are in blocks that contain additional redundant SNPs, so if measurement errors are uncorrelated, regression should yield estimates that are more accurate than the original SNP measurements. From the fitted pattern differences, more accurate estimates of the true allele frequency differences for individual SNPs can then be determined.

Let Δ p ^ i be the estimated frequency difference of the 'reference' alleles for SNP i within a haplotype block, and let Δf j be the (unknown) frequency difference of common haplotype pattern j 1...N. Our model proposes that:
Δ p ^ i = j = 1 N m i j Δ f j + ε
where m ij is a coefficient that takes a value of +0.5 if the allele at position i in pattern j matches the reference allele and -0.5 if it matches the alternate allele for that SNP. The reason for the 0.5 factor is that the frequency difference for an allele would otherwise be double counted when differences for the complete set of patterns are evaluated. We further require that the pattern frequency differences must sum to 0; this constraint can be folded into the previous equation by eliminating Δf N and defining rijm ij - m iN to obtain:
Δ p ^ i = j = 1 N - 1 r i j Δ f j + ε

Solving these equations given Δ p ^ i and r ij is a linear regression problem. Standard regression statistics (R[2] and the P value for an F test) can be used to judge the quality of the fit of the SNP data to the haplotype pattern information. Deviations from a perfect fit can arise both from experimental errors and inaccuracies in the haplotype model. In instances where the quality of the fit to the haplotype map is good, the fitted allele frequency differences should have lower variance than the raw data for individual SNPs because they incorporate information about the expected correlations between SNPs.

A similar method could be used to estimate haplotype pattern frequencies in each pooled sample, with a constraint that the frequencies of common patterns add up to 1. We chose to work in the space of allele frequency differences for several reasons. The frequency differences are the quantities we are ultimately interested in, and it seemed most parsimonious to evaluate a fit for these differences directly, rather than performing separate fits on frequencies in each pool and then combining these to obtain differences. Also, the quality of a fit on absolute frequencies would be sensitive to the presence of rare haplotypes not included in the model, even under the null hypothesis of no pool differences. Our constraint on frequency differences summing to 0 only implies that the proportion of rare haplotypes in case and control pools is similar. Finally, due to experimental differences in SNP hybridisation characteristics, we have more confidence in our ability to detect pool differences than to obtain unbiased estimates of absolute allele frequencies.

Determination of individual genotypes

For individual genotyping by high-density oligonucleotide arrays, samples were amplified by short-range multiplex PCR, labelled, hybridised to the arrays, stained and detected as described [18].

The individual genotypes for an SNP were determined by clustering measurements from multiple scans in the two-dimensional space defined by reference and alternate perfectmatch trimmed mean intensities. Trimmed mean intensities were computed as described above. We used a K-means algorithm to assign p ^ measurements to clusters representing distinct diploid genotypes. Instead of estimating the background intensity term Ĩ MM from a single scan, we determined an optimal value for each SNP that minimised the variance in p ^ within the assigned genotype clusters. The K-means and background optimisation steps were iterated until cluster membership and background estimates converged. To determine the appropriate number of genotype clusters, we repeated the analysis for one, two and three clusters and selected the most likely solution, considering likelihoods of the data and the cluster parameters. The data likelihood was determined using a normal mixture model for the distribution of p ^ around the cluster means. The model likelihood was calculated using a prior distribution of expected cluster positions (ie homozygous reference allele near p ^ = 1 . 0 , heterozygote near p ^ = 0 . 5 and homozygous alternate allele near p ^ = 0 . 0 ).

For individual genotyping by template-directed dyeterminator incorporation with fluorescence-polarisation detection (FP-TDI) [19], samples were amplified by PCR, primer extension was performed using AcycloPrime FP SNP detection kit (Perkin Elmer Life Sciences) and changes in fluorescence polarisation were measured using Analyst HT (LJL Biosystems) as described [16].


Population stratification analysis

In an association study, systematic differences in ancestry between case and control groups can produce large numbers of statistically significant but spurious associations [20, 21]. We examined the 348 individuals with low HDL levels and the 329 individuals with high HDL levels in the study population to ensure that they were adequately matched prior to constructing DNA pools. We individually genotyped the samples for 300 SNPs that are genetically unlinked and uniformly spaced across the genome, as described previously [18].

In χ2 tests for association with the HDL cholesterol phenotype, we observed a small excess of moderate p values. For 280 SNPs giving high-quality genotype data, 43 had p < 0.1 versus 28 expected. A sensitive global test for population structure based on the sum of χ2 statistics [22] was significant (p < 0.001); however, a permutation analysis of the genotype data indicated that the expected increase in variance of allele frequency measurements due to stratification of this magnitude was less than 1 per cent. We also analysed the genotype data for population structure using the structure program [23]. The structure program uses a model-based clustering method for identifying subpopulations such that, within a cluster, all markers are in Hardy-Weinberg and linkage equilibrium. This analysis did not show convincing evidence for more than one subpopulation. In runs with between two and five assumed clusters, most samples were assigned similar admixture proportions in each predicted subpopulation; for two clusters, 75 per cent of samples were given admixture proportions between 0.4 and 0.6. Based on these results, and the limited accuracy of pooled genotyping assays, we judged that the low and high HDL cholesterol groups were adequately matched.

Pooled genotyping results

For each SNP, we estimated an allele frequency difference, Δ p ^ , between the low HDL cholesterol and high HDL cholesterol pools. We then excluded a small proportion of the pooled data due to spurious experimental errors, such as saturated features or inconsistent hybridisation patterns. We also excluded SNPs where all three measurements for any one of the four pools failed and SNPs where the standard error of Δ p ^ exceeded 0.025. Of the 7,283 SNPs tiled on the array, 6,611 (91 per cent) passed all of these data quality filters.

Haplotype block fitting analysis

Of the 6,611 SNPs for which we obtained good pooled genotyping data, 4,387 SNPs were included in the haplotype map. Table 2 shows the results of the haplotype block fitting analysis for these SNPs; the results for all blocks, the subset of blocks that are informative (those that contain redundant SNP information) and the subset of these that had p < 0.05 in an F test for agreement of the Δ p ^ with the haplotype model for that block are shown. Good fits should only be possible for blocks that have real allele frequency differences between the low and high HDL cholesterol pools, either due to sampling variation or association with the phenotype. Thus, we would expect most blocks to have poor p values, due to a lack of significant allele frequency differences. In fact, more than 40 per cent of the 4,387 SNPs are in blocks with good agreement between Δ p ^ and the haplotype model, and these tend to be the larger blocks. Uninformative blocks often contain just one or two SNPs and while they represent a large fraction of all blocks, they represent a much smaller proportion of SNPs and base pairs covered. Here, informative blocks represented 53 per cent of all blocks, but included 86 per cent of SNPs in the haplotype map and about 75 per cent of the DNA sequence.
Table 2

Haplotype block-fitting results and analysis of variance.


All blocks


p< 0.05b

Haplotype blocks




SNPs passing quality filters




SNPs contributing to fits




Fitted degrees of freedom




Residual degrees of freedom




% degrees of freedom used




Total sum of squares




Fitted sum of squares




Residual sum of squares

0. 516



% variance explained




a Blocks having redundant information, ie at least as many SNP measurements as common haplotype patterns.

b Informative blocks for which an F test on the fit of the SNP data to the haplotype structure had p < 0.05.

SNPs, single nucleotide polymorphisms.

Analysis of variance allows us to determine how much of the variation in SNP allele frequencies observed between the DNA pools is consistent with the haplotype map and how much is residual variation due to experimental errors in the Δ p ^ measurements, the contribution of rare patterns not represented in the haplotype map and errors in the haplotype map. We can measure the effectiveness of the algorithm by the extent to which the fraction of variance explained by the fitted haplotype patterns exceeds the fraction of degrees of freedom used in the fits. In this analysis (Table 2), we found that about 77 per cent of all the variance in the data was consistent with the model based on common haplotypes. Based on the number of free parameters in the haplotype model, we would have expected only 42 per cent of the variance to be accounted for by chance. We repeated this analysis after permuting the individual Δ p ^ measurements. Here, the haplotype map explained only 43 per cent of the variance and only 5 per cent of SNPs were in blocks having p < 0.05 in an F test. This analysis shows that the agreement between the haplotype model and the original Δ p ^ data could not arise by chance.

Selection of SNPs for individual genotyping

Selecting the SNP markers that are the most likely to have large allele frequency differences based on the pooled array data is difficult. The set of SNPs having the largest absolute Δ p ^ is dominated by a subset of measurements with very high experimental variance. A t-test is also inadequate, because it favours SNPs with low experimental variance, even if the Δ p ^ is too small to be of biological interest and is probably due to sampling variation. The experimental variance is poorly determined from the limited number of data points available. Due to differences in SNP calibration in our genotyping assay, our ability to estimate absolute allele frequencies, and hence sampling variance, is similarly limited. Based on data from experiments with pools of known composition, we found that the strategy of excluding data for SNPs with very high standard errors, and then selecting SNPs with the largest Δ p ^ , performed as well or better than tests based on variance estimates for each SNP (data not shown).

A total of 312 SNP markers were chosen for individual genotyping based on the capacity of a small high-density oligonucleotide array. Based on the pooled allele frequency data, we selected 284 SNPs expected to have large allele frequency differences. Half of the 284 SNPs were chosen to be 'haplotype conforming' -- belonging to informative haplotype blocks that had good fits with p < 0.05 -- while the other half were chosen to be 'non-conforming' SNPs selected from the remainder based only on pooled estimates of Δ p ^ . We ranked 1,934 conforming SNPs by the smaller of their actual and fitted Δ p ^ values, and selected the top 142 SNPs yielding a cut-off of Δ p ^ > 0 . 037 . For 4,677 non-conforming SNPs, ranking by absolute Δ p ^ and selecting the top 142 yielded a cutoff of Δ p ^ > 0 . 048 . We selected a higher proportion of conforming SNPs for individual genotyping because their consistency with the haplotype map provided additional evidence for allele frequency differences at those positions. We did include non-conforming SNPs, however, so as not to overlook signals that were not in blocks, or for which the fit to the haplotype map was poor for other reasons. An additional 28 SNPs that did not meet these criteria were selected because they were either in candidate loci of interest or had been independently genotyped in the same population using fluorescence polarisation. They were used to assess the accuracy of our high-density array-based individual genotyping.

Individual genotyping data quality analysis

A total of 832 DNA samples in the study and replicate populations were individually genotyped for the 312 selected SNPS. Three quality-control filters were applied to the individual genotype data. We first required that SNPs have an unambiguous genotype call in at least 80 per cent of the 832 DNA samples assayed. Secondly, we required that both SNP alleles be segregating in the population (ie have at least two identifiable genotype clusters). Finally, we required that the SNP alleles be in Hardy-Weinberg equilibrium (p > 0.001). We found that large deviations from Hardy-Weinberg equilibrium were generally associated with systematic hybridisation artefacts. Of the 312 SNPs, 284 (91 per cent) passed all three data quality filters.

To estimate the quality of the individual genotypes called using the high-density oligonucleotide array platform, we compared our genotype calls with those obtained using FPTDI for 19 SNPs in three gene regions (CETP, endothelial lipase [LIPG] and liver receptor alpha [LXRα]). The call rate (the fraction of assigned genotypes out of potential genotypes) for the array platform is above 98 per cent, very similar to that generated using FP-TDI using the same DNA samples (supplementary Table 3). Of the genotypes called by both methods, the concordance (the fraction of SNPs assigned genotypes by both methods that were in agreement) between the oligonucleotide array and FP-TDI methodologies is greater than 99 per cent.

Evaluation of the pooled genotyping screen

To evaluate the effectiveness of the pooled genotyping step for estimating allele frequency differences between the case and control DNA pools, we examined the relationship between pooled allele frequency estimates and allele frequencies determined by individual genotyping. For each of the 284 SNPs selected from the pooled data, we have allele frequency estimates for four pooled samples and corresponding individual genotype data for all the samples used to compose the pools. While the numbers of data points and ranges of allele frequencies for each SNP are small, we can still use these data to examine the relationship between a pooled p ^ and an allele frequency p determined by individual genotyping. This relationship for an individual SNP is very nearly linear; however, there is substantial variation in slope and intercept between SNPs (Figure 3). A regression of the p ^ , averaged over the four HDL pools against an allele frequency p, determined by individual genotyping for all 284 SNPs, has an R2 of 0.71. When we examined the independent measurements of p and p ^ in the four pools for the 284 individual SNPs, the median R2 = 0.85. In principle, we could calibrate assays for each SNP using samples of known allele frequency; however, this becomes impractical when many thousands of assays are analysed. Some variation in sensitivity can be tolerated because the pooled data are only used as a screen for selecting SNP candidates for individual genotyping.
Figure 3

Relationship between pooled allele frequency estimates and allele frequencies determined by individual genotyping. The frequency estimates from pooled genotyping, p ^ , are linearly related to allele frequencies, p, determined by individual genotyping. (A) Across all SNPs that were individually genotyped, variation in slope and intercept partially obscures the strength of this relationship. Here, we show p ^ plotted against p averaged over the four high-density lipoprotein pools for 284 SNPs. (B) For each SNP, we have independent measurements of p and p ^ in four pools. We show representative data for four SNPs having (due to sampling variation or association) relatively large separation between the four values of p.

To evaluate the sensitivity of SNP selection from pooled genotyping, we used χ2 tests to measure allelic association of each SNP with the HDL cholesterol phenotype. To the extent that pooled allele frequency differences are predictive, we should see an excess of small p values in tests for the 284 SNPs selected based on the pooled results. Given tests of N SNPs at a threshold of p <X, we expect (N × X) SNPs to meet that threshold due to sampling variation in allele frequencies between the pools. In fact, we see far more small p values than would be expected based on 284 tests (Table 3). From the entire 6,611 SNPs we used to choose the 284 for individual genotyping, we would expect 6,611 × 0.01 = 66 to have p < 0.01, and 6,611 × 0.04 = 284 to have p < 0.04. A perfect SNP selection procedure would have captured all of these. In fact, we captured 32 per cent of the expected total number of SNPs at the p < 0.04 level, and 62 per cent of the expected number at the p < 0.01 level. Thus, the pooled assay has sufficient sensitivity to capture a substantial fraction of SNPs having even very modest allele frequency differences at the level of sampling variation. Sensitivity for larger allele frequency differences indicative of causal associations should be correspondingly higher.
Table 3

Distribution of χ2 test scores in SNPs selected for individual genotyping.


All SNPs



Selected SNPs




Number meeting quality criteria




p < 0.04c




p < 0.01c




a SNPs selected based on corroborating evidence from the haplotype-fitting procedure.

b SNPs selected based on a large Δ p ^ but without supporting haplotype information.

c SNPs meeting this threshold in a χ2 test of allelic association with the HDL phenotype.

SNPs, single nucleotide polymorphisms.

To assess the effectiveness of the haplotype fitting procedure, we looked at the numbers of 'haplotype conforming' and 'non-conforming' SNPs having small χ2 test p values in individual genotyping (Table 3). Of SNPs having p < 0.04, about 65 per cent came from the 'haplotype conforming' category, and this excess was quite significant (p ≈ 0.001). The same trend was seen among SNPs having p < 0.01; however, the numbers of observations were insufficient to reach statistical significance. Thus, SNPs selected based on corroborating haplotype data indeed seem to be more likely to show larger allele frequency differences in individual genotyping.

To further examine the impact of using haplotype block information to improve estimates of SNP allele frequency differences between pooled DNA samples in case-control studies, we compared the correspondence of estimates of Δ p ^ to actual allele frequency differences determined by individual genotyping (Figure 4). While the correspondence of pooled estimates of Δ p ^ to actual allele frequency differences was good (r2 = 0.82), the correspondence of the haplotype-fitted estimates of Δ p ^ to actual allele frequency differences was substantially better (r2 = 0.90). These results demonstrate that the accuracy of estimating SNP allele frequency differences in pooled genotyping is improved using haplotype block information.
Figure 4

Correspondence between estimated pooled allele frequency differences and actual differences determined by individual genotyping. For this analysis, we included data for single nucleotide polymorphisms (SNPs) having call rates of at least 90 per cent. (A) The correlation between estimated and actual allele frequency differences for 267 SNPs is 0.82. (B) For 164 SNPs in haplotype blocks with fitted p < 0.05, the correlation between allele frequency difference and the estimate based on the haplotype-fitting procedure is 0.90. In the upper right corner of each plot are five SNPs in the genomic interval encoding CETP.

SNP associations with HDL levels in the study population

Our two-stage experimental design posed a tricky multiple testing problem. While we performed tests on just 312 individually genotyped SNPs, these were selected as likely to have large allele frequency differences from a total of 6,611 SNPs with good pooled genotyping data quality. If our pooled assay was perfect, then we were effectively testing all 6,611 SNPs; if the pooled estimates were uncorrelated with allele frequencies, then we are really only testing the 312 SNPs selected for individual genotyping. Based on our capture rates for SNPs with small p values, we consider that the effective number of tests we were performing was a substantial fraction of 6,611.

Using a conservative Bonferroni correction, a global falsepositive rate of 0.05 for 6,611 SNPs tested would require p < 7.6 × 10-6 for an association to be significant. Considering only the study cohort used for pooled genotyping, there were six SNPs, all in the CETP gene, that met this threshold of significance (Table 4). Of these six SNPs, four (rs711752, rs708272, rs11508026, rs7205804) had been selected based on positive results in the pooled genotyping screen and two (rs1800775, rs11076175) had been included to test cross-technology concordance of genotype calling. The four SNPs that had been assayed in the pooled screen were all in the same haplotype block (Figure 5), which had a fitted p value of ~0.002 in our haplotype analysis of the pooled data, and the largest fitted Δ p ^ values observed in the study. The next best SNP in the data for the study cohort outside the CETP gene had a χ2 test p value of 0.0015, which would not be significant, even using a generous threshold based on 312 independent tests.
Table 4

SNPs having significant association with HDL cholesterol in the study population.









χ 2











1.1 × 10-11










7.2 × 10-9










2.4 × 10-8










9.1 × 10-8










9.7 × 10-8










1.2 × 10-7

a Position on GenBank sequence NC_000016.4.

b Counts of reference and alternate alleles for low and high HDL cholesterol groups.

c Allele frequency difference, calculated as refL /(refL + altL) - refH /(refH + altH).

dbSNPs single nucleotide polymorphism database; HDL, high-density lipoprotein.
Figure 5

Single nucleotide polymorphisms (SNPs) and haplotype structure in the Cholesteryl ester transfer protein ( CETP ) gene. (A) The CETP transcript (GenBank accession NM_000078) is roughly 22 kilobases and contains 16 exons. We obtained good quality genotype data for 14 SNPs in this region. Of these, nine were common SNPs that were included in our haplotype map. In the map, these were assigned to two haplotype blocks, each having three common haplotype patterns. The 5' block includes the Taq1B SNP, rs708272 [28, 29]. (B) We measured linkage disequilibrium, D', for all pairs of SNPs in this interval, using an expectation maximisation (EM) algorithm to determine haplotypes. These data also show two strong blocks of high disequilibrium.

SNP associations in the replicate population

To replicate an association in the study cohort, we only need to consider tests in the replicate samples for the subset of SNPs that gave significant χ2 scores in that analysis. In the replicate population, there are only five SNPs with good data quality having p < 0.01, and all are in CETP (Table 5). For the six SNPs in CETP that showed significance association in the study samples, a conservative Bonferroni correction for a global false-positive level of 0.05 would require p < 8.3 × 10-3 to count as a replication. Of these six SNPs, four (rs711752, rs708272, rs11508026, rs7205804) were also associated at this significance level in the replicate population. Two SNPs (rs1800775, rs11076175) that had small p values in the study population did not meet significance levels in the replicate DNA samples. These differences were not unexpected, given the limited sample size of the replicate cohort and incomplete linkage disequilibrium of the SNPs in this interval.
Table 5

Genotyping results for 14 CETP gene SNPs in the test and replicate samples.



Study samples

Replicate samples

Overall pvalueb




Odds ratioc



Odds ratio





1.2 × 10-2



2.8 × 10-2


1. 4 × 10-3




1.1 × 10-11



1.5 × 10-1


1. 8 × 10-11




2.4 × 10 -8



5.0 × 10 -4


5.9 × 10 -11




7.2 × 10 -9



3.2 × 10 -4


1.2 × 10 -11




9.0 × 10 -8



1.9 × 10 -3


7.1 × 10 -10




9.7 × 10 -8



5.6 × 10 -3


1.9 × 10 -9




1.2 × 10-7



8.9 × 10-1


1.6 × 10-6




1.6 × 10-2



3.9 × 10-1


1.1 × 10-2




5.0 × 10-3



8.7 × 10-3


2.4 × 10-4




7.5 × 10-1



5.8 × 10-2


2.5 × 10-1




4.4 × 10-3



8.0 × 10-1


7.7 × 10-3




1.1 × 10-3



2.9 × 10-1


4.2 × 10-3




2.0 × 10-1



2.9 × 10-1


5.0 × 10-1




1.2 × 10-1



4.6 × 10-2


3.0 × 10-1

a Position on GenBank sequence NC_000016.4.

b Calculated from a χ2 test for the combined study and replicate samples.

c Calculated as (refL /altL)/(refH /altH), where these are allele counts as in Table 4.

SNPs in bold are those with significant association with HDL cholesterol in both the study and replicate populations.

CETP, cholesteryl ester transfer protein dbSNP, single nucleotide polymorphism database; HDL, high-density lipoprotein.

Linkage disequilibrium across the CETPlocus

Previous studies have identified two major blocks of linkage disequilibrium across the CETP locus [4, 17, 24]. Of the 14 SNP markers in CETP that we examined for association with HDL cholesterol levels (supplementary Table 4), nine are members of our whole-genome haplotype map (Figure 5). Consistent with these other studies, in our map, these nine SNPs are divided into two haplotype blocks, each having three common haplotype patterns. We computed D' for all pairs of the 14 SNPs in CETP, using an expectation maximisation (EM) algorithm to determine haplotype frequencies [25, 26]. These results, again, show two blocks of very strong disequilibrium.


The goal of this study was to determine the effectiveness of a large-scale pooled genotyping screen to identify common variants associated with a complex trait. CETP, which transfers cholesteryl esters from the anti-atherogenic HDL to the proatherogenic very-low- and low-density lipoprotein fractions, plays an important role in HDL cholesterol metabolism and served as our positive control. Correlations between SNPs in the genomic interval encoding CETP and variations in the mass/activity of the CETP protein and corresponding HDL levels have been intensively studied [10, 16, 27] and consistently shown to be associated. CETP is estimated to account for ~5 per cent of the variability of HDL levels in the general population [17]. Here, four SNPs in CETP had strong signals and were independently identified as being associated with HDL levels in the pooled screen. The fact that we identified CETP in this study as being associated with HDL levels confirms that pooled genotyping can be used in genetic association studies to identify genes underlying complex phenotypes.

While we find CETP to be replicable and convincingly associated with HDL cholesterol serum levels, none of the 70 candidate genes or 230 other genes in the 17.1Mb of DNA screened appear to play a major role in the genetic variability of HDL cholesterol levels in this population. Based on the strong association of CETP with HDL observed in our study, we are likely to have had sufficient power to identify similar effect sizes in the other candidate genes. Recent work suggests that there are likely to be several additional genes that contribute to HDL phenotypic variance and are as yet unidentified [17]. We examined SNPs distributed across only about 0.5 per cent of the genome, and thus it is likely that these unidentified genes are located in genomic intervals that we did not examine.

In our candidate region study, we used a design incorporating stratification analysis, pooled genotyping, confirmation of promising candidate loci by individual genotyping and replication in an independent cohort. We have demonstrated that independently derived haplotype map information can be used to improve SNP selection from a pooled genotyping screen. High-density oligonucleotide arrays permit the scale and efficiency required for very large-scale association studies. These experimental methods and analysis strategies can be directly scaled up to whole-genome association studies.



We would like to thank Charles Shear and the Lipitor Team for assistance in the clinical trials; Pascual Starink, Ellen Jacobs and Eric Buljan for LIMs development and support; Geoff Nilsen, Michael Jen and Wade Barrett for designing the high-density arrays and assistance with data analysis; John Sheehan for quality analysis of the high-density array data; Albert Yee, Reed George, Julie Marschner, Joe Karbowski and Mike Kennemer for the DNA normalisation and pooling; Patrick Chu for the pooled genotyping; Clariza LaRosa, Matt Morenzoni, Pei-Hua Wang, Rajal Patel, Rhode Vergara, Robin Li, Thai Lai, Vincent Mendoza, Karel Konvicka and Renee Stokowski for the high-density array individual genotyping; Maruja Lira, Amy Mank-Seymour and Jodi Richmond for the fluorescence polarisation genotyping; Erica Beilharz for assistance with manuscript preparation; Paul Feeney and Michael Swietek for assistance in sample handling; and the patients for donating samples for research.

Authors’ Affiliations

Perlegen Sciences
Genomic and Proteomic Sciences, Pfizer Global Research and Development
Biostatistical Applications, Pfizer Global Research and Development


  1. Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science. 1996, 273: 1516-1517. 10.1126/science.273.5281.1516.View ArticlePubMedGoogle Scholar
  2. Germer S, Holland MJ, Higuchi R: High-throughput SNP allele-frequency determination in pooled DNA samples by kinetic PCR. Genome Res. 2000, 10: 258-266. 10.1101/gr.10.2.258.PubMed CentralView ArticlePubMedGoogle Scholar
  3. Uhl GR, Liu Q, Walther D, et al: Polysubstance abusevulnerability genes: Genome scans for association, using 1,004 subjects and 1,494 single-nucleotide polymorphisms. Am J Hum Genet. 2001, 69: 1290-1300. 10.1086/324467.PubMed CentralView ArticlePubMedGoogle Scholar
  4. Bansal A, van den Boom D, Kammerer S, et al: Association testing by DNA pooling: An effective initial screen. Proc Natl Acad Sci USA. 2002, 99: 16871-16874. 10.1073/pnas.262671399.PubMed CentralView ArticlePubMedGoogle Scholar
  5. Gruber JD, Colligan JK, Wolford JK: Estimation of single nucleotide polymorphism allele frequency in DNA pools by using pyrosequencing. Hum Genet. 2002, 110: 395-401. 10.1007/s00439-002-0722-6.View ArticlePubMedGoogle Scholar
  6. Xiao M, Kwok PY: DNA analysis by fluorescence quenching detection. Genome Res. 2003, 13: 932-939. 10.1101/gr.987803.PubMed CentralView ArticlePubMedGoogle Scholar
  7. Barratt BJ, Payne F, Rance HE, et al: Identification of the sources of error in allele frequency estimations from pooled DNA indicates an optimal experimental design. Ann Hum Genet. 2002, 66: 393-405. 10.1046/j.1469-1809.2002.00125.x.View ArticlePubMedGoogle Scholar
  8. Jawaid A, Bader JS, Purcell S, et al: Optimal selection strategies for QTL mapping using pooled DNA samples. Eur J Hum Genet. 2002, 10: 125-132. 10.1038/sj.ejhg.5200771.View ArticlePubMedGoogle Scholar
  9. Sham P, Bader JS, Craig I, et al: DNA pooling: A tool for large-scale association studies. Nat Rev Genet. 2002, 3: 862-871.View ArticlePubMedGoogle Scholar
  10. Barter PJ, Brewer HB, Chapman MJ, et al: Cholesteryl ester transfer protein: A novel target for raising HDL and inhibiting atherosclerosis. Arterioscler Thromb Vasc Biol. 2003, 23: 160-167. 10.1161/01.ATV.0000054658.91146.64.View ArticlePubMedGoogle Scholar
  11. Patil N, Berno AJ, Hinds DA, et al: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001, 294: 1719-1723. 10.1126/science.1065573.View ArticlePubMedGoogle Scholar
  12. Collins FS, Brooks LD, Chakravarti A: A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 1998, 8: 1229-1231.PubMedGoogle Scholar
  13. Zhang K, Deng M, Chen T, et al: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA. 2002, 99: 7335-7339. 10.1073/pnas.102186799.PubMed CentralView ArticlePubMedGoogle Scholar
  14. Ballantyne CM, Andrews TC, Hsia JA, et al: Correlation of non-high-density lipoprotein cholesterol with apolipoprotein B: Effect of 5 hydroxymethylglutaryl coenzyme A reductase inhibitors on non-highdensity lipoprotein cholesterol levels. Am J Cardiol. 2001, 88: 265-269. 10.1016/S0002-9149(01)01638-1.View ArticlePubMedGoogle Scholar
  15. Eiriksdottir G, Bolla MK, Thorsson B, et al: The -629C > A polymorphism in the CETP gene does not explain the association of TaqIB polymorphism with risk and age of myocardial infarction in Icelandic men. Atherosclerosis. 2001, 159: 187-192. 10.1016/S0021-9150(01)00489-0.View ArticlePubMedGoogle Scholar
  16. Thompson JF, Lira ME, Durham LK, et al: Polymorphisms in the CETP gene and association with CETP mass and HDL levels. Atherosclerosis. 2003, 167: 195-204. 10.1016/S0021-9150(03)00005-4.View ArticlePubMedGoogle Scholar
  17. Knoblauch H, Bauerfeind A, Toliat MR, et al: Haplotypes and SNPs in 13 lipid-relevant genes explain most of the genetic variance in high-density lipoprotein and low-density lipoprotein cholesterol. Hum Mol Genet. 2004, 13: 993-1004. 10.1093/hmg/ddh119.View ArticlePubMedGoogle Scholar
  18. Hinds DA, Stokowski RP, Patil N, et al: Matching strategies for genetic association studies in structured populations. Am J Hum Genet. 2004, 74: 317-325. 10.1086/381716.PubMed CentralView ArticlePubMedGoogle Scholar
  19. Chen X, Levine L, Kwok PY: Fluorescence polarization in homogeneous nucleic acid analysis. Genome Res. 1999, 9: 492-498.PubMed CentralPubMedGoogle Scholar
  20. Knowler WC, Williams RC, Pettitt DJ, et al: Gm3;5,13,14 and type 2 diabetes mellitus: An association in American Indians with genetic admixture. Am J Hum Genet. 1988, 43: 520-526.PubMed CentralPubMedGoogle Scholar
  21. Lander ES, Schork NJ: Genetic dissection of complex traits. Science. 1994, 265: 2037-2048. 10.1126/science.8091226.View ArticlePubMedGoogle Scholar
  22. Pritchard JK, Rosenberg NA: Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet. 1999, 65: 220-228. 10.1086/302449.PubMed CentralView ArticlePubMedGoogle Scholar
  23. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155: 945-959.PubMed CentralPubMedGoogle Scholar
  24. Corbex M, Poirier O, Fumeron F, et al: Extensive association analysis between the CETP gene and coronary heart disease phenotypes reveals several putative functional polymorphisms and gene-environment interaction. Genet Epidemiol. 2000, 19: 64-80. 10.1002/1098-2272(200007)19:1<64::AID-GEPI5>3.0.CO;2-E.View ArticlePubMedGoogle Scholar
  25. Lewontin RC: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics. 1964, 49: 49-67.PubMed CentralPubMedGoogle Scholar
  26. Weir BS: Genetic data analysis II. 1996, Sinauer Associates, Sunderland, MAGoogle Scholar
  27. Ordovas JM, Cupples LA, Corella D, et al: Association of cholesteryl ester transfer protein-TaqIB polymorphism with variations in lipoprotein subclasses and coronary heart disease risk: The Framingham study. Arterioscler Thromb Vasc Biol. 2000, 20: 1323-1329. 10.1161/01.ATV.20.5.1323.View ArticlePubMedGoogle Scholar
  28. Drayna D, Lawn R: Multiple RFLPs at the human cholesteryl ester transfer protein (CETP) locus. Nucleic Acids Res. 1987, 15: 4698-10.1093/nar/15.11.4698.PubMed CentralView ArticlePubMedGoogle Scholar
  29. Fumeron F, Betoulle D, Luc G, et al: Alcohol intake modulates the effect of a polymorphism of the cholesteryl ester transfer protein gene on plasma high density lipoprotein and the risk of myocardial infarction. J Clin Invest. 1995, 96: 1664-1671. 10.1172/JCI118207.PubMed CentralView ArticlePubMedGoogle Scholar


© Henry Stewart Publications 2004