Skip to main content
  • Primary research
  • Published:

Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistical estimation


Chemokine signals and their cell-surface receptors are important modulators of HIV-1 disease and cancer. To aid future case/control association studies, aim to further characterise the haplotype structure of variation in chemokine and chemokine receptor genes. To perform haplotype analysis in a population-based association study, haplotypes must be determined by estimation, in the absence of family information or laboratory methods to establish phase. Here, test the accuracy of estimates of haplotype frequency and linkage disequilibrium by comparing estimated haplotypes generated with the expectation maximisation (EM) algorithm to haplotypes determined from Centre d'Etude Polymorphisme Humain (CEPH) pedigree data. To do this, they have characterised haplotypes comprising alleles at 11 biallelic loci in four chemokine receptor genes (CCR3, CCR2, CCR5 and CCRL2), which span 150 kb on chromosome 3p21, and haplotyes of nine biallelic loci in six chemokine genes [MCP-1(CCL2), Eotaxin(CCL11), RANTES(CCL5), MPIF-1(CCL23), PARC(CCL18) and MIP-1α(CCL3) ] on chromosome 17q11-12. Forty multi-generation CEPH families, totalling 489 individuals, were genotyped by the TaqMan 5'-nuclease assay. Phased haplotypes and haplotypes estimated from unphased genotypes were compared in 103 grandparents who were assumed to have mated at random.

For the 3p21 single nucleotide polymorphism (SNP) data, haplotypes determined by pedigree analysis and haplotypes generated by the EM algorithm were nearly identical. Linkage disequilibrium, measured by the D' statistic, was nearly maximal across the 150 kb region, with complete disequilibrium maintained at the extremes between CCR3-Y17Y and CCRL2-1243V. D'-values calculated from estimated haplotypes on 3p21 had high concordance with pairwise comparisons between pedigree-phased chromosomes. Conversely, there was less agreement between analyses of haplotype frequencies and linkage disequilibrium using estimated haplotypes when compared with pedigree-phased haplotypes of SNPs on chromosome 17q11-12. These results suggest that, while estimations of haplotype frequency and linkage disequilibrium may be relatively simple in the 3p21 chemokine receptor cluster in population samples, the more complex environment on chromosome 17q11-12 will require a higher resolution haplotype analysis.


Chemokines are small intercellular signalling molecules that recruit immune cells to sites of inflammation and infection. The two major subfamilies of chemokine proteins are defined as CC, with two adjacent cysteine residues, or as CXC, with an intervening non-conserved amino acid. Other chemokine family members have cysteine residues separated by more than one intervening amino acid (eg CX3C or Fractalkine) [1, 2], or are characterised by having only one cysteine (eg XCL1 or Lymphotactin) [3, 4]. Chemokine receptors are defined by the subfamily of chemokine ligand that they bind. Both the chemokine and the chemokine receptor genes are generally clustered in four distinct chromosomal regions: CC on 17q11-21, CXC on 4q12-21, both CCR and CXCR on 3p21-24 and CXCR on 2q21-35.

Variation in chemokines, or their cell-surface receptors, influences an individual's susceptibility to HIV-1 infection and modulates progression to AIDS [511]. Chemokine signals are also important in the angiogenic [1214] and metastatic [15, 16] processes of cancer. Therefore, describing the genetic variation and haplotype structure of chemokine and chemokine receptor gene clusters is necessary for further disease association analyses of these candidate genes.

The focus of the present analysis is to describe the structure of multi-single nucleotide polymorphism (SNP) haplotypes in chemokine genes on chromosome 17q11-12 and chemokine receptor genes on chromosome 3p21 in Centre d'Etude Polymorphisme Humain (CEPH) pedigrees (n = 489). Secondary to this goal is to use the empirically phased haplotypes to determine the accuracy of estimated measures of haplotype frequencies and linkage disequilibrium using the subset of CEPH grandparents (n = 103).

Samples and methods

Study samples

SNP screening and validation were performed using two population panels: a 16-individual panel (four European-Americans, four African-Americans, four Chinese and four self-identified Hispanic-Americans) and an 88-individual panel (30 African-Americans, 34 European-Americans and 24 Hispanics). Forty multi-generation CEPH families, a total of 489 individuals, were genotyped for 23 SNPs scattered over two gene clusters: CC-chemokines on 17q11-12 and CC-chemokine receptors on 3p21 (see Table 1). Genotype data from a subsample of 103 unrelated grandparents were used for comparative haplotype analyses. The use of all anonymous DNA samples was either reviewed by the NIH Internal Review Board or determined 'exempt' from review.

Table 1 Biallelic loci typed in CEPH pedigrees

Chemokine and chemokine receptor SNPs

Conditions for SNP detection in the CCR2 promoter

Four of the 23 SNPs included in the haplotype analysis (Table 1) have not previously been reported and were discovered by direct sequencing. Three kilobases of the CCR2 promoter region were amplified using the Invitrogen Platinum Taq™ kit in a panel of 16 individuals (32 chromosomes), including four European-Americans, four African-Americans, four Chinese and four Hispanic-Americans (self-identified). For 100 μL polymerase chain reactions (PCRs), 200 nM deoxyribonu-cleotide triphosphates (dNTPs), 200 nM of each primer, 400 nM MgSO4, 10 μL of 10 × Platinum Taq™ buffer and 1 μL Platinum Taq™ were mixed with approximately 100 ng of genomic DNA. Primer sequences for the 3 kb product were as follows: 5'-TCATCTGCTTCTTAATTGCCTTCAG-3' (forward) and 5' -CAGGGTTTCTCTAACATCTCCTGGT-3' (reverse). PCR was performed in a PE Biosystems 9700 ThermoCycler with long-range PCR conditions recommended for Platinum Taq™.

Sequencing was performed on a 3 kb segment at intervals of 400-500 kb with internal primers using the BigDye™ (Applied Biosystems) cycle sequencing kit with some modifications. Sequencing reactions were performed as follows: 15-30 ng of purified product was added to 10 μL reaction solution, which included 2 μL of BigDye™ mix, 1 μL of standard 5 × dilution buffer, 1.1 μL of 0.5 μM primer stock and double-distilled water (ddH2O) for the remaining volume. Reactions were cycled in a PE Biosystems 9700 thermo cycler under the following conditions: 95°C for five minutes, and 30 cycles of 95°C for 30 seconds, 50°C for ten seconds and 60°C for four minutes. All individuals were sequenced for the entire 3 kb in both forward and reverse directions on an ABI 3700 capillary sequencer. Sequence trace files were analysed by the Phred/Phrap/Consed system [1720], and PolyPhred was used to detect putative SNPs [21].

Eight SNPs (-5983 G/A, -5047 G/T, -4866 G/C, -4599 T/G, -4419 C/T, -4338 A/T, -3433 T/C and -3232 C/T) were confirmed by visual inspection of the CCR2 promoter sequence of the 16-individual screening panel. Five of these SNPs (-5983 G/A, -5047 G/T, -4866 G/C, -4599 T/G and -3433 T/C) were validated by direct sequencing in a larger sample set that comprised 88 individuals from three populations: 30 African-Americans, 34 European-Americans and 24 self-described Hispanics. The other three SNPs were not validated in the larger sample set, as they are in nearly complete linkage disequilibrium with at least one of the five SNPs chosen for further study. Four of the five validated SNPs listed in Table 1 (-5983 G/A, -5047 G/T, -4866 G/C and -3433 T/C) were successfully optimised for 5'-nuclease assays.

Conditions for screening putative SNPs

The remaining 19 SNPs listed in Table 1 were previously characterised in this laboratory by denaturing high performance liquid chromato-graphy (dHPLC) or single-stranded conformation polymorphism (SSCP) analysis, or were taken from published works or public databases. Flanking primers were designed for a total of 22 polymorphisms from dbSNP [22] using Primer 3.0 from MIT, Cambridge, MA [23]. PCR was performed in 25 μL-scale reactions with the following components: 50 ng genomic DNA, 3 mM MgCl2, 200 nM dNTPs, 200 nM of each primer, 1U TaqGold™ (Applied Biosytems) and 2.5 μL 10 × TaqGold™ Buffer. The cycling conditions (PE Biosystems 9700) for all reactions were as follows: a 95°C hold for ten minutes, then a touch-down cluster of 12 cycles (95°C for 30 seconds, 62-57°C (decreasing by 0.5°C every cycle) for one minute and 72°C for 1 minute), a standard cluster of 30 cycles (95°C for 30 seconds, 57°C for one minute and 72° for one minute) and a final 72°C hold for seven minutes. PCR products were purified using 10 U exonuclease 1 and 2 U shrimp alkaline phosphatase (SAP) enzymes under the protocol specified by the Washington University Sequencing Center [24].

All purified reaction solutions were sequenced as follows: 15-30 ng of purified product was added to 10 μL reaction solution, which included 2 μL of BigDye™ mix, 1 μL of standard 5 × dilution buffer, 1.1 μL of 0.5 μM primer stock and ddH2O for the remaining volume. Reactions were cycled in a PE Biosystems 9700 thermo cycler under the following conditions: 95°C for five minutes, and 30 cycles of 95°C for 30 seconds, 50°C for ten seconds, and 60°C for four minutes. Nine of the 22 primer pairs produced viable sequences and the SNPs were polymorphic in at least one of the 16-individual population panel. Those 'confirmed' SNPs were further characterised by either sequencing or genotyping in the larger sample of 88 individuals (data not shown).

SNP genotyping

All 23 SNPs were genotyped using the 5'-nuclease assay under a set of universal assay conditions. Dual-labelled TaqMan™ (Applied Biosystems) probes, standard, Turbo and Minor-Groove Binding (MGB) chemistries were designed using Primer Express™ (Applied Biosystems). Previous analysis of genotyping accuracy using the TaqMan method revealed 14 discordancies out of 1,165 duplicate genotype pairs, a 1.2 percent error rate averaged over multiple TaqMan assays [25]. PCR conditions for genotyping (reaction components and cycling conditions), as described in Morin et al. (1999) and Clark et al. (2001), were used for all SNPs typed in this study [25, 26]. PCR was performed in 96-well plates that included positive genotypic controls (for both homozygote states and the heterozygote state for each SNP) and reactions with no DNA as a negative control. All 5'-nuclease assay plates were read on the ABI 7700 Sequence Detector, and analysed using the 'dye components' feature of the SDS v1.6.3 or v1.7 software package (Applied Biosystems). Genotype determinations for each reaction were made manually by visual inspection of a scatter-plot of the data, with reference to the results of the genotype control samples. CEPH pedigree data for all 23 genotyping assays were checked for concordance with Mendelian inheritance using PEDCHECK [27].

Haplotype analysis

Haplotype phase was determined using the CYRILLIC II pedigree drawing software (Cherwell Scientific) to establish the inheritance of multi-locus genotypes. The algorithm developed by Guo and Thompson (1992) was used to determine whether the distribution of whole haplotypes in the CEPH grandparent sample (n = 103) deviates from Hardy-Weinberg proportions [28]. Significance is determined by an exact test, with a cut-off of p = 0:05: Haplotype states and frequencies on both chromosomes 3p21 and 17q11-12 were estimated in sets of unphased genotype data by MLOCUS [29, 30], which uses the expectation-maximization (EM) algorithm [31], a maximum-likelihood based method. A previously described three-step procedure to determine the most likely set of haplotypes to describe the genotype data was used here to analyse the haplotype states and Frequencies for all datasets [32]. Haplotype blocks on 3p21 were assessed using HaploBlock-Finder [33], which performs the four-gamete test (FGT) between each pairwise SNP to identify past recombination events [34]. The minimum-D' method [35, 36] (with minimum D' = 0.80) was also used to assess haplotype block structure in the 150 kb region of 3p21.

Validation of haplotype estimation

Haplotype frequencies are determined by direct counting of whole chromosomes in the grandparents after haplotypes are established by pedigree analysis. Haplotypes were estimated using MLOCUS with unphased genotype data from these same individuals. Comparisons of the two methods were performed with genotype data from two regions: the chemokine cluster (six genes) on chromosome 17q11-12 and the chemokine receptor cluster (four genes) on chromosome 3p21. For the 17q11-12 data, two analyses were performed: one included all nine SNPs typed in all six genes arrayed over 2 Mb, and the other included only six of these SNPs in the 77 kb 'core' region of three genes (MPIF-1, PARC, MIP-1a) on 17q11-12. The analysis of the 3p21 chemokine receptor genes included 14 SNPs arrayed over 150 kb.

The IF and IH algorithm performance indices suggested by Excoffier and Slatkin (1995) were used to quantitatively evaluate the estimation results in the CEPH grandparents [37]. The IH index evaluates the performance of the algorithm to identify the actual haplotypes, and the IF statistic examines how close the estimated frequencies are to the pedigree haplotype frequencies. IH and IF values were calculated using only those haplotypes above the threshold frequency (1/2n). A mean squared error (MSE) statistic was also used to compare the estimated haplo-type frequencies to the pedigree-derived frequencies [38]. To determine whether omitting those grandparents who could not be phased from the analysis generates skewed pedigree-derived haplotype frequencies, MLOCUS haplotype estimations of the total sample (n = 103) were compared to the 'phased-only' sample using the above-described performance indices.

Estimating linkage disequilibrium in population data

D' statistics were calculated with phased haplotypes derived from pedigree analysis with DnaSP (v3.53) [39]. Linkage disequilibrium estimates generated by haplotypes determined by pedigree analysis in the CEPH grandparents were compared with those estimates calculated from MLOCUS reconstructed haplotypes in the same datasets. PAIRWISE was used to estimate linkage disequilibrium from the estimated haplotypes generated by MLOCUS [30]. PAIRWISE generates Lewontin's normalised D' statistic [40] and the p-value determined from an exact test of association between all pairs of polymorphic loci in the dataset.


Haplotype analysis of 3p21 SNPs

To determine the haplotype structure of SNPs in the 3p21 region, we typed 14 polymorphisms in the CEPH pedigrees. Eleven of the 14 SNPs were polymorphic and none of these SNPs deviated from Hardy-Weinberg equilibrium (HWE) at the p = 0:05 significance level. Haplotype phase was established for every grandparent in the sample (n = 103): Haplotype frequencies were then determined by direct counting of whole haplotypes (Table 2). Nine haplotypes explained nearly all of the variation (98 per cent) in the CEPH grandparents. The remaining 2 per cent is composed of two haplotypes that occur only once.

Table 2 Results from a comparison of pedigree-derived and estimated haplotype frequencies (n = 103).

The diplotypes, or multi-locus genotypes, were also counted in the CEPH grandparent sample. The diplotype combination of haplotypes 1 and 3 was the most frequent in the sample, at 13 per cent. In the CEPH grandparent sample, the 3p21 haplotypes were in HWE, as the randomisation test of the distribution of diplotypes yielded a non-significant p-value of 0.2708. When analysed individually, the 11 polymorphisms demonstrated no deviations from HWE in the CEPH grandparent sample.

Both haplotype block tests, the FGT and the minimal-D' method (set to the default of a minimum D' = 0.80), found a break between CCR2-N260N and CCR5-208. This indicates a past recombination event somewhere in the 20 kb between CCR2 and CCR5. The pedigee haplotypes support this, as although there was no direct observation of a recombination event in the pedigree data, one haplotype (11112121211121) appeared to be a recombinant of haplotypes 4 (211111121211121) and 7 (11112121111111).

Haplotype analysis of 17q11-12 SNPs

To characterise the chemokine loci on chromosome 17, haplotype analysis was performed using all nine SNPs (over a 2 Mb region), as well as a subset of six SNPs arrayed over the 73 kb region, which includes MPIF-1, PARC and MIP-1a. Conclusive phase was established for only 87 individuals of the 103 in the CEPH grandparent sample for nine-SNP haplotypes. A total of 70 per cent of the variation of the total sample (n = 103) was explained by 14 haplotypes (of nine SNPs) (Table 3). The remaining portion included 11 doubleton haplotypes (found in two individuals), ten singletons (occured only once), as well as the 32 unphased chromosomes. When the analysis was reduced to six SNPs in the 73 kb region (Table 4), we were able to phase 96 grandparents by visual inspection of the pedigrees. Haplotype phase was not definitely assigned to seven of the 103 grandparents because two or more haplotype combinations could be inferred, given the diplotypes of their children or because of missing data. Eight six-SNP haplotypes explain 90 per cent of the variation in the CEPH grandparent sample (n = 103); and 41 per cent of the total number of chromosomes carry the most common haplotype (111111) (Table 3). The remaining 10 per cent of the total number of chromosomes (2n = 206) is comprised of two doubletons, two singletons and the 14 unphased chromosomes.

Table 3 Comparison of pedigree-phased haplotypes for nine SNPs over 2 Mb of 17q11-12 in Centre d'Etude Polymorphisme Humain (CEPH) grandparents (n = 87) with MLOCUS estimates from unphased genotype data from these same individuals
Table 4 Comparison of MLOCUS estimated to pedigree-phased haplotypes (n = 96) for six SNPs in 79 kb 'core' region of 17q11-12

Diplotypes were assigned to all individuals for which phase was established (n = 96) for the six SNPs in MPIF-1, PARC and MIP-1a. The most frequent diplotype combination included haplotypes 1 and 2, at 28 per cent in the CEPH grandparents. There was no significant deviation from Hardy-Weinberg proportions for the six-SNP multi-site genotypes, with a randomisation p-value of 0.1102. When analysing the SNPs individually for HWE, one SNP -- PARC (- 116) -- showed a sigmicant deviation using a χ2 test, at p = 0.012; which did not survive a Bonferroni multiple-test correction.

Validation of the EM algorithm on 3p21 and 17q11-12

To validate the accuracy of the EM algorithm, we compared the pedigree-derived haplotypes to those estimated haplotypes generated by MLOCUS. The 3p21 haplotype distributions were nearly identical to the estimated frequencies (Table 2). The similarity (IF) and identity (IH) indices were calculated for haplotypes in the CEPH grandparent sample (n = 103) for 14 SNPs. For the 14 SNP haplotypes in 3p21, as indicated in the Table 2, the similarity index (IF) was 0.9869. An IF of 1.0 would indicate perfect concordance between the haplotype frequencies generated by the two methods. The identity index (IH) for these data was exactly 1.0, as all haplotypes derived by pedigree analysis were present in the MLOCUS results. One estimated haplotype was dropped from the analysis, as it was below the frequency threshold of (1/2n = 0.004854); as suggested by Excoffier and Slatkin (1995) [37]. The MSE incorporates the overall difference in frequencies between actual (pedigree-derived) and estimated frequencies for all H haplotypes. The MSE for the 3p21 haplotypes was small (0.00001), which, again, indicates that the two frequency distributions are nearly identical.

As mentioned previously, phase could not be determined for the nine SNPs typed on chromosome 17q11-12 for all grandparents. Haplotype frequencies were determined, both by whole chromosome counting and by estimation, with data from 87 out of 103 individuals. The similarity index (IF) for the distribution of frequencies for the 43 haplotypes (nine SNPs) in this region is 0.8196, as indicated in Table 3. The haplotype estimation yielded 24 haplotypes with frequencies over the threshold value (1/(2n) = 0.0057); and missed 13 haplotypes that were present in the pedigree data. The IH statistic for these data is 0.7457. The MSE for the nine-SNP haplotypes is 0.0002, as indicated in Table 3. The EM algorithm also generated seven low frequency haplotypes (less than 1 per cent, not shown) that were not observed in the pedigree analysis. Constraining the MLOCUS analysis by removing these haplotypes did not significantly improve the MLE. This constrained analysis also resulted in the generation of other spurious low-frequency haplotypes, indicating that the EM algorithm could not effectively resolve haplotype phase for some individuals in the nine-SNP dataset.

Not surprisingly, paring the analysis down to the six SNPs in the 77 kb region that contains MPIF-1, PARC and MIP-1 α yields more accurate haplotype estimates. Ninety-six grandparents were included in this analysis, as phase could not be determined for seven of the 103 individuals in the total sample. As indicated in Table 4, the IF statistic increased to 0.9491, and the IH of 0.9167 is closer to perfect identity (1.0). The MSE is also closer to zero, at 0.0001.

Comparisons of MLOCUS haplotype estimates for 17q11-12

Omitting the unphased chromosomes from the pedigree haplotype frequency calculation of the 17q11-12 SNPs is a potential source of bias, as those individuals for whom complete resolution is not possible may have a higher per site heterozygosity than randomly sampled individuals. Additionally, those 'unphasable' individuals may carry haplotypes that are not present in the phased portion of the sample. To test if using only the phased individuals generates skewed 'pedigree-derived' 17q11-12 haplotype frequencies, MLOCUS haplotype frequency estimates were generated from both the total dataset of unphased genotypes (n = 103; data not shown) and those genotypes only from the phased individuals -- n = 87, for the nine-SNP haplotypes (Table 3), and n = 96 for the six-SNP haplotypes (Table 4). Comparisons of nine-SNP MLOCUS haplotypes (above 1 per cent frequency) from the whole sample (n = 103) and the phased sample (n = 87) yielded an IH of 0.9729, an IF of 0.9313 and an MSE of 0.00007. The same comparison performed on the six-SNP haplotypes yielded an Ih of 1, an IF of 0.9838 and an MSE of 0.00002. One nine-SNP haplotype present in the total sample (at a frequency of 0.015) was missed in the 'phased-only' sample, while in the six-SNP analysis, both sets of genotypes generated identical haplotypes. The potential bias of removing the unphased grandparents from the haplotype analyses appears to be slight, as the index values indicate that the haplotype frequencies generated by the two datasets (the complete sample and the 'phased-only' sample) are very similar, particularly for the six-SNP haplotypes.

Comparisons of methods to estimate linkage disequilibrium

Both phased haplotypes and unphased genotype data from the CEPH grandparents (n = 103) were used to estimate the extent of pairwise linkage disequilibrium (described by D') between SNPs in the chemokine receptor region on chromosome 3p21 and the chemokine cluster on chromosome 17q11-12. The D' statistic (above the diagonal) and the measure of statistical significance (p-value) (below the diagonal) are presented for pairwise comparisons of the 11 polymorphic sites in 3p21 in Table 5. Negative values indicate that there is disequilibrium between opposite alleles at the two SNPs (ie allele 1 at the first SNP and allele 2 at the second SNP, where the common allele is allele 1).

Table 5 Estimated D' values generated by two methods for all polymorphic loci in the 3p2l chemokine receptor gene region in the CEPH sample

The D'-values generated from analyses of the 3p21 polymorphisms by the DnaSP and PAIRWISE programs were, for the most part, very similar. The three differences, noted in bold, are slight. As discussed previously, the haplotypes generated by the EM algorithm were essentially identical when compared with those discerned by pedigree analysis for the variants in this region. The analysis of both the haplotypes and the unphased genotype data indicated that linkage disequilibrium in this 150 kb region of 3p21 is high in the CEPH grandparents. There is intact linkage disequilibrium (D' = 1) between two SNPs at the extremes of the region (CCR3-Y17Yand CCRL2-I243V), preserved primarily on haplotype 4 (211111121211121). The relative loss of linkage disequilibrium in the centre of the region, between CCR2-N260N and two SNPs in the CCR5 promoter, 208 (D' = 0.326) and CCR5-676 (D' = 0.326), was detected by haplotype block analysis, indicating past recombination between these two genes.

It is not surprising that the DnaSP analysis of haplotypes on 17q11-12 indicated no evidence of long-range linkage disequilibrium between variants at the extremes of the 2 MB region. There is significant linkage disequilibrium between the SNPs typed in MCP-1 and nearby Eotaxin (D' = -1) at the centromeric end of the region. Likewise, there is some significant allelic association between SNPs in MIP-1α, PARC and MPIF-1, which are within 77 kb of each other. The relative lack of association between more distal SNPs seems to have hampered the ability of the PAIRWISE analysis of unphased genotype data to accurately detect the extent of linkage disequilibrium, when compared with the DnaSP analysis of whole haplotypes. This lack of sensitivity is especially evident in the analyses of all nine SNPs, as the multitude of haplotypes (including spurious haplotypes generated by the EM estimation) created false-positive associations between distal variants (such as between MCP-1 and SNPs in PARC) (Tables 6 and 7).

Table 6 Estimated D' values generated by two methods for all nine SNPs in the 2 Mb chemokine gene region on chromosome I7ql 1-12 in CEPH grandparents (n = 87)
Table 7 Estimated D' values generated by two methods for six SNPs in the 79 kb 'core' region of three chemokine genes on chromosome 17q11-12 in CEPH


Given the potential accuracy of low-cost statistical methods, and the current high cost of molecular haplotyping and pedigree analysis, statistical estimation to determine haplotypes may be a cost-effective strategy for many gene regions. As a minimum, statistical estimation can be used to determine the overall need for molecular haplotyping and to specify where in the dataset molecular haplotyping would provide the most benefit [4143]. Independent assessments of the effectiveness of the EM algorithm have been discussed at length [38, 44, 45]. Xu et al. (2002) discuss a comparison of three computational algorithms for estimating haplotype frequencies: the Clark (1990) rule-based method [47], the EM algorithm and the Stephens et al. (2001) Bayesian PHAS method [48]. Using previously described criteria [37], Xu et al. ound that all three methods performed better for regions with a high degree of linkage disequilibrium, such as in the NAT2 gene, than for regions where linkage disequilibrium is not maintained (chromosome 8p22) when compared with haplotypes.determined by molecular methods [46].

The purpose of the evaluation presented here is to establish the accuracy of statistical estimation in these chemokine and chemokine receptor gene clusters. Estimated haplotypes from unphased genotypes were compared with haplotypes derived empirically from pedigree analysis in the CEPH grandparent sample (n = 103). How the EM algorithm responds to irregular linkage disequilibrium, sample size, different levels of polymorphism and deviations from HWE is critical for the effectiveness of haplotype estimation [38]. These conditions will be affected by the genomic environment of the region of interest, the history of the population from which the samples were selected and the quality of the genotype data. While these validation results cannot control for all these variables, an attempt was made to explore how the EM algorithm responds to the conditions of the gene clusters studied on chromosomes 3p21 and 17q11-12 in a European-derived sample set.

A greater degree of linkage disequilibrium between SNPs, and therefore fewer haplotypes, increases the accuracy of the EM algorithm and aids subsequent estimates of measures of linkage disequilibrium (such as D'). This is evident from the results of estimations of haplotype frequency and linkage disequilibrium in the 150 kb region on 3p21. Relatively few haplotypes explain the variation between these SNPs, at least in the CEPH grandparent sample. Indeed, there is intact linkage disequilibrium at the extremes of this region, as CCR3-Y17Y and CCRL2-I243V have a pairwise D'-value of 1. The haplotype block analysis also indicates a fairly simple structure, as both tests applied here found only two blocks, with what appeared to be a past recombination event between CCR2 and CCR5.

The degree of linkage disequilibrium between SNPs is one of the most important factors in the ability of the EM algorithm to properly detect haplotypes in population samples [38, 44]. The analysis presented here shows that the EM algorithm accurately describes the haplotype structure and patterns of pairwise linkage disequilibrium on chromosome 3p21 (a region of higher linkage disequilibrium). As for chromosome 17, it is important to note that, because of the relatively few SNPs assessed (a total of nine), this analysis is a low resolution evaluation of haplotypes and linkage disequilibrium across a large region (2Mb). While including only the 'core' region of 17q11-12 yields more accurate estimates of haplotype frequencies and linkage disequilibrium, these analyses still include a relatively sparse sampling of SNPs (six in 77 kb). The results of the pedigree analysis indicate that, while haplotype estimations in the chemokine receptor cluster on 3p21 may be fairly straightforward, special care must be taken for any haplotype inference in the chemokine genes on chromosome 17. More SNP genotype data, especially in the chromosome 17 chemokine genes, will no doubt aid in further characterisation of variation and linkage disequilibrium in these gene regions, as well as improve the accuracy of future haplotype analyses.


  1. Bazan JF, Bacon KB, Hardiman G, et al: 'A new class of membrane-bound chemokine with a CX3C motif'. Nature. 1997, 385: 640-644. 10.1038/385640a0.

    Article  CAS  PubMed  Google Scholar 

  2. Pan Y, Lloyd C, Zhou W, et al: 'Neurotactin, a membrane-anchored chemokine upregulated in brain inflammation [published erratum appears in Nature (1997), Vol. 389, p.100]'. Nature. 1997, 387: 611-617. 10.1038/42491.

    Article  CAS  PubMed  Google Scholar 

  3. Yoshida T, Imai T, Kakizaki H, et al: 'Molecular cloning of a novel C or gamma type chemokine, SCM-1'. FEBS Lett. 1995, 360: 155-159. 10.1016/0014-5793(95)00093-O.

    Article  CAS  PubMed  Google Scholar 

  4. Yoshida T, Imai T, Kakizaki H, et al: 'Identification of single C motif-1/lymphotactin receptor XCR1'. J Biol Chem. 1998, 273: 16551-16554. 10.1074/jbc.273.26.16551.

    Article  CAS  PubMed  Google Scholar 

  5. Cocchi F, DeVico AL, Garzino-Demo A, et al: 'Identification of RANTES, MIP-1 alpha, and MIP-1 beta as the major HIV-suppressive factors produced by CD8 + T cells'. Science. 1995, 270: 1811-1815. 10.1126/science.270.5243.1811.

    Article  CAS  PubMed  Google Scholar 

  6. Dean M, Carrington M, Winkler C, et al: 'Genetic restriction of HIV-1 infection and progression to AIDS by a deletion allele of the CKR5 structural gene. Hemophilia Growth and Development Study, Multicenter AIDS Cohort Study, Multicenter Hemophilia Cohort Study, San Francisco City Cohort, ALIVE Study'. Science. 1996, 273: 1856-1862. 10.1126/science.273.5283.1856.

    Article  CAS  PubMed  Google Scholar 

  7. Smith MW, Carrington M, Winkler C, et al: 'CCR2 chemokine receptor and AIDS progression'. Nat Med. 1997, 3: 1052-1053.

    Article  CAS  PubMed  Google Scholar 

  8. Kostrikis LS, Huang Y, Moore JP, et al: 'A chemokine receptor CCR2 allele delays HIV-1 disease progression and is associated with a CCR5 promoter mutation'. Nature Med. 1998, 4: 350-353. 10.1038/nm0398-350.

    Article  CAS  PubMed  Google Scholar 

  9. Martin MP, Dean M, Smith HW, et al: 'Genetic acceleration of AIDS progression by a promoter variant of CCR5'. Science. 1998, 282: 1907-1911.

    Article  CAS  PubMed  Google Scholar 

  10. Winkler C, Modi W, Smith HW, et al: 'Genetic restriction of AIDS pathogenesis by an SDF-1 chemokine gene variant. ALIVE Study, Hemophilia Growth and Development Study (HGDS), Multicenter AIDS Cohort Study (MACS), Multicenter Hemophilia Cohort Study (MHCS), San Francisco City Cohort (SFCC)'. Science. 1998, 279: 389-393. 10.1126/science.279.5349.389.

    Article  CAS  PubMed  Google Scholar 

  11. An P, Martin MP, Nelson GW, et al: 'Influence of CCR5 promoter haplotypes on AIDS progression in African-Americans'. Aids. 2000, 14: 2117-2122. 10.1097/00002030-200009290-00007.

    Article  CAS  PubMed  Google Scholar 

  12. Strieter RM, Polverini PJ, Arenberg DA, et al: 'Role of C-X-C chemokines as regulators of angiogenesis in lung cancer'. J Leukoc Biol. 1995, 57: 752-762.

    CAS  PubMed  Google Scholar 

  13. Arenberg DA, Kunkel SL, Polverini PJ, et al: 'Interferon-gamma-inducible protein 10 (IP-10) is an angiostatic factor that inhibits human non-small cell lung cancer (NSCLC) tumorigenesis and spontaneous metastases'. J Exp Med. 1996, 184: 981-999. 10.1084/jem.184.3.981.

    Article  CAS  PubMed  Google Scholar 

  14. Moore BB, Arenberg DA, Addison CL, et al: 'Tumor angiogenesis is regulated by CXC chemokines'. J Lab Clin Med. 1998, 132: 97-103. 10.1016/S0022-2143(98)90004-X.

    Article  CAS  PubMed  Google Scholar 

  15. Wang JM, Chertov O, Proost P, et al: 'Purification and identification of chemokines potentially involved in kidney-specific metastases by a murine lymphoma variant: Induction of migration and NFKB activation'. Inlt J Cancer. 1998, 75: 900-907. 10.1002/(SICI)1097-0215(19980316)75:6<900::AID-IJC13>3.0.CO;2-6.

    Article  CAS  Google Scholar 

  16. Muller A, Homey B, Soto H, et al: 'Involvement of chemokine receptors in breast cancer metastasis'. Nature. 2001, 410: 50-56. 10.1038/35065016.

    Article  CAS  PubMed  Google Scholar 

  17. Gordon D, Abajian C, Green P, et al: 'Consed: A graphical tool for sequence finishing'. Genome Res. 1998, 8: 195-202.

    Article  CAS  PubMed  Google Scholar 

  18. Kwok PY, Carlson C, Yager JD, et al: 'Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products'. Genomics. 1994, 23: 138-144. 10.1006/geno.1994.1469.

    Article  CAS  PubMed  Google Scholar 

  19. Ewing B, Green P: 'Base-calling of automated sequencer traces using phred. II. Error probabilities'. Genome Res. 1998, 8: 186-194.

    Article  CAS  PubMed  Google Scholar 

  20. Ewing B, Hillier L, Wendl HC, et al: 'Base-calling of automated sequencer traces using phred. II. Accuracy assessment'. Genome Res. 1998, 8: 175-185.

    Article  CAS  PubMed  Google Scholar 

  21. Nickerson DA, Tobe VO, Taylor SL, et al: 'PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing'. Nucleic Acids Res. 1997, 25: 2745-2751. 10.1093/nar/25.14.2745.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. NTH (2001-2003), dbSNP: National Center for Biotechnology Information, National Institutes of Health, []

  23. Whitehead Institute (2001-2003), Primer 3.0: Whitehead Institute, []

  24. Genome Sequencing Centre (2001), Exo-SAP Protocol: Genome Sequencing Center, Washington University, []

  25. Clark VJ, Metheny N, Dean M, Peterson R: 'Statistical estimation and pedigree analysis of CCR2-CCR5 haplotypes'. Hum Genet. 2001, 108: 484-493. 10.1007/s004390100512.

    Article  CAS  PubMed  Google Scholar 

  26. Morin PA, Saiz R, Monjazeb A: 'High-throughput single nucleotide polymorphism genotyping by fluorescent 5' exonuclease assay'. Biotechniques. 1999, 544: 538-540. 542, 544.

    Google Scholar 

  27. O'Connell JR, Weeks DE: 'PEDCHECK: A program for identification of genotype incompatibilities in linkage analysis'. Am J Hum Genet. 1998, 63: 259-266. 10.1086/301904.

    Article  PubMed Central  PubMed  Google Scholar 

  28. Guo SW, Thompson EA: 'Performing the exact test of Hardy-Weinberg proportion for multiple alleles'. Biometrics. 1992, 48: 361-372. 10.2307/2532296.

    Article  CAS  PubMed  Google Scholar 

  29. Long JC, Williams RC, Urbanek M: 'An E-M algorithm and testing strategy for multiple-locus haplotypes'. Am J Hum Genet. 1995, 56: 799-810.

    PubMed Central  CAS  PubMed  Google Scholar 

  30. Long JC: 'Multiple locus haplotype analysis (MLOCUS, OBSHAP, PAIRWISE), software and documentation distributed by the author'. 1999, Bethesda, MD, Section on Population Genetics and Linkage, Laboratory of Neurogenetics, NIAAA, National Institutes of Health

    Google Scholar 

  31. Dempster AP: 'Maximum-likelihood from incomplete data via the EM algorithm'. JR Stat Soc B. 1977, 39: 1-38.

    Google Scholar 

  32. Peterson RJ, Goldman D, et al: 'Effects of worldwide population subdivision on ALDH2 linkage disequilibrium'. Genome Res. 1999, 9: 844-852. 10.1101/gr.9.9.844.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  33. Zhang K, Jin L: 'HaploBlockFinder: Haplotype block analyses'. Bioinformatics. 2003, 19: 1300-1301. 10.1093/bioinformatics/btg142.

    Article  CAS  PubMed  Google Scholar 

  34. Wang N, Akey JM, Zhang K, et al: 'Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation'. Am J Hum Genet. 2002, 71: 1227-1234. 10.1086/344398.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  35. Daly MJ, Rioux JD, Schaffner SF, et al: 'High-resolution haplotype structure in the human genome'. Nat Genet. 2001, 29: 229-232. 10.1038/ng1001-229.

    Article  CAS  PubMed  Google Scholar 

  36. Gabriel SB, Schaffner SF, Nguyen H, et al: 'The structure of haplotype blocks in the human genome'. Science. 2002, 296: 2225-2229. 10.1126/science.1069424.

    Article  CAS  PubMed  Google Scholar 

  37. Excoffier L, Slatkin M: 'Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population'. Mol Biol Evol. 1995, 12: 921-927.

    CAS  PubMed  Google Scholar 

  38. Fallin D, Schork NJ: 'Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data'. Am J Hum Genet. 2000, 67: 947-959. 10.1086/303069.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  39. Rozas J, Rozas R: 'DnaSP version 3: An integrated program for molecular population genetics and molecular evolution analysis'. Bioinformatics. 1999, 15: 174-175. 10.1093/bioinformatics/15.2.174.

    Article  CAS  PubMed  Google Scholar 

  40. Lewontin RC: 'The interaction of selection and linkage. I. General considerations: Heterotic models'. Genetics. 1964, 49: 49-67.

    PubMed Central  CAS  PubMed  Google Scholar 

  41. Michalatos-Beloin S, Tishkoff SA, Bentley KL, et al: 'Molecular haplotyping of genetic markers 10 kb apart by allele-specific long-range PCR'. Nucleic Acids Res. 1996, 24: 4841-4843. 10.1093/nar/24.23.4841.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  42. Tishkoff SA, Dietzsch E, Speed W, et al: 'Global patterns of linkage disequilibrium at the CD4 locus and modern human origins'. Science. 1996, 271: 1380-1387. 10.1126/science.271.5254.1380.

    Article  CAS  PubMed  Google Scholar 

  43. Clark AG, Weiss KM, Nickerson DA, et al: 'Haplotype structure and population genetic inferences from nucleotide sequence variation in human lipoprotein lipase'. Am J Hum Genet. 1998, 63: 595-612. 10.1086/301977.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  44. Tishkoff SA, Pakstis AJ, Ruano G, Kidd KK: 'The accuracy of statistical methods for estimation of haplotype frequencies: An example from the CD4 locus'. Am J Hum Genet. 2000, 67: 518-522. 10.1086/303000.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  45. McKeigue PM: 'Efficiency of estimation of haplotype frequencies: Use of marker phenotypes of unrelated individuals versus counting of phase-known gametes'. Am J Hum Genet. 2000, 67: 1626-1627. 10.1086/316912.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  46. Xu W, Tse HF, Chan FH, et al: 'New Bayesian discriminator for detection of atrial tachyarrhythmias'. Circulation. 2002, 105: 1472-1479. 10.1161/01.CIR.0000012349.14270.54.

    Article  PubMed  Google Scholar 

  47. Clark AG: 'Inference of haplotypes from PCR-amplified samples of diploid populations'. Mol Biol Evol. 1990, 7: 111-122.

    CAS  PubMed  Google Scholar 

  48. Stephens M, Smith NJ, Donnelly P, et al: 'A new statistical method for haplotype reconstruction from population data'. Am J Hum Genet. 2001, 68: 978-989. 10.1086/319501.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references


The authors would like to thank Bert Gold, Jim Lautenberger, Raymond Peterson and George Nelson for helpful discussion of haplotype analysis. Bill Modi, Noah Metheny and Julie Bergeron provided technical assistance with some TaqMan assays, and the LGD Cell Repository aided with DNA extraction. Jeff Long provided software and Carrie Pfaff performed necessary program modifications for this analysis. We also wish to thank the two anonymous reviewers for helpful comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Vanessa J Clark.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Clark, V.J., Dean, M. Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistical estimation. Hum Genomics 1, 195 (2004).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: