Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistical estimation

Chemokine signals and their cell-surface receptors are important modulators of HIV-1 disease and cancer. To aid future case/control association studies, aim to further characterise the haplotype structure of variation in chemokine and chemokine receptor genes. To perform haplotype analysis in a population-based association study, haplotypes must be determined by estimation, in the absence of family information or laboratory methods to establish phase. Here, test the accuracy of estimates of haplotype frequency and linkage disequilibrium by comparing estimated haplotypes generated with the expectation maximisation (EM) algorithm to haplotypes determined from Centre d'Etude Polymorphisme Humain (CEPH) pedigree data. To do this, they have characterised haplotypes comprising alleles at 11 biallelic loci in four chemokine receptor genes (CCR3, CCR2, CCR5 and CCRL2), which span 150 kb on chromosome 3p21, and haplotyes of nine biallelic loci in six chemokine genes [MCP-1(CCL2), Eotaxin(CCL11), RANTES(CCL5), MPIF-1(CCL23), PARC(CCL18) and MIP-1α(CCL3) ] on chromosome 17q11-12. Forty multi-generation CEPH families, totalling 489 individuals, were genotyped by the TaqMan 5'-nuclease assay. Phased haplotypes and haplotypes estimated from unphased genotypes were compared in 103 grandparents who were assumed to have mated at random. For the 3p21 single nucleotide polymorphism (SNP) data, haplotypes determined by pedigree analysis and haplotypes generated by the EM algorithm were nearly identical. Linkage disequilibrium, measured by the D' statistic, was nearly maximal across the 150 kb region, with complete disequilibrium maintained at the extremes between CCR3-Y17Y and CCRL2-1243V. D'-values calculated from estimated haplotypes on 3p21 had high concordance with pairwise comparisons between pedigree-phased chromosomes. Conversely, there was less agreement between analyses of haplotype frequencies and linkage disequilibrium using estimated haplotypes when compared with pedigree-phased haplotypes of SNPs on chromosome 17q11-12. These results suggest that, while estimations of haplotype frequency and linkage disequilibrium may be relatively simple in the 3p21 chemokine receptor cluster in population samples, the more complex environment on chromosome 17q11-12 will require a higher resolution haplotype analysis.


Introduction
Chemokines are small intercellular signalling molecules that recruit immune cells to sites of inflammation and infection. The two major subfamilies of chemokine proteins are defined as CC, with two adjacent cysteine residues, or as CXC, with an intervening non-conserved amino acid. Other chemokine family members have cysteine residues separated by more than one intervening amino acid (eg CX3C or Fractalkine), 1,2 or are characterised by having only one cysteine (eg XCL1 or Lymphotactin). 3,4 Chemokine receptors are defined by the subfamily of chemokine ligand that they bind. Both the chemokine and the chemokine receptor genes are generally clustered in four distinct chromosomal regions: CC on 17q11-21, CXC on 4q12-21, both CCR and CXCR on 3p21-24 and CXCR on 2q21- 35. Variation in chemokines, or their cell-surface receptors, influences an individual's susceptibility to HIV-1 infection and modulates progression to AIDS. 5 -11 Chemokine signals are also important in the angiogenic 12 -14 and metastatic 15,16 processes of cancer. Therefore, describing the genetic variation and haplotype structure of chemokine and PRIMARY RESEARCH q HENRY STEWART PUBLICATIONS 1473 -9542. HUMAN GENOMICS. VOL 1. NO 3. 195-207 MARCH 2004 chemokine receptor gene clusters is necessary for further disease association analyses of these candidate genes.
The focus of the present analysis is to describe the structure of multi-single nucleotide polymorphism (SNP) haplotypes in chemokine genes on chromosome 17q11-12 and chemokine receptor genes on chromosome 3p21 in Centre d'Etude Polymorphisme Humain (CEPH) pedigrees ðn ¼ 489Þ: Secondary to this goal is to use the empirically phased haplotypes to determine the accuracy of estimated measures of haplotype frequencies and linkage disequilibrium using the subset of CEPH grandparents ðn ¼ 103Þ:

Study samples
SNP screening and validation were performed using two population panels: a 16-individual panel (four European-Americans, four African-Americans, four Chinese and four self-identified Hispanic-Americans) and an 88-individual panel (30 African-Americans, 34 European-Americans and 24 Hispanics). Forty multi-generation CEPH families, a total of 489 individuals, were genotyped for 23 SNPs scattered over two gene clusters: CC-chemokines on 17q11-12 and CCchemokine receptors on 3p21 (see Table 1). Genotype data from a subsample of 103 unrelated grandparents were used for comparative haplotype analyses. The use of all anonymous DNA samples was either reviewed by the NIH Internal Review Board or determined 'exempt' from review.

Chemokine and chemokine receptor SNPs
Conditions for SNP detection in the CCR2 promoter. Four of the 23 SNPs included in the haplotype analysis (Table 1) have not previously been reported and were discovered by direct sequencing. Three kilobases of the CCR2 promoter region were amplified using the Invitrogen Platinum Taqe kit in a panel of 16 individuals (32 chromosomes), including four European-Americans, four African-Americans, four Chinese and four Hispanic-Americans (self-identified). For 100 mL polymerase chain reactions (PCRs), 200 nM deoxyribonucleotide triphosphates (dNTPs), 200 nM of each primer, 400 nM MgSO 4 , 10 mL of 10 £ Platinum Taqe buffer and 1 mL Platinum Taqe were mixed with approximately 100 ng of genomic DNA. Primer sequences for the 3 kb product were as follows: 5 0 -TCATCTGCTTCTTAATTGCCTTCAG-3 0 (forward) and 5 0 -CAGGGTTTCTCTAACATCTCCTGGT-3 0 (reverse). PCR was performed in a PE Biosystems 9700 ThermoCycler with long-range PCR conditions recommended for Platinum Taqe.
Sequencing was performed on a 3 kb segment at intervals of 400 -500 kb with internal primers using the BigDyee (Applied Biosystems) cycle sequencing kit with some modifications. Sequencing reactions were performed as follows: 15 -30 ng of purified product was added to 10 mL reaction solution, which included 2 mL of BigDyee mix, 1 mL of standard 5 £ dilution buffer, 1.1 mL of 0.5 mM primer stock and double-distilled water (ddH 2 O) for the remaining volume. Reactions were cycled in a PE Biosystems 9700 thermo cycler under the following conditions: 958C for five minutes, and 30 cycles of 958C for 30 seconds, 508C for ten seconds and 608C for four minutes. All individuals were sequenced for the entire 3 kb in both forward and reverse directions on an ABI 3700 capillary sequencer. Sequence trace files were analysed by the Phred/ Phrap/Consed system, 17 -20 and PolyPhred was used to detect putative SNPs. 21 Eight SNPs (2 5983 G/A, 2 5047 G/T, 2 4866 G/C, 2 4599 T/G, 2 4419 C/T, 2 4338 A/T, 2 3433 T/C and 2 3232 C/T) were confirmed by visual inspection of the CCR2 promoter sequence of the 16-individual screening panel. Five of these SNPs (2 5983 G/A, 2 5047 G/T, 2 4866 G/C, 2 4599 T/G and 2 3433 T/C) were validated by direct sequencing in a larger sample set that comprised 88 individuals from three populations: 30 African-Americans, 34 European-Americans and 24 self-described Hispanics. The other three SNPs were not validated in the larger sample set, as they are in nearly complete linkage disequilibrium with at least one of the five SNPs chosen for further study. Four of the five validated SNPs listed in Table 1 (2 5983 G/A, 2 5047 G/T, 2 4866 G/C and 2 3433 T/C) were successfully optimised for 5 0 -nuclease assays.
Conditions for screening putative SNPs. The remaining 19 SNPs listed in Table 1 were previously characterised in this laboratory by denaturing high performance liquid chromatography (dHPLC) or single-stranded conformation polymorphism (SSCP) analysis, or were taken from published works or public databases. Flanking primers were designed for a total of 22 polymorphisms from dbSNP 22 using Primer 3.0 from MIT, Cambridge, MA. 23 PCR was performed in 25 mLscale reactions with the following components: 50 ng genomic DNA, 3 mM MgCl 2 , 200 nM dNTPs, 200 nM of each primer, 1U TaqGolde (Applied Biosytems) and 2.5 mL 10 £ TaqGolde Buffer. The cycling conditions (PE Biosystems 9700) for all reactions were as follows: a 958C hold for ten minutes, then a touch-down cluster of 12 cycles (958C for 30 seconds, 62 -578C (decreasing by 0.58C every cycle) for one minute and 728C for 1 minute), a standard cluster of 30 cycles (958C for 30 seconds, 578C for one minute and 728 for one minute) and a final 728C hold for seven minutes. PCR products were purified using 10 U exonuclease 1 and 2 U shrimp alkaline phosphatase (SAP) enzymes under the protocol specified by the Washington University Sequencing Center. 24 All purified reaction solutions were sequenced as follows: 15 -30 ng of purified product was added to 10 mL reaction solution, which included 2 mL of BigDyee mix, 1 mL of standard 5 £ dilution buffer, 1.1 mL of 0.5 mM primer stock and ddH 2   Characterisation of SNP haplotype structure Review PRIMARY RESEARCH conditions: 958C for five minutes, and 30 cycles of 958C for 30 seconds, 508C for ten seconds, and 608C for four minutes. Nine of the 22 primer pairs produced viable sequences and the SNPs were polymorphic in at least one of the 16-individual population panel. Those 'confirmed' SNPs were further characterised by either sequencing or genotyping in the larger sample of 88 individuals (data not shown).

SNP genotyping
All 23 SNPs were genotyped using the 5 0 -nuclease assay under a set of universal assay conditions. Dual-labelled TaqMane (Applied Biosystems) probes, standard, Turbo and Minor-Groove Binding (MGB) chemistries were designed using Primer Expresse (Applied Biosystems). Previous analysis of genotyping accuracy using the TaqMan method revealed 14 discordancies out of 1,165 duplicate genotype pairs, a 1.2 per cent error rate averaged over multiple TaqMan assays. 25 PCR conditions for genotyping (reaction components and cycling conditions), as described in Morin et al. (1999) and Clark et al. (2001), were used for all SNPs typed in this study. 25,26 PCR was performed in 96-well plates that included positive genotypic controls (for both homozygote states and the heterozygote state for each SNP) and reactions with no DNA as a negative control. All 5 0 -nuclease assay plates were read on the ABI 7700 Sequence Detector, and analysed using the 'dye components' feature of the SDS v1.6.3 or v1.7 software package (Applied Biosystems). Genotype determinations for each reaction were made manually by visual inspection of a scatter-plot of the data, with reference to the results of the genotype control samples. CEPH pedigree data for all 23 genotyping assays were checked for concordance with Mendelian inheritance using PEDCHECK. 27

Haplotype analysis
Haplotype phase was determined using the CYRILLIC II pedigree drawing software (Cherwell Scientific) to establish the inheritance of multi-locus genotypes. The algorithm developed by Guo and Thompson (1992) was used to determine whether the distribution of whole haplotypes in the CEPH grandparent sample ðn ¼ 103Þ deviates from Hardy-Weinberg proportions. 28 Significance is determined by an exact test, with a cut-off of p ¼ 0:05: Haplotype states and frequencies on both chromosomes 3p21 and 17q11-12 were estimated in sets of unphased genotype data by MLOCUS, 29,30 which uses the expectation-maximization (EM) algorithm, 31 a maximum-likelihood based method. A previously described three-step procedure to determine the most likely set of haplotypes to describe the genotype data was used here to analyse the haplotype states and frequencies for all datasets. 32 Haplotype blocks on 3p21 were assessed using HaploBlock-Finder, 33 which performs the four-gamete test (FGT) between each pairwise SNP to identify past recombination events. 34 The minimum-D 0 method 35,36 (with minimum D 0 ¼ 0.80) was also used to assess haplotype block structure in the 150 kb region of 3p21.

Validation of haplotype estimation
Haplotype frequencies are determined by direct counting of whole chromosomes in the grandparents after haplotypes are established by pedigree analysis. Haplotypes were estimated using MLOCUS with unphased genotype data from these same individuals. Comparisons of the two methods were performed with genotype data from two regions: the chemokine cluster (six genes) on chromosome 17q11-12 and the chemokine receptor cluster (four genes) on chromosome 3p21. For the 17q11-12 data, two analyses were performed: one included all nine SNPs typed in all six genes arrayed over 2 Mb, and the other included only six of these SNPs in the 77 kb 'core' region of three genes (MPIF-1, PARC, MIP-1a) on 17q11-12. The analysis of the 3p21 chemokine receptor genes included 14 SNPs arrayed over 150 kb.
The I F and I H algorithm performance indices suggested by Excoffier and Slatkin (1995) were used to quantitatively evaluate the estimation results in the CEPH grandparents. 37 The I H index evaluates the performance of the algorithm to identify the actual haplotypes, and the I F statistic examines how close the estimated frequencies are to the pedigree haplotype frequencies. I H and I F values were calculated using only those haplotypes above the threshold frequency (1/2n). A mean squared error (MSE) statistic was also used to compare the estimated haplotype frequencies to the pedigree-derived frequencies. 38 To determine whether omitting those grandparents who could not be phased from the analysis generates skewed pedigree-derived haplotype frequencies, MLOCUS haplotype estimations of the total sample ðn ¼ 103Þ were compared to the 'phased-only' sample using the above-described performance indices.
Estimating linkage disequilibrium in population data D 0 statistics were calculated with phased haplotypes derived from pedigree analysis with DnaSP (v3.53). 39 Linkage disequilibrium estimates generated by haplotypes determined by pedigree analysis in the CEPH grandparents were compared with those estimates calculated from MLOCUS reconstructed haplotypes in the same datasets. PAIRWISE was used to estimate linkage disequilibrium from the estimated haplotypes generated by MLOCUS. 30 PAIRWISE generates Lewontin's normalised D 0 statistic 40 and the p-value determined from an exact test of association between all pairs of polymorphic loci in the dataset.

Haplotype analysis of 3p21 SNPs
To determine the haplotype structure of SNPs in the 3p21 region, we typed 14 polymorphisms in the CEPH pedigrees. Eleven of the 14 SNPs were polymorphic and none of these SNPs deviated from Hardy -Weinberg equilibrium (HWE) at the p ¼ 0:05 significance level. Haplotype phase was established for every grandparent in the sample ðn ¼ 103Þ: Haplotype frequencies were then determined by direct counting of whole haplotypes (Table 2). Nine haplotypes explained nearly all of the variation (98 per cent) in the CEPH grandparents. The remaining 2 per cent is composed of two haplotypes that occur only once.
The diplotypes, or multi-locus genotypes, were also counted in the CEPH grandparent sample. The diplotype combination of haplotypes 1 and 3 was the most frequent in the sample, at 13 per cent. In the CEPH grandparent sample, the 3p21 haplotypes were in HWE, as the randomisation test of the distribution of diplotypes yielded a non-significant p-value of 0.2708. When analysed individually, the 11 polymorphisms demonstrated no deviations from HWE in the CEPH grandparent sample.
Both haplotype block tests, the FGT and the minimal-D' method (set to the default of a minimum D' ¼ 0.80), found a break between CCR2-N260N and CCR5-208. This indicates a past recombination event somewhere in the 20 kb between CCR2 and CCR5. The pedigee haplotypes support this, as although there was no direct observation of a recombination event in the pedigree data, one haplotype (11112121211121) appeared to be a recombinant of haplotypes 4 (211111121211121) and 7 (11112121111111).

Haplotype analysis of 17q11-12 SNPs
To characterise the chemokine loci on chromosome 17, haplotype analysis was performed using all nine SNPs (over a 2 Mb region), as well as a subset of six SNPs arrayed over the 73 kb region, which includes MPIF-1, PARC and MIP-1a. Conclusive phase was established for only 87 individuals of the 103 in the CEPH grandparent sample for nine-SNP haplotypes. A total of 70 per cent of the variation of the total sample ðn ¼ 103Þ was explained by 14 haplotypes (of nine SNPs) ( Table 3). The remaining portion included 11 doubleton haplotypes (found in two individuals), ten singletons (occured only once), as well as the 32 unphased chromosomes. When the analysis was reduced to six SNPs in the 73 kb region (Table 4), we were able to phase 96 grandparents by visual inspection of the pedigrees. Haplotype phase was not definitely assigned to seven of the 103 grandparents because two or more haplotype combinations could be inferred, given the diplotypes of their children or because of missing data. Eight six-SNP haplotypes explain 90 per cent of the variation in the CEPH grandparent sample ðn ¼ 103Þ; and 41 per cent of the total number of chromosomes carry the most common haplotype (1 1 1 1 1 1) (Table 3). The remaining 10 per cent of the total number of chromosomes ð2n ¼ 206Þ is comprised of two doubletons, two singletons and the 14 unphased chromosomes.
Diplotypes were assigned to all individuals for which phase was established ðn ¼ 96Þ for the six SNPs in MPIF-1, PARC and MIP-1a. The most frequent diplotype combination

Validation of the EM algorithm on 3p21 and 17q11-12
To validate the accuracy of the EM algorithm, we compared the pedigree-derived haplotypes to those estimated haplotypes generated by MLOCUS. The 3p21 haplotype distributions were nearly identical to the estimated frequencies ( Table 2). The similarity (I F ) and identity (I H ) indices were calculated for haplotypes in the CEPH grandparent sample ðn ¼ 103Þ for 14 SNPs. For the 14 SNP haplotypes in 3p21, as indicated in the Table 2, the similarity index (I F ) was 0.9869. An I F of 1.0 would indicate perfect concordance between the haplotype frequencies generated by the two methods. The identity index (I H ) for these data was exactly 1.0, as all haplotypes derived by pedigree analysis were present in the MLOCUS results. One estimated haplotype was dropped from the analysis, as it was below the frequency threshold of ð1=2n ¼ 0:004854Þ; as suggested by Excoffier and Slatkin (1995). 37 The MSE incorporates the overall difference in frequencies between actual (pedigree-derived) and estimated frequencies for all H haplotypes. The MSE for the 3p21 haplotypes was small (0.00001), which, again, indicates that the two frequency distributions are nearly identical. As mentioned previously, phase could not be determined for the nine SNPs typed on chromosome 17q11-12 for all grandparents. Haplotype frequencies were determined, both by whole chromosome counting and by estimation, with data from 87 out of 103 individuals. The similarity index (I F ) for the distribution of frequencies for the 43 haplotypes (nine SNPs) in this region is 0.8196, as indicated in Table 3. The haplotype estimation yielded 24 haplotypes with frequencies over the threshold value ð1=ð2nÞ ¼ 0:0057Þ; and missed 13 haplotypes that were present in the pedigree data. The I H statistic for these data is 0.7457. The MSE for the nine-SNP haplotypes is 0.0002, as indicated in Table 3. The EM algorithm also generated seven low frequency haplotypes (less than 1 per cent, not shown) that were not observed in the pedigree analysis. Constraining the MLOCUS analysis by removing these haplotypes did not significantly improve the MLE. This constrained analysis also resulted in the generation of other spurious low-frequency haplotypes, indicating that the EM algorithm could not effectively resolve haplotype phase for some individuals in the nine-SNP dataset.
Not surprisingly, paring the analysis down to the six SNPs in the 77 kb region that contains MPIF-1, PARC and MIP-1a yields more accurate haplotype estimates. Ninety-six grandparents were included in this analysis, as phase could not be determined for seven of the 103 individuals in the total sample. As indicated in Table 4, the I F statistic increased to 0.9491, and the I H of 0.9167 is closer to perfect identity (1.0). The MSE is also closer to zero, at 0.0001.

Comparisons of MLOCUS haplotype estimates for 17q11-12
Omitting the unphased chromosomes from the pedigree haplotype frequency calculation of the 17q11-12 SNPs is a potential source of bias, as those individuals for whom complete resolution is not possible may have a higher per site heterozygosity than randomly sampled individuals. Additionally, those 'unphasable' individuals may carry haplotypes that are not present in the phased portion of the sample. To test if using only the phased individuals generates skewed 'pedigree-derived' 17q11-12 haplotype frequencies, MLOCUS haplotype frequency estimates were generated from both the total dataset of unphased genotypes (n ¼ 103; data not shown) and those genotypes only from the phased individualsn ¼ 87; for the nine-SNP haplotypes (Table 3), and n ¼ 96 for the six-SNP haplotypes ( Table 4). Comparisons of nine-SNP MLOCUS haplotypes (above 1 per cent frequency) from the whole sample ðn ¼ 103Þ and the phased sample ðn ¼ 87Þ yielded an I H of 0.9729, an I F of 0.9313 and an MSE of 0.00007. The same comparison performed on the six-SNP haplotypes yielded an I H of 1, an I F of 0.9838 and an MSE of 0.00002. One nine-SNP haplotype present in the total sample (at a frequency of 0.015) was missed in the 'phased-only' sample, while in the six-SNP analysis, both sets of genotypes generated identical haplotypes. The potential bias of removing the unphased grandparents from the haplotype analyses appears to be slight, as the index values indicate that the haplotype frequencies generated by the two datasets (the complete sample and the 'phased-only' sample) are very similar, particularly for the six-SNP haplotypes.

Comparisons of methods to estimate linkage disequilibrium
Both phased haplotypes and unphased genotype data from the CEPH grandparents ðn ¼ 103Þ were used to estimate the extent of pairwise linkage disequilibrium (described by D 0 ) between SNPs in the chemokine receptor region on chromosome 3p21 and the chemokine cluster on chromosome 17q11-12. The D 0 statistic (above the diagonal) and the measure of statistical significance (p-value) (below the diagonal) are presented for pairwise comparisons of the 11 polymorphic sites in 3p21 in Table 5. Negative values indicate that there is disequilibrium between opposite alleles at the two SNPs (ie allele 1 at the first SNP and allele 2 at the second SNP, where the common allele is allele 1). The D 0 -values generated from analyses of the 3p21 polymorphisms by the DnaSP and PAIRWISE programs were, for the most part, very similar. The three differences, noted in bold, are slight. As discussed previously, the haplotypes generated by the EM algorithm were essentially identical when compared with those discerned by pedigree analysis for the variants in this region. The analysis of both the haplotypes and the unphased genotype data indicated that linkage disequilibrium in this 150 kb region of 3p21 is high in the CEPH grandparents. There is intact linkage disequilibrium ðD 0 ¼ 1Þ between two SNPs at the extremes of the region (CCR3-Y17Y and CCRL2-I243V), preserved primarily on haplotype 4 (2 1 1 1 1 1 1 2 1 2 1 1 1 2 1). The relative loss of linkage disequilibrium in the centre of the region, between CCR2-N260N and two SNPs in the CCR5 promoter, 208 ðD 0 ¼ 0:326Þ and CCR5-676 ðD 0 ¼ 0:326Þ; was detected by haplotype block analysis, indicating past recombination between these two genes.
It is not surprising that the DnaSP analysis of haplotypes on 17q11-12 indicated no evidence of long-range linkage disequilibrium between variants at the extremes of the 2 MB region. There is significant linkage disequilibrium between the SNPs typed in MCP-1 and nearby Eotaxin ðD 0 ¼ 21Þ at the centromeric end of the region. Likewise, there is some significant allelic association between SNPs in MIP-1a, PARC and MPIF-1, which are within 77 kb of each other. The relative lack of association between more distal SNPs seems to have hampered the ability of the PAIRWISE analysis of unphased genotype data to accurately detect the extent of linkage disequilibrium, when compared with the DnaSP analysis of whole haplotypes. This lack of sensitivity is especially evident in the analyses of all nine SNPs, as the multitude of haplotypes (including spurious haplotypes generated by the EM estimation) created false-positive associations between distal variants (such as between MCP-1 and SNPs in PARC) (Tables 6 and 7).

Discussion
Given the potential accuracy of low-cost statistical methods, and the current high cost of molecular haplotyping and pedigree analysis, statistical estimation to determine haplotypes may be a cost-effective strategy for many gene regions. As a minimum, statistical estimation can be used to determine the overall need for molecular haplotyping and to specify where in the dataset molecular haplotyping would provide the most benefit. 41 -43 Independent assessments of the effectiveness of the EM algorithm have been discussed at length. 38 48 Using previously described criteria, 37 Xu et al. found that all three methods performed better for regions with a high degree of linkage disequilibrium, such as in the NAT2 gene, than for regions where linkage disequilibrium is not maintained (chromosome 8p22) when compared with haplotypes determined by molecular methods. 46 The purpose of the evaluation presented here is to establish the accuracy of statistical estimation in these chemokine and chemokine receptor gene clusters. Estimated haplotypes from unphased genotypes were compared with haplotypes derived empirically from pedigree analysis in the CEPH grandparent sample ðn ¼ 103Þ: How the EM algorithm responds to irregular linkage disequilibrium, sample size, different levels of polymorphism and deviations from HWE is critical for the effectiveness of haplotype estimation. 38 These conditions will be affected by the genomic environment of the region of interest, the history of the population from which the samples were selected and the quality of the genotype data. While these validation results cannot control for all these variables, an attempt was made to explore how the EM algorithm responds to the conditions of the gene clusters studied on chromosomes 3p21 and 17q11-12 in a European-derived sample set.
A greater degree of linkage disequilibrium between SNPs, and therefore fewer haplotypes, increases the accuracy of the EM algorithm and aids subsequent estimates of measures of linkage disequilibrium (such as D 0 ). This is evident from the results of estimations of haplotype frequency and linkage disequilibrium in the 150 kb region on 3p21. Relatively few haplotypes explain the variation between these SNPs, at least in the CEPH grandparent sample. Indeed, there is intact linkage disequilibrium at the extremes of this region, as CCR3-Y17Y and CCRL2-I243V have a pairwise D 0 -value of 1. The haplotype block analysis also indicates a fairly simple structure, as both tests applied here found only two blocks, with what appeared to be a past recombination event between CCR2 and CCR5.
The degree of linkage disequilibrium between SNPs is one of the most important factors in the ability of the EM algorithm to properly detect haplotypes in population samples. 38,44 The analysis presented here shows that the EM algorithm accurately describes the haplotype structure and patterns of pairwise linkage disequilibrium on chromosome 3p21 (a region of higher linkage disequilibrium). As for chromosome 17, it is important to note that, because of the relatively few SNPs assessed (a total of nine), this analysis is a low resolution evaluation of haplotypes and linkage disequilibrium across a large region (2 Mb). While including only the 'core' region of 17q11-12 yields more accurate estimates of haplotype frequencies and linkage disequilibrium, these analyses still include a relatively sparse sampling of SNPs (six in 77 kb). The results of the pedigree analysis indicate that, while haplotype estimations in the chemokine receptor cluster on 3p21 may be fairly straightforward, special care must be taken for any haplotype inference in the chemokine genes on chromosome 17. More SNP genotype data, especially in the chromosome 17 chemokine genes, will no doubt aid in further characterisation of variation and linkage disequilibrium in these gene regions, as well as improve the accuracy of future haplotype analyses.