Stepwise haplotype analysis: Are LD patterns repeatable?

A variety of techniques exist to describe and depict patterns of pairwise linkage disequilibrium (LD). In the current paper, a new log-linear framework is proposed for the summarisation of local interactions among single nucleotide polymorphisms (SNPs). Our approach provides a straightforward means of capturing the diversity of higher-order LD relationships for small numbers of loci by investigating inter-marker interactions. Our method was applied to a dataset of 76 SNP markers spanning a genomic interval of length 2.8 megabases. The analysis of three short sub-regions is described in detail here. Model and graphical representations of contiguous markers in medium to high LD are presented. In the regions studied, evidence for sub-structure was detected, supporting the view that the genomic reality is complex. Interestingly, a critical evaluation of the method by bootstrapping showed that while some LD relationships were captured in a highly repeatable fashion, the majority were not. Large numbers of small interactions, both direct and indirect, mean that many models can adequately summarise the data at hand. Our results suggest that repeatability should be further investigated in the application of LD-based approaches.


Introduction
The abundance of single nucleotide polymorphisms (SNPs) and the limited power, in somes ituations, of single-locus analysish as led to increased use of haplotype-inference methods such as Clark'sa lgorithm, 1 the Expectation-Maximisation (EM)a lgorithm 2 and iterative-sampling algorithms to resolvep hase ambiguity by both coalescent and non-coalescent models. 3,4 Recent studies [5][6][7][8][9] have shown that the humangenomecan be viewed in terms of haplotype blocks, givenb yd iscrete regions of highl inkage disequilibrium (LD), and separated by shorter regions of lowL D. Haplotype block identification has been conducted via evaluation of measures of LD,s uch as Lewontin'sD ', as well as by methods of directly assessing evidence of recombination. 10 Thec orollaryo ft he block conceptw as that as mallp roportion of the SNPs, the 'haplotype-tagging'S NPs, should be sufficient to capturet he majority of theh aplotype structure contained in blocks genome-wide. 11 More recently,B ayesian graphicalm odelling has been applied to describe more complex patterns of relationship,f or examplea mongl oci that arep roximal but non-adjacent. 12 We introduce anovel application of log-linear modelling, to describe higher-order interactions amongSNPs.The log-linear step is embedded within the EM algorithm in order correctly to model phase.P reviously,l og-linear models have been used to formt he basis of Bayesian priorsi nr esolving phase and to model different levelso fL Dw ith known phase. 13,14 We showt hat the log-linear model mayb eu sed to describe discrete islandso fL D, 15 as well as smaller conditionally independent sub-fragments of high LD.W et est the repeatability of our findings by bootstrapping and find instances of complex LD for which model repeatability is low.

Materials and methods
The methods described beloww ere applied to ad ataset consisting of arandom sample of 150 Caucasian controls from the Preventiono fR EStenosis with Tranilast and its Outcomes (PRESTO) study. 16,17 Appropriate consent waso btained and these samples were genotyped across 76 SNPs spanning approximately2 .8 megabases (Mb), within and around the UGT1A1 gene.T hese data and their analyses are described in detail elsewhere. 18 frequencies, while the likelihood of the data, giventhe model, is maximised in the M-step.T he process proceeds iteratively.
More formally, 3 givenasample of n diploidi ndividuals from ap opulation, let G ¼ (G 1 , ...,G n ) denote the known genotypes, let H ¼ (H 1 , ... ,H n ) denote the unknown corresponding haplotype pairsa nd let F ¼ (F 1 , ...,F m ) be the unknown population haplotype frequencies. The algorithm starts under random assignmento fg enotypes. The M-step of the EM algorithm then finds the set of haplotype frequencies, F, which maximises the following likelihood: Under Hardy-Weinberg equilibrium, theg enotype probabilities can be partitionedi nto the product of haplotype probabilities: where V i is the set of all (ordered) haplotype pairsc onsistent with the multilocus genotype G i . The E-step of the algorithm, used here,t hen estimates the population haplotype frequencies F by using the log-linear model and not the traditional counting method. Investigation of the saturated log-linear model, however, in which all loci and interactions arer epresented, is challenging due to the necessarily high number of parameters. Therefore, as tepwise approach of fitting intermediate modelsh as been used. These intermediate models contain more parameterst han am odel of completel inkage equilibrium (LE) but fewer parameters than the saturated model. 19,20 In the current paper,w es how hows uch models provide the framework for quantifying the patterns of LD.

Notation
Notationf or the remainder of the paper will focuso nt he composition of the log-linear model, as it is this that is of interest in describing patterns of SNP interaction. Thevariable corresponding to the i th SNP is givenb y l i and models are specified by usingt he Wilkinson and Rogers notation, where the SNP variables arec ombined by ' þ 'tod enote independence,a nd ' * 'tod enote interaction. 21 For example, l 1 þ l 2 denotes independence between the first and second SNP and l 3 * l 4 denotes interaction between the 3rd and 4th.

Forwards tepwise algorithm
We propose af orwards tepwisea pproach to determining a parsimonious model of LD.Starting withamodel of complete LE, higher-order LD terms are added sequentially to the model until ap arsimoniousm odel is found. This procedure has been implemented as the command swblock within STATA 22 and is available using the ssc command. Al ikelihood ratio test (LRT) wasu sed to measuret he strength of LD or inter-SNP interaction, although other test statistics are possible.The LRT wasp erformed using hapipf,acommand 20 implemented in STATA. 22 More formally,the algorithm examines aregion of n SNPs. In order to preserve efficiency of the EM algorithm,f ewer than ten SNPs is practical. Thefi rsts tep is to estimatet he log-likelihood under the base model of LE l 1 þ l 2 þ ... þ l n . Then,e very pairwise SNP interaction termi sa dded to this model and the LRT, comparing the new model with the base model is re-evaluated.T he mosts ignificant interaction termi st hen added to the base model, this becomes then ew base model and the process repeats. Anominal p -value of 0.05 wasinitially chosen to compare new models with the base model; however, other thresholds of p ¼ 0.01 and p ¼ 0.001 were also investigated. Once no more pairwise interactions are significant, the algorithm proceeds to the next order of interaction terms, and so on. This approach accommodates the fact that pairwise interactions can occur over greater distancesthan contiguous pairsa nd that LD does not decay monotonically with distance.Ateach step,the number of degrees of freedom is minimised in the sequence of LRTs, and the algorithm continues until the highest interaction termi se valuated.

Application to LD structure
Certain LD features have been helpfully described in ar eview by Wa ll and Pritchard, 23 who established three criteria, derived using pairwiseL D, for assessing haplotype blocks. They introduced concepts of 'holes' and 'overlapping blocks' in regions of high LD,a nd thesec oncepts are applicable to more general evaluationso fc omplex LD structure.A s described below, these concepts can be presented in terms of log-linear models.
Holes arise when the outermostSNPs are not in strong LD with an SNPormultiple SNPs that lie in between. To translate this to al og-linear framework, consider at riplet of markers parameterised as l l ,l 2 and l 3 .I f l 1 and l 3 showh igh LD, but intervening pairs( l 1 ,l 2 and l 2 ,l 3 )d on ot showh igh LD, as can happen withl ow frequency SNPs,t hen this situation mayb ed escribed by the model l 1 * l 3 þ l 2 .
This representation can be extended to af ourth SNP, l 4 ,i n as imilar fashion.C ontinuing the example of ah ole at SNP2 (variable l 2 ), one model describing the interactionsw ould be l 1 * l 3 * l 4 þ l 2 .A lternatively,i ft he three-way interaction is not needed,t hen another suitable model mightb e l 1 * l 3 þ l 3 * l 4 þ l 1 * l 4 þ l 2 ,w here,a gain, SNP2 ( l 2 )i s independent of the other three SNPs.
Also defined by Wa ll and Pritchard, 23 another feature of certain regions is the presence of SNPs that area ssignable to more than one region of high LD.I nt he simplest case of four SNPs ( l 1 -l 4 ), twoo verlapping sets of relationships might be specified as l 1 * l 2 * l 3 þ l 2 * l 3 * l 4 .I nt his model, SNP1 and SNP4 arec onditionally independent,g iven SNP2 and SNP3.

Stepwise haplotypeanalysis
Review PRIMARY RESEARCH In reality,t here mayb eacombination of holes and overlapping regions. The method mayb ee asily generalised.

Investigating the repeatability of derived models
Regression approaches are better suited to hypothesis generation than to inference,d ue to the large number of models evaluated, but repeatability is of importance when assessing theu tility of these methods. Data from 150 controls were available for study.T oi nvestigate the repeatability of our models,t his control set wass ubjected to bootstrapping. In other words, the following process wasc arried out 12 times: 150 samples were selected withr eplacement from the entire control seta nd model fittinga pplied.I nt his way, 12 models were derived for each genomic interval.
In the first round of analysis, at hreshold of p ¼ 0.05 was used in the stepwise regression to select parametersf or inclusion in the model.A cknowledging that this threshold mayb ec onsidered generous in the context of the large number of tests being applied,the whole analysiswas repeated using thresholds of p ¼ 0.01a nd p ¼ 0.001. Again, model repeatability wasa ssessed.

Results
All pairwise R 2 statistics for the 76 SNPs were produced using STATA 22 and the pwld command( available from http:\\www-gene.cimr.cam.ac.uk/clayton).F igure 1d isplays estimates of all of the pairwise statistics. The diagonal cells are shown as white,a st he program does not calculate R 2 values for these. Elsewhere, increasingly high R 2 is denoted by increasingly dark grey shading. Af ew areas had very high R 2 values, givenb yt he black squares. Three regions were selected for model fitting. They were chosen by eye, based upon Figure 1, as having different characteristics, and while they do not provideacomprehensiveevaluation of the region, they provide an interesting insight into the question of repeatability.T he three areb oxed in Figure 1, and resultant models are shown graphically in Figures 2-5.T hese subsequent graphs were constructed usingt he command gipf within STATA,i nstalled using the command ssc.I nt hese graphs, each node represents aS NP and an edge represents as ignificant pairwise relationship.T hree-way relationships are givenb ys olid bold lines and four-wayi nteractions are givenb yb rokenb old lines.

Forwards tepwise analysis of SNP1 -SNP5
For the group SNP1-SNP5, the LD plot ( Figure 1) suggested as imple patterno fu niformly high LD.W hen model fitting wasa pplied,h owever, more complex models were derived. Figure 2s hows 12 graphs that represent the series of models derived from bootstrapping. Some features, such as the l 2 * l 3 and l 1 * l 4 pairwise relationships, were captured in everymodel; however, relationships amongS NPs 3-5w ere captured in three different ways. The variability of these1 2s imple graphs and of the models they denote is interesting, givent he apparently uniform' block' of LD seen in Figure 1. This may, however, be attributed to thef act that as mall percentage of overall variability can be explained in an umbero fw ays. Importantly,t he overall conclusion from this regional analysis-that all markersa re strongly inter-related -i sn ot affected by the nuances in the models selected. When the threshold for parameter inclusion in stepwise model fitting wasd ecreased from p ¼ 0.  (results not shown), the models derived were almost identical. The only change wast he loss of the l 4 * l 5 parameter in twoo f the bootstrap samples at p ¼ 0.01,a nd from three of the samplesa t p ¼ 0.001.

Forwards tepwise analysis of SNP36 -SNP41
This group of markerswas selected because they mark the join between twoa pparent regions of high LD (Figure1 ). The models derived from this more complexr egion are shown in Figure 3. In this more complex example,m odel repeatability wasl ower.T he strong block-liker elationship among markers SNP38 -SNP41 wasc learly visible as at angle of graphical relationships in each bootstraps ample,b ut the exact positioning of edges tended to vary.L owering the threshold for parameter inclusion from p ¼ 0.05 to p ¼ 0.01l ed to the loss of edges in everyb ootstrap sample (Figure 4). At this lowert hreshold, thetwo regions of high LD (SNP36 -SNP37 and SNP38 -SNP41) became more visibly distinct. Fewmodel changesw ere observeda st he threshold wasl owered from p ¼ 0.01 to p ¼ 0.001 (results not shown).

Forwards tepwise analysis of SNP44 -SNP49
Lastly,f or SNP44 -SNP49, the LD plot ( Figure 1) suggests a complex set of interrelationships. The parsimoniousm odels for the 12 bootstrap samples are giveni nF igure 5. The large number of edges suggests that this is an area of high haplotype diversity and this intervalp rovides the most striking example of lack of repeatability in model fitting. Only twof eatures were captured in all models.T hese were at hree-way interaction (SNP46, SNP47,S NP48)a nd at wo-way interaction (SNP44, SNP45). Other features were variously described. This is ac lear example where overall variability of relationship can be explained in amodel in anumberofw ays.

Discussion
This study investigated the performance of stepwise log-linear modellingi nt he evaluation of LD in three genomic loci. Bootstrapping of the data demonstrated that although certain LD features were consistently captured by this approach, derived models were generally not repeatable.F urthermore, altering the significancet hreshold for inclusion of parameters in the stepwisea nalysisd id not materially change our models. It is noteworthyt hat sample size mayb eaconsideration. Repeatingt he bootstrap analyses with as maller sample size (n ¼ 75) led to models with agreater number of higher-order interactions (results not shown). With as ample size of n ¼ 150, these same relationships tended to manifest as two-way interactions in the model. Clearly,t he allele frequency distribution of the markersavailable must be amajor component of the patterns derived, and while it is not appropriate to extrapolate our findings to all genomic regions and/or all methodologies, these findings do raise interesting questions of repeatability.L D-based inference is widely used, both for exploratorya nalysis and for the efficient selection of markersf or genotyping.
Model complexity and/or lack of repeatability should come as no real surprise.L Dm appinge xploits historical recombination events to narrowc andidate regions for disease genes; however, the patterno fL Di sa lso influenced by mutation and other stochastic factorswhich create associations between markerst hat do not have as imple relationship with distance.O ur models shed no lighto nt he 'source' of any complexity; they merely supporti ts existence.G reater repeatability of inferred LD has been observeda tc omparatively lowr esolution, when ac lose relationship is maintained between recombination and LD patterns. 6 Our models reflect the more stochastic picture seen at comparatively high resolution.
Other investigatorsh avep resented methods of modelling non-adjacent SNP interactions. Thomas and Camp 12 derived Bayesian graphicalm odels usingat ailored Metropolis -Hastings approach. Earlier evaluationso fp artial LD models have also been made.O ne group commented on the exceedingc omplicationa rising from the inclusion of higher-order interactions. 24 Model complexity is indeed an outcome of applying this method to al arge genomic region. In terms of applicability,our approach is limited to arelatively small number of SNPs -f ewer than ten -a nd thus it is restrictedt os mall genomic regions.F or most current-day situations, it would be impractical to apply stepwise log-linear modellingf or thep urposes of tag selection. The great wealth of marker data available nowf romt he HapMap and other sources, combined with the ever-decreasingc ost of genotyping, makei ta nunlikely avenue to pursue.For small candidate gene studies, however, it would be possible to use al og-linear approach to identify' sensible' modelst ot est in the analysis of asubsequent replication study,therebyreducing the burden of multiple testing. Such models would include an additionalp arameter pertaining to the disease locus but would be derived in exactlythe same way. It is hoped that the visual immediacy of this approach will aid hypothesis generation and serve as au seful additiont oafi ne-mappingt ool kit that already includes coalescent modelling, 25 for example.
It is nowg enerally agreed that the genome is not simply composed of discrete haplotype blocks of uniformly high LD; indeed, our models support am ore complex reality. Nevertheless, LD has an important role to playi nd esigning efficientm arker sets for genetic study.R esources sucha st he HapMap lessen the need for marker validation and provide am eans of allowing the selectiono fi nformativem arkers for genotyping. Our bootstrap results showt hat LD-based inference can be sample dependent, even within an ethnic group.T herefore, in utilisings uch data, it mayb eb eneficial to investigate the repeatabilityo fo ne'sc hosen methodology and, if appropriate,t oa llowg reater redundancy in marker selection.