A non-parametric approach to population structure inference using multilocus genotypes

Inference of population structure from genetic markers is helpful in diverse situations, such as association and evolutionary studies. In this paper, we describe a two-stage strategy in inferring population structure using multilocus genotype data. In the first stage, we use dimension reduction methods such as singular value decomposition to reduce the dimension of the data, and in the second stage, we use clustering methods on the reduced data to identify population structure. The strategy has the ability to identify population structure and assign each individual to its corresponding subpopulation. The strategy does not depend on any population genetics assumptions (such as Hardy-Weinberg equilibrium and linkage equilibrium between loci within populations) and can be used with any genotype data. When applied to real and simulated data, the strategy is found to have similar or better performance compared with STRUCTURE, the most popular method in current use. Therefore, the proposed strategy provides a useful alternative to analyse population data.


Introduction
Information about the populations tructure of species is useful in av ariety of situations, such as admixturem apping, subspecies classification, genetic barrier detection and evolutionarys tudy. [1][2][3][4][5] For example, anthropologists mayh avet he debris of ancient people,s upplied by archaeologists, and want to learnabout the relationship between the ancient people and modernp opulations to infer the evolutionaryh istoryo f human beings. Population structurec an be identified based on visible characteristicss uch as language,c ulture,p hysical appearance and geographicalregion.But this can be subjective and mayb ear no relevance to genetics. 3 In other situations, the presence of population structuremay constitute apractical nuisance.I na ssociation studies, case-control design is often used to identifyg enetic variants underlying complex traits by comparing allele frequencies between unrelated individuals who area ffected and those who are unaffected. The presence of population structurec an lead to spurious associations between ac andidate marker and ap henotype,h owever, as a result of populations tructure in the sample. 6,7 In forensic studies, the identification of reference groupsi sv eryi mportant, but this can be difficult when population structure exists. 4,8 In all of these situations, the first step is to identify population substructure.
Pritchard et al.introduced amodel-based clustering method to infer population structure and assigni ndividuals to populations usingm ultilocus genotype data. 3 They used aB ayesian formulation to generate the posterior distribution usinga Markov chainM onte Carlo (MCMC) method based on Gibbs sampling. Their main modelling assumptions were Hardy-Weinberg equilibrium within populations and linkage equilibrium between loci within each population. 3 This is the predominant method currently used in genetic studies. Some other methods have been proposed, [9][10][11][12] but all are model-based (parametric) methods. These methods have their owna dvantages and disadvantages. They all have model assumptions because of their parametric nature.
Here,w ed escribe at wo-stage strategy for inferring population structure,w hich is an alternativea nd complementary approach to STRUCTURE 3,10 for exploring data. In the first stage,w eu se methods such as singularv alue decomposition (SVD) to reduce the dimension of data and then perform clustering on the reduced data. This two-stage strategy is widely used in knowledge induction and representationt o determine similarities between the meaning of wordsa nd passages by analysiso fl arge text corpora. 13,14 Our method does not use assumptions such as Hardy-Weinberg equilibrium for populations or linkage equilibrium for loci. Here we showt hat our method is faster and has comparable (when model assumptions hold) or better performance (when model assumptions fail) than STRUCTURE when applied to real and simulated data. 3,10 In the next section, we describe the strategy and methods we use and some of the advantages of the approach we take. We illustrate our method with examples and makec omparisons with STRUCTURE in the Resultss ection. In the Discussion section, we highlighti ssues in the methods, the potentialu se of the methods, and future work.

Methods
It is well known that clustera nalysisi sd ifficult in highdimensional space becauses tandard clustering algorithms such as the Expectation-Maximization( EM) algorithm 15 and the K-means method arep robably trapped in local minima. 13,16 Although many initialisationm ethods have been proposed to deal with this problem, they have had only limited success. 13 Therefore, at wo-stage procedure seemsv aluable: first reduce the dimensiono ft he original space and then cluster in the reduced (low-dimensional) space.I ng eneral, any dimensionr eduction methods and clustering methods can be plugged into this two-stage framework. In this paper,w e use SVD as thedimension-reduction method, and the mixture model and K-means as the clustering methods. We also propose an on-parametric clustering method, which can be viewed as av ariant of the K-means method, for small sample sizes.

Dimension reduction
SVD is widely used in knowledge induction and representation and information retrieval. Fore xample,S VD plays a keyr ole in latent semantic analysis( LSA) or latent semantic indexing (LSI). The semantic dimensions aret houghtt o contain redundant and noisy information, which can be separated out and should be ignored. Bartell et al.s howed that the document representations givenb yL SI are equivalent to the optimal representations found when solving aparticular multidimensional scaling problem in which the giveni nterobject similarity information is provided by thei nner product similarities between the documents themselves. 17 LSI automatically computes am uch smaller semantic subspace from the original text collection. This improves recall and precision in information retrieval; information filtering or text classification; word sense disambiguation; word sorting and relatedness judgments; the prediction of learning from text; and summarising skills. 14,18 The effectiveness of LSI in empirical studies is often attributed to the reduction of noise, redundancy and ambiguity. 14,18,19 By introducing ad ual probabilistic model based on similarity concepts, Ding showed that semantic correlations could be characterised quantitatively by their statistical significance -t hat is, the likelihood. 18 He further showedt hat LSI is the optimal solutiono ft he model and proved the existence of the optimal semantic subspace.T his model explainst heoretically the performance improvements observedf or LSI.
Mathematically speaking,S VD is am atrix decomposition technique.Areal-valued m-by-n matrix( say X )c an be represented uniquely (up to certain trivial rearrangements of columnsa nd subspace rotations, in the case of duplicated singularv alues) as the producto ft hree matrices: where both U and V arec olumn orthonormal and S is a diagonal matrix of singularv alues. 20 There is ad irect relationship between SVD and principal component analysis (PCA) when PCA is performed from the covariancem atrix using thef ollowing equations: If each rowo f X is normalised (centred and unitary), the covariance matrix S of data X is XX T .W ek nowt hat: where A is an orthonormal matrixa nd the l values aret he eigenvalues of S .T he decomposition is unique up to some trivial column rearrangements. Matrix A contains the principal components of columnso f X .F rome quations (2) and (4), we can see that the left singularv ectors U are the same as the principal components of columns of X .S imilarly,t he right singularv ectors V are the same as the principal components of rows of X .

Clustering
We choose to use twoc lustering methods: one is mixture 4, proposedb yF igueiredo et al., based on the mixture model, 21 the other is K-means. Thea dvantages of mixture 4a re that it is capable of selecting the number of components (ie the number of clusters) and that it is relatively robust to the initialisation of the parameters. Figueiredo et al. 21 used the following finite mixture models: where Y ¼½Y 1 ; ... ; Y d T ,ad-dimensional random variable with y ¼½y 1 ; ...; y d T being ar ealisation of Y ; a 1 ; ... ; a k are the mixing probabilities; each u m is the set of parameters definingt he m th component; and u ¼ { u 1 ; ... ; u k ; a 1 ; ...; a k } is the complete set of parametersn eeded to specify the mixture.T hey used the minimum message length criterion and the component-wise EM algorithm to integrate estimation and model selection in as ingle algorithm. 22 This method can avoid another well-known drawback of the EM algorithm for mixture fitting -n amely,t he possibility of convergence towardsasingular estimate at theb oundaryo f the parameters pace. 21 K-means is ac ommonly used non-parametric clustering method, buti th as some drawbacks. We propose ac lustering method (density-based mean clustering [DBMC]), which is av ariant of K-means but can avoid some of its drawbacks. The details of DBMC are provided in Appendix 1.

Number of subpopulations
There aremanymethods for estimating the number of clusters which can also be used for estimating the number of subpopulations. Zhu et al.s howedt hat Bayesian information criterion (BIC) from their mixture model performed better than STRUCTURE in inferring the number of subpopulations. 23 All of these methods can be integrated into the clustering procedure.

Missing data imputation
It is not uncommon to have missing values in genetic studies. Such data can be manually flagged and excluded from subsequent analyses. 24 Many analyticalm ethods, such as PCA or SVD,r equirec ompletem atrices. 25 Although somes tudies reported dealing withSVD/PCA withmissing data, 26 -29 they often rely on specific probabilistic modelsa nd have limited generalisability.A lthough one solutiont ot he missing data problem is to repeat the experiment, and this method has been used in the validation of microarraya nalysis algorithms, 30 this strategy mayb et oo expensivea nd impractical for most studies. Therefore, we need to estimate the missing values from non-experimental methods.
There is little published literature concerning missing value estimations for genotype data from humanp opulations. The uniqueness of this issue is that the genotype data are categorical by nature.N otet hat Sen and Churchill 31 and Broman et al. 32 discussed usingt he EM algorithm and the hiddenM arkov model to deal withm issing genotypes, but they were mostly concerned with experiments involving inbred animals.
Genotype data are usually in the formo fl arge matrices of genotypes of marker loci (columns) from different persons (rows). 3 Without loss of information,w ec an transformt his person-markerm atrix into ag enotype-person matrix. For each marker,a ll of the genotypes appearing in the data are listed,o ne genotype per row, with av alue of one for the cell if aperson (column) has this genotype,and zerootherwise. Using this reformatting we nowh avealarge 0-1 matrix.
We can view this genotype-person matrix as af requency matrix, with each cell denoting the frequency of the person (its column)w ho has theg enotype that is denoted by its row. Such frequency matrices arec ommonly used in LSA and are called 'word-document matrices' or 'term-document matrices'. 14,18,33 Because we can fully reconstruct the original person -marker matrix from this genotype-person matrix, there is no information loss in this transformation. We impute the missing values on the basis of this genotype -person matrix. In this study,w eu sed an imputationm ethod which is similar to the 'K nearest-neighbour' (KNN)-based method used in Troyanskaya et al. 34 Ther ationaleu nderlying this method is that where data points arec lusteredt ogether (similar) in the lowerd imension, we can expectt hem to be clusteredt ogether (similar) in the higher dimension as well. In this way, the missing dimensions of ad ata point( individual in our case) can be estimated by its neighbours( those which are very similar to the data pointsu nder study), withn o missing data in these dimensions. Details of this method are providedi nA ppendix 2. In this study,w ed id not iterate to impute the missing values (ie we only use KNN once).

Data
To evaluate fully the performance of the proposedstrategy, we applied it to twor eald atasets and twos imulated datasets. The first real dataset wast hat reported in Rosenberg et al., 5 which has genotypes at 377 autosomal microsatellite markersin1,056 individuals from 52 populations. Here,w ec onsidered the whole dataset, as well as twoA merican populations, the Pima and Surui populations, with 25 and 21 individuals, respectively,t od emonstrate our methods. The other real dataset was from the HapMap project; 35 we used genotype data from 45 Chinese and 44 Japanese on chromosome 17 which was released in October 2005. Adataset is formed by the 500 most informativesingle nucleotide polymorphisms (SNPs) usingthe methods proposedi nR osenberg et al. 36 One simulated dataset wasg enerated under the coalescent model using MS,aprogram developed by Hudson. 37 A progenitor populationg aver ise to twos ubpopulations 3,000 generations ago.S ubpopulation 1h ad ac onstants ize of 10,000, began to grow exponentially 1,000 generations ago and has nowr eached4 0,000. Subpopulation2had ac onstant size of 2,000 before 2,000 generations, and then instantaneouslye xpanded to 10,000a nd has remained at that size until the present. We also assume that the mutation rate per site per generation is 10 2 8 and that we arei nterested in a segmento f1 0kilobases. No recombination is set. In this fashion, we have generated 100 such chromosomal segments, each segmenth arbouring 27 -77S NPs. Thec hromosomal segments arepooled together to produce more genotype data. Ad ataset is formed by randomly sampling 400 haplotypes

Population structure identification
We used STRUCTURE 2.0 10 and our SVD-based procedure on the datasets.Inthe second stage (the clustering stage) of our procedure,w eu sed both the mixture model and K-means methods for clustering. For STRUCTURE, we tried the four available models.I fn ot stated explicitly,w eu sed the default model with admixtureand correlated allele frequencies and set both burnin length and number of MCMC replications after burnin to be 20,000.
For Rosenberg et al.'sf ull dataset, 5 we followedt heir procedure and ran analyses (STRUCTURE 2.0 and mixture 4) multiple times for the number of populations (clusters)f rom twot os ix. Ta ble 1s hows the Pearson correlation coefficients between the results of the STRUCTUREa nd mixture 4 analyses usingthe first fiveprincipal components.F or number of populations (clusters) K ¼ 5a nd 6, the relatively small correlation coefficients of twoclusters(cluster 5for K ¼ 5and cluster1for K ¼ 6) are caused by the accumulated differences between the estimates of thet wo methods (thep robabilities of belonging to one cluster,o rm embership coefficients, must sum to 1a cross clusters).
We tried the four models available in STRUCTURE 2.0 on the Pima -Surui data subset with 100 randomly chosen markers. The model assuming independent allele frequencies among populations withnoadmixture yielded the best results. Burnin length and number of MCMC replications after burnin were all 10,000 in the analyses. The results are summarised in Ta ble 2. Clearly,i ti sd ifficult to drawt he conclusion that there aret wo populations in the dataset from the above results, but once we set K ¼ 2, STRUCTURE 2.0 can assigne ach individual correctly to the populationi t belongs to.
Before performing SVD,w efi rstt ransformed the data into the genotype-person format. Wa ll et al.r eported that pre-processing is critical in SVD/PCA, 38 which is well known in LSA. 14,18,19,39 We applied the so-called tf-idf transformation 18,39 on the genotype -person matrix. For the Pima -Surui sub-dataset with 100 randomly chosen markers, on the reduced two-dimensional space,m ixture 4fi nds two clusters.T he mixture 4a nd K-means methods assign individuals correctly to their populations witht wo clusters. Figure 1p lots pairwise cosine similarities between individuals in the reduced two-dimensional space.F igure 2s hows the pairwise cosine similarities between individuals using the original data without reduction. It seems that the original data are noisier and that SVD not only reduces the dimension but also reduces noise.
To evaluate our method'sp erformance,w er educed the number of markers. When the numberso fm arkersw ere 80, 60, 40 and 20, both methods performed equally well (data not shown). When the number of markersw as reduced to ten, both methods still performed well,w ith STRUCTURE slightly better when the marker information wasl imited (data not shown).
To compare the performance of the methods fully,w e conducted as imulation study usingp opulationg enetics models. Ta ble 3s hows the results for three datasetsf romt he simulation study.I th as been reported that STRUCTURE providesv erys table estimates when the model assumptions hold. 5 In the presence of tightly linked SNPs, STRUCTURE not only performed worseb ut wasa lso not very stable.F or example,T able 3s hows that when 494 SNPs were used, the best performance of STRUCTURE only misclassified three individuals; however, the median number of misclassified individuals by STRUCTUREw as 33. For the same dataset, the time taken for the analyses (usingal aptop with Intel   all individuals correctly.T he inferior performance of STRUCTURE wasp artly due to the fact that some of the markersi nt he dataset were tightly linked. Fore xample,t he distancesb etween some SNPs were less than 100 base pairs. It is well known that Chinese and Japanese populations are closelyr elated and very difficult to distinguish in genetics studies. It is not too difficult, however, to distinguish between informativemarkers. This alsoconfirms that SNPs can be very informativei ni nferring population structures. 40 We used the simulated data with admixed individuals to evaluate the performance of the proposeds trategyo nd ata exhibiting populationa dmixture. Figure 3s hows the results. The membership coefficients were calculated following the method of Nascimento et al. 41 We found that STRUCTURE performed slightly better than the proposed strategy.T his is expected because when the model assumptions hold, parametricm ethods with correct assumptions should always performb etter.

Evaluation of DBMC
We used our DBMC method on ther educed data to evaluate its performance.D BMC performed well in different reduced dimensions. Figure 4s hows the formation of the initialp artitioning by DBMC.T here were four points (blue circles withouto ther symbols superimposed) left ungrouped by Gap statistics. They were classified into the second clusterb yt he initialisation procedure.T herefore, after the density-based initialisation, all but one individual (number 42 in the figures with coordinates 2 5.95710, 2 0.19782) were correctly classified. Only one iterationw as required to finish thec lustering correctly.

Discussion
We have described at wo-stage strategy for using multilocus genotype datat oe xamine populations tructure and to assign individuals to populations. We prefer to call this approach a strategy, instead of am ethod, because it provides a framework,n ot just one method. One can choose different dimensionr eduction and clustering methods to fit into the framework.
Our strategyd oes not rely on anyp opulationg enetic assumptions, such as Hardy-Weinberg equilibrium and linkage equilibrium between loci within populations. This means that violationo ft he assumptions does not invalidate our strategy.F or model-based methods, the violationo f assumptions makes these methods invalid, at least theoretically, although some methods mayb er obust to certain departures from assumptions. We have shown, through simulation and real data analyses, that the proposed approach is not affected by departure from the linkage equilibrium assumptionf or markersi nt he data; however, tightly linked markersm ay provider edundant information, so more markersa re usually needed.I nt his situation, the validity of our strategyi sn ot affected, but the validity of model-based methods becomes questionable.
It is reported that pre-transformation is critical in SVD. 18,38,39 We choose to use the tf-idf transformation, which is widely used in information retrieval, but other pretransformationsa re possible.B efore the pre-transformation, it would be helpful to eliminatet he non-informativer ows and columns. 19 In our experiment, we only eliminated the markersi ns ituations where just one genotype appears in the whole sample.I ti sp ossible to use some criteria (such as entropy) to filter out the non-informatived ata. This would maket he analysisf aster (because the matrix becomes smaller) and more efficient (because the remaining matrixi sm ore informative).
There are many dimension reduction techniques besides SVD,s uch as PCA and its variants (probabilistic PCA, non-linear PCAe tc), correspondence analysis, multidimensional scaling, independentc omponent analysis, projection pursuit and projectionp ursuit regression, principal curvesa nd methods based on topologically continuous maps and neural networks. 42 The reason we chose SVD is that many efficient algorithms exist for this method. Becausew eo nly need af ew principal components, even more efficienta lgorithms are available for this purpose.Although the person -marker matrix can be large if there arem any markerso ri ndividuals, the SVD procedure can be performed very quickly.I na ddition, It would be morestatistically sound to view our genotypeperson matrix as ab inary matrix.S everal methods have been proposed for PCA on binaryd ata, 43 -45 but they implicitly assume that the observations arei ndependent across the dimensions, which does not apply in our case-the genotypes are by no means independent at each marker.Nevertheless, we have tried the logistic PCA 44 on our data, but the results were not as good as those achieved by treating the genotype-person matrix as ar eal matrix. Figure 5shows the first three principal components from logistic PCAusingthe criterion that the log likelihood change is less than 0.1. It is obvious that the two clusters aren ot clearly separated. Although we could perform dimensionr eduction usingp robabilistic PCA on the original genotype datam atrix, which is ac ategorical matrix, no methods area vailable for such analyses. One PCAm ethod for categorical data is implemented in SPSS software (SPSS Inc.) through the use of the optimal scaling (or optimal scoring) approach to turnt he categorical problem into aq uantitative one, 46 and applies PCA to then umeric matrix. Thec oding scheme of the categorical data in the original matrix mayaffect the resulting numeric matrix, however; moreover, one needs to set the optimal scaling level for analysis variables subjectively, whereas we intendt oa void subjectivec hoices in our strategy.
It is an open questiona st oh ow to choosed imensionality for the reduced space when using SVD or PCA.M uch work has been doneo nt his topic,i ncluding likelihood ratio, Minimum Description Length (MDL), Akaikei nformation criterion (AIC), BIC, 47 Laplace'sm ethod, 48 and probabilistic PCA model. 26 All of these methods areb ased on some probabilistic model, however-usually the normality assumption. In our case,t he assumption is obviously not appropriate.Laplace'smethod seemstogiveamore reasonable choicet han others, but can only serve as ag uideline. Because our purpose is clustering, one possible wayf or choosing the optimal dimension is by clustering results. For each given dimension, we can performc lustera nalysiso nt he reduced space and evaluate the resulting clusters -f or example, between to within-cluster variation. We can then select the optimal dimension as the one with the best clustering evaluation. Methods based on appropriate models (perhaps binary or categorical models)o rn on-parametric (empirical) approaches should be more appropriate for our problemand we are planning to investigate this in the future.
Ando observedt hat, in LSA, usingS VD,t he topics underlying outlier documents (ie those documents that are very different from other documents)t end to be lost as lower numberso fd imensions are chosen. 33 Ag enerale xplanation of the good performance of LSI is that when eigenvectors with smaller eigenvalues arel eft out, noisei se liminated, and,   as ar esult, the similarities among the linguistic units are measured more accuratelyt han in the original space. According to the mathematical formulation of SVD, dimensional reduction comes from twos ources: outlier documents and minor terms. These twot ypes of noisea re mathematically equivalent and are inseparable under SVD. However, people do not want to consider the outlier documents as 'noise', when their interest is in characterising the relationships among the documents while all the documents are assumed to be equal.I no ur case,fi ne structure (small numberso fi ndividuals whoa re very different from others) mayb el ost, especially when the samples ize is small. Hastie et al.noted that finer structurecan be lostwith anydimension reduction method. 49 Ando proposeda na lgorithm which differsfromSVD in that terms and documents aret reated in a non-symmetrical way. 33 By scaling the vectorsi ne ach computation of eigenvectors, his algorithm tries to eliminate noise from the minor terms without eliminating the influence of the outlier documents. Further analyses are needed to evaluate this method.
In this paper,w ec hose to use mixture modelsa so ur clustering methods. The advantages of the mixturem odels include their readiness for use,t heir ability to choose the number of clustersa utomatically and their computational efficiency.T hese are by no means the only choice,h owever, and we have also considered K-means methods here.I nf act, both mixture models and DBMC performw ell. Conventional K-means performs al ittle worse( givent he number of clusters as ap riori ). In our analysisi th appened that some initial values produced very different (worse)c lustering results by the conventional K-means. In general, when the sample size is large and the model provides ar easonable description of the data, mixture models (model-based methods in general) performw ell. When the clusters are restricted to globular regions,K -means should work well.I no ur analysiso ft he Pima -Surui dataset, our sample size wasn ot small (25 and 21 for twoc lusters) and the cluster shapesw ere convex( data not shown), so it is not surprising that both mixture models and K-means performed well.
We used cosines imilarity to measure' similarity' between individuals. Becausec osines imilarity is easy to interpret and simplet oc ompute,i ti sw idelyu sed in text mining and information retrieval. 14,18,19 It is natural to measure 'similarity' between vectorsb yt heir inner product. Cosine is closely related to inner product and correlation. If thev ectorsh ave unit length, cosinei se quivalent to innerp roduct. If the vectorsa re centred, cosinei st he samea sc orrelation.
Our strategy can be used for identifying populations and assigningi ndividuals in situations where there is little information about population structure.I ts hould also be useful in situations where cryptic populations tructure is ac oncern, such as in case-controls tudies in association mapping.
In summary, we find that the strategy we have described in this paper has the ability to identifypopulationstructure,make correct inferences of the number of subpopulations and assign individuals to their corresponding subpopulation. Most of all, it is model free and does not depend on any genetics assumptions. Although it has several advantages over its parametric counterpart, as pointed out by Ta ng et al.: 12 'no one method is universallyp referable'; however, it provides auseful alternativet oa nalyse genetic data for population structure inferences.

Acknowledgments
We thankD rM á rioF igueiredo for providing the source code for his mixture model and patiently answering questions; Mr PreslavN akov for providing information about preprocessing; Dr ThomasM inka for providing the source code for the choice of dimensionality for PCA; Dr Michael Tipping for answering questions about dimension reduction for binary data; Dr AndrewS chein for providing the code for logistic PCA;D rN oah Rosenberg for providing the data; DrsR ebecka Jö rnstena nd Hongyuan Zha for discussiono n clustering number selection; Dr AndrewP akstis for providing information about marker selection; and Dr Baolin Wu for general discussions. This work wass upported in part by National Institutes of Healthg rants GM59507a nd GM57672.

Appendix 1
Av ariant of the K-means method K-means is ac ommonly used non-parametric clustering method, buti th as the following drawbacks: (1) The initialpartition mayaffect the results. Randomisation is often used but has limited success. 13 (2) The procedure mayn ot converge.I ft he procedure is not well defined, it is quite possible for the procedure to oscillate indefinitely between twoo rm ore partitions and never converge.T his defect wasr ecognised by the developer of the K-means method. 50 (3) It cannot determine the number of clusters,w hich is either preset or visually determined. We propose ac lustering method which is based on Kmeans and can avoid the above drawbacks.T he basic idea of our method is that to identifyacluster starting from the point withthe highest density around it in thecurrent dataset. To be more specific,s uppose we are given n datap oints X 1 ; X 2 ; ... ; X n .L et p 1 , p 2 , ..., p k denote ap artitioning of the data into k disjoint clusterss uch that The algorithm of this method (density-based mean clustering [DBMC]) is as follows. Va ry the total number of clusters from k ¼ 1, 2, ... ,K .F or each k ,p erformt he following procedure: Compute the new centroids and repeat this updating procedure until ac ertain stoppingc riterion described belowi sm et. 6. Evaluate the resulting clustersa sd escribed below.
Among all of the k values studied, select the bestc lustering according to the quality evaluated in step 6.
For step 5, astoppingcriterion is needed,anexample is: where one choice for E ()is the objectivefunction discussed by Dhillonand Modha 19 and an alternativecandidate is the between-to total (between-cluster plus within-cluster)sums of squares. 51 To ensure the convergence of { p ð t Þ j } k j ¼ 1 when we use the between-to total sums of squares as the stoppingcriterion at each iteration, we choose the newpartitioning to have alarger value of the between-to within-sums of squares, or else the iterationstops. Therefore, the algorithm outlined above never results in adecrease in the E (.) value,which is bounded from above by some constant. 19 Therefore, if DBMC is iterated indefinitely,then the value of E (.) will eventually converge. Note that this only means that the algorithm procedure will converge,but it does not implythat the underlying partitioning { p ð t Þ j } k j ¼ 1 converges. 19,52 There areanumber of methods for estimating the number of clusters. 53 Here,wechose the Gap statistic usingthe resampling method. 54 Suppose that the maximum possible number of clusters in the data is M .The basic idea of the Gap method for estimating the number of cluster Kistoidentify^K ; 1 ,^K # M ; which provides the strongest significant evidence against the null hypothesis H 0 of K ¼ 1, that is, 'no cluster' in the data. TheGap method employsthe so-called uniformity hypothesis, which states that the data are sampled from auniformdistribution in the d -dimensional space.Itcompares an observedinternal index, such as the within-clusterssum of squares, to its expectation under areference null distribution via resampling, and chooses the smallest k which maximises the Gapstatistic as the number of clusters. 53,54 The basic idea of the Gap statistics for estimating clustersize 49 is similar to that of the Gapmethod for estimating the number of clusters.