Reconstructing the genomic architecture of mammalian ancestors using multispecies comparative maps

Rapidly developing comparative gene maps in selected mammal species are providing an opportunity to reconstruct the genomic architecture of mammalian ancestors and study rearrangements that transformed this ancestral genome into existing mammalian genomes. Here, the recently developed Multiple Genome Rearrangement (MGR) algorithm is applied to human, mouse, cat and cattle comparative maps (with 311-470 shared markers) to impute the ancestral mammalian genome. Reconstructed ancestors consist of 70-100 conserved segments shared across the genomes that have been exchanged by rearrangement events along the ordinal lineages leading to modern species genomes. Genomic distances between species, dominated by inversions (reversals) and translocations, are presented in a first multispecies attempt using ordered mapping data to reconstruct the evolutionary exchanges that preceded modern placental mammal genomes.


Introduction
Great strides in understanding the evolutionary history of whole vertebrate genomes have been made over the past decade with the explosion of comparative mapping and sequencing data from diverse organisms. 1 -7 Comparative maps from birds and mammals, coupled with recent human and mouse genomic sequences, have already provided many interesting insights into the evolutionary patterns and potential forces behind chromosomal rearrangements in vertebrates. 5 -9 Previous vertebrate gene order comparisons have been limited to single chromosome comparisons of multiple genomes 5,6,10 -12 or defining conserved segments between two whole genomes, however, rather than between multiple whole genomes. 3 -6,11,13 -16 Comparative studies to identify and quantify the extent of conserved segments between two genomes are often based on the breakpoint analysis approach pioneered by Nadeau and Taylor. 17 These early studies of rearrangements between human and mouse genomes considered breakpoints independently, without revealing combinatorial dependencies between related breakpoints. Kececioglu and Sankoff 18 were the first to explore the importance of dependencies between breakpoints, and developed an approximation algorithm for the reversal distance problem (eg studies of rearrangements in unichromosomal genomes). Hannenhalli and Pevzner 19,20 developed a polynomial-time algorithm for the reversal distance problem, which was extended to the genomic distance problem of finding a most parsimonious scenario for multichromosomal genomes under inversions (reversals), translocations, fusions and fissions of chromosomes. 21 -23 Although these studies provided efficient algorithms to study rearrangements between two genomes, integrating data from multiple genomes (genome phylogeny) poses a more difficult problem. Previous genome phylogeny analyses were based on breakpoint distances that measure the number of breakpoints between two genomes. 24 -26 Bourque and Pevzner 27 proposed a new approach, the Multiple Genome Rearrangement (MGR) algorithm, based upon the reversal/genomic distance. The MGR applications demonstrated important advantages of the reversal/genomic distance over the breakpoint distance. One strength of this new method is that it is directly adaptable to multichromosomal genomes, a variable unexplored in breakpoint distance approaches to date. The method is applicable to any group of multichromosomal organisms with comparative mapping data on the same set of markers, and can provide an estimate of original synteny (an ancestral genome) in the organisms under study. 28,29 Recently, other methods studying rearrangement scenarios using the reversal distance were developed 30,31 but, so far, these methods are restricted to the median problem of unichromosomal genomes.
Here, an expanded set of homologous syntenic markers between the human, cat and mouse genomes is analysed, along with a set shared between human, cat and cattle genomes. Moreover, we derive a parsimonious genome rearrangement scenarios for these species and the hypothetical ancestral genomes for these index species imputed. A comparison of these inferences with reconstructions of the ancestral placental mammal karyotype based on comparative cytogenetic approaches 8,32,33 were largely concordant, validating the MGR approach 27 for using moderately dense comparative maps across mammalian orders to define the exchanges that led to modern genome reorganisation in each lineage.
Supporting information on the two datasets (humancat -cow and human -cat -mouse) has been posted at www.ingenta.com

MGR algorithm
The MGR algorithm developed by Bourque and Pevzner 27 reconstructs a rearrangement-based evolutionary tree, considering reversals (more commonly called inversions), translocations, fusions and fissions. MGR is based on the Hannenhalli -Pevzner theory 34 and a fast implementation of the multichromosomal genome rearrangement algorithm 22,23 called GRIMM. The MGR algorithm works in two stages. Assume one wishes to attempt to reconstruct the rearrangement scenario of m genomes. In the first stage, rearrangement events in genome i ð1 # i # mÞ, that bring it closer to each of the remaining m 2 1 genomes, are iteratively carried out in a carefully selected order. The rearrangements performed in the first stage are very reliable. 27 In fact, when there are only three genomes ðm ¼ 3Þ; all three genomes are converted into the real ancestor if the tree is additive. In the case of non-additive trees, the first step stops before converging to an ancestor and an intermediate genome, or preancestor, is left. Because the moves made to reach the preancestors in the first stage were made with the highest confidence, it can be argued that studying them can provide insights into the global rearrangement scenario. In the second stage, the conditions for rearrangements to be carried out are relaxed by choosing a rearrangement in genome i that brings it closer to t ¼ m 2 2 out of m 2 1 other genomes. We stop once again if all genomes converge to an ancestor. Otherwise, the parameter t is further lowered. For a full description of the algorithm, see Bourque and Pevzner. 27 In the context of genome rearrangements, genomes are typically viewed as signed permutations, where each integer corresponds to a unique gene/marker and the sign corresponds to its orientation. By contrast, comparative maps usually correspond to unsigned permutations -ie no information on the sign of the markers is available. Since no efficient algorithms for rearrangement analysis of unsigned permutations are available, Bourque and Pevzner 27 searched for strips in the unsigned permutations to infer the signed permutations from the original data. 35 A strip is two or more markers that appear consecutively in all genomes in the exact same order, or reversed order (to which we assign reversed signs), without any interruption by other markers. A marker that is not part of any strip is called a singleton and is dropped from the signed permutation due to uncertainty in its sign. Below, we propose a new, more flexible, method to recover the signed permutation from the comparative mapping data that uses clusters (two or more markers located closely to each other in all genomes) instead of strips. This new method is less sensitive to local mapping errors and to micro-rearrangements that can complicate the recovery of the global rearrangement scenario.

GRIMM-synteny algorithm for cluster generation
A particularly confounding variable in comparative genome analysis is the distinction between small micro-rearrangements that interrupt conserved segments and exceptional singleton markers that reflect imprecise map orders or mapping/ assembly errors. Making this aspect more perplexing are recent comparisons of full genome sequences for mouse and human which show significantly more rearrangements than previously predicted, due to evidence of multiple micro-rearrangements within previously defined conserved segments. 3 -5,36 Here, the notion of conserved segments is relaxed and the notion of a gene (marker) cluster introduced. Every cluster (comparable to a synteny block) corresponds to a set of markers located close to each other in each of the genomes under study. The order of markers within the cluster may vary from one genome to another, and may reflect mapping imprecision or actual micro-rearrangements. 37 Thus, clusters are the fragments of the genome that can be converted into conserved segments by micro-rearrangements (eg by reversals spanning relatively few markers). Local errors in comparative maps and micro-rearrangements make it non-trivial to find clusters. 25,38 -40 Here, we describe the clustering algorithm using three genomes (human, cat and mouse) with comparative mapping data, but the algorithm applies to two or more genomes. 27,36 To perform the multispecies genome comparisons, we first concatenate chromosomes in human, cat and mouse to form a single coordinate system for each genome based on n markers. The markers in each concatenation are assigned coordinates 1,2,. . .,n. A marker located at position h in human, c in cat and m in mouse is assigned a coordinate (h, c, m) that can be viewed as an element of a three-dimensional n by n by n grid. Triplets of chromosomes divide this grid into boxes Genomic architecture of mammalian ancestors Review PRIMARY RESEARCH (the human, cat and mouse comparison has 24 £ 20 £ 21 boxes). Every marker is on a triplet of chromosomes (one from human, one from cat and one from mouse). The distance between two points (h 1 , c 1 , m 1 ) and (h 2 , c 2 , m 2 ) from the same chromosome triplet (the same box) is the Manhattan distance jh 2 2 h 1 j þ jc 2 2 c 1 j þ jm 2 2 m 1 j: The distance between points from different chromosome triplets is defined as infinity.
MGR can be directly applied to all genetic markers shared by human, cat and mouse to find a rearrangement scenario; however, this scenario is likely to be flawed, since comparative maps will have some unreliably positioned markers that impute a false rearrangement. Therefore, we apply the GRIMM-synteny algorithm to filter out spurious markers that occur as isolated points (or 'small clusters') in a marker matrix. The GRIMMsynteny algorithm for comparative data invokes a distance threshold, G, as a parameter. The distance threshold is defined as the number of chromosomal interruptions below which markers are deemed to be part of the same synteny block.

GRIMM-synteny algorithm
(1) Form a marker graph whose vertex set is the set of markers.
(2) Connect vertices in the marker graph by an edge if the distance between them is smaller than the distance threshold G.
(3) Define clusters as connected components in the marker graph. (4) Delete singletons (clusters with just one marker). (5) Determine the cluster order and signs (orientation) for each genome. We define the span of a cluster in human (or cat or mouse) as the interval between its minimum and maximum coordinates. Note that, although different clusters are not supposed to overlap in three dimensions, they often overlap in one dimension (ie their span intervals may overlap in human or cat or mouse). Therefore, defining the cluster order for intermingled clusters should be carried out with caution. To do this, we compute the centre of mass of all markers forming the cluster, and order clusters in human by the coordinates of their centres of mass. Cluster numbers are assigned according to their order on the human genome and then ordered in the other genomes in terms of these labels. We define rearrangements of markers within a cluster as micro-rearrangements, while rearrangements of the order and orientation of clusters are called macrorearrangements.
Maximum distance threshold. We illustrate the influence of the maximum distance threshold G on the set of derived clusters in the case of three genomes A, B and C. Consider two markers, x and y, that are adjacent in all three genomes, either as x, y or as y, x. Their distance is dðx; yÞ ¼ 1 þ 1 þ 1 ¼ 3; and they will be placed in the same cluster only if G $ 4: Conversely, distances larger than 3 indicate that a pair of markers fails to be adjacent in one or more genomes. Hence, the threshold, G, limits the maximum number of chromosomal interruptions d(x, y), between markers x and y across m genome comparisons.
Recall that a strip is a sequence of markers x; y; . . .; z that appear consecutively or reversed in all three genomes, without interruption by other markers. At G , 4; each marker forms its own singleton cluster and is deleted. At G ¼ 4; each strip forms its own cluster. As G increases, some clusters may be merged together to form a larger cluster with microrearrangements. An example of this is shown in Figure 1.
Thus, for m genomes, G # m puts each marker into its own singleton cluster that is deleted. G ¼ m þ 1 puts each Q1 Table 1. Conserved markers, clusters and reversal distances computed with GRIMM-synteny and Multiple Genome Rearrangement Algorithm analysis of comparative gene maps of 470 Type I gene homologues aligned between human (H), mouse (M) and cat (C) genomes. The common ancestor of all three genomes is denoted A, while preancestors for human, mouse and cat are denoted H*, M* and C*, respectively. The total distance between the three genomes, d(H, M, C), is defined as d(H, M) þ d(M, C) þd(C, H). The tree score is defined as d  strip into its own cluster. G ¼ m þ 2 allows for clusters that form a strip in all but one genome, which instead has a pair of adjacent markers in that strip which are inverted (there can be multiple inverted pairs within a cluster, as long as no two pairs are adjacent). Therefore, increasing the value of G allows for clusters with more complex microrearrangements.

Comparative mapping data
Feline-human comparative mapping data (590 shared coding gene markers) have been described by Murphy et al. and Menotti-Raymond et al. 11,41 Human -mouse comparative mapping data were derived online, from http://www.ncbi. nlm.nih.gov/Homology. Cattle -human comparative mapping data were derived from Band et al. 15 and associated mouse data were derived from the previously listed mouse databases. For cases where mapped homologous loci did not exist for a given species pair, we found the most physically proximal human gene, which was taken as a 'virtual' coordinate to find a mapped mouse homologue in genetic or radiation hybrid (RH) maps. Cattle homologues were considered equivalent 'common' markers if their human homologue resided within 20 centirays (map units) of the human -cat anchors and were consistent with previously defined blocks of human -cattle synteny. 15 In a few cases, the virtual marker was extended to 50 centirays, but only where it was certain that there were no violations of previously defined syntenies. For this analysis, we assembled two comparative datasets: (1) Human -mouse -cat (470 shared markers), which represented two conserved (few rearrangements from the ancestral placental genome 8 ) mammalian genomes (human and cat) with one significantly reshuffled mammalian genome (mouse). (2) Human -cat -cow (311 shared markers), which represented two conserved mammalian genomes (human and cat) and one moderately reshuffled mammalian genome (cow). 8 The number of identified homologous mapped markers (actual plus virtual) between the species pairs human -cat, human -cow and cat -mouse and cat -cow, was 551, 633, 470 and 311, respectively.

Human -mouse -cat dataset
The genomic maps of homologous markers were first compared between human, mouse and cat using the MGR and GRIMM-synteny algorithms. The comparison involved 470 Type 1 coding gene markers with MGR distance threshold parameters set at G ¼ 4; 5; 6; 8 and 20 (Table 1). The results reveal several important patterns that can be interpreted in a comparative genomics context. First, increasing the distance threshold typically results in an increase in the number of Figure 1. Illustrating the effect of the distance threshold, G, on cluster formation. Suppose genome A has marker order 1,2,3,4,5,6; genome B has 1,2,3,6,5,4; and genome C has 3,1,2,4,5,6. The strips are [1,2], [3], [4,5,6]. The clusters at G ¼ 4 (a) are [1,2] and [4,5,6] (the singleton [3] is deleted). At G ¼ 5 (not shown), some of these are combined together. Specifically, dð2; 3Þ ¼ 1 þ 1 þ 2 ¼ 4 , 5; so an edge is added between markers 2 and 3, joining their clusters together. The clusters at G ¼ 5 are [1,2,3] and [4,5,6] and the order within the clusters varies by genome, giving micro-rearrangements. At G ¼ 6 and 7 (b), edges are added within clusters, but not between clusters, so clusters do not change. At G ¼ 8 (not shown), two edges are added that would join the clusters into [1,2,3,4,5,6] Specifically, dð2; 4Þ ¼ 2 þ 4 þ 1 ¼ 7 , 8 and dð3; 4Þ markers returned in clusters, as fewer singletons are dropped. Another consequence of the threshold increase is that the number of clusters typically decreases, as does the overall rearrangement distance. This is the result of reducing the number of local rearrangements due to poor mapping resolution of tightly linked markers, or derived microrearrangements, in the mouse genome. At very high thresholds (eg G ¼ 20), almost all internal inversions are not counted, in many cases collapsing entire chromosomes into single conserved segments. We show results at high thresholds only to demonstrate the failure to resolve chromosome associations (see below) with a few diagnostic markers, while enhancing recovery of single chromosome syntenies ( Table 2). In practice, however, we do not advocate Table 2. Comparison of the Multiple Genome Rearrangement (MGR) algorithm-derived syntenies found in the common ancestors of the human-cat -mouse (HCM) and the human-cat -cow (HCC) datasets, with predicted syntenies based on comparative cytogenetic analyses (left-hand column 8  using such high thresholds, as they result in loss of almost all intrachromosomal detail. A chromosome association is defined as clusters of two different human chromosomes that are adjacent on a single chromosome in another genome (ie fragments of human chromosomes 14 and 15 fused together (denoted 14/15) on cat chromosome B3) or in an ancestor. Table 2 illustrates the sensitivity of the algorithm to threshold in recovery of ancestral chromosomes predicted by previous studies on chromosome painting and comparative mapping data. 8 It should be noted that previous studies were based on lower-resolution datasets generated for much larger sets of mammalian species (20 -40 species from as many as eight placental orders). In general, increasing the threshold, G, tends to improve the consistency of the overall reconstruction with previous chromosomal syntenies. Figure 2 depicts a reconstructed ancestral genome from which the human, cat and mouse genomes descended, based on MGR-GRIMM ðG ¼ 6Þ: The putative three-species ancestor contains 19 autosomes and the sex chromosomes, and shares a number of chromosomes and chromosome associations hypothesised to be in the ancestral placental mammal: these include associations 3/21 (human chromosome 3 fused to human chromosome 21), 4/8p, 7/16p, 16q/19q and single chromosome syntenies 2q, 8q, 9 and 17. This reconstruction differs from previous hypotheses by lacking, for example, the 14/15 chromosome association and one of the two 12/22 associations. If, however, the three preancestors (defined here as genomes on the path towards the ancestor on which rearrangements have only been performed with the highest confidence) are examined at threshold 4, there is evidence of these predicted associations in at least one of the preancestors (see supporting information at www.ingenta.com). Table 3 shows the results of applying GRIMM-synteny and MGR to the 311 marker human -cat -cow dataset. As observed with the previous dataset, increasing the thresholds tends to add more markers but decreases conserved segment resolution. This dataset also recovers most of the human chromosome associations predicted in the placental ancestor, although fewer markers resulted in loss of some of the segments within the 4/8p and 12/22 associations (Table 2 and Figure 3). By contrast with the human -mouse -cat dataset, the more conserved human -catcow genome triple, with lower and more equal distances (Table 3), recovers more of the single human chromosome syntenies at lower thresholds (eg 4 and 5), while threshold 6 shows more of these single syntenies instead as associations (eg 13 with 5 and 2p þ q). All datasets, descriptions of clusters and results from analyses of both human -cat -mouse and humancat -cow datasets can be found in the supporting information at www.ingenta.com Table 4 shows a comparison of the proportions of each type of rearrangement at the varying thresholds for the human -mousecat versus the human -cat -cow datasets. Reversals (inversions) represent a very frequent category of rearrangement event in both datasets. The fact that this event category is even more common in the human -cat -cow dataset than in the humancat -mouse dataset is consistent with previous analyses of mammalian comparative maps. 28 Recent human -mouse genomic sequence comparisons, however, 3 -5 reveal that intrachromosomal rearrangements (reversals) are the most frequent rearrangement event, as will probably become more evident in the human -cat -mouse rearrangement scenario as the number

Discussion
Using multispecies mammalian comparative maps, coupled with new computational tools for multichromosomal rearrangement analysis, we have been able to demonstrate the promise of generating ancestral chromosome architectures from small numbers of taxa and fewer than 500 shared markers. Our results using two three-taxa datasets (human -catmouse and human -cat -cow) reconstruct, under different assumptions about treating local mapping errors and microrearrangements, mammalian ancestral genomes containing most of the chromosome associations and syntenies hypothesised based on chromosome painting inferences. 8,32,33 Of course, if the number of species is increased, markers will improve upon the accuracy of the ancestor reconstructions and rearrangement scenarios. Despite having fewer common markers, the human -catcow dataset recovers the single chromosome syntenies (eg 5 and 13) at a higher frequency than the human -mouse -cat dataset, where they tend to be intact yet fused to other chromosomes (Table 2 and Figure 3). This is best explained by the overall slower rate of change among these three species (Table 3) and the tendency of most of these chromosomes to be fused to other human syntenic regions in the rearranged mouse genome. This confirms the conclusion that increasingly additive trees produce more reliable ancestors 27 and suggests that inclusion of more slowly evolving genomes will aid in the reconstruction of placental ancestral genomes.
One finding of interest is that, even though the mouse is highly rearranged compared with most species, increasing the threshold of considered micro-rearrangements (which have occurred largely on the mouse lineage) allows the algorithm to compensate and converge on a relatively unshuffled ancestor. Although there are some unexpected ancestral chromosomes in different analyses of the human -cat -mouse dataset, most of these represent fusions of intact human chromosomes that are thought to have been distinct chromosomes in the placental ancestor. One example is the fusion of human 2p and 20 into a single ancestral chromosome in almost all analyses within and between both datasets. This 2p/20 association is found intact in the cat genome and is believed to be ancestral for carnivores. 8,42,43 This has never been found in another placental karyotype examined with molecular methods, except in mouse, where human 20 is syntenic with a small fragment of human chromosome 2p. In rare cases like this, the apparently common carnivore-rodent association is best explained by convergence through the extensive chromosomal scrambling observed in the mouse genome. 1,4 This is supported by inspection of the rat genome, 1,14 which does not share this association. As with any phylogenetic analysis, increasing taxon (genome) sampling will decrease the effects of homoplasy and increase the reliability of the tree and ancestral reconstruction.
Because MGR inferences are parsimony-based, saturation and long-branch attraction issues remain outstanding problems that will need to be addressed in future applications of this method to infer mammalian genome rearrangements. Therefore, the choice of genomes will affect chromosomal reconstructions, hence caution must be exercised when making interpretations from ancestors imputed with combinations of slowly and rapidly evolving genomes. A good illustration of this principle is observed in the difficulty of recovering the 14/15 association with the human -mouse -cat dataset. Human chromosomes 14 and 15 are syntenic in the large majority of placental mammal genomes examined to date, 8,32,33 although this synteny has independently been lost in the human -ape lineage and the murid rodent lineage. Thus, two of three genomes in the human -mouse -cat dataset lack this chromosome association (otherwise widespread in mammals), resulting in difficulty in recovering this ancestral chromosome. It should be noted that the humancat -cow dataset, where two of three genomes do have the 14/15 association, recovers this ancestral chromosome at low thresholds, although recovers it less well when the threshold is increased due to loss of marker resolution. Increased marker density will ultimately improve reconstruction accuracy. This was suggested by the improvement of the current human -mouse -cat ancestor over a previously computed scenario using these same three species, but with a much smaller number of markers. 27 This result supports previous conclusions emphasising that the number of markers should exceed a certain threshold to provide reliable ancestral reconstructions. 27 As the number of ordered comparative maps from different mammalian species increases, along with an increase in shared markers, it is expected that the reliability of the ancestral reconstructions (both whole chromosomes and orders within chromosomes) will be more accurate reflections of the ancestral mammalian genome. These advances will initially proceed from the mapping stage, where a broader taxonomic sampling from whole genome descriptions is currently available (or in development). The promise and application of this approach to multiple mammalian genomic sequences from several orders will surely provide the greatest accuracy and insight into whole genome evolution, as demonstrated by current human -mouse whole genome comparisons. 3,4,36