A genome-wide survey of segmental duplications that mediate common human genetic variation of chromosomal architecture
© Henry Stewart Publications. 2004
Received: 20 May 2004
Accepted: 20 May 2004
Published: 1 August 2004
Recent studies have identified a small number of genomic rearrangements that occur frequently in the general population. Bioinformatics tools are now available for systematic genome-wide surveys of higher-order structures predisposing to such common variations in genomic architecture. Segmental duplications (SDs) constitute up to 5 per cent of the genome and play an important role in generating additional rearrangements and in disease aetiology. We conducted a genome-wide database search for a form of SD, palindromic segmental duplications (PSDs), which consist of paired, inverted duplications, and which predispose to inversions, duplications and deletions. The survey was complemented by a search for SDs in tandem orientation (TSDs) that can mediate duplications and deletions but not inversions. We found more than 230 distinct loci with higher-order genomic structure that can mediate genomic variation, of these about 180 contained a PSD. A number of these sites were previously identified as harbouring common inversions or as being associated with specific genomic diseases characterised by duplication, deletions or inversions. Most of the regions, however, were previously unidentified; their characterisation should identify further common rearrangements and may indicate localisations for additional genomic disorders. The widespread distribution of complex chromosomal architecture suggests a potentially high degree of plasticity of the human genome and could uncover another level of genetic variation within human populations.
Keywordsgenomic architecture segmental duplications inversion polymorphism genomic variation
Investigation of human genetic variation has focused mainly on single nucleotide polymorphisms (SNPs) and minisatellite and microsatellite repeat sequences. Although it has long been known that genomic rearrangements predispose to numerous disease phenotypes, it has only recently become apparent that some such rearrangements occur frequently in the general population. Investigation of specific loci or chromosomal regions has identified the few known common variations in genomic architecture. Now, however, the availability of bioinformatics tools, for searching for patterns in genome sequences, enables genome-wide surveys for particular types of higher-order structure predisposing to genomic variation. Segmental duplications (SDs) represent a form of genome architecture constituting up to 5 per cent of the human genome.[1–4] Non-allelic homologous recombination between these paralogous sequences results in changes of genomic structure creating inversions and other types of chromosomal rearrangements, sometimes leading to disease.[5–8] Recently, two genomic regions were identified that harbour a common inversion polymorphism. On chromosome 8p23, two large low-copy repeat regions containing olfactory-receptor (OR) gene clusters spanning approximately 350 kilobases (kb) each, and separated by approximately 4 megabases (Mb) of unique sequence, mediate recurrent genomic rearrangements. An inversion polymorphism in this segment is present in heterozygous form in about 25 per cent of Europeans. Additionally, a homologous structure of two pairs of OR gene clusters at 4p16, separated by almost 6 Mb, mediates a relatively common translocation between chromosomes 4p16 and 8p23. More than 10 per cent of Europeans sampled are heterozygous for an inversion at 4p16. Detailed examination of the regions at 4p16 and 8p23, which contain the submicroscopic genomic inversions, revealed specific higher-order structures involving SDs termed palindromic segmental duplications (PSDs), which predisposes to inversions, duplications and deletions. A PSD consists of paired, inverted duplications within limited physical distance of each other. In addition to PSDs, non-homologous recombination between segmental duplications in tandem orientation (TSDs) are known to mediate duplications and deletions but not inversions, leading to changes in copy number of intervening DNA sequences. Recurrent deletions and duplications are a known cause for genomic disorders and are observed relatively frequently, whereas submicroscopic inversion events without change of copy number (of a gene) are hard to detect and may not necessarily lead to a distinct phenotype. We hypothesised that the genomic architecture containing PSDs associated with common inversion polymorphisms is not unique to the 8p23 and 4p16 regions. The existence in the human genome of recurrent SDs that mediate common inversion polymorphisms without known association with human diseases, raised the possibility that there are many more such PSD structures throughout the genome. In this paper we describe the results of a genome-wide database search for loci containing chromosomal architecture that could mediate genomic variation, with a special emphasis on PSD structures, and discuss the implications of these findings for our understanding of genome plasticity.
Materials and methods
Chromosome 8-4-11-3 PSD family analysis
Using documented markers from the SDs that mediate the genomic rearrangements on chromosomes 8p23 and 4p16, all four repetitive regions were downloaded from the National Center for Biotechnology Information (NCBI)'s public domain. Each pair of SDs was aligned. Each of the four SDs was compared with the others in six pair-wise alignments using Miropeats. To eliminate alignment redundancy, we designed a Combine-and-Color (CC) algorithm that modifies the Miropeats output (see below). The parameters used for our CC algorithm, described below, were an internal spacing threshold of 50 base pairs (bp) and a maximum spacing difference of 75 bp.
The purpose of the CC algorithm is to combine overlapping or closely neighbouring alignments into more comprehensive alignments, and to colour the corresponding alignment blocks for visualisation. The combining algorithm exhaustively compares all neighbouring local alignments from two sequences, two at a time. Using the relative start and stop locations of the alignment on each sequence, the algorithm computes the internal spacing on each sequence and the spacing difference between the two. The internal spacing, calculated once for each sequence, is the distance between the end of the first alignment and the beginning of the second. If the sequences overlap, the internal spacing would therefore be negative. Once the spacing between the alignments on both sequences has been calculated, the spacing difference is determined by calculating the absolute difference between the two internal spacings. The spacing difference ensures that the two alignments are uniformly spaced on both sequences. If both the internal spacings and the spacing difference are less than predefined thresholds, the two alignments are combined so that a new single alignment spans the regions on both sequences, defined by the previous two alignments. After all possible combinations have occurred, the alignment blocks are coloured according to size to aid in visualisation. Alignments of less than 100 bp are coloured yellow, and the colouration increases in darkness (ie orange, cyan, purple, green, red, blue) and ends in black as the alignment size increases to greater than 4 kb.
Genome-wide BLAST analysis
For the genome-wide detection of PSD and TSD pairs, all sequence data for each chromosome were downloaded from the University of California, Santa Cruz (UCSC) Genome July 2003 Freeze. To reduce computation time and background noise, chromosomes were 'fuguised'  by removing both repetitive elements masked by Repeat Masker (A.F.A. Smit and P. Green, unpublished), and unsequenced gaps from the July 2003 Freeze. PSD pairs were defined as two segmental duplications of at least 10 kb in length and wit ≥ 90 per cent sequence identity, in inverted orientation, and with an internal spacing between the two members of the pair of a maximum of 8 Mb. TSD pairs were similarly defined as two segmental duplications in tandem orientation with identical criteria, as described for PSDs. We used the NCBI's stand-alone BLAST release 2.2.6 for the alignment of the chromosomes. Using bl2seq, we aligned pairs of sequences consisting of a query sequence of 100 kb and a subject sequence of 8 Mb. The first pair consisted of the first 100 kb and first 8 Mb of the chromosome. The BLAST results from this pair constituted our first BLAST pair output. The process was repeated, stepping our 8 Mb window forward by 100 kb with each iteration. For our BLAST analysis, we implemented the restriction of a maximum expectation value (E-value) of e-20. BLAST hits that met our criteria of 90 per cent identity and plus/minus and plus/plus orientation for PSDs and TSDs, respectively, were stored in a duplication file for annotation.
To annotate the PSD pairs, we used our CC algorithm to join neighbouring BLAST hits in our duplication file, which comprised the same PSD pair member. All BLAST hits that met our criteria were sorted by starting position of the query sequence to group neighbouring hits together. To preserve the structure of the PSD pair, neighbouring hits from both regions of the PSD pair were examined simultaneously. The thresholds for internal spacing and spacing difference were both 2 kb. After the combining algorithm was completed, the physical locations of all PSD pairs whose two members were both greater than 10 kb were stored in a segment file. Annotation of TSD pairs was performed in identical fashion
PSD pair BLAST database
Using the physical locations defined in the segment file, all PSD pair sequence data were extracted from the fugued chromosomes for preparation of a PSD pair BLAST database. To determine the sequence similarity between our PSD pair set, each PSD pair member was BLASTed against the database using blastall. For each PSD pair element queried, all PSD pair elements in our database containing at least one hit with an E-value of zero were considered significant, and the ranges of coverage were recorded in a coverage file.
Structure of common inversion regions on 8p23 and 4p16
Genome-wide search for PSDs and TSDs
Since PSDs seem to cluster in specific regions, and multiple PSD hits may represent single instances of duplicated segments in inverted orientation, we assessed the redundancy of PSDs from our screen. This was done by combining overlapping pairs of PSDs into one, as well as joining PSD pairs that were within 100 kb. We estimated that there are 179 distinct PSD-containing regions. Likewise, 144 distinct TSD loci were identified. For the combined PSD and TSD data, we identified 233 distinct regions in the genome containing large blocks of duplicated sequences in opposite or tandem orientation. From these 233 regions, 86 loci contain exclusively PSDs whereas about half that number (46) feature only TSDs. These numbers not only show that there are more PSD than TSD structures in the human genome but also indicate that PSDs and TSDs frequently co-localise within the same region of the genome.
The initial finding of the 8-4-11-3 PSD family led to the suggestion that perhaps more distinct PSD families could be identified. We therefore performed a sequence analysis between each of the PSDs using the BLAST algorithm. For 550 out of the 861 total PSD pairs (69.3 per cent) we found no inter-chromosomal hit, and for 118 of the 861 PSD pairs (13.7 per cent) we found neither an inter- nor an intra-chromosomal hit. The vast majority of hits (94 per cent) were intra-chromosomal. Further sequence comparisons of all PSD sequences against the human genome revealed a large number of unpaired duplicated segments throughout the genome (data not shown).
Known genomic regions with PSDs and TSDs
We have performed a genome-wide survey of specific patterns of chromosomal architecture in the human genome, based on results of our initial analysis of two regions containing inversion polymorphisms on chromosomes 8p23 and 4p16.[9, 10] Close examination of these two loci revealed the presence of PSDs, consisting of paired, inverted duplications, flanking the inversion region with unique sequence. Moreover, we confirmed the high degree of sequence similarity between these PSDs across the different chromosomes and identified two other loci on chromosomes 3q21 and 11q13, containing almost identical PSDs in inverse orientation, with similar spacing but without known genomic variation. This result showed that the genomic architecture of the 8p23 and 4p16 inversion regions is not unique. Moreover, the existence of recurrent SDs in the human genome, mediating inversion polymorphisms, raised the possibility that there are many more such PSD structures throughout the human genome; this possibility led us to perform a genome-wide survey of PSDs that can mediate genomic variation of chromosomal architecture. The genome-wide analysis also included a survey of TSDs, which can also mediate ectopic sequence exchange, causing deletions or duplications. We focused particularly on regions with PSDs that could mediate 'balanced' inversion events, that is, without loss or gain of intervening sequence. Our results revealed a large number of loci harbouring these structural features; most of these were previously unknown and await further confirmation and characterisation.
Distribution of PSDs and TSDs
The chromosomal distribution of PSDs, as well as TSDs, shows distinctive patterns. Segmental duplications within pericentromeric and subtelomeric regions are well documented [24–26] but their distribution and number vary by chromosome. Our results demonstrated an over-representation of PSDs and TSDs near centromeres, although not all centromeres are characterised by a high density of these kinds of structures. The reason for the abundance of PSDs and TSDs near some centromeres could lie in the fact that these pericentromeric regions harbour greater overall plasticity.[27, 28] The scarcity of these structures in subtelomeric regions, which are also recognised as sites of rapid genomic change, however, may suggest differences in higher-order structure and/or type of plasticity between centromeres and telomeres in the human genome. Comparative studies of these regions are required to ascertain whether this is a particular structural characteristic for centromeres and telomeres in general, or is possibly restricted to the human genome.
We observed that SDs, including both PSDs and TSDs, are usually arranged in a complex structure consisting of multiple modules, some in direct orientation and others in inverted orientation. Of the relatively few SDs that are uniquely PSDs or TSDs, there is a preponderance of PSD regions. The high prevalence of PSDs suggests that SDs within close proximity preferentially have occurred in inverted orientation. It is unclear why this preference would occur, as there is no apparent advantage for inverted duplications over tandem duplications. It is possible, however, that a preponderance of PSDs exists because of three-dimensional structural advantages at the chromatin level for these events to occur. If this is true, one might expect that comparison of genomic structures of SDs across species will reveal the same bias in distribution of PSDs versus TSDs. Further study is required to confirm this hypothesis.
The observation that the 8p23 and 4p16 inversion-mediating PSDs are members of a family with at least two additional, nearly identical loci at 11q13 and 3q21, led to the suggestion that perhaps more distinct PSD families could be identified. Sequence comparison between all identified PSDs revealed that the vast majority of PSDs were related to at least one other PSD sequence in the genome, with almost all sequences found on the same chromosome. This high rate of intra-chromosomal hits suggests a closer relationship between PSDs on the same chromosome than between PSDs on different chromosomes. This strong bias is not surprising, since any PSD pair, by definition, consists of an intra-chromosomal duplicated segment in close proximity. Similar numbers have been found for paralogous sequences in general,[1, 2] suggesting that PSDs are not an inherently different group of duplicated segments within the human genome.
It is important to note that the current sequence assembly of the human genome, even though in its advanced stages, is still incomplete and contains gaps. This could lead to under-representation of duplicated segments, misalignments of some sequence data [1, 29] and uncertainties in the proper orientation especially of low-copy repeat sequences. For this reason, independent molecular confirmation is required for any region identified in our survey that may underlie common genomic variations. The combination of SD data in inverted and tandem orientation (PSD and TSD, respectively) is thus necessary to identify these sites. Moreover, under-representation of SDs in the human genome assembly will lead to an underestimation of loci with higher-order structure that are potentially involved in genomic variation. These factors certainly limit our survey, in that PSDs may be recognised as TSDs or vice versa, or alternatively that the actual number of these types of loci in the human genome is higher than our data suggest. Nevertheless, our effort to identify specific sequence structures involving SDs in the human genome is an important step to corroborate the concept that human genome plasticity is probably very substantial and is not limited to the pericentromeric and subtelomeric regions.
One example of a region associated with a genomic disorder is that of 17p11, associated with the Smith-Magenis syndrome (SMS; MIM 182290). This region comprises three SDs (distal, middle and proximal) combined into two PSDs forming one TSD structure, commonly deleted in subjects with SMS. Interestingly, within this SMS region the breakpoint region for a common isochromosome 17q [i(17q)] in human neoplasia was recently reported; i(17q) is associated with loss of 17p, which includes the tumour-suppressor gene TP53. The recurrent breakpoint of i(17q) was described as a PSD locus. This example suggests, again, that somatic rearrangements are not random but that genome architecture, such as PSDs and TSDs, may also be important in chromosomal rearrangements associated with human neoplasia. Moreover, a different study identified an abundance of SDs in this 17p region and also reported that the particular genomic architecture is involved in non-recurrent chromosomal rearrangements and unusual-sized deletions.
The presence of a great number of regions in the human genome harbouring higher-order structures predisposing to genomic variation also implies that the chromosomal structure of these loci may vary between human populations. It is interesting, for example, that for Sotos syndrome, microdeletions are commonly observed in Japanese patients but only in a very small fraction of non-Japanese patients.[19, 20, 32] The reason for this large difference in frequency of microdeletions could be due to a patient-selection bias but one could also argue that some 5q35 alleles with a particular variation in genomic architecture predisposing to these events in the respective populations are the basis for the observed differences. Even though the latter may be incorrect for the 5q35 Sotos syndrome region, it may be the correct scenario for one of the many other regions in the genome containing complex chromosomal architecture.
Our results indicate that the human genome harbours a considerable number of regions whose higher-order structure may vary within human populations. The approach that we employed, however, was restricted to a single set of criteria, focusing on PSDs with a minimum length for each segment and with a given maximum spacing. These criteria were applied to maximise the chances of identifying regions that predispose to recurrent inversions such as those seen in 8p and 4p. There are, however, already examples of PSD regions with smaller segments that mediate inversion polymorphisms, for example at the EMD locus on chromosome Xq28. This, in addition to the limitations of the current genome assembly, as previously mentioned, suggests that further investigation may reveal additional regions in which there is common variability of genomic structure. Such variability in higher-order structure of the genome could also alter our interpretation of genetic maps and haplotypes, especially at high resolution. Under the assumption of uniform architecture, we consider maps to represent a fixed order of markers, although we recognise that the genetic distance between two markers is variable (eg between males and females) and therefore represents an average. Similarly, we may need to consider that the order also represents an average for specific regions in the genome. Indeed, unrecognised variability in the order of markers could increase uncertainty in estimates of the distances between them. Systematic identification of relatively widespread genetic variations in genome structure may be important for comparative genomic studies, for analysis of recombination in the human genome and, in particular, for mapping phenotypes. While only a small proportion of SNPs may have a functional effect, it is likely that a relatively high proportion of variants in higher-order structure have either direct or indirect effects on the function of one or more genes, given the large amount of genome sequence incorporated within each variant.
We thank Susan Service and York Mararhens for comments. M.R.M. was supported by NSF IGERT Training Award #DGE-9987641. This work was funded by a NARSAD Young Investigator Award to R.A.O. and by grants from the US National Institutes of Health; (R01 MH 49499) to N.B.F. and (R01 GM 068875-01) to R.A.O.
- Bailey JA, Yavor AM, Massa HF, et al: 'Segmental duplications: Organization and impact within the current human genome project assembly'. Genome Res. 2001, 11: 1005-1017. 10.1101/gr.GR-1871R.PubMed CentralView ArticlePubMed
- Bailey JA, Gu Z, Clark RA, et al: 'Recent segmental duplications in the human genome'. Science. 2002, 297: 1003-1007. 10.1126/science.1072047.View ArticlePubMed
- Cheung J, Estivill X, Khaja R, et al: 'Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence'. Genome Biol. 2003, 4: R25-10.1186/gb-2003-4-4-r25.PubMed CentralView ArticlePubMed
- Lander ES, Linton LM, Birren B, et al: 'Initial sequencing and analysis of the human genome'. Nature. 2001, 409: 860-921. 10.1038/35057062.View ArticlePubMed
- Samonte RV, Eichler EE: 'Segmental duplications and the evolution of the primate genome'. Nat Rev Genet. 2002, 3: 65-72.View ArticlePubMed
- Stankiewicz P, Lupski JR: 'Molecular-evolutionary mechanisms for genomic disorders'. Curr Opin Genet Dev. 2002, 12: 312-319. 10.1016/S0959-437X(02)00304-0.View ArticlePubMed
- Emanuel BS, Shaikh TH: 'Segmental duplications: An 'expanding' role in genomic instability and disease'. Nat Rev Genet. 2001, 2: 791-800.View ArticlePubMed
- Mazzarella R, Schlessinger D: 'Pathological consequences of sequence duplications in the human genome'. Genome Res. 1998, 8: 1007-1021.PubMed
- Giglio S, Broman KW, Matsumoto N, et al: 'Olfactory receptor-gene clusters, genomic-inversion polymorphisms, and common chromosome rearrangements'. Am J Hum Genet. 2001, 68: 874-883. 10.1086/319506.PubMed CentralView ArticlePubMed
- Giglio S, Calvari V, Gregato G, et al: 'Heterozygous submicroscopic inversions involving olfactory receptor-gene clusters mediate the recurrent t(4;8)(p16;p23) translocation'. Am J Hum Genet. 2002, 71: 276-285. 10.1086/341610.PubMed CentralView ArticlePubMed
- Inoue K, Lupski JR: 'Molecular mechanisms for genomic disorders'. Annu Rev Genomics Hum Genet. 2002, 3: 199-242. 10.1146/annurev.genom.3.032802.120023.View ArticlePubMed
- Parsons JD: 'Miropeats: Graphical DNA sequence comparisons'. Comput Appl Biosci. 1995, 11: 615-619.PubMed
- Altschul SF, Gish W, Miller W, et al: 'Basic local alignment search tool'. J Mol Biol. 1990, 215: 403-410.View ArticlePubMed
- Parsons JD: 'Improved tools for DNA comparison and clustering'. Comput Appl Biosci. 1995, 11: 603-613.PubMed
- Skaletsky H, Kuroda-Kawaguchi T, et al: 'The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes'. Nature. 2003, 423: 825-837. 10.1038/nature01722.View ArticlePubMed
- Kong A, Gudbjartsson DF, Sainz J, et al: 'A high-resolution recombination map of the human genome'. Nat Genet. 2002, 31: 241-247.PubMed
- Osborne LR, Li M, Pober B, et al: 'A 1.5 million-base pair inversion polymorphism in families with Williams-Beuren syndrome'. Nat Genet. 2001, 29: 321-325. 10.1038/ng753.PubMed CentralView ArticlePubMed
- Gimelli G, Pujana MA, Patricelli MG, et al: 'Genomic inversions of human chromosome 15q11-q13 in mothers of Angelman syndrome patients with class II (BP2/3) deletions'. Hum Mol Genet. 2003, 12: 849-858. 10.1093/hmg/ddg101.View ArticlePubMed
- Kurotaki N, Imaizumi K, Harada N, et al: 'Haploinsufficiency of NSD1 causes Sotos syndrome'. Nat Genet. 2002, 30: 365-366. 10.1038/ng863.View ArticlePubMed
- Kurotaki N, Harada N, Shimokawa O, et al: 'Fifty microdele-tions among 112 cases of Sotos syndrome: Low copy repeats possibly mediate the common deletion'. Hum Mutat. 2003, 22: 378-387. 10.1002/humu.10270.View ArticlePubMed
- Kuroda-Kawaguchi T, Skaletsky H, Brown LG, et al: 'The AZFc region of the Y chromosome features massive palindromes and uniform recurrent deletions in infertile men'. Nat Genet. 2001, 29: 279-286. 10.1038/ng757.View ArticlePubMed
- Repping S, Skaletsky H, Brown L, et al: 'Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection'. Nat Genet. 2003, 35: 247-251. 10.1038/ng1250.View ArticlePubMed
- Small K, Iber J, Warren ST: 'Emerin deletion reveals a common X-chromosome inversion mediated by inverted repeats'. Nat Genet. 1997, 16: 96-99. 10.1038/ng0597-96.View ArticlePubMed
- Eichler EE: 'Recent duplication, domain accretion and the dynamic mutation of the human genome'. Trends Genet. 2001, 7: 661-669.View Article
- Mefford HC, Trask BJ: 'The complex structure and dynamic evolution of human subtelomeres'. Nat Rev Genet. 2002, 3: 91-102.View ArticlePubMed
- Guy J, Hearn T, Crosier M, et al: 'Genomic sequence and transcriptional profile of the boundary between pericentromeric satellites and genes on human chromosome arm 10p'. Genome Res. 2003, 13: 159-172. 10.1101/gr.644503.PubMed CentralView ArticlePubMed
- Ventura M, Archidiacono N, Rocchi M: 'Centromere emergence in evolution'. Genome Res. 2001, 11: 595-599. 10.1101/gr.152101.PubMed CentralView ArticlePubMed
- Amor DJ, Choo KH: 'Neocentromeres: Role in human disease, evolution, and centromere study'. Am J Hum Genet. 2002, 71: 695-714. 10.1086/342730.PubMed CentralView ArticlePubMed
- Eichler EE: 'Segmental duplications: What's missing, misassigned, and misassembled -- and should we care?'. Genome Res. 2001, 11: 653-656. 10.1101/gr.188901.View ArticlePubMed
- Stankiewicz P, Shaw CJ, Dapper JD, et al: 'Genome architecture catalyzes nonrecurrent chromosomal rearrangements'. Am J Hum Genet. 2003, 72: 1101-1116. 10.1086/374385.PubMed CentralView ArticlePubMed
- Barbouti A, Stankiewicz P, Nusbaum C, et al: 'The breakpoint region of the most common isochromosome, i(17q), in human neoplasia is characterized by a complex genomic architecture with large, palindromic, low-copy repeats'. Am J Hum Genet. 2004, 74: 1-10. 10.1086/380648.PubMed CentralView ArticlePubMed
- Douglas J, Hanks S, Temple IK, et al: 'NSD1 mutations are the major cause of Sotos syndrome and occur in some cases of Weaver syndrome but are rare in other overgrowth phenotypes'. Am J Hum Genet. 2003, 72: 132-143. 10.1086/345647.PubMed CentralView ArticlePubMed
- Repping S, Skaletsky H, Brown L, et al: 'Polymorphism for a 1.6-Mb deletion of the human Y chromosome persists through balance between recurrent mutation and haploid selection'. Nat Genet. 2003, 35: 247-251. 10.1038/ng1250.View ArticlePubMed