Genetic-variant hotspots and hotspot clusters in the human genome facilitating adaptation while increasing instability

Background Genetic variants, underlining phenotypic diversity, are known to distribute unevenly in the human genome. A comprehensive understanding of the distributions of different genetic variants is important for insights into genetic functions and disorders. Methods Herein, a sliding-window scan of regional densities of eight kinds of germline genetic variants, including single-nucleotide-polymorphisms (SNPs) and four size-classes of copy-number-variations (CNVs) in the human genome has been performed. Results The study has identified 44,379 hotspots with high genetic-variant densities, and 1135 hotspot clusters comprising more than one type of hotspots, accounting for 3.1% and 0.2% of the genome respectively. The hotspots and clusters are found to co-localize with different functional genomic features, as exemplified by the associations of hotspots of middle-size CNVs with histone-modification sites, work with balancing and positive selections to meet the need for diversity in immune proteins, and facilitate the development of sensory-perception and neuroactive ligand-receptor interaction pathways in the function-sparse late-replicating genomic sequences. Genetic variants of different lengths co-localize with retrotransposons of different ages on a “long-with-young” and “short-with-all” basis. Hotspots and clusters are highly associated with tumor suppressor genes and oncogenes (p < 10−10), and enriched with somatic tumor CNVs and the trait- and disease-associated SNPs identified by genome-wise association studies, exceeding tenfold enrichment in clusters comprising SNPs and extra-long CNVs. Conclusions In conclusion, the genetic-variant hotspots and clusters represent two-edged swords that spearhead both positive and negative genomic changes. Their strong associations with complex traits and diseases also open up a potential “Common Disease-Hotspot Variant” approach to the missing heritability problem. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-021-00318-3.

hotspot identified using SCNV density as criterion. Bottom panel: hotspot region identified using weighted SCNV density as criterion in sliding windows. In middle and bottom panels, windows with SCNV densities above threshold (red dashed line) are identified as hotspot (red bins) in sliding windows. With the weighting scheme described in Methods, sliding windows overlapping with the SCNV peaks are assessed with different levels of weighted SCNV density based on the location of SCNV peaks in each sliding window, enabling thereby the refinement of hotspot boundary.

SNP hotspot detection 34,487 SNP hotspots with mean width of 1,994 bp (SD = 4,528 bp), amounting to 2.54% of the autosomal region analyzed
Top windows = the top-ranked windows the sum of SNP entries in which reach but not exceed 5% SNPs in the zone. D min = the minimum D win of the top windows.

Windows entirely inside the specified type of zone Windows partially inside the specified type of zone
Top windows = the top-ranked windows the D win of which equal or exceed D min .

Rank of window
Weighted SNP density (w win ) Supplementary Figure S5. Density of GWAS-identified SNPs in ten groups of hotspots and clusters ranked based on minor allele frequency from low (group 1) to high (group 10). Densities in simulated regions with matching minor allele frequencies are shown by the grey violin plots. A combined result of the ten groups is shown in Fig. 3d.

Supplementary Figure S9. Length distribution of MCNVs in the MCNV hotspots with high or low levels of histone modifications.
Based on the intensity of each type of histone modification, the MCNV hotspots are divided into two groups that show above-average (orange) or below-average (blue) intensities respectively. The contrast between the MCNV hotspots with high levels of histone modification (orange shaded peaks) and those with low levels (blue shaded peaks) is particularly evident in the dashed boxes at 146-226 bp, which corresponds to the length of DNA sequence wrapping around one nucleosome (146 bp) plus the linker DNA (up to 80 bp). 'Av.' stands for average CNV length in bp as indicated by the dashed vertical orange or blue line. Fold-changes of the density/intensity of the genomic features are expressed similar to those in Fig. 6.

No. genes in hotspots located in Distal S4 & G2
Supplementary Figure S11. Enrichment Map for Distal-zone-enriched genes annotated using g:Profiler based on Gene Ontology biological process and KEGG. Each circular node represents a pathway (with 3 to 350 genes) significantly enriched in Distal zones with Benjamini-Hochberg false discovery rate < 0.05, and the node size is proportional to the number of Distal-zone genes belonging to the pathway, as illustrated by the two circle signs of '5 genes' and '118 genes'. The pathway IDs inside the nodes correspond to those given in the table on the right. Pathways are connected by a grey sented by the red-blue thermal scale and shown in the 'Fold' column of the  Enrichment of y-axis GVs at different distances from x-axis GVs, ranging from 0 bp to ± 500 bp in 50-bp increments. Enrichment of each y-axis GV is expressed by its density fold-change relative to its autosomal average density. Small indels (SID) are further separated into small insertions (SINS) and small deletions (SDEL). In addition to the eight kinds of GVs employed in hotspot detection, the variations analyzed in figure also include the 'calls' of other eight kinds of structrual variations given in dbVar database.  Figure S14. Distribution of GVs around the telomere ends. GV densities are measured in 5-Mb sequence windows located near the start of the short arm (left) and end of the long arm (right) of 22 autosomes. Autosomal average density of each kind of GV is indicated by the dashed line. Error bar represents 95% CI. Density at each distance is compared to that of all 5-Mb windows in 22 autosomes using one-tailed t-test. Red or blue asterisks represent significant (Benjamini-corrected p < 0.05) enrichment or depletion of GV relative to its autosomal average.