Contribution of 3D genome topological domains to genetic risk of cancers: a genome-wide computational study

Background Genome-wide association studies have identified statistical associations between various diseases, including cancers, and a large number of single-nucleotide polymorphisms (SNPs). However, they provide no direct explanation of the mechanisms underlying the association. Based on the recent discovery that changes in three-dimensional genome organization may have functional consequences on gene regulation favoring diseases, we investigated systematically the genome-wide distribution of disease-associated SNPs with respect to a specific feature of 3D genome organization: topologically associating domains (TADs) and their borders. Results For each of 449 diseases, we tested whether the associated SNPs are present in TAD borders more often than observed by chance, where chance (i.e., the null model in statistical terms) corresponds to the same number of pointwise loci drawn at random either in the entire genome, or in the entire set of disease-associated SNPs listed in the GWAS catalog. Our analysis shows that a fraction of diseases displays such a preferential localization of their risk loci. Moreover, cancers are relatively more frequent among these diseases, and this predominance is generally enhanced when considering only intergenic SNPs. The structure of SNP-based diseasome networks confirms that localization of risk loci in TAD borders differs between cancers and non-cancer diseases. Furthermore, different TAD border enrichments are observed in embryonic stem cells and differentiated cells, consistent with changes in topological domains along embryogenesis and delineating their contribution to disease risk. Conclusions Our results suggest that, for certain diseases, part of the genetic risk lies in a local genetic variation affecting the genome partitioning in topologically insulated domains. Investigating this possible contribution to genetic risk is particularly relevant in cancers. This study thus opens a way of interpreting genome-wide association studies, by distinguishing two types of disease-associated SNPs: one with an effect on an individual gene, the other acting in interplay with 3D genome organization. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-022-00375-2.

Rule graph of the integrated pipeline. The figure displays an automated drawing of the bioinformatic and statistical pipeline devised for the study. For each disease, the potential TAD border enrichment in disease-associated SNPs is assessed from the associations available in the GWAS catalog and Hi-C data, downloaded from ftp://cooler.csail.mit.edu/coolers. The test involves the determination of TADs using TopDom algorithm [S1] and the computation of the p-value quantifying the statistical significance of the enrichment (see Methods). The subsequent steps of the analysis, for instance the aggregation of the results over several values of the parameter k of the TAD caller, the plots of enrichment histograms (Fig 2) or percentages (Fig 3), have been integrated in the pipeline, available at: https://github.com/kpj/GeneticRiskAndTADs. The whole analysis can thus be implemented straightforwardly for any Hi-C dataset in .cool format.

Supplementary File 1: Lists of cancers displaying TAD border enrichment in their associated SNPs
We present below two lists obtained using Hi-C data from [S2], at 10kb resolution, for NHEK and IMR90ncell lines, including only cancers displaying TAD border enrichment when border SNPs are defined using a majority rule (i.e. as the SNPs being in a border for a majority of values of the TAD caller parameter k). The presence or absence of an entry has a variable degree of robustness, so these lists are provided only to illustrate the outcomes of our analysis. Reliable conclusions for a specific cancer would require a dedicated detailed study.

Supplementary Fig 2. Variation of TADs and TAD borders at varying TopDom parameter k.
The top panel displays the Hi-C contact matrix as a heat map (see the color bar, the redder the more contacts), here for a region of chr11 (chr11: 123050000-123750000, hg19 coordinates, embedding the region in Fig 1), drawn from Hi-C data published in [S2], for IMR90 cell type, at 10kb-resolution. TADs determined with TopDom, for a window size k=10, are underlined together with their internal 20kb-borders. Vertical lines pinpoint SNPs located in a TAD border (full line for cancer-associated SNPs, dashed lines for SNPs associated with non-cancer diseases). TADs obtained at increasing value of k, from k=3 to k=20, are schematically displayed in the lines below (blue: TAD body, red: TAD border, value of k indicated above each line at its left end).

Supplementary Fig 3. Variation of TAD and TAD borders across data sources.
Same as the previous Supplementary Fig 2, now for different data sources, indicated above each line, at a fixed value k=10 of TopDom window parameter (blue: TAD body, red: TAD border defined as the end region of size 20kb within the TAD, yellow: situations where a TAD is too short and its two borders overlap). The Hi-C map is based on Hi-C data from [S2], IMR90 cell type, at 10kb-resolution, for the region chr11: 123050000-123750000, hg19 coordinates (same region as in Supplementary Fig 2 and embedding the region in Fig 1).

Supplementary Fig 4. Comparison of two null models for assessing TAD border enrichment in daSNPs.
The figure presents a scatter plot of the enrichment statistical significance [-log(p-value)], one dot per disease, for two different null models (see Methods) and, on the right and top respectively, the p-value distribution over all diseases for each null model. The first null model (horizontal axis in the scatter plot, distribution on top) is defined for each disease as a homogeneous distribution of the associated SNPs along the genome, at the base pair level, whereas the second one (vertical axis in the scatter plot, distribution on the right) is defined for each disease as a homogeneous distribution of the associated SNPs within the ensemble of all disease-associated SNPs gathered from the GWAS catalog (not including SNPs associated with non-pathological traits). The p-value assessing for each disease the overrepresentation of its associated SNPs in TAD borders is computed using the cumulative hypergeometric distribution, then corrected for multiple testing using Benjamini-Hochberg procedure [S3] separately for cancers and non-cancer diseases (see Methods). The correlation coefficient between the results obtained with each null model is 0.92. Hi-C data from [S2], IMR90 cell type, 10kb resolution; TAD determination with TopDom window parameter value k=10.

Supplementary Fig 5. Multiple testing correction in assessing TAD border enrichment in daSNPs.
Histograms of [-log(p-value)] for cancers (orange) and non-cancer diseases (blue, overlap in grey) (A) without multiple-testing correction; (B) with a correction applied to all diseases jointly; (C) with a correction applied separately to cancers and non-cancer diseases (group-wise correction, see Fig 2). In cases B and C, the correction followed Benjamini-Hochberg procedure [S3] controlling the false discovery rate. The significance threshold at 5% is indicated by the dashed red line. Histograms are normalized separately for cancers and non-cancer diseases. Same underlying Hi-C data and setting as in Fig 2.  Supplementary Fig 6. Multiple testing correction in assessing TAD border enrichment in intergenic daSNPs. Same as Supplementary Fig 5 but considering intergenic SNPs only. The figure displays the normalized histograms of [-log(p-value)] for cancers (orange) and non-cancer diseases (blue, overlap in grey) and (A) without multiple-testing correction; (B) with a correction applied to all diseases jointly; (C) with a correction applied separately to cancers and non-cancer diseases (group-wise correction, see Fig 2).

Supplementary Fig 7. TAD border enrichment across data sources.
The relative dominance of cancers among the diseases displaying a preferential location of their associated SNPs in TAD borders is investigated across Hi-C datasets (same setting as in Fig 3). The datasets were obtained in different laboratories (main ordering of the panels), or/and with different restriction enzymes (HindIII, MboI or DpnII, as indicated in the caption, see next page Supplementary Fig 8) or/and in different cell types: human embryonic stem cells (H1 hESC) and derived cell lines: mesendoderm (H1_ME), neural progenitors (H1_NP), trophoblast-like cells (H1_TB) and mesenchymal cells (H1_MS); human lymphoblastoid cell line (GM12878); fetal lung fibroblasts of Caucasian origin (IMR90); human mammary epithelial cells (HMEC); normal human epidermal keratinocytes (NHEK). human umbilical vein endothelial cells (HUVEC) [S2, S4-S7].

Supplementary Table 1: Values of network coherence for subgraphs in SNP-based diseasome networks.
Nodes in the diseasomes represent diseases, distinguishing cancers or non-cancer diseases. Four SNP-based diseasome networks have been constructed, depending on the meaning of an edge between two nodes. In the network labelled 'border', an edge is drawn between two diseases when they share at least one border SNP, i.e. a SNP located in a TAD border for a majority of values of TopDom parameter k. Non-border SNPs are defined as the complementary set of SNPs, yielding the network labelled 'non-border'. The networks labelled 'border intergenic', and 'non-border intergenic' are obtained when the additional condition of being intergenic is imposed on the shared SNPs. Note that these graphs are not disjoint; two diseases can be linked in two or more of these diseasome networks. The level of clustering (density of internal links) of various subgraphs, compared to a situation of random wiring, has been estimated using the notion of network coherence, defined as the fraction of connected nodes within the subgraph, z-transformed with a null model of randomly drawn node sets of the same size (see Methods).
The considered subgraphs correspond to sets of diseases: cancers, non-cancer diseases, cancers whose associated SNPs are preferentially located in TAD borders for a majority of values of k ('enriched cancer') or not ('not enriched cancer') and similar definitions for enriched and not enriched non-cancer diseases. Network coherence provides an absolute quantification, here computed for the 6 subgraphs in the four diseasome networks. A random subgraph would have a vanishing network coherence, a positive value indicates a high internal connectivity while a negative value reveals that the subgraph is less connected than a random set of nodes in the overall network.  For each disease, we computed the number of pairs of associated SNPs located at a distance closer than a threshold, equal to either 10kb (size of a bin), 20kb (size of a TAD border), 200kb or 1Mb (upper bounds on LD range). An average was then taken over several categories of diseases (cancers and non-cancer diseases, displaying or not TAD border enrichment) distinguishing when pairs are between any SNPs or only between border SNPs (the borders being determined as previously in the typical case of IMR90, data from [S2], TopDom parameter k=10). These numbers(rounded to the lower integer) show that the potential contribution of pairs of SNPs in strong LD is negligible in the interpretation of our enrichment results.

Supplementary File 2: Typical numbers of diseases and SNPs involved in the analysis
About 350 Mb are located within TAD borders, representing about 12% of the genome and 13% of the number of base pairs located in TADs. However, this number of base pairs located in TAD borders varies with the value of the parameter k used in the TAD caller TopDom, roughly decreasing when k increases. Exact values in the case of IMR90 cell type, data from [S2], are given in Supplementary Table 3.

Supplementary Table 3: Genome fraction located in TAD borders.
The fraction of the genome (resp. of the total number of base pairs in TADs) located in TAD borders is given for different values of the parameter k (window size) of the TAD caller TopDom (underlying Hi-C data from [S2], IMR90 cell type).
Only the statistical analysis could support statements about TAD border enrichment in daSNPs. However, to give a feeling of the orders of magnitude, we give below some typical numbers (by default, TAD borders are those determined with k=10 for the IMR90 cell type, data from [S2]).
449 EFOs have been considered in the study, among which 71 cancers (that is, 16%). . The overall number of base pairs located in TAD borders and border SNP counts vary with the datasets, as described in Supplementary Table 4. The number of disease-associated SNPs (without multiplicity) is 21,183 among which about 2800 (13%) are located in TAD borders, resp. 3,319 cancer-associated SNPs among which about 470 (14%) are located in TAD borders.
There is on average 9 border SNPs (rounded value) per disease, with no significant difference between cancers and non-cancer diseases. The standard deviation of the number of border SNPs per disease is larger than its mean (here 23 compared to 9) which reflects a broad distribution and the presence of outlier extreme values (here 227) The mean number of border SNPs associated with a disease dramatically increases when considering only diseases displaying TAD border enrichment in their associated SNPs, reaching on average 30 border SNPs for cancers and a similar value (29) for non-cancer diseases, while the mean decreases to 5 (resp. 7) for cancers (respect. non-cancer diseases) that do not display TAD border enrichment. We again underline that these figures are not sufficient to assess TAD border enrichment, and they can only motivate further statistical tests.
8,438 (40%) disease-associated SNPs are intergenic. 1,058 are intergenic and located in TAD borders. 1,275 are intergenic and associated with cancer. 176 are intergenic, associated with cancer and located in TAD borders.
11,552 (55%) disease-associated SNPs are intronic. 1,508 are intronic and located in TAD borders. 1,854 are intronic and associated with cancer. 257 are intronic, associated with cancer and located in TAD borders.