Genotypic changes in nontumor and paratumor tissues
White blood cells (B), tumor tissue (T), paratumor tissue (P) immediately adjacent to the tumor, and more remote nontumor tissue (N) were collected in 12 same-patient tetra sample cases consisted of four breast carcinomas (BRCA), five stomach adenocarcinomas (STAD), and three hepatocellular carcinomas (LIHC) (Additional files 1 and 2: Tables S1 and S2) and subjected to DNA analysis using the AluScan platform based on inter-Alu polymerase chain reaction (PCR) followed by massively parallel sequencing as described in the “Methods” section. The genotype of a base residue was referred to as a major allele (M) when it matches the sequence of human reference genome hg19 or as a minor allele (m) when there was no match, thereby enabling the identification of changes in the form of MM-to-Mm GOH (“GOH-M”), mm-to-Mm GOH (“GOH-m”), Mm-to-MM LOH (“LOH-M”), or Mm-to-mm LOH (“LOH-m”) [17]. Figure 1a, b show the total number of residue-by-residue changes in the N-, P-, or T-sample genomes relative to B-sample, viz. ∆NB, ∆PB, or ∆TB respectively in terms of GOH-M, GOH-m, and LOH (sum of instances of LOH-M and LOH-m). Since the numbers of GOH-M, GOH-m, and LOH mutations were higher in ΔNB than in ΔTB, and comparable in ΔPB and ΔTB, both the N-sample and P-sample cells had to be regarded as premalignant or early malignant cells despite their normal morphology and expression of immunohistochemistry (IHC) markers, in contrast with T-sample cells showing enlarged nuclei (Fig. 1c) and reduced expression of IHC markers (Additional file 1: Table S1). Since the residues of minor bases different from “m” were rare in the samples analyzed, mutations involving them are listed in Additional files 3 and 4: Tables S3 and S4 but not shown in Fig. 1a, b. Notably, 96% of the B-to-N-stage forward GOH mutations were reversed in P-stage via steps L1 and L8, and 56% and 75% of the B-to-N-stage forward LOHs were reversed via steps G11 and G12, respectively. On the other hand, only 16% and 0% of the B-to-N-stage forward GOH mutations were reversed in T-stage via steps L5 and L12, and only 1% and 0% of the B-to-N-stage LOH mutations were reversed in T-stage via steps G13 and G16.
Moreover, the LOH mutations partitioned Mm genotypes between MM and mm products on a non-random basis. Thus, the ratio of MM/mm products from the L1 and L2 steps was 1518/1, whereas the ratio of MM/mm products from the L7 and L8 steps was 0/357. Likewise, the ratio of MM/mm products from the L3 and L4 steps was 1116/0, and the ratio of MM/mm products from the L9 and L10 steps was 0/34. Therefore, the partition of LOH products in each of these instances was biased by strong lineage effect in favor of restoring the original germline genotype that gave rise to the Mm residue in the first place (as highlighted by yellow triangles in Fig. 1b).
Notably, in Fig. 1b right panel, the partition of the germline Mm genotypes via LOH steps L13 and L14 yielded a greater MM/mm product ratio than the partition via LOH steps L15 and L16, and greater still than the partition via LOH steps L21 and L22, although in each instance, MM products exceeded mm products (Fig. 2a). Since all these three successive partitions emanated from the germline Mm genotypes, their diminishing MM/mm ratios could not be the consequence of lineage effects. Instead, because MM genotypes in the genome have been optimized in general for growth in the course of human evolution, they tended to be favored over mm genotypes. The finding of [L13/L14 = 5.7] > [L15/L16 = 2.9] > [L21/L22 = 1.7] could be the result of the N-stage cells having gone through a more prolonged period of positive selection for MM genotypes than the P-stage cells, and the P-stage cells in turn have gone through a more prolonged period of positive selection than T-stage cells.
When the trinucleotide-based mutational profile method [32] was employed to classify the GOHs and LOHs observed in the B-N-T-P tetra samples into the C>A, C>G, C>T, T>A, T>C, and T>G groups, the results showed that C>T and T>C mutations were particularly prominent among both GOHs and LOHs, in keeping with the expectation that transitions would exceed transversions in SNVs (Fig. 1d). The C>T GOHs among the ∆NB changes displayed peak frequencies at the four NCG triplets, conforming to the “signature 1A” (marked by four solid arrowheads) common to cancers, and likely ascribable to the contribution of spontaneous deamination of 5-methylcytosine at methylated CpG to form thymidine [32, 39]. These deaminations would also explain the ~ 50% greater occurrence of CG>TG GOHs than TG>CG GOHs in the ∆NB changes. In support of this, Fig. 2b shows that although there were less CG dimers than other dimers among AluScan captured as well as whole-genome sequences (Additional file 5: Figure S1), more CG dimers underwent SNV mutations than any other dimers. In Fig. 1d, all SNV frequency columns in the ∆NB tier were represented by a solid segment and an open segment; the mutations in the solid segments were reversed in the next ∆PN tier, whereas the open segments were unreversed. Both the C>T and T>C GOHs show large solid segments indicating their extensive reversals in the ∆PN changes; since the T>C LOHs in the ∆PN tier were mostly reversals of the C>T GOHs in the ∆NB tier, these T>C LOHs were likewise more abundant than ∆PN C>T LOHs and showed four NTG peaks (marked by open arrows), which may be referred to as a “signature 1A”-like LOH feature.
Figure 2c summarizes the forward and reverse mutation occurrences in the B-N-P-T samples; more SNVs and CNVs occurred in N-stage (viz. sum of types II, III, IVa, and IVb patterns) than in P- and T-stages combined (viz. type I). Reversals of N-stage SNVs and CNVs (viz. sum of types IVa and IVb patterns) were common, amounting to ~ 70% of N-stage SNVs or ~ 40% of N-stage CNVs, and far more of such reversals took place in P-stage (type IVa) than in T-stage (type IVb), see Additional files 3 and 6: Tables S3 and S5 for detailed numbers of the SNVs and CNVs at different stages.
When another 17 B-N-T trio sample sets consisted of 1 BRCA, 2 LIHC, and 14 non-small cell lung cancers (NSCLC) were analyzed with respect to the GOH and LOH changes in the N- and T-stage cells relative to B-stage cells (Fig. 3, Additional files 7 and 8: Tables S6 and S7), the results obtained showed the same regularities as the B-N-P-T tetra samples: the B genomes displayed much higher LOH (L5, L6) rates and GOH-m (G3, G4) rates than GOH-M (G1, G2) rates, strong lineage effects in LOH partitions between MM and mm products (highlighted by yellow triangles), and prominent FR-mutations, viz. L1 reversing G1, L4 reversing G3, G5 reversing L5, and G6 reversing L6.
Genotypic changes in cultured HeLa cells
When frozen HeLa cells were restarted in culture and sequentially sampled for AluScan sequencing, the results obtained also showed a wave of forward mutations followed by reverse mutations. Figure 4a shows the changes in the genotypes of base residues between day 10 and day 5 (viz. ∆10–5) and between day 14 and day 5 (viz. ∆14–5), and these changes are indicated in the patch diagrams in Fig. 4b. Notably, of the 273 MM residues that mutated to Mm via the G1 step, 263 of them were reverted to MM by day 14, and none was mutated to mm. Similarly, of the 95 mm residues that mutated to Mm via the G3 step, 83 of them were reverted to mm, and none was mutated to MM. Thus, the ratio of MM/mm products from the yellow-highlighted L1 and L2 LOH steps was 263/0 and that for the L3 and L4 LOH steps was 0/83, displaying striking lineage effects in both instances comparable to the lineage effects displayed by the N-stage cells in Fig. 1b that were also yellow-highlighted. Since HeLa cells were transformed cells, the forward-reverse mutation cycles formed by the G1-L1 steps, or by the G3-L4 steps, in Fig. 4b could not be related to the oncogenic transformation. Instead, they likely represented a mechanism employed by the cells in the process of adapting to replating and growth.
Figure 4c shows the mutational profiles of GOHs (left panel) and LOHs (right panel) observed in the transitions between day 5 and day 10 (viz. ∆10–5, upper tier) and between day 10 and day 14 (viz. ∆14–10, lower tier), where the solid or open segments in the ∆10–5 tier represent the mutations that were reversed or unreversed respectively in the ∆14–10 tier. As in the case of the profiles for the ∆NB and ∆PN changes in Fig. 1d, both the CG>TG (blue) and TG>CG (pink) GOH peaks in the ∆10–5 tier were extensively reversed in the ∆14–10 tier, giving rise to the prominent TG>CG (pink) and CG>TG (blue) LOH peaks respectively in the ∆14–10 tier. The G1, L1, L2, G3, L3, and L4 rates in Fig. 4b were also similar to their counterpart G1, L1, L2, G6, L7, and L8 rates in Fig. 1b.
Genotypic changes in primary and metastatic tumors
Figures 5 and 6 compare the mutations observed in five cancer groups based on same-patient N-stage, T-stage, and metastatic stage (M-stage) samples: (i) AluScan group of 2 N-T-M trio sets analyzed with AluScan sequencing, (ii) WGS-Liver-M group of 4 trio sets of liver-to-lung metastasis analyzed by Ouyang et al. [27] using WGS, and 67 trio sets involving brain metastases analyzed with WES by Brastianos et al. [28], which were separated into (iii) 38 WES-Non-Lung cancers, (iv) 6 WES-NSCLC-L (L = low in C>A GOHs) cancers, and (v) 23 WES-NSCLC-H (H = high in C>A GOHs) cancers. Although the five N-T-M trio groups compared in Fig. 5a were analyzed using variously the AluScan, WES, and WGS platforms, the ratios of the [∆TN]/N and [∆MN]/N counts both indicated that the rates of LOH far surpassed the rates of GOH-m, which in turn far surpassed the rates of GOH-M (Additioanl files 9 and 10: Tables S8 and S9). All five groups also displayed pronounced lineage effects in Fig. 5b in the partitions of LOH mutations of Mm genotypes between MM and mm products (highlighted by yellow triangles).
In Fig. 6a, the relative prominences of ∆TN GOHs, ∆TN LOHs, ∆MT GOHs, and ∆MT LOHs varied among the five different cancer groups. This could arise in part from biological dissimilarities between the sequences analyzed on the different platforms on account of their varied sequence coverages of the genome. The SNV sites observed in the five groups displayed non-identical distributions among the genic, proximal, and distal sequence zones [35], as well as non-identical replication timings during the cell cycle (Fig. 6b). The proportion of ∆TN GOHs that became reversed in the ∆MT changes, marked by solid segments of the GOH frequency bars in the ∆TN tiers, was highest in the AluScan group, also quite high in the WES-NSCLC-L group, modest in the WES-Non-Lung group, and lowest in the WGS-Liver-M and WES-NSCLC-H groups, even though the WES-NSCLC-L, WES-Non-Lung, and WES-NSCLC-H groups were all analyzed based on the WES platform [28].
The WES-NSCLC-H group was unique in its display of particularly eminent C>A GOHs. Previously, C>A transversions were linked to polycyclic aromatic hydrocarbons [40] and acrolein [41] in tobacco smoke. The 23 WES-NSCLC-H samples were derived entirely from smokers, in accord with smoking being a significant factor for their elevated C>A GOHs. However, the WES-NSCLC-L samples with much more subdued C>A GOHs included two non-smokers and four smokers, suggesting that smoking or high C>A GOHs could play a less important carcinogenic role in a minority of smokers.
When the AluScan-capturable regions were extracted from the four N-T-M trio samples in the WGS-Liver-M group and analyzed with respect to their genotypic changes, the results obtained were similar to those obtained from the entire WGS sequences for the same samples: (a) among the ∆TN changes, the LOH/GOH-M ratios of 2.5 for the WGS-based samples (Fig. 5a), and 2.1 for the AluScan-based samples (Additional file 11: Figure S2a), were both substantially greater than unity; (b) among the ∆MN changes, the LOH/GOH-M ratios of 0.89 for the WGS-based samples, and 1.1 for the AluScan-based samples, were both close to unity; (c) in the patch diagram for the total WGS-based mutations originating from MM residues in the four samples (Fig. 5b), the rates for the L1 and L2 steps were 0.36 and 1.2E−4, respectively. In the total AluScan-based samples (Additional file 11: Figure S2b), the rates for the L1 and L2 steps were 0.50 and 4.4E-3 respectively. Thus, both WGS-based and AluScan-based analyses yielded a high L1/L2 > 100 rate ratio indicative of strong lineage effects in the LOH mutations that reversed the GOH mutation in the G1 step; and (d) the mutational profiles for the AluScan-based ∆TN and ∆MT changes (Additional file 11: Figure S2c) were highly similar to the WGS-based ones (Fig. 6a, WGS-Liver-M) with respect to the major mutation peaks in both the N-to-T and T-to-M transitions. The patch diagrams for the four individual AluScan-based cases (cases 1–4, Fig. 5c) were all similar to that for their sum total (WGS-Liver-M, Fig. 5c) in the much larger numbers of LOHs arising from L1 step compared to L2 step, testifying in each instance to a strong lineage effect.
Whole-genome sequencing confirmed the abundance of interstitial LOHs
A total of 246 tumor-control pairs from the International Cancer Genome Consortium (ICGC) collection of WGS data [42] were analyzed to yield LOH and GOH types of SNVs in each paired samples. These included a panel of 63 pan-cancer cases (pilot-63) (Fig. 7a), 22 intrahepatic cholangiocarcinoma (ICC), 86 LIHC, and 75 NSCLC cases (Fig. 7b, Additioanl file 12: Table S10), showing prominent LOHs in each instance. In the Pilot-63 dataset, the different ΔTB mutation counts in T-stage cells relative to B-stage cells (Fig. 7a left panel) yielded a rate ratio of 4300 between LOH and GOH-M, which was comparable to the 5400, 2700, and 5300 rate ratios observed in Figs. 1 and 3 and earlier in Reference [17], respectively, indicating a vastly greater rate of LOH than GOH in the cancer cells in all four instances. As well, in all four instances, the three mutation rates remain in the same order of RLOH > RGOH-m > RGOH-M, with LOH rate being the highest. The massive interstitial LOH rates observed earlier based on AluScan data were thus confirmed by the ICGC Pilot-63 WGS dataset.
Evidence from mutational profiles for gene conversions in LOH production
For the ∆TB SNVs of Pilot-63 WGS dataset, Fig. 7c shows the alteration-group plots of mutational profiles, which are rearranged in Fig. 7d so that opposing GOH pairs or opposing LOH pairs are placed side-by-side, e.g., by pairing the ATG>ACG GOH (pink bar) with the ACG>ATG GOH (blue bar) in section 3 of the left panel and similarly pairing the ATG>ACG LOH (pink bar) with the ACG>ATG LOH (blue bar) in section 3 of the right panel (marked by arrowheads). Figure 7d shows the strikingly similar heights of the opposing C>T (blue) or T>C (pink) LOH bars in the right panel, but the generally dissimilar heights of the opposing C>T (blue) or T>C (pink) GOH bars in the left panel. The context-group plots in Additional file 13: Figure S3a show comparable rates for different pairs of opposing LOHs in contrast to the generally unequal rates for different pairs of opposing GOHs. The rates for the individual pairs of opposing GOHs are further displayed in the mutation-rate diagrams in Additional file 13: Figure S3a and those for the opposing LOHs in the mutation-rate diagrams in Additional file 13: Figure S3b. One of the GOH diagrams from Additional file 13: Figure S3a and one of the LOH diagrams from Additional file 13: Figure S3b are illustrated in Fig. 7e, showing the frequencies for all opposing GOH pairs or opposing LOH pairs, e.g., the rates of ACG>ATG GOH and ATG>ACG GOH changes (arrowhead-marked in Fig. 7d left panel) were dissimilar (229 in blue versus 95 in pink), but the rates of LOH changes of the same triplet duplexes (arrowhead-marked in Fig. 7d right panel) were similar (1617 in blue versus 1558 in pink). The ratios between the frequencies (or rates) of the different LOH pairs were 192/183, 150/141, 713/704, 1617/1558, 204/200, and 178/174, which varied only between 1.01–1.06. The results from Additional file 13: Figure S3a and b are summarized in Fig. 7f, where over 61% of the opposing pair frequencies were greater than 2 and spread between 1 and 14 for the GOHs (blue bars), but 100% between 1 and 2 for the LOHs (striped red bars), clearly indicative of the different mutational mechanisms employed for the production of the GOHs versus the LOHs. This divergence of the rate ratios between opposing GOHs and opposing LOHs was in accord with our proposal that the LOHs in cancer cells were generated mainly by double-strand break (DSB) repairs through gene conversion, whereas the GOHs were produced by more diverse mechanisms including mutations due to the highly error-prone nature of the DNA polymerase employed for interhomolog recombination [17] and deaminations that accounted for the ~ 50% greater occurrence of CG>TG GOHs than TG>CG GOHs in the ∆NB changes (Fig. 1d). In a DSB at a heterozygous C/T, LOH by gene conversion could yield either a C/C or T/T homozygous position at comparable rates, depending on which homologous chromatid bears the DSB. On the other hand, because GOHs depend on point mutations rather than gene conversions, this comparable-rate constraint would not apply to GOHs.
Moreover, for the B-N-P-T tetra samples, 95.5% of the forward LOHs in the B-to-N transition (steps L13 and L14, Fig. 1b), 98.7% of the reverse LOHs in the N-to-P transition (steps L1 and L8, Fig. 1b), and 95.2% of the reverse LOHs in the P-to-T transition (steps L3 and L10, Fig. 1b) occurred within the copy number neutral regions (Additional file 14: Figure S4), suggesting that both the forward LOHs and the reverse LOHs were mostly brought about by gene conversion.
Distances between SNVs and recurrent CNVs
That the N-stage SNVs and CNVs in the B-N-P-T tetra samples both underwent active reversions, and more in P-stage than in T-stage (Fig. 2c) suggest some form of possible correlation between these two types of mutations. This was supported by Fig. 2d which shows that the sites of C>T GOHs with NCG context occurring in the ∆NB changes, and T>C LOHs with NTG context occurring in the ∆PN changes, were located particularly close to the recurrent CNVs compared to the mutations with other contexts or in other stages of change, p < 10−7. Furthermore, these two groups of SNVs declined with the age at diagnosis (Fig. 2e), in resemblance to the decrease of global DNA methylation in old age [43]. The correlation between somatic CNVs with CpGe and MeMRE (Fig. 2f), the increased SNVs at CpG sites (Fig. 2b), and the high tendency of methylated CpG conversion to TpG [44] also pointed to some SNV-CNV relationships in the CNV production process, such as breakpoint misrepair and merit investigation.
Frequency classes of serial CNV changes
In the B-N-P-T tetra sample cases, the status of any CNV in the N-, P-, and T-stages could be CN-unaltered (U), CN-gain (G), or CN-loss (L) relative to its status in B-stage. Arranging in serial order, the CN-status found in the N-P-T stages (Additional file 6 and 15: Tables S5 and S11) yielded 26 different serial orders, and their frequencies fell into three classes (Fig. 8a, b). In the LUG order, for example, each CNV site was CN-loss in N-stage, CN-unaltered in P-stage, and CN-gain in T-stage, and the total number of sequence windows in the B-N-P-T sequences analyzed that exhibited such an LUG order made up the frequency on the y-axis of Fig. 8a. The three frequency classes separated by vertical dashed lines in the figure were:
-
I.
Class I (U = 2)—comprising six different orders, where a U status occurred in two of the N-, P-, and T-stages.
-
II.
Class II (U ≤ 1)—comprising eight different orders, where the U status occurred in no more than one of the N-, P-, and T-stages.
-
III.
Class III (disadvantaged)—comprising 12 different orders, where 10 of them (viz. outside of LUG and GUL) included an abrupt double-dose change directly from G to L, or L to G in the order.
The plausible basis for these different classes could be straightforward; the CNV orders in class I entailed minimal copy-number departures from the starting B-stage and were therefore well tolerated; in comparison, the class II of CNV orders incurred greater departures from U and were less well tolerated. Every CNV order in class III involved at least one double-dose change jumping either from G to L or from L to G between two successive stages of cancer development, a distinct disadvantage that led to their lowest frequencies.
The double-dose disadvantage explained the low frequencies of GLU, LGU, UGL, ULG, LGL, GLG, LGG, GGL, GLL, and LLG, but not the low frequencies of LUG and GUL which fell into class III even though they did not incur any double-dose copy number changes, in contrast to GUG and LUL which belonged to the more abundant class II. The contrast indicates that lineage effects were important not only to LOH partitions (Figs. 1b, 3b, and 5b) but also to the frequencies of different CNV orders. In GUG, the G status of T-stage cells constituted a reversion to the G status of N-stage cells. In LUL, the L status of T-stage cells likewise constituted a reversion to the L status of the N-stage cells. Thus, both these reversions were favored by lineage effects, allowing GUG and LUL to join class II even though they each incurred two CNV status changes. In contrast, lineage effects acted against LUG and GUL, because the CNV status of the T-stage cells in these cases was not a restoration of the CNV status of the N-stage cells, thereby explaining their diminished frequencies. These lineage effects were observable when different sizes of sequence windows were employed for CNV identification: as shown in Fig. 8c, both the quotients (Q value) of GUG/GUL and LUL/LUG greatly exceeded unity (marked by dashed red line), yielding significant lineage effects of p < 10−16 for all sizes of sequence windows ranging from 50 to 500 kb.