Targeted exome sequencing identifies mutational landscape in a cohort of 1500 Chinese patients with non-small cell lung carcinoma (NSCLC)

Background Non-small cell lung carcinoma (NSCLC) is one of the most common human cancers, comprising approximately 80–85% of all lung carcinomas. An estimated incidence of NSCLC is approximately 2 million new cases per year worldwide. Results In recent decade, the treatment of NSCLC has made breakthrough progress owing to a large number of targeted therapies which were approved for clinical use. Epidemiology, genetic susceptibility, and molecular profiles in patients are likely to play an important factor in response rates and survival benefits to these targeted treatments and thus warrant further investigation on ethnic differences in NSCLC. In this study, a total number of 1500 Chinese patient samples,1000 formalin fixed paraffin-embedded (FFPE) and 500 blood samples, from patients with NSCLC were analyzed by targeted sequencing to explore mutational landscape in ethnic groups associated with China. Conclusions Overall, the data presented here provide a comprehensive analysis of NSCLC mutational landscape in Chinese patients and findings are discussed in the context of similar studies on different ethnic groups.

to treatment. However, previous studies raised the possibility that the distribution of these mutations show a race-dependent pattern, with one study estimating that 10% of Caucasians but as high as 50% of Asians will be found to have drug sensitizing mutations of the EGFR [2]. The observed high variation in mutation frequency in demographic subgroups urges for large-scale studies that systematically investigate mutation landscapes in certain races and offers a better insight what genes has to be tested prior to choosing a targeted therapy [3,4].
Next-generation sequencing (NGS) has revolutionized the identification process and systematic characterization of genomic alterations, including single nucleotide variations and small insertions/deletions (InDels), and will likely receive recommendations from cancer societies in the very near future about its daily use in clinical oncology practice. Indeed, upfront tumor genotyping is now widely considered as an essential step in guiding treatment decision-making in the management of patients with NSCLC [5].
In this study, a number of 1000 formalin-fixed paraffin-embedded (FFPE) and 500 blood samples with NSCLC were analyzed by NGS-targeted sequencing. This study represents to our knowledge one of the largest efforts so far to systematically characterize mutational landscape in Chinese NSCLC cohort samples.

Clinical features of the patient samples
Discovery and quantification of genetic alterations in NSCLC, from point mutations to large genomic rearrangements, requires a comprehensive genome-wide approach and a large sample cohort. We have collected 1000 formalin-fixed paraffin-embedded (FFPE) tumor samples and 500 blood samples from a total of 1500 patients diagnosed with NSCLC between June 2017 and April 2019. Tissue and blood samples were obtained from independent patient groups. The detailed clinical characteristics of the patients are shown in Table 1. Briefly, lung adenocarcinoma accounted for 84.3% of the FFPE samples (843/1000), squamous cell carcinoma for 14.2% (142/1000), and others for 1.5% (15/1000). As for the blood samples, lung adenocarcinoma accounted for 80.4% (402/500), squamous cell carcinoma for 17% (85/ 500), and others for 2.6% (13/500). In total, 39 samples were excluded due to not passing quality standards along the sample processing and sequencing.
Overview of the genomic alterations of 1000 tissue and 500 blood samples of NSCLC patients The clinical significance of identifying hypermutated tumors has recently been demonstrated in several NSCLC studies [6,7]. However, there is a large variability in mutation burden within tumor types in NSCLCs [8]. To begin to explore the mutation burden in our cohort, we first identified the overall mutation landscape across the tissue and blood samples. We subclassified mutations into four main types, single mutation (single base variation, insertion or deletion, SM), multiple single mutations (MM), amplification (AMP), and fusion (FUS) (Fig. 1). As for the FFPE NSCLC tissue samples, a total of 968/1000 samples had at least one type of the above-  (Fig. 1).

Mutation patterns of frequently altered cancer genes
Next, we set out to determine the most common cancer genes enriched for SNV/InDel in our NSCLC patient cohort. We identified many genes previously also found to be mutated in NSCLC, including several tumor suppressor genes TP53 [9], CDKN2A [10], and oncogenes EGFR [11] and KRAS [12]. Notably, we observed highly accumulated TP53 and EGFR mutations in both blood and tissue samples of NSCLC patients (Fig. 2a, b). Co-occurrence of EGFR with the TP53 mutations was remarkable in the tissue samples (>25%). EGFR mutation rate was significantly higher in tissues (~55%) vs. blood (~35%). In addition, we found several other genes that were significantly mutated in our cohort, such as PTCH1 and PIK3CA (Fig. 2a, b). Other, less frequently detected, but previously identified genes included tumor suppressor genes (APC) and tyrosine kinase genes (ERBB2, FGFR, and NTRK genes). Next, we assessed the distribution of nonsynonymous frameshift insertions and deletions, missense mutations, Stop-gain, and other infrequent alterations (e.g., splicing) in both the tissue and blood samples ( Fig. 2c-e). In addition to identifying previously known NSCLC-associated genes, such as TP53, KRAS, EGFR, and CDKN2A, the analysis revealed GNAQ gene, which was previously mostly implicated in melanomas and only a very recent study linked to lung cancer ( Fig. 2c-e) [13]. Identified mutations of GNAQ included p.R60G, p.P174R, p.A93D, p.M59L, and p.Q81H.

Recurrent SNV mutations in NSCLC
Next, we explored the positional distribution and recurrence of SNV mutations in the genes with most frequent mutations, focusing on the most frequently mutated genes, TP53, EGFR, KRAS, CDKN2A, PTCH1, and PIK3CA (Fig. 3).
Most clinical studies suggest that lung cancer with alterations detected in TP53 carries an overall worse prognosis and such cases are more resistant to chemotherapy and radiation [14]. Indeed, as it was shown in Fig. 2, mutations of the TP53 gene occurred in over 50% of NSCL C samples in our cohort. In our cohort, only 8 samples showed mutations at codons 157, 6 samples at codon 158, 11 samples at codon 179, and 27 samples at codon 248 of TP53. These codons are typically mutated in lung cancer from smokers and uncommonly observed in lung cancer from nonsmokers [15].
Previous analysis of the TK domain of the EGFR by Shigematsu et al. identified that all mutations in lung cancer specimens occurred within exons 18-21, with a prevalence of 21% [11,16,17]. Consistent with these previous reports, EGFR mainly had three subtype of mutation (p.L858R, Exon 19del, p.T790M). EGFR p.L858R and Exon 19del were the most common EGFR active mutant, which may be sensitive to EGFR-TKI inhibitors such as gefitinib, erlotinib, or afatinib. We found the percentage of these mutation in FFPE and blood sample were similar. There were 42.4% p.L858R in blood sample and 44.4% in FFPE samples. Similarly, there were 38.5.4% Exon 19del in blood sample and 34.2% in FFPE samples. Interestingly, there was significantly different percent of p.T790M in FFPE and blood sample. The percent of p.T790M in FFPE and blood sample were 24.8% and 2.4%, respectively.
We found that mutations in KRAS were mostly detected at amino acid positions 12, 13, 61, in regions which are considered mutational hotspots (Fig. 3). Recurrent mutations included p.G12C, p.G12V, p.G13D, and p.Q61H. In addition, we have also found pA146T in two tissue samples.
In addition to the previously described mutations involving TP53, EGFR, and KRAS genes, our analysis in this large cohort revealed several other recurrent point mutations in NSCLC. For instance, recurrent point mutations (E545K) in the PIK3CA gene were identified. In fact, somatic mutations of the PIK3CA gene have been also described NSCLC [18,19].

Structural rearrangement signatures and overview of aberration frequencies identified in our NSCLC patient cohort
Previous studies have been able to detect significant copy number alteration in lung adenocarcinomas [21,22]. Sequencing of the coding exons of the 65 pre-selected candidate cancer genes in our study identified gene amplifications in both lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) (Fig. 4a, b). Similarly to previous reports, we have found both EGFR and KRAS gene copy number gains to occur frequently in NSCLC [23,24].

Combination of SNV, amplification, and fusion of significantly mutated genes
Finally, to further explore the mutations in the most common cancer genes involved in Chinese NSCLC patients, we also assessed the co-occurrence of single nucleotide variations with other mutational events. Strikingly, majority of samples (~90%) carrying KRAS mutations were not containing any other type of mutations (Fig. 5). In contrast, EGFR has often co-occurred with other mutations.

Discussion
In this study, we analyzed genomic events in a large set of FFPE and blood samples from patients with NSCLC. Specifically, we used targeted sequencing of selected candidate genes to identify most common mutations in a large cohort of Chinese NSCLC patients. The vast amount of genomic information generated in this and similar studies is expected to transform our current understanding of lung cancer and advance personalized lung cancer therapy. We also anticipate that our study along with other studies implementing tumor mutation landscape analysis using targeted and genome-wide NGS across different ethnic groups in lung cancer will enormously expand our knowledge base in lung cancer biology, treatment strategy, new drug target development, and NSCLC outcome.
In fact, recent discoveries made based on previous mutational analysis already significantly improved and expanded the availability of targeted therapies. Development of new receptor kinase inhibitors, such as erlotinib and gefitinib (against EGFR) and most recently crizotinib (against rearranged ALK), and antibodies such as cetuximab (against EGFR) are all great examples how NGS can help to improve personalized medicine [26]. However, while these drugs are effective in a subset of patients, our analysis and other studies clearly suggest a very complex mutational landscape in NSCLC and warrant for even more targeted drug development to be able to further decrease the still high mortality rate of NSCLC.
An interesting target that came out from our analysis is GNAQ (Fig. 2). GNAQ (guanine nucleotide binding protein [G protein], q polypeptide) is known as a subunit of one of the heterotrimeric guanine nucleotide binding proteins (G proteins) that is involved in multiple processes of mammary cells including hormonal signal transduction, metabolism, development, cell survival, and sensory functions. Previous studies mostly implicated its mutations in melanoma, and GNAQ mutations have not been documented in NSCLC. We found several Another interesting candidate for follow-up studies was the tumor suppressor Patched 1 (PTCH1), a multipass transmembrane protein which is over-expressed in many metastatic cancers. In an unbound inactive state, PTCH1 acts as a negative regulator of smoothened (SMO), while upon activation it leads to activation of GLI1 proto-oncoprotein. Since PTCH1 is a multidrug transporter, it contributes to chemotherapy resistance by the efflux of chemotherapeutic agents such as doxorubicin [27]. PTCH1-altered tumors can be now targeted with three different FDA-approved SMO inhibitors, namely sonidegib, vismodegib, and glasdegib [27].
An important context to discuss is related to health disparities, which are a recognized and well-documented phenomenon on the cancer field but has not yet been addressed in case of NSCLC. Socioeconomic and cultural differences across ethnic groups undoubtedly account for some of the disparities, namely that certain groups may bear a disproportionate burden of cancer compared with other groups. Our study specifically aimed to collect and explore data of a well-defined group of patients based on geographic location. Our data collection and/or exploration did not yet include gathering information on income, education, disabilities, and other possibly relevant characteristics. Nevertheless, it is important to highlight that the analyzed samples are all representing non-smoker patients and we gathered information on gender that will be further correlated with mutational landscapes in follow-up studies.
While a number of cancer centers have already begun to integrate molecular profiling and even clinical nextgeneration sequencing (NGS) into the pipeline of routine cancer diagnosis in order to increase accuracy and Fig. 5 An overview of significantly mutated genes. Assessment of single mutations (SNVs and InDels), multiple mutations, and amplifications across the top most frequently mutated genes, excluding TP53. Genes were depicted according to aberration frequencies efficiency of treatments, it is important to recognize and discuss the limitations of the targeted therapy in the treatment of NSCLC. For instance, EGFR inhibitors, such as gefitinib, erlotinib, or afatinib, can effectively shrink tumors for several months; these drugs eventually stop working for most patients, usually because the cancer cells within the tumor develop additional mutation(s) in the EGFR gene. Studies investigating the clinicopathological factors influencing post-recurrence survival and the effect of post-recurrence therapy in NSCLC will be critical to further advance therapies.

Conclusions
In summary, using targeted whole exome sequencing, we have identified mutations in a large cohort of Chinese NSCLC blood and tissue samples for 65 genes and provide an overview of the mutational landscape by analyzing CNVs, fusions, and SNV/InDel in details.

Samples
The study was conducted in accordance with the Helsinki Declaration and was approved by the institute's Ethics Committee. All the patients enrolled had been informed about the content and purposes of this study and signed the consents. In this study, we have collected and processed a total of 1000 formalin-
The FFPE and blood samples were sequenced by Illumina Nova seq. As for the FFPE samples, the mean sequencing depth was nearly 1200x, the coverage rate was 99.99%, and fraction of bases mapped to target region was between 40 and 70%. At least 200x nucleic acid coverage and 1% of mutation allele fraction were used as the standard cutoff to make the final variant call. As for the blood samples, the mean sequencing depth was nearly 10000x, the coverage rate was 99.99%, and fraction of bases mapped to target region was between 4 and 70%. At least 2000x nucleic acid coverage and 0.5% of mutation allele fraction were used as the cutoff for the final variant call.

Bioinformatics analysis
Our initial analysis aimed to explore genomic alterations, including gene rearrangements, copy number variations (CNVs), single nucleotide variants (SNVs), and short and long insertions/deletions (InDels). Raw sequencing reads were aligned to the human reference genome (hg19) using Burrows-Wheeler Aligner (BWA). Consensus reads were generated for error suppressing and PCR duplicates were removed using in-house software ECR. Read depth and coverage of the targeted regions were calculated by in-house software LibraryQC. The logratio per region of each target genes was calculated, and customized algorithms were used to detect copy number variations. Focal amplifications were characterized as genes with thresholds ≥4 copies. Gene rearrangements and long indels were detected using CREST [28] and Manta [29]. SNVs and short indels were identified by MuTect [30] and Pindel [31]. Availability of data and materials All data generated or analyzed during this study are included in this published article. The sequence data will be provided upon request.

Declarations
Ethics approval and consent to participate The study was conducted in accordance with the Helsinki Declaration and was approved by the institute's Ethics Committee. All the patients enrolled had been informed about the content and purposes of this study and signed the consents.