Novel clinical, molecular and bioinformatics insights into the genetic background of autism

Talli, Ioanna; Dovrolis, Nikolas; Oulas, Anastasis; Stavrakaki, Stavroula; Makedou, Kali; Spyrou, George M.; Maroulakou, Ioanna

doi:10.1186/s40246-022-00415-x

Research
Open access
Published: 18 September 2022

Novel clinical, molecular and bioinformatics insights into the genetic background of autism

Ioanna Talli¹,
Nikolas Dovrolis²,
Anastasis Oulas^3,4,
Stavroula Stavrakaki¹,
Kali Makedou⁵,
George M. Spyrou^3,4 &
…
Ioanna Maroulakou⁶

Human Genomics volume 16, Article number: 39 (2022) Cite this article

3010 Accesses
4 Citations
2 Altmetric
Metrics details

Abstract

Background

Clinical classification of autistic patients based on current WHO criteria provides a valuable but simplified depiction of the true nature of the disorder. Our goal is to determine the biology of the disorder and the ASD-associated genes that lead to differences in the severity and variability of clinical features, which can enhance the ability to predict clinical outcomes.

Method

Novel Whole Exome Sequencing data from children (n = 33) with ASD were collected along with extended cognitive and linguistic assessments. A machine learning methodology and a literature-based approach took into consideration known effects of genetic variation on the translated proteins, linking them with specific ASD clinical manifestations, namely non-verbal IQ, memory, attention and oral language deficits.

Results

Linear regression polygenic risk score results included the classification of severe and mild ASD samples with a 81.81% prediction accuracy. The literature-based approach revealed 14 genes present in all sub-phenotypes (independent of severity) and others which seem to impair individual ones, highlighting genetic profiles specific to mild and severe ASD, which concern non-verbal IQ, memory, attention and oral language skills.

Conclusions

These genes can potentially contribute toward a diagnostic gene-set for determining ASD severity. However, due to the limited number of patients in this study, our classification approach is mostly centered on the prediction and verification of these genes and does not hold a diagnostic nature per se. Substantial further experimentation is required to validate their role as diagnostic markers. The use of these genes as input for functional analysis highlights important biological processes and bridges the gap between genotype and phenotype in ASD.

Background

According to the Diagnostic and Statistical Manual of Mental Disorders [1], Autism Spectrum Disorder (ASD) is associated with abnormalities in early developmental period in communication and social interaction and with restricted and repetitive patterns of behavior or interests. Cognitive skills such as intelligence, memory and attention, as well as language skills may also be affected in ASD. Forty-four percent of children identified with ASD has average and above average intellectual quotient (IQ > 85), 25% has below average IQ (71–85) and 31% is within the range of intellectual disability (IQ < 70) [2, 3]. Since ASD is a heterogeneous disorder, researchers usually adopt two main classifications: one based on the presence or absence of intellectual disability (ID) and one based on oral language skills (i.e., verbal or minimally verbal children).

The most common is the one that classifies the following two main subgroups: those that ASD coexists with intellectual disability and those that have average or above average intellectual functioning, whose characteristics vary in terms of linguistic, cognitive and social skills from those with intellectual disability [4]. Another classification for children within the autistic spectrum is verbal and non-verbal or "minimally verbal" children, i.e., children who have very limited use of spoken language for communication purposes. The reception of language might also be affected, and the autistic symptoms are usually severe in terms of behavior [5,6,7,8]. It has been commonly believed that non-verbal cognitive abilities predict expressive and receptive language [2, 3]. However, Hanson et al. [9] have shown that there are minimally verbal children with autism who do not have low non-verbal IQ, others with low both expressive and receptive language skills and others that have low expressive but good receptive language skills. Consequently, categorization of subgroups in ASD is problematic.

Additionally, the association of genetic loci with specific behavioral characteristics in ASD contributes significantly to the understanding of the influence of genetic factors on clinical phenotype. This connection arises from studies that in their methodology include, in addition to genetic analysis, behavioral assessment, such as language and cognitive assessment. Recently, various researchers have suggested that genetics provide a lot of information on clinical phenotypes of ASD rather than vice versa [6]. Several chromosomal copy number variants (CNVs) and single-nucleotide variants (SNVs) (such as deletions and duplications at chromosomal regions 1q21, 7q11.23, 15q11–13, 16p11.2, and 22q11.2) have been identified as genetic risk factors for ASD [7, 8] and have shown to have predictive value for clinical phenotype of ASD [5]. For example, 15q11.2 duplications are linked to ASD and Schizophrenia [10] in addition to their connection to high rates of epilepsy [11]. Other deletions, including 16p11.2, have been linked to cognitive deficits such as intellectual disability [9, 12] as well as developmental coordination disorder, phonological processing disorder, expressive and receptive language disorders [13]. Prognostication of the clinical profiles of individuals with ASD based on specific genes that could serve as reliable biomarkers is important for early diagnosis and eventually for early and effective treatment. These findings would not be possible without contemporary sequencing and bioinformatics methods.

The technological advancements of next-generation sequencing (NGS), including whole genome sequencing (WGS) and whole exome sequencing (WES), have enabled researchers to perform detailed gene variation analyses like genome-wide association studies (GWAS) en masse. This newfound accessibility to these technologies enables not only experimental high-throughput protocols to be undertaken but also provides clinicians with powerful tools for assessing disease pathogenesis, progression and outcome. It has also enabled clinicians to provide more gene guided counseling into matters like therapy (through pharmacogenomics), and pre/peri-natal consulting. However, there are a variety of factors that need to be taken into account especially due to the complex nature of various diseases and the idiosyncrasies of individual patients regarding their genetic background. These notions bring forward precision medicine.

In precision medicine, genetic variation screening provides an important tool for detecting high-risk individuals of specific genetic disorders. Odds ratio analysis employed in traditional GWAS helps ascertain disease-variant associations by the occurrence frequency of these high-risk variants in non-control groups. These variants can act both protectively and as instigators of disease. To make this determination, researchers can employ polygenic risk score predictions by training risk models on pools of variants highlighted in specific case–control studies [14, 15]. Alternatively, variant annotation using in-silico approaches like GEMINI [16] provide information for each variant found in a study’s samples through several genomic databases (ENCODE [17], UCSC [18], OMIM [19], dbSNP [20], KEGG [21], and HPRD [22]) and informs on frequency (like ExAC [23] and 1000GP [24]) and proteinic impact of changes in amino acid coding due to these variants (ClinVar [25], COSMIC [26], CADD [27], Polyphen [28] and SIFT [29]).

Current WHO criteria for classification and grading of ASD provide a valuable but simplified depiction of the true nature of the disorder. Moreover, it is often difficult to predict clinical outcome using the current grading scheme. The aim of this study is to elucidate, through clinical assessment and bioinformatics, the differences in the genetic background of different phenotypical manifestations of ASD. More specifically, it aims at investigating whether there are specific genes that can account for differences in the clinical profiles of children with ASD at the linguistic and cognitive level by reporting on the analyses of a new autistic patient WES dataset (n = 33). We first extracted the sequenced genotypes (WES) of blood samples of school-aged (6–12-year-old) children with ASD. We then conducted clinical assessment by administrating standardized tests of non-verbal IQ, memory, attention and oral language skills and separated them into mild and severe phenotypes in each of these cognitive and linguistic categories, based on these assessments. The next step was to identify common high-risk variants in the sample dataset previously found in the literature by searching through several genomic databases but perhaps also to identify de novo variants, not previously reported in the literature. Finally, we used a linear regression polygenic risk score machine learning algorithm to obtain biologically significant genes with the potential to aid in the grading of autistic samples based on their sequenced genotypes, derive specific molecular signatures from severe and non-severe subtypes of autistic samples and assess whether these molecular signatures outline functional subclasses. At this stage we should stress that given the limited number of patients in the dataset used in our study, results require additional validation using further experimentation. With this in mind, we report eighty-four identified variants which could be assigned to specific functional categories related to ASD and intellectual disability, as well as other disorders. Classification of our samples using these variants was in agreement with the clinical classification for our dataset with 81.81% prediction accuracy. The six samples that showed a differential molecular diagnosis were further assessed using clinical information in order to substantiate the classification provided by our risk model.

Methods

Participants

Thirty-three children with ASD that attended both mainstream and special education schools were recruited from private speech therapy centers. Only those children whose parents gave written permission to participate in the research were included in the study. All children were diagnosed with ASD by public hospitals and public medical-pedagogical centers according to the ICD-10 (https://apps.who.int/iris/handle/10665/37958) and DSM-V [1] official criteria. Children were initially divided into two groups based on their non-verbal IQ. The first included 18 children (average age: 9.5 years) with typical non-verbal IQ (> 80 in Raven Progressive Matrices) (ASD_MH group) and the second included 15 children (average age: 8.5 years) with low non-verbal IQ (< 60 in RPM) (ASD_L group). We then divided them based on whether they were verbal (acquired spoken language) or minimally verbal (absence of spoken language) children with ASD. The criterion was their performance (score 0) in two language tasks that required spoken language (see below expressive vocabulary and narration tasks). There were 19 verbal and 14 minimally verbal children. Moreover, we divided them in two groups (severe and mild) based on their attention and memory skills. Regarding the attention skills, the criterion for a child to fall under the severe phenotype was performance under the 10th percentile in both auditory and visual attention tasks and equal or over the 10th percentile for the mild phenotype. There were 9 children in the mild and 24 in the severe phenotype concerning attention skills. Regarding the memory skills, the criterion for a child to fall under the severe phenotype was performance under the 10th percentile in both auditory and visual memory tasks. There were 18 children in the mild and 15 in the severe phenotype.

Participants were assessed at their school individually in one or two sessions of a total duration of 45 min. Moreover, blood samples were obtained by experienced microbiologists in microbiology laboratories and were then sent to a genetic lab for Whole Exome Sequencing analysis.

Clinical phenotype assessment

In this study, children were assessed with cognitive as well language tasks. Standardized tests for Greek were employed. More specifically, our assessment materials included:

Cognitive measures

Non-verbal IQ. Non-verbal IQ was assessed with the Greek version of Raven Standard Progressive Matrices [30, 31]. Both standard scores and percentiles were taken into consideration.

Auditory and visual attention. Auditory and visual attention was assessed using three subtests of the Test for the Assessment of Attention and Concentration [32] (i) Sustained auditory attention, (ii) Sustained visual attention and (iii) Range of visual attention. The Total Attention Score of all three auditory and visual attention subtests was also calculated.

Verbal short-term memory, visual and auditory memory. There were totally seven measures: VSTM Sentence recall [32], VSTM word recall [33], Immediate visual memory, Delayed visual memory, Visual information recall, Information retention factor (Story Recall subtest of the Memory Test; see Narration below) and Recognition.

Language measures

Expressive vocabulary. It was assessed using the Greek version of Crichton Vocabulary Scales [31]. It contains 80 word definitions, presented orally, and arranged in order of increasing difficulty (interruption criterion: four consecutive errors). Only one child from the ASD_L group was able to name a few definitions, so for all the rest of the children in the ASD_L group, Picture Naming and Comprehension Subscale was administered.

Picture Comprehension. It was administered only to the ASD_L group, because all but one were minimally verbal. Receptive vocabulary was assessed using Picture Comprehension Subscale (Detection of Speech and Language Disorders Test Preschool, [DSLD Test] [34]), in which the child was asked to point to the picture (among 4) that corresponded to the word presented orally by the examiner.

Narration. Narration was assessed by using the Story Recall subtest of the Memory Test [33]. The child would listen to two short stories and repeat them back right after the examiner and after a short break (scoring: total number of elements and total number of sections s/he remembered correctly).

Sequencing, mapping, alignment and variant calling

Exome enrichment library was prepared with the Agilent SureSelectXT Human All Exon V6 kit as per the manufacturer’s instructions. Read files (Fastq) were generated from the sequencing platform (Illumina Hiseq). The samples were sequenced in paired end, 2 × 100 bp mode and deep coverage was obtained with approx. 6–7 Gb per sample (approx. 100 × av. coverage). Quality assessment and trimming was performed using the FastQC version: 0.11.7 and FASTX version: 0.0.14 toolkits, respectively. The Burrows-Wheeler Aligner (BWA) [35], version: 0.7.15 was used to map the raw reads to the human genome (build hg19/b37). Duplicate reads, which are likely to be the results of PCR bias, were marked using Picard (http://broadinstitute.github.io/picard/) version: 2.6.0. Samtools [36], version: 0.1.19, was used for additional BAM/SAM file manipulations. The Genome Analysis Tool Kit (GATK) [37], version 3.6.0, Haplotype Caller method was used for single-nucleotide polymorphism (SNP) and insertion/deletion (indel) variant calling.

Variant annotation

Variants were annotated with gene functional data from Ensembl version 90 using the Variant Effect Predictor (VEP) tool, version 90.6. Known variants were labeled using the dbSNP (Release 147) allowing for rapid identification of novel variants. Additional exploration of the results was performed using GEMINI, version 0.20.0, which provides a framework for analyzing, filtering and exploring genomic variation.

Odds ratio analysis

PLINK [38] was used to perform odds ratio analysis for obtaining variants with high disease association. PLINK allows for the detection of variants more frequently associated with severe than non-severe cases, which are labeled as high-risk or pro-severe variants. In contrast variants being more frequently associated in non-severe than severe cases are labeled as protective or pro-non-severe variants.

Risk score prediction using linear logistic regression analysis and risk model construction

Linear logistic regression fitting was performed using the PredictABEL [39] package (available in R). Specifically risk models were constructed using the fitLogRegModel function, and the predRisk function was used to assess their performance and predict risks. Additional functions available by the package were used for the various measures to assess model performance. Commonly in genetic risk prediction studies this includes: (i) plotting receiver operating characteristic (ROC) curves and calculating area under the curve (AUC) values using the plotRoc function and (ii) the reclassification table construction, net reclassification improvement (NRI) and integrated discrimination improvement (IDI) calculations using the reclassification function. The NRI and IDI are important comparative measures that provide an assessment of how well a new model reclassifies the data [40]. Graphical representation of results were attained using the plotRiskDistribution function for plotting risk distributions, the plotDiscriminationBox function for plotting discrimination box plots and the plotPredictivenessCurve function for plotting predictiveness curves. Better model performance was achieved by substituting the glm (generalized linear model) function utilized by PredictABEL with the bayesglm function available from the arm R package [41].

Machine learning data analysis

The samples obtained were separated into 2 classes based on their non-verbal IQ representing severe autism (n = 15) and non-severe autism (n = 18) and concurrently used as input for training the linear regression model classifier described above. Assessment of classifier performance was achieved using a leave-one-out cross-validation (LOOCV) procedure. During each round of cross-validation, each sample was removed recursively and feature selection was performed on the remaining samples in the dataset, the model was then trained and utilized to classify the left-out sample. For the feature selection (variants), PLINK was used. To find the optimal set of variants, the leave-one-out cross-validation method was performed by testing initially the top six variants and sequentially increasing the number of variants for each run until classification accuracy reached a saturation point with no further improvement. LOOCV performance was assessed using prediction accuracy, sensitivity, specificity, and Matthew’s correlation coefficient (MCC). MCC is defined as a balanced measurement of the classification quality which takes into account true and false positives and negatives. MCC returns values within the range of [− 1, 1]. The flowchart of the classification procedure is shown in Fig. 1.

Literature-based genomics approach

To complement our de novo classification approach, which predicts if a sample fits into the two main autism measurements under investigation (mild and severe), we utilized a novel pipeline for the identification of genes which characterize the IQ, verbal, memory and attention measurements based on current scientific knowledge. This pipeline consists of several distinct steps:

Creating a subgroup of the variants in each of our samples which exhibit homozygosity to the alternate allele of our reference.

From the previous subgroup discarding any variant which isn’t flagged simultaneously in the SIFT [29] (database as “deleterious” or “deleterious_low_confidence” and in the Polyphen [28] database as “possibly_damaging” or “probably_damaging”. The remaining genes were deemed “important” (IGs).

Keeping only the above variants for each sample, divide the samples into sub-phenotype groups based on our original phenotypical assessment measurements.

Running all samples of a sub-phenotypical group through R’s SuperExactTest [42] package to identify genes common among at least n-2 samples (where n is the total number of samples in each category).

Finally, comparing the genes in each sample grouping via VENNY [43] to identify genes which characterize a group but also genes which are common among them and describe a more generic autism genetic signature. This approach is visualized in Fig. 2.

Validation gene dataset

To validate our results versus genes that are already known in literature in connection to autism, we created a gene dataset by extracting all autism-related genes from 5 databases (AutismKB [44], SFARI [45], HuVarBase [46], DisGeNET [47] and OpenTargets [48]), on which, using Venn Diagrams, we superimposed the results of our 2 approaches. This also allowed to identify genes that were found de novo as being implicated in autism by our study.

Functional analysis

To better understand the pathways and mechanisms involved in Severe-Mild autism classification and sub-phenotyping, we performed a variety of functional analyses on our gene data from both approaches. REACTOME [49] was used for the first pass identifying genes involved in various pathways. The genes not found in REACTOME were manually researched in literature and other sources like GeneCards [50], STRING [51], Uniprot [52], Mammalian phenotype Ontology [53] and Gene Ontology [54] for their functional associations.

Results

Results of the performance of the two groups according to our clinical measures on the experimental tasks are presented in Table 3 as well as between group comparisons. For the different tasks (all except picture comprehension), non-parametric tests (Mann–Whitney test) were conducted to compare performances of ASD_MH and ASD_L groups.