An application of slow feature analysis to the genetic sequences of coronaviruses and influenza viruses

Background Mathematical approaches have been for decades used to probe the structure of DNA sequences. This has led to the development of Bioinformatics. In this exploratory work, a novel mathematical method is applied to probe the DNA structure of two related viral families: those of coronaviruses and those of influenza viruses. The coronaviruses are SARS-CoV-2, SARS-CoV-1, and MERS. The influenza viruses include H1N1-1918, H1N1-2009, H2N2-1957, and H3N2-1968. Methods The mathematical method used is the slow feature analysis (SFA), a rather new but promising method to delineate complex structure in DNA sequences. Results The analysis indicates that the DNA sequences exhibit an elaborate and convoluted structure akin to complex networks. We define a measure of complexity and show that each DNA sequence exhibits a certain degree of complexity within itself, while at the same time there exists complex inter-relationships between the sequences within a family and between the two families. From these relationships, we find evidence, especially for the coronavirus family, that increasing complexity in a sequence is associated with higher transmission rate but with lower mortality. Conclusions The complexity measure defined here may hold a promise and could become a useful tool in the prediction of transmission and mortality rates in future new viral strains. Supplementary Information The online version contains supplementary material available at 10.1186/s40246-021-00327-2.

delineating its complex structure more effectively. This analysis is rooted, theoretically, in the time-embedding theorems. In this method, a one-dimensional time series is embedded in a multi-dimensional space consisting of the original time series and lagged copies thereof. SFA further uses a nonlinear expansion to map this multidimensional input signal onto an even larger feature space, and then solves a linear problem to find a linear combination of feature-space variables that minimizes their time derivative (rate of change) [11]. The objective of SFA is to find the optimally filtered signals that vary as slowly as possible but still carry significant information. To ensure this, the output signals need to be uncorrelated and have unit variance [12]. This approach has been successfully applied in many areas, including climate science [13,14].
In mathematical terms [8], the goal of SFA is, given an n-dimensional input signal x(t), to find a set of realvalued input-output functions g j (x) such that the output signals minimize Δðy i Þ≔ <ẏ 2 j > t under the constraints < y j > t ¼ 0 zero mean ð Þ ; < y 2 j > t ¼ 1 unit variance ð Þ ; ∀i < j :< y i y j > t ¼ 0 decorrelation and order ð Þ with <•> t andẏ indicating temporal averaging and the derivative of y, respectively. The Δ-value is a measure of the temporal slowness of the signal y(t). It is given by the mean square of the signal's time derivative. Small Δ-values correspond to slowly varying signals. The first two constraints avoid the trivial constant solution, while the last constraint guarantees that the output functions g j are distinct and hence extract different information from the input signal. For a tutorial on this method, the reader could consult reference [8] or a more recent presentation in [15]. In that tutorial, a simple example of a two-dimensional input signal x 1 (t)=sin(t)+cos(11t) 2 and x 2 (t)=cos(11t) is considered. Both components are quickly varying, but hidden in the signal is the slowly varying "feature" y(t)=x 1 (t)−x 2 (t) 2 =sin(t), which can be extracted with a polynomial of degree two, namely h(x)=x 1 −x 2 2 . In the situation with one observable (time series of some variable) from an unknown system where the actual state space is not known (as is the case here), embedding is necessary (and essential) to delineate the underlying dynamics much like in attractor reconstructions. The SFA algorithm can be summarized as follows. Consider a time series fxðtÞg t¼t 1 ;…;t n , where t denotes time and n indicates the length of the time series. First, we embed {x(t)} into an m-dimensional state space using time-delayed copies of x(t): , and so on, τ is the delay, and N = nm + 1. Then, nonlinear expansions (usually second-order polynomials) are used to generate a k-dimensional function state space: which can also be written as The expanded signal H(t) is then centered and normalized to zero mean and unit variance. This process is referred to as whitening or sphering. Thus, we have Using the Schmidt algorithm, H ′ (t) is orthogonized into: where the transformed signal matrix Z is column orthogonal: The final step of SFA is to find the set of coefficients (a 1 , a 2 , …, a k ) such that the time series varies as slowly as possible. This set is given by the eigenvector W 1 of the time-derivative covariance matrix B ¼Ż TŻ corresponding to the smallest eigenvalue λ 1 . Herė Using W 1 , the optimally filtered slow-feature signal (also known as a driving force factor, which can be composed of one or more components) can be written as: where r and c are constants derived to best match y(t) and the original time series x(t).
Once the optimally filtered (low-frequency) SFA signal has been identified, its significant periodicities can be found from the time-averaged wavelet power spectrum. Wavelet analysis has been widely used to analyze localized structures and spectral properties of time series. For example, [16] provides a detailed description of the wavelet analysis, along with a very useful toolkit to conduct step-by-step wavelet analysis, including a statistical significance test based on the red-noise surrogate data (see http://paos.colorado.edu/research/wavelets/). We here used the Morlet wavelet with the wavenumber set to 4 to match the smoothness of the SFA-derived slowfeature signal, focusing, once again, on the spectral peaks statistically significant at the 5% level. Note also that SFA is applicable to non-stationary data, so no data preprocessing is required.
The combination of the SFA and wavelet analyses we use in the present study has been shown to be more effective in diagnosing low-frequency periodicities in data sets of a limited length than direct spectral analysis methods. Note that the driving force may not necessarily consist of just one component, but several components, which, as we will see below, correspond to forcings or signals at certain time scales. The success of SFA in delineating these slow signals lies in the fact that embedding the time series in high enough dimensions and the subsequent dynamical procedure removes the noise and small-scale features that may obscure or suppress those slow signals, thereby delineating more accurately the complex structure of a sequence.

Analysis and results
We first analyzed the DNA sequences from three viruses from the same family: SARS-CoV-2, SARS-CoV-1, and MERS. Those sequences are approximately 30, 450 bases long and part of the now world-infamous coronavirus family. Since a nucleotide sequence is a string of the bases A, T (U in RNA), C, and G, we first transformed it to a time series of integers in the interval [1][2][3][4] Here, we need to stress that a time series represents a particular type of process, where some quantity is sampled in time, t. A DNA sequence is a very similar object, but the "sampling" is over space. In a time series, we are interested in the dependency of observations at different time scales, whereas in DNA sequences, we are interested in dependencies in different space scales. As such, the mathematical tools to identify structures in time can in principle be applied to identify structure in space, as long as t is thought as a parameter identifying the scale. Transforming a DNA sequence into a time series has been used in the past to identify interesting properties in DNA sequences (such as the well-known period 3; see [4,5] and references therein). Note also that the above transformation of A➔1, T/U ➔2, C ➔3, and G ➔4 may, depending on frequency distribution of A, T/U, C, and G in the sequence, result in a nonstationary time series. However, unlike other spectral methods, SFA is not affected by nonstationarity in the data.
Once we have a time series, we apply SFA, and once we have the SFA signal (which as we mentioned above may be comprised of several components, see Eq. 1), we extract the SFA components by wavelet analysis. Figure 1 shows the SFA signal for SARS-CoV-2 virus for m=15 and τ=1. As explained above, this signal is normalized to zero mean and unit variance. Figure 2 shows the wavelet of the time series in Fig. 1. In order to extract the peak "periods" of the driving force signal, we used the Morlet wavelet to compute the time-averaged power spectrum of the wavelet transform [16]. The black solid line in Fig. 3 is the time-averaged power spectrum of the wavelet transform of the driving force, and the dashed line represents the 95% confidence level, estimated using AR-1 surrogate data [16]. The dots show the periods of the oscillatory components of the driving force that are significant above the 95% level.
The significant peak periodicities for SARS-CoV-2 are as follows 1 : Fig. 1 The SFA signal of the DNA sequence of SARS-CoV-2. Note the oscillatory components at many scales Given the above periodicities recovered from SARS-CoV-2, we next construct Table 1, which shows the ratios between these peaks. We observe the following EXACT relations between peak periods: And the following almost exact relationships based on the criterion: j P−nearest integer j =nearest integer < 0:25% Fig. 3 The time-averaged power spectrum of the wavelet transform extracted from Fig. 2. The dashed line represents the 95% confidence level. The dots show the periods of the oscillatory components of the driving force that are significant above the 95% level Table 1 Ratios between the peaks in (2) Keeping only those relationships, we remain with Table 2, which could be thought as portraying the degree of structure or complexity in the SARS-CoV-2 sequence. We observe in the exact relationships multiples of a power of 2 and in the almost exact relationships multiples of 19 (152=2×76=8×19) and 83. Clearly, a sophisticated and rather convoluted structure, with numerous processes embedded in the sequence, is present. Keep in mind that the factors 19 and 83 (odd numbers) will appear in the rest of the sequences studied here. We define the number of entries above the diagonal in Table  2 as the degree complexity, C. In this case, C=13.
In  Tables  ST3 and ST4 are similar to Tables 1 and 2 but for  MERS. According to Figure S3, the peak periods for SARS-CoV-1 are as follows: and three almost exact Again here, we observe in the exact relationships, multiples of a power of 2, and in the almost exact between P 7 and P 1 , P 7 and P 2 , and between P 5 and P 2 . Note again the multiples of 19 and 83. Note also that from the almost exact relationships, it follows that P 7 /P 5 =4, which is one of the exact relationships. Here, the degree of complexity is C=8.
According to Figure S6, the peak periods for MERS are as follows:   (3) and (4) and three almost exact relationships P 9 , P 8 , and P 6 are multiples (again in a power of 2) of P 1 , P 2 , P 4 (ordered in a bottom-top "symmetric" way), P 6 , P 7 , and P 8 are multiples of P 2 , P 3 , and P 1 but not of 2, but again of 19 and 83 (it is interesting to note that the odd multiples of 19 and 83 appear in all three sequences). Here, the degree of complexity is C=6.
By comparing Tables 2, ST2, and ST4 (and their associated C), one may argue that there is more embedded complexity and intricate patterning in SARS-CoV-2 than SARS-CoV-1 and MERS.

Other important relationships
Keeping in mind that all three sequences belong to the same coronavirus family, there are similarities and interrelationships between the sequences. For example, it is easy to observe that: In general, SFA reveals a consistent picture between these sequences with very intricate structure with details at many scales, indicating very elaborate and sophisticated embedded processes, with complexity increasing from MERS to SARS-CoV-1 to SARS-CoV-2.
Extension of the analysis to the influenza viruses of H1N1-1918, H1N1-2009, H2N2-1957, and H3N2-1968 In an effort to provide further support for the efficiency and consistency of SFA in the analysis of nucleotide sequences, we consider four other viral sequences from a different viral family, that of the influenza viruses or the Orthomyxoviridae family [17].

Inter-relationships
As in the case of the coronaviruses, the influenza virus sequence analysis also revealed plenty of interrelationships as expected, since the four viruses belong to the same family.
Interestingly, we found that many relationships exist between the two viral families investigated here. If we compare the results in this section to the previous section, we can infer that: More on this is discussed next.

Discussion
If we consider the peak SFA periodicities from a sequence as nodes of a community, and their relationships as links between the nodes, then, a visualization of the results for the SARS-CoV-2 community would look like the top left panel of Fig. 4. Since there are 10 peak periodicities, we have 10 nodes. Then, from Eqs. 3 and 4, we have 13 (recall that C=13) links between them (showing in blue). The rest of the panels correspond to the rest of the sequences in both families. The red lines give the links between the communities within a family (from Eq. 9 for the coronavirus family and from Eq. 18 for the influenza family). The black lines are the links between the two families (Eq. 19). This picture is a perfect example of complex networks, which are often characterized by a community structure, where in each community the nodes are connected in a certain way (meaning the community obeys its own dynamics), but where there exist also some connections (or interactions) between the communities (see for example [18,19]). We note two interesting observations: (1) the influenza virus family is much more connected (more red links) than the coronavirus family, possibly indicating that the influenza strains are less mutated than the coronavirus strains, and (2) SARS-CoV-1 has no direct links to the influenza family. This result supports our claims that SFA has the potential and efficiency to delineate the complex mathematical structure of genetic sequences and that it could become a useful tool in such analyses. We need to stress here that, given the mathematics behind SFA, while we can make direct comparisons of the complexity measure "C" within a certain family (where more or less the number of bases is the same), we cannot compare complexities based on "C" between different families. This is due to the differences in nucleotide length between viral families. The coronavirus family sequence length is approximately 30,450 bases, whereas the influenza family sequence length is approximately 13,500 bases. As such, SFA may "see" longer oscillations in the coronavirus family than in the influenza family. Thus, there will be more entries above the diagonal (in tables such as Table  2), and therefore, higher complexity in the coronavirus family.
Finally, it is interesting to note that the complexity measure "C" in the case of the coronavirus family relates to mortality and severity of symptoms as well as to the rate of transmission. As "C" increases, the transmission rate to humans increases, but mortality rate decreases. It is reported that symptoms of the SARS-CoV-2 are milder than SARS-CoV-1 and MERS; however, the viral transmission rate (from human-to-human) is greater than the other family members. The mortality rate of SARS-CoV-2 is lower (3.4%) than that of SARS-CoV-1 (9.6%) and MERS (35%) [20]. This relationship is not as clear, however, in the case of the influenza virus family. Unfortunately, in this case, the outbreaks span over a century, and the actual numbers are skewed by several factors such as deaths by secondary infection (due to the unavailability of antibiotics), hygiene, lack of experience and lack of proper healthcare, especially in the early outbreaks, and other problems. For example, H1N1-1918 (C=6) infected 30% of the planet's population and H1N1-2009 (C=4) infected 10% of the population. This is consistent with "increasing C ➔ higher infection rate", but it is not consistent with "increasing C➔ less mortality rate". H1N1-1918 killed about 8% of the infected, whereas H1N1-2009 killed only 0.0025% of the infected [21][22][23][24][25][26][27][28]. But how can we compare the conditions in 1918 and 2009? To complicate comparisons further, there is hardly any reliable data of infection rates for H2N2 and H3N3. In any case, the complexity measure "C" may hold a promise and could become a useful tool in the prediction of transmission and mortality rates in future new viral strains.

Conclusions
In this exploratory work, a relatively recent mathematical method (SFA) is applied to probe the structure of the DNA sequences of two related viral families: those of coronaviruses and those of influenza viruses. The coronaviruses are SARS-CoV-2, SARS-CoV-1, and MERS. The influenza viruses include H1N1-1918, H1N1-2009, H2N2-1957, and H3N2-1968. The analysis indicates that the DNA sequences exhibit an elaborate and convoluted structure akin to complex networks. We define a measure of complexity and show that each DNA sequence exhibits a certain degree of complexity within itself, while at the same time there exists complex interrelationships between the sequences within a family and between the two families. From these relationships, we find evidence, especially for the coronavirus family, that increasing complexity in a sequence is associated with higher transmission rate but with lower mortality. As such, the complexity measure defined here may hold a promise and could become a useful tool in the prediction of transmission and mortality rates in future new viral strains.

Additional file 1. Supplementary figures and tables
Additional file 2. DNA sequences used in the study Fig. 4 A complex network visualization of the relationships (connections) between individual nucleotide sequences (blue), between sequences within each individual family (the coronavirus family and the influenza family; red), and between the two families (black) resulted from the SFA. This picture is akin to structures of complex networks where in each community the nodes are connected in a certain way (meaning the community obeys its own dynamics), but where there are also connections between the communities