 Research
 Open access
 Published:
Prediction of microbial communities for urban metagenomics using neural network approach
Human Genomics volume 13, Article number: 47 (2019)
Abstract
Background
Microbes are greatly associated with human health and disease, especially in densely populated cities. It is essential to understand the microbial ecosystem in an urban environment for cities to monitor the transmission of infectious diseases and detect potentially urgent threats. To achieve this goal, the DNA sample collection and analysis have been conducted at subway stations in major cities. However, cityscale sampling with the finegrained geospatial resolution is expensive and laborious. In this paper, we introduce MetaMLAnn, a neural network based approach to infer microbial communities at unsampled locations given information reflecting different factors, including subway line networks, sampling material types, and microbial composition patterns.
Results
We evaluate the effectiveness of MetaMLAnn based on the public metagenomics dataset collected from multiple locations in the New York and Boston subway systems. The experimental results suggest that MetaMLAnn consistently performs better than other five conventional classifiers under different taxonomic ranks. At genus level, MetaMLAnn can achieve F1 scores of 0.63 and 0.72 on the New York and the Boston datasets, respectively.
Conclusions
By exploiting heterogeneous features, MetaMLAnn captures the hidden interactions between microbial compositions and the urban environment, which enables precise predictions of microbial communities at unmeasured locations.
Background
Metagenomics studies the genomic content obtained from a human body site or an environment with a goal of understanding microbial diversity. The microorganisms in our environment are greatly associated with human health and disease.
Human microbiome studies are already rich enough to uncover the microbial diversity within the human body [1]. Environmental metagenomics, though falling behind in the past years, has also become increasingly important due to the increasing awareness of its impacts on public health, especially in densely populated urban areas [2–8]. Therefore, the effectiveness of a city’s longterm disease surveillance and health management relies heavily on how we understand and predict the metagenomics composition at a finegrained level.
Many recent research have been devoted to building up cityscale metagenomic profiles [9, 10]. For example, Afshinnekoo et al. [9] created a citywide metagenomic profile for New York City by collecting samples from different surfaces across the entire New York subway system. Taxonomic assignments were generated by alignment reading, and the relative abundances were computed at the species level. The profile described the pattern of metagenomic communities and revealed how the human interacts with new microbes or danger pathogens. Another study conducted by Hsu et al. [10] provided a more comprehensive metagenomic profile in the Boston transportation system, which described microbial communities across multiple surface types. However, collecting, sequencing, and analyzing the metagenomics data at every station cost them a great amount of money and time. Given that, our study focuses on developing a model to automatically predict the microbial communities for unsampled locations.
It is challenging to predict the microbial communities for unsampled locations. First, the characteristics of microbial communities can vary enormously in a complicated urban system due to various factors like geographical topology and public transit network. Many recent works have investigated how network connectivity affects the similarity of microbiomes. For examples, Leung et al. [2] conducted a Mantel test of Hong Kong subway line (MTR), and found that closely connected MTR lines shared more similar microbial communities than pairs that are further apart (R=0.47, P=0.03), probably because of distancedependent dispersal and transferring commuters. To further evaluate the assumption, we conduct a clustering analysis based on microbial community similarity at different locations. As shown in Fig. 1, different microbes are separated by geographical boundaries.
Second, the formation and transmission of microbial communities are also affected by the material type of surfaces where the samples are collected [10]. Lastly, within each community, the genetic properties of each individual microorganisms and the correlation between individual microorganisms also contribute to the complexity. Considering the mixed effects from various factors, a simple model for each station along the same subway line should not be enough.
To address these challenges, we formulate the prediction of microbial communities at unsampled locations as a multilabel classification (MLC) task. Based on a set of heterogeneous features extracted from the urban environment, we aim to predict the presence or absence of a list of microbes at a nearby location. For MLC, each location is considered as an instance and each label represents a microbe.
Since different class labels have to be predicted simultaneously [11], MLC is suitable for solving the microbes inference problem, with their dependencies exploited at the same time. These properties reflect the nature of microbial communities.
In the field of urban computing, statistical models like regression trees have been applied to do realtime air quality prediction. For example, in UAir [12], the authors inferred the finegrained air quality in a city by using a semisupervised learning approach. The model was able to predict air quality at nonmonitored stations based on the air quality data reported by existing monitor stations. The spatial classifier for their model was based on an artificial neural network (ANN). However, this model only estimated a single value (i.e. the air quality index) for each location, so it was also inadequate to address the MLC task we formulated.
In the field of metagenomics, several computational models, such as BioMiCo [13] and NMF [14] have been developed to infer microbial community structures. To estimate the composition of each sample given the abundance profile, BioMiCo uses the supervised Bayesian model while NMF leverages the matrix factorization.
Nevertheless, these works cannot directly infer the microbial community for unsampled locations in the urban environment due to their inability to incorporate spatial information.
All the models mentioned above either cannot address the complicated environmental conditions or handle the intricate relationships between microbial compositions and the urban environment. In our recent work [15], we propose MetaMLAnn (Metagenomic Multi Label Artifical neural network), a neural network based and supervised learning model to predict the microbial community for cityscale metagenomics. MetaMLAnn is built on the widelyused feedforward neural network model. But unlike the conventional feedforward neural network model that predicts each label individually, it leverages an extra shared structure to capture the dependencies among different labels (microbes). To begin with, we train MetaMLAnn using a stateoftheart network embedding technique to integrate features constructed from different data sources. Next, we leverage manifold regularization to extend our model. Our model is robust to the sparse samples with limited labeled data by incorporating the domain knowledge. To further improve our model, we also introduce an ensemble model, MetaMLAnn+, which can outperform each individual model by leveraging the diversified information from MetaMLAnn and different classification models with the strong signal. To our best knowledge, our work is the initial attempt to predict the microbial community for urban metagenomics by using the neural network model. In this paper, we extend our previous work by presenting detailed theoretical foundations and additional statistical analyses.
We summarize the contribution of this paper as follows:

This is the first series of indepth study of microbial communities inference for unsampled locations. The inference task is formulated as a multilabel classification problem and a neural network learning technique (MetaMLAnn) is proposed to solve it.

We integrate the manifold regularization into our framework to guide the training of MetaMLAnn. We provide detailed theoretical foundations of showing how the domain knowledge of microbial evolutionary relationships helps.

Important features are extracted from multiple data sources, including cityscale transit features and surface material. An indepth feature importance study has also been provided.

We evaluate MetaMLAnn on the New York and Boston subway metagenomic DNA sequencing data samples. We present detailed discussions about that MetaMLAnn performs better against five baseline methods under two datasets with different level of the taxonomy. We also analyze the importance of using the ensemble model.
Materials and methods
In this section, we present the detailed designed of our framework and describe the dataset used in this work.
Preliminaries and problem definition
We start with formalizing the mathematical notations of our model. Table 1 summarizes the symbols we use in this article.
Definition 1
(Microbe Index) Microbe Index is defined as an alphabetically ordered list of microbial names of identified organisms. Each element in the list is a taxonomic name.
Definition 2
(Microbial Distribution Matrix) All samples at different locations are represented as a matrix Y∈R^{n×m}, where n is the number of sampling locations, and m is the total number of microbes in the Microbe Index. Each row Y_{i} represents the microbial distribution vector of location i. Each element Y_{ij} represents whether the j^{th} microbe exists (or its relative abundance meets a threshold γ) in the i^{th} location. More specifically,
Definition 3
(MultiLabel Classification) Given \(\mathcal {X} \in R^{n\times k}\), a set of n instances, each being a kdimensional feature vector, and \(\mathcal {Y} = \{y_{1}, y_{2}, \ldots, y_{m}\}= \{0,1\}^{m}\), a set of labels, where each element is 1 if the label is relevant and 0 otherwise.
The classification model is to learn an estimation function f:R^{k}→2^{m} that assigns a subset of labels to a given instance.
In our microbial community inference case, we extract feature vectors of n samples and represent them as X. The Microbe Index created from known locations is used as Y, where the order of microbes is preserved.
Problem statement. Suppose S=S_{1}∪S_{2}={s_{1},s_{2},…,s_{n}}, where S_{1} and S_{2} are sets of sampled and unsampled locations, respectively. Each sampled location s_{i}∈S_{1} is associated with a microbial distribution vector \(Y_{s_{i}}\). Our goal is to predict \(Y_{s_{j}}\) of each s_{j}∈S_{2}, which is not sampled.
The framework of MetaMLAnn is shown in Fig. 2. It contains two major components and one model: the blue component for learning and the red component for inference, together with the MetaMLAnn model. In the following subsections, we introduce how MetaMLAnn is constructed, explain the regularization framework, discuss how feature extraction has been done to train MetaMLAnn, and present the ensemble model.
Model: MetaMLAnn
We start with introducing the one hidden layer feedforward neural network model [17]. In the neural network model, there are p hidden units. The input layer x∈R^{k×1} is connected to hidden layer h∈R^{p×1} with weights W^{(1)}∈R^{p×k} and biases b^{(1)}∈R^{p×1}. The hidden nodes are then connected to output nodes o∈R^{m×1} via weights W^{(2)}∈R^{m×p} and biases b^{(2)}∈R^{m×1}.
We denote f_{θ}:x→o as the feedforward neural network below:
where, θ={W^{(1)},W^{(2)},b^{(1)},b^{(2)}}. f_{o} and f_{h} are activation functions in the output layer and the hidden layer respectively. Specifically, the function f_{θ}(x) can be simplified by using vector representation as follows, where z^{(1)} and z^{(2)} are the vector representations of the weighted sums of inputs and hidden activation functions as follows:
Given the cost function J(θ;x,y), we seek for a parameter vector θ which minimizes it. J(θ;x,y) measures the difference of given targets y and predictions of the network. Here, we choose CrossEntropy [18] as our cost function:
where y_{i} and o_{i} are the ground truth and the predicted scores for label i, respectively. The sigmoid activation function o=σ(z)=f_{o}(z)=1/(1+exp(−z)) is applied in the output layer.
In MetaMLAnn, we extend the basic form feedforward neural network by leveraging a heterogeneous architecture. Figure 3 depicts the detailed design of MetaMLAnn. Instead of using multiple hidden nodes of the same type in the hidden layer, we denote two different types of sub hidden layers which we call blocks (B). The first set of blocks are called individual blocks, B_{1} to B_{m} where m is the number of labels. The second type of block, B_{share}, is a shared block that connects to all output neurons. Therefore, each output neuron connects to a corresponding individual block and a commonly shared block. All blocks contain one hidden layer with p neurons.
Therefore, we replace the p units hidden layer with m+1 blocks B. Each block consists of a hidden layer with p hidden neurons. For each i, the input layer x∈R^{k×1} is connected to each block B_{i}∈R^{p×1} with weights \(W_{i}^{(1)} \in R^{p\times k}\) and biases \(b_{i}^{(1)} \in R^{p\times 1}\). Then, the blocks B_{i} and B_{share} are connected to output node o_{i}∈R via weights \(W_{i}^{(2)} \in R^{1\times p}\) and biases b^{(2)}∈R.
We use stochastic gradient descent (SGD) [19] to efficiently optimize the cost function in Eq. 4. We randomly sample a location i and a unit from y_{i} to compute B_{i} for each individual block. We randomly sample a location i and a unit from all the classes among y_{1} and y_{m} to capture the global properties shared by all microbes for the shared block B_{share},. The updating rules for different variables W and b can be derived by taking the derivatives of the above objective function and applying SGD. Training our model is efficient with SGD and backpropagation. More specifically, the time complexity of training our model is O(t·n·∣θ∣), where t is the number of training epochs; n is the number of training examples; θ is the set of parameters in the model. To demonstrate the convergence of the proposed algorithm, we plot the values of the loss function over different optimization epochs in Fig. 4.
Finally, the heterogeneous neural network model f_{θ}:x→o can be reformatted as follows:
Manifold regularization
Neural networks tend to suffer from limited training examples. However, with only a few instances of each label, it is challenging to train MetaMLAnn. One potential solution to compensate for the data sparsity is to incorporate prior knowledge. Inspired by the general observation that evolutionary relationships are expected to be associated with patterns of community composition [20], we presume that the groups of microbes tend to cooccur in the same community when they are closely related to each other in the taxonomy.
The taxonomy here is referred as the identification, naming, and classification of organisms. We choose to use the evolutionary similarity as the domain knowledge, which is then fed into our regularizer. This is because taxonomy is often informed by the evolutionary relationships among different microbes (i.e., phylogenetic). To incorporate such microbial similarity, many regularization techniques can be used. We choose one of the most popular techniques, Graph Laplacian regularizer, to build our regularization frameworks [21–25].
Definition 4
(Graph Laplacian matrix L) Given a pairwise similarity matrix P∈R^{m×m}, the Graph Laplacian matrix is defined as L=D−P, where D is a diagonal matrix with j^{th} diagonal element \(D_{j,j} = \sum _{j' =1}^{m}\left (P_{j,j'}\right).\)
By minimizing
the regularizer can preserve the local geometrical structure of a parameter vector β with length I. According to the definition, we observe that L has the following property that makes it suitable for regularization. Given the trace operator tr(·):
From the above equations, the two parameters β_{i} and \(\beta _{i'}\phantom {\dot {i}\!}\) are enforced to be similar, which can be incorporated into the cost function. The regularized cost function is defined as:
where y_{i} and o_{i} are the ground truth label and the predicted score for sample i.
The Graph Laplacian regularizer can represent any pairwise relationships between parameters. Here we discuss how to use the evolutionary similarities as priors and the corresponding Laplacian regularizers to incorporate structured domain knowledge. The Laplacian matrix L is firstly obtained by constructing the pairwise evolutionary similarity matrix (P) of different microbes.
Upon obtaining the predicted microbial distribution vector \(Y_{i}^{*}\) for given location i from the blocks, each vector is regularized by feeding \(Y_{i}^{*}\) into Eq. 5, where β refers to the predicted vector \(Y_{i}^{*}\) and β_{i}, β_{j} refers to microbe i and microbe j at this location, respectively.
Feature extraction
Here we describe how we extract the features from various data sources. These feature extraction methods can serve as a general pipeline for any urbanscale metagenomics study.
We define a feature vector as F:R^{k}, where R is a k dimensional feature.
For this work, we extract the following features: subway station information, interstation connections, and sampling surface materials. All features are concatenated into a feature vector for each sample and are used to train MetaMLAnn.
Subway station features (F_{s}): The first set of features that we extracted is the subway station information. We obtain the MTA and MBTA subway station data for New York and Boston. Each location is associated with the closest stations within a predefined radius, r=0.01 miles. This radius value is an empirical parameter and can be tuned. The feature vector is then created based on the lines that pass through the current station. If there is no station information available in this range, we will find the 2 nearest stations and see if their subway line information matches. If they do match, we will align the subway line to this location. Otherwise, we will not assign any subway line information to this location. This process is specifically for dealing with sampling locations which are not stations, but in between two subway stations on the same line.
It has been shown that the number of riders is positively correlated with the amount of DNA collected in a station [9]. Therefore, we also retrieve the public MTA data with the turnstiles usage information at each station. The corresponding node vector is then weighted by the average number of riders within DNA collection date at each station.
For example, there are in total 25 different subway lines in New York, thus we create a binary vector of size 25, each element in the vector indicates whether this line will pass this location or not. For example, for station l, the subway line feature vector is defined as \(F_{s_{l}} = (v_{1}, v_{2}, \ldots v_{25})\). If v_{i}=1, then line i passes through this location. Finally, \(F_{s_{l}} \) will be weighted based on the busyness of station l.
Note that it is possible one location is associated with multiple lines or no lines. For the multiple lines’ case, there will be more than one v_{i} equal to 1. For the case of no line, we will simply remove such location since we focus on the inference at stations. Therefore, all locations will be associated with a subway line feature as a vector.
Interconnection features (F_{c}): We first describe how we construct the subway system network. Each subway station is denoted as a node and each interaction between two stations is drawn an edge. The weight of edge (i,j) is computed by the minimum number of stops from station i to station j. We also consider the case of express trains and if there exist express trains directly connecting two stations, we assign 1 as the weight to that edge.
Upon obtaining the station network, we apply the network embedding algorithm Node2Vec [26]. Each node is embedded into a low dimensional vector based on the generated network.
Surface materials features (F_{m}): The surface materials are strongly correlated with the microbial communities, as discussed in [10]. Therefore, we represent such information by using another set of vectors. Based on the type of materials it was collected from, a vector of length equal to the number of material types is constructed. For the New York dataset, each element represents one type of material: ‘concrete’, ‘metal’, ‘plastic’,‘water’ or ‘wood’ and the vectors are of length 5. As for the Boston dataset, the vector is of length 4 with four types of materials: ‘glass’, ‘polyester’, ‘PVC’, and ‘steel’.
Ensemble with hybrid prediction
To alleviate the lack of training data, in addition to the regularization, we also propose to construct an ensemble of MetaMLAnn with any other model that needs fewer training samples.
For each label i, let o_{i} be the predicted score of MetaMLAnn. Given the score from the other model m as \(o^{m}_{i}\), we conduct a linear hybrid prediction for ensemble as follows:
where 0≤α≤1 is a parameter to decide the weights of two models. When α=1 the prediction is MetaMLAnn, and when α=0 the prediction is the model m.
We denote the ensemble approach as MetaMLAnn+.
Sample collection and data preprocessing
We apply our model on the New York and Boston datasets obtained from the MetaSUB InterCity Challenges track of the 2017 CAMDA Contest.
They both contain masstransit metagenomic raw reads data, supplemented with sample descriptions.
The New York dataset contains 1572 samples, representing different sites. These samples were collected from open subway stations for all 24 subway lines of the NYC Metropolitan Transit Authority (MTA). At subway stations, samples were collected in triplicate, with one sample taken inside a train at the station and two samples from the station itself, as reported by [9]. DNA samples collected from each site were sequenced using Illumina platform, with a total of 10.4 billion pairedend DNA sequencing reads.
In addition, each sample is also associated with meta information, including the latitude and longitude showing where the sample was collected, and surface materials. All these information are indispensable for the enrichment of feature generation.
Similarly, there are 141 samples in the Boston dataset, which have been also collected from the local subway system, consisting of 5 lines (red, orange, blue, green, and silver) that extend from downtown Boston into the surrounding suburbs. As mentioned in [10], most samples are 16S rRNA gene amplification sequence data, and a subset of the samples are subjected to shotgun metagenomic sequencing. Each sample is also supplemented with additional information, which describes the date of collection, station information, and surface type. For the 16S rRNA samples, the corresponding abundances profiles are also provided.
For each sample in the New York dataset and samples subjected to shotgun metagenomic sequencing in the Boston dataset, we conduct the following preprocessing steps:

1)
To be consistent with the processing procedure in [9] from which the New York data is collected, We use MetaPhlan2 [16] to perform microbial profiling. Each profile contains the relative abundances as a percentage from the kingdom level to the species level.

2)
There are 48.3% of the reads that do not match to any known organism in the New York dataset, as described in [9]. Therefore, when we construct the microbial distribution vector, those unknown microbes are removed and the relative abundances of the remaining known microbes are recomputed.
Supplemental data sources
We use the New York subway station data and the Boston subway station data from the MTA and MBTA website respectively to construct the subway line features. They contain geographic locations, subway station names, and subway lines that pass each station. We also obtain the turnstile data of MTA and MBTA to count the busyness of all stations. The detailed descriptions can be found in Table 2.
To capture the underlying microbiota structure, we construct a pairwise similarity matrix to represent the evolutionary relationship between two species. We retrieve the 16S ribosomal RNA sequence for bacteria and archaea, 5S ribosomal RNA for eukaryotes, and the whole DNA sequences for viruses from the NCBI [27–29] and the Silva [30, 31] database. We perform sequence alignments to compute the pairwise similarity within each kingdom. We normalize the similarity values to the range of 0 to 1 and we assign 0 to their similarity for crosskingdom species pairs. Finally, we take the mean of all species’ similarity scores under that level and aggregate them as the new score for each genus pairs (Eq. 9). In this way, we can obtain the similarity matrix between genus level.
Given two genus g_{a} and g_{b} as sets of species, the similarity score between the pair of genus can be computed as:
where sp_{a} and sp_{b} are the species of g_{a} and g_{b}, respectively.
Results
To demonstrate the effectiveness of MetaMLAnn, we conduct comprehensive experiments by using both the New York and Boston datasets. In this section, we will discuss the experiment setup, evaluation metrics, baselines and results.
Experimental settings
After we conduct data processing, each sample is associated with an abundance vector.
It is observed that many species are seriously underrepresented (i.e. appearing at only one location) for the abundance at all levels. We choose to focus on the genuslevel abundance to alleviate the issues including underrepresented microbes, missing specieslevel taxonomy, and very similar microbial species.
Together with the number of features obtained, the detailed microbial composition of both dataset can be found in Table 3.
We use kfold crossvalidation for all experiments. Setting the value of k to be three, we randomly and equally split the data into three nonoverlapping subsets. Each subset has a chance to train the model and to test the model.
The average performance of each method from these three folds is reported. In addition, we also justify the effectiveness of our feature construction by comparing the performance of individual features and their combination with the same classifier.
Evaluation metrics
We assess the performance of our classifier in several ways. While accuracy is the simplest and the most straightforward measure, it is biased toward classes with a larger sample size. Instead, we report precision, recall, and F1 score as our evaluation metrics. These metrics are defined as:
where given m labels, tp_{i}, tn_{i}, fp_{i} and fn_{i} represents true positives, true negatives, false positives and false negatives for i^{t}h label respectively.
Finally, we also use ranking loss, which averages over n samples the number of label pairs that are incorrectly ordered, i.e. true labels have a lower score (\(\hat {f}\)) than false labels, weighted by the inverse number of false and true labels, as shown below:
Baselines
As we formalize the inference problem as a multilabel classification (MLC) problem, we adopt several widely used MLC algorithms as the baseline methods, including Inverse Distance Weighting (IDW) interpolation, k Nearest Neighbor (kNN) [32], Support Vector Machine (SVM) [33], Random Forest [34], and Neural Network [35].

Inverse Distance Weighting (IDW): Inverse distance weighting is a deterministic, nonlinear interpolation technique that uses a weighted average of the attribute values from nearby sample points to estimate the magnitude of that attribute at nonsampled locations. The weight a particular point is assigned depends upon the sampled point’s distance to the nonsampled location.

KNearest Neighbor: This classifier will compute classification from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

SVM with onevsall: This baseline assumes all the label prediction are independent. Binary decomposition is used, on each binary classification task (one for each label). SVM is used as the base classifier. Then the onevsall is used, which consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes. Then the predictions of SVMs for all labels are combined to make the final prediction.

Random Forest: This baseline method is an ensemble of decision tree classifiers. Based on various subsamples of the dataset random forest will use averaging to improve the predictive accuracy and control overfitting. In this baseline, we feed all the features equally into a decision tree.

Singlelayer Perceptron classifier (Vanilla Neural Network): We choose the singlelayer feedforward neural network model in the experiments for its simplicity and generality. It is the most similar classification model as MetaMLAnn.
Performance of MetaMLAnn
Using the combined features, Tables 4 and 5 show the performance of MetaMLAnn and other aforementioned baselines on New York and Boston datasets, respectively. As discussed in experimental settings, we focus on the genus level inference. We observe that MetaMLAnn and MetaMLAnn+, outperform all baselines on F1 score and ranking loss.
In the New York dataset, MetaMLAnn and MetaMLAnn+ perform the best in terms of F1 score and ranking loss, though the precision and recall of MetaMLAnn rank second among other baselines. IDW achieves the highest recall but its precision is the lowest, which offsets its high recall. As an unsupervised learning model using the inverse distance weighting of surrounding microbial distribution vectors, IDW tends to predict more microbes than others. However, most of them are false positives. On the other hand, SVM shows a slightly higher precision than all methods but results in a poor recall. This implies that SVM based methods tend to be conservative in predicting the “presence” of species, which do not meet our expectation. MetaMLAnn tends to have the best balance of both precision and recall, which results in the best overall F1 score. In addition to MetaMLAnn, we also report the result of the ensemble model with IDW where we use α=0.7 as MetaMLAnn+ after parameter tuning.
As can be seen from the table, the F1 score can be further boosted by more than 1%, which is better than either of the single model.
As for the Boston dataset, our model outperforms all the baseline models in terms of precision, F1 score and ranking loss. Even though Random Forest achieves a bit higher recall than our model, its precision suffers from the issue of predicting too many microbes. However, after we leverage the Random Forest model as part of our ensemble model with the same parameter as New York, α=0.7, MetaMLAnn+ achieves the best score in all metrics against other baselines.
Discussion
Feature analysis
As feature extraction is crucial for inferring microbial communities in a complicated urban system with heterogeneous data sources, we first demonstrate the effectiveness of our feature construction. Recall that we have three groups of features: subway station features (F_{s}), interconnection features (F_{c}), and surface material features (F_{m}). As shown in Table 6, a random forest model is used to compare the performance of individual features and their combinations. Overall, the complete features set have the best performance in precision, F1 score, and ranking loss. Note that we intentionally choose to use Random Forest instead of our model, MetaMLAnn, to conduct experiments. This is to demonstrate that our feature extraction techniques are beneficial in general to the microbial community inference problem without favoring our model.
Analysis on different taxonomic levels
To further demonstrate the generality of our model, we compare the performance of MetaMLAnn with other aforementioned baselines under different taxonomic levels from phylum to species. We ignore Kingdom level due to few numbers of classes.
As seen in Fig. 5, with the level of taxonomy becoming more specific, the performances of all methods decrease due to the increase of complexity. Against all competitors, MetaMLAnn and MetaMLAnn+ IDW constantly achieve the highest F1 score and the lowest ranking loss across all taxonomic levels. The advantage of MetaMLAnn becomes more obvious with a finer granularity of taxonomic level.
Parameter selection of the ensemble model
Here, we analyze how the ensemble weight α affects the prediction performance.
Figure 6 shows the F1 score and the ranking loss over different ensemble weights α of MetaMLAnn and IDW under the New York dataset. On the left vertical axis, we have F1 score (the larger the better) and on the right vertical axis, we have the ranking loss (the smaller the better). Recall that our ensemble model is defined in Eq. 8, where alpha closer to 1 means more weight on MetaMLAnn and closer to 0 means more weight on the additional model. The results suggest that with a good mixture of two models (i.e. α=0.7 for this case), the ensemble model can achieve the best for both F1 score and ranking loss. This is because the additional model (IDW) contains orthogonal information, which can compensate for the missing information from the training of MetaMLAnn. Without the ensemble model, MetaMLAnn tends to be conservative due to the sparsity of dataset. On the contrary, IDW tends to predict more microbes, which boosts the overall performance.
Ablation study of the shared block B _{shared}
Table 7 shows the results of the ablation study of the shared block B_{shared} and individual blocks B_{i}, where i=1…m. B_{shared}+B_{i}. In the New York dataset, removing the shared block slightly decrease the F1 score and increase the loss while using only the shared block will downgrade the F1 score by around 3% and double the ranking loss. In the Boston dataset, dropping any of the two units largely impair the performance of MetaMLAnn. These results reflect the importance of having both the individual and shared hidden blocks in our model for predicting microbial communities.
Conclusions
Profiling cityscale microbial diversity is important for urban longterm disease surveillance and health management. The great efforts to collect DNA samples in densely populated cities still cannot meet the challenge to obtain the metagenomic profiles at finegrained geospatial resolutions. To address this issue, we first define the task of inferring microbial community for cityscale metagenomics as a multilabel classification problem. We then propose MetaMLAnn, a neural network based approach to infer microbial communities of unsampled locations given the information from multiple data sources in the urban environment, including subway line information, sampling materials, and microbial compositions in sparsely sampled locations. The model captures the interactions between microbes and the urban environment by a shared hidden layer, and fuses the heterogeneous urban transit information with embedding for feature extraction.
Additionally, by incorporating signals from other strong models, the ensemble technique MetaMLAnn+ further improves the performance of the model. Extensive experiments demonstrate the effectiveness of our approach. In this work, we mainly focus on New York and Boston subway stations due to the limitation of data availability. In the future, with more cities being sampled, we plan to extend our model to the regional scale to solve the intercity metagenomic inference problem.
References
Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010; 464(7285):59–65.
Leung MH, Wilkins D, Li EK, Kong FK, Lee PK. Indoorair microbiome in an urban subway network: diversity and dynamics. Appl Environ Microbiol. 2014; 80(21):6760–70.
Robertson CE, Baumgartner LK, Harris JK, Peterson KL, Stevens MJ, Frank DN, Pace NR. Cultureindependent analysis of aerosol microbiology in a metropolitan subway system. Appl Environ Microbiol. 2013; 79(11):3485–93.
Cao C, Jiang W, Wang B, Fang J, Lang J, Tian G, Jiang J, Zhu TF. Inhalable microorganisms in beijing’s pm2. 5 and pm10 pollutants during a severe smog event. Environ Sci Technol. 2014; 48(3):1499.
Yooseph S, AndrewsPfannkoch C, Tenney A, McQuaid J, Williamson S, Thiagarajan M, Brami D, ZeiglerAllen L, Hoffman J, Goll JB, et al.A metagenomic framework for the study of airborne microbial communities. PLoS ONE. 2013; 8(12):81862.
Firth C, Bhat M, Firth MA, Williams SH, Frye MJ, Simmonds P, Conte JM, Ng J, Garcia J, Bhuva NP, et al. Detection of zoonotic pathogens and characterization of novel viruses carried by commensal rattus norvegicus in new york city. MBio. 2014; 5(5):01933–14.
Conceição T, Diamantino F, Coelho C, de Lencastre H, AiresdeSousa M. Contamination of public buses with mrsa in lisbon, portugal: a possible transmission route of major mrsa clones within the community. PLoS ONE. 2013; 8(11):77812.
Reese AT, Savage A, Youngsteadt E, McGuire KL, Koling A, Watkins O, Frank SD, Dunn RR. Urban stress is associated with variation in microbial species composition but not richness in manhattan. ISME J. 2016; 10(3):751–60.
Afshinnekoo E, Meydan C, Chowdhury S, Jaroudi D, Boyer C, Bernstein N, Maritz JM, Reeves D, Gandara J, Chhangawala S, et al. Geospatial resolution of human and bacterial diversity with cityscale metagenomics. Cell Syst. 2015; 1(1):72–87.
Hsu T, Joice R, Vallarino J, AbuAli G, Hartmann EM, Shafquat A, DuLong C, Baranowski C, Gevers D, Green JL, Morgan XC, Spengler JD, Huttenhower C. Urban transit system microbial communities differ by surface type and interaction with humans and the environment. mSystems. 2016;1(3). https://doi.org/10.1128/mSystems.0001816. http://msystems.asm.org/content/1/3/e0001816.full.pdf.
Dembczyński K, Waegeman W, Cheng W, Hüllermeier E. On label dependence and loss minimization in multilabel classification. Mach Learn. 2012; 88(12):5–45.
Zheng Y, Liu F, Hsieh HP. Uair: When urban air quality inference meets big data. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2013. p. 1436–44.
Shafiei M, Dunn KA, Boon E, MacDonald SM, Walsh DA, Gu H, Bielawski JP. Biomico: a supervised bayesian model for inference of microbial community structure. Microbiome. 2015; 3(1):8.
Cai Y, Gu H, Kenney T. Learning microbial community structures with supervised and unsupervised nonnegative matrix factorization. Microbiome. 2017; 5(1):110.
Zhou G, Jiang JY, Ju CJT, Wang W. Inferring microbial communities for city scale metagenomics using neural networks. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Piscataway: IEEE: 2018. p. 603–8.
Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. Metaphlan2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015; 12(10):902–3.
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989; 2(5):359–66.
Deng LY. The crossentropy method: a unified approach to combinatorial optimization, MonteCarlo simulation, and machine learning. Alexandria: Taylor & Francis; 2006.
Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951; 22(3):400–7.
Lovette IJ, Hochachka WM. Simultaneous effects of phylogenetic niche conservatism and competition on avian community structure. Ecology. 2006; 87(sp7):S14–S28. Wiley Online Library.
Zhang T, Popescul A, Dom B. Linear prediction models with graph regularization for webpage categorization. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2006. p. 821–6.
Ando RK, Zhang T. Learning on graph with laplacian regularization. In: Advances in Neural Information Processing Systems: 2007. p. 25–32.
Weinberger KQ, Sha F, Zhu Q, Saul LK. Graph laplacian regularization for largescale semidefinite programming. In: Advances in Neural Information Processing Systems: 2007. p. 1489–96.
Belkin M, Niyogi P, Sindhwani V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J Mach Learn Res. 2006; 7(Nov):2399–434.
Che Z, Kale D, Li W, Bahadori MT, Liu Y. Deep computational phenotyping. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2015. p. 507–16.
Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM: 2016. p. 855–64.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, SmithWhite B, AkoAdjei D, et al.Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015; 44(D1):733–45.
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J. Ncbi prokaryotic genome annotation pipeline. Nucleic Acids Res. 2016; 44(14):6614–24.
Brister JR, AkoAdjei D, Bao Y, Blinkova O. Ncbi viral genomes resource. Nucleic Acids Res. 2014; 43(D1):571–7.
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. The silva ribosomal rna gene database project: improved data processing and webbased tools. Nucleic Acids Res. 2012; 41(D1):590–6.
Yilmaz P, Parfrey LW, Yarza P, Gerken J, Pruesse E, Quast C, Schweer T, Peplies J, Ludwig W, Glöckner FO. The silva and “allspecies living tree project (ltp)” taxonomic frameworks. Nucleic Acids Res. 2013; 42(D1):643–8.
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967; 13(1):21–7.
Cortes C, Vapnik V. Supportvector networks. Mach Learn. 1995; 20(3):273–97.
Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943; 5(4):115–33.
Acknowledgments
The authors thank Mr. Patrick Tan and Dr. Xiuli Ma for proofreading. We also thank the reviewers for their helpful comments.
About this supplement
This article has been published as part of Human Genomics Volume 13 Supplement 1, 2019: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2018: human genomics. The full contents of the supplement are available online at https://humgenomics.biomedcentral.com/articles/supplements/volume13supplement1.
Funding
The work was partially supported by NSF DBI1565137 and NIH R01GM115833.
Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.
Author information
Authors and Affiliations
Contributions
All authors materially participated in the study and manuscript preparation. GY and JJ participated in the design of the study and the implementation of the model. GY and CJ performed experiments and analysis. WW designed the study and revised the manuscript. All authors approved the final article.
Corresponding author
Ethics declarations
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhou, G., Jiang, JY., Ju, C.JT. et al. Prediction of microbial communities for urban metagenomics using neural network approach. Hum Genomics 13 (Suppl 1), 47 (2019). https://doi.org/10.1186/s4024601902244
Published:
DOI: https://doi.org/10.1186/s4024601902244