Prediction of microbial communities for urban metagenomics using neural network approach

Background Microbes are greatly associated with human health and disease, especially in densely populated cities. It is essential to understand the microbial ecosystem in an urban environment for cities to monitor the transmission of infectious diseases and detect potentially urgent threats. To achieve this goal, the DNA sample collection and analysis have been conducted at subway stations in major cities. However, city-scale sampling with the fine-grained geo-spatial resolution is expensive and laborious. In this paper, we introduce MetaMLAnn, a neural network based approach to infer microbial communities at unsampled locations given information reflecting different factors, including subway line networks, sampling material types, and microbial composition patterns. Results We evaluate the effectiveness of MetaMLAnn based on the public metagenomics dataset collected from multiple locations in the New York and Boston subway systems. The experimental results suggest that MetaMLAnn consistently performs better than other five conventional classifiers under different taxonomic ranks. At genus level, MetaMLAnn can achieve F1 scores of 0.63 and 0.72 on the New York and the Boston datasets, respectively. Conclusions By exploiting heterogeneous features, MetaMLAnn captures the hidden interactions between microbial compositions and the urban environment, which enables precise predictions of microbial communities at unmeasured locations.


Background
Metagenomics studies the genomic content obtained from a human body site or an environment with a goal of understanding microbial diversity. The microorganisms in our environment are greatly associated with human health and disease.
Human microbiome studies are already rich enough to uncover the microbial diversity within the human body [1]. Environmental metagenomics, though falling behind in the past years, has also become increasingly important due to the increasing awareness of its impacts on public health, especially in densely populated urban areas transportation system, which described microbial communities across multiple surface types. However, collecting, sequencing, and analyzing the metagenomics data at every station cost them a great amount of money and time. Given that, our study focuses on developing a model to automatically predict the microbial communities for unsampled locations.
It is challenging to predict the microbial communities for unsampled locations. First, the characteristics of microbial communities can vary enormously in a complicated urban system due to various factors like geographical topology and public transit network. Many recent works have investigated how network connectivity affects the similarity of microbiomes. For examples, Leung et al. [2] conducted a Mantel test of Hong Kong subway line (MTR), and found that closely connected MTR lines shared more similar microbial communities than pairs that are further apart (R = 0.47, P = 0.03), probably because of distance-dependent dispersal and transferring commuters. To further evaluate the assumption, we conduct a clustering analysis based on microbial community similarity at different locations. As shown in Fig. 1, different microbes are separated by geographical boundaries.
Second, the formation and transmission of microbial communities are also affected by the material type of surfaces where the samples are collected [10]. Lastly, within each community, the genetic properties of each individual Fig. 1 There are three groups of subway stations based on the hierarchical clustering of the microbial community abundance in each location. We set the number of clusters to be three and use the Pearson correlation as the distance metric. We observe that the East river is a clear boundary that separates the three districts: Manhattan (blue dots), Brooklyn (yellow marks), and the Roosevelt Island (one red dot at top right) microorganisms and the correlation between individual microorganisms also contribute to the complexity. Considering the mixed effects from various factors, a simple model for each station along the same subway line should not be enough.
To address these challenges, we formulate the prediction of microbial communities at unsampled locations as a multi-label classification (MLC) task. Based on a set of heterogeneous features extracted from the urban environment, we aim to predict the presence or absence of a list of microbes at a nearby location. For MLC, each location is considered as an instance and each label represents a microbe.
Since different class labels have to be predicted simultaneously [11], MLC is suitable for solving the microbes inference problem, with their dependencies exploited at the same time. These properties reflect the nature of microbial communities.
In the field of urban computing, statistical models like regression trees have been applied to do real-time air quality prediction. For example, in U-Air [12], the authors inferred the fine-grained air quality in a city by using a semi-supervised learning approach. The model was able to predict air quality at non-monitored stations based on the air quality data reported by existing monitor stations. The spatial classifier for their model was based on an artificial neural network (ANN). However, this model only estimated a single value (i.e. the air quality index) for each location, so it was also inadequate to address the MLC task we formulated.
In the field of metagenomics, several computational models, such as BioMiCo [13] and NMF [14] have been developed to infer microbial community structures. To estimate the composition of each sample given the abundance profile, BioMiCo uses the supervised Bayesian model while NMF leverages the matrix factorization.
Nevertheless, these works cannot directly infer the microbial community for unsampled locations in the urban environment due to their inability to incorporate spatial information.
All the models mentioned above either cannot address the complicated environmental conditions or handle the intricate relationships between microbial compositions and the urban environment. In our recent work [15], we propose MetaMLAnn (Metagenomic Multi Label Artifical neural network), a neural network based and supervised learning model to predict the microbial community for city-scale metagenomics. MetaMLAnn is built on the widely-used feed-forward neural network model. But unlike the conventional feed-forward neural network model that predicts each label individually, it leverages an extra shared structure to capture the dependencies among different labels (microbes). To begin with, we train MetaMLAnn using a state-of-the-art network embedding technique to integrate features constructed from different data sources. Next, we leverage manifold regularization to extend our model. Our model is robust to the sparse samples with limited labeled data by incorporating the domain knowledge. To further improve our model, we also introduce an ensemble model, MetaMLAnn+, which can outperform each individual model by leveraging the diversified information from MetaMLAnn and different classification models with the strong signal. To our best knowledge, our work is the initial attempt to predict the microbial community for urban metagenomics by using the neural network model. In this paper, we extend our previous work by presenting detailed theoretical foundations and additional statistical analyses.
We summarize the contribution of this paper as follows: • This is the first series of in-depth study of microbial communities inference for unsampled locations. The inference task is formulated as a multi-label classification problem and a neural network learning technique (MetaMLAnn) is proposed to solve it. • We integrate the manifold regularization into our framework to guide the training of MetaMLAnn. We provide detailed theoretical foundations of showing how the domain knowledge of microbial evolutionary relationships helps. • Important features are extracted from multiple data sources, including city-scale transit features and surface material. An in-depth feature importance study has also been provided. • We evaluate MetaMLAnn on the New York and Boston subway metagenomic DNA sequencing data samples. We present detailed discussions about that MetaMLAnn performs better against five baseline methods under two datasets with different level of the taxonomy. We also analyze the importance of using the ensemble model.

Materials and methods
In this section, we present the detailed designed of our framework and describe the dataset used in this work.

Preliminaries and problem definition
We start with formalizing the mathematical notations of our model. Table 1 summarizes the symbols we use in this article.   The classification model is to learn an estimation function f : R k → 2 m that assigns a subset of labels to a given instance.
In our microbial community inference case, we extract feature vectors of n samples and represent them as X. The Microbe Index created from known locations is used as Y, where the order of microbes is preserved.
Problem statement. Suppose S = S 1 ∪ S 2 = {s 1 , s 2 , . . . , s n }, where S 1 and S 2 are sets of sampled and unsampled locations, respectively. Each sampled location s i ∈ S 1 is associated with a microbial distribution vector Y s i . Our goal is to predict Y s j of each s j ∈ S 2 , which is not sampled.
The framework of MetaMLAnn is shown in Fig. 2. It contains two major components and one model: the blue component for learning and the red component for inference, together with the MetaMLAnn model. In the following subsections, we introduce how MetaMLAnn is constructed, explain the regularization framework, discuss how feature extraction has been done to train MetaMLAnn, and present the ensemble model. Our general framework. Starting from the map, we simulate the inference task by splitting the samples into the training set (blue dots) and test set (red dots). We use Metaphlan2 [16] to obtain the microbial distribution profiles from the raw sequencing data. We first extract and integrate features for both training and test data. We also construct the evolutionary (phylogenetic) microbial similarity matrix, using the 16s rRNA of the bacteria as a regularizer. Then, we feed the training data's features and the similarity matrix into MetaMLAnn, which will perform microbial inference based on the features of test dataset. Our model can also be integrated with other classification models trained with same features as an ensemble model

Model: MetaMLAnn
We start with introducing the one hidden layer feedforward neural network model [17]. In the neural network model, there are p hidden units. The input layer x ∈ R k×1 is connected to hidden layer h ∈ R p×1 with weights W (1) ∈ R p×k and biases b (1) ∈ R p×1 . The hidden nodes are then connected to output nodes o ∈ R m×1 via weights W (2) ∈ R m×p and biases b (2) We denote f θ : x → o as the feed-forward neural network below: where, θ = W (1) , W (2) , b (1) , b (2) . f o and f h are activation functions in the output layer and the hidden layer respectively. Specifically, the function f θ (x) can be simplified by using vector representation as follows, where z (1) and z (2) are the vector representations of the weighted sums of inputs and hidden activation functions as follows: Given the cost function J(θ; x, y), we seek for a parameter vector θ which minimizes it. J(θ; x, y) measures the difference of given targets y and predictions of the network. Here, we choose Cross-Entropy [18] as our cost function: (4) where y i and o i are the ground truth and the predicted scores for label i, respectively. The sigmoid activation In MetaMLAnn, we extend the basic form feed-forward neural network by leveraging a heterogeneous architecture. Figure 3 depicts the detailed design of MetaMLAnn. Instead of using multiple hidden nodes of the same type in the hidden layer, we denote two different types of sub hidden layers which we call blocks (B). The first set of blocks are called individual blocks, B 1 to B m where m is the number of labels. The second type of block, B share , is a shared block that connects to all output neurons. Therefore, each output neuron connects to a corresponding individual block and a commonly shared block. All blocks contain one hidden layer with p neurons. Therefore, we replace the p units hidden layer with m + 1 blocks B. Each block consists of a hidden layer with p hidden neurons. For each i, the input layer x ∈ R k×1 is connected to each block B i ∈ R p×1 with weights W (1) i ∈ R p×k and biases b (1) i ∈ R p×1 . Then, the blocks B i and B share are connected to output node o i ∈ R via weights W We use stochastic gradient descent (SGD) [19] to efficiently optimize the cost function in Eq. 4. We randomly sample a location i and a unit from y i to compute B i for each individual block. We randomly sample a location i and a unit from all the classes among y 1 and y m to capture the global properties shared by all microbes for the shared block B share ,. The updating rules for different variables W and b can be derived by taking the derivatives of the above objective function and applying SGD. Training our model is efficient with SGD and back-propagation. More specifically, the time complexity of training our where t is the number of training epochs; n is the number of training examples; θ is the set of parameters in the model. To demonstrate the convergence of the proposed algorithm, we plot the values of the loss function over different optimization epochs in Fig. 4.   Fig. 3 The architecture of our proposed model MetaMLAnn. Starting from the left, the input layer with k nodes will receive k features respectively. Then, every input node will connect to all blocks, where each block contains p hidden units. The shared block (B share ) is connected to every output label, and every other individual blocks (B 1 . . . B m ) is connected to its corresponding output label. It is regularized by the evolutionary (phylogenetic) microbial similarity matrix x → o can be reformatted as follows: (1) ,

Manifold regularization
Neural networks tend to suffer from limited training examples. However, with only a few instances of each label, it is challenging to train MetaMLAnn. One potential solution to compensate for the data sparsity is to incorporate prior knowledge. Inspired by the general observation that evolutionary relationships are expected to be associated with patterns of community composition [20], we presume that the groups of microbes tend to co-occur in the same community when they are closely related to each other in the taxonomy.
The taxonomy here is referred as the identification, naming, and classification of organisms. We choose to use the evolutionary similarity as the domain knowledge, which is then fed into our regularizer. This is because taxonomy is often informed by the evolutionary relationships among different microbes (i.e., phylogenetic). To incorporate such microbial similarity, many regularization techniques can be used. We choose one of the most popular techniques, Graph Laplacian regularizer, to build our regularization frameworks [21][22][23][24][25].

Definition 4 (Graph Laplacian matrix L) Given a pairwise similarity matrix P ∈ R m×m , the Graph Laplacian matrix is defined as L
By minimizing the regularizer can preserve the local geometrical structure of a parameter vector β with length I. According to the definition, we observe that L has the following property that makes it suitable for regularization. Given the trace operator tr(·): From the above equations, the two parameters β i and β i are enforced to be similar, which can be incorporated into the cost function. The regularized cost function is defined as: where y i and o i are the ground truth label and the predicted score for sample i. The Graph Laplacian regularizer can represent any pairwise relationships between parameters. Here we discuss how to use the evolutionary similarities as priors and the corresponding Laplacian regularizers to incorporate structured domain knowledge. The Laplacian matrix L is firstly obtained by constructing the pairwise evolutionary similarity matrix (P) of different microbes.
Upon obtaining the predicted microbial distribution vector Y * i for given location i from the blocks, each vector is regularized by feeding Y * i into Eq. 5, where β refers to the predicted vector Y * i and β i , β j refers to microbe i and microbe j at this location, respectively.

Feature extraction
Here we describe how we extract the features from various data sources. These feature extraction methods can serve as a general pipeline for any urban-scale metagenomics study.
We define a feature vector as F : R k , where R is a k dimensional feature.
For this work, we extract the following features: subway station information, inter-station connections, and sampling surface materials. All features are concatenated into a feature vector for each sample and are used to train MetaMLAnn.
Subway station features (F s ): The first set of features that we extracted is the subway station information. We obtain the MTA and MBTA subway station data for New York and Boston. Each location is associated with the closest stations within a predefined radius, r = 0.01 miles. This radius value is an empirical parameter and can be tuned. The feature vector is then created based on the lines that pass through the current station. If there is no station information available in this range, we will find the 2 nearest stations and see if their subway line information matches. If they do match, we will align the subway line to this location. Otherwise, we will not assign any subway line information to this location. This process is specifically for dealing with sampling locations which are not stations, but in between two subway stations on the same line.
It has been shown that the number of riders is positively correlated with the amount of DNA collected in a station [9]. Therefore, we also retrieve the public MTA data with the turnstiles usage information at each station. The corresponding node vector is then weighted by the average number of riders within DNA collection date at each station.
For example, there are in total 25 different subway lines in New York, thus we create a binary vector of size 25, each element in the vector indicates whether this line will pass this location or not. For example, for station l, the subway line feature vector is defined as If v i = 1, then line i passes through this location. Finally, F s l will be weighted based on the busyness of station l.
Note that it is possible one location is associated with multiple lines or no lines. For the multiple lines' case, there will be more than one v i equal to 1. For the case of no line, we will simply remove such location since we focus on the inference at stations. Therefore, all locations will be associated with a subway line feature as a vector.
Interconnection features (F c ): We first describe how we construct the subway system network. Each subway station is denoted as a node and each interaction between two stations is drawn an edge. The weight of edge(i, j) is computed by the minimum number of stops from station i to station j. We also consider the case of express trains and if there exist express trains directly connecting two stations, we assign 1 as the weight to that edge.
Upon obtaining the station network, we apply the network embedding algorithm Node2Vec [26]. Each node is embedded into a low dimensional vector based on the generated network.
Surface materials features (F m ): The surface materials are strongly correlated with the microbial communities, as discussed in [10]. Therefore, we represent such information by using another set of vectors. Based on the type of materials it was collected from, a vector of length equal to the number of material types is constructed. For the New York dataset, each element represents one type of material: 'concrete' , 'metal' , 'plastic' , 'water' or 'wood' and the vectors are of length 5. As for the Boston dataset, the vector is of length 4 with four types of materials: 'glass' , 'polyester' , 'PVC' , and 'steel' .

Ensemble with hybrid prediction
To alleviate the lack of training data, in addition to the regularization, we also propose to construct an ensemble of MetaMLAnn with any other model that needs fewer training samples.
For each label i, let o i be the predicted score of MetaMLAnn. Given the score from the other model m as o m i , we conduct a linear hybrid prediction for ensemble as follows: where 0 ≤ α ≤ 1 is a parameter to decide the weights of two models. When α = 1 the prediction is MetaMLAnn, and when α = 0 the prediction is the model m.
We denote the ensemble approach as MetaMLAnn+.

Sample collection and data preprocessing
We apply our model on the New York and Boston datasets obtained from the MetaSUB Inter-City Challenges track of the 2017 CAMDA Contest. They both contain mass-transit metagenomic raw reads data, supplemented with sample descriptions.
The New York dataset contains 1572 samples, representing different sites. These samples were collected from open subway stations for all 24 subway lines of the NYC Metropolitan Transit Authority (MTA). At subway stations, samples were collected in triplicate, with one sample taken inside a train at the station and two samples from the station itself, as reported by [9]. DNA samples collected from each site were sequenced using Illumina platform, with a total of 10.4 billion paired-end DNA sequencing reads.
In addition, each sample is also associated with meta information, including the latitude and longitude showing where the sample was collected, and surface materials. All these information are indispensable for the enrichment of feature generation.
Similarly, there are 141 samples in the Boston dataset, which have been also collected from the local subway system, consisting of 5 lines (red, orange, blue, green, and silver) that extend from downtown Boston into the surrounding suburbs. As mentioned in [10], most samples are 16S rRNA gene amplification sequence data, and a subset of the samples are subjected to shotgun metagenomic sequencing. Each sample is also supplemented with additional information, which describes the date of collection, station information, and surface type. For the 16S rRNA samples, the corresponding abundances profiles are also provided.
For each sample in the New York dataset and samples subjected to shotgun metagenomic sequencing in the Boston dataset, we conduct the following preprocessing steps: 1) To be consistent with the processing procedure in [9] from which the New York data is collected, We use MetaPhlan2 [16] to perform microbial profiling. Each profile contains the relative abundances as a percentage from the kingdom level to the species level.
2) There are 48.3% of the reads that do not match to any known organism in the New York dataset, as described in [9]. Therefore, when we construct the microbial distribution vector, those unknown microbes are removed and the relative abundances of the remaining known microbes are recomputed.

Supplemental data sources
We use the New York subway station data and the Boston subway station data from the MTA and MBTA website respectively to construct the subway line features. They contain geographic locations, subway station names, and subway lines that pass each station. We also obtain the turnstile data of MTA and MBTA to count the busyness of all stations. The detailed descriptions can be found in Table 2.
To capture the underlying microbiota structure, we construct a pairwise similarity matrix to represent the evolutionary relationship between two species. We retrieve the 16S ribosomal RNA sequence for bacteria and archaea, 5S ribosomal RNA for eukaryotes, and the whole DNA sequences for viruses from the NCBI [27][28][29] and the Silva [30,31] database. We perform sequence alignments to compute the pairwise similarity within each kingdom. We normalize the similarity values to the range of 0 to 1 and we assign 0 to their similarity for cross-kingdom species pairs. Finally, we take the mean of all species' similarity scores under that level and aggregate them as the new score for each genus pairs (Eq. 9). In this way, we can obtain the similarity matrix between genus level.
Given two genus g a and g b as sets of species, the similarity score between the pair of genus can be computed as: where sp a and sp b are the species of g a and g b , respectively.

Results
To demonstrate the effectiveness of MetaMLAnn, we conduct comprehensive experiments by using both the New York and Boston datasets. In this section, we will discuss the experiment setup, evaluation metrics, baselines and results.

Experimental settings
After we conduct data processing, each sample is associated with an abundance vector. It is observed that many species are seriously underrepresented (i.e. appearing at only one location) for the abundance at all levels. We choose to focus on the genuslevel abundance to alleviate the issues including underrepresented microbes, missing species-level taxonomy, and very similar microbial species.
Together with the number of features obtained, the detailed microbial composition of both dataset can be found in Table 3.
We use k-fold cross-validation for all experiments. Setting the value of k to be three, we randomly and equally split the data into three non-overlapping subsets. Each subset has a chance to train the model and to test the model.
The average performance of each method from these three folds is reported. In addition, we also justify the effectiveness of our feature construction by comparing the performance of individual features and their combination with the same classifier.

Evaluation metrics
We assess the performance of our classifier in several ways. While accuracy is the simplest and the most straightforward measure, it is biased toward classes with a larger sample size. Instead, we report precision, recall, and F1 score as our evaluation metrics. These metrics are defined as: F1 score = 2 * precision * recall precision + recall (12) where given m labels, tp i , tn i , fp i and fn i represents true positives, true negatives, false positives and false negatives for i t h label respectively. Finally, we also use ranking loss, which averages over n samples the number of label pairs that are incorrectly ordered, i.e. true labels have a lower score (f ) than false labels, weighted by the inverse number of false and true labels, as shown below:
• Inverse Distance Weighting (IDW): Inverse distance weighting is a deterministic, nonlinear interpolation technique that uses a weighted average of the attribute values from nearby sample points to estimate the magnitude of that attribute at non-sampled locations. The weight a particular point is assigned depends upon the sampled point's distance to the non-sampled location.

Performance of MetaMLAnn
Using the combined features, Tables 4 and 5 show the performance of MetaMLAnn and other aforementioned baselines on New York and Boston datasets, respectively. As discussed in experimental settings, we focus on the genus level inference. We observe that MetaMLAnn and MetaMLAnn+, outperform all baselines on F1 score and ranking loss.
In the New York dataset, MetaMLAnn and MetaMLAnn+ perform the best in terms of F1 score and ranking loss, though the precision and recall of MetaMLAnn rank second among other baselines. IDW achieves the highest recall but its precision is the lowest, which offsets its high recall. As an unsupervised learning model using the inverse distance weighting of surrounding microbial distribution vectors, IDW tends to predict more microbes than others. However, most of them are false positives. On the other hand, SVM shows a slightly higher precision than all methods but results in a poor recall. This implies that SVM based methods tend to be conservative in predicting the "presence" of species, which do not meet our expectation. MetaMLAnn tends to have the best balance of both precision and recall, which results in the best overall F1 score. In addition to MetaMLAnn, we also report the result of the ensemble model with IDW where we use α = 0.7 as MetaMLAnn+ after parameter tuning.
As can be seen from the table, the F1 score can be further boosted by more than 1%, which is better than either of the single model.  As for the Boston dataset, our model outperforms all the baseline models in terms of precision, F1 score and ranking loss. Even though Random Forest achieves a bit higher recall than our model, its precision suffers from the issue of predicting too many microbes. However, after we leverage the Random Forest model as part of our ensemble model with the same parameter as New York, α = 0.7, MetaMLAnn+ achieves the best score in all metrics against other baselines.

Feature analysis
As feature extraction is crucial for inferring microbial communities in a complicated urban system with heterogeneous data sources, we first demonstrate the effectiveness of our feature construction. Recall that we have three groups of features: subway station features (F s ), interconnection features (F c ), and surface material features (F m ). As shown in Table 6, a random forest model is used to compare the performance of individual features and their combinations. Overall, the complete features set have the best performance in precision, F1 score, and ranking loss. Note that we intentionally choose to use  Random Forest instead of our model, MetaMLAnn, to conduct experiments. This is to demonstrate that our feature extraction techniques are beneficial in general to the microbial community inference problem without favoring our model.

Analysis on different taxonomic levels
To further demonstrate the generality of our model, we compare the performance of MetaMLAnn with other aforementioned baselines under different taxonomic levels from phylum to species. We ignore Kingdom level due to few numbers of classes.
As seen in Fig. 5, with the level of taxonomy becoming more specific, the performances of all methods decrease due to the increase of complexity. Against all competitors, MetaMLAnn and MetaMLAnn+ IDW constantly achieve the highest F1 score and the lowest ranking loss across all taxonomic levels. The advantage of MetaMLAnn becomes more obvious with a finer granularity of taxonomic level.

Parameter selection of the ensemble model
Here, we analyze how the ensemble weight α affects the prediction performance. Fig. 6 The F1 score and ranking loss performance on the New York dataset at genus level for the ensemble model MetaMLAnn+ that aggregates MetaMLAnn and IDW over different weights α Figure 6 shows the F1 score and the ranking loss over different ensemble weights α of MetaMLAnn and IDW under the New York dataset. On the left vertical axis, we have F1 score (the larger the better) and on the right vertical axis, we have the ranking loss (the smaller the better). Recall that our ensemble model is defined in Eq. 8, where alpha closer to 1 means more weight on MetaMLAnn and closer to 0 means more weight on the additional model. The results suggest that with a good mixture of two models (i.e. α = 0.7 for this case), the ensemble model can achieve the best for both F1 score and ranking loss. This is because the additional model (IDW) contains orthogonal information, which can compensate for the missing information from the training of MetaMLAnn. Without the ensemble model, MetaMLAnn tends to be conservative due to the sparsity of dataset. On the contrary, IDW tends to predict more microbes, which boosts the overall performance. Table 7 shows the results of the ablation study of the shared block B shared and individual blocks B i , where i = 1 . . . m. B shared + B i . In the New York dataset, removing the shared block slightly decrease the F1 score and increase the loss while using only the shared block will downgrade the F1 score by around 3% and double the ranking loss. In the Boston dataset, dropping any of the two units largely impair the performance of MetaMLAnn. These results reflect the importance of having both the individual and shared hidden blocks in our model for predicting microbial communities.

Conclusions
Profiling city-scale microbial diversity is important for urban long-term disease surveillance and health manage- Higher precision, recall, F1 score, and lower ranking loss indicate better performance. Bold entries indicate best performance among different methods ment. The great efforts to collect DNA samples in densely populated cities still cannot meet the challenge to obtain the metagenomic profiles at fine-grained geo-spatial resolutions. To address this issue, we first define the task of inferring microbial community for city-scale metagenomics as a multi-label classification problem. We then propose MetaMLAnn, a neural network based approach to infer microbial communities of unsampled locations given the information from multiple data sources in the urban environment, including subway line information, sampling materials, and microbial compositions in sparsely sampled locations. The model captures the interactions between microbes and the urban environment by a shared hidden layer, and fuses the heterogeneous urban transit information with embedding for feature extraction. Additionally, by incorporating signals from other strong models, the ensemble technique MetaMLAnn+ further improves the performance of the model. Extensive experiments demonstrate the effectiveness of our approach. In this work, we mainly focus on New York and Boston subway stations due to the limitation of data availability. In the future, with more cities being sampled, we plan to extend our model to the regional scale to solve the inter-city metagenomic inference problem.