Microbial diversity in complex microbial communities can be assessed based on the lengths of one or more of the nine variable regions of 16S rRNA (Figure 1). The PCR amplicons can be analysed using other techniques, including: terminal restriction fragment length polymorphism (T-RFLP) analysis and amplicon length heterogeneity (LH) [24–29]. The fragments are separated and analysed using a capillary electrophoresis-based genetic analyser. The data generated can be subjected to bioinformatics and statistical analysis to increase their reliability. The resulting output can provide a community profile and can putatively identify individual organisms at the strain, species or genus level. The recent developments in high-throughput sequencing enable rapid sequencing of the amplicons (bacterial and fungal, with the use of appropriate primers), which is likely to lead to a rapid understanding of the community structure of any complex niche.
T-RFLP analysis
This technique relies on the inherent variation of the sequence of a molecular marker [30] and is the most widely used method in identifying phylogenetic specificity in bacterial communities [31]. T-RFLP analysis includes PCR amplification, using one primer that is fluorescently end-labelled, restriction enzyme digestion of the amplicon and detection of the terminal restriction fragment by an automated DNA sequencer or capillary electrophoresis [31]. The resulting output consists of a microbial profile where each detected length is that of specific fragments from the digested PCR product. Each length represents one or more bacteria that have the same terminal restriction fragment length. T-RFLP profiles can be used for community differentiation, identification of specific organisms in populations and comparison of the relative phylotype richness and community structure [30].
This method has been successful in the differentiation of bacterial communities present in many environments, including marine samples, soil samples and sputum samples from CF patients [30–33]. Rogers et al. [32] analysed T-RFLP amplicons of CF patient sputa and bronchoscopy samples using a computer program called MapSort (Wisconsin Package version 10.3; Accelrys, Inc., San Diego, CA, USA), which contains a database containing restriction patterns and lengths of fragments generated for known 16S rRNA bacterial sequences. The analysis suggested the presence of P. aeruginosa, B. cenocepacia, S. aureus, and H. influenzae in the CF samples [32].
The T-RFLP method is fast and data can be easily replicated for statistical analysis. The major disadvantage of T-RFLP is that many bacteria produce similar fragment sizes, and thus not all peaks in the profiles are species specific. Some peaks may even represent more than one genus [30, 32]. There are also inherent problems in using restriction enzymes, such as incomplete digestion, which can produce DNA fragments that do not correlate with the correct bacterium [33]. Therefore, to achieve better identification of the organism, further analysis -- such as sequencing of the 16S rRNA gene -- must be performed.
LH
LH techniques analyse microbial populations based on the lengths of generated PCR products produced from the hypervariable regions of the 16S rRNA [33–38]. Profiles from one region are produced for the microbial community. These profiles represent the minimum diversity of bacteria present within the eubacterial community. The profiles contain peaks at specific amplicon lengths (Figure 2) representative of the number of nucleotides in the hypervariable region between the conserved regions. The peak heights are representative of the relative abundance of amplicons of that length present in the community. To identify individual bacterial organisms in the community, a database is needed. This can be generated by in silico analysis of known 16S rRNA sequences and the expected amplicon fragment length with a particular primer set that would be produced during an LH-PCR. The fragment lengths in the sample profile are compared against the database to identify the putative organisms. A profile resulting from this analysis suggests the presence of certain organisms and the definitive absence of others. In cases where the amplicon length is not species specific, it is often genus specific [29]. LH profiles can also be used to compare community profiles from multiple samples. Previous research has shown that the compositions of bacterial communities are highly specific to the environment in which they are found, and these differences are represented in LH profiles [33, 35]. Changes in the community's niche can drastically influence bacteria and thus add specificity to the profile of a bacterial community, showing that the overall bacterial community has many unique features from sample to sample [33, 35].
The main advantages of LH-PCR are that it rapidly surveys relative gene frequencies within complex mixtures of DNA, is reproducible, requires small sample sizes and can be performed simultaneously with many samples [29]. The LH profiles provide information about the members of the entire bacterial community (not just specific isolates) and their relative abundance. These data allow one to make taxonomic inferences and sample comparisons [29]. A major disadvantage of this technique is that one amplicon in the profile can represent more than one bacterium, therefore, identification at the species level cannot be guaranteed. This is also true with many length-based molecular techniques, such as T-RFLP; however, the fragments are discrete 'units' of information that can be used for comparative analyses [30]. Analysis of different combinations of the 16S rRNA variable regions will increase the power of microbial detection and sample discrimination and lead to more definitive identification.
LH was the first technique used in several ecological research projects to compare microbial communities between samples and to identify members within one community [33, 35, 38]. Fourteen CF sputum samples were analysed using LH-PCR for the presence of eubacteria [32]. The raw data generated from the genetic analyser were first processed with corresponding software, such as GeneimageIR v.3.56 (Scanalytics, Fairfax, VA, USA.)[32] or GeneMapper [35] (Applied Biosystems), to produce amplicon fragment lengths in base pairs. To identify presumptively the bacteria present in the CF samples, the fragment lengths were compared with a database of theoretical fragment lengths constructed using GAP (Wisconsin Package version 10.3) [32]. For example, P. aeruginosa was identified presumptively in all 14 CF samples, five of which were confirmed by cloning and sequencing [32]. In another study, LH analysis presumptively identified P. aeruginosa in 19 south Florida CF patients, all of which were clinically diagnosed with this pathogen [39]. The LH fragment representing B. cenocepacia was not found in any of the patients, and clinical diagnosis and sequencing results confirmed the absence of this organism [39].
To assist in the identification of individual microbial organisms in a community, we developed a software package called AmpliQué, to be used in conjunction with LH-PCR [39]. For all the bacterial and archaeal 16S rRNA sequences available from the Ribosomal Database Project (RDP) (http://rdp.cme.msu.edu/), AmpliQué computes the length of the amplicon for any specified (degenerate) primer sequence pair. For a given sample on which PCR has been performed with a fixed pair of primers, and given the lengths of the PCR products, AmpliQué infers the bacterial and archaeal organisms present in the sample. AmpliQué has recently been generalised also to handle lengths of PCR products from more than one pair of primers, enhancing the power of this in silico identification method. AmpliQué was used to determine the presumptive identity of organisms present in 19 south Florida CF patients based on the fragment lengths produced by LH-PCR. Oral-associated bacteria, such as Lactobacillus mali, Capnocytophaga gingivalis, Porphyromonas spp. and Prevotella spp. and the known CF-associated lung pathogens P aeruginosa, H. influenzae, B. cenocepacia, Achromobacter xylosoxidans, Serratia marcesens, S. malto-philia and Sarcina ventriculi, were presumptively identified [39].
To expand the use of LH-PCR in clinical settings, Bjerketorp et al. [40] combined it with a lab-on-a-chip (LOC) system, which is used for sizing and quantifying DNA, to analyse samples containing mixtures of known human gut microbes. An Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), a bench-top instrument that uses microfluidics-based separation, was used to detect the LH fragments. This modified method allows LH-PCR to be more affordable and faster, and thus more convenient and suitable for clinical and diagnostic situations [40]. To test this system, samples containing mixtures of human gut microbes and known human gut bacteria isolates were analysed using both LOC and a capillary electrophoresis-based genetic analyser. The latter method had a higher resolution and was thus able to resolve more peaks or fragments from one another. It is important to separate PCR fragments clearly, as LH identification is based on the lengths of PCR products. Single base pair length differences are known to occur between species and even at the genus level. The level of resolution for the LOC LH-PCR technique is a weakness but the technique is rapid, economical and easier to analyse than the traditional system. Future modifications may improve the resolution, making it more useful for clinical diagnosis [40].
LH-related bioinformatics
Regardless of whether LH is being used to compare communities or to identify members of a community, statistics and bioinformatics must be used to derive any information produced by the technique. The first aspect of the LH-PCR system is that it profiles a community based on the patterns of lengths of amplified products (amplicons) and allows one community to be distinguished among other communities, without necessarily identifying individual species or genera.
Microbial diversity and community dynamics were first studied using computing measures, such as species richness and dominance or evenness indices [41]. Theoretical models of microbial diversity based on the log-normal distributions have also been used [42]. LH and T-RFLP data derived from soil communities have been clustered using the unweighted pair-group method using arithmetic averages (UPGMA) algorithm based on the use of distance metrics (such as the Jaccards, Hellinger or Pearson distances) [43–45]. Such unsupervised methods have been used to support claims that certain relationships between communities can be discerned, that the groupings are natural and that outliers can be identified.
The statistical analysis of LH profiles is used to differentiate between two or more microbial communities. Without rigorous statistical analysis, it is impossible to differentiate between significant differences and random events. The identification of individual organisms in the community will be discussed later.
Statistical analysis based on ecological indices
Many statistical techniques have been applied to ecological indices that measure the diversity of microbial communities. A number of diversity indices have been used with microbial communities [41]. Traditional indices include the richness (S), the Shannon information index (H) and the evenness (E) derived from it, and are defined as follows in Equations (1), (2) and (3), respectively:
(1)
(2)
where pi is the ratio of individual peak height to the sum total of the heights of all the peaks in the LH profile.
where Hmax = ln(S). Note that the traditional diversity indices are based on the clear definition of an ecological description of an individual species. Here, the definitions have been modified for presumptive identification of LH profiles by replacing the definition of an individual species with that of individual peaks in LH profiles.
Once appropriate diversity indices are chosen, multivariate statistical techniques, such as analysis of variance (ANOVA), can be applied to compare microbial communities.
Statistical analysis based on abundance models
Even with the availability of the numerous diversity indices, analysing microbial diversity and communities merely using ecological indices has its short-comings [46]. Although each index represents an attempt to distil diversity information into a single quantity, each one ends up measuring specific aspects of diversity. Diversity indices vary in their sensitivity to different abundance classes. Species abundance models are considered to be more sophisticated tools to investigate diversity because they examine the distribution of abundances in a population.
Statistical models used for species abundance of microbial communities include log series distribution, log-normal distribution [47], the broken stick model and the overlapping niche model [41]. The most frequently used statistical model for species abundance distributions is the log-normal distribution. In log-normal communities, the null model for bacterial species abundance is a log-normal distribution defined as follows:
where S(R) is the number of species that contain R individuals, S
T
is the total number of species in the community, and σ2 is the variance of the distribution. The parameters S
T
and σ2 can be estimated from a sample of measured species abundance data by using statistical techniques such as the method of moments or least squares analysis [47].
Supervised analysis of LH profiles
In addition to the unsupervised methods introduced above, computational tools based on supervised classification methods from machine learning have also been used for analyses based on microbial diversity [38]. These methods are used to 'learn' the differences between the diversities in the microbial communities of two sets of samples. Two well-known supervised classification tools include support vector machines (SVM) and the k-nearest neighbour method (KNN). These tools have the ability to 'learn' to classify samples after being 'trained' with 'features' from a collection of known, labelled samples. Both are computational machine-learning tools that treat the data as points or vectors in Euclidean space. These vectors are usually referred to as 'feature vectors' because their coordinates correspond to quantified 'features' of the data. These features are usually obtained after a feature extraction process. Given a new sample, it too is represented by a feature vector. In both methods, classification of the new sample is based on the location of its feature vector in relation to the location of the labelled feature vectors in the feature space [48–51]. SVMs have been showntoperform well in a variety of research areas, including pattern recognition [52], face recognition [53], classifications based on microarray gene expression data [54–58], detecting remote protein homologies [59] and classifying G-protein-coupled receptors [60]. In particular, SVMs are well suited for dealing with high-dimensional data [48, 51]. KNN classifiers have been successfully used in applications such as classification of handwritten digits and satellite image scenes [50].
Computational machine learning classifiers based on SVMs and KNNs have been used to identify and compare microbial communities from different types of soil samples [38]. After a learning phase, the resulting classifiers were able to classify with high accuracy. Detailed studies using these tools revealed the limitations of the data and the minimum amount of information from LH assays that was necessary to perform reliable classification for microbial communities [38].
Sequencing
Even with the combined use of bioinformatics tools and LH, certain members of a community may not be identified. Sequencing of the 16S rRNA gene is imperative to identify an organism with near certainty. The most common method of sequencing is the Sanger method, developed in 1977 [61]. Once the sequences are generated they are compared with known 16S rRNA sequences (stored in the Ribosomal Database Project II [62], Greengenes [63] and GenBank [64]) to identify organisms in any samples, including the CF lung [10, 65]. Sequencing of the RFLP-PCR products from the total metagenomic DNA from BAL samples of CF children identified known CF pathogens, such as P. aeruginosa, S. aureus, S. maltophilia and H. influenzae [65]. Potentially novel pathogens from the genera Lysobacter, Coxiellaceae and Rickettsiales were also found [65].
Another study which involved the sequencing of the 16S rRNA gene has shown that CF sputum contains Streptococcus mitis, S. pneumoniae, Prevotella melani-nogenica, Veionella spp., Granulicatella para-adiacens and Exiguobacterium spp., besides the normal CF pathogens, such as P. aeruginosa. In this study, clones were screened using LH-PCR to ensure that plasmids containing a wide array of 16S rRNA genes were sequenced.
Although sequencing technologies are able to identify bacteria in a sample more accurately, the high cost of reagents and labour may be too expensive for widespread clinical use [66]. For some bacteria, partial sequencing of the gene would lead to identification; for others, the entire gene would need to be analysed. Sequencing isolates can be performed in a timely manner and the data produced are fairly easy to analyse, especially with the use of commercial sequencing kits;[67] however, sequencing cannot differentiate between some species (eg Mycobacterium chelonae and M. abscessus are 99 per cent similar) [66]. Bacterial identification would still have to be achieved using a polyphasic approach.
As with most molecular methods, non-culturable bacteria can be sequenced but this requires additional protocols, reagents and time. With traditional sequencing methods, cloning must be performed to isolate individual 16S rRNA genes amplified by PCR. Even then, further screening must be performed to ensure that multiple copies of the same 16S rRNA gene are not repetitively sequenced, thereby wasting time, reagents and money. LH can be used as a screening method to ensure that only clones of interest are sequenced. Thus, efficient identification of non-isolates poses many challenges.
Pyrosequencing
New developments in sequencing technologies are revolutionising the way that microbial communities are being studied [68, 69]. Recently developed pyrosequencing techniques that allow faster sequencing at a lower cost are opening doors for many laboratories to use sequence data for microbial identification. Pyrosequencing relies on a process referred to as sequencing-by-synthesis [70], a technique that allows for real-time monitoring of DNA synthesis [71]. Pyrosequencing is based on the principle that pyrophosphate (PPi) is released when the DNA polymerase adds a nucleotide to the growing complementary strand. The PPi is converted to adenosine triphosphate (ATP), which is used as a substrate in a chemical reaction that results in visible light emission. The detectable amount of light produced is relative to the amount of synthesis [71]. As with the Sanger method, pyrosequencing can only sequence individua PCR products, and thus must be used in conjunction with cloning to study microbia communities.
Pyrosequencing has been used to identify bacteria isolates by using the first and the third variable regions of the 16S rRNA [72, 73]. Importantly, pyrosequencing surpassed traditional methods of detection in a clinical setting by identifying 90 per cent of the isolates at least at the genus level [74]. The remaining 10 per cent of the isolates could not be identified owing to the short sequencing reads, a clear drawback of pyrosequencing [74]. Pyrosequencing may help bacterial identification in samples that do not lend themselves to polyphasic approaches [75, 76]. This technique has also been shown to distinguish clearly between multiple species of Mycobacterium. Three species, Mycobacterium kansasii, M. scrofulaceum and M. gordonae, require further sequencing analyses to obtain accurate identifications [75]. To implement pyrosequencing successfully as a diagnostic tool, the technique needs to be improved to address its limitations. Bioinformatics tools need to be refined or newly designed to handle the large amounts of data. Also, further research needs to be performed to validate the technique. In addition, issues regarding management and use of pyrosequencing in a clinical laboratory need to be addressed [74].
454 sequencing
This is a new technique which allows whole-genome sequencing in a matter of days. To circumvent the need for cloning, 454 sequencing, which performs many PPi-sequencing reactions in parallel, was developed [77]. The 454 sequencing combines an emulsion-based method that isolates and amplifies DNA fragments in vitro with an instrument that performs pyrosequencing in picolitre-sized wells [7]. The reactions are resolved on a Genome Sequencer FLX (454 Life Sciences, Inc., Bradford, CT, USA), which reads 200-300 bases and in one run can read up to 400,000 bases [78]. This method has been used to study the microbial diversity of the deep sea [79] and the metagenome found in honey bees, which led to the discovery of a possible causative agent ofcolony collapse disorder [80]. A large number of microbial communities can be studied quickly and efficiently with 454 sequencing.