Approaches to analyse dynamic microbial communities such as those seen in cystic fibrosis lung

Microbial communities play vital roles in many aspects of our lives, although our understanding of microbial biogeography and community profiles remains unclear. The number of microbes or the diversity of the microbes, even in small environmental niches, is staggering. Current microbiological methods used to analyse these communities are limited, in that many microorganisms cannot be cultured. Even for the isolates that can be cultured, the expense of identifying them definitively is much too high to be practical. Many recent molecular technologies, combined with bioinformatic tools, are raising the bar by improving the sensitivity and reliability of microbial community analysis. These tools and techniques range from those that attempt to understand a microbial community from their length heterogeneity profiles to those that help to identify the strains and species of a random sampling of the microbes in a given sample. These technologies are reviewed here, using the microbial communities present in the lungs of cystic fibrosis patients as a paradigm.


Introduction
Microbial communities play important roles in agriculture, bioremediation, and animal and human health, although our understanding of microbial biogeography and community profiles remains unclear. Current microbiological methods used to analyse these communities are limited, in that many microorganisms cannot be cultured or definitively identified. The application of recent molecular and bioinformatics tools is improving the sensitivity and reliability of microbial community analysis. These tools range from those using a 'broad brush strokes' approach to shed light on a microbial community profile to those involving identification of the strains and species of a random sampling of the microbes in a sample. The environmental genome shotgun survey of the Sargasso Sea 1 highlights the tremendous microbial diversity present in nature and the enormity of the effort needed to assess diversity and to understand a meta-community. This review discusses these technologies in the context of analysing the microbial communities present in the lungs of cystic fibrosis (CF) patients.

CF
CF is a fatal inherited disease primarily affecting Caucasians. In the USA, 3,500 children are born with the disease each year. 2 The gene responsible for CF encodes a protein called the CF transmembrane conductance regulator (CFTR). 3 The CFTR is a secretory epithelial cyclic-AMP-activated chloride channel; mutations in the CFTR lead to decreased fluid secretion and dehydration of epithelial surfaces. 4 Oversecretion of thick mucus in the airway leads to congestion of the respiratory tract and increased susceptibility to chronic broncho-pulmonary infection, which is the major cause of morbidity and mortality among patients with CF. 4 To retard the rate of decreasing lung function, bacterial infections are treated with antibiotics; however, these must be tailored to the particular infection, which is often polymicrobial. For example, anti-pseudomonal drugs are often ineffective for patients treated for Burkholderia cenocepacia infection owing to resistance. 5 Thus, it is important to identify the infecting pathogens correctly in order to prescribe an appropriate antibiotic regimen.
CF sputum bacterial flora Staphylococcus aureus, Haemophilus influenzae and Pseudomonas aeruginosa are the primary pathogens found in the polymicrobial infection of CF patients. 6 Other opportunistic pathogens have also emerged, such as B. cenocepacia, Alcaligenes xylosoxidans, Ralstonia pickettii, Burkholderia gladioli, Stenotrophomonas maltophilia and Mycobacterium species. 6,7 S. aureus, the predominant pathogen in children, is succeeded by H. influenzae during early childhood, and P. aeruginosa becomes the predominant pathogen during adolescence, reaching a prevalence rate of 80 per cent in adults. 8 The occurrence of the more recently emerging organisms increases with advancing age and severity of lung disease. 8,9 Common assays used for clinical identification of bacteria and their limitations Currently, the pathogens present in a CF sputum sample, throat swab or bronchoalveolar lavage (BAL) fluid are determined based on commercially available culture-based biochemical and phenotypic identification systems. These systems can either be manual, such as the API 20 NE (BioMérieux, Marcy l'Etoile, France) or fully or partly automated, such as MicroScan (Dade Behring, West Sacramento, CA, USA), BD Phoenix (Becton Dickinson, Sparks, MD, USA), and VITEK (BioMérieux). 10 These systems allow clinical microbiologists to identify bacteria accurately and rapidly, ultimately leading to better and more costeffective patient management. 11 Misdiagnosis results from the limitation of the system's reference database 10 or from strain variation. 12 Since only about 1 per cent of eubacteria in the environment can be cultured, 13 -15 a number of pathogenic species that are potentially present in the CF lung can be missed. 16 With other bacterial species (eg Mycobacterium), even though they can be cultured, due to their slow growth and similar phenotypes they can still be easily misdiagnosed. 17 Misidentification problems can be reduced or completely eliminated by using genotype-based molecular identification methods. 18

Molecular analysis of isolates
In the CF lung, some bacteria can be identified through culture whereas others would require molecular analysis. Molecular-based assays using polymerase chain reaction (PCR) and molecular markers such as 16S rRNA have been designed to identify pure isolates of many types of bacteria, including Mycobacterium, and will be discussed in detail.
PCR PCR amplifies template material from minimal amounts of extracted DNA. 19,20 This technique heralded a new era for the detection and identification of various microorganisms in any samples. Thus, the most recent techniques that study microorganisms are molecular based, using both universal and species-specific primers to select molecular markers. 19 Molecular marker 16S ribosomal RNA (rRNA) rRNA plays a catalytic role in protein synthesis. The basic ribosome structure is evolutionarily conserved, although variations in overall proportions and sizes of RNA and protein exist. 21,22 A component of the small ribosomal subunit, 16S rRNA, is composed of alternating evolutionarily conserved and variable regions, 23 and is the most commonly used molecular marker. 24 The conserved regions in 16S rRNA ( Figure 1) can be used to link organisms to their distant ancestors, while the highly variable regions can be used to identify evolutionary relationships between closely related organisms, at the genus and species level. 23 Studying these evolutionary relationships, however, requires the sequencing of the 16S rRNA gene.
Mycobacterium spp. identification DNA-based commercial assays have been developed to identify slow-growing Mycobacteria. Mycobacterium tuberculosis can be identified using the Cobas Amplicor assay, which is based on DNA hybridisation of a fragment of the 16S rRNA gene. 25 Hain Lifescience (Baden-Württemberg, Germany) developed a genotype Mycobacteria direct assay for the detection of M. tuberculosis complex and four atypical Mycobacteria. 25 This technique uses nucleic acid sequence-based amplification of the 23S rRNA gene. The MicroSeq system (Applied Biosystems, Foster City, CA, USA) is able to identify many Mycobacterium species based on the first 500 base pairs of the 16S rRNA gene. 25,26 The most used identification method is AccuProbe (Gen-Probe, San Diego, CA, USA). Isolates are grown either in solid or liquid cultures. The cells are lysed using sonication and labelled DNA probes bound to the targeted rRNA. The resulting light emission is measured, thus identifying the isolate based on the DNA probe used in the experiment. 25 The emergence of non-tuberculous mycobacteria in CF and immunocompromised patients has created a need to assure accurate identification of these organisms. The sensitivity and accuracy of each of these assays and others vary, based on the species of Mycobacteria being analysed. These assays all rely on the isolation of bacteria and are not used to identify complex samples.
A sample containing two types of bacteria can be analysed using matrix-assisted laser desorption ionisation-time of flight mass spectrometry (MALDI-TOF-MS). 27,28 This method identifies cultivated organisms based upon the profile of proteins and peptides detected from the bacteria. In one study, CF-associated bacteria were analysed using MALDI-TOF-MS. 27 Each organism gave a specific spectrum, irrespective of how the organism had been grown (ie incubation time or media) or the presence of a mucoidy phenotype. The authors concluded that this identification technique is cost-effective, rapid and easy to use. This technique, as mentioned earlier, cannot be used to analyse complex communities.

Molecular tools for community studies
Microbial diversity in complex microbial communities can be assessed based on the lengths of one or more of the nine variable regions of 16S rRNA ( Figure 1). The PCR amplicons can be analysed using other techniques, including: terminal restriction fragment length polymorphism (T-RFLP) analysis and amplicon length heterogeneity (LH). 24,29 The fragments are separated and analysed using a capillary electrophoresis-based genetic analyser. The data generated can be subjected to bioinformatics and statistical analysis to increase their reliability. The resulting output can provide a community profile and can putatively identify individual organisms at the strain, species or genus level. The recent developments in high-throughput sequencing enable rapid sequencing of the amplicons (bacterial and fungal, with the use of appropriate primers), which is likely to lead to a rapid understanding of the community structure of any complex niche.

T-RFLP analysis
This technique relies on the inherent variation of the sequence of a molecular marker 30 and is the most widely used method in identifying phylogenetic specificity in bacterial communities. 31 T-RFLP analysis includes PCR amplification, using one primer that is fluorescently end-labelled, restriction enzyme digestion of the amplicon and detection of the terminal restriction fragment by an automated DNA sequencer or capillary electrophoresis. 31 The resulting output consists of a microbial profile where each detected length is that of specific fragments from the digested PCR product. Each length represents one or more bacteria that have the same terminal restriction fragment length. T-RFLP profiles can be used for community differentiation, identification of specific organisms in populations and comparison of the relative phylotype richness and community structure. 30 This method has been successful in the differentiation of bacterial communities present in many environments, including marine samples, soil samples and sputum samples from CF patients. 30 -33 Rogers et al. 32 analysed T-RFLP amplicons of CF patient sputa and bronchoscopy samples using a computer program called MapSort (Wisconsin Package version 10.3; Accelrys, Inc., San Diego, CA, USA), which contains a database containing restriction patterns and lengths of fragments generated for known 16S rRNA bacterial sequences. The analysis suggested the presence of P. aeruginosa, B. cenocepacia, S. aureus, and H. influenzae in the CF samples. 32 The T-RFLP method is fast and data can be easily replicated for statistical analysis. The major disadvantage of T-RFLP is that many bacteria produce similar fragment sizes, and thus not all peaks in the profiles are species specific. Some peaks may even represent more than one genus. 30,32 There are also inherent problems in using restriction enzymes, such as incomplete digestion, which can produce DNA fragments that do not correlate with the correct bacterium. 33 Therefore, to achieve better identification of the organism, further analysis -such as sequencing of the 16S rRNA gene -must be performed.

LH
LH techniques analyse microbial populations based on the lengths of generated PCR products produced from the hypervariable regions of the 16S rRNA. 33 -38 Profiles from one region are produced for the microbial community. These profiles represent the minimum diversity of bacteria present within the eubacterial community. The profiles contain peaks at specific amplicon lengths ( Figure 2) representative of the number of nucleotides in the hypervariable region between the conserved regions. The peak heights are representative of the relative abundance of amplicons of that length present in the community. To identify individual bacterial organisms in the community, a database is needed. This can be generated by in silico analysis of known 16S rRNA sequences and the expected amplicon fragment length with a particular primer set that would be produced during an LH-PCR. The fragment lengths in the sample profile are compared against the database to identify the putative organisms. A profile resulting from this analysis suggests the presence of certain organisms and the definitive absence of others. In cases where the amplicon length is not species specific, it is often genus specific. 29 LH profiles can also be used to compare community profiles from multiple samples. Previous research has shown that the compositions of bacterial communities are highly specific to the environment in which they are found, and these differences are represented in LH profiles. 33,35 Changes in the community's niche can drastically influence bacteria and thus add specificity to the profile of a bacterial community, showing that the overall bacterial community has many unique features from sample to sample. 33,35 The main advantages of LH-PCR are that it rapidly surveys relative gene frequencies within complex mixtures of DNA, is reproducible, requires small sample sizes and can be performed simultaneously with many samples. 29 The LH profiles provide information about the members of the entire bacterial community (not just specific isolates) and their relative abundance. These data allow one to make taxonomic inferences and sample comparisons. 29 A major disadvantage of this technique is that one amplicon in the profile can represent more than one bacterium, therefore, identification at the species level cannot be guaranteed. This is also true with many length-based molecular techniques, such as T-RFLP; however, the fragments are discrete 'units' of information that can be used for comparative analyses. 30 Analysis of different combinations of the 16S rRNA variable regions will increase the power of microbial detection and sample discrimination and lead to more definitive identification.
LH was the first technique used in several ecological research projects to compare microbial communities between samples and to identify members within one community. 33,35,38 Fourteen CF sputum samples were analysed using LH-PCR for the presence of eubacteria. 32 The raw data generated from the genetic analyser were first processed with corresponding software, such as GeneimageIR v.3.56 (Scanalytics, Fairfax, VA, USA.) 32 or GeneMapper (Applied Biosystems), 35 to produce amplicon fragment lengths in base pairs. To identify presumptively the bacteria present in the CF samples, the fragment lengths were compared with a database of theoretical fragment lengths constructed using GAP (Wisconsin Package version 10.3). 32 For example, P. aeruginosa was identified presumptively in all 14 CF samples, five of which were confirmed by cloning and sequencing. 32 In another study, LH analysis presumptively identified P. aeruginosa in 19 south Florida CF patients, all of which were clinically diagnosed with this pathogen. 39 The LH fragment representing B. cenocepacia was not found in any of the patients, and clinical diagnosis and sequencing results confirmed the absence of this organism. 39 To assist in the identification of individual microbial organisms in a community, we developed a software package called AmpliQué, to be used in conjunction with LH-PCR. 39 For all the bacterial and archaeal 16S rRNA sequences available from the Ribosomal Database Project (RDP) (http://rdp.cme. msu.edu/), AmpliQué computes the length of the amplicon for any specified (degenerate) primer sequence pair. For a given sample on which PCR has been performed with a fixed pair of primers, and given the lengths of the PCR products, AmpliQué infers the bacterial and archaeal organisms present in the sample. AmpliQué has recently been generalised also to handle lengths of PCR products from more than one pair of primers, enhancing the power of this in silico identification method. AmpliQué was used to determine the presumptive identity of organisms present in 19 south Florida CF patients based on the fragment lengths produced by LH-PCR. Oralassociated bacteria, such as Lactobacillus mali, Capnocytophaga gingivalis, Porphyromonas spp. and Prevotella spp. and the known CF-associated lung pathogens P. aeruginosa, H. influenzae, B. cenocepacia, Achromobacter xylosoxidans, Serratia marcesens, S. maltophilia and Sarcina ventriculi, were presumptively identified. 39 To expand the use of LH-PCR in clinical settings, Bjerketorp et al. 40 combined it with a lab-on-a-chip (LOC) system, which is used for sizing and quantifying DNA, to analyse samples containing mixtures of known human gut microbes. An Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), a benchtop instrument that uses microfluidics-based separation, was used to detect the LH fragments. This modified method allows LH-PCR to be more affordable and faster, and thus more convenient and suitable for clinical and diagnostic situations. 40 To test this system, samples containing mixtures of human gut microbes and known human gut bacteria isolates were analysed using both LOC and a capillary electrophoresis-based genetic analyser. The latter method had a higher resolution and was thus able to resolve more peaks or fragments from one another. It is important to separate PCR fragments clearly, as LH identification is based on the lengths of PCR products. Single base pair length differences are known to occur between species and even at the genus level. The level of resolution for the LOC LH-PCR technique is a weakness but the technique is rapid, economical and easier to analyse than the traditional system. Future modifications may improve the resolution, making it more useful for clinical diagnosis. 40

LH-related bioinformatics
Regardless of whether LH is being used to compare communities or to identify members of a community, statistics and bioinformatics must be used to derive any information produced by the technique. The first aspect of the LH-PCR system is that it profiles a community based on the patterns of lengths of amplified products (amplicons) and allows one community to be distinguished among other communities, without necessarily identifying individual species or genera.
Microbial diversity and community dynamics were first studied using computing measures, such as species richness and dominance or evenness indices. 41 Theoretical models of microbial diversity based on the log-normal distributions have also been used. 42 LH and T-RFLP data derived from soil communities have been clustered using the unweighted pair-group method using arithmetic averages (UPGMA) algorithm based on the use of distance metrics (such as the Jaccards, Hellinger or Pearson distances). 43 -45 Such unsupervised methods have been used to support claims that certain relationships between communities can be discerned, that the groupings are natural and that outliers can be identified.
The statistical analysis of LH profiles is used to differentiate between two or more microbial communities. Without rigorous statistical analysis, it is impossible to differentiate between significant differences and random events. The identification of individual organisms in the community will be discussed later.

Statistical analysis based on ecological indices
Many statistical techniques have been applied to ecological indices that measure the diversity of microbial communities. A number of diversity indices have been used with microbial communities. 41 Traditional indices include the richness (S), the Shannon information index (H) and the evenness (E) derived from it, and are defined as follows in Equations (1), (2) where p i is the ratio of individual peak height to the sum total of the heights of all the peaks in the LH profile.
where H max ¼ ln(S). Note that the traditional diversity indices are based on the clear definition of an ecological description of an individual species. Here, the definitions have been modified for presumptive identification of LH profiles by replacing the definition of an individual species with that of individual peaks in LH profiles. Once appropriate diversity indices are chosen, multivariate statistical techniques, such as analysis of variance (ANOVA), can be applied to compare microbial communities.

Statistical analysis based on abundance models
Even with the availability of the numerous diversity indices, analysing microbial diversity and communities merely using ecological indices has its shortcomings. 46 Although each index represents an attempt to distil diversity information into a single quantity, each one ends up measuring specific aspects of diversity. Diversity indices vary in their sensitivity to different abundance classes. Species abundance models are considered to be more sophisticated tools to investigate diversity because they examine the distribution of abundances in a population.
Statistical models used for species abundance of microbial communities include log series distribution, log-normal distribution, 47 the broken stick model and the overlapping niche model. 41 The most frequently used statistical model for species abundance distributions is the log-normal distribution. In log-normal communities, the null model for bacterial species abundance is a log-normal distribution defined as follows: where S(R) is the number of species that contain R individuals, S T is the total number of species in the community, and s 2 is the variance of the distribution. The parameters S T and s 2 can be estimated from a sample of measured species abundance data by using statistical techniques such as the method of moments or least squares analysis. 47

Supervised analysis of LH profiles
In addition to the unsupervised methods introduced above, computational tools based on supervised classification methods from machine learning have also been used for analyses based on microbial diversity. 38 These methods are used to 'learn' the differences between the diversities in the microbial communities of two sets of samples. Two well-known supervised classification tools include support vector machines (SVM) and the k-nearest neighbour method (KNN). These tools have the ability to 'learn' to classify samples after being 'trained' with 'features' from a collection of known, labelled samples. Both are computational machine-learning tools that treat the data as points or vectors in Euclidean space. These vectors are usually referred to as 'feature vectors' because their coordinates correspond to quantified 'features' of the data. These features are usually obtained after a feature extraction process. Given a new sample, it too is represented by a feature vector. In both methods, classification of the new sample is based on the location of its feature vector in relation to the location of the labelled feature vectors in the feature space. 48 -51 SVMs have been shown to perform well in a variety of research areas, including pattern recognition, 52 face recognition, 53 classifications based on microarray gene expression data, 54 -58 detecting remote protein homologies 59 and classifying G-protein-coupled receptors. 60 In particular, SVMs are well suited for dealing with high-dimensional data. 48,51 KNN classifiers have been successfully used in applications such as classification of handwritten digits and satellite image scenes. 50 Computational machine learning classifiers based on SVMs and KNNs have been used to identify and compare microbial communities from different types of soil samples. 38 After a learning phase, the resulting classifiers were able to classify with high accuracy. Detailed studies using these tools revealed the limitations of the data and the minimum amount of information from LH assays that was necessary to perform reliable classification for microbial communities. 38 Sequencing Even with the combined use of bioinformatics tools and LH, certain members of a community may not be identified. Sequencing of the 16S rRNA gene is imperative to identify an organism with near certainty. The most common method of sequencing is the Sanger method, developed in 1977. 61 Once the sequences are generated they are compared with known 16S rRNA sequences (stored in the Ribosomal Database Project II, 62 Greengenes 63 and GenBank 64 ) to identify organisms in any samples, including the CF lung. 10,65 Sequencing of the RFLP-PCR products from the total metagenomic DNA from BAL samples of CF children identified known CF pathogens, such as P. aeruginosa, S. aureus, S. maltophilia and H. influenzae. 65 Potentially novel pathogens from the genera Lysobacter, Coxiellaceae and Rickettsiales were also found. 65 Another study which involved the sequencing of the 16S rRNA gene has shown that CF sputum contains Streptococcus mitis, S. pneumoniae, Prevotella melaninogenica, Veionella spp., Granulicatella para-adiacens and Exiguobacterium spp., besides the normal CF pathogens, such as P. aeruginosa. In this study, clones were screened using LH-PCR to ensure that plasmids containing a wide array of 16S rRNA genes were sequenced.
Although sequencing technologies are able to identify bacteria in a sample more accurately, the high cost of reagents and labour may be too expensive for widespread clinical use. 66 For some bacteria, partial sequencing of the gene would lead to identification; for others, the entire gene would need to be analysed. Sequencing isolates can be performed in a timely manner and the data produced are fairly easy to analyse, especially with the use of commercial sequencing kits; 67 however, sequencing cannot differentiate between some species (eg Mycobacterium chelonae and M. abscessus are 99 per cent similar). 66 Bacterial identification would still have to be achieved using a polyphasic approach.
As with most molecular methods, non-culturable bacteria can be sequenced but this requires additional protocols, reagents and time. With traditional sequencing methods, cloning must be performed to isolate individual 16S rRNA genes amplified by PCR. Even then, further screening must be performed to ensure that multiple copies of the same 16S rRNA gene are not repetitively sequenced, thereby wasting time, reagents and money. LH can be used as a screening method to ensure that only clones of interest are sequenced. Thus, efficient identification of non-isolates poses many challenges.

Pyrosequencing
New developments in sequencing technologies are revolutionising the way that microbial communities are being studied. 68,69 Recently developed pyrosequencing techniques that allow faster sequencing at a lower cost are opening doors for many laboratories to use sequence data for microbial identification. Pyrosequencing relies on a process referred to as sequencing-by-synthesis, 70 a technique that allows for real-time monitoring of DNA synthesis. 71 Pyrosequencing is based on the principle that pyrophosphate (PPi) is released when the DNA polymerase adds a nucleotide to the growing complementary strand. The PPi is converted to adenosine triphosphate (ATP), which is used as a substrate in a chemical reaction that results in visible light emission. The detectable amount of light produced is relative to the amount of synthesis. 71 As with the Sanger method, pyrosequencing can only sequence individual PCR products, and thus must be used in conjunction with cloning to study microbial communities.
Pyrosequencing has been used to identify bacterial isolates by using the first and the third variable regions of the 16S rRNA. 72,73 Importantly, pyrosequencing surpassed traditional methods of detection in a clinical setting by identifying 90 per cent of the isolates at least at the genus level. 74 The remaining 10 per cent of the isolates could not be identified owing to the short sequencing reads, a clear drawback of pyrosequencing. 74 Pyrosequencing may help bacterial identification in samples that do not lend themselves to polyphasic approaches. 75,76 This technique has also been shown to distinguish clearly between multiple species of Mycobacterium. Three species, Mycobacterium kansasii, M. scrofulaceum and M. gordonae, require further sequencing analyses to obtain accurate identifications. 75 To implement pyrosequencing successfully as a diagnostic tool, the technique needs to be improved to address its limitations. Bioinformatics tools need to be refined or newly designed to handle the large amounts of data. Also, further research needs to be performed to validate the technique. In addition, issues regarding management and use of pyrosequencing in a clinical laboratory need to be addressed. 74 454 sequencing This is a new technique which allows wholegenome sequencing in a matter of days. To circumvent the need for cloning, 454 sequencing, which performs many PPi-sequencing reactions in parallel, was developed. 77 The 454 sequencing combines an emulsion-based method that isolates and amplifies DNA fragments in vitro with an instrument that performs pyrosequencing in picolitre-sized wells. 77 The reactions are resolved on a Genome Sequencer FLX (454 Life Sciences, Inc., Bradford, CT, USA), which reads 200 -300 bases and in one run can read up to 400,000 bases. 78 This method has been used to study the microbial diversity of the deep sea 79 and the metagenome found in honey bees, which led to the discovery of a possible causative agent of colony collapse disorder. 80 A large number of microbial communities can be studied quickly and efficiently with 454 sequencing.

Conclusion
The members of a microbial community and the associated dynamics of the niche can be studied using various methods. LH, T-RFLP and sequencing have all been used to study microbial community profiles, as well as to identify bacteria found in the CF lung. Each of these techniques has its drawbacks but can produce data that can be used (with the help of bioinformatics) to understand the composition of the community and the factors that drive it. Recent advances in technology are now the driving force behind community profiling. With the advances in high-throughput sequencing-based technologies, entire niches of organisms can be studied in a relatively short period of time. As a result, a vast amount of complex data is produced from these experiments. With the use of newly designed bioinformatics tools, data can be interpreted correctly and provide researchers with information that can ultimately be used to address community interactions that dictate the outcomes of human health studies.