Bioinformatic tools for testing microbial ecology theory in natural environments through metagenomics.
Dissertation, Doctor of Philosophy in Bioinformatics, Georgia Institute of Technology, School of Biological Sciences. 2016.
The study of microbial ecology has been traditionally hampered by the inability to sample members of microbial communities uniformly at random in their natural environments. However, advances in molecular techniques during the past three decades have allowed the characterization of communities through DNA census. The existence of a global phylogenetic reference framework (Woese & Fox, 1977) sparked the popularization of 16S/18S ribosomal RNA gene amplification (SSU-rRNA amplicons) for the characterization of microbial communities. The use of SSU-rRNA amplicons has been further promoted by the availability of large standardized reference databases such as Ribosomal Database Project – RDP (Cole et al., 2014), allowing unprecedented advances in microbial ecology. However, the SSU-rRNA universality implies a degree of conservation that comes at the expense of low resolution near and below the species level (Cole, Konstantinidis, Farris, & Tiedje, 2010; Rodriguez-R, Castro, & Konstantinidis, In preparation). In order to solve this shortcoming for isolated organisms, recent advances in whole-genome comparisons have provided the framework necessary to re-define the bacterial and archaeal species on the basis of genome-aggregate Average Nucleotide Identity –ANI– (Konstantinidis, Ramette, & Tiedje, 2006; Konstantinidis & Tiedje, 2005a, 2005b). The extension and application of this theoretical framework to the study of natural populations and communities is now possible thanks to the availability and increasing popularization of metagenomics. Such advance has the potential to bring species-level resolution to the characterization of microbial communities, and is the subject of chapter I. However, the feasibility of such application is fully realized only when the proper tools and techniques are made available. Therefore, a guide to computational tools to explore and quantitatively compare metagenomic datasets is presented in chapter II, and a suite of bioinformatic tools for genomics and metagenomics, the enveomics collection, is presented in chapter III. A particularly pervasive but underappreciated problem in the use of metagenomics is the issue of sequencing coverage, i.e., the fraction of the microbial community in a sample characterized by sequencing, potentially decreasing the accuracy of both individual sample characterizations and comparative analyses, as discussed in chapter IV. In order to accurately assess sequencing coverage we developed Nonpareil, a computational tool that accurately estimates abundance-weighted average coverage in a metagenomic sample using read redundancy, as described in chapter V. Moreover, Nonpareil can be used to determine the sequence diversity in a community independently of databases or sequence coverage, allowing higher accuracy in the determination of alpha-diversity, as described in chapter VI. Chapter VI also describes recent computational optimizations on Nonpareil 3 using k-mer matching and high-performance computing in order to cope with the increasing volume of data that becomes available from environmental or clinical metagenomics surveys. The application of novel computational and statistical techniques, including those presented here, have the potential to close the gap between ecology theory and testing in microbial systems. We addressed factors that drive microbial community assembly using time-series metagenomics in two different ecosystems. First, we documented the post-disturbance successional patterns in shoreline sediments in Pensacola beach (Florida, USA) after the large-scale deposition of hydrocarbons caused by the 2010 Macondo oil spill in the Gulf of Mexico. This study and our approach to test the specialization- disturbance hypothesis based on the oiled beach sand microbial communities are the subject of chapter VII. Next, the characterization of a freshwater meta- community in the Southeast USA monitored for six years in seven locations was utilized to quantify the different biogeographic factors contributing to community assembly. Chapter VIII describes the results of this study, documenting distinct microbial provinces within interconnected habitats along the Chattahoochee River, revealing similarly high impact of seasonality and geographic distance on community variation and extant diversity, and documenting a modest effect of landscape and only a minor effect of other environmental factors in community assembly.