Large-Scale Metagenomic Sequence Clustering and Inference of Environment-Microbe and Microbe-Microbe Associations
Abstract: Underlying an environmental sample from, i.e., marine, fresh water, soil and human body, the diversity of the microbial community can be answered by the identities of the taxonomic units, and their abundance levels. With the advancements of next-generation sequencing technology, it is now possible to directly sequence DNAs obtained from environmental samples. In this talk, we focus on the targeted 16S rRNA gene sequencing that directly profiles the diversity of the microbial communities. We present an unsupervised Bayesian clustering method for clustering 16S rRNA for taxonomic prediction, and we then speed it up to cluster billions of sequences.
Understanding associations among microbes and associations between microbes and their environmental factors from metagenomic sequencing data is a key research topic in microbial ecology, which could help us to unravel real interactions (e.g., commensalism, parasitism, competition, etc.) in a community as well as understanding community-wide dynamics. Although several statistical tools have been developed for metagenomic association studies, they either suffer from compositional bias or fail to take into account environmental factors that directly affect the composition of a microbial community, leading to some false positive associations. Here, we propose metagenomic Lognormal-Dirichlet-Multinomial (mLDM), a hierarchical Bayesian model with sparsity constraints to bypass compositional bias and discover new associations among microbes and associations between microbes and their environmental factors. The mLDM model is able to: 1) infer both conditionally independent associations among microbes and direct associations between microbes and environmental factors; 2) consider both compositional bias and variance of metagenomic data; and 3) estimate absolute abundance for microbes. Thus, conditionally independent associations can capture the direct relationships underlying pairs of microbes and remove the indirect connections induced from other common factors.
Reconstructing Cell Cycle Pseudo Time-Series via Single-cell Transcriptome Data
Single-cell mRNA sequencing (scRNA-seq), which permits whole transcriptional profiling of individual cells, has been widely applied to study growth and development of tissues and tumors. Resolving cell cycle for such groups of cells is significant, but may not be adequately achieved by commonly used approaches. Here we develop a traveling salesman problem (TSP) and hidden Markov model (HMM) – based computational method to recover cell cycle along time (reCAT) for unsynchronized single-cell transcriptome data. We independently test reCAT for accuracy and reliability using several datasets. We find that cell cycle genes cluster into two major waves of expression, which correspond to the two well-known checkpoints, G1 and G2. Moreover, we leverage reCAT to exhibit methylation variation along the recovered cell cycle. Thus, reCAT shows the potential to elucidate diverse profiles of cell cycle, as well as other cyclic or circadian processes (e.g., in liver), on single-cell resolution.