Title | : | Computational methods to systematically dissect microbial states in health and disease: Tackling challenges in TB and beyond |
Speaker | : | Brintha V P (IITM) |
Details | : | Thu, 27 Mar, 2025 2:30 PM @ SSB 233 |
Abstract: | : | Microbes such as bacteria and viruses inhabit various organisms including humans, and their states (species composition and function) can play important roles in the health and disease of the host organism. Advances in DNA sequencing technologies and associated computational methods for analyzing sequencing data have revealed the diversity, abundance and functionality of different microbes in a sample of interest, facilitating personalized healthcare and disease surveillance. But computational challenges remain in analyzing microbial sequencing data, in terms of handling sequencing errors, identifying and quantifying novel bacterial variants, and developing scalable workflows. In this talk, we discuss the design and development of computational methods that address such challenges in the context of three applications related to Tuberculosis (TB), a respiratory disease caused by a bacterial pathogen. First, we address the problem of delineating different bacterial strains including novel ones that are mixed in a TB sample of interest. Existing methods for identifying and quantifying mixed infection from microbial sequencing reads rely on a reference database of known strains and thereby miss novel strains. We present Demixer, a probabilistic generative model that extends a popular topic model in text mining called Latent Dirichlet Allocation to represent known mutations in a reference database as well as discover novel ones. We have carefully tuned our method using parallelization and other heuristics to perform efficient multi-sample inference. In both synthetic and experimental benchmarks, Demixer precisely detected the identity and proportions of the mixed strains. As Demixer is a shallow bag-of-words based probabilistic model, we also compared Demixer with deep sequence models such as DNABERT-2 and DeepMicrobes, and found that Demixer is able to better learn the mutational profiles of both seen and unseen strains. Second, we developed a workflow to study the community of microbes during single-disease or double-disease (such as during co-infection of TB and COVID-19) conditions. Specifically, we generated a novel TB+COVID microbial sequencing dataset, and systematically analysed it using a workflow of statistical methods and linear regression models to identify potential biomarkers. Our findings highlight that co-infection with COVID-19 can exacerbate the severity of lung infection in TB patients by altering the microbial composition and their related pathways. Differential abundance analysis identified species such as Capnocytophaga gingivalis, Prevotella melaninogenica, Escherichia coli and Veillonella parvula, significantly elevated in the TB+COVID group compared to the TB only group. Functional profiling with PICRUSt2 indicated that pathways related to dysregulation of pulmonary lipids were significantly elevated in the TBCOVID group. The pipeline can be used to analyse similar co-infection datasets in the future. Finally, we address the problem of predicting resistance of the TB-causing bacteria to certain drugs using language models. Most existing methods focus only on the mutations to predict resistance to different drugs and often have challenges in predicting the resistance related to second-line TB drugs. So, we propose a novel method that uses contextual embeddings from large DNA language models, such as Evo2, to better predict drug resistance. Preliminary results from this ongoing work will be discussed. Overall, we have developed and applied different computational methods for efficient analysis of microbial sequencing data pertaining to TB, and working with collaborators at the National Institute of Research in Tuberculosis to test the applicability of these methods to help with the Government's mission of TB eradication. |