Title | : | Empowering low-resource language ASR via large-scale machine-generated datasets |
Speaker | : | Kaushal Santosh Bhogale (IITM) |
Details | : | Tue, 29 Apr, 2025 12:00 PM @ SSB 334 |
Abstract: | : | Building accurate automatic speech recognition (ASR) systems for low-resource languages is challenging due to the scarcity of labeled data. This work addresses this fundamental bottleneck by proposing scalable and cost-effective frameworks to generate large-scale machine-labeled and unlabeled speech datasets for Indian languages. First, we introduce Shrutilipi, a high-quality supervised dataset containing over 6,400 hours of aligned speech-text pairs across 12 Indian languages, constructed by mining All India Radio archives and aligning transcripts using a robust variant of the Needleman-Wunsch algorithm. Human evaluations confirm the quality and diversity of Shrutilipi, and downstream experiments show significant reductions in word error rate (WER) when used for training both large and efficient ASR models. Next, to address the lack of pretraining resources, we develop a framework for collecting large-scale raw audio, resulting in MahaDhwani, a corpus of 279K hours spanning 22 Indian languages. Leveraging this data, we pretrain multilingual Conformer-based models and introduce a hybrid multi-softmax decoder to support both cross-lingual knowledge transfer and language-specific learning. Evaluations on the IndicVoices benchmark demonstrate notable gains, especially in low-resource settings. Finally, we explore pseudo-labeling to further enhance ASR performance, utilizing the MahaDhwani dataset. We propose a modular framework combining multiple ASR models and evaluation heuristics to filter high-quality pseudo-labeled data, validated through the IndicYT benchmark. Results show consistent improvements when this data augments existing training sets, without degrading out-of-domain performance. Together, these contributions demonstrate the power of large-scale machine-generated datasets in enabling accurate and robust ASR systems for low-resource languages, and lay the foundation for scalable data-centric approaches in multilingual speech recognition. |