Title | : | Novel Methods for Information Bottleneck based Audio Diarization |
Speaker | : | Nauman Abdul Razzak Dawalatabad (IITM) |
Details | : | Thu, 8 Aug, 2019 3:00 PM @ AM Turing Hall |
Abstract: | : | Given an audio signal, the speaker diarization answers a question of “Who Spoke When?”. A speaker diarization system aims to annotate input audio with information that attributes temporal regions of the audio signal to their respective sources, which may include both speech and non-speech events.Diarization is a clustering process and is mostly performed in a bottom-up agglomerative fashion. Short speech segments are clustered such that each cluster is speaker homogeneous. Diarization is used as a preprocessing module in many of the speech-related applications. Speaker diarization using information bottleneck (IB) based approach is a widely used technique for diarizing meeting recordings. Owing to the non-parametric nature of the IB approach, the run time of the clustering algorithm is small. In this approach, the audio is segmented into short segments ofuniform duration. Posteriors extracted from the short segments are then clustered in a bottom-up manner.Normalized mutual information is used as the stopping criterion to end the clustering process and obtain final speaker clusters. The major challenges in building a diarization system lie in the initialization of segments for clustering, obtaining speaker discriminative features, deciding on the number of speakers, and detecting the overlapped speaker segments. In this work, we address two of these challenges: (i) segment initialization for IB based system, and (ii) use of recording-specific speaker discriminative features in the diarization process. The first proposed approach uses a varying length segment initialization technique for IB based system. This approach focuses on better initialization of the speech segments using phoneme rate as side information. In the second part of the work, we propose Two-Pass IB (TPIB) based system that focuses on speaker discriminative features for diarization. The TPIB system uses the first pass IB output to refine speaker boundaries in the second pass. We discuss different variants of TPIB system that uses multi-layer feedforward neural network and linear discriminant analysis. We show that when we use both the proposed approaches in tandem, the performance of diarization system improves significantly. Finally, we will see diarization-enhanced source separation as applied to Taniavartanam in Carnatic music concerts. We use diarization to obtain separate clusters for different percussive instruments. The cluster with overlapping segments is identified. Source separation is performed only on the segments in the overlapping cluster. We show that using diarization as a front-end improves the overall quality of source separated output. |