Title | : | Large Language Models for Speech Recognition and Understanding |
Speaker | : | Andreas Stolcke (Uniphore) |
Details | : | Thu, 3 Apr, 2025 11:30 AM @ CS-25 |
Abstract: | : | The talk will summarize various lines of work that aim to leverage the knowledge encoded in LLMs for speech recognition and understanding tasks. One approach is to for LLMs to postprocess ASR outputs, either to rerank or edit (correct) hypotheses. We show that this is possible even without fine-tuning LLMs for this task, via instruction prompting and in-context learning. Another line of work is to augment LLMs originally trained on text with acoustic information to make them pay attention to cues that are unique to speech and conversation, also leveraging pretrained acoustic embedding models. This yields multimodal models that process speech as more than transduced text, while still leveraging the long-span language "understanding" capabilities LLMs are known for. In one case, we show that LLMs can use acoustic information to model utterance sentiment to improve word prediction. In another, we are able to use LLMs to predict where in a conversations one should take the turn, back-channel, or continue listening. Finally, I discuss some applications of so-called "SpeechLLMs", which map acoustic speech encodings to an LLM's token embeddings space, a techniques that opens up new possibilities for end-to-end speech processing, but also has some shortcomings. Finally, I will show that while LLMs are powerful, they are still limited and far from having common sense or general intelligence, so enthusiasm about them should be tempered. |