Wu, Minhua
Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion
Wang, Jinhan, Chen, Long, Khare, Aparna, Raju, Anirudh, Dheram, Pranav, He, Di, Wu, Minhua, Stolcke, Andreas, Ravichandran, Venkatesh
We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.
Guided contrastive self-supervised pre-training for automatic speech recognition
Khare, Aparna, Wu, Minhua, Bhati, Saurabhchand, Droppo, Jasha, Maas, Roland
Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model. It can be used to effectively initialize the encoder of an Automatic Speech Recognition (ASR) model. We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC). Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training. We validate our method on 3 ASR tasks: German, French and English. Our method outperforms CPC pre-training on all three datasets, reducing the Word Error Rate (WER) by 4.44%, 6.55% and 15.43% relative on the German, French and English (Librispeech) tasks respectively, compared to training from scratch, while CPC pre-training only brings 2.96%, 1.01% and 14.39% relative WER reduction respectively.
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning
Mošner, Ladislav, Wu, Minhua, Raju, Anirudh, Parthasarathi, Sree Hari Krishnan, Kumatani, Kenichi, Sundaram, Shiva, Maas, Roland, Hoffmeister, Björn
In this work, we adopt the teacherstudent (T/S)learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for transferring data. We incorporate up to 8000 hours of untranscribed data for training and present our results on sequence trained models apartfrom cross entropy trained ones. The best sequence trained student model yields relative word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our clean, simulated noisy and real test sets respectively comparing toa sequence trained teacher. Index Terms-- automatic speech recognition, noise robustness, teacher-studenttraining, domain adaptation 1. INTRODUCTION With the exponential growth of big data and computing power, automatic speech recognition (ASR) technology has been successfully used in many applications. People can do voice search using mobile devices.