Self-T aught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Neural Information Processing Systems 

We propose an unsupervised adaptation framework, Self-T Aught Recognizer (ST AR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents. ST AR is developed for prevalent speech foundation models based on Transformer-related architecture with auto-regressive decoding (e.g., Whisper, Canary; SeamlessM4T).