u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality
–arXiv.org Artificial Intelligence
While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled audio-visual data and the cost to deploy one model per modality. In this paper, we present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech with a unified masked cluster prediction objective. By utilizing modality dropout during pre-training, we demonstrate that a single fine-tuned model can achieve performance on par or better than the state-of-the-art modality-specific models. Moreover, our model fine-tuned only on audio can perform well with audio-visual and visual speech input, achieving zero-shot modality generalization for multiple speech processing tasks.
arXiv.org Artificial Intelligence
Nov-27-2022
- Country:
- North America > United States (0.04)
- Europe
- Asia > Japan
- Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Genre:
- Research Report (1.00)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Speech > Speech Recognition (1.00)
- Natural Language > Machine Translation (1.00)
- Machine Learning > Statistical Learning (0.68)
- Information Technology > Artificial Intelligence