AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Lian, Jiachen, Baevski, Alexei, Hsu, Wei-Ning, Auli, Michael
–arXiv.org Artificial Intelligence
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under most settings.
arXiv.org Artificial Intelligence
Feb-9-2023
- Country:
- Europe > Portugal
- North America > United States
- California > Alameda County > Berkeley (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: