AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Lian, Jiachen, Baevski, Alexei, Hsu, Wei-Ning, Auli, Michael

Feb-9-2023–arXiv.org Artificial Intelligence

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under most settings.

artificial intelligence, machine learning, representation, (12 more...)

arXiv.org Artificial Intelligence

Feb-9-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > Alameda County > Berkeley (0.04)
- Europe > Portugal
  - Braga > Braga (0.04)

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found