Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

Open in new window