Towards Robust Speech Representation Learning for Thousands of Languages

Chen, William, Zhang, Wangyou, Peng, Yifan, Li, Xinjian, Tian, Jinchuan, Shi, Jiatong, Chang, Xuankai, Maiti, Soumi, Livescu, Karen, Watanabe, Shinji

Jul-2-2024–arXiv.org Artificial Intelligence

Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.

shinji watanabe, speech recognition, xeus, (14 more...)

arXiv.org Artificial Intelligence

Jul-2-2024

arXiv.org PDF

Add feedback

Country:
- South America > Chile
  - Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America
  - Mexico > Puebla (0.04)
  - United States
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
- Europe
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
- Asia
  - East Asia (0.04)
  - South Korea > Gyeonggi-do
    - Suwon (0.04)
  - China
    - Shanghai > Shanghai (0.04)
    - Beijing > Beijing (0.04)

Genre:
- Research Report (0.82)

Industry:
- Media (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found