Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions?

Oct-25-2024–arXiv.org Artificial Intelligence

Discrete representations of speech, obtained from Self-Supervised Learning (SSL) foundation models, are widely used, especially where there are limited data for the downstream task, such as for a low-resource language. Typically, discretization of speech into a sequence of symbols is achieved by unsupervised clustering of the latents from an SSL model. Our study evaluates whether discrete symbols - found using k-means - adequately capture tone in two example languages, Mandarin and Yoruba. We compare latent vectors with discrete symbols, obtained from HuBERT base, MandarinHuBERT, or XLS-R, for vowel and tone classification. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models. We suggest that discretization needs to be task-aware, particularly for tone-dependent downstream tasks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Oct-25-2024

arXiv.org PDF

Add feedback

Country:
- South America (0.04)
- Africa > Nigeria (0.04)
- North America
  - United States > Rhode Island (0.04)
  - Mexico (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - Ontario > Toronto (0.04)
- Europe
  - Greece (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - East Asia (0.04)
  - South Korea > Incheon
    - Incheon (0.04)
  - China > Shanghai
    - Shanghai (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Statistical Learning
    - Clustering (0.34)