Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering

Chang, Heng-Jui, Liu, Alexander H., Glass, James

May-18-2023–arXiv.org Artificial Intelligence

Self-supervised speech representation models have succeeded in various tasks, but improving them for content-related problems using unlabeled data is challenging. We propose speaker-invariant clustering (Spin), a novel self-supervised learning method that clusters speech representations and performs swapped prediction between the original and speaker-perturbed utterances. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU. Spin improves pre-trained networks and outperforms prior methods in speech recognition and acoustic unit discovery.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

May-18-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Poland
  - Lower Silesia Province > Wroclaw (0.04)

Genre:
- Research Report > Experimental Study (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Speech > Speech Recognition (0.36)
  - Machine Learning
    - Inductive Learning (0.55)
    - Neural Networks (0.48)
    - Statistical Learning (0.48)
    - Unsupervised or Indirectly Supervised Learning (0.35)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found