DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Liu, Alexander H., Chang, Heng-Jui, Auli, Michael, Hsu, Wei-Ning, Glass, James R.

May-17-2023–arXiv.org Artificial Intelligence

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units. The source code will be made available after the anonymity period.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-17-2023

arXiv.org PDF

Add feedback

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Neural Networks > Deep Learning (0.68)
    - Statistical Learning > Clustering (0.46)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found