Self-Supervised Speech Representation Learning: A Review

Mohamed, Abdelrahman, Lee, Hung-yi, Borgholt, Lasse, Havtorn, Jakob D., Edin, Joakim, Igel, Christian, Kirchhoff, Katrin, Li, Shang-Wen, Livescu, Karen, Maaløe, Lars, Sainath, Tara N., Watanabe, Shinji

Oct-27-2022–arXiv.org Artificial Intelligence

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Oct-27-2022

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- Africa (0.04)
- North America
  - United States
    - New York (0.04)
    - California (0.04)
    - Pennsylvania > Allegheny County
      - Pittsburgh (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
  - Canada
    - Quebec > Montreal (0.04)
    - Alberta > Census Division No. 15
      - Improvement District No. 9 > Banff (0.04)
- Europe
  - Germany (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Romania > Sud - Muntenia Development Region
    - Giurgiu County > Giurgiu (0.04)
  - Italy > Calabria
    - Catanzaro Province > Catanzaro (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Taiwan (0.04)
  - Japan > Honshū
    - Chūbu > Aichi Prefecture > Nagoya (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Overview (1.00)
- Research Report > New Finding (0.67)

Industry:
- Information Technology > Services (0.67)
- Education > Educational Setting (0.45)
- Media > News (0.45)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Speech > Speech Recognition (1.00)
  - Natural Language > Text Processing (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (1.00)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found