Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting

Asaad, Ihab, Jacquelin, Maxime, Perrotin, Olivier, Girin, Laurent, Hueber, Thomas

May-30-2024–arXiv.org Artificial Intelligence

Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.

dataset, mask length, speech, (15 more...)

arXiv.org Artificial Intelligence

May-30-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Rhode Island (0.04)
    - Utah > Salt Lake County
      - Salt Lake City (0.04)
    - Texas > Dallas County
      - Dallas (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - California > San Diego County
      - San Diego (0.04)
  - Canada > Ontario
    - Toronto (0.04)
- Europe
  - Greece (0.04)
  - Germany (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - France > Auvergne-Rhône-Alpes
    - Isère > Grenoble (0.05)
  - Finland > Uusimaa
    - Helsinki (0.04)
  - Czechia > South Moravian Region
    - Brno (0.04)
  - Austria
    - Vienna (0.14)
    - Styria > Graz (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Speech (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Statistical Learning (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found