Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
Asaad, Ihab, Jacquelin, Maxime, Perrotin, Olivier, Girin, Laurent, Hueber, Thomas
–arXiv.org Artificial Intelligence
Most speech self-supervised learning (SSL) models are trained with a pretext task which consists in predicting missing parts of the input signal, either future segments (causal prediction) or segments masked anywhere within the input (non-causal prediction). Learned speech representations can then be efficiently transferred to downstream tasks (e.g., automatic speech or speaker recognition). In the present study, we investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context, i.e., fulfilling a downstream task that is very similar to the pretext task. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder. In particular, we propose two solutions to match the HuBERT output with the HiFiGAN input, by freezing one and fine-tuning the other, and vice versa. Performance of both approaches was assessed in single- and multi-speaker settings, for both informed and blind inpainting configurations (i.e., the position of the mask is known or unknown, respectively), with different objective metrics and a perceptual evaluation. Performances show that if both solutions allow to correctly reconstruct signal portions up to the size of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a more accurate signal reconstruction in the single-speaker setting case, while freezing it (and training the neural vocoder instead) is a better strategy when dealing with multi-speaker data.
arXiv.org Artificial Intelligence
May-30-2024
- Country:
- North America
- United States
- Rhode Island (0.04)
- Utah > Salt Lake County
- Salt Lake City (0.04)
- Texas > Dallas County
- Dallas (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- California > San Diego County
- San Diego (0.04)
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Asia > China
- North America
- Genre:
- Research Report > New Finding (0.68)
- Technology: