Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Koizumi, Yuma, Zen, Heiga, Karita, Shigeki, Ding, Yifan, Yatabe, Kohei, Morioka, Nobuyuki, Zhang, Yu, Han, Wei, Bapna, Ankur, Bacchiani, Michiel

Aug-14-2023–arXiv.org Artificial Intelligence

Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

Aug-14-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Japan (0.28)

Genre:
- Research Report > New Finding (0.35)

Industry:
- Media (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (1.00)
  - Natural Language (0.93)
  - Speech (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found