Investigating Pre-trained Audio Encoders in the Low-Resource Condition

Yang, Hao, Zhao, Jinming, Haffari, Gholamreza, Shareghi, Ehsan

arXiv.org Artificial Intelligence 

To better understand the interplay between pre-training protocols of speech encoders, the amount of fine-tuning data, and Pre-trained speech encoders have been central to pushing stateof-the-art speech task types, we conduct a comprehensive study in this results across various speech understanding and generation work. We evaluate a set of three very recent speech models tasks. Nonetheless, the capabilities of these encoders in (Wav2vec2, WavLM, and Whisper) and assess their performance low-resource settings are yet to be thoroughly explored. To address on 7 downstream tasks (covering content, speaker and this, we conduct a comprehensive set of experiments using semantic types) in the low-resource setting. Through extensive a representative set of 3 state-of-the-art encoders (Wav2vec2, experiments in the low-resource setting, we found that Whisper WavLM, Whisper) in the low-resource setting across 7 speech significantly outperforms Wav2vec2 and WavLM by a large understanding and generation tasks. We provide various quantitative margin on content-related (content, semantics) tasks, and shows and qualitative analyses on task performance, convergence performance degradation when speaker information is required speed, and representational properties of the encoders.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found