Cooper, Erica
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Gong, Cheng, Cooper, Erica, Wang, Xin, Qiang, Chunyu, Geng, Mengzhe, Wells, Dan, Wang, Longbiao, Dang, Jianwu, Tessier, Marc, Pine, Aidan, Richmond, Korin, Yamagishi, Junichi
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
Spoof Diarization: "What Spoofed When" in Partially Spoofed Audio
Zhang, Lin, Wang, Xin, Cooper, Erica, Diez, Mireia, Landini, Federico, Evans, Nicholas, Yamagishi, Junichi
This paper defines Spoof Diarization as a novel task in the Partial Spoof (PS) scenario. It aims to determine what spoofed when, which includes not only locating spoof regions but also clustering them according to different spoofing methods. As a pioneering study in spoof diarization, we focus on defining the task, establishing evaluation metrics, and proposing a benchmark model, namely the Countermeasure-Condition Clustering (3C) model. Utilizing this model, we first explore how to effectively train countermeasures to support spoof diarization using three labeling schemes. We then utilize spoof localization predictions to enhance the diarization performance. This first study reveals the high complexity of the task, even in restricted scenarios where only a single speaker per audio file and an oracle number of spoofing methods are considered. Our code is available at https://github.com/nii-yamagishilab/PartialSpoof.
Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction
Ravuri, Aditya, Cooper, Erica, Yamagishi, Junichi
This paper addresses the gap in We are particularly inspired by approaches in biology where efficient audio quality prediction, especially in low-resource zero-shot prediction is possible using a model's uncertainty settings where extensive MOS data from large-scale listening estimates, where uncertainties act as proxies for downstream tests may be unavailable. We demonstrate that uncertainty tasks [4]. Our main hypotheses are that, measures derived from out-of-the-box pretrained selfsupervised learning (SSL) models, such as wav2vec, correlate 1. uncertainty estimates can be derived from the outputs with MOS scores. These findings are based on data from the of SSL models such as wav2vec, and that, 2022 and 2023 VoiceMOS challenges. We explore the extent 2. these uncertainties can be used as proxies to MOS of this correlation across different models and language scores as high model uncertainty around the contents contexts, revealing insights into how inherent uncertainties in of an audio sequence must correspond to low audio SSL models can serve as effective proxies for audio quality quality.
Range-Based Equal Error Rate for Spoof Localization
Zhang, Lin, Wang, Xin, Cooper, Erica, Evans, Nicholas, Yamagishi, Junichi
Spoof localization, also called segment-level detection, is a crucial task that aims to locate spoofs in partially spoofed audio. The equal error rate (EER) is widely used to measure performance for such biometric scenarios. Although EER is the only threshold-free metric, it is usually calculated in a point-based way that uses scores and references with a pre-defined temporal resolution and counts the number of misclassified segments. Such point-based measurement overly relies on this resolution and may not accurately measure misclassified ranges. To properly measure misclassified ranges and better evaluate spoof localization performance, we upgrade point-based EER to range-based EER. Then, we adapt the binary search algorithm for calculating range-based EER and compare it with the classical point-based EER. Our analyses suggest utilizing either range-based EER, or point-based EER with a proper temporal resolution can fairly and properly evaluate the performance of spoof localization.