reference speaker
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis
Lemerle, Théodor, Vanderbyl, Harrison, Srivastav, Vaibhav, Obin, Nicolas, Roebel, Axel
Neural codec language models have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or lengthy speech excerpts, since the concatenation of source and target speech must fall within the maximum context length which is determined during training. In this work, we introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA). Building on the success of initial-state tuning on RWKV, we extend this technique to voice cloning, enabling the use of multiple speech samples and full utilization of the context window in synthesis. This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes. Notably, Lina-Speech matches or outperforms state-of-the-art baseline models, including some with a parameter count up to four times higher or trained in an end-to-end style. We release our code and checkpoints. Audio samples are available at https://theodorblackbird.github.io/blog/demo_lina/.
- North America > United States (0.14)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Optimal Transport Maps are Good Voice Converters
Asadulaev, Arip, Korst, Rostislav, Shutov, Vitalii, Korotin, Alexander, Grebnyak, Yaroslav, Egiazarian, Vahe, Burnaev, Evgeny
Recently, neural network-based methods for computing optimal transport maps have been effectively applied to style transfer problems. However, the application of these methods to voice conversion is underexplored. In our paper, we fill this gap by investigating optimal transport as a framework for voice conversion. We present a variety of optimal transport algorithms designed for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD). This performance is consistent with our theoretical analysis, which suggests that our method provides an upper bound on the FAD between the target and generated distributions. Within the latent space of the WavLM encoder, we achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis
Lemerle, Théodor, Obin, Nicolas, Roebel, Axel
Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at https://github.com/theodorblackbird/lina-speech.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization
Wang, Yuejiao, Wu, Xixin, Wang, Disong, Meng, Lingwei, Meng, Helen
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
OpenVoice: Versatile Instant Voice Cloning
Qin, Zengyi, Zhao, Wenliang, Yu, Xumin, Sun, Xin
We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Do Prosody Transfer Models Transfer Prosody?
Sigurgeirsson, Atli Thor, King, Simon
Some recent models for Text-to-Speech synthesis aim to transfer the prosody of a reference utterance to the generated target synthetic speech. This is done by using a learned embedding of the reference utterance, which is used to condition speech generation. During training, the reference utterance is identical to the target utterance. Yet, during synthesis, these models are often used to transfer prosody from a reference that differs from the text or speaker being synthesized. To address this inconsistency, we propose to use a different, but prosodically-related, utterance during training too. We believe this should encourage the model to learn to transfer only those characteristics that the reference and target have in common. If prosody transfer methods do indeed transfer prosody they should be able to be trained in the way we propose. However, results show that a model trained under these conditions performs significantly worse than one trained using the target utterance as a reference. To explain this, we hypothesize that prosody transfer models do not learn a transferable representation of prosody, but rather an utterance-level representation which is highly dependent on both the reference speaker and reference text.
- North America > Canada > Quebec > Montreal (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)