vall-e
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis
We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-theart autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts. Transformers trained with autoregressive (AR) objectives have become the dominant approach in natural language processing (NLP) (Radford et al., 2019; Brown et al., 2020; Vaswani, 2017). The vast successes in NLP have inspired researchers to apply Transformers and autoregressive objectives to image and speech domains as well (Ramesh et al., 2021; Wang et al.; Betker, 2023). Since speech and images are continuous signals, discretization is a critical first step before applying discrete autoregressive training. As a result, autoregressive modeling of images and audio typically involves two stages of training. In the first stage, a VQ-VAE (Van Den Oord et al., 2017) or a variant of VQ-VAE (Zeghidour et al., 2021; Défossez et al., 2022; Kumar et al., 2024) is trained to encode input data into discrete latent representations using a vector quantization bottleneck. After training the VQ-VAE, an autoregressive model is trained on the discrete latent codes produced by the encoder (Ramesh et al., 2021). The AR model captures the sequential dependencies in the latent space, learning to predict the next latent code based on previous ones, which enables high-fidelity generation.
- Asia > China > Hong Kong (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Ye, Zhen, Sun, Peiwen, Lei, Jiahe, Lin, Hongzhan, Tan, Xu, Dai, Zheqi, Kong, Qiuqiang, Chen, Jianyi, Pan, Jiahao, Liu, Qifeng, Guo, Yike, Xue, Wei
Recent advancements in audio generation have been significantly propelled by the capabilities of Large Language Models (LLMs). The existing research on audio LLM has primarily focused on enhancing the architecture and scale of audio language models, as well as leveraging larger datasets, and generally, acoustic codecs, such as EnCodec, are used for audio tokenization. However, these codecs were originally designed for audio compression, which may lead to suboptimal performance in the context of audio LLM. Our research aims to address the shortcomings of current audio LLM codecs, particularly their challenges in maintaining semantic integrity in generated audio. For instance, existing methods like VALL-E, which condition acoustic token generation on text transcriptions, often suffer from content inaccuracies and elevated word error rates (WER) due to semantic misinterpretations of acoustic tokens, resulting in word skipping and errors. To overcome these issues, we propose a straightforward yet effective approach called X-Codec. X-Codec incorporates semantic features from a pre-trained semantic encoder before the Residual Vector Quantization (RVQ) stage and introduces a semantic reconstruction loss after RVQ. By enhancing the semantic ability of the codec, X-Codec significantly reduces WER in speech synthesis tasks and extends these benefits to non-speech applications, including music and sound generation. Our experiments in text-to-speech, music continuation, and text-to-sound tasks demonstrate that integrating semantic information substantially improves the overall performance of language models in audio generation. Our code and demo are available (Demo: https://x-codec-audio.github.io Code: https://github.com/zhenye234/xcodec)
- Asia > China > Hong Kong (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Leisure & Entertainment (1.00)
- Media > Music (0.94)
Speech Slytherin: Examining the Performance and Efficiency of Mamba for Speech Separation, Recognition, and Synthesis
Jiang, Xilin, Li, Yinghao Aaron, Florea, Adrian Nicolas, Han, Cong, Mesgarani, Nima
It is too early to conclude that Mamba is a better alternative to transformers for speech before comparing Mamba with transformers in terms of both performance and efficiency in multiple speech-related tasks. To reach this conclusion, we propose and evaluate three models for three tasks: Mamba-TasNet for speech separation, ConMamba for speech recognition, and VALL-M for speech synthesis. We compare them with transformers of similar sizes in performance, memory, and speed. Our Mamba or Mamba-transformer hybrid models show comparable or higher performance than their transformer counterparts: Sepformer, Conformer, and VALL-E. They are more efficient than transformers in memory and speed for speech longer than a threshold duration, inversely related to the resolution of a speech token. Mamba for separation is the most efficient, and Mamba for recognition is the least. Further, we show that Mamba is not more efficient than transformer for speech shorter than the threshold duration and performs worse in models that require joint modeling of text and speech, such as cross or masked attention of two inputs. Therefore, we argue that the superiority of Mamba or transformer depends on particular problems and models. Code available at https://github.com/xi-j/Mamba-TasNet and https://github.com/xi-j/Mamba-ASR.
- North America > United States (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Song, Yakun, Chen, Zhuo, Wang, Xiaofei, Ma, Ziyang, Yang, Guanrou, Chen, Xie
Neural codec language model (LM) has demonstrated strong capability in zero-shot text-to-speech (TTS) synthesis. However, the codec LM often suffers from limitations in inference speed and stability, due to its auto-regressive nature and implicit alignment between text and audio. In this work, to handle these challenges, we introduce a new variant of neural codec LM, namely TacoLM. Specifically, TacoLM introduces a gated attention mechanism to improve the training and inference efficiency and reduce the model size. Meanwhile, an additional gated cross-attention layer is included for each decoder layer, which improves the efficiency and content accuracy of the synthesized speech. In the evaluation of the Librispeech corpus, the proposed TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E. Demo and code is available at https://ereboas.github.io/TacoLM/.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States (0.04)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
Zhu, Yongxin, Su, Dan, He, Liqiang, Xu, Linli, Yu, Dong
While recent advancements in speech language models have achieved significant progress, they face remarkable challenges in modeling the long acoustic sequences of neural audio codecs. In this paper, we introduce \textbf{G}enerative \textbf{P}re-trained \textbf{S}peech \textbf{T}ransformer (GPST), a hierarchical transformer designed for efficient speech language modeling. GPST quantizes audio waveforms into two distinct types of discrete speech representations and integrates them within a hierarchical transformer architecture, allowing for a unified one-stage generation process and enhancing Hi-Res audio generation capabilities. By training on large corpora of speeches in an end-to-end unsupervised manner, GPST can generate syntactically consistent speech with diverse speaker identities. Given a brief 3-second prompt, GPST can produce natural and coherent personalized speech, demonstrating in-context learning abilities. Moreover, our approach can be easily extended to spoken cross-lingual speech generation by incorporating multi-lingual semantic tokens and universal acoustic tokens. Experimental results indicate that GPST significantly outperforms the existing speech language models in terms of word error rate, speech quality, and speaker similarity. See \url{https://youngsheen.github.io/GPST/demo} for demo samples.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (9 more...)
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Asia > China > Hong Kong (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Song, Yakun, Chen, Zhuo, Wang, Xiaofei, Ma, Ziyang, Chen, Xie
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead of the corresponding acoustic tokens. The experimental findings reveal that our model outperforms VALL-E in terms of accuracy and delivers more stable results using both greedy and sampling-based decoding strategies. The code of ELLA-V will be open-sourced after cleanups. Audio samples are available at https://ereboas.github.io/ELLAV/.
- North America > Canada > Quebec > Montreal (0.05)
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Kansas > Cowley County (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Boosting Large Language Model for Speech Synthesis: An Empirical Study
Hao, Hongkun, Zhou, Long, Liu, Shujie, Li, Jinyu, Hu, Shujie, Wang, Rui, Wei, Furu
Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prompting LLMs with perception abilities like auditory comprehension, and the effective approach for augmenting LLMs with speech synthesis capabilities remains ambiguous. In this paper, we conduct a comprehensive empirical exploration of boosting LLMs with the ability to generate speech, by combining pre-trained LLM LLaMA/OPT and text-to-speech synthesis model VALL-E. We compare three integration methods between LLMs and speech synthesis models, including directly fine-tuned LLMs, superposed layers of LLMs and VALL-E, and coupled LLMs and VALL-E using LLMs as a powerful text encoder. Experimental results show that, using LoRA method to fine-tune LLMs directly to boost the speech synthesis capability does not work well, and superposed LLMs and VALL-E can improve the quality of generated speech both in speaker similarity and word error rate (WER). Among these three methods, coupled methods leveraging LLMs as the text encoder can achieve the best performance, making it outperform original speech synthesis models with a consistently better speaker similarity and a significant (10.9%)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Le, Matthew, Vyas, Apoorv, Shi, Bowen, Karrer, Brian, Sari, Leda, Moritz, Rashel, Williamson, Mary, Manohar, Vimal, Adi, Yossi, Mahadeokar, Jay, Hsu, Wei-Ning
Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.
- Europe > United Kingdom > North Sea > Southern North Sea (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Japan (0.04)
- (4 more...)
SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer
Kim, Daegyeom, Hong, Seongho, Choi, Yong-Hoon
Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these attributes. To evaluate the performance of SC VALL-E, we conduct comparative experiments with three representative expressive speech synthesis models: global style token (GST) Tacotron2, variational autoencoder (VAE) Tacotron2, and original VALL-E. We measure word error rate (WER), F0 voiced error (FVE), and F0 gross pitch error (F0GPE) as evaluation metrics to assess the accuracy of generated sentences. For comparing the quality of synthesized speech, we measure comparative mean option score (CMOS) and similarity mean option score (SMOS). To evaluate the style control ability of the generated speech, we observe the changes in F0 and mel-spectrogram by modifying the trained tokens. When using prompt audio that is not present in the training data, SC VALL-E generates a variety of expressive sounds and demonstrates competitive performance compared to the existing models. Our implementation, pretrained models, and audio samples are located on GitHub.
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > Rhode Island (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (9 more...)