Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

Lin, Weiwei, He, Chenghan

arXiv.org Artificial Intelligence 

We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-theart autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts. Transformers trained with autoregressive (AR) objectives have become the dominant approach in natural language processing (NLP) (Radford et al., 2019; Brown et al., 2020; Vaswani, 2017). The vast successes in NLP have inspired researchers to apply Transformers and autoregressive objectives to image and speech domains as well (Ramesh et al., 2021; Wang et al.; Betker, 2023). Since speech and images are continuous signals, discretization is a critical first step before applying discrete autoregressive training. As a result, autoregressive modeling of images and audio typically involves two stages of training. In the first stage, a VQ-VAE (Van Den Oord et al., 2017) or a variant of VQ-VAE (Zeghidour et al., 2021; Défossez et al., 2022; Kumar et al., 2024) is trained to encode input data into discrete latent representations using a vector quantization bottleneck. After training the VQ-VAE, an autoregressive model is trained on the discrete latent codes produced by the encoder (Ramesh et al., 2021). The AR model captures the sequential dependencies in the latent space, learning to predict the next latent code based on previous ones, which enables high-fidelity generation.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found