Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

Feb-13-2025–arXiv.org Artificial Intelligence

We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-theart autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at https://tinyurl.com/gmm-lm-tts. Transformers trained with autoregressive (AR) objectives have become the dominant approach in natural language processing (NLP) (Radford et al., 2019; Brown et al., 2020; Vaswani, 2017). The vast successes in NLP have inspired researchers to apply Transformers and autoregressive objectives to image and speech domains as well (Ramesh et al., 2021; Wang et al.; Betker, 2023). Since speech and images are continuous signals, discretization is a critical first step before applying discrete autoregressive training. As a result, autoregressive modeling of images and audio typically involves two stages of training. In the first stage, a VQ-VAE (Van Den Oord et al., 2017) or a variant of VQ-VAE (Zeghidour et al., 2021; Défossez et al., 2022; Kumar et al., 2024) is trained to encode input data into discrete latent representations using a vector quantization bottleneck. After training the VQ-VAE, an autoregressive model is trained on the discrete latent codes produced by the encoder (Ramesh et al., 2021). The AR model captures the sequential dependencies in the latent space, learning to predict the next latent code based on previous ones, which enables high-fidelity generation.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Feb-13-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Speech
    - Speech Recognition (1.00)
    - Speech Synthesis (0.86)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found