AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Neural Information Processing SystemsDec-24-2025, 02:07:40 GMT

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel. Despite the advantage, the parallel TTS models cannot be trained without guidance from autoregressive TTS models as their external aligners. In this work, we propose Glow-TTS, a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis.

generative flow, glow-tts, text-to-speech, (6 more...)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

arXiv.org Artificial IntelligenceNov-18-2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zheng, Zhisheng, Peng, Puyuan, Diwan, Anuj, Huynh, Cong Phuoc, Sun, Xiaohang, Liu, Zhu, Bhat, Vimal, Harwath, David

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

2511.12347

Country:

Europe (1.00)
North America > United States > Texas (0.28)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)

arXiv.org Artificial IntelligenceNov-6-2025

PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Wong, Michel, Alshehri, Ali, Kao, Sophia, He, Haotian

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.

large language model, machine learning, normalization, (18 more...)

2511.0308

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Huynh-Nguyen, Hieu-Nghia, Dang, Huynh Nguyen, Nguyen, Ngoc-Son, Nguyen, Van

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

arXiv.org Artificial IntelligenceOct-6-2025

Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.

large language model, machine learning, natural language, (20 more...)

2510.02848

Country: Asia (0.28)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Falai, Alessio, Zhang, Ziyao, Gangoly, Akos

Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

arXiv.org Artificial IntelligenceAug-26-2025

In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.

artificial intelligence, machine learning, natural language, (19 more...)

2508.18006

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.63)

arXiv.org Artificial IntelligenceAug-6-2025

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Yang, Yifan, Liu, Shujie, Li, Jinyu, Hu, Yuxuan, Wu, Haibin, Wang, Hui, Yu, Jianwei, Meng, Lingwei, Sun, Haiyang, Liu, Yanqing, Lu, Yan, Yu, Kai, Chen, Xie

Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://microsoft.com/research/project/vall-e-x/palle.

large language model, machine learning, proc, (15 more...)

2504.10352

Country:

North America > United States (0.68)
Asia > China (0.48)
North America > Canada (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Vu, Thi, Nguyen, Linh The, Nguyen, Dat Quoc

Zero-Shot Text-to-Speech for Vietnamese

arXiv.org Artificial IntelligenceJun-3-2025

This paper introduces PhoAudiobook, a newly curated dataset comprising 941 hours of high-quality audio for Vietnamese text-to-speech. Using PhoAudiobook, we conduct experiments on three leading zero-shot TTS models: VALL-E, VoiceCraft, and XTTS-V2. Our findings demonstrate that PhoAudiobook consistently enhances model performance across various metrics. Moreover, VALL-E and VoiceCraft exhibit superior performance in synthesizing short sentences, highlighting their robustness in handling diverse linguistic contexts. We publicly release PhoAudiobook to facilitate further research and development in Vietnamese text-to-speech.

artificial intelligence, large language model, natural language, (19 more...)

2506.01322

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.83)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.74)

Neural Information Processing SystemsJan-24-2025, 18:18:02 GMT

Review for NeurIPS paper: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

Weaknesses: I was a little confused about how the grouped 1x1 convolutions interact with the coupling layers. If the standard (half-and-half) partitioning is used for the coupling layers and the grouped 1x1 convolutions never mix channels outside of their group of 4, then half of the channels will never be transformed by any coupling layer. I'm assuming the authors deal with this issue somehow (since the results are good), but I only briefly scanned the code and didn't want to work through all of the index gymnastics. I could see readers being confused by these missing details. Update: In their response, the authors said they will explain more of the details of the grouped 1x1 convolutions in their revised version.

generative flow, monotonic alignment search, vocoder, (7 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Neural Information Processing SystemsJan-24-2025, 18:17:55 GMT

Review for NeurIPS paper: Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search

After rebuttal and discussion, all four reviewers provide very favorable reviews. The reviewers point out a novel methodology, combining flows with dynamic programming (hard monotonic alignment). The paper is therefore accepted for an oral.

generative flow, monotonic alignment search, text-to-speech, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)