AITopics

Technology: Information Technology > Artificial Intelligence (1.00)

Neural Information Processing SystemsFeb-14-2026, 05:47:01 GMT

5a016da670821af25f151f523a2e563f-Paper-Conference.pdf

large language model, machine learning, natural language, (21 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
(2 more...)

arXiv.org Artificial IntelligenceNov-18-2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Zheng, Zhisheng, Peng, Puyuan, Diwan, Anuj, Huynh, Cong Phuoc, Sun, Xiaohang, Liu, Zhu, Bhat, Vimal, Harwath, David

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

2511.12347

Country:

Europe (1.00)
North America > United States > Texas (0.28)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)

Neural Information Processing SystemsOct-10-2025, 03:33:21 GMT

5a016da670821af25f151f523a2e563f-Paper-Conference.pdf

codec language model, dataset, language model, (15 more...)

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
Asia > China (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
(2 more...)

Neural Information Processing SystemsMay-27-2025, 02:17:20 GMT

SpeechAlign: Aligning Speech Generation to Human Preferences

artificial intelligence, language model, speechalign, (9 more...)

Technology: Information Technology > Artificial Intelligence (1.00)

arXiv.org Artificial IntelligenceOct-29-2024

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

Li, Bohan, Wang, Hankun, Zhang, Situo, Guo, Yiwei, Yu, Kai

The auto-regressive architecture, like GPTs, is widely used in modern Text-to-Speech (TTS) systems. However, it incurs substantial inference time, particularly due to the challenges in the next-token prediction posed by lengthy sequences of speech tokens. In this work, we introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding. Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively. Furthermore, the inclusion of a tolerance mechanism during sampling accelerates inference without compromising quality. Our approach demonstrates strong generalization across large datasets and various types of speech tokens.

artificial intelligence, machine learning, natural language, (17 more...)

2410.21951

Country:

North America > United States (0.57)
Asia > China > Shanghai > Shanghai (0.05)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > China > Jiangsu Province (0.04)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.87)

Nercessian, Shahan, Imort, Johannes, Devis, Ninon, Blang, Frederik

Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models

arXiv.org Artificial IntelligenceJul-22-2024

In this paper, we propose and investigate the use of neural audio codec language models for the automatic generation of sample-based musical instruments based on text or reference audio prompts. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding. We identify maintaining timbral consistency within the generated instruments as a major challenge. To tackle this issue, we introduce three distinct conditioning schemes. We analyze our methods through objective metrics and human listening tests, demonstrating that our approach can produce compelling musical instruments. Specifically, we introduce a new objective metric to evaluate the timbral consistency of the generated instruments and adapt the average Contrastive Language-Audio Pretraining (CLAP) score for the text-to-instrument case, noting that its naive application is unsuitable for assessing this task. Our findings reveal a complex interplay between timbral consistency, the quality of generated samples, and their correspondence to the input prompt.

clap, instrument, proceedings, (16 more...)

2407.15641

Country: North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (0.84)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJul-11-2024

Autoregressive Speech Synthesis without Vector Quantization

Meng, Lingwei, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Han, Bing, Hu, Shujie, Liu, Yanqing, Li, Jinyu, Zhao, Sheng, Wu, Xixin, Meng, Helen, Wei, Furu

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See https://aka.ms/melle for demos of our work.

arxiv preprint arxiv, language model, melle, (14 more...)

2407.08551

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)
(2 more...)

arXiv.org Artificial IntelligenceJun-17-2024

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Chen, Sanyuan, Liu, Shujie, Zhou, Long, Liu, Yanqing, Tan, Xu, Li, Jinyu, Zhao, Sheng, Qian, Yao, Wei, Furu

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.

arxiv preprint arxiv, speech, vall-e 2, (12 more...)

2406.0537

Country: Asia > Middle East > Israel (0.04)

Genre: Research Report (0.82)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

arXiv.org Artificial IntelligenceApr-8-2024

SpeechAlign: Aligning Speech Generation to Human Preferences

Zhang, Dong, Li, Zhaowei, Li, Shimin, Zhang, Xin, Wang, Pengyu, Zhou, Yaqian, Qiu, Xipeng

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.

codec language model, human feedback, language model, (16 more...)

2404.056

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)