AITopics | fastspeech

Listening to the Brain: Multi-Band sEEGAuditory Reconstruction via Dynamic Spatio-Temporal Hypergraphs

Neural Information Processing SystemsJun-22-2026, 10:16:59 GMT

Speech is a fundamental form of human communication, and speech perception constitutes the initial stage of language comprehension. Although brain-to-speech interface technologies have made significant progress in recent years, most existing studies focus on neural decoding during speech production. Such approaches heavily rely on articulatory motor regions, rendering them unsuitable for individuals with speech motor impairments, such as those with aphasia or locked-in syndrome. To address this limitation, we construct and release NeuroListen, the first publicly available stereo-electroencephalography (sEEG) dataset specifically designed for auditory reconstruction. It contains over 10 hours of neuralspeech paired recordings from 5 clinical participants, covering a wide range of semantic categories. Building on this dataset, we propose HyperSpeech, a multi-band neural decoding framework that employs dynamic spatio-temporal hypergraph neural networks to capture high-order dependencies across frequency, spatial, and temporal dimensions. Experimental results demonstrate that HyperSpeech significantly outperforms existing methods across multiple objective speech quality metrics, and achieves superior performance in human subjective evaluations, validating its effectiveness and advancement. This study provides a dedicated dataset and modeling framework for auditory speech decoding, offering foundations for neural language processing and assistive communication systems.

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country: Asia > China (0.68)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry:

Health & Medicine > Health Care Technology (0.88)
Health & Medicine > Therapeutic Area > Neurology > Epilepsy (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Neural Information Processing SystemsFeb-15-2026, 03:49:24 GMT

Prominent methods (e.g., Tacotron 2)usuallyfirst generate mel-spectrogram from text, and then synthesize speech from themel-spectrogram using vocoder such as WaveNet. Compared with traditionalconcatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech isusually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control).

artificial intelligence, fastspeech, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Asia > China (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsDec-26-2025, 03:47:21 GMT

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x. Therefore, we call our model FastSpeech.

fastspeech, name change, robust and controllable text, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.82)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.62)

Add feedback

Thanks all the reviewers for the detailed and thoughtful comments

Neural Information Processing SystemsOct-3-2025, 00:27:52 GMT

Thanks all the reviewers for the detailed and thoughtful comments. HMM-based works [1, 2, 3], all of which proposed methods to estimate alignments from unsegmented data. We've not thoroughly explored to improve the duration predictor and simply follow the same We design the grouped 1x1 convolutions to be able to mix channels. For example, to generate a speech of 5.8 Therefore, adopting parallel TTS models significantly improves the sampling speed of end-to-end systems. In Section 5.3, we showed that varying temperature can change We will add a reference about Viterbi training.

artificial intelligence, machine learning, tacotron 2, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.36)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu

Neural Information Processing SystemsAug-20-2025, 09:32:38 GMT

Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of con-trollability (voice speed or prosody control).

fastspeech, mel-spectrogram sequence, sequence, (10 more...)

Neural Information Processing Systems

Country:

Asia > China (0.04)
North America > Canada (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsJun-1-2025, 23:53:13 GMT

Originally: Although phoneme duration prediction is widely adopted in conventional TTS systems, jointly training it in a neural TTS model is new. This paper is one of the first works on non-autoregressive text-to-spectrogram modeling. Quality: This paper seems sound overall, expected for a few issues in the comments below. Some of these issues must be addressed before acceptance. Clarity: A well written paper. Significance: The advantages over its autoregressive counterparts are significant, especially for industrial use.

fastspeech, robust and controllable text, tacotron 2, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsJun-1-2025, 23:53:02 GMT

The paper proposes a novel non-autoregressive parallelisation approach for mel-spectrogram intermediate representation TTS. The reviewers concur that the paper incorporates two novel explicit components to tts systems - length and duration modules and that the results on Speedup at inference and high-quality audio generations are relevant.

fastspeech, review, robust and controllable text

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.40)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)
Information Technology > Artificial Intelligence > Assistive Technologies (0.40)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Neural Information Processing SystemsOct-11-2024, 06:22:38 GMT

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation.

Add feedback

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Popov, Vadim, Vovk, Ivan, Gogoryan, Vladimir, Sadekova, Tasnima, Kudinov, Mikhail

arXiv.org Machine LearningMay-13-2021

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We will make the code publicly available shortly.

diffusion probabilistic model, grad-tts, reverse diffusion, (13 more...)

arXiv.org Machine Learning

2105.06337

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia > Russia (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.91)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.90)
(2 more...)

Add feedback

FastSpeech: Fast, Robust and Controllable Text to Speech

Ren, Yi, Ruan, Yangjun, Tan, Xu, Qin, Tao, Zhao, Sheng, Zhao, Zhou, Liu, Tie-Yan

Neural Information Processing SystemsMar-18-2020, 21:46:47 GMT

Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly.

fastspeech, mel-spectrogram generation, robust and controllable text, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.86)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.78)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.64)

Add feedback

Filters

Collaborating Authors

fastspeech

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Listening to the Brain: Multi-Band sEEGAuditory Reconstruction via Dynamic Spatio-Temporal Hypergraphs

FastSpeech: Fast, Robust and Controllable Text to Speech

FastSpeech: Fast, Robust and Controllable Text to Speech

Thanks all the reviewers for the detailed and thoughtful comments

FastSpeech: Fast, Robust and Controllable Text to Speech

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

Reviews: FastSpeech: Fast, Robust and Controllable Text to Speech

FastSpeech: Fast, Robust and Controllable Text to Speech

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

FastSpeech: Fast, Robust and Controllable Text to Speech