AITopics | tacotron

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsMar-17-2026, 17:16:31 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

artificial intelligence, proceedings, speech synthesis, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Neural Information Processing SystemsNov-21-2025, 16:02:58 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

deep voice 2, multi-speaker neural text-to-speech, name change, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

Neural Information Processing SystemsNov-21-2025, 12:48:28 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep V oice 1 and Tacotron.

artificial intelligence, machine learning, oice 2, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)

Add feedback

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Battenberg, Eric, Skerry-Ryan, RJ, Stanton, Daisy, Mariooryad, Soroosh, Shannon, Matt, Salazar, Julian, Kao, David

arXiv.org Artificial IntelligenceOct-29-2024

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2410.22179

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou

Neural Information Processing SystemsOct-4-2024, 05:22:43 GMT

Neural Information Processing Systems http://nips.cc/

deep voice 1, deep voice 2, tacotron, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Sunnyvale (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.68)

Add feedback

WellSaid attracts $10M A round for higher quality synthetic speech – TechCrunch

#artificialintelligenceJul-7-2021, 17:25:12 GMT

WellSaid Labs, whose tools create synthetic speech that could be mistaken for the real thing, has raised a $10M Series A to grow the business. The company's home-baked text-to-speech engine works faster than real time and produces natural-sounding clips of pretty much any length, from quick snippets to hours-long readings. WellSaid came out of the Allen Institute for AI incubator in 2019, and its goal was to make synthetic voices that didn't sound so robotic for common business purposes like training and marketing content. It achieved that first by basing its solution on Tacotron, a speech engine developed by Google and academic researchers. But soon it had built its own that was more efficient, resulted in more convincing voices, and could produce clips of arbitrary lengths.

speech, techcrunch, wellsaid attract, (3 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

AI based Presentation Creator With Customized Audio Content Delivery

Mansoor, Muvazima, Chandar, Srikanth, Srinath, Ramamoorthy

arXiv.org Artificial IntelligenceJun-27-2021

In this paper, we propose an architecture to solve a novel problem statement that has stemmed more so in recent times with an increase in demand for virtual content delivery due to the COVID-19 pandemic. All educational institutions, workplaces, research centers, etc. are trying to bridge the gap of communication during these socially distanced times with the use of online content delivery. The trend now is to create presentations, and then subsequently deliver the same using various virtual meeting platforms. The time being spent in such creation of presentations and delivering is what we try to reduce and eliminate through this paper which aims to use Machine Learning (ML) algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides-based presentation from a document, and then use state-of-the-art voice cloning models to deliver the content in the desired author's voice. We consider a structured document such as a research paper to be the content that has to be presented. The research paper is first summarized using BERT summarization techniques and condensed into bullet points that go into the slides. Tacotron inspired architecture with Encoder, Synthesizer, and a Generative Adversarial Network (GAN) based vocoder, is used to convey the contents of the slides in the author's voice (or any customized voice). Almost all learning has now been shifted to online mode, and professionals are now working from the comfort of their homes. Due to the current situation, teachers and professionals have shifted to presentations to help them in imparting information. In this paper, we aim to reduce the considerable amount of time that is taken in creating a presentation by automating this process and subsequently delivering this presentation in a customized voice, using a content delivery mechanism that can clone any voice using a short audio clip.

arxiv preprint arxiv, présentation, summarization, (14 more...)

arXiv.org Artificial Intelligence

2106.14213

Country: Asia > India > Karnataka > Bengaluru (0.05)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.36)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Yasuda, Yusuke, Wang, Xin, Yamagishi, Junichi

arXiv.org Machine LearningOct-7-2020

Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes. Unlike traditional pipeline TTS, the neural sequence-to-sequence TTS does not require manually annotated and complicated linguistic features such as part-of-speech tags and syntactic structures for system training. However, it must be carefully designed and well optimized so that it can implicitly extract useful linguistic features from the input features. In this paper we investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English along with comparisons with deep neural network (DNN) based pipeline TTS systems. Unlike past comparative studies, the pipeline systems also use autoregressive probabilistic modeling and a neural vocoder. We investigated systems from three aspects: a) model architecture, b) model parameter size, and c) language. For the model architecture aspect, we adopt modified Tacotron systems that we previously proposed and their variants using an encoder from Tacotron or Tacotron2. For the model parameter size aspect, we investigate two model parameter sizes. For the language aspect, we conduct listening tests in both Japanese and English to see if our findings can be generalized across languages. Our experiments suggest that a) a neural sequence-to-sequence TTS system should have a sufficient number of model parameters to produce high quality speech, b) it should also use a powerful encoder when it takes characters as inputs, and c) the encoder still has a room for improvement and needs to have an improved architecture to learn supra-segmental features more appropriately.

deep learning, speech synthesis, tacotron, (19 more...)

arXiv.org Machine Learning

2005.1039

Country:

Europe > United Kingdom (0.28)
Asia > Japan > Honshū > Kantō (0.14)
North America > United States > Hawaii (0.14)
(5 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.88)

Industry: Energy > Oil & Gas (0.36)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(2 more...)

Add feedback

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Gibiansky, Andrew, Arik, Sercan, Diamos, Gregory, Miller, John, Peng, Kainan, Ping, Wei, Raiman, Jonathan, Zhou, Yanqi

Neural Information Processing SystemsFeb-14-2020, 11:44:31 GMT

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets.

deep voice 1, deep voice 2, multi-speaker neural text-to-speech, (2 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.65)
Information Technology > Artificial Intelligence > Assistive Technologies (0.65)

Add feedback

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

Zhang, Mingyang, Wang, Xin, Fang, Fuming, Li, Haizhou, Yamagishi, Junichi

arXiv.org Machine LearningApr-7-2019

We investigated the training of a shared model for both text-to-speech (TTS) and voice conversion (VC) tasks. We propose using an extended model architecture of Tacotron, that is a multi-source sequence-to-sequence model with a dual attention mechanism as the shared model for both the TTS and VC tasks. This model can accomplish these two different tasks respectively according to the type of input. An end-to-end speech synthesis task is conducted when the model is given text as the input while a sequence-to-sequence voice conversion task is conducted when it is given the speech of a source speaker as the input. Waveform signals are generated by using WaveNet, which is conditioned by using a predicted mel-spectrogram. We propose jointly training a shared model as a decoder for a target speaker that supports multiple sources. Listening experiments show that our proposed multi-source encoder-decoder model can efficiently achieve both the TTS and VC tasks.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Machine Learning

1903.12389

Country: Asia > Japan (0.29)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.92)

Add feedback

Filters

Collaborating Authors

tacotron

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

WellSaid attracts $10M A round for higher quality synthetic speech – TechCrunch

AI based Presentation Creator With Customized Audio Content Delivery

Investigation of learning abilities on linguistic features in sequence-to-sequence text-to-speech synthesis

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet