AITopics | transformer tts

Collaborating Authors

transformer tts

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

f63f65b503e22cb970527f23c9ad7db1-AuthorFeedback.pdf

Neural Information Processing SystemsAug-22-2025, 02:26:33 GMT

machine translation, new version, transformer tts, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.35)

Add feedback

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Chen, Li-Wei, Watanabe, Shinji, Rudnicky, Alexander

arXiv.org Artificial IntelligenceFeb-8-2023

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

alignment, artificial intelligence, speech recognition, (17 more...)

arXiv.org Artificial Intelligence

2302.04215

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(5 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

NatiQ: An End-to-end Text-to-Speech System for Arabic

Abdelali, Ahmed, Durrani, Nadir, Demiroglu, Cenk, Dalvi, Fahim, Mubarak, Hamdy, Darwish, Kareem

arXiv.org Artificial IntelligenceNov-16-2022

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron-1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the parallel wavegan vocoder to synthesize waveforms from the spectrograms. We used in-house speech data for two voices: 1) neutral male "Hamza"- narrating general content and news, and 2) expressive female "Amina"- narrating children story books to train our models. Our best systems achieve an average Mean Opinion Score (MOS) of 4.21 and 4.40 for Amina and Hamza respectively. The objective evaluation of the systems using word and character error rate (WER and CER) as well as the response time measured by real-time factor favored the end-to-end architecture ESPnet. NatiQ demo is available on-line at https://tts.qcri.org

architecture, artificial intelligence, speech synthesis, (14 more...)

arXiv.org Artificial Intelligence

2206.07373

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.05)
North America > United States > California > Santa Clara County > Los Gatos (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(5 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback

Sprachsynthese -- State-of-the-Art in englischer und deutscher Sprache

Peinl, René

arXiv.org Artificial IntelligenceJun-11-2021

Reading text aloud is an important feature for modern computer applications. It not only facilitates access to information for visually impaired people, but is also a pleasant convenience for non-impaired users. In this article, the state of the art of speech synthesis is presented separately for mel-spectrogram generation and vocoders. It concludes with an overview of available data sets for English and German with a discussion of the transferability of the good speech synthesis results from English to German language.

arxiv prepr, synthesis, tacotron 2, (13 more...)

arXiv.org Artificial Intelligence

2106.0623

Country: Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.58)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

FastSpeech: New text-to-speech model improves on speed, accuracy, and controllability - Microsoft Research

#artificialintelligenceDec-12-2019, 20:05:32 GMT

Text to speech (TTS) has attracted a lot of attention recently due to advancements in deep learning. Neural network-based TTS models (such as Tacotron 2, DeepVoice 3 and Transformer TTS) have outperformed conventional concatenative and statistical parametric approaches in terms of speech quality. Neural network-based TTS models usually first generate a mel-scale spectrogram (or mel-spectrogram) autoregressively from text input and then synthesize speech from the mel-spectrogram using a vocoder. A spectrogram is a visual representation of frequencies measured over time.) To address the above problems, researchers from Microsoft and Zhejiang University propose FastSpeech, a novel feed-forward network that generates mel-spectrograms with fast generation speed, robustness, controllability, and high quality.

fastspeech, length regulator, sequence, (14 more...)

#artificialintelligence

Genre: Summary/Review (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback