Microsoft has reached a milestone in text-to-speech synthesis with a production system that uses deep neural networks to make the voices of computers nearly indistinguishable from recordings of people. With the human-like natural prosody and clear articulation of words, Neural TTS has significantly reduced listening fatigue when you interact with AI systems. Our team demonstrated our neural-network powered text-to-speech capability at the Microsoft Ignite conference in Orlando, Florida, this week. The capability is currently available in preview through Azure Cognitive Services Speech Services. Neural text-to-speech can be used to make interactions with chatbots and virtual assistants more natural and engaging, convert digital texts such as e-books into audiobooks and enhance in-car navigation systems.
Being an iOS user, how many times do you talk to Siri in a day? If you are a keen observer, then you know that Siri's voice sounds much more like a human in iOS 11 than it has before. This is because Apple is digging deeper into the technology of artificial intelligence, machine learning, and deep learning to offer the best personal assistant experience to its users. From the introduction of Siri with the iPhone 4S to its continuation in iOS 11, this personal assistant has evolved to get closer to humans and establish good relations with them. To reply to voice commands of users, Siri uses speech synthesis combined with deep learning.
While the pandemic slowed down the development of businesses and entire industries, it did not affect the ongoing development of AI-generated speech. According to analysts at Meticulous Research, the global voice technology market is growing at 17.2% annually. By 2025 its volume is expected to reach $26.8 billion. What makes voice synthesis such a rapidly developing niche, and what impact is that development having on speech-based applications today? Implementing speech-based applications helps businesses significantly improve customer experiences.
Speech synthesis (Text-to-speech, TTS) is the formation of a speech signal from printed text. In a way, it is the opposite of speech recognition. Speech synthesis is used in medicine, dialogue systems, voice assistants and many other business tasks. As long as we have one speaker, the task of speech synthesis at first glance looks quite clear. When several speakers come into play, the situation becomes somewhat complicated and other tasks come into play; for example, voice cloning and voice conversion, this will be discussed further in the text.
In this paper, we propose an architecture to solve a novel problem statement that has stemmed more so in recent times with an increase in demand for virtual content delivery due to the COVID-19 pandemic. All educational institutions, workplaces, research centers, etc. are trying to bridge the gap of communication during these socially distanced times with the use of online content delivery. The trend now is to create presentations, and then subsequently deliver the same using various virtual meeting platforms. The time being spent in such creation of presentations and delivering is what we try to reduce and eliminate through this paper which aims to use Machine Learning (ML) algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides-based presentation from a document, and then use state-of-the-art voice cloning models to deliver the content in the desired author's voice. We consider a structured document such as a research paper to be the content that has to be presented. The research paper is first summarized using BERT summarization techniques and condensed into bullet points that go into the slides. Tacotron inspired architecture with Encoder, Synthesizer, and a Generative Adversarial Network (GAN) based vocoder, is used to convey the contents of the slides in the author's voice (or any customized voice). Almost all learning has now been shifted to online mode, and professionals are now working from the comfort of their homes. Due to the current situation, teachers and professionals have shifted to presentations to help them in imparting information. In this paper, we aim to reduce the considerable amount of time that is taken in creating a presentation by automating this process and subsequently delivering this presentation in a customized voice, using a content delivery mechanism that can clone any voice using a short audio clip.