Moore's law has driven silicon chip circuitry to the point where we are surrounded by devices equipped with microprocessors. The devices are frequently wonderful; communicating with them – not so much. Pressing buttons on smart devices or keyboards is often clumsy and never the method of choice when effective voice communication is possible. The keyword in the previous sentence is "effective". Technology has advanced to the point where we are in the early stages of being able to communicate with our devices using voice recognition.
Amazon Lex is a service for building conversational interfaces into any application using voice and text. Amazon Lex provides the advanced deep learning functionalities of automatic speech recognition (ASR) for converting speech to text, and natural language understanding (NLU) to recognize the intent of the text, to enable you to build applications with highly engaging user experiences and lifelike conversational interactions. With Amazon Lex, the same deep learning technologies that power Amazon Alexa are now available to any developer, enabling you to quickly and easily build sophisticated, natural language, conversational bots ("chatbots").
There are many situations when running deep learning inferences on local devices is preferable for both individuals and companies: imagine traveling with no reliable internet connection available or dealing with privacy concerns and latency issues on transferring data to cloud-based services. Edge computing provides solutions to these problems by processing and analyzing data at the edge of network. Take the "Ok Google" feature as an example -- by training "Ok Google" with a user's voice, that user's mobile phone will be activated when capturing the keywords. This kind of small-footprint keyword-spotting (KWS) inference usually happens on-device so you don't have to worry that the service providers are listening to you all the time. The cloud-based services will only be initiated after you make the commands.
To learn more about conversational AI, check out Yishay Carmiel's session Applications of neural-based models for conversational speech at the Artificial Intelligence Conference in San Francisco, Sept. 17-20, 2017. The dream of speech recognition is a system that truly understands humans speaking--in different environments, with a variety of accents and languages. For decades, people tackled this problem with no success. Pinpointing effective strategies for creating such a system seemed impossible. In the past years, however, breakthroughs in AI and deep learning have changed everything in the quest for speech recognition.
To understand why WaveNet improves on the current state of the art, it is useful to understand how text-to-speech (TTS) - or speech synthesis - systems work today. The majority of these are based on so-called concatenative TTS, which uses a large database of high-quality recordings, collected from a single voice actor over many hours. These recordings are split into tiny chunks that can then be combined - or concatenated - to form complete utterances as needed. However, these systems can result in unnatural sounding voices and are also difficult to modify because a whole new database needs to be recorded each time a set of changes, such as new emotions or intonations, are needed. To overcome some of these problems, an alternative model known as parametric TTS is sometimes used.