During a blockbuster press event last week, Amazon took the wraps off a redesigned Echo Show, Echo Plus, and Echo Spot, and nine other new other voice-activated accessories, peripherals, and smart speakers powered by Alexa. Also in tow: the Alexa Presentation Language, which lets developers build "multimodal" Alexa apps -- skills -- that combine voice, touch, text, images, graphics, audio, and video in a single interface. Developing the frameworks that underlie it was easier said than done, according to Amazon senior speech scientist Vishal Naik. In a blog post today, he explained how Alexa leverages multiple neural networks -- layered math functions that loosely mimic the human brain's physiology -- to resolve ambiguous requests. The work is also detailed in a paper ("Context Aware Conversational Understanding for Intelligent Agents with a Screen") that was presented earlier this year at the Association for the Advancement of Artificial Intelligence.
Recent years have seen significant advances in end-to-end (E2E) spoken language understanding (SLU) systems, which directly predict intents and slots from spoken audio. While dialogue history has been exploited to improve conventional text-based natural language understanding systems, current E2E SLU approaches have not yet incorporated such critical contextual signals in multi-turn and task-oriented dialogues. In this work, we propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts (actions taken by the voice assistant) of a multi-turn dialogue. We detail alternative methods to integrate these contexts into the state-ofthe-art recurrent and transformer-based models. When applied to a large de-identified dataset of utterances collected by a voice assistant, our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively. We also present results on a publicly available dataset and show that our method significantly improves performance over a noncontextual baseline
Khatri, Chandra, Hedayatnia, Behnam, Venkatesh, Anu, Nunn, Jeff, Pan, Yi, Liu, Qing, Song, Han, Gottardi, Anna, Kwatra, Sanjeev, Pancholi, Sanju, Cheng, Ming, Chen, Qinglang, Stubel, Lauren, Gopalakrishnan, Karthik, Bland, Kate, Gabriel, Raefer, Mandal, Arindam, Hakkani-Tur, Dilek, Hwang, Gene, Michel, Nate, King, Eric, Prasad, Rohit
Building open domain conversational systems that allow users to have engaging conversations on topics of their choice is a challenging task. Alexa Prize was launched in 2016 to tackle the problem of achieving natural, sustained, coherent and engaging open-domain dialogs. In the second iteration of the competition in 2018, university teams advanced the state of the art by using context in dialog models, leveraging knowledge graphs for language understanding, handling complex utterances, building statistical and hierarchical dialog managers, and leveraging model-driven signals from user responses. The 2018 competition also included the provision of a suite of tools and models to the competitors including the CoBot (conversational bot) toolkit, topic and dialog act detection models, conversation evaluators, and a sensitive content detection model so that the competing teams could focus on building knowledge-rich, coherent and engaging multi-turn dialog systems. This paper outlines the advances developed by the university teams as well as the Alexa Prize team to achieve the common goal of advancing the science of Conversational AI. We address several key open-ended problems such as conversational speech recognition, open domain natural language understanding, commonsense reasoning, statistical dialog management and dialog evaluation. These collaborative efforts have driven improved experiences by Alexa users to an average rating of 3.61, median duration of 2 mins 18 seconds, and average turns to 14.6, increases of 14%, 92%, 54% respectively since the launch of the 2018 competition. For conversational speech recognition, we have improved our relative Word Error Rate by 55% and our relative Entity Error Rate by 34% since the launch of the Alexa Prize. Socialbots improved in quality significantly more rapidly in 2018, in part due to the release of the CoBot toolkit, with new entrants attaining an average rating of 3.35 just 1 week into the semifinals, compared to 9 weeks in the 2017 competition.
ABSTRACT Representation learning is an essential problem in a wide range of applications and it is important for performing downstream tasks successfully. In this paper, we propose a new model that learns coupled representations of domains, intents, and slots by taking advantage of their hierarchical dependency in a Spoken Language Understanding system. Our proposed model learns the vector representation of intents based on the slots tied to these intents by aggregating the representations of the slots. Similarly, the vector representation of a domain is learned by aggregating the representations of the intents tied to a specific domain. To the best of our knowledge, it is the first approach to jointly learning the representations of domains, intents, and slots using their hierarchical relationships.
User interaction with voice-powered agents generates large amounts of unlabeled utterances. In this paper, we explore techniques to efficiently transfer the knowledge from these unlabeled utterances to improve model performance on Spoken Language Understanding (SLU) tasks. We use Embeddings from Language Model (ELMo) to take advantage of unlabeled data by learning contextualized word representations. Additionally, we propose ELMo-Light (ELMoL), a faster and simpler unsupervised pre-training method for SLU. Our findings suggest unsupervised pre-training on a large corpora of unlabeled utterances leads to significantly better SLU performance compared to training from scratch and it can even outperform conventional supervised transfer. Additionally, we show that the gains from unsupervised transfer techniques can be further improved by supervised transfer. The improvements are more pronounced in low resource settings and when using only 1000 labeled in-domain samples, our techniques match the performance of training from scratch on 10-15x more labeled in-domain data.