In spoken dialogue systems, we aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. Dialogue systems are increasingly being designed to move beyond just imitating conversation and also improve from such interactions over time. In this survey, we present a broad overview of methods developed to build dialogue systems over the years. Different use cases for dialogue systems ranging from task-based systems to open domain chatbots motivate and necessitate specific systems. Starting from simple rule-based systems, research has progressed towards increasingly complex architectures trained on a massive corpus of datasets, like deep learning systems. Motivated with the intuition of resembling human dialogues, progress has been made towards incorporating emotions into the natural language generator, using reinforcement learning. While we see a trend of highly marginal improvement on some metrics, we find that limited justification exists for the metrics, and evaluation practices are not uniform. To conclude, we flag these concerns and highlight possible research directions.
Recommender system has been proven to be significantly crucial in many fields and is widely used by various domains. Most of the conventional recommender systems rely on the numeric rating given by a user to reflect his opinion about a consumed item; however, these ratings are not available in many domains. As a result, a new source of information represented by the user-generated reviews is incorporated in the recommendation process to compensate for the lack of these ratings. The reviews contain prosperous and numerous information related to the whole item or a specific feature that can be extracted using the sentiment analysis field. This paper gives a comprehensive overview to help researchers who aim to work with recommender system and sentiment analysis. It includes a background of the recommender system concept, including phases, approaches, and performance metrics used in recommender systems. Then, it discusses the sentiment analysis concept and highlights the main points in the sentiment analysis, including level, approaches, and focuses on aspect-based sentiment analysis.
The goal of dialogue state tracking (DST) is to predict the current dialogue state given all previous dialogue contexts. Existing approaches generally predict the dialogue state at every turn from scratch. However, the overwhelming majority of the slots in each turn should simply inherit the slot values from the previous turn. Therefore, the mechanism of treating slots equally in each turn not only is inefficient but also may lead to additional errors because of the redundant slot value generation. To address this problem, we devise the two-stage DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. The Dual Slot Selector determines each slot whether to update slot value or to inherit the slot value from the previous turn from two aspects: (1) if there is a strong relationship between it and the current turn dialogue utterances; (2) if a slot value with high reliability can be obtained for it through the current turn dialogue. The slots selected to be updated are permitted to enter the Slot Value Generator to update values by a hybrid method, while the other slots directly inherit the values from the previous turn. Empirical results show that our method achieves 56.93%, 60.73%, and 58.04% joint accuracy on MultiWOZ 2.0, MultiWOZ 2.1, and MultiWOZ 2.2 datasets respectively and achieves a new state-of-the-art performance with significant improvements.
To successfully negotiate a deal, it is not enough to communicate fluently: pragmatic planning of persuasive negotiation strategies is essential. While modern dialogue agents excel at generating fluent sentences, they still lack pragmatic grounding and cannot reason strategically. We present DialoGraph, a negotiation system that incorporates pragmatic strategies in a negotiation dialogue using graph neural networks. DialoGraph explicitly incorporates dependencies between sequences of strategies to enable improved and interpretable prediction of next optimal strategies, given the dialogue context. Our graph-based method outperforms prior state-of-the-art negotiation models both in the accuracy of strategy/dialogue act prediction and in the quality of downstream dialogue response generation. We qualitatively show further benefits of learned strategy-graphs in providing explicit associations between effective negotiation strategies over the course of the dialogue, leading to interpretable and strategic dialogues.
The impact of user satisfaction in policy learning task-oriented dialogue systems has long been a subject of research interest. Most current models for estimating the user satisfaction either (i) treat out-of-context short-texts, such as product reviews, or (ii) rely on turn features instead of on distributed semantic representations. In this work we adopt deep neural networks that use distributed semantic representation learning for estimating the user satisfaction in conversations. We evaluate the impact of modelling context length in these networks. Moreover, we show that the proposed hierarchical network outperforms state-of-the-art quality estimators. Furthermore, we show that applying these networks to infer the reward function in a Partial Observable Markov Decision Process (POMDP) yields to a great improvement in the task success rate.
Lin, Chien-Wei, Auvray, Vincent, Elkind, Daniel, Biswas, Arijit, Fazel-Zarandi, Maryam, Belgamwar, Nehal, Chandra, Shubhra, Zhao, Matt, Metallinou, Angeliki, Chung, Tagyoung, Zhu, Charlie Shucheng, Adhikari, Suranjit, Hakkani-Tur, Dilek
Goal-oriented dialog systems enable users to complete specific goals like requesting information about a movie or booking a ticket. Typically the dialog system pipeline contains multiple ML models, including natural language understanding, state tracking and action prediction (policy learning). These models are trained through a combination of supervised or reinforcement learning methods and therefore require collection of labeled domain specific datasets. However, collecting annotated datasets with language and dialog-flow variations is expensive, time-consuming and scales poorly due to human involvement. In this paper, we propose an approach for automatically creating a large corpus of annotated dialogs from a few thoroughly annotated sample dialogs and the dialog schema. Our approach includes a novel goal-sampling technique for sampling plausible user goals and a dialog simulation technique that uses heuristic interplay between the user and the system (Alexa), where the user tries to achieve the sampled goal. We validate our approach by generating data and training three different downstream conversational ML models. We achieve 18 ? 50% relative accuracy improvements on a held-out test set compared to a baseline dialog generation approach that only samples natural language and entity value variations from existing catalogs but does not generate any novel dialog flow variations. We also qualitatively establish that the proposed approach is better than the baseline. Moreover, several different conversational experiences have been built using this method, which enables customers to have a wide variety of conversations with Alexa.
The migration of conversational AI agents across different embodiments in order to maintain the continuity of the task has been recently explored to further improve user experience. However, these migratable agents lack contextual understanding of the user information and the migrated device during the dialog conversations with the user. This opens the question of how an agent might behave when migrated into an embodiment for contextually predicting the next utterance. We collected a dataset from the dialog conversations between crowdsourced workers with the migration context involving personal and non-personal utterances in different settings (public or private) of embodiment into which the agent migrated. We trained the generative and information retrieval models on the dataset using with and without migration context and report the results of both qualitative metrics and human evaluation. We believe that the migration dataset would be useful for training future migratable AI systems.
Dialogue level quality estimation is vital for optimizing data driven dialogue management. Current automated methods to estimate turn and dialogue level user satisfaction employ hand-crafted features and rely on complex annotation schemes, which reduce the generalizability of the trained models. We propose a novel user satisfaction estimation approach which minimizes an adaptive multi-task loss function in order to jointly predict turn-level Response Quality labels provided by experts and explicit dialogue-level ratings provided by end users. The proposed BiLSTM based deep neural net model automatically weighs each turn's contribution towards the estimated dialogue-level rating, implicitly encodes temporal dependencies, and removes the need to hand-craft features. On dialogues sampled from 28 Alexa domains, two dialogue systems and three user groups, the joint dialogue-level satisfaction estimation model achieved up to an absolute 27% (0.43->0.70) and 7% (0.63->0.70) improvement in linear correlation performance over baseline deep neural net and benchmark Gradient boosting regression models, respectively.
End-to-end task-oriented dialogue systems aim to generate system responses directly from plain text inputs. There are two challenges for such systems: one is how to effectively incorporate external knowledge bases (KBs) into the learning framework; the other is how to accurately capture the semantics of dialogue history. In this paper, we address these two challenges by exploiting the graph structural information in the knowledge base and in the dependency parsing tree of the dialogue. To effectively leverage the structural information in dialogue history, we propose a new recurrent cell architecture which allows representation learning on graphs. To exploit the relations between entities in KBs, the model combines multi-hop reasoning ability based on the graph structure. Experimental results show that the proposed model achieves consistent improvement over state-of-the-art models on two different task-oriented dialogue datasets.
Inducing a meaningful structural representation from one or a set of dialogues is a crucial but challenging task in computational linguistics. Advancement made in this area is critical for dialogue system design and discourse analysis. It can also be extended to solve grammatical inference. In this work, we propose to incorporate structured attention layers into a Variational Recurrent Neural Network (VRNN) model with discrete latent states to learn dialogue structure in an unsupervised fashion. Compared to a vanilla VRNN, structured attention enables a model to focus on different parts of the source sentence embeddings while enforcing a structural inductive bias. Experiments show that on two-party dialogue datasets, VRNN with structured attention learns semantic structures that are similar to templates used to generate this dialogue corpus. While on multi-party dialogue datasets, our model learns an interactive structure demonstrating its capability of distinguishing speakers or addresses, automatically disentangling dialogues without explicit human annotation.