Collaborating Authors

Deep Reinforcement Learning for Chatbots Using Clustered Actions and Human-Likeness Rewards Artificial Intelligence

Training chatbots using the reinforcement learning paradigm is challenging due to high-dimensional states, infinite action spaces and the difficulty in specifying the reward function. We address such problems using clustered actions instead of infinite actions, and a simple but promising reward function based on human-likeness scores derived from human-human dialogue data. We train Deep Reinforcement Learning (DRL) agents using chitchat data in raw text---without any manual annotations. Experimental results using different splits of training data report the following. First, that our agents learn reasonable policies in the environments they get familiarised with, but their performance drops substantially when they are exposed to a test set of unseen dialogues. Second, that the choice of sentence embedding size between 100 and 300 dimensions is not significantly different on test data. Third, that our proposed human-likeness rewards are reasonable for training chatbots as long as they use lengthy dialogue histories of >=10 sentences.

Ensemble-Based Deep Reinforcement Learning for Chatbots Artificial Intelligence

Such an agent is typically characterised by: (i) a finite set of states 6 S {s i} that describe all possible situations in the environment; (ii) a finite set of actions A {a j} to change in the environment from one situation to another; (iii) a state transition function T (s,a,s null) that specifies the next state s null for having taken action a in the current state s; (iv) a reward function R (s,a,s null) that specifies a numerical value given to the agent for taking action a in state s and transitioning to state s null; and (v) a policy π: S A that defines a mapping from states to actions [2, 30]. The goal of a reinforcement learning agent is to find an optimal policy by maximising its cumulative discounted reward defined as Q (s,a) max π E[r t γr t 1 γ 2 r t 1 ... s t s,a t a,π ], where function Q represents the maximum sum of rewards r t discounted by factor γ at each time step. While a reinforcement learning agent takes actions with probability Pr ( a s) during training, it selects the best action at test time according to π (s) arg max a A Q (s,a). A deep reinforcement learning agent approximates Q using a multi-layer neural network [31]. The Q function is parameterised as Q(s,a; θ), where θ are the parameters or weights of the neural network (recurrent neural network in our case). Estimating these weights requires a dataset of learning experiences D {e 1,...e N} (also referred to as'experience replay memory'), where every experience is described as a tuple e t ( s t,a t,r t,s t 1). Inducing a Q function consists in applying Q-learning updates over minibatches of experience MB {( s,a,r,s null) U (D)} drawn uniformly at random from the full dataset D . This process is implemented in learning algorithms using Deep Q-Networks (DQN) such as those described in [31, 32, 33], and the following section describes a DQN-based algorithm for human-chatbot interaction.

Is the User Enjoying the Conversation? A Case Study on the Impact on the Reward Function Artificial Intelligence

The impact of user satisfaction in policy learning task-oriented dialogue systems has long been a subject of research interest. Most current models for estimating the user satisfaction either (i) treat out-of-context short-texts, such as product reviews, or (ii) rely on turn features instead of on distributed semantic representations. In this work we adopt deep neural networks that use distributed semantic representation learning for estimating the user satisfaction in conversations. We evaluate the impact of modelling context length in these networks. Moreover, we show that the proposed hierarchical network outperforms state-of-the-art quality estimators. Furthermore, we show that applying these networks to infer the reward function in a Partial Observable Markov Decision Process (POMDP) yields to a great improvement in the task success rate.

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement Artificial Intelligence

We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.

Neural User Simulation for Corpus-based Policy Optimisation for Spoken Dialogue Systems Artificial Intelligence

User Simulators are one of the major tools that enable offline training of task-oriented dialogue systems. For this task the Agenda-Based User Simulator (ABUS) is often used. The ABUS is based on hand-crafted rules and its output is in semantic form. Issues arise from both properties such as limited diversity and the inability to interface a text-level belief tracker. This paper introduces the Neural User Simulator (NUS) whose behaviour is learned from a corpus and which generates natural language, hence needing a less labelled dataset than simulators generating a semantic output. In comparison to much of the past work on this topic, which evaluates user simulators on corpus-based metrics, we use the NUS to train the policy of a reinforcement learning based Spoken Dialogue System. The NUS is compared to the ABUS by evaluating the policies that were trained using the simulators. Cross-model evaluation is performed i.e. training on one simulator and testing on the other. Furthermore, the trained policies are tested on real users. In both evaluation tasks the NUS outperformed the ABUS.