Reinforcement Learning
Planning With Pixels in (Almost) Real Time
Bandres, Wilmer (Universitat Pompeu Fabra) | Bonet, Blai (Universidad Sim ó n Bolívar) | Geffner, Hector (ICREA & Universitat Pompeu Fabra)
Recently, width-based planning methods have been shown to yield state-of-the-art results in the Atari 2600 video games. For this, the states were associated with the (RAM) memory states of the simulator. In this work, we consider the same planning problem but using the screen instead. By using the same visual inputs, the planning results can be compared with those of humans and learning methods. We show that the planning approach, out of the box and without training, results in scores that compare well with those obtained by humans and learning methods, and moreover, by developing an episodic, rollout version of the IW(k) algorithm, we show that such scores can be obtained in almost real time.
Reinforcement Learning for Relation Classification From Noisy Data
Feng, Jun (Tsinghua University) | Huang, Minlie (Tsinghua Unvesity) | Zhao, Li (Microsoft Research Asia) | Yang, Yang (Zhejiang University) | Zhu, Xiaoyan ( Tsinghua University )
Existing relation classification methods that rely on distant supervision assume that a bag of sentences mentioning an entity pair are all describing a relation for the entity pair. Such methods, performing classification at the bag level, cannot identify the mapping between a relation and a sentence, and largely suffers from the noisy labeling problem. In this paper, we propose a novel model for relation classification at the sentence level from noisy data. The model has two modules: an instance selector and a relation classifier. The instance selector chooses high-quality sentences with reinforcement learning and feeds the selected sentences into the relation classifier, and the relation classifier makes sentence-level prediction and provides rewards to the instance selector. The two modules are trained jointly to optimize the instance selection and relation classification processes.Experiment results show that our model can deal with the noise of data effectively and obtains better performance for relation classification at the sentence level.
Large Scaled Relation Extraction With Reinforcement Learning
Zeng, Xiangrong (Institute of Automation, Chinese Academy of Sciences) | He, Shizhu (Institute of Automation, Chinese Academy of Sciences) | Liu, Kang (Institute of Automation, Chinese Academy of Sciences) | Zhao, Jun (Institute of Automation, Chinese Academy of Sciences)
Sentence relation extraction aims to extract relational facts from sentences, which is an important task in natural language processing field. Previous models rely on the manually labeled supervised dataset. However, the human annotation is costly and limits to the number of relation and data size, which is difficult to scale to large domains. In order to conduct largely scaled relation extraction, we utilize an existing knowledge base to heuristically align with texts, which not rely on human annotation and easy to scale. However, using distant supervised data for relation extraction is facing a new challenge: sentences in the distant supervised dataset are not directly labeled and not all sentences that mentioned an entity pair can represent the relation between them. To solve this problem, we propose a novel model with reinforcement learning. The relation of the entity pair is used as distant supervision and guide the training of relation extractor with the help of reinforcement learning method. We conduct two types of experiments on a publicly released dataset. Experiment results demonstrate the effectiveness of the proposed method compared with baseline models, which achieves 13.36\% improvement.
Learning to Extract Coherent Summary via Deep Reinforcement Learning
Wu, Yuxiang (Hong Kong University of Science and Technology) | Hu, Baotian (University of Massachusetts Medical School)
Coherence plays a critical role in producing a high-quality summary from a document. In recent years, neural extractive summarization is becoming increasingly attractive. However, most of them ignore the coherence of summaries when extracting sentences. As an effort towards extracting coherent summaries, we propose a neural coherence model to capture the cross-sentence semantic and syntactic coherence patterns. The proposed neural coherence model obviates the need for feature engineering and can be trained in an end-to-end fashion using unlabeled data. Empirical results show that the proposed neural coherence model can efficiently capture the cross-sentence coherence patterns. Using the combined output of the neural coherence model and ROUGE package as the reward, we design a reinforcement learning method to train a proposed neural extractive summarizer which is named Reinforced Neural Extractive Summarization (RNES) model. The RNES model learns to optimize coherence and informative importance of the summary simultaneously. The experimental results show that the proposed RNES outperforms existing baselines and achieves state-of-the-art performance in term of ROUGE on CNN/Daily Mail dataset. The qualitative evaluation indicates that summaries produced by RNES are more coherent and readable.
MathDQN: Solving Arithmetic Word Problems via Deep Reinforcement Learning
Wang, Lei (UESTC) | Zhang, Dongxiang (UESTC) | Gao, Lianli (UESTC) | Song, Jingkuan (UESTC) | Guo, Long ( Peking University ) | Shen, Heng Tao (UESTC)
Designing an automatic solver for math word problems has been considered as a crucial step towards general AI, with the ability of natural language understanding and logical inference. The state-of-the-art performance was achieved by enumerating all the possible expressions from the quantities in the text and customizing a scoring function to identify the one with the maximum probability. However, it incurs exponential search space with the number of quantities and beam search has to be applied to trade accuracy for efficiency. In this paper, we make the first attempt of applying deep reinforcement learning to solve arithmetic word problems. The motivation is that deep Q-network has witnessed success in solving various problems with big search space and achieves promising performance in terms of both accuracy and running time. To fit the math problem scenario, we propose our MathDQN that is customized from the general deep reinforcement learning framework. Technically, we design the states, actions, reward function, together with a feed-forward neural network as the deep Q-network. Extensive experimental results validate our superiority over state-of-the-art methods. Our MathDQN yields remarkable improvement on most of datasets and boosts the average precision among all the benchmark datasets by 15\%.
CoChat: Enabling Bot and Human Collaboration for Task Completion
Luo, Xufang (Beihang University) | Lin, Zijia (Microsoft Research) | Wang, Yunhong (Beihang University) | Nie, Zaiqing (Alibaba AI Labs)
Chatbots have drawn significant attention of late in both industry and academia. For most task completion bots in the industry, human intervention is the only means of avoiding mistakes in complex real-world cases. However, to the best of our knowledge, there is no existing research work modeling the collaboration between task completion bots and human workers. In this paper, we introduce CoChat, a dialog management framework to enable effective collaboration between bots and human workers. In CoChat, human workers can introduce new actions at any time to handle previously unseen cases. We propose a memory-enhanced hierarchical RNN (MemHRNN) to handle the one-shot learning challenges caused by instantly introducing new actions in CoChat. Extensive experiments on real-world datasets well demonstrate that CoChat can relieve most of the human workers’ workload, and get better user satisfaction rates comparing to other state-of-the-art frameworks.
HogRider: Champion Agent of Microsoft Malmo Collaborative AI Challenge
Xiong, Yanhai (Nanyang Technological University) | Chen, Haipeng (Nanyang Technological University) | Zhao, Mengchen (Nanyang Technological University) | An, Bo (Nanyang Technological University)
It has been an open challenge for self-interested agents to make optimal sequential decisions in complex multiagent systems, where agents might achieve higher utility via collaboration. The Microsoft Malmo Collaborative AI Challenge (MCAC), which is designed to encourage research relating to various problems in Collaborative AI, takes the form of a Minecraft mini-game where players might work together to catch a pig or deviate from cooperation, for pursuing high scores to win the challenge. Various characteristics, such as complex interactions among agents, uncertainties, sequential decision making and limited learning trials all make it extremely challenging to find effective strategies. We present HogRider---the champion agent of MCAC in 2017 out of 81 teams from 26 countries. One key innovation of HogRider is a generalized agent type hypothesis framework to identify the behavior model of the other agents, which is demonstrated to be robust to observation uncertainty. On top of that, a second key innovation is a novel Q-learning approach to learn effective policies against each type of the collaborating agents. Various ideas are proposed to adapt traditional Q-learning to handle complexities in the challenge, including state-action abstraction to reduce problem scale, a warm start approach using human reasoning for addressing limited learning trials, and an active greedy strategy to balance exploitation-exploration. Challenge results show that HogRider outperforms all the other teams by a significant edge, in terms of both optimality and stability.
Action Branching Architectures for Deep Reinforcement Learning
Tavakoli, Arash (Imperial College London) | Pardo, Fabio (Imperial College London) | Kormushev, Petar (Imperial College London)
Discrete-action algorithms have been central to numerous recent successes of deep reinforcement learning. However, applying these algorithms to high-dimensional action tasks requires tackling the combinatorial increase of the number of possible actions with the number of action dimensions. This problem is further exacerbated for continuous-action tasks that require fine control of actions via discretization. In this paper, we propose a novel neural architecture featuring a shared decision module followed by several network branches, one for each action dimension. This approach achieves a linear increase of the number of network outputs with the number of degrees of freedom by allowing a level of independence for each individual action dimension. To illustrate the approach, we present a novel agent, called Branching Dueling Q-Network (BDQ), as a branching variant of the Dueling Double Deep Q-Network (Dueling DDQN). We evaluate the performance of our agent on a set of challenging continuous control tasks. The empirical results show that the proposed agent scales gracefully to environments with increasing action dimensionality and indicate the significance of the shared decision module in coordination of the distributed action branches. Furthermore, we show that the proposed agent performs competitively against a state-of-the-art continuous control algorithm, Deep Deterministic Policy Gradient (DDPG).
Reinforcement Learning in POMDPs With Memoryless Options and Option-Observation Initiation Sets
Steckelmacher, Denis (Vrije Universiteit Brussels) | Roijers, Diederik M. (Vrije Universiteit Brussels) | Harutyunyan, Anna (Vrije Universiteit Brussels) | Vrancx, Peter (PROWLER.io) | Plisnier, Hélène (Vrije Universiteit Brussels) | Nowé, Ann (Vrije Universiteit Brussels)
Many real-world reinforcement learning problems have a hierarchical nature, and often exhibit some degree of partial observability. While hierarchy and partial observability are usually tackled separately (for instance by combining recurrent neural networks and options), we show that addressing both problems simultaneously is simpler and more efficient in many cases. More specifically, we make the initiation set of options conditional on the previously-executed option, and show that options with such Option-Observation Initiation Sets (OOIs) are at least as expressive as Finite State Controllers (FSCs), a state-of-the-art approach for learning in POMDPs. OOIs are easy to design based on an intuitive description of the task, lead to explainable policies and keep the top-level and option policies memoryless. Our experiments show that OOIs allow agents to learn optimal policies in challenging POMDPs, while being much more sample-efficient than a recurrent neural network over options.
Source Traces for Temporal Difference Learning
Pitis, Silviu (Georgia Institute of Technology )
This paper motivates and develops source traces for temporal difference (TD) learning in the tabular setting. Source traces are like eligibility traces, but model potential histories rather than immediate ones. This allows TD errors to be propagated to potential causal states and leads to faster generalization. Source traces can be thought of as the model-based, backward view of successor representations (SR), and share many of the same benefits. This view, however, suggests several new ideas. First, a TD(λ)-like source learning algorithm is proposed and its convergence is proven. Then, a novel algorithm for learning the source map (or SR matrix) is developed and shown to outperform the previous algorithm. Finally, various approaches to using the source/SR model are explored, and it is shown that source traces can be effectively combined with other model-based methods like Dyna and experience replay.