Goto

Collaborating Authors

 Reinforcement Learning


Construction of Macro Actions for Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Conventional deep reinforcement learning typically determines an appropriate primitive action at each timestep, which requires enormous amount of time and effort for learning an effective policy, especially in large and complex environments. To deal with the issue fundamentally, we incorporate macro actions, defined as sequences of primitive actions, into the primitive action space to form an augmented action space. The problem lies in how to find an appropriate macro action to augment the primitive action space. The agent using a proper augmented action space is able to jump to a farther state and thus speed up the exploration process as well as facilitate the learning procedure. In previous researches, macro actions are developed by mining the most frequently used action sequences or repeating previous actions. However, the most frequently used action sequences are extracted from a past policy, which may only reinforce the original behavior of that policy. On the other hand, repeating actions may limit the diversity of behaviors of the agent. Instead, we propose to construct macro actions by a genetic algorithm, which eliminates the dependency of the macro action derivation procedure from the past policies of the agent. Our approach appends a macro action to the primitive action space once at a time and evaluates whether the augmented action space leads to promising performance or not. We perform extensive experiments and show that the constructed macro actions are able to speed up the learning process for a variety of deep reinforcement learning methods. Our experimental results also demonstrate that the macro actions suggested by our approach are transferable among deep reinforcement learning methods and similar environments. We further provide a comprehensive set of ablation analysis to validate the proposed methodology.


Dueling Posterior Sampling for Preference-Based Reinforcement Learning

arXiv.org Artificial Intelligence

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present Dueling Posterior Sampling (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the user's preferences. Because preference feedback is provided on trajectories rather than individual state/action pairs, we develop a Bayesian approach to solving the credit assignment problem, translating user preferences to a posterior distribution over state/action reward models. We prove an asymptotic no-regret rate for DPS with a Bayesian logistic regression credit assignment model; to our knowledge, this is the first regret guarantee for preference-based RL. We also discuss possible avenues for extending this proof methodology to analyze other credit assignment models. Finally, we evaluate the approach empirically, showing competitive performance against existing baselines.


Improving Deep Reinforcement Learning in Minecraft with Action Advice

arXiv.org Artificial Intelligence

Training deep reinforcement learning agents complex behaviors in 3D virtual environments requires significant computational resources. This is especially true in environments with high degrees of aliasing, where many states share nearly identical visual features. Minecraft is an exemplar of such an environment. We hypothesize that interactive machine learning (IML), wherein human teachers play a direct role in training through demonstrations, critique, or action advice, may alleviate agent susceptibility to aliasing. However, interactive machine learning is only practical when the number of human interactions is limited, requiring a balance between human teacher effort and agent performance. We conduct experiments with two reinforcement learning algorithms which enable human teachers to give action advice--Feedback Arbitration, and Newtonian Action Advice--under visual aliasing conditions. To assess potential cognitive load per advice type, we vary the accuracy and frequency of various human action advice techniques. The training efficiency, robustness against infrequent and inaccurate advisor input, and sensitivity to aliasing are examined.


Combining learned skills and reinforcement learning for robotic manipulations

arXiv.org Artificial Intelligence

Manipulation tasks such as preparing a meal or assembling furniture remain highly challenging for robotics and vision. The supervised approach of imitation learning can handle short tasks but suffers from compounding errors and the need of many demonstrations for longer and more complex tasks. Reinforcement learning (RL) can find solutions beyond demonstrations but requires tedious and task-specific reward engineering for multi-step problems. In this work we address the difficulties of both methods and explore their combination. To this end, we propose a RL policies operating on pre-trained skills, that can learn composite manipulations using no intermediate rewards and no demonstrations of full tasks. We also propose an efficient training of basic skills from few synthetic demonstrated trajectories by exploring recent CNN architectures and data augmentation. We show successful learning of policies for composite manipulation tasks such as making a simple breakfast. Notably, our method achieves high success rates on a real robot, while using synthetic training data only.


Unsupervised Learning: The Next Wave in AI Revolution Analytics Insight

#artificialintelligence

Throughout the last decade, machine learning has gained exceptional ground in areas as varied as image recognition, self-driving vehicles and playing complex games like Go. These victories have been generally acknowledged via preparing deep neural systems with one of two learning paradigms which are supervised learning and reinforcement learning. The two standards require training signals to be structured by a human and then passed to the computer. On account of supervised learning, these are the "objectives, (for example, the right name for a picture); on account of reinforcement learning, they are the "rewards" for fruitful conduct, (for example, getting a high score in an Atari game). The cutoff points of learning are in this way characterized by human mentors.


Reinforcement Learning Explained: Overview, Comparisons and Applications in Business

#artificialintelligence

Imagine you're completing a mission in a computer game. Maybe you're going through a military depot to find a secret weapon. You get points for the right actions (killing an enemy) and lose them for the wrong ones (falling into a pit or getting hit). If you're playing on high difficulty, you might not conclude this task in just one attempt. Try after try, you learn which consecutive actions are needed to get out of a location safe, armed, and equipped with bonuses like extra health points or small artifacts in your bag.


Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes

arXiv.org Machine Learning

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution (say with a sufficiently rich policy class); how they cope with approximation error due to using a restricted class of parametric policies; or their finite sample behavior. Such characterizations are important not only to compare these methods to their approximate value function counterparts (where such issues are relatively well understood, at least in the worst case) but also to help with more principled approaches to algorithm design. This work provides provable characterizations of computational, approximation, and sample size issues with regards to policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: 1) "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy, and 2) restricted policy classes, which may not contain the optimal policy and where we provide agnostic learning results. One insight of this work is in formalizing the importance how a favorable initial state distribution provides a means to circumvent worst-case exploration issues. Overall, these results place policy gradient methods under a solid theoretical footing, analogous to the global convergence guarantees of iterative value function based algorithms.


Robby is Not a Robber (anymore): On the Use of Institutions for Learning Normative Behavior

arXiv.org Artificial Intelligence

We show how norms can be used to guide a reinforcement learning agent towards achieving normative behavior and apply the same set of norms over different domains. Thus, we are able to: (1) provide a way to intuitively encode social knowledge (through norms); (2) guide learning towards normative behaviors (through an automatic norm reward system); and (3) achieve a transfer of learning by abstracting policies; Finally, (4) the method is not dependent on a particular RL algorithm. We show how our approach can be seen as a means to achieve abstract representation and learn procedural knowledge based on the declarative semantics of norms and discuss possible implications of this in some areas of cognitive science. Index T erms --Norms, Institutions, Automatic Reward Shaping, Transfer of Learning, Abstract Policies, Abstraction, State-Space Selection, Schema I. I NTRODUCTION In order to be accepted in human society, robots need to comply with human social norms.


Neural Simplex Architecture

arXiv.org Artificial Intelligence

We present the Neural Simplex Architecture (NSA), a new approach to runtime assurance that provides safety guarantees for neural controllers (obtained e.g. using reinforcement learning) of complex autonomous and other cyber-physical systems without unduly sacrificing performance. NSA is inspired by the Simplex control architecture of Sha et al., but with some significant differences. In the traditional Simplex approach, the advanced controller (AC) is treated as a black box; there are no techniques for correcting the AC after it generates a potentially unsafe control input that causes a failover to the BC. Our NSA addresses this limitation. NSA not only provides safety assurances for CPSs in the presence of a possibly faulty neural controller, but can also improve the safety of such a controller in an online setting via retraining, without degrading its performance. NSA also offers reverse switching strategies, which allow the AC to resume control of the system under reasonable conditions, allowing the mission to continue unabated. Our experimental results on several significant case studies, including a target-seeking ground rover navigating an obstacle field and a neural controller for an artificial pancreas system, demonstrate NSA's benefits.


Reinforcement Learning for Personalized Dialogue Management

arXiv.org Artificial Intelligence

Language systems have been of great interest to the research community and have recently reached the mass market through various assistant platforms on the web. Reinforcement Learning methods that optimize dialogue policies have seen successes in past years and have recently been extended into methods that personalize the dialogue, e.g. take the personal context of users into account. These works, however, are limited to personalization to a single user with whom they require multiple interactions and do not generalize the usage of context across users. This work introduces a problem where a generalized usage of context is relevant and proposes two Reinforcement Learning (RL)-based approaches to this problem. The first approach uses a single learner and extends the traditional POMDP formulation of dialogue state with features that describe the user context. The second approach segments users by context and then employs a learner per context. We compare these approaches in a benchmark of existing non-RL and RL-based methods in three established and one novel application domain of financial product recommendation. We compare the influence of context and training experiences on performance and find that learning approaches generally outperform a handcrafted gold standard.