Reinforcement Learning
Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms
Xu, Tengyu, Wang, Zhe, Liang, Yingbin
As an important type of reinforcement learning algorithms, actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies. In the first nested-loop design, actor's one update of policy is followed by an entire loop of critic's updates of the value function, and the finite-sample analysis of such AC and NAC algorithms have been recently well established. The second two time-scale design, in which actor and critic update simultaneously but with different learning rates, has much fewer tuning parameters than the nested-loop design and is hence substantially easier to implement. Although two time-scale AC and NAC have been shown to converge in the literature, the finite-sample convergence rate has not been established. In this paper, we provide the first such non-asymptotic convergence rate for two time-scale AC and NAC under Markovian sampling and with actor having general policy class approximation. We show that two time-scale AC requires the overall sample complexity at the order of $\mathcal{O}(\epsilon^{-2.5}\log^3(\epsilon^{-1}))$ to attain an $\epsilon$-accurate stationary point, and two time-scale NAC requires the overall sample complexity at the order of $\mathcal{O}(\epsilon^{-4}\log^2(\epsilon^{-1}))$ to attain an $\epsilon$-accurate global optimal point. We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling and for analyzing the convergence rate of the linear critic with dynamically changing base functions and transition kernel.
Adaptive Dialog Policy Learning with Hindsight and User Modeling
Cao, Yan, Lu, Keting, Chen, Xiaoping, Zhang, Shiqi
Reinforcement learning methods have been used to compute dialog policies from language-based interaction experiences. Efficiency is of particular importance in dialog policy learning, because of the considerable cost of interacting with people, and the very poor user experience from low-quality conversations. Aiming at improving the efficiency of dialog policy learning, we develop algorithm LHUA (Learning with Hindsight, User modeling, and Adaptation) that, for the first time, enables dialog agents to adaptively learn with hindsight from both simulated and real users. Simulation and hindsight provide the dialog agent with more experience and more (positive) reinforcements respectively. Experimental results suggest that, in success rate and policy quality, LHUA outperforms competitive baselines from the literature, including its no-simulation, no-adaptation, and no-hindsight counterparts.
Reinforcement Learning with Feedback Graphs
Dann, Christoph, Mansour, Yishay, Mohri, Mehryar, Sekhari, Ayush, Sridharan, Karthik
We study episodic reinforcement learning in Markov decision processes when the agent receives additional feedback per step in the form of several transition observations. Such additional observations are available in a range of tasks through extended sensors or prior knowledge about the environment (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning. We give a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of a feedback graph. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the bandit setting and propose a new algorithm that can use knowledge of such a dominating set for more sample-efficient learning of a near-optimal policy.
Plan2Vec: Unsupervised Representation Learning by Latent Plans
Yang, Ge, Zhang, Amy, Morcos, Ari S., Pineau, Joelle, Abbeel, Pieter, Calandra, Roberto
In this paper we introduce plan2vec, an unsupervised representation learning approach that is inspired by reinforcement learning. Plan2vec constructs a weighted graph on an image dataset using near-neighbor distances, and then extrapolates this local metric to a global embedding by distilling path-integral over planned path. When applied to control, plan2vec offers a way to learn goal-conditioned value estimates that are accurate over long horizons that is both compute and sample efficient. We demonstrate the effectiveness of plan2vec on one simulated and two challenging real-world image datasets. Experimental results show that plan2vec successfully amortizes the planning cost, enabling reactive planning that is linear in memory and computation complexity rather than exhaustive over the entire state space.
The future of deep-reinforcement learning, our contemporary AI superhero โ TechCrunch
It was not long ago that the world watched World Chess Champion Garry Kasparov lose a decisive match against a supercomputer. IBM's Deep Blue embodied the state of the art in the late 1990s, when a machine defeating a world (human) champion at a complex game such as chess was still unheard of. Fast-forward to today, and not only have supercomputers greatly surpassed Deep Blue in chess, they have managed to achieve superhuman performance in a string of other games, often much more complex than chess, ranging from Go to Dota to classic Atari titles. Many of these games have been mastered just in the last five years, pointing to a pace of innovation much quicker than the two decades prior. Recently, Google released work on Agent57, which for the first time showcased superior performance over existing benchmarks across all 57 Atari 2600 games. The class of AI algorithms underlying these feats -- deep-reinforcement learning -- has demonstrated the ability to learn at very high levels in constrained domains, such as the ones offered by games.
Does on-policy data collection fix errors in off-policy reinforcement learning?
Reinforcement learning has seen a great deal of success in solving complex decision making problems ranging from robotics to games to supply chain management to recommender systems. Despite their success, deep reinforcement learning algorithms can be exceptionally difficult to use, due to unstable training, sensitivity to hyperparameters, and generally unpredictable and poorly understood convergence properties. Multiple explanations, and corresponding solutions, have been proposed for improving the stability of such methods, and we have seen good progress over the last few years on these algorithms. In this blog post, we will dive deep into analyzing a central and underexplored reason behind some of the problems with the class of deep RL algorithms based on dynamic programming, which encompass the popular DQN and soft actor-critic (SAC) algorithms โ the detrimental connection between data distributions and learned models. Before diving deep into a description of this problem, let us quickly recap some of the main concepts in dynamic programming.
Robotic Arm Control and Task Training through Deep Reinforcement Learning
Franceschetti, Andrea, Tosello, Elisa, Castaman, Nicola, Ghidoni, Stefano
This paper proposes a detailed and extensive comparison of the Trust Region Policy Optimization and DeepQ-Network with Normalized Advantage Functions with respect to other state of the art algorithms, namely Deep Deterministic Policy Gradient and Vanilla Policy Gradient. Comparisons demonstrate that the former have better performances then the latter when asking robotic arms to accomplish manipulation tasks such as reaching a random target pose and pick &placing an object. Both simulated and real-world experiments are provided. Simulation lets us show the procedures that we adopted to precisely estimate the algorithms hyper-parameters and to correctly design good policies. Real-world experiments let show that our polices, if correctly trained on simulation, can be transferred and executed in a real environment with almost no changes.
Safe Reinforcement Learning through Meta-learned Instincts
Grbic, Djordje, Risi, Sebastian
An important goal in reinforcement learning is to create agents that can quickly adapt to new goals while avoiding situations that might cause damage to themselves or their environments. One way agents learn is through exploration mechanisms, which are needed to discover new policies. However, in deep reinforcement learning, exploration is normally done by injecting noise in the action space. While performing well in many domains, this setup has the inherent risk that the noisy actions performed by the agent lead to unsafe states in the environment. Here we introduce a novel approach called Meta-Learned Instinctual Networks (MLIN) that allows agents to safely learn during their lifetime while avoiding potentially hazardous states. At the core of the approach is a plastic network trained through reinforcement learning and an evolved "instinctual" network, which does not change during the agent's lifetime but can modulate the noisy output of the plastic network. We test our idea on a simple 2D navigation task with no-go zones, in which the agent has to learn to approach new targets during deployment. MLIN outperforms standard meta-trained networks and allows agents to learn to navigate to new targets without colliding with any of the no-go zones. These results suggest that meta-learning augmented with an instinctual network is a promising new approach for safe AI, which may enable progress in this area on a variety of different domains.
An AI can simulate an economy millions of times to create fairer tax policy
Income inequality is one of the overarching problems of economics. One of the most effective tools policymakers have to address it is taxation: governments collect money from people according to what they earn and redistribute it either directly, via welfare schemes, or indirectly, by using it to pay for public projects. But though more taxation can lead to greater equality, taxing people too much can discourage them from working or motivate them to find ways to avoid paying--which reduces the overall pot. Getting the balance right is not easy. Economists typically rely on assumptions that are hard to validate.
Nik Bear Brown posted on LinkedIn
INFO 7375 - Special Topics in Artificial Intelligence Engineering and Applications - Computational Skepticism is looking for experts to speak online this summer on a variety of subjects. The Computational Skepticism class is starting today!!! I'd like to thank Kinesso, H2O.ai, Squark Ai, ArrowDx, and the Computational Radiology Laboratory at Harvard/BCH for expressing an interest in speaking with the class. These are all online talks and can be with just a small group of around 20, or we can invite the thousands of Masters students in MGENs Boston, Silicon Valley, and Seattle campuses. These subjects include data quality and completeness, bias and fairness, AutoML, model interpretability, causal inference, counterfactual models, deep learning pipeline (AutoDL), time-series pipeline (AutoTS), feature engineering pipeline (AutoFE), autoVisualization (AutoViz), reinforcement learning pipeline (AutoRL), evidence knowledge graphs (EKG) We are looking for more companies and research groups that may be willing to share data and present how they are using machine learning.