Goto

Collaborating Authors

 Reinforcement Learning


Bias-reduced multi-step hindsight experience replay

arXiv.org Artificial Intelligence

Multi-goal reinforcement learning is widely used in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges with hindsight knowledge. However, HER and its previous variants still need millions of samples and a huge computation. In this paper, we propose \emph{Multi-step Hindsight Experience Replay} (MHER) based on $n$-step relabeling, incorporating multi-step relabeled returns to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.


Positive Reinforcements Help Algorithm Forecast Underground Natural Reserves

#artificialintelligence

Texas A&M University and University of Oklahoma researchers have designed a reinforcement-based algorithm that automates the prediction of underground oil and gas reserves. Texas A&M University (TAMU) and University of Oklahoma researchers have developed a reinforcement-based algorithm that automates forecasting of subterranean properties, enabling accurate prediction of oil and gas reserves. The algorithm focuses on the correct characterization of the underground environment based on rewards accumulated for making correct predictions of pressure and flow anticipated from boreholes. The TAMU team learned that within 10 iterations of reinforcement learning, the algorithm could correctly and rapidly predict the properties of simple subsurface scenarios. TAMU's Siddharth Misra said, "We have turned history matching into a sequential decision-making problem, which has the potential to reduce engineers' efforts, mitigate human bias, and remove the need of large sets of labeled training data."


3 ways to get into reinforcement learning

#artificialintelligence

When I was in graduate school in the 1990s, one of my favorite classes was neural networks. Back then, we didn't have access to TensorFlow, PyTorch, or Keras; we programmed neurons, neural networks, and learning algorithms by hand with the formulas from textbooks. We didn't have access to cloud computing, and we coded sequential experiments that often ran overnight. There weren't platforms like Alteryx, Dataiku, SageMaker, or SAS to enable a machine learning proof of concept or manage the end-to-end MLops lifecycles. I was most interested in reinforcement learning algorithms, and I recall writing hundreds of reward functions to stabilize an inverted pendulum.


Multi-Agent Deep Reinforcement Learning in 13 Lines of Code Using PettingZoo

#artificialintelligence

This tutorial provides a simple introduction to using multi-agent reinforcement learning, assuming a little experience in machine learning and knowledge of Python. Reinforcement stems from using machine learning to optimally control an agent in an environment. It works by learning a policy, a function that maps an observation obtained from its environment to an action. Policy functions are typically deep neural networks, which gives rise to the name "deep reinforcement learning." The goal of reinforcement learning is to learn an optimal policy, a policy that achieves the maximum expected reward from the environment when acting.


How reinforcement learning chooses the ads you see

#artificialintelligence

Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question: Which of the many ads they have in their catalog is more likely to appeal to a certain viewer? Finding the right answer to this question can have a huge impact on revenue when you are dealing with hundreds of websites, thousands of ads, and millions of visitors. Fortunately (for the ad agencies, at least), reinforcement learning (RL), the branch of artificial intelligence that has become renowned for mastering board and video games, provides a solution. Reinforcement learning models seek to maximize rewards.


A Sufficient Statistic for Influence in Structured Multiagent Environments

Journal of Artificial Intelligence Research

Making decisions in complex environments is a key challenge in artificial intelligence (AI). Situations involving multiple decision makers are particularly complex, leading to computational intractability of principled solution methods. A body of work in AI has tried to mitigate this problem by trying to distill interaction to its essence: how does the policy of one agent influence another agent? If we can find more compact representations of such influence, this can help us deal with the complexity, for instance by searching the space of influences rather than the space of policies. However, so far these notions of influence have been restricted in their applicability to special cases of interaction. In this paper we formalize influence-based abstraction (IBA), which facilitates the elimination of latent state factors without any loss in value, for a very general class of problems described as factored partially observable stochastic games (fPOSGs). On the one hand, this generalizes existing descriptions of influence, and thus can serve as the foundation for improvements in scalability and other insights in decision making in complex multiagent settings. On the other hand, since the presence of other agents can be seen as a generalization of single agent settings, our formulation of IBA also provides a sufficient statistic for decision making under abstraction for a single agent. We also give a detailed discussion of the relations to such previous works, identifying new insights and interpretations of these approaches. In these ways, this paper deepens our understanding of abstraction in a wide range of sequential decision making settings, providing the basis for new approaches and algorithms for a large class of problems.


Combining Off and On-Policy Training in Model-Based Reinforcement Learning

arXiv.org Artificial Intelligence

The combination of deep learning and Monte Carlo Tree Search (MCTS) has shown to be effective in various domains, such as board and video games. AlphaGo represented a significant step forward in our ability to learn complex board games, and it was rapidly followed by significant advances, such as AlphaGo Zero and AlphaZero. Recently, MuZero demonstrated that it is possible to master both Atari games and board games by directly learning a model of the environment, which is then used with MCTS to decide what move to play in each position. During tree search, the algorithm simulates games by exploring several possible moves and then picks the action that corresponds to the most promising trajectory. When training, limited use is made of these simulated games since none of their trajectories are directly used as training examples. Even if we consider that not all trajectories from simulated games are useful, there are thousands of potentially useful trajectories that are discarded. Using information from these trajectories would provide more training data, more quickly, leading to faster convergence and higher sample efficiency. Recent work introduced an off-policy value target for AlphaZero that uses data from simulated games. In this work, we propose a way to obtain off-policy targets using data from simulated games in MuZero. We combine these off-policy targets with the on-policy targets already used in MuZero in several ways, and study the impact of these targets and their combinations in three environments with distinct characteristics. When used in the right combinations, our results show that these targets speed up the training process and lead to faster convergence and higher rewards than the ones obtained by MuZero.


Hybrid Car-Following Strategy based on Deep Deterministic Policy Gradient and Cooperative Adaptive Cruise Control

arXiv.org Artificial Intelligence

Deep deterministic policy gradient (DDPG) based car-following strategy can break through the constraints of the differential equation model due to the ability of exploration on complex environments. However, the car-following performance of DDPG is usually degraded by unreasonable reward function design, insufficient training and low sampling efficiency. In order to solve this kind of problem, a hybrid car-following strategy based on DDPG and cooperative adaptive cruise control (CACC) is proposed. Firstly, the car-following process is modeled as markov decision process to calculate CACC and DDPG simultaneously at each frame. Given a current state, two actions are obtained from CACC and DDPG, respectively. Then an optimal action, corresponding to the one offering a larger reward, is chosen as the output of the hybrid strategy. Meanwhile, a rule is designed to ensure that the change rate of acceleration is smaller than the desired value. Therefore, the proposed strategy not only guarantees the basic performance of car-following through CACC, but also makes full use of the advantages of exploration on complex environments via DDPG. Finally, simulation results show that the car-following performance of proposed strategy is improved significantly as compared with that of DDPG and CACC in the whole state space.


Improved Regret Bound and Experience Replay in Regularized Policy Iteration

arXiv.org Machine Learning

In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.


Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with $\sqrt{T}$ Regret

arXiv.org Machine Learning

We consider the task of learning to control a linear dynamical system under fixed quadratic costs, known as the Linear Quadratic Regulator (LQR) problem. While model-free approaches are often favorable in practice, thus far only model-based methods, which rely on costly system identification, have been shown to achieve regret that scales with the optimal dependence on the time horizon T. We present the first model-free algorithm that achieves similar regret guarantees. Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space in this setting.