Goto

Collaborating Authors

 Reinforcement Learning


Adaptive Temporal Difference Learning with Linear Function Approximation

arXiv.org Machine Learning

This paper revisits the celebrated temporal difference (TD) learning algorithm for the policy evaluation in reinforcement learning. Typically, the performance of the plain-vanilla TD algorithm is sensitive to the choice of stepsizes. Oftentimes, TD suffers from slow convergence. Motivated by the tight connection between the TD learning algorithm and the stochastic gradient methods, we develop the first adaptive variant of the TD learning algorithm with linear function approximation that we term AdaTD. In contrast to the original TD, AdaTD is robust or less sensitive to the choice of stepsizes. Analytically, we establish that to reach an $\epsilon$ accuracy, the number of iterations needed is $\tilde{O}(\epsilon^2\ln^4\frac{1}{\epsilon}/\ln^4\frac{1}{\rho})$, where $\rho$ represents the speed of the underlying Markov chain converges to the stationary distribution. This implies that the iteration complexity of AdaTD is no worse than that of TD in the worst case. Going beyond TD, we further develop an adaptive variant of TD($\lambda$), which is referred to as AdaTD($\lambda$). We evaluate the empirical performance of AdaTD and AdaTD($\lambda$) on several standard reinforcement learning tasks in OpenAI Gym on both linear and nonlinear function approximation, which demonstrate the effectiveness of our new approaches over existing ones.


From Poincar\'e Recurrence to Convergence in Imperfect Information Games: Finding Equilibrium via Regularization

arXiv.org Machine Learning

In this paper we investigate the Follow the Regularized Leader dynamics in sequential imperfect information games (IIG). We generalize existing results of Poincar\'e recurrence from normal-form games to zero-sum two-player imperfect information games and other sequential game settings. We then investigate how adapting the reward (by adding a regularization term) of the game can give strong convergence guarantees in monotone games. We continue by showing how this reward adaptation technique can be leveraged to build algorithms that converge exactly to the Nash equilibrium. Finally, we show how these insights can be directly used to build state-of-the-art model-free algorithms for zero-sum two-player Imperfect Information Games (IIG).


Sim2Real Transfer for Reinforcement Learning without Dynamics Randomization

arXiv.org Artificial Intelligence

In this work we show how to use the Operational Space Control framework (OSC) under joint and cartesian constraints for reinforcement learning in cartesian space. Our method is therefore able to learn fast and with adjustable degrees of freedom, while we are able to transfer policies without additional dynamics randomizations on a KUKA LBR iiwa peg in-hole task. Before learning in simulation starts, we perform a system identification for aligning the simulation environment as far as possible with the dynamics of a real robot. Adding constraints to the OSC controller allows us to learn in a safe way on the real robot or to learn a flexible, goal conditioned policy that can be easily transferred from simulation to the real robot.


Using AI for Mitigating the Impact of Network Delay in Cloud-based Intelligent Traffic Signal Control

arXiv.org Artificial Intelligence

The recent advancements in cloud services, Internet of Things (IoT) and Cellular networks have made cloud computing an attractive option for intelligent traffic signal control (ITSC). Such a method significantly reduces the cost of cables, installation, number of devices used, and maintenance. ITSC systems based on cloud computing lower the cost of the ITSC systems and make it possible to scale the system by utilizing the existing powerful cloud platforms. While such systems have significant potential, one of the critical problems that should be addressed is the network delay. It is well known that network delay in message propagation is hard to prevent, which could potentially degrade the performance of the system or even create safety issues for vehicles at intersections. In this paper, we introduce a new traffic signal control algorithm based on reinforcement learning, which performs well even under severe network delay. The framework introduced in this paper can be helpful for all agent-based systems using remote computing resources where network delay could be a critical concern. Extensive simulation results obtained for different scenarios show the viability of the designed algorithm to cope with network delay.


Efficient Deep Reinforcement Learning through Policy Transfer

arXiv.org Artificial Intelligence

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.


Value-driven Hindsight Modelling

arXiv.org Machine Learning

Value estimation is a critical component of the reinforcement learning (RL) paradigm. The question of how to effectively learn predictors for value from data is one of the major problems studied by the RL community, and different approaches exploit structure in the problem domain in different ways. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. In contrast, model-free methods directly leverage the quantity of interest from the future but have to compose with a potentially weak scalar signal (an estimate of the return). In this paper we develop an approach for representation learning in RL that sits in between these two extremes: we propose to learn what to model in a way that can directly help value prediction. To this end we determine which features of the future trajectory provide useful information to predict the associated return. This provides us with tractable prediction targets that are directly relevant for a task, and can thus accelerate learning of the value function. The idea can be understood as reasoning, in hindsight, about which aspects of the future observations could help past value prediction. We show how this can help dramatically even in simple policy evaluation settings. We then test our approach at scale in challenging domains, including on 57 Atari 2600 games.


How To Make Sure Your Robot Doesn't Drop Your Wine Glass

#artificialintelligence

From microelectronics to mechanics and machine learning, the modern-day robots are a marvel of multiple engineering disciplines. They use sensors, image processing and reinforcement learning algorithms to move the objects around and move around the obstacles as well. However, this is not the case when it comes to handling objects such as glass. The surface properties of glass are transparent, and non-uniform light reflection makes it difficult for the sensors mounted on the robot to understand how to engage in a simple pick and place operation. To address this problem, researchers at Google AI along with Synthesis AI and Columbia University devised a novel machine-learning algorithm called ClearGrasp, that is capable of estimating accurate 3D data of transparent objects from RGB-D images.


Using Rotation, Translation, and Cropping to Boost Generalization in Deep Reinforcement Learningโ€ฆ

#artificialintelligence

"Generalization" is an AI buzzword these days for good reason: most scientists would love to see the models they're training in simulations and video game environments evolve and expand to take on meaningful real-world challenges -- for example in safety, conservation, medicine, etc. One concerned research area is deep reinforcement learning (DRL), which implements deep learning architectures with reinforcement learning algorithms to enable AI agents to learn the best actions possible to attain their goals in virtual environments. DRL has been widely applied in games and robotics. Such DRL agents have an impressive track record on Starcraft II and Dota-2. But because they were trained in fixed environments, studies suggest DRL agents can fail to generalize to even slight variations of their training environments. In a new paper, researchers from the New York University and Modl.ai, a company applying machine learning to game developing, suggest that simple spacial processing methods such as rotation, translation and cropping could help increase model generality.


MoTiAC: Multi-Objective Actor-Critics for Real-Time Bidding

arXiv.org Artificial Intelligence

Online real-time bidding (RTB) is known as a complex auction game where ad platforms seek to consider various influential key performance indicators (KPIs), like revenue and return on investment (ROI). The trade-off among these competing goals needs to be balanced on a massive scale. To address the problem, we propose a multi-objective reinforcement learning algorithm, named MoTiAC, for the problem of bidding optimization with various goals. Specifically, in MoTiAC, instead of using a fixed and linear combination of multiple objectives, we compute adaptive weights overtime on the basis of how well the current state agrees with the agent's prior. In addition, we provide interesting properties of model updating and further prove that Pareto optimality could be guaranteed. We demonstrate the effectiveness of our method on a real-world commercial dataset. Experiments show that the model outperforms all state-of-the-art baselines.


Adaptive Estimator Selection for Off-Policy Evaluation

arXiv.org Machine Learning

We develop a generic data-driven method for estimator selection in off-policy policy evaluation settings. We establish a strong performance guarantee for the method, showing that it is competitive with the oracle estimator, up to a constant factor. Via in-depth case studies in contextual bandits and reinforcement learning, we demonstrate the generality and applicability of the method. We also perform comprehensive experiments, demonstrating the empirical efficacy of our approach and comparing with related approaches. In both case studies, our method compares favorably with existing methods.