Goto

Collaborating Authors

 Reinforcement Learning


Provably Efficient Exploration for RL with Unsupervised Learning

arXiv.org Artificial Intelligence

We study how to use unsupervised learning for efficient exploration in reinforcement learning with rich observations generated from a small number of latent states. We present a novel algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret reinforcement learning algorithm. We show that our algorithm provably finds a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of possible observations. Our result gives theoretical justification to the prevailing paradigm of using unsupervised learning for efficient exploration [tang2017exploration,bellemare2016unifying].


Building an AI-powered Battlesnake with reinforcement learning on Amazon SageMaker Amazon Web Services

#artificialintelligence

Battlesnake is an AI competition based on the traditional snake game in which multiple AI-powered snakes compete to be the last snake surviving. Battlesnake attracts a community of developers at all levels. Hundreds of snakes compete and rise up in the ranks in the online Battlesnake global arena. Battlesnake also hosts several offline events that are attended by more than a thousand developers and non-developers alike and are streamed on Twitch. Teams of developers build snakes for the competition and learn new tech skills, learn to collaborate, and have fun. Teams can build snakes by using a variety of strategies ranging from state-of-the-art deep reinforcement learning (RL) algorithms to unique heuristics-based strategies. This post shows how to use Amazon SageMaker to build an RL-based snake.


Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

arXiv.org Artificial Intelligence

We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.


Creating a Tendon-Driven Robot That Teaches Itself to Walk with Reinforcement Learning

#artificialintelligence

The robotic limb has an architecture resembling the muscle and tendon structure that powers human and vertebrate movement [1,2]. Tendons connect muscles to bones, making it possible for the biological motors (muscles) to exert force on bones from a distance [3,4]. While tendons have mechanical and structural advantages, a tendon-driven robot is significantly more challenging to control than a traditional robot, where a simple PID controller to control joint angles directly is often sufficient. In a tendon-driven robotic limb, multiple motors may act on a single joint, which means that a given motor may act on multiple joints. As a result, the system is simultaneously nonlinear, over-determined, and under-determined, greatly increasing the control design complexity and calling for a new control design approach.


Sparse Graphical Memory for Robust Planning

arXiv.org Artificial Intelligence

To operate effectively in the real world, artificial agents must act from raw sensory input such as images and achieve diverse goals across long time-horizons. On the one hand, recent strides in deep reinforcement and imitation learning have demonstrated impressive ability to learn goal-conditioned policies from high-dimensional image input, though only for short-horizon tasks. On the other hand, classical graphical methods like A* search are able to solve long-horizon tasks, but assume that the graph structure is abstracted away from raw sensory input and can only be constructed with task-specific priors. We wish to combine the strengths of deep learning and classical planning to solve long-horizon tasks from raw sensory input. To this end, we introduce Sparse Graphical Memory (SGM), a new data structure that stores observations and feasible transitions in a sparse memory. SGM can be combined with goal-conditioned RL or imitative agents to solve long-horizon tasks across a diverse set of domains. We show that SGM significantly outperforms current state of the art methods on long-horizon, sparse-reward visual navigation tasks. Project video and code are available at https://mishalaskin.github.io/sgm/


Application of Deep Q-Network in Portfolio Management

arXiv.org Machine Learning

Machine Learning algorithms and Neural Networks are widely applied to many different areas such as stock market prediction, face recognition and population analysis. This paper will introduce a strategy based on the classic Deep Reinforcement Learning algorithm, Deep Q-Network, for portfolio management in stock market. It is a type of deep neural network which is optimized by Q Learning. To make the DQN adapt to financial market, we first discretize the action space which is defined as the weight of portfolio in different assets so that portfolio management becomes a problem that Deep Q-Network can solve. Next, we combine the Convolutional Neural Network and dueling Q-net to enhance the recognition ability of the algorithm. Experimentally, we chose five lowrelevant American stocks to test the model. The result demonstrates that the DQN based strategy outperforms the ten other traditional strategies. The profit of DQN algorithm is 30% more than the profit of other strategies. Moreover, the Sharpe ratio associated with Max Drawdown demonstrates that the risk of policy made with DQN is the lowest.


Interference and Generalization in Temporal Difference Learning

arXiv.org Machine Learning

We study the link between generalization and interference in temporal-difference (TD) learning. Interference is defined as the inner product of two different gradients, representing their alignment. This quantity emerges as being of interest from a variety of observations about neural networks, parameter sharing and the dynamics of learning. We find that TD easily leads to low-interference, under-generalizing parameters, while the effect seems reversed in supervised learning. We hypothesize that the cause can be traced back to the interplay between the dynamics of interference and bootstrapping. This is supported empirically by several observations: the negative relationship between the generalization gap and interference in TD, the negative effect of bootstrapping on interference and the local coherence of targets, and the contrast between the propagation rate of information in TD(0) versus TD($\lambda$) and regression tasks such as Monte-Carlo policy evaluation. We hope that these new findings can guide the future discovery of better bootstrapping methods.


Taylor Expansion Policy Optimization

arXiv.org Machine Learning

In this work, we investigate the application of Taylor expansions in reinforcement learning. In particular, we propose Taylor expansion policy optimization, a policy optimization formalism that generalizes prior work (e.g., TRPO) as a first-order special case. We also show that Taylor expansions intimately relate to off-policy evaluation. Finally, we show that this new formulation entails modifications which improve the performance of several state-of-the-art distributed algorithms.


Multiplicative Controller Fusion: A Hybrid Navigation Strategy For Deployment in Unknown Environments

arXiv.org Artificial Intelligence

Learning-based approaches often outperform hand-coded algorithmic solutions for many problems in robotics. However, learning long-horizon tasks on real robot hardware can be intractable, and transferring a learned policy from simulation to reality is still extremely challenging. We present a novel approach to model-free reinforcement learning that can leverage existing sub-optimal solutions as an algorithmic prior during training and deployment. During training, our gated fusion approach enables the prior to guide the initial stages of exploration, increasing sample-efficiency and enabling learning from sparse long-horizon reward signals. Importantly, the policy can learn to improve beyond the performance of the sub-optimal prior since the prior's influence is annealed gradually. During deployment, the policy's uncertainty provides a reliable strategy for transferring a simulation-trained policy to the real world by falling back to the prior controller in uncertain states. We show the efficacy of our Multiplicative Controller Fusion approach on the task of robot navigation and demonstrate safe transfer from simulation to the real world without any fine tuning. The code for this project is made publicly available at https://sites.google.com/view/mcf-nav/home.


Regret Bound of Adaptive Control in Linear Quadratic Gaussian (LQG) Systems

arXiv.org Machine Learning

One of the core challenges in the field of control theory and reinforcement learning is adaptive control. It is the problem of controlling dynamical systems when the dynamics of the systems are unknown to the decision-making agents. In adaptive control, agents interact with given systems in order to explore and control them while the long-term objective is to minimize the overall average associated costs. The agent has to balance between exploration and exploitation, learn the dynamics, strategize for further exploration, and exploit the estimation to minimize the overall costs. The sequential nature of agent-system interaction results in challenges in the system identifying, estimation, and control under uncertainty, and these challenges are magnified when the systems are partially observable, i.e. contain hidden underlying dynamics. In the linear systems, when the underlying dynamics are fully observable, the asymptotic optimality of estimation methods has been the topic of study in the last decades [Lai et al., 1982, Lai and Wei, 1987]. Recently, novel techniques and learning algorithms have been developed to study the finite-time behavior of adaptive control algorithms and shed light on the design of optimal methods [Peña et al., 2009, Fiechter, 1997, Abbasi-Yadkori and Szepesvári, 2011]. In particular, Abbasi-Yadkori and Szepesvári [2011] proposes to use the principle of optimism in the face of uncertainty (OFU) to balance exploration and exploitation in LQR, where the state of the system is observable.