Goto

Collaborating Authors

 Reinforcement Learning



AI develops its own 'alien' language, the better to mock human underlings - ExtremeTech

#artificialintelligence

Even more amazing, the researchers never explicitly programmed this AI communication. Instead, it "evolved" as a response to a reinforcement learning problem. While the jargon can get a bit technical, the OpenAI blog does a decent job of parsing it. The important thing to grok is the language was never defined, but rather hit upon as a solution to a general problem of learning to communicate. This type of AI method is called reinforcement learning, and involves the use of a reward signal to continually guide the agent towards an optimum outcome.


The meta-parameter slot machine

#artificialintelligence

Today we'll step back a bit and consider the psychology of a machine learning researcher when he does his job, a subject which interests me deeply and one that I've already touched in another post. Some of this comes from my own introspection, as I've been doing machine learning for quite a few years now. It is a well known fact from biology that little achievements trigger the release of small amounts of dopamine - a neurotransmitter that is believed to be involved in reinforcement learning. The dopamine makes us feel good and also triggers plasticity in certain parts of the brain (likely allowing the brain to "remember" what behaviour lead to the reward). Reinforcement learning however has its issues, since the reward can appear by coincidence and therefore reinforce the "wrong cause". This is very much visible these days with Internet, emails and texts: since receiving an important and rewarding message reinforces the behaviour which lead to it - and that most likely was pressing "get mail" button - we get addicted to checking email!


Evolution Strategies as a Scalable Alternative to Reinforcement Learning

#artificialintelligence

Our finding continues the modern trend of achieving strong results with decades-old ideas. For example, in 2012, the "AlexNet" paper showed how to design, scale and train convolutional neural networks (CNNs) to achieve extremely strong results on image recognition tasks, at a time when most researchers thought that CNNs were not a promising approach to computer vision. Similarly, in 2013, the Deep Q-Learning paper showed how to combine Q-Learning with CNNs to successfully solve Atari games, reinvigorating RL as a research field with exciting experimental (rather than theoretical) results. Likewise, our work demonstrates that ES achieves strong performance on RL benchmarks, dispelling the common belief that ES methods are impossible to apply to high dimensional problems. ES is easy to implement and scale.


2017: The Year of Neuroevolution

#artificialintelligence

This month OpenAI published a paper "Evolution Strategies as a Scalable Alternative to Reinforcement Learning" by Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever which shows Evolution Strategies (ES) can be a strong alternative to Reinforcement Learning (RL) and have a number of advantages like ease of implementation, invariance to the length of the episode and settings with sparse rewards, better exploration behaviour than policy gradient methods, ease to scale in a distributed setting. Running on a computing cluster of 80 machines and 1,440 CPU cores, authors' implementation was able to train a 3D MuJoCo humanoid walker in only 10 minutes (A3C on 32 cores takes about 10 hours). Using 720 cores they can also obtain comparable performance to A3C on Atari while cutting down the training time from 1 day to 1 hour. The communication overhead of implementing ES in a distributed setting is lower than for reinforcement learning methods such as policy gradients and Q-learning. By not requiring backpropagation, black box optimizers (the ones make no assumptions about the structure of the function being optimized) reduce the amount of computation per episode by about two thirds, and memory by potentially much more.


Deep Learning of Robotic Tasks without a Simulator using Strong and Weak Human Supervision

arXiv.org Artificial Intelligence

We propose a scheme for training a computerized agent to perform complex human tasks such as highway steering. The scheme is designed to follow a natural learning process whereby a human instructor teaches a computerized trainee. The learning process consists of five elements: (i) unsupervised feature learning; (ii) supervised imitation learning; (iii) supervised reward induction; (iv) supervised safety module construction; and (v) reinforcement learning. We implemented the last four elements of the scheme using deep convolutional networks and applied it to successfully create a computerized agent capable of autonomous highway steering over the well-known racing game Assetto Corsa. We demonstrate that the use of the last four elements is essential to effectively carry out the steering task using vision alone, without access to a driving simulator internals, and operating in wall-clock time. This is made possible also through the introduction of a safety network, a novel way for preventing the agent from performing catastrophic mistakes during the reinforcement learning stage.


Inverse Reinforcement Learning in Swarm Systems

arXiv.org Artificial Intelligence

Inverse reinforcement learning (IRL) has become a useful tool for learning behavioral models from demonstration data. However, IRL remains mostly unexplored for multi-agent systems. In this paper, we show how the principle of IRL can be extended to homogeneous large-scale problems, inspired by the collective swarming behavior of natural systems. In particular, we make the following contributions to the field: 1) We introduce the swarMDP framework, a sub-class of decentralized partially observable Markov decision processes endowed with a swarm characterization. 2) Exploiting the inherent homogeneity of this framework, we reduce the resulting multi-agent IRL problem to a single-agent one by proving that the agent-specific value functions in this model coincide. 3) To solve the corresponding control problem, we propose a novel heterogeneous learning scheme that is particularly tailored to the swarm setting. Results on two example systems demonstrate that our framework is able to produce meaningful local reward models from which we can replicate the observed global system dynamics.


Pit.ai puts a financial twist on reinforcement learning to outperform hedge funds

#artificialintelligence

Despite mystery and intrigue, the reality is that most hedge funds don't make money. This hasn't stopped a growing list of startups from trying their hands at employing machine learning to tip the scales in their favor. But Pit.ai, a new machine learning-powered hedge fund, adopted into the YC W17 class, thinks it can best Numerai, Quantopian and others with its own unique recipe for automating money making. Hedge funds employ aggressive trading strategies to "seek alpha," which is industry jargon for above market returns. These are not your standard trading shops, and over the last decade firms have gone to great lengths to seize data for information arbitrage.


Unsupervised Basis Function Adaptation for Reinforcement Learning

arXiv.org Machine Learning

When using reinforcement learning (RL) algorithms to evaluate a policy it is common, given a large state space, to introduce some form of approximation architecture for the value function (VF). The exact form of this architecture can have a significant effect on the accuracy of the VF estimate, however, and determining a suitable approximation architecture can often be a highly complex task. Consequently there is a large amount of interest in the potential for allowing RL algorithms to adaptively generate (i.e. to learn) approximation architectures. We investigate a method of adapting approximation architectures which uses feedback regarding the frequency with which an agent has visited certain states to guide which areas of the state space to approximate with greater detail. We introduce an algorithm based upon this idea which adapts a state aggregation approximation architecture on-line. Assuming $S$ states, we demonstrate theoretically that - provided the following relatively non-restrictive assumptions are satisfied: (a) the number of cells $X$ in the state aggregation architecture is of order $\sqrt{S}\ln{S}\log_2{S}$ or greater, (b) the policy and transition function are close to deterministic, and (c) the prior for the transition function is uniformly distributed - our algorithm can guarantee, assuming we use an appropriate scoring function to measure VF error, error which is arbitrarily close to zero as $S$ becomes large. It is able to do this despite having only $O(X\log_2{S})$ space complexity (and negligible time complexity). We conclude by generating a set of empirical results which support the theoretical results.


Deep Exploration via Randomized Value Functions

arXiv.org Machine Learning

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.