AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Reinforcement Learning under Model Mismatch

Roy, Aurko, Xu, Huan, Pokutta, Sebastian

arXiv.org Machine LearningNov-8-2017

Reinforcement learning is concerned with learning a good policy for sequential decision making problems modeled as a Markov Decision Process (MDP), via interacting with the environment [22, 20]. In this work we address the problem of reinforcement learning from a misspecified model. As a motivating example, consider the scenario where the problem of interest is not directly accessible, but instead the agent can interact with a simulator whose dynamics is reasonably close to the true problem. Another plausible application is when the parameters of the model may evolve over time but can still be reasonably approximated by an MDP. To address this problem we use the framework of robust MDPs which was proposed by [2, 17, 13] to solve the planning problem under model misspecification. The robust MDP framework considers a class of models and finds the robust optimal policy which is a policy that performs best under the worst model. It was shown by [2, 17, 13] that the robust optimal policy satisfies the robust Bellman equation which naturally leads to exact dynamic programming algorithms to find an optimal policy. However, this approach is model dependent and does not immediately generalize to the model-free case where the parameters of the model are unknown. Essentially, reinforcement learning is a model-free framework to solve the Bellman equation using samples.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

1706.04711

Country: North America > United States (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

What is reinforcement learning? - Quora

#artificialintelligenceNov-7-2017, 12:10:32 GMT

Thank you for your feedback! Is this answer still relevant and up to date?

artificial intelligence, machine learning, reinforcement learning, (2 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.40)

Add feedback

Boltzmann Exploration Done Right

Cesa-Bianchi, Nicolò, Gentile, Claudio, Lugosi, Gábor, Neu, Gergely

arXiv.org Machine LearningNov-7-2017

Boltzmann exploration is a classic strategy for sequential decision-making under uncertainty, and is one of the most standard tools in Reinforcement Learning (RL). Despite its widespread use, there is virtually no theoretical understanding about the limitations or the actual benefits of this exploration scheme. Does it drive exploration in a meaningful way? Is it prone to misidentifying the optimal actions or spending too much time exploring the suboptimal ones? What is the right tuning for the learning rate? In this paper, we address several of these questions in the classic setup of stochastic multi-armed bandits. One of our main results is showing that the Boltzmann exploration strategy with any monotone learning-rate sequence will induce suboptimal behavior. As a remedy, we offer a simple non-monotone schedule that guarantees near-optimal performance, albeit only when given prior access to key problem parameters that are typically not available in practical situations (like the time horizon $T$ and the suboptimality gap $\Delta$). More importantly, we propose a novel variant that uses different learning rates for different arms, and achieves a distribution-dependent regret bound of order $\frac{K\log^2 T}{\Delta}$ and a distribution-independent bound of order $\sqrt{KT}\log K$ without requiring such prior knowledge. To demonstrate the flexibility of our technique, we also propose a variant that guarantees the same performance bounds even if the rewards are heavy-tailed.

big data, exploration, upstream oil & gas, (19 more...)

arXiv.org Machine Learning

1705.10257

Country:

Europe > Spain (0.14)
North America > United States > California (0.14)
Europe > Italy (0.14)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
Information Technology > Data Science > Data Mining > Big Data (0.69)

Add feedback

UCB Exploration via Q-Ensembles

Chen, Richard Y., Sidor, Szymon, Abbeel, Pieter, Schulman, John

arXiv.org Machine LearningNov-7-2017

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

exploration, health & medicine, upstream oil & gas, (13 more...)

arXiv.org Machine Learning

1706.01502

Country:

North America > United States > California (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.67)
Leisure & Entertainment > Games (0.48)
Energy > Oil & Gas > Upstream (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)

Add feedback

Machine Learning Scientist (Distributed Systems, Tensorflow) - Cambridge - November-04-2017 (FcARx)

@machinelearnbotNov-5-2017, 17:25:14 GMT

We are currently seeking a hands-on Machine Learning Scientist (Distributed Systems, Tensorflow) for our new research-led startup, focussing on the application of artificial intelligence in the real world; particularly smart city simulations and bots. We're looking for a hardcore Machine Learning Scientist/Engineer who thrives wants to work with the latest technology in multi-agent learning algorithms, Gaussian process and reinforcement learning. As a Machine Learning Scientist/Engineer, you will be a core member of the machine learning team; working closely with the Machine Learning researchers, transforming their algorithmic research into highly innovative products which will be attractive and accessible to the world. Key Skills: Machine Learning Engineer/ML Scientist, Tensorflow, C, C, Java, Python, C#, Distributed Algorithms. Distributed systems, BSc, MSc, MPhil, PhD, Post-Doc, Research, R&D, startup, Multithreading.

machine learning, machine learning scientist, reinforcement learning, (8 more...)

@machinelearnbot

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

Sharma, Sahil, J, Girish Raguvir, Ramesh, Srivatsan, Ravindran, Balaraman

arXiv.org Artificial IntelligenceNov-5-2017

Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state's value function. $\lambda$-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While lambda-returns have been extensively studied in RL, they haven't been explored a lot in Deep RL. This paper's first contribution is an exhaustive benchmarking of lambda-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, lambda-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1705.07445

Country: Asia (0.68)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Double Q($\sigma$) and Q($\sigma, \lambda$): Unifying Reinforcement Learning Control Algorithms

Dumke, Markus

arXiv.org Machine LearningNov-5-2017

Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q($\sigma$) algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q($\sigma$) algorithm to an online multi-step algorithm Q($\sigma, \lambda$) using eligibility traces and introduces Double Q($\sigma$) as the extension of Q($\sigma$) to double learning. Experiments suggest that the new Q($\sigma, \lambda$) algorithm can outperform the classical TD control methods Sarsa($\lambda$), Q($\lambda$) and Q($\sigma$).

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Machine Learning

1711.01569

Country: North America > United States (0.14)

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Deep Reinforcement Learning Chatbot

Serban, Iulian V., Sankar, Chinnadhurai, Germain, Mathieu, Zhang, Saizheng, Lin, Zhouhan, Subramanian, Sandeep, Kim, Taesup, Pieper, Michael, Chandar, Sarath, Ke, Nan Rosemary, Rajeshwar, Sai, de Brebisson, Alexandre, Sotelo, Jose M. R., Suhubdy, Dendi, Michalski, Vincent, Nguyen, Alexandre, Pineau, Joelle, Bengio, Yoshua

arXiv.org Machine LearningNov-5-2017

We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including template-based models, bag-of-words models, sequence-to-sequence neural network and latent variable neural network models. By applying reinforcement learning to crowdsourced data and real-world user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with real-world users, where it performed significantly better than many competing systems. Due to its machine learning architecture, the system is likely to improve with additional data.

candidate response, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

1709.02349

Country:

North America > United States (1.00)
North America > Canada > Quebec > Montreal (0.68)

Genre:

Research Report > New Finding (1.00)
Personal > Interview (1.00)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Government > Regional Government > North America Government > United States Government (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback

Deep Reinforcement Learning: From Toys to Enteprise

@machinelearnbotNov-4-2017, 06:01:05 GMT

Reinforcement learning is an increasingly popular machine learning technique that is particularly well suited for addressing problems within dynamic and adaptive environments. When paired with simulations, reinforcement learning is a powerful tool for training AI models that can help increase automation or optimize operational efficiency of sophisticated systems such as robotics, manufacturing, and supply chain logistics. However, moving from the games commonly used to demonstrate these techniques into real-world applications isn't always straightforward. Structuring solutions to move beyond purely data-driven training introduces all sorts of new complexity, requiring you to consider things like how to use simulations to target your learning objectives, what kinds of simulations are applicable, how to deal with long-running simulations, how to incorporate ongoing training refinement once deployed, how to account for scaling and performance, and ultimately how to bridge from simulation to the real world. I was recently able to talk about how to effectively leverage reinforcement learning in real-world use cases at the O'Reilly AI conference in San Francisco.

machine learning, reinforcement, reinforcement learning, (13 more...)

@machinelearnbot

Country: North America > United States > California > San Francisco County > San Francisco (0.25)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Analysis of Agent Expertise in Ms. Pac-Man using Value-of-Information-based Policies

Sledge, Isaac J., Principe, Jose C.

arXiv.org Artificial IntelligenceNov-4-2017

Conventional reinforcement learning methods for Markov decision processes rely on weakly-guided, stochastic searches to drive the learning process. It can therefore be difficult to predict what agent behaviors might emerge. In this paper, we consider an information-theoretic cost function for performing constrained stochastic searches that promote the formation of risk-averse to risk-favoring behaviors. This cost function is the value of information, which provides the optimal trade-off between the expected return of a policy and the policy's complexity; policy complexity is measured by number of bits and controlled by a single hyperparameter on the cost function. As the policy complexity is reduced, the agents will increasingly eschew risky actions. This reduces the potential for high accrued rewards. As the policy complexity increases, the agents will take actions, regardless of the risk, that can raise the long-term rewards. The obtainable reward depends on a single, tunable hyperparameter that regulates the degree of policy complexity. We evaluate the performance of value-of-information-based policies on a stochastic version of Ms. Pac-Man. A major component of this paper is the demonstration that ranges of policy complexity values yield different game-play styles and explaining why this occurs. We also show that our reinforcement-learning search mechanism is more efficient than the others we utilize. This result implies that the value of information theory is appropriate for framing the exploitation-exploration trade-off in reinforcement learning.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

arXiv.org Artificial Intelligence

1702.08628

Country:

Europe (1.00)
North America > United States > New York (0.46)

Genre: Research Report > New Finding (0.92)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback