Goto

Collaborating Authors

 reinforcement-learning algorithm


Gradient Descent for General Reinforcement Learning

Neural Information Processing Systems

A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement(cid:173) learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q(cid:173) In addition to these learning, SARSA, and advantage learning. Simulations results are given, and several areas for future research are discussed.


Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Neural Information Processing Systems

The reinforcement learning community has explored many approaches to obtain- ing value estimates and models to guide decision making; these approaches, how- ever, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent's confidence are useful for many applications, such as bi- asing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agent- environment interaction rarely satisfies traditional assumptions. Samples of value- estimates are dependent, likely non-normally distributed and often limited, partic- ularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes.


Computably Continuous Reinforcement-Learning Objectives are PAC-learnable

arXiv.org Artificial Intelligence

In reinforcement learning, the classic objectives of maximizing discounted and finite-horizon cumulative rewards are PAC-learnable: There are algorithms that learn a near-optimal policy with high probability using a finite amount of samples and computation. In recent years, researchers have introduced objectives and corresponding reinforcement-learning algorithms beyond the classic cumulative rewards, such as objectives specified as linear temporal logic formulas. However, questions about the PAC-learnability of these new objectives have remained open. This work demonstrates the PAC-learnability of general reinforcement-learning objectives through sufficient conditions for PAC-learnability in two analysis settings. In particular, for the analysis that considers only sample complexity, we prove that if an objective given as an oracle is uniformly continuous, then it is PAC-learnable. Further, for the analysis that considers computational complexity, we prove that if an objective is computable, then it is PAC-learnable. In other words, if a procedure computes successive approximations of the objective's value, then the objective is PAC-learnable. We give three applications of our condition on objectives from the literature with previously unknown PAC-learnability and prove that these objectives are PAC-learnable. Overall, our result helps verify existing objectives' PAC-learnability. Also, as some studied objectives that are not uniformly continuous have been shown to be not PAC-learnable, our results could guide the design of new PAC-learnable objectives.


[FREE] Modern Reinforcement-learning Using Deep Learning

#artificialintelligence

Udemy is the biggest website in the world that offer courses in many categories, all the skills that you would be looking for are offered in Udemy, including languages, design, marketing and a lot of other categories, so when you ever want to buy a courses and pay for a new skills, Udemy would be the best forum for you. You can find payment courses, 100 free courses and coupons also, more than 12 categories are offered, and that what makes sure you will find the domain and the skill you are looking for. Our duty is to search for 100 off courses and free coupons. In my Deep reinforcement-learning course you will learn the newest state-of-the-art Deep reinforcement-learning knowledge. A generalization of MDP in which an agent cannot observe the state.


DeepMind's AI can control superheated plasma inside a fusion reactor

#artificialintelligence

Controlling nuclear fusion on Earth is hard, however. The problem is that atomic nuclei repel each other. Smashing them together inside a reactor can only be done at extremely high temperatures, often reaching hundreds of millions of degrees--hotter than the center of the sun. At these temperatures, matter is neither solid, liquid, nor gas. It enters a fourth state, known as plasma: a roiling, superheated soup of particles.


DeepMind's AI can control superheated plasma inside a fusion reactor

#artificialintelligence

In nuclear fusion, the atomic nuclei of hydrogen atoms get forced together to form heavier atoms, like helium. This produces a lot of energy relative to a tiny amount of fuel, making it a very efficient source of power. It is far cleaner and safer than fossil fuels or conventional nuclear power, which is created by fission--forcing nuclei apart. It is also the process that powers stars. Controlling nuclear fusion on Earth is hard, however.


Reinforcement Learning for General LTL Objectives Is Intractable

arXiv.org Artificial Intelligence

In recent years, researchers have made significant progress in devising reinforcement-learning algorithms for optimizing linear temporal logic (LTL) objectives and LTL-like objectives. Despite these advancements, there are fundamental limitations to how well this problem can be solved that previous studies have alluded to but, to our knowledge, have not examined in depth. In this paper, we address theoretically the hardness of learning with general LTL objectives. We formalize the problem under the probably approximately correct learning in Markov decision processes (PAC-MDP) framework, a standard framework for measuring sample complexity in reinforcement learning. In this formalization, we prove that the optimal policy for any LTL formula is PAC-MDP-learnable only if the formula is in the most limited class in the LTL hierarchy, consisting of only finite-horizon-decidable properties. Practically, our result implies that it is impossible for a reinforcement-learning algorithm to obtain a PAC-MDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for non-finite-horizon-decidable LTL objectives.


Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

Neural Information Processing Systems

The reinforcement learning community has explored many approaches to obtain- ing value estimates and models to guide decision making; these approaches, how- ever, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent's confidence are useful for many applications, such as bi- asing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agent- environment interaction rarely satisfies traditional assumptions. Samples of value- estimates are dependent, likely non-normally distributed and often limited, partic- ularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes.


Tiny alterations in training data can introduce "backdoors" into machine learning models

#artificialintelligence

In TrojDRL: Trojan Attacks on Deep Reinforcement Learning Agents, a group of Boston University researchers demonstrate an attack on machine learning systems trained with "reinforcement learning" in which ML systems derive solutions to complex problems by iteratively trying multiple solutions. The attack is related to adversarial examples, a class of attacks that involve probing a machine-learning model to find "blind spots" -- very small changes (usually imperceptible to humans) that cause machine learning classifiers' accuracy to shelve off rapidly (for example, a small change to a model of a gun can make an otherwise reliable classifier think it's looking at a helicopter). It's not clear whether it's possible to create a machine learning model that's immune to adversarial examples (the expert I trust most on this told me off the record that they think it's not), but what the researchers behind Trojdrl propose is a method for deliberately introducing adversarial examples by slipping difficult-to-spot changes into training data, which will produce defects in the eventual model that can serve as a "backdoor" that future adversaries can exploit. Training data sets are often ad-hoc in nature; they're so large that it's hard to create version-by-version snapshots, and they're also so prone to mislabeling that researchers are always making changes to them in order to improve their accuracy. All of this suggests that poisoning training data might be easier than it sounds.


Tainted Data Can Teach Algorithms the Wrong Lessons

#artificialintelligence

An important leap for artificial intelligence in recent years is machines' ability to teach themselves, through endless practice, to solve problems, from mastering ancient board games to navigating busy roads. But a few subtle tweaks in the training regime can poison this "reinforcement learning," so that the resulting algorithm responds--like a sleeper agent--to a specified trigger by misbehaving in strange or harmful ways. "In essence, this type of back door gives the attacker some ability to directly control" the algorithm, says Wenchao Li, an assistant professor at Boston University who devised the attack with colleagues. Their recent paper is the latest in a growing body of evidence suggesting that AI programs can be sabotaged by the data used to train them. As companies, governments, and militaries rush to deploy AI, the potential for mischief could be serious.