Goto

Collaborating Authors

 Reinforcement Learning


Researchers use reinforcement learning to train gliders to soar like birds

#artificialintelligence

The words "fly like an eagle" are famously part of a song, but they may also be words that make some scientists scratch their heads. Especially when it comes to soaring birds like eagles, falcons and hawks, who seem to ascend to great heights over hills, canyons and mountain tops with ease. Scientists realize that upward currents of warm air assist the birds in their flight, but they don't know how the birds find and navigate these thermal plumes. To figure it out, researchers from the University of California San Diego used reinforcement learning to train gliders to autonomously navigate atmospheric thermals, soaring to heights of 700 meters--nearly 2,300 feet. The novel research results, published in the Sept. 19 issue of Nature, highlight the role of vertical wind accelerations and roll-wise torques as viable biological cues for soaring birds.


AI Show: What are the different types of machine learning?

#artificialintelligence

Scott: Welcome to the AI show. What are those big questions? Usually people think about three different types, like reinforcement learning, unsupervised, or supervised. Susan: Reinforcement learning is learning from a series of actions where you get a series of choices and rewards along away. So, a classic one that's been in the news is AlphaGo. A large chunk of reinforcement learning techniques are used in there, specifically Monte Carlo tree research techniques. Scott: So, people play the game Go. AlphaGo is a machine playing Go, being very good at it, and beating the world's top Go players. Susan: Another one that's really fun and has popped up the last couple of years is based off of Atari games.


r/MachineLearning - [N] Stable-Baselines v2.0.0 Released

#artificialintelligence

Has anyone tried to use Stable-Baselines? How does it compare to the official Baselines from OpenAI in your experience? Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines. You can read a detailed presentation of Stable Baselines in the Medium article. These algorithms will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of.


Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

arXiv.org Artificial Intelligence

In reinforcement learning (RL) , one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d. in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.


Constrained Exploration and Recovery from Experience Shaping

arXiv.org Artificial Intelligence

We consider the problem of reinforcement learning under safety requirements, in which an agent is trained to complete a given task, typically formalized as the maximization of a reward signal over time, while concurrently avoiding undesirable actions or states, associated to lower rewards, or penalties. The construction and balancing of different reward components can be difficult in the presence of multiple objectives, yet is crucial for producing a satisfying policy. For example, in reaching a target while avoiding obstacles, low collision penalties can lead to reckless movements while high penalties can discourage exploration. To circumvent this limitation, we examine the effect of past actions in terms of safety to estimate which are acceptable or should be avoided in the future. We then actively reshape the action space of the agent during reinforcement learning, so that reward-driven exploration is constrained within safety limits. We propose an algorithm enabling the learning of such safety constraints in parallel with reinforcement learning and demonstrate its effectiveness in terms of both task completion and training time.


Target Transfer Q-Learning and Its Convergence Analysis

arXiv.org Artificial Intelligence

Q-learning is one of the most popular methods in Reinforcement Learning (RL). Transfer Learning aims to utilize the learned knowledge from source tasks to help new tasks to improve the sample complexity of the new tasks. Considering that data collection in RL is both more time and cost consuming and Q-learning converges slowly comparing to supervised learning, different kinds of transfer RL algorithms are designed. However, most of them are heuristic with no theoretical guarantee of the convergence rate. Therefore, it is important for us to clearly understand when and how will transfer learning help RL method and provide the theoretical guarantee for the improvement of the sample complexity. In this paper, we propose to transfer the Q-function learned in the source task to the target of the Q-learning in the new task when certain safe conditions are satisfied. We call this new transfer Q-learning method target transfer Q-Learning. The safe conditions are necessary to avoid the harm to the new tasks and thus ensure the convergence of the algorithm. We study the convergence rate of the target transfer Q-learning. We prove that if the two tasks are similar with respect to the MDPs, the optimal Q-functions in the source and new RL tasks are similar which means the error of the transferred target Q-function in new MDP is small. Also, the convergence rate analysis shows that the target transfer Q-Learning will converge faster than Q-learning if the error of the transferred target Q-function is smaller than the current Q-function in the new task. Based on our theoretical results, we design the safe condition as the Bellman error of the transferred target Q-function is less than the current Q-function. Our experiments are consistent with our theoretical founding and verified the effectiveness of our proposed target transfer Q-learning method.


Interpretable Multi-Objective Reinforcement Learning through Policy Orchestration

arXiv.org Artificial Intelligence

Autonomous cyber-physical agents and systems play an increasingly large role in our lives. To ensure that agents behave in ways aligned with the values of the societies in which they operate, we must develop techniques that allow these agents to not only maximize their reward in an environment, but also to learn and follow the implicit constraints of society. These constraints and norms can come from any number of sources including regulations, business process guidelines, laws, ethical principles, social norms, and moral values. We detail a novel approach that uses inverse reinforcement learning to learn a set of unspecified constraints from demonstrations of the task, and reinforcement learning to learn to maximize the environment rewards. More precisely, we assume that an agent can observe traces of behavior of members of the society but has no access to the explicit set of constraints that give rise to the observed behavior. Inverse reinforcement learning is used to learn such constraints, that are then combined with a possibly orthogonal value function through the use of a contextual bandit-based orchestrator that picks a contextually-appropriate choice between the two policies (constraint-based and environment reward-based) when taking actions. The contextual bandit orchestrator allows the agent to mix policies in novel ways, taking the best actions from either a reward maximizing or constrained policy. In addition, the orchestrator is transparent on which policy is being employed at each time step. We test our algorithms using a Pac-Man domain and show that the agent is able to learn to act optimally, act within the demonstrated constraints, and mix these two functions in complex ways.



r/MachineLearning - [D] reinforcement learning theoretical results

#artificialintelligence

Since i understand that there are a lot of empirical results but not much theoretical results in reinforcement learning I wanted to ask you: what kind of theoretical RL question would you like being answered? Are convergence guarantees and rates of convergence fully covered for tabular algorithms?


AI for Government Programs and the Federal Marketplace - CTOvision.com

#artificialintelligence

ScaleUP USA has developed a free "Artificial Intelligence for Government" program to help the government employees and contractors learn foundational skills around Artificial Intelligence and the unique challenges faced in using it in the government. The program is focused on beginners in Artificial Intelligence (AI), Machine Learning (ML), and Reinforcement Learning (RL) with minimal technical fluency. No STEM / Computer Science degree required. The program targets government executives trying to understand how to use Artificial Intelligence as well as government contractors wanting to learn how to integrate AI into their offering and startups trying to understand how to work with the government on AI, ML, and RL. ScaleUP USA is also building a video-based marketplace of AI technologies for government where companies can showcase their products, platforms, and services relevant for governments.