AITopics

1911.07246

Country: North America > United States > California (0.14)

Genre: Research Report (0.41)

Industry:

Leisure & Entertainment > Games > Computer Games (0.93)
Retail (0.84)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.53)
(2 more...)

Islam, Riashat, Teru, Komal K., Sharma, Deepak

Off-Policy Policy Gradient Algorithms by Constraining the State Distribution Shift

Off-policy deep reinforcement learning (RL) algorithms are incapable of learning solely from batch offline data without online interactions with the environment, due to the phenomenon known as \textit{extrapolation error}. This is often due to past data available in the replay buffer that may be quite different from the data distribution under the current policy. We argue that most off-policy learning methods fundamentally suffer from a \textit{state distribution shift} due to the mismatch between the state visitation distribution of the data collected by the behavior and target policies. This data distribution shift between current and past samples can significantly impact the performance of most modern off-policy based policy optimization algorithms. In this work, we first do a systematic analysis of state distribution mismatch in off-policy learning, and then develop a novel off-policy policy optimization method to constraint the state distribution shift. To do this, we first estimate the state distribution based on features of the state, using a density estimator and then develop a novel constrained off-policy gradient objective that minimizes the state distribution shift. Our experimental results on continuous control tasks show that minimizing this distribution mismatch can significantly improve performance in most popular practical off-policy policy gradient algorithms.

algorithm, state distribution, state distribution shift, (14 more...)

1911.0697

Country:

North America > Canada > Quebec > Montreal (0.14)
Europe > Sweden > Stockholm > Stockholm (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(4 more...)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Loynd, Ricky, Fernandez, Roland, Celikyilmaz, Asli, Swaminathan, Adith, Hausknecht, Matthew

Working Memory Graphs

A BSTRACT Transformers have increasingly outperformed gated RNNs in obtaining new state-of-the-art results on supervised tasks involving text sequences. Inspired by this trend, we study the question of how Transformer-based models can improve the performance of sequential decision-making agents. We present the Working Memory Graph (WMG), an agent that employs multi-head self-attention to reason over a dynamic set of vectors representing observed and recurrent state. We evaluate WMG in two partially observable environments, one that requires complex reasoning over past observations, and another that features factored observations. We find that WMG significantly outperforms gated RNNs on these tasks, supporting the hypothesis that WMG's inductive bias in favor of learning and leveraging factored representations can dramatically boost sample efficiency in environments featuring such structure. In the RNN-based approach of Sutskever et al. (2014), an encoder RNN maps an input sentence to a series of internal hidden state vectors. The encoder's final hidden state is copied into a decoder RNN, which then generates another sequence of hidden states that determine the selection of output tokens in the target language. This model can be trained to translate sentences, but translation quality deteriorates on long sentences where long-term dependencies become critical.

agent, factored observation, vector, (15 more...)

1911.07141

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.60)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Missingness as Stability: Understanding the Structure of Missingness in Longitudinal EHR data and its Impact on Reinforcement Learning in Healthcare

Fleming, Scott L., Jeyapragasan, Kuhan, Duan, Tony, Ding, Daisy, Gombar, Saurabh, Shah, Nigam, Brunskill, Emma

There is an emerging trend in the reinforcement learning for healthcare literature. In order to prepare longitudinal, irregularly sampled, cli nical datasets for reinforcement learning algorithms, many researchers will resa mple the time series data to short, regular intervals and use last-observation- carried-forward (LOCF) imputation to fill in these gaps. Typically, they will not mai ntain any explicit information about which values were imputed. In this work, w e (1) call attention to this practice and discuss its potential implication s; (2) propose an alternative representation of the patient state that addresses som e of these issues; and (3) demonstrate in a novel but representative clinical data set that our alternative representation yields consistently better results for ach ieving optimal control, as measured by off-policy policy evaluation, compared to repr esentations that do not incorporate missingness information.

information, missingness, reinforcement, (12 more...)

1911.07084

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > Experimental Study (0.49)

Industry:

Health & Medicine > Consumer Health (1.00)
Health & Medicine > Therapeutic Area > Hematology (0.46)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

On Value Discrepancy of Imitation Learning

Xu, Tian, Li, Ziniu, Yu, Yang

Imitation learning trains a policy from expert demonstrations. Imitation learning approaches have been designed from various principles, such as behavioral cloning via supervised learning, apprenticeship learning via inverse reinforcement learning, and GAIL via generative adversarial learning. In this paper, we propose a framework to analyze the theoretical property of imitation learning approaches based on discrepancy propagation analysis. Under the infinite-horizon setting, the framework leads to the value discrepancy of behavioral cloning in an order of O((1-\gamma)^{-2}). We also show that the framework leads to the value discrepancy of GAIL in an order of O((1-\gamma)^{-1}). It implies that GAIL has less compounding errors than behavioral cloning, which is also verified empirically in this paper. To the best of our knowledge, we are the first one to analyze GAIL's performance theoretically. The above results indicate that the proposed framework is a general tool to analyze imitation learning approaches. We hope our theoretical results can provide insights for future improvements in imitation learning algorithms.

algorithm, discrepancy, imitation, (13 more...)

1911.07027

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

#artificialintelligenceNov-15-2019, 21:50:56 GMT

What is Reinforcement Learning? AI 101

Thank you to Jeff, Gerald, Milan, Ian, Becky, Jino, Daniel, Narskogr, Jason, and Mariano for being $5 /month Patrons! Follow me on Twitter! http://twitter.com/jordanbharrod

openai, reinforcement learning, wikipedia, (2 more...)

#artificialintelligence

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.41)

Luck, Kevin Sebastian, Vecerik, Mel, Stepputtis, Simon, Amor, Heni Ben, Scholz, Jonathan

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient

arXiv.org Artificial IntelligenceNov-15-2019

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient Kevin Sebastian Luck 1, Mel V ecerik 2, Simon Stepputtis 1, Heni Ben Amor 1 and Jonathan Scholz 2 Abstract -- Model-free reinforcement learning algorithms such as Deep Deterministic Policy Gradient (DDPG) often require additional exploration strategies, especially if the actor is of deterministic nature. This work evaluates the use of model-based trajectory optimization methods used for exploration in Deep Deterministic Policy Gradient when trained on a latent image embedding. In addition, an extension of DDPG is derived using a value function as critic, making use of a learned deep dynamics model to compute the policy gradient. This approach leads to a symbiotic relationship between the deep reinforcement learning algorithm and the latent trajectory optimizer . The trajectory optimizer benefits from the critic learned by the RL algorithm and the latter from the enhanced exploration generated by the planner . The developed methods are evaluated on two continuous control tasks, one in simulation and one in the real world. In particular, a Baxter robot is trained to perform an insertion task, while only receiving sparse rewards and images as observations from the environment. I NTRODUCTION Reinforcement learning (RL) methods enabled the development of autonomous systems that can autonomously learn and master a task when provided with an objective function. RL has been successfully applied to a wide range of tasks including flying [24], [17], manipulation [26], [9], [12], [3], [1], locomotion [10], [13], and even autonomous driving [6], [7].

artificial intelligence, exploration, upstream oil & gas, (16 more...)

1911.06833

Country:

North America > United States > Arizona (0.14)
Europe > Switzerland (0.14)

Genre: Research Report (0.82)

Industry:

Energy > Oil & Gas > Upstream (0.36)
Transportation (0.34)
Information Technology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Mai, Tien, Chan, Kennard, Jaillet, Patrick

Generalized Maximum Causal Entropy for Inverse Reinforcement Learning

arXiv.org Machine LearningNov-15-2019

We consider the problem of learning from demonstrated trajectories with inverse reinforcement learning (IRL). Motivated by a limitation of the classical maximum entropy model (Ziebart, Bagnell, and Dey 2010) in capturing the structure of the network of states, we propose an IRL model based on a generalized version of the causal entropy maximization problem, which allows us to generate a class of maximum entropy IRL models. Our generalized model has an advantage of being able to recover, in addition to a reward function, another expert's function that would (partially) capture the impact of the connecting structure of the states on experts' decisions. Empirical evaluation on a real-world dataset and a grid-world dataset shows that our generalized model outperforms the classical ones, in terms of recovering reward functions and demonstrated trajectories.

irl model, reward function, trajectory, (12 more...)

arXiv.org Machine Learning

1911.06928

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.64)

Industry: Transportation > Ground > Road (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Gaudet, Brian, Linares, Richard, Furfaro, Roberto

Six Degree-of-Freedom Hovering using LIDAR Altimetry via Reinforcement Meta-Learning

arXiv.org Artificial IntelligenceNov-15-2019

We optimize a six degrees of freedom hovering policy using reinforcement meta-learning. The policy maps flash LIDAR measurements directly to on/off spacecraft body-frame thrust commands, allowing hovering at a fixed position and attitude in the asteroid body-fixed reference frame. Importantly, the policy does not require position and velocity estimates, and can operate in environments with unknown dynamics, and without an asteroid shape model or navigation aids. Indeed, during optimization the agent is confronted with a new randomly generated asteroid for each episode, insuring that it does not learn an asteroid's shape, texture, or environmental dynamics. This allows the deployed policy to generalize well to novel asteroid characteristics, which we demonstrate in our experiments. The hovering controller has the potential to simplify mission planning by allowing asteroid body-fixed hovering immediately upon the spacecraft's arrival to an asteroid. This in turn simplifies shape model generation and allows resource mapping via remote sensing immediately upon arrival at the target asteroid.

asteroid, optimization, spacecraft, (16 more...)

1911.08553

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Arizona > Pima County > Tucson (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Energy > Renewable (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)

Mai, Tien, Nguyen, Quoc Phong, Low, Kian Hsiang, Jaillet, Patrick

Inverse Reinforcement Learning with Missing Data

arXiv.org Artificial IntelligenceNov-15-2019

We consider the problem of recovering an expert's reward function with inverse reinforcement learning (IRL) when there are missing/incomplete state-action pairs or observations in the demonstrated trajectories. This issue of missing trajectory data or information occurs in many situations, e.g., GPS signals from vehicles moving on a road network are intermittent. In this paper, we propose a tractable approach to directly compute the log-likelihood of demonstrated trajectories with incomplete/missing data. Our algorithm is efficient in handling a large number of missing segments in the demonstrated trajectories, as it performs the training with incomplete data by solving a sequence of systems of linear equations, and the number of such systems to be solved does not depend on the number of missing segments. Empirical evaluation on a real-world dataset shows that our training algorithm outperforms other conventional techniques.

algorithm, linear equation, trajectory, (15 more...)

1911.0693

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Transportation > Ground > Road (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)