AITopics

2008.06799

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Hubbs, Christian D., Perez, Hector D., Sarwar, Owais, Sahinidis, Nikolaos V., Grossmann, Ignacio E., Wassick, John M.

OR-Gym: A Reinforcement Learning Library for Operations Research Problem

arXiv.org Artificial IntelligenceAug-14-2020

Reinforcement learning (RL) has been widely applied to game-playing and surpassed the best human-level performance in many domains, yet there are few use-cases in industrial or commercial settings. We introduce OR-Gym, an open-source library for developing reinforcement learning algorithms to address operations research problems. In this paper, we apply reinforcement learning to the knapsack, multi-dimensional bin packing, multi-echelon supply chain, and multi-period asset allocation model problems, as well as benchmark the RL solutions against MILP and heuristic models. These problems are used in logistics, finance, engineering, and are common in many business operation settings. We develop environments based on prototypical models in the literature and implement various optimization and heuristic models in order to benchmark the RL results. By re-framing a series of classic optimization problems as RL tasks, we seek to provide a new tool for the operations research community, while also opening those in the RL community to many of the problems and challenges in the OR field.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2008.06319

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Michigan > Midland County > Midland (0.04)
(3 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Industry: Banking & Finance > Trading (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceAug-14-2020

Decision-making at Unsignalized Intersection for Autonomous Vehicles: Left-turn Maneuver with Deep Reinforcement Learning

Liu, Teng, Mu, Xingyu, Huang, Bing, Tang, Xiaolin, Zhao, Fuqing, Wang, Xiao, Cao, Dongpu

Decision-making module enables autonomous vehicles to reach appropriate maneuvers in the complex urban environments, especially the intersection situations. This work proposes a deep reinforcement learning (DRL) based left-turn decision-making framework at unsignalized intersection for autonomous vehicles. The objective of the studied automated vehicle is to make an efficient and safe left-turn maneuver at a four-way unsignalized intersection. The exploited DRL methods include deep Q-learning (DQL) and double DQL. Simulation results indicate that the presented decision-making strategy could efficaciously reduce the collision rate and improve transport efficiency. This work also reveals that the constructed left-turn control structure has a great potential to be applied in real-time.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2008.06595

Country:

North America > Canada > Ontario > Waterloo Region > Waterloo (0.28)
Asia > China > Chongqing Province > Chongqing (0.06)
Asia > China > Beijing > Beijing (0.05)
(9 more...)

Genre:

Personal (0.46)
Research Report (0.40)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Wachi, Akifumi, Sui, Yanan

Safe Reinforcement Learning in Constrained Markov Decision Processes

arXiv.org Artificial IntelligenceAug-14-2020

Safe reinforcement learning has been a promising approach for optimizing the policy of an agent that operates in safety-critical applications. In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints. Specifically, we take a stepwise approach for optimizing safety and cumulative reward. In our method, the agent first learns safety constraints by expanding the safe region, and then optimizes the cumulative reward in the certified safe region. We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward under proper regularity assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openly-available environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2008.06626

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.62)

Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings

Zhang, Jesse, Cheung, Brian, Finn, Chelsea, Levine, Sergey, Jayaraman, Dinesh

Reinforcement learning (RL) in real-world safety-critical target settings like urban driving is hazardous, imperiling the RL agent, other agents, and the environment. To overcome this difficulty, we propose a "safety-critical adaptation" task setting: an agent first trains in non-safety-critical "source" environments such as in a simulator, before it adapts to the target environment where failures carry heavy costs. We propose a solution approach, CARL, that builds on the intuition that prior experience in diverse environments equips an agent to estimate risk, which in turn enables relative safety through risk-averse, cautious adaptation. CARL first employs model-based RL to train a probabilistic model to capture uncertainty about transition dynamics and catastrophic states across varied source environments. Then, when exploring a new safety-critical environment with unknown dynamics, the CARL agent plans to avoid actions that could lead to catastrophic states. In experiments on car driving, cartpole balancing, half-cheetah locomotion, and robotic object manipulation, CARL successfully acquires cautious exploration behaviors, yielding higher rewards with fewer failures than strong RL adaptation baselines. Website at https://sites.google.com/berkeley.edu/carl.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2008.06622

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Willems, Lucas, Lahlou, Salem, Bengio, Yoshua

Mastering Rate based Curriculum Learning

Recently, deep reinforcement learning algorithms have been successfully applied to a wide range of domains ([1], [2], [3], [4]). However, their success relies heavily on dense rewards being given to the agent; and learning in environments with sparse rewards is still a major limitation of RL due to the low sample efficiency of the current algorithms in such scenarios. In sparse rewards settings, the sample inefficiency is essentially caused by the low likelihood of the agent obtaining a reward by random exploration. Recent attempts to tackle this issue revolve around providing the agent an intrinsic reward that encourages exploring new states of the environment, thus increasing the likelihood of reaching the reward ([5], [6], [7]). An alternative way to improve the sample efficiency is curriculum learning ([8]). It consists in first training the agent on an easy version of the task at hand, where it can get reward more easily and learn, then training on increasingly difficult versions using the previously learned policy and finally, training on the task at hand. Its usage is not limited to reinforcement learning and robotics tasks, but also to supervised tasks. Curriculum learning may be decomposed into two parts: 1. Defining the curriculum, i.e. the set of tasks the learner may be trained on.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2008.06456

Country:

North America > Canada > Quebec > Montreal (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.64)

Industry:

Education (1.00)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Defending Adversarial Attacks without Adversarial Attacks in Deep Reinforcement Learning

Qu, Xinghua, Ong, Yew-Soon, Gupta, Abhishek, Sun, Zhu

Many recent studies in deep reinforcement learning (DRL) have proposed to boost adversarial robustness through policy distillation utilizing adversarial training, where additional adversarial examples are added in the training process of the student policy; this makes the robustness improvement less flexible and more computationally expensive. In contrast, we propose an efficient policy distillation paradigm called robust policy distillation that is capable of achieving an adversarially robust student policy without relying on any adversarial example during student policy training. To this end, we devise a new policy distillation loss that consists of two terms: 1) a prescription gap maximization loss aiming at simultaneously maximizing the likelihood of the action selected by the teacher policy and the entropy over the remaining actions; 2) a Jacobian regularization loss that minimizes the magnitude of Jacobian with respect to the input state. The theoretical analysis proves that our distillation loss guarantees to increase the prescription gap and the adversarial robustness. Meanwhile, experiments on five Atari games firmly verifies the superiority of our policy distillation on boosting adversarial robustness compared to other state-of-the-arts.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2008.06199

Country: Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (0.89)
Government > Military (0.79)
Leisure & Entertainment > Games > Computer Games (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Geng, Sinong, Nassif, Houssam, Manzanares, Carlos A., Reppen, A. Max, Sircar, Ronnie

Deep PQR: Solving Inverse Reinforcement Learning using Anchor Actions

We propose a reward function estimation framework for inverse reinforcement learning with deep energy-based policies. We name our method PQR, as it sequentially estimates the Policy, the $Q$-function, and the Reward function by deep learning. PQR does not assume that the reward solely depends on the state, instead it allows for a dependency on the choice of action. Moreover, PQR allows for stochastic state transitions. To accomplish this, we assume the existence of one anchor action whose reward is known, typically the action of doing nothing, yielding no reward. We present both estimators and algorithms for the PQR method. When the environment transition is known, we prove that the PQR reward estimator uniquely recovers the true reward. With unknown transitions, we bound the estimation error of PQR. Finally, the performance of PQR is demonstrated by synthetic and real-world datasets.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2007.07443

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
(14 more...)

Genre: Research Report (0.82)

Industry:

Transportation > Passenger (1.00)
Transportation > Air (1.00)
Consumer Products & Services > Travel (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.67)

Chang, Michael, Kaushik, Sidhant, Weinberg, S. Matthew, Griffiths, Thomas L., Levine, Sergey

Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

This paper seeks to establish a framework for directing a society of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems. What makes it challenging to use a decentralized approach to collectively optimize a central objective is the difficulty in characterizing the equilibrium strategy profile of non-cooperative games. To overcome this challenge, we design a mechanism for defining the learning environment of each agent for which we know that the optimal solution for the global objective coincides with a Nash equilibrium strategy profile of the agents optimizing their own local objectives. The society functions as an economy of agents that learn the credit assignment process itself by buying and selling to each other the right to operate on the environment state. We derive a class of decentralized reinforcement learning algorithms that are broadly applicable not only to standard reinforcement learning but also for selecting options in semi-MDPs and dynamically composing computation graphs. Lastly, we demonstrate the potential advantages of a society's inherent modular structure for more efficient transfer learning.

artificial intelligence, equilibrium, machine learning, (17 more...)

2007.02382

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.81)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

arXiv.org Artificial IntelligenceAug-13-2020

Offline Meta-Reinforcement Learning with Advantage Weighting

Mitchell, Eric, Rafailov, Rafael, Peng, Xue Bin, Levine, Sergey, Finn, Chelsea

This paper introduces the offline meta-reinforcement learning (offline meta-RL) problem setting and proposes an algorithm that performs well in this setting. Offline meta-RL is analogous to the widely successful supervised learning strategy of pretraining a model on a large batch of fixed, pre-collected data (possibly from various tasks) and fine-tuning the model to a new task with relatively little data. That is, in offline meta-RL, we meta-train on fixed, pre-collected data from several tasks and adapt to a new task with a very small amount (less than 5 trajectories) of data from the new task. By nature of being offline, algorithms for offline meta-RL can utilize the largest possible pool of training data available and eliminate potentially unsafe or costly data collection during meta-training. This setting inherits the challenges of offline RL, but it differs significantly because offline RL does not generally consider a) transfer to new tasks or b) limited data from the test task, both of which we face in offline meta-RL. Targeting the offline meta-RL setting, we propose Meta-Actor Critic with Advantage Weighting (MACAW). MACAW is an optimization-based meta-learning algorithm that uses simple, supervised regression objectives for both the inner and outer loop of meta-training. On offline variants of common meta-RL benchmarks, we empirically find that this approach enables fully offline meta-reinforcement learning and achieves notable gains over prior methods. Meta-reinforcement learning (meta-RL) has emerged as a promising strategy for tackling the high sample complexity of reinforcement learning algorithms, when the goal is to ultimately learn many tasks. Meta-RL algorithms exploit shared structure among tasks during meta-training, amortizing the cost of learning across tasks and enabling rapid adaptation to new tasks during meta-testing from only a small amount of experience. Yet unlike in supervised learning, where large amounts of pre-collected data can be pooled from many sources to train a single model, existing meta-RL algorithms assume the ability to collect millions of environment interactions online during meta-training.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2008.06043

Country:

North America > United States > Texas (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)