AITopics

arXiv.org Machine LearningJun-30-2020

Regularly Updated Deterministic Policy Gradient Algorithm

Han, Shuai, Zhou, Wenbo, Lü, Shuai, Yu, Jiayu

Deep Deterministic Policy Gradient (DDPG) algorithm is one of the most well-known reinforcement learning methods. However, this method is inefficient and unstable in practical applications. On the other hand, the bias and variance of the Q estimation in the target function are sometimes difficult to control. This paper proposes a Regularly Updated Deterministic (RUD) policy gradient algorithm for these problems. This paper theoretically proves that the learning procedure with RUD can make better use of new data in replay buffer than the traditional procedure. In addition, the low variance of the Q value in RUD is more suitable for the current Clipped Double Q-learning strategy. This paper has designed a comparison experiment against previous methods, an ablation experiment with the original DDPG, and other analytical experiments in Mujoco environments. The experimental results demonstrate the effectiveness and superiority of RUD.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2007.00169

Country:

North America > United States > California > Alameda County > Livermore (0.04)
Asia > China > Jilin Province (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Energy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Mondal, Arnab Kumar, Nair, Pratheeksha, Siddiqi, Kaleem

Group Equivariant Deep Reinforcement Learning

In Reinforcement Learning (RL), Convolutional Neural Networks(CNNs) have been successfully applied as function approximators in Deep Q-Learning algorithms, which seek to learn action-value functions and policies in various environments. However, to date, there has been little work on the learning of symmetry-transformation equivariant representations of the input environment state. In this paper, we propose the use of Equivariant CNNs to train RL agents and study their inductive bias for transformation equivariant Q-value approximation. We demonstrate that equivariant architectures can dramatically enhance the performance and sample efficiency of RL agents in a highly symmetric environment while requiring fewer parameters. Additionally, we show that they are robust to changes in the environment caused by affine transformations.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2007.03437

Country:

North America > Canada > Quebec > Montreal (0.15)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Lee, Heunchul, Girnyk, Maksym, Jeong, Jaeseong

Deep reinforcement learning approach to MIMO precoding problem: Optimality and Robustness

In this paper, we propose a deep reinforcement learning (RL)-based precoding framework that can be used to learn an optimal precoding policy for complex multiple-input multiple-output (MIMO) precoding problems. We model the precoding problem for a single-user MIMO system as an RL problem in which a learning agent sequentially selects the precoders to serve the environment of MIMO system based on contextual information about the environmental conditions, while simultaneously adapting the precoder selection policy based on the reward feedback from the environment to maximize a numerical reward signal. We develop the RL agent with two canonical deep RL (DRL) algorithms, namely deep Q-network (DQN) and deep deterministic policy gradient (DDPG). To demonstrate the optimality of the proposed DRL-based precoding framework, we explicitly consider a simple MIMO environment for which the optimal solution can be obtained analytically and show that DQN- and DDPG-based agents can learn the near-optimal policy to map the environment state of MIMO system to a precoder that maximizes the reward function, respectively, in the codebook-based and non-codebook based MIMO precoding systems. Furthermore, to investigate the robustness of DRL-based precoding framework, we examine the performance of the two DRL algorithms in a complex MIMO environment, for which the optimal solution is not known. The numerical results confirm the effectiveness of the DRL-based precoding framework and show that the proposed DRL-based framework can outperform the conventional approximation algorithm in the complex MIMO environment.

artificial intelligence, machine learning, reinforcement learning, (19 more...)

2006.16646

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre: Research Report > New Finding (0.87)

Industry: Telecommunications (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Testing match-3 video games with Deep Reinforcement Learning

Napolitano, Nicholas

Testing a video game is a critical step for the production process and requires a great effort in terms of time and resources spent. Some software houses are trying to use the artificial intelligence to reduce the need of human resources using systems able to replace a human agent. We study the possibility to use the Deep Reinforcement Learning to automate the testing process in match-3 video games and suggest to approach the problem in the framework of a Dueling Deep Q-Network paradigm. We test this kind of network on the Jelly Juice game, a match-3 video game developed by the redBit Games. The network extracts the essential information from the game environment and infers the next move. We compare the results with the random player performance, finding that the network shows a highest success rate. The results are in most cases similar with those obtained by real users, and the network also succeeds in learning over time the different features that distinguish the game levels and adapts its strategy to the increasing difficulties.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2007.01137

Country:

North America > United States (0.04)
Europe > Italy (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Planning to Explore via Self-Supervised World Models

Sekar, Ramanan, Rybkin, Oleh, Daniilidis, Kostas, Abbeel, Pieter, Hafner, Danijar, Pathak, Deepak

Reinforcement learning allows solving complex tasks, however, the learning tends to be task-specific and the sample efficiency remains a challenge. We present Plan2Explore, a self-supervised reinforcement learning agent that tackles both these challenges through a new approach to self-supervised exploration and fast adaptation to new tasks, which need not be known during exploration. During exploration, unlike prior methods which retrospectively compute the novelty of observations after the agent has already reached them, our agent acts efficiently by leveraging planning to seek out expected future novelty. After exploration, the agent quickly adapts to multiple downstream tasks in a zero or a few-shot manner. We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. Videos and code at https://ramanans1.github.io/plan2explore/

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2005.0596

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Pennsylvania (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Reinforcement Learning based Control of Imitative Policies for Near-Accident Driving

Cao, Zhangjie, Bıyık, Erdem, Wang, Woodrow Z., Raventos, Allan, Gaidon, Adrien, Rosman, Guy, Sadigh, Dorsa

Autonomous driving has achieved significant progress in recent years, but autonomous cars are still unable to tackle high-risk situations where a potential accident is likely. In such near-accident scenarios, even a minor change in the vehicle's actions may result in drastically different consequences. To avoid unsafe actions in near-accident scenarios, we need to fully explore the environment. However, reinforcement learning (RL) and imitation learning (IL), two widely-used policy learning methods, cannot model rapid phase transitions and are not scalable to fully cover all the states. To address driving in near-accident scenarios, we propose a hierarchical reinforcement and imitation learning (H-ReIL) approach that consists of low-level policies learned by IL for discrete driving modes, and a high-level policy learned by RL that switches between different driving modes. Our approach exploits the advantages of both IL and RL by integrating them into a unified learning framework. Experimental results and user studies suggest our approach can achieve higher efficiency and safety compared to other methods. Analyses of the policies demonstrate our high-level policy appropriately switches between different low-level policies in near-accident driving situations.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2007.00178

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Middle East > Republic of Türkiye > Karaman Province > Karaman (0.04)

Genre: Research Report > Experimental Study (0.66)

Industry:

Transportation > Ground > Road (1.00)
Automobiles & Trucks (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Capasso, Alessandro Paolo, Bacchiani, Giulio, Broggi, Alberto

From Simulation to Real World Maneuver Execution using Deep Reinforcement Learning

Deep Reinforcement Learning has proved to be able to solve many control tasks in different fields, but the behavior of these systems is not always as expected when deployed in real-world scenarios. This is mainly due to the lack of domain adaptation between simulated and real-world data together with the absence of distinction between train and test datasets. In this work, we investigate these problems in the autonomous driving field, especially for a maneuver planning module for roundabout insertions. In particular, we present a system based on multiple environments in which agents are trained simultaneously, evaluating the behavior of the model in different scenarios. Finally, we analyze techniques aimed at reducing the gap between simulated and real-world data showing that this increased the generalization capabilities of the system both on unseen and real-world scenarios.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

2005.07023

Country:

Europe > Italy (0.04)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)

Genre: Research Report (0.50)

Industry: Transportation > Ground > Road (0.85)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Erskine, Jordan, Lehnert, Chris

Developing cooperative policies for multi-stage tasks

arXiv.org Machine LearningJun-30-2020

This paper proposes the Cooperative Soft Actor Critic (CSAC) method of enabling consecutive reinforcement learning agents to cooperatively solve a long time horizon multi-stage task. This method is achieved by modifying the policy of each agent to maximise both the current and next agent's critic. Cooperatively maximising each agent's critic allows each agent to take actions that are beneficial for its task as well as subsequent tasks. Using this method in a multi-room maze domain, the cooperative policies were able to outperform both uncooperative policies as well as a single agent trained across the entire domain. CSAC achieved a success rate of at least 20\% higher than the uncooperative policies, and converged on a solution at least 4 times faster than the single agent.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2007.00203

Country:

Oceania > Australia > Queensland > Brisbane (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Machine LearningJun-30-2020

Dynamic Regret of Policy Optimization in Non-stationary Environments

Fei, Yingjie, Yang, Zhuoran, Wang, Zhaoran, Xie, Qiaomin

We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the non-stationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of non-stationarity, and moreover satisfies a notion of adaptive (near-)optimality, in the sense that it matches the (near-)optimal static regret under slow-changing environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with non-stationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2007.00148

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)