Goto

Collaborating Authors

 Reinforcement Learning


What is deep reinforcement learning: The next step in AI and deep learning

#artificialintelligence

Reinforcement learning has traditionally occupied a niche status in the world of artificial intelligence. But reinforcement learning has started to assume a larger role in many AI initiatives in the past few years. Its application sweet spot is in calculation of optimal actions to be taken by agents in environmentally contextualized decision scenarios. Using trial-and-error approaches to maximize an algorithmic reward function, reinforcement learning is well suited to many adaptive-control and multiagent automation applications in IT operations management, energy, health care, commerce, finance, transportation, and finance. And it's being used to train the AI that powers both its traditional focus areas--robotics, gaming, and simulation--and a new generation of AI solutions in edge analytics, natural language processing, machine translation, computer vision, and digital assistants.


End-to-end deep reinforcement learning without reward engineering

Robohub

Communicating the goal of a task to another person is easy: we can use language, show them an image of the desired outcome, point them to a how-to video, or use some combination of all of these. On the other hand, specifying a task to a robot for reinforcement learning requires substantial effort. Most prior work that has applied deep reinforcement learning to real robots makes uses of specialized sensors to obtain rewards or studies tasks where the robot's internal sensors can be used to measure reward. Since such instrumentation needs to be done for any new task that we may wish to learn, it poses a significant bottleneck to widespread adoption of reinforcement learning for robotics, and precludes the use of these methods directly in open-world environments that lack this instrumentation. We have developed an end-to-end method that allows robots to learn from a modest number of images that depict successful completion of a task, without any manual reward engineering.


Recurrent Existence Determination Through Policy Optimization

arXiv.org Artificial Intelligence

Binary determination of the presence of objects is one of the problems where humans perform extraordinarily better than computer vision systems, in terms of both speed and preciseness. One of the possible reasons is that humans can skip most of the clutter and attend only on salient regions. Recurrent attention models (RAM) are the first computational models to imitate the way humans process images via the REINFORCE algorithm. Despite that RAM is originally designed for image recognition, we extend it and present recurrent existence determination, an attention-based mechanism to solve the existence determination. Our algorithm employs a novel $k$-maximum aggregation layer and a new reward mechanism to address the issue of delayed rewards, which would have caused the instability of the training process. The experimental analysis demonstrates significant efficiency and accuracy improvement over existing approaches, on both synthetic and real-world datasets.


Proximal Reliability Optimization for Reinforcement Learning

arXiv.org Machine Learning

In recent years, reinforcement learning has seen incremental growth in replacing classical dynamic programming in the field of control engineering due to it making limited to no assumptions about the dynamics of the system. Instead, it depends upon universal approximating capabilities of the control structure to develop a good control function through trial and error experimentation. The challenge of this approach is to efficiently carry out the exploration, which allows the controller to adapt to a control strategy with satisfactory global performance. We can envision the implausibility of directly employing reinforcement learning approach in designing a controller for a physical system, as the controller may crash during thousands or even tens of thousands of trials needed before it finds a stable control function, thereby making it an impractical practice for designing robust controllers. Since conducting trials, in reality, is often infeasible, usually, a mathematical model of the physical system is constructed in the form of a simulator, the controller is designed for the model, and then the controller is implemented on the physical system. If there are substantial differences between the model and the physical system, often called the reality gap, then the controller may operate with compromised performance and possibly be unstable. Physical systems often possess underlying dynamics that are difficult to measure accurately such as friction, density distribution, and unknown torques. Furthermore, the dynamics of the system often change over time; the change can be gradual such as when devices wear or new systems break-in or the change can be abrupt as in the catastrophic failure of a sub-component or the replacement of an old part with a new one.


Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning

arXiv.org Machine Learning

In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis, which identifies the size-difference of the action-gap across the state-space as the primary cause. We then introduce a new method that enables more homogeneous action-gaps by mapping value estimates to a logarithmic space. We prove convergence for this method under standard assumptions and demonstrate empirically that it indeed enables lower discount factors for approximate reinforcement-learning methods. This in turn allows tackling a class of reinforcement-learning problems that are challenging to solve with traditional methods.


Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

arXiv.org Machine Learning

Off-policy reinforcement learning aims to leverage experience collected from prior policies for sample-efficient learning. However, in practice, commonly used off-policy approximate dynamic programming methods based on Q-learning and actor-critic methods are highly sensitive to the data distribution, and can make only limited progress without collecting additional on-policy data. As a step towards more robust off-policy algorithms, we study the setting where the off-policy experience is fixed and there is no further interaction with the environment. We identify bootstrapping error as a key source of instability in current methods. Bootstrapping error is due to bootstrapping from actions that lie outside of the training data distribution, and it accumulates via the Bellman backup operator. We theoretically analyze bootstrapping error, and demonstrate how carefully constraining action selection in the backup can mitigate it. Based on our analysis, we propose a practical algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks.


Load Balancing for Ultra-Dense Networks: A Deep Reinforcement Learning Based Approach

arXiv.org Machine Learning

In this paper, we propose a deep reinforcement learning (DRL) based mobility load balancing (MLB) algorithm along with a two-layer architecture to solve the large-scale load balancing problem for ultra-dense networks (UDNs). Our contribution is three-fold. First, this work proposes a two-layer architecture to solve the large-scale load balancing problem in a self-organized manner. The proposed architecture can alleviate the global traffic variations by dynamically grouping small cells into self-organized clusters according to their historical loads, and further adapt to local traffic variations through intra-cluster load balancing afterwards. Second, for the intra-cluster load balancing, this paper proposes an off-policy DRL-based MLB algorithm to autonomously learn the optimal MLB policy under an asynchronous parallel learning framework, without any prior knowledge assumed over the underlying UDN environments. Moreover, the algorithm enables joint exploration with multiple behavior policies, such that the traditional MLB methods can be used to guide the learning process thereby improving the learning efficiency and stability. Third, this work proposes an offline-evaluation based safeguard mechanism to ensure that the online system can always operate with the optimal and well-trained MLB policy, which not only stabilizes the online performance but also enables the exploration beyond current policies to make full use of machine learning in a safe way. Empirical results verify that the proposed framework outperforms the existing MLB methods in general UDN environments featured with irregular network topologies, coupled interferences, and random user movements, in terms of the load balancing performance.


Deep Reinforcement Learning Architecture for Continuous Power Allocation in High Throughput Satellites

arXiv.org Machine Learning

In the coming years, the satellite broadband market will experience significant increases in the service demand, especially for the mobility sector, where demand is burstier. Many of the next generation of satellites will be equipped with numerous degrees of freedom in power and bandwidth allocation capabilities, making manual resource allocation impractical and inefficient. Therefore, it is desirable to automate the operation of these highly flexible satellites. This paper presents a novel power allocation approach based on Deep Reinforcement Learning (DRL) that represents the problem as continuous state and action spaces. We make use of the Proximal Policy Optimization (PPO) algorithm to optimize the allocation policy for minimum Unmet System Demand (USD) and power consumption. The performance of the algorithm is analyzed through simulations of a multibeam satellite system, which show promising results for DRL to be used as a dynamic resource allocation algorithm.


Sequential Triggers for Watermarking of Deep Reinforcement Learning Policies

arXiv.org Artificial Intelligence

This paper proposes a novel scheme for the watermarking of Deep Reinforcement Learning (DRL) policies. This scheme provides a mechanism for the integration of a unique identifier within the policy in the form of its response to a designated sequence of state transitions, while incurring minimal impact on the nominal performance of the policy. The applications of this watermarking scheme include detection of unauthorized replications of proprietary policies, as well as enabling the graceful interruption or termination of DRL activities by authorized entities. We demonstrate the feasibility of our proposal via experimental evaluation of watermarking a DQN policy trained in the Cartpole environment.


Adversarial Exploitation of Policy Imitation

arXiv.org Artificial Intelligence

This paper investigates a class of attacks targeting Typically, the settings of imitation learning are concerned the confidentiality aspect of security in Deep with learning from human demonstrations. However, it is Reinforcement Learning (DRL) policies. Recent straightforward to deduce that the techniques developed for research have established the vulnerability of supervised those settings may also be applied to learning from artificial machine learning models (e.g., classifiers) experts, such as DRL agents. Of particular relevance to to model extraction attacks. Such attacks leverage this research is the emerging area of Reinforcement Learning the loosely-restricted ability of the attacker to iteratively with Expert Demonstrations (RLED)[Piot et al., 2014]. The query the model for labels, thereby allowing techniques of RLED aim to minimize the effect of modeling for the forging of a labeled dataset which can be imperfections on the efficacy of the final RL policy, while used to train a replica of the original model. In this minimizing the cost of training by leveraging the information work, we demonstrate the feasibility of exploiting available demonstrations to reduce the search space of imitation learning techniques in launching model the policy.