Goto

Collaborating Authors

 Reinforcement Learning


Reinforcement learning techniques for Outer Loop Link Adaptation in 4G/5G systems

arXiv.org Machine Learning

Wireless systems perform rate adaptation to transmit at highest possible instantaneous rates. Rate adaptation has been increasingly granular over generations of wireless systems. The base-station uses SINR and packet decode feedback called acknowledgement/no acknowledgement (ACK/NACK) to perform rate adaptation. SINR is used for rate anchoring called inner look adaptation and ACK/NACK is used for fine offset adjustments called Outer Loop Link Adaptation (OLLA). We cast the OLLA as a reinforcement learning problem of the class of Multi-Armed Bandits (MAB) where the different offset values are the arms of the bandit. In OLLA, as the offset values increase, the probability of packet error also increase, and every user equipment (UE) has a desired Block Error Rate (BLER) to meet certain Quality of Service (QoS) requirements. For this MAB we propose a binary search based algorithm which achieves a Probably Approximately Correct (PAC) solution making use of bounds from large deviation theory and confidence bounds. In addition to this we also discuss how a Thompson sampling or UCB based method will not help us meet the target objectives. Finally, simulation results are provided on an LTE system simulator and thereby prove the efficacy of our proposed algorithm.


Perturbation Training for Human-Robot Teams

Journal of Artificial Intelligence Research

In this work, we design and evaluate a computational learning model that enables a human-robot team to co-develop joint strategies for performing novel tasks that require coordination. The joint strategies are learned through "perturbation training," a human team-training strategy that requires team members to practice variations of a given task to help their team generalize to new variants of that task. We formally define the problem of human-robot perturbation training and develop and evaluate the first end-to-end framework for such training, which incorporates a multi-agent transfer learning algorithm, human-robot co-learning framework and communication protocol. Our transfer learning algorithm, Adaptive Perturbation Training (AdaPT), is a hybrid of transfer and reinforcement learning techniques that learns quickly and robustly for new task variants. We empirically validate the benefits of AdaPT through comparison to other hybrid reinforcement and transfer learning techniques aimed at transferring knowledge from multiple source tasks to a single target task. We also demonstrate that AdaPT's rapid learning supports live interaction between a person and a robot, during which the human-robot team trains to achieve a high level of performance for new task variants. We augment AdaPT with a co-learning framework and a computational bi-directional communication protocol so that the robot can co-train with a person during live interaction. Results from large-scale human subject experiments (n=48) indicate that AdaPT enables an agent to learn in a manner compatible with a human's own learning process, and that a robot undergoing perturbation training with a human results in a high level of team performance. Finally, we demonstrate that human-robot training using AdaPT in a simulation environment produces effective performance for a team incorporating an embodied robot partner.


Advantages and Limitations of using Successor Features for Transfer in Reinforcement Learning

arXiv.org Machine Learning

One question central to Reinforcement Learning is how to learn a feature representation that supports algorithm scaling and re-use of learned information from different tasks. Successor Features approach this problem by learning a feature representation that satisfies a temporal constraint. We present an implementation of an approach that decouples the feature representation from the reward function, making it suitable for transferring knowledge between domains. We then assess the advantages and limitations of using Successor Features for transfer.


How businesses can leverage reinforcement learning?

#artificialintelligence

It branches out from Artificial Intelligence and is classified as a Machine Learning type. Leveraging reinforcement learning, software agents and machines are made to ascertain the ideal behavior in a specific context with the aim of maximizing its performance. When the learning agent acts on trial and error, it is termed as exploration, and when it acts based on the knowledge gained from the environment, it is referred to as exploitation. The environment rewards the agent for correct actions, which is the reinforcement signal. Leveraging the rewards obtained, the agent improves its environment knowledge to select the next action.


Black Hat USA 2017: Machine learning is not a silver bullet for security - SD Times

#artificialintelligence

The framework takes a game-like approach, accesses the system, learns about the system, and figures out how it can be attacked and how it can evade an attack. "Reinforcement learning has produced models that top human performance in a myriad of games. Using similar techniques, our PE malware evasion technique can be framed as a competitive game between our agent and the machine learning model detector. Our agent inspects a PE file and selects a sequence of functionality-preserving mutations to the PE file which best evade the malware detection model. The agent learns through the experience of thousands of "games" against the detector, which sequence of actions is most likely to result in an [invasive variant.]


Batch Reinforcement Learning on the Industrial Benchmark: First Experiences

arXiv.org Artificial Intelligence

The Particle Swarm Optimization Policy (PSO-P) has been recently introduced and proven to produce remarkable results on interacting with academic reinforcement learning benchmarks in an off-policy, batch-based setting. To further investigate the properties and feasibility on real-world applications, this paper investigates PSO-P on the so-called Industrial Benchmark (IB), a novel reinforcement learning (RL) benchmark that aims at being realistic by including a variety of aspects found in industrial applications, like continuous state and action spaces, a high dimensional, partially observable state space, delayed effects, and complex stochasticity. The experimental results of PSO-P on IB are compared to results of closed-form control policies derived from the model-based Recurrent Control Neural Network (RCNN) and the model-free Neural Fitted Q-Iteration (NFQ). Experiments show that PSO-P is not only of interest for academic benchmarks, but also for real-world industrial applications, since it also yielded the best performing policy in our IB setting. Compared to other well established RL techniques, PSO-P produced outstanding results in performance and robustness, requiring only a relatively low amount of effort in finding adequate parameters or making complex design decisions.


DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

arXiv.org Machine Learning

Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA's vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts - even with no access to the target domain. DARLA significantly outperforms conventional baselines in zero-shot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC).


Learning Sparse Representations in Reinforcement Learning with Sparse Coding

arXiv.org Machine Learning

A variety of representation learning approaches have been investigated for reinforcement learning; much less attention, however, has been given to investigating the utility of sparse coding. Outside of reinforcement learning, sparse coding representations have been widely used, with non-convex objectives that result in discriminative representations. In this work, we develop a supervised sparse coding objective for policy evaluation. Despite the non-convexity of this objective, we prove that all local minima are global minima, making the approach amenable to simple optimization strategies. We empirically show that it is key to use a supervised objective, rather than the more straightforward unsupervised sparse coding approach. We compare the learned representations to a canonical fixed sparse representation, called tile-coding, demonstrating that the sparse coding representation outperforms a wide variety of tilecoding representations.


Self-Correcting Models for Model-Based Reinforcement Learning

arXiv.org Artificial Intelligence

When an agent cannot represent a perfectly accurate model of its environment's dynamics, model-based reinforcement learning (MBRL) can fail catastrophically. Planning involves composing the predictions of the model; when flawed predictions are composed, even minor errors can compound and render the model useless for planning. Hallucinated Replay (Talvitie 2014) trains the model to "correct" itself when it produces errors, substantially improving MBRL with flawed models. This paper theoretically analyzes this approach, illuminates settings in which it is likely to be effective or ineffective, and presents a novel error bound, showing that a model's ability to self-correct is more tightly related to MBRL performance than one-step prediction error. These results inspire an MBRL algorithm for deterministic MDPs with performance guarantees that are robust to model class limitations.


Deep Reinforcement Learning with Successor Features for Navigation across Similar Environments

arXiv.org Artificial Intelligence

Autonomous navigation is one of the core problems in mobile robotics. It can roughly be characterized as the ability of a robot to get from its current position to a designated goal location solely based on the input it receives from its on-board sensors. A popular approach to this problem relies on the successful combination of a series of different algorithms for the problems of simultaneous localization and mapping (SLAM), localization in a given map as well as path planning and control, all of which often depend on additional information given to the agent. Although individually the problems of SLAM, localization, path planning and control are well understood [1], [2], [3], and a lot of progress has been made on learning control [4], they have mainly been treated as separable problems within robotics and some often require human assistance during setup-time. For example, the majority of SLAM solutions are implemented as passive procedures relying on special exploration strategies or a human controlling the robot for sensory data acquisition. In addition, they typically require an expert to check as to whether the obtained map is accurate enough for path planning and localization. Our goal in this paper is to make first steps towards a solution for navigation tasks without explicit localization, mapping and path planning procedures.