Goto

Collaborating Authors

 Reinforcement Learning


Adversarial Reinforcement Learning for Observer Design in Autonomous Systems under Cyber Attacks

arXiv.org Machine Learning

Complex autonomous control systems are subjected to sensor failures, cyber-attacks, sensor noise, communication channel failures, etc. that introduce errors in the measurements. The corrupted information, if used for making decisions, can lead to degraded performance. We develop a framework for using adversarial deep reinforcement learning to design observer strategies that are robust to adversarial errors in information channels. We further show through simulation studies that the learned observation strategies perform remarkably well when the adversary's injected errors are bounded in some sense. We use neural network as function approximator in our studies with the understanding that any other suitable function approximating class can be used within our framework.


Sampled Policy Gradient for Learning to Play the Game Agar.io

arXiv.org Artificial Intelligence

In this paper, a new offline actor-critic learning algorithm is introduced: Sampled Policy Gradient (SPG). SPG samples in the action space to calculate an approximated policy gradient by using the critic to evaluate the samples. This sampling allows SPG to search the action-Q-value space more globally than deterministic policy gradient (DPG), enabling it to theoretically avoid more local optima. SPG is compared to Q-learning and the actor-critic algorithms CACLA and DPG in a pellet collection task and a self play environment in the game Agar.io. The online game Agar.io has become massively popular on the internet due to intuitive game design and the ability to instantly compete against players around the world. From the point of view of artificial intelligence this game is also very intriguing: The game has a continuous input and action space and allows to have diverse agents with complex strategies compete against each other. The experimental results show that Q-Learning and CACLA outperform a pre-programmed greedy bot in the pellet collection task, but all algorithms fail to outperform this bot in a fighting scenario. The SPG algorithm is analyzed to have great extendability through offline exploration and it matches DPG in performance even in its basic form without extensive sampling.


Incorporating Behavioral Constraints in Online AI Systems

arXiv.org Artificial Intelligence

AI systems that learn through reward feedback about the actions they take are increasingly deployed in domains that have significant impact on our daily life. However, in many cases the online rewards should not be the only guiding criteria, as there are additional constraints and/or priorities imposed by regulations, values, preferences, or ethical principles. We detail a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints as a guide when making decisions in an online setting while still being reactive to reward feedback. To define this agent, we propose to adopt a novel extension to the classical contextual multi-armed bandit setting and we provide a new algorithm called Behavior Constrained Thompson Sampling (BCTS) that allows for online learning while obeying exogenous constraints. Our agent learns a constrained policy that implements the observed behavioral constraints demonstrated by a teacher agent, and then uses this constrained policy to guide the reward-based online exploration and exploitation. We characterize the upper bound on the expected regret of the contextual bandit algorithm that underlies our agent and provide a case study with real world data in two application domains. Our experiments show that the designed agent is able to act within the set of behavior constraints without significantly degrading its overall reward performance.


Learning Robust Manipulation Skills with Guided Policy Search via Generative Motor Reflexes

arXiv.org Artificial Intelligence

Guided Policy Search enables robots to learn control policies for complex manipulation tasks efficiently. Therein, the control policies are represented as high-dimensional neural networks which derive robot actions based on states. However, due to the small number of real-world trajectory samples in Guided Policy Search, the resulting neural networks are only robust in the neighbourhood of the trajectory distribution explored by real-world interactions. In this paper, we present a new policy representation called Generative Motor Reflexes, which is able to generate robust actions over a broader state space compared to previous methods. In contrast to prior stateaction policies, Generative Motor Reflexes map states to parameters for a state-dependent motor reflex, which is then used to derive actions. Robustness is achieved by generating similar motor reflexes for arbitrary states. We evaluate the presented method in simulated and real-world manipulation tasks, including contact-rich peg-in-hole tasks. Using this evaluation tasks, we show that policies represented as Generative Motor Reflexes lead to robust manipulation skills also outside the explored trajectory distribution with less training needs compared to previous methods. Therefore, the presented approach serves as a step towards reliable applications of reinforcement learning for manipulation.


Online Cyber-Attack Detection in Smart Grid: A Reinforcement Learning Approach

arXiv.org Machine Learning

Early detection of cyber-attacks is crucial for a safe and reliable operation of the smart grid. In the literature, outlier detection schemes making sample-by-sample decisions and online detection schemes requiring perfect attack models have been proposed. In this paper, we formulate the online attack/anomaly detection problem as a partially observable Markov decision process (POMDP) problem and propose a universal robust online detection algorithm using the framework of model-free reinforcement learning (RL) for POMDPs. Numerical studies illustrate the effectiveness of the proposed RL-based algorithm in timely and accurate detection of cyber-attacks targeting the smart grid. A. Background and Related W ork The next generation power grid, i.e., the smart grid, relies on advanced control and communication technologies. This critical cyber infrastructure makes the smart grid vulnerable to hostile cyber-attacks [1]-[3]. Main objective of attackers is to damage/mislead the state estimation mechanism in the smart grid to cause wide-area power blackouts or to manipulate electricity market prices [4]. There are many types of cyber-attacks, among them false data injection (FDI), jamming, and denial of service (DoS) attacks are well known. FDI attacks add malicious fake data to meter measurements [5]-[8], jamming attacks corrupt meter measurements via additive noise [9], and DoS attacks block the access of system to meter measurements [8], [10], [11]. The smart grid is a complex network and any failure or anomaly in a part of the system may lead to huge damages on the overall system in a short period of time. Hence, early detection of cyber-attacks is critical for a timely and effective response. In this context, the framework of quickest change detection [12]-[15] is quite useful. In the quickest change detection problems, a change occurs in the sensing environment at an unknown time and the aim is to detect the change as soon as possible with the minimal level of false alarms based on the measurements that become available sequentially over time. After obtaining measurements at a given time, decision maker either declares a change or waits for the next time interval to have further measurements.


Robustness of Adaptive Quantum-Enhanced Phase Estimation

arXiv.org Machine Learning

As all physical adaptive quantum-enhanced metrology schemes operate under noisy conditions with only partially understood noise characteristics, so a practical control policy must be robust even for unknown noise. We aim to devise a test to evaluate the robustness of AQEM policies and assess the resource used by the policies. The robustness test is performed on adaptive phase estimation by simulating the scheme under four phase noise models corresponding to the normal-distribution noise, the random telegraph noise, the skew-normal-distribution noise, and the log-normal-distribution noise. The control policies are devised either by a reinforcement-learning algorithm in the same noise condition, albeit ignorant of its properties, or a Bayesian-based feedback method that assumes no noise. Our robustness test and resource comparison can be used to determining the efficacy and selecting a suitable policy.


VPE: Variational Policy Embedding for Transfer Reinforcement Learning

arXiv.org Machine Learning

Reinforcement Learning methods are capable of solving complex problems, but resulting policies might perform poorly in environments that are even slightly different. In robotics especially, training and deployment conditions often vary and data collection is expensive, making retraining undesirable. Simulation training allows for feasible training times, but on the other hand suffers from a reality-gap when applied in real-world settings. This raises the need of efficient adaptation of policies acting in new environments. We consider this as a problem of transferring knowledge within a family of similar Markov decision processes. For this purpose we assume that Q-functions are generated by some low-dimensional latent variable. Given such a Q-function, we can find a master policy that can adapt given different values of this latent variable. Our method learns both the generative mapping and an approximate posterior of the latent variables, enabling identification of policies for new tasks by searching only in the latent space, rather than the space of all policies. The low-dimensional space, and master policy found by our method enables policies to quickly adapt to new environments. We demonstrate the method on both a pendulum swing-up task in simulation, and for simulation-to-real transfer on a pushing task.


Towards Better Interpretability in Deep Q-Networks

arXiv.org Machine Learning

Deep reinforcement learning techniques have demonstrated superior performance in a wide variety of environments. As improvements in training algorithms continue at a brisk pace, theoretical or empirical studies on understanding what these networks seem to learn, are far behind. In this paper we propose an interpretable neural network architecture for Q-learning which provides a global explanation of the model's behavior using key-value memories, attention and reconstructible embeddings. With a directed exploration strategy, our model can reach training rewards comparable to the state-of-the-art deep Q-learning models. However, results suggest that the features extracted by the neural network are extremely shallow and subsequent testing using out-of-sample examples shows that the agent can easily overfit to trajectories seen during training.


CM3: Cooperative Multi-goal Multi-stage Multi-agent Reinforcement Learning

arXiv.org Machine Learning

We propose CM3, a new deep reinforcement learning method for cooperative multi-agent problems where agents must coordinate for joint success in achieving different individual goals. We restructure multi-agent learning into a two-stage curriculum, consisting of a single-agent stage for learning to accomplish individual tasks, followed by a multi-agent stage for learning to cooperate in the presence of other agents. These two stages are bridged by modular augmentation of neural network policy and value functions. We further adapt the actor-critic framework to this curriculum by formulating local and global views of the policy gradient and learning via a double critic, consisting of a decentralized value function and a centralized action-value function. We evaluated CM3 on a new high-dimensional multi-agent environment with sparse rewards: negotiating lane changes among multiple autonomous vehicles in the Simulation of Urban Mobility (SUMO) traffic simulator. Detailed ablation experiments show the positive contribution of each component in CM3, and the overall synthesis converges significantly faster to higher performance policies than existing cooperative multi-agent methods.


Model-Based Reinforcement Learning via Meta-Policy Optimization

arXiv.org Machine Learning

Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.