Reinforcement Learning
Autonomous Penetration Testing using Reinforcement Learning
Schwartz, Jonathon, Kurniawati, Hanna
Penetration testing (pentesting) involves performing a controlled attack on a computer system in order to assess it's security. Although an effective method for testing security, pentesting requires highly skilled practitioners and currently there is a growing shortage of skilled cyber security professionals. One avenue for alleviating this problem is automate the pentesting process using artificial intelligence techniques. Current approaches to automated pentesting have relied on model-based planning, however the cyber security landscape is rapidly changing making maintaining up-to-date models of exploits a challenge. This project investigated the application of model-free Reinforcement Learning (RL) to automated pentesting. Model-free RL has the key advantage over model-based planning of not requiring a model of the environment, instead learning the best policy through interaction with the environment. We first designed and built a fast, low compute simulator for training and testing autonomous pentesting agents. We did this by framing pentesting as a Markov Decision Process with the known configuration of the network as states, the available scans and exploits as actions, the reward determined by the value of machines on the network. We then used this simulator to investigate the application of model-free RL to pentesting. We tested the standard Q-learning algorithm using both tabular and neural network based implementations. We found that within the simulated environment both tabular and neural network implementations were able to find optimal attack paths for a range of different network topologies and sizes without having a model of action behaviour. However, the implemented algorithms were only practical for smaller networks and numbers of actions. Further work is needed in developing scalable RL algorithms and testing these algorithms in larger and higher fidelity environments.
Synthesis of Provably Correct Autonomy Protocols for Shared Control
Cubuktepe, Murat, Jansen, Nils, Alsiekh, Mohammed, Topcu, Ufuk
We synthesize shared control protocols subject to probabilistic temporal logic specifications. More specifically, we develop a framework in which a human and an autonomy protocol can issue commands to carry out a certain task. We blend these commands into a joint input to a robot. We model the interaction between the human and the robot as a Markov decision process (MDP) that represents the shared control scenario. Using inverse reinforcement learning, we obtain an abstraction of the human's behavior and decisions. We use randomized strategies to account for randomness in human's decisions, caused by factors such as complexity of the task specifications or imperfect interfaces. We design the autonomy protocol to ensure that the resulting robot behavior satisfies given safety and performance specifications in probabilistic temporal logic. Additionally, the resulting strategies generate behavior as similar to the behavior induced by the human's commands as possible. We solve the underlying problem efficiently using quasiconvex programming. Case studies involving autonomous wheelchair navigation and unmanned aerial vehicle mission planning showcase the applicability of our approach.
A Contextual-Based Framework for Opinion Formation
Santos, Eugene (Dartmouth College) | Nyanhongo, Clement (Dartmouth College)
During opinion formation, interacting agents can be assumed to be engaging in learning and decision-making processes to satisfy their individual goals. These goals are determined by the agentsโ preferences โ which are often unknown, complex and unpredictable. Most opinion formation frameworks however, assume static preferences and fail to model practical situations where human preferences change. We propose a new framework to simulate the process of opinion formation under uncertainty and dynamism. Agents who are unaware of their implicit con-textual preferences utilize inverse reinforcement learning to compute reward functions that determines their preferences. Reinforcement learning is subsequently used to optimize the agentsโ behavior and satisfy their individual goals. The novelty of our approach lies in its ability to capture uncertainty and dynamism in the agentโs preferences, which are assumed to be unknown initially. This framework is compared to a baseline method based on reinforcement learning, and results show its ability to per-form better under dynamic scenarios.
Explaining Reward Functions in Markov Decision Processes
Russell, Jacob (Dartmouth College) | Santos, Eugene (Dartmouth College)
Rewards in Markov Decision Processes (MDP) define the behavior of the model. Without a clear interpretation of what the reward function is and is not capturing, one cannot trust their model nor diagnose when the model is giving incorrect recommendations. Increasing complexity of state-of-the-art models used to represent the reward function and model-free methods that attempt to avoid representing this function make trusting the model much more difficult. We map these reward functions onto a standard classification problem where we can explain what factors the model considers in making decisions in local and global contexts and quantify whether the fit of the reward function is likely to be good for explaining the behavior of the model. We evaluate our proof-of-concept on both the standard version and a modified version of the Object World domain to add more nonlinearity.
A Deep Reinforcement Learn-Based FIFA Agent with Naive State Representations and Portable Connection Interfaces
Faria, Matheus Prado Prandini (Federal University of Uberlรขndi) | Julia, Rita Maria Silva (Federal University of Uberlรขndi) | Tomaz, Lรญdia Bononi Paiva (Federal University of Triรขngulo Mineiroi)
Video games have proved to be a very defying laboratory to study machine-learning techniques, such as Deep Reinforcement Learning (DRL) algorithms. This paper presents a new approach for a DRL-based agent trained through Deep Q-Network (DQN) technique to perform free kicks in FIFA 18 game. The main motivation for choosing this case study is the fact that, like in many situations of the real life, FIFA represents a stochastic environment. Coping with this task, the main contributions of the present paper consist on: inspired on the OpenAI Gym and on the OpenAI Universe platforms, implementing a new user-friendly interface (in terms of portability and use simplicity) to connect the learning module with the 3D FIFA's game environment; implementing a DRL-based agent for free kicks in FIFA that uses two distinct data representations retrieved from lower cost computational procedures. The results were validated through two evaluative parameters: score of well succeed kicks and training time.
Generative NNI Transformation Strategies in Binary Trees Using Reinforcement Learning
Shirvani, Shirin (University of Texas at Arlington) | Huber, Manfred (University of Texas at Arlington)
Learning strategies to address problems on graph and tree structures with no a-priori size limitations in cases where no known solution exists (and thus supervised data is hard to obtain), is a difficult problem with potential applications in a wide range of domains ranging from biological networks to protein folding and social network search. The main challenges here arise from the variable size representation that needs to be resolved in the context of Reinforcement Learning (RL) to address the problem. In this paper we consider a common, specific tree problem and show that it can be addressed using a combination of feature engineering and carefully designed learning processes. In particular, We consider the classical Nearest Neighbor Interchange (NNI) distance between unrooted labeled trees, which is defined as the minimum-cost sequence of operations that transform one tree into another. We introduce a representation and a reinforcement learning method that learns the transition dynamics and iteratively changes an arbitrary initial labeled tree into a goal configuration reachable through NNI. The differential tree representation and NNI actions permits the system to learn a strategy that is applicable to arbitrary sized trees. To train the system, we introduce a training process that uses randomly sampled trajectories to incrementally train more and more complex problems to overcome the difficulty of the overall strategy space. Experiments performed show that the system can successfully learn a strategy for effective NNI on complex trees.
Improving Safety in Reinforcement Learning Using Model-Based Architectures and Human Intervention
Prakash, Bharat (University of Maryland, Baltimore County) | Khatwani, Mohit (University of Maryland, Baltimore County) | Waytowich, Nicholas (US Army Research Laboratory) | Mohsenin, Tinoosh (University of Maryland, Baltimore County)
Recent progress in AI and Reinforcement learning has shown great success in solving complex problems with high dimensional state spaces. However, most of these successes have been primarily in simulated environments where failure is of little or no consequence. Most real-world applications, however, require training solutions that are safe to operate as catastrophic failures are inadmissible especially when there is human interaction involved. Currently, Safe RL systems use human oversight during training and exploration in order to make sure the RL agent does not go into a catastrophic state. These methods require a large amount of human labor and it is very difficult to scale up. We present a hybrid method for reducing the human intervention time by combining model-based approaches and training a supervised learner to to improve sample efficiency while also ensuring safety. We evaluate these methods on various grid-world environments using both standard and visual representations and show that our approach achieves better performance in terms of sample efficiency, number of catastrophic states reached as well as overall task performance compared to traditional model-free approaches.
Meta reinforcement learning as task inference
Humplik, Jan, Galashov, Alexandre, Hasenclever, Leonard, Ortega, Pedro A., Teh, Yee Whye, Heess, Nicolas
Humans achieve efficient learning by relying on prior knowledge about the structure of naturally occurring tasks. There has been considerable interest in designing reinforcement learning algorithms with similar properties. This includes several proposals to learn the learning algorithm itself, an idea also referred to as meta learning. One formal interpretation of this idea is in terms of a partially observable multi-task reinforcement learning problem in which information about the task is hidden from the agent. Although agents that solve partially observable environments can be trained from rewards alone, shaping an agent's memory with additional supervision has been shown to boost learning efficiency. It is thus natural to ask what kind of supervision, if any, facilitates meta-learning. Here we explore several choices and develop an architecture that separates learning of the belief about the unknown task from learning of the policy, and that can be used effectively with privileged information about the task during training. We show that this approach can be very effective at solving standard meta-RL environments, as well as a complex continuous control environment in which a simulated robot has to execute various movement sequences.
Reinforcement Learning for Robotics and Control with Active Uncertainty Reduction
Patwardhan, Narendra, Wang, Zequn
Model-free reinforcement learning based methods such as Proximal Policy Optimization, or Q-learning typically require thousands of interactions with the environment to approximate the optimum controller which may not always be feasible in robotics due to safety and time consumption. Model-based methods such as PILCO or BlackDrops, while data-efficient, provide solutions with limited robustness and complexity. To address this tradeoff, we introduce active uncertainty reduction-based virtual environments, which are formed through limited trials conducted in the original environment. We provide an efficient method for uncertainty management, which is used as a metric for self-improvement by identification of the points with maximum expected improvement through adaptive sampling. Capturing the uncertainty also allows for better mimicking of the reward responses of the original system. Our approach enables the use of complex policy structures and reward functions through a unique combination of model-based and model-free methods, while still retaining the data efficiency. We demonstrate the validity of our method on several classic reinforcement learning problems in OpenAI gym. We prove that our approach offers a better modeling capacity for complex system dynamics as compared to established methods.
Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning
Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.