Goto

Collaborating Authors

 Markov Models


Reinforcement Learning for Learning of Dynamical Systems in Uncertain Environment: a Tutorial

arXiv.org Artificial Intelligence

In this paper, a review of model-free reinforcement learning for learning of dynamical systems in uncertain environments has discussed. For this purpose, the Markov Decision Process (MDP) will be reviewed. Furthermore, some learning algorithms such as Temporal Difference (TD) learning, Q-Learning, and Approximate Q-learning as model-free algorithms which constitute the main part of this article have been investigated, and benefits and drawbacks of each algorithm will be discussed. The discussed concepts in each section are explaining with details and examples.


Online Convex Optimization in Adversarial Markov Decision Processes

arXiv.org Machine Learning

We consider online learning in episodic loopfree We propose a novel algorithm for the adversarial MDP Markov decision processes (MDPs), where model where the transition function is unknown to the the loss function can change arbitrarily between learner and the losses change arbitrarily over time. Our episodes, and the transition function is not known algorithm, UC-O-REPS, uses two important ingredients, to the learner. We show ร•(L X A T) regret the first is Online Mirror Descent (OMD) (Shalev-Shwartz, bound, where T is the number of episodes, X 2012) and the second is UCRL-2 (Auer et al., 2008). A is the state space, A is the action space, and L major challenge in this work is to handle convex performance is the length of each episode. Our online algorithm criteria, which model different ways of aggregating is implemented using entropic regularization the losses of each episode. In order to handle convex performance methodology, which allows to extend the criteria, we use the methodology of OMD, which is original adversarial MDP model to handle convex widely used for online convex optimization, and we implement performance criteria (different ways to aggregate it in the adversarial MDP setting. In order to overcome the losses of a single episode), as well as the unknown dynamics (stochastic transition function) improve previous regret bounds.


Evolving Rewards to Automate Reinforcement Learning

arXiv.org Machine Learning

Many continuous control tasks have easily formulated objectives, yet using them directly as a reward in reinforcement learning (RL) leads to suboptimal policies. Therefore, many classical control tasks guide RL training using complex rewards, which require tedious hand-tuning. We automate the reward search with AutoRL, an evolutionary layer over standard RL that treats reward tuning as hyperparameter optimization and trains a population of RL agents to find a reward that maximizes the task objective. AutoRL, evaluated on four Mujoco continuous control tasks over two RL algorithms, shows improvements over baselines, with the the biggest uplift for more complex tasks. The video can be found at: \url{https://youtu.be/svdaOFfQyC8}.


Optimizing Sequential Medical Treatments with Auto-Encoding Heuristic Search in POMDPs

arXiv.org Artificial Intelligence

Health-related data is noisy and stochastic in implying the true physiological states of patients, limiting information contained in single-moment observations for sequential clinical decision making. We model patient-clinician interactions as partially observable Markov decision processes (POMDPs) and optimize sequential treatment based on belief states inferred from history sequence. To facilitate inference, we build a variational generative model and boost state representation with a recurrent neural network (RNN), incorporating an auxiliary loss from sequence auto-encoding. Meanwhile, we optimize a continuous policy of drug levels with an actor-critic method where policy gradients are obtained from a stablized off-policy estimate of advantage function, with the value of belief state backed up by parallel best-first suffix trees. We exploit our methodology in optimizing dosages of vasopressor and intravenous fluid for sepsis patients using a retrospective intensive care dataset and evaluate the learned policy with off-policy policy evaluation (OPPE). The results demonstrate that modelling as POMDPs yields better performance than MDPs, and that incorporating heuristic search improves sample efficiency.


Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

arXiv.org Machine Learning

We consider an agent who is involved in a Markov decision process and receives a vector of outcomes every round. Her objective is to maximize a global concave reward function on the average vectorial outcome. The problem models applications such as multi-objective optimization, maximum entropy exploration, and constrained optimization in Markovian environments. In our general setting where a stationary policy could have multiple recurrent classes, the agent faces a subtle yet consequential trade-off in alternating among different actions for balancing the vectorial outcomes. In particular, stationary policies are in general sub-optimal. We propose a no-regret algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur 2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel gradient threshold procedure, which carefully controls the switches among actions to handle the subtle trade-off. By delaying the gradient updates, our procedure produces a non-stationary policy that diversifies the outcomes for optimizing the objective. The procedure is compatible with a variety of OCO tools.


Emergence in Multi-Agent Systems

AAAI Conferences

In a multiagent system or MAS, due to agent interactions, the agents as a group may make decisions that none of them would make alone; this phenomenon is called emergence. Emergence is characterized by an unanticipated system behavior caused by nonlinear interactions. This paper detects such emergence in a MAS by analyzing agent behaviors across two simple strategies. In the first strategy, agents make decisions based on the local information; in the second strategy, agents make decisions based on global information provided via communication. The proposed method identifies when and how nonlinear interactions cause behavior change, and quantitatively defines emergence based on the change in team performance. It then proves several theorems about emergence in a MAS. It also explores several emergence-related factors like the communication cost and the reward gap quantitatively. Experimental results on several benchmarks demonstrate the promising performance of the proposed framework in detecting emergence in a MAS.


Autonomous Penetration Testing using Reinforcement Learning

arXiv.org Artificial Intelligence

Penetration testing (pentesting) involves performing a controlled attack on a computer system in order to assess it's security. Although an effective method for testing security, pentesting requires highly skilled practitioners and currently there is a growing shortage of skilled cyber security professionals. One avenue for alleviating this problem is automate the pentesting process using artificial intelligence techniques. Current approaches to automated pentesting have relied on model-based planning, however the cyber security landscape is rapidly changing making maintaining up-to-date models of exploits a challenge. This project investigated the application of model-free Reinforcement Learning (RL) to automated pentesting. Model-free RL has the key advantage over model-based planning of not requiring a model of the environment, instead learning the best policy through interaction with the environment. We first designed and built a fast, low compute simulator for training and testing autonomous pentesting agents. We did this by framing pentesting as a Markov Decision Process with the known configuration of the network as states, the available scans and exploits as actions, the reward determined by the value of machines on the network. We then used this simulator to investigate the application of model-free RL to pentesting. We tested the standard Q-learning algorithm using both tabular and neural network based implementations. We found that within the simulated environment both tabular and neural network implementations were able to find optimal attack paths for a range of different network topologies and sizes without having a model of action behaviour. However, the implemented algorithms were only practical for smaller networks and numbers of actions. Further work is needed in developing scalable RL algorithms and testing these algorithms in larger and higher fidelity environments.


Synthesis of Provably Correct Autonomy Protocols for Shared Control

arXiv.org Artificial Intelligence

We synthesize shared control protocols subject to probabilistic temporal logic specifications. More specifically, we develop a framework in which a human and an autonomy protocol can issue commands to carry out a certain task. We blend these commands into a joint input to a robot. We model the interaction between the human and the robot as a Markov decision process (MDP) that represents the shared control scenario. Using inverse reinforcement learning, we obtain an abstraction of the human's behavior and decisions. We use randomized strategies to account for randomness in human's decisions, caused by factors such as complexity of the task specifications or imperfect interfaces. We design the autonomy protocol to ensure that the resulting robot behavior satisfies given safety and performance specifications in probabilistic temporal logic. Additionally, the resulting strategies generate behavior as similar to the behavior induced by the human's commands as possible. We solve the underlying problem efficiently using quasiconvex programming. Case studies involving autonomous wheelchair navigation and unmanned aerial vehicle mission planning showcase the applicability of our approach.


Integrating Typed Model Counting into First-Order Maximum Entropy Computations and the Connection to Markov Logic Networks

AAAI Conferences

The principle of maximum entropy (MaxEnt) provides a well-founded methodology for commonsense reasoning based on probabilistic conditional knowledge. We show how to calculate MaxEnt distributions in a first-order setting by using typed model counting and condensed iterative scaling. Further, we discuss the connection to Markov Logic Networks for drawing inferences.


Modeling the Dynamics of User Preferences for Sequence-Aware Recommendation Using Hidden Markov Models

AAAI Conferences

In a variety of online settings involving interaction with end-users it is critical for the systems to adapt to changes in user preference. User preferences on items tend to change over time due to a variety of factors such as change in context, the task being performed, or other short-term or long-term external factors. Recommender systems, in particular need to be able to capture these dynamics in user preferences in order to remain tuned to the most current interests of users. In this work we present a recommendation framework which takes into account the dynamics of user preferences. We propose an approach based on Hidden Markov Models (HMM) to identify change-points in the sequence of user interactions which reflect significant changes in preference according to the sequential behavior of all the users in the data. The proposed framework leverages the identified change points to generate recommendations using a sequence-aware non-negative matrix factorization model. We empirically demonstrate the effectiveness of the HMM-based change detection method as compared to standard baseline methods. Additionally, we evaluate the performance of the proposed recommendation method and show that it compares favorably to state-of-the-art sequence-aware recommendation models.