Markov Models
Safe Exploration in Markov Decision Processes with Time-Variant Safety using Spatio-Temporal Gaussian Process
Wachi, Akifumi, Kajino, Hiroshi, Munawar, Asim
In many real-world applications (e.g., planetary exploration, robot navigation), an autonomous agent must be able to explore a space with guaranteed safety. Most safe exploration algorithms in the field of reinforcement learning and robotics have been based on the assumption that the safety features are a priori known and time-invariant. This paper presents a learning algorithm called ST-SafeMDP for exploring Markov decision processes (MDPs) that is based on the assumption that the safety features are a priori unknown and time-variant. In this setting, the agent explores MDPs while constraining the probability of entering unsafe states defined by a safety function being below a threshold. The unknown and time-variant safety values are modeled using a spatio-temporal Gaussian process. However, there remains an issue that an agent may have no viable action in a shrinking true safe space. To address this issue, we formulate a problem maximizing the cumulative number of safe states in the worst case scenario with respect to future observations. The effectiveness of this approach was demonstrated in two simulation settings, including one using real lunar terrain data.
Endowing Robots with Longer-term Autonomy by Recovering from External Disturbances in Manipulation through Grounded Anomaly Classification and Recovery Policies
Wu, Hongmin, Luo, Shuangqi, Chen, Longxin, Duan, Shuangda, Chumkamon, Sakmongkon, Liu, Dong, Guan, Yisheng, Rojas, Juan
Robot manipulation is increasingly poised to interact with humans in co-shared workspaces. Despite increasingly robust manipulation and control algorithms, failure modes continue to exist whenever models do not capture the dynamics of the unstructured environment. To obtain longer-term horizons in robot automation, robots must develop introspection and recovery abilities. We contribute a set of recovery policies to deal with anomalies produced by external disturbances as well as anomaly classification through the use of non-parametric statistics with memoized variational inference with scalable adaptation. A recovery critic stands atop of a tightly-integrated, graph-based online motion-generation and introspection system that resolves a wide range of anomalous situations. Policies, skills, and introspection models are learned incrementally and contextually in a task. Two task-level recovery policies: re-enactment and adaptation resolve accidental and persistent anomalies respectively. The introspection system uses non-parametric priors along with Markov jump linear systems and memoized variational inference with scalable adaptation to learn a model from the data. Extensive real-robot experimentation with various strenuous anomalous conditions is induced and resolved at different phases of a task and in different combinations. The system executes around-the-clock introspection and recovery and even elicited self-recovery when misclassifications occurred.
Collapsed Variational Inference for Nonparametric Bayesian Group Factor Analysis
Group factor analysis (GFA) methods have been widely used to infer the common structure and the group-specific signals from multiple related datasets in various fields including systems biology and neuroimaging. To date, most available GFA models require Gibbs sampling or slice sampling to perform inference, which prevents the practical application of GFA to large-scale data. In this paper we present an efficient collapsed variational inference (CVI) algorithm for the nonparametric Bayesian group factor analysis (NGFA) model built upon an hierarchical beta Bernoulli process. Our CVI algorithm proceeds by marginalizing out the group-specific beta process parameters, and then approximating the true posterior in the collapsed space using mean field methods. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our CVI algorithm for the NGFA compared with state-of-the-art GFA methods.
Energy Disaggregation via Deep Temporal Dictionary Learning
Khodayar, Mahdi, Wang, Jianhui, Wang, Zhaoyu
This paper addresses the energy disaggregation problem, i.e. decomposing the electricity signal of a whole home to its operating devices. First, we cast the problem as a dictionary learning (DL) problem where the key electricity patterns representing consumption behaviors are extracted for each device and stored in a dictionary matrix. The electricity signal of each device is then modeled by a linear combination of such patterns with sparse coefficients that determine the contribution of each device in the total electricity. Although popular, the classic DL approach is prone to high error in real-world applications including energy disaggregation, as it merely finds linear dictionaries. Moreover, this method lacks a recurrent structure; thus, it is unable to leverage the temporal structure of energy signals. Motivated by such shortcomings, we propose a novel optimization program where the dictionary and its sparse coefficients are optimized simultaneously with a deep neural model extracting powerful nonlinear features from the energy signals. A long short-term memory auto-encoder (LSTM-AE) is proposed with tunable time dependent states to capture the temporal behavior of energy signals for each device. We learn the dictionary in the space of temporal features captured by the LSTM-AE rather than the original space of the energy signals; hence, in contrast to the traditional DL, here, a nonlinear dictionary is learned using powerful temporal features extracted from our deep model. Real experiments on the publicly available Reference Energy Disaggregation Dataset (REDD) show significant improvement compared to the state-of-the-art methodologies in terms of the disaggregation accuracy and F-score metrics.
A Low-Cost Ethics Shaping Approach for Designing Reinforcement Learning Agents
This paper proposes a low-cost, easily realizable strategy to equip a reinforcement learning (RL) agent the capability of behaving ethically. Our model allows the designers of RL agents to solely focus on the task to achieve, without having to worry about the implementation of multiple trivial ethical patterns to follow. Based on the assumption that the majority of human behavior, regardless which goals they are achieving, is ethical, our design integrates human policy with the RL policy to achieve the target objective with less chance of violating the ethical code that human beings normally obey.
Generic Probabilistic Interactive Situation Recognition and Prediction: From Virtual to Real
Li, Jiachen, Ma, Hengbo, Zhan, Wei, Tomizuka, Masayoshi
Abstract-- Accurate and robust recognition and prediction of traffic situation plays an important role in autonomous driving, which is a prerequisite for risk assessment and effective decision making. Although there exist a lot of works dealing with modeling driver behavior of a single object, it remains a challenge to make predictions for multiple highly interactive agents that react to each other simultaneously. In this work, we propose a generic probabilistic hierarchical recognition and prediction framework which employs a two-layer Hidden Markov Model (TLHMM) to obtain the distribution of potential situations and a learning-based dynamic scene evolution model to sample a group of future trajectories. Instead of predicting motions of a single entity, we propose to get the joint distribution by modeling multiple interactive agents as a whole system. Moreover, due to the decoupling property of the layered structure, our model is suitable for knowledge transfer from simulation to real world applications as well as among different traffic scenarios, which can reduce the computational efforts of training and the demand for a large data amount. A case study of highway ramp merging scenario is demonstrated to verify the effectiveness and accuracy of the proposed framework. I. INTRODUCTION Accurate and efficient recognition and prediction of future traffic scene evolution plays a significant role in autonomous driving which is a prerequisite for risk assessment, decision making and high-quality motion planning.
Optimal and Low-Complexity Dynamic Spectrum Access for RF-Powered Ambient Backscatter System with Online Reinforcement Learning
Van Huynh, Nguyen, Hoang, Dinh Thai, Nguyen, Diep N., Dutkiewicz, Eryk, Niyato, Dusit, Wang, Ping
Ambient backscatter has been introduced with a wide range of applications for low power wireless communications. In this article, we propose an optimal and low-complexity dynamic spectrum access framework for RFpowered ambient backscatter system. Under the dynamics of the ambient signals, we first adopt the Markov decision process (MDP) framework to obtain the optimal policy for the secondary transmitter, aiming to maximize the system throughput. However, the MDP-based optimization requires complete knowledge of environment parameters, e.g., the probability of a channel to be idle and the probability of a successful packet transmission, that may not be practical to obtain. To cope with such incomplete knowledge of the environment, we develop a low-complexity online reinforcement learning algorithm that allows the secondary transmitter to "learn" from its decisions and then attain the optimal policy. Simulation results show that the proposed learning algorithm not only efficiently deals with the dynamics of the environment, but also improves the average throughput up to 50% and reduces the blocking probability and delay up to 80% compared with conventional methods. Dynamic spectrum access (DSA) has been considered as a promising solution to improve the utilization of radio spectrum [2]. As DSA standard frameworks, the Federal Communications Commission and the European Telecommunications Standardization Institute have recently proposed Spectrum Access Systems (SAS) and Licensed Shared Access (LSA) respectively [3]. In both SAS and LSA, spectrum users are prioritized at different levels/tiers (e.g., there are three types of users with a decreasing order of priority: Incumbent Users (IUs), Priority Access Licensees (PALs), and General Authorized Access (GAAs)). Without loss of generality, in this work, we refer users with higher priority as IUs and users with lower priority as secondary users (SUs). DSA harvests under-utilized spectrum chunks by allowing an SU to dynamically access (temporarily) idle spectrum bands/whitespaces to transmit data.
CASC: Context-Aware Segmentation and Clustering for Motif Discovery in Noisy Time Series Data
Jain, Saachi, Hallac, David, Sosic, Rok, Leskovec, Jure
Complex systems, such as airplanes, cars, or financial markets, produce multivariate time series data consisting of system observations over a period of time. Such data can be interpreted as a sequence of segments, where each segment is associated with a certain state of the system. An important problem in this domain is to identify repeated sequences of states, known as motifs. Such motifs correspond to complex behaviors that capture common sequences of state transitions. For example, a motif of "making a turn" might manifest in sensor data as a sequence of states: slowing down, turning the wheel, and then speeding back up. However, discovering these motifs is challenging, because the individual states are unknown and need to be learned from the noisy time series. Simultaneously, the time series also needs to be precisely segmented and each segment needs to be associated with a state. Here we develop context-aware segmentation and clustering (CASC), a method for discovering common motifs in time series data. We formulate the problem of motif discovery as a large optimization problem, which we then solve using a greedy alternating minimization-based approach. CASC performs well in the presence of noise in the input data and is scalable to very large datasets. Furthermore, CASC leverages common motifs to more robustly segment the time series and assign segments to states. Experiments on synthetic data show that CASC outperforms state-of-the-art baselines by up to 38.2%, and two case studies demonstrate how our approach discovers insightful motifs in real-world time series data.
Budget Constrained Bidding by Model-free Reinforcement Learning in Display Advertising
Wu, Di, Chen, Xiujun, Yang, Xun, Wang, Hao, Tan, Qing, Zhang, Xiaoxun, Xu, Jian, Gai, Kun
Real-time bidding (RTB) is an important mechanism in online display advertising, where a proper bid for each page view plays an essential role for good marketing results. Budget constrained bidding is a typical scenario in RTB where the advertisers hope to maximize the total value of the winning impressions under a pre-set budget constraint. However, the optimal bidding strategy is hard to be derived due to the complexity and volatility of the auction environment. To address these challenges, in this paper, we formulate budget constrained bidding as a Markov Decision Process and propose a model-free reinforcement learning framework to resolve the optimization problem. Our analysis shows that the immediate reward from environment is misleading under a critical resource constraint. Therefore, we innovate a reward function design methodology for the reinforcement learning problems with constraints. Based on the new reward design, we employ a deep neural network to learn the appropriate reward so that the optimal policy can be learned effectively. Different from the prior model-based work, which suffers from the scalability problem, our framework is easy to be deployed in large-scale industrial applications. The experimental evaluations demonstrate the effectiveness of our framework on large-scale real datasets.
Reinforcement Learning under Threats
Gallego, Víctor, Naveiro, Roi, Insua, David Ríos
In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. In this paper, we introduce Threatened Markov Decision Processes (TMDPs), which provide a framework to support a decision maker against a potential adversary in RL. Furthermore, we propose a level-$k$ thinking scheme resulting in a new learning framework to deal with TMDPs. After introducing our framework and deriving theoretical results, relevant empirical evidence is given via extensive experiments, showing the benefits of accounting for adversaries while the agent learns.