Markov Models
Pixel-Attentive Policy Gradient for Multi-Fingered Grasping in Cluttered Scenes
Wu, Bohan, Akinola, Iretiayo, Allen, Peter K.
Recent advances in on-policy reinforcement learning (RL) methods enabled learning agents in virtual environments to master complex tasks with high-dimensional and continuous observation and action spaces. However, leveraging this family of algorithms in multi-fingered robotic grasping remains a challenge due to large sim-to-real fidelity gaps and the high sample complexity of on-policy RL algorithms. This work aims to bridge these gaps by first reinforcement-learning a multi-fingered robotic grasping policy in simulation that operates in the pixel space of the input: a single depth image. Using a mapping from pixel space to Cartesian space according to the depth map, this method transfers to the real world with high fidelity and introduces a novel attention mechanism that substantially improves grasp success rate in cluttered environments. Finally, the direct-generative nature of this method allows learning of multi-fingered grasps that have flexible end-effector positions, orientations and rotations, as well as all degrees of freedom of the hand.
On Convergence Rate of the Gaussian Belief Propagation Algorithm for Markov Networks
Gaussian Belief Propagation (BP) algorithm is one of the most important distributed algorithms in signal processing and statistical learning involving Markov networks. It is well known that the algorithm correctly computes marginal density functions from a high dimensional joint density function over a Markov network in a finite number of iterations when the underlying Gaussian graph is acyclic. It is also known more recently that the algorithm produces correct marginal means asymptotically for cyclic Gaussian graphs under the condition of walk summability. This paper extends this convergence result further by showing that the convergence is exponential under the walk summability condition, and provides a simple bound for the convergence rate.
Scaling up budgeted reinforcement learning
Carrara, Nicolas, Leurent, Edouard, Laroche, Romain, Urvoy, Tanguy, Maillard, Odalric-Ambrym, Pietquin, Olivier
Can we learn a control policy able to adapt its behaviour in real time so as to take any desired amount of risk? The general Reinforcement Learning framework solely aims at optimising a total reward in expectation, which may not be desirable in critical applications. In stark contrast, the Budgeted Markov Decision Process (BMDP) framework is a formalism in which the notion of risk is implemented as a hard constraint on a failure signal. Existing algorithms solving BMDPs rely on strong assumptions and have so far only been applied to toy-examples. In this work, we relax some of these assumptions and demonstrate the scalability of our approach on two practical problems: a spoken dialogue system and an autonomous driving task. On both examples, we reach similar performances as Lagrangian Relaxation methods with a significant improvement in sample and memory efficiency.
Attack Graph Obfuscation
Puzis, Rami, Polad, Hadar, Shapira, Bracha
Before executing an attack, adversaries usually explore the victim's network in an attempt to infer the network topology and identify vulnerabilities in the victim's servers and personal computers. Falsifying the information collected by the adversary post penetration may significantly slower lateral movement and increase the amount of noise generated within the victim's network. We investigate the effect of fake vulnerabilities within a real enterprise network on the attacker performance. We use the attack graphs to model the path of an attacker making its way towards a target in a given network. We use combinatorial optimization in order to find the optimal assignments of fake vulnerabilities. We demonstrate the feasibility of our deception-based defense by presenting results of experiments with a large scale real network. We show that adding fake vulnerabilities forces the adversary to invest a significant amount of effort, in terms of time and exploitability cost.
Probabilistic Modeling for Novelty Detection with Applications to Fraud Identification
Novelty detection is the unsupervised problem of identifying anomalies in test data which significantly differ from the training set. Novelty detection is one of the classic challenges in Machine Learning and a core component of several research areas such as fraud detection, intrusion detection, medical diagnosis, data cleaning, and fault prevention. While numerous algorithms were designed to address this problem, most methods are only suitable to model continuous numerical data. Tackling datasets composed of mixed-type features, such as numerical and categorical data, or temporal datasets describing discrete event sequences is a challenging task. In addition to the supported data types, the key criteria for efficient novelty detection methods are the ability to accurately dissociate novelties from nominal samples, the interpretability, the scalability and the robustness to anomalies located in the training data. In this thesis, we investigate novel ways to tackle these issues. In particular, we propose (i) an experimental comparison of novelty detection methods for mixed-type data (ii) an experimental comparison of novelty detection methods for sequence data, (iii) a probabilistic nonparametric novelty detection method for mixed-type data based on Dirichlet process mixtures and exponential-family distributions and (iv) an autoencoder-based novelty detection model with encoder/decoder modelled as deep Gaussian processes.
PROPS: Probabilistic personalization of black-box sequence models
Wojnowicz, Michael Thomas, Zhao, Xuan
We present PROPS, a lightweight transfer learning mechanism for sequential data. PROPS learns probabilistic perturbations around the predictions of one or more arbitrarily complex, pre-trained black box models (such as recurrent neural networks). The technique pins the black-box prediction functions to "source nodes" of a hidden Markov model (HMM), and uses the remaining nodes as "perturbation nodes" for learning customized perturbations around those predictions. In this paper, we describe the PROPS model, provide an algorithm for online learning of its parameters, and demonstrate the consistency of this estimation. We also explore the utility of PROPS in the context of personalized language modeling. In particular, we construct a baseline language model by training a LSTM on the entire Wikipedia corpus of 2.5 million articles (around 6.6 billion words), and then use PROPS to provide lightweight customization into a personalized language model of President Donald J. Trump's tweeting. We achieved good customization after only 2,000 additional words, and find that the PROPS model, being fully probabilistic, provides insight into when President Trump's speech departs from generic patterns in the Wikipedia corpus. Python code (for both the PROPS training algorithm as well as experiment reproducibility) is available at https://github.com/cylance/perturbed-sequence-model.
Learning Exploration Policies for Navigation
Chen, Tao, Gupta, Saurabh, Gupta, Abhinav
Numerous past works have tackled the problem of task-driven navigation. But, how to effectively explore a new environment to enable a variety of downstream tasks has received much less attention. In this work, we study how agents can autonomously explore realistic and complex 3D environments without the context of task-rewards. We propose a learning-based approach and investigate different policy architectures, reward functions, and training paradigms. We find that use of policies with spatial memory that are bootstrapped with imitation learning and finally finetuned with coverage rewards derived purely from on-board sensors can be effective at exploring novel environments. We show that our learned exploration policies can explore better than classical approaches based on geometry alone and generic learning-based exploration techniques. Finally, we also show how such task-agnostic exploration can be used for downstream tasks. Imagine your first day at a new workplace. If you are like most people, the first task you set for yourself is to become familiar with the office so that the next day when you have to attend meetings and perform tasks, you can navigate efficiently and seamlessly. To achieve that goal, you explore your office without the task context of target locations you have to reach and build a generic understanding of space. This step of task-independent exploration is quite critical yet often ignored in current approaches for navigation. When it comes to navigation, currently there are two paradigms: (a) geometric reconstruction and path-planning based approaches (Hartley & Zisserman, 2003; Thrun et al., 2005; LaValle, 2006), and (b) learning-based approaches (Mirowski et al., 2017; Gupta et al., 2017; Savinov et al., 2018; Zhu et al., 2017).
Microscopic Traffic Simulation by Cooperative Multi-agent Deep Reinforcement Learning
Bacchiani, Giulio, Molinari, Daniele, Patander, Marco
Expert human drivers perform actions relying on traffic laws and their previous experience. While traffic laws are easily embedded into an artificial brain, modeling human complex behaviors which come from past experience is a more challenging task. One of these behaviors is the capability of communicating intentions and negotiating the right of way through driving actions, as when a driver is entering a crowded roundabout and observes other cars movements to guess the best time to merge in. In addition, each driver has its own unique driving style, which is conditioned by both its personal characteristics, such as age and quality of sight, and external factors, such as being late or in a bad mood. For these reasons, the interaction between different drivers is not trivial to simulate in a realistic manner. In this paper, this problem is addressed by developing a microscopic simulator using a Deep Reinforcement Learning Algorithm based on a combination of visual frames, representing the perception around the vehicle, and a vector of numerical parameters. In particular, the algorithm called Asynchronous Advantage Actor-Critic has been extended to a multi-agent scenario in which every agent needs to learn to interact with other similar agents. Moreover, the model includes a novel architecture such that the driving style of each vehicle is adjustable by tuning some of its input parameters, permitting to simulate drivers with different levels of aggressiveness and desired cruising speeds.
QuickStop: A Markov Optimal Stopping Approach for Quickest Misinformation Detection
Wei, Honghao, Kang, Xiaohan, Wang, Weina, Ying, Lei
This paper combines data-driven and model-driven methods for real-time misinformation detection. Our algorithm, named QuickStop, is an optimal stopping algorithm based on a probabilistic information spreading model obtained from labeled data. The algorithm consists of an offline machine learning algorithm for learning the probabilistic information spreading model and an online optimal stopping algorithm to detect misinformation. The online detection algorithm has both low computational and memory complexities. Our numerical evaluations with a real-world dataset show that QuickStop outperforms existing misinformation detection algorithms in terms of both accuracy and detection time (number of observations needed for detection). Our evaluations with synthetic data further show that QuickStop is robust to (offline) learning errors.
Bernoulli Race Particle Filters
Schmon, Sebastian M, Doucet, Arnaud, Deligiannidis, George
When the weights in a particle filter are not available analytically, standard resampling methods cannot be employed. To circumvent this problem state-of-the-art algorithms replace the true weights with non-negative unbiased estimates. This algorithm is still valid but at the cost of higher variance of the resulting filtering estimates in comparison to a particle filter using the true weights. We propose here a novel algorithm that allows for resampling according to the true intractable weights when only an unbiased estimator of the weights is available. We demonstrate our algorithm on several examples.