Goto

Collaborating Authors

 Undirected Networks


Variational Bayes In Private Settings (VIPS)

arXiv.org Machine Learning

Many applications of Bayesian data analysis involve sensitive information, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilistic models, called the Conjugate Exponential (CE) family. We observe that we can straightforwardly privatise VB's approximate posterior distributions for models in the CE family, by perturbing the expected sufficient statistics of the complete-data likelihood. For a broadly-used class of non-CE models, those with binomial likelihoods, we show how to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible, using the Polya-Gamma data augmentation scheme. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.


Variance-Aware Regret Bounds for Undiscounted Reinforcement Learning in MDPs

arXiv.org Machine Learning

The problem of reinforcement learning in an unknown and discrete Markov Decision Process (MDP) under the average-reward criterion is considered, when the learner interacts with the system in a single stream of observations, starting from an initial state without any reset. We revisit the minimax lower bound for that problem by making appear the local variance of the bias function in place of the diameter of the MDP. Furthermore, we provide a novel analysis of the KL-UCRL algorithm establishing a high-probability regret bound scaling as $\widetilde {\mathcal O}\Bigl({\textstyle \sqrt{S\sum_{s,a}{\bf V}^\star_{s,a}T}}\Big)$ for this algorithm for ergodic MDPs, where $S$ denotes the number of states and where ${\bf V}^\star_{s,a}$ is the variance of the bias function with respect to the next-state distribution following action $a$ in state $s$. The resulting bound improves upon the best previously known regret bound $\widetilde {\mathcal O}(DS\sqrt{AT})$ for that algorithm, where $A$ and $D$ respectively denote the maximum number of actions (per state) and the diameter of MDP. We finally compare the leading terms of the two bounds in some benchmark MDPs indicating that the derived bound can provide an order of magnitude improvement in some cases. Our analysis leverages novel variations of the transportation lemma combined with Kullback-Leibler concentration inequalities, that we believe to be of independent interest.


Recurrent Predictive State Policy Networks

arXiv.org Machine Learning

We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent architecture that brings insights from predictive state representations to reinforcement learning in partially observable environments. Predictive state policy networks consist of a recursive filter, which keeps track of a belief about the state of the environment, and a reactive policy that directly maps beliefs to actions, to maximize the cumulative reward. The recursive filter leverages predictive state representations (PSRs) (Rosencrantz and Gordon, 2004; Sun et al., 2016) by modeling predictive state-- a prediction of the distribution of future observations conditioned on history and future actions. This representation gives rise to a rich class of statistically consistent algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive state serves as an equivalent representation of a belief state. Therefore, the policy component of the RPSP-network can be purely reactive, simplifying training while still allowing optimal behaviour. Moreover, we use the PSR interpretation during training as well, by incorporating prediction error in the loss function. The entire network (recursive filter and reactive policy) is still differentiable and can be trained using gradient based methods. We optimize our policy using a combination of policy gradient based on rewards (Williams, 1992) and gradient descent based on prediction error. We show the efficacy of RPSP-networks under partial observability on a set of robotic control tasks from OpenAI Gym. We empirically show that RPSP-networks perform well compared with memory-preserving networks such as GRUs, as well as finite memory models, being the overall best performing method.


Fast and Scalable Distributed Deep Convolutional Autoencoder for fMRI Big Data Analytics

arXiv.org Machine Learning

In recent years, analyzing task-based fMRI (tfMRI) data has become an essential tool for understanding brain function and networks. However, due to the sheer size of tfMRI data, its intrinsic complex structure, and lack of ground truth of underlying neural activities, modeling tfMRI data is hard and challenging. Previously proposed data-modeling methods including Independent Component Analysis (ICA) and Sparse Dictionary Learning only provided a weakly established model based on blind source separation under the strong assumption that original fMRI signals could be linearly decomposed into time series components with corresponding spatial maps. Meanwhile, analyzing and learning a large amount of tfMRI data from a variety of subjects has been shown to be very demanding but yet challenging even with technological advances in computational hardware. Given the Convolutional Neural Network (CNN), a robust method for learning high-level abstractions from low-level data such as tfMRI time series, in this work we propose a fast and scalable novel framework for distributed deep Convolutional Autoencoder model. This model aims to both learn the complex hierarchical structure of the tfMRI data and to leverage the processing power of multiple GPUs in a distributed fashion. To implement such a model, we have created an enhanced processing pipeline on the top of Apache Spark and Tensorflow library, leveraging from a very large cluster of GPU machines. Experimental data from applying the model on the Human Connectome Project (HCP) show that the proposed model is efficient and scalable toward tfMRI big data analytics, thus enabling data-driven extraction of hierarchical neuroscientific information from massive fMRI big data in the future.


Hamiltonian Monte Carlo explained by Alex Rogozhnikov

#artificialintelligence

MCMC (Markov chain Monte Carlo) is a family of methods that are applied in computational physics and chemistry and also widely used in bayesian machine learning. It is used to simulate physical systems with Gibbs canonical distribution: $$ p(\vx) \propto \exp\left( - \frac{U(\vx)}{T} \right) $$ Probability $ p(\vx) $ of a system to be in the state $ \vx $ depends on the energy of the state $U(\vx)$ and temperature $ T $ . This distribution describes positions and velocities of particles in the gas, for instance. In bayesian machine learning, it defines distribution of model parameters (such as weights of a neural network). For example, consider a multivariate normal distribution: $$ p(\vx) \propto \exp\left( - \dfrac{1}{2} (\vx - \mu) T \Sigma {-1} (\vx - \mu) \right) $$ which corresponds to the following potential energy: $$ U(\vx) \dfrac{1}{2} (\vx - \mu) T \Sigma {-1} (\vx - \mu), \qquad T 1. $$ Any distribution can be rewritten as Gibbs canonical distribution, but for many problems such energy-based distributions appear very naturally.


Nonnegative Matrix Factorization for Signal and Data Analytics: Identifiability, Algorithms, and Applications

arXiv.org Machine Learning

Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability---the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging---had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical `pitfalls' that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.


What is the difference between Markov chain approximation and variational approximation?

#artificialintelligence

PageRank and RBMs are not Markov chain approximations, rather they use Markov chains in their implementation. Similarly, LDA (Latent Dirichlet Allocation) is a generative probabilistic model (aka Bayesian hierarchical model) and not a variational approximation. LDA may use variational approximation methods for inference. Let me take the LDA model as an example. In LDA, a complicated generative model is constructed to learn the topic allocation probabilities of different documents.


Consequentialist conditional cooperation in social dilemmas with imperfect information

arXiv.org Artificial Intelligence

Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one's behavior solely on outcomes (ie. one's past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies using deep reinforcement learning techniques and demonstrate, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.


Analyzing Business Process Anomalies Using Autoencoders

arXiv.org Artificial Intelligence

Businesses are naturally interested in detecting anomalies in their internal processes, because these can be indicators for fraud and inefficiencies. Within the domain of business intelligence, classic anomaly detection is not very frequently researched. In this paper, we propose a method, using autoencoders, for detecting and analyzing anomalies occurring in the execution of a business process. Our method does not rely on any prior knowledge about the process and can be trained on a noisy dataset already containing the anomalies. We demonstrate its effectiveness by evaluating it on 700 different datasets and testing its performance against three state-of-the-art anomaly detection methods. This paper is an extension of our previous work from 2016 [30]. Compared to the original publication we have further refined the approach in terms of performance and conducted an elaborate evaluation on more sophisticated datasets including real-life event logs from the Business Process Intelligence Challenges of 2012 and 2017. In our experiments our approach reached an F1 score of 0.87, whereas the best unaltered state-of-the-art approach reached an F1 score of 0.72. Furthermore, our approach can be used to analyze the detected anomalies in terms of which event within one execution of the process causes the anomaly.


On Polynomial Time PAC Reinforcement Learning with Rich Observations

arXiv.org Machine Learning

We study episodic reinforcement learning (RL) when the observations may be realistically rich, such as images or text. We aim for methods that use function approximation in a provably effective manner to find the best possible policy through systematic exploration. While such problems are central to empirical RL research [22], most theoretical results on systematic exploration have focused on tabular MDPs with small state spaces [e.g., 19]. Until recently, little was known about how to engage in sophisticated exploration in the general function approximation setting to achieve global optimality in a statistically efficient manner. Indeed, as pointed out by Krishnamurthy et al. [20], no algorithm achieving polynomial sample complexity is possible without further assumptions. Nevertheless, when the underlying problem exhibits additional structure, it was recently shown that learning becomes statistically feasible. In particular, Krishnamurthy et al. [20] showed that reactive POMDPs with rich observations and deterministic dynamics over M hidden states can be learned with polynomial sample complexity that depends on M. Later, Jiang et al. [16] provided a new algorithm called O LIVE that In this paper, we directly address this difficult computational challenge. We adopt a reduction approach, meaning that we aim to design algorithms whose computation can be reduced to common optimization oracles over function spaces, such as linear optimization and cost-sensitive classification, while retaining the statistical properties of prior works.