Uncertainty
Statistical inference using SGD
Li, Tianyang, Liu, Liu, Kyrillidis, Anastasios, Caramanis, Constantine
We present a novel method for frequentist statistical inference in $M$-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. From a practical perspective, our SGD-based inference procedure is a first order method, and is well-suited for large scale problems. To show its merits, we apply it to both synthetic and real datasets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.
The amazing predictive power of conditional probability in Bayes Nets
Using conditional probability gives Bayes Nets strong analytical advantages over traditional regression-based models. This adds to several advantages we discussed in an earlier article. But what is conditional probability and what makes it different? In short, conditional probability means that the effects of one variable depend on, of flow from, the distribution of another variable (or others). The complete state of one variable determines how another acts.
Robust Synthetic Control
Amjad, Muhammad Jehangir, Shah, Devavrat, Shen, Dennis
We present a robust generalization of the synthetic control method for comparative case studies. Like the classical method, we present an algorithm to estimate the unobservable counterfactual of a treatment unit. A distinguishing feature of our algorithm is that of de-noising the data matrix via singular value thresholding, which renders our approach robust in multiple facets: it automatically identifies a good subset of donors, overcomes the challenges of missing data, and continues to work well in settings where covariate information may not be provided. To begin, we establish the condition under which the fundamental assumption in synthetic control-like approaches holds, i.e. when the linear relationship between the treatment unit and the donor pool prevails in both the pre- and post-intervention periods. We provide the first finite sample analysis for a broader class of models, the Latent Variable Model, in contrast to Factor Models previously considered in the literature. Further, we show that our de-noising procedure accurately imputes missing entries, producing a consistent estimator of the underlying signal matrix provided $p = \Omega( T^{-1 + \zeta})$ for some $\zeta > 0$; here, $p$ is the fraction of observed data and $T$ is the time interval of interest. Under the same setting, we prove that the mean-squared-error (MSE) in our prediction estimation scales as $O(\sigma^2/p + 1/\sqrt{T})$, where $\sigma^2$ is the noise variance. Using a data aggregation method, we show that the MSE can be made as small as $O(T^{-1/2+\gamma})$ for any $\gamma \in (0, 1/2)$, leading to a consistent estimator. We also introduce a Bayesian framework to quantify the model uncertainty through posterior probabilities. Our experiments, using both real-world and synthetic datasets, demonstrate that our robust generalization yields an improvement over the classical synthetic control method.
General Bayesian Inference over the Stiefel Manifold via the Givens Transform
Pourzanjani, Arya A, Jiang, Richard M, Mitchell, Brian, Atzberger, Paul J, Petzold, Linda R
We introduce the Givens Transform, a novel transform between the space of orthonormal matrices and $\mathbb{R}^D$. The Givens Transform allows for the application of any general Bayesian inference algorithm to probabilistic models containing constrained unit-vectors or orthonormal matrix parameters. This includes a variety of matrix factorizations and dimensionality reduction models such as Probabilistic PCA (PPCA), Exponential Family PPCA (BXPCA), and Canonical Correlation Analysis (CCA). While previous Bayesian approaches to these models relied on separate sampling update rules for constrained and unconstrained parameters, the Givens Transform enables the treatment of unit-vectors and orthonormal matrices agnostically as unconstrained parameters. Thus any Bayesian inference algorithm can be used on these models without modification. This opens the door to not just sampling algorithms, but Variational Inference (VI) as well. We illustrate with several examples and supplied code, how the Givens Transform allows end-users to easily build complex models in their favorite Bayesian modeling framework such as Stan, Edward, or PyMC3, a task that was previously intractable due to technical constraints.
Rate-Distortion Bounds on Bayes Risk in Supervised Learning
Nokleby, Matthew, Beirami, Ahmad, Calderbank, Robert
We present an information-theoretic framework for bounding the number of labeled samples needed to train a classifier in a parametric Bayesian setting. We derive bounds on the average $L_p$ distance between the learned classifier and the true maximum a posteriori classifier, which are well-established surrogates for the excess classification error due to imperfect learning. We provide lower and upper bounds on the rate-distortion function, using $L_p$ loss as the distortion measure, of a maximum a priori classifier in terms of the differential entropy of the posterior distribution and a quantity called the interpolation dimension, which characterizes the complexity of the parametric distribution family. In addition to expressing the information content of a classifier in terms of lossy compression, the rate-distortion function also expresses the minimum number of bits a learning machine needs to extract from training data to learn a classifier to within a specified $L_p$ tolerance. We use results from universal source coding to express the information content in the training data in terms of the Fisher information of the parametric family and the number of training samples available. The result is a framework for computing lower bounds on the Bayes $L_p$ risk. This framework complements the well-known probably approximately correct (PAC) framework, which provides minimax risk bounds involving the Vapnik-Chervonenkis dimension or Rademacher complexity. Whereas the PAC framework provides upper bounds the risk for the worst-case data distribution, the proposed rate-distortion framework lower bounds the risk averaged over the data distribution. We evaluate the bounds for a variety of data models, including categorical, multinomial, and Gaussian models. In each case the bounds are provably tight orderwise, and in two cases we prove that the bounds are tight up to multiplicative constants.
A primer on universal function approximation with deep learning (in Torch and R)
Arthur C. Clarke famously stated that "any sufficiently advanced technology is indistinguishable from magic." No current technology embodies this statement more than neural networks and deep learning. And like any good magic it not only dazzles and inspires but also puts fear into people's hearts. One known property of artificial neural networks (ANNs) is that they are universal function approximators. This means that any mathematical function can be represented by a neural network.
HodgeRank with Information Maximization for Crowdsourced Pairwise Ranking Aggregation
Xu, Qianqian, Xiong, Jiechao, Chen, Xi, Huang, Qingming, Yao, Yuan
Recently, crowdsourcing has emerged as an effective paradigm for human-powered large scale problem solving in various domains. However, task requester usually has a limited amount of budget, thus it is desirable to have a policy to wisely allocate the budget to achieve better quality. In this paper, we study the principle of information maximization for active sampling strategies in the framework of HodgeRank, an approach based on Hodge Decomposition of pairwise ranking data with multiple workers. The principle exhibits two scenarios of active sampling: Fisher information maximization that leads to unsupervised sampling based on a sequential maximization of graph algebraic connectivity without considering labels; and Bayesian information maximization that selects samples with the largest information gain from prior to posterior, which gives a supervised sampling involving the labels collected. Experiments show that the proposed methods boost the sampling efficiency as compared to traditional sampling schemes and are thus valuable to practical crowdsourcing experiments.
Neural Variational Inference and Learning in Undirected Graphical Models
Kuleshov, Volodymyr, Ermon, Stefano
Many problems in machine learning are naturally expressed in the language of undirected graphical models. Here, we propose black-box learning and inference algorithms for undirected models that optimize a variational approximation to the log-likelihood of the model. Central to our approach is an upper bound on the log-partition function parametrized by a function q that we express as a flexible neural network. Our bound makes it possible to track the partition function during learning, to speed-up sampling, and to train a broad class of hybrid directed/undirected models via a unified variational inference framework. We empirically demonstrate the effectiveness of our method on several popular generative modeling datasets.
Hinge-Loss Markov Random Fields and Probabilistic Soft Logic
Bach, Stephen H., Broecheler, Matthias, Huang, Bert, Getoor, Lise
A fundamental challenge in developing high-impact machine learning technologies is balancing the need to model rich, structured domains with the ability to scale to big data. Many important problem areas are both richly structured and large scale, from social and biological networks, to knowledge graphs and the Web, to images, video, and natural language. In this paper, we introduce two new formalisms for modeling structured data, and show that they can both capture rich structure and scale to big data. The first, hinge-loss Markov random fields (HL-MRFs), is a new kind of probabilistic graphical model that generalizes different approaches to convex inference. We unite three approaches from the randomized algorithms, probabilistic graphical models, and fuzzy logic communities, showing that all three lead to the same inference objective. We then define HL-MRFs by generalizing this unified objective. The second new formalism, probabilistic soft logic (PSL), is a probabilistic programming language that makes HL-MRFs easy to define using a syntax based on first-order logic. We introduce an algorithm for inferring most-probable variable assignments (MAP inference) that is much more scalable than general-purpose convex optimization methods, because it uses message passing to take advantage of sparse dependency structures. We then show how to learn the parameters of HL-MRFs. The learned HL-MRFs are as accurate as analogous discrete models, but much more scalable. Together, these algorithms enable HL-MRFs and PSL to model rich, structured data at scales not previously possible.
AI and Automation: 10-point guide to its impact on KYC Process in Banks
The term Artificial Intelligence (AI) has been around for a while. A quick search on the web reveals that the field of modern AI was born in the year 1950, when Alan Turing published a paper on thinking machines. Here we are, almost seven decades later, still in the advent of this emerging technology. Over the last few years, Google CEO, Sundar Pitchai has been speaking about the increasing role of AI in software and it seems like this year might be the inflection point for the field. In May 2017, Pichai explained how at Google, it is an "AI-first" approach for several of its products.