Statistical Learning
Modeling Temporal Crowd Work Quality with Limited Supervision
Jung, Hyun Joon (University of Texas at Austin) | Lease, Matthew (University of Texas at Austin)
While recent work has shown that a workerโs performance can be more accurately modeled by temporal correlation in task performance, a fundamental challenge remains in the need for expert gold labels to evaluate a workerโs performance. To solve this problem, we explore two methods of utilizing limited gold labels, initial training and periodic updating. Furthermore, we present a novel way of learning a prediction model in the absence of gold labels with uncertaintyaware learning and soft-label updating. Our experiment with a real crowdsourcing dataset demonstrates that periodic updating tends to show better performance than initial training when the number of gold labels are very limited (< 25).
From "In" to "Over": Behavioral Experiments on Whole-Network Computation
Dworkin, Lili (University of Pennsylvania) | Kearns, Michael (University of Pennsylvania)
We report on a series of behavioral experiments in human computation on three different tasks over networks: graph coloring, community detection (or graph clustering), and competitive contagion. While these tasks share similar action spaces and interfaces, they capture a diversity of computational challenges: graph coloring is a search problem, clustering is an optimization problem, and competitive contagion is a game-theoretic problem. In contrast with most of the prior literature on human-subject experiments in networks, in which collectives of subjects are embedded "in" the network, and have only local information and interactions, here individual subjects have a global (or "over") view and must solve "whole network" problems alone. Our primary findings are that subject performance is impressive across all three problem types; that subjects find diverse and novel strategies for solving each task; and that collective performance can often be strongly correlated with known algorithms.
A Complete Recipe for Stochastic Gradient MCMC
Ma, Yi-An, Chen, Tianqi, Fox, Emily B.
Many recent Markov chain Monte Carlo (MCMC) samplers leverage continuous dynamics to define a transition kernel that efficiently explores a target distribution. In tandem, a focus has been on devising scalable variants that subsample the data and use stochastic gradients in place of full-data gradients in the dynamic simulations. However, such stochastic gradient MCMC samplers have lagged behind their full-data counterparts in terms of the complexity of dynamics considered since proving convergence in the presence of the stochastic gradient noise is non-trivial. Even with simple dynamics, significant physical intuition is often required to modify the dynamical system to account for the stochastic gradient noise. In this paper, we provide a general recipe for constructing MCMC samplers--including stochastic gradient versions--based on continuous Markov processes specified via two matrices. We constructively prove that the framework is complete. That is, any continuous Markov process that provides samples from the target distribution can be written in our framework. We show how previous continuous-dynamic samplers can be trivially "reinvented" in our framework, avoiding the complicated sampler-specific proofs. We likewise use our recipe to straightforwardly propose a new state-adaptive sampler: stochastic gradient Riemann Hamiltonian Monte Carlo (SGRHMC). Our experiments on simulated data and a streaming Wikipedia analysis demonstrate that the proposed SGRHMC sampler inherits the benefits of Riemann HMC, with the scalability of stochastic gradient methods.
Copula variational inference
Tran, Dustin, Blei, David M., Airoldi, Edoardo M.
We develop a general variational inference method that preserves dependency among the latent variables. Our method uses copulas to augment the families of distributions used in mean-field and structured approximations. Copulas model the dependency that is not captured by the original variational distribution, and thus the augmented variational family guarantees better approximations to the posterior. With stochastic optimization, inference on the augmented distribution is scalable. Furthermore, our strategy is generic: it can be applied to any inference procedure that currently uses the mean-field or structured approach. Copula variational inference has many advantages: it reduces bias; it is less sensitive to local optima; it is less sensitive to hyperparameters; and it helps characterize and interpret the dependency among the latent variables.
Blitzkriging: Kronecker-structured Stochastic Gaussian Processes
Nickson, Thomas, Gunter, Tom, Lloyd, Chris, Osborne, Michael A, Roberts, Stephen
We present Blitzkriging, a new approach to fast inference for Gaussian processes, applicable to regression, optimisation and classification. State-of-the-art (stochastic) inference for Gaussian processes on very large datasets scales cubically in the number of 'inducing inputs', variables introduced to factorise the model. Blitzkriging shares state-of-the-art scaling with data, but reduces the scaling in the number of inducing points to approximately linear. Further, in contrast to other methods, Blitzkriging: does not force the data to conform to any particular structure (including grid-like); reduces reliance on error-prone optimisation of inducing point locations; and is able to learn rich (covariance) structure from the data. We demonstrate the benefits of our approach on real data in regression, time-series prediction and signal-interpolation experiments.
Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm
Khashabi, Daniel, Wieting, John, Liu, Jeffrey Yufei, Liang, Feng
In this paper, we propose a model-based clustering method (TVClust) that robustly incorporates noisy side information as soft-constraints and aims to seek a consensus between side information and the observed data. Our method is based on a nonparametric Bayesian hierarchical model that combines the probabilistic model for the data instance and the one for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior inference. Using the small-variance asymptotics of our probabilistic model, we then derive a new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of K-means that allows for the inclusion of side information and has the additional property that the number of clusters does not need to be specified a priori. Empirical studies have been carried out to compare our work with many constrained clustering algorithms from the literature on both a variety of data sets and under a variety of conditions such as using noisy side information and erroneous k values. The results of our experiments show strong results for our probabilistic and deterministic approaches under these conditions when compared to other algorithms in the literature.
ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient
Gulcehre, Caglar, Moczulski, Marcin, Bengio, Yoshua
Stochastic gradient algorithms have been the main focus of large-scale learning problems and they led to important successes in machine learning. The convergence of SGD depends on the careful choice of learning rate and the amount of the noise in stochastic estimates of the gradients. In this paper, we propose a new adaptive learning rate algorithm, which utilizes curvature information for automatically tuning the learning rates. The information about the element-wise curvature of the loss function is estimated from the local statistics of the stochastic first order gradients. We further propose a new variance reduction technique to speed up the convergence. In our preliminary experiments with deep neural networks, we obtained better performance compared to the popular stochastic gradient algorithms.
Sparse Variational Bayesian Approximations for Nonlinear Inverse Problems: applications in nonlinear elastography
Franck, Isabell M., Koutsourelakis, P. S.
This paper presents an efficient Bayesian framework for solving nonlinear, high-dimensional model calibration problems. It is based on a Variational Bayesian formulation that aims at approximating the exact posterior by means of solving an optimization problem over an appropriately selected family of distributions. The goal is two-fold. Firstly, to find lower-dimensional representations of the unknown parameter vector that capture as much as possible of the associated posterior density, and secondly to enable the computation of the approximate posterior density with as few forward calls as possible. We discuss how these objectives can be achieved by using a fully Bayesian argumentation and employing the marginal likelihood or evidence as the ultimate model validation metric for any proposed dimensionality reduction. We demonstrate the performance of the proposed methodology for problems in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical diagnosis. An Importance Sampling scheme is finally employed in order to validate the results and assess the efficacy of the approximations provided.
Maximum Likelihood Learning With Arbitrary Treewidth via Fast-Mixing Parameter Sets
Inference is typically intractable in high-treewidth undirected graphical models, making maximum likelihood learning a challenge. One way to overcome this is to restrict parameters to a tractable set, most typically the set of tree-structured parameters. This paper explores an alternative notion of a tractable set, namely a set of "fast-mixing parameters" where Markov chain Monte Carlo (MCMC) inference can be guaranteed to quickly converge to the stationary distribution. While it is common in practice to approximate the likelihood gradient using samples obtained from MCMC, such procedures lack theoretical guarantees. This paper proves that for any exponential family with bounded sufficient statistics, (not just graphical models) when parameters are constrained to a fast-mixing set, gradient descent with gradients approximated by sampling will approximate the maximum likelihood solution inside the set with high-probability. When unregularized, to find a solution epsilon-accurate in log-likelihood requires a total amount of effort cubic in 1/epsilon, disregarding logarithmic factors. When ridge-regularized, strong convexity allows a solution epsilon-accurate in parameter distance with effort quadratic in 1/epsilon. Both of these provide of a fully-polynomial time randomized approximation scheme.
Gaussian Process Random Fields
Moore, David A., Russell, Stuart J.
Gaussian processes have been successful in both supervised and unsupervised machine learning tasks, but their computational complexity has constrained practical applications. We introduce a new approximation for large-scale Gaussian processes, the Gaussian Process Random Field (GPRF), in which local GPs are coupled via pairwise potentials. The GPRF likelihood is a simple, tractable, and parallelizeable approximation to the full GP marginal likelihood, enabling latent variable modeling and hyperparameter selection on large datasets. We demonstrate its effectiveness on synthetic spatial data as well as a real-world application to seismic event location.