Statistical Learning
Predicting Next Label Quality: A Time-Series Model of Crowdwork
Jung, Hyun Joon (University of Texas at Austin) | Park, Yubin (University of Texas at Austin) | Lease, Matthew (University of Texas at Austin)
While temporal behavioral patterns can be discerned to underlie real crowd work, prior studies have typically modeled worker performance under a simplified i.i.d. assumption. To better model such temporal worker behavior, we propose a time-series label prediction model for crowd work. This latent variable model captures and summarizes past worker behavior, enabling us to better predict the quality of each worker's next label. Given inherent uncertainty in prediction, we also investigate a decision reject option to balance the tradeoff between prediction accuracy vs. coverage. Results show our model improves accuracy of both label prediction on real crowd worker data, as well as data quality overall.
Greedy Subspace Clustering
Park, Dohyung, Caramanis, Constantine, Sanghavi, Sujay
We consider the problem of subspace clustering: given points that lie on or near the union of many low-dimensional linear subspaces, recover the subspaces. To this end, one first identifies sets of points close to the same subspace and uses the sets to estimate the subspaces. As the geometric structure of the clusters (linear subspaces) forbids proper performance of general distance based approaches such as K-means, many model-specific methods have been proposed. In this paper, we provide new simple and efficient algorithms for this problem. Our statistical analysis shows that the algorithms are guaranteed exact (perfect) clustering performance under certain conditions on the number of points and the affinity between subspaces. These conditions are weaker than those considered in the standard statistical literature. Experimental results on synthetic data generated from the standard unions of subspaces model demonstrate our theory. We also show that our algorithm performs competitively against state-of-the-art algorithms on real-world applications such as motion segmentation and face clustering, with much simpler implementation and lower computational cost.
Altitude Training: Strong Bounds for Single-Layer Dropout
Wager, Stefan, Fithian, William, Wang, Sida, Liang, Percy
Dropout training, originally designed for deep neural networks, has been successful on high-dimensional single-layer natural language tasks. This paper proposes a theoretical explanation for this phenomenon: we show that, under a generative Poisson topic model with long documents, dropout training improves the exponent in the generalization bound for empirical risk minimization. Dropout achieves this gain much like a marathon runner who practices at altitude: once a classifier learns to perform reasonably well on training examples that have been artificially corrupted by dropout, it will do very well on the uncorrupted test set. We also show that, under similar conditions, dropout preserves the Bayes decision boundary and should therefore induce minimal bias in high dimensions.
Semi-Supervised Learning with Deep Generative Models
Kingma, Diederik P., Rezende, Danilo J., Mohamed, Shakir, Welling, Max
The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.
Feedback Detection for Live Predictors
Wager, Stefan, Chamandy, Nick, Muralidharan, Omkar, Najmi, Amir
A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.
Partition-wise Linear Models
Oiwa, Hidekazu, Fujimaki, Ryohei
Region-specific linear models are widely used in practical applications because of their non-linear but highly interpretable model representations. One of the key challenges in their use is non-convexity in simultaneous optimization of regions and region-specific models. This paper proposes novel convex region-specific linear models, which we refer to as partition-wise linear models. Our key ideas are 1) assigning linear models not to regions but to partitions (region-specifiers) and representing region-specific linear models by linear combinations of partition-specific models, and 2) optimizing regions via partition selection from a large number of given partition candidates by means of convex structured regularizations. In addition to providing initialization-free globally-optimal solutions, our convex formulation makes it possible to derive a generalization bound and to use such advanced optimization techniques as proximal methods and decomposition of the proximal maps for sparsity-inducing regularizations. Experimental results demonstrate that our partition-wise linear models perform better than or are at least competitive with state-of-the-art region-specific or locally linear models.
Sparse principal component regression with adaptive loading
Kawano, Shuichi, Fujisawa, Hironori, Takada, Toyoyuki, Shiroishi, Toshihiko
Principal component analysis (PCA) (Jolliffe, 2002) is a fundamental statistical tool for dimensionality reduction, data processing, and visualization of multiv ariate data, with various applications in biology, engineering, and social science. In re gression analysis, it can be useful to replace many original explanatory variables with a f ew principal components, which is called the principal component regression (PCR) (Ma ssy, 1965; Jolliffe, 1982). PCR is widely used in various fields of research and many exten sions of PCR have been proposed (see, e.g., Hartnett et al., 1998; Rosital et al., 2001; Reiss and Ogden, 2007; Wang and Abbott, 2008). Whereas PCR is a useful tool for analyzin g multivariate data, this method may not have enough prediction accuracy if the respon se variable depends on the principal components with small eigenvalues. The problem arises from the two-stage procedure for PCR; a few principal components are selected with la rge eigenvalues, but without any relation to response variable, and then the regression model is constructed using them as new explanatory variables. In this paper, we deal with PCA and regression analysis simultaneous ly, and propose a one-stage procedure for PCR to address this problem. The proc edure combines two loss functions; one is the ordinary regression analysis loss and the othe r is PCA loss with some devices proposed by Zou et al. (2006).
A random forest system combination approach for error detection in digital dictionaries
Bloodgood, Michael, Ye, Peng, Rodrigues, Paul, Zajic, David, Doermann, David
When digitizing a print bilingual dictionary, whether via optical character recognition or manual entry, it is inevitable that errors are introduced into the electronic version that is created. We investigate automating the process of detecting errors in an XML representation of a digitized print dictionary using a hybrid approach that combines rule-based, feature-based, and language model-based methods. We investigate combining methods and show that using random forests is a promising approach. We find that in isolation, unsupervised methods rival the performance of supervised methods. Random forests typically require training data so we investigate how we can apply random forests to combine individual base methods that are themselves unsupervised without requiring large amounts of training data. Experiments reveal empirically that a relatively small amount of data is sufficient and can potentially be further reduced through specific selection criteria.
Causal Inference through a Witness Protection Program
One of the most fundamental problems in causal inference is the estimation of a causal effect when variables are confounded. This is difficult in an observational study, because one has no direct evidence that all confounders have been adjusted for. We introduce a novel approach for estimating causal effects that exploits observational conditional independencies to suggest "weak" paths in a unknown causal graph. The widely used faithfulness condition of Spirtes et al. is relaxed to allow for varying degrees of "path cancellations" that imply conditional independencies but do not rule out the existence of confounding causal paths. The outcome is a posterior distribution over bounds on the average causal effect via a linear programming approach and Bayesian inference. We claim this approach should be used in regular practice along with other default tools in observational studies.
Discovering Structure in High-Dimensional Data Through Correlation Explanation
Steeg, Greg Ver, Galstyan, Aram
We introduce a method to learn a hierarchy of successively more abstract representations of complex data based on optimizing an information-theoretic objective. Intuitively, the optimization searches for a set of latent factors that best explain the correlations in the data as measured by multivariate mutual information. The method is unsupervised, requires no model assumptions, and scales linearly with the number of variables which makes it an attractive approach for very high dimensional systems. We demonstrate that Correlation Explanation (CorEx) automatically discovers meaningful structure for data from diverse sources including personality tests, DNA, and human language.