Statistical Learning
Differentially Private Bayesian Optimization
Kusner, Matt J., Gardner, Jacob R., Garnett, Roman, Weinberger, Kilian Q.
Bayesian optimization is a powerful tool for fine-tuning the hyper-parameters of a wide variety of machine learning models. The success of machine learning has led practitioners in diverse real-world settings to learn classifiers for practical problems. As machine learning becomes commonplace, Bayesian optimization becomes an attractive method for practitioners to automate the process of classifier hyper-parameter tuning. A key observation is that the data used for tuning models in these settings is often sensitive. Certain data such as genetic predisposition, personal email statistics, and car accident history, if not properly private, may be at risk of being inferred from Bayesian optimization outputs. To address this, we introduce methods for releasing the best hyper-parameters and classifier accuracy privately. Leveraging the strong theoretical guarantees of differential privacy and known Bayesian optimization convergence bounds, we prove that under a GP assumption these private quantities are also near-optimal. Finally, even if this assumption is not satisfied, we can use different smoothness guarantees to protect privacy.
Using NLP to measure democracy
This paper uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS are based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases. The ADS are produced with supervised learning. Three approaches are tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperforms the alternatives, so it is the one on which the ADS are based. There is a web application where anyone can change the training set and see how the results change: democracy-scores.org
Clustering multi-way data: a novel algebraic approach
Kernfeld, Eric, Aeron, Shuchin, Kilmer, Misha
In this paper, we develop a method for unsupervised clustering of two-way (matrix) data by combining two recent innovations from different fields: the Sparse Subspace Clustering (SSC) algorithm [10], which groups points coming from a union of subspaces into their respective subspaces, and the t-product [18], which was introduced to provide a matrix-like multiplication for third order tensors. Our algorithm is analogous to SSC in that an "affinity" between different data points is built using a sparse self-representation of the data. Unlike SSC, we employ the t-product in the self-representation. This allows us more flexibility in modeling; infact, SSC is a special case of our method. When using the t-product, three-way arrays are treated as matrices whose elements (scalars) are n-tuples or tubes. Convolutions take the place of scalar multiplication. This framework allows us to embed the 2-D data into a vector-space-like structure called a free module over a commutative ring. These free modules retain many properties of complex inner-product spaces, and we leverage that to provide theoretical guarantees on our algorithm. We show that compared to vector-space counterparts, SSmC achieves higher accuracy and better able to cluster data with less preprocessing in some image clustering problems. In particular we show the performance of the proposed method on Weizmann face database, the Extended Yale B Face database and the MNIST handwritten digits database.
Sensitivity Analysis for Computationally Expensive Models using Optimization and Objective-oriented Surrogate Approximations
Wang, Yilun, Shoemaker, Christine A.
In this paper, we focus on developing efficient sensitivity analysis methods for a computationally expensive objective function $f(x)$ in the case that the minimization of it has just been performed. Here "computationally expensive" means that each of its evaluation takes significant amount of time, and therefore our main goal to use a small number of function evaluations of $f(x)$ to further infer the sensitivity information of these different parameters. Correspondingly, we consider the optimization procedure as an adaptive experimental design and re-use its available function evaluations as the initial design points to establish a surrogate model $s(x)$ (or called response surface). The sensitivity analysis is performed on $s(x)$, which is an lieu of $f(x)$. Furthermore, we propose a new local multivariate sensitivity measure, for example, around the optimal solution, for high dimensional problems. Then a corresponding "objective-oriented experimental design" is proposed in order to make the generated surrogate $s(x)$ better suitable for the accurate calculation of the proposed specific local sensitivity quantities. In addition, we demonstrate the better performance of the Gaussian radial basis function interpolator over Kriging in our cases, which are of relatively high dimensionality and few experimental design points. Numerical experiments demonstrate that the optimization procedure and the "objective-oriented experimental design" behavior much better than the classical Latin Hypercube Design. In addition, the performance of Kriging is not as good as Gaussian RBF, especially in the case of high dimensional problems.
Deep Learning using Linear Support Vector Machines
Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide variety of tasks such as speech recognition, image classification, natural language processing, and bioinformatics. For classification tasks, most of these "deep learning" models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, our results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Representation Learning Workshop's face expression recognition challenge.
MILJS : Brand New JavaScript Libraries for Matrix Calculation and Machine Learning
Miura, Ken, Mano, Tetsuaki, Kanehira, Atsushi, Tsuchiya, Yuichiro, Harada, Tatsuya
MILJS is a collection of state-of-the-art, platform-independent, scalable, fast JavaScript libraries for matrix calculation and machine learning. Our core library offering a matrix calculation is called Sushi, which exhibits far better performance than any other leading machine learning libraries written in JavaScript. Especially, our matrix multiplication is 177 times faster than the fastest JavaScript benchmark. Based on Sushi, a machine learning library called Tempura is provided, which supports various algorithms widely used in machine learning research. We also provide Soba as a visualization library. The implementations of our libraries are clearly written, properly documented and thus can are easy to get started with, as long as there is a web browser. These libraries are available from http://mil-tokyo.github.io/
NP-Hardness and Inapproximability of Sparse PCA
The earliest reference to principal components analysis (PCA) is in [14]. Since then, PCA has evolved into a classic tool for data analysis. A challenge for the interpretation of the principal components (or factors) is that they can be linear combinations of all the original variables. When the original variables have direct physical significance (e.g.
Pairwise Constraint Propagation: A Survey
As one of the most important types of (weaker) supervised information in machine learning and pattern recognition, pairwise constraint, which specifies whether a pair of data points occur together, has recently received significant attention, especially the problem of pairwise constraint propagation. At least two reasons account for this trend: the first is that compared to the data label, pairwise constraints are more general and easily to collect, and the second is that since the available pairwise constraints are usually limited, the constraint propagation problem is thus important. This paper provides an up-to-date critical survey of pairwise constraint propagation research. There are two underlying motivations for us to write this survey paper: the first is to provide an up-to-date review of the existing literature, and the second is to offer some insights into the studies of pairwise constraint propagation. To provide a comprehensive survey, we not only categorize existing propagation techniques but also present detailed descriptions of representative methods within each category.
Finding Dantzig selectors with a proximity operator based fixed-point algorithm
Prater, Ashley, Shen, Lixin, Suter, Bruce W.
In this paper, we study a simple iterative method for finding the Dantzig selector, which was designed for linear regression problems. The method consists of two main stages. The first stage is to approximate the Dantzig selector through a fixed-point formulation of solutions to the Dantzig selector problem. The second stage is to construct a new estimator by regressing data onto the support of the approximated Dantzig selector. We compare our method to an alternating direction method, and present the results of numerical simulations using both the proposed method and the alternating direction method on synthetic and real data sets. The numerical simulations demonstrate that the two methods produce results of similar quality, however the proposed method tends to be significantly faster.
A New Sampling Technique for Tensors
Bhojanapalli, Srinadh, Sanghavi, Sujay
In this paper we propose new techniques to sample arbitrary third-order tensors, with an objective of speeding up tensor algorithms that have recently gained popularity in machine learning. Our main contribution is a new way to select, in a biased random way, only $O(n^{1.5}/\epsilon^2)$ of the possible $n^3$ elements while still achieving each of the three goals: \\ {\em (a) tensor sparsification}: for a tensor that has to be formed from arbitrary samples, compute very few elements to get a good spectral approximation, and for arbitrary orthogonal tensors {\em (b) tensor completion:} recover an exactly low-rank tensor from a small number of samples via alternating least squares, or {\em (c) tensor factorization:} approximating factors of a low-rank tensor corrupted by noise. \\ Our sampling can be used along with existing tensor-based algorithms to speed them up, removing the computational bottleneck in these methods.