Statistical Learning
LOCO: Distributing Ridge Regression with Random Projections
Heinze, Christina, McWilliams, Brian, Meinshausen, Nicolai, Krummenacher, Gabriel
We propose Loco, an algorithm for large-scale ridge regression which distributes the features across workers on a cluster. Important dependencies between variables are preserved using structured random projections which are cheap to compute and must only be communicated once. We show that Loco obtains a solution which is close to the exact ridge regression solution in the fixed design setting. We verify this experimentally in a simulation study as well as an application to climate prediction. Furthermore, we show that Loco achieves significant speedups compared with a state-of-the-art distributed algorithm on a large-scale regression problem.
A Topological Approach to Spectral Clustering
The analysis of complex, high-dimensional data is one of the major research challenges in contemporary computer science and statistics. In recent years, geometric and topological approaches to data analysis have begun to yield important insights into the structure of complex data sets (see, for instance, [1] for an example of spectral geometry applied to dimension reduction, and [6], [2] for surveys on homological methods of data analysis and visualization). The common point of departure of these methods is the assumption that data in highdimensional spaces is often concentrated around a low-dimensional manifold or other topological space. In this note, we begin from the assumption that the data comes from a uniform distribution supported on a topologically disconnected space, and that clusters in the data reflect this lack of topological connectivity. Geometric techniques for data analysis have concentrated on approximating the geometry of the data as a step toward nonlinear dimension reduction. Once the dimension is reduced, standard statistical techniques are then used to analyze the data in the lower-dimensional space.
Linear Convergence of the Randomized Feasible Descent Method Under the Weak Strong Convexity Assumption
Ma, Chenxin, Tappenden, Rachael, Takรกฤ, Martin
In this paper we generalize the framework of the feasible descent method (FDM) to a randomized (R-FDM) and a coordinate-wise random feasible descent method (RC-FDM) framework. We show that the famous SDCA algorithm for optimizing the SVM dual problem, or the stochastic coordinate descent method for the LASSO problem, fits into the framework of RC-FDM. We prove linear convergence for both R-FDM and RC-FDM under the weak strong convexity assumption. Moreover, we show that the duality gap converges linearly for RC-FDM, which implies that the duality gap also converges linearly for SDCA applied to the SVM dual problem.
Population Empirical Bayes
Kucukelbir, Alp, Blei, David M.
Bayesian predictive inference analyzes a dataset to make predictions about new observations. When a model does not match the data, predictive accuracy suffers. We develop population empirical Bayes (POP-EB), a hierarchical framework that explicitly models the empirical population distribution as part of Bayesian analysis. We introduce a new concept, the latent dataset, as a hierarchical variable and set the empirical population as its prior. This leads to a new predictive density that mitigates model mismatch. We efficiently apply this method to complex models by proposing a stochastic variational inference algorithm, called bumping variational inference (BUMP-VI). We demonstrate improved predictive accuracy over classical Bayesian inference in three models: a linear regression model of health data, a Bayesian mixture model of natural images, and a latent Dirichlet allocation topic model of scientific documents.
String Gaussian Process Kernels
Samo, Yves-Laurent Kom, Roberts, Stephen
Kernels are often used as a flexible way of departing from linear hypotheses in learning machines, thereby allowing for more complex nonlinear patterns [1, 2]. They have indeed been successfully applied to problems of classification, clustering, density estimation and regression. The duality between kernels and covariance functions has made kernels a critical tool for both frequentist and Bayesian statisticians. In the Bayesian nonparametrics community, kernels are often used as a covariance function of a Gaussian process (GP), introduced as a prior over a latent function. The family of covariance functions postulated for the GP is typically chosen so as to express prior domain knowledge about the underlying function, such as periodicity, regularity and range.
Primal Method for ERM with Flexible Mini-batching Schemes and Non-convex Losses
Csiba, Dominik, Richtรกrik, Peter
In this work we develop a new algorithm for regularized empirical risk minimization. Our method extends recent techniques of Shalev-Shwartz [02/2015], which enable a dual-free analysis of SDCA, to arbitrary mini-batching schemes. Moreover, our method is able to better utilize the information in the data defining the ERM problem. For convex loss functions, our complexity results match those of QUARTZ, which is a primal-dual method also allowing for arbitrary mini-batching schemes. The advantage of a dual-free analysis comes from the fact that it guarantees convergence even for non-convex loss functions, as long as the average loss is convex. We illustrate through experiments the utility of being able to design arbitrary mini-batching schemes.
Kernel Manifold Alignment
Tuia, Devis, Camps-Valls, Gustau
We introduce a kernel method for manifold alignment (KEMA) and domain adaptation that can match an arbitrary number of data sources without needing corresponding pairs, just few labeled examples in all domains. KEMA has interesting properties: 1) it generalizes other manifold alignment methods, 2) it can align manifolds of very different complexities, performing a sort of manifold unfolding plus alignment, 3) it can define a domain-specific metric to cope with multimodal specificities, 4) it can align data spaces of different dimensionality, 5) it is robust to strong nonlinear feature deformations, and 6) it is closed-form invertible which allows transfer across-domains and data synthesis. We also present a reduced-rank version for computational efficiency and discuss the generalization performance of KEMA under Rademacher principles of stability. KEMA exhibits very good performance over competing methods in synthetic examples, visual object recognition and recognition of facial expressions tasks.
High-dimensional Ordinary Least-squares Projection for Screening Variables
Variable selection is a challenging issue in statistical applications when the number of predictors $p$ far exceeds the number of observations $n$. In this ultra-high dimensional setting, the sure independence screening (SIS) procedure was introduced to significantly reduce the dimensionality by preserving the true model with overwhelming probability, before a refined second stage analysis. However, the aforementioned sure screening property strongly relies on the assumption that the important variables in the model have large marginal correlations with the response, which rarely holds in reality. To overcome this, we propose a novel and simple screening technique called the high-dimensional ordinary least-squares projection (HOLP). We show that HOLP possesses the sure screening property and gives consistent variable selection without the strong correlation assumption, and has a low computational complexity. A ridge type HOLP procedure is also discussed. Simulation study shows that HOLP performs competitively compared to many other marginal correlation based methods. An application to a mammalian eye disease data illustrates the attractiveness of HOLP.
JUMP-Means: Small-Variance Asymptotics for Markov Jump Processes
Huggins, Jonathan H., Narasimhan, Karthik, Saeedi, Ardavan, Mansinghka, Vikash K.
Markov jump processes (MJPs) are used to model a wide range of phenomena from disease progression to RNA path folding. However, maximum likelihood estimation of parametric models leads to degenerate trajectories and inferential performance is poor in nonparametric models. We take a small-variance asymptotics (SVA) approach to overcome these limitations. We derive the small-variance asymptotics for parametric and nonparametric MJPs for both directly observed and hidden state models. In the parametric case we obtain a novel objective function which leads to non-degenerate trajectories. To derive the nonparametric version we introduce the gamma-gamma process, a novel extension to the gamma-exponential process. We propose algorithms for each of these formulations, which we call \emph{JUMP-means}. Our experiments demonstrate that JUMP-means is competitive with or outperforms widely used MJP inference approaches in terms of both speed and reconstruction accuracy.
Low-Cost Learning via Active Data Procurement
Abernethy, Jacob, Chen, Yiling, Ho, Chien-Ju, Waggoner, Bo
We design mechanisms for online procurement of data held by strategic agents for machine learning tasks. The challenge is to use past data to actively price future data and give learning guarantees even when an agent's cost for revealing her data may depend arbitrarily on the data itself. We achieve this goal by showing how to convert a large class of no-regret algorithms into online posted-price and learning mechanisms. Our results in a sense parallel classic sample complexity guarantees, but with the key resource being money rather than quantity of data: With a budget constraint $B$, we give robust risk (predictive error) bounds on the order of $1/\sqrt{B}$. Because we use an active approach, we can often guarantee to do significantly better by leveraging correlations between costs and data. Our algorithms and analysis go through a model of no-regret learning with $T$ arriving pairs (cost, data) and a budget constraint of $B$. Our regret bounds for this model are on the order of $T/\sqrt{B}$ and we give lower bounds on the same order.