Goto

Collaborating Authors

 Statistical Learning


On the accuracy of self-normalized log-linear models

arXiv.org Machine Learning

Calculation of the log-normalizer is a major computational obstacle in applications of log-linear models with large output spaces. The problem of fast normalizer computation has therefore attracted significant attention in the theoretical and applied machine learning literature. In this paper, we analyze a recently proposed technique known as "self-normalization", which introduces a regularization term in training to penalize log normalizers for deviating from zero. This makes it possible to use unnormalized model scores as approximate probabilities. Empirical evidence suggests that self-normalization is extremely effective, but a theoretical understanding of why it should work, and how generally it can be applied, is largely lacking. We prove generalization bounds on the estimated variance of normalizers and upper bounds on the loss in accuracy due to self-normalization, describe classes of input distributions that self-normalize easily, and construct explicit examples of high-variance input distributions. Our theoretical results make predictions about the difficulty of fitting self-normalized models to several classes of distributions, and we conclude with empirical validation of these predictions.


A simple application of FIC to model selection

arXiv.org Machine Learning

Although the predictivity of a model is a central objective in model building in science, it is only one of a wide range of criteria considered. We also seek models that are motivated by our understanding of the underlying mechanisms that give rise to phenomena and the idea of model parsimony is often a useful guiding principle, especially in physics. In contrast to this broad view of model selection, this paper describes the application of a theory for model selection motivated and entirely justified by a narrow definition of model predictivity: the ability of a model to predict a new observation generated by a stochastic process, after the model parameters have been fit to a finite number of previous observations.


Dependent Multinomial Models Made Easy: Stick Breaking with the P\'olya-Gamma Augmentation

arXiv.org Machine Learning

Many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions. For example, nucleotides in a DNA sequence, children's names in a given state and year, and text documents are all commonly modeled with multinomial distributions. In all of these cases, we expect some form of dependency between the draws: the nucleotide at one position in the DNA strand may depend on the preceding nucleotides, children's names are highly correlated from year to year, and topics in text may be correlated and dynamic. These dependencies are not naturally captured by the typical Dirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking representation and recent innovations in P\'olya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead.


Optimal model-free prediction from multivariate time series

arXiv.org Machine Learning

Forecasting a time series from multivariate predictors constitutes a challenging problem, especially using model-free approaches. Most techniques, such as nearest-neighbor prediction, quickly suffer from the curse of dimensionality and overfitting for more than a few predictors which has limited their application mostly to the univariate case. Therefore, selection strategies are needed that harness the available information as efficiently as possible. Since often the right combination of predictors matters, ideally all subsets of possible predictors should be tested for their predictive power, but the exponentially growing number of combinations makes such an approach computationally prohibitive. Here a prediction scheme that overcomes this strong limitation is introduced utilizing a causal pre-selection step which drastically reduces the number of possible predictors to the most predictive set of causal drivers making a globally optimal search scheme tractable. The information-theoretic optimality is derived and practical selection criteria are discussed. As demonstrated for multivariate nonlinear stochastic delay processes, the optimal scheme can even be less computationally expensive than commonly used sub-optimal schemes like forward selection. The method suggests a general framework to apply the optimal model-free approach to select variables and subsequently fit a model to further improve a prediction or learn statistical dependencies. The performance of this framework is illustrated on a climatological index of El Ni\~no Southern Oscillation.


Convolutional Dictionary Learning through Tensor Factorization

arXiv.org Machine Learning

Tensor methods have emerged as a powerful paradigm for consistent learning of many latent variable models such as topic models, independent component analysis and dictionary learning. Model parameters are estimated via CP decomposition of the observed higher order input moments. However, in many domains, additional invariances such as shift invariances exist, enforced via models such as convolutional dictionary learning. In this paper, we develop novel tensor decomposition algorithms for parameter estimation of convolutional models. Our algorithm is based on the popular alternating least squares method, but with efficient projections onto the space of stacked circulant matrices. Our method is embarrassingly parallel and consists of simple operations such as fast Fourier transforms and matrix multiplications. Our algorithm converges to the dictionary much faster and more accurately compared to the alternating minimization over filters and activation maps.


Stay on path: PCA along graph paths

arXiv.org Machine Learning

We introduce a variant of (sparse) PCA in which the set of feasible support sets is determined by a graph. In particular, we consider the following setting: given a directed acyclic graph $G$ on $p$ vertices corresponding to variables, the non-zero entries of the extracted principal component must coincide with vertices lying along a path in $G$. From a statistical perspective, information on the underlying network may potentially reduce the number of observations required to recover the population principal component. We consider the canonical estimator which optimally exploits the prior knowledge by solving a non-convex quadratic maximization on the empirical covariance. We introduce a simple network and analyze the estimator under the spiked covariance model. We show that side information potentially improves the statistical complexity. We propose two algorithms to approximate the solution of the constrained quadratic maximization, and recover a component with the desired properties. We empirically evaluate our schemes on synthetic and real datasets.


Optimally Combining Classifiers Using Unlabeled Data

arXiv.org Machine Learning

We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers can perform significantly better than any single classifier.


Learning Scale-Free Networks by Dynamic Node-Specific Degree Prior

arXiv.org Machine Learning

Learning the network structure underlying data is an important problem in machine learning. This paper introduces a novel prior to study the inference of scale-free networks, which are widely used to model social and biological networks. The prior not only favors a desirable global node degree distribution, but also takes into consideration the relative strength of all the possible edges adjacent to the same node and the estimated degree of each individual node. To fulfill this, ranking is incorporated into the prior, which makes the problem challenging to solve. We employ an ADMM (alternating direction method of multipliers) framework to solve the Gaussian Graphical model regularized by this prior. Our experiments on both synthetic and real data show that our prior not only yields a scale-free network, but also produces many more correctly predicted edges than the others such as the scale-free inducing prior, the hub-inducing prior and the $l_1$ norm.


Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

arXiv.org Machine Learning

This paper considers the problem of estimating multiple related Gaussian graphical models from a $p$-dimensional dataset consisting of different classes. Our work is based upon the formulation of this problem as group graphical lasso. This paper proposes a novel hybrid covariance thresholding algorithm that can effectively identify zero entries in the precision matrices and split a large joint graphical lasso problem into small subproblems. Our hybrid covariance thresholding method is superior to existing uniform thresholding methods in that our method can split the precision matrix of each individual class using different partition schemes and thus split group graphical lasso into much smaller subproblems, each of which can be solved very fast. In addition, this paper establishes necessary and sufficient conditions for our hybrid covariance thresholding algorithm. The superior performance of our thresholding method is thoroughly analyzed and illustrated by a few experiments on simulated data and real gene expression data.


Generalized Additive Model Selection

arXiv.org Machine Learning

We introduce GAMSEL (Generalized Additive Model Selection), a penalized likelihood approach for fitting sparse generalized additive models in high dimension. Our method interpolates between null, linear and additive models by allowing the effect of each variable to be estimated as being either zero, linear, or a low-complexity curve, as determined by the data. We present a blockwise coordinate descent procedure for efficiently optimizing the penalized likelihood objective over a dense grid of the tuning parameter, producing a regularization path of additive models. We demonstrate the performance of our method on both real and simulated data examples, and compare it with existing techniques for additive model selection.