Goto

Collaborating Authors

 Statistical Learning


Stability of Multi-Task Kernel Regression Algorithms

arXiv.org Machine Learning

We study the stability properties of nonlinear multi-task regression in reproducing Hilbert spaces with operator-valued kernels. Such kernels, a.k.a. multi-task kernels, are appropriate for learning prob- lems with nonscalar outputs like multi-task learning and structured out- put prediction. We show that multi-task kernel regression algorithms are uniformly stable in the general case of infinite-dimensional output spaces. We then derive under mild assumption on the kernel generaliza- tion bounds of such algorithms, and we show their consistency even with non Hilbert-Schmidt operator-valued kernels . We demonstrate how to apply the results to various multi-task kernel regression methods such as vector-valued SVR and functional ridge regression.


Spectral Experts for Estimating Mixtures of Linear Regressions

arXiv.org Machine Learning

Discriminative latent-variable models are typically learned using EM or gradient-based optimization, which suffer from local optima. In this paper, we develop a new computationally efficient and provably consistent estimator for a mixture of linear regressions, a simple instance of a discriminative latent-variable model. Our approach relies on a low-rank linear regression to recover a symmetric tensor, which can be factorized into the parameters using a tensor power method. We prove rates of convergence for our estimator and provide an empirical evaluation illustrating its strengths relative to local optimization (EM).


Early stopping and non-parametric regression: An optimal data-dependent stopping rule

arXiv.org Machine Learning

The strategy of early stopping is a regularization technique based on choosing a stopping time for an iterative algorithm. Focusing on non-parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient-descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(P)$ and $L^2(P_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator.


Constrained fractional set programs and their application in local clustering and community detection

arXiv.org Machine Learning

The (constrained) minimization of a ratio of set functions is a problem frequently occurring in clustering and community detection. As these optimization problems are typically NP-hard, one uses convex or spectral relaxations in practice. While these relaxations can be solved globally optimally, they are often too loose and thus lead to results far away from the optimum. In this paper we show that every constrained minimization problem of a ratio of non-negative set functions allows a tight relaxation into an unconstrained continuous optimization problem. This result leads to a flexible framework for solving constrained problems in network analysis. While a globally optimal solution for the resulting non-convex problem cannot be guaranteed, we outperform the loose convex or spectral relaxations by a large margin on constrained local clustering problems.


Hyperparameter Optimization and Boosting for Classifying Facial Expressions: How good can a "Null" Model be?

arXiv.org Machine Learning

One of the goals of the ICML workshop on representation and learning is to establish benchmark scores for a new data set of labeled facial expressions. This paper presents the performance of a "Null model" consisting of convolutions with random weights, PCA, pooling, normalization, and a linear readout. Our approach focused on hyperparameter optimization rather than novel model components. On the Facial Expression Recognition Challenge held by the Kaggle website, our hyperparameter optimization approach achieved a score of 60% accuracy on the test data. This paper also introduces a new ensemble construction variant that combines hyperparameter optimization with the construction of ensembles. This algorithm constructed an ensemble of four models that scored 65.5% accuracy. These scores rank 12th and 5th respectively among the 56 challenge participants. It is worth noting that our approach was developed prior to the release of the data set, and applied without modification; our strong competition performance suggests that the TPE hyperparameter optimization algorithm and domain expertise encoded in our Null model can generalize to new image classification data sets.


Classifying Single-Trial EEG during Motor Imagery with a Small Training Set

arXiv.org Machine Learning

Before the operation of a motor imagery based brain-computer interface (BCI) adopting machine learning techniques, a cumbersome training procedure is unavoidable. The development of a practical BCI posed the challenge of classifying single-trial EEG with a small training set. In this letter, we addressed this problem by employing a series of signal processing and machine learning approaches to alleviate overfitting and obtained test accuracy similar to training accuracy on the datasets from BCI Competition III and our own experiments.


Physeter catodon localization by sparse coding

arXiv.org Machine Learning

This paper presents a spermwhale' localization architecture using jointly a bag-of-features (BoF) approach and machine learning framework. BoF methods are known, especially in computer vision, to produce from a collection of local features a global representation invariant to principal signal transformations. Our idea is to regress supervisely from these local features two rough estimates of the distance and azimuth thanks to some datasets where both acoustic events and ground-truth position are now available. Furthermore, these estimates can feed a particle filter system in order to obtain a precise spermwhale' position even in mono-hydrophone configuration. Anti-collision system and whale watching are considered applications of this work.


Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

arXiv.org Machine Learning

The L1-regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to recent state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton's method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and present experimental results using synthetic and real-world application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.


A Linear Approximation to the chi^2 Kernel with Geometric Convergence

arXiv.org Machine Learning

We propose a new analytical approximation to the $\chi^2$ kernel that converges geometrically. The analytical approximation is derived with elementary methods and adapts to the input distribution for optimal convergence rate. Experiments show the new approximation leads to improved performance in image classification and semantic segmentation tasks using a random Fourier feature approximation of the $\exp-\chi^2$ kernel. Besides, out-of-core principal component analysis (PCA) methods are introduced to reduce the dimensionality of the approximation and achieve better performance at the expense of only an additional constant factor to the time complexity. Moreover, when PCA is performed jointly on the training and unlabeled testing data, further performance improvements can be obtained. Experiments conducted on the PASCAL VOC 2010 segmentation and the ImageNet ILSVRC 2010 datasets show statistically significant improvements over alternative approximation methods.


Non-parametric Power-law Data Clustering

arXiv.org Machine Learning

It has always been a great challenge for clustering algorithms to automatically determine the cluster numbers according to the distribution of datasets. Several approaches have been proposed to address this issue, including the recent promising work which incorporate Bayesian Nonparametrics into the $k$-means clustering procedure. This approach shows simplicity in implementation and solidity in theory, while it also provides a feasible way to inference in large scale datasets. However, several problems remains unsolved in this pioneering work, including the power-law data applicability, mechanism to merge centers to avoid the over-fitting problem, clustering order problem, e.t.c.. To address these issues, the Pitman-Yor Process based k-means (namely \emph{pyp-means}) is proposed in this paper. Taking advantage of the Pitman-Yor Process, \emph{pyp-means} treats clusters differently by dynamically and adaptively changing the threshold to guarantee the generation of power-law clustering results. Also, one center agglomeration procedure is integrated into the implementation to be able to merge small but close clusters and then adaptively determine the cluster number. With more discussion on the clustering order, the convergence proof, complexity analysis and extension to spectral clustering, our approach is compared with traditional clustering algorithm and variational inference methods. The advantages and properties of pyp-means are validated by experiments on both synthetic datasets and real world datasets.