Goto

Collaborating Authors

 Statistical Learning


Phase transitions for high-dimensional joint support recovery

Neural Information Processing Systems

We consider the following instance of transfer learning: given a pair of regression problems, suppose that the regression coefficients share a partially common support, parameterized by the overlap fraction $\overlap$ between the two supports. This set-up suggests the use of $1, \infty$-regularized linear regression for recovering the support sets of both regression vectors. Our main contribution is to provide a sharp characterization of the sample complexity of this $1,\infty$ relaxation, exactly pinning down the minimal sample size $n$ required for joint support recovery as a function of the model dimension $\pdim$, support size $\spindex$ and overlap $\overlap \in [0,1]$. For measurement matrices drawn from standard Gaussian ensembles, we prove that the joint $1,\infty$-regularized method undergoes a phase transition characterized by order parameter $\orpar(\numobs, \pdim, \spindex, \overlap) = \numobs{(4 - 3 \overlap) s \log(p-(2-\overlap)s)}$. More precisely, the probability of successfully recovering both supports converges to $1$ for scalings such that $\orpar > 1$, and converges to $0$ to scalings for which $\orpar < 1$. An implication of this threshold is that use of $1, \infty$-regularization leads to gains in sample complexity if the overlap parameter is large enough ($\overlap > 2/3$), but performs worse than a naive approach if $\overlap < 2/3$. We illustrate the close agreement between these theoretical predictions, and the actual behavior in simulations. Thus, our results illustrate both the benefits and dangers associated with block-$1,\infty$ regularization in high-dimensional inference.


Fast Learning from Non-i.i.d. Observations

Neural Information Processing Systems

We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from $\a$-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case.


AUC optimization and the two-sample problem

Neural Information Processing Systems

The purpose of the paper is to explore the connection between multivariate homogeneity testsand AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that,in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose atwo-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually rankedaccording to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework.We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method.


Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

Neural Information Processing Systems

Kernel supervised learning methods can be unified by utilizing the tools from regularization theory. The duality between regularization and prior leads to interpreting regularization methods in terms of maximum a posteriori estimation and has motivated Bayesian interpretations of kernel methods. In this paper we pursue a Bayesian interpretation of sparsity in the kernel setting by making use of a mixture of a point-mass distribution and prior that we refer to as ``Silverman's g-prior.'' We provide a theoretical analysis of the posterior consistency of a Bayesian model choice procedure based on this prior. We also establish the asymptotic relationship between this procedure and the Bayesian information criterion.


Hierarchical Fisher Kernels for Longitudinal Data

Neural Information Processing Systems

We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel outperform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.


Cyclizing Clusters via Zeta Function of a Graph

Neural Information Processing Systems

Detecting underlying clusters from large-scale data plays a central role in machine learning research. In this paper, we attempt to tackle clustering problems for complex data of multiple distributions and large multi-scales. To this end, we develop an algorithm named Zeta $l$-links, or Zell which consists of two parts: Zeta merging with a similarity graph and an initial set of small clusters derived from local $l$-links of the graph. More specifically, we propose to structurize a cluster using cycles in the associated subgraph. A mathematical tool, Zeta function of a graph, is introduced for the integration of all cycles, leading to a structural descriptor of the cluster in determinantal form. The popularity character of the cluster is conceptualized as the global fusion of variations of the structural descriptor by means of the leave-one-out strategy in the cluster. Zeta merging proceeds, in the agglomerative fashion, according to the maximum incremental popularity among all pairwise clusters. Experiments on toy data, real imagery data, and real sensory data show the promising performance of Zell. The $98.1\%$ accuracy, in the sense of the normalized mutual information, is obtained on the FRGC face data of 16028 samples and 466 facial clusters. The MATLAB codes of Zell will be made publicly available for peer evaluation.


Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Neural Information Processing Systems

In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.


Kernel Measures of Independence for non-iid Data

Neural Information Processing Systems

Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering.


Multi-stage Convex Relaxation for Learning with Sparse Regularization

Neural Information Processing Systems

We study learning formulations with non-convex regularizaton that are natural for sparse linear models. There are two approaches to this problem: (1) Heuristic methods such as gradient descent that only find a local minimum. A drawback of this approach is the lack of theoretical guarantee showing that the local minimum gives a good solution. (2) Convex relaxation such as $L_1$-regularization that solves the problem under some conditions. However it often leads to sub-optimal sparsity in reality. This paper tries to remedy the above gap between theory and practice. In particular, we investigate a multi-stage convex relaxation scheme for solving problems with non-convex regularization. Theoretically, we analyze the behavior of a resulting two-stage relaxation scheme for the capped-$L_1$ regularization. Our performance bound shows that the procedure is superior to the standard $L_1$ convex relaxation for learning sparse targets. Experiments confirm the effectiveness of this method on some simulation and real data.


Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

Neural Information Processing Systems

Consider linear prediction models where the target function is a sparse linear combination of a set of basis functions. We are interested in the problem of identifying those basis functions with non-zero coefficients and reconstructing the target function from noisy observations. Two heuristics that are widely used in practice are forward and backward greedy algorithms. First, we show that neither idea is adequate. Second, we propose a novel combination that is based on the forward greedy algorithm but takes backward steps adaptively whenever beneficial. We prove strong theoretical results showing that this procedure is effective in learning sparse representations. Experimental results support our theory.