Goto

Collaborating Authors

 Genre


Discriminative Probabilistic Prototype Learning

arXiv.org Machine Learning

In this paper we propose a simple yet powerful method for learning representations in supervised learning scenarios where each original input datapoint is described by a set of vectors and their associated outputs may be given by soft labels indicating, for example, class probabilities. We represent an input datapoint as a mixture of probabilities over the corresponding set of feature vectors where each probability indicates how likely each vector is to belong to an unknown prototype pattern. We propose a probabilistic model that parameterizes these prototype patterns in terms of hidden variables and therefore it can be trained with conventional approaches based on likelihood maximization. More importantly, both the model parameters and the prototype patterns can be learned from data in a discriminative way. We show that our model can be seen as a probabilistic generalization of learning vector quantization (LVQ). We apply our method to the problems of shape classification, hyperspectral imaging classification and people's work class categorization, showing the superior performance of our method compared to the standard prototype-based classification approach and other competitive benchmark methods.


Copula-based Kernel Dependency Measures

arXiv.org Machine Learning

The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions.


LPQP for MAP: Putting LP Solvers to Better Use

arXiv.org Machine Learning

MAP inference for general energy functions remains a challenging problem. While most efforts are channeled towards improving the linear programming (LP) based relaxation, this work is motivated by the quadratic programming (QP) relaxation. We propose a novel MAP relaxation that penalizes the Kullback-Leibler divergence between the LP pairwise auxiliary variables, and QP equivalent terms given by the product of the unaries. We develop two efficient algorithms based on variants of this relaxation. The algorithms minimize the non-convex objective using belief propagation and dual decomposition as building blocks. Experiments on synthetic and real-world data show that the solutions returned by our algorithms substantially improve over the LP relaxation.


Factorized Asymptotic Bayesian Hidden Markov Models

arXiv.org Machine Learning

This paper addresses the issue of model selection for hidden Markov models (HMMs). We generalize factorized asymptotic Bayesian inference (FAB), which has been recently developed for model selection on independent hidden variables (i.e., mixture models), for time-dependent hidden variables. As with FAB in mixture models, FAB for HMMs is derived as an iterative lower bound maximization algorithm of a factorized information criterion (FIC). It inherits, from FAB for mixture models, several desirable properties for learning HMMs, such as asymptotic consistency of FIC with marginal log-likelihood, a shrinkage effect for hidden state selection, monotonic increase of the lower FIC bound through the iterative optimization. Further, it does not have a tunable hyper-parameter, and thus its model selection process can be fully automated. Experimental results shows that FAB outperforms states-of-the-art variational Bayesian HMM and non-parametric Bayesian HMM in terms of model selection accuracy and computational efficiency.


Linear Regression with Limited Observation

arXiv.org Machine Learning

We consider the most common variants of linear regression, including Ridge, Lasso and Support-vector regression, in a setting where the learner is allowed to observe only a fixed number of attributes of each example at training time. We present simple and efficient algorithms for these problems: for Lasso and Ridge regression they need the same total number of attributes (up to constants) as do full-information algorithms, for reaching a certain accuracy. For Support-vector regression, we require exponentially less attributes compared to the state of the art. By that, we resolve an open problem recently posed by Cesa-Bianchi et al. (2010). Experiments show the theoretical bounds to be justified by superior performance compared to the state of the art.


Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching

arXiv.org Machine Learning

In real-world classification problems, the class balance in the training dataset does not necessarily reflect that of the test dataset, which can cause significant estimation bias. If the class ratio of the test dataset is known, instance re-weighting or resampling allows systematical bias correction. However, learning the class ratio of the test dataset is challenging when no labeled data is available from the test domain. In this paper, we propose to estimate the class ratio in the test dataset by matching probability distributions of training and test input data. We demonstrate the utility of the proposed approach through experiments.


Comparison-Based Learning with Rank Nets

arXiv.org Machine Learning

We consider the problem of search through comparisons, where a user is presented with two candidate objects and reveals which is closer to her intended target. We study adaptive strategies for finding the target, that require knowledge of rank relationships but not actual distances between objects. We propose a new strategy based on rank nets, and show that for target distributions with a bounded doubling constant, it finds the target in a number of comparisons close to the entropy of the target distribution and, hence, of the optimum. We extend these results to the case of noisy oracles, and compare this strategy to prior art over multiple datasets.


Group Sparse Additive Models

arXiv.org Machine Learning

We consider the problem of sparse variable selection in nonparametric additive models, with the prior knowledge of the structure among the covariates to encourage those variables within a group to be selected jointly. Previous works either study the group sparsity in the parametric setting (e.g., group lasso), or address the problem in the non-parametric setting without exploiting the structural information (e.g., sparse additive models). In this paper, we present a new method, called group sparse additive models (GroupSpAM), which can handle group sparsity in additive models. We generalize the l1/l2 norm to Hilbert spaces as the sparsity-inducing penalty in GroupSpAM. Moreover, we derive a novel thresholding condition for identifying the functional sparsity at the group level, and propose an efficient block coordinate descent algorithm for constructing the estimate. We demonstrate by simulation that GroupSpAM substantially outperforms the competing methods in terms of support recovery and prediction accuracy in additive models, and also conduct a comparative experiment on a real breast cancer dataset.


Efficient Active Algorithms for Hierarchical Clustering

arXiv.org Machine Learning

Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size ?(log n) using O(n log^2 n) similarities and runs in O(n log^3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring.


Approximate Principal Direction Trees

arXiv.org Machine Learning

We introduce a new spatial data structure for high dimensional data called the \emph{approximate principal direction tree} (APD tree) that adapts to the intrinsic dimension of the data. Our algorithm ensures vector-quantization accuracy similar to that of computationally-expensive PCA trees with similar time-complexity to that of lower-accuracy RP trees. APD trees use a small number of power-method iterations to find splitting planes for recursively partitioning the data. As such they provide a natural trade-off between the running-time and accuracy achieved by RP and PCA trees. Our theoretical results establish a) strong performance guarantees regardless of the convergence rate of the power-method and b) that $O(\log d)$ iterations suffice to establish the guarantee of PCA trees when the intrinsic dimension is $d$. We demonstrate this trade-off and the efficacy of our data structure on both the CPU and GPU.