Goto

Collaborating Authors

 Statistical Learning


Min-Max Kernels

arXiv.org Machine Learning

The min-max kernel is a generalization of the popular resemblance kernel (which is designed for binary data). In this paper, we demonstrate, through an extensive classification study using kernel machines, that the min-max kernel often provides an effective measure of similarity for nonnegative data. As the min-max kernel is nonlinear and might be difficult to be used for industrial applications with massive data, we show that the min-max kernel can be linearized via hashing techniques. This allows practitioners to apply min-max kernel to large-scale applications using well matured linear algorithms such as linear SVM or logistic regression. The previous remarkable work on consistent weighted sampling (CWS) produces samples in the form of ($i^*, t^*$) where the $i^*$ records the location (and in fact also the weights) information analogous to the samples produced by classical minwise hashing on binary data. Because the $t^*$ is theoretically unbounded, it was not immediately clear how to effectively implement CWS for building large-scale linear classifiers. In this paper, we provide a simple solution by discarding $t^*$ (which we refer to as the "0-bit" scheme). Via an extensive empirical study, we show that this 0-bit scheme does not lose essential information. We then apply the "0-bit" CWS for building linear classifiers to approximate min-max kernel classifiers, as extensively validated on a wide range of publicly available classification datasets. We expect this work will generate interests among data mining practitioners who would like to efficiently utilize the nonlinear information of non-binary and nonnegative data.


A General Hybrid Clustering Technique

arXiv.org Machine Learning

Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ clusters where $K_\ell \gg K$ and then joins these small clusters by using single linkage clustering. The second stage stabilizes the result of stage one by reclustering via the `membership matrix' under Hamming distance to generate a dendrogram. The third stage is to cut the dendrogram to get $K^*$ clusters where $K^* \geq K$ and then prune back to $K$ to give a final clustering. A variant on our technique also gives a reasonable estimate for $K_T$, the true number of clusters. We provide a series of arguments to justify the steps in the stages of our methods and we provide numerous examples involving real and simulated data to compare our technique with other related techniques.


Efficient Estimation of Mutual Information for Strongly Dependent Variables

arXiv.org Machine Learning

We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.


Sparsistency and agnostic inference in sparse PCA

arXiv.org Machine Learning

The presence of a sparse "truth" has been a constant assumption in the theoretical analysis of sparse PCA and is often implicit in its methodological development. This naturally raises questions about the properties of sparse PCA methods and how they depend on the assumption of sparsity. Under what conditions can the relevant variables be selected consistently if the truth is assumed to be sparse? What can be said about the results of sparse PCA without assuming a sparse and unique truth? We answer these questions by investigating the properties of the recently proposed Fantope projection and selection (FPS) method in the high-dimensional setting. Our results provide general sufficient conditions for sparsistency of the FPS estimator. These conditions are weak and can hold in situations where other estimators are known to fail. On the other hand, without assuming sparsity or identifiability, we show that FPS provides a sparse, linear dimension-reducing transformation that is close to the best possible in terms of maximizing the predictive covariance.


Local Expectation Gradients for Doubly Stochastic Variational Inference

arXiv.org Machine Learning

We introduce local expectation gradients which is a general purpose stochastic variational inference algorithm for constructing stochastic gradients through sampling from the variational distribution. This algorithm divides the problem of estimating the stochastic gradients over multiple variational parameters into smaller sub-tasks so that each sub-task exploits intelligently the information coming from the most relevant part of the variational distribution. This is achieved by performing an exact expectation over the single random variable that mostly correlates with the variational parameter of interest resulting in a Rao-Blackwellized estimate that has low variance and can work efficiently for both continuous and discrete random variables. Furthermore, the proposed algorithm has interesting similarities with Gibbs sampling but at the same time, unlike Gibbs sampling, it can be trivially parallelized.


Bethe Learning of Conditional Random Fields via MAP Decoding

arXiv.org Machine Learning

Many machine learning tasks can be formulated in terms of predicting structured outputs. In frameworks such as the structured support vector machine (SVM-Struct) and the structured per-ceptron, discriminative functions are learned by iteratively applying efficient maximum a posteri-ori (MAP) decoding. However, maximum likelihood estimation (MLE) of probabilistic models over these same structured spaces requires computing partition functions, which is generally intractable. This paper presents a method for learning discrete exponential family models using the Bethe approximation to the MLE. Remarkably, this problem also reduces to iterative (MAP) decoding. This connection emerges by combining the Bethe approximation with a Frank-Wolfe (FW) algorithm on a convex dual objective which circumvents the intractable partition function. The result is a new single loop algorithm MLE-Struct, which is substantially more efficient than previous double-loop methods for approximate maximum likelihood estimation. Our algorithm outperforms existing methods in experiments involving image segmentation, matching problems from vision, and a new dataset of university roommate assignments.


Heteroscedastic Treed Bayesian Optimisation

arXiv.org Machine Learning

Optimising black-box functions is important in many disciplines, such as tuning machine learning models, robotics, finance and mining exploration. Bayesian optimisation is a state-of-the-art technique for the global optimisation of black-box functions which are expensive to evaluate. At the core of this approach is a Gaussian process prior that captures our belief about the distribution over functions. However, in many cases a single Gaussian process is not flexible enough to capture non-stationarity in the objective function. Consequently, heteroscedasticity negatively affects performance of traditional Bayesian methods. In this paper, we propose a novel prior model with hierarchical parameter learning that tackles the problem of non-stationarity in Bayesian optimisation. Our results demonstrate substantial improvements in a wide range of applications, including automatic machine learning and mining exploration.


Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-$1$ Updates

arXiv.org Machine Learning

In this paper, we provide local and global convergence guarantees for recovering CP (Candecomp/Parafac) tensor decomposition. The main step of the proposed algorithm is a simple alternating rank-$1$ update which is the alternating version of the tensor power iteration adapted for asymmetric tensors. Local convergence guarantees are established for third order tensors of rank $k$ in $d$ dimensions, when $k=o \bigl( d^{1.5} \bigr)$ and the tensor components are incoherent. Thus, we can recover overcomplete tensor decomposition. We also strengthen the results to global convergence guarantees under stricter rank condition $k \le \beta d$ (for arbitrary constant $\beta > 1$) through a simple initialization procedure where the algorithm is initialized by top singular vectors of random tensor slices. Furthermore, the approximate local convergence guarantees for $p$-th order tensors are also provided under rank condition $k=o \bigl( d^{p/2} \bigr)$. The guarantees also include tight perturbation analysis given noisy tensor.


Group-Sparse Model Selection: Hardness and Relaxations

arXiv.org Machine Learning

Group-based sparsity models are proven instrumental in linear regression problems for recovering signals from much fewer measurements than standard compressive sensing. The main promise of these models is the recovery of "interpretable" signals through the identification of their constituent groups. In this paper, we establish a combinatorial framework for group-model selection problems and highlight the underlying tractability issues. In particular, we show that the group-model selection problem is equivalent to the well-known NP-hard weighted maximum coverage problem (WMC). Leveraging a graph-based understanding of group models, we describe group structures which enable correct model selection in polynomial time via dynamic programming. Furthermore, group structures that lead to totally unimodular constraints have tractable discrete as well as convex relaxations. We also present a generalization of the group-model that allows for within group sparsity, which can be used to model hierarchical sparsity. Finally, we study the Pareto frontier of group-sparse approximations for two tractable models, among which the tree sparsity model, and illustrate selection and computation trade-offs between our framework and the existing convex relaxations.


Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP)

arXiv.org Machine Learning

We introduce a new structured kernel interpolation (SKI) framework, which generalises and unifies inducing point methods for scalable Gaussian processes (GPs). SKI methods produce kernel approximations for fast computations through kernel interpolation. The SKI framework clarifies how the quality of an inducing point approach depends on the number of inducing (aka interpolation) points, interpolation strategy, and GP covariance kernel. SKI also provides a mechanism to create new scalable kernel methods, through choosing different kernel interpolation strategies. Using SKI, with local cubic kernel interpolation, we introduce KISS-GP, which is 1) more scalable than inducing point alternatives, 2) naturally enables Kronecker and Toeplitz algebra for substantial additional gains in scalability, without requiring any grid data, and 3) can be used for fast and expressive kernel learning. KISS-GP costs O(n) time and storage for GP inference. We evaluate KISS-GP for kernel matrix approximation, kernel learning, and natural sound modelling.