Goto

Collaborating Authors

 Country


Feature-aware Label Space Dimension Reduction for Multi-label Classification

Neural Information Processing Systems

Label space dimension reduction (LSDR) is an efficient and effective paradigm for multi-label classification with many classes. Existing approaches to LSDR, such as compressive sensing and principal label space transformation, exploit only the label part of the dataset, but not the feature part. In this paper, we propose a novel approach to LSDR that considers both the label and the feature parts. The approach, called conditional principal label space transformation, is based on minimizing an upper bound of the popular Hamming loss. The minimization step of the approach can be carried out efficiently by a simple use of singular value decomposition. In addition, the approach can be extended to a kernelized version that allows the use of sophisticated feature combinations to assist LSDR. The experimental results verify that the proposed approach is more effective than existing ones to LSDR across many real-world datasets.


Graphical Gaussian Vector for Image Categorization

Neural Information Processing Systems

This paper proposes a novel image representation called a Graphical Gaussian Vector, which is a counterpart of the codebook and local feature matching approaches. In our method, we model the distribution of local features as a Gaussian Markov Random Field (GMRF) which can efficiently represent the spatial relationship among local features. We consider the parameter of GMRF as a feature vector of the image. Using concepts of information geometry, proper parameters and a metric from the GMRF can be obtained. Finally we define a new image feature by embedding the metric into the parameters, which can be directly applied to scalable linear classifiers. Our method obtains superior performance over the state-of-the-art methods in the standard object recognition datasets and comparable performance in the scene dataset. As the proposed method simply calculates the local auto-correlations of local features, it is able to achieve both high classification accuracy and high efficiency.


Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

Neural Information Processing Systems

Statistical models for networks have been typically committed to strong prior assumptions concerning the form of the modeled distributions. Moreover, the vast majority of currently available models are explicitly designed for capturing some specific graph properties (such as power-law degree distributions), which makes them unsuitable for application to domains where the behavior of the target quantities is not known a priori. The key contribution of this paper is twofold. First, we introduce the Fiedler delta statistic, based on the Laplacian spectrum of graphs, which allows to dispense with any parametric assumption concerning the modeled network properties. Second, we use the defined statistic to develop the Fiedler random field model, which allows for efficient estimation of edge distributions over large-scale random networks. After analyzing the dependence structure involved in Fiedler random fields, we estimate them over several real-world networks, showing that they achieve a much higher modeling accuracy than other well-known statistical approaches.


Repulsive Mixtures

Neural Information Processing Systems

Discrete mixtures are used routinely in broad sweeping applications ranging from unsupervised settings to fully supervised multi-task learning. Indeed, finite mixtures and infinite mixtures, relying on Dirichlet processes and modifications, have become a standard tool. One important issue that arises in using discrete mixtures is low separation in the components; in particular, different components can be introduced that are very similar and hence redundant. Such redundancy leads to too many clusters that are too similar, degrading performance in unsupervised learning and leading to computational problems and an unnecessarily complex model in supervised settings. Redundancy can arise in the absence of a penalty on components placed close together even when a Bayesian approach is used to learn the number of components. To solve this problem, we propose a novel prior that generates components from a repulsive process, automatically penalizing redundant components. We characterize this repulsive prior theoretically and propose a Markov chain Monte Carlo sampling algorithm for posterior computation. The methods are illustrated using synthetic examples and an iris data set.


Active Comparison of Prediction Models

Neural Information Processing Systems

We address the problem of comparing the risks of two given predictive models - for instance, a baseline model and a challenger - as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values.


The Lovรกsz ฯ‘ function, SVMs and finding large dense subgraphs

Neural Information Processing Systems

The Lovasz $\theta$ function of a graph, is a fundamental tool in combinatorial optimization and approximation algorithms. Computing $\theta$ involves solving a SDP and is extremely expensive even for moderately sized graphs. In this paper we establish that the Lovasz $\theta$ function is equivalent to a kernel learning problem related to one class SVM. This interesting connection opens up many opportunities bridging graph theoretic algorithms and machine learning. We show that there exist graphs, which we call $SVM-\theta$ graphs, on which the Lovasz $\theta$ function can be approximated well by a one-class SVM. This leads to a novel use of SVM techniques to solve algorithmic problems in large graphs e.g. identifying a planted clique of size $\Theta({\sqrt{n}})$ in a random graph $G(n,\frac{1}{2})$. A classic approach for this problem involves computing the $\theta$ function, however it is not scalable due to SDP computation. We show that the random graph with a planted clique is an example of $SVM-\theta$ graph, and as a consequence a SVM based approach easily identifies the clique in large graphs and is competitive with the state-of-the-art. Further, we introduce the notion of a ''common orthogonal labeling'' which extends the notion of a ''orthogonal labelling of a single graph (used in defining the $\theta$ function) to multiple graphs. The problem of finding the optimal common orthogonal labelling is cast as a Multiple Kernel Learning problem and is used to identify a large common dense region in multiple graphs. The proposed algorithm achieves an order of magnitude scalability compared to the state of the art.


Sketch-Based Linear Value Function Approximation

Neural Information Processing Systems

Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction withtile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility ofcollisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing.


Learning optimal spike-based representations

Neural Information Processing Systems

How do neural networks learn to represent information? Here, we address this question by assuming that neural networks seek to generate an optimal population representation for a fixed linear decoder. We define a loss function for the quality of the population read-out and derive the dynamical equations for both neurons and synapses from the requirement to minimize this loss. The dynamical equations yield a network of integrate-and-fire neurons undergoing Hebbian plasticity. We show that, through learning, initially regular and highly correlated spike trains evolve towards Poisson-distributed and independent spike trains with much lower firing rates. The learning rule drives the network into an asynchronous, balanced regime where all inputs to the network are represented optimally for the given decoder. We show that the network dynamics and synaptic plasticity jointly balance the excitation and inhibition received by each unit as tightly as possible and, in doing so, minimize the prediction error between the inputs and the decoded outputs. In turn, spikes are only signalled whenever this prediction error exceeds a certain value, thereby implementing a predictive coding scheme. Our work suggests that several of the features reported in cortical networks, such as the high trial-to-trial variability, the balance between excitation and inhibition, and spike-timing dependent plasticity, are simply signatures of an efficient, spike-based code.


Learning Probability Measures with respect to Optimal Transport Metrics

Neural Information Processing Systems

We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance ofa classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rateof empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures.


A new metric on the manifold of kernel matrices with application to matrix geometric means

Neural Information Processing Systems

Symmetric positive definite (spd) matrices are remarkably pervasive in a multitude of scientific disciplines, including machine learning and optimization. We consider the fundamental task of measuring distances between two spd matrices; a task that is often nontrivial whenever an application demands the distance function to respect the non-Euclidean geometry of spd matrices. Unfortunately, typical non-Euclidean distance measures such as the Riemannian metric $\riem(X,Y)=\frob{\log(X\inv{Y})}$, are computationally demanding and also complicated to use. To allay some of these difficulties, we introduce a new metric on spd matrices: this metric not only respects non-Euclidean geometry, it also offers faster computation than $\riem$ while being less complicated to use. We support our claims theoretically via a series of theorems that relate our metric to $\riem(X,Y)$, and experimentally by studying the nonconvex problem of computing matrix geometric means based on squared distances.